GitHub user weiqingy opened a pull request:
https://github.com/apache/spark/pull/18127
[SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql
statement 'insert into' on hbase table
## What changes were proposed in this pull request?
The issue of SPARK-6628 is:
```
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat
```
cannot be cast to
```
org.apache.hadoop.hive.ql.io.HiveOutputFormat
```
The reason is:
```
public interface HiveOutputFormat extends OutputFormat {â¦}
public class HiveHBaseTableOutputFormat extends
TableOutputFormat implements
OutputFormat {...}
```
From the two snippets above, we can see both `HiveHBaseTableOutputFormat`
and `HiveOutputFormat` `extends`/`implements` `OutputFormat`, and can not cast
to each other.
For Spark 1.6, 2.0, 2.1, Spark initials the `outputFormat` in
`SparkHiveWriterContainer`. For Spark 2.2+, Spark initials the `outputFormat`
in `HiveFileFormat`.
```
@transient private lazy val outputFormat =
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef,
Writable]]
```
`outputFormat` above has to be `HiveOutputFormat`. However, when users
insert data into hbase, the outputFormat is `HiveHBaseTableOutputFormat`, it
isn't instance of `HiveOutputFormat`.
This PR is to make `outputFormat` to be "null" when the `OutputFormat` is
not an instance of `HiveOutputFormat`. This change should be safe since
`outputFormat` is only used to get the file extension in function
[`getFileExtension()`](https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L101).
We can also submit this PR to Master branch.
## How was this patch tested?
Manually test.
(1) create a HBase table with Hive:
```
CREATE TABLE testwq100 (row_key string COMMENT 'from deserializer',
application string COMMENT 'from deserializer', starttime timestamp COMMENT
'from deserializer', endtime timestamp COMMENT 'from deserializer', status
string COMMENT 'from deserializer', statusid smallint COMMENT 'from
deserializer', insertdate timestamp COMMENT 'from deserializer', count int
COMMENT 'from deserializer', errordesc string COMMENT 'from deserializer') ROW
FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (
'hbase.columns.mapping'='cf1:application,cf1:starttime,cf1:endtime,cf1:Status,cf1:StatusId,cf1:InsertDate,cf1:count,cf1:ErrorDesc',
'line.delim'='\\n', 'mapkey.delim'='\\u0003',
'serialization.format'='\\u0001') TBLPROPERTIES
('transient_lastDdlTime'='1489696241', 'hbase.table.name' = 'xyz',
'hbase.mapred.output.outputtable' = 'xyz')
```
(2) verify:
**Before:**
Insert data into the Hbase table `testwq100` from Spark SQL:
```
scala> sql(s"INSERT INTO testwq100 VALUES
('AA1M22','AA1M122','2011722','201156','Starte1d6',45,20,1,'ad1')")
17/05/26 00:09:10 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.lang.ClassCastException:
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to
org.apache.hadoop.hive.ql.io.HiveOutputFormat
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:82)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:81)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:101)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:125)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:94)
at
org.apache.spark.sql.hive.SparkHiveWriterContainer.writeToFile(hiveWriterContainers.scala:182)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/26 00:09:10 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3,
localhost, executor driver): java.lang.ClassCastException:
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to
org.apache.hadoop.hive.ql.io.HiveOutputFormat
at
org.apache.spa