GitHub user weiqingy opened a pull request: https://github.com/apache/spark/pull/17989
[SPARK-6628][SQL] Fix ClassCastException when executing sql statement 'insert into' on hbase table ## What changes were proposed in this pull request? The major issue of SPARK-6628 is: ``` org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat ``` cannot be cast to ``` org.apache.hadoop.hive.ql.io.HiveOutputFormat ``` The reason is: ``` public interface HiveOutputFormat<K, V> extends OutputFormat<K, V> {â¦} public class HiveHBaseTableOutputFormat extends TableOutputFormat<ImmutableBytesWritable> implements OutputFormat<ImmutableBytesWritable, Object> {...} ``` From the two snippets above, we can see both `HiveHBaseTableOutputFormat` and `HiveOutputFormat` `extends`/`implements` OutputFormat, and can not cast to each other. Spark initials the `outputFormat` in `SparkHiveWriterContainer` of Spark 1.6, 2.0, 2.1 (or: in `HiveFileFormat` of Spark 2.2 /Master) ``` @transient private lazy val outputFormat = jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] ``` Notice: this file output format is `HiveOutputFormat`. However, when users write the data into the hbase, the outputFormat is `HiveHBaseTableOutputFormat`, it isn't instance of `HiveOutputFormat`. This PR is to make `outputFormat` to be "null" when the `OutputFormat` is not an instance of `HiveOutputFormat`. `outputFormat` is only used to get the file extension in function `getFileExtension()`. Spark 2.x also has this issue. We can also submit this PR to Master branch. ## How was this patch tested? Manually test. **Before:** User was trying to write to a hive-hbase table from Spark SQL using hiveContext and failing with below error: ``` 17/03/30 20:26:08 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 17/03/30 20:26:08 INFO ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x25acf50c46d05ce 17/03/30 20:26:08 INFO ZooKeeper: Session: 0x25acf50c46d05ce closed 17/03/30 20:26:08 INFO ClientCnxn: EventThread shut down 17/03/30 20:26:08 INFO ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x35acf50c63305c7 17/03/30 20:26:08 INFO ZooKeeper: Session: 0x35acf50c63305c7 closed 17/03/30 20:26:08 INFO ClientCnxn: EventThread shut down 17/03/30 20:26:08 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5) java.lang.ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat at org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:74) at org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:73) at org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:93) at org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:119) at org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:86) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:102) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/03/30 20:26:08 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, localhost): java.lang.ClassCastException: org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to org.apache.hadoop.hive.ql.io.HiveOutputFormat ``` Below is the create Table script : ``` CREATE TABLE `0bq_cntl.spark_load_cntl_stats`( `row_key` string COMMENT 'from deserializer', `application` string COMMENT 'from deserializer', `starttime` timestamp COMMENT 'from deserializer', `endtime` timestamp COMMENT 'from deserializer', `status` string COMMENT 'from deserializer', `statusid` smallint COMMENT 'from deserializer', `insertdate` timestamp COMMENT 'from deserializer', `count` int COMMENT 'from deserializer', `errordesc` string COMMENT 'from deserializer') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'='cf1:application,cf1:starttime,cf1:endtime,cf1:Status,cf1:StatusId,cf1:InsertDate,cf1:count,cf1:ErrorDesc', 'line.delim'='\n', 'mapkey.delim'='\u0003', 'serialization.format'='\u0001')TBLPROPERTIES ( 'transient_lastDdlTime'='1489696241') ``` Below is the query running using spark sql: ``` val df=sqlContext.sql("Insert into table db1.spark_load_cntl_stats select 'AAM-846d55f6-0ffe-4694-b37a-1637a58f34f2','AAM','2017-03-21 04:03:01','2017-03-21 04:03:01','Started',45,'2017-03-21 04:03:01',1,'ad'") ``` **After:** The ClassCastException gone. "Insert" succeed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/weiqingy/spark SPARK-6628 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17989.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17989 ---- commit 0fa2bb791d1fa9c37fe89c1942ce0ed950a9ee59 Author: Weiqing Yang <yangweiqing...@gmail.com> Date: 2017-05-16T00:12:16Z [SPARK-6628][SQL] Fix ClassCastException when executing sql statement 'insert into' on hbase table ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org