[GitHub] spark pull request #18127: [SPARK-6628][SQL][Branch-2.1] Fix ClassCastExcept...

weiqingy Fri, 26 May 2017 22:38:05 -0700

GitHub user weiqingy opened a pull request:

    https://github.com/apache/spark/pull/18127


    [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql 
statement 'insert into' on hbase table

    ## What changes were proposed in this pull request?
    
    The issue of SPARK-6628 is:
    ```
    org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat 
    ```
    cannot be cast to
    ```
    org.apache.hadoop.hive.ql.io.HiveOutputFormat
    ```
    The reason is:
    ```
    public interface HiveOutputFormat<K, V> extends OutputFormat<K, V> {â¦}
    
    public class HiveHBaseTableOutputFormat extends
        TableOutputFormat<ImmutableBytesWritable> implements
        OutputFormat<ImmutableBytesWritable, Object> {...}
    ```
    From the two snippets above, we can see both `HiveHBaseTableOutputFormat` 
and `HiveOutputFormat` `extends`/`implements` `OutputFormat`, and can not cast 
to each other. 
    
    For Spark 1.6, 2.0, 2.1, Spark initials the `outputFormat` in 
`SparkHiveWriterContainer`. For Spark 2.2+,  Spark initials the `outputFormat` 
in `HiveFileFormat`.
    ```
    @transient private lazy val outputFormat =
            jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, 
Writable]]
    ```
    `outputFormat` above has to be  `HiveOutputFormat`. However, when users 
insert data into hbase, the outputFormat is `HiveHBaseTableOutputFormat`, it 
isn't instance of `HiveOutputFormat`.
    
    This PR is to make `outputFormat` to be "null" when the `OutputFormat` is 
not an instance of `HiveOutputFormat`. This change should be safe since 
`outputFormat` is only used to get the file extension in function 
[`getFileExtension()`](https://github.com/apache/spark/blob/branch-2.1/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala#L101).
 
    
    We can also submit this PR to Master branch.
    
    ## How was this patch tested?
    Manually test.
    (1) create a HBase table with Hive:
    ```
    CREATE TABLE testwq100 (row_key string COMMENT 'from deserializer', 
application string COMMENT 'from deserializer', starttime timestamp COMMENT 
'from deserializer', endtime timestamp COMMENT 'from deserializer', status 
string COMMENT 'from deserializer', statusid smallint COMMENT 'from 
deserializer',   insertdate timestamp COMMENT 'from deserializer', count int 
COMMENT 'from deserializer', errordesc string COMMENT 'from deserializer') ROW 
FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 
'hbase.columns.mapping'='cf1:application,cf1:starttime,cf1:endtime,cf1:Status,cf1:StatusId,cf1:InsertDate,cf1:count,cf1:ErrorDesc',
 'line.delim'='\\n',   'mapkey.delim'='\\u0003', 
'serialization.format'='\\u0001') TBLPROPERTIES 
('transient_lastDdlTime'='1489696241', 'hbase.table.name' = 'xyz', 
'hbase.mapred.output.outputtable' = 'xyz')
    ```
    (2) verify:
    
    **Before:**
    
    Insert data into the Hbase table `testwq100` from Spark SQL:
    ```
    scala> sql(s"INSERT INTO testwq100 VALUES 
('AA1M22','AA1M122','2011722','201156','Starte1d6',45,20,1,'ad1')")
    17/05/26 00:09:10 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
    java.lang.ClassCastException: 
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
org.apache.hadoop.hive.ql.io.HiveOutputFormat
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:82)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:81)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:101)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:125)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:94)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.writeToFile(hiveWriterContainers.scala:182)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    17/05/26 00:09:10 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
localhost, executor driver): java.lang.ClassCastException: 
org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
org.apache.hadoop.hive.ql.io.HiveOutputFormat
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:82)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:81)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:101)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:125)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:94)
        at 
org.apache.spark.sql.hive.SparkHiveWriterContainer.writeToFile(hiveWriterContainers.scala:182)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
        at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    ```
    
    **After:**
    ```
    scala> sql(s"INSERT INTO testwq100 VALUES 
('AA1M22','AA1M122','2011722','201156','Starte1d6',45,20,1,'ad1')")
    res2: org.apache.spark.sql.DataFrame = []
    
    scala> sql("select * from testwq100").show
    
+-------+-----------+---------+-------+---------+--------+--------------------+-----+---------+
    |row_key|application|starttime|endtime|   status|statusid|          
insertdate|count|errordesc|
    
+-------+-----------+---------+-------+---------+--------+--------------------+-----+---------+
    |   AA1M|       AA1M|     null|   null| Starte1d|      45|                
null|    1|      ad1|
    | AA1M22|    AA1M122|     null|   null|Starte1d6|      45|1970-01-01 
00:00:...|    1|      ad1|
    
+-------+-----------+---------+-------+---------+--------+--------------------+-----+---------+
    ```
    The ClassCastException gone. "Insert" succeed. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/weiqingy/spark SPARK_6628

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18127.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18127
    
----
commit 6a622b071fdf2e86e1849a4473cf9525e2ae3de0
Author: Weiqing Yang <yangweiqing...@gmail.com>
Date:   2017-05-27T04:27:42Z

    [SPARK-6628][SQL][Branch-2.1] Fix ClassCastException when executing sql 
statement 'insert into' on hbase table

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18127: [SPARK-6628][SQL][Branch-2.1] Fix ClassCastExcept...

Reply via email to