[jira] [Comment Edited] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

Prasanth Jayachandran (JIRA) Tue, 05 Sep 2017 14:56:42 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154377#comment-16154377
 ]


Prasanth Jayachandran edited comment on HIVE-17280 at 9/5/17 9:55 PM:
----------------------------------------------------------------------

[~mgaido] Posted a patch to HIVE-17403 that will fix the issue (along with 
adding restrictions). Tested this locally and it worked. If concatenation finds 
incompatible file, it will rename to Hive's convention to avoid the issue that 
I mentioned above. 


was (Author: prasanth_j):
[~mgaido] Posted a patch to HIVE-17280 that will fix the issue (along with 
adding restrictions). Tested this locally and it worked. If concatenation finds 
incompatible file, it will rename to Hive's convention to avoid the issue that 
I mentioned above. 

> Data loss in CONCATENATE ORC created by Spark
> ---------------------------------------------
>
>                 Key: HIVE-17280
>                 URL: https://issues.apache.org/jira/browse/HIVE-17280
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Spark
>    Affects Versions: 1.2.1
>         Environment: Spark 1.6.3
>            Reporter: Marco Gaido
>            Priority: Critical
>
> Hive concatenation causes data loss if the ORC files in the table were 
> written by Spark.
> Here are the steps to reproduce the problem:
>  - create a table;
> {code:java}
> hive
> hive> create table aa (a string, b int) stored as orc;
> {code}
>  - insert 2 rows using Spark;
> {code:java}
> spark-shell
> scala> case class AA(a:String, b:Int)
> scala> val df = sc.parallelize(Array(AA("b",2),AA("c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - change table schema;
> {code:java}
> hive
> hive> alter table aa add columns(aa string, bb int);
> {code}
>  - insert other 2 rows with Spark
> {code:java}
> spark-shell
> scala> case class BB(a:String, b:Int, aa:String, bb:Int)
> scala> val df = sc.parallelize(Array(BB("b",2,"b",2),BB("c",3,"c",3) )).toDF
> scala> df.write.insertInto("aa")
> {code}
>  - at this point, running a select statement with Hive returns correctly *4 
> rows* in the table; then run the concatenation
> {code:java}
> hive
> hive> alter table aa concatenate;
> {code}
> At this point, a select returns only *3 rows, ie. a row is missing*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-17280) Data loss in CONCATENATE ORC created by Spark

Reply via email to