[ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-----------------------------
    Priority: Major  (was: Minor)

> Data missing after insert overwrite table partition which is created on 
> specific location
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-20663
>                 URL: https://issues.apache.org/jira/browse/SPARK-20663
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: kobefeng
>              Labels: easyfix
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>          >     id bigint,
>          >     name string,
>          >     dt string
>          > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>          > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng hdmi-technology          0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> ----
> ----
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive also drop the specific location but data is preserved on 
> auto-created partition folder
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
>       Loading data to table kofeng.partitioned_table partition (dt=20170507)
>       Moved: '/user/kofeng/partitioned_table/dt=20170507/000000_0' to trash 
> at: /user/kofeng/.Trash/Current
>       Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
> numRows=1, totalSize=338, rawDataSize=2]
>       MapReduce Jobs Launched:
>       Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
> HDFS Write: 577 SUCCESS
>       Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
> HDFS Write: 338 SUCCESS
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
> -rwxr-xr-x   3 kofeng hdmi-technology        338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/dt=20170507/000000_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to