[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kobefeng updated SPARK-20663: ----------------------------- Description: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS ---- ---- -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 /user/kofeng/partitioned_table/20170507/000000_0 {code} was: Use spark sql to create partition table first, and alter table by adding partition on specific location, then insert overwrite into this partition by selection, which will cause data missing compared with HIVE. {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} -- create partition table first $ hadoop fs -mkdir /user/kofeng/partitioned_table $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql spark-sql> create table kofeng.partitioned_table( > id bigint, > name string, > dt string > ) using parquet options ('compression'='snappy', 'path'='/user/kofeng/partitioned_table') > partitioned by (dt); -- add partition with specific location spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; $ hadoop fs -ls /user/kofeng/partitioned_table drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507 -- insert overwrite this partition, and the specific location folder gone, data is missing, job is success by attaching _SUCCESS spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name; $ hadoop fs -ls /user/kofeng/partitioned_table -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS ---- ---- -- Then drop this partition and use hive to add partition and insert overwrite this partition data, then verify: spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507'); hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; OK -- could see hive also drop the specific location but data is preserved on auto-created partition folder hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; Loading data to table kofeng.partitioned_table partition (dt=20170507) Moved: '/user/kofeng/partitioned_table/dt=20170507/000000_0' to trash at: /user/kofeng/.Trash/Current Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 HDFS Write: 577 SUCCESS Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 HDFS Write: 338 SUCCESS hive> select * from kofeng.partitioned_table; OK 123 kofeng 20170507 $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/000000_0 {code} > Data missing after insert overwrite table partition which is created on > specific location > ----------------------------------------------------------------------------------------- > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: kobefeng > Labels: easyfix > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > ---- > ---- > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive also drop the specific location but data is preserved on > auto-created partition folder > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 > -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 > /user/kofeng/partitioned_table/20170507/000000_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org