[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kobefeng updated SPARK-20663: ----------------------------- Labels: easyfix (was: ) > Data missing after insert overwrite table partition which is created on > specific location > ----------------------------------------------------------------------------------------- > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0, 2.1.1 > Reporter: kobefeng > Priority: Minor > Labels: easyfix > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng hdmi-technology 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > ---- > ---- > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive also drop the specific location but data is preserved on > auto-created partition folder > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > Loading data to table kofeng.partitioned_table partition (dt=20170507) > Moved: '/user/kofeng/partitioned_table/dt=20170507/000000_0' to trash > at: /user/kofeng/.Trash/Current > Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, > numRows=1, totalSize=338, rawDataSize=2] > MapReduce Jobs Launched: > Stage-Stage-1: Map: 2 Cumulative CPU: 10.61 sec HDFS Read: 9767 > HDFS Write: 577 SUCCESS > Stage-Stage-3: Map: 1 Cumulative CPU: 12.36 sec HDFS Read: 3635 > HDFS Write: 338 SUCCESS > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507 > -rwxr-xr-x 3 kofeng hdmi-technology 338 2017-05-08 17:26 > /user/kofeng/partitioned_table/dt=20170507/000000_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org