[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

kobefeng (JIRA) Mon, 08 May 2017 17:47:46 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


kobefeng updated SPARK-20663:
-----------------------------
    Docs Text:   (was: 
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
         >     id bigint,
         >     name string,
         >     dt string
         > ) using parquet options ('compression'='snappy', 
'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
         > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology          0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS

----
----
-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
        Loading data to table kofeng.partitioned_table partition (dt=20170507)
        Moved: 
'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/000000_0' to 
trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
        Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
        MapReduce Jobs Launched:
        Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
        Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123     kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology        338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/000000_0)

> Data missing after insert overwrite table partition which is created on 
> specific location
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-20663
>                 URL: https://issues.apache.org/jira/browse/SPARK-20663
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: kobefeng
>            Priority: Minor
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

Reply via email to