[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

kobefeng (JIRA) Mon, 08 May 2017 18:18:25 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


kobefeng updated SPARK-20663:
-----------------------------
    Description: 
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
         >     id bigint,
         >     name string,
         >     dt string
         > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
         > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng         0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS

----
----
-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
hive> select * from kofeng.partitioned_table;
OK
123     kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/20170507
-rwxr-xr-x   3 kofeng kofeng       338 2017-05-08 17:26 
/user/kofeng/partitioned_table/20170507/000000_0
{code}

  was:
Use spark sql to create partition table first, and alter table by adding 
partition on specific location, then insert overwrite into this partition by 
selection, which will cause data missing compared with HIVE.

{code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
         >     id bigint,
         >     name string,
         >     dt string
         > ) using parquet options ('compression'='snappy', 
'path'='/user/kofeng/partitioned_table')
         > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 
/user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data 
is missing, job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table 
partition(dt='20170507') select 123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng kofeng         0 2017-05-08 17:06 
/user/kofeng/partitioned_table/_SUCCESS

----
----
-- Then drop this partition and use hive to add partition and insert overwrite 
this partition data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists 
partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists 
partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on 
auto-created partition folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') 
select 123 as id, "kofeng" as name from kofeng.test;
        Loading data to table kofeng.partitioned_table partition (dt=20170507)
        Moved: '/user/kofeng/partitioned_table/dt=20170507/000000_0' to trash 
at: /user/kofeng/.Trash/Current
        Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, 
numRows=1, totalSize=338, rawDataSize=2]
        MapReduce Jobs Launched:
        Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 
HDFS Write: 577 SUCCESS
        Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 
HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123     kofeng  20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng kofeng       338 2017-05-08 17:26 
/user/kofeng/partitioned_table/dt=20170507/000000_0
{code}


> Data missing after insert overwrite table partition which is created on 
> specific location
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-20663
>                 URL: https://issues.apache.org/jira/browse/SPARK-20663
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: kobefeng
>              Labels: easyfix
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>          >     id bigint,
>          >     name string,
>          >     dt string
>          > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>          > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng kofeng         0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> ----
> ----
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive also drop the specific location but data is preserved on 
> auto-created partition folder
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/20170507
> -rwxr-xr-x   3 kofeng kofeng       338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/20170507/000000_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

Reply via email to