[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

Deepanker (JIRA) Fri, 12 Oct 2018 06:48:30 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647916#comment-16647916
 ]


Deepanker commented on SPARK-20236:
-----------------------------------

Hi [~cloud_fan],

1.) Dataset<Row> resultantDataset = <Create from some computation by reading 
from Hive Tables>

2.) 
resultantDataset.write().format("ORC").mode(SaveMode.Overwrite).saveAsTable("*test_table*");

3.) sparkSession.sqlContext().sql("INSERT OVERWRITE TABLE *test_table* SELECT 
<columns>  FROM  another_data_source");

This doesn't work. Like i said earlier for table created from saveAsTable API 
all the partitions get overwritten.

However, if the test_table was created from an already existing hive table like 
"CREATE TABLE test_table like another_test_table" from Beeline CLI and then 
executing line no. 3 doesn't remove rest of the existing partitions in the 
table. This is true for both managed tables and external tables. 

Do let me know if you want additional details here. Will be happy to help!

 

 

 

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-20236
>                 URL: https://issues.apache.org/jira/browse/SPARK-20236
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Wenchen Fan
>            Assignee: Wenchen Fan
>            Priority: Major
>              Labels: releasenotes
>             Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

Reply via email to