[ 
https://issues.apache.org/jira/browse/SPARK-24116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460472#comment-16460472
 ] 

Rui Li commented on SPARK-24116:
--------------------------------

[~hyukjin.kwon], sorry for the late response. For example, assume we have two 
non-partitioned tables, one is text table and the other is Parquet table. If we 
insert overwrite the text table, old data will go to HDFS trash. But if we 
insert overwrite the Parquet table, old data doesn't go to trash.
I believe SparkSQL has different code paths to load data into different kinds 
of tables. And whether old data goes to trash is inconsistent among these code 
paths. Specifically, {{Hive::loadTable}} moves old data to trash but seems 
other code paths simply delete the old data. Ideally it's good if SparkSQL lets 
user specify whether old data goes to trash when overwriting, some feature like 
HIVE-15880.

> SparkSQL inserting overwrite table has inconsistent behavior regarding HDFS 
> trash
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-24116
>                 URL: https://issues.apache.org/jira/browse/SPARK-24116
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Rui Li
>            Priority: Major
>
> When inserting overwrite a table, the old data may or may not go to trash 
> based on:
>  # Date format. E.g. text table may go to trash but parquet table doesn't.
>  # Whether table is partitioned. E.g. partitioned text table doesn't go to 
> trash while non-partitioned table does.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to