[ 
https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Rastogi updated SPARK-32778:
---------------------------------
    Description: 
{code:java}
df.write.option("path", 
"/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
{code}
Above code deleted the data present in path "/already/existing/path". This 
happened because table was already not there in hive metastore however, path 
given had data. And if table is not present in Hive Metastore, SaveMode gets 
modified internally to SaveMode.Overwrite irrespective of what user has 
provided, which leads to data deletion. This change was introduced as part of 
https://issues.apache.org/jira/browse/SPARK-19583. 

Now, suppose if user is not using external hive metastore (hive metastore is 
associated with a cluster) and if cluster goes down or due to some reason user 
has to migrate to a new cluster. Once user tries to save data using above code 
in new cluster, it will first delete the data. It could be a production data 
and user is completely unaware of it as they have provided SaveMode.Append or 
ErrorIfExists. This will be an accidental data deletion.

 

Repro Steps:

 
 # Save data through a hive table as mentioned in above code
 # create another cluster and save data in new table in new cluster by giving 
same path

 

Proposed Fix:

Instead of modifying SaveMode to Overwrite, we should modify it to 
ErrorIfExists in class CreateDataSourceTableAsSelectCommand.

Change (line 154)

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
false)
 
{code}
to

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists 
= false){code}
This should not break CTAS. Even in case of CTAS, user may not want to delete 
data if already exists as it could be accidental.

 

  was:
{code:java}
df.write.option("path", 
"/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
{code}
Above code deleted the data present in path "/already/existing/path". This 
happened because table was already not there in hive metastore however, path 
given had data. And if table is not present in Hive Metastore, SaveMode gets 
modified internally to SaveMode.Overwrite irrespective of what user has 
provided, which leads to data deletion. This change was introduced as part of 
https://issues.apache.org/jira/browse/SPARK-19583. 

Now, suppose if user is not using external hive metastore (hive metastore is 
associated with a cluster) and if cluster goes down or due to some reason user 
has to migrate to a new cluster. Once user tries to save data using above code 
in new cluster, it will first delete the data. It could be a production data 
and user is completely unaware of it as they have provided SaveMode.Append or 
ErrorIfExists. This will be an accidental data deletion.

 

Proposed Fix:

Instead of modifying SaveMode to Overwrite, we should modify it to 
ErrorIfExists in class CreateDataSourceTableAsSelectCommand.

Change (line 154)

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
false)
 
{code}
to

 
{code:java}
val result = saveDataIntoTable(
 sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, tableExists 
= false){code}
This should not break CTAS. Even in case of CTAS, user may not want to delete 
data if already exists as it could be accidental.

 


> Accidental Data Deletion on calling saveAsTable
> -----------------------------------------------
>
>                 Key: SPARK-32778
>                 URL: https://issues.apache.org/jira/browse/SPARK-32778
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Aman Rastogi
>            Priority: Major
>
> {code:java}
> df.write.option("path", 
> "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table)
> {code}
> Above code deleted the data present in path "/already/existing/path". This 
> happened because table was already not there in hive metastore however, path 
> given had data. And if table is not present in Hive Metastore, SaveMode gets 
> modified internally to SaveMode.Overwrite irrespective of what user has 
> provided, which leads to data deletion. This change was introduced as part of 
> https://issues.apache.org/jira/browse/SPARK-19583. 
> Now, suppose if user is not using external hive metastore (hive metastore is 
> associated with a cluster) and if cluster goes down or due to some reason 
> user has to migrate to a new cluster. Once user tries to save data using 
> above code in new cluster, it will first delete the data. It could be a 
> production data and user is completely unaware of it as they have provided 
> SaveMode.Append or ErrorIfExists. This will be an accidental data deletion.
>  
> Repro Steps:
>  
>  # Save data through a hive table as mentioned in above code
>  # create another cluster and save data in new table in new cluster by giving 
> same path
>  
> Proposed Fix:
> Instead of modifying SaveMode to Overwrite, we should modify it to 
> ErrorIfExists in class CreateDataSourceTableAsSelectCommand.
> Change (line 154)
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = 
> false)
>  
> {code}
> to
>  
> {code:java}
> val result = saveDataIntoTable(
>  sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, 
> tableExists = false){code}
> This should not break CTAS. Even in case of CTAS, user may not want to delete 
> data if already exists as it could be accidental.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to