[ https://issues.apache.org/jira/browse/SPARK-32778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17191468#comment-17191468 ]
Hyukjin Kwon commented on SPARK-32778: -------------------------------------- Spark 2.3 is EOL too .. can you reproduce in Spark 2.4 or 3.0? > Accidental Data Deletion on calling saveAsTable > ----------------------------------------------- > > Key: SPARK-32778 > URL: https://issues.apache.org/jira/browse/SPARK-32778 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Aman Rastogi > Priority: Major > > {code:java} > df.write.option("path", > "/already/existing/path").mode(SaveMode.Append).format("json").saveAsTable(db.table) > {code} > Above code deleted the data present in path "/already/existing/path". This > happened because table was already not there in hive metastore however, path > given had data. And if table is not present in Hive Metastore, SaveMode gets > modified internally to SaveMode.Overwrite irrespective of what user has > provided, which leads to data deletion. This change was introduced as part of > https://issues.apache.org/jira/browse/SPARK-19583. > Now, suppose if user is not using external hive metastore (hive metastore is > associated with a cluster) and if cluster goes down or due to some reason > user has to migrate to a new cluster. Once user tries to save data using > above code in new cluster, it will first delete the data. It could be a > production data and user is completely unaware of it as they have provided > SaveMode.Append or ErrorIfExists. This will be an accidental data deletion. > > Repro Steps: > > # Save data through a hive table as mentioned in above code > # create another cluster and save data in new table in new cluster by giving > same path > > Proposed Fix: > Instead of modifying SaveMode to Overwrite, we should modify it to > ErrorIfExists in class CreateDataSourceTableAsSelectCommand. > Change (line 154) > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.Overwrite, tableExists = > false) > > {code} > to > > {code:java} > val result = saveDataIntoTable( > sparkSession, table, tableLocation, child, SaveMode.ErrorIfExists, > tableExists = false){code} > This should not break CTAS. Even in case of CTAS, user may not want to delete > data if already exists as it could be accidental. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org