[ https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717917#comment-17717917 ]
Vaibhav Beriwala commented on SPARK-43106: ------------------------------------------ [~dongjoon] Thank you for taking a look at this issue. 1) This is not specific to just Spark-3.3.2. The issue exists in the master branch as well. 2) As [~itskals] mentioned above, at Uber we mostly use HDFS as the storage backend but this same issue would exist for cloud object storage as well. 3) Running any *_INSERT OVERWRITE TABLE_* query over any unpartitioned table would help you quickly reproduce this issue. You will notice that Spark would first clean up the table output path and then launch a job that does the computation for the new data. Some code pointers on this: 1) Refer to the class _*InsertIntoHadoopFsRelation*_ -> method *_run._* 2) Inside the _*run*_ method, you would see that we first call {_}*deleteMatchingPartitions*{_}(this will clean up the table data) and then later call {_}*FileFormatWriter.write*{_}(this will trigger the actual job). > Data lost from the table if the INSERT OVERWRITE query fails > ------------------------------------------------------------ > > Key: SPARK-43106 > URL: https://issues.apache.org/jira/browse/SPARK-43106 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.2 > Reporter: Vaibhav Beriwala > Priority: Major > > When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, > Spark has the following behavior: > 1) It will first clean up all the data from the actual table path. > 2) It will then launch a job that performs the actual insert. > > There are 2 major issues with this approach: > 1) If the insert job launched in step 2 above fails for any reason, the data > from the original table is lost. > 2) If the insert job in step 2 above takes a huge time to complete, then > table data is unavailable to other readers for the entire duration the job > takes. > This behavior is the same even for the partitioned tables when using static > partitioning. For dynamic partitioning, we do not delete the table data > before the job launch. > > Is there a reason as to why we perform this delete before the job launch and > not as part of the Job commit operation? This issue is not there with Hive - > where the data is cleaned up as part of the Job commit operation probably. As > part of SPARK-19183, we did add a new hook in the commit protocol for this > exact same purpose, but seems like its default behavior is still to delete > the table data before the job launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org