[ 
https://issues.apache.org/jira/browse/SPARK-43106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17717917#comment-17717917
 ] 

Vaibhav Beriwala commented on SPARK-43106:
------------------------------------------

[~dongjoon] Thank you for taking a look at this issue.

1) This is not specific to just Spark-3.3.2. The issue exists in the master 
branch as well.

2) As [~itskals] mentioned above, at Uber we mostly use HDFS as the storage 
backend but this same issue would exist for cloud object storage as well.

3) Running any *_INSERT OVERWRITE TABLE_* query over any unpartitioned table 
would help you quickly reproduce this issue. You will notice that Spark would 
first clean up the table output path and then launch a job that does the 
computation for the new data.

 

Some code pointers on this:

1) Refer to the class _*InsertIntoHadoopFsRelation*_ -> method *_run._*

2) Inside the _*run*_ method, you would see that we first call 
{_}*deleteMatchingPartitions*{_}(this will clean up the table data) and then 
later call  {_}*FileFormatWriter.write*{_}(this will trigger the actual job).

> Data lost from the table if the INSERT OVERWRITE query fails
> ------------------------------------------------------------
>
>                 Key: SPARK-43106
>                 URL: https://issues.apache.org/jira/browse/SPARK-43106
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.2
>            Reporter: Vaibhav Beriwala
>            Priority: Major
>
> When we run an INSERT OVERWRITE query for an unpartitioned table on Spark-3, 
> Spark has the following behavior:
> 1) It will first clean up all the data from the actual table path.
> 2) It will then launch a job that performs the actual insert.
>  
> There are 2 major issues with this approach:
> 1) If the insert job launched in step 2 above fails for any reason, the data 
> from the original table is lost. 
> 2) If the insert job in step 2 above takes a huge time to complete, then 
> table data is unavailable to other readers for the entire duration the job 
> takes.
> This behavior is the same even for the partitioned tables when using static 
> partitioning. For dynamic partitioning, we do not delete the table data 
> before the job launch.
>  
> Is there a reason as to why we perform this delete before the job launch and 
> not as part of the Job commit operation? This issue is not there with Hive - 
> where the data is cleaned up as part of the Job commit operation probably. As 
> part of SPARK-19183, we did add a new hook in the commit protocol for this 
> exact same purpose, but seems like its default behavior is still to delete 
> the table data before the job launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to