[ 
https://issues.apache.org/jira/browse/SPARK-36121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380289#comment-17380289
 ] 

Hyukjin Kwon commented on SPARK-36121:
--------------------------------------

did you enable speculation?

> Write data loss caused by stage retry when enable v2 FileOutputCommitter
> ------------------------------------------------------------------------
>
>                 Key: SPARK-36121
>                 URL: https://issues.apache.org/jira/browse/SPARK-36121
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.1, 3.0.1
>            Reporter: gaoyajun02
>            Priority: Critical
>
> All our ETL scenarios are configured: 
> mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed 
> occurs, the stage retry is triggered, and then the zombie stage and the retry 
> stage may write tasks of the same part at the same time, and their task 
> directory and file name are exactly the same. This may cause data part loss 
> due to conflicts between delete and rename operations.
> For example, this is also a data loss case I encountered recently: Stage 5.0 
> is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry 
> stage. They have two tasks concurrently writing the same part file: 
> part-00298.
>  # The task of stage 5.1 has preemptively created part file: part-00298 and 
> written data.
>  # At the same time as the task commit of stage 5.1, the task of sage 5.0 is 
> going to create this part file to write data, because the file already 
> exists, it throw an exception and delete the task's temporary directory.
>  # Then stage 5.0 starts commitTask, it will traverse the sub-directories and 
> execute rename. At this time, because the file has been deleted, it finally 
> moves empty without any exception, which causes data loss.
>  
> I read this part of the code, and currently I think of two ideas:
>  # Add stageAttemptNumber to taskAttemptPath to avoid conflicts.
>  # Check the number of files after commitTask, and throw an exception 
> directly when it is found to be missing.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to