[ https://issues.apache.org/jira/browse/SPARK-36121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380289#comment-17380289 ]
Hyukjin Kwon commented on SPARK-36121: -------------------------------------- did you enable speculation? > Write data loss caused by stage retry when enable v2 FileOutputCommitter > ------------------------------------------------------------------------ > > Key: SPARK-36121 > URL: https://issues.apache.org/jira/browse/SPARK-36121 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.2.1, 3.0.1 > Reporter: gaoyajun02 > Priority: Critical > > All our ETL scenarios are configured: > mapreduce.fileoutputcommitter.algorithm.version=2, when shuffle fetchFailed > occurs, the stage retry is triggered, and then the zombie stage and the retry > stage may write tasks of the same part at the same time, and their task > directory and file name are exactly the same. This may cause data part loss > due to conflicts between delete and rename operations. > For example, this is also a data loss case I encountered recently: Stage 5.0 > is a zombie stage caused by shuffle FetchFailed, and stage 5.1 is a retry > stage. They have two tasks concurrently writing the same part file: > part-00298. > # The task of stage 5.1 has preemptively created part file: part-00298 and > written data. > # At the same time as the task commit of stage 5.1, the task of sage 5.0 is > going to create this part file to write data, because the file already > exists, it throw an exception and delete the task's temporary directory. > # Then stage 5.0 starts commitTask, it will traverse the sub-directories and > execute rename. At this time, because the file has been deleted, it finally > moves empty without any exception, which causes data loss. > > I read this part of the code, and currently I think of two ideas: > # Add stageAttemptNumber to taskAttemptPath to avoid conflicts. > # Check the number of files after commitTask, and throw an exception > directly when it is found to be missing. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org