cloud-fan commented on a change in pull request #23608: URL: https://github.com/apache/spark/pull/23608#discussion_r655446703
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala ########## @@ -170,7 +170,7 @@ object FileFormatWriter extends Logging { description = description, sparkStageId = taskContext.stageId(), sparkPartitionId = taskContext.partitionId(), - sparkAttemptNumber = taskContext.attemptNumber(), + sparkAttemptNumber = taskContext.taskAttemptId().toInt & Integer.MAX_VALUE, Review comment: After more than 2 years, I revisited this code path again and realized that this is not the best fix. The original motivation is still correct: Spark violates the contract of `TaskAttemptID`, as Spark resets task attempt number after stage retry, making `TaskAttemptID` not unique. The root cause is: Spark job has stages while Hadoop job directly has tasks (no DAG). We map Spark stage id to Hadoop job id, which is inaccurate as this doesn't count stage attempt number. I think a better fix is to generate the Hadoop job id with both stage id and stage attempt number, or generate the Hadoop task attempt number with both Spark task and stage attempt number. However, the current fix also works, as this only decides the intermedia staging directory name, which we don't care as long as it's unique. I'm leaving this comment just for future references. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org