[GitHub] [spark] cloud-fan commented on a change in pull request #23608: [SPARK-26682][SQL] Use taskAttemptID instead of attemptNumber for Hadoop.

GitBox Tue, 22 Jun 2021 00:41:03 -0700


cloud-fan commented on a change in pull request #23608:
URL: https://github.com/apache/spark/pull/23608#discussion_r655446703




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
##########
@@ -170,7 +170,7 @@ object FileFormatWriter extends Logging {
             description = description,
             sparkStageId = taskContext.stageId(),
             sparkPartitionId = taskContext.partitionId(),
-            sparkAttemptNumber = taskContext.attemptNumber(),
+            sparkAttemptNumber = taskContext.taskAttemptId().toInt & 
Integer.MAX_VALUE,

Review comment:
       After more than 2 years, I revisited this code path again and realized 
that this is not the best fix.
   
   The original motivation is still correct: Spark violates the contract of 
`TaskAttemptID`, as Spark resets task attempt number after stage retry, making 
`TaskAttemptID` not unique.
   
   The root cause is: Spark job has stages while Hadoop job directly has tasks 
(no DAG). We map Spark stage id to Hadoop job id, which is inaccurate as this 
doesn't count stage attempt number.
   
   I think a better fix is to generate the Hadoop job id with both stage id and 
stage attempt number, or generate the Hadoop task attempt number with both 
Spark task and stage attempt number. However, the current fix also works, as 
this only decides the intermedia staging directory name, which we don't care as 
long as it's unique.
   
   I'm leaving this comment just for future references.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #23608: [SPARK-26682][SQL] Use taskAttemptID instead of attemptNumber for Hadoop.

Reply via email to