Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196247811
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in 
this stage's tasks (i.e.
        *                       the maximum possible value of 
`context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): 
Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit 
= synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    I don't think the semantics are changing. It's always been racy, in that 
either of the concurrent tasks from different stage attempts may succeed first. 
And I'm almost sure the assumption is that both task attempts are equivalent 
(i.e. the output is deterministic or at least should be), so it should be fine 
for either to be committed.
    
    The problem is that without this change the coordinator would allow both 
attempts to commit, and that is kinda bad.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to