[ https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526834#comment-16526834 ]
Ryan Blue commented on SPARK-24684: ----------------------------------- Yeah, I just backported this wrong and moved to using unique ids in the canCommit calls. I don't think it affects master or the on-going patch releases. Sorry for the false alarm. > DAGScheduler reports the wrong attempt number to the commit coordinator > ----------------------------------------------------------------------- > > Key: SPARK-24684 > URL: https://issues.apache.org/jira/browse/SPARK-24684 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.1.3, 2.3.2 > Reporter: Ryan Blue > Priority: Major > > SPARK-24552 changes writers to pass the task ID to the output coordinator so > that the coordinator tracks each task uniquely because attempt numbers can be > reused across stage attempts. However, the DAGScheduler still passes the > attempt number when notifying the coordinator that a task has finished. The > result is that when a task is authorized and then fails due to OOM or a > similar error, the scheduler is notified but doesn't remove the commit > authorization because the attempt number doesn't match. This causes infinite > task retries. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org