[ 
https://issues.apache.org/jira/browse/SPARK-24684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16526834#comment-16526834
 ] 

Ryan Blue commented on SPARK-24684:
-----------------------------------

Yeah, I just backported this wrong and moved to using unique ids in the 
canCommit calls. I don't think it affects master or the on-going patch 
releases. Sorry for the false alarm.

> DAGScheduler reports the wrong attempt number to the commit coordinator
> -----------------------------------------------------------------------
>
>                 Key: SPARK-24684
>                 URL: https://issues.apache.org/jira/browse/SPARK-24684
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.1.3, 2.3.2
>            Reporter: Ryan Blue
>            Priority: Major
>
> SPARK-24552 changes writers to pass the task ID to the output coordinator so 
> that the coordinator tracks each task uniquely because attempt numbers can be 
> reused across stage attempts. However, the DAGScheduler still passes the 
> attempt number when notifying the coordinator that a task has finished. The 
> result is that when a task is authorized and then fails due to OOM or a 
> similar error, the scheduler is notified but doesn't remove the commit 
> authorization because the attempt number doesn't match. This causes infinite 
> task retries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to