[ 
https://issues.apache.org/jira/browse/SPARK-26634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26634:
---------------------------------
    Description: 
In our production spark cluster, we encoutered a case that the task of retry 
stage due to FetchFailure is denied to commit. However, the task is the first 
attempt of this retry stage.

After carefully investigating, it was found that the call of canCommit of 
OutputCommitCoordinator would allow the task of FetchFailure stage(with the 
same parition number as new task of retry stage) commit. which result in the 
TaskCommitDenied for all the task (same partition) of retry stage. Becuase of 
TaskCommitDenied is not countTowardsFailure, thus might cause Application hangs 
forever.

 
{code:java}
2019-01-09,08:39:53,676 INFO org.apache.spark.scheduler.TaskSetManager: 
Starting task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 
456, partition 138, PROCESS_LOCAL, 5829 bytes)
2019-01-09,08:43:37,514 INFO org.apache.spark.scheduler.TaskSetManager: 
Finished task 138.0 in stage 5.0 (TID 30634) in 466958 ms on 
zjy-hadoop-prc-st1212.bj (executor 1632) (674/5000)
2019-01-09,08:45:57,372 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456): 
TaskCommitDenied (Driver denied task commit) for job: 5, partition: 138, 
attemptNumber: 1
166483 2019-01-09,08:45:57,373 INFO 
org.apache.spark.scheduler.OutputCommitCoordinator: Task was denied committing, 
stage: 5, partition: 138, attempt number: 0, attempt number(counting failed 
stage): 1
{code}

  was:
In our production spark cluster, we encoutered a case that the task of retry 
stage due to FetchFailure is denied to commit. However, the task is the first 
attempt of this retry stage.

After carefully investigating, it was found that the call of canCommit of 
OutputCommitCoordinator would allow the task of FetchFailure stage(with the 
same parition number as new task of retry stage) commit. which result in the 
TaskCommitDenied for all the task of retry stage. This is a correctness bug.


> OutputCommitCoordinator may allow task of FetchFailureStage commit again
> ------------------------------------------------------------------------
>
>                 Key: SPARK-26634
>                 URL: https://issues.apache.org/jira/browse/SPARK-26634
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.4.0
>            Reporter: liupengcheng
>            Priority: Major
>
> In our production spark cluster, we encoutered a case that the task of retry 
> stage due to FetchFailure is denied to commit. However, the task is the first 
> attempt of this retry stage.
> After carefully investigating, it was found that the call of canCommit of 
> OutputCommitCoordinator would allow the task of FetchFailure stage(with the 
> same parition number as new task of retry stage) commit. which result in the 
> TaskCommitDenied for all the task (same partition) of retry stage. Becuase of 
> TaskCommitDenied is not countTowardsFailure, thus might cause Application 
> hangs forever.
>  
> {code:java}
> 2019-01-09,08:39:53,676 INFO org.apache.spark.scheduler.TaskSetManager: 
> Starting task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, 
> executor 456, partition 138, PROCESS_LOCAL, 5829 bytes)
> 2019-01-09,08:43:37,514 INFO org.apache.spark.scheduler.TaskSetManager: 
> Finished task 138.0 in stage 5.0 (TID 30634) in 466958 ms on 
> zjy-hadoop-prc-st1212.bj (executor 1632) (674/5000)
> 2019-01-09,08:45:57,372 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
> task 138.0 in stage 5.1 (TID 31437, zjy-hadoop-prc-st159.bj, executor 456): 
> TaskCommitDenied (Driver denied task commit) for job: 5, partition: 138, 
> attemptNumber: 1
> 166483 2019-01-09,08:45:57,373 INFO 
> org.apache.spark.scheduler.OutputCommitCoordinator: Task was denied 
> committing, stage: 5, partition: 138, attempt number: 0, attempt 
> number(counting failed stage): 1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to