[ 
https://issues.apache.org/jira/browse/SPARK-22902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307596#comment-16307596
 ] 

Keith Sun commented on SPARK-22902:
-----------------------------------

I could propose a simple solution for this : once the task in one stage is 
marked as success, then spark could ignore any failures for the same task (not 
only the active killing , but also the real failures like OOM, system crash, 
etc). 


> do not count failures if the speculative task is killed as the same task 
> finished in other executor
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22902
>                 URL: https://issues.apache.org/jira/browse/SPARK-22902
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.1.1
>            Reporter: Keith Sun
>            Priority: Minor
>
> It is a logic issue, so i did not include much env but my log in the ticket.
> Spark conf related to this issue :
> spark.task.maxFailures=2
> spark.speculation=true 
> My case is this : my task 239 failed first  on an executor and then restarted 
> in another executor while due to it is kind of running slow, spark started 
> another speculative job as we set speculative execution as true.  
> After some short time, the second task finished and then spark killed the 
> specutive task.
> But this cause the whole spark job aborted as the task failure is 2 (first 
> failure due to some other issue + the killed specutive one).
> This is confusing as the task 239 is actually finished successfully and the 
> specutive is killed ,not failed by itself.
> Shall we ignore the speculative failure caused by active kill ?
> On the spark configuration doc, i found the explanation to the 
> spark.taskmaxFailure :
> {noformat}
> Number of failures of any particular task before giving up on the job. The 
> total number of failures spread across different tasks will not cause the job 
> to fail; a particular task has to fail this number of attempts. Should be 
> greater than or equal to 1. Number of allowed retries = this value - 1.
> {noformat}
> My log :
> {noformat}
> 17/12/25 12:25:02 INFO TaskSetManager: Starting task 239.0 in stage 1.0 (TID 
> 10254, host-620-1507-026.lvs02xxxx, executor 208, partition 239, 
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:36:18 INFO TaskSetManager: 
> Lost task 239.0 in stage 1.0 (TID 10254) on host-620-1507-026.lvs02xxxx, 
> executor 208: org.apache.spark.SparkException (Task failed while writing 
> rows) [duplicate 1]
> 17/12/25 12:36:18 INFO TaskSetManager: Starting task 239.1 in stage 1.0 (TID 
> 10601, host-620-1507-038.lvs01xxxx, executor 343, partition 239, 
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:39:19 INFO TaskSetManager: Marking task 239 in stage 1.0 (on 
> host-620-1507-038.lvs01xxxx) as speculatable because it ran more than 45608 ms
> 17/12/25 12:39:19 INFO TaskSetManager: Starting task 239.2 in stage 1.0 (TID 
> 15142, host-620-1507-030.lvs03xxxx, executor 361, partition 239, 
> PROCESS_LOCAL, 5910 bytes)
> 17/12/25 12:39:22 INFO TaskSetManager: Killing attempt 2 for task 239.2 in 
> stage 1.0 (TID 15142) on host-620-1507-030.lvs03xxxx as the attempt 1 
> succeeded on host-620-1507-038.lvs01xxxx
> 17/12/25 12:39:22 INFO TaskSetManager: Finished task 239.1 in stage 1.0 (TID 
> 10601) in 183663 ms on host-620-1507-038.lvs01xxxx (executor 343) (4606/5000)
> 17/12/25 12:39:28 INFO TaskSetManager: Task 239.2 in stage 1.0 (TID 15142) 
> failed, but another instance of the task has already succeeded, so not 
> re-queuing the task to be re-executed.
> 17/12/25 12:39:28 ERROR TaskSetManager: Task 239 in stage 1.0 failed 2 times; 
> aborting job
> 17/12/25 12:39:28 INFO YarnClusterScheduler: Cancelling stage 1
> 17/12/25 12:39:28 INFO YarnClusterScheduler: Stage 1 was cancelled
> 17/12/25 12:39:28 INFO DAGScheduler: ResultStage 1 (sql at 
> SparkStatement.scala:61) failed in 865.935 s due to Job aborted due to stage 
> failure: Task 239 in stage 1.0 failed 2 times, most recent failure: Lost task 
> 239.2 in stage 1.0 (TID 15142, host-620-1507-030.lvs03xxxx, executor 361): 
> org.apache.spark.SparkException: Task failed while writing rows
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to