[ https://issues.apache.org/jira/browse/SPARK-22902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307596#comment-16307596 ]
Keith Sun commented on SPARK-22902: ----------------------------------- I could propose a simple solution for this : once the task in one stage is marked as success, then spark could ignore any failures for the same task (not only the active killing , but also the real failures like OOM, system crash, etc). > do not count failures if the speculative task is killed as the same task > finished in other executor > --------------------------------------------------------------------------------------------------- > > Key: SPARK-22902 > URL: https://issues.apache.org/jira/browse/SPARK-22902 > Project: Spark > Issue Type: Bug > Components: Block Manager > Affects Versions: 2.1.1 > Reporter: Keith Sun > Priority: Minor > > It is a logic issue, so i did not include much env but my log in the ticket. > Spark conf related to this issue : > spark.task.maxFailures=2 > spark.speculation=true > My case is this : my task 239 failed first on an executor and then restarted > in another executor while due to it is kind of running slow, spark started > another speculative job as we set speculative execution as true. > After some short time, the second task finished and then spark killed the > specutive task. > But this cause the whole spark job aborted as the task failure is 2 (first > failure due to some other issue + the killed specutive one). > This is confusing as the task 239 is actually finished successfully and the > specutive is killed ,not failed by itself. > Shall we ignore the speculative failure caused by active kill ? > On the spark configuration doc, i found the explanation to the > spark.taskmaxFailure : > {noformat} > Number of failures of any particular task before giving up on the job. The > total number of failures spread across different tasks will not cause the job > to fail; a particular task has to fail this number of attempts. Should be > greater than or equal to 1. Number of allowed retries = this value - 1. > {noformat} > My log : > {noformat} > 17/12/25 12:25:02 INFO TaskSetManager: Starting task 239.0 in stage 1.0 (TID > 10254, host-620-1507-026.lvs02xxxx, executor 208, partition 239, > PROCESS_LOCAL, 5910 bytes) > 17/12/25 12:36:18 INFO TaskSetManager: > Lost task 239.0 in stage 1.0 (TID 10254) on host-620-1507-026.lvs02xxxx, > executor 208: org.apache.spark.SparkException (Task failed while writing > rows) [duplicate 1] > 17/12/25 12:36:18 INFO TaskSetManager: Starting task 239.1 in stage 1.0 (TID > 10601, host-620-1507-038.lvs01xxxx, executor 343, partition 239, > PROCESS_LOCAL, 5910 bytes) > 17/12/25 12:39:19 INFO TaskSetManager: Marking task 239 in stage 1.0 (on > host-620-1507-038.lvs01xxxx) as speculatable because it ran more than 45608 ms > 17/12/25 12:39:19 INFO TaskSetManager: Starting task 239.2 in stage 1.0 (TID > 15142, host-620-1507-030.lvs03xxxx, executor 361, partition 239, > PROCESS_LOCAL, 5910 bytes) > 17/12/25 12:39:22 INFO TaskSetManager: Killing attempt 2 for task 239.2 in > stage 1.0 (TID 15142) on host-620-1507-030.lvs03xxxx as the attempt 1 > succeeded on host-620-1507-038.lvs01xxxx > 17/12/25 12:39:22 INFO TaskSetManager: Finished task 239.1 in stage 1.0 (TID > 10601) in 183663 ms on host-620-1507-038.lvs01xxxx (executor 343) (4606/5000) > 17/12/25 12:39:28 INFO TaskSetManager: Task 239.2 in stage 1.0 (TID 15142) > failed, but another instance of the task has already succeeded, so not > re-queuing the task to be re-executed. > 17/12/25 12:39:28 ERROR TaskSetManager: Task 239 in stage 1.0 failed 2 times; > aborting job > 17/12/25 12:39:28 INFO YarnClusterScheduler: Cancelling stage 1 > 17/12/25 12:39:28 INFO YarnClusterScheduler: Stage 1 was cancelled > 17/12/25 12:39:28 INFO DAGScheduler: ResultStage 1 (sql at > SparkStatement.scala:61) failed in 865.935 s due to Job aborted due to stage > failure: Task 239 in stage 1.0 failed 2 times, most recent failure: Lost task > 239.2 in stage 1.0 (TID 15142, host-620-1507-030.lvs03xxxx, executor 361): > org.apache.spark.SparkException: Task failed while writing rows > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org