Ryan Williams created SPARK-9038: ------------------------------------ Summary: Missing TaskEnd event when task attempt is superseded by another (speculative) attempt Key: SPARK-9038 URL: https://issues.apache.org/jira/browse/SPARK-9038 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.1 Reporter: Ryan Williams
Yesterday I ran a job that produced [this event log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0]. There are 17314 {{TaskStart}}'s and 17313 {{TaskEnd}}'s; task ID 15820 (aka 13.0.526.0) is missing a {{TaskEnd}} event. A speculative second attempt, ID 16295 (13.0.526.1) finished before it; 15820 was the last taskattempt running in stage-attempt 13.0 and job 3, and when it finished the latter two were each marked as succeeded. At the conclusion of stage 13 / job 3, I observed a few things to be in conflicting/inconsistent states: *Reflecting 15820 as having finished successfully:* * The "stage page" for 13.0 [showed "SUCCESS" in the "Status" column of the per-task-attempt table|http://cl.ly/image/2O0O42382p2W?_ga=1.265890767.118106744.1401937910]. * The driver stdout reported 15820's successful finish, and that it was being ignored due to another attempt of the same task (16295, per above) having already succeeded: {code} 15/07/13 23:30:40 INFO scheduler.TaskSetManager: Ignoring task-finished event for 526.0 in stage 13.0 because task 526 has already completed successfully 15/07/13 23:30:40 INFO cluster.YarnScheduler: Removed TaskSet 13.0, whose tasks have all completed, from pool 15/07/13 23:30:40 INFO scheduler.DAGScheduler: Job 3 finished: collect at JointHistogram.scala:107, took 579.659523 s {code} *Not reflecting 15820 as having finished at all:* * As I mentioned before, [the event log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0] is missing a {{TaskEnd}} for 15820. * The {{AllJobsPage}} shows 11258 tasks finished in job 3; it would have been 11259 with 15820. ** Additionally, inspecting the page in the DOM revealed a 1-task-wide sliver of light-blue (i.e. running task(s)) in the progress bar. ** [This screenshot|http://cl.ly/image/3O201z0e0G2C?_ga=1.265890767.118106744.1401937910] shows both of these on the {{AllJobsPage}}. * A history server, pointed at the event log, consistently shows 15820 as not having finished. ** This is somewhat unsurprising given that the event log powering the history server doesn't {{TaskEnd}} 15820, but seems notable nonetheless since the live UI seemingly *did* partially record the task as having ended (cf. stage page showing SUCCESS). ** Stage page shows 15820 as RUNNING. ** AllJobsPage shows 11258 tasks succeeded, 1 running. I've gone over the relevant task-success code paths and can't understand how the stage page would show me SUCCESS in the live UI, without anything having been written to the event log or the AllJobsPage's counters having been updated. [Here is a bunch of my driver stdout|https://www.dropbox.com/s/pr7rswt4o2umm20/3010.stdout?dl=0], that shows nothing abnormal afaict; and [the dreaded message about events being dropped|https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40] did not appear anywhere while the app was running, which was one of my only guesses about how this could have happened (but which wouldn't fully explain all of the above anyway). Interested in hearing anyones' thoughts about how I might have arrived at this inconsistent state. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org