Ryan Williams created SPARK-9038:
------------------------------------

             Summary: Missing TaskEnd event when task attempt is superseded by 
another (speculative) attempt
                 Key: SPARK-9038
                 URL: https://issues.apache.org/jira/browse/SPARK-9038
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.4.1
            Reporter: Ryan Williams


Yesterday I ran a job that produced [this event 
log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0].

There are 17314 {{TaskStart}}'s and 17313 {{TaskEnd}}'s; task ID 15820 (aka 
13.0.526.0) is missing a {{TaskEnd}} event.

A speculative second attempt, ID 16295 (13.0.526.1) finished before it; 15820 
was the last taskattempt running in stage-attempt 13.0 and job 3, and when it 
finished the latter two were each marked as succeeded.

At the conclusion of stage 13 / job 3, I observed a few things to be in 
conflicting/inconsistent states:

*Reflecting 15820 as having finished successfully:*
* The "stage page" for 13.0 [showed "SUCCESS" in the "Status" column of the 
per-task-attempt 
table|http://cl.ly/image/2O0O42382p2W?_ga=1.265890767.118106744.1401937910].
* The driver stdout reported 15820's successful finish, and that it was being 
ignored due to another attempt of the same task (16295, per above) having 
already succeeded:
{code}
15/07/13 23:30:40 INFO scheduler.TaskSetManager: Ignoring task-finished event 
for 526.0 in stage 13.0 because task 526 has already completed successfully
15/07/13 23:30:40 INFO cluster.YarnScheduler: Removed TaskSet 13.0, whose tasks 
have all completed, from pool
15/07/13 23:30:40 INFO scheduler.DAGScheduler: Job 3 finished: collect at 
JointHistogram.scala:107, took 579.659523 s
{code}

*Not reflecting 15820 as having finished at all:*
* As I mentioned before, [the event 
log|https://www.dropbox.com/s/y90rz0gxao5w9z9/application_1432740718700_3010?dl=0]
 is missing a {{TaskEnd}} for 15820.
* The {{AllJobsPage}} shows 11258 tasks finished in job 3; it would have been 
11259 with 15820.
** Additionally, inspecting the page in the DOM revealed a 1-task-wide sliver 
of light-blue (i.e. running task(s)) in the progress bar.
** [This 
screenshot|http://cl.ly/image/3O201z0e0G2C?_ga=1.265890767.118106744.1401937910]
 shows both of these on the {{AllJobsPage}}.
* A history server, pointed at the event log, consistently shows 15820 as not 
having finished.
** This is somewhat unsurprising given that the event log powering the history 
server doesn't {{TaskEnd}} 15820, but seems notable nonetheless since the live 
UI seemingly *did* partially record the task as having ended (cf. stage page 
showing SUCCESS).
** Stage page shows 15820 as RUNNING.
** AllJobsPage shows 11258 tasks succeeded, 1 running.

I've gone over the relevant task-success code paths and can't understand how 
the stage page would show me SUCCESS in the live UI, without anything having 
been written to the event log or the AllJobsPage's counters having been 
updated. [Here is a bunch of my driver 
stdout|https://www.dropbox.com/s/pr7rswt4o2umm20/3010.stdout?dl=0], that shows 
nothing abnormal afaict; and [the dreaded message about events being 
dropped|https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40]
 did not appear anywhere while the app was running, which was one of my only 
guesses about how this could have happened (but which wouldn't fully explain 
all of the above anyway).

Interested in hearing anyones' thoughts about how I might have arrived at this 
inconsistent state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to