[jira] [Created] (SPARK-10976) java.lang.UnsupportedOperationException: taskSucceeded() called on a finished JobWaiter

Dr Stephen A Hellberg (JIRA) Wed, 07 Oct 2015 09:51:33 -0700

Dr Stephen A Hellberg created SPARK-10976:
---------------------------------------------


             Summary: java.lang.UnsupportedOperationException: taskSucceeded() 
called on a finished JobWaiter
                 Key: SPARK-10976
                 URL: https://issues.apache.org/jira/browse/SPARK-10976
             Project: Spark
          Issue Type: Bug
          Components: Scheduler, Spark Core
    Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0
         Environment: Has arisen in a variety of OSes, and platforms.
It is highly intermittent, however, but annoying - we've seen it through 1.4.x 
and 1.5.x releases.

My environment of current interest happens to be zLinux, which potentially 
represents a higher degree of concurrency than many others; I'm using an IBM 
Java 1.8.0, but this problem has been experienced on other environments, with 
other vendor's Java, e.g. see External URL
            Reporter: Dr Stephen A Hellberg
            Priority: Minor


This issue is surfaced from the "misbehaved resultHandler should not crash 
DAGScheduler and SparkContext" test, part of the DAGSchedulerSuite.  I've been 
particularly trying to determine the causality for this problem, when it arises 
(as infrequently as it is), and surfacing some of the state transitions in the 
JobWaiter code responsible for throwing the j.l.UnsupportedOperationException.

Of relevance, the UnsupportedOperationException is being thrown on the first 
occasion of the taskSucceded() being called (after object instantiation) and 
the executing thread throws the exception because it is finding _jobFinished to 
be 'true' - yes, before any of the tasks being waited upon have reported their 
success/failure.  That is, _jobFinished (a volatile variable) is being 
perceived to be set true during object initialisation... as if its value is/was 
based on the boolean expression 'totalTask==0' (totalTask is one of the formal 
arguments of the class constructor).  In fact, the right/correct values for the 
initial state of these variables during the relevant test of DAGSchedulerSuite 
intended is totalTask==2, and hence should be _jobFinished=false.  We are 
apparently seeing a race condition amongst the read and write operations 
between what threads are doing; only the volatile annotation for _jobFinished 
is providing any thread safety?

The DAGSchedulerSuite test then fails because the ScalaTest asserts receiving a 
deliberately thrown exception: DAGSchedulerSuiteDummyException, from the 
ResultHandler function, albeit as a check on the setup of the test?  Instead in 
our problem scenario, it _first_ captures the RuntimeException - the 
UnsupportedOperationException - produced from the (incompletely initialised?) 
JobWaiter code.

The test suggests that the objective is that the DAGScheduler and SparkContext 
are 'not crashed'...  it proceeds to conduct a count operation on the 
SparkContext, which both succeed... that is, neither are apparently crashed... 
which should be a positive outcome?
It would be... except for this occasional RuntimeException to cloud the issue.
(Is this deliberate.. or is this a deficiency of the current testcase?)

- misbehaved resultHandler should not crash DAGScheduler and SparkContext *** 
FAILED ***
  java.lang.UnsupportedOperationException: taskSucceeded() called on a finished 
JobWaiter was not instance of 
org.apache.spark.scheduler.DAGSchedulerSuiteDummyException 
(DAGSchedulerSuite.scala:869)
Failed: failing job... exception: 
org.apache.spark.scheduler.DAGSchedulerSuiteDummyException
Succeeded: 0 (0 of 2)
Succeeded: 1 (1 of 2)

(My additional diagnostics presented here are minimal... I've surfaced the 
exception passed in the jobFailed() routine; and the index, finishedTasks, and 
(.. of ..), totalTasks as the "Succeeded" message from taskSucceeded().)

I thought I was close - I still might be - to proposing a fix for this issue, 
although the intermittency of this issue is hampering my efforts.  
Nevertheless, I wanted to submit my hypothesis for any feedback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-10976) java.lang.UnsupportedOperationException: taskSucceeded() called on a finished JobWaiter

Reply via email to