[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

Sean Owen (JIRA) Tue, 13 Oct 2015 03:18:53 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954757#comment-14954757
 ]


Sean Owen commented on SPARK-11066:
-----------------------------------

Yes, the problem is anyone who submits a JIRA presumably wants to see it 
addressed and soon. Few are actually actionable, valid, and something that the 
submitter follows through on. Hence Target Version ought to be set only by 
someone who is willing and able to drive to a resolution. Then the view of 
JIRAs targeted at a release is a somewhat reliable picture of what could happen 
in that release. It's still used unevenly but that's the reason.

If it's likely to be resolved rapidly like this one I usually don't even 
bother, but, it'd be valid to target at 1.6 / 1.5.2 after seeing it's probably 
a fine change that passes tests, etc (still some style failures)

> Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler 
> occasionally fails due to j.l.UnsupportedOperationException concerning a 
> finished JobWaiter
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-11066
>                 URL: https://issues.apache.org/jira/browse/SPARK-11066
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core, Tests
>    Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
>         Environment: Multiple OS and platform types.
> (Also observed by others, e.g. see External URL)
>            Reporter: Dr Stephen A Hellberg
>            Priority: Minor
>
> The DAGSchedulerSuite test for the "misbehaved ResultHandler" has an inherent 
> problem: it creates a job for the DAGScheduler comprising multiple (2) tasks, 
> but whilst the job will fail and a SparkDriverExecutionException will be 
> returned, a race condition exists as to whether the first task's 
> (deliberately) thrown exception causes the job to fail - and having its 
> causing exception set to the DAGSchedulerSuiteDummyException that was thrown 
> as the setup of the misbehaving test - or second (and subsequent) tasks who 
> equally end, but have instead the DAGScheduler's legitimate 
> UnsupportedOperationException (a subclass of RuntimeException) returned 
> instead as their causing exception.  This race condition is likely associated 
> with the vagaries of processing quanta, and expense of throwing two 
> exceptions (under interpreter execution) per thread of control; this race is 
> usually 'won' by the first task throwing the DAGSchedulerDummyException, as 
> desired (and expected)... but not always.
> The problem for the testcase is that the first assertion is largely 
> concerning the test setup, and doesn't (can't? Sorry, still not a ScalaTest 
> expert) capture all the causes of SparkDriverExecutionException that can 
> legitimately arise from a correctly working (not crashed) DAGScheduler.  
> Arguably, this assertion might test something of the DAGScheduler... but not 
> all the possible outcomes for a working DAGScheduler.  Nevertheless, this 
> test - when comprising a multiple task job - will report as a failure when in 
> fact the DAGScheduler is working-as-designed (and not crashed ;-).  
> Furthermore, the test is already failed before it actually tries to use the 
> SparkContext a second time (for an arbitrary processing task), which I think 
> is the real subject of the test?
> The solution, I submit, is to ensure that the job is composed of just one 
> task, and that single task will result in the call to the compromised 
> ResultHandler causing the test's deliberate exception to be thrown and 
> exercising the relevant (DAGScheduler) code paths.  Given tasks are scoped by 
> the number of partitions of an RDD, this could be achieved with a single 
> partitioned RDD (indeed, doing so seems to exercise/would test some default 
> parallelism support of the TaskScheduler?); the pull request offered, 
> however, is based on the minimal change of just using a single partition of 
> the 2 (or more) partition parallelized RDD.  This will result in scheduling a 
> job of just one task, one successful task calling the user-supplied 
> compromised ResultHandler function, which results in failing the job and 
> unambiguously wrapping our DAGSchedulerSuiteException inside a 
> SparkDriverExecutionException; there are no other tasks that on running 
> successfully will find the job failed causing the 'undesired' 
> UnsupportedOperationException to be thrown instead.  This, then, satisfies 
> the test's setup assertion.
> I have tested this hypothesis having parametised the number of partitions, N, 
> used by the "misbehaved ResultHandler" job and have observed the 1 x 
> DAGSchedulerSuiteException first, followed by the legitimate N-1 x 
> UnsupportedOperationExceptions ... what propagates back from the job seems to 
> simply become the result of the race between task threads and the 
> intermittent failures observed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11066) Flaky test o.a.scheduler.DAGSchedulerSuite.misbehavedResultHandler occasionally fails due to j.l.UnsupportedOperationException concerning a finished JobWaiter

Reply via email to