[ 
https://issues.apache.org/jira/browse/SPARK-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10431:
------------------------------
    Summary: Flaky test: o.a.s.metrics.InputOutputMetricsSuite - input metrics 
with cache and coalesce  (was: Intermittent test failure in 
InputOutputMetricsSuite)

> Flaky test: o.a.s.metrics.InputOutputMetricsSuite - input metrics with cache 
> and coalesce
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-10431
>                 URL: https://issues.apache.org/jira/browse/SPARK-10431
>             Project: Spark
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.5.0
>            Reporter: Pete Robbins
>            Priority: Minor
>              Labels: flaky-test
>
> I sometimes get test failures such as:
> - input metrics with cache and coalesce *** FAILED ***
>   5994472 did not equal 6044472 (InputOutputMetricsSuite.scala:101)
> Tracking this down by adding some debug it seems this is a timing issue in 
> the test.
> test("input metrics with cache and coalesce") {
>     // prime the cache manager
>     val rdd = sc.textFile(tmpFilePath, 4).cache()
>     rdd.collect()     // <== #1
>     val bytesRead = runAndReturnBytesRead {      // <== #2
>       rdd.count()
>     }
>     val bytesRead2 = runAndReturnBytesRead {
>       rdd.coalesce(4).count()
>     }
>     // for count and coelesce, the same bytes should be read.
>     assert(bytesRead != 0)
>     assert(bytesRead2 == bytesRead) // fails
>   }
> What is happening is that the runAndReturnBytesRead (#2) function adds a 
> SparkListener to monitor TaskEnd events to total the bytes read from eg the 
> rdd.count()
> In the case where this fails the listener receives a TaskEnd event from 
> earlier tasks (eg #1) and this mucks up the totalling. This happens because 
> the asynchronous thread processing the event queue and notifying the 
> listeners has not processed one of the taskEnd events before the new listener 
> is added so it also receives that event.
> There is a simple fix to the test to wait for the event queue to be empty 
> before adding the new listener and I will submit a pull request for that.
> I also notice that a lot of the tests add a listener and as there is no 
> removeSparkListener api the number of listeners on the context builds up 
> during the running of the suite. This is probably why I see this issue 
> running on slow machines.
> A wider question may be: should a listener receive events that occurred 
> before it was added?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to