[ https://issues.apache.org/jira/browse/SPARK-10431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or updated SPARK-10431: ------------------------------ Summary: Flaky test: o.a.s.metrics.InputOutputMetricsSuite - input metrics with cache and coalesce (was: Intermittent test failure in InputOutputMetricsSuite) > Flaky test: o.a.s.metrics.InputOutputMetricsSuite - input metrics with cache > and coalesce > ----------------------------------------------------------------------------------------- > > Key: SPARK-10431 > URL: https://issues.apache.org/jira/browse/SPARK-10431 > Project: Spark > Issue Type: Bug > Components: Tests > Affects Versions: 1.5.0 > Reporter: Pete Robbins > Priority: Minor > Labels: flaky-test > > I sometimes get test failures such as: > - input metrics with cache and coalesce *** FAILED *** > 5994472 did not equal 6044472 (InputOutputMetricsSuite.scala:101) > Tracking this down by adding some debug it seems this is a timing issue in > the test. > test("input metrics with cache and coalesce") { > // prime the cache manager > val rdd = sc.textFile(tmpFilePath, 4).cache() > rdd.collect() // <== #1 > val bytesRead = runAndReturnBytesRead { // <== #2 > rdd.count() > } > val bytesRead2 = runAndReturnBytesRead { > rdd.coalesce(4).count() > } > // for count and coelesce, the same bytes should be read. > assert(bytesRead != 0) > assert(bytesRead2 == bytesRead) // fails > } > What is happening is that the runAndReturnBytesRead (#2) function adds a > SparkListener to monitor TaskEnd events to total the bytes read from eg the > rdd.count() > In the case where this fails the listener receives a TaskEnd event from > earlier tasks (eg #1) and this mucks up the totalling. This happens because > the asynchronous thread processing the event queue and notifying the > listeners has not processed one of the taskEnd events before the new listener > is added so it also receives that event. > There is a simple fix to the test to wait for the event queue to be empty > before adding the new listener and I will submit a pull request for that. > I also notice that a lot of the tests add a listener and as there is no > removeSparkListener api the number of listeners on the context builds up > during the running of the suite. This is probably why I see this issue > running on slow machines. > A wider question may be: should a listener receive events that occurred > before it was added? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org