[ https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920212#comment-16920212 ]
Wing Yew Poon commented on SPARK-28770: --------------------------------------- On my branch from which [https://github.com/apache/spark/pull/23767] was merged into master, I modified ReplayListenerSuite following [https://gist.github.com/dwickern/6ba9c5c505d2325d3737ace059302922], and ran "End-to-end replay with compression" 100 times. I encountered no failures. I ran this on my MacBook Pro. The instance of failure that Jungtaek cited appears to be due to a comparison of two SparkListenerStageExecutorMetrics events (one from the original, the other from the replay) failing. One event came from the driver and the other came from executor "1". SparkListenerStageExecutorMetrics events are logged at stage completion if spark.eventLog.logStageExecutorMetrics.enabled is set to true. The failure could be due to these events being in a different order in the replay than in the original. In the commit that first introduced these events, in ReplayListenerSuite, there was some code to filter out these events in the testApplicationReplay method of ReplayListenerSuite. (The code was to filter out the events from the original, not from the replay, which I didn't understand.) Maybe we could filter out the SparkListenerStageExecutorMetrics events (from both original and replay) in testApplicationReplay (which is called by "End-to-end replay" and "End-to-end replay with compression"), to avoid this flakiness. > Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression > failed > ------------------------------------------------------------------------------- > > Key: SPARK-28770 > URL: https://issues.apache.org/jira/browse/SPARK-28770 > Project: Spark > Issue Type: Test > Components: Spark Core > Affects Versions: 2.4.3 > Environment: Community jenkins and our arm testing instance. > Reporter: huangtianhua > Priority: Major > > Test > org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with > compression is failed see > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/] > > And also the test is failed on arm instance, I sent email to spark-dev > before, and we suspect there is something related with the commit > [https://github.com/apache/spark/pull/23767], we tried to revert it and the > tests are passed: > ReplayListenerSuite: > - ... > - End-to-end replay *** FAILED *** > "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) > - End-to-end replay with compression *** FAILED *** > "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) > > Not sure what's wrong, hope someone can help to figure it out, thanks very > much. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org