[ 
https://issues.apache.org/jira/browse/SPARK-28770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920212#comment-16920212
 ] 

Wing Yew Poon commented on SPARK-28770:
---------------------------------------

On my branch from which [https://github.com/apache/spark/pull/23767] was merged 
into master, I modified ReplayListenerSuite following 
[https://gist.github.com/dwickern/6ba9c5c505d2325d3737ace059302922], and ran 
"End-to-end replay with compression" 100 times. I encountered no failures. I 
ran this on my MacBook Pro.
 The instance of failure that Jungtaek cited appears to be due to a comparison 
of two SparkListenerStageExecutorMetrics events (one from the original, the 
other from the replay) failing. One event came from the driver and the other 
came from executor "1". SparkListenerStageExecutorMetrics events are logged at 
stage completion if spark.eventLog.logStageExecutorMetrics.enabled is set to 
true. The failure could be due to these events being in a different order in 
the replay than in the original. 
 In the commit that first introduced these events, in ReplayListenerSuite, 
there was some code to filter out these events in the testApplicationReplay 
method of ReplayListenerSuite. (The code was to filter out the events from the 
original, not from the replay, which I didn't understand.) Maybe we could 
filter out the SparkListenerStageExecutorMetrics events (from both original and 
replay) in testApplicationReplay (which is called by "End-to-end replay" and 
"End-to-end replay with compression"), to avoid this flakiness.

> Flaky Tests: Test ReplayListenerSuite.End-to-end replay with compression 
> failed
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-28770
>                 URL: https://issues.apache.org/jira/browse/SPARK-28770
>             Project: Spark
>          Issue Type: Test
>          Components: Spark Core
>    Affects Versions: 2.4.3
>         Environment: Community jenkins and our arm testing instance.
>            Reporter: huangtianhua
>            Priority: Major
>
> Test
> org.apache.spark.scheduler.ReplayListenerSuite.End-to-end replay with 
> compression is failed  see 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/267/testReport/junit/org.apache.spark.scheduler/ReplayListenerSuite/End_to_end_replay_with_compression/]
>  
> And also the test is failed on arm instance, I sent email to spark-dev 
> before, and we suspect there is something related with the commit 
> [https://github.com/apache/spark/pull/23767], we tried to revert it and the 
> tests are passed:
> ReplayListenerSuite:
>        - ...
>        - End-to-end replay *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622)
>        - End-to-end replay with compression *** FAILED ***
>          "[driver]" did not equal "[1]" (JsonProtocolSuite.scala:622) 
>  
> Not sure what's wrong, hope someone can help to figure it out, thanks very 
> much.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to