Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20244#discussion_r167138603
  
    --- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -2399,6 +2424,121 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with TimeLi
         }
       }
     
    +  /**
    +   * In this test, we simulate the scene in concurrent jobs using the same
    +   * rdd which is marked to do checkpoint:
    +   * Job one has already finished the spark job, and start the process of 
doCheckpoint;
    +   * Job two is submitted, and submitMissingTasks is called.
    +   * In submitMissingTasks, if taskSerialization is called before 
doCheckpoint is done,
    +   * while part calculates from stage.rdd.partitions is called after 
doCheckpoint is done,
    +   * we may get a ClassCastException when execute the task because of some 
rdd will do
    +   * Partition cast.
    +   *
    +   * With this test case, just want to indicate that we should do 
taskSerialization and
    +   * part calculate in submitMissingTasks with the same rdd checkpoint 
status.
    +   */
    +  test("SPARK-23053: avoid ClassCastException in concurrent execution with 
checkpoint") {
    --- End diff --
    
    hi @ivoson -- I haven't come up with a better way to test this, so I think 
for now you should
    
    (1) change the PR to *only* include the changes to the DAGScheduler (also 
undo the `protected[spark]` changes elsewhere)
    (2) put this repro on the jira as its a pretty good for showing whats going 
on.
    
    if we come up with a way to test it, we can always do that later on.
    
    thanks and sorry for the back and forth


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to