Github user szhem commented on the issue:

    https://github.com/apache/spark/pull/19373
  
    @felixcheung, 
    Unfortunately, RDDs, `PeriodicRDDCheckpointer` is based on, do not have 
`checkpoint(eager: true)` yet. 
    It's a functionality of DataSets.
    
    I've experimented with the similar method for RDDs ...
    
    ```scala
    def checkpoint(eager: Boolean): RDD[T] = {
      checkpoint()
      if (eager) {
        count()
      }
      this
    }
    ```
    
    ... and it does not work for `PeriodicRDDCheckpointer` in some scenarios.
    Please, consider the following example
    
    ```scala
    val checkpointInterval = 2
    
    val checkpointer = new PeriodicRDDCheckpointer[(Int, 
Int)](checkpointInterval, sc)
    val rdd1 = sc.makeRDD((0 until 10).map(i => i -> i))
    
    // rdd1 is not materialized yet, checkpointer(update=1, 
checkpointInterval=2)
    checkpointer.update(rdd1)
    // rdd2 depends on rdd1
    val rdd2 = rdd1.filter(_ => true)
    
    // rdd1 is materialized, checkpointer(update=2, checkpointInterval=2)
    checkpointer.update(rdd1)
    // rdd3 depends on rdd1
    val rdd3 = rdd1.filter(_ => true)
    
    // rdd3 is not materialized yet, checkpointer(update=3, 
checkpointInterval=2)
    checkpointer.update(rdd3)
    // rdd3 is materialized, rdd1's files are removed, checkpointer(update=4, 
checkpointInterval=2)
    checkpointer.update(rdd3)
    
    // fails with FileNotFoundException because
    // rdd1's files were removed on the previous step and
    // rdd2 depends on rdd1
    rdd2.count()
    ```
    It fails with `FileNotFoundException` even in case of `eager` 
checkpointing, and passes in case of preserving parent checkpointed RDDs like 
it's done in this PR.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to