Github user szhem commented on the issue: https://github.com/apache/spark/pull/19373 @felixcheung, Unfortunately, RDDs, `PeriodicRDDCheckpointer` is based on, do not have `checkpoint(eager: true)` yet. It's a functionality of DataSets. I've experimented with the similar method for RDDs ... ```scala def checkpoint(eager: Boolean): RDD[T] = { checkpoint() if (eager) { count() } this } ``` ... and it does not work for `PeriodicRDDCheckpointer` in some scenarios. Please, consider the following example ```scala val checkpointInterval = 2 val checkpointer = new PeriodicRDDCheckpointer[(Int, Int)](checkpointInterval, sc) val rdd1 = sc.makeRDD((0 until 10).map(i => i -> i)) // rdd1 is not materialized yet, checkpointer(update=1, checkpointInterval=2) checkpointer.update(rdd1) // rdd2 depends on rdd1 val rdd2 = rdd1.filter(_ => true) // rdd1 is materialized, checkpointer(update=2, checkpointInterval=2) checkpointer.update(rdd1) // rdd3 depends on rdd1 val rdd3 = rdd1.filter(_ => true) // rdd3 is not materialized yet, checkpointer(update=3, checkpointInterval=2) checkpointer.update(rdd3) // rdd3 is materialized, rdd1's files are removed, checkpointer(update=4, checkpointInterval=2) checkpointer.update(rdd3) // fails with FileNotFoundException because // rdd1's files were removed on the previous step and // rdd2 depends on rdd1 rdd2.count() ``` It fails with `FileNotFoundException` even in case of `eager` checkpointing, and passes in case of preserving parent checkpointed RDDs like it's done in this PR.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org