GitHub user szhem opened a pull request: https://github.com/apache/spark/pull/19373
[SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependant RDDs ## What changes were proposed in this pull request? Fix for [SPARK-22150](https://issues.apache.org/jira/browse/SPARK-22150) JIRA issue. In case of checkpointing RDDs which depend on previously checkpointed RDDs (for example in iterative algorithms) PeriodicCheckpointer removes already checkpointed materialized RDDs too early leading to FileNotFoundExceptions. Consider the following snippet // create a periodic checkpointer with interval of 2 val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc) val rdd1 = createRDD(sc) checkpointer.update(rdd1) // on the second update rdd1 is checkpointed checkpointer.update(rdd1) // on action checkpointed rdd is materialized and its lineage is truncated rdd1.count() // rdd2 depends on rdd1 val rdd2 = rdd1.filter(_ => true) checkpointer.update(rdd2) // on the second update rdd2 is checkpointed and checkpoint files of rdd1 are deleted checkpointer.update(rdd2) // on action it's necessary to read already removed checkpoint files of rdd1 rdd2.count() ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/szhem/spark SPARK-22150-early-checkpoints Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19373.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19373 ---- commit 0c3338cd645f5824f08fe37fd7174e25c416529b Author: Sergey Zhemzhitsky <szhemzhit...@gmail.com> Date: 2017-09-27T21:33:18Z [SPARK-22150][CORE] preventing too early removal of checkpoints in case of dependant RDDs ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org