GitHub user szhem opened a pull request:

    https://github.com/apache/spark/pull/19373

    [SPARK-22150][CORE] PeriodicCheckpointer fails in case of dependant RDDs

    ## What changes were proposed in this pull request?
    
    Fix for [SPARK-22150](https://issues.apache.org/jira/browse/SPARK-22150) 
JIRA issue.
    
    In case of checkpointing RDDs which depend on previously checkpointed RDDs 
(for example in iterative algorithms) PeriodicCheckpointer removes already 
checkpointed materialized RDDs too early leading to FileNotFoundExceptions.
    
    Consider the following snippet
    
        // create a periodic checkpointer with interval of 2
        val checkpointer = new PeriodicRDDCheckpointer[Double](2, sc)
        
        val rdd1 = createRDD(sc)
        checkpointer.update(rdd1)
        // on the second update rdd1 is checkpointed
        checkpointer.update(rdd1)
        // on action checkpointed rdd is materialized and its lineage is 
truncated
        rdd1.count() 
        
        // rdd2 depends on rdd1
        val rdd2 = rdd1.filter(_ => true)
        checkpointer.update(rdd2)
        // on the second update rdd2 is checkpointed and checkpoint files of 
rdd1 are deleted
        checkpointer.update(rdd2)
        // on action it's necessary to read already removed checkpoint files of 
rdd1
        rdd2.count()
    
    ## How was this patch tested?
    
    Unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/szhem/spark SPARK-22150-early-checkpoints

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19373.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19373
    
----
commit 0c3338cd645f5824f08fe37fd7174e25c416529b
Author: Sergey Zhemzhitsky <szhemzhit...@gmail.com>
Date:   2017-09-27T21:33:18Z

    [SPARK-22150][CORE] preventing too early removal of checkpoints in case of 
dependant RDDs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to