[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

viirya Tue, 15 Jul 2014 05:08:34 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/1418


    [SPARK-2490] Change recursive visiting on RDD dependencies to iterative 
approach

    
    When performing some transformations on RDDs after many iterations, the 
dependencies of RDDs could be very long. It can easily cause StackOverflowError 
when recursively visiting these dependencies in Spark core. For example:
    
        var rdd = sc.makeRDD(Array(1))
        for (i <- 1 to 1000) { 
          rdd = rdd.coalesce(1).cache()
          rdd.collect()
        }
    
    This PR changes recursive visiting on rdd's dependencies to iterative 
approach to avoid StackOverflowError. 
    
    In addition to the recursive visiting, since the Java serializer has a 
known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that 
causes StackOverflowError too when serializing/deserializing a large graph of 
objects. So applying this PR only solves part of the problem. Using 
KryoSerializer to replace Java serializer might be helpful. However, since 
KryoSerializer is not supported for `spark.closure.serializer` now, I can not 
test if KryoSerializer can solve Java serializer's problem completely. 
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 remove_recursive_visit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1418.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1418
    
----
commit 900538bbcb61683bf1418534c2466463a630569f
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2014-07-15T10:58:45Z

    change recursive visiting on rdd's dependencies to iterative approach to 
avoid stackoverflowerror.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Reply via email to