Just to add some more clarity in the discussion, there is a difference
between caching to memory and checkpointing, when considered from the
lineage point of view.

When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any
Hadoop API compatible fault-tolerant storage) and the lineage of the RDD is
truncated. This is okay because in case of the worker failure, the RDD data
can be read back from the fault-tolerant storage.

When an RDD is cached, the data of the RDD is cached in memory, but the
lineage is not truncated. This is because if the in-memory data is lost,
the lineage is required to recompute the data.

So to deal with stackoverflow errors due to long lineage, just caching is
not going to be useful. You have to checkpoint the RDD, and as far as I
think, its correct way to do this is to do the following
1. Mark RDD of every Nth iteration for caching and checkpointing (both).
2. And before generating N+1 th iteration RDD, force the materialization of
this RDD by doing a rdd.count(). This will persist the RDD in memory as
well as save to HDFS and truncate the lineage. If you just mark all Nth
iteration RDD for checkpointing, but only force the materialization after
ALL the iterations (not after every N+1 th iteration as I suggested) that
will still lead to stackoverflow errors.

Yes this checkpointing and materialization is definitely decrease
performance, but that is the limitation of the current implementation.

If you are brave enough, you can try the following. Instead of relying on
checkpointing to HDFS for truncating lineage, you can do the following.
1. Persist Nth RDD with replication (see different StorageLevels), this
would replicated the in-memory RDD between workers within Spark. Lets call
this RDD as R.
2. Force it materialize in the memory.
3. Create a modified RDD R` which has the same data as RDD R but does not
have the lineage. This is done by creating a new BlockRDD using the ids of
blocks of data representing the in-memory R (can elaborate on that if you
want).

This will avoid writing to HDFS (replication in the Spark memory), but
truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow
error.

Hope this helps

TD



On Wed, May 14, 2014 at 3:33 AM, lalit1303 <la...@sigmoidanalytics.com>wrote:

> If we do cache() + count() after say every 50 iterations. The whole process
> becomes very slow.
> I have tried checkpoint() , cache() + count(), saveAsObjectFiles().
> Nothing works.
> Materializing RDD's lead to drastic decrease in performance & if we don't
> materialize, we face stackoverflowerror.
>
>
>
> -----
> Lalit Yadav
> la...@sigmoidanalytics.com
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p5699.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to