Hi,
Thanks TD for your reply. I am still not able to resolve the problem for my
use case.
I have let's say 1000 different RDD's, and I am applying a transformation
function on each RDD and I want the output of all rdd's combined to a single
output RDD. For, this I am doing the following:

*<Loop Start>*
tempRDD = jaRDD.rdd().repartition(1).mapPartitions(....).toJavaRDD(); 
*//creating new rdd in every loop* 
outRDD = outRDD.union(tempRDD); *//keep joining RDD's to get the output into
a single RDD*

*//after every 10 iteration, in order to truncate the lineage*
cachRDD = outRDD.cache();
cachRDD.checkpoint();
System.out.println(cachRDD.collect().size());
outRDD = new JavaRDD<String>(cachRDD.rdd(),cachRDD.classTag());
*<Loop Ends>*

*//finally after whole computation*
outRDD.saveAsTextFile(..)

The above operations is overall slow, runs successfully when performed less
iterations i.e. ~100. But, when the num of iterations in increased to ~1000,
The whole job is taking more than *30 mins* and ultimately break down giving
OutOfMemory error. The total size of data is around 1.4 MB. As of now, I am
running the job on spark standalone mode with 2 cores and 2.9 GB memory.

I also observed that when collect() operation is performed, the number of
tasks keeps on increasing as the loop proceeds, like on first collect() 22
total task, then ~40 total tasks ... ~300 task for single collect.
Does this means that all the operations are repeatedly performed, and RDD
lineage is not broken?? 


Can you please elaborate on the point from your last post i.e. how to
perform: "*Create a modified RDD R` which has the same data as RDD R but
does not have the lineage. This is done by creating a new BlockRDD using the
ids of blocks of data representing the in-memory R*"



-----
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-when-calling-count-tp5649p10488.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to