Hello,

For the need of my application, I need to periodically "shuffle" the data
across nodes/partitions of a reasonably-large dataset. This is an expensive
operation but I only need to do it every now and then. However it seems that
I am doing something wrong because as the iterations go the memory usage
increases, causing the job to spill onto HDFS, which eventually gets full. I
am also getting some "Lost executor" errors that I don't get if I don't
repartition.

Here's a basic piece of code which reproduces the problem:

data = sc.textFile("ImageNet_gist_train.txt",50).map(parseLine).cache()
data.count()
for i in range(1000):
        data=data.repartition(50).persist()
        # below several operations are done on data


What am I doing wrong? I tried the following but it doesn't solve the issue:

for i in range(1000):
        data2=data.repartition(50).persist()
        data2.count() # materialize rdd
        data.unpersist() # unpersist previous version
        data=data2


Help and suggestions on this would be greatly appreciated! Thanks a lot!




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-efficient-successive-calls-to-repartition-tp24358.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to