Hello, For the need of my application, I need to periodically "shuffle" the data across nodes/partitions of a reasonably-large dataset. This is an expensive operation but I only need to do it every now and then. However it seems that I am doing something wrong because as the iterations go the memory usage increases, causing the job to spill onto HDFS, which eventually gets full. I am also getting some "Lost executor" errors that I don't get if I don't repartition.
Here's a basic piece of code which reproduces the problem: data = sc.textFile("ImageNet_gist_train.txt",50).map(parseLine).cache() data.count() for i in range(1000): data=data.repartition(50).persist() # below several operations are done on data What am I doing wrong? I tried the following but it doesn't solve the issue: for i in range(1000): data2=data.repartition(50).persist() data2.count() # materialize rdd data.unpersist() # unpersist previous version data=data2 Help and suggestions on this would be greatly appreciated! Thanks a lot! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-efficient-successive-calls-to-repartition-tp24358.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org