lost executor due to large shuffle spill memory

2016-04-05 Thread lllll
I have a task to remap the index to actual uuid in ALS prediction results. But it consistently fail due to lost executors. I noticed there's large shuffle spill memory but I don't know how to improve it. I've tried to

zip two RDD in pyspark

2014-07-28 Thread lllll
I have a file in s3 that I want to map each line with an index. Here is my code: input_data = sc.textFile('s3n:/myinput',minPartitions=6).cache() N input_data.count() index = sc.parallelize(range(N), 6) index.zip(input_data).collect() ... 14/07/28 19:49:31 INFO DAGScheduler: Completed