I have had a lot of success with Spark on large datasets,
both in terms of performance and flexibility.
However I hit a wall with reduceByKey when the RDD contains billions of
items.
I am reducing with simple functions like addition for building histograms,
so the reduction process should be constant memory.
I am using 10s of AWS-EC2 macines with 60G memory and 30 processors.

After a while the whole process just hangs.
I have not been able to isolate the root problem from the logs,
but I suspect that the problem is in the shuffling.
Simple mapping and filtering transfomations work fine,
and the reductions work fine if I reduce the data down to 10^8 items
makes the reduceByKey go through.

What do I need to do to make reducByKey work for >10^9 items.

thanks
Daniel

Reply via email to