I have had a lot of success with Spark on large datasets, both in terms of performance and flexibility. However I hit a wall with reduceByKey when the RDD contains billions of items. I am reducing with simple functions like addition for building histograms, so the reduction process should be constant memory. I am using 10s of AWS-EC2 macines with 60G memory and 30 processors.
After a while the whole process just hangs. I have not been able to isolate the root problem from the logs, but I suspect that the problem is in the shuffling. Simple mapping and filtering transfomations work fine, and the reductions work fine if I reduce the data down to 10^8 items makes the reduceByKey go through. What do I need to do to make reducByKey work for >10^9 items. thanks Daniel