spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes of data). The RDD is partitioned into 2048 partitions which are more or less equal and entirely cached in RAM. I evaluated the performance on several cluster sizes, and am witnessing a non linear (power)

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen
OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions?

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg
Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney
I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly,