subject:"spark performance non\-linear response"

spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg

Hi All, Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes of data). The RDD is partitioned into 2048 partitions which are more or less equal and entirely cached in RAM. I evaluated the performance on several cluster sizes, and am witnessing a non linear (power)

Re: spark performance non-linear response

2015-10-07 Thread Sean Owen

OK, next question then is: if this is wall-clock time for the whole process, then, I wonder if you are just measuring the time taken by the longest single task. I'd expect the time taken by the longest straggler task to follow a distribution like this. That is, how balanced are the partitions?

Re: spark performance non-linear response

2015-10-07 Thread Yadid Ayzenberg

Additional missing relevant information: Im running a transformation, there are no Shuffles occurring and at the end im performing a lookup of 4 partitions on the driver. On 10/7/15 11:26 AM, Yadid Ayzenberg wrote: Hi All, Im using spark 1.4.1 to to analyze a largish data set (several

Re: spark performance non-linear response

2015-10-07 Thread Jonathan Coveney

I've noticed this as well and am curious if there is anything more people can say. My theory is that it is just communication overhead. If you only have a couple of gigabytes (a tiny dataset), then spotting that into 50 nodes means you'll have a ton of tiny partitions all finishing very quickly,

spark performance non-linear response

Re: spark performance non-linear response

Re: spark performance non-linear response

Re: spark performance non-linear response

4 matches

Site Navigation

Mail list logo

Footer information