Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes
of data). The RDD is partitioned into 2048 partitions which are more or
less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am witnessing
a non linear (power)
OK, next question then is: if this is wall-clock time for the whole
process, then, I wonder if you are just measuring the time taken by the
longest single task. I'd expect the time taken by the longest straggler
task to follow a distribution like this. That is, how balanced are the
partitions?
Additional missing relevant information:
Im running a transformation, there are no Shuffles occurring and at the
end im performing a lookup of 4 partitions on the driver.
On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
Hi All,
Im using spark 1.4.1 to to analyze a largish data set (several
I've noticed this as well and am curious if there is anything more people
can say.
My theory is that it is just communication overhead. If you only have a
couple of gigabytes (a tiny dataset), then spotting that into 50 nodes
means you'll have a ton of tiny partitions all finishing very quickly,