Hello.

Is there a way to avoid spark to send little chunks of data near the end of
a RDD.count?

I will try explain:

I am trying to use count on 1T of data. The data is on HDFS.  I use 10
m1.large marchines from EC2.

This produces around 11300 tasks, and when it gets near 11000,  most of the
locality becomes ANY,
and my network (I can see it with ganglia) begins to send and receive
around 30M/sec,  but in the beginning of the job I had a peek of 500M.

So,  I assume that since it is a count, one machine has to aggregate all
information.
But if I do with less data it really works fast and this bottleneck does
not have;

For example,  with 300G spark can count 1 Million lines per seconds.
but with 1T it drops to 600.000 lines per seconds.
So I wonder how can I  avoid this with 1T of data?

If any of you wants the  ganglia info to understand the problem i can send
it.

Thanks a lot

Reply via email to