Hello. Is there a way to avoid spark to send little chunks of data near the end of a RDD.count?
I will try explain: I am trying to use count on 1T of data. The data is on HDFS. I use 10 m1.large marchines from EC2. This produces around 11300 tasks, and when it gets near 11000, most of the locality becomes ANY, and my network (I can see it with ganglia) begins to send and receive around 30M/sec, but in the beginning of the job I had a peek of 500M. So, I assume that since it is a count, one machine has to aggregate all information. But if I do with less data it really works fast and this bottleneck does not have; For example, with 300G spark can count 1 Million lines per seconds. but with 1T it drops to 600.000 lines per seconds. So I wonder how can I avoid this with 1T of data? If any of you wants the ganglia info to understand the problem i can send it. Thanks a lot