You're understanding is basically correct.  Each task creates it's own
local accumulator, and just those results get merged together.

However, there are some performance limitations to be aware of.  First you
need enough memory on the executors to build up whatever those intermediate
results are.  Second, all the work of *merging* the results from each task
are done by the *driver*.  So if there is a lot of stuff to merge, that can
be slow, as its not distributed at all.

Hope that helps a little

Imran
On Jan 14, 2015 6:21 PM, "Corey Nolet" <cjno...@gmail.com> wrote:

> Just noticed an error in my wording.
>
> Should be " I'm assuming it's not immediately aggregating on the driver
> each time I call the += on the Accumulator."
>
> On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet <cjno...@gmail.com> wrote:
>
>> What are the limitations of using Accumulators to get a union of a bunch
>> of small sets?
>>
>> Let's say I have an RDD[Map{String,Any} and i want to do:
>>
>> rdd.map(accumulator += Set(_.get("entityType").get))
>>
>>
>> What implication does this have on performance? I'm assuming it's not
>> immediately aggregating each time I call the += on the Accumulator. Is it
>> doing a local combine and then occasionally sending the results on the
>> current partition back to the driver?
>>
>>
>>
>>
>

Reply via email to