You're understanding is basically correct. Each task creates it's own local accumulator, and just those results get merged together.
However, there are some performance limitations to be aware of. First you need enough memory on the executors to build up whatever those intermediate results are. Second, all the work of *merging* the results from each task are done by the *driver*. So if there is a lot of stuff to merge, that can be slow, as its not distributed at all. Hope that helps a little Imran On Jan 14, 2015 6:21 PM, "Corey Nolet" <cjno...@gmail.com> wrote: > Just noticed an error in my wording. > > Should be " I'm assuming it's not immediately aggregating on the driver > each time I call the += on the Accumulator." > > On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet <cjno...@gmail.com> wrote: > >> What are the limitations of using Accumulators to get a union of a bunch >> of small sets? >> >> Let's say I have an RDD[Map{String,Any} and i want to do: >> >> rdd.map(accumulator += Set(_.get("entityType").get)) >> >> >> What implication does this have on performance? I'm assuming it's not >> immediately aggregating each time I call the += on the Accumulator. Is it >> doing a local combine and then occasionally sending the results on the >> current partition back to the driver? >> >> >> >> >