Just noticed an error in my wording. Should be " I'm assuming it's not immediately aggregating on the driver each time I call the += on the Accumulator."
On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet <cjno...@gmail.com> wrote: > What are the limitations of using Accumulators to get a union of a bunch > of small sets? > > Let's say I have an RDD[Map{String,Any} and i want to do: > > rdd.map(accumulator += Set(_.get("entityType").get)) > > > What implication does this have on performance? I'm assuming it's not > immediately aggregating each time I call the += on the Accumulator. Is it > doing a local combine and then occasionally sending the results on the > current partition back to the driver? > > > >