If you use parallelize, the data is distributed across multiple nodes
available and sum is computed individually within each partition and later
merged. The driver manages the entire process. Is my understanding correct?
Can someone please correct me if I am wrong?

On Fri, Oct 10, 2014 at 9:37 AM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -) <
abaghdasa...@bloomberg.net> wrote:

> Hello,
> I was wondering on what does the Spark accumulator do under the covers.
> I’ve implemented my own associative addInPlace function for the
> accumulator, where is this function being run? Let’s say you call something
> like myRdd.map(x => sum += x) is “sum” being accumulated locally in any
> way, for each element or partition or node? Is “sum” a broadcast variable?
> Or does it only exist on the driver node? How does the driver node get
> access to the “sum”?
> Thanks,
> Areg
>



-- 
Regards,
Haripriya Ayyalasomayajula

Reply via email to