Thanks for the link to the bug.

Unfortunately, using accumulators like this is getting spread around as a 
recommended practice despite the bug.


From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Monday, November 17, 2014 8:32 AM
To: Segerlind, Nathan L
Cc: user
Subject: Re: RDD.aggregate versus accumulables...

You should never use accumulators for this purpose because you may get 
incorrect answers. Accumulators can count the same thing multiple times - you 
cannot rely upon the correctness of the values they compute. See 
SPARK-732<https://issues.apache.org/jira/browse/SPARK-732> for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
<nathan.l.segerl...@intel.com<mailto:nathan.l.segerl...@intel.com>> wrote:
Hi All.

I am trying to get my head around why using accumulators and accumulables seems 
to be the most recommended method for accumulating running sums, averages, 
variances and the like, whereas the aggregate method seems to me to be the 
right one. I have no performance measurements as of yet, but it seems that 
aggregate is simpler and more intuitive (And it does what one might expect an 
accumulator to do) whereas the accumulators and accumulables seem to have some 
extra complications and overhead.

So…

What’s the real difference between an accumulator/accumulable and aggregating 
an RDD? When is one method of aggregation preferred over the other?

Thanks,
Nate



--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io<mailto:daniel.siegm...@velos.io> W: 
www.velos.io<http://www.velos.io>

Reply via email to