Thanks for the link to the bug. Unfortunately, using accumulators like this is getting spread around as a recommended practice despite the bug.
From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Monday, November 17, 2014 8:32 AM To: Segerlind, Nathan L Cc: user Subject: Re: RDD.aggregate versus accumulables... You should never use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732<https://issues.apache.org/jira/browse/SPARK-732> for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L <nathan.l.segerl...@intel.com<mailto:nathan.l.segerl...@intel.com>> wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io<mailto:daniel.siegm...@velos.io> W: www.velos.io<http://www.velos.io>