Flashback: RDD.aggregate versus accumulables...
And Lord Joe you were right future versions did protect accumulators in actions. I wonder if anyone has a "modern" take on the accumulator vs. aggregate question. Seems like if I need to do it by key or control partitioning I would use aggregate. Bottom line question / reason for post: I wonder if anyone has more ideas about using aggregate instead? Am I right to think accumulables are always present on the driver, whereas an aggregate needs to be pulled to the driver manually? Details: But they both give me an option to write custom adds and merges: For example this class I am stubbing out: class DropEvalAccumulableParam implements AccumulableParam{ // Add additional data to the accumulator value. Is allowed to modify and return r for efficiency (to avoid allocating objects). // r is the first value @Override public DropEvaluation addAccumulator(DropEvaluation dropEvaluation, DropResult dropResult) { return null; } // Merge two accumulated values together. Is allowed to modify and return the first value for efficiency (to avoid allocating objects). @Override public DropEvaluation addInPlace(DropEvaluation masterDropEval, DropEvaluation r1) { return null; } // Return the "zero" (identity) value for an accumulator type, given its initial value. For example, if R was a vector of N dimensions, // this would return a vector of N zeroes. @Override public DropEvaluation zero(DropEvaluation dropEvaluation) { // technically the "additive identity" of a DropEvaluation would be return dropEvaluation; } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p26456.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD.aggregate?
any explaination on how aggregate works would be much appreciated. i already looked at the spark example and still am confused about the seqop and combop... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD.aggregate?
There's some explanation and an example here: http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246 -kr, Gerard. On Thu, Dec 11, 2014 at 7:15 PM, ll duy.huynh@gmail.com wrote: any explaination on how aggregate works would be much appreciated. i already looked at the spark example and still am confused about the seqop and combop... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD.aggregate?
can someone please explain how RDD.aggregate works? i looked at the average example done with aggregate() but i'm still confused about this function... much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD.aggregate versus accumulables...
You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L nathan.l.segerl...@intel.com wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
Re: RDD.aggregate versus accumulables...
We use Algebird for calculating things like min/max, stddev, variance, etc. https://github.com/twitter/algebird/wiki -Suren On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L nathan.l.segerl...@intel.com wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
RE: RDD.aggregate versus accumulables...
Thanks for the link to the bug. Unfortunately, using accumulators like this is getting spread around as a recommended practice despite the bug. From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Monday, November 17, 2014 8:32 AM To: Segerlind, Nathan L Cc: user Subject: Re: RDD.aggregate versus accumulables... You should never use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L nathan.l.segerl...@intel.commailto:nathan.l.segerl...@intel.com wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: www.velos.iohttp://www.velos.io
RE: RDD.aggregate versus accumulables...
I have been playing with using accumulators (despite the possible error with multiple attempts) These provide a convenient way to get some numbers while still performing business logic. I posted some sample code at http://lordjoesoftware.blogspot.com/. Even if accumulators are not perfect today - future versions may improve them and they are great ways to monitor execution and get a sense of performance on lazily executed systems -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p19102.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RDD.aggregate versus accumulables...
Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So... What's the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate