Flashback: RDD.aggregate versus accumulables...

2016-03-10 Thread jiml
And Lord Joe you were right future versions did protect accumulators in
actions. I wonder if anyone has a "modern" take on the accumulator vs.
aggregate question. Seems like if I need to do it by key or control
partitioning I would use aggregate.

Bottom line question / reason for post: I wonder if anyone has more ideas
about using aggregate instead? Am I right to think accumulables are always
present on the driver, whereas an aggregate needs to be pulled to the driver
manually?

Details: 

But they both give me an option to write custom adds and merges:
For example this class I am stubbing out:

class DropEvalAccumulableParam implements
AccumulableParam {

// Add additional data to the accumulator value. Is allowed to
modify and return r for efficiency (to avoid allocating objects).
// r is the first value
@Override
public DropEvaluation addAccumulator(DropEvaluation dropEvaluation,
DropResult dropResult) {
return null;
}

// Merge two accumulated values together. Is allowed to modify and
return the first value for efficiency (to avoid allocating objects).
@Override
public DropEvaluation addInPlace(DropEvaluation masterDropEval,
DropEvaluation r1) {
return null;
}

// Return the "zero" (identity) value for an accumulator type, given
its initial value. For example, if R was a vector of N dimensions,
// this would return a vector of N zeroes.
@Override
public DropEvaluation zero(DropEvaluation dropEvaluation) {
// technically the "additive identity" of a DropEvaluation would
be


return dropEvaluation;
}
}





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p26456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD.aggregate?

2014-12-11 Thread ll
any explaination on how aggregate works would be much appreciated.  i already
looked at the spark example and still am confused about the seqop and
combop... thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD.aggregate?

2014-12-11 Thread Gerard Maas
There's some explanation and an example here:
http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246

-kr, Gerard.

On Thu, Dec 11, 2014 at 7:15 PM, ll duy.huynh@gmail.com wrote:

 any explaination on how aggregate works would be much appreciated.  i
 already
 looked at the spark example and still am confused about the seqop and
 combop... thanks.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RDD.aggregate?

2014-12-04 Thread ll
can someone please explain how RDD.aggregate works?  i looked at the average
example done with aggregate() but i'm still confused about this function...
much appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
You should *never* use accumulators for this purpose because you may get
incorrect answers. Accumulators can count the same thing multiple times -
you cannot rely upon the correctness of the values they compute. See
SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
nathan.l.segerl...@intel.com wrote:

  Hi All.



 I am trying to get my head around why using accumulators and accumulables
 seems to be the most recommended method for accumulating running sums,
 averages, variances and the like, whereas the aggregate method seems to me
 to be the right one. I have no performance measurements as of yet, but it
 seems that aggregate is simpler and more intuitive (And it does what one
 might expect an accumulator to do) whereas the accumulators and
 accumulables seem to have some extra complications and overhead.



 So…



 What’s the real difference between an accumulator/accumulable and
 aggregating an RDD? When is one method of aggregation preferred over the
 other?



 Thanks,

 Nate




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Surendranauth Hiraman
We use Algebird for calculating things like min/max, stddev, variance, etc.

https://github.com/twitter/algebird/wiki

-Suren


On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 You should *never* use accumulators for this purpose because you may get
 incorrect answers. Accumulators can count the same thing multiple times -
 you cannot rely upon the correctness of the values they compute. See
 SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info.

 On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
 nathan.l.segerl...@intel.com wrote:

  Hi All.



 I am trying to get my head around why using accumulators and accumulables
 seems to be the most recommended method for accumulating running sums,
 averages, variances and the like, whereas the aggregate method seems to me
 to be the right one. I have no performance measurements as of yet, but it
 seems that aggregate is simpler and more intuitive (And it does what one
 might expect an accumulator to do) whereas the accumulators and
 accumulables seem to have some extra complications and overhead.



 So…



 What’s the real difference between an accumulator/accumulable and
 aggregating an RDD? When is one method of aggregation preferred over the
 other?



 Thanks,

 Nate




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


RE: RDD.aggregate versus accumulables...

2014-11-17 Thread Segerlind, Nathan L
Thanks for the link to the bug.

Unfortunately, using accumulators like this is getting spread around as a 
recommended practice despite the bug.


From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Monday, November 17, 2014 8:32 AM
To: Segerlind, Nathan L
Cc: user
Subject: Re: RDD.aggregate versus accumulables...

You should never use accumulators for this purpose because you may get 
incorrect answers. Accumulators can count the same thing multiple times - you 
cannot rely upon the correctness of the values they compute. See 
SPARK-732https://issues.apache.org/jira/browse/SPARK-732 for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
nathan.l.segerl...@intel.commailto:nathan.l.segerl...@intel.com wrote:
Hi All.

I am trying to get my head around why using accumulators and accumulables seems 
to be the most recommended method for accumulating running sums, averages, 
variances and the like, whereas the aggregate method seems to me to be the 
right one. I have no performance measurements as of yet, but it seems that 
aggregate is simpler and more intuitive (And it does what one might expect an 
accumulator to do) whereas the accumulators and accumulables seem to have some 
extra complications and overhead.

So…

What’s the real difference between an accumulator/accumulable and aggregating 
an RDD? When is one method of aggregation preferred over the other?

Thanks,
Nate



--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: 
www.velos.iohttp://www.velos.io


RE: RDD.aggregate versus accumulables...

2014-11-17 Thread lordjoe
I have been playing with using accumulators (despite the possible error with
multiple attempts) These provide a convenient way to get some numbers while
still performing business logic. 
I posted some sample code at http://lordjoesoftware.blogspot.com/.
Even if accumulators are not perfect today - future versions may improve
them and they are great ways to monitor execution and get a sense of
performance on lazily executed systems



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p19102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RDD.aggregate versus accumulables...

2014-11-16 Thread Segerlind, Nathan L
Hi All.

I am trying to get my head around why using accumulators and accumulables seems 
to be the most recommended method for accumulating running sums, averages, 
variances and the like, whereas the aggregate method seems to me to be the 
right one. I have no performance measurements as of yet, but it seems that 
aggregate is simpler and more intuitive (And it does what one might expect an 
accumulator to do) whereas the accumulators and accumulables seem to have some 
extra complications and overhead.

So...

What's the real difference between an accumulator/accumulable and aggregating 
an RDD? When is one method of aggregation preferred over the other?

Thanks,
Nate