Hi guys, thanks for the information, I'll give it a try with Algebird, thanks again, Richard
@Patrick, thanks for the release calendar On Mon, Mar 24, 2014 at 12:16 AM, Patrick Wendell <pwend...@gmail.com>wrote: > Hey All, > > I think the old thread is here: > https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J > > The method proposed in that thread is to create a utility class for > doing single-pass aggregations. Using Algebird is a pretty good way to > do this and is a bit more flexible since you don't need to create a > new utility each time you want to do this. > > In Spark 1.0 and later you will be able to do this more elegantly with > the schema support: > myRDD.groupBy('user).select(Sum('clicks) as 'clicks, > Average('duration) as 'duration) > > and it will use a single pass automatically... but that's not quite > released yet :) > > - Patrick > > > > > On Sun, Mar 23, 2014 at 1:31 PM, Koert Kuipers <ko...@tresata.com> wrote: > > i currently typically do something like this: > > > > scala> val rdd = sc.parallelize(1 to 10) > > scala> import com.twitter.algebird.Operators._ > > scala> import com.twitter.algebird.{Max, Min} > > scala> rdd.map{ x => ( > > | 1L, > > | Min(x), > > | Max(x), > > | x > > | )}.reduce(_ + _) > > res0: (Long, com.twitter.algebird.Min[Int], > com.twitter.algebird.Max[Int], > > Int) = (10,Min(1),Max(10),55) > > > > however for this you need twitter algebird dependency. without that you > have > > to code the reduce function on the tuples yourself... > > > > another example with 2 columns, where i do conditional count for first > > column, and simple sum for second: > > scala> sc.parallelize((1 to 10).zip(11 to 20)).map{ case (x, y) => ( > > | if (x > 5) 1 else 0, > > | y > > | )}.reduce(_ + _) > > res3: (Int, Int) = (5,155) > > > > > > > > On Sun, Mar 23, 2014 at 2:26 PM, Richard Siebeling <rsiebel...@gmail.com > > > > wrote: > >> > >> Hi Koert, Patrick, > >> > >> do you already have an elegant solution to combine multiple operations > on > >> a single RDD? > >> Say for example that I want to do a sum over one column, a count and an > >> average over another column, > >> > >> thanks in advance, > >> Richard > >> > >> > >> On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling < > rsiebel...@gmail.com> > >> wrote: > >>> > >>> Patrick, Koert, > >>> > >>> I'm also very interested in these examples, could you please post them > if > >>> you find them? > >>> thanks in advance, > >>> Richard > >>> > >>> > >>> On Thu, Mar 13, 2014 at 9:39 PM, Koert Kuipers <ko...@tresata.com> > wrote: > >>>> > >>>> not that long ago there was a nice example on here about how to > combine > >>>> multiple operations on a single RDD. so basically if you want to do a > >>>> count() and something else, how to roll them into a single job. i > think > >>>> patrick wendell gave the examples. > >>>> > >>>> i cant find them anymore.... patrick can you please repost? thanks! > >>> > >>> > >> > > >