Ignoring my warning about overflow - even more functional - just use a reduceByKey.
Since your main operation is just a bunch of summing, you've got a commutative-associative reduce operation and spark will run do everything cluster-parallel, and then shuffle the (small) result set and merge appropriately. For example: input .map{ case (k, v) => (k, (1, v, v*v)) } .reduceByKey { case ((c1, s1, ss1), (c2, s2, ss2)) => (c1+c2, s1+s2, ss1+ss2) } .map { case (k, (count, sum, sumsq)) => (k, sumsq/count - (sum/count * sum/count)) } This is by no means the most memory/time efficient way to do it, but I think it's a nice example of how to think about using spark at a higher level of abstraction. - Evan On Fri, Aug 1, 2014 at 2:00 PM, Sean Owen <so...@cloudera.com> wrote: > Here's the more functional programming-friendly take on the > computation (but yeah this is the naive formula): > > rdd.groupByKey.mapValues { mcs => > val values = mcs.map(_.foo.toDouble) > val n = values.count > val sum = values.sum > val sumSquares = values.map(x => x * x).sum > math.sqrt(n * sumSquares - sum * sum) / n > } > > This gives you a bunch of (key,stdev). I think you want to compute > this RDD and *then* do something to save it if you like. Sure, that > could be collecting it locally and saving to a DB. Or you could use > foreach to do something remotely for every key-value pair. More > efficient would be to mapPartitions and do something to a whole > partition of key-value pairs at a time. > > > On Fri, Aug 1, 2014 at 9:56 PM, kriskalish <k...@kalish.net> wrote: > > So if I do something like this, spark handles the parallelization and > > recombination of sum and count on the cluster automatically? I started > > peeking into the source and see that foreach does submit a job to the > > cluster, but it looked like the inner function needed to return > something to > > work properly. > > > > val grouped = rdd.groupByKey() > > grouped.foreach{ x => > > val iterable = x._2 > > var sum = 0.0 > > var count = 0 > > iterable.foreach{ y => > > sum = sum + y.foo > > count = count + 1 > > } > > val mean = sum/count; > > // save mean to database... > > } > > > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11207.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >