Computing the variance is similar to this example, you just need to keep around the sum of squares as well.
The formula for variance is (sumsq/n) - (sum/n)^2 But with big datasets or large values, you can quickly run into overflow issues - MLlib handles this by maintaining the the average sum of squares in an online fashion. (see: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L83 ) You might consider just calling into the MLlib stats module directly. On Fri, Aug 1, 2014 at 1:48 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: > I meant not sure how to do variance in one shot :-) > > With mean in hand, you can obvious broadcast the variable, and do another > map/reduce to calculate variance per key. > > > On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: > >> val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) => >> (x._1+x._2, y._1+y._2)).collect >> >> This gives you a list of (key, (tot, count)), which you can easily >> calculate the mean. Not sure about variance. >> >> >> On Fri, Aug 1, 2014 at 2:55 PM, kriskalish <k...@kalish.net> wrote: >> >>> I have what seems like a relatively straightforward task to accomplish, >>> but I >>> cannot seem to figure it out from the Spark documentation or searching >>> the >>> mailing list. >>> >>> I have an RDD[(String, MyClass)] that I would like to group by the key, >>> and >>> calculate the mean and standard deviation of the "foo" field of MyClass. >>> It >>> "feels" like I should be able to use group by to get an RDD for each >>> unique >>> key, but it gives me an iterable. >>> >>> As in: >>> >>> val grouped = rdd.groupByKey() >>> >>> grouped.foreach{g => >>> val mean = g.map( x => x.foo).mean() >>> val dev = g.map( x => x.foo ).stddev() >>> // do fancy things with the mean and deviation >>> } >>> >>> However, there seems to be no way to convert the iterable into an RDD. Is >>> there some other technique for doing this? I'm to the point where I'm >>> considering copying and pasting the StatCollector class and changing the >>> type from Double to MyClass (or making it generic). >>> >>> Am I going down the wrong path? >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >> >> >