Computing the variance is similar to this example, you just need to keep
around the sum of squares as well.

The formula for variance is (sumsq/n) - (sum/n)^2

But with big datasets or large values, you can quickly run into overflow
issues - MLlib handles this by maintaining the the average sum of squares
in an online fashion. (see:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala#L83
)

You might consider just calling into the MLlib stats module directly.


On Fri, Aug 1, 2014 at 1:48 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:

> I meant not sure how to do variance in one shot :-)
>
> With mean in hand, you can obvious broadcast the variable, and do another
> map/reduce to calculate variance per key.
>
>
> On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:
>
>> val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) =>
>> (x._1+x._2, y._1+y._2)).collect
>>
>> This gives you a list of (key, (tot, count)), which you can easily
>> calculate the mean. Not sure about variance.
>>
>>
>> On Fri, Aug 1, 2014 at 2:55 PM, kriskalish <k...@kalish.net> wrote:
>>
>>> I have what seems like a relatively straightforward task to accomplish,
>>> but I
>>> cannot seem to figure it out from the Spark documentation or searching
>>> the
>>> mailing list.
>>>
>>> I have an RDD[(String, MyClass)] that I would like to group by the key,
>>> and
>>> calculate the mean and standard deviation of the "foo" field of MyClass.
>>> It
>>> "feels" like I should be able to use group by to get an RDD for each
>>> unique
>>> key, but it gives me an iterable.
>>>
>>> As in:
>>>
>>> val grouped = rdd.groupByKey()
>>>
>>> grouped.foreach{g =>
>>>    val mean = g.map( x => x.foo).mean()
>>>    val dev = g.map( x => x.foo ).stddev()
>>>    // do fancy things with the mean and deviation
>>> }
>>>
>>> However, there seems to be no way to convert the iterable into an RDD. Is
>>> there some other technique for doing this? I'm to the point where I'm
>>> considering copying and pasting the StatCollector class and changing the
>>> type from Double to MyClass (or making it generic).
>>>
>>> Am I going down the wrong path?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>

Reply via email to