[ 
https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193827#comment-15193827
 ] 

Todd Lisonbee commented on FLINK-3613:
--------------------------------------

I checked the Spark code base, it looks like they used the same technique 
described in links above,
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/StatCounter.scala

I'm going to expand this JIRA to also include adding the Mean and Variance to 
list of Aggregations.  There is code overlap for all three so it probably makes 
sense to solve together (like StatCounter.scala).

I noticed there is already an "Average" aggregation that is commented out 
(possibly because of numerical stability problems it would have).

I'll search JIRA for possible overlap.

> Add standard deviation to list of Aggregations
> ----------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>
> Implement Standard Deviation for 
> org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et 
> al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces 
> the numerical errors that occur when adding a sequence of finite precision 
> floating point numbers. Numerical errors arise due to truncation and 
> rounding. These errors can lead to numerical instability when calculating 
> variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to