[jira] [Comment Edited] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations

Todd Lisonbee (JIRA) Tue, 22 Mar 2016 12:04:04 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207054#comment-15207054
 ]


Todd Lisonbee edited comment on FLINK-3613 at 3/22/16 7:02 PM:
---------------------------------------------------------------

Attached is a design for improvements to DataSet.aggregate() needed to 
implement additional aggregations like Standard Deviation.

To maintain public API's it seems like the best path would be to have 
AggregateOperator implement CustomUnaryOperation but that seems weird because 
no other Operator is done that way.  But other options I see don't seem 
consistent with other Operators either.

I really could use some feedback on this.  Thanks.

Also, should I be posting this to the Dev mailing list?


was (Author: tlisonbee):
Attached is a design for improvements to DataSet.aggregate() needed to 
implement additional aggregations like Standard Deviation.

To maintain public API's it seems like the best path would be to have 
AggregateOperator implement CustomUnaryOperation but that seems weird because 
no other Operator is done that way.  But other options I see don't seem 
consistent with other Operators either.

I really could use some feedback on this.  Thanks.

> Add standard deviation, mean, variance to list of Aggregations
> --------------------------------------------------------------
>
>                 Key: FLINK-3613
>                 URL: https://issues.apache.org/jira/browse/FLINK-3613
>             Project: Flink
>          Issue Type: Improvement
>            Reporter: Todd Lisonbee
>            Priority: Minor
>         Attachments: DataSet-Aggregation-Design-March2016-v1.txt
>
>
> Implement standard deviation, mean, variance for 
> org.apache.flink.api.java.aggregation.Aggregations
> Ideally implementation should be single pass and numerically stable.
> References:
> "Scalable and Numerically Stable Descriptive Statistics in SystemML", Tian et 
> al, International Conference on Data Engineering 2012
> http://dl.acm.org/citation.cfm?id=2310392
> "The Kahan summation algorithm (also known as compensated summation) reduces 
> the numerical errors that occur when adding a sequence of finite precision 
> floating point numbers. Numerical errors arise due to truncation and 
> rounding. These errors can lead to numerical instability when calculating 
> variance."
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (FLINK-3613) Add standard deviation, mean, variance to list of Aggregations

Reply via email to