[ https://issues.apache.org/jira/browse/SPARK-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351316#comment-15351316 ]
Martin Mauch commented on SPARK-10597: -------------------------------------- Is there any news on this or another way to calculate weighted statistics for RDDs or DataFrames? I've found https://github.com/mraad/spark-stat but it would be great to have an in-built solution, especially since the difference between non-weighted and weighted is usually just replacing counts by sums of weights. > MultivariateOnlineSummarizer for weighted instances > --------------------------------------------------- > > Key: SPARK-10597 > URL: https://issues.apache.org/jira/browse/SPARK-10597 > Project: Spark > Issue Type: New Feature > Components: MLlib > Affects Versions: 1.5.0 > Reporter: DB Tsai > Assignee: DB Tsai > > MultivariateOnlineSummarizer for weighted instances is implemented as private > API for SPARK-7685. > In SPARK-7685, the online numerical stable version of unbiased estimation of > variance defined by the reliability weights: > [[https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Reliability_weights]] > is implemented, but we would like to make it as public api since there are > different use-cases. > Currently, `count` will return the actual number of instances, and ignores > instance weights, but `numNonzeros` will return the weighted # of nonzeros. > We need to decide the behavior of them before making it public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org