[ https://issues.apache.org/jira/browse/SPARK-21818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-21818. ------------------------------- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19029 [https://github.com/apache/spark/pull/19029] > MultivariateOnlineSummarizer.variance generate negative result > -------------------------------------------------------------- > > Key: SPARK-21818 > URL: https://issues.apache.org/jira/browse/SPARK-21818 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.2.0 > Reporter: Weichen Xu > Fix For: 2.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Because of numerical error, MultivariateOnlineSummarizer.variance is possible > to generate negative variance. > This is a serious bug because many algos in MLLib use stddev computed from > sqrt(variance), > it will generate NaN and crash the whole algorithm. > we can reproduce this bug use the following code: > {code} > val summarizer1 = (new MultivariateOnlineSummarizer) > .add(Vectors.dense(3.0), 0.7) > val summarizer2 = (new MultivariateOnlineSummarizer) > .add(Vectors.dense(3.0), 0.4) > val summarizer3 = (new MultivariateOnlineSummarizer) > .add(Vectors.dense(3.0), 0.5) > val summarizer4 = (new MultivariateOnlineSummarizer) > .add(Vectors.dense(3.0), 0.4) > val summarizer = summarizer1 > .merge(summarizer2) > .merge(summarizer3) > .merge(summarizer4) > println(summarizer.variance(0)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org