Weichen Xu created SPARK-21818: ---------------------------------- Summary: MultivariateOnlineSummarizer.variance generate negative result Key: SPARK-21818 URL: https://issues.apache.org/jira/browse/SPARK-21818 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.2.0 Reporter: Weichen Xu
Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance. This is a serious bug because many algos in MLLib use stddev computed from sqrt(variance), it will generate NaN and crash the whole algorithm. we can reproduce this bug use the following code: {code} val summarizer1 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.7) val summarizer2 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer3 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.5) val summarizer4 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer = summarizer1 .merge(summarizer2) .merge(summarizer3) .merge(summarizer4) println(summarizer.variance(0)) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org