[ https://issues.apache.org/jira/browse/SPARK-16561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-16561: -------------------------------------- Summary: Potential numerical problem in MultivariateOnlineSummarizer min/max (was: Potential numerial problem in MultivariateOnlineSummarizer min/max) > Potential numerical problem in MultivariateOnlineSummarizer min/max > ------------------------------------------------------------------- > > Key: SPARK-16561 > URL: https://issues.apache.org/jira/browse/SPARK-16561 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.0.0 > Reporter: Weichen Xu > Assignee: Weichen Xu > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > In `MultivariateOnlineSummarizer` min/max method, > use judgement "nnz(i) < weightSum", it will cause some numerial problem > and make result unstable. > for example, > add two vector: > [10, -10] with weight 1e10 > [0, 0] with weight 1e-10 > using MultivariateOnlineSummarizer.min/max we will get > minVector = [10, -10] > maxVector = [10, -10] > but the right result should be > minVector = [0, -10] > maxVector = [10, 0] > The bug reason is that > (1e10 + 1e-10) == 1e10 (Double type) > because of the floating rounding. > and different accumulating or merging order may cause different result, > such as: > [10, -10] with weight 1e10 > [0, 0] with weight 1e-7 > .... > (100 lines data [0, 0] with weight 1e-7) > using the input data order listed above, we will get the result: > minVector = [10, -10] > maxVector = [10, -10] > but if the input data order is as following: > [0, 0] with weight 1e-7 > .... > (100 lines data [0, 0] with weight 1e-7) > [10, -10] with weight 1e10 > than it the result will be: > minVector = [0, -10] > maxVector = [10, 0] > that's because: > 1e10 + 1e-7 + ... + 1e-7(add 100 times) == 1e10 (Double type) > but > 1e-7 + ... + 1e-7(add 100 times) + 1e10 = 1.000000000000001E10 != 1e10 > (Double type) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org