Weichen Xu created SPARK-16561: ---------------------------------- Summary: Potential numerial problem in MultivariateOnlineSummarizer min/max Key: SPARK-16561 URL: https://issues.apache.org/jira/browse/SPARK-16561 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.0, 2.0.1, 2.1.0 Reporter: Weichen Xu
In `MultivariateOnlineSummarizer` min/max method, use judgement `nnz(i) < weightSum`, it will cause some numerial problem and make result unstable. for example, add two vector: [10, -10] with weight 1e10 [0, 0] with weight 1e-10 using MultivariateOnlineSummarizer.min/max we will get minVector = [10, -10] maxVector = [10, -10] but the right result should be minVector = [0, -10] maxVector = [10, 0] The bug reason is that (1e10 + 1e-10) == 1e10 because of the floating rounding. and different accumulating or merging order may cause different result, such as: [10, -10] with weight 1e10 [0, 0] with weight 1e-7 .... (100 lines data [0, 0] with weight 1e-7) using the input data order listed above, we will get the result: minVector = [10, -10] maxVector = [10, -10] but if the input data order is as following: [0, 0] with weight 1e-7 .... (100 lines data [0, 0] with weight 1e-7) [10, -10] with weight 1e10 than it the result will be: minVector = [0, -10] maxVector = [10, 0] that's because: 1e10 + 1e-7 + ... + 1e-7(add 100 times) == 1e10 but 1e-7 + ... + 1e-7(add 100 times) + 1e10 = 1.000000000000001E10 != 1e10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org