[ https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-6480: ----------------------------------- Assignee: Apache Spark (was: Sean Owen) > histogram() bucket function is wrong in some simple edge cases > -------------------------------------------------------------- > > Key: SPARK-6480 > URL: https://issues.apache.org/jira/browse/SPARK-6480 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.3.0 > Reporter: Sean Owen > Assignee: Apache Spark > > (Credit to a customer report here) This test would fail now: > {code} > val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) > assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) > {code} > Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' > bucket function that judges buckets based on a multiple of the gap between > first and second elements. Errors multiply and the end of the final bucket > fails to include the max. > Fairly plausible use case actually. > This can be tightened up easily with a slightly better expression. It will > also fix this test, which is actually expecting the wrong answer: > {code} > val rdd = sc.parallelize(6 to 99) > val (histogramBuckets, histogramResults) = rdd.histogram(9) > val expectedHistogramResults = > Array(11, 10, 11, 10, 10, 11, 10, 10, 11) > {code} > (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org