[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases
[ https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381588#comment-14381588 ] Frank Rosner commented on SPARK-6480: - [~srowen] will do today! histogram() bucket function is wrong in some simple edge cases -- Key: SPARK-6480 URL: https://issues.apache.org/jira/browse/SPARK-6480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen (Credit to a customer report here) This test would fail now: {code} val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) {code} Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max. Fairly plausible use case actually. This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer: {code} val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11) {code} (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases
[ https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379913#comment-14379913 ] Sean Owen commented on SPARK-6480: -- [~frosner] can you have a peek at the PR and see if it makes sense to you? I'd like to get another set of eyes on it before committing histogram() bucket function is wrong in some simple edge cases -- Key: SPARK-6480 URL: https://issues.apache.org/jira/browse/SPARK-6480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen (Credit to a customer report here) This test would fail now: {code} val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) {code} Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max. Fairly plausible use case actually. This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer: {code} val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11) {code} (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases
[ https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377611#comment-14377611 ] Frank Rosner commented on SPARK-6480: - Thanks for picking it up [~srowen]! histogram() bucket function is wrong in some simple edge cases -- Key: SPARK-6480 URL: https://issues.apache.org/jira/browse/SPARK-6480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen (Credit to a customer report here) This test would fail now: {code} val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) {code} Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max. Fairly plausible use case actually. This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer: {code} val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11) {code} (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases
[ https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376894#comment-14376894 ] Apache Spark commented on SPARK-6480: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/5148 histogram() bucket function is wrong in some simple edge cases -- Key: SPARK-6480 URL: https://issues.apache.org/jira/browse/SPARK-6480 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Sean Owen Assignee: Sean Owen (Credit to a customer report here) This test would fail now: {code} val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3)) assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2) {code} Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' bucket function that judges buckets based on a multiple of the gap between first and second elements. Errors multiply and the end of the final bucket fails to include the max. Fairly plausible use case actually. This can be tightened up easily with a slightly better expression. It will also fix this test, which is actually expecting the wrong answer: {code} val rdd = sc.parallelize(6 to 99) val (histogramBuckets, histogramResults) = rdd.histogram(9) val expectedHistogramResults = Array(11, 10, 11, 10, 10, 11, 10, 10, 11) {code} (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}}) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org