[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-26 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381588#comment-14381588
 ] 

Frank Rosner commented on SPARK-6480:
-

[~srowen] will do today!

 histogram() bucket function is wrong in some simple edge cases
 --

 Key: SPARK-6480
 URL: https://issues.apache.org/jira/browse/SPARK-6480
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen

 (Credit to a customer report here) This test would fail now: 
 {code}
 val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
 assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
 {code}
 Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
 bucket function that judges buckets based on a multiple of the gap between 
 first and second elements. Errors multiply and the end of the final bucket 
 fails to include the max.
 Fairly plausible use case actually.
 This can be tightened up easily with a slightly better expression. It will 
 also fix this test, which is actually expecting the wrong answer:
 {code}
 val rdd = sc.parallelize(6 to 99)
 val (histogramBuckets, histogramResults) = rdd.histogram(9)
 val expectedHistogramResults =
   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
 {code}
 (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14379913#comment-14379913
 ] 

Sean Owen commented on SPARK-6480:
--

[~frosner] can you have a peek at the PR and see if it makes sense to you? I'd 
like to get another set of eyes on it before committing

 histogram() bucket function is wrong in some simple edge cases
 --

 Key: SPARK-6480
 URL: https://issues.apache.org/jira/browse/SPARK-6480
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen

 (Credit to a customer report here) This test would fail now: 
 {code}
 val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
 assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
 {code}
 Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
 bucket function that judges buckets based on a multiple of the gap between 
 first and second elements. Errors multiply and the end of the final bucket 
 fails to include the max.
 Fairly plausible use case actually.
 This can be tightened up easily with a slightly better expression. It will 
 also fix this test, which is actually expecting the wrong answer:
 {code}
 val rdd = sc.parallelize(6 to 99)
 val (histogramBuckets, histogramResults) = rdd.histogram(9)
 val expectedHistogramResults =
   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
 {code}
 (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-24 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377611#comment-14377611
 ] 

Frank Rosner commented on SPARK-6480:
-

Thanks for picking it up [~srowen]!

 histogram() bucket function is wrong in some simple edge cases
 --

 Key: SPARK-6480
 URL: https://issues.apache.org/jira/browse/SPARK-6480
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen

 (Credit to a customer report here) This test would fail now: 
 {code}
 val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
 assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
 {code}
 Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
 bucket function that judges buckets based on a multiple of the gap between 
 first and second elements. Errors multiply and the end of the final bucket 
 fails to include the max.
 Fairly plausible use case actually.
 This can be tightened up easily with a slightly better expression. It will 
 also fix this test, which is actually expecting the wrong answer:
 {code}
 val rdd = sc.parallelize(6 to 99)
 val (histogramBuckets, histogramResults) = rdd.histogram(9)
 val expectedHistogramResults =
   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
 {code}
 (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6480) histogram() bucket function is wrong in some simple edge cases

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376894#comment-14376894
 ] 

Apache Spark commented on SPARK-6480:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5148

 histogram() bucket function is wrong in some simple edge cases
 --

 Key: SPARK-6480
 URL: https://issues.apache.org/jira/browse/SPARK-6480
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Sean Owen
Assignee: Sean Owen

 (Credit to a customer report here) This test would fail now: 
 {code}
 val rdd = sc.parallelize(Seq(1, 1, 1, 2, 3, 3))
 assert(Array(3, 1, 2) === rdd.map(_.toDouble).histogram(3)._2)
 {code}
 Because it returns 3, 1, 0. The problem ultimately traces to the 'fast' 
 bucket function that judges buckets based on a multiple of the gap between 
 first and second elements. Errors multiply and the end of the final bucket 
 fails to include the max.
 Fairly plausible use case actually.
 This can be tightened up easily with a slightly better expression. It will 
 also fix this test, which is actually expecting the wrong answer:
 {code}
 val rdd = sc.parallelize(6 to 99)
 val (histogramBuckets, histogramResults) = rdd.histogram(9)
 val expectedHistogramResults =
   Array(11, 10, 11, 10, 10, 11, 10, 10, 11)
 {code}
 (Should be {{Array(11, 10, 10, 11, 10, 10, 11, 10, 11)}})



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org