[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19406 Ah, that's fine :). It was just an option. I will follow discussion and help sort it out in any event. --- - To

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon These two JIRAs change percentile_approx in different ways, so maybe it's better to use different JIRAs? --- - To

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon uh...just saw this, already created a new [JIRA](url) and [PR](https://github.com/apache/spark/pull/19438), is it also ok? ---

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19406 Oh, optionally, we can just edit the JIRA I guess. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen @jiangxb1987 OK, I'll close this JIRA and creating a new JIRA as improvement instead of bugfix. --- - To unsubscribe,

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19406 This is actually a bugfix instead of improvement, I think we should follow the approach that @srowen have suggested. --- -

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82419/ Test FAILed. ---

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread gatorsmile
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19406 cc @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 Your proposed way is more consistent with the paper and I also think it's more natural to start from 0. The purpose of this patch is to not introduce accuracy regression for other cases, and

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 I think we should probably follow the paper, no? this should fix more cases. Yes, this case also failed for me. The answer 499 is OK too. ---

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen I tried your way on my laptop, it works fine. But an existing test case failed: (500 expected , 499 returned) ``` test("percentile_approx, supports constant folding for parameter

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 Your new test also passes when changing ... ``` val targetError = relativeError * count ... var i = 0 ``` I think this is more likely to be the correct fix as

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 I read the original paper at http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf and think there might be two bugs in the implementation here that might actually cause the

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen Before the change, the answer in the test case is `2, 2, 2`. Based on the code before the change, percentile_approx would never return the first element when percentile is in

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82384/ Test FAILed. ---

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 (What is the answer in your test case before the change?) Yea, I guess I am not sure why this method has to use relativeError at all, but I didn't think about it much. If it didn't I think it

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 Yes it's ok if consider it's approximate with relativeError. But the point of this pr is that there could be a further improvement in those special cases. IMHO it's more accurate and intuitive for

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82379/ Test FAILed. ---

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 It's probably OK, though kind of special cases the first quantile. This expression doesn't incorporate relativeError. I'm actually not sure why it needs to account for relativeError here; it's an

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10% percentage. Without this fix, it

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 The JIRA doesn't explain what this is meant to fix. What case does this help get more correct? --- - To unsubscribe, e-mail:

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82368/ Test FAILed. ---

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-09-30 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)** for PR 19406 at commit

[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-09-30 Thread wzhfy
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 cc @thunterdb @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands,