[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19406 Ah, that's fine :). It was just an option. I will follow discussion and help sort it out in any event. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon These two JIRAs change percentile_approx in different ways, so maybe it's better to use different JIRAs? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @HyukjinKwon uh...just saw this, already created a new [JIRA](url) and [PR](https://github.com/apache/spark/pull/19438), is it also ok? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19406 Oh, optionally, we can just edit the JIRA I guess. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen @jiangxb1987 OK, I'll close this JIRA and creating a new JIRA as improvement instead of bugfix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19406 This is actually a bugfix instead of improvement, I think we should follow the approach that @srowen have suggested. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82419/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)** for PR 19406 at commit [`9815ce8`](https://github.com/apache/spark/commit/9815ce8e17e34422f8c915d115061a9635abd119). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19406 cc @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 Your proposed way is more consistent with the paper and I also think it's more natural to start from 0. The purpose of this patch is to not introduce accuracy regression for other cases, and improve some corner cases. I can't tell which one is better, I think both ways can work. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 I think we should probably follow the paper, no? this should fix more cases. Yes, this case also failed for me. The answer 499 is OK too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen I tried your way on my laptop, it works fine. But an existing test case failed: (500 expected , 499 returned) ``` test("percentile_approx, supports constant folding for parameter accuracy and percentages") { withTempView(table) { (1 to 1000).toDF("col").createOrReplaceTempView(table) checkAnswer( spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 + 800D) FROM $table"), Row(Seq(500D)) ) } } ``` It seems to me both way (rounding targetError and starting from 1/ not rounding targetError and starting from 0) can get the answer within relative error. One way is more accurate in some cases, and the other is more accurate in some other cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 Your new test also passes when changing ... ``` val targetError = relativeError * count ... var i = 0 ``` I think this is more likely to be the correct fix as it matches the paper and isn't just patching the case of the first element. It does make another test fail, just barely -- seems still within tolerances. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)** for PR 19406 at commit [`9815ce8`](https://github.com/apache/spark/commit/9815ce8e17e34422f8c915d115061a9635abd119). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 I read the original paper at http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf and think there might be two bugs in the implementation here that might actually cause the problem. First, targetError should not be rounded up. The algorithm in 2.2.1 does not show that. Second, this starts by considering sampled(1), not sampled(0). But the impl and algorithm in the paper seem to be 0-based. I would try it myself but for some reason I'm getting null from your query. I'll try again on my end but what do you think of this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen Before the change, the answer in the test case is `2, 2, 2`. Based on the code before the change, percentile_approx would never return the first element when percentile is in (relativeError, 1/N], where relativeError default is 1/1, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. > why this method has to use relativeError `QuantileSummaries` is a sampled data structure, `relativeError` is used to compute `targeError`, which decides the error bound of the target rank (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L204). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82384/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)** for PR 19406 at commit [`dbc3d47`](https://github.com/apache/spark/commit/dbc3d47b0a56113032d2a4565180932e4ef26219). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 (What is the answer in your test case before the change?) Yea, I guess I am not sure why this method has to use relativeError at all, but I didn't think about it much. If it didn't I think it already does the same thing you've done. Which then leads me to ask why there's an approximation in this part of the process at all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 Yes it's ok if consider it's approximate with relativeError. But the point of this pr is that there could be a further improvement in those special cases. IMHO it's more accurate and intuitive for end users, such as return 1 instead of 2 as 10% percentile among 1 to 10. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)** for PR 19406 at commit [`dbc3d47`](https://github.com/apache/spark/commit/dbc3d47b0a56113032d2a4565180932e4ef26219). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82379/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82379 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)** for PR 19406 at commit [`8c8c22d`](https://github.com/apache/spark/commit/8c8c22dbebe99def6127b49988dfc4f886797bd6). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82379 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)** for PR 19406 at commit [`8c8c22d`](https://github.com/apache/spark/commit/8c8c22dbebe99def6127b49988dfc4f886797bd6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 It's probably OK, though kind of special cases the first quantile. This expression doesn't incorporate relativeError. I'm actually not sure why it needs to account for relativeError here; it's an extra level of approximation. But the asymmetry seems a bit wrong. It is after all approximate? the loop below actually accounts for the same case, but incorporates relativeError. When in doubt ask the author: @clockfly --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 @srowen For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10% percentage. Without this fix, it returned 2. I've updated description by adding this example in JIRA and PR, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19406 The JIRA doesn't explain what this is meant to fix. What case does this help get more correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)** for PR 19406 at commit [`24f8295`](https://github.com/apache/spark/commit/24f8295498a7ad6d2d99ea27a196ccf154165907). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82368/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19406 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19406 **[Test build #82368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)** for PR 19406 at commit [`24f8295`](https://github.com/apache/spark/commit/24f8295498a7ad6d2d99ea27a196ccf154165907). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/19406 cc @thunterdb @cloud-fan @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org