[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@HyukjinKwon thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19406
  
Ah, that's fine :). It was just an option. I will follow discussion and 
help sort it out in any event.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@HyukjinKwon These two JIRAs change percentile_approx in different ways, so 
maybe it's better to use different JIRAs?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@HyukjinKwon uh...just saw this, already created a new [JIRA](url) and 
[PR](https://github.com/apache/spark/pull/19438), is it also ok?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19406
  
Oh, optionally, we can just edit the JIRA I guess.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-05 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@srowen @jiangxb1987 OK, I'll close this JIRA and creating a new JIRA as 
improvement instead of bugfix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/19406
  
This is actually a bugfix instead of improvement, I think we should follow 
the approach that @srowen have suggested.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82419/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82419 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)**
 for PR 19406 at commit 
[`9815ce8`](https://github.com/apache/spark/commit/9815ce8e17e34422f8c915d115061a9635abd119).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19406
  
cc @WeichenXu123 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
Your proposed way is more consistent with the paper and I also think it's 
more natural to start from 0.
The purpose of this patch is to not introduce accuracy regression for other 
cases, and improve some corner cases.
I can't tell which one is better, I think both ways can work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
I think we should probably follow the paper, no? this should fix more 
cases. Yes, this case also failed for me. The answer 499 is OK too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@srowen I tried your way on my laptop, it works fine. But an existing test 
case failed: (500 expected , 499 returned)
```
  test("percentile_approx, supports constant folding for parameter accuracy 
and percentages") {
withTempView(table) {
  (1 to 1000).toDF("col").createOrReplaceTempView(table)
  checkAnswer(
spark.sql(s"SELECT percentile_approx(col, array(0.25 + 0.25D), 200 
+ 800D) FROM $table"),
Row(Seq(500D))
  )
}
  }
```
It seems to me both way (rounding targetError and starting from 1/ not 
rounding targetError and starting from 0) can get the answer within relative 
error. One way is more accurate in some cases, and the other is more accurate 
in some other cases.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
Your new test also passes when changing ...

```
val targetError = relativeError * count
...
var i = 0
```

I think this is more likely to be the correct fix as it matches the paper 
and isn't just patching the case of the first element.

It does make another test fail, just barely -- seems still within 
tolerances.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82419 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82419/testReport)**
 for PR 19406 at commit 
[`9815ce8`](https://github.com/apache/spark/commit/9815ce8e17e34422f8c915d115061a9635abd119).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
I read the original paper at 
http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf and 
think there might be two bugs in the implementation here that might actually 
cause the problem.

First, targetError should not be rounded up. The algorithm in 2.2.1 does 
not show that. 

Second, this starts by considering sampled(1), not sampled(0). But the impl 
and algorithm in the paper seem to be 0-based.

I would try it myself but for some reason I'm getting null from your query. 
I'll try again on my end but what do you think of this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-03 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@srowen Before the change, the answer in the test case is `2, 2, 2`.
Based on the code before the change, percentile_approx would never return 
the first element when percentile is in (relativeError, 1/N], where 
relativeError default is 1/1, and N is the total number of elements.
But ideally, percentiles in [0, 1/N] should all return the first element as 
the answer.

> why this method has to use relativeError

`QuantileSummaries` is a sampled data structure, `relativeError` is used to 
compute `targeError`, which decides the error bound of the target rank 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L204).




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82384/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82384 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)**
 for PR 19406 at commit 
[`dbc3d47`](https://github.com/apache/spark/commit/dbc3d47b0a56113032d2a4565180932e4ef26219).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
(What is the answer in your test case before the change?)
Yea, I guess I am not sure why this method has to use relativeError at all, 
but I didn't think about it much. If it didn't I think it already does the same 
thing you've done. Which then leads me to ask why there's an approximation in 
this part of the process at all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
Yes it's ok if consider it's approximate with relativeError. But the point 
of this pr is that there could be a further improvement in those special cases. 
IMHO it's more accurate and intuitive for end users, such as return 1 instead 
of 2 as 10% percentile among 1 to 10.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82384 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82384/testReport)**
 for PR 19406 at commit 
[`dbc3d47`](https://github.com/apache/spark/commit/dbc3d47b0a56113032d2a4565180932e4ef26219).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82379/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82379 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)**
 for PR 19406 at commit 
[`8c8c22d`](https://github.com/apache/spark/commit/8c8c22dbebe99def6127b49988dfc4f886797bd6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82379 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82379/testReport)**
 for PR 19406 at commit 
[`8c8c22d`](https://github.com/apache/spark/commit/8c8c22dbebe99def6127b49988dfc4f886797bd6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-02 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
It's probably OK, though kind of special cases the first quantile. This 
expression doesn't incorporate relativeError. I'm actually not sure why it 
needs to account for relativeError here; it's an extra level of approximation. 
But the asymmetry seems a bit wrong.

It is after all approximate? the loop below actually accounts for the same 
case, but incorporates relativeError.

When in doubt ask the author: @clockfly 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
@srowen For example, given input data 1 to 10, if a user queries 10% (or 
even less) percentile, it should return 1, because the first value 1 already 
reaches 10% percentage. Without this fix, it returned 2. I've updated 
description by adding this example in JIRA and PR, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19406
  
The JIRA doesn't explain what this is meant to fix. What case does this 
help get more correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82368 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)**
 for PR 19406 at commit 
[`24f8295`](https://github.com/apache/spark/commit/24f8295498a7ad6d2d99ea27a196ccf154165907).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82368/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-10-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19406
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-09-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19406
  
**[Test build #82368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82368/testReport)**
 for PR 19406 at commit 
[`24f8295`](https://github.com/apache/spark/commit/24f8295498a7ad6d2d99ea27a196ccf154165907).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19406: [SPARK-22179] percentile_approx should choose the first ...

2017-09-30 Thread wzhfy
Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/19406
  
cc @thunterdb @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org