GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/19438
[SPARK-22208] [SQL] Improve percentile_approx by not rounding up targetError and starting from index 0 ## What changes were proposed in this pull request? Currently percentile_approx never returns the first element when percentile is in (relativeError, 1/N], where relativeError default 1/10000, and N is the total number of elements. But ideally, percentiles in [0, 1/N] should all return the first element as the answer. For example, given input data 1 to 10, if a user queries 10% (or even less) percentile, it should return 1, because the first value 1 already reaches 10%. Currently it returns 2. Based on the paper, targetError is not rounded up, and searching index should start from 0 instead of 1. By following the paper, we should be able to fix the cases mentioned above. ## How was this patch tested? Added a new test case and fix existing test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark improve_percentile_approx Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19438.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19438 ---- commit 24f8295498a7ad6d2d99ea27a196ccf154165907 Author: Zhenhua Wang <wzh_...@163.com> Date: 2017-09-30T16:04:32Z return the first element for small percentage commit 8c8c22dbebe99def6127b49988dfc4f886797bd6 Author: Zhenhua Wang <wzh_...@163.com> Date: 2017-10-02T10:24:28Z fix test commit dbc3d47b0a56113032d2a4565180932e4ef26219 Author: Zhenhua Wang <wzh_...@163.com> Date: 2017-10-02T14:53:04Z fix test commit 9815ce8e17e34422f8c915d115061a9635abd119 Author: Zhenhua Wang <wzh_...@163.com> Date: 2017-10-03T14:51:55Z fix pyspark test commit f2b153800ebdf10999d4a8bb3578101a12f6d631 Author: Zhenhua Wang <wzh_...@163.com> Date: 2017-10-05T15:47:27Z follow the paper and fix sparkR test ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org