GitHub user wzhfy opened a pull request:

    https://github.com/apache/spark/pull/19438

    [SPARK-22208] [SQL] Improve percentile_approx by not rounding up 
targetError and starting from index 0

    ## What changes were proposed in this pull request?
    
    Currently percentile_approx never returns the first element when percentile 
is in (relativeError, 1/N], where relativeError default 1/10000, and N is the 
total number of elements. But ideally, percentiles in [0, 1/N] should all 
return the first element as the answer.
    
    For example, given input data 1 to 10, if a user queries 10% (or even less) 
percentile, it should return 1, because the first value 1 already reaches 10%. 
Currently it returns 2.
    
    Based on the paper, targetError is not rounded up, and searching index 
should start from 0 instead of 1. By following the paper, we should be able to 
fix the cases mentioned above.
    
    ## How was this patch tested?
    
    Added a new test case and fix existing test cases.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wzhfy/spark improve_percentile_approx

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19438.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19438
    
----
commit 24f8295498a7ad6d2d99ea27a196ccf154165907
Author: Zhenhua Wang <wzh_...@163.com>
Date:   2017-09-30T16:04:32Z

    return the first element for small percentage

commit 8c8c22dbebe99def6127b49988dfc4f886797bd6
Author: Zhenhua Wang <wzh_...@163.com>
Date:   2017-10-02T10:24:28Z

    fix test

commit dbc3d47b0a56113032d2a4565180932e4ef26219
Author: Zhenhua Wang <wzh_...@163.com>
Date:   2017-10-02T14:53:04Z

    fix test

commit 9815ce8e17e34422f8c915d115061a9635abd119
Author: Zhenhua Wang <wzh_...@163.com>
Date:   2017-10-03T14:51:55Z

    fix pyspark test

commit f2b153800ebdf10999d4a8bb3578101a12f6d631
Author: Zhenhua Wang <wzh_...@163.com>
Date:   2017-10-05T15:47:27Z

    follow the paper and fix sparkR test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to