[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

clockfly Mon, 29 Aug 2016 14:25:44 -0700

GitHub user clockfly opened a pull request:

    https://github.com/apache/spark/pull/14868


    Implements percentile_approx aggregation function which supports partial 
aggregation.

    ## What changes were proposed in this pull request?
    
    This PR implements aggregation function `percentile_approx`. Function 
`percentile_approx` returns the approximate percentile(s) of a column at the 
given percentage(s). A percentile is a watermark value below which a given 
percentage of the column values fall. For example, the percentile of column 
`col` at percentage 50% is the median value of column `col`. 
    
    ### Syntax:
    ```
    # Returns percentile at a given percentage value. The approximation error 
can be reduced by increasing parameter accuracy, at the cost of memory.
    percentile_approx(col, percentage [, accuracy])
    
    # Returns percentile value array at given percentage value array 
    percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy])
    ```
    
    ### Features:
    1. This function supports partial aggregation.
    2. The memory consumption is bounded. The larger `accuracy` parameter we 
choose, we smaller error we get. The default accuracy value is 10000, to match 
with Hive default setting. Choose a smaller value for smaller memory footprint. 
    3.  This function supports window function aggregation.
    
    ### Example usages:
    ```
    ## Returns the 25th percentile value, with default accuracy
    SELECT percentile_approx(col, 0.25) FROM table
    
    ## Returns an array of percentile value (25th, 50th, 75th), with default 
accuracy
    SELECT percentile_approx(col, array(0.25, 0.5, 0.75)) FROM table
    
    ## Returns 25th percentile value, with custom accuracy value 100, larger 
accuracy parameter yields smaller approximation error 
    SELECT percentile_approx(col, 0.25, 100) FROM table
    
    ## Returns the 25th, and 50th percentile values, with custom accuracy value 
100
    SELECT percentile_approx(col, array(0.25, 0.5), 100) FROM table
    ```
    
    ### NOTE:
    1. The `percentile_approx` implementation is different from Hive, so the 
result returned on same query maybe slightly different with Hive. This 
implementation uses `QuantileSummaries` as the underlying probabilistic data 
structure, and mainly follows paper `Space-efficient Online Computation of 
Quantile Summaries` by Greenwald, Michael and Khanna, Sanjeev. 
(http://dx.doi.org/10.1145/375663.375670)`
    2. The current implementation of `QuantileSummaries` doesn't support 
automatic compression. This PR has a rule to do compression automatically at 
the caller side, but it may not be optimal.
    
    ## How was this patch tested?
    
    Unit test, and Sql query test.
    
    ## Acknowledgement
    1. This PR's work in based on @lw-lin's PR 
https://github.com/apache/spark/pull/14298, with improvements like supporting 
partial aggregation, fixing out of memory issue.
    2. Thanks to @cloud-fan for the implementation of serialization using 
Expression Encoder.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/clockfly/spark appro_percentile_try_2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14868
    
----
commit d71397cc283493a946c42124e41309d7e64d0b12
Author: Sean Zhong <seanzh...@databricks.com>
Date:   2016-08-19T16:34:56Z

    percentile approximate

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14868: Implements percentile_approx aggregation function...

Reply via email to