[ 
https://issues.apache.org/jira/browse/HIVE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Lahiri updated HIVE-1387:
--------------------------------

    Attachment: HIVE-1387.2.patch
                median_approx_quality.png

I've attached HIVE-1387.2.patch, which does the following:

(1) Creates a percentile_approx() UDAF which uses the histogram_numeric() UDAF 
to estimate quantiles from a histogram. The syntax matches the existing 
percentile() UDAF, and extends it with a third parameter that specifies the 
number of histogram bins to use (and thus, the accuracy of quantile estimation):

SELECT percentile_approx(val, 0.5) FROM random;    // estimates the median
SELECT percentile_approx(val, array(0.5, 0.95, 0.98)) FROM random; // estimates 
3 quantiles
SELECT percentile_approx(val, 0.5, 1000) FROM random; // estimates the median 
using 1,000 histogram bins instead of the default of 10,000.

(2) I've left the existing percentile() UDAF as it is for the following 
reasons: when the number of unique values in a column is relatively small, 
percentile_approx() will return an exact result. When the number of unique 
values in a column is very large (as one might expect with double), then 
percentile() will run out of memory and crash, so there's really no need to 
modify the existing percentile() to support doubles.

(3) The accuracy of quantile estimation seems to be pretty good. Attached a 
graph showing approximation quality for the median using different histogram 
sizes for random datasets of 100,000 numbers. The default number of histogram 
bins is 10,000, which appears to work quite well.

(4) This patch also refactors the histogram_numeric() class to put all the 
generic histogram functionality into a re-usable inner class. 

> Make PERCENTILE work with double data type
> ------------------------------------------
>
>                 Key: HIVE-1387
>                 URL: https://issues.apache.org/jira/browse/HIVE-1387
>             Project: Hadoop Hive
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Vaibhav Aggarwal
>            Assignee: Mayank Lahiri
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1387.2.patch, median_approx_quality.png, 
> patch-1387-1.patch
>
>
> The PERCENTILE UDAF does not work with double datatype.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to