[ https://issues.apache.org/jira/browse/HIVE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mayank Lahiri updated HIVE-1387: -------------------------------- Attachment: HIVE-1387.2.patch median_approx_quality.png I've attached HIVE-1387.2.patch, which does the following: (1) Creates a percentile_approx() UDAF which uses the histogram_numeric() UDAF to estimate quantiles from a histogram. The syntax matches the existing percentile() UDAF, and extends it with a third parameter that specifies the number of histogram bins to use (and thus, the accuracy of quantile estimation): SELECT percentile_approx(val, 0.5) FROM random; // estimates the median SELECT percentile_approx(val, array(0.5, 0.95, 0.98)) FROM random; // estimates 3 quantiles SELECT percentile_approx(val, 0.5, 1000) FROM random; // estimates the median using 1,000 histogram bins instead of the default of 10,000. (2) I've left the existing percentile() UDAF as it is for the following reasons: when the number of unique values in a column is relatively small, percentile_approx() will return an exact result. When the number of unique values in a column is very large (as one might expect with double), then percentile() will run out of memory and crash, so there's really no need to modify the existing percentile() to support doubles. (3) The accuracy of quantile estimation seems to be pretty good. Attached a graph showing approximation quality for the median using different histogram sizes for random datasets of 100,000 numbers. The default number of histogram bins is 10,000, which appears to work quite well. (4) This patch also refactors the histogram_numeric() class to put all the generic histogram functionality into a re-usable inner class. > Make PERCENTILE work with double data type > ------------------------------------------ > > Key: HIVE-1387 > URL: https://issues.apache.org/jira/browse/HIVE-1387 > Project: Hadoop Hive > Issue Type: Improvement > Affects Versions: 0.6.0 > Reporter: Vaibhav Aggarwal > Assignee: Mayank Lahiri > Fix For: 0.6.0 > > Attachments: HIVE-1387.2.patch, median_approx_quality.png, > patch-1387-1.patch > > > The PERCENTILE UDAF does not work with double datatype. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.