[jira] Updated: (HIVE-1397) histogram() UDAF for a numerical column

Mayank Lahiri (JIRA) Thu, 10 Jun 2010 16:13:17 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mayank Lahiri updated HIVE-1397:
--------------------------------

    Status: Patch Available  (was: Open)

I've implemented and tested the algorithm. I'm running some experiments on how 
far from optimal (in terms of MSE) we're getting with this streaming algorithm, 
but as of now, it seems to perform well when the number of data points is a few 
orders of magnitude larger than the number of bins. As an example I'm getting 
good histograms when there 100,000 data points and 20-80 histogram bins.

As I noted before, there are no approximation guarantees in terms of how close 
to optimal the histogram is.

> histogram() UDAF for a numerical column
> ---------------------------------------
>
>                 Key: HIVE-1397
>                 URL: https://issues.apache.org/jira/browse/HIVE-1397
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Mayank Lahiri
>            Assignee: Mayank Lahiri
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1397.1.patch
>
>
> A histogram() UDAF to generate an approximate histogram of a numerical (byte, 
> short, double, long, etc.) column. The result is returned as a map of (x,y) 
> histogram pairs, and can be plotted in Gnuplot using impulses (for example). 
> The algorithm is currently adapted from "A streaming parallel decision tree 
> algorithm" by Ben-Haim and Tom-Tov, JMLR 11 (2010), and uses space 
> proportional to the number of histogram bins specified. It has no 
> approximation guarantees, but seems to work well when there is a lot of data 
> and a large number (e.g. 50-100) of histogram bins specified.
> A typical call might be:
> SELECT histogram(val, 10) FROM some_table;
> where the result would be a histogram with 10 bins, returned as a Hive map 
> object.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1397) histogram() UDAF for a numerical column

Reply via email to