[ 
https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469614#comment-13469614
 ] 

Shreepadma Venugopalan commented on HIVE-1362:
----------------------------------------------

@Shrikanth: Thank you for your comments. We can certainly add a new UDAF with 
the Flajolet-Martin sketch that returns a serialized numDV estimator. I've 
already filed a new JIRA (HIVE-3516) for the incremental stats computation 
work. I'll add the UDAF as part of that JIRA. 

A couple of reasons why we decided to create a new compute_stats aggregation 
operator instead of generating more expressions in the SQL,

1. We felt its a lot cleaner to encapsulate the stats for a column within a 
single UDAF. The compute_stats UDAF returns a struct with the relevant stats 
depending on the data type of the column and as a result makes the parsing as 
well as the SQL we generate simple.

2. Adding a new compute_stats UDAF allows the gathering of statistical 
summaries of the underlying data even outside of the column stats framework. 
One use I can think of is, it can be used to model the statistical properties 
of data which in turn can be used to generate data whose statistical properties 
mimic that of the underlying data.

Even though max, min, total count exist as UDAFs today, we need these to be 
part of the histogram UDAF. Estimating quantiles for equi-height histogram is a 
lot more efficient if we know the range of values the column can take. We need 
to know the total_count to generate the histogram bins. Given that we need 
these stats for generating histograms, I think its a good idea to encapsulate 
all of these stats within the compute_stats UDAF. Thanks.
                
> column level statistics
> -----------------------
>
>                 Key: HIVE-1362
>                 URL: https://issues.apache.org/jira/browse/HIVE-1362
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Statistics
>            Reporter: Ning Zhang
>            Assignee: Shreepadma Venugopalan
>         Attachments: HIVE-1362.1.patch.txt, HIVE-1362.2.patch.txt, 
> HIVE-1362.3.patch.txt, HIVE-1362.4.patch.txt, 
> HIVE-1362-gen_thrift.1.patch.txt, HIVE-1362-gen_thrift.2.patch.txt, 
> HIVE-1362-gen_thrift.3.patch.txt, HIVE-1362-gen_thrift.4.patch.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to