[ https://issues.apache.org/jira/browse/HIVE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469614#comment-13469614 ]
Shreepadma Venugopalan commented on HIVE-1362: ---------------------------------------------- @Shrikanth: Thank you for your comments. We can certainly add a new UDAF with the Flajolet-Martin sketch that returns a serialized numDV estimator. I've already filed a new JIRA (HIVE-3516) for the incremental stats computation work. I'll add the UDAF as part of that JIRA. A couple of reasons why we decided to create a new compute_stats aggregation operator instead of generating more expressions in the SQL, 1. We felt its a lot cleaner to encapsulate the stats for a column within a single UDAF. The compute_stats UDAF returns a struct with the relevant stats depending on the data type of the column and as a result makes the parsing as well as the SQL we generate simple. 2. Adding a new compute_stats UDAF allows the gathering of statistical summaries of the underlying data even outside of the column stats framework. One use I can think of is, it can be used to model the statistical properties of data which in turn can be used to generate data whose statistical properties mimic that of the underlying data. Even though max, min, total count exist as UDAFs today, we need these to be part of the histogram UDAF. Estimating quantiles for equi-height histogram is a lot more efficient if we know the range of values the column can take. We need to know the total_count to generate the histogram bins. Given that we need these stats for generating histograms, I think its a good idea to encapsulate all of these stats within the compute_stats UDAF. Thanks. > column level statistics > ----------------------- > > Key: HIVE-1362 > URL: https://issues.apache.org/jira/browse/HIVE-1362 > Project: Hive > Issue Type: Sub-task > Components: Statistics > Reporter: Ning Zhang > Assignee: Shreepadma Venugopalan > Attachments: HIVE-1362.1.patch.txt, HIVE-1362.2.patch.txt, > HIVE-1362.3.patch.txt, HIVE-1362.4.patch.txt, > HIVE-1362-gen_thrift.1.patch.txt, HIVE-1362-gen_thrift.2.patch.txt, > HIVE-1362-gen_thrift.3.patch.txt, HIVE-1362-gen_thrift.4.patch.txt > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira