[ 
https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641947#action_12641947
 ] 

Prasad Chakka commented on HADOOP-4488:
---------------------------------------

some comments and questions

1- For each partition (or table for non-partitioned tables), we should store 
number of files as well (so we can optimize on number of mappers)

2- We should make the number of bins optional and use default. We might need 
some trial and error to figure out the optional number depending on number of 
distinct values/rowcount.

3- how do you do distinct values for floats? by rounding them or not store at 
all?

4- for string we could store stats for some prefix of the string?

5- in histograms, we should store number distinct values as well in the bucket.

6- can we store correlation between two columns?  it would help figuring out 
selectivity more accurately.



> [Hive]: Add ability to compute statistics on hive tables
> --------------------------------------------------------
>
>                 Key: HADOOP-4488
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4488
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hive
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>
> Add commands to collect partition and column level statistics in hive.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to