[
https://issues.apache.org/jira/browse/HADOOP-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641947#action_12641947
]
Prasad Chakka commented on HADOOP-4488:
---------------------------------------
some comments and questions
1- For each partition (or table for non-partitioned tables), we should store
number of files as well (so we can optimize on number of mappers)
2- We should make the number of bins optional and use default. We might need
some trial and error to figure out the optional number depending on number of
distinct values/rowcount.
3- how do you do distinct values for floats? by rounding them or not store at
all?
4- for string we could store stats for some prefix of the string?
5- in histograms, we should store number distinct values as well in the bucket.
6- can we store correlation between two columns? it would help figuring out
selectivity more accurately.
> [Hive]: Add ability to compute statistics on hive tables
> --------------------------------------------------------
>
> Key: HADOOP-4488
> URL: https://issues.apache.org/jira/browse/HADOOP-4488
> Project: Hadoop Core
> Issue Type: New Feature
> Components: contrib/hive
> Reporter: Ashish Thusoo
> Assignee: Ashish Thusoo
>
> Add commands to collect partition and column level statistics in hive.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.