[
https://issues.apache.org/jira/browse/HIVE-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146586#comment-15146586
]
Dhanasekar commented on HIVE-9689:
----------------------------------
Just wanted to know if someone is working on this one. I am a GSOC 2016
aspirant and would like to know if I can work on this one this summer.
> Store histograms and distinct value estimator's bit vectors in metastore
> ------------------------------------------------------------------------
>
> Key: HIVE-9689
> URL: https://issues.apache.org/jira/browse/HIVE-9689
> Project: Hive
> Issue Type: New Feature
> Reporter: Prasanth Jayachandran
> Labels: gsoc, gsoc2015, hive, java
>
> Hive currently uses PCSA (Probabilistic Counting and Stochastic Averaging)
> algorithm to determine distinct cardinality. The NDV value determined from
> the UDF is stored in the metastore instead of the actual bit vectors. This
> makes it impossible to estimate the overall NDV across all the partitions (or
> selected partitions). We should ideally store the bitvectors in the metastore
> and do server side merging of the bitvectors. Also we could replace the
> current PCSA algorithm in favour of HyperLogLog if space is a constraint.
> Also Hive has a UDF for computing histogram. We can persist the histogram in
> the metastore so that hive optimizer can make better decisions. Also having
> histograms in metastore can help with order by, skew join and count distinct
> + group by optimizations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)