[
https://issues.apache.org/jira/browse/HIVE-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171186#comment-15171186
]
Rajdeep Surolia commented on HIVE-9689:
---------------------------------------
Hi Prasanth,
I am a Computer Science undergrad student from Kolkata, India. I program in
C/C++ and JAVA. I have a pretty good knowledge about the workings of Apache
Hadoop and Hive. I am very much interested in Big Data and would like to work
on projects related to it. I too am a GSoC '16 aspirant and this looks like the
right project. I would like to know the prerequisites for this project. Your
help will be much appreciated.
Cheers,
Rajdeep
> Store histograms and distinct value estimator's bit vectors in metastore
> ------------------------------------------------------------------------
>
> Key: HIVE-9689
> URL: https://issues.apache.org/jira/browse/HIVE-9689
> Project: Hive
> Issue Type: New Feature
> Reporter: Prasanth Jayachandran
> Labels: gsoc, gsoc2015, hive, java
>
> Hive currently uses PCSA (Probabilistic Counting and Stochastic Averaging)
> algorithm to determine distinct cardinality. The NDV value determined from
> the UDF is stored in the metastore instead of the actual bit vectors. This
> makes it impossible to estimate the overall NDV across all the partitions (or
> selected partitions). We should ideally store the bitvectors in the metastore
> and do server side merging of the bitvectors. Also we could replace the
> current PCSA algorithm in favour of HyperLogLog if space is a constraint.
> Also Hive has a UDF for computing histogram. We can persist the histogram in
> the metastore so that hive optimizer can make better decisions. Also having
> histograms in metastore can help with order by, skew join and count distinct
> + group by optimizations.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)