[
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144645#comment-16144645
]
Hudson commented on PHOENIX-418:
--------------------------------
FAILURE: Integrated in Jenkins build Phoenix-master #1753 (See
[https://builds.apache.org/job/Phoenix-master/1753/])
PHOENIX-418 Support approximate COUNT DISTINCT (Ethan Wang) (jtaylor: rev
d6381afc3af976ccdbb874d4458ea17b1e8a1d32)
* (add)
phoenix-core/src/main/java/org/apache/phoenix/expression/function/DistinctCountHyperLogLogAggregateFunction.java
* (add)
phoenix-core/src/it/java/org/apache/phoenix/end2end/CountDistinctApproximateHyperLogLogIT.java
* (edit) dev/release_files/NOTICE
* (edit) phoenix-core/pom.xml
* (edit)
phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java
* (add)
phoenix-core/src/main/java/org/apache/phoenix/parse/DistinctCountHyperLogLogAggregateParseNode.java
> Support approximate COUNT DISTINCT
> ----------------------------------
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
> Issue Type: Task
> Reporter: James Taylor
> Assignee: Ethan Wang
> Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch,
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch,
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to
> all distinct values (since this will not scale well when the number of
> distinct values is huge). The Apache Drill folks have had some interesting
> discussions on this
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
> They recommend using [Welford's
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
> I'm open to having a config option that uses exact versus approximate. I
> don't have experience implementing an approximate implementation, so I'm not
> sure how much state is required to keep on the server and return to the
> client (other than realizing it'd be much less that returning all distinct
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using
> hyperloglog, see discuss below.
> Merged patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)