[jira] [Comment Edited] (PHOENIX-418) Support approximate COUNT DISTINCT

Ethan Wang (JIRA) Thu, 24 Aug 2017 13:35:28 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140468#comment-16140468
 ]


Ethan Wang edited comment on PHOENIX-418 at 8/24/17 8:34 PM:
-------------------------------------------------------------

Thanks for suggestions [~jamestaylor].

{quote}Make sure to call TestUtil.analyzeTable(connection, fullTableName) prior 
to running your TABLESAMPLE queries. You'll get more rows back, since you'll 
have guideposts in addition to region boundaries.{quote}

Correct me if I'm mistaken. Within the test for approximate count distinct, 
table sampling technique was not used since HyperLogLog doesn't rely on 
sampling (hence no run time speed gain during counting). So updating guidepost 
may have less impact on approximate count distinct.


was (Author: aertoria):
Thanks for suggestions [~jamestaylor].

bq. Make sure to call TestUtil.analyzeTable(connection, fullTableName) prior to 
running your TABLESAMPLE queries. You'll get more rows back, since you'll have 
guideposts in addition to region boundaries.

If I understand this part right, within the test for approximate count 
distinct, table sampling technique was not used since HyperLogLog doesn't rely 
on sampling (hence no run time speed gain during counting). 

> Support approximate COUNT DISTINCT
> ----------------------------------
>
>                 Key: PHOENIX-418
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-418
>             Project: Phoenix
>          Issue Type: Task
>            Reporter: James Taylor
>            Assignee: Ethan Wang
>              Labels: gsoc2016
>         Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (PHOENIX-418) Support approximate COUNT DISTINCT

Reply via email to