[jira] [Comment Edited] (PHOENIX-4160) research for a proper hash size set for APPROX_COUNT_DISTINCT

Ethan Wang (JIRA) Tue, 05 Sep 2017 13:47:27 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154274#comment-16154274
 ]


Ethan Wang edited comment on PHOENIX-4160 at 9/5/17 8:46 PM:
-------------------------------------------------------------

[~jamestaylor] [~lhofhansl]


was (Author: aertoria):
[~jamestaylor][~lhofhansl]

> research for a proper hash size set for APPROX_COUNT_DISTINCT
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4160
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4160
>             Project: Phoenix
>          Issue Type: Improvement
>         Environment: [link title|http://example.com]
>            Reporter: Ethan Wang
>
> Now after -PHOENIX-418- finishes, currently the hash size is hard coded as 
> 25/16 bits by default (a design we follow Apache Druid. discussion see 
> CALCITE-1588). And now we want to study to find a proper size.
> Note:
> 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  
> sqrt(3\*ln(2)-1)/sqrt(2^precision).
> see the page 129 of this 
> [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
> the performance of hll under different bucket/hash size has been studied. 
> See details here: 
> https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
> 2, When the estimate cardinalities is large enough, this performance of hll 
> will become problematic because the hash collisions (saturation). For detail 
> see the study done by Flajolet et al. In fact, Timok proposed that any number 
> larger than 2^{32}/30 should consider "to large" for a 32 bit hash. See study 
> [Google’s Take On Engineering 
> HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
>  and the Figure 8 of 
> [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
> Alternatively we can instruct user to not only use it in exceedingly huge 
> cardinally scenario.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (PHOENIX-4160) research for a proper hash size set for APPROX_COUNT_DISTINCT

Reply via email to