[ https://issues.apache.org/jira/browse/PHOENIX-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154274#comment-16154274 ]
Ethan Wang edited comment on PHOENIX-4160 at 9/5/17 8:46 PM: ------------------------------------------------------------- [~jamestaylor] [~lhofhansl] was (Author: aertoria): [~jamestaylor][~lhofhansl] > research for a proper hash size set for APPROX_COUNT_DISTINCT > ------------------------------------------------------------- > > Key: PHOENIX-4160 > URL: https://issues.apache.org/jira/browse/PHOENIX-4160 > Project: Phoenix > Issue Type: Improvement > Environment: [link title|http://example.com] > Reporter: Ethan Wang > > Now after -PHOENIX-418- finishes, currently the hash size is hard coded as > 25/16 bits by default (a design we follow Apache Druid. discussion see > CALCITE-1588). And now we want to study to find a proper size. > Note: > 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e., > sqrt(3\*ln(2)-1)/sqrt(2^precision). > see the page 129 of this > [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf] > the performance of hll under different bucket/hash size has been studied. > See details here: > https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html > 2, When the estimate cardinalities is large enough, this performance of hll > will become problematic because the hash collisions (saturation). For detail > see the study done by Flajolet et al. In fact, Timok proposed that any number > larger than 2^{32}/30 should consider "to large" for a 32 bit hash. See study > [Google’s Take On Engineering > HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/] > and the Figure 8 of > [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf] > Alternatively we can instruct user to not only use it in exceedingly huge > cardinally scenario. -- This message was sent by Atlassian JIRA (v6.4.14#64029)