Ethan Wang created PHOENIX-4160:
-----------------------------------
Summary: research for a proper hash size set for
APPROX_COUNT_DISTINCT
Key: PHOENIX-4160
URL: https://issues.apache.org/jira/browse/PHOENIX-4160
Project: Phoenix
Issue Type: Improvement
Environment: [link title|http://example.com]
Reporter: Ethan Wang
Now after -PHOENIX-418- finishes, currently the hash size is hard coded as
25/16 bits by default (a design we follow Apache Druid. discussion see
CALCITE-1588). And now we want to study to find a proper size.
Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,
sqrt(3\*ln(2)-1)/sqrt(2^precision).
see the page 129 of this
[paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf]
the performance of hll under different bucket/hash size has been studied.
See details here:
https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
2, When the estimate cardinalities is large enough, this performance of hll
will become problematic because the hash collisions (saturation). For detail
see the study done by Flajolet et al. In fact, Timok proposed that any number
larger than 2^{32}/30 should consider "to large" for a 32 bit hash. See study
[Google’s Take On Engineering
HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and the Figure 8 of
[paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)