[
https://issues.apache.org/jira/browse/PHOENIX-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Wang updated PHOENIX-4160:
--------------------------------
Environment:
{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}
was:[link title|http://example.com]
Description:
Now after -PHOENIX-418- finishes, currently the hash size is hard coded as
25/16 bits by default (a design we follow Apache Druid. discussion see
CALCITE-1588). And now we want to study to find a proper default size.
Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,
{code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code} Detail see the page 129 of
this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
To try on a bigger size, the performance of hll under different bucket/hash
size has been studied here:
https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
2, When the estimate cardinalities is large enough, this performance of hll
will become problematic because the hash collisions (saturation). For detail
see the study done by Flajolet et al. In fact, Timok proposed that any number
larger than {code:java}2^{32}/30{code} should consider "to large" for a 32 bit
hash. See study [Google’s Take On Engineering
HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and the Figure 8 of
[paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
Alternatively we can instruct user to not only use it in exceedingly huge
cardinally scenario.
was:
Now after -PHOENIX-418- finishes, currently the hash size is hard coded as
25/16 bits by default (a design we follow Apache Druid. discussion see
CALCITE-1588). And now we want to study to find a proper default size.
Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,
sqrt(3\*ln(2)-1)/sqrt(2^precision). Detail see the page 129 of this
[paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
To try on a bigger size, the performance of hll under different bucket/hash
size has been studied here:
https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
2, When the estimate cardinalities is large enough, this performance of hll
will become problematic because the hash collisions (saturation). For detail
see the study done by Flajolet et al. In fact, Timok proposed that any number
larger than 2^{32}/30 should consider "to large" for a 32 bit hash. See study
[Google’s Take On Engineering
HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
and the Figure 8 of
[paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
Alternatively we can instruct user to not only use it in exceedingly huge
cardinally scenario.
> research for a proper hash size set for APPROX_COUNT_DISTINCT
> -------------------------------------------------------------
>
> Key: PHOENIX-4160
> URL: https://issues.apache.org/jira/browse/PHOENIX-4160
> Project: Phoenix
> Issue Type: Improvement
> Environment: {code:java}
> // Some comments here
> public String getFoo()
> {
> return foo;
> }
> {code}
> Reporter: Ethan Wang
>
> Now after -PHOENIX-418- finishes, currently the hash size is hard coded as
> 25/16 bits by default (a design we follow Apache Druid. discussion see
> CALCITE-1588). And now we want to study to find a proper default size.
> Note:
> 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,
> {code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code} Detail see the page 129
> of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
> To try on a bigger size, the performance of hll under different bucket/hash
> size has been studied here:
> https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
> 2, When the estimate cardinalities is large enough, this performance of hll
> will become problematic because the hash collisions (saturation). For detail
> see the study done by Flajolet et al. In fact, Timok proposed that any number
> larger than {code:java}2^{32}/30{code} should consider "to large" for a 32
> bit hash. See study [Google’s Take On Engineering
> HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
> and the Figure 8 of
> [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
> Alternatively we can instruct user to not only use it in exceedingly huge
> cardinally scenario.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)