[ 
https://issues.apache.org/jira/browse/PHOENIX-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-4160:
--------------------------------
    Description: 
Now after -PHOENIX-418- finishes, we want to study to find a proper default 
size for hll hash. Currently the hash size is hard coded as 25/16 bits by 
default (a design we follow Apache Druid. discussion see CALCITE-1588).

Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  
{code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code} Detail see the page 129 of 
this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
To try on a bigger size, the performance of hll under different bucket/hash 
size has been studied here: 
https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html

2, When the estimate cardinalities is large enough, as Timok and Flajolet et al 
found out, this performance of hll will become problematic because the hash 
collisions (saturation). In fact, Timok proposed that any number larger than 
{code}2^{32}/30{code} should consider "to large" for a 32 bit hash. See study 
[Google’s Take On Engineering 
HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
 and the Figure 8 of 
[paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]

Alternatively we can instruct user to not only use it in exceedingly huge 
cardinally scenario.

  was:
Now after -PHOENIX-418- finishes, we want to study to find a proper default 
size for hll hash. Currently the hash size is hard coded as 25/16 bits by 
default (a design we follow Apache Druid. discussion see CALCITE-1588).

Note:
1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  
{code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code} Detail see the page 129 of 
this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
To try on a bigger size, the performance of hll under different bucket/hash 
size has been studied here: 
https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html

2, When the estimate cardinalities is large enough, as Timok et al found out, 
this performance of hll will become problematic because the hash collisions 
(saturation). For detail see the study done by Flajolet et al. In fact, Timok 
proposed that any number larger than {code}2^{32}/30{code} should consider "to 
large" for a 32 bit hash. See study [Google’s Take On Engineering 
HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
 and the Figure 8 of 
[paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]

Alternatively we can instruct user to not only use it in exceedingly huge 
cardinally scenario.


> research for a proper hash size set for APPROX_COUNT_DISTINCT
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4160
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4160
>             Project: Phoenix
>          Issue Type: Improvement
>         Environment: 
>            Reporter: Ethan Wang
>
> Now after -PHOENIX-418- finishes, we want to study to find a proper default 
> size for hll hash. Currently the hash size is hard coded as 25/16 bits by 
> default (a design we follow Apache Druid. discussion see CALCITE-1588).
> Note:
> 1, the std error of hyperloglog is bound by 1/sqrt(size of hash). i.e.,  
> {code:java}sqrt(3\*ln(2)-1)/sqrt(2^precision){code} Detail see the page 129 
> of this [paper|http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf].
> To try on a bigger size, the performance of hll under different bucket/hash 
> size has been studied here: 
> https://metron.apache.org/current-book/metron-analytics/metron-statistics/HLLP.html
> 2, When the estimate cardinalities is large enough, as Timok and Flajolet et 
> al found out, this performance of hll will become problematic because the 
> hash collisions (saturation). In fact, Timok proposed that any number larger 
> than {code}2^{32}/30{code} should consider "to large" for a 32 bit hash. See 
> study [Google’s Take On Engineering 
> HLL|https://research.neustar.biz/2013/01/24/hyperloglog-googles-take-on-engineering-hll/]
>  and the Figure 8 of 
> [paper|https://stefanheule.com/papers/edbt13-hyperloglog.pdf]
> Alternatively we can instruct user to not only use it in exceedingly huge 
> cardinally scenario.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to