[
https://issues.apache.org/jira/browse/PHOENIX-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156083#comment-16156083
]
Ethan Wang commented on PHOENIX-4164:
-------------------------------------
P.S., For folks that want to play around with precision configuration:
In this hll implementation there are two parameters that is configurable:
NormalSetPrecision (p)
SparseSetPrecision (sp)
In short: the actually leading zero counting space = (64-p). so less p, more
room for counting.
For detail:
Due to the reason that hyperloglog performs poorly when cardinality is low,
[Stefan et
al.|http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf]
came up an idea to use a normal hash set in the scenario when cordiality is
low so that we can achieve the best at the both worlds (except we didn't end up
using a normal hash set, we use a sparse set instead). So, in the case when
cardinality goes up, the counting will switch to the hyper log log, because
"sparse has the advantage on accuracy per unit of memory at low cardinality but
quickly falls behind."
Based on this design, this is how streamlib come up a way of using this 64-bits
hash. See
[this|https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java]
{code}
* ******************************************************* <- hashed
length of bits
* | p bits = idx || look for leading zeros here |
* | sp bits = idx' |
{code}
So, during normal hll model, p is the bucket size, the 64-p is the size used
for counting leading zeros.
Though, note that, with default p=16, there is up to 48 bits left for hll, that
is 2^48 of hashing space. The load for a 30million data test set is relatively
trivial.
> APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.
> -------------------------------------------------------------
>
> Key: PHOENIX-4164
> URL: https://issues.apache.org/jira/browse/PHOENIX-4164
> Project: Phoenix
> Issue Type: Bug
> Reporter: Lars Hofhansl
> Assignee: Ethan Wang
>
> {code}
> 0: jdbc:phoenix:localhost> select count(*) from test;
> +-----------+
> | COUNT(1) |
> +-----------+
> | 26931816 |
> +-----------+
> 1 row selected (14.604 seconds)
> 0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test;
> +----------------------------+
> | APPROX_COUNT_DISTINCT(V1) |
> +----------------------------+
> | 17221394 |
> +----------------------------+
> 1 row selected (21.619 seconds)
> {code}
> The table is generated from random numbers, and the cardinality of v1 is
> close to the number of rows.
> (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine
> and eventually kills the regionserver - that's another story and another jira)
> [~aertoria]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)