[jira] [Comment Edited] (PHOENIX-4164) APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.

Ethan Wang (JIRA) Wed, 06 Sep 2017 14:52:27 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156083#comment-16156083
 ]


Ethan Wang edited comment on PHOENIX-4164 at 9/6/17 9:51 PM:
-------------------------------------------------------------

P.S., For folks that want to play around with precision configuration:

In this hll implementation there are two parameters that is configurable:
NormalSetPrecision (p)
SparseSetPrecision (sp)

In short: the actually leading zero counting space = (64-p). so less p, more 
room for counting.

For detail:
Due to the reason that hyperloglog performs poorly when cardinality is low, 
[Stefan et 
al.|http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf]
 came up an idea to use a normal hash set in the scenario when cordiality is 
low so that we can achieve the best at the both worlds (except we didn't end up 
using a normal hash set, we use a sparse set instead). So, in the case when 
cardinality goes up, the counting will switch to the hyper log log, because 
"sparse has the advantage on accuracy per unit of memory at low cardinality but 
quickly falls behind."

Based on this design, this is how streamlib come up a way of using this 64-bits 
hash. Code see 
[this|https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java]
{code}
     * *******************************************************   <- hashed 
length of bits
     * | p bits = idx ||     look for leading zeros here     |
     * |      sp bits = idx'     |
{code}
So, during normal hll model, p is the bucket size, the 64-p is the size used 
for counting leading zeros.

Though, note that, with default p=16, there is up to 48 bits left for hll, that 
is 2^48 of hashing space. The load for a 30million data test set is relatively 
trivial.


was (Author: aertoria):
P.S., For folks that want to play around with precision configuration:

In this hll implementation there are two parameters that is configurable:
NormalSetPrecision (p)
SparseSetPrecision (sp)

In short: the actually leading zero counting space = (64-p). so less p, more 
room for counting.

For detail:
Due to the reason that hyperloglog performs poorly when cardinality is low, 
[Stefan et 
al.|http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf]
 came up an idea to use a normal hash set in the scenario when cordiality is 
low so that we can achieve the best at the both worlds (except we didn't end up 
using a normal hash set, we use a sparse set instead). So, in the case when 
cardinality goes up, the counting will switch to the hyper log log, because 
"sparse has the advantage on accuracy per unit of memory at low cardinality but 
quickly falls behind."

Based on this design, this is how streamlib come up a way of using this 64-bits 
hash. See 
[this|https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java]
{code}
     * *******************************************************   <- hashed 
length of bits
     * | p bits = idx ||     look for leading zeros here     |
     * |      sp bits = idx'     |
{code}
So, during normal hll model, p is the bucket size, the 64-p is the size used 
for counting leading zeros.

Though, note that, with default p=16, there is up to 48 bits left for hll, that 
is 2^48 of hashing space. The load for a 30million data test set is relatively 
trivial.

> APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4164
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4164
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Ethan Wang
>
> {code}
> 0: jdbc:phoenix:localhost> select count(*) from test;
> +-----------+
> | COUNT(1)  |
> +-----------+
> | 26931816  |
> +-----------+
> 1 row selected (14.604 seconds)
> 0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test;
> +----------------------------+
> | APPROX_COUNT_DISTINCT(V1)  |
> +----------------------------+
> | 17221394                   |
> +----------------------------+
> 1 row selected (21.619 seconds)
> {code}
> The table is generated from random numbers, and the cardinality of v1 is 
> close to the number of rows.
> (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine 
> and eventually kills the regionserver - that's another story and another jira)
> [~aertoria]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (PHOENIX-4164) APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.

Reply via email to