[
https://issues.apache.org/jira/browse/PHOENIX-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155556#comment-16155556
]
Ethan Wang commented on PHOENIX-4164:
-------------------------------------
can you provide a sample of your v1? I made my local test table with 30 million
rows, approx count gives me around 0.005 inaccuracy.
{code}
0: jdbc:phoenix:localhost:2181:/hbase> select count(id) from test;
+------------+
| COUNT(ID) |
+------------+
| 30000000 |
+------------+
0: jdbc:phoenix:localhost:2181:/hbase> select approx_count_distinct(id) from
test;
+----------------------------+
| APPROX_COUNT_DISTINCT(ID) |
+----------------------------+
| 30048464 |
+----------------------------+
{code}
> APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.
> -------------------------------------------------------------
>
> Key: PHOENIX-4164
> URL: https://issues.apache.org/jira/browse/PHOENIX-4164
> Project: Phoenix
> Issue Type: Bug
> Reporter: Lars Hofhansl
> Assignee: Ethan Wang
>
> {code}
> 0: jdbc:phoenix:localhost> select count(*) from test;
> +-----------+
> | COUNT(1) |
> +-----------+
> | 26931816 |
> +-----------+
> 1 row selected (14.604 seconds)
> 0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test;
> +----------------------------+
> | APPROX_COUNT_DISTINCT(V1) |
> +----------------------------+
> | 17221394 |
> +----------------------------+
> 1 row selected (21.619 seconds)
> {code}
> The table is generated from random numbers, and the cardinality of v1 is
> close to the number of rows.
> (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine
> and eventually kills the regionserver - that's another story and another jira)
> [~aertoria]
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)