[ https://issues.apache.org/jira/browse/PHOENIX-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155556#comment-16155556 ]
Ethan Wang commented on PHOENIX-4164: ------------------------------------- can you provide a sample of your v1? I made my local test table with 30 million rows, approx count gives me around 0.005 inaccuracy. {code} 0: jdbc:phoenix:localhost:2181:/hbase> select count(id) from test; +------------+ | COUNT(ID) | +------------+ | 30000000 | +------------+ 0: jdbc:phoenix:localhost:2181:/hbase> select approx_count_distinct(id) from test; +----------------------------+ | APPROX_COUNT_DISTINCT(ID) | +----------------------------+ | 30048464 | +----------------------------+ {code} > APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values. > ------------------------------------------------------------- > > Key: PHOENIX-4164 > URL: https://issues.apache.org/jira/browse/PHOENIX-4164 > Project: Phoenix > Issue Type: Bug > Reporter: Lars Hofhansl > Assignee: Ethan Wang > > {code} > 0: jdbc:phoenix:localhost> select count(*) from test; > +-----------+ > | COUNT(1) | > +-----------+ > | 26931816 | > +-----------+ > 1 row selected (14.604 seconds) > 0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test; > +----------------------------+ > | APPROX_COUNT_DISTINCT(V1) | > +----------------------------+ > | 17221394 | > +----------------------------+ > 1 row selected (21.619 seconds) > {code} > The table is generated from random numbers, and the cardinality of v1 is > close to the number of rows. > (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine > and eventually kills the regionserver - that's another story and another jira) > [~aertoria] -- This message was sent by Atlassian JIRA (v6.4.14#64029)