[ https://issues.apache.org/jira/browse/PHOENIX-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16159770#comment-16159770 ]
Ethan Wang commented on PHOENIX-4164: ------------------------------------- I see. sounds good. I make this related to PHOENIX-4160 so that people can share our findings. > APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values. > ------------------------------------------------------------- > > Key: PHOENIX-4164 > URL: https://issues.apache.org/jira/browse/PHOENIX-4164 > Project: Phoenix > Issue Type: Bug > Reporter: Lars Hofhansl > > {code} > 0: jdbc:phoenix:localhost> select count(*) from test; > +-----------+ > | COUNT(1) | > +-----------+ > | 26931816 | > +-----------+ > 1 row selected (14.604 seconds) > 0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test; > +----------------------------+ > | APPROX_COUNT_DISTINCT(V1) | > +----------------------------+ > | 17221394 | > +----------------------------+ > 1 row selected (21.619 seconds) > {code} > The table is generated from random numbers, and the cardinality of v1 is > close to the number of rows. > (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine > and eventually kills the regionserver - that's another story and another jira) > [~aertoria] -- This message was sent by Atlassian JIRA (v6.4.14#64029)