[ https://issues.apache.org/jira/browse/PHOENIX-3836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Samarth Jain updated PHOENIX-3836: ---------------------------------- Attachment: PHOENIX-3836.patch Patch with test that repros the issue along with the fix. It turned out that, at least in 0.98, when HBase runs major compaction, it imposes a limit on the number of key values that can be returned in one internalScanner.next() call. As a result, in our DefaultStatisticsCollector, we may end up counting the row more than once. The issue is reproducible only when the number of key values in a row is greater than 10 (which is the default for hbase.hstore.compaction.kv.max). [~jamestaylor], please review. > Estimated row count is twice the actual row count when stats are updated via > major compaction > --------------------------------------------------------------------------------------------- > > Key: PHOENIX-3836 > URL: https://issues.apache.org/jira/browse/PHOENIX-3836 > Project: Phoenix > Issue Type: Bug > Reporter: Mujtaba Chohan > Assignee: Samarth Jain > Priority: Minor > Attachments: PHOENIX-3836.patch > > > Estimated row count for a 2M table is 3986498 after stats updated via major > compaction vs 1993250 with {{update statistics}}. > {noformat} > Explain plan for count(*) on 2M row table after major compaction: > +--------------------------------------------------------------------------------------+ > | PLAN > | > +--------------------------------------------------------------------------------------+ > | CLIENT 364-CHUNK 3986498 ROWS 3774892993 BYTES PARALLEL 1-WAY FULL SCAN > OVER T | > | SERVER FILTER BY FIRST KEY ONLY > | > | SERVER AGGREGATE INTO SINGLE ROW > | > +--------------------------------------------------------------------------------------+ > Explain plan for count(*) on 2M row table after update statistics: > +--------------------------------------------------------------------------------------+ > | PLAN > | > +--------------------------------------------------------------------------------------+ > | CLIENT 364-CHUNK 1993250 ROWS 3774892993 BYTES PARALLEL 1-WAY FULL SCAN > OVER T | > | SERVER FILTER BY FIRST KEY ONLY > | > | SERVER AGGREGATE INTO SINGLE ROW > | > +--------------------------------------------------------------------------------------+ > {noformat} > Following schema was used with 2M rows and 10MB guidepost width: > {noformat} > CREATE TABLE IF NOT EXISTS T (PKA CHAR(15) NOT NULL, PKF CHAR(3) NOT NULL, > PKP CHAR(15) NOT NULL, CRD DATE NOT NULL, EHI CHAR(15) NOT NULL, STD_COL > VARCHAR, INDEXED_COL INTEGER, > CONSTRAINT PK PRIMARY KEY ( PKA, PKF, PKP, CRD DESC, EHI)) > VERSIONS=1,MULTI_TENANT=true,IMMUTABLE_ROWS=true > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)