[ https://issues.apache.org/jira/browse/DRILL-6442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Volodymyr Vysotskyi updated DRILL-6442: --------------------------------------- Labels: ready-to-commit (was: ) > Adjust Hbase disk cost & row count estimation when filter push down is applied > ------------------------------------------------------------------------------ > > Key: DRILL-6442 > URL: https://issues.apache.org/jira/browse/DRILL-6442 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.13.0 > Reporter: Arina Ielchiieva > Assignee: Arina Ielchiieva > Priority: Major > Labels: ready-to-commit > Fix For: 1.14.0 > > > Disk cost for Hbase scan is calculated based on scan size in bytes. > {noformat} > float diskCost = scanSizeInBytes * ((columns == null || columns.isEmpty()) ? > 1 : columns.size() / statsCalculator.getColsPerRow()); > {noformat} > Scan size is bytes is estimated using {{TableStatsCalculator}} with the help > of sampling. > When we estimate size for the first time (before applying filter push down), > for sampling we use random rows. When estimating rows after filter push down, > for sampling we use rows that qualify filter condition. It can happen that > average row size can be higher after filter push down > than before. Unfortunately since disk cost depends on these calculations, > plan with filter push down can give higher cost then without it. > Possible enhancements: > 1. Currently default row count is 1 million but if during sampling we return > less rows then expected, it means that our query will return not more rows > then this number. We can use this number instead of default row count to > achieve better cost estimations. > 2. When filter push down was applied, row number was reduced by half in order > to ensure plan with filter push down will have less cost. Then same should be > done for disk cost as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)