Aman Sinha created IMPALA-9911: ---------------------------------- Summary: IS [NOT] NULL predicate selectivity estimate is wrong if #nulls is 0 Key: IMPALA-9911 URL: https://issues.apache.org/jira/browse/IMPALA-9911 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 3.4.0 Reporter: Aman Sinha Assignee: Aman Sinha
Consider the tpcds customer table .. its c_current_addr_sk column has #Nulls = 0 as shown below. {noformat} tpcds> show column stats customer; +------------------------+--------+------------------+--------+----------+-------------------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +------------------------+--------+------------------+--------+----------+-------------------+ .... | c_current_cdemo_sk | INT | 91558 | 3438 | 4 | 4 | | c_current_hdemo_sk | INT | 7376 | 3431 | 4 | 4 | | c_current_addr_sk | INT | 42003 | 0 | 4 | 4 | .... {noformat} The cardinality estimate for the following predicates shows a default selectivity of 10% being applied which is not correct: {noformat} explain select c_current_addr_sk from customer where c_current_addr_sk is not null; | 00:SCAN HDFS [tpcds.customer] | | HDFS partitions=1/1 files=1 size=12.60MB | | predicates: c_current_addr_sk IS NOT NULL | | row-size=4B cardinality=10.00K | +------------------------------------------------------------+ explain select c_current_addr_sk from customer where c_current_addr_sk is null; | 00:SCAN HDFS [tpcds.customer] | | HDFS partitions=1/1 files=1 size=12.60MB | | predicates: c_current_addr_sk IS NULL | | row-size=4B cardinality=10.00K | {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)