Aman Sinha created IMPALA-10615: ----------------------------------- Summary: Cardinality estimates for some scalar functions could be improved Key: IMPALA-10615 URL: https://issues.apache.org/jira/browse/IMPALA-10615 Project: IMPALA Issue Type: Bug Components: Frontend Affects Versions: Impala 3.4.0 Reporter: Aman Sinha
The 10% default cardinality estimate for predicates involving most scalar functions can be a significant under-estimate. Consider the following cardinality estimate with UPPER(): {noformat} [localhost:21050] tpch> explain select * from nation where upper(n_name) is not null; | 00:SCAN HDFS [tpch.nation] | | HDFS partitions=1/1 files=1 size=2.15KB | | predicates: upper(n_name) IS NOT NULL | | row-size=109B cardinality=3 | +------------------------------------------------------------+ {noformat} Since n_name is non-null, the actual cardinality is 25, as shown below: {noformat} [localhost:21050] tpch> explain select * from nation where n_name is not null; | 00:SCAN HDFS [tpch.nation] | | HDFS partitions=1/1 files=1 size=2.15KB | | predicates: n_name IS NOT NULL | | row-size=109B cardinality=25 | +------------------------------------------------------------+ {noformat} In general, if a scalar function cannot change the nullability of its input, we should compute the same selectivity. Note that for explicit CAST, we do the right thing: {noformat} [localhost:21050] tpch> explain select * from nation where cast(n_name as varchar(10)) is not null; | 00:SCAN HDFS [tpch.nation] | | HDFS partitions=1/1 files=1 size=2.15KB | | predicates: CAST(n_name AS VARCHAR(10)) IS NOT NULL | | row-size=109B cardinality=25 | {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)