[ https://issues.apache.org/jira/browse/IMPALA-8039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on IMPALA-8039 started by Paul Rogers. ------------------------------------------- > Incorrect selectivity estimate for not-equals predicate > ------------------------------------------------------- > > Key: IMPALA-8039 > URL: https://issues.apache.org/jira/browse/IMPALA-8039 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Major > > Suppose we write a query that uses the not-equals predicate: > {code:sql} > select * > from functional.alltypestiny > where id != 10 > {code} > How many rows will we get? Let's reason it out. Suppose we do this: > {code:sql} > select * > from functional.alltypestiny > where id = 10 > {code} > We know that {{is}} is unique and the table has 8 rows. So, in the second > query, we'll get only one row: where {{id = 10}}. Using this, we can see that > the first query will return all the rows that the second one did not, that is > {{8 - 1 = 7}}. > Let's see what the planner says: > {noformat} > PLAN-ROOT SINK > | mem-estimate=0B mem-reservation=0B thread-reservation=0 > | > 00:SCAN HDFS [functional.alltypestiny] > partitions=4/4 files=4 size=460B > predicates: id != CAST(10 AS INT) > tuple-ids=0 row-size=89B cardinality=1 > {noformat} > So, the planner says that both equality and in-equality give the same number > of rows. Clearly, this is wrong. It is, in fact, a symptom of the fact that > Impala does not attempt to calculate selectivity for other than equality. > (IMPALA-7601). > The correct selectivity estimate for inequality is: > {noformat} > sel(c != x) = 1 - 1/ndv(c) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org