[ 
https://issues.apache.org/jira/browse/IMPALA-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-8038:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Incorrect estimation of selectivity of OR expression
> ----------------------------------------------------
>
>                 Key: IMPALA-8038
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8038
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Suppose we have a query in which we include an OR expression:
> {code:sql}
> select *
> from functional.alltypestiny t
> where t.id = 10
>    or t.bool_col = true
> {code}
> We can compute the join cardinality, {{|join|}} by observing that:
> * Either id will match,
> * Or, month will match,
> * Or, both will match.
> The number of matches is given as {{NDV(col)/|T|}}, and:
> * {{|T| = 8}}
> * {{NDV(id) = |T| = 8}}
> * {{NDV(bool_col) = 2}}
> * {{|T 𝜎 id=10| = |T|/NDV(id) = 1}}
> * {{|T 𝜎 bool_col=true| = |T|/NDV(bool_col) = 4}}
> So, if one or the other (or both) columns match, this means that we get 
> either 1 row or 4 rows. So, the estimated cardinality should be the maximum 
> of the two terms:
> {noformat}
> |bool_col| = NDV(bool_col)
> |id| = NDV(id)
> |join| = max( |T|/|id|, |T|/|bool_col| )
>          = max(1, 4)
>          = 4
> {noformat}
> Let's see what we actually get:
> {noformat}
> PLAN-ROOT SINK
> |  mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> 00:SCAN HDFS [functional.alltypestiny t]
>    partitions=4/4 files=4 size=460B
>    predicates: t.id = CAST(10 AS INT) OR t.bool_col = TRUE
>    tuple-ids=0 row-size=89B cardinality=5
> {noformat}
> So, looks like the code uses {{|id| + |bool_col|}}, not the correct {{max( 
> |id|, |bool_col| )}}.
> While the error here is small, it can be larger in other cases. Bad errors 
> lead to bad join estimation, leading to bad joins, leading to poor 
> performance, leading to unhappy users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to