Aman Sinha created IMPALA-12006:
-----------------------------------

             Summary: Left outer join cardinality highly overestimated
                 Key: IMPALA-12006
                 URL: https://issues.apache.org/jira/browse/IMPALA-12006
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 4.2.0
            Reporter: Aman Sinha


In one of the use cases, we have seen the cardinality estimate for left outer 
join highly overestimated.  The plan is complex and only a partial output is 
shown below (with the column names anonymized): 

{noformat}
  57:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
  |  hash-table-id=121
  |  hash predicates: a.id = a.id
  |  fk/pk conjuncts: none
  |  mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
  |  tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N,29N,31N 
row-size=2.63KB cardinality=3.90T
  |  in pipelines: 06(GETNEXT), 26(OPEN)
  |
  |--F1253:PLAN FRAGMENT hosts=13 instances=13
  |  |  Per-Instance Resources: mem-estimate=1.10GB mem-reservation=204.00MB 
thread-reservation=1
  |  JOIN BUILD
  |  |  join-table-id=121 plan-id=122 cohort-id=25
  |  |  build expressions: a.id
  |  |  mem-estimate=1.08GB mem-reservation=204.00MB spill-buffer=2.00MB 
thread-reservation=0
  |  |
  |  1758:EXCHANGE [BROADCAST]
  |  |  mem-estimate=20.87MB mem-reservation=0B thread-reservation=0
  |  |  tuple-ids=29,31 row-size=85B cardinality=9.56M
  |  |  in pipelines: 26(GETNEXT)
  |  |
  ...
  ...
  ...
  56:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
  |  hash predicates: ifnull(a.id, a.id) = a.id
  |  fk/pk conjuncts: assumed fk/pk
  |  mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB thread-reservation=0
  |  tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N row-size=2.55KB 
cardinality=14.97G
  |  in pipelines: 06(GETNEXT), 22(OPEN)
{noformat}

Note that the left input of the join is estimated as 14.97G rows, right input 
as 9.56M rows but the LOJ estimate is 3.9T rows.  We need to investigate why 
that is so and fix it.  The NDV of the based column involved in the join is 
36661  but in the lower join there are functions involved in the join 
condition. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to