Tim Armstrong created IMPALA-7749:
-------------------------------------

             Summary: Merge aggregation node memory estimate is incorrectly 
influenced by limit
                 Key: IMPALA-7749
                 URL: https://issues.apache.org/jira/browse/IMPALA-7749
             Project: IMPALA
          Issue Type: Sub-task
          Components: Frontend
    Affects Versions: Impala 2.12.0, Impala 3.0, Impala 2.11.0, Impala 3.1.0
            Reporter: Tim Armstrong
            Assignee: Bikramjeet Vig


In the below query the estimate for node ID 3 is too low. If you remove the 
limit it is correct. 

{noformat}
[localhost:21000] default> set explain_level=2; explain select l_orderkey, 
l_partkey, l_linenumber, count(*) from tpch.lineitem group by 1, 2, 3 limit 5;
EXPLAIN_LEVEL set to 2
Query: explain select l_orderkey, l_partkey, l_linenumber, count(*) from 
tpch.lineitem group by 1, 2, 3 limit 5
+-------------------------------------------------------------------------------------------+
| Explain String                                                                
            |
+-------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=43.94MB Threads=4                   
            |
| Per-Host Resource Estimates: Memory=450MB                                     
            |
|                                                                               
            |
| F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                         
            |
| |  Per-Host Resources: mem-estimate=0B mem-reservation=0B 
thread-reservation=1            |
| PLAN-ROOT SINK                                                                
            |
| |  mem-estimate=0B mem-reservation=0B thread-reservation=0                    
            |
| |                                                                             
            |
| 04:EXCHANGE [UNPARTITIONED]                                                   
            |
| |  limit: 5                                                                   
            |
| |  mem-estimate=0B mem-reservation=0B thread-reservation=0                    
            |
| |  tuple-ids=1 row-size=28B cardinality=5                                     
            |
| |  in pipelines: 03(GETNEXT)                                                  
            |
| |                                                                             
            |
| F01:PLAN FRAGMENT [HASH(l_orderkey,l_partkey,l_linenumber)] hosts=3 
instances=3           |
| Per-Host Resources: mem-estimate=10.00MB mem-reservation=1.94MB 
thread-reservation=1      |
| 03:AGGREGATE [FINALIZE]                                                       
            |
| |  output: count:merge(*)                                                     
            |
| |  group by: l_orderkey, l_partkey, l_linenumber                              
            |
| |  limit: 5                                                                   
            |
| |  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB 
thread-reservation=0  |
| |  tuple-ids=1 row-size=28B cardinality=5                                     
            |
| |  in pipelines: 03(GETNEXT), 00(OPEN)                                        
            |
| |                                                                             
            |
| 02:EXCHANGE [HASH(l_orderkey,l_partkey,l_linenumber)]                         
            |
| |  mem-estimate=0B mem-reservation=0B thread-reservation=0                    
            |
| |  tuple-ids=1 row-size=28B cardinality=6001215                               
            |
| |  in pipelines: 00(GETNEXT)                                                  
            |
| |                                                                             
            |
| F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3                                
            |
| Per-Host Resources: mem-estimate=440.27MB mem-reservation=42.00MB 
thread-reservation=2    |
| 01:AGGREGATE [STREAMING]                                                      
            |
| |  output: count(*)                                                           
            |
| |  group by: l_orderkey, l_partkey, l_linenumber                              
            |
| |  mem-estimate=176.27MB mem-reservation=34.00MB spill-buffer=2.00MB 
thread-reservation=0 |
| |  tuple-ids=1 row-size=28B cardinality=6001215                               
            |
| |  in pipelines: 00(GETNEXT)                                                  
            |
| |                                                                             
            |
| 00:SCAN HDFS [tpch.lineitem, RANDOM]                                          
            |
|    partitions=1/1 files=1 size=718.94MB                                       
            |
|    stored statistics:                                                         
            |
|      table: rows=6001215 size=718.94MB                                        
            |
|      columns: all                                                             
            |
|    extrapolated-rows=disabled max-scan-range-rows=1068457                     
            |
|    mem-estimate=264.00MB mem-reservation=8.00MB thread-reservation=1          
            |
|    tuple-ids=0 row-size=20B cardinality=6001215                               
            |
|    in pipelines: 00(GETNEXT)                                                  
            |
+-------------------------------------------------------------------------------------------+
{noformat}

The bug is that we use cardinality_ to cap the number of distinct values, but 
cardinality_ is capped at the output limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to