[ 
https://issues.apache.org/jira/browse/IMPALA-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bharath v updated IMPALA-7564:
------------------------------
    Summary: Conservative FK/PK join type detection with complex equi-join 
conjuncts  (was: Conservative FK/PK join type estimation with complex equi-join 
conjuncts)

> Conservative FK/PK join type detection with complex equi-join conjuncts
> -----------------------------------------------------------------------
>
>                 Key: IMPALA-7564
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7564
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.12.0, Impala 2.13.0, Impala 3.1.0
>            Reporter: bharath v
>            Priority: Major
>
> With IMPALA-5547, we predict whether a join is an FK/PK join as follows.
> {noformat}
>  // Iterate over all groups of conjuncts that belong to the same joined tuple 
> id pair.
>     // For each group, we compute the join NDV of the rhs slots and compare 
> it to the
>     // number of rows in the rhs table.
>     for (List<EqJoinConjunctScanSlots> fkPkCandidate: 
> scanSlotsByJoinedTids.values()) {
>       double jointNdv = 1.0;
>       for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *= 
> slots.rhsNdv();
>       double rhsNumRows = fkPkCandidate.get(0).rhsNumRows();
>       if (jointNdv >= Math.round(rhsNumRows * (1.0 - 
> FK_PK_MAX_STATS_DELTA_PERC))) {
>         // We cannot disprove that the RHS is a PK.
>         if (result == null) result = Lists.newArrayList();
>         result.addAll(fkPkCandidate);
>       }
>     }
> {noformat}
> We iterate through all the "simple" equi join conjuncts on the RHS, multiply 
> their NDVs and check if it close to rhsNumRows. The issue here is that this 
> can result in conservative FK/Pk detection if the equi-join conjuncts are not 
> simple (of the form <slotRef> = <slotRef>)
> {noformat}
> /**
>      * Returns a new EqJoinConjunctScanSlots for the given equi-join conjunct 
> or null if
>      * the given conjunct is not of the form <SlotRef> = <SlotRef> or if the 
> underlying
>      * table/column of at least one side is missing stats.
>      */
>     public static EqJoinConjunctScanSlots create(Expr eqJoinConjunct) {
>       if (!Expr.IS_EQ_BINARY_PREDICATE.apply(eqJoinConjunct)) return null;
>       SlotDescriptor lhsScanSlot = 
> eqJoinConjunct.getChild(0).findSrcScanSlot();
>       if (lhsScanSlot == null || !hasNumRowsAndNdvStats(lhsScanSlot)) return 
> null;
>       SlotDescriptor rhsScanSlot = 
> eqJoinConjunct.getChild(1).findSrcScanSlot();
> {noformat}
> For example, the following query contains a complex equi-join conjunct 
> {{substr(l.c3, 1, 6) = substr(r.c3, 1,6)}}, so while detecting if the left 
> outer join is an FK/PK, we just check if 
> {{NDVs(r.c1) * NDVs(r.c2) ~ r.numRows()}} which is incorrect. 
> {noformat}
> [localhost:21000]> explain select * from test_left l left outer join 
> test_right r on l.c1 = r.c1 and l.c2 = r.c2 and substr(l.c3, 1, 6) = 
> substr(r.c3, 1,6);
> Query: explain select * from test_left l left outer join test_right r on l.c1 
> = r.c1 and l.c2 = r.c2 and substr(l.c3, 1, 6) = substr(r.c3, 1,6)
> +-----------------------------------------------------------------------------------------+
> | Explain String                                                              
>             |
> +-----------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=1.95MB Threads=5                  
>             |
> | Per-Host Resource Estimates: Memory=66MB                                    
>             |
> |                                                                             
>             |
> | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                       
>             |
> | |  Per-Host Resources: mem-estimate=0B mem-reservation=0B 
> thread-reservation=1          |
> | PLAN-ROOT SINK                                                              
>             |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                  
>             |
> | |                                                                           
>             |
> | 04:EXCHANGE [UNPARTITIONED]                                                 
>             |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                  
>             |
> | |  tuple-ids=0,1N row-size=94B cardinality=49334767023                      
>             |
> | |  in pipelines: 00(GETNEXT)                                                
>             |
> | |                                                                           
>             |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                              
>             |
> | Per-Host Resources: mem-estimate=33.94MB mem-reservation=1.95MB 
> thread-reservation=2    |
> | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                                   
>             |
> | |  hash predicates: l.c1 = r.c1, l.c2 = r.c2, substr(l.c3, 1, 6) = 
> substr(r.c3, 1, 6)   |
> | |  fk/pk conjuncts: none                                                    
>             |
> | |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB 
> thread-reservation=0 |
> | |  tuple-ids=0,1N row-size=94B cardinality=49334767023                      
>             |
> | |  in pipelines: 00(GETNEXT), 01(OPEN)                                      
>             |
> | |                                                                           
>             |
> | |--03:EXCHANGE [BROADCAST]                                                  
>             |
> | |  |  mem-estimate=0B mem-reservation=0B thread-reservation=0               
>             |
> | |  |  tuple-ids=1 row-size=47B cardinality=2510                             
>             |
> | |  |  in pipelines: 01(GETNEXT)                                             
>             |
> | |  |                                                                        
>             |
> | |  F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                           
>             |
> | |  Per-Host Resources: mem-estimate=32.00MB mem-reservation=8.00KB 
> thread-reservation=2 |
> | |  01:SCAN HDFS [cdh72724.test_right r, RANDOM]                             
>             |
> | |     partitions=1/1 files=1 size=8B                                        
>             |
> | |     stored statistics:                                                    
>             |
> | |       table: rows=2510 size=8B                                            
>             |
> | |       columns: all                                                        
>             |
> | |     extrapolated-rows=disabled max-scan-range-rows=2510                   
>             |
> | |     mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1      
>             |
> | |     tuple-ids=1 row-size=47B cardinality=2510                             
>             |
> | |     in pipelines: 01(GETNEXT)                                             
>             |
> | |                                                                           
>             |
> | 00:SCAN HDFS [cdh72724.test_left l, RANDOM]                                 
>             |
> |    partitions=1/1 files=1 size=8B                                           
>             |
> |    stored statistics:                                                       
>             |
> |      table: rows=589658570 size=8B                                          
>             |
> |      columns: all                                                           
>             |
> |    extrapolated-rows=disabled max-scan-range-rows=589658570                 
>             |
> |    mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1         
>             |
> |    tuple-ids=0 row-size=47B cardinality=589658570                           
>             |
> |    in pipelines: 00(GETNEXT)                                                
>             |
> +-----------------------------------------------------------------------------------------+
> Fetched 48 row(s) in 0.02s
> {noformat}
> We should either consider NDVs of complex conjuncts (error-prone for obvious 
> reasons) or generally assume a join to be FK/PK until and unless proven 
> otherwise.
> Here is the plan if the complex conjunct is relaxed to a simple <slotRef> = 
> <slotRef> type. We can see that it detects the join to be of fk/pk type.
> {noformat}
> [localhost:21000] cdh72724> explain select * from test_left l left outer join 
> test_right r on l.c1 = r.c1 and l.c2 = r.c2 and l.c3 = r.c3;
> Query: explain select * from test_left l left outer join test_right r on l.c1 
> = r.c1 and l.c2 = r.c2 and l.c3 = r.c3
> +-----------------------------------------------------------------------------------------+
> | Explain String                                                              
>             |
> +-----------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=1.95MB Threads=5                  
>             |
> | Per-Host Resource Estimates: Memory=66MB                                    
>             |
> |                                                                             
>             |
> | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1                       
>             |
> | |  Per-Host Resources: mem-estimate=0B mem-reservation=0B 
> thread-reservation=1          |
> | PLAN-ROOT SINK                                                              
>             |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                  
>             |
> | |                                                                           
>             |
> | 04:EXCHANGE [UNPARTITIONED]                                                 
>             |
> | |  mem-estimate=0B mem-reservation=0B thread-reservation=0                  
>             |
> | |  tuple-ids=0,1N row-size=94B cardinality=589658570                        
>             |
> | |  in pipelines: 00(GETNEXT)                                                
>             |
> | |                                                                           
>             |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                              
>             |
> | Per-Host Resources: mem-estimate=33.94MB mem-reservation=1.95MB 
> thread-reservation=2    |
> | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                                   
>             |
> | |  hash predicates: l.c1 = r.c1, l.c2 = r.c2, l.c3 = r.c3                   
>             |
> | |  fk/pk conjuncts: l.c1 = r.c1, l.c2 = r.c2, l.c3 = r.c3                   
>             |
> | |  mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB 
> thread-reservation=0 |
> | |  tuple-ids=0,1N row-size=94B cardinality=589658570                        
>             |
> | |  in pipelines: 00(GETNEXT), 01(OPEN)                                      
>             |
> | |                                                                           
>             |
> | |--03:EXCHANGE [BROADCAST]                                                  
>             |
> | |  |  mem-estimate=0B mem-reservation=0B thread-reservation=0               
>             |
> | |  |  tuple-ids=1 row-size=47B cardinality=2510                             
>             |
> | |  |  in pipelines: 01(GETNEXT)                                             
>             |
> | |  |                                                                        
>             |
> | |  F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1                           
>             |
> | |  Per-Host Resources: mem-estimate=32.00MB mem-reservation=8.00KB 
> thread-reservation=2 |
> | |  01:SCAN HDFS [cdh72724.test_right r, RANDOM]                             
>             |
> | |     partitions=1/1 files=1 size=8B                                        
>             |
> | |     stored statistics:                                                    
>             |
> | |       table: rows=2510 size=8B                                            
>             |
> | |       columns: all                                                        
>             |
> | |     extrapolated-rows=disabled max-scan-range-rows=2510                   
>             |
> | |     mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1      
>             |
> | |     tuple-ids=1 row-size=47B cardinality=2510                             
>             |
> | |     in pipelines: 01(GETNEXT)                                             
>             |
> | |                                                                           
>             |
> | 00:SCAN HDFS [cdh72724.test_left l, RANDOM]                                 
>             |
> |    partitions=1/1 files=1 size=8B                                           
>             |
> |    stored statistics:                                                       
>             |
> |      table: rows=589658570 size=8B                                          
>             |
> |      columns: all                                                           
>             |
> |    extrapolated-rows=disabled max-scan-range-rows=589658570                 
>             |
> |    mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1         
>             |
> |    tuple-ids=0 row-size=47B cardinality=589658570                           
>             |
> |    in pipelines: 00(GETNEXT)                                                
>             |
> +-----------------------------------------------------------------------------------------+
> Fetched 48 row(s) in 0.01s
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to