Michael Brown created IMPALA-5504:
-------------------------------------

             Summary: wrong results with LEFT JOIN, inline view, and COALESCE()
                 Key: IMPALA-5504
                 URL: https://issues.apache.org/jira/browse/IMPALA-5504
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 2.8.0, Impala 2.9.0
            Reporter: Michael Brown
            Priority: Blocker


*Summary*

This query

{noformat}
USE tpch;
SELECT
COUNT(t1.ps_suppkey)
FROM partsupp t1
LEFT JOIN  (
SELECT
COALESCE(t2.o_custkey, -1) AS coalesce_col
FROM orders t2
LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
{noformat}

produces a different result compared to PostgreSQL. This occurs in both 2.8- 
and 2.9-based builds, so it doesn't seem to be a recent regression.

The problem is the {{COALESCE()}} function in the inline view and its 
relationship to the {{LEFT JOIN}} in the outer query.

*Analysis*

Because {{t2.o_custkey}} is a primary key column and does not contain {{NULL}} 
values:

{noformat}
[localhost:21000] > select count(o_custkey) from orders where o_custkey is null;
+------------------+
| count(o_custkey) |
+------------------+
| 0                |
+------------------+
Fetched 1 row(s) in 0.64s
[localhost:21000] >
{noformat}

This means I can remove the use of {{COALESCE(t2.o_custkey, -1)}} and just 
replace it with {{t2.o_custkey}}:

{noformat}
SELECT
COUNT(t1.ps_suppkey) c
FROM partsupp t1
LEFT JOIN  (
SELECT
t2.o_custkey AS coalesce_col
FROM orders t2
LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
ORDER BY c
{noformat}

When I do that, the results end up matching Postgres. This is my grounds for 
justifying this as an Impala bug, not a Postgres bug: either with or without 
the COALESCE(), the results should have been the same.

Consider the query profiles for the query without the COALESCE, that produces 
the correct results:

{noformat}
Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail                        
-----------------------------------------------------------------------------------------------------------------------------
10:AGGREGATE           1  196.528us  196.528us        1           1   24.00 KB  
     10.00 MB  FINALIZE                      
09:EXCHANGE            1   38.678us   38.678us        2           1          0  
            0  UNPARTITIONED                 
05:AGGREGATE           2   68.671ms   75.556ms        2           1   50.89 KB  
     10.00 MB                                
04:HASH JOIN           2  460.980ms  472.610ms    8.23M     800.00K  114.07 MB  
      6.71 MB  RIGHT OUTER JOIN, PARTITIONED 
|--08:EXCHANGE         2   53.569ms   60.459ms  800.00K     800.00K          0  
            0  HASH((t1.ps_suppkey))         
|  00:SCAN HDFS        1   19.808ms   19.808ms  800.00K     800.00K   33.02 MB  
    176.00 MB  tpch.partsupp t1              
07:EXCHANGE            2   91.832ms  101.949ms    1.50M       1.50M          0  
            0  HASH(t2.o_custkey)            
03:HASH JOIN           2  206.506ms  228.767ms    1.50M       1.50M  178.05 MB  
      6.71 MB  LEFT OUTER JOIN, BROADCAST    
|--06:EXCHANGE         2   95.342ms   97.771ms  800.00K     800.00K          0  
            0  BROADCAST                     
|  02:SCAN HDFS        1   34.058ms   34.058ms  800.00K     800.00K   33.02 MB  
    176.00 MB  tpch.partsupp t3              
01:SCAN HDFS           2   93.905ms  101.139ms    1.50M       1.50M   33.25 MB  
    176.00 MB  tpch.orders t2   
{noformat}

Now compare this to the profile with the incorrect results:

{noformat}
Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak Mem  
Est. Peak Mem  Detail                         
------------------------------------------------------------------------------------------------------------------------------
10:AGGREGATE           1  198.204us  198.204us        1           1   24.00 KB  
     10.00 MB  FINALIZE                       
09:EXCHANGE            1   93.555us   93.555us        2           1          0  
            0  UNPARTITIONED                  
05:AGGREGATE           2    3.066ms    3.717ms        2           1    8.04 MB  
     10.00 MB                                 
04:HASH JOIN           2  140.679ms  236.316ms  800.00K     800.00K  114.06 MB  
      6.71 MB  RIGHT OUTER JOIN, PARTITIONED  
|--08:EXCHANGE         2   51.184ms   53.408ms  800.00K     800.00K          0  
            0  HASH((t1.ps_suppkey))          
|  00:SCAN HDFS        1    9.601ms    9.601ms  800.00K     800.00K   33.02 MB  
    176.00 MB  tpch.partsupp t1               
07:EXCHANGE            2   88.551ms  177.099ms    1.50M       1.50M          0  
            0  HASH(if(TupleIsNull(), NULL... 
03:HASH JOIN           2  216.025ms  241.128ms    1.50M       1.50M  178.05 MB  
      6.71 MB  LEFT OUTER JOIN, BROADCAST     
|--06:EXCHANGE         2   93.506ms  105.640ms  800.00K     800.00K          0  
            0  BROADCAST                      
|  02:SCAN HDFS        1  112.021ms  112.021ms  800.00K     800.00K   32.63 MB  
    176.00 MB  tpch.partsupp t3               
01:SCAN HDFS           2   50.246ms   86.962ms    1.50M       1.50M   33.25 MB  
    176.00 MB  tpch.orders t2   
{noformat}

In the second, incorrect results with COALESCE() case, we see that the {{04 
HASH JOIN}} operator processed 800K rows, whereas in the first, without 
COALESCE, it processed 8.23M. It turns out the {{partsupp}} table has exactly 
800k rows. So in the buggy case, somehow the rows coming in from the {{orders}} 
side are not being processed by the JOIN.

Note this was originally found by the random query generator; this was the 
original query:

{noformat}
USE tpch;
SELECT
COALESCE(AVG(a4.o_custkey), 30.0083477187, (COALESCE(a4.o_custkey, 
a1.ps_partkey, a1.ps_suppkey)) * (-65.4342458142)) AS float_col,
COUNT(a1.ps_suppkey) AS int_col,
COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey) AS int_col_1
FROM partsupp a1
LEFT JOIN (
SELECT
a2.o_custkey,
COALESCE(COALESCE(a3.ps_availqty, a3.ps_partkey, a3.ps_suppkey), a3.ps_suppkey, 
COALESCE(a2.o_custkey, a3.ps_availqty)) AS int_col,
a2.o_orderstatus,
COALESCE(a2.o_custkey, LEAST(COALESCE(a2.o_custkey, 825), 
COALESCE(COALESCE(a2.o_orderkey, a3.ps_availqty, a2.o_orderkey), 538)), 
a3.ps_availqty) AS int_col_1,
COALESCE(a2.o_orderdate, a2.o_clerk) AS char_col
FROM orders a2
LEFT JOIN partsupp a3 ON (a2.o_shippriority) = (a3.ps_partkey)
WHERE
((a2.o_custkey) IS DISTINCT FROM (a3.ps_partkey)) OR ((a2.o_totalprice) IS NULL)
) a4 ON (False) OR ((a1.ps_suppkey) = (a4.int_col_1))
WHERE
((a4.int_col_1) IS NOT DISTINCT FROM (a1.ps_suppkey)) AND ((a4.int_col) NOT IN 
(a1.ps_partkey, a1.ps_availqty))
GROUP BY
COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to