[jira] [Created] (CALCITE-6778) Inconsistent null behavior with correlated queries + quantify operators

Rafael Acevedo (Jira) Fri, 10 Jan 2025 06:06:04 -0800

Rafael Acevedo created CALCITE-6778:
---------------------------------------


             Summary: Inconsistent null behavior with correlated queries + 
quantify operators
                 Key: CALCITE-6778
                 URL: https://issues.apache.org/jira/browse/CALCITE-6778
             Project: Calcite
          Issue Type: Bug
          Components: core
    Affects Versions: 1.38.0
            Reporter: Rafael Acevedo


Hi all!

I believe I've found a bug in the quantify operators. The following queries 
should be equivalent, but yield different results:

Query 1:
SQL:
{code:sql}
--returns null, as expected
SELECT TRUE = ALL (
    SELECT b
    FROM (SELECT * FROM (VALUES (TRUE), (NULL)) as x(a)) AS x1(b))
    AS test;
{code}
Plan:
{code}
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[0], expr#4=[=($t0, $t3)], 
expr#5=[<>($t1, $t0)], expr#6=[1], expr#7=[<=($t1, $t6)], 
expr#8=[null:BOOLEAN], expr#9=[AND($t5, $t7, $t2, $t8)], expr#10=[=($t1, $t6)], 
expr#11=[=($t1, $t0)], expr#12=[>($t1, $t6)], expr#13=[OR($t11, $t12)], 
expr#14=[AND($t10, $t2, $t13)], expr#15=[OR($t4, $t9, $t14)], TEST=[$t15])
  EnumerableAggregate(group=[{}], c=[COUNT()], d=[COUNT($0)], m=[MAX($0)])
      EnumerableValues(tuples=[[{ true }, { null }]])
      
{code}

Query 2:
SQL:
{code:sql}
--returns true but should be null
WITH tb as (select array(SELECT * FROM (VALUES (TRUE), (NULL)) as x(a)) as a)
SELECT TRUE = ALL (
    SELECT b
    FROM UNNEST(a) AS x1(b)
) AS test
FROM tb;
{code}
Plan:
{code}
EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)], expr#6=[0], 
expr#7=[=($t1, $t6)], expr#8=[OR($t5, $t7)], expr#9=[IS TRUE($t8)], 
expr#10=[<>($t2, $t1)], expr#11=[1], expr#12=[<=($t2, $t11)], 
expr#13=[AND($t10, $t12)], expr#14=[IS TRUE($t13)], expr#15=[null:BOOLEAN], 
expr#16=[IS NOT TRUE($t8)], expr#17=[AND($t14, $t3, $t15, $t16)], 
expr#18=[=($t2, $t11)], expr#19=[IS TRUE($t18)], expr#20=[IS NOT TRUE($t13)], 
expr#21=[AND($t19, $t3, $t16, $t20)], expr#22=[OR($t9, $t17, $t21)], 
TEST=[$t22])
  EnumerableCorrelate(correlation=[$cor0], joinType=[left], 
requiredColumns=[{0}])
      EnumerableCollect(field=[x])
            EnumerableValues(tuples=[[{ true }, { null }]])
                EnumerableCalc(expr#0..2=[{inputs}], expr#3=[true], 
proj#0..3=[{exprs}])
                      EnumerableAggregate(group=[{}], c=[COUNT() FILTER $3], 
d=[COUNT($0) FILTER $2], m=[MIN($1) FILTER $3])
                              EnumerableCalc(expr#0..2=[{inputs}], expr#3=[0], 
expr#4=[=($t2, $t3)], expr#5=[1], expr#6=[=($t2, $t5)], proj#0..1=[{exprs}], 
$g_0=[$t4], $g_1=[$t6])
                                        EnumerableAggregate(group=[{0}], 
groups=[[{0}, {}]], m=[MAX($0)], $g=[GROUPING($0)])
                                                    EnumerableUncollect
                                                                  
EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0], expr#2=[$t1.A], A=[$t2])
                                                                                
  EnumerableValues(tuples=[[{ 0 }]])
                                                                                
  
{code}


Looking at the code, {{SubQueryRemoveRule}} produces different plans because 
query 2 is correlated. [Code 
pointer|https://github.com/apache/calcite/blob/575fc1a583b3f5b5febdee40da4c5cb46c9022b1/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L230].

A possible cause for different values is that for correlated queries (query 2 
in this case) {{SubQueryRemoveRule}} 
[uses|https://github.com/apache/calcite/blob/575fc1a583b3f5b5febdee40da4c5cb46c9022b1/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L408-L409]
 {{count(distinct *)}} and {{count(distinct <col>)}}. Non-correlated queries 
(query 1 in this case) use {{count ( * )}} and {{count(<col>)}}, which behave 
differently with null values. I assume there's a reason to use distinct, right?

One possible change is modifying the correlated case to use non-distinct 
counts. I tested this change locally and no test breaks, but I'd like to get 
some better understanding of the reasoning behind using distinct.

Can anyone help me? maybe [~nobigo]?

Thanks!




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (CALCITE-6778) Inconsistent null behavior with correlated queries + quantify operators

Reply via email to