Rafael Acevedo created CALCITE-6778:
---------------------------------------
Summary: Inconsistent null behavior with correlated queries +
quantify operators
Key: CALCITE-6778
URL: https://issues.apache.org/jira/browse/CALCITE-6778
Project: Calcite
Issue Type: Bug
Components: core
Affects Versions: 1.38.0
Reporter: Rafael Acevedo
Hi all!
I believe I've found a bug in the quantify operators. The following queries
should be equivalent, but yield different results:
Query 1:
SQL:
{code:sql}
--returns null, as expected
SELECT TRUE = ALL (
SELECT b
FROM (SELECT * FROM (VALUES (TRUE), (NULL)) as x(a)) AS x1(b))
AS test;
{code}
Plan:
{code}
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[0], expr#4=[=($t0, $t3)],
expr#5=[<>($t1, $t0)], expr#6=[1], expr#7=[<=($t1, $t6)],
expr#8=[null:BOOLEAN], expr#9=[AND($t5, $t7, $t2, $t8)], expr#10=[=($t1, $t6)],
expr#11=[=($t1, $t0)], expr#12=[>($t1, $t6)], expr#13=[OR($t11, $t12)],
expr#14=[AND($t10, $t2, $t13)], expr#15=[OR($t4, $t9, $t14)], TEST=[$t15])
EnumerableAggregate(group=[{}], c=[COUNT()], d=[COUNT($0)], m=[MAX($0)])
EnumerableValues(tuples=[[{ true }, { null }]])
{code}
Query 2:
SQL:
{code:sql}
--returns true but should be null
WITH tb as (select array(SELECT * FROM (VALUES (TRUE), (NULL)) as x(a)) as a)
SELECT TRUE = ALL (
SELECT b
FROM UNNEST(a) AS x1(b)
) AS test
FROM tb;
{code}
Plan:
{code}
EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)], expr#6=[0],
expr#7=[=($t1, $t6)], expr#8=[OR($t5, $t7)], expr#9=[IS TRUE($t8)],
expr#10=[<>($t2, $t1)], expr#11=[1], expr#12=[<=($t2, $t11)],
expr#13=[AND($t10, $t12)], expr#14=[IS TRUE($t13)], expr#15=[null:BOOLEAN],
expr#16=[IS NOT TRUE($t8)], expr#17=[AND($t14, $t3, $t15, $t16)],
expr#18=[=($t2, $t11)], expr#19=[IS TRUE($t18)], expr#20=[IS NOT TRUE($t13)],
expr#21=[AND($t19, $t3, $t16, $t20)], expr#22=[OR($t9, $t17, $t21)],
TEST=[$t22])
EnumerableCorrelate(correlation=[$cor0], joinType=[left],
requiredColumns=[{0}])
EnumerableCollect(field=[x])
EnumerableValues(tuples=[[{ true }, { null }]])
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[true],
proj#0..3=[{exprs}])
EnumerableAggregate(group=[{}], c=[COUNT() FILTER $3],
d=[COUNT($0) FILTER $2], m=[MIN($1) FILTER $3])
EnumerableCalc(expr#0..2=[{inputs}], expr#3=[0],
expr#4=[=($t2, $t3)], expr#5=[1], expr#6=[=($t2, $t5)], proj#0..1=[{exprs}],
$g_0=[$t4], $g_1=[$t6])
EnumerableAggregate(group=[{0}],
groups=[[{0}, {}]], m=[MAX($0)], $g=[GROUPING($0)])
EnumerableUncollect
EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0], expr#2=[$t1.A], A=[$t2])
EnumerableValues(tuples=[[{ 0 }]])
{code}
Looking at the code, {{SubQueryRemoveRule}} produces different plans because
query 2 is correlated. [Code
pointer|https://github.com/apache/calcite/blob/575fc1a583b3f5b5febdee40da4c5cb46c9022b1/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L230].
A possible cause for different values is that for correlated queries (query 2
in this case) {{SubQueryRemoveRule}}
[uses|https://github.com/apache/calcite/blob/575fc1a583b3f5b5febdee40da4c5cb46c9022b1/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L408-L409]
{{count(distinct *)}} and {{count(distinct <col>)}}. Non-correlated queries
(query 1 in this case) use {{count ( * )}} and {{count(<col>)}}, which behave
differently with null values. I assume there's a reason to use distinct, right?
One possible change is modifying the correlated case to use non-distinct
counts. I tested this change locally and no test breaks, but I'd like to get
some better understanding of the reasoning behind using distinct.
Can anyone help me? maybe [~nobigo]?
Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)