Ruben Q L created CALCITE-7266:
----------------------------------

             Summary: Optimize the "well-known count bug" fix
                 Key: CALCITE-7266
                 URL: https://issues.apache.org/jira/browse/CALCITE-7266
             Project: Calcite
          Issue Type: Improvement
          Components: core
            Reporter: Ruben Q L
             Fix For: 1.42.0


CALCITE-7010 fixed the "well-known count bug" on the RelDecorrelator.

As shown 
[here|https://github.com/apache/calcite/blob/8b5c17e51e0c9c3f8e3db17c8d449e67e4e2974a/core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java#L819],
 the root cause of this bug is a misalignment when no match if found: the 
original (correlated) plan returns NULL (or 0 for COUNT) when no match is 
found; whereas the (bugged) decorrelated plan returned empty result set when no 
match is found.
This has been extensively explained on the original Jira, linked papers, PR and 
the comments in the new code introduced by CALCITE-7010.
The fix for this issue relied on introducing an extra join in order to avoid 
"missing" any result on the decorrelated plan.

I'd like to explore the possibility of optimizing this fix.
Specifically, I'd like to discuss the situation where the original Correlate is 
of type LEFT. The examples introduced in CALCITE-7010 were all about INNER 
Correlates (which become INNER Joins), however I wonder if in case of LEFT 
Correlates this situation could be handled differently. I'd argue, when we have 
a LEFT Correlate (and no COUNT on the Aggregate) we will not require the extra 
Join introduced by rewriteScalarAggregate, and the pre-bugfix decorrelated plan 
was just fine. The reason for that is that, in this scenario, having a NULL in 
case of mismatch vs having an empty set in case of mismatch would be 
effectively the same since, due to the nature of the LEFT type, an empty set 
will result on populating NULL values on the RHS, which is precisely what the 
original plan was doing.

Maybe I'm missing something... but I wanted to open the discussion to see if we 
can optimize this fix and avoid (if possible in certain scenarios) the extra 
join, which can be quite expensive depending on the data that the query is 
handling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to