Ruben Q L created CALCITE-7266:
----------------------------------
Summary: Optimize the "well-known count bug" fix
Key: CALCITE-7266
URL: https://issues.apache.org/jira/browse/CALCITE-7266
Project: Calcite
Issue Type: Improvement
Components: core
Reporter: Ruben Q L
Fix For: 1.42.0
CALCITE-7010 fixed the "well-known count bug" on the RelDecorrelator.
As shown
[here|https://github.com/apache/calcite/blob/8b5c17e51e0c9c3f8e3db17c8d449e67e4e2974a/core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java#L819],
the root cause of this bug is a misalignment when no match if found: the
original (correlated) plan returns NULL (or 0 for COUNT) when no match is
found; whereas the (bugged) decorrelated plan returned empty result set when no
match is found.
This has been extensively explained on the original Jira, linked papers, PR and
the comments in the new code introduced by CALCITE-7010.
The fix for this issue relied on introducing an extra join in order to avoid
"missing" any result on the decorrelated plan.
I'd like to explore the possibility of optimizing this fix.
Specifically, I'd like to discuss the situation where the original Correlate is
of type LEFT. The examples introduced in CALCITE-7010 were all about INNER
Correlates (which become INNER Joins), however I wonder if in case of LEFT
Correlates this situation could be handled differently. I'd argue, when we have
a LEFT Correlate (and no COUNT on the Aggregate) we will not require the extra
Join introduced by rewriteScalarAggregate, and the pre-bugfix decorrelated plan
was just fine. The reason for that is that, in this scenario, having a NULL in
case of mismatch vs having an empty set in case of mismatch would be
effectively the same since, due to the nature of the LEFT type, an empty set
will result on populating NULL values on the RHS, which is precisely what the
original plan was doing.
Maybe I'm missing something... but I wanted to open the discussion to see if we
can optimize this fix and avoid (if possible in certain scenarios) the extra
join, which can be quite expensive depending on the data that the query is
handling.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)