Tushar7012 commented on issue #20315:
URL: https://github.com/apache/datafusion/issues/20315#issuecomment-3893164213
Hi @mgrenonville ,
Thanks for reporting this issue — I was able to reproduce the problem
locally and carefully inspect the logical planning behavior.
From my understanding, the error occurs when multiple correlated scalar
subqueries are used within the same parent query. During logical plan
construction (or optimization), an internal helper column (such as
`__always_true`) is introduced to support correlation handling. However, when
more than one correlated scalar subquery is present, each subquery introduces a
helper expression with the same internal column name.
Because these internal columns are not uniquely scoped or aliased, they end
up conflicting in the outer query context. This results in an "ambiguous
reference" error during column resolution instead of each subquery being
properly isolated.
### Root Cause (as I understand it)
- Multiple correlated scalar subqueries inject internal helper expressions.
- These expressions reuse the same generated column name (e.g.,
`__always_true`).
- When merged into the outer logical plan, the planner cannot disambiguate
between them.
- Column resolution fails due to duplicate internal names.
### Proposed Approach
Here is how I plan to approach the fix:
1. Reproduce the issue with a minimal SQL example and inspect the logical
plan using `EXPLAIN` to pinpoint exactly where the duplicate internal column is
introduced.
2. Trace the correlated subquery rewrite/decorrelation logic in the logical
planner to identify where `__always_true` (or similar helper columns) are
generated.
3. Modify the internal column generation so that:
- Either each correlated subquery gets a uniquely generated internal
alias, or
- The helper column remains fully scoped within its respective subquery
plan and does not leak into the shared outer context.
4. Add a regression test that includes multiple correlated scalar subqueries
in the same query to prevent this issue from resurfacing.
5. Validate that the change does not break existing correlated subquery
optimization paths.
I will start by isolating the exact stage in the logical planning process
where the duplicate internal column name is created and propose a scoped or
uniquely generated naming strategy.
Please let me know if there are any prior design considerations around
correlated subquery rewriting that I should be aware of before proceeding.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]