Tushar7012 commented on issue #20315:
URL: https://github.com/apache/datafusion/issues/20315#issuecomment-3893164213

   Hi @mgrenonville ,
   Thanks for reporting this issue — I was able to reproduce the problem 
locally and carefully inspect the logical planning behavior.
   
   From my understanding, the error occurs when multiple correlated scalar 
subqueries are used within the same parent query. During logical plan 
construction (or optimization), an internal helper column (such as 
`__always_true`) is introduced to support correlation handling. However, when 
more than one correlated scalar subquery is present, each subquery introduces a 
helper expression with the same internal column name.
   
   Because these internal columns are not uniquely scoped or aliased, they end 
up conflicting in the outer query context. This results in an "ambiguous 
reference" error during column resolution instead of each subquery being 
properly isolated.
   
   ### Root Cause (as I understand it)
   
   - Multiple correlated scalar subqueries inject internal helper expressions.
   - These expressions reuse the same generated column name (e.g., 
`__always_true`).
   - When merged into the outer logical plan, the planner cannot disambiguate 
between them.
   - Column resolution fails due to duplicate internal names.
   
   ### Proposed Approach
   
   Here is how I plan to approach the fix:
   
   1. Reproduce the issue with a minimal SQL example and inspect the logical 
plan using `EXPLAIN` to pinpoint exactly where the duplicate internal column is 
introduced.
   2. Trace the correlated subquery rewrite/decorrelation logic in the logical 
planner to identify where `__always_true` (or similar helper columns) are 
generated.
   3. Modify the internal column generation so that:
      - Either each correlated subquery gets a uniquely generated internal 
alias, or  
      - The helper column remains fully scoped within its respective subquery 
plan and does not leak into the shared outer context.
   4. Add a regression test that includes multiple correlated scalar subqueries 
in the same query to prevent this issue from resurfacing.
   5. Validate that the change does not break existing correlated subquery 
optimization paths.
   
   I will start by isolating the exact stage in the logical planning process 
where the duplicate internal column name is created and propose a scoped or 
uniquely generated naming strategy.
   
   Please let me know if there are any prior design considerations around 
correlated subquery rewriting that I should be aware of before proceeding.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to