adriangb opened a new pull request, #21770:
URL: https://github.com/apache/datafusion/pull/21770

   ## Which issue does this PR close?
   
   - Closes #.
   
   <!-- Happy to file a tracking issue if the project prefers. -->
   
   ## Rationale for this change
   
   A recursive CTE whose anchor aliases a computed column (e.g. `upper(val) AS 
val`) and whose recursive term leaves the same expression un-aliased 
(`upper(r.val)`) currently returns the wrong column name — but only when the 
outer query has both `ORDER BY` and `LIMIT`. The plan-level schema is correct 
(taken from the anchor), but `RecursiveQueryExec` forwards recursive-term 
`RecordBatch`es with their native schemas intact. Downstream consumers that key 
on `batch.schema().field(i).name()` — `SortExec`'s TopK path, CSV/JSON writers, 
user-code `collect`ors — then observe the leaked recursive-branch name instead 
of the anchor's.
   
   MRE (fails on `datafusion-cli` pre-fix):
   
   ```sql
   CREATE TABLE records (id VARCHAR NOT NULL, parent_id VARCHAR,
                        ts TIMESTAMP NOT NULL, val VARCHAR);
   INSERT INTO records VALUES
     ('a00', NULL,  TIMESTAMP '2025-01-01 00:00:00', 'v_span'),
     ('a01', 'a00', TIMESTAMP '2025-01-01 00:00:01', 'v_log_1'),
     ('a02', 'a01', TIMESTAMP '2025-01-01 00:00:02', 'v_log_2'),
     ('a03', 'a02', TIMESTAMP '2025-01-01 00:00:03', 'v_log_3');
   
   WITH RECURSIVE descendants AS (
     SELECT id, parent_id, ts, upper(val) AS val
       FROM records WHERE id = 'a00'
     UNION ALL
     SELECT r.id, r.parent_id, r.ts, upper(r.val)
       FROM records r INNER JOIN descendants d ON r.parent_id = d.id
   )
   SELECT id, parent_id, ts, val
     FROM descendants ORDER BY ts ASC LIMIT 10;
   ```
   
   Pre-fix header column reads `upper(r.val)`; expected `val`.
   
   Only `ORDER BY + LIMIT` triggers it because:
   - `SortExec` without fetch re-materialises batches via `ExternalSorter` 
(stable schema).
   - `LimitExec` without sort sits above `RecursiveQueryExec`, never mixing 
branches.
   - `SortExec` with fetch uses the TopK path, which emits 
`interleave_record_batch` output that carries whichever input batch's schema 
was used last.
   
   ## What changes are included in this PR?
   
   In `RecursiveQueryStream::push_batch`, rebind each emitted batch to the 
declared output schema (taken from the anchor term). Logical-plan coercion in 
`LogicalPlanBuilder::to_recursive_query` already guarantees matching column 
types, so this is a zero-copy field rebind. 14 lines of production code + 
comment.
   
   ## Are these changes tested?
   
   Yes.
   
   - 
`datafusion/core/tests/sql/select.rs::test_recursive_cte_batch_schema_stable_with_order_by_limit`
 — runs the MRE and asserts every collected `RecordBatch`'s schema field names 
equal `["id", "parent_id", "ts", "val"]`. Fails pre-fix with `left: ["id", 
"parent_id", "ts", "upper(r.val)"]`.
   - `datafusion/sqllogictest/test_files/cte.slt` — round-trips the buggy query 
through a headered CSV file (whose header is written from each batch's schema) 
and re-reads it as headerless CSV so the header row is compared as a data row. 
SLT otherwise cannot assert column names directly, so this is the only way to 
surface the leak inside SLT.
   
   Both regression tests were verified to fail on the base branch before the 
fix was applied and pass after.
   
   ## Are there any user-facing changes?
   
   Recursive CTEs with mismatched anchor/recursive column names will now emit 
batches with the anchor-declared names consistently, regardless of downstream 
operators. No API changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to