adriangb opened a new pull request, #21770:
URL: https://github.com/apache/datafusion/pull/21770
## Which issue does this PR close?
- Closes #.
<!-- Happy to file a tracking issue if the project prefers. -->
## Rationale for this change
A recursive CTE whose anchor aliases a computed column (e.g. `upper(val) AS
val`) and whose recursive term leaves the same expression un-aliased
(`upper(r.val)`) currently returns the wrong column name — but only when the
outer query has both `ORDER BY` and `LIMIT`. The plan-level schema is correct
(taken from the anchor), but `RecursiveQueryExec` forwards recursive-term
`RecordBatch`es with their native schemas intact. Downstream consumers that key
on `batch.schema().field(i).name()` — `SortExec`'s TopK path, CSV/JSON writers,
user-code `collect`ors — then observe the leaked recursive-branch name instead
of the anchor's.
MRE (fails on `datafusion-cli` pre-fix):
```sql
CREATE TABLE records (id VARCHAR NOT NULL, parent_id VARCHAR,
ts TIMESTAMP NOT NULL, val VARCHAR);
INSERT INTO records VALUES
('a00', NULL, TIMESTAMP '2025-01-01 00:00:00', 'v_span'),
('a01', 'a00', TIMESTAMP '2025-01-01 00:00:01', 'v_log_1'),
('a02', 'a01', TIMESTAMP '2025-01-01 00:00:02', 'v_log_2'),
('a03', 'a02', TIMESTAMP '2025-01-01 00:00:03', 'v_log_3');
WITH RECURSIVE descendants AS (
SELECT id, parent_id, ts, upper(val) AS val
FROM records WHERE id = 'a00'
UNION ALL
SELECT r.id, r.parent_id, r.ts, upper(r.val)
FROM records r INNER JOIN descendants d ON r.parent_id = d.id
)
SELECT id, parent_id, ts, val
FROM descendants ORDER BY ts ASC LIMIT 10;
```
Pre-fix header column reads `upper(r.val)`; expected `val`.
Only `ORDER BY + LIMIT` triggers it because:
- `SortExec` without fetch re-materialises batches via `ExternalSorter`
(stable schema).
- `LimitExec` without sort sits above `RecursiveQueryExec`, never mixing
branches.
- `SortExec` with fetch uses the TopK path, which emits
`interleave_record_batch` output that carries whichever input batch's schema
was used last.
## What changes are included in this PR?
In `RecursiveQueryStream::push_batch`, rebind each emitted batch to the
declared output schema (taken from the anchor term). Logical-plan coercion in
`LogicalPlanBuilder::to_recursive_query` already guarantees matching column
types, so this is a zero-copy field rebind. 14 lines of production code +
comment.
## Are these changes tested?
Yes.
-
`datafusion/core/tests/sql/select.rs::test_recursive_cte_batch_schema_stable_with_order_by_limit`
— runs the MRE and asserts every collected `RecordBatch`'s schema field names
equal `["id", "parent_id", "ts", "val"]`. Fails pre-fix with `left: ["id",
"parent_id", "ts", "upper(r.val)"]`.
- `datafusion/sqllogictest/test_files/cte.slt` — round-trips the buggy query
through a headered CSV file (whose header is written from each batch's schema)
and re-reads it as headerless CSV so the header row is compared as a data row.
SLT otherwise cannot assert column names directly, so this is the only way to
surface the leak inside SLT.
Both regression tests were verified to fail on the base branch before the
fix was applied and pass after.
## Are there any user-facing changes?
Recursive CTEs with mismatched anchor/recursive column names will now emit
batches with the anchor-declared names consistently, regardless of downstream
operators. No API changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]