RafaelHerrero opened a new pull request, #21126:
URL: https://github.com/apache/datafusion/pull/21126
## Which issue does this PR close?
- Closes #6543.
## Rationale for this change
We're building a SQL engine on top of DataFusion and hit this while running
TPC-DS benchmarks — Q39 fails during planning with:
```
Projections require unique expression names but the expression
"CAST(inv1.cov AS Decimal128(30, 10))" at position 4 and "inv1.cov"
at position 10 have the same name.
```
The underlying issue is that `CAST` is transparent to `schema_name()`, so
both expressions resolve to `inv1.cov`. But this also affects simpler cases
like `SELECT 1, 1` or `SELECT x, x FROM t` — all of which PostgreSQL, Trino,
and SQLite handle without errors.
Looking at the issue discussion, @alamb suggested adding auto-aliases in the
SQL planner:
> "I think that is actually a pretty neat idea -- specifically add the
aliases in the SQL planner. I would be happy to review such a PR"
That's what this PR does.
### TPC-DS Q39 reproduction
The query joins two CTEs that both produce columns named `cov`, `mean`, etc.
When the planner applies implicit casts during type coercion, the cast-wrapped
and original expressions end up with the same schema name:
```sql
WITH inv1 AS (
SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS
cov
FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
stddev_samp(inv_quantity_on_hand) AS stdev,
avg(inv_quantity_on_hand) AS mean
FROM inventory, item, warehouse, date_dim
WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk
AND inv_date_sk = d_date_sk AND d_year = 2001
GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
),
inv2 AS (
SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS
cov
FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
stddev_samp(inv_quantity_on_hand) AS stdev,
avg(inv_quantity_on_hand) AS mean
FROM inventory, item, warehouse, date_dim
WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk
AND inv_date_sk = d_date_sk AND d_year = 2001 AND d_moy = 2
GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
)
SELECT inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov,
inv2.w_warehouse_sk, inv2.i_item_sk, inv2.d_moy, inv2.mean, inv2.cov
FROM inv1 JOIN inv2
ON inv1.i_item_sk = inv2.i_item_sk AND inv1.w_warehouse_sk =
inv2.w_warehouse_sk
ORDER BY inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean,
inv1.cov,
inv2.d_moy, inv2.mean, inv2.cov;
```
## What changes are included in this PR?
A dedup pass in `SqlToRel` that runs right after `prepare_select_exprs()`
and before `self.project()`. It detects duplicate `schema_name()` values and
wraps the second (and subsequent) occurrences in an `Alias` with a `:{N}`
suffix:
```sql
SELECT x AS c1, y AS c1 FROM t;
-- produces columns: c1, c1:1
```
The actual code is small (~45 lines of logic across 2 files):
- `datafusion/sql/src/utils.rs` — new `deduplicate_select_expr_names()`
function
- `datafusion/sql/src/select.rs` — one call site between
`prepare_select_exprs()` and `self.project()`
I intentionally kept this scoped to the SQL planner only:
- `validate_unique_names("Projections")` in `builder.rs` is untouched, so
the Rust API (`LogicalPlanBuilder::project`) still rejects duplicates
- No changes to the optimizer, physical planner, or DFSchema
- `validate_unique_names("Windows")` is unchanged
**Known limitation:** `SELECT *, x FROM t` still errors when `x` overlaps
with `*`, because wildcard expansion happens after this dedup pass (inside
`project_with_validation`). Happy to address that in a follow-up if desired.
## Are these changes tested?
New sqllogictest file (`duplicate_column_alias.slt`) with 13 test cases
covering:
- Basic duplicate aliases, literals, and same-column-twice
- Subquery with duplicate names
- ORDER BY resolving to first occurrence
- CTE join (TPC-DS Q39 pattern)
- Three-way duplicates
- CAST producing same schema_name as original column
- GROUP BY and aggregates with duplicates
- ORDER BY positional reference to the renamed column
- `iszero(0.0), iszero(-0.0)` (reported in the issue by @jatin510)
- UNION with duplicate column names
- Wildcard limitation documented as explicit `query error` test
Updated existing tests in `sql_integration.rs` (5 tests), `aggregate.slt`,
and `unnest.slt` that previously asserted the "Projections require unique"
error.
## Are there any user-facing changes?
Yes, this is a behavior change:
- SQL queries with duplicate expression names now succeed instead of erroring
- Duplicate columns get a `:{N}` suffix in the output (e.g., `cov`, `cov:1`)
- First occurrence keeps its original name, so ORDER BY / HAVING references
still work
- The programmatic Rust API is unchanged
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]