RafaelHerrero opened a new pull request, #21126:
URL: https://github.com/apache/datafusion/pull/21126

   ## Which issue does this PR close?
   
   - Closes #6543.
   
   ## Rationale for this change
   
   We're building a SQL engine on top of DataFusion and hit this while running 
TPC-DS benchmarks — Q39 fails during planning with:
   
   ```
   Projections require unique expression names but the expression
   "CAST(inv1.cov AS Decimal128(30, 10))" at position 4 and "inv1.cov"
   at position 10 have the same name.
   ```
   
   The underlying issue is that `CAST` is transparent to `schema_name()`, so 
both expressions resolve to `inv1.cov`. But this also affects simpler cases 
like `SELECT 1, 1` or `SELECT x, x FROM t` — all of which PostgreSQL, Trino, 
and SQLite handle without errors.
   
   Looking at the issue discussion, @alamb suggested adding auto-aliases in the 
SQL planner:
   
   > "I think that is actually a pretty neat idea -- specifically add the 
aliases in the SQL planner. I would be happy to review such a PR"
   
   That's what this PR does.
   
   ### TPC-DS Q39 reproduction
   
   The query joins two CTEs that both produce columns named `cov`, `mean`, etc. 
When the planner applies implicit casts during type coercion, the cast-wrapped 
and original expressions end up with the same schema name:
   
   ```sql
   WITH inv1 AS (
       SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
              stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS 
cov
       FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
                    stddev_samp(inv_quantity_on_hand) AS stdev,
                    avg(inv_quantity_on_hand) AS mean
             FROM inventory, item, warehouse, date_dim
             WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk
               AND inv_date_sk = d_date_sk AND d_year = 2001
             GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
       WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
   ),
   inv2 AS (
       SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
              stdev, mean, (CASE mean WHEN 0 THEN NULL ELSE stdev/mean END) AS 
cov
       FROM (SELECT w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy,
                    stddev_samp(inv_quantity_on_hand) AS stdev,
                    avg(inv_quantity_on_hand) AS mean
             FROM inventory, item, warehouse, date_dim
             WHERE inv_item_sk = i_item_sk AND inv_warehouse_sk = w_warehouse_sk
               AND inv_date_sk = d_date_sk AND d_year = 2001 AND d_moy = 2
             GROUP BY w_warehouse_name, w_warehouse_sk, i_item_sk, d_moy) foo
       WHERE CASE mean WHEN 0 THEN 0 ELSE stdev/mean END > 1
   )
   SELECT inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, inv1.cov,
          inv2.w_warehouse_sk, inv2.i_item_sk, inv2.d_moy, inv2.mean, inv2.cov
   FROM inv1 JOIN inv2
     ON inv1.i_item_sk = inv2.i_item_sk AND inv1.w_warehouse_sk = 
inv2.w_warehouse_sk
   ORDER BY inv1.w_warehouse_sk, inv1.i_item_sk, inv1.d_moy, inv1.mean, 
inv1.cov,
            inv2.d_moy, inv2.mean, inv2.cov;
   ```
   
   ## What changes are included in this PR?
   
   A dedup pass in `SqlToRel` that runs right after `prepare_select_exprs()` 
and before `self.project()`. It detects duplicate `schema_name()` values and 
wraps the second (and subsequent) occurrences in an `Alias` with a `:{N}` 
suffix:
   
   ```sql
   SELECT x AS c1, y AS c1 FROM t;
   -- produces columns: c1, c1:1
   ```
   
   The actual code is small (~45 lines of logic across 2 files):
   
   - `datafusion/sql/src/utils.rs` — new `deduplicate_select_expr_names()` 
function
   - `datafusion/sql/src/select.rs` — one call site between 
`prepare_select_exprs()` and `self.project()`
   
   I intentionally kept this scoped to the SQL planner only:
   
   - `validate_unique_names("Projections")` in `builder.rs` is untouched, so 
the Rust API (`LogicalPlanBuilder::project`) still rejects duplicates
   - No changes to the optimizer, physical planner, or DFSchema
   - `validate_unique_names("Windows")` is unchanged
   
   **Known limitation:** `SELECT *, x FROM t` still errors when `x` overlaps 
with `*`, because wildcard expansion happens after this dedup pass (inside 
`project_with_validation`). Happy to address that in a follow-up if desired.
   
   ## Are these changes tested?
   
   New sqllogictest file (`duplicate_column_alias.slt`) with 13 test cases 
covering:
   
   - Basic duplicate aliases, literals, and same-column-twice
   - Subquery with duplicate names
   - ORDER BY resolving to first occurrence
   - CTE join (TPC-DS Q39 pattern)
   - Three-way duplicates
   - CAST producing same schema_name as original column
   - GROUP BY and aggregates with duplicates
   - ORDER BY positional reference to the renamed column
   - `iszero(0.0), iszero(-0.0)` (reported in the issue by @jatin510)
   - UNION with duplicate column names
   - Wildcard limitation documented as explicit `query error` test
   
   Updated existing tests in `sql_integration.rs` (5 tests), `aggregate.slt`, 
and `unnest.slt` that previously asserted the "Projections require unique" 
error.
   
   ## Are there any user-facing changes?
   
   Yes, this is a behavior change:
   
   - SQL queries with duplicate expression names now succeed instead of erroring
   - Duplicate columns get a `:{N}` suffix in the output (e.g., `cov`, `cov:1`)
   - First occurrence keeps its original name, so ORDER BY / HAVING references 
still work
   - The programmatic Rust API is unchanged
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to