kosiew opened a new pull request, #20745:
URL: https://github.com/apache/datafusion/pull/20745

   ## Which issue does this PR close?
   
   * Closes #19950.
   
   ## Rationale for this change
   
   `UPDATE ... FROM` was planned incorrectly and effectively unusable in 
DataFusion. The SQL layer rejected the syntax outright, and the underlying 
planning/evaluation path also stripped qualifiers from assignment expressions. 
That meant expressions such as `t2.b` could be rebound as target-table columns, 
so joined values were not applied correctly.
   
   This change enables the supported single-source `UPDATE ... FROM` flow and 
fixes the core binding issue by preserving source-qualified expressions for 
multi-table updates. It also gives table providers a dedicated execution path 
for updates that depend on joined input rows, instead of forcing all updates 
through the single-table assignment API.
   
   ## What changes are included in this PR?
   
   This PR adds end-to-end support for single-source `UPDATE ... FROM` and 
wires it through planning, provider APIs, MemTable execution, and tests.
   
   At a high level, the changes include:
   
   * removing the SQL-layer `not_impl` guard that previously rejected `UPDATE 
... FROM`;
   * extending `TableProvider` with a new `update_from(...)` hook for 
multi-table updates driven by a physical input plan;
   * updating the physical planner to distinguish between:
   
     * single-table `UPDATE`, which still uses extracted assignment 
expressions; and
     * `UPDATE ... FROM`, which now passes an optimized physical input plan 
plus target-only filters to the provider;
   * preserving qualified source references in assignment extraction for 
multi-table updates, while keeping the existing qualifier-stripping behavior 
for single-table updates;
   * improving identity-assignment detection so aliased target references are 
treated correctly;
   * adding helper logic to detect joins and collect target-table aliases 
during planning;
   * implementing `MemTable::update_from(...)`, including:
   
     * collecting replacement rows from the physical input,
     * validating schema equivalence,
     * counting matched target rows,
     * rejecting plans where replacement row counts do not match the number of 
target rows to update,
     * merging replacement values back into target batches using the update 
mask;
   * clearing MemTable sort-order metadata after mutation, consistent with 
update behavior;
   * updating custom provider DML tests to exercise the new provider path and 
verify that only target-table predicates are forwarded as provider filters;
   * adding planner/unit tests for alias handling and assignment extraction; and
   * adding sqllogictest coverage for explain plans, alias variants, successful 
execution, and mismatch/error behavior.
   
   This PR still keeps the existing limitation that `UPDATE ... FROM` supports 
only a single source table. Queries with multiple tables in the `FROM` clause 
continue to return a `not implemented` error.
   
   ## Are these changes tested?
   
   Yes.
   
   The patch adds and updates tests across several layers:
   
   * physical planner unit tests for assignment extraction in both single-table 
and `UPDATE ... FROM` cases;
   * custom source DML planning tests to verify provider behavior, alias 
handling, and target-filter forwarding;
   * sqllogictests covering:
   
     * logical and physical plans for `UPDATE ... FROM`,
     * successful updates against actual data,
     * target/source alias permutations, and
     * row-count mismatch error handling for invalid joined replacement results.
   
   These tests cover both the original reported bug and the new execution path 
introduced for table providers and MemTable.
   
   ## Are there any user-facing changes?
   
   Yes.
   
   DataFusion now supports single-source `UPDATE ... FROM` statements, 
including target/source aliases and source-qualified assignment expressions 
such as:
   
   ```sql
   UPDATE t1 AS dst
   SET b = src.b, d = src.d
   FROM t2 AS src
   WHERE dst.a = src.a;
   ```
   
   Previously, this syntax was rejected or failed to apply joined source values 
correctly. After this change, supported `UPDATE ... FROM` statements plan and 
execute correctly for MemTable and for providers that implement 
`update_from(...)`.
   
   There is also a small public API change for table providers: `TableProvider` 
now includes a new async `update_from(...)` method for multi-table update 
execution. Providers that do not implement it will continue to return a `not 
implemented` error for this operation.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to