[PR] [SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2 [spark]

via GitHub Thu, 23 Apr 2026 11:30:15 -0700


anuragmantri opened a new pull request, #55518:
URL: https://github.com/apache/spark/pull/55518


    **What changes were proposed in this pull request?**      
   
   For SPIP: [SPARK-56599](https://issues.apache.org/jira/browse/SPARK-56599)   
                                                                                
               
                                                                                
                                                                             
   This PR adds three new default methods to the DSv2 connector API to enable 
scan and write-schema narrowing for column-level UPDATEs:                    
                                                                                
                                                                             
     - `updatedColumns()` on RowLevelOperationInfo — Spark informs the 
connector which columns are being assigned (non-identity only) before the 
operation is  
     built.                                                                     
                                                                             
     - `requiredDataAttributes()` on RowLevelOperation — the connector declares 
the exact set of data columns it needs in the write schema, symmetric with     
     `requiredMetadataAttributes()`.                                            
                                                                               
     - `supportsColumnUpdates()` on RowLevelOperation — explicit opt-in for 
receiving a partial row instead of the full table row.
                                                                                
                                                                             
   When a connector opts in, Spark removes identity assignments from the write 
plan's Project node, unblocking ColumnPruning to narrow the physical scan  
automatically (MOR path). For CoW, scan narrowing is done at analysis time via 
`buildRelationWithAttrs()` since GroupBasedRowLevelOperationScanPlanning reads 
DataSourceV2Relation.output before ColumnPruning fires.                         
                                                                  
                                             
   All three methods have default implementations that preserve today's 
full-row behavior. No existing connector is affected.                           
   
      
   **Why are the changes needed?**                                              
                                                                               
   
   Today, Spark's analyzer generates identity assignments for every column 
during [UPDATE 
alignment](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala#L62).
 These are used to build a Project that references [all columns 
](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteUpdateTable.scala#L179),
 preventing 
[Optimizer](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1050)
 from narrowing the scan. The cost scales as O(table width) regardless of how 
many columns are being updated.
   
   This is especially wasteful for columnar formats like Parquet/Iceberg and is 
a blocker for efficient column-level update implementations in connectors  (see 
the [Efficient Column Updates 
Proposal](https://docs.google.com/document/d/1Bd7JVzgajA8-DozzeEE24mID_GLuz6iwj0g4TlcVJcs/edit?pli=1&tab=t.0)
 in Iceberg).
                                                                                
                                                                             
    **Does this PR introduce any user-facing change?**                          
                                                                                
                                            
     Yes. Three new default methods are added to the public DSv2 connector API: 
                                                                             
     - `RowLevelOperation.supportsColumnUpdates()`
     - `RowLevelOperation.requiredDataAttributes()`                             
                                                                
     - `RowLevelOperationInfo.updatedColumns()`   
                                                                                
                                                                             
   All are opt-in with backward-compatible defaults. Existing connectors see no 
change.                                                                    
                                                                                
                                                                             
   **How was this patch tested?**                                               
                                                                        
                                                                                
                                                                             
     - 31 new tests in DeltaBasedColumnUpdateTableSuite covering scan 
narrowing, write-schema narrowing, data correctness, identity assignment 
filtering, updatedColumns behavior, and requiredDataAttributes across MOR 
(delta), CoW (ReplaceData), and delete-then-reinsert paths.
     - 6 new updatedColumns tests in DeltaBasedUpdateTableSuiteBase.            
                                                                                
                                                                                
                
                                                                                
                                                                             
   **Was this patch authored or co-authored using generative AI tooling?** 
   
   Generated-by: Claude Sonnet 4.6 
                                                                                
     
   I used Claude Code to generate code and tests and manually reviewed the 
generated code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2 [spark]

Reply via email to