Anurag Mantripragada created SPARK-56599:
--------------------------------------------

             Summary: SPIP: Write schema narrowing for column-level UPDATE in 
DSv2
                 Key: SPARK-56599
                 URL: https://issues.apache.org/jira/browse/SPARK-56599
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.2.0
            Reporter: Anurag Mantripragada


Row-level operations like UPDATE are essential for modern data workflows, such 
as fixing ingestion errors or refreshing model scores in machine learning 
pipelines. While Spark supports UPDATE today, the current implementation is 
limited to full row rewrites. Even if a user updates only a small subset of 
columns, Spark must read every column  in the table and send the entire row 
back to the data source. 

For wide tables with hundreds of columns, this full-row approach is highly 
inefficient. It wastes significant disk and network I/O by reading data that 
isn't needed and forces connectors to rewrite columns that haven't changed. 
This is particularly problematic for AI/ML use cases where only a few features 
are updated at a time in wide tables. 

This proposal improves the DataSourceV2 API to support write schema narrowing. 
It allows connectors to declare exactly which columns they need to receive 
during an update. Spark will then optimize the operation to read only the 
required columns and send a partial row back to the source. This makes updates 
significantly faster and reduces write amplification for connectors like 
Iceberg. 

[SPIP 
Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to