[ 
https://issues.apache.org/jira/browse/SPARK-56599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anurag Mantripragada updated SPARK-56599:
-----------------------------------------
    Description: 
Row-level operations like UPDATE are essential for modern data workflows, such 
as fixing ingestion errors or refreshing model scores in machine learning 
pipelines. While Spark supports UPDATE today, the current implementation is 
limited to full row rewrites. Even if a user updates only a small subset of 
columns, Spark must read every column  in the table and send the entire row 
back to the data source. 

For wide tables with hundreds of columns, this full-row approach is highly 
inefficient. It wastes significant disk and network I/O by reading data that 
isn't needed and forces connectors to rewrite columns that haven't changed. 
This is particularly problematic for AI/ML use cases where only a few features 
are updated at a time in wide tables. 

This proposal improves the DataSourceV2 API to support write schema narrowing. 
It allows connectors to declare exactly which columns they need to receive 
during an update. Spark will then optimize the operation to read only the 
required columns and send a partial row back to the source. This makes updates 
significantly faster and reduces write amplification for connectors like 
Iceberg. 

[SPIP 
Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]

PR: https://github.com/apache/spark/pull/55518

  was:
Row-level operations like UPDATE are essential for modern data workflows, such 
as fixing ingestion errors or refreshing model scores in machine learning 
pipelines. While Spark supports UPDATE today, the current implementation is 
limited to full row rewrites. Even if a user updates only a small subset of 
columns, Spark must read every column  in the table and send the entire row 
back to the data source. 

For wide tables with hundreds of columns, this full-row approach is highly 
inefficient. It wastes significant disk and network I/O by reading data that 
isn't needed and forces connectors to rewrite columns that haven't changed. 
This is particularly problematic for AI/ML use cases where only a few features 
are updated at a time in wide tables. 

This proposal improves the DataSourceV2 API to support write schema narrowing. 
It allows connectors to declare exactly which columns they need to receive 
during an update. Spark will then optimize the operation to read only the 
required columns and send a partial row back to the source. This makes updates 
significantly faster and reduces write amplification for connectors like 
Iceberg. 

[SPIP 
Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]


> SPIP: Write schema narrowing for column-level UPDATE in DSv2
> ------------------------------------------------------------
>
>                 Key: SPARK-56599
>                 URL: https://issues.apache.org/jira/browse/SPARK-56599
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Anurag Mantripragada
>            Priority: Major
>              Labels: SPIP, pull-request-available
>
> Row-level operations like UPDATE are essential for modern data workflows, 
> such as fixing ingestion errors or refreshing model scores in machine 
> learning pipelines. While Spark supports UPDATE today, the current 
> implementation is limited to full row rewrites. Even if a user updates only a 
> small subset of columns, Spark must read every column  in the table and send 
> the entire row back to the data source. 
> For wide tables with hundreds of columns, this full-row approach is highly 
> inefficient. It wastes significant disk and network I/O by reading data that 
> isn't needed and forces connectors to rewrite columns that haven't changed. 
> This is particularly problematic for AI/ML use cases where only a few 
> features are updated at a time in wide tables. 
> This proposal improves the DataSourceV2 API to support write schema 
> narrowing. It allows connectors to declare exactly which columns they need to 
> receive during an update. Spark will then optimize the operation to read only 
> the required columns and send a partial row back to the source. This makes 
> updates significantly faster and reduces write amplification for connectors 
> like Iceberg. 
> [SPIP 
> Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]
> PR: https://github.com/apache/spark/pull/55518



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to