[
https://issues.apache.org/jira/browse/SPARK-56599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anurag Mantripragada updated SPARK-56599:
-----------------------------------------
Description:
Row-level operations like UPDATE are essential for modern data workflows, such
as fixing ingestion errors or refreshing model scores in machine learning
pipelines. While Spark supports UPDATE today, the current implementation is
limited to full row rewrites. Even if a user updates only a small subset of
columns, Spark must read every column in the table and send the entire row
back to the data source.
For wide tables with hundreds of columns, this full-row approach is highly
inefficient. It wastes significant disk and network I/O by reading data that
isn't needed and forces connectors to rewrite columns that haven't changed.
This is particularly problematic for AI/ML use cases where only a few features
are updated at a time in wide tables.
This proposal improves the DataSourceV2 API to support write schema narrowing.
It allows connectors to declare exactly which columns they need to receive
during an update. Spark will then optimize the operation to read only the
required columns and send a partial row back to the source. This makes updates
significantly faster and reduces write amplification for connectors like
Iceberg.
[SPIP
Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]
PR: https://github.com/apache/spark/pull/55518
was:
Row-level operations like UPDATE are essential for modern data workflows, such
as fixing ingestion errors or refreshing model scores in machine learning
pipelines. While Spark supports UPDATE today, the current implementation is
limited to full row rewrites. Even if a user updates only a small subset of
columns, Spark must read every column in the table and send the entire row
back to the data source.
For wide tables with hundreds of columns, this full-row approach is highly
inefficient. It wastes significant disk and network I/O by reading data that
isn't needed and forces connectors to rewrite columns that haven't changed.
This is particularly problematic for AI/ML use cases where only a few features
are updated at a time in wide tables.
This proposal improves the DataSourceV2 API to support write schema narrowing.
It allows connectors to declare exactly which columns they need to receive
during an update. Spark will then optimize the operation to read only the
required columns and send a partial row back to the source. This makes updates
significantly faster and reduces write amplification for connectors like
Iceberg.
[SPIP
Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]
> SPIP: Write schema narrowing for column-level UPDATE in DSv2
> ------------------------------------------------------------
>
> Key: SPARK-56599
> URL: https://issues.apache.org/jira/browse/SPARK-56599
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Anurag Mantripragada
> Priority: Major
> Labels: SPIP, pull-request-available
>
> Row-level operations like UPDATE are essential for modern data workflows,
> such as fixing ingestion errors or refreshing model scores in machine
> learning pipelines. While Spark supports UPDATE today, the current
> implementation is limited to full row rewrites. Even if a user updates only a
> small subset of columns, Spark must read every column in the table and send
> the entire row back to the data source.
> For wide tables with hundreds of columns, this full-row approach is highly
> inefficient. It wastes significant disk and network I/O by reading data that
> isn't needed and forces connectors to rewrite columns that haven't changed.
> This is particularly problematic for AI/ML use cases where only a few
> features are updated at a time in wide tables.
> This proposal improves the DataSourceV2 API to support write schema
> narrowing. It allows connectors to declare exactly which columns they need to
> receive during an update. Spark will then optimize the operation to read only
> the required columns and send a partial row back to the source. This makes
> updates significantly faster and reduces write amplification for connectors
> like Iceberg.
> [SPIP
> Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]
> PR: https://github.com/apache/spark/pull/55518
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]