Anurag Mantripragada created SPARK-56599:
--------------------------------------------
Summary: SPIP: Write schema narrowing for column-level UPDATE in
DSv2
Key: SPARK-56599
URL: https://issues.apache.org/jira/browse/SPARK-56599
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Anurag Mantripragada
Row-level operations like UPDATE are essential for modern data workflows, such
as fixing ingestion errors or refreshing model scores in machine learning
pipelines. While Spark supports UPDATE today, the current implementation is
limited to full row rewrites. Even if a user updates only a small subset of
columns, Spark must read every column in the table and send the entire row
back to the data source.
For wide tables with hundreds of columns, this full-row approach is highly
inefficient. It wastes significant disk and network I/O by reading data that
isn't needed and forces connectors to rewrite columns that haven't changed.
This is particularly problematic for AI/ML use cases where only a few features
are updated at a time in wide tables.
This proposal improves the DataSourceV2 API to support write schema narrowing.
It allows connectors to declare exactly which columns they need to receive
during an update. Spark will then optimize the operation to read only the
required columns and send a partial row back to the source. This makes updates
significantly faster and reduces write amplification for connectors like
Iceberg.
[SPIP
Document|https://docs.google.com/document/d/1-Wiw9U54ESpbLakb9Cn_mO4AviM4nrk4TF7rNhI3JZg/edit?usp=sharing]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]