sergiomartinswhg opened a new pull request, #15152:
URL: https://github.com/apache/iceberg/pull/15152
### Context
This PR addresses a long-standing feature request for handling OVERWRITE
snapshots in Spark Structured Streaming.
**Related issues and PRs:**
[Issue #2788](https://github.com/apache/iceberg/issues/2788) - Original
feature request by @SreeramGarlapati
[PR #2944](https://github.com/apache/iceberg/pull/2944) - Format
version-aware approach by @tprelle
[PR #7295](https://github.com/apache/iceberg/pull/7295) - Enum-based
approach by @karim-ramadan
This implementation builds on the ideas from both previous PRs, adopting
the enum-based design from #7295 while maintaining backward compatibility with
the existing `streaming-skip-overwrite-snapshots` option.
### Summary
This PR adds a new `streaming-overwrite-mode` option that provides more
flexibility for handling OVERWRITE snapshots during Spark Structured Streaming
reads. While users today typically use
`streaming-skip-overwrite-snapshots=true` to skip these snapshots entirely,
this PR introduces an `added-files-only` mode that allows processing the added
files from OVERWRITE snapshots instead of skipping them.
### Motivation
Tables frequently undergo operations that produce OVERWRITE snapshots:
- `INSERT OVERWRITE` to specific partitions
- `MERGE INTO` / `UPDATE` / `DELETE` operations
Today, users handle this by setting
`streaming-skip-overwrite-snapshots=true`, which skips these snapshots
entirely. However, this means any new data added during these operations is
missed by the stream.
This PR gives users a third option: process only the added files from
OVERWRITE snapshots, allowing streams to capture new data from these operations.
### Changes
**New option:** `streaming-overwrite-mode` with three modes:
| Mode | Behavior |
|------|----------|
| `fail` | Throws exception on OVERWRITE snapshots (default) |
| `skip` | Ignores OVERWRITE snapshots entirely |
| `added-files-only` | Processes only added files from OVERWRITE snapshots
|
**Backward compatibility:**
- `streaming-skip-overwrite-snapshots=true` maps to
`streaming-overwrite-mode=skip`
- New option takes precedence when both are specified
- Deprecation warning logged when legacy option is used
### Usage
```java
spark.readStream()
.format("iceberg")
.option("streaming-overwrite-mode", "added-files-only")
.load("catalog.db.table")
```
**Warning for added-files-only mode**
This mode may produce duplicate records when overwrites rewrite existing
data (e.g., MERGE, UPDATE, DELETE). Downstream processing must handle
duplicates (e.g., idempotent writes, deduplication).
### Testing
- Unit tests for StreamingOverwriteMode enum parsing
- Integration tests for all three modes across Spark 3.4, 3.5, 4.0, and 4.1
- Tests verify backward compatibility with legacy option
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]