sergiomartinswhg opened a new pull request, #15152:
URL: https://github.com/apache/iceberg/pull/15152

    ### Context
   
     This PR addresses a long-standing feature request for handling OVERWRITE 
snapshots in Spark Structured Streaming.
   
     **Related issues and PRs:**
   
   [Issue #2788](https://github.com/apache/iceberg/issues/2788) - Original 
feature request by @SreeramGarlapati 
   [PR #2944](https://github.com/apache/iceberg/pull/2944) - Format 
version-aware approach by @tprelle 
   [PR #7295](https://github.com/apache/iceberg/pull/7295) - Enum-based 
approach by @karim-ramadan
   
     This implementation builds on the ideas from both previous PRs, adopting 
the enum-based design from #7295 while maintaining backward compatibility with 
the existing `streaming-skip-overwrite-snapshots` option.
   
     ### Summary
   
     This PR adds a new `streaming-overwrite-mode` option that provides more 
flexibility for handling OVERWRITE snapshots during Spark Structured Streaming 
reads. While users today typically use 
`streaming-skip-overwrite-snapshots=true` to skip these snapshots entirely, 
this PR introduces an `added-files-only` mode that allows processing the added 
files from OVERWRITE snapshots instead of skipping them.
   
     ### Motivation
   
     Tables frequently undergo operations that produce OVERWRITE snapshots:
     - `INSERT OVERWRITE` to specific partitions
     - `MERGE INTO` / `UPDATE` / `DELETE` operations
   
     Today, users handle this by setting 
`streaming-skip-overwrite-snapshots=true`, which skips these snapshots 
entirely. However, this means any new data added during these operations is 
missed by the stream.
   
     This PR gives users a third option: process only the added files from 
OVERWRITE snapshots, allowing streams to capture new data from these operations.
   
     ### Changes
   
     **New option:** `streaming-overwrite-mode` with three modes:
     | Mode | Behavior |
     |------|----------|
     | `fail` | Throws exception on OVERWRITE snapshots (default) |
     | `skip` | Ignores OVERWRITE snapshots entirely |
     | `added-files-only` | Processes only added files from OVERWRITE snapshots 
|
   
     **Backward compatibility:**
     - `streaming-skip-overwrite-snapshots=true` maps to 
`streaming-overwrite-mode=skip`
     - New option takes precedence when both are specified
     - Deprecation warning logged when legacy option is used
   
     ### Usage
   
     ```java
     spark.readStream()
         .format("iceberg")
         .option("streaming-overwrite-mode", "added-files-only")
         .load("catalog.db.table")
   ```
   
   **Warning for added-files-only mode**
   
     This mode may produce duplicate records when overwrites rewrite existing 
data (e.g., MERGE, UPDATE, DELETE). Downstream processing must handle 
duplicates (e.g., idempotent writes, deduplication).
   
   ###  Testing
   
     - Unit tests for StreamingOverwriteMode enum parsing
     - Integration tests for all three modes across Spark 3.4, 3.5, 4.0, and 4.1
     - Tests verify backward compatibility with legacy option
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to