SreeramGarlapati opened a new issue #2788:
URL: https://github.com/apache/iceberg/issues/2788


   ## Background / context
   This is a continuation of work done in PR #2660: reading from iceberg table 
as an incremental streaming source in spark.
   
   To read a stream out of Iceberg table - the above PR iterates thru all the 
Iceberg Snapshots (the table history) starting from the very FIRST Snapshot and 
grabs the `list of added files` in all these snapshot one-by-one and hands off 
the rows in them to Spark3 DSv2 reader. However, this along with #2752 - 
doesn't handle when the streaming reader encounters a `SNAPSHOT` of type 
`OVERWRITE`. 
   
   The current issue is a request to extend the implementation to be able to 
support that.
   
   ## Implementation/Idea Proposal
   ### What does an Iceberg `SNAPSHOT` of type `OVERWRITE` do / have ?
   To simplify, Imagine an Iceberg table comprises of just 1 `dataFile` with 
100 rows - row `[a1, b1].....[a100, b100]`. If a Writer performs a `Merge / 
upsert` operation on `[a1, b1]` to update it to `[a1, b1_upserted]`, then, a 
Snapshot of type `OVERWRITE` is written.
   
   ### Iceberg Table - with spec version 1
   In Iceberg `version 1` - this `Merge/Upsert` operation results in a net new 
Iceberg table `SNAPSHOT` with this **One Single ROW** in this `dataFile` - 
changed to the new value & the file fully rewritten. 
   Now, the table snapshots look like
   `S1` --> `[a1, b1], [a2, b2].....[a100, b100]`
   `S2` --> `[a1, b1_upserted], [a2, b2].....[a100, b100]`
   
   So, when the streaming reader encounters a `SNAPSHOT` of type `OVERWRITE` 
(i.e., `S2`) - if we stream all the added files - then, this will result 
restreaming `[a2, b2].....[a100, b100]` - **DUPLICATE** data!
   
   This is a limitation in Iceberg table Spec Version 1 & can be solved with 
version 2 (where the `data` and `delete` files are written as the Snapshot!).
   
   So, the proposal here is to implement a `Spark` option - with a known 
limitation to replay a lot of duplicate data.
   
   ### Iceberg Table - with spec version 2
   In Iceberg `version 2` - this `Merge/Upsert` operation results in a net new 
Iceberg table `SNAPSHOT` with this additional **One Single ROW** in a **NET 
NEW** `dataFile` and 1 single record in a **new** `DeleteFile`. 
   `S1` --> DataFiles { `[a1, b1], [a2, b2].....[a100, b100]` }
   `S2` --> DataFiles { `[a1, b1], [a2, b2].....[a100, b100]` , `[a1, 
b1_upserted]`} + DeleteFiles { `a=a1` }
   
   In this case - when the Streaming reader encounters a `SNAPSHOT` of type 
`OVERWRITE` (i.e., `S2`) - the addedFiles perfectly corresponds to the newly 
added rows - `[a1, b1_upserted]`!
   
   This with the combination of Skipping `SNAPSHOT`s of type `REPLACE` - should 
avoid the possibility of duplicate data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to