rdblue commented on issue #2482:
URL: https://github.com/apache/iceberg/issues/2482#issuecomment-897017924


   @ayush-san, let me make sure I understand. Say a table has snapshots 1, 2, 
3, 4, and 5, and that there is a Flink streaming process that has consumed 
through snapshot 2. Then, someone expires snapshots and removes 1, 2, and 3 so 
the table only has snapshots 4 and 5. The validation that is failing looks at 
5, then 4, then sees that there is no snapshot 3. Because it can't provide the 
rows appended in 3, it fails.
   
   It sounds like you're suggesting that we could keep track of the oldest 
snapshot in the table and use that for startingSnapshotId. So instead of the 
Flink job starting from snapshot 2 and reading snapshots 3, 4, and 5, it would 
start from snapshot 3 and read only 4 and 5. The problem is that this 
automatically skips rows from snapshot 3, which isn't correct.
   
   It also sounds like you're suggesting that we annotate a snapshot, which I 
don't think would be necessary because we could always traverse parents from 
the current table state and only process the snapshots that are available.
   
   What the committer is doing is different. The committer keeps the checkpoint 
ID in the snapshot summary so that it can check whether a given checkpoint has 
been committed to the table already. That makes commits exactly-once when the 
snapshot history is longer than the time the application was unable to commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to