ayush-san commented on issue #2482:
URL: https://github.com/apache/iceberg/issues/2482#issuecomment-897338009


   > It sounds like you're suggesting that we could keep track of the oldest 
snapshot in the table and use that for startingSnapshotId. So instead of the 
Flink job starting from snapshot 2 and reading snapshots 3, 4, and 5, it would 
start from snapshot 3 and read only 4 and 5. The problem is that this 
automatically skips rows from snapshot 3, which isn't correct.
   
   @rdblue So what should be the ideal way around this? Since running regular 
maintenance procedures(Compaction, expire-snapshots, deleting orphans files, 
etc) are really important for streaming pipelines as they will keep metadata 
size in check. But currently, we cannot run expire snapshots as the flink 
validation will fail. 
   
   My logic for annotating a snapshot was that any reader/writer will know that 
the table doesn't contain all the snapshots, so they can update their execution 
according to that. Yes, we can always traverse parents from the current table 
state and only process the snapshots that are available but shouldn't an action 
that alters the table metadata should be the one providing this information? 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to