ayush-san commented on issue #2482: URL: https://github.com/apache/iceberg/issues/2482#issuecomment-897338009
> It sounds like you're suggesting that we could keep track of the oldest snapshot in the table and use that for startingSnapshotId. So instead of the Flink job starting from snapshot 2 and reading snapshots 3, 4, and 5, it would start from snapshot 3 and read only 4 and 5. The problem is that this automatically skips rows from snapshot 3, which isn't correct. @rdblue So what should be the ideal way around this? Since running regular maintenance procedures(Compaction, expire-snapshots, deleting orphans files, etc) are really important for streaming pipelines as they will keep metadata size in check. But currently, we cannot run expire snapshots as the flink validation will fail. My logic for annotating a snapshot was that any reader/writer will know that the table doesn't contain all the snapshots, so they can update their execution according to that. Yes, we can always traverse parents from the current table state and only process the snapshots that are available but shouldn't an action that alters the table metadata should be the one providing this information? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
