stevenzwu commented on issue #6514: URL: https://github.com/apache/iceberg/issues/6514#issuecomment-1444388008
Just to add to @rdblue 's good points above. Regarding the watermark use cases, we can prefix the watermark (in snapshot metadata) from each writer job to avoid the conflict. The downside is that consumer from the watermark info need to aggregate and take the min value. if we go with the conditional commit approach and fail the second commit with lower watermark, how would the second application handle the failure? we can also get into the situation where the second application may never able to commit if it's watermark is forever behind the first application. The condition check will always be false. Regarding the Kafka data file committer use case, it is non-desirable to have all the parallel threads committing to the Iceberg table. If the parallelism is 100 or 1,000, there will be a lot of collisions and retries. The conditional commit can ensure the correctness. But it can be inefficient or infeasible with high parallelism. Flink Iceberg sink coalescing all data files to a single committer task so that there is only one committer thread (in a Flink job) committing to the iceberg table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
