I'll probably reply the same to SO but posting here first. This is mentioned in JIRA ticket, design doc, and also API doc, but to reiterate, the contract/guarantee of the new API is that the API will deduplicate events properly when the max distance of all your duplicate events are less than watermark delay. The internal implementation is slightly complicated and depends on the first arrived event per duplication, and the API does not promise any behavior beyond the contract/guarantee. You cannot expect any strict behavior beyond the contract/guarantee.
The main use case of this new API is to cover with writers which guarantees "at-least-once", which has a risk of duplication. E.g. Writing data to a Kafka topic without a transaction could end up with duplication. In most cases, duplicated writes for the same data would happen within a predictable time frame, and this new API will ensure that these duplicated writes are deduplicated once users provide the max distance of time (max - min) among duplicated events as delay threshold of watermark. Hope this helps. Thanks, Jungtaek Lim (HeartSaVioR) On Mon, Nov 20, 2023 at 10:18 AM Perfect Stranger <paulpaul1...@gmail.com> wrote: > Hello, I have trouble understanding how dropDuplicatesWithinWatermark > works. And I posted this stackoverflow question: > > https://stackoverflow.com/questions/77512507/how-exactly-does-dropduplicateswithinwatermark-work > > Could somebody answer it please? > > Best Regards, > Pavel. >