I'll probably reply the same to SO but posting here first.

This is mentioned in JIRA ticket, design doc, and also API doc, but to
reiterate, the contract/guarantee of the new API is that the API will
deduplicate events properly when the max distance of all your duplicate
events are less than watermark delay. The internal implementation is
slightly complicated and depends on the first arrived event per
duplication, and the API does not promise any behavior beyond
the contract/guarantee. You cannot expect any strict behavior beyond the
contract/guarantee.

The main use case of this new API is to cover with writers which guarantees
"at-least-once", which has a risk of duplication. E.g. Writing data to a
Kafka topic without a transaction could end up with duplication. In most
cases, duplicated writes for the same data would happen within a
predictable time frame, and this new API will ensure that these duplicated
writes are deduplicated once users provide the max distance of time (max -
min) among duplicated events as delay threshold of watermark.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Nov 20, 2023 at 10:18 AM Perfect Stranger <paulpaul1...@gmail.com>
wrote:

> Hello, I have trouble understanding how dropDuplicatesWithinWatermark
> works. And I posted this stackoverflow question:
>
> https://stackoverflow.com/questions/77512507/how-exactly-does-dropduplicateswithinwatermark-work
>
> Could somebody answer it please?
>
> Best Regards,
> Pavel.
>

Reply via email to