[QUESTION] Distinct transform precautions

Cristian Constantinescu Wed, 30 Mar 2022 06:16:04 -0700

Hi everyone,

About the Distinct [1] transformation. I couldn't find what precautions I
need to take when using it in terms of memory consumption and performance.
Furthermore, how does it behave if the pipeline crashes/restarted from
state, is its state restored on rerun (hence removes duplicates that it's
seen from the first run), or will it start a new state, meaning that if
duplicate messages are split between the two runs, then the pipeline will
output at least 2 of those messages.


The Deduplicate [2] transform seems to hint that there could be some
considerations to take into account in terms of windowing to keep memory
use under control.

The current scenario is that I'm using it on the Global Window with a few
100k messages a day.

My data has a field when that specifies data expires, so the alternative is
that I can write my own transform that keeps things in the state with a
timer that triggers after the data expires to remove it from the state.

Thanks,
Cristian

[1]
https://beam.apache.org/releases/javadoc/2.36.0/org/apache/beam/sdk/transforms/Distinct.html
[2]
https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/Deduplicate.html

[QUESTION] Distinct transform precautions

Reply via email to