Hi everyone, About the Distinct [1] transformation. I couldn't find what precautions I need to take when using it in terms of memory consumption and performance. Furthermore, how does it behave if the pipeline crashes/restarted from state, is its state restored on rerun (hence removes duplicates that it's seen from the first run), or will it start a new state, meaning that if duplicate messages are split between the two runs, then the pipeline will output at least 2 of those messages.
The Deduplicate [2] transform seems to hint that there could be some considerations to take into account in terms of windowing to keep memory use under control. The current scenario is that I'm using it on the Global Window with a few 100k messages a day. My data has a field when that specifies data expires, so the alternative is that I can write my own transform that keeps things in the state with a timer that triggers after the data expires to remove it from the state. Thanks, Cristian [1] https://beam.apache.org/releases/javadoc/2.36.0/org/apache/beam/sdk/transforms/Distinct.html [2] https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/Deduplicate.html