> Are you referring to worker crashes? Or do you mean stopping onepipeline and then starting a new one?
Both actually. > Depends on the runner, but generally state is not all stored in memory. I'm using Flink. At my knowledge they suggest a state less than 5Mb in size (I have to double check). So even if it's not in memory, I would need to figure out if it causes issues with Flink's state requirement. On Wed, Mar 30, 2022 at 11:35 AM Reuven Lax <re...@google.com> wrote: > > > On Wed, Mar 30, 2022 at 6:16 AM Cristian Constantinescu <zei...@gmail.com> > wrote: > >> Hi everyone, >> >> About the Distinct [1] transformation. I couldn't find what precautions I >> need to take when using it in terms of memory consumption and performance. >> > > Depends on the runner, but generally state is not all stored in memory. > > >> Furthermore, how does it behave if the pipeline crashes/restarted from >> state, is its state restored on rerun (hence removes duplicates that it's >> seen from the first run), or will it start a new state, meaning that if >> duplicate messages are split between the two runs, then the pipeline will >> output at least 2 of those messages. >> > > Are you referring to worker crashes? Or do you mean stopping onepipeline > and then starting a new one? > >> >> The Deduplicate [2] transform seems to hint that there could be some >> considerations to take into account in terms of windowing to keep memory >> use under control. >> >> The current scenario is that I'm using it on the Global Window with a few >> 100k messages a day. >> >> My data has a field when that specifies data expires, so the alternative >> is that I can write my own transform that keeps things in the state with a >> timer that triggers after the data expires to remove it from the state. >> >> Thanks, >> Cristian >> >> [1] >> https://beam.apache.org/releases/javadoc/2.36.0/org/apache/beam/sdk/transforms/Distinct.html >> [2] >> https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/Deduplicate.html >> >