Re: [QUESTION] Distinct transform precautions

Cristian Constantinescu Wed, 30 Mar 2022 08:41:18 -0700

> Are  you referring to worker crashes? Or  do you mean stopping
onepipeline and  then starting a new one?


Both actually.

> Depends on the runner, but generally state is not all stored in memory.

I'm using Flink. At my knowledge they suggest a state less than 5Mb in size
(I have to double check). So even if it's not in memory, I would need to
figure out if it causes issues with Flink's state requirement.

On Wed, Mar 30, 2022 at 11:35 AM Reuven Lax <re...@google.com> wrote:

>
>
> On Wed, Mar 30, 2022 at 6:16 AM Cristian Constantinescu <zei...@gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> About the Distinct [1] transformation. I couldn't find what precautions I
>> need to take when using it in terms of memory consumption and performance.
>>
>
> Depends on the runner, but generally state is not all stored in memory.
>
>
>> Furthermore, how does it behave if the pipeline crashes/restarted from
>> state, is its state restored on rerun (hence removes duplicates that it's
>> seen from the first run), or will it start a new state, meaning that if
>> duplicate messages are split between the two runs, then the pipeline will
>> output at least 2 of those messages.
>>
>
> Are  you referring to worker crashes? Or  do you mean stopping onepipeline
> and  then starting a new one?
>
>>
>> The Deduplicate [2] transform seems to hint that there could be some
>> considerations to take into account in terms of windowing to keep memory
>> use under control.
>>
>> The current scenario is that I'm using it on the Global Window with a few
>> 100k messages a day.
>>
>> My data has a field when that specifies data expires, so the alternative
>> is that I can write my own transform that keeps things in the state with a
>> timer that triggers after the data expires to remove it from the state.
>>
>> Thanks,
>> Cristian
>>
>> [1]
>> https://beam.apache.org/releases/javadoc/2.36.0/org/apache/beam/sdk/transforms/Distinct.html
>> [2]
>> https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/transforms/Deduplicate.html
>>
>

Re: [QUESTION] Distinct transform precautions

Reply via email to