On Thu, Jan 28, 2021 at 8:42 PM Pradyumna Achar <pradyumna.ac...@gmail.com>
wrote:

> hmm, no I need the triggers on the window for two reasons:
>
>   1. Say I get ~6GB of data for each of those hourly windows, and I let
> the window fire only after the watermark naturally crosses, I would need to
> store that 6GB in memory. Whereas if I let early firings happen often, and
> let the stateful DoFn output whenever it has received 100MB worth of data,
> the memory requirement comes down significantly.
>

I think you're misunderstanding. stateful DoFns receive elements as they
arrive, regardless of triggering and windowing. The only semantic meaning
of windows for stateful DoFns is garbage collection -  there is a separate
state per window, and when a window expires, the state for that window is
garbage collected. The stateful DoFn will not wait until the end of the
window to process the elements.

The behaviour you describe is how GroupByKey (and similar aggregating
transforms, such as count and other combiners) work. However it is not true
that all that data needs to be stored in memory. Generally runners shuffle
the data using disk or an external shuffle service. The grouped data is
then read by downstream transforms, but generally does not have to fit in
memory.


>
>  2. The window size could be larger than an hour, maybe a day. Early
> firings would let 100MB-pieces of the data be written and get picked up by
> downstream systems at a reduced latency instead of waiting for everything
> to arrive.
>
> GroupIntoBatches doesn't work on sizes (in terms of bytes) AFAIK
>

Reply via email to