Re: [DISCUSS] Do we have all the building block(s) to support iterations in Beam?

Reuven Lax Wed, 23 Jun 2021 13:54:00 -0700

One question I have is whether the use cases for cyclic graphs overlap
substantially with the use cases for event-time watermarks. Many of the
uses I'm aware of are ML-type algorithms (e.g. clustering) or iterative
algorithms on large graphs (connected components, etc.), and it's unclear
how many of these problems have a natural time component.


On Wed, Jun 23, 2021 at 1:49 PM Jan Lukavský <je...@seznam.cz> wrote:

> Reuven, can you please elaborate a little on that? Why do you need
> watermark per iteration? Letting the watermark progress as soon as all the
> keys arriving before the upstream watermark terminate the cycle seems like
> a valid definition without the need to make the watermark multidimensional.
> Yes, it introduces (possibly unbounded) latency in downstream processing,
> but that is something that should be probably expected. The unboundness of
> the latency can be limited by either fixed timeout or number of iterations.
> On 6/23/21 8:39 PM, Reuven Lax wrote:
>
> This was explored in the past, though the design started getting very
> complex (watermarks of unbounded dimension, where each iteration has its
> own watermark dimension). At the time, the exploration petered out.
>
> On Wed, Jun 23, 2021 at 10:13 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi,
>>
>> I'd like to discuss a very rough idea. I didn't walk through all the
>> corner cases and the whole idea has a lot of rough edges, so please bear
>> with me. I was thinking about non-IO applications of splittable DoFn,
>> and the main idea - and why it is called splittable - is that it can
>> handle unbounded outputs per element. Then I was thinking about what can
>> generate unbounded outputs per element _without reading from external
>> source_ (as that would be IO application) - and then I realized that the
>> data can - at least theoretically - come from a downstream transform. It
>> would have to be passed over an RPC (gRPC probably) connection, it would
>> probably require some sort of service discovery - as the feedback loop
>> would have to be correctly targeted based on key - and so on (those are
>> the rough edges).
>>
>> But supposing this can be solved - what iterations actually mean is the
>> we have a side channel, that come from downstream processing - and we
>> need a watermark estimator for this channel, that is able to hold the
>> watermark back until the very last element (at a certain watermark)
>> finishes the iteration. The idea is then we could - in theory - create
>> an Iteration PTransform, that would take another PTransform (probably
>> something like PTransform<PCollection<KV<K, V>>, PCollection<KV<K,
>> IterationResult<K, V>>>, where the IterationResult<K, V> would contain
>> the original KV<K, V> and a stopping condition (true, false) and by
>> creating the feedback loop from the output of this PCollection we could
>> actually implement this without any need of support on the side of
>> runners.
>>
>> Does that seem like something that might be worth exploring?
>>
>>   Jan
>>
>>

Re: [DISCUSS] Do we have all the building block(s) to support iterations in Beam?

Reply via email to