[DISCUSS] Do we have all the building block(s) to support iterations in Beam?

Jan Lukavský Wed, 23 Jun 2021 10:13:18 -0700

Hi,

I'd like to discuss a very rough idea. I didn't walk through all thecorner cases and the whole idea has a lot of rough edges, so please bearwith me. I was thinking about non-IO applications of splittable DoFn,and the main idea - and why it is called splittable - is that it canhandle unbounded outputs per element. Then I was thinking about what cangenerate unbounded outputs per element _without reading from externalsource_ (as that would be IO application) - and then I realized that thedata can - at least theoretically - come from a downstream transform. Itwould have to be passed over an RPC (gRPC probably) connection, it wouldprobably require some sort of service discovery - as the feedback loopwould have to be correctly targeted based on key - and so on (those arethe rough edges).

But supposing this can be solved - what iterations actually mean is thewe have a side channel, that come from downstream processing - and weneed a watermark estimator for this channel, that is able to hold thewatermark back until the very last element (at a certain watermark)finishes the iteration. The idea is then we could - in theory - createan Iteration PTransform, that would take another PTransform (probablysomething like PTransform<PCollection<KV<K, V>>, PCollection<KV<K,IterationResult<K, V>>>, where the IterationResult<K, V> would containthe original KV<K, V> and a stopping condition (true, false) and bycreating the feedback loop from the output of this PCollection we couldactually implement this without any need of support on the side of runners.


Does that seem like something that might be worth exploring?

 Jan

[DISCUSS] Do we have all the building block(s) to support iterations in Beam?

Reply via email to