Hi,

I'd like to discuss a very rough idea. I didn't walk through all the corner cases and the whole idea has a lot of rough edges, so please bear with me. I was thinking about non-IO applications of splittable DoFn, and the main idea - and why it is called splittable - is that it can handle unbounded outputs per element. Then I was thinking about what can generate unbounded outputs per element _without reading from external source_ (as that would be IO application) - and then I realized that the data can - at least theoretically - come from a downstream transform. It would have to be passed over an RPC (gRPC probably) connection, it would probably require some sort of service discovery - as the feedback loop would have to be correctly targeted based on key - and so on (those are the rough edges).

But supposing this can be solved - what iterations actually mean is the we have a side channel, that come from downstream processing - and we need a watermark estimator for this channel, that is able to hold the watermark back until the very last element (at a certain watermark) finishes the iteration. The idea is then we could - in theory - create an Iteration PTransform, that would take another PTransform (probably something like PTransform<PCollection<KV<K, V>>, PCollection<KV<K, IterationResult<K, V>>>, where the IterationResult<K, V> would contain the original KV<K, V> and a stopping condition (true, false) and by creating the feedback loop from the output of this PCollection we could actually implement this without any need of support on the side of runners.

Does that seem like something that might be worth exploring?

 Jan

Reply via email to