Re: [DISCUSS] Do we have all the building block(s) to support iterations in Beam?

Jan Lukavský Wed, 23 Jun 2021 14:00:40 -0700

The most qualitatively import use-case I see are ACID transactions -transactions naturally involve cycles, because the most naturalimplementation would be of something like "optimistic locking" where thetransaction is allowed to progress until a downstream "commit" sees aconflict, when it needs to return the transaction back to the beginning(pretty much the same as how git resolves conflict in a push).

Another application would be graph algorithms on changing graphs, whereadding or removing an edge might trigger an iterative algorithm on thegraph (and I'm absolutely not sure that the discussed approach can dothat, this is just something, that would be cool to do :)).


On 6/23/21 10:53 PM, Reuven Lax wrote:

One question I have is whether the use cases for cyclic graphs overlapsubstantially with the use cases for event-time watermarks. Many ofthe uses I'm aware of are ML-type algorithms (e.g. clustering) oriterative algorithms on large graphs (connected components, etc.), andit's unclear how many of these problems have a natural time component.

On Wed, Jun 23, 2021 at 1:49 PM Jan Lukavský <[email protected]<mailto:[email protected]>> wrote:


    Reuven, can you please elaborate a little on that? Why do you need
    watermark per iteration? Letting the watermark progress as soon as
    all the keys arriving before the upstream watermark terminate the
    cycle seems like a valid definition without the need to make the
    watermark multidimensional. Yes, it introduces (possibly
    unbounded) latency in downstream processing, but that is something
    that should be probably expected. The unboundness of the latency
    can be limited by either fixed timeout or number of iterations.

    On 6/23/21 8:39 PM, Reuven Lax wrote:

    This was explored in the past, though the design started getting
    very complex (watermarks of unbounded dimension, where each
    iteration has its own watermark dimension). At the time, the
    exploration petered out.

    On Wed, Jun 23, 2021 at 10:13 AM Jan Lukavský <[email protected]
    <mailto:[email protected]>> wrote:

        Hi,

        I'd like to discuss a very rough idea. I didn't walk through
        all the
        corner cases and the whole idea has a lot of rough edges, so
        please bear
        with me. I was thinking about non-IO applications of
        splittable DoFn,
        and the main idea - and why it is called splittable - is that
        it can
        handle unbounded outputs per element. Then I was thinking
        about what can
        generate unbounded outputs per element _without reading from
        external
        source_ (as that would be IO application) - and then I
        realized that the
        data can - at least theoretically - come from a downstream
        transform. It
        would have to be passed over an RPC (gRPC probably)
        connection, it would
        probably require some sort of service discovery - as the
        feedback loop
        would have to be correctly targeted based on key - and so on
        (those are
        the rough edges).

        But supposing this can be solved - what iterations actually
        mean is the
        we have a side channel, that come from downstream processing
        - and we
        need a watermark estimator for this channel, that is able to
        hold the
        watermark back until the very last element (at a certain
        watermark)
        finishes the iteration. The idea is then we could - in theory
        - create
        an Iteration PTransform, that would take another PTransform
        (probably
        something like PTransform<PCollection<KV<K, V>>,
        PCollection<KV<K,
        IterationResult<K, V>>>, where the IterationResult<K, V>
        would contain
        the original KV<K, V> and a stopping condition (true, false)
        and by
        creating the feedback loop from the output of this
        PCollection we could
        actually implement this without any need of support on the
        side of runners.

        Does that seem like something that might be worth exploring?

          Jan

Re: [DISCUSS] Do we have all the building block(s) to support iterations in Beam?

Reply via email to