Re: [DISCUSS] Side input consistency guarantees for triggers with multiple firings

Ruoyun Huang Thu, 11 Apr 2019 10:36:48 -0700

With little to none experience on Trigger, I am trying to understand the
problem statement in this discussion.


If a user is aware of the potential non-deterministic behavior, isn't it
almost trivial to refactor his/her user code, by putting PCollectionViews S
and T into one single PCollectionView S', to get around the issue?     I
cannot think of a reason (wrong?) why a user *have* to put data into two
separate PCollectionViews in a single ParDo(A).

On Thu, Apr 11, 2019 at 10:16 AM Lukasz Cwik <[email protected]> wrote:

> Even though what Kenn points out is a major reason for me bringing up this
> topic, I didn't want to limit this discussion to how side inputs could work
> but in general what users want from their side inputs when dealing with
> multiple firings.
>
> On Thu, Apr 11, 2019 at 10:09 AM Kenneth Knowles <[email protected]> wrote:
>
>> Luke & I talked in person a bit. I want to give context for what is at
>> stake here, in terms of side inputs in portability. A decent starting place
>> is https://s.apache.org/beam-side-inputs-1-pager
>>
>> In that general design, the runner offers the SDK just one (or a few)
>> materialization strategies, and the SDK builds idiomatic structures on top
>> of it. Concretely, the Fn API today offers a multimap structure, and the
>> idea was that the SDK could cleverly prepare a PCollection<KV<...>> for the
>> runner to materialize. As a naive example, a simple iterable structure
>> could just map all elements to one dummy key in the multimap. But if you
>> wanted a list plus its length, then you might map all elements to an
>> element key and the length to a special length meta-key.
>>
>> So there is a problem: if the SDK is outputting a new KV<"elements-key",
>> ...> and KV<"length-key", ...> for the runner to materialize then consumers
>> of the side input need to see both updates to the materialization or
>> neither. In general, these outputs might span many keys.
>>
>> It seems like there are a few ways to resolve this tension:
>>
>>  - Establish a consistency model so these updates will be observed
>> together. Seems hard and whatever we come up with will limit runners, limit
>> efficiency, and potentially leak into users having to reason about
>> concurrency
>>
>>  - Instead of building the variety of side input views on one primitive
>> multimap materialization, force runners to provide many primitive
>> materializations with consistency under the hood. Not hard to get started,
>> but adds an unfortunate new dimension for runners to vary in functionality
>> and performance, versus letting them optimize just one or a few
>> materializations
>>
>>  - Have no consistency and just not support side input methods that would
>> require consistent metadata. I'm curious what features this will hurt.
>>
>>  - Have no consistency but require the SDK to build some sort of large
>> value since single-element consistency is built in to the model always.
>> Today many runners do concatenate all elements into one value, though that
>> does not perform well. Making this effective probably requires new model
>> features.
>>
>> Kenn
>>
>> On Thu, Apr 11, 2019 at 9:44 AM Reuven Lax <[email protected]> wrote:
>>
>>> One thing to keep in mind: triggers that fire multiple times per window
>>> already tend to be non deterministic. These are element-count or
>>> processing-time triggers, both of which are fairly non deterministic in
>>> firing.
>>>
>>> Reuven
>>>
>>> On Thu, Apr 11, 2019 at 9:27 AM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> Today, we define that a side input becomes available to be consumed
>>>> once at least one firing occurs or when the runner detects that no such
>>>> output could be produced (e.g. watermark is beyond the end of the window
>>>> when using the default trigger). For triggers that fire at most once,
>>>> consumers are guaranteed to have a consistent view of the contents of the
>>>> side input. But what happens when the trigger fire multiple times?
>>>>
>>>> Lets say we have a pipeline containing:
>>>> ParDo(A) --> PCollectionView S
>>>>          \-> PCollectionView T
>>>>
>>>>   ...
>>>>    |
>>>> ParDo(C) <-(side input)- PCollectionView S and PCollectionView T
>>>>    |
>>>>   ...
>>>>
>>>> 1) Lets say ParDo(A) outputs (during a single bundle) X and Y to
>>>> PCollectionView S, should ParDo(C) see be guaranteed to see X only if it
>>>> can also see Y (and vice versa)?
>>>>
>>>> 2) Lets say ParDo(A) outputs (during a single bundle) X to
>>>> PCollectionView S and Y to PCollectionView T, should ParDo(C) be guaranteed
>>>> to see X only if it can also see Y?
>>>>
>>>

-- 
================
Ruoyun  Huang

Re: [DISCUSS] Side input consistency guarantees for triggers with multiple firings

Reply via email to