Re: [DISCUSS] [DOC] Triggering is for sinks!

Robert Bradshaw Wed, 06 Dec 2017 22:45:07 -0800

On Wed, Dec 6, 2017 at 9:53 PM, Kenneth Knowles <[email protected]> wrote:
>
> On Wed, Dec 6, 2017 at 9:45 PM, Reuven Lax <[email protected]> wrote:
>>>>>
>>>>> Ignoring merging, one perspective is that the window is just a key with
>>>>> a deadline.
>>>>
>>>>
>>>> That is only true when performing an aggregation. Records can be
>>>> associated with a window, and do not require keys at that point. The
>>>> "deadline" only applies when something  like a GBK is assigned.
>>>
>>>
>>> Yea, that situation -- windows assigned but no aggregation yet -- is
>>> analogous to data being a KV prior to the GBK. The main function that
>>> windows actually serve in the life of data processing is to allow
>>> aggregations over unbounded data with bounded resources. Only aggregation
>>> really needs them - if you just have a pass-through sequence of ParDos
>>> windows don't really do anything.
>>
>>
>> I disagree. There are multiple instances where windowing is used without
>> an aggregation after. Fundamentally windowing is a function on elements.
>> This function is used during aggregations to bound aggregations, but makes
>> sense on its own. Thinking of windowing as a "timeout" makes for an
>> intuitive model, but I don't think it's really the right model. For one
>> thing, that intuitive model makes less sense in batch.
>
> What are the instances where windowing is used without an aggregation?


There are cases where windowing is used without a group-by-key, e.g.
for state and side inputs, but to me these still are a kind of
aggregation, or more generally grouping of multiple distinct elements
into one.

My model of windows is that they form a secondary key. This key, along
with the primary key, is used in grouping (including the non-gbk
operations listed above), and grouping across these composite key is
not allowed in the model. (Merging windows is simply a case where the
actual key is not computed until the grouping is performed, but if we
had an oracle that could tell us, for example, which session an
element belonged to up front, then we could simply inject this into
the key explicitly and do a "normal" group-by-key. A corollary of this
means that acting on the (proto)window of a merging window fn before
aggregation happens should be disallowed, maybe there's a JIRA to file
here?)

The other novel properties associated with this time-based key are
that they have a known upper bound (so we can use watermarks to make
concrete progress before seeing all the data, though watermarks
themselves are not part of the model (at least not until triggers are
introduced)) and also that they are imposed externally and passed
explicitly.

So, to answer the original question, I don't think windowing is
operational, as given a particular windowing there is one and only one
right output for a given input. This doesn't mean that it couldn't be
interesting to explore expressing the desired windowing on the output
of a transformation and letting it flow up rather than expressing it
on the input and letting it flow down. (There are technical
implications here--an output consumed in multiple ways may need to be
executed multiple times despite it occurring once in the graph which
could be counter-intuitive, though even just changing the triggering
could require this, and runners could probably often be intelligent
about sharing work between similar-enough windowing. It also means the
choice of windowing couldn't be inspected during construction (though
again we may have crossed that bridge if we can't inspect the choice
of triggering).)

Triggering, on the other hand, is very operational, and intention more
easily defined at a sink (and propagated upwards) rather than the way
we do now, so a huge +1 to this proposal.

- Robert

Re: [DISCUSS] [DOC] Triggering is for sinks!

Reply via email to