On Wed, Dec 6, 2017 at 9:53 PM, Kenneth Knowles <[email protected]> wrote: > > On Wed, Dec 6, 2017 at 9:45 PM, Reuven Lax <[email protected]> wrote: >>>>> >>>>> Ignoring merging, one perspective is that the window is just a key with >>>>> a deadline. >>>> >>>> >>>> That is only true when performing an aggregation. Records can be >>>> associated with a window, and do not require keys at that point. The >>>> "deadline" only applies when something like a GBK is assigned. >>> >>> >>> Yea, that situation -- windows assigned but no aggregation yet -- is >>> analogous to data being a KV prior to the GBK. The main function that >>> windows actually serve in the life of data processing is to allow >>> aggregations over unbounded data with bounded resources. Only aggregation >>> really needs them - if you just have a pass-through sequence of ParDos >>> windows don't really do anything. >> >> >> I disagree. There are multiple instances where windowing is used without >> an aggregation after. Fundamentally windowing is a function on elements. >> This function is used during aggregations to bound aggregations, but makes >> sense on its own. Thinking of windowing as a "timeout" makes for an >> intuitive model, but I don't think it's really the right model. For one >> thing, that intuitive model makes less sense in batch. > > What are the instances where windowing is used without an aggregation?
There are cases where windowing is used without a group-by-key, e.g. for state and side inputs, but to me these still are a kind of aggregation, or more generally grouping of multiple distinct elements into one. My model of windows is that they form a secondary key. This key, along with the primary key, is used in grouping (including the non-gbk operations listed above), and grouping across these composite key is not allowed in the model. (Merging windows is simply a case where the actual key is not computed until the grouping is performed, but if we had an oracle that could tell us, for example, which session an element belonged to up front, then we could simply inject this into the key explicitly and do a "normal" group-by-key. A corollary of this means that acting on the (proto)window of a merging window fn before aggregation happens should be disallowed, maybe there's a JIRA to file here?) The other novel properties associated with this time-based key are that they have a known upper bound (so we can use watermarks to make concrete progress before seeing all the data, though watermarks themselves are not part of the model (at least not until triggers are introduced)) and also that they are imposed externally and passed explicitly. So, to answer the original question, I don't think windowing is operational, as given a particular windowing there is one and only one right output for a given input. This doesn't mean that it couldn't be interesting to explore expressing the desired windowing on the output of a transformation and letting it flow up rather than expressing it on the input and letting it flow down. (There are technical implications here--an output consumed in multiple ways may need to be executed multiple times despite it occurring once in the graph which could be counter-intuitive, though even just changing the triggering could require this, and runners could probably often be intelligent about sharing work between similar-enough windowing. It also means the choice of windowing couldn't be inspected during construction (though again we may have crossed that bridge if we can't inspect the choice of triggering).) Triggering, on the other hand, is very operational, and intention more easily defined at a sink (and propagated upwards) rather than the way we do now, so a huge +1 to this proposal. - Robert
