So, what we want to prohibit is stacked merging aggregations. (Open question: is that a property of the WindowFn, in which case some merging WindowFns could allow stacking, and some non-merging ones prohibit it, or is this really tied to merging itself?)
In order to do this in a cross-language way (e.g. two Java aggregations separated by a Go DoFn) we need to preserve this "don't re-aggregate" bit in the proto. I thought that's what ALREADY_MERGED was for. On Thu, Feb 18, 2021 at 8:30 PM Kenneth Knowles <[email protected]> wrote: > So there's a bit of an open question about the Java SDK behavior and > whether we should keep the unused ALREADY_MERGED in the model proto. > > Here is a proposal that maintains the intent of everything: > > - Remove MergeStatus.ALREADY_MERGED since there is no SDK that has ever > had any semantics like that. > - InvalidWindows is merging and translates as NEEDS_MERGE so that it gets > invoked and crashes. This contradicts *both* PRs linked. > - This means that embracing runners that only support a fixed set of > windowing primitives requires them to at least be able to carry along > InvalidWindows without invoking it > > I think the last bullet is unfortunate. So two proposals that allow > runners to support only a fixed set of windowing primitives: > > (1) Don't convert merging WindowFns to InvalidWindows. Instead set an > "already merged bit" that makes it into a non-merging WindowFn and > translate as ALREADY_MERGED. This would allow a later GBK to make no sense > in the case of sessions because there's not much chance windows will > coincide. But merging WindowFns don't have to work like sessions so maybe > there is some case where actually there's a small number of possible output > windows. > > OR > > (2) Don't convert merging WindowFns to InvalidWindows. Instead leave it > just the way it is (like Python) and translate as NEEDS_MERGE. We still > remove ALREADY_MERGED. This would allow a later GBK to make no sense > because there's not likely to be any merging for the same reason. But > merging WindowFns don't have to work like sessions so they might merge > based on some other interesting criteria. > > I think (2) does seem more likely to have uses. I don't think either are > likely to have very many, especially if there are very few user-authored > merging WindowFns out there (and I agree that this is probably true). > Choice (2) also has the benefit that it matches Python and that it is > trivial to implement. > > Kenn > > On Thu, Feb 18, 2021 at 3:18 PM Robert Bradshaw <[email protected]> > wrote: > >> I think you're right about Python. I think it's fine for the SDK to >> prohibit (or require explicit user action) for ambiguous things like >> stacked sessions. This illegal state wouldn't generally need to be >> represented in proto (but maybe it'd be nice for quicker errors in cross >> language). >> >> On Thu, Feb 18, 2021 at 1:38 PM Kenneth Knowles <[email protected]> wrote: >> >>> Great. Should be easy to sort this out before Go has to make any >>> decisions. >>> >>> I will take this opportunity to get on my soapbox and suggest instead of >>> "custom WindowFn" we simply call them "WindowFn". The suffix "Fn" indicates >>> that it is definable code, not just an enum that selects baked-in >>> functionality. If you can't run user code for a particular type of Fn, you >>> don't support it. If you don't support "custom WindowFns" you don't support >>> WindowFns (but you may support "windowing" in some predefined ways). >>> >> >> Or maybe we should call the ones off the short list "arbitrary >> WindowFns." I think the reason "not supporting WindowFns" feels odd is that >> with the enumerated list one may hist 90+% of usecases, which is much >> better than not supporting the concept of windowing (timestamps, ...) at >> all. >> >> >>> Kenn >>> >>> On Thu, Feb 18, 2021 at 10:13 AM Robert Burke <[email protected]> >>> wrote: >>> >>>> A bit more information: graphx/translate.go has the handling of >>>> WindowingStrategy at pipeline encoding and we only use Non Merging. >>>> >>>> Presumably this is something that would need to be fixed when >>>> supporting Session windows in BEAM-4152 >>>> >>>> >>>> On Thu, Feb 18, 2021, 10:02 AM Robert Burke <[email protected]> wrote: >>>> >>>>> Go has very basic windowing support that is managed entirely by the >>>>> runner. Session windowing isn't implemented yet, let alone custom >>>>> windowfns >>>>> which i asume is what would need to specify these things. >>>>> >>>>> Session windowing is tracked in BEAM-4152 >>>>> and Custome windowFns are tracked in BEAM-11100. >>>>> >>>>> On Wed, Feb 17, 2021, 12:06 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> Yet another exciting corner of portability, discovered during >>>>>> debugging. Some discussion at >>>>>> https://github.com/apache/beam/pull/14001 and >>>>>> https://github.com/apache/beam/pull/13998 >>>>>> >>>>>> **In Java since around the beginning of Beam** >>>>>> When a merging WindowFn goes through a GBK/Combine and windows are >>>>>> merged, the downstream windowing is changed to "InvalidWindows" which >>>>>> will >>>>>> fail any downstream GBK. The user is required to re-window before another >>>>>> GBK. >>>>>> >>>>>> It was to protect a user from this: >>>>>> >>>>>> 1. User sets keys and chooses session windowing >>>>>> 2. User groups/combines by session >>>>>> 3. User computes the outputs to produce some new keys >>>>>> 4. User groups again >>>>>> >>>>>> The result usually does not make sense. Because it was forbidden we >>>>>> never decided whether things should merge again or not. >>>>>> >>>>>> **In protos** >>>>>> The MergeStatus enum has NON_MERGING, NEEDS_MERGE, and >>>>>> ALREADY_MERGED. It is documented that ALREADY_MERGED is for >>>>>> sessions/merging windows after a GBK. >>>>>> >>>>>> This is _maybe_ better. It allows the windows to just be carried >>>>>> along. It is a major model change and would require SDK support. But it >>>>>> might still not make sense because the chances that two elements have >>>>>> exactly the same merging window are very low for something like sessions. >>>>>> It may be useful for advanced tricks with merging windows, but noone is >>>>>> doing that because no SDK supports it. >>>>>> >>>>>> **In Python** >>>>>> I think nothing is done. The next GBK will merge again. I could be >>>>>> wrong - I just read the code very quickly and don't know it that well. >>>>>> >>>>>> **In Go** >>>>>> I didn't even check. Maybe someone can add the status to the thread. >>>>>> >>>>>> Kenn >>>>>> >>>>>
