So there's a bit of an open question about the Java SDK behavior and whether we should keep the unused ALREADY_MERGED in the model proto.
Here is a proposal that maintains the intent of everything: - Remove MergeStatus.ALREADY_MERGED since there is no SDK that has ever had any semantics like that. - InvalidWindows is merging and translates as NEEDS_MERGE so that it gets invoked and crashes. This contradicts *both* PRs linked. - This means that embracing runners that only support a fixed set of windowing primitives requires them to at least be able to carry along InvalidWindows without invoking it I think the last bullet is unfortunate. So two proposals that allow runners to support only a fixed set of windowing primitives: (1) Don't convert merging WindowFns to InvalidWindows. Instead set an "already merged bit" that makes it into a non-merging WindowFn and translate as ALREADY_MERGED. This would allow a later GBK to make no sense in the case of sessions because there's not much chance windows will coincide. But merging WindowFns don't have to work like sessions so maybe there is some case where actually there's a small number of possible output windows. OR (2) Don't convert merging WindowFns to InvalidWindows. Instead leave it just the way it is (like Python) and translate as NEEDS_MERGE. We still remove ALREADY_MERGED. This would allow a later GBK to make no sense because there's not likely to be any merging for the same reason. But merging WindowFns don't have to work like sessions so they might merge based on some other interesting criteria. I think (2) does seem more likely to have uses. I don't think either are likely to have very many, especially if there are very few user-authored merging WindowFns out there (and I agree that this is probably true). Choice (2) also has the benefit that it matches Python and that it is trivial to implement. Kenn On Thu, Feb 18, 2021 at 3:18 PM Robert Bradshaw <[email protected]> wrote: > I think you're right about Python. I think it's fine for the SDK to > prohibit (or require explicit user action) for ambiguous things like > stacked sessions. This illegal state wouldn't generally need to be > represented in proto (but maybe it'd be nice for quicker errors in cross > language). > > On Thu, Feb 18, 2021 at 1:38 PM Kenneth Knowles <[email protected]> wrote: > >> Great. Should be easy to sort this out before Go has to make any >> decisions. >> >> I will take this opportunity to get on my soapbox and suggest instead of >> "custom WindowFn" we simply call them "WindowFn". The suffix "Fn" indicates >> that it is definable code, not just an enum that selects baked-in >> functionality. If you can't run user code for a particular type of Fn, you >> don't support it. If you don't support "custom WindowFns" you don't support >> WindowFns (but you may support "windowing" in some predefined ways). >> > > Or maybe we should call the ones off the short list "arbitrary WindowFns." > I think the reason "not supporting WindowFns" feels odd is that with the > enumerated list one may hist 90+% of usecases, which is much better than > not supporting the concept of windowing (timestamps, ...) at all. > > >> Kenn >> >> On Thu, Feb 18, 2021 at 10:13 AM Robert Burke <[email protected]> wrote: >> >>> A bit more information: graphx/translate.go has the handling of >>> WindowingStrategy at pipeline encoding and we only use Non Merging. >>> >>> Presumably this is something that would need to be fixed when supporting >>> Session windows in BEAM-4152 >>> >>> >>> On Thu, Feb 18, 2021, 10:02 AM Robert Burke <[email protected]> wrote: >>> >>>> Go has very basic windowing support that is managed entirely by the >>>> runner. Session windowing isn't implemented yet, let alone custom windowfns >>>> which i asume is what would need to specify these things. >>>> >>>> Session windowing is tracked in BEAM-4152 >>>> and Custome windowFns are tracked in BEAM-11100. >>>> >>>> On Wed, Feb 17, 2021, 12:06 PM Kenneth Knowles <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Yet another exciting corner of portability, discovered during >>>>> debugging. Some discussion at >>>>> https://github.com/apache/beam/pull/14001 and >>>>> https://github.com/apache/beam/pull/13998 >>>>> >>>>> **In Java since around the beginning of Beam** >>>>> When a merging WindowFn goes through a GBK/Combine and windows are >>>>> merged, the downstream windowing is changed to "InvalidWindows" which will >>>>> fail any downstream GBK. The user is required to re-window before another >>>>> GBK. >>>>> >>>>> It was to protect a user from this: >>>>> >>>>> 1. User sets keys and chooses session windowing >>>>> 2. User groups/combines by session >>>>> 3. User computes the outputs to produce some new keys >>>>> 4. User groups again >>>>> >>>>> The result usually does not make sense. Because it was forbidden we >>>>> never decided whether things should merge again or not. >>>>> >>>>> **In protos** >>>>> The MergeStatus enum has NON_MERGING, NEEDS_MERGE, and ALREADY_MERGED. >>>>> It is documented that ALREADY_MERGED is for sessions/merging windows after >>>>> a GBK. >>>>> >>>>> This is _maybe_ better. It allows the windows to just be carried >>>>> along. It is a major model change and would require SDK support. But it >>>>> might still not make sense because the chances that two elements have >>>>> exactly the same merging window are very low for something like sessions. >>>>> It may be useful for advanced tricks with merging windows, but noone is >>>>> doing that because no SDK supports it. >>>>> >>>>> **In Python** >>>>> I think nothing is done. The next GBK will merge again. I could be >>>>> wrong - I just read the code very quickly and don't know it that well. >>>>> >>>>> **In Go** >>>>> I didn't even check. Maybe someone can add the status to the thread. >>>>> >>>>> Kenn >>>>> >>>>
