Before going into the details, I want to set the context (my own
perspective) that the most important thing for the first stable release is
removing possibly dangerous or incomplete pieces, which we can always
restore backwards compatibly after deliberation.

The original spec is this:

"If invoked from startBundle or finishBundle, this will attempt to use the
WindowFn of the input PCollection to determine what windows the element
should be in, throwing an exception if the WindowFn attempts to access any
information about the input element. The output element will have a
timestamp of negative infinity."

First, I want to call out a mistake of mine: I mistakenly read the spec and
code to be that the timestamp of negative infinity was run through the
WindowFn, which would constitute data corruption for most WindowFns.

So, with my corrected understanding, your design where we basically say "it
always outputs to the global window or something like it" is a totally
reasonable alternative to the current status. I'm not the only one with
opinions, here, of course...

In Java, we _can_ now enforce this statically, thanks to Thomas G's
improvements. We could either force it to be globally windowed or - since
it can't inspect the element or timestamp - just run a couple test calls
through "assignWindows", optionally confirming equality of the results. We
can also enforce that the correct subclass of window is output, though this
is not implemented as such right now.

But this behavior is all to support people who want to write transforms
that work with essentially only one windowing strategy, basically
pre-windowing stuff. Are there really any meaningful alternatives to the
global window? If so, I don't think that is realistically what this is
about. Such transforms should be permitted, but not encouraged or supported
automatically, as they are antithetical to the unified model. So I would
also suggest that they should validate that their input is globally
windowed and explicitly output to the global window. Trivial and they
already should be doing that. They also have the capability of storing the
input's WindowFn and producing behavior identical to what you describe.

For a transform to output in FinishBundle and be window-agnostic, the local
state of the DoFn needs to partition the windows manually, consider what
timestamp it wants to use, and output the correct thing to the correct
timestamp and window. I believe that having only the ability to
outputWindowed(value, timestamp, window) makes it quite obvious that this
is necessary. It is not boilerplate to do so, but core functionality.

Kenn

On Fri, May 5, 2017 at 11:41 AM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> The JIRA issue https://issues.apache.org/jira/browse/BEAM-1283
> suggests requiring an explicit Window when emitting from finshBundle.
> I'm starting a thread because JIRA/GitHub probably isn't the best (or
> most efficient) place to have this discussion.
>
> The original Spec requires the ambient WindowFn to be a
> non-element-inspecting, non-timestamp-inspecting Fn (currently, only
> the GlobalWindowsFn) and at the time this was chosen (after much
> deliberation) over requiring a WindowedValue or other options because
> batching in batch mode was a very commonly used pattern (and it should
> be noted that patterns of using state and/or a window expiry callback
> doesn't address this case well due to the lack of key with which to
> store state and distant (if ever) expiration of the window).
>
> The downside, of course, is that trying to use such a DoFn in a
> windowed context will not be caught until runtime. The proposal to
> emit WindowedValues has exactly the same downside, but the runtime
> error obtained when explicitly emitting a GlobalWindow when the
> ambient WindowFn is different will likely be much less clear (an
> encoding exception in the best case, silent data corruption in the
> worse). This also requires more boilerplate on the part of the author,
> and doesn't solve the enumerated issues of limiting which WindowFns
> can be used, choosing a timestamp, or bogus proto windows.
>
> Ideally we could catch such an error at pipeline construction time (or
> earlier) but there have been no proposals along this line yet.
> However, this is a stable-release-blocker, so we should come up with a
> (temporary at least) course of action soon. At the very least it seems
> we should accept emitting a Timestamped value as well, to which most
> WindowFns can be applied.
>
> - Robert
>

Reply via email to