I've just built a pipeline in Beam and after exploring several options for
my use case, I've ended up relying on the deprecated
.outputWithTimestamp() + DoFn#getAllowedTimestampSkew in what seems to me a
quite valid use case. So I suppose this is a vote for un-deprecating this
API (or a teachable moment in which I could be pointed to a more suitable
non-deprecated approach.)

I'll stick with a previously simplification of my use case:

I get these events from my users:
    LOGIN
    CLICK GREEN BUTTON
    LOGOUT

I capture user session duration (logout time *minus* login time) and I want
to perform a PER DAY average (i.e., my window is on CalendarDays) BUT where
the aggregation's timestamp is the time of the CLICK GREEN event.

So once I calculate and emit a single user's session duration I need to
.outputWithTimestamp using the CLICK GREEN event's timestamp. This
involves, of course, outputting with a timestamp *before* the watermark.

In most cases my users LOGOUT in the same day as the CLICK GREEN BUTTON
event, so even though I'm typically outputting a timestamp before the
watermark the CalendarDay window is not yet closed and so most user session
duration's do not affect a late aggregation for that CalendarDay.

Only when a LOGOUT occurs on a day later than the CLICK GREEN event do I
have to contend with potentially late data contributing back to a prior
CalendarDay.

In any case, I have .withAllowedLateness to allow me to make a call here
about what I'm willing tradeoff (keeping windows open vs. dropping data for
users with overly long sessions), etc.

This here seems to be a simple scenario (it is effectively my real-world
scenario) and the .outputWithTimestamp + DoFn#getAllowedTimestampSkew seem
to cover it in a straightforward, effective way.

However of course I don't like building production code on deprecated
capabilities -- so advice on alternatives (or perhaps a reconsideration of
this deprecation :) ) would be appreciated.

Reply via email to