Hi,

After a recent investigation of a data loss bug caused by unintuitive
behavior of some kinds of triggers, we had a discussion about how we can
protect against future issues like this, and I summarized it in
https://issues.apache.org/jira/browse/BEAM-3288 . Copying here:

Current Beam trigger semantics are rather confusing and in some cases
extremely unsafe, especially if the pipeline includes multiple chained
GBKs. One example of that is https://issues.apache.org/jira/browse/BEAM-3169
 .

There's multiple issues:

The API allows users to specify terminating top-level triggers (e.g.
"trigger a pane after receiving 10000 elements in the window, and that's
it"), but experience from user support shows that this is nearly always a
mistake and the user did not intend to drop all further data.

In general, triggers are the only place in Beam where data is being dropped
without making a lot of very loud noise about it - a practice for which the
PTransform style guide uses the language: "never, ever, ever do this".

Continuation triggers are still worse. For context: continuation trigger is
the trigger that's set on the output of a GBK and controls further
aggregation of the results of this aggregation by downstream GBKs. The
output shouldn't just use the same trigger as the input, because e.g. if
the input trigger said "wait for an hour before emitting a pane", that
doesn't mean that we should wait for another hour before emitting a result
of aggregating the result of the input trigger. Continuation triggers try
to simulate the behavior "as if a pane of the input propagated through the
entire pipeline", but the implementation of individual continuation
triggers doesn't do that. E.g. the continuation of "first N elements in
pane" trigger is "first 1 element in pane", and if the results of a first
GBK are further grouped by a second GBK onto more coarse key (e.g. if
everything is grouped onto the same key), that effectively means that, of
the keys of the first GBK, only one survives and all others are dropped
(what happened in the data loss bug).

The ultimate fix to all of these things is
https://s.apache.org/beam-sink-triggers . However, it is a huge model
change, and meanwhile we have to do something. The options are, in order of
increasing backward incompatibility (but incompatibility in a "rejecting
something that previously was accepted but extremely dangerous" kind of
way):

   - *Make the continuation trigger of most triggers be the "always-fire"
   trigger.* Seems that this should be the case for all triggers except the
   watermark trigger. This will definitely increase safety, but lead to more
   eager firing of downstream aggregations. It also will violate a user's
   expectation that a fire-once trigger fires everything downstream only once,
   but that expectation appears impossible to satisfy safely.
   - *Make the continuation trigger of some triggers be the "invalid"
   trigger, *i.e. require the user to set it explicitly: there's in general
   no good and safe way to infer what a trigger on a second GBK "truly" should
   be, based on the trigger of the PCollection input into a first GBK. This is
   especially true for terminating triggers.
   - *Prohibit top-level terminating triggers entirely. *This will ensure
   that the only data that ever gets dropped is "droppably late" data.


Do people think that these options are sensible?
+Kenn Knowles <k...@google.com> +Thomas Groh <tg...@google.com> +Ben Chambers
<bchamb...@google.com> is this a fair summary of our discussion?

Thanks!

Reply via email to