Thanks for your suggestions!
It makes sense to complete the work on this feature by exposing it in
the Python API. We can do this as a next step. (There might be questions
on how to do that exactly)
For now, I'm concerned with getting the semantics right and unblocking
users from stalling pipelines.
I wasn't aware that processing timers used the input timestamp as the
timer output timestamp. I've updated the PR accordingly. Please take a
look: https://github.com/apache/beam/pull/12531
-Max
On 12.08.20 05:03, Luke Cwik wrote:
+1 on what Boyuan said. It is important that the defaults for processing
time domain differ from the defaults for the event time domain.
On Tue, Aug 11, 2020 at 12:36 PM Yichi Zhang <zyi...@google.com
<mailto:zyi...@google.com>> wrote:
+1 to expose set_output_timestamp and enrich python set timer api.
On Tue, Aug 11, 2020 at 12:01 PM Boyuan Zhang <boyu...@google.com
<mailto:boyu...@google.com>> wrote:
Hi Maximilian,
It makes sense to set hold_timestamp as fire_timestamp when the
fire_timestamp is in the event time domain. Otherwise, the
system may advance the watermark incorrectly.
I think we can do something similar to Java FnApiRunner[1]:
* Expose set_output_timestamp API to python timer as well
* If set_output_timestamp is not specified and timer is in
event domain, we can use fire_timestamp as hold_timestamp
* Otherwise, use input_timestamp as hold_timestamp.
What do you think?
[1]
https://github.com/apache/beam/blob/edb42952f6b0aa99477f5c7baca6d6a0d93deb4f/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L1433-L1493
On Tue, Aug 11, 2020 at 9:00 AM Maximilian Michels
<m...@apache.org <mailto:m...@apache.org>> wrote:
We ran into problems setting event time timers per-element
in the Python
SDK. Pipeline progress would stall.
Turns out, although the Python SDK does not expose the timer
output
timestamp feature to the user, it sets the timer output
timestamp to the
current input timestamp of an element.
This will lead to holding back the watermark until the timer
fires (the
Flink Runner respects the timer output timestamp when
advancing the
output watermark). We had set the fire timestamp to a
timestamp so far
in the future, that pipeline progress would completely stall
for
downstream transforms, due to the held back watermark.
Considering that this feature is not even exposed to the
user in the
Python SDK, I think we should set the default output
timestamp to the
fire timestamp, and not to the input timestamp. This is also
how timer
work in the Java SDK.
Let me know what you think.
-Max
PR: https://github.com/apache/beam/pull/12531