Re: Default output timestamp of processing-time timers

Kenneth Knowles Tue, 18 Jan 2022 09:13:06 -0800

This is an interesting case, and a legitimate counterexample to consider.
I'd call it a workaround :-). The semantic thing they would want/need is
"output timestamp" associated with buffered data (also implemented with
watermark hold). I do know systems that designed their state with this
built in.


Kenn

On Tue, Jan 18, 2022 at 8:57 AM Reuven Lax <re...@google.com> wrote:

> One note - some people definitely use timer.withOutputTimestamp as a
> watermark hold.
>

> This is a scenario in which one outputs (from processElement) a timestamp
> behind the current input element timestamp but knows that it is safe
> because there is already an extent timer with an earlier output timestamp
> (state can be used for this). In this case I've seen timers set simply for
> the hold - the actual onTimer never outputs anything.
>
> Reuven
>
> On Tue, Jan 18, 2022 at 6:42 AM Kenneth Knowles <k...@apache.org> wrote:
>
>>
>>
>> On Tue, Dec 14, 2021 at 2:38 PM Steve Niemitz <sniem...@apache.org>
>> wrote:
>>
>>> > I think this wouldn't be very robust to different situations where
>>> processing time and event time may not be that close to each other.
>>>
>>> if you do something like `min(endOfWindow, max(eventInputTimestamp,
>>> computedFiringTimestamp))` the worst case is that you set a watermark hold
>>> for somewhere in the future, right?  For example, if the watermark is
>>> lagging 3 hours, processing time = 4pm, event input = 1pm, window end =
>>> 5pm, the watermark hold/output time is set to 4pm + T.  This would make the
>>> timestamps "newer" than the input, but shouldn't ever create late data,
>>> correct?
>>>
>>> Also, imo, the timestamps really already cross domains now, because the
>>> watermark (event time) is held until the (processing time) timer fires.
>>>
>>> The concrete issue that brought this up was a pipeline with some state,
>>> and the state was "cleaned up" periodically with a processing time timer
>>> that fired every ~hour.  The author of the pipeline was confused why the
>>> watermark wasn't moving (and thus GBKs firing, etc).  The root cause was
>>> the watermark being held by the timer.
>>>
>>> > It would just save you .withOutputTimestamp(elementTimestamp) on your
>>> calls to setting the event time timer, right?
>>>
>>> Correct, the main thing I'm trying to solve is having to recalculate an
>>> output timestamp using the same logic that the timer itself is using to set
>>> its firing timestamp.
>>>
>>
>> It sounds like the main use case that you are dealing with is the case
>> where the timer doesn't actually produce output (or set further timers that
>> produce output) so it doesn't need (or want) a watermark hold. That makes
>> sense.
>>
>> In fact, I do not view a "watermark hold" as a fundamental concept. The
>> act of "set a timer with the intent that I am allowed to produce output
>> with timestamp X" is the fundamental concept, and watermark hold is an
>> implementation detail that should really never have been surfaced as an
>> end-user concept, or really even as an SDK author concept. This is why in
>> my proposal for adding output timestamps to timers, I called it
>> "withOutputTimestamp", and this is why the design does not include any
>> watermark holds - there is a self-loop on a transform where timers produce
>> an input watermark distinct from the watermark on input elements, and that
>> is enough. There is not now, and never has been, a need for the concept of
>> a hold at the level of the Beam model.
>>
>> I wonder if we can automate this behavior by noticing that there is no
>> OutputReceiver parameters to the timer callback, and also transitively. Or
>> just work around it by saying ".withoutOutput" on the timer.
>>
>> Kenn
>>
>>
>>>
>>>
>>>
>>> On Tue, Dec 14, 2021 at 4:10 PM Kenneth Knowles <k...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Dec 7, 2021 at 7:27 AM Steve Niemitz <sniem...@apache.org>
>>>> wrote:
>>>>
>>>>> If I have a processing time timer, is there any way to automatically
>>>>> set the output timestamp to the timer firing timestamp (similar to how
>>>>> event-time timers work).
>>>>>
>>>>> A common use case would be to do something like:
>>>>> timer.offset(X).align(Y).setRelative()
>>>>>
>>>>
>>>>
>>>> but have the output timestamp be the firing timestamp.  In order to do
>>>>> this now you need to re-calculate the output timestamp (using the same
>>>>> logic as the timer does internally) and manually use withOutputTimestamp.
>>>>
>>>>
>>>> I think this wouldn't be very robust to different situations where
>>>> processing time and event time may not be that close to each other. In
>>>> general I'm skeptical of reusing timestamps across time domains, for just
>>>> this sort of reason. I wouldn't recommend doing this manually either.
>>>>
>>>>
>>>>> I'm not sure what the API would look like here, but it would also be
>>>>> nice to allow event-time timers to do the same in reverse (use the element
>>>>> input timestamp rather than the firing timestamp).  Maybe something like
>>>>> `withDefaultOutputTimestampFrom(...)` and an enum of FIRING_TIMESTAMP,
>>>>> ELEMENT_TIMESTAMP?
>>>>>
>>>>
>>>> It would just save you .withOutputTimestamp(elementTimestamp) on your
>>>> calls to setting the event time timer, right? It doesn't work in general
>>>> because a timer can be set from other OnTimer methods, where there is no
>>>> "element" per se, but just the output timestamp of the fired timer.
>>>>
>>>> Kenn
>>>>
>>>

Re: Default output timestamp of processing-time timers

Reply via email to