I created this PR: https://github.com/apache/beam/pull/7556

Feel free to review/comment it.

-Rui

On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <[email protected]> wrote:

> It might be better to keep something like "watermark usually consistently
> moves forward". But "Elements that arrive with a smaller timestamp than the
> current watermark are considered late data." has already given the order of
> late data ts and watermark.
>
>
> -Rui
>
> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <[email protected]> wrote:
>
>> Reuven - I don't think I realized it was possible to have late data with
>> the global window, so I'm definitely learning things through this
>> discussion.
>>
>> New suggested wording, then:
>>
>>     Elements that arrive with a smaller timestamp than the current
>> watermark are considered late data.
>>
>> That says basically the same thing as the wording currently in the guide,
>> but uses "smaller" (which implies a less-than-watermark comparison) rather
>> than "later" (which folks have interpreted as a greater-than-watermark
>> comparison).
>>
>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <[email protected]> wrote:
>>
>>> Though it's not tied to window. You could be in the global window, so
>>> the watermark never advances past the end of the window, yet still get late
>>> data.
>>>
>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <[email protected] wrote:
>>>
>>>> How about: "Once the watermark progresses past the end of a window, any
>>>> further elements that arrive with a timestamp in that window are considered
>>>> late data."
>>>>
>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <[email protected]> wrote:
>>>>
>>>>> Hi Community,
>>>>>
>>>>> In Beam programming guide [1], there is a sentence: "Data that
>>>>> arrives with a timestamp after the watermark is considered *late data*
>>>>> "
>>>>>
>>>>> Seems like people get confused by it. For example, see Stackoverflow
>>>>> comment [2]. Basically it makes people think that a event timestamp that 
>>>>> is
>>>>> bigger than watermark is considered late (due to that "after").
>>>>>
>>>>> Although there is a example right after this sentence to explain late
>>>>> data, seems to me that this sentence is incomplete. The complete sentence
>>>>> to me can be: "The watermark consistently advances from -inf to +inf. Data
>>>>> that arrives with a timestamp after the watermark is considered late 
>>>>> data."
>>>>>
>>>>> Am I understand correctly? Is there better description for the order
>>>>> of late data and watermark? I would happy to send PR to update Beam
>>>>> documentation.
>>>>>
>>>>> -Rui
>>>>>
>>>>> [1]:
>>>>> https://beam.apache.org/documentation/programming-guide/#windowing
>>>>> [2]:
>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971
>>>>>
>>>>>
>>>>>

Reply via email to