I created this PR: https://github.com/apache/beam/pull/7556
Feel free to review/comment it. -Rui On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <[email protected]> wrote: > It might be better to keep something like "watermark usually consistently > moves forward". But "Elements that arrive with a smaller timestamp than the > current watermark are considered late data." has already given the order of > late data ts and watermark. > > > -Rui > > On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <[email protected]> wrote: > >> Reuven - I don't think I realized it was possible to have late data with >> the global window, so I'm definitely learning things through this >> discussion. >> >> New suggested wording, then: >> >> Elements that arrive with a smaller timestamp than the current >> watermark are considered late data. >> >> That says basically the same thing as the wording currently in the guide, >> but uses "smaller" (which implies a less-than-watermark comparison) rather >> than "later" (which folks have interpreted as a greater-than-watermark >> comparison). >> >> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <[email protected]> wrote: >> >>> Though it's not tied to window. You could be in the global window, so >>> the watermark never advances past the end of the window, yet still get late >>> data. >>> >>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <[email protected] wrote: >>> >>>> How about: "Once the watermark progresses past the end of a window, any >>>> further elements that arrive with a timestamp in that window are considered >>>> late data." >>>> >>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <[email protected]> wrote: >>>> >>>>> Hi Community, >>>>> >>>>> In Beam programming guide [1], there is a sentence: "Data that >>>>> arrives with a timestamp after the watermark is considered *late data* >>>>> " >>>>> >>>>> Seems like people get confused by it. For example, see Stackoverflow >>>>> comment [2]. Basically it makes people think that a event timestamp that >>>>> is >>>>> bigger than watermark is considered late (due to that "after"). >>>>> >>>>> Although there is a example right after this sentence to explain late >>>>> data, seems to me that this sentence is incomplete. The complete sentence >>>>> to me can be: "The watermark consistently advances from -inf to +inf. Data >>>>> that arrives with a timestamp after the watermark is considered late >>>>> data." >>>>> >>>>> Am I understand correctly? Is there better description for the order >>>>> of late data and watermark? I would happy to send PR to update Beam >>>>> documentation. >>>>> >>>>> -Rui >>>>> >>>>> [1]: >>>>> https://beam.apache.org/documentation/programming-guide/#windowing >>>>> [2]: >>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971 >>>>> >>>>> >>>>>
