That is correct. For global window there is no such thing as late data. Kenn
On Fri, Jan 18, 2019, 11:13 Ruoyun Huang <ruo...@google.com wrote: > Very helpful discussion (and the fixing PR). > > To make sure my take-way is correct. The status quo is a) "for a Global > Window, then there is *no possible scenario* where data is identified as > late". Rather than b) "for a global window we *no longer* compare > watermark to identify late data, but *there is still other criteria* that > determines data late". > > a) is correct and b) is not. Is that so? > > On Thu, Jan 17, 2019 at 8:57 PM Kenneth Knowles <k...@google.com> wrote: > >> Actually, Reuven, that's no longer the case. >> >> It used to be that incoming data was compared to the watermark but it is >> not today. Instead, Jeff's first phrasing is perfect. >> >> One way to see it is the think about what are the consequences of late >> data: if there is a grouping/aggregation by key+window, the window >> determines when the grouping is complete. We go ahead and include any data >> that shows up before the window is complete. And if you set up allowed >> lateness it matches exactly: any data that arrives before the ON_TIME >> output gets to be in that output. >> >> Previously, when we compared incoming elements to the watermark directly, >> you could have a window that was still being aggregated but the elements >> that fell in the window were dropped. There was no technical benefit to >> losing this data, so we stopped dropping it. We also had lots of tricky >> bugs and hard-to-manage code related to what we do if an element arrives >> after the watermark. And you could have an ON_TIME firing that included a >> bunch of "late" data which is confusing. >> >> Now it is simple: if the window is still alive, the element goes into it. >> >> I very rarely use the term "late data" when describing Beam's semantics >> anyhow. I always found the term / definition a bit arbitrary. >> >> Kenn >> >> On Thu, Jan 17, 2019 at 8:13 PM Rui Wang <ruw...@google.com> wrote: >> >>> I created this PR: https://github.com/apache/beam/pull/7556 >>> >>> Feel free to review/comment it. >>> >>> -Rui >>> >>> On Thu, Jan 17, 2019 at 2:37 PM Rui Wang <ruw...@google.com> wrote: >>> >>>> It might be better to keep something like "watermark usually >>>> consistently moves forward". But "Elements that arrive with a smaller >>>> timestamp than the current watermark are considered late data." has already >>>> given the order of late data ts and watermark. >>>> >>>> >>>> -Rui >>>> >>>> On Thu, Jan 17, 2019 at 1:39 PM Jeff Klukas <jklu...@mozilla.com> >>>> wrote: >>>> >>>>> Reuven - I don't think I realized it was possible to have late data >>>>> with the global window, so I'm definitely learning things through this >>>>> discussion. >>>>> >>>>> New suggested wording, then: >>>>> >>>>> Elements that arrive with a smaller timestamp than the current >>>>> watermark are considered late data. >>>>> >>>>> That says basically the same thing as the wording currently in the >>>>> guide, but uses "smaller" (which implies a less-than-watermark comparison) >>>>> rather than "later" (which folks have interpreted as a >>>>> greater-than-watermark comparison). >>>>> >>>>> On Thu, Jan 17, 2019 at 3:40 PM Reuven Lax <re...@google.com> wrote: >>>>> >>>>>> Though it's not tied to window. You could be in the global window, so >>>>>> the watermark never advances past the end of the window, yet still get >>>>>> late >>>>>> data. >>>>>> >>>>>> On Thu, Jan 17, 2019, 11:14 AM Jeff Klukas <jklu...@mozilla.com >>>>>> wrote: >>>>>> >>>>>>> How about: "Once the watermark progresses past the end of a window, >>>>>>> any further elements that arrive with a timestamp in that window are >>>>>>> considered late data." >>>>>>> >>>>>>> On Thu, Jan 17, 2019 at 1:43 PM Rui Wang <ruw...@google.com> wrote: >>>>>>> >>>>>>>> Hi Community, >>>>>>>> >>>>>>>> In Beam programming guide [1], there is a sentence: "Data that >>>>>>>> arrives with a timestamp after the watermark is considered *late >>>>>>>> data*" >>>>>>>> >>>>>>>> Seems like people get confused by it. For example, see >>>>>>>> Stackoverflow comment [2]. Basically it makes people think that a event >>>>>>>> timestamp that is bigger than watermark is considered late (due to that >>>>>>>> "after"). >>>>>>>> >>>>>>>> Although there is a example right after this sentence to explain >>>>>>>> late data, seems to me that this sentence is incomplete. The complete >>>>>>>> sentence to me can be: "The watermark consistently advances from -inf >>>>>>>> to >>>>>>>> +inf. Data that arrives with a timestamp after the watermark is >>>>>>>> considered >>>>>>>> late data." >>>>>>>> >>>>>>>> Am I understand correctly? Is there better description for the >>>>>>>> order of late data and watermark? I would happy to send PR to update >>>>>>>> Beam >>>>>>>> documentation. >>>>>>>> >>>>>>>> -Rui >>>>>>>> >>>>>>>> [1]: >>>>>>>> https://beam.apache.org/documentation/programming-guide/#windowing >>>>>>>> [2]: >>>>>>>> https://stackoverflow.com/questions/54141352/dataflow-to-process-late-and-out-of-order-data-for-batch-and-stream-messages/54188971?noredirect=1#comment95302476_54188971 >>>>>>>> >>>>>>>> >>>>>>>> > > -- > ================ > Ruoyun Huang > >