Have you try out printing timestamp for rows in each batch and watermark while you add artificial delay on processing batch?
First of all, you're technically using "processing time" in your query, where you will never have "late events" theoretically. Watermark is to handle out-of-order events and you won't need it. If Spark requires the watermark due to the technical reason, you can just set it to 0 and any events shouldn't be lost. > So, let's say during shuffle stage (groupby) or write stage, we have a delay of 5 to 10 minutes, will we lose data due to watermark of 2 minutes here? If your batch is being delayed, the timestamp in the data will be also delayed as the notion of "processing time". No data will be lost, but as you're relying on processing time, the result can be affected by various circumstances. 3 mins of window and 5 to 10 mins of batch delay would lead the grouping only applied within a batch. Applying watermark here doesn't help the situation but just slows down the output unnecessarily. That's the power of "event time" processing. You'll have consistent result even in delay, out-of-order events, etc. whereas the issue you've describe actually applies to "event time" processing (delayed output vs discarded late events). Hope this helps. Jungtaek Lim (HeartSaVioR) On Fri, Jan 24, 2020 at 7:19 AM stevech.hu <stevech...@outlook.com> wrote: > Anyone know the answers or pointers? thanks. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >