Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/20631#discussion_r169146175 --- Diff: docs/structured-streaming-programming-guide.md --- @@ -1051,6 +1053,16 @@ from the aggregation column. For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append output mode. +##### Semantic Guarantees of Aggregation with Watermarking +{:.no_toc} + +- A watermark delay (set with `withWatermark`) of "2 hours" guarantees that the engine will never +drop any data that is less than 2 hours delayed. In other words, any data less than 2 hours behind +(in terms of event-time) the latest data processed till then is guaranteed to be aggregated. + +- However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is +not guaranteed to be dropped; it may or may not get aggregated. More delayed is the data, less +likely is the engine going to process it. --- End diff -- > However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is not guaranteed to be dropped This might contradict an earlier statement, from "Handling Late Data and Watermarking", that says "In other words, late data within the threshold will be aggregated, but data later than the threshold will be dropped"
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org