Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169146175
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1051,6 +1053,16 @@ from the aggregation column.
     For example, `df.groupBy("time").count().withWatermark("time", "1 min")` 
is invalid in Append 
     output mode.
     
    +##### Semantic Guarantees of Aggregation with Watermarking
    +{:.no_toc}
    +
    +- A watermark delay (set with `withWatermark`) of "2 hours" guarantees 
that the engine will never
    +drop any data that is less than 2 hours delayed. In other words, any data 
less than 2 hours behind
    +(in terms of event-time) the latest data processed till then is guaranteed 
to be aggregated.
    +
    +- However, the guarantee is strict only in one direction. Data delayed by 
more than 2 hours is
    +not guaranteed to be dropped; it may or may not get aggregated. More 
delayed is the data, less
    +likely is the engine going to process it.
    --- End diff --
    
    > However, the guarantee is strict only in one direction. Data delayed by 
more than 2 hours is not guaranteed to be dropped
    
    This might contradict an earlier statement, from "Handling Late Data and 
Watermarking", that says
    
    "In other words, late data within the threshold will be aggregated, but 
data later than the threshold will be dropped"


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to