[ https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158406#comment-16158406 ]
Marco Gaido commented on SPARK-21944: ------------------------------------- [~kevinzhang] you should define the watermark on the column `"time"`, not the column `"window"` > Watermark on window column is wrong > ----------------------------------- > > Key: SPARK-21944 > URL: https://issues.apache.org/jira/browse/SPARK-21944 > Project: Spark > Issue Type: Bug > Components: Structured Streaming > Affects Versions: 2.2.0 > Reporter: Kevin Zhang > > When I use a watermark with dropDuplicates in the following way, the > watermark is calculated wrong > {code:java} > val counts = events.select(window($"time", "5 seconds"), $"time", $"id") > .withWatermark("window", "10 seconds") > .dropDuplicates("id", "window") > .groupBy("window") > .count > {code} > where events is a dataframe with a timestamp column "time" and long column > "id". > I registered a listener to print the event time stats in each batch, and the > results is like the following > {code:shell} > ------------------------------------------- > Batch: 0 > ------------------------------------------- > +---------------------------------------------+-----+ > > |window |count| > +---------------------------------------------+-----+ > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3 | > +---------------------------------------------+-----+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > ------------------------------------------- > Batch: 1 > ------------------------------------------- > +---------------------------------------------+-----+ > > |window |count| > +---------------------------------------------+-----+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1 | > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3 | > +---------------------------------------------+-----+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > ------------------------------------------- > Batch: 2 > ------------------------------------------- > +---------------------------------------------+-----+ > > |window |count| > +---------------------------------------------+-----+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1 | > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4 | > +---------------------------------------------+-----+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > {code} > As can be seen, the event time stats are wrong which are always in > 1970-01-01, so the watermark is calculated wrong. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org