Hi Eduardo, What you have written out is to output counts "as fast as possible" for windows of 5 minute length and with a sliding window of 1 minute. So for a record at 10:13, you would get that record included in the count for 10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18.
Please take a look at the following blog post for more details: https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html Also this talk can be helpful: https://www.youtube.com/watch?v=JAb4FIheP28&t=942s (especially after 19th minute) What you seem to be looking for is "Update" output mode (you may need Spark 2.2 for this IIRC), with a window duration of 5 minutes and no sliding interval, and a processing time trigger of 1 minute. Note that this still doesn't guarantee 1 output row every trigger as late data may arrive (unless you set the watermark accordingly). Best, Burak On Mon, Sep 11, 2017 at 8:04 AM, Eduardo D'Avila < eduardo.dav...@corp.globo.com> wrote: > Hi, > > I'm trying to use Spark 2.1.1 structured streaming to *count the number > of records* from Kafka *for each time window* with the code in this > GitHub gist > <https://gist.github.com/erdavila/b6ab0c216e82ae77fa8192c48cb816e4>. > > I expected that, *once each minute* (the slide duration), it would *output > a single record* (since the only aggregation key is the window) with the > *record > count for the last 5 minutes* (the window duration). However, it outputs > several records 2-3 times per minute, like in the sample output included in > the gist. > > Changing the output mode to "append" seems to change the behavior, but > still far from what I expected. > > What is wrong with my assumptions on the way it should work? Given the > code, how should the sample output be interpreted or used? > > Thanks, > > Eduardo >