Hi Eduardo,

What you have written out is to output counts "as fast as possible" for
windows of 5 minute length and with a sliding window of 1 minute. So for a
record at 10:13, you would get that record included in the count for
10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18.

Please take a look at the following blog post for more details:
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
Also this talk can be helpful:
https://www.youtube.com/watch?v=JAb4FIheP28&t=942s (especially after 19th
minute)

What you seem to be looking for is "Update" output mode (you may need Spark
2.2 for this IIRC), with a window duration of 5 minutes and no sliding
interval, and a processing time trigger of 1 minute. Note that this still
doesn't guarantee 1 output row every trigger as late data may arrive
(unless you set the watermark accordingly).


Best,
Burak


On Mon, Sep 11, 2017 at 8:04 AM, Eduardo D'Avila <
eduardo.dav...@corp.globo.com> wrote:

> Hi,
>
> I'm trying to use Spark 2.1.1 structured streaming to *count the number
> of records* from Kafka *for each time window* with the code in this
> GitHub gist
> <https://gist.github.com/erdavila/b6ab0c216e82ae77fa8192c48cb816e4>.
>
> I expected that, *once each minute* (the slide duration), it would *output
> a single record* (since the only aggregation key is the window) with the 
> *record
> count for the last 5 minutes* (the window duration). However, it outputs
> several records 2-3 times per minute, like in the sample output included in
> the gist.
>
> Changing the output mode to "append" seems to change the behavior, but
> still far from what I expected.
>
> What is wrong with my assumptions on the way it should work? Given the
> code, how should the sample output be interpreted or used?
>
> Thanks,
>
> Eduardo
>

Reply via email to