Kafka streams Issue

Hamza HACHANI Fri, 29 Jul 2016 07:36:06 -0700

> Good morning,
>
> I'm an ICT student in TELECOM BRRETAGNE (a french school).
> I did follow your presentation in Youtube and i found them really
> intresting.
> I'm trying to do some stuffs with Kafka. And now it has been  about 3 days
> that I'm blocked.
> I'm trying to control the time in which my processing application send
> data to the output topic .
> What i'm trying to do is to make the application process data from the
> input topic all the time but send the messages only at the end of a
> minute/an hour/a month .... (the notion of windowing).
> For the moment what i managed to do is that the application instead of
> sending data only at the end of the minute,it send it anytime it does
> receive it from the input topic.
> Have you any suggestions to help me?
> I would be really gratfeul.



Preliminary answer for now:

> For the moment what i managed to do is that the application instead of
sending data only at the end
> of the minute,it send it anytime it does receive it from the input topic.

This is actually the expected behavior at the moment.

The main reason for this behavior is that, in stream processing, we never
know whether there is still late-arriving data to be received.  For
example, imagine you have 1-minute windows based on event-time.  Here, it
may happen that, after the first 1 minute window has passed, another record
arrives five minutes later but, according to the record's event-time, it
should have still been part of the first 1-minute window.  In this case,
what we typically want to happen is that the first 1-window will be
updated/reprocessed with the late-arriving record included.  In other
words, just because 1 minute has passed (= the 1-minute window is "done")
it does not mean that actually all the data for that time interval has been
processed already -- so sending only a single update after 1 minute has
passed would even produce incorrect results in many cases.  For this reason
you currently see a downstream update anytime there is a new incoming data
record ("send it anytime it does receive it from the input topic").  So the
point here is due ensure correctness of processing.

That said, one known drawback of the current behavior is that users haven't
been able to control (read: decrease/reduce) the rate/volume of the
resulting downstream updates.  For example, if you have an input topic with
a rate of 1 million msg/s (which is easy for Kafka), some users want to
aggregate/window results primarily to reduce the input rate to a lower
numbers (e.g. 1 thousand msg/s) so that the data can be fed from Kafka to
other systems that might not scale as well as Kafka.  To help these use
cases we will have a new configuration parameter in the next major version
of Kafka that allows you to control the rate/volume of downstream updates.
Here, the point is to help users optimize resource usage rather than
correctness of processing.  This new parameter should also help you with
your use case.  But even this new parameter is not based on strict time
behavior or time windows.

Kafka streams Issue

Reply via email to