Hello,

I'm looking for guidance on how to approach a counting problem. We want to
consume a stream of data that consists of IDs and generate an output of the
aggregated count with a window size of X seconds using processing time and
a hopping time window. For example, using a window size of 1 second, if we
get IDs 1, 2, 2, 2 in the 1st second, then the output would be 1=1, 2=3. If
we get IDs 1, 3, 3 in the 2nd second then the output would be 1=1, 3=2. The
aggregated count will then be turned into increment commands to a cache and
a database.

Obviously we will need some state to be stored during the count of a
window, but we only need to keep it for the time period of the window (i.e.
a second). I was thinking this could be achieved by using a persistent
store, where the counts are reset during the punctuate and the store topic
uses log compression. Alternatively, we could simple have an in memory
store that is reset during the punctuate. My concern with the in memory
store is that I don't know when the input topic offset is committed or when
the output data is written and therefore we could lose data. Ultimately, at
the end of the second, the input offset and output data should be written
at the same time, reducing the likelihood of lost data. We would rather
lose data, than have duplicate counts. What is the correct approach? Is
there a better way of tackling the problem?

I have put together some code, but it doesn't do exactly what I expect. I'm
happy to share if it helps.

Thanks,
Ben

Reply via email to