Re: Subtractor

Vasily Sulatskov Mon, 24 Sep 2018 05:12:31 -0700

Hi,

Given that you need a subtractor you are probably calling
KGroupedTable.aggregate(). In order to get a KGroupedTable you called
(in a general case) KTable.groupBy().


I.e you have an original (pre-groupBy) table stream (changelog): where
a message key is say pageId, and value is say number of hits in the
current time window (or something like that):

key(pageId)=1, value=1
key(pageId)=2, value=2
key(pageId)=3, value=10
key(pageId)=1, value=3
key(pageId)=2, value=4
key(pageId)=3, value=11

Now you want to build a table that contains number of hits in the
current time window per page category, so you group your values by
well, categoryId, let's say pageId=1 and pageId=2 belong to
categoryId=1 and pageId=3 belongs to some other categoryId.

Essentially .groupBy() will transform your changelog to (and dropping
messages from different categories):

key(categoryId=1), value=1 (old key=1)
key(categoryId=1), value=2 (old key=2)
key(categoryId=1), value=3 (old key=1)
key(categoryId=1), value=4 (old key=2)

Which is how the example I've given in the previous email came to be.

And the final aggregation result will be:

categoryId=1, sum(value) = 4

And under the hood, kafka will represent this messages as such:

key(categoryId=1), newValue=1, oldValue=null (oldKey=1)
key(categoryId=1), newValue=2, oldValue=null (oldKey=2)
key(categoryId=1), newValue=3, oldValue=1 (oldKey=1)
key(categoryId=1), newValue=4, oldValue=2 (oldKey=2)

So you see, as you compute your aggregation, new values for an old key
(pre groupBy) arrive and essentially replace old values for the same
old (pre groupBy) keys. And to do this "replacement" you need a
subtraction operation.

The only thing is that kafka doesn't carry around oldKey as I've shown
above just for the demonstration, it's not necessary, it just calls
adder on newValue and subtractor on oldValue. But in order to
understand I like to think about it using "old keys".

Hope that clears it up.

On Mon, Sep 24, 2018 at 12:03 PM Michael Eugene <far...@hotmail.com> wrote:
>
> First off thanks or taking the time out of your schedule to respond.
>
> You lost me at almost the beginning, specifically at mapping to a different 
> key.  If those records come in...
>
> key=1, value=1
> key=2, value=2
> key=1, value=3
> key=2, value=4
>
> Here is all that should happen in my application
> 1. You start with aggregated value zero.
> 2. You handle (key=1, value=1) -> agg=1
> 3. You handle (key=2, value=2) -> agg=2
> 4. You handle (key=1, value=3) -> why not just add 3 to the earlier 1 so it 
> is agg 4?
> 5. You handle (key=2, value=4) -> why not just add 4 to the earlier 2 so it 
> is agg 6?
>
> I have no interest in mapping to different keys.  That's kind of making this 
> exercise more complex.
>
> Also one of the confusing points is why in older versions of Kafka did you 
> not need a subtractor?  Only in 2.0 am I required to give a subtractor. 1.1 I 
> didn't need one.
>
> ________________________________
> From: Vasily Sulatskov <vas...@sulatskov.net>
> Sent: Monday, September 24, 2018 9:46 AM
> To: users@kafka.apache.org
> Subject: Re: Subtractor
>
> Hi,
>
> If I am not mistaken it works like this.
>
> Remember that kafka is a streaming system, i.e. there's no way for
> kafka streams to look at all the current value for a given key, and
> compute the aggregation by repeatedly calling your adder (starting
> with zero value). Values arrive at different times, with value for
> different keys in between them, and you expect kafka streams to always
> give you the up to date aggregated value.
>
> Put yourself in the place of kafka-streams application, how would you
> compute say a sum of all keys that get mapped to a single key after
> with a pen and a paper? I bet you would keep track of last arrived
> value for each key, and the total aggregated value.
>
> So let's say here's a stream of values that all had originally
> different keys, but you mapped them via groupBy() to a different key,
> and they arrive to you like this:
>
> key=1, value=1
> key=2, value=2
> key=1, value=3
> key=2, value=4
>
> 1. You start with aggregated value zero.
> 2. You handle (key=1, value=1) -> agg=1
> 3. You handle (key=2, value=2) -> agg=3
> 4. You handle (key=1, value=3), now you can't just add 3 to your
> aggregated value, you must add new value for key=1, and subtract old
> value for key=1: newAgg = oldAgg + newValueForKey1 - newValueForKey1:
> agg = 3 + 3 - 1 -> agg = 5
> 5. You handle (key=2, value=4), again you must look up a previous
> value for key=2 and subtract it from the aggregated value: agg = 5 + 4
> - 2 -> agg = 7
>
> And this is basically how it works.
>
> If you look into more details there are some complications though,
> such as kakfa-streams transforming a sequence of values into a
> sequence of changes of values, so your KStream[T] becomes more like
> KStream[Change[T]] where change carries both new and old value, and
> over the wire this change gets transmitted as two separate kafka
> messages.
> On Mon, Sep 24, 2018 at 10:56 AM Michael Eugene <far...@hotmail.com> wrote:
> >
> > Can someone explain to me the point of the Subtractor in an aggregator?  I 
> > have to have one, because there is no concrete default implentation of it, 
> > but I am just trying to get a "normal" aggregation working and I don't see 
> > why I need a subtractor.  Other than of course I need to make the program 
> > compile.
> >
> > I'm using Kafka Streams DSL 2.0
>
>
>
> --
> Best regards,
> Vasily Sulatskov



-- 
Best regards,
Vasily Sulatskov

Re: Subtractor

Reply via email to