Hey Matthias, Thanks for the quick reply! Unfortunately that's still a very unsatisfying answer and I'm hoping you or someone else can shed a bit more light on the internals here. First of all, I've read through the documentation (some parts closer than others, so 100% possible I flat out missed it) and I don't recall seeing any mentions of topics being implicitly created by stateful operations. The docs talk at some length about the local state stores necessary for aggregations and joins, but says little about aggregating/joining data that has been re-keyed partway through the topology (such as the word count example I linked in my original post). Furthermore, if it really is creating topics implicitly, then does the user have any control over the configuration of these topics? The number of partitions? The lifespan of the topic? Does this mean that a user of Kafka Streams could inadvertently create dozens of intermediate topics with only a moderately complex topology? Which KStream/KTable operations create implicit topics? How do the consumers coordinate naming the anonymous intermediate topics such that they all use the same correct one without conflict with other topologies running against the same cluster? Perhaps most importantly, where specifically is all of this behavior documented (again, I fully admit I may have just skimmed over it)? I'm happy to go diving through code if the documentation for this simply doesn't exist yet or is otherwise in flux, so some pointers on where to get started would be greatly appreciated. Finally, towards the end of the original Kafka Streams blog post, the author (Jay) mentions diving further into the end-to-end semantics of Kafka Streams. Is that documentation/blog post still coming? Is there anything I can read now about how at least once delivery is guaranteed by Kafka Streams?
Cheers, Michael-Keith From: Matthias J. Sax <matth...@confluent.io> Sent: Thursday, July 21, 2016 7:31 AM To: users@kafka.apache.org Subject: Re: Kafka Streams: Merging of partial results Hi, you answered your question absolutely correctly by yourself (ie, internal topic creating in groupBy() to repartition the data on words). I cannot add more to explain how it works. You might want to have a look here for more details about Kafka Streams in general: http://docs.confluent.io/3.0.0/streams/index.html Kafka Streams — Confluent Platform 3.0.0 documentation docs.confluent.io Kafka Streams¶ This section describes Kafka Streams, a component of open source Apache Kafka. Kafka Streams is a powerful, easy-to-use library for building highly ... -Matthias On 07/21/2016 04:16 PM, Michael-Keith Bernard wrote: > Hello Kafka Users, > > (I apologize if this landed in your inbox twice, I sent it yesterday > afternoon but it never showed up in the archive so I'm sending again just in > case.) > > I've been floating this question around the #apache-kafka IRC channel on > Freenode for the last week or two and I still haven't reached a satisfying > answer. The basic question is: How does Kafka Steams merge partial results? > So let me expand on that a bit... > > Consider the following word count example in the official Kafka Streams repo > (Github mirror): > https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/wordcount/WordCountDemo.java#L46 > > Now suppose we run this topology against a Kafka topic with 2 partitions. > Into the first partition, we insert the sentence "hello world". Into the > second partition, we insert the sentence "goodbye world". We know a priori > that the result of this computation is something like: > > { hello: 1, goodbye: 1, world: 2 } # made up syntax for a compacted > log/KTable state > > And indeed we'd probably see precisely that result from a *single* consumer > process that sees *all* the data. However, my question is, what if I have 1 > consumer per topic partition (2 consumers total in the above hypothetical)? > Under that scenario, consumer 1 would emit { hello: 1, world: 1 } and > consumer 2 would emit { goodbye: 1, world: 1 }... But the true answer > requires and additional reduction of duplicate keys (in this case with a sum > operator, but that needn't be the case for arbitrary aggregations). > > So again my question is, how are the partial results that each consumer > generates merged into a final result? A simple way to accomplish this would > be to produce an intermediate topic that is keyed by the word, then aggregate > that (since each consumer would see all the data for a given key), but if > that's happening it's not explicit anywhere in the example. So what mechanism > is Kafka Streams using internally to aggregate the results of a partitioned > stream across multiple consumers? (Perhaps groupByKey creating an anonymous > intermediate topic?) > > I know that's a bit wordy, but I want to make sure my question is extremely > clear. If I've still fumbled on that, let me know and I will try to be even > more explicit. :) > > Cheers, > Michael-Keith Bernard > > P.S. Kafka is awesome and Kafka Streams look even awesome-er! >