kafka streams internal topic

Srikanth Tue, 17 May 2016 19:37:38 -0700

Hi,

I was reading about Kafka streams and trying to understand its programming
model.
Some observations that I wanted to get some clarity on..


1) Joins & aggregations use an internal topic for shuffle. Source
processors will write to this topic with the key used for join. Then it is
free to commit offset for that run.
Does that mean we have to rely on internal topic's replication to guarantee
at-least-once processing?

  Won't these replicated internal topics put too much load on brokers? We
have to scale up brokers whenever our stream processing needs increase.
Its easy to add/remove processing nodes to run kstream app instance but
adding/removing brokers is not that straight forward.


2) How may partitions are created for the internal topic? What is the
retention policy?
When/how is it created and cleaned up when streaming application is
shutdown permanently?
Any link with deep dive into this will be helpful.


 3) I'm a bit confused on the threading model. Lets take the below example.
Assume "PageViews" and "UserProfile" have four partitions each.
I start two instance of this app both with two threads. So we have four
threads all together.
My understanding is that each thread will now receive records from one
partition in both topics. After reading they'll do the map, filter, etc and
write to internal topic for join.
Then the threads go on to read the next set of records from previous
offset. I guess this can be viewed as some sort of chaining with in a stage.

  Now where does the thread that reads from internal topic run? Do they
share the same set of threads?
Can we increase the parallelism for just this processor? If I know that the
join().map().process() chain is more compute intensive.

  KStream<String, GenericRecord> views = builder.stream("PageViews")
      .map(...)
.filter(...)

  KTable<String, GenericRecord> users = builder.table("UserProfile")
.mapValues(...)

  KStream<String, Long> regionCount = views
               .leftJoin(user, ...)
.map(...)
.process(...)

I hope I was able to explain my questions clear enough for you to
understand.

Thanks,
Srikanth

kafka streams internal topic

Reply via email to