Hi All,

I read through the doc on KStreams here:
http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple
<http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA>

I was wondering about how an use-case that I have be solved with KStream?

Use-case: Logs are obtained from a service pool. It contains many nodes. We
want to alert if a particular consumer (identified by consumer id) is
making calls to the service more than X number of times in the last 1 min.
The data is available in logs similar to access logs for example. The
window is a sliding window. We have to check back 60s from the current
event and see if in total (with the current event) it would exceed the
threshold. Logs are pushed to Kafka using a random partitioner whose range
is [1 to n] where n is the total number of partitions.

One way of achieving this is to push data in to the first Kafka topic
(using random partitioning) and then a set of KStream tasks re-shuffling
the data on consumer_id in to the second topic. The next set of KStream
tasks operate on the second topic (1 task/partition) and do the
aggregation. If this is an acceptable solution, here are my questions on
scaling.


   - I can see that the second topic is prone to hotspots. If we get
   billions of requests for a given consumer_id and only few hundreds for
   another consumer_id, the second kafka topic partitions will become hotspots
   (and the partition getting lot of volume of logs can suffocate other
   partitions on the same broker). If we try to create more partitions and
   probably isolate the partition getting lot of volume, this wastes resources.
   - The max parallelism that we can get for a KStream task is the number
   of partitions - this may work for a single stream. How would this work for
   a multi-tenant stream processing where people want to write multiple stream
   jobs on the same set of data? If the parallelism does not work, they would
   have to copy and group the data in to another topic with more partitions. I
   think like we need two knobs one for scaling Kafka (number of partitions)
   and one for scaling stream. It sounds like with KStream it is only one knob
   for both.
   - How would we deal with organic growth of data? Let us say the
   partitions we chose for the second topic (where it is grouped by
   consumer_id) is not enough to deal with organic growth in volume. If we
   increase partitions, for a given consumer some data could be in one
   partition before the flex up and data could end up in a different topic
   after flex up. Since the KStream jobs are unique per partition and are
   stateless across them, the aggregated result would be incorrect, unless we
   have only one job to read all the data in which case it will become a
   bottleneck.

In storm (or any other streaming engine), the way to solve it would be to
have only one topic (partitioned n-ways) and data pushed in to using a
random partitioner (so no hotspots and scaling issues). We will have n
Spouts reading data from those partitions and we can then have m bolts
getting the data using fields grouping on consumer_id. Since all the data
for a given consumer_id ends up in a bolt we will do the sliding window and
the alert.

If we solve it the Storm way in KStream, we would only have one topic
(partitioned
n-ways) and data pushed in to using a random partitioner (so no hotspots
and scaling issues). But we can only have one KStream task running reading
all the data and doing the windowing and aggregation. This will become a
bottleneck for scaling.

So it sounds like KStreams will either have "hotspots" in kafka topic (as
each partition needs to have the data that the KStream task needs and work
independently) or scaling issues in the KStream task for "aggregation".

How would one solve this kind of problems with KStream?



<https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png>

Reply via email to