Hi All, I read through the doc on KStreams here: http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple <http://www.google.com/url?q=http%3A%2F%2Fwww.confluent.io%2Fblog%2Fintroducing-kafka-streams-stream-processing-made-simple&sa=D&sntz=1&usg=AFQjCNGJu-bDlzStDwxPDIOKpG10Ts9xvA>
I was wondering about how an use-case that I have be solved with KStream? Use-case: Logs are obtained from a service pool. It contains many nodes. We want to alert if a particular consumer (identified by consumer id) is making calls to the service more than X number of times in the last 1 min. The data is available in logs similar to access logs for example. The window is a sliding window. We have to check back 60s from the current event and see if in total (with the current event) it would exceed the threshold. Logs are pushed to Kafka using a random partitioner whose range is [1 to n] where n is the total number of partitions. One way of achieving this is to push data in to the first Kafka topic (using random partitioning) and then a set of KStream tasks re-shuffling the data on consumer_id in to the second topic. The next set of KStream tasks operate on the second topic (1 task/partition) and do the aggregation. If this is an acceptable solution, here are my questions on scaling. - I can see that the second topic is prone to hotspots. If we get billions of requests for a given consumer_id and only few hundreds for another consumer_id, the second kafka topic partitions will become hotspots (and the partition getting lot of volume of logs can suffocate other partitions on the same broker). If we try to create more partitions and probably isolate the partition getting lot of volume, this wastes resources. - The max parallelism that we can get for a KStream task is the number of partitions - this may work for a single stream. How would this work for a multi-tenant stream processing where people want to write multiple stream jobs on the same set of data? If the parallelism does not work, they would have to copy and group the data in to another topic with more partitions. I think like we need two knobs one for scaling Kafka (number of partitions) and one for scaling stream. It sounds like with KStream it is only one knob for both. - How would we deal with organic growth of data? Let us say the partitions we chose for the second topic (where it is grouped by consumer_id) is not enough to deal with organic growth in volume. If we increase partitions, for a given consumer some data could be in one partition before the flex up and data could end up in a different topic after flex up. Since the KStream jobs are unique per partition and are stateless across them, the aggregated result would be incorrect, unless we have only one job to read all the data in which case it will become a bottleneck. In storm (or any other streaming engine), the way to solve it would be to have only one topic (partitioned n-ways) and data pushed in to using a random partitioner (so no hotspots and scaling issues). We will have n Spouts reading data from those partitions and we can then have m bolts getting the data using fields grouping on consumer_id. Since all the data for a given consumer_id ends up in a bolt we will do the sliding window and the alert. If we solve it the Storm way in KStream, we would only have one topic (partitioned n-ways) and data pushed in to using a random partitioner (so no hotspots and scaling issues). But we can only have one KStream task running reading all the data and doing the windowing and aggregation. This will become a bottleneck for scaling. So it sounds like KStreams will either have "hotspots" in kafka topic (as each partition needs to have the data that the KStream task needs and work independently) or scaling issues in the KStream task for "aggregation". How would one solve this kind of problems with KStream? <https://lh3.googleusercontent.com/-uxCWSNe6nw8/Vusvqd5nLAI/AAAAAAAAD5s/aqY-DexRkn097M9egCLLb3D_ANyCKm60w/s1600/Screen%2BShot%2B2016-03-17%2Bat%2B3.30.17%2BPM.png>
