I have a topology that consumes a stream of log events from Kafka (new storm-kafka-client spout) and that performs an aggregation function in a bolt.
Beforehand, the logs events are written to kafka using a VM Group ID as the key for partitioning. In my topology, the same grouping/partitioning logic is used for the bolt (using fieldsGrouping - This is necessary for the aggregation function) I notice a sharp drop in performance when I increase the number of workers. My understanding is that this is explained by the tuples being exchanged between workers. This may be caused by: - Kafka hashing function not being consistent with Storm hashing function (The one used by fieldsGrouping) - Storm has no way of guessing the affinity between Spout and bolt instances. How can I ensure that a Spout that reads its data from one or more partitions (and by consequence one or more VM groups) will sends the tuples to bolts that live in the same worker so I can achieve a proper global partitioning of my workload? I’ve thought about grouping by partition instead of VM Group but I don’t really see how it would help in the end.
