Kafka Spout and bolts grouping/affinity by worker

Ronan Harmegnies Tue, 30 May 2017 02:49:37 -0700

I have a topology that consumes a stream of log events from Kafka (new
storm-kafka-client spout) and that performs an aggregation function in a
bolt.


Beforehand, the logs events are written to kafka using a VM Group ID as the
key for partitioning.

In my topology, the same grouping/partitioning logic is used for the bolt
(using fieldsGrouping - This is necessary for the aggregation function)



I notice a sharp drop in performance when I increase the number of workers.

My understanding is that this is explained by the tuples being exchanged
between workers.

This may be caused by:


   - Kafka hashing function not being consistent with Storm hashing
   function (The one used by fieldsGrouping)
   - Storm has no way of guessing the affinity between Spout and bolt
   instances.



How can I ensure that a Spout that reads its data from one or more
partitions (and by consequence one or more VM groups) will sends the tuples
to bolts that live in the same worker so I can achieve a proper global
partitioning of my workload?

I’ve thought about grouping by partition instead of VM Group but I don’t
really see how it would help in the end.

Kafka Spout and bolts grouping/affinity by worker

Reply via email to