Hi,
we have a topology with 34 bolts and a single spout, running 101 tasks
in 101 executors.
When running the topology in a single worker, we reach a certain
throughput "x".
Since some of the bolts will be rather memory-intensive, we decided to
split the topology across 4 workers, on four physically separated
server machines.
When running that 4-worker setup, we see only a slight increase of
throughput. None of the servers is under heavy CPU load and all bolts
have a capacity below .5, according to the web ui.
The spout implementation is able to easily serve 100 times more
tuples, so it's not a problem of "starvation" at that level.
Do I see it right that the bottleneck is not within the topology
implementation, but in the inter-worker communications setup? Else I'd
have expected to see some bolt to run at 1.0 or higher capacity.
(We had run the same test on older hardware, where the topology was
able to consume all available CPU - there, spreading across more
hardware lead to a significant throughput increase, and "capacity" was
reported as >1 for some bolts.)
How would I proceed to identify the actual bottleneck(s)?
Regards,
Jens