Hi,

I'm wondering what sort of metrics, logs, or other indications I can use in
order to understand why my topology gets stuck after ugrading from Storm
1.1.1 to Storm 2.2.0.



More in length:
I have a 1.1.1 cluster with 40 workers processing ~400K events/second.
It starts by reading from Kinesis via the AWS KCL and this is also used to
implement our own backpressure. That is, when the topology is overloaded
with tuples, we stop reading from Kinesis until enough progress has been
made (we've been able to checkpoint).
After that, we resume reading.

However, with so many workers we don't really see back pressure being
needed even when dealing with much larger event rates.

We've now created a similar cluster with storm 2.2.0 and I've tried
deploying our topology there.
However, what happens is that within a couple of seconds, no more Kinesis
records get read. The topology appears to be just waiting forever without
processing anything.

I would like to troubleshoot this, but I'm not sure where to collect data
from.
My initial suspicion was that the new back pressure mechanism, now found in
Storm 2, might have kicked in and that I need to configure it in order to
resolve this issue. However, this is nothing more than a guess. I'm not
sure how I can actually prove or disprove this without lots of trail &
error.

I've found some documentation about backpressure in the performance tuning
chapter of the documentation, but that only concentrates on configuration
parameters and doesn't give information about how to really understand
what's going on in a running topology.

Reply via email to