Cascading "not alive" in topology with Storm 0.9.5

Yury Ruchin Sun, 13 Dec 2015 04:23:13 -0800

Hello,

I'm running a large topology using Storm 0.9.5. I have 2.5K executors
distributed over 60 workers, 4-5 workers per node. The topology consumes
data from Kafka spout.


I regularly observe Nimbus considering topology workers dead by heartbeat
timeout. It then moves executors to other workers, but soon another worker
times out. Nimbus moves its executors and so on. The sequence repeats over
and over - in fact, there are cascading worker timeouts in topology which
it cannot restore from.The topology itself looks alive but stops consuming
from Kafka and as the result stops processing altogether.

I didn't see any obvious issues with network, so initially I assumed there
might be worker process failures caused by exceptions/errors inside the
process, e. g. OOME. Nothing appeared in worker logs. I then found that the
processes were actually alive when Nimbus declared them dead - it seems
like they simply stopped sending heartbeats for some reason.

I looked for Java fatal error logs in assumption that the error might be
caused by some nasty low-level things happening - but found nothing.

I suspected high CPU usage, but it turned out the user CPU + system CPU on
the nodes never went above 50-60% in peaks. The regular load was even less.

I was observing the same issue with Storm 0.9.3, then upgraded to Storm
0.9.5 hoping that fixes for https://issues.apache.org/jira/browse/STORM-329
and https://issues.apache.org/jira/browse/STORM-404 will help. But they
haven't.

Strange enough, I can only reproduce the issue in this large setup. Small
test setups with 2 workers do not expose this issue - even after killing
all worker processes by kill -9 they restore seamlessly.

My other guess is that large number of workers causes significant overhead
on establishing Netty connections during worker startup which somehow
prevents heartbeats from being sent. Maybe this is something similar to
https://issues.apache.org/jira/browse/STORM-763 and it's worth upgrading to
0.9.6 - I don't know how to check it.

Any help is appreciated.

Cascading "not alive" in topology with Storm 0.9.5

Reply via email to