The issue that we were seeing was that the heartbeats were so large and
numerous that Zookeeper was bogged down with writes, and it took Nimbus a long
time to read them. By the time it finished reading them from zookeeper, they
were old enough to cause a timeout.
-- Kyle
On Thursday, December 17, 2015 4:04 AM, Yury Ruchin <[email protected]>
wrote:
Hi Ravi, Kyle, thanks for the input!
I tried increasing task timeout from 30 to 60 seconds - and still observed the
same issue. Increasing the timeout further does not look reasonable, since it
will affect Nimbus ability to detect real crashes.
I was looking at Zookeeper metrics and haven't noticed any anomalies - no load
spikes around the point of heartbeat timeout. I will double check, however.
Kyle,
Could you elaborate a bit on what the issue with Zookeeper looked like in your
case? Was it simply that write call to Zookeeper at times blocked for more than
nimbus.task.timeout.secs?
2015-12-16 21:53 GMT+03:00 Kyle Nusbaum <[email protected]>:
Yes, I would check Zookeeper.We've seen the exact same thing in large clusters,
which is what this was designed to help solve:
https://issues.apache.org/jira/browse/STORM-885
-- Kyle
On Monday, December 14, 2015 8:45 PM, Ravi Tandon
<[email protected]> wrote:
Try the following: · Increase the value
of"nimbus.monitor.freq.secs"="120", this will make nimbus to wait longer before
declaring a worker dead. Also check other configs like
“supervisor.worker.timeout.secs“ that will allow the system to wait longer
before the re-assignment/re-launching workers. · Check the write load on
the Zookeepers too, that maybe the bottleneck of your cluster and the
co-ordination thereof than the worker nodes themselves. You can choose to have
additional ZK nodes or provide better spec machines for the quorum. -Ravi
From: Yury Ruchin [mailto:[email protected]]
Sent: Sunday, December 13, 2015 4:22 AM
To: [email protected]
Subject: Cascading "not alive" in topology with Storm 0.9.5 Hello, I'm
running a large topology using Storm 0.9.5. I have 2.5K executors distributed
over 60 workers, 4-5 workers per node. The topology consumes data from Kafka
spout. I regularly observe Nimbus considering topology workers dead by
heartbeat timeout. It then moves executors to other workers, but soon another
worker times out. Nimbus moves its executors and so on. The sequence repeats
over and over - in fact, there are cascading worker timeouts in topology which
it cannot restore from.The topology itself looks alive but stops consuming from
Kafka and as the result stops processing altogether. I didn't see any
obvious issues with network, so initially I assumed there might be worker
process failures caused by exceptions/errors inside the process, e. g. OOME.
Nothing appeared in worker logs. I then found that the processes were actually
alive when Nimbus declared them dead - it seems like they simply stopped
sending heartbeats for some reason. I looked for Java fatal error logs in
assumption that the error might be caused by some nasty low-level things
happening - but found nothing. I suspected high CPU usage, but it turned out
the user CPU + system CPU on the nodes never went above 50-60% in peaks. The
regular load was even less. I was observing the same issue with Storm 0.9.3,
then upgraded to Storm 0.9.5 hoping that fixes for
https://issues.apache.org/jira/browse/STORM-329 and
https://issues.apache.org/jira/browse/STORM-404 will help. But they haven't.
Strange enough, I can only reproduce the issue in this large setup. Small test
setups with 2 workers do not expose this issue - even after killing all worker
processes by kill -9 they restore seamlessly. My other guess is that large
number of workers causes significant overhead on establishing Netty connections
during worker startup which somehow prevents heartbeats from being sent. Maybe
this is something similar to https://issues.apache.org/jira/browse/STORM-763
and it's worth upgrading to 0.9.6 - I don't know how to check it. Any help
is appreciated.