Try the following:

·        Increase the value of "nimbus.monitor.freq.secs"="120", this will make 
nimbus to wait longer before declaring a worker dead. Also check other configs 
like “supervisor.worker.timeout.secs“ that will allow the system to wait longer 
before the re-assignment/re-launching workers.

·        Check the write load on the Zookeepers too, that maybe the bottleneck 
of your cluster and the co-ordination thereof than the worker nodes themselves. 
You can choose to have additional ZK nodes or provide better spec machines for 
the quorum.

-Ravi

From: Yury Ruchin [mailto:[email protected]]
Sent: Sunday, December 13, 2015 4:22 AM
To: [email protected]
Subject: Cascading "not alive" in topology with Storm 0.9.5

Hello,

I'm running a large topology using Storm 0.9.5. I have 2.5K executors 
distributed over 60 workers, 4-5 workers per node. The topology consumes data 
from Kafka spout.

I regularly observe Nimbus considering topology workers dead by heartbeat 
timeout. It then moves executors to other workers, but soon another worker 
times out. Nimbus moves its executors and so on. The sequence repeats over and 
over - in fact, there are cascading worker timeouts in topology which it cannot 
restore from.The topology itself looks alive but stops consuming from Kafka and 
as the result stops processing altogether.

I didn't see any obvious issues with network, so initially I assumed there 
might be worker process failures caused by exceptions/errors inside the 
process, e. g. OOME. Nothing appeared in worker logs. I then found that the 
processes were actually alive when Nimbus declared them dead - it seems like 
they simply stopped sending heartbeats for some reason.

I looked for Java fatal error logs in assumption that the error might be caused 
by some nasty low-level things happening - but found nothing.

I suspected high CPU usage, but it turned out the user CPU + system CPU on the 
nodes never went above 50-60% in peaks. The regular load was even less.

I was observing the same issue with Storm 0.9.3, then upgraded to Storm 0.9.5 
hoping that fixes for 
https://issues.apache.org/jira/browse/STORM-329<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-329&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=j7KqlX9nKf7abFTWur0lsIeXNBZUXwCCga7X1Mei7yY%3d>
 and 
https://issues.apache.org/jira/browse/STORM-404<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-404&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=1iDLe2Jr5qZAmuiOYXJomzqdX5G3XqZDFPSkP4wOt2g%3d>
 will help. But they haven't.

Strange enough, I can only reproduce the issue in this large setup. Small test 
setups with 2 workers do not expose this issue - even after killing all worker 
processes by kill -9 they restore seamlessly.

My other guess is that large number of workers causes significant overhead on 
establishing Netty connections during worker startup which somehow prevents 
heartbeats from being sent. Maybe this is something similar to 
https://issues.apache.org/jira/browse/STORM-763<https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fSTORM-763&data=01%7c01%7cRTANDON%40exchange.microsoft.com%7c0a465a4e836e49c5c7d708d303b80ac2%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=mWe7i%2bVejDHainxeYwaybylchyhPisCwT3q6skqTIl0%3d>
 and it's worth upgrading to 0.9.6 - I don't know how to check it.

Any help is appreciated.


Reply via email to