Yes, I would check Zookeeper.We've seen the exact same thing in large clusters, 
which is what this was designed to help solve: 
https://issues.apache.org/jira/browse/STORM-885
 -- Kyle 


    On Monday, December 14, 2015 8:45 PM, Ravi Tandon 
<[email protected]> wrote:
 

 #yiv5243466594 #yiv5243466594 -- _filtered #yiv5243466594 
{font-family:Wingdings;panose-1:5 0 0 0 0 0 0 0 0 0;} _filtered #yiv5243466594 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv5243466594 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv5243466594 
{font-family:Consolas;panose-1:2 11 6 9 2 2 4 3 2 4;}#yiv5243466594 
#yiv5243466594 p.yiv5243466594MsoNormal, #yiv5243466594 
li.yiv5243466594MsoNormal, #yiv5243466594 div.yiv5243466594MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5243466594 a:link, 
#yiv5243466594 span.yiv5243466594MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv5243466594 a:visited, #yiv5243466594 
span.yiv5243466594MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv5243466594 
p.yiv5243466594MsoListParagraph, #yiv5243466594 
li.yiv5243466594MsoListParagraph, #yiv5243466594 
div.yiv5243466594MsoListParagraph 
{margin-top:0in;margin-right:0in;margin-bottom:0in;margin-left:.5in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv5243466594
 p.yiv5243466594msonormal0, #yiv5243466594 li.yiv5243466594msonormal0, 
#yiv5243466594 div.yiv5243466594msonormal0 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv5243466594 
span.yiv5243466594EmailStyle18 {color:#1F497D;}#yiv5243466594 
span.yiv5243466594pl-s {}#yiv5243466594 span.yiv5243466594pl-k {}#yiv5243466594 
.yiv5243466594MsoChpDefault {} _filtered #yiv5243466594 {margin:1.0in 1.0in 
1.0in 1.0in;}#yiv5243466594 div.yiv5243466594WordSection1 {}#yiv5243466594 
_filtered #yiv5243466594 {} _filtered #yiv5243466594 {font-family:Symbol;} 
_filtered #yiv5243466594 {} _filtered #yiv5243466594 {font-family:Wingdings;} 
_filtered #yiv5243466594 {font-family:Symbol;} _filtered #yiv5243466594 {} 
_filtered #yiv5243466594 {font-family:Wingdings;} _filtered #yiv5243466594 
{font-family:Symbol;} _filtered #yiv5243466594 {} _filtered #yiv5243466594 
{font-family:Wingdings;}#yiv5243466594 ol {margin-bottom:0in;}#yiv5243466594 ul 
{margin-bottom:0in;}#yiv5243466594 Try the following:    ·       Increase the 
value of"nimbus.monitor.freq.secs"="120", this will make nimbus to wait longer 
before declaring a worker dead. Also check other configs like 
“supervisor.worker.timeout.secs“ that will allow the system to wait longer 
before the re-assignment/re-launching workers. ·       Check the write load on 
the Zookeepers too, that maybe the bottleneck of your cluster and the 
co-ordination thereof than the worker nodes themselves. You can choose to have 
additional ZK nodes or provide better spec machines for the quorum.    -Ravi    
From: Yury Ruchin [mailto:[email protected]]
Sent: Sunday, December 13, 2015 4:22 AM
To: [email protected]
Subject: Cascading "not alive" in topology with Storm 0.9.5    Hello,    I'm 
running a large topology using Storm 0.9.5. I have 2.5K executors distributed 
over 60 workers, 4-5 workers per node. The topology consumes data from Kafka 
spout.    I regularly observe Nimbus considering topology workers dead by 
heartbeat timeout. It then moves executors to other workers, but soon another 
worker times out. Nimbus moves its executors and so on. The sequence repeats 
over and over - in fact, there are cascading worker timeouts in topology which 
it cannot restore from.The topology itself looks alive but stops consuming from 
Kafka and as the result stops processing altogether.    I didn't see any 
obvious issues with network, so initially I assumed there might be worker 
process failures caused by exceptions/errors inside the process, e. g. OOME. 
Nothing appeared in worker logs. I then found that the processes were actually 
alive when Nimbus declared them dead - it seems like they simply stopped 
sending heartbeats for some reason.    I looked for Java fatal error logs in 
assumption that the error might be caused by some nasty low-level things 
happening - but found nothing.    I suspected high CPU usage, but it turned out 
the user CPU + system CPU on the nodes never went above 50-60% in peaks. The 
regular load was even less.    I was observing the same issue with Storm 0.9.3, 
then upgraded to Storm 0.9.5 hoping that fixes for 
https://issues.apache.org/jira/browse/STORM-329 and 
https://issues.apache.org/jira/browse/STORM-404 will help. But they haven't.    
Strange enough, I can only reproduce the issue in this large setup. Small test 
setups with 2 workers do not expose this issue - even after killing all worker 
processes by kill -9 they restore seamlessly.    My other guess is that large 
number of workers causes significant overhead on establishing Netty connections 
during worker startup which somehow prevents heartbeats from being sent. Maybe 
this is something similar to https://issues.apache.org/jira/browse/STORM-763 
and it's worth upgrading to 0.9.6 - I don't know how to check it.    Any help 
is appreciated.       

  

Reply via email to