Hi Rui We are using version 2.6.2
I have the logs but i cannot share complete trace here because of privacy issues. But i have some analysis on logs and findings are as below So supervisors tried health check call to affected nimbus server and failed[3 times]. Then it went to next seed host and succeeded . Then he got assignments from master node but supervisor not able to restart workers. I can see messages Worker process exited with code 20,137,143. After sometime supervisor process halted and started a complete restart of service. On nimbus i can see all the supervisors becomes blacklisted and when they entered to normal state then they was unable to cop up the load so considered dead or not alive. Once we taken restart of affected master post that all supervisor across 15 nodes restarted and post that cluster become stable. Please let me know if i can help with anything else. Logs contains network/ node info. Will replicate the same on stage environment and will check if i can share those. Regards Sahil On Wed, 17 Jul 2024 at 5:19 PM, Rui Abreu <rui.ab...@gmail.com> wrote: > Hi Sahil, > > Which Storm version are you using? > Do you have logs for Nimbus, Supervisors and Workers? If so, can you post > the errors? > > Some extra documentation: > > https://storm.apache.org/releases/1.2.3/Daemon-Fault-Tolerance.html > https://storm.apache.org/releases/1.2.3/nimbus-ha-design.html > > On Wed, 17 Jul 2024 at 05:40, Sahil Kamboj <sahilkamboj...@gmail.com> > wrote: > >> Hi all >> >> Could somebody explain me how nimbus ha can be achieved? We followed >> official apache storm docs and have all config for high availability but >> this seems not working. >> >> Issue - >> >> Yesterday we have a master node status check fail on aws. During this >> window we were unable to open storm ui and topologies were also went to >> halt state. >> We have min replication count to 3 and have 3 master nodes but despite >> all of this storm process was on halt. >> To get it worked we taken restart of the affected master node and >> topology resumed automatically. >> So doesn’t storm should do it automatically? If one of the nimbus process >> is down then others are there to support ha? >> >> >> Please let me know if I am missing something. >> >> Regards >> Sahil >> >