Hi Rui

We are using version 2.6.2

I have the logs but i cannot share complete trace here because of
privacy issues.

But i have some analysis on logs and findings are as below

So supervisors tried health check call to affected nimbus server and
failed[3 times]. Then it went to next seed host and succeeded . Then he got
assignments from master node but supervisor not able to restart workers. I
can see messages
Worker process exited with code 20,137,143.
After sometime supervisor process halted and started a complete restart of
service.

On nimbus i can see all the supervisors becomes blacklisted and when they
entered to normal state then they was unable to cop up the load so
considered dead or not alive.

Once we taken restart of affected master post that all supervisor across 15
nodes restarted and post that cluster become stable.

Please let me know if i can help with anything else. Logs contains network/
node info. Will replicate the same on stage environment and will check if i
can share those.

Regards
Sahil

On Wed, 17 Jul 2024 at 5:19 PM, Rui Abreu <rui.ab...@gmail.com> wrote:

> Hi Sahil,
>
> Which Storm version are you using?
> Do you have logs for Nimbus, Supervisors and Workers? If so, can you post
> the errors?
>
> Some extra documentation:
>
> https://storm.apache.org/releases/1.2.3/Daemon-Fault-Tolerance.html
> https://storm.apache.org/releases/1.2.3/nimbus-ha-design.html
>
> On Wed, 17 Jul 2024 at 05:40, Sahil Kamboj <sahilkamboj...@gmail.com>
> wrote:
>
>> Hi all
>>
>> Could somebody explain me how nimbus ha can be achieved? We followed
>> official apache storm docs and have all config for high availability but
>> this seems not working.
>>
>> Issue -
>>
>> Yesterday we have a master node status check fail on aws. During this
>> window we were unable to open storm ui and topologies were also went to
>> halt state.
>> We have min replication count to 3 and have 3 master nodes but despite
>> all of this storm process was on halt.
>> To get it worked we taken restart of the affected master node and
>> topology resumed automatically.
>> So doesn’t storm should do it automatically? If one of the nimbus process
>> is down then others are there to support ha?
>>
>>
>> Please let me know if I am missing something.
>>
>> Regards
>> Sahil
>>
>

Reply via email to