Let's imagine the following scenario:

   - machine with one supervisor goes down
   - machine with nimbus goes down

Right now, because some workers go down as well, a few queues are not
drained properly, what causes that these queues are continuously increasing
in size.

To avoid this situation we should rebalance the topology in order to
distribute the load across all of the remaining supervisors, but to do this
I need the nimbus to be up and running. Moreover the basic monitoring
information is not available because StormUI is also not working.

My question is: What is a devops operation when the machine with nimbus
dies and what can be done to minimize its unavailability period? Should we
install nimbus on second machine and run it after first machine dies -
something similar to failover services? Can we run more than one nimbus? Or
maybe there is a better option?

Thanks for help

Reply via email to