You should have at least 3 Zookeepers. Zookeeper clusters can handle losing floor(n/2) nodes, which means with 3 nodes the cluster survives if 2 nodes are up. Add more nodes as necessary for your use case (e.g. 5 to be able to handle 2 nodes failing at the same time). Storm assumes that the Zookeeper cluster is always available, so you should provision it to handle failures.
You might also want to add sufficient supervisors so there are enough worker slots for the topologies you run if one or more supervisors go offline. You can see in the UI how many slots you are using. By default there are 4 slots for each supervisor, but you can add more in the storm.yaml configuration if you need to. Regarding multiple Nimbus hosts, I think that's mainly if you need the UI or ability to submit/kill/rebalance topologies while you're doing maintenance, but it can also have an effect if a supervisor dies while Nimbus is down, since the dead workers won't be reassigned until Nimbus becomes available again. Please read https://storm.apache.org/releases/2.0.0-SNAPSHOT/nimbus-ha-design.html for how to set this up. I haven't used HA Nimbus, but going by the description in that document, you should be able to handle n Nimbus failures as long as you have n + 1 Nimbus nodes with the topology code. You'll need to decide what to set nimbus.min.replication.count to as well. It configures how many Nimbus nodes need to have the topology code before a submit is considered complete. 2017-08-15 18:33 GMT+02:00 Mauro Giusti <mau...@microsoft.com>: > So we want to keep our topologies always running – > > We have a production cluster hosted on K8 with 1 nimbus, 1 zookeeper, 1 UI > and 3 supervisor containers. > > > > We are wondering whether we need 2 nimbuses and/or 2 zookeepers to make > sure the topologies are always up when we do maintenance. > > > > We observed that restarting the nimbus does not affect the topologies from > running (UI was not accessible though). > > When we restarted the zookeeper, the configuration was lost though – so we > had to re-deploy the topologies. > > > > Any pointer to configuration for this case is appreciated - > > > > Thanks – > > Mauro Giusti >