Github user revans2 commented on a diff in the pull request:
https://github.com/apache/storm/pull/354#discussion_r22665893
--- Diff: docs/documentation/nimbus-ha-design.md ---
@@ -0,0 +1,223 @@
+#Highly Available Nimbus design proposal
+##Problem Statement:
+Currently the storm master aka nimbus, is a process that runs on a single
machine under supervision. In most cases the
+nimbus failure is transient and it is restarted by the supervisor. However
sometimes when disks fail and networks
+partitions occur, nimbus goes down. Under these circumstances the
topologies run normally but no new topologies can be
+submitted, no existing topologies can be killed/deactivated/activated and
if a supervisor node fails then the
+reassignments are not performed resulting in performance degradation or
topology failures. With this project we intend
+to resolve this problem by running nimbus in a primary backup mode to
guarantee that even if a nimbus server fails one
+of the backups will take over.
+##Requirements:
+* Increase overall availability of nimbus.
+* Allow nimbus hosts to leave and join the cluster at will any time. A
newly joined host should auto catch up and join
+the list of potential leaders automatically.
+* No topology resubmissions required in case of nimbus fail overs.
+* No active topology should ever be lost.
+
+##Leader Election:
+The nimbus server will use the following interface:
+
+```java
+public interface ILeaderElector {
+ /**
+ * queue up for leadership lock. The call returns immediately and the
caller
+ * must check isLeader() to perform any leadership action.
+ */
+ void addToLeaderLockQueue();
+
+ /**
+ * Removes the caller from the leader lock queue. If the caller is
leader
+ * also releases the lock.
+ */
+ void removeFromLeaderLockQueue();
+
+ /**
+ *
+ * @return true if the caller currently has the leader lock.
+ */
+ boolean isLeader();
+
+ /**
+ *
+ * @return the current leader's address , throws exception if noone
has has lock.
+ */
+ InetSocketAddress getLeaderAddress();
+
+ /**
+ *
+ * @return list of current nimbus addresses, includes leader.
+ */
+ List<InetSocketAddress> getAllNimbusAddresses();
+}
+```
+On startup nimbus will check if it has code for all active topologies
available locally. Once it gets to this state it
+will call addToLeaderLockQueue() function. When a nimbus is notified to
become a leader it will check if it has all the
+code locally before assuming the leadership role. If any active topology
code is missing, the node will not accept the
+leadership role instead it will release the lock and wait till it has all
the code before requeueing for leader lock.
+
+The first implementation will be Zookeeper based. If the zookeeper
connection is lost/resetted resulting in loss of lock
+or the spot in queue the implementation will take care of updating the
state such that isLeader() will reflect the
+current status.The leader like actions must finish in less than
minimumOf(connectionTimeout, SessionTimeout) to ensure
+the lock was held by nimbus for the entire duration of the action (Not
sure if we want to just state this expectation
+and ensure that zk configurations are set high enough which will result in
higher failover time or we actually want to
+create some sort of rollback mechanism for all actions, the second option
needs a lot of code). If a nimbus that is not
+leader receives a request that only a leader can perform it will throw a
RunTimeException.
--- End diff --
For fail-over, what we really want is some fencing around the state
changes. Since most, but not all, state is stored in zookeeper, it feels like
writes to zookeeper are the proper place to put the fencing. There are a few
pieces of state that nimbus holds in memory for a short period of time, for
example topology X is about to be killed. In the first go around I think that
is fine, but it would probably be good to look at persisting them in the future.
The fencing would only really need to be right before a write to ZK. It
would check that we are the leader right before we persist the data. if we are
not throw an exception and don't do the write. That should be simple enough.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---