Github user revans2 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/354#discussion_r22665893
  
    --- Diff: docs/documentation/nimbus-ha-design.md ---
    @@ -0,0 +1,223 @@
    +#Highly Available Nimbus design proposal
    +##Problem Statement:
    +Currently the storm master aka nimbus, is a process that runs on a single 
machine under supervision. In most cases the 
    +nimbus failure is transient and it is restarted by the supervisor. However 
sometimes when disks fail and networks 
    +partitions occur, nimbus goes down. Under these circumstances the 
topologies run normally but no new topologies can be 
    +submitted, no existing topologies can be killed/deactivated/activated and 
if a supervisor node fails then the 
    +reassignments are not performed resulting in performance degradation or 
topology failures. With this project we intend 
    +to resolve this problem by running nimbus in a primary backup mode to 
guarantee that even if a nimbus server fails one 
    +of the backups will take over.
    +##Requirements:
    +* Increase overall availability of nimbus.
    +* Allow nimbus hosts to leave and join the cluster at will any time. A 
newly joined host should auto catch up and join 
    +the list of potential leaders automatically. 
    +* No topology resubmissions required in case of nimbus fail overs.
    +* No active topology should ever be lost. 
    +
    +##Leader Election:
    +The nimbus server will use the following interface:
    +
    +```java
    +public interface ILeaderElector {
    +    /**
    +     * queue up for leadership lock. The call returns immediately and the 
caller                     
    +     * must check isLeader() to perform any leadership action.
    +     */
    +    void addToLeaderLockQueue();
    +
    +    /**
    +     * Removes the caller from the leader lock queue. If the caller is 
leader
    +     * also releases the lock.
    +     */
    +    void removeFromLeaderLockQueue();
    +
    +    /**
    +     *
    +     * @return true if the caller currently has the leader lock.
    +     */
    +    boolean isLeader();
    +
    +    /**
    +     *
    +     * @return the current leader's address , throws exception if noone 
has has    lock.
    +     */
    +    InetSocketAddress getLeaderAddress();
    +
    +    /**
    +     * 
    +     * @return list of current nimbus addresses, includes leader.
    +     */
    +    List<InetSocketAddress> getAllNimbusAddresses();
    +}
    +```
    +On startup nimbus will check if it has code for all active topologies 
available locally. Once it gets to this state it 
    +will call addToLeaderLockQueue() function. When a nimbus is notified to 
become a leader it will check if it has all the
    +code locally before assuming the leadership role. If any active topology 
code is missing, the node will not accept the 
    +leadership role instead it will release the lock and wait till it has all 
the code before requeueing for leader lock. 
    +
    +The first implementation will be Zookeeper based. If the zookeeper 
connection is lost/resetted resulting in loss of lock
    +or the spot in queue the implementation will take care of updating the 
state such that isLeader() will reflect the 
    +current status.The leader like actions must finish in less than 
minimumOf(connectionTimeout, SessionTimeout) to ensure
    +the lock was held by nimbus for the entire duration of the action (Not 
sure if we want to just state this expectation 
    +and ensure that zk configurations are set high enough which will result in 
higher failover time or we actually want to 
    +create some sort of rollback mechanism for all actions, the second option 
needs a lot of code). If a nimbus that is not 
    +leader receives a request that only a leader can perform it will throw a 
RunTimeException.
    --- End diff --
    
    For fail-over, what we really want is some fencing around the state 
changes.  Since most, but not all, state is stored in zookeeper, it feels like 
writes to zookeeper are the proper place to put the fencing.  There are a few 
pieces of state that nimbus holds in memory for a short period of time, for 
example topology X is about to be killed.  In the first go around I think that 
is fine, but it would probably be good to look at persisting them in the future.
    
    The fencing would only really need to be right before a write to ZK.  It 
would check that we are the leader right before we persist the data.  if we are 
not throw an exception and don't do the write.  That should be simple enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to