Stefan Egli created SLING-3750:
----------------------------------

             Summary: Delay discovery-service readiness until first vote has 
finished, to avoid leader being overthrown
                 Key: SLING-3750
                 URL: https://issues.apache.org/jira/browse/SLING-3750
             Project: Sling
          Issue Type: Bug
          Components: Extensions
    Affects Versions: Discovery Impl 1.0.8
            Reporter: Stefan Egli
            Assignee: Stefan Egli
            Priority: Critical
             Fix For: Discovery Impl 1.0.10


The current implementation of discovery.impl has a subtle problem at startup. 
Consider the following problem happening with two simultaneous starts:

 * two (sling) instances start at roughly the same time
 * the goal is to write a service which runs on one of the two only, ever
 * to achieve that, on a TopologyEventListener is used to get hold of the 
latest TopologyView and derive whether the local instance is leader or not
 * currently, upon registration of a TopologyEventListener, a TOPOLOGY_INIT 
event is sent out immediately with the current TopologyView available
 * right after startup though - hence before the first voting has passed - 
discovery.impl considers itself to be in so-called "isolated" mode, creates a 
topology which contains only itself, and makes itself leader (since every 
cluster must have a leader)
 * that means, both instances will receive that isolated view in the 
TOPOLOGY_INIT and are marked as leader (which is kind of right as they don't 
know about any other instance yet - but also wrong as it is not yet an 
established view)
 * at the same time, they both start voting, then find out about each other and 
establish a view where one of the two is marked as leader - hence for the other 
of the two a 'coup d'etat' is happening (the leader is overthrown even though 
the instance did not crash). 

This is certainly very problematic and should be avoided.

The suggested way to avoid this is to delay both the time when the 
discovery.impl service is registered with OSGi (by making it a @Component only 
and registering it as a service explicitly after the first voting) - and by 
delaying the sending of TOPOLOGY_INIT until again said first voting is finished.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to