[ 
https://issues.apache.org/jira/browse/CURATOR-622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443225#comment-17443225
 ] 

Tim Black commented on CURATOR-622:
-----------------------------------

There is no reason to believe that the instances contending for a latch are 
random in many operational environments, and the election shouldn't depend on 
that assumption.  The documentation for LeaderLatch says leadership is assigned 
randomly among threads/processes contending for leadership. However, there is 
nothing in the LeaderLatch code that actually generates any kind of randomness. 
In the current implementation, the LeaderLatch is operating more as a 
leadership queue than an election system, always choosing the next in line 
based on when a process joined the participant pool instead of randomly. By 
adding a random factor into the internal election process, the behavior will 
now match the documentation and it will avoid "clumping" of leadership among 
the longest-running servers/processes.

In my particular case, we have a group of four servers that are contending for 
leadership of a larger number of jobs, in my case, delivering data to external 
clients through various interfaces.(I'll use 20 in this example) Because the 
jobs are managed separately, each has its own LeaderLatch associated with it. 
When enabling the jobs, we can assign them to specific servers on startup to 
balance the load, so each server will initially be leader for 5 tasks, with the 
other servers ready to take over in the event of a server outage. When applying 
patches, updates, or other routine maintenance activities, we generally do a 
rolling restart, allowing the jobs to be automatically re-distributed to the 
currently running servers, minimizing downtime. However, each time we bring a 
server down, all of its jobs go to the server that has been up the longest, 
since it has the lowest sequentially-numbered node for all of the 
LeaderLatches. After rolling through all of the servers, the first server that 
was restarted will now be leader of all 100 jobs, with the other servers idle. 
We're now forced to manually disable and re-enable most of the jobs to force 
them to redistribute among the servers, a much slower process than if the 
leadership had been randomly re-assigned.

> Add Randomness to LeaderLatch Elections
> ---------------------------------------
>
>                 Key: CURATOR-622
>                 URL: https://issues.apache.org/jira/browse/CURATOR-622
>             Project: Apache Curator
>          Issue Type: Improvement
>          Components: Recipes
>            Reporter: Tim Black
>            Priority: Major
>
> Currently, LeaderLatch uses EPHEMERAL_SEQUENTIAL nodes, with the next leader 
> chosen by the lowest numbered node. In a multi-server environment where each 
> server is a participant in multiple elections, the result is that the leader 
> will always be the server that has been up the longest.(Or first to be 
> restarted during a rolling restart)
> Instead of using sequentially numbered nodes, I propose instead that the node 
> number for a new participant be created by adding a random number(From a 
> constrained range) to the current leader number.(Defaults to zero) If a node 
> with that number exists, repeat until an available node is found. After 
> initial node creation, all other aspects of the leader election will remain 
> unchanged.
> I have an implementation for this that I am testing locally and will submit a 
> PR once the tests are complete.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to