Hello, everyone! I've been using Curator for several years now, and it has
definitely made my life much easier when it comes to ZooKeeper. The issue
I'm currently having involves the way leadership elections are determined
by the order of connection, with no randomness injected. I found a previous
discussion here, Archived Message Thread
<https://apache.markmail.org/message/4fufdan25fgczjen?q=list:curator+leaderlatch+random#query:list%3Acurator%20leaderlatch%20random+page:1+mid:dujexkniehmk4abt+state:results>,
which indicates that this is the intended behavior, but in our operational
environment, we're experiencing some problems due to this approach.

In my scenario, we have multiple software agents servicing N number of
external client connections. When we start up a client connection, we have
the ability to target which agent is running that particular connection on
startup.(A preferred starting place) After startup, each client connection
uses a separate LeaderLatch to allow other agents to monitor the status, so
that if one agent is shut down, the connections it was servicing in theory
would spread out amongst the remaining agents. However, the behavior we
were seeing is that all connections would go to one single agent. During
patches/upgrades, we would do a rolling restart, and all connections would
end up on the first agent restarted. To rebalance, we would have to
manually shut down and re-enable individual connections, which is a much
slower process than the automatic leadership election/switch. While
researching it, I came across the thread mentioned above.

To solve the problem for myself, I modified the LeaderLatch code, moving
away from the ephemeral sequential node solution utilized. The latch nodes
are still named using a "latch-[number]" format, but now the number is
generated by adding a random amount from 1-50 to the current leader's
index, repeating if a node with that number already exists. This
effectively randomizes who the next leader will be across all latches,
instead of all of them being determined by connection order.

I'm still testing this modification locally. It passes all test cases, and
I'm working on testing it in a test environment. My primary question is if
the dev community would be interested in this implementation, either as an
update to the LeaderLatch class or as a separate
recipe(RandomLeaderLatch?). Based on the conversation linked above, I
didn't want to open an issue without discussing it here first.

Sorry if this is a bit long-winded, but I'm trying to cover both what my
use-case is for this, as well as at least a general idea of the solution
that I've implemented and am proposing for the project.

-- 
Tim Black
The Law of Software Entomology:
There is always one more bug.

Reply via email to