I think this sounds like an interesting feature that would be useful to the
community. Modifying the existing LeaderLatch would seem like the sensible
option to me.
cheers

On Fri, Nov 12, 2021 at 11:01 AM Tim Black <[email protected]> wrote:

> Hello, everyone! I've been using Curator for several years now, and it has
> definitely made my life much easier when it comes to ZooKeeper. The issue
> I'm currently having involves the way leadership elections are determined
> by the order of connection, with no randomness injected. I found a previous
> discussion here, Archived Message Thread
> <
> https://apache.markmail.org/message/4fufdan25fgczjen?q=list:curator+leaderlatch+random#query:list%3Acurator%20leaderlatch%20random+page:1+mid:dujexkniehmk4abt+state:results
> >,
> which indicates that this is the intended behavior, but in our operational
> environment, we're experiencing some problems due to this approach.
>
> In my scenario, we have multiple software agents servicing N number of
> external client connections. When we start up a client connection, we have
> the ability to target which agent is running that particular connection on
> startup.(A preferred starting place) After startup, each client connection
> uses a separate LeaderLatch to allow other agents to monitor the status, so
> that if one agent is shut down, the connections it was servicing in theory
> would spread out amongst the remaining agents. However, the behavior we
> were seeing is that all connections would go to one single agent. During
> patches/upgrades, we would do a rolling restart, and all connections would
> end up on the first agent restarted. To rebalance, we would have to
> manually shut down and re-enable individual connections, which is a much
> slower process than the automatic leadership election/switch. While
> researching it, I came across the thread mentioned above.
>
> To solve the problem for myself, I modified the LeaderLatch code, moving
> away from the ephemeral sequential node solution utilized. The latch nodes
> are still named using a "latch-[number]" format, but now the number is
> generated by adding a random amount from 1-50 to the current leader's
> index, repeating if a node with that number already exists. This
> effectively randomizes who the next leader will be across all latches,
> instead of all of them being determined by connection order.
>
> I'm still testing this modification locally. It passes all test cases, and
> I'm working on testing it in a test environment. My primary question is if
> the dev community would be interested in this implementation, either as an
> update to the LeaderLatch class or as a separate
> recipe(RandomLeaderLatch?). Based on the conversation linked above, I
> didn't want to open an issue without discussing it here first.
>
> Sorry if this is a bit long-winded, but I'm trying to cover both what my
> use-case is for this, as well as at least a general idea of the solution
> that I've implemented and am proposing for the project.
>
> --
> Tim Black
> The Law of Software Entomology:
> There is always one more bug.
>

Reply via email to