Hello, everyone! I've been using Curator for several years now, and it has definitely made my life much easier when it comes to ZooKeeper. The issue I'm currently having involves the way leadership elections are determined by the order of connection, with no randomness injected. I found a previous discussion here, Archived Message Thread <https://apache.markmail.org/message/4fufdan25fgczjen?q=list:curator+leaderlatch+random#query:list%3Acurator%20leaderlatch%20random+page:1+mid:dujexkniehmk4abt+state:results>, which indicates that this is the intended behavior, but in our operational environment, we're experiencing some problems due to this approach.
In my scenario, we have multiple software agents servicing N number of external client connections. When we start up a client connection, we have the ability to target which agent is running that particular connection on startup.(A preferred starting place) After startup, each client connection uses a separate LeaderLatch to allow other agents to monitor the status, so that if one agent is shut down, the connections it was servicing in theory would spread out amongst the remaining agents. However, the behavior we were seeing is that all connections would go to one single agent. During patches/upgrades, we would do a rolling restart, and all connections would end up on the first agent restarted. To rebalance, we would have to manually shut down and re-enable individual connections, which is a much slower process than the automatic leadership election/switch. While researching it, I came across the thread mentioned above. To solve the problem for myself, I modified the LeaderLatch code, moving away from the ephemeral sequential node solution utilized. The latch nodes are still named using a "latch-[number]" format, but now the number is generated by adding a random amount from 1-50 to the current leader's index, repeating if a node with that number already exists. This effectively randomizes who the next leader will be across all latches, instead of all of them being determined by connection order. I'm still testing this modification locally. It passes all test cases, and I'm working on testing it in a test environment. My primary question is if the dev community would be interested in this implementation, either as an update to the LeaderLatch class or as a separate recipe(RandomLeaderLatch?). Based on the conversation linked above, I didn't want to open an issue without discussing it here first. Sorry if this is a bit long-winded, but I'm trying to cover both what my use-case is for this, as well as at least a general idea of the solution that I've implemented and am proposing for the project. -- Tim Black The Law of Software Entomology: There is always one more bug.
