Thanks, Erick. What can you tell me about this bit? I'm having trouble
making sense of it.
if (seq <= intSeqs.get(0)) {
if (seq == intSeqs.get(0) &&
!context.leaderSeqPath.equals(holdElectionPath + "/" + seqs.get(0)))
{//somebody else already became the leader with the same sequence id
, not me
log.info("was going to be leader {} , seq(0) {}",
context.leaderSeqPath, holdElectionPath + "/" + seqs.get(0));//but
someone else jumped the line
// The problem is that deleting the ZK node that's watched by others
// results in an unpredictable sequencing of the events and
sometime the context that comes in for checking
// this happens to be after the node has already taken over
leadership. So just leave out of here.
// This caused one of the tests to fail on having two nodes with
the same name in the queue. I'm not sure
// the assumption that this is a bad state is valid.
if (getNodeName(context.leaderSeqPath).equals(getNodeName(seqs.get(0)))) {
return;
}
retryElection(context, false);//join at the tail again
return;
}
We ran into this message in our logs, but it was related to Overseer
election, where there shouldn't have been a preferredLeader.
I'm struggling to put together the right mental model, the code is really
hard for me follow.
On Thu, Feb 11, 2016 at 1:19 PM, Erick Erickson <[email protected]>
wrote:
> I can talk a little about joinAtHead. That was put in there to work
> with "preferredLeader". Essentially the idea here is that when a node
> registers into the leader election queue, if it has the
> preferredLeader flag set it should watch the current leader instead of
> joining at the end of the queue.
>
> There's also logic in the leader election process whereby each node
> asks "Should I be the next leader"? One consequence of the joinAtHead
> is that if that's true, two nodes can be watching the current leader
> and both receive events if the leader goes away. So there's some
> complexity around "If I might be leader and another node was watching
> the leader node too, which of us should win?"
>
> preferredLeader is the use-case this was put in for, but it could be
> used for any generalized use-case that required controlling whether a
> node should cut into the queue at the head.
>
> I'll leave replacement to someone who knows about it.
>
> Erick
>
> On Thu, Feb 11, 2016 at 10:11 AM, Scott Blum <[email protected]>
> wrote:
> > Hi dev,
> >
> > Who's the best person to ask questions about the design of LeaderElector
> and
> > ElectionContext?
> >
> > I ask because I've found it to be somewhat brittle in practice. During a
> > rolling restart, it's not uncommon to get into a state where there's no
> > Overseer. I've even experienced this locally with as few as two nodes.
> > When this happens, I've tried (for example) deleting all the children
> under
> > /solr/overseer_elect/election. In theory, this should trigger all
> watches
> > on all nodes, forcing everyone to re-register and contend for leadership,
> > but in practice I haven't found this to work.
> >
> > I've been diving into the LeaderElection code, and it seems much more
> > complicated than I would have expected. Can anyone give me the theory of
> > operation, especially around the joinAtHead and replacement flags?
> >
> > Thanks!
> > Scott
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>