[jira] [Updated] (SOLR-6691) REBALANCELEADERS needs to change the leader election queue.

Erick Erickson (JIRA) Tue, 16 Dec 2014 19:49:01 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Erick Erickson updated SOLR-6691:
---------------------------------
    Attachment: BalanceLeaderTester.java
                SOLR-6691.patch

OK, I think this is finally working as I expect. The attached java file is a 
stand-alone program that stresses the heck out of shard leader election. The 
idea is that you fire it up against a collection and it
1> takes the initial state
2> tries to issue the preferred leader command to a randome replica on each 
shard.
3> issues the rebalanceleaders comand
4> verifies that all the shard leader election queues have one entry for all 
the nodes that were there originally.
5> verifies that the actual leader is the preferred leader
6> goes to <2>.

Note that the guts of this test are in the new unit test.

I had to change the leader election code to get all this predictable, and that 
makes me a little nervous given how difficult that all was to get working in 
the first place so this makes me a little nervous, but the external test code 
beats _all_ the leader election code up pretty fiercely which gives me hope.

So I have a couple of options here:
1> go ahead and check it in. 5.0 appears to be receding here so it has some 
time to bake before release
2> check it in to trunk and let it bake there for a while, perhaps until after 
5.0 is cut, then merge and bake.

Opinions?

> REBALANCELEADERS needs to change the leader election queue.
> -----------------------------------------------------------
>
>                 Key: SOLR-6691
>                 URL: https://issues.apache.org/jira/browse/SOLR-6691
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>         Attachments: BalanceLeaderTester.java, SOLR-6691.patch
>
>
> The original code (SOLR-6517) assumed that changes in the clusterstate after 
> issuing a command to the overseer to change the leader indicated that the 
> leader was successfully changed. Fortunately, Noble clued me in that this 
> isn't the case and that the potential leader needs to insert itself in the 
> leader election queue before trigging the change leader command.
> Inserting themselves in the front of the queue should probably happen in 
> BALANCESHARDUNIQUE when the preferredLeader property is assigned as well.
> [~noble.paul] Do evil things happen if a node joins at the head but it's 
> _already_ in the queue? These ephemeral nodes in the queue are watching each 
> other. So if node1 is the leader you have
> node1 <- node2 <- node3 <- node4
> where <- means "watches".
> Now, if node3 puts itself at the head of the list, you have
> {code}
> node1 <- node2
>       <- node3 <- node4
> {code}
> I _think_ when I was looking at this it all "just worked". 
> 1> node 1 goes down. Nodes 2 and 3 duke it out but there's code to insure 
> that node3 becomes the leader and node2 inserts itself at then end so it's 
> watching node 4.
> 2> node 2 goes down, nobody gets notified and it doesn't matter.
> 3> node 3 goes down, node 4 gets notified and starts watching node 2 by 
> inserting itself at the end of the list.
> 4> node 4 goes down, nobody gets notified and it doesn't matter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-6691) REBALANCELEADERS needs to change the leader election queue.

Reply via email to