[ 
https://issues.apache.org/jira/browse/SOLR-11472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243764#comment-16243764
 ] 

Shalin Shekhar Mangar commented on SOLR-11472:
----------------------------------------------

Here's the sequence of events:

{code}
core_node3 is leader for .system collection
Test starts a new node at port 50071
Node Added Trigger fires and a plan is computed.
action=MOVEREPLICA&collection=.system&targetNode=127.0.0.1:50071_solr&replica=core_node3
        is processed first and core_node8 is added on port 50071
        but before it recovers fully, the leader node core_node3 is unloaded
        core_node6 becomes the leader and asks core_node8 to recover
action=MOVEREPLICA&collection=.system&targetNode=127.0.0.1:50071_solr&replica=core_node6
        now core_node6 is to be moved and core_node10 is added on port 50071
        but before it can recover, core_node6 is also unloaded
        system_shard1_replica_n2 on port 49937 becomes the leader and asks 
core_node8 and core_node10 to sync with it
        but before they can recover the test stops node 49937.
        The NodeLostTrigger fires and tries to create a new replica
        But leader election cannot happen because no nodes have any data and/or 
none of them were active before.
{code}

The crux of the issue is that move replica unloaded the leader before the newly 
added replica becomes active. Actually, Andrzej has fixed this problem already 
in SOLR-11448. The leader election issue seen in these logs is a known problem 
in SolrCloud. Mark Miller created SOLR-7065 to address the gridlock of leader 
election in such cases.

I'll audit jenkins again to see if this test has failed since SOLR-11448 was 
committed. If not, then I'll close this issue.

> Leader election bug
> -------------------
>
>                 Key: SOLR-11472
>                 URL: https://issues.apache.org/jira/browse/SOLR-11472
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.1, master (8.0)
>            Reporter: Andrzej Bialecki 
>            Assignee: Shalin Shekhar Mangar
>         Attachments: 
> Console_output_of_AutoscalingHistoryHandlerTest_failure.txt
>
>
> SOLR-11407 uncovered a bug in leader election, where the same failing node is 
> retried indefinitely. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to