[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159610#comment-15159610
 ] 

Scott Blum commented on SOLR-8697:
----------------------------------

Actually, I'm stupid.  The flaky problem is that I *still* didn't fix the race 
regarding leaderZkNodeParentVersion.  I just made it harder to repro.

The smoking gun is this line:

{code}
      log.info("No version found for ephemeral leader parent node, won't remove 
previous leader registration.");
{code}

You can repro this pretty easily with the following change:

{code}
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 -- a/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
 ++ b/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
@@ -193,7 +193,7 @@ class ShardLeaderElectionContextBase extends 
ElectionContext {
           List<OpResult> results;
           
           results = zkClient.multi(ops, true);
           
           Thread.sleep(10000);
           for (OpResult result : results) {
             if (result.getType() == ZooDefs.OpCode.setData) {
               SetDataResult dresult = (SetDataResult) result;
{code}

We need a harder synchronization around becoming leader vs. canceling.

> Fix LeaderElector issues
> ------------------------
>
>                 Key: SOLR-8697
>                 URL: https://issues.apache.org/jira/browse/SOLR-8697
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>            Assignee: Mark Miller
>              Labels: patch, reliability, solrcloud
>             Fix For: master
>
>         Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, 
> SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to