[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159610#comment-15159610 ]
Scott Blum commented on SOLR-8697: ---------------------------------- Actually, I'm stupid. The flaky problem is that I *still* didn't fix the race regarding leaderZkNodeParentVersion. I just made it harder to repro. The smoking gun is this line: {code} log.info("No version found for ephemeral leader parent node, won't remove previous leader registration."); {code} You can repro this pretty easily with the following change: {code} ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── -- a/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java ++ b/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── @@ -193,7 +193,7 @@ class ShardLeaderElectionContextBase extends ElectionContext { List<OpResult> results; results = zkClient.multi(ops, true); Thread.sleep(10000); for (OpResult result : results) { if (result.getType() == ZooDefs.OpCode.setData) { SetDataResult dresult = (SetDataResult) result; {code} We need a harder synchronization around becoming leader vs. canceling. > Fix LeaderElector issues > ------------------------ > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.4.1 > Reporter: Scott Blum > Assignee: Mark Miller > Labels: patch, reliability, solrcloud > Fix For: master > > Attachments: OverseerTestFail.log, SOLR-8697-followup.patch, > SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org