[ https://issues.apache.org/jira/browse/SOLR-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219535#comment-14219535 ]
Mark Miller commented on SOLR-6763: ----------------------------------- bq. and another spawned by the ReconnectStrategy. Hmm...this sounds fishy. We should not be spawning any new election thread on ConnectionLoss - only on Expiration. > Shard leader election thread can persist across connection loss > --------------------------------------------------------------- > > Key: SOLR-6763 > URL: https://issues.apache.org/jira/browse/SOLR-6763 > Project: Solr > Issue Type: Bug > Reporter: Alan Woodward > Attachments: SOLR-6763.patch > > > A ZK connection loss during a call to > ElectionContext.waitForReplicasToComeUp() will result in two leader election > processes for the shard running within a single node - the initial election > that was waiting, and another spawned by the ReconnectStrategy. After the > function returns, the first election will create an ephemeral leader node. > The second election will then also attempt to create this node, fail, and try > to put itself into recovery. It will also set the 'isLeader' value in its > CloudDescriptor to false. > The first election, meanwhile, is happily maintaining the ephemeral leader > node. But any updates that are sent to the shard will cause an exception due > to the mismatch between the cloudstate (where this node is the leader) and > the local CloudDescriptor leader state. > I think the fix is straightfoward - the call to zkClient.getChildren() in > waitForReplicasToComeUp should be called with 'retryOnReconnect=false', > rather than 'true' as it is currently, because once the connection has > dropped we're going to launch a new election process anyway. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org