[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154740#comment-15154740
 ] 

Mark Miller commented on SOLR-8697:
-----------------------------------

bq. TBH, the code is pretty hard to follow in its existing form

Yup. It was mildly hairy in its first form (copying the ZK recipe as described) 
and took a while to harden. Then some contributions came that just made it 
insane to follow. I've brought it up before, instead of trying to avoid 
thundering herd issues with what will be a reasonably low number of replicas 
trying to be leader, we probably should just have very simple leader elections. 
All of the original logic, and the logic that was added that made it really 
hard for me to follow, would be really simple if we gave up the cool elegant 
approach we used to avoid a mostly non existent thundering herd issue. That 
thicket is just a ripe breeding ground for random bugs our tests just don't 
easily expose.

At this point, the effort to change reliably is probably really high though.

> Fix LeaderElector issues
> ------------------------
>
>                 Key: SOLR-8697
>                 URL: https://issues.apache.org/jira/browse/SOLR-8697
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>              Labels: patch, reliability, solrcloud
>         Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to