[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154888#comment-15154888
 ] 

Scott Blum commented on SOLR-8697:
----------------------------------

Yeah, totally agreed on refactoring and trying to fix core bugs!  Bringing in 
Curator at some point would be something I'd only advocate for incrementally 
and in pieces, like replace our DQ with Curator's, etc.  Moving everything over 
at in a short period of time would be a pipe dream anyway.

Back on the topic of LeaderElector, I think this patch is in a pretty good 
state now.  The only thing I want to consider doing in the short term (after 
this patch) is that, in addition to watching the node ahead of you, I think we 
should also be watching our own node, whether or not we're leader.  If an 
outside party forcibly deletes our node, we should put ourselves at the back of 
the line.  If you think about it, if we could trust that behavior, something 
like RebalanceLeaders wouldn't even need to be a distributed request; overseer 
could just delete the current leader elect node and trust the owner to do the 
right thing.

> Fix LeaderElector issues
> ------------------------
>
>                 Key: SOLR-8697
>                 URL: https://issues.apache.org/jira/browse/SOLR-8697
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>              Labels: patch, reliability, solrcloud
>         Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to