[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154888#comment-15154888 ]
Scott Blum commented on SOLR-8697: ---------------------------------- Yeah, totally agreed on refactoring and trying to fix core bugs! Bringing in Curator at some point would be something I'd only advocate for incrementally and in pieces, like replace our DQ with Curator's, etc. Moving everything over at in a short period of time would be a pipe dream anyway. Back on the topic of LeaderElector, I think this patch is in a pretty good state now. The only thing I want to consider doing in the short term (after this patch) is that, in addition to watching the node ahead of you, I think we should also be watching our own node, whether or not we're leader. If an outside party forcibly deletes our node, we should put ourselves at the back of the line. If you think about it, if we could trust that behavior, something like RebalanceLeaders wouldn't even need to be a distributed request; overseer could just delete the current leader elect node and trust the owner to do the right thing. > Fix LeaderElector issues > ------------------------ > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.4.1 > Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org