[ https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154817#comment-15154817 ]
Scott Blum edited comment on SOLR-8697 at 2/19/16 8:31 PM: ----------------------------------------------------------- I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns. The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous. DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue. Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)? There are a lot of advantages to using third party libs for some of these common patterns like distributed queue, leader election, or even observing changes in a tree. Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.) A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term. That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive. was (Author: dragonsinth): I think part of the general problem with a lot of the ZK-interacting code is a lack of clean separation of concerns. The relationships between LeaderElector and the various ElectionContext subclasses are pretty gnarly and incestuous. DistributedQueue had a similar kind of design problem before I extracted the app specific gnarly parts into OverseerTaskQueue. Have we considered trying to migrate to, say, Apache Curator (full disclosure: I'm a committer)? There are a lot of advantages to using third party libs for some of this common patterns like distributed queue, leader election, or even observing changes in a tree. Those components tend to be reusable, better documented, with cleaner APIs, and have a natural resistance to spaghetti invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are intricately tied to implementation details of LeaderElector.) A clean, reusable leader election component (with its own tests) that could simply be used in a few different contexts seems like a good place to be longer term. That said, I hope this patch can simply clean up some up the existing bugs without being too disruptive. > Fix LeaderElector issues > ------------------------ > > Key: SOLR-8697 > URL: https://issues.apache.org/jira/browse/SOLR-8697 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.4.1 > Reporter: Scott Blum > Labels: patch, reliability, solrcloud > Attachments: SOLR-8697.patch > > > This patch is still somewhat WIP for a couple of reasons: > 1) Still debugging test failures. > 2) This will more scrutiny from knowledgable folks! > There are some subtle bugs with the current implementation of LeaderElector, > best demonstrated by the following test: > 1) Start up a small single-node solrcloud. it should be become Overseer. > 2) kill -9 the solrcloud process and immediately start a new one. > 3) The new process won't become overseer. The old process's ZK leader elect > node has not yet disappeared, and the new process fails to set appropriate > watches. > NOTE: this is only reproducible if the new node is able to start up and join > the election quickly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org