[
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154817#comment-15154817
]
Scott Blum edited comment on SOLR-8697 at 2/19/16 8:31 PM:
-----------------------------------------------------------
I think part of the general problem with a lot of the ZK-interacting code is a
lack of clean separation of concerns. The relationships between LeaderElector
and the various ElectionContext subclasses are pretty gnarly and incestuous.
DistributedQueue had a similar kind of design problem before I extracted the
app specific gnarly parts into OverseerTaskQueue.
Have we considered trying to migrate to, say, Apache Curator (full disclosure:
I'm a committer)? There are a lot of advantages to using third party libs for
some of these common patterns like distributed queue, leader election, or even
observing changes in a tree. Those components tend to be reusable, better
documented, with cleaner APIs, and have a natural resistance to spaghetti
invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are
intricately tied to implementation details of LeaderElector.)
A clean, reusable leader election component (with its own tests) that could
simply be used in a few different contexts seems like a good place to be longer
term.
That said, I hope this patch can simply clean up some up the existing bugs
without being too disruptive.
was (Author: dragonsinth):
I think part of the general problem with a lot of the ZK-interacting code is a
lack of clean separation of concerns. The relationships between LeaderElector
and the various ElectionContext subclasses are pretty gnarly and incestuous.
DistributedQueue had a similar kind of design problem before I extracted the
app specific gnarly parts into OverseerTaskQueue.
Have we considered trying to migrate to, say, Apache Curator (full disclosure:
I'm a committer)? There are a lot of advantages to using third party libs for
some of this common patterns like distributed queue, leader election, or even
observing changes in a tree. Those components tend to be reusable, better
documented, with cleaner APIs, and have a natural resistance to spaghetti
invasion. (Examples: OverseerNodePrioritizer and RebalanceLeaders are
intricately tied to implementation details of LeaderElector.)
A clean, reusable leader election component (with its own tests) that could
simply be used in a few different contexts seems like a good place to be longer
term.
That said, I hope this patch can simply clean up some up the existing bugs
without being too disruptive.
> Fix LeaderElector issues
> ------------------------
>
> Key: SOLR-8697
> URL: https://issues.apache.org/jira/browse/SOLR-8697
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 5.4.1
> Reporter: Scott Blum
> Labels: patch, reliability, solrcloud
> Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector,
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud. it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer. The old process's ZK leader elect
> node has not yet disappeared, and the new process fails to set appropriate
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join
> the election quickly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]