[ 
https://issues.apache.org/jira/browse/SOLR-8697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154817#comment-15154817
 ] 

Scott Blum edited comment on SOLR-8697 at 2/19/16 8:31 PM:
-----------------------------------------------------------

I think part of the general problem with a lot of the ZK-interacting code is a 
lack of clean separation of concerns.  The relationships between LeaderElector 
and the various ElectionContext subclasses are pretty gnarly and incestuous.  
DistributedQueue had a similar kind of design problem before I extracted the 
app specific gnarly parts into OverseerTaskQueue.

Have we considered trying to migrate to, say, Apache Curator (full disclosure: 
I'm a committer)?  There are a lot of advantages to using third party libs for 
some of these common patterns like distributed queue, leader election, or even 
observing changes in a tree.  Those components tend to be reusable, better 
documented, with cleaner APIs, and have a natural resistance to spaghetti 
invasion.  (Examples: OverseerNodePrioritizer and RebalanceLeaders are 
intricately tied to implementation details of LeaderElector.)

A clean, reusable leader election component (with its own tests) that could 
simply be used in a few different contexts seems like a good place to be longer 
term.

That said, I hope this patch can simply clean up some up the existing bugs 
without being too disruptive.



was (Author: dragonsinth):
I think part of the general problem with a lot of the ZK-interacting code is a 
lack of clean separation of concerns.  The relationships between LeaderElector 
and the various ElectionContext subclasses are pretty gnarly and incestuous.  
DistributedQueue had a similar kind of design problem before I extracted the 
app specific gnarly parts into OverseerTaskQueue.

Have we considered trying to migrate to, say, Apache Curator (full disclosure: 
I'm a committer)?  There are a lot of advantages to using third party libs for 
some of this common patterns like distributed queue, leader election, or even 
observing changes in a tree.  Those components tend to be reusable, better 
documented, with cleaner APIs, and have a natural resistance to spaghetti 
invasion.  (Examples: OverseerNodePrioritizer and RebalanceLeaders are 
intricately tied to implementation details of LeaderElector.)

A clean, reusable leader election component (with its own tests) that could 
simply be used in a few different contexts seems like a good place to be longer 
term.

That said, I hope this patch can simply clean up some up the existing bugs 
without being too disruptive.


> Fix LeaderElector issues
> ------------------------
>
>                 Key: SOLR-8697
>                 URL: https://issues.apache.org/jira/browse/SOLR-8697
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.4.1
>            Reporter: Scott Blum
>              Labels: patch, reliability, solrcloud
>         Attachments: SOLR-8697.patch
>
>
> This patch is still somewhat WIP for a couple of reasons:
> 1) Still debugging test failures.
> 2) This will more scrutiny from knowledgable folks!
> There are some subtle bugs with the current implementation of LeaderElector, 
> best demonstrated by the following test:
> 1) Start up a small single-node solrcloud.  it should be become Overseer.
> 2) kill -9 the solrcloud process and immediately start a new one.
> 3) The new process won't become overseer.  The old process's ZK leader elect 
> node has not yet disappeared, and the new process fails to set appropriate 
> watches.
> NOTE: this is only reproducible if the new node is able to start up and join 
> the election quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to