[ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926204#comment-15926204
 ] 

Joshua Humphries edited comment on SOLR-7191 at 3/15/17 1:51 PM:
-----------------------------------------------------------------

Our cluster has many thousands of collections, most of which have only a single 
shard and single replica. Restarting a single node takes over two minutes in 
good circumstances (expected restart, like during upgrades of solr or 
deployment of new/updated plugins). In bad circumstances, like if machines 
appear wedged and leader election issues have already caused the overseer queue 
to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this 
slowness, I saw that the down node messages take *way* too long to process. I 
ended up tracking that to an issue where it results in a ZK write for *every* 
collection, not just the collections that had shard-replicas on that node. In 
our case, it was processing about 40 times too many collections, making a 
rolling restart of the whole cluster effectively O(n^2) instead of O( n) in 
terms of the writes to ZK.

See SOLR-10277.


was (Author: jhump):
Our cluster has many thousands of collections, most of which have only a single 
shard and single replica. Restarting a single node takes over two minutes in 
good circumstances (expected restart, like during upgrades of solr or 
deployment of new/updated plugins). In bad circumstances, like if machines 
appear wedged and leader election issues have already caused the overseer queue 
to grow large, restarting a server can take over 10 minutes!

While watching the overseer queue size in our latest observation of this 
slowness, I saw that the down node messages take *way* too long to process. I 
ended up tracking that to an issue where it results in a ZK write for *every* 
collection, not just the collections that had shard-replicas on that node. In 
our case, it was processing about 40 times too many collections, making a 
rolling restart of the whole cluster effectively O(n^2) instead of O(n) in 
terms of the writes to ZK.

See SOLR-10277.

> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>            Assignee: Noble Paul
>              Labels: performance, scalability
>             Fix For: 6.3
>
>         Attachments: lots-of-zkstatereader-updates-branch_5x.log, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to