[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926204#comment-15926204 ]
Joshua Humphries commented on SOLR-7191: ---------------------------------------- Our cluster has many thousands of collections, most of which have only a single shard and single replica. Restarting a single node takes over two minutes in good circumstances (expected restart, like during upgrades of solr or deployment of new/updated plugins). In bad circumstances, like if machines appear wedged and leader election issues have already caused the overseer queue to grow large, restarting a server can take over 10 minutes! While watching the overseer queue size in our latest observation of this slowness, I saw that the down node messages take *way* too long to process. I ended up tracking that to an issue where it results in a ZK write for *every* collection, not just the collections that had shard-replicas on that node. In our case, it was processing about 40 times too many collections, making a rolling restart of the whole cluster effectively O(n^2) instead of O(n) in terms of the writes to ZK. See SOLR-10277. > Improve stability and startup performance of SolrCloud with thousands of > collections > ------------------------------------------------------------------------------------ > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.0 > Reporter: Shawn Heisey > Assignee: Noble Paul > Labels: performance, scalability > Fix For: 6.3 > > Attachments: lots-of-zkstatereader-updates-branch_5x.log, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org