[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354868#comment-14354868 ]
Shawn Heisey commented on SOLR-7191: ------------------------------------ The branch_5x code is a lot more stable than 5.0, but full recovery after a cluster restart is very slow, and as Damien noted, creation of new collections is also very slow. On my dev server, by the time it had reached 4000 collections, each new collection was taking about 20 seconds to create. The biggest problem I noticed is that any little change on the cluster (even creating a new collection) seems to cause a flood of ZkStateReader "Updating data for XXXXX to ver NN" messages. Each time it must update every single collection it has ... when that happens, it doesn't take an extreme amount of time, but I've noticed that it does it repeatedly, especially on node startup. I have some logs from a cluster restart on a 2 node cluster, but they are far too large to attach here, even compressed. I have loaded the zip onto my dropbox account. https://www.dropbox.com/s/28mu7asdvbgwkqt/logs-restart1-branch_5x.zip?dl=0 These logs are also far too large to load into Notepad or Notepad++ on Windows, but gnu's "less" program can deal with them just fine. To create these logs, I shut down the cluster, deleted all logfiles, and then restarted each node. Looking only at the log for node1, node startup began at 06:54 UTC. At the following point, half an hour later, it appears that the node was fully operational - every collection shard had become active: {noformat} INFO - 2015-03-06 07:27:21.586; org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=2 message={ "operation":"state", "core_node_name":"core_node2", "numShards":"2", "shard":"shard2", "roles":null, "state":"active", "core":"mycoll3892_shard2_replica1", "collection":"mycoll3892", "node_name":"10.100.1.39:7574_solr", "base_url":"http://10.100.1.39:7574/solr"} {noformat} At 08:38 UTC, just before I went to bed, I sent a request to create a new collection, named "lotsashards". The http request for collection creation timed out three minutes later, at 08:41, but the request was in the overseer queue, so eventually it did happen. It took until 15:29 for that collection creation to begin - nearly seven hours after I started it - because Solr was busy handling a very large number of those ZkStateReader updates. It appears that it took another ten minutes for the create to finish, and there are other messages related to lotsashards right up through the end of the log at 16:54 UTC. > Improve stability and startup performance of SolrCloud with thousands of > collections > ------------------------------------------------------------------------------------ > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.0 > Reporter: Shawn Heisey > Labels: performance, scalability > Attachments: lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org