[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

Shawn Heisey (JIRA) Tue, 10 Mar 2015 06:25:40 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354868#comment-14354868
 ]


Shawn Heisey commented on SOLR-7191:
------------------------------------

The branch_5x code is a lot more stable than 5.0, but full recovery after a 
cluster restart is very slow, and as Damien noted, creation of new collections 
is also very slow.  On my dev server, by the time it had reached 4000 
collections, each new collection was taking about 20 seconds to create.

The biggest problem I noticed is that any little change on the cluster (even 
creating a new collection) seems to cause a flood of ZkStateReader "Updating 
data for XXXXX to ver NN" messages.  Each time it must update every single 
collection it has ... when that happens, it doesn't take an extreme amount of 
time, but I've noticed that it does it repeatedly, especially on node startup.

I have some logs from a cluster restart on a 2 node cluster, but they are far 
too large to attach here, even compressed.  I have loaded the zip onto my 
dropbox account.

https://www.dropbox.com/s/28mu7asdvbgwkqt/logs-restart1-branch_5x.zip?dl=0

These logs are also far too large to load into Notepad or Notepad++ on Windows, 
but gnu's "less" program can deal with them just fine.

To create these logs, I shut down the cluster, deleted all logfiles, and then 
restarted each node.

Looking only at the log for node1, node startup began at 06:54 UTC.  At the 
following point, half an hour later, it appears that the node was fully 
operational - every collection shard had become active:

{noformat}
INFO  - 2015-03-06 07:27:21.586; org.apache.solr.cloud.overseer.ReplicaMutator; 
Update state numShards=2 message={
  "operation":"state",
  "core_node_name":"core_node2",
  "numShards":"2",
  "shard":"shard2",
  "roles":null,
  "state":"active",
  "core":"mycoll3892_shard2_replica1",
  "collection":"mycoll3892",
  "node_name":"10.100.1.39:7574_solr",
  "base_url":"http://10.100.1.39:7574/solr"}
{noformat}

At 08:38 UTC, just before I went to bed, I sent a request to create a new 
collection, named "lotsashards".  The http request for collection creation 
timed out three minutes later, at 08:41, but the request was in the overseer 
queue, so eventually it did happen.  It took until 15:29 for that collection 
creation to begin - nearly seven hours after I started it - because Solr was 
busy handling a very large number of those ZkStateReader updates.  It appears 
that it took another ten minutes for the create to finish, and there are other 
messages related to lotsashards right up through the end of the log at 16:54 
UTC.


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>              Labels: performance, scalability
>         Attachments: lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

Reply via email to