[ 
https://issues.apache.org/jira/browse/SOLR-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366563#comment-15366563
 ] 

Erick Erickson commented on SOLR-7280:
--------------------------------------

The symptom is that  OOM errors "unable to create native thread" happens, 
resulting in replicas that never come up, sometimes never ending recovery 
cycles etc. etc. etc. Decreasing Xss or increasing Xmx  doesn't help (maybe 
some other settings?). In my testing  with 400 replicas in a JVM, something on 
the order of 1K temporary threads were spun up when I tried to start the JVM. 
More correctly that many threads were running, the rest (number unknown) never 
started at all.

What the default should be is certainly debatable. The curious thing was that 
at no time did the JVM memory appear to be stressed (using jconsole to 
spot-check, not rigorous at all)....

I've assumed that by ordering the replicas to come up based on what collection 
they belong to, they won't get stuck waiting for a leader election just because 
the ordering on instance1 happened to try to bring up collection1 then 
collection2 whereas instance2 tried to bring them up in reverse order. More 
like skip-lists in terms of waiting...

Sure, with weird enough topology instance1 could have a long queue to get 
through before getting to collectionX whereas instance N could start with 
collectionX and have to go through leader vote wait timeouts, but that's way 
better than having a cluster that won't start at all.

And when I tested starting 3 of my 4 JVMs, it was indeed painfully slow waiting 
for leader vote wait timeouts. But the cluster came up.

Using 3 coreLoadThreads and starting all my JVMs at once took just a few 
minutes.

And the painfully trappy behavior without this patch is that I can create all 
my collections just fine, but then I can't restart the cluster successfully.



> Load cores in sorted order and tweak coreLoadThread counts to improve cluster 
> stability on restarts
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7280
>                 URL: https://issues.apache.org/jira/browse/SOLR-7280
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Noble Paul
>             Fix For: 5.2, 6.0
>
>         Attachments: SOLR-7280.patch
>
>
> In SOLR-7191, Damien mentioned that by loading solr cores in a sorted order 
> and tweaking some of the coreLoadThread counts, he was able to improve the 
> stability of a cluster with thousands of collections. We should explore some 
> of these changes and fold them into Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to