[ https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325575#comment-15325575 ]
Erick Erickson commented on SOLR-7191: -------------------------------------- I had to chase after this for a while, so I'm recording results of some testing for posterity. > Setup: 4 Solr JVMs, 8G each (64G total RAM on the machine). > Create 100 4x4 collections (i.e. 4 replicas, 4 shards each). 1,600 total > shards > Note that the cluster is fine at this point, everything's green. > No data indexed at all. > Shut all Solr instances down. > Bring up a Solr on a different box. I did this to eliminate the chance that the Overseer was somehow involved since it is now on the machine with no replicas. I don't think this matters much though. > Bring up one JVM. > Wait for all the nodes on that JVM to come up. Now every shard has a leader, and the collections are all green, 3 of 4 replicas for each shard are "gone" of course, but it's a functioning cluster. > Bring up the next JVM: Kabloooey. Very shortly you'll start to see OOM errors on the _second_ JVM but not the first. > The numbers of threads on the first JVM are about 1,200. On the second, they go over 2,000. Whether this would drop back down or not is an open question. > So I tried playing with -Xss to drop the size of the stack on the threads and even dropping by half didn't help. > Expanding the memory on the second JVM to 32G didn't help > I tried increasing the processes to no avail (ulimit -u) on a hint that there was a wonky effect there somehow. > Especially disconcerting is the fact that this node was running fine when the collections were _created_, it just can't get past restart. > Changing coreLoadThreads even down to 2 did not seem to help. > At no point does the reported memory consumption via jConsole or top show even getting close to the allocated JVM limits. > I'd like to be able to just start all 4 JVMs at once, but didn't get that far. > If one tries to start additional JVMs anyway, there's a lot of thrashing around, replicas go into recovery, go out of recovery, are permanently down etc. Of course with OOMs it's unclear what _should_ happen. > The OOM killer script apparently does NOT get triggered, I think the OOM is swallowed, perhaps in Zookeeper client code. Note that if the OOM killer script _did_ get fired there'd the second & greater JVMs would ust die. > Error is OOM: Unable to create new native thread. > Here's a stack trace, there are a _lot_ of these... ERROR - 2016-06-11 00:05:36.806; [ ] org.apache.zookeeper.ClientCnxn$EventThread; Error while calling watcher java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:214) at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112) at org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:266) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Improve stability and startup performance of SolrCloud with thousands of > collections > ------------------------------------------------------------------------------------ > > Key: SOLR-7191 > URL: https://issues.apache.org/jira/browse/SOLR-7191 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 5.0 > Reporter: Shawn Heisey > Assignee: Shalin Shekhar Mangar > Labels: performance, scalability > Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, > SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log > > > A user on the mailing list with thousands of collections (5000 on 4.10.3, > 4000 on 5.0) is having severe problems with getting Solr to restart. > I tried as hard as I could to duplicate the user setup, but I ran into many > problems myself even before I was able to get 4000 collections created on a > 5.0 example cloud setup. Restarting Solr takes a very long time, and it is > not very stable once it's up and running. > This kind of setup is very much pushing the envelope on SolrCloud performance > and scalability. It doesn't help that I'm running both Solr nodes on one > machine (I started with 'bin/solr -e cloud') and that ZK is embedded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org