[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

Erick Erickson (JIRA) Fri, 10 Jun 2016 17:55:33 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-7191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325575#comment-15325575
 ]


Erick Erickson commented on SOLR-7191:
--------------------------------------

I had to chase after this for a while, so I'm recording results 
of some testing for posterity.

> Setup: 4 Solr JVMs, 8G each (64G total RAM on the machine).
> Create 100 4x4 collections (i.e. 4 replicas, 4 shards each). 1,600 total 
> shards
  > Note that the cluster is fine at this point, everything's green.
> No data indexed at all.
> Shut all Solr instances down.
> Bring up a Solr on a different box. I did this to eliminate the chance
  that the Overseer was somehow involved since it is now on the machine
  with no replicas. I don't think this matters much though.
> Bring up one JVM.
> Wait for all the nodes on that JVM to come up. Now every shard has a leader,
  and the collections are all green, 3 of 4 replicas for each shard are
  "gone" of course, but it's a functioning cluster.
> Bring up the next JVM: Kabloooey. Very shortly you'll start to see OOM
  errors on the _second_ JVM but not the first.
  > The numbers of threads on the first JVM are about 1,200. On the second,
    they go over 2,000. Whether this would drop back down or not
    is an open question.
  > So I tried playing with -Xss to drop the size of the stack on the threads
    and even dropping by half didn't help.
  > Expanding the memory on the second JVM to 32G didn't help
  > I tried increasing the processes to no avail (ulimit -u) on a hint
    that there was a wonky effect there somehow.
  > Especially disconcerting is the fact that this node was running fine
    when the collections were _created_, it just can't get past restart.
  > Changing coreLoadThreads even down to 2 did not seem to help.
  > At no point does the reported memory consumption via jConsole or top
    show even getting close to the allocated JVM limits.
> I'd like to be able to just start all 4 JVMs at once, but didn't get
  that far.
> If one tries to start additional JVMs anyway, there's a lot of thrashing
  around, replicas go into recovery, go out of recovery, are permanently down 
etc.
  Of course with OOMs it's unclear what _should_ happen.
> The OOM killer script apparently does NOT get triggered, I think the OOM
  is swallowed, perhaps in Zookeeper client code. Note that if the OOM
  killer script _did_ get fired there'd the second & greater JVMs would
  ust die.
> Error is OOM: Unable to create new native thread.
> Here's a stack trace, there are a _lot_ of these...

ERROR - 2016-06-11 00:05:36.806; [   ] 
org.apache.zookeeper.ClientCnxn$EventThread; Error while calling watcher 
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:714)
        at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:214)
        at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
        at 
org.apache.solr.common.cloud.SolrZkClient$3.process(SolrZkClient.java:266)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)


> Improve stability and startup performance of SolrCloud with thousands of 
> collections
> ------------------------------------------------------------------------------------
>
>                 Key: SOLR-7191
>                 URL: https://issues.apache.org/jira/browse/SOLR-7191
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>            Reporter: Shawn Heisey
>            Assignee: Shalin Shekhar Mangar
>              Labels: performance, scalability
>         Attachments: SOLR-7191.patch, SOLR-7191.patch, SOLR-7191.patch, 
> SOLR-7191.patch, lots-of-zkstatereader-updates-branch_5x.log
>
>
> A user on the mailing list with thousands of collections (5000 on 4.10.3, 
> 4000 on 5.0) is having severe problems with getting Solr to restart.
> I tried as hard as I could to duplicate the user setup, but I ran into many 
> problems myself even before I was able to get 4000 collections created on a 
> 5.0 example cloud setup.  Restarting Solr takes a very long time, and it is 
> not very stable once it's up and running.
> This kind of setup is very much pushing the envelope on SolrCloud performance 
> and scalability.  It doesn't help that I'm running both Solr nodes on one 
> machine (I started with 'bin/solr -e cloud') and that ZK is embedded.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-7191) Improve stability and startup performance of SolrCloud with thousands of collections

Reply via email to