If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By "recent", I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem.
I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt <t...@elementspace.com>wrote: > Hey guys, > > Based on my understanding of the problem we are encountering, I feel we've > been able to reduce the likelihood of this issue by making the following > changes to our app's usage of SolrCloud: > > 1) We increased our document batch size to 200 from 10 - our app batches > updates to reduce HTTP requests/overhead. The theory is increasing the > batch size reduces the likelihood of this issue happening. > 2) We reduced to 1 application node sending updates to SolrCloud - we write > Solr updates to Redis, and have previously had 4 application nodes pushing > the updates to Solr (popping off the Redis queue). Reducing the number of > nodes pushing to Solr reduces the concurrency on SolrCloud. > 3) Less threads pushing to SolrCloud - due to the increase in batch size, > we were able to go down to 5 update threads on the update-pushing-app (from > 10 threads). > > To be clear the above only reduces the likelihood of the issue happening, > and DOES NOT actually resolve the issue at hand. > > If we happen to encounter issues with the above 3 changes, the next steps > (I could use some advice on) are: > > 1) Increase the number of shards (2x) - the theory here is this reduces the > locking on shards because there are more shards. Am I onto something here, > or will this not help at all? > 2) Use CloudSolrServer - currently we have a plain-old least-connection > HTTP VIP. If we go "direct" to what we need to update, this will reduce > concurrency in SolrCloud a bit. Thoughts? > > Thanks all! > > Cheers, > > Tim > > > On 6 September 2013 14:47, Tim Vaillancourt <t...@elementspace.com> wrote: > > > Enjoy your trip, Mark! Thanks again for the help! > > > > Tim > > > > > > On 6 September 2013 14:18, Mark Miller <markrmil...@gmail.com> wrote: > > > >> Okay, thanks, useful info. Getting on a plane, but ill look more at this > >> soon. That 10k thread spike is good to know - that's no good and could > >> easily be part of the problem. We want to keep that from happening. > >> > >> Mark > >> > >> Sent from my iPhone > >> > >> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <t...@elementspace.com> > >> wrote: > >> > >> > Hey Mark, > >> > > >> > The farthest we've made it at the same batch size/volume was 12 hours > >> > without this patch, but that isn't consistent. Sometimes we would only > >> get > >> > to 6 hours or less. > >> > > >> > During the crash I can see an amazing spike in threads to 10k which is > >> > essentially our ulimit for the JVM, but I strangely see no > "OutOfMemory: > >> > cannot open native thread errors" that always follow this. Weird! > >> > > >> > We also notice a spike in CPU around the crash. The instability caused > >> some > >> > shard recovery/replication though, so that CPU may be a symptom of the > >> > replication, or is possibly the root cause. The CPU spikes from about > >> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, > >> while > >> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, > >> whole > >> > index is in 128GB RAM, 6xRAID10 15k). > >> > > >> > More on resources: our disk I/O seemed to spike about 2x during the > >> crash > >> > (about 1300kbps written to 3500kbps), but this may have been the > >> > replication, or ERROR logging (we generally log nothing due to > >> > WARN-severity unless something breaks). > >> > > >> > Lastly, I found this stack trace occurring frequently, and have no > idea > >> > what it is (may be useful or not): > >> > > >> > "java.lang.IllegalStateException : > >> > at > org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) > >> > at org.eclipse.jetty.server.Response.sendError(Response.java:325) > >> > at > >> > > >> > org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) > >> > at > >> > > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) > >> > at > >> > > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) > >> > at > >> > > >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) > >> > at > >> > > >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) > >> > at > >> > > >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) > >> > at > >> > > >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) > >> > at > >> > > >> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) > >> > at > >> > > >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) > >> > at > >> > > >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > >> > at org.eclipse.jetty.server.Server.handle(Server.java:445) > >> > at > >> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) > >> > at > >> > > >> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) > >> > at > >> > > >> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) > >> > at > >> > > >> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) > >> > at > >> > > >> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) > >> > at java.lang.Thread.run(Thread.java:724)" > >> > > >> > On your live_nodes question, I don't have historical data on this from > >> when > >> > the crash occurred, which I guess is what you're looking for. I could > >> add > >> > this to our monitoring for future tests, however. I'd be glad to > >> continue > >> > further testing, but I think first more monitoring is needed to > >> understand > >> > this further. Could we come up with a list of metrics that would be > >> useful > >> > to see following another test and successful crash? > >> > > >> > Metrics needed: > >> > > >> > 1) # of live_nodes. > >> > 2) Full stack traces. > >> > 3) CPU used by Solr's JVM specifically (instead of system-wide). > >> > 4) Solr's JVM thread count (already done) > >> > 5) ? > >> > > >> > Cheers, > >> > > >> > Tim Vaillancourt > >> > > >> > > >> > On 6 September 2013 13:11, Mark Miller <markrmil...@gmail.com> wrote: > >> > > >> >> Did you ever get to index that long before without hitting the > >> deadlock? > >> >> > >> >> There really isn't anything negative the patch could be introducing, > >> other > >> >> than allowing for some more threads to possibly run at once. If I had > >> to > >> >> guess, I would say its likely this patch fixes the deadlock issue and > >> your > >> >> seeing another issue - which looks like the system cannot keep up > with > >> the > >> >> requests or something for some reason - perhaps due to some OS > >> networking > >> >> settings or something (more guessing). Connection refused happens > >> generally > >> >> when there is nothing listening on the port. > >> >> > >> >> Do you see anything interesting change with the rest of the system? > CPU > >> >> usage spikes or something like that? > >> >> > >> >> Clamping down further on the overall number of threads night help > >> (which > >> >> would require making something configurable). How many nodes are > >> listed in > >> >> zk under live_nodes? > >> >> > >> >> Mark > >> >> > >> >> Sent from my iPhone > >> >> > >> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <t...@elementspace.com> > >> >> wrote: > >> >> > >> >>> Hey guys, > >> >>> > >> >>> (copy of my post to SOLR-5216) > >> >>> > >> >>> We tested this patch and unfortunately encountered some serious > >> issues a > >> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so > >> we > >> >> are > >> >>> writing about 5000 docs/sec total, using autoCommit to commit the > >> updates > >> >>> (no explicit commits). > >> >>> > >> >>> Our environment: > >> >>> > >> >>> Solr 4.3.1 w/SOLR-5216 patch. > >> >>> Jetty 9, Java 1.7. > >> >>> 3 solr instances, 1 per physical server. > >> >>> 1 collection. > >> >>> 3 shards. > >> >>> 2 replicas (each instance is a leader and a replica). > >> >>> Soft autoCommit is 1000ms. > >> >>> Hard autoCommit is 15000ms. > >> >>> > >> >>> After about 6 hours of stress-testing this patch, we see many of > these > >> >>> stalled transactions (below), and the Solr instances start to see > each > >> >>> other as down, flooding our Solr logs with "Connection Refused" > >> >> exceptions, > >> >>> and otherwise no obviously-useful logs that I could see. > >> >>> > >> >>> I did notice some stalled transactions on both /select and /update, > >> >>> however. This never occurred without this patch. > >> >>> > >> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > >> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > >> >>> > >> >>> Lastly, I have a summary of the ERROR-severity logs from this > 24-hour > >> >> soak. > >> >>> My script "normalizes" the ERROR-severity stack traces and returns > >> them > >> >> in > >> >>> order of occurrence. > >> >>> > >> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb > >> >>> > >> >>> Thanks! > >> >>> > >> >>> Tim Vaillancourt > >> >>> > >> >>> > >> >>> On 6 September 2013 07:27, Markus Jelsma < > markus.jel...@openindex.io> > >> >> wrote: > >> >>> > >> >>>> Thanks! > >> >>>> > >> >>>> -----Original message----- > >> >>>>> From:Erick Erickson <erickerick...@gmail.com> > >> >>>>> Sent: Friday 6th September 2013 16:20 > >> >>>>> To: solr-user@lucene.apache.org > >> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >> >>>>> > >> >>>>> Markus: > >> >>>>> > >> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216 > >> >>>>> > >> >>>>> > >> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >> >>>>> <markus.jel...@openindex.io>wrote: > >> >>>>> > >> >>>>>> Hi Mark, > >> >>>>>> > >> >>>>>> Got an issue to watch? > >> >>>>>> > >> >>>>>> Thanks, > >> >>>>>> Markus > >> >>>>>> > >> >>>>>> -----Original message----- > >> >>>>>>> From:Mark Miller <markrmil...@gmail.com> > >> >>>>>>> Sent: Wednesday 4th September 2013 16:55 > >> >>>>>>> To: solr-user@lucene.apache.org > >> >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >> >>>>>>> > >> >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected > >> >>>> what it > >> >>>>>> is since early this year, but it's never personally been an > issue, > >> so > >> >>>> it's > >> >>>>>> rolled along for a long time. > >> >>>>>>> > >> >>>>>>> Mark > >> >>>>>>> > >> >>>>>>> Sent from my iPhone > >> >>>>>>> > >> >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt < > >> t...@elementspace.com> > >> >>>>>> wrote: > >> >>>>>>> > >> >>>>>>>> Hey guys, > >> >>>>>>>> > >> >>>>>>>> I am looking into an issue we've been having with SolrCloud > since > >> >>>> the > >> >>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't > >> >>>> tested > >> >>>>>> 4.4.0 > >> >>>>>>>> yet). I've noticed other users with this same issue, so I'd > >> really > >> >>>>>> like to > >> >>>>>>>> get to the bottom of it. > >> >>>>>>>> > >> >>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12 > >> >>>> hours > >> >>>>>> we > >> >>>>>>>> see stalled transactions that snowball to consume all Jetty > >> >>>> threads in > >> >>>>>> the > >> >>>>>>>> JVM. This eventually causes the JVM to hang with most threads > >> >>>> waiting > >> >>>>>> on > >> >>>>>>>> the condition/stack provided at the bottom of this message. At > >> this > >> >>>>>> point > >> >>>>>>>> SolrCloud instances then start to see their neighbors (who also > >> >>>> have > >> >>>>>> all > >> >>>>>>>> threads hung) as down w/"Connection Refused", and the shards > >> become > >> >>>>>> "down" > >> >>>>>>>> in state. Sometimes a node or two survives and just returns > 503s > >> >>>> "no > >> >>>>>> server > >> >>>>>>>> hosting shard" errors. > >> >>>>>>>> > >> >>>>>>>> As a workaround/experiment, we have tuned the number of threads > >> >>>> sending > >> >>>>>>>> updates to Solr, as well as the batch size (we batch updates > from > >> >>>>>> client -> > >> >>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning > >> off > >> >>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also > >> >>>> did not > >> >>>>>>>> help. Certain combinations of update threads and batch sizes > seem > >> >>>> to > >> >>>>>>>> mask/help the problem, but not resolve it entirely. > >> >>>>>>>> > >> >>>>>>>> Our current environment is the following: > >> >>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > >> >>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM. > >> >>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of > 1 > >> >>>> shard > >> >>>>>> and > >> >>>>>>>> a replica of 1 shard). > >> >>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no > movement > >> >>>> on a > >> >>>>>> good > >> >>>>>>>> day. > >> >>>>>>>> - 5000 max jetty threads (well above what we use when we are > >> >>>> healthy), > >> >>>>>>>> Linux-user threads ulimit is 6000. > >> >>>>>>>> - Occurs under Jetty 8 or 9 (many versions). > >> >>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions). > >> >>>>>>>> - Occurs under several JVM tunings. > >> >>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or > >> Java > >> >>>>>> version > >> >>>>>>>> (I hope I'm wrong). > >> >>>>>>>> > >> >>>>>>>> The stack trace that is holding up all my Jetty QTP threads is > >> the > >> >>>>>>>> following, which seems to be waiting on a lock that I would > very > >> >>>> much > >> >>>>>> like > >> >>>>>>>> to understand further: > >> >>>>>>>> > >> >>>>>>>> "java.lang.Thread.State: WAITING (parking) > >> >>>>>>>> at sun.misc.Unsafe.park(Native Method) > >> >>>>>>>> - parking to wait for <0x00000007216e68d8> (a > >> >>>>>>>> java.util.concurrent.Semaphore$NonfairSync) > >> >>>>>>>> at > >> >>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >> >>>>>>>> at > >> >> > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > >> >>>>>>>> at > >> >> > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > >> >>>>>>>> at > >> >> > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > >> >>>>>>>> at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >> >>>>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) > >> >>>>>>>> at > >> >> > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > >> >>>>>>>> at org.eclipse.jetty.server.Server.handle(Server.java:445) > >> >>>>>>>> at > >> >>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601) > >> >>>>>>>> at > >> >> > >> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532) > >> >>>>>>>> at java.lang.Thread.run(Thread.java:724)" > >> >>>>>>>> > >> >>>>>>>> Some questions I had were: > >> >>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing > an > >> >>>>>> update? > >> >>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), > could > >> >>>>>> someone > >> >>>>>>>> help me understand "what" solr is locking in this case at > >> >> > >> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)" > >> >>>>>>>> when performing an update? That will help me understand where > to > >> >>>> look > >> >>>>>> next. > >> >>>>>>>> 3) It seems all threads in this state are waiting for > >> >>>>>> "0x00000007216e68d8", > >> >>>>>>>> is there a way to tell what "0x00000007216e68d8" is? > >> >>>>>>>> 4) Is there a limit to how many updates you can do in > SolrCloud? > >> >>>>>>>> 5) Wild-ass-theory: would more shards provide more locks > >> (whatever > >> >>>> they > >> >>>>>>>> are) on update, and thus more update throughput? > >> >>>>>>>> > >> >>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes > >> at > >> >>>>>> this URL > >> >>>>>>>> in gzipped form: > >> >> > >> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz > >> >>>>>>>> > >> >>>>>>>> Any help/suggestions/ideas on this issue, big or small, would > be > >> >>>> much > >> >>>>>>>> appreciated. > >> >>>>>>>> > >> >>>>>>>> Thanks so much all! > >> >>>>>>>> > >> >>>>>>>> Tim Vaillancourt > >> >> > >> > > > > >