Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less.
During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no "OutOfMemory: cannot open native thread errors" that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to WARN-severity unless something breaks). Lastly, I found this stack trace occurring frequently, and have no idea what it is (may be useful or not): "java.lang.IllegalStateException : at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) at org.eclipse.jetty.server.Response.sendError(Response.java:325) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:724)" On your live_nodes question, I don't have historical data on this from when the crash occurred, which I guess is what you're looking for. I could add this to our monitoring for future tests, however. I'd be glad to continue further testing, but I think first more monitoring is needed to understand this further. Could we come up with a list of metrics that would be useful to see following another test and successful crash? Metrics needed: 1) # of live_nodes. 2) Full stack traces. 3) CPU used by Solr's JVM specifically (instead of system-wide). 4) Solr's JVM thread count (already done) 5) ? Cheers, Tim Vaillancourt On 6 September 2013 13:11, Mark Miller <markrmil...@gmail.com> wrote: > Did you ever get to index that long before without hitting the deadlock? > > There really isn't anything negative the patch could be introducing, other > than allowing for some more threads to possibly run at once. If I had to > guess, I would say its likely this patch fixes the deadlock issue and your > seeing another issue - which looks like the system cannot keep up with the > requests or something for some reason - perhaps due to some OS networking > settings or something (more guessing). Connection refused happens generally > when there is nothing listening on the port. > > Do you see anything interesting change with the rest of the system? CPU > usage spikes or something like that? > > Clamping down further on the overall number of threads night help (which > would require making something configurable). How many nodes are listed in > zk under live_nodes? > > Mark > > Sent from my iPhone > > On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <t...@elementspace.com> > wrote: > > > Hey guys, > > > > (copy of my post to SOLR-5216) > > > > We tested this patch and unfortunately encountered some serious issues a > > few hours of 500 update-batches/sec. Our update batch is 10 docs, so we > are > > writing about 5000 docs/sec total, using autoCommit to commit the updates > > (no explicit commits). > > > > Our environment: > > > > Solr 4.3.1 w/SOLR-5216 patch. > > Jetty 9, Java 1.7. > > 3 solr instances, 1 per physical server. > > 1 collection. > > 3 shards. > > 2 replicas (each instance is a leader and a replica). > > Soft autoCommit is 1000ms. > > Hard autoCommit is 15000ms. > > > > After about 6 hours of stress-testing this patch, we see many of these > > stalled transactions (below), and the Solr instances start to see each > > other as down, flooding our Solr logs with "Connection Refused" > exceptions, > > and otherwise no obviously-useful logs that I could see. > > > > I did notice some stalled transactions on both /select and /update, > > however. This never occurred without this patch. > > > > Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > > Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > > > > Lastly, I have a summary of the ERROR-severity logs from this 24-hour > soak. > > My script "normalizes" the ERROR-severity stack traces and returns them > in > > order of occurrence. > > > > Summary of my solr.log: http://pastebin.com/pBdMAWeb > > > > Thanks! > > > > Tim Vaillancourt > > > > > > On 6 September 2013 07:27, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > >> Thanks! > >> > >> -----Original message----- > >>> From:Erick Erickson <erickerick...@gmail.com> > >>> Sent: Friday 6th September 2013 16:20 > >>> To: solr-user@lucene.apache.org > >>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>> > >>> Markus: > >>> > >>> See: https://issues.apache.org/jira/browse/SOLR-5216 > >>> > >>> > >>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >>> <markus.jel...@openindex.io>wrote: > >>> > >>>> Hi Mark, > >>>> > >>>> Got an issue to watch? > >>>> > >>>> Thanks, > >>>> Markus > >>>> > >>>> -----Original message----- > >>>>> From:Mark Miller <markrmil...@gmail.com> > >>>>> Sent: Wednesday 4th September 2013 16:55 > >>>>> To: solr-user@lucene.apache.org > >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>> > >>>>> I'm going to try and fix the root cause for 4.5 - I've suspected > >> what it > >>>> is since early this year, but it's never personally been an issue, so > >> it's > >>>> rolled along for a long time. > >>>>> > >>>>> Mark > >>>>> > >>>>> Sent from my iPhone > >>>>> > >>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <t...@elementspace.com> > >>>> wrote: > >>>>> > >>>>>> Hey guys, > >>>>>> > >>>>>> I am looking into an issue we've been having with SolrCloud since > >> the > >>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't > >> tested > >>>> 4.4.0 > >>>>>> yet). I've noticed other users with this same issue, so I'd really > >>>> like to > >>>>>> get to the bottom of it. > >>>>>> > >>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12 > >> hours > >>>> we > >>>>>> see stalled transactions that snowball to consume all Jetty > >> threads in > >>>> the > >>>>>> JVM. This eventually causes the JVM to hang with most threads > >> waiting > >>>> on > >>>>>> the condition/stack provided at the bottom of this message. At this > >>>> point > >>>>>> SolrCloud instances then start to see their neighbors (who also > >> have > >>>> all > >>>>>> threads hung) as down w/"Connection Refused", and the shards become > >>>> "down" > >>>>>> in state. Sometimes a node or two survives and just returns 503s > >> "no > >>>> server > >>>>>> hosting shard" errors. > >>>>>> > >>>>>> As a workaround/experiment, we have tuned the number of threads > >> sending > >>>>>> updates to Solr, as well as the batch size (we batch updates from > >>>> client -> > >>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off > >>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also > >> did not > >>>>>> help. Certain combinations of update threads and batch sizes seem > >> to > >>>>>> mask/help the problem, but not resolve it entirely. > >>>>>> > >>>>>> Our current environment is the following: > >>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > >>>>>> - 3 x Zookeeper instances, external Java 7 JVM. > >>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 > >> shard > >>>> and > >>>>>> a replica of 1 shard). > >>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement > >> on a > >>>> good > >>>>>> day. > >>>>>> - 5000 max jetty threads (well above what we use when we are > >> healthy), > >>>>>> Linux-user threads ulimit is 6000. > >>>>>> - Occurs under Jetty 8 or 9 (many versions). > >>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions). > >>>>>> - Occurs under several JVM tunings. > >>>>>> - Everything seems to point to Solr itself, and not a Jetty or Java > >>>> version > >>>>>> (I hope I'm wrong). > >>>>>> > >>>>>> The stack trace that is holding up all my Jetty QTP threads is the > >>>>>> following, which seems to be waiting on a lock that I would very > >> much > >>>> like > >>>>>> to understand further: > >>>>>> > >>>>>> "java.lang.Thread.State: WAITING (parking) > >>>>>> at sun.misc.Unsafe.park(Native Method) > >>>>>> - parking to wait for <0x00000007216e68d8> (a > >>>>>> java.util.concurrent.Semaphore$NonfairSync) > >>>>>> at > >> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >>>>>> at > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > >>>>>> at > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > >>>>>> at > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > >>>>>> at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > >>>>>> at > >> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > >>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > >>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > >>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > >>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) > >>>>>> at > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) > >>>>>> at > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) > >>>>>> at > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) > >>>>>> at > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) > >>>>>> at > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) > >>>>>> at > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) > >>>>>> at > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) > >>>>>> at > >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) > >>>>>> at > >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) > >>>>>> at > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) > >>>>>> at > >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) > >>>>>> at > >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) > >>>>>> at > >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) > >>>>>> at > >> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) > >>>>>> at > >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) > >>>>>> at > >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030) > >>>>>> at > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) > >>>>>> at > >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201) > >>>>>> at > >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) > >>>>>> at > >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > >>>>>> at org.eclipse.jetty.server.Server.handle(Server.java:445) > >>>>>> at > >> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268) > >>>>>> at > >> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229) > >>>>>> at > >> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) > >>>>>> at > >> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601) > >>>>>> at > >> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532) > >>>>>> at java.lang.Thread.run(Thread.java:724)" > >>>>>> > >>>>>> Some questions I had were: > >>>>>> 1) What exclusive locks does SolrCloud "make" when performing an > >>>> update? > >>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could > >>>> someone > >>>>>> help me understand "what" solr is locking in this case at > >> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)" > >>>>>> when performing an update? That will help me understand where to > >> look > >>>> next. > >>>>>> 3) It seems all threads in this state are waiting for > >>>> "0x00000007216e68d8", > >>>>>> is there a way to tell what "0x00000007216e68d8" is? > >>>>>> 4) Is there a limit to how many updates you can do in SolrCloud? > >>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever > >> they > >>>>>> are) on update, and thus more update throughput? > >>>>>> > >>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at > >>>> this URL > >>>>>> in gzipped form: > >> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz > >>>>>> > >>>>>> Any help/suggestions/ideas on this issue, big or small, would be > >> much > >>>>>> appreciated. > >>>>>> > >>>>>> Thanks so much all! > >>>>>> > >>>>>> Tim Vaillancourt > >> >