Re: SolrCloud 4.x hangs under high update volume

Mark Miller Fri, 06 Sep 2013 14:25:52 -0700

Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. 
That 10k thread spike is good to know - that's no good and could easily be part 
of the problem. We want to keep that from happening.


Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <t...@elementspace.com> wrote:

> Hey Mark,
> 
> The farthest we've made it at the same batch size/volume was 12 hours
> without this patch, but that isn't consistent. Sometimes we would only get
> to 6 hours or less.
> 
> During the crash I can see an amazing spike in threads to 10k which is
> essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
> cannot open native thread errors" that always follow this. Weird!
> 
> We also notice a spike in CPU around the crash. The instability caused some
> shard recovery/replication though, so that CPU may be a symptom of the
> replication, or is possibly the root cause. The CPU spikes from about
> 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while
> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, whole
> index is in 128GB RAM, 6xRAID10 15k).
> 
> More on resources: our disk I/O seemed to spike about 2x during the crash
> (about 1300kbps written to 3500kbps), but this may have been the
> replication, or ERROR logging (we generally log nothing due to
> WARN-severity unless something breaks).
> 
> Lastly, I found this stack trace occurring frequently, and have no idea
> what it is (may be useful or not):
> 
> "java.lang.IllegalStateException :
>      at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
>      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
>      at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
>      at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
>      at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>      at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
>      at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
>      at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>      at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>      at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>      at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
>      at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>      at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>      at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>      at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>      at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>      at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>      at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>      at org.eclipse.jetty.server.Server.handle(Server.java:445)
>      at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>      at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>      at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>      at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>      at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>      at java.lang.Thread.run(Thread.java:724)"
> 
> On your live_nodes question, I don't have historical data on this from when
> the crash occurred, which I guess is what you're looking for. I could add
> this to our monitoring for future tests, however. I'd be glad to continue
> further testing, but I think first more monitoring is needed to understand
> this further. Could we come up with a list of metrics that would be useful
> to see following another test and successful crash?
> 
> Metrics needed:
> 
> 1) # of live_nodes.
> 2) Full stack traces.
> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> 4) Solr's JVM thread count (already done)
> 5) ?
> 
> Cheers,
> 
> Tim Vaillancourt
> 
> 
> On 6 September 2013 13:11, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Did you ever get to index that long before without hitting the deadlock?
>> 
>> There really isn't anything negative the patch could be introducing, other
>> than allowing for some more threads to possibly run at once. If I had to
>> guess, I would say its likely this patch fixes the deadlock issue and your
>> seeing another issue - which looks like the system cannot keep up with the
>> requests or something for some reason - perhaps due to some OS networking
>> settings or something (more guessing). Connection refused happens generally
>> when there is nothing listening on the port.
>> 
>> Do you see anything interesting change with the rest of the system? CPU
>> usage spikes or something like that?
>> 
>> Clamping down further on the overall number of threads night help (which
>> would require making something configurable). How many nodes are listed in
>> zk under live_nodes?
>> 
>> Mark
>> 
>> Sent from my iPhone
>> 
>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <t...@elementspace.com>
>> wrote:
>> 
>>> Hey guys,
>>> 
>>> (copy of my post to SOLR-5216)
>>> 
>>> We tested this patch and unfortunately encountered some serious issues a
>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
>> are
>>> writing about 5000 docs/sec total, using autoCommit to commit the updates
>>> (no explicit commits).
>>> 
>>> Our environment:
>>> 
>>>   Solr 4.3.1 w/SOLR-5216 patch.
>>>   Jetty 9, Java 1.7.
>>>   3 solr instances, 1 per physical server.
>>>   1 collection.
>>>   3 shards.
>>>   2 replicas (each instance is a leader and a replica).
>>>   Soft autoCommit is 1000ms.
>>>   Hard autoCommit is 15000ms.
>>> 
>>> After about 6 hours of stress-testing this patch, we see many of these
>>> stalled transactions (below), and the Solr instances start to see each
>>> other as down, flooding our Solr logs with "Connection Refused"
>> exceptions,
>>> and otherwise no obviously-useful logs that I could see.
>>> 
>>> I did notice some stalled transactions on both /select and /update,
>>> however. This never occurred without this patch.
>>> 
>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>> 
>>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
>> soak.
>>> My script "normalizes" the ERROR-severity stack traces and returns them
>> in
>>> order of occurrence.
>>> 
>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>> 
>>> Thanks!
>>> 
>>> Tim Vaillancourt
>>> 
>>> 
>>> On 6 September 2013 07:27, Markus Jelsma <markus.jel...@openindex.io>
>> wrote:
>>> 
>>>> Thanks!
>>>> 
>>>> -----Original message-----
>>>>> From:Erick Erickson <erickerick...@gmail.com>
>>>>> Sent: Friday 6th September 2013 16:20
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>> 
>>>>> Markus:
>>>>> 
>>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>>>> 
>>>>> 
>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>> <markus.jel...@openindex.io>wrote:
>>>>> 
>>>>>> Hi Mark,
>>>>>> 
>>>>>> Got an issue to watch?
>>>>>> 
>>>>>> Thanks,
>>>>>> Markus
>>>>>> 
>>>>>> -----Original message-----
>>>>>>> From:Mark Miller <markrmil...@gmail.com>
>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>> 
>>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>>>> what it
>>>>>> is since early this year, but it's never personally been an issue, so
>>>> it's
>>>>>> rolled along for a long time.
>>>>>>> 
>>>>>>> Mark
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <t...@elementspace.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hey guys,
>>>>>>>> 
>>>>>>>> I am looking into an issue we've been having with SolrCloud since
>>>> the
>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>>>> tested
>>>>>> 4.4.0
>>>>>>>> yet). I've noticed other users with this same issue, so I'd really
>>>>>> like to
>>>>>>>> get to the bottom of it.
>>>>>>>> 
>>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>>>> hours
>>>>>> we
>>>>>>>> see stalled transactions that snowball to consume all Jetty
>>>> threads in
>>>>>> the
>>>>>>>> JVM. This eventually causes the JVM to hang with most threads
>>>> waiting
>>>>>> on
>>>>>>>> the condition/stack provided at the bottom of this message. At this
>>>>>> point
>>>>>>>> SolrCloud instances then start to see their neighbors (who also
>>>> have
>>>>>> all
>>>>>>>> threads hung) as down w/"Connection Refused", and the shards become
>>>>>> "down"
>>>>>>>> in state. Sometimes a node or two survives and just returns 503s
>>>> "no
>>>>>> server
>>>>>>>> hosting shard" errors.
>>>>>>>> 
>>>>>>>> As a workaround/experiment, we have tuned the number of threads
>>>> sending
>>>>>>>> updates to Solr, as well as the batch size (we batch updates from
>>>>>> client ->
>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>>>> did not
>>>>>>>> help. Certain combinations of update threads and batch sizes seem
>>>> to
>>>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>>>> 
>>>>>>>> Our current environment is the following:
>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
>>>> shard
>>>>>> and
>>>>>>>> a replica of 1 shard).
>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
>>>> on a
>>>>>> good
>>>>>>>> day.
>>>>>>>> - 5000 max jetty threads (well above what we use when we are
>>>> healthy),
>>>>>>>> Linux-user threads ulimit is 6000.
>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>>>> - Occurs under several JVM tunings.
>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or Java
>>>>>> version
>>>>>>>> (I hope I'm wrong).
>>>>>>>> 
>>>>>>>> The stack trace that is holding up all my Jetty QTP threads is the
>>>>>>>> following, which seems to be waiting on a lock that I would very
>>>> much
>>>>>> like
>>>>>>>> to understand further:
>>>>>>>> 
>>>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>>>  at sun.misc.Unsafe.park(Native Method)
>>>>>>>>  - parking to wait for  <0x00000007216e68d8> (a
>>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
>>>>>>>>  at
>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>>>>>>>>  at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>>>>>>>>  at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>>>>>>>>  at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>>>>>>>>  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>>>>>>>>  at
>> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>>>>>>>>  at
>> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>>>>>>>>  at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>>>>>>>>  at
>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>>>>>>>>  at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>>>>>>>>  at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>>>>>>>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>>>>>>>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>>>>>>>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>>>>>>>>  at
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>>>>>>>>  at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>>>>>>>>  at
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>>>>>>>>  at
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>>>>>>>>  at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>>>>>>>>  at
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>>>>>>>>  at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>>>>>>>>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
>>>>>>>>  at
>>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
>>>>>>>>  at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
>>>>>>>>  at
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>>>>>>>>  at
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
>>>>>>>>  at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
>>>>>>>>  at java.lang.Thread.run(Thread.java:724)"
>>>>>>>> 
>>>>>>>> Some questions I had were:
>>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
>>>>>> update?
>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
>>>>>> someone
>>>>>>>> help me understand "what" solr is locking in this case at
>> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
>>>>>>>> when performing an update? That will help me understand where to
>>>> look
>>>>>> next.
>>>>>>>> 3) It seems all threads in this state are waiting for
>>>>>> "0x00000007216e68d8",
>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
>>>>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever
>>>> they
>>>>>>>> are) on update, and thus more update throughput?
>>>>>>>> 
>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at
>>>>>> this URL
>>>>>>>> in gzipped form:
>> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
>>>>>>>> 
>>>>>>>> Any help/suggestions/ideas on this issue, big or small, would be
>>>> much
>>>>>>>> appreciated.
>>>>>>>> 
>>>>>>>> Thanks so much all!
>>>>>>>> 
>>>>>>>> Tim Vaillancourt
>>

Re: SolrCloud 4.x hangs under high update volume

Reply via email to