Re: SolrCloud 4.x hangs under high update volume

Mark Miller Thu, 12 Sep 2013 11:35:17 -0700

Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps make a 
4.5.1 - it does resolve a critical issue - but 4.5 is in motion and SOLR-5232 
is not quite ready - we need some testing.


- Mark

On Sep 12, 2013, at 2:12 PM, Erick Erickson <erickerick...@gmail.com> wrote:

> My take on it is this, assuming I'm reading this right:
> 1> SOLR-5216 - probably not going anywhere, 5232 will take care of it.
> 2> SOLR-5232 - expected to fix the underlying issue no matter whether
> you're using CloudSolrServer from SolrJ or sending lots of updates from
> lots of clients.
> 3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
> meantime.
> 
> I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
> hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
> is looking like it'll be ready to cut next week so it might not be included.
> 
> Best,
> Erick
> 
> 
> On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt 
> <t...@elementspace.com>wrote:
> 
>> Lol, at breaking during a demo - always the way it is! :) I agree, we are
>> just tip-toeing around the issue, but waiting for 4.5 is definitely an
>> option if we "get-by" for now in testing; patched Solr versions seem to
>> make people uneasy sometimes :).
>> 
>> Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up
>> worse due to less limitations on thread), I'm guessing only SOLR-5232 and
>> SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a
>> world of difference!
>> 
>> Thanks so much again guys!
>> 
>> Tim
>> 
>> 
>> 
>> On 12 September 2013 03:43, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>>> Fewer client threads updating makes sense, and going to 1 core also seems
>>> like it might help. But it's all a crap-shoot unless the underlying cause
>>> gets fixed up. Both would improve things, but you'll still hit the
>> problem
>>> sometime, probably when doing a demo for your boss ;).
>>> 
>>> Adrien has branched the code for SOLR 4.5 in preparation for a release
>>> candidate tentatively scheduled for next week. You might just start
>> working
>>> with that branch if you can rather than apply individual patches...
>>> 
>>> I suspect there'll be a couple more changes to this code (looks like
>>> Shikhar already raised an issue for instance) before 4.5 is finally
>> cut...
>>> 
>>> FWIW,
>>> Erick
>>> 
>>> 
>>> 
>>> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <t...@elementspace.com
>>>> wrote:
>>> 
>>>> Thanks Erick!
>>>> 
>>>> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
>>>> patch. I think that is a very, very useful patch by the way. SOLR-5232
>>>> seems promising as well.
>>>> 
>>>> I see your point on the more-shards idea, this is obviously a
>>>> global/instance-level lock. If I really had to, I suppose I could run
>>> more
>>>> Solr instances to reduce locking then? Currently I have 2 cores per
>>>> instance and I could go 1-to-1 to simplify things.
>>>> 
>>>> The good news is we seem to be more stable since changing to a bigger
>>>> client->solr batch-size and fewer client threads updating.
>>>> 
>>>> Cheers,
>>>> 
>>>> Tim
>>>> 
>>>> On 11/09/13 04:19 AM, Erick Erickson wrote:
>>>> 
>>>>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a
>> recent
>>>>> copy of the 4x branch. By "recent", I mean like today, it looks like
>>> Mark
>>>>> applied this early this morning. But several reports indicate that
>> this
>>>>> will
>>>>> solve your problem.
>>>>> 
>>>>> I would expect that increasing the number of shards would make the
>>> problem
>>>>> worse, not
>>>>> better.
>>>>> 
>>>>> There's also SOLR-5232...
>>>>> 
>>>>> Best
>>>>> Erick
>>>>> 
>>>>> 
>>>>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
>>> **com<t...@elementspace.com>
>>>>>> wrote:
>>>>> 
>>>>> Hey guys,
>>>>>> 
>>>>>> Based on my understanding of the problem we are encountering, I feel
>>>>>> we've
>>>>>> been able to reduce the likelihood of this issue by making the
>>> following
>>>>>> changes to our app's usage of SolrCloud:
>>>>>> 
>>>>>> 1) We increased our document batch size to 200 from 10 - our app
>>> batches
>>>>>> updates to reduce HTTP requests/overhead. The theory is increasing
>> the
>>>>>> batch size reduces the likelihood of this issue happening.
>>>>>> 2) We reduced to 1 application node sending updates to SolrCloud - we
>>>>>> write
>>>>>> Solr updates to Redis, and have previously had 4 application nodes
>>>>>> pushing
>>>>>> the updates to Solr (popping off the Redis queue). Reducing the
>> number
>>> of
>>>>>> nodes pushing to Solr reduces the concurrency on SolrCloud.
>>>>>> 3) Less threads pushing to SolrCloud - due to the increase in batch
>>> size,
>>>>>> we were able to go down to 5 update threads on the update-pushing-app
>>>>>> (from
>>>>>> 10 threads).
>>>>>> 
>>>>>> To be clear the above only reduces the likelihood of the issue
>>> happening,
>>>>>> and DOES NOT actually resolve the issue at hand.
>>>>>> 
>>>>>> If we happen to encounter issues with the above 3 changes, the next
>>> steps
>>>>>> (I could use some advice on) are:
>>>>>> 
>>>>>> 1) Increase the number of shards (2x) - the theory here is this
>> reduces
>>>>>> the
>>>>>> locking on shards because there are more shards. Am I onto something
>>>>>> here,
>>>>>> or will this not help at all?
>>>>>> 2) Use CloudSolrServer - currently we have a plain-old
>> least-connection
>>>>>> HTTP VIP. If we go "direct" to what we need to update, this will
>> reduce
>>>>>> concurrency in SolrCloud a bit. Thoughts?
>>>>>> 
>>>>>> Thanks all!
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Tim
>>>>>> 
>>>>>> 
>>>>>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
>>> t...@elementspace.com>>
>>>>>> wrote:
>>>>>> 
>>>>>> Enjoy your trip, Mark! Thanks again for the help!
>>>>>>> 
>>>>>>> Tim
>>>>>>> 
>>>>>>> 
>>>>>>> On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at
>>>>>>>> this
>>>>>>>> soon. That 10k thread spike is good to know - that's no good and
>>> could
>>>>>>>> easily be part of the problem. We want to keep that from happening.
>>>>>>>> 
>>>>>>>> Mark
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.
>> **com<
>>> t...@elementspace.com>
>>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hey Mark,
>>>>>>>>> 
>>>>>>>>> The farthest we've made it at the same batch size/volume was 12
>>> hours
>>>>>>>>> without this patch, but that isn't consistent. Sometimes we would
>>> only
>>>>>>>>> 
>>>>>>>> get
>>>>>>>> 
>>>>>>>>> to 6 hours or less.
>>>>>>>>> 
>>>>>>>>> During the crash I can see an amazing spike in threads to 10k
>> which
>>> is
>>>>>>>>> essentially our ulimit for the JVM, but I strangely see no
>>>>>>>>> 
>>>>>>>> "OutOfMemory:
>>>>>> 
>>>>>>> cannot open native thread errors" that always follow this. Weird!
>>>>>>>>> 
>>>>>>>>> We also notice a spike in CPU around the crash. The instability
>>> caused
>>>>>>>>> 
>>>>>>>> some
>>>>>>>> 
>>>>>>>>> shard recovery/replication though, so that CPU may be a symptom of
>>> the
>>>>>>>>> replication, or is possibly the root cause. The CPU spikes from
>>> about
>>>>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
>>> CPU,
>>>>>>>>> 
>>>>>>>> while
>>>>>>>> 
>>>>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core
>> Xeons,
>>>>>>>>> 
>>>>>>>> whole
>>>>>>>> 
>>>>>>>>> index is in 128GB RAM, 6xRAID10 15k).
>>>>>>>>> 
>>>>>>>>> More on resources: our disk I/O seemed to spike about 2x during
>> the
>>>>>>>>> 
>>>>>>>> crash
>>>>>>>> 
>>>>>>>>> (about 1300kbps written to 3500kbps), but this may have been the
>>>>>>>>> replication, or ERROR logging (we generally log nothing due to
>>>>>>>>> WARN-severity unless something breaks).
>>>>>>>>> 
>>>>>>>>> Lastly, I found this stack trace occurring frequently, and have no
>>>>>>>>> 
>>>>>>>> idea
>>>>>> 
>>>>>>> what it is (may be useful or not):
>>>>>>>>> 
>>>>>>>>> "java.lang.**IllegalStateException :
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>> 
>> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
>>>>>> 
>>>>>>>      at org.eclipse.jetty.server.**Response.sendError(Response.**
>>>>>>>>> java:325)
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
>>>>>> SolrDispatchFilter.java:692)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:380)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:155)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
>>>>>> doFilter(ServletHandler.java:**1423)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
>>>>>> ServletHandler.java:450)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:138)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
>>>>>> SecurityHandler.java:564)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doHandle(SessionHandler.java:**213)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doHandle(ContextHandler.java:**1083)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
>>>>>> ServletHandler.java:379)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doScope(SessionHandler.java:**175)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doScope(ContextHandler.java:**1017)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:136)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
>>>>>> handle(**ContextHandlerCollection.java:**258)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
>>>>>> handle(HandlerCollection.java:**109)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>>>>> HandlerWrapper.java:97)
>>>>>> 
>>>>>>>      at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>> 
>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
>>>>>>>> 
>>>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>>>>> HttpConnection.java:225)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>>>>> AbstractConnection.java:358)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>>>>> QueuedThreadPool.java:596)
>>>>>> 
>>>>>>>      at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>>>>> QueuedThreadPool.java:527)
>>>>>> 
>>>>>>>      at java.lang.Thread.run(Thread.**java:724)"
>>>>>>>>> 
>>>>>>>>> On your live_nodes question, I don't have historical data on this
>>> from
>>>>>>>>> 
>>>>>>>> when
>>>>>>>> 
>>>>>>>>> the crash occurred, which I guess is what you're looking for. I
>>> could
>>>>>>>>> 
>>>>>>>> add
>>>>>>>> 
>>>>>>>>> this to our monitoring for future tests, however. I'd be glad to
>>>>>>>>> 
>>>>>>>> continue
>>>>>>>> 
>>>>>>>>> further testing, but I think first more monitoring is needed to
>>>>>>>>> 
>>>>>>>> understand
>>>>>>>> 
>>>>>>>>> this further. Could we come up with a list of metrics that would
>> be
>>>>>>>>> 
>>>>>>>> useful
>>>>>>>> 
>>>>>>>>> to see following another test and successful crash?
>>>>>>>>> 
>>>>>>>>> Metrics needed:
>>>>>>>>> 
>>>>>>>>> 1) # of live_nodes.
>>>>>>>>> 2) Full stack traces.
>>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
>>>>>>>>> 4) Solr's JVM thread count (already done)
>>>>>>>>> 5) ?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Tim Vaillancourt
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com>
>>> wrote:
>>>>>>>>> 
>>>>>>>>> Did you ever get to index that long before without hitting the
>>>>>>>>>> 
>>>>>>>>> deadlock?
>>>>>>>> 
>>>>>>>>> There really isn't anything negative the patch could be
>> introducing,
>>>>>>>>>> 
>>>>>>>>> other
>>>>>>>> 
>>>>>>>>> than allowing for some more threads to possibly run at once. If I
>>> had
>>>>>>>>>> 
>>>>>>>>> to
>>>>>>>> 
>>>>>>>>> guess, I would say its likely this patch fixes the deadlock issue
>>> and
>>>>>>>>>> 
>>>>>>>>> your
>>>>>>>> 
>>>>>>>>> seeing another issue - which looks like the system cannot keep up
>>>>>>>>>> 
>>>>>>>>> with
>>>>>> 
>>>>>>> the
>>>>>>>> 
>>>>>>>>> requests or something for some reason - perhaps due to some OS
>>>>>>>>>> 
>>>>>>>>> networking
>>>>>>>> 
>>>>>>>>> settings or something (more guessing). Connection refused happens
>>>>>>>>>> 
>>>>>>>>> generally
>>>>>>>> 
>>>>>>>>> when there is nothing listening on the port.
>>>>>>>>>> 
>>>>>>>>>> Do you see anything interesting change with the rest of the
>> system?
>>>>>>>>>> 
>>>>>>>>> CPU
>>>>>> 
>>>>>>> usage spikes or something like that?
>>>>>>>>>> 
>>>>>>>>>> Clamping down further on the overall number of threads night help
>>>>>>>>>> 
>>>>>>>>> (which
>>>>>>>> 
>>>>>>>>> would require making something configurable). How many nodes are
>>>>>>>>>> 
>>>>>>>>> listed in
>>>>>>>> 
>>>>>>>>> zk under live_nodes?
>>>>>>>>>> 
>>>>>>>>>> Mark
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPhone
>>>>>>>>>> 
>>>>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
>>> **com<t...@elementspace.com>
>>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hey guys,
>>>>>>>>>>> 
>>>>>>>>>>> (copy of my post to SOLR-5216)
>>>>>>>>>>> 
>>>>>>>>>>> We tested this patch and unfortunately encountered some serious
>>>>>>>>>>> 
>>>>>>>>>> issues a
>>>>>>>> 
>>>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs,
>> so
>>>>>>>>>>> 
>>>>>>>>>> we
>>>>>>>> 
>>>>>>>>> are
>>>>>>>>>> 
>>>>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit
>> the
>>>>>>>>>>> 
>>>>>>>>>> updates
>>>>>>>> 
>>>>>>>>> (no explicit commits).
>>>>>>>>>>> 
>>>>>>>>>>> Our environment:
>>>>>>>>>>> 
>>>>>>>>>>>   Solr 4.3.1 w/SOLR-5216 patch.
>>>>>>>>>>>   Jetty 9, Java 1.7.
>>>>>>>>>>>   3 solr instances, 1 per physical server.
>>>>>>>>>>>   1 collection.
>>>>>>>>>>>   3 shards.
>>>>>>>>>>>   2 replicas (each instance is a leader and a replica).
>>>>>>>>>>>   Soft autoCommit is 1000ms.
>>>>>>>>>>>   Hard autoCommit is 15000ms.
>>>>>>>>>>> 
>>>>>>>>>>> After about 6 hours of stress-testing this patch, we see many of
>>>>>>>>>>> 
>>>>>>>>>> these
>>>>>> 
>>>>>>> stalled transactions (below), and the Solr instances start to see
>>>>>>>>>>> 
>>>>>>>>>> each
>>>>>> 
>>>>>>> other as down, flooding our Solr logs with "Connection Refused"
>>>>>>>>>>> 
>>>>>>>>>> exceptions,
>>>>>>>>>> 
>>>>>>>>>>> and otherwise no obviously-useful logs that I could see.
>>>>>>>>>>> 
>>>>>>>>>>> I did notice some stalled transactions on both /select and
>>> /update,
>>>>>>>>>>> however. This never occurred without this patch.
>>>>>>>>>>> 
>>>>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>>>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>>>>>>>>>> 
>>>>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
>>>>>>>>>>> 
>>>>>>>>>> 24-hour
>>>>>> 
>>>>>>> soak.
>>>>>>>>>> 
>>>>>>>>>>> My script "normalizes" the ERROR-severity stack traces and
>> returns
>>>>>>>>>>> 
>>>>>>>>>> them
>>>>>>>> 
>>>>>>>>> in
>>>>>>>>>> 
>>>>>>>>>>> order of occurrence.
>>>>>>>>>>> 
>>>>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> 
>>>>>>>>>>> Tim Vaillancourt
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
>>>>>>>>>>> 
>>>>>>>>>> markus.jel...@openindex.io>
>>>>>> 
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>>> 
>>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>> 
>>>>>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
>>> erickerick...@gmail.com>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent: Friday 6th September 2013 16:20
>>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Markus:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
>>> https://issues.apache.org/jira/browse/SOLR-5216>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>>>>>>>>>> <markus.jel...@openindex.io>**wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Mark,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Got an issue to watch?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Markus
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----Original message-----
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> From:Mark Miller<markrmil...@gmail.com>
>>>>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>>>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
>>> suspected
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> what it
>>>>>>>>>>>> 
>>>>>>>>>>>>> is since early this year, but it's never personally been an
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> issue,
>>>>>> 
>>>>>>> so
>>>>>>>> 
>>>>>>>>> it's
>>>>>>>>>>>> 
>>>>>>>>>>>>> rolled along for a long time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Mark
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sent from my iPhone
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> t...@elementspace.com>
>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hey guys,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> since
>>>>>> 
>>>>>>> the
>>>>>>>>>>>> 
>>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> tested
>>>>>>>>>>>> 
>>>>>>>>>>>>> 4.4.0
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> really
>>>>>>>> 
>>>>>>>>> like to
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> get to the bottom of it.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
>>> 1-12
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> hours
>>>>>>>>>>>> 
>>>>>>>>>>>>> we
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> threads in
>>>>>>>>>>>> 
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most
>> threads
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> waiting
>>>>>>>>>>>> 
>>>>>>>>>>>>> on
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the condition/stack provided at the bottom of this message.
>> At
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> this
>>>>>>>> 
>>>>>>>>> point
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
>>> also
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> have
>>>>>>>>>>>> 
>>>>>>>>>>>>> all
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> become
>>>>>>>> 
>>>>>>>>> "down"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 503s
>>>>>> 
>>>>>>> "no
>>>>>>>>>>>> 
>>>>>>>>>>>>> server
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> hosting shard" errors.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
>>> threads
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> sending
>>>>>>>>>>>> 
>>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> from
>>>>>> 
>>>>>>> client ->
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail.
>> Turning
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> off
>>>>>>>> 
>>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> did not
>>>>>>>>>>>> 
>>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> seem
>>>>>> 
>>>>>>> to
>>>>>>>>>>>> 
>>>>>>>>>>>>> mask/help the problem, but not resolve it entirely.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Our current environment is the following:
>>>>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
>>>>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
>>>>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader
>>> of
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1
>>>>>> 
>>>>>>> shard
>>>>>>>>>>>> 
>>>>>>>>>>>>> and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> a replica of 1 shard).
>>>>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> movement
>>>>>> 
>>>>>>> on a
>>>>>>>>>>>> 
>>>>>>>>>>>>> good
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> day.
>>>>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we
>> are
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> healthy),
>>>>>>>>>>>> 
>>>>>>>>>>>>> Linux-user threads ulimit is 6000.
>>>>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
>>>>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
>>>>>>>>>>>>>>>> - Occurs under several JVM tunings.
>>>>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty
>>> or
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Java
>>>>>>>> 
>>>>>>>>> version
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (I hope I'm wrong).
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads
>>> is
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> the
>>>>>>>> 
>>>>>>>>> following, which seems to be waiting on a lock that I would
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> very
>>>>>> 
>>>>>>> much
>>>>>>>>>>>> 
>>>>>>>>>>>>> like
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> to understand further:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
>>>>>>>>>>>>>>>>  at sun.misc.Unsafe.park(Native Method)
>>>>>>>>>>>>>>>>  - parking to wait for<0x00000007216e68d8>  (a
>>>>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
>>>>>>>>>>>> java:186)
>>>>>>>>>>>> 
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>>>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>>>>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
>>>>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
>>>>>> 
>>>>>>>  at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
>>>>>> AdjustableSemaphore.java:61)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
>>>>>> SolrCmdDistributor.java:418)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
>>>>>> SolrCmdDistributor.java:368)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
>>>>>> SolrCmdDistributor.java:300)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
>>>>>> SolrCmdDistributor.java:96)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
>>>>>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
>>>>>> java:462)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
>>>>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
>>>>>> java:1178)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
>>>>>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
>>>>>> RequestHandlerBase.java:135)
>>>>>> 
>>>>>>>  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
>>>>>> SolrDispatchFilter.java:656)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:359)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
>>>>>> SolrDispatchFilter.java:155)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
>>>>>> doFilter(ServletHandler.java:**1486)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
>>>>>> ServletHandler.java:503)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:138)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
>>>>>> SecurityHandler.java:564)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doHandle(SessionHandler.java:**213)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doHandle(ContextHandler.java:**1096)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
>>>>>> ServletHandler.java:432)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
>>>>>> doScope(SessionHandler.java:**175)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
>>>>>> doScope(ContextHandler.java:**1030)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:136)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
>>>>>> *handle(**ContextHandlerCollection.java:**201)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
>>>>>> handle(HandlerCollection.java:**109)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>>>>> HandlerWrapper.java:97)
>>>>>> 
>>>>>>>  at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
>>>>>>>>>>>> HttpChannel.java:268)
>>>>>>>>>>>> 
>>>>>>>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>>>>> HttpConnection.java:229)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>>>>> AbstractConnection.java:358)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>>>>> QueuedThreadPool.java:601)
>>>>>> 
>>>>>>>  at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>>>>> QueuedThreadPool.java:532)
>>>>>> 
>>>>>>>  at java.lang.Thread.run(Thread.**java:724)"
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Some questions I had were:
>>>>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when
>> performing
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> an
>>>>>> 
>>>>>>> update?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> could
>>>>>> 
>>>>>>> someone
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> help me understand "what" solr is locking in this case at
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
>>>>>> AdjustableSemaphore.java:61)"
>>>>>> 
>>>>>>> when performing an update? That will help me understand where
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> to
>>>>>> 
>>>>>>> look
>>>>>>>>>>>> 
>>>>>>>>>>>>> next.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3) It seems all threads in this state are waiting for
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> "0x00000007216e68d8",
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
>>>>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SolrCloud?
>>>>>> 
>>>>>>> 5) Wild-ass-theory: would more shards provide more locks
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (whatever
>>>>>>>> 
>>>>>>>>> they
>>>>>>>>>>>> 
>>>>>>>>>>>>> are) on update, and thus more update throughput?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
>>> nodes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> at
>>>>>>> 
>>>>>>> 
>>> 
>>

Re: SolrCloud 4.x hangs under high update volume

Reply via email to