Re: SolrCloud 4.x hangs under high update volume

Tim Vaillancourt Thu, 12 Sep 2013 18:52:08 -0700

That makes sense, thanks Erick and Mark for you help! :)

I'll see if I can find a place to assist with the testing of SOLR-5232.


Cheers,

Tim



On 12 September 2013 11:16, Mark Miller <markrmil...@gmail.com> wrote:

> Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps
> make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and
> SOLR-5232 is not quite ready - we need some testing.
>
> - Mark
>
> On Sep 12, 2013, at 2:12 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > My take on it is this, assuming I'm reading this right:
> > 1> SOLR-5216 - probably not going anywhere, 5232 will take care of it.
> > 2> SOLR-5232 - expected to fix the underlying issue no matter whether
> > you're using CloudSolrServer from SolrJ or sending lots of updates from
> > lots of clients.
> > 3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the
> > meantime.
> >
> > I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it
> > hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0
> > is looking like it'll be ready to cut next week so it might not be
> included.
> >
> > Best,
> > Erick
> >
> >
> > On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt <t...@elementspace.com
> >wrote:
> >
> >> Lol, at breaking during a demo - always the way it is! :) I agree, we
> are
> >> just tip-toeing around the issue, but waiting for 4.5 is definitely an
> >> option if we "get-by" for now in testing; patched Solr versions seem to
> >> make people uneasy sometimes :).
> >>
> >> Seeing there seems to be some danger to SOLR-5216 (in some ways it
> blows up
> >> worse due to less limitations on thread), I'm guessing only SOLR-5232
> and
> >> SOLR-4816 are making it into 4.5? I feel those 2 in combination will
> make a
> >> world of difference!
> >>
> >> Thanks so much again guys!
> >>
> >> Tim
> >>
> >>
> >>
> >> On 12 September 2013 03:43, Erick Erickson <erickerick...@gmail.com>
> >> wrote:
> >>
> >>> Fewer client threads updating makes sense, and going to 1 core also
> seems
> >>> like it might help. But it's all a crap-shoot unless the underlying
> cause
> >>> gets fixed up. Both would improve things, but you'll still hit the
> >> problem
> >>> sometime, probably when doing a demo for your boss ;).
> >>>
> >>> Adrien has branched the code for SOLR 4.5 in preparation for a release
> >>> candidate tentatively scheduled for next week. You might just start
> >> working
> >>> with that branch if you can rather than apply individual patches...
> >>>
> >>> I suspect there'll be a couple more changes to this code (looks like
> >>> Shikhar already raised an issue for instance) before 4.5 is finally
> >> cut...
> >>>
> >>> FWIW,
> >>> Erick
> >>>
> >>>
> >>>
> >>> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <
> t...@elementspace.com
> >>>> wrote:
> >>>
> >>>> Thanks Erick!
> >>>>
> >>>> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816
> >>>> patch. I think that is a very, very useful patch by the way. SOLR-5232
> >>>> seems promising as well.
> >>>>
> >>>> I see your point on the more-shards idea, this is obviously a
> >>>> global/instance-level lock. If I really had to, I suppose I could run
> >>> more
> >>>> Solr instances to reduce locking then? Currently I have 2 cores per
> >>>> instance and I could go 1-to-1 to simplify things.
> >>>>
> >>>> The good news is we seem to be more stable since changing to a bigger
> >>>> client->solr batch-size and fewer client threads updating.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Tim
> >>>>
> >>>> On 11/09/13 04:19 AM, Erick Erickson wrote:
> >>>>
> >>>>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a
> >> recent
> >>>>> copy of the 4x branch. By "recent", I mean like today, it looks like
> >>> Mark
> >>>>> applied this early this morning. But several reports indicate that
> >> this
> >>>>> will
> >>>>> solve your problem.
> >>>>>
> >>>>> I would expect that increasing the number of shards would make the
> >>> problem
> >>>>> worse, not
> >>>>> better.
> >>>>>
> >>>>> There's also SOLR-5232...
> >>>>>
> >>>>> Best
> >>>>> Erick
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace.
> >>> **com<t...@elementspace.com>
> >>>>>> wrote:
> >>>>>
> >>>>> Hey guys,
> >>>>>>
> >>>>>> Based on my understanding of the problem we are encountering, I feel
> >>>>>> we've
> >>>>>> been able to reduce the likelihood of this issue by making the
> >>> following
> >>>>>> changes to our app's usage of SolrCloud:
> >>>>>>
> >>>>>> 1) We increased our document batch size to 200 from 10 - our app
> >>> batches
> >>>>>> updates to reduce HTTP requests/overhead. The theory is increasing
> >> the
> >>>>>> batch size reduces the likelihood of this issue happening.
> >>>>>> 2) We reduced to 1 application node sending updates to SolrCloud -
> we
> >>>>>> write
> >>>>>> Solr updates to Redis, and have previously had 4 application nodes
> >>>>>> pushing
> >>>>>> the updates to Solr (popping off the Redis queue). Reducing the
> >> number
> >>> of
> >>>>>> nodes pushing to Solr reduces the concurrency on SolrCloud.
> >>>>>> 3) Less threads pushing to SolrCloud - due to the increase in batch
> >>> size,
> >>>>>> we were able to go down to 5 update threads on the
> update-pushing-app
> >>>>>> (from
> >>>>>> 10 threads).
> >>>>>>
> >>>>>> To be clear the above only reduces the likelihood of the issue
> >>> happening,
> >>>>>> and DOES NOT actually resolve the issue at hand.
> >>>>>>
> >>>>>> If we happen to encounter issues with the above 3 changes, the next
> >>> steps
> >>>>>> (I could use some advice on) are:
> >>>>>>
> >>>>>> 1) Increase the number of shards (2x) - the theory here is this
> >> reduces
> >>>>>> the
> >>>>>> locking on shards because there are more shards. Am I onto something
> >>>>>> here,
> >>>>>> or will this not help at all?
> >>>>>> 2) Use CloudSolrServer - currently we have a plain-old
> >> least-connection
> >>>>>> HTTP VIP. If we go "direct" to what we need to update, this will
> >> reduce
> >>>>>> concurrency in SolrCloud a bit. Thoughts?
> >>>>>>
> >>>>>> Thanks all!
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>>
> >>>>>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com<
> >>> t...@elementspace.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> Enjoy your trip, Mark! Thanks again for the help!
> >>>>>>>
> >>>>>>> Tim
> >>>>>>>
> >>>>>>>
> >>>>>>> On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at
> >>>>>>>> this
> >>>>>>>> soon. That 10k thread spike is good to know - that's no good and
> >>> could
> >>>>>>>> easily be part of the problem. We want to keep that from
> happening.
> >>>>>>>>
> >>>>>>>> Mark
> >>>>>>>>
> >>>>>>>> Sent from my iPhone
> >>>>>>>>
> >>>>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace.
> >> **com<
> >>> t...@elementspace.com>
> >>>>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hey Mark,
> >>>>>>>>>
> >>>>>>>>> The farthest we've made it at the same batch size/volume was 12
> >>> hours
> >>>>>>>>> without this patch, but that isn't consistent. Sometimes we would
> >>> only
> >>>>>>>>>
> >>>>>>>> get
> >>>>>>>>
> >>>>>>>>> to 6 hours or less.
> >>>>>>>>>
> >>>>>>>>> During the crash I can see an amazing spike in threads to 10k
> >> which
> >>> is
> >>>>>>>>> essentially our ulimit for the JVM, but I strangely see no
> >>>>>>>>>
> >>>>>>>> "OutOfMemory:
> >>>>>>
> >>>>>>> cannot open native thread errors" that always follow this. Weird!
> >>>>>>>>>
> >>>>>>>>> We also notice a spike in CPU around the crash. The instability
> >>> caused
> >>>>>>>>>
> >>>>>>>> some
> >>>>>>>>
> >>>>>>>>> shard recovery/replication though, so that CPU may be a symptom
> of
> >>> the
> >>>>>>>>> replication, or is possibly the root cause. The CPU spikes from
> >>> about
> >>>>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the
> >>> CPU,
> >>>>>>>>>
> >>>>>>>> while
> >>>>>>>>
> >>>>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core
> >> Xeons,
> >>>>>>>>>
> >>>>>>>> whole
> >>>>>>>>
> >>>>>>>>> index is in 128GB RAM, 6xRAID10 15k).
> >>>>>>>>>
> >>>>>>>>> More on resources: our disk I/O seemed to spike about 2x during
> >> the
> >>>>>>>>>
> >>>>>>>> crash
> >>>>>>>>
> >>>>>>>>> (about 1300kbps written to 3500kbps), but this may have been the
> >>>>>>>>> replication, or ERROR logging (we generally log nothing due to
> >>>>>>>>> WARN-severity unless something breaks).
> >>>>>>>>>
> >>>>>>>>> Lastly, I found this stack trace occurring frequently, and have
> no
> >>>>>>>>>
> >>>>>>>> idea
> >>>>>>
> >>>>>>> what it is (may be useful or not):
> >>>>>>>>>
> >>>>>>>>> "java.lang.**IllegalStateException :
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>
> >> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964)
> >>>>>>
> >>>>>>>      at org.eclipse.jetty.server.**Response.sendError(Response.**
> >>>>>>>>> java:325)
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(**
> >>>>>> SolrDispatchFilter.java:692)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:380)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:155)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>>>>> doFilter(ServletHandler.java:**1423)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>>>>> ServletHandler.java:450)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:138)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>>>>> SecurityHandler.java:564)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doHandle(SessionHandler.java:**213)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doHandle(ContextHandler.java:**1083)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>>>>> ServletHandler.java:379)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doScope(SessionHandler.java:**175)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doScope(ContextHandler.java:**1017)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:136)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> >>>>>> handle(**ContextHandlerCollection.java:**258)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>>>>> handle(HandlerCollection.java:**109)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>>>>> HandlerWrapper.java:97)
> >>>>>>
> >>>>>>>      at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>
> >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> >>>>>>>>
> >>>>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>>>>> HttpConnection.java:225)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>>>>> AbstractConnection.java:358)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>>>>> QueuedThreadPool.java:596)
> >>>>>>
> >>>>>>>      at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>>>>> QueuedThreadPool.java:527)
> >>>>>>
> >>>>>>>      at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>
> >>>>>>>>> On your live_nodes question, I don't have historical data on this
> >>> from
> >>>>>>>>>
> >>>>>>>> when
> >>>>>>>>
> >>>>>>>>> the crash occurred, which I guess is what you're looking for. I
> >>> could
> >>>>>>>>>
> >>>>>>>> add
> >>>>>>>>
> >>>>>>>>> this to our monitoring for future tests, however. I'd be glad to
> >>>>>>>>>
> >>>>>>>> continue
> >>>>>>>>
> >>>>>>>>> further testing, but I think first more monitoring is needed to
> >>>>>>>>>
> >>>>>>>> understand
> >>>>>>>>
> >>>>>>>>> this further. Could we come up with a list of metrics that would
> >> be
> >>>>>>>>>
> >>>>>>>> useful
> >>>>>>>>
> >>>>>>>>> to see following another test and successful crash?
> >>>>>>>>>
> >>>>>>>>> Metrics needed:
> >>>>>>>>>
> >>>>>>>>> 1) # of live_nodes.
> >>>>>>>>> 2) Full stack traces.
> >>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> >>>>>>>>> 4) Solr's JVM thread count (already done)
> >>>>>>>>> 5) ?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>>
> >>>>>>>>> Tim Vaillancourt
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com>
> >>> wrote:
> >>>>>>>>>
> >>>>>>>>> Did you ever get to index that long before without hitting the
> >>>>>>>>>>
> >>>>>>>>> deadlock?
> >>>>>>>>
> >>>>>>>>> There really isn't anything negative the patch could be
> >> introducing,
> >>>>>>>>>>
> >>>>>>>>> other
> >>>>>>>>
> >>>>>>>>> than allowing for some more threads to possibly run at once. If I
> >>> had
> >>>>>>>>>>
> >>>>>>>>> to
> >>>>>>>>
> >>>>>>>>> guess, I would say its likely this patch fixes the deadlock issue
> >>> and
> >>>>>>>>>>
> >>>>>>>>> your
> >>>>>>>>
> >>>>>>>>> seeing another issue - which looks like the system cannot keep up
> >>>>>>>>>>
> >>>>>>>>> with
> >>>>>>
> >>>>>>> the
> >>>>>>>>
> >>>>>>>>> requests or something for some reason - perhaps due to some OS
> >>>>>>>>>>
> >>>>>>>>> networking
> >>>>>>>>
> >>>>>>>>> settings or something (more guessing). Connection refused happens
> >>>>>>>>>>
> >>>>>>>>> generally
> >>>>>>>>
> >>>>>>>>> when there is nothing listening on the port.
> >>>>>>>>>>
> >>>>>>>>>> Do you see anything interesting change with the rest of the
> >> system?
> >>>>>>>>>>
> >>>>>>>>> CPU
> >>>>>>
> >>>>>>> usage spikes or something like that?
> >>>>>>>>>>
> >>>>>>>>>> Clamping down further on the overall number of threads night
> help
> >>>>>>>>>>
> >>>>>>>>> (which
> >>>>>>>>
> >>>>>>>>> would require making something configurable). How many nodes are
> >>>>>>>>>>
> >>>>>>>>> listed in
> >>>>>>>>
> >>>>>>>>> zk under live_nodes?
> >>>>>>>>>>
> >>>>>>>>>> Mark
> >>>>>>>>>>
> >>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>
> >>>>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace.
> >>> **com<t...@elementspace.com>
> >>>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hey guys,
> >>>>>>>>>>>
> >>>>>>>>>>> (copy of my post to SOLR-5216)
> >>>>>>>>>>>
> >>>>>>>>>>> We tested this patch and unfortunately encountered some serious
> >>>>>>>>>>>
> >>>>>>>>>> issues a
> >>>>>>>>
> >>>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs,
> >> so
> >>>>>>>>>>>
> >>>>>>>>>> we
> >>>>>>>>
> >>>>>>>>> are
> >>>>>>>>>>
> >>>>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit
> >> the
> >>>>>>>>>>>
> >>>>>>>>>> updates
> >>>>>>>>
> >>>>>>>>> (no explicit commits).
> >>>>>>>>>>>
> >>>>>>>>>>> Our environment:
> >>>>>>>>>>>
> >>>>>>>>>>>   Solr 4.3.1 w/SOLR-5216 patch.
> >>>>>>>>>>>   Jetty 9, Java 1.7.
> >>>>>>>>>>>   3 solr instances, 1 per physical server.
> >>>>>>>>>>>   1 collection.
> >>>>>>>>>>>   3 shards.
> >>>>>>>>>>>   2 replicas (each instance is a leader and a replica).
> >>>>>>>>>>>   Soft autoCommit is 1000ms.
> >>>>>>>>>>>   Hard autoCommit is 15000ms.
> >>>>>>>>>>>
> >>>>>>>>>>> After about 6 hours of stress-testing this patch, we see many
> of
> >>>>>>>>>>>
> >>>>>>>>>> these
> >>>>>>
> >>>>>>> stalled transactions (below), and the Solr instances start to see
> >>>>>>>>>>>
> >>>>>>>>>> each
> >>>>>>
> >>>>>>> other as down, flooding our Solr logs with "Connection Refused"
> >>>>>>>>>>>
> >>>>>>>>>> exceptions,
> >>>>>>>>>>
> >>>>>>>>>>> and otherwise no obviously-useful logs that I could see.
> >>>>>>>>>>>
> >>>>>>>>>>> I did notice some stalled transactions on both /select and
> >>> /update,
> >>>>>>>>>>> however. This never occurred without this patch.
> >>>>>>>>>>>
> >>>>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>>>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>>>>>>>>>
> >>>>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this
> >>>>>>>>>>>
> >>>>>>>>>> 24-hour
> >>>>>>
> >>>>>>> soak.
> >>>>>>>>>>
> >>>>>>>>>>> My script "normalizes" the ERROR-severity stack traces and
> >> returns
> >>>>>>>>>>>
> >>>>>>>>>> them
> >>>>>>>>
> >>>>>>>>> in
> >>>>>>>>>>
> >>>>>>>>>>> order of occurrence.
> >>>>>>>>>>>
> >>>>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>
> >>>>>>>>>>> Tim Vaillancourt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 6 September 2013 07:27, Markus Jelsma<
> >>>>>>>>>>>
> >>>>>>>>>> markus.jel...@openindex.io>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks!
> >>>>>>>>>>>>
> >>>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>>
> >>>>>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com<
> >>> erickerick...@gmail.com>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Sent: Friday 6th September 2013 16:20
> >>>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Markus:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216<
> >>> https://issues.apache.org/jira/browse/SOLR-5216>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>>>>>>>>>> <markus.jel...@openindex.io>**wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Mark,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Got an issue to watch?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Markus
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -----Original message-----
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> From:Mark Miller<markrmil...@gmail.com>
> >>>>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>>>>>>>>>>>> To: solr-user@lucene.apache.org
> >>>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've
> >>> suspected
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> what it
> >>>>>>>>>>>>
> >>>>>>>>>>>>> is since early this year, but it's never personally been an
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> issue,
> >>>>>>
> >>>>>>> so
> >>>>>>>>
> >>>>>>>>> it's
> >>>>>>>>>>>>
> >>>>>>>>>>>>> rolled along for a long time.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Sent from my iPhone
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> t...@elementspace.com>
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hey guys,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am looking into an issue we've been having with
> SolrCloud
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> since
> >>>>>>
> >>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3
> (haven't
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> tested
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 4.4.0
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> really
> >>>>>>>>
> >>>>>>>>> like to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> get to the bottom of it.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after
> >>> 1-12
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> hours
> >>>>>>>>>>>>
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> threads in
> >>>>>>>>>>>>
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most
> >> threads
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> waiting
> >>>>>>>>>>>>
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the condition/stack provided at the bottom of this message.
> >> At
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> this
> >>>>>>>>
> >>>>>>>>> point
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who
> >>> also
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> have
> >>>>>>>>>>>>
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the
> shards
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> become
> >>>>>>>>
> >>>>>>>>> "down"
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 503s
> >>>>>>
> >>>>>>> "no
> >>>>>>>>>>>>
> >>>>>>>>>>>>> server
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> hosting shard" errors.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of
> >>> threads
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> sending
> >>>>>>>>>>>>
> >>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> from
> >>>>>>
> >>>>>>> client ->
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail.
> >> Turning
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> off
> >>>>>>>>
> >>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> did not
> >>>>>>>>>>>>
> >>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> seem
> >>>>>>
> >>>>>>> to
> >>>>>>>>>>>>
> >>>>>>>>>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Our current environment is the following:
> >>>>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a
> leader
> >>> of
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 1
> >>>>>>
> >>>>>>> shard
> >>>>>>>>>>>>
> >>>>>>>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a replica of 1 shard).
> >>>>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> movement
> >>>>>>
> >>>>>>> on a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> good
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> day.
> >>>>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we
> >> are
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> healthy),
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Linux-user threads ulimit is 6000.
> >>>>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>>>>>>>>>>>> - Occurs under several JVM tunings.
> >>>>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a
> Jetty
> >>> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Java
> >>>>>>>>
> >>>>>>>>> version
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (I hope I'm wrong).
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP
> threads
> >>> is
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>
> >>>>>>>>> following, which seems to be waiting on a lock that I would
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> very
> >>>>>>
> >>>>>>> much
> >>>>>>>>>>>>
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> to understand further:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>>>>>>>>>>>  at sun.misc.Unsafe.park(Native Method)
> >>>>>>>>>>>>>>>>  - parking to wait for<0x00000007216e68d8>  (a
> >>>>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> java.util.concurrent.locks.**LockSupport.park(LockSupport.**
> >>>>>>>>>>>> java:186)
> >>>>>>>>>>>>
> >>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>>>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>>>>>
> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.**
> >>>>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303)
> >>>>>>
> >>>>>>>  at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>>>>> AdjustableSemaphore.java:61)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>>>>> SolrCmdDistributor.java:418)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(**
> >>>>>> SolrCmdDistributor.java:368)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(**
> >>>>>> SolrCmdDistributor.java:300)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(**
> >>>>>> SolrCmdDistributor.java:96)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>>>>>
> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.**
> >>>>>> java:462)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.update.**processor.**
> >>>>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.**
> >>>>>> java:1178)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.**
> >>>>>> handleRequestBody(**ContentStreamHandlerBase.java:**83)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(**
> >>>>>> RequestHandlerBase.java:135)
> >>>>>>
> >>>>>>>  at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(**
> >>>>>> SolrDispatchFilter.java:656)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:359)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(**
> >>>>>> SolrDispatchFilter.java:155)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.**
> >>>>>> doFilter(ServletHandler.java:**1486)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>>>>> ServletHandler.java:503)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:138)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>>>>> SecurityHandler.java:564)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doHandle(SessionHandler.java:**213)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doHandle(ContextHandler.java:**1096)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>>>>> ServletHandler.java:432)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doScope(SessionHandler.java:**175)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doScope(ContextHandler.java:**1030)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:136)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.*
> >>>>>> *handle(**ContextHandlerCollection.java:**201)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>>>>> handle(HandlerCollection.java:**109)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>>>>> HandlerWrapper.java:97)
> >>>>>>
> >>>>>>>  at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**
> >>>>>>>>>>>> HttpChannel.java:268)
> >>>>>>>>>>>>
> >>>>>>>>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>>>>> HttpConnection.java:229)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>>>>> AbstractConnection.java:358)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>>>>> QueuedThreadPool.java:601)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>>>>> QueuedThreadPool.java:532)
> >>>>>>
> >>>>>>>  at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Some questions I had were:
> >>>>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when
> >> performing
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> an
> >>>>>>
> >>>>>>> update?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D),
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> could
> >>>>>>
> >>>>>>> someone
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> help me understand "what" solr is locking in this case at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(**
> >>>>>> AdjustableSemaphore.java:61)"
> >>>>>>
> >>>>>>> when performing an update? That will help me understand where
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> to
> >>>>>>
> >>>>>>> look
> >>>>>>>>>>>>
> >>>>>>>>>>>>> next.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 3) It seems all threads in this state are waiting for
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "0x00000007216e68d8",
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> SolrCloud?
> >>>>>>
> >>>>>>> 5) Wild-ass-theory: would more shards provide more locks
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> (whatever
> >>>>>>>>
> >>>>>>>>> they
> >>>>>>>>>>>>
> >>>>>>>>>>>>> are) on update, and thus more update throughput?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3
> >>> nodes
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> at
> >>>>>>>
> >>>>>>>
> >>>
> >>
>
>

Re: SolrCloud 4.x hangs under high update volume

Reply via email to