That makes sense, thanks Erick and Mark for you help! :) I'll see if I can find a place to assist with the testing of SOLR-5232.
Cheers, Tim On 12 September 2013 11:16, Mark Miller <markrmil...@gmail.com> wrote: > Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps > make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and > SOLR-5232 is not quite ready - we need some testing. > > - Mark > > On Sep 12, 2013, at 2:12 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > My take on it is this, assuming I'm reading this right: > > 1> SOLR-5216 - probably not going anywhere, 5232 will take care of it. > > 2> SOLR-5232 - expected to fix the underlying issue no matter whether > > you're using CloudSolrServer from SolrJ or sending lots of updates from > > lots of clients. > > 3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the > > meantime. > > > > I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it > > hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0 > > is looking like it'll be ready to cut next week so it might not be > included. > > > > Best, > > Erick > > > > > > On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt <t...@elementspace.com > >wrote: > > > >> Lol, at breaking during a demo - always the way it is! :) I agree, we > are > >> just tip-toeing around the issue, but waiting for 4.5 is definitely an > >> option if we "get-by" for now in testing; patched Solr versions seem to > >> make people uneasy sometimes :). > >> > >> Seeing there seems to be some danger to SOLR-5216 (in some ways it > blows up > >> worse due to less limitations on thread), I'm guessing only SOLR-5232 > and > >> SOLR-4816 are making it into 4.5? I feel those 2 in combination will > make a > >> world of difference! > >> > >> Thanks so much again guys! > >> > >> Tim > >> > >> > >> > >> On 12 September 2013 03:43, Erick Erickson <erickerick...@gmail.com> > >> wrote: > >> > >>> Fewer client threads updating makes sense, and going to 1 core also > seems > >>> like it might help. But it's all a crap-shoot unless the underlying > cause > >>> gets fixed up. Both would improve things, but you'll still hit the > >> problem > >>> sometime, probably when doing a demo for your boss ;). > >>> > >>> Adrien has branched the code for SOLR 4.5 in preparation for a release > >>> candidate tentatively scheduled for next week. You might just start > >> working > >>> with that branch if you can rather than apply individual patches... > >>> > >>> I suspect there'll be a couple more changes to this code (looks like > >>> Shikhar already raised an issue for instance) before 4.5 is finally > >> cut... > >>> > >>> FWIW, > >>> Erick > >>> > >>> > >>> > >>> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt < > t...@elementspace.com > >>>> wrote: > >>> > >>>> Thanks Erick! > >>>> > >>>> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 > >>>> patch. I think that is a very, very useful patch by the way. SOLR-5232 > >>>> seems promising as well. > >>>> > >>>> I see your point on the more-shards idea, this is obviously a > >>>> global/instance-level lock. If I really had to, I suppose I could run > >>> more > >>>> Solr instances to reduce locking then? Currently I have 2 cores per > >>>> instance and I could go 1-to-1 to simplify things. > >>>> > >>>> The good news is we seem to be more stable since changing to a bigger > >>>> client->solr batch-size and fewer client threads updating. > >>>> > >>>> Cheers, > >>>> > >>>> Tim > >>>> > >>>> On 11/09/13 04:19 AM, Erick Erickson wrote: > >>>> > >>>>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a > >> recent > >>>>> copy of the 4x branch. By "recent", I mean like today, it looks like > >>> Mark > >>>>> applied this early this morning. But several reports indicate that > >> this > >>>>> will > >>>>> solve your problem. > >>>>> > >>>>> I would expect that increasing the number of shards would make the > >>> problem > >>>>> worse, not > >>>>> better. > >>>>> > >>>>> There's also SOLR-5232... > >>>>> > >>>>> Best > >>>>> Erick > >>>>> > >>>>> > >>>>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace. > >>> **com<t...@elementspace.com> > >>>>>> wrote: > >>>>> > >>>>> Hey guys, > >>>>>> > >>>>>> Based on my understanding of the problem we are encountering, I feel > >>>>>> we've > >>>>>> been able to reduce the likelihood of this issue by making the > >>> following > >>>>>> changes to our app's usage of SolrCloud: > >>>>>> > >>>>>> 1) We increased our document batch size to 200 from 10 - our app > >>> batches > >>>>>> updates to reduce HTTP requests/overhead. The theory is increasing > >> the > >>>>>> batch size reduces the likelihood of this issue happening. > >>>>>> 2) We reduced to 1 application node sending updates to SolrCloud - > we > >>>>>> write > >>>>>> Solr updates to Redis, and have previously had 4 application nodes > >>>>>> pushing > >>>>>> the updates to Solr (popping off the Redis queue). Reducing the > >> number > >>> of > >>>>>> nodes pushing to Solr reduces the concurrency on SolrCloud. > >>>>>> 3) Less threads pushing to SolrCloud - due to the increase in batch > >>> size, > >>>>>> we were able to go down to 5 update threads on the > update-pushing-app > >>>>>> (from > >>>>>> 10 threads). > >>>>>> > >>>>>> To be clear the above only reduces the likelihood of the issue > >>> happening, > >>>>>> and DOES NOT actually resolve the issue at hand. > >>>>>> > >>>>>> If we happen to encounter issues with the above 3 changes, the next > >>> steps > >>>>>> (I could use some advice on) are: > >>>>>> > >>>>>> 1) Increase the number of shards (2x) - the theory here is this > >> reduces > >>>>>> the > >>>>>> locking on shards because there are more shards. Am I onto something > >>>>>> here, > >>>>>> or will this not help at all? > >>>>>> 2) Use CloudSolrServer - currently we have a plain-old > >> least-connection > >>>>>> HTTP VIP. If we go "direct" to what we need to update, this will > >> reduce > >>>>>> concurrency in SolrCloud a bit. Thoughts? > >>>>>> > >>>>>> Thanks all! > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Tim > >>>>>> > >>>>>> > >>>>>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com< > >>> t...@elementspace.com>> > >>>>>> wrote: > >>>>>> > >>>>>> Enjoy your trip, Mark! Thanks again for the help! > >>>>>>> > >>>>>>> Tim > >>>>>>> > >>>>>>> > >>>>>>> On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com> > >> wrote: > >>>>>>> > >>>>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at > >>>>>>>> this > >>>>>>>> soon. That 10k thread spike is good to know - that's no good and > >>> could > >>>>>>>> easily be part of the problem. We want to keep that from > happening. > >>>>>>>> > >>>>>>>> Mark > >>>>>>>> > >>>>>>>> Sent from my iPhone > >>>>>>>> > >>>>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace. > >> **com< > >>> t...@elementspace.com> > >>>>>>>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hey Mark, > >>>>>>>>> > >>>>>>>>> The farthest we've made it at the same batch size/volume was 12 > >>> hours > >>>>>>>>> without this patch, but that isn't consistent. Sometimes we would > >>> only > >>>>>>>>> > >>>>>>>> get > >>>>>>>> > >>>>>>>>> to 6 hours or less. > >>>>>>>>> > >>>>>>>>> During the crash I can see an amazing spike in threads to 10k > >> which > >>> is > >>>>>>>>> essentially our ulimit for the JVM, but I strangely see no > >>>>>>>>> > >>>>>>>> "OutOfMemory: > >>>>>> > >>>>>>> cannot open native thread errors" that always follow this. Weird! > >>>>>>>>> > >>>>>>>>> We also notice a spike in CPU around the crash. The instability > >>> caused > >>>>>>>>> > >>>>>>>> some > >>>>>>>> > >>>>>>>>> shard recovery/replication though, so that CPU may be a symptom > of > >>> the > >>>>>>>>> replication, or is possibly the root cause. The CPU spikes from > >>> about > >>>>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the > >>> CPU, > >>>>>>>>> > >>>>>>>> while > >>>>>>>> > >>>>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core > >> Xeons, > >>>>>>>>> > >>>>>>>> whole > >>>>>>>> > >>>>>>>>> index is in 128GB RAM, 6xRAID10 15k). > >>>>>>>>> > >>>>>>>>> More on resources: our disk I/O seemed to spike about 2x during > >> the > >>>>>>>>> > >>>>>>>> crash > >>>>>>>> > >>>>>>>>> (about 1300kbps written to 3500kbps), but this may have been the > >>>>>>>>> replication, or ERROR logging (we generally log nothing due to > >>>>>>>>> WARN-severity unless something breaks). > >>>>>>>>> > >>>>>>>>> Lastly, I found this stack trace occurring frequently, and have > no > >>>>>>>>> > >>>>>>>> idea > >>>>>> > >>>>>>> what it is (may be useful or not): > >>>>>>>>> > >>>>>>>>> "java.lang.**IllegalStateException : > >>>>>>>>> at > >>>>>>>>> > >>>>>>>> > >> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964) > >>>>>> > >>>>>>> at org.eclipse.jetty.server.**Response.sendError(Response.** > >>>>>>>>> java:325) > >>>>>>>>> at > >>>>>>>>> > >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(** > >>>>>> SolrDispatchFilter.java:692) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>>>>> SolrDispatchFilter.java:380) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>>>>> SolrDispatchFilter.java:155) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.** > >>>>>> doFilter(ServletHandler.java:**1423) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** > >>>>>> ServletHandler.java:450) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>>>>> ScopedHandler.java:138) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** > >>>>>> SecurityHandler.java:564) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>>>>> doHandle(SessionHandler.java:**213) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>>>>> doHandle(ContextHandler.java:**1083) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** > >>>>>> ServletHandler.java:379) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>>>>> doScope(SessionHandler.java:**175) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>>>>> doScope(ContextHandler.java:**1017) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>>>>> ScopedHandler.java:136) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** > >>>>>> handle(**ContextHandlerCollection.java:**258) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** > >>>>>> handle(HandlerCollection.java:**109) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** > >>>>>> HandlerWrapper.java:97) > >>>>>> > >>>>>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) > >>>>>>>>> at > >>>>>>>>> > >>>>>>>> > >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) > >>>>>>>> > >>>>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > >>>>>> HttpConnection.java:225) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > >>>>>> AbstractConnection.java:358) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > >>>>>> QueuedThreadPool.java:596) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > >>>>>> QueuedThreadPool.java:527) > >>>>>> > >>>>>>> at java.lang.Thread.run(Thread.**java:724)" > >>>>>>>>> > >>>>>>>>> On your live_nodes question, I don't have historical data on this > >>> from > >>>>>>>>> > >>>>>>>> when > >>>>>>>> > >>>>>>>>> the crash occurred, which I guess is what you're looking for. I > >>> could > >>>>>>>>> > >>>>>>>> add > >>>>>>>> > >>>>>>>>> this to our monitoring for future tests, however. I'd be glad to > >>>>>>>>> > >>>>>>>> continue > >>>>>>>> > >>>>>>>>> further testing, but I think first more monitoring is needed to > >>>>>>>>> > >>>>>>>> understand > >>>>>>>> > >>>>>>>>> this further. Could we come up with a list of metrics that would > >> be > >>>>>>>>> > >>>>>>>> useful > >>>>>>>> > >>>>>>>>> to see following another test and successful crash? > >>>>>>>>> > >>>>>>>>> Metrics needed: > >>>>>>>>> > >>>>>>>>> 1) # of live_nodes. > >>>>>>>>> 2) Full stack traces. > >>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide). > >>>>>>>>> 4) Solr's JVM thread count (already done) > >>>>>>>>> 5) ? > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> > >>>>>>>>> Tim Vaillancourt > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com> > >>> wrote: > >>>>>>>>> > >>>>>>>>> Did you ever get to index that long before without hitting the > >>>>>>>>>> > >>>>>>>>> deadlock? > >>>>>>>> > >>>>>>>>> There really isn't anything negative the patch could be > >> introducing, > >>>>>>>>>> > >>>>>>>>> other > >>>>>>>> > >>>>>>>>> than allowing for some more threads to possibly run at once. If I > >>> had > >>>>>>>>>> > >>>>>>>>> to > >>>>>>>> > >>>>>>>>> guess, I would say its likely this patch fixes the deadlock issue > >>> and > >>>>>>>>>> > >>>>>>>>> your > >>>>>>>> > >>>>>>>>> seeing another issue - which looks like the system cannot keep up > >>>>>>>>>> > >>>>>>>>> with > >>>>>> > >>>>>>> the > >>>>>>>> > >>>>>>>>> requests or something for some reason - perhaps due to some OS > >>>>>>>>>> > >>>>>>>>> networking > >>>>>>>> > >>>>>>>>> settings or something (more guessing). Connection refused happens > >>>>>>>>>> > >>>>>>>>> generally > >>>>>>>> > >>>>>>>>> when there is nothing listening on the port. > >>>>>>>>>> > >>>>>>>>>> Do you see anything interesting change with the rest of the > >> system? > >>>>>>>>>> > >>>>>>>>> CPU > >>>>>> > >>>>>>> usage spikes or something like that? > >>>>>>>>>> > >>>>>>>>>> Clamping down further on the overall number of threads night > help > >>>>>>>>>> > >>>>>>>>> (which > >>>>>>>> > >>>>>>>>> would require making something configurable). How many nodes are > >>>>>>>>>> > >>>>>>>>> listed in > >>>>>>>> > >>>>>>>>> zk under live_nodes? > >>>>>>>>>> > >>>>>>>>>> Mark > >>>>>>>>>> > >>>>>>>>>> Sent from my iPhone > >>>>>>>>>> > >>>>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace. > >>> **com<t...@elementspace.com> > >>>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hey guys, > >>>>>>>>>>> > >>>>>>>>>>> (copy of my post to SOLR-5216) > >>>>>>>>>>> > >>>>>>>>>>> We tested this patch and unfortunately encountered some serious > >>>>>>>>>>> > >>>>>>>>>> issues a > >>>>>>>> > >>>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, > >> so > >>>>>>>>>>> > >>>>>>>>>> we > >>>>>>>> > >>>>>>>>> are > >>>>>>>>>> > >>>>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit > >> the > >>>>>>>>>>> > >>>>>>>>>> updates > >>>>>>>> > >>>>>>>>> (no explicit commits). > >>>>>>>>>>> > >>>>>>>>>>> Our environment: > >>>>>>>>>>> > >>>>>>>>>>> Solr 4.3.1 w/SOLR-5216 patch. > >>>>>>>>>>> Jetty 9, Java 1.7. > >>>>>>>>>>> 3 solr instances, 1 per physical server. > >>>>>>>>>>> 1 collection. > >>>>>>>>>>> 3 shards. > >>>>>>>>>>> 2 replicas (each instance is a leader and a replica). > >>>>>>>>>>> Soft autoCommit is 1000ms. > >>>>>>>>>>> Hard autoCommit is 15000ms. > >>>>>>>>>>> > >>>>>>>>>>> After about 6 hours of stress-testing this patch, we see many > of > >>>>>>>>>>> > >>>>>>>>>> these > >>>>>> > >>>>>>> stalled transactions (below), and the Solr instances start to see > >>>>>>>>>>> > >>>>>>>>>> each > >>>>>> > >>>>>>> other as down, flooding our Solr logs with "Connection Refused" > >>>>>>>>>>> > >>>>>>>>>> exceptions, > >>>>>>>>>> > >>>>>>>>>>> and otherwise no obviously-useful logs that I could see. > >>>>>>>>>>> > >>>>>>>>>>> I did notice some stalled transactions on both /select and > >>> /update, > >>>>>>>>>>> however. This never occurred without this patch. > >>>>>>>>>>> > >>>>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > >>>>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > >>>>>>>>>>> > >>>>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this > >>>>>>>>>>> > >>>>>>>>>> 24-hour > >>>>>> > >>>>>>> soak. > >>>>>>>>>> > >>>>>>>>>>> My script "normalizes" the ERROR-severity stack traces and > >> returns > >>>>>>>>>>> > >>>>>>>>>> them > >>>>>>>> > >>>>>>>>> in > >>>>>>>>>> > >>>>>>>>>>> order of occurrence. > >>>>>>>>>>> > >>>>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb > >>>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>> > >>>>>>>>>>> Tim Vaillancourt > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 6 September 2013 07:27, Markus Jelsma< > >>>>>>>>>>> > >>>>>>>>>> markus.jel...@openindex.io> > >>>>>> > >>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>>> > >>>>>>>>>>>> -----Original message----- > >>>>>>>>>>>> > >>>>>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com< > >>> erickerick...@gmail.com> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> Sent: Friday 6th September 2013 16:20 > >>>>>>>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>>>>>>>>>> > >>>>>>>>>>>>> Markus: > >>>>>>>>>>>>> > >>>>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216< > >>> https://issues.apache.org/jira/browse/SOLR-5216> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >>>>>>>>>>>>> <markus.jel...@openindex.io>**wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Mark, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Got an issue to watch? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Markus > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -----Original message----- > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> From:Mark Miller<markrmil...@gmail.com> > >>>>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55 > >>>>>>>>>>>>>>> To: solr-user@lucene.apache.org > >>>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've > >>> suspected > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> what it > >>>>>>>>>>>> > >>>>>>>>>>>>> is since early this year, but it's never personally been an > >>>>>>>>>>>>>> > >>>>>>>>>>>>> issue, > >>>>>> > >>>>>>> so > >>>>>>>> > >>>>>>>>> it's > >>>>>>>>>>>> > >>>>>>>>>>>>> rolled along for a long time. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Mark > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Sent from my iPhone > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt< > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> t...@elementspace.com> > >>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hey guys, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I am looking into an issue we've been having with > SolrCloud > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> since > >>>>>> > >>>>>>> the > >>>>>>>>>>>> > >>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 > (haven't > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> tested > >>>>>>>>>>>> > >>>>>>>>>>>>> 4.4.0 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> really > >>>>>>>> > >>>>>>>>> like to > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> get to the bottom of it. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after > >>> 1-12 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> hours > >>>>>>>>>>>> > >>>>>>>>>>>>> we > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> threads in > >>>>>>>>>>>> > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most > >> threads > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> waiting > >>>>>>>>>>>> > >>>>>>>>>>>>> on > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> the condition/stack provided at the bottom of this message. > >> At > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> this > >>>>>>>> > >>>>>>>>> point > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who > >>> also > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> have > >>>>>>>>>>>> > >>>>>>>>>>>>> all > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the > shards > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> become > >>>>>>>> > >>>>>>>>> "down" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 503s > >>>>>> > >>>>>>> "no > >>>>>>>>>>>> > >>>>>>>>>>>>> server > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> hosting shard" errors. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of > >>> threads > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> sending > >>>>>>>>>>>> > >>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> from > >>>>>> > >>>>>>> client -> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. > >> Turning > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> off > >>>>>>>> > >>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> did not > >>>>>>>>>>>> > >>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> seem > >>>>>> > >>>>>>> to > >>>>>>>>>>>> > >>>>>>>>>>>>> mask/help the problem, but not resolve it entirely. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Our current environment is the following: > >>>>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > >>>>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM. > >>>>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a > leader > >>> of > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 1 > >>>>>> > >>>>>>> shard > >>>>>>>>>>>> > >>>>>>>>>>>>> and > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> a replica of 1 shard). > >>>>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> movement > >>>>>> > >>>>>>> on a > >>>>>>>>>>>> > >>>>>>>>>>>>> good > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> day. > >>>>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we > >> are > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> healthy), > >>>>>>>>>>>> > >>>>>>>>>>>>> Linux-user threads ulimit is 6000. > >>>>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions). > >>>>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions). > >>>>>>>>>>>>>>>> - Occurs under several JVM tunings. > >>>>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a > Jetty > >>> or > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Java > >>>>>>>> > >>>>>>>>> version > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (I hope I'm wrong). > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP > threads > >>> is > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> the > >>>>>>>> > >>>>>>>>> following, which seems to be waiting on a lock that I would > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> very > >>>>>> > >>>>>>> much > >>>>>>>>>>>> > >>>>>>>>>>>>> like > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to understand further: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking) > >>>>>>>>>>>>>>>> at sun.misc.Unsafe.park(Native Method) > >>>>>>>>>>>>>>>> - parking to wait for<0x00000007216e68d8> (a > >>>>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync) > >>>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > java.util.concurrent.locks.**LockSupport.park(LockSupport.** > >>>>>>>>>>>> java:186) > >>>>>>>>>>>> > >>>>>>>>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** > >>>>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** > >>>>>> > doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** > >>>>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303) > >>>>>> > >>>>>>> at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317) > >>>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(** > >>>>>> AdjustableSemaphore.java:61) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(** > >>>>>> SolrCmdDistributor.java:418) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(** > >>>>>> SolrCmdDistributor.java:368) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(** > >>>>>> SolrCmdDistributor.java:300) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(** > >>>>>> SolrCmdDistributor.java:96) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.update.**processor.** > >>>>>> > DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.** > >>>>>> java:462) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.update.**processor.** > >>>>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.** > >>>>>> java:1178) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.** > >>>>>> handleRequestBody(**ContentStreamHandlerBase.java:**83) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** > >>>>>> RequestHandlerBase.java:135) > >>>>>> > >>>>>>> at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820) > >>>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(** > >>>>>> SolrDispatchFilter.java:656) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>>>>> SolrDispatchFilter.java:359) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** > >>>>>> SolrDispatchFilter.java:155) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.** > >>>>>> doFilter(ServletHandler.java:**1486) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** > >>>>>> ServletHandler.java:503) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>>>>> ScopedHandler.java:138) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** > >>>>>> SecurityHandler.java:564) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>>>>> doHandle(SessionHandler.java:**213) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>>>>> doHandle(ContextHandler.java:**1096) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** > >>>>>> ServletHandler.java:432) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>>>>> doScope(SessionHandler.java:**175) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>>>>> doScope(ContextHandler.java:**1030) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>>>>> ScopedHandler.java:136) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.* > >>>>>> *handle(**ContextHandlerCollection.java:**201) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** > >>>>>> handle(HandlerCollection.java:**109) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** > >>>>>> HandlerWrapper.java:97) > >>>>>> > >>>>>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) > >>>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(** > >>>>>>>>>>>> HttpChannel.java:268) > >>>>>>>>>>>> > >>>>>>>>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > >>>>>> HttpConnection.java:229) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > >>>>>> AbstractConnection.java:358) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > >>>>>> QueuedThreadPool.java:601) > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > >>>>>> QueuedThreadPool.java:532) > >>>>>> > >>>>>>> at java.lang.Thread.run(Thread.**java:724)" > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Some questions I had were: > >>>>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when > >> performing > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> an > >>>>>> > >>>>>>> update? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> could > >>>>>> > >>>>>>> someone > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> help me understand "what" solr is locking in this case at > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(** > >>>>>> AdjustableSemaphore.java:61)" > >>>>>> > >>>>>>> when performing an update? That will help me understand where > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to > >>>>>> > >>>>>>> look > >>>>>>>>>>>> > >>>>>>>>>>>>> next. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> 3) It seems all threads in this state are waiting for > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> "0x00000007216e68d8", > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is? > >>>>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> SolrCloud? > >>>>>> > >>>>>>> 5) Wild-ass-theory: would more shards provide more locks > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (whatever > >>>>>>>> > >>>>>>>>> they > >>>>>>>>>>>> > >>>>>>>>>>>>> are) on update, and thus more update throughput? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 > >>> nodes > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> at > >>>>>>> > >>>>>>> > >>> > >> > >