Right, I don't see SOLR-5232 making 4.5 unfortunately. It could perhaps make a 4.5.1 - it does resolve a critical issue - but 4.5 is in motion and SOLR-5232 is not quite ready - we need some testing.
- Mark On Sep 12, 2013, at 2:12 PM, Erick Erickson <erickerick...@gmail.com> wrote: > My take on it is this, assuming I'm reading this right: > 1> SOLR-5216 - probably not going anywhere, 5232 will take care of it. > 2> SOLR-5232 - expected to fix the underlying issue no matter whether > you're using CloudSolrServer from SolrJ or sending lots of updates from > lots of clients. > 3> SOLR-4816 - use this patch and CloudSolrServer from SolrJ in the > meantime. > > I don't quite know whether SOLR-5232 will make it in to 4.5 or not, it > hasn't been committed anywhere yet. The Solr 4.5 release is imminent, RC0 > is looking like it'll be ready to cut next week so it might not be included. > > Best, > Erick > > > On Thu, Sep 12, 2013 at 1:42 PM, Tim Vaillancourt > <t...@elementspace.com>wrote: > >> Lol, at breaking during a demo - always the way it is! :) I agree, we are >> just tip-toeing around the issue, but waiting for 4.5 is definitely an >> option if we "get-by" for now in testing; patched Solr versions seem to >> make people uneasy sometimes :). >> >> Seeing there seems to be some danger to SOLR-5216 (in some ways it blows up >> worse due to less limitations on thread), I'm guessing only SOLR-5232 and >> SOLR-4816 are making it into 4.5? I feel those 2 in combination will make a >> world of difference! >> >> Thanks so much again guys! >> >> Tim >> >> >> >> On 12 September 2013 03:43, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> Fewer client threads updating makes sense, and going to 1 core also seems >>> like it might help. But it's all a crap-shoot unless the underlying cause >>> gets fixed up. Both would improve things, but you'll still hit the >> problem >>> sometime, probably when doing a demo for your boss ;). >>> >>> Adrien has branched the code for SOLR 4.5 in preparation for a release >>> candidate tentatively scheduled for next week. You might just start >> working >>> with that branch if you can rather than apply individual patches... >>> >>> I suspect there'll be a couple more changes to this code (looks like >>> Shikhar already raised an issue for instance) before 4.5 is finally >> cut... >>> >>> FWIW, >>> Erick >>> >>> >>> >>> On Thu, Sep 12, 2013 at 2:13 AM, Tim Vaillancourt <t...@elementspace.com >>>> wrote: >>> >>>> Thanks Erick! >>>> >>>> Yeah, I think the next step will be CloudSolrServer with the SOLR-4816 >>>> patch. I think that is a very, very useful patch by the way. SOLR-5232 >>>> seems promising as well. >>>> >>>> I see your point on the more-shards idea, this is obviously a >>>> global/instance-level lock. If I really had to, I suppose I could run >>> more >>>> Solr instances to reduce locking then? Currently I have 2 cores per >>>> instance and I could go 1-to-1 to simplify things. >>>> >>>> The good news is we seem to be more stable since changing to a bigger >>>> client->solr batch-size and fewer client threads updating. >>>> >>>> Cheers, >>>> >>>> Tim >>>> >>>> On 11/09/13 04:19 AM, Erick Erickson wrote: >>>> >>>>> If you use CloudSolrServer, you need to apply SOLR-4816 or use a >> recent >>>>> copy of the 4x branch. By "recent", I mean like today, it looks like >>> Mark >>>>> applied this early this morning. But several reports indicate that >> this >>>>> will >>>>> solve your problem. >>>>> >>>>> I would expect that increasing the number of shards would make the >>> problem >>>>> worse, not >>>>> better. >>>>> >>>>> There's also SOLR-5232... >>>>> >>>>> Best >>>>> Erick >>>>> >>>>> >>>>> On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<tim@elementspace. >>> **com<t...@elementspace.com> >>>>>> wrote: >>>>> >>>>> Hey guys, >>>>>> >>>>>> Based on my understanding of the problem we are encountering, I feel >>>>>> we've >>>>>> been able to reduce the likelihood of this issue by making the >>> following >>>>>> changes to our app's usage of SolrCloud: >>>>>> >>>>>> 1) We increased our document batch size to 200 from 10 - our app >>> batches >>>>>> updates to reduce HTTP requests/overhead. The theory is increasing >> the >>>>>> batch size reduces the likelihood of this issue happening. >>>>>> 2) We reduced to 1 application node sending updates to SolrCloud - we >>>>>> write >>>>>> Solr updates to Redis, and have previously had 4 application nodes >>>>>> pushing >>>>>> the updates to Solr (popping off the Redis queue). Reducing the >> number >>> of >>>>>> nodes pushing to Solr reduces the concurrency on SolrCloud. >>>>>> 3) Less threads pushing to SolrCloud - due to the increase in batch >>> size, >>>>>> we were able to go down to 5 update threads on the update-pushing-app >>>>>> (from >>>>>> 10 threads). >>>>>> >>>>>> To be clear the above only reduces the likelihood of the issue >>> happening, >>>>>> and DOES NOT actually resolve the issue at hand. >>>>>> >>>>>> If we happen to encounter issues with the above 3 changes, the next >>> steps >>>>>> (I could use some advice on) are: >>>>>> >>>>>> 1) Increase the number of shards (2x) - the theory here is this >> reduces >>>>>> the >>>>>> locking on shards because there are more shards. Am I onto something >>>>>> here, >>>>>> or will this not help at all? >>>>>> 2) Use CloudSolrServer - currently we have a plain-old >> least-connection >>>>>> HTTP VIP. If we go "direct" to what we need to update, this will >> reduce >>>>>> concurrency in SolrCloud a bit. Thoughts? >>>>>> >>>>>> Thanks all! >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> On 6 September 2013 14:47, Tim Vaillancourt<tim@elementspace.**com< >>> t...@elementspace.com>> >>>>>> wrote: >>>>>> >>>>>> Enjoy your trip, Mark! Thanks again for the help! >>>>>>> >>>>>>> Tim >>>>>>> >>>>>>> >>>>>>> On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com> >> wrote: >>>>>>> >>>>>>> Okay, thanks, useful info. Getting on a plane, but ill look more at >>>>>>>> this >>>>>>>> soon. That 10k thread spike is good to know - that's no good and >>> could >>>>>>>> easily be part of the problem. We want to keep that from happening. >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<tim@elementspace. >> **com< >>> t...@elementspace.com> >>>>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hey Mark, >>>>>>>>> >>>>>>>>> The farthest we've made it at the same batch size/volume was 12 >>> hours >>>>>>>>> without this patch, but that isn't consistent. Sometimes we would >>> only >>>>>>>>> >>>>>>>> get >>>>>>>> >>>>>>>>> to 6 hours or less. >>>>>>>>> >>>>>>>>> During the crash I can see an amazing spike in threads to 10k >> which >>> is >>>>>>>>> essentially our ulimit for the JVM, but I strangely see no >>>>>>>>> >>>>>>>> "OutOfMemory: >>>>>> >>>>>>> cannot open native thread errors" that always follow this. Weird! >>>>>>>>> >>>>>>>>> We also notice a spike in CPU around the crash. The instability >>> caused >>>>>>>>> >>>>>>>> some >>>>>>>> >>>>>>>>> shard recovery/replication though, so that CPU may be a symptom of >>> the >>>>>>>>> replication, or is possibly the root cause. The CPU spikes from >>> about >>>>>>>>> 20-30% utilization (system + user) to 60% fairly sharply, so the >>> CPU, >>>>>>>>> >>>>>>>> while >>>>>>>> >>>>>>>>> spiking isn't quite "pinned" (very beefy Dell R720s - 16 core >> Xeons, >>>>>>>>> >>>>>>>> whole >>>>>>>> >>>>>>>>> index is in 128GB RAM, 6xRAID10 15k). >>>>>>>>> >>>>>>>>> More on resources: our disk I/O seemed to spike about 2x during >> the >>>>>>>>> >>>>>>>> crash >>>>>>>> >>>>>>>>> (about 1300kbps written to 3500kbps), but this may have been the >>>>>>>>> replication, or ERROR logging (we generally log nothing due to >>>>>>>>> WARN-severity unless something breaks). >>>>>>>>> >>>>>>>>> Lastly, I found this stack trace occurring frequently, and have no >>>>>>>>> >>>>>>>> idea >>>>>> >>>>>>> what it is (may be useful or not): >>>>>>>>> >>>>>>>>> "java.lang.**IllegalStateException : >>>>>>>>> at >>>>>>>>> >>>>>>>> >> org.eclipse.jetty.server.**Response.resetBuffer(Response.**java:964) >>>>>> >>>>>>> at org.eclipse.jetty.server.**Response.sendError(Response.** >>>>>>>>> java:325) >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.sendError(** >>>>>> SolrDispatchFilter.java:692) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** >>>>>> SolrDispatchFilter.java:380) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** >>>>>> SolrDispatchFilter.java:155) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.** >>>>>> doFilter(ServletHandler.java:**1423) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** >>>>>> ServletHandler.java:450) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** >>>>>> ScopedHandler.java:138) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** >>>>>> SecurityHandler.java:564) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** >>>>>> doHandle(SessionHandler.java:**213) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** >>>>>> doHandle(ContextHandler.java:**1083) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** >>>>>> ServletHandler.java:379) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** >>>>>> doScope(SessionHandler.java:**175) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** >>>>>> doScope(ContextHandler.java:**1017) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** >>>>>> ScopedHandler.java:136) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** >>>>>> handle(**ContextHandlerCollection.java:**258) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** >>>>>> handle(HandlerCollection.java:**109) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** >>>>>> HandlerWrapper.java:97) >>>>>> >>>>>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) >>>>>>>>> at >>>>>>>>> >>>>>>>> >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) >>>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** >>>>>> HttpConnection.java:225) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** >>>>>> AbstractConnection.java:358) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** >>>>>> QueuedThreadPool.java:596) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** >>>>>> QueuedThreadPool.java:527) >>>>>> >>>>>>> at java.lang.Thread.run(Thread.**java:724)" >>>>>>>>> >>>>>>>>> On your live_nodes question, I don't have historical data on this >>> from >>>>>>>>> >>>>>>>> when >>>>>>>> >>>>>>>>> the crash occurred, which I guess is what you're looking for. I >>> could >>>>>>>>> >>>>>>>> add >>>>>>>> >>>>>>>>> this to our monitoring for future tests, however. I'd be glad to >>>>>>>>> >>>>>>>> continue >>>>>>>> >>>>>>>>> further testing, but I think first more monitoring is needed to >>>>>>>>> >>>>>>>> understand >>>>>>>> >>>>>>>>> this further. Could we come up with a list of metrics that would >> be >>>>>>>>> >>>>>>>> useful >>>>>>>> >>>>>>>>> to see following another test and successful crash? >>>>>>>>> >>>>>>>>> Metrics needed: >>>>>>>>> >>>>>>>>> 1) # of live_nodes. >>>>>>>>> 2) Full stack traces. >>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide). >>>>>>>>> 4) Solr's JVM thread count (already done) >>>>>>>>> 5) ? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Tim Vaillancourt >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com> >>> wrote: >>>>>>>>> >>>>>>>>> Did you ever get to index that long before without hitting the >>>>>>>>>> >>>>>>>>> deadlock? >>>>>>>> >>>>>>>>> There really isn't anything negative the patch could be >> introducing, >>>>>>>>>> >>>>>>>>> other >>>>>>>> >>>>>>>>> than allowing for some more threads to possibly run at once. If I >>> had >>>>>>>>>> >>>>>>>>> to >>>>>>>> >>>>>>>>> guess, I would say its likely this patch fixes the deadlock issue >>> and >>>>>>>>>> >>>>>>>>> your >>>>>>>> >>>>>>>>> seeing another issue - which looks like the system cannot keep up >>>>>>>>>> >>>>>>>>> with >>>>>> >>>>>>> the >>>>>>>> >>>>>>>>> requests or something for some reason - perhaps due to some OS >>>>>>>>>> >>>>>>>>> networking >>>>>>>> >>>>>>>>> settings or something (more guessing). Connection refused happens >>>>>>>>>> >>>>>>>>> generally >>>>>>>> >>>>>>>>> when there is nothing listening on the port. >>>>>>>>>> >>>>>>>>>> Do you see anything interesting change with the rest of the >> system? >>>>>>>>>> >>>>>>>>> CPU >>>>>> >>>>>>> usage spikes or something like that? >>>>>>>>>> >>>>>>>>>> Clamping down further on the overall number of threads night help >>>>>>>>>> >>>>>>>>> (which >>>>>>>> >>>>>>>>> would require making something configurable). How many nodes are >>>>>>>>>> >>>>>>>>> listed in >>>>>>>> >>>>>>>>> zk under live_nodes? >>>>>>>>>> >>>>>>>>>> Mark >>>>>>>>>> >>>>>>>>>> Sent from my iPhone >>>>>>>>>> >>>>>>>>>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<tim@elementspace. >>> **com<t...@elementspace.com> >>>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hey guys, >>>>>>>>>>> >>>>>>>>>>> (copy of my post to SOLR-5216) >>>>>>>>>>> >>>>>>>>>>> We tested this patch and unfortunately encountered some serious >>>>>>>>>>> >>>>>>>>>> issues a >>>>>>>> >>>>>>>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, >> so >>>>>>>>>>> >>>>>>>>>> we >>>>>>>> >>>>>>>>> are >>>>>>>>>> >>>>>>>>>>> writing about 5000 docs/sec total, using autoCommit to commit >> the >>>>>>>>>>> >>>>>>>>>> updates >>>>>>>> >>>>>>>>> (no explicit commits). >>>>>>>>>>> >>>>>>>>>>> Our environment: >>>>>>>>>>> >>>>>>>>>>> Solr 4.3.1 w/SOLR-5216 patch. >>>>>>>>>>> Jetty 9, Java 1.7. >>>>>>>>>>> 3 solr instances, 1 per physical server. >>>>>>>>>>> 1 collection. >>>>>>>>>>> 3 shards. >>>>>>>>>>> 2 replicas (each instance is a leader and a replica). >>>>>>>>>>> Soft autoCommit is 1000ms. >>>>>>>>>>> Hard autoCommit is 15000ms. >>>>>>>>>>> >>>>>>>>>>> After about 6 hours of stress-testing this patch, we see many of >>>>>>>>>>> >>>>>>>>>> these >>>>>> >>>>>>> stalled transactions (below), and the Solr instances start to see >>>>>>>>>>> >>>>>>>>>> each >>>>>> >>>>>>> other as down, flooding our Solr logs with "Connection Refused" >>>>>>>>>>> >>>>>>>>>> exceptions, >>>>>>>>>> >>>>>>>>>>> and otherwise no obviously-useful logs that I could see. >>>>>>>>>>> >>>>>>>>>>> I did notice some stalled transactions on both /select and >>> /update, >>>>>>>>>>> however. This never occurred without this patch. >>>>>>>>>>> >>>>>>>>>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC >>>>>>>>>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 >>>>>>>>>>> >>>>>>>>>>> Lastly, I have a summary of the ERROR-severity logs from this >>>>>>>>>>> >>>>>>>>>> 24-hour >>>>>> >>>>>>> soak. >>>>>>>>>> >>>>>>>>>>> My script "normalizes" the ERROR-severity stack traces and >> returns >>>>>>>>>>> >>>>>>>>>> them >>>>>>>> >>>>>>>>> in >>>>>>>>>> >>>>>>>>>>> order of occurrence. >>>>>>>>>>> >>>>>>>>>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> Tim Vaillancourt >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 6 September 2013 07:27, Markus Jelsma< >>>>>>>>>>> >>>>>>>>>> markus.jel...@openindex.io> >>>>>> >>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>>> -----Original message----- >>>>>>>>>>>> >>>>>>>>>>>>> From:Erick Erickson<erickerickson@gmail.**com< >>> erickerick...@gmail.com> >>>>>>>>>>>>>> >>>>>>>>>>>>> Sent: Friday 6th September 2013 16:20 >>>>>>>>>>>>> To: solr-user@lucene.apache.org >>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume >>>>>>>>>>>>> >>>>>>>>>>>>> Markus: >>>>>>>>>>>>> >>>>>>>>>>>>> See: https://issues.apache.org/**jira/browse/SOLR-5216< >>> https://issues.apache.org/jira/browse/SOLR-5216> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma >>>>>>>>>>>>> <markus.jel...@openindex.io>**wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Mark, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Got an issue to watch? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Markus >>>>>>>>>>>>>> >>>>>>>>>>>>>> -----Original message----- >>>>>>>>>>>>>> >>>>>>>>>>>>>>> From:Mark Miller<markrmil...@gmail.com> >>>>>>>>>>>>>>> Sent: Wednesday 4th September 2013 16:55 >>>>>>>>>>>>>>> To: solr-user@lucene.apache.org >>>>>>>>>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm going to try and fix the root cause for 4.5 - I've >>> suspected >>>>>>>>>>>>>>> >>>>>>>>>>>>>> what it >>>>>>>>>>>> >>>>>>>>>>>>> is since early this year, but it's never personally been an >>>>>>>>>>>>>> >>>>>>>>>>>>> issue, >>>>>> >>>>>>> so >>>>>>>> >>>>>>>>> it's >>>>>>>>>>>> >>>>>>>>>>>>> rolled along for a long time. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Mark >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sent from my iPhone >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt< >>>>>>>>>>>>>>> >>>>>>>>>>>>>> t...@elementspace.com> >>>>>>>> >>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hey guys, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am looking into an issue we've been having with SolrCloud >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> since >>>>>> >>>>>>> the >>>>>>>>>>>> >>>>>>>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> tested >>>>>>>>>>>> >>>>>>>>>>>>> 4.4.0 >>>>>>>>>>>>>> >>>>>>>>>>>>>>> yet). I've noticed other users with this same issue, so I'd >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> really >>>>>>>> >>>>>>>>> like to >>>>>>>>>>>>>> >>>>>>>>>>>>>>> get to the bottom of it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Under a very, very high rate of updates (2000+/sec), after >>> 1-12 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> hours >>>>>>>>>>>> >>>>>>>>>>>>> we >>>>>>>>>>>>>> >>>>>>>>>>>>>>> see stalled transactions that snowball to consume all Jetty >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> threads in >>>>>>>>>>>> >>>>>>>>>>>>> the >>>>>>>>>>>>>> >>>>>>>>>>>>>>> JVM. This eventually causes the JVM to hang with most >> threads >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> waiting >>>>>>>>>>>> >>>>>>>>>>>>> on >>>>>>>>>>>>>> >>>>>>>>>>>>>>> the condition/stack provided at the bottom of this message. >> At >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> this >>>>>>>> >>>>>>>>> point >>>>>>>>>>>>>> >>>>>>>>>>>>>>> SolrCloud instances then start to see their neighbors (who >>> also >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> have >>>>>>>>>>>> >>>>>>>>>>>>> all >>>>>>>>>>>>>> >>>>>>>>>>>>>>> threads hung) as down w/"Connection Refused", and the shards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> become >>>>>>>> >>>>>>>>> "down" >>>>>>>>>>>>>> >>>>>>>>>>>>>>> in state. Sometimes a node or two survives and just returns >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 503s >>>>>> >>>>>>> "no >>>>>>>>>>>> >>>>>>>>>>>>> server >>>>>>>>>>>>>> >>>>>>>>>>>>>>> hosting shard" errors. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As a workaround/experiment, we have tuned the number of >>> threads >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> sending >>>>>>>>>>>> >>>>>>>>>>>>> updates to Solr, as well as the batch size (we batch updates >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> from >>>>>> >>>>>>> client -> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. >> Turning >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> off >>>>>>>> >>>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> did not >>>>>>>>>>>> >>>>>>>>>>>>> help. Certain combinations of update threads and batch sizes >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> seem >>>>>> >>>>>>> to >>>>>>>>>>>> >>>>>>>>>>>>> mask/help the problem, but not resolve it entirely. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Our current environment is the following: >>>>>>>>>>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. >>>>>>>>>>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM. >>>>>>>>>>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader >>> of >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1 >>>>>> >>>>>>> shard >>>>>>>>>>>> >>>>>>>>>>>>> and >>>>>>>>>>>>>> >>>>>>>>>>>>>>> a replica of 1 shard). >>>>>>>>>>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> movement >>>>>> >>>>>>> on a >>>>>>>>>>>> >>>>>>>>>>>>> good >>>>>>>>>>>>>> >>>>>>>>>>>>>>> day. >>>>>>>>>>>>>>>> - 5000 max jetty threads (well above what we use when we >> are >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> healthy), >>>>>>>>>>>> >>>>>>>>>>>>> Linux-user threads ulimit is 6000. >>>>>>>>>>>>>>>> - Occurs under Jetty 8 or 9 (many versions). >>>>>>>>>>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions). >>>>>>>>>>>>>>>> - Occurs under several JVM tunings. >>>>>>>>>>>>>>>> - Everything seems to point to Solr itself, and not a Jetty >>> or >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Java >>>>>>>> >>>>>>>>> version >>>>>>>>>>>>>> >>>>>>>>>>>>>>> (I hope I'm wrong). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The stack trace that is holding up all my Jetty QTP threads >>> is >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> the >>>>>>>> >>>>>>>>> following, which seems to be waiting on a lock that I would >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> very >>>>>> >>>>>>> much >>>>>>>>>>>> >>>>>>>>>>>>> like >>>>>>>>>>>>>> >>>>>>>>>>>>>>> to understand further: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> "java.lang.Thread.State: WAITING (parking) >>>>>>>>>>>>>>>> at sun.misc.Unsafe.park(Native Method) >>>>>>>>>>>>>>>> - parking to wait for<0x00000007216e68d8> (a >>>>>>>>>>>>>>>> java.util.concurrent.**Semaphore$NonfairSync) >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> java.util.concurrent.locks.**LockSupport.park(LockSupport.** >>>>>>>>>>>> java:186) >>>>>>>>>>>> >>>>>>>>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** >>>>>> parkAndCheckInterrupt(**AbstractQueuedSynchronizer.**java:834) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** >>>>>> doAcquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:994) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> java.util.concurrent.locks.**AbstractQueuedSynchronizer.** >>>>>> acquireSharedInterruptibly(**AbstractQueuedSynchronizer.**java:1303) >>>>>> >>>>>>> at java.util.concurrent.**Semaphore.acquire(Semaphore.**java:317) >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.util.**AdjustableSemaphore.acquire(** >>>>>> AdjustableSemaphore.java:61) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(** >>>>>> SolrCmdDistributor.java:418) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.submit(** >>>>>> SolrCmdDistributor.java:368) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.flushAdds(** >>>>>> SolrCmdDistributor.java:300) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.update.**SolrCmdDistributor.finish(** >>>>>> SolrCmdDistributor.java:96) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.update.**processor.** >>>>>> DistributedUpdateProcessor.**doFinish(**DistributedUpdateProcessor.** >>>>>> java:462) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.update.**processor.** >>>>>> DistributedUpdateProcessor.**finish(**DistributedUpdateProcessor.** >>>>>> java:1178) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.handler.**ContentStreamHandlerBase.** >>>>>> handleRequestBody(**ContentStreamHandlerBase.java:**83) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>> org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** >>>>>> RequestHandlerBase.java:135) >>>>>> >>>>>>> at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1820) >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.execute(** >>>>>> SolrDispatchFilter.java:656) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** >>>>>> SolrDispatchFilter.java:359) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** >>>>>> SolrDispatchFilter.java:155) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler$CachedChain.** >>>>>> doFilter(ServletHandler.java:**1486) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** >>>>>> ServletHandler.java:503) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** >>>>>> ScopedHandler.java:138) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** >>>>>> SecurityHandler.java:564) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** >>>>>> doHandle(SessionHandler.java:**213) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** >>>>>> doHandle(ContextHandler.java:**1096) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** >>>>>> ServletHandler.java:432) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** >>>>>> doScope(SessionHandler.java:**175) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** >>>>>> doScope(ContextHandler.java:**1030) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** >>>>>> ScopedHandler.java:136) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.* >>>>>> *handle(**ContextHandlerCollection.java:**201) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** >>>>>> handle(HandlerCollection.java:**109) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** >>>>>> HandlerWrapper.java:97) >>>>>> >>>>>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpChannel.handle(** >>>>>>>>>>>> HttpChannel.java:268) >>>>>>>>>>>> >>>>>>>>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** >>>>>> HttpConnection.java:229) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** >>>>>> AbstractConnection.java:358) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** >>>>>> QueuedThreadPool.java:601) >>>>>> >>>>>>> at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** >>>>>> QueuedThreadPool.java:532) >>>>>> >>>>>>> at java.lang.Thread.run(Thread.**java:724)" >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Some questions I had were: >>>>>>>>>>>>>>>> 1) What exclusive locks does SolrCloud "make" when >> performing >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> an >>>>>> >>>>>>> update? >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> could >>>>>> >>>>>>> someone >>>>>>>>>>>>>> >>>>>>>>>>>>>>> help me understand "what" solr is locking in this case at >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "org.apache.solr.util.**AdjustableSemaphore.acquire(** >>>>>> AdjustableSemaphore.java:61)" >>>>>> >>>>>>> when performing an update? That will help me understand where >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> to >>>>>> >>>>>>> look >>>>>>>>>>>> >>>>>>>>>>>>> next. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3) It seems all threads in this state are waiting for >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> "0x00000007216e68d8", >>>>>>>>>>>>>> >>>>>>>>>>>>>>> is there a way to tell what "0x00000007216e68d8" is? >>>>>>>>>>>>>>>> 4) Is there a limit to how many updates you can do in >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> SolrCloud? >>>>>> >>>>>>> 5) Wild-ass-theory: would more shards provide more locks >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (whatever >>>>>>>> >>>>>>>>> they >>>>>>>>>>>> >>>>>>>>>>>>> are) on update, and thus more update throughput? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 >>> nodes >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> at >>>>>>> >>>>>>> >>> >>