Re: SolrCloud 4.x hangs under high update volume
gt;>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>>>>> ScopedHandler.java:138) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** > >>>>>> SecurityHandler.java:564) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>>>>> doHandle(SessionHandler.java:**213) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>>>>> doHandle(ContextHandler.java:**1083) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** > >>>>>> ServletHandler.java:379) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>>>>> doScope(SessionHandler.java:**175) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>>>>> doScope(ContextHandler.java:**1017) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>>>>> ScopedHandler.java:136) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** > >>>>>> handle(**ContextHandlerCollection.java:**258) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** > >>>>>> handle(HandlerCollection.java:**109) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** > >>>>>> HandlerWrapper.java:97) > >>>>>> > >>>>>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) > >>>>>>>>> at > >>>>>>>>> > >>>>>>>> > >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) > >>>>>>>> > >>>>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > >>>>>> HttpConnection.java:225) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > >>>>>> AbstractConnection.java:358) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > >>>>>> QueuedThreadPool.java:596) > >>>>>> > >>>>>>> at > >>>>>>>>> > >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > >>>>>> QueuedThreadPool.java:527) > >>>>>> > >>>>>>> at java.lang.Thread.run(Thread.**java:724)" > >>>>>>>>> > >>>>>>>>> On your live_nodes question, I don't have historical data on this > >>> from > >>>>>>>>> > >>>>>>>> when > >>>>>>>> > >>>>>>>>> the crash occurred, which I guess is what you're looking for. I > >>> could > >>>>>>>>> &
Re: SolrCloud 4.x hangs under high update volume
etty.server.**Server.handle(Server.java:445) > > >>>>>> at > > >>>>>> > > >>>>> > org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) > > >>>>> > > >>>>>> at > > >>>>>> > > >>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > > >>> HttpConnection.java:225) > > >>> > > >>>> at > > >>>>>> > > >>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > > >>> AbstractConnection.java:358) > > >>> > > >>>> at > > >>>>>> > > >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > > >>> QueuedThreadPool.java:596) > > >>> > > >>>> at > > >>>>>> > > >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > > >>> QueuedThreadPool.java:527) > > >>> > > >>>> at java.lang.Thread.run(Thread.**java:724)" > > >>>>>> > > >>>>>> On your live_nodes question, I don't have historical data on this > > from > > >>>>>> > > >>>>> when > > >>>>> > > >>>>>> the crash occurred, which I guess is what you're looking for. I > > could > > >>>>>> > > >>>>> add > > >>>>> > > >>>>>> this to our monitoring for future tests, however. I'd be glad to > > >>>>>> > > >>>>> continue > > >>>>> > > >>>>>> further testing, but I think first more monitoring is needed to > > >>>>>> > > >>>>> understand > > >>>>> > > >>>>>> this further. Could we come up with a list of metrics that would > be > > >>>>>> > > >>>>> useful > > >>>>> > > >>>>>> to see following another test and successful crash? > > >>>>>> > > >>>>>> Metrics needed: > > >>>>>> > > >>>>>> 1) # of live_nodes. > > >>>>>> 2) Full stack traces. > > >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide). > > >>>>>> 4) Solr's JVM thread count (already done) > > >>>>>> 5) ? > > >>>>>> > > >>>>>> Cheers, > > >>>>>> > > >>>>>> Tim Vaillancourt > > >>>>>> > > >>>>>> > > >>>>>> On 6 September 2013 13:11, Mark Miller > > wrote: > > >>>>>> > > >>>>>> Did you ever get to index that long before without hitting the > > >>>>>>> > > >>>>>> deadlock? > > >>>>> > > >>>>>> There really isn't anything negative the patch could be > introducing, > > >>>>>>> > > >>>>>> other > > >>>>> > > >>>>>> than allowing for some more threads to possibly run at once. If I > > had > > >>>>>>> > > >>>>>> to > > >>>>> > > >>>>>> guess, I would say its likely this patch fixes the deadlock issue > > and > > >>>>>>> > > >>>>>> your > > >>>>> > > >>>>>> seeing another issue - which looks like the system cannot keep up > > >>>>>>> > > >>>>>> with > > >>> > > >>>> the > > >>>>> > > >>>>>> requests or something for some reason - perhaps due to some OS > > >>>>>>> > > >>>>>> networking > > >>>>> > > >>>>>> settings or something (more guessing). Connection refused happens > > >>>>>>> > > >>>>>> generally > > >>>>> > > >>>>>> when there is nothing listening on the port. > > >>&g
Re: SolrCloud 4.x hangs under high update volume
gt;>>> doScope(ContextHandler.java:**1017) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** >>>>>> ScopedHandler.java:136) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** >>>>>> handle(**ContextHandlerCollection.java:**258) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** >>>>>> handle(HandlerCollection.java:**109) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** >>>>>> HandlerWrapper.java:97) >>>>>> >>>>>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) >>>>>>>>> at >>>>>>>>> >>>>>>>> >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) >>>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** >>>>>> HttpConnection.java:225) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** >>>>>> AbstractConnection.java:358) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** >>>>>> QueuedThreadPool.java:596) >>>>>> >>>>>>> at >>>>>>>>> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** >>>>>> QueuedThreadPool.java:527) >>>>>> >>>>>>> at java.lang.Thread.run(Thread.**java:724)" >>>>>>>>> >>>>>>>>> On your live_nodes question, I don't have historical data on this >>> from >>>>>>>>> >>>>>>>> when >>>>>>>> >>>>>>>>> the crash occurred, which I guess is what you're looking for. I >>> could >>>>>>>>> >>>>>>>> add >>>>>>>> >>>>>>>>> this to our monitoring for future tests, however. I'd be glad to >>>>>>>>> >>>>>>>> continue >>>>>>>> >>>>>>>>> further testing, but I think first more monitoring is needed to >>>>>>>>> >>>>>>>> understand >>>>>>>> >>>>>>>>> this further. Could we come up with a list of metrics that would >> be >>>>>>>>> >>>>>>>> useful >>>>>>>> >>>>>>>>> to see following another test and successful crash? >>>>>>>>> >>>>>>>>> Metrics needed: >>>>>>>>> >>>>>>>>> 1) # of live_nodes. >>>>>>>>> 2) Full stack traces. >>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide). >>>>>>>>> 4) Solr's JVM thread count (already done) >>>>>>>>> 5) ? >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Tim Vaillancourt >>>>>>>>> >>>>>>>>> >>>>>>>>> On 6 September 2013 13:11, Mark Miller >>> wrote: >>>>>>>>> >>>>>>>>> Did you ever get to index that long before without hitting the >>>>>>>>>> >>>>>>>>> deadlock? >>>>>>>> >>>>>>>>> There really isn't anything negative the patch could be >> introducing, >>>
Re: SolrCloud 4.x hangs under high update volume
Handler$CachedChain.** > >>> doFilter(ServletHandler.java:**1423) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.servlet.**ServletHandler.doHandle(** > >>> ServletHandler.java:450) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>> ScopedHandler.java:138) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(** > >>> SecurityHandler.java:564) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>> doHandle(SessionHandler.java:**213) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>> doHandle(ContextHandler.java:**1083) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(** > >>> ServletHandler.java:379) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**session.SessionHandler.** > >>> doScope(SessionHandler.java:**175) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ContextHandler.** > >>> doScope(ContextHandler.java:**1017) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(** > >>> ScopedHandler.java:136) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** > >>> handle(**ContextHandlerCollection.java:**258) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** > >>> handle(HandlerCollection.java:**109) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** > >>> HandlerWrapper.java:97) > >>> > >>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) > >>>>>> at > >>>>>> > >>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) > >>>>> > >>>>>> at > >>>>>> > >>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** > >>> HttpConnection.java:225) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** > >>> AbstractConnection.java:358) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** > >>> QueuedThreadPool.java:596) > >>> > >>>> at > >>>>>> > >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** > >>> QueuedThreadPool.java:527) > >>> > >>>> at java.lang.Thread.run(Thread.**java:724)" > >>>>>> > >>>>>> On your live_nodes question, I don't have historical data on this > from > >>>>>> > >>>>> when > >>>>> > >>>>>> the crash occurred, which I guess is what you're looking for. I > could > >>>>>> > >>>>> add > >>>>> > >>>>>> this to our monitoring for future tests, however. I'd be glad to > >>>>>> > >>>>> continue > >>>>> > >>>>>> further testing, but I think first more monitoring is needed to > >>>>>> > >>>>> understand > >>>>> > >>>>>> this further. Could we come up with a list of metrics that would be > >>>>>> > >>>>> useful > >>>>> > >>>>>> to see following another test and successful crash? > >
Re: SolrCloud 4.x hangs under high update volume
jetty.server.**handler.ScopedHandler.handle(** >>> ScopedHandler.java:136) >>> >>>> at >>>>>> >>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.** >>> handle(**ContextHandlerCollection.java:**258) >>> >>>> at >>>>>> >>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.** >>> handle(HandlerCollection.java:**109) >>> >>>> at >>>>>> >>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(** >>> HandlerWrapper.java:97) >>> >>>> at org.eclipse.jetty.server.**Server.handle(Server.java:445) >>>>>> at >>>>>> >>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260) >>>>> >>>>>> at >>>>>> >>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(** >>> HttpConnection.java:225) >>> >>>> at >>>>>> >>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(** >>> AbstractConnection.java:358) >>> >>>> at >>>>>> >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(** >>> QueuedThreadPool.java:596) >>> >>>> at >>>>>> >>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(** >>> QueuedThreadPool.java:527) >>> >>>> at java.lang.Thread.run(Thread.**java:724)" >>>>>> >>>>>> On your live_nodes question, I don't have historical data on this from >>>>>> >>>>> when >>>>> >>>>>> the crash occurred, which I guess is what you're looking for. I could >>>>>> >>>>> add >>>>> >>>>>> this to our monitoring for future tests, however. I'd be glad to >>>>>> >>>>> continue >>>>> >>>>>> further testing, but I think first more monitoring is needed to >>>>>> >>>>> understand >>>>> >>>>>> this further. Could we come up with a list of metrics that would be >>>>>> >>>>> useful >>>>> >>>>>> to see following another test and successful crash? >>>>>> >>>>>> Metrics needed: >>>>>> >>>>>> 1) # of live_nodes. >>>>>> 2) Full stack traces. >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide). >>>>>> 4) Solr's JVM thread count (already done) >>>>>> 5) ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Tim Vaillancourt >>>>>> >>>>>> >>>>>> On 6 September 2013 13:11, Mark Miller wrote: >>>>>> >>>>>> Did you ever get to index that long before without hitting the >>>>>>> >>>>>> deadlock? >>>>> >>>>>> There really isn't anything negative the patch could be introducing, >>>>>>> >>>>>> other >>>>> >>>>>> than allowing for some more threads to possibly run at once. If I had >>>>>>> >>>>>> to >>>>> >>>>>> guess, I would say its likely this patch fixes the deadlock issue and >>>>>>> >>>>>> your >>>>> >>>>>> seeing another issue - which looks like the system cannot keep up >>>>>>> >>>>>> with >>> >>>> the >>>>> >>>>>> requests or something for some reason - perhaps due to some OS >>>>>>> >>>>>> networking >>>>> >>>>>> settings or something (more guessing). Connection refused happens >>>>>>> >>>>>> generally >>>>> >>>>>> when there is nothing listening on the port. >>>>>>> >>>>>>> Do you see anything interesting change with the rest of the system? >>>>>>> >>>>>> CPU >>> >>>> usage spikes or something like that? >>>>>>>
Re: SolrCloud 4.x hangs under high update volume
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) at java.lang.Thread.run(Thread.java:724)" On your live_nodes question, I don't have historical data on this from when the crash occurred, which I guess is what you're looking for. I could add this to our monitoring for future tests, however. I'd be glad to continue further testing, but I think first more monitoring is needed to understand this further. Could we come up with a list of metrics that would be useful to see following another test and successful crash? Metrics needed: 1) # of live_nodes. 2) Full stack traces. 3) CPU used by Solr's JVM specifically (instead of system-wide). 4) Solr's JVM thread count (already done) 5) ? Cheers, Tim Vaillancourt On 6 September 2013 13:11, Mark Miller wrote: Did you ever get to index that long before without hitting the deadlock? There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system cannot keep up with the requests or something for some reason - perhaps due to some OS networking settings or something (more guessing). Connection refused happens generally when there is nothing listening on the port. Do you see anything interesting change with the rest of the system? CPU usage spikes or something like that? Clamping down further on the overall number of threads night help (which would require making something configurable). How many nodes are listed in zk under live_nodes? Mark Sent from my iPhone On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt wrote: Hey guys, (copy of my post to SOLR-5216) We tested this patch and unfortunately encountered some serious issues a few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are writing about 5000 docs/sec total, using autoCommit to commit the updates (no explicit commits). Our environment: Solr 4.3.1 w/SOLR-5216 patch. Jetty 9, Java 1.7. 3 solr instances, 1 per physical server. 1 collection. 3 shards. 2 replicas (each instance is a leader and a replica). Soft autoCommit is 1000ms. Hard autoCommit is 15000ms. After about 6 hours of stress-testing this patch, we see many of these stalled transactions (below), and the Solr instances start to see each other as down, flooding our Solr logs with "Connection Refused" exceptions, and otherwise no obviously-useful logs that I could see. I did notice some stalled transactions on both /select and /update, however. This never occurred without this patch. Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. My script "normalizes" the ERROR-severity stack traces and returns them in order of occurrence. Summary of my solr.log: http://pastebin.com/pBdMAWeb Thanks! Tim Vaillancourt On 6 September 2013 07:27, Markus Jelsma< markus.jel...@openindex.io> wrote: Thanks! -Original message----- From:Erick Erickson Sent: Friday 6th September 2013 16:20 To: solr-user@lucene.apache.org Subject: Re: SolrCloud 4.x hangs under high update volume Markus: See: https://issues.apache.org/jira/browse/SOLR-5216 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma wrote: Hi Mark, Got an issue to watch? Thanks, Markus -Original mes
Re: SolrCloud 4.x hangs under high update volume
;> >> when there is nothing listening on the port. > >> >> > >> >> Do you see anything interesting change with the rest of the system? > CPU > >> >> usage spikes or something like that? > >> >> > >> >> Clamping down further on the overall number of threads night help > >> (which > >> >> would require making something configurable). How many nodes are > >> listed in > >> >> zk under live_nodes? > >> >> > >> >> Mark > >> >> > >> >> Sent from my iPhone > >> >> > >> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt > >> >> wrote: > >> >> > >> >>> Hey guys, > >> >>> > >> >>> (copy of my post to SOLR-5216) > >> >>> > >> >>> We tested this patch and unfortunately encountered some serious > >> issues a > >> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so > >> we > >> >> are > >> >>> writing about 5000 docs/sec total, using autoCommit to commit the > >> updates > >> >>> (no explicit commits). > >> >>> > >> >>> Our environment: > >> >>> > >> >>> Solr 4.3.1 w/SOLR-5216 patch. > >> >>> Jetty 9, Java 1.7. > >> >>> 3 solr instances, 1 per physical server. > >> >>> 1 collection. > >> >>> 3 shards. > >> >>> 2 replicas (each instance is a leader and a replica). > >> >>> Soft autoCommit is 1000ms. > >> >>> Hard autoCommit is 15000ms. > >> >>> > >> >>> After about 6 hours of stress-testing this patch, we see many of > these > >> >>> stalled transactions (below), and the Solr instances start to see > each > >> >>> other as down, flooding our Solr logs with "Connection Refused" > >> >> exceptions, > >> >>> and otherwise no obviously-useful logs that I could see. > >> >>> > >> >>> I did notice some stalled transactions on both /select and /update, > >> >>> however. This never occurred without this patch. > >> >>> > >> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > >> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > >> >>> > >> >>> Lastly, I have a summary of the ERROR-severity logs from this > 24-hour > >> >> soak. > >> >>> My script "normalizes" the ERROR-severity stack traces and returns > >> them > >> >> in > >> >>> order of occurrence. > >> >>> > >> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb > >> >>> > >> >>> Thanks! > >> >>> > >> >>> Tim Vaillancourt > >> >>> > >> >>> > >> >>> On 6 September 2013 07:27, Markus Jelsma < > markus.jel...@openindex.io> > >> >> wrote: > >> >>> > >> >>>> Thanks! > >> >>>> > >> >>>> -Original message- > >> >>>>> From:Erick Erickson > >> >>>>> Sent: Friday 6th September 2013 16:20 > >> >>>>> To: solr-user@lucene.apache.org > >> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >> >>>>> > >> >>>>> Markus: > >> >>>>> > >> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216 > >> >>>>> > >> >>>>> > >> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >> >>>>> wrote: > >> >>>>> > >> >>>>>> Hi Mark, > >> >>>>>> > >> >>>>>> Got an issue to watch? > >> >>>>>> > >> >>>>>> Thanks, > >> >>>>>> Markus > >> >>>>>> > >> >>>>>> -Original message- > >> >>>>>>> From:Mark Miller > >> >>>>>>> Sent: Wednesday 4th September 2013 16:55 &g
Re: SolrCloud 4.x hangs under high update volume
ontextHandler.java:1083) >> > at >> > >> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) >> > at >> > >> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) >> > at >> > >> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) >> > at >> > >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) >> > at >> > >> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) >> > at >> > >> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) >> > at >> > >> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) >> > at org.eclipse.jetty.server.Server.handle(Server.java:445) >> > at >> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) >> > at >> > >> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) >> > at >> > >> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) >> > at >> > >> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) >> > at >> > >> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) >> > at java.lang.Thread.run(Thread.java:724)" >> > >> > On your live_nodes question, I don't have historical data on this from >> when >> > the crash occurred, which I guess is what you're looking for. I could >> add >> > this to our monitoring for future tests, however. I'd be glad to >> continue >> > further testing, but I think first more monitoring is needed to >> understand >> > this further. Could we come up with a list of metrics that would be >> useful >> > to see following another test and successful crash? >> > >> > Metrics needed: >> > >> > 1) # of live_nodes. >> > 2) Full stack traces. >> > 3) CPU used by Solr's JVM specifically (instead of system-wide). >> > 4) Solr's JVM thread count (already done) >> > 5) ? >> > >> > Cheers, >> > >> > Tim Vaillancourt >> > >> > >> > On 6 September 2013 13:11, Mark Miller wrote: >> > >> >> Did you ever get to index that long before without hitting the >> deadlock? >> >> >> >> There really isn't anything negative the patch could be introducing, >> other >> >> than allowing for some more threads to possibly run at once. If I had >> to >> >> guess, I would say its likely this patch fixes the deadlock issue and >> your >> >> seeing another issue - which looks like the system cannot keep up with >> the >> >> requests or something for some reason - perhaps due to some OS >> networking >> >> settings or something (more guessing). Connection refused happens >> generally >> >> when there is nothing listening on the port. >> >> >> >> Do you see anything interesting change with the rest of the system? CPU >> >> usage spikes or something like that? >> >> >> >> Clamping down further on the overall number of threads night help >> (which >> >> would require making something configurable). How many nodes are >> listed in >> >> zk under live_nodes? >> >> >> >> Mark >> >> >> >> Sent from my iPhone >> >> >> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt >> >> wrote: >> >> >> >>> Hey guys, >> >>> >> >>> (copy of my post to SOLR-5216) >> >>> >> >>> We tested this patch and unfortunately encountered some serious >> issues a >> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so >> we >> >> are >> >>> writing about 5000 docs/sec total, using autoCommit to commit the >> updates >> >>> (no explicit commits). >> >>> >> >>> Our environment: >> >>> >> >>> Solr 4.3.1 w/SOLR-5216 patch. >> >>> Jetty 9, Java 1.7. >> >>> 3 solr instances, 1 per physical server. >> >>> 1 collection. >> >>> 3 shards. >> >>&g
Re: SolrCloud 4.x hangs under high update volume
this further. Could we come up with a list of metrics that would be > useful > > to see following another test and successful crash? > > > > Metrics needed: > > > > 1) # of live_nodes. > > 2) Full stack traces. > > 3) CPU used by Solr's JVM specifically (instead of system-wide). > > 4) Solr's JVM thread count (already done) > > 5) ? > > > > Cheers, > > > > Tim Vaillancourt > > > > > > On 6 September 2013 13:11, Mark Miller wrote: > > > >> Did you ever get to index that long before without hitting the deadlock? > >> > >> There really isn't anything negative the patch could be introducing, > other > >> than allowing for some more threads to possibly run at once. If I had to > >> guess, I would say its likely this patch fixes the deadlock issue and > your > >> seeing another issue - which looks like the system cannot keep up with > the > >> requests or something for some reason - perhaps due to some OS > networking > >> settings or something (more guessing). Connection refused happens > generally > >> when there is nothing listening on the port. > >> > >> Do you see anything interesting change with the rest of the system? CPU > >> usage spikes or something like that? > >> > >> Clamping down further on the overall number of threads night help (which > >> would require making something configurable). How many nodes are listed > in > >> zk under live_nodes? > >> > >> Mark > >> > >> Sent from my iPhone > >> > >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt > >> wrote: > >> > >>> Hey guys, > >>> > >>> (copy of my post to SOLR-5216) > >>> > >>> We tested this patch and unfortunately encountered some serious issues > a > >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we > >> are > >>> writing about 5000 docs/sec total, using autoCommit to commit the > updates > >>> (no explicit commits). > >>> > >>> Our environment: > >>> > >>> Solr 4.3.1 w/SOLR-5216 patch. > >>> Jetty 9, Java 1.7. > >>> 3 solr instances, 1 per physical server. > >>> 1 collection. > >>> 3 shards. > >>> 2 replicas (each instance is a leader and a replica). > >>> Soft autoCommit is 1000ms. > >>> Hard autoCommit is 15000ms. > >>> > >>> After about 6 hours of stress-testing this patch, we see many of these > >>> stalled transactions (below), and the Solr instances start to see each > >>> other as down, flooding our Solr logs with "Connection Refused" > >> exceptions, > >>> and otherwise no obviously-useful logs that I could see. > >>> > >>> I did notice some stalled transactions on both /select and /update, > >>> however. This never occurred without this patch. > >>> > >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > >>> > >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour > >> soak. > >>> My script "normalizes" the ERROR-severity stack traces and returns them > >> in > >>> order of occurrence. > >>> > >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb > >>> > >>> Thanks! > >>> > >>> Tim Vaillancourt > >>> > >>> > >>> On 6 September 2013 07:27, Markus Jelsma > >> wrote: > >>> > >>>> Thanks! > >>>> > >>>> -Original message- > >>>>> From:Erick Erickson > >>>>> Sent: Friday 6th September 2013 16:20 > >>>>> To: solr-user@lucene.apache.org > >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>> > >>>>> Markus: > >>>>> > >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216 > >>>>> > >>>>> > >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >>>>> wrote: > >>>>> > >>>>>> Hi Mark, > >>>>>> > >>>>>> Got an issue to watch? > >>>>>> > >>>>>> Thanks, > >>
Re: SolrCloud 4.x hangs under high update volume
> > Mark > > Sent from my iPhone > > On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt > wrote: > > > Hey guys, > > > > (copy of my post to SOLR-5216) > > > > We tested this patch and unfortunately encountered some serious issues a > > few hours of 500 update-batches/sec. Our update batch is 10 docs, so we > are > > writing about 5000 docs/sec total, using autoCommit to commit the updates > > (no explicit commits). > > > > Our environment: > > > >Solr 4.3.1 w/SOLR-5216 patch. > >Jetty 9, Java 1.7. > >3 solr instances, 1 per physical server. > >1 collection. > >3 shards. > >2 replicas (each instance is a leader and a replica). > >Soft autoCommit is 1000ms. > >Hard autoCommit is 15000ms. > > > > After about 6 hours of stress-testing this patch, we see many of these > > stalled transactions (below), and the Solr instances start to see each > > other as down, flooding our Solr logs with "Connection Refused" > exceptions, > > and otherwise no obviously-useful logs that I could see. > > > > I did notice some stalled transactions on both /select and /update, > > however. This never occurred without this patch. > > > > Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > > Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > > > > Lastly, I have a summary of the ERROR-severity logs from this 24-hour > soak. > > My script "normalizes" the ERROR-severity stack traces and returns them > in > > order of occurrence. > > > > Summary of my solr.log: http://pastebin.com/pBdMAWeb > > > > Thanks! > > > > Tim Vaillancourt > > > > > > On 6 September 2013 07:27, Markus Jelsma > wrote: > > > >> Thanks! > >> > >> -Original message- > >>> From:Erick Erickson > >>> Sent: Friday 6th September 2013 16:20 > >>> To: solr-user@lucene.apache.org > >>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>> > >>> Markus: > >>> > >>> See: https://issues.apache.org/jira/browse/SOLR-5216 > >>> > >>> > >>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >>> wrote: > >>> > >>>> Hi Mark, > >>>> > >>>> Got an issue to watch? > >>>> > >>>> Thanks, > >>>> Markus > >>>> > >>>> -Original message- > >>>>> From:Mark Miller > >>>>> Sent: Wednesday 4th September 2013 16:55 > >>>>> To: solr-user@lucene.apache.org > >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>> > >>>>> I'm going to try and fix the root cause for 4.5 - I've suspected > >> what it > >>>> is since early this year, but it's never personally been an issue, so > >> it's > >>>> rolled along for a long time. > >>>>> > >>>>> Mark > >>>>> > >>>>> Sent from my iPhone > >>>>> > >>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > >>>> wrote: > >>>>> > >>>>>> Hey guys, > >>>>>> > >>>>>> I am looking into an issue we've been having with SolrCloud since > >> the > >>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't > >> tested > >>>> 4.4.0 > >>>>>> yet). I've noticed other users with this same issue, so I'd really > >>>> like to > >>>>>> get to the bottom of it. > >>>>>> > >>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12 > >> hours > >>>> we > >>>>>> see stalled transactions that snowball to consume all Jetty > >> threads in > >>>> the > >>>>>> JVM. This eventually causes the JVM to hang with most threads > >> waiting > >>>> on > >>>>>> the condition/stack provided at the bottom of this message. At this > >>>> point > >>>>>> SolrCloud instances then start to see their neighbors (who also > >> have > >>>> all > >>>>>> threads hung) as down w/"Connection Refused", and the shards b
Re: SolrCloud 4.x hangs under high update volume
er >> than allowing for some more threads to possibly run at once. If I had to >> guess, I would say its likely this patch fixes the deadlock issue and your >> seeing another issue - which looks like the system cannot keep up with the >> requests or something for some reason - perhaps due to some OS networking >> settings or something (more guessing). Connection refused happens generally >> when there is nothing listening on the port. >> >> Do you see anything interesting change with the rest of the system? CPU >> usage spikes or something like that? >> >> Clamping down further on the overall number of threads night help (which >> would require making something configurable). How many nodes are listed in >> zk under live_nodes? >> >> Mark >> >> Sent from my iPhone >> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt >> wrote: >> >>> Hey guys, >>> >>> (copy of my post to SOLR-5216) >>> >>> We tested this patch and unfortunately encountered some serious issues a >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we >> are >>> writing about 5000 docs/sec total, using autoCommit to commit the updates >>> (no explicit commits). >>> >>> Our environment: >>> >>> Solr 4.3.1 w/SOLR-5216 patch. >>> Jetty 9, Java 1.7. >>> 3 solr instances, 1 per physical server. >>> 1 collection. >>> 3 shards. >>> 2 replicas (each instance is a leader and a replica). >>> Soft autoCommit is 1000ms. >>> Hard autoCommit is 15000ms. >>> >>> After about 6 hours of stress-testing this patch, we see many of these >>> stalled transactions (below), and the Solr instances start to see each >>> other as down, flooding our Solr logs with "Connection Refused" >> exceptions, >>> and otherwise no obviously-useful logs that I could see. >>> >>> I did notice some stalled transactions on both /select and /update, >>> however. This never occurred without this patch. >>> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 >>> >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour >> soak. >>> My script "normalizes" the ERROR-severity stack traces and returns them >> in >>> order of occurrence. >>> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb >>> >>> Thanks! >>> >>> Tim Vaillancourt >>> >>> >>> On 6 September 2013 07:27, Markus Jelsma >> wrote: >>> >>>> Thanks! >>>> >>>> -Original message- >>>>> From:Erick Erickson >>>>> Sent: Friday 6th September 2013 16:20 >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume >>>>> >>>>> Markus: >>>>> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216 >>>>> >>>>> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma >>>>> wrote: >>>>> >>>>>> Hi Mark, >>>>>> >>>>>> Got an issue to watch? >>>>>> >>>>>> Thanks, >>>>>> Markus >>>>>> >>>>>> -Original message- >>>>>>> From:Mark Miller >>>>>>> Sent: Wednesday 4th September 2013 16:55 >>>>>>> To: solr-user@lucene.apache.org >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume >>>>>>> >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected >>>> what it >>>>>> is since early this year, but it's never personally been an issue, so >>>> it's >>>>>> rolled along for a long time. >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt >>>>>> wrote: >>>>>>> >>>>>>>> Hey guys, >>>>>>>> >>>>>>>> I am looking into an issue we've been having with SolrCloud since >>>>
Re: SolrCloud 4.x hangs under high update volume
Did you ever get to index that long before without hitting the deadlock? There really isn't anything negative the patch could be introducing, other than allowing for some more threads to possibly run at once. If I had to guess, I would say its likely this patch fixes the deadlock issue and your seeing another issue - which looks like the system cannot keep up with the requests or something for some reason - perhaps due to some OS networking settings or something (more guessing). Connection refused happens generally when there is nothing listening on the port. Do you see anything interesting change with the rest of the system? CPU usage spikes or something like that? Clamping down further on the overall number of threads night help (which would require making something configurable). How many nodes are listed in zk under live_nodes? Mark Sent from my iPhone On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt wrote: > Hey guys, > > (copy of my post to SOLR-5216) > > We tested this patch and unfortunately encountered some serious issues a > few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are > writing about 5000 docs/sec total, using autoCommit to commit the updates > (no explicit commits). > > Our environment: > >Solr 4.3.1 w/SOLR-5216 patch. >Jetty 9, Java 1.7. >3 solr instances, 1 per physical server. >1 collection. >3 shards. >2 replicas (each instance is a leader and a replica). >Soft autoCommit is 1000ms. >Hard autoCommit is 15000ms. > > After about 6 hours of stress-testing this patch, we see many of these > stalled transactions (below), and the Solr instances start to see each > other as down, flooding our Solr logs with "Connection Refused" exceptions, > and otherwise no obviously-useful logs that I could see. > > I did notice some stalled transactions on both /select and /update, > however. This never occurred without this patch. > > Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > > Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. > My script "normalizes" the ERROR-severity stack traces and returns them in > order of occurrence. > > Summary of my solr.log: http://pastebin.com/pBdMAWeb > > Thanks! > > Tim Vaillancourt > > > On 6 September 2013 07:27, Markus Jelsma wrote: > >> Thanks! >> >> -----Original message- >>> From:Erick Erickson >>> Sent: Friday 6th September 2013 16:20 >>> To: solr-user@lucene.apache.org >>> Subject: Re: SolrCloud 4.x hangs under high update volume >>> >>> Markus: >>> >>> See: https://issues.apache.org/jira/browse/SOLR-5216 >>> >>> >>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma >>> wrote: >>> >>>> Hi Mark, >>>> >>>> Got an issue to watch? >>>> >>>> Thanks, >>>> Markus >>>> >>>> -Original message- >>>>> From:Mark Miller >>>>> Sent: Wednesday 4th September 2013 16:55 >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume >>>>> >>>>> I'm going to try and fix the root cause for 4.5 - I've suspected >> what it >>>> is since early this year, but it's never personally been an issue, so >> it's >>>> rolled along for a long time. >>>>> >>>>> Mark >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt >>>> wrote: >>>>> >>>>>> Hey guys, >>>>>> >>>>>> I am looking into an issue we've been having with SolrCloud since >> the >>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't >> tested >>>> 4.4.0 >>>>>> yet). I've noticed other users with this same issue, so I'd really >>>> like to >>>>>> get to the bottom of it. >>>>>> >>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12 >> hours >>>> we >>>>>> see stalled transactions that snowball to consume all Jetty >> threads in >>>> the >>>>>> JVM. This eventually causes the JVM to hang with most threads >> waiting >>>> on >>>>>> the condition/stack provided at the bottom
Re: SolrCloud 4.x hangs under high update volume
Hey guys, (copy of my post to SOLR-5216) We tested this patch and unfortunately encountered some serious issues a few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are writing about 5000 docs/sec total, using autoCommit to commit the updates (no explicit commits). Our environment: Solr 4.3.1 w/SOLR-5216 patch. Jetty 9, Java 1.7. 3 solr instances, 1 per physical server. 1 collection. 3 shards. 2 replicas (each instance is a leader and a replica). Soft autoCommit is 1000ms. Hard autoCommit is 15000ms. After about 6 hours of stress-testing this patch, we see many of these stalled transactions (below), and the Solr instances start to see each other as down, flooding our Solr logs with "Connection Refused" exceptions, and otherwise no obviously-useful logs that I could see. I did notice some stalled transactions on both /select and /update, however. This never occurred without this patch. Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak. My script "normalizes" the ERROR-severity stack traces and returns them in order of occurrence. Summary of my solr.log: http://pastebin.com/pBdMAWeb Thanks! Tim Vaillancourt On 6 September 2013 07:27, Markus Jelsma wrote: > Thanks! > > -Original message- > > From:Erick Erickson > > Sent: Friday 6th September 2013 16:20 > > To: solr-user@lucene.apache.org > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > Markus: > > > > See: https://issues.apache.org/jira/browse/SOLR-5216 > > > > > > On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > > wrote: > > > > > Hi Mark, > > > > > > Got an issue to watch? > > > > > > Thanks, > > > Markus > > > > > > -----Original message- > > > > From:Mark Miller > > > > Sent: Wednesday 4th September 2013 16:55 > > > > To: solr-user@lucene.apache.org > > > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > > > > > I'm going to try and fix the root cause for 4.5 - I've suspected > what it > > > is since early this year, but it's never personally been an issue, so > it's > > > rolled along for a long time. > > > > > > > > Mark > > > > > > > > Sent from my iPhone > > > > > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > > > wrote: > > > > > > > > > Hey guys, > > > > > > > > > > I am looking into an issue we've been having with SolrCloud since > the > > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't > tested > > > 4.4.0 > > > > > yet). I've noticed other users with this same issue, so I'd really > > > like to > > > > > get to the bottom of it. > > > > > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 > hours > > > we > > > > > see stalled transactions that snowball to consume all Jetty > threads in > > > the > > > > > JVM. This eventually causes the JVM to hang with most threads > waiting > > > on > > > > > the condition/stack provided at the bottom of this message. At this > > > point > > > > > SolrCloud instances then start to see their neighbors (who also > have > > > all > > > > > threads hung) as down w/"Connection Refused", and the shards become > > > "down" > > > > > in state. Sometimes a node or two survives and just returns 503s > "no > > > server > > > > > hosting shard" errors. > > > > > > > > > > As a workaround/experiment, we have tuned the number of threads > sending > > > > > updates to Solr, as well as the batch size (we batch updates from > > > client -> > > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also > did not > > > > > help. Certain combinations of update threads and batch sizes seem > to > > > > > mask/help the problem, but not resolve it entirely. > > > > > > > > > > Our current environment is the following: > > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > > > - 3 x Zookeepe
Re: SolrCloud 4.x hangs under high update volume
Markus: See: https://issues.apache.org/jira/browse/SOLR-5216 On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma wrote: > Hi Mark, > > Got an issue to watch? > > Thanks, > Markus > > -Original message- > > From:Mark Miller > > Sent: Wednesday 4th September 2013 16:55 > > To: solr-user@lucene.apache.org > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > I'm going to try and fix the root cause for 4.5 - I've suspected what it > is since early this year, but it's never personally been an issue, so it's > rolled along for a long time. > > > > Mark > > > > Sent from my iPhone > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > wrote: > > > > > Hey guys, > > > > > > I am looking into an issue we've been having with SolrCloud since the > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested > 4.4.0 > > > yet). I've noticed other users with this same issue, so I'd really > like to > > > get to the bottom of it. > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours > we > > > see stalled transactions that snowball to consume all Jetty threads in > the > > > JVM. This eventually causes the JVM to hang with most threads waiting > on > > > the condition/stack provided at the bottom of this message. At this > point > > > SolrCloud instances then start to see their neighbors (who also have > all > > > threads hung) as down w/"Connection Refused", and the shards become > "down" > > > in state. Sometimes a node or two survives and just returns 503s "no > server > > > hosting shard" errors. > > > > > > As a workaround/experiment, we have tuned the number of threads sending > > > updates to Solr, as well as the batch size (we batch updates from > client -> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > > > help. Certain combinations of update threads and batch sizes seem to > > > mask/help the problem, but not resolve it entirely. > > > > > > Our current environment is the following: > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > - 3 x Zookeeper instances, external Java 7 JVM. > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard > and > > > a replica of 1 shard). > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a > good > > > day. > > > - 5000 max jetty threads (well above what we use when we are healthy), > > > Linux-user threads ulimit is 6000. > > > - Occurs under Jetty 8 or 9 (many versions). > > > - Occurs under Java 1.6 or 1.7 (several minor versions). > > > - Occurs under several JVM tunings. > > > - Everything seems to point to Solr itself, and not a Jetty or Java > version > > > (I hope I'm wrong). > > > > > > The stack trace that is holding up all my Jetty QTP threads is the > > > following, which seems to be waiting on a lock that I would very much > like > > > to understand further: > > > > > > "java.lang.Thread.State: WAITING (parking) > > >at sun.misc.Unsafe.park(Native Method) > > >- parking to wait for <0x0007216e68d8> (a > > > java.util.concurrent.Semaphore$NonfairSync) > > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > > >at > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > > >at > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > > >at > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > > >at > > > > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > > >at > > > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > > >at > > > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > > >at > > > > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > > >at > > >
RE: SolrCloud 4.x hangs under high update volume
Thanks! -Original message- > From:Erick Erickson > Sent: Friday 6th September 2013 16:20 > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud 4.x hangs under high update volume > > Markus: > > See: https://issues.apache.org/jira/browse/SOLR-5216 > > > On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > wrote: > > > Hi Mark, > > > > Got an issue to watch? > > > > Thanks, > > Markus > > > > -Original message- > > > From:Mark Miller > > > Sent: Wednesday 4th September 2013 16:55 > > > To: solr-user@lucene.apache.org > > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > > > I'm going to try and fix the root cause for 4.5 - I've suspected what it > > is since early this year, but it's never personally been an issue, so it's > > rolled along for a long time. > > > > > > Mark > > > > > > Sent from my iPhone > > > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > > wrote: > > > > > > > Hey guys, > > > > > > > > I am looking into an issue we've been having with SolrCloud since the > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested > > 4.4.0 > > > > yet). I've noticed other users with this same issue, so I'd really > > like to > > > > get to the bottom of it. > > > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours > > we > > > > see stalled transactions that snowball to consume all Jetty threads in > > the > > > > JVM. This eventually causes the JVM to hang with most threads waiting > > on > > > > the condition/stack provided at the bottom of this message. At this > > point > > > > SolrCloud instances then start to see their neighbors (who also have > > all > > > > threads hung) as down w/"Connection Refused", and the shards become > > "down" > > > > in state. Sometimes a node or two survives and just returns 503s "no > > server > > > > hosting shard" errors. > > > > > > > > As a workaround/experiment, we have tuned the number of threads sending > > > > updates to Solr, as well as the batch size (we batch updates from > > client -> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > > > > help. Certain combinations of update threads and batch sizes seem to > > > > mask/help the problem, but not resolve it entirely. > > > > > > > > Our current environment is the following: > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > > - 3 x Zookeeper instances, external Java 7 JVM. > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard > > and > > > > a replica of 1 shard). > > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a > > good > > > > day. > > > > - 5000 max jetty threads (well above what we use when we are healthy), > > > > Linux-user threads ulimit is 6000. > > > > - Occurs under Jetty 8 or 9 (many versions). > > > > - Occurs under Java 1.6 or 1.7 (several minor versions). > > > > - Occurs under several JVM tunings. > > > > - Everything seems to point to Solr itself, and not a Jetty or Java > > version > > > > (I hope I'm wrong). > > > > > > > > The stack trace that is holding up all my Jetty QTP threads is the > > > > following, which seems to be waiting on a lock that I would very much > > like > > > > to understand further: > > > > > > > > "java.lang.Thread.State: WAITING (parking) > > > >at sun.misc.Unsafe.park(Native Method) > > > >- parking to wait for <0x0007216e68d8> (a > > > > java.util.concurrent.Semaphore$NonfairSync) > > > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > > > >at > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > > > >at > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:99
Re: SolrCloud 4.x hangs under high update volume
Update: It is a bit too soon to tell, but about 6 hours into testing there are no crashes with this patch. :) We are pushing 500 batches of 10 updates per second to a 3 node, 3 shard cluster I mentioned above. 5000 updates per second total. More tomorrow after a 24 hr soak! Tim On Wednesday, 4 September 2013, Tim Vaillancourt wrote: > Thanks so much for the explanation Mark, I owe you one (many)! > > We have this on our high TPS cluster and will run it through it's paces > tomorrow. I'll provide any feedback I can, more soon! :D > > Cheers, > > Tim >
Re: SolrCloud 4.x hangs under high update volume
Thanks so much for the explanation Mark, I owe you one (many)! We have this on our high TPS cluster and will run it through it's paces tomorrow. I'll provide any feedback I can, more soon! :D Cheers, Tim
RE: SolrCloud 4.x hangs under high update volume
Hi Mark, Got an issue to watch? Thanks, Markus -Original message- > From:Mark Miller > Sent: Wednesday 4th September 2013 16:55 > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud 4.x hangs under high update volume > > I'm going to try and fix the root cause for 4.5 - I've suspected what it is > since early this year, but it's never personally been an issue, so it's > rolled along for a long time. > > Mark > > Sent from my iPhone > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt wrote: > > > Hey guys, > > > > I am looking into an issue we've been having with SolrCloud since the > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 > > yet). I've noticed other users with this same issue, so I'd really like to > > get to the bottom of it. > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours we > > see stalled transactions that snowball to consume all Jetty threads in the > > JVM. This eventually causes the JVM to hang with most threads waiting on > > the condition/stack provided at the bottom of this message. At this point > > SolrCloud instances then start to see their neighbors (who also have all > > threads hung) as down w/"Connection Refused", and the shards become "down" > > in state. Sometimes a node or two survives and just returns 503s "no server > > hosting shard" errors. > > > > As a workaround/experiment, we have tuned the number of threads sending > > updates to Solr, as well as the batch size (we batch updates from client -> > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > > help. Certain combinations of update threads and batch sizes seem to > > mask/help the problem, but not resolve it entirely. > > > > Our current environment is the following: > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > - 3 x Zookeeper instances, external Java 7 JVM. > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and > > a replica of 1 shard). > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good > > day. > > - 5000 max jetty threads (well above what we use when we are healthy), > > Linux-user threads ulimit is 6000. > > - Occurs under Jetty 8 or 9 (many versions). > > - Occurs under Java 1.6 or 1.7 (several minor versions). > > - Occurs under several JVM tunings. > > - Everything seems to point to Solr itself, and not a Jetty or Java version > > (I hope I'm wrong). > > > > The stack trace that is holding up all my Jetty QTP threads is the > > following, which seems to be waiting on a lock that I would very much like > > to understand further: > > > > "java.lang.Thread.State: WAITING (parking) > >at sun.misc.Unsafe.park(Native Method) > >- parking to wait for <0x0007216e68d8> (a > > java.util.concurrent.Semaphore$NonfairSync) > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > >at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > >at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > >at > > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > >at > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > >at > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > >at > > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > >at > > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) > >at > > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) > >at > > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) > >at > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) > >at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >at org.apache.solr.core.SolrCor
Re: SolrCloud 4.x hangs under high update volume
The 'lock' or semaphore was added to cap the number of threads that would be used. Previously, the number of threads in use could spike to many, many thousands on heavy updates. A limit on the number of outstanding requests was put in place to keep this from happening. Something like 16 * the number of hosts in the cluster. I assume the deadlock comes from the fact that requests are of two kinds - forward to the leader and distrib updates from the leader to replicas. Forward to the leader actually waits for the leader to then distrib the updates to replicas before returning. I believe this is what can lead to deadlock. This is likely why the patch for the CloudSolrServer can help the situation - it removes the need to forward to the leader because it sends to the correct leader to begin with. Only useful if you are adding docs with CloudSolrServer though, and more like a workaround than a fix. The patch uses a separate 'limiting' semaphore for the two cases. - Mark On Sep 4, 2013, at 10:22 AM, Tim Vaillancourt wrote: > Thanks guys! :) > > Mark: this patch is much appreciated, I will try to test this shortly, > hopefully today. > > For my curiosity/understanding, could someone explain to me quickly what > locks SolrCloud takes on updates? Was I on to something that more shards > decrease the chance for locking? > > Secondly, I was wondering if someone could summarize what this patch 'fixes'? > I'm not too familiar with Java and the solr codebase (working on that though > :D). > > Cheers, > > Tim > > > > On 4 September 2013 09:52, Mark Miller wrote: > There is an issue if I remember right, but I can't find it right now. > > If anyone that has the problem could try this patch, that would be very > helpful: http://pastebin.com/raw.php?i=aaRWwSGP > > - Mark > > > On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma > wrote: > > > Hi Mark, > > > > Got an issue to watch? > > > > Thanks, > > Markus > > > > -Original message- > > > From:Mark Miller > > > Sent: Wednesday 4th September 2013 16:55 > > > To: solr-user@lucene.apache.org > > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > > > I'm going to try and fix the root cause for 4.5 - I've suspected what it > > is since early this year, but it's never personally been an issue, so it's > > rolled along for a long time. > > > > > > Mark > > > > > > Sent from my iPhone > > > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > > wrote: > > > > > > > Hey guys, > > > > > > > > I am looking into an issue we've been having with SolrCloud since the > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested > > 4.4.0 > > > > yet). I've noticed other users with this same issue, so I'd really > > like to > > > > get to the bottom of it. > > > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours > > we > > > > see stalled transactions that snowball to consume all Jetty threads in > > the > > > > JVM. This eventually causes the JVM to hang with most threads waiting > > on > > > > the condition/stack provided at the bottom of this message. At this > > point > > > > SolrCloud instances then start to see their neighbors (who also have > > all > > > > threads hung) as down w/"Connection Refused", and the shards become > > "down" > > > > in state. Sometimes a node or two survives and just returns 503s "no > > server > > > > hosting shard" errors. > > > > > > > > As a workaround/experiment, we have tuned the number of threads sending > > > > updates to Solr, as well as the batch size (we batch updates from > > client -> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > > > > help. Certain combinations of update threads and batch sizes seem to > > > > mask/help the problem, but not resolve it entirely. > > > > > > > > Our current environment is the following: > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > > - 3 x Zookeeper instances, external Java 7 JVM. > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard > > and > > > > a replica of 1 shard). > >
Re: SolrCloud 4.x hangs under high update volume
Thanks guys! :) Mark: this patch is much appreciated, I will try to test this shortly, hopefully today. For my curiosity/understanding, could someone explain to me quickly what locks SolrCloud takes on updates? Was I on to something that more shards decrease the chance for locking? Secondly, I was wondering if someone could summarize what this patch 'fixes'? I'm not too familiar with Java and the solr codebase (working on that though :D). Cheers, Tim On 4 September 2013 09:52, Mark Miller wrote: > There is an issue if I remember right, but I can't find it right now. > > If anyone that has the problem could try this patch, that would be very > helpful: http://pastebin.com/raw.php?i=aaRWwSGP > > - Mark > > > On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma >wrote: > > > Hi Mark, > > > > Got an issue to watch? > > > > Thanks, > > Markus > > > > -Original message- > > > From:Mark Miller > > > Sent: Wednesday 4th September 2013 16:55 > > > To: solr-user@lucene.apache.org > > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > > > I'm going to try and fix the root cause for 4.5 - I've suspected what > it > > is since early this year, but it's never personally been an issue, so > it's > > rolled along for a long time. > > > > > > Mark > > > > > > Sent from my iPhone > > > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > > wrote: > > > > > > > Hey guys, > > > > > > > > I am looking into an issue we've been having with SolrCloud since the > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested > > 4.4.0 > > > > yet). I've noticed other users with this same issue, so I'd really > > like to > > > > get to the bottom of it. > > > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours > > we > > > > see stalled transactions that snowball to consume all Jetty threads > in > > the > > > > JVM. This eventually causes the JVM to hang with most threads waiting > > on > > > > the condition/stack provided at the bottom of this message. At this > > point > > > > SolrCloud instances then start to see their neighbors (who also have > > all > > > > threads hung) as down w/"Connection Refused", and the shards become > > "down" > > > > in state. Sometimes a node or two survives and just returns 503s "no > > server > > > > hosting shard" errors. > > > > > > > > As a workaround/experiment, we have tuned the number of threads > sending > > > > updates to Solr, as well as the batch size (we batch updates from > > client -> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did > not > > > > help. Certain combinations of update threads and batch sizes seem to > > > > mask/help the problem, but not resolve it entirely. > > > > > > > > Our current environment is the following: > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > > - 3 x Zookeeper instances, external Java 7 JVM. > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 > shard > > and > > > > a replica of 1 shard). > > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a > > good > > > > day. > > > > - 5000 max jetty threads (well above what we use when we are > healthy), > > > > Linux-user threads ulimit is 6000. > > > > - Occurs under Jetty 8 or 9 (many versions). > > > > - Occurs under Java 1.6 or 1.7 (several minor versions). > > > > - Occurs under several JVM tunings. > > > > - Everything seems to point to Solr itself, and not a Jetty or Java > > version > > > > (I hope I'm wrong). > > > > > > > > The stack trace that is holding up all my Jetty QTP threads is the > > > > following, which seems to be waiting on a lock that I would very much > > like > > > > to understand further: > > > > > > > > "java.lang.Thread.State: WAITING (parking) > > > >at sun.misc.Unsafe.park(Native Method) > > > >- parking to wait for <0x0007216e68d8> (a >
Re: SolrCloud 4.x hangs under high update volume
There is an issue if I remember right, but I can't find it right now. If anyone that has the problem could try this patch, that would be very helpful: http://pastebin.com/raw.php?i=aaRWwSGP - Mark On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma wrote: > Hi Mark, > > Got an issue to watch? > > Thanks, > Markus > > -Original message- > > From:Mark Miller > > Sent: Wednesday 4th September 2013 16:55 > > To: solr-user@lucene.apache.org > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > I'm going to try and fix the root cause for 4.5 - I've suspected what it > is since early this year, but it's never personally been an issue, so it's > rolled along for a long time. > > > > Mark > > > > Sent from my iPhone > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > wrote: > > > > > Hey guys, > > > > > > I am looking into an issue we've been having with SolrCloud since the > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested > 4.4.0 > > > yet). I've noticed other users with this same issue, so I'd really > like to > > > get to the bottom of it. > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours > we > > > see stalled transactions that snowball to consume all Jetty threads in > the > > > JVM. This eventually causes the JVM to hang with most threads waiting > on > > > the condition/stack provided at the bottom of this message. At this > point > > > SolrCloud instances then start to see their neighbors (who also have > all > > > threads hung) as down w/"Connection Refused", and the shards become > "down" > > > in state. Sometimes a node or two survives and just returns 503s "no > server > > > hosting shard" errors. > > > > > > As a workaround/experiment, we have tuned the number of threads sending > > > updates to Solr, as well as the batch size (we batch updates from > client -> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > > > help. Certain combinations of update threads and batch sizes seem to > > > mask/help the problem, but not resolve it entirely. > > > > > > Our current environment is the following: > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > - 3 x Zookeeper instances, external Java 7 JVM. > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard > and > > > a replica of 1 shard). > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a > good > > > day. > > > - 5000 max jetty threads (well above what we use when we are healthy), > > > Linux-user threads ulimit is 6000. > > > - Occurs under Jetty 8 or 9 (many versions). > > > - Occurs under Java 1.6 or 1.7 (several minor versions). > > > - Occurs under several JVM tunings. > > > - Everything seems to point to Solr itself, and not a Jetty or Java > version > > > (I hope I'm wrong). > > > > > > The stack trace that is holding up all my Jetty QTP threads is the > > > following, which seems to be waiting on a lock that I would very much > like > > > to understand further: > > > > > > "java.lang.Thread.State: WAITING (parking) > > >at sun.misc.Unsafe.park(Native Method) > > >- parking to wait for <0x0007216e68d8> (a > > > java.util.concurrent.Semaphore$NonfairSync) > > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > > >at > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > > >at > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > > >at > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > > >at > > > > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > > >at > > > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > > >at > > > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > > &
Re: SolrCloud 4.x hangs under high update volume
I am having this issue as well. I did apply this patch. Unfortunately, it did not resolve the issue in my case. On Wed, Sep 4, 2013 at 7:01 AM, Greg Walters wrote: > Tim, > > Take a look at > http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.htmland > https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue > that you're reporting for a while then I applied the patch from SOLR-4816 > to my clients and the problems went away. If you don't feel like applying > the patch it looks like it should be included in the release of version > 4.5. Also note that the problem happens more frequently when the > replication factor is greater than 1. > > Thanks, > Greg > > -Original Message- > From: Tim Vaillancourt [mailto:t...@elementspace.com] > Sent: Tuesday, September 03, 2013 6:31 PM > To: solr-user@lucene.apache.org > Subject: SolrCloud 4.x hangs under high update volume > > Hey guys, > > I am looking into an issue we've been having with SolrCloud since the > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 > yet). I've noticed other users with this same issue, so I'd really like to > get to the bottom of it. > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours we > see stalled transactions that snowball to consume all Jetty threads in the > JVM. This eventually causes the JVM to hang with most threads waiting on > the condition/stack provided at the bottom of this message. At this point > SolrCloud instances then start to see their neighbors (who also have all > threads hung) as down w/"Connection Refused", and the shards become "down" > in state. Sometimes a node or two survives and just returns 503s "no > server hosting shard" errors. > > As a workaround/experiment, we have tuned the number of threads sending > updates to Solr, as well as the batch size (we batch updates from client -> > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > help. Certain combinations of update threads and batch sizes seem to > mask/help the problem, but not resolve it entirely. > > Our current environment is the following: > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > - 3 x Zookeeper instances, external Java 7 JVM. > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and > a replica of 1 shard). > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good > day. > - 5000 max jetty threads (well above what we use when we are healthy), > Linux-user threads ulimit is 6000. > - Occurs under Jetty 8 or 9 (many versions). > - Occurs under Java 1.6 or 1.7 (several minor versions). > - Occurs under several JVM tunings. > - Everything seems to point to Solr itself, and not a Jetty or Java > version (I hope I'm wrong). > > The stack trace that is holding up all my Jetty QTP threads is the > following, which seems to be waiting on a lock that I would very much like > to understand further: > > "java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007216e68d8> (a > java.util.concurrent.Semaphore$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > at > > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > at > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > at > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > at > > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > at > > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) > at > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) > at > > org.apache.so
Re: SolrCloud 4.x hangs under high update volume
I'm going to try and fix the root cause for 4.5 - I've suspected what it is since early this year, but it's never personally been an issue, so it's rolled along for a long time. Mark Sent from my iPhone On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt wrote: > Hey guys, > > I am looking into an issue we've been having with SolrCloud since the > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 > yet). I've noticed other users with this same issue, so I'd really like to > get to the bottom of it. > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours we > see stalled transactions that snowball to consume all Jetty threads in the > JVM. This eventually causes the JVM to hang with most threads waiting on > the condition/stack provided at the bottom of this message. At this point > SolrCloud instances then start to see their neighbors (who also have all > threads hung) as down w/"Connection Refused", and the shards become "down" > in state. Sometimes a node or two survives and just returns 503s "no server > hosting shard" errors. > > As a workaround/experiment, we have tuned the number of threads sending > updates to Solr, as well as the batch size (we batch updates from client -> > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > Client-to-Solr batching (1 update = 1 call to Solr), which also did not > help. Certain combinations of update threads and batch sizes seem to > mask/help the problem, but not resolve it entirely. > > Our current environment is the following: > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > - 3 x Zookeeper instances, external Java 7 JVM. > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and > a replica of 1 shard). > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good > day. > - 5000 max jetty threads (well above what we use when we are healthy), > Linux-user threads ulimit is 6000. > - Occurs under Jetty 8 or 9 (many versions). > - Occurs under Java 1.6 or 1.7 (several minor versions). > - Occurs under several JVM tunings. > - Everything seems to point to Solr itself, and not a Jetty or Java version > (I hope I'm wrong). > > The stack trace that is holding up all my Jetty QTP threads is the > following, which seems to be waiting on a lock that I would very much like > to understand further: > > "java.lang.Thread.State: WAITING (parking) >at sun.misc.Unsafe.park(Native Method) >- parking to wait for <0x0007216e68d8> (a > java.util.concurrent.Semaphore$NonfairSync) >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) >at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) >at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) >at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) >at > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) >at > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) >at > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) >at > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) >at > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) >at > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) >at > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) >at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) >at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) >at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) >at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) >at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) >at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) >at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) >at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) >at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) >at > org.eclipse.jetty.server.session.SessionHandler.doScope
RE: SolrCloud 4.x hangs under high update volume
Tim, Take a look at http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html and https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue that you're reporting for a while then I applied the patch from SOLR-4816 to my clients and the problems went away. If you don't feel like applying the patch it looks like it should be included in the release of version 4.5. Also note that the problem happens more frequently when the replication factor is greater than 1. Thanks, Greg -Original Message- From: Tim Vaillancourt [mailto:t...@elementspace.com] Sent: Tuesday, September 03, 2013 6:31 PM To: solr-user@lucene.apache.org Subject: SolrCloud 4.x hangs under high update volume Hey guys, I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it. Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/"Connection Refused", and the shards become "down" in state. Sometimes a node or two survives and just returns 503s "no server hosting shard" errors. As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client -> solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely. Our current environment is the following: - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. - 3 x Zookeeper instances, external Java 7 JVM. - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard). - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day. - 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads ulimit is 6000. - Occurs under Jetty 8 or 9 (many versions). - Occurs under Java 1.6 or 1.7 (several minor versions). - Occurs under several JVM tunings. - Everything seems to point to Solr itself, and not a Jetty or Java version (I hope I'm wrong). The stack trace that is holding up all my Jetty QTP threads is the following, which seems to be waiting on a lock that I would very much like to understand further: "java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007216e68d8> (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) at org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) at org.eclipse.jetty.server.handler.ScopedHandler.handle(S
SolrCloud 4.x hangs under high update volume
Hey guys, I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it. Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/"Connection Refused", and the shards become "down" in state. Sometimes a node or two survives and just returns 503s "no server hosting shard" errors. As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client -> solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely. Our current environment is the following: - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. - 3 x Zookeeper instances, external Java 7 JVM. - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard). - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day. - 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads ulimit is 6000. - Occurs under Jetty 8 or 9 (many versions). - Occurs under Java 1.6 or 1.7 (several minor versions). - Occurs under several JVM tunings. - Everything seems to point to Solr itself, and not a Jetty or Java version (I hope I'm wrong). The stack trace that is holding up all my Jetty QTP threads is the following, which seems to be waiting on a lock that I would very much like to understand further: "java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007216e68d8> (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) at org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.