Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt
gt;>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:138)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>>>>> SecurityHandler.java:564)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doHandle(SessionHandler.java:**213)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doHandle(ContextHandler.java:**1083)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>>>>> ServletHandler.java:379)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**session.SessionHandler.**
> >>>>>> doScope(SessionHandler.java:**175)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ContextHandler.**
> >>>>>> doScope(ContextHandler.java:**1017)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>>>>> ScopedHandler.java:136)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> >>>>>> handle(**ContextHandlerCollection.java:**258)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>>>>> handle(HandlerCollection.java:**109)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>>>>> HandlerWrapper.java:97)
> >>>>>>
> >>>>>>>  at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>>>>  at
> >>>>>>>>>
> >>>>>>>>
> >> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> >>>>>>>>
> >>>>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>>>>> HttpConnection.java:225)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>>>>> AbstractConnection.java:358)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>>>>> QueuedThreadPool.java:596)
> >>>>>>
> >>>>>>>  at
> >>>>>>>>>
> >>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>>>>> QueuedThreadPool.java:527)
> >>>>>>
> >>>>>>>  at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>>>>
> >>>>>>>>> On your live_nodes question, I don't have historical data on this
> >>> from
> >>>>>>>>>
> >>>>>>>> when
> >>>>>>>>
> >>>>>>>>> the crash occurred, which I guess is what you're looking for. I
> >>> could
> >>>>>>>>>
&

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Erick Erickson
etty.server.**Server.handle(Server.java:445)
> > >>>>>>   at
> > >>>>>>
> > >>>>>
> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> > >>>>>
> > >>>>>>   at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
> > >>> HttpConnection.java:225)
> > >>>
> > >>>>   at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> > >>> AbstractConnection.java:358)
> > >>>
> > >>>>   at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> > >>> QueuedThreadPool.java:596)
> > >>>
> > >>>>   at
> > >>>>>>
> > >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> > >>> QueuedThreadPool.java:527)
> > >>>
> > >>>>   at java.lang.Thread.run(Thread.**java:724)"
> > >>>>>>
> > >>>>>> On your live_nodes question, I don't have historical data on this
> > from
> > >>>>>>
> > >>>>> when
> > >>>>>
> > >>>>>> the crash occurred, which I guess is what you're looking for. I
> > could
> > >>>>>>
> > >>>>> add
> > >>>>>
> > >>>>>> this to our monitoring for future tests, however. I'd be glad to
> > >>>>>>
> > >>>>> continue
> > >>>>>
> > >>>>>> further testing, but I think first more monitoring is needed to
> > >>>>>>
> > >>>>> understand
> > >>>>>
> > >>>>>> this further. Could we come up with a list of metrics that would
> be
> > >>>>>>
> > >>>>> useful
> > >>>>>
> > >>>>>> to see following another test and successful crash?
> > >>>>>>
> > >>>>>> Metrics needed:
> > >>>>>>
> > >>>>>> 1) # of live_nodes.
> > >>>>>> 2) Full stack traces.
> > >>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
> > >>>>>> 4) Solr's JVM thread count (already done)
> > >>>>>> 5) ?
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>>
> > >>>>>> Tim Vaillancourt
> > >>>>>>
> > >>>>>>
> > >>>>>> On 6 September 2013 13:11, Mark Miller
> >  wrote:
> > >>>>>>
> > >>>>>>  Did you ever get to index that long before without hitting the
> > >>>>>>>
> > >>>>>> deadlock?
> > >>>>>
> > >>>>>> There really isn't anything negative the patch could be
> introducing,
> > >>>>>>>
> > >>>>>> other
> > >>>>>
> > >>>>>> than allowing for some more threads to possibly run at once. If I
> > had
> > >>>>>>>
> > >>>>>> to
> > >>>>>
> > >>>>>> guess, I would say its likely this patch fixes the deadlock issue
> > and
> > >>>>>>>
> > >>>>>> your
> > >>>>>
> > >>>>>> seeing another issue - which looks like the system cannot keep up
> > >>>>>>>
> > >>>>>> with
> > >>>
> > >>>> the
> > >>>>>
> > >>>>>> requests or something for some reason - perhaps due to some OS
> > >>>>>>>
> > >>>>>> networking
> > >>>>>
> > >>>>>> settings or something (more guessing). Connection refused happens
> > >>>>>>>
> > >>>>>> generally
> > >>>>>
> > >>>>>> when there is nothing listening on the port.
> > >>&g

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Mark Miller
gt;>>> doScope(ContextHandler.java:**1017)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
>>>>>> ScopedHandler.java:136)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
>>>>>> handle(**ContextHandlerCollection.java:**258)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerCollection.**
>>>>>> handle(HandlerCollection.java:**109)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>>>>> HandlerWrapper.java:97)
>>>>>> 
>>>>>>>  at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>>>>  at
>>>>>>>>> 
>>>>>>>> 
>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
>>>>>>>> 
>>>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>>>>> HttpConnection.java:225)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>>>>> AbstractConnection.java:358)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>>>>> QueuedThreadPool.java:596)
>>>>>> 
>>>>>>>  at
>>>>>>>>> 
>>>>>>>>> org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>>>>> QueuedThreadPool.java:527)
>>>>>> 
>>>>>>>  at java.lang.Thread.run(Thread.**java:724)"
>>>>>>>>> 
>>>>>>>>> On your live_nodes question, I don't have historical data on this
>>> from
>>>>>>>>> 
>>>>>>>> when
>>>>>>>> 
>>>>>>>>> the crash occurred, which I guess is what you're looking for. I
>>> could
>>>>>>>>> 
>>>>>>>> add
>>>>>>>> 
>>>>>>>>> this to our monitoring for future tests, however. I'd be glad to
>>>>>>>>> 
>>>>>>>> continue
>>>>>>>> 
>>>>>>>>> further testing, but I think first more monitoring is needed to
>>>>>>>>> 
>>>>>>>> understand
>>>>>>>> 
>>>>>>>>> this further. Could we come up with a list of metrics that would
>> be
>>>>>>>>> 
>>>>>>>> useful
>>>>>>>> 
>>>>>>>>> to see following another test and successful crash?
>>>>>>>>> 
>>>>>>>>> Metrics needed:
>>>>>>>>> 
>>>>>>>>> 1) # of live_nodes.
>>>>>>>>> 2) Full stack traces.
>>>>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
>>>>>>>>> 4) Solr's JVM thread count (already done)
>>>>>>>>> 5) ?
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>> 
>>>>>>>>> Tim Vaillancourt
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 6 September 2013 13:11, Mark Miller
>>> wrote:
>>>>>>>>> 
>>>>>>>>> Did you ever get to index that long before without hitting the
>>>>>>>>>> 
>>>>>>>>> deadlock?
>>>>>>>> 
>>>>>>>>> There really isn't anything negative the patch could be
>> introducing,
>>>

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Tim Vaillancourt
Handler$CachedChain.**
> >>> doFilter(ServletHandler.java:**1423)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doHandle(**
> >>> ServletHandler.java:450)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:138)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.security.**SecurityHandler.handle(**
> >>> SecurityHandler.java:564)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doHandle(SessionHandler.java:**213)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doHandle(ContextHandler.java:**1083)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.servlet.**ServletHandler.doScope(**
> >>> ServletHandler.java:379)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**session.SessionHandler.**
> >>> doScope(SessionHandler.java:**175)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ContextHandler.**
> >>> doScope(ContextHandler.java:**1017)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.ScopedHandler.handle(**
> >>> ScopedHandler.java:136)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
> >>> handle(**ContextHandlerCollection.java:**258)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.HandlerCollection.**
> >>> handle(HandlerCollection.java:**109)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
> >>> HandlerWrapper.java:97)
> >>>
> >>>>   at org.eclipse.jetty.server.**Server.handle(Server.java:445)
> >>>>>>   at
> >>>>>>
> >>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
> >>>>>
> >>>>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
> >>> HttpConnection.java:225)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
> >>> AbstractConnection.java:358)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
> >>> QueuedThreadPool.java:596)
> >>>
> >>>>   at
> >>>>>>
> >>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
> >>> QueuedThreadPool.java:527)
> >>>
> >>>>   at java.lang.Thread.run(Thread.**java:724)"
> >>>>>>
> >>>>>> On your live_nodes question, I don't have historical data on this
> from
> >>>>>>
> >>>>> when
> >>>>>
> >>>>>> the crash occurred, which I guess is what you're looking for. I
> could
> >>>>>>
> >>>>> add
> >>>>>
> >>>>>> this to our monitoring for future tests, however. I'd be glad to
> >>>>>>
> >>>>> continue
> >>>>>
> >>>>>> further testing, but I think first more monitoring is needed to
> >>>>>>
> >>>>> understand
> >>>>>
> >>>>>> this further. Could we come up with a list of metrics that would be
> >>>>>>
> >>>>> useful
> >>>>>
> >>>>>> to see following another test and successful crash?
> >

Re: SolrCloud 4.x hangs under high update volume

2013-09-12 Thread Erick Erickson
jetty.server.**handler.ScopedHandler.handle(**
>>> ScopedHandler.java:136)
>>>
>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.**ContextHandlerCollection.**
>>> handle(**ContextHandlerCollection.java:**258)
>>>
>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.HandlerCollection.**
>>> handle(HandlerCollection.java:**109)
>>>
>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**handler.HandlerWrapper.handle(**
>>> HandlerWrapper.java:97)
>>>
>>>>   at org.eclipse.jetty.server.**Server.handle(Server.java:445)
>>>>>>   at
>>>>>>
>>>>> org.eclipse.jetty.server.**HttpChannel.handle(**HttpChannel.java:260)
>>>>>
>>>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.server.**HttpConnection.onFillable(**
>>> HttpConnection.java:225)
>>>
>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.io.**AbstractConnection$**ReadCallback.run(**
>>> AbstractConnection.java:358)
>>>
>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool.runJob(**
>>> QueuedThreadPool.java:596)
>>>
>>>>   at
>>>>>>
>>>>>>  org.eclipse.jetty.util.thread.**QueuedThreadPool$3.run(**
>>> QueuedThreadPool.java:527)
>>>
>>>>   at java.lang.Thread.run(Thread.**java:724)"
>>>>>>
>>>>>> On your live_nodes question, I don't have historical data on this from
>>>>>>
>>>>> when
>>>>>
>>>>>> the crash occurred, which I guess is what you're looking for. I could
>>>>>>
>>>>> add
>>>>>
>>>>>> this to our monitoring for future tests, however. I'd be glad to
>>>>>>
>>>>> continue
>>>>>
>>>>>> further testing, but I think first more monitoring is needed to
>>>>>>
>>>>> understand
>>>>>
>>>>>> this further. Could we come up with a list of metrics that would be
>>>>>>
>>>>> useful
>>>>>
>>>>>> to see following another test and successful crash?
>>>>>>
>>>>>> Metrics needed:
>>>>>>
>>>>>> 1) # of live_nodes.
>>>>>> 2) Full stack traces.
>>>>>> 3) CPU used by Solr's JVM specifically (instead of system-wide).
>>>>>> 4) Solr's JVM thread count (already done)
>>>>>> 5) ?
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Tim Vaillancourt
>>>>>>
>>>>>>
>>>>>> On 6 September 2013 13:11, Mark Miller  wrote:
>>>>>>
>>>>>>  Did you ever get to index that long before without hitting the
>>>>>>>
>>>>>> deadlock?
>>>>>
>>>>>> There really isn't anything negative the patch could be introducing,
>>>>>>>
>>>>>> other
>>>>>
>>>>>> than allowing for some more threads to possibly run at once. If I had
>>>>>>>
>>>>>> to
>>>>>
>>>>>> guess, I would say its likely this patch fixes the deadlock issue and
>>>>>>>
>>>>>> your
>>>>>
>>>>>> seeing another issue - which looks like the system cannot keep up
>>>>>>>
>>>>>> with
>>>
>>>> the
>>>>>
>>>>>> requests or something for some reason - perhaps due to some OS
>>>>>>>
>>>>>> networking
>>>>>
>>>>>> settings or something (more guessing). Connection refused happens
>>>>>>>
>>>>>> generally
>>>>>
>>>>>> when there is nothing listening on the port.
>>>>>>>
>>>>>>> Do you see anything interesting change with the rest of the system?
>>>>>>>
>>>>>> CPU
>>>
>>>> usage spikes or something like that?
>>>>>>>

Re: SolrCloud 4.x hangs under high update volume

2013-09-11 Thread Tim Vaillancourt
at


org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)

  at


org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)

  at


org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)

  at


org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)

  at


org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)

  at


org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)

  at


org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)

  at


org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)

  at


org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)

  at


org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)

  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at

org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)

  at


org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)

  at


org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)

  at


org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)

  at


org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)

  at java.lang.Thread.run(Thread.java:724)"

On your live_nodes question, I don't have historical data on this from

when

the crash occurred, which I guess is what you're looking for. I could

add

this to our monitoring for future tests, however. I'd be glad to

continue

further testing, but I think first more monitoring is needed to

understand

this further. Could we come up with a list of metrics that would be

useful

to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt


On 6 September 2013 13:11, Mark Miller  wrote:


Did you ever get to index that long before without hitting the

deadlock?

There really isn't anything negative the patch could be introducing,

other

than allowing for some more threads to possibly run at once. If I had

to

guess, I would say its likely this patch fixes the deadlock issue and

your

seeing another issue - which looks like the system cannot keep up

with

the

requests or something for some reason - perhaps due to some OS

networking

settings or something (more guessing). Connection refused happens

generally

when there is nothing listening on the port.

Do you see anything interesting change with the rest of the system?

CPU

usage spikes or something like that?

Clamping down further on the overall number of threads night help

(which

would require making something configurable). How many nodes are

listed in

zk under live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt
wrote:


Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious

issues a

few hours of 500 update-batches/sec. Our update batch is 10 docs, so

we

are

writing about 5000 docs/sec total, using autoCommit to commit the

updates

(no explicit commits).

Our environment:

   Solr 4.3.1 w/SOLR-5216 patch.
   Jetty 9, Java 1.7.
   3 solr instances, 1 per physical server.
   1 collection.
   3 shards.
   2 replicas (each instance is a leader and a replica).
   Soft autoCommit is 1000ms.
   Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of

these

stalled transactions (below), and the Solr instances start to see

each

other as down, flooding our Solr logs with "Connection Refused"

exceptions,

and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this

24-hour

soak.

My script "normalizes" the ERROR-severity stack traces and returns

them

in

order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma<

markus.jel...@openindex.io>

wrote:

Thanks!

-Original message-----

From:Erick Erickson
Sent: Friday 6th September 2013 16:20
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
wrote:


Hi Mark,

Got an issue to watch?

Thanks,
Markus

-Original mes

Re: SolrCloud 4.x hangs under high update volume

2013-09-11 Thread Erick Erickson
;> >> when there is nothing listening on the port.
> >> >>
> >> >> Do you see anything interesting change with the rest of the system?
> CPU
> >> >> usage spikes or something like that?
> >> >>
> >> >> Clamping down further on the overall number of threads night help
> >> (which
> >> >> would require making something configurable). How many nodes are
> >> listed in
> >> >> zk under live_nodes?
> >> >>
> >> >> Mark
> >> >>
> >> >> Sent from my iPhone
> >> >>
> >> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt 
> >> >> wrote:
> >> >>
> >> >>> Hey guys,
> >> >>>
> >> >>> (copy of my post to SOLR-5216)
> >> >>>
> >> >>> We tested this patch and unfortunately encountered some serious
> >> issues a
> >> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
> >> we
> >> >> are
> >> >>> writing about 5000 docs/sec total, using autoCommit to commit the
> >> updates
> >> >>> (no explicit commits).
> >> >>>
> >> >>> Our environment:
> >> >>>
> >> >>>   Solr 4.3.1 w/SOLR-5216 patch.
> >> >>>   Jetty 9, Java 1.7.
> >> >>>   3 solr instances, 1 per physical server.
> >> >>>   1 collection.
> >> >>>   3 shards.
> >> >>>   2 replicas (each instance is a leader and a replica).
> >> >>>   Soft autoCommit is 1000ms.
> >> >>>   Hard autoCommit is 15000ms.
> >> >>>
> >> >>> After about 6 hours of stress-testing this patch, we see many of
> these
> >> >>> stalled transactions (below), and the Solr instances start to see
> each
> >> >>> other as down, flooding our Solr logs with "Connection Refused"
> >> >> exceptions,
> >> >>> and otherwise no obviously-useful logs that I could see.
> >> >>>
> >> >>> I did notice some stalled transactions on both /select and /update,
> >> >>> however. This never occurred without this patch.
> >> >>>
> >> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >> >>>
> >> >>> Lastly, I have a summary of the ERROR-severity logs from this
> 24-hour
> >> >> soak.
> >> >>> My script "normalizes" the ERROR-severity stack traces and returns
> >> them
> >> >> in
> >> >>> order of occurrence.
> >> >>>
> >> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >> >>>
> >> >>> Thanks!
> >> >>>
> >> >>> Tim Vaillancourt
> >> >>>
> >> >>>
> >> >>> On 6 September 2013 07:27, Markus Jelsma <
> markus.jel...@openindex.io>
> >> >> wrote:
> >> >>>
> >> >>>> Thanks!
> >> >>>>
> >> >>>> -Original message-
> >> >>>>> From:Erick Erickson 
> >> >>>>> Sent: Friday 6th September 2013 16:20
> >> >>>>> To: solr-user@lucene.apache.org
> >> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >> >>>>>
> >> >>>>> Markus:
> >> >>>>>
> >> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >> >>>>>
> >> >>>>>
> >> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Hi Mark,
> >> >>>>>>
> >> >>>>>> Got an issue to watch?
> >> >>>>>>
> >> >>>>>> Thanks,
> >> >>>>>> Markus
> >> >>>>>>
> >> >>>>>> -Original message-
> >> >>>>>>> From:Mark Miller 
> >> >>>>>>> Sent: Wednesday 4th September 2013 16:55
&g

Re: SolrCloud 4.x hangs under high update volume

2013-09-10 Thread Tim Vaillancourt
ontextHandler.java:1083)
>> >  at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
>> >  at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
>> >  at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
>> >  at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
>> >  at
>> >
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
>> >  at
>> >
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
>> >  at
>> >
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> >  at org.eclipse.jetty.server.Server.handle(Server.java:445)
>> >  at
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
>> >  at
>> >
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
>> >  at
>> >
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
>> >  at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
>> >  at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
>> >  at java.lang.Thread.run(Thread.java:724)"
>> >
>> > On your live_nodes question, I don't have historical data on this from
>> when
>> > the crash occurred, which I guess is what you're looking for. I could
>> add
>> > this to our monitoring for future tests, however. I'd be glad to
>> continue
>> > further testing, but I think first more monitoring is needed to
>> understand
>> > this further. Could we come up with a list of metrics that would be
>> useful
>> > to see following another test and successful crash?
>> >
>> > Metrics needed:
>> >
>> > 1) # of live_nodes.
>> > 2) Full stack traces.
>> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
>> > 4) Solr's JVM thread count (already done)
>> > 5) ?
>> >
>> > Cheers,
>> >
>> > Tim Vaillancourt
>> >
>> >
>> > On 6 September 2013 13:11, Mark Miller  wrote:
>> >
>> >> Did you ever get to index that long before without hitting the
>> deadlock?
>> >>
>> >> There really isn't anything negative the patch could be introducing,
>> other
>> >> than allowing for some more threads to possibly run at once. If I had
>> to
>> >> guess, I would say its likely this patch fixes the deadlock issue and
>> your
>> >> seeing another issue - which looks like the system cannot keep up with
>> the
>> >> requests or something for some reason - perhaps due to some OS
>> networking
>> >> settings or something (more guessing). Connection refused happens
>> generally
>> >> when there is nothing listening on the port.
>> >>
>> >> Do you see anything interesting change with the rest of the system? CPU
>> >> usage spikes or something like that?
>> >>
>> >> Clamping down further on the overall number of threads night help
>> (which
>> >> would require making something configurable). How many nodes are
>> listed in
>> >> zk under live_nodes?
>> >>
>> >> Mark
>> >>
>> >> Sent from my iPhone
>> >>
>> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt 
>> >> wrote:
>> >>
>> >>> Hey guys,
>> >>>
>> >>> (copy of my post to SOLR-5216)
>> >>>
>> >>> We tested this patch and unfortunately encountered some serious
>> issues a
>> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so
>> we
>> >> are
>> >>> writing about 5000 docs/sec total, using autoCommit to commit the
>> updates
>> >>> (no explicit commits).
>> >>>
>> >>> Our environment:
>> >>>
>> >>>   Solr 4.3.1 w/SOLR-5216 patch.
>> >>>   Jetty 9, Java 1.7.
>> >>>   3 solr instances, 1 per physical server.
>> >>>   1 collection.
>> >>>   3 shards.
>> >>&g

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
 this further. Could we come up with a list of metrics that would be
> useful
> > to see following another test and successful crash?
> >
> > Metrics needed:
> >
> > 1) # of live_nodes.
> > 2) Full stack traces.
> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
> > 4) Solr's JVM thread count (already done)
> > 5) ?
> >
> > Cheers,
> >
> > Tim Vaillancourt
> >
> >
> > On 6 September 2013 13:11, Mark Miller  wrote:
> >
> >> Did you ever get to index that long before without hitting the deadlock?
> >>
> >> There really isn't anything negative the patch could be introducing,
> other
> >> than allowing for some more threads to possibly run at once. If I had to
> >> guess, I would say its likely this patch fixes the deadlock issue and
> your
> >> seeing another issue - which looks like the system cannot keep up with
> the
> >> requests or something for some reason - perhaps due to some OS
> networking
> >> settings or something (more guessing). Connection refused happens
> generally
> >> when there is nothing listening on the port.
> >>
> >> Do you see anything interesting change with the rest of the system? CPU
> >> usage spikes or something like that?
> >>
> >> Clamping down further on the overall number of threads night help (which
> >> would require making something configurable). How many nodes are listed
> in
> >> zk under live_nodes?
> >>
> >> Mark
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt 
> >> wrote:
> >>
> >>> Hey guys,
> >>>
> >>> (copy of my post to SOLR-5216)
> >>>
> >>> We tested this patch and unfortunately encountered some serious issues
> a
> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
> >> are
> >>> writing about 5000 docs/sec total, using autoCommit to commit the
> updates
> >>> (no explicit commits).
> >>>
> >>> Our environment:
> >>>
> >>>   Solr 4.3.1 w/SOLR-5216 patch.
> >>>   Jetty 9, Java 1.7.
> >>>   3 solr instances, 1 per physical server.
> >>>   1 collection.
> >>>   3 shards.
> >>>   2 replicas (each instance is a leader and a replica).
> >>>   Soft autoCommit is 1000ms.
> >>>   Hard autoCommit is 15000ms.
> >>>
> >>> After about 6 hours of stress-testing this patch, we see many of these
> >>> stalled transactions (below), and the Solr instances start to see each
> >>> other as down, flooding our Solr logs with "Connection Refused"
> >> exceptions,
> >>> and otherwise no obviously-useful logs that I could see.
> >>>
> >>> I did notice some stalled transactions on both /select and /update,
> >>> however. This never occurred without this patch.
> >>>
> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>
> >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
> >> soak.
> >>> My script "normalizes" the ERROR-severity stack traces and returns them
> >> in
> >>> order of occurrence.
> >>>
> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>
> >>> Thanks!
> >>>
> >>> Tim Vaillancourt
> >>>
> >>>
> >>> On 6 September 2013 07:27, Markus Jelsma 
> >> wrote:
> >>>
> >>>> Thanks!
> >>>>
> >>>> -Original message-
> >>>>> From:Erick Erickson 
> >>>>> Sent: Friday 6th September 2013 16:20
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>
> >>>>> Markus:
> >>>>>
> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >>>>>
> >>>>>
> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>> wrote:
> >>>>>
> >>>>>> Hi Mark,
> >>>>>>
> >>>>>> Got an issue to watch?
> >>>>>>
> >>>>>> Thanks,
> >>

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt 
> wrote:
>
> > Hey guys,
> >
> > (copy of my post to SOLR-5216)
> >
> > We tested this patch and unfortunately encountered some serious issues a
> > few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
> are
> > writing about 5000 docs/sec total, using autoCommit to commit the updates
> > (no explicit commits).
> >
> > Our environment:
> >
> >Solr 4.3.1 w/SOLR-5216 patch.
> >Jetty 9, Java 1.7.
> >3 solr instances, 1 per physical server.
> >1 collection.
> >3 shards.
> >2 replicas (each instance is a leader and a replica).
> >Soft autoCommit is 1000ms.
> >Hard autoCommit is 15000ms.
> >
> > After about 6 hours of stress-testing this patch, we see many of these
> > stalled transactions (below), and the Solr instances start to see each
> > other as down, flooding our Solr logs with "Connection Refused"
> exceptions,
> > and otherwise no obviously-useful logs that I could see.
> >
> > I did notice some stalled transactions on both /select and /update,
> > however. This never occurred without this patch.
> >
> > Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> > Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >
> > Lastly, I have a summary of the ERROR-severity logs from this 24-hour
> soak.
> > My script "normalizes" the ERROR-severity stack traces and returns them
> in
> > order of occurrence.
> >
> > Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >
> > Thanks!
> >
> > Tim Vaillancourt
> >
> >
> > On 6 September 2013 07:27, Markus Jelsma 
> wrote:
> >
> >> Thanks!
> >>
> >> -Original message-
> >>> From:Erick Erickson 
> >>> Sent: Friday 6th September 2013 16:20
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>
> >>> Markus:
> >>>
> >>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >>>
> >>>
> >>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>> wrote:
> >>>
> >>>> Hi Mark,
> >>>>
> >>>> Got an issue to watch?
> >>>>
> >>>> Thanks,
> >>>> Markus
> >>>>
> >>>> -Original message-
> >>>>> From:Mark Miller 
> >>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>
> >>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
> >> what it
> >>>> is since early this year, but it's never personally been an issue, so
> >> it's
> >>>> rolled along for a long time.
> >>>>>
> >>>>> Mark
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> >>>> wrote:
> >>>>>
> >>>>>> Hey guys,
> >>>>>>
> >>>>>> I am looking into an issue we've been having with SolrCloud since
> >> the
> >>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >> tested
> >>>> 4.4.0
> >>>>>> yet). I've noticed other users with this same issue, so I'd really
> >>>> like to
> >>>>>> get to the bottom of it.
> >>>>>>
> >>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
> >> hours
> >>>> we
> >>>>>> see stalled transactions that snowball to consume all Jetty
> >> threads in
> >>>> the
> >>>>>> JVM. This eventually causes the JVM to hang with most threads
> >> waiting
> >>>> on
> >>>>>> the condition/stack provided at the bottom of this message. At this
> >>>> point
> >>>>>> SolrCloud instances then start to see their neighbors (who also
> >> have
> >>>> all
> >>>>>> threads hung) as down w/"Connection Refused", and the shards b

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller
er
>> than allowing for some more threads to possibly run at once. If I had to
>> guess, I would say its likely this patch fixes the deadlock issue and your
>> seeing another issue - which looks like the system cannot keep up with the
>> requests or something for some reason - perhaps due to some OS networking
>> settings or something (more guessing). Connection refused happens generally
>> when there is nothing listening on the port.
>> 
>> Do you see anything interesting change with the rest of the system? CPU
>> usage spikes or something like that?
>> 
>> Clamping down further on the overall number of threads night help (which
>> would require making something configurable). How many nodes are listed in
>> zk under live_nodes?
>> 
>> Mark
>> 
>> Sent from my iPhone
>> 
>> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt 
>> wrote:
>> 
>>> Hey guys,
>>> 
>>> (copy of my post to SOLR-5216)
>>> 
>>> We tested this patch and unfortunately encountered some serious issues a
>>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
>> are
>>> writing about 5000 docs/sec total, using autoCommit to commit the updates
>>> (no explicit commits).
>>> 
>>> Our environment:
>>> 
>>>   Solr 4.3.1 w/SOLR-5216 patch.
>>>   Jetty 9, Java 1.7.
>>>   3 solr instances, 1 per physical server.
>>>   1 collection.
>>>   3 shards.
>>>   2 replicas (each instance is a leader and a replica).
>>>   Soft autoCommit is 1000ms.
>>>   Hard autoCommit is 15000ms.
>>> 
>>> After about 6 hours of stress-testing this patch, we see many of these
>>> stalled transactions (below), and the Solr instances start to see each
>>> other as down, flooding our Solr logs with "Connection Refused"
>> exceptions,
>>> and otherwise no obviously-useful logs that I could see.
>>> 
>>> I did notice some stalled transactions on both /select and /update,
>>> however. This never occurred without this patch.
>>> 
>>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
>>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
>>> 
>>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
>> soak.
>>> My script "normalizes" the ERROR-severity stack traces and returns them
>> in
>>> order of occurrence.
>>> 
>>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
>>> 
>>> Thanks!
>>> 
>>> Tim Vaillancourt
>>> 
>>> 
>>> On 6 September 2013 07:27, Markus Jelsma 
>> wrote:
>>> 
>>>> Thanks!
>>>> 
>>>> -Original message-
>>>>> From:Erick Erickson 
>>>>> Sent: Friday 6th September 2013 16:20
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>> 
>>>>> Markus:
>>>>> 
>>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>>>> 
>>>>> 
>>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>>>> wrote:
>>>>> 
>>>>>> Hi Mark,
>>>>>> 
>>>>>> Got an issue to watch?
>>>>>> 
>>>>>> Thanks,
>>>>>> Markus
>>>>>> 
>>>>>> -Original message-
>>>>>>> From:Mark Miller 
>>>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>>>> 
>>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>>>> what it
>>>>>> is since early this year, but it's never personally been an issue, so
>>>> it's
>>>>>> rolled along for a long time.
>>>>>>> 
>>>>>>> Mark
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hey guys,
>>>>>>>> 
>>>>>>>> I am looking into an issue we've been having with SolrCloud since
>>>>

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Mark Miller
Did you ever get to index that long before without hitting the deadlock?

There really isn't anything negative the patch could be introducing, other than 
allowing for some more threads to possibly run at once. If I had to guess, I 
would say its likely this patch fixes the deadlock issue and your seeing 
another issue - which looks like the system cannot keep up with the requests or 
something for some reason - perhaps due to some OS networking settings or 
something (more guessing). Connection refused happens generally when there is 
nothing listening on the port. 

Do you see anything interesting change with the rest of the system? CPU usage 
spikes or something like that?

Clamping down further on the overall number of threads night help (which would 
require making something configurable). How many nodes are listed in zk under 
live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt  wrote:

> Hey guys,
> 
> (copy of my post to SOLR-5216)
> 
> We tested this patch and unfortunately encountered some serious issues a
> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
> writing about 5000 docs/sec total, using autoCommit to commit the updates
> (no explicit commits).
> 
> Our environment:
> 
>Solr 4.3.1 w/SOLR-5216 patch.
>Jetty 9, Java 1.7.
>3 solr instances, 1 per physical server.
>1 collection.
>3 shards.
>2 replicas (each instance is a leader and a replica).
>Soft autoCommit is 1000ms.
>Hard autoCommit is 15000ms.
> 
> After about 6 hours of stress-testing this patch, we see many of these
> stalled transactions (below), and the Solr instances start to see each
> other as down, flooding our Solr logs with "Connection Refused" exceptions,
> and otherwise no obviously-useful logs that I could see.
> 
> I did notice some stalled transactions on both /select and /update,
> however. This never occurred without this patch.
> 
> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> 
> Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
> My script "normalizes" the ERROR-severity stack traces and returns them in
> order of occurrence.
> 
> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> 
> Thanks!
> 
> Tim Vaillancourt
> 
> 
> On 6 September 2013 07:27, Markus Jelsma  wrote:
> 
>> Thanks!
>> 
>> -----Original message-
>>> From:Erick Erickson 
>>> Sent: Friday 6th September 2013 16:20
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>> 
>>> Markus:
>>> 
>>> See: https://issues.apache.org/jira/browse/SOLR-5216
>>> 
>>> 
>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
>>> wrote:
>>> 
>>>> Hi Mark,
>>>> 
>>>> Got an issue to watch?
>>>> 
>>>> Thanks,
>>>> Markus
>>>> 
>>>> -Original message-
>>>>> From:Mark Miller 
>>>>> Sent: Wednesday 4th September 2013 16:55
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
>>>>> 
>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
>> what it
>>>> is since early this year, but it's never personally been an issue, so
>> it's
>>>> rolled along for a long time.
>>>>> 
>>>>> Mark
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
>>>> wrote:
>>>>> 
>>>>>> Hey guys,
>>>>>> 
>>>>>> I am looking into an issue we've been having with SolrCloud since
>> the
>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
>> tested
>>>> 4.4.0
>>>>>> yet). I've noticed other users with this same issue, so I'd really
>>>> like to
>>>>>> get to the bottom of it.
>>>>>> 
>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
>> hours
>>>> we
>>>>>> see stalled transactions that snowball to consume all Jetty
>> threads in
>>>> the
>>>>>> JVM. This eventually causes the JVM to hang with most threads
>> waiting
>>>> on
>>>>>> the condition/stack provided at the bottom

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Tim Vaillancourt
Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

Solr 4.3.1 w/SOLR-5216 patch.
Jetty 9, Java 1.7.
3 solr instances, 1 per physical server.
1 collection.
3 shards.
2 replicas (each instance is a leader and a replica).
Soft autoCommit is 1000ms.
Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with "Connection Refused" exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
My script "normalizes" the ERROR-severity stack traces and returns them in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma  wrote:

> Thanks!
>
> -Original message-
> > From:Erick Erickson 
> > Sent: Friday 6th September 2013 16:20
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > Markus:
> >
> > See: https://issues.apache.org/jira/browse/SOLR-5216
> >
> >
> > On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> > wrote:
> >
> > > Hi Mark,
> > >
> > > Got an issue to watch?
> > >
> > > Thanks,
> > > Markus
> > >
> > > -----Original message-
> > > > From:Mark Miller 
> > > > Sent: Wednesday 4th September 2013 16:55
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > > >
> > > > I'm going to try and fix the root cause for 4.5 - I've suspected
> what it
> > > is since early this year, but it's never personally been an issue, so
> it's
> > > rolled along for a long time.
> > > >
> > > > Mark
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> > > wrote:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > I am looking into an issue we've been having with SolrCloud since
> the
> > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't
> tested
> > > 4.4.0
> > > > > yet). I've noticed other users with this same issue, so I'd really
> > > like to
> > > > > get to the bottom of it.
> > > > >
> > > > > Under a very, very high rate of updates (2000+/sec), after 1-12
> hours
> > > we
> > > > > see stalled transactions that snowball to consume all Jetty
> threads in
> > > the
> > > > > JVM. This eventually causes the JVM to hang with most threads
> waiting
> > > on
> > > > > the condition/stack provided at the bottom of this message. At this
> > > point
> > > > > SolrCloud instances then start to see their neighbors (who also
> have
> > > all
> > > > > threads hung) as down w/"Connection Refused", and the shards become
> > > "down"
> > > > > in state. Sometimes a node or two survives and just returns 503s
> "no
> > > server
> > > > > hosting shard" errors.
> > > > >
> > > > > As a workaround/experiment, we have tuned the number of threads
> sending
> > > > > updates to Solr, as well as the batch size (we batch updates from
> > > client ->
> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also
> did not
> > > > > help. Certain combinations of update threads and batch sizes seem
> to
> > > > > mask/help the problem, but not resolve it entirely.
> > > > >
> > > > > Our current environment is the following:
> > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > > - 3 x Zookeepe

Re: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Erick Erickson
Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
wrote:

> Hi Mark,
>
> Got an issue to watch?
>
> Thanks,
> Markus
>
> -Original message-
> > From:Mark Miller 
> > Sent: Wednesday 4th September 2013 16:55
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> is since early this year, but it's never personally been an issue, so it's
> rolled along for a long time.
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> wrote:
> >
> > > Hey guys,
> > >
> > > I am looking into an issue we've been having with SolrCloud since the
> > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> 4.4.0
> > > yet). I've noticed other users with this same issue, so I'd really
> like to
> > > get to the bottom of it.
> > >
> > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> we
> > > see stalled transactions that snowball to consume all Jetty threads in
> the
> > > JVM. This eventually causes the JVM to hang with most threads waiting
> on
> > > the condition/stack provided at the bottom of this message. At this
> point
> > > SolrCloud instances then start to see their neighbors (who also have
> all
> > > threads hung) as down w/"Connection Refused", and the shards become
> "down"
> > > in state. Sometimes a node or two survives and just returns 503s "no
> server
> > > hosting shard" errors.
> > >
> > > As a workaround/experiment, we have tuned the number of threads sending
> > > updates to Solr, as well as the batch size (we batch updates from
> client ->
> > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > help. Certain combinations of update threads and batch sizes seem to
> > > mask/help the problem, but not resolve it entirely.
> > >
> > > Our current environment is the following:
> > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> and
> > > a replica of 1 shard).
> > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> good
> > > day.
> > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > Linux-user threads ulimit is 6000.
> > > - Occurs under Jetty 8 or 9 (many versions).
> > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > - Occurs under several JVM tunings.
> > > - Everything seems to point to Solr itself, and not a Jetty or Java
> version
> > > (I hope I'm wrong).
> > >
> > > The stack trace that is holding up all my Jetty QTP threads is the
> > > following, which seems to be waiting on a lock that I would very much
> like
> > > to understand further:
> > >
> > > "java.lang.Thread.State: WAITING (parking)
> > >at sun.misc.Unsafe.park(Native Method)
> > >- parking to wait for  <0x0007216e68d8> (a
> > > java.util.concurrent.Semaphore$NonfairSync)
> > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > >at
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > >at
> > >

RE: SolrCloud 4.x hangs under high update volume

2013-09-06 Thread Markus Jelsma
Thanks!
 
-Original message-
> From:Erick Erickson 
> Sent: Friday 6th September 2013 16:20
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.x hangs under high update volume
> 
> Markus:
> 
> See: https://issues.apache.org/jira/browse/SOLR-5216
> 
> 
> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> wrote:
> 
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> > > From:Mark Miller 
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> > is since early this year, but it's never personally been an issue, so it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >at sun.misc.Unsafe.park(Native Method)
> > > >- parking to wait for  <0x0007216e68d8> (a
> > > > java.util.concurrent.Semaphore$NonfairSync)
> > > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > >at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:99

Re: SolrCloud 4.x hangs under high update volume

2013-09-05 Thread Tim Vaillancourt
Update: It is a bit too soon to tell, but about 6 hours into testing there
are no crashes with this patch. :)

We are pushing 500 batches of 10 updates per second to a 3 node, 3 shard
cluster I mentioned above. 5000 updates per second total.

More tomorrow after a 24 hr soak!

Tim

On Wednesday, 4 September 2013, Tim Vaillancourt wrote:

> Thanks so much for the explanation Mark, I owe you one (many)!
>
> We have this on our high TPS cluster and will run it through it's paces
> tomorrow. I'll provide any feedback I can, more soon! :D
>
> Cheers,
>
> Tim
>


Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Tim Vaillancourt
Thanks so much for the explanation Mark, I owe you one (many)!

We have this on our high TPS cluster and will run it through it's paces
tomorrow. I'll provide any feedback I can, more soon! :D

Cheers,

Tim


RE: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Markus Jelsma
Hi Mark,

Got an issue to watch?

Thanks,
Markus
 
-Original message-
> From:Mark Miller 
> Sent: Wednesday 4th September 2013 16:55
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud 4.x hangs under high update volume
> 
> I'm going to try and fix the root cause for 4.5 - I've suspected what it is 
> since early this year, but it's never personally been an issue, so it's 
> rolled along for a long time. 
> 
> Mark
> 
> Sent from my iPhone
> 
> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt  wrote:
> 
> > Hey guys,
> > 
> > I am looking into an issue we've been having with SolrCloud since the
> > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
> > yet). I've noticed other users with this same issue, so I'd really like to
> > get to the bottom of it.
> > 
> > Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
> > see stalled transactions that snowball to consume all Jetty threads in the
> > JVM. This eventually causes the JVM to hang with most threads waiting on
> > the condition/stack provided at the bottom of this message. At this point
> > SolrCloud instances then start to see their neighbors (who also have all
> > threads hung) as down w/"Connection Refused", and the shards become "down"
> > in state. Sometimes a node or two survives and just returns 503s "no server
> > hosting shard" errors.
> > 
> > As a workaround/experiment, we have tuned the number of threads sending
> > updates to Solr, as well as the batch size (we batch updates from client ->
> > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > help. Certain combinations of update threads and batch sizes seem to
> > mask/help the problem, but not resolve it entirely.
> > 
> > Our current environment is the following:
> > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > - 3 x Zookeeper instances, external Java 7 JVM.
> > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
> > a replica of 1 shard).
> > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
> > day.
> > - 5000 max jetty threads (well above what we use when we are healthy),
> > Linux-user threads ulimit is 6000.
> > - Occurs under Jetty 8 or 9 (many versions).
> > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > - Occurs under several JVM tunings.
> > - Everything seems to point to Solr itself, and not a Jetty or Java version
> > (I hope I'm wrong).
> > 
> > The stack trace that is holding up all my Jetty QTP threads is the
> > following, which seems to be waiting on a lock that I would very much like
> > to understand further:
> > 
> > "java.lang.Thread.State: WAITING (parking)
> >at sun.misc.Unsafe.park(Native Method)
> >- parking to wait for  <0x0007216e68d8> (a
> > java.util.concurrent.Semaphore$NonfairSync)
> >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> >at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> >at
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> >at
> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> >at
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> >at
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> >at
> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> >at
> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> >at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> >at
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> >at
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> >at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >at org.apache.solr.core.SolrCor

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Mark Miller
The 'lock' or semaphore was added to cap the number of threads that would be 
used. Previously, the number of threads in use could spike to many, many 
thousands on heavy updates. A limit on the number of outstanding requests was 
put in place to keep this from happening. Something like 16 * the number of 
hosts in the cluster.

I assume the deadlock comes from the fact that requests are of two kinds - 
forward to the leader and distrib updates from the leader to replicas. Forward 
to the leader actually waits for the leader to then distrib the updates to 
replicas before returning. I believe this is what can lead to deadlock. 

This is likely why the patch for the CloudSolrServer can help the situation - 
it removes the need to forward to the leader because it sends to the correct 
leader to begin with. Only useful if you are adding docs with CloudSolrServer 
though, and more like a workaround than a fix.

The patch uses a separate 'limiting' semaphore for the two cases.

- Mark

On Sep 4, 2013, at 10:22 AM, Tim Vaillancourt  wrote:

> Thanks guys! :)
> 
> Mark: this patch is much appreciated, I will try to test this shortly, 
> hopefully today.
> 
> For my curiosity/understanding, could someone explain to me quickly what 
> locks SolrCloud takes on updates? Was I on to something that more shards 
> decrease the chance for locking?
> 
> Secondly, I was wondering if someone could summarize what this patch 'fixes'? 
> I'm not too familiar with Java and the solr codebase (working on that though 
> :D).
> 
> Cheers,
> 
> Tim
> 
> 
> 
> On 4 September 2013 09:52, Mark Miller  wrote:
> There is an issue if I remember right, but I can't find it right now.
> 
> If anyone that has the problem could try this patch, that would be very
> helpful: http://pastebin.com/raw.php?i=aaRWwSGP
> 
> - Mark
> 
> 
> On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma 
> wrote:
> 
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> > > From:Mark Miller 
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> > is since early this year, but it's never personally been an issue, so it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> > and
> > > > a replica of 1 shard).
> > 

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Tim Vaillancourt
Thanks guys! :)

Mark: this patch is much appreciated, I will try to test this shortly,
hopefully today.

For my curiosity/understanding, could someone explain to me quickly what
locks SolrCloud takes on updates? Was I on to something that more shards
decrease the chance for locking?

Secondly, I was wondering if someone could summarize what this patch
'fixes'? I'm not too familiar with Java and the solr codebase (working on
that though :D).

Cheers,

Tim



On 4 September 2013 09:52, Mark Miller  wrote:

> There is an issue if I remember right, but I can't find it right now.
>
> If anyone that has the problem could try this patch, that would be very
> helpful: http://pastebin.com/raw.php?i=aaRWwSGP
>
> - Mark
>
>
> On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma  >wrote:
>
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -Original message-
> > > From:Mark Miller 
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what
> it
> > is since early this year, but it's never personally been an issue, so
> it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads
> in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads
> sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did
> not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are
> healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >at sun.misc.Unsafe.park(Native Method)
> > > >- parking to wait for  <0x0007216e68d8> (a
>

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Mark Miller
There is an issue if I remember right, but I can't find it right now.

If anyone that has the problem could try this patch, that would be very
helpful: http://pastebin.com/raw.php?i=aaRWwSGP

- Mark


On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma wrote:

> Hi Mark,
>
> Got an issue to watch?
>
> Thanks,
> Markus
>
> -Original message-
> > From:Mark Miller 
> > Sent: Wednesday 4th September 2013 16:55
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> is since early this year, but it's never personally been an issue, so it's
> rolled along for a long time.
> >
> > Mark
> >
> > Sent from my iPhone
> >
> > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt 
> wrote:
> >
> > > Hey guys,
> > >
> > > I am looking into an issue we've been having with SolrCloud since the
> > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> 4.4.0
> > > yet). I've noticed other users with this same issue, so I'd really
> like to
> > > get to the bottom of it.
> > >
> > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> we
> > > see stalled transactions that snowball to consume all Jetty threads in
> the
> > > JVM. This eventually causes the JVM to hang with most threads waiting
> on
> > > the condition/stack provided at the bottom of this message. At this
> point
> > > SolrCloud instances then start to see their neighbors (who also have
> all
> > > threads hung) as down w/"Connection Refused", and the shards become
> "down"
> > > in state. Sometimes a node or two survives and just returns 503s "no
> server
> > > hosting shard" errors.
> > >
> > > As a workaround/experiment, we have tuned the number of threads sending
> > > updates to Solr, as well as the batch size (we batch updates from
> client ->
> > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > help. Certain combinations of update threads and batch sizes seem to
> > > mask/help the problem, but not resolve it entirely.
> > >
> > > Our current environment is the following:
> > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> and
> > > a replica of 1 shard).
> > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> good
> > > day.
> > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > Linux-user threads ulimit is 6000.
> > > - Occurs under Jetty 8 or 9 (many versions).
> > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > - Occurs under several JVM tunings.
> > > - Everything seems to point to Solr itself, and not a Jetty or Java
> version
> > > (I hope I'm wrong).
> > >
> > > The stack trace that is holding up all my Jetty QTP threads is the
> > > following, which seems to be waiting on a lock that I would very much
> like
> > > to understand further:
> > >
> > > "java.lang.Thread.State: WAITING (parking)
> > >at sun.misc.Unsafe.park(Native Method)
> > >- parking to wait for  <0x0007216e68d8> (a
> > > java.util.concurrent.Semaphore$NonfairSync)
> > >at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > >at
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > >at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > >at
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > >at
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > &

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Kevin Osborn
I am having this issue as well. I did apply this patch. Unfortunately, it
did not resolve the issue in my case.


On Wed, Sep 4, 2013 at 7:01 AM, Greg Walters
wrote:

> Tim,
>
> Take a look at
> http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.htmland
> https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue
> that you're reporting for a while then I applied the patch from SOLR-4816
> to my clients and the problems went away. If you don't feel like applying
> the patch it looks like it should be included in the release of version
> 4.5. Also note that the problem happens more frequently when the
> replication factor is greater than 1.
>
> Thanks,
> Greg
>
> -Original Message-
> From: Tim Vaillancourt [mailto:t...@elementspace.com]
> Sent: Tuesday, September 03, 2013 6:31 PM
> To: solr-user@lucene.apache.org
> Subject: SolrCloud 4.x hangs under high update volume
>
> Hey guys,
>
> I am looking into an issue we've been having with SolrCloud since the
> beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
> yet). I've noticed other users with this same issue, so I'd really like to
> get to the bottom of it.
>
> Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
> see stalled transactions that snowball to consume all Jetty threads in the
> JVM. This eventually causes the JVM to hang with most threads waiting on
> the condition/stack provided at the bottom of this message. At this point
> SolrCloud instances then start to see their neighbors (who also have all
> threads hung) as down w/"Connection Refused", and the shards become "down"
> in state. Sometimes a node or two survives and just returns 503s "no
> server hosting shard" errors.
>
> As a workaround/experiment, we have tuned the number of threads sending
> updates to Solr, as well as the batch size (we batch updates from client ->
> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> help. Certain combinations of update threads and batch sizes seem to
> mask/help the problem, but not resolve it entirely.
>
> Our current environment is the following:
> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> - 3 x Zookeeper instances, external Java 7 JVM.
> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
> a replica of 1 shard).
> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
> day.
> - 5000 max jetty threads (well above what we use when we are healthy),
> Linux-user threads ulimit is 6000.
> - Occurs under Jetty 8 or 9 (many versions).
> - Occurs under Java 1.6 or 1.7 (several minor versions).
> - Occurs under several JVM tunings.
> - Everything seems to point to Solr itself, and not a Jetty or Java
> version (I hope I'm wrong).
>
> The stack trace that is holding up all my Jetty QTP threads is the
> following, which seems to be waiting on a lock that I would very much like
> to understand further:
>
> "java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x0007216e68d8> (a
> java.util.concurrent.Semaphore$NonfairSync)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> at
>
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> at
>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> at
>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> at
>
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> at
>
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> at
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> at
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> at
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> at
>
> org.apache.so

Re: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Mark Miller
I'm going to try and fix the root cause for 4.5 - I've suspected what it is 
since early this year, but it's never personally been an issue, so it's rolled 
along for a long time. 

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt  wrote:

> Hey guys,
> 
> I am looking into an issue we've been having with SolrCloud since the
> beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
> yet). I've noticed other users with this same issue, so I'd really like to
> get to the bottom of it.
> 
> Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
> see stalled transactions that snowball to consume all Jetty threads in the
> JVM. This eventually causes the JVM to hang with most threads waiting on
> the condition/stack provided at the bottom of this message. At this point
> SolrCloud instances then start to see their neighbors (who also have all
> threads hung) as down w/"Connection Refused", and the shards become "down"
> in state. Sometimes a node or two survives and just returns 503s "no server
> hosting shard" errors.
> 
> As a workaround/experiment, we have tuned the number of threads sending
> updates to Solr, as well as the batch size (we batch updates from client ->
> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> help. Certain combinations of update threads and batch sizes seem to
> mask/help the problem, but not resolve it entirely.
> 
> Our current environment is the following:
> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> - 3 x Zookeeper instances, external Java 7 JVM.
> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
> a replica of 1 shard).
> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
> day.
> - 5000 max jetty threads (well above what we use when we are healthy),
> Linux-user threads ulimit is 6000.
> - Occurs under Jetty 8 or 9 (many versions).
> - Occurs under Java 1.6 or 1.7 (several minor versions).
> - Occurs under several JVM tunings.
> - Everything seems to point to Solr itself, and not a Jetty or Java version
> (I hope I'm wrong).
> 
> The stack trace that is holding up all my Jetty QTP threads is the
> following, which seems to be waiting on a lock that I would very much like
> to understand further:
> 
> "java.lang.Thread.State: WAITING (parking)
>at sun.misc.Unsafe.park(Native Method)
>- parking to wait for  <0x0007216e68d8> (a
> java.util.concurrent.Semaphore$NonfairSync)
>at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
>at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
>at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
>at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
>at
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
>at
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
>at
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
>at
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
>at
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
>at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
>at
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
>at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
>at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
>at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
>at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
>at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
>at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
>at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
>at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
>at
> org.eclipse.jetty.server.session.SessionHandler.doScope

RE: SolrCloud 4.x hangs under high update volume

2013-09-04 Thread Greg Walters
Tim,

Take a look at 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html
 and https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue that 
you're reporting for a while then I applied the patch from SOLR-4816 to my 
clients and the problems went away. If you don't feel like applying the patch 
it looks like it should be included in the release of version 4.5. Also note 
that the problem happens more frequently when the replication factor is greater 
than 1.

Thanks,
Greg

-Original Message-
From: Tim Vaillancourt [mailto:t...@elementspace.com] 
Sent: Tuesday, September 03, 2013 6:31 PM
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.x hangs under high update volume

Hey guys,

I am looking into an issue we've been having with SolrCloud since the beginning 
of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've 
noticed other users with this same issue, so I'd really like to get to the 
bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see 
stalled transactions that snowball to consume all Jetty threads in the JVM. 
This eventually causes the JVM to hang with most threads waiting on the 
condition/stack provided at the bottom of this message. At this point SolrCloud 
instances then start to see their neighbors (who also have all threads hung) as 
down w/"Connection Refused", and the shards become "down"
in state. Sometimes a node or two survives and just returns 503s "no server 
hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads sending updates 
to Solr, as well as the batch size (we batch updates from client -> solr), and 
the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching 
(1 update = 1 call to Solr), which also did not help. Certain combinations of 
update threads and batch sizes seem to mask/help the problem, but not resolve 
it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a 
replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day.
- 5000 max jetty threads (well above what we use when we are healthy), 
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version (I 
hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the following, 
which seems to be waiting on a lock that I would very much like to understand 
further:

"java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007216e68d8> (a
java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(S

SolrCloud 4.x hangs under high update volume

2013-09-03 Thread Tim Vaillancourt
Hey guys,

I am looking into an issue we've been having with SolrCloud since the
beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
yet). I've noticed other users with this same issue, so I'd really like to
get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
see stalled transactions that snowball to consume all Jetty threads in the
JVM. This eventually causes the JVM to hang with most threads waiting on
the condition/stack provided at the bottom of this message. At this point
SolrCloud instances then start to see their neighbors (who also have all
threads hung) as down w/"Connection Refused", and the shards become "down"
in state. Sometimes a node or two survives and just returns 503s "no server
hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads sending
updates to Solr, as well as the batch size (we batch updates from client ->
solr), and the Soft/Hard autoCommits, all to no avail. Turning off
Client-to-Solr batching (1 update = 1 call to Solr), which also did not
help. Certain combinations of update threads and batch sizes seem to
mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
day.
- 5000 max jetty threads (well above what we use when we are healthy),
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version
(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the
following, which seems to be waiting on a lock that I would very much like
to understand further:

"java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007216e68d8> (a
java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
at
org.eclipse.jetty.