Re: SolrCloud 4.x hangs under high update volume

Tim Vaillancourt Wed, 11 Sep 2013 23:14:26 -0700

Thanks Erick!

Yeah, I think the next step will be CloudSolrServer with the SOLR-4816patch. I think that is a very, very useful patch by the way. SOLR-5232seems promising as well.

I see your point on the more-shards idea, this is obviously aglobal/instance-level lock. If I really had to, I suppose I could runmore Solr instances to reduce locking then? Currently I have 2 cores perinstance and I could go 1-to-1 to simplify things.

The good news is we seem to be more stable since changing to a biggerclient->solr batch-size and fewer client threads updating.


Cheers,

Tim

On 11/09/13 04:19 AM, Erick Erickson wrote:

If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent
copy of the 4x branch. By "recent", I mean like today, it looks like Mark
applied this early this morning. But several reports indicate that this will
solve your problem.

I would expect that increasing the number of shards would make the problem
worse, not
better.

There's also SOLR-5232...

Best
Erick


On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt<t...@elementspace.com>wrote:

Hey guys,

Based on my understanding of the problem we are encountering, I feel we've
been able to reduce the likelihood of this issue by making the following
changes to our app's usage of SolrCloud:

1) We increased our document batch size to 200 from 10 - our app batches
updates to reduce HTTP requests/overhead. The theory is increasing the
batch size reduces the likelihood of this issue happening.
2) We reduced to 1 application node sending updates to SolrCloud - we write
Solr updates to Redis, and have previously had 4 application nodes pushing
the updates to Solr (popping off the Redis queue). Reducing the number of
nodes pushing to Solr reduces the concurrency on SolrCloud.
3) Less threads pushing to SolrCloud - due to the increase in batch size,
we were able to go down to 5 update threads on the update-pushing-app (from
10 threads).

To be clear the above only reduces the likelihood of the issue happening,
and DOES NOT actually resolve the issue at hand.

If we happen to encounter issues with the above 3 changes, the next steps
(I could use some advice on) are:

1) Increase the number of shards (2x) - the theory here is this reduces the
locking on shards because there are more shards. Am I onto something here,
or will this not help at all?
2) Use CloudSolrServer - currently we have a plain-old least-connection
HTTP VIP. If we go "direct" to what we need to update, this will reduce
concurrency in SolrCloud a bit. Thoughts?

Thanks all!

Cheers,

Tim


On 6 September 2013 14:47, Tim Vaillancourt<t...@elementspace.com>  wrote:

Enjoy your trip, Mark! Thanks again for the help!

Tim


On 6 September 2013 14:18, Mark Miller<markrmil...@gmail.com>  wrote:

Okay, thanks, useful info. Getting on a plane, but ill look more at this
soon. That 10k thread spike is good to know - that's no good and could
easily be part of the problem. We want to keep that from happening.

Mark

Sent from my iPhone

On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt<t...@elementspace.com>
wrote:

Hey Mark,

The farthest we've made it at the same batch size/volume was 12 hours
without this patch, but that isn't consistent. Sometimes we would only

get

to 6 hours or less.

During the crash I can see an amazing spike in threads to 10k which is
essentially our ulimit for the JVM, but I strangely see no

"OutOfMemory:

cannot open native thread errors" that always follow this. Weird!

We also notice a spike in CPU around the crash. The instability caused

some

shard recovery/replication though, so that CPU may be a symptom of the
replication, or is possibly the root cause. The CPU spikes from about
20-30% utilization (system + user) to 60% fairly sharply, so the CPU,

while

spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,

whole

index is in 128GB RAM, 6xRAID10 15k).

More on resources: our disk I/O seemed to spike about 2x during the

crash

(about 1300kbps written to 3500kbps), but this may have been the
replication, or ERROR logging (we generally log nothing due to
WARN-severity unless something breaks).

Lastly, I found this stack trace occurring frequently, and have no

idea

what it is (may be useful or not):

"java.lang.IllegalStateException :
      at

org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)

      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
      at

org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)

at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)

at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)

at

org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)

at

org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)

at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)

at

org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)

at

org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)

at

org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)

at

org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)

at

org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)

at

org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)

at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)

at

org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)

at

org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)

at

org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)

      at org.eclipse.jetty.server.Server.handle(Server.java:445)
      at

org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)

at

org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)

at

org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)

at

org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)

at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)

      at java.lang.Thread.run(Thread.java:724)"

On your live_nodes question, I don't have historical data on this from

when

the crash occurred, which I guess is what you're looking for. I could

add

this to our monitoring for future tests, however. I'd be glad to

continue

further testing, but I think first more monitoring is needed to

understand

this further. Could we come up with a list of metrics that would be

useful

to see following another test and successful crash?

Metrics needed:

1) # of live_nodes.
2) Full stack traces.
3) CPU used by Solr's JVM specifically (instead of system-wide).
4) Solr's JVM thread count (already done)
5) ?

Cheers,

Tim Vaillancourt


On 6 September 2013 13:11, Mark Miller<markrmil...@gmail.com>  wrote:

Did you ever get to index that long before without hitting the

deadlock?

There really isn't anything negative the patch could be introducing,

other

than allowing for some more threads to possibly run at once. If I had

to

guess, I would say its likely this patch fixes the deadlock issue and

your

seeing another issue - which looks like the system cannot keep up

with

the

requests or something for some reason - perhaps due to some OS

networking

settings or something (more guessing). Connection refused happens

generally

when there is nothing listening on the port.

Do you see anything interesting change with the rest of the system?

CPU

usage spikes or something like that?

Clamping down further on the overall number of threads night help

(which

would require making something configurable). How many nodes are

listed in

zk under live_nodes?

Mark

Sent from my iPhone

On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt<t...@elementspace.com>
wrote:

Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious

issues a

few hours of 500 update-batches/sec. Our update batch is 10 docs, so

we

are

writing about 5000 docs/sec total, using autoCommit to commit the

updates

(no explicit commits).

Our environment:

   Solr 4.3.1 w/SOLR-5216 patch.
   Jetty 9, Java 1.7.
   3 solr instances, 1 per physical server.
   1 collection.
   3 shards.
   2 replicas (each instance is a leader and a replica).
   Soft autoCommit is 1000ms.
   Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of

these

stalled transactions (below), and the Solr instances start to see

each

other as down, flooding our Solr logs with "Connection Refused"

exceptions,

and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this

24-hour

soak.

My script "normalizes" the ERROR-severity stack traces and returns

them

in

order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma<

markus.jel...@openindex.io>

wrote:

Thanks!

-----Original message-----

From:Erick Erickson<erickerick...@gmail.com>
Sent: Friday 6th September 2013 16:20
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

Markus:

See: https://issues.apache.org/jira/browse/SOLR-5216


On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

Hi Mark,

Got an issue to watch?

Thanks,
Markus

-----Original message-----

From:Mark Miller<markrmil...@gmail.com>
Sent: Wednesday 4th September 2013 16:55
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud 4.x hangs under high update volume

I'm going to try and fix the root cause for 4.5 - I've suspected

what it

is since early this year, but it's never personally been an

issue,

so

it's

rolled along for a long time.

Mark

Sent from my iPhone

On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt<

t...@elementspace.com>

wrote:

Hey guys,

I am looking into an issue we've been having with SolrCloud

since

the

beginning of our testing, all the way from 4.1 to 4.3 (haven't

tested

4.4.0

yet). I've noticed other users with this same issue, so I'd

really

like to

get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12

hours

we

see stalled transactions that snowball to consume all Jetty

threads in

the

JVM. This eventually causes the JVM to hang with most threads

waiting

on

the condition/stack provided at the bottom of this message. At

this

point

SolrCloud instances then start to see their neighbors (who also

have

all

threads hung) as down w/"Connection Refused", and the shards

become

"down"

in state. Sometimes a node or two survives and just returns

503s

"no

server

hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads

sending

updates to Solr, as well as the batch size (we batch updates

from

client ->

solr), and the Soft/Hard autoCommits, all to no avail. Turning

off

Client-to-Solr batching (1 update = 1 call to Solr), which also

did not

help. Certain combinations of update threads and batch sizes

seem

to

mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of

shard

and

a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no

movement

on a

good

day.
- 5000 max jetty threads (well above what we use when we are

healthy),

Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or

Java

version

(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is

the

following, which seems to be waiting on a lock that I would

very

much

like

to understand further:

"java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for<0x00000007216e68d8>  (a
java.util.concurrent.Semaphore$NonfairSync)
  at

java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)

at

java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)

at

java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)

at

java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)

  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
  at

org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)

at

org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)

at

org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)

at

org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)

at

org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)

at

org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)

at

org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)

at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)

at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
  at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)

at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)

at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)

at

org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)

at

org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)

at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)

at

org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)

at

org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)

at

org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)

at

org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)

at

org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)

at

org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)

at

org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)

at

org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)

at

org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)

at

org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)

  at org.eclipse.jetty.server.Server.handle(Server.java:445)
  at

org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)

at

org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)

at

org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)

at

org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)

at

org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)

  at java.lang.Thread.run(Thread.java:724)"

Some questions I had were:
1) What exclusive locks does SolrCloud "make" when performing

an

update?

2) Keeping in mind I do not read or write java (sorry :D),

could

someone

help me understand "what" solr is locking in this case at

"org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"

when performing an update? That will help me understand where

to

look

next.

3) It seems all threads in this state are waiting for

"0x00000007216e68d8",

is there a way to tell what "0x00000007216e68d8" is?
4) Is there a limit to how many updates you can do in

SolrCloud?

5) Wild-ass-theory: would more shards provide more locks

(whatever

they

are) on update, and thus more update throughput?

To those interested, I've provided a stacktrace of 1 of 3 nodes

at

this URL

in gzipped form:

https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz

Any help/suggestions/ideas on this issue, big or small, would

be

much

appreciated.

Thanks so much all!

Tim Vaillancourt

Re: SolrCloud 4.x hangs under high update volume

Reply via email to