Hey guys,

I am looking into an issue we've been having with SolrCloud since the
beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0
yet). I've noticed other users with this same issue, so I'd really like to
get to the bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we
see stalled transactions that snowball to consume all Jetty threads in the
JVM. This eventually causes the JVM to hang with most threads waiting on
the condition/stack provided at the bottom of this message. At this point
SolrCloud instances then start to see their neighbors (who also have all
threads hung) as down w/"Connection Refused", and the shards become "down"
in state. Sometimes a node or two survives and just returns 503s "no server
hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads sending
updates to Solr, as well as the batch size (we batch updates from client ->
solr), and the Soft/Hard autoCommits, all to no avail. Turning off
Client-to-Solr batching (1 update = 1 call to Solr), which also did not
help. Certain combinations of update threads and batch sizes seem to
mask/help the problem, but not resolve it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and
a replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good
day.
- 5000 max jetty threads (well above what we use when we are healthy),
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version
(I hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the
following, which seems to be waiting on a lock that I would very much like
to understand further:

"java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007216e68d8> (a
java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
    at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
    at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
    at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
    at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
    at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
    at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
    at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
    at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
    at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
    at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
    at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
    at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:445)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
    at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
    at java.lang.Thread.run(Thread.java:724)"

Some questions I had were:
1) What exclusive locks does SolrCloud "make" when performing an update?
2) Keeping in mind I do not read or write java (sorry :D), could someone
help me understand "what" solr is locking in this case at
"org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
when performing an update? That will help me understand where to look next.
3) It seems all threads in this state are waiting for "0x00000007216e68d8",
is there a way to tell what "0x00000007216e68d8" is?
4) Is there a limit to how many updates you can do in SolrCloud?
5) Wild-ass-theory: would more shards provide more locks (whatever they
are) on update, and thus more update throughput?

To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL
in gzipped form:
https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz

Any help/suggestions/ideas on this issue, big or small, would be much
appreciated.

Thanks so much all!

Tim Vaillancourt

Reply via email to