I'll give you the high level before delving deep into setup etc. I have been struggeling at work with a seemingly random problem when solr will hang for 10-15 minutes during updates. This outage always seems to immediately be proceeded by an EOF exception on the replica. Then 10-15 minutes later we see an exception on the leader for a socket timeout to the replica. The leader will then tell the replica to recover which in most cases it does and then the outage is over.
Here are the setup details: We are currently using Solr 4.0.0 with an external ZK ensemble of 5 machines. We have 2 active collections each with only 1 shard (we have in total about 15 collections but most are empty or have less than 100 docs). The first index (collection1) is 6.5GB and has ~18M documents. The 2nd index (collection2) is 9GB and has about 13M documents. In all cases the leader resides on 1 server and the replica resides on the other. Both servers are AWS XL High Mem instances. (8 CPUs @ 2.67Ghz, 70GB Ram) with the index residing on a 1TB raid 10 using ephemeral storage disks. We are starting solr using the embedded jetty with the following java memory and GC options: -Xmx16382m -Xms4092m -XX:MaxPermSize=256m -Xss256k -XX:NewSize=1536m -XX:SurvivorRatio=16 -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:ParallelCMSThreads=2 -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=80 -XX:+CMSParallelRemarkEnabled Both collections receive a constant stream of updates ~10k per hour (both adds/deletes). Approximately once per day the following events transpire: 1. We see a log entry for a distributed update that takes just over 50000 ms followed by an EOF write exception on the replica. In all cases this exception is triggered by an update to the 9GB collection. 2. Occasionally we'll see a 503 shard update error on the leader but usually not. 3. Approximately 15 minutes after this exception we see a timeout error for a this distributed update request on the leader. 4. The leader then creates a new connection and tells the replica to recover, which it does and everything is OK again. 5. During the 15 minute window from when the replica throws the EOF until the SocketTimeout by the leader no other updates are processed: ERROR ON REPLICA: Jul 8, 2013 6:38:16 PM org.apache.solr.core.SolrCore execute INFO: [collection2_0] webapp=/solr path=/update params={distrib.from=http://Solr4-1-1.domain.com:8983/solr/collection2_0/&update.distrib=FROMLEADER&wt=javabin&version=2} status=0 QTime=50012 Jul 8, 2013 6:38:16 PM org.apache.solr.common.SolrException log SEVERE: null:org.eclipse.jetty.io.EofException at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:154) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:101) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:203) at org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.java:196) at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:94) at org.apache.solr.response.BinaryResponseWriter.write(BinaryResponseWriter.java:49) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:404) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:952) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:662) ERROR LEADER: ================ 10 MINUTES AFTER EOF (sometimes) ================ Jul 8, 2013 6:47:32 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://Solr4-1-2.domain.com:8983/solr/collection2_0/ Jul 8, 2013 6:47:32 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://Solr4-1-2.domain.com:8983/solr/collection2_0/:org.apache.solr.common.SolrException: Server at http://Solr4-1-3.domain.com:8983/solr/collection2_0 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) ================ 15 MINUTES AFTER EOF ================ Jul 8, 2013 6:52:56 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://Solr4-1-2.domain.com:8983/solr/collection2_0/:org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://Solr4-1-2.domain.com:8983/solr/collection2_0 at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:413) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:335) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:309) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.SocketException: Connection timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:111) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:264) at org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(LoggingSessionInputBuffer.java:115) at org.apache.http.impl.conn.DefaultResponseParser.parseHead(DefaultResponseParser.java:98) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:252) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:282) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:247) at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:216) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:298) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:647) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:464) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) ... 11 more Jul 8, 2013 6:52:56 PM org.apache.solr.update.processor.DistributedUpdateProcessor doFinish INFO: try and ask http://Solr4-1-2.domain.com:8983/solr to recover Jul 8, 2013 6:52:56 PM org.apache.solr.client.solrj.impl.HttpClientUtil createClient INFO: Creating new http client, config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false The machines are certainly not under any load. During the outage disk, network, and cpu levels were dead almost as if nothing was happening. Initially we thought it could be related to GC since we are using the CMS stuff and we have a lot of allocated memory, but the outages are not consistent with the GC times that were logged. After doing some digging we found that jetty is configured with a default maxIdleTime of 50000 ms. We thought that this could explain the EOF exception in this way: Solr4-1-2 accepts an update from the leader, takes to long to do whatever and jetty closes the socket. Then when it goes to finish writing it's response to the socket object it throws an EOF exception. However I am not able to reproduce this! I set the maxIdleTime to 100ms on both machines, run it under the same load but never get the error! Does anyone have any suggestions/expertise that could explain this, or what we can try to further gather information/cause? This has been going on for 3 weeks now and I'M really starting to get mad about it. I'M happy to send logs, and configurations privately to anyone who is willing to help us chase this down. Thanks! Jed