Hello, I'm looking for some guidance on the best steps for tuning a solr cloud cluster which is heavy on writes. We are currently running a solr cloud fleet composed of one core, one shard, and three nodes. The cloud is hosted in AWS, and each solr node is on its own linux r3.2xl instance with 8 cpu and 61 GiB mem, and a 2TB EBS volume attached. Our index is currently 550 GiB over 420M documents, and growing quite rapidly. We are currently doing a bit more than 1000 document writes/deletes per second.
Recently, we've hit some trouble with our production cloud. We have had the process on individual instances die a few times, and we see the following error messages being logged (expanded logs at the bottom of the email): ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException; null:org.eclipse.jetty.io.EofException WARN - 2016-04-26 00:55:29.571; org.eclipse.jetty.servlet.ServletHandler; /solr/panopto/select java.lang.IllegalStateException: Committed WARN - 2016-04-26 00:55:29.571; org.eclipse.jetty.server.Response; Committed before 500 {trace=org.eclipse.jetty.io.EofException Another time we saw this happen, we had java OOM errors (expanded logs at the bottom): WARN - 2016-04-25 22:58:43.943; org.eclipse.jetty.servlet.ServletHandler; Error for /solr/panopto/select java.lang.OutOfMemoryError: Java heap space ERROR - 2016-04-25 22:58:43.945; org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space ... Caused by: java.lang.OutOfMemoryError: Java heap space When the cloud goes into recovery during live indexing, it takes about 4-6 hours for a node to recover, but when we turn off indexing, recovery only takes about 90 minutes. Moreover, we see that deletes are extremely slow. We do batch deletes of about 300 documents based on two value filters, and this takes about one minute: Research online suggests that a larger disk cache <https://wiki.apache.org/solr/SolrPerformanceProblems> could be helpful, but I also see from an older page <http://wiki.apache.org/lucene-java/ImproveSearchingSpeed> on tuning for Lucene that turning down the swappiness on our Linux instances may be preferred to simply increasing space for the disk cache. Moreover, to scale in the past, we've simply rolled our cluster while increasing the memory on the new machines, but I wonder if we're hitting the limit for how much we should scale vertically. My impression is that sharding will allow us to warm searchers faster and maintain a more effective cache as we scale. Will we really be helped by sharding, or is it only a matter of total CPU/Memory in the cluster? Thanks! Stephen (206)753-9320 stephen-lewis.net Logs: ERROR - 2016-04-26 00:56:43.873; org.apache.solr.common.SolrException; null:org.eclipse.jetty.io.EofException at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:142) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207) at org.apache.solr.util.FastWriter.flush(FastWriter.java:141) at org.apache.solr.util.FastWriter.flushBuffer(FastWriter.java:155) at org.apache.solr.response.TextResponseWriter.close(TextResponseWriter.java:83) at org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:42) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) WARN - 2016-04-25 22:58:43.943; org.eclipse.jetty.servlet.ServletHandler; Error for /solr/panopto/select java.lang.OutOfMemoryError: Java heap space ERROR - 2016-04-25 22:58:43.945; org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:793) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:434) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Java heap space WARN - 2016-04-26 00:56:43.873; org.eclipse.jetty.server.Response; Committed before 500 {trace=org.eclipse.jetty.io.EofException at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:142) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107) at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221) at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282) at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125) at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207) at org.apache.solr.util.FastWriter.flush(FastWriter.java:141) at org.apache.solr.util.FastWriter.flushBuffer(FastWriter.java:155) at org.apache.solr.response.TextResponseWriter.close(TextResponseWriter.java:83) at org.apache.solr.response.XMLResponseWriter.write(XMLResponseWriter.java:42) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) ,code=500}