I have recently observer severe performance issues of 1 collection, 2 shard, 4 server SolrCloud (Solr 6.6.0 running on Windows, using AdoptOpenJDK 1.8 JRE, NSSM was used to run Solr as Windows service). During recovery of a replica the network utilization of the server hosting the replica (that is in Recovering state) was very small (around 8 Mbps). Often the recovery failed and re-started. When the recovery failed something like the following warning was logged on the shard leader: Exception while writing response for params: generation=XXXXX&qt=/replication&file=_randomlookingfilename.fdt&offset=YYYYY&checksum=true&wt=filestream&command=filecontent java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 50016/50000 ms at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:219) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:220) at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491) at org.apache.commons.io.output.ProxyOutputStream.write(ProxyOutputStream.java:90) at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:213) at org.apache.solr.common.util.FastOutputStream.write(FastOutputStream.java:83) at org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1520) at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2601) at org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49) at org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:809) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:538) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Unknown Source) Caused by: java.util.concurrent.TimeoutException: Idle timeout expired: 50016/50000 ms at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166) at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Unknown Source) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ... 1 more
And after some (often around 30) seconds a warning like this was logged on the recovering node: Error in fetching file: _randomlookingfilename.fdt (downloaded NNNNN of MMMMM bytes) org.apache.http.MalformedChunkCodingException: CRLF expected at end of chunk at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:255) at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:227) at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:186) at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79) at org.apache.solr.common.util.FastInputStream.read(FastInputStream.java:128) at org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:166) at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1512) at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1463) at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1444) at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:926) at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:492) at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:301) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:400) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:219) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:471) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:284) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) The issue was observable on relatively slow hardware (e.g. Azure B4ms virtual machine), on more powerful hardware the replica was able to recover (however the network usage during recovery was still much lower than expected). When analyzing the resource bottleneck it was observed that conhost.exe had higher CPU usage than java.exe (Solr process). The total CPU usage was around 25% (on 4 core machine). It turned out that debug logging (enabled by -v command-line option) was the root cause of performance issues. Solr was writing excessive amount of log data to console and all the log data was drawn on an invisible console. The drawing was slow-enough and blocking to cause issues for Solr. Redirecting STDOUT and STDERR of Solr to nul (in NSSM settings) increased the network utilization during replica recovery several times and made it possible for the replica to recover and become active. The final solution was to revert to the default logging level (info) by removing the -v command line option. This improved the network utilization during recovery even more than STDOUT/STDERR redirection. Could it be possible to mention in Solr documentation (e.g. in https://cwiki.apache.org/confluence/display/solr/SolrPerformanceProblems and https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceFactors ) that debug logging might seriously hurt performance? Maybe it could be possible to log a warning (that debug logging may seriously hurt performance) if root logging level is set to debug?