We are running Solr 4.6.1 in AWS: - 2 Solr instances (1 shard, 1 leader, 1 replica) - 1 CloudSolrServer SolrJ client updating the index. - 3 Zookeepers
The Solr instances are behind a load balanceer and also in an auto scaling group. The ScaleUpPolicy will add up to 9 additional instances (replicas), 1 per minute. Later, the 9 replicas are terminated with the ScaleDownPolicy. Problem: during the ScaleUpPolicy, when the Solr Leader is under heavy query load, the SolrJ indexing client issues a commit which hangs and never returns. Note that the index schema contains 3 ExternalFileFields wich slow down the commit process. Here's the stack trace: Thread 1959: (state = IN_NATIVE) - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise) - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame) - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71, line=166 (Compiled frame) - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90 (Compiled frame) - org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=137, line=281 (Compiled frame) - org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer) @bci=5, line=115 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=16, line=92 (Compiled frame) - org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer) @bci=2, line=62 (Compiled frame) - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254 (Compiled frame) - org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader() @bci=8, line=289 (Compiled frame) - org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader() @bci=1, line=252 (Compiled frame) - org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader() @bci=6, line=191 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=62, line=300 (Compiled frame) - org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest, org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext) @bci=60, line=127 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest, org.apache.http.protocol.HttpContext) @bci=198, line=717 (Compiled frame) - org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=597, line=522 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost, org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext) @bci=344, line=906 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest, org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame) - org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest) @bci=6, line=784 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest, org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Compiled frame) - org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=17, line=199 (Compiled frame) - org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(org.apache.solr.client.solrj.impl.LBHttpSolrServer$Req) @bci=132, line=285 (Compiled frame) - org.apache.solr.client.solrj.impl.CloudSolrServer.request(org.apache.solr.client.solrj.SolrRequest) @bci=838, line=640 (Compiled frame) - org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(org.apache.solr.client.solrj.SolrServer) @bci=17, line=117 (Compiled frame) - org.apache.solr.client.solrj.SolrServer.commit(boolean, boolean) @bci=16, line=168 (Interpreted frame) - org.apache.solr.client.solrj.SolrServer.commit() @bci=3, line=146 (Interpreted frame) The Solr leader log shows many connection timeout exceptions from the other Solr replicas during this period. Some of these timeouts may have been caused by replicas disappearing from the ScaleDownPolicy. From the search client application's point of view, everything looked fine, but indexing stopped until I restarted the SolrJ client. Does this look like a case where a timeout value needs to be increased somewhere? If so, which one? Thanks, Peter