We are running Solr 4.6.1 in AWS:
- 2 Solr instances (1 shard, 1 leader, 1 replica)
- 1 CloudSolrServer SolrJ client updating the index.
- 3 Zookeepers

The Solr instances are behind a load balanceer and also in an auto scaling
group. The ScaleUpPolicy will add up to 9 additional instances (replicas),
1 per minute. Later, the 9 replicas are terminated with the ScaleDownPolicy.

Problem: during the ScaleUpPolicy, when the Solr Leader is under heavy
query load, the SolrJ indexing client issues a commit which hangs and never
returns. Note that the index schema contains 3 ExternalFileFields wich slow
down the commit process. Here's the stack trace:

Thread 1959: (state = IN_NATIVE)
 - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[],
int, int, int) @bci=0 (Compiled frame; information may be imprecise)
 - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150
(Compiled frame)
 - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121
(Compiled frame)
 - org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer() @bci=71,
line=166 (Compiled frame)
 - org.apache.http.impl.io.SocketInputBuffer.fillBuffer() @bci=1, line=90
(Compiled frame)
 -
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=137, line=281 (Compiled frame)
 -
org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(org.apache.http.util.CharArrayBuffer)
@bci=5, line=115 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=16, line=92 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(org.apache.http.io.SessionInputBuffer)
@bci=2, line=62 (Compiled frame)
 - org.apache.http.impl.io.AbstractMessageParser.parse() @bci=38, line=254
(Compiled frame)
 -
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader()
@bci=8, line=289 (Compiled frame)
 -
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader()
@bci=1, line=252 (Compiled frame)
 -
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader()
@bci=6, line=191 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=62, line=300 (Compiled frame)
 -
org.apache.http.protocol.HttpRequestExecutor.execute(org.apache.http.HttpRequest,
org.apache.http.HttpClientConnection, org.apache.http.protocol.HttpContext)
@bci=60, line=127 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(org.apache.http.impl.client.RoutedRequest,
org.apache.http.protocol.HttpContext) @bci=198, line=717 (Compiled frame)
 -
org.apache.http.impl.client.DefaultRequestDirector.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=597, line=522 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.HttpHost,
org.apache.http.HttpRequest, org.apache.http.protocol.HttpContext)
@bci=344, line=906 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest,
org.apache.http.protocol.HttpContext) @bci=21, line=805 (Compiled frame)
 -
org.apache.http.impl.client.AbstractHttpClient.execute(org.apache.http.client.methods.HttpUriRequest)
@bci=6, line=784 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest,
org.apache.solr.client.solrj.ResponseParser) @bci=1175, line=395 (Compiled
frame)
 -
org.apache.solr.client.solrj.impl.HttpSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=17, line=199 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(org.apache.solr.client.solrj.impl.LBHttpSolrServer$Req)
@bci=132, line=285 (Compiled frame)
 -
org.apache.solr.client.solrj.impl.CloudSolrServer.request(org.apache.solr.client.solrj.SolrRequest)
@bci=838, line=640 (Compiled frame)
 -
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(org.apache.solr.client.solrj.SolrServer)
@bci=17, line=117 (Compiled frame)
 - org.apache.solr.client.solrj.SolrServer.commit(boolean, boolean)
@bci=16, line=168 (Interpreted frame)
 - org.apache.solr.client.solrj.SolrServer.commit() @bci=3, line=146
(Interpreted frame)

 The Solr leader log shows many connection timeout exceptions from the
other Solr replicas during this period. Some of these timeouts may have
been caused by replicas disappearing from the ScaleDownPolicy. From the
search client application's point of view, everything looked fine, but
indexing stopped until I restarted the SolrJ client.

 Does this look like a case where a timeout value needs to be increased
somewhere? If so, which one?

 Thanks,
 Peter

Reply via email to