I have uncovered some additional details in the shard leader log: 2015-01-11 09:38:00.693 [qtp268575911-3617101] INFO org.apache.solr.update.processor.LogUpdateProcessor – [listings] webapp=/solr path=/update params{distrib.from=http://solr05.search.abebooks.com:8983/solr/listings/&u pdate.distrib=TOLEADER&wt=javabin&version=2} {add=[14065572860 (1490024273004199936)]} 0 707 2015-01-11 09:38:00.913 [updateExecutor-1-thread-35734] ERROR org.apache.solr.update.StreamingSolrServers – error java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessi onInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java :84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSession InputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe sponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe sponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.j ava:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(Abs tractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(Def aultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader (ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestE xecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.ja va:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultReques tDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDi rector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient .java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient .java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient .java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient .java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(Con currentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 615) at java.lang.Thread.run(Thread.java:745) 2015-01-11 09:38:00.917 [qtp268575911-3616964] WARN org.apache.solr.update.processor.DistributedUpdateProcessor – Error sending update java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessi onInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java :84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSession InputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe sponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpRe sponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.j ava:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(Abs tractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(Def aultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader (ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestE xecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.ja va:123) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultReques tDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDi rector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient .java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient .java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient .java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient .java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(Con currentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 615) at java.lang.Thread.run(Thread.java:745)
It appears that the connection to the replica is being reset. The number of file open files allowed on all Solr boxes is set to 65536. There doesn’t appear to be any retries here and this triggers the recovery of the replica. Is this expected behaviour? Thanks, Lindsay On 2015-01-12, 11:11 AM, "Lindsay Martin" <lmar...@abebooks.com> wrote: >Here are more details about our setup: > >Zookeeper: >* 3 separate hosts in same rack as Solr cluster >* Zookeeper hosts do not run any other processes > >Solr: >* total servers: 24 (plus 2 cold standbys in case of host failure) >* physical memory: 65931872 kB (62 GB) >* max JVM heap size: -Xmx10880m ( 10 GB) >* only one Solr per host > >On the Œindex¹ directory size front, I am seeing some differences in the >disk usage between leaders and replicas. > >In 1 / 12 shards, there is no difference in size between the leader and >replica. > >In 6 / 12 shards, there 1 1 G difference in size between the leader and >replica. Both have one index directory. > >In 5 / 12 shards, the replica is a multiple of the leader size, due to >multiple index directories on disk. > >For example, shard 1 leader has a directory named >'index.20140624071707699¹ 30 G in size. The replica has two directories: >'index.20150108052156468¹ at 31G and Œindex.20140624071556270¹ at 32G. > >Thanks, > >Lindsay > >On 2015-01-09, 5:01 PM, "Shawn Heisey" <apa...@elyograg.org> wrote: > >>On 1/9/2015 4:54 PM, Lindsay Martin wrote: >>> I am experiencing a problem where Solr nodes go into recovery following >>>an update cycle. >> >><snip> >> >>> For background, here are some details about our configuration: >>> * Solr 4.10.2 (problem also observed with Solr 4.6.1) >>> * 12 shards with 2 nodes per shard >>> * a single updater running in a separate subnet is posting updates >>>using the SolrJ CloudSolrServer client. Updates are triggered hourly. >>> * system is under continuous query load >>> * autoCommit is set to 821 seconds >>> * autoSoftCommit is set to 303 seconds >> >>I would suspect some kind of performance problem that likely results in >>the zkClientTimeout expiring. I have a standard set of questions for >>performance problems. >> >>Questions about zookeeper: >> >>How many ZK nodes? Is zookeeper on separate hardware? If it's on the >>same hardware as Solr, is its database on the same disk spindles as the >>Solr index, or separate spindles? Is zookeeper standalone or embedded >>in Solr? If it's standalone, do you happen to know the java max heap >>for the zookeeper processes? >> >>Questions about Solr and the hardware: >> >>How many total Solr servers? How much RAM is installed on each one? >>What is the max size of the Java heap? Are you running more than one >>Solr (JVM/container) instance per machine? >> >>If you add up all the "index" directories on a server, how much disk >>space does it take? Is the amount of disk space used similar on all of >>the servers? >> >>Thanks, >>Shawn >> >