Hi Shawn, Here is some update. We found the main issue We have configured our cluster to run under jetty and when we tried full indexing, we did not see the original Invalid Chunk error. However the replicas still went into recovery All this time we been trying to look into replicas logs to diagnose the issue. The problem seem to be at the leader side. When we looked into leader logs, we found the following on all the leaders
3439873 [qtp1314570047-92] WARN org.apache.solr.update.processor.DistributedUpdateProcessor – Error sending update *java.lang.IllegalStateException: Queue full* at java.util.AbstractQueue.add(AbstractQueue.java:98) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner$1.writeTo(ConcurrentUpdateSolrServer.java:182) at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:69) at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 3439874 [qtp1314570047-92] INFO org.apache.solr.common.cloud.SolrZkClient – makePath: /collections/search1/leader_initiated_recovery/shard7/core_node214 3439879 [qtp1314570047-92] INFO org.apache.solr.cloud.ZkController – Wrote down to /collections/search1/leader_initiated_recovery/shard7/core_node214 3439879 [qtp1314570047-92] INFO org.apache.solr.cloud.ZkController – Put replica core=search1_shard7_replica1 coreNodeName=core_node214 on XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:8580_solr into leader-initiated recovery. 3439880 [qtp1314570047-92] WARN org.apache.solr.cloud.ZkController – Leader is publishing core=search1_shard7_replica1 coreNodeName =core_node214 state=down on behalf of un-reachable replica http://XXXXXXXXXXXXXXX:YYYYYY/solr/search1_shard7_replica1/; forcePublishState? false 3439881 [zkCallback-2-thread-12] INFO org.apache.solr.cloud.DistributedQueue – LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 3439881 [qtp1314570047-92] ERROR org.apache.solr.update.processor.DistributedUpdateProcessor – *Setting up to try to start recovery on replica http://XXXXXXXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYY/solr/search1_shard7_replica1/ after: java.lang.IllegalStateException: Queue full* 3439882 [qtp1314570047-92] INFO org.apache.solr.core.SolrCore – [search1_shard7_replica2] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=2608 3439882 [updateExecutor-1-thread-153] INFO org.apache.solr.cloud.LeaderInitiatedRecoveryThread – LeaderInitiatedRecoveryThread-search1_shard7_replica1 started running to send REQUESTRECOVERY command to http://XXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYY/solr/search1_shard7_replica1/; will try for a max of 600 secs 3439882 [updateExecutor-1-thread-153] INFO org.apache.solr.cloud.LeaderInitiatedRecoveryThread – Asking core=search1_shard7_replica1 coreNodeName=core_node214 on http://XXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYY/solr to recover 3439885 [OverseerStateUpdate-93213309377511456-XXXXXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYY_solr-n_0000002822] INFO org.apache.solr.cloud.Overseer There is a similar bug reported around this https://issues.apache.org/jira/browse/SOLR-5850 and it seem to be in OPEN status. Is there a way we can configure the queue size and increase it ? or is there a version of solr that has this issue resolved already? Can you suggest where we go from here to resolve this ? We can repatch the war file if that is what you would recommend . In the end our initial speculation about solr unable to handle so many update is correct. We do not see this issue when the update load is less. Here is the log from the replica FYI. No invalid chunk error with Jetty but the issue at hand is unrelated 2382763 [Thread-74] INFO org.apache.solr.cloud.ZkController – Wrote recovering to /collections/search1/leader_initiated_recovery /shard7/core_node214 2382776 [Thread-74] INFO org.apache.solr.update.DefaultSolrCoreState – Running recovery - first canceling any ongoing recovery 2382777 [Thread-74] WARN org.apache.solr.cloud.RecoveryStrategy – Stopping recovery for core=search1_shard7_replica1 coreNodeName=core_node214 2382778 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Starting recovery process. core=search1_shard7_replica1 recoveringAfterStartup=false 2382779 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Publishing state of core search1_shard7_replica1 as recovering, leader is http://XXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYY/solr/search1_shard7_replica2/ and I am http://XXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYY /solr/search1_shard7_replica1/ 2382779 [RecoveryThread] INFO org.apache.solr.cloud.ZkController – publishing core=search1_shard7_replica1 state=recovering collection=search1 2382779 [RecoveryThread] INFO org.apache.solr.cloud.ZkController – numShards not found on descriptor - reading it from system property 2382785 [zkCallback-2-thread-5] INFO org.apache.solr.common.cloud.ZkStateReader – A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) 2382796 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Sending prep recovery command to http://XXXXXXXXXXXXXX:YYYYYYYYYYYYYYYY/solr; WaitForState: action=PREPRECOVERY&core=search1_shard7_replica2&nodeName=XXXXXXXXXXXXXXXXXXXXXX%3AXXXXXXXXXXX_solr&coreNodeName=core_node214&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true 2382953 [zkCallback-2-thread-5] INFO org.apache.solr.common.cloud.ZkStateReader – A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 40) 2385815 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Attempting to PeerSync from http://sXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYYYYYYYYYYY/solr/search1_shard7_replica2/ core=search1_shard7_replica1 - recoveringAfterStartup=false 2385815 [RecoveryThread] INFO org.apache.solr.update.PeerSync – PeerSync: core=search1_shard7_replica1 url=http://XXXXXXXXXXXXXX:YYYYYYYY/solr START replicas=[http://XXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYY/solr/search1_shard7_replica2/] nUpdates=100 2385816 [RecoveryThread] WARN org.apache.solr.update.PeerSync – PeerSync: core=search1_shard7_replica1 url=http://XXXXXXXXXXXXXXXX:YYYYYYYYYYYYYY/solr too many updates received since start - startingUpdates no longer overlaps with our currentUpdates 2385817 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – PeerSync Recovery was not successful - trying replication. core=search1_shard7_replica1 2385817 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Starting Replication Recovery. core=search1_shard7_replica1 2385817 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Begin buffering updates. core=search1_shard7_replica1 2385817 [RecoveryThread] INFO org.apache.solr.update.UpdateLog – Starting to buffer updates. FSUpdateLog{state=ACTIVE, tlog=tlog{file=/opt/solr/solrnodes/solrnode1/search1_shard7_replica1/data/tlog/tlog.0000000000000000407 refcount=1}} 2385817 [RecoveryThread] INFO org.apache.solr.cloud.RecoveryStrategy – Attempting to replicate from http://XXXXXXXXXXXXXXXXXXXXXX:YYYYYYYYYYYYY/solr/search1_shard7_replica2/. core=search1_shard7_replica1 2386469 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – No value set for 'pollInterval'. Timer Task not started. 2386477 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Master's generation: 963 2386478 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Slave's generation: 962 2386478 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Starting replication process 2386489 [RecoveryThread] INFO org.apache.solr.handler.SnapPuller – Number of files in latest index in master: 250 2386489 [RecoveryThread] INFO org.apache.solr.core.CachingDirectoryFactory – return new directory for /opt/solr/solrnodes/solrnode1/search1_shard7_replica1/data/index.20150127135840219 2386491 [RecoveryThread] WARN org.apache.solr.handler.SnapPuller – File _89e.fdt expected to be 811432 while it is 4204287 -regards Vijay On Tue, Jan 27, 2015 at 12:07 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 1/26/2015 9:34 PM, Vijay Sekhri wrote: > > Hi Shawn, Erick > > So it turned out that once we increased our indexing rate to the original > > full indexing rate the replicas went back into recovery no matter what > the > > zk timeout setting was. Initially we though that increasing the timeout > is > > helping but apparently not . We just decreased indexing rate and that > > caused less replicas to go in recovery. Once we have our full indexing > rate > > almost all replicas went into recovery no matter what the zk timeout or > the > > ticktime setting were. We reverted back the ticktime to original 2 > seconds > > > > So we investigated further and after checking the logs we found this > > exception happening right before the recovery process is initiated. We > > observed this on two different replicas that went into recovery. We are > not > > sure if this is a coincidence or a real problem . Notice we were also > > putting some search query load while indexing to trigger the recovery > > behavior > > <snip> > > > 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] > (http-/10.235.46.36:8580-32) > > ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk > > header* > > One possibility that my searches on that exception turned up is that > this is some kind of a problem in the servlet container, and the > information I can see suggests it may be a bug in JBoss, and the > underlying cause is changes in newer releases of Java 7. Your > stacktraces do seem to mention jboss classes, so that seems likely. The > reason that we only recommend running under the Jetty that comes with > Solr, which has a tuned config, is because that's the only servlet > container that actually gets tested. > > https://bugzilla.redhat.com/show_bug.cgi?id=1104273 > https://bugzilla.redhat.com/show_bug.cgi?id=1154028 > > I can't really verify any other possibility. > > Thanks, > Shawn > > -- ********************************************* Vijay Sekhri *********************************************