Re: replicas goes in recovery mode right after update
Hi Erick, @ichattopadhyaya beat me to it already yesterday. So we are good -cheers Vijay On Wed, Jan 28, 2015 at 1:30 PM, Erick Erickson wrote: > Vijay: > > Thanks for reporting this back! Could I ask you to post a new patch with > your correction? Please use the same patch name > (SOLR-5850.patch), and include a note about what you found (I've already > added a comment). > > Thanks! > Erick > > On Wed, Jan 28, 2015 at 9:18 AM, Vijay Sekhri > wrote: > > > Hi Shawn, > > Thank you so much for the assistance. Building is not a problem . Back in > > the days I have worked with linking, compiling and building C , C++ > > software . Java is a piece of cake. > > We have built the new war from the source version 4.10.3 and our > > preliminary tests have shown that our issue (replicas in recovery on high > > load)* is resolved *. We will continue to do more testing and confirm . > > Please note that the *patch is BUGGY*. > > > > It removed the break statement within while loop because of which, > whenever > > we send a list of docs it would hang (API CloudSolrServer.add) , but it > > would work if send one doc at a time. > > > > It took a while to figure out why that is happening. Once we put the > break > > statement back it worked like a charm. > > Furthermore the patch has > > > > > solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java > > which should be > > > > > solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java > > > > Finally checking if(!offer) is sufficient than using if(offer == false) > > Last but not the least having a configurable queue size and timeouts > > (managed via solrconfig) would be quite helpful > > Thank you once again for your help. > > > > Vijay > > > > On Tue, Jan 27, 2015 at 6:20 PM, Shawn Heisey > wrote: > > > > > On 1/27/2015 2:52 PM, Vijay Sekhri wrote: > > > > Hi Shawn, > > > > Here is some update. We found the main issue > > > > We have configured our cluster to run under jetty and when we tried > > full > > > > indexing, we did not see the original Invalid Chunk error. However > the > > > > replicas still went into recovery > > > > All this time we been trying to look into replicas logs to diagnose > the > > > > issue. The problem seem to be at the leader side. When we looked into > > > > leader logs, we found the following on all the leaders > > > > > > > > 3439873 [qtp1314570047-92] WARN > > > > org.apache.solr.update.processor.DistributedUpdateProcessor – Error > > > > sending update > > > > *java.lang.IllegalStateException: Queue full* > > > > > > > > > > > > > There is a similar bug reported around this > > > > https://issues.apache.org/jira/browse/SOLR-5850 > > > > > > > > and it seem to be in OPEN status. Is there a way we can configure the > > > queue > > > > size and increase it ? or is there a version of solr that has this > > issue > > > > resolved already? > > > > Can you suggest where we go from here to resolve this ? We can > repatch > > > the > > > > war file if that is what you would recommend . > > > > In the end our initial speculation about solr unable to handle so > many > > > > update is correct. We do not see this issue when the update load is > > less. > > > > > > Are you in a position where you can try the patch attached to > > > SOLR-5850? You would need to get the source code for the version > you're > > > on (or perhaps a newer 4.x version), patch it, and build Solr yourself. > > > If you have no experience building java packages from source, this > might > > > prove to be difficult. > > > > > > Thanks, > > > Shawn > > > > > > > > > > > > -- > > * > > Vijay Sekhri > > * > > > -- * Vijay Sekhri *
Re: replicas goes in recovery mode right after update
Vijay: Thanks for reporting this back! Could I ask you to post a new patch with your correction? Please use the same patch name (SOLR-5850.patch), and include a note about what you found (I've already added a comment). Thanks! Erick On Wed, Jan 28, 2015 at 9:18 AM, Vijay Sekhri wrote: > Hi Shawn, > Thank you so much for the assistance. Building is not a problem . Back in > the days I have worked with linking, compiling and building C , C++ > software . Java is a piece of cake. > We have built the new war from the source version 4.10.3 and our > preliminary tests have shown that our issue (replicas in recovery on high > load)* is resolved *. We will continue to do more testing and confirm . > Please note that the *patch is BUGGY*. > > It removed the break statement within while loop because of which, whenever > we send a list of docs it would hang (API CloudSolrServer.add) , but it > would work if send one doc at a time. > > It took a while to figure out why that is happening. Once we put the break > statement back it worked like a charm. > Furthermore the patch has > > solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java > which should be > > solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java > > Finally checking if(!offer) is sufficient than using if(offer == false) > Last but not the least having a configurable queue size and timeouts > (managed via solrconfig) would be quite helpful > Thank you once again for your help. > > Vijay > > On Tue, Jan 27, 2015 at 6:20 PM, Shawn Heisey wrote: > > > On 1/27/2015 2:52 PM, Vijay Sekhri wrote: > > > Hi Shawn, > > > Here is some update. We found the main issue > > > We have configured our cluster to run under jetty and when we tried > full > > > indexing, we did not see the original Invalid Chunk error. However the > > > replicas still went into recovery > > > All this time we been trying to look into replicas logs to diagnose the > > > issue. The problem seem to be at the leader side. When we looked into > > > leader logs, we found the following on all the leaders > > > > > > 3439873 [qtp1314570047-92] WARN > > > org.apache.solr.update.processor.DistributedUpdateProcessor – Error > > > sending update > > > *java.lang.IllegalStateException: Queue full* > > > > > > > > > There is a similar bug reported around this > > > https://issues.apache.org/jira/browse/SOLR-5850 > > > > > > and it seem to be in OPEN status. Is there a way we can configure the > > queue > > > size and increase it ? or is there a version of solr that has this > issue > > > resolved already? > > > Can you suggest where we go from here to resolve this ? We can repatch > > the > > > war file if that is what you would recommend . > > > In the end our initial speculation about solr unable to handle so many > > > update is correct. We do not see this issue when the update load is > less. > > > > Are you in a position where you can try the patch attached to > > SOLR-5850? You would need to get the source code for the version you're > > on (or perhaps a newer 4.x version), patch it, and build Solr yourself. > > If you have no experience building java packages from source, this might > > prove to be difficult. > > > > Thanks, > > Shawn > > > > > > > -- > * > Vijay Sekhri > * >
Re: replicas goes in recovery mode right after update
Hi Shawn, Thank you so much for the assistance. Building is not a problem . Back in the days I have worked with linking, compiling and building C , C++ software . Java is a piece of cake. We have built the new war from the source version 4.10.3 and our preliminary tests have shown that our issue (replicas in recovery on high load)* is resolved *. We will continue to do more testing and confirm . Please note that the *patch is BUGGY*. It removed the break statement within while loop because of which, whenever we send a list of docs it would hang (API CloudSolrServer.add) , but it would work if send one doc at a time. It took a while to figure out why that is happening. Once we put the break statement back it worked like a charm. Furthermore the patch has solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java which should be solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java Finally checking if(!offer) is sufficient than using if(offer == false) Last but not the least having a configurable queue size and timeouts (managed via solrconfig) would be quite helpful Thank you once again for your help. Vijay On Tue, Jan 27, 2015 at 6:20 PM, Shawn Heisey wrote: > On 1/27/2015 2:52 PM, Vijay Sekhri wrote: > > Hi Shawn, > > Here is some update. We found the main issue > > We have configured our cluster to run under jetty and when we tried full > > indexing, we did not see the original Invalid Chunk error. However the > > replicas still went into recovery > > All this time we been trying to look into replicas logs to diagnose the > > issue. The problem seem to be at the leader side. When we looked into > > leader logs, we found the following on all the leaders > > > > 3439873 [qtp1314570047-92] WARN > > org.apache.solr.update.processor.DistributedUpdateProcessor – Error > > sending update > > *java.lang.IllegalStateException: Queue full* > > > > > There is a similar bug reported around this > > https://issues.apache.org/jira/browse/SOLR-5850 > > > > and it seem to be in OPEN status. Is there a way we can configure the > queue > > size and increase it ? or is there a version of solr that has this issue > > resolved already? > > Can you suggest where we go from here to resolve this ? We can repatch > the > > war file if that is what you would recommend . > > In the end our initial speculation about solr unable to handle so many > > update is correct. We do not see this issue when the update load is less. > > Are you in a position where you can try the patch attached to > SOLR-5850? You would need to get the source code for the version you're > on (or perhaps a newer 4.x version), patch it, and build Solr yourself. > If you have no experience building java packages from source, this might > prove to be difficult. > > Thanks, > Shawn > > -- * Vijay Sekhri *
Re: replicas goes in recovery mode right after update
On 1/27/2015 2:52 PM, Vijay Sekhri wrote: > Hi Shawn, > Here is some update. We found the main issue > We have configured our cluster to run under jetty and when we tried full > indexing, we did not see the original Invalid Chunk error. However the > replicas still went into recovery > All this time we been trying to look into replicas logs to diagnose the > issue. The problem seem to be at the leader side. When we looked into > leader logs, we found the following on all the leaders > > 3439873 [qtp1314570047-92] WARN > org.apache.solr.update.processor.DistributedUpdateProcessor – Error > sending update > *java.lang.IllegalStateException: Queue full* > There is a similar bug reported around this > https://issues.apache.org/jira/browse/SOLR-5850 > > and it seem to be in OPEN status. Is there a way we can configure the queue > size and increase it ? or is there a version of solr that has this issue > resolved already? > Can you suggest where we go from here to resolve this ? We can repatch the > war file if that is what you would recommend . > In the end our initial speculation about solr unable to handle so many > update is correct. We do not see this issue when the update load is less. Are you in a position where you can try the patch attached to SOLR-5850? You would need to get the source code for the version you're on (or perhaps a newer 4.x version), patch it, and build Solr yourself. If you have no experience building java packages from source, this might prove to be difficult. Thanks, Shawn
Re: replicas goes in recovery mode right after update
Hi Shawn, Here is some update. We found the main issue We have configured our cluster to run under jetty and when we tried full indexing, we did not see the original Invalid Chunk error. However the replicas still went into recovery All this time we been trying to look into replicas logs to diagnose the issue. The problem seem to be at the leader side. When we looked into leader logs, we found the following on all the leaders 3439873 [qtp1314570047-92] WARN org.apache.solr.update.processor.DistributedUpdateProcessor – Error sending update *java.lang.IllegalStateException: Queue full* at java.util.AbstractQueue.add(AbstractQueue.java:98) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner$1.writeTo(ConcurrentUpdateSolrServer.java:182) at org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:69) at org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89) at org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) at org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117) at org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265) at org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203) at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 3439874 [qtp1314570047-92] INFO org.apache.solr.common.cloud.SolrZkClient – makePath: /collections/search1/leader_initiated_recovery/shard7/core_node214 3439879 [qtp1314570047-92] INFO org.apache.solr.cloud.ZkController – Wrote down to /collections/search1/leader_initiated_recovery/shard7/core_node214 3439879 [qtp1314570047-92] INFO org.apache.solr.cloud.ZkController – Put replica core=search1_shard7_replica1 coreNodeName=core_node214 on XX:8580_solr into leader-initiated recovery. 3439880 [qtp1314570047-92] WARN org.apache.solr.cloud.ZkController – Leader is publishing core=search1_shard7_replica1 coreNodeName =core_node214 state=down on behalf of un-reachable replica http://XXX:YY/solr/search1_shard7_replica1/; forcePublishState? false 3439881 [zkCallback-2-thread-12] INFO org.apache.solr.cloud.DistributedQueue – LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 3439881 [qtp1314570047-92] ERROR org.apache.solr.update.processor.DistributedUpdateProcessor – *Setting up to try to start recovery on replica http://XXX:YY/solr/search1_shard7_replica1/ after: java.lang.IllegalStateException: Queue full* 3439882 [qtp1314570047-92] INFO org.apache.solr.core.SolrCore – [search1_shard7_replica2] webapp=/solr path=/update params={wt=javabin&version=2} status=0 QTime=2608 3439882 [updateExecutor-1-thread-153] INFO org.apache.solr.cloud.LeaderInitiatedRecoveryThread – LeaderInitiatedRecoveryThread-search1_shard7_replica1 started running to send REQUESTRECOVERY command to http://XXX:/solr/search1_shard7_replica1/; will try for a max of 600 secs 3439882 [updateExecutor-1-thread-153] INFO org.apache.solr.cloud.LeaderInitiatedRecoveryThread – Asking core=search1_shard7_replica1 coreNodeName=core_node214 on http://X:/solr to recover 3439885 [OverseerStateUpdate-93213309377511456-X:_solr-n_002822] INFO org.apache.solr.cloud.Overseer There is a similar bug reported around this https://issues.apache.org/jira/browse/SOLR-5850 and it seem to be in OPEN status. Is there a way we can configure the queue size and increase it ? or is there a version of solr that has this issue resolved already? Can you suggest where we go from here to resolve this ? We can repatch the war file if tha
Re: replicas goes in recovery mode right after update
On 1/26/2015 9:34 PM, Vijay Sekhri wrote: > Hi Shawn, Erick > So it turned out that once we increased our indexing rate to the original > full indexing rate the replicas went back into recovery no matter what the > zk timeout setting was. Initially we though that increasing the timeout is > helping but apparently not . We just decreased indexing rate and that > caused less replicas to go in recovery. Once we have our full indexing rate > almost all replicas went into recovery no matter what the zk timeout or the > ticktime setting were. We reverted back the ticktime to original 2 seconds > > So we investigated further and after checking the logs we found this > exception happening right before the recovery process is initiated. We > observed this on two different replicas that went into recovery. We are not > sure if this is a coincidence or a real problem . Notice we were also > putting some search query load while indexing to trigger the recovery > behavior > 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] > (http-/10.235.46.36:8580-32) > ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk > header* One possibility that my searches on that exception turned up is that this is some kind of a problem in the servlet container, and the information I can see suggests it may be a bug in JBoss, and the underlying cause is changes in newer releases of Java 7. Your stacktraces do seem to mention jboss classes, so that seems likely. The reason that we only recommend running under the Jetty that comes with Solr, which has a tuned config, is because that's the only servlet container that actually gets tested. https://bugzilla.redhat.com/show_bug.cgi?id=1104273 https://bugzilla.redhat.com/show_bug.cgi?id=1154028 I can't really verify any other possibility. Thanks, Shawn
Re: replicas goes in recovery mode right after update
Hi Shawn, Erick >From another replicas right after the same error it seems the leader initiates the recovery of the replicas. This one has a bit different log information than the other one that went into recovery. I am not sure if this helps in diagnosing Caused by: java.io.IOException: JBWEB002020: Invalid chunk header at org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281) at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134) at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697) at org.apache.coyote.Request.doRead(Request.java:438) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341) ... 31 more 21:55:07,678 INFO [org.apache.solr.handler.admin.CoreAdminHandler] (http-/10.235.43.57:8680-32) It has been requested that we recover: core=search1_shard4_replica13 21:55:07,678 INFO [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.235.43.57:8680-32) [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=search1_shard4_replica13&wt=javabin&version=2} status=0 QTime=0 21:55:07,678 INFO [org.apache.solr.cloud.ZkController] (Thread-443) publishing core=search1_shard4_replica13 state=recovering collection=search1 21:55:07,678 INFO [org.apache.solr.cloud.ZkController] (Thread-443) numShards not found on descriptor - reading it from system property 21:55:07,681 INFO [org.apache.solr.cloud.ZkController] (Thread-443) Wrote recovering to /collections/search1/leader_initiated_recovery /shard4/core_node192 On Mon, Jan 26, 2015 at 10:34 PM, Vijay Sekhri wrote: > Hi Shawn, Erick > So it turned out that once we increased our indexing rate to the original > full indexing rate the replicas went back into recovery no matter what the > zk timeout setting was. Initially we though that increasing the timeout is > helping but apparently not . We just decreased indexing rate and that > caused less replicas to go in recovery. Once we have our full indexing rate > almost all replicas went into recovery no matter what the zk timeout or the > ticktime setting were. We reverted back the ticktime to original 2 seconds > > So we investigated further and after checking the logs we found this > exception happening right before the recovery process is initiated. We > observed this on two different replicas that went into recovery. We are not > sure if this is a coincidence or a real problem . Notice we were also > putting some search query load while indexing to trigger the recovery > behavior > > 22:00:32,493 INFO [org.apache.solr.cloud.RecoveryStrategy] > (rRecoveryThread) Finished recovery process. core=search1_shard5_replica2 > 22:00:32,503 INFO [org.apache.solr.common.cloud.ZkStateReader] > (zkCallback-2-thread-66) A cluster state change: WatchedEvent > state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has > occurred - updating... (live nodes size: 22) > 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger > flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600 > 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread > state has 12530488 bytes; docInRAM=2051 > 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread > state has 12984633 bytes; docInRAM=2205 > > > 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] > (http-/10.235.46.36:8580-32) > ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk > header* > at > org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351) > at > org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422) > at > org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373) > at > org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193) > at > org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80) > at > org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) > at > org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192) > at > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111) > at > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173) > at > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106) > at > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) > at > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.j
Re: replicas goes in recovery mode right after update
Hi Shawn, Erick So it turned out that once we increased our indexing rate to the original full indexing rate the replicas went back into recovery no matter what the zk timeout setting was. Initially we though that increasing the timeout is helping but apparently not . We just decreased indexing rate and that caused less replicas to go in recovery. Once we have our full indexing rate almost all replicas went into recovery no matter what the zk timeout or the ticktime setting were. We reverted back the ticktime to original 2 seconds So we investigated further and after checking the logs we found this exception happening right before the recovery process is initiated. We observed this on two different replicas that went into recovery. We are not sure if this is a coincidence or a real problem . Notice we were also putting some search query load while indexing to trigger the recovery behavior 22:00:32,493 INFO [org.apache.solr.cloud.RecoveryStrategy] (rRecoveryThread) Finished recovery process. core=search1_shard5_replica2 22:00:32,503 INFO [org.apache.solr.common.cloud.ZkStateReader] (zkCallback-2-thread-66) A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 22) 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state has 12530488 bytes; docInRAM=2051 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state has 12984633 bytes; docInRAM=2205 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] (http-/10.235.46.36:8580-32) ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk header* at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351) at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422) at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373) at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193) at org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80) at org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) at org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192) at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111) at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173) at org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106) at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149) at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:559) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: JBWEB002020: Invalid chunk header at org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedI
Re: replicas goes in recovery mode right after update
On 1/26/2015 2:26 PM, Vijay Sekhri wrote: > Hi Erick, > In solr.xml file I had zk timeout set to/ name="zkClientTimeout">${zkClientTimeout:45}/ > One thing that made a it a bit better now is the zk tick time and > syncLimit settings. I set it to a higher value as below. This may not > be advisable though. > > tickTime=3 > initLimit=30 > syncLimit=20 > > Now we observed that replicas do not go in recovery that often as > before. In the whole cluster at a given time I would have a couple of > replicas in recovery whereas earlier it were multiple replicas from > every shard . > On the wiki https://wiki.apache.org/solr/SolrCloudit says the "The > maximum is 20 times the tickTime." in the FAQ so I decided to increase > the tick time. Is this the correct approach ? The default zkClientTimeout on recent Solr versions is 30 seconds, up from 15 in slightly older releases. Those values of 15 or 30 seconds are a REALLY long time in computer terms, and if you are exceeding that timeout on a regular basis, something is VERY wrong with your Solr install. Rather than take steps to increase your timeout beyond the normal maximum of 40 seconds (20 times a tickTime of 2 seconds), figure out why you're exceeding that timeout and fix the performance problem. The zkClientTimeout value that you have set, 450 seconds, is seven and a half *MINUTES*. Nothing in Solr should ever take that long. "Not enough memory in the server" is by far the most common culprit for performance issues. Garbage collection pauses are a close second. I don't actually know this next part for sure, because I've never looked into the code, but I believe that increasing the tickTime, especially to a value 15 times higher than default, might make all zookeeper operations a lot slower. Thanks, Shawn
Re: replicas goes in recovery mode right after update
Personally, I never really set maxDocs for autocommit, I just leave things time-based. That said, your settings are so high that this shouldn't matter in the least. There's nothing in the log fragments you posted that's the proverbial "smoking gun". There's nothing here that tells me _why_ the node went into recovery in the first place. bq: Now we observed that replicas do not go in recovery that often as before. Hmmm, that at least seems to be pointing us in the direction of ZK not being able to see the server, marking it as down so it goes into recovery. Why this is happening is still a mystery to me though. I'd expect the Solr logs to show "connection timeout", a stack trace or some such. _Some_ indication of why the node thinks it's out of sync. Sorry, I'm a bit clueless here. Erick On Mon, Jan 26, 2015 at 1:34 PM, Vijay Sekhri wrote: > Hi Erick, > In solr.xml file I had zk timeout set to* name="zkClientTimeout">${zkClientTimeout:45}* > One thing that made a it a bit better now is the zk tick time and syncLimit > settings. I set it to a higher value as below. This may not be advisable > though. > > tickTime=3 > initLimit=30 > syncLimit=20 > > Now we observed that replicas do not go in recovery that often as before. > In the whole cluster at a given time I would have a couple of replicas in > recovery whereas earlier it were multiple replicas from every shard . > On the wiki https://wiki.apache.org/solr/SolrCloud it says the "The > maximum > is 20 times the tickTime." in the FAQ so I decided to increase the tick > time. Is this the correct approach ? > > One question I have is that if auto commit settings has anything to do with > this or not ? Does it induce extra work for the searchers because of which > this would happen? I have tried with following settings > * * > *50* > *90* > ** > > ** > *20* > *3* > *false* > ** > > I have increased the heap size to 15GB for each JVM instance . I > monitored during full indexing how the heap usage looks like and it never > goes beyond 8 GB . I don't see any Full GC happening at any point . > I had some attached screenshots but they were marked as spam so not sending > them again > > > > Our rate is a variable rate . It is not a sustained rate of 6000/second , > however there are intervals where it would reach that much and come down > and grow again and come down. So if I would take an average it would be > 600/second only but that is not real rate at any given time. > Version of solr cloud is 4.10. All indexers are basically java programs > running on different host using CloudSolrServer api. > As I mentioned it is much better now than before , however not completely > as expected . We would want none of them to go in recovery if really there > is no need. > > I captured some logs before and after recovery > > 4:13:54,298 INFO [org.apache.solr.handler.SnapPuller] (RecoveryThread) > New index installed. Updating index properties... > index=index.20150126140904697 > 14:13:54,301 INFO [org.apache.solr.handler.SnapPuller] (RecoveryThread) > removing old index directory > NRTCachingDirectory(MMapDirectory@ > /opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417 > lockFactory=NativeFSLockFactory@ > /opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417; > maxCacheMB=48.0 maxMergeSizeMB=4.0) > 14:13:54,302 INFO [org.apache.solr.update.DefaultSolrCoreState] > (RecoveryThread) Creating new IndexWriter... > 14:13:54,302 INFO [org.apache.solr.update.DefaultSolrCoreState] > (RecoveryThread) Waiting until IndexWriter is unused... > core=search1_shard7_replica4 > 14:13:54,302 INFO [org.apache.solr.update.DefaultSolrCoreState] > (RecoveryThread) Rollback old IndexWriter... core=search1_shard7_replica4 > 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] > (RecoveryThread) [IW][RecoveryThread]: rollback > 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] > (RecoveryThread) [IW][RecoveryThread]: all running merges have aborted > 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] > (RecoveryThread) [IW][RecoveryThread]: rollback: done finish merges > 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] > (RecoveryThread) [DW][RecoveryThread]: abort > 14:13:54,303 INFO [org.apache.solr.update.LoggingInfoStream] > (RecoveryThread) [DW][RecoveryThread]: done abort; abortedFiles=[] > success=true > 14:13:54,306 INFO [org.apache.solr.update.LoggingInfoStream] > (RecoveryThread) [IW][RecoveryThread]: rollback: > infos=_4qe(4.10.0):C4312879/1370002:delGen=56 > _554(4.10.0):C3995865/780418:delGen=23 _56u(4.10.0):C286775/11906:delGen=15 > _5co(4.10.0):C871785/93841:delGen=10 _5m7(4.10.0):C122852/31645:delGen=11 > _5hm(4.10.0):C457977/32465:delGen=11 _5q2(4.10.0):C13189/649:delGen=6 > _5kb(4.10.0):C424868/19148:delGen=11 _5f5(4.10.0):C116528/42495:delGen=1 >
Re: replicas goes in recovery mode right after update
Hi Erick, In solr.xml file I had zk timeout set to* ${zkClientTimeout:45}* One thing that made a it a bit better now is the zk tick time and syncLimit settings. I set it to a higher value as below. This may not be advisable though. tickTime=3 initLimit=30 syncLimit=20 Now we observed that replicas do not go in recovery that often as before. In the whole cluster at a given time I would have a couple of replicas in recovery whereas earlier it were multiple replicas from every shard . On the wiki https://wiki.apache.org/solr/SolrCloud it says the "The maximum is 20 times the tickTime." in the FAQ so I decided to increase the tick time. Is this the correct approach ? One question I have is that if auto commit settings has anything to do with this or not ? Does it induce extra work for the searchers because of which this would happen? I have tried with following settings * * *50* *90* ** ** *20* *3* *false* ** I have increased the heap size to 15GB for each JVM instance . I monitored during full indexing how the heap usage looks like and it never goes beyond 8 GB . I don't see any Full GC happening at any point . I had some attached screenshots but they were marked as spam so not sending them again Our rate is a variable rate . It is not a sustained rate of 6000/second , however there are intervals where it would reach that much and come down and grow again and come down. So if I would take an average it would be 600/second only but that is not real rate at any given time. Version of solr cloud is 4.10. All indexers are basically java programs running on different host using CloudSolrServer api. As I mentioned it is much better now than before , however not completely as expected . We would want none of them to go in recovery if really there is no need. I captured some logs before and after recovery 4:13:54,298 INFO [org.apache.solr.handler.SnapPuller] (RecoveryThread) New index installed. Updating index properties... index=index.20150126140904697 14:13:54,301 INFO [org.apache.solr.handler.SnapPuller] (RecoveryThread) removing old index directory NRTCachingDirectory(MMapDirectory@/opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417 lockFactory=NativeFSLockFactory@/opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417; maxCacheMB=48.0 maxMergeSizeMB=4.0) 14:13:54,302 INFO [org.apache.solr.update.DefaultSolrCoreState] (RecoveryThread) Creating new IndexWriter... 14:13:54,302 INFO [org.apache.solr.update.DefaultSolrCoreState] (RecoveryThread) Waiting until IndexWriter is unused... core=search1_shard7_replica4 14:13:54,302 INFO [org.apache.solr.update.DefaultSolrCoreState] (RecoveryThread) Rollback old IndexWriter... core=search1_shard7_replica4 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [IW][RecoveryThread]: rollback 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [IW][RecoveryThread]: all running merges have aborted 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [IW][RecoveryThread]: rollback: done finish merges 14:13:54,302 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [DW][RecoveryThread]: abort 14:13:54,303 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [DW][RecoveryThread]: done abort; abortedFiles=[] success=true 14:13:54,306 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [IW][RecoveryThread]: rollback: infos=_4qe(4.10.0):C4312879/1370002:delGen=56 _554(4.10.0):C3995865/780418:delGen=23 _56u(4.10.0):C286775/11906:delGen=15 _5co(4.10.0):C871785/93841:delGen=10 _5m7(4.10.0):C122852/31645:delGen=11 _5hm(4.10.0):C457977/32465:delGen=11 _5q2(4.10.0):C13189/649:delGen=6 _5kb(4.10.0):C424868/19148:delGen=11 _5f5(4.10.0):C116528/42495:delGen=1 _5nx(4.10.0):C33236/20668:delGen=1 _5ql(4.10.0):C25924/2:delGen=2 _5o8(4.10.0):C27155/7531:delGen=1 _5of(4.10.0):C38545/5677:delGen=1 _5p7(4.10.0):C37457/648:delGen=1 _5r5(4.10.0):C4260 _5qv(4.10.0):C1750 _5qi(4.10.0):C842 _5qp(4.10.0):C2247 _5qm(4.10.0):C2214 _5qo(4.10.0):C1785 _5qn(4.10.0):C1962 _5qu(4.10.0):C2390 _5qy(4.10.0):C2129 _5qx(4.10.0):C2192 _5qw(4.10.0):C2157/1:delGen=1 _5r6(4.10.0):C159 _5r4(4.10.0):C742 _5r8(4.10.0):C334 _5r7(4.10.0):C390 _5r3(4.10.0):C1122 14:13:54,306 INFO [org.apache.solr.update.LoggingInfoStream] (RecoveryThread) [IFD][RecoveryThread]: now checkpoint "_4qe(4.10.0):C4312879/1370002:delGen=56 _554(4.10.0):C3995865/780418:delGen=23 _56u(4.10.0):C286775/11906:delGen=15 _5co(4.10.0):C871785/93841:delGen=10 _5m7(4.10.0):C122852/31645:delGen=11 _5hm(4.10.0):C457977/32465:delGen=11 _5q2(4.10.0):C13189/649:delGen=6 _5kb(4.10.0):C424868/19148:delGen=11 _5f5(4.10.0):C116528/42495:delGen=1 _5nx(4.10.0):C33236/20668:delGen=1 _5ql(4.10.0):C25924/2:delGen=2 _5o8(4.10.0):C27155/7531:delGen=1 _5of(4.10.0):C38545/5677:
Re: replicas goes in recovery mode right after update
Ah, OK. Whew! because I was wondering how you were running at _all_ if all the memory was allocated to the JVM ;).. What is your Zookeeper timeout? The original default was 15 seconds and this has caused problems like this. Here's the scenario: You send a bunch of docs at the server, and eventually you hit a stop-the-world GC that takes longer than the Zookeeper timeout. So ZK thinks the node is down and initiates recovery. Eventually, you hit this on all the replicas. Sometimes I've seen situations where the answer is giving a bit more memory to the JVM, say 2-4G in your case. The theory here (and this is a shot in the dark) that your peak JVM requirements are close to your 12G, so the garbage collection spends enormous amounts of time collecting a small bit of memory, runs for some fraction of a second and does it again. Adding more to the JVMs memory allows the parallel collections to work without so many stop-the-world GC pauses. So what I'd do is turn on GC logging (probably on the replicas) and look for very long GC pauses. Mark Miller put together a blog here: https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ See the "getting a view into garbage collection". The smoking gun here is if you see full GC pauses that are longer than the ZK timeout. 90M docs in 4 hours across 10 shards is only 625/sec or so per shard. I've seen sustained indexing rates significantly above this, YMMV or course, a lot depends on the size of the docs. What version of Solr BTW? And when you say you fire a bunch of indexers, I'm assuming these are SolrJ clients and use CloudSolrServer? Best, Erick On Sun, Jan 25, 2015 at 4:10 PM, Vijay Sekhri wrote: > Thank you for the reply Eric. > I am sorry I had wrong information posted. I posted our DEV env > configuration by mistake. > After double checking our stress and Prod Beta env where we have found the > original issue, I found all the searchers have around 50 GB of RAM > available and two instances of JVM running (2 different ports). Both > instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st > instance on a host has search1 collection (live collection) and the 2nd > instance on the same host has search2 collection (for full indexing ). > > There is plenty room for OS related tasks. Our issue is not in anyway > related to OS starving as shown from our dashboards. > We have been through > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > a lot of times but we have two modes of operation > a) 1st collection (Live traffic) - heavy searches and medium indexing > b) 2nd collection (Not serving traffic) - very heavy indexing, no searches > > When our indexing finishes we swap the alias for these collection . So > essentially we need to have a configuration that can support both the use > cases together. We have tried a lot of different configuration options and > none of them seems to work. My suspicion is that solr cloud is unable to > keep up with the updates at the rate we are sending while it is trying to > be consistent with all the replicas. > > > On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson > wrote: > > > Shawn directed you over here to the user list, but I see this note on > > SOLR-7030: > > "All our searchers have 12 GB of RAM available and have quad core > Intel(R) > > Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e > > jboss and solr in it . All 12 GB is available as heap for the java > > process..." > > > > So you have 12G physical memory and have allocated 12G to the Java > process? > > This is an anti-pattern. If that's > > the case, your operating system is being starved for memory, probably > > hitting a state where it spends all of its > > time in stop-the-world garbage collection, eventually it doesn't respond > to > > Zookeeper's ping so Zookeeper > > thinks the node is down and puts it into recovery. Where it spends a lot > of > > time doing... essentially nothing. > > > > About the hard and soft commits: I suspect these are entirely unrelated, > > but here's a blog on what they do, you > > should pick the configuration that supports your use case (i.e. how much > > latency can you stand between indexing > > and being able to search?). > > > > > > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > > > Here's one very good reason you shouldn't starve your op system by > > allocating all the physical memory to the JVM: > > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > > > > > But your biggest problem is that you have far too much of your physical > > memory allocated to the JVM. This > > will cause you endless problems, you just need more physical memory on > > those boxes. It's _possible_ you could > > get by with less memory for the JVM, counterintuitive as it seems try 8G > or > > maybe even 6G. At some point > > you'll hit OOM errors, but that'll give you a low
Re: replicas goes in recovery mode right after update
Thank you for the reply Eric. I am sorry I had wrong information posted. I posted our DEV env configuration by mistake. After double checking our stress and Prod Beta env where we have found the original issue, I found all the searchers have around 50 GB of RAM available and two instances of JVM running (2 different ports). Both instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st instance on a host has search1 collection (live collection) and the 2nd instance on the same host has search2 collection (for full indexing ). There is plenty room for OS related tasks. Our issue is not in anyway related to OS starving as shown from our dashboards. We have been through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ a lot of times but we have two modes of operation a) 1st collection (Live traffic) - heavy searches and medium indexing b) 2nd collection (Not serving traffic) - very heavy indexing, no searches When our indexing finishes we swap the alias for these collection . So essentially we need to have a configuration that can support both the use cases together. We have tried a lot of different configuration options and none of them seems to work. My suspicion is that solr cloud is unable to keep up with the updates at the rate we are sending while it is trying to be consistent with all the replicas. On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson wrote: > Shawn directed you over here to the user list, but I see this note on > SOLR-7030: > "All our searchers have 12 GB of RAM available and have quad core Intel(R) > Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e > jboss and solr in it . All 12 GB is available as heap for the java > process..." > > So you have 12G physical memory and have allocated 12G to the Java process? > This is an anti-pattern. If that's > the case, your operating system is being starved for memory, probably > hitting a state where it spends all of its > time in stop-the-world garbage collection, eventually it doesn't respond to > Zookeeper's ping so Zookeeper > thinks the node is down and puts it into recovery. Where it spends a lot of > time doing... essentially nothing. > > About the hard and soft commits: I suspect these are entirely unrelated, > but here's a blog on what they do, you > should pick the configuration that supports your use case (i.e. how much > latency can you stand between indexing > and being able to search?). > > > https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > Here's one very good reason you shouldn't starve your op system by > allocating all the physical memory to the JVM: > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > > But your biggest problem is that you have far too much of your physical > memory allocated to the JVM. This > will cause you endless problems, you just need more physical memory on > those boxes. It's _possible_ you could > get by with less memory for the JVM, counterintuitive as it seems try 8G or > maybe even 6G. At some point > you'll hit OOM errors, but that'll give you a lower limit on what the JVM > needs. > > Unless I've mis-interpreted what you've written, though, I doubt you'll get > stable with that much memory allocated > to the JVM. > > Best, > Erick > > > > On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri > wrote: > > > We have a cluster of solr cloud server with 10 shards and 4 replicas in > > each shard in our stress environment. In our prod environment we will > have > > 10 shards and 15 replicas in each shard. Our current commit settings are > as > > follows > > > > ** > > *50* > > *18* > > ** > > ** > > *200* > > *18* > > *false* > > ** > > > > > > We indexed roughly 90 Million docs. We have two different ways to index > > documents a) Full indexing. It takes 4 hours to index 90 Million docs and > > the rate of docs coming to the searcher is around 6000 per second b) > > Incremental indexing. It takes an hour to indexed delta changes. Roughly > > there are 3 million changes and rate of docs coming to the searchers is > > 2500 > > per second > > > > We have two collections search1 and search2. When we do full indexing , > we > > do it in search2 collection while search1 is serving live traffic. After > it > > finishes we swap the collection using aliases so that the search2 > > collection serves live traffic while search1 becomes available for next > > full indexing run. When we do incremental indexing we do it in the > search1 > > collection which is serving live traffic. > > > > All our searchers have 12 GB of RAM available and have quad core Intel(R) > > Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e > > jboss and solr in it . All 12 GB is available as heap for the java > > process. We have observed that the heap memory of the java process > average
Re: replicas goes in recovery mode right after update
Shawn directed you over here to the user list, but I see this note on SOLR-7030: "All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process..." So you have 12G physical memory and have allocated 12G to the Java process? This is an anti-pattern. If that's the case, your operating system is being starved for memory, probably hitting a state where it spends all of its time in stop-the-world garbage collection, eventually it doesn't respond to Zookeeper's ping so Zookeeper thinks the node is down and puts it into recovery. Where it spends a lot of time doing... essentially nothing. About the hard and soft commits: I suspect these are entirely unrelated, but here's a blog on what they do, you should pick the configuration that supports your use case (i.e. how much latency can you stand between indexing and being able to search?). https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Here's one very good reason you shouldn't starve your op system by allocating all the physical memory to the JVM: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html But your biggest problem is that you have far too much of your physical memory allocated to the JVM. This will cause you endless problems, you just need more physical memory on those boxes. It's _possible_ you could get by with less memory for the JVM, counterintuitive as it seems try 8G or maybe even 6G. At some point you'll hit OOM errors, but that'll give you a lower limit on what the JVM needs. Unless I've mis-interpreted what you've written, though, I doubt you'll get stable with that much memory allocated to the JVM. Best, Erick On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri wrote: > We have a cluster of solr cloud server with 10 shards and 4 replicas in > each shard in our stress environment. In our prod environment we will have > 10 shards and 15 replicas in each shard. Our current commit settings are as > follows > > ** > *50* > *18* > ** > ** > *200* > *18* > *false* > ** > > > We indexed roughly 90 Million docs. We have two different ways to index > documents a) Full indexing. It takes 4 hours to index 90 Million docs and > the rate of docs coming to the searcher is around 6000 per second b) > Incremental indexing. It takes an hour to indexed delta changes. Roughly > there are 3 million changes and rate of docs coming to the searchers is > 2500 > per second > > We have two collections search1 and search2. When we do full indexing , we > do it in search2 collection while search1 is serving live traffic. After it > finishes we swap the collection using aliases so that the search2 > collection serves live traffic while search1 becomes available for next > full indexing run. When we do incremental indexing we do it in the search1 > collection which is serving live traffic. > > All our searchers have 12 GB of RAM available and have quad core Intel(R) > Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e > jboss and solr in it . All 12 GB is available as heap for the java > process. We have observed that the heap memory of the java process average > around 8 - 10 GB. All searchers have final index size of 9 GB. So in total > there are 9X10 (shards) = 90GB worth of index files. > > We have observed the following issue when we trigger indexing . In about > 10 minutes after we trigger indexing on 14 parallel hosts, the replicas > goes in to recovery mode. This happens to all the shards . In about 20 > minutes more and more replicas start going into recovery mode. After about > half an hour all replicas except the leader are in recovery mode. We cannot > throttle the indexing load as that will increase our overall indexing time. > So to overcome this issue, we remove all the replicas before we trigger the > indexing and then add them back after the indexing finishes. > > We observe the same behavior of replicas going into recovery when we do > incremental indexing. We cannot remove replicas during our incremental > indexing because it is also serving live traffic. We tried to throttle our > indexing speed , however the cluster still goes into recovery . > > If we leave the cluster as it , when the indexing finishes , it eventually > recovers after a while. As it is serving live traffic we cannot have these > replicas go into recovery mode because it degrades the search performance > also , our tests have shown. > > We have tried different commit settings like below > > a) No auto soft commit, no auto hard commit and a commit triggered at the > end of indexing b) No auto soft commit, yes auto hard commit and a commit > in the end of indexing > c) Yes auto soft commit , no auto hard commit > d) Yes auto soft commit , yes auto hard commit > e) Dif
replicas goes in recovery mode right after update
We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows ** *50* *18* ** ** *200* *18* *false* ** We indexed roughly 90 Million docs. We have two different ways to index documents a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run. When we do incremental indexing we do it in the search1 collection which is serving live traffic. All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process. We have observed that the heap memory of the java process average around 8 - 10 GB. All searchers have final index size of 9 GB. So in total there are 9X10 (shards) = 90GB worth of index files. We have observed the following issue when we trigger indexing . In about 10 minutes after we trigger indexing on 14 parallel hosts, the replicas goes in to recovery mode. This happens to all the shards . In about 20 minutes more and more replicas start going into recovery mode. After about half an hour all replicas except the leader are in recovery mode. We cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, we remove all the replicas before we trigger the indexing and then add them back after the indexing finishes. We observe the same behavior of replicas going into recovery when we do incremental indexing. We cannot remove replicas during our incremental indexing because it is also serving live traffic. We tried to throttle our indexing speed , however the cluster still goes into recovery . If we leave the cluster as it , when the indexing finishes , it eventually recovers after a while. As it is serving live traffic we cannot have these replicas go into recovery mode because it degrades the search performance also , our tests have shown. We have tried different commit settings like below a) No auto soft commit, no auto hard commit and a commit triggered at the end of indexing b) No auto soft commit, yes auto hard commit and a commit in the end of indexing c) Yes auto soft commit , no auto hard commit d) Yes auto soft commit , yes auto hard commit e) Different frequency setting for commits for above. Please NOTE that we have tried 15 minute soft commit setting and 30 minutes hard commit settings. Same time settings for both, 30 minute soft commit and an hour hard commit setting Unfortunately all the above yields the same behavior . The replicas still goes in recovery We have increased the zookeeper timeout from 30 seconds to 5 minutes and the problem persists. Is there any setting that would fix this issue ? -- * Vijay Sekhri *