Re: replicas goes in recovery mode right after update

2015-01-29 Thread Vijay Sekhri
Hi Erick,

@ichattopadhyaya  beat me to it already yesterday. So we are good
-cheers
Vijay

On Wed, Jan 28, 2015 at 1:30 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Vijay:

 Thanks for reporting this back!  Could I ask you to post a new patch with
 your correction? Please use the same patch name
 (SOLR-5850.patch), and include a note about what you found (I've already
 added a comment).

 Thanks!
 Erick

 On Wed, Jan 28, 2015 at 9:18 AM, Vijay Sekhri sekhrivi...@gmail.com
 wrote:

  Hi Shawn,
  Thank you so much for the assistance. Building is not a problem . Back in
  the days I have worked with linking, compiling and  building C , C++
  software . Java is a piece of cake.
  We have built the new war from the source version 4.10.3 and our
  preliminary tests have shown that our issue (replicas in recovery on high
  load)* is resolved *. We will continue to do more testing and confirm .
  Please note that the *patch is BUGGY*.
 
  It removed the break statement within while loop because of which,
 whenever
  we send a list of docs it would hang (API CloudSolrServer.add) , but it
  would work if send one doc at a time.
 
  It took a while to figure out why that is happening. Once we put the
 break
  statement back it worked like a charm.
  Furthermore the patch has
 
 
 solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java
  which should be
 
 
 solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java
 
  Finally checking if(!offer) is sufficient than using if(offer == false)
  Last but not the least having a configurable queue size and timeouts
  (managed via solrconfig) would be quite helpful
  Thank you once again for your help.
 
  Vijay
 
  On Tue, Jan 27, 2015 at 6:20 PM, Shawn Heisey apa...@elyograg.org
 wrote:
 
   On 1/27/2015 2:52 PM, Vijay Sekhri wrote:
Hi Shawn,
Here is some update. We found the main issue
We have configured our cluster to run under jetty and when we tried
  full
indexing, we did not see the original Invalid Chunk error. However
 the
replicas still went into recovery
All this time we been trying to look into replicas logs to diagnose
 the
issue. The problem seem to be at the leader side. When we looked into
leader logs, we found the following on all the leaders
   
3439873 [qtp1314570047-92] WARN
 org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
sending update
*java.lang.IllegalStateException: Queue full*
  
   snip
  
There is a similar bug reported around this
https://issues.apache.org/jira/browse/SOLR-5850
   
and it seem to be in OPEN status. Is there a way we can configure the
   queue
size and increase it ? or is there a version of solr that has this
  issue
resolved already?
Can you suggest where we go from here to resolve this ? We can
 repatch
   the
war file if that is what you would recommend .
In the end our initial speculation about solr unable to handle so
 many
update is correct. We do not see this issue when the update load is
  less.
  
   Are you in a position where you can try the patch attached to
   SOLR-5850?  You would need to get the source code for the version
 you're
   on (or perhaps a newer 4.x version), patch it, and build Solr yourself.
   If you have no experience building java packages from source, this
 might
   prove to be difficult.
  
   Thanks,
   Shawn
  
  
 
 
  --
  *
  Vijay Sekhri
  *
 




-- 
*
Vijay Sekhri
*


Re: replicas goes in recovery mode right after update

2015-01-28 Thread Vijay Sekhri
Hi Shawn,
Thank you so much for the assistance. Building is not a problem . Back in
the days I have worked with linking, compiling and  building C , C++
software . Java is a piece of cake.
We have built the new war from the source version 4.10.3 and our
preliminary tests have shown that our issue (replicas in recovery on high
load)* is resolved *. We will continue to do more testing and confirm .
Please note that the *patch is BUGGY*.

It removed the break statement within while loop because of which, whenever
we send a list of docs it would hang (API CloudSolrServer.add) , but it
would work if send one doc at a time.

It took a while to figure out why that is happening. Once we put the break
statement back it worked like a charm.
Furthermore the patch has
solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java
which should be
solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java

Finally checking if(!offer) is sufficient than using if(offer == false)
Last but not the least having a configurable queue size and timeouts
(managed via solrconfig) would be quite helpful
Thank you once again for your help.

Vijay

On Tue, Jan 27, 2015 at 6:20 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 1/27/2015 2:52 PM, Vijay Sekhri wrote:
  Hi Shawn,
  Here is some update. We found the main issue
  We have configured our cluster to run under jetty and when we tried full
  indexing, we did not see the original Invalid Chunk error. However the
  replicas still went into recovery
  All this time we been trying to look into replicas logs to diagnose the
  issue. The problem seem to be at the leader side. When we looked into
  leader logs, we found the following on all the leaders
 
  3439873 [qtp1314570047-92] WARN
   org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
  sending update
  *java.lang.IllegalStateException: Queue full*

 snip

  There is a similar bug reported around this
  https://issues.apache.org/jira/browse/SOLR-5850
 
  and it seem to be in OPEN status. Is there a way we can configure the
 queue
  size and increase it ? or is there a version of solr that has this issue
  resolved already?
  Can you suggest where we go from here to resolve this ? We can repatch
 the
  war file if that is what you would recommend .
  In the end our initial speculation about solr unable to handle so many
  update is correct. We do not see this issue when the update load is less.

 Are you in a position where you can try the patch attached to
 SOLR-5850?  You would need to get the source code for the version you're
 on (or perhaps a newer 4.x version), patch it, and build Solr yourself.
 If you have no experience building java packages from source, this might
 prove to be difficult.

 Thanks,
 Shawn




-- 
*
Vijay Sekhri
*


Re: replicas goes in recovery mode right after update

2015-01-28 Thread Erick Erickson
Vijay:

Thanks for reporting this back!  Could I ask you to post a new patch with
your correction? Please use the same patch name
(SOLR-5850.patch), and include a note about what you found (I've already
added a comment).

Thanks!
Erick

On Wed, Jan 28, 2015 at 9:18 AM, Vijay Sekhri sekhrivi...@gmail.com wrote:

 Hi Shawn,
 Thank you so much for the assistance. Building is not a problem . Back in
 the days I have worked with linking, compiling and  building C , C++
 software . Java is a piece of cake.
 We have built the new war from the source version 4.10.3 and our
 preliminary tests have shown that our issue (replicas in recovery on high
 load)* is resolved *. We will continue to do more testing and confirm .
 Please note that the *patch is BUGGY*.

 It removed the break statement within while loop because of which, whenever
 we send a list of docs it would hang (API CloudSolrServer.add) , but it
 would work if send one doc at a time.

 It took a while to figure out why that is happening. Once we put the break
 statement back it worked like a charm.
 Furthermore the patch has

 solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrClient.java
 which should be

 solr/solrj/src/java/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.java

 Finally checking if(!offer) is sufficient than using if(offer == false)
 Last but not the least having a configurable queue size and timeouts
 (managed via solrconfig) would be quite helpful
 Thank you once again for your help.

 Vijay

 On Tue, Jan 27, 2015 at 6:20 PM, Shawn Heisey apa...@elyograg.org wrote:

  On 1/27/2015 2:52 PM, Vijay Sekhri wrote:
   Hi Shawn,
   Here is some update. We found the main issue
   We have configured our cluster to run under jetty and when we tried
 full
   indexing, we did not see the original Invalid Chunk error. However the
   replicas still went into recovery
   All this time we been trying to look into replicas logs to diagnose the
   issue. The problem seem to be at the leader side. When we looked into
   leader logs, we found the following on all the leaders
  
   3439873 [qtp1314570047-92] WARN
org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
   sending update
   *java.lang.IllegalStateException: Queue full*
 
  snip
 
   There is a similar bug reported around this
   https://issues.apache.org/jira/browse/SOLR-5850
  
   and it seem to be in OPEN status. Is there a way we can configure the
  queue
   size and increase it ? or is there a version of solr that has this
 issue
   resolved already?
   Can you suggest where we go from here to resolve this ? We can repatch
  the
   war file if that is what you would recommend .
   In the end our initial speculation about solr unable to handle so many
   update is correct. We do not see this issue when the update load is
 less.
 
  Are you in a position where you can try the patch attached to
  SOLR-5850?  You would need to get the source code for the version you're
  on (or perhaps a newer 4.x version), patch it, and build Solr yourself.
  If you have no experience building java packages from source, this might
  prove to be difficult.
 
  Thanks,
  Shawn
 
 


 --
 *
 Vijay Sekhri
 *



Re: replicas goes in recovery mode right after update

2015-01-27 Thread Vijay Sekhri
Hi Shawn,
Here is some update. We found the main issue
We have configured our cluster to run under jetty and when we tried full
indexing, we did not see the original Invalid Chunk error. However the
replicas still went into recovery
All this time we been trying to look into replicas logs to diagnose the
issue. The problem seem to be at the leader side. When we looked into
leader logs, we found the following on all the leaders

3439873 [qtp1314570047-92] WARN
 org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
sending update
*java.lang.IllegalStateException: Queue full*
at java.util.AbstractQueue.add(AbstractQueue.java:98)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner$1.writeTo(ConcurrentUpdateSolrServer.java:182)
at
org.apache.http.entity.EntityTemplate.writeTo(EntityTemplate.java:69)
at
org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:89)
at
org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
at
org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:117)
at
org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:265)
at
org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:203)
at
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
at
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:121)
at
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
3439874 [qtp1314570047-92] INFO  org.apache.solr.common.cloud.SolrZkClient
 – makePath:
/collections/search1/leader_initiated_recovery/shard7/core_node214
3439879 [qtp1314570047-92] INFO  org.apache.solr.cloud.ZkController  –
Wrote down to
/collections/search1/leader_initiated_recovery/shard7/core_node214
3439879 [qtp1314570047-92] INFO  org.apache.solr.cloud.ZkController  – Put
replica core=search1_shard7_replica1 coreNodeName=core_node214 on
XX:8580_solr into leader-initiated
recovery.
3439880 [qtp1314570047-92] WARN  org.apache.solr.cloud.ZkController  –
Leader is publishing core=search1_shard7_replica1 coreNodeName
=core_node214 state=down on behalf of un-reachable replica
http://XXX:YY/solr/search1_shard7_replica1/;
forcePublishState? false
3439881 [zkCallback-2-thread-12] INFO
 org.apache.solr.cloud.DistributedQueue  – LatchChildWatcher fired on path:
/overseer/queue state: SyncConnected type NodeChildrenChanged
3439881 [qtp1314570047-92] ERROR
org.apache.solr.update.processor.DistributedUpdateProcessor  – *Setting up
to try to start recovery on replica
http://XXX:YY/solr/search1_shard7_replica1/
after: java.lang.IllegalStateException: Queue full*
3439882 [qtp1314570047-92] INFO  org.apache.solr.core.SolrCore  –
[search1_shard7_replica2] webapp=/solr path=/update
params={wt=javabinversion=2} status=0 QTime=2608
3439882 [updateExecutor-1-thread-153] INFO
 org.apache.solr.cloud.LeaderInitiatedRecoveryThread  –
LeaderInitiatedRecoveryThread-search1_shard7_replica1 started running to
send REQUESTRECOVERY command to
http://XXX:/solr/search1_shard7_replica1/;
will try for a max of 600 secs
3439882 [updateExecutor-1-thread-153] INFO
 org.apache.solr.cloud.LeaderInitiatedRecoveryThread  – Asking
core=search1_shard7_replica1 coreNodeName=core_node214 on
http://X:/solr to recover
3439885
[OverseerStateUpdate-93213309377511456-X:_solr-n_002822]
INFO  org.apache.solr.cloud.Overseer


There is a similar bug reported around this
https://issues.apache.org/jira/browse/SOLR-5850

and it seem to be in OPEN status. Is there a way we can configure the queue
size and increase it ? or is there a version of solr that has this issue
resolved already?
Can you suggest where we go from here to resolve this ? We can repatch the
war file if 

Re: replicas goes in recovery mode right after update

2015-01-27 Thread Shawn Heisey
On 1/27/2015 2:52 PM, Vijay Sekhri wrote:
 Hi Shawn,
 Here is some update. We found the main issue
 We have configured our cluster to run under jetty and when we tried full
 indexing, we did not see the original Invalid Chunk error. However the
 replicas still went into recovery
 All this time we been trying to look into replicas logs to diagnose the
 issue. The problem seem to be at the leader side. When we looked into
 leader logs, we found the following on all the leaders

 3439873 [qtp1314570047-92] WARN
  org.apache.solr.update.processor.DistributedUpdateProcessor  – Error
 sending update
 *java.lang.IllegalStateException: Queue full*

snip

 There is a similar bug reported around this
 https://issues.apache.org/jira/browse/SOLR-5850

 and it seem to be in OPEN status. Is there a way we can configure the queue
 size and increase it ? or is there a version of solr that has this issue
 resolved already?
 Can you suggest where we go from here to resolve this ? We can repatch the
 war file if that is what you would recommend .
 In the end our initial speculation about solr unable to handle so many
 update is correct. We do not see this issue when the update load is less.

Are you in a position where you can try the patch attached to
SOLR-5850?  You would need to get the source code for the version you're
on (or perhaps a newer 4.x version), patch it, and build Solr yourself. 
If you have no experience building java packages from source, this might
prove to be difficult.

Thanks,
Shawn



Re: replicas goes in recovery mode right after update

2015-01-26 Thread Vijay Sekhri
Hi Shawn, Erick
So it turned out that once we increased our indexing rate to the original
full indexing rate  the replicas went back into recovery no matter what the
zk timeout setting was. Initially we though that increasing the timeout is
helping but apparently not . We just decreased indexing rate and that
caused less replicas to go in recovery. Once we have our full indexing rate
almost all replicas went into recovery no matter what the zk timeout or the
ticktime setting were. We reverted back the ticktime to original 2 seconds

So we investigated further and after checking the logs we found this
exception happening right before the recovery process is initiated. We
observed this on two different replicas that went into recovery. We are not
sure if this is a coincidence or a real problem . Notice we were also
putting some search query load while indexing to trigger the recovery
behavior

22:00:32,493 INFO  [org.apache.solr.cloud.RecoveryStrategy]
(rRecoveryThread) Finished recovery process. core=search1_shard5_replica2
22:00:32,503 INFO  [org.apache.solr.common.cloud.ZkStateReader]
(zkCallback-2-thread-66) A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
occurred - updating... (live nodes size: 22)
22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger
flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600
22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state
has 12530488 bytes; docInRAM=2051
22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
(http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread state
has 12984633 bytes; docInRAM=2205


22:00:40,861 ERROR [org.apache.solr.core.SolrCore] (http-/10.235.46.36:8580-32)
ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk
header*
at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351)
at
org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422)
at
org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373)
at
org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193)
at
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
at
org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
at
org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192)
at
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111)
at
org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
at
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
at
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149)
at
org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:559)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
  at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: JBWEB002020: Invalid chunk header
at

Re: replicas goes in recovery mode right after update

2015-01-26 Thread Vijay Sekhri
Hi Shawn, Erick
From another replicas right after the same error it seems the leader
initiates the recovery of the replicas. This one has a bit different log
information than the other one that went into recovery. I am not sure if
this helps in diagnosing

Caused by: java.io.IOException: JBWEB002020: Invalid chunk header
at
org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281)
at
org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134)
at
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697)
at org.apache.coyote.Request.doRead(Request.java:438)
at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341)
... 31 more

21:55:07,678 INFO  [org.apache.solr.handler.admin.CoreAdminHandler]
(http-/10.235.43.57:8680-32) It has been requested that we recover:
core=search1_shard4_replica13
21:55:07,678 INFO  [org.apache.solr.servlet.SolrDispatchFilter]
(http-/10.235.43.57:8680-32) [admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERYcore=search1_shard4_replica13wt=javabinversion=2}
status=0 QTime=0
21:55:07,678 INFO  [org.apache.solr.cloud.ZkController] (Thread-443)
publishing core=search1_shard4_replica13 state=recovering collection=search1
21:55:07,678 INFO  [org.apache.solr.cloud.ZkController] (Thread-443)
numShards not found on descriptor - reading it from system property
21:55:07,681 INFO  [org.apache.solr.cloud.ZkController] (Thread-443) Wrote
recovering to /collections/search1/leader_initiated_recovery
/shard4/core_node192


On Mon, Jan 26, 2015 at 10:34 PM, Vijay Sekhri sekhrivi...@gmail.com
wrote:

 Hi Shawn, Erick
 So it turned out that once we increased our indexing rate to the original
 full indexing rate  the replicas went back into recovery no matter what the
 zk timeout setting was. Initially we though that increasing the timeout is
 helping but apparently not . We just decreased indexing rate and that
 caused less replicas to go in recovery. Once we have our full indexing rate
 almost all replicas went into recovery no matter what the zk timeout or the
 ticktime setting were. We reverted back the ticktime to original 2 seconds

 So we investigated further and after checking the logs we found this
 exception happening right before the recovery process is initiated. We
 observed this on two different replicas that went into recovery. We are not
 sure if this is a coincidence or a real problem . Notice we were also
 putting some search query load while indexing to trigger the recovery
 behavior

 22:00:32,493 INFO  [org.apache.solr.cloud.RecoveryStrategy]
 (rRecoveryThread) Finished recovery process. core=search1_shard5_replica2
 22:00:32,503 INFO  [org.apache.solr.common.cloud.ZkStateReader]
 (zkCallback-2-thread-66) A cluster state change: WatchedEvent
 state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
 occurred - updating... (live nodes size: 22)
 22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
 (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger
 flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600
 22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
 (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread
 state has 12530488 bytes; docInRAM=2051
 22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
 (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread
 state has 12984633 bytes; docInRAM=2205


 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] 
 (http-/10.235.46.36:8580-32)
 ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk
 header*
 at
 org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351)
 at
 org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422)
 at
 org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373)
 at
 org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193)
 at
 org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
 at
 org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
 at
 org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192)
 at
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111)
 at
 org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
 at
 org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
 at
 org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
 at
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
 

Re: replicas goes in recovery mode right after update

2015-01-26 Thread Shawn Heisey
On 1/26/2015 9:34 PM, Vijay Sekhri wrote:
 Hi Shawn, Erick
 So it turned out that once we increased our indexing rate to the original
 full indexing rate  the replicas went back into recovery no matter what the
 zk timeout setting was. Initially we though that increasing the timeout is
 helping but apparently not . We just decreased indexing rate and that
 caused less replicas to go in recovery. Once we have our full indexing rate
 almost all replicas went into recovery no matter what the zk timeout or the
 ticktime setting were. We reverted back the ticktime to original 2 seconds
 
 So we investigated further and after checking the logs we found this
 exception happening right before the recovery process is initiated. We
 observed this on two different replicas that went into recovery. We are not
 sure if this is a coincidence or a real problem . Notice we were also
 putting some search query load while indexing to trigger the recovery
 behavior

snip

 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] 
 (http-/10.235.46.36:8580-32)
 ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk
 header*

One possibility that my searches on that exception turned up is that
this is some kind of a problem in the servlet container, and the
information I can see suggests it may be a bug in JBoss, and the
underlying cause is changes in newer releases of Java 7.  Your
stacktraces do seem to mention jboss classes, so that seems likely.  The
reason that we only recommend running under the Jetty that comes with
Solr, which has a tuned config, is because that's the only servlet
container that actually gets tested.

https://bugzilla.redhat.com/show_bug.cgi?id=1104273
https://bugzilla.redhat.com/show_bug.cgi?id=1154028

I can't really verify any other possibility.

Thanks,
Shawn



Re: replicas goes in recovery mode right after update

2015-01-26 Thread Erick Erickson
Personally, I never really set maxDocs for autocommit, I just leave things
time-based. That said, your settings are so high that this shouldn't matter
in the least.

There's nothing in the log fragments you posted that's the proverbial
smoking gun. There's
nothing here that tells me _why_ the node went into recovery in the first
place.

bq: Now we observed that replicas do not go in recovery that often as
before.

Hmmm, that at least seems to be pointing us in the direction of ZK not
being able
to see the server, marking it as down so it goes into recovery.

Why this is happening is still a mystery to me though. I'd expect the Solr
logs to show
connection timeout, a stack trace or some such. _Some_ indication of why
the node
thinks it's out of sync.

Sorry, I'm a bit clueless here.

Erick



On Mon, Jan 26, 2015 at 1:34 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

 Hi Erick,
 In solr.xml file I had zk timeout set to*  int
 name=zkClientTimeout${zkClientTimeout:45}/int*
 One thing that made a it a bit better now is the zk tick time and syncLimit
 settings. I set it to a higher value as below. This may not be advisable
 though.

 tickTime=3
 initLimit=30
 syncLimit=20

 Now we observed that replicas do not go in recovery that often as before.
 In the whole cluster at a given time I would have a couple of replicas in
 recovery whereas earlier it were multiple replicas from every shard .
 On the wiki https://wiki.apache.org/solr/SolrCloud it says the The
 maximum
 is 20 times the tickTime. in the FAQ so I decided to increase the tick
 time. Is this the correct approach ?

 One question I have is that if auto commit settings has anything to do with
 this or not ? Does it induce extra work for the searchers because of which
 this would happen? I have tried with following settings
 *  autoSoftCommit*
 *maxDocs50/maxDocs*
 *maxTime90/maxTime*
 */autoSoftCommit*

 *autoCommit*
 *maxDocs20/maxDocs*
 *maxTime3/maxTime*
 *openSearcherfalse/openSearcher*
 */autoCommit*

 I have increased  the  heap size to 15GB for each JVM instance . I
 monitored during full indexing how the heap usage looks like and it never
 goes beyond 8 GB .  I don't see any Full GC happening at any point .
 I had some attached screenshots but they were marked as spam so not sending
 them again



 Our rate is a variable rate . It is not a sustained rate of 6000/second ,
 however there are intervals where it would reach that much and come down
 and grow again and come down.  So if I would take an average it would be
 600/second only but that is not real rate at any given time.
 Version of solr cloud is 4.10.  All indexers are basically java programs
 running on different host using CloudSolrServer api.
 As I mentioned it is much better now than before , however not completely
 as expected . We would want none of them to go in recovery if really there
 is no need.

 I captured some logs before and after recovery

  4:13:54,298 INFO  [org.apache.solr.handler.SnapPuller] (RecoveryThread)
 New index installed. Updating index properties...
 index=index.20150126140904697
 14:13:54,301 INFO  [org.apache.solr.handler.SnapPuller] (RecoveryThread)
 removing old index directory
 NRTCachingDirectory(MMapDirectory@
 /opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417
 lockFactory=NativeFSLockFactory@
 /opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417;
 maxCacheMB=48.0 maxMergeSizeMB=4.0)
 14:13:54,302 INFO  [org.apache.solr.update.DefaultSolrCoreState]
 (RecoveryThread) Creating new IndexWriter...
 14:13:54,302 INFO  [org.apache.solr.update.DefaultSolrCoreState]
 (RecoveryThread) Waiting until IndexWriter is unused...
 core=search1_shard7_replica4
 14:13:54,302 INFO  [org.apache.solr.update.DefaultSolrCoreState]
 (RecoveryThread) Rollback old IndexWriter... core=search1_shard7_replica4
 14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
 (RecoveryThread) [IW][RecoveryThread]: rollback
 14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
 (RecoveryThread) [IW][RecoveryThread]: all running merges have aborted
 14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
 (RecoveryThread) [IW][RecoveryThread]: rollback: done finish merges
 14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
 (RecoveryThread) [DW][RecoveryThread]: abort
 14:13:54,303 INFO  [org.apache.solr.update.LoggingInfoStream]
 (RecoveryThread) [DW][RecoveryThread]: done abort; abortedFiles=[]
 success=true
 14:13:54,306 INFO  [org.apache.solr.update.LoggingInfoStream]
 (RecoveryThread) [IW][RecoveryThread]: rollback:
 infos=_4qe(4.10.0):C4312879/1370002:delGen=56
 _554(4.10.0):C3995865/780418:delGen=23 _56u(4.10.0):C286775/11906:delGen=15
 _5co(4.10.0):C871785/93841:delGen=10 _5m7(4.10.0):C122852/31645:delGen=11
 _5hm(4.10.0):C457977/32465:delGen=11 _5q2(4.10.0):C13189/649:delGen=6
 

Re: replicas goes in recovery mode right after update

2015-01-26 Thread Shawn Heisey
On 1/26/2015 2:26 PM, Vijay Sekhri wrote:
 Hi Erick,
 In solr.xml file I had zk timeout set to/ int
 name=zkClientTimeout${zkClientTimeout:45}/int/
 One thing that made a it a bit better now is the zk tick time and
 syncLimit settings. I set it to a higher value as below. This may not
 be advisable though. 

 tickTime=3
 initLimit=30
 syncLimit=20

 Now we observed that replicas do not go in recovery that often as
 before. In the whole cluster at a given time I would have a couple of
 replicas in recovery whereas earlier it were multiple replicas from
 every shard .
 On the wiki https://wiki.apache.org/solr/SolrCloudit says the The
 maximum is 20 times the tickTime. in the FAQ so I decided to increase
 the tick time. Is this the correct approach ?

The default zkClientTimeout on recent Solr versions is 30 seconds, up
from 15 in slightly older releases.

Those values of 15 or 30 seconds are a REALLY long time in computer
terms, and if you are exceeding that timeout on a regular basis,
something is VERY wrong with your Solr install.  Rather than take steps
to increase your timeout beyond the normal maximum of 40 seconds (20
times a tickTime of 2 seconds), figure out why you're exceeding that
timeout and fix the performance problem.  The zkClientTimeout value that
you have set, 450 seconds, is seven and a half *MINUTES*.  Nothing in
Solr should ever take that long.

Not enough memory in the server is by far the most common culprit for
performance issues.  Garbage collection pauses are a close second.

I don't actually know this next part for sure, because I've never looked
into the code, but I believe that increasing the tickTime, especially to
a value 15 times higher than default, might make all zookeeper
operations a lot slower.

Thanks,
Shawn



Re: replicas goes in recovery mode right after update

2015-01-26 Thread Vijay Sekhri
Hi Erick,
In solr.xml file I had zk timeout set to*  int
name=zkClientTimeout${zkClientTimeout:45}/int*
One thing that made a it a bit better now is the zk tick time and syncLimit
settings. I set it to a higher value as below. This may not be advisable
though.

tickTime=3
initLimit=30
syncLimit=20

Now we observed that replicas do not go in recovery that often as before.
In the whole cluster at a given time I would have a couple of replicas in
recovery whereas earlier it were multiple replicas from every shard .
On the wiki https://wiki.apache.org/solr/SolrCloud it says the The maximum
is 20 times the tickTime. in the FAQ so I decided to increase the tick
time. Is this the correct approach ?

One question I have is that if auto commit settings has anything to do with
this or not ? Does it induce extra work for the searchers because of which
this would happen? I have tried with following settings
*  autoSoftCommit*
*maxDocs50/maxDocs*
*maxTime90/maxTime*
*/autoSoftCommit*

*autoCommit*
*maxDocs20/maxDocs*
*maxTime3/maxTime*
*openSearcherfalse/openSearcher*
*/autoCommit*

I have increased  the  heap size to 15GB for each JVM instance . I
monitored during full indexing how the heap usage looks like and it never
goes beyond 8 GB .  I don't see any Full GC happening at any point .
I had some attached screenshots but they were marked as spam so not sending
them again



Our rate is a variable rate . It is not a sustained rate of 6000/second ,
however there are intervals where it would reach that much and come down
and grow again and come down.  So if I would take an average it would be
600/second only but that is not real rate at any given time.
Version of solr cloud is 4.10.  All indexers are basically java programs
running on different host using CloudSolrServer api.
As I mentioned it is much better now than before , however not completely
as expected . We would want none of them to go in recovery if really there
is no need.

I captured some logs before and after recovery

 4:13:54,298 INFO  [org.apache.solr.handler.SnapPuller] (RecoveryThread)
New index installed. Updating index properties...
index=index.20150126140904697
14:13:54,301 INFO  [org.apache.solr.handler.SnapPuller] (RecoveryThread)
removing old index directory
NRTCachingDirectory(MMapDirectory@/opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417
lockFactory=NativeFSLockFactory@/opt/solr/solrnodes/solrnode1/search1_shard7_replica4/data/index.20150126134945417;
maxCacheMB=48.0 maxMergeSizeMB=4.0)
14:13:54,302 INFO  [org.apache.solr.update.DefaultSolrCoreState]
(RecoveryThread) Creating new IndexWriter...
14:13:54,302 INFO  [org.apache.solr.update.DefaultSolrCoreState]
(RecoveryThread) Waiting until IndexWriter is unused...
core=search1_shard7_replica4
14:13:54,302 INFO  [org.apache.solr.update.DefaultSolrCoreState]
(RecoveryThread) Rollback old IndexWriter... core=search1_shard7_replica4
14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [IW][RecoveryThread]: rollback
14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [IW][RecoveryThread]: all running merges have aborted
14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [IW][RecoveryThread]: rollback: done finish merges
14:13:54,302 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [DW][RecoveryThread]: abort
14:13:54,303 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [DW][RecoveryThread]: done abort; abortedFiles=[]
success=true
14:13:54,306 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [IW][RecoveryThread]: rollback:
infos=_4qe(4.10.0):C4312879/1370002:delGen=56
_554(4.10.0):C3995865/780418:delGen=23 _56u(4.10.0):C286775/11906:delGen=15
_5co(4.10.0):C871785/93841:delGen=10 _5m7(4.10.0):C122852/31645:delGen=11
_5hm(4.10.0):C457977/32465:delGen=11 _5q2(4.10.0):C13189/649:delGen=6
_5kb(4.10.0):C424868/19148:delGen=11 _5f5(4.10.0):C116528/42495:delGen=1
_5nx(4.10.0):C33236/20668:delGen=1 _5ql(4.10.0):C25924/2:delGen=2
_5o8(4.10.0):C27155/7531:delGen=1 _5of(4.10.0):C38545/5677:delGen=1
_5p7(4.10.0):C37457/648:delGen=1 _5r5(4.10.0):C4260 _5qv(4.10.0):C1750
_5qi(4.10.0):C842 _5qp(4.10.0):C2247 _5qm(4.10.0):C2214 _5qo(4.10.0):C1785
_5qn(4.10.0):C1962 _5qu(4.10.0):C2390 _5qy(4.10.0):C2129 _5qx(4.10.0):C2192
_5qw(4.10.0):C2157/1:delGen=1 _5r6(4.10.0):C159 _5r4(4.10.0):C742
_5r8(4.10.0):C334 _5r7(4.10.0):C390 _5r3(4.10.0):C1122
14:13:54,306 INFO  [org.apache.solr.update.LoggingInfoStream]
(RecoveryThread) [IFD][RecoveryThread]: now checkpoint
_4qe(4.10.0):C4312879/1370002:delGen=56
_554(4.10.0):C3995865/780418:delGen=23 _56u(4.10.0):C286775/11906:delGen=15
_5co(4.10.0):C871785/93841:delGen=10 _5m7(4.10.0):C122852/31645:delGen=11
_5hm(4.10.0):C457977/32465:delGen=11 _5q2(4.10.0):C13189/649:delGen=6
_5kb(4.10.0):C424868/19148:delGen=11 

Re: replicas goes in recovery mode right after update

2015-01-25 Thread Erick Erickson
Shawn directed you over here to the user list, but I see this note on
SOLR-7030:
All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process...

So you have 12G physical memory and have allocated 12G to the Java process?
This is an anti-pattern. If that's
the case, your operating system is being starved for memory, probably
hitting a state where it spends all of its
time in stop-the-world garbage collection, eventually it doesn't respond to
Zookeeper's ping so Zookeeper
thinks the node is down and puts it into recovery. Where it spends a lot of
time doing... essentially nothing.

About the hard and soft commits: I suspect these are entirely unrelated,
but here's a blog on what they do, you
should pick the configuration that supports your use case (i.e. how much
latency can you stand between indexing
and being able to search?).

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Here's one very good reason you shouldn't starve your op system by
allocating all the physical memory to the JVM:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html


But your biggest problem is that you have far too much of your physical
memory allocated to the JVM. This
will cause you endless problems, you just need more physical memory on
those boxes. It's _possible_ you could
get by with less memory for the JVM, counterintuitive as it seems try 8G or
maybe even 6G. At some point
you'll hit OOM errors, but that'll give you a lower limit on what the JVM
needs.

Unless I've mis-interpreted what you've written, though, I doubt you'll get
stable with that much memory allocated
to the JVM.

Best,
Erick



On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

 We have a cluster of solr cloud server with 10 shards and 4 replicas in
 each shard in our stress environment. In our prod environment we will have
 10 shards and 15 replicas in each shard. Our current commit settings are as
 follows

 *autoSoftCommit*
 *maxDocs50/maxDocs*
 *maxTime18/maxTime*
 */autoSoftCommit*
 *autoCommit*
 *maxDocs200/maxDocs*
 *maxTime18/maxTime*
 *openSearcherfalse/openSearcher*
 */autoCommit*


 We indexed roughly 90 Million docs. We have two different ways to index
 documents a) Full indexing. It takes 4 hours to index 90 Million docs and
 the rate of docs coming to the searcher is around 6000 per second b)
 Incremental indexing. It takes an hour to indexed delta changes. Roughly
 there are 3 million changes and rate of docs coming to the searchers is
 2500
 per second

 We have two collections search1 and search2. When we do full indexing , we
 do it in search2 collection while search1 is serving live traffic. After it
 finishes we swap the collection using aliases so that the search2
 collection serves live traffic while search1 becomes available for next
 full indexing run. When we do incremental indexing we do it in the search1
 collection which is serving live traffic.

 All our searchers have 12 GB of RAM available and have quad core Intel(R)
 Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
 jboss and solr in it . All 12 GB is available as heap for the java
 process.  We have observed that the heap memory of the java process average
 around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
 there are 9X10 (shards) =  90GB worth of index files.

  We have observed the following issue when we trigger indexing . In about
 10 minutes after we trigger indexing on 14 parallel hosts, the replicas
 goes in to recovery mode. This happens to all the shards . In about 20
 minutes more and more replicas start going into recovery mode. After about
 half an hour all replicas except the leader are in recovery mode. We cannot
 throttle the indexing load as that will increase our overall indexing time.
 So to overcome this issue, we remove all the replicas before we trigger the
 indexing and then add them back after the indexing finishes.

 We observe the same behavior of replicas going into recovery when we do
 incremental indexing. We cannot remove replicas during our incremental
 indexing because it is also serving live traffic. We tried to throttle our
 indexing speed , however the cluster still goes into recovery .

 If we leave the cluster as it , when the indexing finishes , it eventually
 recovers after a while. As it is serving live traffic we cannot have these
 replicas go into recovery mode because it degrades the search performance
 also , our tests have shown.

 We have tried different commit settings like below

 a) No auto soft commit, no auto hard commit and a commit triggered at the
 end of indexing b) No auto soft commit, yes auto hard commit and a commit
 in the end of indexing
 c) Yes auto 

Re: replicas goes in recovery mode right after update

2015-01-25 Thread Erick Erickson
Ah, OK. Whew! because I was wondering how you were running at _all_ if all
the memory was allocated to the JVM ;)..

What is your Zookeeper timeout? The original default was 15 seconds and this
has caused problems like this. Here's the scenario:
You send a bunch of docs at the server, and eventually you hit a
stop-the-world
GC that takes longer than the Zookeeper timeout. So ZK thinks the node is
down
and initiates recovery. Eventually, you hit this on all the replicas.

Sometimes I've seen situations where the answer is giving a bit more memory
to the JVM, say 2-4G in your case. The theory here (and this is a shot in
the
dark) that your peak JVM requirements are close to your 12G, so the garbage
collection spends enormous amounts of time collecting a small bit of memory,
runs for some fraction of a second and does it again. Adding more to the
JVMs
memory allows the parallel collections to work without so many
stop-the-world
GC pauses.

So what I'd do is turn on GC logging (probably on the replicas) and look for
very long GC pauses. Mark Miller put together a blog here:
https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

See the getting a view into garbage collection. The smoking gun here
is if you see full GC pauses that are longer than the ZK timeout.

90M docs in 4 hours across 10 shards is only 625/sec or so per shard. I've
seen
sustained indexing rates significantly above this, YMMV or course, a lot
depends
on the size of the docs.

What version of Solr BTW? And when you say you fire a bunch of indexers,
I'm
assuming these are SolrJ clients and use CloudSolrServer?

Best,
Erick


On Sun, Jan 25, 2015 at 4:10 PM, Vijay Sekhri sekhrivi...@gmail.com wrote:

 Thank you for the reply Eric.
 I am sorry I had wrong information posted. I posted our DEV env
 configuration by mistake.
 After double checking our stress and Prod Beta env where we have found the
 original issue, I found all the searchers have around 50 GB of RAM
 available and two instances of JVM running (2 different ports). Both
 instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st
  instance on a host has search1 collection (live collection) and the 2nd
 instance on the same host  has search2 collection (for full indexing ).

 There is plenty room for OS related tasks. Our issue is not in anyway
 related to OS starving as shown from our dashboards.
 We have been through

 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 a lot of times but  we have two modes of operation
 a)  1st collection (Live traffic) - heavy searches and medium indexing
 b)  2nd collection (Not serving traffic) - very heavy indexing, no searches

 When our indexing finishes we swap the alias for these collection . So
 essentially we need to have a configuration that can support both the use
 cases together. We have tried a lot of different configuration options and
 none of them seems to work. My suspicion is that solr cloud is unable to
 keep up with the updates at the rate we are sending while it is trying to
 be consistent with all the replicas.


 On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Shawn directed you over here to the user list, but I see this note on
  SOLR-7030:
  All our searchers have 12 GB of RAM available and have quad core
 Intel(R)
  Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
  jboss and solr in it . All 12 GB is available as heap for the java
  process...
 
  So you have 12G physical memory and have allocated 12G to the Java
 process?
  This is an anti-pattern. If that's
  the case, your operating system is being starved for memory, probably
  hitting a state where it spends all of its
  time in stop-the-world garbage collection, eventually it doesn't respond
 to
  Zookeeper's ping so Zookeeper
  thinks the node is down and puts it into recovery. Where it spends a lot
 of
  time doing... essentially nothing.
 
  About the hard and soft commits: I suspect these are entirely unrelated,
  but here's a blog on what they do, you
  should pick the configuration that supports your use case (i.e. how much
  latency can you stand between indexing
  and being able to search?).
 
 
 
 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 
  Here's one very good reason you shouldn't starve your op system by
  allocating all the physical memory to the JVM:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 
 
  But your biggest problem is that you have far too much of your physical
  memory allocated to the JVM. This
  will cause you endless problems, you just need more physical memory on
  those boxes. It's _possible_ you could
  get by with less memory for the JVM, counterintuitive as it seems try 8G
 or
  maybe even 6G. At some point
  you'll hit OOM errors, but that'll give you a lower limit on what the JVM
  needs.
 
  Unless I've mis-interpreted 

Re: replicas goes in recovery mode right after update

2015-01-25 Thread Vijay Sekhri
Thank you for the reply Eric.
I am sorry I had wrong information posted. I posted our DEV env
configuration by mistake.
After double checking our stress and Prod Beta env where we have found the
original issue, I found all the searchers have around 50 GB of RAM
available and two instances of JVM running (2 different ports). Both
instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st
 instance on a host has search1 collection (live collection) and the 2nd
instance on the same host  has search2 collection (for full indexing ).

There is plenty room for OS related tasks. Our issue is not in anyway
related to OS starving as shown from our dashboards.
We have been through
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
a lot of times but  we have two modes of operation
a)  1st collection (Live traffic) - heavy searches and medium indexing
b)  2nd collection (Not serving traffic) - very heavy indexing, no searches

When our indexing finishes we swap the alias for these collection . So
essentially we need to have a configuration that can support both the use
cases together. We have tried a lot of different configuration options and
none of them seems to work. My suspicion is that solr cloud is unable to
keep up with the updates at the rate we are sending while it is trying to
be consistent with all the replicas.


On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Shawn directed you over here to the user list, but I see this note on
 SOLR-7030:
 All our searchers have 12 GB of RAM available and have quad core Intel(R)
 Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
 jboss and solr in it . All 12 GB is available as heap for the java
 process...

 So you have 12G physical memory and have allocated 12G to the Java process?
 This is an anti-pattern. If that's
 the case, your operating system is being starved for memory, probably
 hitting a state where it spends all of its
 time in stop-the-world garbage collection, eventually it doesn't respond to
 Zookeeper's ping so Zookeeper
 thinks the node is down and puts it into recovery. Where it spends a lot of
 time doing... essentially nothing.

 About the hard and soft commits: I suspect these are entirely unrelated,
 but here's a blog on what they do, you
 should pick the configuration that supports your use case (i.e. how much
 latency can you stand between indexing
 and being able to search?).


 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

 Here's one very good reason you shouldn't starve your op system by
 allocating all the physical memory to the JVM:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html


 But your biggest problem is that you have far too much of your physical
 memory allocated to the JVM. This
 will cause you endless problems, you just need more physical memory on
 those boxes. It's _possible_ you could
 get by with less memory for the JVM, counterintuitive as it seems try 8G or
 maybe even 6G. At some point
 you'll hit OOM errors, but that'll give you a lower limit on what the JVM
 needs.

 Unless I've mis-interpreted what you've written, though, I doubt you'll get
 stable with that much memory allocated
 to the JVM.

 Best,
 Erick



 On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com
 wrote:

  We have a cluster of solr cloud server with 10 shards and 4 replicas in
  each shard in our stress environment. In our prod environment we will
 have
  10 shards and 15 replicas in each shard. Our current commit settings are
 as
  follows
 
  *autoSoftCommit*
  *maxDocs50/maxDocs*
  *maxTime18/maxTime*
  */autoSoftCommit*
  *autoCommit*
  *maxDocs200/maxDocs*
  *maxTime18/maxTime*
  *openSearcherfalse/openSearcher*
  */autoCommit*
 
 
  We indexed roughly 90 Million docs. We have two different ways to index
  documents a) Full indexing. It takes 4 hours to index 90 Million docs and
  the rate of docs coming to the searcher is around 6000 per second b)
  Incremental indexing. It takes an hour to indexed delta changes. Roughly
  there are 3 million changes and rate of docs coming to the searchers is
  2500
  per second
 
  We have two collections search1 and search2. When we do full indexing ,
 we
  do it in search2 collection while search1 is serving live traffic. After
 it
  finishes we swap the collection using aliases so that the search2
  collection serves live traffic while search1 becomes available for next
  full indexing run. When we do incremental indexing we do it in the
 search1
  collection which is serving live traffic.
 
  All our searchers have 12 GB of RAM available and have quad core Intel(R)
  Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
  jboss and solr in it . All 12 GB is available as heap for the java
  process.  We have observed 

replicas goes in recovery mode right after update

2015-01-25 Thread Vijay Sekhri
We have a cluster of solr cloud server with 10 shards and 4 replicas in
each shard in our stress environment. In our prod environment we will have
10 shards and 15 replicas in each shard. Our current commit settings are as
follows

*autoSoftCommit*
*maxDocs50/maxDocs*
*maxTime18/maxTime*
*/autoSoftCommit*
*autoCommit*
*maxDocs200/maxDocs*
*maxTime18/maxTime*
*openSearcherfalse/openSearcher*
*/autoCommit*


We indexed roughly 90 Million docs. We have two different ways to index
documents a) Full indexing. It takes 4 hours to index 90 Million docs and
the rate of docs coming to the searcher is around 6000 per second b)
Incremental indexing. It takes an hour to indexed delta changes. Roughly
there are 3 million changes and rate of docs coming to the searchers is 2500
per second

We have two collections search1 and search2. When we do full indexing , we
do it in search2 collection while search1 is serving live traffic. After it
finishes we swap the collection using aliases so that the search2
collection serves live traffic while search1 becomes available for next
full indexing run. When we do incremental indexing we do it in the search1
collection which is serving live traffic.

All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process.  We have observed that the heap memory of the java process average
around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
there are 9X10 (shards) =  90GB worth of index files.

 We have observed the following issue when we trigger indexing . In about
10 minutes after we trigger indexing on 14 parallel hosts, the replicas
goes in to recovery mode. This happens to all the shards . In about 20
minutes more and more replicas start going into recovery mode. After about
half an hour all replicas except the leader are in recovery mode. We cannot
throttle the indexing load as that will increase our overall indexing time.
So to overcome this issue, we remove all the replicas before we trigger the
indexing and then add them back after the indexing finishes.

We observe the same behavior of replicas going into recovery when we do
incremental indexing. We cannot remove replicas during our incremental
indexing because it is also serving live traffic. We tried to throttle our
indexing speed , however the cluster still goes into recovery .

If we leave the cluster as it , when the indexing finishes , it eventually
recovers after a while. As it is serving live traffic we cannot have these
replicas go into recovery mode because it degrades the search performance
also , our tests have shown.

We have tried different commit settings like below

a) No auto soft commit, no auto hard commit and a commit triggered at the
end of indexing b) No auto soft commit, yes auto hard commit and a commit
in the end of indexing
c) Yes auto soft commit , no auto hard commit
d) Yes auto soft commit , yes auto hard commit
e) Different frequency setting for commits for above. Please NOTE that we
have tried 15 minute soft commit setting and 30 minutes hard commit
settings. Same time settings for both, 30 minute soft commit and an hour
hard commit setting

Unfortunately all the above yields the same behavior . The replicas still
goes in recovery We have increased the zookeeper timeout from 30 seconds to
5 minutes and the problem persists. Is there any setting that would fix
this issue ?

-- 
*
Vijay Sekhri
*