Hi Shawn, Erick >From another replicas right after the same error it seems the leader initiates the recovery of the replicas. This one has a bit different log information than the other one that went into recovery. I am not sure if this helps in diagnosing
Caused by: java.io.IOException: JBWEB002020: Invalid chunk header at org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281) at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134) at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697) at org.apache.coyote.Request.doRead(Request.java:438) at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341) ... 31 more 21:55:07,678 INFO [org.apache.solr.handler.admin.CoreAdminHandler] (http-/10.235.43.57:8680-32) It has been requested that we recover: core=search1_shard4_replica13 21:55:07,678 INFO [org.apache.solr.servlet.SolrDispatchFilter] (http-/10.235.43.57:8680-32) [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=search1_shard4_replica13&wt=javabin&version=2} status=0 QTime=0 21:55:07,678 INFO [org.apache.solr.cloud.ZkController] (Thread-443) publishing core=search1_shard4_replica13 state=recovering collection=search1 21:55:07,678 INFO [org.apache.solr.cloud.ZkController] (Thread-443) numShards not found on descriptor - reading it from system property 21:55:07,681 INFO [org.apache.solr.cloud.ZkController] (Thread-443) Wrote recovering to /collections/search1/leader_initiated_recovery /shard4/core_node192 On Mon, Jan 26, 2015 at 10:34 PM, Vijay Sekhri <sekhrivi...@gmail.com> wrote: > Hi Shawn, Erick > So it turned out that once we increased our indexing rate to the original > full indexing rate the replicas went back into recovery no matter what the > zk timeout setting was. Initially we though that increasing the timeout is > helping but apparently not . We just decreased indexing rate and that > caused less replicas to go in recovery. Once we have our full indexing rate > almost all replicas went into recovery no matter what the zk timeout or the > ticktime setting were. We reverted back the ticktime to original 2 seconds > > So we investigated further and after checking the logs we found this > exception happening right before the recovery process is initiated. We > observed this on two different replicas that went into recovery. We are not > sure if this is a coincidence or a real problem . Notice we were also > putting some search query load while indexing to trigger the recovery > behavior > > 22:00:32,493 INFO [org.apache.solr.cloud.RecoveryStrategy] > (rRecoveryThread) Finished recovery process. core=search1_shard5_replica2 > 22:00:32,503 INFO [org.apache.solr.common.cloud.ZkStateReader] > (zkCallback-2-thread-66) A cluster state change: WatchedEvent > state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has > occurred - updating... (live nodes size: 22) > 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger > flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600 > 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread > state has 12530488 bytes; docInRAM=2051 > 22:00:40,450 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread > state has 12984633 bytes; docInRAM=2205 > > > 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] > (http-/10.235.46.36:8580-32) > ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk > header* > at > org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351) > at > org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422) > at > org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373) > at > org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193) > at > org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80) > at > org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89) > at > org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192) > at > org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111) > at > org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173) > at > org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106) > at > org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58) > at > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149) > at > org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97) > at > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:559) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.io.IOException: JBWEB002020: Invalid chunk header > at > org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281) > at > org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134) > at > org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697) > at org.apache.coyote.Request.doRead(Request.java:438) > at > org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341) > ... 31 more > 22:00:40,864 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new > segment has 0 deleted docs > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new > segment has no vectors; no norms; no docValues; prox; freqs > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: > flushedFiles=[_227_Lucene41_0.tip, _227.fdx, _227_Lucene41_0.tim, _227.fdt, > _227_Lucene41_0.doc, _227.fnm, _227_Lucene41_0.pos] > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed > codec=Lucene410 > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed: > segment=_227 ramUsed=12.383 MB newFlushedSize(includes docstores)=4.444 MB > docs/MB=496.165 > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [DW][http-/10.235.46.36:8580-27]: > publishFlushedSegment seg-private updates=null > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]: > publishFlushedSegment > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [BD][http-/10.235.46.36:8580-27]: push > deletes 16553 deleted terms (unique count=16553) bytesUsed=222240 > delGen=17 packetCount=1 totBytesUsed=222240 > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]: publish > sets newSegment delGen=18 seg=_227(4.10.0):C2205 > 22:00:40,865 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: now > checkpoint "_12l(4.10.0):C676510/354682:delGen=208 > _1wo(4.10.0):C420813/69819:delGen=40 _1bs(4.10.0):C229862/110117:delGen=77 > _1ga(4.10.0):C322386/84434:delGen=91 _1nc(4.10.0):C209477/2949:delGen=52 > _223(4.10.0):C206753/11:delGen=1 _1ug(4.10.0):C7185/1:delGen=1 > _20n(4.10.0):C59384/2287:delGen=7 _1zm(4.10.0):C78927/1581:delGen=12 > _20v(4.10.0):C4317 _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288 > _20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16 > _21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116 > _21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92 > _21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988 > _224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205" [27 > segments ; isCommit = false] > 22:00:40,866 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: 0 msec to > checkpoint > 22:00:40,866 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [TMP][http-/10.235.46.36:8580-27]: > findMerges: 27 segments > 10.235.46.36:8580-27]: now merge > 22:00:40,867 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]: index: > _12l(4.10.0):C676510/354682:delGen=208 _1wo(4.10.0):C420813/69819:delGen=40 > _1bs(4.10.0):C229862/110117:delGen=77 _1ga(4.10.0):C322386/84434:delGen=91 > _1nc(4.10.0):C209477/2949:delGen=52 _223(4.10.0):C206753/11:delGen=1 > _1ug(4.10.0):C7185/1:delGen=1 _20n(4.10.0):C59384/2287:delGen=7 > _1zm(4.10.0):C78927/1581:delGen=12 _20v(4.10.0):C4317 > _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288 > _20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16 > _21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116 > _21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92 > _21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988 > _224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205 > 22:00:40,867 INFO [org.apache.solr.update.LoggingInfoStream] > (http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]: no more > merges pending; now return > 22:00:41,885 INFO [org.apache.solr.handler.admin.CoreAdminHandler] > (http-/10.235.46.36:8580-32) It has been requested that we recover: > core=search1_shard5_replica2 > 22:00:41,886 INFO [org.apache.solr.servlet.SolrDispatchFilter] > (http-/10.235.46.36:8580-32) [admin] webapp=null path=/admin/cores > params={action=REQUESTRECOVERY&core=search1_shard5_replica2&wt=javabin&version=2} > status=0 QTime=1 > 22:00:41,886 INFO [org.apache.solr.cloud.ZkController] (Thread-472) > publishing core=search1_shard5_replica2 state=recovering collection=search1 > 22:00:41,886 INFO [org.apache.solr.cloud.ZkController] (Thread-472) > numShards not found on descriptor - reading it from system property > > 22:00:41,889 INFO [org.apache.solr.common.cloud.ZkStateReader] > (zkCallback-2-thread-66) A cluster state change: WatchedEvent > state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has > occurred - updating... (live nodes size: 22) > 22:00:41,890 INFO [org.apache.solr.cloud.ZkController] (Thread-472)* > Wrote recovering to > /collections/search1/leader_initiated_recovery/shard5/core_node223* > 22:00:41,892 INFO [org.apache.solr.update.DefaultSolrCoreState] > (Thread-472) Running recovery - first canceling any ongoing recovery > 22:00:41,893 INFO [org.apache.solr.cloud.RecoveryStrategy] > (RecoveryThread) Starting recovery process. core=search1_shard5_replica2 > recoveringAfterStartup=false > 22:00:41,894 INFO [org.apache.solr.cloud.RecoveryStrategy] > (RecoveryThread) Publishing state of core search1_shard5_replica2 as > recovering, leader is http://XXXXXXXXXXCHANGEDONPURPOSEXXXX:8680/solr/search1/ > and I am http://solrx341p.qa.ch3.s.com:8580/solr/search1_shard5_replica2/ > 22:00:41,894 INFO [org.apache.solr.cloud.ZkController] (RecoveryThread) > publishing core=search1_shard5_replica2 state=recovering collection=search1 > 22:00:41,894 INFO [org.apache.solr.cloud.ZkController] (RecoveryThread) > numShards not found on descriptor - reading it from system property > 22:00:41,896 INFO [org.apache.solr.cloud.RecoveryStrategy] > (RecoveryThread) Sending prep recovery command to > http://XXXXXXXXXCHANGEDONPURPOSEXXXXXXXX:8680/solr; WaitForState: > action=PREPRECOVERY&core=search1&nodeName=XXXXXXXXXXXXXXCHANGEDONPURPOSE > XXXXXXXXXXXXXXX%3A8580_solr&coreNodeName=core_node223&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true > 22:00:42,006 INFO [org.apache.solr.common.cloud.ZkStateReader] > (zkCallback-2-thread-66) A cluster state change: WatchedEvent > state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has > occurred - updating... (live nodes size: 22) > > > > > 15Gb of heap is allocated to these searchers and it hardly goes over 8GB. > Full GC does not happen. We write gc logs to separate file and also > monitored these process via visualvm . It seems to have enough memory > > > On Mon, Jan 26, 2015 at 5:13 PM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 1/26/2015 2:26 PM, Vijay Sekhri wrote: >> > Hi Erick, >> > In solr.xml file I had zk timeout set to/ <int >> > name="zkClientTimeout">${zkClientTimeout:450000}</int>/ >> > One thing that made a it a bit better now is the zk tick time and >> > syncLimit settings. I set it to a higher value as below. This may not >> > be advisable though. >> > >> > tickTime=30000 >> > initLimit=30 >> > syncLimit=20 >> > >> > Now we observed that replicas do not go in recovery that often as >> > before. In the whole cluster at a given time I would have a couple of >> > replicas in recovery whereas earlier it were multiple replicas from >> > every shard . >> > On the wiki https://wiki.apache.org/solr/SolrCloudit says the "The >> > maximum is 20 times the tickTime." in the FAQ so I decided to increase >> > the tick time. Is this the correct approach ? >> >> The default zkClientTimeout on recent Solr versions is 30 seconds, up >> from 15 in slightly older releases. >> >> Those values of 15 or 30 seconds are a REALLY long time in computer >> terms, and if you are exceeding that timeout on a regular basis, >> something is VERY wrong with your Solr install. Rather than take steps >> to increase your timeout beyond the normal maximum of 40 seconds (20 >> times a tickTime of 2 seconds), figure out why you're exceeding that >> timeout and fix the performance problem. The zkClientTimeout value that >> you have set, 450 seconds, is seven and a half *MINUTES*. Nothing in >> Solr should ever take that long. >> >> "Not enough memory in the server" is by far the most common culprit for >> performance issues. Garbage collection pauses are a close second. >> >> I don't actually know this next part for sure, because I've never looked >> into the code, but I believe that increasing the tickTime, especially to >> a value 15 times higher than default, might make all zookeeper >> operations a lot slower. >> >> Thanks, >> Shawn >> >> > > > -- > ********************************************* > Vijay Sekhri > ********************************************* > -- ********************************************* Vijay Sekhri *********************************************