Re: replicas goes in recovery mode right after update

Vijay Sekhri Mon, 26 Jan 2015 20:54:07 -0800

Hi Shawn, Erick
>From another replicas right after the same error it seems the leader
initiates the recovery of the replicas. This one has a bit different log
information than the other one that went into recovery. I am not sure if
this helps in diagnosing


Caused by: java.io.IOException: JBWEB002020: Invalid chunk header
        at
org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281)
        at
org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134)
        at
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697)
        at org.apache.coyote.Request.doRead(Request.java:438)
        at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341)
        ... 31 more

21:55:07,678 INFO  [org.apache.solr.handler.admin.CoreAdminHandler]
(http-/10.235.43.57:8680-32) It has been requested that we recover:
core=search1_shard4_replica13
21:55:07,678 INFO  [org.apache.solr.servlet.SolrDispatchFilter]
(http-/10.235.43.57:8680-32) [admin] webapp=null path=/admin/cores
params={action=REQUESTRECOVERY&core=search1_shard4_replica13&wt=javabin&version=2}
status=0 QTime=0
21:55:07,678 INFO  [org.apache.solr.cloud.ZkController] (Thread-443)
publishing core=search1_shard4_replica13 state=recovering collection=search1
21:55:07,678 INFO  [org.apache.solr.cloud.ZkController] (Thread-443)
numShards not found on descriptor - reading it from system property
21:55:07,681 INFO  [org.apache.solr.cloud.ZkController] (Thread-443) Wrote
recovering to /collections/search1/leader_initiated_recovery
/shard4/core_node192


On Mon, Jan 26, 2015 at 10:34 PM, Vijay Sekhri <sekhrivi...@gmail.com>
wrote:

> Hi Shawn, Erick
> So it turned out that once we increased our indexing rate to the original
> full indexing rate  the replicas went back into recovery no matter what the
> zk timeout setting was. Initially we though that increasing the timeout is
> helping but apparently not . We just decreased indexing rate and that
> caused less replicas to go in recovery. Once we have our full indexing rate
> almost all replicas went into recovery no matter what the zk timeout or the
> ticktime setting were. We reverted back the ticktime to original 2 seconds
>
> So we investigated further and after checking the logs we found this
> exception happening right before the recovery process is initiated. We
> observed this on two different replicas that went into recovery. We are not
> sure if this is a coincidence or a real problem . Notice we were also
> putting some search query load while indexing to trigger the recovery
> behavior
>
> 22:00:32,493 INFO  [org.apache.solr.cloud.RecoveryStrategy]
> (rRecoveryThread) Finished recovery process. core=search1_shard5_replica2
> 22:00:32,503 INFO  [org.apache.solr.common.cloud.ZkStateReader]
> (zkCallback-2-thread-66) A cluster state change: WatchedEvent
> state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
> occurred - updating... (live nodes size: 22)
> 22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: trigger
> flush: activeBytes=101796784 deleteBytes=3061644 vs limit=104857600
> 22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread
> state has 12530488 bytes; docInRAM=2051
> 22:00:40,450 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [FP][http-/10.235.46.36:8580-27]: thread
> state has 12984633 bytes; docInRAM=2205
>
>
> 22:00:40,861 ERROR [org.apache.solr.core.SolrCore] 
> (http-/10.235.46.36:8580-32)
> ClientAbortException: * java.io.IOException: JBWEB002020: Invalid chunk
> header*
>         at
> org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:351)
>         at
> org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:422)
>         at
> org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:373)
>         at
> org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:193)
>         at
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:80)
>         at
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:89)
>         at
> org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:192)
>         at
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:111)
>         at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
>         at
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
>         at
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
>         at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:99)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
>         at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>         at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:246)
>         at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
>         at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
>         at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149)
>         at
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
>         at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:145)
>         at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97)
>         at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:559)
>         at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102)
>         at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:336)
>         at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
>                                           at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:920)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: java.io.IOException: JBWEB002020: Invalid chunk header
>         at
> org.apache.coyote.http11.filters.ChunkedInputFilter.parseChunkHeader(ChunkedInputFilter.java:281)
>         at
> org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:134)
>         at
> org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:697)
>         at org.apache.coyote.Request.doRead(Request.java:438)
>         at
> org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:341)
>         ... 31 more
> 22:00:40,864 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new
> segment has 0 deleted docs
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: new
> segment has no vectors; no norms; no docValues; prox; freqs
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]:
> flushedFiles=[_227_Lucene41_0.tip, _227.fdx, _227_Lucene41_0.tim, _227.fdt,
> _227_Lucene41_0.doc, _227.fnm, _227_Lucene41_0.pos]
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed
> codec=Lucene410
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [DWPT][http-/10.235.46.36:8580-27]: flushed:
> segment=_227 ramUsed=12.383 MB newFlushedSize(includes docstores)=4.444 MB
> docs/MB=496.165
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [DW][http-/10.235.46.36:8580-27]:
> publishFlushedSegment seg-private updates=null
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]:
> publishFlushedSegment
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [BD][http-/10.235.46.36:8580-27]: push
> deletes  16553 deleted terms (unique count=16553) bytesUsed=222240
> delGen=17 packetCount=1 totBytesUsed=222240
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [IW][http-/10.235.46.36:8580-27]: publish
> sets newSegment delGen=18 seg=_227(4.10.0):C2205
> 22:00:40,865 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: now
> checkpoint "_12l(4.10.0):C676510/354682:delGen=208
> _1wo(4.10.0):C420813/69819:delGen=40 _1bs(4.10.0):C229862/110117:delGen=77
> _1ga(4.10.0):C322386/84434:delGen=91 _1nc(4.10.0):C209477/2949:delGen=52
> _223(4.10.0):C206753/11:delGen=1 _1ug(4.10.0):C7185/1:delGen=1
> _20n(4.10.0):C59384/2287:delGen=7 _1zm(4.10.0):C78927/1581:delGen=12
> _20v(4.10.0):C4317 _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288
> _20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16
> _21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116
> _21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92
> _21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988
> _224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205" [27
> segments ; isCommit = false]
> 22:00:40,866 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [IFD][http-/10.235.46.36:8580-27]: 0 msec to
> checkpoint
> 22:00:40,866 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [TMP][http-/10.235.46.36:8580-27]:
> findMerges: 27 segments
> 10.235.46.36:8580-27]: now merge
> 22:00:40,867 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]:   index:
> _12l(4.10.0):C676510/354682:delGen=208 _1wo(4.10.0):C420813/69819:delGen=40
> _1bs(4.10.0):C229862/110117:delGen=77 _1ga(4.10.0):C322386/84434:delGen=91
> _1nc(4.10.0):C209477/2949:delGen=52 _223(4.10.0):C206753/11:delGen=1
> _1ug(4.10.0):C7185/1:delGen=1 _20n(4.10.0):C59384/2287:delGen=7
> _1zm(4.10.0):C78927/1581:delGen=12 _20v(4.10.0):C4317
> _20u(4.10.0):C5243/1:delGen=1 _20z(4.10.0):C3288
> _20y(4.10.0):C4614/1:delGen=1 _21i(4.10.0):C2190 _21f(4.10.0):C16
> _21g(4.10.0):C67 _21h(4.10.0):C18 _21o(4.10.0):C116
> _21q(4.10.0):C1243/1:delGen=1 _21r(4.10.0):C39 _21s(4.10.0):C92
> _21t(4.10.0):C1 _21u(4.10.0):C1 _222(4.10.0):C26988
> _224(4.10.0):C28193/1:delGen=1 _225(4.10.0):C25966 _227(4.10.0):C2205
> 22:00:40,867 INFO  [org.apache.solr.update.LoggingInfoStream]
> (http-/10.235.46.36:8580-27) [CMS][http-/10.235.46.36:8580-27]:   no more
> merges pending; now return
> 22:00:41,885 INFO  [org.apache.solr.handler.admin.CoreAdminHandler]
> (http-/10.235.46.36:8580-32) It has been requested that we recover:
> core=search1_shard5_replica2
> 22:00:41,886 INFO  [org.apache.solr.servlet.SolrDispatchFilter]
> (http-/10.235.46.36:8580-32) [admin] webapp=null path=/admin/cores
> params={action=REQUESTRECOVERY&core=search1_shard5_replica2&wt=javabin&version=2}
> status=0 QTime=1
> 22:00:41,886 INFO  [org.apache.solr.cloud.ZkController] (Thread-472)
> publishing core=search1_shard5_replica2 state=recovering collection=search1
> 22:00:41,886 INFO  [org.apache.solr.cloud.ZkController] (Thread-472)
> numShards not found on descriptor - reading it from system property
>
> 22:00:41,889 INFO  [org.apache.solr.common.cloud.ZkStateReader]
> (zkCallback-2-thread-66) A cluster state change: WatchedEvent
> state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
> occurred - updating... (live nodes size: 22)
> 22:00:41,890 INFO  [org.apache.solr.cloud.ZkController] (Thread-472)*
> Wrote recovering to
> /collections/search1/leader_initiated_recovery/shard5/core_node223*
> 22:00:41,892 INFO  [org.apache.solr.update.DefaultSolrCoreState]
> (Thread-472) Running recovery - first canceling any ongoing recovery
> 22:00:41,893 INFO  [org.apache.solr.cloud.RecoveryStrategy]
> (RecoveryThread) Starting recovery process.  core=search1_shard5_replica2
> recoveringAfterStartup=false
> 22:00:41,894 INFO  [org.apache.solr.cloud.RecoveryStrategy]
> (RecoveryThread) Publishing state of core search1_shard5_replica2 as
> recovering, leader is http://XXXXXXXXXXCHANGEDONPURPOSEXXXX:8680/solr/search1/
> and I am http://solrx341p.qa.ch3.s.com:8580/solr/search1_shard5_replica2/
> 22:00:41,894 INFO  [org.apache.solr.cloud.ZkController] (RecoveryThread)
> publishing core=search1_shard5_replica2 state=recovering collection=search1
> 22:00:41,894 INFO  [org.apache.solr.cloud.ZkController] (RecoveryThread)
> numShards not found on descriptor - reading it from system property
> 22:00:41,896 INFO  [org.apache.solr.cloud.RecoveryStrategy]
> (RecoveryThread) Sending prep recovery command to
> http://XXXXXXXXXCHANGEDONPURPOSEXXXXXXXX:8680/solr; WaitForState:
> action=PREPRECOVERY&core=search1&nodeName=XXXXXXXXXXXXXXCHANGEDONPURPOSE
> XXXXXXXXXXXXXXX%3A8580_solr&coreNodeName=core_node223&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
> 22:00:42,006 INFO  [org.apache.solr.common.cloud.ZkStateReader]
> (zkCallback-2-thread-66) A cluster state change: WatchedEvent
> state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has
> occurred - updating... (live nodes size: 22)
>
>
>
>
> 15Gb of heap is allocated to these searchers and it hardly goes over 8GB.
> Full GC does not happen. We write gc logs to separate file and also
> monitored these process via visualvm . It seems to have enough memory
>
>
> On Mon, Jan 26, 2015 at 5:13 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 1/26/2015 2:26 PM, Vijay Sekhri wrote:
>> > Hi Erick,
>> > In solr.xml file I had zk timeout set to/ <int
>> > name="zkClientTimeout">${zkClientTimeout:450000}</int>/
>> > One thing that made a it a bit better now is the zk tick time and
>> > syncLimit settings. I set it to a higher value as below. This may not
>> > be advisable though.
>> >
>> > tickTime=30000
>> > initLimit=30
>> > syncLimit=20
>> >
>> > Now we observed that replicas do not go in recovery that often as
>> > before. In the whole cluster at a given time I would have a couple of
>> > replicas in recovery whereas earlier it were multiple replicas from
>> > every shard .
>> > On the wiki https://wiki.apache.org/solr/SolrCloudit says the "The
>> > maximum is 20 times the tickTime." in the FAQ so I decided to increase
>> > the tick time. Is this the correct approach ?
>>
>> The default zkClientTimeout on recent Solr versions is 30 seconds, up
>> from 15 in slightly older releases.
>>
>> Those values of 15 or 30 seconds are a REALLY long time in computer
>> terms, and if you are exceeding that timeout on a regular basis,
>> something is VERY wrong with your Solr install.  Rather than take steps
>> to increase your timeout beyond the normal maximum of 40 seconds (20
>> times a tickTime of 2 seconds), figure out why you're exceeding that
>> timeout and fix the performance problem.  The zkClientTimeout value that
>> you have set, 450 seconds, is seven and a half *MINUTES*.  Nothing in
>> Solr should ever take that long.
>>
>> "Not enough memory in the server" is by far the most common culprit for
>> performance issues.  Garbage collection pauses are a close second.
>>
>> I don't actually know this next part for sure, because I've never looked
>> into the code, but I believe that increasing the tickTime, especially to
>> a value 15 times higher than default, might make all zookeeper
>> operations a lot slower.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> *********************************************
> Vijay Sekhri
> *********************************************
>



-- 
*********************************************
Vijay Sekhri
*********************************************

Re: replicas goes in recovery mode right after update

Reply via email to