>
> Before tlog replay, the replica will replicate any missing index files
> from the leader. I think that is what is causing the time between the
> two log messages. You have INFO logging turned off so there are no
> messages from the replication handler about it.


I did not monitor major network throughput during that timeframe, and I
thought the first log already showed the peersync failed. So I try to
understand the time spent there.

Also, in our solr.log, I did not see log reporting Recovery- retry(1),
Recovery - retry(2), Recovery give up, etc in this log file before it tells
us "tlog replay"

2015-09-21 9:07 GMT-04:00 Shalin Shekhar Mangar <shalinman...@gmail.com>:

> Hi Jeff,
>
> Comments inline:
>
> On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu <wuhai...@gmail.com> wrote:
> > Our environment ran in Solr4.7. Recently hit a core recovery failure and
> > then it retries to recover from tlog.
> >
> > We noticed after  20:05:22 said Recovery failed, Solr server waited a
> long
> > time before it started tlog replay. During that time, we have about 32
> > cores doing such tlog relay. The service took over 40 minutes to make
> whole
> > service back.
> >
> > Some questions we want to know:
> > 1. Is tlog replay a single thread activity? Can we configure to have
> > multiple threads since in our deployment we have 64 cores for each solr
> > server.
>
> Each core gets a separate recovery thread but each individual log
> replay is single-threaded
>
> >
> > 2. What might cause the tlog replay thread to wait for over 15 minutes
> > before actual tlog replay?  The actual replay seems very quick.
>
> Before tlog replay, the replica will replicate any missing index files
> from the leader. I think that is what is causing the time between the
> two log messages. You have INFO logging turned off so there are no
> messages from the replication handler about it.
>
> >
> > 3. The last message "Log replay finished" does not tell which core it is
> > finished. Given 32 cores to recover, we can not know which core the log
> is
> > reporting.
>
> Yeah, many such issues were fixed in recent 5.x releases where we use
> MDC to log collection, shard, core etc for each message. Furthermore,
> tlog replay progress/status is also logged since 5.0
>
> >
> > 4. We know 4.7 is pretty old, we'd like to know is this known issue and
> > fixed in late release, any related JIRA?
> >
> > Line 4120: ERROR - 2015-09-16 20:05:22.396;
> > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again...
> > (0) core=collection3_shard11_replica2
> > WARN  - 2015-09-16 20:22:50.343;
> > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay
> >
> tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0000000000000120498
> > refcount=2} active=true starting pos=25981
> > WARN  - 2015-09-16 20:22:53.301;
> > org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished.
> > recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0
> > positionOfStart=25981}
> >
> > Thank you all~
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
Jeff Wu
---------------------------
CSDL Beijing, China

Reply via email to