> > Before tlog replay, the replica will replicate any missing index files > from the leader. I think that is what is causing the time between the > two log messages. You have INFO logging turned off so there are no > messages from the replication handler about it.
I did not monitor major network throughput during that timeframe, and I thought the first log already showed the peersync failed. So I try to understand the time spent there. Also, in our solr.log, I did not see log reporting Recovery- retry(1), Recovery - retry(2), Recovery give up, etc in this log file before it tells us "tlog replay" 2015-09-21 9:07 GMT-04:00 Shalin Shekhar Mangar <shalinman...@gmail.com>: > Hi Jeff, > > Comments inline: > > On Mon, Sep 21, 2015 at 6:06 PM, Jeff Wu <wuhai...@gmail.com> wrote: > > Our environment ran in Solr4.7. Recently hit a core recovery failure and > > then it retries to recover from tlog. > > > > We noticed after 20:05:22 said Recovery failed, Solr server waited a > long > > time before it started tlog replay. During that time, we have about 32 > > cores doing such tlog relay. The service took over 40 minutes to make > whole > > service back. > > > > Some questions we want to know: > > 1. Is tlog replay a single thread activity? Can we configure to have > > multiple threads since in our deployment we have 64 cores for each solr > > server. > > Each core gets a separate recovery thread but each individual log > replay is single-threaded > > > > > 2. What might cause the tlog replay thread to wait for over 15 minutes > > before actual tlog replay? The actual replay seems very quick. > > Before tlog replay, the replica will replicate any missing index files > from the leader. I think that is what is causing the time between the > two log messages. You have INFO logging turned off so there are no > messages from the replication handler about it. > > > > > 3. The last message "Log replay finished" does not tell which core it is > > finished. Given 32 cores to recover, we can not know which core the log > is > > reporting. > > Yeah, many such issues were fixed in recent 5.x releases where we use > MDC to log collection, shard, core etc for each message. Furthermore, > tlog replay progress/status is also logged since 5.0 > > > > > 4. We know 4.7 is pretty old, we'd like to know is this known issue and > > fixed in late release, any related JIRA? > > > > Line 4120: ERROR - 2015-09-16 20:05:22.396; > > org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... > > (0) core=collection3_shard11_replica2 > > WARN - 2015-09-16 20:22:50.343; > > org.apache.solr.update.UpdateLog$LogReplayer; Starting log replay > > > tlog{file=/mnt/solrdata1/solr/home/collection3_shard11_replica2/data/tlog/tlog.0000000000000120498 > > refcount=2} active=true starting pos=25981 > > WARN - 2015-09-16 20:22:53.301; > > org.apache.solr.update.UpdateLog$LogReplayer; Log replay finished. > > recoveryInfo=RecoveryInfo{adds=914 deletes=215 deleteByQuery=0 errors=0 > > positionOfStart=25981} > > > > Thank you all~ > > > > -- > Regards, > Shalin Shekhar Mangar. > -- Jeff Wu --------------------------- CSDL Beijing, China