Thanks for your suggestion.  Here's a chunk of info from the logging in the 
solr admin page below.  Is there somewhere else I should be looking too?

It looks to me like its stuck in  a never-ending loop of attempting recovery 
that fails.

I don't know if the warnings from IndexFetcher are relevant or not, and if they 
are, what I can do about them?

Our system has been feeding 150k docs a day into this cluster for nearly two 
months now.  I have a backlog of approx 45million more documents I need to get 
loaded but until I have a healthy looking cluster it would be foolish to start 
loading even more.

Jon


Time (Local)

Level

Core

Logger

Message

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.nvm did not match. expected checksum is 1754812894 and actual is 
checksum 3450541029. expected length is 108 and actual length is 108

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fnm did not match. expected checksum is 2714900770 and actual is 
checksum 1393668596. expected length is 1265 and actual length is 1265

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.doc did not match. expected checksum is 1374818988 and 
actual is checksum 1039421217. expected length is 110 and actual length is 433

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.tim did not match. expected checksum is 1001343351 and 
actual is checksum 3395571641. expected length is 2025 and actual length is 7662

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene50_0.tip did not match. expected checksum is 814607015 and 
actual is checksum 1271109784. expected length is 301 and actual length is 421

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b_Lucene54_0.dvd did not match. expected checksum is 875968405 and 
actual is checksum 4024097898. expected length is 96 and actual length is 144

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.si did not match. expected checksum is 2341973651 and actual is 
checksum 281320882. expected length is 535 and actual length is 535

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fdx did not match. expected checksum is 2874533507 and actual is 
checksum 3545673052. expected length is 84 and actual length is 84

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.nvd did not match. expected checksum is 663721296 and actual is 
checksum 1107475498. expected length is 59 and actual length is 68

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File _iu0b.fdt did not match. expected checksum is 2953417110 and actual is 
checksum 471758721. expected length is 1109 and actual length is 7185

8/26/2016, 6:17:52 AM

WARN false

IndexFetcher

File segments_h7g8 did not match. expected checksum is 2040860271 and actual is 
checksum 1873911116. expected length is 2056 and actual length is 1926

8/26/2016, 6:17:53 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.0000000000000000321
 refcount=2} active=true starting pos=0

8/26/2016, 6:17:53 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0 
deleteByQuery=0 errors=0 positionOfStart=0}

8/26/2016, 6:17:53 AM

ERROR false

RecoveryStrategy

Could not publish as ACTIVE after succesful recovery

8/26/2016, 6:17:53 AM

ERROR false

RecoveryStrategy

Recovery failed - trying again... (0)

8/26/2016, 6:18:13 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.0000000000000000322
 refcount=2} active=true starting pos=0

8/26/2016, 6:18:13 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fdt did not match. expected checksum is 4059848174 and actual is 
checksum 4234063128. expected length is 3060 and actual length is 1772

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fdx did not match. expected checksum is 2421590578 and actual is 
checksum 1492609115. expected length is 84 and actual length is 84

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene54_0.dvd did not match. expected checksum is 2898024557 and 
actual is checksum 3762900089. expected length is 99 and actual length is 97

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.si did not match. expected checksum is 730964774 and actual is 
checksum 1292368805. expected length is 535 and actual length is 535

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.nvd did not match. expected checksum is 2920743481 and actual is 
checksum 2869652522. expected length is 59 and actual length is 59

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.nvm did not match. expected checksum is 328126313 and actual is 
checksum 1484623710. expected length is 108 and actual length is 108

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene54_0.dvm did not match. expected checksum is 3300364001 and 
actual is checksum 3819493713. expected length is 312 and actual length is 312

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.pos did not match. expected checksum is 93672274 and 
actual is checksum 1261080786. expected length is 845 and actual length is 403

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x.fnm did not match. expected checksum is 3818945769 and actual is 
checksum 2677014577. expected length is 1265 and actual length is 1265

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.doc did not match. expected checksum is 2160820791 and 
actual is checksum 3998191027. expected length is 110 and actual length is 110

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.tim did not match. expected checksum is 2334039520 and 
actual is checksum 2647062877. expected length is 3923 and actual length is 2573

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File _iu0x_Lucene50_0.tip did not match. expected checksum is 3632944779 and 
actual is checksum 1632973124. expected length is 325 and actual length is 304

8/26/2016, 6:22:12 AM

WARN false

IndexFetcher

File segments_h7gk did not match. expected checksum is 633644562 and actual is 
checksum 1552102209. expected length is 2121 and actual length is 2056

8/26/2016, 6:22:13 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.0000000000000000323
 refcount=2} active=true starting pos=0

8/26/2016, 6:22:13 AM

WARN false

UpdateLog

Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}

8/26/2016, 6:22:13 AM

ERROR false

RecoveryStrategy

Could not publish as ACTIVE after succesful recovery

8/26/2016, 6:22:13 AM

ERROR false

RecoveryStrategy

Recovery failed - trying again... (0)

8/26/2016, 6:22:32 AM

WARN false

UpdateLog

Starting log replay 
tlog{file=E:\solr_home\transcribedReports_shard1_replica3\data\tlog\tlog.00



-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Thursday, August 25, 2016 10:35 PM
To: solr-user
Subject: Re: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down



This is odd. The ADDREPLICA _should_ be immediately listed as "down", but 
should shortly go to "recovering"and then "active". The transition to "active" 
may take a while as the index has to be copied from the leader, but you 
shouldn't be stuck at "down" for very long.



Take a look at the Solr logs for both the leader of the shard and the replica 
you're trying to add. They often have more complete and helpful error 
messages...



Also note that you occasionally have to be patient. For instance, there's a

3 minute wait period for

leader election at times. It sounds, though, like things aren't getting better 
for far longer than 3 minutes.



Best,

Erick



On Thu, Aug 25, 2016 at 2:00 PM, Jon Hawkesworth < 
jon.hawkeswo...@medquist.onmicrosoft.com<mailto:jon.hawkeswo...@medquist.onmicrosoft.com>>
 wrote:



> Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica

> remains down issue?

>

>

>

> Today we stopped all the loading and querying, brought down all 4 solr

> nodes, went into zookeeper and deleted everything under /collections/

> transcribedReports/leader_initiated_recovery/shard1/ and brought the

> cluster back up (this seeming to be a reasonably similar situation to

> https://issues.apache.org/jira/browse/SOLR-7021 where this workaround

> is described, albeit for an older version of solr.

>

>

>

> After a while things looked ok but when we attempted to move the

> second replica back to the original node (by creating a third and then

> deleting the temp one which wasn't on the node we wanted it on), we

> immediately got a 'down' status on the node (and its stayed that way

> ever since), with ' Could not publish as ACTIVE after succesful

> recovery ' messages appearing in the logs

>

>

>

> Its as if there is something specifically wrong with that node that

> stops us from ever having a functioning replica of shard1 on it.

>

>

>

> weird thing is shard2 on the same (problematic) node seems fine.

>

>

>

> Other stuff we have tried includes

>

>

>

> issuing a REQUESTRECOVERY

>

> moving from 2 to 4 nodes

>

> adding more replicas on other nodes (new replicas immediately go into

> down state and stay that way).

>

>

>

> System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection

> with 4 shards and and I'm trying to have 2 replicas on each of the 4 nodes.

>

> Currently each shard is managing approx 1.2 million docs (mostly just

> text 10-20k in size each usually).

>

>

>

> Any suggestions would be gratefully appreciated.

>

>

>

> Many thanks,

>

>

>

> Jon

>

>

>

>

>

> *Jon Hawkesworth*

> Software Developer

>

>

>

>

>

> Hanley Road, Malvern, WR13 6NP. UK

>

> O: +44 (0) 1684 312313

>

> *jon.hawkeswo...@mmodal.com<mailto:*jon.hawkeswo...@mmodal.com> 
> <jon.hawkeswo...@mmodal.com<mailto:jon.hawkeswo...@mmodal.com>>

> www.mmodal.com<http://www.mmodal.com>

> <http://www.medquist.com/>*

>

>

>

> *This electronic mail transmission contains confidential information

> intended only for the person(s) named. Any use, distribution, copying

> or disclosure by another person is strictly prohibited. If you are not

> the intended recipient of this e-mail, promptly delete it and all

> attachments.*

>

>

>


Jon Hawkesworth
Software Developer


[cid:image002.png@01D1FF65.D5875150]

Hanley Road, Malvern, WR13 6NP. UK
O: +44 (0) 1684 312313
jon.hawkeswo...@mmodal.com
www.mmodal.com<http://www.medquist.com/>

This electronic mail transmission contains confidential information intended 
only for the person(s) named. Any use, distribution, copying or disclosure by 
another person is strictly prohibited. If you are not the intended recipient of 
this e-mail, promptly delete it and all attachments.

Reply via email to