RE: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Jon Hawkesworth Fri, 26 Aug 2016 10:02:18 -0700

Many thanks for this too, I am digging to this now.

Jon
-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, August 26, 2016 5:01 PM
To: solr-user
Subject: Re: solrcloud 6.0.1 any suggestions for fixing a replica that 
stubbornly remains down


OK, this is starting to get worrisome. This _looks_ like your index is somehow 
corrupt. Let's back up a bit:

Before you do the ADDREPLICA, is your system healthy? That is, do you have at 
least one leader for each shard that's "active"? You say it looks OK for a 
while, can you successfully query the cluster? Can you successfully query every 
individual replica (use &distrib=false and point at specific cores to verify 
this).

I'd pull out the CheckIndex tool, here's a brief KB on it:
https://support.lucidworks.com/hc/en-us/articles/202091128-How-to-deal-with-Index-Corruption
to verify at least one replica for each shard.

Note the caveat there, the "-fix" option WILL DROP segments it doesn't think 
are OK, so don't use it yet.

So let's assume you can run this successfully on at least one replica for each 
shard. I'd disable all other replicas then and restart my Solrs. The "trick" to 
disabling them is to find the associated "core.properties" file and name it 
something else. Don't worry, this won't remove data or anything and you can get 
it back by going back and renaming it back to "core.properties" and restarting 
your Solrs. This is "core discovery", see:
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties.
You've wisely stopped indexing so the cores won't be out of sync.

The goal here is to have a healthy, leader-only cluster that you totally know 
is OK, you can query it & etc. If you get this far, try the ADDREPLICA again. 
Let's assume that you can do this, then it's really up to you whether to just 
ADDREPLICA and build up your cluster again and just nuke all the old replicas 
or try to recover the old ones.

Mind you, this is largely shooting in the dark since this is very peculiar.
Did you have any weird errors, disk full and the like? Note that looking at 
your current disk is insufficient since you need up to as much free space on 
your disk as your indexes already have.

Best,
Erick

On Thu, Aug 25, 2016 at 10:48 PM, Jon Hawkesworth < 
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Thanks for your suggestion.  Here's a chunk of info from the logging 
> in the solr admin page below.  Is there somewhere else I should be looking 
> too?
>
>
>
> It looks to me like its stuck in  a never-ending loop of attempting 
> recovery that fails.
>
>
>
> I don't know if the warnings from IndexFetcher are relevant or not, 
> and if they are, what I can do about them?
>
>
>
> Our system has been feeding 150k docs a day into this cluster for 
> nearly two months now.  I have a backlog of approx 45million more 
> documents I need to get loaded but until I have a healthy looking 
> cluster it would be foolish to start loading even more.
>
>
>
> Jon
>
>
>
>
>
> *Time (Local)*
>
> *Level*
>
> *Core*
>
> *Logger*
>
> *Message*
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvm did not match. expected checksum is 1754812894 and 
> actual is checksum 3450541029. expected length is 108 and actual 
> length is 108
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fnm did not match. expected checksum is 2714900770 and 
> actual is checksum 1393668596. expected length is 1265 and actual 
> length is 1265
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.doc did not match. expected checksum is 
> 1374818988 and actual is checksum 1039421217. expected length is 110 
> and actual length is 433
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tim did not match. expected checksum is 
> 1001343351 and actual is checksum 3395571641. expected length is 2025 
> and actual length is 7662
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tip did not match. expected checksum is 
> 814607015 and actual is checksum 1271109784. expected length is 301 
> and actual length is 421
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene54_0.dvd did not match. expected checksum is 
> 875968405 and actual is checksum 4024097898. expected length is 96 and 
> actual length is 144
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.si did not match. expected checksum is 2341973651 and 
> actual is checksum 281320882. expected length is 535 and actual length 
> is 535
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdx did not match. expected checksum is 2874533507 and 
> actual is checksum 3545673052. expected length is 84 and actual length 
> is 84
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvd did not match. expected checksum is 663721296 and 
> actual is checksum 1107475498. expected length is 59 and actual length 
> is 68
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdt did not match. expected checksum is 2953417110 and 
> actual is checksum 471758721. expected length is 1109 and actual 
> length is 7185
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File segments_h7g8 did not match. expected checksum is 2040860271 and 
> actual is checksum 1873911116. expected length is 2056 and actual 
> length is
> 1926
>
> 8/26/2016, 6:17:53 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.0000000000000000321 refcount=2} active=true 
> starting pos=0
>
> 8/26/2016, 6:17:53 AM
>
> WARN false
>
> UpdateLog
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 8/26/2016, 6:17:53 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Could not publish as ACTIVE after succesful recovery
>
> 8/26/2016, 6:17:53 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Recovery failed - trying again... (0)
>
> 8/26/2016, 6:18:13 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.0000000000000000322 refcount=2} active=true 
> starting pos=0
>
> 8/26/2016, 6:18:13 AM
>
> WARN false
>
> UpdateLog
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.fdt did not match. expected checksum is 4059848174 and 
> actual is checksum 4234063128. expected length is 3060 and actual 
> length is 1772
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.fdx did not match. expected checksum is 2421590578 and 
> actual is checksum 1492609115. expected length is 84 and actual length 
> is 84
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene54_0.dvd did not match. expected checksum is 
> 2898024557 and actual is checksum 3762900089. expected length is 99 
> and actual length is 97
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.si did not match. expected checksum is 730964774 and actual 
> is checksum 1292368805. expected length is 535 and actual length is 
> 535
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.nvd did not match. expected checksum is 2920743481 and 
> actual is checksum 2869652522. expected length is 59 and actual length 
> is 59
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.nvm did not match. expected checksum is 328126313 and 
> actual is checksum 1484623710. expected length is 108 and actual 
> length is 108
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene54_0.dvm did not match. expected checksum is 
> 3300364001 and actual is checksum 3819493713. expected length is 312 
> and actual length is 312
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.pos did not match. expected checksum is 93672274 
> and actual is checksum 1261080786. expected length is 845 and actual 
> length is
> 403
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.fnm did not match. expected checksum is 3818945769 and 
> actual is checksum 2677014577. expected length is 1265 and actual 
> length is 1265
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.doc did not match. expected checksum is 
> 2160820791 and actual is checksum 3998191027. expected length is 110 
> and actual length is 110
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.tim did not match. expected checksum is 
> 2334039520 and actual is checksum 2647062877. expected length is 3923 
> and actual length is 2573
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.tip did not match. expected checksum is 
> 3632944779 and actual is checksum 1632973124. expected length is 325 
> and actual length is 304
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File segments_h7gk did not match. expected checksum is 633644562 and 
> actual is checksum 1552102209. expected length is 2121 and actual 
> length is
> 2056
>
> 8/26/2016, 6:22:13 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.0000000000000000323 refcount=2} active=true 
> starting pos=0
>
> 8/26/2016, 6:22:13 AM
>
> WARN false
>
> UpdateLog
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 8/26/2016, 6:22:13 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Could not publish as ACTIVE after succesful recovery
>
> 8/26/2016, 6:22:13 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Recovery failed - trying again... (0)
>
> 8/26/2016, 6:22:32 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.00
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, August 25, 2016 10:35 PM
> To: solr-user
> Subject: Re: solrcloud 6.0.1 any suggestions for fixing a replica that 
> stubbornly remains down
>
>
>
> This is odd. The ADDREPLICA _should_ be immediately listed as "down", 
> but should shortly go to "recovering"and then "active". The transition 
> to "active" may take a while as the index has to be copied from the 
> leader, but you shouldn't be stuck at "down" for very long.
>
>
>
> Take a look at the Solr logs for both the leader of the shard and the 
> replica you're trying to add. They often have more complete and 
> helpful error messages...
>
>
>
> Also note that you occasionally have to be patient. For instance, 
> there's a
>
> 3 minute wait period for
>
> leader election at times. It sounds, though, like things aren't 
> getting better for far longer than 3 minutes.
>
>
>
> Best,
>
> Erick
>
>
>
> On Thu, Aug 25, 2016 at 2:00 PM, Jon Hawkesworth < 
> jon.hawkeswo...@medquist.onmicrosoft.com> wrote:
>
>
>
> > Anyone got any suggestions how I can fix up my solrcloud 6.0.1 
> > replica
>
> > remains down issue?
>
> >
>
> >
>
> >
>
> > Today we stopped all the loading and querying, brought down all 4 
> > solr
>
> > nodes, went into zookeeper and deleted everything under 
> > /collections/
>
> > transcribedReports/leader_initiated_recovery/shard1/ and brought the
>
> > cluster back up (this seeming to be a reasonably similar situation 
> > to
>
> > https://issues.apache.org/jira/browse/SOLR-7021 where this 
> > workaround
>
> > is described, albeit for an older version of solr.
>
> >
>
> >
>
> >
>
> > After a while things looked ok but when we attempted to move the
>
> > second replica back to the original node (by creating a third and 
> > then
>
> > deleting the temp one which wasn't on the node we wanted it on), we
>
> > immediately got a 'down' status on the node (and its stayed that way
>
> > ever since), with ' Could not publish as ACTIVE after succesful
>
> > recovery ' messages appearing in the logs
>
> >
>
> >
>
> >
>
> > Its as if there is something specifically wrong with that node that
>
> > stops us from ever having a functioning replica of shard1 on it.
>
> >
>
> >
>
> >
>
> > weird thing is shard2 on the same (problematic) node seems fine.
>
> >
>
> >
>
> >
>
> > Other stuff we have tried includes
>
> >
>
> >
>
> >
>
> > issuing a REQUESTRECOVERY
>
> >
>
> > moving from 2 to 4 nodes
>
> >
>
> > adding more replicas on other nodes (new replicas immediately go 
> > into
>
> > down state and stay that way).
>
> >
>
> >
>
> >
>
> > System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection
>
> > with 4 shards and and I'm trying to have 2 replicas on each of the 4
> nodes.
>
> >
>
> > Currently each shard is managing approx 1.2 million docs (mostly 
> > just
>
> > text 10-20k in size each usually).
>
> >
>
> >
>
> >
>
> > Any suggestions would be gratefully appreciated.
>
> >
>
> >
>
> >
>
> > Many thanks,
>
> >
>
> >
>
> >
>
> > Jon
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > *Jon Hawkesworth*
>
> > Software Developer
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > Hanley Road, Malvern, WR13 6NP. UK
>
> >
>
> > O: +44 (0) 1684 312313
>
> >
>
> > *jon.hawkeswo...@mmodal.com <jon.hawkeswo...@mmodal.com>
>
> > www.mmodal.com
>
> > <http://www.medquist.com/>*
>
> >
>
> >
>
> >
>
> > *This electronic mail transmission contains confidential information
>
> > intended only for the person(s) named. Any use, distribution, 
> > copying
>
> > or disclosure by another person is strictly prohibited. If you are 
> > not
>
> > the intended recipient of this e-mail, promptly delete it and all
>
> > attachments.*
>
> >
>
> >
>
> >
>
>
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkeswo...@mmodal.com <jon.hawkeswo...@mmodal.com> 
> www.mmodal.com
> <http://www.medquist.com/>*
>
>
>
> *This electronic mail transmission contains confidential information 
> intended only for the person(s) named. Any use, distribution, copying 
> or disclosure by another person is strictly prohibited. If you are not 
> the intended recipient of this e-mail, promptly delete it and all 
> attachments.*
>
>
>

RE: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Reply via email to