Re: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Erick Erickson Fri, 26 Aug 2016 09:01:40 -0700

OK, this is starting to get worrisome. This _looks_ like your index is
somehow corrupt. Let's back up a bit:


Before you do the ADDREPLICA, is your system healthy? That is, do you have
at least one leader for each shard that's "active"? You say it looks OK for
a while, can you successfully query the cluster? Can you successfully query
every individual replica (use &distrib=false and point at specific cores to
verify this).

I'd pull out the CheckIndex tool, here's a brief KB on it:
https://support.lucidworks.com/hc/en-us/articles/202091128-How-to-deal-with-Index-Corruption
to verify at least one replica for each shard.

Note the caveat there, the "-fix" option WILL DROP segments it doesn't
think are OK, so don't use it yet.

So let's assume you can run this successfully on at least one replica for
each shard. I'd disable all other replicas then and restart my Solrs. The
"trick" to disabling them is to find the associated "core.properties" file
and name it something else. Don't worry, this won't remove data or anything
and you can get it back by going back and renaming it back to
"core.properties" and restarting your Solrs. This is "core discovery", see:
https://cwiki.apache.org/confluence/display/solr/Defining+core.properties.
You've wisely stopped indexing so the cores won't be out of sync.

The goal here is to have a healthy, leader-only cluster that you totally
know is OK, you can query it & etc. If you get this far, try the ADDREPLICA
again. Let's assume that you can do this, then it's really up to you
whether to just ADDREPLICA and build up your cluster again and just nuke
all the old replicas or try to recover the old ones.

Mind you, this is largely shooting in the dark since this is very peculiar.
Did you have any weird errors, disk full and the like? Note that looking at
your current disk is insufficient since you need up to as much free space
on your disk as your indexes already have.

Best,
Erick

On Thu, Aug 25, 2016 at 10:48 PM, Jon Hawkesworth <
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Thanks for your suggestion.  Here's a chunk of info from the logging in
> the solr admin page below.  Is there somewhere else I should be looking too?
>
>
>
> It looks to me like its stuck in  a never-ending loop of attempting
> recovery that fails.
>
>
>
> I don't know if the warnings from IndexFetcher are relevant or not, and if
> they are, what I can do about them?
>
>
>
> Our system has been feeding 150k docs a day into this cluster for nearly
> two months now.  I have a backlog of approx 45million more documents I need
> to get loaded but until I have a healthy looking cluster it would be
> foolish to start loading even more.
>
>
>
> Jon
>
>
>
>
>
> *Time (Local)*
>
> *Level*
>
> *Core*
>
> *Logger*
>
> *Message*
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvm did not match. expected checksum is 1754812894 and actual
> is checksum 3450541029. expected length is 108 and actual length is 108
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fnm did not match. expected checksum is 2714900770 and actual
> is checksum 1393668596. expected length is 1265 and actual length is 1265
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.doc did not match. expected checksum is 1374818988
> and actual is checksum 1039421217. expected length is 110 and actual length
> is 433
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tim did not match. expected checksum is 1001343351
> and actual is checksum 3395571641. expected length is 2025 and actual
> length is 7662
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene50_0.tip did not match. expected checksum is 814607015
> and actual is checksum 1271109784. expected length is 301 and actual length
> is 421
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b_Lucene54_0.dvd did not match. expected checksum is 875968405
> and actual is checksum 4024097898. expected length is 96 and actual
> length is 144
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.si did not match. expected checksum is 2341973651 and actual
> is checksum 281320882. expected length is 535 and actual length is 535
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdx did not match. expected checksum is 2874533507 and actual
> is checksum 3545673052. expected length is 84 and actual length is 84
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.nvd did not match. expected checksum is 663721296 and actual is
> checksum 1107475498. expected length is 59 and actual length is 68
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0b.fdt did not match. expected checksum is 2953417110 and actual
> is checksum 471758721. expected length is 1109 and actual length is 7185
>
> 8/26/2016, 6:17:52 AM
>
> WARN false
>
> IndexFetcher
>
> File segments_h7g8 did not match. expected checksum is 2040860271 and
> actual is checksum 1873911116. expected length is 2056 and actual length is
> 1926
>
> 8/26/2016, 6:17:53 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.0000000000000000321 refcount=2} active=true
> starting pos=0
>
> 8/26/2016, 6:17:53 AM
>
> WARN false
>
> UpdateLog
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 8/26/2016, 6:17:53 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Could not publish as ACTIVE after succesful recovery
>
> 8/26/2016, 6:17:53 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Recovery failed - trying again... (0)
>
> 8/26/2016, 6:18:13 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.0000000000000000322 refcount=2} active=true
> starting pos=0
>
> 8/26/2016, 6:18:13 AM
>
> WARN false
>
> UpdateLog
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.fdt did not match. expected checksum is 4059848174 and actual
> is checksum 4234063128. expected length is 3060 and actual length is 1772
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.fdx did not match. expected checksum is 2421590578 and actual
> is checksum 1492609115. expected length is 84 and actual length is 84
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene54_0.dvd did not match. expected checksum is 2898024557
> and actual is checksum 3762900089. expected length is 99 and actual length
> is 97
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.si did not match. expected checksum is 730964774 and actual is
> checksum 1292368805. expected length is 535 and actual length is 535
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.nvd did not match. expected checksum is 2920743481 and actual
> is checksum 2869652522. expected length is 59 and actual length is 59
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.nvm did not match. expected checksum is 328126313 and actual is
> checksum 1484623710. expected length is 108 and actual length is 108
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene54_0.dvm did not match. expected checksum is 3300364001
> and actual is checksum 3819493713. expected length is 312 and actual length
> is 312
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.pos did not match. expected checksum is 93672274 and
> actual is checksum 1261080786. expected length is 845 and actual length is
> 403
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x.fnm did not match. expected checksum is 3818945769 and actual
> is checksum 2677014577. expected length is 1265 and actual length is 1265
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.doc did not match. expected checksum is 2160820791
> and actual is checksum 3998191027. expected length is 110 and actual length
> is 110
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.tim did not match. expected checksum is 2334039520
> and actual is checksum 2647062877. expected length is 3923 and actual
> length is 2573
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File _iu0x_Lucene50_0.tip did not match. expected checksum is 3632944779
> and actual is checksum 1632973124. expected length is 325 and actual length
> is 304
>
> 8/26/2016, 6:22:12 AM
>
> WARN false
>
> IndexFetcher
>
> File segments_h7gk did not match. expected checksum is 633644562 and
> actual is checksum 1552102209. expected length is 2121 and actual length is
> 2056
>
> 8/26/2016, 6:22:13 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.0000000000000000323 refcount=2} active=true
> starting pos=0
>
> 8/26/2016, 6:22:13 AM
>
> WARN false
>
> UpdateLog
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=1 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 8/26/2016, 6:22:13 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Could not publish as ACTIVE after succesful recovery
>
> 8/26/2016, 6:22:13 AM
>
> ERROR false
>
> RecoveryStrategy
>
> Recovery failed - trying again... (0)
>
> 8/26/2016, 6:22:32 AM
>
> WARN false
>
> UpdateLog
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica3\data\tlog\tlog.00
>
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Thursday, August 25, 2016 10:35 PM
> To: solr-user
> Subject: Re: solrcloud 6.0.1 any suggestions for fixing a replica that
> stubbornly remains down
>
>
>
> This is odd. The ADDREPLICA _should_ be immediately listed as "down", but
> should shortly go to "recovering"and then "active". The transition to
> "active" may take a while as the index has to be copied from the leader,
> but you shouldn't be stuck at "down" for very long.
>
>
>
> Take a look at the Solr logs for both the leader of the shard and the
> replica you're trying to add. They often have more complete and helpful
> error messages...
>
>
>
> Also note that you occasionally have to be patient. For instance, there's a
>
> 3 minute wait period for
>
> leader election at times. It sounds, though, like things aren't getting
> better for far longer than 3 minutes.
>
>
>
> Best,
>
> Erick
>
>
>
> On Thu, Aug 25, 2016 at 2:00 PM, Jon Hawkesworth <
> jon.hawkeswo...@medquist.onmicrosoft.com> wrote:
>
>
>
> > Anyone got any suggestions how I can fix up my solrcloud 6.0.1 replica
>
> > remains down issue?
>
> >
>
> >
>
> >
>
> > Today we stopped all the loading and querying, brought down all 4 solr
>
> > nodes, went into zookeeper and deleted everything under /collections/
>
> > transcribedReports/leader_initiated_recovery/shard1/ and brought the
>
> > cluster back up (this seeming to be a reasonably similar situation to
>
> > https://issues.apache.org/jira/browse/SOLR-7021 where this workaround
>
> > is described, albeit for an older version of solr.
>
> >
>
> >
>
> >
>
> > After a while things looked ok but when we attempted to move the
>
> > second replica back to the original node (by creating a third and then
>
> > deleting the temp one which wasn't on the node we wanted it on), we
>
> > immediately got a 'down' status on the node (and its stayed that way
>
> > ever since), with ' Could not publish as ACTIVE after succesful
>
> > recovery ' messages appearing in the logs
>
> >
>
> >
>
> >
>
> > Its as if there is something specifically wrong with that node that
>
> > stops us from ever having a functioning replica of shard1 on it.
>
> >
>
> >
>
> >
>
> > weird thing is shard2 on the same (problematic) node seems fine.
>
> >
>
> >
>
> >
>
> > Other stuff we have tried includes
>
> >
>
> >
>
> >
>
> > issuing a REQUESTRECOVERY
>
> >
>
> > moving from 2 to 4 nodes
>
> >
>
> > adding more replicas on other nodes (new replicas immediately go into
>
> > down state and stay that way).
>
> >
>
> >
>
> >
>
> > System is solrcloud 6.0.1 running on 4 nodes.  There's 1 collection
>
> > with 4 shards and and I'm trying to have 2 replicas on each of the 4
> nodes.
>
> >
>
> > Currently each shard is managing approx 1.2 million docs (mostly just
>
> > text 10-20k in size each usually).
>
> >
>
> >
>
> >
>
> > Any suggestions would be gratefully appreciated.
>
> >
>
> >
>
> >
>
> > Many thanks,
>
> >
>
> >
>
> >
>
> > Jon
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > *Jon Hawkesworth*
>
> > Software Developer
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > Hanley Road, Malvern, WR13 6NP. UK
>
> >
>
> > O: +44 (0) 1684 312313
>
> >
>
> > *jon.hawkeswo...@mmodal.com <jon.hawkeswo...@mmodal.com>
>
> > www.mmodal.com
>
> > <http://www.medquist.com/>*
>
> >
>
> >
>
> >
>
> > *This electronic mail transmission contains confidential information
>
> > intended only for the person(s) named. Any use, distribution, copying
>
> > or disclosure by another person is strictly prohibited. If you are not
>
> > the intended recipient of this e-mail, promptly delete it and all
>
> > attachments.*
>
> >
>
> >
>
> >
>
>
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkeswo...@mmodal.com <jon.hawkeswo...@mmodal.com> www.mmodal.com
> <http://www.medquist.com/>*
>
>
>
> *This electronic mail transmission contains confidential information
> intended only for the person(s) named. Any use, distribution, copying or
> disclosure by another person is strictly prohibited. If you are not the
> intended recipient of this e-mail, promptly delete it and all attachments.*
>
>
>

Re: solrcloud 6.0.1 any suggestions for fixing a replica that stubbornly remains down

Reply via email to