My guess is that recovery is indeed successful but the leader is repeatedly
marking this replicas as 'down' using what we call
Leader-Initiated-Recovery or LIR. We need to understand why that is
happening. Are there GC issues on this new node?

Can we see the logs on the leader from a bit before and after a line that
should look like the following:
"Put replica core=documents_shard1_replica2 coreNodeName="...

Can we also see the complete cluster state?

On Thu, Sep 1, 2016 at 9:15 PM, Jon Hawkesworth <
jon.hawkeswo...@medquist.onmicrosoft.com> wrote:

> Hi
>
>
>
> If anyone has any suggestions of things I could try to resolve my issue
> where one replica on one of my solcloud 6.0.1 shards refuses to stay up,
> I'd love to hear them.  In fact, I'll get you something off your amazon
> wishlist, within reason, if you can solve this puzzle.
>
>
>
> Today we pruned the dead replica, restarted the machine where it ran and
> once the node had rejoined the cluster, we added a new replica.
>
> The replica was marked as Active for about 10 minutes then went down
>
>
>
> I put some example logging from below, but it looks much the same as last
> time.
>
>
>
> There's a bunch of warnings about a checksum being different even though
> the file size is the same and then RecoveryStrategy
>
> reports 'Could not publish as ACTIVE after succesful recovery'
>
>
>
> I think I've found where that message comes from in the code here:
> https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=
> blob;f=solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java;h=
> abd00aef19a731b42b314f8b526cdb2d77baf89f;hb=refs/heads/master
>
> (I am running 6.0.1 though so could have changed in latest devel).
>
>
>
> So it seems this chunk of codeā€¦
>
>
>
> 451         if (successfulRecovery) {
>
> 452           LOG.info("Registering as Active after recovery.");
>
> 453           try {
>
> 454             zkController.publish(core.getCoreDescriptor(),
> Replica.State.ACTIVE);
>
> 455           } catch (Exception e) {
>
> 456             LOG.error("Could not publish as ACTIVE after succesful
> recovery", e);
>
> 457             successfulRecovery = false;
>
> 458           }
>
> 459
>
>  460           if (successfulRecovery) {
>
> 461             close = true;
>
> 462             recoveryListener.recovered();
>
> 463           }
>
> 464         }
>
>
>
> results in this:
>
>
>
> org.apache.solr.common.SolrException: Cannot publish state of core
> 'documents_shard1_replica2' as active without recovering first!
>
>                at org.apache.solr.cloud.ZkController.publish(
> ZkController.java:1141)
>
>                at org.apache.solr.cloud.ZkController.publish(
> ZkController.java:1097)
>
>                at org.apache.solr.cloud.ZkController.publish(
> ZkController.java:1093)
>
>                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> RecoveryStrategy.java:457)
>
>                at org.apache.solr.cloud.RecoveryStrategy.run(
> RecoveryStrategy.java:224)
>
>                at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
>
>                at java.util.concurrent.FutureTask.run(Unknown Source)
>
>                at org.apache.solr.common.util.ExecutorUtil$
> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>
>                at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
>
>                at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>
>                at java.lang.Thread.run(Unknown Source)
>
>
>
> I don't yet understand the interaction with zookeeper but there's some
> disagreement about whether recovery has happened or not (if it hadn't from
> solr's point of view the successfulRecovery boolean would presumably be
> false.
>
>
>
> Should I raise a JIRA?  Is there any other useful information I could
> gather?
>
>
>
> I haven't really had any similar problems with the other 3 shards, just
> shard1.
>
>
>
> The nodes that it is running on are all pretty similar - all vms built to
> the same specification and the deployment of java and solrcloud is
> automated so there shouldn't be any differences in the stack.
>
>
>
> Many thanks,
>
>
>
> Jon
>
>
>
>
>
>
>
>
>
> Example log output below
>
>
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnux.si did not match. expected checksum is 1186898951 and actual
> is checksum 1994281621. expected length is 417 and actual length is 417
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.nvd did not match. expected checksum is 2200422612 and actual
> is checksum 3635321041. expected length is 63 and actual length is 65
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.fdx did not match. expected checksum is 281622189 and actual is
> checksum 838341528. expected length is 84 and actual length is 84
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.nvm did not match. expected checksum is 1875012021 and actual
> is checksum 524812847. expected length is 108 and actual length is 108
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.fnm did not match. expected checksum is 1681449973 and actual
> is checksum 3351426142. expected length is 1265 and actual length is 1265
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene54_0.dvm did not match. expected checksum is 355987228
> and actual is checksum 847034886. expected length is 380 and actual length
> is 404
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.pos did not match. expected checksum is 806636274
> and actual is checksum 2272195325. expected length is 1059 and actual
> length is 1172
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.doc did not match. expected checksum is 4041316671
> and actual is checksum 3122885740. expected length is 212 and actual
> length is 281
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.tim did not match. expected checksum is 2891628412
> and actual is checksum 2420913910. expected length is 5346 and actual
> length is 6251
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.tip did not match. expected checksum is 1652105503
> and actual is checksum 807238796. expected length is 336 and actual length
> is 349
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene54_0.dvd did not match. expected checksum is 2664049801
> and actual is checksum 2930561414. expected length is 130 and actual length
> is 145
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.fdt did not match. expected checksum is 4175958592 and actual
> is checksum 3650490510. expected length is 4280 and actual length is 4983
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.si did not match. expected checksum is 2223401636 and actual
> is checksum 734463570. expected length is 535 and actual length is 535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene54_0.dvd did not match. expected checksum is 202072236
> and actual is checksum 4194802930. expected length is 96 and actual
> length is 264
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.tip did not match. expected checksum is 2123658306
> and actual is checksum 435878007. expected length is 298 and actual length
> is 639
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.nvd did not match. expected checksum is 4214748910 and actual
> is checksum 3784036105. expected length is 59 and actual length is 77
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.fdt did not match. expected checksum is 3837568601 and actual
> is checksum 2542454689. expected length is 896 and actual length is 20338
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.fdx did not match. expected checksum is 2070429440 and actual
> is checksum 3279752998. expected length is 84 and actual length is 86
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.pos did not match. expected checksum is 2299588010
> and actual is checksum 2299553846. expected length is 190 and actual
> length is 5717
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene54_0.dvm did not match. expected checksum is 914650440
> and actual is checksum 2852383192. expected length is 312 and actual length
> is 548
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.nvm did not match. expected checksum is 3037735995 and actual
> is checksum 4023026424. expected length is 108 and actual length is 108
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.doc did not match. expected checksum is 3813274592
> and actual is checksum 189237707. expected length is 110 and actual length
> is 1945
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.tim did not match. expected checksum is 3013245878
> and actual is checksum 2122722316. expected length is 1757 and actual
> length is 16642
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.fnm did not match. expected checksum is 2117653105 and actual
> is checksum 3401755804. expected length is 1265 and actual length is 1265
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.si did not match. expected checksum is 2715978927 and actual
> is checksum 3653125964. expected length is 535 and actual length is 535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.fdt did not match. expected checksum is 1699019853 and actual
> is checksum 3731775500. expected length is 15865 and actual length is 12728
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.pos did not match. expected checksum is 2189908204
> and actual is checksum 2338139479. expected length is 4475 and actual
> length is 3431
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.doc did not match. expected checksum is 1522019614
> and actual is checksum 969681917. expected length is 1394 and actual length
> is 1093
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.tim did not match. expected checksum is 813529901
> and actual is checksum 529669468. expected length is 13843 and actual
> length is 12535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.si did not match. expected checksum is 3802482417 and actual
> is checksum 1865633126. expected length is 535 and actual length is 535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene54_0.dvm did not match. expected checksum is 4236057860
> and actual is checksum 2986112802. expected length is 500 and actual length
> is 476
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.fdx did not match. expected checksum is 2497099401 and actual
> is checksum 990046808. expected length is 85 and actual length is 84
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.nvd did not match. expected checksum is 1736308969 and actual
> is checksum 3657480551. expected length is 73 and actual length is 71
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.tip did not match. expected checksum is 1362235492
> and actual is checksum 640196019. expected length is 570 and actual length
> is 531
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.fnm did not match. expected checksum is 1975043794 and actual
> is checksum 1035049893. expected length is 1265 and actual length is 1265
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.nvm did not match. expected checksum is 2985228383 and actual
> is checksum 2603407196. expected length is 108 and actual length is 108
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene54_0.dvd did not match. expected checksum is 762056409
> and actual is checksum 1514176651. expected length is 228 and actual length
> is 211
>
> 9/1/2016, 12:37:09 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica2\data\tlog\tlog.0000000000000000005 refcount=2} active=true
> starting pos=20222
>
> 9/1/2016, 12:37:10 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=10 deletes=0
> deleteByQuery=1 errors=0 positionOfStart=20222}
>
> 9/1/2016, 12:37:10 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Could not publish as ACTIVE after succesful recovery
>
> 9/1/2016, 12:37:10 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Recovery failed - trying again... (0)
>
> 9/1/2016, 12:37:37 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica2\data\tlog\tlog.0000000000000000006 refcount=2} active=true
> starting pos=0
>
> 9/1/2016, 12:37:38 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=2 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 9/1/2016, 12:37:41 PM
>
>
>
> WARN false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Stopping recovery for core=[transcribedReports_shard1_replica2]
> coreNodeName=[core_node14]
>
> 9/1/2016, 12:42:13 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica2\data\tlog\tlog.0000000000000000007 refcount=2} active=true
> starting pos=1748
>
> 9/1/2016, 12:42:14 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=1748}
>
> 9/1/2016, 12:42:14 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Could not publish as ACTIVE after succesful recovery
>
> 9/1/2016, 12:42:14 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Recovery failed - trying again... (0)
>
> 9/1/2016, 12:42:43 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Could not publish as ACTIVE after succesful recovery
>
> 9/1/2016, 12:42:43 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Recovery failed - trying again... (0)
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkeswo...@mmodal.com <jon.hawkeswo...@mmodal.com> www.mmodal.com
> <http://www.medquist.com/>*
>
>
>
> *This electronic mail transmission contains confidential information
> intended only for the person(s) named. Any use, distribution, copying or
> disclosure by another person is strictly prohibited. If you are not the
> intended recipient of this e-mail, promptly delete it and all attachments.*
>
>
>



-- 
Regards,
Shalin Shekhar Mangar.

Reply via email to