[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007800#comment-17007800 ] ASF subversion and git services commented on SOLR-13486: Commit a8ab65b1860eaa10605209a060807c292f5497b1 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a8ab65b ] SOLR-13486: Text improvements * force a hard commit of all docs in TestCloudConsistency to work around bug in that test * add new AwaitsFix'ed TestTlogReplayVsRecovery that more explicitly demonstrates the bug via TestInjection.updateLogReplayRandomPause (cherry picked from commit 0fac7c1a26395ed21f14e02a471e6350144074c7) > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13486__test.patch, > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007791#comment-17007791 ] ASF subversion and git services commented on SOLR-13486: Commit 0fac7c1a26395ed21f14e02a471e6350144074c7 in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0fac7c1 ] SOLR-13486: Text improvements * force a hard commit of all docs in TestCloudConsistency to work around bug in that test * add new AwaitsFix'ed TestTlogReplayVsRecovery that more explicitly demonstrates the bug via TestInjection.updateLogReplayRandomPause > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13486__test.patch, > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007744#comment-17007744 ] Erick Erickson commented on SOLR-13486: --- [~hossman] What I'm seeing is different, and I haven't a clue what is happening. On the surface, it looks similar because the "assertDocExists" method fails. However, what I can reliably reproduce in asertDocExists is: {code} Server refused connection at: https://127.0.0.1:49190/solr/outOfSyncReplicasCannotBecomeLeader-false org.apache.solr.client.solrj.SolrServerException: Server refused connection at: https://127.0.0.1:49190/solr/outOfSyncReplicasCannotBecomeLeader-false {code} This problem happens on one of the _followers_, BTW, after it apparently successfully syncs after being restarted. So that's what I'm looking at over on the linked JIRA. And I can only reliably make it happen on my MBP, my Mac Pro doesn't seem to generate this. "Reliably" is a bit of a misnomer, it's really between 3 and 10 times per thousand tests. I can also make the "connection refused" error happen when beasting only a single test. So please go ahead and push anything you think will help for the commit issue. AFAIK, there are multiple issues and it looks like I conflated what I'm seeing with the original problem. I'll take whatever else I find over to the linked JIRA. It'd be really weird if this fix also fixed my other issue... > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-13486__test.patch, > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007007#comment-17007007 ] Chris M. Hostetter commented on SOLR-13486: --- Ok, I'm playing catchup on all the comments from the last 2 weeks – most of which are confusing hte hell out of me – so forgive me if this is all over the place... This Jira is *NOT* a generic tracker for "TestCloudConsistency can sometimes fail" ... it is tracking a very specific bug, that can impact real world solr instances in the wild. The fact that it was discovered via TestCloudConsistency, and that that test appears to be (AFAICT) the only test we currently have in the code base for reproducing this _code_ bug does not mean we should use this jira to discuss every possible failure people encounter w/TestCloudConsistency. Please pay attention to the specifics of the failures you encounter, and when in doubt file a new jira (they can always be linked later, but confusing off topic comments are hard to de-tangle from a jira after the fact) [~erickerickson] : if you are seeing NPEs in your logs when you run TestCloudConsistency then that most certainly has nothing to do with the this specific bug being tracked here. (BUt it's impossible to be certain since you dind't attach any logs or specifics) Please open a new jira and move the disucssion of this unrelated problem you have found to that jira. (or open multiple new jiras if you think you've found multiple new problems) {quote}The cluster setup and teardown is done for each test 'cause they're annotated with `@Before` and `@After`. Changing these to `@BeforeClass` and `@AfterClass`, which at least lessens the confusion. I don't think this is a real fix given the comments from Chris and Dat, so I'll see if I can still generate errors with this change. {quote} Please *DO NOT* change this test to setup/teardown the cluster in {{@BeforeClass}} and {{@AfterClass}} – the fact that this test (and others like it) uses a "pristine" cluster for each test method is very specific – because the test is mucking with the collection nodes, and this way a failure from one test (that might leve the cluster in a bad state) won't "bleed over" into another test and cause spurrious failures. {quote}What Hoss saw may be an artifact of the fact that the cluster was being created/destroyed between tests. So far when I only run a single test at a time I'm not seeing failures, but that's not very conclusive at this point. {quote} The bug being tracked here has *NOTHING* to do with a new cluster being recreated for each test. please note all of the logs & analysis i've already posted – everything about this bug has to do with when/how the leader is partitioned & shutdown _during_ the test, and the race condition thta exists as a result when nodes are trying to recover from that leader while it is trying to recover from it's tlog. Dawid's {{org.apache.solr.cloud.TestCloudConsistency.zip}} attachment includes a log that does in fact show the specific problem tracked in this isssue – note the time stamps of the log messages in the last 2 grep comments... {noformat} # here's the replica that is failing to locate doc#4... hossman@slate:~/tmp$ grep outOfSyncReplicasCannotBecomeLeader-false org.apache.solr.cloud.TestCloudConsistency.html | grep 'Doc with id=4 not found' java.lang.AssertionError: Doc with id=4 not found in http://127.0.0.1:40399/solr/outOfSyncReplicasCannotBecomeLeader-false due to: Path not found: /id; rsp={doc=null} # here's the recovery via replication logging from that replica showing who the leader is... hossman@slate:~/tmp$ grep outOfSyncReplicasCannotBecomeLeader-false org.apache.solr.cloud.TestCloudConsistency.html | grep 'n:127.0.0.1:40399_solr' | grep recoveryExecutor | grep 'Attempting to replicate from' 1500569 INFO (recoveryExecutor-10537-thread-1-processing-n:127.0.0.1:40399_solr x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3 c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node4) [n:127.0.0.1:40399_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node4 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3 ] o.a.s.c.RecoveryStrategy Attempting to replicate from [http://127.0.0.1:33461/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1/]. # here's when that leader is doing it's tlog replay... hossman@slate:~/tmp$ grep outOfSyncReplicasCannotBecomeLeader-false org.apache.solr.cloud.TestCloudConsistency.html | grep 'n:127.0.0.1:33461_solr' | grep 'Replaying tlog' 1515376 INFO (coreZkRegister-10093-thread-1-processing-n:127.0.0.1:33461_solr x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2) [n:127.0.0.1:33461_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005364#comment-17005364 ] Dawid Weiss commented on SOLR-13486: You can leave that in -- I'll post more stack traces once I have them, no problem. I just wanted to let you guys know this one has been pretty bad for me recently. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005357#comment-17005357 ] Erick Erickson commented on SOLR-13486: --- Right. WDYT about you just @Ignore the test for now locally? Hoss' rollups show it failing once in the last week in 30 runs, and trying to run the entire suite enough times to have a hope of showing it fixed will take forever, there's no sense in you being burdened with this wonkiness. So I'll try to clean up what I can reproduce, then push it to the repo and maybe then go from there. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005323#comment-17005323 ] Dawid Weiss commented on SOLR-13486: The one I attached is by far the one I see most often. Like I said -- this happens when running full test suite. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005322#comment-17005322 ] Erick Erickson commented on SOLR-13486: --- [~ab] Yes, I do get that error on some failed tests. I notice that this can be logged in two places, and the second one (AutoScalingSnitch.getRemoteInfo) throws an explicit exception. My guess is that one should continue to throw a SERVER_ERROR, exception correct? [~dweiss] I'm seeing two distinct errors now, but neither one of them is exactly yours. One is coming from assertDocExists like yours, but the root cause seems to be "Server refused connection". Do you see that in your stack traces? The other one I'm getting is a timeout waiting for the collection to become active. Digging more... > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005313#comment-17005313 ] Dawid Weiss commented on SOLR-13486: I attached an example log report file from the failure, perhaps it'll be of some help. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, > org.apache.solr.cloud.TestCloudConsistency.zip > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005284#comment-17005284 ] Andrzej Bialecki commented on SOLR-13486: - [~noble.paul] {{SolrMetricManager}} is initialized and the {{MetricsHandler}} is initialized *before* any core is loaded, so in this sense metrics are always "up" - however, if the metrics for this specific core are missing (e.g. because the core is not there) then an empty response is returned. However, in this case because we're getting an NPE I think it was an IOException due to some other error that caused this. Can you guys check the logs of the failing runs and see if there are any INFO lines starting with "Error on getting remote info, trying again:"? > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005179#comment-17005179 ] Dawid Weiss commented on SOLR-13486: What's interesting is that I get the exception I mentioned almost all of the time but only when I run the full test suite. It doesn't reproduce in isolation. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004897#comment-17004897 ] Noble Paul commented on SOLR-13486: --- Is it that metrics is not up and running and it returned an empty response ? [~ab] ? > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004893#comment-17004893 ] Andrzej Bialecki commented on SOLR-13486: - The logic in {{SolrClientNodeStateProvider.fetchReplicaMetrics}} is a bit convoluted. I don't think it properly handles repeating IOException-s - it simply exits the {{while}} loop in line 199 after 3 iterations and the {{rsp}} object may very well be null if there were 3+ repeating IOExceptions. IMHO this should produce a warning but should not throw an exception (and certainly not an NPE!), it should simply return - this is the intention of the code at the end of this method. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004878#comment-17004878 ] Erick Erickson commented on SOLR-13486: --- Making some progress, but it's slow 'cause it takes a long time to generate... The cluster setup and teardown is done for each test 'cause they're annotated with `@Before` and `@After`. Changing these to `@BeforeClass` and `@AfterClass`, which at least lessens the confusion. I don't think this is a real fix given the comments from Chris and Dat, so I'll see if I can still generate errors with this change. I found one issue, spamming people who've been in the code for opinions. I don't think this is the problem Dawid had though. SolrClientNodeStateProvider.fetchReplicaMetrics is generating an NPE in some of the tests at this line: {code}Object v = Utils.getObjectByPath(frsp.nl, true, Arrays.asList("metrics", key));{code} because `rsp` is null. It's simple to test for null, and maybe log a warning, but is this indicative of some deeper problem? [~ab] [~noble] [~shalin]. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Erick Erickson >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004577#comment-17004577 ] Erick Erickson commented on SOLR-13486: --- At least I can get this to reproduce fairly reliably, digging. Although I'm getting different errors. > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)
[ https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17004269#comment-17004269 ] Dawid Weiss commented on SOLR-13486: I hit this very often on Linux (org.apache.solr.cloud.TestCloudConsistency) and this prevents regular sanity-checks of the gradle branch. Is there anything that can be done about it? {code} org.apache.solr.cloud.TestCloudConsistency > testOutOfSyncReplicasCannotBecomeLeader FAILED java.lang.AssertionError: Doc with id=4 not found in http://127.0.0.1:40399/solr/outOfSyncReplicasCannotBecomeLeader-false due to: Path not found: /id; rsp={doc=null} at __randomizedtesting.SeedInfo.seed([542A0C718771D678:2AC12C614416D942]:0) at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.solr.cloud.TestCloudConsistency.assertDocExists(TestCloudConsistency.java:285) at org.apache.solr.cloud.TestCloudConsistency.assertDocsExistInAllReplicas(TestCloudConsistency.java:269) at org.apache.solr.cloud.TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader(TestCloudConsistency.java:140) at org.apache.solr.cloud.TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader(TestCloudConsistency.java:99) {code} > race condition between leader's "replay on startup" and non-leader's "recover > from leader" can leave replicas out of sync (TestCloudConsistency) > > > Key: SOLR-13486 > URL: https://issues.apache.org/jira/browse/SOLR-13486 > Project: Solr > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > Attachments: > apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, > apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz > > > I've been investigating some jenkins failures from TestCloudConsistency, > which at first glance suggest a problem w/replica(s) recovering after a > network partition from the leader - but in digging into the logs the root > cause acturally seems to be a thread race conditions when a replica (the > leader) is first registered... > * The {{ZkContainer.registerInZk(...)}} method (which is called by > {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically > run in a background thread (via the {{ZkContainer.coreZkRegister}} > ExecutorService) > * {{ZkContainer.registerInZk(...)}} delegates to > {{ZKController.register(...)}} which is ultimately responsible for checking > if there are any "old" tlogs on disk, and if so handling the "Replaying tlog > for during startup" logic > * Because this happens in a background thread, other logic/requests can be > handled by this core/replica in the meantime - before it starts (or while in > the middle of) replaying the tlogs > ** Notably: *leader's that have not yet replayed tlogs on startup will > erroneously respond to RTG / Fingerprint / PeerSync requests from other > replicas w/incomplete data* > ...In general, it seems scary / fishy to me that a replica can (aparently) > become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog > ... during startup" logic ... particularly since this can happen even for > replicas that are/become leaders. It seems like this could potentially cause > a whole host of problems, only one of which manifests in this particular test > failure: > * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog > ... during startup" check: > ** replicaX can recognize (via zk terms) that it should be the leader(X) > ** this leaderX can then instruct some other replicaY to recover from it > ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX > (either on it's own volition, or because it was instructed to by leaderX) in > an attempt to recover > *** the responses to these recovery requests will not include updates in the > tlog files that existed on leaderX prior to startup that hvae not yet been > replayed > * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... > during startup" can finish > ** replicaY now thinks it is in sync with leaderX, but leaderX has > (replayed) updates the other replicas know nothing about -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org