[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)

Chris M. Hostetter (Jira) Thu, 02 Jan 2020 10:55:52 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007007#comment-17007007
 ]


Chris M. Hostetter commented on SOLR-13486:
-------------------------------------------

Ok, I'm playing catchup on all the comments from the last 2 weeks – most of 
which are confusing hte hell out of me – so forgive me if this is all over the 
place...
----
This Jira is *NOT* a generic tracker for "TestCloudConsistency can sometimes 
fail" ... it is tracking a very specific bug, that can impact real world solr 
instances in the wild.

The fact that it was discovered via TestCloudConsistency, and that that test 
appears to be (AFAICT) the only test we currently have in the code base for 
reproducing this _code_ bug does not mean we should use this jira to discuss 
every possible failure people encounter w/TestCloudConsistency. Please pay 
attention to the specifics of the failures you encounter, and when in doubt 
file a new jira (they can always be linked later, but confusing off topic 
comments are hard to de-tangle from a jira after the fact)
----
[~erickerickson] : if you are seeing NPEs in your logs when you run 
TestCloudConsistency then that most certainly has nothing to do with the this 
specific bug being tracked here.

(BUt it's impossible to be certain since you dind't attach any logs or 
specifics)

Please open a new jira and move the disucssion of this unrelated problem you 
have found to that jira. (or open multiple new jiras if you think you've found 
multiple new problems)
----
{quote}The cluster setup and teardown is done for each test 'cause they're 
annotated with `@Before` and `@After`. Changing these to `@BeforeClass` and 
`@AfterClass`, which at least lessens the confusion. I don't think this is a 
real fix given the comments from Chris and Dat, so I'll see if I can still 
generate errors with this change.
{quote}
Please *DO NOT* change this test to setup/teardown the cluster in 
{{@BeforeClass}} and {{@AfterClass}} – the fact that this test (and others like 
it) uses a "pristine" cluster for each test method is very specific – because 
the test is mucking with the collection nodes, and this way a failure from one 
test (that might leve the cluster in a bad state) won't "bleed over" into 
another test and cause spurrious failures.
{quote}What Hoss saw may be an artifact of the fact that the cluster was being 
created/destroyed between tests. So far when I only run a single test at a time 
I'm not seeing failures, but that's not very conclusive at this point.
{quote}
The bug being tracked here has *NOTHING* to do with a new cluster being 
recreated for each test. please note all of the logs & analysis i've already 
posted – everything about this bug has to do with when/how the leader is 
partitioned & shutdown _during_ the test, and the race condition thta exists as 
a result when nodes are trying to recover from that leader while it is trying 
to recover from it's tlog.
----
Dawid's {{org.apache.solr.cloud.TestCloudConsistency.zip}} attachment includes 
a log that does in fact show the specific problem tracked in this isssue – note 
the time stamps of the log messages in the last 2 grep comments...
{noformat}
# here's the replica that is failing to locate doc#4...

hossman@slate:~/tmp$ grep outOfSyncReplicasCannotBecomeLeader-false 
org.apache.solr.cloud.TestCloudConsistency.html | grep 'Doc with id=4 not found'
<pre>java.lang.AssertionError: Doc with id=4 not found in 
http://127.0.0.1:40399/solr/outOfSyncReplicasCannotBecomeLeader-false due to: 
Path not found: /id; rsp={doc=null}

# here's the recovery via replication logging from that replica showing who the 
leader is...

hossman@slate:~/tmp$ grep outOfSyncReplicasCannotBecomeLeader-false 
org.apache.solr.cloud.TestCloudConsistency.html | grep 'n:127.0.0.1:40399_solr' 
| grep recoveryExecutor | grep 'Attempting to replicate from'
1500569 INFO  
(recoveryExecutor-10537-thread-1-processing-n:127.0.0.1:40399_solr 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node4) 
[n:127.0.0.1:40399_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 
r:core_node4 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n3 ] 
o.a.s.c.RecoveryStrategy Attempting to replicate from 
[http://127.0.0.1:33461/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1/].

# here's when that leader is doing it's tlog replay...

hossman@slate:~/tmp$ grep outOfSyncReplicasCannotBecomeLeader-false 
org.apache.solr.cloud.TestCloudConsistency.html | grep 'n:127.0.0.1:33461_solr' 
| grep 'Replaying tlog'
1515376 INFO  (coreZkRegister-10093-thread-1-processing-n:127.0.0.1:33461_solr 
x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 
c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 r:core_node2) 
[n:127.0.0.1:33461_solr c:outOfSyncReplicasCannotBecomeLeader-false s:shard1 
r:core_node2 x:outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1 ] 
o.a.s.c.ZkController Replaying tlog for 
http://127.0.0.1:33461/solr/outOfSyncReplicasCannotBecomeLeader-false_shard1_replica_n1/
 during startup... NOTE: This can take a while.

{noformat}
...so in that case at least, we are in fact still on topic talking about the 
correct bug.

[~dweiss] : if you've encountered other failures that don't match this pattern, 
then please file new jiras for those failure patterns – I honestly can't tell 
if the other failures you mentioned seeing are the same as the other failures 
Erick mentioned seeing or not.
----
{quote}I propose that I open a new JIRA for the test failures and push my 
changes from there, then rename this one because what I'm doing does nothing to 
address the analysis Hoss did.
{quote}
[~erickerickson] : you haven't attached any logs showing the errors you got, 
nor have you attached any patch showing what changes you are suggesting – but 
unless those changes address the underlying bug I identified in the description 
of this jira, then yes, as i said before: you should open a new jira.

But i really can't understand why you think this jira should be renamed – this 
jira tracks a bug and the summary/description make it very clera what that bug 
is. If you have unrelated improvements for the test, or have found an unreltaed 
bug that manifests in the same test, then those belong in a new jira.

> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestCloudConsistency)
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz, 
> org.apache.solr.cloud.TestCloudConsistency.zip
>
>
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)

Reply via email to