[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)

Dawid Weiss (Jira) Fri, 27 Dec 2019 08:40:45 -0800


    [ 
https://issues.apache.org/jira/browse/SOLR-13486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004269#comment-17004269
 ]


Dawid Weiss commented on SOLR-13486:
------------------------------------

I hit this very often on Linux (org.apache.solr.cloud.TestCloudConsistency) and 
this prevents regular sanity-checks of the gradle branch. Is there anything 
that can be done about it?
{code}
org.apache.solr.cloud.TestCloudConsistency > 
testOutOfSyncReplicasCannotBecomeLeader FAILED
    java.lang.AssertionError: Doc with id=4 not found in 
http://127.0.0.1:40399/solr/outOfSyncReplicasCannotBecomeLeader-false due to: 
Path not found: /id; rsp={doc=null}
        at 
__randomizedtesting.SeedInfo.seed([542A0C718771D678:2AC12C614416D942]:0)
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at 
org.apache.solr.cloud.TestCloudConsistency.assertDocExists(TestCloudConsistency.java:285)
        at 
org.apache.solr.cloud.TestCloudConsistency.assertDocsExistInAllReplicas(TestCloudConsistency.java:269)
        at 
org.apache.solr.cloud.TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader(TestCloudConsistency.java:140)
        at 
org.apache.solr.cloud.TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader(TestCloudConsistency.java:99)
{code}

> race condition between leader's "replay on startup" and non-leader's "recover 
> from leader" can leave replicas out of sync (TestCloudConsistency)
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13486
>                 URL: https://issues.apache.org/jira/browse/SOLR-13486
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: 
> apache_Lucene-Solr-BadApples-NightlyTests-master_61.log.txt.gz, 
> apache_Lucene-Solr-BadApples-Tests-8.x_102.log.txt.gz
>
>
> I've been investigating some jenkins failures from TestCloudConsistency, 
> which at first glance suggest a problem w/replica(s) recovering after a 
> network partition from the leader - but in digging into the logs the root 
> cause acturally seems to be a thread race conditions when a replica (the 
> leader) is first registered...
>  * The {{ZkContainer.registerInZk(...)}} method (which is called by 
> {{CoreContainer.registerCore(...)}} & {{CoreContainer.load()}}) is typically 
> run in a background thread (via the {{ZkContainer.coreZkRegister}} 
> ExecutorService)
>  * {{ZkContainer.registerInZk(...)}} delegates to 
> {{ZKController.register(...)}} which is ultimately responsible for checking 
> if there are any "old" tlogs on disk, and if so handling the "Replaying tlog 
> for <URL> during startup" logic
>  * Because this happens in a background thread, other logic/requests can be 
> handled by this core/replica in the meantime - before it starts (or while in 
> the middle of) replaying the tlogs
>  ** Notably: *leader's that have not yet replayed tlogs on startup will 
> erroneously respond to RTG / Fingerprint / PeerSync requests from other 
> replicas w/incomplete data*
> ...In general, it seems scary / fishy to me that a replica can (aparently) 
> become *ACTIVE* before it's finished it's {{registerInZk}} + "Replaying tlog 
> ... during startup" logic ... particularly since this can happen even for 
> replicas that are/become leaders. It seems like this could potentially cause 
> a whole host of problems, only one of which manifests in this particular test 
> failure:
>  * *BEFORE* replicaX's "coreZkRegister" thread reaches the "Replaying tlog 
> ... during startup" check:
>  ** replicaX can recognize (via zk terms) that it should be the leader(X)
>  ** this leaderX can then instruct some other replicaY to recover from it
>  ** replicaY can send RTG / PeerSync / FetchIndex requests to the leaderX 
> (either on it's own volition, or because it was instructed to by leaderX) in 
> an attempt to recover
>  *** the responses to these recovery requests will not include updates in the 
> tlog files that existed on leaderX prior to startup that hvae not yet been 
> replayed
>  * *AFTER* replicaY has finished it's recovery, leaderX's "Replaying tlog ... 
> during startup" can finish
>  ** replicaY now thinks it is in sync with leaderX, but leaderX has 
> (replayed) updates the other replicas know nothing about



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13486) race condition between leader's "replay on startup" and non-leader's "recover from leader" can leave replicas out of sync (TestCloudConsistency)

Reply via email to