[jira] [Assigned] (IGNITE-21181) Failure to resolve a primary replica after stopping a node

Denis Chudov (Jira) Mon, 29 Jan 2024 00:05:42 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denis Chudov reassigned IGNITE-21181:
-------------------------------------

    Assignee: Denis Chudov

> Failure to resolve a primary replica after stopping a node
> ----------------------------------------------------------
>
>                 Key: IGNITE-21181
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21181
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary 
> replica of the sole partition is on node 0. Then node 0 is stopped and an 
> attempt to do a put via node 2 is done. The partition still has majority, but 
> the put results in the following:
>  
> {code:java}
> org.apache.ignite.tx.TransactionException: IGN-REP-5 
> TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary 
> replica node [consistentId=itrst_ncisasiti_0]
>  
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
> at 
> java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196)
> at 
> org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> at 
> org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111)
> at 
> org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code}
>  
> This can be reproduced using 
> ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt().
> The reason is that, according to the test, the leader of partition group is 
> transferred on node 0, which means that this node most probably will be 
> selected as primary, and after that the node 0 is stopped, and then the 
> transaction is started. Node 0 is still a leaseholder in the current time 
> interval, but it's already left the topology.
> We can fix the test to make it await the new primary, which would be present 
> in the cluster, or make the restries on the very first transactional request. 
> In the case of latter, we need to ensure that the request is actually first 
> and single, no other request in any parallel thread was sent, otherwise we 
> cant retry the request on another primary .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (IGNITE-21181) Failure to resolve a primary replica after stopping a node

Reply via email to