[jira] [Updated] (IGNITE-21181) Failure to resolve a primary replica after stopping a node

Denis Chudov (Jira) Mon, 15 Jan 2024 03:26:03 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Denis Chudov updated IGNITE-21181:
----------------------------------
    Description: 
The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary replica 
of the sole partition is on node 0. Then node 0 is stopped and an attempt to do 
a put via node 2 is done. The partition still has majority, but the put results 
in the following:
 
{code:java}
org.apache.ignite.tx.TransactionException: IGN-REP-5 
TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary 
replica node [consistentId=itrst_ncisasiti_0]
 
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749)
at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
at 
java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
at 
java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965)
at 
org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196)
at 
org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111)
at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
at 
org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111)
at 
org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102)
at 
org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193)
at 
org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185)
at 
org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257)
at 
org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253)
at 
org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code}
 
This can be reproduced using 
ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt().

The reason is that the leader of partition group is transferred on node 0, 
which means that this node most probably will be selected as primary, and after 
that the node 0 is stopped, and then the transaction is started. Node 0 is 
still a leaseholder in the current time interval, but it's already left the 
topology.

We can fix the test to make it await the new primary, which would be present in 
the cluster, or make the restries on the very first transactional request. In 
the case of latter, we need to ensure that the request is actually first and 
single, no other request in any parallel thread was sent, otherwise we cant 
retry the request on another primary .

  was:
The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary replica 
of the sole partition is on node 0. Then node 0 is stopped and an attempt to do 
a put via node 2 is done. The partition still has majority, but the put results 
in the following:
 
{code:java}
org.apache.ignite.tx.TransactionException: IGN-REP-5 
TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary 
replica node [consistentId=itrst_ncisasiti_0]
 
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749)
at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
at 
java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
at 
java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301)
at 
org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965)
at 
org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196)
at 
org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111)
at 
java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
at 
java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
at 
org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111)
at 
org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102)
at 
org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193)
at 
org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185)
at 
org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257)
at 
org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253)
at 
org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code}

 
This can be reproduced using 
ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt().


> Failure to resolve a primary replica after stopping a node
> ----------------------------------------------------------
>
>                 Key: IGNITE-21181
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21181
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>
> The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary 
> replica of the sole partition is on node 0. Then node 0 is stopped and an 
> attempt to do a put via node 2 is done. The partition still has majority, but 
> the put results in the following:
>  
> {code:java}
> org.apache.ignite.tx.TransactionException: IGN-REP-5 
> TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary 
> replica node [consistentId=itrst_ncisasiti_0]
>  
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946)
> at 
> java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301)
> at 
> org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196)
> at 
> org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111)
> at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> at 
> org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111)
> at 
> org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193)
> at 
> org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253)
> at 
> org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code}
>  
> This can be reproduced using 
> ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt().
> The reason is that the leader of partition group is transferred on node 0, 
> which means that this node most probably will be selected as primary, and 
> after that the node 0 is stopped, and then the transaction is started. Node 0 
> is still a leaseholder in the current time interval, but it's already left 
> the topology.
> We can fix the test to make it await the new primary, which would be present 
> in the cluster, or make the restries on the very first transactional request. 
> In the case of latter, we need to ensure that the request is actually first 
> and single, no other request in any parallel thread was sent, otherwise we 
> cant retry the request on another primary .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-21181) Failure to resolve a primary replica after stopping a node

Reply via email to