[ https://issues.apache.org/jira/browse/IGNITE-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Chudov updated IGNITE-21181: ---------------------------------- Description: The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary replica of the sole partition is on node 0. Then node 0 is stopped and an attempt to do a put via node 2 is done. The partition still has majority, but the put results in the following: {code:java} org.apache.ignite.tx.TransactionException: IGN-REP-5 TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary replica node [consistentId=itrst_ncisasiti_0] at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) at java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965) at org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196) at org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111) at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111) at org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102) at org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193) at org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185) at org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257) at org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253) at org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code} This can be reproduced using ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(). The reason is that the leader of partition group is transferred on node 0, which means that this node most probably will be selected as primary, and after that the node 0 is stopped, and then the transaction is started. Node 0 is still a leaseholder in the current time interval, but it's already left the topology. We can fix the test to make it await the new primary, which would be present in the cluster, or make the restries on the very first transactional request. In the case of latter, we need to ensure that the request is actually first and single, no other request in any parallel thread was sent, otherwise we cant retry the request on another primary . was: The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary replica of the sole partition is on node 0. Then node 0 is stopped and an attempt to do a put via node 2 is done. The partition still has majority, but the put results in the following: {code:java} org.apache.ignite.tx.TransactionException: IGN-REP-5 TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary replica node [consistentId=itrst_ncisasiti_0] at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) at java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301) at org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965) at org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196) at org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111) at java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) at java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) at org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111) at org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102) at org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193) at org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185) at org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257) at org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253) at org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code} This can be reproduced using ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(). > Failure to resolve a primary replica after stopping a node > ---------------------------------------------------------- > > Key: IGNITE-21181 > URL: https://issues.apache.org/jira/browse/IGNITE-21181 > Project: Ignite > Issue Type: Bug > Reporter: Roman Puchkovskiy > Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > > The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary > replica of the sole partition is on node 0. Then node 0 is stopped and an > attempt to do a put via node 2 is done. The partition still has majority, but > the put results in the following: > > {code:java} > org.apache.ignite.tx.TransactionException: IGN-REP-5 > TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary > replica node [consistentId=itrst_ncisasiti_0] > > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) > at > java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965) > at > org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196) > at > org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111) > at > java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > at > java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) > at > org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111) > at > org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102) > at > org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193) > at > org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code} > > This can be reproduced using > ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(). > The reason is that the leader of partition group is transferred on node 0, > which means that this node most probably will be selected as primary, and > after that the node 0 is stopped, and then the transaction is started. Node 0 > is still a leaseholder in the current time interval, but it's already left > the topology. > We can fix the test to make it await the new primary, which would be present > in the cluster, or make the restries on the very first transactional request. > In the case of latter, we need to ensure that the request is actually first > and single, no other request in any parallel thread was sent, otherwise we > cant retry the request on another primary . -- This message was sent by Atlassian Jira (v8.20.10#820010)