[ https://issues.apache.org/jira/browse/IGNITE-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin updated IGNITE-21181: ------------------------------------- Reviewer: Alexander Lapin > Failure to resolve a primary replica after stopping a node > ---------------------------------------------------------- > > Key: IGNITE-21181 > URL: https://issues.apache.org/jira/browse/IGNITE-21181 > Project: Ignite > Issue Type: Bug > Reporter: Roman Puchkovskiy > Assignee: Denis Chudov > Priority: Major > Labels: ignite-3 > Fix For: 3.0.0-beta2 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The scenario is that the cluster consists of 3 nodes (0, 1, 2). Primary > replica of the sole partition is on node 0. Then node 0 is stopped and an > attempt to do a put via node 2 is done. The partition still has majority, but > the put results in the following: > > {code:java} > org.apache.ignite.tx.TransactionException: IGN-REP-5 > TraceId:55c59c96-17d1-4efc-8e3c-cca81b8b41ad Failed to resolve the primary > replica node [consistentId=itrst_ncisasiti_0] > > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.lambda$enlist$69(InternalTableImpl.java:1749) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base/java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:946) > at > java.base/java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2266) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlist(InternalTableImpl.java:1739) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistWithRetry(InternalTableImpl.java:480) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.enlistInTx(InternalTableImpl.java:301) > at > org.apache.ignite.internal.table.distributed.storage.InternalTableImpl.upsert(InternalTableImpl.java:965) > at > org.apache.ignite.internal.table.KeyValueViewImpl.lambda$putAsync$10(KeyValueViewImpl.java:196) > at > org.apache.ignite.internal.table.AbstractTableView.lambda$withSchemaSync$1(AbstractTableView.java:111) > at > java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106) > at > java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235) > at > org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:111) > at > org.apache.ignite.internal.table.AbstractTableView.withSchemaSync(AbstractTableView.java:102) > at > org.apache.ignite.internal.table.KeyValueViewImpl.putAsync(KeyValueViewImpl.java:193) > at > org.apache.ignite.internal.table.KeyValueViewImpl.put(KeyValueViewImpl.java:185) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:257) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.putToNode(ItTableRaftSnapshotsTest.java:253) > at > org.apache.ignite.internal.raftsnapshot.ItTableRaftSnapshotsTest.nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(ItTableRaftSnapshotsTest.java:473){code} > > This can be reproduced using > ItTableRaftSnapshotsTest#nodeCanInstallSnapshotsAfterSnapshotInstalledToIt(). > The reason is that, according to the test, the leader of partition group is > transferred on node 0, which means that this node most probably will be > selected as primary, and after that the node 0 is stopped, and then the > transaction is started. Node 0 is still a leaseholder in the current time > interval, but it's already left the topology. > We can fix the test to make it await the new primary, which would be present > in the cluster, or make the restries on the very first transactional request. > In the case of latter, we need to ensure that the request is actually first > and single, no other request in any parallel thread was sent, otherwise we > cant retry the request on another primary . -- This message was sent by Atlassian Jira (v8.20.10#820010)