[ 
https://issues.apache.org/jira/browse/IGNITE-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17965519#comment-17965519
 ] 

Denis Chudov commented on IGNITE-23702:
---------------------------------------

Most recent stacktrace:
{code:java}
[2025-06-04T12:28:24,119][ERROR][%idrmt_trtp_3344%partition-operations-18][FailureManager]
 Critical system error detected. Will be handled accordingly to configured 
handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler 
[ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, 
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=CRITICAL_ERROR]
org.apache.ignite.internal.failure.StackTraceCapturingException: Failed to 
process the lease granted message [msg=LeaseGrantedMessageImpl [force=false, 
groupId=21_part_0, leaseExpirationTime=HybridTimestamp [physical=2025-06-04 
12:30:24:112 +0000, logical=0, composite=114625100127404032], 
leaseStartTime=HybridTimestamp [physical=2025-06-04 12:28:24:112 +0000, 
logical=2, composite=114625092263084034]]].
    at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:183)
 ~[ignite-failure-handler-9.1.127-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.failure.FailureManager.process(FailureManager.java:160)
 ~[ignite-failure-handler-9.1.127-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.replicator.PlacementDriverMessageProcessor.lambda$processPlacementDriverMessage$0(PlacementDriverMessageProcessor.java:148)
 ~[ignite-replicator-9.1.127-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:934)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:911)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:614)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1163)
 [?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:482)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
 [?:?]
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
 [?:?]
    at java.base/java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.util.concurrent.CompletionException: java.lang.AssertionError: 
Unexpected replica reservation with STOPPING state [groupId=21_part_0].
    at 
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
 ~[?:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1159)
 ~[?:?]
    ... 4 more
Caused by: java.lang.AssertionError: Unexpected replica reservation with 
STOPPING state [groupId=21_part_0].
    at 
org.apache.ignite.internal.replicator.ReplicaStateManager.reserveReplica(ReplicaStateManager.java:384)
 ~[ignite-replicator-9.1.127-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.replicator.PlacementDriverMessageProcessor.waitForActualState(PlacementDriverMessageProcessor.java:273)
 ~[ignite-replicator-9.1.127-SNAPSHOT.jar:?]
    at 
org.apache.ignite.internal.replicator.PlacementDriverMessageProcessor.lambda$processLeaseGrantedMessage$6(PlacementDriverMessageProcessor.java:206)
 ~[ignite-replicator-9.1.127-SNAPSHOT.jar:?]
    at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150)
 ~[?:?]
    ... 4 more {code}

> Incorrect HB in deferred replica stop on partition restart
> ----------------------------------------------------------
>
>                 Key: IGNITE-23702
>                 URL: https://issues.apache.org/jira/browse/IGNITE-23702
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: _Integration_Tests_Integration_CLI_10340.log
>
>
> *Stack trace:*
> {code:java}
> java.lang.AssertionError: Unexpected replica reservation with STOPPING state 
> [groupId=18_part_1].
>     at 
> org.apache.ignite.internal.replicator.ReplicaManager$ReplicaStateManager.reserveReplica(ReplicaManager.java:1560)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaImpl.waitForActualState(ReplicaImpl.java:306)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaImpl.lambda$processLeaseGrantedMessage$6(ReplicaImpl.java:240)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaImpl.lambda$processLeaseGrantedMessage$7(ReplicaImpl.java:209)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaImpl.processLeaseGrantedMessage(ReplicaImpl.java:209)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaImpl.processPlacementDriverMessage(ReplicaImpl.java:178)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaManager.lambda$onPlacementDriverMessageReceived$7(ReplicaManager.java:539)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
>  ~[?:?]
>     at 
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
>  ~[?:?]
>     at 
> org.apache.ignite.internal.replicator.ReplicaManager.onPlacementDriverMessageReceived(ReplicaManager.java:539)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.network.TrackableNetworkMessageHandler.onReceived(TrackableNetworkMessageHandler.java:52)
>  ~[ignite-network-api-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.network.DefaultMessagingService.handleStartingWithFirstHandler(DefaultMessagingService.java:549)
>  ~[ignite-network-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.network.DefaultMessagingService.lambda$handleMessageFromNetwork$5(DefaultMessagingService.java:440)
>  ~[ignite-network-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
>     at java.base/java.lang.Thread.run(Thread.java:834) [?:?]{code}
> *Scenario:*
>  * Replica should be stopped due to partition restart;
>  * replica appears to be a primary one, stop lease prolongation message is 
> sent;
>  * right after stop lease prolongation handling (9 ms later) 
> LeaseGrantMessage is received on the same node where the primary replica was 
> located;
>  * 1 ms after LeaseGrantMessage is received, replica is stopped (the stop of 
> raft node begins);
>  * In parallel, the handling of LeaseGrantMessage continues, and the replica 
> should be reserved as primary. The state is STOPPED already, and 
> reservedForPrimary is true (due to previous primary state), which is 
> incorrect, so the assertion is thrown. 
> The correct scenario would be: reservedForPrimary becomes false during the 
> handling of PRIMARY_REPLICA_EXPIRED event and reserveReplica() returns false 
> due to STOPPING state. No error is thrown, replica declines the lease.
> *What needs to be fixed:*
>  * missed synchronized block, which should have wrapped stopReplica method in 
> ReplicaStateManager#planDeferredReplicaStop
>  * incorrect lease expiration waiting: reservation flag may be not removed 
> when expiration timestamp comes. This means only that lease is expired, but 
> doesn't mean that PRIMARY_REPLICA_EXPIRED was handled, so reservedForPrimary 
> is still true and replica stop process still cannot begin. The deferred 
> replica stop should wait not for this timestamp, but the event of expiration 
> of the lease having the corresponding start time, or election of a new lease 
> having the start time greater than expiration timestamp.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to