[ 
https://issues.apache.org/jira/browse/RATIS-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guangbao zhao updated RATIS-2140:
---------------------------------
    Description: 
hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
follower of InstallSnapshot, the leader may cause the GrcpAppender thread to be 
in the wait state due to timing issues, causing the installation of the 
snapshot to fail, and triggering the follower to not receive the leader's 
heartbeat within the specified timeout period to trigger the election.

The last log that triggered the exception

node1:
{code:java}
2024/08/17 19:36:19,068 
[node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
2024/08/17 19:36:19,068 
[node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-GrpcLogAppender: send 
node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: InstallSnapshot 
in progress.
2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
request=AppendEntriesRequest:cid=11690239,entriesCount=0
2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: receive 
requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, i:16857386182))
2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
reject PRE_VOTE from node2: this server is the leader and still has leadership 
...{code}
node2:
{code:java}
2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
appendEntries as snapshot (17159569079) installation is in progress
2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
inconsistency entries. 
Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.server.impl.SnapshotInstallationHandler: 
node2@group-4F53D3317400: receive installSnapshot: 
node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.server.impl.SnapshotInstallationHandler: 
node2@group-4F53D3317400: reply installSnapshot: 
node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
INSTALL_SNAPSHOT, lastReply: null 
2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
org.apache.ratis.server.impl.FollowerState: 
node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
node2@group-4F53D3317400-FollowerState
2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: changes 
role from  FOLLOWER to CANDIDATE at term 139 for changeToCandidate
...{code}
node2 grpc thread stack:
{code:java}
jstack 118659 | grep -A 12 
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
 #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)      
at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0x00007f95477a9448> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   at 
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
 at 
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
       at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) 
     at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
 Source)  at java.lang.Thread.run(Thread.java:748)        jstack 118659 | grep 
-A 12 
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
 #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)      
at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0x00007f95477a9448> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   at 
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
 at 
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
       at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) 
     at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
 Source)  at java.lang.Thread.run(Thread.java:748)
jstack 118659 | grep -A 12 
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
 #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)      
at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0x00007f95477a9448> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   at 
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
 at 
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
       at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) 
     at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
 Source)  at java.lang.Thread.run(Thread.java:748) {code}
It can be seen from the stack that the GrpcLogAppender node2 thread has been in 
the wait state, causing the above problems. through the code, we can see that 
after appendLog, notifyLogAppender will be called to call 
getEventAwaitForSignal().signal(). If there is no appendLog request at this 
moment, it will always hang here.

And I also did a test, when I perform a random write operation, the snapshot 
will be performed again.(Indicates that responseHandler.isDone() has become 
true)

I think it might be something like this:

 
{code:java}
GrpcLogAppender::notifyInstallSnapshot
while (isRunning() && !responseHandler.isDone()) {  // first execute
  try {
    getEventAwaitForSignal().await();   // fourth execute
  } catch (InterruptedException ignored) {
    Thread.currentThread().interrupt();
  }
} 

InstallSnapshotResponseHandler::close
void close() {  
  done.set(true);   // second execute
  notifyLogAppender();   // third execute
}{code}
Eventually, it remains in the wait state.

I wrote a ut, can you help me find out if this is the problem? [~szetszwo] 
[~adoroszlai] 
{code:java}
public class TestSnapshotHang {
  private AwaitForSignal awaitForSignal;
  private AtomicBoolean done;

  @Before
  public void setUp() {
    awaitForSignal = new AwaitForSignal("test");
    done = new AtomicBoolean(false);
  }

  @Test
  public void test() throws InterruptedException {
    new Thread(this::close).start();

    while (!done.get()) {
      Thread.sleep(100);
      try {
        awaitForSignal.await();
      } catch (InterruptedException ignored) {
        Thread.currentThread().interrupt();
      }
    }
    System.out.println("test done.");
  }

  private void close() {
    done.set(true);
    awaitForSignal.signal();
    System.out.println("close done.");
  }
} {code}

  was:
hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
follower of InstallSnapshot, the leader may cause the GrcpAppender thread to be 
in the wait state due to timing issues, causing the installation of the 
snapshot to fail, and triggering the follower to not receive the leader's 
heartbeat within the specified timeout period to trigger the election.

The last log that triggered the exception

node1:
{code:java}
2024/08/17 19:36:19,068 
[node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
2024/08/17 19:36:19,068 
[node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-GrpcLogAppender: send 
node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: InstallSnapshot 
in progress.
2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
org.apache.ratis.grpc.server.GrpcLogAppender: 
node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
request=AppendEntriesRequest:cid=11690239,entriesCount=0
2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: receive 
requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, i:16857386182))
2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
reject PRE_VOTE from node2: this server is the leader and still has leadership 
...{code}
node2:
{code:java}
2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
appendEntries as snapshot (17159569079) installation is in progress
2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
inconsistency entries. 
Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.server.impl.SnapshotInstallationHandler: 
node2@group-4F53D3317400: receive installSnapshot: 
node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.server.impl.SnapshotInstallationHandler: 
node2@group-4F53D3317400: reply installSnapshot: 
node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, i:17159569079)
2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
INSTALL_SNAPSHOT, lastReply: null 
2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
org.apache.ratis.server.impl.FollowerState: 
node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
node2@group-4F53D3317400-FollowerState
2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: changes 
role from  FOLLOWER to CANDIDATE at term 139 for changeToCandidate
...{code}
node2 grpc thread stack:
{code:java}
jstack 118659 | grep -A 12 
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
 #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)      
at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0x00007f95477a9448> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   at 
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
 at 
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
       at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) 
     at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
 Source)  at java.lang.Thread.run(Thread.java:748)        jstack 118659 | grep 
-A 12 
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
 #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)      
at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0x00007f95477a9448> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   at 
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
 at 
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
       at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) 
     at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
 Source)  at java.lang.Thread.run(Thread.java:748)
jstack 118659 | grep -A 12 
node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
 #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)      
at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
<0x00007f95477a9448> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   at 
org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
 at 
org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
       at 
org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   at 
org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80) 
     at 
org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
 Source)  at java.lang.Thread.run(Thread.java:748) {code}
It can be seen from the stack that the GrpcLogAppender node2 thread has been in 
the wait state, causing the above problems. through the code, we can see that 
after appendLog, notifyLogAppender will be called to call 
getEventAwaitForSignal().signal(). If there is no appendLog request at this 
moment, it will always hang here.

And I also did a test, when I perform a random write operation, the snapshot 
will be performed again.(Indicates that responseHandler.isDone() has become 
true)

I think it might be something like this:

 
{code:java}
GrpcLogAppender::notifyInstallSnapshot
while (isRunning() && !responseHandler.isDone()) {  // first execute
  try {
    getEventAwaitForSignal().await();   // fourth execute
  } catch (InterruptedException ignored) {
    Thread.currentThread().interrupt();
  }
} 

InstallSnapshotResponseHandler::close
void close() {  
  done.set(true);   // second execute
  notifyLogAppender();   // third execute
}{code}
Eventually, it remains in the wait state.

I wrote a ut, can you help me find out if this is the problem? [~szetszwo] 
@adoro
{code:java}
public class TestSnapshotHang {
  private AwaitForSignal awaitForSignal;
  private AtomicBoolean done;

  @Before
  public void setUp() {
    awaitForSignal = new AwaitForSignal("test");
    done = new AtomicBoolean(false);
  }

  @Test
  public void test() throws InterruptedException {
    new Thread(this::close).start();

    while (!done.get()) {
      Thread.sleep(100);
      try {
        awaitForSignal.await();
      } catch (InterruptedException ignored) {
        Thread.currentThread().interrupt();
      }
    }
    System.out.println("test done.");
  }

  private void close() {
    done.set(true);
    awaitForSignal.signal();
    System.out.println("close done.");
  }
} {code}


> Thread wait when installing snapshot
> ------------------------------------
>
>                 Key: RATIS-2140
>                 URL: https://issues.apache.org/jira/browse/RATIS-2140
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 3.0.1
>            Reporter: guangbao zhao
>            Priority: Major
>
> hi, [~szetszwo] I found a problem. In our service, when the leader notify the 
> follower of InstallSnapshot, the leader may cause the GrcpAppender thread to 
> be in the wait state due to timing issues, causing the installation of the 
> snapshot to fail, and triggering the follower to not receive the leader's 
> heartbeat within the specified timeout period to trigger the election.
> The last log that triggered the exception
> node1:
> {code:java}
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: notifyInstallSnapshot with 
> firstAvailable=(t:138, i:17159569079), followerNextIndex=16857386183
> 2024/08/17 19:36:19,068 
> [node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-GrpcLogAppender: send 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: received a 
> reply node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] INFO 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-InstallSnapshotResponseHandler: 
> InstallSnapshot in progress.
> 2024/08/17 19:36:19,068 [grpc-default-executor-220] WARN 
> org.apache.ratis.grpc.server.GrpcLogAppender: 
> node1@group-4F53D3317400->node2-AppendLogResponseHandler: received 
> INCONSISTENCY reply with nextIndex 16857386183, errorCount=1, 
> request=AppendEntriesRequest:cid=11690239,entriesCount=0
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.RaftServer$Division: node1@group-4F53D3317400: 
> receive requestVote(PRE_VOTE, node2, group-4F53D3317400, 139, (t:138, 
> i:16857386182))
> 2024/08/17 19:36:27,677 [grpc-default-executor-220] INFO 
> org.apache.ratis.server.impl.VoteContext: node1@group-4F53D3317400-LEADER: 
> reject PRE_VOTE from node2: this server is the leader and still has 
> leadership 
> ...{code}
> node2:
> {code:java}
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: Failed 
> appendEntries as snapshot (17159569079) installation is in progress
> 2024/08/17 19:36:19,068 [node2-server-thread482] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> inconsistency entries. 
> Reply:node1<-node2#11690239:FAIL-t139,INCONSISTENCY,nextIndex=16857386183,followerCommit=16857385992,matchIndex=-1
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: receive installSnapshot: 
> node1->node2#0-t139,notify:(t:138, i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.server.impl.SnapshotInstallationHandler: 
> node2@group-4F53D3317400: reply installSnapshot: 
> node1<-node2#0:FAIL-t139,IN_PROGRESS,snapshotIndex=0
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastRequest: node1->node2#0-t139,notify:(t:138, 
> i:17159569079)
> 2024/08/17 19:36:19,068 [grpc-default-executor-631] INFO 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: node2: Completed 
> INSTALL_SNAPSHOT, lastReply: null 
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.FollowerState: 
> node2@group-4F53D3317400-FollowerState: change to CANDIDATE, 
> lastRpcElapsedTime:8607933578ns, electionTimeout:5088ms
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.impl.RoleInfo: node2: shutdown 
> node2@group-4F53D3317400-FollowerState
> 2024/08/17 19:36:27,676 [node2@group-4F53D3317400-FollowerState] INFO 
> org.apache.ratis.server.RaftServer$Division: node2@group-4F53D3317400: 
> changes role from  FOLLOWER to CANDIDATE at term 139 for changeToCandidate
> ...{code}
> node2 grpc thread stack:
> {code:java}
> jstack 118659 | grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
>  #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
> condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)    
> at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
> <0x00007f95477a9448> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   
> at 
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
>  at 
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
>        at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   
> at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>       at 
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
>  Source)  at java.lang.Thread.run(Thread.java:748)        jstack 118659 | 
> grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
>  #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
> condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)    
>   at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
> <0x00007f95477a9448> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   
> at 
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
>  at 
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
>        at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   
> at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>       at 
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
>  Source)  at java.lang.Thread.run(Thread.java:748)
> jstack 118659 | grep -A 12 
> node2-GrpcLogAppender-LogAppenderDaemon"node1@group-4F53D3317400->node2-GrpcLogAppender-LogAppenderDaemon"
>  #885 daemon prio=5 os_prio=0 tid=0x00007f84750a5090 nid=0x7df0 waiting on 
> condition [0x00007f85018db000]   java.lang.Thread.State: WAITING (parking)    
> at sun.misc.Unsafe.park(Native Method)  - parking to wait for  
> <0x00007f95477a9448> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)   at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)    at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>     at org.apache.ratis.util.AwaitForSignal.await(AwaitForSignal.java:48)   
> at 
> org.apache.ratis.grpc.server.GrpcLogAppender.notifyInstallSnapshot(GrpcLogAppender.java:794)
>  at 
> org.apache.ratis.grpc.server.GrpcLogAppender.installSnapshot(GrpcLogAppender.java:245)
>        at 
> org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:257)   
> at 
> org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:80)
>       at 
> org.apache.ratis.server.leader.LogAppenderDaemon$$Lambda$821/1075374995.run(Unknown
>  Source)  at java.lang.Thread.run(Thread.java:748) {code}
> It can be seen from the stack that the GrpcLogAppender node2 thread has been 
> in the wait state, causing the above problems. through the code, we can see 
> that after appendLog, notifyLogAppender will be called to call 
> getEventAwaitForSignal().signal(). If there is no appendLog request at this 
> moment, it will always hang here.
> And I also did a test, when I perform a random write operation, the snapshot 
> will be performed again.(Indicates that responseHandler.isDone() has become 
> true)
> I think it might be something like this:
>  
> {code:java}
> GrpcLogAppender::notifyInstallSnapshot
> while (isRunning() && !responseHandler.isDone()) {  // first execute
>   try {
>     getEventAwaitForSignal().await();   // fourth execute
>   } catch (InterruptedException ignored) {
>     Thread.currentThread().interrupt();
>   }
> } 
> InstallSnapshotResponseHandler::close
> void close() {  
>   done.set(true);   // second execute
>   notifyLogAppender();   // third execute
> }{code}
> Eventually, it remains in the wait state.
> I wrote a ut, can you help me find out if this is the problem? [~szetszwo] 
> [~adoroszlai] 
> {code:java}
> public class TestSnapshotHang {
>   private AwaitForSignal awaitForSignal;
>   private AtomicBoolean done;
>   @Before
>   public void setUp() {
>     awaitForSignal = new AwaitForSignal("test");
>     done = new AtomicBoolean(false);
>   }
>   @Test
>   public void test() throws InterruptedException {
>     new Thread(this::close).start();
>     while (!done.get()) {
>       Thread.sleep(100);
>       try {
>         awaitForSignal.await();
>       } catch (InterruptedException ignored) {
>         Thread.currentThread().interrupt();
>       }
>     }
>     System.out.println("test done.");
>   }
>   private void close() {
>     done.set(true);
>     awaitForSignal.signal();
>     System.out.println("close done.");
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to