Sammi Chen created RATIS-2504:
---------------------------------

             Summary: Follower's nextIndex is reset to 0 in 
grpcLogAppender.resetClient()
                 Key: RATIS-2504
                 URL: https://issues.apache.org/jira/browse/RATIS-2504
             Project: Ratis
          Issue Type: Improvement
            Reporter: Sammi Chen


The node becomes leader on term 28, and index 262936. 


{noformat}
2026-04-09 07:24:42,753 INFO 
[3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
 3497a434-0af8-4e24-9818-3aa5aaeb539d: start 
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderStateImpl
2026-04-09 07:24:42,753 INFO 
[3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
 
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker: 
Rolling segment log-217137_262936 to index:262936
2026-04-09 07:24:42,759 INFO 
[3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
 
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker: 
created new log segment 
/metadata/1/hadoop-ozone/datanode/ratis/data9/614975a1-3dad-4dff-85a2-b5d703ef062a/current/log_inprogress_262937
{noformat}

One of the follower is shutdown and another follower is online and healthy. The 
shutdown follower cannot response to any message sent by grpcLogAppender 

{noformat}
2026-04-09 07:24:42,754 WARN 
[grpc-default-executor-38]-org.apache.ratis.grpc.server.GrpcLogAppender: 
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6-AppendLogResponseHandler:
 Failed appendEntries: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
2026-04-09 07:24:42,754 INFO 
[grpc-default-executor-24]-org.apache.ratis.server.leader.FollowerInfo: 
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6:
 decreaseNextIndex nextIndex: updateUnconditionally 262937 -> 0
{noformat}

After received  "io exception", grpcLogAppender.resetClient() reset this 
follower's nextIndex to 0, and then start to fetch logIndex 1 writeChunk 
record, 


{noformat}
2026-04-09 07:24:42,759 WARN 
[ChunkWriter-29-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler: 
Operation: ReadChunk , Trace ID:  , Message: java.nio.file.NoSuchFileException: 
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
 , Result: UNABLE_TO_FIND_CHUNK , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.nio.file.NoSuchFileException: 
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
        at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:431)
        at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:213)
        at 
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:200)
        at 
org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:106)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:752)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:273)
        at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:236)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:345)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:178)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:177)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:500)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.readStateMachineData(ContainerStateMachine.java:830)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$read$8(ContainerStateMachine.java:915)
        at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.nio.file.NoSuchFileException: 
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
        at 
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at 
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at 
java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
        at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$3(ChunkUtils.java:202)
        at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:371)
        at 
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:201)
        ... 16 more

2026-04-09 07:24:42,760 ERROR 
[ChunkWriter-29-0]-org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
 gid group-B5D703EF062A : ReadStateMachine failed. cmd ReadChunk logIndex 1 msg 
: java.nio.file.NoSuchFileException: 
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
 Container Result: UNABLE_TO_FIND_CHUNK
{noformat}

While the chunk file is not present on disk, so the replayed logIndex 1 
writChunk failed, and in consequence ContainerStateMachine healthy state is set 
to false, all other applyTransactions and write requests are failed followly 
due to statemachine unhealthy. 

The involved container 65078 is a closed state container with all three replica 
same BCSID, so it's likely the missing chunk file is deleted officially, reset 
the nextIndex to 0 on grpc io failure case increase the possibility of read 
deleted chunk files during ReadStateMachine, further cause the pipeline to 
close, and containers of pipeline become QUASI_CLOSED containers. 

One suggestion is if there is "io exception", which is likely network is 
broken, we can keep the same nextIndex, not change it. 







--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to