[ 
https://issues.apache.org/jira/browse/RATIS-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reassigned RATIS-2504:
---------------------------------

    Assignee: Tsz-wo Sze

> Follower's nextIndex is reset to 0 in grpcLogAppender.resetClient()
> -------------------------------------------------------------------
>
>                 Key: RATIS-2504
>                 URL: https://issues.apache.org/jira/browse/RATIS-2504
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Sammi Chen
>            Assignee: Tsz-wo Sze
>            Priority: Major
>
> The node becomes leader on term 28, and index 262936. 
> {noformat}
> 2026-04-09 07:24:42,753 INFO 
> [3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
>  3497a434-0af8-4e24-9818-3aa5aaeb539d: start 
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderStateImpl
> 2026-04-09 07:24:42,753 INFO 
> [3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>  
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker:
>  Rolling segment log-217137_262936 to index:262936
> 2026-04-09 07:24:42,759 INFO 
> [3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>  
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker:
>  created new log segment 
> /metadata/1/hadoop-ozone/datanode/ratis/data9/614975a1-3dad-4dff-85a2-b5d703ef062a/current/log_inprogress_262937
> {noformat}
> One of the follower is shutdown and another follower is online and healthy. 
> The shutdown follower cannot response to any message sent by grpcLogAppender 
> {noformat}
> 2026-04-09 07:24:42,754 WARN 
> [grpc-default-executor-38]-org.apache.ratis.grpc.server.GrpcLogAppender: 
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6-AppendLogResponseHandler:
>  Failed appendEntries: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2026-04-09 07:24:42,754 INFO 
> [grpc-default-executor-24]-org.apache.ratis.server.leader.FollowerInfo: 
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6:
>  decreaseNextIndex nextIndex: updateUnconditionally 262937 -> 0
> {noformat}
> After received  "io exception", grpcLogAppender.resetClient() reset this 
> follower's nextIndex to 0, and then start to fetch logIndex 1 writeChunk 
> record, 
> {noformat}
> 2026-04-09 07:24:42,759 WARN 
> [ChunkWriter-29-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
>  Operation: ReadChunk , Trace ID:  , Message: 
> java.nio.file.NoSuchFileException: 
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
>  , Result: UNABLE_TO_FIND_CHUNK , StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
>         at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:431)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:213)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:200)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:106)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:752)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:273)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:236)
>         at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:345)
>         at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:178)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:177)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:500)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.readStateMachineData(ContainerStateMachine.java:830)
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$read$8(ContainerStateMachine.java:915)
>         at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>         at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.nio.file.NoSuchFileException: 
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
>         at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>         at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
>         at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>         at 
> java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
>         at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$3(ChunkUtils.java:202)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:371)
>         at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:201)
>         ... 16 more
> 2026-04-09 07:24:42,760 ERROR 
> [ChunkWriter-29-0]-org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
>  gid group-B5D703EF062A : ReadStateMachine failed. cmd ReadChunk logIndex 1 
> msg : java.nio.file.NoSuchFileException: 
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
>  Container Result: UNABLE_TO_FIND_CHUNK
> {noformat}
> While the chunk file is not present on disk, so the replayed logIndex 1 
> writChunk failed, and in consequence ContainerStateMachine healthy state is 
> set to false, all other applyTransactions and write requests are failed 
> followly due to statemachine unhealthy. 
> The involved container 65078 is a closed state container with all three 
> replica same BCSID, so it's likely the missing chunk file is deleted 
> officially, reset the nextIndex to 0 on grpc io failure case increase the 
> possibility of read deleted chunk files during ReadStateMachine, further 
> cause the pipeline to close, and containers of pipeline become QUASI_CLOSED 
> containers. 
> One suggestion is if there is "io exception", which is likely network is 
> broken, we can keep the same nextIndex, not change it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to