[
https://issues.apache.org/jira/browse/RATIS-2504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-2504:
------------------------------
Component/s: gRPC
> Follower's nextIndex is reset to 0 in grpcLogAppender.resetClient()
> -------------------------------------------------------------------
>
> Key: RATIS-2504
> URL: https://issues.apache.org/jira/browse/RATIS-2504
> Project: Ratis
> Issue Type: Improvement
> Components: gRPC
> Reporter: Sammi Chen
> Assignee: Tsz-wo Sze
> Priority: Major
>
> The node becomes leader on term 28, and index 262936.
> {noformat}
> 2026-04-09 07:24:42,753 INFO
> [3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
> 3497a434-0af8-4e24-9818-3aa5aaeb539d: start
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderStateImpl
> 2026-04-09 07:24:42,753 INFO
> [3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker:
> Rolling segment log-217137_262936 to index:262936
> 2026-04-09 07:24:42,759 INFO
> [3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
>
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker:
> created new log segment
> /metadata/1/hadoop-ozone/datanode/ratis/data9/614975a1-3dad-4dff-85a2-b5d703ef062a/current/log_inprogress_262937
> {noformat}
> One of the follower is shutdown and another follower is online and healthy.
> The shutdown follower cannot response to any message sent by grpcLogAppender
> {noformat}
> 2026-04-09 07:24:42,754 WARN
> [grpc-default-executor-38]-org.apache.ratis.grpc.server.GrpcLogAppender:
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6-AppendLogResponseHandler:
> Failed appendEntries:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> 2026-04-09 07:24:42,754 INFO
> [grpc-default-executor-24]-org.apache.ratis.server.leader.FollowerInfo:
> 3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6:
> decreaseNextIndex nextIndex: updateUnconditionally 262937 -> 0
> {noformat}
> After received "io exception", grpcLogAppender.resetClient() reset this
> follower's nextIndex to 0, and then start to fetch logIndex 1 writeChunk
> record,
> {noformat}
> 2026-04-09 07:24:42,759 WARN
> [ChunkWriter-29-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
> Operation: ReadChunk , Trace ID: , Message:
> java.nio.file.NoSuchFileException:
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
> , Result: UNABLE_TO_FIND_CHUNK , StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> java.nio.file.NoSuchFileException:
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:431)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:213)
> at
> org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:200)
> at
> org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:106)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:752)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:273)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:236)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:345)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:178)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:177)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:500)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.readStateMachineData(ContainerStateMachine.java:830)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$read$8(ContainerStateMachine.java:915)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:840)
> Caused by: java.nio.file.NoSuchFileException:
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
> at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
> java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
> at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$3(ChunkUtils.java:202)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:371)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:201)
> ... 16 more
> 2026-04-09 07:24:42,760 ERROR
> [ChunkWriter-29-0]-org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
> gid group-B5D703EF062A : ReadStateMachine failed. cmd ReadChunk logIndex 1
> msg : java.nio.file.NoSuchFileException:
> /data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
> Container Result: UNABLE_TO_FIND_CHUNK
> {noformat}
> While the chunk file is not present on disk, so the replayed logIndex 1
> writChunk failed, and in consequence ContainerStateMachine healthy state is
> set to false, all other applyTransactions and write requests are failed
> followly due to statemachine unhealthy.
> The involved container 65078 is a closed state container with all three
> replica same BCSID, so it's likely the missing chunk file is deleted
> officially, reset the nextIndex to 0 on grpc io failure case increase the
> possibility of read deleted chunk files during ReadStateMachine, further
> cause the pipeline to close, and containers of pipeline become QUASI_CLOSED
> containers.
> One suggestion is if there is "io exception", which is likely network is
> broken, we can keep the same nextIndex, not change it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)