Sammi Chen created RATIS-2504:
---------------------------------
Summary: Follower's nextIndex is reset to 0 in
grpcLogAppender.resetClient()
Key: RATIS-2504
URL: https://issues.apache.org/jira/browse/RATIS-2504
Project: Ratis
Issue Type: Improvement
Reporter: Sammi Chen
The node becomes leader on term 28, and index 262936.
{noformat}
2026-04-09 07:24:42,753 INFO
[3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.impl.RoleInfo:
3497a434-0af8-4e24-9818-3aa5aaeb539d: start
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderStateImpl
2026-04-09 07:24:42,753 INFO
[3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-LeaderElection8]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker:
Rolling segment log-217137_262936 to index:262936
2026-04-09 07:24:42,759 INFO
[3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker]-org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A-SegmentedRaftLogWorker:
created new log segment
/metadata/1/hadoop-ozone/datanode/ratis/data9/614975a1-3dad-4dff-85a2-b5d703ef062a/current/log_inprogress_262937
{noformat}
One of the follower is shutdown and another follower is online and healthy. The
shutdown follower cannot response to any message sent by grpcLogAppender
{noformat}
2026-04-09 07:24:42,754 WARN
[grpc-default-executor-38]-org.apache.ratis.grpc.server.GrpcLogAppender:
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6-AppendLogResponseHandler:
Failed appendEntries:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
2026-04-09 07:24:42,754 INFO
[grpc-default-executor-24]-org.apache.ratis.server.leader.FollowerInfo:
3497a434-0af8-4e24-9818-3aa5aaeb539d@group-B5D703EF062A->f987210a-e081-418c-873b-3107732c7fe6:
decreaseNextIndex nextIndex: updateUnconditionally 262937 -> 0
{noformat}
After received "io exception", grpcLogAppender.resetClient() reset this
follower's nextIndex to 0, and then start to fetch logIndex 1 writeChunk
record,
{noformat}
2026-04-09 07:24:42,759 WARN
[ChunkWriter-29-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
Operation: ReadChunk , Trace ID: , Message: java.nio.file.NoSuchFileException:
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
, Result: UNABLE_TO_FIND_CHUNK , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
java.nio.file.NoSuchFileException:
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
at
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.wrapInStorageContainerException(ChunkUtils.java:431)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:213)
at
org.apache.hadoop.ozone.container.keyvalue.impl.FilePerBlockStrategy.readChunk(FilePerBlockStrategy.java:200)
at
org.apache.hadoop.ozone.container.keyvalue.impl.ChunkManagerDispatcher.readChunk(ChunkManagerDispatcher.java:106)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleReadChunk(KeyValueHandler.java:752)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:273)
at
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:236)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:345)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:178)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:177)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:500)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.readStateMachineData(ContainerStateMachine.java:830)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$read$8(ContainerStateMachine.java:915)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.nio.file.NoSuchFileException:
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
at
java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at
java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at
java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.lambda$readData$3(ChunkUtils.java:202)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.processFileExclusively(ChunkUtils.java:371)
at
org.apache.hadoop.ozone.container.keyvalue.helpers.ChunkUtils.readData(ChunkUtils.java:201)
... 16 more
2026-04-09 07:24:42,760 ERROR
[ChunkWriter-29-0]-org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine:
gid group-B5D703EF062A : ReadStateMachine failed. cmd ReadChunk logIndex 1 msg
: java.nio.file.NoSuchFileException:
/data/14/hadoop-ozone/datanode/data9/hdds/CID-fd30d4ce-0371-4d65-9ec7-033c6d8a7739/current/containerDir127/65078/chunks/117883640223800349.block
Container Result: UNABLE_TO_FIND_CHUNK
{noformat}
While the chunk file is not present on disk, so the replayed logIndex 1
writChunk failed, and in consequence ContainerStateMachine healthy state is set
to false, all other applyTransactions and write requests are failed followly
due to statemachine unhealthy.
The involved container 65078 is a closed state container with all three replica
same BCSID, so it's likely the missing chunk file is deleted officially, reset
the nextIndex to 0 on grpc io failure case increase the possibility of read
deleted chunk files during ReadStateMachine, further cause the pipeline to
close, and containers of pipeline become QUASI_CLOSED containers.
One suggestion is if there is "io exception", which is likely network is
broken, we can keep the same nextIndex, not change it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)