[jira] [Updated] (HDDS-15443) Close statemachine immediately on writeStateMachineData failure

Tsz-wo Sze (Jira) Wed, 03 Jun 2026 10:32:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tsz-wo Sze updated HDDS-15443:
------------------------------
    Description: 
(Revised by [~szetszwo])

When a datanode writeStateMachineData fails (e.g. disk-out-of-space) for a log 
entry of a client request, the applyTransaction can never happen for that 
request. Also, the datanode cannot append log entires anymore since 
writeStateMachineData failure is a RaftLog failure.

 - (Bad case) If the datanode is a leader, it will never respond to the client 
for that request. The client will keep waiting for that request and retrying 
until either its RetryPolicy stops retrying or the RaftGroup is removed by SCM. 
By the default conf, the SCM will remove the RaftGroup in ~5min and the client 
will retry for much longer than 5min. As a result the client will hang for 5min 
and cannot write any other requests.

 - (Better case) If the datanode is a follower, it will stop working since it 
cannot cannot append log entires anymore. The client is able to receive a reply 
from the leader for that request. Then, it will watch for ALL_COMMIT for the 
log entry of that request. Since a follower has failed, the watch ALL_COMMIT 
can never receive a reply unit watch timeout (default 3min). In this case, the 
client can continue writing other requests while it is waiting for the watch.

  was:
When leader performs write () and it fails, ratis server do not respond 
immediately as it wait for re-election, and other server can operate over this 
request in quorum. But since leader is present, re-election do not happen or 
its random to get success.

 

But since reply is not returned by the server, client hangs till timeout occurs 
OR pipeline gets close by SCM on this error.

 

Since the state machine is not usable as no other request is allowed to be 
processed. So its better to close, so that having below behavior:

If Leader write() fails and state machine closes,
 * leader reply with ServerNotReadyException immediately
 * Client will retry as per policy, till either new leader or raft group removal
 * leader election will happen if leader is closed within few seconds
 * Once new leader is choosen and client retry, it will return success with 
majority commit

 

If One follower write() fails and state machine closes, Still leader will 
process client request with majority node success with commit.

 

SCM on failure of any node, 
 * will close containers with cool down time (2.5 minute default)
 * stop allocating any new blocks
 * close pipeline after 5 min

This ensures in-progress write can finish with 2-node running if any.

 

Impact:
 * Do not handle graceful shutdown to finish apply transaction, impact:
 ** If leader closes, it return failure to client waiting for reply and can 
retry
 ** If one follower closes, majority nodes are present to process and container 
closes before pipeline close
 ** 2-node follower failure - case have only one node having data as expected.

 

Below issue to be handled with separate JIRA
 # 2-node failure case
 # client configuration for long wait for commit-all / majority-commit and 
other config

 

 

        Summary: Close statemachine immediately on writeStateMachineData 
failure  (was: close statemachine immediately on write failure)

[~sumitagrawl], let's focus on only the problem this JIRA trying to address.  
Revised the Summary and the Description

> Close statemachine immediately on writeStateMachineData failure
> ---------------------------------------------------------------
>
>                 Key: HDDS-15443
>                 URL: https://issues.apache.org/jira/browse/HDDS-15443
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: Ozone Datanode
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>              Labels: pull-request-available
>
> (Revised by [~szetszwo])
> When a datanode writeStateMachineData fails (e.g. disk-out-of-space) for a 
> log entry of a client request, the applyTransaction can never happen for that 
> request. Also, the datanode cannot append log entires anymore since 
> writeStateMachineData failure is a RaftLog failure.
>  - (Bad case) If the datanode is a leader, it will never respond to the 
> client for that request. The client will keep waiting for that request and 
> retrying until either its RetryPolicy stops retrying or the RaftGroup is 
> removed by SCM. By the default conf, the SCM will remove the RaftGroup in 
> ~5min and the client will retry for much longer than 5min. As a result the 
> client will hang for 5min and cannot write any other requests.
>  - (Better case) If the datanode is a follower, it will stop working since it 
> cannot cannot append log entires anymore. The client is able to receive a 
> reply from the leader for that request. Then, it will watch for ALL_COMMIT 
> for the log entry of that request. Since a follower has failed, the watch 
> ALL_COMMIT can never receive a reply unit watch timeout (default 3min). In 
> this case, the client can continue writing other requests while it is waiting 
> for the watch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15443) Close statemachine immediately on writeStateMachineData failure

Reply via email to