[ 
https://issues.apache.org/jira/browse/RATIS-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849603#comment-17849603
 ] 

Ivan Andika edited comment on RATIS-2089 at 5/27/24 2:42 AM:
-------------------------------------------------------------

Thank you for checking this out, glad to see that this might be useful. 

I thought of this improvement while analyzing the Ratis block write process and 
the upper bound latency for block write in case of failure (e.g. 1 datanode is 
down/unreachable). We found three relevant watch configurations that might be 
too high:
 * Client-side watch timeout: hdds.ratis.raft.client.rpc.watch.request.timeout
 ** Ratis equivalent: raft.client.rpc.watch.request.timeout
 ** Ratis default: 10s
 ** Ozone default: 180s
 * Server-side watch timeout (This is the one that throws 
NotReplicatedException): hdds.ratis.raft.server.watch.timeout
 ** Ratis equivalent: raft.server.watch.timeout
 ** Ratis default: 10s
 ** Ozone default: 180s
 * Cleint-side watch retry timeout: hdds.ratis.client.request.watch.timeout
 ** No Ratis equivalent. This is used in RequestTypeDependentRetryPolicyCreator 
to decide whether to throw RaftRetryFailureException
 ** I don't think this configuration is used frequently since WATCH request 
retry policy is RetryPolicies.noRetry(), since NotReplicatedException have 
NO_RETRY policy (only ResourceUnavailableException will be retried).
 ** Ozone default: 3m

All of them have the same values. Most likely the client-side watch timeout 
will always be triggered since the client-side watch timeout starts first. This 
might cause NotReplicatedException to never be caught by the client since 
client timeout happened before server timeout. I believe we need to reduce the 
server-side watch timeout significantly (e.g. 30s or 10s) so that 
NotReplicatedException can be thrown faster. Only then, the chance of 
NotReplicatedException caught will be higher and we can use the 
CommitInfoProto. [~smeng] May I know whether your Ozone cluster changed these 
configurations and whether there are latency improvements?

 

I will try to come up with a simple patch for RATIS-2089 soon. Will take a look 
at HDDS-10108.

[~szetszwo] [~smeng] Also, may I ask that was the reason of the ALL_COMMITTED 
watch requests on the block write in the first place? Alternatively, can we 
instead release the client buffer immediately after successful 
WriteChunk/PutBlock request instead? My understanding is that the reply from 
Ratis server implies MAJORITY_COMMITTED and therefore there is no need to keep 
the buffer since it's going to be replicated to all the Datanode Ratis server 
eventually. Is it to ensure that when client (XceiverClientGrpc) read from any 
datanode, all the data would have been replicated to all the datanodes and 
therefore prevent BCSID_MISMATCH during BlockManagerImpl#getBlock?


was (Author: JIRAUSER298977):
Thank you for checking this out, glad to see that this might be useful. 

I thought of this improvement while analyzing the Ratis block write process and 
the upper bound latency for block write in case of failure (e.g. 1 datanode is 
down/unreachable). We found three relevant watch configurations that might be 
too high:
 * Client-side watch timeout: hdds.ratis.raft.client.rpc.watch.request.timeout 
 ** Ratis equivalent: raft.client.rpc.watch.request.timeout
 ** Ratis default: 10s
 ** Ozone default: 180s
 * Server-side watch timeout (This is the one that throws 
NotReplicatedException): hdds.ratis.raft.server.watch.timeout
 ** Ratis equivalent: raft.server.watch.timeout
 ** Ratis default: 10s
 ** Ozone default: 180s
 * Cleint-side watch retry timeout: hdds.ratis.client.request.watch.timeout
 ** No Ratis equivalent. This is used in RequestTypeDependentRetryPolicyCreator 
to decide whether to throw RaftRetryFailureException
 ** I don't think this configuration is used frequently since WATCH request 
retry policy is RetryPolicies.noRetry(), since NotReplicatedException have 
NO_RETRY policy (only ResourceUnavailableException will be retried).
 ** Ozone default: 3m

All of them have the same values. Most likely the client-side watch timeout 
will always be triggered since the client-side watch timeout starts first. This 
might cause NotReplicatedException to never be caught by the client since 
client timeout happened before server timeout. I believe we need to reduce the 
server-side watch timeout significantly (e.g. 30s or 10s) so that 
NotReplicatedException can be thrown faster. Only then, the chance of 
NotReplicatedException caught will be higher and we can use the 
CommitInfoProto. [~smeng] May I know whether your Ozone cluster changed these 
configurations and whether there are latency improvements? I will try to come 
up with a simple patch for RATIS-2089 soon.

[~szetszwo] [~smeng] Also, may I ask that was the reason of the ALL_COMMITTED 
watch requests on the block write in the first place? Alternatively, can we 
instead release the client buffer immediately after successful 
WriteChunk/PutBlock request instead? My understanding is that the reply from 
Ratis server implies MAJORITY_COMMITTED and therefore there is no need to keep 
the buffer since it's going to be replicated to all the Datanode Ratis server 
eventually. Is it to ensure that when client (XceiverClientGrpc) read from any 
datanode, all the data would have been replicated to all the datanodes and 
therefore prevent BCSID_MISMATCH during BlockManagerImpl#getBlock?

> Add CommitInfoProto in NotReplicatedException
> ---------------------------------------------
>
>                 Key: RATIS-2089
>                 URL: https://issues.apache.org/jira/browse/RATIS-2089
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Minor
>
> In Ozone's XceiverClientRatis#watchForCommit, there are two watch commits 
> request with different ReplicationLevel
>  # Watch for ALL_COMMITTED 
>  # Watch for MAJORITY_COMMITTED (If the previous watch threw an exception)
> Based on the second watch request, the client will remove some failed 
> datanode UUID from the commitInfoMap.
> The second watch might not be necessary since the entries in 
> AbstractCommitWatcher.commitIndexMap implies that the PutBlock request has 
> been committed to the majority of the servers. Therefore, another 
> MAJORITY_COMMITTED watch might not be necessary. From my understanding, the 
> second MAJORITY_COMMITTED only serves to gain information to remove entries 
> from commitInfoMap.
> If the first watch failed with NotReplicatedException, we might be able to 
> remove the need to a second watch request. Since NotReplicatedException is a 
> Raft server exception, we can include the CommitInfoProtos in the 
> NotReplicatedException. The client can use this CommitInfoProtos to remove 
> the entry from commitInfoMap without sending another WATCH request. 
> This CommitInfoProto is returned for every RaftClientReply 
> (RaftClientReply.commitInfos), but if there is an exception, it seems the 
> RaftClientReply is not accessible to the client.
> However, if the exception is a client exception (e.g. due to Raft client 
> watch timeout configuration), the client might have no choice but to send 
> another watch request.
> So in this patch, I propose to include CommitInfoProto into 
> NotReplicatedException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to