[ https://issues.apache.org/jira/browse/IGNITE-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Chudov updated IGNITE-25421: ---------------------------------- Description: In RaftGroupServiceImpl we have following parameters for retrying the requests: * request timeout: the timeout of a single request to raft group, the completable future fails on the client side of timeout exceeded; * retry timeout: total timeout to get the successful response; includes all retry attempts; * retry delay: delay to schedule the next retry attempt in the case of failure. The problem is that the retry model is too simple: * in the case of overloaded raft group it throws "TimeoutException: Send with retry timed out" giving no useful information for the user * it perform retries after short delay, producing more repeated requests to overloaded group, while the old requests are still somewhere in queue * it doesn't limit the count of requests to the raft group. In the same time, the retries are useful: * raft leader can be changed at any moment; * network failure, gc pause on the leader, anything else may happen that will be seen on the client side as TimeoutException, and the request should be retried. Also, this is the reason why request timeout is less than retry timeout. *Proposal* The most simple solution would be: - dividing the requests to two groups: those that are being retried and those that are incoming into the client. Former should be retried until retry timeout exceeds, latter may be rejected with and exception instantly. To achieve this, we may maintain the "request capacity" per remote node; - increasing the request timeout until it reaches retry timeout in the case of TimeoutException. This will give a chance for requests that are being retried to be processed by raft group within timeout. The increased timeout should work for any request sent by the client to overloaded node, ideally it should work for any request for the same node because striped disruptors are shared between groups - request timeout may be decreased back when the response time in the last N seconds becomes less than some threshold. So, there can be some shared context between clients that keeps remote nodes' capacities and request timeouts for each of them. was: In RaftGroupServiceImpl we have following parameters for retrying the requests: * request timeout: the timeout of a single request to raft group, the completable future fails on the client side of timeout exceeded; * retry timeout: total timeout to get the successful response; includes all retry attempts; * retry delay: delay to schedule the next retry attempt in the case of failure. The problem is that the retry model is too simple: * in the case of overloaded raft group it throws "TimeoutException: Send with retry timed out" giving no useful information for the user * it perform retries after short delay, producing more repeated requests to overloaded group, while the old requests are still somewhere in queue * it doesn't limit the count of requests to the raft group. In the same time, the retries are useful: * raft leader can be changed at any moment; * network failure, gc pause on the leader, anything else may happen that will be seen on the client side as TimeoutException, and the request should be retried. Also, this is the reason why request timeout is less than retry timeout. *Proposal* The most simple solution would be: - dividing the requests to two groups: those that are being retried and those that are incoming into the client. Former should be retried until retry timeout exceeds, latter may be rejected with and exception instantly. To achieve this, we may maintain the "request capacity" per remote node; - increasing the request timeout until it reaches retry timeout in the case of TimeoutException. This will give a chance for requests that are being retried to be processed by raft group within timeout. The increased timeout should work for any request sent by the client to overloaded node, ideally it should work for any request for the same node because striped disruptors are shared between groups. - request timeout may be decreased back when the response time in the last N seconds becomes less than some threshold. So, there can be some shared context between clients that keeps remote nodes' capacities and request timeouts for each of them. > Add the requests throttling to raft client > ------------------------------------------ > > Key: IGNITE-25421 > URL: https://issues.apache.org/jira/browse/IGNITE-25421 > Project: Ignite > Issue Type: Bug > Reporter: Denis Chudov > Priority: Major > Labels: ignite-3 > > In RaftGroupServiceImpl we have following parameters for retrying the > requests: > * request timeout: the timeout of a single request to raft group, the > completable future fails on the client side of timeout exceeded; > * retry timeout: total timeout to get the successful response; includes all > retry attempts; > * retry delay: delay to schedule the next retry attempt in the case of > failure. > The problem is that the retry model is too simple: > * in the case of overloaded raft group it throws "TimeoutException: Send > with retry timed out" giving no useful information for the user > * it perform retries after short delay, producing more repeated requests to > overloaded group, while the old requests are still somewhere in queue > * it doesn't limit the count of requests to the raft group. > In the same time, the retries are useful: > * raft leader can be changed at any moment; > * network failure, gc pause on the leader, anything else may happen that > will be seen on the client side as TimeoutException, and the request should > be retried. Also, this is the reason why request timeout is less than retry > timeout. > *Proposal* > The most simple solution would be: > - dividing the requests to two groups: those that are being retried and > those that are incoming into the client. Former should be retried until retry > timeout exceeds, latter may be rejected with and exception instantly. To > achieve this, we may maintain the "request capacity" per remote node; > - increasing the request timeout until it reaches retry timeout in the case > of TimeoutException. This will give a chance for requests that are being > retried to be processed by raft group within timeout. The increased timeout > should work for any request sent by the client to overloaded node, ideally it > should work for any request for the same node because striped disruptors are > shared between groups > - request timeout may be decreased back when the response time in the last N > seconds becomes less than some threshold. > So, there can be some shared context between clients that keeps remote nodes' > capacities and request timeouts for each of them. -- This message was sent by Atlassian Jira (v8.20.10#820010)