[ 
https://issues.apache.org/jira/browse/IGNITE-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-25421:
----------------------------------
    Description: 
In RaftGroupServiceImpl we have following parameters for retrying the requests:
 * request timeout: the timeout of a single request to raft group, the 
completable future fails on the client side of timeout exceeded;
 * retry timeout: total timeout to get the successful response; includes all 
retry attempts;
 * retry delay: delay to schedule the next retry attempt in the case of failure.

The problem is that the retry model is too simple:
 * in the case of overloaded raft group it throws "TimeoutException: Send with 
retry timed out" giving no useful information for the user
 * it perform retries after short delay, producing more repeated requests to 
overloaded group, while the old requests are still somewhere in queue
 * it doesn't limit the count of requests to the raft group.

In the same time, the retries are useful:
 * raft leader can be changed at any moment;
 * network failure, gc pause on the leader, anything else may happen that will 
be seen on the client side as TimeoutException, and the request should be 
retried. Also, this is the reason why request timeout is less than retry 
timeout.

*Proposal*

The most simple solution would be:
 - dividing the requests to two groups: those that are being retried and those 
that are incoming into the client. Former should be retried until retry timeout 
exceeds, latter may be rejected with and exception instantly. To achieve this, 
we may maintain the "request capacity" per remote node;
 - increasing the request timeout until it reaches retry timeout in the case of 
TimeoutException. This will give a chance for requests that are being retried 
to be processed by raft group within timeout. The increased timeout should work 
for any request sent by the client to overloaded node, ideally it should work 
for any request for the same  node because striped disruptors are shared 
between groups

 - request timeout may be decreased back when the response time in the last N 
seconds becomes less than some threshold.

So, there can be some shared context between clients that keeps remote nodes' 
capacities and request timeouts for each of them.

  was:
In RaftGroupServiceImpl we have following parameters for retrying the requests:
 * request timeout: the timeout of a single request to raft group, the 
completable future fails on the client side of timeout exceeded;
 * retry timeout: total timeout to get the successful response; includes all 
retry attempts;
 * retry delay: delay to schedule the next retry attempt in the case of failure.

The problem is that the retry model is too simple:
 * in the case of overloaded raft group it throws "TimeoutException: Send with 
retry timed out" giving no useful information for the user
 * it perform retries after short delay, producing more repeated requests to 
overloaded group, while the old requests are still somewhere in queue
 * it doesn't limit the count of requests to the raft group.

In the same time, the retries are useful:
 * raft leader can be changed at any moment;
 * network failure, gc pause on the leader, anything else may happen that will 
be seen on the client side as TimeoutException, and the request should be 
retried. Also, this is the reason why request timeout is less than retry 
timeout.

*Proposal*

The most simple solution would be:
 - dividing the requests to two groups: those that are being retried and those 
that are incoming into the client. Former should be retried until retry timeout 
exceeds, latter may be rejected with and exception instantly. To achieve this, 
we may maintain the "request capacity" per remote node;

 - increasing the request timeout until it reaches retry timeout in the case of 
TimeoutException. This will give a chance for requests that are being retried 
to be processed by raft group within timeout. The increased timeout should work 
for any request sent by the client to overloaded node, ideally it should work 
for any request for the same  node because striped disruptors are shared 
between groups.
 - request timeout may be decreased back when the response time in the last N 
seconds becomes less than some threshold.

So, there can be some shared context between clients that keeps remote nodes' 
capacities and request timeouts for each of them.


> Add the requests throttling to raft client
> ------------------------------------------
>
>                 Key: IGNITE-25421
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25421
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> In RaftGroupServiceImpl we have following parameters for retrying the 
> requests:
>  * request timeout: the timeout of a single request to raft group, the 
> completable future fails on the client side of timeout exceeded;
>  * retry timeout: total timeout to get the successful response; includes all 
> retry attempts;
>  * retry delay: delay to schedule the next retry attempt in the case of 
> failure.
> The problem is that the retry model is too simple:
>  * in the case of overloaded raft group it throws "TimeoutException: Send 
> with retry timed out" giving no useful information for the user
>  * it perform retries after short delay, producing more repeated requests to 
> overloaded group, while the old requests are still somewhere in queue
>  * it doesn't limit the count of requests to the raft group.
> In the same time, the retries are useful:
>  * raft leader can be changed at any moment;
>  * network failure, gc pause on the leader, anything else may happen that 
> will be seen on the client side as TimeoutException, and the request should 
> be retried. Also, this is the reason why request timeout is less than retry 
> timeout.
> *Proposal*
> The most simple solution would be:
>  - dividing the requests to two groups: those that are being retried and 
> those that are incoming into the client. Former should be retried until retry 
> timeout exceeds, latter may be rejected with and exception instantly. To 
> achieve this, we may maintain the "request capacity" per remote node;
>  - increasing the request timeout until it reaches retry timeout in the case 
> of TimeoutException. This will give a chance for requests that are being 
> retried to be processed by raft group within timeout. The increased timeout 
> should work for any request sent by the client to overloaded node, ideally it 
> should work for any request for the same  node because striped disruptors are 
> shared between groups
>  - request timeout may be decreased back when the response time in the last N 
> seconds becomes less than some threshold.
> So, there can be some shared context between clients that keeps remote nodes' 
> capacities and request timeouts for each of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to