[
https://issues.apache.org/jira/browse/IGNITE-25421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Pligin reassigned IGNITE-25421:
----------------------------------------
Assignee: Denis Chudov
> Add the requests throttling to raft client
> ------------------------------------------
>
> Key: IGNITE-25421
> URL: https://issues.apache.org/jira/browse/IGNITE-25421
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Chudov
> Assignee: Denis Chudov
> Priority: Major
> Labels: ignite-3
>
> InĀ RaftGroupServiceImpl we have following parameters for retrying the
> requests:
> * request timeout: the timeout of a single request to raft group, the
> completable future fails on the client side of timeout exceeded;
> * retry timeout: total timeout to get the successful response; includes all
> retry attempts;
> * retry delay: delay to schedule the next retry attempt in the case of
> failure.
> The problem is that the retry model is too simple:
> * in the case of overloaded raft group it throws "TimeoutException: Send
> with retry timed out" giving no useful information for the user
> * it perform retries after short delay, producing more repeated requests to
> overloaded group, while the old requests are still somewhere in queue
> * it doesn't limit the count of requests to the raft group.
> In the same time, the retries are useful:
> * raft leader can be changed at any moment;
> * network failure, gc pause on the leader, anything else may happen that
> will be seen on the client side as TimeoutException, and the request should
> be retried. Also, this is the reason why request timeout is less than retry
> timeout.
> *Proposal*
> The most simple solution would be:
> - dividing the requests to two groups: those that are being retried and
> those that are incoming into the client. Former should be retried until retry
> timeout exceeds, latter may be rejected with and exception instantly. To
> achieve this, we may maintain the "request capacity" per remote node;
> - increasing the request timeout until it reaches retry timeout in the case
> of TimeoutException. This will give a chance for requests that are being
> retried to be processed by raft group within timeout. The increased timeout
> should work for any request sent by the client to overloaded node, ideally it
> should work for any request for the sameĀ node because striped disruptors are
> shared between groups
> - request timeout may be decreased back when the response time in the last N
> seconds becomes less than some threshold.
> So, there can be some shared context between clients that keeps remote nodes'
> capacities and request timeouts for each of them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)