Good Day All-
We’ve been running AuthByLOADBALANCE for some time now and have noticed that if
there is a message that does not get a response from the downstream hosts that
it will be retried infinitely. This not only keeps the message around forever
but as it is tried and failed, it increases the failure counts for the target
hosts which makes them more likely to be marked unavailable and causes delivery
problems with other requests.
For example a malformed request may be sent by an upstream client and handled
by AuthByLOADBALANCE where the target hosts simply do not respond to the
proxied request because they don’t like it. The request will be retried on the
current host for Retries times by handle_timeout() after which the request is
handed off to failed(), which tracks MaxFailedRequests for the host and marks
it unavailable if applicable and then hands off the request to forward() which
calls chooseHost() to find the next available host. The stock chooseHost() in
AuthByRADIUS tracks if the request has reach the end of the list or not but
chooseHost() in AuthByLOADBALANCE will always return a host if one is available
and it could even be the same host as the last try if MaxFailedRequests has not
been reached for that host. The end result is that the request will be retried
forever and incrementing the failure count for downstream hosts, causing them
to be marked unavailable.
After some looking at the code I think I could override failed() to track the
number of unique hosts to which a request has been forwarded with something like
$fp->{retryHosts}->{$host}++
and then add a couple of checks in chooseHost() that are similar to the to
original one-
if (@{$fp->{retryHosts}} < @{$self->{Hosts}})
{
foreach $host (@{$self->{Hosts}})
{
next if ($fp->{retryHosts}->{$host})
…
The end result being that the request will be tried for each host in the list
Retries times and then the next best candidate chosen by the volume algorithm
until all hosts are tried and then the request fails. That may not be the
optimal behavior but it beats trying forever.
Before doing that and bearing the burden of maintaining a custom AuthBy I
figured I’d send it to the list and see if someone else has already solved this
problem or if Open Systems would be willing to revisit the AuthByLOADBALANCE
logic. Perhaps changing the interpretation of Retries to mean the total number
of times a request is retried instead of a per host number in order to have a
finite lifetime on a request? In that case chooseHost() could be called for
each retry in handle_timeout() to increase the chances of success.
Regards-
[cid:3BC7925D-9AA6-49B4-BE13-4C50B5984F63]
Frank Danielson | S.V.P. Engineering
•
[email protected]<applewebdata://B42CE82B-00AD-4466-A1C0-45CE1FB8AEBB/[email protected]>
_______________________________________________
radiator mailing list
[email protected]
https://lists.open.com.au/mailman/listinfo/radiator