+1

I support the goal of making Traffic Control more reliable and think this is a 
step in the right direction.

In some of our other products, we’ve had good luck using a circuit breaker as 
part of the retry process (https://pypi.org/project/circuitbreaker/, linked 
just for algorithm  but I’m not suggesting we use it.)

Exponential retries are useful, but they must be capped at a certain level or 
number of retries to provide some guarantee about response time to the user.

The state of the retries must also be shared across all callers of the API 
within a certain execution context. If we have a program with many threads that 
all use the TO API, the failure of API calls in one thread should result in a 
back-off or blockage of calls from other threads. Otherwise, we may continue 
overloading the API with requests.

Pulling in a library like this is the first step, employing it properly is a 
larger challenge.

—Eric

On Oct 24, 2018, at 7:43 PM, Dave Neuman 
<[email protected]<mailto:[email protected]>> wrote:

+1, seems reasonable.

On Wed, Oct 24, 2018 at 15:03 Rawlin Peters 
<[email protected]<mailto:[email protected]>> wrote:

Hey Traffic Controllers,

As a large distributed system, Traffic Control has a lot of components
that make requests to each other over the network, and these requests
are known to fail sometimes and require retrying periodically. Some of
these requests can also be very expensive to make and should require a
more careful retry strategy.

For that reason, I'd like to propose that we make it easier for our Go
components to do exponential backoff retrying with randomness by
introducing a simple Go library that would require minimal changes to
existing retry logic and would be simple to implement in new code that
requires retrying.

I've noticed a handful of places in different Traffic Control
components where there is retrying of operations that may potentially
fail (e.g. making requests to the TO API). Most of these areas simply
retry the failed operation after sleeping for X seconds, where X is
hard-coded or configurable with a sane default. In cases where an
operation fails due to being overloaded (or failing to come back up
due to requests piling up), I think we could make TC more robust by
introducing exponential backoff algorithms to our retry handlers.

In cases where a service is overloaded and taking a long time to
respond to requests because of it, simply retrying the same request
every X seconds is not ideal and could prevent the overloaded service
from becoming healthy again. It is generally a best practice to
exponentially back off on retrying requests, and it is even better
with added randomness that prevents multiple retries occurring at the
same time from different clients.

For example, this is what a lot of retries look like today:
attempt #1 failed
* sleep 1s *
attempt #2 failed
* sleep 1s *
attempt #3 failed
* sleep 1s *
... and so on

This is what it would look like with exponential (factor of 2) backoff
with some randomness/jitter:
attempt #1 failed
* sleep 0.8s *
attempt #2 failed
* sleep 2.1s *
attempt #3 failed
* sleep 4.2s *
... and so on until some max interval like 60s ...
attempt #9 failed
* sleep 61.3s *
attempt #10 failed
* sleep 53.1s *
... and so on

I think this would greatly improve our project's story around
retrying, and I've found a couple Go libraries that seem pretty good.
Both are MIT-licensed:

https://github.com/cenkalti/backoff
- provides the backoff algorithm but also a handful of other useful
features like Retry wrappers which handle looping for you
https://github.com/jpillora/backoff
- a bit simpler than the other and provides just the basic backoff
calculator/counter which provides a Duration that you'd use in
`time.Sleep()`, leaving you responsible for actually looping over the
retriable operation

I think I'd prefer jpillora/backoff for its simplicity and
conciseness. You basically just build a struct with your backoff
options (min, max, factor, jitter) then it spits out Durations for you
to sleep for before retrying. It doesn't expect much else from you.

Please drop a +1/-1 if you're for or against adding exponential
backoff retrying and vendoring a project like one of the above in
Traffic Control.

Thanks,
Rawlin


Reply via email to