Hey Traffic Controllers,

As a large distributed system, Traffic Control has a lot of components
that make requests to each other over the network, and these requests
are known to fail sometimes and require retrying periodically. Some of
these requests can also be very expensive to make and should require a
more careful retry strategy.

For that reason, I'd like to propose that we make it easier for our Go
components to do exponential backoff retrying with randomness by
introducing a simple Go library that would require minimal changes to
existing retry logic and would be simple to implement in new code that
requires retrying.

I've noticed a handful of places in different Traffic Control
components where there is retrying of operations that may potentially
fail (e.g. making requests to the TO API). Most of these areas simply
retry the failed operation after sleeping for X seconds, where X is
hard-coded or configurable with a sane default. In cases where an
operation fails due to being overloaded (or failing to come back up
due to requests piling up), I think we could make TC more robust by
introducing exponential backoff algorithms to our retry handlers.

In cases where a service is overloaded and taking a long time to
respond to requests because of it, simply retrying the same request
every X seconds is not ideal and could prevent the overloaded service
from becoming healthy again. It is generally a best practice to
exponentially back off on retrying requests, and it is even better
with added randomness that prevents multiple retries occurring at the
same time from different clients.

For example, this is what a lot of retries look like today:
attempt #1 failed
* sleep 1s *
attempt #2 failed
* sleep 1s *
attempt #3 failed
* sleep 1s *
... and so on

This is what it would look like with exponential (factor of 2) backoff
with some randomness/jitter:
attempt #1 failed
* sleep 0.8s *
attempt #2 failed
* sleep 2.1s *
attempt #3 failed
* sleep 4.2s *
... and so on until some max interval like 60s ...
attempt #9 failed
* sleep 61.3s *
attempt #10 failed
* sleep 53.1s *
... and so on

I think this would greatly improve our project's story around
retrying, and I've found a couple Go libraries that seem pretty good.
Both are MIT-licensed:

https://github.com/cenkalti/backoff
- provides the backoff algorithm but also a handful of other useful
features like Retry wrappers which handle looping for you
https://github.com/jpillora/backoff
- a bit simpler than the other and provides just the basic backoff
calculator/counter which provides a Duration that you'd use in
`time.Sleep()`, leaving you responsible for actually looping over the
retriable operation

I think I'd prefer jpillora/backoff for its simplicity and
conciseness. You basically just build a struct with your backoff
options (min, max, factor, jitter) then it spits out Durations for you
to sleep for before retrying. It doesn't expect much else from you.

Please drop a +1/-1 if you're for or against adding exponential
backoff retrying and vendoring a project like one of the above in
Traffic Control.

Thanks,
Rawlin

Reply via email to