+1

> On Oct 25, 2018, at 8:43 AM, Dan Kirkwood <[email protected]> wrote:
> 
> +1
> 
> On Thu, Oct 25, 2018 at 6:49 AM Eric Friedrich (efriedri)
> <[email protected]> wrote:
> 
>> +1
>> 
>> I support the goal of making Traffic Control more reliable and think this
>> is a step in the right direction.
>> 
>> In some of our other products, we’ve had good luck using a circuit breaker
>> as part of the retry process (https://pypi.org/project/circuitbreaker/,
>> linked just for algorithm  but I’m not suggesting we use it.)
>> 
>> Exponential retries are useful, but they must be capped at a certain level
>> or number of retries to provide some guarantee about response time to the
>> user.
>> 
>> The state of the retries must also be shared across all callers of the API
>> within a certain execution context. If we have a program with many threads
>> that all use the TO API, the failure of API calls in one thread should
>> result in a back-off or blockage of calls from other threads. Otherwise, we
>> may continue overloading the API with requests.
>> 
>> Pulling in a library like this is the first step, employing it properly is
>> a larger challenge.
>> 
>> —Eric
>> 
>> On Oct 24, 2018, at 7:43 PM, Dave Neuman <[email protected]<mailto:
>> [email protected]>> wrote:
>> 
>> +1, seems reasonable.
>> 
>> On Wed, Oct 24, 2018 at 15:03 Rawlin Peters <[email protected]
>> <mailto:[email protected]>> wrote:
>> 
>> Hey Traffic Controllers,
>> 
>> As a large distributed system, Traffic Control has a lot of components
>> that make requests to each other over the network, and these requests
>> are known to fail sometimes and require retrying periodically. Some of
>> these requests can also be very expensive to make and should require a
>> more careful retry strategy.
>> 
>> For that reason, I'd like to propose that we make it easier for our Go
>> components to do exponential backoff retrying with randomness by
>> introducing a simple Go library that would require minimal changes to
>> existing retry logic and would be simple to implement in new code that
>> requires retrying.
>> 
>> I've noticed a handful of places in different Traffic Control
>> components where there is retrying of operations that may potentially
>> fail (e.g. making requests to the TO API). Most of these areas simply
>> retry the failed operation after sleeping for X seconds, where X is
>> hard-coded or configurable with a sane default. In cases where an
>> operation fails due to being overloaded (or failing to come back up
>> due to requests piling up), I think we could make TC more robust by
>> introducing exponential backoff algorithms to our retry handlers.
>> 
>> In cases where a service is overloaded and taking a long time to
>> respond to requests because of it, simply retrying the same request
>> every X seconds is not ideal and could prevent the overloaded service
>> from becoming healthy again. It is generally a best practice to
>> exponentially back off on retrying requests, and it is even better
>> with added randomness that prevents multiple retries occurring at the
>> same time from different clients.
>> 
>> For example, this is what a lot of retries look like today:
>> attempt #1 failed
>> * sleep 1s *
>> attempt #2 failed
>> * sleep 1s *
>> attempt #3 failed
>> * sleep 1s *
>> ... and so on
>> 
>> This is what it would look like with exponential (factor of 2) backoff
>> with some randomness/jitter:
>> attempt #1 failed
>> * sleep 0.8s *
>> attempt #2 failed
>> * sleep 2.1s *
>> attempt #3 failed
>> * sleep 4.2s *
>> ... and so on until some max interval like 60s ...
>> attempt #9 failed
>> * sleep 61.3s *
>> attempt #10 failed
>> * sleep 53.1s *
>> ... and so on
>> 
>> I think this would greatly improve our project's story around
>> retrying, and I've found a couple Go libraries that seem pretty good.
>> Both are MIT-licensed:
>> 
>> https://github.com/cenkalti/backoff
>> - provides the backoff algorithm but also a handful of other useful
>> features like Retry wrappers which handle looping for you
>> https://github.com/jpillora/backoff
>> - a bit simpler than the other and provides just the basic backoff
>> calculator/counter which provides a Duration that you'd use in
>> `time.Sleep()`, leaving you responsible for actually looping over the
>> retriable operation
>> 
>> I think I'd prefer jpillora/backoff for its simplicity and
>> conciseness. You basically just build a struct with your backoff
>> options (min, max, factor, jitter) then it spits out Durations for you
>> to sleep for before retrying. It doesn't expect much else from you.
>> 
>> Please drop a +1/-1 if you're for or against adding exponential
>> backoff retrying and vendoring a project like one of the above in
>> Traffic Control.
>> 
>> Thanks,
>> Rawlin
>> 
>> 
>> 

Reply via email to