Re: Let's talk retries

Andrew Wilkins Tue, 09 Aug 2016 00:35:07 -0700

On Tue, Aug 9, 2016 at 3:31 PM William Reade <william.re...@canonical.com>
wrote:


> I feel obliged to note that we also have axw's operation queue, used in
> storageprovisioner, and that it's the only one which doesn't make the
> assumption that the code being retried is the only important thing that
> could possibly be happening in the calling context.
>
> All the other approaches discussed will leave us blocking 99 good machines
> behind a single failure -- and while that's harmless-ish in the current
> provisioner, because it has a strategy of waiting no more than 30s in
> total, it's going to bite us hard as soon as we switch to a strategy that
> actually backs off and keeps retrying long enough to be useful.
>
> Andrew: as I recall, that code is actually pretty general. Is there a
> strong reason it's tucked away in an /internal/ package?
>

Nope. I did propose creating a new repo a while back, but it fell by the
wayside: https://github.com/axw/juju-time. It should be a natural fit for
the machine provisioner.

Cheers,
Andrew


> On Tue, Aug 9, 2016 at 9:02 AM, roger peppe <roger.pe...@canonical.com>
> wrote:
>
>> On 9 August 2016 at 07:28, roger peppe <roger.pe...@canonical.com> wrote:
>> > On 9 August 2016 at 01:22, Katherine Cox-Buday
>> > <katherine.cox-bu...@canonical.com> wrote:
>> >> Hey All,
>> >>
>> >> We currently have 3 ways we're performing retries in Juju:
>> >>
>> >> 1. Archaic, custom, intra-package retry code.
>> >> 2. github.com/juju/utils::{Countdown,BackoffTimer}
>> <http://github.com/juju/utils::%7BCountdown,BackoffTimer%7D>
>> >> 3. github.com/juju/retry (current best practice)
>> >
>> > There's a fourth way that you haven't mentioned, which fits
>> > somewhere in between 1 and 2 I think (it was the first
>> > explicit general retry code in Juju AFAIK), which is
>> > utils.Attempt.
>> >
>> > I'd suggest that the way it's used matches the pattern
>> > you'd like to write quite well:
>> >
>> >      for attempt := strategy.State(); attempt.Next(); {
>>
>> Oops; s/State/Start/ of course.
>>
>> >           Do something
>> >      }
>> >
>> > AFAIR this pattern was arrived at after some discussion between myself
>> > and Gustavo. At the time I was thinking of some kind of
>> > callback-based scheme like the others schemes you mention, but
>> > I think that leaving the caller in control and the loop explicit
>> > is actually nicer - the logic is clear to the reader and more
>> > composable with other primitives (it is also a good match
>> > for the pattern you propose).
>> >
>> > How about making AttemptStrategy a little more flexible?
>> > It was always my intention to admit other kinds of strategy - if
>> > a BackoffFunc field were added to AttemptStrategy and
>> > a Stop argument added to AttemptStrategy.Start it would
>> > become as general as retry.Call, while keeping the nice
>> > usage pattern and general straightforwardness.
>> >
>> >   cheers,
>> >     rog.
>> >
>> >> I have just touched some code which fits #1, and when I was attempting
>> to retrofit it to #3, I found it a little obtuse. Here's why:
>> >>
>> >> In my case (and I assume most), I wanted something like this:
>> >>
>> >> for range retries { // Loop should break if our parent is cancelled
>> >>         // Do something
>> >>     // Success, retry, or fatal error
>> >> }
>> >>
>> >> To implement this, I do something like this:
>> >>
>> >> args := retry.CallArgs{
>> >>     Func: func() error {
>> >>         // Body of my loop
>> >>     },
>> >>     BackoffFunc: func(delay time.Duration, attempt int) time.Duration {
>> >>         if attempt == 1 {
>> >>             return delay
>> >>         }
>> >>         return delay * factor
>> >>     },
>> >>     Stop: dying,
>> >>     // etc...
>> >> }
>> >>
>> >> Call(args)
>> >>
>> >> We attempt to encapsulate every scenario for retries in
>> github.com/juju/retry/::CallArgs. We have variables to determine if
>> errors are fatal, to notify things, how to back off, etc. This feels very
>> heavy-weight to me, and a bit like the monolith model rather than the unix
>> model. I would prefer the logic be left to the caller (i.e. do I break out
>> of this loop? do I notify something?), and the interface into a retry
>> strategy be much simpler, say a channel like in time.Tick(time.Duration)
>> <-chan time.Time.
>> >>
>> >> How would folks feel about something like this?
>> >>
>> >> func BackoffTick(done <-chan struct{}, clock clock.Cock, delay
>> time.Duration, factor int64) <-chan time.Time {
>> >>         signalStream := make(chan time.Time)
>> >>         go func() {
>> >>                 defer close(signalStream)
>> >>                 for {
>> >>                         select {
>> >>                         case <-done:
>> >>                                 return
>> >>                         case signalStream <- time.After(delay):
>> >>                                 delay =
>> time.Duration(delay.Nanoseconds() * factor)
>> >>                         }
>> >>                 }
>> >>         }()
>> >>         return signalStream
>> >> }
>> >>
>> >> With this, the above becomes:
>> >>
>> >> for range BackoffTick(dying, myClock, 1*time.Second, 2) {
>> >>     // Determination of fatality and notifications happen here
>> >> }
>> >>
>> >> If you want a max retry concept, you do this:
>> >>
>> >> for range Take(done, 3, BackoffTick(dying, myClock, 1*time.Second, 2))
>> {
>> >>     // Determination of fatality and notifications happen here
>> >> }
>> >>
>> >> If you want a max duration you do this:
>> >>
>> >> done = Timeout(done, 5*time.Minute)
>> >> for range Take(done, 3, BackoffTick(dying, myClock, 1*time.Second, 2))
>> {
>> >>     // Determination of fatality and notifications happen here
>> >> }
>> >>
>> >> Functionally, what's in juju/retry is fine; and I can stuff anything I
>> want into the function. It just feels a little odd to use in that I must
>> put the body of my loop in a function, and I dislike that the CallArgs
>> struct attempts to encapsulate a bunch of different scenarios.
>> >>
>> >> Thoughts?
>> >
>> > I'm not keen on the channel-based approach because if you want to
>> > terminate early, you need to shut down the goroutine - it's just
>> > one more resource to clean up. Also, the goroutine that's sending
>> > the events is one step removed from what's going on in the loop - the
>> > delay is calculated one step before the current iteration is complete.
>> >
>> >   cheers,
>> >     rog.
>> >
>> >>
>> >> Also, is it on the roadmap to address the inconsitant retries
>> throughout Juju? I almost used utils.BackoffTimer until I started looking
>> deeper. I'm going to submit a PR to flag it as deprecated.
>> >>
>> >> --
>> >> Katherine
>> >>
>> >> --
>> >> Juju-dev mailing list
>> >> Juju-dev@lists.ubuntu.com
>> >> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>> --
>> Juju-dev mailing list
>> Juju-dev@lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
>

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Re: Let's talk retries

Reply via email to