Re: [whimsy.git] [1/1] Commit 60f24d0: if all hosts are down, give up

Sam Ruby Thu, 04 Feb 2016 15:30:14 -0800

I'm not near a computer at the moment. Feel free to revert but please don't
leave it at a place where it doesn't work for long running apps. Thanks
On Feb 4, 2016 4:47 PM, "sebb" <seb...@gmail.com> wrote:


> On 4 February 2016 at 20:42, Sam Ruby <ru...@intertwingly.net> wrote:
> > On Thu, Feb 4, 2016 at 3:15 PM, sebb <seb...@gmail.com> wrote:
> >> On 4 February 2016 at 19:45, Sam Ruby <ru...@intertwingly.net> wrote:
> >>> On Thu, Feb 4, 2016 at 12:55 PM, sebb <seb...@gmail.com> wrote:
> >>>> On 4 February 2016 at 15:31, Sam Ruby <ru...@apache.org> wrote:
> >>>>> Commit 60f24d0cab030cf6d64ce75c23d05b1ac33bdda7:
> >>>>>     if all hosts are down, give up
> >>>>>
> >>>>> Branch: refs/heads/master
> >>>>> Author: Sam Ruby <ru...@intertwingly.net>
> >>>>> Committer: Sam Ruby <ru...@intertwingly.net>
> >>>>> Pusher: rubys <ru...@apache.org>
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>> lib/whimsy/asf/ldap.rb                                       | ++ --
> >>>>> ------------------------------------------------------------
> >>>>> 4 changes: 2 additions, 2 deletions.
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>>
> >>>>> diff --git a/lib/whimsy/asf/ldap.rb b/lib/whimsy/asf/ldap.rb
> >>>>> index 0376c64..f73646b 100644
> >>>>> --- a/lib/whimsy/asf/ldap.rb
> >>>>> +++ b/lib/whimsy/asf/ldap.rb
> >>>>> @@ -73,7 +73,7 @@ def self.puppet_ldapservers
> >>>>>
> >>>>>      # connect to LDAP
> >>>>>      def self.connect
> >>>>> -      loop do
> >>>>> +      host.length.times do
> >>>>
> >>>> -1
> >>>>
> >>>> The original code works fine.
> >>>>
> >>>> The new loop tries to loop some 37 times (depending on the size of the
> >>>> host name) and then fails with
> >>>>
> >>>>      iteration reached an end (StopIteration)
> >>>>
> >>>> I think the original code works rather better.
> >>>
> >>> Can you explain how this works for me?
> >>>
> >>> What I see is a "loop do...end" with the only exit being on a success,
> >>> and the code at the end of this method being unreachable.
> >>
> >> loop/end works in conjunction with next_host which is an enum.
> >> enum.next generates StopIteration which is specifically caught by
> loop/end.
> >>
> >> http://ruby-doc.org/core-1.9.3/StopIteration.html
> >
> > That feels a bit too tightly coupled to me.  Having one method's
> > implementation depend on another method's implementation and all that.
>
> Seems reasonable to me - it offers plenty of flexibility in designing
> loops as the getter can be anywhere in the loop.
> Otherwise the getter has to be part of the condition checking at the
> start or end.
>
> > But be that as it may, I missed the fact that once you run out of your
> > enumeration, it is game over.  That doesn't work well with long
> > running applications, like the board agenda.  37 failures and it is
> > time to restart the server.
>
> In fact it's about 7 failures (or however many are in the host list)
>
> The loop maximum was 37 (approx) but it will only try each host once.
>
> > That doesn't work for me.
> >
> >>> I will say that whether there are 4 hosts or 50 hosts, trying exactly
> >>> 37 times (depending on the length of an unrelated string) seems kinda
> >>> weird.
> >>
> >> It does it 37 or so times because host holds the host URL, and that is
> >> roughly its length.
> >
> > That seems totally random.
>
> > I could understand trying each host once.
>
> Which is what it does.
>
> > Or trying each host twice.  Or trying something like [hosts.length,
> > 3].min times.  But trying a number of times based on a url length?
>
> See above, 37 approx is the *max* loop count. I think you
> misunderstood me earlier.
>
> >> The host is the wrong variable to use.
>
> Hence the peculiar loop count.
>
> >> But so is the hosts array wrong, because that will always return the
> >> original number of hosts, even if some have already been used up and
> >> we are resuming the loop.
> >> The loop will then fall off the end of the enum, generating a
> >> StopIteration, which is not caught by the x.length.times do loop.
> >
> > I would rather next_host continuously provide hosts.
>
> OK, but this is a different design.
>
> My intention was to ensure that hosts that have already failed are not
> tried again.
> This makes sense for all the apps I have worked on.
>
> However I agree that does not sit well with long-running apps.
>
> > When it runs out
> > of a batch, it starts a new set.  I was too quick in scanning the code
> > and presumed that's what the @he ||= was doing.
> >
> > Perhaps something like the following instead:
> >
> >        @he = hosts.dup unless @he and not @he.empty?
> >        @he.pop
> >
> > The net effect is that it would try each host in turn.  If there later
> > is a failure, it will start over and loop back around if necessary
> > until it finds a host that works.
> >
> > Thoughts?
>
> The host re-use behaviour should be selectable.
>
> There's no point looping round all the hosts again for the one-shot
> apps that are going to be run again soon.
> If all the hosts have failed, the likelihood is that they will do so
> again if tried again immediately.
> So just give up and hope the fault has cleared for the next run.
>
> For long-running apps, the situation is a bit different.
> It's fine if each host is tried multiple times, so long as that
> happens over a long period.
> However there needs to be a guard against repeatedly trying all the
> hosts in rapid succession.
>
> So I don't think simply resetting the list will do; there needs to be
> a time delay.
> Nor does it make sense to always pause before restarting.
> Perhaps it would work to store the time when the list was last
> started, and add a wait if necessary before restarting.
> This would stop endless rapid recycling, but would still not be ideal
> behaviour e.g. if the external network was down.
>
> Is there a stage when even the long-running apps need to give up?
> And if so, how is that detected?
>
> Note: there was no auto retry at all until recently, so what do the
> long-running apps do currently?
> Do they try to recover from LDAP queries?
>
> If so, it seems more likely that the current design (StopIteration
> generated and not caught by connect) is less likely to play well with
> any recovery.
>
> So whatever the case, I think the change needs to be reverted.
>
> We can then improve the host re-use strategy for the long-running apps.
>
> > - Sam Ruby
> >
> >>> And failing with a StopIteration is less desirable than producing a
> >>> clean message, such as "Failed to connect to any LDAP host".  I can
> >>> see the advantage of raising an exception over returning nil, but I
> >>> would prefer a more meaningful message.
> >>
> >> It only *fails* with StopIteration with the updated code.
> >> That is because the loop count is too large, so next_host is called
> >> more times than there are hosts.
> >>
> >> With the previous code, next_host is also called one extra time, but
> >> the generated StopIteration is caught by the loop/do, as designed.
> >>
> >> http://ruby-doc.org/core-1.9.3/StopIteration.html
> >>
> >> Try defining single failing URL in your .whimsy file and you will see
> >> what I mean.
> >>
> >>> - Sam Ruby
> >>>
> >>>
> >>>>>          host = next_host
> >>>>>          Wunderbar.info "Connecting to LDAP server: #{host}"
> >>>>>
> >>>>> @@ -97,8 +97,8 @@ def self.connect
> >>>>>            Wunderbar.warn "Error connecting to LDAP server #{host}:
> " +
> >>>>>              re.message + " (continuing)"
> >>>>>          end
> >>>>> -
> >>>>>        end
> >>>>> +
> >>>>>        Wunderbar.error "Failed to connect to any LDAP host"
> >>>>>        return nil
> >>>>>      end
>

Re: [whimsy.git] [1/1] Commit 60f24d0: if all hosts are down, give up

Reply via email to