On 06/24/2016 05:41 AM, Adam Spiers wrote: > Andrew Beekhof <abeek...@redhat.com> wrote: >> On Fri, Jun 24, 2016 at 1:01 AM, Adam Spiers <aspi...@suse.com> wrote: >>> Andrew Beekhof <abeek...@redhat.com> wrote: >> >>>>> Well, if you're OK with bending the rules like this then that's good >>>>> enough for me to say we should at least try it :) >>>> >>>> I still say you shouldn't only do it on error. >>> >>> When else should it be done? >> >> I was thinking whenever a stop() happens. > > OK, seems we are agreed finally :) > >>> IIUC, disabling/enabling the service is independent of the up/down >>> state which nova tracks automatically, and which based on slightly >>> more than a skim of the code, is dependent on the state of the RPC >>> layer. >>> >>>>> But how would you avoid repeated consecutive invocations of "nova >>>>> service-disable" when the monitor action fails, and ditto for "nova >>>>> service-enable" when it succeeds? >>>> >>>> I don't think you can. Not ideal but I'd not have thought a deal breaker. >>> >>> Sounds like a massive deal-breaker to me! With op monitor >>> interval="10s" and 100 compute nodes, that would mean 10 pointless >>> calls to nova-api every second. Am I missing something? >> >> I was thinking you would only call it for the "I detected a failure >> case" and service-enable would still be on start(). >> So the number of pointless calls per second would be capped at one >> tenth of the number of failed compute nodes. >> >> One would hope that all of them weren't dead. > > Oh OK - yeah that wouldn't be nearly as bad. > >>> Also I don't see any benefit to moving the API calls from start/stop >>> actions to the monitor action. If there's a failure, Pacemaker will >>> invoke the stop action, so we can do service-disable there. >> >> I agree. Doing it unconditionally at stop() is my preferred option, I >> was only trying to provide a path that might be close to the behaviour >> you were looking for. >> >>> If the >>> start action is invoked and we successfully initiate startup of >>> nova-compute, the RA can undo any service-disable it previously did >>> (although it should not reverse a service-disable done elsewhere, >>> e.g. manually by the cloud operator). >> >> Agree > > Trying to adjust to this new sensation of agreement ;-) > >>>>> Earlier in this thread I proposed >>>>> the idea of a tiny temporary file in /run which tracks the last known >>>>> state and optimizes away the consecutive invocations, but IIRC you >>>>> were against that. >>>> >>>> I'm generally not a fan, but sometimes state files are a necessity. >>>> Just make sure you think through what a missing file might mean. >>> >>> Sure. A missing file would mean the RA's never called service-disable >>> before, >> >> And that is why I generally don't like state files. >> The default location for state files doesn't persist across reboots. >> >> t1. stop (ie. disable) >> t2. reboot >> t3. start with no state file >> t4. WHY WONT NOVA USE THE NEW COMPUTE NODE STUPID CLUSTERS > > Well then we simply put the state file somewhere which does persist > across reboots.
There's also the possibility of using a node attribute. If you set a normal node attribute, it will abort the transition and calculate a new one, so that's something to take into account. You could set a private node attribute, which never gets written to the CIB and thus doesn't abort transitions, but it also does not survive a complete cluster stop. >>> which means that it shouldn't call service-enable on startup. >>> >>>> Unless.... use the state file to store the date at which the last >>>> start operation occurred? >>>> >>>> If we're calling stop() and data - start_date > threshold, then, if >>>> you must, be optimistic, skip service-disable and assume we'll get >>>> started again soon. >>>> >>>> Otherwise if we're calling stop() and data - start_date <= threshold, >>>> always call service-disable because we're in a restart loop which is >>>> not worth optimising for. >>>> >>>> ( And always call service-enable at start() ) >>>> >>>> No Pacemaker feature or Beekhof approval required :-) >>> >>> Hmm ... it's possible I just don't understand this proposal fully, >>> but it sounds a bit woolly to me, e.g. how would you decide a suitable >>> threshold? >> >> roll a dice? >> >>> I think I preferred your other suggestion of just skipping the >>> optimization, i.e. calling service-disable on the first stop, and >>> service-enable on (almost) every start. >> >> good :) >> >> And the use of force-down from your subsequent email sounds excellent > > OK great! We finally got there :-) Now I guess I just have to write > the spec and the actual code ;-) _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org