On 08/27/2015 03:04 AM, Tom Yates wrote: > On Mon, 24 Aug 2015, Andrei Borzenkov wrote: > >> 24.08.2015 13:32, Tom Yates пишет: >>> if i understand you aright, my problem is that the stop script didn't >>> return a 0 (OK) exit status, so CRM didn't know where to go. is the >>> exit status of the stop script how CRM determines the status of the >>> stop >>> operation? >> >> correct >> >>> does CRM also use the output of "/etc/init.d/script status" to >>> determine >>> continuing successful operation? >> >> It definitely does not use *output* of script - only return code. If >> the question is whether it probes resource additionally to checking >> stop exit code - I do not think so (I know it does it in some cases >> for systemd resources). > > i just thought i'd come back and follow-up. in testing this morning, i > can confirm that the "pppoe-stop" command returns status 1 if pppd isn't > running. that makes a standard init.d script, which passes on the > return code of the stop command, unhelpful to CRM. > > i changed the script so that on stop, having run pppoe-stop, it checks > for the existence of a working ppp0 interface, and returns 0 IFO there > is none.
Nice >> If resource was previously active and stop was attempted as cleanup >> after resource failure - yes, it should attempt to start it again. > > that is now what happens. it seems to try three time to bring up pppd, > then kicks the service over to the other node. > > in the case of extended outages (ie, the ISP goes away for more than > about 10 minutes), where both nodes have time to fail, we end up back in > the bad old state (service failed on both nodes): > > [root@positron ~]# crm status > [...] > Online: [ electron positron ] > > Resource Group: BothIPs > InternalIP (ocf::heartbeat:IPaddr): Started electron > ExternalIP (lsb:hb-adsl-helper): Stopped > > Failed actions: > ExternalIP_monitor_60000 (node=positron, call=15, rc=7, > status=complete): not running > ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed > Out): unknown exec error > ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): > unknown exec error > > is there any way to configure CRM to keep kicking the service between > the two nodes forever (ie, try three times on positron, kick service > group to electron, try three times on electron, kick back to positron, > lather rinse repeat...)? > > for a service like DSL, which can go away for extended periods through > no local fault then suddenly and with no announcement come back, this > would be most useful behaviour. Yes, see migration-threshold and failure-timeout. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options > thanks to all for help with this. thanks also to those who have > suggested i rewrite this as an OCF agent (especially to ken gaillot who > was kind enough to point me to documentation); i will look at that if > time permits. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org