On Mon, 24 Aug 2015, Andrei Borzenkov wrote:

24.08.2015 13:32, Tom Yates пишет:
 if i understand you aright, my problem is that the stop script didn't
 return a 0 (OK) exit status, so CRM didn't know where to go.  is the
 exit status of the stop script how CRM determines the status of the stop
 operation?

correct

 does CRM also use the output of "/etc/init.d/script status" to determine
 continuing successful operation?

It definitely does not use *output* of script - only return code. If the question is whether it probes resource additionally to checking stop exit code - I do not think so (I know it does it in some cases for systemd resources).

i just thought i'd come back and follow-up. in testing this morning, i can confirm that the "pppoe-stop" command returns status 1 if pppd isn't running. that makes a standard init.d script, which passes on the return code of the stop command, unhelpful to CRM.

i changed the script so that on stop, having run pppoe-stop, it checks for the existence of a working ppp0 interface, and returns 0 IFO there is none.

If resource was previously active and stop was attempted as cleanup after resource failure - yes, it should attempt to start it again.

that is now what happens. it seems to try three time to bring up pppd, then kicks the service over to the other node.

in the case of extended outages (ie, the ISP goes away for more than about 10 minutes), where both nodes have time to fail, we end up back in the bad old state (service failed on both nodes):

[root@positron ~]# crm status
[...]
Online: [ electron positron ]

 Resource Group: BothIPs
     InternalIP (ocf::heartbeat:IPaddr):        Started electron
     ExternalIP (lsb:hb-adsl-helper):   Stopped

Failed actions:
    ExternalIP_monitor_60000 (node=positron, call=15, rc=7, status=complete): 
not running
    ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): 
unknown exec error
    ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): 
unknown exec error

is there any way to configure CRM to keep kicking the service between the two nodes forever (ie, try three times on positron, kick service group to electron, try three times on electron, kick back to positron, lather rinse repeat...)?

for a service like DSL, which can go away for extended periods through no local fault then suddenly and with no announcement come back, this would be most useful behaviour.

thanks to all for help with this. thanks also to those who have suggested i rewrite this as an OCF agent (especially to ken gaillot who was kind enough to point me to documentation); i will look at that if time permits.


--

  Tom Yates  -  http://www.teaparty.net
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to