Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

Tom Yates Thu, 27 Aug 2015 01:06:13 -0700

On Mon, 24 Aug 2015, Andrei Borzenkov wrote:

24.08.2015 13:32, Tom Yates пишет:
 if i understand you aright, my problem is that the stop script didn't
 return a 0 (OK) exit status, so CRM didn't know where to go.  is the
 exit status of the stop script how CRM determines the status of the stop
 operation?
correct
 does CRM also use the output of "/etc/init.d/script status" to determine
 continuing successful operation?
It definitely does not use *output* of script - only return code. If thequestion is whether it probes resource additionally to checking stop exitcode - I do not think so (I know it does it in some cases for systemdresources).

i just thought i'd come back and follow-up. in testing this morning, ican confirm that the "pppoe-stop" command returns status 1 if pppd isn'trunning. that makes a standard init.d script, which passes on the returncode of the stop command, unhelpful to CRM.

i changed the script so that on stop, having run pppoe-stop, it checks forthe existence of a working ppp0 interface, and returns 0 IFO there isnone.

If resource was previously active and stop was attempted as cleanup afterresource failure - yes, it should attempt to start it again.

that is now what happens. it seems to try three time to bring up pppd,then kicks the service over to the other node.

in the case of extended outages (ie, the ISP goes away for more than about10 minutes), where both nodes have time to fail, we end up back in the badold state (service failed on both nodes):


[root@positron ~]# crm status
[...]
Online: [ electron positron ]

 Resource Group: BothIPs
     InternalIP (ocf::heartbeat:IPaddr):        Started electron
     ExternalIP (lsb:hb-adsl-helper):   Stopped

Failed actions:
    ExternalIP_monitor_60000 (node=positron, call=15, rc=7, status=complete): 
not running
    ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed Out): 
unknown exec error
    ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out): 
unknown exec error

is there any way to configure CRM to keep kicking the service between thetwo nodes forever (ie, try three times on positron, kick service group toelectron, try three times on electron, kick back to positron, lather rinserepeat...)?

for a service like DSL, which can go away for extended periods through nolocal fault then suddenly and with no announcement come back, this wouldbe most useful behaviour.

thanks to all for help with this. thanks also to those who have suggestedi rewrite this as an OCF agent (especially to ken gaillot who was kindenough to point me to documentation); i will look at that if time permits.



--

  Tom Yates  -  http://www.teaparty.net

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

Reply via email to