Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-27 Thread Ken Gaillot
On 08/27/2015 03:04 AM, Tom Yates wrote:
 On Mon, 24 Aug 2015, Andrei Borzenkov wrote:
 
 24.08.2015 13:32, Tom Yates пишет:
  if i understand you aright, my problem is that the stop script didn't
  return a 0 (OK) exit status, so CRM didn't know where to go.  is the
  exit status of the stop script how CRM determines the status of the
 stop
  operation?

 correct

  does CRM also use the output of /etc/init.d/script status to
 determine
  continuing successful operation?

 It definitely does not use *output* of script - only return code. If
 the question is whether it probes resource additionally to checking
 stop exit code - I do not think so (I know it does it in some cases
 for systemd resources).
 
 i just thought i'd come back and follow-up.  in testing this morning, i
 can confirm that the pppoe-stop command returns status 1 if pppd isn't
 running.  that makes a standard init.d script, which passes on the
 return code of the stop command, unhelpful to CRM.
 
 i changed the script so that on stop, having run pppoe-stop, it checks
 for the existence of a working ppp0 interface, and returns 0 IFO there
 is none.

Nice

 If resource was previously active and stop was attempted as cleanup
 after resource failure - yes, it should attempt to start it again.
 
 that is now what happens.  it seems to try three time to bring up pppd,
 then kicks the service over to the other node.
 
 in the case of extended outages (ie, the ISP goes away for more than
 about 10 minutes), where both nodes have time to fail, we end up back in
 the bad old state (service failed on both nodes):
 
 [root@positron ~]# crm status
 [...]
 Online: [ electron positron ]
 
  Resource Group: BothIPs
  InternalIP (ocf::heartbeat:IPaddr):Started electron
  ExternalIP (lsb:hb-adsl-helper):   Stopped
 
 Failed actions:
 ExternalIP_monitor_6 (node=positron, call=15, rc=7,
 status=complete): not running
 ExternalIP_start_0 (node=positron, call=17, rc=-2, status=Timed
 Out): unknown exec error
 ExternalIP_start_0 (node=electron, call=6, rc=-2, status=Timed Out):
 unknown exec error
 
 is there any way to configure CRM to keep kicking the service between
 the two nodes forever (ie, try three times on positron, kick service
 group to electron, try three times on electron, kick back to positron,
 lather rinse repeat...)?
 
 for a service like DSL, which can go away for extended periods through
 no local fault then suddenly and with no announcement come back, this
 would be most useful behaviour.

Yes, see migration-threshold and failure-timeout.

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-options

 thanks to all for help with this.  thanks also to those who have
 suggested i rewrite this as an OCF agent (especially to ken gaillot who
 was kind enough to point me to documentation); i will look at that if
 time permits.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Ken Gaillot
On 08/24/2015 04:52 AM, Andrei Borzenkov wrote:
 24.08.2015 12:35, Tom Yates пишет:
 I've got a failover firewall pair where the external interface is ADSL;
 that is, PPPoE.  i've defined the service thus:

 primitive ExternalIP lsb:hb-adsl-helper \
  op monitor interval=60s

 and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:

 #!/bin/bash
 RETVAL=0
 start() {
  /sbin/pppoe-start
 }
 stop() {
  /sbin/pppoe-stop
 }
 case $1 in
start)
  start
  ;;
stop)
  stop
  ;;
status)
  /sbin/ifconfig ppp0  /dev/null  exit 0
  exit 1
  ;;
*)
  echo $Usage: $0 {start|stop|status}
  exit 3
 esac
 exit $?

Pacemaker expects that LSB agents follow the LSB spec for return codes,
and won't be able to behave properly if they don't:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-lsb


However it's just as easy to write an OCF agent, which gives you more
flexibility (accepting parameters, etc.):

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#ap-ocf

 The problem is that sometimes the ADSL connection falls over, as they
 do, eg:

 Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
 Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
 Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received
 164420300 bytes.
 Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
 Aug 20 11:42:13 positron pppd[2469]: Modem hangup
 Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session
 1735: Input/output error
 Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
 Aug 20 11:42:13 positron pppd[2469]: Exit.
 Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost;
 attempting re-connection.

 CRMd then logs a bunch of stuff, followed by

 Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
 Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no
 additional parameters are needed.
 [...]
 Aug 20 11:42:18 positron pppoe-stop: Killing pppd
 Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
 Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop
 process 28357 exited with return code 1.


 At this point, the PPPoE connection is down, and stays down.  CRMd
 doesn't fail the group which contains both internal and external
 interfaces over to the other node, but nor does it try to restart the
 service.  I'm fairly sure this is because I've done something
 boneheaded, but I can't get my bone head around what it might be.

 Any light anyone can shed is much appreciated.


 
 If stop operation failed resource state is undefined; pacemaker won't do
 anything with this resource. Either make sure script returns success
 when appropriate or the only option is to make it fence node where
 resource was active.
 
 
 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Tom Yates
I've got a failover firewall pair where the external interface is ADSL; 
that is, PPPoE.  i've defined the service thus:


primitive ExternalIP lsb:hb-adsl-helper \
op monitor interval=60s

and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:

#!/bin/bash
RETVAL=0
start() {
/sbin/pppoe-start
}
stop() {
/sbin/pppoe-stop
}
case $1 in
  start)
start
;;
  stop)
stop
;;
  status)
/sbin/ifconfig ppp0  /dev/null  exit 0
exit 1
;;
  *)
echo $Usage: $0 {start|stop|status}
exit 3
esac
exit $?

The problem is that sometimes the ADSL connection falls over, as they do, 
eg:


Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received 164420300 
bytes.
Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
Aug 20 11:42:13 positron pppd[2469]: Modem hangup
Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session 1735: 
Input/output error
Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
Aug 20 11:42:13 positron pppd[2469]: Exit.
Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost; attempting 
re-connection.

CRMd then logs a bunch of stuff, followed by

Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no 
additional parameters are needed.
[...]
Aug 20 11:42:18 positron pppoe-stop: Killing pppd
Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop process 
28357 exited with return code 1.


At this point, the PPPoE connection is down, and stays down.  CRMd doesn't 
fail the group which contains both internal and external interfaces over 
to the other node, but nor does it try to restart the service.  I'm fairly 
sure this is because I've done something boneheaded, but I can't get my 
bone head around what it might be.


Any light anyone can shed is much appreciated.


--

  Tom Yates  -  http://www.teaparty.net

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Andrei Borzenkov

24.08.2015 12:35, Tom Yates пишет:

I've got a failover firewall pair where the external interface is ADSL;
that is, PPPoE.  i've defined the service thus:

primitive ExternalIP lsb:hb-adsl-helper \
 op monitor interval=60s

and in addition written a noddy script /etc/init.d/hb-adsl-helper, thus:

#!/bin/bash
RETVAL=0
start() {
 /sbin/pppoe-start
}
stop() {
 /sbin/pppoe-stop
}
case $1 in
   start)
 start
 ;;
   stop)
 stop
 ;;
   status)
 /sbin/ifconfig ppp0  /dev/null  exit 0
 exit 1
 ;;
   *)
 echo $Usage: $0 {start|stop|status}
 exit 3
esac
exit $?

The problem is that sometimes the ADSL connection falls over, as they
do, eg:

Aug 20 11:42:10 positron pppd[2469]: LCP terminated by peer
Aug 20 11:42:10 positron pppd[2469]: Connect time 8619.4 minutes.
Aug 20 11:42:10 positron pppd[2469]: Sent 1342528799 bytes, received
164420300 bytes.
Aug 20 11:42:13 positron pppd[2469]: Connection terminated.
Aug 20 11:42:13 positron pppd[2469]: Modem hangup
Aug 20 11:42:13 positron pppoe[2470]: read (asyncReadFromPPP): Session
1735: Input/output error
Aug 20 11:42:13 positron pppoe[2470]: Sent PADT
Aug 20 11:42:13 positron pppd[2469]: Exit.
Aug 20 11:42:13 positron pppoe-connect: PPPoE connection lost;
attempting re-connection.

CRMd then logs a bunch of stuff, followed by

Aug 20 11:42:18 positron lrmd: [1760]: info: rsc:ExternalIP:8: stop
Aug 20 11:42:18 positron lrmd: [28357]: WARN: For LSB init script, no
additional parameters are needed.
[...]
Aug 20 11:42:18 positron pppoe-stop: Killing pppd
Aug 20 11:42:18 positron pppoe-stop: Killing pppoe-connect
Aug 20 11:42:18 positron lrmd: [1760]: WARN: Managed ExternalIP:stop
process 28357 exited with return code 1.


At this point, the PPPoE connection is down, and stays down.  CRMd
doesn't fail the group which contains both internal and external
interfaces over to the other node, but nor does it try to restart the
service.  I'm fairly sure this is because I've done something
boneheaded, but I can't get my bone head around what it might be.

Any light anyone can shed is much appreciated.




If stop operation failed resource state is undefined; pacemaker won't do 
anything with this resource. Either make sure script returns success 
when appropriate or the only option is to make it fence node where 
resource was active.



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] CRM managing ADSL connection; failure not handled

2015-08-24 Thread Andrei Borzenkov

24.08.2015 13:32, Tom Yates пишет:

On Mon, 24 Aug 2015, Andrei Borzenkov wrote:


24.08.2015 12:35, Tom Yates пишет:

I've got a failover firewall pair where the external interface is ADSL;
that is, PPPoE.  i've defined the service thus:


If stop operation failed resource state is undefined; pacemaker won't
do anything with this resource. Either make sure script returns
success when appropriate or the only option is to make it fence node
where resource was active.


andrei, thank you for your prompt and helpful response.

if i understand you aright, my problem is that the stop script didn't
return a 0 (OK) exit status, so CRM didn't know where to go.  is the
exit status of the stop script how CRM determines the status of the stop
operation?


correct


 and if that gives exit code 0, it will then try to do a
/etc/init.d/script start?



If resource was previously active and stop was attempted as cleanup 
after resource failure - yes, it should attempt to start it again.




does CRM also use the output of /etc/init.d/script status to determine
continuing successful operation?



It definitely does not use *output* of script - only return code. If the 
question is whether it probes resource additionally to checking stop 
exit code - I do not think so (I know it does it in some cases for 
systemd resources).




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org