[Linux-ha-dev] [patch] conntrackd RA
Hi, While playing with the conntrackd agent on Debian Squeeze, I found out the method used in the monitor action is not accurate and can sometimes yield false results, believing conntrackd is running when it is not. I am instead checking the existence of the control socket and it has so far proved more stable. Patch attached. -- Albéric de Pertat ADELUX: http://www.adelux.fr Tel: 01 40 86 45 81 GPG: http://www.adelux.fr/societe/gpg/alberic.asc --- conntrackd 2011-08-18 12:12:36.807562142 +0200 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd 2011-08-18 12:25:20.0 +0200 @@ -111,8 +111,10 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS +# It does not write a PID file, so check the socket exists after +# extracting its path from the configuration file +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ *Path /) { print $2 } }' $OCF_RESKEY_config) +[ -S $conntrack_socket ] rc=$OCF_SUCCESS if [ $rc -eq $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries signature.asc Description: This is a digitally signed message part. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [patch] conntrackd RA
On Thu, Aug 18, 2011 at 12:28:44PM +0200, Albéric de Pertat wrote: Hi, While playing with the conntrackd agent on Debian Squeeze, I found out the method used in the monitor action is not accurate and can sometimes yield false results, believing conntrackd is running when it is not. As in when? Could that be resolved? I am instead checking the existence of the control socket and it has so far proved more stable. If you kill -9 conntrackd (or contrackd should crash for some reason), it will leave behind that socket. So testing on the existence of that socket is in no way more reliable than looking through the process table. Maybe there is some ping method in the conntrack-tools? Or something that could be used as such? If not, maybe try a connect to that socket, using netcat/socat? Just checking if the socket exists will not detect conntrackd crashes. --- conntrackd2011-08-18 12:12:36.807562142 +0200 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd 2011-08-18 12:25:20.0 +0200 @@ -111,8 +111,10 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS +# It does not write a PID file, so check the socket exists after +# extracting its path from the configuration file +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ *Path /) { print $2 } }' $OCF_RESKEY_config) +[ -S $conntrack_socket ] rc=$OCF_SUCCESS if [ $rc -eq $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [patch] conntrackd RA
On Thu, Aug 18, 2011 at 12:58:21PM +0200, Lars Ellenberg wrote: On Thu, Aug 18, 2011 at 12:28:44PM +0200, Albéric de Pertat wrote: Hi, While playing with the conntrackd agent on Debian Squeeze, I found out the method used in the monitor action is not accurate and can sometimes yield false results, believing conntrackd is running when it is not. As in when? Could that be resolved? I am instead checking the existence of the control socket and it has so far proved more stable. If you kill -9 conntrackd (or contrackd should crash for some reason), it will leave behind that socket. So testing on the existence of that socket is in no way more reliable than looking through the process table. Maybe there is some ping method in the conntrack-tools? Or something that could be used as such? Ah, my bad... should have looked not at the patch only, but at the RA in context. That check if it accepts queries is done immediately following this test. So yes, your patch is good. Acked-by lars ;-) --- conntrackd 2011-08-18 12:12:36.807562142 +0200 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd2011-08-18 12:25:20.0 +0200 @@ -111,8 +111,10 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS +# It does not write a PID file, so check the socket exists after +# extracting its path from the configuration file +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ *Path /) { print $2 } }' $OCF_RESKEY_config) +[ -S $conntrack_socket ] rc=$OCF_SUCCESS if [ $rc -eq $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [patch] conntrackd RA
On Thu, Aug 18, 2011 at 01:24:04PM +0200, Lars Ellenberg wrote: On Thu, Aug 18, 2011 at 12:58:21PM +0200, Lars Ellenberg wrote: On Thu, Aug 18, 2011 at 12:28:44PM +0200, Albéric de Pertat wrote: Hi, While playing with the conntrackd agent on Debian Squeeze, I found out the method used in the monitor action is not accurate and can sometimes yield false results, believing conntrackd is running when it is not. As in when? Could that be resolved? I am instead checking the existence of the control socket and it has so far proved more stable. If you kill -9 conntrackd (or contrackd should crash for some reason), it will leave behind that socket. So testing on the existence of that socket is in no way more reliable than looking through the process table. Maybe there is some ping method in the conntrack-tools? Or something that could be used as such? Ah, my bad... should have looked not at the patch only, but at the RA in context. That check if it accepts queries is done immediately following this test. So yes, your patch is good. Acked-by lars ;-) Oh, these monologues... --- conntrackd2011-08-18 12:12:36.807562142 +0200 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd 2011-08-18 12:25:20.0 +0200 @@ -111,8 +111,10 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS +# It does not write a PID file, so check the socket exists after +# extracting its path from the configuration file +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ *Path /) { print $2 } }' $OCF_RESKEY_config) Is space really the only allowed white space there? I guess the regex has to be changed to use [[:space:]]* local conntrack_socket=$(awk '/^[[:space:]]*UNIX[[:space:]]*{/,/^[[:space:]]*}/ { if ($1 == Path) { print $2 } }' $OCF_RESKEY_config) Maybe rather do it the other way round, see below. +[ -S $conntrack_socket ] rc=$OCF_SUCCESS if [ $rc -eq $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries (untested) diff --git a/heartbeat/conntrackd b/heartbeat/conntrackd --- a/heartbeat/conntrackd +++ b/heartbeat/conntrackd @@ -110,16 +110,17 @@ conntrackd_set_master_score() { } conntrackd_monitor() { - rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS - if [ $rc -eq $OCF_SUCCESS ]; then - # conntrackd is running - # now see if it acceppts queries - if ! $OCF_RESKEY_binary -C $OCF_RESKEY_config -s /dev/null 21; then + # see if it acceppts queries + if ! $OCF_RESKEY_binary -C $OCF_RESKEY_config -s /dev/null 21; then + local conntrack_socket=$(awk '/^[[:space:]]*UNIX[[:space:]]*{/,/^[[:space:]]*}/ { if ($1 == Path) { print $2 } }' $OCF_RESKEY_config) + if test -S $conntrack_socket ; then rc=$OCF_ERR_GENERIC - ocf_log err conntrackd is running but not responding to queries + ocf_log err conntrackd control socket exists, but not responding to queries + else + rc=$OCF_NOT_RUNNING fi + else + rc=$OCF_SUCCESS if conntrackd_is_master; then rc=$OCF_RUNNING_MASTER # Restore master setting on probes -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] OCF RA for named
On Wed, Aug 17, 2011 at 12:39 PM, Lars Ellenberg lars.ellenb...@linbit.comwrote: On Tue, Aug 16, 2011 at 08:51:04AM -0600, Serge Dubrouski wrote: On Tue, Aug 16, 2011 at 8:44 AM, Dejan Muhamedagic de...@suse.de wrote: Hi Serge, On Fri, Aug 05, 2011 at 08:19:52AM -0600, Serge Dubrouski wrote: No interest? Probably not true :) It's just that recently I've been away for a while and in between really swamped with my daily work. I'm trying to catch up now, but it may take a while. In the meantime, I'd like to ask you about the motivation. DNS already has a sort of redundancy built in through its primary/secondary servers. That redundancy doesn't work quite well. Yes you can have primary and secondary servers configured in resolv.conf but if primary is down resolver waits till request times out for the primary server till it sends a request to the secondary one. The dealy can be up to 30 seconds and impacts some applications pretty badly, This is standard behaviour for Linux, Solaris for example works differently and isn't impacted by this issue. Works around are having caching DNS server working locally or having primary DNS server highly available with using Pacemaker :-) Here is what man page for resolv.conf says: nameserver Name server IP address Internet address (in dot notation) of a name server that the resolver should query. Up to MAXNS (currently 3, see resolv.h) name servers may be listed, one per keyword. If there are multiple servers, the resolver library queries them in the order listed. If no nameserver entries are present, the default is to use the name server on the local machine. *(The algorithm used is to try a name server, and if the query times out, try the next, until out of name servers, then repeat trying all the name servers until a maximum number of retries are made.)* options timeout:2 attempts:5 rotate Right, once can do this. But even with this it would take additional 10 seconds for requests sent to the server that's down before they timeout. In production environment that's absolutely unacceptable. but yes, it is still a valid use case to have a clustered primary name server, and possibly multiple backups. And that's why I cerated this RA :-) -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] [patch] conntrackd RA
Le jeudi 18 août 2011 13:43:20, Lars Ellenberg a écrit : So yes, your patch is good. Acked-by lars ;-) Oh thanks :) Oh, these monologues... --- conntrackd 2011-08-18 12:12:36.807562142 +0200 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd2011-08-18 12:25:20.0 +0200 @@ -111,8 +111,10 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS +# It does not write a PID file, so check the socket exists after +# extracting its path from the configuration file +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ *Path /) { print $2 } }' $OCF_RESKEY_config) Is space really the only allowed white space there? I guess the regex has to be changed to use [[:space:]]* Well I don't know about that but I guess we shouldn't take any chance. Unfortunately, awk doesn't know about [[:space:]]. I would have used [ \t\n] instead but I'm not sure blank lines in the middle of statements are allowed so, for the sake of clarity, I replaced them with [ \t] only. -- Albéric de Pertat ADELUX: http://www.adelux.fr Tel: 01 40 86 45 81 GPG: http://www.adelux.fr/societe/gpg/alberic.asc --- conntrackd 2011-08-18 12:12:36.807562142 +0200 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd 2011-08-18 14:14:48.0 +0200 @@ -111,8 +111,10 @@ conntrackd_monitor() { rc=$OCF_NOT_RUNNING - # It does not write a PID file, so check with pgrep - pgrep -f $OCF_RESKEY_binary rc=$OCF_SUCCESS +# It does not write a PID file, so check the socket exists after +# extracting its path from the configuration file +local conntrack_socket=$(awk '/^[ \t]*UNIX[ \t]*{/,/^[ \t]*}/ { if ($1 == Path) { print $2 } }' $OCF_RESKEY_config) +[ -S $conntrack_socket ] rc=$OCF_SUCCESS if [ $rc -eq $OCF_SUCCESS ]; then # conntrackd is running # now see if it acceppts queries signature.asc Description: This is a digitally signed message part. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-HA] Q: crm shell: things more complex than group
Hi! Reading the docs, I learned that pacemaker understands more complex dependencies than group where resources are strictly sequential. For example one could start a set of resources in parallel, then wait until all are done, then start another set of resources, etc. Now I wonder: 1) Can such a thing (i.e. parallelism) be configured with crm shell? If so, what is the syntax like? 2) In some resource groups, not all resources are really required. For example in a RAID1, only one of both legs is really required to start up the RAID (assuming I need to activate some extra resource for each leg of the RAID). Can such a dependency be expressed in CRM? If so, how do you do it? I'd wish (I know what you will reply ;-)) that could be expressed with the crm shell. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Forcing primitive_nfslock away from node
WTH does this mean (from node2): pengine: [16069]: notice: clone_print: Master/Slave Set: master_drbd pengine: [16069]: notice: short_print: Masters: [ node1 ] pengine: [16069]: notice: short_print: Slaves: [ node2 ] pengine: [16069]: notice: native_print: filesystem_drbd#011(ocf::heartbeat:Filesystem):#011Started node1 pengine: [16069]: notice: native_print: primitive_nfslock#011(lsb:nfslock):#011Started node2 pengine: [16069]: info: get_failcount: filesystem_drbd has failed INFINITY times on node2 pengine: [16069]: WARN: common_apply_stickiness: Forcing filesystem_drbd away from node2 after 100 failures (max=100) pengine: [16069]: info: get_failcount: primitive_nfslock has failed INFINITY times on node1 pengine: [16069]: WARN: common_apply_stickiness: Forcing primitive_nfslock away from node1 after 100 failures (max=100) Does this mean the nfs filesystem is started on node1 while the statd lockd for it are started on node2? Despite inf: colocation constraint? (SL6 w/ stock rpms plus drbd from atrpms) Dima -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems