[Linux-ha-dev] [patch] conntrackd RA

2011-08-18 Thread Albéric de Pertat
Hi,

While playing with the conntrackd agent on Debian Squeeze, I found out the 
method used in the monitor action is not accurate and can sometimes yield 
false results, believing conntrackd is running when it is not. I am instead 
checking the existence of the control socket and it has so far proved more 
stable.

Patch attached. 
-- 
Albéric de Pertat
ADELUX: http://www.adelux.fr
Tel: 01 40 86 45 81
GPG: http://www.adelux.fr/societe/gpg/alberic.asc
--- conntrackd	2011-08-18 12:12:36.807562142 +0200
+++ /usr/lib/ocf/resource.d/heartbeat/conntrackd	2011-08-18 12:25:20.0 +0200
@@ -111,8 +111,10 @@
 
 conntrackd_monitor() {
 	rc=$OCF_NOT_RUNNING
-	# It does not write a PID file, so check with pgrep
-	pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
+# It does not write a PID file, so check the socket exists after
+# extracting its path from the configuration file
+local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ *Path /) { print $2 } }' $OCF_RESKEY_config)
+[ -S $conntrack_socket ]  rc=$OCF_SUCCESS
 	if [ $rc -eq $OCF_SUCCESS ]; then
 		# conntrackd is running 
 		# now see if it acceppts queries


signature.asc
Description: This is a digitally signed message part.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [patch] conntrackd RA

2011-08-18 Thread Lars Ellenberg
On Thu, Aug 18, 2011 at 12:28:44PM +0200, Albéric de Pertat wrote:
 Hi,
 
 While playing with the conntrackd agent on Debian Squeeze, I found out the 
 method used in the monitor action is not accurate and can sometimes yield 
 false results, believing conntrackd is running when it is not.

As in when?
Could that be resolved?

 I am instead checking the existence of the control socket and it has
 so far proved more stable.

If you kill -9 conntrackd (or contrackd should crash for some reason),
it will leave behind that socket.

So testing on the existence of that socket is in no way more reliable
than looking through the process table.

Maybe there is some ping method in the conntrack-tools?
Or something that could be used as such?

If not, maybe try a connect to that socket, using netcat/socat?
Just checking if the socket exists will not detect conntrackd crashes.


 --- conntrackd2011-08-18 12:12:36.807562142 +0200
 +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd  2011-08-18 
 12:25:20.0 +0200
 @@ -111,8 +111,10 @@
  
  conntrackd_monitor() {
   rc=$OCF_NOT_RUNNING
 - # It does not write a PID file, so check with pgrep
 - pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
 +# It does not write a PID file, so check the socket exists after
 +# extracting its path from the configuration file
 +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ 
 *Path /) { print $2 } }' $OCF_RESKEY_config)
 +[ -S $conntrack_socket ]  rc=$OCF_SUCCESS
   if [ $rc -eq $OCF_SUCCESS ]; then
   # conntrackd is running 
   # now see if it acceppts queries




 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [patch] conntrackd RA

2011-08-18 Thread Lars Ellenberg
On Thu, Aug 18, 2011 at 12:58:21PM +0200, Lars Ellenberg wrote:
 On Thu, Aug 18, 2011 at 12:28:44PM +0200, Albéric de Pertat wrote:
  Hi,
  
  While playing with the conntrackd agent on Debian Squeeze, I found out the 
  method used in the monitor action is not accurate and can sometimes yield 
  false results, believing conntrackd is running when it is not.
 
 As in when?
 Could that be resolved?
 
  I am instead checking the existence of the control socket and it has
  so far proved more stable.
 
 If you kill -9 conntrackd (or contrackd should crash for some reason),
 it will leave behind that socket.
 
 So testing on the existence of that socket is in no way more reliable
 than looking through the process table.
 
 Maybe there is some ping method in the conntrack-tools?
 Or something that could be used as such?

Ah, my bad...
should have looked not at the patch only,
but at the RA in context.

That check if it accepts queries is done
immediately following this test.

So yes, your patch is good. Acked-by lars ;-)

  --- conntrackd  2011-08-18 12:12:36.807562142 +0200
  +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd2011-08-18 
  12:25:20.0 +0200
  @@ -111,8 +111,10 @@
   
   conntrackd_monitor() {
  rc=$OCF_NOT_RUNNING
  -   # It does not write a PID file, so check with pgrep
  -   pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
  +# It does not write a PID file, so check the socket exists after
  +# extracting its path from the configuration file
  +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ 
  *Path /) { print $2 } }' $OCF_RESKEY_config)
  +[ -S $conntrack_socket ]  rc=$OCF_SUCCESS
  if [ $rc -eq $OCF_SUCCESS ]; then
  # conntrackd is running 
  # now see if it acceppts queries

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [patch] conntrackd RA

2011-08-18 Thread Lars Ellenberg
On Thu, Aug 18, 2011 at 01:24:04PM +0200, Lars Ellenberg wrote:
 On Thu, Aug 18, 2011 at 12:58:21PM +0200, Lars Ellenberg wrote:
  On Thu, Aug 18, 2011 at 12:28:44PM +0200, Albéric de Pertat wrote:
   Hi,
   
   While playing with the conntrackd agent on Debian Squeeze, I found out 
   the 
   method used in the monitor action is not accurate and can sometimes yield 
   false results, believing conntrackd is running when it is not.
  
  As in when?
  Could that be resolved?
  
   I am instead checking the existence of the control socket and it has
   so far proved more stable.
  
  If you kill -9 conntrackd (or contrackd should crash for some reason),
  it will leave behind that socket.
  
  So testing on the existence of that socket is in no way more reliable
  than looking through the process table.
  
  Maybe there is some ping method in the conntrack-tools?
  Or something that could be used as such?
 
 Ah, my bad...
 should have looked not at the patch only,
 but at the RA in context.
 
 That check if it accepts queries is done
 immediately following this test.
 
 So yes, your patch is good. Acked-by lars ;-)

Oh, these monologues...

   --- conntrackd2011-08-18 12:12:36.807562142 +0200
   +++ /usr/lib/ocf/resource.d/heartbeat/conntrackd  2011-08-18 
   12:25:20.0 +0200
   @@ -111,8 +111,10 @@

conntrackd_monitor() {
 rc=$OCF_NOT_RUNNING
   - # It does not write a PID file, so check with pgrep
   - pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
   +# It does not write a PID file, so check the socket exists after
   +# extracting its path from the configuration file
   +local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~ /^ 
   *Path /) { print $2 } }' $OCF_RESKEY_config)

Is space really the only allowed white space there?
I guess the regex has to be changed to use [[:space:]]*

local conntrack_socket=$(awk '/^[[:space:]]*UNIX[[:space:]]*{/,/^[[:space:]]*}/ 
{ if ($1 == Path) { print $2 } }' $OCF_RESKEY_config)

Maybe rather do it the other way round, see below.

   +[ -S $conntrack_socket ]  rc=$OCF_SUCCESS
 if [ $rc -eq $OCF_SUCCESS ]; then
 # conntrackd is running 
 # now see if it acceppts queries

(untested)
diff --git a/heartbeat/conntrackd b/heartbeat/conntrackd
--- a/heartbeat/conntrackd
+++ b/heartbeat/conntrackd
@@ -110,16 +110,17 @@ conntrackd_set_master_score() {
 }
 
 conntrackd_monitor() {
-   rc=$OCF_NOT_RUNNING
-   # It does not write a PID file, so check with pgrep
-   pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
-   if [ $rc -eq $OCF_SUCCESS ]; then
-   # conntrackd is running 
-   # now see if it acceppts queries
-   if ! $OCF_RESKEY_binary -C $OCF_RESKEY_config -s  /dev/null 
21; then
+   # see if it acceppts queries
+   if ! $OCF_RESKEY_binary -C $OCF_RESKEY_config -s  /dev/null 21; then
+   local conntrack_socket=$(awk 
'/^[[:space:]]*UNIX[[:space:]]*{/,/^[[:space:]]*}/ { if ($1 == Path) { print 
$2 } }' $OCF_RESKEY_config)
+   if test -S $conntrack_socket ; then
rc=$OCF_ERR_GENERIC
-   ocf_log err conntrackd is running but not responding 
to queries
+   ocf_log err conntrackd control socket exists, but not 
responding to queries
+   else
+   rc=$OCF_NOT_RUNNING
fi
+   else
+   rc=$OCF_SUCCESS
if conntrackd_is_master; then
rc=$OCF_RUNNING_MASTER
# Restore master setting on probes

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] OCF RA for named

2011-08-18 Thread Serge Dubrouski
On Wed, Aug 17, 2011 at 12:39 PM, Lars Ellenberg
lars.ellenb...@linbit.comwrote:

 On Tue, Aug 16, 2011 at 08:51:04AM -0600, Serge Dubrouski wrote:
  On Tue, Aug 16, 2011 at 8:44 AM, Dejan Muhamedagic de...@suse.de
 wrote:
 
   Hi Serge,
  
   On Fri, Aug 05, 2011 at 08:19:52AM -0600, Serge Dubrouski wrote:
No interest?
  
   Probably not true :) It's just that recently I've been away for
   a while and in between really swamped with my daily work. I'm
   trying to catch up now, but it may take a while.
  
   In the meantime, I'd like to ask you about the motivation. DNS
   already has a sort of redundancy built in through its
   primary/secondary servers.
  
 
  That redundancy doesn't work quite well. Yes you can have primary and
  secondary servers configured in resolv.conf but if primary is down
 resolver
  waits till request times out for the primary server till it sends a
 request
  to the secondary one. The dealy can be up to 30 seconds and impacts some
  applications pretty badly, This is standard behaviour for Linux, Solaris
 for
  example works differently and isn't impacted by this issue. Works around
 are
  having caching DNS server working locally or having primary DNS server
  highly available with using Pacemaker :-)
 
  Here is what man page for resolv.conf says:
 
   nameserver Name server IP address
  Internet  address  (in  dot  notation) of a name server that the
  resolver  should  query.   Up  to  MAXNS   (currently   3, see
  resolv.h)  name  servers  may  be listed, one per keyword.  If
  there are multiple servers, the resolver library queries them in
  the  order  listed.   If  no nameserver entries are present, the
  default is to use the name server on the  local  machine.  *(The
  algorithm  used  is to try a name server, and if the query times
  out, try the next, until out of name servers, then repeat trying
  all  the  name  servers  until  a  maximum number of retries are
  made.)*

 options timeout:2 attempts:5 rotate


Right, once can do this. But even with this it would take additional 10
seconds for requests  sent to the server that's down before they timeout. In
production environment that's absolutely unacceptable.


 but yes, it is still a valid use case to have a clustered primary name
 server,
 and possibly multiple backups.


And that's why I cerated this RA :-)



 --
 : Lars Ellenberg
 : LINBIT | Your Way to High Availability
 : DRBD/HA support and consulting http://www.linbit.com

 DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
 ___
 Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
 Home Page: http://linux-ha.org/




-- 
Serge Dubrouski.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] [patch] conntrackd RA

2011-08-18 Thread Albéric de Pertat
Le jeudi 18 août 2011 13:43:20, Lars Ellenberg a écrit :

  So yes, your patch is good. Acked-by lars ;-)

Oh thanks :)

 Oh, these monologues...
 
--- conntrackd  2011-08-18 12:12:36.807562142 +0200
+++ /usr/lib/ocf/resource.d/heartbeat/conntrackd2011-08-18
12:25:20.0 +0200 @@ -111,8 +111,10 @@

 conntrackd_monitor() {
 
rc=$OCF_NOT_RUNNING

-   # It does not write a PID file, so check with pgrep
-   pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
+# It does not write a PID file, so check the socket exists
after +# extracting its path from the configuration file
+local conntrack_socket=$(awk '/^ *UNIX *{/,/^ *}/ { if ($0 ~
/^ *Path /) { print $2 } }' $OCF_RESKEY_config)
 
 Is space really the only allowed white space there?
 I guess the regex has to be changed to use [[:space:]]*

Well I don't know about that but I guess we shouldn't take any chance. 
Unfortunately, awk doesn't know about [[:space:]]. I would have used [ \t\n] 
instead but I'm not sure blank lines in the middle of statements are allowed 
so, for the sake of clarity, I replaced them with [ \t] only.
-- 
Albéric de Pertat
ADELUX: http://www.adelux.fr
Tel: 01 40 86 45 81
GPG: http://www.adelux.fr/societe/gpg/alberic.asc
--- conntrackd	2011-08-18 12:12:36.807562142 +0200
+++ /usr/lib/ocf/resource.d/heartbeat/conntrackd	2011-08-18 14:14:48.0 +0200
@@ -111,8 +111,10 @@
 
 conntrackd_monitor() {
 	rc=$OCF_NOT_RUNNING
-	# It does not write a PID file, so check with pgrep
-	pgrep -f $OCF_RESKEY_binary  rc=$OCF_SUCCESS
+# It does not write a PID file, so check the socket exists after
+# extracting its path from the configuration file
+local conntrack_socket=$(awk '/^[ \t]*UNIX[ \t]*{/,/^[ \t]*}/ { if ($1 == Path) { print $2 } }' $OCF_RESKEY_config)
+[ -S $conntrack_socket ]  rc=$OCF_SUCCESS
 	if [ $rc -eq $OCF_SUCCESS ]; then
 		# conntrackd is running 
 		# now see if it acceppts queries


signature.asc
Description: This is a digitally signed message part.
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-HA] Q: crm shell: things more complex than group

2011-08-18 Thread Ulrich Windl
Hi!

Reading the docs, I learned that pacemaker understands more complex 
dependencies than group where resources are strictly sequential. For example 
one could start a set of resources in parallel, then wait until all are done, 
then start another set of resources, etc.

Now I wonder:
1) Can such a thing (i.e. parallelism) be configured with crm shell? If so, 
what is the syntax like?

2) In some resource groups, not all resources are really required. For example 
in a RAID1, only one of both legs is really required to start up the RAID 
(assuming I need to activate some extra resource for each leg of the RAID). Can 
such a dependency be expressed in CRM? If so, how do you do it?

I'd wish (I know what you will reply ;-)) that could be expressed with the crm 
shell.

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Forcing primitive_nfslock away from node

2011-08-18 Thread Dimitri Maziuk

WTH does this mean (from node2):

pengine: [16069]: notice: clone_print:  Master/Slave Set: master_drbd
pengine: [16069]: notice: short_print:  Masters: [ node1 ]
pengine: [16069]: notice: short_print:  Slaves: [ node2 ]
pengine: [16069]: notice: native_print:
filesystem_drbd#011(ocf::heartbeat:Filesystem):#011Started node1
pengine: [16069]: notice: native_print:
primitive_nfslock#011(lsb:nfslock):#011Started node2
pengine: [16069]: info: get_failcount: filesystem_drbd has failed
INFINITY times on node2
pengine: [16069]: WARN: common_apply_stickiness: Forcing filesystem_drbd
away from node2 after 100 failures (max=100)
pengine: [16069]: info: get_failcount: primitive_nfslock has failed
INFINITY times on node1
pengine: [16069]: WARN: common_apply_stickiness: Forcing
primitive_nfslock away from node1 after 100 failures (max=100)

Does this mean the nfs filesystem is started on node1 while the statd 
lockd for it are started on node2? Despite inf: colocation constraint?

(SL6 w/ stock rpms plus drbd from atrpms)

Dima
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems