[Linux-ha-dev] make rpm failed

2006-07-24 Thread Guochun Shi

gmake[2]: Entering directory `/home/gshi/linux-ha/mgmt/client'
msgfmt not found -o haclient.zh_CN.mo haclient.zh_CN.po
gmake[2]: msgfmt: Command not found
gmake[2]: *** [haclient.zh_CN.mo] Error 127


-Guochun

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] BSC failure due to CTSproxy.py permission

2006-07-24 Thread Guochun Shi

It comes from CTSproxy.py  not being executable.

[EMAIL PROTECTED] cts]#  /usr/bin/python /usr/lib/heartbeat/cts/CTSlab.py --bsc
Jul 24 11:29:35 Random seed is: 1153758575
Jul 24 11:29:35  BEGINNING 2 TESTS
Jul 24 11:29:35 HA configuration directory: /etc/ha.d
Jul 24 11:29:35 System log files: /var/log/ha-log-local7
Jul 24 11:29:35 Enable Stonith: 1
Jul 24 11:29:35 Enable Fencing: 1
Jul 24 11:29:35 Enable Standby: 1
Jul 24 11:29:35 Cluster nodes:
/bin/sh: line 1: /usr/lib/heartbeat/cts/CTSproxy.py: Permission denied
Traceback (most recent call last):
 File /usr/lib/heartbeat/cts/CTSlab.py, line 749, in ?
   /usr/sbin/crm_uuid)
 File /usr/lib/heartbeat/cts/CTS.py, line 204, in remote_py
   result.pop()
IndexError: pop from empty list

After I manually chmod +x CTSproxy.py, it works fine. I don't know how 
to change it in makefile.am.

Can someone fix it?

thanks
-Guochun


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] MAXMSG too small

2006-05-30 Thread Guochun Shi

Andrew Beekhof wrote:

On 5/29/06, Alan Robertson [EMAIL PROTECTED] wrote:

Andrew Beekhof wrote:
 Running CTS on 6 nodes has shown MAXMSG to be too small - the PE 
cannot

 send its transition graph and the cluster stalls indefinitely.

So, that means the CIB is  256K compressed?  Or is it  256K 
uncompressed?


its being added with
ha_msg_addstruct_compress(msg, field, xml);
and sent via IPC to the crmd (from the pengine)

whether its actually been compressed or not i dont know.
It should be compressed if you have specified compression method ha.cf. 
However it would be
good to have some proof that it is compressed. Having a message  256K 
after compression means

the uncompressed one probably has 1M ~2M

Another way that might be interesting is to provide an API that has much 
higher bound, which is suited for local

usage only.






 We could increase the value but looking through the code this seems to
 be an artificial limitation to various degrees...

 * In some cases its used as a substitute for get_netstringlen(msg)  
- I

 believe these should be fixed

 * In some cases its used to pre-empt checks by child functions - I
 believe these should  be removed.

 The two cases that seem to legitimately use MAXMSG are the HBcomm
 plugins and the decompression code (though even that could retry a
 couple of time with larger buffers).


 Alan, can you please take a look at the use of MAXMSG in the IPC
 layer which is really not my area of expertise (especially the HBcomm
 plugins) and verify that my assessment is correct (and possibly get
 someone to look at fixing it).

Unfortunately, this means various buffers get locked into memory at this
size.  Our processes are already pretty huge.  get_netstringlen() is an
expensive call.


Thats basically the tradeoff... either we increase MAXMSG and take a
hit on the process size, or we do more dynamically and take a runtime
hit.

Not being a guru in the IPC layer, I dont know which is worse.

However, my suspicion was that get_(net)stringlen was not too bad for
flat messages and would therefore be preferred.


Why do you think that predicting that child buffers will be too large is
a bad idea?  How do you understand that removing it will help?


For low values of MAXMSG I think its fine to do that.  But we keep
upping the value and   allocating 256k for regular heartbeat packets
seems like a real waste.


Is your concern related to compressed/uncompressed sizes?


As above.  I'm doing my part and indicating that it can/should be
compressed, but i dont know the internals well enough to say for sure.
Andrew, if you can send log/debug file to me, I may (or may not) find 
some clue


-Guochun

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] MAXMSG too small

2006-05-30 Thread Guochun Shi

Alan Robertson wrote:

Guochun Shi wrote:

Andrew Beekhof wrote:

On 5/29/06, Alan Robertson [EMAIL PROTECTED] wrote:

Andrew Beekhof wrote:
 Running CTS on 6 nodes has shown MAXMSG to be too small - the PE 
cannot

 send its transition graph and the cluster stalls indefinitely.

So, that means the CIB is  256K compressed?  Or is it  256K 
uncompressed?


its being added with
ha_msg_addstruct_compress(msg, field, xml);
and sent via IPC to the crmd (from the pengine)

whether its actually been compressed or not i dont know.
It should be compressed if you have specified compression method 
ha.cf. However it would be
good to have some proof that it is compressed. Having a message  
256K after compression means

the uncompressed one probably has 1M ~2M

Another way that might be interesting is to provide an API that has 
much higher bound, which is suited for local

usage only.






 We could increase the value but looking through the code this 
seems to

 be an artificial limitation to various degrees...

 * In some cases its used as a substitute for 
get_netstringlen(msg)  - I

 believe these should be fixed

 * In some cases its used to pre-empt checks by child functions - I
 believe these should  be removed.

 The two cases that seem to legitimately use MAXMSG are the HBcomm
 plugins and the decompression code (though even that could retry a
 couple of time with larger buffers).


 Alan, can you please take a look at the use of MAXMSG in the IPC
 layer which is really not my area of expertise (especially the 
HBcomm

 plugins) and verify that my assessment is correct (and possibly get
 someone to look at fixing it).

Unfortunately, this means various buffers get locked into memory at 
this
size.  Our processes are already pretty huge.  get_netstringlen() 
is an

expensive call.


Thats basically the tradeoff... either we increase MAXMSG and take a
hit on the process size, or we do more dynamically and take a runtime
hit.

Not being a guru in the IPC layer, I dont know which is worse.

However, my suspicion was that get_(net)stringlen was not too bad for
flat messages and would therefore be preferred.

Why do you think that predicting that child buffers will be too 
large is

a bad idea?  How do you understand that removing it will help?


For low values of MAXMSG I think its fine to do that.  But we keep
upping the value and   allocating 256k for regular heartbeat packets
seems like a real waste.


Is your concern related to compressed/uncompressed sizes?


As above.  I'm doing my part and indicating that it can/should be
compressed, but i dont know the internals well enough to say for sure.
Andrew, if you can send log/debug file to me, I may (or may not) find 
some clue


I think that MAXMSG is inappropriately used for the size of IPC 
messages - which would prevent messages from being sent in some cases.


are you saying that there should be higher limit or no limit in IPC-only 
messages? I think the message layer can provide another API for that


-Guochun


___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: lib by alan from

2006-02-24 Thread Guochun Shi

there is an error from mcast.c

cc1: warnings being treated as errors
mcast.c: In function 'if_getaddr':
mcast.c:703: warning: 'err' may be used uninitialized in this function
gmake[4]: *** [mcast.lo] Error 1


linux-ha-cvs@lists.linux-ha.org wrote:


linux-ha CVS committal

Author  : alan
Host: 
Project : linux-ha

Module  : lib

Dir : linux-ha/lib/plugins/HBcomm


Modified Files:
	mcast.c 



Log Message:
Increased how long we'll wait for the network interface to get an address...

===
RCS file: /home/cvs/linux-ha/linux-ha/lib/plugins/HBcomm/mcast.c,v
retrieving revision 1.27
retrieving revision 1.28
diff -u -3 -r1.27 -r1.28
--- mcast.c 24 Feb 2006 00:14:59 -  1.27
+++ mcast.c 24 Feb 2006 02:20:24 -  1.28
@@ -1,4 +1,4 @@
-/* $Id: mcast.c,v 1.27 2006/02/24 00:14:59 alan Exp $ */
+/* $Id: mcast.c,v 1.28 2006/02/24 02:20:24 alan Exp $ */
/*
 * mcast.c: implements hearbeat API for UDP multicast communication
 *
@@ -696,10 +696,9 @@
static int
if_getaddr(const char *ifname, struct in_addr *addr)
{
-   int fd;
struct ifreqif_info;
int j;
-   int maxtry = 30;
+   int maxtry = 120;
gbooleangotaddr = FALSE;
int err;

@@ -716,28 +715,37 @@
return 0;
}

-   if ((fd=socket(AF_INET, SOCK_DGRAM, 0)) == -1)  {
-   PILCallLog(LOG, PIL_CRIT, Error getting socket);
-   return -1;
-   }
if (Debug  0) {
PILCallLog(LOG, PIL_DEBUG, looking up address for %s
,   if_info.ifr_name);
}
for (j=0; j  maxtry  !gotaddr; ++j) {
-   if (ioctl(fd, SIOCGIFADDR, if_info)  0) {
-   err = errno;
-   sleep(1);
-   }else{
+   int fd;
+   if ((fd=socket(AF_INET, SOCK_DGRAM, 0)) == -1)  {
+   PILCallLog(LOG, PIL_CRIT, Error getting socket);
+   return -1;
+   }
+   if (ioctl(fd, SIOCGIFADDR, if_info) = 0) {
gotaddr = TRUE;
+   }else{
+   err = errno;
+   switch(err) {
+   case EADDRNOTAVAIL:
+   sleep(1);
+   break;  
+   default:
+   close(fd);
+   goto getout;
+   }
}
+   close(fd);
}
+getout:
if (!gotaddr) {
PILCallLog(LOG, PIL_CRIT
,   Unable to retrieve local interface address
 for interface [%s] using ioctl(SIOCGIFADDR): %s
,   ifname, strerror(err));
-   close(fd);
return -1;
}

@@ -750,7 +758,6 @@
memcpy(addr, (SOCKADDR_IN(if_info.ifr_addr)-sin_addr)
,   sizeof(struct in_addr));

-   close(fd);
return 0;
}

@@ -813,6 +820,9 @@

/*
 * $Log: mcast.c,v $
+ * Revision 1.28  2006/02/24 02:20:24  alan
+ * Increased how long we'll wait for the network interface to get an address...
+ *
 * Revision 1.27  2006/02/24 00:14:59  alan
 * Put code into mcast.c to make it retry retrieving the address from
 * the interface if it fails...


___
Linux-ha-cvs mailing list
Linux-ha-cvs@lists.linux-ha.org
http://lists.community.tummy.com/mailman/listinfo/linux-ha-cvs


 



___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] different bug fix for bug described in archives

2006-02-24 Thread Guochun Shi
A grep shows MSG_DONWAIT is still used in ipcsocket.c and IPV6addr.c. I 
will have it fixed.

thanks for reminding.

-Guochun

Steven Dake wrote:


Folks,

Joe is porting openais to bsd (and Darwin).  During this process, we
found a problem with the portability of our ipc layer because of the
fact that sendmsg doesn't honor the MSG_DONTWAIT flag.  A quick google
search brought up this thread with the same problem in linux-ha:

http://www.gossamer-threads.com/lists/linuxha/dev/0

The fix in this thread was to increase the buffer size of the send queue
in the kernel.  A more portable fix is to set the O_NONBLOCK flag via
the fcntl syscall.  This seems to work properly on Linux and Darwin
(Joe's darwin port now works for me on macosx).  Alan mentioned
rewriting the code - i'm not sure if this has been done yet, but if it
hasn't you might keep this tip in mind.

Regards
-steve

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


 



___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] CTS result ---- Overall Results:{'failure': 0, 'success': 1000, 'BadNews': 0}

2005-12-02 Thread Guochun Shi

[EMAIL PROTECTED] cts_ha_test]# rpm -qi heartbeat
Name: heartbeatRelocations: (not relocatable)
Version : 2.0.3 Vendor: (none)
Release : 1 Build Date: Wed 30 Nov 2005 
03:15:51 PM CST
Install Date: Wed 30 Nov 2005 03:15:21 PM CST  Build Host: 
posic066.ncsa.uiuc.edu
Group   : Utilities Source RPM: 
heartbeat-2.0.3-1.src.rpm

Size: 13641396 License: GPL/LGPL


Dec 02 12:29:36 
Dec 02 12:29:36 Overall Results:{'failure': 0, 'success': 1000, 
'BadNews': 0}

Dec 02 12:29:36 
Dec 02 12:29:36 Detailed Results
Dec 02 12:29:36 Test Flip:  {'elapsed_time': 3056.7287585735321, 
'skipped': 0, 'calls': 66, 'success': 66, 'started': 12, 'down-up': 12, 
'auditfail': 0, 'failure': 0, 's
topped': 54, 'max_time': 69.569756031036377, 'min_time': 
22.751783132553101, 'up-down': 54}
Dec 02 12:29:36 Test Restart:   {'elapsed_time': 2781.6241755485535, 
'skipped': 0, 'calls': 65, 'success': 65, 'min_time': 
23.147539138793945, 'auditfail': 0, 'failure': 0, '
node:posic043': 15, 'node:posic042': 23, 'node:posic045': 16, 
'node:posic044': 11, 'max_time': 60.437736034393311, 'WasStopped': 48}
Dec 02 12:29:36 Test Stonithd:  {'elapsed_time': 20226.280859470367, 
'skipped': 0, 'calls': 77, 'success': 77, 'auditfail': 0, 'failure': 0, 
'max_time': 305.45212197303772, '

min_time': 237.37988209724426}
Dec 02 12:29:36 Test StartOnebyOne: {'elapsed_time': 
10851.152466773987, 'skipped': 0, 'calls': 86, 'success': 86, 
'auditfail': 0, 'failure': 0, 'max_time': 128.338687181

47278, 'min_time': 107.48243713378906}
Dec 02 12:29:36 Test SimulStart:{'elapsed_time': 
4848.6989839076996, 'skipped': 0, 'calls': 70, 'success': 70, 
'auditfail': 0, 'failure': 0, 'max_time': 74.8273160457

61108, 'min_time': 52.387298107147217}
Dec 02 12:29:36 Test SimulStop: {'elapsed_time': 
1881.3365070819855, 'skipped': 0, 'calls': 66, 'success': 66, 
'auditfail': 0, 'failure': 0, 'max_time': 69.1265339851

37939, 'min_time': 15.12415599822998}
Dec 02 12:29:36 Test StopOnebyOne:  {'elapsed_time': 
7284.4506158828735, 'skipped': 0, 'calls': 80, 'success': 80, 
'auditfail': 0, 'failure': 0, 'max_time': 107.769363880

15747, 'min_time': 51.32683801651001}
Dec 02 12:29:36 Test RestartOnebyOne:   {'elapsed_time': 
20264.56384563446, 'skipped': 0, 'calls': 98, 'success': 98, 
'auditfail': 0, 'failure': 0, 'max_time': 256.4438838958

7402, 'min_time': 167.01707005500793}
Dec 02 12:29:36 Test standby2:  {'elapsed_time': 9871.682421207428, 
'skipped': 0, 'calls': 89, 'success': 89, 'auditfail': 0, 'failure': 0, 
'max_time': 150.31842494010925, 'm

in_time': 94.204804182052612}
Dec 02 12:29:36 Test Bandwidth: {'elapsed_time': 
2135.1637029647827, 'skipped': 12, 'calls': 73, 'success': 61, 'min': 
7464.737207460571, 'max': 8226.9630687767694, '
totalbandwidth': 479334.04886258673, 'auditfail': 0, 'failure': 0, 
'max_time': 72.851320028305054, 'min_time': 0.00011706352233886719}
Dec 02 12:29:36 Test ResourceRecover:   {'elapsed_time': 
2640.5746552944183, 'skipped': 0, 'calls': 65, 'success': 65, 
'auditfail': 0, 'failure': 0, 'max_time': 89.0237920284

27124, 'min_time': 23.00553297996521}
Dec 02 12:29:36 Test SpecialTest1:  {'elapsed_time': 
8043.8690595626831, 'skipped': 0, 'calls': 88, 'success': 0, 
'auditfail': 0, 'failure': 0, 'max_time': 99.96206617355

3467, 'min_time': 75.700264930725098}
Dec 02 12:29:36 Test NearQuorumPoint:   {'elapsed_time': 
2532.2484936714172, 'skipped': 7, 'calls': 77, 'success': 70, 
'auditfail': 0, 'failure': 0, 'max_time': 212.765486001

96838, 'min_time': 0.0004940032958984375}
Dec 02 12:29:36  TESTS COMPLETED

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] [Fwd: Re: [Linux-HA] hb 2.0.3 cvs does not start anymore with config of 2.0.2]

2005-11-17 Thread Guochun Shi
OK, I forward this mail to dev-list. Who is in charge of the scripts? 
Sunxun?
---BeginMessage---
Hi Andrew, Serge and Alan,

Am Mittwoch, 9. November 2005 06:37 schrieb Alan Robertson:
 [EMAIL PROTECTED] wrote:
  It is very possible that the problem is that your OCF start scripts don't
  return $OCF_NOT_RUNNING value when monitor function is called and
  resources are down. I had similar problem when moved my config files from
  2.0.2 to 2.0.3. BTW: the in CVS 2.0.3 some of heartbeats' OCF script did
  not complain to this rule.

 And I think that this restriction is a problem.  I know that it's the
 LSB spec - but I doubt it's very often (if at all) followed.

All three of you were right!
I just trusted the ocf compliance of the delivered scripts.
The resources start just fine with my patched versions.

The problem with the stop order remains. 
For me, it seems, the stop starts as expected.
In parallel to the rightly ordered stop operation the DC role is released and 
as soon as the election timeout pops, another round of stop operations is 
started , this time all operations at once.

It would definitely help, if the lrm would print the exact commands it 
executes and the return codes.

regards,

Joachim Banzhaf
--- db2.cvs	2005-11-09 18:31:51.0 +0100
+++ db2	2005-11-09 21:32:36.0 +0100
@@ -87,7 +87,6 @@
 END
 }
 
-
 #
 # methods: What methods/operations do we support?
 #
@@ -104,9 +103,9 @@
 	!
 }
 
-
-#	Gather up information about our db2 instance
-
+#
+# Gather up information about our db2 instance
+#
 db2info() {
 	instance=$1
 	db2admin=$instance
@@ -118,7 +117,7 @@
 	db2bin=$db2sql/bin
 	db2db2=$db2bin/db2
 
-	#	Let's make sure a few important things are there...
+	# Let's make sure a few important things are there...
 	if
 	  [ -d $db2sql -a  -d $db2bin -a -f $db2profile -a \
 		-x $db2profile -a -x $db2db2 ]
@@ -138,7 +137,7 @@
 }
 
 #
-#	Run the given command in the db2 admin environment...
+# Run the given command in the db2 admin environment...
 #
 runasdb2() {
 	if
@@ -151,7 +150,7 @@
 }
 
 #
-#	Run a command as the DB2 admin, and log the output
+# Run a command as the DB2 admin, and log the output
 #
 logasdb2() {
 	output=`runasdb2 $*`
@@ -166,7 +165,18 @@
 	return $rc
 }
 
-
+#
+# db2 returncodes 2 and 4 are just warnings
+#
+filterdb2rc() {
+  if 
+[ $1 == 2 -o $1 == 4 ]
+  then
+return 0
+  fi
+  return $1
+}
+  
 #
 # db2_start: Start the given db2 instance
 #
@@ -193,6 +203,7 @@
   for DB in `db2_dblist`
   do
 runasdb2 $db2db2 activate database $DB
+filterdb2rc $?
   done
 fi
 return $?
@@ -233,7 +244,6 @@
   return $rc
 }
 
-
 #
 # db2_status: is the given db2 instance running?
 #
@@ -243,6 +253,7 @@
   test $pscount -ge 5
 }
 
+
 our_db2_ps() {
   ps -u $db2admin | grep db2
 }
@@ -250,10 +261,9 @@
 
 db2_dblist() {
   runasdb2 $db2db2 list database directory	\
-  |	grep -i 'Database name.*=' | sed 's%.*= *%%'
+  | awk -F'=' '$1 ~ /lias/ { db = $2 } $1 ~ /o[ck]al/ { print db }'
 }
 
-
 #
 # db2_monitor: Can the given db2 instance do anything useful?
 #
@@ -287,9 +297,8 @@
 }
 
 #
-#	'main' starts here...
+# 'main' starts here...
 #
-
 if
   ( [ $# -ne 1 ] )
 then
@@ -329,45 +338,55 @@
   exit $OCF_ERR_PERM
 fi
 
-#
-#	Grab common db2 information...
-#
-if
-  db2info $instance 
-then
-  : DB2 info is OK!
-else
-  exit $OCF_ERR_GENERIC
-fi
-
-
 # What kind of method was invoked?
 case $1 in
 
-  start)	db2_start $instance
-		exit $?;;
-
-  stop)		db2_stop $instance
-		exit $?;;
+  start)if
+		  db2info $instance
+		then
+		  db2_start $instance
+		  exit $?
+		fi
+		exit $OCF_ERR_GENERIC
+		;;
 
+  stop) if
+		  db2info $instance
+		then
+		  db2_stop $instance
+		  exit $?
+		fi
+		exit $OCF_SUCCESS
+		;;
+		
   status)	if
-		  db2_status $instance
+		  db2info $instance /dev/null 21  db2_status $instance
 		then
 		  echo DB2 UDB instance $instance is running
 		  exit $OCF_SUCCESS
-		else
-		  echo DB2 UDB instance $instance is stopped
-		  exit $OCF_NOT_RUNNING
 		fi
+		echo DB2 UDB instance $instance is stopped
+		exit $OCF_NOT_RUNNING
 		;;
 
-  monitor)	db2_monitor $instance
-		exit $?;;
+  monitor)	if 
+  		  db2info $instance
+		then
+		  db2_monitor $instance
+		  exit $?
+		fi
+		exit $OCF_NOT_RUNNING
+		;;
 
-  validate-all)	# OCF_RESKEY_instance has already checked within db2info(),
-		# just exit successfully here.
-		exit $OCF_SUCCESS;;
+  validate-all) if
+		  db2info $instance 
+	then
+		  exit $OCF_SUCCESS
+	fi
+	exit $OCF_ERR_GENERIC
+		;;
 
   *)		db2_methods
-		exit $OCF_ERR_UNIMPLEMENTED;;
+		exit $OCF_ERR_UNIMPLEMENTED
+		;;
 esac
--- drbddisk.orig	2005-11-09 21:36:36.0 +0100
+++ drbddisk	2005-11-09 17:04:08.0 +0100
@@ -33,8 +33,10 @@
 	done
 	;;
 stop)
-	# exec, so the exit code of drbdadm propagates
-	exec $DRBDADM secondary $RES
+	$DRBDADM secondary $RES
+if [ $? = 20 ]; then
+exit 20
+fi
 	;;
 

Re: [Linux-ha-dev] UUID bug?

2005-11-15 Thread Guochun Shi

David Lee wrote:


Heartbeat is failing.

I'm chasing a really weird problem (CVS/HEAD), and the finger of 
suspicion is pointing at the way we handle UUIDs.  I suspect that in 
at least one context we are passing the address of the UUID rather 
than the UUID itself.


Top-down:

1. On Solaris-8, I was getting error messages:
  /etc/opt/LXHAhb/ha.d/harc: ha_log: not found
   (This was then stopping IPaddr being called.)

   But on Solaris-9 and Linux (FC4) things are fine.

2. Noticing that the harc script and the OCF rely heavily on 
environment

   variables in the shell I then inserted into it a diagnostic:
  env | grep HA_  /dev/console

   Ouch!

   Linux: Lots of good-looking strings.  Fine.

   S8: Lots of good-looking strings.  But also:
  HA_srcuuid=horrible binary-looking thing
  HA_dstuuid=horrible binary-looking thing

   S9: Very few HA_* strings at all.  (In particular HA_FUNCS was
   absent, which explains my S9 ha_log: not found failure sequence.)

So it looks as if HA_srcuuid and HA_dstuuid are being incorrectly 
set.

Might the cause be around heartbeat/heartbeat.c:770?

Let's assume this (or something like it) to be the case.  What then 
happens when these HA_* are setenvd in preparation for harc?




Is it S9 or S8 which has the problem? You first said it was S8 then 
switched to S9 :)


Anyway, the problem is caused by heartbeat try to run a script that 
comes out of the message type. Before running the script,
all message fields will be set in the enviroment since it is (name, 
value) pair each field. In this case, obviously we should not set any 
binary as envioment variable.

I will commit a fix soon.

-Guochun





___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] UUID bug?

2005-11-15 Thread Guochun Shi

fix is in CVS now. Let me know if that works or not

thanks
-Guochun

Guochun Shi wrote:


David Lee wrote:


Heartbeat is failing.

I'm chasing a really weird problem (CVS/HEAD), and the finger of 
suspicion is pointing at the way we handle UUIDs.  I suspect that in 
at least one context we are passing the address of the UUID rather 
than the UUID itself.


Top-down:

1. On Solaris-8, I was getting error messages:
  /etc/opt/LXHAhb/ha.d/harc: ha_log: not found
   (This was then stopping IPaddr being called.)

   But on Solaris-9 and Linux (FC4) things are fine.

2. Noticing that the harc script and the OCF rely heavily on 
environment

   variables in the shell I then inserted into it a diagnostic:
  env | grep HA_  /dev/console

   Ouch!

   Linux: Lots of good-looking strings.  Fine.

   S8: Lots of good-looking strings.  But also:
  HA_srcuuid=horrible binary-looking thing
  HA_dstuuid=horrible binary-looking thing

   S9: Very few HA_* strings at all.  (In particular HA_FUNCS was
   absent, which explains my S9 ha_log: not found failure sequence.)

So it looks as if HA_srcuuid and HA_dstuuid are being incorrectly 
set.

Might the cause be around heartbeat/heartbeat.c:770?

Let's assume this (or something like it) to be the case.  What then 
happens when these HA_* are setenvd in preparation for harc?




Is it S9 or S8 which has the problem? You first said it was S8 then 
switched to S9 :)


Anyway, the problem is caused by heartbeat try to run a script that 
comes out of the message type. Before running the script,
all message fields will be set in the enviroment since it is (name, 
value) pair each field. In this case, obviously we should not set any 
binary as envioment variable.

I will commit a fix soon.

-Guochun





___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/




___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/