Here is one more thing I notice: I modified my haresources file to be this: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 filesystem killnfs Delay::3::0 nfs nfslock
It looks like the filesystem and the killnfs shell scripts get run twice during heartbeat take over. Is something wrong with my configuration? Has anyone else seen this? This is the default heartbeat rpm package that comes with CentOS. Thanks, Nikhil -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nikhil Kulkarni Sent: Tuesday, July 15, 2008 9:32 AM To: General Linux-HA mailing list Subject: RE: [Linux-HA] Wierd heartbeat problem. I'm running the following versions of heartbeat: [EMAIL PROTECTED] ~]# rpm -qa | grep heartbeat heartbeat-pils-2.1.3-3.el5.centos heartbeat-2.1.3-3.el5.centos heartbeat-stonith-2.1.3-3.el5.centos Also I see this in the logs regarding the file system being mounted twice: ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[11112]: 2008/07/11_09:59:31 INFO: Running start for /dev/drbd1 on //mnt/data Filesystem[11101]: 2008/07/11_09:59:31 INFO: Success INFO: Success ResourceManager[10758]: 2008/07/11_09:59:31 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC= 0 Then further down I see this: ResourceManager[11535]: 2008/07/11_10:06:26 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[11799]: 2008/07/11_10:06:26 INFO: Running start for /dev/drbd1 on //mnt/data mount: /dev/drbd1 already mounted or //mnt/data busy mount: according to mtab, /dev/drbd1 is mounted on /mnt/data Filesystem[11799]: 2008/07/11_10:06:26 ERROR: Couldn't mount filesystem /dev/drbd1 on //mnt/data Filesystem[11788]: 2008/07/11_10:06:26 ERROR: Generic error ERROR: Generic error ResourceManager[11535]: 2008/07/11_10:06:26 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC= 1 ResourceManager[11535]: 2008/07/11_10:06:26 ERROR: Return code 1 from /etc/ha.d/resource.d/Filesystem ResourceManager[11535]: 2008/07/11_10:06:26 CRIT: Giving up resources due to failure of Filesystem::/dev/drbd1:://mnt/data::e xt3 ResourceManager[11535]: 2008/07/11_10:06:26 info: Releasing resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock So this tells me that the file system is trying to get mounted twice and it fails the second time because it is already mounted the first time and heartbeat then releases all its resources. This is the entire log if it helps: heartbeat[7769]: 2008/07/11_09:59:15 info: Configuration validated. Starting heartbeat 2.1.3 heartbeat[10607]: 2008/07/11_09:59:15 info: heartbeat: version 2.1.3 heartbeat[10607]: 2008/07/11_09:59:15 info: Heartbeat generation: 1215538978 heartbeat[10607]: 2008/07/11_09:59:15 info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth0 heartbeat[10607]: 2008/07/11_09:59:15 info: glib: UDP Broadcast heartbeat closed on port 694 interface eth0 - Status: 1 heartbeat[10607]: 2008/07/11_09:59:15 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[10607]: 2008/07/11_09:59:15 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[10607]: 2008/07/11_09:59:15 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[10607]: 2008/07/11_09:59:15 info: Local status now set to: 'up' heartbeat[10607]: 2008/07/11_09:59:16 info: Link watchdog-client1:eth0 up. heartbeat[10607]: 2008/07/11_09:59:19 info: Link watchdog-client2:eth0 up. heartbeat[10607]: 2008/07/11_09:59:19 info: Status update for node watchdog-client2: status up heartbeat[10613]: 2008/07/11_09:59:19 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[10613]: 2008/07/11_09:59:19 info: Running /etc/ha.d/rc.d/status status heartbeat[10607]: 2008/07/11_09:59:19 info: Comm_now_up(): updating status to active heartbeat[10607]: 2008/07/11_09:59:19 info: Local status now set to: 'active' heartbeat[10607]: 2008/07/11_09:59:19 debug: get_delnodelist: delnodelist= heartbeat[10607]: 2008/07/11_09:59:20 info: Status update for node watchdog-client2: status active heartbeat[10632]: 2008/07/11_09:59:20 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[10632]: 2008/07/11_09:59:20 info: Running /etc/ha.d/rc.d/status status heartbeat[10607]: 2008/07/11_09:59:30 info: remote resource transition completed. heartbeat[10607]: 2008/07/11_09:59:30 info: remote resource transition completed. heartbeat[10607]: 2008/07/11_09:59:30 info: Initial resource acquisition complete (T_RESOURCES(us)) IPaddr[10686]: 2008/07/11_09:59:30 INFO: Resource is stopped heartbeat[10650]: 2008/07/11_09:59:30 info: Local Resource acquisition completed. heartbeat[10607]: 2008/07/11_09:59:30 debug: StartNextRemoteRscReq(): child count 1 heartbeat[10737]: 2008/07/11_09:59:30 debug: notify_world: setting SIGCHLD Handler to SIG_DFL harc[10737]: 2008/07/11_09:59:30 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[10737]: 2008/07/11_09:59:30 received ip-request-resp IPaddr::10.0.38.71/24/eth0 OK yes ResourceManager[10758]: 2008/07/11_09:59:30 info: Acquiring resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock IPaddr[10785]: 2008/07/11_09:59:30 INFO: Resource is stopped ResourceManager[10758]: 2008/07/11_09:59:30 info: Running /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start ResourceManager[10758]: 2008/07/11_09:59:30 debug: Starting /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start IPaddr[10883]: 2008/07/11_09:59:31 INFO: Using calculated netmask for 10.0.38.71: 255.255.255.0 IPaddr[10883]: 2008/07/11_09:59:31 DEBUG: Using calculated broadcast for 10.0.38.71: 10.0.38.255 IPaddr[10883]: 2008/07/11_09:59:31 INFO: eval ifconfig eth0:0 10.0.38.71 netmask 255.255.255.0 broadcast 10.0.38.255 IPaddr[10883]: 2008/07/11_09:59:31 DEBUG: Sending Gratuitous Arp for 10.0.38.71 on eth0:0 [eth0] IPaddr[10854]: 2008/07/11_09:59:31 INFO: Success INFO: Success ResourceManager[10758]: 2008/07/11_09:59:31 debug: /etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start done. RC=0 ResourceManager[10758]: 2008/07/11_09:59:31 info: Running /etc/ha.d/resource.d/drbddisk r0 start ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting /etc/ha.d/resource.d/drbddisk r0 start ResourceManager[10758]: 2008/07/11_09:59:31 debug: /etc/ha.d/resource.d/drbddisk r0 start done. RC=0 Filesystem[11031]: 2008/07/11_09:59:31 INFO: Resource is stopped ResourceManager[10758]: 2008/07/11_09:59:31 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[11112]: 2008/07/11_09:59:31 INFO: Running start for /dev/drbd1 on //mnt/data Filesystem[11101]: 2008/07/11_09:59:31 INFO: Success INFO: Success ResourceManager[10758]: 2008/07/11_09:59:31 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC= 0 ResourceManager[10758]: 2008/07/11_09:59:31 info: Running /etc/init.d/killnfs start ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting /etc/init.d/killnfs start kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill -l [sigspec] ResourceManager[10758]: 2008/07/11_09:59:32 debug: /etc/init.d/killnfs start done. RC=0 Delay[11221]: 2008/07/11_09:59:35 INFO: Delay is stopped Delay[11210]: 2008/07/11_09:59:35 INFO: Resource is stopped ResourceManager[10758]: 2008/07/11_09:59:35 info: Running /etc/ha.d/resource.d/Delay 3 0 start ResourceManager[10758]: 2008/07/11_09:59:35 debug: Starting /etc/ha.d/resource.d/Delay 3 0 start Delay[11268]: 2008/07/11_09:59:38 INFO: Success INFO: Success ResourceManager[10758]: 2008/07/11_09:59:38 debug: /etc/ha.d/resource.d/Delay 3 0 start done. RC=0 ResourceManager[10758]: 2008/07/11_09:59:38 info: Running /etc/init.d/nfs start ResourceManager[10758]: 2008/07/11_09:59:38 debug: Starting /etc/init.d/nfs start Starting NFS services: [ OK ] Starting NFS quotas: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] ResourceManager[10758]: 2008/07/11_09:59:38 debug: /etc/init.d/nfs start done. RC=0 ResourceManager[10758]: 2008/07/11_09:59:38 info: Running /etc/init.d/nfslock start ResourceManager[10758]: 2008/07/11_09:59:38 debug: Starting /etc/init.d/nfslock start Starting NFS statd: [ OK ] ResourceManager[10758]: 2008/07/11_09:59:38 debug: /etc/init.d/nfslock start done. RC=0 heartbeat[10607]: 2008/07/11_10:06:26 info: Received shutdown notice from 'watchdog-client2'. heartbeat[10607]: 2008/07/11_10:06:26 info: Resources being acquired from watchdog-client2. heartbeat[10607]: 2008/07/11_10:06:26 debug: StartNextRemoteRscReq(): child count 1 heartbeat[11509]: 2008/07/11_10:06:26 info: acquire local HA resources (standby). ResourceManager[11535]: 2008/07/11_10:06:26 info: Acquiring resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock IPaddr[11585]: 2008/07/11_10:06:26 INFO: Running OK heartbeat[11510]: 2008/07/11_10:06:26 info: Local Resource acquisition completed. IPaddr[11586]: 2008/07/11_10:06:26 INFO: Running OK heartbeat[10607]: 2008/07/11_10:06:26 debug: StartNextRemoteRscReq(): child count 1 Filesystem[11718]: 2008/07/11_10:06:26 INFO: Resource is stopped ResourceManager[11535]: 2008/07/11_10:06:26 info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start Filesystem[11799]: 2008/07/11_10:06:26 INFO: Running start for /dev/drbd1 on //mnt/data mount: /dev/drbd1 already mounted or //mnt/data busy mount: according to mtab, /dev/drbd1 is mounted on /mnt/data Filesystem[11799]: 2008/07/11_10:06:26 ERROR: Couldn't mount filesystem /dev/drbd1 on //mnt/data Filesystem[11788]: 2008/07/11_10:06:26 ERROR: Generic error ERROR: Generic error ResourceManager[11535]: 2008/07/11_10:06:26 debug: /etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC= 1 ResourceManager[11535]: 2008/07/11_10:06:26 ERROR: Return code 1 from /etc/ha.d/resource.d/Filesystem ResourceManager[11535]: 2008/07/11_10:06:26 CRIT: Giving up resources due to failure of Filesystem::/dev/drbd1:://mnt/data::e xt3 ResourceManager[11535]: 2008/07/11_10:06:26 info: Releasing resource group: watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock ResourceManager[11535]: 2008/07/11_10:06:26 info: Running /etc/init.d/nfslock stop ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting /etc/init.d/nfslock stop Stopping NFS locking: [ OK ] Stopping NFS statd: [ OK ] ResourceManager[11535]: 2008/07/11_10:06:26 debug: /etc/init.d/nfslock stop done. RC=0 ResourceManager[11535]: 2008/07/11_10:06:26 info: Running /etc/init.d/nfs stop ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting /etc/init.d/nfs stop Shutting down NFS mountd: [ OK ] Shutting down NFS daemon: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Thanks, Nikhil -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lars Marowsky-Bree Sent: Monday, July 14, 2008 11:51 PM To: General Linux-HA mailing list Subject: Re: [Linux-HA] Wierd heartbeat problem. On 2008-07-14T10:13:21, Nikhil Kulkarni <[EMAIL PROTECTED]> wrote: > This is my haresources file: > > watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 > Filesystem::/dev/drbd1:://mnt/data::ext3 kill nfs Delay::3::0 nfs > nfslock Why all the Delay resources? They don't help anything, it just slows your failover. > Now, when I power back client1, it never becomes the secondary and the > file system on client2 for some reason gets un-mounted and never gets > mounted again. drbd going to secondary is outside heartbeat's control; drbd (if used with drbddisk) is started before heartbeat, and heartbeat just moves the primary around. If you mean to say that client1 never automatically switches back the resources (from heartbeat's point of view), you did set auto_failback off, so that won't happen. > The logs on the client1 machine, tell me that it's trying to mount the > file system twice (as shown in bold below) and it fails the second time, > since the file system is already mounted the first time successfully. There is no bold. This is text/plain, not html. > Since, it fails the second time, the heartbeat resource just stops all > its other processes. Filesystem won't fail if started twice. (No RA is allowed to fail if started twice, that's a legitimate operation.) Which version are you running? That seems as if you're running an old and buggy version, and really really should upgrade. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems