RE: [Linux-HA] Wierd heartbeat problem.

Nikhil Kulkarni Tue, 15 Jul 2008 10:59:39 -0700

Here is one more thing I notice:

I modified my haresources file to be this:
watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0 filesystem
killnfs Delay::3::0 nfs nfslock


It looks like the filesystem and the killnfs shell scripts get run twice during 
heartbeat take over.
Is something wrong with my configuration? Has anyone else seen this? This is 
the default heartbeat rpm package that comes with CentOS.

Thanks,
Nikhil

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nikhil Kulkarni
Sent: Tuesday, July 15, 2008 9:32 AM
To: General Linux-HA mailing list
Subject: RE: [Linux-HA] Wierd heartbeat problem.

I'm running the following versions of heartbeat:
[EMAIL PROTECTED] ~]# rpm -qa | grep heartbeat
heartbeat-pils-2.1.3-3.el5.centos
heartbeat-2.1.3-3.el5.centos
heartbeat-stonith-2.1.3-3.el5.centos

Also I see this in the logs regarding the file system being mounted twice:
ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
Filesystem[11112]:      2008/07/11_09:59:31 INFO: Running start for /dev/drbd1 
on //mnt/data
Filesystem[11101]:      2008/07/11_09:59:31 INFO:  Success
INFO:  Success
ResourceManager[10758]: 2008/07/11_09:59:31 debug: 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC=
0

Then further down I see this:
ResourceManager[11535]: 2008/07/11_10:06:26 info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
Filesystem[11799]:      2008/07/11_10:06:26 INFO: Running start for /dev/drbd1 
on //mnt/data
mount: /dev/drbd1 already mounted or //mnt/data busy
mount: according to mtab, /dev/drbd1 is mounted on /mnt/data
Filesystem[11799]:      2008/07/11_10:06:26 ERROR: Couldn't mount filesystem 
/dev/drbd1 on //mnt/data
Filesystem[11788]:      2008/07/11_10:06:26 ERROR:  Generic error
ERROR:  Generic error
ResourceManager[11535]: 2008/07/11_10:06:26 debug: 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC=
1
ResourceManager[11535]: 2008/07/11_10:06:26 ERROR: Return code 1 from 
/etc/ha.d/resource.d/Filesystem
ResourceManager[11535]: 2008/07/11_10:06:26 CRIT: Giving up resources due to 
failure of Filesystem::/dev/drbd1:://mnt/data::e
xt3
ResourceManager[11535]: 2008/07/11_10:06:26 info: Releasing resource group: 
watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd
isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock

So this tells me that the file system is trying to get mounted twice and it 
fails the second time because it is already mounted the first time and 
heartbeat then releases all its resources.

This is the entire log if it helps:
heartbeat[7769]: 2008/07/11_09:59:15 info: Configuration validated. Starting 
heartbeat 2.1.3
heartbeat[10607]: 2008/07/11_09:59:15 info: heartbeat: version 2.1.3
heartbeat[10607]: 2008/07/11_09:59:15 info: Heartbeat generation: 1215538978
heartbeat[10607]: 2008/07/11_09:59:15 info: glib: UDP Broadcast heartbeat 
started on port 694 (694) interface eth0
heartbeat[10607]: 2008/07/11_09:59:15 info: glib: UDP Broadcast heartbeat 
closed on port 694 interface eth0 - Status: 1
heartbeat[10607]: 2008/07/11_09:59:15 info: G_main_add_TriggerHandler: Added 
signal manual handler
heartbeat[10607]: 2008/07/11_09:59:15 info: G_main_add_TriggerHandler: Added 
signal manual handler
heartbeat[10607]: 2008/07/11_09:59:15 info: G_main_add_SignalHandler: Added 
signal handler for signal 17
heartbeat[10607]: 2008/07/11_09:59:15 info: Local status now set to: 'up'
heartbeat[10607]: 2008/07/11_09:59:16 info: Link watchdog-client1:eth0 up.
heartbeat[10607]: 2008/07/11_09:59:19 info: Link watchdog-client2:eth0 up.
heartbeat[10607]: 2008/07/11_09:59:19 info: Status update for node 
watchdog-client2: status up
heartbeat[10613]: 2008/07/11_09:59:19 debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
harc[10613]:    2008/07/11_09:59:19 info: Running /etc/ha.d/rc.d/status status
heartbeat[10607]: 2008/07/11_09:59:19 info: Comm_now_up(): updating status to 
active
heartbeat[10607]: 2008/07/11_09:59:19 info: Local status now set to: 'active'
heartbeat[10607]: 2008/07/11_09:59:19 debug: get_delnodelist: delnodelist=
heartbeat[10607]: 2008/07/11_09:59:20 info: Status update for node 
watchdog-client2: status active
heartbeat[10632]: 2008/07/11_09:59:20 debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
harc[10632]:    2008/07/11_09:59:20 info: Running /etc/ha.d/rc.d/status status
heartbeat[10607]: 2008/07/11_09:59:30 info: remote resource transition 
completed.
heartbeat[10607]: 2008/07/11_09:59:30 info: remote resource transition 
completed.
heartbeat[10607]: 2008/07/11_09:59:30 info: Initial resource acquisition 
complete (T_RESOURCES(us))
IPaddr[10686]:  2008/07/11_09:59:30 INFO:  Resource is stopped
heartbeat[10650]: 2008/07/11_09:59:30 info: Local Resource acquisition 
completed.
heartbeat[10607]: 2008/07/11_09:59:30 debug: StartNextRemoteRscReq(): child 
count 1
heartbeat[10737]: 2008/07/11_09:59:30 debug: notify_world: setting SIGCHLD 
Handler to SIG_DFL
harc[10737]:    2008/07/11_09:59:30 info: Running 
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
ip-request-resp[10737]: 2008/07/11_09:59:30 received ip-request-resp 
IPaddr::10.0.38.71/24/eth0 OK yes
ResourceManager[10758]: 2008/07/11_09:59:30 info: Acquiring resource group: 
watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd
isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock
IPaddr[10785]:  2008/07/11_09:59:30 INFO:  Resource is stopped
ResourceManager[10758]: 2008/07/11_09:59:30 info: Running 
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start
ResourceManager[10758]: 2008/07/11_09:59:30 debug: Starting 
/etc/ha.d/resource.d/IPaddr 10.0.38.71/24/eth0 start
IPaddr[10883]:  2008/07/11_09:59:31 INFO: Using calculated netmask for 
10.0.38.71: 255.255.255.0
IPaddr[10883]:  2008/07/11_09:59:31 DEBUG: Using calculated broadcast for 
10.0.38.71: 10.0.38.255
IPaddr[10883]:  2008/07/11_09:59:31 INFO: eval ifconfig eth0:0 10.0.38.71 
netmask 255.255.255.0 broadcast 10.0.38.255
IPaddr[10883]:  2008/07/11_09:59:31 DEBUG: Sending Gratuitous Arp for 
10.0.38.71 on eth0:0 [eth0]
IPaddr[10854]:  2008/07/11_09:59:31 INFO:  Success
INFO:  Success
ResourceManager[10758]: 2008/07/11_09:59:31 debug: /etc/ha.d/resource.d/IPaddr 
10.0.38.71/24/eth0 start done. RC=0
ResourceManager[10758]: 2008/07/11_09:59:31 info: Running 
/etc/ha.d/resource.d/drbddisk r0 start
ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting 
/etc/ha.d/resource.d/drbddisk r0 start
ResourceManager[10758]: 2008/07/11_09:59:31 debug: 
/etc/ha.d/resource.d/drbddisk r0 start done. RC=0
Filesystem[11031]:      2008/07/11_09:59:31 INFO:  Resource is stopped
ResourceManager[10758]: 2008/07/11_09:59:31 info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
Filesystem[11112]:      2008/07/11_09:59:31 INFO: Running start for /dev/drbd1 
on //mnt/data
Filesystem[11101]:      2008/07/11_09:59:31 INFO:  Success
INFO:  Success
ResourceManager[10758]: 2008/07/11_09:59:31 debug: 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC=
0
ResourceManager[10758]: 2008/07/11_09:59:31 info: Running /etc/init.d/killnfs  
start
ResourceManager[10758]: 2008/07/11_09:59:31 debug: Starting /etc/init.d/killnfs 
 start
kill: usage: kill [-s sigspec | -n signum | -sigspec] pid | jobspec ... or kill 
-l [sigspec]
ResourceManager[10758]: 2008/07/11_09:59:32 debug: /etc/init.d/killnfs  start 
done. RC=0
Delay[11221]:   2008/07/11_09:59:35 INFO: Delay is stopped
Delay[11210]:   2008/07/11_09:59:35 INFO:  Resource is stopped
ResourceManager[10758]: 2008/07/11_09:59:35 info: Running 
/etc/ha.d/resource.d/Delay 3 0 start
ResourceManager[10758]: 2008/07/11_09:59:35 debug: Starting 
/etc/ha.d/resource.d/Delay 3 0 start
Delay[11268]:   2008/07/11_09:59:38 INFO:  Success
INFO:  Success
ResourceManager[10758]: 2008/07/11_09:59:38 debug: /etc/ha.d/resource.d/Delay 3 
0 start done. RC=0
ResourceManager[10758]: 2008/07/11_09:59:38 info: Running /etc/init.d/nfs  start
ResourceManager[10758]: 2008/07/11_09:59:38 debug: Starting /etc/init.d/nfs  
start
Starting NFS services:  [  OK  ]
Starting NFS quotas: [  OK  ]
Starting NFS daemon: [  OK  ]
Starting NFS mountd: [  OK  ]
ResourceManager[10758]: 2008/07/11_09:59:38 debug: /etc/init.d/nfs  start done. 
RC=0
ResourceManager[10758]: 2008/07/11_09:59:38 info: Running /etc/init.d/nfslock  
start
ResourceManager[10758]: 2008/07/11_09:59:38 debug: Starting /etc/init.d/nfslock 
 start
Starting NFS statd: [  OK  ]
ResourceManager[10758]: 2008/07/11_09:59:38 debug: /etc/init.d/nfslock  start 
done. RC=0
heartbeat[10607]: 2008/07/11_10:06:26 info: Received shutdown notice from 
'watchdog-client2'.
heartbeat[10607]: 2008/07/11_10:06:26 info: Resources being acquired from 
watchdog-client2.
heartbeat[10607]: 2008/07/11_10:06:26 debug: StartNextRemoteRscReq(): child 
count 1
heartbeat[11509]: 2008/07/11_10:06:26 info: acquire local HA resources 
(standby).
ResourceManager[11535]: 2008/07/11_10:06:26 info: Acquiring resource group: 
watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd
isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock
IPaddr[11585]:  2008/07/11_10:06:26 INFO:  Running OK
heartbeat[11510]: 2008/07/11_10:06:26 info: Local Resource acquisition 
completed.
IPaddr[11586]:  2008/07/11_10:06:26 INFO:  Running OK
heartbeat[10607]: 2008/07/11_10:06:26 debug: StartNextRemoteRscReq(): child 
count 1
Filesystem[11718]:      2008/07/11_10:06:26 INFO:  Resource is stopped
ResourceManager[11535]: 2008/07/11_10:06:26 info: Running 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start
Filesystem[11799]:      2008/07/11_10:06:26 INFO: Running start for /dev/drbd1 
on //mnt/data
mount: /dev/drbd1 already mounted or //mnt/data busy
mount: according to mtab, /dev/drbd1 is mounted on /mnt/data
Filesystem[11799]:      2008/07/11_10:06:26 ERROR: Couldn't mount filesystem 
/dev/drbd1 on //mnt/data
Filesystem[11788]:      2008/07/11_10:06:26 ERROR:  Generic error
ERROR:  Generic error
ResourceManager[11535]: 2008/07/11_10:06:26 debug: 
/etc/ha.d/resource.d/Filesystem /dev/drbd1 //mnt/data ext3 start done. RC=
1
ResourceManager[11535]: 2008/07/11_10:06:26 ERROR: Return code 1 from 
/etc/ha.d/resource.d/Filesystem
ResourceManager[11535]: 2008/07/11_10:06:26 CRIT: Giving up resources due to 
failure of Filesystem::/dev/drbd1:://mnt/data::e
xt3
ResourceManager[11535]: 2008/07/11_10:06:26 info: Releasing resource group: 
watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbdd
isk::r0 Filesystem::/dev/drbd1:://mnt/data::ext3 killnfs Delay::3::0 nfs nfslock
ResourceManager[11535]: 2008/07/11_10:06:26 info: Running /etc/init.d/nfslock  
stop
ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting /etc/init.d/nfslock 
 stop
Stopping NFS locking: [  OK  ]
Stopping NFS statd: [  OK  ]
ResourceManager[11535]: 2008/07/11_10:06:26 debug: /etc/init.d/nfslock  stop 
done. RC=0
ResourceManager[11535]: 2008/07/11_10:06:26 info: Running /etc/init.d/nfs  stop
ResourceManager[11535]: 2008/07/11_10:06:26 debug: Starting /etc/init.d/nfs  
stop
Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon: [  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]

Thanks,
Nikhil

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Lars 
Marowsky-Bree
Sent: Monday, July 14, 2008 11:51 PM
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Wierd heartbeat problem.

On 2008-07-14T10:13:21, Nikhil Kulkarni <[EMAIL PROTECTED]> wrote:

> This is my haresources file:
> 
> watchdog-client1 IPaddr::10.0.38.71/24/eth0 drbddisk::r0 Delay::3::0
> Filesystem::/dev/drbd1:://mnt/data::ext3 kill nfs Delay::3::0 nfs
> nfslock

Why all the Delay resources? They don't help anything, it just slows
your failover.

> Now, when I power back client1, it never becomes the secondary and the
> file system on client2 for some reason gets un-mounted and never gets
> mounted again.

drbd going to secondary is outside heartbeat's control; drbd (if used
with drbddisk) is started before heartbeat, and heartbeat just moves the
primary around.

If you mean to say that client1 never automatically switches back the
resources (from heartbeat's point of view), you did set auto_failback
off, so that won't happen.

> The logs on the client1 machine, tell me that it's trying to mount the
> file system twice (as shown in bold below) and it fails the second time,
> since the file system is already mounted the first time successfully.

There is no bold. This is text/plain, not html.

> Since, it fails the second time, the heartbeat resource just stops all
> its other processes.

Filesystem won't fail if started twice. (No RA is allowed to fail if
started twice, that's a legitimate operation.) Which version are you
running? That seems as if you're running an old and buggy version, and
really really should upgrade.


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

RE: [Linux-HA] Wierd heartbeat problem.

Reply via email to