Hi All;

I've looked at the log files until I'm blind and I've no idea why a simple two node HA pair failed over.

The setup is straightforward; two boxes with external RAID storage setup for failover NFS services.

Saturday morning the backup server had a problem with the scsi bus, renegotiated to 40 MB/s, and remounted root in read only mode. It effectively went offline for a couple of minutes where ipfail and heartbeat couldn't talk to the primary. When it couldn't see the primary it decided it was to take over, which makes sense.

What doesn't make sense is that a couple of minutes later, when the scsi bit was sorted out and the two servers talked to each other again, is that they decided to failover to the backup.

There was never anything wrong with the primary server.

Any ideas what would have induced the failover?

Relevant bits:

Heartbeat v 2.0.7-2 installed via debian packages. Debian etch, updated a couple of weeks ago. 2.6.18-6-686 kernel.

haresources has only this resource group:

nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all Filesystem::/dev/drbd0::/nfs_exports::ext3 Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server start_dnotify.sh

ha.cfg:

logfacility     local0
keepalive 4
deadtime 30
bcast   eth0 eth1
auto_failback off
node nfs1
node nfs2
respawn hacluster /usr/lib/heartbeat/ipfail
ping 192.168.120.1


Logfile from nfs1 (primary server):

Oct 25 06:43:16 nfs1 heartbeat: [9320]: WARN: node nfs2: is dead
Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Dead node nfs2 gave up resources.
Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Link nfs2:eth0 dead.
Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Link nfs2:eth1 dead.
Oct 25 06:43:16 nfs1 ipfail: [9342]: info: Status update: Node nfs2 now has status dead
Oct 25 06:43:18 nfs1 ipfail: [9342]: info: NS: We are still alive!
Oct 25 06:43:18 nfs1 ipfail: [9342]: info: Link Status update: Link nfs2/eth0 now has status dead Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: For information on cluster partitions, See URL: http://linux-ha.org/SplitBrain Oct 25 06:43:19 nfs1 heartbeat: [9320]: WARN: Deadtime value may be too small. Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: See FAQ for information on tuning deadtime. Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: URL: http://linux-ha.org/FAQ#heavy_load
Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Link nfs2:eth1 up.
Oct 25 06:43:19 nfs1 heartbeat: [9320]: WARN: Late heartbeat: Node nfs2: interval 33030 ms Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Status update for node nfs2: status active
Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Link nfs2:eth0 up.
Oct 25 06:43:19 nfs1 harc[6029]: info: Running /etc/ha.d/rc.d/status status
Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Asking other side for ping node count. Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Checking remote count of ping nodes. Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Link Status update: Link nfs2/eth1 now has status dead Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Asking other side for ping node count. Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Checking remote count of ping nodes. Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Link Status update: Link nfs2/eth1 now has status up Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Status update: Node nfs2 now has status active Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Link Status update: Link nfs2/eth0 now has status up Oct 25 06:43:21 nfs1 heartbeat: [9320]: info: Heartbeat shutdown in progress. (9320)
Oct 25 06:43:21 nfs1 heartbeat: [6039]: info: Giving up all HA resources.
Oct 25 06:43:21 nfs1 ResourceManager[6049]: info: Releasing resource group: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all Filesystem::/dev/drbd0::/nfs_exports::ext3 Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server start_dnotify.sh Oct 25 06:43:21 nfs1 ResourceManager[6049]: info: Running /etc/ha.d/resource.d/start_dnotify.sh stop Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running /etc/init.d/nfs-kernel-server stop
Oct 25 06:43:22 nfs1 ipfail: [9342]: info: No giveup timer to abort.
Oct 25 06:43:22 nfs1 kernel: nfsd: last server has exited
Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 /llnw_backups ext3 stop Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: Running stop for /dev/drbd1 on /llnw_backups
Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: Trying to unmount /llnw_backups
Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: unmounted /llnw_backups successfully
Oct 25 06:43:22 nfs1 Filesystem[6139]: INFO: Filesystem Success
Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs_exports ext3 stop Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Running stop for /dev/drbd0 on /nfs_exports
Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Trying to unmount /nfs_exports
Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Some processes on /nfs_exports were signalled
Oct 25 06:43:23 nfs1 ipfail: [9342]: info: No giveup timer to abort.
Oct 25 06:43:25 nfs1 Filesystem[6329]: INFO: unmounted /nfs_exports successfully
Oct 25 06:43:25 nfs1 Filesystem[6265]: INFO: Filesystem Success
Oct 25 06:43:25 nfs1 ResourceManager[6049]: info: Running /etc/ha.d/resource.d/drbddisk all stop Oct 25 06:43:26 nfs1 kernel: drbd0: Primary/Secondary --> Secondary/Secondary Oct 25 06:43:26 nfs1 kernel: drbd1: Primary/Secondary --> Secondary/Secondary Oct 25 06:43:26 nfs1 ResourceManager[6049]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.120.230/24/eth0 stop Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: /sbin/route -n del -host 192.168.120.230 Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: /sbin/ifconfig eth0:0 192.168.120.230 down
Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: IP Address 192.168.120.230 released
Oct 25 06:43:27 nfs1 IPaddr[6443]: INFO: IPaddr Success
Oct 25 06:43:27 nfs1 heartbeat: [6039]: info: All HA resources relinquished.
Oct 25 06:43:28 nfs1 kernel: drbd0: Secondary/Secondary --> Secondary/Primary Oct 25 06:43:28 nfs1 kernel: drbd1: Secondary/Secondary --> Secondary/Primary Oct 25 06:43:28 nfs1 heartbeat: [9320]: info: killing /usr/lib/heartbeat/ipfail process group 9342 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process 9328 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process 9329 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBFIFO process 9323 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process 9324 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process 9325 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process 9326 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process 9327 with signal 15 Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9328 exited. 7 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9323 exited. 6 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9329 exited. 5 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9325 exited. 4 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9326 exited. 3 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9327 exited. 2 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9324 exited. 1 remaining Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: nfs1 Heartbeat shutdown complete.
Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Heartbeat restart triggered.
Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Restarting heartbeat.
Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Performing heartbeat restart exec.
Oct 25 06:43:48 nfs1 syslogd 1.4.1#18: restart.
Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Core dumps could be lost if multiple dumps occur Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Consider setting /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supportability Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Oct 25 06:44:01 nfs1 heartbeat: [9320]: info: **************************
[finishes restarting heartbeat]


===============================

Log file from nfs2:

Oct 25 06:42:00 nfs2 kernel: mptscsih: ioc0: attempting task abort! (sc=df8111c0)
Oct 25 06:42:00 nfs2 kernel: sd 0:0:1:0:
Oct 25 06:42:00 nfs2 kernel: command: Write(10): 2a 00 28 6c 02 8f 00 00 30 00
Oct 25 06:42:02 nfs2 kernel: mptbase: Initiating ioc0 recovery
Oct 25 06:42:08 nfs2 kernel: mptscsih: ioc0: task abort: SUCCESS (sc=df8111c0)
Oct 25 06:42:08 nfs2 kernel:  target0:0:0: Beginning Domain Validation
Oct 25 06:42:09 nfs2 kernel: target0:0:0: Domain Validation skipping write tests
Oct 25 06:42:09 nfs2 kernel:  target0:0:0: Ending Domain Validation
Oct 25 06:42:09 nfs2 kernel: target0:0:0: FAST-40 SCSI 40.0 MB/s ST (25 ns, offset 127)
Oct 25 06:42:09 nfs2 kernel:  target0:0:1: Beginning Domain Validation
Oct 25 06:42:09 nfs2 kernel: target0:0:1: Domain Validation skipping write tests
Oct 25 06:42:09 nfs2 kernel:  target0:0:1: Ending Domain Validation
Oct 25 06:42:09 nfs2 kernel: target0:0:1: FAST-40 SCSI 40.0 MB/s ST (25 ns, offset 127) Oct 25 06:43:19 nfs2 kernel: mptscsih: ioc0: attempting task abort! (sc=eb523980) Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 31980 ms (> 50 ms) (GSource: 0x80fbbb8)
Oct 25 06:43:19 nfs2 kernel: sd 0:0:1:0:
Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 29200 ms (> 2010 ms) before being called (GSource: 0x80fbeb0) Oct 25 06:43:19 nfs2 kernel: command: Write(10): 2a 00 00 01 b9 17 00 00 28 00 Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: started at 1751846343 should have started at 1751843423
Oct 25 06:43:19 nfs2 kernel: mptbase: Initiating ioc0 recovery
Oct 25 06:43:19 nfs2 kernel: mptscsih: ioc0: task abort: SUCCESS (sc=eb523980)
Oct 25 06:43:19 nfs2 kernel:  target0:0:0: Beginning Domain Validation
Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node nfs2: interval 33200 ms Oct 25 06:43:19 nfs2 kernel: target0:0:0: Domain Validation skipping write tests
Oct 25 06:43:19 nfs2 kernel:  target0:0:0: Ending Domain Validation
Oct 25 06:43:19 nfs2 kernel: target0:0:0: FAST-40 SCSI 40.0 MB/s ST (25 ns, offset 127) Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node nfs1: interval 33980 ms Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node 192.168.120.1: interval 32980 ms Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 31670 ms (> 2010 ms) before being called (GSource: 0x80fc1b8) Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: started at 1751846344 should have started at 1751843177 Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: Dispatch function for update msgfree count was delayed 31990 ms (> 30000 ms) before being called (GSource: 0x80fc288) Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: started at 1751846344 should have started at 1751843145 Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch: Dispatch function for client audit was delayed 27510 ms (> 5000 ms) before being called (GSource: 0x80fc0e8) Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch: started at 1751846344 should have started at 1751843593
Oct 25 06:43:19 nfs2 kernel:  target0:0:1: Beginning Domain Validation
Oct 25 06:43:19 nfs2 kernel: target0:0:1: Domain Validation skipping write tests
Oct 25 06:43:19 nfs2 kernel:  target0:0:1: Ending Domain Validation
Oct 25 06:43:19 nfs2 kernel: target0:0:1: FAST-40 SCSI 40.0 MB/s ST (25 ns, offset 127)
Oct 25 06:43:21 nfs2 ipfail: [22731]: info: Ping node count is balanced.
Oct 25 06:43:22 nfs2 ipfail: [22731]: info: Ping node count is balanced.
Oct 25 06:43:26 nfs2 kernel: drbd0: Secondary/Primary --> Secondary/Secondary Oct 25 06:43:26 nfs2 kernel: drbd1: Secondary/Primary --> Secondary/Secondary Oct 25 06:43:27 nfs2 heartbeat: [22707]: info: Received shutdown notice from 'nfs1'. Oct 25 06:43:27 nfs2 heartbeat: [22707]: info: Resources being acquired from nfs1. Oct 25 06:43:27 nfs2 heartbeat: [32715]: info: acquire all HA resources (standby). Oct 25 06:43:28 nfs2 heartbeat: [32716]: info: No local resources [/usr/lib/heartbeat/ResourceManager listkeys nfs2] to acquire. Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Acquiring resource group: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all Filesystem::/dev/drbd0::/nfs_exports::ext3 Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server start_dnotify.sh
Oct 25 06:43:28 nfs2 IPaddr[32759]: INFO: IPaddr Resource is stopped
Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running /etc/ha.d/resource.d/IPaddr 192.168.120.230/24/eth0 start Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: eval /sbin/ifconfig eth0:0 192.168.120.230 netmask 255.255.255.0 broadcast 192.168.120.255 Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: Sending Gratuitous Arp for 192.168.120.230 on eth0:0 [eth0] Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: /usr/lib/heartbeat/send_arp -i 500 -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.120.230 eth0 192.168.120.230 auto 192.168.120.230 ffffffffffff
Oct 25 06:43:28 nfs2 IPaddr[409]: INFO: IPaddr Success
Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running /etc/ha.d/resource.d/drbddisk all start Oct 25 06:43:28 nfs2 kernel: drbd0: Secondary/Secondary --> Primary/Secondary Oct 25 06:43:28 nfs2 kernel: drbd1: Secondary/Secondary --> Primary/Secondary Oct 25 06:43:28 nfs2 Filesystem[697]: INFO: Running status for /dev/drbd0 on /nfs_exports Oct 25 06:43:28 nfs2 Filesystem[697]: INFO: /nfs_exports is unmounted (stopped)
Oct 25 06:43:28 nfs2 Filesystem[633]: INFO: Filesystem Resource is stopped
Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs_exports ext3 start Oct 25 06:43:28 nfs2 Filesystem[812]: INFO: Running start for /dev/drbd0 on /nfs_exports
Oct 25 06:43:28 nfs2 kernel: kjournald starting.  Commit interval 5 seconds
Oct 25 06:43:28 nfs2 kernel: EXT3 FS on drbd0, internal journal
Oct 25 06:43:28 nfs2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 25 06:43:28 nfs2 Filesystem[743]: INFO: Filesystem Success
Oct 25 06:43:28 nfs2 Filesystem[921]: INFO: Running status for /dev/drbd1 on /llnw_backups Oct 25 06:43:28 nfs2 Filesystem[921]: INFO: /llnw_backups is unmounted (stopped)
Oct 25 06:43:28 nfs2 Filesystem[853]: INFO: Filesystem Resource is stopped
Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running /etc/ha.d/resource.d/Filesystem /dev/drbd1 /llnw_backups ext3 start Oct 25 06:43:28 nfs2 Filesystem[1030]: INFO: Running start for /dev/drbd1 on /llnw_backups
Oct 25 06:43:28 nfs2 kernel: kjournald starting.  Commit interval 5 seconds
Oct 25 06:43:28 nfs2 kernel: EXT3 FS on drbd1, internal journal
Oct 25 06:43:28 nfs2 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Oct 25 06:43:28 nfs2 Filesystem[966]: INFO: Filesystem Success
Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running /etc/init.d/nfs-kernel-server start Oct 25 06:43:28 nfs2 kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Oct 25 06:43:28 nfs2 kernel: NFSD: starting 90-second grace period
Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running /etc/ha.d/resource.d/start_dnotify.sh start Oct 25 06:43:32 nfs2 heartbeat: [32715]: info: all HA resource acquisition completed (standby). Oct 25 06:43:32 nfs2 heartbeat: [22707]: info: Standby resource acquisition done [all].
Oct 25 06:43:32 nfs2 harc[1324]: info: Running /etc/ha.d/rc.d/status status
Oct 25 06:43:32 nfs2 mach_down[1334]: info: Taking over resource group IPaddr::192.168.120.230/24/eth0 Oct 25 06:43:32 nfs2 ResourceManager[1354]: info: Acquiring resource group: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::all Filesystem::/dev/drbd0::/nfs_exports::ext3 Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-server start_dnotify.sh
Oct 25 06:43:32 nfs2 IPaddr[1378]: INFO: IPaddr Running OK
Oct 25 06:43:32 nfs2 ResourceManager[1354]: info: Running /etc/ha.d/resource.d/drbddisk all start Oct 25 06:43:32 nfs2 Filesystem[1601]: INFO: Running status for /dev/drbd0 on /nfs_exports Oct 25 06:43:32 nfs2 Filesystem[1601]: INFO: /nfs_exports is mounted (running)
Oct 25 06:43:32 nfs2 Filesystem[1537]: INFO: Filesystem Running OK
Oct 25 06:43:33 nfs2 Filesystem[1700]: INFO: Running status for /dev/drbd1 on /llnw_backups Oct 25 06:43:33 nfs2 Filesystem[1700]: INFO: /llnw_backups is mounted (running)
Oct 25 06:43:33 nfs2 Filesystem[1636]: INFO: Filesystem Running OK
Oct 25 06:43:33 nfs2 ResourceManager[1354]: info: Running /etc/ha.d/resource.d/start_dnotify.sh start Oct 25 06:43:33 nfs2 mach_down[1334]: info: /usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquired Oct 25 06:43:33 nfs2 mach_down[1334]: info: mach_down takeover complete for node nfs1.
Oct 25 06:43:33 nfs2 heartbeat: [22707]: info: mach_down takeover complete.



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to