[Linux-HA] Strange failover

David Miller Tue, 28 Oct 2008 08:10:53 -0700

Hi All;

I've looked at the log files until I'm blind and I've no idea why asimple two node HA pair failed over.

The setup is straightforward; two boxes with external RAID storage setupfor failover NFS services.

Saturday morning the backup server had a problem with the scsi bus,renegotiated to 40 MB/s, and remounted root in read only mode. Iteffectively went offline for a couple of minutes where ipfail andheartbeat couldn't talk to the primary. When it couldn't see theprimary it decided it was to take over, which makes sense.

What doesn't make sense is that a couple of minutes later, when the scsibit was sorted out and the two servers talked to each other again, isthat they decided to failover to the backup.


There was never anything wrong with the primary server.

Any ideas what would have induced the failover?

Relevant bits:

Heartbeat v 2.0.7-2 installed via debian packages. Debian etch,updated a couple of weeks ago. 2.6.18-6-686 kernel.


haresources has only this resource group:

nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::allFilesystem::/dev/drbd0::/nfs_exports::ext3Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-serverstart_dnotify.sh


ha.cfg:

logfacility     local0
keepalive 4
deadtime 30
bcast   eth0 eth1
auto_failback off
node nfs1
node nfs2
respawn hacluster /usr/lib/heartbeat/ipfail
ping 192.168.120.1


Logfile from nfs1 (primary server):

Oct 25 06:43:16 nfs1 heartbeat: [9320]: WARN: node nfs2: is dead

Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Dead node nfs2 gave upresources.

Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Link nfs2:eth0 dead.
Oct 25 06:43:16 nfs1 heartbeat: [9320]: info: Link nfs2:eth1 dead.

Oct 25 06:43:16 nfs1 ipfail: [9342]: info: Status update: Node nfs2 nowhas status dead

Oct 25 06:43:18 nfs1 ipfail: [9342]: info: NS: We are still alive!

Oct 25 06:43:18 nfs1 ipfail: [9342]: info: Link Status update: Linknfs2/eth0 now has status deadOct 25 06:43:19 nfs1 heartbeat: [9320]: info: For information on clusterpartitions, See URL: http://linux-ha.org/SplitBrainOct 25 06:43:19 nfs1 heartbeat: [9320]: WARN: Deadtime value may be toosmall.Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: See FAQ for information ontuning deadtime.Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: URL:http://linux-ha.org/FAQ#heavy_load

Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Link nfs2:eth1 up.

Oct 25 06:43:19 nfs1 heartbeat: [9320]: WARN: Late heartbeat: Node nfs2:interval 33030 msOct 25 06:43:19 nfs1 heartbeat: [9320]: info: Status update for nodenfs2: status active

Oct 25 06:43:19 nfs1 heartbeat: [9320]: info: Link nfs2:eth0 up.
Oct 25 06:43:19 nfs1 harc[6029]: info: Running /etc/ha.d/rc.d/status status

Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Asking other side for pingnode count.Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Checking remote count of pingnodes.Oct 25 06:43:19 nfs1 ipfail: [9342]: info: Link Status update: Linknfs2/eth1 now has status deadOct 25 06:43:21 nfs1 ipfail: [9342]: info: Asking other side for pingnode count.Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Checking remote count of pingnodes.Oct 25 06:43:21 nfs1 ipfail: [9342]: info: Link Status update: Linknfs2/eth1 now has status upOct 25 06:43:21 nfs1 ipfail: [9342]: info: Status update: Node nfs2 nowhas status activeOct 25 06:43:21 nfs1 ipfail: [9342]: info: Link Status update: Linknfs2/eth0 now has status upOct 25 06:43:21 nfs1 heartbeat: [9320]: info: Heartbeat shutdown inprogress. (9320)

Oct 25 06:43:21 nfs1 heartbeat: [6039]: info: Giving up all HA resources.

Oct 25 06:43:21 nfs1 ResourceManager[6049]: info: Releasing resourcegroup: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::allFilesystem::/dev/drbd0::/nfs_exports::ext3Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-serverstart_dnotify.shOct 25 06:43:21 nfs1 ResourceManager[6049]: info: Running/etc/ha.d/resource.d/start_dnotify.sh stopOct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running/etc/init.d/nfs-kernel-server stop

Oct 25 06:43:22 nfs1 ipfail: [9342]: info: No giveup timer to abort.
Oct 25 06:43:22 nfs1 kernel: nfsd: last server has exited

Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running/etc/ha.d/resource.d/Filesystem /dev/drbd1 /llnw_backups ext3 stopOct 25 06:43:22 nfs1 Filesystem[6203]: INFO: Running stop for /dev/drbd1on /llnw_backups

Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: Trying to unmount /llnw_backups

Oct 25 06:43:22 nfs1 Filesystem[6203]: INFO: unmounted /llnw_backupssuccessfully

Oct 25 06:43:22 nfs1 Filesystem[6139]: INFO: Filesystem Success

Oct 25 06:43:22 nfs1 ResourceManager[6049]: info: Running/etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs_exports ext3 stopOct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Running stop for /dev/drbd0on /nfs_exports

Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Trying to unmount /nfs_exports

Oct 25 06:43:22 nfs1 Filesystem[6329]: INFO: Some processes on/nfs_exports were signalled

Oct 25 06:43:23 nfs1 ipfail: [9342]: info: No giveup timer to abort.

Oct 25 06:43:25 nfs1 Filesystem[6329]: INFO: unmounted /nfs_exportssuccessfully

Oct 25 06:43:25 nfs1 Filesystem[6265]: INFO: Filesystem Success

Oct 25 06:43:25 nfs1 ResourceManager[6049]: info: Running/etc/ha.d/resource.d/drbddisk all stopOct 25 06:43:26 nfs1 kernel: drbd0: Primary/Secondary -->Secondary/SecondaryOct 25 06:43:26 nfs1 kernel: drbd1: Primary/Secondary -->Secondary/SecondaryOct 25 06:43:26 nfs1 ResourceManager[6049]: info: Running/etc/ha.d/resource.d/IPaddr 192.168.120.230/24/eth0 stopOct 25 06:43:27 nfs1 IPaddr[6525]: INFO: /sbin/route -n del -host192.168.120.230Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: /sbin/ifconfig eth0:0192.168.120.230 down

Oct 25 06:43:27 nfs1 IPaddr[6525]: INFO: IP Address 192.168.120.230 released
Oct 25 06:43:27 nfs1 IPaddr[6443]: INFO: IPaddr Success
Oct 25 06:43:27 nfs1 heartbeat: [6039]: info: All HA resources relinquished.

Oct 25 06:43:28 nfs1 kernel: drbd0: Secondary/Secondary -->Secondary/PrimaryOct 25 06:43:28 nfs1 kernel: drbd1: Secondary/Secondary -->Secondary/PrimaryOct 25 06:43:28 nfs1 heartbeat: [9320]: info: killing/usr/lib/heartbeat/ipfail process group 9342 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process9328 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process9329 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBFIFO process9323 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process9324 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process9325 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBWRITE process9326 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: killing HBREAD process9327 with signal 15Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9328 exited.7 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9323 exited.6 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9329 exited.5 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9325 exited.4 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9326 exited.3 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9327 exited.2 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: Core process 9324 exited.1 remainingOct 25 06:43:30 nfs1 heartbeat: [9320]: info: nfs1 Heartbeat shutdowncomplete.

Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Heartbeat restart triggered.
Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Restarting heartbeat.

Oct 25 06:43:30 nfs1 heartbeat: [9320]: info: Performing heartbeatrestart exec.

Oct 25 06:43:48 nfs1 syslogd 1.4.1#18: restart.

Oct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Core dumps could be lostif multiple dumps occurOct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Consider setting/proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximumsupportabilityOct 25 06:44:01 nfs1 heartbeat: [9320]: WARN: Logging daemon is disabled--enabling logging daemon is recommended

Oct 25 06:44:01 nfs1 heartbeat: [9320]: info: **************************
[finishes restarting heartbeat]


===============================

Log file from nfs2:

Oct 25 06:42:00 nfs2 kernel: mptscsih: ioc0: attempting task abort!(sc=df8111c0)

Oct 25 06:42:00 nfs2 kernel: sd 0:0:1:0:

Oct 25 06:42:00 nfs2 kernel: command: Write(10): 2a 00 28 6c 028f 00 00 30 00

Oct 25 06:42:02 nfs2 kernel: mptbase: Initiating ioc0 recovery

Oct 25 06:42:08 nfs2 kernel: mptscsih: ioc0: task abort: SUCCESS(sc=df8111c0)

Oct 25 06:42:08 nfs2 kernel:  target0:0:0: Beginning Domain Validation

Oct 25 06:42:09 nfs2 kernel: target0:0:0: Domain Validation skippingwrite tests

Oct 25 06:42:09 nfs2 kernel:  target0:0:0: Ending Domain Validation

Oct 25 06:42:09 nfs2 kernel: target0:0:0: FAST-40 SCSI 40.0 MB/s ST (25ns, offset 127)

Oct 25 06:42:09 nfs2 kernel:  target0:0:1: Beginning Domain Validation

Oct 25 06:42:09 nfs2 kernel: target0:0:1: Domain Validation skippingwrite tests

Oct 25 06:42:09 nfs2 kernel:  target0:0:1: Ending Domain Validation

Oct 25 06:42:09 nfs2 kernel: target0:0:1: FAST-40 SCSI 40.0 MB/s ST (25ns, offset 127)Oct 25 06:43:19 nfs2 kernel: mptscsih: ioc0: attempting task abort!(sc=eb523980)Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: G_CH_dispatch_int:Dispatch function for read child took too long to execute: 31980 ms (>50 ms) (GSource: 0x80fbbb8)

Oct 25 06:43:19 nfs2 kernel: sd 0:0:1:0:

Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch:Dispatch function for send local status was delayed 29200 ms (> 2010 ms)before being called (GSource: 0x80fbeb0)Oct 25 06:43:19 nfs2 kernel: command: Write(10): 2a 00 00 01 b917 00 00 28 00Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch:started at 1751846343 should have started at 1751843423

Oct 25 06:43:19 nfs2 kernel: mptbase: Initiating ioc0 recovery

Oct 25 06:43:19 nfs2 kernel: mptscsih: ioc0: task abort: SUCCESS(sc=eb523980)

Oct 25 06:43:19 nfs2 kernel:  target0:0:0: Beginning Domain Validation

Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Nodenfs2: interval 33200 msOct 25 06:43:19 nfs2 kernel: target0:0:0: Domain Validation skippingwrite tests

Oct 25 06:43:19 nfs2 kernel:  target0:0:0: Ending Domain Validation

Oct 25 06:43:19 nfs2 kernel: target0:0:0: FAST-40 SCSI 40.0 MB/s ST (25ns, offset 127)Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Nodenfs1: interval 33980 msOct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Late heartbeat: Node192.168.120.1: interval 32980 msOct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch:Dispatch function for check for signals was delayed 31670 ms (> 2010 ms)before being called (GSource: 0x80fc1b8)Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch:started at 1751846344 should have started at 1751843177Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch:Dispatch function for update msgfree count was delayed 31990 ms (> 30000ms) before being called (GSource: 0x80fc288)Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch:started at 1751846344 should have started at 1751843145Oct 25 06:43:19 nfs2 heartbeat: [22707]: WARN: Gmain_timeout_dispatch:Dispatch function for client audit was delayed 27510 ms (> 5000 ms)before being called (GSource: 0x80fc0e8)Oct 25 06:43:19 nfs2 heartbeat: [22707]: info: Gmain_timeout_dispatch:started at 1751846344 should have started at 1751843593

Oct 25 06:43:19 nfs2 kernel:  target0:0:1: Beginning Domain Validation

Oct 25 06:43:19 nfs2 kernel: target0:0:1: Domain Validation skippingwrite tests

Oct 25 06:43:19 nfs2 kernel:  target0:0:1: Ending Domain Validation

Oct 25 06:43:19 nfs2 kernel: target0:0:1: FAST-40 SCSI 40.0 MB/s ST (25ns, offset 127)

Oct 25 06:43:21 nfs2 ipfail: [22731]: info: Ping node count is balanced.
Oct 25 06:43:22 nfs2 ipfail: [22731]: info: Ping node count is balanced.

Oct 25 06:43:26 nfs2 kernel: drbd0: Secondary/Primary -->Secondary/SecondaryOct 25 06:43:26 nfs2 kernel: drbd1: Secondary/Primary -->Secondary/SecondaryOct 25 06:43:27 nfs2 heartbeat: [22707]: info: Received shutdown noticefrom 'nfs1'.Oct 25 06:43:27 nfs2 heartbeat: [22707]: info: Resources being acquiredfrom nfs1.Oct 25 06:43:27 nfs2 heartbeat: [32715]: info: acquire all HA resources(standby).Oct 25 06:43:28 nfs2 heartbeat: [32716]: info: No local resources[/usr/lib/heartbeat/ResourceManager listkeys nfs2] to acquire.Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Acquiring resourcegroup: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::allFilesystem::/dev/drbd0::/nfs_exports::ext3Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-serverstart_dnotify.sh

Oct 25 06:43:28 nfs2 IPaddr[32759]: INFO: IPaddr Resource is stopped

Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running/etc/ha.d/resource.d/IPaddr 192.168.120.230/24/eth0 startOct 25 06:43:28 nfs2 IPaddr[491]: INFO: eval /sbin/ifconfig eth0:0192.168.120.230 netmask 255.255.255.0 broadcast 192.168.120.255Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: Sending Gratuitous Arp for192.168.120.230 on eth0:0 [eth0]Oct 25 06:43:28 nfs2 IPaddr[491]: INFO: /usr/lib/heartbeat/send_arp -i500 -r 10 -p /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.120.230eth0 192.168.120.230 auto 192.168.120.230 ffffffffffff

Oct 25 06:43:28 nfs2 IPaddr[409]: INFO: IPaddr Success

Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running/etc/ha.d/resource.d/drbddisk all startOct 25 06:43:28 nfs2 kernel: drbd0: Secondary/Secondary -->Primary/SecondaryOct 25 06:43:28 nfs2 kernel: drbd1: Secondary/Secondary -->Primary/SecondaryOct 25 06:43:28 nfs2 Filesystem[697]: INFO: Running status for/dev/drbd0 on /nfs_exportsOct 25 06:43:28 nfs2 Filesystem[697]: INFO: /nfs_exports is unmounted(stopped)

Oct 25 06:43:28 nfs2 Filesystem[633]: INFO: Filesystem Resource is stopped

Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running/etc/ha.d/resource.d/Filesystem /dev/drbd0 /nfs_exports ext3 startOct 25 06:43:28 nfs2 Filesystem[812]: INFO: Running start for /dev/drbd0on /nfs_exports

Oct 25 06:43:28 nfs2 kernel: kjournald starting.  Commit interval 5 seconds
Oct 25 06:43:28 nfs2 kernel: EXT3 FS on drbd0, internal journal

Oct 25 06:43:28 nfs2 kernel: EXT3-fs: mounted filesystem with ordereddata mode.

Oct 25 06:43:28 nfs2 Filesystem[743]: INFO: Filesystem Success

Oct 25 06:43:28 nfs2 Filesystem[921]: INFO: Running status for/dev/drbd1 on /llnw_backupsOct 25 06:43:28 nfs2 Filesystem[921]: INFO: /llnw_backups is unmounted(stopped)

Oct 25 06:43:28 nfs2 Filesystem[853]: INFO: Filesystem Resource is stopped

Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running/etc/ha.d/resource.d/Filesystem /dev/drbd1 /llnw_backups ext3 startOct 25 06:43:28 nfs2 Filesystem[1030]: INFO: Running start for/dev/drbd1 on /llnw_backups

Oct 25 06:43:28 nfs2 kernel: kjournald starting.  Commit interval 5 seconds
Oct 25 06:43:28 nfs2 kernel: EXT3 FS on drbd1, internal journal

Oct 25 06:43:28 nfs2 kernel: EXT3-fs: mounted filesystem with ordereddata mode.

Oct 25 06:43:28 nfs2 Filesystem[966]: INFO: Filesystem Success

Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running/etc/init.d/nfs-kernel-server startOct 25 06:43:28 nfs2 kernel: NFSD: Using /var/lib/nfs/v4recovery as theNFSv4 state recovery directory

Oct 25 06:43:28 nfs2 kernel: NFSD: starting 90-second grace period

Oct 25 06:43:28 nfs2 ResourceManager[32735]: info: Running/etc/ha.d/resource.d/start_dnotify.sh startOct 25 06:43:32 nfs2 heartbeat: [32715]: info: all HA resourceacquisition completed (standby).Oct 25 06:43:32 nfs2 heartbeat: [22707]: info: Standby resourceacquisition done [all].

Oct 25 06:43:32 nfs2 harc[1324]: info: Running /etc/ha.d/rc.d/status status

Oct 25 06:43:32 nfs2 mach_down[1334]: info: Taking over resource groupIPaddr::192.168.120.230/24/eth0Oct 25 06:43:32 nfs2 ResourceManager[1354]: info: Acquiring resourcegroup: nfs1 IPaddr::192.168.120.230/24/eth0 drbddisk::allFilesystem::/dev/drbd0::/nfs_exports::ext3Filesystem::/dev/drbd1::/llnw_backups::ext3 nfs-kernel-serverstart_dnotify.sh

Oct 25 06:43:32 nfs2 IPaddr[1378]: INFO: IPaddr Running OK

Oct 25 06:43:32 nfs2 ResourceManager[1354]: info: Running/etc/ha.d/resource.d/drbddisk all startOct 25 06:43:32 nfs2 Filesystem[1601]: INFO: Running status for/dev/drbd0 on /nfs_exportsOct 25 06:43:32 nfs2 Filesystem[1601]: INFO: /nfs_exports is mounted(running)

Oct 25 06:43:32 nfs2 Filesystem[1537]: INFO: Filesystem Running OK

Oct 25 06:43:33 nfs2 Filesystem[1700]: INFO: Running status for/dev/drbd1 on /llnw_backupsOct 25 06:43:33 nfs2 Filesystem[1700]: INFO: /llnw_backups is mounted(running)

Oct 25 06:43:33 nfs2 Filesystem[1636]: INFO: Filesystem Running OK

Oct 25 06:43:33 nfs2 ResourceManager[1354]: info: Running/etc/ha.d/resource.d/start_dnotify.sh startOct 25 06:43:33 nfs2 mach_down[1334]: info:/usr/lib/heartbeat/mach_down: nice_failback: foreign resources acquiredOct 25 06:43:33 nfs2 mach_down[1334]: info: mach_down takeover completefor node nfs1.

Oct 25 06:43:33 nfs2 heartbeat: [22707]: info: mach_down takeover complete.



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Strange failover

Reply via email to