Hi,
I'm running into weird problems on a Heartbeat v1 cluster: Heartbeat
restarts itself with the message:
heartbeat[2419]: 2010/01/22_06:30:35 WARN: Exiting HBREAD process 3272
killed by signal 24 [SIGXCPU - CPU limit exceeded].
heartbeat[2419]: 2010/01/22_06:30:35 ERROR: Exiting HBREAD process 3272
dumped core
heartbeat[2419]: 2010/01/22_06:30:35 ERROR: Core heartbeat process died!
Restarting.
I've read that this could be due to debugging being turned on, however
it continues even after I set debug 0. The heartbeat version is
2.1.2-2.fc8 (Fedora Core 8 x86_64) running on Dell PE 2950's. Yeah I
know the version is old and buggy, but for a simple v1 config we've
never run into any problems.
The coredump doesn't provide any useful info:
Core was generated by `heartbeat: read: seri'.
Program terminated with signal 24, CPU time limit exceeded.
#0 0x0000003b0b2c6e00 in ?? ()
There are several "ttyS0: 1 input overrun(s)" messages on the active
node (the Heartbeat restarts happened fortunately on the passive node).
I've speculated it could be connected with the serial port
communication, however tweaking the baud rate and keepalive interval
didn't matter. The cable is completely crossed and was tested.
cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:1693465738 rx:1692970683 fe:6586
brk:190 oe:114
1: uart:16550A port:000002F8 irq:3 tx:0 rx:0 CTS
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3
setserial /dev/ttyS0
/dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
I turned off the serial line in ha.cf (interestingly I stopped seeing
serial in /proc/interrupts afterwards) to see if that will help.
cat /proc/interrupts (active node)
CPU0 CPU1
0: 147 1 IO-APIC-edge timer
1: 1 1 IO-APIC-edge i8042
8: 1 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-fasteoi acpi
12: 2 2 IO-APIC-edge i8042
14: 18637949 57 IO-APIC-edge libata
15: 0 0 IO-APIC-edge libata
20: 233135 33 IO-APIC-fasteoi uhci_hcd:usb3
21: 20703 13 IO-APIC-fasteoi ehci_hcd:usb1,
uhci_hcd:usb2, uhci_hcd:usb4
78: 279961508 5436 IO-APIC-fasteoi megasas
2292: 711142718 80 PCI-MSI-edge eth0
2293: 3 3 PCI-MSI-edge eth2
2294: 25604 462664078 PCI-MSI-edge eth1
NMI: 0 0
LOC: 3753255022 16088395
ERR: 0
cat /proc/interrupts (passive node)
CPU0 CPU1
0: 148 1 IO-APIC-edge timer
1: 1 1 IO-APIC-edge i8042
8: 1 0 IO-APIC-edge rtc
9: 0 0 IO-APIC-fasteoi acpi
12: 2 2 IO-APIC-edge i8042
14: 136694263 55 IO-APIC-edge libata
15: 0 0 IO-APIC-edge libata
20: 31 2148038 IO-APIC-fasteoi uhci_hcd:usb3
21: 196116 11 IO-APIC-fasteoi ehci_hcd:usb1,
uhci_hcd:usb2, uhci_hcd:usb4
78: 251713976 5915 IO-APIC-fasteoi megasas
2292: 1859784733 103 PCI-MSI-edge eth0
2293: 2 2 PCI-MSI-edge eth2
2294: 12833 475741619 PCI-MSI-edge eth1
NMI: 0 0
LOC: 1892627058 940684479
ERR: 0
/etc/ha.d/ha.cf:
keepalive 5
deadtime 20
warntime 10
initdead 40
udpport 694
bcast bond0 # Linux
auto_failback off
node mwcls1
node mwcls2
debug 0
use_logd yes
conn_logd_time 10
compression bz2
--
Peter LUCIAK ([email protected])
IBL Software Engineering, http://www.iblsoft.com/
Mierová 103, 82105 Bratislava, Slovakia
Phone: +421-2-32662111, Fax: +421-2-32662110
Direct: +421-2-32662175
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems