Hi, On Fri, Jan 22, 2010 at 08:39:35AM +0100, Peter Luciak wrote: > Hi, > > I'm running into weird problems on a Heartbeat v1 cluster: Heartbeat > restarts itself with the message: > > heartbeat[2419]: 2010/01/22_06:30:35 WARN: Exiting HBREAD process 3272 > killed by signal 24 [SIGXCPU - CPU limit exceeded]. > heartbeat[2419]: 2010/01/22_06:30:35 ERROR: Exiting HBREAD process 3272 > dumped core > heartbeat[2419]: 2010/01/22_06:30:35 ERROR: Core heartbeat process died! > Restarting.
The read process CPU usage is limited to 10 percent. According to ha.cf below, heartbeats are every 5 seconds which is quite low. > I've read that this could be due to debugging being turned on, however > it continues even after I set debug 0. The heartbeat version is > 2.1.2-2.fc8 (Fedora Core 8 x86_64) running on Dell PE 2950's. Yeah I > know the version is old and buggy, but for a simple v1 config we've > never run into any problems. If you upgrade there will at least be better handling of failing media handling processes. Also, you distribution seems to be quite old. Heartbeat uses glib2 extensively, it may also be that there's a problem somewhere with the system libraries. > The coredump doesn't provide any useful info: > > Core was generated by `heartbeat: read: seri'. > Program terminated with signal 24, CPU time limit exceeded. > #0 0x0000003b0b2c6e00 in ?? () > > There are several "ttyS0: 1 input overrun(s)" messages on the active > node (the Heartbeat restarts happened fortunately on the passive node). > I've speculated it could be connected with the serial port > communication, however tweaking the baud rate and keepalive interval > didn't matter. The cable is completely crossed and was tested. > > cat /proc/tty/driver/serial > serinfo:1.0 driver revision: > 0: uart:16550A port:000003F8 irq:4 tx:1693465738 rx:1692970683 fe:6586 > brk:190 oe:114 > 1: uart:16550A port:000002F8 irq:3 tx:0 rx:0 CTS > 2: uart:unknown port:000003E8 irq:4 > 3: uart:unknown port:000002E8 irq:3 > > setserial /dev/ttyS0 > /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4 > > I turned off the serial line in ha.cf (interestingly I stopped seeing > serial in /proc/interrupts afterwards) to see if that will help. So, did it? Otherwise, you can try to enabled debugging and see if there's anything in the logs hinting at what went wrong. Thanks, Dejan > cat /proc/interrupts (active node) > CPU0 CPU1 > 0: 147 1 IO-APIC-edge timer > 1: 1 1 IO-APIC-edge i8042 > 8: 1 0 IO-APIC-edge rtc > 9: 0 0 IO-APIC-fasteoi acpi > 12: 2 2 IO-APIC-edge i8042 > 14: 18637949 57 IO-APIC-edge libata > 15: 0 0 IO-APIC-edge libata > 20: 233135 33 IO-APIC-fasteoi uhci_hcd:usb3 > 21: 20703 13 IO-APIC-fasteoi ehci_hcd:usb1, > uhci_hcd:usb2, uhci_hcd:usb4 > 78: 279961508 5436 IO-APIC-fasteoi megasas > 2292: 711142718 80 PCI-MSI-edge eth0 > 2293: 3 3 PCI-MSI-edge eth2 > 2294: 25604 462664078 PCI-MSI-edge eth1 > NMI: 0 0 > LOC: 3753255022 16088395 > ERR: 0 > > cat /proc/interrupts (passive node) > CPU0 CPU1 > 0: 148 1 IO-APIC-edge timer > 1: 1 1 IO-APIC-edge i8042 > 8: 1 0 IO-APIC-edge rtc > 9: 0 0 IO-APIC-fasteoi acpi > 12: 2 2 IO-APIC-edge i8042 > 14: 136694263 55 IO-APIC-edge libata > 15: 0 0 IO-APIC-edge libata > 20: 31 2148038 IO-APIC-fasteoi uhci_hcd:usb3 > 21: 196116 11 IO-APIC-fasteoi ehci_hcd:usb1, > uhci_hcd:usb2, uhci_hcd:usb4 > 78: 251713976 5915 IO-APIC-fasteoi megasas > 2292: 1859784733 103 PCI-MSI-edge eth0 > 2293: 2 2 PCI-MSI-edge eth2 > 2294: 12833 475741619 PCI-MSI-edge eth1 > NMI: 0 0 > LOC: 1892627058 940684479 > ERR: 0 > > > > /etc/ha.d/ha.cf: > keepalive 5 > deadtime 20 > warntime 10 > initdead 40 > udpport 694 > bcast bond0 # Linux > auto_failback off > node mwcls1 > node mwcls2 > debug 0 > use_logd yes > conn_logd_time 10 > compression bz2 > > > -- > Peter LUCIAK ([email protected]) > IBL Software Engineering, http://www.iblsoft.com/ > Mierová 103, 82105 Bratislava, Slovakia > Phone: +421-2-32662111, Fax: +421-2-32662110 > Direct: +421-2-32662175 > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
