Hi,

On Fri, Jan 22, 2010 at 08:39:35AM +0100, Peter Luciak wrote:
> Hi,
> 
> I'm running into weird problems on a Heartbeat v1 cluster: Heartbeat 
> restarts itself with the message:
> 
> heartbeat[2419]: 2010/01/22_06:30:35 WARN: Exiting HBREAD process 3272 
> killed by signal 24 [SIGXCPU - CPU limit exceeded].
> heartbeat[2419]: 2010/01/22_06:30:35 ERROR: Exiting HBREAD process 3272 
> dumped core
> heartbeat[2419]: 2010/01/22_06:30:35 ERROR: Core heartbeat process died! 
> Restarting.

The read process CPU usage is limited to 10 percent. According to
ha.cf below, heartbeats are every 5 seconds which is quite low.

> I've read that this could be due to debugging being turned on, however 
> it continues even after I set debug 0. The heartbeat version is 
> 2.1.2-2.fc8 (Fedora Core 8 x86_64) running on Dell PE 2950's. Yeah I 
> know the version is old and buggy, but for a simple v1 config we've 
> never run into any problems.

If you upgrade there will at least be better handling of failing
media handling processes. Also, you distribution seems to be
quite old. Heartbeat uses glib2 extensively, it may also be that
there's a problem somewhere with the system libraries.

> The coredump doesn't provide any useful info:
> 
> Core was generated by `heartbeat: read: seri'.
> Program terminated with signal 24, CPU time limit exceeded.
> #0  0x0000003b0b2c6e00 in ?? ()
> 
> There are several "ttyS0: 1 input overrun(s)" messages on the active 
> node (the Heartbeat restarts happened fortunately on the passive node). 
> I've speculated it could be connected with the serial port 
> communication, however tweaking the baud rate and keepalive interval 
> didn't matter. The cable is completely crossed and was tested.
> 
> cat /proc/tty/driver/serial
> serinfo:1.0 driver revision:
> 0: uart:16550A port:000003F8 irq:4 tx:1693465738 rx:1692970683 fe:6586 
> brk:190 oe:114
> 1: uart:16550A port:000002F8 irq:3 tx:0 rx:0 CTS
> 2: uart:unknown port:000003E8 irq:4
> 3: uart:unknown port:000002E8 irq:3
> 
> setserial /dev/ttyS0
> /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
> 
> I turned off the serial line in ha.cf (interestingly I stopped seeing 
> serial in /proc/interrupts afterwards) to see if that will help.

So, did it?

Otherwise, you can try to enabled debugging and see if there's
anything in the logs hinting at what went wrong.

Thanks,

Dejan

> cat /proc/interrupts (active node)
>             CPU0       CPU1
>    0:        147          1   IO-APIC-edge      timer
>    1:          1          1   IO-APIC-edge      i8042
>    8:          1          0   IO-APIC-edge      rtc
>    9:          0          0   IO-APIC-fasteoi   acpi
>   12:          2          2   IO-APIC-edge      i8042
>   14:   18637949         57   IO-APIC-edge      libata
>   15:          0          0   IO-APIC-edge      libata
>   20:     233135         33   IO-APIC-fasteoi   uhci_hcd:usb3
>   21:      20703         13   IO-APIC-fasteoi   ehci_hcd:usb1, 
> uhci_hcd:usb2, uhci_hcd:usb4
>   78:  279961508       5436   IO-APIC-fasteoi   megasas
> 2292:  711142718         80   PCI-MSI-edge      eth0
> 2293:          3          3   PCI-MSI-edge      eth2
> 2294:      25604  462664078   PCI-MSI-edge      eth1
> NMI:          0          0
> LOC: 3753255022   16088395
> ERR:          0
> 
> cat /proc/interrupts (passive node)
>             CPU0       CPU1
>    0:        148          1   IO-APIC-edge      timer
>    1:          1          1   IO-APIC-edge      i8042
>    8:          1          0   IO-APIC-edge      rtc
>    9:          0          0   IO-APIC-fasteoi   acpi
>   12:          2          2   IO-APIC-edge      i8042
>   14:  136694263         55   IO-APIC-edge      libata
>   15:          0          0   IO-APIC-edge      libata
>   20:         31    2148038   IO-APIC-fasteoi   uhci_hcd:usb3
>   21:     196116         11   IO-APIC-fasteoi   ehci_hcd:usb1, 
> uhci_hcd:usb2, uhci_hcd:usb4
>   78:  251713976       5915   IO-APIC-fasteoi   megasas
> 2292: 1859784733        103   PCI-MSI-edge      eth0
> 2293:          2          2   PCI-MSI-edge      eth2
> 2294:      12833  475741619   PCI-MSI-edge      eth1
> NMI:          0          0
> LOC: 1892627058  940684479
> ERR:          0
> 
> 
> 
> /etc/ha.d/ha.cf:
> keepalive 5
> deadtime 20
> warntime 10
> initdead 40
> udpport       694
> bcast bond0           # Linux
> auto_failback off
> node  mwcls1
> node  mwcls2
> debug 0
> use_logd yes
> conn_logd_time 10
> compression   bz2
> 
> 
> -- 
> Peter LUCIAK ([email protected])
> IBL Software Engineering, http://www.iblsoft.com/
> Mierová 103, 82105 Bratislava, Slovakia
> Phone: +421-2-32662111, Fax: +421-2-32662110
> Direct: +421-2-32662175
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to