On 5/17/2011 8:25 AM, Sascha Hagedorn wrote:
> Hi everyone,
...
> -          Pulled the HA network cable
> -          Put it back after a couple of seconds
>
> Result:
>
> -          Node 2 is being restarted
> -          Load average on Node 1 increases until the system becomes 
> unreachable
> -          A massive amount of log messages are being produced by OCFS2 on 
> the surviving node (see below)
> -          The DRBD partition is not accessible
> -          Node 1 cannot be rebooted only a hard reset brings it back to life

FWIW I've seen it on heartbeat-2.1.4 centos 5, too. Something sends it 
into a tight loop spewing tons of log messages and using all CPU cycles 
it can get.

RHEL-5 kernels have a tendency to time out sockets under high CPU load, 
so that effectively kills the entire cluster.

Dima

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to