Further reading indicates that heartbeat itself sets a limit for itself every so often.
Then it exceeds the limit (probably due to a bug). I am sure that tha's why whoever wrote heartbeat, set cpu limit, instead of foxing their bugs. Then it dies with SIGXCPU, leaving everything in an extremely messy state, leading to split brain, destruction of shared resources (DRBD data). I was trying to be a little patient. A little forgiving. I must say that my patience is rapidly running out. I absolutely cannot use this "solution" as a basis of a high reliability cluster, because it is the opposite of reliability. We had an old cluster that works very well with heartbeat V1. But it is getting old, the disks are wearing out, the fans are not getting newer, etc. I set up a new cluster in summer, but never fully trusted it, and it looks like I will not be able to trust it. We never completed a switchover. At this point I feel rather desperate. Perhaps I should give "pacemaker" another go. I really have no idea and I am running out of options. i On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <ichu...@g.mail.com> wrote: > A few weeks I reported that heartbeat died on one of the cluster machines, > due to SIGXCPU. > > Well, it happened again. Heartbeat died, now both machines had the shared > IP address up, what a god awful mess!!! > > Nopw they have split brain and the whole nine yards! > > I looked at /proc/<heartbeat_pid>/limits and found: > > Limit Soft Limit Hard Limit Units > > Max cpu time 43 unlimited seconds > > > So, this process somehow has a limit set for it. > > Does anyone have ANY clue who would set a limit for this process??? WTF? > Does it do it for itself or what? > > _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems