Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Igor Chudov Tue, 04 Jan 2011 05:48:09 -0800

Further reading indicates that heartbeat itself sets a limit for itself
every so often.

Then it exceeds the limit (probably due to a bug). I am sure that tha's why
whoever wrote heartbeat, set cpu limit, instead of foxing their bugs.

Then it dies with SIGXCPU, leaving everything in an extremely messy state,
leading to split brain, destruction of shared resources (DRBD data).

I was trying to be a little patient. A little forgiving. I must say that my
patience is rapidly running out.

I absolutely cannot use this "solution" as a basis of a high reliability
cluster, because it is the opposite of reliability.

We had an old cluster that works very well with heartbeat V1. But it is
getting old, the disks are wearing out, the fans are not getting newer, etc.
I set up a new cluster in summer, but never fully trusted it, and it looks
like I will not be able to trust it. We never completed a switchover.

At this point I feel rather desperate. Perhaps I should give "pacemaker"
another go. I really have no idea and I am running out of options.

i

On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <ichu...@g.mail.com> wrote:

> A few weeks I reported that heartbeat died on one of the cluster machines,
> due to SIGXCPU.
>
> Well, it happened again. Heartbeat died, now both machines had the shared
> IP address up, what a god awful mess!!!
>
> Nopw they have split brain and the whole nine yards!
>
> I  looked at /proc/<heartbeat_pid>/limits and found:
>
> Limit                     Soft Limit           Hard Limit           Units
>
> Max cpu time              43                   unlimited            seconds
>
>
> So, this process somehow has a limit set for it.
>
> Does anyone have ANY clue who would set a limit for this process??? WTF?
> Does it do it for itself or what?
>
>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Reply via email to