Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-18 Thread Igor Chudov
I have set up cron jobs on both servers. I restart heartbeat at 22 hours on one box and at 23 hours on another. It's been 4 days and so far, so good. I will report more result. This could be an ugly solution to an ugly problem, but workable. i ___ Linux-

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-10 Thread Dimitri Maziuk
Igor Chudov wrote: > My second question is, can heartbeat be configured to restart itself in case > of such a failure. Usually you can't have X restart itself after X dies. You need some kind of Y. If you're running snmpd, see if you can get "proc" to identify "heartbeat: master control proces

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-10 Thread Igor Chudov
On Tue, Jan 4, 2011 at 10:22 AM, Serge Dubrouski wrote: > On Tue, Jan 4, 2011 at 9:14 AM, Igor Chudov wrote: > > Serge, I am not sure of anything, but the self-communication is supposed > to > > be taking place on a single crossover cable between second network cards > of > > the servers. (eth1)

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Serge Dubrouski
On Tue, Jan 4, 2011 at 1:29 PM, Dimitri Maziuk wrote: > Igor Chudov wrote: > >> At this point I feel rather desperate. Perhaps I should give "pacemaker" >> another go. I really have no idea and I am running out of options. > > If all you need is a 2-node active-passive cluster, most (all?) > pacem

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Dimitri Maziuk
Igor Chudov wrote: > At this point I feel rather desperate. Perhaps I should give "pacemaker" > another go. I really have no idea and I am running out of options. If all you need is a 2-node active-passive cluster, most (all?) pacemaker features are useless for you. (Besides, one look at their

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Serge Dubrouski
On Tue, Jan 4, 2011 at 9:14 AM, Igor Chudov wrote: > Serge, I am not sure of anything, but the self-communication is supposed to > be taking place on a single crossover cable between second network cards of > the servers. (eth1). Agree, yet something strange and pretty unique is going on with you

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Igor Chudov
Serge, I am not sure of anything, but the self-communication is supposed to be taking place on a single crossover cable between second network cards of the servers. (eth1). Igor On Tue, Jan 4, 2011 at 10:06 AM, Serge Dubrouski wrote: > Are you sure that everything is all right with your network

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Serge Dubrouski
Are you sure that everything is all right with your network? It looks like processes that are responsible for UDP communications are taking too much of CPU time. On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov wrote: > Steve, here's some data. > > The OS is Ubuntu 10.04. > > ~# apt-cache policy heart

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Igor Chudov
On Tue, Jan 4, 2011 at 9:40 AM, Serge Dubrouski wrote: > Which OS? > > Ubuntu 10.04 Lucid. > Which version of Hearbeat? > > 3.0.3 ~# apt-cache policy heartbeat heartbeat: Installed: 1:3.0.3-1ubuntu1 Candidate: 1:3.0.3-1ubuntu1 Version table: *** 1:3.0.3-1ubuntu1 0 - PID of which of H

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Igor Chudov
Steve, here's some data. The OS is Ubuntu 10.04. ~# apt-cache policy heartbeat heartbeat: Installed: 1:3.0.3-1ubuntu1 Candidate: 1:3.0.3-1ubuntu1 Version table: *** 1:3.0.3-1ubuntu1 0 500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe Packages 100 /var/lib/dpkg/status

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Dejan Muhamedagic
Hi, On Tue, Jan 04, 2011 at 07:47:10AM -0600, Igor Chudov wrote: > Further reading indicates that heartbeat itself sets a limit for itself > every so often. True. > Then it exceeds the limit (probably due to a bug). I am sure that tha's why > whoever wrote heartbeat, set cpu limit, instead of fo

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Serge Dubrouski
Which OS? Which version of Hearbeat? - PID of which of Heartbeat processes? It has several. On Tue, Jan 4, 2011 at 6:32 AM, Igor Chudov wrote: > A few weeks I reported that heartbeat died on one of the cluster machines, > due to SIGXCPU. > > Well, it happened again. Heartbeat died, now both

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Steve Davies
On 4 January 2011 13:47, Igor Chudov wrote: > Further reading indicates that heartbeat itself sets a limit for itself > every so often. > > Then it exceeds the limit (probably due to a bug). I am sure that tha's why > whoever wrote heartbeat, set cpu limit, instead of foxing their bugs. > > Then i

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Igor Chudov
Further reading indicates that heartbeat itself sets a limit for itself every so often. Then it exceeds the limit (probably due to a bug). I am sure that tha's why whoever wrote heartbeat, set cpu limit, instead of foxing their bugs. Then it dies with SIGXCPU, leaving everything in an extremely m

[Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

2011-01-04 Thread Igor Chudov
A few weeks I reported that heartbeat died on one of the cluster machines, due to SIGXCPU. Well, it happened again. Heartbeat died, now both machines had the shared IP address up, what a god awful mess!!! Nopw they have split brain and the whole nine yards! I looked at /proc//limits and found: