Hi,

On Mon, Feb 18, 2008 at 07:55:16PM +0100, Andreas Mather1 wrote:
> Hello,
> 
> I justed wanted to let you know that my issues with the cluster are solved
> now.
> 
> Here's what I did:
> 
> *) raising debug to 1

Watch for the logs getting to large. You should put it back to 0
soon.

> *) put everything to one logfile instead of two (still using sylog)
> *) changing mcast entries to ucast entries in ha.cf

This probably helped. My guess is that multicast doesn't go well
with whatever technology you have there and I guess that there
are quite a few ;-)

> *) cleaning up my customized db2 and WAS_generic RAs
>    (they return now OCF_NOT_RUNNING on monitor operation instead of
> OCF_ERR_INSTALLED on nodes which don't run the resources)

OK. IIRC, we never finished fixing the db2 resource agent.

> None of these changes sound like beeing able to prevent the strange
> behaviour I had before, but something helped...
> The strange messages (late heartbeats, 'link down' when link should be
> still up) in my logs also vanished...

Great!

> Thanks for your time and hints!

Thanks,

Dejan

> Andreas
> 
> 
> IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H.
> Sitz: Wien
> Firmenbuchgericht: Handelsgericht Wien, FN 80000y
> 
> [EMAIL PROTECTED] wrote on 02/11/2008 07:25:46 PM:
> 
> > Hi Andreas,
> >
> > On Sun, Feb 10, 2008 at 09:38:45PM +0100, Andreas Mather1 wrote:
> > > ***********************
> > > Warning: Your file, report_1.tar.gz, contains more than 32 files
> > after decompression and cannot be scanned.
> > > ***********************
> > >
> > >
> > >
> > >
> > > Hi all,
> > >
> > > Please find attached a hb_report for a problem I experienced when
> > > implementing heartbeat.
> > >
> > > The environment:
> > > It's an asymmetric 4 node cluster, running heartbeat 2.1.3. All nodes
> share
> > > a couple of filesystems, all GPFS formatted. Services inlcude WebSphere
> > > (modified RA), DB2 (modified RA), vsftpd (Xinetd), samba, nfs, MCS
> (self
> > > written RA), IHS and are put in 4 groups (filesvc, mcs, was, db). Dejan
> is
> > > also familiar with the setup.
> > > OS: SLES 9.3 (x86_64)
> > > hearbeat: build via ./ConfigureMe package
> > >
> > >
> > > The Problem:
> > > In general, everything works fine (crm_standby works for every node,
> etc.),
> > > but, when I simulate a power loss of one node (via IBM RSA)*, a cluster
> > > split occurs when this node rejoins. Suddenly, on every node, crm_mon
> shows
> > > the node it is running on as 'online' while reporting the other nodes
> as
> > > 'OFFLINE'. After 1 - 2 min. the cluster is fully operational again (all
> > > nodes found themself again), but it seems as every resource gets
> restarted.
> > >
> > > Please let me know, if I can provide further information.
> >
> > >From the log on rbxw02:
> >
> > Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 up.
> > Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up.
> > Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead.
> > Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead.
> > Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 dead.
> > Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 dead.
> > Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 dead.
> > Feb 10 19:15:06 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxw01
> > returning after partition.
> > Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 up.
> > Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 up.
> > Feb 10 19:15:07 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxd01
> > returning after partition.
> > Feb 10 19:15:07 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 up.
> >
> > Strange timestamps. Which node went down? And when? Also,
> > rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0
> > not as up: probably at some point rbxw02:eth0 went down. It would
> > be interesting to see logs from the other nodes. Don't know why
> > hb_report didn't pack them.
> >
> > Two extra nodes went DC around 19:13 for about two minutes, which
> > means that there were three partitions: w02,d02 and w01 and d01.
> > Note that none of them had quorum.
> >
> > Looks like a network problem, but an awkward one. Don't know how
> > it got disrupted this much. Perhaps you could try with unicast:
> > replace each mcast directive with four ucast directives.
> >
> > Cheers,
> >
> > Dejan
> >
> > > Thanks,
> > >
> > > Andreas
> > >
> > >
> > > * Sorry, I forgot to test what happens, when I just stop and start
> > > heartbeat on that node - would be useful too, I think... :(
> > >
> > >
> > >
> > >
> > > (See attached file: report_1.tar.gz)
> > >
> > > Mit freundlichen Gr??en / Best regards
> > >
> > > Andreas MATHER
> > > ESLT - Enterprise Services for Linux Technologies
> > >
> > > IBM Austria, Obere Donaustrasse 95, 1020 Vienna
> > > Phone : +43-1-21145/4799
> > > Fax: +43-1-21145/8888
> > > e-mail: [EMAIL PROTECTED]
> > >
> > > IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H.
> > > Sitz: Wien
> > > Firmenbuchgericht: Handelsgericht Wien, FN 80000y
> >
> >
> > > _______________________________________________________
> > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > > Home Page: http://linux-ha.org/
> >
> > _______________________________________________________
> > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> > Home Page: http://linux-ha.org/
> 
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to