Hi, On Mon, Feb 18, 2008 at 07:55:16PM +0100, Andreas Mather1 wrote: > Hello, > > I justed wanted to let you know that my issues with the cluster are solved > now. > > Here's what I did: > > *) raising debug to 1
Watch for the logs getting to large. You should put it back to 0 soon. > *) put everything to one logfile instead of two (still using sylog) > *) changing mcast entries to ucast entries in ha.cf This probably helped. My guess is that multicast doesn't go well with whatever technology you have there and I guess that there are quite a few ;-) > *) cleaning up my customized db2 and WAS_generic RAs > (they return now OCF_NOT_RUNNING on monitor operation instead of > OCF_ERR_INSTALLED on nodes which don't run the resources) OK. IIRC, we never finished fixing the db2 resource agent. > None of these changes sound like beeing able to prevent the strange > behaviour I had before, but something helped... > The strange messages (late heartbeats, 'link down' when link should be > still up) in my logs also vanished... Great! > Thanks for your time and hints! Thanks, Dejan > Andreas > > > IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H. > Sitz: Wien > Firmenbuchgericht: Handelsgericht Wien, FN 80000y > > [EMAIL PROTECTED] wrote on 02/11/2008 07:25:46 PM: > > > Hi Andreas, > > > > On Sun, Feb 10, 2008 at 09:38:45PM +0100, Andreas Mather1 wrote: > > > *********************** > > > Warning: Your file, report_1.tar.gz, contains more than 32 files > > after decompression and cannot be scanned. > > > *********************** > > > > > > > > > > > > > > > Hi all, > > > > > > Please find attached a hb_report for a problem I experienced when > > > implementing heartbeat. > > > > > > The environment: > > > It's an asymmetric 4 node cluster, running heartbeat 2.1.3. All nodes > share > > > a couple of filesystems, all GPFS formatted. Services inlcude WebSphere > > > (modified RA), DB2 (modified RA), vsftpd (Xinetd), samba, nfs, MCS > (self > > > written RA), IHS and are put in 4 groups (filesvc, mcs, was, db). Dejan > is > > > also familiar with the setup. > > > OS: SLES 9.3 (x86_64) > > > hearbeat: build via ./ConfigureMe package > > > > > > > > > The Problem: > > > In general, everything works fine (crm_standby works for every node, > etc.), > > > but, when I simulate a power loss of one node (via IBM RSA)*, a cluster > > > split occurs when this node rejoins. Suddenly, on every node, crm_mon > shows > > > the node it is running on as 'online' while reporting the other nodes > as > > > 'OFFLINE'. After 1 - 2 min. the cluster is fully operational again (all > > > nodes found themself again), but it seems as every resource gets > restarted. > > > > > > Please let me know, if I can provide further information. > > > > >From the log on rbxw02: > > > > Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 up. > > Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up. > > Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead. > > Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead. > > Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 dead. > > Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 dead. > > Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 dead. > > Feb 10 19:15:06 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxw01 > > returning after partition. > > Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 up. > > Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 up. > > Feb 10 19:15:07 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxd01 > > returning after partition. > > Feb 10 19:15:07 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 up. > > > > Strange timestamps. Which node went down? And when? Also, > > rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0 > > not as up: probably at some point rbxw02:eth0 went down. It would > > be interesting to see logs from the other nodes. Don't know why > > hb_report didn't pack them. > > > > Two extra nodes went DC around 19:13 for about two minutes, which > > means that there were three partitions: w02,d02 and w01 and d01. > > Note that none of them had quorum. > > > > Looks like a network problem, but an awkward one. Don't know how > > it got disrupted this much. Perhaps you could try with unicast: > > replace each mcast directive with four ucast directives. > > > > Cheers, > > > > Dejan > > > > > Thanks, > > > > > > Andreas > > > > > > > > > * Sorry, I forgot to test what happens, when I just stop and start > > > heartbeat on that node - would be useful too, I think... :( > > > > > > > > > > > > > > > (See attached file: report_1.tar.gz) > > > > > > Mit freundlichen Gr??en / Best regards > > > > > > Andreas MATHER > > > ESLT - Enterprise Services for Linux Technologies > > > > > > IBM Austria, Obere Donaustrasse 95, 1020 Vienna > > > Phone : +43-1-21145/4799 > > > Fax: +43-1-21145/8888 > > > e-mail: [EMAIL PROTECTED] > > > > > > IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H. > > > Sitz: Wien > > > Firmenbuchgericht: Handelsgericht Wien, FN 80000y > > > > > > > _______________________________________________________ > > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > > > Home Page: http://linux-ha.org/ > > > > _______________________________________________________ > > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > > Home Page: http://linux-ha.org/ > > _______________________________________________________ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/