Our reboots were because of timing. We set o2cb to 31, 30000, 2000 and 2000 now. Since then no more reboots. We also fixed our FC wiring (replaced 3 cables and it seems we have 1 bad port on a switch) to get rid of SCSI errors and with that multipath not disabling paths.
> -----Original Message----- > From: Hagmann, Michael [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 30, 2007 23:47 > To: ocfs2-users@oss.oracle.com > Cc: Ulf Zimmermann; Sunil Mushran > Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots > > Hi > > I have the same situation here, that one node from 4 reboot with this > error: > > Aug 30 17:27:47 lilr206c (12,2):o2hb_write_timeout:269 ERROR: Heartbeat > write timeout to device dm-26 after 120000 milliseconds > Aug 30 17:27:47 lilr206c Heartbeat thread (12) printing last 24 > blocking operations (cur = 13): > Aug 30 17:27:47 lilr206c Heartbeat thread stuck at waiting for write > completion, stuffing current time into that blocker (index 13) > Aug 30 17:27:47 lilr206c Index 14: took 0 ms to do checking slots > Aug 30 17:27:47 lilr206c Index 15: took 1105 ms to do waiting for write > completion > Aug 30 17:27:47 lilr206c Index 16: took 18 ms to do msleep > Aug 30 17:27:47 lilr206c Index 17: took 0 ms to do allocating bios for > read > Aug 30 17:27:47 lilr206c Index 18: took 0 ms to do bio alloc read > Aug 30 17:27:47 lilr206c Index 19: took 0 ms to do bio add page read > Aug 30 17:27:47 lilr206c Index 20: took 0 ms to do submit_bio for read > Aug 30 17:27:47 lilr206c Index 21: took 401 ms to do waiting for read > completion > Aug 30 17:27:47 lilr206c Index 22: took 0 ms to do bio alloc write > Aug 30 17:27:47 lilr206c Index 23: took 0 ms to do bio add page write > Aug 30 17:27:47 lilr206c Index 0: took 0 ms to do submit_bio for write > Aug 30 17:27:47 lilr206c Index 1: took 0 ms to do checking slots > Aug 30 17:27:47 lilr206c Index 2: took 276 ms to do waiting for write > completion > Aug 30 17:27:47 lilr206c Index 3: took 1322 ms to do msleep > Aug 30 17:27:47 lilr206c Index 4: took 0 ms to do allocating bios for > read > Aug 30 17:27:47 lilr206c Index 5: took 0 ms to do bio alloc read > Aug 30 17:27:47 lilr206c Index 6: took 0 ms to do bio add page read > Aug 30 17:27:47 lilr206c Index 7: took 0 ms to do submit_bio for read > Aug 30 17:27:47 lilr206c Index 8: took 85285 ms to do waiting for read > completion > Aug 30 17:27:47 lilr206c Index 9: took 0 ms to do bio alloc write > Aug 30 17:27:47 lilr206c Index 10: took 0 ms to do bio add page write > Aug 30 17:27:47 lilr206c Index 11: took 0 ms to do submit_bio for write > Aug 30 17:27:47 lilr206c Index 12: took 0 ms to do checking slots > Aug 30 17:27:47 lilr206c Index 13: took 33389 ms to do waiting for > write completion > Aug 30 17:27:47 lilr206c *** ocfs2 is very sorry to be fencing this > system by restarting *** > > Did you find any reason for you reboot? Please let me now. > > we have here a 4 Node HP DL585 G2 Cluster ( with multipath > device-mapper, 2 SAN Cards per Server, EMC CX3-20 Storage ) with RHEL4 > U5 and ocfs2-2.6.9-55.0.2.ELsmp-1.2.5-2. > > thx mike > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Ulf Zimmermann > Sent: Montag, 13. August 2007 17:47 > To: Ulf Zimmermann; Sunil Mushran > Cc: ocfs2-users@oss.oracle.com > Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots > > One node of our 4-node cluster rebooted last night: > > (11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device > dm-1 after 12000 milliseconds > Heartbeat thread (11) printing last 24 blocking operations (cur = 22): > Heartbeat thread stuck at waiting for write completion, stuffing current > time in to that blocker (index 22) Index 23: took 0 ms to do checking > slots Index 0: took 1 ms to do waiting for write completion Index 1: > took 1997 ms to do msleep Index 2: took 0 ms to do allocating bios for > read Index 3: took 0 ms to do bio alloc read Index 4: took 0 ms to do > bio add page read Index 5: took 0 ms to do submit_bio for read Index 6: > took 8 ms to do waiting for read completion Index 7: took 0 ms to do bio > alloc write Index 8: took 0 ms to do bio add page write Index 9: took 0 > ms to do submit_bio for write Index 10: took 0 ms to do checking slots > Index 11: took 0 ms to do waiting for write completion Index 12: took > 1992 ms to do msleep Index 13: took 0 ms to do allocating bios for read > Index 14: took 0 ms to do bio alloc read Index 15: took 0 ms to do bio > add page read Index 16: took 0 ms to do submit_bio for read Index 17: > took 7 ms to do waiting for read completion Index 18: took 0 ms to do > bio alloc write Index 19: took 0 ms to do bio add page write Index 20: > took 0 ms to do submit_bio for write Index 21: took 0 ms to do checking > slots Index 22: took 10003 ms to do waiting for write completion > *** ocfs2 is very sorry to be fencing this system by restarting *** > > There were no SCSI errors on the console or logs around the time of this > reboot. > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:ocfs2-users- > > [EMAIL PROTECTED] On Behalf Of Ulf Zimmermann > > Sent: Monday, July 30, 2007 11:11 > > To: Sunil Mushran > > Cc: ocfs2-users@oss.oracle.com > > Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots > > > > Too early to call. Management made the call "This hardware seems to > have > > been stable, lets use it". > > > > > -----Original Message----- > > > From: Sunil Mushran [mailto:[EMAIL PROTECTED] > > > Sent: Monday, July 30, 2007 11:07 > > > To: Ulf Zimmermann > > > Cc: ocfs2-users@oss.oracle.com > > > Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots > > > > > > So are you suggesting the reason was bad hardware? > > > Or, is it too early to call? > > > > > > Ulf Zimmermann wrote: > > > > I have serial console setup with logging via conserver but so far > no > > > > further crash. We also swapped hardware a bit around (another 4 > node > > > > cluster with DL360g5 was working without crash for several weeks, > we > > > > swapped those 4 nodes in for the first 4 in the 6 node cluster). > > > > > > > > > > > >> -----Original Message----- > > > >> From: Sunil Mushran [mailto:[EMAIL PROTECTED] > > > >> Sent: Monday, July 30, 2007 10:21 > > > >> To: Ulf Zimmermann > > > >> Cc: ocfs2-users@oss.oracle.com > > > >> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained > reboots > > > >> > > > >> Do you have a netconsole setup? If not, set it up. That will > > capture > > > >> > > > > the > > > > > > > >> real reason for the reset. Well, it typically does. > > > >> > > > >> Ulf Zimmermann wrote: > > > >> > > > >>> We just installed a new cluster with 6 HP DL380g5, dual single > > port > > > >>> > > > >> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches > to > > a > > > >> > > > > 3Par > > > > > > > >> S400. We are using the 3Par recommended config for the Qlogic > > driver > > > >> > > > > and > > > > > > > >> device-mapper-multipath giving us 4 paths to the SAN. We do see > > some > > > >> > > > > SCSI > > > > > > > >> errors where DM-MP is failing a path after get a 0x2000 error > from > > the > > > >> > > > > SAN > > > > > > > >> controller, but the path gets puts back in service in less then > 10 > > > >> seconds. > > > >> > > > >>> This needs to be fixed but I don't think it is what is causing > our > > > >>> > > > >> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and > > > > >> clusterware were running, no db) and one node rebooted while idle > > > >> > > > > (another > > > > > > > >> node was copying using fscat our 9i db from ocfs1 to the ocfs2 > data > > > >> volume) and once while some load was put on it via the upgraded > 10g > > > >> database. In all cases it is as if someone a hardware reset > button. > > No > > > >> kernel panic (at least not one leading to a stop with visable > > > >> > > > > message), we > > > > > > > >> can get a dirty write cache for the internal cciss controller. > > > >> > > > >>> The only messages we get on the nodes are when the crashed node > is > > > >>> > > > >> already in reset and it missed its ocfs2 heartbeat (set to the > > default > > > >> > > > > of > > > > > > > >> 7), followed later by crs moving the vip. > > > >> > > > >>> Any hints on trouble shooting this would be appreciated. > > > >>> > > > >>> Regards, Ulf. > > > >>> _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users