You may start with what is suggested in the FAQ. http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT
Regards, Marcos Eduardo Matsunaga Oracle USA Linux Engineering Ulf Zimmermann wrote: >> -----Original Message----- >> From: Mark Fasheh [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, August 15, 2007 16:49 >> To: Ulf Zimmermann >> Cc: Sunil Mushran; ocfs2-users@oss.oracle.com >> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots >> >> On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote: >> >>> Index 22: took 10003 ms to do waiting for write completion >>> *** ocfs2 is very sorry to be fencing this system by restarting *** >>> >>> There were no SCSI errors on the console or logs around the time of >>> > this > >>> reboot. >>> >> It looks like the write took too long - as a first step, you might >> > want to > >> up the disk heartbeat timeouts on those systems. Run: >> >> $ /etc/init.d/o2cb configure >> >> on each node to do that. That won't hide any hardware problems, but if >> > the > >> problem is just a latency to get the write to disk, it'd help tune it >> away. >> --Mark >> > > Ok, we had now 4 reboots, plus 2 more by my own action, which were by > OCFS2 fencing. As said in previous emails we were seeing some SCSI > errors and although device-mapper-multipath seems to take care of it, > sometimes the 10 second configured in multipath.conf and the default > timings of o2cb are colliding. > > On the two clusters we have run into this, I have now replaced several > fibre cables and it seems we also have 1 bad port on one of the fibre > channel switches. Swapped first cable, still problems. Swapped SPF, > still problem, moved node to another port from where the SPF was swapped > from, 0 errors. > > Now I am still concerned about the timing of device-mapper-multipath and > o2cb. O2cb is currently set to the default of: > > Specify heartbeat dead threshold (>=7) [7]: > Specify network idle timeout in ms (>=5000) [10000]: > Specify network keepalive delay in ms (>=1000) [5000]: > Specify network reconnect delay in ms (>=2000) [2000]: > > So the timeout I seem to hit is the 10,000 of network idle timeout? Even > this timeout occurs on the disk? What values would you recommend I > should set this to? > > Another question in case someone can answer this. If I get a syslog > entries like: > > Aug 16 00:44:33 dbprd01 kernel: SCSI error : <1 0 0 1> return code = > 0x20000 > Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector > 346452448 > Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing > path 8:144. > Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector > 346452456 > Aug 16 00:44:33 dbprd01 kernel: SCSI error : <1 0 1 1> return code = > 0x20000 > Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector > 1469242384 > Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing > path 8:208. > Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector > 1469242392 > Aug 16 00:44:33 dbprd01 multipathd: 8:144: mark as failed > Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 3 > Aug 16 00:44:33 dbprd01 multipathd: 8:208: mark as failed > Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 2 > > Does this actually errors out all the way or does the request still go > to one of the remaining paths? If this request doesn't error out, > because it was able to still fulfill it via the 2 remaining paths, then > it is really just the timing between device-mapper-multipath recovering > this request through the remain paths and our o2cb settings. If not, we > might still have another problem. We have seen many such errors but only > had like 8 reboots, all I think attributed to fencing now. > > Regards, Ulf. > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users