Hello Andrea i think you need to think about that Lars told you = (Upgrade to SP2) or maybe you can try to use a diferent lun for the sbd and use ionice for setting the realtime class for sbd process
2013/5/7 andrea cuozzo <andrea.cuo...@sysma.it> > Hi, > > Here are three logs from the last server watchdog-driven reboot on friday > evening (not that I want you to actually dig into them, it's just to update > this thread with my new findings), with SBD watchdog timeout set to 20 > seconds. > > 1) sar.txt is the output of sar -d -p- 2 (two seconds frequency of disk > statistics pretty printed), starting right before the reboot > > 2) messages.txt is an extract of the server /var/log/messages starting > right > before the reboot, with QLogic driver, scsi layer and SBD verbose loggings > enabled > > 3) cpu1.txt is the output of sar -P ALL -2 (two seconds frequency of cpu > statistics), filtered by cpu #1, starting right before the reboot > > sda is the local drive, sdb and sdc are the same single SAN LUN as seen by > the two FC ports of the server, san is the LUN multipath alias, san_part1 > is > the SBD partition, san_part2 is the Oracle partition. > > sar.txt shows that somewhere between 17:46.44 and 17:46.46 all reads and > writes to/from the san LUN drops to zero, for both SBD and Oracle > partitions, right until the 17th second of the SBD countdown, at which time > something (3.88 wr/s) seems to get written on the Oracle partition. > %util jumps to 100% as it does iowait%, from cpu1.txt, on 3 of the 24 cpu > cores this server has got (the ones Oracle and SBD were using at the time, > I > suppose). > > messages.txt shows at 17:46.44 this QLogic driver message that is different > from the rest og QLogic messages: > > May 3 17:46:44 server1 kernel: [66588.156113] qla2xxx > [0000:11:00.1]-5816:2: Discard RND Frame -- 1006 02c1 0000. > > By the time I started facing these problems, I got gigs of > /var/log/messages > from these servers now, and the QLogic driver will write some rare "dropped > frame(s) detected" from time to time during normal server operations, but > it > will never write this "Discard RND Frame" message unless there's going to > be > an unwanted reboot right after. No scsi layer read and write communication > on sdb and sdc gets recorded by the kernel afterwards, except for a couple > of "device ready" commands. All these info have been shared with the SAN > department already. > > Yesterday the SAN department has made a parameter configuration change on > the two Brocade switches (and multipath worked smoothlessly on the servers, > switching paths back and forth as the relative switches got restarted) I > hope this fixes the problem, otherwise we might investigate the switch port > configuration change described in the following link, as our current > configuration seems to apply (8Gb fc, Brocade switches, lots of er_bad_os > port errors, fill word port mode currently set to 1, and random server > problem) > > > http://loopbackconnector.com/2013/02/14/brocade-8-gb-how-to-talk-when-idle-p > ortcfgfillword/ > > andrea > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 3 May 2013 10:17:12 +0200 > From: Lars Marowsky-Bree <l...@suse.com> > To: The Pacemaker cluster resource manager > <pacemaker@oss.clusterlabs.org> > Subject: Re: [Pacemaker] Frequent SBD triggered server reboots > Message-ID: <20130503081712.ge3...@suse.de> > Content-Type: text/plain; charset=iso-8859-1 > > On 2013-05-03T02:49:54, andrea cuozzo <andrea.cuo...@sysma.it> wrote: > > > Unfortunately Os and SP version for the Oracle project these clusters > > belong to have been decided several layers over my head, I'll make it > > a point for upgrading to Sp2 anyway, I might get lucky. > > Good luck with that! > > > the SAN department investigate their side of the problem, I'll take a > > look at trying a different stonith resources, all servers involved > > have some kind of IBM management console. Thanks for your answers to > > my questions and for your time, very much appreciated. > > You're missing out on many further fixes since SP1 went out of support. > Not just to sbd, but everything, from kernel to pacemaker to glibc and > back. > > Since support is obviously irrelevant to your management, you could > consider > recompiling sbd from source if you were so inclined, though. > > > > Regards, > Lars > > -- > Architect Storage/HA > SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer, > HRB 21284 (AG N?rnberg) "Experience is the name everyone gives to their > mistakes." -- Oscar Wilde > > > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > -- esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org