Alexei But given that the problem is o2net_idle_timer, that sort of takes the disk heartbeat out of the equation.
Andy On Mon, 2006-09-18 at 09:57 -0700, Alexei_Roudnev wrote: > OCFS have 2 heartbeat thresholds, and only one is configured by this option. > (I don't remember, which one of 2 - network and disk heartbeats). > > > ----- Original Message ----- > From: "Mark Maiden" <[EMAIL PROTECTED]> > To: "Andy Phillips" <[EMAIL PROTECTED]> > Cc: <[email protected]> > Sent: Monday, September 18, 2006 4:17 AM > Subject: Re: [Ocfs2-users] problem with 2 host cluster > > > We had a similar issue using SLES 9 and a CX300. > > We upgraded to the latest ocfs version and changed our > O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both nodes) > to the following : > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. > O2CB_HEARTBEAT_THRESHOLD=61 > > It seemed to sort the issue out for us, but could be a totally different > issue! ;-) > > Mark Maiden > Systems Administrator > Globoforce, Ltd > 6 Beckett Way Parkwest > Dublin 12 > Ireland > t: +353 1 625 8812 > f: +353 1 625 8880 > e: [EMAIL PROTECTED] > www.globoforce.com > > http://guidance.gospelcom.net/answer.htm > > > Andy Phillips wrote: > > Hi, > > > > I've got _exactly_ the same problem. I've not had the time to dive > > through the source code and check it. We're on ES4.3 and ocfs-1.2.3. > > > > For us the problem (same trace as below) was not that repeatable, and > > was possibly related to the i/o pattern. > > > > What seems to happen is that the underlying "network services" of > > ocfs2 (o2net) believes that no packets are being sent. The tcp socket is > > surrounded by wrapper functions, one of which times when the last packet > > is received. Its this that decides the socket is dead, then closes the > > socket. Meanwhile, the upper layers (which are actually sending data > > regularly) find the carpet yanked out from underneath them, and decide > > to halt the cluster to protect the data. > > > > Highly annoying. I expect it will be some signed 32bit integer > > wrapping somewhere.... > > > > Andy > > > > > > On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote: > >> Hi, > >> > >> > >> > >> We have 2 Dell 1850’s in a cluster, both machines are running Redhat > >> Enterprise Linux 4 AS, update 2. > >> > >> > >> > >> The boxes are connected to a Dell EMC CX300 using emulex HBA’s > >> > >> > >> > >> The cluster is running an Oracle 10gR2 std edition RAC. > >> > >> > >> > >> We are using ocfs2 to store files generated by our application and not > >> to store anything to do with the database. > >> > >> > >> > >> We’ve been having a few problems were the servers appear to hang, and > >> have to be shutdown (using the powerbutton) and then started up again. > >> This seems to be happening every weekend and I don’t really understand > >> what’s happening, or how to fix it. > >> > >> > >> > >> I’ve included an extract from messages in the hope someone can shed > >> some light on the matter. > >> > >> > >> > >> Kind regards > >> > >> > >> > >> Andrew > >> > >> > >> > >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection > >> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been > >> idle for 10 seconds, shutting it down. > >> > >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are > >> some times that might help debug the situation: (tmr 1158527154.993223 > >> now 1158527164.993090 dr 1158527154.993213 adv > >> 1158527154.993227:1158527154.993228 func (101e0528:505) > >> 1158527153.796194:1158527153.796200) > >> > >> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no > >> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at > >> 10.1.1.110:7777 > >> > >> Sep 17 22:06:04 argon2 kernel: > >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112 > >> > >> Sep 17 22:06:04 argon2 kernel: > >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 17 22:06:05 argon2 last message repeated 185 times > >> > >> Sep 17 22:06:05 argon2 kernel: > >> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 17 22:06:05 argon2 last message repeated 154 times > >> > >> Sep 17 22:06:05 argon2 kernel: > >> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 17 22:06:05 argon2 last message repeated 123 times > >> > >> Sep 17 22:06:05 argon2 kernel: > >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 17 22:06:05 argon2 last message repeated 472 times > >> > >> Sep 17 22:06:05 argon2 kernel: > >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 17 22:06:08 argon2 last message repeated 3239 times > >> > >> Sep 17 22:06:08 argon2 kernel: > >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 17 22:06:08 argon2 last message repeated 118 times > >> > >> Sep 17 22:06:08 argon2 kernel: > >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107 > >> > >> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart. > >> > >> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded > >> > >> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg > >> started. > >> > >> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro > >> root=LABEL=/ apic rhgb quiet) > >> > >> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp > >> ([EMAIL PROTECTED]) (gcc version 3.4.4 20050721 > >> (Red Hat 3.4.4-2)) #1 SMP > >> > >> > >> > >> Andrew Brunton > >> > >> Senior Application Developer > >> > >> UK Fuels Limited > >> > >> > >> > >> Tel +44 (0)1270 655636 > >> > >> Fax +44 (0)1270 655700 > >> > >> > >> > >> [EMAIL PROTECTED] > >> > >> > >> > >> > >> > >> ________________________________________________________________________ > >> In order to protect our email recipients, Betfair use SkyScan from > >> MessageLabs to scan all Incoming and Outgoing mail for viruses. > >> > >> ________________________________________________________________________ > >> _______________________________________________ > >> Ocfs2-users mailing list > >> [email protected] > >> http://oss.oracle.com/mailman/listinfo/ocfs2-users > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > ________________________________________________________________________ > In order to protect our email recipients, Betfair use SkyScan from > MessageLabs to scan all Incoming and Outgoing mail for viruses. > > ________________________________________________________________________ -- Andy Phillips Systems Architecture Manager, Betfair.com Office: 0208 8348436 Betfair Limited | Winslow Road | Hammersmith Embankment | London | W6 9HP(Change address information to reflect company of employment and your work address) Company No. 5140986 (Modify company number to correspond with company name listed above) The information in this e-mail and any attachment is confidential and is intended only for the named recipient(s). The e-mail may not be disclosed or used by any person other than the addressee, nor may it be copied in any way. If you are not a named recipient please notify the sender immediately and delete any copies of this message. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Any view or opinions presented are solely those of the author and do not necessarily represent those of the company. _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
