Hi, I was wondering about the timeout, but wasn't sure where to set it
Why doesn't the restart work ? ( I assume it's trying to restart ) Andrew -----Original Message----- From: Mark Maiden [mailto:[EMAIL PROTECTED] Sent: 18 September 2006 12:18 To: Andy Phillips Cc: Andrew Brunton; [email protected] Subject: Re: [Ocfs2-users] problem with 2 host cluster We had a similar issue using SLES 9 and a CX300. We upgraded to the latest ocfs version and changed our O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both nodes) to the following : # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=61 It seemed to sort the issue out for us, but could be a totally different issue! ;-) Mark Maiden Systems Administrator Globoforce, Ltd 6 Beckett Way Parkwest Dublin 12 Ireland t: +353 1 625 8812 f: +353 1 625 8880 e: [EMAIL PROTECTED] www.globoforce.com http://guidance.gospelcom.net/answer.htm Andy Phillips wrote: > Hi, > > I've got _exactly_ the same problem. I've not had the time to dive > through the source code and check it. We're on ES4.3 and ocfs-1.2.3. > > For us the problem (same trace as below) was not that repeatable, and > was possibly related to the i/o pattern. > > What seems to happen is that the underlying "network services" of > ocfs2 (o2net) believes that no packets are being sent. The tcp socket is > surrounded by wrapper functions, one of which times when the last packet > is received. Its this that decides the socket is dead, then closes the > socket. Meanwhile, the upper layers (which are actually sending data > regularly) find the carpet yanked out from underneath them, and decide > to halt the cluster to protect the data. > > Highly annoying. I expect it will be some signed 32bit integer > wrapping somewhere.... > > Andy > > > On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote: >> Hi, >> >> >> >> We have 2 Dell 1850's in a cluster, both machines are running Redhat >> Enterprise Linux 4 AS, update 2. >> >> >> >> The boxes are connected to a Dell EMC CX300 using emulex HBA's >> >> >> >> The cluster is running an Oracle 10gR2 std edition RAC. >> >> >> >> We are using ocfs2 to store files generated by our application and not >> to store anything to do with the database. >> >> >> >> We've been having a few problems were the servers appear to hang, and >> have to be shutdown (using the powerbutton) and then started up again. >> This seems to be happening every weekend and I don't really understand >> what's happening, or how to fix it. >> >> >> >> I've included an extract from messages in the hope someone can shed >> some light on the matter. >> >> >> >> Kind regards >> >> >> >> Andrew >> >> >> >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection >> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been >> idle for 10 seconds, shutting it down. >> >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are >> some times that might help debug the situation: (tmr 1158527154.993223 >> now 1158527164.993090 dr 1158527154.993213 adv >> 1158527154.993227:1158527154.993228 func (101e0528:505) >> 1158527153.796194:1158527153.796200) >> >> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no >> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at >> 10.1.1.110:7777 >> >> Sep 17 22:06:04 argon2 kernel: >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112 >> >> Sep 17 22:06:04 argon2 kernel: >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 17 22:06:05 argon2 last message repeated 185 times >> >> Sep 17 22:06:05 argon2 kernel: >> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 17 22:06:05 argon2 last message repeated 154 times >> >> Sep 17 22:06:05 argon2 kernel: >> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 17 22:06:05 argon2 last message repeated 123 times >> >> Sep 17 22:06:05 argon2 kernel: >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 17 22:06:05 argon2 last message repeated 472 times >> >> Sep 17 22:06:05 argon2 kernel: >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 17 22:06:08 argon2 last message repeated 3239 times >> >> Sep 17 22:06:08 argon2 kernel: >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 17 22:06:08 argon2 last message repeated 118 times >> >> Sep 17 22:06:08 argon2 kernel: >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107 >> >> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart. >> >> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded >> >> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg >> started. >> >> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro >> root=LABEL=/ apic rhgb quiet) >> >> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp >> ([EMAIL PROTECTED]) (gcc version 3.4.4 20050721 >> (Red Hat 3.4.4-2)) #1 SMP >> >> >> >> Andrew Brunton >> >> Senior Application Developer >> >> UK Fuels Limited >> >> >> >> Tel +44 (0)1270 655636 >> >> Fax +44 (0)1270 655700 >> >> >> >> [EMAIL PROTECTED] >> >> >> >> >> >> ________________________________________________________________________ >> In order to protect our email recipients, Betfair use SkyScan from >> MessageLabs to scan all Incoming and Outgoing mail for viruses. >> >> ________________________________________________________________________ >> _______________________________________________ >> Ocfs2-users mailing list >> [email protected] >> http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
