RE: [Ocfs2-users] problem with 2 host cluster

Andrew Brunton Mon, 18 Sep 2006 06:08:08 -0700

Hi,

I was wondering about the timeout, but wasn't sure where to set it


Why doesn't the restart work ? ( I assume it's trying to restart )

Andrew


-----Original Message-----
From: Mark Maiden [mailto:[EMAIL PROTECTED] 
Sent: 18 September 2006 12:18
To: Andy Phillips
Cc: Andrew Brunton; [email protected]
Subject: Re: [Ocfs2-users] problem with 2 host cluster

We had a similar issue using SLES 9 and a CX300.

We upgraded to the latest ocfs version and changed our 
O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both nodes) 
to the following :

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=61

It seemed to sort the issue out for us, but could be a totally different 
issue! ;-)

Mark Maiden
Systems Administrator
Globoforce, Ltd
  6 Beckett Way Parkwest
  Dublin 12
  Ireland
  t: +353 1 625 8812
  f: +353 1 625 8880
  e: [EMAIL PROTECTED]
   www.globoforce.com

   http://guidance.gospelcom.net/answer.htm


Andy Phillips wrote:
> Hi,
> 
>    I've got _exactly_ the same problem. I've not had the time to dive
> through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
> 
>    For us the problem (same trace as below) was not that repeatable, and
> was possibly related to the i/o pattern. 
> 
>    What seems to happen is that the underlying "network services" of
> ocfs2 (o2net) believes that no packets are being sent. The tcp socket is
> surrounded by wrapper functions, one of which times when the last packet
> is received. Its this that decides the socket is dead, then closes the 
> socket. Meanwhile, the upper layers (which are actually sending data
> regularly) find the carpet yanked out from underneath them, and decide
> to halt the cluster to protect the data. 
> 
>    Highly annoying. I expect it will be some signed 32bit integer
> wrapping somewhere....
> 
>    Andy
>  
> 
> On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
>> Hi,
>>
>>  
>>
>> We have 2 Dell 1850's in a cluster, both machines are running Redhat
>> Enterprise Linux 4 AS, update 2.
>>
>>  
>>
>> The boxes are connected to a Dell EMC CX300 using emulex HBA's
>>
>>  
>>
>> The cluster is running an Oracle 10gR2 std edition RAC. 
>>
>>  
>>
>> We are using ocfs2 to store files generated by our application and not
>> to store anything to do with the database.
>>
>>  
>>
>> We've been having a few problems were the servers appear to hang, and
>> have to be shutdown (using the powerbutton) and then started up again.
>> This seems to be happening every weekend and I don't really understand
>> what's happening, or how to fix it.
>>
>>  
>>
>> I've included an extract from messages in the hope someone can shed
>> some light on the matter.
>>
>>  
>>
>> Kind regards
>>
>>  
>>
>> Andrew
>>
>>  
>>
>> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection
>> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been
>> idle for 10 seconds, shutting it down.
>>
>> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
>> some times that might help debug the situation: (tmr 1158527154.993223
>> now 1158527164.993090 dr 1158527154.993213 adv
>> 1158527154.993227:1158527154.993228 func (101e0528:505)
>> 1158527153.796194:1158527153.796200)
>>
>> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
>> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
>> 10.1.1.110:7777
>>
>> Sep 17 22:06:04 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
>>
>> Sep 17 22:06:04 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 185 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 154 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 123 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 472 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:08 argon2 last message repeated 3239 times
>>
>> Sep 17 22:06:08 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:08 argon2 last message repeated 118 times
>>
>> Sep 17 22:06:08 argon2 kernel:
>> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
>>
>> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
>>
>> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>>
>> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
>> root=LABEL=/ apic rhgb quiet)
>>
>> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
>> ([EMAIL PROTECTED]) (gcc version 3.4.4 20050721
>> (Red Hat 3.4.4-2)) #1 SMP
>>
>>  
>>
>> Andrew Brunton
>>
>> Senior Application Developer
>>
>> UK Fuels Limited
>>
>>  
>>
>> Tel +44 (0)1270 655636
>>
>> Fax +44 (0)1270 655700
>>
>>  
>>
>> [EMAIL PROTECTED]
>>
>>  
>>
>>
>>
>> ________________________________________________________________________
>> In order to protect our email recipients, Betfair use SkyScan from 
>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>
>> ________________________________________________________________________
>> _______________________________________________
>> Ocfs2-users mailing list
>> [email protected]
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

RE: [Ocfs2-users] problem with 2 host cluster

Reply via email to