Re: [Ocfs2-users] problem with 2 host cluster

Adam Kenger Tue, 19 Sep 2006 07:49:26 -0700

We had a lot of issues in that same configuration - 4 e1000 nic'sbonded against 2 1g switches. I don't know if that's coincidental atall and we have since downed the whole thing until we can make itwork in a more stable fashion. I would suggest, just based on thatpattern alone, that you break up the nic bonding and just let it siton one interface. See what happens. It would be interesting ifthere was any connection on the network layer between the bonding andsome kind of latency that it's generating - causing the OCFS clusterto oops and panic.


Adam


On Sep 19, 2006, at 9:46 AM, Andrew Brunton wrote:

We've had the system go down under various load conditions. (onemachine has
gone down before now during the day (not good)
In each box we have 4 e1000 network cards bonded into 2 bondedconnectionsbond0 (public) and bond1 (private), we have 2 1g switches with aconnection
from each of the bonds going into each switch.
http://www.puschitz.com/TuningLinuxForOracle.shtml#ChangingNetworkKernelSettings mentions about flowcontrol for the e1000, (options e1000FlowControl=1)
could this be something to do with the problem ?
Something else I have noticed is that I'm using the public bondedconnectionfor the heartbeat link rather than the private one which is used bythe RACCluster. I assume I can change it ? If so Can I just down the ocfsclusterchange the ip address to the private one and then start them backup again ?
What's the recommeneded way to do it ?
The ocfs mount is shared to our windows clients using samba, i'venoticed alarge number of errors concerning samba in messages and I'mwondering if
that's whats causing my problem.

How do you work out the O2CB_HEARTBEAT_THRESHOLD ?
This is a bit OT but how do I stop samba from bonding to theprivate bonded
connection ?

Andrew



-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf OfAlexei_Roudnev
Sent: 19 September 2006 09:23
To: [EMAIL PROTECTED]
Cc: [email protected]
Subject: ***Bulk SPAM*** Re: [Ocfs2-users] problem with 2 host cluster
I do not remember, which timeouts are confuigurabkle and which arenot -
but, if I am not mistaken, network timeout is hardcoded.
So, if disks are reconnecetd more than 12 seconds (normally, theyreconnectduring 60 seconds), you can reconfigure OCFSv2, but if networkreconenctiontime is > 12 seconds (and it is ALWAYS > 12 seconds! no exceptions)then you
have not a choices.
It is design flaw, in general. The only idea which I have, if it isNETWORK
glitch, you can try direct cross-connection (but it is unlikely - more
likely it is server's glitch - server loops in the kernel and so delay
service from receiving TCP/IP in time - and you better find thecore reason
for it).



----- Original Message -----
From: "Andy Phillips" <[EMAIL PROTECTED]>
To: "Alexei_Roudnev" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[email protected]>
Sent: Monday, September 18, 2006 10:04 AM
Subject: Re: [Ocfs2-users] problem with 2 host cluster


Alexei

But given that the problem is o2net_idle_timer, that sort of takes the
disk heartbeat out of the equation.

Andy


On Mon, 2006-09-18 at 09:57 -0700, Alexei_Roudnev wrote:
OCFS have 2 heartbeat thresholds, and only one is configured by this
option.
(I don't remember, which one of 2 - network and disk heartbeats).


----- Original Message -----
From: "Mark Maiden" <[EMAIL PROTECTED]>
To: "Andy Phillips" <[EMAIL PROTECTED]>
Cc: <[email protected]>
Sent: Monday, September 18, 2006 4:17 AM
Subject: Re: [Ocfs2-users] problem with 2 host cluster


We had a similar issue using SLES 9 and a CX300.

We upgraded to the latest ocfs version and changed our
O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on bothnodes)
to the following :
# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considereddead.
O2CB_HEARTBEAT_THRESHOLD=61
It seemed to sort the issue out for us, but could be a totallydifferent
issue! ;-)

Mark Maiden
Systems Administrator
Globoforce, Ltd
  6 Beckett Way Parkwest
  Dublin 12
  Ireland
  t: +353 1 625 8812
  f: +353 1 625 8880
  e: [EMAIL PROTECTED]
   www.globoforce.com

   http://guidance.gospelcom.net/answer.htm


Andy Phillips wrote:
Hi,
I've got _exactly_ the same problem. I've not had the time todive
through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
For us the problem (same trace as below) was not thatrepeatable, and
was possibly related to the i/o pattern.

   What seems to happen is that the underlying "network services" of
ocfs2 (o2net) believes that no packets are being sent. The tcpsocket issurrounded by wrapper functions, one of which times when the lastpacketis received. Its this that decides the socket is dead, thencloses the
socket. Meanwhile, the upper layers (which are actually sending data
regularly) find the carpet yanked out from underneath them, anddecide
to halt the cluster to protect the data.

   Highly annoying. I expect it will be some signed 32bit integer
wrapping somewhere....

   Andy


On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
Hi,
We have 2 Dell 1850's in a cluster, both machines are runningRedhat
Enterprise Linux 4 AS, update 2.



The boxes are connected to a Dell EMC CX300 using emulex HBA's



The cluster is running an Oracle 10gR2 std edition RAC.
We are using ocfs2 to store files generated by our applicationand not
to store anything to do with the database.
We've been having a few problems were the servers appear tohang, andhave to be shutdown (using the powerbutton) and then started upagain.This seems to be happening every weekend and I don't reallyunderstand
what's happening, or how to fix it.



I've included an extract from messages in the hope someone can shed
some light on the matter.



Kind regards



Andrew
Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310connectionto node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777has been
idle for 10 seconds, shutting it down.

Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
some times that might help debug the situation: (tmr1158527154.993223
now 1158527164.993090 dr 1158527154.993213 adv
1158527154.993227:1158527154.993228 func (101e0528:505)
1158527153.796194:1158527153.796200)

Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
10.1.1.110:7777

Sep 17 22:06:04 argon2 kernel:
(73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112

Sep 17 22:06:04 argon2 kernel:
(73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 17 22:06:05 argon2 last message repeated 185 times

Sep 17 22:06:05 argon2 kernel:
(26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 17 22:06:05 argon2 last message repeated 154 times

Sep 17 22:06:05 argon2 kernel:
(25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 17 22:06:05 argon2 last message repeated 123 times

Sep 17 22:06:05 argon2 kernel:
(73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 17 22:06:05 argon2 last message repeated 472 times

Sep 17 22:06:05 argon2 kernel:
(73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 17 22:06:08 argon2 last message repeated 3239 times

Sep 17 22:06:08 argon2 kernel:
(73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 17 22:06:08 argon2 last message repeated 118 times

Sep 17 22:06:08 argon2 kernel:
(73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107

Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.

Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded

Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
started.

Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
root=LABEL=/ apic rhgb quiet)

Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
([EMAIL PROTECTED]) (gcc version 3.4.4 20050721
(Red Hat 3.4.4-2)) #1 SMP



Andrew Brunton

Senior Application Developer

UK Fuels Limited



Tel +44 (0)1270 655636

Fax +44 (0)1270 655700



[EMAIL PROTECTED]
________________________________________________________________________
In order to protect our email recipients, Betfair use SkyScan from
MessageLabs to scan all Incoming and Outgoing mail for viruses.
________________________________________________________________________
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
________________________________________________________________________
In order to protect our email recipients, Betfair use SkyScan from
MessageLabs to scan all Incoming and Outgoing mail for viruses.
________________________________________________________________________
--
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Limited | Winslow Road | Hammersmith Embankment | London | W6
9HP(Change address information to reflect company of employment andyour
work address)

Company No. 5140986 (Modify company number to correspond with company
name listed above)
The information in this e-mail and any attachment is confidentialand is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor mayit be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material inthise-mail is strictly forbidden. Any view or opinions presented aresolely
those of the author and do not necessarily represent those of the
company.



_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users



_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] problem with 2 host cluster

Reply via email to