Re: [Ocfs2-users] OCFS2 v1.4 hangs

2009-06-04 Thread Karim Alkhayer
Hi John,

 

When multiple systems/nodes have access to data via shared storage, the
integrity of the data depends on inter-node communication ensuring that each
node is aware when other nodes are writing data. When the coordination
between the nodes fails, it results in a "split brain" condition; A
situation in which two servers try to independently control the storage,
potentially resulting in application failure or even corruption of critical
data.

 

I/O fencing is a method of choice (used by vendors cluster frameworks,
including OCFS2) for ensuring the integrity of critical information by
preventing data corruption, allowing a set of systems to have temporary
registrations with the disk and coordinate a write-exclusive reservation
with the disk containing the data. With I/O fencing, the cluster system
ensures that errant nodes are "fenced" and do not have access to the shared
storage, while the eligible node(s) continue to have access to the data,
virtually eliminating the risk of data corruption.

 

The quorum is the group of nodes in a cluster that is allowed to operate on
the shared storage. When there is a failure in the cluster, nodes may be
split into groups that can communicate in their groups and with the shared
storage but not between groups.

 

O2QUO determines which group is allowed to continue and initiates fencing of
the other group(s).

Fencing is the act of forcefully removing a node from a cluster. A node with
OCFS2 mounted will fence itself when it realizes that it does not have
quorum in a degraded cluster. It does this so that other nodes won't be
stuck trying to access its resources. However, the resources do NOT get
released

 

O2CB uses a node reset mechanism to fence; this however, is causing the
machine(s) to hang instead of seamless handover. In OCFS2 1.4, Oracle has
introduced a new fencing mechanism which no longer uses "panic" for fencing.
Instead, by default, it uses "machine restart".

 

In your case, taking the network down the way you've done is causing the
servers to hang, including the mounted file system which becomes locked
until the OCFS cluster services is restarted.

 

RAC handover fails due to exactly this problem: the file system is locked by
another node which was kicked out of the cluster, but still occupying the
file system

The healthy node will try to continue to work, but the databases hosted on
the occupied file system will hang, and possibly the machine. At this time
there is no solution but to 

- Force shutdown the troublesome node(s)

- Shutdown the databases processes

- Restart the OCFS2 services

 

Network failure resolution can be applied in a situation where you have
setup a net bonding for the interconnects, which is highly recommended.

 

Best regards,

Karim Alkhayer

 

 

-Original Message-
From: ocfs2-users-boun...@oss.oracle.com
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of John Murphy
Sent: Thursday, June 04, 2009 10:15 PM
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] OCFS2 v1.4 hangs

 

I have four database servers in a high-availability, load-balancing

configuration. Each machine has a mount to a common data source which is

an OCFS2 v1.4 file-system. While working on three of the servers, I

restarted the IP network and found after-wards the fourth machine hung.

I could not reboot and could not unmount the ocfs2 partitions. I am

pretty sure this was all caused by my taking down the network on all

three of the remaining machines, can anyone shed some light on this for.

Ironically, I have four machines in order to ensure reliability.

 

TIA

 

John

-- 

John Murphy

Technical And Managing Director

MANDAC Ltd

Kandoy House

2 Fairview Strand

Dublin 3

p: +353 1 5143001

m: +353 85 711 6844

e: john.mur...@mandac.eu

w: www.mandac.eu

 

 

 

___

Ocfs2-users mailing list

Ocfs2-users@oss.oracle.com

http://oss.oracle.com/mailman/listinfo/ocfs2-users

___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] OCFS2 v1.4 hangs

2009-06-04 Thread John Murphy
I have four database servers in a high-availability, load-balancing
configuration. Each machine has a mount to a common data source which is
an OCFS2 v1.4 file-system. While working on three of the servers, I
restarted the IP network and found after-wards the fourth machine hung.
I could not reboot and could not unmount the ocfs2 partitions. I am
pretty sure this was all caused by my taking down the network on all
three of the remaining machines, can anyone shed some light on this for.
Ironically, I have four machines in order to ensure reliability.

TIA

John
-- 
John Murphy
Technical And Managing Director
MANDAC Ltd
Kandoy House
2 Fairview Strand
Dublin 3
p: +353 1 5143001
m: +353 85 711 6844
e: john.mur...@mandac.eu
w: www.mandac.eu



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users


Re: [Ocfs2-users] O2CB heartbeat not active on 2nd node

2009-06-04 Thread McKinley, Reid
Thanks for your push on this.  I don't know why I didn't check this
before.

It looks like we have a problem with our interconnect.  Our SA is
checking the hardware now.

[r...@nyclx1 ~]# more /etc/ocfs2/cluster.conf
node:
ip_port = 
ip_address = 192.168.0.218
number = 0
name = nyclx1
cluster = tiaa

node:
ip_port = 
ip_address = 192.168.0.217
number = 1
name = nyclx2
cluster = tiaa

cluster:
node_count = 2
name = tiaa

[r...@nyclx1 ~]# ping 192.168.0.217
PING 192.168.0.217 (192.168.0.217) 56(84) bytes of data.
>From 192.168.0.218 icmp_seq=2 Destination Host Unreachable
>From 192.168.0.218 icmp_seq=3 Destination Host Unreachable
>From 192.168.0.218 icmp_seq=4 Destination Host Unreachable

-Original Message-
From: Sunil Mushran [mailto:sunil.mush...@oracle.com] 
Sent: Wednesday, June 03, 2009 6:19 PM
To: McKinley, Reid
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] O2CB heartbeat not active on 2nd node

Do on both nodes:
$ netstat -ta --numeric-ports

Maybe port  is already in use.

Check your setup again. Ensure cluster.conf is the same on both nodes.
And that the ips are correct. That tcpdump was capturing
the traffic on the correct interface. etc. etc.


McKinley, Reid wrote:
> Yes, I had tcpdump running in separate sessions on both servers.
>
> The port is correct.  Here is the cluster.conf.
>
> node:
> ip_port = 
> ip_address = 192.168.0.218
> number = 0
> name = nyclx1
> cluster = tiaa
>
> node:
> ip_port = 
> ip_address = 192.168.0.217
> number = 1
> name = nyclx2
> cluster = tiaa
>
> cluster:
> node_count = 2
> name = tiaa
>
> -Original Message-
> From: Sunil Mushran [mailto:sunil.mush...@oracle.com] 
> Sent: Wednesday, June 03, 2009 5:35 PM
> To: McKinley, Reid
> Cc: ocfs2-users@oss.oracle.com
> Subject: Re: [Ocfs2-users] O2CB heartbeat not active on 2nd node
>
> Did you have tcpdump running on a terminal when you attempted
> the mount on another terminal? Is the interface and port correct?
>
> It is one thing to not see the packets on the nyclx2. But what
> confuses me is that there is no traffic on nyclx1 too.
>   



This message, including any attachments, contains confidential information 
intended 
for a specific individual and purpose, and is protected by law. If you are not 
the intended 
recipient, please contact the sender immediately by reply e-mail and destroy 
all copies.
You are hereby notified that any disclosure, copying, or distribution of this 
message, or
the taking of any action based on it, is strictly prohibited.

TIAA-CREF



___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users