Please ignore last message, all mess up due to cut and paste .Message should be as below:

I cannot do much with checking network issues because it is not under my
control.

We are thinking about taking node 0 that always gets fenced out of the
cluster, so I tried the test below just to check that we will not have
issues with node 1 and node 2.

When I tried to use private IP, the situation got worse. I have tried to
document what I have done below.  I am hoping that somebody will be able to
figure out what is happening.

Took node 0 out of picture i.e shut it down ( I did not remove it from the
RAC or ocfs cluster yet)


Sequence of events with Public IP
1. stop all ocfs services/cluster on node 1 and 2 (ok)
2. unmount ocfs fs on node 1 and 2 (ok)
3. verify that ip in cluster.conf is public (ok)
4. start all ocfs services/cluster on node 1 and 2 (ok)
5. On node 2:
   - mount -at ocfs2 (ok)
   - df ( shows mount ocfs fs)
6. On node 1:
   - mount -at ocfs2 (ok but see error after below)
   ora2:~ # mount -at ocfs2
mount.ocfs2: Device or resource busy while mounting /dev/sdb1 on /u02/oradata/orcl
   ora2:~ # df ( shows mount ocfs fs)

dmesg so far for node 2
ocfs2: Unmounting device (8,17) on (node 2)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 2, slot 0)
o2net: connected to node ora2 (num 1) at 10.12.1.36:7777
ocfs2_dlm: Node 1 joins domain A7AE746FB3D34479A4B04C0535A0A341
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2

dmesg so far for node 1
ocfs2: Unmounting device (8,17) on (node 1)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
o2net: accepted connection from node ora3 (num 2) at 10.12.1.37:7777
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 1, slot 1)


Sequence of events with Private IP
1. unmount ocfs fs on node 1 and 2 (ok)
2. stop all ocfs services/cluster on node 1 and 2 (ok)
3. change ip in cluster.conf is private (ok)
4. verify you can ping private ip from/to node 1 and node 2 (ok)
5. start all ocfs services/cluster on node 1 and 2 (ok)
5. On node 2:
   - mount -at ocfs2 (ok)
   - df ( shows mount ocfs fs)
6. On node 1:
    ora2:~ # mount -at ocfs2
mount.ocfs2: Transport endpoint is not connected while mounting /dev/sdb1 on /u02/oradata/orcl mount.ocfs2: Transport endpoint is not connected while mounting /dev/sdb1 on /u02/oradata/orcl
        ora2:~ # df ( shows no mount of ocfs fs)
7. unmount ocfs fs on node 2 (ok)
8. mount ocfs fs on node 1 (ok but see message below)
      ora2:~ # mount -at ocfs2
mount.ocfs2: Device or resource busy while mounting /dev/sdb1 on /u02/oradata/orcl
      ora2:~ # df ( shows mount ocfs fs)
9.mount ocfs fs on node 2 ( NOT OK)
       ora3:~ # mount -at ocfs2
mount.ocfs2: Transport endpoint is not connected while mounting /dev/sdb1 on /u02/oradata/orcl
       ora2:~ # df ( shows no mount of ocfs fs)


dmesg now on node 2
ocfs2: Unmounting device (8,17) on (node 2)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 2, slot 0)
o2net: connected to node ora2 (num 1) at 10.12.1.36:7777
ocfs2_dlm: Node 1 joins domain A7AE746FB3D34479A4B04C0535A0A341
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2
o2net: no longer connected to node ora2 (num 1) at 10.12.1.36:7777
ocfs2: Unmounting device (8,17) on (node 2)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 2, slot 0)
(15650,0):o2net_start_connect:1390 ERROR: bind failed with -99 at address 193.168.2.3 (15650,0):o2net_start_connect:1421 connect attempt to node ora2 (num 1) at 193.168.2.2:7777 failed with errno -99 (15650,0):o2net_connect_expired:1445 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors. (15650,0):o2net_start_connect:1390 ERROR: bind failed with -99 at address 193.168.2.3 (15650,0):o2net_start_connect:1421 connect attempt to node ora2 (num 1) at 193.168.2.2:7777 failed with errno -99 (15650,0):o2net_connect_expired:1445 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
ocfs2: Unmounting device (8,17) on (node 2)
(15650,0):o2net_start_connect:1390 ERROR: bind failed with -99 at address 193.168.2.3 (15650,0):o2net_start_connect:1421 connect attempt to node ora2 (num 1) at 193.168.2.2:7777 failed with errno -99 (15650,0):o2net_connect_expired:1445 ERROR: no connection established with node 1 after 10 seconds, giving up and returning errors.
(21431,0):dlm_request_join:786 ERROR: status = -107
(21431,0):dlm_try_to_join_domain:934 ERROR: status = -107
(21431,0):dlm_join_domain:1186 ERROR: status = -107
(21431,0):dlm_register_domain:1379 ERROR: status = -107
(21431,0):ocfs2_dlm_init:2007 ERROR: status = -107
(21431,0):ocfs2_mount_volume:1064 ERROR: status = -107
ocfs2: Unmounting device (8,17) on (node 2)


dmesg now on node 1
ocfs2: Unmounting device (8,17) on (node 1)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
o2net: accepted connection from node ora3 (num 2) at 10.12.1.37:7777
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1 2
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 1, slot 1)
ocfs2_dlm: Node 2 leaves domain A7AE746FB3D34479A4B04C0535A0A341
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1
o2net: no longer connected to node ora3 (num 2) at 10.12.1.37:7777
ocfs2: Unmounting device (8,17) on (node 1)
OCFS2 Node Manager 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLM 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 DLMFS 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
OCFS2 User DLM kernel interface loaded
OCFS2 1.2.3-SLES Wed Aug  9 13:16:58 PDT 2006 (build sles)
(18294,0):o2net_connect_expired:1445 ERROR: no connection established with node 2 after 10 seconds, giving up and returning errors.
(19360,1):dlm_request_join:786 ERROR: status = -107
(19360,1):dlm_try_to_join_domain:934 ERROR: status = -107
(19360,1):dlm_join_domain:1186 ERROR: status = -107
(19360,1):dlm_register_domain:1379 ERROR: status = -107
(19360,1):ocfs2_dlm_init:2007 ERROR: status = -107
(19360,1):ocfs2_mount_volume:1064 ERROR: status = -107
ocfs2: Unmounting device (8,17) on (node 1)
(18294,0):o2net_connect_expired:1445 ERROR: no connection established with node 2 after 10 seconds, giving up and returning errors.
(19409,0):dlm_request_join:786 ERROR: status = -107
(19409,0):dlm_try_to_join_domain:934 ERROR: status = -107
(19409,0):dlm_join_domain:1186 ERROR: status = -107
(19409,0):dlm_register_domain:1379 ERROR: status = -107
(19409,0):ocfs2_dlm_init:2007 ERROR: status = -107
(19409,0):ocfs2_mount_volume:1064 ERROR: status = -107
ocfs2: Unmounting device (8,17) on (node 1)
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 1
kjournald starting.  Commit interval 5 seconds
ocfs2: Mounting device (8,17) on (node 1, slot 0)
(18294,0):o2net_connect_expired:1445 ERROR: no connection established with node 2 after 10 seconds, giving up and returning errors.







----Original Message Follows----
From: Sunil Mushran <[EMAIL PROTECTED]>
To: enohi ibekwe <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED], [EMAIL PROTECTED], ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
Date: Wed, 11 Apr 2007 14:14:41 -0700

Use private.

enohi ibekwe wrote:
The IP address on the cluster.conf file is the public IP address for the nodes.

----Original Message Follows----
From: Sunil Mushran <[EMAIL PROTECTED]>
To: enohi ibekwe <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED], [EMAIL PROTECTED], ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
Date: Wed, 11 Apr 2007 10:04:24 -0700

Are you using a private or a public network?

enohi ibekwe wrote:
Thanks for your help so far.

My issue is the frequency at which node 0 gets fenced, it has happened at least once a day in the last 2 days.

More details:

I am attempting to add a node (node 2) to an existing 2 node ( node 0 and
node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp
i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. Node
2 is not part of the RAC cluster yet, I have only installed ocfs on it. I
can mount the ocfs file system on all nodes, and the ocfs file system is
accessible from all nodes.

Node 0 is the node alway fenced and gets fenced very frequently. Before I
added the kernel.panic parameter, node 0 would get fenced, panic and hang.
Only a power reboot would make it responsive again.

This is what happened this morning.

I was remotely connected to node 0 via ssh. Then I suddenly lost the
connection. I tried to ssh again but node 0 refused the connection.

Checking node 1 dmesg I found :
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for
10 seconds, shutting it down.
(0,3):o2net_idle_timer:1310 here are some times that might help debug the
situation: (tmr 1176207822.713473 now 1176207832.712008 dr 1176207822.713466
adv 1176207822.713475:1176207822.713476 func (1459c2a9:504)
1176196519.600486:1176196519.600489)
o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777

checking node 2 dmesg I found:
ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2
o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for
10 seconds, shutting it down.
(0,0):o2net_idle_timer:1310 here are some times that might help debug the
situation: (tmr 1176207823.774296 now 1176207833.772712 dr 1176207823.774293
adv 1176207823.774297:1176207823.774297 func (1459c2a9:504)
1176196505.704238:1176196505.704240)
o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777

Since I had reboot on panic on both node 0, node 0 restarted. Checking
/var/log/messages I found:
Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing
this node because it is only connected to 1 nodes and 2 is needed to make a
quorum out of 3 heartbeating nodes
Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR:
stopping heartbeat on all active regions.
Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be fencing
this system by panicing.




----Original Message Follows----
From: "Alexei_Roudnev" <[EMAIL PROTECTED]>
To: "Jeff Mahoney" <[EMAIL PROTECTED]>,"enohi ibekwe" <[EMAIL PROTECTED]>
CC: <ocfs2-users@oss.oracle.com>
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic
Date: Mon, 9 Apr 2007 11:00:30 -0700

It's noty an issue; it is really OCFSv2 killer:
- in 99% cases, it is not split brain condition but just a short (20 - 30
seconds) network interruption. Systems can (in most cases) see each other by
network or thru the voting disk, so they can communicate by one or another
way;
- in 90% cases system have not any pending IO activity, so it have not any
reason to fence itself at least until some IO happen on OCFSv2 file system. For example, if OCFSv2 is used for backups, it is active 3 hours at night + at the time of restoring only, and server can remount it without any fencing
if it lost consensus.
- timeouts and other fencing parameters are badly designed, and it makes a
problem worst. IT can't work out of the box on the most SAN networks (with
recoinfiguration timeouts all about 30 seconds - 1 minute by default). For
example, NetApp cluster takepooevr takes about 20 seconds, and giveback
about 40 seconds - which kills OCFSv2 for 100% sure (with default settings).
STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 100%
sure. Network switch remoot time is about 1 minute for most switches, which
kills OCFSv2 for 100% sure. Result - if I reboot staging network switch, I
have all stand alone servers working, all RAC clusters working, all other
servers working, and all OCFSv2 cluster fenced themself.

For me, I baned OCFSv2 from any usage except backup and archive logs, and
only with using cross connection cable for heartbeat.
All other scenarios are catastrofic (cause overall cluster failure in many
cases). And all because of this fencing behavior.

PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known
problem in buffer use - it don't release small buffers after file is
created/deleted (so if you run create file / remove file loop for a long
time, you will deplete system memory in apporox a few days). It is not a
case if files are big enough (Oracle backups, oracle archive logs,
application home) but must be taken into account if you have more than
100,000 - 1,000,000 files on OCFSv2 file system(s).

But fencing problem exists in all versions (little better in modern ones,
because developers added configurable network timeout). If you add _one
heartbeat interface only_ design and _no serial heartbeat_ design, it really became a problem, ad it's why I was thinking about testing OCFSv2 in SLES10 with heartbeat2 (heartbeat2 have a very reliable heartbeat and have external
fencing, but unfortunately SLES10 is not production ready yet for us, de
facto).



----- Original Message -----
From: "Jeff Mahoney" <[EMAIL PROTECTED]>
To: "enohi ibekwe" <[EMAIL PROTECTED]>
Cc: <ocfs2-users@oss.oracle.com>
Sent: Saturday, April 07, 2007 12:06 PM
Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic


> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> enohi ibekwe wrote:
> > Is this also an issue on SLES9?
> >
> > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see
> > the error on the same box on the cluster.
>
> I'm not sure what you mean by "issue." This is designed behavior. When
> the cluster ends up in a split condition, one or more nodes will fence
> themselves.
>
> - -Jeff
>
> - --
> Jeff Mahoney
> SUSE Labs
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.6 (GNU/Linux)
> Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org
>
> iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp
> zcRzcaedVAmk+AaJ/OFeddE=
> =8e6c
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

_________________________________________________________________
Can’t afford to quit your job? – Earn your AS, BS, or MS degree online in 1 year. http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143




_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

_________________________________________________________________
The average US Credit Score is 675. The cost to see yours: $0 by Experian. http://www.freecreditreport.com/pm/default.aspx?sc=660600&bcd=EMAILFOOTERAVERAGE



_________________________________________________________________
Mortgage refinance is Hot. *Terms. Get a 5.375%* fix rate. Check savings https://www2.nextag.com/goto.jsp?product=100000035&url=%2fst.jsp&tm=y&search=mortgage_text_links_88_h2bbb&disc=y&vers=925&s=4056&p=5117


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to