Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.

2013-04-12 Thread Kristiansen Morten
All nodes where powered down cleanly. Database was stopped and Grid 
Infrastructure was shut down manualy, but the ocfs2 cluster was not stopped 
manualy. Nothing happened when the nodes was shut down first time. After the 
first planned reboot, node 1,2,3 and 7 seemed to be OK, but the sysadmins had 
to look into node 4,5 and 6, due to disk problems. Grid Infrastructure was 
started on node 1,2 and 3, but it wouldn't start on node 7. The dba checked 
that the node had disks, but not that disks was in proper order, meaning that 
disk02 really was disk02, etc. The dba thought a reboot would probably fix it. 
So he disabled the grid infrastructure, did nothing to ocfs2 and rebooted the 
server. And that seemed to reboot all other nodes as well.

After the second reboot which was uncontrolled, the grid infrastructure was 
started one by one on node 1, 2 and 3. At 03:00 am the three nodes was running 
the database. At 04:06 am the cluster went down again unplanned. Nobody know 
why, but the sysadm guys said they saw some kernel panic. And in the 
/var/log/messages the "Kernel Bug at ...shran/BUILD/ocfs2-1.4.7..." came again 
as it did at 02:25 am.

Then when the nodes came up again, the database was started on node 4, 5 and 6. 
It wasn't possible to start crs on node 1, 2, 3 and 7. The sysadmins did 
something with those nodes and after another reboot of just those nodes, crs 
was able to start again. So the instances on node 1, 2 and 3 was started, but 
we didn't start anything on node 7 because we were afraid of shutting down the 
cluster again.

Got a mail from Sunil sayin I had to "ping Oracle". So I guess I'll do that.
 
Morten K. 
Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 
Kvalitet  - Trygghet - Respekt



-Original Message-
From: Joel Becker [mailto:jl...@ftp.linux.org.uk] On Behalf Of Joel Becker
Sent: 11. april 2013 21:04
To: Kristiansen Morten
Cc: ocfs2-users@oss.oracle.com
Subject: Re: [Ocfs2-users] Shutting down one node caused all the other nodes to 
shutdown aswell.

Did you power down nodes uncleanly?  The message says that one node lost track 
of who was doing a particular recovery.  If nodes are shut down cleanly, they 
should be communicating that information.

Joel

On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote:
> I've had no response on my problem, is there anybody who can help me on this?
> 
> Morten K.
> 
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet  - Trygghet - 
> Respekt
> 
> 
> 
> From: ocfs2-users-boun...@oss.oracle.com 
> [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Kristiansen 
> Morten
> Sent: 21. mars 2013 14:47
> To: ocfs2-users@oss.oracle.com
> Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to 
> shutdown aswell.
> 
> Hi,
> 
> We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the 
> server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and 
> synchronizing the disks. When they rebooted the servers, one of the nodes, 
> tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting 
> disk was not found. Then we tried to reboot that node, causing all nodes to 
> reboot. Time round about 02:25. When examine the /var/log/messages I 
> discovered a BUG message on one of the node that rebooted unexpectedly, 
> tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. 
> Is this a well known bug? Does any body have a solution to this problem?
> 
> Below is a extract of o2net and ocfs2 messages from the /var/log/message file.
> 
> /var/log/messages til tos-dipsprod-07:
> Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
> sh

Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.

2013-04-11 Thread Joel Becker
Did you power down nodes uncleanly?  The message says that one node
lost track of who was doing a particular recovery.  If nodes are shut
down cleanly, they should be communicating that information.

Joel

On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote:
> I've had no response on my problem, is there anybody who can help me on this?
> 
> Morten K.
> 
> Tlf: +47 76 16 61 81 | Mob: +47 906 52 903
> Kvalitet  - Trygghet - Respekt
> 
> 
> 
> From: ocfs2-users-boun...@oss.oracle.com 
> [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Kristiansen Morten
> Sent: 21. mars 2013 14:47
> To: ocfs2-users@oss.oracle.com
> Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to 
> shutdown aswell.
> 
> Hi,
> 
> We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the 
> server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and 
> synchronizing the disks. When they rebooted the servers, one of the nodes, 
> tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting 
> disk was not found. Then we tried to reboot that node, causing all nodes to 
> reboot. Time round about 02:25. When examine the /var/log/messages I 
> discovered a BUG message on one of the node that rebooted unexpectedly, 
> tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. 
> Is this a well known bug? Does any body have a solution to this problem?
> 
> Below is a extract of o2net and ocfs2 messages from the /var/log/message file.
> 
> /var/log/messages til tos-dipsprod-07:
> Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
> shutting it down.
> Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node 
> tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
> shutting it down.
> 
> Og her fra tos-dipsprod-02:
> 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: 
> (o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: 
> dead_node previously set to 7, node 3 changing it to 7
> 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: 
> (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery 
> finalize msg, but node 3 is supposed to be the new master, dead=7
> 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at 
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart.
> 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue 
> not found.
> --
> 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: 
> (o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: 
> dead_node previously set to 6, node 6 changing it to 7
> 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: 
> (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery 
> finalize msg, but node 255 is supposed to be the new master, dead=7
> 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at 
> ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
> 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart.
>

Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.

2013-04-11 Thread Kristiansen Morten
I've had no response on my problem, is there anybody who can help me on this?

Morten K.

Tlf: +47 76 16 61 81 | Mob: +47 906 52 903
Kvalitet  - Trygghet - Respekt



From: ocfs2-users-boun...@oss.oracle.com 
[mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Kristiansen Morten
Sent: 21. mars 2013 14:47
To: ocfs2-users@oss.oracle.com
Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to 
shutdown aswell.

Hi,

We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the 
server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and 
synchronizing the disks. When they rebooted the servers, one of the nodes, 
tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting 
disk was not found. Then we tried to reboot that node, causing all nodes to 
reboot. Time round about 02:25. When examine the /var/log/messages I discovered 
a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. 
I've tried to google it, but I couldn't find any solution. Is this a well known 
bug? Does any body have a solution to this problem?

Below is a extract of o2net and ocfs2 messages from the /var/log/message file.

/var/log/messages til tos-dipsprod-07:
Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
shutting it down.

Og her fra tos-dipsprod-02:
10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: 
(o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: 
dead_node previously set to 7, node 3 changing it to 7
10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: 
(o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery 
finalize msg, but node 3 is supposed to be the new master, dead=7
10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at 
...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart.
10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue 
not found.
--
17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: 
(o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: 
dead_node previously set to 6, node 6 changing it to 7
17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: 
(o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery 
finalize msg, but node 255 is supposed to be the new master, dead=7
17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at 
...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart.
18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue 
not found.


Morten Kristiansen| Counsellor
Helse Nord IKT | Departement of Serviceproduction

Tlf: +47 76 16 61 81 | Mob: +47 906 52 903
Office address:  Amtmann Worsøes gate 63, 8012 Bodø, Norway
Quality  - Safety - Respect




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.

2013-03-21 Thread Kristiansen Morten
Hi,

We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the 
server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and 
synchronizing the disks. When they rebooted the servers, one of the nodes, 
tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting 
disk was not found. Then we tried to reboot that node, causing all nodes to 
reboot. Time round about 02:25. When examine the /var/log/messages I discovered 
a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. 
I've tried to google it, but I couldn't find any solution. Is this a well known 
bug? Does any body have a solution to this problem?

Below is a extract of o2net and ocfs2 messages from the /var/log/message file.

/var/log/messages til tos-dipsprod-07:
Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, 
shutting it down.
Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node 
tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, 
shutting it down.

Og her fra tos-dipsprod-02:
10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: 
(o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: 
dead_node previously set to 7, node 3 changing it to 7
10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: 
(o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery 
finalize msg, but node 3 is supposed to be the new master, dead=7
10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at 
...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart.
10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue 
not found.
--
17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: 
(o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: 
dead_node previously set to 6, node 6 changing it to 7
17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: 
(o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery 
finalize msg, but node 255 is supposed to be the new master, dead=7
17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at 
...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840
18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart.
18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue 
not found.


Morten Kristiansen| Counsellor
Helse Nord IKT | Departement of Serviceproduction

Tlf: +47 76 16 61 81 | Mob: +47 906 52 903
Office address:  Amtmann Worsøes gate 63, 8012 Bodø, Norway
Quality  - Safety - Respect




___
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users