Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
All nodes where powered down cleanly. Database was stopped and Grid Infrastructure was shut down manualy, but the ocfs2 cluster was not stopped manualy. Nothing happened when the nodes was shut down first time. After the first planned reboot, node 1,2,3 and 7 seemed to be OK, but the sysadmins had to look into node 4,5 and 6, due to disk problems. Grid Infrastructure was started on node 1,2 and 3, but it wouldn't start on node 7. The dba checked that the node had disks, but not that disks was in proper order, meaning that disk02 really was disk02, etc. The dba thought a reboot would probably fix it. So he disabled the grid infrastructure, did nothing to ocfs2 and rebooted the server. And that seemed to reboot all other nodes as well. After the second reboot which was uncontrolled, the grid infrastructure was started one by one on node 1, 2 and 3. At 03:00 am the three nodes was running the database. At 04:06 am the cluster went down again unplanned. Nobody know why, but the sysadm guys said they saw some kernel panic. And in the /var/log/messages the "Kernel Bug at ...shran/BUILD/ocfs2-1.4.7..." came again as it did at 02:25 am. Then when the nodes came up again, the database was started on node 4, 5 and 6. It wasn't possible to start crs on node 1, 2, 3 and 7. The sysadmins did something with those nodes and after another reboot of just those nodes, crs was able to start again. So the instances on node 1, 2 and 3 was started, but we didn't start anything on node 7 because we were afraid of shutting down the cluster again. Got a mail from Sunil sayin I had to "ping Oracle". So I guess I'll do that. Morten K. Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet - Trygghet - Respekt -Original Message- From: Joel Becker [mailto:jl...@ftp.linux.org.uk] On Behalf Of Joel Becker Sent: 11. april 2013 21:04 To: Kristiansen Morten Cc: ocfs2-users@oss.oracle.com Subject: Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell. Did you power down nodes uncleanly? The message says that one node lost track of who was doing a particular recovery. If nodes are shut down cleanly, they should be communicating that information. Joel On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote: > I've had no response on my problem, is there anybody who can help me on this? > > Morten K. > > Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet - Trygghet - > Respekt > > > > From: ocfs2-users-boun...@oss.oracle.com > [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Kristiansen > Morten > Sent: 21. mars 2013 14:47 > To: ocfs2-users@oss.oracle.com > Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to > shutdown aswell. > > Hi, > > We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the > server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and > synchronizing the disks. When they rebooted the servers, one of the nodes, > tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting > disk was not found. Then we tried to reboot that node, causing all nodes to > reboot. Time round about 02:25. When examine the /var/log/messages I > discovered a BUG message on one of the node that rebooted unexpectedly, > tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. > Is this a well known bug? Does any body have a solution to this problem? > > Below is a extract of o2net and ocfs2 messages from the /var/log/message file. > > /var/log/messages til tos-dipsprod-07: > Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, > shutting it down. > Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, > sh
Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
Did you power down nodes uncleanly? The message says that one node lost track of who was doing a particular recovery. If nodes are shut down cleanly, they should be communicating that information. Joel On Thu, Apr 11, 2013 at 12:10:22PM +0200, Kristiansen Morten wrote: > I've had no response on my problem, is there anybody who can help me on this? > > Morten K. > > Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 > Kvalitet - Trygghet - Respekt > > > > From: ocfs2-users-boun...@oss.oracle.com > [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Kristiansen Morten > Sent: 21. mars 2013 14:47 > To: ocfs2-users@oss.oracle.com > Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to > shutdown aswell. > > Hi, > > We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the > server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and > synchronizing the disks. When they rebooted the servers, one of the nodes, > tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting > disk was not found. Then we tried to reboot that node, causing all nodes to > reboot. Time round about 02:25. When examine the /var/log/messages I > discovered a BUG message on one of the node that rebooted unexpectedly, > tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. > Is this a well known bug? Does any body have a solution to this problem? > > Below is a extract of o2net and ocfs2 messages from the /var/log/message file. > > /var/log/messages til tos-dipsprod-07: > Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, > shutting it down. > Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, > shutting it down. > Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, > shutting it down. > Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, > shutting it down. > Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, > shutting it down. > Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, > shutting it down. > Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node > tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, > shutting it down. > > Og her fra tos-dipsprod-02: > 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: > (o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: > dead_node previously set to 7, node 3 changing it to 7 > 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: > (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery > finalize msg, but node 3 is supposed to be the new master, dead=7 > 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at > ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 > 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart. > 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue > not found. > -- > 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: > (o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: > dead_node previously set to 6, node 6 changing it to 7 > 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: > (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery > finalize msg, but node 255 is supposed to be the new master, dead=7 > 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at > ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 > 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart. >
Re: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
I've had no response on my problem, is there anybody who can help me on this? Morten K. Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Kvalitet - Trygghet - Respekt From: ocfs2-users-boun...@oss.oracle.com [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of Kristiansen Morten Sent: 21. mars 2013 14:47 To: ocfs2-users@oss.oracle.com Subject: [Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell. Hi, We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and synchronizing the disks. When they rebooted the servers, one of the nodes, tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk was not found. Then we tried to reboot that node, causing all nodes to reboot. Time round about 02:25. When examine the /var/log/messages I discovered a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. Is this a well known bug? Does any body have a solution to this problem? Below is a extract of o2net and ocfs2 messages from the /var/log/message file. /var/log/messages til tos-dipsprod-07: Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, shutting it down. Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, shutting it down. Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, shutting it down. Og her fra tos-dipsprod-02: 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 3 is supposed to be the new master, dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart. 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. -- 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 255 is supposed to be the new master, dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart. 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. Morten Kristiansen| Counsellor Helse Nord IKT | Departement of Serviceproduction Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address: Amtmann Worsøes gate 63, 8012 Bodø, Norway Quality - Safety - Respect ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users
[Ocfs2-users] Shutting down one node caused all the other nodes to shutdown aswell.
Hi, We are running a 8 nodes cluster on RHEL 2.6.18-128 64-bit. Yesterday the server/san guys exchanged the ocfs2 disks to another SAN, by mirroring and synchronizing the disks. When they rebooted the servers, one of the nodes, tos-dipsprod-07 wasn't able to start Oracle Grid Infrastructure, the voting disk was not found. Then we tried to reboot that node, causing all nodes to reboot. Time round about 02:25. When examine the /var/log/messages I discovered a BUG message on one of the node that rebooted unexpectedly, tos-dipsprod-02. I've tried to google it, but I couldn't find any solution. Is this a well known bug? Does any body have a solution to this problem? Below is a extract of o2net and ocfs2 messages from the /var/log/message file. /var/log/messages til tos-dipsprod-07: Mar 21 02:08:49 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:35 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:40 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:45 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, shutting it down. Mar 21 02:25:54 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-04 (num 5) at 192.168.7.103: has been idle for 10.0 seconds, shutting it down. Mar 21 04:03:17 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-06 (num 3) at 192.168.7.105: has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:32 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-01 (num 0) at 192.168.7.100: has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:37 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, shutting it down. Mar 21 04:06:47 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-03 (num 2) at 192.168.7.102: has been idle for 10.0 seconds, shutting it down. Mar 21 06:04:25 tos-dipsprod-07 kernel: o2net: connection to node tos-dipsprod-02 (num 1) at 192.168.7.101: has been idle for 10.0 seconds, shutting it down. Og her fra tos-dipsprod-02: 10474-Mar 21 02:25:15 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 7, node 3 changing it to 7 10646-Mar 21 02:25:25 tos-dipsprod-02 kernel: (o2net,7452,5):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 3 is supposed to be the new master, dead=7 10826:Mar 21 02:25:25 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 10939-Mar 21 02:43:01 tos-dipsprod-02 syslogd 1.4.1: restart. 10995-Mar 21 02:43:02 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. -- 17537-Mar 21 04:06:19 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_begin_reco_handler:2730 992D008CD522447C8333FC34BD46F8CD: dead_node previously set to 6, node 6 changing it to 7 17709-Mar 21 04:06:29 tos-dipsprod-02 kernel: (o2net,7472,1):dlm_finalize_reco_handler:2839 ERROR: node 6 sent recovery finalize msg, but node 255 is supposed to be the new master, dead=7 17891:Mar 21 04:06:29 tos-dipsprod-02 kernel: Kernel BUG at ...shran/BUILD/ocfs2-1.4.7/fs/ocfs2/dlm/dlmrecovery.c:2840 18004-Mar 21 04:38:04 tos-dipsprod-02 syslogd 1.4.1: restart. 18060-Mar 21 04:41:33 tos-dipsprod-02 modprobe: FATAL: Module ocfs2_stackglue not found. Morten Kristiansen| Counsellor Helse Nord IKT | Departement of Serviceproduction Tlf: +47 76 16 61 81 | Mob: +47 906 52 903 Office address: Amtmann Worsøes gate 63, 8012 Bodø, Norway Quality - Safety - Respect ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users