We appear to be stuck in a loop. You have to have netconsole setup. Ping support if you need help setting up netconsole.
Raheel Akhtar wrote: > Thanks, One of node (alf3) rebooted and here is log message from another > node alf1 about some error about node3. > Why node3 rebooted? > > > ------------------------------- > Jul 29 10:15:57 alf1 kernel: o2net: connection to node alf3 (num 3) at > 172.25.29.13:7777 has been idle for 30.0 seconds, shutting it down. > > Jul 29 10:15:57 alf1 kernel: (0,1):o2net_idle_timer:1506 here are some times > that might help debug the situation: (tmr 1248876927.861591 now > 1248876957.858464 dr 1248876927.861556 adv > 1248876927.861622:1248876927.861623 func (0ffa2aed:506) > 1248876927.861592:1248876927.861604) > > Jul 29 10:15:57 alf1 kernel: o2net: no longer connected to node alf3 (num 3) > at 172.25.29.13:7777 Jul 29 10:16:27 alf1 kernel: > (2600,1):o2net_connect_expired:1667 ERROR: no connection established with > node 3 after 30.0 seconds, giving up and returning errors. > > Jul 29 10:17:27 alf1 last message repeated 2 times > > Jul 29 10:17:30 alf1 kernel: (2618,0):ocfs2_dlm_eviction_cb:98 device > (8,33): dlm has evicted node 3 > > Jul 29 10:17:32 alf1 kernel: (2629,2):dlm_get_lock_resource:844 > 7BE7E9E2026A40F8801B56257D805C88:$RECOVERY: at least one node (3) to recover > before lock mastery can begin > > Jul 29 10:17:32 alf1 kernel: (2629,2):dlm_get_lock_resource:878 > 7BE7E9E2026A40F8801B56257D805C88: recovery map is not empty, but must master > $RECOVERY lock now > > Jul 29 10:17:32 alf1 kernel: (2629,1):dlm_do_recovery:524 (2629) Node 1 is > the Recovery Master for the Dead Node 3 for Domain > 7BE7E9E2026A40F8801B56257D805C88 > > Jul 29 10:17:34 alf1 kernel: o2net: accepted connection from node alf3 (num > 3) at 172.25.29.13:7777 > > Jul 29 10:17:38 alf1 kernel: ocfs2_dlm: Node 3 joins domain > 7BE7E9E2026A40F8801B56257D805C88 > > Jul 29 10:17:38 alf1 kernel: ocfs2_dlm: Nodes in domain > ("7BE7E9E2026A40F8801B56257D805C88"): 1 2 3 4 5 > > Jul 29 11:09:42 alf1 kernel: o2net: connected to node alf0 (num 0) at > 172.25.29.10:7777 > > Jul 29 11:09:45 alf1 kernel: ocfs2_dlm: Node 0 joins domain > 7BE7E9E2026A40F8801B56257D805C88 > > Jul 29 11:09:45 alf1 kernel: ocfs2_dlm: Nodes in domain > ("7BE7E9E2026A40F8801B56257D805C88"): 0 1 2 3 4 5 > ---------------------------------- > > > > > -----Original Message----- > From: Sunil Mushran [mailto:sunil.mush...@oracle.com] > Sent: Wednesday, July 29, 2009 1:25 PM > To: Raheel Akhtar > Cc: ocfs2-users@oss.oracle.com > Subject: Re: [Ocfs2-users] Error message whil booting system > > ocfs2_stackglue not found error message is harmless. > We use the same init script for all versions of the fs.... stackglue > is present in the current mainline and will be in ocfs2 1.6. > > Raheel Akhtar wrote: > >> Hi, >> >> When system booting getting error message “modprobe: FATAL: Module >> ocfs2_stackglue not found” in message. Some nodes reboot without any >> error message. >> >> ------------------------------------------------- >> >> ul 27 10:02:19 alf3 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team >> >> Jul 27 10:02:19 alf3 kernel: Netfilter messages via NETLINK v0.30. >> >> Jul 27 10:02:19 alf3 kernel: ip_conntrack version 2.4 (8192 buckets, >> 65536 max) - 304 bytes per conntrack >> >> Jul 27 10:02:19 alf3 kernel: e1000: eth0: e1000_watchdog_task: NIC >> Link is Up 1000 Mbps Full Duplex, Flow Control: None >> >> Jul 27 10:02:20 alf3 setroubleshoot: [server.ERROR] cannot start >> systen DBus service: Failed to connect to socket /var/run/db >> >> us/system_bus_socket: No such file or directory >> >> Jul 27 10:02:20 alf3 kernel: VMware memory control driver initialized >> >> Jul 27 10:02:20 alf3 kernel: e1000: eth0: e1000_set_tso: TSO is Enabled >> >> Jul 27 10:02:21 alf3 modprobe: FATAL: Module ocfs2_stackglue not found. >> >> Jul 27 10:02:21 alf3 kernel: OCFS2 Node Manager 1.4.2 Wed Jul 1 >> 19:55:44 PDT 2009 (build 0b9eb999c4d39c0d4b66219a2752cda6) >> >> Jul 27 10:02:21 alf3 kernel: OCFS2 DLM 1.4.2 Wed Jul 1 19:55:44 PDT >> 2009 (build 0faae8d4263a8c594749be558d8d7edd) >> >> Jul 27 10:02:21 alf3 kernel: OCFS2 DLMFS 1.4.2 Wed Jul 1 19:55:44 PDT >> 2009 (build 0faae8d4263a8c594749be558d8d7edd) >> >> Jul 27 10:02:21 alf3 kernel: OCFS2 User DLM kernel interface loaded >> >> Jul 27 10:02:25 alf3 kernel: o2net: connected to node alf0 (num 0) at >> 172.25.29.10:7777 >> >> Jul 27 10:02:25 alf3 kernel: o2net: connected to node alf2 (num 2) at >> 172.25.29.12:7777 >> >> Jul 27 10:02:25 alf3 kernel: o2net: accepted connection from node alf5 >> (num 5) at 172.25.29.15:7777 >> >> Jul 27 10:02:26 alf3 kernel: o2net: accepted connection from node alf4 >> (num 4) at 172.25.29.14:7777 >> >> Jul 27 10:02:27 alf3 kernel: o2net: connected to node alf1 (num 1) at >> 172.25.29.11:7777 >> >> Jul 27 10:02:31 alf3 kernel: OCFS2 1.4.2 Wed Jul 1 19:55:41 PDT 2009 >> (build 966fd2793489955b2271e7bb7e691088) >> >> Jul 27 10:02:31 alf3 kernel: ocfs2_dlm: Nodes in domain >> ("7BE7E9E2026A40F8801B56257D805C88"): 0 1 2 3 4 5 >> >> Kernel log from another node alf1 for above node alf3 is like >> >> Jul 29 10:15:57 alf1 kernel: o2net: connection to node alf3 (num 3) at >> 172.25.29.13:7777 has been idle for 30.0 seconds, shut >> >> ting it down. >> >> Jul 29 10:15:57 alf1 kernel: (0,1):o2net_idle_timer:1506 here are some >> times that might help debug the situation: (tmr 124887 >> >> 6927.861591 now 1248876957.858464 dr 1248876927.861556 adv >> 1248876927.861622:1248876927.861623 func (0ffa2aed:506) 1248876927 >> >> .861592:1248876927.861604) >> >> Jul 29 10:15:57 alf1 kernel: o2net: no longer connected to node alf3 >> (num 3) at 172.25.29.13:7777 >> >> Jul 29 10:16:27 alf1 kernel: (2600,1):o2net_connect_expired:1667 >> ERROR: no connection established with node 3 after 30.0 seco >> >> nds, giving up and returning errors. >> >> Jul 29 10:17:27 alf1 last message repeated 2 times >> >> Jul 29 10:17:30 alf1 kernel: (2618,0):ocfs2_dlm_eviction_cb:98 device >> (8,33): dlm has evicted node 3 >> >> Jul 29 10:17:32 alf1 kernel: (2629,2):dlm_get_lock_resource:844 >> 7BE7E9E2026A40F8801B56257D805C88:$RECOVERY: at least one node >> >> (3) to recover before lock mastery can begin >> >> Jul 29 10:17:32 alf1 kernel: (2629,2):dlm_get_lock_resource:878 >> 7BE7E9E2026A40F8801B56257D805C88: recovery map is not empty, >> >> but must master $RECOVERY lock now >> >> Jul 29 10:17:32 alf1 kernel: (2629,1):dlm_do_recovery:524 (2629) Node >> 1 is the Recovery Master for the Dead Node 3 for Domain >> >> 7BE7E9E2026A40F8801B56257D805C88 >> >> Jul 29 10:17:34 alf1 kernel: o2net: accepted connection from node alf3 >> (num 3) at 172.25.29.13:7777 >> >> Jul 29 10:17:38 alf1 kernel: ocfs2_dlm: Node 3 joins domain >> 7BE7E9E2026A40F8801B56257D805C88 >> >> Jul 29 10:17:38 alf1 kernel: ocfs2_dlm: Nodes in domain >> ("7BE7E9E2026A40F8801B56257D805C88"): 1 2 3 4 5 >> >> Jul 29 11:09:42 alf1 kernel: o2net: connected to node alf0 (num 0) at >> 172.25.29.10:7777 >> >> Jul 29 11:09:45 alf1 kernel: ocfs2_dlm: Node 0 joins domain >> 7BE7E9E2026A40F8801B56257D805C88 >> >> Jul 29 11:09:45 alf1 kernel: ocfs2_dlm: Nodes in domain >> ("7BE7E9E2026A40F8801B56257D805C88"): 0 1 2 3 4 5 >> >> OS = Red Hat 5.2 >> >> [r...@alf3 /]# uname -a >> >> Linux alf3 2.6.18-128.1.16.el5 #1 SMP Fri Jun 26 10:53:31 EDT 2009 >> x86_64 x86_64 x86_64 GNU/Linux >> >> [r...@alf3 /]# rpm -qa | grep ocfs2 >> >> ocfs2-tools-1.4.2-1.el5 >> >> ocfs2-2.6.18-128.1.16.el5-1.4.2-1.el5 >> >> ocfs2console-1.4.2-1.el5 >> >> Any help will be appreciated, OCFS2 cluster is not stable. Mounting >> File system for file sharing with Alfresco. >> >> Thanks >> >> Raheel >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users@oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >> > > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users