You may want to try to increase the network timeout. You will have to do it on all nodes.
See the FAQ http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT with special attention to #104 and 105. Regards, Marcos Eduardo Matsunaga Oracle USA Linux Engineering paul fretter (TOC) wrote: > To clarify, > > The host "node1" is the OCFS node 0 in the config file. > > The log entries are from another system in the cluster. > > Kind regards > Paul > > > > >> -----Original Message----- >> From: paul fretter (TOC) >> Sent: 09 October 2007 11:41 >> To: [email protected] >> Subject: Access to OCFS2 volume paused when a node crashes >> >> There is a node (node1) on our cluster that for some reason hangs >> > every > >> now and again, but it seems that when it happens it also pauses access >> to the OCFS2 volume for the other nodes. >> >> We are running the latest version of OCFS2 and the tools, on RHEL4 >> (x86_64) with kernel 2.6.9-42. All nodes area connected by >> fibrechannel to a common LUN for data sharing. >> >> I guess there may be something I can do with configuring timeouts >> etc(?), but I thought I'd check with this list first. Here is the >> relevant info from /va/log/messages >> >> >> Oct 9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num >> 0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it >> down. >> Oct 9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are >> some times that might help debug the situation: (tmr >> > 1191925471.993435 > >> now 1191925481.9942 92 dr 1191925471.993425 adv >> 1191925471.993436:1191925471.993437 func (98e2d068:5 07) >> 1191924562.14841:1191924562.14844) >> Oct 9 11:24:41 jic55124 kernel: o2net: no longer connected to node >> node1 (num 0 ) at 10.10.10.1:7777 >> Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418 >> ERROR: link to 0 went down! >> Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995 >> ERROR: status = -112 >> [EMAIL PROTECTED] ~]# tail /var/log/messages >> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995 >> ERROR: status = -107 >> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418 >> ERROR: link to 0 went down! >> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995 >> ERROR: status = -107 >> Oct 9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921 >> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at >> least one node (0) torecover before lock mastery can begin >> Oct 9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119 >> device (8,80): dlm has evicted node 0 >> Oct 9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976 >> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at >> least one node (0) torecover before lock mastery can begin >> Oct 9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301 >> ERROR: node down! 0 >> Oct 9 11:33:46 jic55124 kernel: >> > (727,3):dlm_wait_for_lock_mastery:1118 > >> ERROR: status = -11 >> Oct 9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167 >> Recovering node 0 from slot 5 on device (8,80) >> Oct 9 11:33:50 jic55124 kernel: kjournald starting. Commit interval >> > 5 > >> seconds >> >> >> Many thanks >> Paul Fretter >> > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
_______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
