Ah...I had initially mounted the volume with nordirplus, but it looks like I missed it from the fstab entry. I'll see if that fixes the problem.
Thanks, James On Fri, Mar 25, 2011 at 08:39:36PM +0000, Sunil Mushran wrote: > Are you mount with nordirplus? > > For more refer to this email. > http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html > > On 03/25/2011 08:49 AM, James Abbott wrote: > > Hello, > > > > I've recently setup an ocfs2 volume via a 4Gb/s SAN which is directly > > mounted on two CentOS 5.5 machines (2.6.18-194.32.1.el5). Both servers > > are exporting the volume via NFS3 to our HPC cluster. This is to replace > > a single NFS server exporting an ext3 volume which was unable to keep up > > with our IO requirements. I switched over to using the new ocfs2 volume on > > Monday, and it had been performing pretty well overall. This morning, > > however, I saw significant loads appearing on both the NFS servers (load > >> 30, which is not unheard of since we are running 32 NFS threads per > > machine), however attempting ot access the shared volume resulted in a > > hanging connection. > > > > Logging into the NFS servers showed that the ocfs volume could be > > accessed fine, and was responsive, however the load on the machines was > > clearly coming from nfsd. iostat showed there was no substantial activity > > on the ocfs2 volume despite the NFS load. dmesg outputs on both servers > > show a number of hung task warnings: > > > > Mar 25 12:02:13 bss-adm2 kernel: INFO: task nfsd:996 blocked for more than > > 120 seconds. > > Mar 25 12:02:13 bss-adm2 kernel: "echo 0> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > Mar 25 12:02:13 bss-adm2 kernel: nfsd D ffff81041ab1d7e0 0 996 > > 1 1008 1017 (L-TLB) > > Mar 25 12:02:13 bss-adm2 kernel: ffff8102b8e05d00 0000000000000046 > > 0000000000000246 ffffffff889678b9 > > Mar 25 12:02:13 bss-adm2 kernel: ffff8103eaa20000 000000000000000a > > ffff81041262e040 ffff81041ab1d7e0 > > Mar 25 12:02:13 bss-adm2 kernel: 00022676fafd0a61 00000000000021b6 > > ffff81041262e228 00000007de477ba0 > > Mar 25 12:02:13 bss-adm2 kernel: Call Trace: > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff889678b9>] > > :ocfs2:ocfs2_cluster_unlock+0x290/0x30d > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff88979236>] > > :ocfs2:ocfs2_permission+0x137/0x1a4 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8000d9d8>] permission+0x81/0xc8 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff80063c6f>] > > __mutex_lock_slowpath+0x60/0x9b > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff80063cb9>] > > .text.lock.mutex+0xf/0x14 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8882c981>] > > :nfsd:nfsd_lookup_dentry+0x306/0x418 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff887ab4b4>] > > :sunrpc:ip_map_match+0x19/0x30 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8882cab5>] > > :nfsd:nfsd_lookup+0x22/0xb0 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff887ab59e>] > > :sunrpc:ip_map_lookup+0xbc/0xc3 > > Mar 25 12:02:13 bss-adm2 kernel: [<ffffffff8883347d>] > > :nfsd:nfsd3_proc_lookup+0xc5/0xd2 > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888281db>] > > :nfsd:nfsd_dispatch+0xd8/0x1d6 > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff887a8651>] > > :sunrpc:svc_process+0x454/0x71b > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff80064644>] __down_read+0x12/0x92 > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff88828746>] > > :nfsd:nfsd+0x1a5/0x2cb > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff888285a1>] :nfsd:nfsd+0x0/0x2cb > > Mar 25 12:02:14 bss-adm2 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > > > Although these are obviously nfsd hangs, the fact they occurrent on both > > servers at the same time make me suspect something on the ocfs2 side. It > > was necessary to shutdown nfsd and restart the cluster nodes in order > > for them to resume. > > > > Being new to ocfs I'm not sure quite where to look for clues as to what > > caused this. I'm gussing from the ocfs2_cluster_unlock at the top of the > > stack trace that this is a o2cb locking issue. The NFS traffic is going > > over the same (1Gb) network connections as the o2cb heartbeat, so I'm > > wondering if that may have contributed to the problem. I should be able > > to add a separate fabric for the oc2b heartbeat if that might be the > > cause, however neither of the servers were fenced. > > > > Anyone have any suggestions? > > > > Many thanks, > > James > > > -- Dr. James Abbott Bioinformatics Software Developer Imperial College, London _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users