On 2010-01-06, at 04:25, David Cohen wrote: > On Monday 04 January 2010 20:42:12 Andreas Dilger wrote: >> On 2010-01-04, at 03:02, David Cohen wrote: >>> I'm using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS's (had a >>> problem with qlogic drivers and rolled back to 1.6.6). >>> My MDS get unresponsive each day at 4-5 am local time, no kernel >>> panic or error messages before. > > I was indeed the *locate update, a simple edit of /etc/updatedb.conf > on the > clients and the system is stable again.
I asked the upstream Fedora/RHEL maintainer of mlocate to add "lustre" to the exception list in updatedb.conf, and he has already done so for Fedora. There is also a bug filed for RHEL5 to do the same, if anyone is interested in following it: https://bugzilla.redhat.com/show_bug.cgi?id=557712 >> Judging by the time, I'd guess this is "slocate" or "mlocate" running >> on all of your clients at the same time. This used to be a source of >> extremely high load back in the old days, but I thought that Lustre >> was in the exclude list in newer versions of *locate. Looking at the >> installed mlocate on my system, that doesn't seem to be the case... >> strange. >> >>> Some errors and an LBUG appear in the log after force booting the >>> MDS and >>> mounting the MDT and then the log is clear until next morning: >>> >>> Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: >>> (class_hash.c:225:lustre_hash_findadd_unique_hnode()) >>> ASSERTION(hlist_unhashed(hnode)) failed >>> Jan 4 06:33:31 tech-mds kernel: LustreError: 6357:0: >>> (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG >>> Jan 4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux- >>> debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357 >>> Jan 4 06:33:31 tech-mds kernel: ll_mgs_02 R running task >>> 0 6357 >>> 1 6340 (L-TLB) >>> Jan 4 06:33:31 tech-mds kernel: Call Trace: >>> Jan 4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe >>> Jan 4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68 >>> Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0 >>> Jan 4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe >>> Jan 4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336 >>> Jan 4 06:33:31 tech-mds kernel: child_rip+0xa/0x11 >>> Jan 4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0 >>> Jan 4 06:33:31 tech-mds kernel: child_rip+0x0/0x11 >> >> It shouldn't LBUG during recovery, however. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> > > -- > David Cohen > Grid Computing > Physics Department > Technion - Israel Institute of Technology > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss