Re: [Lustre-discuss] stuck OSS node

Adrian Ulrich Fri, 05 Aug 2011 02:02:21 -0700

Hi Craig,

> Has anyone seen anything like this?


Yes: we had a similar problem a couple of times:


First, try to umount all OSTs on the affected OSS.

Some OSTs will (most likely) fail to umount. (umount gets stuck due to the 
ll_ost_io_?? thread).
Note the 'broken' OSTs and kill the OSS (echo b > /proc/sysrq-trigger) after 
the 'good' OSTs finished umounting.

Afterwards do a simple 'e2fsck -f -p' on the bad OSTs - it should complain 
about corrupted directories and other nice things. If it doesn't -> upgrade to 
the latest fsck from whamcloud.
(We had a corruption a few months ago that was unfixable/not detected with the 
1.8.4-sun e2fsprogs)



> This is a recent phenomena - we are not 
> sure, but we think it may be related to a particular workload.  Our o2ib 
> clients don't seem to have any trouble.

I don't think that this issue is related to the network: It's probably just 
'bad luck' that only the tcp clients hit the corrupted directories.



Regards,
 Adrian
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] stuck OSS node

Reply via email to