On Jun 13, 2008, at 5:46 PM, Andreas Dilger wrote:

> On Jun 13, 2008  16:03 -0400, Charles Taylor wrote:
>> We have been running the config below on three different lustre file
>> systems since early January and, for the most part, things have been
>> pretty stable.    We are now experiencing frequent hangs on some
>> clients - particularly our interactive login nodes.    All processes
>> get blocked behind Lustre I/O requests.   When this happens there are
>> *no* messages in either dmesg or syslog on the clients.     They seem
>> unaware of a problem.
>
> This is likely due to "client statahead" problems.  Please disable  
> this
> with "echo 0 > /proc/fs/lustre/llite/*/statahead_max" on the clients.
> This should also be fixed in 1.6.5

This seems to have done the trick.   Odd though that we've been  
running this way for several months and it didn't seem to be an issue  
until now.   We saw the discussions of this go by at one point and we  
should have just taken care of it then whether we were seeing it or  
not.   Thanks for reminding us of it.

>
>
>> 1. A ton of lustre-log.M.N files get dumped into /tmp in a  short
>> period of time.   Most of them appear to be full of garbage and
>> unprintable characters rather than thread stack traces.   Many of  
>> them
>> are also zero length.
>
> The lustre-log files are not stack traces.  They are dumped lustre  
> debug
> logs.

Got it.

>
>
>> We have been adjusting lru_size on the clients but so far it has made
>> no difference.    We have "options mds mds_num_threads=512" and our
>> system timeout is 1000 (sure, go ahead and flame me but if we don't  
>> do
>> that we get tons of "endpoint transport failures" on the clients and
>> no, there are no connectivity issues).   :)
>>
>> We are open to suggestion and wondering if we should update the MDSs
>> to 1.6.5.   Can we do that safely without also upgrading the clients
>> and OSTs?
>
> In general the MDS and OSS nodes should run the same level of  
> software,
> as that is what we test, but there isn't a hard requirement for it.

Would it be reasonable then, to upgrade the MDSs and OSSs but leave  
the clients at 1.6.4.2 or is that asking for trouble.   I think this  
comes up a lot and I'm pretty sure people have said they do it  
successfully.   I'm just wondering if it is a *design* goal that is  
architected in or just something that happens to work most of the time.

Thanks again,

Charlie Taylor
UF HPC Center

>
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to