Charles Taylor wrote:
> On Feb 3, 2009, at 11:42 AM, Brian J. Murrell wrote:
>
>   
>>> I down one of the servers (normal shutdown, not the MGD of course).
>>> OK, so the clients seem to be frozen in regards to the lustre.
>>>       
>> Only if they want to access objects (files, or file stripes) on that
>> server that you shut down, yes.
>>     
>
> In our experience, despite what has been said and what we have read,  
> if we lose or take down a single OSS, our clients lose access (i/o  
> seems blocked) to the file system until that OSS is back up and has  
> completed recovery.    That's just or experience and it has been very  
> consistent.   We've never seen otherwise, though we would like to.  :)
>   

You are probably both correct.  Only nodes with files on the down OSTs 
should be
impacted, but it is very easy to use/access files on the down OST.

If your home directory is on Lustre, then a login will certainly hang as 
you will likely have
dotfiles on nearly all OSTs.  If you do an "ls -l" you will hang as most 
likely a file in the
directory will be on the hung OST.

If you do an "lsof" and verify that there are no open files that are on 
an OST, then when
that OST goes down the jobs on that node should continue to run, 
assuming nothing tries
to access the down OST.

>>> Many here
>>> have noted that it should be ok, with the exception of files that  
>>> were
>>> stored on the downed server,
>>>       
>
> Again, not in our experience.    We are currently running 1.6.4.2 and  
> have never seen this work.    Losing a single OSS renders the file  
> system pretty much unusable until the OSS has recovered.    We could  
> be doing something wrong, I suppose but I'm not sure what.
>
>   
>>> but that does not seem to be the case here.
>>> That is not my main concern however, the real question is, I bring  
>>> the server
>>> back up; check its ID by issuing lctl dl; I check the MGS by a cat / 
>>> proc/fs/lustre/devices
>>> and see the ID in there as UP. OK, so it all seems well again, but  
>>> the client
>>> is still (somewhat) stuck.
>>>       
>
> You have to wait for recovery to complete.     You can check the  
> recovery status on the OSSs and MGS/MDS by....
>
> cd /proc/fs/lustre; find . -name "*recov*" -exec cat {} \;
>
> Once all the OSSs/MGS show recovery "COMPLETE", clients will be able  
> to access the file system again.
>
> We've been running three separate Lustre file systems for over a year  
> now and are *very* happy with it.    There are a few things that we  
> still don't understand and this is one of them.   We wish that when an  
> OSS went down, we only lost access to files/objects on *that* OSS but,  
> again, that has not been our experience.    Still we've kissed a lot  
> of distributed/parallel file system frogs.   We'll take Lustre, hands  
> down.
>
> Charlie Taylor
> UF HPC Center
>
>   
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to