> On Oct 19, 2018, at 10:42 AM, Marion Hakanson <hakan...@ohsu.edu> wrote:
> 
> Thanks for the feedback.  You're both confirming what we've learned so far, 
> that we had to unmount all the clients (which required rebooting most of 
> them), then reboot all the storage servers, to get things unstuck until the 
> problem recurred.
> 
> I tried abort_recovery on the clients last night, before rebooting the MDS, 
> but that did not help.  Could well be I'm not using it right:

Aborting recovery is a server-side action, not something that is done on the 
client.  As mentioned by Peter, you can abort recovery on a single target after 
it is mounted by using “lctl —device <DEV> abort_recover”.  But you can also 
just skip over the recovery step when the target is mounted by adding the “-o 
abort_recov” option to the mount command.  For example, 

mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0

And similarly for OSTs.  So you should be able to just unmount your MDT/OST on 
the running file system and then remount with the abort_recov option.  >From a 
client perspective, the lustre client will get evicted but should automatically 
reconnect.   

Some applications can ride through a client eviction without causing issues, 
some cannot.  I think it depends largely on how the application does its IO and 
if there is any IO in flight when the eviction occurs.  I have had to do this a 
few times on a running cluster, and in my experience, we have had good luck 
with most of the applications continuing without issues.  Sometimes there are a 
few jobs that abort, but overall this is better than having to stop all jobs 
and remount lustre on all the compute nodes.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to