> On Oct 19, 2018, at 10:42 AM, Marion Hakanson <hakan...@ohsu.edu> wrote: > > Thanks for the feedback. You're both confirming what we've learned so far, > that we had to unmount all the clients (which required rebooting most of > them), then reboot all the storage servers, to get things unstuck until the > problem recurred. > > I tried abort_recovery on the clients last night, before rebooting the MDS, > but that did not help. Could well be I'm not using it right:
Aborting recovery is a server-side action, not something that is done on the client. As mentioned by Peter, you can abort recovery on a single target after it is mounted by using “lctl —device <DEV> abort_recover”. But you can also just skip over the recovery step when the target is mounted by adding the “-o abort_recov” option to the mount command. For example, mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0 And similarly for OSTs. So you should be able to just unmount your MDT/OST on the running file system and then remount with the abort_recov option. >From a client perspective, the lustre client will get evicted but should automatically reconnect. Some applications can ride through a client eviction without causing issues, some cannot. I think it depends largely on how the application does its IO and if there is any IO in flight when the eviction occurs. I have had to do this a few times on a running cluster, and in my experience, we have had good luck with most of the applications continuing without issues. Sometimes there are a few jobs that abort, but overall this is better than having to stop all jobs and remount lustre on all the compute nodes. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org