Thanks, Paul, for confirming our planned approach. We did it that way and it worked very well. I have to admit that my fingers were a bit wet when suspending thousands of running jobs, but it worked without any problems. I just didn't dare to resume all suspended jobs at once, but did that in a staggered manner.
Best regards Jürgen * Paul Edmon <ped...@cfa.harvard.edu> [211019 15:15]: > Yup, we follow the same process for when we do Slurm upgrades, this looks > analogous to our process. > > -Paul Edmon- > > On 10/19/2021 3:06 PM, Juergen Salk wrote: > > Dear all, > > > > we are planning to perform some maintenance work on our Lustre file system > > which may or may not harm running jobs. Although failover functionality is > > enabled on the Lustre servers we'd like to minimize risk for running jobs > > in case something goes wrong. > > > > Therefore, we thought about suspending all running jobs and resume > > them as soon as file systems are back again. > > > > The idea would be to stop Slurm from scheduling new jobs as a first step: > > > > # for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done > > > > with foo, bar and baz being the configured partitions. > > > > Then suspend all running jobs (taking job arrays into account): > > > > # squeue -ho %A -t R | xargs -n 1 scontrol suspend > > > > Then perform the failover of OSTs to another OSS server. > > Once done, verify that file system is fully back and all > > OSTs are in place again on the client nodes. > > > > Then resume all suspended jobs: > > > > # squeue -ho %A -t S | xargs -n 1 scontrol resume > > > > Finally bring back the partitions: > > > > # for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done > > > > Does that make sense? Is that common practice? Are there any caveats that > > we must think about? > > > > Thank you in advance for your thoughts. > > > > Best regards > > Jürgen > >