Hi all,

I am looking for a clean way to set up Slurms native high availability feature. I am managing a Slurm cluster with one control node (hosting both slurmctld and slurmdbd), one login node and a few dozen compute nodes. I have a virtual machine that I want to set up as a backup control node.

The Slurm documentation says the following about the StateSaveLocation directory:

The directory used should be on a low-latency local disk to prevent file system 
delays from affecting Slurm performance. If using a backup host, the 
StateSaveLocation should reside on a file system shared by the two hosts. We do 
not recommend using NFS to make the directory accessible to both hosts, but do 
recommend a shared mount that is accessible to the two controllers and allows 
low-latency reads and writes to the disk. If a controller comes up without 
access to the state information, queued and running jobs will be cancelled. [1]

My question: How do I implement the shared file system for the StateSaveLocation?

I do not want to introduce a single point of failure by having a single node that hosts the StateSaveLocation, neither do I want to put that directory on the clusters NFS storage since outages/downtime of the storage system will happen at some point and I do not want that to cause an outage of the Slurm controller.

Any help or ideas would be appreciated.

Best,
Pierre


[1] https://slurm.schedmd.com/quickstart_admin.html#Config

--
Pierre Abele, M.Sc.

HPC Administrator
Max-Planck-Institute for Evolutionary Anthropology
Department of Primate Behavior and Evolution

Deutscher Platz 6
04103 Leipzig

Room: U2.80
E-Mail: pierre_ab...@eva.mpg.de
Phone: +49 (0) 341 3550 245

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to