Hi Dean,
You may want to look at the links in my Slurm Wiki page. Both the
official Slurm documentation and other resources are listed. I think
most of your requirements and questions are described in these pages.
My Wiki gives detailed deployment information for a CentOS 7 cluster,
but much of this information should be relevant for Ubuntu as well.
/Ole
On 06-12-2019 22:57, Dean Schulze wrote:
I'm doing my first slurm installation. The schedmd docs assume that I
have a cluster that meets certain (unstated) requirements available, but
I don't. I've found a couple of examples showing how to setup a cluster
for slurm using real hardware (nodes) with GPUs:
https://github.com/mknoxnv/ubuntu-slurm
https://github.com/nateGeorge/slurm_gpu_ubuntu
The requirements for a cluster for slurm seem to be:
Passwordless SSH is working between slurm controller and slurm nodes
There is shared storage between all the nodes: /storage & /home (NFS)
The UIDs and GIDs will be consistent between all the nodes. (LDAP or
other)
Hostnames have to be a FQDN.
Slurm will be used to control SSH access to compute nodes.
Compute nodes are DNS resolvable.
Compute nodes have GPUs and the latest CUDA drivers installed
Time has to be synchronized across all nodes and controller (ntp or
freeipa)
(If time isn't synch'ed properly the controller might not start)
My questions are:
Are the cluster requirements above correct and complete?
Can I use virtual machines without GPUs for my nodes?
(This is just to get started. Eventually I'll have real hardware
with GPUs for my nodes.)
From the Ubuntu link on your download page I've downloaded these files:
slurmctld_18.08.6.2-1_amd64.deb 610.9 kB
slurm-client_18.08.6.2-1_amd64.deb 887.7 kB
slurm-wlm_18.08.6.2-1_amd64.deb 12.3 kB
The slurmctld would be installed on my controller, but what do I
install on my nodes?
The slurm-wlm file is very small. Would I install it on my node?
What is the client for?