Hello Joe, You haven't defined any memory allocation or oversubscription in your slurm.conf so by default it is giving a full node's worth of memory to each job. There are multiple options that you can do but what you probably want to do is make both CPU and memory a selected type with the parameter:
SelectTypeParameters=CR_CPU_Memory Then you'll want to define the amount of memory (in megabytes) on a node as part of the definition with RealMemory= Lastly, you'll need to define a default memory (in megabytes) per job, typically by memory per cpu, with DefMemPerCpu= With those changes when you submit a job by default it'll do #cpus x defmempercpu for the memory given to a job. You can then use either the flags --mem or --mempercpu to request more or less memory for a job. There's also oversubscription where you can allow more memory than available on the node to be used by jobs and then you don't technically need to define the memory for a job but run into the issue that a single job could use all of it and get OOM errors on the nodes. Regards, -- Willy Markuske HPC Systems Engineer MS Data Science and Engineering SDSC - Research Data Services (619) 519-4435 wmarku...@sdsc.edu On Jun 16, 2023, at 12:43, Joe Waliga <jwal...@umich.edu> wrote: Hello, (This is my first time submitting a question to the list) We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 jobs onto the test-HPC, we can only run one job per node. We seem to be allocating all memory to the one job and other jobs can run until the memory is freed up. Any ideas on what we need to change inorder to free up the memory? ~ ~ We noticed this from the 'slurmctld.log' ... [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_* The test-HPC is running on hardware, but we also created a test-HPC using a 3 VM set constructed by Vagrant running on a Virtualbox backend. I have included some of the 'slurmctld.log' file, the batch submission script, the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' file (in case someone wants to recreate our test-HPC in a set of VMs.) - Joe ----- (some of) slurmctld.log ----------------------------------------- [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 0 nodes [2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=-1 [2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_7(71) [2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) node_mode:Normal alloc_mode:Test_Only [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & exc_cores [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2 [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_7(71) on 2 nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp01(state:1), mem 1/1 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 32 CPUs on hpc2-comp02(state:1), mem 1/1 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 ThreadsPerCore:2 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[0] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: Socket[1] Cores:8 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/elim_nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/choose_nodes [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15,node[1]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/sync_cores [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list:hpc2-comp01 [2023-06-15T20:11:32.632] select/cons_tres: core_array_log: core_list:node[0]:0-15 [2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 0 pass: test_only [2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no job_resources info for JobId=71_7(71) rc=0 [2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. Reason=Resources. Priority=4294901759. Partition=debug. [2023-06-15T20:11:56.645] debug: Spawning ping agent for hpc2-comp[01-02] [2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type REQUEST_PING [2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01 [2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2 [2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02 [2023-06-15T20:11:56.647] debug2: Tree head got back 1 [2023-06-15T20:11:56.647] debug2: Tree head got back 2 [2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01 [2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02 [2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: beginning [2023-06-15T20:11:57.329] debug: sched/backfill: _attempt_backfill: 1 jobs to backfill [2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: entering _try_sched for JobId=71_*. [2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: evaluating JobId=71_* [2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* node_mode:Normal alloc_mode:Will_Run [2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & exc_cores [2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list:hpc2-comp[01-02] [2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: min:1 max:1 requested:1 avail:2 [2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp01 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp02 Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1) [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp01, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not considering node hpc2-comp02, allocated memory = 1 and all memory requested for JobId=71_* [2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: evaluating JobId=71_* on 0 nodes [2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 0 fail: insufficient resources [2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_5(76): overlap=1 [2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) action:normal [2023-06-15T20:11:57.330] ==================== [2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp01 [2023-06-15T20:11:57.330] Node[0]: [2023-06-15T20:11:57.330] Mem(MB):1:0 Sockets:2 Cores:8 CPUs:2:0 [2023-06-15T20:11:57.330] Socket[0] Core[0] is allocated [2023-06-15T20:11:57.330] -------------------- [2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1 [2023-06-15T20:11:57.330] ==================== [2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: removed JobId=71_5(76) from part debug row 0 [2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) finished [2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_6(77): overlap=1 [2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_6(77) action:normal [2023-06-15T20:11:57.330] ==================== [2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 nodes=hpc2-comp02 [2023-06-15T20:11:57.330] Node[0]: [2023-06-15T20:11:57.330] Mem(MB):1:0 Sockets:2 Cores:8 CPUs:2:0 [2023-06-15T20:11:57.330] Socket[0] Core[0] is allocated [2023-06-15T20:11:57.330] -------------------- [2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1 [2023-06-15T20:11:57.330] ==================== ----- batch script ----------------------------------- #!/bin/bash echo "Running on: ${SLURM_CLUSTER_NAME}, node list: ${SLURM_JOB_NODELIST}, node names: ${SLURMD_NODENAME} in: `pwd` at `date`" echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: ${SLURM_TASKS_PER_NODE} " echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}" echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}" sleep 3600 echo "END" Here is the sbatch command to run it: sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm ----- slurm.conf ----------------------------------- # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=cluster SlurmctldHost=hpc2-comp00 #SlurmctldHost= #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=67043328 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=lua #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=10000 #MaxStepCount=40000 #MaxTasksPerNode=512 MpiDefault=none #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/linuxproc #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root SrunPortRange=60001-60005 #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none #TaskEpilog= #TaskPlugin=task/affinity,task/cgroup TaskPlugin=task/none #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_tres # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/basic #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 AccountingStorageHost=hpc2-comp00 AccountingStoragePass=/var/run/munge/munge.socket.2 AccountingStoragePort=6819 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreFlags=job_comment,job_env,job_extra,job_script #JobCompHost=localhost #JobCompLoc=slurm_jobcomp_db ##JobCompParams= #JobCompPass=/var/run/munge/munge.socket.2 #JobCompPort=3306 #JobCompType=jobcomp/mysql JobCompType=jobcomp/none #JobCompUser=slurm #JobContainerType=job_container/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux # Enabled next line - 06-15-2023 SlurmctldDebug=debug5 SlurmctldLogFile=/var/log/slurmctld.log # Enabled next line - 06-15-2023 SlurmdDebug=debug5 SlurmdLogFile=/var/log/slurmd.log #SlurmSchedLogFile= # Added next line : 06-15-2023 DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE State=UP ----- Vagrantfile file ----------------------------------- # -*- mode: ruby -*- # vi: set ft=ruby : # All Vagrant configuration is done below. The "2" in Vagrant.configure # configures the configuration version (we support older styles for # backwards compatibility). Please don't change it unless you know what # you're doing. Vagrant.configure("2") do |config| # The most common configuration options are documented and commented below. # For a complete reference, please see the online documentation at # https://urldefense.com/v3/__https://docs.vagrantup.com__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2wTKzoiY$ . # Every Vagrant development environment requires a box. You can search for # boxes at https://urldefense.com/v3/__https://vagrantcloud.com/search__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2f04oxLo$ . config.vm.box = "generic/fedora37" # Stop vagrant from generating a new key for each host to allow ssh between # machines. config.ssh.insert_key = false # The Vagrant commands are too limited to configure a NAT network, # so run the VBoxManager commands by hand. config.vm.provider "virtualbox" do |vbox| # Add nic2 (eth1 on the guest VM) as the physical router. Never change # nic1, because that's what the host uses to communicate with the guest VM. vbox.customize ["modifyvm", :id, "--nic2", "bridged", "--bridge-adapter2", "enp8s0"] end # Common provisioning for all guest VMs. config.vm.provision "shell", inline: <<-SHELL # Show which command is being run to associate with command output! set -x # Remove suprious hosts from the VM image. sed -i '/fedora37/d' /etc/hosts sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts # Add NAT network to /etc/hosts. for host in 10.0.1.{100..102} do hostname=hpc2-comp${host:8} grep -q $host /etc/hosts || echo "$host $hostname" >> /etc/hosts done unset host hostname # Use latest set of packages. dnf -y update # Install MUNGE. dnf -y install munge # Create the SLURM user. id -u slurm || useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job scheduler" slurm SHELL config.vm.define "hpc2_comp00" do |hpc2_comp00| hpc2_comp00.vm.hostname = "hpc2-comp00" hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true hpc2_comp00.vm.provision :shell, inline: <<-SHELL # Show which command is being run to associate with command output! set -x # Set static IP address for NAT network. HOST=10.0.1.100 ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection sed "s|address1=10.0.1.100|address1=${HOST}|" \ /vagrant/eth1.nmconnection > $ETH1 chmod go-r $ETH1 nmcli con load $ETH1 unset $HOST # Create the MUNGE key. [[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v cp -av /etc/munge/munge.key /vagrant/ # Enable and start munge. systemctl enable munge systemctl start munge systemctl status munge # Setup database on the head node: dnf -y install mariadb-devel mariadb-server # Set recomended memory (5%-50% RAM) and timeout. CNF=/etc/my.cnf.d/mariadb-server.cnf # Note we need to use a double slash for the newline character below # because Vagrant's inline shell script. MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' /proc/meminfo) grep -q innodb_buffer_pool_size $CNF || sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF grep -q innodb_lock_wait_timeout $CNF || sed -i '/innodb_buffer_pool_size/a innodb_lock_wait_timeout=900' $CNF unset CNF MYSQL_RAM # Run the head node services: systemctl enable mariadb systemctl start mariadb systemctl status mariadb # Secure the server. # # Send interactive commands using printf per # https://urldefense.com/v3/__https://unix.stackexchange.com/a/112348__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e23TZJlog$ printf "%s\n" "" n n y y y y | mariadb-secure-installation # Install the RPM package builder for SLURM. dnf -y install rpmdevtools # Download SLURM. wget -nc https://urldefense.com/v3/__https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2-8gcs08$ # Install the source package dependencies to determine the dependencies # to build the binary. dnf -y install \ dbus-devel \ freeipmi-devel \ hdf5-devel \ http-parser-devel \ json-c-devel \ libcurl-devel \ libjwt-devel \ libyaml-devel \ lua-devel \ lz4-devel \ man2html \ readline-devel \ rrdtool-devel SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm # Create SLURM .rpmmacros file. cp -av /vagrant/.rpmmacros . [[ -f ${SLURM_SRPM} ]] || rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee build-slurm-source.log # Installs the source package dependencies to build the binary. dnf -y builddep ${SLURM_SRPM} unset SLURM_SRPM # Build the SLURM binaries. SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm [[ -f ${SLURM_RPM} ]] || rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee build-slurm-binary.log unset SLURM_RPM # Copy SLURM packages to the compute nodes. DIR_RPM=~/rpmbuild/RPMS/x86_64 cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm /vagrant/ && touch /vagrant/sentinel-copied-rpms.done # Install all SLURM packages on the head node. find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name '*-torque-*' \ -exec dnf -y install {} + unset DIR_RPM # Copy the configuration files. cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf chown slurm:slurm /etc/slurm/slurmdbd.conf chmod 600 /etc/slurm/slurmdbd.conf cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf chown root:root /etc/slurm/slurm.conf chmod 644 /etc/slurm/slurm.conf # Now create the slurm MySQL user. SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' /etc/slurm/slurmdbd.conf) DBD_HOST=localhost # https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/*_start_mysql_service_and_enable_at_loggin__;Iw!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2km7qEbo$ mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';" mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';" mysql -e "show grants for 'slurm'@'$DBD_HOST';" DBD_HOST=hpc2-comp00 mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by '$SLURM_PASSWORD';" mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';" mysql -e "show grants for 'slurm'@'$DBD_HOST';" unset SLURM_PASSWORD DBD_HOST systemctl enable slurmdbd systemctl start slurmdbd systemctl status slurmdbd systemctl enable slurmctld mkdir -p /var/spool/slurmctld chown slurm:slurm /var/spool/slurmctld # Open ports for slurmctld (6817) and slurmdbd (6819). firewall-cmd --add-port=6817/tcp firewall-cmd --add-port=6819/tcp firewall-cmd --runtime-to-permanent systemctl start slurmctld systemctl status slurmctld # Clear any previous node DOWN errors. sinfo -s sinfo -R scontrol update nodename=ALL state=RESUME sinfo -s sinfo -R SHELL end config.vm.define "hpc2_comp01" do |hpc2_comp01| hpc2_comp01.vm.hostname = "hpc2-comp01" hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true hpc2_comp01.vm.provision :shell, inline: <<-SHELL # Show which command is being run to associate with command output! set -x # Set static IP address for NAT network. HOST=10.0.1.101 ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection sed "s|address1=10.0.1.100|address1=${HOST}|" \ /vagrant/eth1.nmconnection > $ETH1 chmod go-r $ETH1 nmcli con load $ETH1 unset $HOST # Copy the MUNGE key. KEY=/etc/munge/munge.key cp -av /vagrant/munge.key /etc/munge/ chown munge:munge $KEY chmod 600 $KEY # Enable and start munge. systemctl enable munge systemctl start munge systemctl status munge # SLURM packages to be installed on compute nodes. DIR_RPM=~/rpmbuild/RPMS/x86_64 mkdir -p ${DIR_RPM} rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/ dnf -y install ${DIR_RPM}/slurm-23*.rpm dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm unset DIR_RPM # Copy the configuration file. mkdir -p /etc/slurm cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf chown root:root /etc/slurm/slurm.conf chmod 644 /etc/slurm/slurm.conf # Only enable slurmd on the worker nodes. systemctl enable slurmd # Open port for slurmd (6818). firewall-cmd --add-port=6818/tcp firewall-cmd --runtime-to-permanent # Open port range for srun. SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' /etc/slurm/slurm.conf) firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp firewall-cmd --runtime-to-permanent systemctl start slurmd systemctl status slurmd # Clear any previous node DOWN errors. sinfo -s sinfo -R scontrol update nodename=ALL state=RESUME sinfo -s sinfo -R SHELL end config.vm.define "hpc2_comp02" do |hpc2_comp02| hpc2_comp02.vm.hostname = "hpc2-comp02" hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true hpc2_comp02.vm.provision :shell, inline: <<-SHELL # Show which command is being run to associate with command output! set -x # Set static IP address for NAT network. HOST=10.0.1.102 ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection sed "s|address1=10.0.1.100|address1=${HOST}|" \ /vagrant/eth1.nmconnection > $ETH1 chmod go-r $ETH1 nmcli con load $ETH1 unset $HOST # Copy the MUNGE key. KEY=/etc/munge/munge.key cp -av /vagrant/munge.key /etc/munge/ chown munge:munge $KEY chmod 600 $KEY # Enable and start munge. systemctl enable munge systemctl start munge systemctl status munge # SLURM packages to be installed on compute nodes. DIR_RPM=~/rpmbuild/RPMS/x86_64 mkdir -p ${DIR_RPM} rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/ dnf -y install ${DIR_RPM}/slurm-23*.rpm dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm unset DIR_RPM # Copy the configuration file. mkdir -p /etc/slurm cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf chown root:root /etc/slurm/slurm.conf chmod 644 /etc/slurm/slurm.conf # Only enable slurmd on the worker nodes. systemctl enable slurmd # Open port for slurmd (6818). firewall-cmd --add-port=6818/tcp firewall-cmd --runtime-to-permanent systemctl start slurmd systemctl status slurmd # Clear any previous node DOWN errors. sinfo -s sinfo -R scontrol update nodename=ALL state=RESUME sinfo -s sinfo -R SHELL end end -------------------------------------------------------------------------