Hello Joe,

You haven't defined any memory allocation or oversubscription in your 
slurm.conf so by default it is giving a full node's worth of memory to each 
job. There are multiple options that you can do but what you probably want to 
do is make both CPU and memory a selected type with the parameter:

SelectTypeParameters=CR_CPU_Memory

Then you'll want to define the amount of memory (in megabytes) on a node as 
part of the definition with

RealMemory=

Lastly, you'll need to define a default memory (in megabytes) per job, 
typically by memory per cpu, with

DefMemPerCpu=

With those changes when you submit a job by default it'll do #cpus x 
defmempercpu for the memory given to a job. You can then use either the flags 
--mem or --mempercpu to request more or less memory for a job.

There's also oversubscription where you can allow more memory than available on 
the node to be used by jobs and then you don't technically need to define the 
memory for a job but run into the issue that a single job could use all of it 
and get OOM errors on the nodes.

Regards,

--
Willy Markuske

HPC Systems Engineer
MS Data Science and Engineering
SDSC - Research Data Services
(619) 519-4435
wmarku...@sdsc.edu

On Jun 16, 2023, at 12:43, Joe Waliga <jwal...@umich.edu> wrote:

Hello,

(This is my first time submitting a question to the list)

We have a test-HPC with 1 login node and 2 computer nodes. When we submit 90 
jobs onto the test-HPC, we can only run one job per node. We seem to be 
allocating all memory to the one job and other jobs can run until the memory is 
freed up.

Any ideas on what we need to change inorder to free up the memory?

~ ~

We noticed this from the 'slurmctld.log' ...

[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp01, allocated memory = 1 and all memory requested for 
JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp02, allocated memory = 1 and all memory requested for 
JobId=71_*

The test-HPC is running on hardware, but we also created a test-HPC using a 3 
VM set constructed by Vagrant running on a Virtualbox backend.

I have included some of the 'slurmctld.log' file, the batch submission script, 
the slurm.conf file (of the hardware based test-HPC), and the 'Vagrantfile' 
file (in case someone wants to recreate our test-HPC in a set of VMs.)

- Joe


----- (some of) slurmctld.log -----------------------------------------

[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp01 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] select/cons_tres: node_data_dump: Node:hpc2-comp02 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp01, allocated memory = 1 and all memory requested for 
JobId=71_7(71)
[2023-06-15T20:11:32.631] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp02, allocated memory = 1 and all memory requested for 
JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: evaluating 
JobId=71_7(71) on 0 nodes
[2023-06-15T20:11:32.631] select/cons_tres: _job_test: SELECT_TYPE: test 0 
fail: insufficient resources
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: no job_resources 
info for JobId=71_7(71) rc=-1
[2023-06-15T20:11:32.631] debug2: select/cons_tres: select_p_job_test: 
evaluating JobId=71_7(71)
[2023-06-15T20:11:32.631] select/cons_tres: common_job_test: JobId=71_7(71) 
node_mode:Normal alloc_mode:Test_Only
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: node_list & 
exc_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: nodes: min:1 max:1 
requested:1 avail:2
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp01 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: node_data_dump: Node:hpc2-comp02 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: evaluating 
JobId=71_7(71) on 2 nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: _select_nodes/enter
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 
32 CPUs on hpc2-comp01(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
Node:hpc2-comp01 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 
ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   
Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   
Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: can_job_run_on_node: SELECT_TYPE: 
32 CPUs on hpc2-comp02(state:1), mem 1/1
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE: 
Node:hpc2-comp02 Sockets:2 SpecThreads:0 CPUs:Min-Max,Avail:2-32,32 
ThreadsPerCore:2
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   
Socket[0] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: _avail_res_log: SELECT_TYPE:   
Socket[1] Cores:8
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/elim_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _eval_nodes: set:0 consec CPUs:64 
nodes:2:hpc2-comp[01-02] begin:0 end:1 required:-1 weight:511
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/choose_nodes
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp01
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15,node[1]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
_select_nodes/sync_cores
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
node_list:hpc2-comp01
[2023-06-15T20:11:32.632] select/cons_tres: core_array_log: 
core_list:node[0]:0-15
[2023-06-15T20:11:32.632] select/cons_tres: _job_test: SELECT_TYPE: test 0 
pass: test_only
[2023-06-15T20:11:32.632] select/cons_tres: common_job_test: no job_resources 
info for JobId=71_7(71) rc=0
[2023-06-15T20:11:32.632] debug3: sched: JobId=71_*. State=PENDING. 
Reason=Resources. Priority=4294901759. Partition=debug.
[2023-06-15T20:11:56.645] debug:  Spawning ping agent for hpc2-comp[01-02]
[2023-06-15T20:11:56.645] debug2: Spawning RPC agent for msg_type REQUEST_PING
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp01
[2023-06-15T20:11:56.646] debug2: Tree head got back 0 looking for 2
[2023-06-15T20:11:56.646] debug3: Tree sending to hpc2-comp02
[2023-06-15T20:11:56.647] debug2: Tree head got back 1
[2023-06-15T20:11:56.647] debug2: Tree head got back 2
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp01
[2023-06-15T20:11:56.651] debug2: node_did_resp hpc2-comp02
[2023-06-15T20:11:57.329] debug:  sched/backfill: _attempt_backfill: beginning
[2023-06-15T20:11:57.329] debug:  sched/backfill: _attempt_backfill: 1 jobs to 
backfill
[2023-06-15T20:11:57.329] debug2: sched/backfill: _attempt_backfill: entering 
_try_sched for JobId=71_*.
[2023-06-15T20:11:57.329] debug2: select/cons_tres: select_p_job_test: 
evaluating JobId=71_*
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: JobId=71_* 
node_mode:Normal alloc_mode:Will_Run
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: node_list & 
exc_cores
[2023-06-15T20:11:57.329] select/cons_tres: core_array_log: 
node_list:hpc2-comp[01-02]
[2023-06-15T20:11:57.329] select/cons_tres: common_job_test: nodes: min:1 max:1 
requested:1 avail:2
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp01 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:16 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:57.330] select/cons_tres: node_data_dump: Node:hpc2-comp02 
Boards:1 SocketsPerBoard:2 CoresPerSocket:8 ThreadsPerCore:2 TotalCores:16 
CumeCores:32 TotalCPUs:32 PUsPerCore:2 AvailMem:1 AllocMem:1 State:one_row(1)
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp01, allocated memory = 1 and all memory requested for 
JobId=71_*
[2023-06-15T20:11:57.330] debug3: select/cons_tres: _verify_node_state: Not 
considering node hpc2-comp02, allocated memory = 1 and all memory requested for 
JobId=71_*
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: evaluating 
JobId=71_* on 0 nodes
[2023-06-15T20:11:57.330] select/cons_tres: _job_test: SELECT_TYPE: test 0 
fail: insufficient resources
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_5(76): 
overlap=1
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) 
action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_5(76) nhosts:1 ncpus:1 node_req:1 
nodes=hpc2-comp01
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0
[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] debug3: select/cons_tres: job_res_rm_job: removed 
JobId=71_5(76) from part debug row 0
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_5(76) 
finished
[2023-06-15T20:11:57.330] select/cons_tres: _will_run_test: JobId=71_6(77): 
overlap=1
[2023-06-15T20:11:57.330] select/cons_tres: job_res_rm_job: JobId=71_6(77) 
action:normal
[2023-06-15T20:11:57.330] ====================
[2023-06-15T20:11:57.330] JobId=71_6(77) nhosts:1 ncpus:1 node_req:1 
nodes=hpc2-comp02
[2023-06-15T20:11:57.330] Node[0]:
[2023-06-15T20:11:57.330]   Mem(MB):1:0  Sockets:2  Cores:8  CPUs:2:0
[2023-06-15T20:11:57.330]   Socket[0] Core[0] is allocated
[2023-06-15T20:11:57.330] --------------------
[2023-06-15T20:11:57.330] cpu_array_value[0]:2 reps:1
[2023-06-15T20:11:57.330] ====================

----- batch script -----------------------------------

#!/bin/bash

echo "Running on: ${SLURM_CLUSTER_NAME}, node list: ${SLURM_JOB_NODELIST}, node 
names: ${SLURMD_NODENAME} in: `pwd` at `date`"
echo "SLURM_NTASKS: ${SLURM_NTASKS} SLURM_TASKS_PER_NODE: 
${SLURM_TASKS_PER_NODE} "
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"
echo "SLURM_MEM_PER_CPU: ${SLURM_MEM_PER_CPU}"

sleep 3600

echo "END"

Here is the sbatch command to run it:

sbatch -J test -a1-10 -t 999:00:00 -N 1 -n 1 -p debug sbatch.slurm

----- slurm.conf -----------------------------------

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=cluster
SlurmctldHost=hpc2-comp00
#SlurmctldHost=
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
SrunPortRange=60001-60005
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
#TaskPlugin=task/affinity,task/cgroup
TaskPlugin=task/none
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
AccountingStorageHost=hpc2-comp00
AccountingStoragePass=/var/run/munge/munge.socket.2
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_comment,job_env,job_extra,job_script
#JobCompHost=localhost
#JobCompLoc=slurm_jobcomp_db
##JobCompParams=
#JobCompPass=/var/run/munge/munge.socket.2
#JobCompPort=3306
#JobCompType=jobcomp/mysql
JobCompType=jobcomp/none
#JobCompUser=slurm
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
# Enabled next line - 06-15-2023
SlurmctldDebug=debug5
SlurmctldLogFile=/var/log/slurmctld.log
# Enabled next line - 06-15-2023
SlurmdDebug=debug5
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=

# Added next line : 06-15-2023
DebugFlags=Cgroup,CPU_Bind,Data,Gres,NodeFeatures,SelectType,Steps,TraceJobs

#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=hpc2-comp[01-02] CPUs=32 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 
State=UNKNOWN
PartitionName=debug Nodes=hpc2-comp[01-02] Default=YES MaxTime=INFINITE State=UP

----- Vagrantfile file -----------------------------------

# -*- mode: ruby -*-
# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
 # The most common configuration options are documented and commented below.
 # For a complete reference, please see the online documentation at
 # 
https://urldefense.com/v3/__https://docs.vagrantup.com__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2wTKzoiY$
 .

 # Every Vagrant development environment requires a box. You can search for
 # boxes at 
https://urldefense.com/v3/__https://vagrantcloud.com/search__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2f04oxLo$
 .
 config.vm.box = "generic/fedora37"

 # Stop vagrant from generating a new key for each host to allow ssh between
 # machines.
 config.ssh.insert_key = false

 # The Vagrant commands are too limited to configure a NAT network,
 # so run the VBoxManager commands by hand.
 config.vm.provider "virtualbox" do |vbox|
   # Add nic2 (eth1 on the guest VM) as the physical router.  Never change
   # nic1, because that's what the host uses to communicate with the guest VM.
   vbox.customize ["modifyvm", :id,
                   "--nic2", "bridged",
                   "--bridge-adapter2", "enp8s0"]
 end

 # Common provisioning for all guest VMs.
 config.vm.provision "shell", inline: <<-SHELL
   # Show which command is being run to associate with command output!
   set -x

   # Remove suprious hosts from the VM image.
   sed -i '/fedora37/d' /etc/hosts
   sed -i '/^127[.]0[.]1[.]1/d' /etc/hosts

   # Add NAT network to /etc/hosts.
   for host in 10.0.1.{100..102}
   do
       hostname=hpc2-comp${host:8}
       grep -q $host /etc/hosts ||
            echo "$host $hostname" >> /etc/hosts
   done
   unset host hostname

   # Use latest set of packages.
   dnf -y update

   # Install MUNGE.
   dnf -y install munge

   # Create the SLURM user.
   id -u slurm ||
      useradd -r -s /sbin/nologin -d /etc/slurm -c "SLURM job scheduler" slurm
 SHELL

 config.vm.define "hpc2_comp00" do |hpc2_comp00|
   hpc2_comp00.vm.hostname = "hpc2-comp00"
   hpc2_comp00.vm.synced_folder ".", "/vagrant", automount: true
   hpc2_comp00.vm.provision :shell, inline: <<-SHELL
     # Show which command is being run to associate with command output!
     set -x

     # Set static IP address for NAT network.
     HOST=10.0.1.100
     ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
     sed "s|address1=10.0.1.100|address1=${HOST}|" \
         /vagrant/eth1.nmconnection > $ETH1
     chmod go-r $ETH1
     nmcli con load $ETH1
     unset $HOST

     # Create the MUNGE key.
     [[ -f /etc/munge/munge.key ]] || sudo -u munge /usr/sbin/mungekey -v
     cp -av /etc/munge/munge.key /vagrant/

     # Enable and start munge.
     systemctl enable munge
     systemctl start munge
     systemctl status munge

     # Setup database on the head node:
     dnf -y install mariadb-devel mariadb-server

     # Set recomended memory (5%-50% RAM) and timeout.
     CNF=/etc/my.cnf.d/mariadb-server.cnf

     # Note we need to use a double slash for the newline character below
     # because Vagrant's inline shell script.

     MYSQL_RAM=$(awk '/^MemTotal/ {printf "%.0f\\n", $2*0.05}' /proc/meminfo)
     grep -q innodb_buffer_pool_size $CNF ||
          sed -i '/InnoDB/a innodb_buffer_pool_size='${MYSQL_RAM}K $CNF
     grep -q innodb_lock_wait_timeout $CNF ||
          sed -i '/innodb_buffer_pool_size/a innodb_lock_wait_timeout=900' $CNF
     unset CNF MYSQL_RAM

     # Run the head node services:
     systemctl enable mariadb
     systemctl start mariadb
     systemctl status mariadb

     # Secure the server.
     #
     # Send interactive commands using printf per
     # 
https://urldefense.com/v3/__https://unix.stackexchange.com/a/112348__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e23TZJlog$
       printf "%s\n" "" n n y y y y | mariadb-secure-installation

     # Install the RPM package builder for SLURM.
     dnf -y install rpmdevtools

     # Download SLURM.
     wget -nc 
https://urldefense.com/v3/__https://download.schedmd.com/slurm/slurm-23.02.0.tar.bz2__;!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2-8gcs08$
       # Install the source package dependencies to determine the dependencies
     # to build the binary.
     dnf -y install \
         dbus-devel \
         freeipmi-devel \
         hdf5-devel \
         http-parser-devel \
         json-c-devel \
         libcurl-devel \
         libjwt-devel \
         libyaml-devel \
         lua-devel \
         lz4-devel \
         man2html \
         readline-devel \
         rrdtool-devel
     SLURM_SRPM=~/rpmbuild/SRPMS/slurm-23.02.0-1.fc37.src.rpm
     # Create SLURM .rpmmacros file.
     cp -av /vagrant/.rpmmacros .
     [[ -f ${SLURM_SRPM} ]] ||
           rpmbuild -ts slurm-23.02.0.tar.bz2 |& tee build-slurm-source.log
     # Installs the source package dependencies to build the binary.
     dnf -y builddep ${SLURM_SRPM}
     unset SLURM_SRPM
     # Build the SLURM binaries.
     SLURM_RPM=~/rpmbuild/RPMS/x86_64/slurm-23.02.0-1.fc37.x86_64.rpm
     [[ -f ${SLURM_RPM} ]] ||
           rpmbuild -ta slurm-23.02.0.tar.bz2 |& tee build-slurm-binary.log
     unset SLURM_RPM

     # Copy SLURM packages to the compute nodes.
     DIR_RPM=~/rpmbuild/RPMS/x86_64
     cp -av ${DIR_RPM}/slurm-23*.rpm ${DIR_RPM}/slurm-slurmd-23*.rpm /vagrant/ 
&&
        touch /vagrant/sentinel-copied-rpms.done

     # Install all SLURM packages on the head node.
     find ${DIR_RPM} -type f -not -name '*-slurmd-*' -not -name '*-torque-*' \
          -exec dnf -y install {} +
     unset DIR_RPM
     # Copy the configuration files.
     cp -av /vagrant/slurmdbd.conf /etc/slurm/slurmdbd.conf
     chown slurm:slurm /etc/slurm/slurmdbd.conf
     chmod 600 /etc/slurm/slurmdbd.conf
     cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
     chown root:root /etc/slurm/slurm.conf
     chmod 644 /etc/slurm/slurm.conf

     # Now create the slurm MySQL user.
     SLURM_PASSWORD=$(awk -vFS='=' '/StoragePass/ {print $2}' 
/etc/slurm/slurmdbd.conf)
     DBD_HOST=localhost
     # 
https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/quick-docs/installing-mysql-mariadb/*_start_mysql_service_and_enable_at_loggin__;Iw!!Mih3wA!B0rUgK-f9sQpF0vLNY9G206xXX03qc46yjGggS1QpOtGs8qbM92TcgbpAolreSsvBZbeyouGb6e2km7qEbo$
       mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by 
'$SLURM_PASSWORD';"
     mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
     mysql -e "show grants for 'slurm'@'$DBD_HOST';"
     DBD_HOST=hpc2-comp00
     mysql -e "create user if not exists 'slurm'@'$DBD_HOST' identified by 
'$SLURM_PASSWORD';"
     mysql -e "grant all on slurm_acct_db.* to 'slurm'@'$DBD_HOST';"
     mysql -e "show grants for 'slurm'@'$DBD_HOST';"
     unset SLURM_PASSWORD DBD_HOST

     systemctl enable slurmdbd
     systemctl start slurmdbd
     systemctl status slurmdbd

     systemctl enable slurmctld
     mkdir -p /var/spool/slurmctld
     chown slurm:slurm /var/spool/slurmctld
     # Open ports for slurmctld (6817) and slurmdbd (6819).
     firewall-cmd --add-port=6817/tcp
     firewall-cmd --add-port=6819/tcp
     firewall-cmd --runtime-to-permanent
     systemctl start slurmctld
     systemctl status slurmctld

     # Clear any previous node DOWN errors.
     sinfo -s
     sinfo -R
     scontrol update nodename=ALL state=RESUME
     sinfo -s
     sinfo -R
 SHELL
 end

 config.vm.define "hpc2_comp01" do |hpc2_comp01|
   hpc2_comp01.vm.hostname = "hpc2-comp01"
   hpc2_comp01.vm.synced_folder ".", "/vagrant", automount: true
   hpc2_comp01.vm.provision :shell, inline: <<-SHELL
     # Show which command is being run to associate with command output!
     set -x

     # Set static IP address for NAT network.
     HOST=10.0.1.101
     ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
     sed "s|address1=10.0.1.100|address1=${HOST}|" \
         /vagrant/eth1.nmconnection > $ETH1
     chmod go-r $ETH1
     nmcli con load $ETH1
     unset $HOST

     # Copy the MUNGE key.
     KEY=/etc/munge/munge.key
     cp -av /vagrant/munge.key /etc/munge/
     chown munge:munge $KEY
     chmod 600 $KEY

     # Enable and start munge.
     systemctl enable munge
     systemctl start munge
     systemctl status munge

     # SLURM packages to be installed on compute nodes.
     DIR_RPM=~/rpmbuild/RPMS/x86_64
     mkdir -p ${DIR_RPM}
     rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
     dnf -y install ${DIR_RPM}/slurm-23*.rpm
     dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
     unset DIR_RPM
     # Copy the configuration file.
     mkdir -p /etc/slurm
     cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
     chown root:root /etc/slurm/slurm.conf
     chmod 644 /etc/slurm/slurm.conf
     # Only enable slurmd on the worker nodes.
     systemctl enable slurmd
     # Open port for slurmd (6818).
     firewall-cmd --add-port=6818/tcp
     firewall-cmd --runtime-to-permanent
     # Open port range for srun.
     SRUN_PORT_RANGE=$(awk -vFS='=' '/SrunPortRange/ {print $2}' 
/etc/slurm/slurm.conf)
     firewall-cmd --add-port=$SRUN_PORT_RANGE/tcp
     firewall-cmd --runtime-to-permanent
     systemctl start slurmd
     systemctl status slurmd

     # Clear any previous node DOWN errors.
     sinfo -s
     sinfo -R
     scontrol update nodename=ALL state=RESUME
     sinfo -s
     sinfo -R
 SHELL
 end

 config.vm.define "hpc2_comp02" do |hpc2_comp02|
   hpc2_comp02.vm.hostname = "hpc2-comp02"
   hpc2_comp02.vm.synced_folder ".", "/vagrant", automount: true
   hpc2_comp02.vm.provision :shell, inline: <<-SHELL
     # Show which command is being run to associate with command output!
     set -x

     # Set static IP address for NAT network.
     HOST=10.0.1.102
     ETH1=/etc/NetworkManager/system-connections/eth1.nmconnection
     sed "s|address1=10.0.1.100|address1=${HOST}|" \
         /vagrant/eth1.nmconnection > $ETH1
     chmod go-r $ETH1
     nmcli con load $ETH1
     unset $HOST

     # Copy the MUNGE key.
     KEY=/etc/munge/munge.key
     cp -av /vagrant/munge.key /etc/munge/
     chown munge:munge $KEY
     chmod 600 $KEY

     # Enable and start munge.
     systemctl enable munge
     systemctl start munge
     systemctl status munge

     # SLURM packages to be installed on compute nodes.
     DIR_RPM=~/rpmbuild/RPMS/x86_64
     mkdir -p ${DIR_RPM}
     rsync -avP /vagrant/slurm*.rpm ${DIR_RPM}/
     dnf -y install ${DIR_RPM}/slurm-23*.rpm
     dnf -y install ${DIR_RPM}/slurm-slurmd-23*.rpm
     unset DIR_RPM
     # Copy the configuration file.
     mkdir -p /etc/slurm
     cp -av /vagrant/slurm.conf /etc/slurm/slurm.conf
     chown root:root /etc/slurm/slurm.conf
     chmod 644 /etc/slurm/slurm.conf
     # Only enable slurmd on the worker nodes.
     systemctl enable slurmd
     # Open port for slurmd (6818).
     firewall-cmd --add-port=6818/tcp
     firewall-cmd --runtime-to-permanent
     systemctl start slurmd
     systemctl status slurmd

     # Clear any previous node DOWN errors.
     sinfo -s
     sinfo -R
     scontrol update nodename=ALL state=RESUME
     sinfo -s
     sinfo -R
 SHELL
 end
end

-------------------------------------------------------------------------


Reply via email to