[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Hermann Schwärzler via slurm-users

Dear Xaver,

we have a similar setup and yes, we have set the node to "state=DRAIN".
Slurm keeps it this way until you manually change it to e.g. "state=RESUME".

Regards,
Hermann

On 6/24/24 13:54, Xaver Stiensmeier via slurm-users wrote:

Dear Slurm users,

in our project we exclude the master from computing before starting 
Slurmctld. We used to exclude the master from computing by simply not 
mentioning it in the configuration i.e. just not having:


     PartitionName=SomePartition Nodes=master

or something similar. Apparently, this is not the way to do this as it 
is now a fatal error


fatal: Unable to determine this slurmd's NodeName

therefore, my *question:*

What is the best practice for excluding the master node from work?

I personally primarily see the option to set the node into DOWN, DRAINED 
or RESERVED. Since we use ReturnToService=2, I guess DOWN is not the way 
to go. RESERVED fits with the second part "The node is in an advanced 
reservation and *not generally available*." and DRAINED "The node is 
unavailable for use per system administrator request." fits completely. 
So is *DRAINED* the correct setting in such a case?


Best regards,
Xaver





--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sbatch and --nodes

2024-05-31 Thread Hermann Schwärzler via slurm-users

Hi Michael,

if you submit a job-array, all resources related options (number of 
nodes, tasks, cpus per task, memory, time, ...) are meant *per array-task*.
So in your case you start 100 array-tasks (you could also call them 
"sub-jobs") that *each* (not your whole job) is limited to one node, one 
cpu and the default amount of time, memory asf. A lot of them might run 
in parallel and potentially on many different nodes.


So what you get is the expected behaviour.
If you really would like to limit all your array tasks to one node you 
had to specify it with '-w'.


Regards,
Hermann


On 5/31/24 19:12, Michael DiDomenico via slurm-users wrote:

its friday and i'm either doing something silly or have a misconfig
somewhere, i can't figure out which

when i run

sbatch --nodes=1 --cpus-per-task=1 --array=1-100 --output
test_%A_%a.txt --wrap 'uname -n'

sbatch doesn't seem to be adhering to the --nodes param.  when i look
at my output files it's spreading them across more nodes.  in the
simple case above it's 50/50, but if i through a random sleep in,
it'll be more.  and if i expand the array it'll use even more nodes.
i'm using con/tres and have cr_core_memory,cr_one_core_per_task set



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sbatch problem

2024-05-29 Thread Hermann Schwärzler via slurm-users

Hi Mihai,

yes, it's the same problem: when you run

  srun echo $CUDA_VISIBLE_DEVICES

the value of $CUDA_VISIBLE_DEVICES on the first of the two nodes is 
substituted into the line *before* srun is called.


  srun bash -c 'echo $CUDA_VISIBLE_DEVICES'

is the way to go.

BTW: the job-script I am seeing in your email is not the one that 
produced the output you are showing. The script does not use srun, so 
everything will be run only once and only on the first of the two nodes.


Regards,
Hermann


On 5/29/24 10:18, Mihai Ciubancan wrote:

Dear Hermann,

Sorry to come back to you, but just to understand...if I run the 
following script:


#!/bin/bash

#SBATCH --partition=gpu
#SBATCH --time=24:00:00

#SBATCH --nodes=2
#SBATCH --exclusive

#SBATCH --job-name="test_job"
#SBATCH -o stdout_%j
#SBATCH -e stderr_%j

touch test.txt

# Print the hostname of the allocated node
echo "Running on host: $(hostname)"

# Print the start time
echo "Job started at: $(date)"

# Perform a simple task that takes a few minutes
echo "Starting the task..."
sleep 60

echo "GPU UUIDs:"
nvidia-smi --query-gpu=uuid --format=csv,noheader
echo $CUDA_VISIBLE_DEVICES

echo "Task completed."

# Print the end time
echo "Job finished at: $(date)"

I'm getting the following results:

Starting the task...
GPU UUIDs:
GPU UUIDs:
GPU-d4e002a9-409f-79bb-70e1-56c1a473a188
GPU-33b728e2-0396-368b-b9c3-8f828ca145b1
GPU-7d90f7d8-aadf-ba95-2409-8c57bd40d24b
GPU-30faa03a-0782-4b6c-dda2-e108159ba953
GPU-37d09257-2582-8080-223a-dd5a646fba43
GPU-c71cbb10-4368-d327-e0e5-56372aa4f10f
GPU-a413a75a-15b2-063e-638f-bde063af5c8e
GPU-bf12181a-e615-dcd4-5da2-9a518ae1af5d
GPU-dfec21c4-e30d-5a36-599d-eef2fd354809
GPU-15a11fe2-33f2-cd65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b
0,1,2,3
0,1,2,3
Task completed.

When for the command echo $CUDA_VISIBLE_DEVICES I should get:

0,1,2,3
0,1,2,3,4,5,6,7

This for the some reason that I had problems with hostname?

Thank you,
Mihai


On 2024-05-28 13:31, Hermann Schwärzler wrote:

Dear Mihai,

you are not asking Slurm to provide you with any GPUs:

 #SBATCH --gpus=12

So it doesn't reserve any for you and as a consequence also does not
set CUDA_VISIBLE_DEVICES for you.

nvidia-smi works, because it looks like you are not using cgroups at
all or at least not "ConstrainDevices=yes" in e.g. cgroup.conf.
So it "sees" all the GPUs that are installed in the node it's running
on even if none is reserved for you by Slurm.

Regards,
Hermann

On 5/28/24 12:07, Mihai Ciubancan wrote:

Dear Hermann,
Dear James,

Thank you both for your answers!

I have tried as you suggested using bash -c and it worked.
But when I'm trying the following script the "bash -c" trick doesn't 
work:


#!/bin/bash

#SBATCH --partition=eli
#SBATCH --time=24:00:00

#SBATCH --nodelist=mihaigpu2,mihai-x8640
#SBATCH --gpus=12
#SBATCH --exclusive

#SBATCH --job-name="test_job"
#SBATCH -o /data/mihai/stdout_%j
#SBATCH -e /data/mihai/stderr_%j

touch test.txt

# Print the hostname of the allocated node
srun bash -c 'echo Running on host: $(hostname)'

# Print the start time
echo "Job started at: $(date)"

# Perform a simple task that takes a few minutes
echo "Starting the task..."
sleep 20

srun echo "GPU UUIDs:"
srun nvidia-smi --query-gpu=uuid --format=csv,noheader
srun bash -c 'echo $CUDA_VISIBLE_DEVICES'

##echo "Task completed."

# Print the end time
echo "Job finished at: $(date)"

I don't get any output of the command srun bash -c 'echo 
$CUDA_VISIBLE_DEVICES':


Running on host: mihaigpu2
Running on host: mihai-x8640
Job started at: Tue May 28 13:02:59 EEST 2024
Starting the task...
GPU UUIDs:
GPU UUIDs:
GPU-d4e002a9-409f-79bb-70e1-56c1a473a188
GPU-33b728e2-0396-368b-b9c3-8f828ca145b1
GPU-7d90f7d8-aadf-ba95-2409-8c57bd40d24b
GPU-30faa03a-0782-4b6c-dda2-e108159ba953
GPU-37d09257-2582-8080-223a-dd5a646fba43
GPU-c71cbb10-4368-d327-e0e5-56372aa4f10f
GPU-a413a75a-15b2-063e-638f-bde063af5c8e
GPU-bf12181a-e615-dcd4-5da2-9a518ae1af5d
GPU-dfec21c4-e30d-5a36-599d-eef2fd354809
GPU-15a11fe2-33f2-cd65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b


Job finished at: Tue May 28 13:03:20 EEST 2024

...I'm not interesting on the output of the other 'echo' commands, 
beside the one with the hostname, that's why I didn't changed.


Best,
Mihai


I will try
On 2024-05-28 12:23, Hermann Schwärzler via slurm-users wrote:

Hi Mihai,

this is a problem that is not Slurm related. It's rather about:
"when does command substitution happen?"

When you write

  srun echo Running on host: $(hostname)

$(hostname) is replaced by the output of the hostname-command *before*
the line is "submitted" to srun. Which means that srun will happily
run it on any (remote) node using the name of the host 

[slurm-users] Re: sbatch problem

2024-05-28 Thread Hermann Schwärzler via slurm-users

Dear Mihai,

you are not asking Slurm to provide you with any GPUs:

 #SBATCH --gpus=12

So it doesn't reserve any for you and as a consequence also does not set 
CUDA_VISIBLE_DEVICES for you.


nvidia-smi works, because it looks like you are not using cgroups at all 
or at least not "ConstrainDevices=yes" in e.g. cgroup.conf.
So it "sees" all the GPUs that are installed in the node it's running on 
even if none is reserved for you by Slurm.


Regards,
Hermann

On 5/28/24 12:07, Mihai Ciubancan wrote:

Dear Hermann,
Dear James,

Thank you both for your answers!

I have tried as you suggested using bash -c and it worked.
But when I'm trying the following script the "bash -c" trick doesn't work:

#!/bin/bash

#SBATCH --partition=eli
#SBATCH --time=24:00:00

#SBATCH --nodelist=mihaigpu2,mihai-x8640
#SBATCH --gpus=12
#SBATCH --exclusive

#SBATCH --job-name="test_job"
#SBATCH -o /data/mihai/stdout_%j
#SBATCH -e /data/mihai/stderr_%j

touch test.txt

# Print the hostname of the allocated node
srun bash -c 'echo Running on host: $(hostname)'

# Print the start time
echo "Job started at: $(date)"

# Perform a simple task that takes a few minutes
echo "Starting the task..."
sleep 20

srun echo "GPU UUIDs:"
srun nvidia-smi --query-gpu=uuid --format=csv,noheader
srun bash -c 'echo $CUDA_VISIBLE_DEVICES'

##echo "Task completed."

# Print the end time
echo "Job finished at: $(date)"

I don't get any output of the command srun bash -c 'echo 
$CUDA_VISIBLE_DEVICES':


Running on host: mihaigpu2
Running on host: mihai-x8640
Job started at: Tue May 28 13:02:59 EEST 2024
Starting the task...
GPU UUIDs:
GPU UUIDs:
GPU-d4e002a9-409f-79bb-70e1-56c1a473a188
GPU-33b728e2-0396-368b-b9c3-8f828ca145b1
GPU-7d90f7d8-aadf-ba95-2409-8c57bd40d24b
GPU-30faa03a-0782-4b6c-dda2-e108159ba953
GPU-37d09257-2582-8080-223a-dd5a646fba43
GPU-c71cbb10-4368-d327-e0e5-56372aa4f10f
GPU-a413a75a-15b2-063e-638f-bde063af5c8e
GPU-bf12181a-e615-dcd4-5da2-9a518ae1af5d
GPU-dfec21c4-e30d-5a36-599d-eef2fd354809
GPU-15a11fe2-33f2-cd65-09f0-9897ba057a0c
GPU-2d971e69-8147-8221-a055-e26573950f91
GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b


Job finished at: Tue May 28 13:03:20 EEST 2024

...I'm not interesting on the output of the other 'echo' commands, 
beside the one with the hostname, that's why I didn't changed.


Best,
Mihai


I will try
On 2024-05-28 12:23, Hermann Schwärzler via slurm-users wrote:

Hi Mihai,

this is a problem that is not Slurm related. It's rather about:
"when does command substitution happen?"

When you write

  srun echo Running on host: $(hostname)

$(hostname) is replaced by the output of the hostname-command *before*
the line is "submitted" to srun. Which means that srun will happily
run it on any (remote) node using the name of the host it is running
on.

If you want to avoid this, one possible solution is

  srun bash -c 'echo Running on host: $(hostname)'

In this case the command substitution is happening after srun starts
the process on a (potentially remote) node.

Regards,
Hermann


On 5/28/24 10:54, Mihai Ciubancan via slurm-users wrote:

Hello,

My name is Mihai and a have an issue with a small GPU cluster manage 
with slurm 22.05.11. I got 2 different output when I'm trying to find 
out the name of the nodes(one correct and one wrong). The script is:


#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=/data/mihai/res.txt
#SBATCH --partition=eli
#SBATCH --nodes=2
srun echo Running on host: $(hostname)
srun hostname
srun sleep 15

And the output look like this:

cat res.txt
Running on host: mihai-x8640
Running on host: mihai-x8640
mihaigpu2
mihai-x8640

As you can see the output of the command 'srun echo Running on host: 
$(hostname)' is the same, as the jobs was running twice on the same 
node, while command 'srun hostname' it's giving me the correct output.


Do you have any idea why the outputs of the 2 commands are different?

Thank you,
Mihai



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sbatch problem

2024-05-28 Thread Hermann Schwärzler via slurm-users

Hi Mihai,

this is a problem that is not Slurm related. It's rather about:
"when does command substitution happen?"

When you write

  srun echo Running on host: $(hostname)

$(hostname) is replaced by the output of the hostname-command *before* 
the line is "submitted" to srun. Which means that srun will happily run 
it on any (remote) node using the name of the host it is running on.


If you want to avoid this, one possible solution is

  srun bash -c 'echo Running on host: $(hostname)'

In this case the command substitution is happening after srun starts the 
process on a (potentially remote) node.


Regards,
Hermann


On 5/28/24 10:54, Mihai Ciubancan via slurm-users wrote:

Hello,

My name is Mihai and a have an issue with a small GPU cluster manage 
with slurm 22.05.11. I got 2 different output when I'm trying to find 
out the name of the nodes(one correct and one wrong). The script is:


#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=/data/mihai/res.txt
#SBATCH --partition=eli
#SBATCH --nodes=2
srun echo Running on host: $(hostname)
srun hostname
srun sleep 15

And the output look like this:

cat res.txt
Running on host: mihai-x8640
Running on host: mihai-x8640
mihaigpu2
mihai-x8640

As you can see the output of the command 'srun echo Running on host: 
$(hostname)' is the same, as the jobs was running twice on the same 
node, while command 'srun hostname' it's giving me the correct output.


Do you have any idea why the outputs of the 2 commands are different?

Thank you,
Mihai



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-27 Thread Hermann Schwärzler via slurm-users

Hi everbody,

On 5/26/24 08:40, Ole Holm Nielsen via slurm-users wrote:
[...]
Whether or not to enable Hyper-Threading (HT) on your compute nodes 
depends entirely on the properties of applications that you wish to run 
on the nodes.  Some applications are faster without HT, others are 
faster with HT.  When HT is enabled, the "virtual CPU cores" obviously 
will have only half the memory available per core.


The VASP code is highly CPU- and memory intensive, and HT should 
probably be disabled for optimal performance with VASP.


Slurm doesn't affect the performance of your codes with or without HT. 
Slurm just schedules tasks to run on the available cores.


This is how we are handling Hyper-Threading in our cluster:
* It's enabled in the BIOS/system settings.
* The important parts in our slurm.conf are:
 TaskPlugin=task/affinity,task/cgroup
 CliFilterPlugins=cli_filter/lua
 NodeName=DEFAULT ... ThreadsPerCore=2
* We make "--hint=nomultithread" the default for jobs by having this in 
cli_filter.lua:

  function slurm_cli_setup_defaults(options, early_pass)
options['hint'] = 'nomultithread'
return slurm.SUCCESS
  end
 So users can still use Hyper-Threading by specifying
 "--hint=multithread" in their job-script which will give them two
 "CPUs/Threads" per Core. Without this option they will get one Core
 per requested CPU.

This works for us and our users. There is only one small side-effect: 
when a job is pending, the expected number is displayed in the "CPUS" 
column of the output of "squeue". But when a job is running, twice that 
number is displayed (as Slurm counts both Hyper-Threads per Core as "CPUs").


Regards,
Hermann

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-24 Thread Hermann Schwärzler via slurm-users

Hi Zhao,

my guess is that in your faster case you are using hyperthreading 
whereas in the Slurm case you don't.


Can you check what performance you get when you add

#SBATCH --hint=multithread

to you slurm script?

Another difference between the two might be
a) the communication channel/interface that is used.
b) the number of nodes involved: when using mpirun you might run things 
on more than one node.


Regards,
Hermann

On 5/24/24 15:32, Hongyi Zhao via slurm-users wrote:

Dear Slurm Users,

I am experiencing a significant performance discrepancy when running
the same VASP job through the Slurm scheduler compared to running it
directly with mpirun. I am hoping for some insights or advice on how
to resolve this issue.

System Information:

Slurm Version: 21.08.5
OS: Ubuntu 22.04.4 LTS (Jammy)


Job Submission Script:

#!/usr/bin/env bash
#SBATCH -N 1
#SBATCH -D .
#SBATCH --output=%j.out
#SBATCH --error=%j.err
##SBATCH --time=2-00:00:00
#SBATCH --ntasks=36
#SBATCH --mem=64G

echo '###'
echo "date= $(date)"
echo "hostname= $(hostname -s)"
echo "pwd = $(pwd)"
echo "sbatch  = $(which sbatch | xargs realpath -e)"
echo ""
echo "WORK_DIR= $WORK_DIR"
echo "SLURM_SUBMIT_DIR= $SLURM_SUBMIT_DIR"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"
echo "SLURM_NTASKS= $SLURM_NTASKS"
echo "SLURM_NTASKS_PER_NODE   = $SLURM_NTASKS_PER_NODE"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
echo "SLURM_JOBID = $SLURM_JOBID"
echo "SLURM_JOB_NODELIST  = $SLURM_JOB_NODELIST"
echo "SLURM_NNODES= $SLURM_NNODES"
echo "SLURMTMPDIR = $SLURMTMPDIR"
echo '###'
echo ""

module purge > /dev/null 2>&1
module load vasp
ulimit -s unlimited
mpirun vasp_std


Performance Observation:

When running the job through Slurm:

werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
grep LOOP OUTCAR
   LOOP:  cpu time 14.4893: real time 14.5049
   LOOP:  cpu time 14.3538: real time 14.3621
   LOOP:  cpu time 14.3870: real time 14.3568
   LOOP:  cpu time 15.9722: real time 15.9018
   LOOP:  cpu time 16.4527: real time 16.4370
   LOOP:  cpu time 16.7918: real time 16.7781
   LOOP:  cpu time 16.9797: real time 16.9961
   LOOP:  cpu time 15.9762: real time 16.0124
   LOOP:  cpu time 16.8835: real time 16.9008
   LOOP:  cpu time 15.2828: real time 15.2921
  LOOP+:  cpu time176.0917: real time176.0755

When running the job directly with mpirun:


werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
mpirun -n 36 vasp_std
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
grep LOOP OUTCAR
   LOOP:  cpu time  9.0072: real time  9.0074
   LOOP:  cpu time  9.0515: real time  9.0524
   LOOP:  cpu time  9.1896: real time  9.1907
   LOOP:  cpu time 10.1467: real time 10.1479
   LOOP:  cpu time 10.2691: real time 10.2705
   LOOP:  cpu time 10.4330: real time 10.4340
   LOOP:  cpu time 10.9049: real time 10.9055
   LOOP:  cpu time  9.9718: real time  9.9714
   LOOP:  cpu time 10.4511: real time 10.4470
   LOOP:  cpu time  9.4621: real time  9.4584
  LOOP+:  cpu time110.0790: real time110.0739


Could you provide any insights or suggestions on what might be causing
this performance issue? Are there any specific configurations or
settings in Slurm that I should check or adjust to align the
performance more closely with the direct mpirun execution?

Thank you for your time and assistance.

Best regards,
Zhao


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun weirdness

2024-05-15 Thread Hermann Schwärzler via slurm-users

Hi Dj,

could be a memory-limits related problem. What is the output of

 ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right?
In that case what is the respective content of

 /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards,
Hemann


On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
I'm running into a strange issue and I'm hoping another set of brains 
looking at this might help.  I would appreciate any feedback.


I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8 
on Rocky Linux 8.9 machines.  The second cluster is running Slurm 
23.11.6 on Rocky Linux 9.4 machines.


This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory
runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
     runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 
pc=0x4605dc

runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
     runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 
sp=0x7ffe6be32648 pc=0x456b7c

runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
     runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 
pc=0x454565

runtime.(*mheap).init(0x127b47e0)
     runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 
pc=0x451885

runtime.mallocinit()
     runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 
pc=0x434f97

runtime.schedinit()
     runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 
pc=0x464397

runtime.rt0_go()
     runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 
pc=0x49421c



If I ssh directly to the same node on that second cluster (skipping 
Slurm entirely), and run the same "/mnt/local/ollama/ollama help" 
command, it works perfectly fine.



My first thought was that it might be related to cgroups.  I switched 
the second cluster from cgroups v2 to v1 and tried again, no 
difference.  I tried disabling cgroups on the second cluster by removing 
all cgroups references in the slurm.conf file but that also made no 
difference.



My guess is something changed with regards to srun between these two 
Slurm versions, but I'm not sure what.


Any thoughts on what might be happening and/or a way to get this to work 
on the second cluster?  Essentially I need a way to request an 
interactive shell through Slurm that is associated with the requested 
resources.  Should we be using something other than srun for this?



Thank you,

-Dj





--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Hermann Schwärzler via slurm-users

Hi Dietmar,

what do you find in the output-file of this job

sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status'

On our 64 cores machines with enabled hyperthreading I see e.g.

Cpus_allowed:   0400,,0400,
Cpus_allowed_list:  58,122

Greetings
Hermann


On 2/28/24 14:28, Dietmar Rieder via slurm-users wrote:

Hi,

I'm new to slrum, but maybe someone can help me:

I'm trying to restrict the CPU usage to the actually requested/allocated 
resources using cgroup v2.


For this I made the following settings in slurmd.conf:


ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity

And in cgroup.conf

CgroupPlugin=cgroup/v2
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
AllowedRAMSpace=98


cgroup v2 seems to be active on the compute node:

# mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)


# cat /sys/fs/cgroup/cgroup.subtree_control
cpuset cpu io memory pids
# cat /sys/fs/cgroup/system.slice/cgroup.subtree_control
cpuset cpu io memory pids


Now, when I use sbatch to submit the following test script, the python 
script which is started from the batch script is utilizing all CPUs (96) 
at 100% on the allocated node, although I only ask for 4 cpus 
(--cpus-per-task=4). I'd expect that the task can not use more that 
these 4.


#!/bin/bash
#SBATCH --output=/local/users/appadmin/test-%j.log
#SBATCH --job-name=test
#SBATCH --chdir=/local/users/appadmin
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --mem=64gb
#SBATCH --time=4:00:00
#SBATCH --partition=standard
#SBATCH --gpus=0
#SBATCH --export
#SBATCH --get-user-env=L

export 
PATH=/usr/local/bioinf/jupyterhub/bin:/usr/local/bioinf/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/local/bioinf/miniforge/condabin


source .bashrc
conda activate test
python test.py


The python code in test.py is the following using the cpu_load_generator 
package from [1]:


#!/usr/bin/env python

import sys
from cpu_load_generator import load_single_core, load_all_cores, 
from_profile


load_all_cores(duration_s=120, target_load=1)  # generates load on all 
cores



Interestingly, when I use srun to launch an interactive job, and run the 
python script manually, I see with top that only 4 cpus are running at 
100%. And I also python errors thrown when the script tries to start the 
5th process (which makes sense):


   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

     self.run()
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/multiprocessing/process.py", line 108, in run

     self._target(*self._args, **self._kwargs)
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/cpu_load_generator/_interface.py", line 24, in load_single_core

     process.cpu_affinity([core_num])
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/__init__.py", line 867, in cpu_affinity

     self._proc.cpu_affinity_set(list(set(cpus)))
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 1714, in wrapper

     return fun(self, *args, **kwargs)
    ^^
   File 
"/usr/local/bioinf/miniforge/envs/test/lib/python3.12/site-packages/psutil/_pslinux.py", line 2213, in cpu_affinity_set

     cext.proc_cpu_affinity_set(self.pid, cpus)
OSError: [Errno 22] Invalid argument


What am I missing, why are the CPU resources not restricted when I use 
sbatch?



Thanks for any input or hint
    Dietmar

[1]: https://pypi.org/project/cpu-load-generator/




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com