[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Hermann Schwärzler via slurm-users
Dear Xaver, we have a similar setup and yes, we have set the node to "state=DRAIN". Slurm keeps it this way until you manually change it to e.g. "state=RESUME". Regards, Hermann On 6/24/24 13:54, Xaver Stiensmeier via slurm-users wrote: Dear Slurm users, in our project we exclude the master

[slurm-users] Re: sbatch and --nodes

2024-05-31 Thread Hermann Schwärzler via slurm-users
Hi Michael, if you submit a job-array, all resources related options (number of nodes, tasks, cpus per task, memory, time, ...) are meant *per array-task*. So in your case you start 100 array-tasks (you could also call them "sub-jobs") that *each* (not your whole job) is limited to one node,

[slurm-users] Re: sbatch problem

2024-05-29 Thread Hermann Schwärzler via slurm-users
c GPU-2d971e69-8147-8221-a055-e26573950f91 GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b 0,1,2,3 0,1,2,3 Task completed. When for the command echo $CUDA_VISIBLE_DEVICES I should get: 0,1,2,3 0,1,2,3,4,5,6,7 This for the some reason that I had problems with hostname? Thank you, Mihai On 2024-05-28 1

[slurm-users] Re: sbatch problem

2024-05-28 Thread Hermann Schwärzler via slurm-users
15a11fe2-33f2-cd65-09f0-9897ba057a0c GPU-2d971e69-8147-8221-a055-e26573950f91 GPU-22ee3c89-fed1-891f-96bb-6bbf27a2cc4b Job finished at: Tue May 28 13:03:20 EEST 2024 ...I'm not interesting on the output of the other 'echo' commands, beside the one with the hostname, that's why I didn't changed

[slurm-users] Re: sbatch problem

2024-05-28 Thread Hermann Schwärzler via slurm-users
Hi Mihai, this is a problem that is not Slurm related. It's rather about: "when does command substitution happen?" When you write srun echo Running on host: $(hostname) $(hostname) is replaced by the output of the hostname-command *before* the line is "submitted" to srun. Which means that

[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-27 Thread Hermann Schwärzler via slurm-users
Hi everbody, On 5/26/24 08:40, Ole Holm Nielsen via slurm-users wrote: [...] Whether or not to enable Hyper-Threading (HT) on your compute nodes depends entirely on the properties of applications that you wish to run on the nodes.  Some applications are faster without HT, others are faster

[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-24 Thread Hermann Schwärzler via slurm-users
Hi Zhao, my guess is that in your faster case you are using hyperthreading whereas in the Slurm case you don't. Can you check what performance you get when you add #SBATCH --hint=multithread to you slurm script? Another difference between the two might be a) the communication

[slurm-users] Re: srun weirdness

2024-05-15 Thread Hermann Schwärzler via slurm-users
Hi Dj, could be a memory-limits related problem. What is the output of ulimit -l -m -v -s in both interactive job-shells? You are using cgroups-v1 now, right? In that case what is the respective content of /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes in both

[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Hermann Schwärzler via slurm-users
Hi Dietmar, what do you find in the output-file of this job sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' On our 64 cores machines with enabled hyperthreading I see e.g. Cpus_allowed: 0400,,0400, Cpus_allowed_list: 58,122 Greetings

Re: [slurm-users] slurm.conf

2024-01-18 Thread Hermann Schwärzler
Hi Christine, yes, you can either set the environment variable SLURM_CONF to the full path of the configuration-file you want to use and then run any program. Or you can do it like this SLURM_CONF=/your/path/to/slurm.conf sinfo|sbatch|srun|... But I am not quite sure if this is really the

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Hermann Schwärzler
d this be the issue? Best regards, Xaver Stiensmeier On 17.07.23 14:11, Hermann Schwärzler wrote: > Hi Xaver, > > what kind of SelectType are you using in your slurm.conf? > > Per https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgre

Re: [slurm-users] GRES and GPUs

2023-07-17 Thread Hermann Schwärzler
Hi Xaver, what kind of SelectType are you using in your slurm.conf? Per https://slurm.schedmd.com/gres.html you have to consider: "As for the --gpu* option, these options are only supported by Slurm's select/cons_tres plugin." So you can use "--gpus ..." only when you state SelectType

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-13 Thread Hermann Schwärzler
group type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate) Distribution and kernel RedHat 8.7 4.18.0-348.2.1.el8_5.x86_64 -Original Message- From: slurm-users On Behalf Of Hermann Schwärzler Sent: Wednesday, July 12, 2023 4:36 AM To: slurm-users@lists.schedmd.com Subje

Re: [slurm-users] cgroupv2 + slurmd - external cgroup changes needed to get daemon to start

2023-07-12 Thread Hermann Schwärzler
Hi Jenny, I *guess* you have a system that has both cgroup/v1 and cgroup/v2 enabled. Which Linux distribution are you using? And which kernel version? What is the output of mount | grep cgroup What if you do not restrict the cgroup-version Slurm can use to cgroup/v2 but omit

Re: [slurm-users] Troubles with cgroups

2023-05-17 Thread Hermann Schwärzler
Hi everybody, I would like to give you a quick update on this problem (hanging systems when swapping due to cgroup memory-limits is happening): We had opened a case with RedHat's customer support. After some to and fro they could reproduce the problem. Last week they told us to upgrade to

Re: [slurm-users] Problem with cgroup plugin in Ubuntu22.04 and slurm 21.08.5

2023-04-21 Thread Hermann Schwärzler
Hi Ángel, which version of cgroups does Ubuntu 22.04 use? What is the output of mount | grep cgroup on your system? Regards, Hermann On 4/21/23 14:33, Angel de Vicente wrote: Hello, I've installed Slurm in a workstation (this is a single-node install) with Ubuntu 22.04, and have installed

Re: [slurm-users] RLIMIT_NPROCS

2023-03-23 Thread Hermann Schwärzler
Hi Marcus, I am not sure if this is helpful but from looking at the source code of Slurm (line 276 of src/slurmd/slurmstepd/ulimits.c in version 22.05) it looks like you are explicitly using "--propagate..." to set resource limits (the one you see when running "ulimit -a") on the workers the

[slurm-users] Slurm + IntelMPI

2023-03-21 Thread Hermann Schwärzler
Hi everybody, in our new cluster we have configured Slurm with SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory ProctrackType=proctrack/cgroup TaskPlugin=task/affinity,task/cgroup which I think is quite a usual setup. After installing Intel MPI (using Spack v0.19) we saw that

Re: [slurm-users] Troubles with cgroups

2023-03-21 Thread Hermann Schwärzler
Warmest regards, Jason On Thu, Mar 16, 2023 at 10:59 AM Hermann Schwärzler mailto:hermann.schwaerz...@uibk.ac.at>> wrote: Dear Slurm users, after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each - Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) f

[slurm-users] Troubles with cgroups

2023-03-16 Thread Hermann Schwärzler
Dear Slurm users, after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each - Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) for "friendly user" test operation about 6 weeks ago we were soon facing serious problems with nodes that suddenly become unresponsive (so

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-13 Thread Hermann Schwärzler
nsense, please let me know! Best wishes, Sebastian On 11.02.23 11:13, Hermann Schwärzler wrote: Hi Sebastian, we did a similar thing just recently. We changed our node settings from NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 to NodeName=DEFAULT Boar

Re: [slurm-users] GPUs not available after making use of all threads?

2023-02-11 Thread Hermann Schwärzler
Hi Sebastian, we did a similar thing just recently. We changed our node settings from NodeName=DEFAULT CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 to NodeName=DEFAULT Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 in order to make use of

Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-16 Thread Hermann Schwärzler
use the node to drain. Maybe this helps. Kind regards Sebastian PS: goslmailer looks quite nice with its recommendations! Will definitely look into it. -- Westfälische Wilhelms-Universität (WWU) Münster WWU IT Sebastian Potthoff (eScience / HPC) Am 15.09.2022 um 18:07 schrieb Hermann Schwärzler

Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-15 Thread Hermann Schwärzler
Hi Ole, On 9/15/22 5:21 PM, Ole Holm Nielsen wrote: On 15-09-2022 16:08, Hermann Schwärzler wrote: Just out of curiosity: how do you insert the output of seff into the out-file of a job? Use the "smail" tool from the slurm-contribs RPM and set this in slurm.conf: MailProg=/usr

Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-15 Thread Hermann Schwärzler
Hi Loris, we try to achieve the same (I guess) - which is nudging the users in the direction of using scarce resources carefully - by using goslmailer (https://github.com/CLIP-HPC/goslmailer) and a (not yet published - see https://github.com/CLIP-HPC/goslmailer/issues/20) custom connector to

Re: [slurm-users] Swap Configuraton for compute nodes

2022-08-17 Thread Hermann Schwärzler
Hi Eg. if you are using cgroups (as you do if I read your other post correctly) these two lines in your cgroup.conf should do the trick: ConstrainSwapSpace=yes AllowedSwapSpace=0 Regards, Hermann PS: BTW we are planning to *not* use this setting as right now we are looking into allowing

Re: [slurm-users] Epilog script does not execute

2022-07-18 Thread Hermann Schwärzler
Hi Purvesh, which version of Slurm are you using? In which OS-environment? The epilog script is run *on every node when a user's job completes*. So: * Do you have copied your epilog script to all of your nodes? * Did you look at /tmp/ on nodes a job ran recently to see if there is any output

Re: [slurm-users] Slurm notifications, a more comprehensive solution - goslmailer

2022-05-17 Thread Hermann Schwärzler
Hi Petar, thanks for letting us know! We will definitely look into this and will get back to you on GitHub when technical questions/problems arise. Just one quick question: we are neither using Telegram nor MS-Teams here, but Matrix. In case we would like to deliver messages through that:

Re: [slurm-users] container on slurm cluster

2022-05-17 Thread Hermann Schwärzler
Hi GHui, fyi: I am not a podman-expert so my questions might be stupid. :-) From what you told us so far you are running the podman-command as non-root but you are root inside the container, right? What is the output of "podman info | grep root" in your case? How are you submitting a job

Re: [slurm-users] container on slurm cluster

2022-05-16 Thread Hermann Schwärzler
Hi GHui, I have a few questions regarding your mail: * What kind of container are you using? * How exactly do you switch to a different user inside the container? Regards, Hermann On 5/16/22 7:53 AM, GHui wrote: I fount a serious problem. If I run a container on a common user, eg. tom. In

Re: [slurm-users] srun and --cpus-per-task

2022-03-25 Thread Hermann Schwärzler
Hi Bjørn-Helge, hi everone, ok, I see. I also just re-read the documentation to find this in the description of the "CPUs" option: "This can be useful when you want to schedule only the cores on a hyper-threaded node. If CPUs is omitted, its default will be set equal to the product of

Re: [slurm-users] srun and --cpus-per-task

2022-03-24 Thread Hermann Schwärzler
Hi Durai, I see the same thing as you on our test-cluster that has ThreadsPerCore=2 configured in slurm.conf The double-foo goes away with this: srun --cpus-per-task=1 --hint=nomultithread echo foo Having multithreading enabled leads to imho surprising behaviour of Slurm. My impression is

Re: [slurm-users] monitoring and update regime for Power Saving nodes

2022-02-24 Thread Hermann Schwärzler
Hi everybody, for forcing a run of your config management as Tina suggested you might just add a ExecStartPre= line to your slurmd.service file? This is somewhat unrelated to your problem but we are very successfully using ExecStartPre=-/usr/bin/nvidia-smi -L in our slurmd.service file

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Hermann Schwärzler
Dear Nousheen, I guess there is something missing in your installation - proably your slurm.conf? Do you have logging enabled for slurmctld? If yes what do you see in that log? Or what do you get if you run slurmctld manually like this: /usr/local/sbin/slurmctld -D Regards, Hermann On

Re: [slurm-users] memory limits:: why job is not killed but oom-killer steps up?

2022-01-13 Thread Hermann Schwärzler
Hi Adrian, ConstrainRAMSpace=yes has the effect that when the memory the job requested is exhausted the processes of the job will start paging/swapping. If you want to stop jobs that use more memory (RSS to be precise) than they reqeusted, you have to add this to your cgroup.conf:

Re: [slurm-users] work with sensitive data

2021-12-15 Thread Hermann Schwärzler
Hi Michał, hi everyone, we are having similar issues looming at the horizon (sensitive medical and human genetic data). :-) We are currently looking into telling our users to use EncFS (https://en.wikipedia.org/wiki/EncFS) for this. As it is a filesystem in user-space unprivileged users can

Re: [slurm-users] How to enforce memory contrains?

2021-10-05 Thread Hermann Schwärzler
Hi Rodrigo, a possible solution is using VSizeFactor=100 in slurm.conf. With this settings, programs that try to allocate more memory than requested in the job's settings will fail. Be aware that this puts a limit on *virtual* memory, not on RSS. This might or might not be what you want