[slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler
Hello, Our cluster is very rarely fully utilized, often only a handful of jobs are running. This has the effect that the first couple nodes get used a whole lot more frequently than the ones further near the end of the list. This is primarily a problem because of the SSDs in the nodes. They

Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler
On 02.07.2020 20:28, Luis Huang wrote: You can look into the CR_LLN feature. It works fairly well in our environment and jobs are distributed evenly. SelectTypeParameters=CR_Core_Memory,CR_LLN From how I understand it, CR_LLN will schedule jobs to the least used node. But if there's nearly n

Re: [slurm-users] Slurmctld and log file

2020-09-08 Thread Timo Rothenpieler
My slurm logrotate file looks like this: /var/log/slurm/*.log { weekly compress missingok nocopytruncate nocreate nodelaycompress nomail notifempty noolddir rotate 5 sharedscripts size=5M create 640 slurm slurm postrotate systemctl

[slurm-users] sbatch output logs get truncated

2021-01-28 Thread Timo Rothenpieler
This has started happening after upgrading slurm from 20.02 to latest 20.11. It seems like something exits too early, before slurm, or whatever else is writing that file, has a chance to flush the final output buffer to disk. For example, take this very simple batch script, which gets submitted

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Timo Rothenpieler
On 24.04.2021 04:37, Cristóbal Navarro wrote: Hi Community, I have a set of users still not so familiar with slurm, and yesterday they bypassed srun/sbatch and just ran their CPU program directly on the head/login node thinking it would still run on the compute node. I am aware that I will nee

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Timo Rothenpieler
You shouldn't need this script and pam_exec. You can set those limits directly in the systemd config to match every user. On 20.05.2021 16:28, Bas van der Vlies wrote: same here we use the systemd user slice in out pam stack: ``` # Setup for local and ldap  logins session required   pam_systemd.

Re: [slurm-users] Slurm version 21.08 is now available

2021-08-27 Thread Timo Rothenpieler
I'm immediately running into an issue when updating our Gentoo packages: > checking for netloc installation... > configure: error: unable to locate netloc installation That happens even though --without-netloc was specified when configuring. Looking at the following patch: https://github.com/S

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2021-09-21 Thread Timo Rothenpieler
Are you using LDAP for your users? This sounds exactly like what I was seeing on our cluster when nsswitch.conf was not properly set up. In my case, I was missing a line like > initgroups: files [SUCCESS=continue] ldap Just adding ldap to group: was not enough, and only got the primary group

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-31 Thread Timo Rothenpieler
Make sure you properly configured nsswitch.conf. Most commonly this kind of issue indicates that you forgot to define initgroups correctly. It should look something like this: ... group: files [SUCCESS=merge] systemd [SUCCESS=merge] ldap ... initgroups: files [SUCCESS=continue] ldap ...

Re: [slurm-users] container on slurm cluster

2022-05-17 Thread Timo Rothenpieler
On 17.05.2022 15:58, Brian Andrus wrote: You are starting to understand a major issue with most containers. I suggest you check out Singularity, which was built from the ground up to address most issues. And it can run other container types (eg: docker). Brian Andrus Side-Note to this, sing

Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-29 Thread Timo Rothenpieler
Ideally, the systemd service would specify the User/Group already, and then also specify RuntimeDirectory=slurmrestd. It then pre-creates a slurmrestd directory in /run for the service to put its runtime files (like sockets) into, avoiding any permission issues. Having service files in top leve

[slurm-users] Get Job Array information in Epilog script

2023-03-17 Thread Timo Rothenpieler
Hello! I'm currently facing a bit of an issue regarding cleanup after a job completed. I've added the following bit of Shellscript to our clusters Epilog script: for d in "${SLURM_JOB_ID}" "${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}" "${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"; do WO

Re: [slurm-users] Get Job Array information in Epilog script

2023-03-17 Thread Timo Rothenpieler
ultiple jobs. Since all those will be unique per job? On Fri, 17 Mar 2023 at 11:17, Timo Rothenpieler wrote: Hello! I'm currently facing a bit of an issue regarding cleanup after a job completed. I've added the following bit of Shellscript to our clusters Epilog scrip

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler
On 19/07/2023 11:47, Jan Andersen wrote: I'm trying to build slurm with nvml support, but configure doesn't find it: root@zorn:~/slurm-23.02.3# ./configure --with-nvml ... checking for hwloc installation... /usr checking for nvml.h... no checking for nvmlInit in -lnvidia-ml... yes configure: err

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler
On 19/07/2023 15:04, Jan Andersen wrote: Hmm, OK - but that is the only nvml.h I can find, as shown by the find command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and ran it successfully; do I need to install something else beside? A google search for 'CUDA SDK' leads directly

[slurm-users] Re: Slurm 23.11 - Unknown system variable 'wsrep_on'

2024-04-03 Thread Timo Rothenpieler via slurm-users
On 02.04.2024 22:15, Russell Jones via slurm-users wrote: Hi all, I am working on upgrading a Slurm cluster from 20 -> 23. I was successfully able to upgrade to 22, however now that I am trying to go from 22 to 23, starting slurmdbd results in the following error being logged: error: mysql_

[slurm-users] Re: Issue with starting slurmctld

2024-06-14 Thread Timo Rothenpieler via slurm-users
On 14.06.2024 17:51, Rafał Lalik via slurm-users wrote: Hello, I have encountered issues with running slurmctld. From logs, I see these errors: [2024-06-14T17:37:57.587] slurmctld version 24.05.0 started on cluster laura [2024-06-14T17:37:57.587] error: plugin_load_from_file: dlopen(/usr/li

[slurm-users] Re: how to safely rename a slurm user's name

2024-06-20 Thread Timo Rothenpieler via slurm-users
On 20/06/2024 10:57, hermes via slurm-users wrote: Hello, I’d like to ask if there is any safe method to rename an existing slurm user to new username with the same uid? As for linux itself, it’s quite common to have 2 user share the same uid. So if we already have 2 system users, for exampl

[slurm-users] Re: Slurmctld Problems

2024-06-25 Thread Timo Rothenpieler via slurm-users
On 25/06/2024 12:20, stth via slurm-users wrote: Jun 25 10:06:39 server slurmctld[63738]: slurmctld: fatal: Can not recover last_conf_lite, incompatible version, (9472 not between 9728 and 10240), start with '-i' to ignore this. Warning: using -i will lose the data that can't be recovered. Se

[slurm-users] Re: Slurmctld Problems

2024-06-25 Thread Timo Rothenpieler via slurm-users
On 25.06.2024 17:54, stth via slurm-users wrote: Hi Timo, Thanks, The old data wasn’t important so I did that. I changed the line as follows in the /usr/lib/systemd/system/slurmctld.service : ExecStart=/usr/sbin/slurmctld --systemd -i $SLURMCTLD_OPTIONS You should be able to immediately remo