[slurm-users] Re: Slurmctld Problems

2024-06-25 Thread Timo Rothenpieler via slurm-users

On 25.06.2024 17:54, stth via slurm-users wrote:

Hi Timo,

Thanks, The old data wasn’t important so I did that. I changed the line 
as follows in the

/usr/lib/systemd/system/slurmctld.service :
ExecStart=/usr/sbin/slurmctld --systemd -i $SLURMCTLD_OPTIONS


You should be able to immediately remove it again.
I'd have probably just launched slurmctld maually via cli with -i once.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurmctld Problems

2024-06-25 Thread Timo Rothenpieler via slurm-users

On 25/06/2024 12:20, stth via slurm-users wrote:
Jun 25 10:06:39 server slurmctld[63738]: slurmctld: fatal: Can not 
recover last_conf_lite, incompatible version, (9472 not between 9728 and 
10240), start with '-i' to ignore this. Warning: using -i will lose the 
data that can't be recovered.


Seems like it's not the first time, but the first time in a long while.
If there is no important data in that old db, just do what the error 
says as a one-off.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: how to safely rename a slurm user's name

2024-06-20 Thread Timo Rothenpieler via slurm-users

On 20/06/2024 10:57, hermes via slurm-users wrote:

Hello,

I’d like to ask if there is any safe method to rename an existing slurm 
user to new username with the same uid?


As for linux itself, it’s quite common to have 2 user share the same uid.

So if we already have 2 system users, for example Bob(uid=1500) and 
HPCBob(uid=1500), and the HPCBob is the existing slurm user, how can we 
rename it to Bob without lost of user’s history job records or 
associations? Do we have to manually edit the slurm database?


Slurm differentiates between users and accounts.
Just add the new user to the existing slurm account, and delete or leave 
the old one.


I don't think there is anyway to rename a slurm account though.
Does not look like it from the sacctmgr docs at least.

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Issue with starting slurmctld

2024-06-14 Thread Timo Rothenpieler via slurm-users

On 14.06.2024 17:51, Rafał Lalik via slurm-users wrote:

Hello,

I have encountered issues with running slurmctld.

 From logs, I see these errors:

[2024-06-14T17:37:57.587] slurmctld version 24.05.0 started on cluster 
laura
[2024-06-14T17:37:57.587] error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/jobacct_gather_cgroup.so): 
/usr/lib64/slurm/jobacct_gather_cgroup.so: undefined symbol: xcpuinfo_init
[2024-06-14T17:37:57.587] error: Couldn't load specified plugin name for 
jobacct_gather/cgroup: Dlopen of plugin file failed
[2024-06-14T17:37:57.587] error: cannot create jobacct_gather context 
for jobacct_gather/cgroup

[2024-06-14T17:37:57.587] fatal: failed to initialize jobacct_gather plugin
[2024-06-14T17:39:07.741] Not running as root. Can't drop supplementary 
groups



Aftre setting

#JobAcctGatherType=

the problem changed to:

[2024-06-14T17:39:07.742] slurmctld version 24.05.0 started on cluster 
laura
[2024-06-14T17:39:07.742] error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/prep_script.so): 
/usr/lib64/slurm/prep_script.so: undefined symbol: send_slurmd_conf_lite
[2024-06-14T17:39:07.742] error: Couldn't load specified plugin name for 
prep/script: Dlopen of plugin file failed
[2024-06-14T17:39:07.742] error: prep_g_init: cannot create prep context 
for prep/script

[2024-06-14T17:39:07.742] fatal: failed to initialize prep plugin


I also tried that with git-master:

[2024-06-14T17:48:21.691] Not running as root. Can't drop supplementary 
groups
[2024-06-14T17:48:21.691] error: Job accounting information gathered, 
but not stored
[2024-06-14T17:48:21.692] slurmctld version 24.11.0-0rc1 started on 
cluster laura
[2024-06-14T17:48:21.692] error: plugin_load_from_file: 
dlopen(/usr/lib64/slurm/jobacct_gather_cgroup.so): 
/usr/lib64/slurm/jobacct_gather_cgroup.so: undefined symbol: xcpuinfo_init
[2024-06-14T17:48:21.692] error: Couldn't load specified plugin name for 
jobacct_gather/cgroup: Dlopen of plugin file failed
[2024-06-14T17:48:21.692] error: cannot create jobacct_gather context 
for jobacct_gather/cgroup

[2024-06-14T17:48:21.692] fatal: failed to initialize jobacct_gather plugin


Any idea what may be wrong?


Recent compiler-hardening efforts broke slurms way of loading plugins.
As a workaround, link slurm with -Wl,-z,lazy

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm 23.11 - Unknown system variable 'wsrep_on'

2024-04-03 Thread Timo Rothenpieler via slurm-users

On 02.04.2024 22:15, Russell Jones via slurm-users wrote:

Hi all,

I am working on upgrading a Slurm cluster from 20 -> 23. I was 
successfully able to upgrade to 22, however now that I am trying to go 
from 22 to 23, starting slurmdbd results in the following error being 
logged:


error: mysql_query failed: 1193 Unknown system variable 'wsrep_on'


I get that error in my log every startup, and it's benign.
That variable only exists on a Galera Cluster, so seeing it on a simple 
mariadb instance is to be expected and benign.




When trying to start slurmctld, I get:

[2024-04-02T15:09:52.439] Couldn't find tres gres/gpumem in the 
database, creating.
[2024-04-02T15:09:52.439] Couldn't find tres gres/gpuutil in the 
database, creating.
[2024-04-02T15:09:52.440] fatal: Problem adding tres to the database, 
can't continue until database is able to make new tres



Any ideas what could be causing these errors? Is MariaDB 5.5 still 
officially supported?


Check permissions of your database user, it has to be able to create and 
alter tables.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler

On 19/07/2023 15:04, Jan Andersen wrote:
Hmm, OK - but that is the only nvml.h I can find, as shown by the find 
command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and 
ran it successfully; do I need to install something else beside? A 
google search for 'CUDA SDK' leads directly to NVIDIA's page: 
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html




Yes, I'm pretty sure it's part of the CUDA SDK.

And be careful with running the .run installers from Nvidia.
They bypass the package manager and can badly clash with system 
packages, making recovery complicated.

Always prefer system packages for the drivers and SDKs.



Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler

On 19/07/2023 11:47, Jan Andersen wrote:

I'm trying to build slurm with nvml support, but configure doesn't find it:

root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h


It's not looking for the hwloc header, but for the nvidia one.
If you have your CUDA SDK installed in for example /opt/cuda, you got to 
point it there: --with-nvml=/opt/cuda



root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but the 
script is a bit eye-watering; how should I do?







Re: [slurm-users] Get Job Array information in Epilog script

2023-03-17 Thread Timo Rothenpieler

On 17/03/2023 13:11, William Brown wrote:

We create the temporary directories using SLURM_JOB_ID, and that works
fine with Job Arrays so far as I can see.   Don't you have a problem
if a user has multiple jobs on the same node?

William


Ours users just have /work/$username, anything below that the respective 
job script creates on its own.

So there's various different schemes that appear in /work.

Recently some users have started submitting smaller jobs, of which 
multiple run on the same node.
So their /work dir gets littered with tons of no longer used per-job 
subdirs.
Since they've grown to rely on the Epilog script cleaning up /work when 
their last job on the node finishes, that's never been a problem.
But now we ran out of storage on /work multiple times, since there are 
so many jobs from some users that a node never was fully vacant of their 
jobs, so never got cleaned up.


The subdirs pretty much always use one of the three styles from the script:

"${SLURM_JOB_ID}", "${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}" or 
"${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}".


I don't see how that'd cause problems with multiple jobs. Since all 
those will be unique per job?



On Fri, 17 Mar 2023 at 11:17, Timo Rothenpieler
 wrote:


Hello!

I'm currently facing a bit of an issue regarding cleanup after a job
completed.

I've added the following bit of Shellscript to our clusters Epilog script:


for d in "${SLURM_JOB_ID}" "${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}" 
"${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"; do
 WORKDIR="/work/${SLURM_JOB_USER}/${d}"
 if [ -e "${WORKDIR}" ]; then
 rm -rf "${WORKDIR}"
 fi
done


However, it did not end up working to clean up working directories of
Array-Jobs.

After some investigation, I found the reason in the documentation:

  > SLURM_ARRAY_JOB_ID/SLURM_ARRAY_TASK_ID: [...]
  > Available in PrologSlurmctld, SrunProlog, TaskProlog,
EpilogSlurmctld, SrunEpilog and TaskEpilog.

So, now I wonder... how am I supposed to get that information in the
Epilog script? The whole job is part of an array, so how do I get the
information at a job level?

The "obvious alternative" based on that documentation would be to put
that bit of code into a TaskEpilog script. But my understanding of that
is that the script would run after each one of potentially multiple
srun-launched tasks in the same job, and would then clean up the
work-dir while the job would still use it?

I only want to do that bit of cleanup when the job is terminating.







[slurm-users] Get Job Array information in Epilog script

2023-03-17 Thread Timo Rothenpieler

Hello!

I'm currently facing a bit of an issue regarding cleanup after a job 
completed.


I've added the following bit of Shellscript to our clusters Epilog script:


for d in "${SLURM_JOB_ID}" "${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}" 
"${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"; do
WORKDIR="/work/${SLURM_JOB_USER}/${d}"
if [ -e "${WORKDIR}" ]; then
rm -rf "${WORKDIR}"
fi
done


However, it did not end up working to clean up working directories of 
Array-Jobs.


After some investigation, I found the reason in the documentation:

> SLURM_ARRAY_JOB_ID/SLURM_ARRAY_TASK_ID: [...]
> Available in PrologSlurmctld, SrunProlog, TaskProlog, 
EpilogSlurmctld, SrunEpilog and TaskEpilog.


So, now I wonder... how am I supposed to get that information in the 
Epilog script? The whole job is part of an array, so how do I get the 
information at a job level?


The "obvious alternative" based on that documentation would be to put 
that bit of code into a TaskEpilog script. But my understanding of that 
is that the script would run after each one of potentially multiple 
srun-launched tasks in the same job, and would then clean up the 
work-dir while the job would still use it?


I only want to do that bit of cleanup when the job is terminating.



Re: [slurm-users] slurmrestd service broken by 22.05.07 update

2022-12-29 Thread Timo Rothenpieler
Ideally, the systemd service would specify the User/Group already, and 
then also specify RuntimeDirectory=slurmrestd.
It then pre-creates a slurmrestd directory in /run for the service to 
put its runtime files (like sockets) into, avoiding any permission issues.


Having service files in top level dirs like /run or /var/lib is bound to 
cause issues like this.


On 29.12.2022 16:53, Chris Stackpole wrote:

Thanks Brian!

I also discovered that I can edit the service file to remove the unix 
socket. Doesn't seem to impact the things I'm working with anyway. But 
this design choice still seems strange to me that editing the service 
file is required. It seems like this should also be a configurable item 
like the user information at the very least. But again, I've not found 
any official documentation on how the devs expect us to configure this.


Thanks!

On 12/29/22 09:46, Brian Andrus wrote:
I dug up my old stuff for getting it started and see that I just 
disabled the unix socket completely. I was never able to get it to 
work for the reasons you are seeing, so I enabled it in listening 
mode. There are comments in the service file about it, but to do so, I 
changed the 'ExecStart' line in the systemd service file to be:


/*ExecStart=/usr/sbin/slurmrestd $SLURMRESTD_OPTIONS*/

Then I created /etc/default/slurmrestd and added:

    /*SLURM_JWT=daemon*//*
    *//*SLURMRESTD_LISTEN=0.0.0.0:8081*//*
    *//*SLURMRESTD_DEBUG=4*//*
    *//*SLURMRESTD_OPTIONS="-f /etc/slurm/slurm.conf"*/

You can change those as needed. This made it listen on port 8081 only 
(no socket and not 6820)


I was then able to just use curl on port 8081 to test things.

Hope that helps.

Brian Andrus

On 12/29/2022 6:49 AM, Chris Stackpole wrote:

Greetings,

Thanks for responding!

On 12/28/22 20:35, Brian Andrus wrote:
I suspect if you delete /var/lib/slurmrestd.socket and then start 
slurmrestd, it will create it as the user you need it to be.


Or just change the owner of it to the slurmrestd owner.



No go on that. Because /var/lib requires root to create 
/var/lib/slurmrestd.socket . Which is what I meant by "has to write 
into a root-only directory to create the unix socket".

Here, I'll show what happens with me.
Spun up a virtual machine with nothing changed on a fresh compile of 
22.05.07.


# rm -rf /var/lib/slurmrestd.socket
# systemctl start slurmrestd
# systemctl status slurmrestd

Active: failed (Result: exit-code) since Thu 2022-12-29 08:39:45 CST; 
54s ago



# journalctl -xe

Dec 29 08:39:45 testslurmvm.cluster slurmrestd[114317]: fatal: 
_create_socket: [unix:/var/lib/slurmrestd.socket] Unable to bind UNIX 
socket: Permission denied
Dec 29 08:39:45 testslurmvm.cluster systemd[1]: slurmrestd.service: 
Main process exited, code=exited, status=1/FAILURE


Now what about giving ownership to the user?

# touch /var/lib/slurmrestd.socket
# systemctl start slurmrestd
# systemctl status slurmrestd

Active: failed (Result: exit-code) since Thu 2022-12-29 08:45:37 CST; 
1min 2s ago


# journalctl -xe

Dec 29 08:45:37 testslurmvm.cluster slurmrestd[114402]: error: Error 
unlink(/var/lib/slurmrestd.socket): Permission denied
Dec 29 08:45:37 testslurmvm.cluster slurmrestd[114402]: fatal: 
_create_socket: [unix:/var/lib/slurmrestd.socket] Unable to bind UNIX 
socket: Address already in use


Again, it doesn't have permissions to modify those files nor create 
files inside that directory.


On 12/28/22 20:35, Brian Andrus wrote:
> I have been running slurmrestd as a separate user for some time.

Under 22.05.07? Because that's what broke things for me. And I think 
that it's this change:


| -- slurmrestd - switch users earlier on startup to avoid sockets being
| made as root.

I'm not saying it's a bad change either - but I don't see any 
documentation on the proper way to handle it and I don't feel like 
editing the service file is the proper way to handle it.


Thanks!







Re: [slurm-users] container on slurm cluster

2022-05-17 Thread Timo Rothenpieler

On 17.05.2022 15:58, Brian Andrus wrote:

You are starting to understand a major issue with most containers.

I suggest you check out Singularity, which was built from the ground up 
to address most issues. And it can run other container types (eg: docker).


Brian Andrus


Side-Note to this, singularity is now called apptainer, but is otherwise 
the same thing.

Documentation and Google results are wildly split between the two.



Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-31 Thread Timo Rothenpieler

Make sure you properly configured nsswitch.conf.
Most commonly this kind of issue indicates that you forgot to define 
initgroups correctly.


It should look something like this:

...
group:  files [SUCCESS=merge] systemd [SUCCESS=merge] ldap
...
initgroups: files [SUCCESS=continue] ldap
...


On 28.01.2022 06:56, Ratnasamy, Fritz wrote:

Hi,

I have a similar issue as described on the following link 
(https://groups.google.com/g/slurm-users/c/6SnwFV-S_Nk) 

A machine had some existing local permissions. We have added it as a 
compute node  to our cluster via Slurm. When running an srun interactive 
session on that server,

it would seem that the LDAP groups shadow the local groups.
johndoe@ecolonnelli:~ $ groups
Faculty_Collab ecolonnelli_access #Those are LDAP groups
johndoe@ecolonnelli:~ $ groups johndoe
johndoe : Faculty_Collab projectsbrasil core rais rfb polconnfirms 
johndoe vpce rfb_all backup_johndoe ecolonnelli_access


The issue is that now the user can not access folders that have
his local group permissions (such as projectsbrasil,rais, rfb, core 
ect..) when he request an interative session to that compute node

Is there any solution to that issue?
Best,





Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2021-09-21 Thread Timo Rothenpieler

Are you using LDAP for your users?
This sounds exactly like what I was seeing on our cluster when 
nsswitch.conf was not properly set up.


In my case, I was missing a line like

> initgroups: files [SUCCESS=continue] ldap

Just adding ldap to group: was not enough, and only got the primary 
group to work, exactly like in your case.



On 21/09/2021 09:11, Amjad Syed wrote:

Hello all

We have users who have have defined unix secondary id on our login nodes.

vas20xhu@login01 ~]$ groups

BIO_pg BIO_AFMAKAY_LAB_USERS



But when we run interactive  and go to compute node , the user does not 
have secondary  group of BIO_AFMAKAY_LAB_USERS



vas20xhu@c0077 ~]$ groups

BIO_pg


This is our interactive script


alias interactive='srun -n1 -p interactive -J interactive 
--time=12:00:00 --mem-per-cpu=4G --pty bash --login'


/usr/bin/srun



When we ssh directly into node without using interactive script there 
are no issues  with groups.



Anything we are missing in that interactive script?


Amjad






Re: [slurm-users] Slurm version 21.08 is now available

2021-08-27 Thread Timo Rothenpieler

I'm immediately running into an issue when updating our Gentoo packages:

> checking for netloc installation...
> configure: error: unable to locate netloc installation

That happens even though --without-netloc was specified when configuring.

Looking at the following patch:

https://github.com/SchedMD/slurm/commit/d7c089ec63c2c7608845b9b5f1881e3ca11a092b#diff-147f7e1072e8a7d65c0cd55f7cdec7835d531e5cf26a5e43e6ad04625694436bL23


I don't think that change removing the x is correct. Cause now that 
condition can never be true, and the script will always continue as if 
--with-netloc was given.



On 26/08/2021 22:40, Tim Wickberg wrote:
After 9 months of development and testing we are pleased to announce the 
availability of Slurm version 21.08!


Slurm 21.08 includes a number of new features including:

- A new "AccountingStoreFlags=job_script" option to store the job 
scripts directly in SlurmDBD.


- Added "sacct -o SubmitLine" format option to get the submit line of a 
job/step.


- Changes to the node state management so that nodes are marked as 
PLANNED instead of IDLE if the scheduler is still accumulating resources 
while waiting to launch a job on them.


- RS256 token support in auth/jwt.

- Overhaul of the cgroup subsystems to simplify operation, mitigate a 
number of inherent race conditions, and prepare for future cgroup v2 
support.


- Further improvements to cloud node power state management.

- A new child process of the Slurm controller called 'slurmscriptd' 
responsible for executing PrologSlurmctld and EpilogSlurmctld scripts, 
which significantly reduces performance issues associated with enabling 
those options.


- A new burst_buffer/lua plugin allowing for site-specific asynchronous 
job data management.


- Fixes to the job_container/tmpfs plugin to allow the slurmd process to 
be restarted while the job is running without issue.


- Added json/yaml output to sacct, squeue, and sinfo commands.

- Added a new node_features/helpers plugin to provide a generic way to 
change settings on a compute node across a reboot.


- Added support for automatically detecting and broadcasting shared 
libraries for an executable launched with 'srun --bcast'.


- Added initial OCI container execution support with a new --container 
option to sbatch and srun.


- Improved job step launch throughput.

- Improved "configless" support by allowing multiple control servers to 
be specified through the slurmd --conf-server option, and send 
additional configuration files at startup including cli_filter.lua.


Please see the RELEASE_NOTES distributed alongside the source for 
further details.


Thank you to all customers, partners, and community members who 
contributed to this release.


As with past releases, the documentation available at 
https://slurm.schedmd.com has been updated to the 21.08 release. Past 
versions are available in the archive. This release also marks the end 
of support for the 20.02 release. The 20.11 release will remain 
supported up until the 22.05 release next May, but will not see as 
frequent updates, and bug-fixes will be targeted for the 21.08 
maintenance releases going forward.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim





Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Timo Rothenpieler

You shouldn't need this script and pam_exec.
You can set those limits directly in the systemd config to match every user.

On 20.05.2021 16:28, Bas van der Vlies wrote:

same here we use the systemd user slice in out pam stack:
```
# Setup for local and ldap  logins
session required   pam_systemd.so
session required   pam_exec.so seteuid type=open_session 
/etc/security/limits.sh

```

limit.sh:
```
#!/bin/sh -e

PAM_UID=$(getent passwd "${PAM_USER}" | cut -d: -f3)

if [ "${PAM_UID}" -ge 1000 ]; then
     /bin/systemctl set-property "user-${PAM_UID}.slice" CPUQuota=400% 
CPUAccounting=true MemoryLimit=16G MemoryAccounting=true

fi
```

and also kill process that use to much time and exlude some processes:
  * 
https://github.com/basvandervlies/cf_surfsara_lib/blob/master/doc/services/sara_user_consume_resources.md 





Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-05-20 Thread Timo Rothenpieler

On 24.04.2021 04:37, Cristóbal Navarro wrote:

Hi Community,
I have a set of users still not so familiar with slurm, and yesterday 
they bypassed srun/sbatch and just ran their CPU program directly on the 
head/login node thinking it would still run on the compute node. I am 
aware that I will need to teach them some basic usage, but in the 
meanwhile, how have you solved this type of user-behavior problem? Is 
there a preffered way to restrict the master/login resources, or 
actions,  to the regular users ?


many thanks in advance
--
Cristóbal A. Navarro


I just put a drop-in config file for systemd into
/etc/systemd/system/user-.slice.d/user-limits.conf


[Slice]
CPUQuota=800%
MemoryHigh=48G
MemoryMax=56G
MemorySwapMax=0


Accompanied by another drop-in that resets all those limits for root.

This enforces that no single user can use up all CPUs (limited to 8 
Hyperthreads) and RAM, and can't cause the system to swap.


Other than that, we leave it to the users due diligence to not trash up 
the login nodes, which so far worked fine.
They occasionally compile stuff on the login nodes in preparation of 
runs, so I don't want to limit them too much.




[slurm-users] sbatch output logs get truncated

2021-01-28 Thread Timo Rothenpieler

This has started happening after upgrading slurm from 20.02 to latest 20.11.
It seems like something exits too early, before slurm, or whatever else 
is writing that file, has a chance to flush the final output buffer to disk.


For example, take this very simple batch script, which gets submitted 
via sbatch:



#!/bin/bash
#SBATCH --job-name=test
#SBATCH --ntasks=1
#SBATCH --exclusive
set -e

echo A
echo B
sleep 5
echo C


The resulting slurm-$jobid.out file is only

> A
> B

The final echo never gets written to the output file.

A lot of users print a final result status at the end, which then never 
hits the logs. So this is a major for them.


The scripts run to completion just fine, it's only the log being missing 
the end.
For example touching some file after the "echo C" will touch that file 
as expected.


The behaviour is also not at all consistent. Sometimes the output log is 
written as expected, with no recognizable pattern. Though this seems to 
be the exception, majority of the time it's truncated.


This was never an issue before the recent slurm update.



Re: [slurm-users] Slurmctld and log file

2020-09-08 Thread Timo Rothenpieler

My slurm logrotate file looks like this:


/var/log/slurm/*.log {
weekly
compress
missingok
nocopytruncate
nocreate
nodelaycompress
nomail
notifempty
noolddir
rotate 5
sharedscripts
size=5M
create 640 slurm slurm
postrotate
systemctl reload slurmd > /dev/null 2>&1 || true
systemctl reload slurmdbd > /dev/null 2>&1 || true
systemctl reload slurmctld > /dev/null 2>&1 || true
endscript
}


The reload section is probably the most important part yours is missing.


On 08.09.2020 11:41, Gestió Servidors wrote:

Hello,

I don’t know why, but my SLURM server (that is running fine) has its 
slurmdctl.log file with size 0 bytes... so... where is writting logs? It 
seems that log file has 0 bytes from logrotate process during today’s 
early morning. My logrotate SLURM conf is this:


[root@server logrotate.d]# cat slurm

/var/log/slurmdctl.log

/var/log/slurmdbd.log

{

rotate 7

notifempty

missingok

create

weekly

}

Now, I have run “scontrol reconfigure” and, voilà, file 
/var/log/slurmdctl.log has appeared... but it doesn’t show log info from 
logrotate execution to scontrol execution, so I have lost log info...


Is a logrotate problem or is a SLURM one?

Thanks.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler

On 02.07.2020 20:28, Luis Huang wrote:
You can look into the CR_LLN feature. It works fairly well in our 
environment and jobs are distributed evenly.


SelectTypeParameters=CR_Core_Memory,CR_LLN


From how I understand it, CR_LLN will schedule jobs to the least used 
node. But if there's nearly no jobs running, it will still only use the 
first few nodes all the time, and unless enough jobs come up to fill all 
nodes it will never touch node 20+.


My current idea for a workaround is writing a cron-script that 
periodically collects the current amount of data written on the nodes, 
and assigns a Weight to the nodes according to that.




smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] Evenly use all nodes

2020-07-02 Thread Timo Rothenpieler

Hello,

Our cluster is very rarely fully utilized, often only a handful of jobs 
are running.
This has the effect that the first couple nodes get used a whole lot 
more frequently than the ones further near the end of the list.


This is primarily a problem because of the SSDs in the nodes. They 
already show a significant difference in their wear level between the 
first couple and all the remaining nodes.


Is there some way to nicely assign jobs to nodes such that all nodes get 
roughly equal amounts of jobs/work?
I looked through possible options to the SelectType, but nothing looks 
like it does anything like that.




smime.p7s
Description: S/MIME Cryptographic Signature