[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-16 Thread Jeffrey T Frey via slurm-users
I can confirm on a freshly-installed RockyLinux 9.4 system, the dbus-devel 
package was not installed by default.  The Development Tools


# dnf repoquery --groupmember dbus-devel
Last metadata expiration check: 2:04:16 ago on Tue 16 Jul 2024 12:02:50 PM EDT.
dbus-devel-1:1.12.20-8.el9.i686
dbus-devel-1:1.12.20-8.el9.x86_64
  @platform-devel


# dnf group list 
Last metadata expiration check: 2:03:23 ago on Tue 16 Jul 2024 12:02:50 PM EDT.
Available Environment Groups:
   Minimal Install
   Workstation
   Custom Operating System
   Virtualization Host
Installed Environment Groups:
   Server with GUI
   Server
Installed Groups:
   Legacy UNIX Compatibility
   Console Internet Tools
   Container Management
   Development Tools
   Headless Management
   RPM Development Tools
   System Tools
Available Groups:
   .NET Development
   Graphical Administration Tools
   Network Servers
   Scientific Support
   Security Tools
   Smart Card Support


So the package was _not_ present on any of the groups that got installed, and 
"Platform Development" isn't in the group list in the first place.




> On Jul 16, 2024, at 13:50, Ole Holm Nielsen via slurm-users 
>  wrote:
> 
> On 16-07-2024 16:20, William V via slurm-users wrote:
>> How can I propose modifications to the wiki?
>> For example, for RHEL9, it is missing 'dnf install dbus-devel' for compil 
>> with "cgroup v2" .
> 
> On my RockyLinux 9.4 system there was no requirement for the dbus-devel RPM 
> package (it isn't installed) when I built the Slurm RPMs.  How did you 
> experience this requirement?
> 
> /Ole
> 
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jeffrey T Frey via slurm-users
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.

The ulimit is a frontend to rusage limits, which are per-process restrictions 
(not per-user).

The fs.file-max is the kernel's limit on how many file descriptors can be open 
in aggregate.  You'd have to edit that with sysctl:


$ sysctl fs.file-max
fs.file-max = 26161449


Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative 
limit versus the default.




> But if you have ulimit -n == 1024, then no user should be able to hit
> the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
> 96 users each trying to open 1024 files would do it, though.)

Naturally, since the ulimit is per-process the equating of core count with the 
multiplier isn't valid.  It also assumes Slurm isn't setup to oversubscribe CPU 
resources :-)



>> I'm not sure how the number 3092846 got set, since it's not defined in
>> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
>> our compute nodes, so which dynamic service might affect the limits?

If the 1024 is a soft limit, you may have users who are raising it to arbitrary 
values themselves, for example.  Especially as 1024 is somewhat low for the 
more naively-written data science Python code I see on our systems.  If Slurm 
is configured to propagate submission shell ulimits to the runtime environment 
and you allow submission from a variety of nodes/systems you could be seeing 
myriad limits reconstituted on the compute node despite the 
/etc/security/limits.conf settings.


The main question needing an answer is _what_ process(es) are opening all the 
files on your systems that are faltering.  It's very likely to be user jobs' 
opening all of them, I was just hoping to also rule out any bug in munged.  
Since you're upgrading munged, you'll now get the errno associated with the 
backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users
https://github.com/dun/munge/issues/94


The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?


If you go on one of the affected nodes and do an `lsof -p ` I'm 
betting you'll find a long list of open file descriptors — that would explain 
the "Too many open files" situation _and_ indicate that this is something other 
than external memory pressure or open file limits on the process.




> On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
>  wrote:
> 
> We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
> 8.9.  We've had a number of incidents where the Munge log-file 
> /var/log/munge/munged.log suddenly fills up the root file system, after a 
> while to 100% (tens of GBs), and the node eventually comes to a grinding 
> halt!  Wiping munged.log and restarting the node works around the issue.
> 
> I've tried to track down the symptoms and this is what I found:
> 
> 1. In munged.log there are infinitely many lines filling up the disk:
> 
>   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
> processing backlog
> 
> 2. The slurmd is not getting any responses from munged, even though we run
>   "munged --num-threads 10".  The slurmd.log displays errors like:
> 
>   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
> --num-threads=10
>   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
> "/var/run/munge/munge.socket.2": Resource temporarily unavailable
>   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
> RESPONSE_ACCT_GATHER_UPDATE has authentication error
> 
> 3. The /var/log/messages displays the errors from slurmd as well as
>   NetworkManager saying "Too many open files in system".
>   The telltale syslog entry seems to be:
> 
>   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
> 
>   where the limit is confirmed in /proc/sys/fs/file-max.
> 
> We have never before seen any such errors from Munge.  The error may perhaps 
> be triggered by certain user codes (possibly star-ccm+) that might be opening 
> a lot more files on the 96-core nodes than on nodes with a lower core count.
> 
> My workaround has been to edit the line in /etc/sysctl.conf:
> 
> fs.file-max = 131072
> 
> and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
> since!
> 
> The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
> version in https://github.com/dun/munge/releases/tag/munge-0.5.16
> I can't figure out if 0.5.16 has a fix for the issue seen here?
> 
> Questions: Have other sites seen the present Munge issue as well?  Are there 
> any good recommendations for setting the fs.file-max parameter on Slurm 
> compute nodes?
> 
> Thanks for sharing your insights,
> Ole
> 
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Jeffrey T Frey via slurm-users
The native job_container/tmpfs would certainly have access to the job record, 
so modification to it (or a forked variant) would be possible.  A SPANK plugin 
should be able to fetch the full job record [1] and is then able to inspect the 
"gres" list (as a C string), which means I could modify UD's auto_tmpdir 
accordingly.  Having a compiled plugin executing xfs_quota to effect the 
commands illustrated wouldn't be a great idea -- luckily Linux XFS has an API.  
Seemingly not the simplest one, but xfsprogs is a working example.




[1] https://gitlab.hpc.cineca.it/dcesari1/slurm-msrsafe



> On Feb 7, 2024, at 05:25, Tim Schneider via slurm-users 
>  wrote:
> 
> Hey Jeffrey,
> thanks for this suggestion! This is probably the way to go if one can find a 
> way to access GRES in the prolog. I read somewhere that people were calling 
> scontrol to get this information, but this seems a bit unclean. Anyway, if I 
> find some time I will try it out.
> Best,
> Tim
> On 2/6/24 16:30, Jeffrey T Frey wrote:
>> Most of my ideas have revolved around creating file systems on-the-fly as 
>> part of the job prolog and destroying them in the epilog.  The issue with 
>> that mechanism is that formatting a file system (e.g. mkfs.) can be 
>> time-consuming.  E.g. formatting your local scratch SSD as an LVM PV+VG and 
>> allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and 
>> mount the new file system. 
>> 
>> 
>> ZFS file system creation is much quicker (basically combines the LVM + mkfs 
>> steps above) but I don't know of any clusters using ZFS to manage local file 
>> systems on the compute nodes :-)
>> 
>> 
>> One could leverage XFS project quotas.  E.g. for Slurm job 2147483647:
>> 
>> 
>> [root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647
>> [root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 
>> 2147483647' /tmp-alloc
>> Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)...
>> Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with 
>> recursion depth infinite (-1).
>> [root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc
>> [root@r00n00 /]# cd /tmp-alloc/slurm-2147483647
>> [root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000
>> dd: error writing ‘zeroes’: No space left on device
>> 205+0 records in
>> 204+0 records out
>> 1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s
>> 
>>:
>> 
>> [root@r00n00 /]# rm -rf /tmp-alloc/slurm-2147483647
>> [root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc
>> 
>> 
>> Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) we 
>> have an easy on-demand project id to use on the file system.  Slurm tmpfs 
>> plugins have to do a mkdir to create the per-job directory, adding two 
>> xfs_quota commands (which run in more or less O(1) time) won't extend the 
>> prolog by much. Likewise, Slurm tmpfs plugins have to scrub the directory at 
>> job cleanup, so adding another xfs_quota command will not do much to change 
>> their epilog execution times.  The main question is "where does the tmpfs 
>> plugin find the quota limit for the job?"
>> 
>> 
>> 
>> 
>> 
>>> On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users 
>>>  wrote:
>>> 
>>> Hi,
>>> 
>>> In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure 
>>> that each user can use /tmp and it gets cleaned up after them. Currently, 
>>> we are mapping /tmp into the nodes RAM, which means that the cgroups make 
>>> sure that users can only use a certain amount of storage inside /tmp.
>>> 
>>> Now we would like to use of the node's local SSD instead of its RAM to hold 
>>> the files in /tmp. I have seen people define local storage as GRES, but I 
>>> am wondering how to make sure that users do not exceed the storage space 
>>> they requested in a job. Does anyone have an idea how to configure local 
>>> storage as a proper tracked resource?
>>> 
>>> Thanks a lot in advance!
>>> 
>>> Best,
>>> 
>>> Tim
>>> 
>>> 
>>> -- 
>>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>> 
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Restricting local disk storage of jobs

2024-02-06 Thread Jeffrey T Frey via slurm-users
Most of my ideas have revolved around creating file systems on-the-fly as part 
of the job prolog and destroying them in the epilog.  The issue with that 
mechanism is that formatting a file system (e.g. mkfs.) can be 
time-consuming.  E.g. formatting your local scratch SSD as an LVM PV+VG and 
allocating per-job volumes, you'd still need to run a e.g. mkfs.xfs and mount 
the new file system.


ZFS file system creation is much quicker (basically combines the LVM + mkfs 
steps above) but I don't know of any clusters using ZFS to manage local file 
systems on the compute nodes :-)


One could leverage XFS project quotas.  E.g. for Slurm job 2147483647:


[root@r00n00 /]# mkdir /tmp-alloc/slurm-2147483647
[root@r00n00 /]# xfs_quota -x -c 'project -s -p /tmp-alloc/slurm-2147483647 
2147483647' /tmp-alloc
Setting up project 2147483647 (path /tmp-alloc/slurm-2147483647)...
Processed 1 (/etc/projects and cmdline) paths for project 2147483647 with 
recursion depth infinite (-1).
[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=1g 2147483647' /tmp-alloc
[root@r00n00 /]# cd /tmp-alloc/slurm-2147483647
[root@r00n00 slurm-2147483647]# dd if=/dev/zero of=zeroes bs=5M count=1000
dd: error writing ‘zeroes’: No space left on device
205+0 records in
204+0 records out
1073741824 bytes (1.1 GB) copied, 2.92232 s, 367 MB/s

   :

[root@r00n00 /]# rm -rf /tmp-alloc/slurm-2147483647
[root@r00n00 /]# xfs_quota -x -c 'limit -p bhard=0 2147483647' /tmp-alloc


Since Slurm jobids max out at 0x03FF (and 2147483647 = 0x7FFF) we have 
an easy on-demand project id to use on the file system.  Slurm tmpfs plugins 
have to do a mkdir to create the per-job directory, adding two xfs_quota 
commands (which run in more or less O(1) time) won't extend the prolog by much. 
Likewise, Slurm tmpfs plugins have to scrub the directory at job cleanup, so 
adding another xfs_quota command will not do much to change their epilog 
execution times.  The main question is "where does the tmpfs plugin find the 
quota limit for the job?"





> On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users 
>  wrote:
> 
> Hi,
> 
> In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure 
> that each user can use /tmp and it gets cleaned up after them. Currently, we 
> are mapping /tmp into the nodes RAM, which means that the cgroups make sure 
> that users can only use a certain amount of storage inside /tmp.
> 
> Now we would like to use of the node's local SSD instead of its RAM to hold 
> the files in /tmp. I have seen people define local storage as GRES, but I am 
> wondering how to make sure that users do not exceed the storage space they 
> requested in a job. Does anyone have an idea how to configure local storage 
> as a proper tracked resource?
> 
> Thanks a lot in advance!
> 
> Best,
> 
> Tim
> 
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com