[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-19 Thread Ole Holm Nielsen via slurm-users
It turns out that the Slurm job limits are *not* controlled by the normal 
/etc/security/limits.conf configuration.  Any service running under 
Systemd (such as slurmd) has limits defined by Systemd, see [1] and [2].


The limits of processes started by slurmd are defined by LimitXXX in 
/usr/lib/systemd/system/slurmd.service, and current Slurm versions have 
LimitNOFILE=131072.


I guess that LimitNOFILE is the limit applied to every Slurm job, and that 
jobs presumably ought to crash if opening more than LimitNOFILE files?


If this is correct, I think the kernel's fs.file-max ought to be set to 
131072 times the maximum possible number of Slurm jobs per node, plus a 
safety margin for the OS.  Depending on Slurm configuration, fs.file-max 
should be set to 131072 times number of CPUs plus some extra margin.  For 
example, a 96-core node might have fs.file-max set to 100*131072 = 13107200.


Does this make sense?

Best regards,
Ole

[1] "How to set limits for services in RHEL and systemd" 
https://access.redhat.com/solutions/1257953
[2] 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#slurmd-systemd-limits


On 4/18/24 11:23, Ole Holm Nielsen wrote:
I looked at some of our busy 96-core nodes where users are currently 
running the STAR-CCM+ CFD software.


One job runs on 4 96-core nodes.  I'm amazed that each STAR-CCM+ process 
has opened almost 1000 open files, for example:


$ lsof -p 440938 | wc -l
950

and that on this node the user has almost 95000 open files:

$ lsof -u  | wc -l
94606

So it's no wonder that 65536 open files would have been exhausted, and 
that my current limit is just barely sufficient:


$ sysctl fs.file-max
fs.file-max = 131072

As an experiment I lowered the max number of files on a node:

$ sysctl fs.file-max=32768

and immediately the syslog display error messages:

Apr 18 10:54:11 e033 kernel: VFS: file-max limit 32768 reached

Munged (version 0.5.16) logged a lot of errors:

2024-04-18 10:54:33 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:55:34 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:56:35 +0200 Info:  Failed to accept connection: Too many 
open files in system

2024-04-18 10:57:22 +0200 Info:  Encode retry #1 for client UID=0 GID=0
2024-04-18 10:57:22 +0200 Info:  Failed to send message: Broken pipe
(many lines deleted)

Slurmd also logged some errors:

[2024-04-18T10:57:22.070] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error



The node became completely non-responsive until I restored 
fs.file-max=131072.


Conclusions:

1. Munge should be upgraded to 0.5.15 or later to avoid the munged.log 
filling up the disk.  I summarize this in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service


2. We still need some heuristics for determining sufficient values for the 
kernel's fs.file-max limit.  I don't understand whether the kernel itself 
might set good default values, which we have noticed on some servers and 
login nodes.


As Jeffrey points out, there are both soft and hard user limits on the 
number of files, and this is what I see for a normal user:


$ ulimit -Sn   # Soft limit
1024
$ ulimit -Hn   # Hard limit
262144

Maybe the heuristics could be to multiply "ulimit -Hn" by the CPU core 
count (if we believe that users will only run 1 process per core).  An 
extra safety margin would need to be added on top.  Or maybe we need 
something a lot higher?


Question: Would there be any negative side effect of setting fs.file-max 
to a very large number (10s of millions)?


Interestingly, the (possibly outdated) Large Cluster Administration Guide 
at https://slurm.schedmd.com/big_sys.html recommends a ridiculously low 
number:


/proc/sys/fs/file-max: The maximum number of concurrently open files. We 
recommend a limit of at least 32,832.


Thanks for sharing your insights,
Ole


On 4/16/24 14:40, Jeffrey T Frey via slurm-users wrote:

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.


The ulimit is a frontend to rusage limits, which are per-process 
restrictions (not per-user).


The fs.file-max is the kernel's limit on how many file descriptors can 
be open in aggregate.  You'd have to edit that with sysctl:



    *$ sysctl fs.file-max*
    fs.file-max = 26161449



Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an 
alternative limit versus the default.






But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it 

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-18 Thread Ole Holm Nielsen via slurm-users
I looked at some of our busy 96-core nodes where users are currently 
running the STAR-CCM+ CFD software.


One job runs on 4 96-core nodes.  I'm amazed that each STAR-CCM+ process 
has opened almost 1000 open files, for example:


$ lsof -p 440938 | wc -l
950

and that on this node the user has almost 95000 open files:

$ lsof -u  | wc -l
94606

So it's no wonder that 65536 open files would have been exhausted, and 
that my current limit is just barely sufficient:


$ sysctl fs.file-max
fs.file-max = 131072

As an experiment I lowered the max number of files on a node:

$ sysctl fs.file-max=32768

and immediately the syslog display error messages:

Apr 18 10:54:11 e033 kernel: VFS: file-max limit 32768 reached

Munged (version 0.5.16) logged a lot of errors:

2024-04-18 10:54:33 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:55:34 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:56:35 +0200 Info:  Failed to accept connection: Too many 
open files in system

2024-04-18 10:57:22 +0200 Info:  Encode retry #1 for client UID=0 GID=0
2024-04-18 10:57:22 +0200 Info:  Failed to send message: Broken pipe
(many lines deleted)

Slurmd also logged some errors:

[2024-04-18T10:57:22.070] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error



The node became completely non-responsive until I restored fs.file-max=131072.

Conclusions:

1. Munge should be upgraded to 0.5.15 or later to avoid the munged.log 
filling up the disk.  I summarize this in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service


2. We still need some heuristics for determining sufficient values for the 
kernel's fs.file-max limit.  I don't understand whether the kernel itself 
might set good default values, which we have noticed on some servers and 
login nodes.


As Jeffrey points out, there are both soft and hard user limits on the 
number of files, and this is what I see for a normal user:


$ ulimit -Sn   # Soft limit
1024
$ ulimit -Hn   # Hard limit
262144

Maybe the heuristics could be to multiply "ulimit -Hn" by the CPU core 
count (if we believe that users will only run 1 process per core).  An 
extra safety margin would need to be added on top.  Or maybe we need 
something a lot higher?


Question: Would there be any negative side effect of setting fs.file-max 
to a very large number (10s of millions)?


Interestingly, the (possibly outdated) Large Cluster Administration Guide 
at https://slurm.schedmd.com/big_sys.html recommends a ridiculously low 
number:



/proc/sys/fs/file-max: The maximum number of concurrently open files. We 
recommend a limit of at least 32,832.


Thanks for sharing your insights,
Ole


On 4/16/24 14:40, Jeffrey T Frey via slurm-users wrote:

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.


The ulimit is a frontend to rusage limits, which are per-process 
restrictions (not per-user).


The fs.file-max is the kernel's limit on how many file descriptors can be 
open in aggregate.  You'd have to edit that with sysctl:



*$ sysctl fs.file-max*
fs.file-max = 26161449



Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative 
limit versus the default.






But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
96 users each trying to open 1024 files would do it, though.)


Naturally, since the ulimit is per-process the equating of core count with 
the multiplier isn't valid.  It also assumes Slurm isn't setup to 
oversubscribe CPU resources :-)





I'm not sure how the number 3092846 got set, since it's not defined in
/etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
our compute nodes, so which dynamic service might affect the limits?


If the 1024 is a soft limit, you may have users who are raising it to 
arbitrary values themselves, for example.  Especially as 1024 is somewhat 
low for the more naively-written data science Python code I see on our 
systems.  If Slurm is configured to propagate submission shell ulimits to 
the runtime environment and you allow submission from a variety of 
nodes/systems you could be seeing myriad limits reconstituted on the 
compute node despite the /etc/security/limits.conf settings.



The main question needing an answer is _what_ process(es) are opening all 
the files on your systems that are faltering.  It's very likely to be user 
jobs' opening all of them, I was just hoping to 

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-17 Thread Bjørn-Helge Mevik via slurm-users
Jeffrey T Frey via slurm-users  writes:

>> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
>> is per user.
>
> The ulimit is a frontend to rusage limits, which are per-process restrictions 
> (not per-user).

You are right; I sit corrected. :)

(Except for number of procs and number of pending signals, according to
"man setrlimit".)

Then 1024 might not be so low for ulimit -n after all.

-- 
Regard,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jason Simms via slurm-users
As a related point, for this reason I mount /var/log separately from /. Ask
me how I learned that lesson...

Jason

On Tue, Apr 16, 2024 at 8:43 AM Jeffrey T Frey via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.
>
>
> The ulimit is a frontend to rusage limits, which are per-process
> restrictions (not per-user).
>
> The fs.file-max is the kernel's limit on how many file descriptors can be
> open in aggregate.  You'd have to edit that with sysctl:
>
>
> *$ sysctl fs.file-max*
> fs.file-max = 26161449
>
>
>
> Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative
> limit versus the default.
>
>
>
>
> But if you have ulimit -n == 1024, then no user should be able to hit
> the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
> 96 users each trying to open 1024 files would do it, though.)
>
>
> Naturally, since the ulimit is per-process the equating of core count with
> the multiplier isn't valid.  It also assumes Slurm isn't setup to
> oversubscribe CPU resources :-)
>
>
>
> I'm not sure how the number 3092846 got set, since it's not defined in
> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
> our compute nodes, so which dynamic service might affect the limits?
>
>
> If the 1024 is a soft limit, you may have users who are raising it to
> arbitrary values themselves, for example.  Especially as 1024 is somewhat
> low for the more naively-written data science Python code I see on our
> systems.  If Slurm is configured to propagate submission shell ulimits to
> the runtime environment and you allow submission from a variety of
> nodes/systems you could be seeing myriad limits reconstituted on the
> compute node despite the /etc/security/limits.conf settings.
>
>
> The main question needing an answer is _what_ process(es) are opening all
> the files on your systems that are faltering.  It's very likely to be user
> jobs' opening all of them, I was just hoping to also rule out any bug in
> munged.  Since you're upgrading munged, you'll now get the errno associated
> with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Instructor, Department of Languages & Literary Studies
Lafayette College
Pardee Hall | One Pardee Dr, 4th Fl | Easton, PA 18042
Office: Pardee 405

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Jeffrey T Frey via slurm-users
> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.

The ulimit is a frontend to rusage limits, which are per-process restrictions 
(not per-user).

The fs.file-max is the kernel's limit on how many file descriptors can be open 
in aggregate.  You'd have to edit that with sysctl:


$ sysctl fs.file-max
fs.file-max = 26161449


Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative 
limit versus the default.




> But if you have ulimit -n == 1024, then no user should be able to hit
> the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
> 96 users each trying to open 1024 files would do it, though.)

Naturally, since the ulimit is per-process the equating of core count with the 
multiplier isn't valid.  It also assumes Slurm isn't setup to oversubscribe CPU 
resources :-)



>> I'm not sure how the number 3092846 got set, since it's not defined in
>> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
>> our compute nodes, so which dynamic service might affect the limits?

If the 1024 is a soft limit, you may have users who are raising it to arbitrary 
values themselves, for example.  Especially as 1024 is somewhat low for the 
more naively-written data science Python code I see on our systems.  If Slurm 
is configured to propagate submission shell ulimits to the runtime environment 
and you allow submission from a variety of nodes/systems you could be seeing 
myriad limits reconstituted on the compute node despite the 
/etc/security/limits.conf settings.


The main question needing an answer is _what_ process(es) are opening all the 
files on your systems that are faltering.  It's very likely to be user jobs' 
opening all of them, I was just hoping to also rule out any bug in munged.  
Since you're upgrading munged, you'll now get the errno associated with the 
backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen  writes:

> Hi Bjørn-Helge,
>
> That sounds interesting, but which limit might affect the kernel's
> fs.file-max?  For example, a user already has a narrow limit:
>
> ulimit -n
> 1024

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.

Now that I think of it, fs.file-max of 65536 seems *very* low.  On our
CentOS-7-based clusters, we have in the order of tens of millions, and
on our Rocky 9 based clusters, we have 9223372036854775807(!)

Also a per-user limit of 1024 seems low to me; I think we have in the
order of 200K files per user on most clusters.

But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
96 users each trying to open 1024 files would do it, though.)

> whereas the permitted number of user processes is a lot higher:
>
> ulimit -u
> 3092846

I guess any process will have a few open files, which I believe count
against the ulimit -n for each user (and fs.file-max).

> I'm not sure how the number 3092846 got set, since it's not defined in
> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
> our compute nodes, so which dynamic service might affect the limits?

There is a vague thing in my head saying that I've looked for this
before, and found that the default value dependened on the size of the
RAM of the machine.  But the vague thing might of course be lying to
me. :)

-- 
Bjørn-Helge


signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users

Hi Bjørn-Helge,

On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote:

Ole Holm Nielsen via slurm-users  writes:


Therefore I believe that the root cause of the present issue is user
applications opening a lot of files on our 96-core nodes, and we need
to increase fs.file-max.


You could also set a limit per user, for instance in
/etc/security/limits.d/.  Then users would be blocked from opening
unreasonably many files.  One could use this to find which applications
are responsible, and try to get them fixed.


That sounds interesting, but which limit might affect the kernel's 
fs.file-max?  For example, a user already has a narrow limit:


ulimit -n
1024

whereas the permitted number of user processes is a lot higher:

ulimit -u
3092846

I'm not sure how the number 3092846 got set, since it's not defined in 
/etc/security/limits.conf.  The "ulimit -u" varies quite a bit among our 
compute nodes, so which dynamic service might affect the limits?


Perhaps there is a recommendation for defining nproc in 
/etc/security/limits.conf on compute nodes?


Thanks,
Ole

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Bjørn-Helge Mevik via slurm-users
Ole Holm Nielsen via slurm-users  writes:

> Therefore I believe that the root cause of the present issue is user
> applications opening a lot of files on our 96-core nodes, and we need
> to increase fs.file-max.

You could also set a limit per user, for instance in
/etc/security/limits.d/.  Then users would be blocked from opening
unreasonably many files.  One could use this to find which applications
are responsible, and try to get them fixed.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo



signature.asc
Description: PGP signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users

Hi Jeffrey,

Thanks a lot for the information:

On 4/15/24 15:40, Jeffrey T Frey wrote:

https://github.com/dun/munge/issues/94


I hadn't seen issue #94 before, and it seems to be relevant to our 
problem.  It's probably a good idea to upgrade munge beyond what's 
supplied by EL8/EL9.  We can build the latest 0.5.16 RPMs by:


wget 
https://github.com/dun/munge/releases/download/munge-0.5.16/munge-0.5.16.tar.xz

rpmbuild -ta munge-0.5.16.tar.xz

I've updated my Slurm Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service 
accordingly now.



The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?


Correct, we run munge 0.5.13 as supplied by EL8 (RockyLinux 8.9).


If you go on one of the affected nodes and do an `lsof -p ` I'm betting 
you'll find a long list of open file descriptors — that would explain the "Too many open 
files" situation _and_ indicate that this is something other than external memory pressure 
or open file limits on the process.


Actually, munged is normally working without too many open files as seen 
by "lsof -p `pidof munged`" over the entire partition, where the munged 
open file count is only 29.  I currently don't have any broken nodes with 
a full file system that I can examine.


Therefore I believe that the root cause of the present issue is user 
applications opening a lot of files on our 96-core nodes, and we need to 
increase fs.file-max.  And upgrade munge as well to avoid the log file 
growing without bounds.


I'd still like to know if anyone has good recommendations for setting the 
fs.file-max parameter on Slurm compute nodes?


Thanks,
Ole


On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
 wrote:

We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
8.9.  We've had a number of incidents where the Munge log-file 
/var/log/munge/munged.log suddenly fills up the root file system, after a while 
to 100% (tens of GBs), and the node eventually comes to a grinding halt!  
Wiping munged.log and restarting the node works around the issue.

I've tried to track down the symptoms and this is what I found:

1. In munged.log there are infinitely many lines filling up the disk:

   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
processing backlog

2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
--num-threads=10
   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
"/var/run/munge/munge.socket.2": Resource temporarily unavailable
   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
RESPONSE_ACCT_GATHER_UPDATE has authentication error

3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge.  The error may perhaps be 
triggered by certain user codes (possibly star-ccm+) that might be opening a 
lot more files on the 96-core nodes than on nodes with a lower core count.

My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
since!

The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version 
in https://github.com/dun/munge/releases/tag/munge-0.5.16
I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well?  Are there 
any good recommendations for setting the fs.file-max parameter on Slurm compute 
nodes?

Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com




--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users
https://github.com/dun/munge/issues/94


The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?


If you go on one of the affected nodes and do an `lsof -p ` I'm 
betting you'll find a long list of open file descriptors — that would explain 
the "Too many open files" situation _and_ indicate that this is something other 
than external memory pressure or open file limits on the process.




> On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
>  wrote:
> 
> We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
> 8.9.  We've had a number of incidents where the Munge log-file 
> /var/log/munge/munged.log suddenly fills up the root file system, after a 
> while to 100% (tens of GBs), and the node eventually comes to a grinding 
> halt!  Wiping munged.log and restarting the node works around the issue.
> 
> I've tried to track down the symptoms and this is what I found:
> 
> 1. In munged.log there are infinitely many lines filling up the disk:
> 
>   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
> processing backlog
> 
> 2. The slurmd is not getting any responses from munged, even though we run
>   "munged --num-threads 10".  The slurmd.log displays errors like:
> 
>   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
> --num-threads=10
>   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
> "/var/run/munge/munge.socket.2": Resource temporarily unavailable
>   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
> RESPONSE_ACCT_GATHER_UPDATE has authentication error
> 
> 3. The /var/log/messages displays the errors from slurmd as well as
>   NetworkManager saying "Too many open files in system".
>   The telltale syslog entry seems to be:
> 
>   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
> 
>   where the limit is confirmed in /proc/sys/fs/file-max.
> 
> We have never before seen any such errors from Munge.  The error may perhaps 
> be triggered by certain user codes (possibly star-ccm+) that might be opening 
> a lot more files on the 96-core nodes than on nodes with a lower core count.
> 
> My workaround has been to edit the line in /etc/sysctl.conf:
> 
> fs.file-max = 131072
> 
> and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
> since!
> 
> The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
> version in https://github.com/dun/munge/releases/tag/munge-0.5.16
> I can't figure out if 0.5.16 has a fix for the issue seen here?
> 
> Questions: Have other sites seen the present Munge issue as well?  Are there 
> any good recommendations for setting the fs.file-max parameter on Slurm 
> compute nodes?
> 
> Thanks for sharing your insights,
> Ole
> 
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com