We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 8.9. We've had a number of incidents where the Munge log-file /var/log/munge/munged.log suddenly fills up the root file system, after a while to 100% (tens of GBs), and the node eventually comes to a grinding halt! Wiping munged.log and restarting the node works around the issue.

I've tried to track down the symptoms and this is what I found:

1. In munged.log there are infinitely many lines filling up the disk:

2024-04-11 09:59:29 +0200 Info: Suspended new connections while processing backlog

2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

[2024-04-12T02:05:45.001] error: If munged is up, restart with --num-threads=10 [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error

3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge. The error may perhaps be triggered by certain user codes (possibly star-ccm+) that might be opening a lot more files on the 96-core nodes than on nodes with a lower core count.

My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p". We haven't seen any of the Munge errors since!

The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer version in https://github.com/dun/munge/releases/tag/munge-0.5.16
I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes?

Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to