[slurm-users] Re: Munge log-file fills up the file system to 100%

Jeffrey T Frey via slurm-users Tue, 16 Apr 2024 05:41:05 -0700

> AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
> is per user.


The ulimit is a frontend to rusage limits, which are per-process restrictions 
(not per-user).

The fs.file-max is the kernel's limit on how many file descriptors can be open 
in aggregate.  You'd have to edit that with sysctl:


$ sysctl fs.file-max
fs.file-max = 26161449


Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative 
limit versus the default.




> But if you have ulimit -n == 1024, then no user should be able to hit
> the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
> 96 users each trying to open 1024 files would do it, though.)

Naturally, since the ulimit is per-process the equating of core count with the 
multiplier isn't valid.  It also assumes Slurm isn't setup to oversubscribe CPU 
resources :-)



>> I'm not sure how the number 3092846 got set, since it's not defined in
>> /etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
>> our compute nodes, so which dynamic service might affect the limits?

If the 1024 is a soft limit, you may have users who are raising it to arbitrary 
values themselves, for example.  Especially as 1024 is somewhat low for the 
more naively-written data science Python code I see on our systems.  If Slurm 
is configured to propagate submission shell ulimits to the runtime environment 
and you allow submission from a variety of nodes/systems you could be seeing 
myriad limits reconstituted on the compute node despite the 
/etc/security/limits.conf settings.


The main question needing an answer is _what_ process(es) are opening all the 
files on your systems that are faltering.  It's very likely to be user jobs' 
opening all of them, I was just hoping to also rule out any bug in munged.  
Since you're upgrading munged, you'll now get the errno associated with the 
backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Munge log-file fills up the file system to 100%

Reply via email to