[slurm-users] Job Invalid Account

2024-04-18 Thread Joe Teumer via slurm-users
We installed slurm 23.11.5 and we are receiving "JobId=n has invalid
account" for every sbatch job.
We are not using the slurm accounting/user database; we are using uniform
UIDs and GIDs across the cluster.

The jobs run and complete; can these invalid account errors be ignored or
silenced?

Job Submission Environment:
id joteumer
uid=938401109(joteumer) gid=938400513(SPG) groups=938400513(SPG),27(sudo)

Slurm Worker Node:
id joteumer
uid=938401109(joteumer) gid=938400513(SPG) groups=938400513(SPG),27(sudo)

slurmctld log:
[2024-04-18T09:46:40.000] sched: JobId=18 has invalid account

scontrol show job 18
JobId=18 JobName=simplejob.sh
   UserId=joteumer(938401109) GroupId=SPG(938400513) MCS_label=N/A
   Priority=1 Nice=0 *Account=(null) *QOS=(null)

Submit another sbatch job and update the job to include an Account
scontrol update jobid=19 Account=joteumer

[2024-04-18T09:56:05.126] _slurm_rpc_submit_batch_job: JobId=19 InitPrio=1
usec=485
[2024-04-18T09:56:06.000] sched: JobId=19 has invalid account
[2024-04-18T09:56:17.000] debug:  set_job_failed_assoc_qos_ptr: Filling in
assoc for JobId=19 Assoc=0
[2024-04-18T09:56:17.000] sched: JobId=19 has invalid account
[2024-04-18T09:56:17.588] debug:  set_job_failed_assoc_qos_ptr: Filling in
assoc for JobId=19 Assoc=0
[2024-04-18T09:56:27.505] _slurm_rpc_update_job: complete JobId=19 uid=0
usec=110
[2024-04-18T09:56:28.000] sched: JobId=19 has invalid account

scontrol show job 19
JobId=19 JobName=simplejob.sh
   UserId=joteumer(938401109) GroupId=SPG(938400513) MCS_label=N/A
   Priority=1 Nice=0 Account=(null) QOS=(null)


 JOBID PARTITION
  NAME USERSTATE   TIME TIME_LIMI  NODES
NODELIST(REASON)
19   SPG  simplejob
joteumer  PENDING   0:00  18:00:00  1 (InvalidAccount)

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-18 Thread Ole Holm Nielsen via slurm-users
I looked at some of our busy 96-core nodes where users are currently 
running the STAR-CCM+ CFD software.


One job runs on 4 96-core nodes.  I'm amazed that each STAR-CCM+ process 
has opened almost 1000 open files, for example:


$ lsof -p 440938 | wc -l
950

and that on this node the user has almost 95000 open files:

$ lsof -u  | wc -l
94606

So it's no wonder that 65536 open files would have been exhausted, and 
that my current limit is just barely sufficient:


$ sysctl fs.file-max
fs.file-max = 131072

As an experiment I lowered the max number of files on a node:

$ sysctl fs.file-max=32768

and immediately the syslog display error messages:

Apr 18 10:54:11 e033 kernel: VFS: file-max limit 32768 reached

Munged (version 0.5.16) logged a lot of errors:

2024-04-18 10:54:33 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:55:34 +0200 Info:  Failed to accept connection: Too many 
open files in system
2024-04-18 10:56:35 +0200 Info:  Failed to accept connection: Too many 
open files in system

2024-04-18 10:57:22 +0200 Info:  Encode retry #1 for client UID=0 GID=0
2024-04-18 10:57:22 +0200 Info:  Failed to send message: Broken pipe
(many lines deleted)

Slurmd also logged some errors:

[2024-04-18T10:57:22.070] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error
[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)] 
slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpected 
missing socket error



The node became completely non-responsive until I restored fs.file-max=131072.

Conclusions:

1. Munge should be upgraded to 0.5.15 or later to avoid the munged.log 
filling up the disk.  I summarize this in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service


2. We still need some heuristics for determining sufficient values for the 
kernel's fs.file-max limit.  I don't understand whether the kernel itself 
might set good default values, which we have noticed on some servers and 
login nodes.


As Jeffrey points out, there are both soft and hard user limits on the 
number of files, and this is what I see for a normal user:


$ ulimit -Sn   # Soft limit
1024
$ ulimit -Hn   # Hard limit
262144

Maybe the heuristics could be to multiply "ulimit -Hn" by the CPU core 
count (if we believe that users will only run 1 process per core).  An 
extra safety margin would need to be added on top.  Or maybe we need 
something a lot higher?


Question: Would there be any negative side effect of setting fs.file-max 
to a very large number (10s of millions)?


Interestingly, the (possibly outdated) Large Cluster Administration Guide 
at https://slurm.schedmd.com/big_sys.html recommends a ridiculously low 
number:



/proc/sys/fs/file-max: The maximum number of concurrently open files. We 
recommend a limit of at least 32,832.


Thanks for sharing your insights,
Ole


On 4/16/24 14:40, Jeffrey T Frey via slurm-users wrote:

AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.


The ulimit is a frontend to rusage limits, which are per-process 
restrictions (not per-user).


The fs.file-max is the kernel's limit on how many file descriptors can be 
open in aggregate.  You'd have to edit that with sysctl:



*$ sysctl fs.file-max*
fs.file-max = 26161449



Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have an alternative 
limit versus the default.






But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
96 users each trying to open 1024 files would do it, though.)


Naturally, since the ulimit is per-process the equating of core count with 
the multiplier isn't valid.  It also assumes Slurm isn't setup to 
oversubscribe CPU resources :-)





I'm not sure how the number 3092846 got set, since it's not defined in
/etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
our compute nodes, so which dynamic service might affect the limits?


If the 1024 is a soft limit, you may have users who are raising it to 
arbitrary values themselves, for example.  Especially as 1024 is somewhat 
low for the more naively-written data science Python code I see on our 
systems.  If Slurm is configured to propagate submission shell ulimits to 
the runtime environment and you allow submission from a variety of 
nodes/systems you could be seeing myriad limits reconstituted on the 
compute node despite the /etc/security/limits.conf settings.



The main question needing an answer is _what_ process(es) are opening all 
the files on your systems that are faltering.  It's very likely to be user 
jobs' opening all of them, I was just hoping to