[slurm-users] Fwd: sreport cluster UserUtilizationByaccount Used result versus sreport job SizesByAccount or sacct: inconsistencies

2024-04-15 Thread KK via slurm-users
-- Forwarded message -
发件人: KK 
Date: 2024年4月15日周一 13:25
Subject: sreport cluster UserUtilizationByaccount Used result versus
sreport job SizesByAccount or sacct: inconsistencies
To: 


I wish to ascertain the CPU core hours utilized by user dj1 and dj. I have
tested with sreport cluster UserUtilizationByAccount, sreport job
SizesByAccount, and sacct. It appears that sreport cluster
UserUtilizationByAccount displays the total core hours used by the entire
account, rather than the individual user's cpu time. Here are the specifics:

Users dj and dj1 are both under the account mehpc.

In 2024-04-12 ~ 2024-04-15, dj1 used approximately 10 minutes of core time,
while dj used about 4 minutes. However, "*sreport Cluster
UserUtilizationByAccount user=dj1 start=2024-04-12 end=2024-04-15*" shows
14 minutes of usage. Similarly, "*sreport job SizesByAccount Users=dj
start=2024-04-12 end=2024-04-15*" hows about 14 minutes.
Using "*sreport job SizesByAccount Users=dj1 start=2024-04-12
end=2024-04-15*" or "*sacct -u dj1 -S 2024-04-12 -E 2024-04-15 -o
"jobid,partition,account,user,alloccpus,cputimeraw,state,workdir%60" -X
|awk 'BEGIN{total=0}{total+=$6}END{print total}'*" yields the accurate
values, which are around 10 minutes for dj1. Here are the details:

[root@ood-master ~]# sacctmgr list assoc format=cluster,user,account,qos
   Cluster   UserAccount  QOS
-- -- -- 
 mehpc  root   normal
 mehpc   root   root   normal
 mehpc mehpc   normal
 mehpc dj  mehpc   normal
 mehpcdj1  mehpc   normal


[root@ood-master ~]# sacct -X -u dj1 -S 2024-04-12 -E 2024-04-15 -o
jobid,ncpus,elapsedraw,cputimeraw
JobID NCPUS ElapsedRaw CPUTimeRAW
 -- -- --
4 1 60 60
5 2120240
6 1 61 61
8 2120240
9 0  0  0

[root@ood-master ~]# sacct -X -u dj -S 2024-04-12 -E 2024-04-15 -o
jobid,ncpus,elapsedraw,cputimeraw
JobID NCPUS ElapsedRaw CPUTimeRAW
 -- -- --
7 2120240


[root@ood-master ~]# sreport job SizesByAccount Users=dj1 start=2024-04-12
end=2024-04-15

Job Sizes 2024-04-12T00:00:00 - 2024-04-14T23:59:59 (259200 secs)
Time reported in Minutes

  Cluster   Account 0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs
 >= 1000 CPUs % of cluster
- - - - - -
- 
mehpc  root10 0 0 0
0  100.00%


[root@ood-master ~]# sreport job SizesByAccount Users=dj start=2024-04-12
end=2024-04-15

Job Sizes 2024-04-12T00:00:00 - 2024-04-14T23:59:59 (259200 secs)
Time reported in Minutes

  Cluster   Account 0-49 CPUs   50-249 CPUs  250-499 CPUs  500-999 CPUs
 >= 1000 CPUs % of cluster
- - - - - -
- 
mehpc  root 4 0 0 0
0  100.00%


[root@ood-master ~]# sreport Cluster UserUtilizationByAccount user=dj1
start=2024-04-12 end=2024-04-15

Cluster/User/Account Utilization 2024-04-12T00:00:00 - 2024-04-14T23:59:59
(259200 secs)
Usage reported in CPU Minutes

  Cluster Login Proper Name Account Used   Energy
- - --- ---  
mehpc   dj1 dj1 dj1   mehpc   140



[root@ood-master ~]# sreport Cluster UserUtilizationByAccount user=dj
start=2024-04-12 end=2024-04-15

Cluster/User/Account Utilization 2024-04-12T00:00:00 - 2024-04-14T23:59:59
(259200 secs)
Usage reported in CPU Minutes

  Cluster Login Proper Name Account Used   Energy
- - --- ---  
mehpcdj   dj dj   mehpc   140


[root@ood-master ~]# sacct -u dj1 -S 2024-04-12 -E 2024-04-15 -o

[slurm-users] Re: Slurm.conf and workers

2024-04-15 Thread Brian Andrus via slurm-users

Xaver,

If you look at your slurmctld log, you likely end up seeing messages 
about each node's slurm.conf not being the same as that on the master.


So, yes, it can work temporarily, but unless there are some very 
specific settings done, issues will arise. The state you are in now, you 
will want to sync the config across all nodes and then 'scontrol 
reconfigure'


You may want to look into configless if you can set DNS entries and your 
config is basically monolithic or all parts are in /etc/slurm/


Brian Andrus

On 4/15/2024 2:55 AM, Xaver Stiensmeier via slurm-users wrote:

Dear slurm-user list,

as far as I understood it, the slurm.conf needs to be present on the
master and on the workers at slurm.conf (if no other path is set via
SLURM_CONF). However, I noticed that when adding a partition only in the
master's slurm.conf, all workers were able to "correctly" show the added
partition when calling sinfo on them.

Is the stored slurm.conf on every instance just a fallback for when
connection is down or what is the purpose? The documentation only says: .
"This file should be consistent across all nodes in the cluster."

Best regards,
Xaver




--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-15 Thread Jeffrey T Frey via slurm-users
https://github.com/dun/munge/issues/94


The NEWS file claims this was fixed in 0.5.15.  Since your log doesn't show the 
additional strerror() output you're definitely running an older version, 
correct?


If you go on one of the affected nodes and do an `lsof -p ` I'm 
betting you'll find a long list of open file descriptors — that would explain 
the "Too many open files" situation _and_ indicate that this is something other 
than external memory pressure or open file limits on the process.




> On Apr 15, 2024, at 08:14, Ole Holm Nielsen via slurm-users 
>  wrote:
> 
> We have some new AMD EPYC compute nodes with 96 cores/node running RockyLinux 
> 8.9.  We've had a number of incidents where the Munge log-file 
> /var/log/munge/munged.log suddenly fills up the root file system, after a 
> while to 100% (tens of GBs), and the node eventually comes to a grinding 
> halt!  Wiping munged.log and restarting the node works around the issue.
> 
> I've tried to track down the symptoms and this is what I found:
> 
> 1. In munged.log there are infinitely many lines filling up the disk:
> 
>   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
> processing backlog
> 
> 2. The slurmd is not getting any responses from munged, even though we run
>   "munged --num-threads 10".  The slurmd.log displays errors like:
> 
>   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
> --num-threads=10
>   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to connect to 
> "/var/run/munge/munge.socket.2": Resource temporarily unavailable
>   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: auth_g_create: 
> RESPONSE_ACCT_GATHER_UPDATE has authentication error
> 
> 3. The /var/log/messages displays the errors from slurmd as well as
>   NetworkManager saying "Too many open files in system".
>   The telltale syslog entry seems to be:
> 
>   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
> 
>   where the limit is confirmed in /proc/sys/fs/file-max.
> 
> We have never before seen any such errors from Munge.  The error may perhaps 
> be triggered by certain user codes (possibly star-ccm+) that might be opening 
> a lot more files on the 96-core nodes than on nodes with a lower core count.
> 
> My workaround has been to edit the line in /etc/sysctl.conf:
> 
> fs.file-max = 131072
> 
> and update settings by "sysctl -p".  We haven't seen any of the Munge errors 
> since!
> 
> The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
> version in https://github.com/dun/munge/releases/tag/munge-0.5.16
> I can't figure out if 0.5.16 has a fix for the issue seen here?
> 
> Questions: Have other sites seen the present Munge issue as well?  Are there 
> any good recommendations for setting the fs.file-max parameter on Slurm 
> compute nodes?
> 
> Thanks for sharing your insights,
> Ole
> 
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Munge log-file fills up the file system to 100%

2024-04-15 Thread Ole Holm Nielsen via slurm-users
We have some new AMD EPYC compute nodes with 96 cores/node running 
RockyLinux 8.9.  We've had a number of incidents where the Munge log-file 
/var/log/munge/munged.log suddenly fills up the root file system, after a 
while to 100% (tens of GBs), and the node eventually comes to a grinding 
halt!  Wiping munged.log and restarting the node works around the issue.


I've tried to track down the symptoms and this is what I found:

1. In munged.log there are infinitely many lines filling up the disk:

   2024-04-11 09:59:29 +0200 Info:  Suspended new connections while 
processing backlog


2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

   [2024-04-12T02:05:45.001] error: If munged is up, restart with 
--num-threads=10
   [2024-04-12T02:05:45.001] error: Munge encode failed: Failed to 
connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable
   [2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg: 
auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error


3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge.  The error may 
perhaps be triggered by certain user codes (possibly star-ccm+) that might 
be opening a lot more files on the 96-core nodes than on nodes with a 
lower core count.


My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p".  We haven't seen any of the Munge 
errors since!


The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer 
version in https://github.com/dun/munge/releases/tag/munge-0.5.16

I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well?  Are 
there any good recommendations for setting the fs.file-max parameter on 
Slurm compute nodes?


Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Interfaces of topology/tree and Topology Awareness

2024-04-15 Thread Nico Derl via slurm-users
I know this isn't a developer forum, but I don't really know where else to ask. 
I've had no luck with Stackoverflow. Is there no input on this?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm.conf and workers

2024-04-15 Thread Xaver Stiensmeier via slurm-users

Dear slurm-user list,

as far as I understood it, the slurm.conf needs to be present on the
master and on the workers at slurm.conf (if no other path is set via
SLURM_CONF). However, I noticed that when adding a partition only in the
master's slurm.conf, all workers were able to "correctly" show the added
partition when calling sinfo on them.

Is the stored slurm.conf on every instance just a fallback for when
connection is down or what is the purpose? The documentation only says:
"This file should be consistent across all nodes in the cluster."

Best regards,
Xaver


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com