Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-06 Thread Rémi Palancher
Le jeudi 6 janvier 2022 à 22:39, David Henkemeyer  
a écrit :

> All,
>
> When my team used PBS, we had several nodes that had a TON of CPUs, so many, 
> in fact, that we ended up setting np to a smaller value, in order to not 
> starve the system of memory.
>
> What is the best way to do this with Slurm? I tried modifying # of CPUs in 
> the slurm.conf file, but I noticed that Slurm enforces that "CPUs" is equal 
> to Boards * SocketsPerBoard * CoresPerSocket * ThreadsPerCore. This left me 
> with having to "fool" Slurm into thinking there were either fewer 
> ThreadsPerCore, fewer CoresPerSocket, or fewer SocketsPerBoard. This is a 
> less than ideal solution, it seems to me. At least, it left me feeling like 
> there has to be a better way.

I'm not sure you can lie to Slurm about the real number of CPUs on the nodes.

If you want to prevent Slurm from allocating more than n CPUs below the total 
number of CPUs of these nodes, I guess one solution is to use MaxCPUsPerNode=n 
at the partition level.

You can also mask "system" CPUs with CpuSpecList at node level.

The later is better if you need fine grain control over the exact list of 
reserved CPUs regarding NUMA topology or whatever.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io





Re: [slurm-users] Scheduler does not reserve resources

2022-01-17 Thread Rémi Palancher
Hi Jérémy,

Le mercredi 12 janvier 2022 à 16:59, Jérémy Lapierre 
 a écrit :

> Hi To all slurm users,
>
> We have the following issue: jobs with highest priority are pending
> forever with "Resources" reason. More specifically, the jobs pending
> forever ask for 2 full nodes but all other jobs from other users
> (running or pending) need only a 1/4 of a node, then pending jobs asking
> for 1/4 of a node always get allocated and the jobs asking for 2 nodes
> are pending forever even though the priority is higher than the ones
> asking for less resources. I hope I'm clear enough, if not please look
> at page 17 on https://slurm.schedmd.com/SUG14/sched_tutorial.pdf, in our
> situation an infinite number of jobs will fit before what is job4 in the
> scheme p. 17 and thus job4 will never be launched.

Backfilling doesn't delay the scheduled start time of higher priority jobs,
but at least they must have a scheduled start time.

Did you check the start time of your job pending with Resources reason? eg.
with `scontrol show job  | grep StartTime`.

Sometimes Slurm is unable to define the start time of a pending job. One
typical reason is the absence of timelimit on the running jobs.

In t his case Slurm is unable to define when the running jobs are over,
when the next highest priority job can start and eventually unable to define
if lower priority jobs actually delay higher priority jobs.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] big increase of MaxStepCount?

2022-01-19 Thread Rémi Palancher
‐‐‐ Original Message ‐‐‐

Le mercredi 12 janvier 2022 à 18:45, John R Anderson  a écrit :

> hello, a user has requested that we set MaxStepCount to "unlimited" or 
> 16million to accommodate some of their desired workflows. i searched around 
> for details about this parameter & don't see alot, and i reviewed  
> https://bugs.schedmd.com/show_bug.cgi?id=5722
>
> any thoughts on this? can this successfully be applied to a partition or 
> individual nodes only? i wonder about log files exploding or worse...

I think one bottleneck here could be accounting and SlurmDBD, if you are using 
it. One step is one record in the step table of the SQL database. If you end up 
with hundreds of millions of records in the SQL table, you might experience 
weird issues with eg. archives or sreport. Mind that Slurm major version 
upgrades may come with database schema changes, and it could take a big amount 
of time (like several hours) with this order of magnitude.

Considering the total number of steps, I suspect this user may also generate 
big throughput of steps as well. At some point, slurmctld might need some 
specific tuning to handle it gracefully [1].

[1] https://slurm.schedmd.com/high_throughput.html

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2022-01-28 Thread Rémi Palancher
Le vendredi 28 janvier 2022 à 06:56, Ratnasamy, Fritz 
 a écrit :

> Hi,
>
> I have a similar issue as described on the following link 
> (https://groups.google.com/g/slurm-users/c/6SnwFV-S_Nk)A machine had some 
> existing local permissions. We have added it as a compute node to our cluster 
> via Slurm. When running an srun interactive session on that server,it would 
> seem that the LDAP groups shadow the local groups.
>
> johndoe@ecolonnelli:~ $ groups
>
> Faculty_Collab ecolonnelli_access #Those are LDAP groups
>
> johndoe@ecolonnelli:~ $ groups johndoe
>
> johndoe : Faculty_Collab projectsbrasil core rais rfb polconnfirms johndoe 
> vpce rfb_all backup_johndoe ecolonnelli_access

The difference between the first and the second command could be the UID used 
for the resolution. The first command calls getgroups() syscall using the UID 
of the shell. The second command resolves johndoe UID through nsswitch stack 
then looks after the groups of this UID.

Do you have johndoe declared in both local /etc/passwd and LDAP directory with 
different UID?

Do `id` and `id johndoe` return the same UID?

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




Re: [slurm-users] how to allocate high priority to low cpu and memory jobs

2022-01-28 Thread Rémi Palancher
--- Original Message ---

Le mardi 25 janvier 2022 à 22:22,  a écrit :

> Dear all,
>
> how can I reverse the priority, so that jobs with high cpu and memory
>
> have a low priority?
>
> The Priority/Multifactor plugin it is possible to calculate high
>
> priority for high cpu and memory jobs.
>
> With PriorityFavorSmall, jobs with a lower cpu number have a high
>
> priority, but this only works for cpu, not memory.

Well, there are several options available for this use case, and the best 
choice mostly depends of your current configuration.

Additionnaly to Michael proposal with the partitions, you could also set up a 
QOS for low memory jobs, with a high priority and MaxTRESPerJob.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] Limiting srun to a specific partition

2022-02-14 Thread Rémi Palancher
Hi Peter,

Le lundi 14 février 2022 à 18:37, Peter Schmidt  a écrit 
:

> slurm newbie here, converting from pbspro. In pbspro there is the capability 
> of limiting interactive jobs (i.e srun) to a specific queue (i.e partition).

Note that in Slurm, srun and interactive jobs are not the same things. The srun 
command is for creating steps of jobs (interactive or not), optionally creating 
a job allocation beforehand if it does not exist.

You can run interactive jobs with salloc and even attach your PTY to a running 
batch job to interact with it. On the other hand, batchs jobs can create steps 
using srun command.

I don't know any native Slurm feature to restrict interactive jobs (to a 
specific partition or whatever). However, using job_submit LUA plugin and a 
custom LUA script, you might be able to accomplish what you are expecting. It 
has been discussed here:

https://bugs.schedmd.com/show_bug.cgi?id=3094

Best,
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




Re: [slurm-users] Slurm database field for SystemCPU, UserCPU, TotalCPU

2022-03-18 Thread Rémi Palancher
Hi Simon,

Le samedi 12 mars 2022 à 00:26, Simon Gao  a écrit :

> HI,
>
> We export SLURM database job_table and step_table to CSV files for data 
> analysis.
>
> Which table fields store CPU data for SystemCPU, UserCPU, TotalCPU?

I think they are the user_[u]sec, sys_[u]sec fields from the cluster 
step_table. The total is computed, it is the sum of these fields, as you can 
see here:

https://github.com/SchedMD/slurm/blob/fd6fef3e14a0c6d1484230744289749c0e4b19d0/src/plugins/accounting_storage/mysql/as_mysql_jobacct_process.c#L1063

Best,
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




Re: [slurm-users] why sacct display wrong username while the UID is right?

2022-03-18 Thread Rémi Palancher
Hi,

Le dimanche 13 mars 2022 à 04:59,  a écrit :

> Hi all:
>
> […]
>
> So is there any guess about why only sacct display the wrong username?

I guess sacct reports the username as found in cluster_assoc_table of SlurmDBD 
database, linked to cluster_job_table through the id_assoc field. There might 
not be NSS resolution in the output.

Did the UID of phywht change over time? That would explain why the jobs are 
associated to this user in the SlurmDBD database.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




[slurm-users] New future and roadmap for Slurm-web

2023-05-08 Thread Rémi Palancher
Hi Slurm community,

Slurm-web is an open source web interface for Slurm workload manager : 
http://rackslab.github.io/slurm-web/

The project was born in 2015(*), it was originally funded by EDF [2] (huge 
thanks to them!) and it reached a nice and unique feature set with versions 
2.x. Unfortunately, the software has suffered during the last years from 
lowered maintenance and investment.

Today, Slurm-web is being endorsed by Rackslab[3], a small company focused on 
development of open source solutions for HPC operations, which becomes its new 
official maintainer. A new ambitious roadmap has been defined with long-term 
vision about this project, starting with version 3.0 coming later this year.

In addition to existing Slurm-web feature set, the following new features are 
planned:

- Near real-time updates of the dashboard
- Accounting reports and vizualisation on past jobs
- Built-in metrics about jobs and scheduling
- Job submission and inspection
- Vastly improved Gantt view
- GPGPU support
- QOS, associations and reservations management
- Native RPM/deb packages and containers for easy deployment on most Linux 
distributions

The software architecture will be reworked with modern established 
technologies, it will notably be based on reference slurmrestd REST API. The 
source code will remain free, published under GPLv3, in conformity with 
Rackslab commitment for free software community.

Our goal is clearly to build the reference open source web interface for all 
users of Slurm based HPC clusters.

More details about the roadmap has been published in project discussions on 
Github: https://github.com/rackslab/slurm-web/discussions/235

You are more than welcome to discuss about it there, ask questions and give 
comments!

Best regards,

(*) The original announcement can still be found in the archives of this 
mailing-list! [1]
[1] https://groups.google.com/g/slurm-users/c/LiD2Pa8r22A/m/fDHWm5GomJsJ
[2] https://www.edf.fr/en
[3] https://rackslab.io
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] Configuring slurm.conf and using subpartitions

2023-10-04 Thread Rémi Palancher
Le mercredi 4 octobre 2023 à 06:03, Kratz, Zach  a écrit :

> We use an interactive node that will randomly select from our list of 
> computing nodes to complete the job. We would like to find a way to select 
> from our list of old nodes first, before using the newer ones. We tried using 
> weight and assigned each of the old nodes a lower weight than the new nodes, 
> but in testing the new nodes were still assigned, even if the old nodes were 
> available.

Unless confidential, can you show the configuration Node and Partition 
configuration lines you have tested unsuccessfully?

> Is there any way to configure this in the line that configures the 
> interactive node in slurm.conf, for example: 
> 
> PartitionName=interactive-cpu   Nodes=node[1-17] weight =10 node[18-24] 
> weight=50

Mind that Weight is a *Node* parameter, to be defined on Node setting lines[1], 
not on Partition line.

Another less optimal option is to define a default partition with the old nodes 
and another overlapping partition including the new nodes that users would need 
to specify explicitely on job submission to access the new nodes.

[1] https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] Response to Rémi Palancher about Configuring slurm.conf and using subpartitions

2023-10-05 Thread Rémi Palancher
--- Original Message ---
Le mercredi 4 octobre 2023 à 17:39, Kratz, Zach  a écrit :


> Thank you for your response,
> 
> Just to clarify,
> We do specify the node weight in the node setting lines, I was just wondering 
> if there was a way to be more detailed in our weight assignments.
> 
> Here is our configuration right now:
>  
> … 
>
> Notice the weights are set under compute nodes, and under interactive 
> sessions is where it selects from Nodes=node[1-24] to choose what node will 
> complete the interactive job. 

I don't see anything wrong with your configuration and to be honest I can't 
figure out what would prevent Weight to operate as expected in this case. I was 
a bit dubious about the Priority on the partition because it is not documented 
(as far as I looked for) but it seems it sets both PriorityJobFactor and 
PriorityTier[2] so it shouldn't bother though.

Maybe you could try the manpage proposal for the Weight option[1]?

> If you absolutely want to minimize the number of higher weight nodes 
> allocated to a job (at a cost of higher scheduling overhead), give each node 
> a distinct Weight value and they will be added to the pool of nodes being 
> considered for scheduling individually. 

[1] 
https://github.com/SchedMD/slurm/blob/10b6d5122b77eae417546d5263757d0ed1b2fd31/src/common/read_config.c#L1667
[2] https://slurm.schedmd.com/slurm.conf.html#OPT_Weight
--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io





Re: [slurm-users] auth_munge.so: Incompatible Slurm plugin version (21.08.8)

2023-10-05 Thread Rémi Palancher
Hello Julien,

Le mercredi 4 octobre 2023 à 19:04, Julien Rey 
 a écrit :

> Hello,
> 
> I did an upgrade of Slurm this week (20.11 to 21.08.8) and while
> everything seems to be working with srun and sbatch commands, here is
> what I get when I try to launch jobs from drmaa library:
> 
> … 
> 
> I don't know if this is a slurm or a drmaa bug. So any advice would be
> welcome.

Slurm daemons, binaries and libraries check the version of the plugins matches 
their own version at load time. The version of the plugins is bumped on every 
major version of Slurm (eg. 21.08) hence plugins compiled with 21.08 cannot be 
loaded by programs linked with libslurm from Slurm 20.11.

I suspect in this case DRMMA to be compiled and linked on libslurm from Slurm 
20.11 trying (and failing) to load newer plugins provided with Slurm 21.08.

Did you try to recompile your DRMMA layer against Slurm 21.08.8 headers and 
library?

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




Re: [slurm-users] Slurm account coordinator

2023-10-12 Thread Rémi Palancher
Hi Russel,

Le mercredi 11 octobre 2023 à 22:54, Steven Hood  a écrit :

> Russell-
> 
> Thanks for this, How do I assign a user to this level?
> 
> sacctmgr modify user set default=coordinator where default=something

You can set user john as coordinator of account scientists with:

$ sacctmgr add coordinator account=scientists names=john

You can remove john as coordinator of this account with:

$ sacctmgr delete coordinator account=scientists names=john

You can visualize the list of coordinators for all accounts with:

$ sacctmgr show accounts WithCoord

And you can visualize the list of accounts users are coordinating with:

$ sacctmgr show users WithCoord

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io




Re: [slurm-users] Configure a user as "admin" only in his/her account

2023-10-18 Thread Rémi Palancher
Hello,

Le mercredi 18 octobre 2023 à 12:29, Gestió Servidors  a 
écrit :


> Hello,
> 
> I would like to if it possible to configure a user as “admin” only for 
> his/her account. For example, in my accounting tree I have an account called 
> “students” with users “student-1”, “student-2” and so on. In this account, 
> there are a user called “teacher” that must have privileges to cancel a job 
> of any of the “students” users. I have read that a can update the 
> “AdminLevel” attribute of an user, but then, this user could cancel jobs of 
> ANY users, right? Or only of the users of his/her parent account?

Right, operators and administrators have permissions on all users.

What you need is the coordinator role:
https://slurm.schedmd.com/user_permissions.html#coord

You can set teacher as coordinator of students account:

# sacctmgr add coordinator account=students names=teacher

Then teacher will have the ability to cancel students' jobs among other things 
(eg. set limits on students associations, etc). It won't have any special 
privilege on other accounts.

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io



Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Rémi Palancher
Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
> I'm fighting this strange scenario where slurmd is started before the
> Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
> by slurmd then fails the node (as it should).  This happens only on EL8
> Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
> Infiniband/OPA network work without problems.
> 
> Question: Does anyone know how to reliably delay the start of the slurmd
> Systemd service until the Infiniband/OPA network is fully up?
> 
> …

FWIW, after a while struggling with systemd dependencies to wait for 
availability of networks and shared filesystems, we ended up with a 
customer writing a patch in Slurm to delay slurmd registration (and jobs 
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1] 
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal 
and should be taken with caution but it works for years for this customer.

[1] 
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-11-01 Thread Rémi Palancher
Hello Gérard,

> On 30/10/2023 15:46, Gérard Henry (AMU) wrote:
>> Hello all,
>> …
>> when it fails, sacct gives the follwing information:
>> JobID   JobName    Elapsed  NCPUS   TotalCPU    CPUTime
>> ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
>>  -- -- -- -- --
>> -- --   -- 
>> 8500578    analyse5   00:03:04 60   02:57:58   03:04:00
>> 9M  OUT_OF_ME+    0:125
>> 8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04
>>      21465736K    0.23M    0.01M OUT_OF_ME+    0:125
>> 8500578.0 orted   00:03:05 44   02:11:24   02:15:40
>>     40952K    0.42M    0.03M  COMPLETED  0:0
>>
>> i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus
>> and 1500M per cpu (24M)

Due to job accounting sampling intervals, tasks whose memory consumption 
increase quickly might not be properly reported by `sacct`. Default 
JobAcctGatherFrequency is 30 seconds so your batch step may have reached 
its limit in the 30 seconds time frame following the 21GB measure.

You can probably retrieve the exact memory consumption in the nodes 
kernel logs when the tasks have been killed.

Le 30/10/2023 à 15:53, Gérard Henry a écrit :
 > if i try to request just nodes and memory, for instance:
 > #SBATCH -N 2
 > #SBATCH --mem=0
 > to resquest all memory on a node, and 2nodes seem sufficient for a
 > program that consumes 100GB, i ot this error:
 > sbatch: error: CPU count per node can not be satisfied
 > sbatch: error: Batch job submission failed: Requested node configuration
 > is not available

Do you have a MaxMemPerCPU on the cluster or on the partition? If this 
value is too low, this could make the job fail due to CPU count limit.

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




Re: [slurm-users] GraceTime is not working, But there is log.

2023-11-08 Thread Rémi Palancher
Le 08/11/2023 à 02:28, 김형진 a écrit :
> Hello ~
> 
> …
> 
> However, as soon as the base QoS job is created, the large QoS job is 
> immediately canceled without any waiting time.
> 
> __ __
> 
> But in the slurmctld log, there is a grace time log.
> 
> [2023-11-02T11:37:36.589] debug:  setting 3600 sec preemption grace time 
> for JobId=153 to reclaim resources for JobId=154
> 
> __ __
> 
> Could you help me understand what might be going wrong?

Note that Slurm sends SIGTERM signal by default to slurmstepd immediate 
children (which might be gpu_burn in your case) at _the beginning_ of 
the GraceTime, to notify them of approaching termination.

If the processes react to SIGTERM by terminating, which generally the 
case, you may have the impression GraceTime is not honored.

To benefit from the GraceTime, your program must either trap SIGTERM 
with a signal handler or you must enable send_user_signal 
PreemptParameters flag and submit your job with --signal and another signal.

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




Re: [slurm-users] Graphing job metrics

2017-11-14 Thread Rémi Palancher

Hi there,

Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :

Now that there is a slurm-users mailing list, I thought I would share
something with the community that I have been working on to see if anyone else
is interested in it.  I have a lot of students on my cluster and I really
wanted a way to show my users how efficient their jobs are, or let them know
that they are wasting resources.

I created a few scripts that leverage Graphite and whisper databases (RRD like)
to gather metrics from Slurm jobs running in cgroups.  The resolution for the
metrics is defined by the retention interval that you specify in graphite.  In
my case I can store 1 minute metrics for CPU usage and Memory usage for the
entire lifetime of a job.


FWIW, we wrote at EDF a collectd[1] plugin some time ago that does 
basically the same thing, ie. exploring the cgroups to get cpu/memory 
metrics out of jobs' processes. Code is here:


  https://github.com/collectd/collectd/pull/1198

Then, you gain all collectd flexibility in terms of metrics processing 
and backends (graphite, RRD, influxdb, and so on).


We also wrote a tiny web interface to visualize the metrics. One can 
find out more by searching 'jobmetrics' in the following slides:


  https://slurm.schedmd.com/SLUG16/EDF.pdf

NB: my intent is just to share, not to steal the thread. Please forgive 
me if you take it the wrong way.


Best,
Rémi

[1] https://collectd.org/



[slurm-users] Announcing Slurm-web v3.0.0, open source web dashboard for Slurm

2024-05-13 Thread Rémi Palancher via slurm-users
Hello Slurm users,

Some of you may find interest in the new major version of Slurm-web v3.0.0, an 
open source web dashboard for Slurm: https://slurm-web.com

Slurm-web provides a reactive & responsive web interface to track jobs with 
intuitive insights and advanced visualizations to monitor status of HPC 
supercomputers in your organization. The software is released under GPLv3 [1].

This new version is based on official Slurm REST API slurmrestd and adopts 
modern web technologies to provide many features:

- Instant jobs filtering and sorting
- Live jobs status update
- Advanced visualization of node status with racking topology
- Intuitive visualization of QOS and advanced reservations
- Multi-clusters support
- LDAP authentication
- Advanced RBAC permissions management
- Transparent caching

For the next releases, a roadmap is published with many features ideas [2].

Quick start guide to install: 
http://docs.rackslab.io/slurm-web/install/quickstart.html

RPM and deb packages are published for easy installation and upgrade on all 
most popular Linux distributions.

I hope you will like it!

[1] https://github.com/rackslab/Slurm-web
[2] https://slurm-web.com/roadmap/

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com