Re: [slurm-users] How to limit # of execution slots for a given node
Le jeudi 6 janvier 2022 à 22:39, David Henkemeyer a écrit : > All, > > When my team used PBS, we had several nodes that had a TON of CPUs, so many, > in fact, that we ended up setting np to a smaller value, in order to not > starve the system of memory. > > What is the best way to do this with Slurm? I tried modifying # of CPUs in > the slurm.conf file, but I noticed that Slurm enforces that "CPUs" is equal > to Boards * SocketsPerBoard * CoresPerSocket * ThreadsPerCore. This left me > with having to "fool" Slurm into thinking there were either fewer > ThreadsPerCore, fewer CoresPerSocket, or fewer SocketsPerBoard. This is a > less than ideal solution, it seems to me. At least, it left me feeling like > there has to be a better way. I'm not sure you can lie to Slurm about the real number of CPUs on the nodes. If you want to prevent Slurm from allocating more than n CPUs below the total number of CPUs of these nodes, I guess one solution is to use MaxCPUsPerNode=n at the partition level. You can also mask "system" CPUs with CpuSpecList at node level. The later is better if you need fine grain control over the exact list of reserved CPUs regarding NUMA topology or whatever. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Scheduler does not reserve resources
Hi Jérémy, Le mercredi 12 janvier 2022 à 16:59, Jérémy Lapierre a écrit : > Hi To all slurm users, > > We have the following issue: jobs with highest priority are pending > forever with "Resources" reason. More specifically, the jobs pending > forever ask for 2 full nodes but all other jobs from other users > (running or pending) need only a 1/4 of a node, then pending jobs asking > for 1/4 of a node always get allocated and the jobs asking for 2 nodes > are pending forever even though the priority is higher than the ones > asking for less resources. I hope I'm clear enough, if not please look > at page 17 on https://slurm.schedmd.com/SUG14/sched_tutorial.pdf, in our > situation an infinite number of jobs will fit before what is job4 in the > scheme p. 17 and thus job4 will never be launched. Backfilling doesn't delay the scheduled start time of higher priority jobs, but at least they must have a scheduled start time. Did you check the start time of your job pending with Resources reason? eg. with `scontrol show job | grep StartTime`. Sometimes Slurm is unable to define the start time of a pending job. One typical reason is the absence of timelimit on the running jobs. In t his case Slurm is unable to define when the running jobs are over, when the next highest priority job can start and eventually unable to define if lower priority jobs actually delay higher priority jobs. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] big increase of MaxStepCount?
‐‐‐ Original Message ‐‐‐ Le mercredi 12 janvier 2022 à 18:45, John R Anderson a écrit : > hello, a user has requested that we set MaxStepCount to "unlimited" or > 16million to accommodate some of their desired workflows. i searched around > for details about this parameter & don't see alot, and i reviewed > https://bugs.schedmd.com/show_bug.cgi?id=5722 > > any thoughts on this? can this successfully be applied to a partition or > individual nodes only? i wonder about log files exploding or worse... I think one bottleneck here could be accounting and SlurmDBD, if you are using it. One step is one record in the step table of the SQL database. If you end up with hundreds of millions of records in the SQL table, you might experience weird issues with eg. archives or sreport. Mind that Slurm major version upgrades may come with database schema changes, and it could take a big amount of time (like several hours) with this order of magnitude. Considering the total number of steps, I suspect this user may also generate big throughput of steps as well. At some point, slurmctld might need some specific tuning to handle it gracefully [1]. [1] https://slurm.schedmd.com/high_throughput.html -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command
Le vendredi 28 janvier 2022 à 06:56, Ratnasamy, Fritz a écrit : > Hi, > > I have a similar issue as described on the following link > (https://groups.google.com/g/slurm-users/c/6SnwFV-S_Nk)A machine had some > existing local permissions. We have added it as a compute node to our cluster > via Slurm. When running an srun interactive session on that server,it would > seem that the LDAP groups shadow the local groups. > > johndoe@ecolonnelli:~ $ groups > > Faculty_Collab ecolonnelli_access #Those are LDAP groups > > johndoe@ecolonnelli:~ $ groups johndoe > > johndoe : Faculty_Collab projectsbrasil core rais rfb polconnfirms johndoe > vpce rfb_all backup_johndoe ecolonnelli_access The difference between the first and the second command could be the UID used for the resolution. The first command calls getgroups() syscall using the UID of the shell. The second command resolves johndoe UID through nsswitch stack then looks after the groups of this UID. Do you have johndoe declared in both local /etc/passwd and LDAP directory with different UID? Do `id` and `id johndoe` return the same UID? -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] how to allocate high priority to low cpu and memory jobs
--- Original Message --- Le mardi 25 janvier 2022 à 22:22, a écrit : > Dear all, > > how can I reverse the priority, so that jobs with high cpu and memory > > have a low priority? > > The Priority/Multifactor plugin it is possible to calculate high > > priority for high cpu and memory jobs. > > With PriorityFavorSmall, jobs with a lower cpu number have a high > > priority, but this only works for cpu, not memory. Well, there are several options available for this use case, and the best choice mostly depends of your current configuration. Additionnaly to Michael proposal with the partitions, you could also set up a QOS for low memory jobs, with a high priority and MaxTRESPerJob. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Limiting srun to a specific partition
Hi Peter, Le lundi 14 février 2022 à 18:37, Peter Schmidt a écrit : > slurm newbie here, converting from pbspro. In pbspro there is the capability > of limiting interactive jobs (i.e srun) to a specific queue (i.e partition). Note that in Slurm, srun and interactive jobs are not the same things. The srun command is for creating steps of jobs (interactive or not), optionally creating a job allocation beforehand if it does not exist. You can run interactive jobs with salloc and even attach your PTY to a running batch job to interact with it. On the other hand, batchs jobs can create steps using srun command. I don't know any native Slurm feature to restrict interactive jobs (to a specific partition or whatever). However, using job_submit LUA plugin and a custom LUA script, you might be able to accomplish what you are expecting. It has been discussed here: https://bugs.schedmd.com/show_bug.cgi?id=3094 Best, -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Slurm database field for SystemCPU, UserCPU, TotalCPU
Hi Simon, Le samedi 12 mars 2022 à 00:26, Simon Gao a écrit : > HI, > > We export SLURM database job_table and step_table to CSV files for data > analysis. > > Which table fields store CPU data for SystemCPU, UserCPU, TotalCPU? I think they are the user_[u]sec, sys_[u]sec fields from the cluster step_table. The total is computed, it is the sum of these fields, as you can see here: https://github.com/SchedMD/slurm/blob/fd6fef3e14a0c6d1484230744289749c0e4b19d0/src/plugins/accounting_storage/mysql/as_mysql_jobacct_process.c#L1063 Best, -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] why sacct display wrong username while the UID is right?
Hi, Le dimanche 13 mars 2022 à 04:59, a écrit : > Hi all: > > […] > > So is there any guess about why only sacct display the wrong username? I guess sacct reports the username as found in cluster_assoc_table of SlurmDBD database, linked to cluster_job_table through the id_assoc field. There might not be NSS resolution in the output. Did the UID of phywht change over time? That would explain why the jobs are associated to this user in the SlurmDBD database. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
[slurm-users] New future and roadmap for Slurm-web
Hi Slurm community, Slurm-web is an open source web interface for Slurm workload manager : http://rackslab.github.io/slurm-web/ The project was born in 2015(*), it was originally funded by EDF [2] (huge thanks to them!) and it reached a nice and unique feature set with versions 2.x. Unfortunately, the software has suffered during the last years from lowered maintenance and investment. Today, Slurm-web is being endorsed by Rackslab[3], a small company focused on development of open source solutions for HPC operations, which becomes its new official maintainer. A new ambitious roadmap has been defined with long-term vision about this project, starting with version 3.0 coming later this year. In addition to existing Slurm-web feature set, the following new features are planned: - Near real-time updates of the dashboard - Accounting reports and vizualisation on past jobs - Built-in metrics about jobs and scheduling - Job submission and inspection - Vastly improved Gantt view - GPGPU support - QOS, associations and reservations management - Native RPM/deb packages and containers for easy deployment on most Linux distributions The software architecture will be reworked with modern established technologies, it will notably be based on reference slurmrestd REST API. The source code will remain free, published under GPLv3, in conformity with Rackslab commitment for free software community. Our goal is clearly to build the reference open source web interface for all users of Slurm based HPC clusters. More details about the roadmap has been published in project discussions on Github: https://github.com/rackslab/slurm-web/discussions/235 You are more than welcome to discuss about it there, ask questions and give comments! Best regards, (*) The original announcement can still be found in the archives of this mailing-list! [1] [1] https://groups.google.com/g/slurm-users/c/LiD2Pa8r22A/m/fDHWm5GomJsJ [2] https://www.edf.fr/en [3] https://rackslab.io -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Configuring slurm.conf and using subpartitions
Le mercredi 4 octobre 2023 à 06:03, Kratz, Zach a écrit : > We use an interactive node that will randomly select from our list of > computing nodes to complete the job. We would like to find a way to select > from our list of old nodes first, before using the newer ones. We tried using > weight and assigned each of the old nodes a lower weight than the new nodes, > but in testing the new nodes were still assigned, even if the old nodes were > available. Unless confidential, can you show the configuration Node and Partition configuration lines you have tested unsuccessfully? > Is there any way to configure this in the line that configures the > interactive node in slurm.conf, for example: > > PartitionName=interactive-cpu Nodes=node[1-17] weight =10 node[18-24] > weight=50 Mind that Weight is a *Node* parameter, to be defined on Node setting lines[1], not on Partition line. Another less optimal option is to define a default partition with the old nodes and another overlapping partition including the new nodes that users would need to specify explicitely on job submission to access the new nodes. [1] https://slurm.schedmd.com/slurm.conf.html#SECTION_NODE-CONFIGURATION -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Response to Rémi Palancher about Configuring slurm.conf and using subpartitions
--- Original Message --- Le mercredi 4 octobre 2023 à 17:39, Kratz, Zach a écrit : > Thank you for your response, > > Just to clarify, > We do specify the node weight in the node setting lines, I was just wondering > if there was a way to be more detailed in our weight assignments. > > Here is our configuration right now: > > … > > Notice the weights are set under compute nodes, and under interactive > sessions is where it selects from Nodes=node[1-24] to choose what node will > complete the interactive job. I don't see anything wrong with your configuration and to be honest I can't figure out what would prevent Weight to operate as expected in this case. I was a bit dubious about the Priority on the partition because it is not documented (as far as I looked for) but it seems it sets both PriorityJobFactor and PriorityTier[2] so it shouldn't bother though. Maybe you could try the manpage proposal for the Weight option[1]? > If you absolutely want to minimize the number of higher weight nodes > allocated to a job (at a cost of higher scheduling overhead), give each node > a distinct Weight value and they will be added to the pool of nodes being > considered for scheduling individually. [1] https://github.com/SchedMD/slurm/blob/10b6d5122b77eae417546d5263757d0ed1b2fd31/src/common/read_config.c#L1667 [2] https://slurm.schedmd.com/slurm.conf.html#OPT_Weight -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] auth_munge.so: Incompatible Slurm plugin version (21.08.8)
Hello Julien, Le mercredi 4 octobre 2023 à 19:04, Julien Rey a écrit : > Hello, > > I did an upgrade of Slurm this week (20.11 to 21.08.8) and while > everything seems to be working with srun and sbatch commands, here is > what I get when I try to launch jobs from drmaa library: > > … > > I don't know if this is a slurm or a drmaa bug. So any advice would be > welcome. Slurm daemons, binaries and libraries check the version of the plugins matches their own version at load time. The version of the plugins is bumped on every major version of Slurm (eg. 21.08) hence plugins compiled with 21.08 cannot be loaded by programs linked with libslurm from Slurm 20.11. I suspect in this case DRMMA to be compiled and linked on libslurm from Slurm 20.11 trying (and failing) to load newer plugins provided with Slurm 21.08. Did you try to recompile your DRMMA layer against Slurm 21.08.8 headers and library? -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Slurm account coordinator
Hi Russel, Le mercredi 11 octobre 2023 à 22:54, Steven Hood a écrit : > Russell- > > Thanks for this, How do I assign a user to this level? > > sacctmgr modify user set default=coordinator where default=something You can set user john as coordinator of account scientists with: $ sacctmgr add coordinator account=scientists names=john You can remove john as coordinator of this account with: $ sacctmgr delete coordinator account=scientists names=john You can visualize the list of coordinators for all accounts with: $ sacctmgr show accounts WithCoord And you can visualize the list of accounts users are coordinating with: $ sacctmgr show users WithCoord -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] Configure a user as "admin" only in his/her account
Hello, Le mercredi 18 octobre 2023 à 12:29, Gestió Servidors a écrit : > Hello, > > I would like to if it possible to configure a user as “admin” only for > his/her account. For example, in my accounting tree I have an account called > “students” with users “student-1”, “student-2” and so on. In this account, > there are a user called “teacher” that must have privileges to cancel a job > of any of the “students” users. I have read that a can update the > “AdminLevel” attribute of an user, but then, this user could cancel jobs of > ANY users, right? Or only of the users of his/her parent account? Right, operators and administrators have permissions on all users. What you need is the coordinator role: https://slurm.schedmd.com/user_permissions.html#coord You can set teacher as coordinator of students account: # sacctmgr add coordinator account=students names=teacher Then teacher will have the ability to cancel students' jobs among other things (eg. set limits on students associations, etc). It won't have any special privilege on other accounts. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ole, Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit : > I'm fighting this strange scenario where slurmd is started before the > Infiniband/OPA network is fully up. The Node Health Check (NHC) executed > by slurmd then fails the node (as it should). This happens only on EL8 > Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with > Infiniband/OPA network work without problems. > > Question: Does anyone know how to reliably delay the start of the slurmd > Systemd service until the Infiniband/OPA network is fully up? > > … FWIW, after a while struggling with systemd dependencies to wait for availability of networks and shared filesystems, we ended up with a customer writing a patch in Slurm to delay slurmd registration (and jobs start) until NHC is OK: https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528 For the record, this patch was once merged in Slurm and then reverted[1] for reasons I did not fully explore. This approach is far from your original idea, it is clearly not ideal and should be taken with caution but it works for years for this customer. [1] https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528 -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io/
Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory
Hello Gérard, > On 30/10/2023 15:46, Gérard Henry (AMU) wrote: >> Hello all, >> … >> when it fails, sacct gives the follwing information: >> JobID JobName Elapsed NCPUS TotalCPU CPUTime >> ReqMem MaxRSS MaxDiskRead MaxDiskWrite State ExitCode >> -- -- -- -- -- >> -- -- -- >> 8500578 analyse5 00:03:04 60 02:57:58 03:04:00 >> 9M OUT_OF_ME+ 0:125 >> 8500578.bat+ batch 00:03:04 16 46:34.302 00:49:04 >> 21465736K 0.23M 0.01M OUT_OF_ME+ 0:125 >> 8500578.0 orted 00:03:05 44 02:11:24 02:15:40 >> 40952K 0.42M 0.03M COMPLETED 0:0 >> >> i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus >> and 1500M per cpu (24M) Due to job accounting sampling intervals, tasks whose memory consumption increase quickly might not be properly reported by `sacct`. Default JobAcctGatherFrequency is 30 seconds so your batch step may have reached its limit in the 30 seconds time frame following the 21GB measure. You can probably retrieve the exact memory consumption in the nodes kernel logs when the tasks have been killed. Le 30/10/2023 à 15:53, Gérard Henry a écrit : > if i try to request just nodes and memory, for instance: > #SBATCH -N 2 > #SBATCH --mem=0 > to resquest all memory on a node, and 2nodes seem sufficient for a > program that consumes 100GB, i ot this error: > sbatch: error: CPU count per node can not be satisfied > sbatch: error: Batch job submission failed: Requested node configuration > is not available Do you have a MaxMemPerCPU on the cluster or on the partition? If this value is too low, this could make the job fail due to CPU count limit. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io/
Re: [slurm-users] GraceTime is not working, But there is log.
Le 08/11/2023 à 02:28, 김형진 a écrit : > Hello ~ > > … > > However, as soon as the base QoS job is created, the large QoS job is > immediately canceled without any waiting time. > > __ __ > > But in the slurmctld log, there is a grace time log. > > [2023-11-02T11:37:36.589] debug: setting 3600 sec preemption grace time > for JobId=153 to reclaim resources for JobId=154 > > __ __ > > Could you help me understand what might be going wrong? Note that Slurm sends SIGTERM signal by default to slurmstepd immediate children (which might be gpu_burn in your case) at _the beginning_ of the GraceTime, to notify them of approaching termination. If the processes react to SIGTERM by terminating, which generally the case, you may have the impression GraceTime is not honored. To benefit from the GraceTime, your program must either trap SIGTERM with a signal handler or you must enable send_user_signal PreemptParameters flag and submit your job with --signal and another signal. -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io/
Re: [slurm-users] Graphing job metrics
Hi there, Le 13/11/2017 à 18:18, Nicholas McCollum a écrit : Now that there is a slurm-users mailing list, I thought I would share something with the community that I have been working on to see if anyone else is interested in it. I have a lot of students on my cluster and I really wanted a way to show my users how efficient their jobs are, or let them know that they are wasting resources. I created a few scripts that leverage Graphite and whisper databases (RRD like) to gather metrics from Slurm jobs running in cgroups. The resolution for the metrics is defined by the retention interval that you specify in graphite. In my case I can store 1 minute metrics for CPU usage and Memory usage for the entire lifetime of a job. FWIW, we wrote at EDF a collectd[1] plugin some time ago that does basically the same thing, ie. exploring the cgroups to get cpu/memory metrics out of jobs' processes. Code is here: https://github.com/collectd/collectd/pull/1198 Then, you gain all collectd flexibility in terms of metrics processing and backends (graphite, RRD, influxdb, and so on). We also wrote a tiny web interface to visualize the metrics. One can find out more by searching 'jobmetrics' in the following slides: https://slurm.schedmd.com/SLUG16/EDF.pdf NB: my intent is just to share, not to steal the thread. Please forgive me if you take it the wrong way. Best, Rémi [1] https://collectd.org/
[slurm-users] Announcing Slurm-web v3.0.0, open source web dashboard for Slurm
Hello Slurm users, Some of you may find interest in the new major version of Slurm-web v3.0.0, an open source web dashboard for Slurm: https://slurm-web.com Slurm-web provides a reactive & responsive web interface to track jobs with intuitive insights and advanced visualizations to monitor status of HPC supercomputers in your organization. The software is released under GPLv3 [1]. This new version is based on official Slurm REST API slurmrestd and adopts modern web technologies to provide many features: - Instant jobs filtering and sorting - Live jobs status update - Advanced visualization of node status with racking topology - Intuitive visualization of QOS and advanced reservations - Multi-clusters support - LDAP authentication - Advanced RBAC permissions management - Transparent caching For the next releases, a roadmap is published with many features ideas [2]. Quick start guide to install: http://docs.rackslab.io/slurm-web/install/quickstart.html RPM and deb packages are published for easy installation and upgrade on all most popular Linux distributions. I hope you will like it! [1] https://github.com/rackslab/Slurm-web [2] https://slurm-web.com/roadmap/ -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com