[slurm-users] Re: Upgrade node while jobs running
G'day Sid, On 7/31/24 5:02 pm, Sid Young via slurm-users wrote: I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job. Are you looking to do a Slurm upgrade, an OS upgrade, or both? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs
On 6/21/24 3:50 am, Arnuld via slurm-users wrote: I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users application that will run something on the node that will access the GPU(s). Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application). -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11
On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote: Also, server must be newer than client. This is the major issue for the OP - the version rule is: slurmdbd >= slurmctld >= slurmd and clients and no more than the permitted skew in versions. Plus, of course, you have to deal with config file compatibility issues between versions. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Building Slurm debian package vs building from source
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have both GPU and non-GPU nodes but we use the same RPMs we build on both (they all boot the same SLES15 OS image though). -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Location of Slurm source packages?
Hi Jeff! On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote: I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 22.04, is there a package that has the source code? How to download the sources I need from github? You shouldn't need Github, this should give you what you are after (especially the "Download slurm-wlm" section at the end): https://packages.ubuntu.com/source/jammy/slurm-wlm Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote: Fixed with: [...] Thanks and sorry for the noise as I really missed this detail :) So glad it helped! Best of luck with this work. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote: Any clues about "elf_aarch64" and "aarch64elf" mismatch? As I mentioned I think this is coming from the FreeBSD patching that's being done to the upstream Slurm sources, specifically it looks like elf_aarch64 is being injected here: /usr/bin/sed -i.bak -e 's|"/proc|"/compat/linux/proc|g' -e 's|(/proc)|(/compat/linux/proc)|g' /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmd/slurmstepd/req.c /usr/bin/find /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/api /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/plugins/openapi /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/sacctmgr /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/sackd /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scontrol /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scrontab /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scrun /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmctld /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmd/slurmd /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/squeue -name Makefile.in | /usr/bin/xargs /usr/bin/sed -i.bak -e 's|-r -o|-r -m elf_aarch64 -o|' So I guess that will need to be fixed to match what FreeBSD supports. I don't think this is a Slurm issue from what I see there. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote: Any clues? > ld: error: unknown emulation: elf_aarch64 All I can think is that your ld doesn't like elf_aarch64, from the log your posting it looks that's being injected from the FreeBSD ports system. Looking at the man page for ld on Linux it says: -m emulation Emulate the emulation linker. You can list the available emulations with the --verbose or -V options. So I'd guess you'd need to look at what that version of ld supports and then update the ports system to match. Good luck! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get jobs caught in this state (and there's nothing else on the node for our shared nodes). Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Is SWAP memory mandatory for SLURM
On 3/3/24 23:04, John Joseph via slurm-users wrote: Is SWAP a mandatory requirement All our compute nodes are diskless, so no swap on them. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo
Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] > [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied Where do you set the permissions on /tmp ? What do you set them to? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] sacct --name --status filtering
On 1/10/24 19:39, Drucker, Daniel wrote: What am I misunderstanding about how sacct filtering works here? I would have expected the second command to show the exact same results as the first. You need to specify --end NOW for this to work as expected. From the man page: WITHOUT --jobs AND WITH --state specified: --starttime defaults to Now. --endtime defaults to --starttime and to Now if --starttime is not specified. Eg > sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") -X --format JobID,JobName,State,Elapsed --name bash JobID JobName StateElapsed -- -- -- 570741 bash COMPLETED 00:00:02 570742 bash COMPLETED 00:00:02 570743 bash FAILED 00:00:01 570744 bash FAILED 00:00:01 570745 bash FAILED 00:00:01 570746 bash COMPLETED 00:00:02 570747 bash COMPLETED 00:00:02 570748 bash COMPLETED 00:00:02 > sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") -X --format JobID,JobName,State,Elapsed --name bash --state COMPLETED JobID JobName StateElapsed -- -- -- > > sacct --starttime $(date -d "7 days ago" +"%Y-%m-%d") -X --format JobID,JobName,State,Elapsed --name bash --state COMPLETED --end now JobID JobName StateElapsed -- -- -- 570741 bash COMPLETED 00:00:02 570742 bash COMPLETED 00:00:02 570746 bash COMPLETED 00:00:02 570747 bash COMPLETED 00:00:02 570748 bash COMPLETED 00:00:02 Hope this helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] parastation (mpi)
On 11/24/23 06:16, Heckes, Frank wrote: My colleagues are using this toolchains on Jülich cluster (especially Juwels). My question is whether these eb files can be shared ? I would be interested especially in the ones using NVHPC as core module. If Jülich developed that toolchain then I think you'd need to ask them whether they are agreeable to sharing them. Does anyone knows whether the parastation MPI is an active project still, because the github doesn’t show so many recent changes? There's a number of different repos under that umbrella, and whilst psmpi does look active it seems the psmgmt one has had more commits recently. So it does look active to me. https://github.com/ParaStation/ All the best. Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLURM , maximum scalable instance is which one
On 10/29/23 03:13, John Joseph wrote: Like to know that what is the maximum scalled up instance of SLURM so far. Cori (which we retired mid-year) had ~12,000 compute nodes in case that helps. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? The intent of the "ASAP` flag for "scontrol reboot" is to not let any more jobs onto a node until it has rebooted. IIRC that was from work we sponsored, the idea being that (for how our nodes are managed) we would build new images with the latest software stack, test them on a separate test system and then once happy bring them over to the production system and do an "scontrol reboot ASAP nextstate=resume reason=... $NODES" to ensure that from that point onwards no new jobs would start in the old software configuration, only the new one. Also slurmctld would know that these nodes are due to come back in "ResumeTimeout" seconds after the reboot is issued and so could plan for them as part of scheduling large jobs, rather than thinking there was no way it could do so and letting lots of smaller jobs get in the way. Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)
On 10/16/23 08:22, Groner, Rob wrote: It is my understanding that it is a different issue than pmix. That's my understanding too. The PMIx issue wasn't in Slurm, it was in the PMIx code that Slurm was linked to. This CVE is for Slurm itself. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Fairshare: Penalising unused memory rather than used memory?
On 10/11/23 07:27, Cristian Huza wrote: I recall there was a built in tool named seff (slurm efficiency), not sure if it is still maintained "seff" is in the Slurm sources in the contribs/seff directory, if you're building RPMs from them then it's in the "slurm-contribs" RPM. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Site factor plugin example?
On 10/13/23 10:10, Angel de Vicente wrote: But, in any case, I would still be interested in a site factor plugin example, because I might revisit this in the future. I don't know if you saw, but there is a skeleton example in the Slurm sources: src/plugins/site_factor/none Not sure if that helps? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Unconfigured GPUs being allocated
On 7/14/23 1:10 pm, Wilson, Steven M wrote: It's not so much whether a job may or may not access the GPU but rather which GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what our CUDA jobs can see and therefore use (within any cgroups constraints, of course). In my case, Slurm is sometimes setting $CUDA_VISIBLE_DEVICES to a GPU that is not in the Slurm configuration because it is intended only for driving the display and not GPU computations. Sorry I didn't see this before! Yeah that does sound different, I wouldn't expect that. :-( All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] slurmdbd database usage
On 8/2/23 2:30 pm, Sandor wrote: I am looking to track accounting and job data. Slurm requires the use of MySQL or MariaDB. Has anyone created the needed tables within PostGreSQL then had slurmdbd write to it? Any problems? From memory (and confirmed by git) support for Postgres was removed from Slurm way back in 2013 before the 14.03 release (the first one using dates as version numbers). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Unconfigured GPUs being allocated
On 7/14/23 10:20 am, Wilson, Steven M wrote: I upgraded Slurm to 23.02.3 but I'm still running into the same problem. Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still being made available to jobs so we end up with compute jobs being run on GPUs which should only be used I think this is expected - it's not that Slurm is making them available, it's that it's unaware of them and so doesn't control them in the way it does for the GPUs it does know about. So you get the default behaviour (any process can access them). If you want to stop them being accessed from Slurm you'd need to find a way to prevent that access via cgroups games or similar. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Trying to update from slurm 19.05 to slurm 23.02 but I can't figure out how to allow users to reboot nodes...
On 6/6/23 1:33 pm, Heinz, Michael wrote: I've gone through the man pages for slurm.conf but I can't find anything about how to define who the admins are? Is there still a way to do this with slurm or has the ability been removed? Looks like that was disabled over 3 years ago. commit dd111a52bf23d79efcfe9d5688e15cbc768bb22b Author: Brian Christiansen Date: Fri Jan 31 14:24:40 2020 -0700 Disable sbatch, salloc, srun --reboot for non-admins Bug 7767 That bug is private it seems. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Temporary Stop User Submission
On 5/25/23 4:16 pm, Markuske, William wrote: I have a badly behaving user that I need to speak with and want to temporarily disable their ability to submit jobs. I know I can change their account settings to stop them. Is there another way to set a block on a specific username that I can lift later without removing the user/account associations? There are many ways to do this, our way is we set their QOS's to one called "batchdisable" and that QOS has "MaxJobsPerUser=0" set on it. One of the benefits of that is it's easy to see everyone who's been blocked from submitting jobs. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Usage gathering for GPUs
On 5/24/23 11:39 am, Fulton, Ben wrote: Hi, Hi Ben, The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”. How would I go about enabling this? I can only comment on the nvidia side (as those are the GPUs we have) but for that you need Slurm built with NVML support and running with "Autodetect=NVML" in gres.conf and then that information is stored in slurmdbd as part of the TRES usage data. For example to grab a job step for a test code I ran the other day: csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | tr , \\n | fgrep gpu gres/gpumem=493120K gres/gpuutil=76 Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] [EXTERNAL] Re: Question about PMIX ERROR messages being emitted by some child of srun process
On 5/23/23 10:33 am, Pritchard Jr., Howard wrote: Thanks Christopher, No worries! This doesn't seem to be related to Open MPI at all except that for our 5.0.0 and newer one has to use PMix to talk to the job launcher. I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a similar message from srun --mpi=pmix That's right, these messages are coming from PMIx code rather than MPI. I too noticed that if I set PMIX_DEBUG=1 the chatter from srun stops. Yeah, it looks like setting PMIX_DEBUG to anything (I tried "hello") stops these messages from being emitted. Slurm RPMs with that patch will go on to Perlmutter in the Thursday maintenance. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Question about PMIX ERROR messages being emitted by some child of srun process
Hi Tommi, Howard, On 5/22/23 12:16 am, Tommi Tervo wrote: 23.02.2 contains PMIx permission regression, it may be worth to check if it's case? I confirmed I could replicate the UNPACK-INADEQUATE-SPACE messages Howard is seeing on a test system, so I tried that patch on that same system without any change. :-( Looking at the PMIx code base the messages appear to come from that code (the triggers are in src/mca/bfrops/) and I saw I could set PMIX_DEBUG=verbose to get more info on the problem, but when I set that these messages go away entirely. :-/ Very odd. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] From an initial installation cannot start slurmctld with a slurmdbd running
Hi Lawrence, On 5/17/23 3:26 pm, Sorrillo, Lawrence wrote: Here is the error I get: slurmctld: fatal: Can not recover assoc_usage state, incompatible version, got 9728 need >= 8704 <= 9216, The slurm version is: 20.11.9 That error seems to appear when slurmctld is loading usage data from an on-disk cache of data (the "assoc_usage" file) - the function that throws that error is called here: /* Now load the usage from a flat file since it isn't kept in the database */ load_assoc_usage(); It's telling you that the data file was written with a version of Slurm ahead of where it's at. With my little cheat sheet: > ./slurmver SLURM_23_02_PROTOCOL_VERSION = 9984 SLURM_22_05_PROTOCOL_VERSION = 9728 SLURM_21_08_PROTOCOL_VERSION = 9472 SLURM_20_11_PROTOCOL_VERSION = 9216 SLURM_20_02_PROTOCOL_VERSION = 8960 SLURM_19_05_PROTOCOL_VERSION = 8704 SLURM_18_08_PROTOCOL_VERSION = 8448 That tells us the data file was written by Slurm 22.05.x, so my guess is that version was tested and the "assoc_usage" file that's being read here wasn't cleaned up afterwards. Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] PreemptExemptTime
On 3/7/23 6:46 am, Groner, Rob wrote: Over global settings are PreemptMode=SUSPEND,GANG and PreemptType=preempt/partition_prio. We have a high priority partition that nothing should ever preempt, and an open partition that is always preemptable. In between is a burst partition. It can be preempted if the high priority partition needs the resources. That's the partition we'd like to guarantee a 1 hour run time on. Looking at the sacctmgr man page, it gives this info on QOS Just a quick comment, here you're talking about both partitions and QOS's with respect to preemption, I think for this you need to pick just one of those options and only use those configs. For instance we just use QOS's for preemption and our exempt time works in that case. Hope this helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results
On 2/10/23 11:06 am, Analabha Roy wrote: I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in my cluster. If you're looking to try checkpointing MPI applications you may want to experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin for DMTCP here: https://github.com/mpickpt/mana We (NERSC) are collaborating with the developers and it is installed on Cori (our older Cray system) for people to experiment with. The documentation for it may be useful to others who'd like to try it out - it's got a nice description of how it works too which even I as a non-programmer can understand. https://docs.nersc.gov/development/checkpoint-restart/mana/ Pay special attention to the caveats in our docs though! I've not used it myself, though I'm peripherally involved to give advice on system related issues. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm - UnkillableStepProgram
On 1/19/23 5:01 am, Stefan Staeglich wrote: Hi, Hiya, I'm wondering where the UnkillableStepProgram is actually executed. According to Mike it has to be available on every on the compute nodes. This makes sense only if it is executed there. That's right, it's only executed on compute nodes. But the man page slurm.conf of 21.08.x states: UnkillableStepProgram Must be executable by user SlurmUser. The file must be accessible by the primary and backup control machines. So I would expect it's executed on the controller node. That's strange, my slurm.conf man page from a system still running 21.08 says: UNKILLABLE STEP PROGRAM SCRIPT This program can be used to take special actions to clean up the unkillable processes and/or notify system administrators. The program will be run as SlurmdUser (usually "root") on the compute node where UnkillableStepTimeout was triggered. Ah, I see, there's a later "FILE AND DIRECTORY PERMISSIONS" part which has the text that you've found - that part's wrong! :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Interactive jobs using "srun --pty bash" and MPI
On 11/2/22 4:45 pm, Juergen Salk wrote: However, instead of using `srun --pty bash´ for launching interactive jobs, it is now recommended to use `salloc´ and have `LaunchParameters=use_interactive_step´ set in slurm.conf. +1 on that, this is what we've been using since it landed. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
On 10/31/22 5:46 am, Davide DelVento wrote: Thanks for helping me find workarounds. No worries! My only other thought is that you might be able to use node features & job constraints to communicate this without the user realising. I am not sure I understand this approach. I was just trying to think of things that could get into the Prolog that runs as root that you could use as a signal to it. Job constraints seemed the most reasonable choice. Are you saying that if the job_submit.lua can't directly add an environmental variable that the prolog can see, but can add the constraint which will become an environmental variable that the prolog can see? That's correct - the difference being that Slurm, not the user, is in control of its presence and the possible values it can have (as it's constrained by what you've chosen for the name of the node feature). Would that work if that feature is available in all nodes? Yes, that should work just fine I believe. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?
On 8/3/22 8:37 am, Phil Chiu wrote: Therefore my problem is this - "Reboot all nodes, permitting N nodes to be rebooting simultaneously." I think currently the only way to do that would be to have a script that does: * issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes * wait for 1 to come back to being online * issue an `scontrol reboot` for another node * wait for 1 more to come back * lather rinse repeat. This does assume you've got your nodes configured to come back cleanly on a reboot with slurmd up and no manual intervention required (which is what we do). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rate-limiting sbatch and srun
On 7/18/22 3:45 pm, gphipps wrote: Everyone so often one of our users accidentally writes a “fork-bomb” that submits thousands of sbatch and srun requests per second. It is a giant DDOS attack on our scheduler. Is there a way of rate limiting these requests before they reach the daemon? Yes there is, you can use the Slurm cli_filter to do this. https://slurm.schedmd.com/cli_filter_plugins.html If you use the lua plugin you can write what you need in that; though of course it would need careful thought as you would need somewhere to store state on the node (writeable by users), a way of counting the frequency of the RPCs and introducing increasing delays (up to a point) if it's out of control and then decaying that delay time down when the RPCs from that user cease/decrease. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How do you make --export=NONE the default behavior for our cluster?
On 6/3/22 11:39 am, Ransom, Geoffrey M. wrote: Adding “--export=NONE” to the job avoids the problem, but I’m not seeing a way to change this default behavior for the whole cluster. There's an SBATCH_EXPORT environment variable that you could set for users to force that (at $JOB-1 we used to do that). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rolling upgrade of compute nodes
On 5/29/22 3:09 pm, byron wrote: This is the first time I've done an upgrade of slurm and I had been hoping to do a rolling upgrade as opposed to waiting for all the jobs to finish on all the compute nodes and then switching across but I dont see how I can do it with this setup. Does any one have any expereience of this? We do rolling upgrades with: scontrol reboot ASAP nextstate=resume reason="some-useful-reason" [list-of-nodes] But you do need to have RebootProgram defined and an appropriate ResumeTimeout set to allow enough time for your node to reboot (and of course your system must be configured to boot into a production ready state when rebooted, including starting up slurmd). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] upgrading slurm to 20.11
On 5/17/22 12:00 pm, Paul Edmon wrote: Database upgrades can also take a while if your database is large. Definitely recommend backing up prior to upgrade as well as running slurmdbd -Dv and not the systemd daemon as if the upgrade takes a long time it will kill it preemptively due to unresponsiveness which will create all sorts of problems. +lots to this - it's our SOP when doing upgrades as it takes hours to do so on a busy system that's been around for a while. We also take routine backups and then when we're looking to do an upgrade I'll use one of those on a test system to see how it goes. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLURM: reconfig
On 5/5/22 7:08 am, Mark Dixon wrote: I'm confused how this is supposed to be achieved in a configless setting, as slurmctld isn't running to distribute the updated files to slurmd. That's exactly what happens with configless mode, slurmd's retrieve their config from the slurmctld, and will grab it again on an "scontrol reconfigure". There's no reason to stop slurmctld for this. So your slurm.conf should only exist on the slurmctld node - this is how we operate on our latest system. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLURM: reconfig
On 5/5/22 5:17 am, Steven Varga wrote: Thank you for the quick reply! I know I am pushing my luck here: is it possible to modify slurm: src/common/[read_conf.c, node_conf.c] src/slurmctld/[read_config.c, ...] such that the state can be maintained dynamically? -- or cheaper to write a job manager with less features but supporting dynamic nodes from ground up? I had said currently, because it looks like you will be in luck with the next release (though it sounds like it needs a little config): From https://github.com/SchedMD/slurm/blob/master/RELEASE_NOTES: -- Allow nodes to be dynamically added and removed from the system. Configure MaxNodeCount to accomodate nodes created with dynamic node registrations (slurmd -Z --conf="") and scontrol. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLURM: reconfig
On 5/4/22 7:26 pm, Steven Varga wrote: I am wondering what is the best way to update node changes, such as addition and removal of nodes to SLURM. The excerpts below suggest a full restart, can someone confirm this? You are correct, you need to restart slurmctld and slurmd daemons at present. See https://slurm.schedmd.com/faq.html#add_nodes All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] sbatch - accept jobs above limits
On 2/8/22 11:41 pm, Alexander Block wrote: I'm just discussing a familiar case with SchedMD right now (ticket 13309). But it seems that it is not possible with Slurm to submit jobs that request features/configuration that are not available at the moment of submission. Does --hold not allow that for you? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] sbatch - accept jobs above limits
On 2/8/22 2:26 pm, z1...@arcor.de wrote: These jobs should be accepted, if a suitable node will be active soon. For example, these jobs could be in PartitionConfig. From memory if you submit jobs with the `--hold` option then you should find they are successfully accepted - I've used that in the past (and just checked that it still works with 20.11.8, assuming nobody has snuck a node with 2TB of RAM in whilst I wasn't looking). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Stopping new jobs but letting old ones end
On 1/31/22 9:25 pm, Brian Andrus wrote: touch /etc/nologin That will prevent new logins. It's also useful that if you put a message in /etc/nologin then users who are trying to login will get that message before being denied. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Stopping new jobs but letting old ones end
On 1/31/22 9:00 pm, Christopher Samuel wrote: That would basically be the way Thinking further on this a better way would be to mark your partitions down, as it's likely you've got fewer partitions than compute nodes. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Stopping new jobs but letting old ones end
On 1/31/22 4:41 pm, Sid Young wrote: I need to replace a faulty DIMM chim in our login node so I need to stop new jobs being kicked off while letting the old ones end. I thought I would just set all nodes to drain to stop new jobs from being kicked off... That would basically be the way, but is there any reason why compute jobs shouldn't start whilst the login node is down? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Questions about scontrol reconfigure / reconfig
On 1/16/22 7:41 pm, Nicolas Greneche wrote: I add a new compute node in config file so, Nodename becomes : When adding a node you need to restart slurmctld and all the slurmd's as they (currently) can only rebuild their internal structures for this at that time. This is meant to be addressed in a future major Slurm release (can't remember which one sorry). https://slurm.schedmd.com/faq.html#add_nodes All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Error " slurm_receive_msg_and_forward: Zero Bytes were transmitted or received"
On 12/1/21 5:51 am, Gestió Servidors wrote: I can’t syncronize before with “ntpdate” because when I run “ntpdate -s my_NTP_server”, I only received message “ntpdate: no server suitable for synchronization found”… Yeah, you'll need to make sure your NTP infrastructure is working first. There is useful information (including NTP background info) here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ntp_using_ntpd and for chronyd (rather than ntpd): https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/ch-configuring_ntp_using_the_chrony_suite All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] random allocation of resources
On 12/1/21 3:27 pm, Brian Andrus wrote: If you truly want something like this, you could have a wrapper script look at available nodes, pick a random one and set the job to use that node. Alternatively you could have a cron job that adjusted nodes `weight` periodically to change which ones Slurm will prefer to use over time (everything else being equal Slurm picks nodes with the lowest weight). Hope this helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job Preemption Time
On 11/22/21 8:28 pm, Jeherul Islam wrote: Is there any way to configure slurm, that the High Priority job waits for a certain amount of time(say 24 hours), before it preempts the other job? Not quite, but you can set PreemptExemptTime which says how long a job must have run for before it can be considered eligible for preemption. In other words if that's set to 1 hour and there's a low priority job that was submitted 55 minutes ago and a new high priority job comes along it won't be able to preempt it for another 5 minutes. You can set it on a QOS for instance so that different QOS's can have different minimum times. https://slurm.schedmd.com/qos.html#preemption All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Can't use cgroups on debian 11 : unable to get parameter 'tasks' for '/sys/fs/cgroup/cpuset/'
On 11/16/21 8:04 am, Arthur Toussaint wrote: I've seen people having those kind of problems, but no one seem to be able to solve it and keep the cgroups Debian Bullseye switched to cgroups v2 by default which Slurm doesn't support yet, you'll need to switch back to the v1 cgroups. The release notes have info on how to do this here: https://www.debian.org/releases/stable/amd64/release-notes/ch-information.en.html#openstack-cgroups Short version is it says you need to add this to the kernel boot params: systemd.unified_cgroup_hierarchy=false systemd.legacy_systemd_cgroup_controller=false The name of that second one looks a bit misleading, it's described in the systemd man page as: https://manpages.debian.org/testing/systemd/systemd.1.en.html > Takes effect if the full unified cgroup hierarchy is not used (see previous option). When specified without an argument or with a true argument, disables the use of "hybrid" cgroup hierarchy (i.e. a cgroups-v2 tree used for systemd, and legacy cgroup hierarchy[10], a.k.a. cgroups-v1, for other controllers), and forces a full "legacy" mode. When specified with a false argument, enables the use of "hybrid" hierarchy. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Unable to start slurmd service
On 11/16/21 7:07 am, Jaep Emmanuel wrote: > root@ecpsc10:~# scontrol show node ecpsc10 [...] >State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A [...] Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04] This is why the node isn't considered available, as others have already noted you will need to resume the node. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] draining nodes due to failed killing of task?
On 8/7/21 11:47 pm, Adrian Sevcenco wrote: yes, the jobs that are running have a part of file saving if they are killed, saving which depending of the target can get stuck ... i have to think for a way to take a processes snapshot when this happens .. Slurm does let you request a signal a certain amount of time before the job is due to end, you could make your job use that to do the checkpoint in advance of the end of the job so you don't hit this problem. Look at the --signal option in "man sbatch". Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Users Logout when job die or complete
Hi Andrea, On 7/9/21 3:50 am, Andrea Carotti wrote: ProctrackType=proctrack/pgid I suspect this is the cause of your problems, my bet is that it is incorrectly identifying the users login processes as being part of the job and thinking it needs to tidy them up in addition to any processes left over from the job. It also seems to be more for BSD systems than Linux. At the very least you'd want: ProctrackType=proctrack/linuxproc Though I'd strongly suggest looking at cgroups for this, see: https://slurm.schedmd.com/slurm.conf.html#OPT_ProctrackType and: https://slurm.schedmd.com/cgroups.html Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How to avoid a feature?
On 7/1/21 7:08 am, Brian Andrus wrote: I have a partition where one of the nodes has a node-locked license. That license is not used by everyone that uses the partition. This might be a case for using a reservation on that node with the MaxStartDelay flag to set the maximum amount of time (in minutes) that jobs that need to run in the reservation are willing to wait for a job on the node to clean up and exit. The candidate jobs need to use the --signal flag with the R option to specify how many seconds of warning they would need to clean up before being preempted. If the amount of time they say they need is less than the MaxStartDelay then they are candidates to run on those nodes _outside_ of the reservation, and when the actual work comes along they will get told to get out of the way and, if they fail to, they'll get killed. I presume people have to request a license in Slurm to get sent to that node so you could automatically add that reservation to jobs that request the license. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Exposing only requested CPUs to a job on a given node.
On 7/1/21 3:26 pm, Sid Young wrote: I have exactly the same issue with a user who needs the reported cores to reflect the requested cores. If you find a solution that works please share. :) The number of CPUs in teh system vs the number of CPUs you can access are very different things. You can use the "nproc" command to find the number of CPUs you can access. From a software side of things this is why libraries like "hwloc" exist, so you can determine what is accessible in a portable way. https://www.open-mpi.org/projects/hwloc/ It live on the Open-MPI website, but it doesn't use Open-MPI (Open-MPI uses it). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Specify a gpu ID
On 6/4/21 11:04 am, Ahmad Khalifa wrote: Because there are failing GPUs that I'm trying to avoid. Could you remove them from your gres.conf and adjust slurm.conf to match? If you're using cgroups enforcement for devices (ConstrainDevices=yes in cgroup.conf) then that should render them inaccessible to jobs. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] DMTCP or MANA with Slurm?
On 5/27/21 12:26 pm, Prentice Bisbal wrote: Given the lack of traffic on the mailing list and lack of releases, I'm beginning to think that both of these project are all but abandoned. They're definitely actively working on it - I've given them a heads up on this to let them know how it's being perceived. Thanks for mentioning it! All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Drain node from TaskProlog / TaskEpilog
On 5/24/21 3:02 am, Mark Dixon wrote: Does anyone have advice on automatically draining a node in this situation, please? We do some health checks via a node epilog set with the "Epilog" setting, including queueing node reboots with "scontrol reboot". All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] inconsistent CUDA_VISIBLE_DEVICES with srun vs sbatch
On 5/19/21 1:41 pm, Tim Carlson wrote: but I still don't understand how with "shared=exclusive" srun gives one result and sbatch gives another. I can't either, but I can reproduce it with Slurm 20.11.7. :-/ -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] nodes going to down* and getting stuck in that state
On 5/19/21 9:15 pm, Herc Silverstein wrote: Does anyone have an idea of what might be going on? To add to the other suggestions, I would say that checking the slurmctld and slurmd logs to see what it is saying is wrong is a good place to start. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Determining Cluster Usage Rate
On 5/14/21 1:45 am, Diego Zuccato wrote: Usage reported in Percentage of Total Cluster TRES Name Allocated Down PLND Dow Idle Reserved Reported - -- --- oph cpu 81.93% 0.00% 0.00% 15.85% 2.22% 100.00% oph mem 80.60% 0.00% 0.00% 19.40% 0.00% 100.00% The "Reserved" column is the one you're interested in, it's indicating that for the 13th some jobs were waiting for CPUs, not memory. You can look at a longer reporting period by specifying a start date, something like: sreport -t percent -T cpu,mem cluster utilization start=2021-01-01 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Determining Cluster Usage Rate
On 5/14/21 1:45 am, Diego Zuccato wrote: It just doesn't recognize 'ALL'. It works if I specify the resources. That's odd, what does this say? sreport --version All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Determining Cluster Usage Rate
On 5/13/21 3:08 pm, Sid Young wrote: Hi All, Hiya, Is there a way to define an effective "usage rate" of a HPC Cluster using the data captured in the slurm database. Primarily I want to see if it can be helpful in presenting to the business a case for buying more hardware for the HPC :) I have a memory that it's possible to use "sreport" to show you what amount of time jobs were waiting for what TRES - in other words whether they were waiting for CPUs, memory, GPUs, etc (or some combination). Ah here you go.. sreport -t percent -T ALL cluster utilization That breaks things down by all the trackable resources on your system. Hope that helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Grid engine slaughtering parallel jobs when any one of them fails (copy)
Hi Robert, On 4/16/21 12:39 pm, Robert Peck wrote: Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs because ONE (or a few) node(s) fails? You will also probably want this for your srun: --kill-on-bad-exit=0 What does the scontrol command below show? scontrol show config | fgrep KillOnBadExit From the manual page: -K, --kill-on-bad-exit[=0|1] Controls whether or not to terminate a step if any task exits with a non-zero exit code. If this option is not specified, the default action will be based upon the Slurm configuration parameter of KillOnBadExit. If this option is specified, it will take precedence over KillOnBadExit. An option argument of zero will not terminate the job. A non-zero argument or no argument will terminate the job. Note: This option takes precedence over the -W, --wait option to terminate the job immediately if a task exits with a non-zero exit code. Since this option's argument is optional, for proper parsing the single letter option must be followed immediately with the value and not include a space between them. For example "-K1" and not "-K 1". Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] PartitionName default
On 4/7/21 11:48 am, Administração de Sistemas do Centro de Bioinformática wrote: Unfortunately, I still don't know how to use any other value to PartitionName. We've got about 20 different partitions on our large Cray system, with a variety of names (our submit filter system directs jobs to the right location based on what the user requests and has access to): cat /etc/slurm/slurm.conf | awk '/^PartitionName/ {print $1}' PartitionName=system PartitionName=system_shared PartitionName=debug_hsw PartitionName=debug_knl PartitionName=jupyter PartitionName=regular_hsw PartitionName=regular_knl PartitionName=regularx_hsw PartitionName=regularx_knl PartitionName=resv PartitionName=resv_shared PartitionName=benchmark PartitionName=realtime_shared PartitionName=realtime PartitionName=shared PartitionName=interactive PartitionName=genepool PartitionName=genepool_shared PartitionName=genepool_resv PartitionName=genepool_resv_shared I've not had issues with naming partitions in the past, though I can imagine `default` could cause confusion as there is a `default=yes` setting you can put on the one partition you want as the default choice. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rate Limiting of RPC calls
On 2/9/21 5:08 pm, Paul Edmon wrote: 1. Being on the latest release: A lot of work has gone into improving RPC throughput, if you aren't running the latest 20.11 release I highly recommend upgrading. 20.02 also was pretty good at this. We've not gone to 20.11 on production systems yet, but I can vouch for 20.02 being far better than previous versions for scheduling performance. We also use the cli_filter lua plugin to write our own RPC limiting mechanism using a local directory for per-user files. The big advantage of this is that it does the rate limiting client side and so they don't get sent to the slurmctld in the first place. Yes, it is theoretically possible for users to discover and work around this, but the intent here is to catch accidental/naive use rather than anything malicious. Also getting users to use `sacct` rather than `squeue` to check what state a job is in can help a lot too, it reduces the load on slurmctld. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] only 1 job running
On 1/27/21 9:28 pm, Chandler wrote: Hi list, we have a new cluster setup with Bright cluster manager. Looking into a support contract there, but trying to get community support in the mean time. I'm sure things were working when the cluster was delivered, but I provisioned an additional node and now the scheduler isn't quite working right. Did you restart the slurm daemons when you added the new node? Some internal data structures (bitmaps) are build based on the number of nodes and they need to be rebuild with a restart in this situation. https://slurm.schedmd.com/faq.html#add_nodes All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?
On 1/26/21 12:10 pm, Ole Holm Nielsen wrote: What I don't understand is, is it actually *required* to make the NVIDIA libraries available to Slurm? I didn't do that, and I'm not aware of any problems with our GPU nodes so far. Of course, our GPU nodes have the libraries installed and the /dev/nvidia? devices are present. You only need it if you want to use NVML autodetection of GPUs, we don't have any nvidia software in the OS image we use to build our vast array of RPMs and they work just fine on our GPU nodes. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Defining an empty partition
On 12/18/20 4:45 am, Tina Friedrich wrote: Yeah, I had that problem as well (trying to set up a partition that didn't have any nodes - they're not here yet). You can define nodes in Slurm that don't exist yet with State=FUTURE, that means slurmctld basically ignores them until you change that state setting (either with scontrol or updating your config). I've used that before, and in fact added some nodes in that state yesterday on one of our test HPCs. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Scripts run slower in slurm?
On 12/14/20 11:20 pm, Alpha Experiment wrote: It is called using the following submission script: #!/bin/bash #SBATCH --partition=full #SBATCH --job-name="Large" source testenv1/bin/activate python3 multithread_example.py You're not asking for a number of cores, so you'll likely only be getting a single core to use here. You'll likely need something like: #SBATCH -c 64 for it to get access to more cores. Also in your config I noticed: NodeName=localhost I'd suggest you use the actual name for your compute nodes, I don't think that's going to work out too well with more than 1 node. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Trouble installing slurm-20.02.4-1.amzn2.x86_64 libnvidia-ml.so.1
Hi Drew, On 12/4/20 11:32 am, Mullen, Drew wrote: Error: Package: slurm-20.02.4-1.amzn2.x86_64 (/slurm-20.02.4-1.amzn2.x86_64) Requires: libnvidia-ml.so.1()(64bit That looks like it's fixed in 20.02.5 (the current release is 20.02.6): -- commit 1be5492c274e170451ed18763e7eeea826f57cb7 Author: Tim McMullan Date: Tue Aug 11 11:32:26 2020 -0400 slurm.spec - don't depend on libnvidia-ml to allow manual cuda installs Bug 9525 -- Hope this helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] update_node / reason set to: slurm.conf / state set to DRAINED
Hi Kevin, On 11/4/20 6:00 pm, Kevin Buckley wrote: In looking at the SlurmCtlD log we see pairs of lines as follows update_node: node nid00245 reason set to: slurm.conf update_node: node nid00245 state set to DRAINED I'd go looking in your healthcheck scripts, I took a quick look at the source last night and couldn't see anything that looked related, and it's not a message I remember seeing before. Also take a look in the slurmd logs on the node for that time, to see if there's anything that correlates there. Good luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm Upgrade
Hi Navin, On 11/4/20 10:14 pm, navin srivastava wrote: I have already built a new server slurm 20.2 with the latest DB. my question is, shall i do a mysqldump into this server from existing server running with version slurm version 17.11.8 This won't work - you must upgrade your 17.11 database to 19.05.x first, then you can upgrade from 19.05.x to 20.02. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Nodes not returning from DRAINING
On 10/28/20 6:27 am, Diego Zuccato wrote: Strangely the core file seems corrupted (maybe because it's from a 4-nodes job and they all try to write to the same file?): You can set a pattern for core file names to prevent that, usually the PID is in the name, but you can put the hostname in there too. https://man7.org/linux/man-pages/man5/core.5.html See the section: "Naming of core dump files" All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do
Hi Paul, On 10/23/20 10:13 am, Paul Raines wrote: Any clues as to why pam_slurm_adopt thinks there is no job? Do you have PrologFlags=Contain in your slurm.conf? Contain At job allocation time, use the ProcTrack plugin to create a job container on all allocated compute nodes. This container may be used for user processes not launched under Slurm control, for example pam_slurm_adopt may place processes launched through a direct user login into this container. If using pam_slurm_adopt, then ProcTrackType must be set to either proctrack/cgroup or proctrack/cray_aries. Setting the Contain implicitly sets the Alloc flag. Hope that helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it
On 10/21/20 6:32 pm, Kevin Buckley wrote: If you install SLES 15 SP1 from the Q2 ISOs so that you have Munge but not the Slurm 18 that comes on the media, and then try to "rpmbuild -ta" against a vanilla Slurm 20.02.5 tarball, you should get the error I did. Ah, yes, that looks like it was a packaging bug fixed in subsequent updates! # for i in libmunge2-0.5.*.rpm; do echo $i; rpm --provides -qp $i | fgrep munge-libs; done libmunge2-0.5.13-4.3.1.x86_64.rpm libmunge2-0.5.13-4.6.1.x86_64.rpm munge-libs = 0.5.13 libmunge2-0.5.14-4.9.1.x86_64.rpm munge-libs = 0.5.14 Well spotted - so applying the SLES updates before trying to build Slurm should fix that. The reason I got confused is the systems I've got access to already have those updates in place. I suspect that also explains "Bug 6752 - Missing munge-libs dependency on SLES 15" that HPE opened and a separate bug of ours was marked as a duplicate of. Thanks! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] [External] Limit usage outside reservation
On 10/22/20 12:20 pm, Burian, John wrote: This doesn' t help you now, but Slurm 20.11 is expected to have "magnetic reservations," which are reservations that will adopt jobs that don't specify a reservation but otherwise meet the restrictions of the reservation: Magnetic reservations are in 20.02 already. https://slurm.schedmd.com/reservations.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it
On 10/20/20 12:49 am, Kevin Buckley wrote: only have, as listed before, Munge 0.5.13. I guess the question is (going back to your initial post): > error: Failed build dependencies: >munge-libs is needed by slurm-20.02.5-1.x86_64 Had you installed libmunge2 before trying this build? rpmbuild can't install it for you if you've not already got it in place. It should work once installed - assuming yours also shows: # fgrep PRETTY /etc/os-release PRETTY_NAME="SUSE Linux Enterprise Server 15 SP1" # rpm -q libmunge2 --provides | tail -n1 munge-libs = 0.5.14 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
Hi Sajesh, On 10/8/20 4:18 pm, Sajesh Singh wrote: Thank you for the tip. That works as expected. No worries, glad it's useful. Do be aware that the core bindings for the GPUs would likely need to be adjusted for your hardware! Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
On 10/8/20 3:48 pm, Sajesh Singh wrote: Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks like (hopefully not line wrapped!): NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1 Cores=10,12,14,16,18 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2 Cores=20,22,24,26,28 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3 Cores=30,32,34,36,38 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: debug: common_gres_set_env: unable to set env vars, no device files configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Current status of checkpointing
On 8/14/20 6:17 am, Stefan Staeglich wrote: what's the current status of the checkpointing support in SLURM? There isn't any these days, there used to be support for BLCR but that's been dropped as BLCR is no more. I know from talking with SchedMD they are of the opinion that any current checkpoint/resume code (such as DMTCP [1]) should be supported via the users batch script and not in Slurm itself. All the best, Chris [1] - https://github.com/dmtcp/dmtcp -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Reservation vs. Draining for Maintenance?
On 8/6/20 10:13 am, Jason Simms wrote: Later this month, I will have to bring down, patch, and reboot all nodes in our cluster for maintenance. The two options available to set nodes into a maintenance mode seem to be either: 1) creating a system-wide reservation, or 2) setting all nodes into a DRAIN state. We use both. :-) So for cases where we need to do a system wide outage for some reason we will put reservations on in advance to ensure the system is drained for the maintenance. But for rolling upgrades we will build a new image, set nodes to use it and then do something like: scontrol reboot ASAP nextstate=resume reason="Rolling upgrade" [nodes] That will allow running jobs to complete, drain all the nodes and when idle they'll reboot into the new image and resume themselves once they're back up and slurmd has started and checked in. We use the same mechanism when we need to reboot nodes for other maintenance activities, say when huge pages are too fragmented and the only way to reclaim them is to reboot the node (these checks happen in the node epilog). We paid for enhancements to Slurm 18.08 to ensure that slurmctld took these nodes states into account when scheduling jobs so that large jobs (as in requiring most of the nodes in the system) do not lose their scheduling window when a node has to be rebooted for this reason. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] cgroup limits not created for jobs
On 7/26/20 12:21 pm, Paul Raines wrote: Thank you so much. This also explains my GPU CUDA_VISIBLE_DEVICES missing problem in my previous post. I've missed that, but yes, that would do it. As a new SLURM admin, I am a bit suprised at this default behavior. Seems like a way for users to game the system by never running srun. This is because by default salloc only requests a job allocation, it expects you to use srun to run an application on a compute node. But yes, it is non-obvious (as evidenced by the number of "sinteractive" and other scripts out there that folks have written not realising about the SallocDefaultCommand config option - I wrote one back in 2013!). The only limit I suppose that is being really enforced at that point is walltime? Well the user isn't on the compute node so there's nothing really else to enforce. I guess I need to research srun and SallocDefaultCommand more, but is there some way to set some kind of separate walltime limit on a job for the time a salloc has to run srun? It is not clear if one can make a SallocDefaultCommand that does "srun ..." that really covers all possibilities. An srun inside of a salloc (just like an sbatch) should not be able to exceed the time limit for the job allocation. If it helps this is the SallocDefaultCommand we use for our GPU nodes: srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 -G 0 --gpus-per-task=0 --gpus-per-node=0 --gpus-per-socket=0 --pty --preserve-env --mpi=none -m block $SHELL We have to give all those possible permutations to not use various GPU GRES as otherwise this srun will consume them if the salloc asked for it and then when the user tries to "srun" their application across the nodes it will block as there won't be any available on this first node. Of course the fact that because of this the user can't see the GPUs without the srun can confuse some people, but it's unavoidable for this use case. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Restart Job after sudden reboot of the node
On 7/24/20 12:28 pm, Saikat Roy wrote: If SLURM restarts automatically, is there any way to stop it? If you would rather Slurm not start scheduling jobs when it is restarted then you can set your partitions to have `State=DOWN` in slurm.conf. That way should the node running slurmctld reboot then it won't start scheduling jobs until you tell it to. For compute nodes I believe Slurm should detect any node that reboots and mark it "DOWN" with the reason set to "Node unexpectedly rebooted". All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users
On 7/7/20 5:57 pm, Jason Simms wrote: Failed to look up user weissp: No such process That looks like the user isn't known to the node. What do these say: id weissp getent passwd weissp Which version of Slurm is this? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Are SLURM_JOB_USER and SLURM_JOB_UID always constant and available
On 5/20/20 7:23 pm, Kevin Buckley wrote: Are they set as part of the job payload creation, and so would ignore and node local lookup, or set as the job gets allocated to the various nodes it will run on? Looking at git, it's a bit of both: src/slurmd/slurmd/req.c: setenvf(, "SLURM_JOB_UID", "%u", job_env->uid); [...] setenvf(, "SLURM_JOB_USER", "%s", job_env->user_name); so the variables get set on the slurmd side (as you'd expect) but from data that is sent along with the job. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] additional jobs killed by scancel.
On 5/11/20 9:52 am, Alastair Neil wrote: [2020-05-10T00:26:05.202] [533900.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9 This caught my eye, Googling for it found a single instance, from 2019 on the list again about jobs on a node mysteriously dying. The resolution was (courtesy of Uwe Seher): # The system is an opensuse leap 15 installation and slurm # comes from the repository. By default a slurm.epilog.clean # skript is installed which kills everything that belongs to $ the user when a job is finished including other jobs, # ssh-sessions and so on. I do not know if other distributions # do the same or if the script is broken, but removing it # solved the problem. Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Do not upgrade mysql to 5.7.30!
On 5/7/20 6:08 AM, Riebs, Andy wrote: Alternatively, you could switch to MariaDB; I've been using that for years. Debian switched to only having MariaDB in 2017 with the release of Debian 9 (Stretch), as a derivative distro I'm surprised that Ubuntu still packages MySQL. I'd second Andy's suggestion. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
On 4/22/20 12:56 PM, dean.w.schu...@gmail.com wrote: There is a third user account on all machines in the cluster that is the user account for using the cluster. That account has uid 1000 on all four worker nodes, but on the controller it is 1001. So that is probably why the question marks. You need to have identical UIDs everywhere for this to work. I would strongly suggest using something like LDAP to ensure that your users have identical representation everywhere. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS
Hi Robert, On 4/8/20 7:08 AM, Robert Kudyba wrote: and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration That's the key phrase - when whoever compiled Slurm ran ./configure *before* compilation it was on a system without the nvidia libraries and headers present, so Slurm could not compile that support in. You'll need to redo the build on a system with the nvidia libraries and headers in order for this to work. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS
On 4/7/20 2:48 PM, Robert Kudyba wrote: How can I get this to work by loading the correct Bright module? You can't - you will need to recompile Slurm. The error says: Apr 07 16:52:33 node001 slurmd[299181]: fatal: We were configured to autodetect nvml functionality, but we weren't able to find that lib when Slurm was configured. So when Slurm was built the libraries you are telling it to use now were not detected and so the configure script disabled that functionality as it would not otherwise have been able to compile. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Accounting Information from slurmdbd does not reach slurmctld
On 3/19/20 4:05 AM, Pascal Klink wrote: However, there was not real answer given why this happened. So we thought that maybe this time someone may have an idea. To me it sounds like either your slurmctld is not correctly registering with slurmdbd, or if it has then slurmdbd cannot connect back to slurmctld. What does this say? sacctmgr show clusters All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] slurmd -C showing incorrect core count
On 3/12/20 9:37 PM, Kirill 'kkm' Katsnelson wrote: Aaah, that's a cool find! I never really looked inside my nodes for more than a year since I debugged all my stuff so it "just works". They are conjured out of nothing and dissolve back into nothing after 10 minutes of inactivity. But good to know! In the cloud, changing the amount of RAM and the number and even type of CPUs is all too easy. Also on some architectures doing that discovery can take time, so having it cached can be useful (slurmd will just read it once on startup). For us that's on a ramdisk filesystem (as Cray XC nodes have no local disk) so it vanishes every time the node reboots. My bet is that Mike's nodes have persistent storage and have an old copy of this file, hence the weird discrepancy he's seeing. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Block interactive shell sessions
On 3/5/20 9:22 AM, Luis Huang wrote: We would like to block certain nodes from accepting interactive jobs. Is this possible on slurm? My suggestion would be to make a partition for interactive jobs that only contains the nodes that you want to run them and then use the submit filter to direct jobs without a batch script set to that partition only (and prevent people modifying the partition for those jobs once submitted). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm 19.05 X11-forwarding
On 2/28/20 8:56 PM, Pär Lundö wrote: I thought that I could run the srun-command with X11-forwarding called from an sbatch-jobarray-script and get the X11-forwarding to my display. No, I believe X11 forwarding can only work when you run "srun --x11" directly on a login node, not from inside a batch script. (You should not need to be logged into a compute node either) See: https://slurm.schedmd.com/faq.html#x11 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes
On 2/27/20 11:23 AM, Robert Kudyba wrote: OK so does SLURM support MPS and if so what version? Would we need to enable cons_tres and use, e.g., --mem-per-gpu? Slurm 19.05 (and later) supports MPS - here's the docs from the most recent release of 19.05: https://slurm.schedmd.com/archive/slurm-19.05.5/gres.html It does require the use of cons_tres for MPS. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm version 20.02.0 is now available
On 2/25/20 11:41 AM, Dean Schulze wrote: I'm very interested in the "configless" setup for slurm. Is the setup for configless documented somewhere? Looks like the website has already been updated for the 20.02 documentation, and it looks like it's here: https://slurm.schedmd.com/configless_slurm.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How to use Autodetect=nvml in gres.conf
Hi Dean, On 2/7/20 8:03 AM, dean.w.schu...@gmail.com wrote: I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name. Are you sure that slurm distributes nvidia binaries? SchedMD only distributes sources, it's up to distros how they package it. I suspect you'll need to build it yourself if you want NVML support, I doubt many distros will want to be distributing builds linked against non-free nvidia libraries. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] sbatch sending the working directory from the controller to the node
On 1/21/20 11:27 AM, Dean Schulze wrote: The sbatch docs say nothing about why the node gets the pwd from the controller. Why would slurm send a directory to a node that may not exist on the node and expect it to use it? That's a pretty standard expectation from a cluster, that the filesystem you are working in on the node you are submitting from is the same as the one that's on the compute nodes. Otherwise there's a lot of messy staging of files you'll need to do. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA