[slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Groner, Rob
I'm working through how to use the new dynamic node features in order to take down a particular node, reconfigure it (using nvidia MIG to change the number of graphic cores available) and give it back to slurm. I'm at the point where I can take a node out of slurm's control from the master nod

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Groner, Rob
pt outside slurm itself, on the head node. You can use ssh/pdsh to connect to a node and execute things there while it is out of the mix. Brian Andrus On 9/23/2022 7:09 AM, Groner, Rob wrote: I'm working through how to use the new dynamic node features in order to take down a particular

Re: [slurm-users] slurmd and dynamic nodes

2022-09-23 Thread Groner, Rob
make you feel slurmd cannot run as a service on a dynamic node. As long as you added the options to the systemd defaults file for it, you should be fine (usually /etc/defaults/slurmd) Brian On 9/23/2022 7:40 AM, Groner, Rob wrote: Ya, we're still working out the mechanism for taking the nod

[slurm-users] Questions about dynamic nodes

2022-09-27 Thread Groner, Rob
I have 2 nodes that offer a "gc" feature. Node t-gc-1202 is "normal", and node t-gc-1201 is dynamic. I can successfully remove t-gc-1201 and bring it back dynamically. Once I bring it back, that node appears JUST LIKE the "normal" node in the sinfo output, as seen here: [rug262@testsch (RC)

Re: [slurm-users] Questions about dynamic nodes

2022-09-28 Thread Groner, Rob
I tried a simpler test, removing the features altogether so it was just another node offering 48 CPUs. I then started jobs asking for 24 CPUs a bunch of times. The jobs started on every node EXCEPT t-gc-1201, and jobs went pending for resources until the "normal" nodes could return. So at th

Re: [slurm-users] Questions about dynamic nodes

2022-09-28 Thread Groner, Rob
I ended up getting some help, and in the process, I noticed (for the first time) that the topology plugin was listed in the slurm.conf file. I remembered that the dynamic nodes docs mentioned that dynamic nodes was not compatible with the topology plugin. I had previously removed the nodes fro

[slurm-users] How to hold a job until a feature is available?

2022-09-29 Thread Groner, Rob
I'm trying to setup a system where, when a job from a certain account is submitted, if no nodes are available that have a specific feature, then the job will be paused/held/pending and a node will be dynamically created with that feature. I can now dynamically bring up the node with the feature

Re: [slurm-users] How to hold a job until a feature is available?

2022-09-30 Thread Groner, Rob
Thanks. I tried that, and it seems like it may be exactly what I was looking for. Rob

Re: [slurm-users] slurm_update error: Invalid node state specified

2022-10-11 Thread Groner, Rob
Have you checked the logs for slurmd and slurmctld? I seem to recall that the "invalid" state for a node meant that there was some discrepancy between what the node says or thinks it has (slurmd -C) and what the slurm.conf says it has. While there is that discrepancy and the node is invalid, y

[slurm-users] Can sinfo/scontrol be called from job_submit.lua?

2022-10-11 Thread Groner, Rob
I am testing a method where, when a job gets submitted asking for specific features, then, if those features don't exist, I'll do something. The job_submit.lua plugin has worked to determine when a job is submitted asking for the specific features. I'm at the point of checking if those feature

Re: [slurm-users] Can sinfo/scontrol be called from job_submit.lua?

2022-10-12 Thread Groner, Rob
y doing such for a fairly limited amount of information which presumably does not change frequently, perhaps it would be better to have a cron job periodically output the desired information to a file, and have the job_submit.lua read the information from the file? On Tue, Oct 11, 2022 at 5:17 P

Re: [slurm-users] gres/gpu count reported lower than configured

2022-10-21 Thread Groner, Rob
I've encountered that many times, and for me, it was always related to AutoDetect and the nvidia-ml library. Does your slurmd log contain a line like "debug: skipping GRES for NodeName=t-gc-1202 AutoDetect=nvml"? I see that you didn't specifically set AutoDetect to nvml in gres.conf, but may

[slurm-users] Test Suite problems related to requesting tasks

2022-10-24 Thread Groner, Rob
I'm really pleased to find the test suite included with slurm, and after some initial difficulty, I now am able to run the unit tests and expect tests. The expect tests seem to generally be failing whenever the test involves tasks. Anything asking for more than 1 task per node is failing. [202

Re: [slurm-users] Test Suite problems related to requesting tasks

2022-10-25 Thread Groner, Rob
A very helpful reply, thank you! For your "special testing config", do you just mean the slurm.conf/gres.conf/*.conf files? So when you want to test a new version of slurm, you replace the conf files and then restart all of the daemons? Rob

Re: [slurm-users] NVML not found when Slurm was configured.

2022-11-11 Thread Groner, Rob
Hi Mike, I can't tell if you're compiling slurm or not on your own. You will have to if you want the functionality. On RedHat8, I had to install cuda-nvml-devel-11-7, so find what the equivalent is for that in Ubuntu. Basically, whatever package includes nvml.h and libnvidia-ml.so. Then, mo

Re: [slurm-users] NVML not found when Slurm was configured.

2022-11-11 Thread Groner, Rob
users on behalf of Michael Lewis Reply-To: Slurm User Community List Date: Friday, November 11, 2022 at 10:01 AM To: Slurm User Community List Subject: Re: [slurm-users] NVML not found when Slurm was configured. Thanks Rob! No I just grabbed it through apt. I’ll try that now. Mike Fr

[slurm-users] NVIDIA MIG question

2022-11-15 Thread Groner, Rob
We have successfully used the nvidia-smi tool to take the 2 A100's in a node and split them into multiple GPU devices. In one case, we split the 2 GPUS into 7 MIG devices each, so 14 in that node total, and in the other case, we split the 2 GPUs into 2 MIG devices each, so 4 total in the node.

Re: [slurm-users] NVIDIA MIG question

2022-11-16 Thread Groner, Rob
a single job could use all 14 instances. The result you observed suggests that MIG is a feature of the driver i.e lspci shows one device but nvidia-smi shows 7 devices. I haven't played around with this myself in slurm but would be interested to know the answers. Laurence On 15/11/2022

Re: [slurm-users] NVIDIA MIG question

2022-11-17 Thread Groner, Rob
emory or cpu), or some other limit (in the account, partition, or qos) On our setup we're limiting jobs to 1 gpu per job (via partition qos), however we can use up all the MIGs with single gpu jobs. On Wed, 16 Nov 2022 at 23:48, Groner, Rob mailto:rug...@psu.edu>> wrote: That does h

Re: [slurm-users] NVIDIA MIG question

2022-11-17 Thread Groner, Rob
just 1 gpu, without them going to pending (until all gpus are used up). Rob From: slurm-users on behalf of Groner, Rob Sent: Thursday, November 17, 2022 10:08 AM To: Slurm User Community List Subject: Re: [slurm-users] NVIDIA MIG question No, I can't s

[slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Groner, Rob
We currently have a test cluster and a production cluster, all on the same network. We try things on the test cluster, and then we gather those changes and make a change to the production cluster. We're doing that through two different repos, but we'd like to have a single repo to make the tra

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-17 Thread Groner, Rob
time (after the maintenance reservation) to ensure that MPI runs correctly. On Wed, Jan 4, 2023 at 12:26 PM Groner, Rob mailto:rug...@psu.edu>> wrote: We currently have a test cluster and a production cluster, all on the same network. We try things on the test cluster, and then we gath

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-18 Thread Groner, Rob
ll running happily. If it succeeds, it takes control back and you can then restart the secondary with the new (known good) config. Brian Andrus On 1/17/2023 12:36 PM, Groner, Rob wrote: So, you have two equal sized clusters, one for test and one for production? Our test cluster is a small

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-18 Thread Groner, Rob
Generating the *.conf files from parseable/testable sources is an interesting idea. You mention nodes.conf and partitions.conf. I can't find any documentation on those. Are you just creating those files and then including them in slurm.conf? Rob From: slurm-

[slurm-users] Using oversubscribe to hammer a node

2023-01-19 Thread Groner, Rob
I'm trying to setup a specific partition where users can fight with the OS for dominance, The oversubscribe property sounds like what I want, as it says "More than one job can execute simultaneously on the same compute resource." That's exactly what I want. I've setup a node with 48 CPU and o

Re: [slurm-users] Using oversubscribe to hammer a node

2023-01-20 Thread Groner, Rob
m User Community List Subject: Re: [slurm-users] Using oversubscribe to hammer a node Hi Rob, "Groner, Rob" writes: > I'm trying to setup a specific partition where users can fight with the OS > for dominance, The oversubscribe property sounds like what I want, as it says

[slurm-users] slurm and singularity

2023-02-07 Thread Groner, Rob
I'm trying to setup the capability where a user can execute: $: sbatch script_to_run.sh and the end result is that a job is created on a node, and that job will execute "singularity exec script_to_run.sh" Also, that they could execute: $: salloc and would end up on a node per their paramet

Re: [slurm-users] slurm and singularity

2023-02-07 Thread Groner, Rob
r_run.sh Then cluster_run.sh would call sbatch along with the appropriate commands. Brian Andrus On 2/7/2023 9:31 AM, Groner, Rob wrote: I'm trying to setup the capability where a user can execute: $: sbatch script_to_run.sh and the end result is that a job is created on a node, and that

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Groner, Rob
I tried that, and it says the nodes have been allocated, but it never comes to an apptainer prompt. I then tried doing them in separate steps. Doing salloc works, I get a prompt on the node that was allocated. I can then run "singularity shell " and get the apptainer prompt. If I prefix that

Re: [slurm-users] slurm and singularity

2023-02-08 Thread Groner, Rob
ularity shell /opt/shared/singularity/prebuilt/postgresql/13.2.simg salloc: Granted job allocation 3953723 salloc: Waiting for resource configuration salloc: Nodes r1n00 are ready for job Singularity> On Feb 8, 2023, at 09:47 , Groner, Rob mailto:rug...@psu.edu>> wrote: I tried th

Re: [slurm-users] Lua plugin job_desc fields

2023-02-08 Thread Groner, Rob
No, there's no other official documentation of that. The official docs also say to go to that source file and see the fields there. It's what I do also. Rob From: slurm-users on behalf of Chrysovalantis Paschoulas Sent: Wednesday, February 8, 2023 11:46 AM To

[slurm-users] Unit Testing job_submit.lua

2023-02-17 Thread Groner, Rob
I'm trying to setup some testing of our job_submit.lua plugin so I can verify that changes I make to it don't break anything. I looked into luaunit for testing, and that seems like it would do what I needlet me set the value of inputs, call the slurm_job_submit() function with them, and the

[slurm-users] PreemptExemptTime

2023-03-07 Thread Groner, Rob
I found a thread about this topic that's a year old and at that time seemed to give no hope, I'm just wondering if the situation has changed. My testing so far isn't encouraging. In the thread (here: https://groups.google.com/g/slurm-users/c/yhnSVBoohik) it talks about wanting to give lower pr

Re: [slurm-users] PreemptExemptTime

2023-03-10 Thread Groner, Rob
mpt exempt time (unless that comes from the global setting). Thanks. Rob From: slurm-users on behalf of Christopher Samuel Sent: Tuesday, March 7, 2023 3:40 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] PreemptExemptTime On 3/7/23 6:46

Re: [slurm-users] PreemptExemptTime

2023-03-10 Thread Groner, Rob
h 7, 2023 3:40 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] PreemptExemptTime On 3/7/23 6:46 am, Groner, Rob wrote: > Over global settings are PreemptMode=SUSPEND,GANG and > PreemptType=preempt/partition_prio. We have a high priority partition > that nothing should e

Re: [slurm-users] PreemptExemptTime

2023-03-10 Thread Groner, Rob
nce it has run through its preempexempttime. It never gets preempted. Thanks. Rob From: slurm-users on behalf of Christopher Samuel Sent: Tuesday, March 7, 2023 3:40 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] PreemptExemptTime On 3/7/

[slurm-users] Oversubscribing even though it's set to No on both partitions

2023-03-24 Thread Groner, Rob
I'm trying to puzzle out using QOS-based preemption instead of partition-based so we can have the juicy prize of PreemptExemptTime. But in the process, I've encountered something that puzzles ME. I have 2 partitions that, for the purposes of testing, are identical except for the QOS they have

[slurm-users] On the ability of coordinators

2023-05-17 Thread Groner, Rob
I was asked to see if coordinators could do anything in this scenario: * Within the account that they coordinated, User A submitted 1000s of jobs and left for the day. * Within the same account, User B wanted to run a few jobs really quickly. Once submitted, his jobs were of course behi

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Groner, Rob
hen they will run next. That will work if you are able to wait for some jobs to finish and you can 'skip the line' for the priority jobs. If you need to preempt running jobs, that would take a bit more effort to set up, but is an alternative. Brian Andrus On 5/17/2023 6:40 A

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Groner, Rob
its of what the default permissions of a coordinator can do. Of course, that still may not work if there are other accounts/partitions/users with higher priority jobs than User B. Specifically if those jobs can use the same resources A's jobs are running on. Brian Andrus On 5/17/2023 10

Re: [slurm-users] On the ability of coordinators

2023-05-17 Thread Groner, Rob
ahead of any of the heavy user’s pending jobs automatically? From: slurm-users on behalf of "Groner, Rob" Reply-To: Slurm User Community List Date: Wednesday, May 17, 2023 at 1:09 PM To: "slurm-users@lists.schedmd.com" Subject: Re: [slurm-users] On the ability of co

Re: [slurm-users] hi-priority partition and preemption

2023-05-24 Thread Groner, Rob
What you are describing is definitely doable. We have our system setup similarly. All nodes are in the "open" partition and "prio" partition, but a job submitted to the "prio" partition will preempt the open jobs. I don't see anything clearly wrong with your slurm.conf settings. Ours are ver

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Groner, Rob
A quick test to see if it's a configuration error is to set config_overrides in your slurm.conf and see if the node then responds to scontrol update. From: slurm-users on behalf of Brian Andrus Sent: Thursday, May 25, 2023 10:54 AM To: slurm-users@lists.schedmd

Re: [slurm-users] GRES and GPUs

2023-07-17 Thread Groner, Rob
That would certainly do it. If you look at the slurmctld log when it comes up, it will say that it's marking that node as invalid because it has less (0) gres resources then you say it should have. That's because slurmd on that node will come up and say "What gres resources??" For testing pur

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Groner, Rob
45.027] mcs: MCSParameters = (null). ondemand set. >> [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed >> usec=5898 >> [2023-07-18T14:59:45.952] >> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max

Re: [slurm-users] MIG-Slice: Unavailable GRES

2023-07-19 Thread Groner, Rob
At some point when we were experimenting with MIG, I was being entirely frustrated in getting it to work until I finally removed the autodetect from gres.conf and explicitly listed the stuff instead. THEN it worked. I think you can find the list of files that are the device files using nvidia

Re: [slurm-users] MaxMemPerCPU not enforced?

2023-07-24 Thread Groner, Rob
I'm not sure I can help with the rest, but the EnforcePartLimits setting will only reject a job at submission time that exceeds partition​ limits, not overall cluster limits. I don't see anything, offhand, in the interactive partition definition that is exceeded by your request for 4 GB/CPU. R

[slurm-users] Partition not allowing subaccount use

2023-07-24 Thread Groner, Rob
I've setup a partition THING with AllowAccounts=stuff. I then use sacctmgr to create the stuff account and a mystuff account whose parent is stuff. My understanding is that this would make mystuff a subaccount of stuff. The description for specifying allowaccount in a partition definition in

Re: [slurm-users] Partition not allowing subaccount use

2023-07-25 Thread Groner, Rob
s supported since version 23.02 On 24/07/2023 23:26, Groner, Rob wrote: > I've setup a partition THING with AllowAccounts=stuff. I then use > sacctmgr to create the stuff account and a mystuff account whose parent > is stuff. My understanding is that this would make mystuff a subacco

Re: [slurm-users] Automatically converting jobs to regular priority when high-priority is exhausted

2023-08-03 Thread Groner, Rob
I didn't see this thread before, so maybe this has already been suggested... When submitting jobs with sbatch, you could specify a list of partitions to use, and slurm will send the jobs to the partition with the earliest start/highest priority first, and if that gets "full" then it will send th

Re: [slurm-users] Nodes stay drained no matter what I do

2023-08-24 Thread Groner, Rob
Ya, I agree about the invalid argument not being much help. In times past when I encountered issues like that, I typically tried: * restart slurmd on the compute node. Watch its log to see what it complains about. Usually it's about memory. * Set the configuration of the node to whatev

Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Groner, Rob
There's a builtin slurm command, I can't remember what it is and google is failing me, that will take a compacted list of nodenames and return their full names, and I'm PRETTY sure it will do the opposite as well (what you're asking for). It's probably sinfo or scontrolmaybe an sutil if tha

Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Groner, Rob
Yes! Thanks. I'll try to remember it for next time.

[slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-28 Thread Groner, Rob
There's 14 steps to upgrading slurm listed on their website, including shutting down and backing up the database. So far we've only updated slurm during a downtime, and it's been a major version change, so we've taken all the steps indicated. We now want to upgrade from 23.02.4 to 23.02.5. O

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Groner, Rob
--*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Sep 28, 2023, at 11:58,

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Groner, Rob
nd slurmdbd can all run on different versions so long as the slurmdbd > slurmctld > slurmd. So if you want to do a live upgrade you can do it. However out paranoia we general stop everything. The entire process takes about an hour start to finish, with the longest part being the pausin

[slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Groner, Rob
On our system, for some partitions, we guarantee that a job can run at least an hour before being preempted by a higher priority job. We use the QOS preempt exempt time for this, and it appears to be working. But of course, I want to TEST that it works. So on a test system, I start a lower pr

Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Groner, Rob
han a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob mailto:rug.

Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Groner, Rob
elp you when you are looking into this. Sent from my iPhone On Sep 29, 2023, at 16:10, Groner, Rob wrote:  I'm not looking for a one-time answer. We run these tests anytime we change anything related to slurmversion, configuration, etc.We certainly run the test after the syst

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-16 Thread Groner, Rob
It is my understanding that it is a different issue than pmix. So to be fully protected, you would need to build the latest/fixed pmix and rebuild slurm using that (or just keep pmix disabled), AND have this latest version of slurm with their fix for their own vulnerability. Rob _

Re: [slurm-users] Correct way to do logrotation

2023-10-17 Thread Groner, Rob
Thanks for doing that, as I did not see this original message, and I also am having to look at configuring our log for rotation. We once accidentally turned on debug5 and didn't notice until other things started failing because the drive was full...from that ONE file. I did find this conversat

Re: [slurm-users] Autodetect of nvml is not working in gres.conf

2023-11-30 Thread Groner, Rob
Did you have --with-nvml as part of your configuration? Go back to your config.log and verify that it ever said it found nvml.h. If not, then you'll need to make sure you have the right nvidia/cuda packages installed on the host you're building slurm on, and you might have to specify --with-nv

Re: [slurm-users] Time spent in PENDING/Priority

2023-12-07 Thread Groner, Rob
Ya, I'm kinda looking at exactly this right now as well. For us, I know we're under-utilizing our hardware currently, but I still want to know if the number of pending jobs is growing because that would probably point to something going wrong somewhere. It's a good metric to have. We are goi

[slurm-users] Re: AllowAccounts partition setting

2024-03-12 Thread Groner, Rob via slurm-users
Marko, We are running 23.02.6 and have a partition with a specific account set in AllowAccounts. We test that only that account can use that partition, and it works. I'll note that you'll need to set EnforcePartLimits=ALL in slurm.conf for it to work, and if you use the job_submit filter, mak

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-08 Thread Groner, Rob via slurm-users
FYI, I submitted a bug about this in March because the "compatible" line in the docs was confusing to me as well. The change coming to the docs removes that altogether and simply says that setting it to OFF "disables job preemption and gang scheduling". Much clearer. And we do it the same way

[slurm-users] Re: Running slurm on alternate ports

2024-05-20 Thread Groner, Rob via slurm-users
It gets them from the slurm.conf file. So wherever you are executing srun/sbatch/etc, it should have access to the slurm config files. From: Alan Stange via slurm-users Sent: Monday, May 20, 2024 2:55 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] R

[slurm-users] Re: Running slurm on alternate ports

2024-05-20 Thread Groner, Rob via slurm-users
Since you mentioned "an alternate configuration file", look at the bottom of the sbatch online docs. It describes a SLURM_CONF env var you can set that points to the config files. Rob ____ From: Groner, Rob via slurm-users Sent: Monday, May 20, 2024

[slurm-users] Re: Help with slurmdbd and Slurmctld

2024-06-17 Thread Groner, Rob via slurm-users
When you updated your operating system, you likely updated the version of slurm you were using too (assuming slurm had been installed from system repos instead of built source code). Slurm only supports db and state files that are within 2 major versions older than itself. The fix is to uninst