Re: [slurm-users] Single Node cluster. How to manage oversubscribing
On Wed, 1 Mar 2023 at 07:51, Doug Meyer wrote: > Hi, > > I forgot one thing you didn't mention. When you change the node > descriptors and partitions you have to also restart slurmctld. scontrol > reconfigure works for the nodes but the main daemon has to be told to > reread the config. Until you restart the daemon it will be referencing the > config from the last time it started. > > Yeah, I restart all three daemons and run scontrol reconfigure after the changes are done. I think I've solved the problem. SLURM was double-counting the CPUs, so I set FORCE:2 to FORCE:1 to oversubscribe <https://github.com/hariseldon99/buparamshavak/commit/59fadd33e485ea8ab3b8d9a6c57381bfa9b89d72> and now it's working. I launched 4 infinite loop MPI test jobs with 32 cores and it started running two of them and queued the other two, as required. Thanks for all the help. It was awesome. AR > Doug > > On Sun, Feb 26, 2023 at 10:25 PM Analabha Roy > wrote: > >> Hey, >> >> >> Thanks for sticking with this. >> >> On Sun, 26 Feb 2023 at 23:43, Doug Meyer wrote: >> >>> Hi, >>> >>> Suggest removing "boards=1", The docs say to include it but in previous >>> discussions with schedmd we were advised to remove it. >>> >>> >> I just did. Then ran scontrol reconfigure. >> >> >> >>> When you are running execute "scontrol show node " and look at >>> the lines ConfigTres and AllocTres. The former is what the maitre d >>> believes is available, the latter what has been allocated. >>> >>> Then "scontrol show job " looking down at the "NumNodes" like >>> which will show you what the job requested. >>> >>> I suspect there is a syntax error in the submit. >>> >>> >> Okay. Now this is strange. >> >> First, I launched this job twice <https://pastebin.com/s21yXFH2> >> This should take up 20 + 20 = 40 cores, because of the >> >> >>1. #SBATCH -n 20 # Number of tasks >>2. #SBATCH --cpus-per-task=1 >> >> >> >> running scontrol show job on both jobids yields >> >>-NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>-NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> >> Then, running scontrol on the node yields: >> >> >>- scontrol show node $HOSTNAME >>- CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1 >>- AllocTRES=cpu=40 >> >> >> So far so good. Both show 40 cores allocated. >> >> >> >> However, if I now add another job with 60 cores >> <https://pastebin.com/C0uW0Aut>,this happens: >> >> scontrol on the node: >> >> CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1 >>AllocTRES=cpu=60 >> >> >> squeue >> JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) >>413 CPU normaladmin R 21:22 1 >> shavak-DIT400TR-55L >>414 CPU normaladmin R 19:53 1 >> shavak-DIT400TR-55L >>417 CPU elevatedadmin R 1:31 1 >> shavak-DIT400TR-55L >> >> scontrol on the jobids: >> >> admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep >> NumCPUs >>NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep >> NumCPUs >>NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep >> NumCPUs >>NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> >> So there are 100 CPUs running, according to this, but 60 according to >> scontrol on the node?? >> >> The submission scripts are on pastebin: >> >> https://pastebin.com/s21yXFH2 >> https://pastebin.com/C0uW0Aut >> >> >> AR >> >> >> >> >> >> >>> Doug >>> >>> >>> On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy >>> wrote: >>> >>>> Hi Doug, >>>> >>>> Again, many thanks for your detailed response. >>>> Based on my understanding of your previous note, I did the following: >>>> >>>> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 >>>> CoresPerSocket=16 ThreadsPerCore=2 >>>> >>>> and the partitions with overs
Re: [slurm-users] Single Node cluster. How to manage oversubscribing
Hey, Thanks for sticking with this. On Sun, 26 Feb 2023 at 23:43, Doug Meyer wrote: > Hi, > > Suggest removing "boards=1", The docs say to include it but in previous > discussions with schedmd we were advised to remove it. > > I just did. Then ran scontrol reconfigure. > When you are running execute "scontrol show node " and look at > the lines ConfigTres and AllocTres. The former is what the maitre d > believes is available, the latter what has been allocated. > > Then "scontrol show job " looking down at the "NumNodes" like which > will show you what the job requested. > > I suspect there is a syntax error in the submit. > > Okay. Now this is strange. First, I launched this job twice <https://pastebin.com/s21yXFH2> This should take up 20 + 20 = 40 cores, because of the 1. #SBATCH -n 20 # Number of tasks 2. #SBATCH --cpus-per-task=1 running scontrol show job on both jobids yields -NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* -NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Then, running scontrol on the node yields: - scontrol show node $HOSTNAME - CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1 - AllocTRES=cpu=40 So far so good. Both show 40 cores allocated. However, if I now add another job with 60 cores <https://pastebin.com/C0uW0Aut>,this happens: scontrol on the node: CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1 AllocTRES=cpu=60 squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 413 CPU normaladmin R 21:22 1 shavak-DIT400TR-55L 414 CPU normaladmin R 19:53 1 shavak-DIT400TR-55L 417 CPU elevatedadmin R 1:31 1 shavak-DIT400TR-55L scontrol on the jobids: admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep NumCPUs NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep NumCPUs NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:* admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep NumCPUs NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:* So there are 100 CPUs running, according to this, but 60 according to scontrol on the node?? The submission scripts are on pastebin: https://pastebin.com/s21yXFH2 https://pastebin.com/C0uW0Aut AR > Doug > > > On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy > wrote: > >> Hi Doug, >> >> Again, many thanks for your detailed response. >> Based on my understanding of your previous note, I did the following: >> >> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=16 ThreadsPerCore=2 >> >> and the partitions with oversubscribe=force:2 >> >> then I put further restrictions with the default qos >> to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2 >> >> That way, no single user can request more than 2 X 32 cores legally. >> >> I launched two jobs, sbatch -n 32 each as one user. They started running >> immediately, taking up all 64 cores. >> >> Then I logged in as another user and launched the same job with sbatch -n >> 2. To my dismay, it started to run! >> >> Shouldn't slurm have figured out that all 64 cores were occupied and >> queued the -n 2 job to pending? >> >> AR >> >> >> On Sun, 26 Feb 2023 at 02:18, Doug Meyer wrote: >> >>> Hi, >>> >>> You got me, I didn't know that " oversubscribe=FORCE:2" is an option. >>> I'll need to explore that. >>> >>> I missed the question about srun. srun is the preferred I believe. I >>> am not associated with drafting the submit scripts but can ask my peer. >>> You do need to stipulate the number of cores you want. Your "sbatch -n 1" >>> should be changed to the number of MPI ranks you desire. >>> >>> As good as slurm is, many come to assume it does far more than it does. >>> I explain slurm as a maître d' in a very exclusive restaurant, aware of >>> every table and the resources they afford. When a reservation is placed, a >>> job submitted, a review of the request versus the resources matches the >>> pending guest/job against the resources and when the other diners/jobs are >>> expected to finish. If a guest requests resources that are not available >>> in the restaurant, the reservation is denied. If a guest arrives and does >>> not need all the resources, the place settings requested but unused are >>> left in reserva
Re: [slurm-users] Single Node cluster. How to manage oversubscribing
Hi Doug, Again, many thanks for your detailed response. Based on my understanding of your previous note, I did the following: I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 and the partitions with oversubscribe=force:2 then I put further restrictions with the default qos to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2 That way, no single user can request more than 2 X 32 cores legally. I launched two jobs, sbatch -n 32 each as one user. They started running immediately, taking up all 64 cores. Then I logged in as another user and launched the same job with sbatch -n 2. To my dismay, it started to run! Shouldn't slurm have figured out that all 64 cores were occupied and queued the -n 2 job to pending? AR On Sun, 26 Feb 2023 at 02:18, Doug Meyer wrote: > Hi, > > You got me, I didn't know that " oversubscribe=FORCE:2" is an option. > I'll need to explore that. > > I missed the question about srun. srun is the preferred I believe. I am > not associated with drafting the submit scripts but can ask my peer. You > do need to stipulate the number of cores you want. Your "sbatch -n 1" > should be changed to the number of MPI ranks you desire. > > As good as slurm is, many come to assume it does far more than it does. I > explain slurm as a maître d' in a very exclusive restaurant, aware of every > table and the resources they afford. When a reservation is placed, a job > submitted, a review of the request versus the resources matches the > pending guest/job against the resources and when the other diners/jobs are > expected to finish. If a guest requests resources that are not available > in the restaurant, the reservation is denied. If a guest arrives and does > not need all the resources, the place settings requested but unused are > left in reservation until the job finishes. Slurm manages requests against > an inventory. Without enforcement, a job that requests 1 core but uses 12 > will run. If your 64 core system accepts 64 single core reservations, > slurm believing 64 cores are needed, 64 jobs wll start. and then the wait > staff (the OS) is left to deal with 768 tasks running on 64 cores. It > becomes a sad comedy as the system will probably run out of RAM triggering > OOM killer or just run horribly slow. Never assume slurm is going to > prevent bad actors once they begin running unless you have configured it to > do so. > > We run a very lax environment. We set a standard of 6 GB per job unless > the sbatch declares otherwise and a max runtime default. Without an > estimated runtime to work with the backfill scheduler is crippled. In an > environment mixing single thread and MPI jobs of various sizes it is > critical the jobs are honest in their requirements providing slurm the > information needed to correctly assign resources. > > Doug > > On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy > wrote: > >> Hi, >> >> Thanks for your considered response. Couple of questions linger... >> >> On Sat, 25 Feb 2023 at 21:46, Doug Meyer wrote: >> >>> Hi, >>> >>> Declaring cores=64 will absolutely work but if you start running MPI >>> you'll want a more detailed config description. The easy way to read it is >>> "128=2 sockets * 32 corespersocket * 2 threads per core". >>> >>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 >>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100 >>> >>> But if you just want to work with logical cores the "cpus=128" will work. >>> >>> If you go with the more detailed description then you need to declare >>> oversubscription (hyperthreading) in the partition declaration. >>> >> >> >> Yeah, I'll try that. >> >> >>> By default slurm will not let two different jobs share the logical cores >>> comprising a physical core. For example if Sue has an Array of 1-1000 her >>> array tasks could each take a logical core on a physical core. But if >>> Jamal is also running they would not be able to share the physical core. >>> (as I understand it). >>> >>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2 >>> MaxTime=Infinite State=Up AllowAccounts=cowboys >>> >>> >>> In the sbatch/srun the user needs to add a declaration >>> "oversubscribe=yes" telling slurm the job can run on both logical cores >>> available. >>> >> >> How about setting oversubscribe=FORCE:2? That way, users need not add a >> setting in their scripts. >> >> >> >> >>> In the days on Knight's Landing e
Re: [slurm-users] Single Node cluster. How to manage oversubscribing
Hi, Thanks for your considered response. Couple of questions linger... On Sat, 25 Feb 2023 at 21:46, Doug Meyer wrote: > Hi, > > Declaring cores=64 will absolutely work but if you start running MPI > you'll want a more detailed config description. The easy way to read it is > "128=2 sockets * 32 corespersocket * 2 threads per core". > > NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 > ThreadsPerCore=2 RealMemory=512000 TmpDisk=100 > > But if you just want to work with logical cores the "cpus=128" will work. > > If you go with the more detailed description then you need to declare > oversubscription (hyperthreading) in the partition declaration. > Yeah, I'll try that. > By default slurm will not let two different jobs share the logical cores > comprising a physical core. For example if Sue has an Array of 1-1000 her > array tasks could each take a logical core on a physical core. But if > Jamal is also running they would not be able to share the physical core. > (as I understand it). > > PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2 > MaxTime=Infinite State=Up AllowAccounts=cowboys > > > In the sbatch/srun the user needs to add a declaration "oversubscribe=yes" > telling slurm the job can run on both logical cores available. > How about setting oversubscribe=FORCE:2? That way, users need not add a setting in their scripts. > In the days on Knight's Landing each core could handle four logical cores > but I don't believe there are any current AMD or Intel processors > supporting more then two logical cores (hyperthreads per core). The > conversation about hyperthreads is difficult as the Intel terminology is > logical cores for hyperthreading and cores for physical cores but the > tendency is to call the logical cores threads or hyperthreaded cores. This > can be very confusing for consumers of the resources. > > > In any case, if you create an array job of 1-100 sleep jobs, my simplest > logical test job, then you can use scontrol show node to see the > nodes resource configuration as well as consumption. squeue -w > -i 10 will iteratate every ten seconds to show you the node chomping > through the job. > > > Hope this helps. Once you are comfortable I would urge you to use the > NodeName/Partition descriptor format above and encourage your users to > declare oversubscription in their jobs. It is a little more work up front > but far easier than correcting scripts later. > > > Doug > > > > > > On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy > wrote: > >> Howdy, and thanks for the warm welcome, >> >> On Fri, 24 Feb 2023 at 07:31, Doug Meyer wrote: >> >>> Hi, >>> >>> Did you configure your node definition with the outputs of slurmd -C? >>> Ignore boards. Don't know if it is still true but several years ago >>> declaring boards made things difficult. >>> >>> >> $ slurmd -C >> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311 >> UpTime=0-00:47:51 >> $ grep NodeName /etc/slurm-llnl/slurm.conf >> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1 >> >> There is a difference. I, too, discarded the Boards and sockets in >> slurmd.conf . Is that the problem? >> >> >> >> >> >> >> >>> Also, if you have hyperthreaded AMD or Intel processors your partition >>> declaration should be overscribe:2 >>> >>> >> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is >> set to show them as 64 cores. >> >> >> >> >>> Start with a very simple job with a script containing sleep 100 or >>> something else without any runtime issues. >>> >>> >> I ran this MPI hello world thing >> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with >> this sbatch script. >> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch> >> Should be the same thing as your suggestion, basically. >> Should I switch to 'srun' in the batch file? >> >> AR >> >> >>> When I started with slurm I built the sbatch one small step at a time. >>> Nodes, cores. memory, partition, mail, etc >>> >>> It sounds like your config is very close but your problem may be in the >>> submit script. >>> >>> Best of luck and welcome to slurm. It is very powerful with a hu
Re: [slurm-users] Single Node cluster. How to manage oversubscribing
Howdy, and thanks for the warm welcome, On Fri, 24 Feb 2023 at 07:31, Doug Meyer wrote: > Hi, > > Did you configure your node definition with the outputs of slurmd -C? > Ignore boards. Don't know if it is still true but several years ago > declaring boards made things difficult. > > $ slurmd -C NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311 UpTime=0-00:47:51 $ grep NodeName /etc/slurm-llnl/slurm.conf NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1 There is a difference. I, too, discarded the Boards and sockets in slurmd.conf . Is that the problem? > Also, if you have hyperthreaded AMD or Intel processors your partition > declaration should be overscribe:2 > > Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is set to show them as 64 cores. > Start with a very simple job with a script containing sleep 100 or > something else without any runtime issues. > > I ran this MPI hello world thing <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with this sbatch script. <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch> Should be the same thing as your suggestion, basically. Should I switch to 'srun' in the batch file? AR > When I started with slurm I built the sbatch one small step at a time. > Nodes, cores. memory, partition, mail, etc > > It sounds like your config is very close but your problem may be in the > submit script. > > Best of luck and welcome to slurm. It is very powerful with a huge > community. > > Doug > > > > On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy > wrote: > >> Hi folks, >> >> I have a single-node "cluster" running Ubuntu 20.04 LTS with the >> distribution packages for slurm (slurm-wlm 19.05.5) >> Slurm only ran one job in the node at a time with the default >> configuration, leaving all other jobs pending. >> This happened even if that one job only requested like a few cores (the >> node has 64 cores, and slurm.conf is configged accordingly). >> >> in slurm conf, SelectType is set to select/cons_res, and >> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file >> is referenced below. >> >> So I set OverSubscribe=FORCE in the partition config and restarted the >> daemons. >> >> Multiple jobs are now run concurrently, but when Slurm is oversubscribed, >> it is *truly* *oversubscribed*. That is to say, it runs so many jobs >> that there are more processes running than cores/threads. >> How should I config slurm so that it runs multiple jobs at once per node, >> but ensures that it doesn't run more processes than there are cores? Is >> there some TRES magic for this that I can't seem to figure out? >> >> My slurm.conf is here on github: >> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf >> The only gres I've set is for the GPU: >> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf >> >> Thanks for your attention, >> Regards, >> AR >> -- >> Analabha Roy >> Assistant Professor >> Department of Physics >> <http://www.buruniv.ac.in/academics/department/physics> >> The University of Burdwan <http://www.buruniv.ac.in/> >> Golapbag Campus, Barddhaman 713104 >> West Bengal, India >> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, >> hariseldo...@gmail.com >> Webpage: http://www.ph.utexas.edu/~daneel/ >> > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/
[slurm-users] Single Node cluster. How to manage oversubscribing
Hi folks, I have a single-node "cluster" running Ubuntu 20.04 LTS with the distribution packages for slurm (slurm-wlm 19.05.5) Slurm only ran one job in the node at a time with the default configuration, leaving all other jobs pending. This happened even if that one job only requested like a few cores (the node has 64 cores, and slurm.conf is configged accordingly). in slurm conf, SelectType is set to select/cons_res, and SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file is referenced below. So I set OverSubscribe=FORCE in the partition config and restarted the daemons. Multiple jobs are now run concurrently, but when Slurm is oversubscribed, it is *truly* *oversubscribed*. That is to say, it runs so many jobs that there are more processes running than cores/threads. How should I config slurm so that it runs multiple jobs at once per node, but ensures that it doesn't run more processes than there are cores? Is there some TRES magic for this that I can't seem to figure out? My slurm.conf is here on github: https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf The only gres I've set is for the GPU: https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf Thanks for your attention, Regards, AR -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/
Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results
Hi, Thanks for the advice. I already tried out mana, but at present it only works with mpich, not openmpi, which is what I've setup via Ubuntu. AR On Sun, 19 Feb 2023, 02:10 Christopher Samuel, wrote: > On 2/10/23 11:06 am, Analabha Roy wrote: > > > I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in > > my cluster. > > If you're looking to try checkpointing MPI applications you may want to > experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin > for DMTCP here: https://github.com/mpickpt/mana > > We (NERSC) are collaborating with the developers and it is installed on > Cori (our older Cray system) for people to experiment with. The > documentation for it may be useful to others who'd like to try it out - > it's got a nice description of how it works too which even I as a > non-programmer can understand. > https://docs.nersc.gov/development/checkpoint-restart/mana/ > > Pay special attention to the caveats in our docs though! > > I've not used it myself, though I'm peripherally involved to give advice > on system related issues. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > >
Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results
Hi, On Mon, 13 Feb 2023, 13:04 Diego Zuccato, wrote: > Hi. > > I'm no expert, but it seems ChatGPT is confusing "queued" and "running" > jobs. That's what I also suspected. Assuming you are interested in temporarily shutting down slurmctld > node for maintenance. > Temporarily and daily. > > > > > > > If the jobs are still queued ( == not yet running) what do you need to > save? The queue order is dynamically adjusted by slurmctld based on the > selected factors, there's nothing special to save. > For the running jobs, OTOH, you have multiple solutions: > 1) drain the cluster: safest but often impractical > 2) checkpoint: seems fragile, expecially if jobs span multiple nodes > I just have one node, but the bigger problem with check pointing is that GPUs don't seem to be supported. 3) have a second slurmd node (a small VM is sufficient) that takes over > the cluster management when the master node is down (be *sure* the state > dir is shared and quite fast!) > I've just got that one "node" for compute and login and storage and everything. It's a Tyrone server with 64 cores and a couplea raided hdds. Just wanna run some DFT/QM/MM simulations for myself and departmental colleagues, and do some exact diagonalization problems. 4) just hope you'll be able to recover the slurmctld node before a job > completes *and* the timeouts expire > I booted into gparted live and beefed up the swap space to 200 gigs (the ram is 93 G). I've setup a mandatory (through qos settings) Slurm reservation that kills all running jobs in the normal qos after 8:30 pm everyday and a cron job that starts @ 835 pm, drains the partitions, suspends all jobs running on elevated qos privileges, then hibernates the whole sumbich to swap. Another script runs whenever the fella comes outta hibernation, resets the slurm partitions and resumes the suspended jobs. Its an ugly jugaad, I know. I guess it's tough noogies for the normal qos people if their jobs ran past the reservation or were not properly checkpointed before a blackout, but I don't see any other alternative. My department refuses to let me run my thingie 24/7, and power outages occur frequently round here. I'm concerned about implementing a failsafe in case this Rube Goldberg like setup takes a hard left. Was thinking about a systemd service that kills all running jobs, then simply runs "scontrol shutdown" to preserve the state of queued jobs and then resumes a regular system shutdown. In that case, automatic checkpointing of the jobs with dmtcp/mana would be cool, and I was encouraged when chatgpt claimed that slurm supported this. But the recent docs don't corroborate this claim,so I guess it got deprecated or something... > While 4 is relatively risky (you could end up with runaway jobs that > you'll have to fix afterwards), it does not directly impact users: their > jobs will run and complete/fail regardless of slurmctld state. At most > the users won't receive a completion mail and they will be billed less > than expected. > > Diego > > Il 10/02/2023 20:06, Analabha Roy ha scritto: > > Hi, > > > > I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in > > my cluster. On a whim, I logged into ChatGPT and asked the AI about it. > > It told me things that I couldn't find in the current version of the > > SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce > > the > > contents of my chat session in my GitHub repository for peer review and > > commentary by you fine folks. > > > > https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md > > <https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md> > > > > I apologize for the poor formatting. I did this in a hurry, and my > > knowledge of markdown is rudimentary. > > > > Please do comment on the veracity and reliability of the AI's response. > > > > AR > > > > -- > > Analabha Roy > > Assistant Professor > > Department of Physics > > <http://www.buruniv.ac.in/academics/department/physics> > > The University of Burdwan <http://www.buruniv.ac.in/> > > Golapbag Campus, Barddhaman 713104 > > West Bengal, India > > Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>, > > a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>, > > hariseldo...@gmail.com <mailto:hariseldo...@gmail.com> > > Webpage: http://www.ph.utexas.edu/~daneel/ > > <http://www.ph.utexas.edu/~daneel/> > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > >
[slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results
Hi, I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in my cluster. On a whim, I logged into ChatGPT and asked the AI about it. It told me things that I couldn't find in the current version of the SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce the contents of my chat session in my GitHub repository for peer review and commentary by you fine folks. https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md I apologize for the poor formatting. I did this in a hurry, and my knowledge of markdown is rudimentary. Please do comment on the veracity and reliability of the AI's response. AR -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/
Re: [slurm-users] [External] Hibernating a whole cluster
Howdy, On Tue, 7 Feb 2023 at 20:18, Sean Mc Grath wrote: > Hi Analabha, > > Yes, unfortunately for your needs, I expect a time limited reservation > along my suggestion would not accept jobs that would be scheduled to end > outside of the reservations availability times. I'd suggest looking at > check-pointing in this case, e.g. with DMTCP: Distributed MultiThreaded > Checkpointing, http://dmtcp.sourceforge.net/. That could allow jobs to > have their state saved and then re-loaded when they are started again. > > Checkpointing sounds intriguing. Many thanks for the suggestion. A bit of googling turned up this cluster page <https://docs.nersc.gov/development/checkpoint-restart/dmtcp/>where they've set it up to work with slurm. However, I also noticed this presentation <https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf>hosted on the slurm website that indicates that DMTCP doesn't work with containers, and the other checkpointing tools that do support containers don't support MPI. I also took a gander at CRIU <https://criu.org/Main_Page>, but this paper <https://www.ijecs.in/index.php/ijecs/article/download/4122/3855/8058> indicates that it too, has similar limitations, and BLCR seems to have died <https://hpc.rz.rptu.de/documentation/checkpoint_blcr.html>. Unless some or all of this information is dated or obsolete, these drawbacks would be deal-breakers, since most of us have been spoiled by containerization, and MPI is, of course, bread and butter for all. I'd be mighty grateful for any other insights regarding my predicament. In the meantime, I'm going to give the ugly hack of launching scontrol suspend-resume scripts a whirl. AR > Best > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Tuesday 7 February 2023 12:14 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi Sean, > > Thanks for your awesome suggestion! I'm going through the reservation docs > now. At first glance, it seems like a daily reservation would turn down > jobs that are too big for the reservation. It'd be nice if > slurm could suspend (in the manner of 'scontrol suspend') jobs during > reserved downtime and resume them after. That way, folks can submit large > jobs without having to worry about the downtimes. Perhaps the FLEX option > in reservations can accomplish this somehow? > > > I suppose that I can do it using a shell script iterator and a cron job, > but that seems like an ugly hack. I was hoping if there is a way to config > this in slurm itself? > > AR > > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath wrote: > > Hi Analabha, > > Could you do something like create a daily reservation for 8 hours that > starts at 9am, or whatever times work for you like the following untested > command: > > scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 > flags=daily ReservationName=daily > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > Some more possible helpful documentation at > https://slurm.schedmd.com/reservations.html, search for "daily". > > My idea being that jobs can only run in that reservation, (that would have > to be configured separately, not sure how from the top of my head), which > is only active during the times you want the node to be working. So the > cronjob that hibernates/shuts it down will do so when there are no jobs > running. At least in theory. > > Hope that helps. > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Tuesday 7 February 2023 10:05 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > Thanks. I had read the Slurm Power Saving Guide before. I believe the > configs enable slurmctld to check other nodes for idleness and > suspend/resume them. Slurmctld must run on a separate, always-on server for > this to work, right? > > My issue might be a little different. I literally have only one node that > runs everything: slurmctld, slurmd, slurmdbd, everything. > > This node must be set to "sudo systemctl hibernate"after business hours, > regardless of whether jobs are queued or running. The next business day, it > can be switched on manually. > > systemctl hibernate is supposed to save the entire run state of the sole > node to swap and poweroff. When powered on again, it should restore > everything to its previous running state. > > When the job queue
Re: [slurm-users] [External] Hibernating a whole cluster
On Tue, 7 Feb 2023, 18:12 Diego Zuccato, wrote: > RAM used by a suspended job is not released. At most it can be swapped > out (if enough swap is available). > There should be enough swap available. I have 93 gigs of Ram and as big a swap partition. I can top it off with swap files if needed. > > Il 07/02/2023 13:14, Analabha Roy ha scritto: > > Hi Sean, > > > > Thanks for your awesome suggestion! I'm going through the reservation > > docs now. At first glance, it seems like a daily reservation would turn > > down jobs that are too big for the reservation. It'd be nice if > > slurm could suspend (in the manner of 'scontrol suspend') jobs during > > reserved downtime and resume them after. That way, folks can submit > > large jobs without having to worry about the downtimes. Perhaps the FLEX > > option in reservations can accomplish this somehow? > > > > > > I suppose that I can do it using a shell script iterator and a cron job, > > but that seems like an ugly hack. I was hoping if there is a way to > > config this in slurm itself? > > > > AR > > > > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath > <mailto:smcg...@tcd.ie>> wrote: > > > > Hi Analabha, > > > > Could you do something like create a daily reservation for 8 hours > > that starts at 9am, or whatever times work for you like the > > following untested command: > > > > scontrol create reservation starttime=09:00:00 duration=8:00:00 > > nodecnt=1 flags=daily ReservationName=daily > > > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > <https://slurm.schedmd.com/scontrol.html#OPT_DAILY> > > > > Some more possible helpful documentation at > > https://slurm.schedmd.com/reservations.html > > <https://slurm.schedmd.com/reservations.html>, search for "daily". > > > > My idea being that jobs can only run in that reservation, (that > > would have to be configured separately, not sure how from the top of > > my head), which is only active during the times you want the node to > > be working. So the cronjob that hibernates/shuts it down will do so > > when there are no jobs running. At least in theory. > > > > Hope that helps. > > > > Sean > > > > --- > > Sean McGrath > > Senior Systems Administrator, IT Services > > > > > > > *From:* slurm-users > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of > > Analabha Roy mailto:hariseldo...@gmail.com > >> > > *Sent:* Tuesday 7 February 2023 10:05 > > *To:* Slurm User Community List > <mailto:slurm-users@lists.schedmd.com>> > > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > > > Thanks. I had read the Slurm Power Saving Guide before. I believe > > the configs enable slurmctld to check other nodes for idleness and > > suspend/resume them. Slurmctld must run on a separate, always-on > > server for this to work, right? > > > > My issue might be a little different. I literally have only one node > > that runs everything: slurmctld, slurmd, slurmdbd, everything. > > > > This node must be set to "sudo systemctl hibernate"after business > > hours, regardless of whether jobs are queued or running. The next > > business day, it can be switched on manually. > > > > systemctl hibernate is supposed to save the entire run state of the > > sole node to swap and poweroff. When powered on again, it should > > restore everything to its previous running state. > > > > When the job queue is empty, this works well. I'm not sure how well > > this hibernate/resume will work with running jobs and would > > appreciate any suggestions or insights. > > > > AR > > > > > > On Tue, 7 Feb 2023 at 01:39, Florian Zillner > <mailto:fzill...@lenovo.com>> wrote: > > > > Hi, > > > > follow this guide: https://slurm.schedmd.com/power_save.html > > <https://slurm.schedmd.com/power_save.html> > > > > Create poweroff / poweron scripts and configure slurm to do the > > poweroff after X minutes. Works well for us. Make sure to set an > > appropriate time (ResumeTimeout) to allow the node to come back > > to service. > >
Re: [slurm-users] [External] Hibernating a whole cluster
Hi Sean, Thanks for your awesome suggestion! I'm going through the reservation docs now. At first glance, it seems like a daily reservation would turn down jobs that are too big for the reservation. It'd be nice if slurm could suspend (in the manner of 'scontrol suspend') jobs during reserved downtime and resume them after. That way, folks can submit large jobs without having to worry about the downtimes. Perhaps the FLEX option in reservations can accomplish this somehow? I suppose that I can do it using a shell script iterator and a cron job, but that seems like an ugly hack. I was hoping if there is a way to config this in slurm itself? AR On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath wrote: > Hi Analabha, > > Could you do something like create a daily reservation for 8 hours that > starts at 9am, or whatever times work for you like the following untested > command: > > scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1 > flags=daily ReservationName=daily > > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY > > Some more possible helpful documentation at > https://slurm.schedmd.com/reservations.html, search for "daily". > > My idea being that jobs can only run in that reservation, (that would have > to be configured separately, not sure how from the top of my head), which > is only active during the times you want the node to be working. So the > cronjob that hibernates/shuts it down will do so when there are no jobs > running. At least in theory. > > Hope that helps. > > Sean > > --- > Sean McGrath > Senior Systems Administrator, IT Services > > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Tuesday 7 February 2023 10:05 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster > > Hi, > > Thanks. I had read the Slurm Power Saving Guide before. I believe the > configs enable slurmctld to check other nodes for idleness and > suspend/resume them. Slurmctld must run on a separate, always-on server for > this to work, right? > > My issue might be a little different. I literally have only one node that > runs everything: slurmctld, slurmd, slurmdbd, everything. > > This node must be set to "sudo systemctl hibernate"after business hours, > regardless of whether jobs are queued or running. The next business day, it > can be switched on manually. > > systemctl hibernate is supposed to save the entire run state of the sole > node to swap and poweroff. When powered on again, it should restore > everything to its previous running state. > > When the job queue is empty, this works well. I'm not sure how well this > hibernate/resume will work with running jobs and would appreciate any > suggestions or insights. > > AR > > > On Tue, 7 Feb 2023 at 01:39, Florian Zillner wrote: > > Hi, > > follow this guide: https://slurm.schedmd.com/power_save.html > > Create poweroff / poweron scripts and configure slurm to do the poweroff > after X minutes. Works well for us. Make sure to set an appropriate time > (ResumeTimeout) to allow the node to come back to service. > Note that we did not achieve good power saving with suspending the nodes, > powering them off and on saves way more power. The downside is it takes ~ 5 > mins to resume (= power on) the nodes when needed. > > Cheers, > Florian > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Monday, 6 February 2023 18:21 > *To:* slurm-users@lists.schedmd.com > *Subject:* [External] [slurm-users] Hibernating a whole cluster > > Hi, > > I've just finished setup of a single node "cluster" with slurm on ubuntu > 20.04. Infrastructural limitations prevent me from running it 24/7, and > it's only powered on during business hours. > > > Currently, I have a cron job running that hibernates that sole node before > closing time. > > The hibernation is done with standard systemd, and hibernates to the swap > partition. > > I have not run any lengthy slurm jobs on it yet. Before I do, can I get > some thoughts on a couple of things? > > If it hibernated when slurm still had jobs running/queued, would they > resume properly when the machine powers back on? > > Note that my swap space is bigger than my RAM. > > Is it necessary to perhaps setup a pre-hibernate script for systemd to > iterate scontrol to suspend all the jobs before hibernating and resume them > post-resume? > > What about the wall times? I'm uessing that slurm will count the downtime > as elapsed for each job. Is there a way to config this, or is the only > alternative a post-hibernate
Re: [slurm-users] [External] Hibernating a whole cluster
Hi, Thanks. I had read the Slurm Power Saving Guide before. I believe the configs enable slurmctld to check other nodes for idleness and suspend/resume them. Slurmctld must run on a separate, always-on server for this to work, right? My issue might be a little different. I literally have only one node that runs everything: slurmctld, slurmd, slurmdbd, everything. This node must be set to "sudo systemctl hibernate"after business hours, regardless of whether jobs are queued or running. The next business day, it can be switched on manually. systemctl hibernate is supposed to save the entire run state of the sole node to swap and poweroff. When powered on again, it should restore everything to its previous running state. When the job queue is empty, this works well. I'm not sure how well this hibernate/resume will work with running jobs and would appreciate any suggestions or insights. AR On Tue, 7 Feb 2023 at 01:39, Florian Zillner wrote: > Hi, > > follow this guide: https://slurm.schedmd.com/power_save.html > > Create poweroff / poweron scripts and configure slurm to do the poweroff > after X minutes. Works well for us. Make sure to set an appropriate time > (ResumeTimeout) to allow the node to come back to service. > Note that we did not achieve good power saving with suspending the nodes, > powering them off and on saves way more power. The downside is it takes ~ 5 > mins to resume (= power on) the nodes when needed. > > Cheers, > Florian > ------ > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Monday, 6 February 2023 18:21 > *To:* slurm-users@lists.schedmd.com > *Subject:* [External] [slurm-users] Hibernating a whole cluster > > Hi, > > I've just finished setup of a single node "cluster" with slurm on ubuntu > 20.04. Infrastructural limitations prevent me from running it 24/7, and > it's only powered on during business hours. > > > Currently, I have a cron job running that hibernates that sole node before > closing time. > > The hibernation is done with standard systemd, and hibernates to the swap > partition. > > I have not run any lengthy slurm jobs on it yet. Before I do, can I get > some thoughts on a couple of things? > > If it hibernated when slurm still had jobs running/queued, would they > resume properly when the machine powers back on? > > Note that my swap space is bigger than my RAM. > > Is it necessary to perhaps setup a pre-hibernate script for systemd to > iterate scontrol to suspend all the jobs before hibernating and resume them > post-resume? > > What about the wall times? I'm uessing that slurm will count the downtime > as elapsed for each job. Is there a way to config this, or is the only > alternative a post-hibernate script that iteratively updates the wall times > of the running jobs using scontrol again? > > Thanks for your attention. > Regards > AR > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/
[slurm-users] Hibernating a whole cluster
Hi, I've just finished setup of a single node "cluster" with slurm on ubuntu 20.04. Infrastructural limitations prevent me from running it 24/7, and it's only powered on during business hours. Currently, I have a cron job running that hibernates that sole node before closing time. The hibernation is done with standard systemd, and hibernates to the swap partition. I have not run any lengthy slurm jobs on it yet. Before I do, can I get some thoughts on a couple of things? If it hibernated when slurm still had jobs running/queued, would they resume properly when the machine powers back on? Note that my swap space is bigger than my RAM. Is it necessary to perhaps setup a pre-hibernate script for systemd to iterate scontrol to suspend all the jobs before hibernating and resume them post-resume? What about the wall times? I'm uessing that slurm will count the downtime as elapsed for each job. Is there a way to config this, or is the only alternative a post-hibernate script that iteratively updates the wall times of the running jobs using scontrol again? Thanks for your attention. Regards AR
Re: [slurm-users] Enforce gpu usage limits (with GRES?)
Hi, Thanks, your advice worked. I used sacctmgr to create a QOS called 'nogpu' and set MaxTRES=gres/gpu=0, then attached it to the cpu partition in slurm.conf as PartitionName=CPU Nodes=ALL Default=Yes QOS=nogpu MaxTime=INFINITE State=UP And it works! Trying to run gpu jobs in the cpu partition now fails. Qos'es are nice! Only thing is that the nogpu qos has a priority of 0. Should it be higher? https://pastebin.com/VVsQAz6P AR On Fri, 3 Feb 2023 at 13:37, Markus Kötter wrote: > Hi, > > > limits ain't easy. > > > > https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmLimits.html#precedence > > > I think there is multiple options, starting with not having GPU > resources in the CPU partition. > > Or creating qos the partition and have > MaxTRES=gres/gpu:A100=0,gres/gpu:K80=0,gres/gpu=0 > attaching it to the CPU partition. > > And the configuration will require some values as well, > > # slurm.conf > AccountingStorageEnforce=associations,limits,qos,safe > AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:K80 > > # cgroups.conf > ConstrainDevices=yes > > most likely some others I miss. > > > MfG > -- > Markus Kötter, +49 681 870832434 > 30159 Hannover, Lange Laube 6 > Helmholtz Center for Information Security > -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/
Re: [slurm-users] [ext] Enforce gpu usage limits (with GRES?)
Hi, Thanks for the reply. Yes, your advice helped! Much obliged. Not only was cgroups config necessary, but the option ConstrainDevices=yes in cgroup.conf was necessary to enforce the gpu gres. Now, not adding a gres parameter to srun causes gpu jobs to fail. An improvement! Although, I still can't keep out gpu jobs from the "CPU" partition. Is there a way to link a partition to a GRES or something? Alternatively, can I define two nodenames in slurm.conf that point to the same physical node, but only one of them has the gpu GRES? That way, I can link the GPU partition to the gres-configged nodename only. Thanks in advance, AR *PS*: If the slurm devs are reading this, may I suggest that perhaps it would be a good idea to add a reference to cgroups in the gres documentation page? On Thu, 2 Feb 2023 at 16:52, Holtgrewe, Manuel < manuel.holtgr...@bih-charite.de> wrote: > Hi, > > > if by "share the GPU" you mean exclusive allocation to a single job then, > I believe, you are missing cgroup configuration for isolating access to the > GPU. > > > Below the relevant parts (I believe) of our configuration. > > > There also is a way of time- and space-slice GPUs but I guess you should > get things setup without slicing. > > > I hope this helps. > > > Manuel > > > ==> /etc/slurm/cgroup.conf <== > # https://bugs.schedmd.com/show_bug.cgi?id=3701 > CgroupMountpoint="/sys/fs/cgroup" > CgroupAutomount=yes > AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf" > > ==> /etc/slurm/cgroup_allowed_devices_file.conf <== > /dev/null > /dev/urandom > /dev/zero > /dev/sda* > /dev/cpu/*/* > /dev/pts/* > /dev/nvidia* > > ==> /etc/slurm/slurm.conf <== > > ProctrackType=proctrack/cgroup > > # Memory is enforced via cgroups, so we should not do this here by [*] > # > # /etc/slurm/cgroup.conf: ConstrainRAMSpace=yes > # > # [*] https://bugs.schedmd.com/show_bug.cgi?id=5262 > JobAcctGatherParams=NoOverMemoryKill > > TaskPlugin=task/cgroup > > JobAcctGatherType=jobacct_gather/cgroup > > > -- > Dr. Manuel Holtgrewe, Dipl.-Inform. > Bioinformatician > Core Unit Bioinformatics – CUBI > Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in > the Helmholtz Association / Charité – Universitätsmedizin Berlin > > Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin > Postal Address: Chariteplatz 1, 10117 Berlin > > E-Mail: manuel.holtgr...@bihealth.de > Phone: +49 30 450 543 607 > Fax: +49 30 450 7 543 901 > Web: cubi.bihealth.org www.bihealth.org www.mdc-berlin.de > www.charite.de > -- > *From:* slurm-users on behalf of > Analabha Roy > *Sent:* Wednesday, February 1, 2023 6:12:40 PM > *To:* slurm-users@lists.schedmd.com > *Subject:* [ext] [slurm-users] Enforce gpu usage limits (with GRES?) > > Hi, > > I'm new to slurm, so I apologize in advance if my question seems basic. > > I just purchased a single node 'cluster' consisting of one 64-core cpu and > an nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it > with ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the > config to suit the needs of my department. > > I'm trying to bone up on GRES scheduling by reading this manual page > <https://slurm.schedmd.com/gres.html>, but am confused about some things. > > My slurm.conf file has the following lines put in it by the vendor: > > ### > # COMPUTE NODES > GresTypes=gpu > NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32 > ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1 > #PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE State=UP > > PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE State=UP > # > > So they created two partitions that are essentially identical. Secondly, > they put just the following line in gres.conf: > > ### > NodeName=shavak-DIT400TR-55L Name=gpuFile=/dev/nvidia0 > ### > > That's all. However, this configuration does not appear to constrain > anyone in any manner. As a regular user, I can still use srun or sbatch to > start GPU jobs from the "CPU partition," and nvidia-smi says that a simple > cupy <https://cupy.dev/> script that multiplies matrices and starts as an > sbatch job in the CPU partition can access the gpu just fine. Note that the > environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in > any job step. I tested this by starting an interactive srun shell i
[slurm-users] Enforce gpu usage limits (with GRES?)
Hi, I'm new to slurm, so I apologize in advance if my question seems basic. I just purchased a single node 'cluster' consisting of one 64-core cpu and an nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it with ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the config to suit the needs of my department. I'm trying to bone up on GRES scheduling by reading this manual page <https://slurm.schedmd.com/gres.html>, but am confused about some things. My slurm.conf file has the following lines put in it by the vendor: ### # COMPUTE NODES GresTypes=gpu NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1 #PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE State=UP PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE State=UP # So they created two partitions that are essentially identical. Secondly, they put just the following line in gres.conf: ### NodeName=shavak-DIT400TR-55L Name=gpuFile=/dev/nvidia0 ### That's all. However, this configuration does not appear to constrain anyone in any manner. As a regular user, I can still use srun or sbatch to start GPU jobs from the "CPU partition," and nvidia-smi says that a simple cupy <https://cupy.dev/> script that multiplies matrices and starts as an sbatch job in the CPU partition can access the gpu just fine. Note that the environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in any job step. I tested this by starting an interactive srun shell in both CPU and GPU partition and running ''echo $CUDA_VISIBLE_DEVICES" and got bupkis for both. What I need to do is constrain jobs to using chunks of GPU Cores/RAM so that multiple jobs can share the GPU. As I understand from the gres manpage, simply adding "AutoDetect=nvml" (NVML should be installed with the NVIDIA HPC SDK, right? I installed it with apt-get...) in gres.conf should allow Slurm to detect the GPU's internal specifications automatically. Is that all, or do I need to config an mps GRES as well? Will that succeed in jailing out the GPU from jobs that don't mention any gres parameters (perhaps by setting CUDA_VISIBLE_DEVICES), or is there any additional config for that? Do I really need that extra "GPU" partition that the vendor put in for any of this, or is there a way to bind GRES resources to a particular partition in such a way that simply launching jobs in that partition will be enough? Thanks for your attention. Regards AR -- Analabha Roy Assistant Professor Department of Physics <http://www.buruniv.ac.in/academics/department/physics> The University of Burdwan <http://www.buruniv.ac.in/> Golapbag Campus, Barddhaman 713104 West Bengal, India Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com Webpage: http://www.ph.utexas.edu/~daneel/