Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-03-02 Thread Analabha Roy
On Wed, 1 Mar 2023 at 07:51, Doug Meyer  wrote:

> Hi,
>
> I forgot one thing you didn't mention.  When you change the node
> descriptors and partitions you have to also restart slurmctld.  scontrol
> reconfigure works for the nodes but the main daemon has to be told to
> reread the config.  Until you restart the daemon it will be referencing the
> config from the last time it started.
>
>

Yeah, I restart all three daemons and run scontrol reconfigure after the
changes are done.

I think I've solved the problem. SLURM was double-counting the CPUs, so I
set FORCE:2 to FORCE:1 to oversubscribe
<https://github.com/hariseldon99/buparamshavak/commit/59fadd33e485ea8ab3b8d9a6c57381bfa9b89d72>
and now it's working. I launched 4 infinite loop MPI test jobs with 32
cores and it started running two of them and queued the other two, as
required.


Thanks for all the help. It was awesome.

AR





> Doug
>
> On Sun, Feb 26, 2023 at 10:25 PM Analabha Roy 
> wrote:
>
>> Hey,
>>
>>
>> Thanks for sticking with this.
>>
>> On Sun, 26 Feb 2023 at 23:43, Doug Meyer  wrote:
>>
>>> Hi,
>>>
>>> Suggest removing "boards=1",  The docs say to include it but in previous
>>> discussions with schedmd we were advised to remove it.
>>>
>>>
>> I just did. Then ran scontrol reconfigure.
>>
>>
>>
>>> When you are running execute "scontrol show node " and look at
>>> the lines ConfigTres and AllocTres.  The former is what the maitre d
>>> believes is available, the latter what has been allocated.
>>>
>>> Then "scontrol show job " looking down at the "NumNodes" like
>>> which will show you what the job requested.
>>>
>>> I suspect there is a syntax error in the submit.
>>>
>>>
>> Okay. Now this is strange.
>>
>> First, I launched this job twice <https://pastebin.com/s21yXFH2>
>> This should take up 20 + 20 = 40 cores, because of the
>>
>>
>>1. #SBATCH -n 20  # Number of tasks
>>2. #SBATCH --cpus-per-task=1
>>
>>
>>
>> running scontrol show job on both jobids yields
>>
>>-NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>-NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>
>> Then, running scontrol on the node yields:
>>
>>
>>- scontrol show node $HOSTNAME
>>- CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
>>- AllocTRES=cpu=40
>>
>>
>> So far so good. Both show 40 cores allocated.
>>
>>
>>
>> However, if I now add another job with 60 cores
>> <https://pastebin.com/C0uW0Aut>,this happens:
>>
>> scontrol on the node:
>>
>> CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
>>AllocTRES=cpu=60
>>
>>
>> squeue
>>  JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
>>413   CPU   normaladmin  R  21:22  1
>> shavak-DIT400TR-55L
>>414   CPU   normaladmin  R  19:53  1
>> shavak-DIT400TR-55L
>>417   CPU elevatedadmin  R   1:31  1
>> shavak-DIT400TR-55L
>>
>> scontrol on the jobids:
>>
>> admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep
>> NumCPUs
>>NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>> admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep
>> NumCPUs
>>NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>> admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep
>> NumCPUs
>>NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>
>> So there are 100 CPUs running, according to this, but 60 according to
>> scontrol on the node??
>>
>> The submission scripts are on pastebin:
>>
>> https://pastebin.com/s21yXFH2
>> https://pastebin.com/C0uW0Aut
>>
>>
>> AR
>>
>>
>>
>>
>>
>>
>>> Doug
>>>
>>>
>>> On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy 
>>> wrote:
>>>
>>>> Hi Doug,
>>>>
>>>> Again, many thanks for your detailed response.
>>>> Based on my understanding of your previous note, I did the following:
>>>>
>>>> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
>>>> CoresPerSocket=16 ThreadsPerCore=2
>>>>
>>>> and the partitions with overs

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-26 Thread Analabha Roy
Hey,


Thanks for sticking with this.

On Sun, 26 Feb 2023 at 23:43, Doug Meyer  wrote:

> Hi,
>
> Suggest removing "boards=1",  The docs say to include it but in previous
> discussions with schedmd we were advised to remove it.
>
>
I just did. Then ran scontrol reconfigure.



> When you are running execute "scontrol show node " and look at
> the lines ConfigTres and AllocTres.  The former is what the maitre d
> believes is available, the latter what has been allocated.
>
> Then "scontrol show job " looking down at the "NumNodes" like which
> will show you what the job requested.
>
> I suspect there is a syntax error in the submit.
>
>
Okay. Now this is strange.

First, I launched this job twice <https://pastebin.com/s21yXFH2>
This should take up 20 + 20 = 40 cores, because of the


   1. #SBATCH -n 20  # Number of tasks
   2. #SBATCH --cpus-per-task=1



running scontrol show job on both jobids yields

   -NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   -NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Then, running scontrol on the node yields:


   - scontrol show node $HOSTNAME
   - CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
   - AllocTRES=cpu=40


So far so good. Both show 40 cores allocated.



However, if I now add another job with 60 cores
<https://pastebin.com/C0uW0Aut>,this happens:

scontrol on the node:

CfgTRES=cpu=64,mem=95311M,billing=64,gres/gpu=1
   AllocTRES=cpu=60


squeue
 JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
   413   CPU   normaladmin  R  21:22  1
shavak-DIT400TR-55L
   414   CPU   normaladmin  R  19:53  1
shavak-DIT400TR-55L
   417   CPU elevatedadmin  R   1:31  1
shavak-DIT400TR-55L

scontrol on the jobids:

admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 413|grep NumCPUs
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 414|grep NumCPUs
   NumNodes=1 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
admin@shavak-DIT400TR-55L:~/mpi_runs_inf$ scontrol show job 417|grep NumCPUs
   NumNodes=1 NumCPUs=60 NumTasks=60 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

So there are 100 CPUs running, according to this, but 60 according to
scontrol on the node??

The submission scripts are on pastebin:

https://pastebin.com/s21yXFH2
https://pastebin.com/C0uW0Aut


AR






> Doug
>
>
> On Sun, Feb 26, 2023 at 2:43 AM Analabha Roy 
> wrote:
>
>> Hi Doug,
>>
>> Again, many thanks for your detailed response.
>> Based on my understanding of your previous note, I did the following:
>>
>> I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=16 ThreadsPerCore=2
>>
>> and the partitions with oversubscribe=force:2
>>
>> then I put further restrictions with the default qos
>> to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2
>>
>> That way, no single user can request more than 2 X 32 cores legally.
>>
>> I launched two jobs, sbatch -n 32 each as one user. They started running
>> immediately, taking up all 64 cores.
>>
>> Then I logged in as another user and launched the same job with sbatch -n
>> 2. To my dismay, it started to run!
>>
>> Shouldn't slurm have figured out that all 64 cores were occupied and
>> queued the -n 2 job to pending?
>>
>> AR
>>
>>
>> On Sun, 26 Feb 2023 at 02:18, Doug Meyer  wrote:
>>
>>> Hi,
>>>
>>> You got me, I didn't know that " oversubscribe=FORCE:2" is an option.
>>> I'll need to explore that.
>>>
>>> I missed the question about srun.  srun is the preferred I believe.  I
>>> am not associated with drafting the submit scripts but can ask my peer.
>>> You do need to stipulate the number of cores you want.  Your "sbatch -n 1"
>>> should be changed to the number of MPI ranks you desire.
>>>
>>> As good as slurm is, many come to assume it does far more than it does.
>>> I explain slurm as a maître d' in a very exclusive restaurant, aware of
>>> every table and the resources they afford.  When a reservation is placed, a
>>> job submitted, a review of the request versus the resources matches the
>>> pending  guest/job against the resources and when the other diners/jobs are
>>> expected to finish.  If a guest requests resources that are not available
>>> in the restaurant, the reservation is denied.  If a guest arrives and does
>>> not need all the resources, the place settings requested but unused are
>>> left in reserva

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-26 Thread Analabha Roy
Hi Doug,

Again, many thanks for your detailed response.
Based on my understanding of your previous note, I did the following:

I set the nodename with CPUs=64 Boards=1 SocketsPerBoard=2
CoresPerSocket=16 ThreadsPerCore=2

and the partitions with oversubscribe=force:2

then I put further restrictions with the default qos
to MaxTRESPerNode:cpu=32, MaxJobsPU=MaxSubmit=2

That way, no single user can request more than 2 X 32 cores legally.

I launched two jobs, sbatch -n 32 each as one user. They started running
immediately, taking up all 64 cores.

Then I logged in as another user and launched the same job with sbatch -n
2. To my dismay, it started to run!

Shouldn't slurm have figured out that all 64 cores were occupied and queued
the -n 2 job to pending?

AR


On Sun, 26 Feb 2023 at 02:18, Doug Meyer  wrote:

> Hi,
>
> You got me, I didn't know that " oversubscribe=FORCE:2" is an option.
> I'll need to explore that.
>
> I missed the question about srun.  srun is the preferred I believe.  I am
> not associated with drafting the submit scripts but can ask my peer.  You
> do need to stipulate the number of cores you want.  Your "sbatch -n 1"
> should be changed to the number of MPI ranks you desire.
>
> As good as slurm is, many come to assume it does far more than it does.  I
> explain slurm as a maître d' in a very exclusive restaurant, aware of every
> table and the resources they afford.  When a reservation is placed, a job
> submitted, a review of the request versus the resources matches the
> pending  guest/job against the resources and when the other diners/jobs are
> expected to finish.  If a guest requests resources that are not available
> in the restaurant, the reservation is denied.  If a guest arrives and does
> not need all the resources, the place settings requested but unused are
> left in reservation until the job finishes.  Slurm manages requests against
> an inventory.  Without enforcement, a job that requests 1 core but uses 12
> will run.  If your 64 core system accepts 64 single core reservations,
> slurm believing 64 cores are needed, 64 jobs wll start.  and then the wait
> staff (the OS) is left to deal with 768 tasks running on 64 cores.  It
> becomes a sad comedy as the system will probably run out of RAM triggering
> OOM killer or just run horribly slow.  Never assume slurm is going to
> prevent bad actors once they begin running unless you have configured it to
> do so.
>
> We run a very lax environment.  We set a standard of 6 GB per job unless
> the sbatch declares otherwise and a max runtime default.  Without an
> estimated runtime to work with the backfill scheduler is crippled.  In an
> environment mixing single thread and MPI jobs of various sizes it is
> critical the jobs are honest in their requirements providing slurm the
> information needed to correctly assign resources.
>
> Doug
>
> On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy 
> wrote:
>
>> Hi,
>>
>> Thanks for your considered response. Couple of questions linger...
>>
>> On Sat, 25 Feb 2023 at 21:46, Doug Meyer  wrote:
>>
>>> Hi,
>>>
>>> Declaring cores=64 will absolutely work but if you start running MPI
>>> you'll want a more detailed config description.  The easy way to read it is
>>> "128=2 sockets * 32 corespersocket * 2 threads per core".
>>>
>>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
>>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>>>
>>> But if you just want to work with logical cores the "cpus=128" will work.
>>>
>>> If you go with the more detailed description then you need to declare
>>> oversubscription (hyperthreading) in the partition declaration.
>>>
>>
>>
>> Yeah, I'll try that.
>>
>>
>>> By default slurm will not let two different jobs share the logical cores
>>> comprising a physical core.  For example if Sue has an Array of 1-1000 her
>>> array tasks could each take a logical core on a physical core.  But if
>>> Jamal is also running they would not be able to share the physical core.
>>> (as I understand it).
>>>
>>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
>>> MaxTime=Infinite State=Up AllowAccounts=cowboys
>>>
>>>
>>> In the sbatch/srun the user needs to add a declaration
>>> "oversubscribe=yes" telling slurm the job can run on both logical cores
>>> available.
>>>
>>
>> How about setting oversubscribe=FORCE:2? That way, users need not add a
>> setting in their scripts.
>>
>>
>>
>>
>>> In the days on Knight's Landing e

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Analabha Roy
Hi,

Thanks for your considered response. Couple of questions linger...

On Sat, 25 Feb 2023 at 21:46, Doug Meyer  wrote:

> Hi,
>
> Declaring cores=64 will absolutely work but if you start running MPI
> you'll want a more detailed config description.  The easy way to read it is
> "128=2 sockets * 32 corespersocket * 2 threads per core".
>
> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>
> But if you just want to work with logical cores the "cpus=128" will work.
>
> If you go with the more detailed description then you need to declare
> oversubscription (hyperthreading) in the partition declaration.
>


Yeah, I'll try that.


> By default slurm will not let two different jobs share the logical cores
> comprising a physical core.  For example if Sue has an Array of 1-1000 her
> array tasks could each take a logical core on a physical core.  But if
> Jamal is also running they would not be able to share the physical core.
> (as I understand it).
>
> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
> MaxTime=Infinite State=Up AllowAccounts=cowboys
>
>
> In the sbatch/srun the user needs to add a declaration "oversubscribe=yes"
> telling slurm the job can run on both logical cores available.
>

How about setting oversubscribe=FORCE:2? That way, users need not add a
setting in their scripts.




> In the days on Knight's Landing each core could handle four logical cores
> but I don't believe there are any current AMD or Intel processors
> supporting more then two logical cores (hyperthreads per core).  The
> conversation about hyperthreads is difficult as the Intel terminology is
> logical cores for hyperthreading and cores for physical cores but the
> tendency is to call the logical cores threads or hyperthreaded cores.  This
> can be very confusing for consumers of the resources.
>
>
> In any case, if you create an array job of 1-100 sleep jobs, my simplest
> logical test job, then you can use scontrol show node  to see the
> nodes resource configuration as well as consumption.  squeue -w 
> -i 10 will iteratate every ten seconds to show you the node chomping
> through the job.
>
>
> Hope this helps.  Once you are comfortable I would urge you to use the
> NodeName/Partition descriptor format above and encourage your users to
> declare oversubscription in their jobs.  It is a little more work up front
> but far easier than correcting scripts later.
>
>
> Doug
>
>
>
>
>
> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy 
> wrote:
>
>> Howdy, and thanks for the warm welcome,
>>
>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer  wrote:
>>
>>> Hi,
>>>
>>> Did you configure your node definition with the outputs of slurmd -C?
>>> Ignore boards.  Don't know if it is still true but several years ago
>>> declaring boards made things difficult.
>>>
>>>
>> $ slurmd -C
>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>> UpTime=0-00:47:51
>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>
>> There is a difference. I, too, discarded the Boards and sockets in
>> slurmd.conf . Is that the problem?
>>
>>
>>
>>
>>
>>
>>
>>> Also, if you have hyperthreaded AMD or Intel processors your partition
>>> declaration should be overscribe:2
>>>
>>>
>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is
>> set to show them as 64 cores.
>>
>>
>>
>>
>>> Start with a very simple job with a script containing sleep 100 or
>>> something else without any runtime issues.
>>>
>>>
>> I ran this MPI hello world thing
>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
>> this sbatch script.
>> <https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
>> Should be the same thing as your suggestion, basically.
>> Should I switch to 'srun' in the batch file?
>>
>> AR
>>
>>
>>> When I started with slurm I built the sbatch one small step at a time.
>>> Nodes, cores. memory, partition, mail, etc
>>>
>>> It sounds like your config is very close but your problem may be in the
>>> submit script.
>>>
>>> Best of luck and welcome to slurm. It is very powerful with a hu

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-23 Thread Analabha Roy
Howdy, and thanks for the warm welcome,

On Fri, 24 Feb 2023 at 07:31, Doug Meyer  wrote:

> Hi,
>
> Did you configure your node definition with the outputs of slurmd -C?
> Ignore boards.  Don't know if it is still true but several years ago
> declaring boards made things difficult.
>
>
$ slurmd -C
NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
UpTime=0-00:47:51
$ grep NodeName /etc/slurm-llnl/slurm.conf
NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1

There is a difference. I, too, discarded the Boards and sockets in
slurmd.conf . Is that the problem?







> Also, if you have hyperthreaded AMD or Intel processors your partition
> declaration should be overscribe:2
>
>
Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is
set to show them as 64 cores.




> Start with a very simple job with a script containing sleep 100 or
> something else without any runtime issues.
>
>
I ran this MPI hello world thing
<https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count.c>with
this sbatch script.
<https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/usr/local/share/examples/mpi_runs_inf/mpi_count_normal.sbatch>
Should be the same thing as your suggestion, basically.
Should I switch to 'srun' in the batch file?

AR


> When I started with slurm I built the sbatch one small step at a time.
> Nodes, cores. memory, partition, mail, etc
>
> It sounds like your config is very close but your problem may be in the
> submit script.
>
> Best of luck and welcome to slurm. It is very powerful with a huge
> community.
>
> Doug
>
>
>
> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy 
> wrote:
>
>> Hi folks,
>>
>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>> distribution packages for slurm (slurm-wlm 19.05.5)
>> Slurm only ran one job in the node at a time with the default
>> configuration, leaving all other jobs pending.
>> This happened even if that one job only requested like a few cores (the
>> node has 64 cores, and slurm.conf is configged accordingly).
>>
>> in slurm conf, SelectType is set to select/cons_res, and
>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>> is referenced below.
>>
>> So I set OverSubscribe=FORCE in the partition config and restarted the
>> daemons.
>>
>> Multiple jobs are now run concurrently, but when Slurm is oversubscribed,
>> it is *truly* *oversubscribed*. That is to say, it runs so many jobs
>> that there are more processes running than cores/threads.
>> How should I config slurm so that it runs multiple jobs at once per node,
>> but ensures that it doesn't run more processes than there are cores? Is
>> there some TRES magic for this that I can't seem to figure out?
>>
>> My slurm.conf is here on github:
>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
>> The only gres I've set is for the GPU:
>> https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf
>>
>> Thanks for your attention,
>> Regards,
>> AR
>> --
>> Analabha Roy
>> Assistant Professor
>> Department of Physics
>> <http://www.buruniv.ac.in/academics/department/physics>
>> The University of Burdwan <http://www.buruniv.ac.in/>
>> Golapbag Campus, Barddhaman 713104
>> West Bengal, India
>> Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in,
>> hariseldo...@gmail.com
>> Webpage: http://www.ph.utexas.edu/~daneel/
>>
>

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/


[slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-23 Thread Analabha Roy
Hi folks,

I have a single-node "cluster" running Ubuntu 20.04 LTS with the
distribution packages for slurm (slurm-wlm 19.05.5)
Slurm only ran one job in the node at a time with the default
configuration, leaving all other jobs pending.
This happened even if that one job only requested like a few cores (the
node has 64 cores, and slurm.conf is configged accordingly).

in slurm conf, SelectType is set to select/cons_res, and
SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
is referenced below.

So I set OverSubscribe=FORCE in the partition config and restarted the
daemons.

Multiple jobs are now run concurrently, but when Slurm is oversubscribed,
it is *truly* *oversubscribed*. That is to say, it runs so many jobs that
there are more processes running than cores/threads.
How should I config slurm so that it runs multiple jobs at once per node,
but ensures that it doesn't run more processes than there are cores? Is
there some TRES magic for this that I can't seem to figure out?

My slurm.conf is here on github:
https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/slurm.conf
The only gres I've set is for the GPU:
https://github.com/hariseldon99/buparamshavak/blob/main/shavak_root/etc/slurm-llnl/gres.conf

Thanks for your attention,
Regards,
AR
-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/


Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

2023-02-19 Thread Analabha Roy
Hi,

Thanks for the advice. I already tried out mana, but at present it only
works with mpich, not openmpi, which is what I've setup via Ubuntu.


AR


On Sun, 19 Feb 2023, 02:10 Christopher Samuel,  wrote:

> On 2/10/23 11:06 am, Analabha Roy wrote:
>
> > I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in
> > my cluster.
>
> If you're looking to try checkpointing MPI applications you may want to
> experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin
> for DMTCP here: https://github.com/mpickpt/mana
>
> We (NERSC) are collaborating with the developers and it is installed on
> Cori (our older Cray system) for people to experiment with. The
> documentation for it may be useful to others who'd like to try it out -
> it's got a nice description of how it works too which even I as a
> non-programmer can understand.
> https://docs.nersc.gov/development/checkpoint-restart/mana/
>
> Pay special attention to the caveats in our docs though!
>
> I've not used it myself, though I'm peripherally involved to give advice
> on system related issues.
>
> All the best,
> Chris
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
>


Re: [slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

2023-02-18 Thread Analabha Roy
Hi,

On Mon, 13 Feb 2023, 13:04 Diego Zuccato,  wrote:

> Hi.
>
> I'm no expert, but it seems ChatGPT is confusing "queued" and "running"
> jobs.


That's what I also suspected.



Assuming you are interested in temporarily shutting down slurmctld
> node for maintenance.
>


Temporarily and daily.





>
>
>
>
>
>
> If the jobs are still queued ( == not yet running) what do you need to
> save? The queue order is dynamically adjusted by slurmctld based on the
> selected factors, there's nothing special to save.
> For the running jobs, OTOH, you have multiple solutions:
> 1) drain the cluster: safest but often impractical
> 2) checkpoint: seems fragile, expecially if jobs span multiple nodes
>

I just have one node, but the bigger problem with check pointing is that
GPUs don't seem to be supported.


3) have a second slurmd node (a small VM is sufficient) that takes over
> the cluster management when the master node is down (be *sure* the state
> dir is shared and quite fast!)
>

I've just got that one "node" for compute and login and storage and
everything.

It's a Tyrone server with 64 cores and a couplea raided hdds. Just wanna
run some DFT/QM/MM simulations for myself and departmental colleagues, and
do some exact diagonalization problems.




4) just hope you'll be able to recover the slurmctld node before a job
> completes *and* the timeouts expire
>


I booted into gparted live and beefed up the swap space to 200 gigs (the
ram is 93 G). I've setup a mandatory (through qos settings) Slurm
reservation that kills all running jobs in the normal qos after 8:30 pm
everyday and a cron job that starts @ 835 pm, drains the partitions,
suspends all jobs running on elevated qos privileges, then hibernates the
whole sumbich to swap. Another script runs whenever the fella comes outta
hibernation, resets the slurm partitions and resumes the suspended jobs.

Its an ugly jugaad, I know.

I guess it's tough noogies for the normal qos people if their jobs ran past
the reservation or were not properly checkpointed before a blackout, but I
don't see any other alternative.

 My department refuses to let me run my thingie 24/7, and power outages
occur frequently round here.

I'm concerned about implementing a failsafe in case this Rube Goldberg like
setup takes a hard left.

Was thinking about a systemd service that kills all running jobs, then
simply runs "scontrol shutdown" to preserve the state of queued jobs and
then resumes a regular system shutdown. In that case, automatic
checkpointing of the jobs with dmtcp/mana would be cool, and I was
encouraged when chatgpt claimed that slurm supported this. But the recent
docs don't corroborate this claim,so I guess it got deprecated or
something...










> While 4 is relatively risky (you could end up with runaway jobs that
> you'll have to fix afterwards), it does not directly impact users: their
> jobs will run and complete/fail regardless of slurmctld state. At most
> the users won't receive a completion mail and they will be billed less
> than expected.
>
> Diego
>
> Il 10/02/2023 20:06, Analabha Roy ha scritto:
> > Hi,
> >
> > I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in
> > my cluster. On a whim, I logged into ChatGPT and asked the AI about it.
> > It told me things that I couldn't find in the current version of the
> > SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce
> > the
> > contents of my chat session in my GitHub repository for peer review and
> > commentary by you fine folks.
> >
> > https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md
> > <https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md>
> >
> > I apologize for the poor formatting. I did this in a hurry, and my
> > knowledge of markdown is rudimentary.
> >
> > Please do comment on the veracity and reliability of the AI's response.
> >
> > AR
> >
> > --
> > Analabha Roy
> > Assistant Professor
> > Department of Physics
> > <http://www.buruniv.ac.in/academics/department/physics>
> > The University of Burdwan <http://www.buruniv.ac.in/>
> > Golapbag Campus, Barddhaman 713104
> > West Bengal, India
> > Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>,
> > a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>,
> > hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>
> > Webpage: http://www.ph.utexas.edu/~daneel/
> > <http://www.ph.utexas.edu/~daneel/>
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
>


[slurm-users] I just had a "conversation" with ChatGPT about working DMTCP, OpenMPI and SLURM. Here are the results

2023-02-10 Thread Analabha Roy
Hi,

I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in my
cluster. On a whim, I logged into ChatGPT and asked the AI about it.
It told me things that I couldn't find in the current version of the SLURM
docs (I looked). Since ChatGPT is not always reliable, I reproduce the
contents of my chat session in my GitHub repository for peer review and
commentary by you fine folks.

https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md

I apologize for the poor formatting. I did this in a hurry, and my
knowledge of markdown is rudimentary.

Please do comment on the veracity and reliability of the AI's response.

AR

-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/


Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
Howdy,

On Tue, 7 Feb 2023 at 20:18, Sean Mc Grath  wrote:

> Hi Analabha,
>
> Yes, unfortunately for your needs, I expect a time limited reservation
> along my suggestion would not accept jobs that would be scheduled to end
> outside of the reservations availability times. I'd suggest looking at
> check-pointing in this case, e.g. with DMTCP: Distributed MultiThreaded
> Checkpointing, http://dmtcp.sourceforge.net/. That could allow jobs to
> have their state saved and then re-loaded when they are started again.
>
>
Checkpointing sounds intriguing. Many thanks for the suggestion.

A bit of googling turned up this cluster page
<https://docs.nersc.gov/development/checkpoint-restart/dmtcp/>where they've
set it up to work with slurm. However, I also noticed this presentation
<https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf>hosted on the slurm website
that indicates that DMTCP doesn't work with containers, and the other
checkpointing tools that do support containers don't support MPI.
I also took a gander at CRIU <https://criu.org/Main_Page>, but this paper
<https://www.ijecs.in/index.php/ijecs/article/download/4122/3855/8058>
indicates that it too, has similar limitations, and BLCR seems to have died
<https://hpc.rz.rptu.de/documentation/checkpoint_blcr.html>.


Unless some or all of this information is dated or obsolete, these
drawbacks would be deal-breakers, since most of us have been spoiled by
containerization, and MPI is, of course, bread and butter for all.

I'd be mighty grateful for any other insights regarding my predicament. In
the meantime, I'm going to give the ugly hack of launching scontrol
suspend-resume scripts a whirl.


AR




> Best
>
> Sean
>
> ---
> Sean McGrath
> Senior Systems Administrator, IT Services
>
> --
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Tuesday 7 February 2023 12:14
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster
>
> Hi Sean,
>
> Thanks for your awesome suggestion! I'm going through the reservation docs
> now. At first glance, it seems like a daily reservation would turn down
> jobs that are too big for the reservation. It'd be nice if
> slurm could suspend (in the manner of 'scontrol suspend') jobs during
> reserved downtime and resume them after. That way, folks can submit large
> jobs without having to worry about the downtimes. Perhaps the FLEX option
> in reservations can accomplish this somehow?
>
>
> I suppose that I can do it using a shell script iterator and a cron job,
> but that seems like an ugly hack. I was hoping if there is a way to config
> this in slurm itself?
>
> AR
>
> On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath  wrote:
>
> Hi Analabha,
>
> Could you do something like create a daily reservation for 8 hours that
> starts at 9am, or whatever times work for you like the following untested
> command:
>
> scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1
> flags=daily ReservationName=daily
>
> Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY
>
> Some more possible helpful documentation at
> https://slurm.schedmd.com/reservations.html, search for "daily".
>
> My idea being that jobs can only run in that reservation, (that would have
> to be configured separately, not sure how from the top of my head), which
> is only active during the times you want the node to be working. So the
> cronjob that hibernates/shuts it down will do so when there are no jobs
> running. At least in theory.
>
> Hope that helps.
>
> Sean
>
> ---
> Sean McGrath
> Senior Systems Administrator, IT Services
>
> --
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Tuesday 7 February 2023 10:05
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster
>
> Hi,
>
> Thanks. I had read the Slurm Power Saving Guide before. I believe the
> configs enable slurmctld to check other nodes for idleness and
> suspend/resume them. Slurmctld must run on a separate, always-on server for
> this to work, right?
>
> My issue might be a little different. I literally have only one node that
> runs everything: slurmctld, slurmd, slurmdbd, everything.
>
> This node must be set to "sudo systemctl hibernate"after business hours,
> regardless of whether jobs are queued or running. The next business day, it
> can be switched on manually.
>
> systemctl hibernate is supposed to save the entire run state of the sole
> node to swap and poweroff. When powered on again, it should restore
> everything to its previous running state.
>
> When the job queue

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
On Tue, 7 Feb 2023, 18:12 Diego Zuccato,  wrote:

> RAM used by a suspended job is not released. At most it can be swapped
> out (if enough swap is available).
>


There should be enough swap available. I have 93 gigs of Ram and as big a
swap partition. I can top it off with swap files if needed.




>
> Il 07/02/2023 13:14, Analabha Roy ha scritto:
> > Hi Sean,
> >
> > Thanks for your awesome suggestion! I'm going through the reservation
> > docs now. At first glance, it seems like a daily reservation would turn
> > down jobs that are too big for the reservation. It'd be nice if
> > slurm could suspend (in the manner of 'scontrol suspend') jobs during
> > reserved downtime and resume them after. That way, folks can submit
> > large jobs without having to worry about the downtimes. Perhaps the FLEX
> > option in reservations can accomplish this somehow?
> >
> >
> > I suppose that I can do it using a shell script iterator and a cron job,
> > but that seems like an ugly hack. I was hoping if there is a way to
> > config this in slurm itself?
> >
> > AR
> >
> > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath  > <mailto:smcg...@tcd.ie>> wrote:
> >
> > Hi Analabha,
> >
> > Could you do something like create a daily reservation for 8 hours
> > that starts at 9am, or whatever times work for you like the
> > following untested command:
> >
> > scontrol create reservation starttime=09:00:00 duration=8:00:00
> > nodecnt=1 flags=daily ReservationName=daily
> >
> > Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY
> > <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>
> >
> > Some more possible helpful documentation at
> > https://slurm.schedmd.com/reservations.html
> > <https://slurm.schedmd.com/reservations.html>, search for "daily".
> >
> > My idea being that jobs can only run in that reservation, (that
> > would have to be configured separately, not sure how from the top of
> > my head), which is only active during the times you want the node to
> > be working. So the cronjob that hibernates/shuts it down will do so
> > when there are no jobs running. At least in theory.
> >
> > Hope that helps.
> >
> > Sean
> >
> > ---
> > Sean McGrath
> > Senior Systems Administrator, IT Services
> >
> >
>  
> > *From:* slurm-users  > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of
> > Analabha Roy mailto:hariseldo...@gmail.com
> >>
> > *Sent:* Tuesday 7 February 2023 10:05
> > *To:* Slurm User Community List  > <mailto:slurm-users@lists.schedmd.com>>
> > *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster
> > Hi,
> >
> > Thanks. I had read the Slurm Power Saving Guide before. I believe
> > the configs enable slurmctld to check other nodes for idleness and
> > suspend/resume them. Slurmctld must run on a separate, always-on
> > server for this to work, right?
> >
> > My issue might be a little different. I literally have only one node
> > that runs everything: slurmctld, slurmd, slurmdbd, everything.
> >
> > This node must be set to "sudo systemctl hibernate"after business
> > hours, regardless of whether jobs are queued or running. The next
> > business day, it can be switched on manually.
> >
> > systemctl hibernate is supposed to save the entire run state of the
> > sole node to swap and poweroff. When powered on again, it should
> > restore everything to its previous running state.
> >
> > When the job queue is empty, this works well. I'm not sure how well
> > this hibernate/resume will work with running jobs and would
> > appreciate any suggestions or insights.
> >
> > AR
> >
> >
> > On Tue, 7 Feb 2023 at 01:39, Florian Zillner  > <mailto:fzill...@lenovo.com>> wrote:
> >
> > Hi,
> >
> > follow this guide: https://slurm.schedmd.com/power_save.html
> > <https://slurm.schedmd.com/power_save.html>
> >
> > Create poweroff / poweron scripts and configure slurm to do the
> > poweroff after X minutes. Works well for us. Make sure to set an
> > appropriate time (ResumeTimeout) to allow the node to come back
> > to service.
> > 

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
Hi Sean,

Thanks for your awesome suggestion! I'm going through the reservation docs
now. At first glance, it seems like a daily reservation would turn down
jobs that are too big for the reservation. It'd be nice if
slurm could suspend (in the manner of 'scontrol suspend') jobs during
reserved downtime and resume them after. That way, folks can submit large
jobs without having to worry about the downtimes. Perhaps the FLEX option
in reservations can accomplish this somehow?


I suppose that I can do it using a shell script iterator and a cron job,
but that seems like an ugly hack. I was hoping if there is a way to config
this in slurm itself?

AR

On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath  wrote:

> Hi Analabha,
>
> Could you do something like create a daily reservation for 8 hours that
> starts at 9am, or whatever times work for you like the following untested
> command:
>
> scontrol create reservation starttime=09:00:00 duration=8:00:00 nodecnt=1
> flags=daily ReservationName=daily
>
> Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY
>
> Some more possible helpful documentation at
> https://slurm.schedmd.com/reservations.html, search for "daily".
>
> My idea being that jobs can only run in that reservation, (that would have
> to be configured separately, not sure how from the top of my head), which
> is only active during the times you want the node to be working. So the
> cronjob that hibernates/shuts it down will do so when there are no jobs
> running. At least in theory.
>
> Hope that helps.
>
> Sean
>
> ---
> Sean McGrath
> Senior Systems Administrator, IT Services
>
> --
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Tuesday 7 February 2023 10:05
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster
>
> Hi,
>
> Thanks. I had read the Slurm Power Saving Guide before. I believe the
> configs enable slurmctld to check other nodes for idleness and
> suspend/resume them. Slurmctld must run on a separate, always-on server for
> this to work, right?
>
> My issue might be a little different. I literally have only one node that
> runs everything: slurmctld, slurmd, slurmdbd, everything.
>
> This node must be set to "sudo systemctl hibernate"after business hours,
> regardless of whether jobs are queued or running. The next business day, it
> can be switched on manually.
>
> systemctl hibernate is supposed to save the entire run state of the sole
> node to swap and poweroff. When powered on again, it should restore
> everything to its previous running state.
>
> When the job queue is empty, this works well. I'm not sure how well this
> hibernate/resume will work with running jobs and would appreciate any
> suggestions or insights.
>
> AR
>
>
> On Tue, 7 Feb 2023 at 01:39, Florian Zillner  wrote:
>
> Hi,
>
> follow this guide: https://slurm.schedmd.com/power_save.html
>
> Create poweroff / poweron scripts and configure slurm to do the poweroff
> after X minutes. Works well for us. Make sure to set an appropriate time
> (ResumeTimeout) to allow the node to come back to service.
> Note that we did not achieve good power saving with suspending the nodes,
> powering them off and on saves way more power. The downside is it takes ~ 5
> mins to resume (= power on) the nodes when needed.
>
> Cheers,
> Florian
> --
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Monday, 6 February 2023 18:21
> *To:* slurm-users@lists.schedmd.com 
> *Subject:* [External] [slurm-users] Hibernating a whole cluster
>
> Hi,
>
> I've just finished  setup of a single node "cluster" with slurm on ubuntu
> 20.04. Infrastructural limitations  prevent me from running it 24/7, and
> it's only powered on during business hours.
>
>
> Currently, I have a cron job running that hibernates that sole node before
> closing time.
>
> The hibernation is done with standard systemd, and hibernates to the swap
> partition.
>
>  I have not run any lengthy slurm jobs on it yet. Before I do, can I get
> some thoughts on a couple of things?
>
> If it hibernated when slurm still had jobs running/queued, would they
> resume properly when the machine powers back on?
>
> Note that my swap space is bigger than my  RAM.
>
> Is it necessary to perhaps setup a pre-hibernate script for systemd to
> iterate scontrol to suspend all the jobs before hibernating and resume them
> post-resume?
>
> What about the wall times? I'm uessing that slurm will count the downtime
> as elapsed for each job. Is there a way to config this, or is the only
> alternative a post-hibernate

Re: [slurm-users] [External] Hibernating a whole cluster

2023-02-07 Thread Analabha Roy
Hi,

Thanks. I had read the Slurm Power Saving Guide before. I believe the
configs enable slurmctld to check other nodes for idleness and
suspend/resume them. Slurmctld must run on a separate, always-on server for
this to work, right?

My issue might be a little different. I literally have only one node that
runs everything: slurmctld, slurmd, slurmdbd, everything.

This node must be set to "sudo systemctl hibernate"after business hours,
regardless of whether jobs are queued or running. The next business day, it
can be switched on manually.

systemctl hibernate is supposed to save the entire run state of the sole
node to swap and poweroff. When powered on again, it should restore
everything to its previous running state.

When the job queue is empty, this works well. I'm not sure how well this
hibernate/resume will work with running jobs and would appreciate any
suggestions or insights.

AR


On Tue, 7 Feb 2023 at 01:39, Florian Zillner  wrote:

> Hi,
>
> follow this guide: https://slurm.schedmd.com/power_save.html
>
> Create poweroff / poweron scripts and configure slurm to do the poweroff
> after X minutes. Works well for us. Make sure to set an appropriate time
> (ResumeTimeout) to allow the node to come back to service.
> Note that we did not achieve good power saving with suspending the nodes,
> powering them off and on saves way more power. The downside is it takes ~ 5
> mins to resume (= power on) the nodes when needed.
>
> Cheers,
> Florian
> ------
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Monday, 6 February 2023 18:21
> *To:* slurm-users@lists.schedmd.com 
> *Subject:* [External] [slurm-users] Hibernating a whole cluster
>
> Hi,
>
> I've just finished  setup of a single node "cluster" with slurm on ubuntu
> 20.04. Infrastructural limitations  prevent me from running it 24/7, and
> it's only powered on during business hours.
>
>
> Currently, I have a cron job running that hibernates that sole node before
> closing time.
>
> The hibernation is done with standard systemd, and hibernates to the swap
> partition.
>
>  I have not run any lengthy slurm jobs on it yet. Before I do, can I get
> some thoughts on a couple of things?
>
> If it hibernated when slurm still had jobs running/queued, would they
> resume properly when the machine powers back on?
>
> Note that my swap space is bigger than my  RAM.
>
> Is it necessary to perhaps setup a pre-hibernate script for systemd to
> iterate scontrol to suspend all the jobs before hibernating and resume them
> post-resume?
>
> What about the wall times? I'm uessing that slurm will count the downtime
> as elapsed for each job. Is there a way to config this, or is the only
> alternative a post-hibernate script that iteratively updates the wall times
> of the running jobs using scontrol again?
>
> Thanks for your attention.
> Regards
> AR
>


-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/


[slurm-users] Hibernating a whole cluster

2023-02-06 Thread Analabha Roy
Hi,

I've just finished  setup of a single node "cluster" with slurm on ubuntu
20.04. Infrastructural limitations  prevent me from running it 24/7, and
it's only powered on during business hours.


Currently, I have a cron job running that hibernates that sole node before
closing time.

The hibernation is done with standard systemd, and hibernates to the swap
partition.

 I have not run any lengthy slurm jobs on it yet. Before I do, can I get
some thoughts on a couple of things?

If it hibernated when slurm still had jobs running/queued, would they
resume properly when the machine powers back on?

Note that my swap space is bigger than my  RAM.

Is it necessary to perhaps setup a pre-hibernate script for systemd to
iterate scontrol to suspend all the jobs before hibernating and resume them
post-resume?

What about the wall times? I'm uessing that slurm will count the downtime
as elapsed for each job. Is there a way to config this, or is the only
alternative a post-hibernate script that iteratively updates the wall times
of the running jobs using scontrol again?

Thanks for your attention.
Regards
AR


Re: [slurm-users] Enforce gpu usage limits (with GRES?)

2023-02-04 Thread Analabha Roy
Hi,

Thanks, your advice worked. I used sacctmgr to create a QOS called 'nogpu'
and set MaxTRES=gres/gpu=0, then attached it to the cpu partition in
slurm.conf as

PartitionName=CPU Nodes=ALL Default=Yes QOS=nogpu MaxTime=INFINITE  State=UP

And it works! Trying to run gpu jobs in the cpu partition now fails. Qos'es
are nice!

Only thing is that the nogpu qos has a priority of 0. Should it be higher?

https://pastebin.com/VVsQAz6P

AR

On Fri, 3 Feb 2023 at 13:37, Markus Kötter  wrote:

> Hi,
>
>
> limits ain't easy.
>
> >
> https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmLimits.html#precedence
>
>
> I think there is multiple options, starting with not having GPU
> resources in the CPU partition.
>
> Or creating qos the partition and have
> MaxTRES=gres/gpu:A100=0,gres/gpu:K80=0,gres/gpu=0
> attaching it to the CPU partition.
>
> And the configuration will require some values as well,
>
> # slurm.conf
> AccountingStorageEnforce=associations,limits,qos,safe
> AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:K80
>
> # cgroups.conf
> ConstrainDevices=yes
>
> most likely some others I miss.
>
>
> MfG
> --
> Markus Kötter, +49 681 870832434
> 30159 Hannover, Lange Laube 6
> Helmholtz Center for Information Security
>


-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/


Re: [slurm-users] [ext] Enforce gpu usage limits (with GRES?)

2023-02-02 Thread Analabha Roy
Hi,

Thanks for the reply. Yes, your advice helped! Much obliged. Not only was
cgroups config necessary, but the option

ConstrainDevices=yes

in cgroup.conf was necessary to enforce the gpu gres. Now, not adding a
gres parameter to srun causes gpu jobs to fail. An improvement!

Although, I still can't keep out gpu jobs from the "CPU" partition. Is
there a way to link a partition to a GRES or something?

Alternatively, can I define two nodenames in slurm.conf that point to the
same physical node, but only one of them has the gpu GRES? That way, I can
link the GPU partition to the gres-configged nodename only.

Thanks in advance,
AR

*PS*: If the slurm devs are reading this, may I suggest that perhaps it
would be a good idea to add a reference to cgroups in the gres
documentation page?








On Thu, 2 Feb 2023 at 16:52, Holtgrewe, Manuel <
manuel.holtgr...@bih-charite.de> wrote:

> Hi,
>
>
> if by "share the GPU" you mean exclusive allocation to a single job then,
> I believe, you are missing cgroup configuration for isolating access to the
> GPU.
>
>
> Below the relevant parts (I believe) of our configuration.
>
>
> There also is a way of time- and space-slice GPUs but I guess you should
> get things setup without slicing.
>
>
> I hope this helps.
>
>
> Manuel
>
>
> ==> /etc/slurm/cgroup.conf <==
> # https://bugs.schedmd.com/show_bug.cgi?id=3701
> CgroupMountpoint="/sys/fs/cgroup"
> CgroupAutomount=yes
> AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
>
> ==> /etc/slurm/cgroup_allowed_devices_file.conf <==
> /dev/null
> /dev/urandom
> /dev/zero
> /dev/sda*
> /dev/cpu/*/*
> /dev/pts/*
> /dev/nvidia*
>
> ==> /etc/slurm/slurm.conf <==
>
> ProctrackType=proctrack/cgroup
>
> # Memory is enforced via cgroups, so we should not do this here by [*]
> #
> # /etc/slurm/cgroup.conf: ConstrainRAMSpace=yes
> #
> # [*] https://bugs.schedmd.com/show_bug.cgi?id=5262
> JobAcctGatherParams=NoOverMemoryKill
>
> TaskPlugin=task/cgroup
>
> JobAcctGatherType=jobacct_gather/cgroup
>
>
> --
> Dr. Manuel Holtgrewe, Dipl.-Inform.
> Bioinformatician
> Core Unit Bioinformatics – CUBI
> Berlin Institute of Health / Max Delbrück Center for Molecular Medicine in
> the Helmholtz Association / Charité – Universitätsmedizin Berlin
>
> Visiting Address: Invalidenstr. 80, 3rd Floor, Room 03 028, 10117 Berlin
> Postal Address: Chariteplatz 1, 10117 Berlin
>
> E-Mail: manuel.holtgr...@bihealth.de
> Phone: +49 30 450 543 607
> Fax: +49 30 450 7 543 901
> Web: cubi.bihealth.org  www.bihealth.org  www.mdc-berlin.de
> www.charite.de
> --
> *From:* slurm-users  on behalf of
> Analabha Roy 
> *Sent:* Wednesday, February 1, 2023 6:12:40 PM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [ext] [slurm-users] Enforce gpu usage limits (with GRES?)
>
> Hi,
>
> I'm new to slurm, so I apologize in advance if my question seems basic.
>
> I just purchased a single node 'cluster' consisting of one 64-core cpu and
> an nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it
> with ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the
> config to suit the needs of my department.
>
> I'm trying to bone up on GRES scheduling by reading this manual page
> <https://slurm.schedmd.com/gres.html>, but am confused about some things.
>
> My slurm.conf file has the following lines put in it by the vendor:
>
> ###
> # COMPUTE NODES
> GresTypes=gpu
> NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32
> ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1
> #PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE  State=UP
>
> PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE  State=UP
> #
>
> So they created two partitions that are essentially identical. Secondly,
> they put just the following line in gres.conf:
>
> ###
> NodeName=shavak-DIT400TR-55L  Name=gpuFile=/dev/nvidia0
> ###
>
> That's all. However, this configuration does not appear to constrain
> anyone in any manner. As a regular user, I can still use srun or sbatch to
> start GPU jobs from the "CPU partition," and nvidia-smi says that a simple
> cupy <https://cupy.dev/> script that multiplies matrices and starts as an
> sbatch job in the CPU partition can access the gpu just fine. Note that the
> environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in
> any job step. I tested this by starting an interactive srun shell i

[slurm-users] Enforce gpu usage limits (with GRES?)

2023-02-01 Thread Analabha Roy
Hi,

I'm new to slurm, so I apologize in advance if my question seems basic.

I just purchased a single node 'cluster' consisting of one 64-core cpu and
an nvidia rtx5k gpu (Turing architecture, I think). The vendor supplied it
with ubuntu 20.04 and slurm-wlm 19.05.5. Now I'm trying to adjust the
config to suit the needs of my department.

I'm trying to bone up on GRES scheduling by reading this manual page
<https://slurm.schedmd.com/gres.html>, but am confused about some things.

My slurm.conf file has the following lines put in it by the vendor:

###
# COMPUTE NODES
GresTypes=gpu
NodeName=shavak-DIT400TR-55L CPUs=64 SocketsPerBoard=2 CoresPerSocket=32
ThreadsPerCore=1 RealMemory=95311 Gres=gpu:1
#PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

PartitionName=CPU Nodes=ALL Default=Yes MaxTime=INFINITE  State=UP

PartitionName=GPU Nodes=ALL Default=NO MaxTime=INFINITE  State=UP
#

So they created two partitions that are essentially identical. Secondly,
they put just the following line in gres.conf:

###
NodeName=shavak-DIT400TR-55L  Name=gpuFile=/dev/nvidia0
###

That's all. However, this configuration does not appear to constrain anyone
in any manner. As a regular user, I can still use srun or sbatch to start
GPU jobs from the "CPU partition," and nvidia-smi says that a simple cupy
<https://cupy.dev/> script that multiplies matrices and starts as an sbatch
job in the CPU partition can access the gpu just fine. Note that the
environment variable "CUDA_VISIBLE_DEVICES" does not appear to be set in
any job step. I tested this by starting an interactive srun shell in both
CPU and GPU partition and running ''echo $CUDA_VISIBLE_DEVICES" and got
bupkis for both.


What I need to do is constrain jobs to using chunks of GPU Cores/RAM so
that multiple jobs can share the GPU.

As I understand from the gres manpage, simply adding "AutoDetect=nvml"
(NVML should be installed with the NVIDIA HPC SDK, right? I installed it
with apt-get...) in gres.conf should allow Slurm to detect the GPU's
internal specifications automatically. Is that all, or do I need to config
an mps GRES as well? Will that succeed in jailing out the GPU from jobs
that don't mention any gres parameters (perhaps by setting
CUDA_VISIBLE_DEVICES), or is there any additional config for that? Do I
really need that extra "GPU" partition that the vendor put in for any of
this, or is there a way to bind GRES resources to a particular partition in
such a way that simply launching jobs in that partition will be enough?

Thanks for your attention.
Regards
AR













-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu, a...@phys.buruniv.ac.in, hariseldo...@gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/