Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Doug Meyer
Hi,

You got me, I didn't know that " oversubscribe=FORCE:2" is an option.  I'll
need to explore that.

I missed the question about srun.  srun is the preferred I believe.  I am
not associated with drafting the submit scripts but can ask my peer.  You
do need to stipulate the number of cores you want.  Your "sbatch -n 1"
should be changed to the number of MPI ranks you desire.

As good as slurm is, many come to assume it does far more than it does.  I
explain slurm as a maƮtre d' in a very exclusive restaurant, aware of every
table and the resources they afford.  When a reservation is placed, a job
submitted, a review of the request versus the resources matches the
pending  guest/job against the resources and when the other diners/jobs are
expected to finish.  If a guest requests resources that are not available
in the restaurant, the reservation is denied.  If a guest arrives and does
not need all the resources, the place settings requested but unused are
left in reservation until the job finishes.  Slurm manages requests against
an inventory.  Without enforcement, a job that requests 1 core but uses 12
will run.  If your 64 core system accepts 64 single core reservations,
slurm believing 64 cores are needed, 64 jobs wll start.  and then the wait
staff (the OS) is left to deal with 768 tasks running on 64 cores.  It
becomes a sad comedy as the system will probably run out of RAM triggering
OOM killer or just run horribly slow.  Never assume slurm is going to
prevent bad actors once they begin running unless you have configured it to
do so.

We run a very lax environment.  We set a standard of 6 GB per job unless
the sbatch declares otherwise and a max runtime default.  Without an
estimated runtime to work with the backfill scheduler is crippled.  In an
environment mixing single thread and MPI jobs of various sizes it is
critical the jobs are honest in their requirements providing slurm the
information needed to correctly assign resources.

Doug

On Sat, Feb 25, 2023 at 12:04 PM Analabha Roy 
wrote:

> Hi,
>
> Thanks for your considered response. Couple of questions linger...
>
> On Sat, 25 Feb 2023 at 21:46, Doug Meyer  wrote:
>
>> Hi,
>>
>> Declaring cores=64 will absolutely work but if you start running MPI
>> you'll want a more detailed config description.  The easy way to read it is
>> "128=2 sockets * 32 corespersocket * 2 threads per core".
>>
>> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
>> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>>
>> But if you just want to work with logical cores the "cpus=128" will work.
>>
>> If you go with the more detailed description then you need to declare
>> oversubscription (hyperthreading) in the partition declaration.
>>
>
>
> Yeah, I'll try that.
>
>
>> By default slurm will not let two different jobs share the logical cores
>> comprising a physical core.  For example if Sue has an Array of 1-1000 her
>> array tasks could each take a logical core on a physical core.  But if
>> Jamal is also running they would not be able to share the physical core.
>> (as I understand it).
>>
>> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
>> MaxTime=Infinite State=Up AllowAccounts=cowboys
>>
>>
>> In the sbatch/srun the user needs to add a declaration
>> "oversubscribe=yes" telling slurm the job can run on both logical cores
>> available.
>>
>
> How about setting oversubscribe=FORCE:2? That way, users need not add a
> setting in their scripts.
>
>
>
>
>> In the days on Knight's Landing each core could handle four logical cores
>> but I don't believe there are any current AMD or Intel processors
>> supporting more then two logical cores (hyperthreads per core).  The
>> conversation about hyperthreads is difficult as the Intel terminology is
>> logical cores for hyperthreading and cores for physical cores but the
>> tendency is to call the logical cores threads or hyperthreaded cores.  This
>> can be very confusing for consumers of the resources.
>>
>>
>> In any case, if you create an array job of 1-100 sleep jobs, my simplest
>> logical test job, then you can use scontrol show node  to see the
>> nodes resource configuration as well as consumption.  squeue -w 
>> -i 10 will iteratate every ten seconds to show you the node chomping
>> through the job.
>>
>>
>> Hope this helps.  Once you are comfortable I would urge you to use the
>> NodeName/Partition descriptor format above and encourage your users to
>> declare oversubscription in their jobs.  It is a little more work up front
>> but far easier than correcting scripts later.
>>
>>
>> Doug
>>
>>
>>
>>
>>
>> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy 
>> wrote:
>>
>>> Howdy, and thanks for the warm welcome,
>>>
>>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer  wrote:
>>>
 Hi,

 Did you configure your node definition with the outputs of slurmd -C?
 Ignore boards.  Don't know if it is still true but several years ago
 declaring boards made things difficult.


>>> $ 

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Analabha Roy
Hi,

Thanks for your considered response. Couple of questions linger...

On Sat, 25 Feb 2023 at 21:46, Doug Meyer  wrote:

> Hi,
>
> Declaring cores=64 will absolutely work but if you start running MPI
> you'll want a more detailed config description.  The easy way to read it is
> "128=2 sockets * 32 corespersocket * 2 threads per core".
>
> NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32
> ThreadsPerCore=2 RealMemory=512000 TmpDisk=100
>
> But if you just want to work with logical cores the "cpus=128" will work.
>
> If you go with the more detailed description then you need to declare
> oversubscription (hyperthreading) in the partition declaration.
>


Yeah, I'll try that.


> By default slurm will not let two different jobs share the logical cores
> comprising a physical core.  For example if Sue has an Array of 1-1000 her
> array tasks could each take a logical core on a physical core.  But if
> Jamal is also running they would not be able to share the physical core.
> (as I understand it).
>
> PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
> MaxTime=Infinite State=Up AllowAccounts=cowboys
>
>
> In the sbatch/srun the user needs to add a declaration "oversubscribe=yes"
> telling slurm the job can run on both logical cores available.
>

How about setting oversubscribe=FORCE:2? That way, users need not add a
setting in their scripts.




> In the days on Knight's Landing each core could handle four logical cores
> but I don't believe there are any current AMD or Intel processors
> supporting more then two logical cores (hyperthreads per core).  The
> conversation about hyperthreads is difficult as the Intel terminology is
> logical cores for hyperthreading and cores for physical cores but the
> tendency is to call the logical cores threads or hyperthreaded cores.  This
> can be very confusing for consumers of the resources.
>
>
> In any case, if you create an array job of 1-100 sleep jobs, my simplest
> logical test job, then you can use scontrol show node  to see the
> nodes resource configuration as well as consumption.  squeue -w 
> -i 10 will iteratate every ten seconds to show you the node chomping
> through the job.
>
>
> Hope this helps.  Once you are comfortable I would urge you to use the
> NodeName/Partition descriptor format above and encourage your users to
> declare oversubscription in their jobs.  It is a little more work up front
> but far easier than correcting scripts later.
>
>
> Doug
>
>
>
>
>
> On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy 
> wrote:
>
>> Howdy, and thanks for the warm welcome,
>>
>> On Fri, 24 Feb 2023 at 07:31, Doug Meyer  wrote:
>>
>>> Hi,
>>>
>>> Did you configure your node definition with the outputs of slurmd -C?
>>> Ignore boards.  Don't know if it is still true but several years ago
>>> declaring boards made things difficult.
>>>
>>>
>> $ slurmd -C
>> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
>> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
>> UpTime=0-00:47:51
>> $ grep NodeName /etc/slurm-llnl/slurm.conf
>> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>>
>> There is a difference. I, too, discarded the Boards and sockets in
>> slurmd.conf . Is that the problem?
>>
>>
>>
>>
>>
>>
>>
>>> Also, if you have hyperthreaded AMD or Intel processors your partition
>>> declaration should be overscribe:2
>>>
>>>
>> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is
>> set to show them as 64 cores.
>>
>>
>>
>>
>>> Start with a very simple job with a script containing sleep 100 or
>>> something else without any runtime issues.
>>>
>>>
>> I ran this MPI hello world thing
>> with
>> this sbatch script.
>> 
>> Should be the same thing as your suggestion, basically.
>> Should I switch to 'srun' in the batch file?
>>
>> AR
>>
>>
>>> When I started with slurm I built the sbatch one small step at a time.
>>> Nodes, cores. memory, partition, mail, etc
>>>
>>> It sounds like your config is very close but your problem may be in the
>>> submit script.
>>>
>>> Best of luck and welcome to slurm. It is very powerful with a huge
>>> community.
>>>
>>> Doug
>>>
>>>
>>>
>>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy 
>>> wrote:
>>>
 Hi folks,

 I have a single-node "cluster" running Ubuntu 20.04 LTS with the
 distribution packages for slurm (slurm-wlm 19.05.5)
 Slurm only ran one job in the node at a time with the default
 configuration, leaving all other jobs pending.
 This happened even if that one job only requested like a few cores (the
 node has 64 cores, and slurm.conf is configged accordingly).

 in slurm conf, SelectType is set to select/cons_res, and
 SelectTypeParameters to CR_Core. NodeName 

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-25 Thread Chris Samuel

On 23/2/23 2:55 am, David Laehnemann wrote:


And consequently, would using `scontrol` thus be the better default
option (as opposed to `sacct`) for repeated job status checks by a
workflow management system?


Many others have commented on this, but use of scontrol in this way is 
really really bad because of the impact it has on slurmctld. This is 
because responding to the RPC (IIRC) requires taking read locks on 
internal data structures and on a large, busy system (like ours, we 
recently rolled over slurm job IDs back to 1 after ~6 years of operation 
and run at over 90% occupancy most of the time) this can really damage 
scheduling performance.


We've had numerous occasions where we've had to track down users abusing 
scontrol in this way and redirect them to use sacct instead.


We already use the cli filter abilities in Slurm to impose a form of 
rate limiting on RPCs from other commands, but unfortunately scontrol is 
not covered by that.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Doug Meyer
Hi,

Declaring cores=64 will absolutely work but if you start running MPI you'll
want a more detailed config description.  The easy way to read it is "128=2
sockets * 32 corespersocket * 2 threads per core".

NodeName=hpc[306-308] CPUs=128 Sockets=2 CoresPerSocket=32 ThreadsPerCore=2
RealMemory=512000 TmpDisk=100

But if you just want to work with logical cores the "cpus=128" will work.

If you go with the more detailed description then you need to declare
oversubscription (hyperthreading) in the partition declaration.  By default
slurm will not let two different jobs share the logical cores comprising a
physical core.  For example if Sue has an Array of 1-1000 her array tasks
could each take a logical core on a physical core.  But if Jamal is also
running they would not be able to share the physical core. (as I understand
it).

PartitionName=a Nodes= [301-308] Default=No OverSubscribe=YES:2
MaxTime=Infinite State=Up AllowAccounts=cowboys


In the sbatch/srun the user needs to add a declaration "oversubscribe=yes"
telling slurm the job can run on both logical cores available.  In the days
on Knight's Landing each core could handle four logical cores but I don't
believe there are any current AMD or Intel processors supporting more then
two logical cores (hyperthreads per core).  The conversation about
hyperthreads is difficult as the Intel terminology is logical cores for
hyperthreading and cores for physical cores but the tendency is to call the
logical cores threads or hyperthreaded cores.  This can be very confusing
for consumers of the resources.


In any case, if you create an array job of 1-100 sleep jobs, my simplest
logical test job, then you can use scontrol show node  to see the
nodes resource configuration as well as consumption.  squeue -w 
-i 10 will iteratate every ten seconds to show you the node chomping
through the job.


Hope this helps.  Once you are comfortable I would urge you to use the
NodeName/Partition descriptor format above and encourage your users to
declare oversubscription in their jobs.  It is a little more work up front
but far easier than correcting scripts later.


Doug





On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy  wrote:

> Howdy, and thanks for the warm welcome,
>
> On Fri, 24 Feb 2023 at 07:31, Doug Meyer  wrote:
>
>> Hi,
>>
>> Did you configure your node definition with the outputs of slurmd -C?
>> Ignore boards.  Don't know if it is still true but several years ago
>> declaring boards made things difficult.
>>
>>
> $ slurmd -C
> NodeName=shavak-DIT400TR-55L CPUs=64 Boards=1 SocketsPerBoard=2
> CoresPerSocket=16 ThreadsPerCore=2 RealMemory=95311
> UpTime=0-00:47:51
> $ grep NodeName /etc/slurm-llnl/slurm.conf
> NodeName=shavak-DIT400TR-55L CPUs=64 RealMemory=95311 Gres=gpu:1
>
> There is a difference. I, too, discarded the Boards and sockets in
> slurmd.conf . Is that the problem?
>
>
>
>
>
>
>
>> Also, if you have hyperthreaded AMD or Intel processors your partition
>> declaration should be overscribe:2
>>
>>
> Yes I do, It's actually 16 X 2 cores with hyperthreading, but the BIOS is
> set to show them as 64 cores.
>
>
>
>
>> Start with a very simple job with a script containing sleep 100 or
>> something else without any runtime issues.
>>
>>
> I ran this MPI hello world thing
> with
> this sbatch script.
> 
> Should be the same thing as your suggestion, basically.
> Should I switch to 'srun' in the batch file?
>
> AR
>
>
>> When I started with slurm I built the sbatch one small step at a time.
>> Nodes, cores. memory, partition, mail, etc
>>
>> It sounds like your config is very close but your problem may be in the
>> submit script.
>>
>> Best of luck and welcome to slurm. It is very powerful with a huge
>> community.
>>
>> Doug
>>
>>
>>
>> On Thu, Feb 23, 2023 at 6:58 AM Analabha Roy 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I have a single-node "cluster" running Ubuntu 20.04 LTS with the
>>> distribution packages for slurm (slurm-wlm 19.05.5)
>>> Slurm only ran one job in the node at a time with the default
>>> configuration, leaving all other jobs pending.
>>> This happened even if that one job only requested like a few cores (the
>>> node has 64 cores, and slurm.conf is configged accordingly).
>>>
>>> in slurm conf, SelectType is set to select/cons_res, and
>>> SelectTypeParameters to CR_Core. NodeName is set with CPUs=64. Path to file
>>> is referenced below.
>>>
>>> So I set OverSubscribe=FORCE in the partition config and restarted the
>>> daemons.
>>>
>>> Multiple jobs are now run concurrently, but when Slurm is
>>> oversubscribed, it is *truly* *oversubscribed*. That is to say, it runs
>>> so many jobs that there are more processes running than cores/threads.
>>> How should I config slurm so that it runs multiple