[slurm-users] Re: Software builds using slurm

2024-06-10 Thread Cutts, Tim via slurm-users
You have two options for managing those dependencies, as I see it)


  1.  you use SLURM’s native job dependencies, but this requires you to create 
a build script for SLURM
  2.  You use make to submit the jobs, and take advantage of the -j flag to 
make it run lots of tasks at once, just use a job starter prefix to prefix 
tasks you want run under SLURM with srun

The first approach will get the jobs run soonest.  The second approach is a bit 
of a hack, and it means that the dependent jobs don’t get submitted until the 
previous jobs have finished, which isn’t ideal, but it does work, and it meets 
your requirement of having a single build process that works both with and 
without SLURM:


JOBSTARTER=srun -c 1 -t 00:05:00

SLEEP=60



all: jobC.out



clean:

   rm -f job[ABC].out



jobA.out:

   $(JOBSTARTER) sh -c "sleep $(SLEEP); echo done > $@"



jobB.out:

   $(JOBSTARTER) sh -c "sleep $(SLEEP); echo done > $@"



jobC.out: jobA.out jobB.out

   $(JOBSTARTER) sh -c "echo done > $@"

When you want to run it interactively, you set JOBSTARTER to be empty, 
otherwise you use some suitable srun command to run the tasks under SLURM, and 
the above makefile does this:


$ make -j

srun -c 1 -t 00:01:00 sh -c "sleep 60; echo done > jobA.out"

srun -c 1 -t 00:01:00 sh -c "sleep 60; echo done > jobB.out"

srun: job 13324201 queued and waiting for resources

srun: job 13324202 queued and waiting for resources

srun: job 13324201 has been allocated resources

srun: job 13324202 has been allocated resources

srun -c 1 -t 00:01:00 sh -c "echo done > jobC.out"

srun: job 13324220 queued and waiting for resources

srun: job 13324220 has been allocated resources

Regards,

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Duane Ellis via slurm-users 
Date: Sunday, 9 June 2024 at 15:50
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Software builds using slurm
I have been lurking here for a while hoping to see some examples that would 
help but have not fit several months

We have a slurm system setup for xilnix FPGA builds (hdl) I want to use this 
for sw builds too

What I seem to see is slurm talks about cpus, GPUs and memory etc I am looking 
for a “run my make file (or shell script) on any available node”

In our case we have 3 top level jobs A B and C
These can all run in parallel and are independent (ie bootloader, linux kernel, 
and the Linux root file system via buildroot)

Job A (boot) is actually about 7 small builds that are independent

I am looking for a means to fork n jobs (ie job A B and C above) across the 
cluster and wait/collect the std output of those n jobs and the exit status

Job A would then fork and build 7 to 8 sub jobs
When they are done it would assemble the result into what Xilinix calls boot.bin

Job B is a Linux kernel build

Job C is buildroot so there are several (n=50) smaller builds ie bash, 
busybody, and other tools like python for the target agian each of these can be 
executed in parallel

Really do not (cannot) re architect my build to be a slurm only build because 
it also need to be able to run without slurm ie build everything on my laptop 
without slurm present

In that case the jobs would run serially and take an hour or so the hope is by 
parallelizing the sw build jobs our overall cycle time will improve

It would also be nice if the slurm cluster would adapt to the available nodes 
automatically

Our hope is we can run our lab pcs as duel boot they normally boot windows but 
we can duel boot them into Linux and they become a compile node and auto join 
the cluster and the cluster sees them as going off line when somebody reboots 
the machine back to  windows


Sent from my iPhone

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-22 Thread Cutts, Tim via slurm-users
Users can, of course always just wrap the job itself in time  to record the 
maximum memory usage.  Bit of a naïve approach but it does work.  I agree the 
polling of current usage is not very satisfactory.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: greent10--- via slurm-users 
Date: Monday, 20 May 2024 at 12:10
To: Emyr James , Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting
Hi,

We have had similar questions from users regarding how best to find out the 
high memory peak of a job since they may run a job and get a not very useful 
value for variables in sacct such as the MaxRSS since Slurm didn’t poll during 
the use of its maximum memory usage.

With Cgroupv1 looking online it looks like memory.max_usage_in_bytes takes into 
account caches so can vary on how much I/O is used whilst total_rss in 
memory.stats looks more useful maybe. Maybe memory.peak is clearer?

Its not clear in the documentation how a user should in the sacct values to 
infer the actual usage of jobs to correct their behaviour in future submissions.

I would be keen to see improvements in high water mark reporting.  I noticed 
that the jobacctgather plugin documentation was deleted back in Slurm 21.08 – 
Spank plugin does possibly look like the way to go.  Also it seems a common 
problem across technologies e.g. 
https://github.com/google/cadvisor/issues/3286

Tom

From: Emyr James via slurm-users 
Date: Monday, 20 May 2024 at 10:50
To: Davide DelVento , Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting
External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.

Looking here :

https://slurm.schedmd.com/spank.html#SECTION_SPANK-PLUGINS

It looks like it's possible to hook something in at the right place using the 
slurm_spank_task_exit or slurm_spank_exit plugins. Does anyone have any 
experience or examples of doing this ? Is there any more documentation 
available on this functionality ?

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Emyr James via slurm-users 
Sent: 17 May 2024 01:15
To: Davide DelVento 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: memory high water mark reporting

Hi,

I have got a very simple LD_PRELOAD that can do this. Maybe I should see if I 
can force slurmstepd to be run with that LD_PRELOAD and then see if that does 
it.

Ultimately am trying to get all the useful accounting metrics into a clickhouse 
database. If the LD_PRELOAD on slurmstepd seems to work then I can expand it to 
insert the relevant row into the clickhouse DB in the C code of the preload 
library.

But still...this seems like a very basic thing to do and am very suprised that 
it seems so difficult to do this with the standard accounting recording out of 
the box.

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation


From: Davide DelVento 
Sent: 17 May 2024 01:02
To: Emyr James 
Cc: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] memory high water mark reporting

Not exactly the answer to your question (which I don't know) but if you can get 
to prefix whatever is executed with this 
https://github.com/NCAR/peak_memusage
 (which also uses getrusage) or a variant you will be able to do that.

On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi,

We are trying out slurm having been running grid engine for a long while.
In grid engine, the cgroups peak memory and max_rss are generated at the end of 
a job and recorded. It logs the information from the cgroup hierarchy as well 
as doing a getrusage call right at the end on the parent pid of the whole job 
"container" before cleaning up.
With slurm it seems that the only way memory is recorded is by the acct gather 
polling. I am trying to add something in an epilog script to get the 
memory.peak but It looks like the cgroup hierarchy has been destroyed by the 
time the epilog is run.
Where in the code is the cgroup hierarchy cleared up ? Is there no way to add 
something in so that the accounting is updated during the job cleanup process 
so that peak memory usage can be accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if this causes 
a lot 

[slurm-users] Re: scrontab question

2024-05-08 Thread Cutts, Tim via slurm-users
Someone may have said this already but you know that you can replace 
0,5,10,15,20,25,30,35,40,45,50,55 with */5?

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Bjørn-Helge Mevik via slurm-users 
Date: Wednesday, 8 May 2024 at 07:38
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Re: scrontab question
Sandor via slurm-users  writes:

> I am working out the details of scrontab. My initial testing is giving me
> an unsolvable question

If you have an unsolvable problem, you don't have a problem, you have a
fact of life. :)

> Within scrontab editor I have the following example from the slurm
> documentation:
>
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
> /directory/subdirectory/crontest.sh

- The command (/directory/...) should beon the same line as the time
spec (0,5,...) - but that was perhaps just the email formatting.

- Check for any UTF8 characters that look like ordinary ascii, for
instance "unbreakable space".  I tend to just pipe the text throuth "od
-a".

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Cutts, Tim via slurm-users
We have Weka filesystems on one of our clusters and saw this; we discovered we 
had slightly misconfigured the weka client and the result was that Weka’s and 
SLURMs cgroups were fighting with each other, and this seemed to be the result. 
 Fixing the weka cgroups config improved the problem, for us.  I haven’t heard 
anyone complain about it since.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Paul Edmon via slurm-users 
Date: Wednesday, 10 April 2024 at 14:46
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Jobs of a user are stuck in Completing stage for a 
long time and cannot cancel them
Usually to clear jobs like this you have to reboot the node they are on.
That will then force the scheduler to clear them.

-Paul Edmon-

On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote:
> We are running a slurm cluster with version `slurm 22.05.8`. One of our users 
> has reported that their jobs have been stuck at the completion stage for a 
> long time. Referring to Slurm Workload Manager - Slurm Troubleshooting Guide 
> we found that indeed the batchhost for the job was removed from the cluster, 
> perhaps without draining it first.
>
> How do we cancel/delete the jobs ?
>
> * We tried scancel on the batch and individual job ids from both the user and 
> from SlurmUser
>

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Cutts, Tim via slurm-users
Agree with that.   Plus, of course, even if the jobs run a bit slower by not 
having all the cores on a single node, they will be scheduled sooner, so the 
overall turnaround time for the user will be better, and ultimately that's what 
they care about.  I've always been of the view, for any scheduler, that the 
less you try to constrain it the better.  It really depends on what you're 
trying to optimise for, but generally speaking I try to optimise for maximum 
utilisation and throughput, unless I have a specific business case that needs 
to prioritise particular workloads, and then I'll compromise on throughput to 
get the urgent workload through sooner.

Tun

From: Loris Bennett via slurm-users 
Sent: 09 April 2024 06:51
To: slurm-users@lists.schedmd.com 
Cc: Gerhard Strangar 
Subject: [slurm-users] Re: Avoiding fragmentation

Hi Gerhard,

Gerhard Strangar via slurm-users  writes:

> Hi,
>
> I'm trying to figure out how to deal with a mix of few- and many-cpu
> jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> jobs with only 16. As soon as that job with only 16 is running, the
> scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> assigning a full 128 cpu node to them. Is there a way for the
> administrator to achieve preferring full nodes?
> The existence of pack_serial_at_end makes me believe there is not,
> because that basically is what I needed, apart from my serial jobs using
> 16 cpus instead of 1.
>
> Gerhard

This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:

  - When the cluster is full, which is most of the time, MPI jobs in
general will start much faster if they don't specify the number of
nodes and certainly don't request full nodes.  The overhead due to
the jobs being scattered across nodes is often much lower than the
additional waiting time incurred by requesting whole nodes.

  - When all the cores of a node are requested, all the memory of the
node becomes unavailable to other jobs, regardless of how much
memory is requested or indeed how much is actually used.  This holds
up jobs with low CPU but high memory requirements and thus reduces
the total throughput of the system.

These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: SLURM in K8s, any advice?

2024-03-13 Thread Cutts, Tim via slurm-users
I really struggle to see the point of k8s for large computational workloads.  
It adds a lot of complexity, and I don’t see what benefit it brings.

If you really want to run containerised workloads as batch jobs on AWS, for 
example, then it’s a great deal simpler to do so using AWS Batch and ECS rather 
than doing all that stuff with Kubernetes.

Creating a Batch queue and job definition in CDK can be done in a couple of 
dozen lines of code.  See the example I wrote a year or so ago, recently 
updated now that AWS Batch has fully supported L2 constructs in CDK:  
https://github.com/tcutts/cdk-batch-python/tree/main which has a few more bells 
and whistles, like triggering batch job submissions as files arrive in an S3 
bucket, and closing the queue to jobs automatically if a budget threshold is 
exceeded, but it’s still only about 200 lines of code.

I really don’t understand what k8s would add to that sort of architecture.  In 
fact, when AWS added support for EKS to AWS Batch, I asked the internal team 
what the point of that was, and it was basically just “some customers insisted 
on it”.  No-one could actually articulate for me what tangible benefit there 
was to it.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Sylvain MARET via slurm-users 
Date: Wednesday, 13 March 2024 at 10:29
To: Nicolas Greneche , 
slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: SLURM in K8s, any advice?
Hello,

I haven't played with slurm in k8s but I did attend this talk :
https://fosdem.org/2024/schedule/event/fosdem-2024-2590-kubernetes-and-hpc-bare-metal-bros/


Which shows at least someone was able to do so and maybe it'll be worth
to talk to her about it. I wanted to ask her for the code to reproduce
her experiment but I don't have the time yet to do so.

Regards,
Sylvain Maret

On 13/03/2024 11:04, Nicolas Greneche via slurm-users wrote:
> CAUTION : External Sender. Please do not click on links or open
> attachments from senders you do not trust.
>
>
> Hi Alan,
>
> Your topic is indeed my PhD thesis (defended late november). It consists
> in building autoscaling HPC infrastructure in the cloud (in a compute
> node provisioning point of view). In this work I show that kubernetes
> default controllers are not well designed for autoscaling containerized
> HPC clusters [1] and I wrote a super basic K8s controller for OAR [2]
> (another scheduler developped at INRIA). This controller deserves a
> rewrite, it's only a proof of concept ^^.
>
> My guess is that you have nesting issues with cgroup/v2 inside your
> containerized compute node (those that runs slurmd) ? If it's the case
> may be you can use katacontainers [3] instead of CRI-O as a container
> engine [4] (I made it in 2021). The main asset of katacontainers is that
> it can use KVM.
>
> The consequence is that your CPU is a "real" vcpus and not a quota
> enforced by cgroups. I noticed that with kata, my slurmd containers had
> a cpuinfos / nproc that reflected the limits enforced in my K8s
> manifest. I didn't went deep with kata because I focused on "controller"
> aspect of things. But using KVM may hide the nesting of cgroups to
> slurmd ?
>
> I hope this help !
>
> Kind regards,
>
> [1] A methodology to scale containerized HPC infrastructures in the
> Cloud Nicolas Greneche, Christophe Cérin and PTarek Menouer at
> Euro-par 2022
>
> [2] Autoscaling of Containerized HPC Clusters in the Cloud Nicolas
> Greneche, Christophe Cerin at SuperCompCloud: 6th Workshop on
> Interoperability of Supercomputing and Cloud Technologies (Held in
> conjunction with SC'22)
>
> [3] https://katacontainers.io
>
> [4]
> https://github.com/kata-containers/documentation/blob/master/how-to/run-kata-with-k8s.md
>
>
> Le 13/03/2024 à 09:06, LEAVY Alan via slurm-users a écrit :
>> I’m a little late to this party but would love to establish contact with
>> others using slurm in Kubernetes.
>>
>> I recently joined a research institute in Vienna (IIASA) and I’m getting
>> to grips with slurm and Kubernetes (my previous role was data
>> engineering / fintech). My current setup sounds like what Urban
>> described in this thread, back in Nov 22. It has some rough edges
>> though.
>>
>> Right now, I’m trying to upgrade to slurm-23.11.4 in Ubuntu 23.10
>> containers. I’m having trouble with the cgroup/v2 plugin.
>>
>> Are you still using slurm on K8s Urban? How did your installation work
>> out Hans?
>> Would either of you be willing to share your experiences?
>>
>> Regards,
>>
>>  Alan.
>>
>>
>>
>
> --
> Nicolas Greneche
> USPN / DSI
> Support à la recherche / RSSI Suppléant
> 

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Cutts, Tim via slurm-users
It depends on a number of factors.

How do your workloads behave?  Do they do a lot of fork()?  I’ve had cases in 
the past where users submitted scripts which initially used quite a lot of 
memory and then used fork() or system() to execute subprocesses.  This of 
course means that temporarily (between the fork() and the exec() system calls) 
the job uses twice as much virtual memory, although this does not become real 
because the pages are copy-on-write.  Something similar happens if the code 
performs mmap() on large files.

Whether this has an impact on you needing swap space is down to what your  
sysctl settings are for vm.overcommit_memory and vm.overcommit_ratio

If you set vm.overcommit_memory to 2, then the OOM killer will never hit you 
(because malloc() will fail rather than allocate virtual memory that isn’t 
available), but cases like the above will tend to fail memory allocations 
unnecessarily, especially if you don’t have any swap allocated.

If you set vm.overcommit_memory to 0 or 1, then you need less swap allocated 
(possibly even zero) but you run the risk of running out of memory and the OOM 
killer blowing things up left right and centre.

If you provide swap, it only causes a performance impact if the node actually 
runs out of physical memory and actively starts swapping.

So bottom line is I think it depends on what you want the failure mode to be.


  1.  If you want everything to always run in a very deterministic way at full 
speed, with failures at the precise moment the memory is exhausted, but with a 
risk that jobs fail if they’re relying on overcommit (e.g. through 
fork(0/exec()), then vm.overcommit_memory=2 and no swap
  2.  If you want high throughput single threaded stuff to run more smoothly 
(think:  horrible genomics perl and python scrips, etc), then 
overcommit_memory=0 and add some swap.  You’ll probably get higher throughput, 
but things may blow up slight unpredictably from time to time when nodes run 
out of memory.

I now call on someone who understands cgroups properly to explain how this 
changes when cgroups are in play, because I’m not sure I understand that!

Tim


--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: John Joseph via slurm-users 
Date: Monday, 4 March 2024 at 07:06
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Is SWAP memory mandatory for SLURM
Dear All,
Good morning
I do have a 4 node SLURM instance up and running.
Like to know if I disable the SWAP memory, will it effect the SLURM performance
Is SWAP a mandatory requirement, I have each node more RAM, if my phsicall RAM 
is more, is there any need for the SWAP
thanks
Joseph John



AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Cutts, Tim via slurm-users
HAProxy, for on-prem things.  In the cloud I just use their load balancers 
rather than implement my own.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Dan Healy via slurm-users 
Date: Wednesday, 28 February 2024 at 20:56
To: Brian Andrus 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: [ext] Re: canonical way to run longer shell/bash 
interactive job (instead of srun inside of screen/tmux at front-end)?
Are most of us using HAProxy or something else?

On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Magnus,

That is a feature of the load balancer. Most of them have that these days.

Brian Andrus

On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
> On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
>> for us, we put a load balancer in front of the login nodes with
>> session
>> affinity enabled. This makes them land on the same backend node each
>> time.
> Hi Brian,
> that sounds interesting - how did you implement session affinity?
> cheers
> magnus
>
>

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com


--
Thanks,

Daniel Healy


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Cutts, Tim via slurm-users
My view is that it depends entirely on the workload, and the systems with which 
your compute needs to interact.  A few things I’ve experienced before.


  1.  Modern ethernet networks have pretty good latency these days, and so MPI 
codes can run over them.   Whether IB is worth the money is a cost/benefit 
calculation for the codes you want to run.  The ethernet network we put in at 
Sanger in 2016 or so we measured as having similar latency, in practice, as FDR 
infiniband, if I remember correctly.  So it wasn’t as good as state-of-the-art 
IB at the time, but not bad.  Certainly good enough for our purposes, and we 
gained a lot of flexibility through software-defined networking, important if 
you have workloads which require better security boundaries than just a big 
shared network.
  2.  If your workload is predominantly single node, embarrassingly parallel, 
you might do better to go with ethernet and invest the saved money in more 
compute nodes.
  3.  If you only have ethernet, your cluster will be simpler, and require less 
specialised expertise to run
  4.  If your parallel filesystem is Lustre, IB seems to be the more well-worn 
path than ethernet.  We encountered a few Lustre bugs early on because of that.
  5.  On the other hand, if you need to talk to Weka, ethernet is the well-worn 
path.  Weka’s IB implementation requires the dedication of some cores on every 
client node, so you lose some compute capacity, which you don’t need to do if 
you’re using ethernet.

So, as any lawyer would say “it depends”.  Most of my career has been in 
genomics, where IB definitely wasn’t necessary.  Now that I’m in pharma, 
there’s more MPI code, so there’s more of a case for it.

Ultimately, I think you need to run the real benchmarks with real code, and as 
Jason says, work out whether the additional complexity and cost of the IB 
network is worth it for your particular workload.  I don’t think the mantra 
“It’s HPC so it has to be Infiniband” is a given.

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Jason Simms via slurm-users 
Date: Monday, 26 February 2024 at 01:13
To: Dan Healy 
Cc: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Re: Question about IB and Ethernet networks
Hello Daniel,

In my experience, if you have a high-speed interconnect such as IB, you would 
do IPoIB. You would likely still have a "regular" Ethernet connection for 
management purposes, and yes that means both an IB switch and an Ethernet 
switch, but that switch doesn't have to be anything special. Any "real" traffic 
is routed over IB, everything is mounted via IB, etc. That's how the last two 
clusters I've worked with have been configured, and the next one will be the 
same (but will use Omnipath rather than IB). We likewise use BeeGFS.

These next comments are perhaps more likely to encounter differences of 
opinion, but I would say that sufficiently fast Ethernet is often "good enough" 
for most workloads (e.g., MPI). I'd wager that for all but the most demanding 
of workloads, it's entirely acceptable. You'll also save a bit of money, of 
course. HOWEVER, I do think there is, shall we say, an expectation from many 
researchers that any cluster worth its salt will have some kind of fast 
interconnect, even if at the scale of most on-prem work, you might be 
hard-pressed in real-world conditions to notice much of a difference. If you're 
running jobs that take weeks and hundreds of nodes, the time (and other) 
savings may add up, but if we're talking the difference between a job running 
on 5 nodes taking 48 hours vs. slightly less, then?? Your mileage may vary, as 
they say...

Warmest regards,
Jason

On Sun, Feb 25, 2024 at 3:13 PM Dan Healy via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hi Fellow Slurm Users,

This question is not slurm-specific, but it might develop into that.

My question relates to understanding how typical HPCs are designed in terms of 
networking. To start, is it typical for there to be a high speed Ethernet and 
Infiniband networks (meaning separate switches, NICs)? I know you can easily 
setup IP over IB, but is IB usually fully reserved for MPI messages? I’m 
tempted to spec all new HPCs with only a high speed (200Gbps) IB network, and 
use IPoIB for all slurm comms with compute nodes. I plan on using BeeGFS for 
the file system with RDMA.

Just looking for some feedback, please. Is this OK? Is there a better way? If 
yes, please share why it’s better.

Thanks,

Daniel Healy

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com


--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information 

[slurm-users] Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Cutts, Tim via slurm-users
Hi, I apologise if I’ve failed to find this in the documentation (and am happy 
to be told to RTFM) but a recent issue for one of my users resulted in a 
question I couldn’t answer.

LSF has a feature called a Pre-Exec where a script executes to check whether a 
node is ready to run a task.  So, you can run arbitrary checks and go back to 
the queue if they fail.

For example, if I have some automounted filesystems, and I want to be able to 
check for failure of the automounted, in an LSF world, I can do:

  bsub -E “test -f /nfs/someplace/file_I_know_exists” my_job.sh

What’s the equivalent in SLURM?

Thanks,

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |



AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] slurm.conf

2024-01-18 Thread Cutts, Tim
Can you not also do this with a single configuration file but configuring 
multiple clusters which the user can choose with the -M option?  I suppose it 
depends on the use case; if you want to be able to choose a dev cluster over 
the production one, to test new config options, then the environment variable 
approach makes sense.  If this is actually multiple clusters that the users are 
using in production, then the -M approach might work better?

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


On 18/01/2024, 12:07, "slurm-users"  
wrote:
LEROY Christine 208562 
mailto:christine.ler...@cea.fr>> writes:

> Is there an env variable in SLURM to tell where the slurm.conf is?
> We would like to have on the same client node, 2 type of possible submissions 
> to address 2 different cluster.

According to man sbatch:

   SLURM_CONFThe location of the Slurm configuration file.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo




AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com


Re: [slurm-users] RPC rate limiting for different users

2023-11-28 Thread Cutts, Tim
Thanks for your swift response.

Ah yes, I had a look at the source code, and the only exception is root.  
Doesn’t look like it would be too difficult to add the capability to have an 
exceptions table though which admins could configure?

We’ve enabled ratelimiting but it’s causing issues for some applications run by 
certain service account users.

It also seems that slurmrestd doesn’t cope well with it at all, we’ve been 
seeing lots of errors from that following enabling rate limiting.

Tim


From: slurm-users  on behalf of Ole Holm 
Nielsen 
Date: Tuesday, 28 November 2023 at 11:44
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] RPC rate limiting for different users
On 11/28/23 11:59, Cutts, Tim wrote:
> Is the new rate limiting feature always global for all users, or is there
> an option, which I’ve missed, to have different settings for different
> users?  For example, to allow a higher rate from web services which submit
> jobs on behalf of a large number of users?

The rate limiting is global for all users.  You can only play with the
various rl_* parameters described in the slurm.conf manual page to
increase bucket size etc. for everyone.

/Ole


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com<https://www.astrazeneca.com>


[slurm-users] RPC rate limiting for different users

2023-11-28 Thread Cutts, Tim
Is the new rate limiting feature always global for all users, or is there an 
option, which I’ve missed, to have different settings for different users?  For 
example, to allow a higher rate from web services which submit jobs on behalf 
of a large number of users?

Tim


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com