[slurm-users] Re: Jobs not getting scheduled, no priority calculation, but still in queue?

2024-10-07 Thread Cutts, Tim via slurm-users
rm-users Date: Monday, 7 October 2024 at 10:35 AM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs not getting scheduled, no priority calculation, but still in queue? Hi Tim, On 10/7/24 11:13, Cutts, Tim via slurm-users wrote: > Something odd is going on on our cluster. User has

[slurm-users] Jobs not getting scheduled, no priority calculation, but still in queue?

2024-10-07 Thread Cutts, Tim via slurm-users
Something odd is going on on our cluster. User has a lot of pending jobs in a job array (a few thousand). squeue -u kmnx005 -r -t PD | head -5 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3045324_875 core run_scp_ kmnx005 PD 0:00 1

[slurm-users] Re: Configuration for nodes with different TmpFs locations and TmpDisk sizes

2024-09-05 Thread Cutts, Tim via slurm-users
I’ve always had local storage mounted in the same place, in /tmp. In LSF clusters, I just let LSF’s lim get on with autodetecting how big /tmp was and setting the tmp resource automatically. I presume SLURM can do the same thing, but I’ve never checked. Tim -- Tim Cutts Scientific Computing

[slurm-users] Re: Upgrade node while jobs running

2024-08-02 Thread Cutts, Tim via slurm-users
Generally speaking as a best practice I’d perform such things with no jobs running, but some upgrades you can allow without it. Upgrading a package, even one which is currently in use by a running job, does not necessarily kill the job. For example, upgrading a shared library won’t kill existi

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-02 Thread Cutts, Tim via slurm-users
You can’t have both exclusive access to a node and sharing, that makes no sense. You see this on AWS as well – you can select either sharing a physical machine or not. There is no “don’t share if possible, and share otherwise”. Unless you configure SLURM to overcommit CPUs, by definition if yo

[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Cutts, Tim via slurm-users
It sounds to me perhaps as though your systemd units are starting in the wrong order, or don’t have appropriate dependencies set in them? Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our S

[slurm-users] SLURM noob administrator question

2024-07-11 Thread Cutts, Tim via slurm-users
Still learning about SLURM, so please forgive me if I ask a naïve question I like to use Anders Halager’s gnodes command to visualise the state of our nodes. I’ve noticed lately that we fairly often see things like this (apologies for line wrap): +- core - 46 cores & 186GB --

[slurm-users] Re: Software builds using slurm

2024-06-10 Thread Cutts, Tim via slurm-users
You have two options for managing those dependencies, as I see it) 1. you use SLURM’s native job dependencies, but this requires you to create a build script for SLURM 2. You use make to submit the jobs, and take advantage of the -j flag to make it run lots of tasks at once, just use a jo

[slurm-users] Re: memory high water mark reporting

2024-05-22 Thread Cutts, Tim via slurm-users
Users can, of course always just wrap the job itself in time to record the maximum memory usage. Bit of a naïve approach but it does work. I agree the polling of current usage is not very satisfactory. Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D I

[slurm-users] Re: scrontab question

2024-05-08 Thread Cutts, Tim via slurm-users
Someone may have said this already but you know that you can replace 0,5,10,15,20,25,30,35,40,45,50,55 with */5? Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Cutts, Tim via slurm-users
We have Weka filesystems on one of our clusters and saw this; we discovered we had slightly misconfigured the weka client and the result was that Weka’s and SLURMs cgroups were fighting with each other, and this seemed to be the result. Fixing the weka cgroups config improved the problem, for u

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Cutts, Tim via slurm-users
Agree with that. Plus, of course, even if the jobs run a bit slower by not having all the cores on a single node, they will be scheduled sooner, so the overall turnaround time for the user will be better, and ultimately that's what they care about. I've always been of the view, for any schedu

[slurm-users] Re: SLURM in K8s, any advice?

2024-03-13 Thread Cutts, Tim via slurm-users
I really struggle to see the point of k8s for large computational workloads. It adds a lot of complexity, and I don’t see what benefit it brings. If you really want to run containerised workloads as batch jobs on AWS, for example, then it’s a great deal simpler to do so using AWS Batch and ECS

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Cutts, Tim via slurm-users
It depends on a number of factors. How do your workloads behave? Do they do a lot of fork()? I’ve had cases in the past where users submitted scripts which initially used quite a lot of memory and then used fork() or system() to execute subprocesses. This of course means that temporarily (be

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Cutts, Tim via slurm-users
HAProxy, for on-prem things. In the cloud I just use their load balancers rather than implement my own. Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Cutts, Tim via slurm-users
My view is that it depends entirely on the workload, and the systems with which your compute needs to interact. A few things I’ve experienced before. 1. Modern ethernet networks have pretty good latency these days, and so MPI codes can run over them. Whether IB is worth the money is a cos

[slurm-users] Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Cutts, Tim via slurm-users
Hi, I apologise if I’ve failed to find this in the documentation (and am happy to be told to RTFM) but a recent issue for one of my users resulted in a question I couldn’t answer. LSF has a feature called a Pre-Exec where a script executes to check whether a node is ready to run a task. So, yo

Re: [slurm-users] slurm.conf

2024-01-18 Thread Cutts, Tim
Can you not also do this with a single configuration file but configuring multiple clusters which the user can choose with the -M option? I suppose it depends on the use case; if you want to be able to choose a dev cluster over the production one, to test new config options, then the environmen

Re: [slurm-users] RPC rate limiting for different users

2023-11-28 Thread Cutts, Tim
: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] RPC rate limiting for different users On 11/28/23 11:59, Cutts, Tim wrote: > Is the new rate limiting feature always global for all users, or is there > an option, which I’ve missed, to have different settings for different > us

[slurm-users] RPC rate limiting for different users

2023-11-28 Thread Cutts, Tim
Is the new rate limiting feature always global for all users, or is there an option, which I’ve missed, to have different settings for different users? For example, to allow a higher rate from web services which submit jobs on behalf of a large number of users? Tim