[slurm-users] Re: SLUG'24 presentation slides?

2024-09-27 Thread Kilian Cavalotti via slurm-users
sident of Marketing * > > 909.609.8889 > > www.schedmd.com > > > On Mon, Sep 23, 2024 at 10:49 AM Kilian Cavalotti via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> Hi SchedMD, >> >> I'm sure they will eventually, but do you know when the

[slurm-users] SLUG'24 presentation slides?

2024-09-23 Thread Kilian Cavalotti via slurm-users
Hi SchedMD, I'm sure they will eventually, but do you know when the slides of the SLUG'24 presentation will be available online at https://slurm.schedmd.com/publications.html, like previous editions'? Thanks! -- Kilian -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-16 Thread Kilian Cavalotti
Those CVEs are indeed for different software (one for PMIx, one for Slurm), even though they're ultimately for the same kind of underlying problem (chown() being used instead of lchown(), which could lead in taking over privileged files). The Slurm patches include more fixes related to permissions

Re: [slurm-users] END Mail notifications not being sent?

2023-07-11 Thread Kilian Cavalotti
And to close the loop on this, the "smail" fix will be in 23.02.4 when it's released https://bugs.schedmd.com/show_bug.cgi?id=17123 Cheers, -- Kilian On Mon, Jul 3, 2023 at 9:30 AM Angel de Vicente wrote: > > Hello, > > Angel de Vicente writes: > > > Any idea what could be going on or how to de

Re: [slurm-users] Enforcing GPU-CPU ratios

2023-03-14 Thread Kilian Cavalotti
On Tue, Jun 23, 2020 at 7:37 AM Bas van der Vlies wrote: > > Which version of slurm do you use? as slurm 19.05: > * DefCpuPerGPU Sorry for necroposting and undigging this old thread, but the DefCpuPerGpu configuration option is actually just a default, which will happily get overridden by job s

Re: [slurm-users] Unable to delete account

2023-03-06 Thread Kilian Cavalotti
Hi Simon, On Mon, Mar 6, 2023 at 1:34 PM Simon Gao wrote: > We are experiencing an issue with deleting any Slurm account. > > When running a command like: sacctmgr delete account , > following errors are returned and the command failed. > > # sacctmgr delete account > Database is busy or waitin

Re: [slurm-users] Number of allocated cores/threads ..

2022-12-12 Thread Kilian Cavalotti
Hi Sefa, `scontrol -d show job ` should give you that information: # scontrol -d show job 2781284 | grep Nodes= NumNodes=10 NumCPUs=256 NumTasks=128 CPUs/Task=2 ReqB:S:C:T=0:0:*:* Nodes=sh03-01n29 CPU_IDs=4-6,12-19,22-23,25 Mem=71680 GRES= Nodes=sh03-01n[38,40] CPU_IDs=0-31 Mem=1638

Re: [slurm-users] Preemption for licenses

2022-12-09 Thread Kilian Cavalotti
Hi Allan, On Fri, Dec 9, 2022 at 3:20 PM Carter, Allan wrote: > If a job is pending only because it needs a license and all are being used, > can it preempt jobs in a lower priority partition that are using the license? > Or does preemption only work for compute resources. I've tried to configu

Re: [slurm-users] srun --mem issue

2022-12-08 Thread Kilian Cavalotti
Hi Loris, On Thu, Dec 8, 2022 at 12:59 AM Loris Bennett wrote: > However, I do have a chronic problem with users requesting too much > memory. My approach has been to try to get people to use 'seff' to see > what resources their jobs in fact need. In addition each month we > generate a graphical

Re: [slurm-users] Problems building RPMs

2022-07-21 Thread Kilian Cavalotti
Hi Phil, Link-time optimization (LTO) has been enabled by default in RHEL9: https://fedoraproject.org/wiki/LTOByDefault https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html-single/developing_c_and_cpp_applications_in_rhel_9/index#ref_link-time-optimization_using-libraries

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Kilian Cavalotti
On Wed, Jun 2, 2021 at 10:13 PM Ahmad Khalifa wrote: > How to send a job to a particular gpu card using its ID (0,1,2...etc)? Well, you can't, because: 1. GPU ids are something of a relative concept: https://bugs.schedmd.com/show_bug.cgi?id=10933 2. requesting specific GPUs is not supported: ht

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Kilian Cavalotti
On Tue, May 11, 2021 at 5:55 AM Renfro, Michael wrote: > > XDMoD [1] is useful for this, but it’s not a simple script. It does have some > user-accessible APIs if you want some report automation. I’m using that to > create a lightning-talk-style slide at [2]. > > [1] https://open.xdmod.org/ > [2

Re: [slurm-users] NVML autodetect "Failed to get supported memory frequencies" error

2021-03-05 Thread Kilian Cavalotti
Hi Joshua, On Thu, Mar 4, 2021 at 8:38 PM Joshua Baker-LePain wrote: > slurmd: error: _nvml_get_mem_freqs: Failed to get supported memory > frequencies > slurmd: error: for the GPU : Not Supported > slurmd: 4 GPU system device(s) detected > slurmd: WARNING: The following autodetected GPUs a

Re: [slurm-users] cpu core exclusion?

2021-01-20 Thread Kilian Cavalotti
On Wed, Jan 20, 2021 at 12:56 PM Brian Andrus wrote: > We would need more information. > At a minimum, what client is it? As this is not a slurm issue, you would need > to dig into what is causing that behavior with your storage system. And if the question is how to make sure Slurm won't allocat

Re: [slurm-users] Compiling Slurm with nvml support

2020-09-24 Thread Kilian Cavalotti
Hi Jason, We're taking the approach proposed in https://bugs.schedmd.com/show_bug.cgi?id=7919: same RPM everywhere, but without the dependencies that you don't want installed globally (like NVML, PMIx...). Of course you need to satisfy those dependencies some other way on the nodes that require th

Re: [slurm-users] ProEpiLogInterfacePlugin -> PerilogueInterfacePlugin (E.A. Schneider @ CMU'76?)

2020-02-21 Thread Kilian Cavalotti
On Fri, Feb 21, 2020 at 12:38 AM Benjamin Redling wrote: > If there isn't already a better name, I suggest > "PerilogueInterfacePlugin", because of the following possible historical > IT-roots: > > As "prologue" comes from the Greek "προ", meaning "before", and as > "epilogue" comes from the Greek

Re: [slurm-users] Maxjobs to accrue age priority points

2019-12-12 Thread Kilian Cavalotti
Hi Chris, On Thu, Dec 12, 2019 at 10:47 AM Christopher Benjamin Coffey wrote: > I believe I heard recently that you could limit the number of users jobs that > accrue age priority points. Yet, I cannot find this option in the man pages. > Anyone have an idea? Thank you! It's the *JobsAccrue*

Re: [slurm-users] Submission without Scheduling

2019-12-02 Thread Kilian Cavalotti
Hi Lev, On Mon, Dec 2, 2019 at 2:31 PM Lev Lafayette wrote: > Do others have a special arrangement for managing jobs during outages, apart > from "no arrangements, no jobs". Slurm supports reservations, which can typically be used to make sure no job runs during a scheduled downtime (but can sti

Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-13 Thread Kilian Cavalotti
Hi Jürgen, I would take a look at the various *KmemSpace options in cgroups.conf, they can certainly help with this. Cheers, -- Kilian On Thu, Jun 13, 2019 at 2:41 PM Juergen Salk wrote: > > Dear all, > > I'm just starting to get used to Slurm and play around with it in a small test > environm

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Kilian Cavalotti
On Thu, Jun 6, 2019 at 11:16 AM Christopher Samuel wrote: > Sounds like a good reason to file a bug. Levi did already. Everybody can vote at https://bugs.schedmd.com/show_bug.cgi?id=7191 :) Cheers, -- Kilian

Re: [slurm-users] Slurm Fairshare / Multifactor Priority

2019-05-29 Thread Kilian Cavalotti
Hi Paul, I'm wondering about this part in your SchedulerParameters: ### default_queue_depth should be some multiple of the partition_job_depth, ### ideally number_of_partitions * partition_job_depth, but typically the main ### loop exits prematurely if you go over about 400. A partition_job_depth

Re: [slurm-users] spart: A user-oriented partition info command for slurm

2019-05-03 Thread Kilian Cavalotti
Hi Ahmet, Very useful tool for us, we've adopted it! https://news.sherlock.stanford.edu/posts/a-better-view-at-sherlock-s-resources Thank you very much for writing it. Cheers, -- Kilian On Wed, Mar 27, 2019, 02:53 mercan wrote: > Hi; > > Except sjstat script, Slurm does not contains a comman

Re: [slurm-users] How does cgroups limit user access to GPUs?

2019-04-10 Thread Kilian Cavalotti
Hi Randy! > We have a slurm cluster with a number of nodes, some of which have more than > one GPU. Users select how many or which GPUs they want with srun's "--gres" > option. Nothing fancy here, and in general this works as expected. But > starting a few days ago we've had problems on one

Re: [slurm-users] Topology configuration questions:

2019-01-18 Thread Kilian Cavalotti
On Fri, Jan 18, 2019 at 6:31 AM Prentice Bisbal wrote: > > Note that if you care about node weights (eg. NodeName=whatever001 > > Weight=2, etc. in slurm.conf), using the topology function will disable it. > > I believe I was promised a warning about that in the future in a > > conversation wit

Re: [slurm-users] Slurmctld 18.08.1 and 18.08.3 segfault

2018-11-13 Thread Kilian Cavalotti
Hi Bill, On Tue, Nov 13, 2018 at 5:35 PM Bill Broadley wrote: > (gdb) bt > #0 _step_dealloc_lps (step_ptr=0x555787af0f70) at step_mgr.c:2092 > #1 post_job_step (step_ptr=step_ptr@entry=0x555787af0f70) at step_mgr.c:4720 > #2 0x55578571d1d8 in _post_job_step (step_ptr=0x555787af0f70) at >

Re: [slurm-users] Slurm strigger configuration

2018-09-19 Thread Kilian Cavalotti
On Wed, Sep 19, 2018 at 9:21 AM Christopher Benjamin Coffey wrote: > The only thing that I've gotten working so far is this: > sudo -u slurm bash -c "strigger --set -D -n cn15 -p > /common/adm/slurm/triggers/nodestatus" > > So, that will run the nodestatus script which emails when the node cn15 g

Re: [slurm-users] select/cons_res - found bug when allocating job with --cpus-per-task (-c) option on slurm 17.11.9 (fix included).

2018-09-05 Thread Kilian Cavalotti
Hi Didier, On Wed, Sep 5, 2018 at 7:39 AM Didier GAZEN wrote: > What do you think? I'd recommend opening a bug at https://bugs.schedmd.com to report your findings, if you haven't done that already. This is the best way to get attention of the developers and get this fixed. Cheers, -- Kilian

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-08-22 Thread Kilian Cavalotti
Hi Christian, On Wed, Aug 22, 2018 at 7:27 AM, Christian Peter wrote: > we observed a strange behavior of pam_slurm_adopt regarding the involved > cgroups: > > when we start a shell as a new Slurm job using "srun", the process has > freezer, cpuset and memory cgroups setup as e.g. > "/slurm/uid_5

Re: [slurm-users] Determine usage for a QOS?

2018-08-20 Thread Kilian Cavalotti
Hi Chris, On Sun, Aug 19, 2018 at 6:26 PM, Christopher Samuel wrote: > We are using QOS's for projects which have been granted a fixed set of > time for higher priority work which works nicely, but have just been > asked the obvious question "how much time do we have left?". I _think_ that "scon

Re: [slurm-users] How do you orchestrate SLURM operations, what tools do you use?

2018-08-15 Thread Kilian Cavalotti
On Wed, Aug 15, 2018 at 11:57 AM, Michael Jennings wrote: > We [...] are planning to investigate clush [...] in the near future. I can only encourage you to do so, as ClusterShell comes with nice Slurm bindings out of the box, that allow, among other things, to execute commands on all the nodes:

Re: [slurm-users] How do you orchestrate SLURM operations, what tools do you use?

2018-08-15 Thread Kilian Cavalotti
On Wed, Aug 15, 2018 at 7:01 AM, Paul Edmon wrote: > So we use NHC for our automatic node closer. For reopening we have a series > of scripts that we use but they are all ad hoc and not formalized. Same > with closing off subsets of nodes we just have a bunch of bash scripts that > we have rolle

Re: [slurm-users] DefMemPerCPU is reset to 1 after upgrade

2018-07-10 Thread Kilian Cavalotti
On Tue, Jul 10, 2018 at 10:34 AM, Taras Shapovalov wrote: > I noticed the commit that can be related to this: > > https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e Yes. See also this bug: https://bugs.schedmd.com/show_bug.cgi?id=5240 This commit will be reverted in

Re: [slurm-users] Finding submitted job script

2018-07-10 Thread Kilian Cavalotti
On Tue, Jul 10, 2018 at 10:05 AM, Jessica Nettelblad wrote: > In the master branch, scontrol write batch_script also has the option to > write the job script to STDOUT instead of a file. This is what we use in the > prolog when we gather information for later (possible) troubleshooting. So I > sup

Re: [slurm-users] Alocating a subset cores to each job

2018-06-12 Thread Kilian Cavalotti
Hi Nadav, On Tue, Jun 12, 2018 at 8:18 AM, Nadav Toledo wrote: > How can one send a few jobs running in parallel with different cpus > allocation on the same node? According to https://slurm.schedmd.com/srun.html#OPT_cpu-bind, you may want to use "srun --exclusive": By default, a job step h

Re: [slurm-users] Understanding gres binding

2018-05-10 Thread Kilian Cavalotti
Hi Paul, I'd first suggest to upgrade to 17.11.6, I think the first couple 17.11.x releases had some issues in terms of GRES binding. Then, I believe you also need to request all of your cores to be allocated on the same socket, if that's what you want. Something like --ntasks-per-socket=16. Her

Re: [slurm-users] "allocated+" status

2018-04-16 Thread Kilian Cavalotti
Hi Andy, On Mon, Apr 16, 2018 at 8:43 AM, Andy Riebs wrote: > I hadn't realized that jobs can be scheduled to run on a node that is still > in "completing" state from an earlier job. We occasionally use epilog > scripts that can take 30 seconds or longer, and we really don't want the > next job t

Re: [slurm-users] Checking allocated GRES? SLURM 16.05.x

2018-02-05 Thread Kilian Cavalotti
Hi Ryan, On Mon, Feb 5, 2018 at 8:06 AM, Ryan Novosielski wrote: > We currently use SLURM 16.05.10 and one of our staff asked how they > can check for allocated GPUs, as you might check allocated CPUs by > doing scontrol show node. I could have sworn that you can see both, > but I see that only C

Re: [slurm-users] Slurm SPANK GPU Compute Mode plugin

2018-01-23 Thread Kilian Cavalotti
Hi Miguel, On Tue, Jan 23, 2018 at 4:41 AM, Miguel Gila wrote: > Hi Kilian, a question on this: which version of Slurm/Lua are you running > this against?? Slurm 17.11.x and Lua 5.1 > I don’t seem able to generate the RPM on 17.02.9/Lua 5.2 ; it throws similar > errors to what I had seen earlie

[slurm-users] Slurm SPANK GPU Compute Mode plugin

2018-01-22 Thread Kilian Cavalotti
Hi all, We (Stanford Research Computing Center) developed a SPANK plugin which allows users to choose the GPU compute mode [1] for their jobs. [1] http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-modes This came from the need to give our users some control on the way GPUs

Re: [slurm-users] Get list of nodes and their status, one node per line, no duplicates

2017-11-08 Thread Kilian Cavalotti
Hi Jeff, Quite close: $ sinfo --Format=nodehost,statelong Cheers, -- Kilian