[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu partition. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote: External Email - Use Caution Hi Paul, There could be multiple reasons why the job isn't running, from

[slurm-users] Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Paul Raines via slurm-users
? --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA The information in this e-mail is intended only for the person to whom

[slurm-users] Re: Reserving resources for use by non-slurm stuff

2024-04-17 Thread Paul Raines via slurm-users
t cause wierd behavior with a lot of system tools. So far the root/daemon process work fine in the 20GB limit though so that MemoryHigh=20480M is one and done Then reboot. -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only for th

[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Raines via slurm-users
counts with sacctmgr -i add user "$u" account=$acct fairshare=parent -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 27 Mar 2024 9:22am, Long, Daniel S. via slurm-users wrote: External Email - Use Caution Hi, I’m trying to set up multifactor priority on our clu

[slurm-users] Re: Lua script

2024-03-06 Thread Paul Raines via slurm-users
Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf -- Paul Raines (http://help.nmr.mgh.harvard.edu) The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient

[slurm-users] Re: Slurm billback and sreport

2024-03-05 Thread Paul Raines via slurm-users
Will using option "End=now" with sreport not exclude the still pending array jobs while including data for the completed ones? -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 4 Mar 2024 5:18pm, Chip Seraphine via slurm-users wrote: External Email - Use Cauti

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Raines via slurm-users
tname mlsc-login.nmr.mgh.harvard.edu mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST SLURM_JOB_NODELIST=rtx-02 Seems you MUST use srun -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote: External Email - Use Caution salloc is the currently recom

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Raines via slurm-users
with safe requeing. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 14 Feb 2024 9:32am, Paul Edmon via slurm-users wrote: External Email - Use Caution You probably want the Prolog option: https://secure-web.cisco.com/1gA_zj13OnVqs4BaLrstiwdHEvx0

[slurm-users] scheme for protected GPU jobs from preemption

2024-02-06 Thread Paul Raines via slurm-users
plicated cron job that tries to do it all "outside" of SLURM issuing scontrol commands. --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149

[slurm-users] Re: after upgrade to 23.11.1 nodes stuck in completion state

2024-02-01 Thread Paul Raines via slurm-users
ree caused by the bug in the gpu_nvml.c code So it is not truly clear where the underlying issue really is though but seems most likely a bug in the older version of NVML I had installed. Ideally though SLURM would have better handling of the slurmstepd processes crashing. -- Paul Ra

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
s the problem has gone away. Going to need to have real users test with real jobs. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 30 Jan 2024 9:01am, Paul Raines wrote: External Email - Use Caution I built 23.02.7 and tried that and had the same problems. BTW, I am using the s

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
I wonder if the NVML library at the time of build is the key (though like I said I tried rebuiling with NVIDIA 470 and that still had the issue) -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote: External Email - Use Caution We had sim

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
ted epilog for jobid 3679888 [2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0 -- Paul Raines (http://help.nmr.mgh.harvard.edu) Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, ple

[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
BUT there are jobs running on the box that SLURM thinks are done --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown

Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Paul Raines
Turns out on that new node I was running hwloc in a cgroup restricted to cores 0-13 so that was causing the issue. In an unrestricted cgroup shell, "hwloc-ls --only pu" works properly and gives me the correct SLURM mapping. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On T

Re: [slurm-users] CPUSpecList confusion

2022-12-15 Thread Paul Raines
is command does work on all my other boxes so I do think using hwloc-ls is the "best" answer for getting the mapping on most hardware out there. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 15 Dec 2022 1:24am, Marcus Wagner wrote: Hi Paul, as Slurm uses hwloc, I was look

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Paul Raines
ed, 14 Dec 2022 9:42am, Paul Raines wrote: Yes, I see that on some of my other machines too. So apicid is definitely not what SLURM is using but somehow just lines up that way on this one machine I have. I think the issue is cgroups counts starting at 0 all the cores on the first socket

Re: [slurm-users] CPUSpecList confusion

2022-12-14 Thread Paul Raines
# scontrol -d show job 1967214 | grep CPU_ID Nodes=r17 CPU_IDs=8-11,20-23 Mem=51200 GRES= # cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1967214/cpuset.cpus 16-23 I am totally lost now. Seems totally random. SLURM devs? Any insight? -- Paul Raines (http://help.nmr.mgh.harvard.edu

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
-- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote: External Email - Use Caution In the slurm.conf manual they state the CpuSpecList ids are "abstract", and in the CPU management docs they enforce the notion that the abstract

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Hmm. Actually looks like confusion between CPU IDs on the system and what SLURM thinks the IDs are # scontrol -d show job 8 ... Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES= ... # cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective 7-10,39-42 -- Paul Raines

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
Oh but that does explain the CfgTRES=cpu=14. With the CpuSpecList below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense. The issue remains that thought the number of cpus in CpuSpecList is taken into account, the exact IDs seem to be ignored. -- Paul Raines (http

Re: [slurm-users] CPUSpecList confusion

2022-12-13 Thread Paul Raines
--mem=25G \ --time=10:00:00 --cpus-per-task=8 --pty /bin/bash $ grep -i ^cpu /proc/self/status Cpus_allowed: 0780,0780 Cpus_allowed_list: 7-10,39-42 -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote: Hi Paul, Nodename=foobar

[slurm-users] CPUSpecList confusion

2022-12-09 Thread Paul Raines
. --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA The information in this e-mail is intended only for the person to whom it is addressed

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-21 Thread Paul Raines
Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was about 16 seconds. So I moved the bzip2 to its own separate line so the tables are only locked for the ~16 seconds -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote

Re: [slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-20 Thread Paul Raines
to rework it. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 19 Sep 2022 9:29am, Reed Dier wrote: I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit differently, namely instead of a systemctl reload, I am sending a specific SIGUSR2 signal, which

[slurm-users] job_time_limit: inactivity time limit reached ...

2022-09-19 Thread Paul Raines
ob with srun/salloc and not a job that has been running for days. Is it InactiveLimit that leads to the "inactivity time limit reached" message? Anyway, I have changed InactiveLimit=600 to see if that helps. ------- Paul Raines

Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-08 Thread Paul Raines
Sorry, should have stated that before. I am running Slurm 20.11.3 on CentOS 8 Stream that I compiled myself back in June 2021. I will try to arrange an upgrade in the next few weeks. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Fri, 8 Apr 2022 4:02am, Bjørn-Helge Mevik wrote: Paul

Re: [slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-07 Thread Paul Raines
[0]:~$ cat /sys/fs/cgroup/memory/slurm/uid_5829/job_1134068/memory.limit_in_bytes 8589934592 On Wed, 6 Apr 2022 3:30pm, Paul Raines wrote: I have a user who submitted an interactive srun job using: srun --mem-per-gpu 64 --gpus 1 --nodes 1 From sacct for this job we see

[slurm-users] Strange memory limit behavior with --mem-per-gpu

2022-04-06 Thread Paul Raines
. --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA The information in this e-mail

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-03 Thread Paul Raines
On Thu, 3 Feb 2022 1:30am, Stephan Roth wrote: On 02.02.22 18:32, Michael Di Domenico wrote: On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-01 Thread Paul Raines
it or are their instances where it would be ignored there? -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 1 Feb 2022 3:09am, EPF (Esben Peter Friis) wrote: The numbering seen from nvidia-smi is not necessarily the same as the order of /dev/nvidiaXX. There is a way to force that, though, using

[slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-30 Thread Paul Raines
425 C python 10849MiB | |4 N/A N/A 63426 C python 10849MiB | ++ How can I make SLURM not use GPU 2 and 4? --- P

Re: [slurm-users] Calculate the GPU usages

2021-09-01 Thread Paul Raines
-- mlsc ***jl1103 *** gres/gpu 390 In slurm.conf for the partition all these jobs ran on I have TRESBillingWeights="CPU=1.24,Mem=0.02G,Gres/gpu=3.0" if that effects the sreport number somehow -- but then I would expect sreport's number to simply be 3x the sacct n

Re: [slurm-users] Restrict user not use node without GPU

2021-08-17 Thread Paul Raines
Then set JobSubmitPlugins=lua in slurm.conf I cannot find any documentation about what really should be in tres_per_job and tres_per_node as I would expect the cpu and memory requests in there but it is still "nil" even when those are given. For our cluster I have only seen it non

Re: [slurm-users] Information about finished jobs

2021-06-14 Thread Paul Raines
but is is way less (11 minutes instead of nearly 9 hours) # /usr/bin/sstat -p -a --job=357305 --format=JobID,AveCPU JobID|AveCPU| 357305.extern|213503982334-14:25:51| 357305.batch|11:33.000| Any idea why this is? Also, what is that crazy number for AveCPU on 357305.extern? -- Paul Raines (http

Re: [slurm-users] [EXT] rejecting jobs that exceed QOS limits

2021-05-29 Thread Paul Raines
Ah, should have found that. Thanks. On Sat, 29 May 2021 12:08am, Sean Crosby wrote: Hi Paul, Try sacctmgr modify qos gputest set flags=DenyOnLimit Sean From: slurm-users on behalf of Paul Raines Sent: Saturday, 29 May 2021 12:48 To: slurm-users

[slurm-users] rejecting jobs that exceed QOS limits

2021-05-28 Thread Paul Raines
I want to dedicate one of our GPU servers for testing where users are only allowed to run 1 job at a time using 1 GPU and 8 cores of the server. So I put one server in a partition on its own and setup a QOS for it as follows: sacctmgr add qos gputest sacctmgr modify qos gputest set

Re: [slurm-users] unable to create directory '/run/user/14325/dconf': Permission denied. dconf will not work properly.

2021-03-17 Thread Paul Raines
This is most likely because your XDG* environment variables are being copied into the job environment. We do the following in our taskprolog script echo "unset XDG_RUNTIME_DIR" echo "unset XDG_SESSION_ID" echo "unset XDG_DATA_DIRS" -- Paul Raines (h

Re: [slurm-users] slurm bank and sreport tres minute usage problem

2021-03-12 Thread Paul Raines
1 201124.0 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38 cpu=4,gres/gpu=1,mem=512G,node=1 So the first job used all 24 hours of that day, the 2nd just 3 seconds (so ignore it) and the third about 9 hours and 5 minutes CPU = 24*60*3+(9*60+5)*4 = 6500 GPU = 24*60*2+(9*60+5)*1 = 3425 -- P

[slurm-users] cgroup clean up after "Kill task failed"

2021-02-16 Thread Paul Raines
's in /sys/fs/cgroup without rebooting? --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129USA

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Raines
This also probably requires you to have ProctrackType=proctrack/cgroup TaskPlugin=task/affinity,task/cgroup GresTypes=gpu like I do -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 26 Jan 2021 3:40pm, Ole Holm Nielsen wrote: Thanks Paul! On 26-01-2021 21:11, Paul Raines wrote: You

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-26 Thread Paul Raines
=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0 This is fine with me as I want SLURM to ignore GPU affinity on these nodes but it is curious. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 25 Jan 2021 10:07am, Paul Raines wrote: I tried submitting jobs

Re: [slurm-users] Building Slurm RPMs with NVIDIA GPU support?

2021-01-26 Thread Paul Raines
fault RPM SPEC is needed. I just run rpmbuild --tb slurm-20.11.3.tar.bz2 You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the GPU node. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 26

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-25 Thread Paul Raines
I tried submitting jobs with --gres-flags=disable-binding but this has not made any difference. Jobs asking for GPUs are still only being run if a core defined in gres.conf for the GPU is free. Basically seems the option is ignored. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Sun

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-24 Thread Paul Raines
enforcment" as it is more important that a job run with a GPU on its non-affinity socket than just wait and not run at all? Thanks -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote: On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrot

Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-23 Thread Paul Raines
Power= TresPerJob=gpu:1 MailUser=mu40 MailType=FAIL I don't see anything obvious here. Is it maybe the 7 day thing? If I submit my jobs for 7 days to the rtx6000 partition though I don't see the problem. -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Thu, 21 Jan 2021 5:47pm, Williams

[slurm-users] Job not running with Resource Reason even though resources appear to be available

2021-01-21 Thread Paul Raines
duce_completing_frag,\ max_rpc_cnt=16 DependencyParameters=kill_invalid_depend So any idea why job 38687 is not being run on the rtx-06 node --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoul

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-29 Thread Paul Raines
ite } for pid=403840 comm="sshd" name="rtx-05_811.4294967295" dev="md122" ino=2228938 scontext=system_u:system_r:sshd_t:s0-s0:c0.c1023 tcontext=system_u:object_r:var_t:s0 tclass=sock_file permissive=1 -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Mon, 26 O

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-26 Thread Paul Raines
: Cancelled pending job step with signal 2 srun: error: Unable to create step for job 808: Job/step already completing or completed But it just hung forever till I did a ^C thank -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Sat, 24 Oct 2020 3:43am, Juergen Salk wrote: Hi Paul, maybe

Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-26 Thread Paul Raines
]: fatal: Access denied for user raines by PAM account configuration [preauth] -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Fri, 23 Oct 2020 11:12pm, Wensheng Deng wrote: Append ‘log_level=debug5’ to the pam_slurm_adopt line in system-auth, restart sshd, try a new job and ssh session

[slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-23 Thread Paul Raines
I am running Slurm 20.02.3 on CentOS 7 systems. I have pam_slurm_adopt setup in /etc/pam.d/system-auth and slurm.conf has PrologFlags=Contain,X11 I also have masked systemd-logind But pam_slurm_adopt always denies login with "Access denied by pam_slurm_adopt: you have no active jobs on this

Re: [slurm-users] Billing issue

2020-08-06 Thread Paul Raines
Bas Does that mean you are setting PriorityFlags=MAX_TRES ? Also does anyone understand this from the slurm.conf docs: The weighted amount of a resource can be adjusted by adding a suffix of K,M,G,T or P after the billing weight. For example, a memory weight of "mem=.25" on a job

Re: [slurm-users] cgroup limits not created for jobs

2020-07-26 Thread Paul Raines
On Sat, 25 Jul 2020 2:00am, Chris Samuel wrote: On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote: But when I run a job on the node it runs I can find no evidence in cgroups of any limits being set Example job: mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1

[slurm-users] cgroup limits not created for jobs

2020-07-24 Thread Paul Raines
/freezer/tasks /sys/fs/cgroup/systemd/user.slice/user-5829.slice/session-80624.scope/tasks --- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th

Re: [slurm-users] GPU configuration not working

2020-07-23 Thread Paul Raines
SLURM_SUBMIT_HOST=mlscgpu1 SLURM_JOB_PARTITION=batch SLURM_JOB_NUM_NODES=1 SLURM_MEM_PER_NODE=1024 mlscgpu1[0]:~$ But still no CUDA_VISIBLE_DEVICES is being set On Thu, 23 Jul 2020 10:32am, Paul Raines wrote: I have two systems in my cluster with GPUs. Their setup in slurm.conf is GresTypes=gpu NodeName

[slurm-users] GPU configuration not working

2020-07-23 Thread Paul Raines
I have two systems in my cluster with GPUs. Their setup in slurm.conf is GresTypes=gpu NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557 NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1