Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu
partition.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote:
External Email - Use Caution
Hi Paul,
There could be multiple reasons why the job isn't running, from
?
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA
The information in this e-mail is intended only for the person to whom
t
cause wierd behavior with a lot of system tools. So far the
root/daemon process work fine in the 20GB limit though so that
MemoryHigh=20480M is one and done
Then reboot.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
The information in this e-mail is intended only for th
counts with
sacctmgr -i add user "$u" account=$acct fairshare=parent
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 27 Mar 2024 9:22am, Long, Daniel S. via slurm-users wrote:
External Email - Use Caution
Hi,
I’m trying to set up multifactor priority on our clu
Alternativelly consider setting EnforcePartLimits=ALL in slurm.conf
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient
Will using option "End=now" with sreport not exclude the still
pending array jobs while including data for the completed ones?
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 4 Mar 2024 5:18pm, Chip Seraphine via slurm-users wrote:
External Email - Use Cauti
tname
mlsc-login.nmr.mgh.harvard.edu
mlsc-login[0]:~$ printenv | grep SLURM_JOB_NODELIST
SLURM_JOB_NODELIST=rtx-02
Seems you MUST use srun
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 28 Feb 2024 10:25am, Paul Edmon via slurm-users wrote:
External Email - Use Caution
salloc is the currently recom
with safe requeing.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 14 Feb 2024 9:32am, Paul Edmon via slurm-users wrote:
External Email - Use Caution
You probably want the Prolog option:
https://secure-web.cisco.com/1gA_zj13OnVqs4BaLrstiwdHEvx0
plicated cron job that
tries to do it all "outside" of SLURM issuing scontrol commands.
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149
ree caused by the bug in the gpu_nvml.c code
So it is not truly clear where the underlying issue really is though but
seems most likely a bug in the older version of NVML I had installed.
Ideally though SLURM would have better handling of the slurmstepd
processes crashing.
-- Paul Ra
s the problem has gone away. Going to need to have real users
test with real jobs.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 9:01am, Paul Raines wrote:
External Email - Use Caution
I built 23.02.7 and tried that and had the same problems.
BTW, I am using the s
I wonder if the
NVML library at the time of build is the key (though like I said I tried
rebuiling with NVIDIA 470 and that still had the issue)
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 30 Jan 2024 3:36am, Fokke Dijkstra wrote:
External Email - Use Caution
We had sim
ted epilog for jobid 3679888
[2024-01-28T17:33:58.774] debug: JobId=3679888: sent epilog complete msg: rc = 0
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, ple
BUT
there are jobs running on the box that SLURM thinks are done
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown
Turns out on that new node I was running hwloc in a cgroup restricted
to cores 0-13 so that was causing the issue.
In an unrestricted cgroup shell, "hwloc-ls --only pu" works properly
and gives me the correct SLURM mapping.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On T
is command does work on all my other boxes so I do think using hwloc-ls
is the "best" answer for getting the mapping on most hardware out there.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Thu, 15 Dec 2022 1:24am, Marcus Wagner wrote:
Hi Paul,
as Slurm uses hwloc, I was look
ed, 14 Dec 2022 9:42am, Paul Raines wrote:
Yes, I see that on some of my other machines too. So apicid is definitely
not what SLURM is using but somehow just lines up that way on this one
machine I have.
I think the issue is cgroups counts starting at 0 all the cores on the first
socket
# scontrol -d show job 1967214 | grep CPU_ID
Nodes=r17 CPU_IDs=8-11,20-23 Mem=51200 GRES=
# cat /sys/fs/cgroup/cpuset/slurm/uid_5164679/job_1967214/cpuset.cpus
16-23
I am totally lost now. Seems totally random. SLURM devs? Any insight?
-- Paul Raines (http://help.nmr.mgh.harvard.edu
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 13 Dec 2022 9:52am, Sean Maxwell wrote:
External Email - Use Caution
In the slurm.conf manual they state the CpuSpecList ids are "abstract", and
in the CPU management docs they enforce the notion that the abstract
Hmm. Actually looks like confusion between CPU IDs on the system
and what SLURM thinks the IDs are
# scontrol -d show job 8
...
Nodes=foobar CPU_IDs=14-21 Mem=25600 GRES=
...
# cat
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_8/cpuset.cpus.effective
7-10,39-42
-- Paul Raines
Oh but that does explain the CfgTRES=cpu=14. With the CpuSpecList
below and SlurmdOffSpec I do get CfgTRES=cpu=50 so that makes sense.
The issue remains that thought the number of cpus in CpuSpecList
is taken into account, the exact IDs seem to be ignored.
-- Paul Raines (http
--mem=25G \
--time=10:00:00 --cpus-per-task=8 --pty /bin/bash
$ grep -i ^cpu /proc/self/status
Cpus_allowed: 0780,0780
Cpus_allowed_list: 7-10,39-42
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 12 Dec 2022 10:21am, Sean Maxwell wrote:
Hi Paul,
Nodename=foobar
.
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA
The information in this e-mail is intended only for the person to whom it is
addressed
Almost all the 5 min+ time was in the bzip2. The mysqldump by itself was
about 16 seconds. So I moved the bzip2 to its own separate line so
the tables are only locked for the ~16 seconds
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Wed, 21 Sep 2022 3:49am, Ole Holm Nielsen wrote
to rework it.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit
differently, namely instead of a systemctl reload, I am sending a specific
SIGUSR2 signal, which
ob with srun/salloc and not a job that has been running
for days. Is it InactiveLimit that leads to the "inactivity time limit
reached" message?
Anyway, I have changed InactiveLimit=600 to see if that helps.
-------
Paul Raines
Sorry, should have stated that before. I am running Slurm 20.11.3
on CentOS 8 Stream that I compiled myself back in June 2021.
I will try to arrange an upgrade in the next few weeks.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 8 Apr 2022 4:02am, Bjørn-Helge Mevik wrote:
Paul
[0]:~$ cat
/sys/fs/cgroup/memory/slurm/uid_5829/job_1134068/memory.limit_in_bytes
8589934592
On Wed, 6 Apr 2022 3:30pm, Paul Raines wrote:
I have a user who submitted an interactive srun job using:
srun --mem-per-gpu 64 --gpus 1 --nodes 1
From sacct for this job we see
.
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA
The information in this e-mail
On Thu, 3 Feb 2022 1:30am, Stephan Roth wrote:
On 02.02.22 18:32, Michael Di Domenico wrote:
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth
wrote:
The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
it or
are their instances where it would be ignored there?
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 1 Feb 2022 3:09am, EPF (Esben Peter Friis) wrote:
The numbering seen from nvidia-smi is not necessarily the same as the order of
/dev/nvidiaXX.
There is a way to force that, though, using
425 C python 10849MiB |
|4 N/A N/A 63426 C python 10849MiB |
++
How can I make SLURM not use GPU 2 and 4?
---
P
--
mlsc ***jl1103 *** gres/gpu 390
In slurm.conf for the partition all these jobs ran on I have
TRESBillingWeights="CPU=1.24,Mem=0.02G,Gres/gpu=3.0" if that effects
the sreport number somehow -- but then I would expect sreport's number
to simply be 3x the sacct n
Then set JobSubmitPlugins=lua in slurm.conf
I cannot find any documentation about what really should be in
tres_per_job and tres_per_node as I would expect the cpu and memory
requests in there but it is still "nil" even when those are given.
For our cluster I have only seen it non
but is is way less (11 minutes
instead of nearly 9 hours)
# /usr/bin/sstat -p -a --job=357305 --format=JobID,AveCPU
JobID|AveCPU|
357305.extern|213503982334-14:25:51|
357305.batch|11:33.000|
Any idea why this is? Also, what is that crazy number for
AveCPU on 357305.extern?
-- Paul Raines (http
Ah, should have found that. Thanks.
On Sat, 29 May 2021 12:08am, Sean Crosby wrote:
Hi Paul,
Try
sacctmgr modify qos gputest set flags=DenyOnLimit
Sean
From: slurm-users on behalf of Paul Raines
Sent: Saturday, 29 May 2021 12:48
To: slurm-users
I want to dedicate one of our GPU servers for testing where
users are only allowed to run 1 job at a time using 1 GPU and
8 cores of the server. So I put one server in a partition on its
own and setup a QOS for it as follows:
sacctmgr add qos gputest
sacctmgr modify qos gputest set
This is most likely because your XDG* environment variables are being copied
into the job environment. We do the following in our taskprolog script
echo "unset XDG_RUNTIME_DIR"
echo "unset XDG_SESSION_ID"
echo "unset XDG_DATA_DIRS"
-- Paul Raines (h
1
201124.0 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38
cpu=4,gres/gpu=1,mem=512G,node=1
So the first job used all 24 hours of that day, the 2nd just 3 seconds
(so ignore it) and the third about 9 hours and 5 minutes
CPU = 24*60*3+(9*60+5)*4 = 6500
GPU = 24*60*2+(9*60+5)*1 = 3425
-- P
's in /sys/fs/cgroup
without rebooting?
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129USA
This also probably requires you to have
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
GresTypes=gpu
like I do
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 26 Jan 2021 3:40pm, Ole Holm Nielsen wrote:
Thanks Paul!
On 26-01-2021 21:11, Paul Raines wrote:
You
=7696487 File=/dev/nvidia0 Cores=0-31 CoreCnt=32 Links=-1,0,0,0,0,0,2,0,0,0
This is fine with me as I want SLURM to ignore GPU affinity on these nodes
but it is curious.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 25 Jan 2021 10:07am, Paul Raines wrote:
I tried submitting jobs
fault
RPM SPEC is needed. I just run
rpmbuild --tb slurm-20.11.3.tar.bz2
You can run 'rpm -qlp slurm-20.11.3-1.el8.x86_64.rpm | grep nvml' and see
that /usr/lib64/slurm/gpu_nvml.so only exists on the one built on the
GPU node.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 26
I tried submitting jobs with --gres-flags=disable-binding but
this has not made any difference. Jobs asking for GPUs are still only
being run if a core defined in gres.conf for the GPU is free.
Basically seems the option is ignored.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Sun
enforcment" as it is
more important that a job run with a GPU on its non-affinity socket
than just wait and not run at all?
Thanks
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Sat, 23 Jan 2021 3:19pm, Chris Samuel wrote:
On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrot
Power=
TresPerJob=gpu:1
MailUser=mu40 MailType=FAIL
I don't see anything obvious here. Is it maybe the 7 day thing? If
I submit my jobs for 7 days to the rtx6000 partition though I don't
see the problem.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Thu, 21 Jan 2021 5:47pm, Williams
duce_completing_frag,\
max_rpc_cnt=16
DependencyParameters=kill_invalid_depend
So any idea why job 38687 is not being run on the rtx-06 node
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoul
ite } for
pid=403840 comm="sshd" name="rtx-05_811.4294967295" dev="md122" ino=2228938
scontext=system_u:system_r:sshd_t:s0-s0:c0.c1023
tcontext=system_u:object_r:var_t:s0 tclass=sock_file permissive=1
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 26 O
: Cancelled pending job step with signal 2
srun: error: Unable to create step for job 808: Job/step already completing or
completed
But it just hung forever till I did a ^C
thank
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Sat, 24 Oct 2020 3:43am, Juergen Salk wrote:
Hi Paul,
maybe
]: fatal: Access denied for user raines by
PAM account configuration [preauth]
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Fri, 23 Oct 2020 11:12pm, Wensheng Deng wrote:
Append ‘log_level=debug5’ to the pam_slurm_adopt line in system-auth,
restart sshd, try a new job and ssh session
I am running Slurm 20.02.3 on CentOS 7 systems. I have pam_slurm_adopt
setup in /etc/pam.d/system-auth and slurm.conf has PrologFlags=Contain,X11
I also have masked systemd-logind
But pam_slurm_adopt always denies login with "Access denied by
pam_slurm_adopt: you have no active jobs on this
Bas
Does that mean you are setting PriorityFlags=MAX_TRES ?
Also does anyone understand this from the slurm.conf docs:
The weighted amount of a resource can be adjusted by adding a suffix of
K,M,G,T or P after the billing weight. For example, a memory weight of
"mem=.25" on a job
On Sat, 25 Jul 2020 2:00am, Chris Samuel wrote:
On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote:
But when I run a job on the node it runs I can find no
evidence in cgroups of any limits being set
Example job:
mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1
/freezer/tasks
/sys/fs/cgroup/systemd/user.slice/user-5829.slice/session-80624.scope/tasks
---
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th
SLURM_SUBMIT_HOST=mlscgpu1
SLURM_JOB_PARTITION=batch
SLURM_JOB_NUM_NODES=1
SLURM_MEM_PER_NODE=1024
mlscgpu1[0]:~$
But still no CUDA_VISIBLE_DEVICES is being set
On Thu, 23 Jul 2020 10:32am, Paul Raines wrote:
I have two systems in my cluster with GPUs. Their setup in slurm.conf is
GresTypes=gpu
NodeName
I have two systems in my cluster with GPUs. Their setup in slurm.conf is
GresTypes=gpu
NodeName=mlscgpu1 Gres=gpu:quadro_rtx_6000:10 CPUs=64 Boards=1
SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1546557
NodeName=mlscgpu2 Gres=gpu:quadro_rtx_6000:5 CPUs=64 Boards=1
56 matches
Mail list logo