tory to the
amount requested (so it couldn't be exceeded) and then used the private tmp
spank plugin to map that into what the job saw as /tmp, /var/tmp and /dev/shm.
The epilog then cleaned up after the job.
Worked nicely!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
nd set "EnforceUsageThreshold" on the QOS that you don't want to be
able to exceed its limit.
https://slurm.schedmd.com/sacctmgr.html#lbAW
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
, we didn't change our config for GPUs and (so far) things seem OK.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Hi Kevin!
On 13/8/19 7:25 pm, Kevin Buckley wrote:
Then again, perhaps the bug seen there has been fixed in some
other way for 19.05.2?
From what I can see with "git log -Saries -p" it appears not yet.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
file.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
a job specifically requests so via the --switches
option to sbatch to request how many switches a job should span.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
u
can. There's an awful lot of fixes you are missing out on otherwise.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
trol reconfigure?
BTW that check was introduced in 2003 by Moe :-)
https://github.com/SchedMD/slurm/commit/1c7ee080a48aa6338d3fc5480523017d4287dc08
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On Tuesday, 23 July 2019 7:47:33 PM PDT Weiguang Chen wrote:
> I just reinstalled hdf5, but the error still exist.
Are you going to use HDF5 actively? If not tell configure not to use it by
adding the --with-hdf5=no flag to your configure line.
--
Chris Samuel : http://www.csamuel.
On Monday, 24 June 2019 10:47:46 PM PDT Valerio Bellizzomi wrote:
> slurmctld: error: High latency for 1000 calls to gettimeofday(): 2072
> microseconds
Are you running in a VM ?
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
:27.515] error: slurmdbd: agent queue filling (20140),
RESTART SLURMDBD NOW
Have you tried doing what it told you to?
You may want to look at the performance of you MySQL server to see if
it's failing to keep up with what slurmdbd is asking it to do.
All the best,
Chris
--
Chris Samuel
the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
there to see what's using it.
Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 11/7/19 11:04 pm, Pär Lundö wrote:
It works fine running on a single node(with ”-N1” instead of ”-N2”), but
it is aborted or stopped when running on two nodes.
What is the error you get?
Does the same srun command but with "hostname" instead of Python work?
--
Chris Samue
. So I'd look to check that side
of things are OK.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 3/7/19 8:49 am, David Baker wrote:
Does the above make sense or is it too complicated?
[looks at our 14 partitions and 112 QOS's]
Nope, that seems pretty simple. We do much the same here.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
/bash to stay on
the login node.
Same on our test 19.05.0 system.
Which version of Slurm are you on?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
GPFS for this) as both
slurmctld's need to see the same state directory all the time.
We also run slurmdbd in failover mode talking to the same MySQL/MariaDB
instance (but with a backup in case that fails).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
hether that be an ssh session, an
xterm or using screen or tmux to multiplex terminals on a single session.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
What does "scontrol show node $NODE" say where $NODE is the name of a node
that isn't being listed despite you expecting it to be?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
ind a way to trace mpirun - I think it's just a shell script so
running it with "bash -x mpirun {etc}" would probably do it.
That said you're probably better off just using srun anyway.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
rmctld to see if that helps.
Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
to do that, the existing
ones only use QOS or partition.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
nning job
> or cancels it with cntrl. When this happens we can have many many nodes
> stuck in CG. Slurm 17.02.6. Thanks!
Are you using cgroups to control/constrain jobs?
17.02 is very old, now 19.05 is out only it and 18.08 are getting updates.
All the best,
Chris
--
Chris
to trial things
like cgroups you'll want a VM at least.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
tween DOS & Linux line ending conventions?
See this for more on that last point: https://kb.iu.edu/d/acux
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
up a little cluster in a set of VM's makes life a
lot
easier for you as you'll be able to control the whole environment.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
groups at
all then it may just work to run the daemons by hand yourself. You would
need to make sure you specify your username as both the "SlurmUser" and
"SlurmdUser" in slurm.conf as well.
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
that Slurm is calling to do
this work to see if anything is hiding there.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
forwarding is reworked in 19.05 so it may be worth testing
that out to see whether that improves things in this area.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
; with Intel on what appears to be an undocumented regression and all I
> got after several back and forths was that PMI-2 is not supported in
> Intel MPI 2019.
Is that because they've moved to PMIx exclusively now?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
rocks run host compute-0-1 /bin/bash -x
/state/partition1/ans190/v190/Framework/bin/Linux64/runwb2
Also why aren't you using the Slurm commands to run things?
Does this "rocks" command use them under the covers?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 27/4/19 10:07 pm, J.R. W wrote:
Using slurm 15.08.7
Is that a typo for 18.08.7 ?
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
it
has in the first #! line does not exist.
What does this command say on that node?
file /state/partition1/ans190/v190/Framework/bin/Linux64/runwb2
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
in the first one of the pack jobs
instead?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
g
generally
coincides with the processing unit logical number (PU L#) seen in
lstopo
output.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
I application,
and also that the MPI stack you are using does not know about Slurm and so
doesn't know to start itself correctly when you run with mpirun.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
their tables properly updated (and as you say, either
apply your patch or migrate their MySQL server to a box running a more
recent version of MySQL - it doesn't have to be on the same system
running slurmdbd).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
ing, it's not applicable to 19.05 (which is coming up
quickly now and so they'll be busy trying to get ready for that).
"Release dates in calendar are closer than they appear"
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
md.com/show_bug.cgi?id=4966
but it looks like it's languished since I left Australia.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
f MariaDB or
MySQL is strongly encouraged to prevent this problem.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
to ask for that.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On Wednesday, 27 March 2019 11:33:30 PM PDT Mahmood Naderan wrote:
> Still only one node is running the processes
What does "srun --version" say?
Do you get any errors in your output file from the second pack job?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/
On 27/3/19 7:56 pm, Kevin Buckley wrote:
Does the SchedMD website contain "back issues" of SLURM User
Group Meeting info
Yup, somewhat non-intuitively as publications:
https://slurm.schedmd.com/publications.html
Goes all the way back to something at SC08!
--
Chris Samue
with different numbers of nodes allocated.
Does anyone have any idea why?
You would need to share the output of "scontrol show nodes" to get an
idea of what resources Slurm thinks each node has.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 27/3/19 1:00 pm, Anne M. Hammond wrote:
NodeName=fl[01-04] CPUs=24 RealMemory=4 Sockets=2 CoresPerSocket=6
ThreadsPerCore=2 State=UNKNOWN
This will give you 12 tasks per node, each task with 2 thread units.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
considered in one partition to when it's
being considered in a different partition.
I don't think you can do that though I'm afraid, José, I think the
weight is only attached to the node and the partition doesn't influence it.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org
On 21/3/19 7:39 pm, Will Dennis wrote:
Why does it think that the "gres/gpu_mem_per_card" count is 0? How can I fix
this?
Did you remember to distribute gres.conf as well to the nodes?
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
etero_steps" option in your
scheduler parameters, but even then I don't believe it's working properly
there.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
https://slurm.schedmd.com/mpi_guide.html#pmix
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
ugh if it wasn't then I'd expect a different error, other than resources.
I think to understand better it'd be necessary to see what "scontrol show job"
for a job stuck in that state looks like.
If it helps we're running Slurm on Cray and make heavy use of QOS's.
All the best,
Chris
-
of cores
that an association has in the hierarchy either at or above that level
that this would exceed.
You'll probably need to go poking around with sacctmgr to see what that
limit might be.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On Monday, 25 February 2019 2:55:44 AM PST Patrice Peterson wrote:
> Filed a bug: https://bugs.schedmd.com/show_bug.cgi?id=6573
Looks like Danny fixed it in git.
https://github.com/SchedMD/slurm/commit/b1c78d9934ef461df637c57c001eb165a6b1fcc3
--
Chris Samuel : http://www.csamuel.
FAILED
2019-02-27T22:35:23 2019-02-27T22:36:38 00:01:15 COMPLETED
The "COMPLETED" part is the extern step we have as we use pam_slurm_adopt.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
servation is being created for the larger job,
what do these say?
sprio -l
squeue --start
scontrol show job ${LARGE_JOBID}
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
f you use Open-MPI instead of Intel MPI?
I'm not sure whether Intel MPI can cope with heterogenous jobs or not (it
doesn't seem to be documented anywhere what will, or will not, work with it).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
am getting out of context somehow and have access to all
> resources.
Yes, check the documentation and review your PAM configuration. As I
mentioned it sounds like you've got things in the wrong order there.
https://slurm.schedmd.com/pam_slurm_adopt.html#PAM_CONFIG
All the best,
Chris
--
Chr
, to give a sane interface and
default logical layout. Slurm uses a similar system that results in something
that looks very similar, so to Slurm CPU 0 is socket 1, core 1, thread 1 and
CPU 2 is socket 1, core 1, thread 2, etc...
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ :
e spec is:
CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2
Hope this helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
d can interfere with things.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
might need a recent version of Open-MPI for instance.
https://slurm.schedmd.com/heterogeneous_jobs.html
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
U jobs were not. The submit
filter did all the policing of that.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
ptures that information in the granularity you want currently.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On Wednesday, 13 February 2019 4:48:05 AM PST Marcus Wagner wrote:
> #SBATCH --ntasks-per-node=48
I wouldn't mind betting is that if you set that to 24 it will work, and each
thread will be assigned a single core with the 2 thread units on it.
All the best,
Chris
--
Chris Samuel : h
ke that a requirement at submission time. It's also
exposed inside the job as ${SLURM_JOB_ACCOUNT}.
https://slurm.schedmd.com/sacctmgr.html
https://slurm.schedmd.com/accounting.html
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
ook like?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
they inform slurmctld of their hostname and IP address.
https://slurm.schedmd.com/elastic_computing.html
Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
: Accounting storage SLURMDBD plugin loaded with AuthInfo=(null)
You need to configure munge on this node and tell slurmdbd to use it via
the AuthInfo directive in your configuration file.
https://slurm.schedmd.com/slurmdbd.conf.html
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org
is
not initialized
So this looks like it's trying to use PMI1.
What do the following say?
srun --mpi=list
scontrol show config | fgrep -i mpidefault
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
chance either? That could allow
users to escape their cgroup settings as it can set up its own.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
s firewall related and sometimes this is because
slurmctld tells slurmdbd about an IP address that isn't reachable rather
than one that is.
Hope this helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
on exclusive nodes.
That's correct - because parts of Docker (currently) run as root they
can modify cgroups at will and apparently do. This is why things like
Shifter, CharlieCloud and Singularity exist to let this happen on HPC
systems more safely.
All the best,
Chris
--
Chris Samuel
to be
terminated when the job does end.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
arguments from the script for it to
know what resources you are asking for.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
-smi isn't working in
Docker because of a lack of device files, the problem is that it's
seeing all 4 GPUs and thus is no longer being controlled by the device
cgroup that Slurm is creating.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
the session to the cgroup for the job and then will
clean up that session when the job ends.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
I've used before
so I cannot vouch for how it works.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
, CharlieCloud and Singularity are used instead.
I believe Docker are working on a "rootless" mode that might get around
this, no idea where that's at though.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On 1/1/19 5:25 pm, 허웅 wrote:
what's the problem?
Are you using cgroups to constrain access to GPUs?
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
isn't part of Slurm, so you'll need
to contact the people who've created it to see how it works and why it's
not doing what you expect.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
you a shell on the same node as you ran it on,
with a job allocation that you can access by srun.
You can read more about interactive shells here:
https://slurm.schedmd.com/faq.html#prompt
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
by the various daemons when they
are started up.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On Sat, December 15, 2018 5:59 am, Christopher Benjamin Coffey wrote:
> Hi Guys,
Hi Chris,
> It appears that slurm currently doesn't support mysql 8.0. After upgrading
> from 5.7 to 8.0 slurm commands that hit the db result in:
>
> sacct: error: slurmdbd: "Unknown error 1064"
That's correct,
related.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
then that should be
possible using an overlapping partition restricted to them with LLN enabled.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
ll finish.
So ours is (amongst others): bf_window=23040,bf_resolution=600
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
it so unfortunately there's no
reasoning for this given.
commit e3140b7f8d96ced9dc85089caa65dd7c6be396fd
Author: Tim Wickberg
Date: Wed Sep 20 12:09:34 2017 -0600
Add new x11_util.c file to src/common.
Utility functions for new x11 forwarding implementation.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
-
1795583wrap FAILED 141:0
Hope that helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
would you mind providing access to your prolog and epilog scripts?
Attached!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
#!/bin/bash
if [ "${SLURM_RESTART_COUNT}" == "" ]; then
SLURM_RESTART_COUNT=0
fi
JOBSCRATC
nodes of the cluster.
I think it's good to hear from sites where this is the case because we can
easily get stuck in our own little bubbles until something comes and trips us
up like that.
All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Tuesday, 20 November 2018 11:42:49 PM AEDT Baker D. J. wrote:
> We are running Slurm 18.08.0 on our cluster and I am concerned that Slurm
> appears to be using backfill scheduling excessively.
What are your SchedulerParameters ?
All the beest,
Chris
--
Chris Samuel :
th CentOS 7.5. Haven't gone to 7.6
yet.
One thing I just realised I'd not mentioned is that for this to work the user
needs to be able to SSH from the compute node back into the login node without
being prompted for any reason.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ :
1 support with SSH host based authentication
including from compute nodes back into the login node (that's important)!
Also you need to have configured your /etc/ssh/ssh_known_hosts files so the
ssh client doesn't prompt to confirm host keys.
All the best,
Chris
--
Chris Samuel : http://www.cs
t libssh2-devel"
> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
That also looks like an error you should look into fixing first.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
*).
It's worked well for us at Swinburne (17.11.x and now 18.08.x) running with
sssd and enumeration disabled. Not a vast number of users though!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
that mean everything is ok?
> I wonder why the second command fails?
Check your slurmd logs on the compute node. What errors are there?
> >Another thing is we had to set:
> > * X11Parameters=local_xauthority
>
> Where? sshd config file?
No, that's in slurm.conf.
All the
RSA keys.
Extra info here: https://slurm.schedmd.com/faq.html#x11
You can (apparently) still use the external plugin if you build Slurm without
its internal X11 support.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
101 - 200 of 330 matches
Mail list logo