On Tuesday, 6 November 2018 1:02:02 AM AEDT kamil wrote:
> Any idea what these mean and how to handle it?
No, but we've just upgraded and see the same. I've opened a bug:
https://bugs.schedmd.com/show_bug.cgi?id=6016
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ate resources: Invalid feature specification
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
elease..
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
s.
Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
some chance?
If so you might want to check if it works with it disabled first..
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
s" say for you?
Remember just because you can run squeue on the DB server and talk to the
control daemon doesn't mean that the slurmctld has told the slurmdbd to use
that same working IP address that squeue is getting via slurm.conf.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
393), being killed
>
> Is this a limit that's dictated by cgroup.conf
It's not cgroups, that is enforced by the kernel instead, whereas this
is Slurm monitoring jobs and deciding it's used too much memory
and it needs to kill it.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Wednesday, 7 November 2018 3:46:01 PM AEDT Brian Andrus wrote:
> Ah. I was getting ahead of myself. I used 'limits' and I have no limits
> configured, only associations. Changed it to just associations and all is
> good.
Excellent! Well spotted..
--
Chris Samuel : http://www.cs
be disruptive though,
should it? We just flip a symlink and the users see the new binaries,
libraries, etc immediately, we can then restart daemons as and when we
need to (in the right order of course, slurmdbd, slurmctld and then
slurmd's).
All the best,
Chris
--
Chris Samuel : http
> needed by pmi2.
This is what we have working (with 17.11.x):
/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/ram
/dev/random
/dev/hfi*
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
r example:
We don't see those here, but defunct (zombie) processes don't really exist;
they're just caching the exit status until their parent gets around to
wait()ing for them and then they can be reaped.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ced it by then.
However, we've seen it now too.
We'll try the disable/mask trick for `systemd-logind` too.
cheers!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
NFS latencies
are the problem here.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
duling decisions on them.
I'm out of ideas sorry!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
m at the beginning but holes then open up as jobs finish.
So hopefully you'll have a nice mix of job sizes that will fit those holes.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
g it?
My understanding (having never tried federation) is that each cluster will run
its own slurmctld's and slurmds, but they must share the same slurmdbd.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ata.
cheers!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Hope this helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ng correctly.
That's... odd. I've never seen that.
Worth trying by hand on a clean install running slurmdbd like this:
slurmdbd -Dvvv
to see if there's anything obvious showing up in the debug logs to indicate
some problems.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ere.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
that systemd is killing slurmdbd for
some reason.
What happens if you run slurmdbd by hand as root? Like this:
slurmdbd -D -
That should run it in the foreground and output debug info to the screen.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
by default in
17.11.x (and I'm not even sure it works if you enable it there) and
seems to be enabled by default in 18.08.x.
To see check the _enable_pack_steps() function src/srun/srun.c
All the best,
Chris (currently away in the UK)
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
VLSCI we would get reporting questions
about usage (even after systems had been decommissioned) that we needed to go
back to get data out of Slurm for. Luckily we had some beefy Percona MySQL
servers in a cluster!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
e, then I can do slurmctld
(with partitions marked down, just in case). Once those are done I can
restart slurmd's around the cluster.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Tuesday, 25 September 2018 1:54:19 PM AEST Kevin Buckley wrote:
> Is there a way to rename a Reservation ?
I've never come across a way to do that, I've just had to delete and recreate.
Sorry Kevin!
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
this mode it's
just sending processes SIGSTOP and then launching the incoming job so you
should really have enough swap for the previous job to get swapped out to in
order to free up RAM for the incoming job.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
restriction.
I guess it's possible the next level caches might get a work out, but then
unless you're restricting OS daemon processes to cores that are not used by
Slurm then you're probably still going to get some amount of cache pollution
anyway.
All the best!
Chris
--
Chris Sam
e mode, or do you mean that the code
inside the job uses all the cores on the node instead of what was requested?
The latter is often the case for badly behaved codes and that's why using
cgroups to contain applications is so important.
All the best,
Chris
--
Chris Samuel : http://www.csamu
ps://slurm.schedmd.com/cgroups.html
https://slurm.schedmd.com/pam_slurm_adopt.html
Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ooking for the absence of a batch script. Have a look at this bug:
https://bugs.schedmd.com/show_bug.cgi?id=3094
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
lob/master/karaage/datastores/slurm.py
Hope that helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Thursday, 13 September 2018 4:24:41 AM AEST Ariel Balter wrote:
> How do I set email preferences for this group?
https://lists.schedmd.com/cgi-bin/mailman/options/slurm-users
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
based & independently created), so when people
are added/modified/deleted then it runs sacctmgr to keep everything in step.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
post-build.
Correct - autoconf will detect hwloc if the headers & library are
present there at compile time. It links against it so it *must*
be there when you are compiling in order to use it.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
r version of Slurm - works happily on 17.11.x.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
es=.
The gres.conf we use on our HPC cluster uses Cores= quite happily.
Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17
Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35
All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
s package in RHEL/CentOS and cgroup-tools in Debian.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
eat catch Gennaro!
Ah, just noticed you're the Debian package maintainer for Slurm. :-)
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ing until it hits its time limit (unless, as you
say, you manually kill that step yourself).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ve you from users setting CUDA_VISIBLE_DEVICES
themselves and accessing GPUs they are not meant to, you really really do
need to use cgroups to stop that happening.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ou set CUDA_VISBLE_DEVICES to be as processes
will only be able to access what they requested.
Hope that helps!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
load plugin
> /root/sl/sl2/lib/slurm/crypto_munge.so
To me that looks like you managed to compile Slurm against a
version of Munge installed under root's home directory.
This is unlikely to be what you want.
If you build Slurm as a non-root user then it won't find that.
All the best,
Chris
--
bly an bug in how they packaged it!
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
here (which I've just started using for a
side radio-astronomy project at the observatory I volunteer at):
https://www.brendanlong.com/systemd-user-services-are-amazing.html
Hope this helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Tuesday, 21 August 2018 6:17:59 PM AEST Chris Samuel wrote:
> My apologies - I've just tested here (with Slurm 17.11.7) and you are indeed
> correct, they only appear when launched with sbatch and salloc and not when
> you launch jobs directly with srun!
I think the confusion i
acks the configuration of the "memory" cgroup. (see output below)
I don't see that on our CentOS 7.5 system, which distro are you using?
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ck for both.
Hope this helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
on of
> the various counters and limits the controller keeps in memory.
Awesome, thanks Kilian!
$ scontrol show assoc_mgr QOS=astac_oz045 | fgrep UsageRaw=
UsageRaw=18641632.00
Looking promising...
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
nuking your shells environment perhaps?
17.02.11 is the last released version of 17.02.x and all previous versions
have been pulled from the SchedMD website due to CVE-2018-10995.
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Hey folks,
I'm going to be unsubscribing from slurm-users for a while as I'll be
travelling to the US & UK for a number of weeks & I don't want to drown in
email.
I'll be back...
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ts, not what is currently in use.
So the sum of the requested memory of all jobs running in that association
doesn't leave enough permitted resources free to allow this job to begin.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
4-35,37-45,52-53,65-66,72-86]
It does result in a job being allocated which will never appear in your
accounting though, so you'll need to be prepared for that.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
n/bash
exec srun $* --pty -u ${SHELL} -l
That's it..
Hope that helps!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
iles.
Hope that helps!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ling them fixed that.
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Thursday, 10 May 2018 1:02:36 AM AEST Eric F. Alemany wrote:
> All seem good for now
Great news!
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
::
john46
Hope that helps,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
upport contract I would be opening a bug with SchedMD now.
cheers!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
thub.com/dun/munge/wiki/Installation-Guide
Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
art munged as well? That's what's reading the key, not Slurm.
Munge is just an external service that Slurm talks to.
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ads, but from the point of view of what you
*request* CPUs are just boards*sockets*cores.
Confusing!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
t doesn't care about that.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
node=rocks7 state=resume
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Sunday, 6 May 2018 2:58:26 PM AEST Chris Samuel wrote:
> Very very interesting - both slurmd and lscpu report 32 cores, but with
> differing interpretations of the number of the layout. Meanwhile the AMD
> website says these are 16 core CPUs, which means both Slurm and lscpu ar
ler which is why it's important to know about for memory locality).
What's the hardware you're running this on?
Also can you refresh my memory please, what do each of these say?
lscpu
slurmd -C
lstopo
(don't worry if that last one isn't there)
All the best,
Chris
--
Chris Samuel : h
his helps..
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
e 17.11.0 (as I
know it works for us with 17.11.5) or a kernel bug (or missing device
cgroups).
Sorry I can't be more helpful!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
fixed in 17.11.6 which will be out soon. I can't tell if
you're hitting the same bug we hit, but I'd suggest re-testing when it
appears.
Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
wnload that version any more from SchedMD
because of the CVE.
I'd suggest upgrading if you can.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
n you can schedule 16 tasks per node and each
task can use 2 threads.
What does "slurmd -C" say on that node?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
l says: --sysconfdir=/etc/slurm-llnl
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ATTERN}
will do git's grep with no need to have a git repository. Plus it paginates,
etc, for you. Also pretty fast. :-)
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
https://slurm.schedmd.com/wckey.html
Good luck,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
tld and slurmd everywhere and see
where that gets you. If it's still drained but the hardware config looks good
in Slurm then you can do "scontrol update node=rocks7 state=resume" to tell
Slurm to try using it again.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Hi Mahmood,
Not quite what I meant sorry.
What does this say?
scontrol show config | fgrep -i rocks7
cheers,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Sunday, 29 April 2018 4:11:39 PM AEST Mahmood Naderan wrote:
> So, I don't know why only 1 core included
What do you have in your slurm.conf for rocks7?
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
hardware resources to meet
what you've told it.
What does "slurmd -C" say on rocks7 ?
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Saturday, 28 April 2018 7:58:08 PM AEST Mahmood Naderan wrote:
> I see that the state of the frontend is Drained. Is that the default
> state?
Probably not. What does "sinfo --list-reasons" say?
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
txt) but I think that is just a
mechanism to store information about completed jobs. Slurmdbd also stores
information about users, accounts and associations and so I suspect you'll
need that to be able to get that information.
Also note account = bank account, user = username.
Hope that he
StorageEnforce to anything that requires
associations then jobs will quite happily start without being part of one
(from memory).
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Wednesday, 25 April 2018 3:47:17 PM AEST Chris Samuel wrote:
> I'll open a bug just in case..
https://bugs.schedmd.com/show_bug.cgi?id=5097
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
her, but I wonder if there
has been accidental redefinitions (for instance the one in slurm_persist_conn.c
didn't appear until 2016, whilst the one in
slurm_protocol_socket_implementation.c
was set to that value (1GB) back in 2013.
I'll open a bug just in case..
cheers,
Chris
--
Chris Samuel : h
A job can be submitted to many partitions (modulo local policy) but once it
starts it is only in one partition, that might be what you are thinking of
here.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
s on that same node.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
isunderstood what you were trying to achieve. I assumed you
wanted a homogenous configuration for the partition. Yes, if you are happy
for the asymmetry then you can do that.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
's really something very wrong in your setup I'm afraid.
I've not seen an impact like that just from running Slurm.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ll cores.
All you need to do is add "MaxCPUsPerNode=20" to that to limit the number of
cores that the partition can use.
We do this for our non-GPU job partition to reserve some cores for the GPU job
partition.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
ore) format options to you so I had to guess a little.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
On Tuesday, 17 April 2018 5:08:09 PM AEST Mahmood Naderan wrote:
> So, UsePAM has not been set. So, slurm shouldn't limit anything. Is
> that correct? however, I see that slurm limits the virtual memory size
What does this say?
scontrol show config | fgrep VSizeFactor
--
Chris
ob_comp_mysql.so) and slurm-slurmdbd
(accounting_storage_mysql) packages.
The example configuration files have been moved to slurm-example-configs.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
we can just restart services as
we wish to pick up the right one (/apps is a shared read-only filesystem across
all the cluster nodes).
Our config directory is /apps/slurm/etc so again only one place to modify
things for all nodes to see the change.
All the best,
Chris
--
Chris Sam
rvationId,Start,TotalTime
best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
r us (yet - it seems to
work pretty well so far on our systems) but worth keeping in mind for the
future. Thanks for sharing!
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
eason it doesn't at the moment is because you're telling it not to tell you.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
intensive to see differences.
It also sounds like you've got a problem with either your Slurm job or Slurm
itself from the error you posted.
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
and communications instead. Something like NAMD
or a synthetic benchmark like HPL.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
under the covers for you when you do "useradd".
[...]
> I think something is wrong. Any idea?
What does this say?
scontrol show config | fgrep AccountingStorageEnforce
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
201 - 300 of 330 matches
Mail list logo