t?
Thanks,
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
herwise).
Has anybody seen this before (and knows how to fix this)?
Thanks,
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
On 09/30/2015 01:19 PM, Robbert Eggermont wrote:
I recently upgraded from slurm-14.11 to 15.08(.1). I remember sstat
working for 14.11, but now it just says "sstat: error: no steps running
for job x" for any job I try.
Correction: the above is only true for jobs that don't
ild environment?
Best,
Robbert
On 10/05/2015 12:14 PM, James Oguya wrote:
I can build rpm packages for slurm-15.08.1, but I can't install
slurm-slurmdbd.x86_64 due to missing libmysqlclient_r.so.16 object file.
--
Robbert Eggermont Intelligent Systems
r.
Hello,
Some modifications to the slurm.conf require me to restart the slurmd
daemons on all nodes. Is there a way to do this without loosing any
running jobs (and not having to drain the cluster)?
Thanks,
Robbert
--
Robbert Eggermont Intelligent Systems
Indeed the jobs were not terminated by the restart of the slurmd, that
was just required to get slurmctld and slurmd to resume communicating
and immidiately execute the terminations requested by slurmctld.
Robbert
--
Robbert Eggermont Intelligent System
emote process.
Are there special options we need to use for this? (Is there some kind
of keep-alive necessary?) Any other thoughts on this?
Best,
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Scien
i?id=1641) describes
slurmreport, a set of configurable scripts for daily or monthly job accounting
reporting. Perhaps these might help when you upgrade.
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp
= SMALL_RELATIVE_TO_TIME,DEPTH_OBLIVIOUS,CALCULATE_RUNNING
Should I change anything else in the configuration if I want to use
CALCULATE_RUNNING?
Best,
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Sci
quot;cgroup", MS_NOSUID|MS_NODEV|MS_NOEXEC,
"cpuacct") = -1 EBUSY (Device or resource busy)
...and it might be related to this existing mount courtesy
of systemd in /proc/mounts:
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup
rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0
Anyone el
fo on the
batch jobs, interactive jobs report just fine. For the batch ones I
can only see the elapsed time, not the memory, cores etc.
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15
t expected behaviour that a failed job launch is handled as a
duplicate jobid? If so, can anybody elaborate on this and do I need to
do anything (besides resuming the node)?
Or is this a bug? (Caused by the timing of the requeue?)
Best,
Robbert
--
Robbert Eggermont
lost. For jobs
that get started the only clue is that Slurm immediately reports the job
as failed but no output file is created.
All in all it works well.
Regards,
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathe
Hi Matthieu
Hi Robbert, what we do to solve this problem is adding a section in the
Slurmctld prolog that check that the user associated to the job to start
has a valid credential in the auksd daemon, otherwise we update the job
with a comment indicating that no kerberos token is available and
nly if OverSubscribe=FORCE is set on the partition the argument
"--exclusive" would make sense to prevent the default sharing of nodes.
With "--exclusive" all resources of the node would be billed to the
exclusive job automatically, right ?
Correct.
Best,
Robbert
--
R
hem?
Are there any other likely causes that we've missed?
Best,
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
On 06-12-16 10:49, David van Leeuwen wrote:
"gres/gpu count too low (0 < 1)"
Last time I saw this I had to restart the slurmd on that node (a simple
scontrol reconfigure was not enough).
I guess this message indicates a discrepancy between the number of GPU
resources detected by slurmd at
least 12 hours at night.
There was no need to reconfigure partitions, so nice and simple.
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
then
delete it (including any attachments) from your system.
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
Hello,
In our Slurm setup (now 17.02.4) I've noticed several times now that
backfilled jobs push back the start time of the highest priority job.
I'm not sure if this is due to a configuration error or an scheduler
error, and since I'm having a hard time diagnosing what's happening, I
was hop
unfortunately it didn't change anything for this problem.
Robbert
2017-06-16 1:16 GMT+02:00 Robbert Eggermont :
Hello,
In our Slurm setup (now 17.02.4) I've noticed several times now that
backfilled jobs push back the start time of the highest priority job.
I'm not sure if thi
e future. There's a patch to fix this, but it isn't in the 17.02
tarball. Take a look at
https://github.com/SchedMD/slurm/commit/3f7e10f868145a505b1dad6a69b040a167eaa541
-
Gary Skouson
-----Original Message-
From: Robbert Eggermont [mailto:r.eggerm...@tudelft.nl]
Sent: Thur
make the nodes go into State=FUTURE automatically?
Or do we simply remove the node definitions until the nodes are ready?
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234
ed?)
Are there any "best practices" for preparing to add new nodes?
Robbert
--
Robbert Eggermont Intelligent Systems
r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science
+31 15 27 83234 Delft University of Technology
On 14-08-17 07:50, Lachlan Musicman wrote:
We have TaskPlugin=task/cgroup and when testing I noticed that the # of
threads/cpus being allocated was rounded up to the nearest even. I
presume this was due to cgroups marking a core as a cpu, rather than a
thread as a cpu.
Sounds like you're usi
Given:
% salloc -n 4 -c 2 -gres=gpu:1
% srun env | grep CUDA # a single srun
# Currently always produces
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
man salloc:
--gres
... The specified resources will be allocated to the job on each
26 matches
Mail list logo