[slurm-dev] Re: node in down state from "unexpected reboot"

2015-07-31 Thread Robbert Eggermont
t? Thanks, Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology

[slurm-dev] sstat: error: no steps running for job

2015-09-30 Thread Robbert Eggermont
herwise). Has anybody seen this before (and knows how to fix this)? Thanks, Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology

[slurm-dev] Re: sstat: error: no steps running for job

2015-09-30 Thread Robbert Eggermont
On 09/30/2015 01:19 PM, Robbert Eggermont wrote: I recently upgraded from slurm-14.11 to 15.08(.1). I remember sstat working for 14.11, but now it just says "sstat: error: no steps running for job x" for any job I try. Correction: the above is only true for jobs that don't

[slurm-dev] Re: Unable to install slurm-slurmdbd-15.08.1 on CentOS 7

2015-10-05 Thread Robbert Eggermont
ild environment? Best, Robbert On 10/05/2015 12:14 PM, James Oguya wrote: I can build rpm packages for slurm-15.08.1, but I can't install slurm-slurmdbd.x86_64 due to missing libmysqlclient_r.so.16 object file. -- Robbert Eggermont Intelligent Systems r.

[slurm-dev] Slurmd restart without loosing jobs?

2015-10-12 Thread Robbert Eggermont
Hello, Some modifications to the slurm.conf require me to restart the slurmd daemons on all nodes. Is there a way to do this without loosing any running jobs (and not having to drain the cluster)? Thanks, Robbert -- Robbert Eggermont Intelligent Systems

[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Robbert Eggermont
Indeed the jobs were not terminated by the restart of the slurmd, that was just required to get slurmctld and slurmd to resume communicating and immidiately execute the terminations requested by slurmctld. Robbert -- Robbert Eggermont Intelligent System

[slurm-dev] srun interactive sessions hang

2015-10-15 Thread Robbert Eggermont
emote process. Are there special options we need to use for this? (Is there some kind of keep-alive necessary?) Any other thoughts on this? Best, Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Scien

[slurm-dev] RE: Generating utilisation report for accounts without users?

2015-11-23 Thread Robbert Eggermont
i?id=1641) describes slurmreport, a set of configurable scripts for daily or monthly job accounting reporting. Perhaps these might help when you upgrade. -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp

[slurm-dev] PriorityFlags CALCULATE_RUNNING crashes slurmctld?

2015-11-25 Thread Robbert Eggermont
= SMALL_RELATIVE_TO_TIME,DEPTH_OBLIVIOUS,CALCULATE_RUNNING Should I change anything else in the configuration if I want to use CALCULATE_RUNNING? Best, Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Sci

[slurm-dev] Re: slurmd can't mount cpuacct cgroup namespace on RHEL 7.2 ?

2015-12-21 Thread Robbert Eggermont
quot;cgroup", MS_NOSUID|MS_NODEV|MS_NOEXEC, "cpuacct") = -1 EBUSY (Device or resource busy) ...and it might be related to this existing mount courtesy of systemd in /proc/mounts: cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 Anyone el

[slurm-dev] Re: Sacct not reporting info on batch jobs

2016-02-27 Thread Robbert Eggermont
fo on the batch jobs, interactive jobs report just fine. For the batch ones I can only see the elapsed time, not the memory, cores etc. -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15

[slurm-dev] Duplicate jobid, launch failed?

2016-03-16 Thread Robbert Eggermont
t expected behaviour that a failed job launch is handled as a duplicate jobid? If so, can anybody elaborate on this and do I need to do anything (besides resuming the node)? Or is this a bug? (Caused by the timing of the requeue?) Best, Robbert -- Robbert Eggermont

[slurm-dev] Re: NFSv4

2016-05-25 Thread Robbert Eggermont
lost. For jobs that get started the only clue is that Slurm immediately reports the job as failed but no output file is created. All in all it works well. Regards, Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathe

[slurm-dev] Re: NFSv4

2016-06-03 Thread Robbert Eggermont
Hi Matthieu Hi Robbert, what we do to solve this problem is adding a section in the Slurmctld prolog that check that the user associated to the job to start has a valid credential in the auksd daemon, otherwise we update the job with a comment indicating that no kerberos token is available and

[slurm-dev] Re: resource usage, TRES and --exclusive option

2016-09-01 Thread Robbert Eggermont
nly if OverSubscribe=FORCE is set on the partition the argument "--exclusive" would make sense to prevent the default sharing of nodes. With "--exclusive" all resources of the node would be billed to the exclusive job automatically, right ? Correct. Best, Robbert -- R

[slurm-dev] All nodes reboot automagically?

2016-10-17 Thread Robbert Eggermont
hem? Are there any other likely causes that we've missed? Best, Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology

[slurm-dev] Re: Reason=gres/gpu count too low

2016-12-06 Thread Robbert Eggermont
On 06-12-16 10:49, David van Leeuwen wrote: "gres/gpu count too low (0 < 1)" Last time I saw this I had to restart the slurmd on that node (a simple scontrol reconfigure was not enough). I guess this message indicates a discrepancy between the number of GPU resources detected by slurmd at

[slurm-dev] Re: Daytime Interactive jobs

2017-01-30 Thread Robbert Eggermont
least 12 hours at night. There was no need to reconfigure partitions, so nice and simple. Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology

[slurm-dev] Re: Unable to allocate Gres by type

2017-02-06 Thread Robbert Eggermont
then delete it (including any attachments) from your system. -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology

[slurm-dev] Scheduling weirdness

2017-06-15 Thread Robbert Eggermont
Hello, In our Slurm setup (now 17.02.4) I've noticed several times now that backfilled jobs push back the start time of the highest priority job. I'm not sure if this is due to a configuration error or an scheduler error, and since I'm having a hard time diagnosing what's happening, I was hop

[slurm-dev] Re: Scheduling weirdness

2017-06-16 Thread Robbert Eggermont
unfortunately it didn't change anything for this problem. Robbert 2017-06-16 1:16 GMT+02:00 Robbert Eggermont : Hello, In our Slurm setup (now 17.02.4) I've noticed several times now that backfilled jobs push back the start time of the highest priority job. I'm not sure if thi

[slurm-dev] RE: Scheduling weirdness

2017-06-21 Thread Robbert Eggermont
e future. There's a patch to fix this, but it isn't in the 17.02 tarball. Take a look at https://github.com/SchedMD/slurm/commit/3f7e10f868145a505b1dad6a69b040a167eaa541 - Gary Skouson -----Original Message- From: Robbert Eggermont [mailto:r.eggerm...@tudelft.nl] Sent: Thur

[slurm-dev] How to set 'future' node state?

2017-07-14 Thread Robbert Eggermont
make the nodes go into State=FUTURE automatically? Or do we simply remove the node definitions until the nodes are ready? Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234

[slurm-dev] Re: How to set 'future' node state?

2017-07-17 Thread Robbert Eggermont
ed?) Are there any "best practices" for preparing to add new nodes? Robbert -- Robbert Eggermont Intelligent Systems r.eggerm...@tudelft.nl Electr.Eng., Mathematics & Comp.Science +31 15 27 83234 Delft University of Technology

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread Robbert Eggermont
On 14-08-17 07:50, Lachlan Musicman wrote: We have TaskPlugin=task/cgroup and when testing I noticed that the # of threads/cpus being allocated was rounded up to the nearest even. I presume this was due to cgroups marking a core as a cpu, rather than a thread as a cpu. Sounds like you're usi

[slurm-dev] Re: salloc/srun advice for 1 gpu/task but job makes use of all available gpus

2017-09-05 Thread Robbert Eggermont
Given: % salloc -n 4 -c 2 -gres=gpu:1 % srun env | grep CUDA   # a single srun # Currently always produces CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 man salloc: --gres ... The specified resources will be allocated to the job on each