ubject: [slurm-dev] Re: Backup controller not responding to requests
What is the output of
scontrol show config | grep SlurmctldTimeout
?
2017-01-31 6:57 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>:
> Yes, if I do scontrol takeover, it successfully goes to the backup.
>
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Backup controller not responding to requests
Does it work if you use "scontrol takeover" to shut down the primary controller
and switch immediately to the backup controller?
2017-01-30 19:41 GMT+01:00 Andrus, Brian C
of slurm.
Paddy
On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote:
> All,
>
> I have configured a backup slurmctld system and it appears to work at first,
> but not in practice.
> In particular, when I start it, it says it is running in background m
All,
I have configured a backup slurmctld system and it appears to work at first,
but not in practice.
In particular, when I start it, it says it is running in background mode:
[2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on cluster hamming
[2017-01-25T14:23:37.650] slurmctld
The way we did that was to put the nodes in their own partition which is only
accessible by that group.
PartitionName=beardq Nodes=compute-8-[1,5,9,13,17] AllowGroups=beards
DefaultTime=01:00:00 MaxTime=INFINITE State=UP
So here is a partition "beardq" which is only available to folks in the
I take that back. It was indeed the issue. User name is clwalton1...
Doh!
Thanks for pointing me in the right direction.
Brian Andrus
ITACS/Research Computing
Naval Postgraduate School
Monterey, California
voice: 831-656-6238
-Original Message-
From: Andrus, Brian Contractor
[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs
running
Hi,
Is the user defined in all the compute nodes? Does it has the same UID in all
the hosts?
Regards,
Carlos
On Mon, Nov 28, 2016 at 6:54 PM, Andrus, Brian Contractor
<bdand...@nps.edu<mailto:bdand...@
see instructions in our Wiki
https://wiki.fysik.dtu.dk/niflheim/SLURM
/Ole
On 11/25/2016 05:04 PM, Andrus, Brian Contractor wrote:
> All,
>
> I have been having an issue where if I try to run the slurm daemon
> under systemd, it hangs for some time and then errors out
All,
Don't quite get this:
# squeue|head
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
751071_17703 primary PARAMEIG clwalton CG 3-00:00:19 1 compute-3-87
751071_[36752-6220 primary PARAMEIG clwalton PD 0:00 1 (Resources)
All,
I have been having an issue where if I try to run the slurm daemon under
systemd, it hangs for some time and then errors out with:
systemd[1]: Starting LSB: slurm daemon management...
systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start.
systemd[1]: slurm.service:
nodes
Hi Brian,
Looks like your default memory allocation for jobs is 258307 MB, which is just
how much memory you have on the node. Try to request less memory with --mem.
Best wishes,
Marius
16. aug. 2016 kl. 01.44 skrev Andrus, Brian Contractor
<bdand...@nps.edu<mailto:bdand...@n
Ok, I am still having trouble here and am not sure where to look.
Slurm is configured with:
SelectType = select/cons_res
SelectTypeParameters= CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE
I have a node which has 64 cores:
NodeName=compute-2-1 Arch=x86_64
016 at 11:06 AM, Andrus, Brian Contractor
<bdand...@nps.edu<mailto:bdand...@nps.edu>> wrote:
All,
I am trying to figure out the bits required to allow users to use part of a
node and not block others from using remaining resources.
It looks like the “OverSubscribe” option
All,
I am trying to find a way to see what resources are used/remaining on a per
node basis. In particular memory and sockets/cpus/cores/threads
Not seeing anything in the sinfo or scontrol man pages that show specifically
that..
Any insight is appreciated.
Brian Andrus
index number,
though there are some other creative things you can do with them.
Ryan
On 01/27/2016 06:47 PM, Andrus, Brian Contractor wrote:
I ended up just doing ‘scancel’ on all the jobs and resubmitting them.
I seem to be making progress.
Now I am having trouble figuring out the –distribution opt
tmap.
John DeSantis
2016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor
<bdand...@nps.edu<mailto:bdand...@nps.edu>>:
John,
Thanks. That seemed to help; a job started on a node that had a job on it once
the job that had been on it (‘using’ all the memory) completed.
But now all
All,
I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.
Now I have one array job that is running across several nodes, but only using
some of the node resources. I would like to have slurm start sharing the nodes
so some of the
per node is scheduled.
HTH,
John DeSantis
2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor
<bdand...@nps.edu<mailto:bdand...@nps.edu>>:
All,
I am in the process of transitioning from Torque to Slurm.
So far it is doing very well, especially handling arrays.
Now I have one array job t
Corporation<http://www.decisionsciencescorp.com/>
On Wed, Jan 20, 2016 at 6:49 PM, Andrus, Brian Contractor
<bdand...@nps.edu<mailto:bdand...@nps.edu>> wrote:
All,
Is there a way to change the maximum simultaneous running tasks of an array job
that is currently running?
For exa
20:37 schrieb Andrus, Brian Contractor:
> I am testing our slurm to replace our torque/moab setup here.
>
> The issue I have is to try and put all our node names in the NodeName
> and PartitionName entries.
> In our cluster, we name our nodes compute-- That seems to
> be problem e
All,
I am testing our slurm to replace our torque/moab setup here.
The issue I have is to try and put all our node names in the NodeName and
PartitionName entries.
In our cluster, we name our nodes compute--
That seems to be problem enough with the abilities to use ranges in slurm, but
it is
21 matches
Mail list logo