[slurm-dev] Re: Backup controller not responding to requests

2017-01-31 Thread Andrus, Brian Contractor
ubject: [slurm-dev] Re: Backup controller not responding to requests What is the output of scontrol show config | grep SlurmctldTimeout ? 2017-01-31 6:57 GMT+01:00 Andrus, Brian Contractor <bdand...@nps.edu>: > Yes, if I do scontrol takeover, it successfully goes to the backup. >

[slurm-dev] Re: Backup controller not responding to requests

2017-01-30 Thread Andrus, Brian Contractor
To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Re: Backup controller not responding to requests Does it work if you use "scontrol takeover" to shut down the primary controller and switch immediately to the backup controller? 2017-01-30 19:41 GMT+01:00 Andrus, Brian C

[slurm-dev] Re: Backup controller not responding to requests

2017-01-30 Thread Andrus, Brian Contractor
of slurm. Paddy On Mon, Jan 30, 2017 at 08:21:59AM -0800, Andrus, Brian Contractor wrote: > All, > > I have configured a backup slurmctld system and it appears to work at first, > but not in practice. > In particular, when I start it, it says it is running in background m

[slurm-dev] Backup controller not responding to requests

2017-01-30 Thread Andrus, Brian Contractor
All, I have configured a backup slurmctld system and it appears to work at first, but not in practice. In particular, when I start it, it says it is running in background mode: [2017-01-25T14:23:37.648] slurmctld version 16.05.6 started on cluster hamming [2017-01-25T14:23:37.650] slurmctld

[slurm-dev] RE: Restrict access for a user group to certain nodes

2016-12-01 Thread Andrus, Brian Contractor
The way we did that was to put the nodes in their own partition which is only accessible by that group. PartitionName=beardq Nodes=compute-8-[1,5,9,13,17] AllowGroups=beards DefaultTime=01:00:00 MaxTime=INFINITE State=UP So here is a partition "beardq" which is only available to folks in the

[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running

2016-11-28 Thread Andrus, Brian Contractor
I take that back. It was indeed the issue. User name is clwalton1... Doh! Thanks for pointing me in the right direction. Brian Andrus ITACS/Research Computing Naval Postgraduate School Monterey, California voice: 831-656-6238 -Original Message- From: Andrus, Brian Contractor

[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running

2016-11-28 Thread Andrus, Brian Contractor
[slurm-dev] Re: squeue returns "invalid user" for a user that has jobs running Hi, Is the user defined in all the compute nodes? Does it has the same UID in all the hosts? Regards, Carlos On Mon, Nov 28, 2016 at 6:54 PM, Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@

[slurm-dev] Re: PIDfile on CentOS7 and compute nodes

2016-11-28 Thread Andrus, Brian Contractor
see instructions in our Wiki https://wiki.fysik.dtu.dk/niflheim/SLURM /Ole On 11/25/2016 05:04 PM, Andrus, Brian Contractor wrote: > All, > > I have been having an issue where if I try to run the slurm daemon > under systemd, it hangs for some time and then errors out

[slurm-dev] squeue returns "invalid user" for a user that has jobs running

2016-11-25 Thread Andrus, Brian Contractor
All, Don't quite get this: # squeue|head JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 751071_17703 primary PARAMEIG clwalton CG 3-00:00:19 1 compute-3-87 751071_[36752-6220 primary PARAMEIG clwalton PD 0:00 1 (Resources)

[slurm-dev] PIDfile on CentOS7 and compute nodes

2016-11-25 Thread Andrus, Brian Contractor
All, I have been having an issue where if I try to run the slurm daemon under systemd, it hangs for some time and then errors out with: systemd[1]: Starting LSB: slurm daemon management... systemd[1]: PID file /var/run/slurmctld.pid not readable (yet?) after start. systemd[1]: slurm.service:

[slurm-dev] Re: Fully utilizing nodes

2016-08-16 Thread Andrus, Brian Contractor
nodes Hi Brian, Looks like your default memory allocation for jobs is 258307 MB, which is just how much memory you have on the node. Try to request less memory with --mem. Best wishes, Marius 16. aug. 2016 kl. 01.44 skrev Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@n

[slurm-dev] Re: Fully utilizing nodes

2016-08-15 Thread Andrus, Brian Contractor
Ok, I am still having trouble here and am not sure where to look. Slurm is configured with: SelectType = select/cons_res SelectTypeParameters= CR_CORE_MEMORY,CR_ONE_TASK_PER_CORE I have a node which has 64 cores: NodeName=compute-2-1 Arch=x86_64

[slurm-dev] Re: Fully utilizing nodes

2016-08-09 Thread Andrus, Brian Contractor
016 at 11:06 AM, Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>> wrote: All, I am trying to figure out the bits required to allow users to use part of a node and not block others from using remaining resources. It looks like the “OverSubscribe” option

[slurm-dev] List resources used/available

2016-02-04 Thread Andrus, Brian Contractor
All, I am trying to find a way to see what resources are used/remaining on a per node basis. In particular memory and sockets/cpus/cores/threads Not seeing anything in the sinfo or scontrol man pages that show specifically that.. Any insight is appreciated. Brian Andrus

[slurm-dev] Re: distribution for array jobs

2016-01-28 Thread Andrus, Brian Contractor
index number, though there are some other creative things you can do with them. Ryan On 01/27/2016 06:47 PM, Andrus, Brian Contractor wrote: I ended up just doing ‘scancel’ on all the jobs and resubmitting them. I seem to be making progress. Now I am having trouble figuring out the –distribution opt

[slurm-dev] distribution for array jobs

2016-01-27 Thread Andrus, Brian Contractor
tmap. John DeSantis 2016-01-26 20:05 GMT-05:00 Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>>: John, Thanks. That seemed to help; a job started on a node that had a job on it once the job that had been on it (‘using’ all the memory) completed. But now all

[slurm-dev] Update job and partition for shared jobs

2016-01-26 Thread Andrus, Brian Contractor
All, I am in the process of transitioning from Torque to Slurm. So far it is doing very well, especially handling arrays. Now I have one array job that is running across several nodes, but only using some of the node resources. I would like to have slurm start sharing the nodes so some of the

[slurm-dev] Re: Update job and partition for shared jobs

2016-01-26 Thread Andrus, Brian Contractor
per node is scheduled. HTH, John DeSantis 2016-01-26 15:20 GMT-05:00 Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>>: All, I am in the process of transitioning from Torque to Slurm. So far it is doing very well, especially handling arrays. Now I have one array job t

[slurm-dev] Re: Adjust an array job's maximum simultaneous running tasks

2016-01-21 Thread Andrus, Brian Contractor
Corporation<http://www.decisionsciencescorp.com/> On Wed, Jan 20, 2016 at 6:49 PM, Andrus, Brian Contractor <bdand...@nps.edu<mailto:bdand...@nps.edu>> wrote: All, Is there a way to change the maximum simultaneous running tasks of an array job that is currently running? For exa

[slurm-dev] Re: NodeName and PartitionName format in slurm.conf

2016-01-20 Thread Andrus, Brian Contractor
20:37 schrieb Andrus, Brian Contractor: > I am testing our slurm to replace our torque/moab setup here. > > The issue I have is to try and put all our node names in the NodeName > and PartitionName entries. > In our cluster, we name our nodes compute-- That seems to > be problem e

[slurm-dev] NodeName and PartitionName format in slurm.conf

2016-01-19 Thread Andrus, Brian Contractor
All, I am testing our slurm to replace our torque/moab setup here. The issue I have is to try and put all our node names in the NodeName and PartitionName entries. In our cluster, we name our nodes compute-- That seems to be problem enough with the abilities to use ranges in slurm, but it is