[slurm-dev] MaxMemPerCPU seems not working for me...

2017-02-01 Thread Julien Collas
Hi, It seems that my *MaxMemPerCpu *is not working as I would have expected (increase cpu if mem or mem-per-cpu exceed that limit) Here is my partition definition $ scontrol show part short PartitionName=short AllowGroups=ALL DenyAccounts=data AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/

[slurm-dev] Problems with suspend/resume

2017-02-01 Thread Claudio
Hi all; I setup a Slurm based cluster to study scheduling algorithms. This cluster has 10 nodes and 2 CPUs per node. I compiled Slurm with the "--enable-multiple-slurmd" option and I configured it to act as a 3600 node cluster with 16 CPUs per node. Since I am only submitting sleep jobs, this co

[slurm-dev] Re: Setting a partition QOS, etc

2017-02-01 Thread Thomas M. Payerle
If you decide to go the single partition model, you can use the "Weight" parameter in slurm.conf to cause the standard nodes to be preferentially used to the high-mem and GPU nodes. So jobs only end up on high-mem or GPU nodes if they requested a lot of memory or a GPU, or if the cluster is ver

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-02-01 Thread E V
Ah, success. It was gres related. I verified the slurm.conf's are the same, but I never verified the gres.conf. It looks like our production gres.conf had been copied to the backup controller which had the same gres names, but different hosts associated with them. Fixing that and restarting slurmd

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-02-01 Thread E V
Yes, head node & backup head sync to the same ntp server. Verifying by hand they seem to be within 1 sec of each other. Here's the nodes info it finds as it starts up in slurmd.log: [2017-01-31T15:31:59.711] CPUs=24 Boards=1 Sockets=2 Cores=6 Threads=2 Memory=48388 TmpDisk=508671 Uptime=1147426 CP

[slurm-dev] Re: Setting a partition QOS, etc

2017-02-01 Thread Loris Bennett
Hi David, Baker D.J. writes: > Hello, > > This is hopefully a very simple set of questions for someone. I’m evaluating > slurm with a view to replacing our existing torque/moab system, and I’ve been > reading about defining partitions and QoSs. I like the idea of being able to > use > a QoS to

[slurm-dev] Re: Node switching to DRAIN for unknown reason, trouble shooting ideas?

2017-02-01 Thread Paddy Doyle
Similar to Lachlan's suggestions: check that the slurm.conf is the same on all nodes, and in particular that the number of cpus and cores are correct. Have you tried removing the Gres parameters? Perhaps it's looking for devices it can't find. Paddy On Tue, Jan 31, 2017 at 02:08:51PM -0800, Lac

[slurm-dev] Setting a partition QOS, etc

2017-02-01 Thread Baker D . J .
Hello, This is hopefully a very simple set of questions for someone. I'm evaluating slurm with a view to replacing our existing torque/moab system, and I've been reading about defining partitions and QoSs. I like the idea of being able to use a QoS to throttle user activity -- for example to se