Merlin, thanks for the insight.
I set:
#SBATCH --mem=1G
That is all I needed to get it to share.
How can I set the default to use only the memory attached to the
particular socket the job is running on and have the default memory to
be set to that value (64GB in my case)?
I think I've done it (sort of) with:
PartitionName=normal Nodes=d0[1,2] Default=YES OverSubscribe=FORCE:2
SelectTypeParameters=CR_Socket_Memory QoS=part_shared MaxCPUsPerNode=28
DefMemPerCPU=4590 MaxMemPerCPU=4590 MaxTime=48:00:00 State=UP
On 03/22/2017 12:04 PM, Merlin Hartley wrote:
Hi Cyrus
I think you should specify the memory requirements in your sbatch
script - the default would be to allocate all the memory for a node -
thus ‘filling’ it even with a 1 cpu job.
#SBATCH --mem 1G
Hope this helps!
Merlin
--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom
On 22 Mar 2017, at 16:20, Cyrus Proctor <cproc...@tacc.utexas.edu
<mailto:cproc...@tacc.utexas.edu>> wrote:
Hi all,
Any thoughts at all on this would be most helpful. I'm not sure where
to go from here to get overcommitted nodes working properly.
Thank you,
Cyrus
On 03/17/2017 11:39 AM, Cyrus Proctor wrote:
Hello,
I currently have a small cluster for testing. Each compute node
contains 2 sockets with 14 cores per CPU and a total of 128 GB RAM.
I would like to set up Slurm such that two jobs can simultaneously
share one compute node, effectively giving 1 socket (with binding)
and half the total memory to each job.
I've tried several iterations of settings, to no avail. It seems
that whatever I try, I am still only allowed to run one job per node
(blocked by "resources" reason). I am running Slurm 17.02.1-2, and I
am attaching my slurm.conf as well as cgroup.conf files. System
information includes:
# uname -r
3.10.0-514.10.2.el7.x86_64
# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)
I am also attaching logs for slurmd (slurmd.d01.log) and slurmctld
(slurmctld.log) as I submit three jobs (batch.slurm) in rapid
succession. With two compute nodes available, I would hope that all
three start together. Instead, two begin and one waits until a node
becomes idle to start.
There is likely extra "crud" in the config files simply from prior
failed attempts. I'm happy to take out / reconfigure as necessary
but not sure what exactly is the right combination of settings to
get this to work. I'm hoping that's where you all can help.
Thanks,
Cyrus