Merlin, thanks for the insight.

I set:
#SBATCH --mem=1G

That is all I needed to get it to share.
How can I set the default to use only the memory attached to the particular socket the job is running on and have the default memory to be set to that value (64GB in my case)?

I think I've done it (sort of) with:
PartitionName=normal Nodes=d0[1,2] Default=YES OverSubscribe=FORCE:2 SelectTypeParameters=CR_Socket_Memory QoS=part_shared MaxCPUsPerNode=28 DefMemPerCPU=4590 MaxMemPerCPU=4590 MaxTime=48:00:00 State=UP

On 03/22/2017 12:04 PM, Merlin Hartley wrote:
Hi Cyrus

I think you should specify the memory requirements in your sbatch script - the default would be to allocate all the memory for a node - thus ‘filling’ it even with a 1 cpu job.
#SBATCH --mem 1G

Hope this helps!


Merlin
--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom

On 22 Mar 2017, at 16:20, Cyrus Proctor <cproc...@tacc.utexas.edu <mailto:cproc...@tacc.utexas.edu>> wrote:


Hi all,

Any thoughts at all on this would be most helpful. I'm not sure where to go from here to get overcommitted nodes working properly.

Thank you,
Cyrus

On 03/17/2017 11:39 AM, Cyrus Proctor wrote:
Hello,

I currently have a small cluster for testing. Each compute node contains 2 sockets with 14 cores per CPU and a total of 128 GB RAM. I would like to set up Slurm such that two jobs can simultaneously share one compute node, effectively giving 1 socket (with binding) and half the total memory to each job.

I've tried several iterations of settings, to no avail. It seems that whatever I try, I am still only allowed to run one job per node (blocked by "resources" reason). I am running Slurm 17.02.1-2, and I am attaching my slurm.conf as well as cgroup.conf files. System information includes:
# uname -r
3.10.0-514.10.2.el7.x86_64
# cat /etc/centos-release
CentOS Linux release 7.3.1611 (Core)

I am also attaching logs for slurmd (slurmd.d01.log) and slurmctld (slurmctld.log) as I submit three jobs (batch.slurm) in rapid succession. With two compute nodes available, I would hope that all three start together. Instead, two begin and one waits until a node becomes idle to start.

There is likely extra "crud" in the config files simply from prior failed attempts. I'm happy to take out / reconfigure as necessary but not sure what exactly is the right combination of settings to get this to work. I'm hoping that's where you all can help.

Thanks,
Cyrus


Reply via email to