When using 20.02/cons_tres and defining DefMemPerGPU, jobs submitted that 
request GPUs without defining “—mem” will not run more than one job per node. I 
can see where it is allocating the correct amount of memory for the job per 
GPUs requested, but no other jobs will run on the node. If a value for “—mem” 
is defined, other jobs will share the node. Is this the expected behavior? I 
understand that when jobs do not request memory it is assumed that the job is 
running on the whole node, but here when we are asking for GPUs there is a 
default memory set with DefMemPerGPU and it seems this is not being taken into 
account. Let me know if there is a reason for this behavior or if there is 
another way to set the default job memory.

Config:
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK
PartitionName=p100 Nodes=ucs480 OverSubscribe=FORCE:4 DefCpuPerGPU=20 
DefMemPerGPU=125000 Default=YES MaxTime=INFINITE State=UP

Node and job state when two jobs submitted with each requesting half the GPUs 
(no —mem specified):

   CfgTRES=cpu=80,mem=500000M,billing=80
   AllocTRES=cpu=40,mem=250000M

Job state:
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
872      p100  test-s6 wayne.he PD       0:00      1 (Resources) 
871      p100  test-s5 wayne.he  R       0:03      1 ucs480

Reply via email to