Re: [slurm-users] RAM "overbooking"

2020-05-27 Thread Brian Andrus

Heh. That is the on-going "user education"

You could change the amount of ram requested using a job_sumit lua 
script, but that could bite those that are accurate with their requests.


Or set a max ram for the partition.

Brian Andrus

On 5/27/2020 3:46 PM, Marcelo Z. Silva wrote:

Hello all

We have a single node simple slurm installation with the following hardware
configuration:

NodeName=node01 Arch=x86_64 CoresPerSocket=1
CPUAlloc=102 CPUErr=0 CPUTot=160 CPULoad=67.09
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=biotec01 NodeHostName=biotec01 Version=16.05
OS=Linux RealMemory=120 AllocMem=1093632 FreeMem=36066 Sockets=160 
Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2020-04-19T17:22:31 SlurmdStartTime=2020-04-20T13:54:34
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Slurm version is 16.05 (we are about to upgrade do Debian 10 and slurm 18.08 
from
the repo):

Everything is working as expected except but we have the following
"problem":

Users use sbatch to submit their jobs but usually reserve way too much RAM for 
the
job causing other jobs queued waiting for RAM even when the actual RAM usage is
very low.

Is there a recommended solution for this problem? Is there an way to say
slurm to start a job "overbooking" some RAM by, say, 20%?

Thanks for any recommendation.

slurm.conf:

ControlMachine=node01
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=cluster
JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
DebugFlags=NO_CONF_HASH
NodeName=biotec01 CPUs=160 RealMemory=120 State=UNKNOWN
PartitionName=short Nodes=node01 Default=YES MaxTime=24:00:00 State=UP 
Priority=30
PartitionName=long Nodes=node01 MaxTime=30-00:00:00 State=UP Priority=20
PartitionName=test Nodes=node01 MaxTime=1 State=UP MaxCPUsPerNode=3 Priority=30






[slurm-users] RAM "overbooking"

2020-05-27 Thread Marcelo Z. Silva


Hello all

We have a single node simple slurm installation with the following hardware
configuration:

NodeName=node01 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=102 CPUErr=0 CPUTot=160 CPULoad=67.09
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=biotec01 NodeHostName=biotec01 Version=16.05
   OS=Linux RealMemory=120 AllocMem=1093632 FreeMem=36066 Sockets=160 
Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2020-04-19T17:22:31 SlurmdStartTime=2020-04-20T13:54:34
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Slurm version is 16.05 (we are about to upgrade do Debian 10 and slurm 18.08 
from
the repo):

Everything is working as expected except but we have the following
"problem":

Users use sbatch to submit their jobs but usually reserve way too much RAM for 
the
job causing other jobs queued waiting for RAM even when the actual RAM usage is
very low.

Is there a recommended solution for this problem? Is there an way to say
slurm to start a job "overbooking" some RAM by, say, 20%?

Thanks for any recommendation.

slurm.conf:

ControlMachine=node01
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=cluster
JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
DebugFlags=NO_CONF_HASH
NodeName=biotec01 CPUs=160 RealMemory=120 State=UNKNOWN
PartitionName=short Nodes=node01 Default=YES MaxTime=24:00:00 State=UP 
Priority=30
PartitionName=long Nodes=node01 MaxTime=30-00:00:00 State=UP Priority=20
PartitionName=test Nodes=node01 MaxTime=1 State=UP MaxCPUsPerNode=3 Priority=30




[slurm-users] epilog can't get the full OUTPUT ENVIRONMENT VARIABLES from sbatch

2020-05-27 Thread Fred Liu
Hi,

I used to have full OUTPUT ENVIRONMENT VARIABLES from sbatch.
But now I just part of them like below:
SLURM_NODELIST SLURM_JOBID SLURM_SCRIPT_CONTEXT SLURM_UID SLURM_CLUSTER_NAME 
SLURM_JOB_USER SLURM_JOB_ID SLURM_CONF SLURM_JOB_GID SLURM_JOB_UID 
SLURMD_NODENAME

I can't get SLURm_SUBMIT_HOST any more?

Any ideas?

Thanks.

Fred