Heh. That is the on-going "user education"

You could change the amount of ram requested using a job_sumit lua script, but that could bite those that are accurate with their requests.

Or set a max ram for the partition.

Brian Andrus

On 5/27/2020 3:46 PM, Marcelo Z. Silva wrote:
Hello all

We have a single node simple slurm installation with the following hardware
configuration:

NodeName=node01 Arch=x86_64 CoresPerSocket=1
    CPUAlloc=102 CPUErr=0 CPUTot=160 CPULoad=67.09
    AvailableFeatures=(null)
    ActiveFeatures=(null)
    Gres=(null)
    NodeAddr=biotec01 NodeHostName=biotec01 Version=16.05
    OS=Linux RealMemory=1200000 AllocMem=1093632 FreeMem=36066 Sockets=160 
Boards=1
    State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
    BootTime=2020-04-19T17:22:31 SlurmdStartTime=2020-04-20T13:54:34
    CapWatts=n/a
    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Slurm version is 16.05 (we are about to upgrade do Debian 10 and slurm 18.08 
from
the repo):

Everything is working as expected except but we have the following
"problem":

Users use sbatch to submit their jobs but usually reserve way too much RAM for 
the
job causing other jobs queued waiting for RAM even when the actual RAM usage is
very low.

Is there a recommended solution for this problem? Is there an way to say
slurm to start a job "overbooking" some RAM by, say, 20%?

Thanks for any recommendation.

slurm.conf:

ControlMachine=node01
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=cluster
JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
DebugFlags=NO_CONF_HASH
NodeName=biotec01 CPUs=160 RealMemory=1200000 State=UNKNOWN
PartitionName=short Nodes=node01 Default=YES MaxTime=24:00:00 State=UP 
Priority=30
PartitionName=long Nodes=node01 MaxTime=30-00:00:00 State=UP Priority=20
PartitionName=test Nodes=node01 MaxTime=1 State=UP MaxCPUsPerNode=3 Priority=30



Reply via email to