Re: [slurm-users] RAM "overbooking"
Heh. That is the on-going "user education" You could change the amount of ram requested using a job_sumit lua script, but that could bite those that are accurate with their requests. Or set a max ram for the partition. Brian Andrus On 5/27/2020 3:46 PM, Marcelo Z. Silva wrote: Hello all We have a single node simple slurm installation with the following hardware configuration: NodeName=node01 Arch=x86_64 CoresPerSocket=1 CPUAlloc=102 CPUErr=0 CPUTot=160 CPULoad=67.09 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=biotec01 NodeHostName=biotec01 Version=16.05 OS=Linux RealMemory=120 AllocMem=1093632 FreeMem=36066 Sockets=160 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A BootTime=2020-04-19T17:22:31 SlurmdStartTime=2020-04-20T13:54:34 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Slurm version is 16.05 (we are about to upgrade do Debian 10 and slurm 18.08 from the repo): Everything is working as expected except but we have the following "problem": Users use sbatch to submit their jobs but usually reserve way too much RAM for the job causing other jobs queued waiting for RAM even when the actual RAM usage is very low. Is there a recommended solution for this problem? Is there an way to say slurm to start a job "overbooking" some RAM by, say, 20%? Thanks for any recommendation. slurm.conf: ControlMachine=node01 MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/cgroup FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory AccountingStorageType=accounting_storage/slurmdbd ClusterName=cluster JobAcctGatherType=jobacct_gather/linux SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdLogFile=/var/log/slurm-llnl/slurmd.log DebugFlags=NO_CONF_HASH NodeName=biotec01 CPUs=160 RealMemory=120 State=UNKNOWN PartitionName=short Nodes=node01 Default=YES MaxTime=24:00:00 State=UP Priority=30 PartitionName=long Nodes=node01 MaxTime=30-00:00:00 State=UP Priority=20 PartitionName=test Nodes=node01 MaxTime=1 State=UP MaxCPUsPerNode=3 Priority=30
[slurm-users] RAM "overbooking"
Hello all We have a single node simple slurm installation with the following hardware configuration: NodeName=node01 Arch=x86_64 CoresPerSocket=1 CPUAlloc=102 CPUErr=0 CPUTot=160 CPULoad=67.09 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=biotec01 NodeHostName=biotec01 Version=16.05 OS=Linux RealMemory=120 AllocMem=1093632 FreeMem=36066 Sockets=160 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A BootTime=2020-04-19T17:22:31 SlurmdStartTime=2020-04-20T13:54:34 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Slurm version is 16.05 (we are about to upgrade do Debian 10 and slurm 18.08 from the repo): Everything is working as expected except but we have the following "problem": Users use sbatch to submit their jobs but usually reserve way too much RAM for the job causing other jobs queued waiting for RAM even when the actual RAM usage is very low. Is there a recommended solution for this problem? Is there an way to say slurm to start a job "overbooking" some RAM by, say, 20%? Thanks for any recommendation. slurm.conf: ControlMachine=node01 MpiDefault=none ProctrackType=proctrack/cgroup ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/cgroup FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory AccountingStorageType=accounting_storage/slurmdbd ClusterName=cluster JobAcctGatherType=jobacct_gather/linux SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdLogFile=/var/log/slurm-llnl/slurmd.log DebugFlags=NO_CONF_HASH NodeName=biotec01 CPUs=160 RealMemory=120 State=UNKNOWN PartitionName=short Nodes=node01 Default=YES MaxTime=24:00:00 State=UP Priority=30 PartitionName=long Nodes=node01 MaxTime=30-00:00:00 State=UP Priority=20 PartitionName=test Nodes=node01 MaxTime=1 State=UP MaxCPUsPerNode=3 Priority=30
[slurm-users] epilog can't get the full OUTPUT ENVIRONMENT VARIABLES from sbatch
Hi, I used to have full OUTPUT ENVIRONMENT VARIABLES from sbatch. But now I just part of them like below: SLURM_NODELIST SLURM_JOBID SLURM_SCRIPT_CONTEXT SLURM_UID SLURM_CLUSTER_NAME SLURM_JOB_USER SLURM_JOB_ID SLURM_CONF SLURM_JOB_GID SLURM_JOB_UID SLURMD_NODENAME I can't get SLURm_SUBMIT_HOST any more? Any ideas? Thanks. Fred