[slurm-dev] Re: MaxMemPerCPU seems not working for me...

Julien Collas Tue, 14 Feb 2017 07:28:57 -0800

Hello,

From the man page:


       MaxMemPerCPU
>         ...
>               NOTE: If a job specifies a memory per CPU limit that exceeds
> this system limit, that job’s count of CPUs per task will automatically be
> increased. This may result in the job failing due to CPU count limits.


Here is hat I get with version *15.08*

# srun --version
slurm 15.08.13
# srun --mem 600 sleep 5 && scontrol show job
JobId=15 JobName=sleep
   UserId=root(0) GroupId=root(0)
   Priority=4294901757 Nice=0 Account=(null) QOS=(null)
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:05 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2017-02-14T14:39:54 EligibleTime=2017-02-14T14:39:54
   StartTime=2017-02-14T14:39:54 EndTime=2017-02-14T14:39:59
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=short AllocNode:Sid=dhcpvm4-174:5130
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=dhcpvm4-191
   BatchHost=dhcpvm4-191
   NumNodes=1 *NumCPUs=3* CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=3,*mem=600*,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=3 MinMemoryNode=600M MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=sleep
   WorkDir=/root
   Power= SICP=0


Julien

2017-02-14 16:14 GMT+01:00 E V <eliven...@gmail.com>:

>
> You sure it worked as you expected? I always think of CPUs & RAM is
> independent things that need to be manually requested independently.
>
> On Tue, Feb 14, 2017 at 10:07 AM, Julien Collas <jul.col...@gmail.com>
> wrote:
> > Hello,
> >
> > I made some tests on a simple environment and it seems that this
> > functionality is working fine until 15.08.13 (included).
> > With versions 16.05.6, 16.05.9, and 17.02.0-0rc1 I'm not able to see
> what I
> > would expect to see.
> >
> > # scontrol show part
> > PartitionName=short
> >    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> >    AllocNodes=ALL Default=YES QoS=N/A
> >    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> > Hidden=NO
> >    MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 LLN=NO
> > MaxCPUsPerNode=UNLIMITED
> >    Nodes=dhcpvm4-191
> >    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> > OverSubscribe=NO
> >    OverTimeLimit=NONE PreemptMode=OFF
> >    State=UP TotalCPUs=8 TotalNodes=1 SelectTypeParameters=NONE
> >    DefMemPerCPU=200 MaxMemPerCPU=200
> >
> > # srun --mem 600 sleep 5 && scontrol show job
> > JobId=25 JobName=sleep
> >    UserId=root(0) GroupId=root(0) MCS_label=N/A
> >    Priority=4294901754 Nice=0 Account=(null) QOS=(null)
> >    JobState=COMPLETED Reason=None Dependency=(null)
> >    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
> >    RunTime=00:00:05 TimeLimit=UNLIMITED TimeMin=N/A
> >    SubmitTime=2017-02-14T15:38:41 EligibleTime=2017-02-14T15:38:41
> >    StartTime=2017-02-14T15:38:41 EndTime=2017-02-14T15:38:46 Deadline=N/A
> >    PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >    Partition=short AllocNode:Sid=dhcpvm4-174:5130
> >    ReqNodeList=(null) ExcNodeList=(null)
> >    NodeList=dhcpvm4-191
> >    BatchHost=dhcpvm4-191
> >    NumNodes=1 NumCPUs=1 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> >    TRES=cpu=1,mem=600M,node=1
> >    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >    MinCPUsNode=1 MinMemoryNode=600M MinTmpDiskNode=0
> >    Features=(null) DelayBoot=00:00:00
> >    Gres=(null) Reservation=(null)
> >    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >    Command=sleep
> >    WorkDir=/root
> >    Power=
> >
> > It doesn't help me a lot with my problem but ...
> >
> > Best regards,
> >
> > Julien
> >
> > 2017-02-02 8:53 GMT+01:00 Julien Collas <jul.col...@gmail.com>:
> >>
> >> Hi,
> >>
> >> It seems that my MaxMemPerCpu is not working as I would have expected
> >> (increase cpu if mem or mem-per-cpu exceed that limit)
> >>
> >> Here is my partition definition
> >>
> >> $ scontrol show part short
> >> PartitionName=short
> >>    AllowGroups=ALL DenyAccounts=data AllowQos=ALL
> >>    AllocNodes=ALL Default=YES QoS=N/A
> >>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> >> Hidden=NO
> >>    MaxNodes=UNLIMITED MaxTime=00:30:00 MinNodes=1 LLN=NO
> >> MaxCPUsPerNode=UNLIMITED
> >>    Nodes=srv0029[73-80,87-95,98-99]
> >>    PriorityJobFactor=1000 PriorityTier=1000 RootOnly=NO ReqResv=NO
> >> OverSubscribe=NO PreemptMode=OFF
> >>    State=UP TotalCPUs=560 TotalNodes=28 SelectTypeParameters=NONE
> >>    DefMemPerCPU=19000 MaxMemPerCPU=19000
> >>
> >>
> >> $ srun --partition=short --mem=40000 sleep 10
> >> $ srun --partition=short --mem-per-cpu=40000 sleep 10
> >> $ sacct
> >>        JobID      User    Account    JobName   Priority   NTasks
> >> AllocCPUS  ReqCPUS     MaxRSS  MaxVMSize     ReqMem      State
> >> ------------ --------- ---------- ---------- ---------- --------
> >> ---------- -------- ---------- ---------- ---------- ----------
> >> 19522383       jcollas      admin      sleep        994        1
> >> 1        1        92K    203980K    40000Mn  COMPLETED
> >> 19522384       jcollas      admin      sleep        994        1
> >> 1        1        92K    203980K    40000Mc  COMPLETED
> >>
> >> For theses 2 jobs, I would have expected AllocCPUS to 3.
> >>
> >> $ scontrol show conf
> >> ...
> >> DefMemPerNode               = UNLIMITED
> >> MaxMemPerNode               = UNLIMITED
> >> MemLimitEnforce             = Yes
> >> SelectTypeParameters        = CR_CPU_MEMORY
> >> ...
> >> AccountingStorageBackupHost = (null)
> >> AccountingStorageEnforce    = associations,limits
> >> AccountingStorageHost       = stor089
> >> AccountingStorageLoc        = N/A
> >> AccountingStoragePort       = 6819
> >> AccountingStorageTRES       = cpu,mem,energy,node
> >> AccountingStorageType       = accounting_storage/slurmdbd
> >> AccountingStorageUser       = N/A
> >> AccountingStoreJobComment   = Yes
> >> AcctGatherEnergyType        = acct_gather_energy/none
> >> AcctGatherFilesystemType    = acct_gather_filesystem/none
> >> AcctGatherInfinibandType    = acct_gather_infiniband/none
> >> AcctGatherNodeFreq          = 0 sec
> >> AcctGatherProfileType       = acct_gather_profile/none
> >> JobAcctGatherFrequency      = 10
> >> JobAcctGatherType           = jobacct_gather/cgroup
> >> JobAcctGatherParams         = (null)
> >> ...
> >>
> >> We are currently running with version 16.05.6
> >>
> >> Is there something I am missing ?
> >>
> >>
> >> Regards,
> >>
> >> Julien
> >
> >
>

[slurm-dev] Re: MaxMemPerCPU seems not working for me...

Reply via email to