On 09/04/2012 04:22 PM, Miguel Méndez wrote:
> Hi Lennart,
>
> I have some questions for you so I can help you:
>
> Have you tried to set DebugFlags=Priority in slurm.conf to get some more
> info about priorities on slurmctld.log?
>
> Are your priorities being recalculated every "PriorityCalcPeriod" (in
> slurm.conf as well, default is 5 min)? If not, do you have Accounting
> enabled?

Hi Miguel,

And thanks for trying to help me!

Yes, I have configured

   PriorityCalcPeriod=5

in the slurm.conf file.

I do not understand your question about if I have Accounting enabled.
I have no such configuration variable in my slurm.conf file. I run


I have now tried your suggestion to set DebugFlags=Priority,
so now I can rewrite my question in a new way.

In slurm.conf, I have configured
PriorityMaxAge=14-0
PriorityWeightAge=20160

The plan behind this configuration is to start with an age
value of zero and get approximately one priority point added
for each minute that the job has been waiting, up to a
maximum of 20160.

This has worked for a long time and usually still does.
But sometimes it goes seriously wrong, with a new job starting
at a age value of 20160 instead.

This can be seen with the sprio command and also with Priority
debugging on:

[2012-09-05T10:43:37] Weighted Age priority is 1.000000 * 20160 = 20160.00
[2012-09-05T10:43:37] Weighted Fairshare priority is 10.000000 * 10000 = 
100000.00
[2012-09-05T10:43:37] Weighted JobSize priority is 0.001616 * 104 = 0.17
[2012-09-05T10:43:37] Weighted Partition priority is 0.000000 * 0 = 0.00
[2012-09-05T10:43:37] Weighted QOS priority is 0.000000 * 400000 = 0.00
[2012-09-05T10:43:37] Job 2182878 priority: 20160.00 + 100000.00 + 0.17 + 0.00 
+ 0.00 - 0 = 120160.17

The job was submitted 2012-09-05T10:42:22, so it should have a weighted
age priority of zero or one, but it got for some unknown reason the
maximum value instead.

Here are a job that behaves the normal way, as expected:
[2012-09-05T10:44:17] Weighted Age priority is 0.000000 * 20160 = 0.00
[2012-09-05T10:44:17] Weighted Fairshare priority is 6.000000 * 10000 = 60000.00
[2012-09-05T10:44:17] Weighted JobSize priority is 0.002874 * 104 = 0.30
[2012-09-05T10:44:17] Weighted Partition priority is 0.000000 * 0 = 0.00
[2012-09-05T10:44:17] Weighted QOS priority is 0.000000 * 400000 = 0.00
[2012-09-05T10:44:17] Job 2182879 priority: 0.00 + 60000.00 + 0.30 + 0.00 + 
0.00 - 0 = 60000.30

This job was submitted 2012-09-05T10:44:17, so the weighted age
priority is zero, as expected.

Here is an example for a job that has waited for some time:
[2012-09-05T00:07:31] Weighted Age priority is 0.004721 * 20160 = 95.17
[2012-09-05T00:07:31] Weighted Fairshare priority is 10.000000 * 10000 = 
100000.00
[2012-09-05T00:07:31] Weighted JobSize priority is 0.002874 * 104 = 0.30
[2012-09-05T00:07:31] Weighted Partition priority is 0.000000 * 0 = 0.00
[2012-09-05T00:07:31] Weighted QOS priority is 0.300000 * 400000 = 120000.00
[2012-09-05T00:07:31] Job 2178648 priority: 95.17 + 100000.00 + 0.30 + 0.00 + 
120000.00 - 0 = 220095.47

Submit time was 2012-09-04T22:32:08, so the Weighted Age
priority works as intended in this case.

This is version 2.4.1 of SLURM. (If someone thinks that the Fairshare
priorities are strange, do not worry. They are intended to be in this
way, but that is another story.)

Full slurm.conf configuration is at the bottom of this e-mail,
with line numbers added.

Cheers,
-- Lennart Karlsson
     UPPMAX, Uppsala University, Sweden
     http://www.uppmax.uu.se

==============================================
      1  ControlMachine=kalkyl2
      2  AuthType=auth/munge
      3  CacheGroups=0
      4  CryptoType=crypto/munge
      5  EnforcePartLimits=YES
      6  Epilog=/etc/slurm/slurm.epilog
      7  JobCredentialPrivateKey=/etc/slurm/slurm.key
      8  JobCredentialPublicCertificate=/etc/slurm/slurm.cert
      9  JobRequeue=0
     10  MaxJobCount=1000000
     11  MpiDefault=none
     12  Proctracktype=proctrack/cgroup
     13  Prolog=/etc/slurm/slurm.prolog
     14  PropagateResourceLimits=RSS
     15  ReturnToService=0
     16  SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env 
--mpi=none -Q $SHELL"
     17  
SchedulerParameters=default_queue_depth=5000,bf_window=10080,max_job_bf=5000,bf_interval=120
     18  SlurmctldPidFile=/var/run/slurmctld.pid
     19  SlurmctldPort=6817
     20  SlurmdPidFile=/var/run/slurmd.pid
     21  SlurmdPort=6818
     22  SlurmdSpoolDir=/var/spool/slurmd
     23  SlurmUser=slurm
     24  StateSaveLocation=/usr/local/slurm-state
     25  SwitchType=switch/none
     26  TaskPlugin=task/cgroup
     27  TaskProlog=/etc/slurm/slurm.taskprolog
     28  TopologyPlugin=topology/tree
     29  TmpFs=/scratch
     30  TrackWCKey=yes
     31  TreeWidth=20
     32  UsePAM=1
     33  HealthCheckInterval=1800
     34  HealthCheckProgram=/etc/slurm/slurm.healthcheck
     35  InactiveLimit=0
     36  KillWait=600
     37  MessageTimeout=60
     38  ResvOverRun=UNLIMITED
     39  MinJobAge=43200
     40  SlurmctldTimeout=300
     41  SlurmdTimeout=1200
     42  Waittime=0
     43  FastSchedule=1
     44  MaxMemPerCPU=3072
     45  SchedulerType=sched/backfill
     46  SchedulerPort=7321
     47  SelectType=select/cons_res
     48  SelectTypeParameters=CR_Core_Memory
     49  PriorityType=priority/multifactor
     50  PriorityDecayHalfLife=0
     51  PriorityCalcPeriod=5
     52  PriorityUsageResetPeriod=MONTHLY
     53  PriorityFavorSmall=NO
     54  PriorityMaxAge=14-0
     55  PriorityWeightAge=20160
     56  PriorityWeightFairshare=10000
     57  PriorityWeightJobSize=104
     58  PriorityWeightPartition=0
     59  PriorityWeightQOS=400000
     60  AccountingStorageEnforce=associations,limits,qos
     61  AccountingStorageHost=kalkyl2
     62  AccountingStoragePort=7031
     63  AccountingStorageType=accounting_storage/slurmdbd
     64  ClusterName=kalkyl
     65  DebugFlags=NO_CONF_HASH,Priority
     66  JobCompLoc=/etc/slurm/slurm_jobcomp_logger
     67  JobCompType=jobcomp/script
     68  JobAcctGatherFrequency=30
     69  JobAcctGatherType=jobacct_gather/linux
     70  SlurmctldDebug=3
     71  SlurmctldLogFile=/var/log/slurm/slurmctld.log
     72  SlurmdDebug=3
     73  SlurmdLogFile=/var/log/slurm/slurmd.log
     74  NodeName=DEFAULT Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 
State=UNKNOWN TmpDisk=100000
     75
     76  NodeName=q[1-16]    RealMemory=72000 Feature=fat,mem72GB,ibsw1   
Weight=3
     77  NodeName=q[17-32]   RealMemory=48000 Feature=fat,mem48GB,ibsw1   
Weight=2
     78  NodeName=q[33-64]   RealMemory=24000 Feature=thin,mem24GB,ibsw2  
Weight=1
     79  NodeName=q[65-96]   RealMemory=24000 Feature=thin,mem24GB,ibsw3  
Weight=1
     80  NodeName=q[97-108]  RealMemory=24000 Feature=thin,mem24GB,ibsw4  
Weight=1
     81  NodeName=q[109-140] RealMemory=24000 Feature=thin,mem24GB,ibsw5  
Weight=1
     82  NodeName=q[141-172] RealMemory=24000 Feature=thin,mem24GB,ibsw6  
Weight=1
     83  NodeName=q[173-204] RealMemory=24000 Feature=thin,mem24GB,ibsw7  
Weight=1
     84  NodeName=q[205-216] RealMemory=24000 Feature=thin,mem24GB,ibsw8  
Weight=1
     85
     86  NodeName=q[217-232] RealMemory=24000 Feature=thin,mem24GB,ibsw4  
Weight=1
     87
     88  NodeName=q[233-252] RealMemory=24000 Feature=thin,mem24GB,ibsw8  
Weight=1
     89  NodeName=q[253-284] RealMemory=24000 Feature=thin,mem24GB,ibsw9  
Weight=1
     90  NodeName=q[285-316] RealMemory=24000 Feature=thin,mem24GB,ibsw10 
Weight=1
     91  NodeName=q[317-348] RealMemory=24000 Feature=thin,mem24GB,ibsw11 
Weight=1
     92
     93  PartitionName=all Nodes=q[1-348] Shared=EXCLUSIVE DefaultTime=00:00:01 
MaxTime=14400 State=DOWN
     94  PartitionName=core Nodes=q[45-348] Default=YES Shared=NO MaxTime=14400 
MaxNodes=1 State=UP
     95  PartitionName=node Nodes=q[1-32,45-348] Shared=EXCLUSIVE 
DefaultTime=00:00:01 MaxTime=14400 State=UP
     96  PartitionName=devel Nodes=q[33-44] Shared=EXCLUSIVE 
DefaultTime=00:00:01 MaxTime=60 MaxNodes=4 State=UP

Reply via email to