Sorry, that link is for code's latest version. It's a bit different for the
one I sent you.

On Wed, Sep 5, 2012 at 1:37 PM, Miguel Méndez <[email protected]> wrote:

>  Hi Lennart,
>
> Don't worry about the Accounting thing, I didn't know this you now say:
>
>
>> This has worked for a long time and usually still does.
>>
>> But sometimes it goes seriously wrong, with a new job starting
>> at a age value of 20160 instead.
>>
>
> So if it usually works, all I can think of right now is about this part of
> multifactor's code (for 2.4.1):
>
> if (weight_age) {
> uint32_t diff;
> if (flags & PRIORITY_FLAGS_ACCRUE_ALWAYS)
>  diff = start_time - job_ptr->details->submit_time;
> else
> diff = start_time - job_ptr->details->begin_time;
>  if (job_ptr->details->begin_time) {
> if (diff < max_age) {
> job_ptr->prio_factors->priority_age =
>  (double)diff / (double)max_age;
> } else
> job_ptr->prio_factors->priority_age = 1.0;
>  } else if (flags & PRIORITY_FLAGS_ACCRUE_ALWAYS) {
> if (diff < max_age) {
>  job_ptr->prio_factors->priority_age =
> (double)diff / (double)max_age;
>  } else
> job_ptr->prio_factors->priority_age = 1.0;
> }
>  }
>
> ( You can find it here:
> https://github.com/chaos/slurm/blob/master/src/plugins/priority/multifactor/priority_multifactor.c#L458
>  )
>
> You may be getting to any of those "job_ptr->prio_factors->priority_age =
> 1.0;" for some reason. Maybe, as you suggest, there is a problem with some
> of those times ("begin_time" maybe). I would check the times you're getting
> in job details.
>
> Regards,
>
> Miguel
>
>
>
> On Wed, Sep 5, 2012 at 12:07 PM, Lennart Karlsson <
> [email protected]> wrote:
>
>>
>> On 09/04/2012 04:22 PM, Miguel Méndez wrote:
>> > Hi Lennart,
>> >
>> > I have some questions for you so I can help you:
>> >
>> > Have you tried to set DebugFlags=Priority in slurm.conf to get some more
>> > info about priorities on slurmctld.log?
>> >
>> > Are your priorities being recalculated every "PriorityCalcPeriod" (in
>> > slurm.conf as well, default is 5 min)? If not, do you have Accounting
>> > enabled?
>>
>> Hi Miguel,
>>
>> And thanks for trying to help me!
>>
>> Yes, I have configured
>>
>>    PriorityCalcPeriod=5
>>
>> in the slurm.conf file.
>>
>> I do not understand your question about if I have Accounting enabled.
>> I have no such configuration variable in my slurm.conf file. I run
>>
>>
>> I have now tried your suggestion to set DebugFlags=Priority,
>> so now I can rewrite my question in a new way.
>>
>> In slurm.conf, I have configured
>> PriorityMaxAge=14-0
>> PriorityWeightAge=20160
>>
>> The plan behind this configuration is to start with an age
>> value of zero and get approximately one priority point added
>> for each minute that the job has been waiting, up to a
>> maximum of 20160.
>>
>> This has worked for a long time and usually still does.
>> But sometimes it goes seriously wrong, with a new job starting
>> at a age value of 20160 instead.
>>
>> This can be seen with the sprio command and also with Priority
>> debugging on:
>>
>> [2012-09-05T10:43:37] Weighted Age priority is 1.000000 * 20160 = 20160.00
>> [2012-09-05T10:43:37] Weighted Fairshare priority is 10.000000 * 10000 =
>> 100000.00
>> [2012-09-05T10:43:37] Weighted JobSize priority is 0.001616 * 104 = 0.17
>> [2012-09-05T10:43:37] Weighted Partition priority is 0.000000 * 0 = 0.00
>> [2012-09-05T10:43:37] Weighted QOS priority is 0.000000 * 400000 = 0.00
>> [2012-09-05T10:43:37] Job 2182878 priority: 20160.00 + 100000.00 + 0.17 +
>> 0.00 + 0.00 - 0 = 120160.17
>>
>> The job was submitted 2012-09-05T10:42:22, so it should have a weighted
>> age priority of zero or one, but it got for some unknown reason the
>> maximum value instead.
>>
>> Here are a job that behaves the normal way, as expected:
>> [2012-09-05T10:44:17] Weighted Age priority is 0.000000 * 20160 = 0.00
>> [2012-09-05T10:44:17] Weighted Fairshare priority is 6.000000 * 10000 =
>> 60000.00
>> [2012-09-05T10:44:17] Weighted JobSize priority is 0.002874 * 104 = 0.30
>> [2012-09-05T10:44:17] Weighted Partition priority is 0.000000 * 0 = 0.00
>> [2012-09-05T10:44:17] Weighted QOS priority is 0.000000 * 400000 = 0.00
>> [2012-09-05T10:44:17] Job 2182879 priority: 0.00 + 60000.00 + 0.30 + 0.00
>> + 0.00 - 0 = 60000.30
>>
>> This job was submitted 2012-09-05T10:44:17, so the weighted age
>> priority is zero, as expected.
>>
>> Here is an example for a job that has waited for some time:
>> [2012-09-05T00:07:31] Weighted Age priority is 0.004721 * 20160 = 95.17
>> [2012-09-05T00:07:31] Weighted Fairshare priority is 10.000000 * 10000 =
>> 100000.00
>> [2012-09-05T00:07:31] Weighted JobSize priority is 0.002874 * 104 = 0.30
>> [2012-09-05T00:07:31] Weighted Partition priority is 0.000000 * 0 = 0.00
>> [2012-09-05T00:07:31] Weighted QOS priority is 0.300000 * 400000 =
>> 120000.00
>> [2012-09-05T00:07:31] Job 2178648 priority: 95.17 + 100000.00 + 0.30 +
>> 0.00 + 120000.00 - 0 = 220095.47
>>
>> Submit time was 2012-09-04T22:32:08, so the Weighted Age
>> priority works as intended in this case.
>>
>> This is version 2.4.1 of SLURM. (If someone thinks that the Fairshare
>> priorities are strange, do not worry. They are intended to be in this
>> way, but that is another story.)
>>
>> Full slurm.conf configuration is at the bottom of this e-mail,
>> with line numbers added.
>>
>> Cheers,
>> -- Lennart Karlsson
>>      UPPMAX, Uppsala University, Sweden
>>      http://www.uppmax.uu.se
>>
>> ==============================================
>>       1  ControlMachine=kalkyl2
>>       2  AuthType=auth/munge
>>       3  CacheGroups=0
>>       4  CryptoType=crypto/munge
>>       5  EnforcePartLimits=YES
>>       6  Epilog=/etc/slurm/slurm.epilog
>>       7  JobCredentialPrivateKey=/etc/slurm/slurm.key
>>       8  JobCredentialPublicCertificate=/etc/slurm/slurm.cert
>>       9  JobRequeue=0
>>      10  MaxJobCount=1000000
>>      11  MpiDefault=none
>>      12  Proctracktype=proctrack/cgroup
>>      13  Prolog=/etc/slurm/slurm.prolog
>>      14  PropagateResourceLimits=RSS
>>      15  ReturnToService=0
>>      16  SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env
>> --mpi=none -Q $SHELL"
>>      17
>>  
>> SchedulerParameters=default_queue_depth=5000,bf_window=10080,max_job_bf=5000,bf_interval=120
>>      18  SlurmctldPidFile=/var/run/slurmctld.pid
>>      19  SlurmctldPort=6817
>>      20  SlurmdPidFile=/var/run/slurmd.pid
>>      21  SlurmdPort=6818
>>      22  SlurmdSpoolDir=/var/spool/slurmd
>>      23  SlurmUser=slurm
>>      24  StateSaveLocation=/usr/local/slurm-state
>>      25  SwitchType=switch/none
>>      26  TaskPlugin=task/cgroup
>>      27  TaskProlog=/etc/slurm/slurm.taskprolog
>>      28  TopologyPlugin=topology/tree
>>      29  TmpFs=/scratch
>>      30  TrackWCKey=yes
>>      31  TreeWidth=20
>>      32  UsePAM=1
>>      33  HealthCheckInterval=1800
>>      34  HealthCheckProgram=/etc/slurm/slurm.healthcheck
>>      35  InactiveLimit=0
>>      36  KillWait=600
>>      37  MessageTimeout=60
>>      38  ResvOverRun=UNLIMITED
>>      39  MinJobAge=43200
>>      40  SlurmctldTimeout=300
>>      41  SlurmdTimeout=1200
>>      42  Waittime=0
>>      43  FastSchedule=1
>>      44  MaxMemPerCPU=3072
>>      45  SchedulerType=sched/backfill
>>      46  SchedulerPort=7321
>>      47  SelectType=select/cons_res
>>      48  SelectTypeParameters=CR_Core_Memory
>>      49  PriorityType=priority/multifactor
>>      50  PriorityDecayHalfLife=0
>>      51  PriorityCalcPeriod=5
>>      52  PriorityUsageResetPeriod=MONTHLY
>>      53  PriorityFavorSmall=NO
>>      54  PriorityMaxAge=14-0
>>      55  PriorityWeightAge=20160
>>      56  PriorityWeightFairshare=10000
>>      57  PriorityWeightJobSize=104
>>      58  PriorityWeightPartition=0
>>      59  PriorityWeightQOS=400000
>>      60  AccountingStorageEnforce=associations,limits,qos
>>      61  AccountingStorageHost=kalkyl2
>>      62  AccountingStoragePort=7031
>>      63  AccountingStorageType=accounting_storage/slurmdbd
>>      64  ClusterName=kalkyl
>>      65  DebugFlags=NO_CONF_HASH,Priority
>>      66  JobCompLoc=/etc/slurm/slurm_jobcomp_logger
>>      67  JobCompType=jobcomp/script
>>      68  JobAcctGatherFrequency=30
>>      69  JobAcctGatherType=jobacct_gather/linux
>>      70  SlurmctldDebug=3
>>      71  SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>      72  SlurmdDebug=3
>>      73  SlurmdLogFile=/var/log/slurm/slurmd.log
>>      74  NodeName=DEFAULT Sockets=2 CoresPerSocket=4 ThreadsPerCore=1
>> State=UNKNOWN TmpDisk=100000
>>      75
>>      76  NodeName=q[1-16]    RealMemory=72000 Feature=fat,mem72GB,ibsw1
>> Weight=3
>>      77  NodeName=q[17-32]   RealMemory=48000 Feature=fat,mem48GB,ibsw1
>> Weight=2
>>      78  NodeName=q[33-64]   RealMemory=24000 Feature=thin,mem24GB,ibsw2
>>  Weight=1
>>      79  NodeName=q[65-96]   RealMemory=24000 Feature=thin,mem24GB,ibsw3
>>  Weight=1
>>      80  NodeName=q[97-108]  RealMemory=24000 Feature=thin,mem24GB,ibsw4
>>  Weight=1
>>      81  NodeName=q[109-140] RealMemory=24000 Feature=thin,mem24GB,ibsw5
>>  Weight=1
>>      82  NodeName=q[141-172] RealMemory=24000 Feature=thin,mem24GB,ibsw6
>>  Weight=1
>>      83  NodeName=q[173-204] RealMemory=24000 Feature=thin,mem24GB,ibsw7
>>  Weight=1
>>      84  NodeName=q[205-216] RealMemory=24000 Feature=thin,mem24GB,ibsw8
>>  Weight=1
>>      85
>>      86  NodeName=q[217-232] RealMemory=24000 Feature=thin,mem24GB,ibsw4
>>  Weight=1
>>      87
>>      88  NodeName=q[233-252] RealMemory=24000 Feature=thin,mem24GB,ibsw8
>>  Weight=1
>>      89  NodeName=q[253-284] RealMemory=24000 Feature=thin,mem24GB,ibsw9
>>  Weight=1
>>      90  NodeName=q[285-316] RealMemory=24000 Feature=thin,mem24GB,ibsw10
>> Weight=1
>>      91  NodeName=q[317-348] RealMemory=24000 Feature=thin,mem24GB,ibsw11
>> Weight=1
>>      92
>>      93  PartitionName=all Nodes=q[1-348] Shared=EXCLUSIVE
>> DefaultTime=00:00:01 MaxTime=14400 State=DOWN
>>      94  PartitionName=core Nodes=q[45-348] Default=YES Shared=NO
>> MaxTime=14400 MaxNodes=1 State=UP
>>      95  PartitionName=node Nodes=q[1-32,45-348] Shared=EXCLUSIVE
>> DefaultTime=00:00:01 MaxTime=14400 State=UP
>>      96  PartitionName=devel Nodes=q[33-44] Shared=EXCLUSIVE
>> DefaultTime=00:00:01 MaxTime=60 MaxNodes=4 State=UP
>>
>
>

Reply via email to