Sorry, that link is for code's latest version. It's a bit different for the one I sent you.
On Wed, Sep 5, 2012 at 1:37 PM, Miguel Méndez <[email protected]> wrote: > Hi Lennart, > > Don't worry about the Accounting thing, I didn't know this you now say: > > >> This has worked for a long time and usually still does. >> >> But sometimes it goes seriously wrong, with a new job starting >> at a age value of 20160 instead. >> > > So if it usually works, all I can think of right now is about this part of > multifactor's code (for 2.4.1): > > if (weight_age) { > uint32_t diff; > if (flags & PRIORITY_FLAGS_ACCRUE_ALWAYS) > diff = start_time - job_ptr->details->submit_time; > else > diff = start_time - job_ptr->details->begin_time; > if (job_ptr->details->begin_time) { > if (diff < max_age) { > job_ptr->prio_factors->priority_age = > (double)diff / (double)max_age; > } else > job_ptr->prio_factors->priority_age = 1.0; > } else if (flags & PRIORITY_FLAGS_ACCRUE_ALWAYS) { > if (diff < max_age) { > job_ptr->prio_factors->priority_age = > (double)diff / (double)max_age; > } else > job_ptr->prio_factors->priority_age = 1.0; > } > } > > ( You can find it here: > https://github.com/chaos/slurm/blob/master/src/plugins/priority/multifactor/priority_multifactor.c#L458 > ) > > You may be getting to any of those "job_ptr->prio_factors->priority_age = > 1.0;" for some reason. Maybe, as you suggest, there is a problem with some > of those times ("begin_time" maybe). I would check the times you're getting > in job details. > > Regards, > > Miguel > > > > On Wed, Sep 5, 2012 at 12:07 PM, Lennart Karlsson < > [email protected]> wrote: > >> >> On 09/04/2012 04:22 PM, Miguel Méndez wrote: >> > Hi Lennart, >> > >> > I have some questions for you so I can help you: >> > >> > Have you tried to set DebugFlags=Priority in slurm.conf to get some more >> > info about priorities on slurmctld.log? >> > >> > Are your priorities being recalculated every "PriorityCalcPeriod" (in >> > slurm.conf as well, default is 5 min)? If not, do you have Accounting >> > enabled? >> >> Hi Miguel, >> >> And thanks for trying to help me! >> >> Yes, I have configured >> >> PriorityCalcPeriod=5 >> >> in the slurm.conf file. >> >> I do not understand your question about if I have Accounting enabled. >> I have no such configuration variable in my slurm.conf file. I run >> >> >> I have now tried your suggestion to set DebugFlags=Priority, >> so now I can rewrite my question in a new way. >> >> In slurm.conf, I have configured >> PriorityMaxAge=14-0 >> PriorityWeightAge=20160 >> >> The plan behind this configuration is to start with an age >> value of zero and get approximately one priority point added >> for each minute that the job has been waiting, up to a >> maximum of 20160. >> >> This has worked for a long time and usually still does. >> But sometimes it goes seriously wrong, with a new job starting >> at a age value of 20160 instead. >> >> This can be seen with the sprio command and also with Priority >> debugging on: >> >> [2012-09-05T10:43:37] Weighted Age priority is 1.000000 * 20160 = 20160.00 >> [2012-09-05T10:43:37] Weighted Fairshare priority is 10.000000 * 10000 = >> 100000.00 >> [2012-09-05T10:43:37] Weighted JobSize priority is 0.001616 * 104 = 0.17 >> [2012-09-05T10:43:37] Weighted Partition priority is 0.000000 * 0 = 0.00 >> [2012-09-05T10:43:37] Weighted QOS priority is 0.000000 * 400000 = 0.00 >> [2012-09-05T10:43:37] Job 2182878 priority: 20160.00 + 100000.00 + 0.17 + >> 0.00 + 0.00 - 0 = 120160.17 >> >> The job was submitted 2012-09-05T10:42:22, so it should have a weighted >> age priority of zero or one, but it got for some unknown reason the >> maximum value instead. >> >> Here are a job that behaves the normal way, as expected: >> [2012-09-05T10:44:17] Weighted Age priority is 0.000000 * 20160 = 0.00 >> [2012-09-05T10:44:17] Weighted Fairshare priority is 6.000000 * 10000 = >> 60000.00 >> [2012-09-05T10:44:17] Weighted JobSize priority is 0.002874 * 104 = 0.30 >> [2012-09-05T10:44:17] Weighted Partition priority is 0.000000 * 0 = 0.00 >> [2012-09-05T10:44:17] Weighted QOS priority is 0.000000 * 400000 = 0.00 >> [2012-09-05T10:44:17] Job 2182879 priority: 0.00 + 60000.00 + 0.30 + 0.00 >> + 0.00 - 0 = 60000.30 >> >> This job was submitted 2012-09-05T10:44:17, so the weighted age >> priority is zero, as expected. >> >> Here is an example for a job that has waited for some time: >> [2012-09-05T00:07:31] Weighted Age priority is 0.004721 * 20160 = 95.17 >> [2012-09-05T00:07:31] Weighted Fairshare priority is 10.000000 * 10000 = >> 100000.00 >> [2012-09-05T00:07:31] Weighted JobSize priority is 0.002874 * 104 = 0.30 >> [2012-09-05T00:07:31] Weighted Partition priority is 0.000000 * 0 = 0.00 >> [2012-09-05T00:07:31] Weighted QOS priority is 0.300000 * 400000 = >> 120000.00 >> [2012-09-05T00:07:31] Job 2178648 priority: 95.17 + 100000.00 + 0.30 + >> 0.00 + 120000.00 - 0 = 220095.47 >> >> Submit time was 2012-09-04T22:32:08, so the Weighted Age >> priority works as intended in this case. >> >> This is version 2.4.1 of SLURM. (If someone thinks that the Fairshare >> priorities are strange, do not worry. They are intended to be in this >> way, but that is another story.) >> >> Full slurm.conf configuration is at the bottom of this e-mail, >> with line numbers added. >> >> Cheers, >> -- Lennart Karlsson >> UPPMAX, Uppsala University, Sweden >> http://www.uppmax.uu.se >> >> ============================================== >> 1 ControlMachine=kalkyl2 >> 2 AuthType=auth/munge >> 3 CacheGroups=0 >> 4 CryptoType=crypto/munge >> 5 EnforcePartLimits=YES >> 6 Epilog=/etc/slurm/slurm.epilog >> 7 JobCredentialPrivateKey=/etc/slurm/slurm.key >> 8 JobCredentialPublicCertificate=/etc/slurm/slurm.cert >> 9 JobRequeue=0 >> 10 MaxJobCount=1000000 >> 11 MpiDefault=none >> 12 Proctracktype=proctrack/cgroup >> 13 Prolog=/etc/slurm/slurm.prolog >> 14 PropagateResourceLimits=RSS >> 15 ReturnToService=0 >> 16 SallocDefaultCommand="/usr/bin/srun -n1 -N1 --pty --preserve-env >> --mpi=none -Q $SHELL" >> 17 >> >> SchedulerParameters=default_queue_depth=5000,bf_window=10080,max_job_bf=5000,bf_interval=120 >> 18 SlurmctldPidFile=/var/run/slurmctld.pid >> 19 SlurmctldPort=6817 >> 20 SlurmdPidFile=/var/run/slurmd.pid >> 21 SlurmdPort=6818 >> 22 SlurmdSpoolDir=/var/spool/slurmd >> 23 SlurmUser=slurm >> 24 StateSaveLocation=/usr/local/slurm-state >> 25 SwitchType=switch/none >> 26 TaskPlugin=task/cgroup >> 27 TaskProlog=/etc/slurm/slurm.taskprolog >> 28 TopologyPlugin=topology/tree >> 29 TmpFs=/scratch >> 30 TrackWCKey=yes >> 31 TreeWidth=20 >> 32 UsePAM=1 >> 33 HealthCheckInterval=1800 >> 34 HealthCheckProgram=/etc/slurm/slurm.healthcheck >> 35 InactiveLimit=0 >> 36 KillWait=600 >> 37 MessageTimeout=60 >> 38 ResvOverRun=UNLIMITED >> 39 MinJobAge=43200 >> 40 SlurmctldTimeout=300 >> 41 SlurmdTimeout=1200 >> 42 Waittime=0 >> 43 FastSchedule=1 >> 44 MaxMemPerCPU=3072 >> 45 SchedulerType=sched/backfill >> 46 SchedulerPort=7321 >> 47 SelectType=select/cons_res >> 48 SelectTypeParameters=CR_Core_Memory >> 49 PriorityType=priority/multifactor >> 50 PriorityDecayHalfLife=0 >> 51 PriorityCalcPeriod=5 >> 52 PriorityUsageResetPeriod=MONTHLY >> 53 PriorityFavorSmall=NO >> 54 PriorityMaxAge=14-0 >> 55 PriorityWeightAge=20160 >> 56 PriorityWeightFairshare=10000 >> 57 PriorityWeightJobSize=104 >> 58 PriorityWeightPartition=0 >> 59 PriorityWeightQOS=400000 >> 60 AccountingStorageEnforce=associations,limits,qos >> 61 AccountingStorageHost=kalkyl2 >> 62 AccountingStoragePort=7031 >> 63 AccountingStorageType=accounting_storage/slurmdbd >> 64 ClusterName=kalkyl >> 65 DebugFlags=NO_CONF_HASH,Priority >> 66 JobCompLoc=/etc/slurm/slurm_jobcomp_logger >> 67 JobCompType=jobcomp/script >> 68 JobAcctGatherFrequency=30 >> 69 JobAcctGatherType=jobacct_gather/linux >> 70 SlurmctldDebug=3 >> 71 SlurmctldLogFile=/var/log/slurm/slurmctld.log >> 72 SlurmdDebug=3 >> 73 SlurmdLogFile=/var/log/slurm/slurmd.log >> 74 NodeName=DEFAULT Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 >> State=UNKNOWN TmpDisk=100000 >> 75 >> 76 NodeName=q[1-16] RealMemory=72000 Feature=fat,mem72GB,ibsw1 >> Weight=3 >> 77 NodeName=q[17-32] RealMemory=48000 Feature=fat,mem48GB,ibsw1 >> Weight=2 >> 78 NodeName=q[33-64] RealMemory=24000 Feature=thin,mem24GB,ibsw2 >> Weight=1 >> 79 NodeName=q[65-96] RealMemory=24000 Feature=thin,mem24GB,ibsw3 >> Weight=1 >> 80 NodeName=q[97-108] RealMemory=24000 Feature=thin,mem24GB,ibsw4 >> Weight=1 >> 81 NodeName=q[109-140] RealMemory=24000 Feature=thin,mem24GB,ibsw5 >> Weight=1 >> 82 NodeName=q[141-172] RealMemory=24000 Feature=thin,mem24GB,ibsw6 >> Weight=1 >> 83 NodeName=q[173-204] RealMemory=24000 Feature=thin,mem24GB,ibsw7 >> Weight=1 >> 84 NodeName=q[205-216] RealMemory=24000 Feature=thin,mem24GB,ibsw8 >> Weight=1 >> 85 >> 86 NodeName=q[217-232] RealMemory=24000 Feature=thin,mem24GB,ibsw4 >> Weight=1 >> 87 >> 88 NodeName=q[233-252] RealMemory=24000 Feature=thin,mem24GB,ibsw8 >> Weight=1 >> 89 NodeName=q[253-284] RealMemory=24000 Feature=thin,mem24GB,ibsw9 >> Weight=1 >> 90 NodeName=q[285-316] RealMemory=24000 Feature=thin,mem24GB,ibsw10 >> Weight=1 >> 91 NodeName=q[317-348] RealMemory=24000 Feature=thin,mem24GB,ibsw11 >> Weight=1 >> 92 >> 93 PartitionName=all Nodes=q[1-348] Shared=EXCLUSIVE >> DefaultTime=00:00:01 MaxTime=14400 State=DOWN >> 94 PartitionName=core Nodes=q[45-348] Default=YES Shared=NO >> MaxTime=14400 MaxNodes=1 State=UP >> 95 PartitionName=node Nodes=q[1-32,45-348] Shared=EXCLUSIVE >> DefaultTime=00:00:01 MaxTime=14400 State=UP >> 96 PartitionName=devel Nodes=q[33-44] Shared=EXCLUSIVE >> DefaultTime=00:00:01 MaxTime=60 MaxNodes=4 State=UP >> > >
