Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reuti Sun, 28 Feb 2016 11:04:09 -0800

Hi,

Am 28.02.2016 um 17:03 schrieb Ben Daniel Pere:


> I'm looking into several cases where jobs don't enter our queues even though 
> the load is lower than the threshold and I noticed there's a different 
> calculation there I can't figure..
> 
> Turning on logging, I see the following on qstat -j on a job that should 
> enter but isn't:
> queue instance "[email protected]" dropped because it is overloaded: 
> np_load_avg=1.306875 (= 0.965536 + 0.50 * 38.230000 with nproc=56) >= 1.30

Each job starting on a machine will contribute 1 to the adjustment which will 
decay over time to 0, in your case in 7:30 minutes. The 38.23 is the sum of all 
these adjustments of all jobs starting in the last 7:30 while each job will 
have it's own individual contribution to this sum. If no job started in the 
last 7:30 on a machine it should read 0.50 * 0.000000. This value is then 
divided by 56 before being added to 0.965536.


> load_formula is load_avg-num_proc and load_adjustments are 0.5:
> 
> $ qconf -ssconf
> algorithm                         default
> schedule_interval                 00:00:01
> maxujobs                          0
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50,load_avg=0.50
> load_adjustment_decay_time        0:7:30
> load_formula                      load_avg-num_proc

What was the reason to implement it this way? Having a full loaded machine and 
subtracting num_proc would read zero - which doesn't reflect the actual use of 
the machine.


> given the n38 example, I see the average load is 0.965536 but I have 
> absolutely ZERO idea where that 38.23 comes from.. num_proc is 56, load_avg 
> is less than that, where does 38.23 comes from?
> 
> Also, I should note all our jobs take 1 full cpu and they start doing it 
> after about 10 seconds of starting, what should I set the decay time to? we 
> took 7:30

When your jobs use the granted CPU almost instantly and you don't intend to 
overload machines by intention, then you will neither need any 
job_load_adjustments nor any alarm_threshold in the queue definition.

- A job_load_adjustments does handle the fact that a job isn't using the 
granted resources instantly, what is not happening in your case.

- alarm_threshold in the queue definition takes care in case you want to 
oversubscribe a machine by intention as your parallel job doesn't scale well 
(E.g. running 72 slots on 64 core machine and you expect an average load of 1 
by a certain mix of parallel and serial jobs on this machine. Running now 72 
serial jobs would lead to an noticeable oversubscription - the alarm_threshold 
would take care of this and avoid further jobs to be started on this machine). 
If you have serial jobs only or parallel jobs which scale well, this isn't 
necessary to be set.

(BTW: we use alarm_threshold only to put a machine in alarm state to avoid 
further dispatching of jobs to this machine in case the free space in /tmp 
falls below 1 GB)

-- Reuti


> minutes as a default we found somewhere..
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reply via email to