Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reuti Mon, 29 Feb 2016 07:58:30 -0800

> Am 28.02.2016 um 21:51 schrieb Ben Daniel Pere <[email protected]>:
> 
> Each job starting on a machine will contribute 1 to the adjustment which will 
> decay over time to 0, in your case in 7:30 minutes. The 38.23 is the sum of 
> all these adjustments of all jobs starting in the last 7:30 while each job 
> will have it's own individual contribution to this sum. If no job started in 
> the last 7:30 on a machine it should read 0.50 * 0.000000. This value is then 
> divided by 56 before being added to 0.965536.
> 
> I actually realized the 38.23 while I was writing this email and noticed the 
> decay time and started to read about it - still what made me send the 
> question was the fact I don't see where the load_formula kicks into play here 
> - the minus num_procs seems to be completely ignored here so I'm probably 
> missing something - what is it?


It's the other way round. The load used in the load_formula is already 
adjusted. You adjust individual values, not the result of any computation 
already made with them.

The computed load_formula will then be used to sort the machines.


> > load_formula is load_avg-num_proc and load_adjustments are 0.5:
> What was the reason to implement it this way? Having a full loaded machine 
> and subtracting num_proc would read zero - which doesn't reflect the actual 
> use of the machine.
> 
> no one remembers.. talked with the people who configured it - they have 
> absolutely no idea :) "probably copy pasted from somewhere online" <-- real 
> quote.
>  
> - A job_load_adjustments does handle the fact that a job isn't using the 
> granted resources instantly, what is not happening in your case.
> 
> I would also assumes it's good for "starting engines" - since the load_avg is 
> the 5 minutes load submitting a huge array after some idle time will make all 
> jobs see almost zero load on the machine.. I wouldn't mind bombing the 
> machine because we only have 1 slot per core so not really worried about 
> killing the cpu but I can see the logic in it even in cases of always 
> intensive jobs.
>  
> - alarm_threshold in the queue definition takes care in case you want to 
> oversubscribe a machine by intention as your parallel job doesn't scale well 
>  
> we basically have 2 kinds of queue - a workhorse queue "all.q" which has 1 
> slot per core and an interactive queue which also has 1 slot per core but 
> gets a better priority. we set the load_thresholds to 1.3 to allow 30% 
> oversubscription to ensure interactive jobs can always run.. we never ever 
> put our nodes in alarm mode, we use zabbix to monitor machine's health and we 
> automatically take it out of the cluster (by disabling all of it's queues) in 
> cases of "mess" (disk failures, out of space, mounting issues, stuff like 
> that).

Are these interactive job generating load, is it used only to allow users to 
peek on a machine?

-- Reuti


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reply via email to