Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reuti Tue, 01 Mar 2016 05:30:00 -0800

> Am 29.02.2016 um 23:27 schrieb Ben Daniel Pere <ben.p...@gmail.com>:
> 
> It's the other way round. The load used in the load_formula is already 
> adjusted. You adjust individual values, not the result of any computation 
> already made with them.
> 
> The computed load_formula will then be used to sort the machines.
> 
> Oh load formula is just for machines priority? so I do see the sense in 
> normalizing this load by the number of cores (otherwise we'll kill machines 
> with 24 cores while machines with 56 cores are barely doing anything)


With the current setting of the load_formular you should observe the opposite. 
The 56 cores machines show up as -56 on an empty machine and have a lower value 
than -24. These machines are then filled up to 32 cores and all machines have 
the same load_formular of -24 when now also these other machines a being used 
for further scheduling.

Are these quad-CPU machines without hyperthreading to gain a total of 56 cores?

-- Reuti


> - and I suppose that's exactly what the default "np_load_avg" does.. awesome!
> 
> > we basically have 2 kinds of queue - a workhorse queue "all.q" which has 1 
> > slot per core and an interactive queue which also has 1 slot per core but 
> > gets a better priority. we set the load_thresholds to 1.3 to allow 30% 
> > oversubscription to ensure interactive jobs can always run.. we never ever 
> > put our nodes in alarm mode, we use zabbix to monitor machine's health and 
> > we automatically take it out of the cluster (by disabling all of it's 
> > queues) in cases of "mess" (disk failures, out of space, mounting issues, 
> > stuff like that).
> 
> Are these interactive job generating load, is it used only to allow users to 
> peek on a machine?
>  
> yes they're generating load, but there aren't many of them and they are 
> usually very short (seconds to minute-ish), absolutley all our tasks 
> single-threaded, 100% cpu taking.. we work super hard to relieve other 
> bottlenecks (filesystem, databases, etc) - doesn't always work perfectly but 
> for most of our tasks, cpu is our only boundary.
> Our cluster is 50 execution hosts, each with 128-256GB RAM and 24-56 cores, 
> and we have some "support" hardware like an fhgfs cluster for information not 
> on local disks, mysql servers, etc - we intend to double the size of the 
> cluster this year and we're preparing by making uses of our "shared" 
> resources (database, fhgfs-storage) more efficient and by looking at our sge 
> configuration and trying to figure out what we're doing wrong =) the most 
> common complaint in our halls is that the cluster isn't responsive enough so 
> we've created a cluster task force that tried to tackle some issues - I'm a 
> software engineer but helping with fhgfs and sge configuration as well, so 
> you're probably going to hear a lot from me soon ;) 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Understanding load_formula and load calculations for queue overloads..

Reply via email to