> > It's the other way round. The load used in the load_formula is already > adjusted. You adjust individual values, not the result of any computation > already made with them. > > The computed load_formula will then be used to sort the machines. >
Oh load formula is just for machines priority? so I do see the sense in normalizing this load by the number of cores (otherwise we'll kill machines with 24 cores while machines with 56 cores are barely doing anything) - and I suppose that's exactly what the default "np_load_avg" does.. awesome! > we basically have 2 kinds of queue - a workhorse queue "all.q" which has > 1 slot per core and an interactive queue which also has 1 slot per core but > gets a better priority. we set the load_thresholds to 1.3 to allow 30% > oversubscription to ensure interactive jobs can always run.. we never ever > put our nodes in alarm mode, we use zabbix to monitor machine's health and > we automatically take it out of the cluster (by disabling all of it's > queues) in cases of "mess" (disk failures, out of space, mounting issues, > stuff like that). > > Are these interactive job generating load, is it used only to allow users > to peek on a machine? > yes they're generating load, but there aren't many of them and they are usually very short (seconds to minute-ish), absolutley all our tasks single-threaded, 100% cpu taking.. we work super hard to relieve other bottlenecks (filesystem, databases, etc) - doesn't always work perfectly but for most of our tasks, cpu is our only boundary. Our cluster is 50 execution hosts, each with 128-256GB RAM and 24-56 cores, and we have some "support" hardware like an fhgfs cluster for information not on local disks, mysql servers, etc - we intend to double the size of the cluster this year and we're preparing by making uses of our "shared" resources (database, fhgfs-storage) more efficient and by looking at our sge configuration and trying to figure out what we're doing wrong =) the most common complaint in our halls is that the cluster isn't responsive enough so we've created a cluster task force that tried to tackle some issues - I'm a software engineer but helping with fhgfs and sge configuration as well, so you're probably going to hear a lot from me soon ;)
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
