> Am 28.02.2016 um 21:51 schrieb Ben Daniel Pere <[email protected]>: > > Each job starting on a machine will contribute 1 to the adjustment which will > decay over time to 0, in your case in 7:30 minutes. The 38.23 is the sum of > all these adjustments of all jobs starting in the last 7:30 while each job > will have it's own individual contribution to this sum. If no job started in > the last 7:30 on a machine it should read 0.50 * 0.000000. This value is then > divided by 56 before being added to 0.965536. > > I actually realized the 38.23 while I was writing this email and noticed the > decay time and started to read about it - still what made me send the > question was the fact I don't see where the load_formula kicks into play here > - the minus num_procs seems to be completely ignored here so I'm probably > missing something - what is it?
It's the other way round. The load used in the load_formula is already adjusted. You adjust individual values, not the result of any computation already made with them. The computed load_formula will then be used to sort the machines. > > load_formula is load_avg-num_proc and load_adjustments are 0.5: > What was the reason to implement it this way? Having a full loaded machine > and subtracting num_proc would read zero - which doesn't reflect the actual > use of the machine. > > no one remembers.. talked with the people who configured it - they have > absolutely no idea :) "probably copy pasted from somewhere online" <-- real > quote. > > - A job_load_adjustments does handle the fact that a job isn't using the > granted resources instantly, what is not happening in your case. > > I would also assumes it's good for "starting engines" - since the load_avg is > the 5 minutes load submitting a huge array after some idle time will make all > jobs see almost zero load on the machine.. I wouldn't mind bombing the > machine because we only have 1 slot per core so not really worried about > killing the cpu but I can see the logic in it even in cases of always > intensive jobs. > > - alarm_threshold in the queue definition takes care in case you want to > oversubscribe a machine by intention as your parallel job doesn't scale well > > we basically have 2 kinds of queue - a workhorse queue "all.q" which has 1 > slot per core and an interactive queue which also has 1 slot per core but > gets a better priority. we set the load_thresholds to 1.3 to allow 30% > oversubscription to ensure interactive jobs can always run.. we never ever > put our nodes in alarm mode, we use zabbix to monitor machine's health and we > automatically take it out of the cluster (by disabling all of it's queues) in > cases of "mess" (disk failures, out of space, mounting issues, stuff like > that). Are these interactive job generating load, is it used only to allow users to peek on a machine? -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
