Jesse is right that it is "real" half-life as in radio active decay formulas. So after each half-life interval the impact of a recorded amount of resource consumption contributed to a job (and thus to a user/project leaf node in the share tree) will have become cut in half. If you feel that the share tree policy "forgets" too quickly then simply increase the half-life. There's also the compensation factor and usage scaling factors you can play with to adjust the policy to how you want it behave.

While the policy's inherent algorithm is totally deterministic it is definitely challenging to try following what's going on. You'd do this for debugging reasons but otherwise it is about as helpful as calculating the pertinent laws of physics when steering a car around a corner. There's way too many variables which are constantly changing. So my advice would be the same as for driving a car: try to point into the right direction and make adjustments as you see fit.

As per relating the accounting info to what's happening in the share-tree policy: that's next to impossible ... which is why we've chosen to augment the UniSight accouting/reporting technology in our proprietary Univa Grid Engine version to include reporting on share tree history. Sorry for the commercial note here but it's the only solution I can point you to in this regard.

Cheers,

Fritz

Jesse Becker schrieb:
On Wed, May 08, 2013 at 02:02:04PM -0700, Brian McNally wrote:
Thanks Reuti, you're awesome!

I thought the halftime just dictated the length of time usage took before it was half its original value. It seems to be that that is not the same as how long the scheduler keeps usage information for jobs. Although, at some point, say 4-5 halflife cycles the decayed usage is very small and doesn't have much of an impact.

I seem to recall hearing that "5 halflives" is how long radioactive
stuff has to decay before it's "safe."  Don't quote me on that though.
:)

Poking around a bit in the sgeee.c file (from SoGE version 8.1.1, which
is what I have handy ATM), it looks like that, even though halftime is
specified in hours, the calculations are actually done in minutes.

It also looks like a real exponential decay is used, instead of a linear
decrease (as in some of the load calculations).  I think that the actual
decay rates come from the following (sge_support.c):

/*--------------------------------------------------------------------
 * calculate_decay_constant - calculates decay rate and constant based
 * on the decay half life and usage interval. The halftime argument
 * is in minutes.
 *--------------------------------------------------------------------*/

void
calculate_decay_constant( double halftime,
                          double *decay_rate,
                          double *decay_constant )
{
   if (halftime < 0) {
      *decay_rate = 1.0;
      *decay_constant = 0;
   } else if (halftime == 0) {
      *decay_rate = 0;
      *decay_constant = 1.0;
   } else {
      *decay_rate = - log(0.5) / (halftime * 60);
      *decay_constant = 1 - (*decay_rate * sge_usage_interval);
   }
   return;
}

This is especially interesting since it implies that negative halftimes
are acceptible.  Sure enough, setting a negative value zeros out
historical usage:
https://blogs.oracle.com/sgrell/entry/a_couple_lines_on_halftime


So yes, you'd need to keep your accounting files around for some number
of halftimes.  At 5 halflives, you're at 1/32nd of the original
weighting, or about 3%.




--
Brian McNally

On 05/08/2013 01:46 PM, Reuti wrote:
Hi,

Am 08.05.2013 um 22:30 schrieb Brian McNally:

qacct reports usage from a file, but GE has its own internal database for tracking jobs and usage.

You mean for the share tree policy? Yes.


Is this correct? If so, what controls the length of time GE keeps job data for?

The "halftime" setting in the scheduler configuration (`man sched_conf`).


It seems that using qacct to display overall usage per user (-o), for example, might be a little misleading if the actual accounting information is stored internally. Users might draw conclusions about their usage and how that'll impact their job priorities based on potentially incorrect data.

Unfortunately this is correct. You can even remove the accouting file or rotate it which might lead to even different output. It would be hard to mimic the internal computation. Maybe setting "report_pjob_tickets" to true could give them a hint at which position their jobs are in the pending list (usually it's switched off for performance reasons).

-- Reuti



Thanks,

--
Brian McNally
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


--

UnivaFritz Ferstl | CTO and Business Development, EMEA
Univa Corporation | The Data Center Optimization Company
E-Mail: [email protected] | Phone: +49.9471.200.195 | Mobile: +49.170.819.7390

Where Grid Engine lives



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to