On 2012-08-24 03:05, Lipari, Don wrote:
Janne,

As one of those who designed the fair-share component to the multi-factor 
priority plugin, I'm open to your suggestions below.  I would recommend you 
create it as a separate multi-factor plugin so that we could have the 
opportunity to switch back and forth and examine the differences in behavior.  
If it winds up delivering more fairness across all cases, then I'm sure we 
would abandon the current implementation someday.

A separate plugin sounds like a good approach, yes. I'll look into that. For the time being, in case anyone is interested, the core algorithm change can be tested with the attached patch (so far only compile-tested..).

I'm not sure how best to handle the sshare utility with regard to the fields it 
displays.  It might make sense to create a new sshare that displays the 
pertinent components of the new algorithm.

Would you mind citing the reference(s) that motivated the formula you presented 
below?

There's a lot of papers on fair share scheduling, but most of them are related to scheduling network packets, and while one can get some idea from them the terminology is often somewhat different.

One paper on fair share CPU scheduling which is reasonably easy to understand is

J. Kay and P. Lauder [1988], `A Fair Share Scheduler', Communications of the ACM, 31(1):44-55 (1988),
http://sydney.edu.au/engineering/it/~judy/Research_fair/89_share.pdf

add an addendum

http://sydney.edu.au/engineering/it/~piers/papers/Share/share2.html


The particular algorithm was (re-)invented by yours truly, later I found something quite close to it in

www.cs.umu.se/~elmroth/papers/fsgrid.pdf


MOAB also uses a similar algorithm, except that the tree hierarchy is fixed, and the scale factor for each level of the tree is configurable.

www.eecs.harvard.edu/~chaki/bib/papers/jackson01maui.pdf


 While I understand the shortcomings in the scenario of the A, B, and C 
accounts you describe, coming up with an elegant design that supports a 
multi-level hierarchy is not a trivial problem, particularly when one of the 
requirements is to create something that users will understand intuitively.

Indeed! FWIW, one way in which the proposed algorithm can fail is if several accounts have used much more than their fair share; then users in those accounts with little usage can clobber the parent FS contribution. Might not be a huge problem in practice, though.

An algorithm which I think is quite robust, and does not either suffer from the problems with deep trees in the algorithm I suggested, is the one used by SGE (or whatever it's called nowadays). There you start with a number of tickets at the root of the tree, then distribute those tickets to the child nodes in proportion with their fair share. So at the end of this phase the tickets will have been distributed to the queued jobs. Then you give the job with the highest number the priority 1.0, and the other jobs priorities proportional to the number of tickets they have vs. the number of tickets of the first job. This gives a fairly robust algorithm which gives priorities hierarchically like we want, but at the cost of adding state to each queued job (number of tickets it has).

Google finds a number of papers on things like "lottery scheduling" or "stride scheduling", which are slightly relevant to the ticket algorithm. E.g.

www.waldspurger.org/carl/papers/phd-mit-tr667.pdf

I'll try to find time to see whether such an algorithm can be incorporated into slurm, and if it would be better than the one I originally suggested..


Don

-----Original Message-----
From: Janne Blomqvist [mailto:[email protected]]
Sent: Thursday, August 23, 2012 12:05 AM
To: slurm-dev
Subject: [slurm-dev] Making hierarchical fair share hierarchical


Hello,

we are seeing an issue with the hierarchical fair share algorithm which
we'd like to get fixed.  We think our suggested improvements would be of
benefit to others as well, but we'd like some inputs on how, or indeed
whether, to proceed.

For some background, we have a cluster shared between several
departments, and we have configured the fair-share such that each
department account has a share corresponding to its financial
contribution (so far it's simple, 3 departments with equal shares each).
Thus it's politically important for us that the fair-share priority
reflects the usage factor of each department as well as between users
belonging to an account. And yes, we're certainly aware that fair share
cannot guarantee that departments get their fair share, but it should at
least push in that direction.

Thus, we'd like the fair-share factor of a user reflect the account
hierarchy, such that users belonging to underserved accounts always have
a higher fair-share priority than users belonging to an account which
has used more than it's fair share.

The problem we're seeing is that the current hierarchical fair-share
algorithm does not manage to do this in our case. Our issues seems to be
partly due to the departments having somewhat different usage patterns.
The departments have a roughly equal amount of users, but for two of the
departments (lets call them A and B) most of the users use the cluster
very little, but with a few heavy users. The third department, OTOH, has
a much larger number of active users. The end result of this is that the
heavy users of the A and B departments have very low fair-share
priorities, and department C manages to use most of the resources since
their users have higher fair-share priorities, despite the usage at the
department level being far away from the fair-share target.

After reviewing literature and thinking about the issue, we suggest to
calculate the hierarchical fair-share as follows (in latex notation):

fs = ( \sum_{i=0}^d fs_{raw}(i)*k^{(d-i)} ) / k^d

(for a rendered version, see e.g.

http://texify.com/?$fs = ( \sum_{i=0}^d fs_{raw}(i)*k^{(d-i)} ) / k^d$

)

where fs_raw is the non-hierarchical fair-share (2^(-U/S)), d is the
depth of the account/user in the tree hierarchy (i=0 is the root of the
tree), and k is a scaling factor. k should be chosen such that it's
large enough to distinguish between nodes at the same depth of the tree,
and small enough that there are enough digits in the priority so that
reasonably deep trees work as intended. That is, as long as two accounts
at the same level of the tree differ in priority by more than 1/k, the
users under those accounts will not be intermingled with each other.
OTOH, as the priority ultimately is a uint32_t, even with a large
fair-share priority weight there are at most only about 9 digits. We
suggest k=100 as a starting point, which would allow about 4 levels deep
hierarchy (5 when including the root) before the priorities at the
bottom of the tree would start to get filtered out.

So, how to proceed? We think others would benefit from this as well, and
of course we have the selfish desire to have this upstream so that we
don't have to carry around our own (not yet done) patch indefinitely, or
come up with some other way to solve our problem. Is there a need to
keep the existing hierarchical fair-share algorithm around, or can we
replace it with our version? If so, as our approach doesn't use the
concept of "effective usage", this should be removed from the code and
tool output (e.g. sshare). As this issue is important to us, we're
prepared to do the dirty work, but we'd like some feedback first before
doing anything.



--
Janne Blomqvist


--
Janne Blomqvist
diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor.c
index 1e45002..866580f 100644
--- a/src/plugins/priority/multifactor/priority_multifactor.c
+++ b/src/plugins/priority/multifactor/priority_multifactor.c
@@ -400,10 +400,12 @@ static double _get_fairshare_priority( struct job_record *job_ptr)
 {
 	slurmdb_association_rec_t *job_assoc =
 		(slurmdb_association_rec_t *)job_ptr->assoc_ptr;
-	slurmdb_association_rec_t *fs_assoc = NULL;
+	slurmdb_association_rec_t *fs_assoc = NULL, *fs;
 	double priority_fs = 0.0;
 	assoc_mgr_lock_t locks = { READ_LOCK, NO_LOCK,
 				   NO_LOCK, NO_LOCK, NO_LOCK };
+	int d;
+	static const double k = 100;
 
 	if (!calc_fairshare)
 		return 0;
@@ -429,15 +431,16 @@ static double _get_fairshare_priority( struct job_record *job_ptr)
 		priority_p_set_assoc_usage(fs_assoc);
 
 	/* Priority is 0 -> 1 */
-	priority_fs = priority_p_calc_fs_factor(
-		fs_assoc->usage->usage_efctv,
-		(long double)fs_assoc->usage->shares_norm);
+	for (d = 0, fs = fs_assoc; fs != assoc_mgr_root_assoc; 
+	     d++, fs = fs->usage->parent_assoc_ptr)
+		priority_fs += priority_p_calc_fs_factor(
+			fs->usage->usage_norm, fs->usage->shares_norm)
+			* pow(k, d);
+	priority_fs /= pow(k, d);
 	if (priority_debug) {
 		info("Fairshare priority of job %u for user %s in acct"
-		     " %s is 2**(-%Lf/%f) = %f",
-		     job_ptr->job_id, job_assoc->user, job_assoc->acct,
-		     fs_assoc->usage->usage_efctv,
-		     fs_assoc->usage->shares_norm, priority_fs);
+		     " %s is %f", job_ptr->job_id, job_assoc->user, 
+		     job_assoc->acct, priority_fs);
 	}
 
 	assoc_mgr_unlock(&locks);

Reply via email to