[slurm-dev] Re: fairshare

2014-07-15 Thread Ryan Cox


Bill,

I may be wrong (corrections welcomed), but I'm pretty sure you'll have 
to use a database query.  My understanding is that the decayed usage is 
stored as a single usage_raw value per association 
(https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1119). 
There is no history of any kind.


You would have to do a fairly complex query to get an accurate 
representation or write some code to recreate the way Slurm does it.  If 
you look at _apply_decay() and _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c, you can see all 
that happens.  Basically, once per decay thread iteration each 
association's usage_raw and the job's cputime for that time period is 
calculated and decayed accordingly.  This can happen many, many times 
over the length of a job.  If a job terminates before reaching its 
timelimit, the remaining allocated cputime is immediately added all at 
the same time 
(https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1036).


Those are some of the issues that you may run into while creating a 
database tool for this.


I could be mistaken on some of the details but that is my understanding 
of the code (we looked recently for an unrelated reason).


Ryan

On 07/14/2014 02:15 PM, Bill Wichser wrote:


Is there any way to get a better view of fairshare than the "sshare" 
command?


Under PBS, there was the diagnose -f command which showed the 
breakdown per set time period which calculated this value.  What was 
nice about this was I could point a group to this command, or cut and 
paste, showing that you have been using 20% over the last 30 days even 
though you haven't run anything in the last three days.


It's a much more difficult problem when asked now.  I have no tool 
which shows the value, and decay, over the time.  So I'm wondering if 
anyone has a method to demonstrate that, yes, this fairshare value is 
correct and here is why.  Or do I just need to figure out a database 
query to cull this information?


Thanks,
Bill


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: fairshare usage

2013-01-22 Thread Eckert, Phil
Have you looked at sshare?

Phil Eckert
LLNL

From: Mario Kadastik mailto:mario.kadas...@cern.ch>>
Reply-To: slurm-dev mailto:slurm-dev@schedmd.com>>
Date: Tuesday, January 22, 2013 11:17 AM
To: slurm-dev mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] fairshare usage

Hi,

is there some decent way to get multifactor fairshare current state? Something 
akin to maui's diagnose -f output that shows groups (accounts for slurm) and 
users with their fairshare target as well as their historic usage over the past 
N days. This would seriously help understand how the fairshare is computed 
based on the actual usage statistics and current cluster state.

For example we have all user fairshares set as parent and for the accounts:

   Account Share
-- -
  root 1
  grid 1
  grid-ops 1
  hepusers   100
 kbfiusers 1

now let's assume one of the users in hepusers spends the past N days computing 
with the full cluster and then another user submits a number of jobs it would 
be logical to assume that as there is no distinction between the  users in an 
account the newcomers priority would be higher as (s)he hasn't had any 
allocated time.

[root@slurm-1 slurm]# sreport cluster accountutilizationbyuser start=2013-01-08

Cluster/Account/User Utilization 2013-01-08T00:00:00 - 2013-01-21T23:59:59 
(1209600 secs)
Time reported in CPU Minutes

  Cluster Account Login Proper Name   Used
- --- - --- --
t2estoniaroot  7801048
t2estoniagrid0
t2estoniagridcms134 mapped user fo+  0
t2estoniagrid sgmcms000 mapped user fo+  0
t2estoniahepusers  7801048
t2estoniahepusersandres Andres Tiko  85048
t2estoniahepusers mario  Mario Kadastik7716000

so according to this Mario (me) has computed a huge amount of time in 
comparison to andres. However if I look at the priorities from sinfo -nl I see 
this:

[root@slurm-1 slurm]# sprio -nl|head -3
  JOBID USER PRIORITY   AGEFAIRSHARE  JOBSIZEPARTITION  QOS
  53498mario 0.3497 0.2404977  0.4897101  0.9919238  1.000  
0.000
  53499mario 0.3497 0.2404977  0.4897101  0.9919238  1.000  
0.000
[root@slurm-1 slurm]# sprio -nl|grep andres|head -1
  53835   andres 0.3497 0.2396412  0.4897101  0.9919238  1.000  
0.000

so in fact the fairshare factor is equivalent for both users no matter that one 
has been getting a lot of the resource while the other has not.

or do I misunderstand the =parent part?  I tried also setting all users shares 
to 1 and have no clue how long it will take for sprio to recompute this, but 
right now it's showing the same priorities.

That's one of the reasons why I'd like to be able to see how the actual usage 
and decay over time affect the factor so that I can better understand the 
algorithm and tune the weights.

Thanks,

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it"
 -- Richard P. Feynman




[slurm-dev] Re: fairshare usage

2013-01-22 Thread Moe Jette

Look at the sshare and sprio tools.

Quoting Mario Kadastik :

> Hi,
>
> is there some decent way to get multifactor fairshare current state?  
> Something akin to maui's diagnose -f output that shows groups  
> (accounts for slurm) and users with their fairshare target as well  
> as their historic usage over the past N days. This would seriously  
> help understand how the fairshare is computed based on the actual  
> usage statistics and current cluster state.
>
> For example we have all user fairshares set as parent and for the accounts:
>
>Account Share
> -- -
>   root 1
>   grid 1
>   grid-ops 1
>   hepusers   100
>  kbfiusers 1
>
> now let's assume one of the users in hepusers spends the past N days  
> computing with the full cluster and then another user submits a  
> number of jobs it would be logical to assume that as there is no  
> distinction between the  users in an account the newcomers priority  
> would be higher as (s)he hasn't had any allocated time.
>
> [root@slurm-1 slurm]# sreport cluster accountutilizationbyuser  
> start=2013-01-08
> 
> Cluster/Account/User Utilization 2013-01-08T00:00:00 -  
> 2013-01-21T23:59:59 (1209600 secs)
> Time reported in CPU Minutes
> 
>   Cluster Account Login Proper Name   Used
> - --- - --- --
> t2estoniaroot  7801048
> t2estoniagrid0
> t2estoniagridcms134 mapped user fo+  0
> t2estoniagrid sgmcms000 mapped user fo+  0
> t2estoniahepusers  7801048
> t2estoniahepusersandres Andres Tiko  85048
> t2estoniahepusers mario  Mario Kadastik7716000
>
> so according to this Mario (me) has computed a huge amount of time  
> in comparison to andres. However if I look at the priorities from  
> sinfo -nl I see this:
>
> [root@slurm-1 slurm]# sprio -nl|head -3
>   JOBID USER PRIORITY   AGEFAIRSHARE  JOBSIZEPARTITION  QOS
>   53498mario 0.3497 0.2404977  0.4897101  0.9919238   
> 1.000  0.000
>   53499mario 0.3497 0.2404977  0.4897101  0.9919238   
> 1.000  0.000
> [root@slurm-1 slurm]# sprio -nl|grep andres|head -1
>   53835   andres 0.3497 0.2396412  0.4897101  0.9919238   
> 1.000  0.000
>
> so in fact the fairshare factor is equivalent for both users no  
> matter that one has been getting a lot of the resource while the  
> other has not.
>
> or do I misunderstand the =parent part?  I tried also setting all  
> users shares to 1 and have no clue how long it will take for sprio  
> to recompute this, but right now it's showing the same priorities.
>
> That's one of the reasons why I'd like to be able to see how the  
> actual usage and decay over time affect the factor so that I can  
> better understand the algorithm and tune the weights.
>
> Thanks,
>
> Mario Kadastik, PhD
> Researcher
>
> ---
>   "Physics is like sex, sure it may have practical reasons, but  
> that's not why we do it"
>  -- Richard P. Feynman
>
>



[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Alan V. Cowles


We think we may be onto something, in sacct we were looking at the jobs 
submitted by the users, and found that many users share the same 
uidnumber in the slurm database. It seems to correlate with the size of 
the user's uid number in our ldap directory... users who's uid number 
are greater than 65535 get trunked to that number... users with uid 
numbers below that keep their correct uidnumbers (user2 in the sample 
output below)





[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user2|head
user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   00:00:48
  0:0  COMPLETED
user2  27545 30571   bwa node01-1 2013-07-08T15:18:00   00:00:48
  0:0  COMPLETED
user2  27545 30573   bwa node01-1 2013-07-09T09:40:59   00:00:48
  0:0  COMPLETED
user2  27545 30618  grep node01-1 2013-07-09T11:57:12   00:00:48
  0:0  COMPLETED
user2  27545 30619bc node01-1 2013-07-09T11:58:08   00:00:48
  0:0  CANCELLED
user2  27545 30620du node01-1 2013-07-09T11:58:19   00:00:48
  0:0  COMPLETED
user2  27545 30621wc node01-1 2013-07-09T11:58:43   00:00:48
  0:0  COMPLETED
user2  27545 30622  zcat node01-1 2013-07-09T11:58:54   00:00:48
  0:0  COMPLETED
user2  27545 30623  zcat node01-1 2013-07-09T12:12:56   00:00:48
  0:0  COMPLETED
user2  27545 30624  zcat node01-1 2013-07-09T12:26:37   00:00:48
  0:0  CANCELLED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user1|head
user1  65535 83   impute2_w+ node01-1 2013-04-17T09:29:47   00:00:48
  0:0 FAILED
user1  65535 84   impute2_w+ node01-1 2013-04-17T09:30:17   00:00:48
  0:0 FAILED
user1  65535 85   impute2_w+ node01-1 2013-04-17T09:30:40   00:00:48
  0:0 FAILED
user1  65535 86   impute2_w+ node01-1 2013-04-17T09:40:45   00:00:48
  0:0 FAILED
user1  65535 87 date node01-1 2013-04-17T09:42:36   00:00:48
  0:0  COMPLETED
user1  65535 88 hostname node01-1 2013-04-17T09:42:37   00:00:48
  0:0  COMPLETED
user1  65535 89   impute2_w+ node01-1 2013-04-17T09:48:50   00:00:48
  0:0 FAILED
user1  65535 90   impute2_w+ node01-1 2013-04-17T09:48:56   00:00:48
  0:0 FAILED
user1  65535 91   impute2_w+ node01-1 2013-04-17T09:49:56   00:00:48
  0:0 FAILED
user1  65535 92   impute2_w+ node01-1 2013-04-17T09:50:06   00:00:48
  0:0 FAILED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
 |grep user3|head
user3  65535 5 script.sh node09-1 2013-04-09T15:55:07   00:00:48
  0:0 FAILED
user3  65535 6 script.sh node09-1 2013-04-09T15:55:13INVALID
  0:0  COMPLETED
user3  65535 8  bash node09-1 2013-04-09T15:57:34   00:00:48
  0:0  COMPLETED
user3  65535 7  bash node09-1 2013-04-09T15:57:21   00:00:48
  0:0  COMPLETED
user3  65535 23script.sh node09-1 2013-04-09T16:10:02   00:00:48
  0:0  COMPLETED
user3  65535 27script.sh node09-+ 2013-04-09T16:18:33   00:00:48
  0:0  CANCELLED
user3  65535 28script.sh node01-+ 2013-04-09T16:18:55   00:00:48
  0:0  CANCELLED
user3  65535 30script.sh node01-+ 2013-04-09T16:34:12   00:00:48
  0:0  CANCELLED
user3  65535 31script.sh node01-+ 2013-04-09T16:34:17   00:00:48
  0:0  CANCELLED
user3  65535 32script.sh node01-+ 2013-04-09T16:34:21   00:00:48
  0:0  CANCELLED

We are thinking perhaps this could lead to our major issues with the 
system and priority factoring.


AC

On 08/23/2013 07:56 AM, Alan V. Cowles wrote:

Hey guys,

So in the past we had 3 prioritization factors in effect: partition, 
age and fairshare and they were working wonderfully. Currently 
partition has no effect for us as it's all one large shared partition 
so everyone gets the same value there. So everything is balanced in 
age and fairshare, In the past age and fairshare worked splendidly, 
and we have it set as I understand to refresh counters every 2 
weeks... so basically everyone had a blank slate this past weekend. 
What our current issue is as follows...


A problematic user has submitted 70k job

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Alan V. Cowles


Sorry to spam the list, but we wanted to keep updates in flux.

We managed to find the issue in our mysqldb we are using for job 
accounting which had the column value set to smallint (5) for that 
value, so it was rounding things off, some SQL magic and we now have 
appropriate uid's showing up. A new monkey wrench, some test jobs 
submitted by user3 below get their fairshare value of 5000 as expected, 
just not user2... we just cleared his jobs from the queue, and submitted 
another 100 jobs for testing and none of them got a fairshare value...


In his entire history of using our cluster he hasn't submitted over 5000 
jobs, in fact:


[root@slurm-master ~]# sacct -c 
--format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
user2 | wc -l

2573

So we can't figure out why he's being overlooked.

AC


On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
We think we may be onto something, in sacct we were looking at the 
jobs submitted by the users, and found that many users share the same 
uidnumber in the slurm database. It seems to correlate with the size 
of the user's uid number in our ldap directory... users who's uid 
number are greater than 65535 get trunked to that number... users with 
uid numbers below that keep their correct uidnumbers (user2 in the 
sample output below)





[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state 
|grep user2|head
user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
00:00:48  0:0  COMPLETED
user2  27545 30571   bwa node01-1 2013-07-08T15:18:00   
00:00:48  0:0  COMPLETED
user2  27545 30573   bwa node01-1 2013-07-09T09:40:59   
00:00:48  0:0  COMPLETED
user2  27545 30618  grep node01-1 2013-07-09T11:57:12   
00:00:48  0:0  COMPLETED
user2  27545 30619bc node01-1 2013-07-09T11:58:08   
00:00:48  0:0  CANCELLED
user2  27545 30620du node01-1 2013-07-09T11:58:19   
00:00:48  0:0  COMPLETED
user2  27545 30621wc node01-1 2013-07-09T11:58:43   
00:00:48  0:0  COMPLETED
user2  27545 30622  zcat node01-1 2013-07-09T11:58:54   
00:00:48  0:0  COMPLETED
user2  27545 30623  zcat node01-1 2013-07-09T12:12:56   
00:00:48  0:0  COMPLETED
user2  27545 30624  zcat node01-1 2013-07-09T12:26:37   
00:00:48  0:0  CANCELLED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state 
|grep user1|head
user1  65535 83   impute2_w+ node01-1 2013-04-17T09:29:47   
00:00:48  0:0 FAILED
user1  65535 84   impute2_w+ node01-1 2013-04-17T09:30:17   
00:00:48  0:0 FAILED
user1  65535 85   impute2_w+ node01-1 2013-04-17T09:30:40   
00:00:48  0:0 FAILED
user1  65535 86   impute2_w+ node01-1 2013-04-17T09:40:45   
00:00:48  0:0 FAILED
user1  65535 87 date node01-1 2013-04-17T09:42:36   
00:00:48  0:0  COMPLETED
user1  65535 88 hostname node01-1 2013-04-17T09:42:37   
00:00:48  0:0  COMPLETED
user1  65535 89   impute2_w+ node01-1 2013-04-17T09:48:50   
00:00:48  0:0 FAILED
user1  65535 90   impute2_w+ node01-1 2013-04-17T09:48:56   
00:00:48  0:0 FAILED
user1  65535 91   impute2_w+ node01-1 2013-04-17T09:49:56   
00:00:48  0:0 FAILED
user1  65535 92   impute2_w+ node01-1 2013-04-17T09:50:06   
00:00:48  0:0 FAILED
[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state 
|grep user3|head
user3  65535 5 script.sh node09-1 2013-04-09T15:55:07   
00:00:48  0:0 FAILED
user3  65535 6 script.sh node09-1 2013-04-09T15:55:13
INVALID  0:0  COMPLETED
user3  65535 8  bash node09-1 2013-04-09T15:57:34   
00:00:48  0:0  COMPLETED
user3  65535 7  bash node09-1 2013-04-09T15:57:21   
00:00:48  0:0  COMPLETED
user3  65535 23script.sh node09-1 2013-04-09T16:10:02   
00:00:48  0:0  COMPLETED
user3  65535 27script.sh node09-+ 2013-04-09T16:18:33   
00:00:48  0:0  CANCELLED
user3  65535 28script.sh node01-+ 2013-04-09T16:18:55   
00:00:48  0:0  CANCELLED
user3  65535 30script.sh node01-+ 2013-04-09T16:34:12   
00:00:48  0:0  CANCELLED
user3  65535 31script.sh node01-+ 2013-04-09T16:34:17   
00:00:4

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Alan V. Cowles


Final update for the day, we have found what is causing priority to be 
overlooked we just don't know what is causing it...


[root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T 
%.10M %.9l %.6D %R" |grep user1
(null)  181378lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181379lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181380lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181381lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181382lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181383lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181384lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181385lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181386lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)
(null)  181387lowmem testbatc user1  PENDING   0:00 UNLIMITED
1 (Priority)


Compared to:

[root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T 
%.10M %.9l %.6D %R" |grep user2
account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)



We have tried to create new users and new accounts this afternoon and 
all of them show (null) as their account when we break out the 
formatting rules on sacct.


sacctmgr add account accountname
sacctmgr add user username defaultaccount accountname

We have even one case where all users under and account are working fine 
except a user we added yesterday... so at some point in the past (logs 
aren't helping us thus far) the ability to actually sync up a user and 
an account for accounting purposes has left us. Also I have failed to 
mention to this point that we are still running Slurm 2.5.4, my 
apologies for that.


AC


On 08/23/2013 11:22 AM, Alan V. Cowles wrote:

Sorry to spam the list, but we wanted to keep updates in flux.

We managed to find the issue in our mysqldb we are using for job 
accounting which had the column value set to smallint (5) for that 
value, so it was rounding things off, some SQL magic and we now have 
appropriate uid's showing up. A new monkey wrench, some test jobs 
submitted by user3 below get their fairshare value of 5000 as 
expected, just not user2... we just cleared his jobs from the queue, 
and submitted another 100 jobs for testing and none of them got a 
fairshare value...


In his entire history of using our cluster he hasn't submitted over 
5000 jobs, in fact:


[root@slurm-master ~]# sacct -c 
--format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | 
grep user2 | wc -l

2573

So we can't figure out why he's being overlooked.

AC


On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
We think we may be onto something, in sacct we were looking at the 
jobs submitted by the users, and found that many users share the same 
uidnumber in the slurm database. It seems to correlate with the size 
of the user's uid number in our ldap directory... users who's uid 
number are greater than 65535 get trunked to that number... users 
with uid numbers below that keep their correct uidnumbers (user2 in 
the sample output below)





[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state 
|grep user2|head
user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
00:00:48  0:0 COMPLETED
user2  27545 30571   bwa node01-1 2013-07-08T15:18:00   
00:00:48  0:0 COMPLETED
user2  27545 30573   bwa node01-1 2013-07-09T09:40:59   
00:00:48  0:0 COMPLETED
user2  27545 30618  grep node01-1 2013-07-09T11:57:12   
00:00:48  0:0 COMPLETED
user2  27545 30619bc node01-1 2013-07-09T11:58:08   
00:00:48  0:0 CANCELLED
user2  27545 30620du node01-1 2013-07-09T11:58:19   
00:00:48  0:0 COMPLETED
user2  27545 30621wc node01-1 2013-07-09T11:58:43   
00:00:48  0:0 COMPLETED
user2  27545 30622  zcat node01-1 2013-07-09T11:58:54   
00:00:48

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Ralph Castain

Perhaps it is a copy/paste error - but those two tables are identical

On Aug 23, 2013, at 12:14 PM, Alan V. Cowles  wrote:

> 
> Final update for the day, we have found what is causing priority to be 
> overlooked we just don't know what is causing it...
> 
> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
> %.9l %.6D %R" |grep user1
> (null)  181378lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181379lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181380lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181381lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181382lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181383lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181384lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181385lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181386lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> (null)  181387lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
> (Priority)
> 
> Compared to:
> 
> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
> %.9l %.6D %R" |grep user2
> account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
> (Priority)
> 
> 
> We have tried to create new users and new accounts this afternoon and all of 
> them show (null) as their account when we break out the formatting rules on 
> sacct.
> 
> sacctmgr add account accountname
> sacctmgr add user username defaultaccount accountname
> 
> We have even one case where all users under and account are working fine 
> except a user we added yesterday... so at some point in the past (logs aren't 
> helping us thus far) the ability to actually sync up a user and an account 
> for accounting purposes has left us. Also I have failed to mention to this 
> point that we are still running Slurm 2.5.4, my apologies for that.
> 
> AC
> 
> 
> On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
>> Sorry to spam the list, but we wanted to keep updates in flux.
>> 
>> We managed to find the issue in our mysqldb we are using for job accounting 
>> which had the column value set to smallint (5) for that value, so it was 
>> rounding things off, some SQL magic and we now have appropriate uid's 
>> showing up. A new monkey wrench, some test jobs submitted by user3 below get 
>> their fairshare value of 5000 as expected, just not user2... we just cleared 
>> his jobs from the queue, and submitted another 100 jobs for testing and none 
>> of them got a fairshare value...
>> 
>> In his entire history of using our cluster he hasn't submitted over 5000 
>> jobs, in fact:
>> 
>> [root@slurm-master ~]# sacct -c 
>> --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
>> user2 | wc -l
>> 2573
>> 
>> So we can't figure out why he's being overlooked.
>> 
>> AC
>> 
>> 
>> On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
>>> We think we may be onto something, in sacct we were looking at the jobs 
>>> submitted by the users, and found that many users share the same uidnumber 
>>> in the slurm database. It seems to correlate with the size of the user's 
>>> uid number in our ldap directory... users who's uid number are greater than 
>>> 65535 get trunked to that number... users with uid numbers below that keep 
>>> their correct uidnumbers (user2 in the sample output below)
>>> 
>>> 
>>> 
>>> 
>>> [root@slurm-master ~]# sacct -c 
>>> --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
>>>  |grep user2|head
>>> user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
>>> 00:00:48  0:0 COMPLETED
>>> user2  27545 30571   bwa node01-1 2013-07-08T15:18:00   
>>> 00:00:48  0:0 COMPLETED
>>> user2  27545 30573   bwa node01-1 2013-07-09T09:40:59   
>>> 00:00:48  0:0 COMPLETED
>>> user2  27545 30618  grep node01-1 2013-07-09T11:57:12

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Ralph Castain

Ah, never mind - I see the difference now. Was looking for some info to be 
different


On Aug 23, 2013, at 12:17 PM, Ralph Castain  wrote:

> Perhaps it is a copy/paste error - but those two tables are identical
> 
> On Aug 23, 2013, at 12:14 PM, Alan V. Cowles  wrote:
> 
>> 
>> Final update for the day, we have found what is causing priority to be 
>> overlooked we just don't know what is causing it...
>> 
>> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
>> %.9l %.6D %R" |grep user1
>> (null)  181378lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181379lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181380lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181381lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181382lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181383lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181384lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181385lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181386lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> (null)  181387lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> 
>> Compared to:
>> 
>> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
>> %.9l %.6D %R" |grep user2
>> account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>> (Priority)
>> 
>> 
>> We have tried to create new users and new accounts this afternoon and all of 
>> them show (null) as their account when we break out the formatting rules on 
>> sacct.
>> 
>> sacctmgr add account accountname
>> sacctmgr add user username defaultaccount accountname
>> 
>> We have even one case where all users under and account are working fine 
>> except a user we added yesterday... so at some point in the past (logs 
>> aren't helping us thus far) the ability to actually sync up a user and an 
>> account for accounting purposes has left us. Also I have failed to mention 
>> to this point that we are still running Slurm 2.5.4, my apologies for that.
>> 
>> AC
>> 
>> 
>> On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
>>> Sorry to spam the list, but we wanted to keep updates in flux.
>>> 
>>> We managed to find the issue in our mysqldb we are using for job accounting 
>>> which had the column value set to smallint (5) for that value, so it was 
>>> rounding things off, some SQL magic and we now have appropriate uid's 
>>> showing up. A new monkey wrench, some test jobs submitted by user3 below 
>>> get their fairshare value of 5000 as expected, just not user2... we just 
>>> cleared his jobs from the queue, and submitted another 100 jobs for testing 
>>> and none of them got a fairshare value...
>>> 
>>> In his entire history of using our cluster he hasn't submitted over 5000 
>>> jobs, in fact:
>>> 
>>> [root@slurm-master ~]# sacct -c 
>>> --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
>>> user2 | wc -l
>>> 2573
>>> 
>>> So we can't figure out why he's being overlooked.
>>> 
>>> AC
>>> 
>>> 
>>> On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
 We think we may be onto something, in sacct we were looking at the jobs 
 submitted by the users, and found that many users share the same uidnumber 
 in the slurm database. It seems to correlate with the size of the user's 
 uid number in our ldap directory... users who's uid number are greater 
 than 65535 get trunked to that number... users with uid numbers below that 
 keep their correct uidnumbers (user2 in the sample output below)
 
 
 
 
 [root@slurm-master ~]# sacct -c 
 --format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state
  |grep user2|head
 user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
 00:00:48  0:0 COMPLETED
 user2  27545 305

[slurm-dev] Re: fairshare incrementing

2013-08-23 Thread Alan V. Cowles
Yes its all us just running and killing test jobs as the various users now.

Ralph Castain  wrote:

>
>Ah, never mind - I see the difference now. Was looking for some info to be 
>different
>
>
>On Aug 23, 2013, at 12:17 PM, Ralph Castain  wrote:
>
>> Perhaps it is a copy/paste error - but those two tables are identical
>> 
>> On Aug 23, 2013, at 12:14 PM, Alan V. Cowles  wrote:
>> 
>>> 
>>> Final update for the day, we have found what is causing priority to be 
>>> overlooked we just don't know what is causing it...
>>> 
>>> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
>>> %.9l %.6D %R" |grep user1
>>> (null)  181378lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181379lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181380lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181381lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181382lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181383lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181384lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181385lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181386lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> (null)  181387lowmem testbatc user1  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> 
>>> Compared to:
>>> 
>>> [root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T %.10M 
>>> %.9l %.6D %R" |grep user2
>>> account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED1 
>>> (Priority)
>>> 
>>> 
>>> We have tried to create new users and new accounts this afternoon and all 
>>> of them show (null) as their account when we break out the formatting rules 
>>> on sacct.
>>> 
>>> sacctmgr add account accountname
>>> sacctmgr add user username defaultaccount accountname
>>> 
>>> We have even one case where all users under and account are working fine 
>>> except a user we added yesterday... so at some point in the past (logs 
>>> aren't helping us thus far) the ability to actually sync up a user and an 
>>> account for accounting purposes has left us. Also I have failed to mention 
>>> to this point that we are still running Slurm 2.5.4, my apologies for that.
>>> 
>>> AC
>>> 
>>> 
>>> On 08/23/2013 11:22 AM, Alan V. Cowles wrote:
 Sorry to spam the list, but we wanted to keep updates in flux.
 
 We managed to find the issue in our mysqldb we are using for job 
 accounting which had the column value set to smallint (5) for that value, 
 so it was rounding things off, some SQL magic and we now have appropriate 
 uid's showing up. A new monkey wrench, some test jobs submitted by user3 
 below get their fairshare value of 5000 as expected, just not user2... we 
 just cleared his jobs from the queue, and submitted another 100 jobs for 
 testing and none of them got a fairshare value...
 
 In his entire history of using our cluster he hasn't submitted over 5000 
 jobs, in fact:
 
 [root@slurm-master ~]# sacct -c 
 --format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | grep 
 user2 | wc -l
 2573
 
 So we can't figure out why he's being overlooked.
 
 AC
 
 
 On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
> We think we may be onto something, in sacct we were looking at the jobs 
> submitted by the users, and found that many users share the same 
> uidnumber in the slurm database. It seems to correlate with the size of 
> the user's uid number in our ldap directory... users who's uid number are 
> greater than 65535 get trunked to that number... users with uid numbers 
> below that keep their correct uidnumbers (user2 in the sample output 
> below)
> 
> 
> 
> 
> [root@slurm-master ~]# sacct -c 
> --format=User

[slurm-dev] Re: fairshare incrementing

2013-08-27 Thread Alan V. Cowles


Hey guys,

We're still hung on our priority scheduling here, but I had a thought 
reading some other mailings to the list this morning. The syntax that 
others are using with user creation is not as basic as ours, and has 
other variables in place suchas "fairshare=parent" are these things that 
we need to specify when creating an account or are they defaults, 
wondering if this is why our newer users aren't showing up with the 
correct accounts in squeue.


It still bugs us that accounts show up correctly in sacctmgr, just not 
in squeue for the point of enforcing priority. Could this be a bug 
corrected in later releases?


AC

On 08/23/2013 03:13 PM, Alan V. Cowles wrote:
Final update for the day, we have found what is causing priority to be 
overlooked we just don't know what is causing it...


[root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T 
%.10M %.9l %.6D %R" |grep user1
(null)  181378lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181379lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181380lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181381lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181382lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181383lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181384lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181385lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181386lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181387lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)


Compared to:

[root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T 
%.10M %.9l %.6D %R" |grep user2
account  181378lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181379lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181380lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181381lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181382lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181383lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181384lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181385lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181386lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)
account  181387lowmem testbatc user2  PENDING   0:00 UNLIMITED 
   1 (Priority)



We have tried to create new users and new accounts this afternoon and 
all of them show (null) as their account when we break out the 
formatting rules on sacct.


sacctmgr add account accountname
sacctmgr add user username defaultaccount accountname

We have even one case where all users under and account are working 
fine except a user we added yesterday... so at some point in the past 
(logs aren't helping us thus far) the ability to actually sync up a 
user and an account for accounting purposes has left us. Also I have 
failed to mention to this point that we are still running Slurm 2.5.4, 
my apologies for that.


AC


On 08/23/2013 11:22 AM, Alan V. Cowles wrote:

Sorry to spam the list, but we wanted to keep updates in flux.

We managed to find the issue in our mysqldb we are using for job 
accounting which had the column value set to smallint (5) for that 
value, so it was rounding things off, some SQL magic and we now have 
appropriate uid's showing up. A new monkey wrench, some test jobs 
submitted by user3 below get their fairshare value of 5000 as 
expected, just not user2... we just cleared his jobs from the queue, 
and submitted another 100 jobs for testing and none of them got a 
fairshare value...


In his entire history of using our cluster he hasn't submitted over 
5000 jobs, in fact:


[root@slurm-master ~]# sacct -c 
--format=user,jobid,jobname,start,elapsed,state,exitcode -u user2 | 
grep user2 | wc -l

2573

So we can't figure out why he's being overlooked.

AC


On 08/23/2013 10:31 AM, Alan V. Cowles wrote:
We think we may be onto something, in sacct we were looking at the 
jobs submitted by the users, and found that many users share the 
same uidnumber in the slurm database. It seems to correlate with the 
size of the user's uid number in our ldap directory... users who's 
uid number are greater than 65535 get trunked to that number... 
users with uid numbers below that keep their correct uidnumbers 
(user2 in the sample output below)





[root@slurm-master ~]# sacct -c 
--format=User,uid,JobID,JobName,NodeList,Start,Elapsed,ExitCode,DerivedExitCode,state 
|grep user2|head
user2  27545 30548   bwa node01-1 2013-07-08T13:04:25   
00:00:4

[slurm-dev] Re: fairshare incrementing

2013-08-28 Thread Alan V. Cowles


Hey guys,

We had a eureka moment, and have discovered the cause of the problem and 
how to get it back working. Now we need to prevent it from occurring 
again in the future.


Looking through the mysql lead us to dead ends, as well as every 
sacctmgr thing we ran for the first couple of days. Finally we dug 
around in the transactions log, saw that there was a restart of 
slurmctld immediately after the last correctly functioning account was 
created. We actually wondered if you had to bounce the daemon in order 
to make priority work for users.


Trying to find more info that we could cull from the command line I also 
attempted to run sview on our master node, and found it couldn't 
launch... it was then I discovered that if I ran scontrol ping, we got 
the following response:


Slurmctld(primary/backup) at slurm-master/slurm-backup are DOWN/UP
*
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*

Curious we decided to restart it on the master node, and we immediately 
entered a panic where it took a bit to re-import all of the jobs we have 
currently in our queue, but after a few minutes, we are currently in 
this status:


Slurmctld(primary/backup) at slurm-master/slurm-backup are UP/DOWN
*
** RESTORE SLURMCTLD DAEMON TO SERVICE **
*


So from what we can tell, on or about june 17, the slurmctld on our 
master host failed and the backup host took off running with it. 
Meanwhile slurmdbd continued to run unfettered on our master node.


For what we can figure, the slurmctld on the backup node could not 
communicate properly with the slurmdbd on the master as it was a remote 
host. Once we restored slurmctld to running on the master node with 
slurmdbd on the same host, priority fairshare began to work for all of 
our problematic users.


Now we want to find a way to make sure this doesn't happen again to 
perhaps allow slurmdbd to also run on the backup node, or have the 
backup node be able to make a remote call to the dbd host?


In slurmdbd.conf on the master node we have the following:

# slurmDBD info
DbdAddr=localhost
DbdHost=localhost

Would it help to put the ipaddr of the host itself and start slurmdbd on 
the backup node as well?


On this page for ubuntu: 
http://manpages.ubuntu.com/manpages/jaunty/man5/slurmdbd.conf.5.html


We found a reference to a slurmdbdaddr value that should be placed in 
slurm.conf and possible tell the slurmctld where the slurmdbd is 
running. Though most references to it seem to be ancient (slurmdbd 1.3?) 
and perhaps this is no longer needed in slurm 2.5.4+


Any thoughts on config modifications we could make?

Thanks in advance.

AC



On 08/27/2013 11:24 AM, Alan V. Cowles wrote:

Hey guys,

We're still hung on our priority scheduling here, but I had a thought 
reading some other mailings to the list this morning. The syntax that 
others are using with user creation is not as basic as ours, and has 
other variables in place suchas "fairshare=parent" are these things 
that we need to specify when creating an account or are they defaults, 
wondering if this is why our newer users aren't showing up with the 
correct accounts in squeue.


It still bugs us that accounts show up correctly in sacctmgr, just not 
in squeue for the point of enforcing priority. Could this be a bug 
corrected in later releases?


AC

On 08/23/2013 03:13 PM, Alan V. Cowles wrote:
Final update for the day, we have found what is causing priority to 
be overlooked we just don't know what is causing it...


[root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T 
%.10M %.9l %.6D %R" |grep user1
(null)  181378lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181379lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181380lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181381lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181382lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181383lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181384lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181385lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181386lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)
(null)  181387lowmem testbatc user1  PENDING   0:00 
UNLIMITED1 (Priority)


Compared to:

[root@cluster-login ~]# squeue  --format="%a %.7i %.9P %.8j %.8u %.8T 
%.10M %.9l %.6D %R" |grep user2
account  181378lowmem testbatc user2  PENDING   0:00 
UNLIMITED1 (Priority)
account  181379lowmem testbatc user2  PENDING   0:00 
UNLIMITED1 (Priority)
account  181380lowmem testbatc user2  PENDING   0:00 
UNLIMITED1 (Priority)
acco

[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Lipari, Don


> -Original Message-
> From: Bill Wichser [mailto:b...@princeton.edu]
> Sent: Wednesday, January 21, 2015 5:20 AM
> To: slurm-dev
> Subject: [slurm-dev] fairshare allocations
> 
> 
> The algorithm I use is fairtree under 14.11 but I believe that my
> question relates to any method.
> 
> As a University, we have many investments into a given cluster.  At the
> most simplistic level, lets assume there are but two two allocations.
> The method I have been using is to assign a value, as a percentage of
> ownership, to the various ACCOUNTs such that when summed across all
> accounts, they add to 100.
> 
> So chemistry might have a fairshare value of 20 as they contributed 20%
> of the funding.  Physics has a value of 10.  And so forth, with many
> having a fairshare value of 1 since no money was contributed.
> 
> In the past, I simply assigned either a fairshare value of parent to the
> users or assigned them a value of 1.
> 
> So lets take a user, call him Bill, who has a fairshare value of 1 under
> the account=chem.  It appears to me that this 1 share is actually a 1
> share of the total and not a 1 share of what the account=chem owns.  Am
> I reading this correctly here?

A share of 1 for Bill is a share of the total shares assigned to users
(or accounts) under the chem account.  Chem can have 1000 users, each with
1 share, but chem users' combined usage of the system will be throttled
to 20% based on job priorities calculated by the fair-share factor.

That works both ways:  if only one user from chem is submitting jobs, that
user can receive 20% of the resources of the cluster, even though they have
only one share of chem.

The most common practice is to assign a share of 1 to every user in an
account.  You can assign greater share values to users who are entitled
to more than their peers.

Don Lipari

> 
> Thanks,
> Bill


[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Bill Wichser




On 01/21/2015 11:07 AM, Lipari, Don wrote:




-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Wednesday, January 21, 2015 5:20 AM
To: slurm-dev
Subject: [slurm-dev] fairshare allocations


The algorithm I use is fairtree under 14.11 but I believe that my
question relates to any method.

As a University, we have many investments into a given cluster.  At the
most simplistic level, lets assume there are but two two allocations.
The method I have been using is to assign a value, as a percentage of
ownership, to the various ACCOUNTs such that when summed across all
accounts, they add to 100.

So chemistry might have a fairshare value of 20 as they contributed 20%
of the funding.  Physics has a value of 10.  And so forth, with many
having a fairshare value of 1 since no money was contributed.

In the past, I simply assigned either a fairshare value of parent to the
users or assigned them a value of 1.

So lets take a user, call him Bill, who has a fairshare value of 1 under
the account=chem.  It appears to me that this 1 share is actually a 1
share of the total and not a 1 share of what the account=chem owns.  Am
I reading this correctly here?


A share of 1 for Bill is a share of the total shares assigned to users
(or accounts) under the chem account.  Chem can have 1000 users, each with
1 share, but chem users' combined usage of the system will be throttled
to 20% based on job priorities calculated by the fair-share factor.

That works both ways:  if only one user from chem is submitting jobs, that
user can receive 20% of the resources of the cluster, even though they have
only one share of chem.

The most common practice is to assign a share of 1 to every user in an
account.  You can assign greater share values to users who are entitled
to more than their peers.

Don Lipari



Thanks,
Bill


So that was my expectation.  But lets look at this account, truncated, 
with a user with a fairshare of 20 (using sshare -a -l -A ee -p)


Account|User|Raw Shares|Norm Shares|Raw Usage|Norm Usage|Effectv 
Usage|FairShare|Level FS|GrpCPUMins|CPURunMins|

ee||261|0.218227|189272064|0.047197|0.047197||4.623757||50912|
   ee|user1|1|0.009091|24151307|0.006022|0.127601|0.771261|0.071245||24605|
 ee|user2|1|0.009091|652289|0.000163|0.003446|0.780059|2.637872||0|
 ee|user3|25|0.227273|15684228|0.003911|0.082866|0.781525|2.742652||0|
...






So ee as an account gets fairshare=261 and gets a 0.218227 normalized 
share count.


A user underneath gets the expected 0.009091 normalized shares since 
there are a lot of fairshare=1 users there.  The user3 gets basically 
25x this value as the fairshare for user3=25


Yet the normalized shares is actually MORE than the normalized shares 
for the account as a whole.  What should I make of this?


Bill


[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Lipari, Don
> -Original Message-
> From: Bill Wichser [mailto:b...@princeton.edu]
> Sent: Wednesday, January 21, 2015 8:23 AM
> To: slurm-dev
> Subject: [slurm-dev] RE: fairshare allocations
> 
> 
> 
> 
> On 01/21/2015 11:07 AM, Lipari, Don wrote:
> >
> >
> >> -Original Message-
> >> From: Bill Wichser [mailto:b...@princeton.edu]
> >> Sent: Wednesday, January 21, 2015 5:20 AM
> >> To: slurm-dev
> >> Subject: [slurm-dev] fairshare allocations
> >>
> >>
> >> The algorithm I use is fairtree under 14.11 but I believe that my
> >> question relates to any method.
> >>
> >> As a University, we have many investments into a given cluster.  At the
> >> most simplistic level, lets assume there are but two two allocations.
> >> The method I have been using is to assign a value, as a percentage of
> >> ownership, to the various ACCOUNTs such that when summed across all
> >> accounts, they add to 100.
> >>
> >> So chemistry might have a fairshare value of 20 as they contributed 20%
> >> of the funding.  Physics has a value of 10.  And so forth, with many
> >> having a fairshare value of 1 since no money was contributed.
> >>
> >> In the past, I simply assigned either a fairshare value of parent to
> the
> >> users or assigned them a value of 1.
> >>
> >> So lets take a user, call him Bill, who has a fairshare value of 1
> under
> >> the account=chem.  It appears to me that this 1 share is actually a 1
> >> share of the total and not a 1 share of what the account=chem owns.  Am
> >> I reading this correctly here?
> >
> > A share of 1 for Bill is a share of the total shares assigned to users
> > (or accounts) under the chem account.  Chem can have 1000 users, each
> with
> > 1 share, but chem users' combined usage of the system will be throttled
> > to 20% based on job priorities calculated by the fair-share factor.
> >
> > That works both ways:  if only one user from chem is submitting jobs,
> that
> > user can receive 20% of the resources of the cluster, even though they
> have
> > only one share of chem.
> >
> > The most common practice is to assign a share of 1 to every user in an
> > account.  You can assign greater share values to users who are entitled
> > to more than their peers.
> >
> > Don Lipari
> >
> >>
> >> Thanks,
> >> Bill
> 
> So that was my expectation.  But lets look at this account, truncated,
> with a user with a fairshare of 20 (using sshare -a -l -A ee -p)
> 
> Account|User|Raw Shares|Norm Shares|Raw Usage|Norm Usage|Effectv
> Usage|FairShare|Level FS|GrpCPUMins|CPURunMins|
> ee||261|0.218227|189272064|0.047197|0.047197||4.623757||50912|
> 
> ee|user1|1|0.009091|24151307|0.006022|0.127601|0.771261|0.071245||24605|
>   ee|user2|1|0.009091|652289|0.000163|0.003446|0.780059|2.637872||0|
>   ee|user3|25|0.227273|15684228|0.003911|0.082866|0.781525|2.742652||0|
> ...
> 
> 
> 
> 
> 
> 
> So ee as an account gets fairshare=261 and gets a 0.218227 normalized
> share count.
> 
> A user underneath gets the expected 0.009091 normalized shares since
> there are a lot of fairshare=1 users there.  The user3 gets basically
> 25x this value as the fairshare for user3=25
> 
> Yet the normalized shares is actually MORE than the normalized shares
> for the account as a whole.  What should I make of this?

That looks like a bug.  I don't see that behavior on our systems running slurm 
14.03.11.
Don

> Bill


[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Ryan Cox



On 01/21/2015 09:23 AM, Bill Wichser wrote:


A user underneath gets the expected 0.009091 normalized shares since 
there are a lot of fairshare=1 users there.  The user3 gets basically 
25x this value as the fairshare for user3=25


Yet the normalized shares is actually MORE than the normalized shares 
for the account as a whole.  What should I make of this?




This is actually by design in Fair Tree and is different from other 
algorithms.  The manpage for sshare covers this under "FAIR_TREE 
MODIFICATIONS".The manpage states that Norm Shares is "The shares 
assigned to the user or account normalized to the total number of 
assigned shares within the level."  Basically, the Norm Shares is the 
association's raw shares value divided by the sum of it and its sibling 
associations' assigned raw shares values.  For example, if an account 
has 10 users, each having 1 assigned raw share, the Norm Shares value 
will be .1 for each of those users under Fair Tree.


Fair Tree only uses Norm Shares and Effective Usage (the other sshare 
field that's modified) when comparing sibling associations. Our Slurm UG 
presentation slides also mention this on pages 35 and 76 
(http://slurm.schedmd.com/SUG14/fair_tree.pdf).


Ryan


[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Bill Wichser


Okay, I get it now.  All the shares under a given ACCOUNT add up to 1.0. 
 The divvy, and why some are higher than the actual ACCOUNT's number, 
is merely the effect from this allocation amongst the users underneath.


So if the ACCOUNT gets 20% (0.20 or nearly so), then all the users 
underneath, when summed, have a value of 1.0.  Because a user here gets 
a bigger cut by assignment of a fairshare value >1 then just because 
that user has a value exceeding the ACCOUNT value is not to be confused 
with a fairshare exceeding that of it's parent.  Only that the value of 
the parent's fairshare, that user gets this percentage of the cut.


Got it!

Thanks much,
Bill

On 1/21/2015 12:57 PM, Ryan Cox wrote:



On 01/21/2015 09:23 AM, Bill Wichser wrote:


A user underneath gets the expected 0.009091 normalized shares since
there are a lot of fairshare=1 users there.  The user3 gets basically
25x this value as the fairshare for user3=25

Yet the normalized shares is actually MORE than the normalized shares
for the account as a whole.  What should I make of this?



This is actually by design in Fair Tree and is different from other
algorithms.  The manpage for sshare covers this under "FAIR_TREE
MODIFICATIONS".The manpage states that Norm Shares is "The shares
assigned to the user or account normalized to the total number of
assigned shares within the level."  Basically, the Norm Shares is the
association's raw shares value divided by the sum of it and its sibling
associations' assigned raw shares values.  For example, if an account
has 10 users, each having 1 assigned raw share, the Norm Shares value
will be .1 for each of those users under Fair Tree.

Fair Tree only uses Norm Shares and Effective Usage (the other sshare
field that's modified) when comparing sibling associations. Our Slurm UG
presentation slides also mention this on pages 35 and 76
(http://slurm.schedmd.com/SUG14/fair_tree.pdf).

Ryan


[slurm-dev] Re: Fairshare points return rate

2014-07-11 Thread jette



The way I understood it was that the full share of points would
return in 7 days.


Its a _HalfLife_

Quoting Christopher B Coffey :


Hello,

I think I have either a configuration problem, or I am not understanding
something correctly.  I have the following set for fairshare in slurm.conf:

PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0


Yet, a users fairshare points, shown by sshare, do not return at the rate
that I had envisioned.  The points are returning extremely slowly (many
weeks).  The way I understood it was that the full share of points would
return in 7 days.  In my example below, a user would have full fairshare
points when the number was 0.25.

 Account   User Raw Shares Norm Shares   Raw Usage Effectv
Usage  FairShare
 -- -- --- ---
- --
root  1.00   116705341
1.00   0.50
 root  root  10.50   0
0.00   1.00
 normal  10.50   116705341
1.00   0.25
  normal abc1 10.006329   0  0.012658
0.25
  normal  abc2 10.006329   0  0.012658
0.25
  normal  abc3 10.006329   0  0.012658
0.25
  normal abc4 10.00632927021063  0.241260
0.00
normal abc5 10.006329   43576  0.013027   0.240107
  normal abc6 10.006329   0  0.012658   0.25
  normal abc7 10.006329 1045805  0.021506   0.094868
  normal abc8 10.006329   3  0.012658   0.24
  normal abc9 10.006329 8131420  0.081451   0.000134



Also, I’m not sure I have the shares correctly configured now that I think
about it.  In general fairshare priority has been working, but the points
returning issue has vexed me, thanks guys!

Chris




[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Lipari, Don
Bill,

As I understand the dilemma you presented, you want to maximize the utilization 
of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge users into 
requesting only the amount of memory they will need for their jobs.  The nudge 
would be in the form of decreased fair-share priority for users' jobs that 
request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists.  I can only offer 
alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor 
priority plugin.  This would be a substantial undertaking as it touches code 
not just in multifactor/priority_multifactor.c but also in structures that are 
defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a configurable blend 
of cpu and memory usage.  These changes could be more localized to the 
multifactor/priority_multifactor.c module.  However you would have a harder 
time justifying a user's sshare report because the usage numbers would no 
longer track jobs' historical cpu usage.  You response to a user who asked you 
to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed 
nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come 
to mind.

Don Lipari

> -Original Message-
> From: Bill Wichser [mailto:b...@princeton.edu]
> Sent: Friday, July 25, 2014 6:14 AM
> To: slurm-dev
> Subject: [slurm-dev] fairshare - memory resource allocation
> 
> 
> I'd like to revisit this...
> 
> 
> After struggling with memory allocations in some flavor of PBS for over
> 20 years, it was certainly a wonderful thing to have cgroup support
> right out of the box with Slurm.  No longer do we have a shared node's
> jobs eating all the memory and killing everything running there.  But we
> have found that there is a cost to this and that is a failure to
> adequately feed back this information to the fairshare mechanism.
> 
> In looking at running jobs over the past 4 months, we found a spot where
> we could reduce the DefMemPerCPU allocation in slurm.conf to a value
> about 1G less than the actual G/core available.  This meant that we had
> to notify the users close to this max value so that they could adjust
> their scripts. We also notified users that if this value was too high
> that they'd do best to reduce that limit to exactly what they require.
> This has proven much less successful.
> 
> So our default is 3G/core with an actual node having 4G/core available.
>   This allows some bigger memory jobs and some smaller memory jobs to
> make use of the node as there are available cores but not enough memory
> for the default case.
> 
> Now that is good. It allows higher utilization of nodes, all the while
> protecting the memory of each other's processes.  But the problem of
> fairshare comes about pretty quickly when there are jobs requiring say
> half the node's memory.  This is mostly serial jobs requesting a single
> core.  So this leaves about 11 cores with only about 2G/core left.
> Worse, when it comes to fairshare calculations it appears that these
> jobs are only using a single core when in fact they are using half a
> node.  You can see where this is causing issues.
> 
> Fairshare has a number of other issues as well, which I will send under
> a different email.
> 
> Now maybe this is just a matter of constant monitoring of user jobs and
> proactively going after those users having small memory per core
> requirements.  We have attempted this in the past and have found that
> the first job they run which crashes due to insufficient memory results
> in all scripts being increased and so the process is never ending.
> 
> Another solution is to simply trust the users and just keep reminding
> them about allocations.  They are usually a smart bunch and are quite
> creative when it comes to getting jobs to run!  So maybe I am concerned
> over nothing at all and things will just work out.
> 
> Bill


[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Ryan Cox


Bill and Don,

We have wondered about this ourselves.  I just came up with this idea 
and haven't thought it through completely, but option two seems like the 
easiest.  For example, you could modify lines like 
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 
to have a MAX() of a few different types.


I seem to recall seeing this on the list or in a bug report somewhere 
already, but you could have different charge rates for memory or GPUs 
compared to a CPU, maybe on a per partition basis. You could give each 
of them a charge rate like:

PartitionName=p1  ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 ..

So the line I referenced would be something like the following (except 
using real code and real struct members, etc):
real_decay = run_decay * MAX(CPUs*ChargePerCPU, 
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);


In this case, each CPU is 1.0 but each GB of RAM is 0.5.  Assuming no 
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting 
usage is 1.0.  But if they use 4 GB of RAM and 1 CPU, it is 2.0 just 
like they had been using 2 CPUs.  Essentially you define every 2 GB of 
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with 
"cpu equivalents".


It might be harder to explain to users but I don't think it would be too 
bad.


Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:

Bill,

As I understand the dilemma you presented, you want to maximize the utilization 
of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge users into 
requesting only the amount of memory they will need for their jobs.  The nudge 
would be in the form of decreased fair-share priority for users' jobs that 
request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists.  I can only offer 
alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor 
priority plugin.  This would be a substantial undertaking as it touches code 
not just in multifactor/priority_multifactor.c but also in structures that are 
defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and 
memory usage.  These changes could be more localized to the 
multifactor/priority_multifactor.c module.  However you would have a harder time 
justifying a user's sshare report because the usage numbers would no longer track jobs' 
historical cpu usage.  You response to a user who asked you to justify their sshare usage 
report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed 
nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come 
to mind.

Don Lipari


-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there.  But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node.  You can see where this is causing issues.

Fairshare has a number of other issues as well, which I will send under
a different email.

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Bill Wichser


Thank you Ryan.  Not sure how we will proceed here.

Bill

On 7/25/2014 12:30 PM, Ryan Cox wrote:


Bill and Don,

We have wondered about this ourselves.  I just came up with this idea 
and haven't thought it through completely, but option two seems like 
the easiest.  For example, you could modify lines like 
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 
to have a MAX() of a few different types.


I seem to recall seeing this on the list or in a bug report somewhere 
already, but you could have different charge rates for memory or GPUs 
compared to a CPU, maybe on a per partition basis. You could give each 
of them a charge rate like:
PartitionName=p1  ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 
..


So the line I referenced would be something like the following (except 
using real code and real struct members, etc):
real_decay = run_decay * MAX(CPUs*ChargePerCPU, 
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);


In this case, each CPU is 1.0 but each GB of RAM is 0.5.  Assuming no 
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting 
usage is 1.0.  But if they use 4 GB of RAM and 1 CPU, it is 2.0 just 
like they had been using 2 CPUs.  Essentially you define every 2 GB of 
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with 
"cpu equivalents".


It might be harder to explain to users but I don't think it would be 
too bad.


Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:

Bill,

As I understand the dilemma you presented, you want to maximize the 
utilization of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge 
users into requesting only the amount of memory they will need for 
their jobs.  The nudge would be in the form of decreased fair-share 
priority for users' jobs that request only one core but lots of memory.


I don't know of a way for Slurm to do this as it exists.  I can only 
offer alternatives that have their pros and cons.


One alternative would be to add memory usage support to the 
multifactor priority plugin.  This would be a substantial undertaking 
as it touches code not just in multifactor/priority_multifactor.c but 
also in structures that are defined in common/assoc_mgr.h as well as 
sshare itself.


A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a 
configurable blend of cpu and memory usage.  These changes could be 
more localized to the multifactor/priority_multifactor.c module.  
However you would have a harder time justifying a user's sshare 
report because the usage numbers would no longer track jobs' 
historical cpu usage.  You response to a user who asked you to 
justify their sshare usage report would be, "trust me, it's right".


A third alternative (as I'm sure you know) is to give up on perfectly 
packed nodes and make every 4G of memory requested cost 1 cpu of 
allocation.


Perhaps there are other options, but those are the ones that 
immediately come to mind.


Don Lipari


-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there.  
But we

have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot 
where

we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node.  You can see where th

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-27 Thread Blomqvist Janne

Hi,

As a variation on the second option you propose, take a look at the concept of 
Dominant Resource Fairness [1], which is an algorithm for achieving 
multi-resource (e.g. cpu's, memory, disk/net BW, ...) fairness. By using 
"dominant share"-secs instead of cpu-secs in the current accounting code the 
changes would similarly be limited in scope.

[1] http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
https://www.usenix.org/legacy/events/nsdi11/tech/slides/ghodsi.pdf

--
Janne Blomqvist


From: Lipari, Don [lipa...@llnl.gov]
Sent: Friday, July 25, 2014 19:04
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

Bill,

As I understand the dilemma you presented, you want to maximize the utilization 
of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge users into 
requesting only the amount of memory they will need for their jobs.  The nudge 
would be in the form of decreased fair-share priority for users' jobs that 
request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists.  I can only offer 
alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor 
priority plugin.  This would be a substantial undertaking as it touches code 
not just in multifactor/priority_multifactor.c but also in structures that are 
defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a configurable blend 
of cpu and memory usage.  These changes could be more localized to the 
multifactor/priority_multifactor.c module.  However you would have a harder 
time justifying a user's sshare report because the usage numbers would no 
longer track jobs' historical cpu usage.  You response to a user who asked you 
to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed 
nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come 
to mind.

Don Lipari

> -Original Message-
> From: Bill Wichser [mailto:b...@princeton.edu]
> Sent: Friday, July 25, 2014 6:14 AM
> To: slurm-dev
> Subject: [slurm-dev] fairshare - memory resource allocation
>
>
> I'd like to revisit this...
>
>
> After struggling with memory allocations in some flavor of PBS for over
> 20 years, it was certainly a wonderful thing to have cgroup support
> right out of the box with Slurm.  No longer do we have a shared node's
> jobs eating all the memory and killing everything running there.  But we
> have found that there is a cost to this and that is a failure to
> adequately feed back this information to the fairshare mechanism.
>
> In looking at running jobs over the past 4 months, we found a spot where
> we could reduce the DefMemPerCPU allocation in slurm.conf to a value
> about 1G less than the actual G/core available.  This meant that we had
> to notify the users close to this max value so that they could adjust
> their scripts. We also notified users that if this value was too high
> that they'd do best to reduce that limit to exactly what they require.
> This has proven much less successful.
>
> So our default is 3G/core with an actual node having 4G/core available.
>   This allows some bigger memory jobs and some smaller memory jobs to
> make use of the node as there are available cores but not enough memory
> for the default case.
>
> Now that is good. It allows higher utilization of nodes, all the while
> protecting the memory of each other's processes.  But the problem of
> fairshare comes about pretty quickly when there are jobs requiring say
> half the node's memory.  This is mostly serial jobs requesting a single
> core.  So this leaves about 11 cores with only about 2G/core left.
> Worse, when it comes to fairshare calculations it appears that these
> jobs are only using a single core when in fact they are using half a
> node.  You can see where this is causing issues.
>
> Fairshare has a number of other issues as well, which I will send under
> a different email.
>
> Now maybe this is just a matter of constant monitoring of user jobs and
> proactively going after those users having small memory per core
> requirements.  We have attempted this in the past and have found that
> the first job they run which crashes due to insufficient memory results
> in all scripts being increased and so the process is never ending.
>
> Another solution is to simply trust the users and just keep reminding
> them about allocations.  They are usually a smart bunch and are quite
> creative when it comes to getting jobs to run!  So maybe I am concerned
> over nothing at all and things will just work out.
>
> Bill

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-29 Thread Ryan Cox
I'm interested in hearing opinions on this, if any.  Basically, I think 
there is an easy solution to the problem of a user using few CPUs but a 
lot of memory and that not being reflected well in the CPU-centric usage 
stats.


Below is my proposal.  There are likely some other good approaches out 
there too (Don and Janne presented some) so feel free to tell me that 
you don't like this idea :)



Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU 
equivalents" * time) instead of just (CPUs * time).  The "CPU 
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over 
that time period, or whatever multiplied by a corresponding charge rate 
that an admin can configure on a per partition basis.


I wrote a simple proof of concept patch to demonstrate this (see "Proof 
of Concept" below for details).



Longer version

The CPU equivalent would be used in place of total_cpus for calculating 
usage_raw.  I propose that the default charge rate be 1.0 for each CPU 
in a job and 0.0 for everything else.  This is the current behavior so 
there are no behavior changes if you choose not to define a different 
charge rate.  The reason I think this should be done on a partition 
basis is because different partitions may have nodes with different 
memory/core ratios, etc. so one partition may have 2 GB/core and another 
may have 8 GB/core nodes and you may want to charge differently on each.


If you define the charge rate for each CPU to be 1.0 and the charge rate 
per GB of memory to be 0.5, that is saying that 2 GB of memory will be 
equivalent to the charge rate for 1 CPU.  4 GB of memory would be 
equivalent to 2 CPUs (4 GB * 0.5/GB).  Since it is a MAX() of all the 
available (resource * charge_rate) combinations, the largest value is 
chosen.  If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the 
user gets charged for using all the RAM.  If a user uses 16 CPUs and 1 
MB, the user gets charged for 16 CPUs.



Downsides

The problem that is not completely solved is if a user uses 1 CPU but 
3/4 of the memory on a node.  Then they only get billed for 3/4 of the 
node but might make it unusable for others who need a whole or half 
node.  I'm not sure of a great way to solve that besides modifying the 
request in a job submit plugin or requiring exclusive node access.


One other complication is for resources that include a counter rather 
than a static allocation value, such as network bandwidth or energy.  
This is a problem because the current approach is to immediately begin 
decaying the cputime (aka usage) as it accumulates.  This means you 
would have to keep a delta value for each resource with a counter, 
meaning you track that 5 GB have been transmitted since the last decay 
thread iteration then only add that 5 GB.  This could get messy when 
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb) 
each iteration since the bandwidth may never reach a high enough value 
to matter between iterations but might when considered as an entire job.


I don't think this proposal would be too bad for something like energy.  
You could define a charge rate per joule (or kilojoule or whatever) that 
would equal the node's minimum power divided by core count.  Then you 
look at the delta of that time period.  If they were allocated all cores 
and used minimum power, they get charged 1.0 * core count.  If they were 
allocated all cores and used maximum power, they effectively get charged 
for the difference in the node's max energy and min energy times the 
energy charge rate.  This calculation, as with others, would occur once 
per decay thread iteration.



User Education

The reason I like this approach is that it is incredibly simple to 
implement and I don't think it takes much effort to explain to users.  
It would be easy to add other resources you want to charge for (it would 
require a code addition, though it would be pretty simple if the data is 
available in the right structs).  It doesn't require any RPC changes.  
sshare, etc only need manpage clarifications to say that the usage data 
is "CPU equivalents".  No new fields are required.


As for user education, you just need to explain the concept of "CPU 
equivalents", something that can be easily done in the documentation.  
The slurm.conf partition lines would be relatively easy to read too.  If 
you don't need to change the behavior, no slurm.conf changes or 
explanations to users are required.



Proof of Concept

I did a really quick proof of concept (attached) based on the master 
branch.  It is very simple to charge for most things as long as the data 
is there in the existing structs.  One caveat for the test patch is that 
I didn't see a float handler in the config parser so I skipped over that 
for the test.  Instead, each config parameter in slurm.conf should be 
set to (desired_value * 1000) for now.  Proper float handling can be 
added if this is the route people want to take.

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-30 Thread Blomqvist Janne

Hi,

if I understand it correctly, this is actually very close to Dominant Resource 
Fairness (DRF) which I mentioned previously, with the difference that in DRF 
the charge rates are determined automatically from the available resources (in 
a partition) rather than being specified explicitly by the administrator. So 
for an example, say we have a partition with 100 cores and 400 GB memory. Now 
for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as 
follows:

1) Calculate the "domination vector" by dividing each element in the request 
vector (here, CPU & MEM) with the available resources. That is (10/100, 20/400) 
= (0.1, 0.05).

2) The MAX element in the domination vector is chosen (it "dominates" the 
others, hence the name of the algorithm) as the one to use in fairshare 
calculations, accounting etc. In this case, the CPU element (0.1). 

Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 
0.05) and the MAX element is the memory element (0.05), so in this case the 
memory part of the request dominates.

In your patch you have used "cpu-sec equivalents" rather than "dominant share 
secs", but that's just a difference of a scaling factor. From a backwards 
compatibility and user education point of view cpu-sec equivalents seem like a 
better choice to me, actually.

So while you patch is more flexible than DRF in that it allows arbitrary charge 
rates to be specified, I'm not sure it makes sense to specify rates different 
from the DRF ones? Or if one does specify different rates, it might end up 
breaking some of the fairness properties that are described in the DRF paper 
and opens up the algorithm for gaming?

--
Janne Blomqvist


From: Ryan Cox [ryan_...@byu.edu]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

I'm interested in hearing opinions on this, if any.  Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal.  There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)


Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time).  The "CPU
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate
that an admin can configure on a per partition basis.

I wrote a simple proof of concept patch to demonstrate this (see "Proof
of Concept" below for details).


Longer version

The CPU equivalent would be used in place of total_cpus for calculating
usage_raw.  I propose that the default charge rate be 1.0 for each CPU
in a job and 0.0 for everything else.  This is the current behavior so
there are no behavior changes if you choose not to define a different
charge rate.  The reason I think this should be done on a partition
basis is because different partitions may have nodes with different
memory/core ratios, etc. so one partition may have 2 GB/core and another
may have 8 GB/core nodes and you may want to charge differently on each.

If you define the charge rate for each CPU to be 1.0 and the charge rate
per GB of memory to be 0.5, that is saying that 2 GB of memory will be
equivalent to the charge rate for 1 CPU.  4 GB of memory would be
equivalent to 2 CPUs (4 GB * 0.5/GB).  Since it is a MAX() of all the
available (resource * charge_rate) combinations, the largest value is
chosen.  If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the
user gets charged for using all the RAM.  If a user uses 16 CPUs and 1
MB, the user gets charged for 16 CPUs.


Downsides

The problem that is not completely solved is if a user uses 1 CPU but
3/4 of the memory on a node.  Then they only get billed for 3/4 of the
node but might make it unusable for others who need a whole or half
node.  I'm not sure of a great way to solve that besides modifying the
request in a job submit plugin or requiring exclusive node access.

One other complication is for resources that include a counter rather
than a static allocation value, such as network bandwidth or energy.
This is a problem because the current approach is to immediately begin
decaying the cputime (aka usage) as it accumulates.  This means you
would have to keep a delta value for each resource with a counter,
meaning you track that 5 GB have been transmitted since the last decay
thread iteration then only add that 5 GB.  This could get messy when
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)
each iteration since the bandwidth may never reach a high enough value
to matter between iterations but might when c

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Bjørn-Helge Mevik

Just a short note about terminology.  I believe "processor equivalents"
(PE) is a much used term for this.  It is at least what Maui and Moab
uses, if I recall correctly.  The "resource*time" would then be PE seconds
(or hours, or whatever).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


Janne,

I appreciate the feedback.  I agree that it makes the most sense to 
specify rates like DRF most of the time.  However, there are some use 
cases that I'm aware of and others that are probably out there that 
would make a DRF imitation difficult or less desirable if it's the only 
option.


We happen to have one partition that has mixed memory amounts per node, 
32 GB and 64 GB.  Besides the memory differences (long story), the nodes 
are homogeneous and each have 16 cores.  I'm not sure I would like the 
DRF approach for this particular scenario.  In this case we would like 
to set the charge rate to be .5/GB, or 1 core == 2 GB RAM.  If someone 
needs 64 GB per node, they are contending for a more limited resource 
and we would be happy to double the charge rate for the 64 GB nodes.  If 
they need all 64 GB, they would end up being charged for 32 
CPU/processor equivalents instead of 16.  With DRF that wouldn't be 
possible if I understand correctly.


One other feature that could be interesting is to have a "baseline" 
standard for a CPU charge on a per-partition basis.  Let's say that you 
have three partitions:  old_hardware, new_hardware, and 
super_cooled_overclocked_awesomeness.  You could set the per CPU charges 
to be 0.8, 1.0, and 20.0.  That would reflect that a cpu-hour on one 
partition doesn't result in the same amount of computation as in another 
partition.  You could accomplish the same thing automatically by using a 
QOS (and maybe some other parameter I'm not aware of) and maybe a job 
submit plugin but this would make it easier.  I don't know that we would 
do this in our setup but it would be possible.


It would be possible to add a config parameter that is something like 
Mem=DRF that would auto-configure it to match.  The one question I have 
about that approach is what to do about partitions with non-homogeneous 
nodes.  Does it make sense to sum the total cores and memory, etc or 
should it default to a charge rate that is the min() of the node 
configurations?  Of course, partitions with mixed node types could be 
difficult to support no matter what method is used for picking charge rates.


So yes, having a DRF-like auto-configuration could be nice and we might 
even use it for most of our partitions.  I don't think I'll attempt it 
for the initial implementation but we'll see.


Thanks,
Ryan

On 07/30/2014 03:31 PM, Blomqvist Janne wrote:

Hi,

if I understand it correctly, this is actually very close to Dominant Resource 
Fairness (DRF) which I mentioned previously, with the difference that in DRF 
the charge rates are determined automatically from the available resources (in 
a partition) rather than being specified explicitly by the administrator. So 
for an example, say we have a partition with 100 cores and 400 GB memory. Now 
for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as 
follows:

1) Calculate the "domination vector" by dividing each element in the request vector 
(here, CPU & MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05).

2) The MAX element in the domination vector is chosen (it "dominates" the 
others, hence the name of the algorithm) as the one to use in fairshare calculations, 
accounting etc. In this case, the CPU element (0.1).

Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 
0.05) and the MAX element is the memory element (0.05), so in this case the 
memory part of the request dominates.

In your patch you have used "cpu-sec equivalents" rather than "dominant share 
secs", but that's just a difference of a scaling factor. From a backwards compatibility and 
user education point of view cpu-sec equivalents seem like a better choice to me, actually.

So while you patch is more flexible than DRF in that it allows arbitrary charge 
rates to be specified, I'm not sure it makes sense to specify rates different 
from the DRF ones? Or if one does specify different rates, it might end up 
breaking some of the fairness properties that are described in the DRF paper 
and opens up the algorithm for gaming?

--
Janne Blomqvist

________________
From: Ryan Cox [ryan_...@byu.edu]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

I'm interested in hearing opinions on this, if any.  Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal.  There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)


Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time).  Th

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


Thanks.  I can certainly call it that.  My understanding is that this 
would be a slightly different implementation from Moab/Maui, but I don't 
know those as well so I could be wrong.  Either way, the concept is 
similar enough that a more recognizable term might be good.


Does anyone else have thoughts on this?  I called it "CPU equivalents" 
because the calculation in the code is currently ("total_cpus" * time) 
so I stuck with CPUs.  Slurm seems to use lots of terms somewhat 
interchangeably so I couldn't really decide.  I don't really have an 
opinion on the name so I'll just accept what others decide.


Ryan

On 07/31/2014 02:28 AM, Bjørn-Helge Mevik wrote:

Just a short note about terminology.  I believe "processor equivalents"
(PE) is a much used term for this.  It is at least what Maui and Moab
uses, if I recall correctly.  The "resource*time" would then be PE seconds
(or hours, or whatever).



[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


All,

There has been more conversation on 
http://bugs.schedmd.com/show_bug.cgi?id=858.  It might be good to post 
future comments there so we have just one central location for 
everything.  No worries if you'd rather reply on the list.


Once a solution is ready I'll post something to the list so everyone is 
aware.


Ryan


[slurm-dev] RE: fairshare - memory resource allocation

2014-08-20 Thread Ulf Markwardt

Hi all,
this is a very interesting approach.
I hope we find a chance to discuss it in Lugano.
Ulf


--
___
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640  WWW:  http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-dev] Re: Fairshare factor not raising

2017-06-06 Thread Simon Kreuzer

Hello,

I am still experiencing the same issue, which is described below. I 
would be glad to get any hint where to look for a solution. - Do you 
think it might help to recreate the database from scratch?


Kind Regards and thank you all in advance!


On 05/30/2017 02:02 PM, Simon Kreuzer wrote:


Hello,

After a crash of the machine on which slurmctld and slumdbd is 
running, fairshare factors as seen in /sshare/ are only decreasing for 
people submitting jobs. Even though the corresponding users didn't 
submit jobs for days, the fairshare factor stays zero. It seems like 
PriorityDecayHalfLife=7-0 does not have any effect. I restartet both, 
slurmctld and slurmdbd deamon and also tried resetting using 
/PriorityUsageResetPeriod=NOW. /It didn't change anything.


Has anyone a hint where this issue might come from? It used to work 
before.



The priority settings in slum.conf are the following:

FastSchedule=1
SchedulerType=sched/backfill
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=14-0
PriorityFavorSmall=YES
PriorityFlags=SMALL_RELATIVE_TO_TIME
PriorityWeightAge=5000
PriorityWeightFairshare=10
PriorityWeightJobSize=1000
PriorityWeightPartition=1
PriorityWeightQOS=0
PriorityUsageResetPeriod=NONE

Thanks in advance!







[slurm-dev] RE: Fairshare=parent on an account: What should it do?

2014-06-11 Thread Brown George Andrew

Hi Ryan,

We currently use have a similar set up which may achieve what you want.

We have parent accounts which are assigned a fairshare based upon the amount of 
funding they provide. In this set up we do not use fairshare=parent but rather 
that the individual user accounts have their group as the parent. Both the 
parent group and individual user accounts have a fairshare value.

For example;

group1 40
|
 ---> user_foo account 10
 |   |
 |---> user_foo 1
 |
 |---> user_bar account 10
 |
  ---> user_bar 1

group2 30
|
 ---> user account 10
|
---> user 1


group3 30
|
 ---> user account 10
|
---> user 1

The user accounts under group1 will have a higher priority from their fairshare 
factor than users from groups 2 or 3. As users are using their own individual 
child accounts there is also a fairshare in affect within the group, this 
allows for a user who has not run very much to be given higher priority than a 
larger user in the same group which prevents them from long queuing times.

I have copied pasted some acct commands with a number of fields removed in the 
event the above ASCII diagram suffers a formatting error.

sacctmgr show account group1 withasccoc

Account  Descr  Par Name  Share
group1   group1 root  40

sacctmgr show account foo withasccoc

Account  Descr   Par NameShare
foo  group1_foo  group1  10
foo  group1_foo  foo 1

As far as account automation all our users exist within LDAP so we use a cron 
job to automatically poll LDAP and add users if they are not already present in 
slurm. You may be able to give the coordinator rights to the grad student but 
this may be more control than you want to give.

Kind regards,
George

From: Ryan Cox [ryan_...@byu.edu]
Sent: 11 June 2014 00:20
To: slurm-dev
Subject: [slurm-dev] Fairshare=parent on an account: What should it do?

We're trying to figure out what the intended behavior of
Fairshare=parent is when set on an account
(http://bugs.schedmd.com/show_bug.cgi?id=864). We know what the actual
behavior is but we're wondering if anyone actually likes the current
behavior. There could be some use case out there that we don't know about.

For example, you can end up with a scenario like the following:
acctProf
/ | \
/ | \
acctTA(parent) uD(5) uE(5)
/ | \
/ | \
uA(5) uB(5) uC(5)


The number in parenthesis is Fairshare according to sacctmgr. We
incorrectly thought that Fairshare=parent would essentially collapse the
tree so that uA - uE would all be on the same level. Thus, all five
users would each get 5 / 25 shares.

What actually happens is you get the following shares at the user level:
shares (uA) = 5 / 15 = .333
shares (uB) = 5 / 15 = .333
shares (uC) = 5 / 15 = .333
shares (uD) = 5 / 10 = .5
shares (uE) = 5 / 10 = .5

That's pretty far off from each other, but not as far as it would be if
one account had two users and the other had forty. Assuming this
demonstration value of 5 shares, that would be:
user_in_small_account = 5 / (2*5) = .5
user_in_large_account = 5 / (40*5) = .025

Is that actually useful to someone?

We want to use subaccounts below a faculty account to hold, for example,
a grad student or postdoc who teaches a class. It would be nice for the
grad student to have administrative control over the subaccount since he
actually knows the students but not have it affect priority calculations.

Ryan

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University