[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Lipari, Don
Bill,

As I understand the dilemma you presented, you want to maximize the utilization 
of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge users into 
requesting only the amount of memory they will need for their jobs.  The nudge 
would be in the form of decreased fair-share priority for users' jobs that 
request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists.  I can only offer 
alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor 
priority plugin.  This would be a substantial undertaking as it touches code 
not just in multifactor/priority_multifactor.c but also in structures that are 
defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a configurable blend 
of cpu and memory usage.  These changes could be more localized to the 
multifactor/priority_multifactor.c module.  However you would have a harder 
time justifying a user's sshare report because the usage numbers would no 
longer track jobs' historical cpu usage.  You response to a user who asked you 
to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed 
nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come 
to mind.

Don Lipari

> -Original Message-
> From: Bill Wichser [mailto:b...@princeton.edu]
> Sent: Friday, July 25, 2014 6:14 AM
> To: slurm-dev
> Subject: [slurm-dev] fairshare - memory resource allocation
> 
> 
> I'd like to revisit this...
> 
> 
> After struggling with memory allocations in some flavor of PBS for over
> 20 years, it was certainly a wonderful thing to have cgroup support
> right out of the box with Slurm.  No longer do we have a shared node's
> jobs eating all the memory and killing everything running there.  But we
> have found that there is a cost to this and that is a failure to
> adequately feed back this information to the fairshare mechanism.
> 
> In looking at running jobs over the past 4 months, we found a spot where
> we could reduce the DefMemPerCPU allocation in slurm.conf to a value
> about 1G less than the actual G/core available.  This meant that we had
> to notify the users close to this max value so that they could adjust
> their scripts. We also notified users that if this value was too high
> that they'd do best to reduce that limit to exactly what they require.
> This has proven much less successful.
> 
> So our default is 3G/core with an actual node having 4G/core available.
>   This allows some bigger memory jobs and some smaller memory jobs to
> make use of the node as there are available cores but not enough memory
> for the default case.
> 
> Now that is good. It allows higher utilization of nodes, all the while
> protecting the memory of each other's processes.  But the problem of
> fairshare comes about pretty quickly when there are jobs requiring say
> half the node's memory.  This is mostly serial jobs requesting a single
> core.  So this leaves about 11 cores with only about 2G/core left.
> Worse, when it comes to fairshare calculations it appears that these
> jobs are only using a single core when in fact they are using half a
> node.  You can see where this is causing issues.
> 
> Fairshare has a number of other issues as well, which I will send under
> a different email.
> 
> Now maybe this is just a matter of constant monitoring of user jobs and
> proactively going after those users having small memory per core
> requirements.  We have attempted this in the past and have found that
> the first job they run which crashes due to insufficient memory results
> in all scripts being increased and so the process is never ending.
> 
> Another solution is to simply trust the users and just keep reminding
> them about allocations.  They are usually a smart bunch and are quite
> creative when it comes to getting jobs to run!  So maybe I am concerned
> over nothing at all and things will just work out.
> 
> Bill


[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Ryan Cox


Bill and Don,

We have wondered about this ourselves.  I just came up with this idea 
and haven't thought it through completely, but option two seems like the 
easiest.  For example, you could modify lines like 
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 
to have a MAX() of a few different types.


I seem to recall seeing this on the list or in a bug report somewhere 
already, but you could have different charge rates for memory or GPUs 
compared to a CPU, maybe on a per partition basis. You could give each 
of them a charge rate like:

PartitionName=p1  ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 ..

So the line I referenced would be something like the following (except 
using real code and real struct members, etc):
real_decay = run_decay * MAX(CPUs*ChargePerCPU, 
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);


In this case, each CPU is 1.0 but each GB of RAM is 0.5.  Assuming no 
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting 
usage is 1.0.  But if they use 4 GB of RAM and 1 CPU, it is 2.0 just 
like they had been using 2 CPUs.  Essentially you define every 2 GB of 
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with 
"cpu equivalents".


It might be harder to explain to users but I don't think it would be too 
bad.


Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:

Bill,

As I understand the dilemma you presented, you want to maximize the utilization 
of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge users into 
requesting only the amount of memory they will need for their jobs.  The nudge 
would be in the form of decreased fair-share priority for users' jobs that 
request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists.  I can only offer 
alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor 
priority plugin.  This would be a substantial undertaking as it touches code 
not just in multifactor/priority_multifactor.c but also in structures that are 
defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and 
memory usage.  These changes could be more localized to the 
multifactor/priority_multifactor.c module.  However you would have a harder time 
justifying a user's sshare report because the usage numbers would no longer track jobs' 
historical cpu usage.  You response to a user who asked you to justify their sshare usage 
report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed 
nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come 
to mind.

Don Lipari


-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there.  But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node.  You can see where this is causing issues.

Fairshare has a number of other issues as well, which I will send under
a different email.

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-25 Thread Bill Wichser


Thank you Ryan.  Not sure how we will proceed here.

Bill

On 7/25/2014 12:30 PM, Ryan Cox wrote:


Bill and Don,

We have wondered about this ourselves.  I just came up with this idea 
and haven't thought it through completely, but option two seems like 
the easiest.  For example, you could modify lines like 
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 
to have a MAX() of a few different types.


I seem to recall seeing this on the list or in a bug report somewhere 
already, but you could have different charge rates for memory or GPUs 
compared to a CPU, maybe on a per partition basis. You could give each 
of them a charge rate like:
PartitionName=p1  ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 
..


So the line I referenced would be something like the following (except 
using real code and real struct members, etc):
real_decay = run_decay * MAX(CPUs*ChargePerCPU, 
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);


In this case, each CPU is 1.0 but each GB of RAM is 0.5.  Assuming no 
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting 
usage is 1.0.  But if they use 4 GB of RAM and 1 CPU, it is 2.0 just 
like they had been using 2 CPUs.  Essentially you define every 2 GB of 
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with 
"cpu equivalents".


It might be harder to explain to users but I don't think it would be 
too bad.


Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:

Bill,

As I understand the dilemma you presented, you want to maximize the 
utilization of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge 
users into requesting only the amount of memory they will need for 
their jobs.  The nudge would be in the form of decreased fair-share 
priority for users' jobs that request only one core but lots of memory.


I don't know of a way for Slurm to do this as it exists.  I can only 
offer alternatives that have their pros and cons.


One alternative would be to add memory usage support to the 
multifactor priority plugin.  This would be a substantial undertaking 
as it touches code not just in multifactor/priority_multifactor.c but 
also in structures that are defined in common/assoc_mgr.h as well as 
sshare itself.


A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a 
configurable blend of cpu and memory usage.  These changes could be 
more localized to the multifactor/priority_multifactor.c module.  
However you would have a harder time justifying a user's sshare 
report because the usage numbers would no longer track jobs' 
historical cpu usage.  You response to a user who asked you to 
justify their sshare usage report would be, "trust me, it's right".


A third alternative (as I'm sure you know) is to give up on perfectly 
packed nodes and make every 4G of memory requested cost 1 cpu of 
allocation.


Perhaps there are other options, but those are the ones that 
immediately come to mind.


Don Lipari


-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there.  
But we

have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot 
where

we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node.  You can see where th

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-27 Thread Blomqvist Janne

Hi,

As a variation on the second option you propose, take a look at the concept of 
Dominant Resource Fairness [1], which is an algorithm for achieving 
multi-resource (e.g. cpu's, memory, disk/net BW, ...) fairness. By using 
"dominant share"-secs instead of cpu-secs in the current accounting code the 
changes would similarly be limited in scope.

[1] http://static.usenix.org/event/nsdi11/tech/full_papers/Ghodsi.pdf
https://www.usenix.org/legacy/events/nsdi11/tech/slides/ghodsi.pdf

--
Janne Blomqvist


From: Lipari, Don [lipa...@llnl.gov]
Sent: Friday, July 25, 2014 19:04
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

Bill,

As I understand the dilemma you presented, you want to maximize the utilization 
of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge users into 
requesting only the amount of memory they will need for their jobs.  The nudge 
would be in the form of decreased fair-share priority for users' jobs that 
request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists.  I can only offer 
alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor 
priority plugin.  This would be a substantial undertaking as it touches code 
not just in multifactor/priority_multifactor.c but also in structures that are 
defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a configurable blend 
of cpu and memory usage.  These changes could be more localized to the 
multifactor/priority_multifactor.c module.  However you would have a harder 
time justifying a user's sshare report because the usage numbers would no 
longer track jobs' historical cpu usage.  You response to a user who asked you 
to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed 
nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come 
to mind.

Don Lipari

> -Original Message-
> From: Bill Wichser [mailto:b...@princeton.edu]
> Sent: Friday, July 25, 2014 6:14 AM
> To: slurm-dev
> Subject: [slurm-dev] fairshare - memory resource allocation
>
>
> I'd like to revisit this...
>
>
> After struggling with memory allocations in some flavor of PBS for over
> 20 years, it was certainly a wonderful thing to have cgroup support
> right out of the box with Slurm.  No longer do we have a shared node's
> jobs eating all the memory and killing everything running there.  But we
> have found that there is a cost to this and that is a failure to
> adequately feed back this information to the fairshare mechanism.
>
> In looking at running jobs over the past 4 months, we found a spot where
> we could reduce the DefMemPerCPU allocation in slurm.conf to a value
> about 1G less than the actual G/core available.  This meant that we had
> to notify the users close to this max value so that they could adjust
> their scripts. We also notified users that if this value was too high
> that they'd do best to reduce that limit to exactly what they require.
> This has proven much less successful.
>
> So our default is 3G/core with an actual node having 4G/core available.
>   This allows some bigger memory jobs and some smaller memory jobs to
> make use of the node as there are available cores but not enough memory
> for the default case.
>
> Now that is good. It allows higher utilization of nodes, all the while
> protecting the memory of each other's processes.  But the problem of
> fairshare comes about pretty quickly when there are jobs requiring say
> half the node's memory.  This is mostly serial jobs requesting a single
> core.  So this leaves about 11 cores with only about 2G/core left.
> Worse, when it comes to fairshare calculations it appears that these
> jobs are only using a single core when in fact they are using half a
> node.  You can see where this is causing issues.
>
> Fairshare has a number of other issues as well, which I will send under
> a different email.
>
> Now maybe this is just a matter of constant monitoring of user jobs and
> proactively going after those users having small memory per core
> requirements.  We have attempted this in the past and have found that
> the first job they run which crashes due to insufficient memory results
> in all scripts being increased and so the process is never ending.
>
> Another solution is to simply trust the users and just keep reminding
> them about allocations.  They are usually a smart bunch and are quite
> creative when it comes to getting jobs to run!  So maybe I am concerned
> over nothing at all and things will just work out.
>
> Bill

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-29 Thread Ryan Cox
I'm interested in hearing opinions on this, if any.  Basically, I think 
there is an easy solution to the problem of a user using few CPUs but a 
lot of memory and that not being reflected well in the CPU-centric usage 
stats.


Below is my proposal.  There are likely some other good approaches out 
there too (Don and Janne presented some) so feel free to tell me that 
you don't like this idea :)



Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU 
equivalents" * time) instead of just (CPUs * time).  The "CPU 
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over 
that time period, or whatever multiplied by a corresponding charge rate 
that an admin can configure on a per partition basis.


I wrote a simple proof of concept patch to demonstrate this (see "Proof 
of Concept" below for details).



Longer version

The CPU equivalent would be used in place of total_cpus for calculating 
usage_raw.  I propose that the default charge rate be 1.0 for each CPU 
in a job and 0.0 for everything else.  This is the current behavior so 
there are no behavior changes if you choose not to define a different 
charge rate.  The reason I think this should be done on a partition 
basis is because different partitions may have nodes with different 
memory/core ratios, etc. so one partition may have 2 GB/core and another 
may have 8 GB/core nodes and you may want to charge differently on each.


If you define the charge rate for each CPU to be 1.0 and the charge rate 
per GB of memory to be 0.5, that is saying that 2 GB of memory will be 
equivalent to the charge rate for 1 CPU.  4 GB of memory would be 
equivalent to 2 CPUs (4 GB * 0.5/GB).  Since it is a MAX() of all the 
available (resource * charge_rate) combinations, the largest value is 
chosen.  If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the 
user gets charged for using all the RAM.  If a user uses 16 CPUs and 1 
MB, the user gets charged for 16 CPUs.



Downsides

The problem that is not completely solved is if a user uses 1 CPU but 
3/4 of the memory on a node.  Then they only get billed for 3/4 of the 
node but might make it unusable for others who need a whole or half 
node.  I'm not sure of a great way to solve that besides modifying the 
request in a job submit plugin or requiring exclusive node access.


One other complication is for resources that include a counter rather 
than a static allocation value, such as network bandwidth or energy.  
This is a problem because the current approach is to immediately begin 
decaying the cputime (aka usage) as it accumulates.  This means you 
would have to keep a delta value for each resource with a counter, 
meaning you track that 5 GB have been transmitted since the last decay 
thread iteration then only add that 5 GB.  This could get messy when 
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb) 
each iteration since the bandwidth may never reach a high enough value 
to matter between iterations but might when considered as an entire job.


I don't think this proposal would be too bad for something like energy.  
You could define a charge rate per joule (or kilojoule or whatever) that 
would equal the node's minimum power divided by core count.  Then you 
look at the delta of that time period.  If they were allocated all cores 
and used minimum power, they get charged 1.0 * core count.  If they were 
allocated all cores and used maximum power, they effectively get charged 
for the difference in the node's max energy and min energy times the 
energy charge rate.  This calculation, as with others, would occur once 
per decay thread iteration.



User Education

The reason I like this approach is that it is incredibly simple to 
implement and I don't think it takes much effort to explain to users.  
It would be easy to add other resources you want to charge for (it would 
require a code addition, though it would be pretty simple if the data is 
available in the right structs).  It doesn't require any RPC changes.  
sshare, etc only need manpage clarifications to say that the usage data 
is "CPU equivalents".  No new fields are required.


As for user education, you just need to explain the concept of "CPU 
equivalents", something that can be easily done in the documentation.  
The slurm.conf partition lines would be relatively easy to read too.  If 
you don't need to change the behavior, no slurm.conf changes or 
explanations to users are required.



Proof of Concept

I did a really quick proof of concept (attached) based on the master 
branch.  It is very simple to charge for most things as long as the data 
is there in the existing structs.  One caveat for the test patch is that 
I didn't see a float handler in the config parser so I skipped over that 
for the test.  Instead, each config parameter in slurm.conf should be 
set to (desired_value * 1000) for now.  Proper float handling can be 
added if this is the route people want to take.

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-30 Thread Blomqvist Janne

Hi,

if I understand it correctly, this is actually very close to Dominant Resource 
Fairness (DRF) which I mentioned previously, with the difference that in DRF 
the charge rates are determined automatically from the available resources (in 
a partition) rather than being specified explicitly by the administrator. So 
for an example, say we have a partition with 100 cores and 400 GB memory. Now 
for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as 
follows:

1) Calculate the "domination vector" by dividing each element in the request 
vector (here, CPU & MEM) with the available resources. That is (10/100, 20/400) 
= (0.1, 0.05).

2) The MAX element in the domination vector is chosen (it "dominates" the 
others, hence the name of the algorithm) as the one to use in fairshare 
calculations, accounting etc. In this case, the CPU element (0.1). 

Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 
0.05) and the MAX element is the memory element (0.05), so in this case the 
memory part of the request dominates.

In your patch you have used "cpu-sec equivalents" rather than "dominant share 
secs", but that's just a difference of a scaling factor. From a backwards 
compatibility and user education point of view cpu-sec equivalents seem like a 
better choice to me, actually.

So while you patch is more flexible than DRF in that it allows arbitrary charge 
rates to be specified, I'm not sure it makes sense to specify rates different 
from the DRF ones? Or if one does specify different rates, it might end up 
breaking some of the fairness properties that are described in the DRF paper 
and opens up the algorithm for gaming?

--
Janne Blomqvist


From: Ryan Cox [ryan_...@byu.edu]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

I'm interested in hearing opinions on this, if any.  Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal.  There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)


Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time).  The "CPU
equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate
that an admin can configure on a per partition basis.

I wrote a simple proof of concept patch to demonstrate this (see "Proof
of Concept" below for details).


Longer version

The CPU equivalent would be used in place of total_cpus for calculating
usage_raw.  I propose that the default charge rate be 1.0 for each CPU
in a job and 0.0 for everything else.  This is the current behavior so
there are no behavior changes if you choose not to define a different
charge rate.  The reason I think this should be done on a partition
basis is because different partitions may have nodes with different
memory/core ratios, etc. so one partition may have 2 GB/core and another
may have 8 GB/core nodes and you may want to charge differently on each.

If you define the charge rate for each CPU to be 1.0 and the charge rate
per GB of memory to be 0.5, that is saying that 2 GB of memory will be
equivalent to the charge rate for 1 CPU.  4 GB of memory would be
equivalent to 2 CPUs (4 GB * 0.5/GB).  Since it is a MAX() of all the
available (resource * charge_rate) combinations, the largest value is
chosen.  If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the
user gets charged for using all the RAM.  If a user uses 16 CPUs and 1
MB, the user gets charged for 16 CPUs.


Downsides

The problem that is not completely solved is if a user uses 1 CPU but
3/4 of the memory on a node.  Then they only get billed for 3/4 of the
node but might make it unusable for others who need a whole or half
node.  I'm not sure of a great way to solve that besides modifying the
request in a job submit plugin or requiring exclusive node access.

One other complication is for resources that include a counter rather
than a static allocation value, such as network bandwidth or energy.
This is a problem because the current approach is to immediately begin
decaying the cputime (aka usage) as it accumulates.  This means you
would have to keep a delta value for each resource with a counter,
meaning you track that 5 GB have been transmitted since the last decay
thread iteration then only add that 5 GB.  This could get messy when
comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)
each iteration since the bandwidth may never reach a high enough value
to matter between iterations but might when c

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Bjørn-Helge Mevik

Just a short note about terminology.  I believe "processor equivalents"
(PE) is a much used term for this.  It is at least what Maui and Moab
uses, if I recall correctly.  The "resource*time" would then be PE seconds
(or hours, or whatever).

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


Janne,

I appreciate the feedback.  I agree that it makes the most sense to 
specify rates like DRF most of the time.  However, there are some use 
cases that I'm aware of and others that are probably out there that 
would make a DRF imitation difficult or less desirable if it's the only 
option.


We happen to have one partition that has mixed memory amounts per node, 
32 GB and 64 GB.  Besides the memory differences (long story), the nodes 
are homogeneous and each have 16 cores.  I'm not sure I would like the 
DRF approach for this particular scenario.  In this case we would like 
to set the charge rate to be .5/GB, or 1 core == 2 GB RAM.  If someone 
needs 64 GB per node, they are contending for a more limited resource 
and we would be happy to double the charge rate for the 64 GB nodes.  If 
they need all 64 GB, they would end up being charged for 32 
CPU/processor equivalents instead of 16.  With DRF that wouldn't be 
possible if I understand correctly.


One other feature that could be interesting is to have a "baseline" 
standard for a CPU charge on a per-partition basis.  Let's say that you 
have three partitions:  old_hardware, new_hardware, and 
super_cooled_overclocked_awesomeness.  You could set the per CPU charges 
to be 0.8, 1.0, and 20.0.  That would reflect that a cpu-hour on one 
partition doesn't result in the same amount of computation as in another 
partition.  You could accomplish the same thing automatically by using a 
QOS (and maybe some other parameter I'm not aware of) and maybe a job 
submit plugin but this would make it easier.  I don't know that we would 
do this in our setup but it would be possible.


It would be possible to add a config parameter that is something like 
Mem=DRF that would auto-configure it to match.  The one question I have 
about that approach is what to do about partitions with non-homogeneous 
nodes.  Does it make sense to sum the total cores and memory, etc or 
should it default to a charge rate that is the min() of the node 
configurations?  Of course, partitions with mixed node types could be 
difficult to support no matter what method is used for picking charge rates.


So yes, having a DRF-like auto-configuration could be nice and we might 
even use it for most of our partitions.  I don't think I'll attempt it 
for the initial implementation but we'll see.


Thanks,
Ryan

On 07/30/2014 03:31 PM, Blomqvist Janne wrote:

Hi,

if I understand it correctly, this is actually very close to Dominant Resource 
Fairness (DRF) which I mentioned previously, with the difference that in DRF 
the charge rates are determined automatically from the available resources (in 
a partition) rather than being specified explicitly by the administrator. So 
for an example, say we have a partition with 100 cores and 400 GB memory. Now 
for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as 
follows:

1) Calculate the "domination vector" by dividing each element in the request vector 
(here, CPU & MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05).

2) The MAX element in the domination vector is chosen (it "dominates" the 
others, hence the name of the algorithm) as the one to use in fairshare calculations, 
accounting etc. In this case, the CPU element (0.1).

Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 
0.05) and the MAX element is the memory element (0.05), so in this case the 
memory part of the request dominates.

In your patch you have used "cpu-sec equivalents" rather than "dominant share 
secs", but that's just a difference of a scaling factor. From a backwards compatibility and 
user education point of view cpu-sec equivalents seem like a better choice to me, actually.

So while you patch is more flexible than DRF in that it allows arbitrary charge 
rates to be specified, I'm not sure it makes sense to specify rates different 
from the DRF ones? Or if one does specify different rates, it might end up 
breaking some of the fairness properties that are described in the DRF paper 
and opens up the algorithm for gaming?

--
Janne Blomqvist

____________________
From: Ryan Cox [ryan_...@byu.edu]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

I'm interested in hearing opinions on this, if any.  Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal.  There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)


Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU
equivalents" * time) instead of just (CPUs * time).  Th

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


Thanks.  I can certainly call it that.  My understanding is that this 
would be a slightly different implementation from Moab/Maui, but I don't 
know those as well so I could be wrong.  Either way, the concept is 
similar enough that a more recognizable term might be good.


Does anyone else have thoughts on this?  I called it "CPU equivalents" 
because the calculation in the code is currently ("total_cpus" * time) 
so I stuck with CPUs.  Slurm seems to use lots of terms somewhat 
interchangeably so I couldn't really decide.  I don't really have an 
opinion on the name so I'll just accept what others decide.


Ryan

On 07/31/2014 02:28 AM, Bjørn-Helge Mevik wrote:

Just a short note about terminology.  I believe "processor equivalents"
(PE) is a much used term for this.  It is at least what Maui and Moab
uses, if I recall correctly.  The "resource*time" would then be PE seconds
(or hours, or whatever).



[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


All,

There has been more conversation on 
http://bugs.schedmd.com/show_bug.cgi?id=858.  It might be good to post 
future comments there so we have just one central location for 
everything.  No worries if you'd rather reply on the list.


Once a solution is ready I'll post something to the list so everyone is 
aware.


Ryan


[slurm-dev] RE: fairshare - memory resource allocation

2014-08-20 Thread Ulf Markwardt

Hi all,
this is a very interesting approach.
I hope we find a chance to discuss it in Lugano.
Ulf


--
___
Dr. Ulf Markwardt

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany

Phone: (+49) 351/463-33640  WWW:  http://www.tu-dresden.de/zih



smime.p7s
Description: S/MIME Cryptographic Signature