[slurm-dev] Re: Thoughts on GrpCPURunMins as primary constraint?

2017-07-24 Thread Ryan Cox


Corey,

We almost exclusively use GrpCPURunMins as well as 3 or 7 day walltime 
limits depending on the partition.  For my (somewhat rambling) thoughts 
on the matter, see 
http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html. 
It generally works pretty well.


We also have https://marylou.byu.edu/simulation/grpcpurunmins.php to 
simulate various settings, though it needs some improvement such as a 
realistic maximum.


sshare -l (TRESRunMins) should have the live stats you're looking for.

Ryan

On 07/24/2017 02:39 PM, Corey Keasling wrote:


Hi Slurm-Dev,

I'm currently designing and testing what will ultimately be a small 
Slurm cluster of about 60 heterogeneous nodes (five different 
generations of hardware).  Our user-base is also diverse, with need 
for fast turnover of small, sequential jobs and for long-duration 
parallel codes (e.g., 16 cores for several months).


In the past we limited users by how many cores they could allocate at 
any one time.  This has the drawback that no distinction is made 
between, say, 128 cores for 2 hours and 128 cores for 2 months.  We 
want users to be able to run on a large portion of the cluster when it 
is available while ensuring that they cannot take advantage of an idle 
period to start jobs which will monopolize it for weeks.


Limiting by GrpCPURunMins seems like a good answer.  I think of it as 
allocating computational area (i.e., cores*minutes) and not just width 
(cores).  I'd love to know if anyone has any experience or thoughts on 
imposing limits in this way.  Also, is anyone aware of a simple way to 
calculate remaining "area"?  I can use squeue or sacct to ultimately 
derive how much of a limit is in use by looking at remaining wall-time 
and core count, but if there's something built in - or pre-existing - 
it would be nice to know.


It's worth noting that the cluster is divided into several partitions 
with most nodes existing in several.  This is partially political (to 
give groups increased priority on nodes they helped pay for) and 
partially practical (to ensure users explicitly requesting slow nodes 
instead of just dumping them on ancient Opterons).  Also, each user 
gets their own Account, so the QoS Grp limits apply to each human 
separately.  Accounts would also have absolute core limits.


Thank you for your thoughts!

Corey



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: Job Submit Lua Plugin

2017-06-27 Thread Ryan Cox

Nathan and Darby,

For you and anyone else using Lua, see 
https://bugs.schedmd.com/show_bug.cgi?id=3815 with regards to --mem vs 
--mem-per-cpu starting in 17.02.


Ryan

On 06/27/2017 02:30 PM, Nathan Vance wrote:

Re: [slurm-dev] Re: Job Submit Lua Plugin
Darby,

The "job_submit.lua: initialized" line in slurm.conf was indeed the 
issue. When compiling slurm I only got the "yes lua" line without the 
flags, but that seems to be just a difference in OS's.


Now that I have debugging feedback I should be good to go!

Thanks,
Nathan

On 27 June 2017 at 16:13, Vicker, Darby (JSC-EG311) 
<darby.vicke...@nasa.gov <mailto:darby.vicke...@nasa.gov>> wrote:


We recently started using a lua job submit plugin as well.  You
have to have the lua-devel package installed when you compile
slurm. It looks like you do (but we use RHEL the package name is
lua-devel) but confirm that you see something like these in
config.log:

configure:24784: result: yes lua

pkg_cv_lua_LIBS='-llua -lm -ldl '

lua_CFLAGS='  -DLUA_COMPAT_ALL'

lua_LIBS='-llua -lm -ldl  '

Do you have this in your slurm.conf?

JobSubmitPlugins=lua

I'm guessing not given you don't see anything in the logs. Before
I got all the errors worked out, I would see errors like this in
slurmctld_log:

error: Couldn't find the specified plugin name for job_submit/lua
looking at all files

error: cannot find job_submit plugin for job_submit/lua

error: cannot create job_submit context for job_submit/lua

failed to initialize job_submit plugin

After getting everything working, you should see this:

job_submit.lua: initialized

As well as any other slurm.log_info messages you put in your lua
script.

*From: *Nathan Vance <naterva...@gmail.com
<mailto:naterva...@gmail.com>>
*Reply-To: *slurm-dev <slurm-dev@schedmd.com
<mailto:slurm-dev@schedmd.com>>
*Date: *Tuesday, June 27, 2017 at 12:15 PM
*To: *slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>>
*Subject: *[slurm-dev] Job Submit Lua Plugin

Hello all!

I've been working on getting off the ground with Lua plugins. The
goal is to implement Torque's routing queues for SLURM, but so far
I have been unable to get SLURM to even call my plugin.

What I have tried:

1) Copied contrib/lua/job_submit.lua to /etc/slurm/ (the same
directory as slurm.conf)

2) Restarted slurmctld and verified that no functionality was broken

3) Added slurm.log_info("I got here") to several points in the
script. After restarting slurmctld and submitting a job, grep "I
got here" -R /var/log found no results.

4) In case there was a problem with the log file, I added
os.execute("touch /home/myUser/slurm_job_submitted") to the top of
the slurm_job_submit method. Restarting slurmctld and submitting a
job still produced no evidence that my plugin was called.

5) In case there were permission issues, I made job_submit.lua
executable. Nothing. Even grep "job_submit" -R /var/log (in case
there was an error calling the script) comes up dry.

Relevant information:

OS: Ubuntu 16.04

Lua: lua5.2 and liblua5.2-dev (I can use Lua interactively)

SLURM version: 17.02.5, compiled from source (after installing
Lua) using ./configure --prefix=/usr --sysconfdir=/etc/slurm

    Any guidance to get me up and running would be greatly appreciated!

Thanks,

Nathan




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Ryan Cox
m: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>
<mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>
> Sent: 15 March 2017 10:28
> To: slurm-dev
> Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
> It should be (sorry):
> we 'cp'ed a 5GB file from scratch to node local disk
>
>
> On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng
<w...@nyu.edu <mailto:w...@nyu.edu>
> <mailto:w...@nyu.edu
<mailto:w...@nyu.edu>><mailto:w...@nyu.edu <mailto:w...@nyu.edu>
> <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>> wrote:
> Hello experts:
>
> We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> 5GB job from scratch to node local disk, declared 5 GB memory
> for the job, and saw error message as below although the file
> was copied okay:
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
> srun: error: [nodenameXXX]: task 0: Out Of Memory
>
> srun: Terminating job step 41.0
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
>
> From the cgroup document
>https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
> <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt>
> Features:
> - accounting anonymous pages, file caches, swap caches usage and
> limiting them.
>
> It seems that cgroup charges memory "RSS + file caches" to user
> process like 'cp', in our case, charged to user's jobs. swap is
> off in this case. The file cache can be small or very big, and
> it should not be charged to users'  batch jobs in my opinion.
> How do other sites circumvent this issue? The Slurm version is
> 16.05.4.
>
> Thank you and Best Regards.
>
>
>
>

Could you set AllowedRamSpace/AllowedSwapSpace in
/etc/slurm/cgroup.conf to some big number? That way the job memory
limit will be the cgroup soft limit, and the cgroup hard limit
which is when the kernel will OOM kill the job would be
"job_memory_limit * AllowedRamSpace" that is, some large value?

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 <tel:%2B358503841576> || janne.blomqv...@aalto.fi
<mailto:janne.blomqv...@aalto.fi>



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox


If you're interested in the programmatic method I mentioned to increase 
limits for file transfers, 
https://github.com/BYUHPC/uft/tree/master/cputime_controls might be 
worth looking at.  It works well for us, though a user will occasionally 
start using a new file transfer program that you might want to centrally 
install and whitelist.


We used to use LVS for load balancing and it worked pretty well.  We 
finally scrapped it in favor of DNS round robin since it gets expensive 
to have a load balancer that's capable of moving that much bandwidth.  
We have a script that can drop some of the login nodes from the DNS 
round robin based on CPU and memory usage (with sanity checks to not 
drop all of them at the same time, of course :) ). There may be a better 
way of doing this but it has worked so far.


Ryan

On 02/09/2017 11:15 AM, Nicholas McCollum wrote:

While this isn't a SLURM issue, it's something we all face.  Due to my
system being primarily students, it's something I face a lot.

I second the use of ulimits, although this can kill off long running
file transfers.  What you can do to help out users is set a low soft
limit and a somewhat larger hard limit.  Encourage users that want to
do a file transfer to increase their limit (they wont be able to go
over the hard limit).

A method that I am testing to employ is having each login node as a KVM
virtual machine, and then limiting the amount of CPU that the virtual
machine can use.  Each login-VM will be identical minus the MAC and the
IP address, then using IP tables on the VM-host to push the connections
out to the VM that responds first.  The idea is that a loaded down VM
would have a delay in responding and provide a user with a login node
that doesn't have any users on it.

I'm sure someone has already blazed this trail before, but this is how
I am going about it.




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox
d 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Ryan Cox

I should probably add some example output:

Someone we need to talk to:
  Node   | Memory (GB) | CPUs
Hostname   AllocMaxCur   Alloc   Used  Eff%
 m8-10-519.5  0  0   1   0.00 0
*m8-10-219.52.32.2   1   0.9999
 m8-10-319.5  0  0   1   0.00 0
 m8-10-419.5  0  0   1   0.00 0

* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job


Much better:
  Node   | Memory (GB) | CPUs
Hostname   AllocMaxCur   Alloc   Used  Eff%
 m9-48-2   112.0   21.1   19.3  16  15.9799
 m9-48-398.0   18.5   16.8  14  13.9899
 m9-16-3   112.0   20.9   19.2  16  15.9799
 m9-44-1   112.0   21.0   19.2  16  15.9799
 m9-43-3   119.0   22.3   20.4  17  16.9799
 m9-44-2   112.0   21.2   19.3  16  15.9899
 m9-14-4   112.0   21.0   19.2  16  15.9799
 m9-46-4   119.0   22.5   20.5  17  16.9799
*m9-10-291.0   32.0   15.8  13  12.8198
 m9-43-1   119.0   22.3   20.4  17  16.9799
 m9-16-1   126.0   23.9   21.6  18  17.9799
 m9-47-4   119.0   22.4   20.5  17  16.9799
 m9-43-4   119.0   22.4   20.5  17  16.9799
 m9-48-184.0   15.7   14.4  12  11.9899
 m9-42-4   119.0   22.2   20.3  17  16.9799
 m9-43-2   119.0   22.2   20.4  17  16.9799

* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job

Ryan

On 09/19/2016 11:13 AM, Ryan Cox wrote:
We use this script that we cobbled together: 
https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It 
assumes that you're using cgroups.  It uses ssh to connect to each 
node so it's not very scalable but it works well enough for us.


Ryan

On 09/18/2016 06:42 PM, Igor Yakushin wrote:

how to monitor CPU/RAM usage on each node of a slurm job? python API?
Hi All,

I'd like to be able to see for a given jobid how much resources are 
used by a job on each node it is running on at this moment. Is there 
a way to do it?


So far it looks like I have to script it: get the list of the 
involved nodes using, for example, squeue or qstat, ssh to each node 
and find all the user processes (not 100% guaranteed that they would 
be from the job I am interested in: is there a way to find UNIX pids 
corresponding to Slurm jobid?).


Another question: is there python API to slurm? I found pyslurm but 
so far it would not build with my version of Slurm.


Thank you,
Igor





--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm job? python API?

2016-09-19 Thread Ryan Cox
We use this script that we cobbled together: 
https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes 
that you're using cgroups.  It uses ssh to connect to each node so it's 
not very scalable but it works well enough for us.


Ryan

On 09/18/2016 06:42 PM, Igor Yakushin wrote:

how to monitor CPU/RAM usage on each node of a slurm job? python API?
Hi All,

I'd like to be able to see for a given jobid how much resources are 
used by a job on each node it is running on at this moment. Is there a 
way to do it?


So far it looks like I have to script it: get the list of the involved 
nodes using, for example, squeue or qstat, ssh to each node and find 
all the user processes (not 100% guaranteed that they would be from 
the job I am interested in: is there a way to find UNIX pids 
corresponding to Slurm jobid?).


Another question: is there python API to slurm? I found pyslurm but so 
far it would not build with my version of Slurm.


Thank you,
Igor





[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Ryan Cox
The --reservation is for sbatch, salloc, et al.  It tells it that the 
job should run in the specified reservation.


On 04/15/2016 11:37 AM, Glen MacLachlan wrote:

Re: [slurm-dev] Re: scontrol update not allowing jobs
Thanks for your feedbacl. Taking nodes out of maintenance still leaves 
them in the reserved state "resv" but still unable to run jobs even 
though I believe I've given the correct exception as shown in the 
original post.



@Ryan: Yeah, I did specify the reservation, Reservation=root_13. The 
-- before reservation is syntactically incorrect too. In fact, if you 
don't specify which reservation is getting updated the scontrol 
command won't work.




Best,
Glen

==
Glen MacLachlan, PhD
/HPC Specialist //for Physical Sciences &
/
/Professorial Lecturer, Data Sciences
/

Office of Technology Services
The George Washington University
725 21st Street
Washington, DC 20052
Suite 211, Corcoran Hall

==




On Fri, Apr 15, 2016 at 1:07 PM, Ryan Cox <ryan_...@byu.edu 
<mailto:ryan_...@byu.edu>> wrote:


Did you try this: --reservation=root_13


On 04/15/2016 08:10 AM, Glen MacLachlan wrote:

Dear all,

Wrapping up a maintenance period and I want to run some test jobs
before I release the reservation and allow regular user jobs to
start running. I've modified the reservation to allow jobs from
my account:

$ scontrol show res
ReservationName=root_13 StartTime=2016-04-12T09:00:00
EndTime=2016-04-15T20:00:00 Duration=3-11:00:00
   Nodes=ALL NodeCnt=220 CoreCnt=3328 Features=(null)
PartitionName=(null) Flags=MAINT,SPEC_NODES
 TRES=cpu=3328
 Users=bindatype Accounts=(null) Licenses=(null) State=ACTIVE
BurstBuffer=(null) Watts=n/a


but when I try to allocate a set of nodes I keep seeing the
following:

$ salloc -p defq -t 10
salloc: Required node not available (down, drained or reserved)
salloc: Pending job allocation 1692921
salloc: job 1692921 queued and waiting for resources


Note that all the nodes are currently in the maint state. Am I
missing something here or is this a problem with scontrol update?







--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: scontrol update not allowing jobs

2016-04-15 Thread Ryan Cox

Did you try this:  --reservation=root_13

On 04/15/2016 08:10 AM, Glen MacLachlan wrote:

scontrol update not allowing jobs
Dear all,

Wrapping up a maintenance period and I want to run some test jobs 
before I release the reservation and allow regular user jobs to start 
running. I've modified the reservation to allow jobs from my account:


$ scontrol show res
ReservationName=root_13 StartTime=2016-04-12T09:00:00
EndTime=2016-04-15T20:00:00 Duration=3-11:00:00
 Nodes=ALL NodeCnt=220 CoreCnt=3328 Features=(null)
PartitionName=(null) Flags=MAINT,SPEC_NODES
 TRES=cpu=3328
 Users=bindatype Accounts=(null) Licenses=(null) State=ACTIVE
BurstBuffer=(null) Watts=n/a


but when I try to allocate a set of nodes I keep seeing the following:

$ salloc -p defq -t 10
salloc: Required node not available (down, drained or reserved)
salloc: Pending job allocation 1692921
salloc: job 1692921 queued and waiting for resources


Note that all the nodes are currently in the maint state. Am I missing 
something here or is this a problem with scontrol update?






[slurm-dev] Re: AssocGrp*Limits being considered for scheduling

2016-02-23 Thread Ryan Cox


Coincidentally, I asked about that yesterday in a bug report: 
http://bugs.schedmd.com/show_bug.cgi?id=2465. The short answer is to use 
SchedulerParameters=assoc_limit_continue that was introduced in 
15.08.8.  It only works if the Reason for the job is something like 
Assoc*Limit.


Ryan

On 02/23/2016 10:58 AM, Lucas Gabriel Vuotto wrote:


Hello,

we want to know if there is a "built-in" solution for the situation we 
have:


We have an special account A in sacctmgr which gives some users more 
cpu minutes to use monthly. Also, we use the multifactor priority 
plugin to decide which jobs start first. Right now, there are some 
jobs from account A that can't start because the extra resources were 
consumed, so until march, 1st they won't start. Still, there are other 
jobs enqueued that have less priority than the ones from account A, so 
they're not starting because the scheduler still consider the jobs 
from account A to be able to schedule, assigning them a StartTime from 
today.


Basically, what we want to know is if there is some option/plugin to 
either:


  1. delay the StartTime from jobs that can't start because of 
AssocGrp*Limits

  2. turn priority to 0 for that jobs until the next month
  3. any other idea which can have the desire effect (run jobs that 
can actually run this month this month)


Ideally, we want to know if there is some solution from slurm itself 
and not running cron jobs every 10 minutes to do option 1 manually, 
which is the only idea we have right now (better ideas are welcome, 
though).


Cheers & thanks!


-- lv.


[slurm-dev] Re: distribution for array jobs

2016-01-28 Thread Ryan Cox
g to get more than one job to run on a node?

Thanks in advance,

Brian Andrus



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Slurmd restart without loosing jobs?

2015-10-13 Thread Ryan Cox


That particular problem is now fixed: 
http://bugs.schedmd.com/show_bug.cgi?id=587


Ryan

On 10/13/2015 03:26 AM, Bjørn-Helge Mevik wrote:

Restarting the slurmd daemons and/or the slurmctld daemon should in
general not kill jobs.

But if you change things in slurm.conf such that the format of the slurm
state files changes, then restarting slurmctld might result in all jobs
being killed.  We did this once a couple of years ago when we activated
checkpointing.  When slurmcltd started, the checkpointing plugin
expected some extra data in the job states, which obviously wasn't
there, and slurmctld decided the data was invalid and killed all jobs.
(I don't know if this is still a problem.)



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: Batch job submission failed: Invalid account or account/partition combination specified

2015-09-08 Thread Ryan Cox


We have seen similar issues on 14.11.8 but haven't bothered to diagnose 
or report it.  I think I've seen it twice so far out of dozens of new users.


Ryan

On 09/07/2015 09:16 AM, Loris Bennett wrote:

Hi,

This problem occurs with 14.11.8.

A user I set up today got the following error when submitting a
job:

Batch job submission failed: Invalid account or account/partition combination 
specified

Using

sacctmgr show user  withassoc

I can't see any difference between the user with the problem and
another user associated with the same account who can submit.

In the slurmcltd log I have

[2015-09-07T17:02:00.790] _job_create: invalid account or partition for user 
123456, account '(null)', and partition 'main'
[2015-09-07T17:02:00.790] _slurm_rpc_submit_batch_job: Invalid account or 
account/partition combination specified

Access to the partition 'main' is allowed to all.

Restarting slurmctld fixed the problem.

Is this a known issue?

Cheers,

Loris


[slurm-dev] Re: Changing /dev file permissions for particular user

2015-06-24 Thread Ryan Cox


Be sure to test it first before trying anything else: 
https://stackoverflow.com/questions/18661976/reading-dev-cpu-msr-from-userspace-operation-not-permitted. 
We ran into this issue once when we had a trusted person and we 
couldn't easily grant him access to the MSRs.  We couldn't find a good 
solution.  You could add the caps to a copy of the rdmsr binary and make 
that file only usable by your trusted user...


Assuming you have an old enough kernel, I would just add the user to the 
group that MSR files are owned by (and change settings so the relevant 
/dev files are owned by a different group than root).


Ryan

On 06/24/2015 03:08 PM, Marcin Stolarek wrote:

Changing /dev file permissions for particular user
Hey!

I've got one user I trust and know that he isn't going to do anything 
malicious, he needs a direct acces to file in dev (/dev/cpu/*/msr in 
particular).


Have anybody checked how to do such a thing in slurm? We are thinking 
abuot doing it in prologue and changing back in epilogue, checking if 
the node is exclusive for user X. Do you know if the file permissions 
can be changed in users namespace or how to achieve this using slurm 
on Linux?


cheers,
marcin


[slurm-dev] Re: concurrent job limit

2015-06-11 Thread Ryan Cox

Job arrays can kind of be used for that:

From http://slurm.schedmd.com/job_array.html:
A maximum number of simultaneously running tasks from the job array may 
be specified using a % separator. For example --array=0-15%4 will 
limit the number of simultaneously running tasks from this job array to 4.


Ryan

On 06/11/2015 08:12 AM, Martin, Eric wrote:
Is there a way for users to self limit the number of jobs that they 
concurrently run?


Eric Martin
Center for Genome Sciences  Systems Biology
Washington University School of Medicine
 Forest Park Avenue
St. Louis, MO 63108



The materials in this message are private and may contain Protected 
Healthcare Information or other information of a sensitive nature. If 
you are not the intended recipient, be advised that any unauthorized 
use, disclosure, copying or the taking of any action in reliance on 
the contents of this information is strictly prohibited. If you have 
received this email in error, please immediately notify the sender via 
telephone or return mail.




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: FAIR_TREE in SLURM 14.11

2015-06-04 Thread Ryan Cox
   inf   0
secant  100.003315   00.00   
 0.00   inf   0
   physics  parent0.0044470.396478   
 0.396478   0
hepx 7000.23201944470.396478   
 0.396478  0.585199   0
 hepx  test-hepx 10.01282144470.396478   
 1.00   0.226415   0.012821   0

   stat parent0.00   00.00  0.00 0
carroll 100.003315   00.00 
 0.00   inf 0


=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu mailto:treyd...@tamu.edu
Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu

On Thu, Jun 4, 2015 at 11:51 AM, Ryan Cox ryan_...@byu.edu 
mailto:ryan_...@byu.edu wrote:


Trey,

In http://slurm.schedmd.com/fair_tree.html#fairshare, take a look
at the definition for S.  Basically, the normalized shares only
matters between sibling associations and will equal 1.0 when
summed.  If an association has no siblings, the value is 1.0.  If
each of the four siblings in an account has the same Raw Shares
(as defined in sacctmgr) value, the normalized shares value for
each is 0.25.  The reason why is because the Level Fairshare
calculations are only done within in account, comparing siblings
to each other.  Note that Norm Usage is still presented in sshare
but not used in the calculations.

The sshare manpage has a section about the Fair Tree modifications
to existing columns:
http://slurm.schedmd.com/sshare.html#SECTION_FAIR_TREE%20MODIFICATIONS

Ryan


On 06/03/2015 02:47 PM, Trey Dockendorf wrote:

My site is currently on 14.03.10 and we are evaluating and
testing 14.11.7 as well as moving from
PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME to using
PriorityFlags=FAIR_TREE,SMALL_RELATIVE_TO_TIME.

Our account hierarchy is very deep and is intended to represent
the org structure of departments and research organizations that
are using our cluster [1].  We were able to make the normalized
share ratio match up so all non-stakeholders were equal
(0.000323) and all stakeholders had the correct ratio based on
their contributions to the cluster.  The Shares value assigned
represents CPUs funded.  All the CPUs no longer belonging to
stakeholders were given to the mgmt group so that the Shares
given to the top level (tamu) had a meaningful value when divided
up amongst all the accounts.

While testing FAIR_TREE I noticed the normalized shares were
drastically different [2].  In particular the current
stakeholders (idhcm and hepx) both ended up with 1.0.  I'm
guessing this is due to having no sibling accounts.

The docs for FAIR_TREE only describe the formula used to
calculate the Level FairShare. Does the method for calculating
normalized shares change for FAIR_TREE? Is the hierarchy we are
using not a good fit for FAIR_TREE?  The description and benefits
of FAIR_TREE appeal to our use case, so modifying our hierarchy
is within the realm of things I'm willing to change.

Any advice on migrating into FAIR_TREE is more than welcome. 
Right now I've been running sleep jobs under different UIDs to

simulate usage to try and work out how we may need to adjust
things for a migration to FAIR_TREE.

I used the attached spreadsheet to work out the share values we
are using with 14.03.10.

Thanks,
- Trey

[1]:
   Account   User Raw Shares Norm Shares   Raw Usage Effectv
Usage  FairShare
 -- -- ---
--- - --
root  1.00   114089982  1.00 0.870551
 root  root  10.000323   0  
   0.00   1.00
 grid10.0003233688  
   0.32   0.986174
  cms   100.0002693688  
   0.27   0.986155
suragrid   1  0.27   0  
   0.00 1.00
 tamu 30960.999354   114086294  
   0.68   0.870477
agriculture   20  0.0066712697  
   0.24 0.999507
   aglife10.003336  
 2697  0.12   0.999507
 genomics  1  0.003336   0  
   0.00 1.00
engineering   10  0.003336   0  
   0.00 1.00

   pete10.003336   0
 0.00   1.00
  general   100.003336

[slurm-dev] Re: FAIR_TREE in SLURM 14.11

2015-06-04 Thread Ryan Cox
]:
 Account User Raw Shares Norm Shares   Raw Usage 
Effectv Usage  FairShare
 -- -- --- --- 
- --

root0.00   53229  1.00
 rootroot  10.000323   0  0.00 1.00
 grid  10.000323   0  0.00
  cms 100.909091   0  0.00
  suragrid   10.090909   0  0.00
 tamu   30960.999354   53229  1.00
  agriculture 200.006676   0  0.00
   aglife  10.50   0  0.00
   genomics  10.50   0  0.00
  engineering 100.003338   0  0.00
   pete  11.00   0  0.00
  general 100.0033386326  0.118860
  geo 100.003338   0  0.00
   atmo  11.00   0  0.00
  liberalarts1280.042724   13122  0.246522
   idhmc   11.00   13122  1.00
  mgmt20580.686916   20984  0.394237
  science7600.253672   12795  0.240382
   acad 100.013158   0  0.00
   chem 100.013158   0  0.00
   iamcs  100.013158   0  0.00
   math-dept  200.026316   0  0.00
math  100.50   0  0.00
secant  100.50   0  0.00
   physics 7000.921053   12795  1.00
hepx   11.00   12795  1.00
   stat 100.013158   0  0.00
carroll  11.00   0  0.00

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu mailto:treyd...@tamu.edu
Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: GPU node allocation policy

2015-04-07 Thread Ryan Cox


You can do something like this: JobSubmitPlugins=all_partitions,lua.  
Have a special empty partition, as you suggest.  Use the submit plugin 
to detect if the empty partition is in there.  If it is in the job's 
list of partitions, you know that the user didn't specify a particular 
partition.  If it is not in the list, you know that the user requested a 
particular partition (or set of partitions).  You can then do all sorts 
of fun logic.


Does all the GPU code in question need only one CPU core?  Some of our 
users have code that can use multiple CPUs and multiple GPUs 
simultaneously (LAMMPS? NAMD?  I'd have to check...).  It might be 
limiting to restrict users to a certain amount of cores.  If you're 
scheduling memory, it's also important to make sure that there is some 
memory available for the GPU jobs.


What we do is uses QOSs to control access to our GPU partition with 
AllowQos.  We use a job submit plugin to place jobs with the appropriate 
GRES into the gpu QOS, which is allowed into that partition.  We also 
allow jobs in a preemptable QOS into the partition, with the gpu QOS 
able to preempt jobs in the preemptible QOS.  We could also do a shorter 
walltime QOS or something with a lower priority but haven't done so; GPU 
jobs could get on there quickly even if all-CPU jobs are on there.  They 
could also have the job submit plugin add the gpu partition into their 
list of partitions if the job meets certain criteria even if not 
requesting GPUs (short walltime or something else).  Just some thoughts.


Ryan

On 04/07/2015 07:47 AM, Aaron Knister wrote:

Ah, I was wondering about that. You could try this:

Rename standard partition to cpu1
Create a partition called standard with no nodes
Use the lua submit plugin to rewrite the partition list from standard to 
cpu1,cpufromgpunode

I *think* that will work. I'm not sure about the empty partition piece and 
whether that will deny your submission before the submit filter  kicks in but 
my gut says no.

Sent from my iPhone


On Apr 7, 2015, at 9:18 AM, Schmidtmann, Carl carl.schmidtm...@rochester.edu 
wrote:

That only works if ALL the nodes have GPUs. We have 200+ nodes and 30 of them 
have GPUs. So we have to create three partitions - standard, gpu and  
cpufromgpunode. People in the standard partition can’t use the cpus on the gpu 
nodes. People that submit to the cpufromgpunode can’t use the cpus in the 
standard partition. We would like to see a way to specify 
MaxCPUsPerJobOnThisNode so the standard partition can use 24 cores on nodes 
without a GPU and less on nodes with a GPU. Or a way to specify 
ReserveCPUForGPU on the node or some such thing. I assume this is difficult 
because people have asked for it but it hasn’t been implemented.

Carl

Carl Schmidtmann
Center for Integrated Research Computing
University of Rochester






On Apr 7, 2015, at 4:51 AM, Aaron Knister aaron.knis...@gmail.com wrote:

Would MaxCPUsPerNode set at the partition level help?

Here's the snippet from the man page:

MaxCPUsPerNode
Maximum number of CPUs on any node available to all jobs from this partition. This can be especially useful to schedule 
GPUs. For example a node can be associated with two Slurm partitions (e.g. cpu and gpu) and the 
partition/queue cpu could be limited to only a subset of the node's CPUs, insuring that one or more CPUs 
would be available to jobs in the gpu partition/queue.

Sent from my iPhone


On Apr 6, 2015, at 11:25 PM, Novosielski, Ryan novos...@ca.rutgers.edu wrote:

I am imagine part of the reason is to keep people from running CPU jobs that 
would take more than 20 cores on the GPU machine as others do not have GPU's. 
I'd be interested in knowing strategies here too.

 *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS  |-*O*-
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novos...@rutgers.edu- 973/972.0922 (2x0922)
||  \\  Sciences | OIRT/High Perf  Res Comp - MSB C630, Newark
`'


On Apr 6, 2015, at 20:17, Ryan Cox ryan_...@byu.edu wrote:


Chris,

Just have GPU users request the numbers of CPU cores that they need and
don't lie to Slurm about the number of cores.  If a GPU user needs 4
cores and 4 GPUs, have them request that.  That leaves 20 cores for
others to use.

Ryan


On 04/06/2015 03:43 PM, Christopher B Coffey wrote:
Hello,

I’m curious how you handle the allocation of GPU’s and cores on GPU
systems in your cluster.  My new GPU system is 24 core, with 2 Tesla K80’s
(4 gpus total).  We allocate cores/mem by:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory


What I’m thinking of doing is lying to Slurm about the true cores, and
specifying CPUs=20, along with Gres=gpu:tesla:4.  Is this a reasonable
solution in order to ensure there is a core reserved for each gpu in the
system?  My thought is to allocate the 20 cores on the system to non-GPU
type work instead of leaving them idle.

Thanks

[slurm-dev] Re: GPU node allocation policy

2015-04-06 Thread Ryan Cox


Chris,

Just have GPU users request the numbers of CPU cores that they need and 
don't lie to Slurm about the number of cores.  If a GPU user needs 4 
cores and 4 GPUs, have them request that.  That leaves 20 cores for 
others to use.


Ryan

On 04/06/2015 03:43 PM, Christopher B Coffey wrote:

Hello,

I’m curious how you handle the allocation of GPU’s and cores on GPU
systems in your cluster.  My new GPU system is 24 core, with 2 Tesla K80’s
(4 gpus total).  We allocate cores/mem by:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory


What I’m thinking of doing is lying to Slurm about the true cores, and
specifying CPUs=20, along with Gres=gpu:tesla:4.  Is this a reasonable
solution in order to ensure there is a core reserved for each gpu in the
system?  My thought is to allocate the 20 cores on the system to non-GPU
type work instead of leaving them idle.

Thanks!

Chris




[slurm-dev] RE: fairshare allocations

2015-01-21 Thread Ryan Cox



On 01/21/2015 09:23 AM, Bill Wichser wrote:


A user underneath gets the expected 0.009091 normalized shares since 
there are a lot of fairshare=1 users there.  The user3 gets basically 
25x this value as the fairshare for user3=25


Yet the normalized shares is actually MORE than the normalized shares 
for the account as a whole.  What should I make of this?




This is actually by design in Fair Tree and is different from other 
algorithms.  The manpage for sshare covers this under FAIR_TREE 
MODIFICATIONS.The manpage states that Norm Shares is The shares 
assigned to the user or account normalized to the total number of 
assigned shares within the level.  Basically, the Norm Shares is the 
association's raw shares value divided by the sum of it and its sibling 
associations' assigned raw shares values.  For example, if an account 
has 10 users, each having 1 assigned raw share, the Norm Shares value 
will be .1 for each of those users under Fair Tree.


Fair Tree only uses Norm Shares and Effective Usage (the other sshare 
field that's modified) when comparing sibling associations. Our Slurm UG 
presentation slides also mention this on pages 35 and 76 
(http://slurm.schedmd.com/SUG14/fair_tree.pdf).


Ryan


[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-26 Thread Ryan Cox
 to this RAW usage.


Roshan



*From:*Ryan Cox ryan_...@byu.edu
*Sent:* 25 November 2014 17:43
*To:* slurm-dev
*Subject:* [slurm-dev] Re: [ sshare ] RAW Usage

Raw usage is a long double and the time added by jobs can be off by a 
few seconds.  You can take a look at _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c to see exactly 
what happens.


Ryan

On 11/25/2014 10:34 AM, Roshan Mathew wrote:

Hello SLURM users,

http://slurm.schedmd.com/sshare.html

*Raw Usage*

The number of cpu-seconds of all the jobs that charged the account
by the user. This number will decay over time when
PriorityDecayHalfLife is defined.

I am getting different /RAW Usage/  values for the same job every
time it is executed. The Job am using is a CPU stress test for 1
minute.

It would be very useful to understand the formula for how this RAW
Usage is calculated when we are using the plugin
PriorityType=priority/multifactor.

Snip of my slurm.conf file:-

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0


PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor

Thanks!


Image removed by sender.

Image removed by sender.



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-26 Thread Ryan Cox
*

 # Activate the Multi-factor Job Priority Plugin with decay
 PriorityType=priority/multifactor

 # apply no decay
 PriorityDecayHalfLife=0
 PriorityCalcPeriod=1
 # reset usage after 1 month
 PriorityUsageResetPeriod=MONTHLY

 # The larger the job, the greater its job size priority.
 PriorityFavorSmall=NO

 # The job's age factor reaches 1.0 after waiting in the
 # queue for 2 weeks.
 PriorityMaxAge=14-0

 # This next group determines the weighting of each of the
 # components of the Multi-factor Job Priority Plugin.
 # The default value for each of the following is 1.
 PriorityWeightAge=0
 PriorityWeightFairshare=100
 PriorityWeightJobSize=0
 PriorityWeightPartition=0
 PriorityWeightQOS=0 # don't use the qos factor


 *Questions*

 1. Given that I have set the PriorityDecayHalfLife=0, i.e no decay 
applied at any stage, shouldnt both the jobs have the same RAW Usage 
reported by SSHARE?


 2. Also shouldnt CPUTimeRAW in sacct be same as RAW Usage in sshare?


 From: Skouson, Gary B gary.skou...@pnnl.gov
 Sent: 25 November 2014 21:09
 To: slurm-dev
 Subject: [slurm-dev] Re: [ sshare ] RAW Usage

 I believe that the info share data is kept by slurmctld in memory.  
As far as I could tell from the code, it should be checkpointing the 
info to the assoc_usage file wherever slurm is saving state 
information.  I couldn’t find any docs on that, you’d have to check 
the code for more information.


 However, if you just want to see what was used, you can get the raw 
usage using sacct.  For example, for a given job, you can do something 
like:


 sacct -X -a -j 1182128  --format 
Jobid,jobname,partition,account,alloccpus,state,exitcode,cputimeraw


 -
 Gary Skouson


 From: Roshan Mathew [mailto:r.t.mat...@bath.ac.uk]
 Sent: Tuesday, November 25, 2014 9:51 AM
 To: slurm-dev
 Subject: [slurm-dev] Re: [ sshare ] RAW Usage

 Thanks Ryan,

 Is this value stored anywhere in the SLURM accounting DB? I could 
not find any value for the JOB that corresponds to this RAW usage.


 Roshan
 From: Ryan Cox ryan_...@byu.edu
 Sent: 25 November 2014 17:43
 To: slurm-dev
 Subject: [slurm-dev] Re: [ sshare ] RAW Usage

 Raw usage is a long double and the time added by jobs can be off by 
a few seconds.  You can take a look at _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c to see exactly 
what happens.


 Ryan

 On 11/25/2014 10:34 AM, Roshan Mathew wrote:
 Hello SLURM users,

 http://slurm.schedmd.com/sshare.html
 Raw Usage
 The number of cpu-seconds of all the jobs that charged the account 
by the user. This number will decay over time when 
PriorityDecayHalfLife is defined.
 I am getting different RAW Usage  values for the same job every time 
it is executed. The Job am using is a CPU stress test for 1 minute.


 It would be very useful to understand the formula for how this RAW 
Usage is calculated when we are using the plugin 
PriorityType=priority/multifactor.


 Snip of my slurm.conf file:-

 # Activate the Multi-factor Job Priority Plugin with decay
 PriorityType=priority/multifactor

 # apply no decay
 PriorityDecayHalfLife=0

 PriorityCalcPeriod=1
 PriorityUsageResetPeriod=MONTHLY

 # The larger the job, the greater its job size priority.
 PriorityFavorSmall=NO

 # The job's age factor reaches 1.0 after waiting in the
 # queue for 2 weeks.
 PriorityMaxAge=14-0

 # This next group determines the weighting of each of the
 # components of the Multi-factor Job Priority Plugin.
 # The default value for each of the following is 1.
 PriorityWeightAge=0
 PriorityWeightFairshare=100
 PriorityWeightJobSize=0
 PriorityWeightPartition=0
 PriorityWeightQOS=0 # don't use the qos factor

 Thanks!

 image001.jpg
 image001.jpg




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: [ sshare ] RAW Usage

2014-11-25 Thread Ryan Cox
Raw usage is a long double and the time added by jobs can be off by a 
few seconds.  You can take a look at _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c to see exactly 
what happens.


Ryan

On 11/25/2014 10:34 AM, Roshan Mathew wrote:

Hello SLURM users,

http://slurm.schedmd.com/sshare.html
*Raw Usage*
The number of cpu-seconds of all the jobs that charged the account by 
the user. This number will decay over time when PriorityDecayHalfLife 
is defined.
I am getting different /RAW Usage/  values for the same job every time 
it is executed. The Job am using is a CPU stress test for 1 minute.


It would be very useful to understand the formula for how this RAW 
Usage is calculated when we are using the plugin 
PriorityType=priority/multifactor.


Snip of my slurm.conf file:-

# Activate the Multi-factor Job Priority Plugin with decay
PriorityType=priority/multifactor

# apply no decay
PriorityDecayHalfLife=0

PriorityCalcPeriod=1
PriorityUsageResetPeriod=MONTHLY

# The larger the job, the greater its job size priority.
PriorityFavorSmall=NO

# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.
PriorityMaxAge=14-0

# This next group determines the weighting of each of the
# components of the Multi-factor Job Priority Plugin.
# The default value for each of the following is 1.
PriorityWeightAge=0
PriorityWeightFairshare=100
PriorityWeightJobSize=0
PriorityWeightPartition=0
PriorityWeightQOS=0 # don't use the qos factor

Thanks!




[slurm-dev] Re: How many accounts can SLURM support?

2014-11-19 Thread Ryan Cox

Dave,

I have done testing on 5-6 year old hardware with 100,000 users randomly 
distributed in 10,000 accounts with semi-random depths with most being 
between 1-4 levels from root but some much deeper than that, plus 
100,000 jobs pending.  slurmctld startup time was really long but, after 
getting started, fairshare and decay iterations in all fairshare 
algorithms took 50-150 milliseconds depending on how you measure it.  
Those calculations run no more frequently than once per minute and can 
be configured to run less frequently.


You shouldn't have any problems.

Ryan

On 11/18/2014 12:30 PM, David Lipowitz wrote:

How many accounts can SLURM support?
Does anyone have a sense of how far SLURM scales regarding accounts 
and sub-accounts?


In our batch environment, all jobs need to run under the same service 
account for a number of reasons (which I won't go into here).  Since 
our scheduler knows which end user is actually submitting the job, 
we'd like to handle prioritization by creating sub-accounts for each 
user under each of the leaf accounts depicted below:


root
 |
 +- query
 ||
 |+- type_a
 ||
 |+- type_b
 ||
 |+- type_c
 ||
 |+- type_d
 |
 +- process


So I'd have five accounts, one for each type of query and another for 
the process account:


query_type_a_dlipowitz
query_type_b_dlipowitz
query_type_c_dlipowitz
query_type_d_dlipowitz

process_dlipowitz


And each other user would have five analogous accounts.

Given that we have 600 users, can SLURM handle 3000 sub-accounts like 
this?  If we doubled in size, could SLURM handle 6000?


Thanks for any insight you might be able to offer.


Cheers,
Dave




[slurm-dev] Re: Non static partition definition

2014-10-30 Thread Ryan Cox

George,

Wouldn't a QOS with GrpNodes=10 accomplish that?

Ryan

On 10/30/2014 11:47 AM, Brown George Andrew wrote:

Hi,

I would like to have a partition of N nodes without statically 
defining which nodes should belong to a partition and I'm trying to 
work out the best way to achieve this.


Currently I have partitions which span across all the nodes in my 
cluster with differing settings, but I would like some of these to 
only occupy a subset of the cluster. I could say define partition A 
which can use all nodes but partition B may only access nodes 01-10. 
But I would like avoid partition B being reduced in size in the event 
of maintenance or hardware failure.


I'm thinking the way to do this would be via a plugin. I would keep 
all partitions spanning all nodes in the cluster but upon submission 
check how many nodes are in use on the requested partition. If there 
were say already 10 nodes in use in partition B the job should be 
queued. However things then get a bit more complex as to when slurm 
should de-queue and then run the job.


Is there a native method to do this in slurm? Essentially I would like 
something like the MaxNodes option that exists for partitions today 
but have it limit the total number of nodes used by jobs submitted to 
that partition rather than just a limit per job.


Many thanks,
George




[slurm-dev] Re: Understanding Fairshare and effect on background/backfill type partitions

2014-10-27 Thread Ryan Cox

Trey,

I'm not sure why your jobs aren't starting.  Someone else will have to 
answer that question.


You can model an organizational hierarchy a lot better in 14.11 due to 
changes in Fairshare=parent for accounts.  If you only want fairshare to 
matter at the research group and user levels but want to maintain an 
account structure that reflects your organization, set everything above 
the research group to be Fairshare=parent.  It makes it so that those 
accounts disappear for fairshare calculation purposes (but not limits, 
accounting, etc).


As for fairshare, precision loss can be a real issue and I'm guessing 
that you're affected.  I won't rehash our Slurm UG presentation here, 
but we spent some time discussing precision loss issues.  What 
normalized shares values do you see?  Try plugging that into 
2^(-EffectvUsage / SharesNorm) to see how small the number is.  That 
number then has to be multiplied by PriorityWeightFairshare, which I see 
you sized properly.


I would suggest looking at the Fair Tree fairshare algorithm once 14.11 
is released.  In case you want more information: 
http://slurm.schedmd.com/SUG14/fair_tree.pdf and 
https://fsl.byu.edu/documentation/slurm/fair_tree.php.  The slides in 
the first link also discuss Fairshare=parent in slides 82-91.


Ryan

Disclaimer:  I have some personal interest in both of the suggestions 
since we developed them.


On 10/24/2014 10:49 AM, Trey Dockendorf wrote:

Understanding Fairshare and effect on background/backfill type partitions
In our setup we use a background partition that can be preempted but 
has access to the entire cluster.  The idea is that when stakeholder 
partitions are not fully utilized, users can be opportunistic in 
making use of the cluster when the system is not 100% utilized.


Recently I submitted a batch of jobs , ~60, to our background 
partition.  All nodes were idle but half my jobs ended up pending with 
reason of Priority.  I checked sshare and my FairShare value was 
at 0.00.  Would my Fairshare dropping to 0 cause my jobs to be 
queued when resources were IDLE and no other jobs were queued in that 
partition besides my own?


I'm also wondering what method is used to come up with sane Fairshare 
values.  We have a (likely unnecessarily) complex account structure in 
slurmdbd that mimics the organizational structure of the departments / 
colleges / research groups using the cluster.  Be interested how other 
groups have configured fairshare and the multifactor priority.


For completeness, here are relevant config items I'm working with:

AccountingStorageEnforce=limits,qos
PreemptMode=SUSPEND,GANG
PreemptType=preempt/partition_prio
PriorityCalcPeriod=5
PriorityDecayHalfLife=7-0
PriorityFavorSmall=YES
PriorityFlags=SMALL_RELATIVE_TO_TIME
PriorityMaxAge=7-0
PriorityType=priority/multifactor
PriorityUsageResetPeriod=NONE
PriorityWeightAge=2000# 20%
PriorityWeightFairshare=4000  # 40%
PriorityWeightJobSize=3000# 30%
PriorityWeightPartition=0 # 0%
PriorityWeightQOS=1000# 10%
SchedulerParameters=assume_swap # An option for in-house patch
SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK

Example of a stakeholder partition and background:

PartitionName=hepx Nodes=c0[101-116,120-132,227,416,530-532,933-936] 
Priority=100 AllowQOS=hepx  MaxNodes=1 MaxTime=120:00:00 State=UP
PartitionName=background Priority=10 AllowQOS=background MaxNodes=1 
MaxTime=96:00:00 State=UP


Thanks,
- Trey

=

Trey Dockendorf
Systems Analyst I
Texas AM University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treyd...@tamu.edu mailto:treyd...@tamu.edu
Jabber: treyd...@tamu.edu mailto:treyd...@tamu.edu




[slurm-dev] RE: EXTERNAL: Re: question on multifactor priority plugin - fairshare basics

2014-10-16 Thread Ryan Cox

Ed,

Your math looks correct.  In 14.11 you can achieve what you want by 
setting Fairshare=parent on your dev account with sacctmgr. 
Fairshare=parent on accounts (only defined on users prior to 14.11) 
makes it so that accounts effectively disappear for fairshare 
calculations but still exist for limits and organizational purposes.  
Children are effectively reparented to their account's parent (root in 
your case) for fairshare.


Ryan


On 10/14/2014 08:06 PM, Blosch, Edwin L wrote:


Thanks for the reply Ryan,

Yes, I’m using the basic fairshare.  I am trying to use fairshare 
across a flat listing of users only, with a placeholder parent account 
called ‘dev’, but for now, it has no siblings.  All users are under 
‘dev’.


I think the way it is calculated, in my configuration, the largest 
fairshare I will ever see is 0.5.


F = 2**(-Ue/S),   where in my case S = 1000 / 16000  (1000 per user, 
16 users (who each have 1000))


and I have Ue =S  for a user who never submit a job yetbecause Ue 
= 0 (Uactual) + (1.0  – 0.0)*1000/16000   (1.0 is parent usage, which 
is always 1.0 in my case because dev is the only parent account for 
any user)


I was expecting/hoping/wishing the values would be between 0.0 and 
1.0, but I can work with 0.5 as the max value.  It just means that I 
need to double the PriorityWeightFairshare factor in order to achieve 
the intended relative weighting between Fairshare, QOS, Partitions, 
JobSize, Age.


Ed

*From:*Ryan Cox [mailto:ryan_...@byu.edu]
*Sent:* Tuesday, October 14, 2014 6:00 PM
*To:* slurm-dev
*Subject:* EXTERNAL: [slurm-dev] Re: question on multifactor priority 
plugin - fairshare basics


I assume you are using the default fairshare algorithm since you 
didn't specify otherwise.  F=2**(-U/S) where U is Effectv Usage (often 
displayed in documentation as UE) and S is Norm Shares.  See 
http://slurm.schedmd.com/priority_multifactor.html under the heading 
The SLURM Fair-Share Formula.


Basically, Effectv Usage needs to be less than Norm Shares for 
Fairshare to be greater than 0.5.


Ryan

On 10/14/2014 04:27 PM, Blosch, Edwin L wrote:

I must be misunderstanding a basic concept here.

What conditions would have to exist to cause a Fairshare value
greater than 0.5?

[bloscel@maruhpc5 ~]$ sshare -a

 Account User Raw Shares Norm Shares   Raw Usage
Effectv Usage FairShare

 -- -- --- ---
- --

root 1.0011376527  1.00   0.50

 root root  00.00   0  0.00 0.00

 cfd 11.0011376527  1.00   0.50

  cfd bendeee   10000.076923   0  0.076923
0.50

  cfd bloscel   10000.076923  712296  0.134718
0.297027

more users under same group




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: question on multifactor priority plugin - fairshare basics

2014-10-14 Thread Ryan Cox
I assume you are using the default fairshare algorithm since you didn't 
specify otherwise.  F=2**(-U/S) where U is Effectv Usage (often 
displayed in documentation as UE) and S is Norm Shares.  See 
http://slurm.schedmd.com/priority_multifactor.html under the heading 
The SLURM Fair-Share Formula.


Basically, Effectv Usage needs to be less than Norm Shares for Fairshare 
to be greater than 0.5.


Ryan

On 10/14/2014 04:27 PM, Blosch, Edwin L wrote:


I must be misunderstanding a basic concept here.

What conditions would have to exist to cause a Fairshare value greater 
than 0.5?


[bloscel@maruhpc5 ~]$ sshare -a

 Account   User Raw Shares Norm Shares   Raw Usage 
Effectv Usage  FairShare


 -- -- --- --- 
- --


root 1.0011376527  1.00   0.50

 root  root  0 0.00   0  
0.00   0.00


 cfd 1 1.0011376527  
1.00   0.50


  cfd   bendeee   1000 0.076923   0  
0.076923   0.50


  cfd   bloscel   1000 0.076923  712296  
0.134718   0.297027


more users under same group





[slurm-dev] Re: Authentication and invoking slurm commands from web app

2014-10-02 Thread Ryan Cox
from your
system.

Idiria Sociedad Limitada reserves the right to take legal
action against
any persons unlawfully gaining access to the content of any
external
message it has emitted.

For additional information, please visit our website
http://www.idiria.com



-- 
Morris Moe Jette

CTO, SchedMD LLC




--

*
José Román Bilbao Castro*

Ingeniero Consultor
+34 901009188
_jrbc...@idiria.com mailto:jrbc...@idiria.com
__http://www.idiria.com http://www.idiria.com/_ _http:// 
http://%20%20/www.idiria.com/ http://www.idiria.com/_


--
Idiria Sociedad Limitada - Aviso legal

Este mensaje, su contenido y cualquier fichero transmitido con él está 
dirigido únicamente a su destinatario y es confidencial. Por ello, se 
informa a quien lo reciba por error ó tenga conocimiento del mismo sin 
ser su destinatario, que la información contenida en él es reservada y 
su uso no autorizado, por lo que en tal caso le rogamos nos lo 
comunique por la misma  vía o por teléfono (+ 34 690207492), así como 
que se abstenga de reproducir el mensaje mediante cualquier medio o 
remitirlo o entregarlo a otra persona, procediendo a su borrado de 
manera inmediata.


Idiria Sociedad Limitada se reserva las acciones legales que le 
correspondan contra todo tercero que acceda de forma  ilegítima al 
contenido de cualquier mensaje externo procedente del mismo.


Para información y consultas visite nuestra web http://www.idiria.com 
http://www.idiria.com/




Idiria Sociedad Limitada - Disclaimer
This message, its content and any file attached thereto is for the 
intended recipient only and is confidential. If you have received this 
e-mail in error or had access to it, you should note that the 
information in it is private and any use thereof is unauthorised. In 
such an event please notify us by e-mail or by telephone (+ 34 
690207492). Any reproduction of this e-mail by whatsoever means and 
any transmission or dissemination thereof to other persons is 
prohibited. It should be deleted immediately from your system.


Idiria Sociedad Limitada reserves the right to take legal action 
against any persons unlawfully gaining access to the content of any 
external message it has emitted.


For additional information, please visit our website 
http://www.idiria.com http://www.idiria.com/




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Submitting to multiple partitions with job_submit plugin (Was: Implementing fair-share policy using BLCR)

2014-09-29 Thread Ryan Cox



On 09/23/2014 11:27 AM, Trey Dockendorf wrote:

Has anyone used the Lua job_submit plugin and also allows multiple partitions?  I'm not 
even user what the partition value would be in the Lua code when a job is submitted with 
--partition=general,background, for example.


We do.  We use the all_partitions plugin and our own Lua plugin for job 
submission.  In the Lua script, we remove partitions from the array that 
they shouldn't have access to for whatever reason. Reasons include:  the 
job didn't request enough memory to need a bigmem node, the job didn't 
request a GPU and this is a GPU partition, etc.  The partition string 
has commas so you can explode() it into an array.



Ryan


[slurm-dev] Re: Dynamic partitions on Linux cluster

2014-08-14 Thread Ryan Cox


I would also recommend QOS if you absolutely can't use fairshare. Set up 
a QOS per institute with a GrpNodes limit that is the correct ratio and 
only allow institute members to their QOS (make it their default too).


Alternatively you can also do one account per institute and set GrpNodes 
there, though that is less flexible than a QOS.


Ryan

On 08/14/2014 07:48 AM, Paul Edmon wrote:


We have a bit of a similar situation here.  A possible solution that 
may work for you is QoS.  The QoS's behave like a synthetic 
partition.  That way you can have a single partition but multiple 
QoS's which can flex around down nodes.


From the experimentation I have done with them this may be a good 
solution for you.


-Paul Edmon-

On 08/14/2014 09:25 AM, Uwe Sauter wrote:

I would totally agree with you but university administration has to
justify the part of the first institute (because it was paid with
federal money) while the other institute paid for themselves and can do
with their part what they want.

This is the reason for the current unflexible mapping between partition
and nodes. To get away from that for better availability I'm looking for
a way to have a dynamic mapping that just enforces the ratio between the
institutes while flexibly  allocate the nodes from the whole pool.

I know its a waste of resources but I am bound to this decision...

Regards,

Uwe


Am 14.08.2014 um 14:59 schrieb Bill Barth:

Yes, yes it does. I don't mean to be harsh, but doing it their way is a
potentially huge waste of resources. Why not get each institute to 
agree
to share the whole machine in proportion to what they paid? Each 
institute
gets an allocation of time (through accounting) and a fairshare 
fraction

in the ratio of their contribution, but is allowed to use the whole
machine. If both institutes have periods of down time, then the machine
will be less likely to sit idle and more work will get done.

I'll get off my soapbox now.

Best,
Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435 |   Fax:   (512) 475-9445







On 8/14/14, 7:48 AM, Uwe Sauter uwe.sauter...@gmail.com wrote:


Hi Bill,

if I understand the concept of fairshare correctly, this could 
result in

a situation where one institute uses all resources.

Because of this fairshare is out of the question as I have to enforce
the ratio between the institutes - I cannot allow usage that would
result in one institute using more than what they paied for. If an
institute doesn't use the resources they have to run idle (or power 
down).


You could compare my situation with running two clusters that use the
same base infrastructure. What I want to do is enable users of both
institutes to use both clusters - but for each point in time use a
maximum of nodes that belong to their cluster.


Regards,

Uwe


Am 14.08.2014 um 14:34 schrieb Bill Barth:
Why not make one partition and use fairshare to balance the usage 
over

time? That way both institutes can run large jobs that span the whole
machine when others are not using it.

Bill.
--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435 |   Fax:   (512) 475-9445







On 8/14/14, 4:11 AM, Uwe Sauter uwe.sauter...@gmail.com wrote:


Hi all,

I got a question about a configuration detail: dynamic partitions

Situation:
I operate a Linux cluster of currently 54 nodes for a cooperation of
two
different institutes at the university. To reflect the ratio of cash
those institutes invested I configured SLURM with two partition, one
for
each institute. Those partitions have assigned different numbers of
nodes in a hard way, e.g.

PartitionName=InstA Nodes=n[01-20]
PartitionName=InstB Nodes=n[21-54]

To improve availability in case nodes break (and perhaps save some
power) I'd like to configure SLURM in a way that jobs can be 
assigned

nodes from the whole pool, respecting the number of nodes each
institute
bought.


Research so far:
There is an option for partition configuration called MaxNodes but
the
man pages state that this restricts the maximum number of nodes PER
JOB.
It probably is possible to get something similar working using limit
enforcment through accounting, but I haven't understood that part of
SLURM 100% yet.
BlueGene systems seem to have the ability for something alike but 
then

this is for IBM systems only.


Question:
Is it possible to configure SLURM so that both partitions could 
utilize

all nodes but respect a maximum number of nodes that may be used the
same time? Something like:

PartitionName=InstA Nodes=n[01-54] MaxPartNodes=20
PartitionName=InstB Nodes=n[01-54] MaxPartNodes=34


So is there a way to achieve this using the confg file? Do I have to
use
accounting to enfoce the limits? Or is there another way that I 
don't

see?


Best regards,

Uwe Sauter


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


Janne,

I appreciate the feedback.  I agree that it makes the most sense to 
specify rates like DRF most of the time.  However, there are some use 
cases that I'm aware of and others that are probably out there that 
would make a DRF imitation difficult or less desirable if it's the only 
option.


We happen to have one partition that has mixed memory amounts per node, 
32 GB and 64 GB.  Besides the memory differences (long story), the nodes 
are homogeneous and each have 16 cores.  I'm not sure I would like the 
DRF approach for this particular scenario.  In this case we would like 
to set the charge rate to be .5/GB, or 1 core == 2 GB RAM.  If someone 
needs 64 GB per node, they are contending for a more limited resource 
and we would be happy to double the charge rate for the 64 GB nodes.  If 
they need all 64 GB, they would end up being charged for 32 
CPU/processor equivalents instead of 16.  With DRF that wouldn't be 
possible if I understand correctly.


One other feature that could be interesting is to have a baseline 
standard for a CPU charge on a per-partition basis.  Let's say that you 
have three partitions:  old_hardware, new_hardware, and 
super_cooled_overclocked_awesomeness.  You could set the per CPU charges 
to be 0.8, 1.0, and 20.0.  That would reflect that a cpu-hour on one 
partition doesn't result in the same amount of computation as in another 
partition.  You could accomplish the same thing automatically by using a 
QOS (and maybe some other parameter I'm not aware of) and maybe a job 
submit plugin but this would make it easier.  I don't know that we would 
do this in our setup but it would be possible.


It would be possible to add a config parameter that is something like 
Mem=DRF that would auto-configure it to match.  The one question I have 
about that approach is what to do about partitions with non-homogeneous 
nodes.  Does it make sense to sum the total cores and memory, etc or 
should it default to a charge rate that is the min() of the node 
configurations?  Of course, partitions with mixed node types could be 
difficult to support no matter what method is used for picking charge rates.


So yes, having a DRF-like auto-configuration could be nice and we might 
even use it for most of our partitions.  I don't think I'll attempt it 
for the initial implementation but we'll see.


Thanks,
Ryan

On 07/30/2014 03:31 PM, Blomqvist Janne wrote:

Hi,

if I understand it correctly, this is actually very close to Dominant Resource 
Fairness (DRF) which I mentioned previously, with the difference that in DRF 
the charge rates are determined automatically from the available resources (in 
a partition) rather than being specified explicitly by the administrator. So 
for an example, say we have a partition with 100 cores and 400 GB memory. Now 
for a job requesting (10CPU's, 20 GB) the domination calculation proceeds as 
follows:

1) Calculate the domination vector by dividing each element in the request vector 
(here, CPU  MEM) with the available resources. That is (10/100, 20/400) = (0.1, 0.05).

2) The MAX element in the domination vector is chosen (it dominates the 
others, hence the name of the algorithm) as the one to use in fairshare calculations, 
accounting etc. In this case, the CPU element (0.1).

Now for another job request, (1CPU, 20 GB) the domination vector is (0.01, 
0.05) and the MAX element is the memory element (0.05), so in this case the 
memory part of the request dominates.

In your patch you have used cpu-sec equivalents rather than dominant share 
secs, but that's just a difference of a scaling factor. From a backwards compatibility and 
user education point of view cpu-sec equivalents seem like a better choice to me, actually.

So while you patch is more flexible than DRF in that it allows arbitrary charge 
rates to be specified, I'm not sure it makes sense to specify rates different 
from the DRF ones? Or if one does specify different rates, it might end up 
breaking some of the fairness properties that are described in the DRF paper 
and opens up the algorithm for gaming?

--
Janne Blomqvist


From: Ryan Cox [ryan_...@byu.edu]
Sent: Tuesday, July 29, 2014 18:47
To: slurm-dev
Subject: [slurm-dev] RE: fairshare - memory resource allocation

I'm interested in hearing opinions on this, if any.  Basically, I think
there is an easy solution to the problem of a user using few CPUs but a
lot of memory and that not being reflected well in the CPU-centric usage
stats.

Below is my proposal.  There are likely some other good approaches out
there too (Don and Janne presented some) so feel free to tell me that
you don't like this idea :)


Short version

I propose that the Raw Usage be modified to *optionally* be (CPU
equivalents * time) instead of just (CPUs * time).  The CPU
equivalent would be a MAX() of CPUs, memory, nodes, GPUs, energy over
that time period, or whatever multiplied by a corresponding charge rate

[slurm-dev] RE: fairshare - memory resource allocation

2014-07-31 Thread Ryan Cox


Thanks.  I can certainly call it that.  My understanding is that this 
would be a slightly different implementation from Moab/Maui, but I don't 
know those as well so I could be wrong.  Either way, the concept is 
similar enough that a more recognizable term might be good.


Does anyone else have thoughts on this?  I called it CPU equivalents 
because the calculation in the code is currently (total_cpus * time) 
so I stuck with CPUs.  Slurm seems to use lots of terms somewhat 
interchangeably so I couldn't really decide.  I don't really have an 
opinion on the name so I'll just accept what others decide.


Ryan

On 07/31/2014 02:28 AM, Bjørn-Helge Mevik wrote:

Just a short note about terminology.  I believe processor equivalents
(PE) is a much used term for this.  It is at least what Maui and Moab
uses, if I recall correctly.  The resource*time would then be PE seconds
(or hours, or whatever).



[slurm-dev] RE: fairshare - memory resource allocation

2014-07-29 Thread Ryan Cox
.  The patch currently 
implements charging for CPUs, memory (GB), and nodes.


Note:  I saw a similar idea in a bug report from the University of 
Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858.


Ryan

On 07/25/2014 10:31 AM, Ryan Cox wrote:


Bill and Don,

We have wondered about this ourselves.  I just came up with this idea 
and haven't thought it through completely, but option two seems like 
the easiest.  For example, you could modify lines like 
https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 
to have a MAX() of a few different types.


I seem to recall seeing this on the list or in a bug report somewhere 
already, but you could have different charge rates for memory or GPUs 
compared to a CPU, maybe on a per partition basis. You could give each 
of them a charge rate like:
PartitionName=p1  ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 
..


So the line I referenced would be something like the following (except 
using real code and real struct members, etc):
real_decay = run_decay * MAX(CPUs*ChargePerCPU, 
TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);


In this case, each CPU is 1.0 but each GB of RAM is 0.5.  Assuming no 
GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting 
usage is 1.0.  But if they use 4 GB of RAM and 1 CPU, it is 2.0 just 
like they had been using 2 CPUs.  Essentially you define every 2 GB of 
RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with 
cpu equivalents.


It might be harder to explain to users but I don't think it would be 
too bad.


Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:

Bill,

As I understand the dilemma you presented, you want to maximize the 
utilization of node resources when running with Slurm configured for 
SelectType=select/cons_res.  To do this, you would like to nudge 
users into requesting only the amount of memory they will need for 
their jobs.  The nudge would be in the form of decreased fair-share 
priority for users' jobs that request only one core but lots of memory.


I don't know of a way for Slurm to do this as it exists.  I can only 
offer alternatives that have their pros and cons.


One alternative would be to add memory usage support to the 
multifactor priority plugin.  This would be a substantial undertaking 
as it touches code not just in multifactor/priority_multifactor.c but 
also in structures that are defined in common/assoc_mgr.h as well as 
sshare itself.


A second less invasive option would be to redefine the 
multifactor/priority_multifactor.c's raw_usage to make it a 
configurable blend of cpu and memory usage.  These changes could be 
more localized to the multifactor/priority_multifactor.c module.  
However you would have a harder time justifying a user's sshare 
report because the usage numbers would no longer track jobs' 
historical cpu usage.  You response to a user who asked you to 
justify their sshare usage report would be, trust me, it's right.


A third alternative (as I'm sure you know) is to give up on perfectly 
packed nodes and make every 4G of memory requested cost 1 cpu of 
allocation.


Perhaps there are other options, but those are the ones that 
immediately come to mind.


Don Lipari


-Original Message-
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there.  
But we

have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot 
where

we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes

[slurm-dev] Re: fairshare

2014-07-15 Thread Ryan Cox


Bill,

I may be wrong (corrections welcomed), but I'm pretty sure you'll have 
to use a database query.  My understanding is that the decayed usage is 
stored as a single usage_raw value per association 
(https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1119). 
There is no history of any kind.


You would have to do a fairly complex query to get an accurate 
representation or write some code to recreate the way Slurm does it.  If 
you look at _apply_decay() and _apply_new_usage() in 
src/plugins/priority/multifactor/priority_multifactor.c, you can see all 
that happens.  Basically, once per decay thread iteration each 
association's usage_raw and the job's cputime for that time period is 
calculated and decayed accordingly.  This can happen many, many times 
over the length of a job.  If a job terminates before reaching its 
timelimit, the remaining allocated cputime is immediately added all at 
the same time 
(https://github.com/SchedMD/slurm/blob/f8025c1484838ecbe3e690fa565452d990123361/src/plugins/priority/multifactor/priority_multifactor.c#L1036).


Those are some of the issues that you may run into while creating a 
database tool for this.


I could be mistaken on some of the details but that is my understanding 
of the code (we looked recently for an unrelated reason).


Ryan

On 07/14/2014 02:15 PM, Bill Wichser wrote:


Is there any way to get a better view of fairshare than the sshare 
command?


Under PBS, there was the diagnose -f command which showed the 
breakdown per set time period which calculated this value.  What was 
nice about this was I could point a group to this command, or cut and 
paste, showing that you have been using 20% over the last 30 days even 
though you haven't run anything in the last three days.


It's a much more difficult problem when asked now.  I have no tool 
which shows the value, and decay, over the time.  So I'm wondering if 
anyone has a method to demonstrate that, yes, this fairshare value is 
correct and here is why.  Or do I just need to figure out a database 
query to cull this information?


Thanks,
Bill


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: installing slurm on CentOS 5.10

2014-06-24 Thread Ryan Cox

Steve,

Our script generator was rewritten recently and released on Github: 
https://github.com/BYUHPC/BYUJobScriptGenerator. You might want to try 
that out and tailor it for your needs, though we have no problem with 
people linking to our site directly if you don't want to host your own 
version.


Ryan

On 06/24/2014 08:38 AM, Love, Steve W. wrote:

Hello,
I’m trying to build a version of SLURM on a VM for the purpose of 
testing.  The VM is running CentOS 5.10 and has
4 processors.  Our HPC users will be faced with the task of changing 
their submission scripts from a cluster running

SGE to one where they’ll be using SLURM.
I’d like to use the installation of SLURM in order that our users can 
test simple scripts with;

_https://marylou.byu.edu/documentation/slurm/script-generator_
I’ve been following some notes from the Clustervision user portal 
which suggests performing the following;

use yum to install numactl libraries
build hwloc which I did with; ./configure --prefix=/usr/local/hwloc/1.8.1
build munge which I did with; ./configure --prefix=/usr 
--sysconfdir=/etc --localstatedir=/var  make  make install
build slurm which I did with; ./configure 
--prefix=/usr/local/slurm/14.03.3-2/ --enable-multiple-slurmd 
--with-hwloc=/usr/local/hwloc/1.8.1/ --enable-pam
When I try to start a slurm daemon it complains about not having any 
configuration files ... which I can never find?

I’ve went with;
./configure --prefix=/usr/local/slurm/14.03.3-2/ 
--enable-multiple-slurmd --with-hwloc=/usr/local/hwloc/1.8.1/ 
--enable-pam --sysconfdir=/usr/local/slurm/14.03.3-2/

But that too failed to produce any config files.
Any ideas as to what I’m doing wrong here?
Thanks,
Steve Love.
British Geological Survey
Edinburgh

_ _
This message (and any attachments) is for the recipient only. NERC is 
subject to the Freedom of Information Act 2000 and the contents of 
this email and any reply you make may be disclosed by NERC unless it 
is exempt from release under the Act. Any material supplied to NERC 
may be stored in an electronic records management system.


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] LEVEL_BASED prioritization method

2014-06-20 Thread Ryan Cox


Levi Morrison and I have developed a new Slurm prioritization method 
that we call LEVEL_BASED.  It prioritizes users such that users in an 
under-served account will always have a higher fair share factor than 
users in an over-served account.


It works very well for us, though I understand that many sites have 
different needs.  If you're interested, check out the documentation at 
https://fsl.byu.edu/documentation/slurm/level_based.php or try it out at 
https://github.com/BYUHPC/slurm in the level_based branch.


If you want some of our problems with existing algorithms (as they apply 
to our use case), see 
http://tech.ryancox.net/2014/06/problems-with-slurm-prioritization.html.


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Fairshare=parent on an account: What should it do?

2014-06-10 Thread Ryan Cox


We're trying to figure out what the intended behavior of 
Fairshare=parent is when set on an account 
(http://bugs.schedmd.com/show_bug.cgi?id=864).  We know what the actual 
behavior is but we're wondering if anyone actually likes the current 
behavior.  There could be some use case out there that we don't know about.


For example, you can end up with a scenario like the following:
   acctProf
/|\
   / | \
 acctTA(parent)   uD(5)uE(5)
   /   |   \
  /|\
uA(5) uB(5) uC(5)


The number in parenthesis is Fairshare according to sacctmgr.  We 
incorrectly thought that Fairshare=parent would essentially collapse the 
tree so that uA - uE would all be on the same level.  Thus, all five 
users would each get 5 / 25 shares.


What actually happens is you get the following shares at the user level:
shares (uA) = 5 / 15 = .333
shares (uB) = 5 / 15 = .333
shares (uC) = 5 / 15 = .333
shares (uD) = 5 / 10 = .5
shares (uE) = 5 / 10 = .5

That's pretty far off from each other, but not as far as it would be if 
one account had two users and the other had forty.  Assuming this 
demonstration value of 5 shares, that would be:

user_in_small_account = 5 / (2*5) = .5
user_in_large_account = 5 / (40*5) = .025

Is that actually useful to someone?

We want to use subaccounts below a faculty account to hold, for example, 
a grad student or postdoc who teaches a class.  It would be nice for the 
grad student to have administrative control over the subaccount since he 
actually knows the students but not have it affect priority calculations.


Ryan

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: How to spread jobs among nodes?

2014-05-08 Thread Ryan Cox
Rather than maximize fragmentation, you probably want to do it on a 
per-job basis.  If you want one core per node:  sbatch:  -N $numnodes -n 
$numnodes.  Anything else would require the -m flag.  I haven't played 
with it recently but I think you would want -m cyclic.


Ryan

On 05/08/2014 11:49 AM, Atom Powers wrote:

How to spread jobs among nodes?

It appears that my Slurm cluster is scheduling jobs to load up nodes 
as much as possible before putting jobs on other nodes. I understand 
the reasons for doing this, however I foresee my users wanting to 
spread jobs out among as many nodes as possible for various reasons, 
some of which are even valid.


How would I configure the scheduler to distribute jobs in something 
like a round-robin fashion to many nodes instead of loading jobs onto 
just a few nodes?


I currently have:
'SchedulerType' = 'sched/builtin',
'SelectTypeParameters'  = 'CR_Core_Memory',
'SelectType'= 'select/cons_res',

--
Perfection is just a word I use occasionally with mustard.
--Atom Powers--


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Need Help Understanding Cgroup Swapiness

2014-04-21 Thread Ryan Cox
Note that the output of your job was printed successfully, then 
slurmstepd output occurred.  At job/step exit time, the Slurm code 
simply reads the the memory.failcnt and memory.memsw.failcnt files in 
the relevant cgroup (explanation: 
https://www.kernel.org/doc/Documentation/cgroups/memory.txt).


Your job's cgroup has memory.failcnt  0, meaning some of the job was 
swapped out but not killed.  The output is different for 
memory.memsw.failcnt  0 because that means that a process was killed.


Ryan

On 04/21/2014 01:48 PM, Guglielmi Matteo wrote:

Installed memory per node:

RAM  32 GB
SWAP 10 GB

 slurm.conf 

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
SelectTypeParameters=CR_Core_Memory

NodeName=... RealMemory=29000



### cgroup.conf 

AllowedRAMSpace=100
AllowedSwapSpace=30.0
ConstrainRAMSpace=YES
ConstrainSwapSpace=YES
MaxRAMPercent=100
MaxSwapPercent=100
MinRAMSpace=30



This program just eats up the requested amount of memory:

### memoryHog.c 

#include stdio.h
#include stdlib.h
#include string.h
#include unistd.h

#define PAGE_SZ (112)

int main(int argc, char **argv) {
int i;
int gb = atoi( (argv[1]) ); // memory to consume in GB

for (i = 0; i  ((unsigned long)gb30)/PAGE_SZ ; ++i) {
void *m = malloc(PAGE_SZ);
if (!m)
break;
memset(m, 0, 1);
}
printf(allocated %lu MB\n, ((unsigned long)i*PAGE_SZ)20);
sleep(10);
return 0;
}



### TESTING ###

$ salloc --mem-per-cpu=9000
salloc: Granted job allocation 1503

$ srun memoryHog.x 8
allocated 8192 MB

$ srun memoryHog.x 9
allocated 9050 MB
slurmstepd: Exceeded step memory limit at some point. Step may have 
been partially swapped out to disk.


### LOGS: /var/log/slurm/slurmd.log ###

[2014-04-15T18:58:26.212] [1503.0] task/cgroup: 
/slurm/uid_500/job_1503: alloc=9000MB mem.limit=9000MB memsw.limit=11700MB
[2014-04-15T18:58:26.212] [1503.0] task/cgroup: 
/slurm/uid_500/job_1503/step_0: alloc=9000MB mem.limit=9000MB 
memsw.limit=11700MB

[2014-04-15T18:58:39.961] [1503.0] done with job
..
..
[2014-04-15T18:58:45.916] [1503.1] task/cgroup: 
/slurm/uid_500/job_1503: alloc=9000MB mem.limit=9000MB memsw.limit=11700MB
[2014-04-15T18:58:45.916] [1503.1] task/cgroup: 
/slurm/uid_500/job_1503/step_1: alloc=9000MB mem.limit=9000MB 
memsw.limit=11700MB
[2014-04-15T18:59:01.087] [1503.1] Exceeded step memory limit at some 
point. Step may have been partially swapped out to disk.

[2014-04-15T18:59:01.120] [1503.1] done with job



Since slurm sets memsw.limit=11700MB I was expecting
the cgroup feature to start swapping out the exceeding
50 MB or so... they would actually fit in the swap area
and the job should not be killed...

What am I missing here?

Should the code itself be aware of the given mem.limit=9000MB?


Thanks for any explanation.

MG


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: SLRUM as a load balancer for interactive use

2014-03-25 Thread Ryan Cox


This isn't exactly what you're looking for but I'll chime in anyway with 
how we do things.  We decided to buy a few slightly beefier interactive 
nodes and set up cgroups, /tmp and /dev/shm namespaces (/tmp and 
/dev/shm are per-user), cputime limits, /tmp quotas, etc to sanely 
oversubscribe resources.  This ended up being cheaper than other options 
and it has worked really well.  We currently use LVS to load balance 
between interactive nodes but may switch to something else at some point.


We allow users to edit files, compile code, transfer files around, etc. 
and also test their code for a little while.  Anything beyond that 
requires submitting a job.  We limit users to 1/4 of the RAM on the node 
and only 60 CPU-minutes per process via ulimit.  The cpu cgroup (not 
cpuset) is used to set a soft limit of just a few cores for each user 
but allows them to burst to 100% of the cores when there is no 
contention on the node.


It would take at least four users all using the maximum amount of RAM 
plus some extra use before the node crashes.  The memory per node ratio 
and other settings could easily be changed if necessary.


In practice, these settings have made it so that no user has crashed an 
interactive node since everything was deployed.  Obviously I didn't 
answer the original question about SLURM but this is an alternative 
approach that has worked well for us.  If you're interested in the code 
we used to set everything up, it is available at: 
https://github.com/BYUHPC/uft


Ryan

On 03/24/2014 01:32 PM, Olli-Pekka Lehto wrote:

I can foresee the screen issue as well. One could fairly simply add a check 
when the user logs in to see if the the user has a node assigned to them 
already and force the session to use that node. It could perhaps even prompt if 
they want to do access this session or get a new one.

The immediate issue we encountered when testing with screen, however, is the 
fact that when you detach and exit the interactive session SLURM faithfully 
cleans all the processes. In most cases this would be preferred but in this 
case I want the screen session (and the associated interactive job) to persist. 
Any ideas how to do this?

In our case there is no real time limit on the current interactive use nodes so 
setting runtime as unlimited is probably the way to go, at least initially. Of 
course one needs to have a sufficiently large oversubscription factor of slots 
in this case.

Olli-Pekka

On Mar 24, 2014, at 3:48 PM, Schmidtmann, Carl 
carl.schmidtm...@rochester.edu wrote:


We considered this option as well but the problem we saw with it is what 
happens when a user tries to use screen? Many of our users login, start screen, 
do some work and then disconnect. Whenever they reconnect they can pick up from 
where they left off. If you are allocated to a compute node based on loads, you 
likely won't be on the same node where your last session was. This is 
inconvenient for the users but then also leaves screen sessions open, at least 
until the time limit expires, on compute nodes.

The other issue is the time limit. Do you make it 1 hour, 4 hours, 8 hours? How 
long does a user get to be logged in? If the time limit expires, what happens 
to the open editor session? Can this be recovered on a different compute node?

We are still looking for a good way to balance users on login nodes. Right now we are 
working on a method of redirecting ssh logins based on user IDs which feels extremely 
hacky as well.

Carl
--
Carl Schmidtmann
Center for Integrated Research Computing
University of Rochester

On Mar 24, 2014, at 5:44 AM, Olli-Pekka Lehto wrote:


Dear devs,

We are testing a concept where we are dynamically allocating a portion of our 
compute nodes with oversubscribed interactive nodes for low-intensity use. To 
make the use as simple as possible, we are testing redirecting user login 
sessions directly to these nodes via SLURM.

Basically the shell initialization on the actual login node contains a SLURM srun command 
to spawn an interactive session and the user gets transparently dropped into 
a shell session on a compute node.

This would offer more flexibility than physically setting up a set of login 
nodes. Furthermore, SLURM should be able make better decisions on where to 
assign each incoming session based on resource usage than a more naive 
round-robin load balancer. This way also all interactive use can be tracked 
with SLURM's accounting.

Based on simple initial testing this seems to work but it's still a bit hacky.

My question is has anyone been doing similar things and what are your 
experiences? Are there some caveats that we should be aware of?

Best regards,
Olli-Pekka
--
Olli-Pekka Lehto
Development Manager
Computing Platforms
CSC - IT Center for Science Ltd.
E-Mail: olli-pekka.le...@csc.fi
Tel: +358 50 381 8604
skype: oplehto // twitter: ople


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

http

[slurm-dev] Re: Job being canceled due to time limits

2013-09-05 Thread Ryan Cox

-t and --time are synonymous.  You're using both

Ryan

On 09/05/2013 12:38 PM, Matthew Russell wrote:

Job being canceled due to time limits
Hi,

I can't figure out why my job is being canceled due to time limites. 
 My queue has an infinite time limit, and my batch file requests 
several hours, yet the job is always canceled within a few minutes.


gm1@dena:GEM-MACH_1.5.1_dev$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq up infinite  4   idle dena[1-4]
*headnode up infinite  1   idle dena*
matt up infinite  2   idle dena[1-2]

My batch file:
#!/home/gm1/ECssm/multi/bin/s.sge_dummy_shell
#SBATCH -D /home/gm1
#SBATCH --export=NONE
#SBATCH -o /home/gm1/listings/dena/gm338_21388_M.21737.out.o
#SBATCH -e /home/gm1/listings/dena/gm338_21388_M.21737.out.e
#SBATCH -J gm338_21388_M.30296
*#SBATCH --time=38380*
#SBATCH --partition=headnode
#SBATCH
#SBATCH -c 1
#SBATCH -t 4
#SBATCH
#



The error log:
gm1@dena:GEM-MACH_1.5.1_dev$ cat ~/listings/dena/gm338_21388_M.21737.out.e
slurmd[dena]: *** JOB 1683 CANCELLED AT 2013-09-05T14:24:27 DUE TO 
TIME LIMIT ***



Is there somewhere else where a time limit can be imposed?  The time 
limit is being imposed about 5 minutes into the job.


Thanks


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: job steps not properly identified for jobs using step_batch cgroups

2013-08-12 Thread Ryan Cox


Moe,

In what way is it experimental?  Is it possibly unstable or just not 
feature-complete?


We're writing a script to independently gather statistics for our own 
database and would like to use the cpuacct cgroup, thus the interest in 
the jobacct_gather/cgroup plugin.


Ryan

On 08/09/2013 10:07 AM, Moe Jette wrote:


I misspoke. The JobAcctGatherType=jobacct_gather/cgroup plugin is 
experimental and not ready for use. Your configuration should work.


Quoting Moe Jette je...@schedmd.com:

Your explanation seems likely. You probably want to change your 
configuration to:

JobAcctGatherType=jobacct_gather/cgroup

Quoting Andy Wettstein wettst...@uchicago.edu:



I understand this problem more fully now.

Certains jobs that our users run fork processes in a way that the 
parent

PID gets set to 1. The _get_offspring_data function in
jobacct_gather/linux ignores these when adding up memory usage.

It seems like if proctrack/cgroup is enabled, the jobacct_gather/linux
plugin should rely on the cgroup.procs file to identify the pids 
instead

of trying to figure things out based on parent PID. Is something like
that reasonable?

Andy

On Tue, Jul 30, 2013 at 10:59:56AM -0700, Andy Wettstein wrote:


Hi,

I have the following set:

ProctrackType   = proctrack/cgroup
TaskPlugin  = task/cgroup
JobAcctGatherType   = jobacct_gather/linux

This is on slurm 2.5.7.

When I use sstat on all running jobs, there are a large number of jobs
that say they have no steps running (for example: sstat: error: 
couldn't

get steps for job 4783548).

This seems to be the case for all steps that use the step_batch 
cgroup.

If the step gets created in something like step_0, everything seems to
be reported ok. In both instances, the PIDs are actually listed in the
right cgroup.procs file.

I noticed this because there were several jobs that should have been
killed due to memory limits, but were not. The jobacct_gather plugin
doesn't know about the processes in the step_batch cgroup so it 
doesn't

count the memory usage.


Andy




--
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104


--
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104










--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: cgroups usage

2013-08-06 Thread Ryan Cox


We made the mistake of setting TaskAffinity=yes, though I'm not sure why 
we did that.  There seems to be a bug where the first node has 
cgroup/cpuset and task affinity set correctly, but subsequent nodes set 
task affinity for *all* tasks to be CPU 0.  We hadn't gotten around to 
reporting it yet but it's worth checking out.


Ryan

On 08/05/2013 05:52 PM, Kevin Abbey wrote:

Hi ,

I started using cgroups for control memory usage last week.  One user
reported his application takes 4 times longer to complete.  I read
elsewhere that cgroup mem. control can reduce performance.  Is this
amount realistic?

Is there a more efficient method to control memory usage on nodes which
are shared?


Thank you for any advice,

Kevin



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: Job submit plugin to improve backfill

2013-06-28 Thread Ryan Cox

An alternative that we do is choose very low defaults for people:
PartitionName=Default DefaultTime=30:00 #plus other options 
DefMemPerCPU=512

The disadvantage to this approach is that it doesn't give an obvious 
error message at submit time.  However, it's not hard to figure out what 
happened when they hit the time limit or the error output says they went 
over their memory limit.


Ryan

On 06/28/2013 08:29 AM, Daniel M. Weeks wrote:

At CCNI, we use backfill scheduling on all our systems. However, we have
found that users typically do not specify a time limit for their job so
the scheduler assumes the maximum from QoS/user limits/partition
limits/etc. This really hurts backfilling since the scheduler remains
ignorant of short jobs.

Attached is a small patch I wrote containing a job submit plugin and a
new error message. The plugin rejects a job submission when it is
missing a time limit and will provide the user with a clear and distinct
error.

I've just re-tested and the patch applies and builds cleanly on the
slurm-2.5, slurm-2.6, and master branches.

Please let me know if you find this useful, run across problems, or have
suggestions/improvements. Thanks.



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University



[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox

Paul,

We were discussing this yesterday due to a user not limiting the amount 
of jobs hammering our storage.  A QOS with a GrpJobs limit sounds like 
the best approach for both us and you.

Ryan

On 06/19/2013 09:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-

-- 
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: Job Groups

2013-06-19 Thread Ryan Cox

Not that I'm aware of.  I don't know of a way to give users control over 
a QOS like you can do with account coordinators for accounts.

Ryan

On 06/19/2013 10:55 AM, Paul Edmon wrote:
 Thanks for the input.  Can GrpJobs be modified from the user side?

 -Paul Edmon-


 On 06/19/2013 12:15 PM, Ryan Cox wrote:
 Paul,

 We were discussing this yesterday due to a user not limiting the amount
 of jobs hammering our storage.  A QOS with a GrpJobs limit sounds like
 the best approach for both us and you.

 Ryan

 On 06/19/2013 09:36 AM, Paul Edmon wrote:
 I have a group here that wants to submit a ton of jobs to the queue, but
 want to restrict how many they have running at any given time so that
 they don't torch their fileserver.  They were using bgmod -L in LSF to
 do this, but they were wondering if there was a similar way in SLURM to
 do so.  I know you can do this via the accounting interface but it would
 be good if I didn't have to apply it as a blanket to all their jobs and
 if they could manage it themselves.

 If nothing exists in SLURM to do this that's fine.  One can always
 engineer around it.  I figured I would ping the dev list first before
 putting a nail in it.  From my look at the documentation I don't see
 anyway to do this other than what I stated above.

 -Paul Edmon-

-- 
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University


[slurm-dev] Re: untracked processes

2013-02-21 Thread Ryan Cox

This may not be exactly what you're looking for but it could be a start.

We're looking at adding modifying ssh_config and sshd_config to 
propagate SLURM_JOB_ID for jobs that use ssh to spawn processes (credit 
to our sysadmin Lloyd Brown for that one).  Then we will use something 
like a script in /etc/profile.d to add the process to the correct cgroup 
if it's launched via ssh and has $SLURM_JOB_ID set. We're not using 
cgroups yet (still have some CentOS 5) so I don't have exact 
implementation details at this point.  Then the cgroups should work for 
resource control and, I assume, accounting if using the correct plugin.

This may not catch 100% of everything, but we would probably have 
something look for all user processes that are not part of a cgroup and 
add them to the user cgroup.  I don't think accounting could work in 
that case, but that would help catch and control rogue processes that 
aren't accounted for under SLURM.  Epilog or a cron could clean up all 
of a user's processes after they don't have jobs on the node anymore.

I don't know if SLURM has something like Torque's tm_adopt, but that 
could work in lieu of cgroups for accounting if you don't happen to use 
cgroups.  tm_adopt allowed you to add a random process to be accounted 
for under Torque, even if it wasn't launched under Torque.  We used to 
have a wrapper script for ssh that did just that when we used Torque and 
Moab.

Ryan

P.S. We've only been using SLURM for a few weeks so you might want to 
double-check the accuracy and viability of my statements :)


On 02/21/2013 12:57 PM, Moe Jette wrote:
 Slurm only tracks the processes that it's daemons launch (most MPI
 implementations can launch their tasks using slurm). Anything launched
 outside of Slurm can be killed as part of a job prolog, but accounting
 and job step management are outside of Slurm's control.

 Quoting Michael Colonno mcolo...@stanford.edu:

  SLURM gurus ~

  I'm trying to configure a commercial MPI code to run through SLURM.
 I can launch this code through either srun or sbatch without any
 issues (the good) but the processes manage to run completely
 disconnected from SLURM's notice (the bad). i.e. the job is running
 just fine but SLURM thinks it's completed and hence does not report
 anything running. I'm guessing this is due to the fact that this
 tool runs a pre-processing-type executable and then launches
 sub-processes to solve (MPI on a local system) without connecting
 the process IDs(?) In any event, I'm guessing I'm not the first
 person to run into this. Is there a recommended solution to
 configure SLURM to track codes like this?

  Thanks,
  ~Mike C.



-- 
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University