[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-14 Thread Ryan Cox via slurm-users
We did something like this in the past but from C.  However, modifying 
the features was painful if the user did any interesting syntax.


What we are doing now is using --extra for that purpose.  The nodes boot 
up with SLURMD_OPTIONS="--extra {\\\"os\\\":\\\"rhel9\\\"}" or similar.  
Users can request --extra=os=rhel9 or whatever if they want to submit 
across OS versions for some weird reason.


Handling defaults is problematic because there is no way to set a 
default --extra for people.  We had some things working to set an 
environment variable on the nodes that gets passed by sbatch, et al. and 
then read it from the submit plugin.  We would then set the --extra in 
the job submit plugin.  The problem is that salloc and srun behave 
differently and you can't access the environment.


Instead, we are now looking up the alloc_node in the plugin and reading 
its `extra` directly.  Here's what the relevant parts look like:

static void _set_extra_from_alloc_node(job_desc_msg_t *job_desc)
{
    node_record_t *node_ptr = find_node_record(job_desc->alloc_node);
    char *default_str = "os=rhel7";

    if (node_ptr == NULL) {
    job_desc->extra = xstrdup(default_str);
    info("WARNING: _set_extra_from_alloc_node: node %s not 
found. Setting job to default '%s'", job_desc->alloc_node, default_str);

    } else {
    if (!xstrcmp(node_ptr->extra, "{\"os\":\"rhel7\"}")) {
    job_desc->extra = xstrdup("os=rhel7");
    } else if (!xstrcmp(node_ptr->extra, 
"{\"os\":\"rhel9\"}")) {

    job_desc->extra = xstrdup("os=rhel9");
    } else {
    job_desc->extra = xstrdup(default_str);
    info("WARNING: _set_extra_from_alloc_node: node 
%s returned extra of '%s' which did not match known values. Setting job 
to default '%s'", job_desc->alloc_node, node_ptr->extra, default_str);

    }
    }
}

...


    if (!job_desc->extra) {
    _set_extra_from_alloc_node(job_desc);
    }

I don't know if you can do it in lua.  The easiest way to do this would 
be if there was an environment variable for a default --extra, but there 
isn't currently.  I've been meaning to ask SchedMD about that but 
haven't done so yet.


By the way, the nice thing about --extra is that there's no juggling of 
features in config files.  Whatever OS it boots up in, that's what ends 
up in the extra field.  We have a script that populates the relevant 
file before Slurm boots.


On 6/14/24 12:33, Laura Hild via slurm-users wrote:

I wrote a job_submit.lua also.  It would append "" to the feature string unless the 
features already contained "el9," or if empty, set the features string to "centos79" without 
the ampersand.  I didn't hear from any users doing anything fancy enough with their feature string for the 
ampersand to cause a problem.





--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: memory high water mark reporting

2024-05-20 Thread Ryan Cox via slurm-users
PM Emyr James via slurm-users 
 wrote:


Hi,

We are trying out slurm having been running grid engine for a long
while.

In grid engine, the cgroups peak memory and max_rss are generated
at the end of a job and recorded. It logs the information from the
cgroup hierarchy as well as doing a getrusage call right at the
end on the parent pid of the whole job "container" before cleaning up.

With slurm it seems that the only way memory is recorded is by the
acct gather polling. I am trying to add something in an epilog
script to get the memory.peak but It looks like the cgroup
hierarchy has been destroyed by the time the epilog is run.

Where in the code is the cgroup hierarchy cleared up ? Is there no
way to add something in so that the accounting is updated during
the job cleanup process so that peak memory usage can be
accurately logged ?

I can reduce the polling interval from 30s to 5s but don't know if
this causes a lot of overhead and in any case this seems to not be
a sensible way to get values that should just be determined right
at the end by an event rather than using polling.

Many thanks,

Emyr


-- 
    slurm-users mailing list -- slurm-users@lists.schedmd.com

To unsubscribe send an email to slurm-users-le...@lists.schedmd.com




--
Ryan Cox
Director
Office of Research Computing
Brigham Young University

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Trouble Running Slurm C Extension Plugin

2024-04-09 Thread Ryan Cox via slurm-users

Glen,

I don't think I see it in your message, but are you pointing to the 
plugin in slurm.conf with JobSubmitPlugins=?  I assume you are but it's 
worth checking.


Ryan

On 4/9/24 10:19, Glen MacLachlan via slurm-users wrote:

Hi,

We have a plugin in Lua that mostly does what we want but there are 
features available in the C extension that are not available to lua. 
For that reason, we are attempting to convert to C using the guidance 
found here: 
https://slurm.schedmd.com/job_submit_plugins.html#building. We arrived 
here because the lua plugins don't seem to stretch enough to cover the 
use case we were looking at, i.e., branching off of the value of 
alloc_id or, for that matter, get_sid().


The goal is to disallow interactive allocations (i.e., salloc) on 
specific partitions while allowing it on others. However, we've 
run into an issue with our C plugin right out of the gate and I've 
included a minimal reproducer as an example which is basically a 
"Hello World" type of test (job_submit_disallow_salloc.c, see attached).


*Expectation*
What we expect to happen is a sort of hello-world result with a 
message being written to a /tmp/min_repo.log but that does not occur. 
It seems that the plugin does not get run at all when jobs are 
submitted. Jobs still run as expected but the plugin seems to be ignored.


*Steps*
We compile
gcc -fPIC -DHAVE_CONFIG_H -I /modules/source/slurm-23.02.4 -g -O2 
-pthread -fno-gcse -Werror -Wall -g -O0 -fno-strict-aliasing -MT 
job_submit_disallow_salloc.lo -MD -MP -MF 
.deps/job_submit_disallow_salloc.Tpo -c job_submit_disallow_salloc.c 
-o .libs/job_submit_disallow_salloc.o


mv .deps/job_submit_disallow_salloc.Tpo 
.deps/job_submit_disallow_salloc.Plo


and link
gcc -shared -fPIC -DPIC .libs/job_submit_disallow_salloc.o -O2 
-pthread -O0 -pthread -Wl,-soname -Wl,job_submit_disallow_salloc.so   
 -o job_submit_disallow_salloc.so




Check links after copying to /usr/lib64/slurm:
ldd /usr/lib64/slurm/job_submit_disallow_salloc.so
linux-vdso.so.1 (0x7ffe467aa000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x7f1c02095000)
libc.so.6 => /lib64/libc.so.6 (0x7f1c01cd)
/lib64/ld-linux-x86-64.so.2 (0x7f1c024b7000)



Can someone point out what we are doing incorrectly or how we might 
troubleshoot this issue?


Kindest regards,
Glen



*Reproducer*
The minimal reproducer is basically a "hello world" for C extensions 
which I've pasted below (I've also attached it for convenience):


#include 
#include 
#include 
#include "src/slurmctld/slurmctld.h"

const char plugin_name[] = "Min Reproducer";
const char plugin_type[] = "job_submit/disallow_salloc";
const uint32_t plugin_version = SLURM_VERSION_NUMBER;

extern int job_submit(job_desc_msg_t *job_desc, uint32_t submit_uid,
                      char **err_msg)
{
        FILE *fp;
        fp = fopen("/tmp/min_repo.log", "w");
        fprintf(fp,"Hello!");

        fclose(fp);
        return SLURM_SUCCESS;
}

int job_modify(job_desc_msg_t *job_desc, job_record_t *job_ptr,
               uint32_t submit_uid, char **err_msg)
{
        return SLURM_SUCCESS;
}




--
Ryan Cox
Director
Office of Research Computing
Brigham Young University

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] Multifactor fair-share with single account

2024-01-04 Thread Ryan Cox



On 1/4/24 02:41, Kamil Wilczek wrote:



W dniu 4.01.2024 o 07:56, Loris Bennett pisze:

Hi Kamil,

Kamil Wilczek  writes:


Dear All,

I have a question regarding the fair-share factor of the multifactor
priority algorithm. My current understanding is that the fair-share
makes sure that different *accounts* have a fair share of the
computational power.

But what if my organisation structure is flat and I have only one
account where all my user reside. Is fair-share algorithm working
in this situation -- does it take into account users (associations)
from this single account, and tries to assing a fair-factor to each
user? Or each user from this account have the same fair-factor at
each iteration?

And what if I have, say 3 accounts, but I do not wan't to calculate
fair-share between accounts, but between all associations from all
3 accounts? In other words, is there a fair-share factor for
users/associations instead of accounts?

Kind regards


We have a similar situation.  We do in fact have an account for each
research group and the groups are associated with institutes and
departments, but we use FairShare=parent so that all users are given the
same number of shares and thus treated equally by the fair-share
mechanism.



Hi Loris,

but is the "FairShare=parent" still works for the Fair Tree, which is
the default algorithm since 19.05? I can find this option only for the
Classic Fair Share.

And I'm trying to differentiate between users, so that they are not
treated equally by the algorithm. Heavy users should have a lower
factor.

I think I could create an account for each user, but is that a common
practice and not an overkill?

I'm also trying to understand the Fair Tree, because there is a section
when it says that users can have different factors if their common
ancestor accounts have different factors. But what if they have only one
single common ancestor? Would then association/users still be sorted by
the fair-factor?

Kind regards,


Kamil,

fairshare=parent works for all algorithms, last I knew.  It definitely 
works for Fair Tree.  See https://slurm.schedmd.com/SUG14/fair_tree.pdf 
starting at page 80 for examples of usage.  In page 84, the blue 
associations all have fairshare=parent set, so to fairshare it 
calculates it as if it looks like page 85.  The example on pages 86-91 
actually describe what's going on.


Yes, Fair Tree considers sibling users/accounts even underneath another 
account.  The algorithm is recursive and calculates fairshare amongst 
sibling users/accounts within each account. "sshare -l" will show you 
these calculations under "Level FS" within an account (keeping in mind 
that fairshare=parent will affect what looks like an account to Fair 
Tree).  The presentation I linked to has a lot more details.  
https://slurm.schedmd.com/fair_tree.html is more succinct.


Another way to think about this is that you are only looking at one 
subtree's children at a time.  When one association wins, any of its 
children also win compared to any other associations.  The algorithm 
recurses into that association (if there are children) to then determine 
which of those children win compared to others.  So whether you have a 
flat structure or not, it compares siblings within an account.  If you 
have 100 users directly parented by root (or 100 users underneath 
accounts with fairshare=parent all the way up to root), those 100 users 
will compete for fairshare like you want.


Ryan



Re: [slurm-users] Transport from SLC to Provo?

2023-08-14 Thread Ryan Cox

I have used https://expressshuttleutah.com/ in the past.

Ryan

On 8/14/23 06:12, Styrk, Daryl wrote:

I would try Uber or Lyft. You should check with your hotel; they may offer a 
shuttle.

Daryl

On 8/14/23, 4:58 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
mailto:slurm-users-boun...@lists.schedmd.com> on behalf of 
b.h.me...@usit.uio.no <mailto:b.h.me...@usit.uio.no>> wrote:


Dear all,


I'm going to SLUG in Provo in September. My flight lands in Salt Lake
City Airport (SLC) at 7 pm on Sunday 10. I was planning to go by bus or
train from SLC to Provo, but apparently both bus and train have stopped
running by that time on Sundays.


Does anyone know about any alternative way to get to Provo on a Sunday
night?




--
Ryan Cox
Director
Office of Research Computing
Brigham Young University




Re: [slurm-users] sbatch - accept jobs above limits

2022-02-09 Thread Ryan Cox

Mike,

You could potentially add a non-existent node (or nodes) to the 
configuration that has a million cores, petabytes of RAM, and all the 
features in the world.  Then it "exists" in Slurm.  I don't know if 
FUTURE would work, but if you can tolerate having a DOWN node in sinfo, 
that could work.


Ryan

On 2/8/22 3:26 PM, z1...@arcor.de wrote:

Dear all,

sbatch jobs are immediately rejected if no suitable node is available in
the configuration.


sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration
is not available

These jobs should be accepted, if a suitable node will be active soon.
For example, these jobs could be in PartitionConfig.

Is that configurable?


Many thanks,

Mike



--
Ryan Cox
Director
Office of Research Computing
Brigham Young University




Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Ryan Cox

David,

There are several possible answers depending on what you hope to 
accomplish.  What exactly is the issue that you're trying to solve? Do 
you mean that you have users who need, say, 8 GB of RAM per core but you 
only have 4 GB of RAM per core on the system and you want a way to 
account for that?  Or is it something else?


Ryan

On 1/6/22 14:39, David Henkemeyer wrote:

All,

When my team used PBS, we had several nodes that had a TON of CPUs, so 
many, in fact, that we ended up setting np to a smaller value, in 
order to not starve the system of memory.


What is the best way to do this with Slurm?  I tried modifying # of 
CPUs in the slurm.conf file, but I noticed that Slurm enforces that 
"CPUs" is equal to Boards * SocketsPerBoard * CoresPerSocket * 
ThreadsPerCore.  This left me with having to "fool" Slurm into 
thinking there were either fewer ThreadsPerCore, fewer CoresPerSocket, 
or fewer SocketsPerBoard.  This is a less than ideal solution, it 
seems to me.  At least, it left me feeling like there has to be a 
better way.


Thanks!
David








Re: [slurm-users] job_container.conf:: how to adopt a autofs base mount point

2021-12-02 Thread Ryan Cox

Adrian,

We haven't played with that yet, but we have been using pam_namespace.so 
for many years for the same purpose.  We do this at startup:

mkdir -pm 000 /tmp/userns
mkdir -pm 000 /dev/shm/userns
mount --make-shared /
mount --bind /tmp /tmp
mount --make-private /tmp
mount --bind /dev/shm /dev/shm
mount --make-private /dev/shm

This solved our automounting problems.  There might have been some 
updates to the "right" way to do things since we started doing it this 
way, but this still works for us.  Basically, we make *everything* 
shared first then make /tmp and /dev/shm private.


https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt 
has more info on the options.


Ryan

On 12/2/21 15:58, Adrian Sevcenco wrote:


Hi! I have a annoying problem with the namespaces and the shared 
attribute

of an autofs mountpoint...

so, there is a directory named /cvmfs where autofs will mount various 
directories

depending of the job requests.
these directories, named repositories they do not need to be defined, 
regardless
of the settings a job can request for mount any repository present on 
the defined upstream

(stratum-1) servers.

my problems is that, at any point before the actual job, if a apply a 
--make-rshared
on /cvmfs, autofs when will mount something within will reset this 
attribute.


is there a way to tell slurmstepd to somehow adopt and keep this 
mountpoint no matter

what is mounted within?

Thank you!
Adrian



--
Ryan Cox
Director
Office of Research Computing
Brigham Young University




Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Ryan Cox

Brian,

Would a reservation on that node work?  I think you could even do a 
combination of MAGNETIC and features in the reservation itself if you 
wanted to minimize hassle, though that probably doesn't add much beyond 
just requiring that the reservation name be specified by people who want 
to use it.


Ryan

On 7/1/21 8:08 AM, Brian Andrus wrote:

All,

I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
They are cloud nodes, so weights do not work (there is an open bug 
about that).


I need to have jobs 'avoid' that node by default. I am thinking I can 
use a feature constraint, but that seems to only apply to those that 
want the feature. Since we have so many other users, it isn't feasible 
to have them modify their scripts, so having it avoid by default would 
work.


Any ideas how to do that? Submit LUA perhaps?

Brian Andrus




--
Ryan Cox
Director
Office of Research Computing
Brigham Young University




Re: [slurm-users] Exposing only requested CPUs to a job on a given node.

2021-05-14 Thread Ryan Cox
You can check with something like this inside of a job:  cat 
/sys/fs/cgroup/cpuset/slurm/uid_$UID/job_$SLURM_JOB_ID/cpuset.cpus. That 
lists which cpus you have access to.


On 5/14/21 4:40 PM, Renfro, Michael wrote:


Untested, but prior experience with cgroups indicates that if things 
are working correctly, even if your code tries to run as many 
processes as you have cores, those processes will be confined to the 
cores you reserve.


Try a more compute-intensive worker function that will take some 
seconds or minutes to complete, and watch the reserved node with 'top' 
or a similar program. If for example, the job reserved only 1 core and 
tried to run 20 processes, you'd see 20 processes in 'top', each at 5% 
CPU time.


To make the code a bit more polite, you can import the os module and 
create a new variable from the SLURM_CPUS_ON_NODE environment variable 
to guide Python into starting the correct number of processes:


    cpus_reserved = int(os.environ['SLURM_CPUS_ON_NODE'])

*From: *slurm-users  on behalf 
of Rodrigo Santibáñez 

*Date: *Friday, May 14, 2021 at 5:17 PM
*To: *Slurm User Community List 
*Subject: *Re: [slurm-users] Exposing only requested CPUs to a job on 
a given node.


*External Email Warning*

*This email originated from outside the university. Please use caution 
when opening attachments, clicking links, or responding to requests.*




Hi you all,

I'm replying to have notifications answering this question. I have a 
user whose python script used almost all CPUs, but configured to use 
only 6 cpus per task. I reviewed the code, and it doesn't have an 
explicit call to multiprocessing or similar. So the user is unaware of 
this behavior (and also me).


Running slurm 20.02.6

Best!

On Fri, May 14, 2021 at 1:37 PM Luis R. Torres <mailto:lrtor...@gmail.com>> wrote:


Hi Folks,

We are currently running on SLURM 20.11.6 with cgroups constraints
for memory and CPU/Core.  Can the scheduler only expose the
requested number of CPU/Core resources to a job?  We have some
users that employ python scripts with the multi processing
modules, and the scripts apparently use all of the CPU/Cores in a
node, despite using options to constraint a task to just a given
number of CPUs.    We would like several multiprocessing jobs to
run simultaneously on the nodes, but not step on each other.

The sample script I use for testing is below; I'm looking for
something similar to what can be done with the GPU Gres
configuration where only the number of GPUs requested are exposed
to the job requesting them.

#!/usr/bin/env python3

import multiprocessing

def worker():

print("Worker on CPU #%s" % multiprocessing.current_process

().name)

    result=0

    for j in range(20):

      result += j**2

    print ("Result on CPU {} is {}".format(multiprocessing.curr

ent_process().name,result))

    return

if __name__ == '__main__':

    pool = multiprocessing.Pool()

    jobs = []

    print ("This host exposed {} CPUs".format(multiprocessing.c

pu_count()))

    for i in range(multiprocessing.cpu_count()):

        p = multiprocessing.Process(target=worker, name=i).star

t()

Thanks,

-- 


--------
Luis R. Torres



--
Ryan Cox
Director
Office of Research Computing
Brigham Young University



Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
  0   
 0.00      0.00 2.0343e+05 0.045455
  sray                ahantau           1    0.05        31   
 0.51      0.000102 491.737549 0.036364
  sray                 hmiller            1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                   ttinker          1    0.05           0   
 0.00      0.00 1.4798e+13 0.063636
  sray                wcooper          1    0.05       0   
 0.00      0.00        inf   0.181818
  sray                 xtsao          1    0.05       41734   
 0.068296      0.135083   0.370143 0.027273
  sray                   xping            1    0.05           0   
 0.00      0.00 1.9833e+24 0.090909




--
Ryan Cox
Director
Office of Research Computing
Brigham Young University



Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox

From https://slurm.schedmd.com/fair_tree.html:
The basic idea is to set rank equal to the count of user associations 
then start at root:

*   Calculate Level Fairshare for the subtree's children
*   Sort children of the subtree
*   Visit the children in descending order:
-    If user, assign a final fairshare factor similar to (rank-- / 
user_assoc_count)

-    If account, descend to account


On 12/2/20 10:34 AM, Erik Bryer wrote:
I'm not talking about the Level Fair Share. That's easy to compute. 
I'm talking about Fair Share -- what sshare prints out on the 
rightmost side.


*From:* slurm-users  on behalf 
of Ryan Cox 

*Sent:* Wednesday, December 2, 2020 10:31 AM
*To:* Slurm User Community List ; 
Micheal Krombopulous 

*Subject:* Re: [slurm-users] FairShare
It's really similar to a binary search tree.  Within each account, it 
is Shares / Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf 
<https://slurm.schedmd.com/SUG14/fair_tree.pdf> has more details, 
starting at page 34 or so.  It even has an "animation".


Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:
I've read the manual and I re-read the other link. What they boil 
down to is Fair Share is calculated based on a recondite "rooted 
plane tree", which I do not have the background in discrete math to 
understand.


I'm hoping someone can explain it so my little kernel can understand.

*From:* slurm-users  
<mailto:slurm-users-boun...@lists.schedmd.com> on behalf of Micheal 
Krombopulous  
<mailto:michealkrombopul...@outlook.com>

*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
<mailto:slurm-users@lists.schedmd.com> 
 <mailto:slurm-users@lists.schedmd.com>

*Subject:* [slurm-users] FairShare
Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1  0.076923           0   
 0.00      0.00  inf   1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                raab          1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                benequist          1    0.05           0   
 0.00      0.00        inf 0.181818
  sray                 bosch           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                rjenkins         1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                  esmith            1  0.05           0   
 0.00      0.00 1.7226e+07   0.054545
  sray                  gheinz            1  0.05           0   
 0.00      0.00 1.9074e+14   0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1  0.05       42449   
 0.069465      0.137396 0.363913   0.018182
  sray                  jmay           1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                 aferrier            1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                bdehaven         1    0.05    225002   
 0.367771      0.727420   0.068736 0.009091
  sray                msmythe          1    0.05         0   
 0.00      0.00        inf 0.181818
  sray                 gfink           1  0.05           0   
 0.00      0.00 2.0343e+05   0.045455
  sray                ahantau           1    0.05          31   
 0.51      0.000102 491.737549 0.036364
  sray                 hmiller            1  0.05           0   
 0.00      0.00  inf   0.181818
  sray                   ttinker          1  0.05           0   
 0.00      0.00 1.4798e+13   0.063636
  sray                wcooper          1    0.05         0   
 0.00      0.00        inf 0.181818
  sray                 xtsao          1  0.05       41734   
 0.068296      0.135083 0.370143   0.027273
  sray                   xping            1  0.05           0   
 0.00      0.00 1.9833e+24   0.090909









Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox
It's really similar to a binary search tree.  Within each account, it is 
Shares / Usage to calculate the Level FS.  See 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting 
at page 34 or so.  It even has an "animation".


Ryan

On 12/2/20 10:22 AM, Micheal Krombopulous wrote:
I've read the manual and I re-read the other link. What they boil down 
to is Fair Share is calculated based on a recondite "rooted plane 
tree", which I do not have the background in discrete math to understand.


I'm hoping someone can explain it so my little kernel can understand.

*From:* slurm-users  on behalf 
of Micheal Krombopulous 

*Sent:* Wednesday, December 2, 2020 9:32 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] FairShare
Can someone tell me how to calculate fairshare (under fairtree)? I 
can't figure it out. I would have thought it would be the same score 
for all users in an account. E.g., here is one of my accounts:


Account User  RawShares  NormShares    RawUsage NormUsage 
 EffectvUsage    LevelFS  FairShare
 -- -- --- --- 
--- - -- --

root  0.00      611349                  1.00
 root                      root             1    0.076923           0 
   0.00      0.00        inf 1.00
 sray                                  1  0.076923     
 30921 0.505582      0.505582   0.152147
  sray                 phedge            1    0.05       0   
 0.00      0.00        inf   0.181818
  sray                raab          1    0.05           0 
   0.00      0.00        inf 0.181818
  sray                benequist          1    0.05       0   
 0.00      0.00        inf   0.181818
  sray                 bosch           1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                rjenkins         1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                  esmith            1    0.05         0   
 0.00      0.00 1.7226e+07   0.054545
  sray                  gheinz            1    0.05         0   
 0.00      0.00 1.9074e+14   0.072727
  sray                  jfitz         1  0.05           0 
   0.00      0.00 8.0640e+20   0.081818
  sray                   ajoel          1    0.05       42449 
   0.069465      0.137396   0.363913 0.018182
  sray                  jmay           1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                 aferrier            1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                bdehaven         1    0.05  225002   
 0.367771      0.727420   0.068736   0.009091
  sray                msmythe          1    0.05     0    0.00 
     0.00        inf   0.181818
  sray                 gfink           1    0.05         0   
 0.00      0.00 2.0343e+05   0.045455
  sray                ahantau           1    0.05      31   
 0.51      0.000102 491.737549   0.036364
  sray                 hmiller            1    0.05         0   
 0.00      0.00        inf   0.181818
  sray                   ttinker          1    0.05         0   
 0.00      0.00 1.4798e+13   0.063636
  sray                wcooper          1    0.05     0    0.00 
     0.00        inf   0.181818
  sray                 xtsao          1    0.05     41734   
 0.068296      0.135083   0.370143   0.027273
  sray                   xping            1    0.05         0   
 0.00      0.00 1.9833e+24   0.090909







Re: [slurm-users] FairShare

2020-12-02 Thread Ryan Cox

Micheal,

Details are at https://slurm.schedmd.com/fair_tree.html 
. If they have the same shares 
and usage as each other, they will have the same fair share value.  One 
thing to keep in mind is that sshare rounds or truncates the values, so 
0.00 does not necessarily mean that a value is actually 0. 
https://slurm.schedmd.com/SUG14/fair_tree.pdf has more details, starting 
at page 34 or so.


Ryan

On 12/2/20 9:32 AM, Micheal Krombopulous wrote:

Can someone tell me how to calculate fairshare (under fairtree)? I can't figure 
it out. I would have thought it would be the same score for all users in an 
account. E.g., here is one of my accounts:

Account User  RawShares  NormShares    RawUsage   NormUsage  EffectvUsage   
 LevelFS  FairShare
 -- -- --- --- --- 
- -- --
root                                           0.00      611349         
         1.00
  root                      root             1    0.076923           0    
0.00      0.00        inf   1.00
  sray                                  1    0.076923      30921 
0.505582      0.505582   0.152147
   sray                 phedge            1    0.05           0    0.00 
     0.00        inf   0.181818
   sray                raab          1    0.05           0    
0.00      0.00        inf   0.181818
   sray                benequist          1    0.05           0    0.00 
     0.00        inf   0.181818
   sray                 bosch           1    0.05           0    
0.00      0.00        inf   0.181818
   sray                rjenkins         1    0.05           0    
0.00      0.00        inf   0.181818
   sray                  esmith            1    0.05           0    
0.00      0.00 1.7226e+07   0.054545
   sray                  gheinz            1    0.05           0    
0.00      0.00 1.9074e+14   0.072727
   sray                  jfitz         1    0.05           0    
0.00      0.00 8.0640e+20   0.081818
   sray                   ajoel          1    0.05       42449    
0.069465      0.137396   0.363913   0.018182
   sray                  jmay           1    0.05           0    
0.00      0.00        inf   0.181818
   sray                 aferrier            1    0.05           0    
0.00      0.00        inf   0.181818
   sray                bdehaven         1    0.05      225002    0.367771   
   0.727420   0.068736   0.009091
   sray                msmythe          1    0.05           0    0.00   
   0.00        inf   0.181818
   sray                 gfink           1    0.05           0    
0.00      0.00 2.0343e+05   0.045455
   sray                ahantau           1    0.05          31    0.51  
    0.000102 491.737549   0.036364
   sray                 hmiller            1    0.05           0    
0.00      0.00        inf   0.181818
   sray                   ttinker          1    0.05           0    
0.00      0.00 1.4798e+13   0.063636
   sray                wcooper          1    0.05           0    0.00   
   0.00        inf   0.181818
   sray                 xtsao          1    0.05       41734    
0.068296      0.135083   0.370143   0.027273
   sray                   xping            1    0.05           0    
0.00      0.00 1.9833e+24   0.090909







Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-22 Thread Ryan Cox

Ivan,

Are you having I/O slowness? That is the most common cause for us. If 
it's not that, you'll want to look through all the reasons that it takes 
a long time for a process to actually die after a SIGKILL because one of 
those is the likely cause. Typically it's because the process is waiting 
for an I/O syscall to return. Sometimes swap death is the culprit, but 
usually not at the scale that you stated. Maybe you could try 
reproducing the issue manually or putting something in epilog the see 
the state of the processes in the job's cgroup.


Ryan

On 7/22/20 10:24 AM, Ivan Kovanda wrote:


Dear slurm community,

Currently running slurm version 18.08.4

We have been experiencing an issue causing any nodes a slurm job was 
submitted to to "drain".


From what I've seen, it appears that there is a problem with how slurm 
is cleaning up the job with the SIGKILL process.


I've found this slurm article 
(https://slurm.schedmd.com/troubleshoot.html#completing) , which has a 
section titled "Jobs and nodes are stuck in COMPLETING state", where 
it recommends increasing the "UnkillableStepTimeout" in the slurm.conf 
, but all that has done is prolong the time it takes for the job to 
timeout.


The default time for the "UnkillableStepTimeout" is 60 seconds.

After the job completes, it stays in the CG (completing) status for 
the 60 seconds, then the nodes the job was submitted to go to drain 
status.


On the headnode running slurmctld, I am seeing this in the log - 
/var/log/slurmctld:




[2020-07-21T22:40:03.000] update_node: node node001 reason set to: 
Kill task failed


[2020-07-21T22:40:03.001] update_node: node node001 state set to DRAINING

On the compute node, I am seeing this in the log - /var/log/slurmd



[2020-07-21T22:38:33.110] [1485.batch] done with job

[2020-07-21T22:38:33.110] [1485.extern] Sent signal 18 to 1485.4294967295

[2020-07-21T22:38:33.111] [1485.extern] Sent signal 15 to 1485.4294967295

[2020-07-21T22:39:02.820] [1485.extern] Sent SIGKILL signal to 
1485.4294967295


[2020-07-21T22:40:03.000] [1485.extern] error: *** EXTERN STEP FOR 
1485 STEPD TERMINATED ON node001 AT 2020-07-21T22:40:02 DUE TO JOB NOT 
ENDING WITH SIGNALS ***


I've tried restarting the SLURMD daemon on the compute nodes, and even 
completing rebooting a few computes nodes (node001, node002) .


From what I've seen were experiencing this on all nodes in the cluster.

I've yet to restart the headnode because there are still active jobs 
on the system so I don't want to interrupt those.


Thank you for your time,

Ivan





Re: [slurm-users] Is that possible to submit jobs to a Slurm cluster right from a developer's PC

2019-12-12 Thread Ryan Cox
Be careful with this approach.  You also need the same munge key 
installed everywhere.  If the developers have root on their own system, 
they can submit jobs and run Slurm commands as any user.


ssh sounds significantly safer.  A quick and easy way to make sure that 
users don't abuse the system is to set limits using pam_limits.so, 
usually in /etc/security/limits.conf.  A cputime limit of one minute 
should prevent users from running their work there.  If I'm reading it 
right, it sounds like you do want jobs running on that system but do not 
want people launching work over ssh.  In that case, you would need to 
make sure that pam_limits.so is enabled for ssh but not Slurm.


Ryan

On 12/12/19 2:01 AM, Nguyen Dai Quy wrote:
On Thu, Dec 12, 2019 at 5:53 AM Ryan Novosielski > wrote:


Sure; they’ll need to have the appropriate part of SLURM installed
and the config file. This is similar to having just one login node
per user. Typically login nodes don’t run either daemon.


Hi,
It's interesting ! Do you have any link/tutorial for this kind of setup?
Thanks,




On Dec 11, 2019, at 22:41, Victor (Weikai) Xie
mailto:xiewei...@gmail.com>> wrote:


Hi,

We are trying to setup a tiny Slurm cluster to manage shared
access to the GPU server in our team. Both slurmctld and slumrd
are going to run on this GPU server. But here is a problem. On
one hand, we don't want to give developers ssh access to that
box, because otherwise they might bypass Slurm job queue and
launch jobs directly on the box. On the other hand, if developers
don't have ssh access to the box, how can they run 'sbatch'
command to submit jobs?

Does Slurm provide an option to allow developers submit jobs
right from their own PCs?

Regards,

Victor (Weikai)  Xie






Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Ryan Cox

Ole,

UsePAM has to do with how jobs are launched when controlled by Slurm.  
Basically, it sends jobs launched under Slurm through the PAM stack.  
UsePAM is not required by pam_slurm_adopt because it is *sshd* and not 
*slurmd or slurmstepd* that is involved with pam_slurm_adopt.  That's 
what I believe Tim was referring to (I just skimmed the bug report so 
maybe I missed something).


In this case the recommendation to use UsePAM=1 still applies since you 
want PAM to affect the behavior of jobs launched through Slurm.


Ryan

On 03/21/2018 07:16 AM, Ole Holm Nielsen wrote:

On 03/21/2018 02:03 PM, Bill Barth wrote:
I don’t think we had to do anything special since we have UsePAM = 1 
in our slurm.conf. I didn’t do the install personally, but our 
pam.d/slurm* files are written by us and installed by our 
configuration management system. Not sure which one UsePAM looks for, 
but here are ours:


The UsePAM = 1 in slurm.conf may be deprecated, see Tim Wickberg's 
comments on pam_slurm_adopt in 
https://bugs.schedmd.com/show_bug.cgi?id=4098.  Or perhaps UsePAM may 
still be used in the way you describe?



c501-101[skx](41)# cat /etc/pam.d/slurm
auth   required   pam_localuser.so
auth   required   pam_shells.so
account    required   pam_unix.so
account    required   pam_access.so
session    required   pam_unix.so
session    required   pam_limits.so
-session   optional   pam_systemd.so
c501-101[skx](42)# cat /etc/pam.d/slurm.pam
auth   required   pam_localuser.so
auth   required   pam_shells.so
account    required   pam_unix.so
account    required   pam_access.so
session    required   pam_unix.so
session    required   pam_limits.so
-session   optional   pam_systemd.so

There might be better forms of these, but they’re working for us. I 
guess this counts now as being documented in a public place!


Obviously, UsePAM and the /etc/pam.d/slurm rules ought to be 
documented clearly somewhere, but I'm not aware of any good description.


/Ole



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University




Re: [slurm-users] Disable Account Limits Per Partition?

2018-02-22 Thread Ryan Cox

John,

I'm not sure about accounts but you can do that with users:
sacctmgr add user someuser823 account=myaccount7 partition=somepartition

You can set different limits for particular partitions (grpsubmit example):
sacctmgr modify user someuser823 partition=somepartition set grpsubmit=1
or
sacctmgr modify user someuser823 partition=somepartition set 
grpsubmit=-1 #unset limit


Limits are per *association* where an association is the combination of 
cluster, account, user, and partition.  Partition is optional.


If you want to figure something out for accounts or for everyone, you 
probably want to look at using a QOS to override the limits and control 
access to the partitions.  Maybe even a partition QOS.


Ryan

On 02/21/2018 02:13 PM, Roberts, John E. wrote:

Hi,

I'm not sure of the best way to solve this and I don't see any obvious things I 
can set in the configuration. Please let me know if I'm missing something.

I have several partitions in Slurm (16.05). I also have many accounts with 
users tied to them and all of the accounts have a CPU hour limit of some 
number. Accounting Enforcement is set to 'safe' so users can't submit jobs 
unless they have time.

The problem I'm running into is that I also have a few partitions that shouldn't charge for time. I 
have this set on them: TRESBillingWeights="CPU=0.0". This works in that at the end of the 
job they don't get charged for the hours, however, they can't start the job unless they have the 
time initially. I'd like for them to be able to submit to these partitions with any account even if 
the account has run out of hours. I'd also like to avoid creating "special" accounts with 
a ton of hours just to get them running as well. This doesn't work well with how we grant hours and 
track things, but certainly will do this if it's my only option.

Any advice?
--
Thanks.
John



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University