[slurm-users] Single user consuming all resources of the cluster

2018-02-06 Thread Matteo F
Hello there.

I've just set up a small Slurm cluster for our on-premise computation needs
(nothing too exotic, just a bunch of R scripts).

The systems "works" if the sense that users are able to submit jobs, but I
have an issue with resources management: a single user can consume all
resources of the cluster.

I will attach data of my live system so you can watch the config files and
my troubleshooting attempts, but I present here a simplified practical
example:

Suppose I have 2x nodes with 10G of RAM each. User1 submits 4 jobs, each
one requiring 5G. He fills the cluster.
Them comes User2 and submit another Job, which gets queued until one of
User1's Job completes (which may require days). This is not good.
I've tried to limit the number of running job using Qos ->
MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
cluster with fewer (but bigger) jobs.

How can I avoid that?

Here is a link to my config files: https://pastebin.com/iwAnBMpY

Thanks a lot.
Matteo


Re: [slurm-users] Single user consuming all resources of the cluster

2018-02-06 Thread Christopher Samuel

On 06/02/18 21:40, Matteo F wrote:

I've tried to limit the number of running job using Qos -> 
MaxJobsPerAccount, but this wouldn't stop a user to just fill up the

cluster with fewer (but bigger) jobs.


You probably want to look at what you can do with the slurmdbd database
and associations. Things like GrpTRES:

https://slurm.schedmd.com/sacctmgr.html

# GrpTRES=
# Maximum number of TRES running jobs are able to be allocated in
# aggregate for this association and all associations which are children
# of this association. To clear a previously set value use the modify
# command with a new value of -1 for each TRES id.
#
#  NOTE: This limit only applies fully when using the Select Consumable
# Resource plugin.

Best of luck,
Chris



[slurm-users] Is QOS always inherited explicitly?

2018-02-06 Thread Loris Bennett
Hi,

[I didn't get an answer to this when I tacked it onto the end of another
question (which I also didn't get an answer to :-/), so I'm starting a
new thread.]

The documentation for 'sacctmgr' says

  Note: the QOS that can be used at a given account in the hierarchy are
  inherited by the children of that account.

However, if I do the following:

  $ sacctmgr modify account name=root set qos+=medium,short

the result is

  Modified account associations...
C = sorobanA = root
C = sorobanA = anemometry U = alice 
C = sorobanA = anemometry U = bob 
C = sorobanA = barometry  U = carol   
C = sorobanA = barometry  U = dave
C = sorobanA = calorimetryU = ethel 
...

To me this looks as if the QOS are in fact being explicitly added to
each association (rather than being just implicitly inherited).  In this
case, will a new association added within this hierarchy automatically
be associated with the QOS available to the other associations?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] Single user consuming all resources of the cluster

2018-02-06 Thread Bill Barth
Chris probably gives the Slurm-iest way to do this, but we use a Spank plugin 
that counts the jobs that a user has in queue (running and waiting) and sets a 
hard cap on how many they can have. This should probably be scaled to the size 
of the system and the partition they are submitting to, but on Stampede 2 (4200 
KNL nodes and 1736 SKX nodes), we set this, across all queues to about 50, 
which has been our magic number, across numerous schedulers over the years on 
systems ranging from hundreds of nodes to Stamped2e 1 with 6400. Some users get 
more by request and most don’t even bump up against the limits. We’ve started 
to look at using TRES on our test system, but we haven’t gotten there yet. Our 
use of the DB is minimal, and our process to get every user into it when their 
TACC account is created is not 100% automated yet (we use the job completion 
plugin to create a flat file with job records which our local accounting system 
consumes to decrement allocation balances, if you care to know).

Best,
Bill. 

-- 
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435|   Fax:   (512) 475-9445
 
 

On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" 
 wrote:

On 06/02/18 21:40, Matteo F wrote:

> I've tried to limit the number of running job using Qos -> 
> MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
> cluster with fewer (but bigger) jobs.

You probably want to look at what you can do with the slurmdbd database
and associations. Things like GrpTRES:

https://slurm.schedmd.com/sacctmgr.html

# GrpTRES=
# Maximum number of TRES running jobs are able to be allocated in
# aggregate for this association and all associations which are children
# of this association. To clear a previously set value use the modify
# command with a new value of -1 for each TRES id.
#
#  NOTE: This limit only applies fully when using the Select Consumable
# Resource plugin.

Best of luck,
Chris





[slurm-users] spank plugin parameter max length ?

2018-02-06 Thread Tueur Volvo
Hello, i create spank plugin and i have a problem.
With my plugin i create new parameters --hbm

If i write srun command it's work
srun --hbm="tototututititatatetetyt" hostname

but if i add 1 caractere, my slurm job "freeze", job is in R status
srun --hbm="tototututititatatetetyty" hostname

So maybe slurm have limitation number parameter length ? max is 23
caracteres, why ?

In my code, slurmd call slurm_spank_init and _script_opt_process and
slurm_spank_task_init
but when i have my error slurmd not call slurm_spank_task_init

this is my code :

#include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> #include 
>
>
> SPANK_PLUGIN(hbm, 1);
>
> static int _script_opt_process (int val,
> const char *optarg,
> int remote);
>
> struct spank_option spank_options[] =
> {
> { "hbm",
>   "[hbm]",
>   "hbm parameter",
>   1,
>   0,
>   (spank_opt_cb_f) _script_opt_process
> },
>
> SPANK_OPTIONS_TABLE_END
> };
>
>
> int slurm_spank_init (spank_t sp, int ac, char **av)
> {
> return (0);
> }
>
> int slurm_spank_task_init (spank_t sp, int ac, char **av) {
>
> return 0;
> }
>
> static int _script_opt_process (int val,const char *optarg,int remote) {
> return (0);
> }
>


[slurm-users] LAST TASK ID

2018-02-06 Thread david martin

Hi,

I´m running a batch array script and would like to execute a command 
after the last task



#SBATCH --array 1-10%10:1

sh myscript.R inputdir/file.${SLURM_ARRAY_TASK_ID}

# Would like to run a command after the last task

For exemple when i was using SGE there was something like this

| if($SGE_TASK_ID == $SGE_TASK_LAST ) then|
|||#||do||last-task stuff here|
|endif|


Can i do that with slurm ?




Re: [slurm-users] LAST TASK ID

2018-02-06 Thread Michael Gutteridge
Hi

The environment variable SLURM_ARRAY_TASK_MAX might be used for this as
well, e.g.:

if [ $SLURM_ARRAY_TASK_ID -eq $SLURM_ARRAY_TASK_MAX ]
then
   # last task
fi

Though I'd caution that if you need this to run after all the jobs in the
array are _complete_, you should use a job dependency.  Not sure how your
scheduling is set up, but in our setup there's no guarantee that task 10
will complete before task 11.

HTH

Michael



On Tue, Feb 6, 2018 at 7:40 AM, david martin  wrote:

> Hi,
>
> I´m running a batch array script and would like to execute a command after
> the last task
>
>
> #SBATCH --array 1-10%10:1
>
> sh myscript.R inputdir/file.${SLURM_ARRAY_TASK_ID}
>
> # Would like to run a command after the last task
>
> For exemple when i was using SGE there was something like this
>  if($SGE_TASK_ID == $SGE_TASK_LAST ) then
>   # do last-task stuff here
> endif
>
>
> Can i do that with slurm ?
>
>
>


Re: [slurm-users] Problem with nodes appear as DOWN (Not responding) slurm 17.02.9

2018-02-06 Thread Marcin Stolarek
Check returntoservice parameter in slurm.conf

On Mon, 5 Feb 2018 at 20:30, Guy -  wrote:

> Hi,
> I've compiled and installed slurm on ubuntu. it works great but if I take
> a node down by running slurmd stop and start, it keeps appearing as DOWN
> (Not responding)
> The only fix is restarting slurmctld with -c flag. afterwards all nodes
> are back up!
> Note that running slurmd - and slurmctld - does not show any
> indicative signs of errors.
>
> Thanks!
>
> Guy
>


[slurm-users] Slurm version 17.11.3 available

2018-02-06 Thread Tim Wickberg

We are pleased to announce the availability of Slurm version 17.11.3.

This includes over 44 fixes made since 17.11.2 was released last month, 
including one issue that can result in stray processes when a job is 
canceled during a long-running prolog script.


Slurm can be downloaded from https://www.schedmd.com/downloads.php

- Tim


* Changes in Slurm 17.11.3
==
 -- Send SIG_UME correctly to a step.
 -- Sort sreport's reservation report by cluster, time_start, resv_name instead
of cluster, resv_name, time_start.
 -- Avoid setting node in COMPLETING state indefinitely if the job initiating
the node reboot is cancelled while the reboot in in progress.
 -- Scheduling fix for changing node features without any NodeFeatures plugins.
 -- Improve logic when summarizing job arrays mail notifications.
 -- Add scontrol -F/--future option to display nodes in FUTURE state.
 -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
 -- When a job array is preempting make it so tasks in the array don't wait
to preempt other possible jobs.
 -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
in slurmstepd.
 -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
reconfigured.
 -- node_feature/knl_cray - Fix memory leak that can occur during normal
operation.
 -- Fix srun environment variables for --prolog script.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
the first job fail. This fix lets all task array elements that can run
proceed rather than stopping all subsequent task array elements.
 -- Fix potential deadlock in the slurmctld when using list_for_each.
 -- Fix for possible memory corruption in srun when running heterogeneous job
steps.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
the first job fail. This fix lets all task array elements that can run
proceed rather than stopping all subsequent task array elements.
 -- Fix output file containing "%t" (task ID) for heterogeneous job step to
be based upon global task ID rather than task ID for that component of the
heterogeneous job step.
 -- MYSQL - Fix potential abort when attempting to make an account a parent of
itself.
 -- Fix potentially uninitialized variable in slurmctld.
 -- MYSQL - Fix issue for multi-dimensional machines when using sacct to
find jobs that ran on specific nodes.
 -- Reject --acctg-freq at submit if invalid.
 -- Added info string on sh5util when deleting an empty file.
 -- Correct dragonfly topology support when job allocation specifies desired
switch count.
 -- Fix minor memory leak on an sbcast error path.
 -- Fix issues when starting the backup slurmdbd.
 -- Revert uid check when requesting a jobid from a pid.
 -- task/cgroup - add support to detect OOM_KILL cgroup events.
 -- Fix whole node allocation cpu counts when --hint=nomultihtread.
 -- Allow execution of task prolog/epilog when uid has access
rights by a secondary group id.
 -- Validate command existence on the srun *[pro|epi]log options
if LaunchParameter test_exec is set.
 -- Fix potential memory leak if clean starting and the TRES didn't change
from when last started.
 -- Fix for association MaxWall enforcement when none is given at submission.
 -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld.
 -- burst_buffer/cray: Attempts by job to create persistent burst buffer when
one already exists owned by a different user will be logged and the job
held.
 -- CRAY - Remove race in the core_spec where we add the slurmstepd to the
job container where if the step was canceled would also cancel the stepd
erroneously.
 -- Make sure the slurmstepd blocks signals like SIGTERM correctly.
 -- SPANK - When slurm_spank_init_post_opt() fails return error correctly.
 -- When revoking a sibling job in the federation we want to send a start
message before purging the job record to get the uid of the revoked job.
 -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss
was the only correct capitialization; UsePSS or usepss were silently
ignored.
 -- Prevent pthread_atfork handlers from being added unnecessarily after
'scontrol reconfigure', which can eventually lead to a crash if too
many handlers have been registered.
 -- Better debug messages when MaxSubmitJobs is hit.
 -- Docs - update squeue man page to describe all possible job states.
 -- Preserve node features when slurmctld daemons reconfigured including active
and available KNL features.
 -- Prevent orphaned step_extern steps when a job is cancelled while the
prolog is still running




Re: [slurm-users] Single user consuming all resources of the cluster

2018-02-06 Thread Matteo F
Thanks Bill, I really appreciate the time you spent giving this detailed
answer.
I will have a look at the plugin system as the integration with out
accounting system would be a nice feature.

@Chris thanks, I've had a look GrpTRES but I'll probably go with the Spank
route.

Best,
Matteo

On 6 February 2018 at 13:58, Bill Barth  wrote:

> Chris probably gives the Slurm-iest way to do this, but we use a Spank
> plugin that counts the jobs that a user has in queue (running and waiting)
> and sets a hard cap on how many they can have. This should probably be
> scaled to the size of the system and the partition they are submitting to,
> but on Stampede 2 (4200 KNL nodes and 1736 SKX nodes), we set this, across
> all queues to about 50, which has been our magic number, across numerous
> schedulers over the years on systems ranging from hundreds of nodes to
> Stamped2e 1 with 6400. Some users get more by request and most don’t even
> bump up against the limits. We’ve started to look at using TRES on our test
> system, but we haven’t gotten there yet. Our use of the DB is minimal, and
> our process to get every user into it when their TACC account is created is
> not 100% automated yet (we use the job completion plugin to create a flat
> file with job records which our local accounting system consumes to
> decrement allocation balances, if you care to know).
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bba...@tacc.utexas.edu|   Phone: (512) 232-7069
> Office: ROC 1.435|   Fax:   (512) 475-9445
>
>
>
> On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" <
> slurm-users-boun...@lists.schedmd.com on behalf of ch...@csamuel.org>
> wrote:
>
> On 06/02/18 21:40, Matteo F wrote:
>
> > I've tried to limit the number of running job using Qos ->
> > MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
> > cluster with fewer (but bigger) jobs.
>
> You probably want to look at what you can do with the slurmdbd database
> and associations. Things like GrpTRES:
>
> https://slurm.schedmd.com/sacctmgr.html
>
> # GrpTRES=
> # Maximum number of TRES running jobs are able to be allocated in
> # aggregate for this association and all associations which are
> children
> # of this association. To clear a previously set value use the modify
> # command with a new value of -1 for each TRES id.
> #
> #  NOTE: This limit only applies fully when using the Select Consumable
> # Resource plugin.
>
> Best of luck,
> Chris
>
>
>
>