Re: [slurm-users] Longer queuing times for larger jobs

2020-02-12 Thread Chris Samuel

On 5/2/20 1:44 pm, Antony Cleave wrote:

Hi, from what you are describing it sounds like jobs are backfilling in 
front and stopping the large jobs from starting


We use a feature that SchedMD implemented for us called 
"bf_min_prio_reserve" which lets you set a priority threshold below 
which Slurm won't make a forward reservation for a job (and so can only 
start if it can start right now without delaying other jobs).


https://slurm.schedmd.com/slurm.conf.html#OPT_bf_min_prio_reserve

So if you can arrange your local priority system so that large jobs are 
over that threshold and smaller jobs are below it (or whatever suits 
your use case) then you should have a way to let these large jobs get a 
reliable start time without smaller jobs pushing them back in time.


There's some useful background from the bug where this was implemented:

https://bugs.schedmd.com/show_bug.cgi?id=2565

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Longer queuing times for larger jobs

2020-02-12 Thread Loris Bennett
Loris Bennett  writes:

> Hello David,
>
> David Baker  writes:
>
>> Hello,
>>
>> I've taken a very good look at our cluster, however as yet not made
>> any significant changes. The one change that I did make was to
>> increase the "jobsizeweight". That's now our dominant parameter and it
>> does ensure that our largest jobs (> 20 nodes) are making it to the
>> top of the sprio listing which is what we want to see.
>>
>> These large jobs aren't making an progress despite the priority
>> lift. I additionally decreased the nice value of the job that sparked
>> this discussion. That is (looking at at sprio) there is a 32 node job
>> with a very high priority...
>>
>> JOBID PARTITION USER   PRIORITYAGE  FAIRSHAREJOBSIZE  
>> PARTITIONQOSNICE
>> 280919 batch  mep1c101275481 40  59827 415655
>>   0  0 -40
>>
>> That job has been sitting in the queue for well over a week and it is
>> disconcerting that we never see nodes becoming idle in order to
>> service these large jobs. Nodes do become idle and then get scooped by
>> jobs started by backfill. Looking at the slurmctld logs I see that the
>> vast majority of jobs are being started via backfill -- including, for
>> example, a 24 node job. I see very few jobs allocated by the
>> scheduler. That is, messages like sched: Allocate JobId)6915 are few
>> and far between and I never see any of the large jobs being allocated
>> in the batch queue.
>>
>> Surely, this is not correct, however does anyone have any advice on
>> what to check, please?
>
> Have you looked at what 'sprio' says?  I usually want to see the list
> sorted by priority and so call it like this:
>
>   sprio -l -S "%Y"

This should be

  sprio -l -S "Y"

[snip (242 lines)]

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



[slurm-users] Slurm version 20.02.0rc1 is now available

2020-02-12 Thread Tim Wickberg
We are pleased to announce the availability of Slurm release candidate 
20.02.0rc1.


This is the first release candidate for the upcoming 20.02 release 
series, and marks the finalization of the RPC and state file formats.


This rc1 also includes the first version of the Slurm REST API, as 
implemented in the new slurmrestd command / daemon. The slurmrestd 
command acts as a REST proxy to the libslurm internal API, and can be 
used alongside the new auth/jwt authentication mechanism to integrate 
Slurm into external systems.


A high-level overview of some of the new features and other changes in 
the 20.02 release was presented at SLUG'19, and is archived here:

https://slurm.schedmd.com/publications.html

The Release Notes also include a summary of the major changes:
https://slurm.schedmd.com/archive/slurm-master/news.html

If any issues are identified with this new release candidate, please 
report them through https://bugs.schedmd.com against the 20.02.x version 
and we will address them before the first production 20.02.0 release is 
made.


A preview of the updated documentation can be found at 
https://slurm.schedmd.com/archive/slurm-master/ . Once 20.02 is 
released, the main documentation page at https://slurm.schedmd.com will 
be switched over to this newer content.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support



[slurm-users] Advice on using GrpTRESRunMin=cpu=

2020-02-12 Thread David Baker
Hello,

Before implementing "GrpTRESRunMin=cpu=limit" on our production cluster I'm 
doing some tests on the development cluster. I've only get a handful of compute 
nodes to play without and so I have set the limit sensibly low. That is, I've 
set the limit to be 576,000. That's equivalent to 400 CPU-days. In other words, 
I can potentially submit the following job...

1 x 2 nodes x 80 cpus/node x 2.5 days = 400 CPU-days

I submitted a set of jobs requesting 2 nodes, 80 cpus/node for 2.5 days. The 
first day is running and the rest are in the queue -- what I see makes sense...

JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
677 debugmyjob djb1 PD   0:00  2 
(AssocGrpCPURunMinutesLimit)
678 debugmyjob djb1 PD   0:00  2 
(AssocGrpCPURunMinutesLimit)
679 debugmyjob djb1 PD   0:00  2 
(AssocGrpCPURunMinutesLimit)
676 debugmyjob djb1  R  12:52  2 navy[54-55]

On the other hand, I expected these jobs not to accrue priority, however they 
do appear to be (see sprio below). I'm working with Slurm v19.05.2. Have I 
missed something vital/important in the config? We hoped that the queued jobs 
would not accrue priority. We haven't, for example, used "accrue always". Have 
I got that wrong? Could someone please advise us.

Best regards,
David

[root@navy51 slurm]# sprio
  JOBID PARTITION   PRIORITY   SITEAGE  FAIRSHARE
JOBSIZEQOS
677 debug5551643 10   1644 45
500  0
678 debug5551643 10   1644 45
500  0
679 debug5551642 10   1643 45
500  0


Re: [slurm-users] Using "Nodes" on script - file ????

2020-02-12 Thread Renfro, Michael
Hey, Matthias. I’m having to translate a bit, so if I get a meaning wrong, 
please correct me.

You should be able to set the minimum and maximum number of nodes used for jobs 
on a per-partition basis, or to set a default for all partitions. My most 
commonly used partition has:

  PartitionName=batch MinNodes=1 MaxNodes=40 …

and each job runs on one node by default, without anyone having to specify a 
node count.

If your users are running purely OpenMP jobs, with no MPI at all, there’s no 
reason for them to request more than one node per job, as you probably already 
know, and you could potentially set MaxNodes=1 for one or more partitions. If 
they’re using MPI, they’ll typically need the ability to use more than one node.

You could also use maximum job times, QoS settings, or trackable resource 
(TRES) limits on a per-user, per-account, or per-partition basis to keep users 
from consuming all your resources for an extended period of time.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Feb 12, 2020, at 6:27 AM, Matthias Krawutschke 
>  wrote:
> 
> Hello together,
> I have a special question regarding the variables:
>  
> #SBATCH -nodes = 2 or SRUN -N….
>  
> Some users of the HPC set these WORTH very high and allocate the ComputeNode, 
> although with that they do not require this at all.
>  
> My question is the following now:
> Is it really necessary that this value must be put in the script, 
> command-line or this can be left out?
> In which case is it necessary to limit these? at OpenMPI possible?
>  
> Best regards….
>  
>  
>  
> Matthias Krawutschke, Dipl. Inf.
>  
> Universität Potsdam
> ZIM - Zentrum für Informationstechnologie und Medienmanagement
> Team High-Performance-Computing on Cluster - Environment
>  
> Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
> Tel: +49 331 977-, Fax: +49 331 977-1750
>  
> Internet: https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html



[slurm-users] Using "Nodes" on script - file ????

2020-02-12 Thread Matthias Krawutschke
Hello together,

I have a special question regarding the variables:

 

#SBATCH -nodes = 2 or SRUN -N….

 

Some users of the HPC set these WORTH very high and allocate the
ComputeNode, although with that they do not require this at all.

 

My question is the following now:

Is it really necessary that this value must be put in the script,
command-line or this can be left out?

In which case is it necessary to limit these? at OpenMPI possible?

 

Best regards….

 

 

 

Matthias Krawutschke, Dipl. Inf.

 

Universität Potsdam
ZIM - Zentrum für Informationstechnologie und Medienmanagement
Team High-Performance-Computing on Cluster - Environment

 

Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
Tel: +49 331 977-, Fax: +49 331 977-1750

 

Internet:  
https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html

 

 



[slurm-users] Increasing the OpenFile under SLURM ....

2020-02-12 Thread Matthias Krawutschke
Hello together,

 

I have a special question regarding to set up the limit of max. open files
under Linux.

It is in case of this about the value of the soft - files, not about the
value for the hard files (see: ulimit-Sn)

 

Some users of the HPC use a large number of files for the processing in your
application.

These applications break off " too many files open " with the error message.

 

The attitude of the parameters does not have any effects here since no
"LOGIN" is carried out via SLURM via " etc/security/limits.conf ".

 

Who has already had to do with this subject and perhaps who can help me
here?

 

Best regards….

 

 

 

Matthias Krawutschke, Dipl. Inf.

 

Universität Potsdam
ZIM - Zentrum für Informationstechnologie und Medienmanagement
Team High-Performance-Computing on Cluster - Environment

 

Campus Am Neuen Palais: Am Neuen Palais 10 | 14469 Potsdam
Tel: +49 331 977-, Fax: +49 331 977-1750

 

Internet:  
https://www.uni-potsdam.de/de/zim/angebote-loesungen/hpc.html

 

 



Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

2020-02-12 Thread Taras Shapovalov
Hey Robert,

Ask Bright support, they will help you to figure out what is going on there.

Best regards,
Taras

On Tue, Feb 11, 2020 at 8:26 PM Robert Kudyba  wrote:

> This is still happening. Nodes are being drained after a kill task failed.
> Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?
>
> [2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill
> task failed
> [2020-02-11T12:21:26.006] update_node: node node001 state set to DRAINING
> [2020-02-11T12:21:26.006] got (nil)
> [2020-02-11T12:21:26.015] error: slurmd error running JobId=1514 on
> node(s)=node001: Kill task failed
> [2020-02-11T12:21:26.015] _job_complete: JobID=1514 State=0x1 NodeCnt=1
> WEXITSTATUS 1
> [2020-02-11T12:21:26.015] email msg to sli...@fordham.edu: SLURM
> Job_id=1514 Name=run.sh Failed, Run time 00:02:21, NODE_FAIL, ExitCode 0
> [2020-02-11T12:21:26.016] _job_complete: requeue JobID=1514 State=0x8000
> NodeCnt=1 per user/system request
> [2020-02-11T12:21:26.016] _job_complete: JobID=1514 State=0x8000 NodeCnt=1
> done
> [2020-02-11T12:21:26.057] Requeuing JobID=1514 State=0x0 NodeCnt=0
> [2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x8003 NodeCnt=1
> done
> [2020-02-11T12:21:52.111] _job_complete: JobID=1512 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2020-02-11T12:21:52.112] _job_complete: JobID=1512 State=0x8003 NodeCnt=1
> done
> [2020-02-11T12:21:52.214] sched: Allocate JobID=1516 NodeList=node002
> #CPUs=1 Partition=defq
> [2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x1 NodeCnt=1
> WEXITSTATUS 0
> [2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x8003 NodeCnt=1
> done
>
> On Tue, Feb 11, 2020 at 11:54 AM Robert Kudyba 
> wrote:
>
>> Usually means you updated the slurm.conf but have not done "scontrol
>>> reconfigure" yet.
>>>
>> Well it turns out it was something else related to a Bright Computing
>> setting. In case anyone finds this thread in the future:
>>
>> ourcluster->category[gpucategory]->roles]% use slurmclient
>> [ourcluster->category[gpucategory]->roles[slurmclient]]% show
>> ...
>> RealMemory 196489092
>> ...
>> [ ciscluster->category[gpucategory]->roles[slurmclient]]%
>>
>> Values are specified in MB and this line is saying that our node has
>> 196TB of RAM.
>>
>> I set this using cmsh:
>>
>> # cmsh
>> % category
>> % use gpucategory
>> % roles
>> % use slurmclient
>> % set realmemory 191846
>> % commit
>>
>> The value in /etc/slurm/slurm.conf was conflicting with this especially
>> when restarting slurmctld.
>>
>> On 2/10/2020 8:55 AM, Robert Kudyba wrote:
>>>
>>> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
>>>
>>> We're getting the below errors when I restart the slurmctld service. The
>>> file appears to be the same on the head node and compute nodes:
>>> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> [root@ourcluster ~]# ls -l  /cm/shared/apps/slurm/var/etc/slurm.conf
>>> /etc/slurm/slurm.conf
>>>
>>> -rw-r--r-- 1 root root 3477 Feb 10 11:05
>>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> lrwxrwxrwx 1 root root   40 Nov 30  2018 /etc/slurm/slurm.conf ->
>>> /cm/shared/apps/slurm/var/etc/slurm.conf
>>>
>>> So what else could be causing this?
>>> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
>>> [2020-02-10T10:31:12.009] error: Node node001 appears to have a
>>> different slurm.conf than the slurmctld.  This could cause issues with
>>> communication and functionality.  Please review both files and make  sure
>>> they are the same.  If this is expected ignore, and set
>>> DebugFlags=NO_CONF_HASH in your slurm.conf.
>>> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
>>> (191846 < 196489092)
>>> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
>>> node=node001: Invalid argument
>>> [2020-02-10T10:31:12.011] error: Node node002 appears to have a
>>> different slurm.conf than the slurmctld.  This could cause issues with
>>> communication and functionality.  Please review both files and make sure
>>> they are the same.  If this is expected ignore, and set
>>> DebugFlags=NO_CONF_HASH in your slurm.conf.
>>> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
>>> (191840 < 196489092)
>>> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
>>> node=node002: Invalid argument
>>> [2020-02-10T10:31:12.047] error: Node node003 appears to have a
>>> different slurm.conf than the slurmctld.  This could cause issues with
>>> communication and functionality.  Please review both files and make sure
>>> they are the same.  If this is expected ignore, and set
>>> DebugFlags=NO_CONF_HASH in your slurm.conf.
>>> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
>>> (191840 < 196489092)
>>> [2020-02-10T10:31: