Re: [slurm-users] Replacement for FastSchedule since 19.05.3

2019-11-05 Thread Chris Samuel

On 5/11/19 6:36 am, Taras Shapovalov wrote:

Since Slurm 19.05.3 we get an error message that FastSchedule is 
deprecated. But I cannot find in the documentation what is an 
alternative option for FastSchedule=0. Do you know how we can do that 
without using the option since 19.05.3?


There isn't an alternative for FastSchedule=0 from what I can see, it 
seems that it doesn't work properly with cons_tres (which will be 
replacing cons_res) and so is destined for the scrap heap.


See slide 10 of Tim's presentation from this years Slurm Users Group 
meeting:


https://slurm.schedmd.com/SLUG19/Slurm_20.02_and_Beyond.pdf

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] job priority keeping resources from being used?

2019-11-05 Thread c b
On Sun, Nov 3, 2019 at 7:18 AM Juergen Salk  wrote:

>
> Hi,
>
> maybe I missed it, but what does squeue say in the reason field for
> your pending jobs that you expect to slip in?
>
>
the reason on these jobs is just "Priority".




> Is your partition maybe configured for exclusive node access, e.g. by
> setting `OverSubscribe=EXCLUSIVE´?
>
>
We don't have that setting on and i believe we are not configured for
exclusive node access otherwise.  When my small jobs that each require one
core are running, we get as many jobs as we have cores running
simultaneously on each machine.

thanks


> Best regards
> Jürgen
>
> --
> Jürgen Salk
> Scientific Software & Compute Services (SSCS)
> Kommunikations- und Informationszentrum (kiz)
> Universität Ulm
> Telefon: +49 (0)731 50-22478
> Telefax: +49 (0)731 50-22471
>
>
> * c b  [191101 14:44]:
> > I see - yes, to clarify, we are specifying memory for each of these jobs,
> > and there is enough memory on the nodes for both types of jobs to be
> > running simultaneously.
> >
> > On Fri, Nov 1, 2019 at 1:59 PM Brian Andrus  wrote:
> >
> > > I ask if you are specifying it, because if not, slurm will assume a job
> > > will use all the memory available.
> > >
> > > So without specifying, your big job gets allocated 100% of the memory
> so
> > > nothing could be sent to the node. Same if you don't specify for the
> little
> > > jobs. It would want 100%, but if anything is running there, 100% is not
> > > available as far as slurm is concerned.
> > >
> > > Brian
> > > On 11/1/2019 10:52 AM, c b wrote:
> > >
> > > yes, there is enough memory for each of these jobs, and there is enough
> > > memory to run the high resource and low resource jobs at the same time.
> > >
> > > On Fri, Nov 1, 2019 at 1:37 PM Brian Andrus 
> wrote:
> > >
> > >> Are you specifying memory for each of the jobs?
> > >>
> > >> Can't run a small job if there isn't enough memory available for it.
> > >>
> > >> Brian Andrus
> > >> On 11/1/2019 7:42 AM, c b wrote:
> > >>
> > >> I have:
> > >> SelectType=select/cons_res
> > >> SelectTypeParameters=CR_CPU_Memory
> > >>
> > >> On Fri, Nov 1, 2019 at 10:39 AM Mark Hahn  wrote:
> > >>
> > >>> > In theory, these small jobs could slip in and run alongside the
> large
> > >>> jobs,
> > >>>
> > >>> what are your SelectType and SelectTypeParameters settings?
> > >>> ExclusiveUser=YES on partitions?
> > >>>
> > >>> regards, mark hahn.
> > >>>
> > >>>
>
>
>


Re: [slurm-users] Help with preemtion based on licenses

2019-11-05 Thread Mark Hahn

The limiting factor in our cluster is licenses and I want to have high and low 
priority jobs where submitting a high priority job will preempt (suspend) a low 
priority job if all the licenses are already in use.


But what are you expecting to happen?  that Slurm will somehow release
the license used by the suspended job, and then somehow reacquire the 
license when it is resumed?  I've never heard of that kind of thing

even being offered by license managers, let alone that level of intimate
integration between schedulers and license managers.

At most, a scheduler may provide a callout to query the number of 
free licenses, and consider a job eligible to start if its declared 
usage fits (gres in Slurm terms, I think).


regards, mark hahn
--
operator may differ from spokesperson.  h...@mcmaster.ca



[slurm-users] Replacement for FastSchedule since 19.05.3

2019-11-05 Thread Taras Shapovalov
Hey guys,

Since Slurm 19.05.3 we get an error message that FastSchedule is
deprecated. But I cannot find in the documentation what is an alternative
option for FastSchedule=0. Do you know how we can do that without using the
option since 19.05.3?

Best regards,

Taras


Re: [slurm-users] Help with preemtion based on licenses

2019-11-05 Thread Oytun Peksel
Hi,
Apparently the original email got lost here so here it is. If anyone has any 
idea how to this please comment.


On Thursday, June 20, 2019 at 3:20:41 AM UTC+2, Eric Wittmayer wrote:
Hi Slurm experts,
  I'm new to SLURM and could really use some help getting preemption working.

The limiting factor in our cluster is licenses and I want to have high and low 
priority jobs where submitting a high priority job will preempt (suspend) a low 
priority job if all the licenses are already in use.
Is this possible with SLURM currently?
If so can someone provide example configuration settings?

If it isn't currently possible, could this be a feature included in the current 
cons_tres work that is going on?
I've read through a bunch of the documentation and tried to do my due diligence 
but haven't found a definitive answer.

Thanks,
Eric W



From: slurm-users  On Behalf Of Oytun 
Peksel
Sent: den 5 november 2019 08:33
To: Slurm User Community List 
Subject: Re: [slurm-users] Help with preemtion based on licenses

Hi Eric,

Have you been able to find a solution to your problem. Facing the same issue 
right now..

BR
Oytun Peksel




When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy.


Re: [slurm-users] Running job using our serial queue

2019-11-05 Thread David Baker
Hello,

Thank you for your replies. I double checked that the "task" in, for example, 
taskplugin=task/affinity is optional. In this respect it is good to know that 
we have  the correct cgroups setup. So in theory users should only disturb 
themselves, however in reality we find that there is often a knock on effect on 
other users' jobs. So, for example, users have complained that their jobs 
sometimes stall. I can only vaguely think that something odd is going on at the 
kernel level perhaps.

One additional thing that I need to ask is... Should we have hwloc installed 
our compute nodes? Does that help? Whenever I check which processes are not 
being constrained by cgroups I only ever find a small group of system processes.

Best regards,
David





From: slurm-users  on behalf of Marcus 
Wagner 
Sent: 05 November 2019 07:47
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Running job using our serial queue

Hi David,

doing it the way you do it, is the same way, we do it.

When the Matlab job asks for one CPU, it only gets on CPU this way. That means, 
that all the processes are bound to this one CPU. So (theoretically) the user 
is just disturbing himself, if he uses more.

But especially Matlab, there are more things to do. I t does not suffice to add 
'-singleCompThread' to the commandline. Matlab is not the only tool, that tries 
to use all cores, it finds on the node.
The same is valid for CPLEX and Gurobi, both often used from Matlab. So even, 
if the user sets '-singleCompThread' for Matlab, that does not mean at all, the 
job is only using one CPU.


Best
Marcus

On 11/4/19 4:14 PM, David Baker wrote:
Hello,

We decided to route all jobs requesting from 1 to 20 cores to our serial queue. 
Furthermore, the nodes controlled by the serial queue are shared by multiple 
users. We did this to try to reduce the level of fragmentation across the 
cluster -- our default "batch" queue provides exclusive access to compute nodes.

It looks like the downside of the serial queue is that jobs from different 
users can interact quite badly. To some extent this is an education issue -- 
for example matlab users need to be told to add the "-singleCompThread" option 
to their command line. On the other hand I wonder if our cgroups setup is 
optimal for the serial queue. Our cgroup.conf contains...

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The relevant cgroup configuration in the slurm.conf is...
ProctrackType=proctrack/cgroup
TaskPlugin=affinity,cgroup

Could someone please advise us on the required/recommended cgroup setup for the 
above scenario? For example, should we really set "TaskAffinity=yes"? I assume 
the interaction between jobs (sometimes jobs can get stalled) is due to context 
switching at the kernel level, however (apart from educating users) how can we 
minimise that switching on the serial nodes?

Best regards,
David



--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de