Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-20 Thread Chris Samuel

On 20/1/23 3:51 am, Stefan Staeglich wrote:


But someone who is actually using a UnkillableStepProgram stated the opposite
(that it's executed on the controller nodes). Are you aware of any change
between Slurm releases? Maybe one of the two parts is just a leftover. Are you
using a UnkillableStepProgram?


Yes, we've been using it for years on 7 different systems in my time here.

It runs on the compute nodes and collects troubleshooting info for us 
when a job fails to die in an allowed time.


--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] Using oversubscribe to hammer a node

2023-01-20 Thread Groner, Rob
Don't worry, I'm well past the "is this a sensible thing".  Let's just call it 
an experiment.

I have oversubscribe=FORCE:4 set on the partition, and nothing set on the 
sbatch command itself.  And with that setting, I can execute a job that 
requires all of the node's cores 4x and it will put all of those jobs on that 
node.  When I execute a 5th job, it goes pending for resources.  But in the 
meantime, only one of the jobs is running at any given time, the rest are 
suspended.  That's just not what I would have thought it would be for "more 
than one job can execute simultaneously on the same compute resources."  I 
don't consider them to be executing simultaneously if they're suspended.

Rob


From: slurm-users  on behalf of Loris 
Bennett 
Sent: Friday, January 20, 2023 1:48 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Using oversubscribe to hammer a node

Hi Rob,

"Groner, Rob"  writes:

> I'm trying to setup a specific partition where users can fight with the OS 
> for dominance,  The oversubscribe property sounds like what I want, as it says
> "More than one job can execute simultaneously on the same compute resource."  
> That's exactly what I want.  I've setup a node with 48 CPU and
> oversubscribe set to force:4.  I then execute a job that requests 48 cpus, 
> and that starts running.  I execute another job asking for 48 cores, and it 
> gets
> assigned to the node...but it is not running, it's suspended.  I can execute 
> 2 more jobs, and they'll all go on the node (so, 4x) but 3 will be suspended 
> at
> any time.  I see the time slicing going on, but that isn't what I though it 
> would be...I thought all 4 tasks per cpu would be running at the same time.
> Basically, I want the CPU/OS to work out the sharing of resources.  
> Otherwise, if one of the tasks that is running is just sitting there doing 
> nothing, it's
> going to do that for its 30 seconds while other tasks are suspended, right?

Is --oversubscribe set for the jobs?

> What I want to see is 4x the nodes CPUs in tasks all running at the same 
> time, not time slicing, just for jobs using this partition.  Is that a thing?

It might be thing.  I'm not sure it is a very sensible thing.  Time
slicing and context switching is still going to take place, with each
process getting a quarter of a core on average.  It is not clear that
you will actually increase throughput this way.  I would probably first
turn on hyperthreading to deal with jobs which have intermittent
CPU-usage.

Still, since Slurm offers the possibility of oversubscription, I assume
there must be a use-case.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-20 Thread Stefan Staeglich
Hi Chris,

thank you. I've overseen this part.

But someone who is actually using a UnkillableStepProgram stated the opposite 
(that it's executed on the controller nodes). Are you aware of any change 
between Slurm releases? Maybe one of the two parts is just a leftover. Are you 
using a UnkillableStepProgram?

Thank you :)

Best,
Stefan

Am Freitag, 20. Januar 2023, 05:59:19 CET schrieb Christopher Samuel:
> On 1/19/23 5:01 am, Stefan Staeglich wrote:
> > Hi,
> 
> Hiya,
> 
> > I'm wondering where the UnkillableStepProgram is actually executed.
> > According to Mike it has to be available on every on the compute nodes.
> > This makes sense only if it is executed there.
> 
> That's right, it's only executed on compute nodes.
> 
> > But the man page slurm.conf of 21.08.x states:
> > UnkillableStepProgram
> > 
> >Must be executable by user SlurmUser.  The file must be
> > 
> > accessible by the primary and backup control machines.
> > 
> > So I would expect it's executed on the controller node.
> 
> That's strange, my slurm.conf man page from a system still running 21.08
> says:
> 
> UNKILLABLE STEP PROGRAM SCRIPT
> This program can be used to take special actions to clean up
> the unkillable processes and/or notify system administrators.
> The program will be run as SlurmdUser (usually "root") on
> the compute node where UnkillableStepTimeout was triggered.
> 
> Ah, I see, there's a later "FILE AND DIRECTORY PERMISSIONS" part which
> has the text that you've found - that part's wrong! :-)
> 
> All the best,
> Chris


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-8223


signature.asc
Description: This is a digitally signed message part.