[slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Loris Bennett
Hi,

Has anyone already come up with a good way to identify non-MPI jobs which
request multiple cores but don't restrict themselves to a single node,
leaving cores idle on all but the first node?

I can see that this is potentially not easy, since an MPI job might have
still have phases where only one core is actually being used.

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Davide DelVento
At my previous job there were cron jobs running everywhere measuring
possibly idle cores which were eventually averaged out for the
duration of the job, and reported (the day after) via email to the
user support team.
I believe they stopped doing so when compute became (relatively) cheap
at the expense of memory and I/O becoming expensive.

I know, it does not help you much, but perhaps something to think about

On Thu, Sep 29, 2022 at 1:29 AM Loris Bennett
 wrote:
>
> Hi,
>
> Has anyone already come up with a good way to identify non-MPI jobs which
> request multiple cores but don't restrict themselves to a single node,
> leaving cores idle on all but the first node?
>
> I can see that this is potentially not easy, since an MPI job might have
> still have phases where only one core is actually being used.
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>



Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Ole Holm Nielsen

Hi Loris,

On 9/29/22 09:26, Loris Bennett wrote:

Has anyone already come up with a good way to identify non-MPI jobs which
request multiple cores but don't restrict themselves to a single node,
leaving cores idle on all but the first node?

I can see that this is potentially not easy, since an MPI job might have
still have phases where only one core is actually being used.


Just an idea: The "pestat -F" tool[1] will tell you if any nodes have an 
"unexpected" CPU load.  If you see the same JobID runing on multiple nodes 
with a too low CPU load, that might point to a job such as you describe.


/Ole

[1] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat



Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Loris Bennett
Hi Davide,

That is a interesting idea.  We already do some averaging, but over the
whole of the past month.  For each user we use the output of seff to
generate two scatterplots: CPU-efficiency vs. CPU-hours and
memory-efficiency vs. GB-hours.  See
 
  
https://www.fu-berlin.de/en/sites/high-performance-computing/Dokumentation/Statistik

However, I am mainly interested in being able to cancel some of the inefficient
jobs before they have run for too long.

Cheers,

Loris

 Davide DelVento  writes:

> At my previous job there were cron jobs running everywhere measuring
> possibly idle cores which were eventually averaged out for the
> duration of the job, and reported (the day after) via email to the
> user support team.
> I believe they stopped doing so when compute became (relatively) cheap
> at the expense of memory and I/O becoming expensive.
>
> I know, it does not help you much, but perhaps something to think about
>
> On Thu, Sep 29, 2022 at 1:29 AM Loris Bennett
>  wrote:
>>
>> Hi,
>>
>> Has anyone already come up with a good way to identify non-MPI jobs which
>> request multiple cores but don't restrict themselves to a single node,
>> leaving cores idle on all but the first node?
>>
>> I can see that this is potentially not easy, since an MPI job might have
>> still have phases where only one core is actually being used.
>>
>> Cheers,
>>
>> Loris
>>
>> --
>> Dr. Loris Bennett (Herr/Mr)
>> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Loris Bennett
Hi Ole,

Ole Holm Nielsen  writes:

> Hi Loris,
>
> On 9/29/22 09:26, Loris Bennett wrote:
>> Has anyone already come up with a good way to identify non-MPI jobs which
>> request multiple cores but don't restrict themselves to a single node,
>> leaving cores idle on all but the first node?
>> I can see that this is potentially not easy, since an MPI job might have
>> still have phases where only one core is actually being used.
>
> Just an idea: The "pestat -F" tool[1] will tell you if any nodes have an
> "unexpected" CPU load.  If you see the same JobID runing on multiple nodes 
> with a too low CPU load, that might point to a job such as you describe.
>
> /Ole
>
> [1] https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

I do already use 'pestat -F' although this flags over 100 of our 170
nodes, so it results in a bit of information overload.  I guess it would
be nice if the sensitivity of the flagging could be tweaked on the
command line, so that only the worst nodes are shown.

I also use some wrappers around 'sueff' from

  https://github.com/ubccr/stubl

to generate part of an ASCII dashboard (an dasciiboard?), which looks
like

  UsernameMem_Request  Max_Mem_Use  CPU_Efficiency  
Number_of_CPUs_In_Use
  alpha   42000M   0.03Gn   48.80%  (0.98 of 2)
  beta10500M   11.01Gn  99.55%  (3.98 of 4)
  gamma   8000M8.39Gn   99.64%  (63.77 of 64)
  ...
  chi varied   3.96Gn   83.65%  (248.44 of 297)
  phi 1800M1.01Gn   98.79%  (248.95 of 252)
  omega   16G  4.61Gn   99.69%  (127.60 of 128)

  == Above data from: Thu 29 Sep 15:26:29 CEST 2022 
=

and just loops every 30 seconds.  This is what I use to spot users with
badly configured jobs.

However, I'd really like to be able to identify non-MPI jobs on multiple
nodes automatically.

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
Can you check slurm for a job that requests multiple nodes but doesn't have 
mpirun (or srun, or mpiexec) running on its head node?


Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Ward Poelmans

Hi Loris,

On 29/09/2022 09:26, Loris Bennett wrote:


I can see that this is potentially not easy, since an MPI job might have
still have phases where only one core is actually being used.


Slurm will create the needed cgroups on all the nodes that are part of the job 
when the job starts. So you could with a cron job check if there are any 
cgroups on the node with no processes in it?

Ward


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Steffen Grunewald
On Thu, 2022-09-29 at 14:03:58 +, Bernstein, Noam CIV USN NRL (6393) 
Washington DC (USA) wrote:
> Can you check slurm for a job that requests multiple nodes but doesn't have 
> mpirun (or srun, or mpiexec) running on its head node?

Hi Noam,

I'm wondering why one would want to know that - given that there are
approaches to multi-node operation beyond MPI (Charm++ comes to mind)?

Best,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~



Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Sep 29, 2022, at 10:34 AM, Steffen Grunewald 
mailto:steffen.grunew...@aei.mpg.de>> wrote:

Hi Noam,

I'm wondering why one would want to know that - given that there are
approaches to multi-node operation beyond MPI (Charm++ comes to mind)?

The thread title requested a way of detecting non-MPI jobs running on multiple 
nodes.  I assumed that the requester knows, maybe based on their users' 
software, that there are no legitimate ways for them to run on multiple nodes 
without MPI. Actually, we have users that run embarrassingly parallel jobs 
which just ssh to the other nodes and gather files, so clearly it can be done 
in a useful way with very low-tech approaches, but that's a n oddball (and just 
plain old) software package.


[slurm-users] How to hold a job until a feature is available?

2022-09-29 Thread Groner, Rob
I'm trying to setup a system where, when a job from a certain account is 
submitted, if no nodes are available that have a specific feature, then the job 
will be paused/held/pending and a node will be dynamically created with that 
feature.

I can now dynamically bring up the node with the feature, and it shows in the 
sinfo output as having the feature.  But I can't yet figure out how to 
intercept the job submission request and put it on hold so that I can bring up 
the node.

If I don't do anything, then the job just instantly fails because there are no 
nodes with that feature.

Could I maybe create a "dummy" node that has the feature, but no resources?  So 
the job would be set to pending for resources and would stay that way until I 
brought up a new node with the feature and with resources.

I've tried using slurm_job_submit.lua, and I detected the requested feature, 
and the tried to set the job to hold...but it still error'd out because of 
"invalid feature specification".

Thanks for the help.

Rob



[slurm-users] Slurm version 22.05.4 is now available

2022-09-29 Thread Tim Wickberg

We are pleased to announce the availability of Slurm version 22.05.4.

This includes fixes to two potential crashes in the backfill scheduler, 
alongside a number of other moderate severity issues.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support


* Changes in Slurm 22.05.4
==
 -- Fix return code from salloc when the job is revoked prior to executing user
command.
 -- Fix minor memory leak when dealing with gres with multiple files.
 -- Fix printing for no_consume gres in scontrol show job.
 -- sinfo - Fix truncation of very large values when outputting memory.
 -- Fix multi-node step launch failure when nodes in the controller aren't in
natural order. This can happen with inconsistent node naming (such as
node15 and node052) or with dynamic nodes which can register in any order.
 -- job_container/tmpfs - Prevent reading the plugin config multiple times per
step.
 -- Fix wrong attempt of gres binding for gres w/out cores defined.
 -- Fix build to work with '--without-shared-libslurm' configure flag.
 -- Fix power_save mode when repeatedly configuring too fast.
 -- Fix sacct -I option.
 -- Prevent jobs from being scheduled on future nodes.
 -- Fix memory leak in slurmd happening on reconfigure when CPUSpecList used.
 -- Fix sacctmgr show event [min|max]cpus.
 -- Fix regression in 22.05.0rc1 where a prolog or epilog that redirected stdout
to a file could get erroneously killed, resulting in job launch failure
(for the prolog) and the node being drained.
 -- cgroup/v1 - Make a static variable to remove potential redundant checking
for if the system has swap or not.
 -- cgroup/v1 - Add check for swap when running OOM check after task
termination.
 -- job_submit/lua - add --prefer support
 -- cgroup/v1 - fix issue where sibling steps could incorrectly be accounted as
OOM when step memory limit was the same as the job allocation. Detect OOM
events via memory.oom_control oom_kill when exposed by the kernel instead of
subscribing notifications with eventfd.
 -- Fix accounting of oom_kill events in cgroup/v2 and task/cgroup.
 -- Fix segfault when slurmd reports less than configured gres with links after
a slurmctld restart.
 -- Fix TRES counts after node is deleted using scontrol.
 -- sched/backfill - properly handle multi-reservation HetJobs.
 -- sched/backfill - don't try to start HetJobs after system state change.
 -- openapi/v0.0.38 - add submission of job->prefer value.
 -- slurmdbd - become SlurmUser at the same point in logic as slurmctld to match
plugins initialization behavior. This avoids a fatal error when starting
slurmdbd as root and root cannot start the auth or accounting_storage
plugins (for example, if root cannot read the jwt key).
 -- Fix memory leak when attempting to update a job's features with invalid
features.
 -- Fix occasional slurmctld crash or hang in backfill due to invalid pointers.
 -- Fix segfault on Cray machines if cgroup cpuset is used in cgroup/v1.




Re: [slurm-users] How to hold a job until a feature is available?

2022-09-29 Thread mercan
Why don't use a spesific queue instead of the specific feature.The queue 
is an object for waiting resource, it is ready to use for this purpose. 
When required resources are ready to use, the jobs will start.



Regards;


Ahmet M.




29.09.2022 22:27 tarihinde Groner, Rob yazdı:
I'm trying to setup a system where, when a job from a certain account 
is submitted, if no nodes are available that have a specific feature, 
then the job will be paused/held/pending and a node will be 
dynamically created with that feature.


I can now dynamically bring up the node with the feature, and it shows 
in the sinfo output as having the feature.  But I can't yet figure out 
how to intercept the job submission request and put it on hold so that 
I can bring up the node.


If I don't do anything, then the job just instantly fails because 
there are no nodes with that feature.


Could I maybe create a "dummy" node that has the feature, but no 
resources?  So the job would be set to pending for resources and would 
stay that way until I brought up a new node with the feature and with 
resources.


I've tried using slurm_job_submit.lua, and I detected the requested 
feature, and the tried to set the job to hold...but it still error'd 
out because of "invalid feature specification".


Thanks for the help.

Rob





Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Loris Bennett
"Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)"
 writes:

>  On Sep 29, 2022, at 10:34 AM, Steffen Grunewald 
>  wrote:
>
>  Hi Noam,
>
>  I'm wondering why one would want to know that - given that there are
>  approaches to multi-node operation beyond MPI (Charm++ comes to mind)?
>
> The thread title requested a way of detecting non-MPI jobs running on 
> multiple nodes.  I assumed that the requester knows, maybe based on their 
> users' software, that there are no legitimate ways for them to run on 
> multiple nodes without MPI.
> Actually, we have users that run embarrassingly parallel jobs which just ssh 
> to the other nodes and gather files, so clearly it can be done in a useful 
> way with very low-tech approaches, but that's a n oddball (and just plain 
> old) software package.

There may indeed be legitimate ways for non-MPI jobs to be running on
multiple nodes, but that's a bit of an edge case.  However, such cases
would be fine, as long as the resources requested are being used
efficiently.  Thus, Ward's suggestion about checking for cgroups seems
the most general solution.  Having said that, it would also be useful to
then check the head node for 'mpirun' or similar.

Cheers,

Loris