[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-08 Thread Henderson, Brent via slurm-users
Thanks for the suggestion Ole - I'll see if I can get that in the mix  to try 
over the next few days.

I can report that 23.02.7 tree had the same issues, so going backwards on the 
slurm bits did not have any impact.

Brent


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Partition Preemption Configuration Question

2024-05-08 Thread Davide DelVento via slurm-users
{
  "emoji": "👍",
  "version": 1
}
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Container Jobs "hanging"

2024-05-08 Thread Sean Kane via slurm-users
Hello. I am new to this list and Slurm overall. I have a lot of experience in 
computer operations, including Kubernetes, but I am currently exploring Slurm 
in some depth.

I have set up a small cluster and, in general, have gotten things working, but 
when I try to run a container job, it runs the command but then appears to hang 
as if the job container is still running.

So, running the following works, but it never returns to the prompt unless I 
use [Control-C].

$ srun --container /shared_fs/shared/oci_images/alpine uptime
 19:21:47 up 20:43,  0 users,  load average: 0.03, 0.25, 0.15

I'm unsure if something is misconfigured or if I'm misunderstanding how this 
should work, but any help and/or pointers would be greatly appreciated.

Thanks!
Sean

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Partition Preemption Configuration Question

2024-05-08 Thread Groner, Rob via slurm-users
FYI, I submitted a bug about this in March because the "compatible" line in the 
docs was confusing to me as well.  The change coming to the docs removes that 
altogether and simply says that setting it to OFF "disables job preemption and 
gang scheduling".  Much clearer.

And we do it the same way as Davide.

Rob


From: Davide DelVento via slurm-users 
Sent: Thursday, May 2, 2024 11:28 AM
To: Jason Simms 
Cc: Slurm User Community List 
Subject: [slurm-users] Re: Partition Preemption Configuration Question

Hi Jason,

I wanted exactly the same and was confused exactly like you. For a while it did 
not work, regardless of what I tried, but eventually (with some help) I figured 
it out.

What I set up and it is working fine is this globally

PreemptType = preempt/partition_prio
PreemptMode=REQUEUE

and then individually each partition definition has either PreemptMode=off or 
PreemptMode=cancel

It took me a while to make it work, and the problem in my case was that I did 
not include the requeue line because (as I am describing) I did not want 
requeue, but without that line slurm preemption simply would not work. Since 
it's overridden in each partition, then it works as if it's not there, but it 
must be there. Very simple once you know it.

Hope this helps

On Thu, May 2, 2024 at 9:16 AM Jason Simms via slurm-users 
mailto:slurm-users@lists.schedmd.com>> wrote:
Hello all,

The Slurm docs have me a bit confused... I'm wanting to enable job preemption 
on certain partitions but not others. I *presume* I would set 
PreemptType=preempt/partition_prio globally, but then on the partitions where I 
don't want jobs to be able to be preempted, I would set PreemptMode=off within 
the configuration for that specific partition.

The documentation, however, says that setting PreemptMode=off at a partition 
level "is only compatible with PreemptType=preempt/none at a global level" yet 
then immediately says that doing so is a "common use case for this parameter is 
to set it on a partition to disable preemption for that partition," which 
indicates preemption would still be allowable for other partitions.

If PreemptType is set to preempt/none globally, and I *cannot* set that as an 
option for a given partition (at least, the documentation doesn't indicate that 
is a valid parameter for a partition), wouldn't preemption be disabled globally 
anyway? The wording seems odd to me and almost contradictory.

Is it possible to have PreemptType=preempt/partition_prio set globally, yet 
also disable it on specific partitions with PreemptMode=off? Is PreemptType 
actually a valid configuration option for specific partitions?

Thanks for any guidance.

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

--
slurm-users mailing list -- 
slurm-users@lists.schedmd.com
To unsubscribe send an email to 
slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: scrontab question

2024-05-08 Thread Cutts, Tim via slurm-users
Someone may have said this already but you know that you can replace 
0,5,10,15,20,25,30,35,40,45,50,55 with */5?

Tim

--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca

Find out more about R&D IT Data, Analytics & AI and how we can support you by 
visiting our Service 
Catalogue |


From: Bjørn-Helge Mevik via slurm-users 
Date: Wednesday, 8 May 2024 at 07:38
To: slurm-us...@schedmd.com 
Subject: [slurm-users] Re: scrontab question
Sandor via slurm-users  writes:

> I am working out the details of scrontab. My initial testing is giving me
> an unsolvable question

If you have an unsolvable problem, you don't have a problem, you have a
fact of life. :)

> Within scrontab editor I have the following example from the slurm
> documentation:
>
> 0,5,10,15,20,25,30,35,40,45,50,55 * * * *
> /directory/subdirectory/crontest.sh

- The command (/directory/...) should beon the same line as the time
spec (0,5,...) - but that was perhaps just the email formatting.

- Check for any UTF8 characters that look like ordinary ascii, for
instance "unbreakable space".  I tend to just pipe the text throuth "od
-a".

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


AstraZeneca UK Limited is a company incorporated in England and Wales with 
registered number:03674842 and its registered office at 1 Francis Crick Avenue, 
Cambridge Biomedical Campus, Cambridge, CB2 0AA.

This e-mail and its attachments are intended for the above named recipient only 
and may contain confidential and privileged information. If they have come to 
you in error, you must not copy or show them to anyone; instead, please reply 
to this e-mail, highlighting the error to the sender and then immediately 
delete the message. For information about how AstraZeneca UK Limited and its 
affiliates may process information, personal data and monitor communications, 
please see our privacy notice at 
www.astrazeneca.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm With Podman - No child processes error

2024-05-08 Thread ARNULD via slurm-users
I have integrated Podman with Slurm as per the docs (
https://slurm.schedmd.com/containers.html#podman-scrun) and when I do a
test run:

"podman run hello-world" (this runs fine)


$ podman run alpine hostname
executable file `/usr/bin/hostname` not found in $PATH: No such file or
directory
srun: error: slurm1: task 0: Exited with exit code 1
-
$ podman run alpine printenv SLURM_JOB_ID
executable file `/usr/bin/printenv` not found in $PATH: No such file or
directory
srun: error: slurm1: task 0: Exited with exit code 1
scrun: error: run_command_waitpid_timeout: waitpid(67537): No child
processes
---
podman run alpine uptime
 11:31:28 up  5:32,  0 users,  load average: 0.00, 0.00, 0.00
scrun: error: run_command_waitpid_timeout: waitpid(68160): No child
processes
--

I built a small image from python:alpine3.19 which just prints "hello
world" and numbers from 1 to 10. Here is a run:

$ podman run -it --rm hello-python
$ podman run -it --rm hello-python
Hello, world!
Numbers from 1 to 10: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


No error with my image. Also I tested podman on another machine without
Slurm. Podman with its default runtime prints the hostname fine with
"podman run alpine hostname". So something to do with its integration with
Slurm.

What can I do to diagnose the problem?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com