Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang
The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8epel -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones,

Re: [slurm-users] Sinfo options not working in SLURM 23.11

2023-10-30 Thread Davide DelVento
> > I am working on SLURM 23.11 version. > ??? Latest version is slurm-23.02.6 which one are you referring to? https://github.com/SchedMD/slurm/tags >

Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU
if i try to request just nodes and memory, for instance: #SBATCH -N 2 #SBATCH --mem=0 to resquest all memory on a node, and 2nodes seem sufficient for a program that consumes 100GB, i ot this error: sbatch: error: CPU count per node can not be satisfied sbatch: error: Batch job submission

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jens Elkner
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote: Hi Max & freinds, ... > Thanks so much for your fast response with a solution! I didn't know that > NetworkManager (falsely) claims that the network is online as soon as the > first interface comes up :-( IIRC it is documented in

[slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU
Hello all, I can't configure the slurm script correctly. My program needs 100GB of memory, it's the only criteria. But the job always fails with an out of memory. Here's the cluster configuration I'm using: SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory partition:

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
Hi Max, Thanks so much for your fast response with a solution! I didn't know that NetworkManager (falsely) claims that the network is online as soon as the first interface comes up :-( Your solution of a wait-for-interfaces Systemd service makes a lot of sense, and I'm going to try it out.

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Max Rutkowski
Hi, we're not using Omni-Path but also had issues with Infiniband taking too long and slurmd failing to start due to that. Our solution was to implement a little wait-for-interface systemd service which delays the network.target until the ib interface has come up. Our discovery was that

[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
I'm fighting this strange scenario where slurmd is started before the Infiniband/OPA network is fully up. The Node Health Check (NHC) executed by slurmd then fails the node (as it should). This happens only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with Infiniband/OPA

Re: [slurm-users] Sinfo options not working in SLURM 23.11

2023-10-30 Thread Loris Bennett
Hello Deepak, Deepak J writes: > Hello , > > > > I am working on SLURM 23.11 version. > > sinfo option commands are not working properly (-a , --all , -o , -m etc) > > > > e.g : sinfo is giving me below > > 45637@inv456748703$sinfo >