Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang
The service is available in RHEL 8 via the EPEL package repository as 
system-networkd, i.e. systemd-networkd.x86_64   
253.4-1.el8epel


-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:
> Actually there is no need for such a script since
> /lib/systemd/systemd-networkd-wait-online should be able to handle it.

It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.

/Ole




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.


It seems that systemd-networkd exists in Fedora FC38 Linux, but not in 
RHEL 8 and clones, AFAICT.


/Ole




Re: [slurm-users] Sinfo options not working in SLURM 23.11

2023-10-30 Thread Davide DelVento
>
> I am working on SLURM 23.11 version.
>

???

Latest version is slurm-23.02.6 which one are you referring to?
https://github.com/SchedMD/slurm/tags



>


Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

if i try to request just nodes and memory, for instance:
#SBATCH -N 2
#SBATCH --mem=0
to resquest all memory on a node, and 2nodes seem sufficient for a 
program that consumes 100GB, i ot this error:

sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration 
is not available


thanks

On 30/10/2023 15:46, Gérard Henry (AMU) wrote:

Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.

Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=3002M,node=158
for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID   JobName    Elapsed  NCPUS   TotalCPU    CPUTime 
ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
 -- -- -- -- -- 
-- --   -- 
8500578    analyse5   00:03:04 60   02:57:58   03:04:00 
9M  OUT_OF_ME+    0:125
8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04 
    21465736K    0.23M    0.01M OUT_OF_ME+    0:125
8500578.0 orted   00:03:05 44   02:11:24   02:15:40 
   40952K    0.42M    0.03M  COMPLETED  0:0


i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)


if anybody can help?

thanks in advance



--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jens Elkner
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote:
Hi Max & freinds,
...
> Thanks so much for your fast response with a solution!  I didn't know that
> NetworkManager (falsely) claims that the network is online as soon as the
> first interface comes up :-(

IIRC it is documented in the man page.
  
> Your solution of a wait-for-interfaces Systemd service makes a lot of sense,
> and I'm going to try it out.

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.

I.e. 'Exec=/lib/systemd/systemd-networkd-wait-online -i ib0:routable'
or something like that should handle it. E.g. on my laptop the complete
/etc/systemd/system/systemd-networkd-wait-online.service looks like
this:
---schnipp---
[Unit]
Description=Wait for Network to be Configured
Documentation=man:systemd-networkd-wait-online.service(8)
DefaultDependencies=no
Conflicts=shutdown.target
Requires=systemd-networkd.service
After=systemd-networkd.service
Before=network-online.target shutdown.target

[Service]
Type=oneshot
ExecStart=/lib/systemd/systemd-networkd-wait-online -i eth0:routable -i 
wlan0:routable --any
RemainAfterExit=yes

[Install]
WantedBy=network-online.target
---schnapp---
 
Have fun,
jel.
> Best regards,
> Ole
> 
> On 10/30/23 14:30, Max Rutkowski wrote:
> > Hi,
> > 
> > we're not using Omni-Path but also had issues with Infiniband taking too
> > long and slurmd failing to start due to that.
> > 
> > Our solution was to implement a little wait-for-interface systemd
> > service which delays the network.target until the ib interface has come
> > up.
> > 
> > Our discovery was that the network-online.target is triggered by the
> > NetworkManager as soon as the first interface is connected.
> > 
> > I've put the solution we use on my GitHub:
> > https://github.com/maxlxl/network.target_wait-for-interfaces
> > 
> > You may need to do small adjustments, but it's pretty straight forward
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark,
> Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
> E-mail: ole.h.niel...@fysik.dtu.dk
> Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
> Mobile: (+45) 5180 1620 in
> > general.
> > 
> > 
> > Kind regards
> > Max
> > 
> > On 30.10.23 13:50, Ole Holm Nielsen wrote:
> > > I'm fighting this strange scenario where slurmd is started before
> > > the Infiniband/OPA network is fully up.  The Node Health Check (NHC)
> > > executed by slurmd then fails the node (as it should).  This happens
> > > only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9
> > > nodes with Infiniband/OPA network work without problems.
> > > 
> > > Question: Does anyone know how to reliably delay the start of the
> > > slurmd Systemd service until the Infiniband/OPA network is fully up?
> > > 
> > > Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not
> > > Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers
> > > since the CornelisNetworks drivers are not available for RHEL 8.8.
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark,
> Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
> E-mail: ole.h.niel...@fysik.dtu.dk
> Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
> Mobile: (+45) 5180 1620
> > > 
> > > The details:
> > > 
> > > The slurmd service is started by the service file
> > > /usr/lib/systemd/system/slurmd.service after the
> > > "network-online.target" has been reached.
> > > 
> > > It seems that NetworkManager reports "network-online.target" BEFORE
> > > the Infiniband/OPA device ib0 is actually up, and this seems to be
> > > the cause of our problems!
> > > 
> > > Here are some important sequences of events from the syslog showing
> > > that the network goes online before the Infiniband/OPA network
> > > (hfi1_0 adapter) is up:
> > > 
> > > Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
> > > (lines deleted)
> > > Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check
> > > failed: rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib: 
> > > No IB port is ACTIVE (LinkUp 100 Gb/sec).
> > > (lines deleted)
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0:
> > > set_link_state: current GOING_UP, new INIT (LINKUP)
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical
> > > state changed to PHYS_LINKUP (0x5), phy 0x50
> > > 
> > > I tried to delay the NetworkManager "network-online.target" by
> > > setting a wait on the ib0 device and reboot, but that seems to be
> > > ignored:
> > > 
> > > $ nmcli -p connection modify "System ib0"
> > > connection.connection.wait-device-timeout 20
> > > 
> > > I'm hoping that other sites using Omni-Path have seen this and maybe
> > > can share a fix or workaround?
> > > 
> > > Of course we could remove the Infiniband check 

[slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-10-30 Thread AMU

Hello all,


I can't configure the slurm script correctly. My program needs 100GB of 
memory, it's the only criteria. But the job always fails with an out of 
memory.

Here's the cluster configuration I'm using:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

partition:
DefMemPerCPU=5770 MaxMemPerCPU=5778
TRES=cpu=5056,mem=3002M,node=158
for each node: CPUAlloc=32 RealMemory=19 AllocMem=184640

my script contains:
#SBATCH -N 5
#SBATCH --ntasks=60
#SBATCH --mem-per-cpu=1500M
#SBATCH --cpus-per-task=1
...
mpirun ../zsimpletest_analyse

when it fails, sacct gives the follwing information:
JobID   JobNameElapsed  NCPUS   TotalCPUCPUTime 
ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode
 -- -- -- -- -- 
-- --   -- 
8500578analyse5   00:03:04 60   02:57:58   03:04:00 
9M  OUT_OF_ME+0:125
8500578.bat+  batch   00:03:04 16  46:34.302   00:49:04 
   21465736K0.23M0.01M OUT_OF_ME+0:125
8500578.0 orted   00:03:05 44   02:11:24   02:15:40 
  40952K0.42M0.03M  COMPLETED  0:0


i don't understand why MaxRSS=21M leads to "out of memory" with 16cpus 
and 1500M per cpu (24M)


if anybody can help?

thanks in advance

--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Max,

Thanks so much for your fast response with a solution!  I didn't know that 
NetworkManager (falsely) claims that the network is online as soon as the 
first interface comes up :-(


Your solution of a wait-for-interfaces Systemd service makes a lot of 
sense, and I'm going to try it out.


Best regards,
Ole

On 10/30/23 14:30, Max Rutkowski wrote:

Hi,

we're not using Omni-Path but also had issues with Infiniband taking too 
long and slurmd failing to start due to that.


Our solution was to implement a little wait-for-interface systemd service 
which delays the network.target until the ib interface has come up.


Our discovery was that the network-online.target is triggered by the 
NetworkManager as soon as the first interface is connected.


I've put the solution we use on my GitHub: 
https://github.com/maxlxl/network.target_wait-for-interfaces


You may need to do small adjustments, but it's pretty straight forward

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620 in

general.


Kind regards
Max

On 30.10.23 13:50, Ole Holm Nielsen wrote:
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) 
executed by slurmd then fails the node (as it should).  This happens 
only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes 
with Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the slurmd 
Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since 
the CornelisNetworks drivers are not available for RHEL 8.8.

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the "network-online.target" 
has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE the 
Infiniband/OPA device ib0 is actually up, and this seems to be the cause 
of our problems!


Here are some important sequences of events from the syslog showing that 
the network goes online before the Infiniband/OPA network (hfi1_0 
adapter) is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib:  No IB port 
is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: 
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting a 
wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe can 
share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check 
(NHC), but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole


--
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.rutkow...@gfz-potsdam.de
___

Helmholtz-Zentrum Potsdam
*Deutsches GeoForschungsZentrum GFZ*
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Max Rutkowski

Hi,

we're not using Omni-Path but also had issues with Infiniband taking too 
long and slurmd failing to start due to that.


Our solution was to implement a little wait-for-interface systemd 
service which delays the network.target until the ib interface has come up.


Our discovery was that the network-online.target is triggered by the 
NetworkManager as soon as the first interface is connected.


I've put the solution we use on my GitHub: 
https://github.com/maxlxl/network.target_wait-for-interfaces


You may need to do small adjustments, but it's pretty straight forward 
in general.



Kind regards
Max

On 30.10.23 13:50, Ole Holm Nielsen wrote:
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) 
executed by slurmd then fails the node (as it should).  This happens 
only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes 
with Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the 
slurmd Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since 
the CornelisNetworks drivers are not available for RHEL 8.8.


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the 
"network-online.target" has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE 
the Infiniband/OPA device ib0 is actually up, and this seems to be the 
cause of our problems!


Here are some important sequences of events from the syslog showing 
that the network goes online before the Infiniband/OPA network (hfi1_0 
adapter) is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib:  No IB 
port is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 
set_link_state: current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting 
a wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe 
can share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check 
(NHC), but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole


--
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.rutkow...@gfz-potsdam.de
___

Helmholtz-Zentrum Potsdam
*Deutsches GeoForschungsZentrum GFZ*
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam

smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed 
by slurmd then fails the node (as it should).  This happens only on EL8 
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with 
Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the slurmd 
Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since the 
CornelisNetworks drivers are not available for RHEL 8.8.


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the "network-online.target" 
has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE the 
Infiniband/OPA device ib0 is actually up, and this seems to be the cause 
of our problems!


Here are some important sequences of events from the syslog showing that 
the network goes online before the Infiniband/OPA network (hfi1_0 adapter) 
is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed:  check_hw_ib:  No IB port 
is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: 
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting a 
wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe can 
share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check (NHC), 
but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



Re: [slurm-users] Sinfo options not working in SLURM 23.11

2023-10-30 Thread Loris Bennett
Hello Deepak,

Deepak J  writes:

> Hello ,
>
>  
>
> I am working on SLURM 23.11 version.
>
> sinfo  option commands are not working properly  (-a , --all , -o , -m etc)
>
>  
>
> e.g : sinfo is giving me below 
>
> 45637@inv456748703$sinfo  
>   
>   
>
> PARTITION AVAILTIMELIMIT   NODES  STATE NODELIST  
>   
>   
>   
>
> FPGA*up  infinite 1   idle 
> FPGA01
>
> also sinfo --help gives same result
>
> 45637@inv456748703$ sinfo ---help
>
> PARTITION  AVAIL  TIMELIMITNODES  STATE NODELIST  
>   
>   
> 
>  
>
> FPGA* up  infinite   1  idle  
>FPGA01 
>
> Any pointers will help.

Why do you think that the output above is wrong?

Cheers,

Loris

> Regards,
>
> DJ
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin