Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-13 Thread Ole Holm Nielsen

Hi Max and Ward,

I've made a variation of your scripts which wait for at least 1 Infiniband 
port to come up before starting services such as slurmd or NFS mounts.


I prefer Max's Systemd service which comes before the Systemd 
network-online.target.  And I like Ward's script which checks the 
Infiniband status in /sys/class/infiniband/ in stead of relying on 
NetworkManager being installed.


At our site there are different types of compute nodes with different 
types of NICs:


1. Mellanox Infiniband.
2. Cornelis Omni-Path behaving just like Infiniband.
3. Intel X722 Ethernet NICs presenting a "fake" iRDMA Infiniband.
4. Plain Ethernet only.

I've written some modified scripts which are available in
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/InfiniBand
and which have been tested on the 4 types of NICs listed above.

The case 3. is particularly troublesome as reported earlier because it's 
an Ethernet port which presents an iRDMA InfiniBand interface.  My 
waitforib.sh script skips NICs whose link_layer type is not equal to 
InfiniBand.


Comments and suggestions would be most welcome.

Best regards,
Ole

On 11/10/23 19:45, Ward Poelmans wrote:

Hi Ole,

On 10/11/2023 15:04, Ole Holm Nielsen wrote:

On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on 
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11


This might disturb the logic in waitforib.sh, or at least cause some 
confusion?


I had never heard of these cards. But if they behave like infiniband 
cards, is there also an .../ports/1/state file present in /sys with the 
state? In that case it should work just as well.


We could also change the glob '/sys/class/infiniband/*/ports/*/state' to 
only look at devices starting with mlx. I have no clue how much diversity 
is out there, we only have Mellanox cards (or rebrands of those).



IMHO, this seems quite confusing.


Yes, I agree.


Regarding the slurmd service:


An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has:

Before=network-online.target

What do you think of these considerations?


I think Max his approach is the better one. We only do it for slurmd while 
his is completely general for everything that waits on network. The 
downside is probably that if you have issue with your IB network, this 
will make it worse ;)


Ward




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Max Rutkowski

Hi Ward,

Am 10.11.2023 um 19:45 schrieb Ward Poelmans:

Hi Ole,

On 10/11/2023 15:04, Ole Holm Nielsen wrote:

On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online 
on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11


This might disturb the logic in waitforib.sh, or at least cause some 
confusion?


I had never heard of these cards. But if they behave like infiniband 
cards, is there also an .../ports/1/state file present in /sys with 
the state? In that case it should work just as well.


We could also change the glob '/sys/class/infiniband/*/ports/*/state' 
to only look at devices starting with mlx. I have no clue how much 
diversity is out there, we only have Mellanox cards (or rebrands of 
those).



IMHO, this seems quite confusing.


Yes, I agree.


Regarding the slurmd service:


An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service 
which has:

Before=network-online.target

What do you think of these considerations?


I think Max his approach is the better one. We only do it for slurmd 
while his is completely general for everything that waits on network. 
The downside is probably that if you have issue with your IB network, 
this will make it worse ;)
That's why we do have a limit in there which will allow the boot to 
complete even without the network coming up. In case we need to log in 
and check the server. The script is made to delay it up until a timeout 
is reached. And yes, we used a more general approach since our issue 
actually was the network not coming up fast enough for our NFS mounts 
which are also used by Slurmd at our site.


Ward




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Ward Poelmans

Hi Ole,

On 10/11/2023 15:04, Ole Holm Nielsen wrote:

On 11/5/23 21:32, Ward Poelmans wrote:

Yes, it's very similar. I've put our systemd unit file also online on 
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11


This might disturb the logic in waitforib.sh, or at least cause some confusion?


I had never heard of these cards. But if they behave like infiniband cards, is 
there also an .../ports/1/state file present in /sys with the state? In that 
case it should work just as well.

We could also change the glob '/sys/class/infiniband/*/ports/*/state' to only 
look at devices starting with mlx. I have no clue how much diversity is out 
there, we only have Mellanox cards (or rebrands of those).
 

IMHO, this seems quite confusing.


Yes, I agree.
 

Regarding the slurmd service:
 

An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service
 which has:
Before=network-online.target

What do you think of these considerations?


I think Max his approach is the better one. We only do it for slurmd while his 
is completely general for everything that waits on network. The downside is 
probably that if you have issue with your IB network, this will make it worse ;)

Ward


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Ole Holm Nielsen

Hi Ward,

On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on 
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11


This looks really good!  However, I was testing the waitforib.sh script on 
a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC 
(Intel Corporation Ethernet Connection X722 for 10GBASE-T).


The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that 
the Ethernet ports are also Infiniband ports:


# ls -l /sys/class/infiniband
total 0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 -> 
../../devices/pci:5d/:5d:02.0/:5e:00.0/:5f:03.0/:60:00.0/infiniband/irdma0
lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 -> 
../../devices/pci:5d/:5d:02.0/:5e:00.0/:5f:03.0/:60:00.1/infiniband/irdma1


This might disturb the logic in waitforib.sh, or at least cause some 
confusion?


One advantage of Max's script using NetworkManager is that nmcli isn't 
fooled by the fake irdma Infiniband device:


# nmcli connection show
NAME  UUID  TYPE  DEVICE
eno1  cb0937f8-1902-48f7-8139-37cf0c4077b2  ethernet  eno1
eno2  98130354-9215-412e-ab26-032c76c2dbe4  ethernet  --

I found a discussion of the mysterious irdma device in
https://github.com/prometheus/node_exporter/issues/2769
with this explanation:


The irdma module is Intel's replacement for the legacy i40iw module, which was the 
iWARP driver for the Intel X722. The irdma module is a complete rewrite, which 
landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP 
& RoCE).


The Infiniband commands also work on the fake device, claiming that it 
runs 100 Gbit/s:


# ibstatus
Infiniband device 'irdma0' port 1 status:
default gid: 3cec:ef38:d960:::::
base lid:0x1
sm lid:  0x0
state:   4: ACTIVE
phys state:  5: LinkUp
rate:100 Gb/sec (4X EDR)
link_layer:  Ethernet

Infiniband device 'irdma1' port 1 status:
default gid: 3cec:ef38:d961:::::
base lid:0x1
sm lid:  0x0
state:   1: DOWN
phys state:  3: Disabled
rate:100 Gb/sec (4X EDR)
link_layer:  Ethernet

IMHO, this seems quite confusing.

Regarding the slurmd service:


And we add it as a dependency for slurmd:

$ cat /etc/systemd/system/slurmd.service.d/wait.conf

[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
LimitMEMLOCK=infinity

[Unit]
After=waitforib.service
Requires=munge.service
Wants=waitforib.service


An alternative to this extra service would be like Max's service file 
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service 
which has:

Before=network-online.target

What do you think of these considerations?

Best regards,
Ole


On 2/11/2023 09:28, Ole Holm Nielsen wrote:

Hi Ward,

Thanks a lot for the feedback!  The method of probing 
/sys/class/infiniband/*/ports/*/state is also used in the NHC script 
lbnl_hw.nhc and has the advantage of not depending on the nmcli command 
from the NetworkManager package.


Can I ask you how you implement your script as a service in the Systemd 
booting process, perhaps similar to Max's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces ?


Thanks,
Ole

On 11/1/23 20:09, Ward Poelmans wrote:
We have a slightly difference script to do the same. It only relies on 
/sys:


# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
 logger "No infiniband found"
 exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
 for port in ${ports}; do
 if grep -qc ACTIVE $port; then
 logger "Infiniband online at $port"
 exit 0
 fi
 done
 sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1







Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-05 Thread Ward Poelmans

Hi Ole,

Yes, it's very similar. I've put our systemd unit file also online on 
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11

And we add it as a dependency for slurmd:

$ cat /etc/systemd/system/slurmd.service.d/wait.conf

[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
LimitMEMLOCK=infinity

[Unit]
After=waitforib.service
Requires=munge.service
Wants=waitforib.service


So far this has worked flawlessly.


Ward



On 2/11/2023 09:28, Ole Holm Nielsen wrote:

Hi Ward,

Thanks a lot for the feedback!  The method of probing 
/sys/class/infiniband/*/ports/*/state is also used in the NHC script 
lbnl_hw.nhc and has the advantage of not depending on the nmcli command from 
the NetworkManager package.

Can I ask you how you implement your script as a service in the Systemd booting 
process, perhaps similar to Max's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces ?

Thanks,
Ole

On 11/1/23 20:09, Ward Poelmans wrote:

We have a slightly difference script to do the same. It only relies on /sys:

# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
 logger "No infiniband found"
 exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
 for port in ${ports}; do
 if grep -qc ACTIVE $port; then
 logger "Infiniband online at $port"
 exit 0
 fi
 done
 sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-02 Thread Ole Holm Nielsen

Hi Ward,

Thanks a lot for the feedback!  The method of probing 
/sys/class/infiniband/*/ports/*/state is also used in the NHC script 
lbnl_hw.nhc and has the advantage of not depending on the nmcli command 
from the NetworkManager package.


Can I ask you how you implement your script as a service in the Systemd 
booting process, perhaps similar to Max's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces ?


Thanks,
Ole

On 11/1/23 20:09, Ward Poelmans wrote:

We have a slightly difference script to do the same. It only relies on /sys:

# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
     logger "No infiniband found"
     exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
     for port in ${ports}; do
     if grep -qc ACTIVE $port; then
     logger "Infiniband online at $port"
     exit 0
     fi
     done
     sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ward Poelmans

Hi,

We have a slightly difference script to do the same. It only relies on /sys:

# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE

if [[ ! -d /sys/class/infiniband ]]
then
logger "No infiniband found"
exit 0
fi

ports=$(ls /sys/class/infiniband/*/ports/*/state)

for (( count = 0; count < 300; count++ ))
do
for port in ${ports}; do
if grep -qc ACTIVE $port; then
logger "Infiniband online at $port"
exit 0
fi
done
sleep 1
done

logger "Failed to find an active infiniband interface"
exit 1


Ward


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen

Hi Rémi,

Thanks for the feedback!  The patch revert[1] explains SchedMD's reason:


The reasoning is that sysadmins who see nodes with Reason "Not Responding"
but they can manually ping/access the node end up confused. That reason
should only be set if the node is trully not responding, but not if the
HealthCheckProgram execution failed or returned non-zero exit code. For
that case, the program itself would take the appropiate actions, such
as draining the node and setting an appropiate Reason.


We speculate that there may possibly be an issue with slurmd starting up 
at boot time and starting new jobs, while NHC is running in a separate 
thread and possibly fails the node AFTER the job has started!  NHC might 
fail, for example, if an Infiniband/OPA network or NVIDIA GPUs have not 
yet started up completely.


I still need to verify whether this observation is correct and 
reproducible.  Does anyone have evidence that jobs start before NHC is 
complete when slurmd starts up?


IMHO, slurmd ought to start up without delay at boot time, then execute 
the NHC and wait for it to complete.  Only after NHC has succeeded without 
errors should slurmd begin accepting new jobs.


We should configure NHC to make site-specific hardware and network checks, 
for example for Infiniband/OPA network or NVIDIA GPUs.


Best regards,
Ole

On 11/1/23 09:44, Rémi Palancher wrote:

Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :

I'm fighting this strange scenario where slurmd is started before the
Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
by slurmd then fails the node (as it should).  This happens only on EL8
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
Infiniband/OPA network work without problems.

Question: Does anyone know how to reliably delay the start of the slurmd
Systemd service until the Infiniband/OPA network is fully up?

…


FWIW, after a while struggling with systemd dependencies to wait for
availability of networks and shared filesystems, we ended up with a
customer writing a patch in Slurm to delay slurmd registration (and jobs
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1]
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal
and should be taken with caution but it works for years for this customer.

[1]
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528



--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620



Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Rémi Palancher
Hi Ole,

Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
> I'm fighting this strange scenario where slurmd is started before the
> Infiniband/OPA network is fully up.  The Node Health Check (NHC) executed
> by slurmd then fails the node (as it should).  This happens only on EL8
> Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
> Infiniband/OPA network work without problems.
> 
> Question: Does anyone know how to reliably delay the start of the slurmd
> Systemd service until the Infiniband/OPA network is fully up?
> 
> …

FWIW, after a while struggling with systemd dependencies to wait for 
availability of networks and shared filesystems, we ended up with a 
customer writing a patch in Slurm to delay slurmd registration (and jobs 
start) until NHC is OK:

https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

For the record, this patch was once merged in Slurm and then reverted[1] 
for reasons I did not fully explore.

This approach is far from your original idea, it is clearly not ideal 
and should be taken with caution but it works for years for this customer.

[1] 
https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528

-- 
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io/




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-31 Thread Jens Elkner
On Tue, Oct 31, 2023 at 10:59:56AM +0100, Ole Holm Nielsen wrote:
Hi Ole,
  
TLTR;: below systemd-networkd stuff, only.

> On 10/30/23 20:15, Jeffrey R. Lang wrote:
> > The service is available in RHEL 8 via the EPEL package repository as 
> > system-networkd, i.e. systemd-networkd.x86_64   
> > 253.4-1.el8epel
> 
> Thanks for the info.  We can install the systemd-networkd RPM from the EPEL
> repo as you suggest.

Strange, that it is not installed by default. We use Ubuntu, only. The
first LTS which includes it is Xenial (16.04) - released in April 2016.
Anyway, we have never installed any NetworkManager stuff (too unflexible,
unreliable, buggy - last eval ~5 years ago and ditched forever), even
before 16.04 as well on desktops I ditch[ed] it (IMHO just overhead).

> I tried to understand the properties of systemd-networkd before implementing
> it in our compute nodes.  While there are lots of networkd man-pages, it's
> harder to find an overview of the actual properties of networkd.  This is
> what I found:

Basically you just need for each interface a *.netdev and a *.network
file in /etc/systemd/network/.  Optionally symlink /etc/resolv.conf to
/run/systemd/resolve/resolv.conf.  If you want to rename your
interface[s] (e.g. we use ${hostname}${ifidx}), and parameter
'net.ifnames=0' gets passed to the kernel, you can use a *.link file to
accomplish this. That's it. See example 1 below.

Some distros have obscure bloatware to manage them (e.g. Ubuntu installs
per default 'netplan.io' aka another way of indirection), but we ditch
those packages immediately and manage them "manually" as needed.
 
> * Comparing systemd-networkd and NetworkManager:
> https://fedoracloud.readthedocs.io/en/latest/networkd.html

Pretty good - shows all you probably need. Actually within containers we
have just /etc/systemd/network/40-${hostname}0.network, because the
lxc.net.* config already describe, what *.link and *.netdev would do.
See example 2.
  
...
> While networkd seems to be really nifty, I hesitate to replace

Does/can do all we need w/o a lot of overhead.

> NetworkManager by networkd on our EL8 and EL9 systems because this is an
> unsupported and only lightly tested setup,

We use it ~5 years on all machines, ~7 years on most of our machines;
multihomed, containers, simple and complex (i.e. a lot of NICs, VLANs)
w/o any problems ... 

> and it may require additional
> work to keep our systems up-to-date in the future.

I doubt that. The /etc/systemd/network/*.{link,netdev,network} interface
seems to be pretty stable. Haven't seen/noticed any stuff, which got
removed so far.

> It seems to me that Max Rutkowski's solution in
> https://github.com/maxlxl/network.target_wait-for-interfaces is less
> intrusive than converting to systemd-networkd.

Depends on your setup/environment. But I guess soomer or later, you need
to get into touch with it anyway. So here some examples:

Example 1:
--
# /etc/systemd/network/10-mb0.link
# we rename usually eth0, the 1st NIC on the motherboard to mb0 using
# its PCI Address to identify it
[Match]
Path=pci-:00:19.0

[Link]
Name=mb0 
MACAddressPolicy=persistent


# /etc/systemd/network/25-phys-2-vlans+vnics.network
[Match]
Name=mb0

[Link]
ARP=false

[Network]
LinkLocalAddressing=no
LLMNR=false
IPv6AcceptRA=no
LLDP=true
MACVLAN=node1_0
#VLAN=vlan2
#VLAN=vlan3


# /etc/systemd/network/40-node1_0.netdev
[NetDev]
Name=node1_0
Kind=macvlan
# Optional: we use fix mac addr on vnics
MACAddress=00:01:02:03:04:00

[MACVLAN]
Mode=bridge


# /etc/systemd/network/40-node1_0.network
[Match]
Name=node1_0

[Network]
LinkLocalAddressing=no
LLMNR=false
IPv6AcceptRA=no
LLDP=no
Address=10.11.12.13/24
Gateway=10.11.12.200
# stuff which gets copied to /run/systemd/resolve/resolv.conf, when ready
Domains=my.do.main an.other.do.main
DNS=10.11.12.100 10.11.12.101

 
Example 2 (LXC):

# /zones/n00-00/config
...
lxc.net.0.type = macvlan
lxc.net.0.macvlan.mode = bridge
lxc.net.0.flags = up
lxc.net.0.link = mb0
lxc.net.0.name = n00-00_0
lxc.net.0.hwaddr = 00:01:02:03:04:01
...


# /zones/n00-00/rootfs/etc/systemd/network/40-n00-00_0.network
[Match]
Name=n00-00_0

[Network]
LLMNR=false
LLDP=no
LinkLocalAddressing=no
IPv6AcceptRouterAdvertisements=no
Address=10.12.11.0/16
Gateway=10.12.11.2
Domains=gpu.do.main


Have fun,
jel.
> Best regards,
> Ole
> 
> 
> > -Original Message-
> > From: slurm-users  On Behalf Of Ole 
> > Holm Nielsen
> > Sent: Monday, October 30, 2023 1:56 PM
> > To: slurm-users@lists.schedmd.com
> > Subject: Re: [slurm-users] How to delay the start of slurmd until 
> > Infiniband/OPA network is fully up?
> > 
> > ◆ This message was sent from a non-UWYO address. Please exercise caution 
> > when clicking links or opening attachments from external sources.
> > 
> > 
> > Hi Jens,
> > 
> > Thanks for your feedback:
> > 
> > On 30-10-2023 15:52, Jens Elkner wrote:
> > > Actually there is no need for such a script since
> > 

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-31 Thread Ole Holm Nielsen

Hi Jeffrey,

On 10/30/23 20:15, Jeffrey R. Lang wrote:

The service is available in RHEL 8 via the EPEL package repository as 
system-networkd, i.e. systemd-networkd.x86_64   
253.4-1.el8epel


Thanks for the info.  We can install the systemd-networkd RPM from the 
EPEL repo as you suggest.


I tried to understand the properties of systemd-networkd before 
implementing it in our compute nodes.  While there are lots of networkd 
man-pages, it's harder to find an overview of the actual properties of 
networkd.  This is what I found:


* Networkd is a service included in recent versions of Systemd.  It seems 
to be an alternative to NetworkManager.


* Red Hat has stated that systemd-networkd is NOT going to be implemented 
in RHEL 8 or 9.


* Comparing systemd-networkd and NetworkManager: 
https://fedoracloud.readthedocs.io/en/latest/networkd.html


* Networkd is described in the Wikipedia article 
https://en.wikipedia.org/wiki/Systemd


While networkd seems to be really nifty, I hesitate to replace 
NetworkManager by networkd on our EL8 and EL9 systems because this is an 
unsupported and only lightly tested setup, and it may require additional 
work to keep our systems up-to-date in the future.


It seems to me that Max Rutkowski's solution in 
https://github.com/maxlxl/network.target_wait-for-interfaces is less 
intrusive than converting to systemd-networkd.


Best regards,
Ole



-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.


It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jeffrey R. Lang
The service is available in RHEL 8 via the EPEL package repository as 
system-networkd, i.e. systemd-networkd.x86_64   
253.4-1.el8epel


-Original Message-
From: slurm-users  On Behalf Of Ole Holm 
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] How to delay the start of slurmd until 
Infiniband/OPA network is fully up?

◆ This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:
> Actually there is no need for such a script since
> /lib/systemd/systemd-networkd-wait-online should be able to handle it.

It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones, AFAICT.

/Ole




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Jens,

Thanks for your feedback:

On 30-10-2023 15:52, Jens Elkner wrote:

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.


It seems that systemd-networkd exists in Fedora FC38 Linux, but not in 
RHEL 8 and clones, AFAICT.


/Ole




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jens Elkner
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote:
Hi Max & freinds,
...
> Thanks so much for your fast response with a solution!  I didn't know that
> NetworkManager (falsely) claims that the network is online as soon as the
> first interface comes up :-(

IIRC it is documented in the man page.
  
> Your solution of a wait-for-interfaces Systemd service makes a lot of sense,
> and I'm going to try it out.

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.

I.e. 'Exec=/lib/systemd/systemd-networkd-wait-online -i ib0:routable'
or something like that should handle it. E.g. on my laptop the complete
/etc/systemd/system/systemd-networkd-wait-online.service looks like
this:
---schnipp---
[Unit]
Description=Wait for Network to be Configured
Documentation=man:systemd-networkd-wait-online.service(8)
DefaultDependencies=no
Conflicts=shutdown.target
Requires=systemd-networkd.service
After=systemd-networkd.service
Before=network-online.target shutdown.target

[Service]
Type=oneshot
ExecStart=/lib/systemd/systemd-networkd-wait-online -i eth0:routable -i 
wlan0:routable --any
RemainAfterExit=yes

[Install]
WantedBy=network-online.target
---schnapp---
 
Have fun,
jel.
> Best regards,
> Ole
> 
> On 10/30/23 14:30, Max Rutkowski wrote:
> > Hi,
> > 
> > we're not using Omni-Path but also had issues with Infiniband taking too
> > long and slurmd failing to start due to that.
> > 
> > Our solution was to implement a little wait-for-interface systemd
> > service which delays the network.target until the ib interface has come
> > up.
> > 
> > Our discovery was that the network-online.target is triggered by the
> > NetworkManager as soon as the first interface is connected.
> > 
> > I've put the solution we use on my GitHub:
> > https://github.com/maxlxl/network.target_wait-for-interfaces
> > 
> > You may need to do small adjustments, but it's pretty straight forward
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark,
> Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
> E-mail: ole.h.niel...@fysik.dtu.dk
> Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
> Mobile: (+45) 5180 1620 in
> > general.
> > 
> > 
> > Kind regards
> > Max
> > 
> > On 30.10.23 13:50, Ole Holm Nielsen wrote:
> > > I'm fighting this strange scenario where slurmd is started before
> > > the Infiniband/OPA network is fully up.  The Node Health Check (NHC)
> > > executed by slurmd then fails the node (as it should).  This happens
> > > only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9
> > > nodes with Infiniband/OPA network work without problems.
> > > 
> > > Question: Does anyone know how to reliably delay the start of the
> > > slurmd Systemd service until the Infiniband/OPA network is fully up?
> > > 
> > > Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not
> > > Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers
> > > since the CornelisNetworks drivers are not available for RHEL 8.8.
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark,
> Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
> E-mail: ole.h.niel...@fysik.dtu.dk
> Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
> Mobile: (+45) 5180 1620
> > > 
> > > The details:
> > > 
> > > The slurmd service is started by the service file
> > > /usr/lib/systemd/system/slurmd.service after the
> > > "network-online.target" has been reached.
> > > 
> > > It seems that NetworkManager reports "network-online.target" BEFORE
> > > the Infiniband/OPA device ib0 is actually up, and this seems to be
> > > the cause of our problems!
> > > 
> > > Here are some important sequences of events from the syslog showing
> > > that the network goes online before the Infiniband/OPA network
> > > (hfi1_0 adapter) is up:
> > > 
> > > Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
> > > (lines deleted)
> > > Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check
> > > failed: rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib: 
> > > No IB port is ACTIVE (LinkUp 100 Gb/sec).
> > > (lines deleted)
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0:
> > > set_link_state: current GOING_UP, new INIT (LINKUP)
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical
> > > state changed to PHYS_LINKUP (0x5), phy 0x50
> > > 
> > > I tried to delay the NetworkManager "network-online.target" by
> > > setting a wait on the ib0 device and reboot, but that seems to be
> > > ignored:
> > > 
> > > $ nmcli -p connection modify "System ib0"
> > > connection.connection.wait-device-timeout 20
> > > 
> > > I'm hoping that other sites using Omni-Path have seen this and maybe
> > > can share a fix or workaround?
> > > 
> > > Of course we could remove the Infiniband check 

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen

Hi Max,

Thanks so much for your fast response with a solution!  I didn't know that 
NetworkManager (falsely) claims that the network is online as soon as the 
first interface comes up :-(


Your solution of a wait-for-interfaces Systemd service makes a lot of 
sense, and I'm going to try it out.


Best regards,
Ole

On 10/30/23 14:30, Max Rutkowski wrote:

Hi,

we're not using Omni-Path but also had issues with Infiniband taking too 
long and slurmd failing to start due to that.


Our solution was to implement a little wait-for-interface systemd service 
which delays the network.target until the ib interface has come up.


Our discovery was that the network-online.target is triggered by the 
NetworkManager as soon as the first interface is connected.


I've put the solution we use on my GitHub: 
https://github.com/maxlxl/network.target_wait-for-interfaces


You may need to do small adjustments, but it's pretty straight forward

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620 in

general.


Kind regards
Max

On 30.10.23 13:50, Ole Holm Nielsen wrote:
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) 
executed by slurmd then fails the node (as it should).  This happens 
only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes 
with Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the slurmd 
Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since 
the CornelisNetworks drivers are not available for RHEL 8.8.

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the "network-online.target" 
has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE the 
Infiniband/OPA device ib0 is actually up, and this seems to be the cause 
of our problems!


Here are some important sequences of events from the syslog showing that 
the network goes online before the Infiniband/OPA network (hfi1_0 
adapter) is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib:  No IB port 
is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: 
current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting a 
wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe can 
share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check 
(NHC), but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole


--
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.rutkow...@gfz-potsdam.de
___

Helmholtz-Zentrum Potsdam
*Deutsches GeoForschungsZentrum GFZ*
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam




Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Max Rutkowski

Hi,

we're not using Omni-Path but also had issues with Infiniband taking too 
long and slurmd failing to start due to that.


Our solution was to implement a little wait-for-interface systemd 
service which delays the network.target until the ib interface has come up.


Our discovery was that the network-online.target is triggered by the 
NetworkManager as soon as the first interface is connected.


I've put the solution we use on my GitHub: 
https://github.com/maxlxl/network.target_wait-for-interfaces


You may need to do small adjustments, but it's pretty straight forward 
in general.



Kind regards
Max

On 30.10.23 13:50, Ole Holm Nielsen wrote:
I'm fighting this strange scenario where slurmd is started before the 
Infiniband/OPA network is fully up.  The Node Health Check (NHC) 
executed by slurmd then fails the node (as it should).  This happens 
only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes 
with Infiniband/OPA network work without problems.


Question: Does anyone know how to reliably delay the start of the 
slurmd Systemd service until the Infiniband/OPA network is fully up?


Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not 
Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers since 
the CornelisNetworks drivers are not available for RHEL 8.8.


The details:

The slurmd service is started by the service file 
/usr/lib/systemd/system/slurmd.service after the 
"network-online.target" has been reached.


It seems that NetworkManager reports "network-online.target" BEFORE 
the Infiniband/OPA device ib0 is actually up, and this seems to be the 
cause of our problems!


Here are some important sequences of events from the syslog showing 
that the network goes online before the Infiniband/OPA network (hfi1_0 
adapter) is up:


Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
(lines deleted)
Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: 
rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib:  No IB 
port is ACTIVE (LinkUp 100 Gb/sec).

(lines deleted)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 
set_link_state: current GOING_UP, new INIT (LINKUP)
Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state 
changed to PHYS_LINKUP (0x5), phy 0x50


I tried to delay the NetworkManager "network-online.target" by setting 
a wait on the ib0 device and reboot, but that seems to be ignored:


$ nmcli -p connection modify "System ib0" 
connection.connection.wait-device-timeout 20


I'm hoping that other sites using Omni-Path have seen this and maybe 
can share a fix or workaround?


Of course we could remove the Infiniband check in Node Health Check 
(NHC), but that would not really be acceptable during operations.


Thanks for sharing any insights,
Ole


--
Max Rutkowski
IT-Services und IT-Betrieb
Tel.: +49 (0)331/6264-2341
E-Mail: max.rutkow...@gfz-potsdam.de
___

Helmholtz-Zentrum Potsdam
*Deutsches GeoForschungsZentrum GFZ*
Stiftung des öff. Rechts Land Brandenburg
Telegrafenberg, 14473 Potsdam

smime.p7s
Description: S/MIME Cryptographic Signature