Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Max and Ward, I've made a variation of your scripts which wait for at least 1 Infiniband port to come up before starting services such as slurmd or NFS mounts. I prefer Max's Systemd service which comes before the Systemd network-online.target. And I like Ward's script which checks the Infiniband status in /sys/class/infiniband/ in stead of relying on NetworkManager being installed. At our site there are different types of compute nodes with different types of NICs: 1. Mellanox Infiniband. 2. Cornelis Omni-Path behaving just like Infiniband. 3. Intel X722 Ethernet NICs presenting a "fake" iRDMA Infiniband. 4. Plain Ethernet only. I've written some modified scripts which are available in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/InfiniBand and which have been tested on the 4 types of NICs listed above. The case 3. is particularly troublesome as reported earlier because it's an Ethernet port which presents an iRDMA InfiniBand interface. My waitforib.sh script skips NICs whose link_layer type is not equal to InfiniBand. Comments and suggestions would be most welcome. Best regards, Ole On 11/10/23 19:45, Ward Poelmans wrote: Hi Ole, On 10/11/2023 15:04, Ole Holm Nielsen wrote: On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This might disturb the logic in waitforib.sh, or at least cause some confusion? I had never heard of these cards. But if they behave like infiniband cards, is there also an .../ports/1/state file present in /sys with the state? In that case it should work just as well. We could also change the glob '/sys/class/infiniband/*/ports/*/state' to only look at devices starting with mlx. I have no clue how much diversity is out there, we only have Mellanox cards (or rebrands of those). IMHO, this seems quite confusing. Yes, I agree. Regarding the slurmd service: An alternative to this extra service would be like Max's service file https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has: Before=network-online.target What do you think of these considerations? I think Max his approach is the better one. We only do it for slurmd while his is completely general for everything that waits on network. The downside is probably that if you have issue with your IB network, this will make it worse ;) Ward
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ward, Am 10.11.2023 um 19:45 schrieb Ward Poelmans: Hi Ole, On 10/11/2023 15:04, Ole Holm Nielsen wrote: On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This might disturb the logic in waitforib.sh, or at least cause some confusion? I had never heard of these cards. But if they behave like infiniband cards, is there also an .../ports/1/state file present in /sys with the state? In that case it should work just as well. We could also change the glob '/sys/class/infiniband/*/ports/*/state' to only look at devices starting with mlx. I have no clue how much diversity is out there, we only have Mellanox cards (or rebrands of those). IMHO, this seems quite confusing. Yes, I agree. Regarding the slurmd service: An alternative to this extra service would be like Max's service file https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has: Before=network-online.target What do you think of these considerations? I think Max his approach is the better one. We only do it for slurmd while his is completely general for everything that waits on network. The downside is probably that if you have issue with your IB network, this will make it worse ;) That's why we do have a limit in there which will allow the boot to complete even without the network coming up. In case we need to log in and check the server. The script is made to delay it up until a timeout is reached. And yes, we used a more general approach since our issue actually was the network not coming up fast enough for our NFS mounts which are also used by Slurmd at our site. Ward
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ole, On 10/11/2023 15:04, Ole Holm Nielsen wrote: On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This might disturb the logic in waitforib.sh, or at least cause some confusion? I had never heard of these cards. But if they behave like infiniband cards, is there also an .../ports/1/state file present in /sys with the state? In that case it should work just as well. We could also change the glob '/sys/class/infiniband/*/ports/*/state' to only look at devices starting with mlx. I have no clue how much diversity is out there, we only have Mellanox cards (or rebrands of those). IMHO, this seems quite confusing. Yes, I agree. Regarding the slurmd service: An alternative to this extra service would be like Max's service file https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has: Before=network-online.target What do you think of these considerations? I think Max his approach is the better one. We only do it for slurmd while his is completely general for everything that waits on network. The downside is probably that if you have issue with your IB network, this will make it worse ;) Ward smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ward, On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This looks really good! However, I was testing the waitforib.sh script on a SuperMicro server WITHOUT Infiniband and only a dual-port Ethernet NIC (Intel Corporation Ethernet Connection X722 for 10GBASE-T). The EL8 drivers in kernel 4.18.0-477.27.2.el8_8.x86_64 seem to think that the Ethernet ports are also Infiniband ports: # ls -l /sys/class/infiniband total 0 lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma0 -> ../../devices/pci:5d/:5d:02.0/:5e:00.0/:5f:03.0/:60:00.0/infiniband/irdma0 lrwxrwxrwx 1 root root 0 Nov 10 14:31 irdma1 -> ../../devices/pci:5d/:5d:02.0/:5e:00.0/:5f:03.0/:60:00.1/infiniband/irdma1 This might disturb the logic in waitforib.sh, or at least cause some confusion? One advantage of Max's script using NetworkManager is that nmcli isn't fooled by the fake irdma Infiniband device: # nmcli connection show NAME UUID TYPE DEVICE eno1 cb0937f8-1902-48f7-8139-37cf0c4077b2 ethernet eno1 eno2 98130354-9215-412e-ab26-032c76c2dbe4 ethernet -- I found a discussion of the mysterious irdma device in https://github.com/prometheus/node_exporter/issues/2769 with this explanation: The irdma module is Intel's replacement for the legacy i40iw module, which was the iWARP driver for the Intel X722. The irdma module is a complete rewrite, which landed in mainline kernel 5.14, and which also now supports the Intel E810 (iWARP & RoCE). The Infiniband commands also work on the fake device, claiming that it runs 100 Gbit/s: # ibstatus Infiniband device 'irdma0' port 1 status: default gid: 3cec:ef38:d960::::: base lid:0x1 sm lid: 0x0 state: 4: ACTIVE phys state: 5: LinkUp rate:100 Gb/sec (4X EDR) link_layer: Ethernet Infiniband device 'irdma1' port 1 status: default gid: 3cec:ef38:d961::::: base lid:0x1 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate:100 Gb/sec (4X EDR) link_layer: Ethernet IMHO, this seems quite confusing. Regarding the slurmd service: And we add it as a dependency for slurmd: $ cat /etc/systemd/system/slurmd.service.d/wait.conf [Service] Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID" LimitMEMLOCK=infinity [Unit] After=waitforib.service Requires=munge.service Wants=waitforib.service An alternative to this extra service would be like Max's service file https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service which has: Before=network-online.target What do you think of these considerations? Best regards, Ole On 2/11/2023 09:28, Ole Holm Nielsen wrote: Hi Ward, Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask you how you implement your script as a service in the Systemd booting process, perhaps similar to Max's solution in https://github.com/maxlxl/network.target_wait-for-interfaces ? Thanks, Ole On 11/1/23 20:09, Ward Poelmans wrote: We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls /sys/class/infiniband/*/ports/*/state) for (( count = 0; count < 300; count++ )) do for port in ${ports}; do if grep -qc ACTIVE $port; then logger "Infiniband online at $port" exit 0 fi done sleep 1 done logger "Failed to find an active infiniband interface" exit 1
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ole, Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 And we add it as a dependency for slurmd: $ cat /etc/systemd/system/slurmd.service.d/wait.conf [Service] Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID" LimitMEMLOCK=infinity [Unit] After=waitforib.service Requires=munge.service Wants=waitforib.service So far this has worked flawlessly. Ward On 2/11/2023 09:28, Ole Holm Nielsen wrote: Hi Ward, Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask you how you implement your script as a service in the Systemd booting process, perhaps similar to Max's solution in https://github.com/maxlxl/network.target_wait-for-interfaces ? Thanks, Ole On 11/1/23 20:09, Ward Poelmans wrote: We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls /sys/class/infiniband/*/ports/*/state) for (( count = 0; count < 300; count++ )) do for port in ${ports}; do if grep -qc ACTIVE $port; then logger "Infiniband online at $port" exit 0 fi done sleep 1 done logger "Failed to find an active infiniband interface" exit 1 smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ward, Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask you how you implement your script as a service in the Systemd booting process, perhaps similar to Max's solution in https://github.com/maxlxl/network.target_wait-for-interfaces ? Thanks, Ole On 11/1/23 20:09, Ward Poelmans wrote: We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls /sys/class/infiniband/*/ports/*/state) for (( count = 0; count < 300; count++ )) do for port in ${ports}; do if grep -qc ACTIVE $port; then logger "Infiniband online at $port" exit 0 fi done sleep 1 done logger "Failed to find an active infiniband interface" exit 1
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi, We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls /sys/class/infiniband/*/ports/*/state) for (( count = 0; count < 300; count++ )) do for port in ${ports}; do if grep -qc ACTIVE $port; then logger "Infiniband online at $port" exit 0 fi done sleep 1 done logger "Failed to find an active infiniband interface" exit 1 Ward smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Rémi, Thanks for the feedback! The patch revert[1] explains SchedMD's reason: The reasoning is that sysadmins who see nodes with Reason "Not Responding" but they can manually ping/access the node end up confused. That reason should only be set if the node is trully not responding, but not if the HealthCheckProgram execution failed or returned non-zero exit code. For that case, the program itself would take the appropiate actions, such as draining the node and setting an appropiate Reason. We speculate that there may possibly be an issue with slurmd starting up at boot time and starting new jobs, while NHC is running in a separate thread and possibly fails the node AFTER the job has started! NHC might fail, for example, if an Infiniband/OPA network or NVIDIA GPUs have not yet started up completely. I still need to verify whether this observation is correct and reproducible. Does anyone have evidence that jobs start before NHC is complete when slurmd starts up? IMHO, slurmd ought to start up without delay at boot time, then execute the NHC and wait for it to complete. Only after NHC has succeeded without errors should slurmd begin accepting new jobs. We should configure NHC to make site-specific hardware and network checks, for example for Infiniband/OPA network or NVIDIA GPUs. Best regards, Ole On 11/1/23 09:44, Rémi Palancher wrote: Hi Ole, Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit : I'm fighting this strange scenario where slurmd is started before the Infiniband/OPA network is fully up. The Node Health Check (NHC) executed by slurmd then fails the node (as it should). This happens only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with Infiniband/OPA network work without problems. Question: Does anyone know how to reliably delay the start of the slurmd Systemd service until the Infiniband/OPA network is fully up? … FWIW, after a while struggling with systemd dependencies to wait for availability of networks and shared filesystems, we ended up with a customer writing a patch in Slurm to delay slurmd registration (and jobs start) until NHC is OK: https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528 For the record, this patch was once merged in Slurm and then reverted[1] for reasons I did not fully explore. This approach is far from your original idea, it is clearly not ideal and should be taken with caution but it works for years for this customer. [1] https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528 -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Ole, Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit : > I'm fighting this strange scenario where slurmd is started before the > Infiniband/OPA network is fully up. The Node Health Check (NHC) executed > by slurmd then fails the node (as it should). This happens only on EL8 > Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with > Infiniband/OPA network work without problems. > > Question: Does anyone know how to reliably delay the start of the slurmd > Systemd service until the Infiniband/OPA network is fully up? > > … FWIW, after a while struggling with systemd dependencies to wait for availability of networks and shared filesystems, we ended up with a customer writing a patch in Slurm to delay slurmd registration (and jobs start) until NHC is OK: https://github.com/scibian/slurm-wlm/blob/scibian/buster/debian/patches/b31fa177c1ca26dcd2d5cd952e692ef87d95b528 For the record, this patch was once merged in Slurm and then reverted[1] for reasons I did not fully explore. This approach is far from your original idea, it is clearly not ideal and should be taken with caution but it works for years for this customer. [1] https://github.com/SchedMD/slurm/commit/b31fa177c1ca26dcd2d5cd952e692ef87d95b528 -- Rémi Palancher Rackslab: Open Source Solutions for HPC Operations https://rackslab.io/
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
On Tue, Oct 31, 2023 at 10:59:56AM +0100, Ole Holm Nielsen wrote: Hi Ole, TLTR;: below systemd-networkd stuff, only. > On 10/30/23 20:15, Jeffrey R. Lang wrote: > > The service is available in RHEL 8 via the EPEL package repository as > > system-networkd, i.e. systemd-networkd.x86_64 > > 253.4-1.el8epel > > Thanks for the info. We can install the systemd-networkd RPM from the EPEL > repo as you suggest. Strange, that it is not installed by default. We use Ubuntu, only. The first LTS which includes it is Xenial (16.04) - released in April 2016. Anyway, we have never installed any NetworkManager stuff (too unflexible, unreliable, buggy - last eval ~5 years ago and ditched forever), even before 16.04 as well on desktops I ditch[ed] it (IMHO just overhead). > I tried to understand the properties of systemd-networkd before implementing > it in our compute nodes. While there are lots of networkd man-pages, it's > harder to find an overview of the actual properties of networkd. This is > what I found: Basically you just need for each interface a *.netdev and a *.network file in /etc/systemd/network/. Optionally symlink /etc/resolv.conf to /run/systemd/resolve/resolv.conf. If you want to rename your interface[s] (e.g. we use ${hostname}${ifidx}), and parameter 'net.ifnames=0' gets passed to the kernel, you can use a *.link file to accomplish this. That's it. See example 1 below. Some distros have obscure bloatware to manage them (e.g. Ubuntu installs per default 'netplan.io' aka another way of indirection), but we ditch those packages immediately and manage them "manually" as needed. > * Comparing systemd-networkd and NetworkManager: > https://fedoracloud.readthedocs.io/en/latest/networkd.html Pretty good - shows all you probably need. Actually within containers we have just /etc/systemd/network/40-${hostname}0.network, because the lxc.net.* config already describe, what *.link and *.netdev would do. See example 2. ... > While networkd seems to be really nifty, I hesitate to replace Does/can do all we need w/o a lot of overhead. > NetworkManager by networkd on our EL8 and EL9 systems because this is an > unsupported and only lightly tested setup, We use it ~5 years on all machines, ~7 years on most of our machines; multihomed, containers, simple and complex (i.e. a lot of NICs, VLANs) w/o any problems ... > and it may require additional > work to keep our systems up-to-date in the future. I doubt that. The /etc/systemd/network/*.{link,netdev,network} interface seems to be pretty stable. Haven't seen/noticed any stuff, which got removed so far. > It seems to me that Max Rutkowski's solution in > https://github.com/maxlxl/network.target_wait-for-interfaces is less > intrusive than converting to systemd-networkd. Depends on your setup/environment. But I guess soomer or later, you need to get into touch with it anyway. So here some examples: Example 1: -- # /etc/systemd/network/10-mb0.link # we rename usually eth0, the 1st NIC on the motherboard to mb0 using # its PCI Address to identify it [Match] Path=pci-:00:19.0 [Link] Name=mb0 MACAddressPolicy=persistent # /etc/systemd/network/25-phys-2-vlans+vnics.network [Match] Name=mb0 [Link] ARP=false [Network] LinkLocalAddressing=no LLMNR=false IPv6AcceptRA=no LLDP=true MACVLAN=node1_0 #VLAN=vlan2 #VLAN=vlan3 # /etc/systemd/network/40-node1_0.netdev [NetDev] Name=node1_0 Kind=macvlan # Optional: we use fix mac addr on vnics MACAddress=00:01:02:03:04:00 [MACVLAN] Mode=bridge # /etc/systemd/network/40-node1_0.network [Match] Name=node1_0 [Network] LinkLocalAddressing=no LLMNR=false IPv6AcceptRA=no LLDP=no Address=10.11.12.13/24 Gateway=10.11.12.200 # stuff which gets copied to /run/systemd/resolve/resolv.conf, when ready Domains=my.do.main an.other.do.main DNS=10.11.12.100 10.11.12.101 Example 2 (LXC): # /zones/n00-00/config ... lxc.net.0.type = macvlan lxc.net.0.macvlan.mode = bridge lxc.net.0.flags = up lxc.net.0.link = mb0 lxc.net.0.name = n00-00_0 lxc.net.0.hwaddr = 00:01:02:03:04:01 ... # /zones/n00-00/rootfs/etc/systemd/network/40-n00-00_0.network [Match] Name=n00-00_0 [Network] LLMNR=false LLDP=no LinkLocalAddressing=no IPv6AcceptRouterAdvertisements=no Address=10.12.11.0/16 Gateway=10.12.11.2 Domains=gpu.do.main Have fun, jel. > Best regards, > Ole > > > > -Original Message- > > From: slurm-users On Behalf Of Ole > > Holm Nielsen > > Sent: Monday, October 30, 2023 1:56 PM > > To: slurm-users@lists.schedmd.com > > Subject: Re: [slurm-users] How to delay the start of slurmd until > > Infiniband/OPA network is fully up? > > > > ◆ This message was sent from a non-UWYO address. Please exercise caution > > when clicking links or opening attachments from external sources. > > > > > > Hi Jens, > > > > Thanks for your feedback: > > > > On 30-10-2023 15:52, Jens Elkner wrote: > > > Actually there is no need for such a script since > >
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Jeffrey, On 10/30/23 20:15, Jeffrey R. Lang wrote: The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8epel Thanks for the info. We can install the systemd-networkd RPM from the EPEL repo as you suggest. I tried to understand the properties of systemd-networkd before implementing it in our compute nodes. While there are lots of networkd man-pages, it's harder to find an overview of the actual properties of networkd. This is what I found: * Networkd is a service included in recent versions of Systemd. It seems to be an alternative to NetworkManager. * Red Hat has stated that systemd-networkd is NOT going to be implemented in RHEL 8 or 9. * Comparing systemd-networkd and NetworkManager: https://fedoracloud.readthedocs.io/en/latest/networkd.html * Networkd is described in the Wikipedia article https://en.wikipedia.org/wiki/Systemd While networkd seems to be really nifty, I hesitate to replace NetworkManager by networkd on our EL8 and EL9 systems because this is an unsupported and only lightly tested setup, and it may require additional work to keep our systems up-to-date in the future. It seems to me that Max Rutkowski's solution in https://github.com/maxlxl/network.target_wait-for-interfaces is less intrusive than converting to systemd-networkd. Best regards, Ole -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up? ◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources. Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones, AFAICT.
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
The service is available in RHEL 8 via the EPEL package repository as system-networkd, i.e. systemd-networkd.x86_64 253.4-1.el8epel -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up? ◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources. Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: > Actually there is no need for such a script since > /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones, AFAICT. /Ole
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones, AFAICT. /Ole
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote: Hi Max & freinds, ... > Thanks so much for your fast response with a solution! I didn't know that > NetworkManager (falsely) claims that the network is online as soon as the > first interface comes up :-( IIRC it is documented in the man page. > Your solution of a wait-for-interfaces Systemd service makes a lot of sense, > and I'm going to try it out. Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. I.e. 'Exec=/lib/systemd/systemd-networkd-wait-online -i ib0:routable' or something like that should handle it. E.g. on my laptop the complete /etc/systemd/system/systemd-networkd-wait-online.service looks like this: ---schnipp--- [Unit] Description=Wait for Network to be Configured Documentation=man:systemd-networkd-wait-online.service(8) DefaultDependencies=no Conflicts=shutdown.target Requires=systemd-networkd.service After=systemd-networkd.service Before=network-online.target shutdown.target [Service] Type=oneshot ExecStart=/lib/systemd/systemd-networkd-wait-online -i eth0:routable -i wlan0:routable --any RemainAfterExit=yes [Install] WantedBy=network-online.target ---schnapp--- Have fun, jel. > Best regards, > Ole > > On 10/30/23 14:30, Max Rutkowski wrote: > > Hi, > > > > we're not using Omni-Path but also had issues with Infiniband taking too > > long and slurmd failing to start due to that. > > > > Our solution was to implement a little wait-for-interface systemd > > service which delays the network.target until the ib interface has come > > up. > > > > Our discovery was that the network-online.target is triggered by the > > NetworkManager as soon as the first interface is connected. > > > > I've put the solution we use on my GitHub: > > https://github.com/maxlxl/network.target_wait-for-interfaces > > > > You may need to do small adjustments, but it's pretty straight forward > -- > Ole Holm Nielsen > PhD, Senior HPC Officer > Department of Physics, Technical University of Denmark, > Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark > E-mail: ole.h.niel...@fysik.dtu.dk > Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ > Mobile: (+45) 5180 1620 in > > general. > > > > > > Kind regards > > Max > > > > On 30.10.23 13:50, Ole Holm Nielsen wrote: > > > I'm fighting this strange scenario where slurmd is started before > > > the Infiniband/OPA network is fully up. The Node Health Check (NHC) > > > executed by slurmd then fails the node (as it should). This happens > > > only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 > > > nodes with Infiniband/OPA network work without problems. > > > > > > Question: Does anyone know how to reliably delay the start of the > > > slurmd Systemd service until the Infiniband/OPA network is fully up? > > > > > > Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not > > > Mellanox IB. On AlmaLinux 8.8 we use the in-distro OPA drivers > > > since the CornelisNetworks drivers are not available for RHEL 8.8. > -- > Ole Holm Nielsen > PhD, Senior HPC Officer > Department of Physics, Technical University of Denmark, > Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark > E-mail: ole.h.niel...@fysik.dtu.dk > Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ > Mobile: (+45) 5180 1620 > > > > > > The details: > > > > > > The slurmd service is started by the service file > > > /usr/lib/systemd/system/slurmd.service after the > > > "network-online.target" has been reached. > > > > > > It seems that NetworkManager reports "network-online.target" BEFORE > > > the Infiniband/OPA device ib0 is actually up, and this seems to be > > > the cause of our problems! > > > > > > Here are some important sequences of events from the syslog showing > > > that the network goes online before the Infiniband/OPA network > > > (hfi1_0 adapter) is up: > > > > > > Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online. > > > (lines deleted) > > > Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check > > > failed: rc:1 output:ERROR: nhc: Health check failed: check_hw_ib: > > > No IB port is ACTIVE (LinkUp 100 Gb/sec). > > > (lines deleted) > > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up > > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: > > > set_link_state: current GOING_UP, new INIT (LINKUP) > > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical > > > state changed to PHYS_LINKUP (0x5), phy 0x50 > > > > > > I tried to delay the NetworkManager "network-online.target" by > > > setting a wait on the ib0 device and reboot, but that seems to be > > > ignored: > > > > > > $ nmcli -p connection modify "System ib0" > > > connection.connection.wait-device-timeout 20 > > > > > > I'm hoping that other sites using Omni-Path have seen this and maybe > > > can share a fix or workaround? > > > > > > Of course we could remove the Infiniband check
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi Max, Thanks so much for your fast response with a solution! I didn't know that NetworkManager (falsely) claims that the network is online as soon as the first interface comes up :-( Your solution of a wait-for-interfaces Systemd service makes a lot of sense, and I'm going to try it out. Best regards, Ole On 10/30/23 14:30, Max Rutkowski wrote: Hi, we're not using Omni-Path but also had issues with Infiniband taking too long and slurmd failing to start due to that. Our solution was to implement a little wait-for-interface systemd service which delays the network.target until the ib interface has come up. Our discovery was that the network-online.target is triggered by the NetworkManager as soon as the first interface is connected. I've put the solution we use on my GitHub: https://github.com/maxlxl/network.target_wait-for-interfaces You may need to do small adjustments, but it's pretty straight forward -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 in general. Kind regards Max On 30.10.23 13:50, Ole Holm Nielsen wrote: I'm fighting this strange scenario where slurmd is started before the Infiniband/OPA network is fully up. The Node Health Check (NHC) executed by slurmd then fails the node (as it should). This happens only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with Infiniband/OPA network work without problems. Question: Does anyone know how to reliably delay the start of the slurmd Systemd service until the Infiniband/OPA network is fully up? Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not Mellanox IB. On AlmaLinux 8.8 we use the in-distro OPA drivers since the CornelisNetworks drivers are not available for RHEL 8.8. -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 The details: The slurmd service is started by the service file /usr/lib/systemd/system/slurmd.service after the "network-online.target" has been reached. It seems that NetworkManager reports "network-online.target" BEFORE the Infiniband/OPA device ib0 is actually up, and this seems to be the cause of our problems! Here are some important sequences of events from the syslog showing that the network goes online before the Infiniband/OPA network (hfi1_0 adapter) is up: Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online. (lines deleted) Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec). (lines deleted) Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: current GOING_UP, new INIT (LINKUP) Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state changed to PHYS_LINKUP (0x5), phy 0x50 I tried to delay the NetworkManager "network-online.target" by setting a wait on the ib0 device and reboot, but that seems to be ignored: $ nmcli -p connection modify "System ib0" connection.connection.wait-device-timeout 20 I'm hoping that other sites using Omni-Path have seen this and maybe can share a fix or workaround? Of course we could remove the Infiniband check in Node Health Check (NHC), but that would not really be acceptable during operations. Thanks for sharing any insights, Ole -- Max Rutkowski IT-Services und IT-Betrieb Tel.: +49 (0)331/6264-2341 E-Mail: max.rutkow...@gfz-potsdam.de ___ Helmholtz-Zentrum Potsdam *Deutsches GeoForschungsZentrum GFZ* Stiftung des öff. Rechts Land Brandenburg Telegrafenberg, 14473 Potsdam
Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?
Hi, we're not using Omni-Path but also had issues with Infiniband taking too long and slurmd failing to start due to that. Our solution was to implement a little wait-for-interface systemd service which delays the network.target until the ib interface has come up. Our discovery was that the network-online.target is triggered by the NetworkManager as soon as the first interface is connected. I've put the solution we use on my GitHub: https://github.com/maxlxl/network.target_wait-for-interfaces You may need to do small adjustments, but it's pretty straight forward in general. Kind regards Max On 30.10.23 13:50, Ole Holm Nielsen wrote: I'm fighting this strange scenario where slurmd is started before the Infiniband/OPA network is fully up. The Node Health Check (NHC) executed by slurmd then fails the node (as it should). This happens only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with Infiniband/OPA network work without problems. Question: Does anyone know how to reliably delay the start of the slurmd Systemd service until the Infiniband/OPA network is fully up? Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not Mellanox IB. On AlmaLinux 8.8 we use the in-distro OPA drivers since the CornelisNetworks drivers are not available for RHEL 8.8. The details: The slurmd service is started by the service file /usr/lib/systemd/system/slurmd.service after the "network-online.target" has been reached. It seems that NetworkManager reports "network-online.target" BEFORE the Infiniband/OPA device ib0 is actually up, and this seems to be the cause of our problems! Here are some important sequences of events from the syslog showing that the network goes online before the Infiniband/OPA network (hfi1_0 adapter) is up: Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online. (lines deleted) Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: check_hw_ib: No IB port is ACTIVE (LinkUp 100 Gb/sec). (lines deleted) Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: set_link_state: current GOING_UP, new INIT (LINKUP) Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical state changed to PHYS_LINKUP (0x5), phy 0x50 I tried to delay the NetworkManager "network-online.target" by setting a wait on the ib0 device and reboot, but that seems to be ignored: $ nmcli -p connection modify "System ib0" connection.connection.wait-device-timeout 20 I'm hoping that other sites using Omni-Path have seen this and maybe can share a fix or workaround? Of course we could remove the Infiniband check in Node Health Check (NHC), but that would not really be acceptable during operations. Thanks for sharing any insights, Ole -- Max Rutkowski IT-Services und IT-Betrieb Tel.: +49 (0)331/6264-2341 E-Mail: max.rutkow...@gfz-potsdam.de ___ Helmholtz-Zentrum Potsdam *Deutsches GeoForschungsZentrum GFZ* Stiftung des öff. Rechts Land Brandenburg Telegrafenberg, 14473 Potsdam smime.p7s Description: S/MIME Cryptographic Signature