Hi Max and Ward,
I've made a variation of your scripts which wait for at least 1 Infiniband
port to come up before starting services such as slurmd or NFS mounts.
I prefer Max's Systemd service which comes before the Systemd
network-online.target. And I like Ward's script which checks the
Hi Ward,
Am 10.11.2023 um 19:45 schrieb Ward Poelmans:
Hi Ole,
On 10/11/2023 15:04, Ole Holm Nielsen wrote:
On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online
on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
This
Hi Ole,
On 10/11/2023 15:04, Ole Holm Nielsen wrote:
On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
This might disturb the logic in waitforib.sh, or at least cause
Hi Ward,
On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online on
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
This looks really good! However, I was testing the waitforib.sh script on
a SuperMicro server WITHOUT
Hi Ole,
Yes, it's very similar. I've put our systemd unit file also online on
https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
And we add it as a dependency for slurmd:
$ cat /etc/systemd/system/slurmd.service.d/wait.conf
[Service]
Environment="CUDA_DEVICE_ORDER=PCI_BUS_ID"
Hi Ward,
Thanks a lot for the feedback! The method of probing
/sys/class/infiniband/*/ports/*/state is also used in the NHC script
lbnl_hw.nhc and has the advantage of not depending on the nmcli command
from the NetworkManager package.
Can I ask you how you implement your script as a
Hi,
We have a slightly difference script to do the same. It only relies on /sys:
# Search for infiniband devices and check waits until
# at least one reports that it is ACTIVE
if [[ ! -d /sys/class/infiniband ]]
then
logger "No infiniband found"
exit 0
fi
ports=$(ls
Hi Rémi,
Thanks for the feedback! The patch revert[1] explains SchedMD's reason:
The reasoning is that sysadmins who see nodes with Reason "Not Responding"
but they can manually ping/access the node end up confused. That reason
should only be set if the node is trully not responding, but not
Hi Ole,
Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit :
> I'm fighting this strange scenario where slurmd is started before the
> Infiniband/OPA network is fully up. The Node Health Check (NHC) executed
> by slurmd then fails the node (as it should). This happens only on EL8
> Linux
On Tue, Oct 31, 2023 at 10:59:56AM +0100, Ole Holm Nielsen wrote:
Hi Ole,
TLTR;: below systemd-networkd stuff, only.
> On 10/30/23 20:15, Jeffrey R. Lang wrote:
> > The service is available in RHEL 8 via the EPEL package repository as
> > system-networkd, i.e. systemd-networkd.x86_64
Hi Jeffrey,
On 10/30/23 20:15, Jeffrey R. Lang wrote:
The service is available in RHEL 8 via the EPEL package repository as
system-networkd, i.e. systemd-networkd.x86_64
253.4-1.el8epel
Thanks for the info. We can install the systemd-networkd
The service is available in RHEL 8 via the EPEL package repository as
system-networkd, i.e. systemd-networkd.x86_64
253.4-1.el8epel
-Original Message-
From: slurm-users On Behalf Of Ole Holm
Nielsen
Sent: Monday, October 30, 2023 1:56 PM
Hi Jens,
Thanks for your feedback:
On 30-10-2023 15:52, Jens Elkner wrote:
Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.
It seems that systemd-networkd exists in Fedora FC38 Linux, but not in
RHEL 8 and clones,
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote:
Hi Max & freinds,
...
> Thanks so much for your fast response with a solution! I didn't know that
> NetworkManager (falsely) claims that the network is online as soon as the
> first interface comes up :-(
IIRC it is documented in
Hi Max,
Thanks so much for your fast response with a solution! I didn't know that
NetworkManager (falsely) claims that the network is online as soon as the
first interface comes up :-(
Your solution of a wait-for-interfaces Systemd service makes a lot of
sense, and I'm going to try it out.
Hi,
we're not using Omni-Path but also had issues with Infiniband taking too
long and slurmd failing to start due to that.
Our solution was to implement a little wait-for-interface systemd
service which delays the network.target until the ib interface has come up.
Our discovery was that
I'm fighting this strange scenario where slurmd is started before the
Infiniband/OPA network is fully up. The Node Health Check (NHC) executed
by slurmd then fails the node (as it should). This happens only on EL8
Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9 nodes with
Infiniband/OPA
17 matches
Mail list logo