Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Jens Elkner
On Tue, Sep 26, 2023 at 03:04:34PM +0200, Ole Holm Nielsen wrote:
> On 9/26/23 14:50, Groner, Rob wrote:
> > There's a builtin slurm command, I can't remember what it is and google
> > is failing me, that will take a compacted list of nodenames and return
> > their full names, and I'm PRETTY sure it will do the opposite as well
> > (what you're asking for).
> > 
> > It's probably sinfo or scontrolmaybe an sutil if that exists.
> 
> The command would be:
> 
> scontrol show hostname awn-0[01-32,46-77,95-99]

Wondering, why [...] is allowed as a suffix, only. Historical reasons?
Or: Why is not a trailing string like foo_[000-999]_bar or even char
sequences like foo_[a..z]_bar allowed?
 
Thanx a lot,
jel.

> /Ole
> 
> > --
> > *From:* slurm-users  on behalf of
> > Felix 
> > *Sent:* Tuesday, September 26, 2023 7:22 AM
> > *To:* Slurm User Community List 
> > *Subject:* [slurm-users] question about configuration in slurm.conf
> > hello
> > 
> > I have at my site the following work nodes
> > 
> > awn001 ... awn099
> > 
> > and then it continues awn100 ... awn199
> > 
> > How can I configure this line
> > 
> > PartitionName=debug Nodes=awn-0[01-32,46-77,95-99] Default=YES
> > MaxTime=INFINITE State=UP
> > 
> > so that it can contain the nodes from 001 to 199
> > 
> > can I write:
> > 
> > PartitionName=debug Nodes=awn-0[01-32,46-77,95-99] awn-1[00-99]
> > Default=YES MaxTime=INFINITE State=UP
> > 
> > is this correct?
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 52768



Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Jens Elkner
On Mon, Oct 30, 2023 at 03:11:32PM +0100, Ole Holm Nielsen wrote:
Hi Max & freinds,
...
> Thanks so much for your fast response with a solution!  I didn't know that
> NetworkManager (falsely) claims that the network is online as soon as the
> first interface comes up :-(

IIRC it is documented in the man page.
  
> Your solution of a wait-for-interfaces Systemd service makes a lot of sense,
> and I'm going to try it out.

Actually there is no need for such a script since
/lib/systemd/systemd-networkd-wait-online should be able to handle it.

I.e. 'Exec=/lib/systemd/systemd-networkd-wait-online -i ib0:routable'
or something like that should handle it. E.g. on my laptop the complete
/etc/systemd/system/systemd-networkd-wait-online.service looks like
this:
---schnipp---
[Unit]
Description=Wait for Network to be Configured
Documentation=man:systemd-networkd-wait-online.service(8)
DefaultDependencies=no
Conflicts=shutdown.target
Requires=systemd-networkd.service
After=systemd-networkd.service
Before=network-online.target shutdown.target

[Service]
Type=oneshot
ExecStart=/lib/systemd/systemd-networkd-wait-online -i eth0:routable -i 
wlan0:routable --any
RemainAfterExit=yes

[Install]
WantedBy=network-online.target
---schnapp---
 
Have fun,
jel.
> Best regards,
> Ole
> 
> On 10/30/23 14:30, Max Rutkowski wrote:
> > Hi,
> > 
> > we're not using Omni-Path but also had issues with Infiniband taking too
> > long and slurmd failing to start due to that.
> > 
> > Our solution was to implement a little wait-for-interface systemd
> > service which delays the network.target until the ib interface has come
> > up.
> > 
> > Our discovery was that the network-online.target is triggered by the
> > NetworkManager as soon as the first interface is connected.
> > 
> > I've put the solution we use on my GitHub:
> > https://github.com/maxlxl/network.target_wait-for-interfaces
> > 
> > You may need to do small adjustments, but it's pretty straight forward
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark,
> Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
> E-mail: ole.h.niel...@fysik.dtu.dk
> Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
> Mobile: (+45) 5180 1620 in
> > general.
> > 
> > 
> > Kind regards
> > Max
> > 
> > On 30.10.23 13:50, Ole Holm Nielsen wrote:
> > > I'm fighting this strange scenario where slurmd is started before
> > > the Infiniband/OPA network is fully up.  The Node Health Check (NHC)
> > > executed by slurmd then fails the node (as it should).  This happens
> > > only on EL8 Linux (AlmaLinux 8.8) nodes, whereas our CentOS 7.9
> > > nodes with Infiniband/OPA network work without problems.
> > > 
> > > Question: Does anyone know how to reliably delay the start of the
> > > slurmd Systemd service until the Infiniband/OPA network is fully up?
> > > 
> > > Note: Our Infiniband/OPA network fabric is Omni-Path 100 Gbit/s, not
> > > Mellanox IB.  On AlmaLinux 8.8 we use the in-distro OPA drivers
> > > since the CornelisNetworks drivers are not available for RHEL 8.8.
> -- 
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark,
> Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
> E-mail: ole.h.niel...@fysik.dtu.dk
> Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
> Mobile: (+45) 5180 1620
> > > 
> > > The details:
> > > 
> > > The slurmd service is started by the service file
> > > /usr/lib/systemd/system/slurmd.service after the
> > > "network-online.target" has been reached.
> > > 
> > > It seems that NetworkManager reports "network-online.target" BEFORE
> > > the Infiniband/OPA device ib0 is actually up, and this seems to be
> > > the cause of our problems!
> > > 
> > > Here are some important sequences of events from the syslog showing
> > > that the network goes online before the Infiniband/OPA network
> > > (hfi1_0 adapter) is up:
> > > 
> > > Oct 30 13:01:40 d064 systemd[1]: Reached target Network is Online.
> > > (lines deleted)
> > > Oct 30 13:01:41 d064 slurmd[2333]: slurmd: error: health_check
> > > failed: rc:1 output:ERROR:  nhc:  Health check failed: check_hw_ib: 
> > > No IB port is ACTIVE (LinkUp 100 Gb/sec).
> > > (lines deleted)
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: 8051: Link up
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0:
> > > set_link_state: current GOING_UP, new INIT (LINKUP)
> > > Oct 30 13:01:41 d064 kernel: hfi1 :4b:00.0: hfi1_0: physical
> > > state changed to PHYS_LINKUP (0x5), phy 0x50
> > > 
> > > I tried to delay the NetworkManager "network-online.target" by
> > > setting a wait on the ib0 device and reboot, but that seems to be
> > > ignored:
> > > 
> > > $ nmcli -p connection modify "System ib0"
> > > connection.connection.wait-device-timeout 20
> > > 
> > > I'm hoping that other sites using Omni-Path have seen this and maybe
> > > can share a fix or workaround?
> > > 
> > > Of course we could remove the Infiniband check i

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-31 Thread Jens Elkner
On Tue, Oct 31, 2023 at 10:59:56AM +0100, Ole Holm Nielsen wrote:
Hi Ole,
  
TLTR;: below systemd-networkd stuff, only.

> On 10/30/23 20:15, Jeffrey R. Lang wrote:
> > The service is available in RHEL 8 via the EPEL package repository as 
> > system-networkd, i.e. systemd-networkd.x86_64   
> > 253.4-1.el8epel
> 
> Thanks for the info.  We can install the systemd-networkd RPM from the EPEL
> repo as you suggest.

Strange, that it is not installed by default. We use Ubuntu, only. The
first LTS which includes it is Xenial (16.04) - released in April 2016.
Anyway, we have never installed any NetworkManager stuff (too unflexible,
unreliable, buggy - last eval ~5 years ago and ditched forever), even
before 16.04 as well on desktops I ditch[ed] it (IMHO just overhead).

> I tried to understand the properties of systemd-networkd before implementing
> it in our compute nodes.  While there are lots of networkd man-pages, it's
> harder to find an overview of the actual properties of networkd.  This is
> what I found:

Basically you just need for each interface a *.netdev and a *.network
file in /etc/systemd/network/.  Optionally symlink /etc/resolv.conf to
/run/systemd/resolve/resolv.conf.  If you want to rename your
interface[s] (e.g. we use ${hostname}${ifidx}), and parameter
'net.ifnames=0' gets passed to the kernel, you can use a *.link file to
accomplish this. That's it. See example 1 below.

Some distros have obscure bloatware to manage them (e.g. Ubuntu installs
per default 'netplan.io' aka another way of indirection), but we ditch
those packages immediately and manage them "manually" as needed.
 
> * Comparing systemd-networkd and NetworkManager:
> https://fedoracloud.readthedocs.io/en/latest/networkd.html

Pretty good - shows all you probably need. Actually within containers we
have just /etc/systemd/network/40-${hostname}0.network, because the
lxc.net.* config already describe, what *.link and *.netdev would do.
See example 2.
  
...
> While networkd seems to be really nifty, I hesitate to replace

Does/can do all we need w/o a lot of overhead.

> NetworkManager by networkd on our EL8 and EL9 systems because this is an
> unsupported and only lightly tested setup,

We use it ~5 years on all machines, ~7 years on most of our machines;
multihomed, containers, simple and complex (i.e. a lot of NICs, VLANs)
w/o any problems ... 

> and it may require additional
> work to keep our systems up-to-date in the future.

I doubt that. The /etc/systemd/network/*.{link,netdev,network} interface
seems to be pretty stable. Haven't seen/noticed any stuff, which got
removed so far.

> It seems to me that Max Rutkowski's solution in
> https://github.com/maxlxl/network.target_wait-for-interfaces is less
> intrusive than converting to systemd-networkd.

Depends on your setup/environment. But I guess soomer or later, you need
to get into touch with it anyway. So here some examples:

Example 1:
--
# /etc/systemd/network/10-mb0.link
# we rename usually eth0, the 1st NIC on the motherboard to mb0 using
# its PCI Address to identify it
[Match]
Path=pci-:00:19.0

[Link]
Name=mb0 
MACAddressPolicy=persistent


# /etc/systemd/network/25-phys-2-vlans+vnics.network
[Match]
Name=mb0

[Link]
ARP=false

[Network]
LinkLocalAddressing=no
LLMNR=false
IPv6AcceptRA=no
LLDP=true
MACVLAN=node1_0
#VLAN=vlan2
#VLAN=vlan3


# /etc/systemd/network/40-node1_0.netdev
[NetDev]
Name=node1_0
Kind=macvlan
# Optional: we use fix mac addr on vnics
MACAddress=00:01:02:03:04:00

[MACVLAN]
Mode=bridge


# /etc/systemd/network/40-node1_0.network
[Match]
Name=node1_0

[Network]
LinkLocalAddressing=no
LLMNR=false
IPv6AcceptRA=no
LLDP=no
Address=10.11.12.13/24
Gateway=10.11.12.200
# stuff which gets copied to /run/systemd/resolve/resolv.conf, when ready
Domains=my.do.main an.other.do.main
DNS=10.11.12.100 10.11.12.101

 
Example 2 (LXC):

# /zones/n00-00/config
...
lxc.net.0.type = macvlan
lxc.net.0.macvlan.mode = bridge
lxc.net.0.flags = up
lxc.net.0.link = mb0
lxc.net.0.name = n00-00_0
lxc.net.0.hwaddr = 00:01:02:03:04:01
...


# /zones/n00-00/rootfs/etc/systemd/network/40-n00-00_0.network
[Match]
Name=n00-00_0

[Network]
LLMNR=false
LLDP=no
LinkLocalAddressing=no
IPv6AcceptRouterAdvertisements=no
Address=10.12.11.0/16
Gateway=10.12.11.2
Domains=gpu.do.main


Have fun,
jel.
> Best regards,
> Ole
> 
> 
> > -Original Message-
> > From: slurm-users  On Behalf Of Ole 
> > Holm Nielsen
> > Sent: Monday, October 30, 2023 1:56 PM
> > To: slurm-users@lists.schedmd.com
> > Subject: Re: [slurm-users] How to delay the start of slurmd until 
> > Infiniband/OPA network is fully up?
> > 
> > ◆ This message was sent from a non-UWYO address. Please exercise caution 
> > when clicking links or opening att

Re: [slurm-users] SlurmcltdHost confusion

2023-12-14 Thread Jens Elkner
On Wed, Dec 13, 2023 at 08:16:39PM +, Jackson, Gary L. wrote:
Hi Gary,

> The SlurmctldHost value is set like the following in my slurm.conf:
> 
> SlurmctldHost=host0,host1
> 
> That seems to be legal according to the documentation. However, I get error 
> messages like the following:
> 
> $ srun id
> 
> srun: error: get_addr_info: getaddrinfo() failed: Name or service not known
> srun: error: slurm_set_addr: Unable to resolve "host0,host1"
> srun: error: Unable to establish control machine address
> srun: error: Unable to allocate resources: Address already in use
...
> What’s going on?

Not sure, but I've seen such errors, when using a node name, which was not
"registered" via NodeName or discovered otherwise - a code lookup at
this time revealed, that the message is IMHO misleading: slurm does
__not__ make a DNS lookup - it simply greps its internal list of known
nodes and if not found, it emits such messages.

Other options: try to use SlurmctldHost=... for each host on a single
line to rule out a format errors. Not sure, whether it supports ranges,
too (like SlurmctldHost=host[0-1]) ,

Last but not least 'Address already in use' - checking, whether there is
not an instance or something else already listening on the related port
shouldn't hurt ...

Have fun,
jel.
-- 
Otto-von-Guericke University http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany Tel: +49 391 67 52768