On 22/08/2023 00:27, Ryan Novosielski wrote:
I still have the guide from that system, and I saved some of the routing
scripts and what not. But really, it wasn’t much more complicated than
Ethernet routing.
The routing nodes, I guess obviously, had both Omnipath and Infiniband
interfaces. Compute knifes themselves I believe used a supervisord
script, if I’m remembering that name right, to try to balance out which
routing nide ione would use as a gateway. There were two as it was
configured when I got to it, but a larger number was possible.
Having done it in a limited fashion previously I would recommend that
you have two routing nodes and use keepalived on at least the Ethernet
side with VRRP to try and maintain some redundancy.
Otherwise you get in a situation where you are entirely dependent on a
single node which you can't reboot without a GPFS shutdown. Cyber
security makes that an untenable position these days.
In our situation our DSS-G nodes where both Ethernet and Infiniband
connected and we had a bunch of nodes that where using Infiniband for
the data traffic and Ethernet for the management interface at 1Gbps.
Everything else was on 10Gbps or better Ethernet. We therefore needed
the Ethernet only connected nodes to be able to talk to the Infiniband
connected nodes data interface.
Due to the way routing works on Linux when the Infiniband nodes
attempted to connect to the Ethernet connected only nodes it went via
the 1Gbps Ethernet interface.
So after a while and issues with a single gateway machine we switched to
making it redundant. Basically the Ethernet only connected nodes had a
custom route to reach the Infiniband network, and the DSS-G nodes where
doing the forwarding and then had keepalived running VRRP to move the IP
address around on the Ethernet side so there was redundancy in the
gateway. The amount of traffic transiting the gateway was actually tiny
because all the filesystem data was coming from the DSS-G nodes that
were Infiniband connected :-)
I have no idea if you can do the equivalent of VRRP on Infiniband and
Omnipath.
In the end the Infiniband nodes (a bunch of C6220's used to support
undergraduate/MSc projects and classes) had to be upgraded to 10Gbps
Ethernet as RedHat dropped support for the Intel Truescale Infiniband
adapters in RHEL8. We don't let the student's run multinode jobs anyway
so the loss of the Infiniband was not an issue. Though with the enforced
move away from RHEL means we will get it back
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org