Hi Ward,
Am 10.11.2023 um 19:45 schrieb Ward Poelmans:
Hi Ole,
On 10/11/2023 15:04, Ole Holm Nielsen wrote:
On 11/5/23 21:32, Ward Poelmans wrote:
Yes, it's very similar. I've put our systemd unit file also online
on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11
This might disturb the logic in waitforib.sh, or at least cause some
confusion?
I had never heard of these cards. But if they behave like infiniband
cards, is there also an .../ports/1/state file present in /sys with
the state? In that case it should work just as well.
We could also change the glob '/sys/class/infiniband/*/ports/*/state'
to only look at devices starting with mlx. I have no clue how much
diversity is out there, we only have Mellanox cards (or rebrands of
those).
IMHO, this seems quite confusing.
Yes, I agree.
Regarding the slurmd service:
An alternative to this extra service would be like Max's service file
https://github.com/maxlxl/network.target_wait-for-interfaces/blob/main/wait-for-interfaces.service
which has:
Before=network-online.target
What do you think of these considerations?
I think Max his approach is the better one. We only do it for slurmd
while his is completely general for everything that waits on network.
The downside is probably that if you have issue with your IB network,
this will make it worse ;)
That's why we do have a limit in there which will allow the boot to
complete even without the network coming up. In case we need to log in
and check the server. The script is made to delay it up until a timeout
is reached. And yes, we used a more general approach since our issue
actually was the network not coming up fast enough for our NFS mounts
which are also used by Slurmd at our site.
Ward