Re: [lustre-discuss] Disabling multi-rail dynamic discovery
On Sep 14, 2021, at 11:17, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss wrote: > > Ah yes, I see what the lnet unit file is doing. OK, I think this is all > straighten out and working great now. We have a fairly extensive init script > (the lustre3 script in previous posts) that does various checks in addition > to loading modules and mounting/unmounting the filesystems. But at its core, > the start is now doing this: > >/usr/bin/systemctl start lnet >& /dev/null >modprobe lustre > Strictly speaking, the mount command itself should automatically trigger "lustre" module loading, so the "modprobe lustre" is redundant. > The stop portion does: > > > /usr/bin/systemctl stop lnet >& /dev/null > /usr/sbin/lustre_rmmod In 2.15 the lustre_rmmod script will automatically run "lnetctl lnet unconfigure", and conversely lnet.service will run "lustre_rmmod" in the right places (assuming the filesystem was previously unmounted), so only one or the other will be needed. Running both isn't harmful, just a bit redundant. Cheers, Andreas > > The final conf files I'm using are: > > lnet.conf: > > net: > - net type: o2ib1 > local NI(s): > - interfaces: > 0: ib0 > global: > discovery: 0 > > > > /etc/modprobe.d/lustre.conf: > > options ko2iblnd map_on_demand=32 > > > > Using the lnet systemd unit file properly loads the configuration and shows > discovery=0 (without any of lnet stuff in the modprobe conf file). We could > properly enable the lnet unit file and make a dependency to make sure our > init script runs after the lnet service but its a little easier to just run > the systemctl commands in our init script. > > I would be interested if others have a cleaner way to do all mounting, etc. > in a more native systemd manner. It probably just involves making a simple > unit file to run a script. Probably six of one, half dozen of the other but > if anyone has experience with the pros and cons, please let me know. > > Thanks a ton for the help on this. Much appreciated. > > > >> From: "Horn, Chris" >> Date: Tuesday, September 14, 2021 at 9:40 AM >> To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" >> , Riccardo Veraldi , >> "lustre-discuss@lists.lustre.org" >> Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic >> discovery >> >> When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the >> configuration from /etc/lnet.conf. It is going to configure LNet based only >> on kernel module parameters. Since you removed the ‘options lnet networks’ >> from your modprobe.conf file, it is going to use the default configuration >> which is @tcp on whatever the first ethernet interface w/ipv4 configured >> that it finds. >> >> To load /etc/lnet.conf you can use systemctl start lnet.service (or >> equivalent), or if you want to do it manually: >> >> modprobe lnet >> lnetctl lnet configure >> lnetctl lnet import < /etc/lnet.conf >> >> Also, I would try this for your lnet.conf >> >> net: >> - net type: o2ib >> local NI(s): >> - interfaces: >> 0: ib0 >> global: >> discovery: 0 >> >> Chris Horn >> >> From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] >> >> Date: Tuesday, September 14, 2021 at 10:17 AM >> To: Horn, Chris , Riccardo Veraldi >> , lustre-discuss@lists.lustre.org >> >> Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic >> discovery >> >> So I"m a little confused. >> >> When I take the "options lnet networks=o2ib1(ib0)" line out of the modprobe >> conf file and instead put that info in the lnet.conf file, things don't work >> properly. >> >> [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf >> options ko2iblnd map_on_demand=32 >> [root@r1i1n18 lnet]# cat /etc/lnet.conf >> ip2nets: >> - net-spec: o2ib1 >>interfaces: >> 0: ib0 >> global: >> discovery: 0 >> [root@r1i1n18 lnet]# modprobe lnet >> [root@r1i1n18 lnet]# lctl network up >> LNET configured >> [root@r1i1n18 lnet]# service lustre3 start >> Mounting /ephemeral... mount.lustre: mount >> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: >> No such file or directory >> Is the M
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
On Sep 14, 2021, at 11:17, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss mailto:lustre-discuss@lists.lustre.org>> wrote: Ah yes, I see what the lnet unit file is doing. OK, I think this is all straighten out and working great now. We have a fairly extensive init script (the lustre3 script in previous posts) that does various checks in addition to loading modules and mounting/unmounting the filesystems. But at its core, the start is now doing this: /usr/bin/systemctl start lnet >& /dev/null modprobe lustre Strictly speaking, the mount command itself should automatically trigger "lustre" module loading, so the "modprobe lustre" is redundant. The stop portion does: /usr/bin/systemctl stop lnet >& /dev/null /usr/sbin/lustre_rmmod In 2.15 the lustre_rmmod script will automatically run "lnetctl lnet unconfigure", and conversely lnet.service will run "lustre_rmmod" in the right places (assuming the filesystem was previously unmounted), so only one or the other will be needed. Running both isn't harmful, just a bit redundant. Cheers, Andreas The final conf files I'm using are: lnet.conf: net: - net type: o2ib1 local NI(s): - interfaces: 0: ib0 global: discovery: 0 /etc/modprobe.d/lustre.conf: options ko2iblnd map_on_demand=32 Using the lnet systemd unit file properly loads the configuration and shows discovery=0 (without any of lnet stuff in the modprobe conf file). We could properly enable the lnet unit file and make a dependency to make sure our init script runs after the lnet service but its a little easier to just run the systemctl commands in our init script. I would be interested if others have a cleaner way to do all mounting, etc. in a more native systemd manner. It probably just involves making a simple unit file to run a script. Probably six of one, half dozen of the other but if anyone has experience with the pros and cons, please let me know. Thanks a ton for the help on this. Much appreciated. From: "Horn, Chris" mailto:chris.h...@hpe.com>> Date: Tuesday, September 14, 2021 at 9:40 AM To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" mailto:darby.vicke...@nasa.gov>>, Riccardo Veraldi mailto:riccardo.vera...@cnaf.infn.it>>, "lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>" mailto:lustre-discuss@lists.lustre.org>> Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic discovery When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the configuration from /etc/lnet.conf. It is going to configure LNet based only on kernel module parameters. Since you removed the ‘options lnet networks’ from your modprobe.conf file, it is going to use the default configuration which is @tcp on whatever the first ethernet interface w/ipv4 configured that it finds. To load /etc/lnet.conf you can use systemctl start lnet.service (or equivalent), or if you want to do it manually: modprobe lnet lnetctl lnet configure lnetctl lnet import < /etc/lnet.conf Also, I would try this for your lnet.conf net: - net type: o2ib local NI(s): - interfaces: 0: ib0 global: discovery: 0 Chris Horn From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] mailto:darby.vicke...@nasa.gov>> Date: Tuesday, September 14, 2021 at 10:17 AM To: Horn, Chris mailto:chris.h...@hpe.com>>, Riccardo Veraldi mailto:riccardo.vera...@cnaf.infn.it>>, lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> mailto:lustre-discuss@lists.lustre.org>> Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery So I"m a little confused. When I take the "options lnet networks=o2ib1(ib0)" line out of the modprobe conf file and instead put that info in the lnet.conf file, things don't work properly. [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf options ko2iblnd map_on_demand=32 [root@r1i1n18 lnet]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib1 interfaces: 0: ib0 global: discovery: 0 [root@r1i1n18 lnet]# modprobe lnet [root@r1i1n18 lnet]# lctl network up LNET configured [root@r1i1n18 lnet]# service lustre3 start Mounting /ephemeral... mount.lustre: mount 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) FAILED. Mounting /nobackup... mount.lustre: mount 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) FAILED. [r
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
Ah yes, I see what the lnet unit file is doing. OK, I think this is all straighten out and working great now. We have a fairly extensive init script (the lustre3 script in previous posts) that does various checks in addition to loading modules and mounting/unmounting the filesystems. But at its core, the start is now doing this: /usr/bin/systemctl start lnet >& /dev/null modprobe lustre The stop portion does: /usr/bin/systemctl stop lnet >& /dev/null /usr/sbin/lustre_rmmod The final conf files I'm using are: lnet.conf: net: - net type: o2ib1 local NI(s): - interfaces: 0: ib0 global: discovery: 0 /etc/modprobe.d/lustre.conf: options ko2iblnd map_on_demand=32 Using the lnet systemd unit file properly loads the configuration and shows discovery=0 (without any of lnet stuff in the modprobe conf file). We could properly enable the lnet unit file and make a dependency to make sure our init script runs after the lnet service but its a little easier to just run the systemctl commands in our init script. I would be interested if others have a cleaner way to do all mounting, etc. in a more native systemd manner. It probably just involves making a simple unit file to run a script. Probably six of one, half dozen of the other but if anyone has experience with the pros and cons, please let me know. Thanks a ton for the help on this. Much appreciated. From: "Horn, Chris" Date: Tuesday, September 14, 2021 at 9:40 AM To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" , Riccardo Veraldi , "lustre-discuss@lists.lustre.org" Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic discovery When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the configuration from /etc/lnet.conf. It is going to configure LNet based only on kernel module parameters. Since you removed the ‘options lnet networks’ from your modprobe.conf file, it is going to use the default configuration which is @tcp on whatever the first ethernet interface w/ipv4 configured that it finds. To load /etc/lnet.conf you can use systemctl start lnet.service (or equivalent), or if you want to do it manually: modprobe lnet lnetctl lnet configure lnetctl lnet import < /etc/lnet.conf Also, I would try this for your lnet.conf net: - net type: o2ib local NI(s): - interfaces: 0: ib0 global: discovery: 0 Chris Horn From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] Date: Tuesday, September 14, 2021 at 10:17 AM To: Horn, Chris , Riccardo Veraldi , lustre-discuss@lists.lustre.org Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery So I"m a little confused. When I take the "options lnet networks=o2ib1(ib0)" line out of the modprobe conf file and instead put that info in the lnet.conf file, things don't work properly. [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf options ko2iblnd map_on_demand=32 [root@r1i1n18 lnet]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib1 interfaces: 0: ib0 global: discovery: 0 [root@r1i1n18 lnet]# modprobe lnet [root@r1i1n18 lnet]# lctl network up LNET configured [root@r1i1n18 lnet]# service lustre3 start Mounting /ephemeral... mount.lustre: mount 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) FAILED. Mounting /nobackup... mount.lustre: mount 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) FAILED. [root@r1i1n18 lnet]# The logs when this happens: Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 10.159.0.39@tcp<mailto:10.159.0.39@tcp> [8/256/0/180] Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988 Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6 Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(obd_config.c:559:class_setup()) setup MGC10.150.100.30@o<mailto:MGC10.150.100.30@o>2ib1 failed (-2) Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(obd_mount.c:202:lustre_start_simple()) MGC10.150.100.30@o<mailto:MGC10.150.100.30@o>2ib1 setup error -2 Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-2) Note the @tcp above – it looks like without the modprobe conf file, the lnet module isn't getting set up properly. When this happens, I'm not able to shut down lnet or unload the kernel modules to try again. The only way I'
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the configuration from /etc/lnet.conf. It is going to configure LNet based only on kernel module parameters. Since you removed the ‘options lnet networks’ from your modprobe.conf file, it is going to use the default configuration which is @tcp on whatever the first ethernet interface w/ipv4 configured that it finds. To load /etc/lnet.conf you can use systemctl start lnet.service (or equivalent), or if you want to do it manually: modprobe lnet lnetctl lnet configure lnetctl lnet import < /etc/lnet.conf Also, I would try this for your lnet.conf net: - net type: o2ib local NI(s): - interfaces: 0: ib0 global: discovery: 0 Chris Horn From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] Date: Tuesday, September 14, 2021 at 10:17 AM To: Horn, Chris , Riccardo Veraldi , lustre-discuss@lists.lustre.org Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery So I"m a little confused. When I take the "options lnet networks=o2ib1(ib0)" line out of the modprobe conf file and instead put that info in the lnet.conf file, things don't work properly. [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf options ko2iblnd map_on_demand=32 [root@r1i1n18 lnet]# cat /etc/lnet.conf ip2nets: - net-spec: o2ib1 interfaces: 0: ib0 global: discovery: 0 [root@r1i1n18 lnet]# modprobe lnet [root@r1i1n18 lnet]# lctl network up LNET configured [root@r1i1n18 lnet]# service lustre3 start Mounting /ephemeral... mount.lustre: mount 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) FAILED. Mounting /nobackup... mount.lustre: mount 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) FAILED. [root@r1i1n18 lnet]# The logs when this happens: Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 10.159.0.39@tcp<mailto:10.159.0.39@tcp> [8/256/0/180] Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988 Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6 Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(obd_config.c:559:class_setup()) setup MGC10.150.100.30@o<mailto:MGC10.150.100.30@o>2ib1 failed (-2) Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(obd_mount.c:202:lustre_start_simple()) MGC10.150.100.30@o<mailto:MGC10.150.100.30@o>2ib1 setup error -2 Sep 14 09:53:55 r1i1n18 kernel: LustreError: 34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-2) Note the @tcp above – it looks like without the modprobe conf file, the lnet module isn't getting set up properly. When this happens, I'm not able to shut down lnet or unload the kernel modules to try again. The only way I've been able to recover from this is to reboot the node. If I add the "options lnet" stuff back to the modprobe conf file, everything works as expected. Do I not have enough info in lnet.conf or are both just required? Chris, adding lnet_peer_discovery_disabled=1 to my lnet options does indeed seem to work. Thanks! Darby From: "Horn, Chris" Date: Monday, September 13, 2021 at 4:59 PM To: Riccardo Veraldi , "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" , "lustre-discuss@lists.lustre.org" Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a bug. You can try setting the kernel module parameter to disable discovery: options lnet lnet_peer_discovery_disabled=1 This obviously requires LNet to be reloaded. I would not recommend toggling discovery via the CLI as there are some bugs with correctly dealing with the fallout of that (peers going from MR enabled to MR disabled). Chris Horn From: lustre-discuss on behalf of Riccardo Veraldi Date: Monday, September 13, 2021 at 5:25 PM To: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] , lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Disabling multi-rail dynamic discovery I supposed you removed the /etc/modprobe.d/lustre.conf completely. I only have the lnet service enabled at startup, I do not start any lustre3 service, but I am running lustre 2.12.0 sorry not 2.14 so something might be different. Did you start over with a clean configuration ? Did you reboot your system to make sure it picks up the new config ? At least for me sometimes the lnet module does not unload correctly. Also I have to me
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a bug. You can try setting the kernel module parameter to disable discovery: options lnet lnet_peer_discovery_disabled=1 This obviously requires LNet to be reloaded. I would not recommend toggling discovery via the CLI as there are some bugs with correctly dealing with the fallout of that (peers going from MR enabled to MR disabled). Chris Horn From: lustre-discuss on behalf of Riccardo Veraldi Date: Monday, September 13, 2021 at 5:25 PM To: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] , lustre-discuss@lists.lustre.org Subject: Re: [lustre-discuss] Disabling multi-rail dynamic discovery I supposed you removed the /etc/modprobe.d/lustre.conf completely. I only have the lnet service enabled at startup, I do not start any lustre3 service, but I am running lustre 2.12.0 sorry not 2.14 so something might be different. Did you start over with a clean configuration ? Did you reboot your system to make sure it picks up the new config ? At least for me sometimes the lnet module does not unload correctly. Also I have to mention in my setup I did disable discovery also on the OSSes not only client side. Generally it is not advisable to disable Multi-rail unless you have backward compatibility issues with older lustre peers. But disabling discovery will also disable Multi-rail. You can try with lenetctl set discovery 0 as you already did, then you do lnetctl -b export > /etc/lnet.conf check discovery is set to 0 in the file and if not edit it and set it to 0. reboot and see if things changes. If anyway you did not define any tcp interface in lnet.conf you should not see any tcp peers. On 9/13/21 2:59 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] wrote: Thanks Rick. I removed my lnet modprobe options and adapted my lnet.conf file to: # cat /etc/lnet.conf ip2nets: - net-spec: o2ib1 interfaces: 0: ib0 global: discovery: 0 # Now "lnetctl export" doesn't have any reference to NIDs on the other networks, so that's good. However, I'm still seeing some values that concern me: # lnetctl export | grep -e Multi -e discover | sort -u discovery: 1 Multi-Rail: True # Any idea why discovery is still 1 if I'm specifying that to 0 in the lnet.conf file? I'm a little concerned that with Multi-Rail still True and discovery on, the client could still find its way back to the TCP route. From: Riccardo Veraldi <mailto:riccardo.vera...@cnaf.infn.it> Date: Monday, September 13, 2021 at 3:16 PM To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <mailto:darby.vicke...@nasa.gov>, "lustre-discuss@lists.lustre.org"<mailto:lustre-discuss@lists.lustre.org> <mailto:lustre-discuss@lists.lustre.org> Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery I would use configuration on /etc/lnet.conf and I would not use anymore the older style configuration in /etc/modprobe.d/lustre.conf for example in my /etc/lnet.conf configuration I have: ip2nets: - net-spec: o2ib interfaces: 0: ib0 - net-spec: tcp interfaces: 0: enp24s0f0 global: discovery: 0 As I disabled the auto discovery. Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf Mine looks like this: options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 Hope it helps. Rick On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss wrote: Hello, I would like to know how to turn off auto discovery of peers on a client. This seems like it should be straight forward but we can't get it to work. Please fill me in on what I'm missing. We recently upgraded our servers to 2.14. Our servers are multi-homed (1 tcp network and 2 separate IB networks) but we want them to be single rail. On one of our clusters we are still using the 2.12.6 client and it uses one of the IB networks for lustre. The modprobe file from one of the client nodes: # cat /etc/modprobe.d/lustre.conf options lnet networks=o2ib1(ib0) options ko2iblnd map_on_demand=32 # The client does have a route to the TCP network. This is intended to allow jobs on the compute nodes to access licenese servers, not for any serious I/O. We recently discovered that due to some instability in the IB fabric, the client was trying to fail over to tcp: # dmesg | grep Lustre [ 250.205912] Lustre: Lustre: Build Version: 2.12.6 [ 255.886086] Lustre: Mounted scratch-client [ 287.247547] Lustre: 3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699139/real 0] req@98deb9358480 x1709911947878336/t0(0) o9->hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@tcp:28/4<mailto:hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@t
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
I supposed you removed the /etc/modprobe.d/lustre.conf completely. I only have the lnet service enabled at startup, I do not start any lustre3 service, but I am running lustre 2.12.0 sorry not 2.14 so something might be different. Did you start over with a clean configuration ? Did you reboot your system to make sure it picks up the new config ? At least for me sometimes the lnet module does not unload correctly. Also I have to mention in my setup I did disable discovery also on the OSSes not only client side. Generally it is not advisable to disable Multi-rail unless you have backward compatibility issues with older lustre peers. But disabling discovery will also disable Multi-rail. You can try with lenetctl set discovery 0 as you already did, then you do lnetctl -b export > /etc/lnet.conf check discovery is set to 0 in the file and if not edit it and set it to 0. reboot and see if things changes. If anyway you did not define any tcp interface in lnet.conf you should not see any tcp peers. On 9/13/21 2:59 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] wrote: Thanks Rick. I removed my lnet modprobe options and adapted my lnet.conf file to: # cat /etc/lnet.conf ip2nets: - net-spec: o2ib1 interfaces: 0: ib0 global: discovery: 0 # Now "lnetctl export" doesn't have any reference to NIDs on the other networks, so that's good. However, I'm still seeing some values that concern me: # lnetctl export | grep -e Multi -e discover | sort -u discovery: 1 Multi-Rail: True # Any idea why discovery is still 1 if I'm specifying that to 0 in the lnet.conf file? I'm a little concerned that with Multi-Rail still True and discovery on, the client could still find its way back to the TCP route. *From: *Riccardo Veraldi *Date: *Monday, September 13, 2021 at 3:16 PM *To: *"Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" , "lustre-discuss@lists.lustre.org" *Subject: *[EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery I would use configuration on /etc/lnet.conf and I would not use anymore the older style configuration in /etc/modprobe.d/lustre.conf for example in my /etc/lnet.conf configuration I have: *ip2nets: - net-spec: o2ib interfaces: 0: ib0 - net-spec: tcp interfaces: 0: enp24s0f0 global: discovery: 0* As I disabled the auto discovery. Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf Mine looks like this: *options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4* Hope it helps. Rick On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss wrote: Hello, I would like to know how to turn off auto discovery of peers on a client. This seems like it should be straight forward but we can't get it to work. Please fill me in on what I'm missing. We recently upgraded our servers to 2.14. Our servers are multi-homed (1 tcp network and 2 separate IB networks) but we want them to be single rail. On one of our clusters we are still using the 2.12.6 client and it uses one of the IB networks for lustre. The modprobe file from one of the client nodes: # cat /etc/modprobe.d/lustre.conf options lnet networks=o2ib1(ib0) options ko2iblnd map_on_demand=32 # The client does have a route to the TCP network. This is intended to allow jobs on the compute nodes to access licenese servers, not for any serious I/O. We recently discovered that due to some instability in the IB fabric, the client was trying to fail over to tcp: # dmesg | grep Lustre [ 250.205912] Lustre: Lustre: Build Version: 2.12.6 [ 255.886086] Lustre: Mounted scratch-client [ 287.247547] Lustre: 3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699139/real 0] req@98deb9358480 x1709911947878336/t0(0) o9->hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@tcp:28/4 <mailto:hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@tcp:28/4> lens 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ rc 0/-1 [ 739.832744] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699591/real 0] req@98deb935da00 x1709911947883520/t0(0) o400->scratch-MDT-mdc-98b0f1fc0800@192.52.98.31@tcp:12/10 <mailto:scratch-MDT-mdc-98b0f1fc0800@192.52.98.31@tcp:12/10> lens 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ rc 0/-1 [ 739.832755] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [ 739.832762] LustreError: 166-1: MGC10.150.100.30@o2ib1: Connection
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
Thanks Rick. I removed my lnet modprobe options and adapted my lnet.conf file to: # cat /etc/lnet.conf ip2nets: - net-spec: o2ib1 interfaces: 0: ib0 global: discovery: 0 # Now "lnetctl export" doesn't have any reference to NIDs on the other networks, so that's good. However, I'm still seeing some values that concern me: # lnetctl export | grep -e Multi -e discover | sort -u discovery: 1 Multi-Rail: True # Any idea why discovery is still 1 if I'm specifying that to 0 in the lnet.conf file? I'm a little concerned that with Multi-Rail still True and discovery on, the client could still find its way back to the TCP route. From: Riccardo Veraldi Date: Monday, September 13, 2021 at 3:16 PM To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" , "lustre-discuss@lists.lustre.org" Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic discovery I would use configuration on /etc/lnet.conf and I would not use anymore the older style configuration in /etc/modprobe.d/lustre.conf for example in my /etc/lnet.conf configuration I have: ip2nets: - net-spec: o2ib interfaces: 0: ib0 - net-spec: tcp interfaces: 0: enp24s0f0 global: discovery: 0 As I disabled the auto discovery. Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf Mine looks like this: options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4 Hope it helps. Rick On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss wrote: Hello, I would like to know how to turn off auto discovery of peers on a client. This seems like it should be straight forward but we can't get it to work. Please fill me in on what I'm missing. We recently upgraded our servers to 2.14. Our servers are multi-homed (1 tcp network and 2 separate IB networks) but we want them to be single rail. On one of our clusters we are still using the 2.12.6 client and it uses one of the IB networks for lustre. The modprobe file from one of the client nodes: # cat /etc/modprobe.d/lustre.conf options lnet networks=o2ib1(ib0) options ko2iblnd map_on_demand=32 # The client does have a route to the TCP network. This is intended to allow jobs on the compute nodes to access licenese servers, not for any serious I/O. We recently discovered that due to some instability in the IB fabric, the client was trying to fail over to tcp: # dmesg | grep Lustre [ 250.205912] Lustre: Lustre: Build Version: 2.12.6 [ 255.886086] Lustre: Mounted scratch-client [ 287.247547] Lustre: 3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699139/real 0] req@98deb9358480 x1709911947878336/t0(0) o9->hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@tcp:28/4<mailto:hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@tcp:28/4> lens 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ rc 0/-1 [ 739.832744] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699591/real 0] req@98deb935da00 x1709911947883520/t0(0) o400->scratch-MDT-mdc-98b0f1fc0800@192.52.98.31@tcp:12/10<mailto:scratch-MDT-mdc-98b0f1fc0800@192.52.98.31@tcp:12/10> lens 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ rc 0/-1 [ 739.832755] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [ 739.832762] LustreError: 166-1: MGC10.150.100.30@o2ib1: Connection to MGS (at 192.52.98.30@tcp) was lost; in progress operations using this service will fail [ 739.832769] Lustre: hpfs-fsl-MDT-mdc-9880cfb8: Connection to hpfs-fsl-MDT (at 192.52.98.30@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1090.978619] LustreError: 167-0: scratch-MDT-mdc-98b0f1fc0800: This client was evicted by scratch-MDT; in progress operations using this service will fail. I'm pretty sure this is due to the auto discovery. Again, from a client: # lnetctl export | grep -e Multi -e discover | sort -u discovery: 0 Multi-Rail: True # We want to restrict lustre to only the IB NID but its not clear exactly how to do that. Here is one attempt: [root@r1i1n18 lnet]# service lustre3 stop Shutting down lustre mounts Lustre modules successfully unloaded [root@r1i1n18 lnet]# lsmod | grep lnet [root@r1i1n18 lnet]# cat /etc/lnet.conf global: discovery: 0 [root@r1i1n18 lnet]# service lustre3 start Mounting /ephemeral... done. Mounting /nobackup... done. [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u discovery: 1 Multi-Rail: True [root@r1i1n18 lnet]# And a similar attempt (same lnet.conf file), but trying to turn off the discovery before d
Re: [lustre-discuss] Disabling multi-rail dynamic discovery
I would use configuration on /etc/lnet.conf and I would not use anymore the older style configuration in /etc/modprobe.d/lustre.conf for example in my /etc/lnet.conf configuration I have: *ip2nets: - net-spec: o2ib interfaces: 0: ib0 - net-spec: tcp interfaces: 0: enp24s0f0 global: discovery: 0* As I disabled the auto discovery. Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf Mine looks like this: *options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4* Hope it helps. Rick On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss wrote: Hello, I would like to know how to turn off auto discovery of peers on a client. This seems like it should be straight forward but we can't get it to work. Please fill me in on what I'm missing. We recently upgraded our servers to 2.14. Our servers are multi-homed (1 tcp network and 2 separate IB networks) but we want them to be single rail. On one of our clusters we are still using the 2.12.6 client and it uses one of the IB networks for lustre. The modprobe file from one of the client nodes: # cat /etc/modprobe.d/lustre.conf options lnet networks=o2ib1(ib0) options ko2iblnd map_on_demand=32 # The client does have a route to the TCP network. This is intended to allow jobs on the compute nodes to access licenese servers, not for any serious I/O. We recently discovered that due to some instability in the IB fabric, the client was trying to fail over to tcp: # dmesg | grep Lustre [ 250.205912] Lustre: Lustre: Build Version: 2.12.6 [ 255.886086] Lustre: Mounted scratch-client [ 287.247547] Lustre: 3472:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699139/real 0] req@98deb9358480 x1709911947878336/t0(0) o9->hpfs-fsl-OST0001-osc-9880cfb8@192.52.98.33@tcp:28/4 lens 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ rc 0/-1 [ 739.832744] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1630699591/real 0] req@98deb935da00 x1709911947883520/t0(0) o400->scratch-MDT-mdc-98b0f1fc0800@192.52.98.31@tcp:12/10 lens 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ rc 0/-1 [ 739.832755] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [ 739.832762] LustreError: 166-1: MGC10.150.100.30@o2ib1: Connection to MGS (at 192.52.98.30@tcp) was lost; in progress operations using this service will fail [ 739.832769] Lustre: hpfs-fsl-MDT-mdc-9880cfb8: Connection to hpfs-fsl-MDT (at 192.52.98.30@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 1090.978619] LustreError: 167-0: scratch-MDT-mdc-98b0f1fc0800: This client was evicted by scratch-MDT; in progress operations using this service will fail. I'm pretty sure this is due to the auto discovery. Again, from a client: # lnetctl export | grep -e Multi -e discover | sort -u discovery: 0 Multi-Rail: True # We want to restrict lustre to only the IB NID but its not clear exactly how to do that. Here is one attempt: [root@r1i1n18 lnet]# service lustre3 stop Shutting down lustre mounts Lustre modules successfully unloaded [root@r1i1n18 lnet]# lsmod | grep lnet [root@r1i1n18 lnet]# cat /etc/lnet.conf global: discovery: 0 [root@r1i1n18 lnet]# service lustre3 start Mounting /ephemeral... done. Mounting /nobackup... done. [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u discovery: 1 Multi-Rail: True [root@r1i1n18 lnet]# And a similar attempt (same lnet.conf file), but trying to turn off the discovery before doing the mounts: [root@r1i1n18 lnet]# service lustre3 stop Shutting down lustre mounts Lustre modules successfully unloaded [root@r1i1n18 lnet]# modprobe lnet [root@r1i1n18 lnet]# lnetctl set discovery 0 [root@r1i1n18 lnet]# service lustre3 start Mounting /ephemeral... done. Mounting /nobackup... done. [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u discovery: 0 Multi-Rail: True [root@r1i1n18 lnet]# If someone can point me in the right direction, I'd appreciate it. Thanks, Darby ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org