On Sep 14, 2021, at 11:17, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss <lustre-discuss@lists.lustre.org> wrote: > > Ah yes, I see what the lnet unit file is doing. OK, I think this is all > straighten out and working great now. We have a fairly extensive init script > (the lustre3 script in previous posts) that does various checks in addition > to loading modules and mounting/unmounting the filesystems. But at its core, > the start is now doing this: > > /usr/bin/systemctl start lnet >& /dev/null > modprobe lustre > <mount lustre FS's> Strictly speaking, the mount command itself should automatically trigger "lustre" module loading, so the "modprobe lustre" is redundant.
> The stop portion does: > > <umount lustre FS's> > /usr/bin/systemctl stop lnet >& /dev/null > /usr/sbin/lustre_rmmod In 2.15 the lustre_rmmod script will automatically run "lnetctl lnet unconfigure", and conversely lnet.service will run "lustre_rmmod" in the right places (assuming the filesystem was previously unmounted), so only one or the other will be needed. Running both isn't harmful, just a bit redundant. Cheers, Andreas > > The final conf files I'm using are: > > lnet.conf: > > net: > - net type: o2ib1 > local NI(s): > - interfaces: > 0: ib0 > global: > discovery: 0 > > > > /etc/modprobe.d/lustre.conf: > > options ko2iblnd map_on_demand=32 > > > > Using the lnet systemd unit file properly loads the configuration and shows > discovery=0 (without any of lnet stuff in the modprobe conf file). We could > properly enable the lnet unit file and make a dependency to make sure our > init script runs after the lnet service but its a little easier to just run > the systemctl commands in our init script. > > I would be interested if others have a cleaner way to do all mounting, etc. > in a more native systemd manner. It probably just involves making a simple > unit file to run a script. Probably six of one, half dozen of the other but > if anyone has experience with the pros and cons, please let me know. > > Thanks a ton for the help on this. Much appreciated. > > > >> From: "Horn, Chris" <chris.h...@hpe.com> >> Date: Tuesday, September 14, 2021 at 9:40 AM >> To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" >> <darby.vicke...@nasa.gov>, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>, >> "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> >> Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic >> discovery >> >> When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the >> configuration from /etc/lnet.conf. It is going to configure LNet based only >> on kernel module parameters. Since you removed the ‘options lnet networks’ >> from your modprobe.conf file, it is going to use the default configuration >> which is @tcp on whatever the first ethernet interface w/ipv4 configured >> that it finds. >> >> To load /etc/lnet.conf you can use systemctl start lnet.service (or >> equivalent), or if you want to do it manually: >> >> modprobe lnet >> lnetctl lnet configure >> lnetctl lnet import < /etc/lnet.conf >> >> Also, I would try this for your lnet.conf >> >> net: >> - net type: o2ib >> local NI(s): >> - interfaces: >> 0: ib0 >> global: >> discovery: 0 >> >> Chris Horn >> >> From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] >> <darby.vicke...@nasa.gov> >> Date: Tuesday, September 14, 2021 at 10:17 AM >> To: Horn, Chris <chris.h...@hpe.com>, Riccardo Veraldi >> <riccardo.vera...@cnaf.infn.it>, lustre-discuss@lists.lustre.org >> <lustre-discuss@lists.lustre.org> >> Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic >> discovery >> >> So I"m a little confused. >> >> When I take the "options lnet networks=o2ib1(ib0)" line out of the modprobe >> conf file and instead put that info in the lnet.conf file, things don't work >> properly. >> >> [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf >> options ko2iblnd map_on_demand=32 >> [root@r1i1n18 lnet]# cat /etc/lnet.conf >> ip2nets: >> - net-spec: o2ib1 >> interfaces: >> 0: ib0 >> global: >> discovery: 0 >> [root@r1i1n18 lnet]# modprobe lnet >> [root@r1i1n18 lnet]# lctl network up >> LNET configured >> [root@r1i1n18 lnet]# service lustre3 start >> Mounting /ephemeral... mount.lustre: mount >> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: >> No such file or directory >> Is the MGS specification correct? >> Is the filesystem name correct? >> If upgrading, is the copied client log valid? (see upgrade docs) >> FAILED. >> Mounting /nobackup... mount.lustre: mount >> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: >> No such file or directory >> Is the MGS specification correct? >> Is the filesystem name correct? >> If upgrading, is the copied client log valid? (see upgrade docs) >> FAILED. >> [root@r1i1n18 lnet]# >> >> >> The logs when this happens: >> >> Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 10.159.0.39@tcp [8/256/0/180] >> Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988 >> Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6 >> Sep 14 09:53:55 r1i1n18 kernel: LustreError: >> 34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection >> Sep 14 09:53:55 r1i1n18 kernel: LustreError: >> 34174:0:(obd_config.c:559:class_setup()) setup MGC10.150.100.30@o2ib1 failed >> (-2) >> Sep 14 09:53:55 r1i1n18 kernel: LustreError: >> 34174:0:(obd_mount.c:202:lustre_start_simple()) MGC10.150.100.30@o2ib1 setup >> error -2 >> Sep 14 09:53:55 r1i1n18 kernel: LustreError: >> 34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-2) >> >> Note the @tcp above – it looks like without the modprobe conf file, the lnet >> module isn't getting set up properly. When this happens, I'm not able to >> shut down lnet or unload the kernel modules to try again. The only way I've >> been able to recover from this is to reboot the node. If I add the "options >> lnet" stuff back to the modprobe conf file, everything works as expected. >> Do I not have enough info in lnet.conf or are both just required? >> >> Chris, adding lnet_peer_discovery_disabled=1 to my lnet options does indeed >> seem to work. Thanks! >> >> Darby >> >> >> From: "Horn, Chris" <chris.h...@hpe.com> >> Date: Monday, September 13, 2021 at 4:59 PM >> To: Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>, "Vicker, Darby J. >> (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicke...@nasa.gov>, >> "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> >> Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic >> discovery >> >> I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a >> bug. You can try setting the kernel module parameter to disable discovery: >> >> options lnet lnet_peer_discovery_disabled=1 >> >> This obviously requires LNet to be reloaded. >> >> I would not recommend toggling discovery via the CLI as there are some bugs >> with correctly dealing with the fallout of that (peers going from MR enabled >> to MR disabled). >> >> Chris Horn >> >> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of >> Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> >> Date: Monday, September 13, 2021 at 5:25 PM >> To: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] >> <darby.vicke...@nasa.gov>, lustre-discuss@lists.lustre.org >> <lustre-discuss@lists.lustre.org> >> Subject: Re: [lustre-discuss] Disabling multi-rail dynamic discovery >> >> I supposed you removed the /etc/modprobe.d/lustre.conf completely. >> I only have the lnet service enabled at startup, I do not start any lustre3 >> service, but I am running lustre 2.12.0 sorry not 2.14 >> so something might be different. >> Did you start over with a clean configuration ? >> Did you reboot your system to make sure it picks up the new config ? At >> least for me sometimes the lnet module does not unload correctly. >> Also I have to mention in my setup I did disable discovery also on the OSSes >> not only client side. >> Generally it is not advisable to disable Multi-rail unless you have backward >> compatibility issues with older lustre peers. >> But disabling discovery will also disable Multi-rail. >> You can try with >> lenetctl set discovery 0 >> as you already did, >> then you do >> lnetctl -b export > /etc/lnet.conf >> check discovery is set to 0 in the file and if not edit it and set it to 0. >> reboot and see if things changes. >> If anyway you did not define any tcp interface in lnet.conf you should not >> see any tcp peers. >> >> On 9/13/21 2:59 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] >> wrote: >>> Thanks Rick. I removed my lnet modprobe options and adapted my lnet.conf >>> file to: >>> >>> # cat /etc/lnet.conf >>> ip2nets: >>> - net-spec: o2ib1 >>> interfaces: >>> 0: ib0 >>> global: >>> discovery: 0 >>> # >>> >>> >>> Now "lnetctl export" doesn't have any reference to NIDs on the other >>> networks, so that's good. However, I'm still seeing some values that >>> concern me: >>> >>> >>> # lnetctl export | grep -e Multi -e discover | sort -u >>> discovery: 1 >>> Multi-Rail: True >>> # >>> >>> Any idea why discovery is still 1 if I'm specifying that to 0 in the >>> lnet.conf file? I'm a little concerned that with Multi-Rail still True and >>> discovery on, the client could still find its way back to the TCP route. >>> >>> >>> From: Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> >>> Date: Monday, September 13, 2021 at 3:16 PM >>> To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" >>> <darby.vicke...@nasa.gov>, "lustre-discuss@lists.lustre.org" >>> <lustre-discuss@lists.lustre.org> >>> Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic >>> discovery >>> >>> I would use configuration on /etc/lnet.conf and I would not use anymore the >>> older style configuration in >>> /etc/modprobe.d/lustre.conf >>> for example in my /etc/lnet.conf configuration I have: >>> ip2nets: >>> - net-spec: o2ib >>> interfaces: >>> 0: ib0 >>> - net-spec: tcp >>> interfaces: >>> 0: enp24s0f0 >>> global: >>> discovery: 0 >>> As I disabled the auto discovery. >>> Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf >>> Mine looks like this: >>> options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 ntx=2048 >>> map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 >>> conns_per_peer=4 >>> Hope it helps. >>> Rick >>> >>> On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] >>> via lustre-discuss wrote: >>>> Hello, >>>> >>>> I would like to know how to turn off auto discovery of peers on a client. >>>> This seems like it should be straight forward but we can't get it to work. >>>> Please fill me in on what I'm missing. >>>> >>>> We recently upgraded our servers to 2.14. Our servers are multi-homed (1 >>>> tcp network and 2 separate IB networks) but we want them to be single >>>> rail. On one of our clusters we are still using the 2.12.6 client and it >>>> uses one of the IB networks for lustre. The modprobe file from one of the >>>> client nodes: >>>> >>>> >>>> # cat /etc/modprobe.d/lustre.conf >>>> options lnet networks=o2ib1(ib0) >>>> options ko2iblnd map_on_demand=32 >>>> # >>>> >>>> >>>> The client does have a route to the TCP network. This is intended to >>>> allow jobs on the compute nodes to access licenese servers, not for any >>>> serious I/O. We recently discovered that due to some instability in the >>>> IB fabric, the client was trying to fail over to tcp: >>>> >>>> >>>> # dmesg | grep Lustre >>>> [ 250.205912] Lustre: Lustre: Build Version: 2.12.6 >>>> [ 255.886086] Lustre: Mounted scratch-client >>>> [ 287.247547] Lustre: 3472:0:(client.c:2146:ptlrpc_expire_one_request()) >>>> @@@ Request sent has timed out for sent delay: [sent 1630699139/real 0] >>>> req@ffff98deb9358480 x1709911947878336/t0(0) >>>> o9->hpfs-fsl-OST0001-osc-ffff9880cfb80000@192.52.98.33@tcp:28/4 lens >>>> 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 >>>> [ 739.832744] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) >>>> @@@ Request sent has timed out for sent delay: [sent 1630699591/real 0] >>>> req@ffff98deb935da00 x1709911947883520/t0(0) >>>> o400->scratch-MDT0000-mdc-ffff98b0f1fc0800@192.52.98.31@tcp:12/10 lens >>>> 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 >>>> [ 739.832755] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) >>>> Skipped 5 previous similar messages >>>> [ 739.832762] LustreError: 166-1: MGC10.150.100.30@o2ib1: Connection to >>>> MGS (at 192.52.98.30@tcp) was lost; in progress operations using this >>>> service will fail >>>> [ 739.832769] Lustre: hpfs-fsl-MDT0000-mdc-ffff9880cfb80000: Connection >>>> to hpfs-fsl-MDT0000 (at 192.52.98.30@tcp) was lost; in progress operations >>>> using this service will wait for recovery to complete >>>> [ 1090.978619] LustreError: 167-0: scratch-MDT0000-mdc-ffff98b0f1fc0800: >>>> This client was evicted by scratch-MDT0000; in progress operations using >>>> this service will fail. >>>> >>>> >>>> I'm pretty sure this is due to the auto discovery. Again, from a client: >>>> >>>> >>>> # lnetctl export | grep -e Multi -e discover | sort -u >>>> discovery: 0 >>>> Multi-Rail: True >>>> # >>>> >>>> >>>> We want to restrict lustre to only the IB NID but its not clear exactly >>>> how to do that. >>>> >>>> Here is one attempt: >>>> >>>> >>>> [root@r1i1n18 lnet]# service lustre3 stop >>>> Shutting down lustre mounts >>>> Lustre modules successfully unloaded >>>> [root@r1i1n18 lnet]# lsmod | grep lnet >>>> [root@r1i1n18 lnet]# cat /etc/lnet.conf >>>> global: >>>> discovery: 0 >>>> [root@r1i1n18 lnet]# service lustre3 start >>>> Mounting /ephemeral... done. >>>> Mounting /nobackup... done. >>>> [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u >>>> discovery: 1 >>>> Multi-Rail: True >>>> [root@r1i1n18 lnet]# >>>> >>>> >>>> And a similar attempt (same lnet.conf file), but trying to turn off the >>>> discovery before doing the mounts: >>>> >>>> >>>> [root@r1i1n18 lnet]# service lustre3 stop >>>> Shutting down lustre mounts >>>> Lustre modules successfully unloaded >>>> [root@r1i1n18 lnet]# modprobe lnet >>>> [root@r1i1n18 lnet]# lnetctl set discovery 0 >>>> [root@r1i1n18 lnet]# service lustre3 start >>>> Mounting /ephemeral... done. >>>> Mounting /nobackup... done. >>>> [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u >>>> discovery: 0 >>>> Multi-Rail: True >>>> [root@r1i1n18 lnet]# >>>> >>>> If someone can point me in the right direction, I'd appreciate it. >>>> >>>> Thanks, >>>> Darby Cheers, Andreas -- Andreas Dilger Lustre Principal Architect Whamcloud _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org