On Sep 14, 2021, at 11:17, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
Inc.] via lustre-discuss <lustre-discuss@lists.lustre.org> wrote:
> 
> Ah yes, I see what the lnet unit file is doing.  OK, I think this is all 
> straighten out and working great now.  We have a fairly extensive init script 
> (the lustre3 script in previous posts) that does various checks in addition 
> to loading modules and mounting/unmounting the filesystems.  But at its core, 
> the start is now doing this:
>  
>    /usr/bin/systemctl start lnet  >& /dev/null
>    modprobe lustre
>     <mount lustre FS's>
 
Strictly speaking, the mount command itself should automatically trigger 
"lustre" module loading, so the "modprobe lustre" is redundant.

> The stop portion does:
>           
>     <umount lustre FS's>
>     /usr/bin/systemctl stop lnet  >& /dev/null
>     /usr/sbin/lustre_rmmod
 
In 2.15 the lustre_rmmod script will automatically run "lnetctl lnet 
unconfigure", and conversely lnet.service will run "lustre_rmmod" in the right 
places (assuming the filesystem was previously unmounted), so only one or the 
other will be needed.  Running both isn't harmful, just a bit redundant.

Cheers, Andreas

>  
> The final conf files I'm using are:
>  
> lnet.conf:
>  
> net:
>     - net type: o2ib1
>       local NI(s):
>         - interfaces:
>               0: ib0
> global:
>     discovery: 0
>  
>  
>  
> /etc/modprobe.d/lustre.conf:
>  
> options ko2iblnd map_on_demand=32
>  
>  
>  
> Using the lnet systemd unit file properly loads the configuration and shows 
> discovery=0 (without any of lnet stuff in the modprobe conf file).  We could 
> properly enable the lnet unit file and make a dependency to make sure our 
> init script runs after the lnet service but its a little easier to just run 
> the systemctl commands in our init script. 
>  
> I would be interested if others have a cleaner way to do all mounting, etc. 
> in a more native systemd manner.  It probably just involves making a simple 
> unit file to run a script.  Probably six of one, half dozen of the other but 
> if anyone has experience with the pros and cons, please let me know. 
>  
> Thanks a ton for the help on this.  Much appreciated. 
>  
>  
>  
>> From: "Horn, Chris" <chris.h...@hpe.com>
>> Date: Tuesday, September 14, 2021 at 9:40 AM
>> To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
>> <darby.vicke...@nasa.gov>, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>, 
>> "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
>> Subject: [EXTERNAL] Re: Re: [lustre-discuss] Disabling multi-rail dynamic 
>> discovery
>>  
>> When you start LNet via ‘modprobe lnet; lctl net up’, that doesn’t load the 
>> configuration from /etc/lnet.conf. It is going to configure LNet based only 
>> on kernel module parameters. Since you removed the ‘options lnet networks’ 
>> from your modprobe.conf file, it is going to use the default configuration 
>> which is @tcp on whatever the first ethernet interface w/ipv4 configured 
>> that it finds.
>> 
>> To load /etc/lnet.conf you can use systemctl start lnet.service (or 
>> equivalent), or if you want to do it manually:
>> 
>> modprobe lnet
>> lnetctl lnet configure
>> lnetctl lnet import < /etc/lnet.conf
>> 
>> Also, I would try this for your lnet.conf
>> 
>> net:
>>     - net type: o2ib
>>       local NI(s):
>>         - interfaces:
>>               0: ib0
>> global:
>>     discovery: 0
>> 
>> Chris Horn
>>  
>> From: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
>> <darby.vicke...@nasa.gov>
>> Date: Tuesday, September 14, 2021 at 10:17 AM
>> To: Horn, Chris <chris.h...@hpe.com>, Riccardo Veraldi 
>> <riccardo.vera...@cnaf.infn.it>, lustre-discuss@lists.lustre.org 
>> <lustre-discuss@lists.lustre.org>
>> Subject: Re: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
>> discovery
>> 
>> So I"m a little confused.  
>>  
>> When I take the "options lnet networks=o2ib1(ib0)"  line out of the modprobe 
>> conf file and instead put that info in the lnet.conf file, things don't work 
>> properly. 
>>  
>> [root@r1i1n18 lnet]# cat /etc/modprobe.d/lustre.conf
>> options ko2iblnd map_on_demand=32
>> [root@r1i1n18 lnet]# cat /etc/lnet.conf
>> ip2nets:
>> - net-spec: o2ib1
>>    interfaces:
>>       0: ib0
>> global:
>>     discovery: 0
>> [root@r1i1n18 lnet]# modprobe lnet
>> [root@r1i1n18 lnet]# lctl network up
>> LNET configured
>> [root@r1i1n18 lnet]# service lustre3 start
>> Mounting /ephemeral... mount.lustre: mount 
>> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/scratch/work at /ephemeral failed: 
>> No such file or directory
>> Is the MGS specification correct?
>> Is the filesystem name correct?
>> If upgrading, is the copied client log valid? (see upgrade docs)
>> FAILED.
>> Mounting /nobackup... mount.lustre: mount 
>> 10.150.100.30@o2ib1:10.150.100.31@o2ib1:/hpfs-fsl/work at /nobackup failed: 
>> No such file or directory
>> Is the MGS specification correct?
>> Is the filesystem name correct?
>> If upgrading, is the copied client log valid? (see upgrade docs)
>> FAILED.
>> [root@r1i1n18 lnet]#
>>  
>>  
>> The logs when this happens:
>>  
>> Sep 14 09:53:38 r1i1n18 kernel: LNet: Added LNI 10.159.0.39@tcp [8/256/0/180]
>> Sep 14 09:53:38 r1i1n18 kernel: Lnet: Accept secure, port 988
>> Sep 14 09:53:54 r1i1n18 kernel: Lustre: Lustre: Build Version: 2.12.6
>> Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
>> 34174:0:(ldlm_lib.c:494:client_obd_setup()) can't add initial connection
>> Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
>> 34174:0:(obd_config.c:559:class_setup()) setup MGC10.150.100.30@o2ib1 failed 
>> (-2)
>> Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
>> 34174:0:(obd_mount.c:202:lustre_start_simple()) MGC10.150.100.30@o2ib1 setup 
>> error -2
>> Sep 14 09:53:55 r1i1n18 kernel: LustreError: 
>> 34174:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-2)
>>  
>> Note the @tcp above – it looks like without the modprobe conf file, the lnet 
>> module isn't getting set up properly.  When this happens, I'm not able to 
>> shut down lnet or unload the kernel modules to try again.  The only way I've 
>> been able to recover from this is to reboot the node.  If I add the "options 
>> lnet" stuff back to the modprobe conf file, everything works as expected.  
>> Do I not have enough info in lnet.conf or are both just required? 
>>  
>> Chris, adding lnet_peer_discovery_disabled=1 to my lnet options does indeed 
>> seem to work.  Thanks! 
>>  
>> Darby
>>  
>>  
>> From: "Horn, Chris" <chris.h...@hpe.com>
>> Date: Monday, September 13, 2021 at 4:59 PM
>> To: Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>, "Vicker, Darby J. 
>> (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicke...@nasa.gov>, 
>> "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
>> Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
>> discovery
>>  
>> I’m not sure why lnetctl import wouldn’t correctly set discovery. Might be a 
>> bug. You can try setting the kernel module parameter to disable discovery:
>> 
>> options lnet lnet_peer_discovery_disabled=1
>> 
>> This obviously requires LNet to be reloaded.
>> 
>> I would not recommend toggling discovery via the CLI as there are some bugs 
>> with correctly dealing with the fallout of that (peers going from MR enabled 
>> to MR disabled).
>> 
>> Chris Horn
>>  
>> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
>> Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
>> Date: Monday, September 13, 2021 at 5:25 PM
>> To: Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
>> <darby.vicke...@nasa.gov>, lustre-discuss@lists.lustre.org 
>> <lustre-discuss@lists.lustre.org>
>> Subject: Re: [lustre-discuss] Disabling multi-rail dynamic discovery
>> 
>> I supposed you removed the /etc/modprobe.d/lustre.conf completely.
>> I only have the lnet service enabled at startup, I do not start any lustre3 
>> service, but I am running lustre 2.12.0 sorry not 2.14
>> so something might be different.
>> Did you start over with a clean configuration ?
>> Did you reboot your system to make sure it picks up the new config ? At 
>> least for me sometimes the lnet module does not unload correctly.
>> Also I have to mention in my setup I did disable discovery also on the OSSes 
>> not only client side.
>> Generally it is not advisable to disable Multi-rail unless you have backward 
>> compatibility issues with older lustre peers.
>> But disabling discovery will also disable Multi-rail.
>> You can try with 
>> lenetctl set discovery 0
>> as  you already did,
>> then you do
>> lnetctl -b export > /etc/lnet.conf
>> check discovery is set to 0 in the file and if not edit it and set it to 0.
>> reboot and see if things changes.
>> If anyway you did not define any tcp interface in lnet.conf  you should not 
>> see any tcp peers.
>>  
>> On 9/13/21 2:59 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
>> wrote:
>>> Thanks Rick.  I removed my lnet modprobe options and adapted my lnet.conf 
>>> file to:
>>>  
>>> # cat /etc/lnet.conf
>>> ip2nets:
>>> - net-spec: o2ib1
>>>    interfaces:
>>>       0: ib0
>>> global:
>>>     discovery: 0
>>> #
>>>  
>>>  
>>> Now "lnetctl export" doesn't have any reference to NIDs on the other 
>>> networks, so that's good.  However, I'm still seeing some values that 
>>> concern me:
>>>  
>>>  
>>> # lnetctl export | grep -e Multi -e discover | sort -u
>>>     discovery: 1
>>>       Multi-Rail: True
>>> #
>>>  
>>> Any idea why discovery is still 1 if I'm specifying that to 0 in the 
>>> lnet.conf file?  I'm a little concerned that with Multi-Rail still True and 
>>> discovery on, the client could still find its way back to the TCP route.  
>>>  
>>>  
>>> From: Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
>>> Date: Monday, September 13, 2021 at 3:16 PM
>>> To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
>>> <darby.vicke...@nasa.gov>, "lustre-discuss@lists.lustre.org" 
>>> <lustre-discuss@lists.lustre.org>
>>> Subject: [EXTERNAL] Re: [lustre-discuss] Disabling multi-rail dynamic 
>>> discovery
>>>  
>>> I would use configuration on /etc/lnet.conf and I would not use anymore the 
>>> older style configuration in
>>> /etc/modprobe.d/lustre.conf 
>>> for example in my /etc/lnet.conf configuration I have: 
>>> ip2nets:
>>>  - net-spec: o2ib
>>>    interfaces:
>>>       0: ib0
>>>  - net-spec: tcp
>>>    interfaces:
>>>       0: enp24s0f0
>>> global:
>>>     discovery: 0
>>> As I disabled the auto discovery.
>>> Regarding ko2ib you can just use /etc/modprobe.d/ko2iblnd.conf
>>> Mine looks like this:
>>> options ko2iblnd peer_credits=128 peer_credits_hiw=64 credits=1024 ntx=2048 
>>> map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 
>>> conns_per_peer=4
>>> Hope it helps.
>>> Rick
>>>  
>>> On 9/13/21 1:53 PM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] 
>>> via lustre-discuss wrote:
>>>> Hello,
>>>>  
>>>> I would like to know how to turn off auto discovery of peers on a client.  
>>>> This seems like it should be straight forward but we can't get it to work. 
>>>> Please fill me in on what I'm missing. 
>>>>  
>>>> We recently upgraded our servers to 2.14.  Our servers are multi-homed (1 
>>>> tcp network and 2 separate IB networks) but we want them to be single 
>>>> rail.  On one of our clusters we are still using the 2.12.6 client and it 
>>>> uses one of the IB networks for lustre.  The modprobe file from one of the 
>>>> client nodes:
>>>>  
>>>>  
>>>> # cat /etc/modprobe.d/lustre.conf
>>>> options lnet networks=o2ib1(ib0)
>>>> options ko2iblnd map_on_demand=32
>>>> #
>>>>  
>>>>  
>>>> The client does have a route to the TCP network.  This is intended to 
>>>> allow jobs on the compute nodes to access licenese servers, not for any 
>>>> serious I/O.  We recently discovered that due to some instability in the 
>>>> IB fabric, the client was trying to fail over to tcp:
>>>>  
>>>>  
>>>> # dmesg | grep Lustre
>>>> [  250.205912] Lustre: Lustre: Build Version: 2.12.6
>>>> [  255.886086] Lustre: Mounted scratch-client
>>>> [  287.247547] Lustre: 3472:0:(client.c:2146:ptlrpc_expire_one_request()) 
>>>> @@@ Request sent has timed out for sent delay: [sent 1630699139/real 0]  
>>>> req@ffff98deb9358480 x1709911947878336/t0(0) 
>>>> o9->hpfs-fsl-OST0001-osc-ffff9880cfb80000@192.52.98.33@tcp:28/4 lens 
>>>> 224/224 e 0 to 1 dl 1630699145 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>>>> [  739.832744] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) 
>>>> @@@ Request sent has timed out for sent delay: [sent 1630699591/real 0]  
>>>> req@ffff98deb935da00 x1709911947883520/t0(0) 
>>>> o400->scratch-MDT0000-mdc-ffff98b0f1fc0800@192.52.98.31@tcp:12/10 lens 
>>>> 224/224 e 0 to 1 dl 1630699598 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
>>>> [  739.832755] Lustre: 3526:0:(client.c:2146:ptlrpc_expire_one_request()) 
>>>> Skipped 5 previous similar messages
>>>> [  739.832762] LustreError: 166-1: MGC10.150.100.30@o2ib1: Connection to 
>>>> MGS (at 192.52.98.30@tcp) was lost; in progress operations using this 
>>>> service will fail
>>>> [  739.832769] Lustre: hpfs-fsl-MDT0000-mdc-ffff9880cfb80000: Connection 
>>>> to hpfs-fsl-MDT0000 (at 192.52.98.30@tcp) was lost; in progress operations 
>>>> using this service will wait for recovery to complete
>>>> [ 1090.978619] LustreError: 167-0: scratch-MDT0000-mdc-ffff98b0f1fc0800: 
>>>> This client was evicted by scratch-MDT0000; in progress operations using 
>>>> this service will fail.
>>>>  
>>>>  
>>>> I'm pretty sure this is due to the auto discovery.  Again, from a client:
>>>>  
>>>>  
>>>> # lnetctl export | grep -e Multi -e discover | sort -u
>>>>     discovery: 0
>>>>       Multi-Rail: True
>>>> # 
>>>>  
>>>>  
>>>> We want to restrict lustre to only the IB NID but its not clear exactly 
>>>> how to do that. 
>>>>  
>>>> Here is one attempt:
>>>> 
>>>> 
>>>> [root@r1i1n18 lnet]# service lustre3 stop
>>>> Shutting down lustre mounts
>>>> Lustre modules successfully unloaded
>>>> [root@r1i1n18 lnet]# lsmod | grep lnet
>>>> [root@r1i1n18 lnet]# cat /etc/lnet.conf
>>>> global:
>>>>     discovery: 0
>>>> [root@r1i1n18 lnet]# service lustre3 start
>>>> Mounting /ephemeral... done.
>>>> Mounting /nobackup... done.
>>>> [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
>>>>     discovery: 1
>>>>       Multi-Rail: True
>>>> [root@r1i1n18 lnet]#
>>>>  
>>>>  
>>>> And a similar attempt (same lnet.conf file), but trying to turn off the 
>>>> discovery before doing the mounts:
>>>>  
>>>>  
>>>> [root@r1i1n18 lnet]# service lustre3 stop
>>>> Shutting down lustre mounts 
>>>> Lustre modules successfully unloaded
>>>> [root@r1i1n18 lnet]# modprobe lnet
>>>> [root@r1i1n18 lnet]# lnetctl set discovery 0
>>>> [root@r1i1n18 lnet]# service lustre3 start
>>>> Mounting /ephemeral... done.
>>>> Mounting /nobackup... done.
>>>> [root@r1i1n18 lnet]# lnetctl export | grep -e Multi -e discover | sort -u
>>>>     discovery: 0
>>>>       Multi-Rail: True
>>>> [root@r1i1n18 lnet]# 
>>>>  
>>>> If someone can point me in the right direction, I'd appreciate it. 
>>>>  
>>>> Thanks,
>>>> Darby

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
    • ... Riccardo Veraldi
    • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
      • ... Riccardo Veraldi
        • ... Horn, Chris via lustre-discuss
          • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
            • ... Horn, Chris via lustre-discuss
    • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
      • ... Andreas Dilger via lustre-discuss
        • ... Andreas Dilger via lustre-discuss

Reply via email to