Hi Riccardo,

I would check if the OSTs on this OSS have been registered with the correct 
NIDs (o2ib1) on the MGS:

$ lctl --device MGS llog_print <fsname>-client

and look for the NIDs in setup/add_conn for the OSTs in question.

Best,

Stephane



> On Sep 28, 2021, at 9:52 AM, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> 
> wrote:
> 
> Hello.
> 
> I have a lustre setup where the MDS (172.21.156.112)  is on tcp1 while the 
> OSSes are on o2ib1.
> 
> I am using Lustre 2.12.7 on RHEL 7.9
> 
> All the clients can see the MDS correctly as a tcp1 peer:
> 
> peer:
>     - primary nid: 172.21.156.112@tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.112@tcp1
>           state: NA
> 
> 
> This is by design because the MDS has no IB interface. So the MDS to OSSes 
> traffic and MDS to Clients traffic is on tcp1, while clients to OSSes traffic 
> is meant to be on o2ib1.
> 
> I have 1 MDS (tcp1)  And 12 OSSes (tcp1, o2ib1) and a bunch of 20 clients 
> (tcp1, o2ib1).
> 
> All is fine but not for one of the OSSes (172.21.164.116@o2ib1, 
> 172.21.156.102@tcp1).
> 
> Even though it is configured the same as all the other ones, traffic only 
> goes through tcp1 and not o2ib1.
> 
> Even if I force the peer settings to use o2ib, it ignores it and the tcp1 
> peer is added anyway
> 
> this is lnet.conf on the MDS
> 
> p2nets:
>  - net-spec: o2ib1
>    interfaces:
>       0: ib0
>  - net-spec: tcp1
>    interfaces:
>       0: eno1
> global:
>     discovery: 0
> 
> 
> 
> this is lnet.conf on OSSes
> 
> ip2nets:
>  - net-spec: o2ib1
>    interfaces:
>       0: ib0
>  - net-spec: tcp1
>    interfaces:
>       0: enp1s0f0
> global:
>     discovery: 0
> 
> 
> 
> I also tried this on the lustre clients side:
> 
> peer:
>     - primary nid: 172.21.164.116@o2ib1
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.164.116@o2ib1
> 
> enforcing the peer settings to o2ib1.
> 
> This is ignored and the peer is added by its tcp1 LNET interface.
> 
>     - primary nid: 172.21.156.102@tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.102@tcp1
>           state: NA
> 
> All of the hosts involved have discovery set to 0.
> 
> Nevertheless the peer setting for that specific OSS is using tcp1 and not 
> o2ib.
> 
> This is disrupting because traffic goes to tcp1 for that specific OSS and it 
> is of course slower than IB.
> 
> I had to deactivate the OSTs on that specific OSS.
> 
> How may I Fix this issue ?
> 
> Here is the complete peer list from the lustre client side and as you can see 
> there is that specific OSS included as tcp1 peer.
> 
> even if I do  "lnetctl peer del --nid 172.21.156.102@tcp1 --prim_nid 
> 172.21.156.102@tcp1" the entry is added automatically after a while.
> 
> lnetctl peer show
> peer:
>     - primary nid: 172.21.156.112@tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.112@tcp1
>           state: NA
>     - primary nid: 172.21.164.111@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.111@o2ib1
>           state: NA
>     - primary nid: 172.21.164.117@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.117@o2ib1
>           state: NA
>     - primary nid: 172.21.164.112@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.112@o2ib1
>           state: NA
>     - primary nid: 172.21.164.119@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.119@o2ib1
>           state: NA
>     - primary nid: 172.21.164.114@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.114@o2ib1
>           state: NA
>     - primary nid: 172.21.164.120@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.120@o2ib1
>           state: NA
>     - primary nid: 172.21.156.102@tcp1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.156.102@tcp1
>           state: NA
>     - primary nid: 172.21.164.116@o2ib1
>       Multi-Rail: False
>       peer ni:
>         - nid: 172.21.164.116@o2ib1
>           state: NA
>     - primary nid: 172.21.164.110@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.110@o2ib1
>           state: NA
>     - primary nid: 172.21.164.115@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.115@o2ib1
>           state: NA
>     - primary nid: 172.21.164.118@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.118@o2ib1
>           state: NA
>     - primary nid: 172.21.164.113@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.113@o2ib1
>           state: NA
>     - primary nid: 172.21.164.121@o2ib1
>       Multi-Rail: True
>       peer ni:
>         - nid: 172.21.164.121@o2ib1
>           state: NA
> 
> 
> thanks for looking at this.
> 
> Rick
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to