Take a look at this: https://jira.whamcloud.com/browse/LU-11840 Let me know if this is the same issue you're seeing.
On Tue, 5 Mar 2019 at 14:04, Amir Shehata <amir.shehata.whamcl...@gmail.com> wrote: > Hi Riccardo, > > It's not LNet Health. It's Dynamic Discovery. What's happening is that > 2.12 is discovering all the interfaces on the peer. That's why you see all > the interfaces in the peer show. > > Multi-Rail doesn't enable o2ib. It just sees it. If the node doing the > discovery has only tcp, then it should never try to connect over the o2ib. > > Are you able to do a "lnetctl ping 172.21.48.250@tcp" from the MDS > multiple times? Do you see the ping failing intermittently? > > What should happen is that when the MDS (running 2.12) tries to talk to > the peer you have identified, then it'll discover its interfaces. But then > should realize that it can only reach it on the tcp network, since that's > the only network configured on the MDS. > > It might help, if you just configure LNet only, on the MDS and the peer > and run a simple > lctl set_param debug=+"net neterror" > lnetctl ping <> > lctl dk >log > > If you can share the debug output, it'll help to pinpoint the problem. > > thanks > amir > > On Tue, 5 Mar 2019 at 12:30, Riccardo Veraldi < > riccardo.vera...@cnaf.infn.it> wrote: > >> I think I figured out the problem. >> My problem is related to Lnet Network Health feature: >> https://jira.whamcloud.com/browse/LU-9120 >> the lustre MDS and the lsutre client having same version 2.12.0 >> negotiate a Multi-rail peer connection while this does not happen with >> the other clients (2.10.5). So what happens is that both IB and tcp are >> being used during transfers. >> tcp is only for connecting to the MDS, IB only to connect to the OSS >> anyway Multi-rail is enabled by default between the MDS,OSS and client. >> This messes up the situation. the MDS has only one TCP interface and >> cannot communicate by IB but in the "lnetctl peer show" a NID @o2ib >> shows up and it should not. At this point the MDS tries to connect to >> the client using IB and it will never work because there is no IB on the >> MDS. >> MDS Lnet configuration: >> >> net: >> - net type: lo >> local NI(s): >> - nid: 0@lo >> status: up >> - net type: tcp >> local NI(s): >> - nid: 172.21.49.233@tcp >> status: up >> interfaces: >> 0: eth0 >> >> but if I look at lnetctl peer show I See >> >> - primary nid: 172.21.52.88@o2ib >> Multi-Rail: True >> peer ni: >> - nid: 172.21.48.250@tcp >> state: NA >> - nid: 172.21.52.88@o2ib >> state: NA >> - nid: 172.21.48.250@tcp1 >> state: NA >> - nid: 172.21.48.250@tcp2 >> state: NA >> >> there should be no o2ib nid but Multi-rail for some reason enables it. >> I do not have problems with the other clients (non 2.12.0) >> >> How can I disable Multi-rail on 2.12.0 ?? >> >> thank you >> >> >> >> On 3/5/19 12:14 PM, Patrick Farrell wrote: >> > Riccardo, >> > >> > Since 2.12 is still a relatively new maintenance release, it would be >> helpful if you could open an LU and provide more detail there - Such as >> what clients were doing, if you were using any new features (like DoM or >> FLR), and full dmesg from the clients and servers involved in these >> evictions. >> > >> > - Patrick >> > >> > On 3/5/19, 11:50 AM, "lustre-discuss on behalf of Riccardo Veraldi" < >> lustre-discuss-boun...@lists.lustre.org on behalf of >> riccardo.vera...@cnaf.infn.it> wrote: >> > >> > Hello, >> > >> > I have quite a big issue on my Lustre 2.12.0 MDS/MDT. >> > >> > Clients moving data to the OSS occur into a locking problem I >> never met >> > before. >> > >> > The clients are mostly 2.10.5 except for one which is 2.12.0 but >> > regardless the client version the problem is still there. >> > >> > So these are the errors I see on hte MDS/MDT. When this happens >> > everything just hangs. If I reboot the MDS everything is back to >> > normality but it happened already 2 times in 3 days and it is >> disrupting. >> > >> > Any hints ? >> > >> > Is it feasible to downgrade from 2.12.0 to 2.10.6 ? >> > >> > thanks >> > >> > Mar 5 11:10:33 psmdsana1501 kernel: Lustre: >> > 7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request >> sent has >> > failed due to network error: [sent 1551813033/real 1551813033] >> > req@ffff9fdcbecd0300 x1626845000210688/t0(0) >> > o104->ana15-MDT0000@172.21.52.87@o2ib:15/16 lens 296/224 e 0 to 1 >> dl >> > 1551813044 ref 1 fl Rpc:eX/0/ffffffff rc 0/-1 >> > Mar 5 11:10:33 psmdsana1501 kernel: Lustre: >> > 7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 50552576 >> > previous similar messages >> > Mar 5 11:13:03 psmdsana1501 kernel: LustreError: >> > 7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid >> > 172.21.52.87@o2ib) failed to reply to blocking AST >> (req@ffff9fdcbecd0300 >> > x1626845000210688 status 0 rc -110), evict it ns: >> mdt-ana15-MDT0000_UUID >> > lock: ffff9fde9b6873c0/0x9824623d2148ef38 lrc: 4/0,0 mode: PR/PR >> res: >> > [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags: >> > 0x60200400000020 nid: 172.21.52.87@o2ib remote: 0xd8efecd6e7621e63 >> > expref: 8 pid: 7898 timeout: 333081 lvb_type: 0 >> > Mar 5 11:13:03 psmdsana1501 kernel: LustreError: 138-a: >> ana15-MDT0000: >> > A client on nid 172.21.52.87@o2ib was evicted due to a lock >> blocking >> > callback time out: rc -110 >> > Mar 5 11:13:03 psmdsana1501 kernel: LustreError: >> > 5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback >> timer >> > expired after 150s: evicting client at 172.21.52.87@o2ib ns: >> > mdt-ana15-MDT0000_UUID lock: ffff9fde9b6873c0/0x9824623d2148ef38 >> lrc: >> > 3/0,0 mode: PR/PR res: [0x2000013a9:0x1d347:0x0].0x0 bits 0x13/0x0 >> rrc: >> > 5 type: IBT flags: 0x60200400000020 nid: 172.21.52.87@o2ib remote: >> > 0xd8efecd6e7621e63 expref: 9 pid: 7898 timeout: 0 lvb_type: 0 >> > Mar 5 11:13:04 psmdsana1501 kernel: Lustre: ana15-MDT0000: >> Connection >> > restored to 59c5a826-f4e9-0dd0-8d4f-08c204f25941 (at >> 172.21.52.87@o2ib) >> > Mar 5 11:15:34 psmdsana1501 kernel: LustreError: >> > 7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid >> > 172.21.52.142@o2ib) failed to reply to blocking AST >> > (req@ffff9fde2d393600 x1626845000213776 status 0 rc -110), evict >> it ns: >> > mdt-ana15-MDT0000_UUID lock: ffff9fde9b6858c0/0x9824623d2148efee >> lrc: >> > 4/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0 bits 0x13/0x0 >> rrc: 3 >> > type: IBT flags: 0x60200400000020 nid: 172.21.52.142@o2ib remote: >> > 0xbb35541ea6663082 expref: 9 pid: 7898 timeout: 333232 lvb_type: 0 >> > Mar 5 11:15:34 psmdsana1501 kernel: LustreError: 138-a: >> ana15-MDT0000: >> > A client on nid 172.21.52.142@o2ib was evicted due to a lock >> blocking >> > callback time out: rc -110 >> > Mar 5 11:15:34 psmdsana1501 kernel: LustreError: >> > 5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback >> timer >> > expired after 151s: evicting client at 172.21.52.142@o2ib ns: >> > mdt-ana15-MDT0000_UUID lock: ffff9fde9b6858c0/0x9824623d2148efee >> lrc: >> > 3/0,0 mode: PR/PR res: [0x2000013ac:0x1:0x0].0x0 bits 0x13/0x0 >> rrc: 3 >> > type: IBT flags: 0x60200400000020 nid: 172.21.52.142@o2ib remote: >> > 0xbb35541ea6663082 expref: 10 pid: 7898 timeout: 0 lvb_type: 0 >> > Mar 5 11:15:34 psmdsana1501 kernel: Lustre: ana15-MDT0000: >> Connection >> > restored to 9d49a115-646b-c006-fd85-000a4b90019a (at >> 172.21.52.142@o2ib) >> > Mar 5 11:20:33 psmdsana1501 kernel: Lustre: >> > 7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request >> sent has >> > failed due to network error: [sent 1551813633/real 1551813633] >> > req@ffff9fdcc2a95100 x1626845000222624/t0(0) >> > o104->ana15-MDT0000@172.21.52.87@o2ib:15/16 lens 296/224 e 0 to 1 >> dl >> > 1551813644 ref 1 fl Rpc:eX/2/ffffffff rc 0/-1 >> > Mar 5 11:20:33 psmdsana1501 kernel: Lustre: >> > 7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 23570550 >> > previous similar messages >> > Mar 5 11:22:46 psmdsana1501 kernel: LustreError: >> > 7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid >> > 172.21.52.87@o2ib) failed to reply to blocking AST >> (req@ffff9fdcc2a95100 >> > x1626845000222624 status 0 rc -110), evict it ns: >> mdt-ana15-MDT0000_UUID >> > lock: ffff9fde86ffdf80/0x9824623d2148f23a lrc: 4/0,0 mode: PR/PR >> res: >> > [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: >> > 0x60200400000020 nid: 172.21.52.87@o2ib remote: 0xd8efecd6e7621eb7 >> > expref: 9 pid: 7898 timeout: 333665 lvb_type: 0 >> > Mar 5 11:22:46 psmdsana1501 kernel: LustreError: 138-a: >> ana15-MDT0000: >> > A client on nid 172.21.52.87@o2ib was evicted due to a lock >> blocking >> > callback time out: rc -110 >> > Mar 5 11:22:46 psmdsana1501 kernel: LustreError: >> > 5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback >> timer >> > expired after 150s: evicting client at 172.21.52.87@o2ib ns: >> > mdt-ana15-MDT0000_UUID lock: ffff9fde86ffdf80/0x9824623d2148f23a >> lrc: >> > 3/0,0 mode: PR/PR res: [0x2000013ae:0x1:0x0].0x0 bits 0x13/0x0 >> rrc: 3 >> > type: IBT flags: 0x60200400000020 nid: 172.21.52.87@o2ib remote: >> > 0xd8efecd6e7621eb7 expref: 10 pid: 7898 timeout: 0 lvb_type: 0 >> > >> > >> > _______________________________________________ >> > lustre-discuss mailing list >> > lustre-discuss@lists.lustre.org >> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> > >> > >> > _______________________________________________ >> > lustre-discuss mailing list >> > lustre-discuss@lists.lustre.org >> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >> >> _______________________________________________ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org