Dear all, (TL;DR at the bottom)
I have the following situation: +----------------+ | +--------------+ +-------------------------------------------------+ | Lustre servers | | | | | @ o2ib20 | +---------+--------+ | Virtualization host: | | | | | | * Proxmox 6.2, up-to-date | +----------------+ | o2ib20 | | ** Debian 10.5 based | | 10.148.0.0/16 | | ** Ubuntu based kernel 5.4.44-2-pve | | | | * ConnectX-3 (MCX354A-FCBT) | +---------+--------+ | ** 15 VFs configured | | | ** SR-IOV | +----------------------+ | * OFED provided by distribution | | | | +--------+-------+ | | | LNET router | | Virtual machines and LNET routers: | +--------+-------+ | * CentOS 7.8 based | | | * OFED provided by CentOS | +----------------------+ | * Lustre 2.12.5 | | | * Kernel 3.10.0-1127.18.2.el7 | +---------+--------+ | | | |-----------------+--------------+--------------+ | | o2ib43 | | | | | | +----------------+ | 10.225.0.0/16 | | | | | | | | | | | +------+-----+ +------+-----+ +------+-----+ | | Lustre servers | +---------+--------+ | | | | | | | | | @ o2ib43 | | | | VM 1 | | VM 2 | | VM 3 | | | +--------------+ | | @ o2ib43 | | @ o2ib43 | | @ o2ib43 | | +----------------+ | | | | | | | | | | | | | | | | | +------------+ +------------+ +------------+ | | | +-------------------------------------------------+ Lustre @ o2ib20 is a Sonexion appliance based on CentOS 7.2 and Lustre version 2.11.0.300_cray_43_gd35e657_dirty. Lustre @ 02ib43 is CentOS 7.6 based setup with kernel 3.10.0-957.1.3.el7_lustre and Lustre version lustre-2.10.7.1nec-1.el7.x86_64. The issue I currently see is that once more that one VM is running on the virualization host then access to the Lustre file system behind the LNET routers is stuck. The errors I can see on the VM is e.g.: [ 1297.470192] LustreError: 2477:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800 [ 1297.472058] LustreError: 2478:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800 [ 1297.473909] Lustre: 2490:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1597316593/real 1597316593] req@ffff89365cee0900 x1674906532108800/t0 (0) o4->snx11167-OST001a-osc-ffff893468b2f800@10.148.240.33@o2ib20:6/4 lens 488/448 e 0 to 1 dl 1597316688 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 [ 1297.479055] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection to snx11167-OST001a (at 10.148.240.33@o2ib20) was lost; in progress operations using this service will wait for recovery to com plete [ 1299.470205] LustreError: 2478:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800 [ 1299.472403] LustreError: 2477:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800 [ 1299.474395] Lustre: 2490:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1597316595/real 1597316595] req@ffff89365cee0900 x1674906532108800/t0 (0) o4->snx11167-OST001a-osc-ffff893468b2f800@10.148.240.33@o2ib20:6/4 lens 488/448 e 0 to 1 dl 1597316690 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1 [ 1299.479830] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection to snx11167-OST001a (at 10.148.240.33@o2ib20) was lost; in progress operations using this service will wait for recovery to com plete [ 1299.496826] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection restored to 10.148.240.33@o2ib20 (at 10.148.240.33@o2ib20) [ 1301.470102] LustreError: 2478:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800 [ 1301.472096] LustreError: 2477:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800 [ 1301.474135] Lustre: 2490:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1597316597/real 1597316597] req@ffff89365cee0900 x1674906532108800/t0 (0) o4->snx11167-OST001a-osc-ffff893468b2f800@10.148.240.33@o2ib20:6/4 lens 488/448 e 0 to 1 dl 1597316692 ref 2 fl Rpc:eX/2/ffffffff rc 0/-1 [ 1301.479772] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection to snx11167-OST001a (at 10.148.240.33@o2ib20) was lost; in progress operations using this service will wait for recovery to com plete [ 1301.483576] LNetError: 2486:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 10.148.240.33@o2ib20 from <?> Access to the Lustre file system which is on the same IB fabric is still possible so I suspect that this is somehow related to LNET routing. If I run LNET selftests as explained at http://wiki.lustre.org/LNET_Selftest between one of the LNET routers and the VMs I can see that RPCs get dropped. Access from a client running on native hardware is possible for both file systems. Has someone a comparable setup? What kind of logs is needed to debug this? I'll gladly provide any info… TL;DR * native access is possible in the same IB fabric as well as when being routed between different fabrics * if only one VM is running then access is possible to both file systems, too * if more VMs are running on the same virtualization host than access is only possible on the file system attached to the same fabric as the VMs * access to the routed file system gets stuck Any help is appreciated. Thanks, Uwe Sauter _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org