So I'm still experimenting with my 2.10.6 clients mounting from a 2.8 server. I've found some more information that might narrow down the issue.
To recap: When a client is rebooted, or after the IB modules are reloaded, any Lustre operations take a very long time to connect the first time. lctl ping hangs and times out for 30-60 seconds. Once it makes a successful connection, subsequent connections to the same server are fine. So mounting the Lustre filesystem takes a long time as it has to time out to each MDS and each OSS before finally succeeding. What's new: If I do an IPoIB ping of the server I'm trying to reach first, the lctl ping succeeds immediately. So if I ping all of the MDSes and OSSes, the filesystem will mount immediately. Does this sound familiar to anyone? Thanks, Kevin On Thu, Jan 10, 2019 at 4:23 PM Kevin M. Hildebrand <ke...@umd.edu> wrote: > I've got a RHEL6 Lustre installation where the servers are running 2.8.0, > that I'd prefer not to upgrade. > We've been running 2.8.0 on RHEL6 clients as well and everything's been > working fine. However, I just updated the Linux release on the RHEL6 > clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest > kernel. I've built and installed 2.10.6 on these clients, and the kernel > modules load fine, but on first contact with any lustre server, I get a > bunch of timeouts before I can get a valid connection. The Lustre network > in this case is Infiniband, using Mellanox OFED on the clients. > 'lctl ping' hangs for a few seconds and returns 'failed to ping > 192.168.64.70@o2ib1: Input/output error'. An IPoIB ping of the server IP > address works fine. > At the same time I get a message in syslog that says 'LNet: > 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for > 192.168.64.70@o2ib1: 4296292 seconds' > Nothing shows up in the logs on the server side. > > If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, 'lctl > ping' succeeds. > This happens for each of my Lustre servers, and once I get a successful > ping back, it seems to be fully functional up until the next reboot, or > until the Infiniband modules are reloaded. > > If I try to mount the filesystem without doing the pings, I'll get > timeouts contacting the MDS for the same 30-60 seconds, and then once the > MDSes are reachable, I get timeouts to the OSSes for a while, until they > become reachable, and once they're all talking, all seems to be fine. > > Any ideas on what could be wrong? > > Thanks, > Kevin > > -- > Kevin Hildebrand > University of Maryland >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org