Yeah, I thought about that. Both the client and servers are using the defaults for ko2iblnd-
tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: peercredits_hiw: 4 map_on_demand: 0 concurrent_sends: 8 fmr_pool_size: 512 fmr_flush_trigger: 384 fmr_cache: 1 ntx: 512 conns_per_peer: 1 Thanks, Kevin On Fri, Jan 11, 2019 at 5:17 PM Mohr Jr, Richard Frank (Rick Mohr) < rm...@utk.edu> wrote: > Is it possible you have some incompatible ko2iblnd module parameters > between the 2.8 servers and the 2.10 clients? If there was something > causing LNet issues, that could possibly explain some of the symptoms you > are seeing. > > -- > Rick Mohr > Senior HPC System Administrator > National Institute for Computational Sciences > http://www.nics.tennessee.edu > > > > On Jan 10, 2019, at 4:23 PM, Kevin M. Hildebrand <ke...@umd.edu> wrote: > > > > I've got a RHEL6 Lustre installation where the servers are running > 2.8.0, that I'd prefer not to upgrade. > > We've been running 2.8.0 on RHEL6 clients as well and everything's been > working fine. However, I just updated the Linux release on the RHEL6 > clients to 6.10, and Lustre 2.8.0 will no longer compile on the latest > kernel. I've built and installed 2.10.6 on these clients, and the kernel > modules load fine, but on first contact with any lustre server, I get a > bunch of timeouts before I can get a valid connection. The Lustre network > in this case is Infiniband, using Mellanox OFED on the clients. > > 'lctl ping' hangs for a few seconds and returns 'failed to ping > 192.168.64.70@o2ib1: Input/output error'. An IPoIB ping of the server IP > address works fine. > > At the same time I get a message in syslog that says 'LNet: > 8778:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for > 192.168.64.70@o2ib1: 4296292 seconds' > > Nothing shows up in the logs on the server side. > > > > If I repeat the 'lctl ping' a few times, after 30-60 seconds or so, > 'lctl ping' succeeds. > > This happens for each of my Lustre servers, and once I get a successful > ping back, it seems to be fully functional up until the next reboot, or > until the Infiniband modules are reloaded. > > > > If I try to mount the filesystem without doing the pings, I'll get > timeouts contacting the MDS for the same 30-60 seconds, and then once the > MDSes are reachable, I get timeouts to the OSSes for a while, until they > become reachable, and once they're all talking, all seems to be fine. > > > > Any ideas on what could be wrong? > > > > Thanks, > > Kevin > > > > -- > > Kevin Hildebrand > > University of Maryland > > _______________________________________________ > > lustre-discuss mailing list > > lustre-discuss@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > >
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org