Hi Kevin,

Just wild-guessing here. Have you tried playing with the 
live_router_check_interval, dead_router_check_interval and router_ping_timeout 
LNet parameters?

HTH,
Alejandro

From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On Behalf 
Of Kevin M. Hildebrand
Sent: Monday, October 30, 2017 1:47 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre routing help needed

Hello, I'm trying to set up some new Lustre routers between a set of Infiniband 
connected Lustre servers and a few hosts connected to an external 100G Ethernet 
network.   The problem I'm having is that the routers work just fine for a 
minute or two, and then shortly thereafter they're marked as 'down' and all 
traffic stops.  If I unload/reload the lustre modules on the router, it'll work 
again for a short time and then stop again.  The router shows errors like:
[236528.801275] LNetError: 54389:0:(lib-move.c:2120:lnet_parse_get()) 
10.10.104.2@tcp2<mailto:10.10.104.2@tcp2>: Unable to send REPLY for GET from 
12345-10.10.104.201@tcp2<mailto:12345-10.10.104.201@tcp2>: -113
My Lustre router has a Mellanox ConnectX-3 interface connecting to the Lustre 
servers, and a Mellanox ConnectX-5
​100G ​
interface connecting to a 100G switch to which my test client is connected.
​  ​
On the Infiniband side, I've got
​lnet​
​ configured as o2ib1
​​
, and on the Ethernet side, as tcp2.

Clients and servers are all running Lustre 2.8.  The Lustre router at the 
moment is running Lustre 2.10.1, because of software dependencies to support 
the 100G card.

I've verified that I have stable network connectivity on both the IB and 
Ethernet sides.

At the moment, I have very simple lnet configurations, using the built in 
defaults.  lnet.conf on the server:
options lnet ip2nets="o2ib1(ib0) 192.168.[64-95].*; tcp1 10.103.[128-159].*" 
routes="tcp0 192.168.64.[78-79]@o2ib1; tcp2 192.168.64.[78-79]@o2ib1"

On the lustre router:
options lnet networks="o2ib1(ib0),tcp2(p1p1.104)" "forwarding=enabled"

And on the client:
options lnet networks="tcp2(p4p1.104)" routes="o2ib1 10.10.104.[2-3]@tcp2"

All of the hosts (client, server, router) have the following in ko2iblnd.conf:

alias ko2iblnd-opa ko2iblnd
options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4

install ko2iblnd /usr/sbin/ko2iblnd-probe


Does anyone see anything I've missed, or have any thoughts on where I should 
look next?

Thanks,
Kevin

--
Kevin Hildebrand
University of Maryland, College Park
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to