So apparently the issue is indeed with the combination of using a Lustre
2.10.1 router with 2.8 servers and clients. Downgrading the router to 2.9
seems to have solved the problem.
(I can't run 2.8 on the router, because I'm running MOFED 4.1 for the
Mellanox ConnectX-5, and I can't get 2.8 to bui
The 2.10 release added support for multi-rail LNet, which may potentially be
causing problems here. I would suggest to install an older LNet version on your
routers to match your client/server.
You may need to build your own RPMs for your new kernel, but can use
--disable-server for configure t
Thanks, I completely missed that. Indeed the ko2iblnd parameters were
different between the servers and the router. I've updated the parameters
on the router to match those on the server, and things haven't gotten any
better. (The problem appears to be on the Ethernet side anyway, so you've
prob
> On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand wrote:
>
> All of the hosts (client, server, router) have the following in ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024
> concurrent_sends=256 ntx=2048 map_on_demand=32
I received a reply from Alejandro suggesting that I check
live_router_check_interval, dead_router_check_interval and
router_ping_timeout.
I had those set to the defaults, which I assume are 60, 60, and 50 seconds
respectively. I did just try setting those values explicitly, and I'm not
seeing any
: Monday, October 30, 2017 1:47 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre routing help needed
Hello, I'm trying to set up some new Lustre routers between a set of Infiniband
connected Lustre servers and a few hosts connected to an external 100G Ethernet
network.
Hello, I'm trying to set up some new Lustre routers between a set of
Infiniband connected Lustre servers and a few hosts connected to an
external 100G Ethernet network. The problem I'm having is that the
routers work just fine for a minute or two, and then shortly thereafter
they're marked as 'do