The 2.10 release added support for multi-rail LNet, which may potentially be 
causing problems here. I would suggest to install an older LNet version on your 
routers to match your client/server.

You may need to build your own RPMs for your new kernel, but can use 
--disable-server for configure to simplify things.

Cheers, Andreas

On Oct 31, 2017, at 04:45, Kevin M. Hildebrand 
<[email protected]<mailto:[email protected]>> wrote:

Thanks, I completely missed that.  Indeed the ko2iblnd parameters were 
different between the servers and the router.  I've updated the parameters on 
the router to match those on the server, and things haven't gotten any better.  
(The problem appears to be on the Ethernet side anyway, so you've probably 
helped me fix a problem I didn't know I had...)
I don't see much discussion about configuring lnet parameters for Ethernet 
networks, I assume that's using ksocklnd.  On that side, it appears that all of 
the ksocklnd parameters match between the router and clients.  Interesting that 
peer_timeout is 180, which is almost exactly when my client gets marked down on 
the router.

Server (and now router) ko2iblnd parameters:
peer_credits 8
peer_credits_hiw 4
credits 256
concurrent_sends 8
ntx 512
map_on_demand 0
fmr_pool_size 512
fmr_flush_trigger 384
fmr_cache 1

Client and router ksocklnd:
peer_timeout 180
peer_credits 8
keepalive 30
sock_timeout 50
credits 256
rx_buffer_size 0
tx_buffer_size 0
keepalive_idle 30
round_robin 1
sock_timeout 50

Thanks,
Kevin


On Mon, Oct 30, 2017 at 4:16 PM, Mohr Jr, Richard Frank (Rick Mohr) 
<[email protected]<mailto:[email protected]>> wrote:

> On Oct 30, 2017, at 8:47 AM, Kevin M. Hildebrand 
> <[email protected]<mailto:[email protected]>> wrote:
>
> All of the hosts (client, server, router) have the following in ko2iblnd.conf:
>
> alias ko2iblnd-opa ko2iblnd
> options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits=1024 
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048 
> fmr_flush_trigger=512 fmr_cache=1 conns_per_peer=4
>
> install ko2iblnd /usr/sbin/ko2iblnd-probe

Those parameters will only get applied to omnipath interfaces (which you don’t 
have), so everything you have should just be running with default parameters.  
Since your lnet routers have a different version of lustre than your 
servers/clients, it might be possible that the default values for the ko2iblnd 
parameters are different between the two versions.  You can always check this 
by looking at the values in the files under /sys/module/ko2iblnd/parameters.  
It might be worthwhile to compare those values on the lnet routers to the 
values on the servers to see if maybe there is a difference that could affect 
the behavior.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to