Hello. Here I am again trying to have multi-rail work.
I configured multi-rail on OSS and clients side. I have one OSS, one MDS and one client, RHEL74 and Lustre 2.10.1: * psdrp-tst-mds10 MDS * drp-tst-oss10 OSS (172.21.52.86@o2ib 172.21.52.118@o2ib) * drp-tst-lu10 Lustre client (172.21.52.124@o2ib 172.21.52.125@o2ib) without Multi-Rail everything works fine. What I Am doing is to aggregate two IB interface to being able to have more performance. When anyway I mount the lustre partition from the Lsutre client I got this error and the partition does not mount: Oct 9 16:23:50 drp-tst-lu10 kernel: [248177.914832] LNetError: 1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) 172.21.52.118@o2ib rejected: consumer defined fatal error Oct 9 16:23:50 drp-tst-lu10 kernel: [248177.917290] Lustre: Mounted drplu-client Oct 9 16:23:50 drp-tst-lu10 kernel: [248177.920832] Lustre: 31785:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507591430/real 1507591430] req@ffff8807f56a0300 x1580812428378832/t0(0) o8->drplu-OST0001-osc-ffff88084738d800@172.21.52.86@o2ib:28/4 lens 520/544 e 0 to 1 dl 1507591435 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 9 16:23:52 drp-tst-lu10 kernel: [248179.936156] LustreError: 673:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5 Oct 9 16:23:57 drp-tst-lu10 kernel: [248184.645463] LustreError: 674:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5 Oct 9 16:23:58 drp-tst-lu10 kernel: [248186.117364] LustreError: 678:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5 Oct 9 16:23:58 drp-tst-lu10 kernel: [248186.117411] LustreError: 678:0:(llite_lib.c:1748:ll_statfs_internal()) Skipped 1 previous similar message Oct 9 16:24:15 drp-tst-lu10 kernel: [248202.912554] LNetError: 1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) 172.21.52.118@o2ib rejected: consumer defined fatal error Oct 9 16:24:15 drp-tst-lu10 kernel: [248202.912610] LNetError: 1895:0:(o2iblnd_cb.c:2726:kiblnd_rejected()) Skipped 3 previous similar messages Oct 9 16:24:15 drp-tst-lu10 kernel: [248202.918903] Lustre: 31785:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1507591455/real 1507591455] req@ffff88075d2ee700 x1580812428378960/t0(0) o8->drplu-OST0001-osc-ffff88084738d800@172.21.52.86@o2ib:28/4 lens 520/544 e 0 to 1 dl 1507591465 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 Oct 9 16:23:52 drp-tst-lu10 kernel: [248179.936156] LustreError: 673:0:(llite_lib.c:1748:ll_statfs_internal()) obd_statfs fails: rc = -5 fstab entry: 172.21.42.213@tcp:/drplu /drplu lustre noauto,lazystatfs,flock, 0 0 I can see the peers in the lnet status: [root@drp-tst-oss10:~]# cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 172.21.52.124@o2ib 1 NA -1 128 128 128 128 128 0 172.21.52.125@o2ib 1 NA -1 128 128 128 128 128 0 172.21.42.213@tcp 1 NA -1 8 8 8 8 6 0 [root@drp-tst-lu10:etc]# cat /proc/sys/lnet/peers nid refs state last max rtr min tx min queue 172.21.52.118@o2ib 1 NA -1 128 128 128 128 127 0 172.21.52.86@o2ib 1 NA -1 128 128 128 128 102 0 172.21.42.213@tcp 1 NA -1 8 8 8 8 6 0 here is my lnet configuration with multi-rail on the OSS side [root@drp-tst-oss10:veraldi]# lnetctl export net: - net type: lo local NI(s): - nid: 0@lo status: up statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0,1]" - net type: o2ib local NI(s): - nid: 172.21.52.86@o2ib status: up interfaces: 0: ib0 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 tcp bonding: 0 dev cpt: 1 CPT: "[0,1]" - nid: 172.21.52.118@o2ib status: up interfaces: 0: ib1 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 tcp bonding: 0 dev cpt: 1 CPT: "[0,1]" - net type: tcp local NI(s): - nid: 172.21.42.211@tcp status: up interfaces: 0: enp1s0f0 statistics: send_count: 198 recv_count: 198 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0,1]" peer: - primary nid: 172.21.42.213@tcp Multi-Rail: True peer ni: - nid: 172.21.42.213@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 6 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 send_count: 198 recv_count: 198 drop_count: 0 refcount: 1 - primary nid: 172.21.52.124@o2ib Multi-Rail: True peer ni: - nid: 172.21.52.124@o2ib state: NA max_ni_tx_credits: 128 available_tx_credits: 128 min_tx_credits: 128 tx_q_num_of_buf: 0 available_rtr_credits: 128 min_rtr_credits: 128 send_count: 0 recv_count: 0 drop_count: 0 refcount: 1 - nid: 172.21.52.125@o2ib state: NA max_ni_tx_credits: 128 available_tx_credits: 128 min_tx_credits: 128 tx_q_num_of_buf: 0 available_rtr_credits: 128 min_rtr_credits: 128 send_count: 0 recv_count: 0 drop_count: 0 refcount: 1 numa: range: 0 here the lnet configuration client side: [root@drp-tst-lu10:veraldi]# lnetctl export net: - net type: lo local NI(s): - nid: 0@lo status: up statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 0 peer_credits: 0 peer_buffer_credits: 0 credits: 0 lnd tunables: tcp bonding: 0 dev cpt: 0 CPT: "[0]" - net type: o2ib local NI(s): - nid: 172.21.52.124@o2ib status: up interfaces: 0: ib0 statistics: send_count: 403742 recv_count: 807391 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 tcp bonding: 0 dev cpt: -1 CPT: "[0]" - nid: 172.21.52.125@o2ib status: up interfaces: 0: ib1 statistics: send_count: 0 recv_count: 0 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 128 peer_buffer_credits: 0 credits: 1024 lnd tunables: peercredits_hiw: 64 map_on_demand: 32 concurrent_sends: 256 fmr_pool_size: 2048 fmr_flush_trigger: 512 fmr_cache: 1 ntx: 2048 conns_per_peer: 4 tcp bonding: 0 dev cpt: -1 CPT: "[0]" - net type: tcp local NI(s): - nid: 172.21.42.195@tcp status: up interfaces: 0: enp7s0f0 statistics: send_count: 99 recv_count: 99 drop_count: 0 tunables: peer_timeout: 180 peer_credits: 8 peer_buffer_credits: 0 credits: 256 lnd tunables: tcp bonding: 0 dev cpt: -1 CPT: "[0]" peer: - primary nid: 172.21.42.213@tcp Multi-Rail: True peer ni: - nid: 172.21.42.213@tcp state: NA max_ni_tx_credits: 8 available_tx_credits: 8 min_tx_credits: 6 tx_q_num_of_buf: 0 available_rtr_credits: 8 min_rtr_credits: 8 send_count: 99 recv_count: 99 drop_count: 0 refcount: 1 - primary nid: 172.21.52.86@o2ib Multi-Rail: True peer ni: - nid: 172.21.52.86@o2ib state: NA max_ni_tx_credits: 128 available_tx_credits: 128 min_tx_credits: 102 tx_q_num_of_buf: 0 available_rtr_credits: 128 min_rtr_credits: 128 send_count: 403742 recv_count: 807391 drop_count: 0 refcount: 1 - nid: 172.21.52.118@o2ib state: NA max_ni_tx_credits: 128 available_tx_credits: 128 min_tx_credits: 127 tx_q_num_of_buf: 0 available_rtr_credits: 128 min_rtr_credits: 128 send_count: 0 recv_count: 0 drop_count: 0 refcount: 1 numa: range: 0 anyway Lustre does not work. This is really weird. it should. Any hints ? thank you Rick
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org