I figured out the problem was because of a messed up mgs partition on my MDS. thanks
On 4/19/18 7:18 PM, Riccardo Veraldi wrote: > Hello, > I have on my OSSes and on my clients the lnet configuration is loaded at > boot time form lnet.conf > I define local interfaces and peers. > What happens is that when the lustre filesystems are mounted by the > clients lnet is modified both on client and OSS side and tcp peers are > added at the end > of the lnet configuration and this has as a consequence that all traffic > starts to go through TCP and not infiniband. > I am using RHEL74 and Lustre 2.10.3 my configuration si a bit not common > because at the same time I use kernel 4.4 on the servers while all the > clients are stock RHEL74 kernel. > > Follows Lnet yaml configuration before client mounting lustre and after > client mounting lustre partitions. > > seems like that auto peer discovering is overriding ib and using just tcp. > is ther a way to stop peer auto discovery ? or a way to tell that ib has > precedence over tcp ? > > lnet configuread at boot: > > net: > - net type: lo > local NI(s): > - nid: 0@lo > status: up > statistics: > send_count: 0 > recv_count: 0 > drop_count: 0 > tunables: > peer_timeout: 0 > peer_credits: 0 > peer_buffer_credits: 0 > credits: 0 > lnd tunables: > tcp bonding: 0 > dev cpt: 0 > CPT: "[0,1]" > - net type: o2ib > local NI(s): > - nid: 172.21.52.84@o2ib > status: up > interfaces: > 0: ib0 > statistics: > send_count: 96252389 > recv_count: 61558248 > drop_count: 0 > tunables: > peer_timeout: 180 > peer_credits: 128 > peer_buffer_credits: 0 > credits: 1024 > lnd tunables: > peercredits_hiw: 64 > map_on_demand: 32 > concurrent_sends: 256 > fmr_pool_size: 2048 > fmr_flush_trigger: 512 > fmr_cache: 1 > ntx: 2048 > conns_per_peer: 4 > tcp bonding: 0 > dev cpt: 1 > CPT: "[0,1]" > - nid: 172.21.52.116@o2ib > status: up > interfaces: > 0: ib1 > statistics: > send_count: 96253070 > recv_count: 61558217 > drop_count: 0 > tunables: > peer_timeout: 180 > peer_credits: 128 > peer_buffer_credits: 0 > credits: 1024 > lnd tunables: > peercredits_hiw: 64 > map_on_demand: 32 > concurrent_sends: 256 > fmr_pool_size: 2048 > fmr_flush_trigger: 512 > fmr_cache: 1 > ntx: 2048 > conns_per_peer: 4 > tcp bonding: 0 > dev cpt: 1 > CPT: "[0,1]" > - net type: tcp > local NI(s): > - nid: 172.21.42.207@tcp > status: up > interfaces: > 0: enp1s0f0 > statistics: > send_count: 380697 > recv_count: 380352 > drop_count: 0 > tunables: > peer_timeout: 180 > peer_credits: 8 > peer_buffer_credits: 0 > credits: 256 > lnd tunables: > tcp bonding: 0 > dev cpt: 0 > CPT: "[0,1]" > peer: > - primary nid: 172.21.42.159@tcp > Multi-Rail: True > peer ni: > - nid: 172.21.42.159@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 0 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 380697 > recv_count: 380352 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.126@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.126@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: -7 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 28134533 > recv_count: 8553649 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.127@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.127@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 97 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 13505518 > recv_count: 6106498 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.128@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.128@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: -751 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 17672565 > recv_count: 13195155 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.129@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.129@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: -369 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 13934795 > recv_count: 11409629 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.130@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.130@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: -458 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 12257935 > recv_count: 11907534 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.131@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.131@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: -417 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 10748675 > recv_count: 10384163 > drop_count: 0 > refcount: 1 > > when then clients mount the lustre partitions Lnet is modified: > > net: > - net type: lo > local NI(s): > - nid: 0@lo > status: up > statistics: > send_count: 0 > recv_count: 0 > drop_count: 0 > tunables: > peer_timeout: 0 > peer_credits: 0 > peer_buffer_credits: 0 > credits: 0 > lnd tunables: > tcp bonding: 0 > dev cpt: 0 > CPT: "[0,1]" > - net type: o2ib > local NI(s): > - nid: 172.21.52.84@o2ib > status: up > interfaces: > 0: ib0 > statistics: > send_count: 0 > recv_count: 0 > drop_count: 0 > tunables: > peer_timeout: 180 > peer_credits: 128 > peer_buffer_credits: 0 > credits: 1024 > lnd tunables: > peercredits_hiw: 64 > map_on_demand: 32 > concurrent_sends: 256 > fmr_pool_size: 2048 > fmr_flush_trigger: 512 > fmr_cache: 1 > ntx: 2048 > conns_per_peer: 4 > tcp bonding: 0 > dev cpt: 1 > CPT: "[0,1]" > - nid: 172.21.52.116@o2ib > status: up > interfaces: > 0: ib1 > statistics: > send_count: 0 > recv_count: 0 > drop_count: 0 > tunables: > peer_timeout: 180 > peer_credits: 128 > peer_buffer_credits: 0 > credits: 1024 > lnd tunables: > peercredits_hiw: 64 > map_on_demand: 32 > concurrent_sends: 256 > fmr_pool_size: 2048 > fmr_flush_trigger: 512 > fmr_cache: 1 > ntx: 2048 > conns_per_peer: 4 > tcp bonding: 0 > dev cpt: 1 > CPT: "[0,1]" > - net type: tcp > local NI(s): > - nid: 172.21.42.207@tcp > status: up > interfaces: > 0: enp1s0f0 > statistics: > send_count: 646 > recv_count: 646 > drop_count: 0 > tunables: > peer_timeout: 180 > peer_credits: 8 > peer_buffer_credits: 0 > credits: 256 > lnd tunables: > tcp bonding: 0 > dev cpt: 0 > CPT: "[0,1]" > peer: > - primary nid: 172.21.42.159@tcp > Multi-Rail: True > peer ni: > - nid: 172.21.42.159@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 6 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 268 > recv_count: 268 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.126@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.126@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 128 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 0 > recv_count: 0 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.127@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.127@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 128 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 0 > recv_count: 0 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.128@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.128@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 128 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 0 > recv_count: 0 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.129@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.129@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 128 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 0 > recv_count: 0 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.130@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.130@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 128 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 0 > recv_count: 0 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.52.131@o2ib > Multi-Rail: True > peer ni: > - nid: 172.21.52.131@o2ib > state: NA > max_ni_tx_credits: 128 > available_tx_credits: 128 > min_tx_credits: 128 > tx_q_num_of_buf: 0 > available_rtr_credits: 128 > min_rtr_credits: 128 > send_count: 0 > recv_count: 0 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.42.224@tcp > Multi-Rail: False > peer ni: > - nid: 172.21.42.224@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 7 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 101 > recv_count: 101 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.42.221@tcp > Multi-Rail: False > peer ni: > - nid: 172.21.42.221@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 7 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 20 > recv_count: 20 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.42.202@tcp > Multi-Rail: False > peer ni: > - nid: 172.21.42.202@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 7 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 20 > recv_count: 20 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.42.223@tcp > Multi-Rail: False > peer ni: > - nid: 172.21.42.223@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 7 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 197 > recv_count: 197 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.42.222@tcp > Multi-Rail: False > peer ni: > - nid: 172.21.42.222@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 7 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 20 > recv_count: 20 > drop_count: 0 > refcount: 1 > - primary nid: 172.21.42.201@tcp > Multi-Rail: False > peer ni: > - nid: 172.21.42.201@tcp > state: NA > max_ni_tx_credits: 8 > available_tx_credits: 8 > min_tx_credits: 7 > tx_q_num_of_buf: 0 > available_rtr_credits: 8 > min_rtr_credits: 8 > send_count: 20 > recv_count: 20 > drop_count: 0 > refcount: 1 > numa: > range: 0 > > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
