Hi everybody, I tried the functionality for 3D-torus cluster topology support and encountered the bug with error message like below:
srvmpisnb02][[9011,1],3][ompi/mca/btl/openib/connect/btl_openib_connect_sl.c:239:get_pathrecord_info] error posting receive on QP [0x4f] errno says: Success [0] The reason of this bug is receive queue overflow on UD QP associated with handle cache->qp Attached file is my proposal to fix it based on 1.8 Open MPI branch. And I have a question about 3D-Torus toplogy support for UD QPs. For example you use UD transport in UDCM connection manger. Are any changes required to query service level for UD QP? May be we need to add the call of btl_openib_connect_get_pathrecord_sl(…) before ibv_create_ah() like below: ah_attr.is_global = 0; ah_attr.dlid = remote_lid; ah_attr.sl = btl_openib_connect_get_pathrecord_sl(…); ah_attr.src_path_bits = mca_btl_openib_component.ib_src_path_bits; ah_attr.port_num = openib_btl->ib_port_num; ah =ibv_create_ah)(openib_btl->ib_pd, &ah_attr); Regards, Alexey Ryzhikh
btl_openib_connect_sl.c.diff
Description: Binary data