Hi guys,
I have a problem regarding the subject. The detail is below.
Is there anybody who can answer this behavior is a restriction of
openmpi or something?
I executed an mpi program and caught the following error related to
ibv_create_ah.
[sho@host0 ~]$ /opt/openmpi1103_debug/bin/mpirun -host host0,host1 -npernode 1
-np 2 ./sample
PROC(0): senddata = 10
libibverbs: ibv_create_ah failed to query port.
[host1:4395] *** An error occurred in MPI_Send
[host1:4395] *** reported by process [139776618004481,0]
[host1:4395] *** on communicator MPI_COMM_WORLD
[host1:4395] *** MPI_ERR_OTHER: known error not in list
[host1:4395] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
[host1:4395] *** and potentially your MPI job)
host0 has a ConnectX-3 HCA with 2 ports and a cable is connected with the port
2.
host1 has a ConnectX-4 HCA with 1 port and a cable is connected with the port 1.
The function udcm_endpoint_init_data seems to pass a remote port number to
ibv_create_ah.
I added a printf to output remote_msg->mm_port_num and found it output 1 on
host0,
output 2 on host1.
Is this correct? I think a local port number should be specified to
ibv_create_ah.
static int udcm_endpoint_init_data (mca_btl_base_endpoint_t *lcl_ep)
: :
ah_attr.dlid = lcl_ep->rem_info.rem_lid;
ah_attr.port_num = remote_msg->mm_port_num; <****** It's a remote
port.
ah_attr.sl = mca_btl_openib_component.ib_service_level;
ah_attr.src_path_bits = lcl_ep->endpoint_btl->src_path_bits;
udep->ah = ibv_create_ah (lcl_ep->endpoint_btl->device->ib_pd,
&ah_attr);
I modified the above code to specify a local port directly. The sample code was
executed correctly on host0 and host1.
With best regards,
Takashi Sato