On Wed, 2 Jun 2010, Jeff Squyres wrote:
Don't you mean return NULL? This function is supposed to return a (struct
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make
ibv_create_cq always return NULL, the scenario described by George works
smoothly : returned OMPI_ERROR => bitmask cleared => connectivity problem
=> stop or tcp fallback. The problem is more complicated than I thought.
But it made me progress on why I'm crashing : in my case, only a subset of
processes have their create_cq fail. But others work fine, hence they
request a qp creation, and my process which failed over on tcp starts
creating a qp ... and crashes.
If you replace :
return NULL;
by :
if (atoi(getenv("OMPI_COMM_WORLD_RANK")) == 26)
return NULL;
(yes, that's ugly, but it's just to debug the problem) and run on -say- 32
processes, you should be able to reproduce the bug. Well, unless I'm
mistaken again.
The crash stack should look like this :
#0 0x0000003d0d605a30 in ibv_cmd_create_qp () from /usr/lib64/libibverbs.so.1
#1 0x00007f28b44e049b in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so
#2 0x0000003d0d609a42 in ibv_create_qp () from /usr/lib64/libibverbs.so.1
#3 0x00007f28b6be6e6e in qp_create_one () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#4 0x00007f28b6be78a4 in oob_module_start_connect () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#5 0x00007f28b6be7fbb in rml_recv_cb () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#6 0x00007f28b8c56868 in orte_rml_recv_msg_callback () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_rml_oob.so
#7 0x00007f28b8a4cf96 in mca_oob_tcp_msg_recv_complete () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#8 0x00007f28b8a4e2c2 in mca_oob_tcp_peer_recv_handler () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#9 0x00007f28b9496898 in opal_event_base_loop () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#10 0x00007f28b948ace9 in opal_progress () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#11 0x00007f28b9951ed5 in ompi_request_default_wait_all () from
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libmpi.so.0
This new advance may change everything. Of course, stopping at the bml
level still "solves" the problem, but maybe we can fix this more properly
within the openib BTL. Unless this is a general
out-of-band-connection-protocol issue ().
Sylvain