Hello,
testing some stuff on a weird network setup, I came across a possible
bug in the oob tcp module.
Setup: two nodes, vm0 and vm2, which both have IPv4 addresses that can't
connect to each other, and IPv6.
The call "mpirun -host vm0,vm2 /bin/hostname" never finishes, because
the node vm2 does not connect back to vm0:
- mca_oob_tcp_peer_try_connect is called, and first attempts to connect
to the IPv4 address of vm0. peer->peer_state == MCA_OOB_TCP_CONNECTING
- it creates an IPv4 socket (mca_oob_tcp_create_socket())
- connecting fails (network unreachable)
- next try: the IPv6 address
- mca_oob_tcp_create_socket calls mca_oob_tcp_peer_shutdown, because
the address family of the existing socket does not match.
mca_oob_tcp_peer_shutdown sets peer-peer_state =
MCA_OOB_TCP_CLOSED
- mca_oob_tcp_peer_try_connect successfully connects to the IPv6
address
Despite the successful connection, we have a wrong peer state and
consequently, mca_oob_tcp_peer_send_handler bails out with "invalid
connection state".
I fixed the problem by setting the peer_state to MCA_OOB_TCP_CONNECTING
after creating the socket, which works for me. I'm not sure if this is
always correct, though.
Thomas
--
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany