Hello,

testing some stuff on a weird network setup, I came across a possible
bug in the oob tcp module.
Setup: two nodes, vm0 and vm2, which both have IPv4 addresses that can't
connect to each other, and IPv6.
The call "mpirun -host vm0,vm2 /bin/hostname" never finishes, because
the node vm2 does not connect back to vm0:

- mca_oob_tcp_peer_try_connect is called, and first attempts to connect
  to the IPv4 address of vm0. peer->peer_state == MCA_OOB_TCP_CONNECTING
- it creates an IPv4 socket (mca_oob_tcp_create_socket())
- connecting fails (network unreachable)
- next try: the IPv6 address
- mca_oob_tcp_create_socket calls mca_oob_tcp_peer_shutdown, because
  the address family of the existing socket does not match.
  mca_oob_tcp_peer_shutdown sets peer-peer_state =
  MCA_OOB_TCP_CLOSED
- mca_oob_tcp_peer_try_connect successfully connects to the IPv6
  address

Despite the successful connection, we have a wrong peer state and
consequently, mca_oob_tcp_peer_send_handler bails out with "invalid
connection state".

I fixed the problem by setting the peer_state to MCA_OOB_TCP_CONNECTING
after creating the socket, which works for me.  I'm not sure if this is
always correct, though.

Thomas

--
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

Reply via email to