Hello, testing some stuff on a weird network setup, I came across a possible bug in the oob tcp module. Setup: two nodes, vm0 and vm2, which both have IPv4 addresses that can't connect to each other, and IPv6. The call "mpirun -host vm0,vm2 /bin/hostname" never finishes, because the node vm2 does not connect back to vm0:
- mca_oob_tcp_peer_try_connect is called, and first attempts to connect to the IPv4 address of vm0. peer->peer_state == MCA_OOB_TCP_CONNECTING - it creates an IPv4 socket (mca_oob_tcp_create_socket()) - connecting fails (network unreachable) - next try: the IPv6 address - mca_oob_tcp_create_socket calls mca_oob_tcp_peer_shutdown, because the address family of the existing socket does not match. mca_oob_tcp_peer_shutdown sets peer-peer_state = MCA_OOB_TCP_CLOSED - mca_oob_tcp_peer_try_connect successfully connects to the IPv6 address Despite the successful connection, we have a wrong peer state and consequently, mca_oob_tcp_peer_send_handler bails out with "invalid connection state". I fixed the problem by setting the peer_state to MCA_OOB_TCP_CONNECTING after creating the socket, which works for me. I'm not sure if this is always correct, though. Thomas -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany