[OMPI devel] Startup failure on mixed IPv4/IPv6 environment (oob tcp bug?)

2007-08-05 Thread dispanser
Hello,

testing some stuff on a weird network setup, I came across a possible
bug in the oob tcp module.
Setup: two nodes, vm0 and vm2, which both have IPv4 addresses that can't
connect to each other, and IPv6.
The call "mpirun -host vm0,vm2 /bin/hostname" never finishes, because
the node vm2 does not connect back to vm0:

- mca_oob_tcp_peer_try_connect is called, and first attempts to connect
  to the IPv4 address of vm0. peer->peer_state == MCA_OOB_TCP_CONNECTING
- it creates an IPv4 socket (mca_oob_tcp_create_socket())
- connecting fails (network unreachable)
- next try: the IPv6 address
- mca_oob_tcp_create_socket calls mca_oob_tcp_peer_shutdown, because
  the address family of the existing socket does not match.
  mca_oob_tcp_peer_shutdown sets peer-peer_state =
  MCA_OOB_TCP_CLOSED
- mca_oob_tcp_peer_try_connect successfully connects to the IPv6
  address

Despite the successful connection, we have a wrong peer state and
consequently, mca_oob_tcp_peer_send_handler bails out with "invalid
connection state".

I fixed the problem by setting the peer_state to MCA_OOB_TCP_CONNECTING
after creating the socket, which works for me.  I'm not sure if this is
always correct, though.

Thomas

--
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany



Re: [OMPI devel] [RFC] Upgrade to newer libtool 2.1 snapshot

2007-08-05 Thread Ralf Wildenhues
Hi Jeff,

* Jeff Squyres wrote on Fri, Aug 03, 2007 at 10:33:51PM CEST:
> WHAT: Upgrade to a newer Libtool 2.1 nightly snapshot (we are  
> currently using 1.2362 2007/01/23) for making OMPI tarballs.
> 
> WHY: https://svn.open-mpi.org/trac/ompi/ticket/982 is fixed by newer  
> Libtool snapshots (e.g., 1.2444 2007/04/10 is what I have installed  
> at Cisco).

Is it?  If so, then I would like to know why (config.log outputs for
both would be nice).  Could have been an Autoconf update instead.
Asking because I don't think the bug was consciously fixed in Libtool;
only a test was added to expose the issue.  I'll put it on my list of
things to look at.

> Plus, it's a newer version, so it's better, right?  ;-)

FWIW, a patch applied today fixes a regression introduced on 2007-05-08
and reported by Brian.

Cheers,
Ralf