Re: [OMPI devel] Change in btl/tcp

2008-04-21 Thread Adrian Knoth
On Mon, Apr 21, 2008 at 09:04:28AM -0400, Josh Hursey wrote: > Adrian, Hi! > Has there been any progress on this bug? If you still cannot reproduce > it, if you send either Tim Prins or I a debugging patch we can run > with it. Or we can try to arrange access to one of our machines for you.

Re: [OMPI devel] Change in btl/tcp

2008-04-21 Thread Josh Hursey
Adrian, Has there been any progress on this bug? If you still cannot reproduce it, if you send either Tim Prins or I a debugging patch we can run with it. Or we can try to arrange access to one of our machines for you. This bug is making it difficult for us to continue working off of the

Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Tim Prins
To echo what Josh said, there are no special compile flags being used. If you send me a patch with debug output, I'd be happy to run it for you. Both odin and sif are fairly normal linux based clusters, with ethernet and openib IP networks. The ethernet network has both ipv4 & ipv6, and the op

Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Adrian Knoth
On Fri, Apr 18, 2008 at 01:00:40PM -0400, Josh Hursey wrote: > The trick is to force Open MPI to use only tcp,self and nothing else. > Did you try adding this (-mca btl tcp,self) to the runtime parameter > set? Sure. Even with 64 processes, I cannot trigger this behaviour. Neither on Linux no

Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Josh Hursey
I'm seeing this problem as well even running just 4 processes on a single node (though not as frequently as with higher process counts). The trick is to force Open MPI to use only tcp,self and nothing else. Did you try adding this (-mca btl tcp,self) to the runtime parameter set? -- Josh

Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Adrian Knoth
On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote: > Hi Adrian, Hi! > After this change, I am getting a lot of errors of the form: > [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] > mca_btl_tcp_frag_recv: readv failed: Connection reset by > peer (104) > > See for instanc

Re: [OMPI devel] Change in btl/tcp

2008-04-18 Thread Tim Prins
Hi Adrian, After this change, I am getting a lot of errors of the form: [sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615 I have found this espe