I'm seeing this problem as well even running just 4 processes on a single node (though not as frequently as with higher process counts). The trick is to force Open MPI to use only tcp,self and nothing else. Did you try adding this (-mca btl tcp,self) to the runtime parameter set?

-- Josh

On Apr 18, 2008, at 12:56 PM, Adrian Knoth wrote:

On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:

Hi Adrian,

Hi!

After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)

See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

That's weird. I've tried hello_c.c on about ten machines with different
network configurations, none of them showed any problems at all.

Do you have a very special setup? And if need be, would it be possible
to debug on your machine?


From all MTT sites, this error only occurs on Odin and Sif. What's so
special with these clusters?

I have found this especially easy to reproduce if I run 16 processes all
with just the tcp and self btls on the same machine, running the
'hello_c' program in the examples directory.

Unfortunately, I can't reproduce it that way. If this is related to the change, then it would mean that mca_btl_tcp_proc_accept() returns false,
either after the large loop or in mca_btl_tcp_endpoint_accept().

Do you have the cycles to add some BTL_VERBOSE-lines to see where things
go wrong? Or even to step through with the debugger?

If you want me to do it, I would provide you with my ssh key?


Cheerio


--
mail: a...@thur.de      http://adi.thur.de      PGP/GPG: key via keyserver

Das Sterben wird nur halb so schlimm, rauchst du KIM.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to