Re: [OMPI devel] Change in btl/tcp

Josh Hursey Fri, 18 Apr 2008 13:00:46 -0400

I'm seeing this problem as well even running just 4 processes on asingle node (though not as frequently as with higher process counts).The trick is to force Open MPI to use only tcp,self and nothing else.Did you try adding this (-mca btl tcp,self) to the runtime parameterset?


-- Josh


On Apr 18, 2008, at 12:56 PM, Adrian Knoth wrote:

On Fri, Apr 18, 2008 at 08:04:17AM -0400, Tim Prins wrote:

Hi Adrian,

Hi!

After this change, I am getting a lot of errors of the form:
[sif2][[12854,1],9][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Connection reset by
peer (104)

See for instance: http://www.open-mpi.org/mtt/index.php?do_redir=615

That's weird. I've tried hello_c.c on about ten machines withdifferent

network configurations, none of them showed any problems at all.

Do you have a very special setup? And if need be, would it be possible
to debug on your machine?

From all MTT sites, this error only occurs on Odin and Sif. What's so

special with these clusters?

I have found this especially easy to reproduce if I run 16processes all
with just the tcp and self btls on the same machine, running the
'hello_c' program in the examples directory.

Unfortunately, I can't reproduce it that way. If this is related tothechange, then it would mean that mca_btl_tcp_proc_accept() returnsfalse,

either after the large loop or in mca_btl_tcp_endpoint_accept().

Do you have the cycles to add some BTL_VERBOSE-lines to see wherethings

go wrong? Or even to step through with the debugger?

If you want me to do it, I would provide you with my ssh key?


Cheerio


--
mail: [email protected]      http://adi.thur.de      PGP/GPG: key via keyserver

Das Sterben wird nur halb so schlimm, rauchst du KIM.
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Change in btl/tcp

Reply via email to