Hi Adrian, Thanks for that info. The OS is Linux. I was able to get rid of the "connection reset" (104) errors by increasing btl_tcp_endpoint_cache. That leaves the "no route to host" (113) problem.
Interestingly, I sometimes (sometimes not) get the same error on daemon startup with ssh when experimenting with very large jobs: ssh: connect to host blade45 port 22: No route to host [blade1:05832] ERROR: A daemon on node blade45 failed to start as expected. [blade1:05832] ERROR: There may be more information available from [blade1:05832] ERROR: the remote shell (see above). [blade1:05832] ERROR: The daemon exited unexpectedly with status 1. [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../orte/mca/pls/base/pls_base_orted_cmds.c at line 188 [blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file ../../../../../orte/mca/pls/rsh/pls_rsh_module.c at line 1187 I can understand this arising from an ssh bottleneck, with a timeout. So, a question to the OMPI folks: could the "no route to host" (113) error in btl_tcp_endpoint.c:572 also result from a timeout? Thanks, Todd On 4/3/07 5:44 AM, "Adrian Knoth" <a...@drcomp.erfurt.thur.de> wrote: > On Mon, Apr 02, 2007 at 07:15:41PM -0400, Heywood, Todd wrote: > > Hi, > >> [blade90][0,1,223][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mc >> a_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 > > errno is OS specific, so it's important to know which OS you're using. > > You can always convert these error numbers to normal strings with perl: > > adi@drcomp:~$ perl -e 'die$!=113' > No route to host at -e line 1. > > (read: 113 is "No route to host" under Linux. If you're not using Linux, > your 113 probably means something else) > > If it's really "No route to host", check your routing setup. > > > adi@drcomp:~$ perl -e 'die$!=104' > Connection reset by peer at -e line 1. > > > This usually happens when a remote process dies, perhaps due to > segfaults. > > > HTH