Hi Adrian,

Thanks for that info. The OS is Linux. I was able to get rid of the
"connection reset" (104) errors by increasing btl_tcp_endpoint_cache. That
leaves the "no route to host" (113) problem.

Interestingly, I sometimes (sometimes not) get the same error on daemon
startup with ssh when experimenting with very large jobs:

ssh: connect to host blade45 port 22: No route to host
[blade1:05832] ERROR: A daemon on node blade45 failed to start as expected.
[blade1:05832] ERROR: There may be more information available from
[blade1:05832] ERROR: the remote shell (see above).
[blade1:05832] ERROR: The daemon exited unexpectedly with status 1.
[blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
../../../../orte/mca/pls/base/pls_base_orted_cmds.c at line 188
[blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
../../../../../orte/mca/pls/rsh/pls_rsh_module.c at line 1187

I can understand this arising from an ssh bottleneck, with a timeout. So, a
question to the OMPI folks: could the "no route to host" (113) error in
btl_tcp_endpoint.c:572 also result from a timeout?

Thanks,

Todd




On 4/3/07 5:44 AM, "Adrian Knoth" <a...@drcomp.erfurt.thur.de> wrote:

> On Mon, Apr 02, 2007 at 07:15:41PM -0400, Heywood, Todd wrote:
> 
> Hi,
> 
>> [blade90][0,1,223][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mc
>> a_btl_tcp_endpoint_complete_connect] connect() failed with errno=113
> 
> errno is OS specific, so it's important to know which OS you're using.
> 
> You can always convert these error numbers to normal strings with perl:
> 
> adi@drcomp:~$ perl -e 'die$!=113'
> No route to host at -e line 1.
> 
> (read: 113 is "No route to host" under Linux. If you're not using Linux,
>  your 113 probably means something else)
> 
> If it's really "No route to host", check your routing setup.
> 
> 
> adi@drcomp:~$ perl -e 'die$!=104'
> Connection reset by peer at -e line 1.
> 
> 
> This usually happens when a remote process dies, perhaps due to
> segfaults.
> 
> 
> HTH

Reply via email to