Hi All,

For a parallel MPI job, we sometimes (not always) get the following
message:

[n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number
of attempts to create TCP connection has been exceeded.  Can not
communicate with peer
[n047:25850] [[36630,0],1] ORTE_ERROR_LOG: Unreachable in file
../../../../../openmpi-1.6.5/orte/mca/grpcomm/bad/grpcomm_bad_module.c at
line 412
[n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number
of attempts to create TCP connection has been exceeded.  Can not
communicate with peer

These appear in the middle of a running job; we use OpenMPI 1.6.5 and OFED
2.4 on CentOS 6.  

-- 
Grigory Shamov
HPC Analist,
Westgrid/ComputeCanada Site Lead
University of Manitoba
E2-588 EITC Building,
(204) 474-9625



Reply via email to