Hi All, For a parallel MPI job, we sometimes (not always) get the following message:
[n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer [n047:25850] [[36630,0],1] ORTE_ERROR_LOG: Unreachable in file ../../../../../openmpi-1.6.5/orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 412 [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer These appear in the middle of a running job; we use OpenMPI 1.6.5 and OFED 2.4 on CentOS 6. -- Grigory Shamov HPC Analist, Westgrid/ComputeCanada Site Lead University of Manitoba E2-588 EITC Building, (204) 474-9625