It would appear that there is something trying to talk to a socket opened by 
one of your daemons. At a guess, I would bet the problem is that a prior job 
left a daemon alive that is talking on the same socket.

Are you by chance using static ports for the job? Did you run another job just 
before this one that might have left a daemon somewhere?


On Dec 15, 2010, at 1:05 AM, Gilbert Grosdidier wrote:

> Bonjour,
> 
> Running with OpenMPI 1.4.3 on an SGI Altix cluster with 4096 cores, I got
> this error message, right at startup :
> mca_oob_tcp_peer_recv_connect_ack: received unexpected process identifier 
> [[13816,0],209]
> 
> and the whole job is going to spin for an undefined period, without 
> crashing/aborting.
> 
> What could be the culprit please ?
> Is there a workaround ?
> Which parameter is to be tuned ?
> 
> Thanks in advance for any help,    Best,    G.
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to