On Jan 17, 2013, at 2:25 AM, Jure Pečar <pega...@nerv.eu.org> wrote:

> On Wed, 16 Jan 2013 07:46:41 -0800
> Ralph Castain <r...@open-mpi.org> wrote:
> 
>> This one means that a backend node lost its connection to mpirun. We use a 
>> TCP socket between the daemon on a node and mpirun to launch the processes 
>> and to detect if/when that node fails for some reason.
> 
> Hm. And what would be the reasons for this? Too much load on node where 
> mpirun is run?

No, the error means the connection was completely lost - i.e., the socket was 
closed. Do I understand correctly that the job runs for awhile and then dies? 
So there are processes executing on the node that reports a lost connection?

Or is this happening on startup of the larger job, or during a call to 
MPI_Comm_spawn?


> 
> -- 
> 
> Jure Pečar
> http://jure.pecar.org
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to