On Jan 17, 2013, at 2:25 AM, Jure Pečar <pega...@nerv.eu.org> wrote:
> On Wed, 16 Jan 2013 07:46:41 -0800 > Ralph Castain <r...@open-mpi.org> wrote: > >> This one means that a backend node lost its connection to mpirun. We use a >> TCP socket between the daemon on a node and mpirun to launch the processes >> and to detect if/when that node fails for some reason. > > Hm. And what would be the reasons for this? Too much load on node where > mpirun is run? No, the error means the connection was completely lost - i.e., the socket was closed. Do I understand correctly that the job runs for awhile and then dies? So there are processes executing on the node that reports a lost connection? Or is this happening on startup of the larger job, or during a call to MPI_Comm_spawn? > > -- > > Jure Pečar > http://jure.pecar.org > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users