On Jan 16, 2013, at 7:41 AM, Jure Pečar <pega...@nerv.eu.org> wrote:

> 
> Hello,
> 
> I have a large fortran code processing data (weather forecast). It runs ok 
> with smaller dataset, but on larger dataset I get some errors I've never seen 
> before:
> 
> node061:05144] [[55141,0],11]->[[55141,0],0] mca_oob_tcp_msg_send_handler: 
> writev failed: Bad file descriptor (9) [sd = 9]
> [node061:05144] [[55141,0],11] routed:binomial: Connection to lifeline 
> [[55141,0],0] lost

This one means that a backend node lost its connection to mpirun. We use a TCP 
socket between the daemon on a node and mpirun to launch the processes and to 
detect if/when that node fails for some reason.


> 
> and
> 
> node084:7.0.Non-fatal temporary exhaustion of send tid dma descriptors
> (elapsed=43.788s, source LID=0x49/context=11, count=1) (err=0)
> 
> I'm using QLogic software version 7.1.0.0.58 (ofed 1.5.4.1, open-mpi 1.4.3).
> 
> I'm starting this program with mpirun -mca btl openib,sm,self so I don't 
> really understand what tcp has to do in the first error message.
> 
> Also I traced second error message to psm code, but it appears even if i add 
> -mca mtl ^psm to my mpirun arguments. Why?
> 
> Any help appreciated.
> 
> 
> -- 
> 
> Jure Pečar
> http://jure.pecar.org
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to