On Sep 23, 2011, at 1:21 PM, Guilherme V wrote:

> I'm using version 1.4.3 and I forgot to tell that I have made a change in the 
> orterun.c line 792:
> 
>     if (ORTE_JOB_STATE_TERMINATED != exit_state) {
>                     exit(0); /* patch*/
> 

I don't see how that change can keep your job running - we should still have 
terminated it. All this does is suppress the error reporting.

Regardless, openib will cause the process to fail under the described 
circumstances, which should cause OMPI to terminate all running procs. I'm not 
sure what you are doing with tcp, but it could be that there are alternative 
paths available - e.g., you have multiple NICs and remove one cable, but the 
other paths remain viable.

> Regards
> 
> 
> 
> > What version of OMPI are you using? The job should terminate in either case 
> > - what did you do to keep it running after node failure with tcp? 
> >On Sep 23, 2011, at 12:34 PM, Guilherme V wrote: 
> >> Hi, 
> >> I want to know if anybody is having problems with fault tolerant job using 
> >> infiniband. When I run my job with tcp if anything happens with one node, 
> >> my job keeps running, but if I change my job to use infiniband if anything 
> >> happens with the infiniband (i.e cable problems) my job fails. 
> >> 
> >> Anybody knows if there is something different that need to be done when 
> >> using openib instead tcp? 
> >> 
> >> Bellow a example of the message I'm receiving from the mpi. 
> >> 
> >> Regards, 
> >> Guilherme 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to