Re: [OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Pavel Shamis (Pasha)
You may try to use ibdiagnet tool: http://linux.die.net/man/1/ibdiagnet The tool is part of OFED (http://www.openfabrics.org/) Pasha. Prentice Bisbal wrote: Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I

[OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Prentice Bisbal
Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I just reboot the nodes, or should I have the user who submitted these jobs just increase the retry count/timeout paramters?