Sounds like the orteds aren't reporting back to mpirun after launch. The 
MPI_proctable observation just means that the procs didn't launch in those 
cases where it is absent, which is something you already observed.

Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted 
report back to mpirun after it launches. If not, then it is likely that 
something is blocking it.

You could also try updating to 1.6.3/4 in case there is some race condition in 
1.6.1, though we haven't heard of it to-date.


On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bram...@vt.edu> wrote:

> On our cluster we are noticing intermediate job launch failure when using 
> OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is 
> integrated with Torque-4.1.3. It failes even for a simple MPI hello world 
> applications. The issue is that orted gets launched on all the nodes but 
> there are a bunch of nodes that dont launch the actual MPI application. There 
> are no errors reported when the job gets killed because the walltime expires. 
> Enabling --debug-daemons doesnt show any errors either. The only difference 
> being that successful runs have MPI_proctable listed and for failures this is 
> absent. Any help in debugging this issue is greatly appreciated.
> 
> -- 
> Bharath
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to