On our cluster we are noticing intermediate job launch failure when
using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and
it is integrated with Torque-4.1.3. It failes even for a simple MPI
hello world applications. The issue is that orted gets launched on all
the nodes but there are a bunch of nodes that dont launch the actual MPI
application. There are no errors reported when the job gets killed
because the walltime expires. Enabling --debug-daemons doesnt show any
errors either. The only difference being that successful runs have
MPI_proctable listed and for failures this is absent. Any help in
debugging this issue is greatly appreciated.
--
Bharath
smime.p7s
Description: S/MIME Cryptographic Signature