Hi - we've been using openmpi for a while, but only for the last few months
with torque/maui.  Intermittently (maybe 1/10 jobs), we get mpi jobs that fail 
with the error:

[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 142
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 82
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
base/ras_base_allocate.c at line 149
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
base/plm_base_launch_support.c at line 99
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
plm_tm_module.c at line 194

This is completely unrepeatable - resubmitting the same job almost
always works the second time around.  The line appears to be
associated with looking for the torque/maui generated node file,
and when I do something like
  echo $PBS_NODEFILE
  cat $PBS_NODEFILE
it appears that the file is present and correct.  

We're running OpenMPI 1.6.4, configured with 
./configure \
        --prefix=${DEST} \
        --with-tm=/usr/local/torque \
        --enable-mpirun-prefix-by-default \
        --with-openib=/usr \
        --with-openib-libdir=/usr/lib64

Has anyone seen anything like this before, or has any ideas of what might
be happening?  It appears to be a line where openmpi looks for
the PBS node file, which is on a local filesystem (e.g. 
PBS_NODEFILE=/var/spool/torque/aux//4600.tin).

                                                                        thanks,
                                                                        Noam



Noam Bernstein
Center for Computational Materials Science
NRL Code 6390
noam.bernst...@nrl.navy.mil




Reply via email to