I don't know what version of OMPI you're working with, so I can't precisely pinpoint the line in question. However, it looks likely to be an error caused by not finding the PBS nodefile.
We look in the environment for PBS_NODEFILE to find the directory where the file should be found, and then look for a file named with our Torque-assigned jobid in that place. The open failure indicates that it isn't there or isn't readable by us. If you are on a network file system, then it's possible that Torque is creating the file on your server, but the compute node just isn't seeing it fast enough. You might look at potential NFS setup switches to speed-up the sync. On Aug 26, 2014, at 4:30 PM, Andrej Prsa <aprs...@gmail.com> wrote: > Hi all, > > I asked this question on the torque mailing list, and I found several > similar issues on the web, but no definitive solutions. When we run our > MPI programs via torque/maui, at random times, in ~50-70% of all cases, > the job will fail with the following error message: > > [node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file > ras_tm_module.c at line 142 > [node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file > ras_tm_module.c at line 82 > [node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file > base/ras_base_allocate.c at line 149 > [node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file > base/plm_base_launch_support.c at line 99 > [node1:51074] [[36074,0],0] ORTE_ERROR_LOG: File open failure in file > plm_tm_module.c at line 194 > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > > I compiled hwloc 1.9 with --with-libpci, torque with --enable-cpuset and > openmpi with --with-tm, so I thought (from the docs) that this should > make torque and openmpi communicate seamlessly. Resubmitting the exact > same job will run the next time or the time after that. Adding sleep to > work around any race conditions did not help. > > Any ideas? > > Thanks, > Andrej > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15724.php