Am 17.12.2012 um 16:42 schrieb Blosch, Edwin L: > Ralph, > > Unfortunately I didn’t see the ssh output. The output I got was pretty much > as before. > > You know, the fact that the error message is not prefixed with a host name > makes me think it could be happening on the host where the job is placed by > PBS. If there is something wrong in the user environment prior to mpirun, > that is not an OpenMPI problem. And yet, in one of the jobs that failed, I > have also printed outthe results of ‘ldd’ on the mpirun executable just prior > to executing the command, and all the shared libraries were resolved:
You checked the mpirun, but not the orted which misses a "libimf.so" from Intel. The Intel libimf.so from the redistributable archive is present on all nodes? -- Reuti > > ldd /release/cfd/openmpi-intel/bin/mpirun > linux-vdso.so.1 => (0x00007fffbbb39000) > libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 > (0x00002abdf75d2000) > libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 > (0x00002abdf7887000) > libdl.so.2 => /lib64/libdl.so.2 (0x00002abdf7b39000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x00002abdf7d3d000) > libutil.so.1 => /lib64/libutil.so.1 (0x00002abdf7f56000) > libm.so.6 => /lib64/libm.so.6 (0x00002abdf8159000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002abdf83af000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00002abdf85c7000) > libc.so.6 => /lib64/libc.so.6 (0x00002abdf87e4000) > libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so > (0x00002abdf8b42000) > libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so > (0x00002abdf8ed7000) > libintlc.so.5 => > /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 > (0x00002abdf90ed000) > /lib64/ld-linux-x86-64.so.2 (0x00002abdf73b1000) > > Hence my initial assumption that the shared-library problem was happening > with one of the child processes on a remote node. > > So at this point I have more questions than answers. I still don’t know if > this message comes from the main mpirun process or one of the child > processes, although it seems that it should not be the main process because > of the output of ldd above. > > Any more suggestions are welcomed of course. > > Thanks > > > /release/cfd/openmpi-intel/bin/mpirun --machinefile > /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x > MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached > /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000 -ro > /tmp/fv420804.maruhpc4-mgt/restart.5000 > > [c6n38:16219] mca:base:select:( plm) Querying component [rsh] > [c6n38:16219] mca:base:select:( plm) Query of component [rsh] set priority > to 10 > [c6n38:16219] mca:base:select:( plm) Selected component [rsh] > Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M > Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M > /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: > libimf.so: cannot open shared object file: No such file or directory > -------------------------------------------------------------------------- > A daemon (pid 16227) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M > -------------------------------------------------------------------------- > mpirun was unable to cleanly terminate the daemons on the nodes shown > below. Additional manual cleanup may be required - please refer to > the "orte-clean" tool for assistance. > -------------------------------------------------------------------------- > c6n39 - daemon did not report back when launched > c6n40 - daemon did not report back when launched > c6n41 - daemon did not report back when launched > c6n42 - daemon did not report back when launched > > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Friday, December 14, 2012 2:25 PM > To: Open MPI Users > Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries while > launching jobs > > Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that > will show the ssh command being used to start each orted. > > On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" <edwin.l.blo...@lmco.com> > wrote: > > > I am having a weird problem launching cases with OpenMPI 1.4.3. It is most > likely a problem with a particular node of our cluster, as the jobs will run > fine on some submissions, but not other submissions. It seems to depend on > the node list. I just am having trouble diagnosing which node, and what is > the nature of the problem it has. > > One or perhaps more of the orted are indicating they cannot find an Intel > Math library. The error is: > /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: > libimf.so: cannot open shared object file: No such file or directory > > I’ve checked the environment just before launching mpirun, and > LD_LIBRARY_PATH includes the necessary component to point to where the Intel > shared libraries are located. Furthermore, my mpirun command line says to > export the LD_LIBRARY_PATH variable: > Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile > /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x > MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', > '-cycles', '10000', '-ri', 'restart.1', '-ro', > '/tmp/fv420761.maruhpc4-mgt/restart.1'] > > My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH. > OpenMPI is built explicitly --without-torque and should be using ssh to > launch the orted. > > What options can I add to get more debugging of problems launching orted? > > Thanks, > > Ed > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users