Am 17.12.2012 um 16:42 schrieb Blosch, Edwin L:

> Ralph,
> Unfortunately I didn’t see the ssh output.  The output I got was pretty much 
> as before.
> You know, the fact that the error message is not prefixed with a host name 
> makes me think it could be happening on the host where the job is placed by 
> PBS. If there is something wrong in the user environment prior to mpirun, 
> that is not an OpenMPI problem. And yet, in one of the jobs that failed, I 
> have also printed outthe results of ‘ldd’ on the mpirun executable just prior 
> to executing the command, and all the shared libraries were resolved:

You checked the mpirun, but not the orted which misses a "" from 
Intel. The Intel from the redistributable archive is present on all 

-- Reuti

> ldd /release/cfd/openmpi-intel/bin/mpirun
> =>  (0x00007fffbbb39000)
> => /release/cfd/openmpi-intel/lib/ 
> (0x00002abdf75d2000)
> => /release/cfd/openmpi-intel/lib/ 
> (0x00002abdf7887000)
> => /lib64/ (0x00002abdf7b39000)
> => /lib64/ (0x00002abdf7d3d000)
> => /lib64/ (0x00002abdf7f56000)
> => /lib64/ (0x00002abdf8159000)
> => /lib64/ (0x00002abdf83af000)
> => /lib64/ (0x00002abdf85c7000)
> => /lib64/ (0x00002abdf87e4000)
> => /appserv/intel/Compiler/11.1/072/lib/intel64/ 
> (0x00002abdf8b42000)
> => /appserv/intel/Compiler/11.1/072/lib/intel64/ 
> (0x00002abdf8ed7000)
> => 
> /appserv/intel/Compiler/11.1/072/lib/intel64/ 
> (0x00002abdf90ed000)
>         /lib64/ (0x00002abdf73b1000)
> Hence my initial assumption that the shared-library problem was happening 
> with one of the child processes on a remote node.
> So at this point I have more questions than answers.  I still don’t know if 
> this message comes from the main mpirun process or one of the child 
> processes, although it seems that it should not be the main process because 
> of the output of ldd above.
> Any more suggestions are welcomed of course.
> Thanks
> /release/cfd/openmpi-intel/bin/mpirun --machinefile 
> /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x 
> MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached 
> /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000 -ro 
> /tmp/fv420804.maruhpc4-mgt/restart.5000
> [c6n38:16219] mca:base:select:(  plm) Querying component [rsh]
> [c6n38:16219] mca:base:select:(  plm) Query of component [rsh] set priority 
> to 10
> [c6n38:16219] mca:base:select:(  plm) Selected component [rsh]
> Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M
> /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
> cannot open shared object file: No such file or directory
> --------------------------------------------------------------------------
> A daemon (pid 16227) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> There may be more information reported by the environment (see above).
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M
> --------------------------------------------------------------------------
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         c6n39 - daemon did not report back when launched
>         c6n40 - daemon did not report back when launched
>         c6n41 - daemon did not report back when launched
>         c6n42 - daemon did not report back when launched
> From: [] On 
> Behalf Of Ralph Castain
> Sent: Friday, December 14, 2012 2:25 PM
> To: Open MPI Users
> Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries while 
> launching jobs
> Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that 
> will show the ssh command being used to start each orted.
> On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" <> 
> wrote:
> I am having a weird problem launching cases with OpenMPI 1.4.3.  It is most 
> likely a problem with a particular node of our cluster, as the jobs will run 
> fine on some submissions, but not other submissions.  It seems to depend on 
> the node list.  I just am having trouble diagnosing which node, and what is 
> the nature of the problem it has.
> One or perhaps more of the orted are indicating they cannot find an Intel 
> Math library.  The error is:
> /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
> cannot open shared object file: No such file or directory
> I’ve checked the environment just before launching mpirun, and 
> LD_LIBRARY_PATH includes the necessary component to point to where the Intel 
> shared libraries are located.  Furthermore, my mpirun command line says to 
> export the LD_LIBRARY_PATH variable:
> Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile 
> /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x 
> MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', 
> '-cycles', '10000', '-ri', 'restart.1', '-ro', 
> '/tmp/fv420761.maruhpc4-mgt/restart.1']
> My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH.  
> OpenMPI is built explicitly --without-torque and should be using ssh to 
> launch the orted.
> What options can I add to get more debugging of problems launching orted?
> Thanks,
> Ed
> _______________________________________________
> users mailing list
> _______________________________________________
> users mailing list

Reply via email to