mpiexec doesn't use pbsdsh (we use a TM API), but the affect is the same. Been 
so long since I ran on a Torque machine, though, that I honestly don't remember 
how to set the LD_LIBRARY_PATH on the backend.

Do you have a sys admin there whom you could ask? Or you could ping the Torque 
list about it - pretty standard issue.


On Mar 21, 2011, at 1:19 PM, Randall Svancara wrote:

> Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
> (qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:
> 
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
> PATH=/bin:/usr/bin
> LANG=C
> PBS_O_HOME=/home/admins/rsvancara
> PBS_O_LANG=en_US.UTF-8
> PBS_O_LOGNAME=rsvancara
> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> PBS_O_MAIL=/var/spool/mail/rsvancara
> PBS_O_SHELL=/bin/bash
> PBS_SERVER=mgt1.wsuhpc.edu
> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
> PBS_O_QUEUE=batch
> PBS_O_HOST=login1
> HOME=/home/admins/rsvancara
> PBS_JOBNAME=STDIN
> PBS_JOBID=1672.mgt1.wsuhpc.edu
> PBS_QUEUE=batch
> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
> PBS_NODENUM=0
> PBS_TASKNUM=146
> PBS_MOMPORT=15003
> PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
> PBS_VERSION=TORQUE-2.4.7
> PBS_VNODENUM=0
> PBS_ENVIRONMENT=PBS_BATCH
> ENVIRONMENT=BATCH
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
> PATH=/bin:/usr/bin
> LANG=C
> PBS_O_HOME=/home/admins/rsvancara
> PBS_O_LANG=en_US.UTF-8
> PBS_O_LOGNAME=rsvancara
> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
> PBS_O_MAIL=/var/spool/mail/rsvancara
> PBS_O_SHELL=/bin/bash
> PBS_SERVER=mgt1.wsuhpc.edu
> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
> PBS_O_QUEUE=batch
> PBS_O_HOST=login1
> HOME=/home/admins/rsvancara
> PBS_JOBNAME=STDIN
> PBS_JOBID=1672.mgt1.wsuhpc.edu
> PBS_QUEUE=batch
> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
> PBS_NODENUM=1
> PBS_TASKNUM=147
> PBS_MOMPORT=15003
> PBS_VERSION=TORQUE-2.4.7
> PBS_VNODENUM=12
> PBS_ENVIRONMENT=PBS_BATCH
> ENVIRONMENT=BATCH
> 
> So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
> appear available.  Attempted to run mpiexec like this and it fails.
> 
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
> loading shared libraries: libimf.so: cannot open shared object file:
> No such file or directory
> pbsdsh: task 12 exit status 127
> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
> loading shared libraries: libimf.so: cannot open shared object file:
> No such file or directory
> pbsdsh: task 0 exit status 127
> 
> If this is how the openmpi processes are being launched, then it is no
> wonder they are failing and the LD_LIBRARY_PATH error message is
> indeed somewhat accurate.
> 
> So the next question is how to I ensure that this information is
> available to pbsdsh?
> 
> Thanks,
> 
> Randall
> 
> 
> On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara <rsvanc...@gmail.com> 
> wrote:
>> Ok, these are good things to check.  I am going to follow through with
>> this in the next hour after our GPFS upgrade.  Thanks!!!
>> 
>> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen <bro...@umich.edu> wrote:
>>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>>> 
>>>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>>>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>>>> pbsrsh or something like that...?
>>> 
>>> pbsdsh
>>> If TM is working pbsdsh should work fine.
>>> 
>>> Torque+OpenMPI has been working just fine for us.
>>> Do you have libtorque on all your compute hosts?  You should see it open on 
>>> all hosts if it works.
>>> 
>>>> 
>>>> Try that and make sure it works.  Open MPI should be using the same API as 
>>>> that command under the covers.
>>>> 
>>>> I also have a dim recollection that the TM API support library(ies?) may 
>>>> not be installed by default.  You may have to ensure that they're 
>>>> available on all nodes...?
>>>> 
>>>> 
>>>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>>> 
>>>>> I am not sure if there is any extra configuration necessary for torque
>>>>> to forward the environment.  I have included the output of printenv
>>>>> for an interactive qsub session.  I am really at a loss here because I
>>>>> never had this much difficulty making torque run with openmpi.  It has
>>>>> been mostly a good experience.
>>>>> 
>>>>> Permissions of /tmp
>>>>> 
>>>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>>> 
>>>>> mpiexec hostname single node:
>>>>> 
>>>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>>>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>>>> 
>>>>> [rsvancara@node100 ~]$ mpiexec hostname
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> node100
>>>>> 
>>>>> mpiexec hostname two nodes:
>>>>> 
>>>>> [rsvancara@node100 ~]$ mpiexec hostname
>>>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>>>>> status = 17002
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>>>> launch so we are aborting.
>>>>> 
>>>>> There may be more information reported by the environment (see above).
>>>>> 
>>>>> This may be because the daemon was unable to find all the needed shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>>>> below. Additional manual cleanup may be required - please refer to
>>>>> the "orte-clean" tool for assistance.
>>>>> --------------------------------------------------------------------------
>>>>>      node99 - daemon did not report back when launched
>>>>> 
>>>>> 
>>>>> MPIexec on one node with one cpu:
>>>>> 
>>>>> [rsvancara@node164 ~]$ mpiexec printenv
>>>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>>>>> MODULE_VERSION_STACK=3.2.8
>>>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>>>>> HOSTNAME=node164
>>>>> PBS_VERSION=TORQUE-2.4.7
>>>>> TERM=xterm
>>>>> SHELL=/bin/bash
>>>>> HISTSIZE=1000
>>>>> PBS_JOBNAME=STDIN
>>>>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>>>>> PBS_O_WORKDIR=/home/admins/rsvancara
>>>>> PBS_TASKNUM=1
>>>>> USER=rsvancara
>>>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>>>>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>>>>> PBS_O_HOME=/home/admins/rsvancara
>>>>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
>>>>> PBS_MOMPORT=15003
>>>>> PBS_O_QUEUE=batch
>>>>> NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
>>>>> MODULE_VERSION=3.2.8
>>>>> MAIL=/var/spool/mail/rsvancara
>>>>> PBS_O_LOGNAME=rsvancara
>>>>> PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>>>>> PBS_O_LANG=en_US.UTF-8
>>>>> PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9
>>>>> INPUTRC=/etc/inputrc
>>>>> PWD=/home/admins/rsvancara
>>>>> _LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel
>>>>> PBS_NODENUM=0
>>>>> LANG=C
>>>>> MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles
>>>>> LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel
>>>>> PBS_O_SHELL=/bin/bash
>>>>> PBS_SERVER=mgt1.wsuhpc.edu
>>>>> PBS_JOBID=1670.mgt1.wsuhpc.edu
>>>>> SHLVL=1
>>>>> HOME=/home/admins/rsvancara
>>>>> INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses
>>>>> PBS_O_HOST=login1
>>>>> DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
>>>>> PBS_VNODENUM=0
>>>>> LOGNAME=rsvancara
>>>>> PBS_QUEUE=batch
>>>>> MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3
>>>>> LESSOPEN=|/usr/bin/lesspipe.sh %s
>>>>> PBS_O_MAIL=/var/spool/mail/rsvancara
>>>>> G_BROKEN_FILENAMES=1
>>>>> PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu
>>>>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>>>>> module=() {  eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd 
>>>>> bash $*`
>>>>> }
>>>>> _=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec
>>>>> OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
>>>>> OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
>>>>> OMPI_MCA_mpi_yield_when_idle=0
>>>>> OMPI_MCA_orte_app_num=0
>>>>> OMPI_UNIVERSE_SIZE=1
>>>>> OMPI_MCA_ess=env
>>>>> OMPI_MCA_orte_ess_num_procs=1
>>>>> OMPI_COMM_WORLD_SIZE=1
>>>>> OMPI_COMM_WORLD_LOCAL_SIZE=1
>>>>> OMPI_MCA_orte_ess_jobid=3236233217
>>>>> OMPI_MCA_orte_ess_vpid=0
>>>>> OMPI_COMM_WORLD_RANK=0
>>>>> OMPI_COMM_WORLD_LOCAL_RANK=0
>>>>> OPAL_OUTPUT_STDERR_FD=19
>>>>> 
>>>>> MPIExec with -mca plm rsh:
>>>>> 
>>>>> [rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base
>>>>> /fastscratch/admins/tmp hostname
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node164
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> node163
>>>>> 
>>>>> 
>>>>> On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> Can you run anything under TM? Try running "hostname" directly from 
>>>>>> Torque to see if anything works at all.
>>>>>> 
>>>>>> The error message is telling you that the Torque daemon on the remote 
>>>>>> node reported a failure when trying to launch the OMPI daemon. Could be 
>>>>>> that Torque isn't setup to forward environments so the OMPI daemon isn't 
>>>>>> finding required libs. You could directly run "printenv" to see how your 
>>>>>> remote environ is being setup.
>>>>>> 
>>>>>> Could be that the tmp dir lacks correct permissions for a user to create 
>>>>>> the required directories. The OMPI daemon tries to create a session 
>>>>>> directory in the tmp dir, so failure to do so would indeed cause the 
>>>>>> launch to fail. You can specify the tmp dir with a cmd line option to 
>>>>>> mpirun. See "mpirun -h" for info.
>>>>>> 
>>>>>> 
>>>>>> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:
>>>>>> 
>>>>>>> I have a question about using OpenMPI and Torque on stateless nodes.
>>>>>>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
>>>>>>> --without-slurm using intel compiler version 11.1.075.
>>>>>>> 
>>>>>>> When I run a simple "hello world" mpi program, I am receiving the
>>>>>>> following error.
>>>>>>> 
>>>>>>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
>>>>>>> status = 17002
>>>>>>> --------------------------------------------------------------------------
>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting 
>>>>>>> to
>>>>>>> launch so we are aborting.
>>>>>>> 
>>>>>>> There may be more information reported by the environment (see above).
>>>>>>> 
>>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have 
>>>>>>> the
>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpiexec noticed that the job aborted, but has no info as to the process
>>>>>>> that caused that situation.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>> the "orte-clean" tool for assistance.
>>>>>>> --------------------------------------------------------------------------
>>>>>>>        node163 - daemon did not report back when launched
>>>>>>>        node159 - daemon did not report back when launched
>>>>>>>        node158 - daemon did not report back when launched
>>>>>>>        node157 - daemon did not report back when launched
>>>>>>>        node156 - daemon did not report back when launched
>>>>>>>        node155 - daemon did not report back when launched
>>>>>>>        node154 - daemon did not report back when launched
>>>>>>>        node152 - daemon did not report back when launched
>>>>>>>        node151 - daemon did not report back when launched
>>>>>>>        node150 - daemon did not report back when launched
>>>>>>>        node149 - daemon did not report back when launched
>>>>>>> 
>>>>>>> 
>>>>>>> But if I include:
>>>>>>> 
>>>>>>> -mca plm rsh
>>>>>>> 
>>>>>>> The job runs just fine.
>>>>>>> 
>>>>>>> I am not sure what the problem is with torque or openmpi that prevents
>>>>>>> the process from launching on remote nodes.  I have posted to the
>>>>>>> torque list and someone suggested that it may be temporary directory
>>>>>>> space that can be causing issues.  I have 100MB allocated to /tmp
>>>>>>> 
>>>>>>> Any ideas as to why I am having this problem would be appreciated.
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Randall Svancara
>>>>>>> http://knowyourlinux.com/
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Randall Svancara
>>>>> http://knowyourlinux.com/
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> 
>> --
>> Randall Svancara
>> http://knowyourlinux.com/
>> 
> 
> 
> 
> -- 
> Randall Svancara
> http://knowyourlinux.com/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to