I added that temp directory in, but it does not seem to make a
difference either way.  It was just to illustrate that I was trying
specify the temp directory in another place.  I was under the
impression that running mpiexec in a torque/qsub interactive session
would be similar to running torque without a interactive session.
Either way, I receive the same error whether I put it into a script or
I run the task interactively.  All the examples I provided was running
in a torque/qsub interactive session.

I agree, it must be something with the environment.  Just finding this
problem has been fairly illusive and contributing to premature
baldness.

On Mon, Mar 21, 2011 at 11:03 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> On Mar 21, 2011, at 11:53 AM, Randall Svancara wrote:
>
>> I am not sure if there is any extra configuration necessary for torque
>> to forward the environment.  I have included the output of printenv
>> for an interactive qsub session.  I am really at a loss here because I
>> never had this much difficulty making torque run with openmpi.  It has
>> been mostly a good experience.
>
> Not seeing a problem on other Torque users, so it appears to be something in 
> your local setup.
>
> Note that running mpiexec on a single node doesn't invoke Torque at all - 
> mpiexec just fork/execs the app processes directly. Torque is only invoked 
> when running on multiple nodes.
>
> One thing stands out immediately. When you used rsh, you specified the tmp 
> dir:
>
>> -mca orte_tmpdir_base /fastscratch/admins/tmp
>
> Yet you didn't do so when running under Torque. Was there a reason?
>
>
>>
>> Permissions of /tmp
>>
>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>
>> mpiexec hostname single node:
>>
>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>
>> [rsvancara@node100 ~]$ mpiexec hostname
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>> node100
>>
>> mpiexec hostname two nodes:
>>
>> [rsvancara@node100 ~]$ mpiexec hostname
>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>> status = 17002
>> --------------------------------------------------------------------------
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --------------------------------------------------------------------------
>>       node99 - daemon did not report back when launched
>>
>>
>> MPIexec on one node with one cpu:
>>
>> [rsvancara@node164 ~]$ mpiexec printenv
>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>> MODULE_VERSION_STACK=3.2.8
>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>> HOSTNAME=node164
>> PBS_VERSION=TORQUE-2.4.7
>> TERM=xterm
>> SHELL=/bin/bash
>> HISTSIZE=1000
>> PBS_JOBNAME=STDIN
>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>> PBS_O_WORKDIR=/home/admins/rsvancara
>> PBS_TASKNUM=1
>> USER=rsvancara
>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t:/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib:/home/software/intel/Compiler/11.1/075/lib
>> LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
>> PBS_O_HOME=/home/admins/rsvancara
>> CPATH=/home/software/intel/Compiler/11.1/075/ipp/em64t/include:/home/software/intel/Compiler/11.1/075/mkl/include:/home/software/intel/Compiler/11.1/075/tbb/include
>> PBS_MOMPORT=15003
>> PBS_O_QUEUE=batch
>> NLSPATH=/home/software/intel/Compiler/11.1/075/lib/intel64/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/ipp/em64t/lib/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/mkl/lib/em64t/locale/%l_%t/%N:/home/software/intel/Compiler/11.1/075/idb/intel64/locale/%l_%t/%N
>> MODULE_VERSION=3.2.8
>> MAIL=/var/spool/mail/rsvancara
>> PBS_O_LOGNAME=rsvancara
>> PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_LANG=en_US.UTF-8
>> PBS_JOBCOOKIE=D52DE562B685A462849C1136D6B581F9
>> INPUTRC=/etc/inputrc
>> PWD=/home/admins/rsvancara
>> _LMFILES_=/home/software/Modules/3.2.8/modulefiles/modules:/home/software/Modules/3.2.8/modulefiles/null:/home/software/modulefiles/intel/11.1.075:/home/software/modulefiles/openmpi/1.4.3_intel
>> PBS_NODENUM=0
>> LANG=C
>> MODULEPATH=/home/software/Modules/versions:/home/software/Modules/$MODULE_VERSION/modulefiles:/home/software/modulefiles
>> LOADEDMODULES=modules:null:intel/11.1.075:openmpi/1.4.3_intel
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_JOBID=1670.mgt1.wsuhpc.edu
>> SHLVL=1
>> HOME=/home/admins/rsvancara
>> INTEL_LICENSES=/home/software/intel/Compiler/11.1/075/licenses:/opt/intel/licenses
>> PBS_O_HOST=login1
>> DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
>> PBS_VNODENUM=0
>> LOGNAME=rsvancara
>> PBS_QUEUE=batch
>> MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3
>> LESSOPEN=|/usr/bin/lesspipe.sh %s
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> G_BROKEN_FILENAMES=1
>> PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> module=() {  eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd bash 
>> $*`
>> }
>> _=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec
>> OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
>> OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
>> OMPI_MCA_mpi_yield_when_idle=0
>> OMPI_MCA_orte_app_num=0
>> OMPI_UNIVERSE_SIZE=1
>> OMPI_MCA_ess=env
>> OMPI_MCA_orte_ess_num_procs=1
>> OMPI_COMM_WORLD_SIZE=1
>> OMPI_COMM_WORLD_LOCAL_SIZE=1
>> OMPI_MCA_orte_ess_jobid=3236233217
>> OMPI_MCA_orte_ess_vpid=0
>> OMPI_COMM_WORLD_RANK=0
>> OMPI_COMM_WORLD_LOCAL_RANK=0
>> OPAL_OUTPUT_STDERR_FD=19
>>
>> MPIExec with -mca plm rsh:
>>
>> [rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base
>> /fastscratch/admins/tmp hostname
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node164
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>> node163
>>
>>
>> On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> Can you run anything under TM? Try running "hostname" directly from Torque 
>>> to see if anything works at all.
>>>
>>> The error message is telling you that the Torque daemon on the remote node 
>>> reported a failure when trying to launch the OMPI daemon. Could be that 
>>> Torque isn't setup to forward environments so the OMPI daemon isn't finding 
>>> required libs. You could directly run "printenv" to see how your remote 
>>> environ is being setup.
>>>
>>> Could be that the tmp dir lacks correct permissions for a user to create 
>>> the required directories. The OMPI daemon tries to create a session 
>>> directory in the tmp dir, so failure to do so would indeed cause the launch 
>>> to fail. You can specify the tmp dir with a cmd line option to mpirun. See 
>>> "mpirun -h" for info.
>>>
>>>
>>> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:
>>>
>>>> I have a question about using OpenMPI and Torque on stateless nodes.
>>>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
>>>> --without-slurm using intel compiler version 11.1.075.
>>>>
>>>> When I run a simple "hello world" mpi program, I am receiving the
>>>> following error.
>>>>
>>>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
>>>> status = 17002
>>>> --------------------------------------------------------------------------
>>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>>> launch so we are aborting.
>>>>
>>>> There may be more information reported by the environment (see above).
>>>>
>>>> This may be because the daemon was unable to find all the needed shared
>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>>> location of the shared libraries on the remote nodes and this will
>>>> automatically be forwarded to the remote nodes.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpiexec noticed that the job aborted, but has no info as to the process
>>>> that caused that situation.
>>>> --------------------------------------------------------------------------
>>>> --------------------------------------------------------------------------
>>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>>> below. Additional manual cleanup may be required - please refer to
>>>> the "orte-clean" tool for assistance.
>>>> --------------------------------------------------------------------------
>>>>         node163 - daemon did not report back when launched
>>>>         node159 - daemon did not report back when launched
>>>>         node158 - daemon did not report back when launched
>>>>         node157 - daemon did not report back when launched
>>>>         node156 - daemon did not report back when launched
>>>>         node155 - daemon did not report back when launched
>>>>         node154 - daemon did not report back when launched
>>>>         node152 - daemon did not report back when launched
>>>>         node151 - daemon did not report back when launched
>>>>         node150 - daemon did not report back when launched
>>>>         node149 - daemon did not report back when launched
>>>>
>>>>
>>>> But if I include:
>>>>
>>>> -mca plm rsh
>>>>
>>>> The job runs just fine.
>>>>
>>>> I am not sure what the problem is with torque or openmpi that prevents
>>>> the process from launching on remote nodes.  I have posted to the
>>>> torque list and someone suggested that it may be temporary directory
>>>> space that can be causing issues.  I have 100MB allocated to /tmp
>>>>
>>>> Any ideas as to why I am having this problem would be appreciated.
>>>>
>>>>
>>>> --
>>>> Randall Svancara
>>>> http://knowyourlinux.com/
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>>
>> --
>> Randall Svancara
>> http://knowyourlinux.com/
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Randall Svancara
http://knowyourlinux.com/

Reply via email to