Re: [OMPI users] ssh between nodes

2012-02-29 Thread Randall Svancara
Depends on which launcher you are using.  My understanding is that you can
use torque to launch the MPI processes on remote nodes, but you must
compile this support into OpenMPI.  Please, someone correct me if I am
wrong.

For most clusters I work with and manage, we use passwordless keys.  The
reason is that sometimes MPI implementations, like those provided by many
vendords do not supply the requisite functionality to integrate with
Torque, such as the Intel's OpenMPI tools or Comsol's bundled MPI
implementation as an example.

So really, it boils down to your needs.

Thanks

Randall

On Wed, Feb 29, 2012 at 1:09 PM, Denver Smith <denver.sm...@usu.edu> wrote:

>  Hello,
>
>  On my cluster running moab and torque, I cannot ssh without a password
> between compute nodes. I can however request multiple node jobs fine. I was
> wondering if passwordless ssh keys need to be set up between compute nodes
> in order for mpi applications to run correctly.
>
>  Thanks
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Randall Svancara
Know Your Linux? <http://www.knowyourlinux.com/>


Re: [OMPI users] EXTERNAL: Re: Can you set the gid of the processes created by mpirun?

2011-09-14 Thread Randall Svancara
> children) to the gid of mpirun. No cmd line option or anything is required -
> it will just always do it.
> >>
> >> Would you mind giving it a try?
> >>
> >> Please let me know if/how it works.
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Randall Svancara
http://knowyourlinux.com/


Re: [OMPI users] Building openmpi with PGI 11.4: won't find torque??

2011-05-02 Thread Randall Svancara
I believe I was having this same issue until I switched to the 2.5.x
series of torque.  I will have to check my notes which are at work,
but it is something to try at least.  It seems this should not matter.

Maybe try setting the ld_library_path or adding /opt/torque/lib.  I am
not familiar with the PGI compilers, but I have had decent luck with
the Intel compilers.

Best of luck!

Randall

On Mon, May 2, 2011 at 5:22 PM, Jim Kusznir <jkusz...@gmail.com> wrote:
> Hi all:
>
> I'm trying to build openmpi 1.4.3 against PGI 11.4 on my Rocks 5.1
> system.  My "tried and true" build command for OpenMPI is:
>
> CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define
> 'install_in_opt 1' --define 'install_modulefile 1' --define
> 'modules_rpm_name environment-modules' --define 'build_all_in_one_rpm
> 0'  --define 'configure_options --with-tm=/opt/torque' --define '_name
> openmpi-pgi2011' --define 'use_default_rpm_opt_flags 0'
> openmpi-1.4.3.spec
>
> This is what I've used to build openmpi 1.4.3 for gcc and against PGI
> 8.x (our last version of PGI installed).  This time, its not working,
> though, and with what I consider to be a very strange failure point:
>
> --- MCA component plm:tm (m4 configuration macro)
> checking for MCA component plm:tm compile mode... dso
> checking --with-tm value... sanity check ok (/opt/torque)
> checking for pbs-config... /opt/torque/bin/pbs-config
> checking tm.h usability... yes
> checking tm.h presence... yes
> checking for tm.h... yes
> checking for tm_finalize... no
> checking tm.h usability... yes
> checking tm.h presence... yes
> checking for tm.h... yes
> looking for library in lib
> checking for tm_init in -lpbs... no
> looking for library in lib64
> checking for tm_init in -lpbs... no
> looking for library in lib
> checking for tm_init in -ltorque... no
> looking for library in lib64
> checking for tm_init in -ltorque... no
> configure: error: TM support requested but not found.  Aborting
> error: Bad exit status from /var/tmp/rpm-tmp.7564 (%build)
>
>
> However, /opt/torque/ is present.  /opt/torque/bin/pbs-config returns:
> [root@aeolus modulefiles]# /opt/torque/bin/pbs-config --prefix
> /opt/torque
> [root@aeolus modulefiles]# /opt/torque/bin/pbs-config --package
> pbs
> [root@aeolus modulefiles]# /opt/torque/bin/pbs-config --version
> 2.3.0
> [root@aeolus modulefiles]# /opt/torque/bin/pbs-config --libs
> -L/opt/torque/lib64 -ltorque -Wl,--rpath -Wl,/opt/torque/lib64
>
> and /opt/torque/lib64 does have:
> [root@aeolus modulefiles]# ls /opt/torque/lib64
> libtorque.a  libtorque.la  libtorque.so  libtorque.so.2  libtorque.so.2.0.0
>
> so I'm a bit dumbfounded as to why configure doesn't "find" torque
> support...Any suggestions?
>
> --Jim
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Randall Svancara
http://knowyourlinux.com/



Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Yeah, the system admin is me lol.and this is a new system which I
am frantically trying to work out all the bugs.  Torque and MPI are my
last hurdles to overcome.  But I have already been through some faulty
infiniband equipment, bad memory and bad drives.which is to be
expected on a cluster.


I wish there was some kind of TM test tool, that would be really nice
for testing.

I will ping the Torque list again.  Originally they forwarded me to
the openmpi list.

On Mon, Mar 21, 2011 at 12:29 PM, Ralph Castain <r...@open-mpi.org> wrote:
> mpiexec doesn't use pbsdsh (we use a TM API), but the affect is the same. 
> Been so long since I ran on a Torque machine, though, that I honestly don't 
> remember how to set the LD_LIBRARY_PATH on the backend.
>
> Do you have a sys admin there whom you could ask? Or you could ping the 
> Torque list about it - pretty standard issue.
>
>
> On Mar 21, 2011, at 1:19 PM, Randall Svancara wrote:
>
>> Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
>> (qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:
>>
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
>> PATH=/bin:/usr/bin
>> LANG=C
>> PBS_O_HOME=/home/admins/rsvancara
>> PBS_O_LANG=en_US.UTF-8
>> PBS_O_LOGNAME=rsvancara
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
>> PBS_O_QUEUE=batch
>> PBS_O_HOST=login1
>> HOME=/home/admins/rsvancara
>> PBS_JOBNAME=STDIN
>> PBS_JOBID=1672.mgt1.wsuhpc.edu
>> PBS_QUEUE=batch
>> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
>> PBS_NODENUM=0
>> PBS_TASKNUM=146
>> PBS_MOMPORT=15003
>> PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
>> PBS_VERSION=TORQUE-2.4.7
>> PBS_VNODENUM=0
>> PBS_ENVIRONMENT=PBS_BATCH
>> ENVIRONMENT=BATCH
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
>> PATH=/bin:/usr/bin
>> LANG=C
>> PBS_O_HOME=/home/admins/rsvancara
>> PBS_O_LANG=en_US.UTF-8
>> PBS_O_LOGNAME=rsvancara
>> PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
>> PBS_O_MAIL=/var/spool/mail/rsvancara
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=mgt1.wsuhpc.edu
>> PBS_O_WORKDIR=/home/admins/rsvancara/TEST
>> PBS_O_QUEUE=batch
>> PBS_O_HOST=login1
>> HOME=/home/admins/rsvancara
>> PBS_JOBNAME=STDIN
>> PBS_JOBID=1672.mgt1.wsuhpc.edu
>> PBS_QUEUE=batch
>> PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
>> PBS_NODENUM=1
>> PBS_TASKNUM=147
>> PBS_MOMPORT=15003
>> PBS_VERSION=TORQUE-2.4.7
>> PBS_VNODENUM=12
>> PBS_ENVIRONMENT=PBS_BATCH
>> ENVIRONMENT=BATCH
>>
>> So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
>> appear available.  Attempted to run mpiexec like this and it fails.
>>
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
>> loading shared libraries: libimf.so: cannot open shared object file:
>> No such file or directory
>> pbsdsh: task 12 exit status 127
>> [rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
>> /home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
>> loading shared libraries: libimf.so: cannot open shared object file:
>> No such file or directory
>> pbsdsh: task 0 exit status 127
>>
>> If this is how the openmpi processes are being launched, then it is no
>> wonder they are failing and the LD_LIBRARY_PATH error message is
>> indeed somewhat accurate.
>>
>> So the next question is how to I ensure that this information is
>> available to pbsdsh?
>>
>> Thanks,
>>
>> Randall
>>
>>
>> On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara <rsvanc...@gmail.com> 
>> wrote:
>>> Ok, these are good things to check.  I am going to follow through with
>>> this in the next hour after our GPFS upgrade.  Thanks!!!
>>>
>>> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen <bro...@umich.edu> wrote:
>>>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wr

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Hi.  The pbsdsh tool is great.  I ran an interactive qsub session
(qsub -I -lnodes=2:ppn=12) and then rand the pbsdsh tool like this:

[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164 printenv
PATH=/bin:/usr/bin
LANG=C
PBS_O_HOME=/home/admins/rsvancara
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=rsvancara
PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
PBS_O_MAIL=/var/spool/mail/rsvancara
PBS_O_SHELL=/bin/bash
PBS_SERVER=mgt1.wsuhpc.edu
PBS_O_WORKDIR=/home/admins/rsvancara/TEST
PBS_O_QUEUE=batch
PBS_O_HOST=login1
HOME=/home/admins/rsvancara
PBS_JOBNAME=STDIN
PBS_JOBID=1672.mgt1.wsuhpc.edu
PBS_QUEUE=batch
PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
PBS_NODENUM=0
PBS_TASKNUM=146
PBS_MOMPORT=15003
PBS_NODEFILE=/var/spool/torque/aux//1672.mgt1.wsuhpc.edu
PBS_VERSION=TORQUE-2.4.7
PBS_VNODENUM=0
PBS_ENVIRONMENT=PBS_BATCH
ENVIRONMENT=BATCH
[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163 printenv
PATH=/bin:/usr/bin
LANG=C
PBS_O_HOME=/home/admins/rsvancara
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=rsvancara
PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
PBS_O_MAIL=/var/spool/mail/rsvancara
PBS_O_SHELL=/bin/bash
PBS_SERVER=mgt1.wsuhpc.edu
PBS_O_WORKDIR=/home/admins/rsvancara/TEST
PBS_O_QUEUE=batch
PBS_O_HOST=login1
HOME=/home/admins/rsvancara
PBS_JOBNAME=STDIN
PBS_JOBID=1672.mgt1.wsuhpc.edu
PBS_QUEUE=batch
PBS_JOBCOOKIE=50E4985E63684BA781EE9294F21EE25E
PBS_NODENUM=1
PBS_TASKNUM=147
PBS_MOMPORT=15003
PBS_VERSION=TORQUE-2.4.7
PBS_VNODENUM=12
PBS_ENVIRONMENT=PBS_BATCH
ENVIRONMENT=BATCH

So one thing that strikes me as bad is the LD_LIBRARY_PATH does not
appear available.  Attempted to run mpiexec like this and it fails.

[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node163
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
loading shared libraries: libimf.so: cannot open shared object file:
No such file or directory
pbsdsh: task 12 exit status 127
[rsvancara@node164 ~]$ /usr/local/bin/pbsdsh  -h node164
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec hostname
/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec: error while
loading shared libraries: libimf.so: cannot open shared object file:
No such file or directory
pbsdsh: task 0 exit status 127

If this is how the openmpi processes are being launched, then it is no
wonder they are failing and the LD_LIBRARY_PATH error message is
indeed somewhat accurate.

So the next question is how to I ensure that this information is
available to pbsdsh?

Thanks,

Randall


On Mon, Mar 21, 2011 at 11:24 AM, Randall Svancara <rsvanc...@gmail.com> wrote:
> Ok, these are good things to check.  I am going to follow through with
> this in the next hour after our GPFS upgrade.  Thanks!!!
>
> On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen <bro...@umich.edu> wrote:
>> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>>
>>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>>> pbsrsh or something like that...?
>>
>> pbsdsh
>> If TM is working pbsdsh should work fine.
>>
>> Torque+OpenMPI has been working just fine for us.
>> Do you have libtorque on all your compute hosts?  You should see it open on 
>> all hosts if it works.
>>
>>>
>>> Try that and make sure it works.  Open MPI should be using the same API as 
>>> that command under the covers.
>>>
>>> I also have a dim recollection that the TM API support library(ies?) may 
>>> not be installed by default.  You may have to ensure that they're available 
>>> on all nodes...?
>>>
>>>
>>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>>
>>>> I am not sure if there is any extra configuration necessary for torque
>>>> to forward the environment.  I have included the output of printenv
>>>> for an interactive qsub session.  I am really at a loss here because I
>>>> never had this much difficulty making torque run with openmpi.  It has
>>>> been mostly a good experience.
>>>>
>>>> Permissions of /tmp
>>>>
>>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>>
>>>> mpiexec hostname single node:
>>>>
>>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>>> qsub: job 1667.mgt1.wsuhpc.edu re

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Ok, these are good things to check.  I am going to follow through with
this in the next hour after our GPFS upgrade.  Thanks!!!

On Mon, Mar 21, 2011 at 11:14 AM, Brock Palen <bro...@umich.edu> wrote:
> On Mar 21, 2011, at 1:59 PM, Jeff Squyres wrote:
>
>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>> pbsrsh or something like that...?
>
> pbsdsh
> If TM is working pbsdsh should work fine.
>
> Torque+OpenMPI has been working just fine for us.
> Do you have libtorque on all your compute hosts?  You should see it open on 
> all hosts if it works.
>
>>
>> Try that and make sure it works.  Open MPI should be using the same API as 
>> that command under the covers.
>>
>> I also have a dim recollection that the TM API support library(ies?) may not 
>> be installed by default.  You may have to ensure that they're available on 
>> all nodes...?
>>
>>
>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>
>>> I am not sure if there is any extra configuration necessary for torque
>>> to forward the environment.  I have included the output of printenv
>>> for an interactive qsub session.  I am really at a loss here because I
>>> never had this much difficulty making torque run with openmpi.  It has
>>> been mostly a good experience.
>>>
>>> Permissions of /tmp
>>>
>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>
>>> mpiexec hostname single node:
>>>
>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>>
>>> mpiexec hostname two nodes:
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>>> status = 17002
>>> --
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --
>>> --
>>> mpiexec noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --
>>> --
>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --
>>>      node99 - daemon did not report back when launched
>>>
>>>
>>> MPIexec on one node with one cpu:
>>>
>>> [rsvancara@node164 ~]$ mpiexec printenv
>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>>> MODULE_VERSION_STACK=3.2.8
>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>>> HOSTNAME=node164
>>> PBS_VERSION=TORQUE-2.4.7
>>> TERM=xterm
>>> SHELL=/bin/bash
>>> HISTSIZE=1000
>>> PBS_JOBNAME=STDIN
>>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>>> PBS_O_WORKDIR=/home/admins/rsvancara
>>> PBS_TASKNUM=1
>>> USER=rsvancara
>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/sof

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
Ok, Let me give this a try.  Thanks for all your helpful suggestions.

On Mon, Mar 21, 2011 at 11:10 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> On Mar 21, 2011, at 11:59 AM, Jeff Squyres wrote:
>
>> I no longer run Torque on my cluster, so my Torqueology is pretty rusty -- 
>> but I think there's a Torque command to launch on remote nodes.  tmrsh or 
>> pbsrsh or something like that...?
>
> pbsrsh, IIRC
>
> So run pbsrsh  printenv to see the environment on a remote node. 
> Etc.
>
>>
>> Try that and make sure it works.  Open MPI should be using the same API as 
>> that command under the covers.
>>
>> I also have a dim recollection that the TM API support library(ies?) may not 
>> be installed by default.  You may have to ensure that they're available on 
>> all nodes...?
>
> This is true - usually not installed by default, and need to be available on 
> all nodes since Torque starts mpiexec on a backend node.
>
>>
>>
>> On Mar 21, 2011, at 1:53 PM, Randall Svancara wrote:
>>
>>> I am not sure if there is any extra configuration necessary for torque
>>> to forward the environment.  I have included the output of printenv
>>> for an interactive qsub session.  I am really at a loss here because I
>>> never had this much difficulty making torque run with openmpi.  It has
>>> been mostly a good experience.
>>>
>>> Permissions of /tmp
>>>
>>> drwxrwxrwt   4 root root   140 Mar 20 08:57 tmp
>>>
>>> mpiexec hostname single node:
>>>
>>> [rsvancara@login1 ~]$ qsub -I -lnodes=1:ppn=12
>>> qsub: waiting for job 1667.mgt1.wsuhpc.edu to start
>>> qsub: job 1667.mgt1.wsuhpc.edu ready
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>> node100
>>>
>>> mpiexec hostname two nodes:
>>>
>>> [rsvancara@node100 ~]$ mpiexec hostname
>>> [node100:09342] plm:tm: failed to poll for a spawned daemon, return
>>> status = 17002
>>> --
>>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>>> launch so we are aborting.
>>>
>>> There may be more information reported by the environment (see above).
>>>
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --
>>> --
>>> mpiexec noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --
>>> --
>>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>>> below. Additional manual cleanup may be required - please refer to
>>> the "orte-clean" tool for assistance.
>>> --
>>>      node99 - daemon did not report back when launched
>>>
>>>
>>> MPIexec on one node with one cpu:
>>>
>>> [rsvancara@node164 ~]$ mpiexec printenv
>>> OMPI_MCA_orte_precondition_transports=5fbd0d3c8e4195f1-80f964226d1575ea
>>> MODULE_VERSION_STACK=3.2.8
>>> MANPATH=/home/software/mpi/intel/openmpi-1.4.3/share/man:/home/software/intel/Compiler/11.1/075/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/man/en_US:/home/software/intel/Compiler/11.1/075/mkl/../man/en_US:/home/software/Modules/3.2.8/share/man:/usr/share/man
>>> HOSTNAME=node164
>>> PBS_VERSION=TORQUE-2.4.7
>>> TERM=xterm
>>> SHELL=/bin/bash
>>> HISTSIZE=1000
>>> PBS_JOBNAME=STDIN
>>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>>> PBS_O_WORKDIR=/home/admins/rsvancara
>>> PBS_TASKNUM=1
>>> USER=rsvancara
>>> LD_LIBRARY_PATH=/home/software/mpi/intel/openmpi-1.4.3/lib:/home/software/intel/Compiler/11.1/075/lib/intel64:/home/software/intel/Compiler/11.1/075/ipp/em64t/sharedlib:/home/software/intel/Compiler/

Re: [OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
OST=login1
DYLD_LIBRARY_PATH=/home/software/intel/Compiler/11.1/075/tbb/intel64/cc4.1.0_libc2.4_kernel2.6.16.21/lib
PBS_VNODENUM=0
LOGNAME=rsvancara
PBS_QUEUE=batch
MODULESHOME=/home/software/mpi/intel/openmpi-1.4.3
LESSOPEN=|/usr/bin/lesspipe.sh %s
PBS_O_MAIL=/var/spool/mail/rsvancara
G_BROKEN_FILENAMES=1
PBS_NODEFILE=/var/spool/torque/aux//1670.mgt1.wsuhpc.edu
PBS_O_PATH=/home/software/mpi/intel/openmpi-1.4.3/bin:/home/software/intel/Compiler/11.1/075/bin/intel64:/home/software/Modules/3.2.8/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin
module=() {  eval `/home/software/Modules/$MODULE_VERSION/bin/modulecmd bash $*`
}
_=/home/software/mpi/intel/openmpi-1.4.3/bin/mpiexec
OMPI_MCA_orte_local_daemon_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
OMPI_MCA_orte_hnp_uri=3236233216.0;tcp://172.20.102.82:33559;tcp://172.40.102.82:33559
OMPI_MCA_mpi_yield_when_idle=0
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=1
OMPI_MCA_ess=env
OMPI_MCA_orte_ess_num_procs=1
OMPI_COMM_WORLD_SIZE=1
OMPI_COMM_WORLD_LOCAL_SIZE=1
OMPI_MCA_orte_ess_jobid=3236233217
OMPI_MCA_orte_ess_vpid=0
OMPI_COMM_WORLD_RANK=0
OMPI_COMM_WORLD_LOCAL_RANK=0
OPAL_OUTPUT_STDERR_FD=19

MPIExec with -mca plm rsh:

[rsvancara@node164 ~]$ mpiexec -mca plm rsh -mca orte_tmpdir_base
/fastscratch/admins/tmp hostname
node164
node164
node164
node164
node164
node164
node164
node164
node164
node164
node164
node164
node163
node163
node163
node163
node163
node163
node163
node163
node163
node163
node163
node163


On Mon, Mar 21, 2011 at 9:22 AM, Ralph Castain <r...@open-mpi.org> wrote:
> Can you run anything under TM? Try running "hostname" directly from Torque to 
> see if anything works at all.
>
> The error message is telling you that the Torque daemon on the remote node 
> reported a failure when trying to launch the OMPI daemon. Could be that 
> Torque isn't setup to forward environments so the OMPI daemon isn't finding 
> required libs. You could directly run "printenv" to see how your remote 
> environ is being setup.
>
> Could be that the tmp dir lacks correct permissions for a user to create the 
> required directories. The OMPI daemon tries to create a session directory in 
> the tmp dir, so failure to do so would indeed cause the launch to fail. You 
> can specify the tmp dir with a cmd line option to mpirun. See "mpirun -h" for 
> info.
>
>
> On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:
>
>> I have a question about using OpenMPI and Torque on stateless nodes.
>> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
>> --without-slurm using intel compiler version 11.1.075.
>>
>> When I run a simple "hello world" mpi program, I am receiving the
>> following error.
>>
>> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
>> status = 17002
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpiexec noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> --
>> mpiexec was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> --
>>         node163 - daemon did not report back when launched
>>         node159 - daemon did not report back when launched
>>         node158 - daemon did not report back when launched
>>         node157 - daemon did not report back when launched
>>         node156 - daemon did not report back when launched
>>         node155 - daemon did not report back when launched
>>         node154 - daemon did not report back when launched
>>         node152 - daemon did not report back when launched
>>         node151 - daemon did not report back when launched
>>         node150 - daemon did not report back when launched
>>  

[OMPI users] OpenMPI and Torque

2011-03-21 Thread Randall Svancara
I have a question about using OpenMPI and Torque on stateless nodes.
I have compiled openmpi 1.4.3 with --with-tm=/usr/local
--without-slurm using intel compiler version 11.1.075.

When I run a simple "hello world" mpi program, I am receiving the
following error.

[node164:11193] plm:tm: failed to poll for a spawned daemon, return
status = 17002
 --
 A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
 launch so we are aborting.

 There may be more information reported by the environment (see above).

 This may be because the daemon was unable to find all the needed shared
 libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
 location of the shared libraries on the remote nodes and this will
 automatically be forwarded to the remote nodes.
 --
 --
 mpiexec noticed that the job aborted, but has no info as to the process
 that caused that situation.
 --
 --
 mpiexec was unable to cleanly terminate the daemons on the nodes shown
 below. Additional manual cleanup may be required - please refer to
 the "orte-clean" tool for assistance.
 --
 node163 - daemon did not report back when launched
 node159 - daemon did not report back when launched
 node158 - daemon did not report back when launched
 node157 - daemon did not report back when launched
 node156 - daemon did not report back when launched
 node155 - daemon did not report back when launched
 node154 - daemon did not report back when launched
 node152 - daemon did not report back when launched
 node151 - daemon did not report back when launched
 node150 - daemon did not report back when launched
 node149 - daemon did not report back when launched


But if I include:

-mca plm rsh

The job runs just fine.

I am not sure what the problem is with torque or openmpi that prevents
the process from launching on remote nodes.  I have posted to the
torque list and someone suggested that it may be temporary directory
space that can be causing issues.  I have 100MB allocated to /tmp

Any ideas as to why I am having this problem would be appreciated.


-- 
Randall Svancara
http://knowyourlinux.com/


[OMPI users] Error with an application, miscalculate message sizes

2011-03-09 Thread Randall Svancara
I am experiencing an error with an application called MPIBLAST.  I am
trying to understand more about what this error represents in terms of
this application:

mpiblast_writer.cppStreamliner::CalculateMessageSizes - miscalculate
message sizes

Is this a problem with openmpi, infiniband or something I have done
with compiling the applicaiton.

Version:  openmpi 1.4.3
Compiler 11.0.1.75 Intel compiler
Infiniband: Melanox QDR using melanox OFED 1.5.1

Thanks,