[OMPI users] Automatic checkpoint/restart in OpenMPI

2009-04-20 Thread ESTEBAN MENESES ROJAS
   Hello.
   Is there any way to automatically checkpoint/restart an application in 
OpenMPI? This is, checkpointing the application without using the command 
ompi-checkpoint, perhaps via a function call in the application's code itself. 
The same with the restart after a failure.
   On a related note, what is the default behavior of an OpenMPI application 
after one process fails? Does the runtime shut down the whole application?
   Thanks.



Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Gus Correa

Hi Amjad, list

HPL has some quirks to install, as I just found out.
It can be done, though.
I had used a precompiled version of HPL on my Rocks cluster before,
but that version is no longer being distributed, unfortunately.

Go to the HPL "setup" directory,
and run the script "make_generic".
This will give you a Make. template file named Make.UNKNOWN.
You can rename this file "Make.whatever_arch_you_want",
copy it to the HPL top directory,
and edit it,
adjusting the important variable definitions to your system.

For instance, where it says:
CC   = mpicc
replace by:
CC   = /full/path/to/OpenMPI/bin/mpicc
and so on for ARCH, TOPdir, etc.
Some 4-6 variables only need to be changed.

These threads show two examples:

http://marc.info/?l=npaci-rocks-discussion&m=123264688212088&w=2
http://marc.info/?l=npaci-rocks-discussion&m=123163114922058&w=2

You will need also a BLAS (basic linear algebra subprograms) library.
You may have one already on your computer.
Do "locate libblas" and "locate libgoto" to search for it.

If you don't have BLAS, you can download the Goto BLAS library
and install it, which is what I did:

http://www.tacc.utexas.edu/resources/software/

The Goto BLAS is probably the fastest version of BLAS.
However, you can try also the more traditional BLAS from Netlib:

http://www.netlib.org/blas/

I found it easier to work with gcc and gfortran (i.e. both BLAS
and OpenMPI compiled with gcc and gfortran), than to use PGI or Intel
compilers.  However, I didn't try hard with PGI and Intel.

Read the HPL TUNNING file to learn how to change/adjust
the HPL.dat parameters.
The PxQ value gives you the number of processes for mpiexec.

***

The goal of benchmarking is to measure performance under heavy use
(on a parallel computer using MPI, in the HPL case).
However, other than performance measurements,
benchmark programs in general don't produce additional results.
For instance, HPL does LU factorization of matrices and solves
linear systems with an efficient parallel algorithm.
This by itself is great, and is one reason why it is the
Top500 benchmark:
http://en.wikipedia.org/wiki/TOP500 and 
http://www.top500.org/project/linpack .


However, within HPL the LU decomposition and the
linear system solution are not applied to any particular
concrete problem.
Only the time it takes to run each part of HPL really matters.
The matrices are made up of random numbers, if I remember right,
are totally synthetic, and don't mean anything physical.
Of course LU factorization has tons of applications, but the goal
of HPL is not to explore applications, it is just to measure performance
during the number crunching linear algebra operations using MPI.

HPL will make the case that your cluster is working,
and you can tell your professors that it works with
a performance that you can measure, some X Gflops (see the xhpl output).

However, if you want also to show to your professors
that your cluster can be used for applications,
you may want to run a real world MPI program, say,
in a research area of your college, be it computational chemistry,
weather forecast, electrical engineering, structural engineering,
fluid mechanics, genome research, seismology, etc.
Depending on which area it is,
you may find free MPI programs on the Internet.

My two cents,
Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Ankush Kaul wrote:

let me describe what i want to do.

i had taken linux clustering as my final year engineering project as i m 
really iintrested in 0networking.


to tell de truth our college does not have any professor with knowledge 
of clustering.


the aim of our project was just to make a cluster, which we did. not we 
have to show and explain our project to the professors. so i want 
somethin to show them how de cluster works... some program or 
benchmarking s/w.


hope you got the problem.
and thanks again, we really appretiate you patience.




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI 1.2 rank?

2009-04-20 Thread Ralph Castain

Yes, it should - at least, for the more common environments (e.g., ssh).

On Apr 20, 2009, at 4:25 PM, Ross Boylan wrote:


On Mon, 2009-04-20 at 16:22 -0600, Ralph Castain wrote:

Afraid there really isn't anything in the 1.2.x series - we set
several MPI-specific envars beginning with 1.3.0, but not in the  
older

releases.

The problem with using something like OMPI_MCA_ns_nds_vpid is that we
are free to change/eliminate it at any time - in fact, you won't find
that envar in the 1.3.x series at all.

Will it work in the 1.2 series?
Ross



On Apr 20, 2009, at 3:53 PM, Ross Boylan wrote:


How do I determine my rank in a shell script under OpenMPI 1.2?
The only thing I've found that looks promising is the environment
variable OMPI_MCA_ns_nds_vpid, and earlier discussion on this list
said
that was for "internal use only".

I'm on Debian Lenny, which just relased with openmpi 1.2.7~rc2-2.

Thanks.
Ross

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI 1.2 rank?

2009-04-20 Thread Ross Boylan
On Mon, 2009-04-20 at 16:22 -0600, Ralph Castain wrote:
> Afraid there really isn't anything in the 1.2.x series - we set  
> several MPI-specific envars beginning with 1.3.0, but not in the older  
> releases.
> 
> The problem with using something like OMPI_MCA_ns_nds_vpid is that we  
> are free to change/eliminate it at any time - in fact, you won't find  
> that envar in the 1.3.x series at all.
Will it work in the 1.2 series?
Ross
> 
> 
> On Apr 20, 2009, at 3:53 PM, Ross Boylan wrote:
> 
> > How do I determine my rank in a shell script under OpenMPI 1.2?
> > The only thing I've found that looks promising is the environment
> > variable OMPI_MCA_ns_nds_vpid, and earlier discussion on this list  
> > said
> > that was for "internal use only".
> >
> > I'm on Debian Lenny, which just relased with openmpi 1.2.7~rc2-2.
> >
> > Thanks.
> > Ross
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] OpenMPI 1.2 rank?

2009-04-20 Thread Ralph Castain
Afraid there really isn't anything in the 1.2.x series - we set  
several MPI-specific envars beginning with 1.3.0, but not in the older  
releases.


The problem with using something like OMPI_MCA_ns_nds_vpid is that we  
are free to change/eliminate it at any time - in fact, you won't find  
that envar in the 1.3.x series at all.



On Apr 20, 2009, at 3:53 PM, Ross Boylan wrote:


How do I determine my rank in a shell script under OpenMPI 1.2?
The only thing I've found that looks promising is the environment
variable OMPI_MCA_ns_nds_vpid, and earlier discussion on this list  
said

that was for "internal use only".

I'm on Debian Lenny, which just relased with openmpi 1.2.7~rc2-2.

Thanks.
Ross

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] OpenMPI 1.2 rank?

2009-04-20 Thread Ross Boylan
How do I determine my rank in a shell script under OpenMPI 1.2?
The only thing I've found that looks promising is the environment
variable OMPI_MCA_ns_nds_vpid, and earlier discussion on this list said
that was for "internal use only".

I'm on Debian Lenny, which just relased with openmpi 1.2.7~rc2-2.

Thanks.
Ross



Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Ralph Castain
I'm afraid this is a more extensive rewrite than I had hoped - the revisions
are most unlikely to make it for 1.3.2. Looks like it will be 1.3.3 at the
earliest.

Ralph

On Mon, Apr 20, 2009 at 7:50 AM, Lenny Verkhovsky <
lenny.verkhov...@gmail.com> wrote:

> Me too, sorry, it definately seems like a bug. Somewere in the code
> probably undefined variable.
> I just never tested this code with such "bizzare" command line :)
>
> Lenny.
>
> On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot wrote:
>
>> Thanks,
>>
>> I am not in a hurry but it would be nice if I could benefit from this
>> feature in the next release.
>> Regards
>>
>> Geoffroy
>>
>>
>>
>> 2009/4/20 
>>
>>> Send users mailing list submissions to
>>>us...@open-mpi.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> or, via email, send a message with subject or body 'help' to
>>>users-requ...@open-mpi.org
>>>
>>> You can reach the person managing the list at
>>>users-ow...@open-mpi.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of users digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>>   1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>>>
>>>
>>> --
>>>
>>> Message: 1
>>> Date: Mon, 20 Apr 2009 05:59:52 -0600
>>> From: Ralph Castain 
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users 
>>> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org>
>>>
>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>DelSp="yes"
>>>
>>> Honestly haven't had time to look at it yet - hopefully in the next
>>> couple of days...
>>>
>>> Sorry for delay
>>>
>>>
>>> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>>>
>>> > Do you have any news about this bug.
>>> > Thanks
>>> >
>>> > Geoffroy
>>> >
>>> >
>>> > Message: 1
>>> > Date: Tue, 14 Apr 2009 07:57:44 -0600
>>> > From: Ralph Castain 
>>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> > To: Open MPI Users 
>>> > Message-ID: 
>>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>> >DelSp="yes"
>>> >
>>> > Ah now, I didn't say it -worked-, did I? :-)
>>> >
>>> > Clearly a bug exists in the program. I'll try to take a look at it (if
>>> > Lenny doesn't get to it first), but it won't be until later in the
>>> > week.
>>> >
>>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>> >
>>> > > I agree with you Ralph , and that 's what I expect from openmpi but
>>> > > my second example shows that it's not working
>>> > >
>>> > > cat hostfile.0
>>> > >r011n002 slots=4
>>> > >r011n003 slots=4
>>> > >
>>> > >  cat rankfile.0
>>> > > rank 0=r011n002 slot=0
>>> > > rank 1=r011n003 slot=1
>>> > >
>>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>>> > > hostname
>>> > > ### CRASHED
>>> > >
>>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > rmaps_rank_file.c at line 404
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > base/rmaps_base_map_job.c at line 87
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > base/plm_base_launch_support.c at line 77
>>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>>> > > file
>>> > > > > plm_rsh_module.c at line 985
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --
>>> > > > > A daemon (pid unknown) died unexpectedly on signal 1  while
>>> > > > attempting to
>>> > > > > launch so we are aborting.
>>> > > > >
>>> > > > > There may be more information reported by the environment (see
>>> > > > above).
>>> > > > >
>>> > > > > This may be because the daemon was unable to find all the needed
>>> > > > shared
>>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>>> > to
>>> > > > have the
>>> > > > > location of the shared libraries on the remote nodes and this
>>> > will
>>> > > > > automatically be forwarded to the remote nodes.
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --
>>> > > > > orterun noticed that the job aborted, but has no info as to the
>>> > > > process
>>> > > > > that caused that situation.
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> --
>>> > > > > orterun: clean termination accomplished
>>> > >
>>> > >
>>> > >
>>> > > M

Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Prakash Velayutham

Hi Ankush,

You can get some example MPI programs from http://www.pdc.kth.se/training/Tutor/MPI/Templates/index-frame.html 
. 

You can compare the performance of these in a MPI (single processor,  
multiple processors) setting and non-MPI (serial) setting to show how  
it can help their research.


Hope that helps,
Prakash

On Apr 20, 2009, at 12:34 PM, Ankush Kaul wrote:


let me describe what i want to do.

i had taken linux clustering as my final year engineering project as  
i m really iintrested in 0networking.


to tell de truth our college does not have any professor with  
knowledge of clustering.


the aim of our project was just to make a cluster, which we did. not  
we have to show and explain our project to the professors. so i want  
somethin to show them how de cluster works... some program or  
benchmarking s/w.


hope you got the problem.
and thanks again, we really appretiate you patience.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Eugene Loh

Ankush Kaul wrote:


let me describe what i want to do.

i had taken linux clustering as my final year engineering project as i 
m really iintrested in 0networking.


to tell de truth our college does not have any professor with 
knowledge of clustering.


the aim of our project was just to make a cluster, which we did. not 
we have to show and explain our project to the professors. so i want 
somethin to show them how de cluster works... some program or 
benchmarking s/w.


When you download Open MPI software, I guess there is an examples 
subdirectory.  Maybe the example codes there would suffice to illustrate 
message passing to someone who is not familiar with it.


HPL (and the HPC Challenge test suite, which includes HPL) may involve a 
little bit of wrestling, but you're probably best off there wading 
through their README files, including following their advice about what 
to do if you encounter problems.


Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Ankush Kaul
let me describe what i want to do.

i had taken linux clustering as my final year engineering project as i m
really iintrested in 0networking.

to tell de truth our college does not have any professor with knowledge of
clustering.

the aim of our project was just to make a cluster, which we did. not we have
to show and explain our project to the professors. so i want somethin to
show them how de cluster works... some program or benchmarking s/w.

hope you got the problem.
and thanks again, we really appretiate you patience.


Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Gus Correa

Hi Ankush

Ankush Kaul wrote:

Thanks a lot, I m implementing the passwordless cluster

I m also tryin different benchmarking software n got fed up of all the 
probs in all de sofwares i try. will list few:


*1) VampirTrace*

 I extracted de tar in /vt then followed following steps



I never used it.


*$ ./configure --prefix=/vti*
 [...lots of output...]
*$ make all install*

after this the FAQ on open-mpi.org  asks to 
'/Simply replace the compiler wrappers to activate vampir trace/' but 
does not tell how do i replace the compiler wrappers.


i try to run *mpicc-vt -c hello.c -o hello
*
but it gives a error
/bash: mpicc-vt: command not found




It is not on your path.
Use the full path name, which should be /vti/bin/...  or similar
if you did install it in /vti.

Remember, "locate" is your friend!


/*2) HPL
*
for this i didnt undersatnd the installation steps.




I extracted the tar in /hpl

Then is asks to '/create a file Make. in the  top-level 
directory/' i created a file Make.i386.


Are you talking about the Netlib HPL-2.0?
http://netlib.org/benchmark/hpl/

Are your computers i386 (32-bit) or x86_64/em64t (64-bit)?
"uname -a" will tell.

Anyway, read their INSTALL file for where to find
template Make.arch files!


then it says '/This file essentially contains the compilers
 and librairies with their paths to be used/' how do i put that?

after that it asks to run command *make arch=i386
*but it gives error*
*/make[3]: Entering directory `/hpl'
make -f Make.top startup_dir arch=i386
make[4]: Entering directory `/hpl'
Make.top:161: warning: overriding commands for target `clean_arch_all'
Make.i386:84: warning: ignoring old commands for target `clean_arch_all'
include/i386
make[4]: include/i386: Command not found
make[4]: [startup_dir] Error 127 (ignored)
lib
make[4]: lib: Command not found
make[4]: [startup_dir] Error 127 (ignored)
lib/i386
make[4]: lib/i386: Command not found
make[4]: [startup_dir] Error 127 (ignored)
bin
make[4]: bin: Command not found
make[4]: [startup_dir] Error 127 (ignored)
bin/i386
make[4]: bin/i386: Command not found
make[4]: [startup_dir] Error 127 (ignored)
make[4]: Leaving directory `/hpl'
make -f Make.top startup_src arch=i386
make[4]: Entering directory `/hpl'
Make.top:161: warning: overriding commands for target `clean_arch_all'
Make.i386:84: warning: ignoring old commands for target `clean_arch_all'
make -f Make.top leaf le=src/auxil   arch=i386
make[5]: Entering directory `/hpl'
Make.top:161: warning: overriding commands for target `clean_arch_all'
Make.i386:84: warning: ignoring old commands for target `clean_arch_all'
(  src/auxil ;  i386 )
/bin/sh: src/auxil: is a directory

/then it enters shell prompt.

Please help, is there a simpler Benchmarking software?
i dont wanna give at this point :(
*


I may have sent you the Intel MPI Benchmark link already.
Google will find it for you.

I wouldn't spend too much time benchmarking
on standard Ethernet TCP/IP.
Did you try your own programs?

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


*




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Ankush Kaul
Thanks a lot, I m implementing the passwordless cluster

I m also tryin different benchmarking software n got fed up of all the probs
in all de sofwares i try. will list few:

*1) VampirTrace*

 I extracted de tar in /vt then followed following steps

*$ ./configure --prefix=/vti*
 [...lots of output...]
*$ make all install*

after this the FAQ on open-mpi.org asks to '*Simply replace the compiler
wrappers to activate vampir trace*' but does not tell how do i replace
the compiler
wrappers.

i try to run *mpicc-vt -c hello.c -o hello
*
but it gives a error
*bash: mpicc-vt: command not found


**2) HPL
*
for this i didnt undersatnd the installation steps.

I extracted the tar in /hpl

Then is asks to '*create a file Make. in the  top-level directory*' i
created a file Make.i386.
then it says '*This file essentially contains the compilers
 and librairies with their paths to be used*' how do i put that?

after that it asks to run command *make arch=i386
*but it gives error*
**make[3]: Entering directory `/hpl'
make -f Make.top startup_dir arch=i386
make[4]: Entering directory `/hpl'
Make.top:161: warning: overriding commands for target `clean_arch_all'
Make.i386:84: warning: ignoring old commands for target `clean_arch_all'
include/i386
make[4]: include/i386: Command not found
make[4]: [startup_dir] Error 127 (ignored)
lib
make[4]: lib: Command not found
make[4]: [startup_dir] Error 127 (ignored)
lib/i386
make[4]: lib/i386: Command not found
make[4]: [startup_dir] Error 127 (ignored)
bin
make[4]: bin: Command not found
make[4]: [startup_dir] Error 127 (ignored)
bin/i386
make[4]: bin/i386: Command not found
make[4]: [startup_dir] Error 127 (ignored)
make[4]: Leaving directory `/hpl'
make -f Make.top startup_src arch=i386
make[4]: Entering directory `/hpl'
Make.top:161: warning: overriding commands for target `clean_arch_all'
Make.i386:84: warning: ignoring old commands for target `clean_arch_all'
make -f Make.top leaf le=src/auxil   arch=i386
make[5]: Entering directory `/hpl'
Make.top:161: warning: overriding commands for target `clean_arch_all'
Make.i386:84: warning: ignoring old commands for target `clean_arch_all'
(  src/auxil ;  i386 )
/bin/sh: src/auxil: is a directory

*then it enters shell prompt.

Please help, is there a simpler Benchmarking software?
i dont wanna give at this point :(
*
*


Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Gus Correa

Hi Ankush

Please read the FAQ I sent you in the previous message.
That is the answer to your repeated question.
OpenMPI (and all MPIs that I know of) requires passwordless connections.
Your program fails because you didn't setup that.

If it worked with a single compute node,
that was most likely fortuitous,
not by design.
What you see on the screen are the ssh password messages
from your two compute nodes,
but OpenMPI (or any MPI)
won't wait for your typing passwords.
Imagine if you were running your program on 1000 nodes ...,
and,say, running the program 1000 times ...
would you really like to type all those one million passwords?
The design must be scalable.

Here is one recipe for passwordless ssh on clusters:

http://agenda.clustermonkey.net/index.php/Passwordless_SSH_Logins
http://agenda.clustermonkey.net/index.php/Passwordless_SSH_(and_RSH)_Logins

Read it carefully,
the comments about MPI(ch) 1.2 and PVM are somewhat out of date,
however, the ssh recipe is fine, detailed, and clear.
Note also the nuanced difference for NFS mounted home directories
versus separate home directories on each node.

Pay a visit to OpenSSH site also, for more information:
http://www.openssh.com/
http://en.wikipedia.org/wiki/OpenSSH

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Anush Kaul wrote:

Let me explain in detail,

when we had only 2 nodes, 1 master (192.168.67.18) + 1 compute node 
(192.168.45.65)

my openmpi-default-hostfile looked like/
192.168.67.18 slots=2
192.168.45.65 slots=2/

after this on running the command *miprun /work/Pi* on master node we got
/
# root@192.168.45.65  password :/

after entering the password the program ran on both de nodes.

Now after connecting a second compute node, and editing the hostfile:

/192.168.67.18 slots=2
192.168.45.65 slots=2/
/192.168.67.241 slots=2

/and then running the command *miprun /work/Pi* on master node we got

# root@192.168.45.65 's password: 
root@192.168.67.241 's password:


which does not accept the password.

Although we are trying to implement the passwordless cluster. i wud like 
to know what this problem is occuring?



On Sat, Apr 18, 2009 at 3:40 AM, Gus Correa > wrote:


Ankush

You need to setup passwordless connections with ssh to the node you just
added.  You (or somebody else) probably did this already on the
first compute node, otherwise the MPI programs wouldn't run
across the network.

See the very last sentence on this FAQ:

http://www.open-mpi.org/faq/?category=running#run-prereqs

And try this recipe (if you use RSA keys instead of DSA, replace all
"dsa" by "rsa"):


http://www.sshkeychain.org/mirrors/SSH-with-Keys-HOWTO/SSH-with-Keys-HOWTO-4.html#ss4.3


I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Ankush Kaul wrote:

Thank you, i m reading up on de tools u suggested.

I am facing another problem, my cluster is working fine with 2
hosts (1 master + 1 compute node) but when i tried 2 add another
node (1 master + 2 compute node) its not working. it works fine
when i give de command mpirun -host  /work/Pi

but when i try to run
mpirun  /work/Pi it gives following error:

root@192.168.45.65 
>'s
password: root@192.168.67.241 
>'s
password:


Permission denied, please try again. 

root@192.168.45.65 
>'s password:


Permission denied, please try again.

root@192.168.45.65 
>'s password:


Permission denied (publickey,gssapi-with-mic,password).

 
Permission denied, please try again.


root@192.168.67.241 
>'s
password: [ccomp1.cluster:03503] [0,0,0] ORTE_ERROR_LOG: Timeout
in file base/pls_base_orted_cmds.c at line 275


[ccomp1.cluster:03503] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166

[ccomp1.cluster:03503] [0,0,0] ORTE_ERROR_LOG: Timeout in file
errmgr_hnp.c at line 90

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Lenny Verkhovsky
Me too, sorry, it definately seems like a bug. Somewere in the code probably
undefined variable.
I just never tested this code with such "bizzare" command line :)

Lenny.

On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot wrote:

> Thanks,
>
> I am not in a hurry but it would be nice if I could benefit from this
> feature in the next release.
> Regards
>
> Geoffroy
>
>
>
> 2009/4/20 
>
>> Send users mailing list submissions to
>>us...@open-mpi.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>>users-requ...@open-mpi.org
>>
>> You can reach the person managing the list at
>>users-ow...@open-mpi.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>>   1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>>
>>
>> --
>>
>> Message: 1
>> Date: Mon, 20 Apr 2009 05:59:52 -0600
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users 
>> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org>
>>
>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>DelSp="yes"
>>
>> Honestly haven't had time to look at it yet - hopefully in the next
>> couple of days...
>>
>> Sorry for delay
>>
>>
>> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>>
>> > Do you have any news about this bug.
>> > Thanks
>> >
>> > Geoffroy
>> >
>> >
>> > Message: 1
>> > Date: Tue, 14 Apr 2009 07:57:44 -0600
>> > From: Ralph Castain 
>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > To: Open MPI Users 
>> > Message-ID: 
>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> >DelSp="yes"
>> >
>> > Ah now, I didn't say it -worked-, did I? :-)
>> >
>> > Clearly a bug exists in the program. I'll try to take a look at it (if
>> > Lenny doesn't get to it first), but it won't be until later in the
>> > week.
>> >
>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>> >
>> > > I agree with you Ralph , and that 's what I expect from openmpi but
>> > > my second example shows that it's not working
>> > >
>> > > cat hostfile.0
>> > >r011n002 slots=4
>> > >r011n003 slots=4
>> > >
>> > >  cat rankfile.0
>> > > rank 0=r011n002 slot=0
>> > > rank 1=r011n003 slot=1
>> > >
>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> > > hostname
>> > > ### CRASHED
>> > >
>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > rmaps_rank_file.c at line 404
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > base/rmaps_base_map_job.c at line 87
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > base/plm_base_launch_support.c at line 77
>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> > > file
>> > > > > plm_rsh_module.c at line 985
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > > A daemon (pid unknown) died unexpectedly on signal 1  while
>> > > > attempting to
>> > > > > launch so we are aborting.
>> > > > >
>> > > > > There may be more information reported by the environment (see
>> > > > above).
>> > > > >
>> > > > > This may be because the daemon was unable to find all the needed
>> > > > shared
>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>> > to
>> > > > have the
>> > > > > location of the shared libraries on the remote nodes and this
>> > will
>> > > > > automatically be forwarded to the remote nodes.
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > > orterun noticed that the job aborted, but has no info as to the
>> > > > process
>> > > > > that caused that situation.
>> > > > >
>> > > >
>> > >
>> >
>> --
>> > > > > orterun: clean termination accomplished
>> > >
>> > >
>> > >
>> > > Message: 4
>> > > Date: Tue, 14 Apr 2009 06:55:58 -0600
>> > > From: Ralph Castain 
>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> > > To: Open MPI Users 
>> > > Message-ID: 
>> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> > >DelSp="yes"
>> > >
>> > > The rankfile cuts across the entire job - it isn't applied on an
>> > > app_context basis. So the ranks in your rankf

Re: [OMPI users] Problem with running openMPI program

2009-04-20 Thread Gus Correa

Hi Ankush

Ankush Kaul wrote:

Also how can i find out where are my mpi libraries and include directories?


If you configured OpenMPI with --prefix=/some/dir they are
in /some/dir/lib and /some/dir/include,
whereas the executables (mpicc, mpiexec, etc) are in /some/dir/bin.
Otherwise OpenMPI defaults to /usr/local.

However, the preferred way to compile OpenMPI programs is to use the
OpenMPI wrappers (e.g. mpicc), and in this case you don't need to
specify the lib and include directories at all.

If you have many MPI flavors in your computers, use full path names
to avoid confusion (or carefully set the OpenMPI bin path ahead of any 
other).


The Linux command "locate" helps find things (e.g. "locate mpi.h").
You may need to update the location database
before using it with "updatedb".

I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-




On Sat, Apr 18, 2009 at 2:29 PM, Ankush Kaul > wrote:


Let me explain in detail,

when we had only 2 nodes, 1 master (192.168.67.18) + 1 compute node
(192.168.45.65)
my openmpi-default-hostfile looked like/
192.168.67.18 slots=2
192.168.45.65 slots=2/

after this on running the command *miprun /work/Pi* on master node
we got
/
# root@192.168.45.65  password :/

after entering the password the program ran on both de nodes.

Now after connecting a second compute node, and editing the hostfile:

/192.168.67.18 slots=2
192.168.45.65 slots=2/
/192.168.67.241 slots=2

/and then running the command *miprun /work/Pi* on master node we got

# root@192.168.45.65 's password:
root@192.168.67.241 's password:

which does not accept the password.

Although we are trying to implement the passwordless cluster. i wud
like to know what this problem is occuring?


On Sat, Apr 18, 2009 at 3:40 AM, Gus Correa mailto:g...@ldeo.columbia.edu>> wrote:

Ankush

You need to setup passwordless connections with ssh to the node
you just
added.  You (or somebody else) probably did this already on the
first compute node, otherwise the MPI programs wouldn't run
across the network.

See the very last sentence on this FAQ:

http://www.open-mpi.org/faq/?category=running#run-prereqs

And try this recipe (if you use RSA keys instead of DSA, replace
all "dsa" by "rsa"):


http://www.sshkeychain.org/mirrors/SSH-with-Keys-HOWTO/SSH-with-Keys-HOWTO-4.html#ss4.3


I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Ankush Kaul wrote:

Thank you, i m reading up on de tools u suggested.

I am facing another problem, my cluster is working fine with
2 hosts (1 master + 1 compute node) but when i tried 2 add
another node (1 master + 2 compute node) its not working. it
works fine when i give de command mpirun -host 
/work/Pi

but when i try to run
mpirun  /work/Pi it gives following error:

root@192.168.45.65 
>'s
password: root@192.168.67.241 
>'s
password:


Permission denied, please try again. 

root@192.168.45.65 
>'s
password:


Permission denied, please try again.

root@192.168.45.65 
>'s
password:


Permission denied (publickey,gssapi-with-mic,password).

 
Permission denied, please try again.


root@192.168.67.241 
>'s
password: [ccomp1.cluster:03503] [0,0,0] ORTE_ERROR_LOG:
Timeout in file base/pls_base_orted_cmds.c at line 275


[ccomp1.cluster:03503] [0,0,0] ORTE_ERROR_LOG: Timeout in
file pls_rsh_module.c at line 1166

[ccomp1.cluster:03503] [0,0,0] ORTE_ERROR_LOG: Timeout in
file errmgr_hnp.c at line 90

[ccomp1.cluster:03503]

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Geoffroy Pignot
Thanks,

I am not in a hurry but it would be nice if I could benefit from this
feature in the next release.
Regards

Geoffroy



2009/4/20 

> Send users mailing list submissions to
>us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>   1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>
>
> --
>
> Message: 1
> Date: Mon, 20 Apr 2009 05:59:52 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org>
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> Honestly haven't had time to look at it yet - hopefully in the next
> couple of days...
>
> Sorry for delay
>
>
> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>
> > Do you have any news about this bug.
> > Thanks
> >
> > Geoffroy
> >
> >
> > Message: 1
> > Date: Tue, 14 Apr 2009 07:57:44 -0600
> > From: Ralph Castain 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID: 
> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >DelSp="yes"
> >
> > Ah now, I didn't say it -worked-, did I? :-)
> >
> > Clearly a bug exists in the program. I'll try to take a look at it (if
> > Lenny doesn't get to it first), but it won't be until later in the
> > week.
> >
> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
> >
> > > I agree with you Ralph , and that 's what I expect from openmpi but
> > > my second example shows that it's not working
> > >
> > > cat hostfile.0
> > >r011n002 slots=4
> > >r011n003 slots=4
> > >
> > >  cat rankfile.0
> > > rank 0=r011n002 slot=0
> > > rank 1=r011n003 slot=1
> > >
> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> > > hostname
> > > ### CRASHED
> > >
> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > > > >
> > > >
> > >
> >
> --
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > rmaps_rank_file.c at line 404
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > base/rmaps_base_map_job.c at line 87
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > base/plm_base_launch_support.c at line 77
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > plm_rsh_module.c at line 985
> > > > >
> > > >
> > >
> >
> --
> > > > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > > > attempting to
> > > > > launch so we are aborting.
> > > > >
> > > > > There may be more information reported by the environment (see
> > > > above).
> > > > >
> > > > > This may be because the daemon was unable to find all the needed
> > > > shared
> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
> > to
> > > > have the
> > > > > location of the shared libraries on the remote nodes and this
> > will
> > > > > automatically be forwarded to the remote nodes.
> > > > >
> > > >
> > >
> >
> --
> > > > >
> > > >
> > >
> >
> --
> > > > > orterun noticed that the job aborted, but has no info as to the
> > > > process
> > > > > that caused that situation.
> > > > >
> > > >
> > >
> >
> --
> > > > > orterun: clean termination accomplished
> > >
> > >
> > >
> > > Message: 4
> > > Date: Tue, 14 Apr 2009 06:55:58 -0600
> > > From: Ralph Castain 
> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > > To: Open MPI Users 
> > > Message-ID: 
> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> > >DelSp="yes"
> > >
> > > The rankfile cuts across the entire job - it isn't applied on an
> > > app_context basis. So the ranks in your rankfile must correspond to
> > > the eventual rank of each process in the cmd line.
> > >
> > > Unfortunately, that means you have to count ranks. In your case, you
> > > only have four, so that makes life easier. Your rankfile would look
> > > something like this:
> > >
> > > rank 0=r001n001 slot=0
> > > rank 1=r001n002 slot=1
> > > rank 2=r001n001 slot=1
> > > rank 3=r001n002 slot=2
> > >
> > > HT

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Ralph Castain
Honestly haven't had time to look at it yet - hopefully in the next  
couple of days...


Sorry for delay


On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:


Do you have any news about this bug.
Thanks

Geoffroy


Message: 1
Date: Tue, 14 Apr 2009 07:57:44 -0600
From: Ralph Castain 
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset="us-ascii"; Format="flowed";
   DelSp="yes"

Ah now, I didn't say it -worked-, did I? :-)

Clearly a bug exists in the program. I'll try to take a look at it (if
Lenny doesn't get to it first), but it won't be until later in the  
week.


On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:

> I agree with you Ralph , and that 's what I expect from openmpi but
> my second example shows that it's not working
>
> cat hostfile.0
>r011n002 slots=4
>r011n003 slots=4
>
>  cat rankfile.0
> rank 0=r011n002 slot=0
> rank 1=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> hostname
> ### CRASHED
>
> > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > >
> >
>  
--

> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > rmaps_rank_file.c at line 404
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > base/rmaps_base_map_job.c at line 87
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > base/plm_base_launch_support.c at line 77
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > plm_rsh_module.c at line 985
> > >
> >
>  
--

> > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > attempting to
> > > launch so we are aborting.
> > >
> > > There may be more information reported by the environment (see
> > above).
> > >
> > > This may be because the daemon was unable to find all the needed
> > shared
> > > libraries on the remote node. You may set your LD_LIBRARY_PATH  
to

> > have the
> > > location of the shared libraries on the remote nodes and this  
will

> > > automatically be forwarded to the remote nodes.
> > >
> >
>  
--

> > >
> >
>  
--

> > > orterun noticed that the job aborted, but has no info as to the
> > process
> > > that caused that situation.
> > >
> >
>  
--

> > > orterun: clean termination accomplished
>
>
>
> Message: 4
> Date: Tue, 14 Apr 2009 06:55:58 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> The rankfile cuts across the entire job - it isn't applied on an
> app_context basis. So the ranks in your rankfile must correspond to
> the eventual rank of each process in the cmd line.
>
> Unfortunately, that means you have to count ranks. In your case, you
> only have four, so that makes life easier. Your rankfile would look
> something like this:
>
> rank 0=r001n001 slot=0
> rank 1=r001n002 slot=1
> rank 2=r001n001 slot=1
> rank 3=r001n002 slot=2
>
> HTH
> Ralph
>
> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>
> > Hi,
> >
> > I agree that my examples are not very clear. What I want to do  
is to
> > launch a multiexes application (masters-slaves) and benefit from  
the

> > processor affinity.
> > Could you show me how to convert this command , using -rf option
> > (whatever the affinity is)
> >
> > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host  
r001n002

> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > host r001n002 slave.x options4
> >
> > Thanks for your help
> >
> > Geoffroy
> >
> >
> >
> >
> >
> > Message: 2
> > Date: Sun, 12 Apr 2009 18:26:35 +0300
> > From: Lenny Verkhovsky 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID:
> ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com 
>

> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Hi,
> >
> > The first "crash" is OK, since your rankfile has ranks 0 and 1
> > defined,
> > while n=1, which means only rank 0 is present and can be  
allocated.

> >
> > NP must be >= the largest rank in rankfile.
> >
> > What exactly are you trying to do ?
> >
> > I tried to recreate your seqv but all I got was
> >
> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> > hostfile.0
> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> > [witch19:30798] mca: base: component_find: paffinity
> > "mca_paffinity_linux"
> > uses an MCA interface that is not recognized (component MCA
> v1.0.0 !=

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Geoffroy Pignot
Do you have any news about this bug.
Thanks

Geoffroy


>
> Message: 1
> Date: Tue, 14 Apr 2009 07:57:44 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> Ah now, I didn't say it -worked-, did I? :-)
>
> Clearly a bug exists in the program. I'll try to take a look at it (if
> Lenny doesn't get to it first), but it won't be until later in the week.
>
> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>
> > I agree with you Ralph , and that 's what I expect from openmpi but
> > my second example shows that it's not working
> >
> > cat hostfile.0
> >r011n002 slots=4
> >r011n003 slots=4
> >
> >  cat rankfile.0
> > rank 0=r011n002 slot=0
> > rank 1=r011n003 slot=1
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> > hostname
> > ### CRASHED
> >
> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > > >
> > >
> >
> --
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > rmaps_rank_file.c at line 404
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/rmaps_base_map_job.c at line 87
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/plm_base_launch_support.c at line 77
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > plm_rsh_module.c at line 985
> > > >
> > >
> >
> --
> > > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > > attempting to
> > > > launch so we are aborting.
> > > >
> > > > There may be more information reported by the environment (see
> > > above).
> > > >
> > > > This may be because the daemon was unable to find all the needed
> > > shared
> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > > have the
> > > > location of the shared libraries on the remote nodes and this will
> > > > automatically be forwarded to the remote nodes.
> > > >
> > >
> >
> --
> > > >
> > >
> >
> --
> > > > orterun noticed that the job aborted, but has no info as to the
> > > process
> > > > that caused that situation.
> > > >
> > >
> >
> --
> > > > orterun: clean termination accomplished
> >
> >
> >
> > Message: 4
> > Date: Tue, 14 Apr 2009 06:55:58 -0600
> > From: Ralph Castain 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID: 
> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >DelSp="yes"
> >
> > The rankfile cuts across the entire job - it isn't applied on an
> > app_context basis. So the ranks in your rankfile must correspond to
> > the eventual rank of each process in the cmd line.
> >
> > Unfortunately, that means you have to count ranks. In your case, you
> > only have four, so that makes life easier. Your rankfile would look
> > something like this:
> >
> > rank 0=r001n001 slot=0
> > rank 1=r001n002 slot=1
> > rank 2=r001n001 slot=1
> > rank 3=r001n002 slot=2
> >
> > HTH
> > Ralph
> >
> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
> >
> > > Hi,
> > >
> > > I agree that my examples are not very clear. What I want to do is to
> > > launch a multiexes application (masters-slaves) and benefit from the
> > > processor affinity.
> > > Could you show me how to convert this command , using -rf option
> > > (whatever the affinity is)
> > >
> > > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > > host r001n002 slave.x options4
> > >
> > > Thanks for your help
> > >
> > > Geoffroy
> > >
> > >
> > >
> > >
> > >
> > > Message: 2
> > > Date: Sun, 12 Apr 2009 18:26:35 +0300
> > > From: Lenny Verkhovsky 
> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > > To: Open MPI Users 
> > > Message-ID:
> > ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
> > > Content-Type: text/plain; charset="iso-8859-1"
> > >
> > > Hi,
> > >
> > > The first "crash" is OK, since your rankfile has ranks 0 and 1
> > > defined,
> > > while n=1, which means only rank 0 is present and can be allocated.
> > >
> > > NP must be >= the largest rank in rankfile.
> > >
> > > What exactly are you trying to do ?
> > >
> > > I tried to recreate your seqv but all I got was
> > >
> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> > > hostfile.0
> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> > > [witch19:30798] mca: base: component_find: p