Re: [OMPI users] Executions in two different machines

2012-06-18 Thread Jeff Squyres
On Jun 18, 2012, at 11:45 AM, Harald Servat wrote:

>> 2. The two machines need to be able to open TCP connections to each other on 
>> random ports.
> 
> That will be harder. Do need both machines to open TCP connections to
> random ports, or just one? 


Both.

To be specific: there's two layers that open TCP sockets to each other.  The 
run-time system (i.e., mpirun and its friends) opens control channels between 
nodes.  There *is* a predictable pattern upon which nodes open TCP sockets to 
which other nodes, but you shouldn't count on it (because we change it over 
time).

Then the MPI layer opens TCP sockets for MPI messaging.  The pattern of who 
opens TCP sockets to whom depends on the app, because OMPI opens sockets upon 
the first send (and that may be racy, depending on your application).

So it's best not to assume and just allow random TCP sockets from any machines 
that will be involved in the computation.

BTW, there have been a few discussions here in the past about how to configure 
iptables properly to allow this.  No one has quite gotten it right; our advice 
has always just been to disable iptables.  However, if you come up with a 
configuration solution that allows it to work properly -- and I'm *sure* that 
such a configuration exists; I'm just betting that no one with the proper 
willpower / experience has set their mind to figuring it out -- please let us 
know what it is so that we can add it to the FAQ.

Thanks!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Executions in two different machines

2012-06-18 Thread Harald Servat
El dl 18 de 06 de 2012 a les 11:39 -0400, en/na Jeff Squyres va
escriure:
> On Jun 18, 2012, at 11:12 AM, Harald Servat wrote:
> 
> > Thank you Jeff. Now with the following commands starts, but it gets
> > blocked before starting. May be this problem of firewalls? Do I need
> > both that M1 and M2 can log into the other machine through ssh?
> 
> I'm not sure what you mean by "blocked" -- do you mean that it hangs and does 
> nothing after seeming to start?

Yes, that's it.

> 
> If so, then yes, you need at least the two following things to be true:
> 
> 1. You need to be able to ssh to between your machines without manually 
> entering a password or passphrase.

Uhmmm... I'm trying to solve that by opening port 22.

> 2. The two machines need to be able to open TCP connections to each other on 
> random ports.
> 

That will be harder. Do need both machines to open TCP connections to
random ports, or just one? 

Thank you.


WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer


Re: [OMPI users] Executions in two different machines

2012-06-18 Thread Jeff Squyres
On Jun 18, 2012, at 11:12 AM, Harald Servat wrote:

> Thank you Jeff. Now with the following commands starts, but it gets
> blocked before starting. May be this problem of firewalls? Do I need
> both that M1 and M2 can log into the other machine through ssh?

I'm not sure what you mean by "blocked" -- do you mean that it hangs and does 
nothing after seeming to start?

If so, then yes, you need at least the two following things to be true:

1. You need to be able to ssh to between your machines without manually 
entering a password or passphrase.

2. The two machines need to be able to open TCP connections to each other on 
random ports.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Executions in two different machines

2012-06-18 Thread Jeff Squyres
On Jun 18, 2012, at 10:45 AM, Harald Servat wrote:

> # $HOME/aplic/openmpi/1.6/bin/mpirun -np 1 -host
> localhost ./init_barrier_fini : -x
> LD_LIBRARY_PATH=/home/Computational/harald/aplic/openmpi/1.6/lib
> -prefix /home/Computational/harald/aplic/openmpi/1.6/ -x
> PATH=/home/Computational/harald/aplic/openmpi/1.6/bin -np 1 -host
> M2 /home/Computational/harald/tests/mpi/multi-machine/init_barrier_fini

Try without using the absolute pathname to mpirun -- it reacts differently if 
you specify the absolute pathname vs. just "mpirun".

Also, if you setup your .bashrc's right, then you don't need the -x 
LD_LIBRARY_PATH... clause.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Executions in two different machines

2012-06-18 Thread Jeff Squyres
You might also want to set up your shell startup files on each machine to 
reflect the proper PATH and LD_LIBRARY_PATH.  E.g., if you have a different 
.bashrc on each machine, just have it set PATH and LD_LIBARY_PATH properly *for 
that machine*.

To be clear: it's usually easiest to install OMPI to the same prefix on every 
machine, but there's no technical requirement from OMPI to do so.


On Jun 18, 2012, at 10:00 AM, Ralph Castain wrote:

> Try adding "-x LD_LIBRARY_PATH=" to your mpirun cmd line
> 
> 
> On Jun 18, 2012, at 7:11 AM, Harald Servat wrote:
> 
>> Hello list,
>> 
>> I'd like to use OpenMPI to execute an MPI application in two different
>> machines.
>> 
>> Up to now, I've configured and installed OpenMPI 1.6 in my two systems
>> (each on a different directory). When I execute binaries within a system
>> (in any) the application works well. However when I try to execute in
>> the two systems, it does not work, in fact it complains it cannot find
>> "orted". This is the command I try to run and its output
>> 
>> #  $HOME/aplic/openmpi/1.6/bin/mpirun -display-map --machinefile hosts
>> -np 2 /bin/date
>> 
>>    JOB MAP   
>> 
>> Data for node: M1Num procs: 1
>>  Process OMPI jobid: [6021,1] Process rank: 0
>> 
>> Data for node: M2Num procs: 1
>>  Process OMPI jobid: [6021,1] Process rank: 1
>> 
>> =
>> bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no
>> existeix
>> --
>> A daemon (pid 19598) died unexpectedly with status 127 while attempting
>> to launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>> the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>> 
>> My guess is that the spawn process cannot find orted in M2 because the
>> installation prefix of M1 and M2 differ. Is my guess correct? As I
>> cannot change the prefix of the two installation, how can I tell mpirun
>> to look for orted in a different place? After looking at the
>> documentation, I've tried with --prefix and --launch-agent without
>> success.
>> 
>> Thank you very much in advance.
>> 
>> 
>> 
>> 
>> 
>> WARNING / LEGAL TEXT: This message is intended only for the use of the
>> individual or entity to which it is addressed and may contain
>> information which is privileged, confidential, proprietary, or exempt
>> from disclosure under applicable law. If you are not the intended
>> recipient or the person responsible for delivering the message to the
>> intended recipient, you are strictly prohibited from disclosing,
>> distributing, copying, or in any way using this message. If you have
>> received this communication in error, please notify the sender and
>> destroy and delete any copies you may have received.
>> 
>> http://www.bsc.es/disclaimer
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Executions in two different machines

2012-06-18 Thread Ralph Castain
Try adding "-x LD_LIBRARY_PATH=" to your mpirun cmd line


On Jun 18, 2012, at 7:11 AM, Harald Servat wrote:

> Hello list,
> 
>  I'd like to use OpenMPI to execute an MPI application in two different
> machines.
> 
>  Up to now, I've configured and installed OpenMPI 1.6 in my two systems
> (each on a different directory). When I execute binaries within a system
> (in any) the application works well. However when I try to execute in
> the two systems, it does not work, in fact it complains it cannot find
> "orted". This is the command I try to run and its output
> 
> #  $HOME/aplic/openmpi/1.6/bin/mpirun -display-map --machinefile hosts
> -np 2 /bin/date
> 
>    JOB MAP   
> 
> Data for node: M1 Num procs: 1
>   Process OMPI jobid: [6021,1] Process rank: 0
> 
> Data for node: M2 Num procs: 1
>   Process OMPI jobid: [6021,1] Process rank: 1
> 
> =
> bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no
> existeix
> --
> A daemon (pid 19598) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> 
>  My guess is that the spawn process cannot find orted in M2 because the
> installation prefix of M1 and M2 differ. Is my guess correct? As I
> cannot change the prefix of the two installation, how can I tell mpirun
> to look for orted in a different place? After looking at the
> documentation, I've tried with --prefix and --launch-agent without
> success.
> 
> Thank you very much in advance.
> 
> 
> 
> 
> 
> WARNING / LEGAL TEXT: This message is intended only for the use of the
> individual or entity to which it is addressed and may contain
> information which is privileged, confidential, proprietary, or exempt
> from disclosure under applicable law. If you are not the intended
> recipient or the person responsible for delivering the message to the
> intended recipient, you are strictly prohibited from disclosing,
> distributing, copying, or in any way using this message. If you have
> received this communication in error, please notify the sender and
> destroy and delete any copies you may have received.
> 
> http://www.bsc.es/disclaimer
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users