Re: [OMPI users] Executions in two different machines
On Jun 18, 2012, at 11:45 AM, Harald Servat wrote: >> 2. The two machines need to be able to open TCP connections to each other on >> random ports. > > That will be harder. Do need both machines to open TCP connections to > random ports, or just one? Both. To be specific: there's two layers that open TCP sockets to each other. The run-time system (i.e., mpirun and its friends) opens control channels between nodes. There *is* a predictable pattern upon which nodes open TCP sockets to which other nodes, but you shouldn't count on it (because we change it over time). Then the MPI layer opens TCP sockets for MPI messaging. The pattern of who opens TCP sockets to whom depends on the app, because OMPI opens sockets upon the first send (and that may be racy, depending on your application). So it's best not to assume and just allow random TCP sockets from any machines that will be involved in the computation. BTW, there have been a few discussions here in the past about how to configure iptables properly to allow this. No one has quite gotten it right; our advice has always just been to disable iptables. However, if you come up with a configuration solution that allows it to work properly -- and I'm *sure* that such a configuration exists; I'm just betting that no one with the proper willpower / experience has set their mind to figuring it out -- please let us know what it is so that we can add it to the FAQ. Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Executions in two different machines
El dl 18 de 06 de 2012 a les 11:39 -0400, en/na Jeff Squyres va escriure: > On Jun 18, 2012, at 11:12 AM, Harald Servat wrote: > > > Thank you Jeff. Now with the following commands starts, but it gets > > blocked before starting. May be this problem of firewalls? Do I need > > both that M1 and M2 can log into the other machine through ssh? > > I'm not sure what you mean by "blocked" -- do you mean that it hangs and does > nothing after seeming to start? Yes, that's it. > > If so, then yes, you need at least the two following things to be true: > > 1. You need to be able to ssh to between your machines without manually > entering a password or passphrase. Uhmmm... I'm trying to solve that by opening port 22. > 2. The two machines need to be able to open TCP connections to each other on > random ports. > That will be harder. Do need both machines to open TCP connections to random ports, or just one? Thank you. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer
Re: [OMPI users] Executions in two different machines
On Jun 18, 2012, at 11:12 AM, Harald Servat wrote: > Thank you Jeff. Now with the following commands starts, but it gets > blocked before starting. May be this problem of firewalls? Do I need > both that M1 and M2 can log into the other machine through ssh? I'm not sure what you mean by "blocked" -- do you mean that it hangs and does nothing after seeming to start? If so, then yes, you need at least the two following things to be true: 1. You need to be able to ssh to between your machines without manually entering a password or passphrase. 2. The two machines need to be able to open TCP connections to each other on random ports. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Executions in two different machines
El dl 18 de 06 de 2012 a les 10:56 -0400, en/na Jeff Squyres va escriure: > On Jun 18, 2012, at 10:45 AM, Harald Servat wrote: > > > # $HOME/aplic/openmpi/1.6/bin/mpirun -np 1 -host > > localhost ./init_barrier_fini : -x > > LD_LIBRARY_PATH=/home/Computational/harald/aplic/openmpi/1.6/lib > > -prefix /home/Computational/harald/aplic/openmpi/1.6/ -x > > PATH=/home/Computational/harald/aplic/openmpi/1.6/bin -np 1 -host > > M2 /home/Computational/harald/tests/mpi/multi-machine/init_barrier_fini > > Try without using the absolute pathname to mpirun -- it reacts differently if > you specify the absolute pathname vs. just "mpirun". > > Also, if you setup your .bashrc's right, then you don't need the -x > LD_LIBRARY_PATH... clause. > Thank you Jeff. Now with the following commands starts, but it gets blocked before starting. May be this problem of firewalls? Do I need both that M1 and M2 can log into the other machine through ssh? Thank you! # mpirun -v -display-map -np 1 -host localhost ./init_barrier_fini : -np 1 -host M2 /home/Computational/harald/tests/mpi/multi-machine/init_barrier_fini JOB MAP Data for node: M1 Num procs: 1 Process OMPI jobid: [89,1] Process rank: 0 Data for node: M2 Num procs: 1 Process OMPI jobid: [89,1] Process rank: 1 = WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer
Re: [OMPI users] Executions in two different machines
On Jun 18, 2012, at 10:45 AM, Harald Servat wrote: > # $HOME/aplic/openmpi/1.6/bin/mpirun -np 1 -host > localhost ./init_barrier_fini : -x > LD_LIBRARY_PATH=/home/Computational/harald/aplic/openmpi/1.6/lib > -prefix /home/Computational/harald/aplic/openmpi/1.6/ -x > PATH=/home/Computational/harald/aplic/openmpi/1.6/bin -np 1 -host > M2 /home/Computational/harald/tests/mpi/multi-machine/init_barrier_fini Try without using the absolute pathname to mpirun -- it reacts differently if you specify the absolute pathname vs. just "mpirun". Also, if you setup your .bashrc's right, then you don't need the -x LD_LIBRARY_PATH... clause. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Executions in two different machines
Thank you for your answers. I've tried that but it doesn't seem to work. The latest command I've issued is # $HOME/aplic/openmpi/1.6/bin/mpirun -np 1 -host localhost ./init_barrier_fini : -x LD_LIBRARY_PATH=/home/Computational/harald/aplic/openmpi/1.6/lib -prefix /home/Computational/harald/aplic/openmpi/1.6/ -x PATH=/home/Computational/harald/aplic/openmpi/1.6/bin -np 1 -host M2 /home/Computational/harald/tests/mpi/multi-machine/init_barrier_fini But I get the same error. Please, notice the message bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no existeix which means that it cannot find orted. That orted exists in the localhost but not in the other machine. Additionally, I've tried the following command, but it gets blocked... # mpirun -display-map -np 1 -host localhost /bin/date : -np 1 -host M2 /bin/date JOB MAP Data for node: dellNum procs: 1 Process OMPI jobid: [880,1] Process rank: 0 Data for node: knights1.bsc.es Num procs: 1 Process OMPI jobid: [880,1] Process rank: 1 = Any ideas? Thank you. El dl 18 de 06 de 2012 a les 10:04 -0400, en/na Jeff Squyres va escriure: > You might also want to set up your shell startup files on each machine to > reflect the proper PATH and LD_LIBRARY_PATH. E.g., if you have a different > .bashrc on each machine, just have it set PATH and LD_LIBARY_PATH properly > *for that machine*. > > To be clear: it's usually easiest to install OMPI to the same prefix on every > machine, but there's no technical requirement from OMPI to do so. > > > On Jun 18, 2012, at 10:00 AM, Ralph Castain wrote: > > > Try adding "-x LD_LIBRARY_PATH=" to your mpirun cmd line > > > > > > On Jun 18, 2012, at 7:11 AM, Harald Servat wrote: > > > >> Hello list, > >> > >> I'd like to use OpenMPI to execute an MPI application in two different > >> machines. > >> > >> Up to now, I've configured and installed OpenMPI 1.6 in my two systems > >> (each on a different directory). When I execute binaries within a system > >> (in any) the application works well. However when I try to execute in > >> the two systems, it does not work, in fact it complains it cannot find > >> "orted". This is the command I try to run and its output > >> > >> # $HOME/aplic/openmpi/1.6/bin/mpirun -display-map --machinefile hosts > >> -np 2 /bin/date > >> > >> JOB MAP > >> > >> Data for node: M1 Num procs: 1 > >>Process OMPI jobid: [6021,1] Process rank: 0 > >> > >> Data for node: M2 Num procs: 1 > >>Process OMPI jobid: [6021,1] Process rank: 1 > >> > >> = > >> bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no > >> existeix > >> -- > >> A daemon (pid 19598) died unexpectedly with status 127 while attempting > >> to launch so we are aborting. > >> > >> There may be more information reported by the environment (see above). > >> > >> This may be because the daemon was unable to find all the needed shared > >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have > >> the > >> location of the shared libraries on the remote nodes and this will > >> automatically be forwarded to the remote nodes. > >> -- > >> -- > >> mpirun noticed that the job aborted, but has no info as to the process > >> that caused that situation. > >> -- > >> > >> My guess is that the spawn process cannot find orted in M2 because the > >> installation prefix of M1 and M2 differ. Is my guess correct? As I > >> cannot change the prefix of the two installation, how can I tell mpirun > >> to look for orted in a different place? After looking at the > >> documentation, I've tried with --prefix and --launch-agent without > >> success. > >> > >> Thank you very much in advance. > >> > >> > >> > >> > >> > >> WARNING / LEGAL TEXT: This message is intended only for the use of the > >> individual or entity to which it is addressed and may contain > >> information which is privileged, confidential, proprietary, or exempt > >> from disclosure under applicable law. If you are not the intended > >> recipient or the person responsible for delivering the message to the > >> intended recipient, you are strictly prohibited from disclosing, > >> distributing, copying, or in any way using this message. If you have > >> received this communication in error, please notify the sender and > >> destroy and delete any copies you may have received. > >> > >> http://www.bsc.es/disclaimer > >> ___ > >> users
Re: [OMPI users] Executions in two different machines
You might also want to set up your shell startup files on each machine to reflect the proper PATH and LD_LIBRARY_PATH. E.g., if you have a different .bashrc on each machine, just have it set PATH and LD_LIBARY_PATH properly *for that machine*. To be clear: it's usually easiest to install OMPI to the same prefix on every machine, but there's no technical requirement from OMPI to do so. On Jun 18, 2012, at 10:00 AM, Ralph Castain wrote: > Try adding "-x LD_LIBRARY_PATH=" to your mpirun cmd line > > > On Jun 18, 2012, at 7:11 AM, Harald Servat wrote: > >> Hello list, >> >> I'd like to use OpenMPI to execute an MPI application in two different >> machines. >> >> Up to now, I've configured and installed OpenMPI 1.6 in my two systems >> (each on a different directory). When I execute binaries within a system >> (in any) the application works well. However when I try to execute in >> the two systems, it does not work, in fact it complains it cannot find >> "orted". This is the command I try to run and its output >> >> # $HOME/aplic/openmpi/1.6/bin/mpirun -display-map --machinefile hosts >> -np 2 /bin/date >> >> JOB MAP >> >> Data for node: M1Num procs: 1 >> Process OMPI jobid: [6021,1] Process rank: 0 >> >> Data for node: M2Num procs: 1 >> Process OMPI jobid: [6021,1] Process rank: 1 >> >> = >> bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no >> existeix >> -- >> A daemon (pid 19598) died unexpectedly with status 127 while attempting >> to launch so we are aborting. >> >> There may be more information reported by the environment (see above). >> >> This may be because the daemon was unable to find all the needed shared >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >> the >> location of the shared libraries on the remote nodes and this will >> automatically be forwarded to the remote nodes. >> -- >> -- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -- >> >> My guess is that the spawn process cannot find orted in M2 because the >> installation prefix of M1 and M2 differ. Is my guess correct? As I >> cannot change the prefix of the two installation, how can I tell mpirun >> to look for orted in a different place? After looking at the >> documentation, I've tried with --prefix and --launch-agent without >> success. >> >> Thank you very much in advance. >> >> >> >> >> >> WARNING / LEGAL TEXT: This message is intended only for the use of the >> individual or entity to which it is addressed and may contain >> information which is privileged, confidential, proprietary, or exempt >> from disclosure under applicable law. If you are not the intended >> recipient or the person responsible for delivering the message to the >> intended recipient, you are strictly prohibited from disclosing, >> distributing, copying, or in any way using this message. If you have >> received this communication in error, please notify the sender and >> destroy and delete any copies you may have received. >> >> http://www.bsc.es/disclaimer >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Executions in two different machines
Try adding "-x LD_LIBRARY_PATH=" to your mpirun cmd line On Jun 18, 2012, at 7:11 AM, Harald Servat wrote: > Hello list, > > I'd like to use OpenMPI to execute an MPI application in two different > machines. > > Up to now, I've configured and installed OpenMPI 1.6 in my two systems > (each on a different directory). When I execute binaries within a system > (in any) the application works well. However when I try to execute in > the two systems, it does not work, in fact it complains it cannot find > "orted". This is the command I try to run and its output > > # $HOME/aplic/openmpi/1.6/bin/mpirun -display-map --machinefile hosts > -np 2 /bin/date > > JOB MAP > > Data for node: M1 Num procs: 1 > Process OMPI jobid: [6021,1] Process rank: 0 > > Data for node: M2 Num procs: 1 > Process OMPI jobid: [6021,1] Process rank: 1 > > = > bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no > existeix > -- > A daemon (pid 19598) died unexpectedly with status 127 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have > the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -- > -- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -- > > My guess is that the spawn process cannot find orted in M2 because the > installation prefix of M1 and M2 differ. Is my guess correct? As I > cannot change the prefix of the two installation, how can I tell mpirun > to look for orted in a different place? After looking at the > documentation, I've tried with --prefix and --launch-agent without > success. > > Thank you very much in advance. > > > > > > WARNING / LEGAL TEXT: This message is intended only for the use of the > individual or entity to which it is addressed and may contain > information which is privileged, confidential, proprietary, or exempt > from disclosure under applicable law. If you are not the intended > recipient or the person responsible for delivering the message to the > intended recipient, you are strictly prohibited from disclosing, > distributing, copying, or in any way using this message. If you have > received this communication in error, please notify the sender and > destroy and delete any copies you may have received. > > http://www.bsc.es/disclaimer > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Executions in two different machines
Hello list, I'd like to use OpenMPI to execute an MPI application in two different machines. Up to now, I've configured and installed OpenMPI 1.6 in my two systems (each on a different directory). When I execute binaries within a system (in any) the application works well. However when I try to execute in the two systems, it does not work, in fact it complains it cannot find "orted". This is the command I try to run and its output # $HOME/aplic/openmpi/1.6/bin/mpirun -display-map --machinefile hosts -np 2 /bin/date JOB MAP Data for node: M1 Num procs: 1 Process OMPI jobid: [6021,1] Process rank: 0 Data for node: M2 Num procs: 1 Process OMPI jobid: [6021,1] Process rank: 1 = bash: /home/harald/aplic/openmpi/1.6/bin/orted: El fitxer o directori no existeix -- A daemon (pid 19598) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -- -- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -- My guess is that the spawn process cannot find orted in M2 because the installation prefix of M1 and M2 differ. Is my guess correct? As I cannot change the prefix of the two installation, how can I tell mpirun to look for orted in a different place? After looking at the documentation, I've tried with --prefix and --launch-agent without success. Thank you very much in advance. WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. http://www.bsc.es/disclaimer