Hmmm…I don’t know of anyone trying this with OpenStack before, so this may be 
uncharted territory. I assume you configured OMPI with —enable-debug? If not, 
please do so.

Then, add “—mca oob_base_verbose 100 —mca state_base_verbose 10” to your 
command line

> On Mar 28, 2015, at 11:38 AM, LOTFIFAR F. <foad.lotfi...@durham.ac.uk> wrote:
> 
> More precisely,  I create VMs using OpenStack web interface. I have been 
> assigned some resources by the administrator. I created VM instances, each 
> with 2 VPCU using OpenStack dashboard. So I do not know whether VMs  are 
> assigned on the same/different physical nodes.
> 
> FYI: testing with the following command  on fehg_node_0 gives me this output.
> 
> > mpirun --mca plm_base_verbose 20 --host fehg_node_1 hostname
> 
> [fehg-node-0:02057] mca: base: components_open: Looking for plm components
> [fehg-node-0:02057] mca: base: components_open: opening plm components
> [fehg-node-0:02057] mca: base: components_open: found loaded component rsh
> [fehg-node-0:02057] mca: base: components_open: component rsh has no register 
> function
> [fehg-node-0:02057] mca: base: components_open: component rsh open function 
> successful
> [fehg-node-0:02057] mca: base: components_open: found loaded component slurm
> [fehg-node-0:02057] mca: base: components_open: component slurm has no 
> register function
> [fehg-node-0:02057] mca: base: components_open: component slurm open function 
> successful
> [fehg-node-0:02057] mca:base:select: Auto-selecting plm components
> [fehg-node-0:02057] mca:base:select:(  plm) Querying component [rsh]
> [fehg-node-0:02057] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [fehg-node-0:02057] mca:base:select:(  plm) Querying component [slurm]
> [fehg-node-0:02057] mca:base:select:(  plm) Skipping component [slurm]. Query 
> failed to return a module
> [fehg-node-0:02057] mca:base:select:(  plm) Selected component [rsh]
> [fehg-node-0:02057] mca: base: close: component slurm closed
> [fehg-node-0:02057] mca: base: close: unloading component slurm
> [fehg-node-7:02660] mca: base: components_open: Looking for plm components
> [fehg-node-7:02660] mca: base: components_open: opening plm components
> [fehg-node-7:02660] mca: base: components_open: found loaded component rsh
> [fehg-node-7:02660] mca: base: components_open: component rsh has no register 
> function
> [fehg-node-7:02660] mca: base: components_open: component rsh open function 
> successful
> [fehg-node-7:02660] mca:base:select: Auto-selecting plm components
> [fehg-node-7:02660] mca:base:select:(  plm) Querying component [rsh]
> [fehg-node-7:02660] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [fehg-node-7:02660] mca:base:select:(  plm) Selected component [rsh]
> 
> and it freezes here. 
> 
> 
> Regards,
> Karos
> 
> From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
> [r...@open-mpi.org]
> Sent: 28 March 2015 18:23
> To: Open MPI Users
> Subject: Re: [OMPI users] Connection problem on Linux cluster
> 
> Just to be clear: do you have two physical nodes? Or just one physical node 
> and you are running two VMs on it?
> 
>> On Mar 28, 2015, at 10:51 AM, LOTFIFAR F. <foad.lotfi...@durham.ac.uk 
>> <mailto:foad.lotfi...@durham.ac.uk>> wrote:
>> 
>> I have a floating IP for accessing nodes from outside of the cluster and 
>> internal ip addresses. I tried to run the jobs with both of them (both ip 
>> addresses) but it makes no difference. 
>> I have just installed openmpi 1.6.5 to see how does this version works. In 
>> this case I get nothing and I have to press Crtl+c. not output or error is 
>> shown.
>> 
>> 
>> From: users [users-boun...@open-mpi.org <mailto:users-boun...@open-mpi.org>] 
>> on behalf of Ralph Castain [r...@open-mpi.org <mailto:r...@open-mpi.org>]
>> Sent: 28 March 2015 17:03
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Connection problem on Linux cluster
>> 
>> You mentioned running this in a VM - is that IP address correct for getting 
>> across the VMs?
>> 
>> 
>>> On Mar 28, 2015, at 8:38 AM, LOTFIFAR F. <foad.lotfi...@durham.ac.uk 
>>> <mailto:foad.lotfi...@durham.ac.uk>> wrote:
>>> 
>>> Hi , 
>>> 
>>> I am wondering how can I solve this problem. 
>>> System Spec:
>>> 1- Linux cluster with two nodes (master and slave) with Ubuntu 12.04 LTS 
>>> 32bit.
>>> 2- openmpi 1.8.4
>>> 
>>> I do a simple test running on fehg_node_0:
>>> > mpirun -host fehg_node_0,fehg_node_1 hello_world -mca oob_base_verbose 20
>>> 
>>> and I get the following error:
>>> 
>>> A process or daemon was unable to complete a TCP connection
>>> to another process:
>>>   Local host:    fehg-node-0
>>>   Remote host:   10.104.5.40
>>> This is usually caused by a firewall on the remote host. Please
>>> check that any firewall (e.g., iptables) has been disabled and
>>> try again.
>>> ------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> ORTE was unable to reliably start one or more daemons.
>>> This usually is caused by:
>>> 
>>> * not finding the required libraries and/or binaries on
>>>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>> 
>>> * lack of authority to execute on one or more specified nodes.
>>>   Please verify your allocation and authorities.
>>> 
>>> * the inability to write startup files into /tmp 
>>> (--tmpdir/orte_tmpdir_base).
>>>   Please check with your sys admin to determine the correct location to use.
>>> 
>>> *  compilation of the orted with dynamic libraries when static are required
>>>   (e.g., on Cray). Please check your configure cmd line and consider using
>>>   one of the contrib/platform definitions for your system type.
>>> 
>>> * an inability to create a connection back to mpirun due to a
>>>   lack of common network interfaces and/or no route found between
>>>   them. Please check network connectivity (including firewalls
>>>   and network routing requirements).
>>> 
>>> Verbose:
>>> 1- I have full access to the VMs on the cluster and setup everything myself
>>> 2- Firewall and iptables are all disabled on the nodes
>>> 3- nodes can ssh to each other with  no problem
>>> 4- non-interactive bash calls works fine i.e. when I run ssh othernode env 
>>> | grep PATH from both nodes, both PATH and LD_LIBRARY_PATH are set correctly
>>> 5- I have checked the posts, a similar problem reported for Solaris but I 
>>> could not find a clue about mine. 
>>> 6- run with --enable-orterun-prefix-by-default does not make any changes.
>>> 7-  I see orte is running on the other node when I check processes, but 
>>> nothing happens after that and the error happens.
>>> 
>>> Regards,
>>> Karos
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/03/26555.php 
>>> <http://www.open-mpi.org/community/lists/users/2015/03/26555.php>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/03/26557.php 
>> <http://www.open-mpi.org/community/lists/users/2015/03/26557.php>
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users 
> <http://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/03/26559.php 
> <http://www.open-mpi.org/community/lists/users/2015/03/26559.php>

Reply via email to