Re: [OMPI users] Fwd: problem for multiple clusters using mpirun

2014-03-25 Thread Hamid Saeed
Hello,

Is it possible to change the port number for the MPI communication?

I can see that my program uses port 4 for the MPI communication.

[karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on
port 4
[karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
connect() to 134.106.3.252 failed: Connection refused (111)

In my case the ports from 1 to 1024 are reserved.
MPI tries to use one of the reserve ports and prompts the connection
refused error.

I will be very glade for the kind suggestions.


Regards.





On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed  wrote:

> Hello Jeff,
>
> Thanks for your cooperation.
>
> --mca btl_tcp_if_include br0
>
> worked out of the box.
>
> The problem was from the network administrator. The machines on the
> network side were halting the mpi...
>
> so cleaning and killing every thing worked.
>
> :)
>
> regards.
>
>
> On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> There is no "self" IP interface in the Linux kernel.
>>
>> Try using btl_tcp_if_include and list just the interface(s) that you want
>> to use.  From your prior email, I'm *guessing* it's just br2 (i.e., the
>> 10.x address inside your cluster).
>>
>> Also, it looks like you didn't setup your SSH keys properly for logging
>> in to remote notes automatically.
>>
>>
>>
>> On Mar 24, 2014, at 10:56 AM, Hamid Saeed  wrote:
>>
>> > Hello,
>> >
>> > I added the "self" e.g
>> >
>> > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca
>> btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv
>> >
>> > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
>> >
>> --
>> >
>> > ERROR::
>> >
>> > At least one pair of MPI processes are unable to reach each other for
>> > MPI communications.  This means that no Open MPI device has indicated
>> > that it can be used to communicate between these processes.  This is
>> > an error; Open MPI requires that all MPI processes be able to reach
>> > each other.  This error can sometimes be the result of forgetting to
>> > specify the "self" BTL.
>> >
>> >   Process 1 ([[15751,1],7]) is on host: wirth
>> >   Process 2 ([[15751,1],0]) is on host: karp
>> >   BTLs attempted: self sm
>> >
>> > Your MPI job is now going to abort; sorry.
>> >
>> --
>> >
>> --
>> > MPI_INIT has failed because at least one MPI process is unreachable
>> > from another.  This *usually* means that an underlying communication
>> > plugin -- such as a BTL or an MTL -- has either not loaded or not
>> > allowed itself to be used.  Your MPI job will now abort.
>> >
>> > You may wish to try to narrow down the problem;
>> >
>> >  * Check the output of ompi_info to see which BTL/MTL plugins are
>> >available.
>> >  * Run your application with MPI_THREAD_SINGLE.
>> >  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>> >if using MTL-based communications) to see exactly which
>> >communication plugins were considered and/or discarded.
>> >
>> --
>> > [wirth:40329] *** An error occurred in MPI_Init
>> > [wirth:40329] *** on a NULL communicator
>> > [wirth:40329] *** Unknown error
>> > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>> >
>> --
>> > An MPI process is aborting at a time when it cannot guarantee that all
>> > of its peer processes in the job will be killed properly.  You should
>> > double check that everything has shut down cleanly.
>> >
>> >   Reason: Before MPI_INIT completed
>> >   Local host: wirth
>> >   PID:40329
>> >
>> --
>> >
>> --
>> > mpirun has exited due to process rank 7 with PID 40329 on
>> > node wirth exiting improperly. There are two reasons this could occur:
>> >
>> > 1. this process did not call "init" before exiting, but others in
>> > the job did. This can cause a job to hang indefinitely while it waits
>> > for all processes to call "init". By rule, if one process calls "init",
>> > then ALL processes must call "init" prior to termination.
>> >
>> > 2. this process called "init", but exited without calling "finalize".
>> > By rule, all processes that call "init" MUST call "finalize" prior to
>> > exiting or it will be considered an "abnormal termination"
>> >
>> > This may have caused other processes in the application to be
>> > terminated by signals sent by mpirun (as reported here).
>> >
>> --
>> > [karp:29513] 1 more process ha

Re: [OMPI users] Fwd: problem for multiple clusters using mpirun

2014-03-25 Thread Reuti
Hi,

Am 25.03.2014 um 08:34 schrieb Hamid Saeed:

> Is it possible to change the port number for the MPI communication?
> 
> I can see that my program uses port 4 for the MPI communication.
> 
> [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on 
> port 4
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 134.106.3.252 failed: Connection refused (111)
> 
> In my case the ports from 1 to 1024 are reserved. 
> MPI tries to use one of the reserve ports and prompts the connection refused 
> error.
> 
> I will be very glade for the kind suggestions.

There are certain parameters to set the range of used ports, but using any up 
to 1024 should not be the default:

http://www.open-mpi.org/community/lists/users/2011/11/17732.php

Are any of these set by accident beforehand by your environment?

-- Reuti


> Regards.
> 
> 
> 
> 
> 
> On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed  wrote:
> Hello Jeff,
> 
> Thanks for your cooperation.
> 
> --mca btl_tcp_if_include br0 
> 
> worked out of the box.
> 
> The problem was from the network administrator. The machines on the network 
> side were halting the mpi...
> 
> so cleaning and killing every thing worked.
> 
> :)
> 
> regards. 
> 
> 
> On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres)  
> wrote:
> There is no "self" IP interface in the Linux kernel.
> 
> Try using btl_tcp_if_include and list just the interface(s) that you want to 
> use.  From your prior email, I'm *guessing* it's just br2 (i.e., the 10.x 
> address inside your cluster).
> 
> Also, it looks like you didn't setup your SSH keys properly for logging in to 
> remote notes automatically.
> 
> 
> 
> On Mar 24, 2014, at 10:56 AM, Hamid Saeed  wrote:
> 
> > Hello,
> >
> > I added the "self" e.g
> >
> > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca 
> > btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv
> >
> > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
> > --
> >
> > ERROR::
> >
> > At least one pair of MPI processes are unable to reach each other for
> > MPI communications.  This means that no Open MPI device has indicated
> > that it can be used to communicate between these processes.  This is
> > an error; Open MPI requires that all MPI processes be able to reach
> > each other.  This error can sometimes be the result of forgetting to
> > specify the "self" BTL.
> >
> >   Process 1 ([[15751,1],7]) is on host: wirth
> >   Process 2 ([[15751,1],0]) is on host: karp
> >   BTLs attempted: self sm
> >
> > Your MPI job is now going to abort; sorry.
> > --
> > --
> > MPI_INIT has failed because at least one MPI process is unreachable
> > from another.  This *usually* means that an underlying communication
> > plugin -- such as a BTL or an MTL -- has either not loaded or not
> > allowed itself to be used.  Your MPI job will now abort.
> >
> > You may wish to try to narrow down the problem;
> >
> >  * Check the output of ompi_info to see which BTL/MTL plugins are
> >available.
> >  * Run your application with MPI_THREAD_SINGLE.
> >  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> >if using MTL-based communications) to see exactly which
> >communication plugins were considered and/or discarded.
> > --
> > [wirth:40329] *** An error occurred in MPI_Init
> > [wirth:40329] *** on a NULL communicator
> > [wirth:40329] *** Unknown error
> > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> > --
> > An MPI process is aborting at a time when it cannot guarantee that all
> > of its peer processes in the job will be killed properly.  You should
> > double check that everything has shut down cleanly.
> >
> >   Reason: Before MPI_INIT completed
> >   Local host: wirth
> >   PID:40329
> > --
> > --
> > mpirun has exited due to process rank 7 with PID 40329 on
> > node wirth exiting improperly. There are two reasons this could occur:
> >
> > 1. this process did not call "init" before exiting, but others in
> > the job did. This can cause a job to hang indefinitely while it waits
> > for all processes to call "init". By rule, if one process calls "init",
> > then ALL processes must call "init" prior to termination.
> >
> > 2. this process called "init", but exited without calling "finalize".
> > By rule, all processes that call "init" MUST call "finalize" prior to
> > exiting or it will be considered an "abnormal termination"
> >
> 

Re: [OMPI users] Fwd: problem for multiple clusters using mpirun

2014-03-25 Thread Hamid Saeed
Hello,
I am not sure what approach does the MPI communication follow but when i
use
--mca btl_base_verbose 30

I observe the mentioned port.

[karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on
port 4
[karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
connect() to 134.106.3.252 failed: Connection refused (111)


the information on the
http://www.open-mpi.org/community/lists/users/2011/11/17732.php
is not enough could you kindly explain..

How can restrict MPI communication to use the ports starting from 1025.
or use the port some what like
59822...

Regards.



On Tue, Mar 25, 2014 at 9:15 AM, Reuti  wrote:

> Hi,
>
> Am 25.03.2014 um 08:34 schrieb Hamid Saeed:
>
> > Is it possible to change the port number for the MPI communication?
> >
> > I can see that my program uses port 4 for the MPI communication.
> >
> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252
> on port 4
> >
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 134.106.3.252 failed: Connection refused (111)
> >
> > In my case the ports from 1 to 1024 are reserved.
> > MPI tries to use one of the reserve ports and prompts the connection
> refused error.
> >
> > I will be very glade for the kind suggestions.
>
> There are certain parameters to set the range of used ports, but using any
> up to 1024 should not be the default:
>
> http://www.open-mpi.org/community/lists/users/2011/11/17732.php
>
> Are any of these set by accident beforehand by your environment?
>
> -- Reuti
>
>
> > Regards.
> >
> >
> >
> >
> >
> > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed 
> wrote:
> > Hello Jeff,
> >
> > Thanks for your cooperation.
> >
> > --mca btl_tcp_if_include br0
> >
> > worked out of the box.
> >
> > The problem was from the network administrator. The machines on the
> network side were halting the mpi...
> >
> > so cleaning and killing every thing worked.
> >
> > :)
> >
> > regards.
> >
> >
> > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > There is no "self" IP interface in the Linux kernel.
> >
> > Try using btl_tcp_if_include and list just the interface(s) that you
> want to use.  From your prior email, I'm *guessing* it's just br2 (i.e.,
> the 10.x address inside your cluster).
> >
> > Also, it looks like you didn't setup your SSH keys properly for logging
> in to remote notes automatically.
> >
> >
> >
> > On Mar 24, 2014, at 10:56 AM, Hamid Saeed 
> wrote:
> >
> > > Hello,
> > >
> > > I added the "self" e.g
> > >
> > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib
> --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth
> ./scatterv
> > >
> > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
> > >
> --
> > >
> > > ERROR::
> > >
> > > At least one pair of MPI processes are unable to reach each other for
> > > MPI communications.  This means that no Open MPI device has indicated
> > > that it can be used to communicate between these processes.  This is
> > > an error; Open MPI requires that all MPI processes be able to reach
> > > each other.  This error can sometimes be the result of forgetting to
> > > specify the "self" BTL.
> > >
> > >   Process 1 ([[15751,1],7]) is on host: wirth
> > >   Process 2 ([[15751,1],0]) is on host: karp
> > >   BTLs attempted: self sm
> > >
> > > Your MPI job is now going to abort; sorry.
> > >
> --
> > >
> --
> > > MPI_INIT has failed because at least one MPI process is unreachable
> > > from another.  This *usually* means that an underlying communication
> > > plugin -- such as a BTL or an MTL -- has either not loaded or not
> > > allowed itself to be used.  Your MPI job will now abort.
> > >
> > > You may wish to try to narrow down the problem;
> > >
> > >  * Check the output of ompi_info to see which BTL/MTL plugins are
> > >available.
> > >  * Run your application with MPI_THREAD_SINGLE.
> > >  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
> > >if using MTL-based communications) to see exactly which
> > >communication plugins were considered and/or discarded.
> > >
> --
> > > [wirth:40329] *** An error occurred in MPI_Init
> > > [wirth:40329] *** on a NULL communicator
> > > [wirth:40329] *** Unknown error
> > > [wirth:40329] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> > >
> --
> > > An MPI process is aborting at a time when it cannot guarantee that all
> > > of its peer processes in the job will be killed properly.  You should
> > > double check that everything has shut down cleanly.
> > >
> > >   Reason:

Re: [OMPI users] Fwd: problem for multiple clusters using mpirun

2014-03-25 Thread Hamid Saeed
Hello,
Thanks i figured out what was the exact problem in my case.
Now i am using the following execution line.
it is directing the mpi comm port to start from 1...

mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca btl_tcp_if_include
br0 --mca btl_tcp_port_min_v4 1 ./a.out

and every thing works again.

Thanks.

Best regards.




On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed wrote:

> Hello,
> I am not sure what approach does the MPI communication follow but when i
> use
> --mca btl_base_verbose 30
>
> I observe the mentioned port.
>
> [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on
> port 4
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> connect() to 134.106.3.252 failed: Connection refused (111)
>
>
> the information on the
> http://www.open-mpi.org/community/lists/users/2011/11/17732.php
> is not enough could you kindly explain..
>
> How can restrict MPI communication to use the ports starting from 1025.
> or use the port some what like
> 59822...
>
> Regards.
>
>
>
> On Tue, Mar 25, 2014 at 9:15 AM, Reuti  wrote:
>
>> Hi,
>>
>> Am 25.03.2014 um 08:34 schrieb Hamid Saeed:
>>
>> > Is it possible to change the port number for the MPI communication?
>> >
>> > I can see that my program uses port 4 for the MPI communication.
>> >
>> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252
>> on port 4
>> >
>> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>> connect() to 134.106.3.252 failed: Connection refused (111)
>> >
>> > In my case the ports from 1 to 1024 are reserved.
>> > MPI tries to use one of the reserve ports and prompts the connection
>> refused error.
>> >
>> > I will be very glade for the kind suggestions.
>>
>> There are certain parameters to set the range of used ports, but using
>> any up to 1024 should not be the default:
>>
>> http://www.open-mpi.org/community/lists/users/2011/11/17732.php
>>
>> Are any of these set by accident beforehand by your environment?
>>
>> -- Reuti
>>
>>
>> > Regards.
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed 
>> wrote:
>> > Hello Jeff,
>> >
>> > Thanks for your cooperation.
>> >
>> > --mca btl_tcp_if_include br0
>> >
>> > worked out of the box.
>> >
>> > The problem was from the network administrator. The machines on the
>> network side were halting the mpi...
>> >
>> > so cleaning and killing every thing worked.
>> >
>> > :)
>> >
>> > regards.
>> >
>> >
>> > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>> > There is no "self" IP interface in the Linux kernel.
>> >
>> > Try using btl_tcp_if_include and list just the interface(s) that you
>> want to use.  From your prior email, I'm *guessing* it's just br2 (i.e.,
>> the 10.x address inside your cluster).
>> >
>> > Also, it looks like you didn't setup your SSH keys properly for logging
>> in to remote notes automatically.
>> >
>> >
>> >
>> > On Mar 24, 2014, at 10:56 AM, Hamid Saeed 
>> wrote:
>> >
>> > > Hello,
>> > >
>> > > I added the "self" e.g
>> > >
>> > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib
>> --mca btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth
>> ./scatterv
>> > >
>> > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
>> > >
>> --
>> > >
>> > > ERROR::
>> > >
>> > > At least one pair of MPI processes are unable to reach each other for
>> > > MPI communications.  This means that no Open MPI device has indicated
>> > > that it can be used to communicate between these processes.  This is
>> > > an error; Open MPI requires that all MPI processes be able to reach
>> > > each other.  This error can sometimes be the result of forgetting to
>> > > specify the "self" BTL.
>> > >
>> > >   Process 1 ([[15751,1],7]) is on host: wirth
>> > >   Process 2 ([[15751,1],0]) is on host: karp
>> > >   BTLs attempted: self sm
>> > >
>> > > Your MPI job is now going to abort; sorry.
>> > >
>> --
>> > >
>> --
>> > > MPI_INIT has failed because at least one MPI process is unreachable
>> > > from another.  This *usually* means that an underlying communication
>> > > plugin -- such as a BTL or an MTL -- has either not loaded or not
>> > > allowed itself to be used.  Your MPI job will now abort.
>> > >
>> > > You may wish to try to narrow down the problem;
>> > >
>> > >  * Check the output of ompi_info to see which BTL/MTL plugins are
>> > >available.
>> > >  * Run your application with MPI_THREAD_SINGLE.
>> > >  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>> > >if using MTL-based communications) to see exactly which
>> > >communication plugins were considered and/or discarded.
>> > >
>> --

Re: [OMPI users] problem for multiple clusters using mpirun

2014-03-25 Thread Jeff Squyres (jsquyres)
This is very odd -- the default value for btl_tcp_port_min_v4 is 1024.  So 
unless you have overridden this value, you should not be getting a port less 
than 1024.  You can run this to see:

ompi_info --level 9 --param  btl tcp --parsable | grep port_min_v4

Mine says this in a default 1.7.5 installation:

mca:btl:tcp:param:btl_tcp_port_min_v4:value:1024
mca:btl:tcp:param:btl_tcp_port_min_v4:source:default
mca:btl:tcp:param:btl_tcp_port_min_v4:status:writeable
mca:btl:tcp:param:btl_tcp_port_min_v4:level:2
mca:btl:tcp:param:btl_tcp_port_min_v4:help:The minimum port where the TCP BTL 
will try to bind (default 1024)
mca:btl:tcp:param:btl_tcp_port_min_v4:deprecated:no
mca:btl:tcp:param:btl_tcp_port_min_v4:type:int
mca:btl:tcp:param:btl_tcp_port_min_v4:disabled:false



On Mar 25, 2014, at 5:36 AM, Hamid Saeed  wrote:

> Hello,
> Thanks i figured out what was the exact problem in my case.
> Now i am using the following execution line.
> it is directing the mpi comm port to start from 1...
> 
> mpiexec -n 2 --host karp,wirth --mca btl ^openib --mca btl_tcp_if_include br0 
> --mca btl_tcp_port_min_v4 1 ./a.out
> 
> and every thing works again.
> 
> Thanks.
> 
> Best regards.
> 
> 
> 
> 
> On Tue, Mar 25, 2014 at 10:23 AM, Hamid Saeed  wrote:
> Hello,
> I am not sure what approach does the MPI communication follow but when i  
> use
> --mca btl_base_verbose 30
> 
> I observe the mentioned port.
> 
> [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on 
> port 4
> [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 134.106.3.252 failed: Connection refused (111)
> 
> 
> the information on the 
> http://www.open-mpi.org/community/lists/users/2011/11/17732.php
> is not enough could you kindly explain..
> 
> How can restrict MPI communication to use the ports starting from 1025.
> or use the port some what like
> 59822...
> 
> Regards.
> 
> 
> 
> On Tue, Mar 25, 2014 at 9:15 AM, Reuti  wrote:
> Hi,
> 
> Am 25.03.2014 um 08:34 schrieb Hamid Saeed:
> 
> > Is it possible to change the port number for the MPI communication?
> >
> > I can see that my program uses port 4 for the MPI communication.
> >
> > [karp:23756] btl: tcp: attempting to connect() to address 134.106.3.252 on 
> > port 4
> > [karp][[4612,1],0][btl_tcp_endpoint.c:655:mca_btl_tcp_endpoint_complete_connect]
> >  connect() to 134.106.3.252 failed: Connection refused (111)
> >
> > In my case the ports from 1 to 1024 are reserved.
> > MPI tries to use one of the reserve ports and prompts the connection 
> > refused error.
> >
> > I will be very glade for the kind suggestions.
> 
> There are certain parameters to set the range of used ports, but using any up 
> to 1024 should not be the default:
> 
> http://www.open-mpi.org/community/lists/users/2011/11/17732.php
> 
> Are any of these set by accident beforehand by your environment?
> 
> -- Reuti
> 
> 
> > Regards.
> >
> >
> >
> >
> >
> > On Mon, Mar 24, 2014 at 5:32 PM, Hamid Saeed  wrote:
> > Hello Jeff,
> >
> > Thanks for your cooperation.
> >
> > --mca btl_tcp_if_include br0
> >
> > worked out of the box.
> >
> > The problem was from the network administrator. The machines on the network 
> > side were halting the mpi...
> >
> > so cleaning and killing every thing worked.
> >
> > :)
> >
> > regards.
> >
> >
> > On Mon, Mar 24, 2014 at 4:34 PM, Jeff Squyres (jsquyres) 
> >  wrote:
> > There is no "self" IP interface in the Linux kernel.
> >
> > Try using btl_tcp_if_include and list just the interface(s) that you want 
> > to use.  From your prior email, I'm *guessing* it's just br2 (i.e., the 
> > 10.x address inside your cluster).
> >
> > Also, it looks like you didn't setup your SSH keys properly for logging in 
> > to remote notes automatically.
> >
> >
> >
> > On Mar 24, 2014, at 10:56 AM, Hamid Saeed  wrote:
> >
> > > Hello,
> > >
> > > I added the "self" e.g
> > >
> > > hsaeed@karp:~/Task4_mpi/scatterv$ mpirun -np 8 --mca btl ^openib --mca 
> > > btl_tcp_if_exclude sm,self,lo,br0,br1,ib0,br2 --host karp,wirth ./scatterv
> > >
> > > Enter passphrase for key '/home/hsaeed/.ssh/id_rsa':
> > > --
> > >
> > > ERROR::
> > >
> > > At least one pair of MPI processes are unable to reach each other for
> > > MPI communications.  This means that no Open MPI device has indicated
> > > that it can be used to communicate between these processes.  This is
> > > an error; Open MPI requires that all MPI processes be able to reach
> > > each other.  This error can sometimes be the result of forgetting to
> > > specify the "self" BTL.
> > >
> > >   Process 1 ([[15751,1],7]) is on host: wirth
> > >   Process 2 ([[15751,1],0]) is on host: karp
> > >   BTLs attempted: self sm
> > >
> > > Your MPI job is now going to abort; sorry.
> > > --
> > > -

Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Dave Love
Edgar Gabriel  writes:

> I am still looking into the PVFS2 with ROMIO problem with the 1.6
> series, where (as I mentioned yesterday) the problem I am having right
> now is that the data is wrong. Not sure what causes it, but since I have
> teach this afternoon again, it might be friday until I can digg into that.

Was there any progress with this?  Otherwise, what version of PVFS2 is
known to work with OMPI 1.6?  Thanks.


Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Edgar Gabriel
yes, the patch has been submitted to the 1.6 branch for review, not sure
what the precise status of it is. The problems found are more or less
independent of the PVFS2 version.

Thanks
Edga

On 3/25/2014 7:32 AM, Dave Love wrote:
> Edgar Gabriel  writes:
> 
>> I am still looking into the PVFS2 with ROMIO problem with the 1.6
>> series, where (as I mentioned yesterday) the problem I am having right
>> now is that the data is wrong. Not sure what causes it, but since I have
>> teach this afternoon again, it might be friday until I can digg into that.
> 
> Was there any progress with this?  Otherwise, what version of PVFS2 is
> known to work with OMPI 1.6?  Thanks.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

-- 
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] coll_ml_priority in openmpi-1.7.5

2014-03-25 Thread Jeff Squyres (jsquyres)
Yes, Nathan has a few coll ml fixes queued up for 1.8.

On Mar 24, 2014, at 10:11 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> I ran our application using the final version of openmpi-1.7.5 again
> with coll_ml_priority = 90.
> 
> Then, coll/ml was actually activated and I got these error messages
> as shown below:
> [manage][[11217,1],0][coll_ml_lmngr.c:265:mca_coll_ml_lmngr_alloc] COLL-ML
> List manager is empty.
> [manage][[11217,1],0][coll_ml_allocation.c:47:mca_coll_ml_allocate_block]
> COLL-ML lmngr failed.
> [manage][[11217,1],0][coll_ml_module.c:532:ml_module_memory_initialization]
> COLL-ML mca_coll_ml_allocate_block exited wi
> th error.
> 
> Unfortunately coll/ml seems to still have some problems ...
> 
> And, it also means coll/ml was not activated on my test run with
> coll_ml_priority = 27. So, the slowdown was due to the expensive
> connectivity computation as you pointed out, I guess.
> 
> Tetsuya
> 
>> On Mar 20, 2014, at 5:56 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> Hi Ralph, congratulations on releasing new openmpi-1.7.5.
>>> 
>>> By the way, opnempi-1.7.5rc3 has been slowing down our application
>>> with smaller size of testing data, where the time consuming part
>>> of our application is so called sparse solver. It's negligible
>>> with medium or large size data - more practical one, so I have
>>> been defering this problem.
>>> 
>>> However, this slowdown disappears in the final version of
>>> openmpi-1.7.5. After some investigations, I found coll_ml caused
>>> this slowdown. The final version seems to set coll_ml_priority as zero
>>> again.
>>> 
>>> Could you explain briefly about the advantage of coll_ml? In what kind
>>> of situation it's effective and so on ...
>> 
>> I'm not really the one to speak about coll/ml as I wasn't involved in it
> - Nathan would be the one to ask. It is supposed to be significantly faster
> for most collectives, but I imagine it would
>> depend on the precise collective being used and the size of the data. We
> did find and fix a number of problems right at the end (which is why we
> dropped the priority until we can better test/debug
>> it), and so we might have hit something that was causing your slow down.
>> 
>> 
>>> 
>>> In addition, I'm not sure why coll_my is activated in openmpi-1.7.5rc3,
>>> although its priority is lower than tuned as described in the message
>>> of changeset 30790:
>>> We are initially setting the priority lower than
>>> tuned until this has had some time to soak in the trunk.
>> 
>> Were you actually seeing coll/ml being used? It shouldn't have been.
> However, coll/ml was getting called during the collective initialization
> phase so it could set itself up, even if it wasn't being
>> used. One part of its setup is a somewhat expensive connectivity
> computation - one of our last-minute cleanups was removal of a static 1MB
> array in that procedure. Changing the priority to 0
>> completely disables the coll/ml component, thus removing it from even the
> initialization phase. My guess is that you were seeing a measurable "hit"
> by that procedure on your small data tests, which
>> probably ran fairly quickly - and not seeing it on the other tests
> because the setup time was swamped by the computation time.
>> 
>> 
>>> 
>>> Tetsuya
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Jeff Squyres (jsquyres)
Sorry -- we've been focusing on 1.7.5 and the impending 1.8 release; I probably 
won't be able to look at the v1.6 version in the next 2 weeks or so.


On Mar 25, 2014, at 9:09 AM, Edgar Gabriel  wrote:

> yes, the patch has been submitted to the 1.6 branch for review, not sure
> what the precise status of it is. The problems found are more or less
> independent of the PVFS2 version.
> 
> Thanks
> Edga
> 
> On 3/25/2014 7:32 AM, Dave Love wrote:
>> Edgar Gabriel  writes:
>> 
>>> I am still looking into the PVFS2 with ROMIO problem with the 1.6
>>> series, where (as I mentioned yesterday) the problem I am having right
>>> now is that the data is wrong. Not sure what causes it, but since I have
>>> teach this afternoon again, it might be friday until I can digg into that.
>> 
>> Was there any progress with this?  Otherwise, what version of PVFS2 is
>> known to work with OMPI 1.6?  Thanks.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> -- 
> Edgar Gabriel
> Associate Professor
> Parallel Software Technologies Lab  http://pstl.cs.uh.edu
> Department of Computer Science  University of Houston
> Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
> Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Rob Latham



On 03/25/2014 07:32 AM, Dave Love wrote:

Edgar Gabriel  writes:


I am still looking into the PVFS2 with ROMIO problem with the 1.6
series, where (as I mentioned yesterday) the problem I am having right
now is that the data is wrong. Not sure what causes it, but since I have
teach this afternoon again, it might be friday until I can digg into that.


Was there any progress with this?  Otherwise, what version of PVFS2 is
known to work with OMPI 1.6?  Thanks.


Edgar, should I pick this up for MPICH, or was this fix specific to 
OpenMPI ?


==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] Help building/installing a working Open MPI 1.7.4 on OS X 10.9.2 with Free PGI Fortran

2014-03-25 Thread Jeff Squyres (jsquyres)
Got your output -- thanks.  I'm pretty sure this is pointing to a Libtool bug.

Here's the interesting part -- it looks like Libtool simply isn't issuing the 
command to create the library (!).  Check out this (annotated) output from 
"make V=1" on a Linux/gfortran box:


Making all in src
make[1]: Entering directory `/home/jsquyres/git/pgi-autotool-bug/src'

# Compile the fortran_foo.f90 file
/bin/sh ../libtool  --tag=FC   --mode=compile gfortran  -g -O2 -c -o 
fortran_foo.lo fortran_foo.f90
libtool: compile:  gfortran -g -O2 -c fortran_foo.f90  -fPIC -o 
.libs/fortran_foo.o

# Compile the fortran_bar.f90 file
/bin/sh ../libtool  --tag=FC   --mode=compile gfortran  -g -O2 -c -o 
fortran_bar.lo fortran_bar.f90
libtool: compile:  gfortran -g -O2 -c fortran_bar.f90  -fPIC -o 
.libs/fortran_bar.o

# Link the two into the libfortran_stuff.so library
/bin/sh ../libtool  --tag=FC   --mode=link gfortran  -g -O2   -o 
libfortran_stuff.la -rpath /usr/local/lib fortran_foo.lo fortran_bar.lo  
libtool: link: gfortran -shared  -fPIC  .libs/fortran_foo.o .libs/fortran_bar.o 
   -O2   -Wl,-soname -Wl,libfortran_stuff.so.0 -o 
.libs/libfortran_stuff.so.0.0.0

# Make some handy sym links
libtool: link: (cd ".libs" && rm -f "libfortran_stuff.so.0" && ln -s 
"libfortran_stuff.so.0.0.0" "libfortran_stuff.so.0")
libtool: link: (cd ".libs" && rm -f "libfortran_stuff.so" && ln -s 
"libfortran_stuff.so.0.0.0" "libfortran_stuff.so")
libtool: link: ( cd ".libs" && rm -f "libfortran_stuff.la" && ln -s 
"../libfortran_stuff.la" "libfortran_stuff.la" )
-

Compare this to your "make V=1" output:

-
Making install in src

# Compile the fortran_foo.f90 file
/bin/sh ../libtool  --tag=FC   --mode=compile pgfortran  -m64 -c -o 
fortran_foo.lo fortran_foo.f90
libtool: compile:  pgfortran -m64 -c fortran_foo.f90  -o .libs/fortran_foo.o

# Compile the fortran_bar.f90 file
/bin/sh ../libtool  --tag=FC   --mode=compile pgfortran  -m64 -c -o 
fortran_bar.lo fortran_bar.f90
libtool: compile:  pgfortran -m64 -c fortran_bar.f90  -o .libs/fortran_bar.o

# Link the two into the libfortran_stuff.so library
/bin/sh ../libtool  --tag=FC   --mode=link pgfortran  -m64  -m64 -o 
libfortran_stuff.la -rpath /Users/fortran/AutomakeBug/autobug14/lib 
fortran_foo.lo fortran_bar.lo  
*** NOTICE THAT THERE'S NO COMMAND HERE TO MAKE THE LIBRARY!

# Make some handy sym links
libtool: link: (cd ".libs" && rm -f "libfortran_stuff.dylib" && ln -s 
"libfortran_stuff.0.dylib" "libfortran_stuff.dylib")
libtool: link: ( cd ".libs" && rm -f "libfortran_stuff.la" && ln -s 
"../libfortran_stuff.la" "libfortran_stuff.la" )
-

Time to send this bug report upstream.



On Mar 24, 2014, at 7:27 PM, Matt Thompson  wrote:

> Jeff,
> 
> I ran these commands:
> 
> $ make clean
> $ make distclean
> 
> (wanted to be extra sure!)
> 
> $ ./configure CC=gcc CXX=g++ F77=pgfortran FC=pgfortran CFLAGS='-m64' 
> CXXFLAGS='-m64' LDFLAGS='-m64' FCFLAGS='-m64' FFLAGS='-m64' 
> --prefix=/Users/fortran/AutomakeBug/autobug14 | & tee configure.log
> $ make V=1 install |& tee makeV1install.log
> 
> So find attached the config.log, configure.log, and makeV1install.log which 
> should have all the info you asked about.
> 
> Matt
> 
> PS: I just tried configure/make/make install with Open MPI 1.7.5, but the 
> same error occurs as expected. Hope springs eternal, you know?
> 
> 
> On Mon, Mar 24, 2014 at 6:48 PM, Jeff Squyres (jsquyres)  
> wrote:
> On Mar 24, 2014, at 6:34 PM, Matt Thompson  wrote:
> 
> > Sorry for the late reply. The answer is: No, 1.14.1 has not fixed the 
> > problem (and indeed, that's what my Mac is running):
> >
> > (28) $ make install | & tee makeinstall.log
> > Making install in src
> >  ../config/install-sh -c -d '/Users/fortran/AutomakeBug/autobug14/lib'
> >  /bin/sh ../libtool   --mode=install /usr/bin/install -c   
> > libfortran_stuff.la '/Users/fortran/AutomakeBug/autobug14/lib'
> > libtool: install: /usr/bin/install -c .libs/libfortran_stuff.0.dylib 
> > /Users/fortran/AutomakeBug/autobug14/lib/libfortran_stuff.0.dylib
> > install: .libs/libfortran_stuff.0.dylib: No such file or directory
> > make[2]: *** [install-libLTLIBRARIES] Error 71
> > make[1]: *** [install-am] Error 2
> > make: *** [install-recursive] Error 1
> >
> > This is the output from either the am12 or am14 test. If you have any 
> > options you'd like me to try with this, let me know. (For example, is there 
> > a way to make autotools *more* verbose? I've always tried to make it less 
> > so!)
> 
> Ok.  With the am14 tarball, please run:
> 
> make clean
> 
> And then run this:
> 
> make V=1 install
> 
> And then send the following:
> 
> - configure stdout
> - config.log file
> - stdout/stderr from "make V=1 install"
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman

Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Edgar Gabriel
not sure honestly. Basically, as suggested in this email chain earlier,
I had to disable the PVFS2_IreadContig and PVFS2_IwriteContig routines
in ad_pvfs2.c to make the tests pass. Otherwise the tests worked but
produced wrong data. I did not have however the time to figure what
actually goes wrong underneath the hood.

Edgar

On 3/25/2014 9:21 AM, Rob Latham wrote:
> 
> 
> On 03/25/2014 07:32 AM, Dave Love wrote:
>> Edgar Gabriel  writes:
>>
>>> I am still looking into the PVFS2 with ROMIO problem with the 1.6
>>> series, where (as I mentioned yesterday) the problem I am having right
>>> now is that the data is wrong. Not sure what causes it, but since I have
>>> teach this afternoon again, it might be friday until I can digg into
>>> that.
>>
>> Was there any progress with this?  Otherwise, what version of PVFS2 is
>> known to work with OMPI 1.6?  Thanks.
> 
> Edgar, should I pick this up for MPICH, or was this fix specific to
> OpenMPI ?
> 
> ==rob
> 

-- 
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] OpenMPI-ROMIO-OrangeFS

2014-03-25 Thread Dave Love
Edgar Gabriel  writes:

> yes, the patch has been submitted to the 1.6 branch for review, not sure
> what the precise status of it is. The problems found are more or less
> independent of the PVFS2 version.

Thanks; I should have looked in the tracker.


[OMPI users] busy waiting and oversubscriptions

2014-03-25 Thread Ross Boylan
Even when "idle", MPI processes use all the CPU.  I thought I remember
someone saying that they will be low priority, and so not pose much of
an obstacle to other uses of the CPU.

At any rate, my question is whether, if I have processes that spend most
of their time waiting to receive a message, I can put more of them than
I have physical cores without much slowdown?

E.g.  With 8 cores and 8 processes doing real work, can I add a couple
extra processes that mostly wait?

Does it make any difference if there's hyperthreading with, e.g., 16
virtual CPUs based on 8 physical ones?  In general I try to limit to the
number of physical cores.

Thanks.
Ross Boylan