Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-27 Thread Gabriele Fatigati
Wow! Great and useful explanation.
Thanks Jeff .

2009/1/23 Jeff Squyres :
> FWIW, OMPI v1.3 is much better that registered memory usage than the 1.2
> series.  We introduced some new things, to include being able to specify
> exactly what receive queues you want.  See:
>
> ...gaaah!  It's not on our FAQ yet.  :-(
>
> The main idea is that there is a new MCA parameter for the openib BTL:
> btl_openib_receive_queues.  It takes a colon-delimited string listing one or
> more receive queues of specific sizes and characteristics.  For now, all
> processes in the job *must* use the same string.  You can specify three
> kinds of receive queues:
>
> - P: per-peer queues
> - S: shared receive queues
> - X: XRC queues (with OFED 1.4 and later with specific Mellanox hardware)
>
> Here's a copy-n-paste of our help file describing the format of each:
>
> Per-peer receive queues require between 1 and 5 parameters:
>
>  1. Buffer size in bytes (mandatory)
>  2. Number of buffers (optional; defaults to 8)
>  3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
>  4. Credit window size (optional; defaults to (low_watermark / 2))
>  5. Number of buffers reserved for credit messages (optional;
> defaults to (num_buffers*2-1)/credit_window)
>
>  Example: P,128,256,128,16
>  - 128 byte buffers
>  - 256 buffers to receive incoming MPI messages
>  - When the number of available buffers reaches 128, re-post 128 more
>buffers to reach a total of 256
>  - If the number of available credits reaches 16, send an explicit
>credit message to the sender
>  - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
>reserved for explicit credit messages
>
> Shared receive queues can take between 1 and 4 parameters:
>
>  1. Buffer size in bytes (mandatory)
>  2. Number of buffers (optional; defaults to 16)
>  3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
>  4. Maximum number of outstanding sends a sender can have (optional;
> defaults to (low_watermark / 4)
>
>  Example: S,1024,256,128,32
>  - 1024 byte buffers
>  - 256 buffers to receive incoming MPI messages
>  - When the number of available buffers reaches 128, re-post 128 more
>buffers to reach a total of 256
>  - A sender will not send to a peer unless it has less than 32
>outstanding sends to that peer.
>
> IIRC, "X" takes the same parameters as "S"...?  Note that if you you *any*
> XRC queues, then *all* of your queues must be XRC.
>
> OMPI defaults to a btl_receive_queues value that may be specific to your
> hardware.  For example, connectx defaults to the following value:
>
> shell$ ompi_info --param btl openib --parsable | grep receive_queues
> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma
> delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>
> Hope that helps!
>
>
>
>
> On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:
>
>> Hi Gabriele,
>> it might be that your message size is too large for available memory per
>> node.
>> I had a problem with IMB when I was not able to run to completion Alltoall
>> on N=128, ppn=8 on our cluster with 16 GB per node. You'd think 16 GB is
>> quite a lot but when you do the maths:
>> 2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to double
>> because of buffering. I was told by Mellanox (our cards are ConnectX cards)
>> that they introduced XRC in OFED 1.3 in addition to Share Receive Queue
>> which should reduce memory foot print but I have not tested this yet.
>> HTH,
>> Igor
>> 2009/1/23 Gabriele Fatigati 
>> Hi Igor,
>> My message size is 4096kb and i have 4 procs per core.
>> There isn't any difference using different algorithms..
>>
>> 2009/1/23 Igor Kozin :
>> > what is your message size and the number of cores per node?
>> > is there any difference using different algorithms?
>> >
>> > 2009/1/23 Gabriele Fatigati 
>> >>
>> >> Hi Jeff,
>> >> i would like to understand why, if i run over 512 procs or more, my
>> >> code stops over mpi collective, also with little send buffer. All
>> >> processors are locked into call, doing nothing. But, if i add
>> >> MPI_Barrier  after MPI collective, it works! I run over Infiniband
>> >> net.
>> >>
>> >> I know many people with this strange problem, i think there is a
>> >> strange interaction between Infiniband and OpenMPI that causes it.
>> >>
>> >>
>> >>
>> >> 2009/1/23 Jeff Squyres :
>> >> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >> >
>> >> >> I've noted that OpenMPI has an asynchronous behaviour in the
>> >> >> collective
>> >> >> calls.
>> >> >> The processors, doesn't wait that other procs arrives in the call.
>> >> >
>> >> > T

Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-26 Thread Jeff Squyres
Actually, I found out that the help message I pasted lies a little:  
the "number of buffers" parameter for both PP and SRQ types is  
mandatory, not optional.



On Jan 23, 2009, at 2:59 PM, Jeff Squyres wrote:

Here's a copy-n-paste of our help file describing the format of each:

Per-peer receive queues require between 1 and 5 parameters:

 1. Buffer size in bytes (mandatory)
 2. Number of buffers (optional; defaults to 8)
 3. Low buffer count watermark (optional; defaults to (num_buffers /  
2))

 4. Credit window size (optional; defaults to (low_watermark / 2))
 5. Number of buffers reserved for credit messages (optional;
defaults to (num_buffers*2-1)/credit_window)

 Example: P,128,256,128,16
 - 128 byte buffers
 - 256 buffers to receive incoming MPI messages
 - When the number of available buffers reaches 128, re-post 128 more
   buffers to reach a total of 256
 - If the number of available credits reaches 16, send an explicit
   credit message to the sender
 - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
   reserved for explicit credit messages

Shared receive queues can take between 1 and 4 parameters:

 1. Buffer size in bytes (mandatory)
 2. Number of buffers (optional; defaults to 16)
 3. Low buffer count watermark (optional; defaults to (num_buffers /  
2))

 4. Maximum number of outstanding sends a sender can have (optional;
defaults to (low_watermark / 4)

 Example: S,1024,256,128,32
 - 1024 byte buffers
 - 256 buffers to receive incoming MPI messages
 - When the number of available buffers reaches 128, re-post 128 more
   buffers to reach a total of 256
 - A sender will not send to a peer unless it has less than 32
   outstanding sends to that peer.

IIRC, "X" takes the same parameters as "S"...?  Note that if you you  
*any* XRC queues, then *all* of your queues must be XRC.


OMPI defaults to a btl_receive_queues value that may be specific to  
your hardware.  For example, connectx defaults to the following value:


shell$ ompi_info --param btl openib --parsable | grep receive_queues
mca:btl:openib:param:btl_openib_receive_queues:value:P, 
128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S, 
65536,256,128,32
mca:btl:openib:param:btl_openib_receive_queues:data_source:default  
value

mca:btl:openib:param:btl_openib_receive_queues:status:writable
mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited,  
comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4

mca:btl:openib:param:btl_openib_receive_queues:deprecated:no

Hope that helps!




On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:


Hi Gabriele,
it might be that your message size is too large for available  
memory per node.
I had a problem with IMB when I was not able to run to completion  
Alltoall on N=128, ppn=8 on our cluster with 16 GB per node. You'd  
think 16 GB is quite a lot but when you do the maths:
2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to  
double because of buffering. I was told by Mellanox (our cards are  
ConnectX cards) that they introduced XRC in OFED 1.3 in addition to  
Share Receive Queue which should reduce memory foot print but I  
have not tested this yet.

HTH,
Igor
2009/1/23 Gabriele Fatigati 
Hi Igor,
My message size is 4096kb and i have 4 procs per core.
There isn't any difference using different algorithms..

2009/1/23 Igor Kozin :
> what is your message size and the number of cores per node?
> is there any difference using different algorithms?
>
> 2009/1/23 Gabriele Fatigati 
>>
>> Hi Jeff,
>> i would like to understand why, if i run over 512 procs or more,  
my

>> code stops over mpi collective, also with little send buffer. All
>> processors are locked into call, doing nothing. But, if i add
>> MPI_Barrier  after MPI collective, it works! I run over Infiniband
>> net.
>>
>> I know many people with this strange problem, i think there is a
>> strange interaction between Infiniband and OpenMPI that causes it.
>>
>>
>>
>> 2009/1/23 Jeff Squyres :
>> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >
>> >> I've noted that OpenMPI has an asynchronous behaviour in the  
collective

>> >> calls.
>> >> The processors, doesn't wait that other procs arrives in the  
call.

>> >
>> > That is correct.
>> >
>> >> This behaviour sometimes can cause some problems with a lot of
>> >> processors in the jobs.
>> >
>> > Can you describe what exactly you mean?  The MPI spec  
specifically

>> > allows
>> > this behavior; OMPI made specific design choices and  
optimizations to
>> > support this behavior.  FWIW, I'd be pretty surprised if any  
optimized

>> > MPI
>> > implementation defaults to fully synchronous collective  
operations.

>> >
>> >> Is there an OpenMPI parameter to lock all process in the  
collective
>> >> call until is finished? Otherwise  i have to insert many  
MPI_Barrier

>> >> in my code and it is very tedious and strange..
>> >
>> > As you have notes, MPI_Barrier is the *only* collective  
operation that

>> > M

Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Jeff Squyres
FWIW, OMPI v1.3 is much better that registered memory usage than the  
1.2 series.  We introduced some new things, to include being able to  
specify exactly what receive queues you want.  See:


...gaaah!  It's not on our FAQ yet.  :-(

The main idea is that there is a new MCA parameter for the openib BTL:  
btl_openib_receive_queues.  It takes a colon-delimited string listing  
one or more receive queues of specific sizes and characteristics.  For  
now, all processes in the job *must* use the same string.  You can  
specify three kinds of receive queues:


- P: per-peer queues
- S: shared receive queues
- X: XRC queues (with OFED 1.4 and later with specific Mellanox  
hardware)


Here's a copy-n-paste of our help file describing the format of each:

Per-peer receive queues require between 1 and 5 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (optional; defaults to 8)
  3. Low buffer count watermark (optional; defaults to (num_buffers /  
2))

  4. Credit window size (optional; defaults to (low_watermark / 2))
  5. Number of buffers reserved for credit messages (optional;
 defaults to (num_buffers*2-1)/credit_window)

  Example: P,128,256,128,16
  - 128 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
  - If the number of available credits reaches 16, send an explicit
credit message to the sender
  - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
reserved for explicit credit messages

Shared receive queues can take between 1 and 4 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (optional; defaults to 16)
  3. Low buffer count watermark (optional; defaults to (num_buffers /  
2))

  4. Maximum number of outstanding sends a sender can have (optional;
 defaults to (low_watermark / 4)

  Example: S,1024,256,128,32
  - 1024 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
  - A sender will not send to a peer unless it has less than 32
outstanding sends to that peer.

IIRC, "X" takes the same parameters as "S"...?  Note that if you you  
*any* XRC queues, then *all* of your queues must be XRC.


OMPI defaults to a btl_receive_queues value that may be specific to  
your hardware.  For example, connectx defaults to the following value:


shell$ ompi_info --param btl openib --parsable | grep receive_queues
mca:btl:openib:param:btl_openib_receive_queues:value:P, 
128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32

mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
mca:btl:openib:param:btl_openib_receive_queues:status:writable
mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited,  
comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4

mca:btl:openib:param:btl_openib_receive_queues:deprecated:no

Hope that helps!




On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:


Hi Gabriele,
it might be that your message size is too large for available memory  
per node.
I had a problem with IMB when I was not able to run to completion  
Alltoall on N=128, ppn=8 on our cluster with 16 GB per node. You'd  
think 16 GB is quite a lot but when you do the maths:
2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to  
double because of buffering. I was told by Mellanox (our cards are  
ConnectX cards) that they introduced XRC in OFED 1.3 in addition to  
Share Receive Queue which should reduce memory foot print but I have  
not tested this yet.

HTH,
Igor
2009/1/23 Gabriele Fatigati 
Hi Igor,
My message size is 4096kb and i have 4 procs per core.
There isn't any difference using different algorithms..

2009/1/23 Igor Kozin :
> what is your message size and the number of cores per node?
> is there any difference using different algorithms?
>
> 2009/1/23 Gabriele Fatigati 
>>
>> Hi Jeff,
>> i would like to understand why, if i run over 512 procs or more, my
>> code stops over mpi collective, also with little send buffer. All
>> processors are locked into call, doing nothing. But, if i add
>> MPI_Barrier  after MPI collective, it works! I run over Infiniband
>> net.
>>
>> I know many people with this strange problem, i think there is a
>> strange interaction between Infiniband and OpenMPI that causes it.
>>
>>
>>
>> 2009/1/23 Jeff Squyres :
>> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >
>> >> I've noted that OpenMPI has an asynchronous behaviour in the  
collective

>> >> calls.
>> >> The processors, doesn't wait that other procs arrives in the  
call.

>> >
>> > That is correct.
>> >
>> >> This behaviour sometimes can cause some problems with a lot of
>> >> processors in the jobs.
>> >
>> > Can you describe what exactly you mean?  The MPI spec  
specifically

>> > allows
>> > this behavior; OMPI made specific design

Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread George Bosilca


On Jan 23, 2009, at 11:24 , Eugene Loh wrote:


Jeff Squyres wrote:

As you have notes, MPI_Barrier is the *only* collective operation  
that MPI guarantees to have any synchronization properties (and  
it's a  fairly weak guarantee at that; no process will exit the  
barrier until  every process has entered the barrier -- but there's  
no guarantee that  all processes leave the barrier at the same time).


Actually, many collectives have that property due to data-causality  
conditions.  E.g., MPI_Allreduce cannot exit from any process until  
every process has finished.


MPI_Allreduce is a bad example. Depending on the algorithm, this  
collective can finish on some nodes, way before the others (allreduce  
might be a reduce followed by a broadcast). However, there is one  
thing that will _ALWAYS_ be true, all processes have reached the  
MPI_Allreduce call because they had provided their data.


  george.




As Jeff mentions, however, exit times can be "ragged" (and  
unfortunately often are).

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Eugene Loh

Jeff Squyres wrote:

As you have notes, MPI_Barrier is the *only* collective operation that 
MPI guarantees to have any synchronization properties (and it's a  
fairly weak guarantee at that; no process will exit the barrier until  
every process has entered the barrier -- but there's no guarantee 
that  all processes leave the barrier at the same time). 


Actually, many collectives have that property due to data-causality 
conditions.  E.g., MPI_Allreduce cannot exit from any process until 
every process has finished.


As Jeff mentions, however, exit times can be "ragged" (and unfortunately 
often are).


Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Igor Kozin
Hi Gabriele,
it might be that your message size is too large for available memory per
node.
I had a problem with IMB when I was not able to run to completion Alltoall
on N=128, ppn=8 on our cluster with 16 GB per node. You'd think 16 GB is
quite a lot but when you do the maths:
2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to double
because of buffering. I was told by Mellanox (our cards are ConnectX cards)
that they introduced XRC in OFED 1.3 in addition to Share Receive Queue
which should reduce memory foot print but I have not tested this yet.
HTH,
Igor
2009/1/23 Gabriele Fatigati 

> Hi Igor,
> My message size is 4096kb and i have 4 procs per core.
> There isn't any difference using different algorithms..
>
> 2009/1/23 Igor Kozin :
>  > what is your message size and the number of cores per node?
> > is there any difference using different algorithms?
> >
> > 2009/1/23 Gabriele Fatigati 
> >>
> >> Hi Jeff,
> >> i would like to understand why, if i run over 512 procs or more, my
> >> code stops over mpi collective, also with little send buffer. All
> >> processors are locked into call, doing nothing. But, if i add
> >> MPI_Barrier  after MPI collective, it works! I run over Infiniband
> >> net.
> >>
> >> I know many people with this strange problem, i think there is a
> >> strange interaction between Infiniband and OpenMPI that causes it.
> >>
> >>
> >>
> >> 2009/1/23 Jeff Squyres :
> >> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
> >> >
> >> >> I've noted that OpenMPI has an asynchronous behaviour in the
> collective
> >> >> calls.
> >> >> The processors, doesn't wait that other procs arrives in the call.
> >> >
> >> > That is correct.
> >> >
> >> >> This behaviour sometimes can cause some problems with a lot of
> >> >> processors in the jobs.
> >> >
> >> > Can you describe what exactly you mean?  The MPI spec specifically
> >> > allows
> >> > this behavior; OMPI made specific design choices and optimizations to
> >> > support this behavior.  FWIW, I'd be pretty surprised if any optimized
> >> > MPI
> >> > implementation defaults to fully synchronous collective operations.
> >> >
> >> >> Is there an OpenMPI parameter to lock all process in the collective
> >> >> call until is finished? Otherwise  i have to insert many MPI_Barrier
> >> >> in my code and it is very tedious and strange..
> >> >
> >> > As you have notes, MPI_Barrier is the *only* collective operation that
> >> > MPI
> >> > guarantees to have any synchronization properties (and it's a fairly
> >> > weak
> >> > guarantee at that; no process will exit the barrier until every
> process
> >> > has
> >> > entered the barrier -- but there's no guarantee that all processes
> leave
> >> > the
> >> > barrier at the same time).
> >> >
> >> > Why do you need your processes to exit collective operations at the
> same
> >> > time?
> >> >
> >> > --
> >> > Jeff Squyres
> >> > Cisco Systems
> >> >
> >> > ___
> >> > users mailing list
> >> > us...@open-mpi.org
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Ing. Gabriele Fatigati
> >>
> >> Parallel programmer
> >>
> >> CINECA Systems & Tecnologies Department
> >>
> >> Supercomputing Group
> >>
> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
> >>
> >> www.cineca.itTel:   +39 051 6171722
> >>
> >> g.fatigati [AT] cineca.it
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Gabriele Fatigati
Thanks Jeff,
i'll try this flag.

Regards.

2009/1/23 Jeff Squyres :
> This is with the 1.2 series, right?
>
> Have you tried using what is described here:
>
>
>  http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
>
> I don't know if you can try OMPI v1.3 or not, but the issue described in the
> the above FAQ item is fixed properly in the OMPI v1.3 series (i.e., that MCA
> parameter is unnecessary because we fixed it a different way).
>
> FWIW, if adding an MPI_Barrier is the difference between hanging and not
> hanging, it sounds like an Open MPI bug.  You should never need to add an
> MPI_Barrier to make an MPI program correct.
>
>
>
> On Jan 23, 2009, at 8:09 AM, Gabriele Fatigati wrote:
>
>> Hi Igor,
>> My message size is 4096kb and i have 4 procs per core.
>> There isn't any difference using different algorithms..
>>
>> 2009/1/23 Igor Kozin :
>>>
>>> what is your message size and the number of cores per node?
>>> is there any difference using different algorithms?
>>>
>>> 2009/1/23 Gabriele Fatigati 

 Hi Jeff,
 i would like to understand why, if i run over 512 procs or more, my
 code stops over mpi collective, also with little send buffer. All
 processors are locked into call, doing nothing. But, if i add
 MPI_Barrier  after MPI collective, it works! I run over Infiniband
 net.

 I know many people with this strange problem, i think there is a
 strange interaction between Infiniband and OpenMPI that causes it.



 2009/1/23 Jeff Squyres :
>
> On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>
>> I've noted that OpenMPI has an asynchronous behaviour in the
>> collective
>> calls.
>> The processors, doesn't wait that other procs arrives in the call.
>
> That is correct.
>
>> This behaviour sometimes can cause some problems with a lot of
>> processors in the jobs.
>
> Can you describe what exactly you mean?  The MPI spec specifically
> allows
> this behavior; OMPI made specific design choices and optimizations to
> support this behavior.  FWIW, I'd be pretty surprised if any optimized
> MPI
> implementation defaults to fully synchronous collective operations.
>
>> Is there an OpenMPI parameter to lock all process in the collective
>> call until is finished? Otherwise  i have to insert many MPI_Barrier
>> in my code and it is very tedious and strange..
>
> As you have notes, MPI_Barrier is the *only* collective operation that
> MPI
> guarantees to have any synchronization properties (and it's a fairly
> weak
> guarantee at that; no process will exit the barrier until every process
> has
> entered the barrier -- but there's no guarantee that all processes
> leave
> the
> barrier at the same time).
>
> Why do you need your processes to exit collective operations at the
> same
> time?
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>



 --
 Ing. Gabriele Fatigati

 Parallel programmer

 CINECA Systems & Tecnologies Department

 Supercomputing Group

 Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

 www.cineca.itTel:   +39 051 6171722

 g.fatigati [AT] cineca.it
 ___
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.itTel:   +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>



-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Jeff Squyres

This is with the 1.2 series, right?

Have you tried using what is described here:

http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion

I don't know if you can try OMPI v1.3 or not, but the issue described  
in the the above FAQ item is fixed properly in the OMPI v1.3 series  
(i.e., that MCA parameter is unnecessary because we fixed it a  
different way).


FWIW, if adding an MPI_Barrier is the difference between hanging and  
not hanging, it sounds like an Open MPI bug.  You should never need to  
add an MPI_Barrier to make an MPI program correct.




On Jan 23, 2009, at 8:09 AM, Gabriele Fatigati wrote:


Hi Igor,
My message size is 4096kb and i have 4 procs per core.
There isn't any difference using different algorithms..

2009/1/23 Igor Kozin :

what is your message size and the number of cores per node?
is there any difference using different algorithms?

2009/1/23 Gabriele Fatigati 


Hi Jeff,
i would like to understand why, if i run over 512 procs or more, my
code stops over mpi collective, also with little send buffer. All
processors are locked into call, doing nothing. But, if i add
MPI_Barrier  after MPI collective, it works! I run over Infiniband
net.

I know many people with this strange problem, i think there is a
strange interaction between Infiniband and OpenMPI that causes it.



2009/1/23 Jeff Squyres :

On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:

I've noted that OpenMPI has an asynchronous behaviour in the  
collective

calls.
The processors, doesn't wait that other procs arrives in the call.


That is correct.


This behaviour sometimes can cause some problems with a lot of
processors in the jobs.


Can you describe what exactly you mean?  The MPI spec specifically
allows
this behavior; OMPI made specific design choices and  
optimizations to
support this behavior.  FWIW, I'd be pretty surprised if any  
optimized

MPI
implementation defaults to fully synchronous collective operations.

Is there an OpenMPI parameter to lock all process in the  
collective
call until is finished? Otherwise  i have to insert many  
MPI_Barrier

in my code and it is very tedious and strange..


As you have notes, MPI_Barrier is the *only* collective operation  
that

MPI
guarantees to have any synchronization properties (and it's a  
fairly

weak
guarantee at that; no process will exit the barrier until every  
process

has
entered the barrier -- but there's no guarantee that all  
processes leave

the
barrier at the same time).

Why do you need your processes to exit collective operations at  
the same

time?

--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Gabriele Fatigati
Hi Igor,
My message size is 4096kb and i have 4 procs per core.
There isn't any difference using different algorithms..

2009/1/23 Igor Kozin :
> what is your message size and the number of cores per node?
> is there any difference using different algorithms?
>
> 2009/1/23 Gabriele Fatigati 
>>
>> Hi Jeff,
>> i would like to understand why, if i run over 512 procs or more, my
>> code stops over mpi collective, also with little send buffer. All
>> processors are locked into call, doing nothing. But, if i add
>> MPI_Barrier  after MPI collective, it works! I run over Infiniband
>> net.
>>
>> I know many people with this strange problem, i think there is a
>> strange interaction between Infiniband and OpenMPI that causes it.
>>
>>
>>
>> 2009/1/23 Jeff Squyres :
>> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >
>> >> I've noted that OpenMPI has an asynchronous behaviour in the collective
>> >> calls.
>> >> The processors, doesn't wait that other procs arrives in the call.
>> >
>> > That is correct.
>> >
>> >> This behaviour sometimes can cause some problems with a lot of
>> >> processors in the jobs.
>> >
>> > Can you describe what exactly you mean?  The MPI spec specifically
>> > allows
>> > this behavior; OMPI made specific design choices and optimizations to
>> > support this behavior.  FWIW, I'd be pretty surprised if any optimized
>> > MPI
>> > implementation defaults to fully synchronous collective operations.
>> >
>> >> Is there an OpenMPI parameter to lock all process in the collective
>> >> call until is finished? Otherwise  i have to insert many MPI_Barrier
>> >> in my code and it is very tedious and strange..
>> >
>> > As you have notes, MPI_Barrier is the *only* collective operation that
>> > MPI
>> > guarantees to have any synchronization properties (and it's a fairly
>> > weak
>> > guarantee at that; no process will exit the barrier until every process
>> > has
>> > entered the barrier -- but there's no guarantee that all processes leave
>> > the
>> > barrier at the same time).
>> >
>> > Why do you need your processes to exit collective operations at the same
>> > time?
>> >
>> > --
>> > Jeff Squyres
>> > Cisco Systems
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.itTel:   +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Igor Kozin
what is your message size and the number of cores per node?
is there any difference using different algorithms?

2009/1/23 Gabriele Fatigati 

> Hi Jeff,
> i would like to understand why, if i run over 512 procs or more, my
> code stops over mpi collective, also with little send buffer. All
> processors are locked into call, doing nothing. But, if i add
> MPI_Barrier  after MPI collective, it works! I run over Infiniband
> net.
>
> I know many people with this strange problem, i think there is a
> strange interaction between Infiniband and OpenMPI that causes it.
>
>
>
> 2009/1/23 Jeff Squyres :
>  > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
> >
> >> I've noted that OpenMPI has an asynchronous behaviour in the collective
> >> calls.
> >> The processors, doesn't wait that other procs arrives in the call.
> >
> > That is correct.
> >
> >> This behaviour sometimes can cause some problems with a lot of
> >> processors in the jobs.
> >
> > Can you describe what exactly you mean?  The MPI spec specifically allows
> > this behavior; OMPI made specific design choices and optimizations to
> > support this behavior.  FWIW, I'd be pretty surprised if any optimized
> MPI
> > implementation defaults to fully synchronous collective operations.
> >
> >> Is there an OpenMPI parameter to lock all process in the collective
> >> call until is finished? Otherwise  i have to insert many MPI_Barrier
> >> in my code and it is very tedious and strange..
> >
> > As you have notes, MPI_Barrier is the *only* collective operation that
> MPI
> > guarantees to have any synchronization properties (and it's a fairly weak
> > guarantee at that; no process will exit the barrier until every process
> has
> > entered the barrier -- but there's no guarantee that all processes leave
> the
> > barrier at the same time).
> >
> > Why do you need your processes to exit collective operations at the same
> > time?
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
>
>
> --
> Ing. Gabriele Fatigati
>
> Parallel programmer
>
> CINECA Systems & Tecnologies Department
>
> Supercomputing Group
>
> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>
> www.cineca.itTel:   +39 051 6171722
>
> g.fatigati [AT] cineca.it
> ___
>  users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Gabriele Fatigati
Hi Jeff,
i would like to understand why, if i run over 512 procs or more, my
code stops over mpi collective, also with little send buffer. All
processors are locked into call, doing nothing. But, if i add
MPI_Barrier  after MPI collective, it works! I run over Infiniband
net.

I know many people with this strange problem, i think there is a
strange interaction between Infiniband and OpenMPI that causes it.



2009/1/23 Jeff Squyres :
> On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>
>> I've noted that OpenMPI has an asynchronous behaviour in the collective
>> calls.
>> The processors, doesn't wait that other procs arrives in the call.
>
> That is correct.
>
>> This behaviour sometimes can cause some problems with a lot of
>> processors in the jobs.
>
> Can you describe what exactly you mean?  The MPI spec specifically allows
> this behavior; OMPI made specific design choices and optimizations to
> support this behavior.  FWIW, I'd be pretty surprised if any optimized MPI
> implementation defaults to fully synchronous collective operations.
>
>> Is there an OpenMPI parameter to lock all process in the collective
>> call until is finished? Otherwise  i have to insert many MPI_Barrier
>> in my code and it is very tedious and strange..
>
> As you have notes, MPI_Barrier is the *only* collective operation that MPI
> guarantees to have any synchronization properties (and it's a fairly weak
> guarantee at that; no process will exit the barrier until every process has
> entered the barrier -- but there's no guarantee that all processes leave the
> barrier at the same time).
>
> Why do you need your processes to exit collective operations at the same
> time?
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>



-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.itTel:   +39 051 6171722

g.fatigati [AT] cineca.it


Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Ashley Pittman
On Fri, 2009-01-23 at 06:51 -0500, Jeff Squyres wrote:
> > This behaviour sometimes can cause some problems with a lot of
> > processors in the jobs.

> Can you describe what exactly you mean?  The MPI spec specifically  
> allows this behavior; OMPI made specific design choices and  
> optimizations to support this behavior.  FWIW, I'd be pretty surprised  
> if any optimized MPI implementation defaults to fully synchronous  
> collective operations.

As Jeff says the spec encourages the kind of behaviour you describe.  I
have however seen this causing problems in applications before and it's
not uncommon for adding barriers to improve the performance of a
application.  You might find that it's better to add barriers after
every N collectives rather than every single collective.

> > Is there an OpenMPI parameter to lock all process in the collective
> > call until is finished? Otherwise  i have to insert many MPI_Barrier
> > in my code and it is very tedious and strange..
> 
> As you have notes, MPI_Barrier is the *only* collective operation that  
> MPI guarantees to have any synchronization properties

AllGather, AllReduce and AlltoAll also have an implicit barrier by
virtue of the dataflow required, all processes need input from all other
processes before they can return.

Ashley Pittman.



Re: [OMPI users] Asynchronous behaviour of MPI Collectives

2009-01-23 Thread Jeff Squyres

On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:

I've noted that OpenMPI has an asynchronous behaviour in the  
collective calls.

The processors, doesn't wait that other procs arrives in the call.


That is correct.


This behaviour sometimes can cause some problems with a lot of
processors in the jobs.


Can you describe what exactly you mean?  The MPI spec specifically  
allows this behavior; OMPI made specific design choices and  
optimizations to support this behavior.  FWIW, I'd be pretty surprised  
if any optimized MPI implementation defaults to fully synchronous  
collective operations.



Is there an OpenMPI parameter to lock all process in the collective
call until is finished? Otherwise  i have to insert many MPI_Barrier
in my code and it is very tedious and strange..


As you have notes, MPI_Barrier is the *only* collective operation that  
MPI guarantees to have any synchronization properties (and it's a  
fairly weak guarantee at that; no process will exit the barrier until  
every process has entered the barrier -- but there's no guarantee that  
all processes leave the barrier at the same time).


Why do you need your processes to exit collective operations at the  
same time?


--
Jeff Squyres
Cisco Systems