Re: [OMPI users] Asynchronous behaviour of MPI Collectives

Gabriele Fatigati Tue, 27 Jan 2009 05:10:19 -0500

Wow! Great and useful explanation.
Thanks Jeff .

2009/1/23 Jeff Squyres <jsquy...@cisco.com>:
> FWIW, OMPI v1.3 is much better that registered memory usage than the 1.2
> series.  We introduced some new things, to include being able to specify
> exactly what receive queues you want.  See:
>
> ...gaaah!  It's not on our FAQ yet.  :-(
>
> The main idea is that there is a new MCA parameter for the openib BTL:
> btl_openib_receive_queues.  It takes a colon-delimited string listing one or
> more receive queues of specific sizes and characteristics.  For now, all
> processes in the job *must* use the same string.  You can specify three
> kinds of receive queues:
>
> - P: per-peer queues
> - S: shared receive queues
> - X: XRC queues (with OFED 1.4 and later with specific Mellanox hardware)
>
> Here's a copy-n-paste of our help file describing the format of each:
>
> Per-peer receive queues require between 1 and 5 parameters:
>
>  1. Buffer size in bytes (mandatory)
>  2. Number of buffers (optional; defaults to 8)
>  3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
>  4. Credit window size (optional; defaults to (low_watermark / 2))
>  5. Number of buffers reserved for credit messages (optional;
>     defaults to (num_buffers*2-1)/credit_window)
>
>  Example: P,128,256,128,16
>  - 128 byte buffers
>  - 256 buffers to receive incoming MPI messages
>  - When the number of available buffers reaches 128, re-post 128 more
>    buffers to reach a total of 256
>  - If the number of available credits reaches 16, send an explicit
>    credit message to the sender
>  - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
>    reserved for explicit credit messages
>
> Shared receive queues can take between 1 and 4 parameters:
>
>  1. Buffer size in bytes (mandatory)
>  2. Number of buffers (optional; defaults to 16)
>  3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
>  4. Maximum number of outstanding sends a sender can have (optional;
>     defaults to (low_watermark / 4)
>
>  Example: S,1024,256,128,32
>  - 1024 byte buffers
>  - 256 buffers to receive incoming MPI messages
>  - When the number of available buffers reaches 128, re-post 128 more
>    buffers to reach a total of 256
>  - A sender will not send to a peer unless it has less than 32
>    outstanding sends to that peer.
>
> IIRC, "X" takes the same parameters as "S"...?  Note that if you you *any*
> XRC queues, then *all* of your queues must be XRC.
>
> OMPI defaults to a btl_receive_queues value that may be specific to your
> hardware.  For example, connectx defaults to the following value:
>
> shell$ ompi_info --param btl openib --parsable | grep receive_queues
> mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
> mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
> mca:btl:openib:param:btl_openib_receive_queues:status:writable
> mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma
> delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
> mca:btl:openib:param:btl_openib_receive_queues:deprecated:no
>
> Hope that helps!
>
>
>
>
> On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:
>
>> Hi Gabriele,
>> it might be that your message size is too large for available memory per
>> node.
>> I had a problem with IMB when I was not able to run to completion Alltoall
>> on N=128, ppn=8 on our cluster with 16 GB per node. You'd think 16 GB is
>> quite a lot but when you do the maths:
>> 2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to double
>> because of buffering. I was told by Mellanox (our cards are ConnectX cards)
>> that they introduced XRC in OFED 1.3 in addition to Share Receive Queue
>> which should reduce memory foot print but I have not tested this yet.
>> HTH,
>> Igor
>> 2009/1/23 Gabriele Fatigati <g.fatig...@cineca.it>
>> Hi Igor,
>> My message size is 4096kb and i have 4 procs per core.
>> There isn't any difference using different algorithms..
>>
>> 2009/1/23 Igor Kozin <i.n.ko...@googlemail.com>:
>> > what is your message size and the number of cores per node?
>> > is there any difference using different algorithms?
>> >
>> > 2009/1/23 Gabriele Fatigati <g.fatig...@cineca.it>
>> >>
>> >> Hi Jeff,
>> >> i would like to understand why, if i run over 512 procs or more, my
>> >> code stops over mpi collective, also with little send buffer. All
>> >> processors are locked into call, doing nothing. But, if i add
>> >> MPI_Barrier  after MPI collective, it works! I run over Infiniband
>> >> net.
>> >>
>> >> I know many people with this strange problem, i think there is a
>> >> strange interaction between Infiniband and OpenMPI that causes it.
>> >>
>> >>
>> >>
>> >> 2009/1/23 Jeff Squyres <jsquy...@cisco.com>:
>> >> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >> >
>> >> >> I've noted that OpenMPI has an asynchronous behaviour in the
>> >> >> collective
>> >> >> calls.
>> >> >> The processors, doesn't wait that other procs arrives in the call.
>> >> >
>> >> > That is correct.
>> >> >
>> >> >> This behaviour sometimes can cause some problems with a lot of
>> >> >> processors in the jobs.
>> >> >
>> >> > Can you describe what exactly you mean?  The MPI spec specifically
>> >> > allows
>> >> > this behavior; OMPI made specific design choices and optimizations to
>> >> > support this behavior.  FWIW, I'd be pretty surprised if any
>> >> > optimized
>> >> > MPI
>> >> > implementation defaults to fully synchronous collective operations.
>> >> >
>> >> >> Is there an OpenMPI parameter to lock all process in the collective
>> >> >> call until is finished? Otherwise  i have to insert many MPI_Barrier
>> >> >> in my code and it is very tedious and strange..
>> >> >
>> >> > As you have notes, MPI_Barrier is the *only* collective operation
>> >> > that
>> >> > MPI
>> >> > guarantees to have any synchronization properties (and it's a fairly
>> >> > weak
>> >> > guarantee at that; no process will exit the barrier until every
>> >> > process
>> >> > has
>> >> > entered the barrier -- but there's no guarantee that all processes
>> >> > leave
>> >> > the
>> >> > barrier at the same time).
>> >> >
>> >> > Why do you need your processes to exit collective operations at the
>> >> > same
>> >> > time?
>> >> >
>> >> > --
>> >> > Jeff Squyres
>> >> > Cisco Systems
>> >> >
>> >> > _______________________________________________
>> >> > users mailing list
>> >> > us...@open-mpi.org
>> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Ing. Gabriele Fatigati
>> >>
>> >> Parallel programmer
>> >>
>> >> CINECA Systems & Tecnologies Department
>> >>
>> >> Supercomputing Group
>> >>
>> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>> >>
>> >> www.cineca.it                    Tel:   +39 051 6171722
>> >>
>> >> g.fatigati [AT] cineca.it
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.it                    Tel:   +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>




-- 
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it                    Tel:   +39 051 6171722

g.fatigati [AT] cineca.it

Re: [OMPI users] Asynchronous behaviour of MPI Collectives

Reply via email to