Re: [OMPI users] Asynchronous behaviour of MPI Collectives

Jeff Squyres Fri, 23 Jan 2009 14:59:46 -0500

FWIW, OMPI v1.3 is much better that registered memory usage than the1.2 series. We introduced some new things, to include being able tospecify exactly what receive queues you want. See:


...gaaah!  It's not on our FAQ yet.  :-(

The main idea is that there is a new MCA parameter for the openib BTL:btl_openib_receive_queues. It takes a colon-delimited string listingone or more receive queues of specific sizes and characteristics. Fornow, all processes in the job *must* use the same string. You canspecify three kinds of receive queues:


- P: per-peer queues
- S: shared receive queues

- X: XRC queues (with OFED 1.4 and later with specific Mellanoxhardware)


Here's a copy-n-paste of our help file describing the format of each:

Per-peer receive queues require between 1 and 5 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (optional; defaults to 8)

3. Low buffer count watermark (optional; defaults to (num_buffers /2))

  4. Credit window size (optional; defaults to (low_watermark / 2))
  5. Number of buffers reserved for credit messages (optional;
     defaults to (num_buffers*2-1)/credit_window)

  Example: P,128,256,128,16
  - 128 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
    buffers to reach a total of 256
  - If the number of available credits reaches 16, send an explicit
    credit message to the sender
  - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
    reserved for explicit credit messages

Shared receive queues can take between 1 and 4 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (optional; defaults to 16)

3. Low buffer count watermark (optional; defaults to (num_buffers /2))

  4. Maximum number of outstanding sends a sender can have (optional;
     defaults to (low_watermark / 4)

  Example: S,1024,256,128,32
  - 1024 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
    buffers to reach a total of 256
  - A sender will not send to a peer unless it has less than 32
    outstanding sends to that peer.

IIRC, "X" takes the same parameters as "S"...? Note that if you you*any* XRC queues, then *all* of your queues must be XRC.

OMPI defaults to a btl_receive_queues value that may be specific toyour hardware. For example, connectx defaults to the following value:


shell$ ompi_info --param btl openib --parsable | grep receive_queues

mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32

mca:btl:openib:param:btl_openib_receive_queues:data_source:default value
mca:btl:openib:param:btl_openib_receive_queues:status:writable

mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited,comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4

mca:btl:openib:param:btl_openib_receive_queues:deprecated:no

Hope that helps!




On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote:

Hi Gabriele,
it might be that your message size is too large for available memoryper node.I had a problem with IMB when I was not able to run to completionAlltoall on N=128, ppn=8 on our cluster with 16 GB per node. You'dthink 16 GB is quite a lot but when you do the maths:2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need todouble because of buffering. I was told by Mellanox (our cards areConnectX cards) that they introduced XRC in OFED 1.3 in addition toShare Receive Queue which should reduce memory foot print but I havenot tested this yet.
HTH,
Igor
2009/1/23 Gabriele Fatigati <g.fatig...@cineca.it>
Hi Igor,
My message size is 4096kb and i have 4 procs per core.
There isn't any difference using different algorithms..

2009/1/23 Igor Kozin <i.n.ko...@googlemail.com>:
> what is your message size and the number of cores per node?
> is there any difference using different algorithms?
>
> 2009/1/23 Gabriele Fatigati <g.fatig...@cineca.it>
>>
>> Hi Jeff,
>> i would like to understand why, if i run over 512 procs or more, my
>> code stops over mpi collective, also with little send buffer. All
>> processors are locked into call, doing nothing. But, if i add
>> MPI_Barrier  after MPI collective, it works! I run over Infiniband
>> net.
>>
>> I know many people with this strange problem, i think there is a
>> strange interaction between Infiniband and OpenMPI that causes it.
>>
>>
>>
>> 2009/1/23 Jeff Squyres <jsquy...@cisco.com>:
>> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote:
>> >
>> >> I've noted that OpenMPI has an asynchronous behaviour in thecollective
>> >> calls.
>> >> The processors, doesn't wait that other procs arrives in thecall.
>> >
>> > That is correct.
>> >
>> >> This behaviour sometimes can cause some problems with a lot of
>> >> processors in the jobs.
>> >
>> > Can you describe what exactly you mean? The MPI specspecifically
>> > allows
>> > this behavior; OMPI made specific design choices andoptimizations to>> > support this behavior. FWIW, I'd be pretty surprised if anyoptimized
>> > MPI
>> > implementation defaults to fully synchronous collectiveoperations.
>> >
>> >> Is there an OpenMPI parameter to lock all process in thecollective>> >> call until is finished? Otherwise i have to insert manyMPI_Barrier
>> >> in my code and it is very tedious and strange..
>> >
>> > As you have notes, MPI_Barrier is the *only* collectiveoperation that
>> > MPI
>> > guarantees to have any synchronization properties (and it's afairly
>> > weak
>> > guarantee at that; no process will exit the barrier until everyprocess
>> > has
>> > entered the barrier -- but there's no guarantee that allprocesses leave
>> > the
>> > barrier at the same time).
>> >
>> > Why do you need your processes to exit collective operations atthe same
>> > time?
>> >
>> > --
>> > Jeff Squyres
>> > Cisco Systems
>> >
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> >
>>
>>
>>
>> --
>> Ing. Gabriele Fatigati
>>
>> Parallel programmer
>>
>> CINECA Systems & Tecnologies Department
>>
>> Supercomputing Group
>>
>> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
>>
>> www.cineca.it                    Tel:   +39 051 6171722
>>
>> g.fatigati [AT] cineca.it
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



--
Ing. Gabriele Fatigati

Parallel programmer

CINECA Systems & Tecnologies Department

Supercomputing Group

Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy

www.cineca.it                    Tel:   +39 051 6171722

g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Asynchronous behaviour of MPI Collectives

Reply via email to