Wow! Great and useful explanation. Thanks Jeff . 2009/1/23 Jeff Squyres <jsquy...@cisco.com>: > FWIW, OMPI v1.3 is much better that registered memory usage than the 1.2 > series. We introduced some new things, to include being able to specify > exactly what receive queues you want. See: > > ...gaaah! It's not on our FAQ yet. :-( > > The main idea is that there is a new MCA parameter for the openib BTL: > btl_openib_receive_queues. It takes a colon-delimited string listing one or > more receive queues of specific sizes and characteristics. For now, all > processes in the job *must* use the same string. You can specify three > kinds of receive queues: > > - P: per-peer queues > - S: shared receive queues > - X: XRC queues (with OFED 1.4 and later with specific Mellanox hardware) > > Here's a copy-n-paste of our help file describing the format of each: > > Per-peer receive queues require between 1 and 5 parameters: > > 1. Buffer size in bytes (mandatory) > 2. Number of buffers (optional; defaults to 8) > 3. Low buffer count watermark (optional; defaults to (num_buffers / 2)) > 4. Credit window size (optional; defaults to (low_watermark / 2)) > 5. Number of buffers reserved for credit messages (optional; > defaults to (num_buffers*2-1)/credit_window) > > Example: P,128,256,128,16 > - 128 byte buffers > - 256 buffers to receive incoming MPI messages > - When the number of available buffers reaches 128, re-post 128 more > buffers to reach a total of 256 > - If the number of available credits reaches 16, send an explicit > credit message to the sender > - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are > reserved for explicit credit messages > > Shared receive queues can take between 1 and 4 parameters: > > 1. Buffer size in bytes (mandatory) > 2. Number of buffers (optional; defaults to 16) > 3. Low buffer count watermark (optional; defaults to (num_buffers / 2)) > 4. Maximum number of outstanding sends a sender can have (optional; > defaults to (low_watermark / 4) > > Example: S,1024,256,128,32 > - 1024 byte buffers > - 256 buffers to receive incoming MPI messages > - When the number of available buffers reaches 128, re-post 128 more > buffers to reach a total of 256 > - A sender will not send to a peer unless it has less than 32 > outstanding sends to that peer. > > IIRC, "X" takes the same parameters as "S"...? Note that if you you *any* > XRC queues, then *all* of your queues must be XRC. > > OMPI defaults to a btl_receive_queues value that may be specific to your > hardware. For example, connectx defaults to the following value: > > shell$ ompi_info --param btl openib --parsable | grep receive_queues > mca:btl:openib:param:btl_openib_receive_queues:value:P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > mca:btl:openib:param:btl_openib_receive_queues:data_source:default value > mca:btl:openib:param:btl_openib_receive_queues:status:writable > mca:btl:openib:param:btl_openib_receive_queues:help:Colon-delimited, comma > delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4 > mca:btl:openib:param:btl_openib_receive_queues:deprecated:no > > Hope that helps! > > > > > On Jan 23, 2009, at 9:27 AM, Igor Kozin wrote: > >> Hi Gabriele, >> it might be that your message size is too large for available memory per >> node. >> I had a problem with IMB when I was not able to run to completion Alltoall >> on N=128, ppn=8 on our cluster with 16 GB per node. You'd think 16 GB is >> quite a lot but when you do the maths: >> 2* 4 MB * 128 procs * 8 procs/node = 8 GB/node plus you need to double >> because of buffering. I was told by Mellanox (our cards are ConnectX cards) >> that they introduced XRC in OFED 1.3 in addition to Share Receive Queue >> which should reduce memory foot print but I have not tested this yet. >> HTH, >> Igor >> 2009/1/23 Gabriele Fatigati <g.fatig...@cineca.it> >> Hi Igor, >> My message size is 4096kb and i have 4 procs per core. >> There isn't any difference using different algorithms.. >> >> 2009/1/23 Igor Kozin <i.n.ko...@googlemail.com>: >> > what is your message size and the number of cores per node? >> > is there any difference using different algorithms? >> > >> > 2009/1/23 Gabriele Fatigati <g.fatig...@cineca.it> >> >> >> >> Hi Jeff, >> >> i would like to understand why, if i run over 512 procs or more, my >> >> code stops over mpi collective, also with little send buffer. All >> >> processors are locked into call, doing nothing. But, if i add >> >> MPI_Barrier after MPI collective, it works! I run over Infiniband >> >> net. >> >> >> >> I know many people with this strange problem, i think there is a >> >> strange interaction between Infiniband and OpenMPI that causes it. >> >> >> >> >> >> >> >> 2009/1/23 Jeff Squyres <jsquy...@cisco.com>: >> >> > On Jan 23, 2009, at 6:32 AM, Gabriele Fatigati wrote: >> >> > >> >> >> I've noted that OpenMPI has an asynchronous behaviour in the >> >> >> collective >> >> >> calls. >> >> >> The processors, doesn't wait that other procs arrives in the call. >> >> > >> >> > That is correct. >> >> > >> >> >> This behaviour sometimes can cause some problems with a lot of >> >> >> processors in the jobs. >> >> > >> >> > Can you describe what exactly you mean? The MPI spec specifically >> >> > allows >> >> > this behavior; OMPI made specific design choices and optimizations to >> >> > support this behavior. FWIW, I'd be pretty surprised if any >> >> > optimized >> >> > MPI >> >> > implementation defaults to fully synchronous collective operations. >> >> > >> >> >> Is there an OpenMPI parameter to lock all process in the collective >> >> >> call until is finished? Otherwise i have to insert many MPI_Barrier >> >> >> in my code and it is very tedious and strange.. >> >> > >> >> > As you have notes, MPI_Barrier is the *only* collective operation >> >> > that >> >> > MPI >> >> > guarantees to have any synchronization properties (and it's a fairly >> >> > weak >> >> > guarantee at that; no process will exit the barrier until every >> >> > process >> >> > has >> >> > entered the barrier -- but there's no guarantee that all processes >> >> > leave >> >> > the >> >> > barrier at the same time). >> >> > >> >> > Why do you need your processes to exit collective operations at the >> >> > same >> >> > time? >> >> > >> >> > -- >> >> > Jeff Squyres >> >> > Cisco Systems >> >> > >> >> > _______________________________________________ >> >> > users mailing list >> >> > us...@open-mpi.org >> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Ing. Gabriele Fatigati >> >> >> >> Parallel programmer >> >> >> >> CINECA Systems & Tecnologies Department >> >> >> >> Supercomputing Group >> >> >> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> >> >> >> www.cineca.it Tel: +39 051 6171722 >> >> >> >> g.fatigati [AT] cineca.it >> >> _______________________________________________ >> >> users mailing list >> >> us...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> >> >> >> -- >> Ing. Gabriele Fatigati >> >> Parallel programmer >> >> CINECA Systems & Tecnologies Department >> >> Supercomputing Group >> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> >> www.cineca.it Tel: +39 051 6171722 >> >> g.fatigati [AT] cineca.it >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
-- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatigati [AT] cineca.it