Hello,
On Tue, 2010-11-23 at 18:03 -0500, George Bosilca wrote:
> If you know the max size of the receives I would take a different approach.
> Post few persistent receives, and manage them in a circular buffer. Instead
> of doing an MPI_Iprobe, use MPI_Test on the current head of your circular
> buffer. Once you use the data related to the receive, just do an MPI_Start on
> your request.
>
I implemented your approach, and I must say IT IS FASTER !
My ring has 128 bins. I guess that qualifies as a 'few'.
Here are my tests:
* Open-MPI 1.4.3
* Infiniband QDR/full bisection topology
* 32 MPI ranks (Intel(R) Xeon(R) CPU X5560 @ 2.80GHz)
* g++ (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)
* Ray 1.0.0-RC1
* colosse http://www.top500.org/system/10195
with MPI_Iprobe/MPI_Recv (old, r4023)
[sboisver12@colosse2 SRA001125]$ cat qsub-openmpi-r4023.sh
#!/bin/bash
#$ -N iprobe
#$ -P nne-790-aa
#$ -l h_rt=0:40:00
#$ -pe mpi 32
#$ -M sebastien.boisvert.3@
#$ -m bea
module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc
/software/MPI/openmpi-1.4.3_gcc/bin/mpirun /home/sboisver12/r4023/code/Ray \
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001665_1.fastq.gz
/home/sboisver12/nne-790-aa/SRA001125/SRR001665_2.fastq.gz \
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001666_1.fastq.gz
/home/sboisver12/nne-790-aa/SRA001125/SRR001666_2.fastq.gz \
-o Ecoli-THEONE
Beginning of computation: 1 seconds
Distribution of sequence reads: 7 minutes, 48 seconds
Distribution of vertices: 1 minutes, 36 seconds
Calculation of coverage distribution: 0 seconds
Distribution of edges: 2 minutes, 19 seconds
Indexing of sequence reads: 5 seconds
Computation of seeds: 3 minutes, 40 seconds
Computation of library sizes: 1 minutes, 37 seconds
Extension of seeds: 4 minutes, 41 seconds
Computation of fusions: 1 minutes, 16 seconds
Collection of fusions: 0 seconds
Completion of the assembly: 23 minutes, 3 seconds
with MPI_Recv_init/MPI_Start (new, HEAD)
[sboisver12@colosse2 SRA001125]$ qsub qsub-openmpi-r4023.sh
Your job 1031990 ("iprobe") has been submitted
[sboisver12@colosse2 SRA001125]$ cat qsub-openmpi.sh
#!/bin/bash
#$ -N persistent
#$ -P nne-790-aa
#$ -l h_rt=0:40:00
#$ -pe mpi 32
#$ -M sebastien.boisvert.3@
#$ -m bea
module load compilers/gcc/4.4.2 mpi/openmpi/1.4.3_gcc
/software/MPI/openmpi-1.4.3_gcc/bin/mpirun /home/sboisver12/Ray/trunk/code/Ray
\
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001665_1.fastq.gz
/home/sboisver12/nne-790-aa/SRA001125/SRR001665_2.fastq.gz \
-p /home/sboisver12/nne-790-aa/SRA001125/SRR001666_1.fastq.gz
/home/sboisver12/nne-790-aa/SRA001125/SRR001666_2.fastq.gz \
-o Ecoli-THEONE
Beginning of computation: 1 seconds
Distribution of sequence reads: 7 minutes, 22 seconds
Distribution of vertices: 1 minutes, 28 seconds
Calculation of coverage distribution: 1 seconds
Distribution of edges: 2 minutes, 14 seconds
Indexing of sequence reads: 5 seconds
Computation of seeds: 2 minutes, 41 seconds
Computation of library sizes: 1 minutes, 14 seconds
Extension of seeds: 3 minutes, 47 seconds
Computation of fusions: 1 minutes, 0 seconds
Collection of fusions: 1 seconds
Completion of the assembly: 19 minutes, 54 seconds
So:
The MPI_Iprobe approach:
23 minutes, 3 seconds
The persistent approach proposed by George Bosilca:
19 minutes, 54 seconds
> This approach will minimize the unexpected messages, and drain the
> connections faster. Moreover, at the end it is very easy to MPI_Cancel all
> the receives not yet matched.
I see.
> george.
Thank you !
p.s.: I learned a lot on MPI since my first post here !
>
> On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote:
>
> > Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit :
> >> The eager size reported by ompi_info includes the Open MPI internal
> >> headers. They are anywhere between 20 and 64 bytes long (potentially more
> >> for some particular networks), so what Eugene suggested was a safe
> >> boundary.
> >
> > I see.
> >
> >>
> >> Moreover, eager send can improve performance if and only if the matching
> >> receives are already posted on the peer. If not, the data will become
> >> unexpected, and there will be one additional memcpy.
> >
> > So it won't improve performance in my application (Ray,
> > http://denovoassembler.sf.net) because I use MPI_Iprobe to check for
> > incoming messages, which means any receive (MPI_Recv) is never posted
> > before any send (MPI_Isend).
> >
> > Thanks, this thread is very informative for me !
> >
> >>
> >> george.
> >>
> >> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote:
> >>
> >>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
> >>>> Sébastien Boisvert wrote:
> >>>>
> >>>>> Now I can describe the cases.
> >>>>>
> >>>>>
> >>>> The test cases can all be explained by the test requiring eager messages
> >>>> (something that test4096.cpp does not require).
> >>>>
> >>>>> Case 1: 30 MPI ranks, message size is 4096 bytes
> >>>>>
> >>>>> File: mpirun-np-30-Program-4096.txt
> >>>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >>>>>
> >>>>>
> >>>> 4096 is rendezvous. For eager, try 4000 or lower.
> >>>
> >>> According to ompi_info, the threshold is 4096, not 4000, right ?
> >>>
> >>> (Open-MPI 1.4.3)
> >>> [sboisver12@colosse1 ~]$ ompi_info -a|less
> >>> MCA btl: parameter "btl_sm_eager_limit" (current value:
> >>> "4096", data source: default value)
> >>> Maximum size (in bytes) of "short" messages
> >>> (must be >= 1).
> >>>
> >>>
> >>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
> >>> that is, a sender attempts to write its entire message to shared buffers
> >>> without waiting for a receiver to be ready. Above this size, a sender
> >>> will only write the first part of a message, then wait for the receiver
> >>> to acknowledge its ready before continuing. Eager sends can improve
> >>> performance by decoupling senders from receivers."
> >>>
> >>>
> >>>
> >>> source:
> >>> http://www.open-mpi.org/faq/?category=sm#more-sm
> >>>
> >>>
> >>> It should say "Below this size or equal to this size" instead of "Below
> >>> this size" as ompi_info says. ;)
> >>>
> >>>
> >>>
> >>>
> >>> As Mr. George Bosilca put it:
> >>>
> >>> "__should__ is not correct, __might__ is a better verb to describe the
> >>> most "common" behavior for small messages. The problem comes from the
> >>> fact that in each communicator the FIFO ordering is required by the MPI
> >>> standard. As soon as there is any congestion, MPI_Send will block even
> >>> for small messages (and this independent on the underlying network)
> >>> until all he pending packets have been delivered."
> >>>
> >>> source:
> >>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php
> >>>
> >>>
> >>>
> >>>>
> >>>>> Case 2: 30 MPI ranks, message size is 1 byte
> >>>>>
> >>>>> File: mpirun-np-30-Program-1.txt.gz
> >>>>> Outcome: It runs just fine.
> >>>>>
> >>>>>
> >>>> 1 byte is eager.
> >>>
> >>> I agree.
> >>>
> >>>>
> >>>>> Case 3: 2 MPI ranks, message size is 4096 bytes
> >>>>>
> >>>>> File: mpirun-np-2-Program-4096.txt
> >>>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >>>>>
> >>>>>
> >>>> Same as Case 1.
> >>>>
> >>>>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> >>>>> disabled
> >>>>>
> >>>>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> >>>>> Outcome: It runs just fine.
> >>>>>
> >>>>>
> >>>> Eager limit for TCP is 65536 (perhaps less some overhead). So, these
> >>>> messages are eager.
> >>>
> >>> I agree.
> >>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> [email protected]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> [email protected]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> [email protected]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > --
> > M. Sébastien Boisvert
> > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
> > Boursier des Instituts de recherche en santé du Canada
> > Équipe du Professeur Jacques Corbeil
> >
> > Centre de recherche en infectiologie de l'Université Laval
> > Local R-61B
> > 2705, boulevard Laurier
> > Québec, Québec
> > Canada G1V 4G2
> > Téléphone: 418 525 4444 46342
> >
> > Courriel: [email protected]
> > Web: http://boisvert.info
> >
> > "Innovation comes only from an assault on the unknown" -Sydney Brenner
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
"Innovation comes only from an assault on the unknown" -Sydney Brenner
/*
Ray
Copyright (C) 2010 Sébastien Boisvert
http://DeNovoAssembler.SourceForge.Net/
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You have received a copy of the GNU General Public License
along with this program (COPYING).
see <http://www.gnu.org/licenses/>
*/
#include<MessagesHandler.h>
#include<common_functions.h>
#include<assert.h>
/*
* send messages,
*/
void MessagesHandler::sendMessages(StaticVector*outbox,int source){
for(int i=0;i<(int)outbox->size();i++){
Message*aMessage=((*outbox)[i]);
#ifdef ASSERT
int destination=aMessage->getDestination();
assert(destination>=0);
#endif
MPI_Request request;
// MPI_Issend
// Synchronous nonblocking. Note that a Wait/Test will complete only when the matching receive is posted
#ifdef ASSERT
assert(!(aMessage->getBuffer()==NULL && aMessage->getCount()>0));
#endif
#ifndef ASSERT
MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request);
#else
int value=MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request);
assert(value==MPI_SUCCESS);
#endif
MPI_Request_free(&request);
#ifdef ASSERT
assert(request==MPI_REQUEST_NULL);
#endif
}
outbox->clear();
}
/*
* receiveMessages is implemented as recommanded by Mr. George Bosilca from
the University of Tennessee (via the Open-MPI mailing list)
De: George Bosilca <bosilca@…>
Reply-to: Open MPI Developers <devel@…>
À: Open MPI Developers <devel@…>
Sujet: Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
List-Post: [email protected]
Date: 2010-11-23 18:03:04
If you know the max size of the receives I would take a different approach.
Post few persistent receives, and manage them in a circular buffer.
Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer.
Once you use the data related to the receive, just do an MPI_Start on your request.
This approach will minimize the unexpected messages, and drain the connections faster.
Moreover, at the end it is very easy to MPI_Cancel all the receives not yet matched.
george.
*/
void MessagesHandler::receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination){
int flag;
MPI_Status status;
MPI_Test(m_ring+m_head,&flag,&status);
if(flag){
// get the length of the message
// it is not necessary the same as the one posted with MPI_Recv_init
// that one was a lower bound
int tag=status.MPI_TAG;
int source=status.MPI_SOURCE;
int length;
MPI_Get_count(&status,MPI_UNSIGNED_LONG_LONG,&length);
u64*filledBuffer=(u64*)m_buffers+m_head*MPI_BTL_SM_EAGER_LIMIT/sizeof(u64);
// copy it in a safe buffer
u64*incoming=(u64*)inboxAllocator->allocate(length*sizeof(u64));
for(int i=0;i<length;i++){
incoming[i]=filledBuffer[i];
}
// the request can start again
MPI_Start(m_ring+m_head);
// add the message in the inbox
Message aMessage(incoming,length,MPI_UNSIGNED_LONG_LONG,source,tag,source);
inbox->push_back(aMessage);
m_receivedMessages[source]++;
// increment the head
m_head++;
if(m_head==m_ringSize){
m_head=0;
}
}
}
void MessagesHandler::showStats(){
cout<<"Rank "<<m_rank;
for(int i=0;i<m_size;i++){
cout<<" "<<m_receivedMessages[i];
}
cout<<endl;
}
void MessagesHandler::addCount(int rank,u64 count){
m_allReceivedMessages[rank*m_size+m_allCounts[rank]]=count;
m_allCounts[rank]++;
}
bool MessagesHandler::isFinished(int rank){
return m_allCounts[rank]==m_size;
}
bool MessagesHandler::isFinished(){
for(int i=0;i<m_size;i++){
if(!isFinished(i)){
return false;
}
}
// update the counts for root, because it was updated.
for(int i=0;i<m_size;i++){
m_allCounts[MASTER_RANK*m_size+i]=m_receivedMessages[i];
}
return true;
}
void MessagesHandler::writeStats(const char*file){
FILE*f=fopen(file,"w+");
for(int i=0;i<m_size;i++){
fprintf(f,"\t%i",i);
}
fprintf(f,"\n");
for(int i=0;i<m_size;i++){
fprintf(f,"%i",i);
for(int j=0;j<m_size;j++){
fprintf(f,"\t%lu",m_allReceivedMessages[i*m_size+j]);
}
fprintf(f,"\n");
}
fclose(f);
}
void MessagesHandler::constructor(int rank,int size){
m_rank=rank;
m_size=size;
m_receivedMessages=(u64*)__Malloc(sizeof(u64)*m_size);
if(rank==MASTER_RANK){
m_allReceivedMessages=(u64*)__Malloc(sizeof(u64)*m_size*m_size);
m_allCounts=(int*)__Malloc(sizeof(int)*m_size);
}
for(int i=0;i<m_size;i++){
m_receivedMessages[i]=0;
if(rank==MASTER_RANK){
m_allCounts[i]=0;
}
}
// the ring contains 128 elements.
m_ringSize=128;
m_ring=(MPI_Request*)__Malloc(sizeof(MPI_Request)*m_ringSize);
m_buffers=(char*)__Malloc(MPI_BTL_SM_EAGER_LIMIT*m_ringSize);
m_head=0;
// post a few receives.
for(int i=0;i<m_ringSize;i++){
void*buffer=m_buffers+i*MPI_BTL_SM_EAGER_LIMIT;
MPI_Recv_init(buffer,MPI_BTL_SM_EAGER_LIMIT/sizeof(VERTEX_TYPE),MPI_UNSIGNED_LONG_LONG,
MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,m_ring+i);
MPI_Start(m_ring+i);
}
}
u64*MessagesHandler::getReceivedMessages(){
return m_receivedMessages;
}
void MessagesHandler::freeLeftovers(){
for(int i=0;i<m_ringSize;i++){
MPI_Cancel(m_ring+i);
MPI_Request_free(m_ring+i);
}
__Free(m_ring);
__Free(m_buffers);
}
/*
Ray
Copyright (C) 2010 Sébastien Boisvert
http://DeNovoAssembler.SourceForge.Net/
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You have received a copy of the GNU General Public License
along with this program (COPYING).
see <http://www.gnu.org/licenses/>
*/
#ifndef _MessagesHandler
#define _MessagesHandler
#include<vector>
#include<MyAllocator.h>
#include<Message.h>
#include<common_functions.h>
#include<RingAllocator.h>
#include<StaticVector.h>
#include<PendingRequest.h>
using namespace std;
class MessagesHandler{
int m_ringSize;
int m_head;
MPI_Request*m_ring;
char*m_buffers;
u64*m_receivedMessages;
int m_rank;
int m_size;
u64*m_allReceivedMessages;
int*m_allCounts;
public:
void constructor(int rank,int size);
void showStats();
void sendMessages(StaticVector*outbox,int source);
void receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination);
u64*getReceivedMessages();
void addCount(int rank,u64 count);
void writeStats(const char*file);
bool isFinished();
bool isFinished(int rank);
void freeLeftovers();
};
#endif