Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-24 Thread Sébastien Boisvert
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > -- 
> > M. Sébastien Boisvert
> > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
> > Boursier des Instituts de recherche en santé du Canada
> > Équipe du Professeur Jacques Corbeil
> > 
> > Centre de recherche en infectiologie de l'Université Laval
> > Local R-61B
> > 2705, boulevard Laurier
> > Québec, Québec
> > Canada G1V 4G2
> > Téléphone: 418 525  46342
> > 
> > Courriel: s...@boisvert.info
> > Web: http://boisvert.info
> > 
> > "Innovation comes only from an assault on the unknown" -Sydney Brenner
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



"Innovation comes only from an assault on the unknown" -Sydney Brenner

/*
 	Ray
Copyright (C) 2010  Sébastien Boisvert

	http://DeNovoAssembler.SourceForge.Net/

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You have received a copy of the GNU General Public License
along with this program (COPYING).  
	see <http://www.gnu.org/licenses/>

*/

#include
#include
#include


/*
 * send messages,
 */
void MessagesHandler::sendMessages(StaticVector*outbox,int source){
	for(int i=0;i<(int)outbox->size();i++){
		Message*aMessage=((*outbox)[i]);
		#ifdef ASSERT
		int destination=aMessage->getDestination();
		assert(destination>=0);
		#endif

		MPI_Request request;
		//  MPI_Issend
		//  Synchronous nonblocking. Note that a Wait/Test will complete only when the matching receive is posted
		#ifdef ASSERT
		assert(!(aMessage->getBuffer()==NULL && aMessage->getCount()>0));
		#endif
		#ifndef ASSERT
		MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request);
		#else
		int value=MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request);
		assert(value==MPI_SUCCESS);
		#endif

		MPI_Request_free(&request);

		#ifdef ASSERT
		assert(request==MPI_REQUEST_NULL);
		#endif
	}

	outbox->clear();
}



/*	
 * receiveMessages is implemented as recommanded by Mr. George Bosilca from
the University of Tennessee (via the Open-MPI mailing list)

De: George Bosilca 
Reply-to: Open MPI Developers 
À: Open MPI Developers 
Sujet: Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
List-Post: devel@lists.open-mpi.org
Date: 2010-11-23 18:03:04

If you know the max size of the receives I would take a different approach. 
Post few persistent receives, and manage them in a circular buffer. 
Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer. 
Once you use the data related to the receive, just do an MPI_Start on your request.
This approach will minimize the unexpected messages, and drain the connections faster. 
Moreover, at the end it is very easy to MPI_Cancel all the receives not yet matched.

george. 
 */

void MessagesHandler::receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination){
	int flag;
	MPI_Status status;
	MPI_Test(m_ring+m_head,&flag,&status);

	if(flag){
		// get the length of the message
		// it is not necessary the same as the one posted with MPI_Recv_init
		// that one was a lower bound
		int tag=status.MPI_TAG;
		int source=status.MPI_SOURCE;
		int length;
		MPI_Get_count(&status,MPI_UNSIGNED_LONG_LONG,&length);
		u64*filledBuffer=(u64*)m_buffers+m_head*MPI_BTL_SM_EAGER_LIMIT/sizeof(u64);

		// copy it in a safe buffer
		u64*incoming=(u64*)inboxAllocator->allocate(length*sizeof(u64));
		for(int i=0;ipush_back(aMessage);
		m_receivedMessages[source]++;

		// increment the head
		m_head++;
		if(m_head==m_ringSize){
			m_head=0;
		}
	}
}

void MessagesHandler::showStats(){
	cout<<"Rank "</*
 	Ray
Copyright (C) 2010  Sébastien Boisvert

	http://DeNovoAssembler.SourceForge.Net/

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3 of the License.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You have received a copy of the GNU General Public License
along with this program (COPYING).  
	see <http://www.gnu.org/licenses/>

*/

#ifndef _MessagesHandler
#define _MessagesHandler

#include
#include
#include
#include
#include
#include
#include
using namespace std;


class MessagesHandler{
	int m_ringSize;
	int m_head;
	MPI_Request*m_ring;
	char*m_buffers;

	u64*m_receivedMessages;
	int m_rank;
	int m_size;

	u64*m_allReceivedMessages;
	int*m_allCounts;

public:
	void constructor(int rank,int size);
	void showStats();
	void sendMessages(StaticVector*outbox,int source);
	void receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination);
	u64*getReceivedMessages();
	void addCount(int rank,u64 count);
	void writeStats(const char*file);
	bool isFinished();
	bool isFinished(int rank);
	void freeLeftovers();
};

#endif


Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-24 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 24/11/10 16:32, Sébastien Boisvert wrote:

> Yes, Ray version 0.1.0 and below are not fully-compliant
> with MPI 2.2.
> 
> I will release Ray 1.0.0 as soon as my regression tests
> are done. That should be tomorrow.

Wonderful, thank you! :-)

- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkzsprUACgkQO2KABBYQAh/iwgCfSiDfKFtfhQsePzyxcpBz+Vcg
yp4AnjLnPCZRo3xtapyF0V6Rb/7ULsyL
=NAhy
-END PGP SIGNATURE-


Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-24 Thread Sébastien Boisvert
Yes, Ray version 0.1.0 and below are not fully-compliant with MPI 2.2.

I will release Ray 1.0.0 as soon as my regression tests are done. That
should be tomorrow.




Le mercredi 24 novembre 2010 à 00:01 -0500, Christopher Samuel a écrit :
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 24/11/10 09:17, Sébastien Boisvert wrote:
> 
> > As Mr. George Bosilca underlined, since the same test case works for
> > small messages, the problem is about congestion of the FIFOs which leads
> > to resource locking, and as you wrote, deadlock.
> 
> Hmm, we've had a report from someone trying to use Ray on
> our BG/P that they've seen it lock up - is it likely to be
> the same issue ?
> 
> cheers,
> Chris
> - -- 
>  Christopher Samuel - Senior Systems Administrator
>  VLSCI - Victorian Life Sciences Computational Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.unimelb.edu.au/
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAkzsnAwACgkQO2KABBYQAh8aHQCeOPEU5i4En0YPURSqb9tR3uQO
> tR4An1sJ0H6Zn6Pxot2c364bHDmNLhGe
> =p1TT
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
M. Sébastien Boisvert
Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
Boursier des Instituts de recherche en santé du Canada
Équipe du Professeur Jacques Corbeil

Centre de recherche en infectiologie de l'Université Laval
Local R-61B
2705, boulevard Laurier
Québec, Québec
Canada G1V 4G2
Téléphone: 418 525  46342

Courriel: s...@boisvert.info
Web: http://boisvert.info

"Innovation comes only from an assault on the unknown" -Sydney Brenner



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-24 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 24/11/10 09:17, Sébastien Boisvert wrote:

> As Mr. George Bosilca underlined, since the same test case works for
> small messages, the problem is about congestion of the FIFOs which leads
> to resource locking, and as you wrote, deadlock.

Hmm, we've had a report from someone trying to use Ray on
our BG/P that they've seen it lock up - is it likely to be
the same issue ?

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkzsnAwACgkQO2KABBYQAh8aHQCeOPEU5i4En0YPURSqb9tR3uQO
tR4An1sJ0H6Zn6Pxot2c364bHDmNLhGe
=p1TT
-END PGP SIGNATURE-


Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Le mardi 23 novembre 2010 à 20:21 -0500, Jeff Squyres (jsquyres) a
écrit :

> Beware that MPI-request-free on active buffers is valid but evil. You CANNOT 
> be sure when the buffer is available for reuse. 


Yes, but as I said, in my program an MPI rank never flood other MPI
ranks.
(I like to think they respect each other haha)

Therefore the evilness is no more -- it is casted away in oblivions.

If I understand correctly, a call to MPI_Request_free does not affect in
any way the void*buffer associated to the request, it just free the
memory of the MPI_Request.
For statuses, I use MPI_STATUS_IGNORE, except with my MPI_Iprobe
obviously !

So, in a way, MPI_REQUEST_IGNORE would be cool, but it does not exist.


For buffer availability:

For MPI_Recv and MPI_Isend, buffers are allocated with a
"RingAllocator" (one malloc at the start of execution). 
But it is useless as most of the time there is only on active send.


Here is an example of my code (14567 lines, but yet MPI_Isend and
MPI_Recv appear both only once).
p.s. it is GPLed !



These bits extract a k-mer (a string of k symbols) from a DNA (the code
of life) sequence and send it to the good MPI rank


void VerticesExtractor::process(...){
if(!m_ready){
return;
}
...
if(isValidDNA(memory)){
VERTEX_TYPE a=wordId(memory);
int rankToFlush=0;
if(*m_reverseComplementVertex==false){
rankToFlush=vertexRank(a,size);

m_disData->m_messagesStock.addAt(rankToFlush,a);
}else{
VERTEX_TYPE
b=complementVertex(a,m_wordSize,m_colorSpaceMode);
rankToFlush=vertexRank(b,size);

m_disData->m_messagesStock.addAt(rankToFlush,b);
}


if(m_disData->m_messagesStock.flush(rankToFlush,1,TAG_VERTICES_DATA,m_outboxAllocator,m_outbox,rank,false)){
m_ready=false;
}

}
...
}


so, if the "toilet" if flushed, the rank set its slot called m_ready to
false.





The following bits select the message handler:

O(1) message handler selection !



void MessageProcessor::processMessage(Message*message){
int tag=message->getTag();
FNMETHOD f=m_methods[tag];
(this->*f)(message);
}





Obviously, it calls something like this:
(note that a reply is sent)



void MessageProcessor::call_TAG_VERTICES_DATA(Message*message){
void*buffer=message->getBuffer();
int count=message->getCount();
VERTEX_TYPE*incoming=(VERTEX_TYPE*)buffer;
int length=count;
for(int i=0;isize() and
(int)m_subgraph->size()%10==0){
(*m_last_value)=m_subgraph->size();
cout<<"Rank "insert(l);
#ifdef ASSERT
assert(tmp!=NULL);
#endif
if(m_subgraph->inserted()){
tmp->getValue()->constructor();
}

tmp->getValue()->setCoverage(tmp->getValue()->getCoverage()+1);
#ifdef ASSERT
assert(tmp->getValue()->getCoverage()>0);
#endif
}
Message
aMessage(NULL,0,MPI_UNSIGNED_LONG_LONG,message->getSource(),TAG_VERTICES_DATA_REPLY,rank);
m_outbox->push_back(aMessage);
}




These bits process the reply:
(all my message handlers are called call_)



void MessageProcessor::call_TAG_VERTICES_DATA_REPLY(Message*message){
m_verticesExtractor->setReadiness();
}


And, finally, here it goes:

void VerticesExtractor::setReadiness(){
m_ready=true;
}



So, you can see that there is no problem with my use of MPI_Isend
followed by MPI_Request_free.


Thanks !


> 
> There was a sentence or paragraph added yo MPI 2.2 describing exactly this 
> case. 
> 
> Sent from my PDA. No type good. 
> 
> On Nov 23, 2010, at 5:36 PM, Sébastien Boisvert 
>  wrote:
> 
> > Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit :
> >> Sebastien,
> >> 
> >> Using MPI_Isend doesn't guarantee asynchronous progress. As you might be 
> >> aware, the non-blocking communications are guaranteed to progress only 
> >> when the application is in the MPI library. Currently very few MPI 
> >> implementations progress asynchronously (and unfortunately Open MPI is not 
> >> one of them).
> >> 
> > 
> > Regardless, I just need the non-blocking behavior.
> > I call MPI_Request_free just after MPI_Isend, and I use a ring allocator
> > to allocate message buffers.
> > 
> > Message recipients just reply with another message to the source, using
> > a NULL buffer.
> > 
> > The sender waits for the reply before sending the next message.
> > 
> > And it works for assembling bacterial genomes on many MPI ranks:
> > 
> > ...
> > Rank 0: 162 contigs/4576725 nucleotides
> > 
> > Rank 0 reports the elapsed time, Tue Nov 23 0

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Thank you !

Your support is outstanding !

Le mardi 23 novembre 2010 à 22:25 -0500, Eugene Loh a écrit :
> Jeff Squyres (jsquyres) wrote:
> 
> >Ya, it sounds like we should fix this eager limit help text so that others 
> >aren't misled. We did say "attempt", but that's probably a bit too subtle. 
> >
> >Eugene - iirc: this is in the btl base (or some other central location) 
> >because it's shared between all btls. 
> >  
> >
> The cited text was from the OMPI FAQ ("Tuning" / "sm" section, item 6).  
> I made the change in r1309.
> 
> In ompi/mca/btl/base/btl_base_mca.c, I added the phrase "including 
> header" to both
> 
> "rndv_eager_limit"
> "Size (in bytes, including header) of \"phase 1\" fragment sent for all 
> large messages (must be >= 0 and <= eager_limit)"
> module->btl_rndv_eager_limit
> 
> and
> 
> "eager_limit"
> "Maximum size (in bytes, including header) of \"short\" messages (must 
> be >= 1)."
> module->btl_eager_limit
> 
> but I left
> 
> "max_send_size"
> "Maximum size (in bytes) of a single \"phase 2\" fragment of a long 
> message when using the pipeline protocol (must be >= 1)"
> module->btl_max_send_size
> 
> alone (for some combination of lukewarm reasons).  Changes are in r24085.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
M. Sébastien Boisvert
Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
Boursier des Instituts de recherche en santé du Canada
Équipe du Professeur Jacques Corbeil

Centre de recherche en infectiologie de l'Université Laval
Local R-61B
2705, boulevard Laurier
Québec, Québec
Canada G1V 4G2
Téléphone: 418 525  46342

Courriel: s...@boisvert.info
Web: http://boisvert.info

"Innovation comes only from an assault on the unknown" -Sydney Brenner



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh

Jeff Squyres (jsquyres) wrote:

Ya, it sounds like we should fix this eager limit help text so that others aren't misled. We did say "attempt", but that's probably a bit too subtle. 

Eugene - iirc: this is in the btl base (or some other central location) because it's shared between all btls. 
 

The cited text was from the OMPI FAQ ("Tuning" / "sm" section, item 6).  
I made the change in r1309.


In ompi/mca/btl/base/btl_base_mca.c, I added the phrase "including 
header" to both


"rndv_eager_limit"
"Size (in bytes, including header) of \"phase 1\" fragment sent for all 
large messages (must be >= 0 and <= eager_limit)"

module->btl_rndv_eager_limit

and

"eager_limit"
"Maximum size (in bytes, including header) of \"short\" messages (must 
be >= 1)."

module->btl_eager_limit

but I left

"max_send_size"
"Maximum size (in bytes) of a single \"phase 2\" fragment of a long 
message when using the pipeline protocol (must be >= 1)"

module->btl_max_send_size

alone (for some combination of lukewarm reasons).  Changes are in r24085.


Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Jeff Squyres (jsquyres)
Beware that MPI-request-free on active buffers is valid but evil. You CANNOT be 
sure when the buffer is available for reuse. 

There was a sentence or paragraph added yo MPI 2.2 describing exactly this 
case. 

Sent from my PDA. No type good. 

On Nov 23, 2010, at 5:36 PM, Sébastien Boisvert 
 wrote:

> Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit :
>> Sebastien,
>> 
>> Using MPI_Isend doesn't guarantee asynchronous progress. As you might be 
>> aware, the non-blocking communications are guaranteed to progress only when 
>> the application is in the MPI library. Currently very few MPI 
>> implementations progress asynchronously (and unfortunately Open MPI is not 
>> one of them).
>> 
> 
> Regardless, I just need the non-blocking behavior.
> I call MPI_Request_free just after MPI_Isend, and I use a ring allocator
> to allocate message buffers.
> 
> Message recipients just reply with another message to the source, using
> a NULL buffer.
> 
> The sender waits for the reply before sending the next message.
> 
> And it works for assembling bacterial genomes on many MPI ranks:
> 
> ...
> Rank 0: 162 contigs/4576725 nucleotides
> 
> Rank 0 reports the elapsed time, Tue Nov 23 01:35:48 2010
> ---> Step: Collection of fusions
>  Elapsed time: 0 seconds
>  Since beginning: 17 minutes, 33 seconds
> 
> Elapsed time for each step, Tue Nov 23 01:35:48 2010
> 
> Beginning of computation: 1 seconds
> Distribution of sequence reads: 7 minutes, 49 seconds
> Distribution of vertices: 19 seconds
> Calculation of coverage distribution: 1 seconds
> Distribution of edges: 29 seconds
> Indexing of sequence reads: 1 seconds
> Computation of seeds: 2 minutes, 33 seconds
> Computation of library sizes: 1 minutes, 47 seconds
> Extension of seeds: 3 minutes, 34 seconds
> Computation of fusions: 59 seconds
> Collection of fusions: 0 seconds
> Completion of the assembly: 17 minutes, 33 seconds
> 
> Rank 0 wrote Ecoli-THEONE.CoverageDistribution.txt
> Rank 0 wrote Ecoli-THEONE.fasta
> Rank 0 wrote Ecoli-THEONE.ReceivedMessages.txt
> Rank 0 wrote Ecoli-THEONE.Library0.txt
> Rank 0 wrote Ecoli-THEONE.Library1.txt
> 
> Au revoir !
> 
> 
>>  george.
>> 
>> On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote:
>> 
>>> I now use MPI_Isend, so the problem is no more.
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> -- 
> M. Sébastien Boisvert
> Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
> Boursier des Instituts de recherche en santé du Canada
> Équipe du Professeur Jacques Corbeil
> 
> Centre de recherche en infectiologie de l'Université Laval
> Local R-61B
> 2705, boulevard Laurier
> Québec, Québec
> Canada G1V 4G2
> Téléphone: 418 525  46342
> 
> Courriel: s...@boisvert.info
> Web: http://boisvert.info
> 
> "Innovation comes only from an assault on the unknown" -Sydney Brenner
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Jeff Squyres (jsquyres)
Ya, it sounds like we should fix this eager limit help text so that others 
aren't misled. We did say "attempt", but that's probably a bit too subtle. 

Eugene - iirc: this is in the btl base (or some other central location) because 
it's shared between all btls. 

Sent from my PDA. No type good. 

On Nov 23, 2010, at 5:54 PM, "Eugene Loh"  wrote:

> George Bosilca wrote:
> 
>> Moreover, eager send can improve performance if and only if the matching 
>> receives are already posted on the peer. If not, the data will become 
>> unexpected, and there will be one additional memcpy.
>> 
> I don't think the first sentence is strictly true.  There is a cost 
> associated with eager messages, but whether there is an overall improvement 
> or not depends on lots of factors.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Whoa ! Thank, I will try that.


Le mardi 23 novembre 2010 à 18:03 -0500, George Bosilca a écrit :
> If you know the max size of the receives I would take a different approach. 

"max size" is the maximum buffer size required, right ?
in my case, it is 4096.

> Post few persistent receives, and manage them in a circular buffer. 
> Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your 
> circular buffer. Once you use the data related to the receive, just do an 
> MPI_Start on your request.
> 
> This approach will minimize the unexpected messages, and drain the 
> connections faster. Moreover, at the end it is very easy to MPI_Cancel all 
> the receives not yet matched.

Looks very interesting, indeed !
https://computing.llnl.gov/tutorials/mpi_performance/#Persistent

Wow, that is really an in-depth suggestion that I will surely try !

Thank you, your answers are very appreciated !

> 
>   george.
> 
> On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote:
> 
> > Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit :
> >> The eager size reported by ompi_info includes the Open MPI internal 
> >> headers. They are anywhere between 20 and 64 bytes long (potentially more 
> >> for some particular networks), so what Eugene suggested was a safe 
> >> boundary.
> > 
> > I see.
> > 
> >> 
> >> Moreover, eager send can improve performance if and only if the matching 
> >> receives are already posted on the peer. If not, the data will become 
> >> unexpected, and there will be one additional memcpy.
> > 
> > So it won't improve performance in my application (Ray,
> > http://denovoassembler.sf.net) because I use MPI_Iprobe to check for
> > incoming messages, which means any receive (MPI_Recv) is never posted
> > before any send (MPI_Isend).
> > 
> > Thanks, this thread is very informative for me !
> > 
> >> 
> >>  george.
> >> 
> >> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote:
> >> 
> >>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
>  Sébastien Boisvert wrote:
>  
> > Now I can describe the cases.
> > 
> > 
>  The test cases can all be explained by the test requiring eager messages 
>  (something that test4096.cpp does not require).
>  
> > Case 1: 30 MPI ranks, message size is 4096 bytes
> > 
> > File: mpirun-np-30-Program-4096.txt
> > Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> > 
> > 
>  4096 is rendezvous.  For eager, try 4000 or lower.
> >>> 
> >>> According to ompi_info, the threshold is 4096, not 4000, right ?
> >>> 
> >>> (Open-MPI 1.4.3)
> >>> [sboisver12@colosse1 ~]$ ompi_info -a|less
> >>>MCA btl: parameter "btl_sm_eager_limit" (current value:
> >>> "4096", data source: default value)
> >>> Maximum size (in bytes) of "short" messages
> >>> (must be >= 1).
> >>> 
> >>> 
> >>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
> >>> that is, a sender attempts to write its entire message to shared buffers
> >>> without waiting for a receiver to be ready. Above this size, a sender
> >>> will only write the first part of a message, then wait for the receiver
> >>> to acknowledge its ready before continuing. Eager sends can improve
> >>> performance by decoupling senders from receivers."
> >>> 
> >>> 
> >>> 
> >>> source:
> >>> http://www.open-mpi.org/faq/?category=sm#more-sm
> >>> 
> >>> 
> >>> It should say "Below this size or equal to this size" instead of "Below
> >>> this size" as ompi_info says. ;)
> >>> 
> >>> 
> >>> 
> >>> 
> >>> As Mr. George Bosilca put it:
> >>> 
> >>> "__should__ is not correct, __might__ is a better verb to describe the
> >>> most "common" behavior for small messages. The problem comes from the
> >>> fact that in each communicator the FIFO ordering is required by the MPI
> >>> standard. As soon as there is any congestion, MPI_Send will block even
> >>> for small messages (and this independent on the underlying network)
> >>> until all he pending packets have been delivered."
> >>> 
> >>> source:
> >>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php
> >>> 
> >>> 
> >>> 
>  
> > Case 2: 30 MPI ranks, message size is 1 byte
> > 
> > File: mpirun-np-30-Program-1.txt.gz
> > Outcome: It runs just fine.
> > 
> > 
>  1 byte is eager.
> >>> 
> >>> I agree.
> >>> 
>  
> > Case 3: 2 MPI ranks, message size is 4096 bytes
> > 
> > File: mpirun-np-2-Program-4096.txt
> > Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> > 
> > 
>  Same as Case 1.
>  
> > Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> > disabled
> > 
> > File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> > Outcome: It runs just fine.
> > 
> > 
>  Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
>  messages are eager.
> >>> 
> >>> I agree.
> >>> 
>  

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread George Bosilca
If you know the max size of the receives I would take a different approach. 
Post few persistent receives, and manage them in a circular buffer. Instead of 
doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer. 
Once you use the data related to the receive, just do an MPI_Start on your 
request.

This approach will minimize the unexpected messages, and drain the connections 
faster. Moreover, at the end it is very easy to MPI_Cancel all the receives not 
yet matched.

  george.

On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote:

> Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit :
>> The eager size reported by ompi_info includes the Open MPI internal headers. 
>> They are anywhere between 20 and 64 bytes long (potentially more for some 
>> particular networks), so what Eugene suggested was a safe boundary.
> 
> I see.
> 
>> 
>> Moreover, eager send can improve performance if and only if the matching 
>> receives are already posted on the peer. If not, the data will become 
>> unexpected, and there will be one additional memcpy.
> 
> So it won't improve performance in my application (Ray,
> http://denovoassembler.sf.net) because I use MPI_Iprobe to check for
> incoming messages, which means any receive (MPI_Recv) is never posted
> before any send (MPI_Isend).
> 
> Thanks, this thread is very informative for me !
> 
>> 
>>  george.
>> 
>> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote:
>> 
>>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
 Sébastien Boisvert wrote:
 
> Now I can describe the cases.
> 
> 
 The test cases can all be explained by the test requiring eager messages 
 (something that test4096.cpp does not require).
 
> Case 1: 30 MPI ranks, message size is 4096 bytes
> 
> File: mpirun-np-30-Program-4096.txt
> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> 
> 
 4096 is rendezvous.  For eager, try 4000 or lower.
>>> 
>>> According to ompi_info, the threshold is 4096, not 4000, right ?
>>> 
>>> (Open-MPI 1.4.3)
>>> [sboisver12@colosse1 ~]$ ompi_info -a|less
>>>MCA btl: parameter "btl_sm_eager_limit" (current value:
>>> "4096", data source: default value)
>>> Maximum size (in bytes) of "short" messages
>>> (must be >= 1).
>>> 
>>> 
>>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
>>> that is, a sender attempts to write its entire message to shared buffers
>>> without waiting for a receiver to be ready. Above this size, a sender
>>> will only write the first part of a message, then wait for the receiver
>>> to acknowledge its ready before continuing. Eager sends can improve
>>> performance by decoupling senders from receivers."
>>> 
>>> 
>>> 
>>> source:
>>> http://www.open-mpi.org/faq/?category=sm#more-sm
>>> 
>>> 
>>> It should say "Below this size or equal to this size" instead of "Below
>>> this size" as ompi_info says. ;)
>>> 
>>> 
>>> 
>>> 
>>> As Mr. George Bosilca put it:
>>> 
>>> "__should__ is not correct, __might__ is a better verb to describe the
>>> most "common" behavior for small messages. The problem comes from the
>>> fact that in each communicator the FIFO ordering is required by the MPI
>>> standard. As soon as there is any congestion, MPI_Send will block even
>>> for small messages (and this independent on the underlying network)
>>> until all he pending packets have been delivered."
>>> 
>>> source:
>>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php
>>> 
>>> 
>>> 
 
> Case 2: 30 MPI ranks, message size is 1 byte
> 
> File: mpirun-np-30-Program-1.txt.gz
> Outcome: It runs just fine.
> 
> 
 1 byte is eager.
>>> 
>>> I agree.
>>> 
 
> Case 3: 2 MPI ranks, message size is 4096 bytes
> 
> File: mpirun-np-2-Program-4096.txt
> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> 
> 
 Same as Case 1.
 
> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> disabled
> 
> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> Outcome: It runs just fine.
> 
> 
 Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
 messages are eager.
>>> 
>>> I agree.
>>> 
 
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> -- 
> M. Sébastien Boisvert
> Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
> Boursier des Instituts de recherche en santé du Canada
> Équipe du Pr

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh

George Bosilca wrote:


Moreover, eager send can improve performance if and only if the matching 
receives are already posted on the peer. If not, the data will become 
unexpected, and there will be one additional memcpy.

I don't think the first sentence is strictly true.  There is a cost 
associated with eager messages, but whether there is an overall 
improvement or not depends on lots of factors.


Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh




Sébastien Boisvert wrote:

  Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
  
  
Sébastien Boisvert wrote:


  Case 1: 30 MPI ranks, message size is 4096 bytes

File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
  

4096 is rendezvous.  For eager, try 4000 or lower.

  
  According to ompi_info, the threshold is 4096, not 4000, right ?
  

Right.

  "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
that is, a sender attempts to write its entire message to shared buffers
without waiting for a receiver to be ready. Above this size, a sender
will only write the first part of a message, then wait for the receiver
to acknowledge its ready before continuing. Eager sends can improve
performance by decoupling senders from receivers."

source:
http://www.open-mpi.org/faq/?category=sm#more-sm

It should say "Below this size or equal to this size" instead of "Below
this size" as ompi_info says. ;)
  

Well, I guess it should say:

If message data plus header information fits within this limit, the
message is sent "eagerly"...

I guess I'll fix it.  (I suspect I wrote it in the first place.  Sigh.)




Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit :
> The eager size reported by ompi_info includes the Open MPI internal headers. 
> They are anywhere between 20 and 64 bytes long (potentially more for some 
> particular networks), so what Eugene suggested was a safe boundary.

I see.

> 
> Moreover, eager send can improve performance if and only if the matching 
> receives are already posted on the peer. If not, the data will become 
> unexpected, and there will be one additional memcpy.

So it won't improve performance in my application (Ray,
http://denovoassembler.sf.net) because I use MPI_Iprobe to check for
incoming messages, which means any receive (MPI_Recv) is never posted
before any send (MPI_Isend).

Thanks, this thread is very informative for me !

> 
>   george.
> 
> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote:
> 
> > Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
> >> Sébastien Boisvert wrote:
> >> 
> >>> Now I can describe the cases.
> >>> 
> >>> 
> >> The test cases can all be explained by the test requiring eager messages 
> >> (something that test4096.cpp does not require).
> >> 
> >>> Case 1: 30 MPI ranks, message size is 4096 bytes
> >>> 
> >>> File: mpirun-np-30-Program-4096.txt
> >>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >>> 
> >>> 
> >> 4096 is rendezvous.  For eager, try 4000 or lower.
> > 
> > According to ompi_info, the threshold is 4096, not 4000, right ?
> > 
> > (Open-MPI 1.4.3)
> > [sboisver12@colosse1 ~]$ ompi_info -a|less
> > MCA btl: parameter "btl_sm_eager_limit" (current value:
> > "4096", data source: default value)
> >  Maximum size (in bytes) of "short" messages
> > (must be >= 1).
> > 
> > 
> > "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
> > that is, a sender attempts to write its entire message to shared buffers
> > without waiting for a receiver to be ready. Above this size, a sender
> > will only write the first part of a message, then wait for the receiver
> > to acknowledge its ready before continuing. Eager sends can improve
> > performance by decoupling senders from receivers."
> > 
> > 
> > 
> > source:
> > http://www.open-mpi.org/faq/?category=sm#more-sm
> > 
> > 
> > It should say "Below this size or equal to this size" instead of "Below
> > this size" as ompi_info says. ;)
> > 
> > 
> > 
> > 
> > As Mr. George Bosilca put it:
> > 
> > "__should__ is not correct, __might__ is a better verb to describe the
> > most "common" behavior for small messages. The problem comes from the
> > fact that in each communicator the FIFO ordering is required by the MPI
> > standard. As soon as there is any congestion, MPI_Send will block even
> > for small messages (and this independent on the underlying network)
> > until all he pending packets have been delivered."
> > 
> > source:
> > http://www.open-mpi.org/community/lists/devel/2010/11/8696.php
> > 
> > 
> > 
> >> 
> >>> Case 2: 30 MPI ranks, message size is 1 byte
> >>> 
> >>> File: mpirun-np-30-Program-1.txt.gz
> >>> Outcome: It runs just fine.
> >>> 
> >>> 
> >> 1 byte is eager.
> > 
> > I agree.
> > 
> >> 
> >>> Case 3: 2 MPI ranks, message size is 4096 bytes
> >>> 
> >>> File: mpirun-np-2-Program-4096.txt
> >>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >>> 
> >>> 
> >> Same as Case 1.
> >> 
> >>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> >>> disabled
> >>> 
> >>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> >>> Outcome: It runs just fine.
> >>> 
> >>> 
> >> Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
> >> messages are eager.
> > 
> > I agree.
> > 
> >> 
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > 
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
M. Sébastien Boisvert
Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
Boursier des Instituts de recherche en santé du Canada
Équipe du Professeur Jacques Corbeil

Centre de recherche en infectiologie de l'Université Laval
Local R-61B
2705, boulevard Laurier
Québec, Québec
Canada G1V 4G2
Téléphone: 418 525  46342

Courriel: s...@boisvert.info
Web: http://boisvert.info

"Innovation comes only from an assault on the unknown" -Sydney Brenner



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread George Bosilca
The eager size reported by ompi_info includes the Open MPI internal headers. 
They are anywhere between 20 and 64 bytes long (potentially more for some 
particular networks), so what Eugene suggested was a safe boundary.

Moreover, eager send can improve performance if and only if the matching 
receives are already posted on the peer. If not, the data will become 
unexpected, and there will be one additional memcpy.

  george.

On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote:

> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
>> Sébastien Boisvert wrote:
>> 
>>> Now I can describe the cases.
>>> 
>>> 
>> The test cases can all be explained by the test requiring eager messages 
>> (something that test4096.cpp does not require).
>> 
>>> Case 1: 30 MPI ranks, message size is 4096 bytes
>>> 
>>> File: mpirun-np-30-Program-4096.txt
>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
>>> 
>>> 
>> 4096 is rendezvous.  For eager, try 4000 or lower.
> 
> According to ompi_info, the threshold is 4096, not 4000, right ?
> 
> (Open-MPI 1.4.3)
> [sboisver12@colosse1 ~]$ ompi_info -a|less
> MCA btl: parameter "btl_sm_eager_limit" (current value:
> "4096", data source: default value)
>  Maximum size (in bytes) of "short" messages
> (must be >= 1).
> 
> 
> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
> that is, a sender attempts to write its entire message to shared buffers
> without waiting for a receiver to be ready. Above this size, a sender
> will only write the first part of a message, then wait for the receiver
> to acknowledge its ready before continuing. Eager sends can improve
> performance by decoupling senders from receivers."
> 
> 
> 
> source:
> http://www.open-mpi.org/faq/?category=sm#more-sm
> 
> 
> It should say "Below this size or equal to this size" instead of "Below
> this size" as ompi_info says. ;)
> 
> 
> 
> 
> As Mr. George Bosilca put it:
> 
> "__should__ is not correct, __might__ is a better verb to describe the
> most "common" behavior for small messages. The problem comes from the
> fact that in each communicator the FIFO ordering is required by the MPI
> standard. As soon as there is any congestion, MPI_Send will block even
> for small messages (and this independent on the underlying network)
> until all he pending packets have been delivered."
> 
> source:
> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php
> 
> 
> 
>> 
>>> Case 2: 30 MPI ranks, message size is 1 byte
>>> 
>>> File: mpirun-np-30-Program-1.txt.gz
>>> Outcome: It runs just fine.
>>> 
>>> 
>> 1 byte is eager.
> 
> I agree.
> 
>> 
>>> Case 3: 2 MPI ranks, message size is 4096 bytes
>>> 
>>> File: mpirun-np-2-Program-4096.txt
>>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
>>> 
>>> 
>> Same as Case 1.
>> 
>>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
>>> disabled
>>> 
>>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
>>> Outcome: It runs just fine.
>>> 
>>> 
>> Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
>> messages are eager.
> 
> I agree.
> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit :
> Sebastien,
> 
> Using MPI_Isend doesn't guarantee asynchronous progress. As you might be 
> aware, the non-blocking communications are guaranteed to progress only when 
> the application is in the MPI library. Currently very few MPI implementations 
> progress asynchronously (and unfortunately Open MPI is not one of them).
> 

Regardless, I just need the non-blocking behavior.
I call MPI_Request_free just after MPI_Isend, and I use a ring allocator
to allocate message buffers.

Message recipients just reply with another message to the source, using
a NULL buffer.

The sender waits for the reply before sending the next message.

And it works for assembling bacterial genomes on many MPI ranks:

...
Rank 0: 162 contigs/4576725 nucleotides

Rank 0 reports the elapsed time, Tue Nov 23 01:35:48 2010
 ---> Step: Collection of fusions
  Elapsed time: 0 seconds
  Since beginning: 17 minutes, 33 seconds

Elapsed time for each step, Tue Nov 23 01:35:48 2010

 Beginning of computation: 1 seconds
 Distribution of sequence reads: 7 minutes, 49 seconds
 Distribution of vertices: 19 seconds
 Calculation of coverage distribution: 1 seconds
 Distribution of edges: 29 seconds
 Indexing of sequence reads: 1 seconds
 Computation of seeds: 2 minutes, 33 seconds
 Computation of library sizes: 1 minutes, 47 seconds
 Extension of seeds: 3 minutes, 34 seconds
 Computation of fusions: 59 seconds
 Collection of fusions: 0 seconds
 Completion of the assembly: 17 minutes, 33 seconds

Rank 0 wrote Ecoli-THEONE.CoverageDistribution.txt
Rank 0 wrote Ecoli-THEONE.fasta
Rank 0 wrote Ecoli-THEONE.ReceivedMessages.txt
Rank 0 wrote Ecoli-THEONE.Library0.txt
Rank 0 wrote Ecoli-THEONE.Library1.txt

Au revoir !


>   george.
> 
> On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote:
> 
> > I now use MPI_Isend, so the problem is no more.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
M. Sébastien Boisvert
Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval
Boursier des Instituts de recherche en santé du Canada
Équipe du Professeur Jacques Corbeil

Centre de recherche en infectiologie de l'Université Laval
Local R-61B
2705, boulevard Laurier
Québec, Québec
Canada G1V 4G2
Téléphone: 418 525  46342

Courriel: s...@boisvert.info
Web: http://boisvert.info

"Innovation comes only from an assault on the unknown" -Sydney Brenner



Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit :
> Sébastien Boisvert wrote:
> 
> >Now I can describe the cases.
> >  
> >
> The test cases can all be explained by the test requiring eager messages 
> (something that test4096.cpp does not require).
> 
> >Case 1: 30 MPI ranks, message size is 4096 bytes
> >
> >File: mpirun-np-30-Program-4096.txt
> >Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >  
> >
> 4096 is rendezvous.  For eager, try 4000 or lower.

According to ompi_info, the threshold is 4096, not 4000, right ?

(Open-MPI 1.4.3)
[sboisver12@colosse1 ~]$ ompi_info -a|less
 MCA btl: parameter "btl_sm_eager_limit" (current value:
"4096", data source: default value)
  Maximum size (in bytes) of "short" messages
(must be >= 1).


"btl_sm_eager_limit: Below this size, messages are sent "eagerly" --
that is, a sender attempts to write its entire message to shared buffers
without waiting for a receiver to be ready. Above this size, a sender
will only write the first part of a message, then wait for the receiver
to acknowledge its ready before continuing. Eager sends can improve
performance by decoupling senders from receivers."



source:
http://www.open-mpi.org/faq/?category=sm#more-sm


It should say "Below this size or equal to this size" instead of "Below
this size" as ompi_info says. ;)




As Mr. George Bosilca put it:

"__should__ is not correct, __might__ is a better verb to describe the
most "common" behavior for small messages. The problem comes from the
fact that in each communicator the FIFO ordering is required by the MPI
standard. As soon as there is any congestion, MPI_Send will block even
for small messages (and this independent on the underlying network)
until all he pending packets have been delivered."

source:
http://www.open-mpi.org/community/lists/devel/2010/11/8696.php



> 
> >Case 2: 30 MPI ranks, message size is 1 byte
> >
> >File: mpirun-np-30-Program-1.txt.gz
> >Outcome: It runs just fine.
> >  
> >
> 1 byte is eager.

I agree.

> 
> >Case 3: 2 MPI ranks, message size is 4096 bytes
> >
> >File: mpirun-np-2-Program-4096.txt
> >Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> >  
> >
> Same as Case 1.
> 
> >Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> >disabled
> >
> >File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> >Outcome: It runs just fine.
> >  
> >
> Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
> messages are eager.

I agree.

> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread George Bosilca
Sebastien,

Using MPI_Isend doesn't guarantee asynchronous progress. As you might be aware, 
the non-blocking communications are guaranteed to progress only when the 
application is in the MPI library. Currently very few MPI implementations 
progress asynchronously (and unfortunately Open MPI is not one of them).

  george.

On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote:

> I now use MPI_Isend, so the problem is no more.




Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread George Bosilca
No message is eager if there is congestion. 64K is eager for TCP only if the 
kernel buffer has enough room to hold the 64k. For SM it only works if there 
are ready buffers. In fact, eager is an optimization of the MPI library, not 
something the users should be aware of, or base their application on this 
particular behavior.

On the MPI 2.2 there is a specific paragraph that advice the users not to do it.

  george.


On Nov 23, 2010, at 16:07 , Eugene Loh wrote:

> Sébastien Boisvert wrote:
> 
>> Now I can describe the cases.
>> 
> The test cases can all be explained by the test requiring eager messages 
> (something that test4096.cpp does not require).
> 
>> Case 1: 30 MPI ranks, message size is 4096 bytes
>> 
>> File: mpirun-np-30-Program-4096.txt
>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
>> 
> 4096 is rendezvous.  For eager, try 4000 or lower.
> 
>> Case 2: 30 MPI ranks, message size is 1 byte
>> 
>> File: mpirun-np-30-Program-1.txt.gz
>> Outcome: It runs just fine.
>> 
> 1 byte is eager.
> 
>> Case 3: 2 MPI ranks, message size is 4096 bytes
>> 
>> File: mpirun-np-2-Program-4096.txt
>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
>> 
> Same as Case 1.
> 
>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
>> disabled
>> 
>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
>> Outcome: It runs just fine.
>> 
> Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
> messages are eager.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Sébastien Boisvert
Le mardi 23 novembre 2010 à 15:17 -0500, Jeff Squyres (jsquyres) a
écrit :
> Sorry for the delay in replying - many of us were at SC last week. 

Nothing to be sorry for !

> 
> Admittedly, I'm looking at your code on a PDA, so I might be missing some 
> things. But I have 2 q's:

You got it all right, I assure you !

> 
> 1 your send routine doesn't seem to protect from sending to yourself. Correct?
> 

Correct. (my error !)



The code is not compliant with MPI 2.2 -- I realized that afterward.

see http://www.open-mpi.org/community/lists/devel/2010/11/8689.php



Also, Mr. George Bosilca pointed that out too.

see http://www.open-mpi.org/community/lists/devel/2010/11/8696.php


> 2 you're not using nonblocking sends, which, if I understand your code right, 
> can lead to deadlock. Right?  Eg proc A sends to proc b and blocks until b 
> receives. But b is blocking waiting for it's send completion, etc. 

Right.

As Mr. George Bosilca underlined, since the same test case works for
small messages, the problem is about congestion of the FIFOs which leads
to resource locking, and as you wrote, deadlock.

http://www.open-mpi.org/community/lists/devel/2010/11/8696.php


I now use MPI_Isend, so the problem is no more.

> 
> I think with your random destinations (which may even be yourself, in which 
> case the blocking send will never complete because you didn't prepost a 
> nomblocking receive) and blocking sends, you can end up with deadlock. 


Yes, you are right !

> 
> Sent from my PDA. No type good. 


Sent from my Ubuntu. Typing is good. ;)


> 
> On Nov 16, 2010, at 5:21 PM, Sébastien Boisvert 
>  wrote:
> 
> > Dear awesome community,
> > 
> > 
> > Over the last months, I closely followed the evolution of bug 2043,
> > entitled 'sm BTL hang with GCC 4.4.x'.
> > 
> > https://svn.open-mpi.org/trac/ompi/ticket/2043
> > 
> > The reason is that I am developping an MPI-based software, and I use
> > Open-MPI as it is the only implementation I am aware of that send
> > messages eagerly (powerful feature, that is).
> > 
> > http://denovoassembler.sourceforge.net/
> > 
> > I believe that this very pesky bug remains in Open-MPI 1.4.3, and
> > enclosed to this communication are scientific proofs of my claim, or at
> > least I think they are ;).
> > 
> > 
> > Each byte transfer layer has its default limit to send eagerly a
> > message. With shared memory (sm), the value is 4096 bytes. At least it
> > is according to ompi_info.
> > 
> > 
> > To verify this limit, I implemented a very simple test. The source code
> > is test4096.cpp, which basically just send a single message of 4096
> > bytes from a rank to another (rank 1 to 0).
> > 
> > The test was conclusive: the limit is 4096 bytes (see
> > mpirun-np-2-Simple.txt).
> > 
> > 
> > 
> > Then, I implemented a simple program (103 lines) that makes Open-MPI
> > 1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each
> > rank send a message to a randomly-selected destination. A rank polls for
> > new messages with MPI_Iprobe. Each rank prints the current time at each
> > second during 30 seconds. Using this simple code, I ran 4 test cases,
> > each with a different outcome (use the Makefile if you want to reproduce
> > the bug).
> > 
> > Before I describe these cases, I will describe the testing hardware. 
> > 
> > I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). 
> > The computer has 128 GB of physical memory (see
> > cat-proc-meminfo.txt.gz).
> > It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see
> > dmesg.txt.gz & uname.txt).
> > Default kernel parameters are utilized at runtime (see
> > sudo-sysctl-a.txt.gz).
> > 
> > The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g
> > ++--version.txt).
> > 
> > 
> > I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz,
> > make-install.out.gz).
> > Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz).
> > 
> > 
> > 
> > 
> > Now I can describe the cases.
> > 
> > 
> > Case 1: 30 MPI ranks, message size is 4096 bytes
> > 
> > File: mpirun-np-30-Program-4096.txt
> > Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> > 
> > 
> > 
> > 
> > Case 2: 30 MPI ranks, message size is 1 byte
> > 
> > File: mpirun-np-30-Program-1.txt.gz
> > Outcome: It runs just fine.
> > 
> > 
> > 
> > 
> > Case 3: 2 MPI ranks, message size is 4096 bytes
> > 
> > File: mpirun-np-2-Program-4096.txt
> > Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> > 
> > 
> > 
> > 
> > Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> > disabled
> > 
> > File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> > Outcome: It runs just fine.
> > 
> > 
> > 
> > 
> > 
> > A backtrace of the processes in Case 1 is in gdb-bt.txt.gz.
> > 
> > 
> > 
> > 
> > Thank you !
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ___
> > deve

Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh

Sébastien Boisvert wrote:


Now I can describe the cases.
 

The test cases can all be explained by the test requiring eager messages 
(something that test4096.cpp does not require).



Case 1: 30 MPI ranks, message size is 4096 bytes

File: mpirun-np-30-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
 


4096 is rendezvous.  For eager, try 4000 or lower.


Case 2: 30 MPI ranks, message size is 1 byte

File: mpirun-np-30-Program-1.txt.gz
Outcome: It runs just fine.
 


1 byte is eager.


Case 3: 2 MPI ranks, message size is 4096 bytes

File: mpirun-np-2-Program-4096.txt
Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
 


Same as Case 1.


Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
disabled

File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
Outcome: It runs just fine.
 

Eager limit for TCP is 65536 (perhaps less some overhead).  So, these 
messages are eager.





Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Eugene Loh

To add to Jeff's comments:

Sébastien Boisvert wrote:


The reason is that I am developping an MPI-based software, and I use
Open-MPI as it is the only implementation I am aware of that send
messages eagerly (powerful feature, that is).
 

As wonderful as OMPI is, I am fairly sure other MPI implementations also 
support eager message passing.  That is, there is a capability for a 
sender to hand message data over to the MPI implementation, freeing the 
user send buffer and allowing an MPI_Send() call to complete, without 
the message reaching the receiver or the receiver being ready.



Each byte transfer layer has its default limit to send eagerly a
message. With shared memory (sm), the value is 4096 bytes. At least it
is according to ompi_info.
 

Yes.  I think that 4096 bytes can be a little tricky... it may include 
some header information.  So, the amount of user data that could be sent 
would be a little bit less... e.g., 4,000 bytes or so.



To verify this limit, I implemented a very simple test. The source code
is test4096.cpp, which basically just send a single message of 4096
bytes from a rank to another (rank 1 to 0).
 

I don't think the test says much at all.  It has one process post an 
MPI_Send and another post an MPI_Recv.  Such a test should complete 
under a very wide range of conditions.


Here is perhaps a better test:

#include 
#include 

int main(int argc, char **argv) {
 int me;
 char buf[N];

 MPI_Init(&argc,&argv);
 MPI_Comm_rank(MPI_COMM_WORLD,&me);
 MPI_Send(buf,N,MPI_BYTE,1-me,343,MPI_COMM_WORLD);
 MPI_Recv(buf,N,MPI_BYTE,1-me,343,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
 printf("%d of %d done\n", me, np);
 MPI_Finalize();

 return 0;
}

Compile with the preprocessor symbol N defined to, say, 64.  Run for 
--np 2.  Each process will try to send.  The code will complete for 
short, eager messages.  If the messages are long, nothing is sent 
eagerly and both processes stay hung in their sends.  Bump N up slowly.  
For N=4096, the code hangs.  For N slightly less -- say, 4000 -- it runs.





Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang

2010-11-23 Thread Jeff Squyres (jsquyres)
Sorry for the delay in replying - many of us were at SC last week. 

Admittedly, I'm looking at your code on a PDA, so I might be missing some 
things. But I have 2 q's:

1 your send routine doesn't seem to protect from sending to yourself. Correct?

2 you're not using nonblocking sends, which, if I understand your code right, 
can lead to deadlock. Right?  Eg proc A sends to proc b and blocks until b 
receives. But b is blocking waiting for it's send completion, etc. 

I think with your random destinations (which may even be yourself, in which 
case the blocking send will never complete because you didn't prepost a 
nomblocking receive) and blocking sends, you can end up with deadlock. 

Sent from my PDA. No type good. 

On Nov 16, 2010, at 5:21 PM, Sébastien Boisvert 
 wrote:

> Dear awesome community,
> 
> 
> Over the last months, I closely followed the evolution of bug 2043,
> entitled 'sm BTL hang with GCC 4.4.x'.
> 
> https://svn.open-mpi.org/trac/ompi/ticket/2043
> 
> The reason is that I am developping an MPI-based software, and I use
> Open-MPI as it is the only implementation I am aware of that send
> messages eagerly (powerful feature, that is).
> 
> http://denovoassembler.sourceforge.net/
> 
> I believe that this very pesky bug remains in Open-MPI 1.4.3, and
> enclosed to this communication are scientific proofs of my claim, or at
> least I think they are ;).
> 
> 
> Each byte transfer layer has its default limit to send eagerly a
> message. With shared memory (sm), the value is 4096 bytes. At least it
> is according to ompi_info.
> 
> 
> To verify this limit, I implemented a very simple test. The source code
> is test4096.cpp, which basically just send a single message of 4096
> bytes from a rank to another (rank 1 to 0).
> 
> The test was conclusive: the limit is 4096 bytes (see
> mpirun-np-2-Simple.txt).
> 
> 
> 
> Then, I implemented a simple program (103 lines) that makes Open-MPI
> 1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each
> rank send a message to a randomly-selected destination. A rank polls for
> new messages with MPI_Iprobe. Each rank prints the current time at each
> second during 30 seconds. Using this simple code, I ran 4 test cases,
> each with a different outcome (use the Makefile if you want to reproduce
> the bug).
> 
> Before I describe these cases, I will describe the testing hardware. 
> 
> I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). 
> The computer has 128 GB of physical memory (see
> cat-proc-meminfo.txt.gz).
> It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see
> dmesg.txt.gz & uname.txt).
> Default kernel parameters are utilized at runtime (see
> sudo-sysctl-a.txt.gz).
> 
> The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g
> ++--version.txt).
> 
> 
> I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz,
> make-install.out.gz).
> Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz).
> 
> 
> 
> 
> Now I can describe the cases.
> 
> 
> Case 1: 30 MPI ranks, message size is 4096 bytes
> 
> File: mpirun-np-30-Program-4096.txt
> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> 
> 
> 
> 
> Case 2: 30 MPI ranks, message size is 1 byte
> 
> File: mpirun-np-30-Program-1.txt.gz
> Outcome: It runs just fine.
> 
> 
> 
> 
> Case 3: 2 MPI ranks, message size is 4096 bytes
> 
> File: mpirun-np-2-Program-4096.txt
> Outcome: It hangs -- I killed the poor thing after 30 seconds or so.
> 
> 
> 
> 
> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is
> disabled
> 
> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz
> Outcome: It runs just fine.
> 
> 
> 
> 
> 
> A backtrace of the processes in Case 1 is in gdb-bt.txt.gz.
> 
> 
> 
> 
> Thank you !
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel