Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
> >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > > M. Sébastien Boisvert > > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval > > Boursier des Instituts de recherche en santé du Canada > > Équipe du Professeur Jacques Corbeil > > > > Centre de recherche en infectiologie de l'Université Laval > > Local R-61B > > 2705, boulevard Laurier > > Québec, Québec > > Canada G1V 4G2 > > Téléphone: 418 525 46342 > > > > Courriel: s...@boisvert.info > > Web: http://boisvert.info > > > > "Innovation comes only from an assault on the unknown" -Sydney Brenner > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel "Innovation comes only from an assault on the unknown" -Sydney Brenner /* Ray Copyright (C) 2010 Sébastien Boisvert http://DeNovoAssembler.SourceForge.Net/ This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You have received a copy of the GNU General Public License along with this program (COPYING). see <http://www.gnu.org/licenses/> */ #include #include #include /* * send messages, */ void MessagesHandler::sendMessages(StaticVector*outbox,int source){ for(int i=0;i<(int)outbox->size();i++){ Message*aMessage=((*outbox)[i]); #ifdef ASSERT int destination=aMessage->getDestination(); assert(destination>=0); #endif MPI_Request request; // MPI_Issend // Synchronous nonblocking. Note that a Wait/Test will complete only when the matching receive is posted #ifdef ASSERT assert(!(aMessage->getBuffer()==NULL && aMessage->getCount()>0)); #endif #ifndef ASSERT MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request); #else int value=MPI_Isend(aMessage->getBuffer(),aMessage->getCount(),aMessage->getMPIDatatype(),aMessage->getDestination(),aMessage->getTag(),MPI_COMM_WORLD,&request); assert(value==MPI_SUCCESS); #endif MPI_Request_free(&request); #ifdef ASSERT assert(request==MPI_REQUEST_NULL); #endif } outbox->clear(); } /* * receiveMessages is implemented as recommanded by Mr. George Bosilca from the University of Tennessee (via the Open-MPI mailing list) De: George Bosilca Reply-to: Open MPI Developers À: Open MPI Developers Sujet: Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang List-Post: devel@lists.open-mpi.org Date: 2010-11-23 18:03:04 If you know the max size of the receives I would take a different approach. Post few persistent receives, and manage them in a circular buffer. Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer. Once you use the data related to the receive, just do an MPI_Start on your request. This approach will minimize the unexpected messages, and drain the connections faster. Moreover, at the end it is very easy to MPI_Cancel all the receives not yet matched. george. */ void MessagesHandler::receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination){ int flag; MPI_Status status; MPI_Test(m_ring+m_head,&flag,&status); if(flag){ // get the length of the message // it is not necessary the same as the one posted with MPI_Recv_init // that one was a lower bound int tag=status.MPI_TAG; int source=status.MPI_SOURCE; int length; MPI_Get_count(&status,MPI_UNSIGNED_LONG_LONG,&length); u64*filledBuffer=(u64*)m_buffers+m_head*MPI_BTL_SM_EAGER_LIMIT/sizeof(u64); // copy it in a safe buffer u64*incoming=(u64*)inboxAllocator->allocate(length*sizeof(u64)); for(int i=0;ipush_back(aMessage); m_receivedMessages[source]++; // increment the head m_head++; if(m_head==m_ringSize){ m_head=0; } } } void MessagesHandler::showStats(){ cout<<"Rank "</* Ray Copyright (C) 2010 Sébastien Boisvert http://DeNovoAssembler.SourceForge.Net/ This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You have received a copy of the GNU General Public License along with this program (COPYING). see <http://www.gnu.org/licenses/> */ #ifndef _MessagesHandler #define _MessagesHandler #include #include #include #include #include #include #include using namespace std; class MessagesHandler{ int m_ringSize; int m_head; MPI_Request*m_ring; char*m_buffers; u64*m_receivedMessages; int m_rank; int m_size; u64*m_allReceivedMessages; int*m_allCounts; public: void constructor(int rank,int size); void showStats(); void sendMessages(StaticVector*outbox,int source); void receiveMessages(StaticVector*inbox,RingAllocator*inboxAllocator,int destination); u64*getReceivedMessages(); void addCount(int rank,u64 count); void writeStats(const char*file); bool isFinished(); bool isFinished(int rank); void freeLeftovers(); }; #endif
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 24/11/10 16:32, Sébastien Boisvert wrote: > Yes, Ray version 0.1.0 and below are not fully-compliant > with MPI 2.2. > > I will release Ray 1.0.0 as soon as my regression tests > are done. That should be tomorrow. Wonderful, thank you! :-) - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzsprUACgkQO2KABBYQAh/iwgCfSiDfKFtfhQsePzyxcpBz+Vcg yp4AnjLnPCZRo3xtapyF0V6Rb/7ULsyL =NAhy -END PGP SIGNATURE-
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Yes, Ray version 0.1.0 and below are not fully-compliant with MPI 2.2. I will release Ray 1.0.0 as soon as my regression tests are done. That should be tomorrow. Le mercredi 24 novembre 2010 à 00:01 -0500, Christopher Samuel a écrit : > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 24/11/10 09:17, Sébastien Boisvert wrote: > > > As Mr. George Bosilca underlined, since the same test case works for > > small messages, the problem is about congestion of the FIFOs which leads > > to resource locking, and as you wrote, deadlock. > > Hmm, we've had a report from someone trying to use Ray on > our BG/P that they've seen it lock up - is it likely to be > the same issue ? > > cheers, > Chris > - -- > Christopher Samuel - Senior Systems Administrator > VLSCI - Victorian Life Sciences Computational Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.unimelb.edu.au/ > > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkzsnAwACgkQO2KABBYQAh8aHQCeOPEU5i4En0YPURSqb9tR3uQO > tR4An1sJ0H6Zn6Pxot2c364bHDmNLhGe > =p1TT > -END PGP SIGNATURE- > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- M. Sébastien Boisvert Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval Boursier des Instituts de recherche en santé du Canada Équipe du Professeur Jacques Corbeil Centre de recherche en infectiologie de l'Université Laval Local R-61B 2705, boulevard Laurier Québec, Québec Canada G1V 4G2 Téléphone: 418 525 46342 Courriel: s...@boisvert.info Web: http://boisvert.info "Innovation comes only from an assault on the unknown" -Sydney Brenner
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 24/11/10 09:17, Sébastien Boisvert wrote: > As Mr. George Bosilca underlined, since the same test case works for > small messages, the problem is about congestion of the FIFOs which leads > to resource locking, and as you wrote, deadlock. Hmm, we've had a report from someone trying to use Ray on our BG/P that they've seen it lock up - is it likely to be the same issue ? cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzsnAwACgkQO2KABBYQAh8aHQCeOPEU5i4En0YPURSqb9tR3uQO tR4An1sJ0H6Zn6Pxot2c364bHDmNLhGe =p1TT -END PGP SIGNATURE-
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Le mardi 23 novembre 2010 à 20:21 -0500, Jeff Squyres (jsquyres) a écrit : > Beware that MPI-request-free on active buffers is valid but evil. You CANNOT > be sure when the buffer is available for reuse. Yes, but as I said, in my program an MPI rank never flood other MPI ranks. (I like to think they respect each other haha) Therefore the evilness is no more -- it is casted away in oblivions. If I understand correctly, a call to MPI_Request_free does not affect in any way the void*buffer associated to the request, it just free the memory of the MPI_Request. For statuses, I use MPI_STATUS_IGNORE, except with my MPI_Iprobe obviously ! So, in a way, MPI_REQUEST_IGNORE would be cool, but it does not exist. For buffer availability: For MPI_Recv and MPI_Isend, buffers are allocated with a "RingAllocator" (one malloc at the start of execution). But it is useless as most of the time there is only on active send. Here is an example of my code (14567 lines, but yet MPI_Isend and MPI_Recv appear both only once). p.s. it is GPLed ! These bits extract a k-mer (a string of k symbols) from a DNA (the code of life) sequence and send it to the good MPI rank void VerticesExtractor::process(...){ if(!m_ready){ return; } ... if(isValidDNA(memory)){ VERTEX_TYPE a=wordId(memory); int rankToFlush=0; if(*m_reverseComplementVertex==false){ rankToFlush=vertexRank(a,size); m_disData->m_messagesStock.addAt(rankToFlush,a); }else{ VERTEX_TYPE b=complementVertex(a,m_wordSize,m_colorSpaceMode); rankToFlush=vertexRank(b,size); m_disData->m_messagesStock.addAt(rankToFlush,b); } if(m_disData->m_messagesStock.flush(rankToFlush,1,TAG_VERTICES_DATA,m_outboxAllocator,m_outbox,rank,false)){ m_ready=false; } } ... } so, if the "toilet" if flushed, the rank set its slot called m_ready to false. The following bits select the message handler: O(1) message handler selection ! void MessageProcessor::processMessage(Message*message){ int tag=message->getTag(); FNMETHOD f=m_methods[tag]; (this->*f)(message); } Obviously, it calls something like this: (note that a reply is sent) void MessageProcessor::call_TAG_VERTICES_DATA(Message*message){ void*buffer=message->getBuffer(); int count=message->getCount(); VERTEX_TYPE*incoming=(VERTEX_TYPE*)buffer; int length=count; for(int i=0;isize() and (int)m_subgraph->size()%10==0){ (*m_last_value)=m_subgraph->size(); cout<<"Rank "insert(l); #ifdef ASSERT assert(tmp!=NULL); #endif if(m_subgraph->inserted()){ tmp->getValue()->constructor(); } tmp->getValue()->setCoverage(tmp->getValue()->getCoverage()+1); #ifdef ASSERT assert(tmp->getValue()->getCoverage()>0); #endif } Message aMessage(NULL,0,MPI_UNSIGNED_LONG_LONG,message->getSource(),TAG_VERTICES_DATA_REPLY,rank); m_outbox->push_back(aMessage); } These bits process the reply: (all my message handlers are called call_) void MessageProcessor::call_TAG_VERTICES_DATA_REPLY(Message*message){ m_verticesExtractor->setReadiness(); } And, finally, here it goes: void VerticesExtractor::setReadiness(){ m_ready=true; } So, you can see that there is no problem with my use of MPI_Isend followed by MPI_Request_free. Thanks ! > > There was a sentence or paragraph added yo MPI 2.2 describing exactly this > case. > > Sent from my PDA. No type good. > > On Nov 23, 2010, at 5:36 PM, Sébastien Boisvert > wrote: > > > Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit : > >> Sebastien, > >> > >> Using MPI_Isend doesn't guarantee asynchronous progress. As you might be > >> aware, the non-blocking communications are guaranteed to progress only > >> when the application is in the MPI library. Currently very few MPI > >> implementations progress asynchronously (and unfortunately Open MPI is not > >> one of them). > >> > > > > Regardless, I just need the non-blocking behavior. > > I call MPI_Request_free just after MPI_Isend, and I use a ring allocator > > to allocate message buffers. > > > > Message recipients just reply with another message to the source, using > > a NULL buffer. > > > > The sender waits for the reply before sending the next message. > > > > And it works for assembling bacterial genomes on many MPI ranks: > > > > ... > > Rank 0: 162 contigs/4576725 nucleotides > > > > Rank 0 reports the elapsed time, Tue Nov 23 0
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Thank you ! Your support is outstanding ! Le mardi 23 novembre 2010 à 22:25 -0500, Eugene Loh a écrit : > Jeff Squyres (jsquyres) wrote: > > >Ya, it sounds like we should fix this eager limit help text so that others > >aren't misled. We did say "attempt", but that's probably a bit too subtle. > > > >Eugene - iirc: this is in the btl base (or some other central location) > >because it's shared between all btls. > > > > > The cited text was from the OMPI FAQ ("Tuning" / "sm" section, item 6). > I made the change in r1309. > > In ompi/mca/btl/base/btl_base_mca.c, I added the phrase "including > header" to both > > "rndv_eager_limit" > "Size (in bytes, including header) of \"phase 1\" fragment sent for all > large messages (must be >= 0 and <= eager_limit)" > module->btl_rndv_eager_limit > > and > > "eager_limit" > "Maximum size (in bytes, including header) of \"short\" messages (must > be >= 1)." > module->btl_eager_limit > > but I left > > "max_send_size" > "Maximum size (in bytes) of a single \"phase 2\" fragment of a long > message when using the pipeline protocol (must be >= 1)" > module->btl_max_send_size > > alone (for some combination of lukewarm reasons). Changes are in r24085. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- M. Sébastien Boisvert Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval Boursier des Instituts de recherche en santé du Canada Équipe du Professeur Jacques Corbeil Centre de recherche en infectiologie de l'Université Laval Local R-61B 2705, boulevard Laurier Québec, Québec Canada G1V 4G2 Téléphone: 418 525 46342 Courriel: s...@boisvert.info Web: http://boisvert.info "Innovation comes only from an assault on the unknown" -Sydney Brenner
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Jeff Squyres (jsquyres) wrote: Ya, it sounds like we should fix this eager limit help text so that others aren't misled. We did say "attempt", but that's probably a bit too subtle. Eugene - iirc: this is in the btl base (or some other central location) because it's shared between all btls. The cited text was from the OMPI FAQ ("Tuning" / "sm" section, item 6). I made the change in r1309. In ompi/mca/btl/base/btl_base_mca.c, I added the phrase "including header" to both "rndv_eager_limit" "Size (in bytes, including header) of \"phase 1\" fragment sent for all large messages (must be >= 0 and <= eager_limit)" module->btl_rndv_eager_limit and "eager_limit" "Maximum size (in bytes, including header) of \"short\" messages (must be >= 1)." module->btl_eager_limit but I left "max_send_size" "Maximum size (in bytes) of a single \"phase 2\" fragment of a long message when using the pipeline protocol (must be >= 1)" module->btl_max_send_size alone (for some combination of lukewarm reasons). Changes are in r24085.
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Beware that MPI-request-free on active buffers is valid but evil. You CANNOT be sure when the buffer is available for reuse. There was a sentence or paragraph added yo MPI 2.2 describing exactly this case. Sent from my PDA. No type good. On Nov 23, 2010, at 5:36 PM, Sébastien Boisvert wrote: > Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit : >> Sebastien, >> >> Using MPI_Isend doesn't guarantee asynchronous progress. As you might be >> aware, the non-blocking communications are guaranteed to progress only when >> the application is in the MPI library. Currently very few MPI >> implementations progress asynchronously (and unfortunately Open MPI is not >> one of them). >> > > Regardless, I just need the non-blocking behavior. > I call MPI_Request_free just after MPI_Isend, and I use a ring allocator > to allocate message buffers. > > Message recipients just reply with another message to the source, using > a NULL buffer. > > The sender waits for the reply before sending the next message. > > And it works for assembling bacterial genomes on many MPI ranks: > > ... > Rank 0: 162 contigs/4576725 nucleotides > > Rank 0 reports the elapsed time, Tue Nov 23 01:35:48 2010 > ---> Step: Collection of fusions > Elapsed time: 0 seconds > Since beginning: 17 minutes, 33 seconds > > Elapsed time for each step, Tue Nov 23 01:35:48 2010 > > Beginning of computation: 1 seconds > Distribution of sequence reads: 7 minutes, 49 seconds > Distribution of vertices: 19 seconds > Calculation of coverage distribution: 1 seconds > Distribution of edges: 29 seconds > Indexing of sequence reads: 1 seconds > Computation of seeds: 2 minutes, 33 seconds > Computation of library sizes: 1 minutes, 47 seconds > Extension of seeds: 3 minutes, 34 seconds > Computation of fusions: 59 seconds > Collection of fusions: 0 seconds > Completion of the assembly: 17 minutes, 33 seconds > > Rank 0 wrote Ecoli-THEONE.CoverageDistribution.txt > Rank 0 wrote Ecoli-THEONE.fasta > Rank 0 wrote Ecoli-THEONE.ReceivedMessages.txt > Rank 0 wrote Ecoli-THEONE.Library0.txt > Rank 0 wrote Ecoli-THEONE.Library1.txt > > Au revoir ! > > >> george. >> >> On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote: >> >>> I now use MPI_Isend, so the problem is no more. >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > M. Sébastien Boisvert > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval > Boursier des Instituts de recherche en santé du Canada > Équipe du Professeur Jacques Corbeil > > Centre de recherche en infectiologie de l'Université Laval > Local R-61B > 2705, boulevard Laurier > Québec, Québec > Canada G1V 4G2 > Téléphone: 418 525 46342 > > Courriel: s...@boisvert.info > Web: http://boisvert.info > > "Innovation comes only from an assault on the unknown" -Sydney Brenner > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Ya, it sounds like we should fix this eager limit help text so that others aren't misled. We did say "attempt", but that's probably a bit too subtle. Eugene - iirc: this is in the btl base (or some other central location) because it's shared between all btls. Sent from my PDA. No type good. On Nov 23, 2010, at 5:54 PM, "Eugene Loh" wrote: > George Bosilca wrote: > >> Moreover, eager send can improve performance if and only if the matching >> receives are already posted on the peer. If not, the data will become >> unexpected, and there will be one additional memcpy. >> > I don't think the first sentence is strictly true. There is a cost > associated with eager messages, but whether there is an overall improvement > or not depends on lots of factors. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Whoa ! Thank, I will try that. Le mardi 23 novembre 2010 à 18:03 -0500, George Bosilca a écrit : > If you know the max size of the receives I would take a different approach. "max size" is the maximum buffer size required, right ? in my case, it is 4096. > Post few persistent receives, and manage them in a circular buffer. > Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your > circular buffer. Once you use the data related to the receive, just do an > MPI_Start on your request. > > This approach will minimize the unexpected messages, and drain the > connections faster. Moreover, at the end it is very easy to MPI_Cancel all > the receives not yet matched. Looks very interesting, indeed ! https://computing.llnl.gov/tutorials/mpi_performance/#Persistent Wow, that is really an in-depth suggestion that I will surely try ! Thank you, your answers are very appreciated ! > > george. > > On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote: > > > Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit : > >> The eager size reported by ompi_info includes the Open MPI internal > >> headers. They are anywhere between 20 and 64 bytes long (potentially more > >> for some particular networks), so what Eugene suggested was a safe > >> boundary. > > > > I see. > > > >> > >> Moreover, eager send can improve performance if and only if the matching > >> receives are already posted on the peer. If not, the data will become > >> unexpected, and there will be one additional memcpy. > > > > So it won't improve performance in my application (Ray, > > http://denovoassembler.sf.net) because I use MPI_Iprobe to check for > > incoming messages, which means any receive (MPI_Recv) is never posted > > before any send (MPI_Isend). > > > > Thanks, this thread is very informative for me ! > > > >> > >> george. > >> > >> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote: > >> > >>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : > Sébastien Boisvert wrote: > > > Now I can describe the cases. > > > > > The test cases can all be explained by the test requiring eager messages > (something that test4096.cpp does not require). > > > Case 1: 30 MPI ranks, message size is 4096 bytes > > > > File: mpirun-np-30-Program-4096.txt > > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > 4096 is rendezvous. For eager, try 4000 or lower. > >>> > >>> According to ompi_info, the threshold is 4096, not 4000, right ? > >>> > >>> (Open-MPI 1.4.3) > >>> [sboisver12@colosse1 ~]$ ompi_info -a|less > >>>MCA btl: parameter "btl_sm_eager_limit" (current value: > >>> "4096", data source: default value) > >>> Maximum size (in bytes) of "short" messages > >>> (must be >= 1). > >>> > >>> > >>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- > >>> that is, a sender attempts to write its entire message to shared buffers > >>> without waiting for a receiver to be ready. Above this size, a sender > >>> will only write the first part of a message, then wait for the receiver > >>> to acknowledge its ready before continuing. Eager sends can improve > >>> performance by decoupling senders from receivers." > >>> > >>> > >>> > >>> source: > >>> http://www.open-mpi.org/faq/?category=sm#more-sm > >>> > >>> > >>> It should say "Below this size or equal to this size" instead of "Below > >>> this size" as ompi_info says. ;) > >>> > >>> > >>> > >>> > >>> As Mr. George Bosilca put it: > >>> > >>> "__should__ is not correct, __might__ is a better verb to describe the > >>> most "common" behavior for small messages. The problem comes from the > >>> fact that in each communicator the FIFO ordering is required by the MPI > >>> standard. As soon as there is any congestion, MPI_Send will block even > >>> for small messages (and this independent on the underlying network) > >>> until all he pending packets have been delivered." > >>> > >>> source: > >>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php > >>> > >>> > >>> > > > Case 2: 30 MPI ranks, message size is 1 byte > > > > File: mpirun-np-30-Program-1.txt.gz > > Outcome: It runs just fine. > > > > > 1 byte is eager. > >>> > >>> I agree. > >>> > > > Case 3: 2 MPI ranks, message size is 4096 bytes > > > > File: mpirun-np-2-Program-4096.txt > > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > Same as Case 1. > > > Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > > disabled > > > > File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > > Outcome: It runs just fine. > > > > > Eager limit for TCP is 65536 (perhaps less some overhead). So, these > messages are eager. > >>> > >>> I agree. > >>> >
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
If you know the max size of the receives I would take a different approach. Post few persistent receives, and manage them in a circular buffer. Instead of doing an MPI_Iprobe, use MPI_Test on the current head of your circular buffer. Once you use the data related to the receive, just do an MPI_Start on your request. This approach will minimize the unexpected messages, and drain the connections faster. Moreover, at the end it is very easy to MPI_Cancel all the receives not yet matched. george. On Nov 23, 2010, at 17:43 , Sébastien Boisvert wrote: > Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit : >> The eager size reported by ompi_info includes the Open MPI internal headers. >> They are anywhere between 20 and 64 bytes long (potentially more for some >> particular networks), so what Eugene suggested was a safe boundary. > > I see. > >> >> Moreover, eager send can improve performance if and only if the matching >> receives are already posted on the peer. If not, the data will become >> unexpected, and there will be one additional memcpy. > > So it won't improve performance in my application (Ray, > http://denovoassembler.sf.net) because I use MPI_Iprobe to check for > incoming messages, which means any receive (MPI_Recv) is never posted > before any send (MPI_Isend). > > Thanks, this thread is very informative for me ! > >> >> george. >> >> On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote: >> >>> Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : Sébastien Boisvert wrote: > Now I can describe the cases. > > The test cases can all be explained by the test requiring eager messages (something that test4096.cpp does not require). > Case 1: 30 MPI ranks, message size is 4096 bytes > > File: mpirun-np-30-Program-4096.txt > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > 4096 is rendezvous. For eager, try 4000 or lower. >>> >>> According to ompi_info, the threshold is 4096, not 4000, right ? >>> >>> (Open-MPI 1.4.3) >>> [sboisver12@colosse1 ~]$ ompi_info -a|less >>>MCA btl: parameter "btl_sm_eager_limit" (current value: >>> "4096", data source: default value) >>> Maximum size (in bytes) of "short" messages >>> (must be >= 1). >>> >>> >>> "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- >>> that is, a sender attempts to write its entire message to shared buffers >>> without waiting for a receiver to be ready. Above this size, a sender >>> will only write the first part of a message, then wait for the receiver >>> to acknowledge its ready before continuing. Eager sends can improve >>> performance by decoupling senders from receivers." >>> >>> >>> >>> source: >>> http://www.open-mpi.org/faq/?category=sm#more-sm >>> >>> >>> It should say "Below this size or equal to this size" instead of "Below >>> this size" as ompi_info says. ;) >>> >>> >>> >>> >>> As Mr. George Bosilca put it: >>> >>> "__should__ is not correct, __might__ is a better verb to describe the >>> most "common" behavior for small messages. The problem comes from the >>> fact that in each communicator the FIFO ordering is required by the MPI >>> standard. As soon as there is any congestion, MPI_Send will block even >>> for small messages (and this independent on the underlying network) >>> until all he pending packets have been delivered." >>> >>> source: >>> http://www.open-mpi.org/community/lists/devel/2010/11/8696.php >>> >>> >>> > Case 2: 30 MPI ranks, message size is 1 byte > > File: mpirun-np-30-Program-1.txt.gz > Outcome: It runs just fine. > > 1 byte is eager. >>> >>> I agree. >>> > Case 3: 2 MPI ranks, message size is 4096 bytes > > File: mpirun-np-2-Program-4096.txt > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > Same as Case 1. > Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > disabled > > File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > Outcome: It runs just fine. > > Eager limit for TCP is 65536 (perhaps less some overhead). So, these messages are eager. >>> >>> I agree. >>> ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > M. Sébastien Boisvert > Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval > Boursier des Instituts de recherche en santé du Canada > Équipe du Pr
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
George Bosilca wrote: Moreover, eager send can improve performance if and only if the matching receives are already posted on the peer. If not, the data will become unexpected, and there will be one additional memcpy. I don't think the first sentence is strictly true. There is a cost associated with eager messages, but whether there is an overall improvement or not depends on lots of factors.
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Sébastien Boisvert wrote: Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : Sébastien Boisvert wrote: Case 1: 30 MPI ranks, message size is 4096 bytes File: mpirun-np-30-Program-4096.txt Outcome: It hangs -- I killed the poor thing after 30 seconds or so. 4096 is rendezvous. For eager, try 4000 or lower. According to ompi_info, the threshold is 4096, not 4000, right ? Right. "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- that is, a sender attempts to write its entire message to shared buffers without waiting for a receiver to be ready. Above this size, a sender will only write the first part of a message, then wait for the receiver to acknowledge its ready before continuing. Eager sends can improve performance by decoupling senders from receivers." source: http://www.open-mpi.org/faq/?category=sm#more-sm It should say "Below this size or equal to this size" instead of "Below this size" as ompi_info says. ;) Well, I guess it should say: If message data plus header information fits within this limit, the message is sent "eagerly"... I guess I'll fix it. (I suspect I wrote it in the first place. Sigh.)
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Le mardi 23 novembre 2010 à 17:38 -0500, George Bosilca a écrit : > The eager size reported by ompi_info includes the Open MPI internal headers. > They are anywhere between 20 and 64 bytes long (potentially more for some > particular networks), so what Eugene suggested was a safe boundary. I see. > > Moreover, eager send can improve performance if and only if the matching > receives are already posted on the peer. If not, the data will become > unexpected, and there will be one additional memcpy. So it won't improve performance in my application (Ray, http://denovoassembler.sf.net) because I use MPI_Iprobe to check for incoming messages, which means any receive (MPI_Recv) is never posted before any send (MPI_Isend). Thanks, this thread is very informative for me ! > > george. > > On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote: > > > Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : > >> Sébastien Boisvert wrote: > >> > >>> Now I can describe the cases. > >>> > >>> > >> The test cases can all be explained by the test requiring eager messages > >> (something that test4096.cpp does not require). > >> > >>> Case 1: 30 MPI ranks, message size is 4096 bytes > >>> > >>> File: mpirun-np-30-Program-4096.txt > >>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > >>> > >>> > >> 4096 is rendezvous. For eager, try 4000 or lower. > > > > According to ompi_info, the threshold is 4096, not 4000, right ? > > > > (Open-MPI 1.4.3) > > [sboisver12@colosse1 ~]$ ompi_info -a|less > > MCA btl: parameter "btl_sm_eager_limit" (current value: > > "4096", data source: default value) > > Maximum size (in bytes) of "short" messages > > (must be >= 1). > > > > > > "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- > > that is, a sender attempts to write its entire message to shared buffers > > without waiting for a receiver to be ready. Above this size, a sender > > will only write the first part of a message, then wait for the receiver > > to acknowledge its ready before continuing. Eager sends can improve > > performance by decoupling senders from receivers." > > > > > > > > source: > > http://www.open-mpi.org/faq/?category=sm#more-sm > > > > > > It should say "Below this size or equal to this size" instead of "Below > > this size" as ompi_info says. ;) > > > > > > > > > > As Mr. George Bosilca put it: > > > > "__should__ is not correct, __might__ is a better verb to describe the > > most "common" behavior for small messages. The problem comes from the > > fact that in each communicator the FIFO ordering is required by the MPI > > standard. As soon as there is any congestion, MPI_Send will block even > > for small messages (and this independent on the underlying network) > > until all he pending packets have been delivered." > > > > source: > > http://www.open-mpi.org/community/lists/devel/2010/11/8696.php > > > > > > > >> > >>> Case 2: 30 MPI ranks, message size is 1 byte > >>> > >>> File: mpirun-np-30-Program-1.txt.gz > >>> Outcome: It runs just fine. > >>> > >>> > >> 1 byte is eager. > > > > I agree. > > > >> > >>> Case 3: 2 MPI ranks, message size is 4096 bytes > >>> > >>> File: mpirun-np-2-Program-4096.txt > >>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > >>> > >>> > >> Same as Case 1. > >> > >>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > >>> disabled > >>> > >>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > >>> Outcome: It runs just fine. > >>> > >>> > >> Eager limit for TCP is 65536 (perhaps less some overhead). So, these > >> messages are eager. > > > > I agree. > > > >> > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- M. Sébastien Boisvert Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval Boursier des Instituts de recherche en santé du Canada Équipe du Professeur Jacques Corbeil Centre de recherche en infectiologie de l'Université Laval Local R-61B 2705, boulevard Laurier Québec, Québec Canada G1V 4G2 Téléphone: 418 525 46342 Courriel: s...@boisvert.info Web: http://boisvert.info "Innovation comes only from an assault on the unknown" -Sydney Brenner
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
The eager size reported by ompi_info includes the Open MPI internal headers. They are anywhere between 20 and 64 bytes long (potentially more for some particular networks), so what Eugene suggested was a safe boundary. Moreover, eager send can improve performance if and only if the matching receives are already posted on the peer. If not, the data will become unexpected, and there will be one additional memcpy. george. On Nov 23, 2010, at 17:29 , Sébastien Boisvert wrote: > Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : >> Sébastien Boisvert wrote: >> >>> Now I can describe the cases. >>> >>> >> The test cases can all be explained by the test requiring eager messages >> (something that test4096.cpp does not require). >> >>> Case 1: 30 MPI ranks, message size is 4096 bytes >>> >>> File: mpirun-np-30-Program-4096.txt >>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. >>> >>> >> 4096 is rendezvous. For eager, try 4000 or lower. > > According to ompi_info, the threshold is 4096, not 4000, right ? > > (Open-MPI 1.4.3) > [sboisver12@colosse1 ~]$ ompi_info -a|less > MCA btl: parameter "btl_sm_eager_limit" (current value: > "4096", data source: default value) > Maximum size (in bytes) of "short" messages > (must be >= 1). > > > "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- > that is, a sender attempts to write its entire message to shared buffers > without waiting for a receiver to be ready. Above this size, a sender > will only write the first part of a message, then wait for the receiver > to acknowledge its ready before continuing. Eager sends can improve > performance by decoupling senders from receivers." > > > > source: > http://www.open-mpi.org/faq/?category=sm#more-sm > > > It should say "Below this size or equal to this size" instead of "Below > this size" as ompi_info says. ;) > > > > > As Mr. George Bosilca put it: > > "__should__ is not correct, __might__ is a better verb to describe the > most "common" behavior for small messages. The problem comes from the > fact that in each communicator the FIFO ordering is required by the MPI > standard. As soon as there is any congestion, MPI_Send will block even > for small messages (and this independent on the underlying network) > until all he pending packets have been delivered." > > source: > http://www.open-mpi.org/community/lists/devel/2010/11/8696.php > > > >> >>> Case 2: 30 MPI ranks, message size is 1 byte >>> >>> File: mpirun-np-30-Program-1.txt.gz >>> Outcome: It runs just fine. >>> >>> >> 1 byte is eager. > > I agree. > >> >>> Case 3: 2 MPI ranks, message size is 4096 bytes >>> >>> File: mpirun-np-2-Program-4096.txt >>> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. >>> >>> >> Same as Case 1. >> >>> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is >>> disabled >>> >>> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz >>> Outcome: It runs just fine. >>> >>> >> Eager limit for TCP is 65536 (perhaps less some overhead). So, these >> messages are eager. > > I agree. > >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Le mardi 23 novembre 2010 à 17:28 -0500, George Bosilca a écrit : > Sebastien, > > Using MPI_Isend doesn't guarantee asynchronous progress. As you might be > aware, the non-blocking communications are guaranteed to progress only when > the application is in the MPI library. Currently very few MPI implementations > progress asynchronously (and unfortunately Open MPI is not one of them). > Regardless, I just need the non-blocking behavior. I call MPI_Request_free just after MPI_Isend, and I use a ring allocator to allocate message buffers. Message recipients just reply with another message to the source, using a NULL buffer. The sender waits for the reply before sending the next message. And it works for assembling bacterial genomes on many MPI ranks: ... Rank 0: 162 contigs/4576725 nucleotides Rank 0 reports the elapsed time, Tue Nov 23 01:35:48 2010 ---> Step: Collection of fusions Elapsed time: 0 seconds Since beginning: 17 minutes, 33 seconds Elapsed time for each step, Tue Nov 23 01:35:48 2010 Beginning of computation: 1 seconds Distribution of sequence reads: 7 minutes, 49 seconds Distribution of vertices: 19 seconds Calculation of coverage distribution: 1 seconds Distribution of edges: 29 seconds Indexing of sequence reads: 1 seconds Computation of seeds: 2 minutes, 33 seconds Computation of library sizes: 1 minutes, 47 seconds Extension of seeds: 3 minutes, 34 seconds Computation of fusions: 59 seconds Collection of fusions: 0 seconds Completion of the assembly: 17 minutes, 33 seconds Rank 0 wrote Ecoli-THEONE.CoverageDistribution.txt Rank 0 wrote Ecoli-THEONE.fasta Rank 0 wrote Ecoli-THEONE.ReceivedMessages.txt Rank 0 wrote Ecoli-THEONE.Library0.txt Rank 0 wrote Ecoli-THEONE.Library1.txt Au revoir ! > george. > > On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote: > > > I now use MPI_Isend, so the problem is no more. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- M. Sébastien Boisvert Étudiant au doctorat en physiologie-endocrinologie à l'Université Laval Boursier des Instituts de recherche en santé du Canada Équipe du Professeur Jacques Corbeil Centre de recherche en infectiologie de l'Université Laval Local R-61B 2705, boulevard Laurier Québec, Québec Canada G1V 4G2 Téléphone: 418 525 46342 Courriel: s...@boisvert.info Web: http://boisvert.info "Innovation comes only from an assault on the unknown" -Sydney Brenner
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Le mardi 23 novembre 2010 à 16:07 -0500, Eugene Loh a écrit : > Sébastien Boisvert wrote: > > >Now I can describe the cases. > > > > > The test cases can all be explained by the test requiring eager messages > (something that test4096.cpp does not require). > > >Case 1: 30 MPI ranks, message size is 4096 bytes > > > >File: mpirun-np-30-Program-4096.txt > >Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > 4096 is rendezvous. For eager, try 4000 or lower. According to ompi_info, the threshold is 4096, not 4000, right ? (Open-MPI 1.4.3) [sboisver12@colosse1 ~]$ ompi_info -a|less MCA btl: parameter "btl_sm_eager_limit" (current value: "4096", data source: default value) Maximum size (in bytes) of "short" messages (must be >= 1). "btl_sm_eager_limit: Below this size, messages are sent "eagerly" -- that is, a sender attempts to write its entire message to shared buffers without waiting for a receiver to be ready. Above this size, a sender will only write the first part of a message, then wait for the receiver to acknowledge its ready before continuing. Eager sends can improve performance by decoupling senders from receivers." source: http://www.open-mpi.org/faq/?category=sm#more-sm It should say "Below this size or equal to this size" instead of "Below this size" as ompi_info says. ;) As Mr. George Bosilca put it: "__should__ is not correct, __might__ is a better verb to describe the most "common" behavior for small messages. The problem comes from the fact that in each communicator the FIFO ordering is required by the MPI standard. As soon as there is any congestion, MPI_Send will block even for small messages (and this independent on the underlying network) until all he pending packets have been delivered." source: http://www.open-mpi.org/community/lists/devel/2010/11/8696.php > > >Case 2: 30 MPI ranks, message size is 1 byte > > > >File: mpirun-np-30-Program-1.txt.gz > >Outcome: It runs just fine. > > > > > 1 byte is eager. I agree. > > >Case 3: 2 MPI ranks, message size is 4096 bytes > > > >File: mpirun-np-2-Program-4096.txt > >Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > Same as Case 1. > > >Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > >disabled > > > >File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > >Outcome: It runs just fine. > > > > > Eager limit for TCP is 65536 (perhaps less some overhead). So, these > messages are eager. I agree. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Sebastien, Using MPI_Isend doesn't guarantee asynchronous progress. As you might be aware, the non-blocking communications are guaranteed to progress only when the application is in the MPI library. Currently very few MPI implementations progress asynchronously (and unfortunately Open MPI is not one of them). george. On Nov 23, 2010, at 17:17 , Sébastien Boisvert wrote: > I now use MPI_Isend, so the problem is no more.
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
No message is eager if there is congestion. 64K is eager for TCP only if the kernel buffer has enough room to hold the 64k. For SM it only works if there are ready buffers. In fact, eager is an optimization of the MPI library, not something the users should be aware of, or base their application on this particular behavior. On the MPI 2.2 there is a specific paragraph that advice the users not to do it. george. On Nov 23, 2010, at 16:07 , Eugene Loh wrote: > Sébastien Boisvert wrote: > >> Now I can describe the cases. >> > The test cases can all be explained by the test requiring eager messages > (something that test4096.cpp does not require). > >> Case 1: 30 MPI ranks, message size is 4096 bytes >> >> File: mpirun-np-30-Program-4096.txt >> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. >> > 4096 is rendezvous. For eager, try 4000 or lower. > >> Case 2: 30 MPI ranks, message size is 1 byte >> >> File: mpirun-np-30-Program-1.txt.gz >> Outcome: It runs just fine. >> > 1 byte is eager. > >> Case 3: 2 MPI ranks, message size is 4096 bytes >> >> File: mpirun-np-2-Program-4096.txt >> Outcome: It hangs -- I killed the poor thing after 30 seconds or so. >> > Same as Case 1. > >> Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is >> disabled >> >> File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz >> Outcome: It runs just fine. >> > Eager limit for TCP is 65536 (perhaps less some overhead). So, these > messages are eager. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Le mardi 23 novembre 2010 à 15:17 -0500, Jeff Squyres (jsquyres) a écrit : > Sorry for the delay in replying - many of us were at SC last week. Nothing to be sorry for ! > > Admittedly, I'm looking at your code on a PDA, so I might be missing some > things. But I have 2 q's: You got it all right, I assure you ! > > 1 your send routine doesn't seem to protect from sending to yourself. Correct? > Correct. (my error !) The code is not compliant with MPI 2.2 -- I realized that afterward. see http://www.open-mpi.org/community/lists/devel/2010/11/8689.php Also, Mr. George Bosilca pointed that out too. see http://www.open-mpi.org/community/lists/devel/2010/11/8696.php > 2 you're not using nonblocking sends, which, if I understand your code right, > can lead to deadlock. Right? Eg proc A sends to proc b and blocks until b > receives. But b is blocking waiting for it's send completion, etc. Right. As Mr. George Bosilca underlined, since the same test case works for small messages, the problem is about congestion of the FIFOs which leads to resource locking, and as you wrote, deadlock. http://www.open-mpi.org/community/lists/devel/2010/11/8696.php I now use MPI_Isend, so the problem is no more. > > I think with your random destinations (which may even be yourself, in which > case the blocking send will never complete because you didn't prepost a > nomblocking receive) and blocking sends, you can end up with deadlock. Yes, you are right ! > > Sent from my PDA. No type good. Sent from my Ubuntu. Typing is good. ;) > > On Nov 16, 2010, at 5:21 PM, Sébastien Boisvert > wrote: > > > Dear awesome community, > > > > > > Over the last months, I closely followed the evolution of bug 2043, > > entitled 'sm BTL hang with GCC 4.4.x'. > > > > https://svn.open-mpi.org/trac/ompi/ticket/2043 > > > > The reason is that I am developping an MPI-based software, and I use > > Open-MPI as it is the only implementation I am aware of that send > > messages eagerly (powerful feature, that is). > > > > http://denovoassembler.sourceforge.net/ > > > > I believe that this very pesky bug remains in Open-MPI 1.4.3, and > > enclosed to this communication are scientific proofs of my claim, or at > > least I think they are ;). > > > > > > Each byte transfer layer has its default limit to send eagerly a > > message. With shared memory (sm), the value is 4096 bytes. At least it > > is according to ompi_info. > > > > > > To verify this limit, I implemented a very simple test. The source code > > is test4096.cpp, which basically just send a single message of 4096 > > bytes from a rank to another (rank 1 to 0). > > > > The test was conclusive: the limit is 4096 bytes (see > > mpirun-np-2-Simple.txt). > > > > > > > > Then, I implemented a simple program (103 lines) that makes Open-MPI > > 1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each > > rank send a message to a randomly-selected destination. A rank polls for > > new messages with MPI_Iprobe. Each rank prints the current time at each > > second during 30 seconds. Using this simple code, I ran 4 test cases, > > each with a different outcome (use the Makefile if you want to reproduce > > the bug). > > > > Before I describe these cases, I will describe the testing hardware. > > > > I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). > > The computer has 128 GB of physical memory (see > > cat-proc-meminfo.txt.gz). > > It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see > > dmesg.txt.gz & uname.txt). > > Default kernel parameters are utilized at runtime (see > > sudo-sysctl-a.txt.gz). > > > > The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g > > ++--version.txt). > > > > > > I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz, > > make-install.out.gz). > > Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz). > > > > > > > > > > Now I can describe the cases. > > > > > > Case 1: 30 MPI ranks, message size is 4096 bytes > > > > File: mpirun-np-30-Program-4096.txt > > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > > > > > > Case 2: 30 MPI ranks, message size is 1 byte > > > > File: mpirun-np-30-Program-1.txt.gz > > Outcome: It runs just fine. > > > > > > > > > > Case 3: 2 MPI ranks, message size is 4096 bytes > > > > File: mpirun-np-2-Program-4096.txt > > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > > > > > > Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > > disabled > > > > File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > > Outcome: It runs just fine. > > > > > > > > > > > > A backtrace of the processes in Case 1 is in gdb-bt.txt.gz. > > > > > > > > > > Thank you ! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ___ > > deve
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Sébastien Boisvert wrote: Now I can describe the cases. The test cases can all be explained by the test requiring eager messages (something that test4096.cpp does not require). Case 1: 30 MPI ranks, message size is 4096 bytes File: mpirun-np-30-Program-4096.txt Outcome: It hangs -- I killed the poor thing after 30 seconds or so. 4096 is rendezvous. For eager, try 4000 or lower. Case 2: 30 MPI ranks, message size is 1 byte File: mpirun-np-30-Program-1.txt.gz Outcome: It runs just fine. 1 byte is eager. Case 3: 2 MPI ranks, message size is 4096 bytes File: mpirun-np-2-Program-4096.txt Outcome: It hangs -- I killed the poor thing after 30 seconds or so. Same as Case 1. Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is disabled File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz Outcome: It runs just fine. Eager limit for TCP is 65536 (perhaps less some overhead). So, these messages are eager.
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
To add to Jeff's comments: Sébastien Boisvert wrote: The reason is that I am developping an MPI-based software, and I use Open-MPI as it is the only implementation I am aware of that send messages eagerly (powerful feature, that is). As wonderful as OMPI is, I am fairly sure other MPI implementations also support eager message passing. That is, there is a capability for a sender to hand message data over to the MPI implementation, freeing the user send buffer and allowing an MPI_Send() call to complete, without the message reaching the receiver or the receiver being ready. Each byte transfer layer has its default limit to send eagerly a message. With shared memory (sm), the value is 4096 bytes. At least it is according to ompi_info. Yes. I think that 4096 bytes can be a little tricky... it may include some header information. So, the amount of user data that could be sent would be a little bit less... e.g., 4,000 bytes or so. To verify this limit, I implemented a very simple test. The source code is test4096.cpp, which basically just send a single message of 4096 bytes from a rank to another (rank 1 to 0). I don't think the test says much at all. It has one process post an MPI_Send and another post an MPI_Recv. Such a test should complete under a very wide range of conditions. Here is perhaps a better test: #include #include int main(int argc, char **argv) { int me; char buf[N]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&me); MPI_Send(buf,N,MPI_BYTE,1-me,343,MPI_COMM_WORLD); MPI_Recv(buf,N,MPI_BYTE,1-me,343,MPI_COMM_WORLD,MPI_STATUS_IGNORE); printf("%d of %d done\n", me, np); MPI_Finalize(); return 0; } Compile with the preprocessor symbol N defined to, say, 64. Run for --np 2. Each process will try to send. The code will complete for short, eager messages. If the messages are long, nothing is sent eagerly and both processes stay hung in their sends. Bump N up slowly. For N=4096, the code hangs. For N slightly less -- say, 4000 -- it runs.
Re: [OMPI devel] Simple program (103 lines) makes Open-1.4.3 hang
Sorry for the delay in replying - many of us were at SC last week. Admittedly, I'm looking at your code on a PDA, so I might be missing some things. But I have 2 q's: 1 your send routine doesn't seem to protect from sending to yourself. Correct? 2 you're not using nonblocking sends, which, if I understand your code right, can lead to deadlock. Right? Eg proc A sends to proc b and blocks until b receives. But b is blocking waiting for it's send completion, etc. I think with your random destinations (which may even be yourself, in which case the blocking send will never complete because you didn't prepost a nomblocking receive) and blocking sends, you can end up with deadlock. Sent from my PDA. No type good. On Nov 16, 2010, at 5:21 PM, Sébastien Boisvert wrote: > Dear awesome community, > > > Over the last months, I closely followed the evolution of bug 2043, > entitled 'sm BTL hang with GCC 4.4.x'. > > https://svn.open-mpi.org/trac/ompi/ticket/2043 > > The reason is that I am developping an MPI-based software, and I use > Open-MPI as it is the only implementation I am aware of that send > messages eagerly (powerful feature, that is). > > http://denovoassembler.sourceforge.net/ > > I believe that this very pesky bug remains in Open-MPI 1.4.3, and > enclosed to this communication are scientific proofs of my claim, or at > least I think they are ;). > > > Each byte transfer layer has its default limit to send eagerly a > message. With shared memory (sm), the value is 4096 bytes. At least it > is according to ompi_info. > > > To verify this limit, I implemented a very simple test. The source code > is test4096.cpp, which basically just send a single message of 4096 > bytes from a rank to another (rank 1 to 0). > > The test was conclusive: the limit is 4096 bytes (see > mpirun-np-2-Simple.txt). > > > > Then, I implemented a simple program (103 lines) that makes Open-MPI > 1.4.3 hang. The code is in make-it-hang.cpp. At each iteration, each > rank send a message to a randomly-selected destination. A rank polls for > new messages with MPI_Iprobe. Each rank prints the current time at each > second during 30 seconds. Using this simple code, I ran 4 test cases, > each with a different outcome (use the Makefile if you want to reproduce > the bug). > > Before I describe these cases, I will describe the testing hardware. > > I use a computer with 32 x86_64 cores (see cat-proc-cpuinfo.txt.gz). > The computer has 128 GB of physical memory (see > cat-proc-meminfo.txt.gz). > It runs Fedora Core 11 with Linux 2.6.30.10-105.2.23.fc11.x86_64 (see > dmesg.txt.gz & uname.txt). > Default kernel parameters are utilized at runtime (see > sudo-sysctl-a.txt.gz). > > The C++ compiler is g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2) (see g > ++--version.txt). > > > I compiled Open-MPI 1.4.3 myself (see config.out.gz, make.out.gz, > make-install.out.gz). > Finally, I use Open-MPI 1.4.3 with defaults (see ompi_info.txt.gz). > > > > > Now I can describe the cases. > > > Case 1: 30 MPI ranks, message size is 4096 bytes > > File: mpirun-np-30-Program-4096.txt > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > Case 2: 30 MPI ranks, message size is 1 byte > > File: mpirun-np-30-Program-1.txt.gz > Outcome: It runs just fine. > > > > > Case 3: 2 MPI ranks, message size is 4096 bytes > > File: mpirun-np-2-Program-4096.txt > Outcome: It hangs -- I killed the poor thing after 30 seconds or so. > > > > > Case 4: 30 MPI ranks, message size if 4096 bytes, shared memory is > disabled > > File: mpirun-mca-btl-^sm-np-30-Program-4096.txt.gz > Outcome: It runs just fine. > > > > > > A backtrace of the processes in Case 1 is in gdb-bt.txt.gz. > > > > > Thank you ! > > > > > > > > > > > > > > > > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel