> ________________________________________ > De : devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] de la part de > Jeff Squyres [jsquy...@cisco.com] > Date d'envoi : 28 septembre 2011 11:18 > À : Open MPI Developers > Objet : Re: [OMPI devel] RE : RE : Implementation of MPI_Iprobe > > On Sep 28, 2011, at 10:04 AM, George Bosilca wrote: > >>> Why not use pre-posted non-blocking receives and MPI_WAIT_ANY? >> >> That's not very scalable either… Might work for 256 processes, but that's >> about it. > > Just get a machine with oodles of RAM and you'll be fine. > > ;-)
Hello, Each of my 256 cores has 3 GB of memory, thus my computation has 768 GB of distributed memory. So memory is not a problem at all. I only see the problem of starvation for the slave mode RAY_SLAVE_MODE_EXTENSION in Ray. And when there is starvation, the memory usage is just ~1.6 GB per core. Today, I implemented some profiling in my code to check where the granularity is too large in processData(), which calls call_RAY_SLAVE_MODE_EXTENSION(). I consider anything abobe or equal to 128 microseconds to be too long for my computation. This is what I found so far: [1,3]<stdout>:Warning, SlaveMode= RAY_SLAVE_MODE_EXTENSION GranularityInMicroseconds= 16106 [1,3]<stdout>:Number of calls in the stack: 20 [1,3]<stdout>:0 1317227196433984 microseconds +0 from previous (0.00%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 47 [1,3]<stdout>:1 1317227196433985 microseconds +1 from previous (0.01%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 72 [1,3]<stdout>:2 1317227196433985 microseconds +0 from previous (0.00%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 144 [1,3]<stdout>:3 1317227196433985 microseconds +0 from previous (0.00%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 221 [1,3]<stdout>:4 1317227196433985 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 351 [1,3]<stdout>:5 1317227196433985 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 389 [1,3]<stdout>:6 1317227196433986 microseconds +1 from previous (0.01%) in doChoice inside code/assembler/SeedExtender.cpp at line 441 [1,3]<stdout>:7 1317227196433986 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 775 [1,3]<stdout>:8 1317227196433987 microseconds +1 from previous (0.01%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 934 [1,3]<stdout>:9 1317227196433988 microseconds +1 from previous (0.01%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 960 [1,3]<stdout>:10 1317227196442360 microseconds +8372 from previous (51.98%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 989 [1,3]<stdout>:11 1317227196442651 microseconds +291 from previous (1.81%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 993 [1,3]<stdout>:12 1317227196442654 microseconds +3 from previous (0.02%) in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 1002 [1,3]<stdout>:13 1317227196442655 microseconds +1 from previous (0.01%) in resetStructures inside code/assembler/ExtensionData.cpp at line 72 [1,3]<stdout>:14 [1,3]<stdout>: 1317227196442656 microseconds +1 from previous (0.01%) in resetStructures inside code/assembler/ExtensionData.cpp at line 76 [1,3]<stdout>:15 1317227196447138 microseconds +4482 from previous (27.83%) in resetStructures inside code/assembler/ExtensionData.cpp at line 80 [1,3]<stdout>:16 1317227196450084 microseconds +2946 from previous (18.29%) in doChoice inside code/assembler/SeedExtender.cpp at line 883 [1,3]<stdout>:17 1317227196450087 microseconds +3 from previous (0.02%) in doChoice inside code/assembler/SeedExtender.cpp at line 886 [1,3]<stdout>:18 1317227196450087 microseconds +0 from previous (0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 888 [1,3]<stdout>:19 1317227196450089 microseconds +2 from previous (0.01%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 229 [1,3]<stdout>:End of stack So the problem is definitely not with Open-MPI, but doing a round-robin MPI_Iprobe still helps a lot (rotating the source given to MPI_Iprobe at each call to it) when the granularity exceeds 128 microseconds. But I do think that George's patch (with my minor modification) would provide an MPI_Iprobe that is fair for all drained messages (the round-robin thing). But even the patch does not change anything for my problem with MPI_ANY_SOURCE. > > I actually was thinking of his specific 256-process case. I agree that it > doesn't scale arbitrarily. > I think it could scale arbitrarily with Open-MPI ;) (and with any MPI implementations respecting MPI 2.x, for that matter). I just need to get my granularity below 128 microseconds for all the calls in RAY_SLAVE_MODE_EXTENSION (which is Machine::call_RAY_SLAVE_MODE_EXTENSION() in my code.). > Another approach would potentially be to break your 256 processes up into N > sub-communicators of M each (where N * M = 256, obviously), and doing a doing > a non-blocking receive with ANY_SOURCE and then a WAIT_ANY on all of those. > I am not sure that would work in my code as my architecture is like: while(running){ receiveMessages(); // blazing fast, receives 0 or 1 message, never more, never less, other messages will wait for the next iteration ! processMessages(); // consume the one message received, if any, also very fast because it is done with an array mapping tags to function pointers processData(); // should be fast, but apparently call_RAY_SLAVE_MODE_EXTENSION is slowish sometimes... sendMessages(); // fast, sends at most 17 messages. In most case it is either 0 or 1 message.,.. } If I *understand* what you said correctly, doing a WAIT_ANY inside Ray's receiveMessages would hang and/or would lower significantly the speed of the loop, which is not desirable. I like to have my loop at ~ 200000 iterations / 100 milliseconds. This yields a very responsive system -- everyone respond within 128 microseconds with my round-robin thing. The response time is 10 microseconds on guillimin.clumeq.ca and 100 (use to be 250) on colosse.clumeq.ca if I use MPI_ANY_SOURCE (as reported on the list, see http://www.open-mpi.org/community/lists/users/2011/09/17321.php ), but things get complicated in RAY_SLAVE_MODE_EXTENSION because of buggy granularity. > The code gets a bit more complex, but it hypothetically extends your > scalability. > > Or better yet, have your job mimic this idea -- a tree-based gathering > system. Have not just 1 master, but N sub-masters. Individual compute > processes report up to their sub-master, and the sub-master does whatever > combinatorial work it can before reporting it to the ultimate master, etc. Ray does have a MASTER_RANK, which is 0. But all the ranks, including 0, are slave ranks too. In processData(): /** process data my calling current slave and master methods */ void Machine::processData(){ MachineMethod masterMethod=m_master_methods[m_master_mode]; (this->*masterMethod)(); MachineMethod slaveMethod=m_slave_methods[m_slave_mode]; (this->*slaveMethod)(); } Obviously, m_master_mode is always RAY_MASTER_MODE_DO_NOTHING for any rank that is not MASTER_RANK, which is quite simple to implement: void Machine::call_RAY_MASTER_MODE_DO_NOTHING(){} So, although I understand that the tree-based gathering system you describe would act as some sort of virtual network (like routing packets on the Internet), I don't think that would be helpful because the computation granularity in call_RAY_SLAVE_MODE_EXTENSION() is above 128 microseconds anyway (I discovered that today, my bad). > > It depends on your code and how much delegation is possible, how much data > you're transferring over the network, how much fairness you want to > guarantee, etc. My point is that there are a bunch of > different options you can pursue outside of the "everyone sends to 1 master" > model. > My communication model is more distributed than "everyone sends to 1 master". My model is "everyone sends to everyone in a respectful way". When I say "respectful way", I mean that rank A waits for the reply to its first message from rank B before sending anything else to rank B. Because of that, - Open-MPI buffers are happy, - memory usage is happy, and - byte transfer layers are not saturated at all and thus are happy too. And destinations are mostly random because of my hash-based domain decomposition of genomic/biological data. I will thus improve my granularity but would nonetheless agree that George's patch be merged in Open-MPI's trunk as fairness is always desirable in networking algorithms. Thanks a lot ! Sébastien Boisvert PhD student http://boisvert.info > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >