[OMPI devel] RE : RE : RE : Implementation of MPI_Iprobe

Sébastien Boisvert Wed, 28 Sep 2011 13:21:48 -0400

> ________________________________________
> De : devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] de la part de 
> Jeff Squyres [jsquy...@cisco.com]
> Date d'envoi : 28 septembre 2011 11:18
> À : Open MPI Developers
> Objet : Re: [OMPI devel] RE :  RE :  Implementation of MPI_Iprobe
> 
> On Sep 28, 2011, at 10:04 AM, George Bosilca wrote:
> 
>>> Why not use pre-posted non-blocking receives and MPI_WAIT_ANY?
>>
>> That's not very scalable either… Might work for 256 processes, but that's 
>> about it.
> 
> Just get a machine with oodles of RAM and you'll be fine.
> 
> ;-)



Hello,



Each of my 256 cores has 3 GB of memory, thus my computation has 768 GB of 
distributed memory.

So memory is not a problem at all.

I only see the problem of starvation for the slave mode 
RAY_SLAVE_MODE_EXTENSION in Ray. And when there is starvation, the memory usage 
is just
~1.6 GB per core.



Today, I implemented some profiling in my code to check where the granularity 
is too large in processData(), which calls call_RAY_SLAVE_MODE_EXTENSION().

I consider anything abobe or equal to 128 microseconds to be too long for my 
computation.


This is what I found so far:

[1,3]<stdout>:Warning, SlaveMode= RAY_SLAVE_MODE_EXTENSION 
GranularityInMicroseconds= 16106
[1,3]<stdout>:Number of calls in the stack: 20
[1,3]<stdout>:0 1317227196433984 microseconds   +0 from previous (0.00%)        
in extendSeeds inside code/assembler/SeedExtender.cpp at line 47
[1,3]<stdout>:1 1317227196433985 microseconds   +1 from previous (0.01%)        
in extendSeeds inside code/assembler/SeedExtender.cpp at line 72
[1,3]<stdout>:2 1317227196433985 microseconds   +0 from previous (0.00%)        
in extendSeeds inside code/assembler/SeedExtender.cpp at line 144
[1,3]<stdout>:3 1317227196433985 microseconds   +0 from previous (0.00%)        
in extendSeeds inside code/assembler/SeedExtender.cpp at line 221
[1,3]<stdout>:4 1317227196433985 microseconds   +0 from previous (0.00%)        
in doChoice inside code/assembler/SeedExtender.cpp at line 351
[1,3]<stdout>:5 1317227196433985 microseconds   +0 from previous (0.00%)        
in doChoice inside code/assembler/SeedExtender.cpp at line 389
[1,3]<stdout>:6 1317227196433986 microseconds   +1 from previous (0.01%)        
in doChoice inside code/assembler/SeedExtender.cpp at line 441
[1,3]<stdout>:7 1317227196433986 microseconds   +0 from previous (0.00%)        
in doChoice inside code/assembler/SeedExtender.cpp at line 775
[1,3]<stdout>:8 1317227196433987 microseconds   +1 from previous (0.01%)        
in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 
934

[1,3]<stdout>:9 1317227196433988 microseconds   +1 from previous (0.01%)        
in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line 
960
[1,3]<stdout>:10        1317227196442360 microseconds   +8372 from previous 
(51.98%)    in storeExtensionAndGetNextOne inside 
code/assembler/SeedExtender.cpp at line 989

[1,3]<stdout>:11        1317227196442651 microseconds   +291 from previous 
(1.81%)      in storeExtensionAndGetNextOne inside 
code/assembler/SeedExtender.cpp at line 993
[1,3]<stdout>:12        1317227196442654 microseconds   +3 from previous 
(0.02%)        in storeExtensionAndGetNextOne inside 
code/assembler/SeedExtender.cpp at line 1002
[1,3]<stdout>:13        1317227196442655 microseconds   +1 from previous 
(0.01%)        in resetStructures inside code/assembler/ExtensionData.cpp at 
line 72
[1,3]<stdout>:14
[1,3]<stdout>:  1317227196442656 microseconds   +1 from previous (0.01%)        
in resetStructures inside code/assembler/ExtensionData.cpp at line 76
[1,3]<stdout>:15        1317227196447138 microseconds   +4482 from previous 
(27.83%)    in resetStructures inside code/assembler/ExtensionData.cpp at line 
80
[1,3]<stdout>:16        1317227196450084 microseconds   +2946 from previous 
(18.29%)    in doChoice inside code/assembler/SeedExtender.cpp at line 883
[1,3]<stdout>:17        1317227196450087 microseconds   +3 from previous 
(0.02%)        in doChoice inside code/assembler/SeedExtender.cpp at line 886
[1,3]<stdout>:18        1317227196450087 microseconds   +0 from previous 
(0.00%)        in doChoice inside code/assembler/SeedExtender.cpp at line 888
[1,3]<stdout>:19        1317227196450089 microseconds   +2 from previous 
(0.01%)        in extendSeeds inside code/assembler/SeedExtender.cpp at line 229
[1,3]<stdout>:End of stack


So the problem is definitely not with Open-MPI, but doing a round-robin 
MPI_Iprobe still helps a lot (rotating the source given to MPI_Iprobe at each 
call to it) when
the granularity exceeds 128 microseconds.




But I do think that George's patch (with my minor modification) would provide 
an MPI_Iprobe that is fair for all drained messages (the round-robin thing).

But even the patch does not change anything for my problem with MPI_ANY_SOURCE.


> 
> I actually was thinking of his specific 256-process case.  I agree that it 
> doesn't scale arbitrarily.
> 

I think it could scale arbitrarily with Open-MPI ;) (and with any MPI 
implementations respecting MPI 2.x, for that matter).


I just need to get my granularity below 128 microseconds for all the calls in 
RAY_SLAVE_MODE_EXTENSION 
(which is Machine::call_RAY_SLAVE_MODE_EXTENSION() in my code.).

> Another approach would potentially be to break your 256 processes up into N 
> sub-communicators of M each (where N * M = 256, obviously), and doing a doing 
> a non-blocking receive with ANY_SOURCE and then a WAIT_ANY on all of those.
> 

I am not sure that would work in my code as my architecture is like:

while(running){
    receiveMessages();   // blazing fast, receives 0 or 1 message, never more, 
never less, other messages will wait for the next iteration !
    processMessages();   // consume the one message received, if any, also very 
fast because it is done with an array mapping tags to function pointers
    processData();       // should be fast, but apparently 
call_RAY_SLAVE_MODE_EXTENSION is slowish sometimes...
    sendMessages();      // fast, sends at most 17 messages. In most case it is 
either 0 or 1 message.,..
}


If I *understand* what you said correctly, doing a WAIT_ANY inside Ray's 
receiveMessages would hang and/or would lower significantly the speed of the 
loop, which is not desirable.

I like to have my loop at ~ 200000 iterations / 100 milliseconds. This yields a 
very responsive system -- everyone respond within 128 microseconds with my 
round-robin thing.
The response time is 10 microseconds on guillimin.clumeq.ca and 100 (use to be 
250) on colosse.clumeq.ca if I use MPI_ANY_SOURCE 
(as reported on the list, see 
http://www.open-mpi.org/community/lists/users/2011/09/17321.php ), 
but things get complicated in RAY_SLAVE_MODE_EXTENSION because of buggy 
granularity.


> The code gets a bit more complex, but it hypothetically extends your 
> scalability.
> 
> Or better yet, have your job mimic this idea -- a tree-based gathering 
> system.  Have not just 1 master, but N sub-masters.  Individual compute 
> processes report up to their sub-master, and the sub-master does whatever 
> combinatorial work it can before reporting it to the ultimate master, etc.


Ray does have a MASTER_RANK, which is 0. But all the ranks, including 0, are 
slave ranks too.

In processData():


/** process data my calling current slave and master methods */
void Machine::processData(){
        MachineMethod masterMethod=m_master_methods[m_master_mode];
        (this->*masterMethod)();

        MachineMethod slaveMethod=m_slave_methods[m_slave_mode];
        (this->*slaveMethod)();
}


Obviously, m_master_mode is always RAY_MASTER_MODE_DO_NOTHING for any rank that 
is not MASTER_RANK, which is quite simple to implement:

void Machine::call_RAY_MASTER_MODE_DO_NOTHING(){}


So, although I understand that the tree-based gathering system you describe 
would act as some sort of virtual network (like routing packets on the 
Internet), I don't think that would be helpful
because the computation granularity in call_RAY_SLAVE_MODE_EXTENSION() is above 
128 microseconds anyway (I discovered that today, my bad).

> 
> It depends on your code and how much delegation is possible, how much data 
> you're transferring over the network, how much fairness you want to 
> guarantee, etc.  My point is that there are a bunch of 
> different options you can pursue outside of the "everyone sends to 1 master" 
> model.
> 

My communication model is more distributed than "everyone sends to 1 master". 

My model is "everyone sends to everyone in a respectful way".


When I say "respectful way", I mean that rank A waits for the reply to its 
first message from rank B before sending anything else to rank B.

Because of that, 

- Open-MPI buffers are happy, 
- memory usage is happy, and 
- byte transfer layers are not saturated at all and thus are happy too.


And destinations are mostly random because of my hash-based domain 
decomposition of genomic/biological data.


I will thus improve my granularity but would nonetheless agree that George's 
patch be merged in Open-MPI's trunk as fairness is always desirable in 
networking algorithms.




Thanks a lot !



Sébastien Boisvert
PhD student
http://boisvert.info

> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

[OMPI devel] RE : RE : RE : Implementation of MPI_Iprobe

Reply via email to