> ________________________________________
> De : [email protected] [[email protected]] de la part de
> Jeff Squyres [[email protected]]
> Date d'envoi : 28 septembre 2011 11:18
> À : Open MPI Developers
> Objet : Re: [OMPI devel] RE : RE : Implementation of MPI_Iprobe
>
> On Sep 28, 2011, at 10:04 AM, George Bosilca wrote:
>
>>> Why not use pre-posted non-blocking receives and MPI_WAIT_ANY?
>>
>> That's not very scalable either… Might work for 256 processes, but that's
>> about it.
>
> Just get a machine with oodles of RAM and you'll be fine.
>
> ;-)
Hello,
Each of my 256 cores has 3 GB of memory, thus my computation has 768 GB of
distributed memory.
So memory is not a problem at all.
I only see the problem of starvation for the slave mode
RAY_SLAVE_MODE_EXTENSION in Ray. And when there is starvation, the memory usage
is just
~1.6 GB per core.
Today, I implemented some profiling in my code to check where the granularity
is too large in processData(), which calls call_RAY_SLAVE_MODE_EXTENSION().
I consider anything abobe or equal to 128 microseconds to be too long for my
computation.
This is what I found so far:
[1,3]<stdout>:Warning, SlaveMode= RAY_SLAVE_MODE_EXTENSION
GranularityInMicroseconds= 16106
[1,3]<stdout>:Number of calls in the stack: 20
[1,3]<stdout>:0 1317227196433984 microseconds +0 from previous (0.00%)
in extendSeeds inside code/assembler/SeedExtender.cpp at line 47
[1,3]<stdout>:1 1317227196433985 microseconds +1 from previous (0.01%)
in extendSeeds inside code/assembler/SeedExtender.cpp at line 72
[1,3]<stdout>:2 1317227196433985 microseconds +0 from previous (0.00%)
in extendSeeds inside code/assembler/SeedExtender.cpp at line 144
[1,3]<stdout>:3 1317227196433985 microseconds +0 from previous (0.00%)
in extendSeeds inside code/assembler/SeedExtender.cpp at line 221
[1,3]<stdout>:4 1317227196433985 microseconds +0 from previous (0.00%)
in doChoice inside code/assembler/SeedExtender.cpp at line 351
[1,3]<stdout>:5 1317227196433985 microseconds +0 from previous (0.00%)
in doChoice inside code/assembler/SeedExtender.cpp at line 389
[1,3]<stdout>:6 1317227196433986 microseconds +1 from previous (0.01%)
in doChoice inside code/assembler/SeedExtender.cpp at line 441
[1,3]<stdout>:7 1317227196433986 microseconds +0 from previous (0.00%)
in doChoice inside code/assembler/SeedExtender.cpp at line 775
[1,3]<stdout>:8 1317227196433987 microseconds +1 from previous (0.01%)
in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line
934
[1,3]<stdout>:9 1317227196433988 microseconds +1 from previous (0.01%)
in storeExtensionAndGetNextOne inside code/assembler/SeedExtender.cpp at line
960
[1,3]<stdout>:10 1317227196442360 microseconds +8372 from previous
(51.98%) in storeExtensionAndGetNextOne inside
code/assembler/SeedExtender.cpp at line 989
[1,3]<stdout>:11 1317227196442651 microseconds +291 from previous
(1.81%) in storeExtensionAndGetNextOne inside
code/assembler/SeedExtender.cpp at line 993
[1,3]<stdout>:12 1317227196442654 microseconds +3 from previous
(0.02%) in storeExtensionAndGetNextOne inside
code/assembler/SeedExtender.cpp at line 1002
[1,3]<stdout>:13 1317227196442655 microseconds +1 from previous
(0.01%) in resetStructures inside code/assembler/ExtensionData.cpp at
line 72
[1,3]<stdout>:14
[1,3]<stdout>: 1317227196442656 microseconds +1 from previous (0.01%)
in resetStructures inside code/assembler/ExtensionData.cpp at line 76
[1,3]<stdout>:15 1317227196447138 microseconds +4482 from previous
(27.83%) in resetStructures inside code/assembler/ExtensionData.cpp at line
80
[1,3]<stdout>:16 1317227196450084 microseconds +2946 from previous
(18.29%) in doChoice inside code/assembler/SeedExtender.cpp at line 883
[1,3]<stdout>:17 1317227196450087 microseconds +3 from previous
(0.02%) in doChoice inside code/assembler/SeedExtender.cpp at line 886
[1,3]<stdout>:18 1317227196450087 microseconds +0 from previous
(0.00%) in doChoice inside code/assembler/SeedExtender.cpp at line 888
[1,3]<stdout>:19 1317227196450089 microseconds +2 from previous
(0.01%) in extendSeeds inside code/assembler/SeedExtender.cpp at line 229
[1,3]<stdout>:End of stack
So the problem is definitely not with Open-MPI, but doing a round-robin
MPI_Iprobe still helps a lot (rotating the source given to MPI_Iprobe at each
call to it) when
the granularity exceeds 128 microseconds.
But I do think that George's patch (with my minor modification) would provide
an MPI_Iprobe that is fair for all drained messages (the round-robin thing).
But even the patch does not change anything for my problem with MPI_ANY_SOURCE.
>
> I actually was thinking of his specific 256-process case. I agree that it
> doesn't scale arbitrarily.
>
I think it could scale arbitrarily with Open-MPI ;) (and with any MPI
implementations respecting MPI 2.x, for that matter).
I just need to get my granularity below 128 microseconds for all the calls in
RAY_SLAVE_MODE_EXTENSION
(which is Machine::call_RAY_SLAVE_MODE_EXTENSION() in my code.).
> Another approach would potentially be to break your 256 processes up into N
> sub-communicators of M each (where N * M = 256, obviously), and doing a doing
> a non-blocking receive with ANY_SOURCE and then a WAIT_ANY on all of those.
>
I am not sure that would work in my code as my architecture is like:
while(running){
receiveMessages(); // blazing fast, receives 0 or 1 message, never more,
never less, other messages will wait for the next iteration !
processMessages(); // consume the one message received, if any, also very
fast because it is done with an array mapping tags to function pointers
processData(); // should be fast, but apparently
call_RAY_SLAVE_MODE_EXTENSION is slowish sometimes...
sendMessages(); // fast, sends at most 17 messages. In most case it is
either 0 or 1 message.,..
}
If I *understand* what you said correctly, doing a WAIT_ANY inside Ray's
receiveMessages would hang and/or would lower significantly the speed of the
loop, which is not desirable.
I like to have my loop at ~ 200000 iterations / 100 milliseconds. This yields a
very responsive system -- everyone respond within 128 microseconds with my
round-robin thing.
The response time is 10 microseconds on guillimin.clumeq.ca and 100 (use to be
250) on colosse.clumeq.ca if I use MPI_ANY_SOURCE
(as reported on the list, see
http://www.open-mpi.org/community/lists/users/2011/09/17321.php ),
but things get complicated in RAY_SLAVE_MODE_EXTENSION because of buggy
granularity.
> The code gets a bit more complex, but it hypothetically extends your
> scalability.
>
> Or better yet, have your job mimic this idea -- a tree-based gathering
> system. Have not just 1 master, but N sub-masters. Individual compute
> processes report up to their sub-master, and the sub-master does whatever
> combinatorial work it can before reporting it to the ultimate master, etc.
Ray does have a MASTER_RANK, which is 0. But all the ranks, including 0, are
slave ranks too.
In processData():
/** process data my calling current slave and master methods */
void Machine::processData(){
MachineMethod masterMethod=m_master_methods[m_master_mode];
(this->*masterMethod)();
MachineMethod slaveMethod=m_slave_methods[m_slave_mode];
(this->*slaveMethod)();
}
Obviously, m_master_mode is always RAY_MASTER_MODE_DO_NOTHING for any rank that
is not MASTER_RANK, which is quite simple to implement:
void Machine::call_RAY_MASTER_MODE_DO_NOTHING(){}
So, although I understand that the tree-based gathering system you describe
would act as some sort of virtual network (like routing packets on the
Internet), I don't think that would be helpful
because the computation granularity in call_RAY_SLAVE_MODE_EXTENSION() is above
128 microseconds anyway (I discovered that today, my bad).
>
> It depends on your code and how much delegation is possible, how much data
> you're transferring over the network, how much fairness you want to
> guarantee, etc. My point is that there are a bunch of
> different options you can pursue outside of the "everyone sends to 1 master"
> model.
>
My communication model is more distributed than "everyone sends to 1 master".
My model is "everyone sends to everyone in a respectful way".
When I say "respectful way", I mean that rank A waits for the reply to its
first message from rank B before sending anything else to rank B.
Because of that,
- Open-MPI buffers are happy,
- memory usage is happy, and
- byte transfer layers are not saturated at all and thus are happy too.
And destinations are mostly random because of my hash-based domain
decomposition of genomic/biological data.
I will thus improve my granularity but would nonetheless agree that George's
patch be merged in Open-MPI's trunk as fairness is always desirable in
networking algorithms.
Thanks a lot !
Sébastien Boisvert
PhD student
http://boisvert.info
> --
> Jeff Squyres
> [email protected]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>