Hello,
We did more tests concerning the latency using 512 MPI ranks
on our super-computer. (64 machines * 8 cores per machine)
By default in Ray, any rank can communicate directly with any other.
Thus we have a complete graph with 512 vertices and 130816 edges (512*511/2)
where vertices are ranks and edges are communication links.
When a rank sends to itself, the route length is 0 edge. Otherwise, the
route
length is 1 edge. However, 130816 is a lot of edges.
With this, the average latency in microseconds when requesting a reply
for a message of 4000 bytes
with Ray on our super-computer is 386 microseconds (standard deviation: 9).
Recently, Jeff Squyres highlighted that using such a communication pattern
is not recommended and "there are a bunch of different options you can
pursue."
See http://www.open-mpi.org/community/lists/devel/2011/09/9773.php
By pursuing different options, we reduced the latency to 158 microseconds
(standard deviation: 15). This is a drop of 59%.
To do so, we added a transparent message router in Ray.
First a random graph is created with n vertices and n*log2(n)/2 randomly
selected edges
from the n*(n-1)/2 edges. The idea is that, on average, any rank has a
degree of log2(n)
instead of n.
With 512 ranks, this random graph has 2304 edges (512*9/2),
down from 130816 edges.
Note that this is not a 9-regular graph (not all vertices have a degree
of 9),
but the average is 9.
Then , shortest routes are computed with Dijkstra's algorithm modified
to choose the less saturated route if more then one have
the same length.
The route lengths are quite small.
Frequencies:
0 512 0.195312% # send a message to itself
1 4608 1.75781% # send a message to a directly
connected rank
2 37644 14.36%
3 152972 58.3542% # most of them
4 65710 25.0664%
5 698 0.266266%
So my question is:
Does that indicate where the real problem is on our super-computer ?
Thanks a lot !
Also, would transparent message routing be easy to implement
directly in Open-MPI as a component ?
Sébastien Boisvert
http://boisvert.info
On 26/09/11 08:46 AM, Yevgeny Kliteynik wrote:
On 26-Sep-11 11:27 AM, Yevgeny Kliteynik wrote:
On 22-Sep-11 12:09 AM, Jeff Squyres wrote:
On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote:
What happens if you run 2 ibv_rc_pingpong's on each node? Or N
ibv_rc_pingpongs?
With 11 ibv_rc_pingpong's
http://pastebin.com/85sPcA47
Code to do that => https://gist.github.com/1233173
Latencies are around 20 microseconds.
This seems to imply that the network is to blame for the higher latency...?
Interesting... I'm getting the same latency with ibv_rc_pingpong.
I get 8.5 usec for a single ping-pong.
BTW, I've just checked this with performance guys - ibv_rc_pingpong
is not used for performance measurement but only as IB network
sanity check, therefore it was never meant to give optimal performance.
Use ib_write_lat instead.
-- YK
Please run 'ibclearcounters' to reset fabric counters, then
ibdiagnet to make sure that the fabric is clean.
If you have 4x QDR cluster, run ibdiagnet as follows:
ibdiagnet --ls 10 --lw 4x
Check that you don't have any errors/warnings.
Then please run your script with ib_write_lat instead of ibv_rc_pingpong.
Just replace the command in the script and the rest would be fine.
If the fabric is clean, you're supposed to get typical
latency of ~1.4 usec.
-- YK
I.e., if you run the same pattern with MPI processes and get 20us latency, that
would tend to imply that the network itself is not performing well with that IO
pattern.
My job seems to do well so far with ofud !
[sboisver12@colosse2 ray]$ qstat
job-ID prior name user state submit/start at queue
slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
3047460 0.55384 fish-Assem sboisver12 r 09/21/2011 15:02:25 med@r104-n58
256
I would still be suspicious -- ofud is not well tested, and it can definitely
hang if there are network drops.