Re: [OMPI users] RE : RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

Sébastien Boisvert Wed, 9 Nov 2011 12:28:08 -0500

Hello,

We did more tests concerning the latency using 512 MPI ranks
on our super-computer. (64 machines * 8 cores per machine)


By default in Ray, any rank can communicate directly with any other.

Thus we have a complete graph with 512 vertices and 130816 edges (512*511/2)
where vertices are ranks and edges are communication links.

When a rank sends to itself, the route length is 0 edge. Otherwise, theroute

length is 1 edge. However, 130816 is a lot of edges.

With this, the average latency in microseconds when requesting a replyfor a message of 4000 bytes

with Ray on our super-computer is 386 microseconds (standard deviation: 9).

Recently, Jeff Squyres highlighted that using such a communication pattern

is not recommended and "there are a bunch of different options you canpursue."


See http://www.open-mpi.org/community/lists/devel/2011/09/9773.php


By pursuing different options, we reduced the latency to 158 microseconds
(standard deviation: 15).  This is a drop of 59%.
To do so, we added a transparent message router in Ray.

First a random graph is created with n vertices and n*log2(n)/2 randomlyselected edgesfrom the n*(n-1)/2 edges. The idea is that, on average, any rank has adegree of log2(n)

instead of n.
With 512 ranks, this random graph has 2304 edges (512*9/2),
down from 130816 edges.

Note that this is not a 9-regular graph (not all vertices have a degreeof 9),

but the average is 9.

Then , shortest routes are computed with Dijkstra's algorithm modified
to choose the less saturated route if more then one have
the same length.

The route lengths are quite small.

   Frequencies:
        0    512    0.195312%      # send a message to itself

1 4608 1.75781% # send a message to a directlyconnected rank

        2    37644    14.36%
        3    152972    58.3542%   # most of them
        4    65710    25.0664%
        5    698    0.266266%


So my question is:

Does that indicate where the real problem is on our super-computer ?


Thanks a lot !


Also, would transparent message routing be easy to implement
directly in Open-MPI as a component ?



Sébastien Boisvert
http://boisvert.info

On 26/09/11 08:46 AM, Yevgeny Kliteynik wrote:

On 26-Sep-11 11:27 AM, Yevgeny Kliteynik wrote:

On 22-Sep-11 12:09 AM, Jeff Squyres wrote:

On Sep 21, 2011, at 4:24 PM, Sébastien Boisvert wrote:

What happens if you run 2 ibv_rc_pingpong's on each node?  Or N 
ibv_rc_pingpongs?

With 11 ibv_rc_pingpong's

http://pastebin.com/85sPcA47

Code to do that =>    https://gist.github.com/1233173

Latencies are around 20 microseconds.

This seems to imply that the network is to blame for the higher latency...?

Interesting... I'm getting the same latency with ibv_rc_pingpong.
I get 8.5 usec for a single ping-pong.

BTW, I've just checked this with performance guys - ibv_rc_pingpong
is not used for performance measurement but only as IB network
sanity check, therefore it was never meant to give optimal performance.

Use ib_write_lat instead.

-- YK

Please run 'ibclearcounters' to reset fabric counters, then
ibdiagnet to make sure that the fabric is clean.
If you have 4x QDR cluster, run ibdiagnet as follows:

ibdiagnet --ls 10 --lw 4x

Check that you don't have any errors/warnings.

Then please run your script with ib_write_lat instead of ibv_rc_pingpong.
Just replace the command in the script and the rest would be fine.

If the fabric is clean, you're supposed to get typical
latency of ~1.4 usec.

-- YK

I.e., if you run the same pattern with MPI processes and get 20us latency, that 
would tend to imply that the network itself is not performing well with that IO 
pattern.

My job seems to do well so far with ofud !

[sboisver12@colosse2 ray]$ qstat
job-ID  prior   name       user         state submit/start at     queue         
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
3047460 0.55384 fish-Assem sboisver12   r     09/21/2011 15:02:25 med@r104-n58  
                   256

I would still be suspicious -- ofud is not well tested, and it can definitely 
hang if there are network drops.

Re: [OMPI users] RE : RE : Latency of 250 microseconds with Open-MPI 1.4.3, Mellanox Infiniband and 256 MPI ranks

Reply via email to