Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order"option for updn routing

Yevgeny Kliteynik Tue, 17 Jun 2008 01:45:31 -0700

Hi Yiftah,

Yiftah Shahar wrote:

Al, Yevgeny,

I understand that this guid_routing_order_file is synchronized

with

an MPI rank file, right? If not, then synchronizing them might give
even better results.

Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
responsible for MPI ranks, so I order the guids in my file according to
how slurm is configured for chosing MPI ranks.  I will admit to being a
novice to MPI's configuration (blindly accepting slurm MPI rankings).
Is there an underlying file that MPI libs use for ranking knowledge?

I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
can input a file to tell it how ranks should be assigned to nodes for
MPI.  I assume that's what you're talking about?

Al

Upcoming Open MPI 1.3 will have such capabilities of rank placement in a
specific node and specific CPU, we will also have some decisions
settings how to communicate with different HCAs in multi HCAs node (we
also have these capabilities in VLT-MPI for more then 2 years now but it
is going into EOL stage...).

I think that more important then rank placement is communication pattern
(i.e. some ranks communicate a lot and some does not send a single
message) and this is far more complicated to do.


Both are important.
In routing we are dealing with congestion when there is some
communication that involves many nodes. However, the communication
is usually not random - it has a pattern, and this pattern is
affected by ranks.

In some patterns (such as "shift") all the nodes are sending something
at every pattern stage, in others (such as "recursive doubling")
some nodes send all the time, and others rarely.
In addition to that, there are optimizations that reduce IB
communication even more by having mpi processes on the same host
communicate in shared memory, and then have a single "representant"
for IB for all the processes. I think that MVAPICH1 and OpenMPI
are doing this.

However, doesn't matter how optimized the pattern will be,
in the end it has to transmit something on the wire, so if OpenSM
won't produce a balanced routing, you might get a single congested
wire that will delay each and every stage of the MPI communication
pattern.

Theoretically, the best result could be achieved if OpenSM
and MPI would work together - OpenSM would produce some kind
of list that would describe the topology order of the nodes,
and MPI would somehow use this info when assigning ranks.

-- Yevgeny

Yiftah

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:general-
[EMAIL PROTECTED] On Behalf Of Al Chu
Sent: Monday, June 16, 2008 23:09
To: [EMAIL PROTECTED]
Cc: OpenIB
Subject: Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-
order"option for updn routing

On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:

Hey Yevgeny,

On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:

Hi Al,

Al Chu wrote:

Hey Sasha,

This is a conceptually simple option I've developed for updn

routing.

Currently in updn routing, nodes/guids are routed on switches in

seemingly-random order, which I believe is due to internal data
structure organization (i.e. cl_qmap_apply_func is called on
port_guid_tbl) as well as how the fabric is scanned (it is

logically

scanned from a port perspective, but it may not be logical from

node

perspective).  I had a hypothesis that this was leading to

increased

contention in the network for MPI.

For example, suppose we have 12 uplinks from a leaf switch to a

spine

switch.  If we want to send data from this leaf switch to

node[13-

24],

the up links we will send on are pretty random. It's because:

A) node[13-24] are individually routed at seemingly-random

points

based

on when they are called by cl_qmap_apply_func().

B) the ports chosen for routing are based on least used port

usage.

C) least used port usage is based on whatever was routed earlier

on.

So I developed this patch series, which supports an option

called

"guid_routing_order_file" which allows the user to input a file

with

list of port_guids which will indicate the order in which guids

are

routed instead (naturally, those guids not listed are routed

last).

Great idea!

Thanks.

I understand that this guid_routing_order_file is synchronized

with

an MPI rank file, right? If not, then synchronizing them might

give

even better results.

Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
responsible for MPI ranks, so I order the guids in my file according

to

how slurm is configured for chosing MPI ranks.  I will admit to

being a

novice to MPI's configuration (blindly accepting slurm MPI

rankings).

Is there an underlying file that MPI libs use for ranking knowledge?

I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
can input a file to tell it how ranks should be assigned to nodes for
MPI.  I assume that's what you're talking about?

Al

Another idea: OpenSM can create such file (list, doesn't have to

be

actual file) automatically, just by checking

topologically-adjacent

leaf switches and their HCAs.

Definitely a good idea.  This patch set was just a "step one" kind

of

thing.

I list the port guids of the nodes of the cluster from node0 to

nodeN, one

per line in the file.  By listing the nodes in this order, I

believe

we

could get less contention in the network.  In the example above,

sending

to node[13-24] should use all of the 12 uplinks, b/c the ports

will

be

equally used b/c nodes[1-12] were routed beforehand in order.

The results from some tests are pretty impressive when I do

this.

LMC=0

average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678

MB/s

when I use guid_routing_order.

Can you compare this to the fat-tree routing?  Conceptually,

fat-tree

is doing the same - it routes LIDs on nodes in a topological

order, so

it would be interesting to see the comparison.

Actually I already did :-).  w/ LMC=0.

updn default - 391.374 MB/s
updn w/ guid_routing_order - 573.678 MB/s
ftree - 579.603 MB/s

I later discovered that one of the internal ports of the cluster I'm
testing on was broken (sLB of a 288 port), and think that is the

cause

of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
designed) seemed to be able to work around it properly, while updn

(as

currently implemented) couldn't.

When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were

able

to do better on some tests than ftree.  One example (I think these
numbers are in microseconds.  Lower is better):

Alltoall 16K packets
ftree - 415490.6919
updn normal (LMC=0) - 495460.5526
updn w/ ordered routing (LMC=0) - 416562.7417
updn w/ ordered routing (LMC=1) - 453153.7289
 - this ^^^ result is quite odd.  Not sure why.
updn w/ ordered routing (LMC=2) - 3660132.1530

We are regularly debating what will be better overall at the end of

the

day.

Also, fat-tree produces the guid order file automatically, but

nobody

used it yet as an input to produce MPI rank file.

I didn't know about this option.  How do you do this (just skimmed

the

manpage, didn't see anything)?  I know about the --cn_guid_file.

But

since that file doesn't have to be ordered, that's why I created a
different option (rather than have the cn_guid_file for both ftree

and

updn).

Al

-- Yevgeny

A variety of other positive performance
increases were found when doing other tests, other MPIs, and

other

LMCs

if anyone is interested.

BTW, I developed this patch series before your preserve-base-lid

patch

series.  It will 100% conflict with the preserve-base-lid patch

series.

I will fix this patch series once the preserve-base-lids patch

series is

committed to git.  I'm just looking for comments right now.

Al

--
Albert Chu
[EMAIL PROTECTED]
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit

http://openib.org/mailman/listinfo/openib-

general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-order"option for updn routing

Reply via email to