Hi Yiftah,

Yiftah Shahar wrote:
Al, Yevgeny,

I understand that this guid_routing_order_file is synchronized
with
an MPI rank file, right? If not, then synchronizing them might give
even better results.
Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
responsible for MPI ranks, so I order the guids in my file according to
how slurm is configured for chosing MPI ranks.  I will admit to being a
novice to MPI's configuration (blindly accepting slurm MPI rankings).
Is there an underlying file that MPI libs use for ranking knowledge?
I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
can input a file to tell it how ranks should be assigned to nodes for
MPI.  I assume that's what you're talking about?

Al
Upcoming Open MPI 1.3 will have such capabilities of rank placement in a
specific node and specific CPU, we will also have some decisions
settings how to communicate with different HCAs in multi HCAs node (we
also have these capabilities in VLT-MPI for more then 2 years now but it
is going into EOL stage...).

I think that more important then rank placement is communication pattern
(i.e. some ranks communicate a lot and some does not send a single
message) and this is far more complicated to do.

Both are important.
In routing we are dealing with congestion when there is some
communication that involves many nodes. However, the communication
is usually not random - it has a pattern, and this pattern is
affected by ranks.

In some patterns (such as "shift") all the nodes are sending something
at every pattern stage, in others (such as "recursive doubling")
some nodes send all the time, and others rarely.
In addition to that, there are optimizations that reduce IB
communication even more by having mpi processes on the same host
communicate in shared memory, and then have a single "representant"
for IB for all the processes. I think that MVAPICH1 and OpenMPI
are doing this.

However, doesn't matter how optimized the pattern will be,
in the end it has to transmit something on the wire, so if OpenSM
won't produce a balanced routing, you might get a single congested
wire that will delay each and every stage of the MPI communication
pattern.

Theoretically, the best result could be achieved if OpenSM
and MPI would work together - OpenSM would produce some kind
of list that would describe the topology order of the nodes,
and MPI would somehow use this info when assigning ranks.

-- Yevgeny

Yiftah


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:general-
[EMAIL PROTECTED] On Behalf Of Al Chu
Sent: Monday, June 16, 2008 23:09
To: [EMAIL PROTECTED]
Cc: OpenIB
Subject: Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing-
order"option for updn routing

On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote:
Hey Yevgeny,

On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote:
Hi Al,

Al Chu wrote:
Hey Sasha,

This is a conceptually simple option I've developed for updn
routing.
Currently in updn routing, nodes/guids are routed on switches in
a
seemingly-random order, which I believe is due to internal data
structure organization (i.e. cl_qmap_apply_func is called on
port_guid_tbl) as well as how the fabric is scanned (it is
logically
scanned from a port perspective, but it may not be logical from
a
node
perspective).  I had a hypothesis that this was leading to
increased
contention in the network for MPI.

For example, suppose we have 12 uplinks from a leaf switch to a
spine
switch.  If we want to send data from this leaf switch to
node[13-
24],
the up links we will send on are pretty random. It's because:

A) node[13-24] are individually routed at seemingly-random
points
based
on when they are called by cl_qmap_apply_func().

B) the ports chosen for routing are based on least used port
usage.
C) least used port usage is based on whatever was routed earlier
on.
So I developed this patch series, which supports an option
called
"guid_routing_order_file" which allows the user to input a file
with
a
list of port_guids which will indicate the order in which guids
are
routed instead (naturally, those guids not listed are routed
last).
Great idea!
Thanks.

I understand that this guid_routing_order_file is synchronized
with
an MPI rank file, right? If not, then synchronizing them might
give
even better results.
Not quite sure what you mean by a MPI rank file.  At LLNL, slurm is
responsible for MPI ranks, so I order the guids in my file according
to
how slurm is configured for chosing MPI ranks.  I will admit to
being a
novice to MPI's configuration (blindly accepting slurm MPI
rankings).
Is there an underlying file that MPI libs use for ranking knowledge?
I spoke to one of our MPI guys.  I wasn't aware that in some MPIs you
can input a file to tell it how ranks should be assigned to nodes for
MPI.  I assume that's what you're talking about?

Al

Another idea: OpenSM can create such file (list, doesn't have to
be
actual file) automatically, just by checking
topologically-adjacent
leaf switches and their HCAs.
Definitely a good idea.  This patch set was just a "step one" kind
of
thing.

I list the port guids of the nodes of the cluster from node0 to
nodeN, one
per line in the file.  By listing the nodes in this order, I
believe
we
could get less contention in the network.  In the example above,
sending
to node[13-24] should use all of the 12 uplinks, b/c the ports
will
be
equally used b/c nodes[1-12] were routed beforehand in order.

The results from some tests are pretty impressive when I do
this.
LMC=0
average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678
MB/s
when I use guid_routing_order.
Can you compare this to the fat-tree routing?  Conceptually,
fat-tree
is doing the same - it routes LIDs on nodes in a topological
order, so
it would be interesting to see the comparison.
Actually I already did :-).  w/ LMC=0.

updn default - 391.374 MB/s
updn w/ guid_routing_order - 573.678 MB/s
ftree - 579.603 MB/s

I later discovered that one of the internal ports of the cluster I'm
testing on was broken (sLB of a 288 port), and think that is the
cause
of some of the slowdown w/ updn w/ guid_routing_order.  So ftree (as
designed) seemed to be able to work around it properly, while updn
(as
currently implemented) couldn't.

When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were
able
to do better on some tests than ftree.  One example (I think these
numbers are in microseconds.  Lower is better):

Alltoall 16K packets
ftree - 415490.6919
updn normal (LMC=0) - 495460.5526
updn w/ ordered routing (LMC=0) - 416562.7417
updn w/ ordered routing (LMC=1) - 453153.7289
 - this ^^^ result is quite odd.  Not sure why.
updn w/ ordered routing (LMC=2) - 3660132.1530

We are regularly debating what will be better overall at the end of
the
day.

Also, fat-tree produces the guid order file automatically, but
nobody
used it yet as an input to produce MPI rank file.
I didn't know about this option.  How do you do this (just skimmed
the
manpage, didn't see anything)?  I know about the --cn_guid_file.
But
since that file doesn't have to be ordered, that's why I created a
different option (rather than have the cn_guid_file for both ftree
and
updn).

Al

-- Yevgeny

A variety of other positive performance
increases were found when doing other tests, other MPIs, and
other
LMCs
if anyone is interested.

BTW, I developed this patch series before your preserve-base-lid
patch
series.  It will 100% conflict with the preserve-base-lid patch
series.
I will fix this patch series once the preserve-base-lids patch
series is
committed to git.  I'm just looking for comments right now.

Al

--
Albert Chu
[EMAIL PROTECTED]
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
general


_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to