On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote: > Hey Yevgeny, > > On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote: > > Hi Al, > > > > Al Chu wrote: > > > Hey Sasha, > > > > > > This is a conceptually simple option I've developed for updn routing. > > > > > > Currently in updn routing, nodes/guids are routed on switches in a > > > seemingly-random order, which I believe is due to internal data > > > structure organization (i.e. cl_qmap_apply_func is called on > > > port_guid_tbl) as well as how the fabric is scanned (it is logically > > > scanned from a port perspective, but it may not be logical from a node > > > perspective). I had a hypothesis that this was leading to increased > > > contention in the network for MPI. > > > > > > For example, suppose we have 12 uplinks from a leaf switch to a spine > > > switch. If we want to send data from this leaf switch to node[13-24], > > > the up links we will send on are pretty random. It's because: > > > > > > A) node[13-24] are individually routed at seemingly-random points based > > > on when they are called by cl_qmap_apply_func(). > > > > > > B) the ports chosen for routing are based on least used port usage. > > > > > > C) least used port usage is based on whatever was routed earlier on. > > > > > > So I developed this patch series, which supports an option called > > > "guid_routing_order_file" which allows the user to input a file with a > > > list of port_guids which will indicate the order in which guids are > > > routed instead (naturally, those guids not listed are routed last). > > > > Great idea! > > Thanks. > > > I understand that this guid_routing_order_file is synchronized with > > an MPI rank file, right? If not, then synchronizing them might give > > even better results. > > Not quite sure what you mean by a MPI rank file. At LLNL, slurm is > responsible for MPI ranks, so I order the guids in my file according to > how slurm is configured for chosing MPI ranks. I will admit to being a > novice to MPI's configuration (blindly accepting slurm MPI rankings). > Is there an underlying file that MPI libs use for ranking knowledge?
I spoke to one of our MPI guys. I wasn't aware that in some MPIs you can input a file to tell it how ranks should be assigned to nodes for MPI. I assume that's what you're talking about? Al > > Another idea: OpenSM can create such file (list, doesn't have to be > > actual file) automatically, just by checking topologically-adjacent > > leaf switches and their HCAs. > > Definitely a good idea. This patch set was just a "step one" kind of > thing. > > > > > > I list the port guids of the nodes of the cluster from node0 to nodeN, one > > > per line in the file. By listing the nodes in this order, I believe we > > > could get less contention in the network. In the example above, sending > > > to node[13-24] should use all of the 12 uplinks, b/c the ports will be > > > equally used b/c nodes[1-12] were routed beforehand in order. > > > > > > The results from some tests are pretty impressive when I do this. LMC=0 > > > average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s > > > when I use guid_routing_order. > > > > Can you compare this to the fat-tree routing? Conceptually, fat-tree > > is doing the same - it routes LIDs on nodes in a topological order, so > > it would be interesting to see the comparison. > > Actually I already did :-). w/ LMC=0. > > updn default - 391.374 MB/s > updn w/ guid_routing_order - 573.678 MB/s > ftree - 579.603 MB/s > > I later discovered that one of the internal ports of the cluster I'm > testing on was broken (sLB of a 288 port), and think that is the cause > of some of the slowdown w/ updn w/ guid_routing_order. So ftree (as > designed) seemed to be able to work around it properly, while updn (as > currently implemented) couldn't. > > When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able > to do better on some tests than ftree. One example (I think these > numbers are in microseconds. Lower is better): > > Alltoall 16K packets > ftree - 415490.6919 > updn normal (LMC=0) - 495460.5526 > updn w/ ordered routing (LMC=0) - 416562.7417 > updn w/ ordered routing (LMC=1) - 453153.7289 > - this ^^^ result is quite odd. Not sure why. > updn w/ ordered routing (LMC=2) - 3660132.1530 > > We are regularly debating what will be better overall at the end of the > day. > > > Also, fat-tree produces the guid order file automatically, but nobody > > used it yet as an input to produce MPI rank file. > > I didn't know about this option. How do you do this (just skimmed the > manpage, didn't see anything)? I know about the --cn_guid_file. But > since that file doesn't have to be ordered, that's why I created a > different option (rather than have the cn_guid_file for both ftree and > updn). > > Al > > > -- Yevgeny > > > > > A variety of other positive performance > > > increases were found when doing other tests, other MPIs, and other LMCs > > > if anyone is interested. > > > > > > BTW, I developed this patch series before your preserve-base-lid patch > > > series. It will 100% conflict with the preserve-base-lid patch series. > > > I will fix this patch series once the preserve-base-lids patch series is > > > committed to git. I'm just looking for comments right now. > > > > > > Al > > > > > -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
