Hey Yevgeny, On Tue, 2008-06-17 at 13:59 +0300, Yevgeny Kliteynik wrote: > Yevgeny Kliteynik wrote: > > Hi Al, > > > > Al Chu wrote: > >> On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote: > >>> Hey Yevgeny, > >>> > >>> On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote: > >>>> Hi Al, > >>>> > >>>> Al Chu wrote: > >>>>> Hey Sasha, > >>>>> > >>>>> This is a conceptually simple option I've developed for updn routing. > >>>>> > >>>>> Currently in updn routing, nodes/guids are routed on switches in a > >>>>> seemingly-random order, which I believe is due to internal data > >>>>> structure organization (i.e. cl_qmap_apply_func is called on > >>>>> port_guid_tbl) as well as how the fabric is scanned (it is logically > >>>>> scanned from a port perspective, but it may not be logical from a node > >>>>> perspective). I had a hypothesis that this was leading to increased > >>>>> contention in the network for MPI. > >>>>> > >>>>> For example, suppose we have 12 uplinks from a leaf switch to a spine > >>>>> switch. If we want to send data from this leaf switch to node[13-24], > >>>>> the up links we will send on are pretty random. It's because: > >>>>> > >>>>> A) node[13-24] are individually routed at seemingly-random points > >>>>> based > >>>>> on when they are called by cl_qmap_apply_func(). > >>>>> > >>>>> B) the ports chosen for routing are based on least used port usage. > >>>>> > >>>>> C) least used port usage is based on whatever was routed earlier on. > >>>>> > >>>>> So I developed this patch series, which supports an option called > >>>>> "guid_routing_order_file" which allows the user to input a file with a > >>>>> list of port_guids which will indicate the order in which guids are > >>>>> routed instead (naturally, those guids not listed are routed last). > >>>> Great idea! > >>> Thanks. > >>> > >>>> I understand that this guid_routing_order_file is synchronized with > >>>> an MPI rank file, right? If not, then synchronizing them might give > >>>> even better results. > >>> Not quite sure what you mean by a MPI rank file. At LLNL, slurm is > >>> responsible for MPI ranks, so I order the guids in my file according to > >>> how slurm is configured for chosing MPI ranks. I will admit to being a > >>> novice to MPI's configuration (blindly accepting slurm MPI rankings). > >>> Is there an underlying file that MPI libs use for ranking knowledge? > >> > >> I spoke to one of our MPI guys. I wasn't aware that in some MPIs you > >> can input a file to tell it how ranks should be assigned to nodes for > >> MPI. I assume that's what you're talking about? > > > > Yes, that is what I was talking about. > > There is a host file, where you list all the hosts that MPI should use, > > and in some MPIs there is also a way to specify the order of MPI ranks > > that would be assigned to processes (I'm not an MPI expert, so I'm not > > sure about the terminology that I use). > > I know that MVAPICH is using the host order when assigning ranks, so > > the order of the cluster nodes listed in host file is important. > > Not sure about OpenMPI. > > > >>>> Another idea: OpenSM can create such file (list, doesn't have to be > >>>> actual file) automatically, just by checking topologically-adjacent > >>>> leaf switches and their HCAs. > >>> Definitely a good idea. This patch set was just a "step one" kind of > >>> thing. > >>> > >>>>> I list the port guids of the nodes of the cluster from node0 to > >>>>> nodeN, one > >>>>> per line in the file. By listing the nodes in this order, I > >>>>> believe we > >>>>> could get less contention in the network. In the example above, > >>>>> sending > >>>>> to node[13-24] should use all of the 12 uplinks, b/c the ports will be > >>>>> equally used b/c nodes[1-12] were routed beforehand in order. > >>>>> > >>>>> The results from some tests are pretty impressive when I do this. > >>>>> LMC=0 > >>>>> average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s > >>>>> when I use guid_routing_order. > >>>> Can you compare this to the fat-tree routing? Conceptually, fat-tree > >>>> is doing the same - it routes LIDs on nodes in a topological order, so > >>>> it would be interesting to see the comparison. > >>> Actually I already did :-). w/ LMC=0. > >>> > >>> updn default - 391.374 MB/s > >>> updn w/ guid_routing_order - 573.678 MB/s > >>> ftree - 579.603 MB/s > >>> > >>> I later discovered that one of the internal ports of the cluster I'm > >>> testing on was broken (sLB of a 288 port), and think that is the cause > >>> of some of the slowdown w/ updn w/ guid_routing_order. So ftree (as > >>> designed) seemed to be able to work around it properly, while updn (as > >>> currently implemented) couldn't. > >>> > >>> When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able > >>> to do better on some tests than ftree. One example (I think these > >>> numbers are in microseconds. Lower is better): > >>> > >>> Alltoall 16K packets > >>> ftree - 415490.6919 > >>> updn normal (LMC=0) - 495460.5526 > >>> updn w/ ordered routing (LMC=0) - 416562.7417 > >>> updn w/ ordered routing (LMC=1) - 453153.7289 > >>> - this ^^^ result is quite odd. Not sure why. > >>> updn w/ ordered routing (LMC=2) - 3660132.1530 > >>> > >>> We are regularly debating what will be better overall at the end of the > >>> day. > >>> > >>>> Also, fat-tree produces the guid order file automatically, but nobody > >>>> used it yet as an input to produce MPI rank file. > >>> I didn't know about this option. How do you do this (just skimmed the > >>> manpage, didn't see anything)? > > > > Right, it's missing there. I'll add this info. > > Nope, it's there: > > "The algorithm also dumps compute node ordering file > (opensm-ftree-ca-order.dump) > in the same directory where the OpenSM log resides. This ordering file > provides > the CN order that may be used to create efficient communication pattern, > that > will match the routing tables."
Thanks. I guess I just missed it. The manpage is getting big :-) Al > -- Yevgeny > > > > The file is /var/log/opensm-ftree-ca-order.dump. > > Small correction though - the file contains ordered list of HCA LIDs > > and their host names. It's not a problem to change it to have guids > > as well, but MPI doesn't need guids anyway. > > Note that the optimal order might be different depending on the current > > topology state and the location of the management node that runs OpenSM. > > > >>> I know about the --cn_guid_file. But > >>> since that file doesn't have to be ordered, that's why I created a > >>> different option (rather than have the cn_guid_file for both ftree and > >>> updn). > > > > Right, the cn file doesn't have to be ordered - ftree will order it > > by itself. The ordering is by topology-adjacent leaf switches. > > > > -- Yevgeny > > > >>> > >>> Al > >>> > >>>> -- Yevgeny > >>>> > >>>>> A variety of other positive performance > >>>>> increases were found when doing other tests, other MPIs, and other > >>>>> LMCs > >>>>> if anyone is interested. > >>>>> > >>>>> BTW, I developed this patch series before your preserve-base-lid patch > >>>>> series. It will 100% conflict with the preserve-base-lid patch > >>>>> series. > >>>>> I will fix this patch series once the preserve-base-lids patch > >>>>> series is > >>>>> committed to git. I'm just looking for comments right now. > >>>>> > >>>>> Al > >>>>> > > > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
