Hey Yevgeny, On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote: > Hi Al, > > Al Chu wrote: > > Hey Sasha, > > > > This is a conceptually simple option I've developed for updn routing. > > > > Currently in updn routing, nodes/guids are routed on switches in a > > seemingly-random order, which I believe is due to internal data > > structure organization (i.e. cl_qmap_apply_func is called on > > port_guid_tbl) as well as how the fabric is scanned (it is logically > > scanned from a port perspective, but it may not be logical from a node > > perspective). I had a hypothesis that this was leading to increased > > contention in the network for MPI. > > > > For example, suppose we have 12 uplinks from a leaf switch to a spine > > switch. If we want to send data from this leaf switch to node[13-24], > > the up links we will send on are pretty random. It's because: > > > > A) node[13-24] are individually routed at seemingly-random points based > > on when they are called by cl_qmap_apply_func(). > > > > B) the ports chosen for routing are based on least used port usage. > > > > C) least used port usage is based on whatever was routed earlier on. > > > > So I developed this patch series, which supports an option called > > "guid_routing_order_file" which allows the user to input a file with a > > list of port_guids which will indicate the order in which guids are > > routed instead (naturally, those guids not listed are routed last). > > Great idea!
Thanks. > I understand that this guid_routing_order_file is synchronized with > an MPI rank file, right? If not, then synchronizing them might give > even better results. Not quite sure what you mean by a MPI rank file. At LLNL, slurm is responsible for MPI ranks, so I order the guids in my file according to how slurm is configured for chosing MPI ranks. I will admit to being a novice to MPI's configuration (blindly accepting slurm MPI rankings). Is there an underlying file that MPI libs use for ranking knowledge? > Another idea: OpenSM can create such file (list, doesn't have to be > actual file) automatically, just by checking topologically-adjacent > leaf switches and their HCAs. Definitely a good idea. This patch set was just a "step one" kind of thing. > > > I list the port guids of the nodes of the cluster from node0 to nodeN, one > > per line in the file. By listing the nodes in this order, I believe we > > could get less contention in the network. In the example above, sending > > to node[13-24] should use all of the 12 uplinks, b/c the ports will be > > equally used b/c nodes[1-12] were routed beforehand in order. > > > > The results from some tests are pretty impressive when I do this. LMC=0 > > average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s > > when I use guid_routing_order. > > Can you compare this to the fat-tree routing? Conceptually, fat-tree > is doing the same - it routes LIDs on nodes in a topological order, so > it would be interesting to see the comparison. Actually I already did :-). w/ LMC=0. updn default - 391.374 MB/s updn w/ guid_routing_order - 573.678 MB/s ftree - 579.603 MB/s I later discovered that one of the internal ports of the cluster I'm testing on was broken (sLB of a 288 port), and think that is the cause of some of the slowdown w/ updn w/ guid_routing_order. So ftree (as designed) seemed to be able to work around it properly, while updn (as currently implemented) couldn't. When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able to do better on some tests than ftree. One example (I think these numbers are in microseconds. Lower is better): Alltoall 16K packets ftree - 415490.6919 updn normal (LMC=0) - 495460.5526 updn w/ ordered routing (LMC=0) - 416562.7417 updn w/ ordered routing (LMC=1) - 453153.7289 - this ^^^ result is quite odd. Not sure why. updn w/ ordered routing (LMC=2) - 3660132.1530 We are regularly debating what will be better overall at the end of the day. > Also, fat-tree produces the guid order file automatically, but nobody > used it yet as an input to produce MPI rank file. I didn't know about this option. How do you do this (just skimmed the manpage, didn't see anything)? I know about the --cn_guid_file. But since that file doesn't have to be ordered, that's why I created a different option (rather than have the cn_guid_file for both ftree and updn). Al > -- Yevgeny > > > A variety of other positive performance > > increases were found when doing other tests, other MPIs, and other LMCs > > if anyone is interested. > > > > BTW, I developed this patch series before your preserve-base-lid patch > > series. It will 100% conflict with the preserve-base-lid patch series. > > I will fix this patch series once the preserve-base-lids patch series is > > committed to git. I'm just looking for comments right now. > > > > Al > > > -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
