Al, Yevgeny, > > > I understand that this guid_routing_order_file is synchronized with > > > an MPI rank file, right? If not, then synchronizing them might give > > > even better results. > > > > Not quite sure what you mean by a MPI rank file. At LLNL, slurm is > > responsible for MPI ranks, so I order the guids in my file according to > > how slurm is configured for chosing MPI ranks. I will admit to being a > > novice to MPI's configuration (blindly accepting slurm MPI rankings). > > Is there an underlying file that MPI libs use for ranking knowledge? > > I spoke to one of our MPI guys. I wasn't aware that in some MPIs you > can input a file to tell it how ranks should be assigned to nodes for > MPI. I assume that's what you're talking about? > > Al Upcoming Open MPI 1.3 will have such capabilities of rank placement in a specific node and specific CPU, we will also have some decisions settings how to communicate with different HCAs in multi HCAs node (we also have these capabilities in VLT-MPI for more then 2 years now but it is going into EOL stage...).
I think that more important then rank placement is communication pattern (i.e. some ranks communicate a lot and some does not send a single message) and this is far more complicated to do. Yiftah > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:general- > [EMAIL PROTECTED] On Behalf Of Al Chu > Sent: Monday, June 16, 2008 23:09 > To: [EMAIL PROTECTED] > Cc: OpenIB > Subject: Re: [ofa-general] [OPENSM PATCH 0/5]: New "guid-routing- > order"option for updn routing > > On Mon, 2008-06-16 at 10:21 -0700, Al Chu wrote: > > Hey Yevgeny, > > > > On Sun, 2008-06-15 at 11:17 +0300, Yevgeny Kliteynik wrote: > > > Hi Al, > > > > > > Al Chu wrote: > > > > Hey Sasha, > > > > > > > > This is a conceptually simple option I've developed for updn routing. > > > > > > > > Currently in updn routing, nodes/guids are routed on switches in a > > > > seemingly-random order, which I believe is due to internal data > > > > structure organization (i.e. cl_qmap_apply_func is called on > > > > port_guid_tbl) as well as how the fabric is scanned (it is logically > > > > scanned from a port perspective, but it may not be logical from a > node > > > > perspective). I had a hypothesis that this was leading to increased > > > > contention in the network for MPI. > > > > > > > > For example, suppose we have 12 uplinks from a leaf switch to a > spine > > > > switch. If we want to send data from this leaf switch to node[13- > 24], > > > > the up links we will send on are pretty random. It's because: > > > > > > > > A) node[13-24] are individually routed at seemingly-random points > based > > > > on when they are called by cl_qmap_apply_func(). > > > > > > > > B) the ports chosen for routing are based on least used port usage. > > > > > > > > C) least used port usage is based on whatever was routed earlier on. > > > > > > > > So I developed this patch series, which supports an option called > > > > "guid_routing_order_file" which allows the user to input a file with > a > > > > list of port_guids which will indicate the order in which guids are > > > > routed instead (naturally, those guids not listed are routed last). > > > > > > Great idea! > > > > Thanks. > > > > > I understand that this guid_routing_order_file is synchronized with > > > an MPI rank file, right? If not, then synchronizing them might give > > > even better results. > > > > Not quite sure what you mean by a MPI rank file. At LLNL, slurm is > > responsible for MPI ranks, so I order the guids in my file according to > > how slurm is configured for chosing MPI ranks. I will admit to being a > > novice to MPI's configuration (blindly accepting slurm MPI rankings). > > Is there an underlying file that MPI libs use for ranking knowledge? > > I spoke to one of our MPI guys. I wasn't aware that in some MPIs you > can input a file to tell it how ranks should be assigned to nodes for > MPI. I assume that's what you're talking about? > > Al > > > > Another idea: OpenSM can create such file (list, doesn't have to be > > > actual file) automatically, just by checking topologically-adjacent > > > leaf switches and their HCAs. > > > > Definitely a good idea. This patch set was just a "step one" kind of > > thing. > > > > > > > > > I list the port guids of the nodes of the cluster from node0 to > nodeN, one > > > > per line in the file. By listing the nodes in this order, I believe > we > > > > could get less contention in the network. In the example above, > sending > > > > to node[13-24] should use all of the 12 uplinks, b/c the ports will > be > > > > equally used b/c nodes[1-12] were routed beforehand in order. > > > > > > > > The results from some tests are pretty impressive when I do this. > LMC=0 > > > > average bandwidth in mpiGraph goes from 391.374 MB/s to 573.678 MB/s > > > > when I use guid_routing_order. > > > > > > Can you compare this to the fat-tree routing? Conceptually, fat-tree > > > is doing the same - it routes LIDs on nodes in a topological order, so > > > it would be interesting to see the comparison. > > > > Actually I already did :-). w/ LMC=0. > > > > updn default - 391.374 MB/s > > updn w/ guid_routing_order - 573.678 MB/s > > ftree - 579.603 MB/s > > > > I later discovered that one of the internal ports of the cluster I'm > > testing on was broken (sLB of a 288 port), and think that is the cause > > of some of the slowdown w/ updn w/ guid_routing_order. So ftree (as > > designed) seemed to be able to work around it properly, while updn (as > > currently implemented) couldn't. > > > > When we turn on LMC > 0, mpi libraries that are LMC > 0 aware were able > > to do better on some tests than ftree. One example (I think these > > numbers are in microseconds. Lower is better): > > > > Alltoall 16K packets > > ftree - 415490.6919 > > updn normal (LMC=0) - 495460.5526 > > updn w/ ordered routing (LMC=0) - 416562.7417 > > updn w/ ordered routing (LMC=1) - 453153.7289 > > - this ^^^ result is quite odd. Not sure why. > > updn w/ ordered routing (LMC=2) - 3660132.1530 > > > > We are regularly debating what will be better overall at the end of the > > day. > > > > > Also, fat-tree produces the guid order file automatically, but nobody > > > used it yet as an input to produce MPI rank file. > > > > I didn't know about this option. How do you do this (just skimmed the > > manpage, didn't see anything)? I know about the --cn_guid_file. But > > since that file doesn't have to be ordered, that's why I created a > > different option (rather than have the cn_guid_file for both ftree and > > updn). > > > > Al > > > > > -- Yevgeny > > > > > > > A variety of other positive performance > > > > increases were found when doing other tests, other MPIs, and other > LMCs > > > > if anyone is interested. > > > > > > > > BTW, I developed this patch series before your preserve-base-lid > patch > > > > series. It will 100% conflict with the preserve-base-lid patch > series. > > > > I will fix this patch series once the preserve-base-lids patch > series is > > > > committed to git. I'm just looking for comments right now. > > > > > > > > Al > > > > > > > > -- > Albert Chu > [EMAIL PROTECTED] > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
