Hey Yevgeny,

Yes, I tried that and it didn't have much of an effect.  Ever since
Sasha put in his routing sorted by switch load
(sort_ports_by_switch_load() in osm_ucast_mgr.c), guid_routing_order
isn't really necessary (as long as most of the cluster is up).

Al

On Tue, 2010-10-12 at 00:59 -0700, Yevgeny Kliteynik wrote:
> Hi Al,
> 
> This looks really great!
> One question: have you tried benchmarking the BW with up/down
> routing using the guid_routing_order_file option w/o your new
> features?
> 
> -- YK
> 
> On 08-Oct-10 7:40 PM, Albert Chu wrote:
> > Hey Sasha,
> > 
> > We recently got a new cluster and I've been experimenting with some
> > routing changes to improve the average bandwidth of the cluster.  They
> > are attached as patches with description of the routing goals below.
> > 
> > We're using mpiGraph (http://BLOCKEDsourceforge.net/projects/mpigraph/) to
> > measure min, peak, and average send/recv bandwidth across the cluster.
> > What we found with the original updn routing was an average of around
> > 420 MB/s send bandwidth and 508 MB/s recv bandwidth.  The following two
> > patches were able to get the average send bandwidth up to 1045 MB/s and
> > recv bandwidth up to 1228 MB/s.
> > 
> > I'm sure this is only round 1 of the patches and I'm looking for
> > comments.  Many areas could be cleaned up w/ some rearchitecture or
> > struct changes, but I simply implemented the most non-invasive
> > implementation first.  I'm also open to name changes on the options.
> > 
> > BTW, b/c of the old management tree on the git server, the following
> > patches were developed on an internal LLNL tree.  I'll rebase after the
> > up2date tree is on the openfabrics server.
> > 
> > 1) Port Shifting
> > 
> > This is similar to what was done with some of the LMC>  0 code.
> > Congestion would occur due to "alignment" of routes w/ common traffic
> > patterns.  However, we found that it was also necessary for LMC=0 and
> > only for used-ports.  For example, lets say there are 4 ports (called A,
> > B, C, D) and we are routing lids 1-9 through them.  Suppose only routing
> > through A, B, and C will reach lids 1-9.
> > 
> > The LFT would normally be:
> > 
> > A: 1 4 7
> > B: 2 5 8
> > C: 3 6 9
> > D:
> > 
> > The Port Shifting would make this:
> > 
> > A: 1 6 8
> > B: 2 4 9
> > C: 3 5 7
> > D:
> > 
> > This option by itself improved the mpiGraph average send/recv bandwidth
> > from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
> > 
> > 2) Remote Guid Sorting
> > 
> > Most core/spine switches we've seen have had line boards connected to
> > spine boards in a consistent pattern.  However, we recently got some
> > Qlogic switches that connect from line/leaf boards to spine boards in a
> > (to the casual observer) random pattern.  I'm sure there was a good
> > electrical/board reason for this design, but it does hurt routing b/c
> > some of the opensm routing algorithms didn't account for this
> > assumption.  Here's an output from iblinkinfo as an example.
> > 
> > Switch 0x00066a00ec0029b8 ibcore1 L123:
> >           180    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      254   19[  
> > ] "ibsw55" ( )
> >           180    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      253   19[  
> > ] "ibsw56" ( )
> >           180    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      258   19[  
> > ] "ibsw57" ( )
> >           180    4[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      257   19[  
> > ] "ibsw58" ( )
> >           180    5[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      256   19[  
> > ] "ibsw59" ( )
> >           180    6[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      255   19[  
> > ] "ibsw60" ( )
> >           180    7[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      261   19[  
> > ] "ibsw61" ( )
> >           180    8[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      262   19[  
> > ] "ibsw62" ( )
> >           180    9[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      260   19[  
> > ] "ibsw63" ( )
> >           180   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      259   19[  
> > ] "ibsw64" ( )
> >           180   11[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      284   19[  
> > ] "ibsw65" ( )
> >           180   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      285   19[  
> > ] "ibsw66" ( )
> >           180   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>     2227   19[  
> > ] "ibsw67" ( )
> >           180   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      283   19[  
> > ] "ibsw68" ( )
> >           180   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      267   19[  
> > ] "ibsw69" ( )
> >           180   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      270   19[  
> > ] "ibsw70" ( )
> >           180   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      269   19[  
> > ] "ibsw71" ( )
> >           180   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      268   19[  
> > ] "ibsw72" ( )
> >           180   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      222   17[  
> > ] "ibcore1 S117B" ( )
> >           180   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      209   19[  
> > ] "ibcore1 S211B" ( )
> >           180   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      218   21[  
> > ] "ibcore1 S117A" ( )
> >           180   22[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      192   23[  
> > ] "ibcore1 S215B" ( )
> >           180   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       85   15[  
> > ] "ibcore1 S209A" ( )
> >           180   24[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      182   13[  
> > ] "ibcore1 S215A" ( )
> >           180   25[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      200   11[  
> > ] "ibcore1 S115B" ( )
> >           180   26[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      129   25[  
> > ] "ibcore1 S209B" ( )
> >           180   27[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      213   27[  
> > ] "ibcore1 S115A" ( )
> >           180   28[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      197   29[  
> > ] "ibcore1 S213B" ( )
> >           180   29[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      178   28[  
> > ] "ibcore1 S111A" ( )
> >           180   30[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      215    7[  
> > ] "ibcore1 S213A" ( )
> >           180   31[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      207    5[  
> > ] "ibcore1 S113B" ( )
> >           180   32[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      212    6[  
> > ] "ibcore1 S211A" ( )
> >           180   33[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      154   33[  
> > ] "ibcore1 S113A" ( )
> >           180   34[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      194   35[  
> > ] "ibcore1 S217B" ( )
> >           180   35[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      191    3[  
> > ] "ibcore1 S111B" ( )
> >           180   36[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      219    1[  
> > ] "ibcore1 S217A" ( )
> > 
> > This is a line board that connects up to spine boards (ibcore1 S*
> > switches) and down to leaf/edge switches (ibsw*).  As you can see the
> > line board connects to the ports on the spine switches in a random
> > fashion (to the casual observer).
> > 
> > The "remote_guid_sorting" option will slightly tweak routing so that
> > instead of finding a port to route through by searching ports 1 to N. It
> > will (effectively) sort the ports based on remote connected node guid,
> > then pick a port searching from lowest guid to highest guid. That way
> > the routing calculations across each line/leaf board and spine switch
> > will be consistent.
> > 
> > This patch (on top of the port_shifting one above) improved the mpiGraph
> > average send/recv bandwidth from 991 MB/s&  1172 MB/s to 1045 MB/s and
> > 1228 MB/s.
> > 
> > Al
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://BLOCKEDvger.kernel.org/majordomo-info.html
> 
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to