Hi Alex,
As discussed in a private thread, here are the patches again, with some
tweaks. Most notably, the tweak ensures that the remote_guid_sorting
option is independent of port_shifting, so users may enable either,
none, or both options at their discretion.
Al
On Thu, 2011-02-10 at 17:33 -0800, Albert Chu wrote:
> [This is a repost from Oct 2010 with rebased patches]
>
> We recently got a new cluster and I've been experimenting with some
> routing changes to improve the average bandwidth of the cluster. They
> are attached as patches with description of the routing goals below.
>
> We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to
> measure min, peak, and average send/recv bandwidth across the cluster.
> What we found with the original updn routing was an average of around
> 420 MB/s send bandwidth and 508 MB/s recv bandwidth. The following two
> patches were able to get the average send bandwidth up to 1045 MB/s and
> recv bandwidth up to 1228 MB/s.
>
> I'm sure this is only round 1 of the patches and I'm looking for
> comments. Many areas could be cleaned up w/ some rearchitecture, but I
> elected to implement the most non-invasive implementation first. I'm
> also open to name changes on the options.
>
> 1) Port Shifting
>
> This is similar to what was done with some of the LMC > 0 code.
> Congestion would occur due to "alignment" of routes w/ common traffic
> patterns. However, we found that it was also necessary for LMC=0 and
> only for used-ports. For example, lets say there are 4 ports (called A,
> B, C, D) and we are routing lids 1-9 through them. Suppose only routing
> through A, B, and C will reach lids 1-9.
>
> The LFT would normally be:
>
> A: 1 4 7
> B: 2 5 8
> C: 3 6 9
> D:
>
> The Port Shifting option would make this:
>
> A: 1 6 8
> B: 2 4 9
> C: 3 5 7
> D:
>
> This option by itself improved the mpiGraph average send/recv bandwidth
> from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
>
> 2) Remote Guid Sorting
>
> Most core/spine switches we've seen thus far have had line boards
> connected to spine boards in a consistent pattern. However, we recently
> got some Qlogic switches that connect from line/leaf boards to spine
> boards in a (to the casual observer) random pattern. I'm sure there was
> a good electrical/board reason for this design, but it does hurt routing
> b/c updn doesn't account for this. Here's an output from iblinkinfo as
> an example.
>
> Switch 0x00066a00ec0029b8 ibcore1 L123:
> 180 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 254 19[ ]
> "ibsw55" ( )
> 180 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 253 19[ ]
> "ibsw56" ( )
> 180 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 258 19[ ]
> "ibsw57" ( )
> 180 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 257 19[ ]
> "ibsw58" ( )
> 180 5[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 256 19[ ]
> "ibsw59" ( )
> 180 6[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 255 19[ ]
> "ibsw60" ( )
> 180 7[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 261 19[ ]
> "ibsw61" ( )
> 180 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 262 19[ ]
> "ibsw62" ( )
> 180 9[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 260 19[ ]
> "ibsw63" ( )
> 180 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 259 19[ ]
> "ibsw64" ( )
> 180 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 284 19[ ]
> "ibsw65" ( )
> 180 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 285 19[ ]
> "ibsw66" ( )
> 180 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2227 19[ ]
> "ibsw67" ( )
> 180 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 283 19[ ]
> "ibsw68" ( )
> 180 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 267 19[ ]
> "ibsw69" ( )
> 180 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 270 19[ ]
> "ibsw70" ( )
> 180 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 269 19[ ]
> "ibsw71" ( )
> 180 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 268 19[ ]
> "ibsw72" ( )
> 180 19[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 222 17[ ]
> "ibcore1 S117B" ( )
> 180 20[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 209 19[ ]
> "ibcore1 S211B" ( )
> 180 21[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 218 21[ ]
> "ibcore1 S117A" ( )
> 180 22[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 192 23[ ]
> "ibcore1 S215B" ( )
> 180 23[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 85 15[ ]
> "ibcore1 S209A" ( )
> 180 24[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 182 13[ ]
> "ibcore1 S215A" ( )
> 180 25[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 200 11[ ]
> "ibcore1 S115B" ( )
> 180 26[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 129 25[ ]
> "ibcore1 S209B" ( )
> 180 27[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 213 27[ ]
> "ibcore1 S115A" ( )
> 180 28[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 197 29[ ]
> "ibcore1 S213B" ( )
> 180 29[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 178 28[ ]
> "ibcore1 S111A" ( )
> 180 30[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 215 7[ ]
> "ibcore1 S213A" ( )
> 180 31[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 207 5[ ]
> "ibcore1 S113B" ( )
> 180 32[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 212 6[ ]
> "ibcore1 S211A" ( )
> 180 33[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 154 33[ ]
> "ibcore1 S113A" ( )
> 180 34[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 194 35[ ]
> "ibcore1 S217B" ( )
> 180 35[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 191 3[ ]
> "ibcore1 S111B" ( )
> 180 36[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 219 1[ ]
> "ibcore1 S217A" ( )
>
> This is a line board that connects up to spine boards (ibcore1 S*
> switches) and down to leaf/edge switches (ibsw*). As you can see the
> line board connects to the ports on the edge switches in a consistent
> fashion (always port 19), but connects to the spine switches in a (to
> the casual observer) random fashion (port 17, 19, 21, 23, 15, ...).
>
> The "remote_guid_sorting" option will slightly tweak routing so that
> instead of finding a port to route through by searching ports 1 to N. It
> will (effectively) sort the ports based on remote connected node guid,
> then pick a port searching from lowest guid to highest guid. That way
> the routing calculations across each line/leaf board and spine switch
> will be consistent.
>
> This patch (on top of the port_shifting one above) improved the mpiGraph
> average send/recv bandwidth from 991 MB/s & 1172 MB/s to 1045 MB/s and
> 1228 MB/s.
>
> Al
>
>
> email message attachment
> > -------- Forwarded Message --------
> > From: Albert L.Chu <ch...@llnl.gov>
> > Subject: [PATCH] Support port shifting
> > Date: Mon, 7 Feb 2011 16:52:41 -0800
> >
> > Signed-off-by: Albert L. Chu <ch...@llnl.gov>
> > ---
> > include/opensm/osm_subnet.h | 4 ++
> > include/opensm/osm_switch.h | 6 ++-
> > man/opensm.8.in | 8 ++++
> > opensm/main.c | 8 ++++
> > opensm/osm_dump.c | 2 +-
> > opensm/osm_subnet.c | 7 +++
> > opensm/osm_switch.c | 98
> > ++++++++++++++++++++++++++++++++++++++++++-
> > opensm/osm_ucast_mgr.c | 3 +-
> > 8 files changed, 132 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h
> > index 42ae416..59f877e 100644
> > --- a/include/opensm/osm_subnet.h
> > +++ b/include/opensm/osm_subnet.h
> > @@ -199,6 +199,7 @@ typedef struct osm_subn_opt {
> > char *root_guid_file;
> > char *cn_guid_file;
> > char *io_guid_file;
> > + boolean_t port_shifting;
> > uint16_t max_reverse_hops;
> > char *ids_guid_file;
> > char *guid_routing_order_file;
> > @@ -418,6 +419,9 @@ typedef struct osm_subn_opt {
> > * Name of the file that contains list of I/O node guids that
> > * will be used by fat-tree routing (provided by User)
> > *
> > +* port_shifting
> > +* This option will turn on port_shifting in routing.
> > +*
> > * ids_guid_file
> > * Name of the file that contains list of ids which should be
> > * used by Up/Down algorithm instead of node GUIDs
> > diff --git a/include/opensm/osm_switch.h b/include/opensm/osm_switch.h
> > index f407dd9..8eae119 100644
> > --- a/include/opensm/osm_switch.h
> > +++ b/include/opensm/osm_switch.h
> > @@ -919,7 +919,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > IN unsigned start_from,
> > IN boolean_t ignore_existing,
> > IN boolean_t routing_for_lmc,
> > - IN boolean_t dor);
> > + IN boolean_t dor,
> > + IN boolean_t port_shifting);
> > /*
> > * PARAMETERS
> > * p_sw
> > @@ -955,6 +956,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > * dor
> > * [in] If TRUE, Dimension Order Routing will be done.
> > *
> > +* port_shifting
> > +* [in] If TRUE, port_shifting will be done.
> > +*
> > * RETURN VALUE
> > * Returns the recommended port on which to route this LID.
> > *
> > diff --git a/man/opensm.8.in b/man/opensm.8.in
> > index cd3a24f..db48d52 100644
> > --- a/man/opensm.8.in
> > +++ b/man/opensm.8.in
> > @@ -25,6 +25,7 @@ opensm \- InfiniBand subnet manager and administration
> > (SM/SA)
> > [\-a | \-\-root_guid_file <path to file>]
> > [\-u | \-\-cn_guid_file <path to file>]
> > [\-G | \-\-io_guid_file <path to file>]
> > +[\-\-port\-shifting]
> > [\-H | \-\-max_reverse_hops <max reverse hops allowed>]
> > [\-X | \-\-guid_routing_order_file <path to file>]
> > [\-m | \-\-ids_guid_file <path to file>]
> > @@ -208,6 +209,13 @@ to the guids provided in the given file (one to a
> > line).
> > I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches
> > the wrong way around to improve connectivity.
> > .TP
> > +\fB\-\-port\-shifting\fR
> > +This option enables a feature called \fBport shifting\fR. In some
> > +fabrics, particularly cluster environments, routes commonly align and
> > +congest with other routes due to algorithmically unchanging traffic
> > +patterns. This routing option will "shift" routing around in an
> > +attempt to alleviate this problem.
> > +.TP
> > \fB\-H\fR, \fB\-\-max_reverse_hops\fR <file name>
> > Set the maximum number of reverse hops an I/O node is allowed
> > to make. A reverse hop is the use of a switch the wrong way around.
> > diff --git a/opensm/main.c b/opensm/main.c
> > index 756fe6f..abb32ec 100644
> > --- a/opensm/main.c
> > +++ b/opensm/main.c
> > @@ -223,6 +223,9 @@ static void show_usage(void)
> > printf("--io_guid_file, -G <path to file>\n"
> > " Set the I/O nodes for the Fat-Tree routing
> > algorithm\n"
> > " to the guids provided in the given file (one to a
> > line)\n\n");
> > + printf("--port-shifting\n"
> > + " Attempt to shift port routes around to remove
> > alignment problems\n"
> > + " in routing tables\n\n");
> > printf("--max_reverse_hops, -H <hop_count>\n"
> > " Set the max number of hops the wrong way around\n"
> > " an I/O node is allowed to do (connectivity for I/O
> > nodes on top swithces)\n\n");
> > @@ -601,6 +604,7 @@ int main(int argc, char *argv[])
> > {"root_guid_file", 1, NULL, 'a'},
> > {"cn_guid_file", 1, NULL, 'u'},
> > {"io_guid_file", 1, NULL, 'G'},
> > + {"port-shifting", 0, NULL, 11},
> > {"max_reverse_hops", 1, NULL, 'H'},
> > {"ids_guid_file", 1, NULL, 'm'},
> > {"guid_routing_order_file", 1, NULL, 'X'},
> > @@ -937,6 +941,10 @@ int main(int argc, char *argv[])
> > opt.io_guid_file = optarg;
> > printf(" I/O Node Guid File: %s\n", opt.io_guid_file);
> > break;
> > + case 11:
> > + opt.port_shifting = TRUE;
> > + printf(" Port Shifting is on\n");
> > + break;
> > case 'H':
> > opt.max_reverse_hops = atoi(optarg);
> > printf(" Max Reverse Hops: %d\n", opt.max_reverse_hops);
> > diff --git a/opensm/osm_dump.c b/opensm/osm_dump.c
> > index 535a03f..a1ff168 100644
> > --- a/opensm/osm_dump.c
> > +++ b/opensm/osm_dump.c
> > @@ -221,7 +221,7 @@ static void dump_ucast_routes(cl_map_item_t * item,
> > FILE * file, void *cxt)
> > /* No LMC Optimization */
> > best_port = osm_switch_recommend_path(p_sw, p_port,
> > lid_ho, 1, TRUE,
> > - FALSE, dor);
> > + FALSE, dor,
> > FALSE);
> > fprintf(file, "No %u hop path possible via port %u!",
> > best_hops, best_port);
> > }
> > diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c
> > index 228418f..c62192c 100644
> > --- a/opensm/osm_subnet.c
> > +++ b/opensm/osm_subnet.c
> > @@ -347,6 +347,7 @@ static const opt_rec_t opt_tbl[] = {
> > { "root_guid_file", OPT_OFFSET(root_guid_file), opts_parse_charp, NULL,
> > 0 },
> > { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 },
> > { "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 },
> > + { "port_shifting", OPT_OFFSET(port_shifting), opts_parse_boolean, NULL,
> > 1 },
> > { "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16,
> > NULL, 0 },
> > { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0
> > },
> > { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file),
> > opts_parse_charp, NULL, 0 },
> > @@ -740,6 +741,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt)
> > p_opt->root_guid_file = NULL;
> > p_opt->cn_guid_file = NULL;
> > p_opt->io_guid_file = NULL;
> > + p_opt->port_shifting = FALSE;
> > p_opt->max_reverse_hops = 0;
> > p_opt->ids_guid_file = NULL;
> > p_opt->guid_routing_order_file = NULL;
> > @@ -1440,6 +1442,11 @@ int osm_subn_output_conf(FILE *out, IN
> > osm_subn_opt_t * p_opts)
> > p_opts->lash_start_vl);
> >
> > fprintf(out,
> > + "# Port Shifting (use FALSE if unsure)\n"
> > + "port_shifting %s\n\n",
> > + p_opts->port_shifting ? "TRUE" : "FALSE");
> > +
> > + fprintf(out,
> > "# SA database file name\nsa_db_file %s\n\n",
> > p_opts->sa_db_file ? p_opts->sa_db_file : null_str);
> >
> > diff --git a/opensm/osm_switch.c b/opensm/osm_switch.c
> > index 9785a9d..f24d9ea 100644
> > --- a/opensm/osm_switch.c
> > +++ b/opensm/osm_switch.c
> > @@ -51,6 +51,14 @@
> > #include <iba/ib_types.h>
> > #include <opensm/osm_switch.h>
> >
> > +struct switch_port_path {
> > + uint8_t port_num;
> > + uint32_t path_count;
> > + int found_sys_guid;
> > + int found_node_guid;
> > + uint32_t forwarded_to;
> > +};
> > +
> > cl_status_t osm_switch_set_hops(IN osm_switch_t * p_sw, IN uint16_t lid_ho,
> > IN uint8_t port_num, IN uint8_t num_hops)
> > {
> > @@ -217,7 +225,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > IN unsigned start_from,
> > IN boolean_t ignore_existing,
> > IN boolean_t routing_for_lmc,
> > - IN boolean_t dor)
> > + IN boolean_t dor,
> > + IN boolean_t port_shifting)
> > {
> > /*
> > We support an enhanced LMC aware routing mode:
> > @@ -259,6 +268,11 @@ uint8_t osm_switch_recommend_path(IN const
> > osm_switch_t * p_sw,
> > osm_node_t *p_rem_node_first = NULL;
> > struct osm_remote_node *p_remote_guid = NULL;
> > struct osm_remote_node null_remote_node = {NULL, 0, 0};
> > + struct switch_port_path port_paths[IB_NODE_NUM_PORTS_MAX];
> > + unsigned int port_paths_total_paths = 0;
> > + unsigned int port_paths_count = 0;
> > + int found_sys_guid;
> > + int found_node_guid;
> >
> > CL_ASSERT(lid_ho > 0);
> >
> > @@ -369,6 +383,7 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > check_count =
> > osm_port_prof_path_count_get(&p_sw->p_prof[port_num]);
> >
> > +
> > if (dor) {
> > /* Get the Remote Node */
> > p_rem_physp = osm_physp_get_remote(p_physp);
> > @@ -412,7 +427,10 @@ uint8_t osm_switch_recommend_path(IN const
> > osm_switch_t * p_sw,
> > best_port_other_sys = port_num;
> > least_forwarded_to = 0;
> > }
> > + found_sys_guid = 0;
> > } else { /* same sys found - try node */
> > +
> > +
> > /* Else is the node guid already used ? */
> > p_remote_guid =
> > switch_find_node_guid_count(p_sw,
> >
> > p_port->priv,
> > @@ -427,9 +445,27 @@ uint8_t osm_switch_recommend_path(IN const
> > osm_switch_t * p_sw,
> > }
> > /* else prior sys and node guid already used */
> >
> > + if (!p_remote_guid)
> > + found_node_guid = 0;
> > + else
> > + found_node_guid = 1;
> > + found_sys_guid = 1;
> > } /* same sys found */
> > }
> >
> > + port_paths[port_paths_count].port_num = port_num;
> > + port_paths[port_paths_count].path_count = check_count;
> > + if (routing_for_lmc) {
> > + port_paths[port_paths_count].found_sys_guid =
> > found_sys_guid;
> > + port_paths[port_paths_count].found_node_guid =
> > found_node_guid;
> > + }
> > + if (routing_for_lmc && p_remote_guid)
> > + port_paths[port_paths_count].forwarded_to =
> > p_remote_guid->forwarded_to;
> > + else
> > + port_paths[port_paths_count].forwarded_to = 0;
> > + port_paths_total_paths += check_count;
> > + port_paths_count++;
> > +
> > /* routing for LMC mode */
> > /*
> > the count is min but also lower then the max subscribed
> > @@ -454,6 +490,66 @@ uint8_t osm_switch_recommend_path(IN const
> > osm_switch_t * p_sw,
> > if (port_found == FALSE)
> > return OSM_NO_PATH;
> >
> > + if (port_shifting && port_paths_count) {
> > + /* In the port_paths[] array, we now have all the ports that we
> > + * can route out of. Using some shifting math below, possibly
> > + * select a different one so that lids won't align in LFTs
> > + *
> > + * If lmc > 0, we need to loop through these ports to find the
> > + * least_forwarded_to port, best_port_other_sys, and
> > + * best_port_other_node just like before but through the
> > different
> > + * ordering.
> > + */
> > +
> > + least_paths = 0xFFFFFFFF;
> > + least_paths_other_sys = 0xFFFFFFFF;
> > + least_paths_other_nodes = 0xFFFFFFFF;
> > + least_forwarded_to = 0xFFFFFFFF;
> > + best_port = 0;
> > + best_port_other_sys = 0;
> > + best_port_other_node = 0;
> > +
> > + for (i = 0; i < port_paths_count; i++) {
> > + unsigned int idx;
> > +
> > + idx = (port_paths_total_paths/port_paths_count + i) %
> > port_paths_count;
> > +
> > + if (routing_for_lmc) {
> > + if (!port_paths[idx].found_sys_guid
> > + && port_paths[idx].path_count <
> > least_paths_other_sys) {
> > + least_paths_other_sys =
> > port_paths[idx].path_count;
> > + best_port_other_sys =
> > port_paths[idx].port_num;
> > + least_forwarded_to = 0;
> > + }
> > + else if (!port_paths[idx].found_node_guid
> > + && port_paths[idx].path_count <
> > least_paths_other_nodes) {
> > + least_paths_other_nodes =
> > port_paths[idx].path_count;
> > + best_port_other_node =
> > port_paths[idx].port_num;
> > + least_forwarded_to = 0;
> > + }
> > + }
> > +
> > + if (port_paths[idx].path_count < least_paths) {
> > + best_port = port_paths[idx].port_num;
> > + least_paths = port_paths[idx].path_count;
> > + if (routing_for_lmc
> > + && (port_paths[idx].found_sys_guid
> > + || port_paths[idx].found_node_guid)
> > + && port_paths[idx].forwarded_to <
> > least_forwarded_to)
> > + least_forwarded_to =
> > port_paths[idx].forwarded_to;
> > + }
> > + else if (routing_for_lmc
> > + && (port_paths[idx].found_sys_guid
> > + || port_paths[idx].found_node_guid)
> > + && port_paths[idx].path_count == least_paths
> > + && port_paths[idx].forwarded_to <
> > least_forwarded_to) {
> > + least_forwarded_to =
> > port_paths[idx].forwarded_to;
> > + best_port = port_paths[idx].port_num;
> > + }
> > +
> > + }
> > + }
> > +
> > /*
> > if we are in enhanced routing mode and the best port is not
> > the local port 0
> > diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
> > index 4019589..d32eb60 100644
> > --- a/opensm/osm_ucast_mgr.c
> > +++ b/opensm/osm_ucast_mgr.c
> > @@ -255,7 +255,8 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t *
> > p_mgr,
> > port = osm_switch_recommend_path(p_sw, p_port, lid_ho, start_from,
> > p_mgr->p_subn->ignore_existing_lfts,
> > p_mgr->p_subn->opt.lmc,
> > - p_mgr->is_dor);
> > + p_mgr->is_dor,
> > + p_mgr->p_subn->opt.port_shifting);
> >
> > if (port == OSM_NO_PATH) {
> > /* do not try to overwrite the ppro of non existing port ... */
> email message attachment
> > -------- Forwarded Message --------
> > From: Albert L.Chu <ch...@llnl.gov>
> > Subject: [PATCH] Support remote guid sorting
> > Date: Mon, 7 Feb 2011 16:53:39 -0800
> >
> > Signed-off-by: Albert L. Chu <ch...@llnl.gov>
> > ---
> > include/opensm/osm_subnet.h | 4 ++++
> > include/opensm/osm_switch.h | 6 +++++-
> > man/opensm.8.in | 6 ++++++
> > opensm/main.c | 8 ++++++++
> > opensm/osm_dump.c | 3 ++-
> > opensm/osm_subnet.c | 7 +++++++
> > opensm/osm_switch.c | 26 +++++++++++++++++++++++++-
> > opensm/osm_ucast_mgr.c | 3 ++-
> > 8 files changed, 59 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h
> > index 59f877e..589e96c 100644
> > --- a/include/opensm/osm_subnet.h
> > +++ b/include/opensm/osm_subnet.h
> > @@ -200,6 +200,7 @@ typedef struct osm_subn_opt {
> > char *cn_guid_file;
> > char *io_guid_file;
> > boolean_t port_shifting;
> > + boolean_t remote_guid_sorting;
> > uint16_t max_reverse_hops;
> > char *ids_guid_file;
> > char *guid_routing_order_file;
> > @@ -422,6 +423,9 @@ typedef struct osm_subn_opt {
> > * port_shifting
> > * This option will turn on port_shifting in routing.
> > *
> > +* remote_guid_sorting
> > +* This option will turn on remote_guid_sorting in routing.
> > +*
> > * ids_guid_file
> > * Name of the file that contains list of ids which should be
> > * used by Up/Down algorithm instead of node GUIDs
> > diff --git a/include/opensm/osm_switch.h b/include/opensm/osm_switch.h
> > index 8eae119..aef45cb 100644
> > --- a/include/opensm/osm_switch.h
> > +++ b/include/opensm/osm_switch.h
> > @@ -920,7 +920,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > IN boolean_t ignore_existing,
> > IN boolean_t routing_for_lmc,
> > IN boolean_t dor,
> > - IN boolean_t port_shifting);
> > + IN boolean_t port_shifting,
> > + IN boolean_t remote_guid_sorting);
> > /*
> > * PARAMETERS
> > * p_sw
> > @@ -959,6 +960,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > * port_shifting
> > * [in] If TRUE, port_shifting will be done.
> > *
> > +* remote_guid_sorting
> > +* [in] If TRUE, remote_guid_sorting will be done.
> > +*
> > * RETURN VALUE
> > * Returns the recommended port on which to route this LID.
> > *
> > diff --git a/man/opensm.8.in b/man/opensm.8.in
> > index db48d52..decaee7 100644
> > --- a/man/opensm.8.in
> > +++ b/man/opensm.8.in
> > @@ -216,6 +216,12 @@ congest with other routes due to algorithmically
> > unchanging traffic
> > patterns. This routing option will "shift" routing around in an
> > attempt to alleviate this problem.
> > .TP
> > +\fB\-\-remote\-guid\-sorting\fR
> > +This option enables a feature called \fBremote guid sorting\fR. In some
> > +fabrics, switches may be cabled in an inconsistent fashion. This option
> > +may alleviate those issues by sorting remote guids before routing,
> > +making remote destinations appear to be ordered consistently.
> > +.TP
> > \fB\-H\fR, \fB\-\-max_reverse_hops\fR <file name>
> > Set the maximum number of reverse hops an I/O node is allowed
> > to make. A reverse hop is the use of a switch the wrong way around.
> > diff --git a/opensm/main.c b/opensm/main.c
> > index abb32ec..91ae940 100644
> > --- a/opensm/main.c
> > +++ b/opensm/main.c
> > @@ -226,6 +226,9 @@ static void show_usage(void)
> > printf("--port-shifting\n"
> > " Attempt to shift port routes around to remove
> > alignment problems\n"
> > " in routing tables\n\n");
> > + printf("--remote-guid-sorting\n"
> > + " Sort ports by remote port guid before routing to
> > alleviate\n"
> > + " problems with inconsistent cabling across a
> > fabric\n\n");
> > printf("--max_reverse_hops, -H <hop_count>\n"
> > " Set the max number of hops the wrong way around\n"
> > " an I/O node is allowed to do (connectivity for I/O
> > nodes on top swithces)\n\n");
> > @@ -605,6 +608,7 @@ int main(int argc, char *argv[])
> > {"cn_guid_file", 1, NULL, 'u'},
> > {"io_guid_file", 1, NULL, 'G'},
> > {"port-shifting", 0, NULL, 11},
> > + {"remote-guid-sorting", 0, NULL, 13},
> > {"max_reverse_hops", 1, NULL, 'H'},
> > {"ids_guid_file", 1, NULL, 'm'},
> > {"guid_routing_order_file", 1, NULL, 'X'},
> > @@ -945,6 +949,10 @@ int main(int argc, char *argv[])
> > opt.port_shifting = TRUE;
> > printf(" Port Shifting is on\n");
> > break;
> > + case 13:
> > + opt.remote_guid_sorting = TRUE;
> > + printf(" Remote Guid Sorting is on\n");
> > + break;
> > case 'H':
> > opt.max_reverse_hops = atoi(optarg);
> > printf(" Max Reverse Hops: %d\n", opt.max_reverse_hops);
> > diff --git a/opensm/osm_dump.c b/opensm/osm_dump.c
> > index a1ff168..bfe63c3 100644
> > --- a/opensm/osm_dump.c
> > +++ b/opensm/osm_dump.c
> > @@ -221,7 +221,8 @@ static void dump_ucast_routes(cl_map_item_t * item,
> > FILE * file, void *cxt)
> > /* No LMC Optimization */
> > best_port = osm_switch_recommend_path(p_sw, p_port,
> > lid_ho, 1, TRUE,
> > - FALSE, dor,
> > FALSE);
> > + FALSE, dor, FALSE,
> > + FALSE);
> > fprintf(file, "No %u hop path possible via port %u!",
> > best_hops, best_port);
> > }
> > diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c
> > index c62192c..b2b219f 100644
> > --- a/opensm/osm_subnet.c
> > +++ b/opensm/osm_subnet.c
> > @@ -348,6 +348,7 @@ static const opt_rec_t opt_tbl[] = {
> > { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 },
> > { "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 },
> > { "port_shifting", OPT_OFFSET(port_shifting), opts_parse_boolean, NULL,
> > 1 },
> > + { "remote_guid_sorting", OPT_OFFSET(remote_guid_sorting),
> > opts_parse_boolean, NULL, 1 },
> > { "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16,
> > NULL, 0 },
> > { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0
> > },
> > { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file),
> > opts_parse_charp, NULL, 0 },
> > @@ -742,6 +743,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt)
> > p_opt->cn_guid_file = NULL;
> > p_opt->io_guid_file = NULL;
> > p_opt->port_shifting = FALSE;
> > + p_opt->remote_guid_sorting = FALSE;
> > p_opt->max_reverse_hops = 0;
> > p_opt->ids_guid_file = NULL;
> > p_opt->guid_routing_order_file = NULL;
> > @@ -1447,6 +1449,11 @@ int osm_subn_output_conf(FILE *out, IN
> > osm_subn_opt_t * p_opts)
> > p_opts->port_shifting ? "TRUE" : "FALSE");
> >
> > fprintf(out,
> > + "# Remote Guid Sorting (use FALSE if unsure)\n"
> > + "remote_guid_sorting %s\n\n",
> > + p_opts->remote_guid_sorting ? "TRUE" : "FALSE");
> > +
> > + fprintf(out,
> > "# SA database file name\nsa_db_file %s\n\n",
> > p_opts->sa_db_file ? p_opts->sa_db_file : null_str);
> >
> > diff --git a/opensm/osm_switch.c b/opensm/osm_switch.c
> > index f24d9ea..0aa0137 100644
> > --- a/opensm/osm_switch.c
> > +++ b/opensm/osm_switch.c
> > @@ -57,6 +57,7 @@ struct switch_port_path {
> > int found_sys_guid;
> > int found_node_guid;
> > uint32_t forwarded_to;
> > + uint64_t remote_node_guid;
> > };
> >
> > cl_status_t osm_switch_set_hops(IN osm_switch_t * p_sw, IN uint16_t lid_ho,
> > @@ -169,6 +170,19 @@ boolean_t osm_switch_get_lft_block(IN const
> > osm_switch_t * p_sw,
> > return TRUE;
> > }
> >
> > +static int
> > +port_path_guid_cmp(IN const void *x, IN const void *y)
> > +{
> > + struct switch_port_path *a = (struct switch_port_path *)x;
> > + struct switch_port_path *b = (struct switch_port_path *)y;
> > +
> > + if (a->remote_node_guid < b->remote_node_guid)
> > + return -1;
> > + if (a->remote_node_guid > b->remote_node_guid)
> > + return 1;
> > + return 0;
> > +}
> > +
> > static struct osm_remote_node *
> > switch_find_guid_common(IN const osm_switch_t * p_sw,
> > IN struct osm_remote_guids_count *r,
> > @@ -226,7 +240,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > IN boolean_t ignore_existing,
> > IN boolean_t routing_for_lmc,
> > IN boolean_t dor,
> > - IN boolean_t port_shifting)
> > + IN boolean_t port_shifting,
> > + IN boolean_t remote_guid_sorting)
> > {
> > /*
> > We support an enhanced LMC aware routing mode:
> > @@ -428,6 +443,7 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > least_forwarded_to = 0;
> > }
> > found_sys_guid = 0;
> > + found_node_guid = 0;
> > } else { /* same sys found - try node */
> >
> >
> > @@ -463,6 +479,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t
> > * p_sw,
> > port_paths[port_paths_count].forwarded_to =
> > p_remote_guid->forwarded_to;
> > else
> > port_paths[port_paths_count].forwarded_to = 0;
> > + p_rem_physp = osm_physp_get_remote(p_physp);
> > + p_rem_node = osm_physp_get_node_ptr(p_rem_physp);
> > + port_paths[port_paths_count].remote_node_guid =
> > p_rem_node->node_info.node_guid;
> > port_paths_total_paths += check_count;
> > port_paths_count++;
> >
> > @@ -490,6 +509,11 @@ uint8_t osm_switch_recommend_path(IN const
> > osm_switch_t * p_sw,
> > if (port_found == FALSE)
> > return OSM_NO_PATH;
> >
> > + if (remote_guid_sorting && port_paths_count) {
> > + qsort(port_paths, port_paths_count, sizeof(struct
> > switch_port_path),
> > + port_path_guid_cmp);
> > + }
> > +
> > if (port_shifting && port_paths_count) {
> > /* In the port_paths[] array, we now have all the ports that we
> > * can route out of. Using some shifting math below, possibly
> > diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
> > index d32eb60..a8982df 100644
> > --- a/opensm/osm_ucast_mgr.c
> > +++ b/opensm/osm_ucast_mgr.c
> > @@ -256,7 +256,8 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t *
> > p_mgr,
> > p_mgr->p_subn->ignore_existing_lfts,
> > p_mgr->p_subn->opt.lmc,
> > p_mgr->is_dor,
> > - p_mgr->p_subn->opt.port_shifting);
> > + p_mgr->p_subn->opt.port_shifting,
> > +
> > p_mgr->p_subn->opt.remote_guid_sorting);
> >
> > if (port == OSM_NO_PATH) {
> > /* do not try to overwrite the ppro of non existing port ... */
--
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
--- Begin Message ---
Signed-off-by: Albert L. Chu <ch...@llnl.gov>
---
include/opensm/osm_subnet.h | 4 ++
include/opensm/osm_switch.h | 6 ++-
man/opensm.8.in | 8 ++++
opensm/main.c | 8 ++++
opensm/osm_dump.c | 2 +-
opensm/osm_subnet.c | 7 +++
opensm/osm_switch.c | 98 ++++++++++++++++++++++++++++++++++++++++++-
opensm/osm_ucast_mgr.c | 3 +-
8 files changed, 132 insertions(+), 4 deletions(-)
diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h
index 42ae416..59f877e 100644
--- a/include/opensm/osm_subnet.h
+++ b/include/opensm/osm_subnet.h
@@ -199,6 +199,7 @@ typedef struct osm_subn_opt {
char *root_guid_file;
char *cn_guid_file;
char *io_guid_file;
+ boolean_t port_shifting;
uint16_t max_reverse_hops;
char *ids_guid_file;
char *guid_routing_order_file;
@@ -418,6 +419,9 @@ typedef struct osm_subn_opt {
* Name of the file that contains list of I/O node guids that
* will be used by fat-tree routing (provided by User)
*
+* port_shifting
+* This option will turn on port_shifting in routing.
+*
* ids_guid_file
* Name of the file that contains list of ids which should be
* used by Up/Down algorithm instead of node GUIDs
diff --git a/include/opensm/osm_switch.h b/include/opensm/osm_switch.h
index f407dd9..8eae119 100644
--- a/include/opensm/osm_switch.h
+++ b/include/opensm/osm_switch.h
@@ -919,7 +919,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
IN unsigned start_from,
IN boolean_t ignore_existing,
IN boolean_t routing_for_lmc,
- IN boolean_t dor);
+ IN boolean_t dor,
+ IN boolean_t port_shifting);
/*
* PARAMETERS
* p_sw
@@ -955,6 +956,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
* dor
* [in] If TRUE, Dimension Order Routing will be done.
*
+* port_shifting
+* [in] If TRUE, port_shifting will be done.
+*
* RETURN VALUE
* Returns the recommended port on which to route this LID.
*
diff --git a/man/opensm.8.in b/man/opensm.8.in
index c026f3a..f5b4fb9 100644
--- a/man/opensm.8.in
+++ b/man/opensm.8.in
@@ -25,6 +25,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
[\-a | \-\-root_guid_file <path to file>]
[\-u | \-\-cn_guid_file <path to file>]
[\-G | \-\-io_guid_file <path to file>]
+[\-\-port\-shifting]
[\-H | \-\-max_reverse_hops <max reverse hops allowed>]
[\-X | \-\-guid_routing_order_file <path to file>]
[\-m | \-\-ids_guid_file <path to file>]
@@ -208,6 +209,13 @@ to the guids provided in the given file (one to a line).
I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches
the wrong way around to improve connectivity.
.TP
+\fB\-\-port\-shifting\fR
+This option enables a feature called \fBport shifting\fR. In some
+fabrics, particularly cluster environments, routes commonly align and
+congest with other routes due to algorithmically unchanging traffic
+patterns. This routing option will "shift" routing around in an
+attempt to alleviate this problem.
+.TP
\fB\-H\fR, \fB\-\-max_reverse_hops\fR <file name>
Set the maximum number of reverse hops an I/O node is allowed
to make. A reverse hop is the use of a switch the wrong way around.
diff --git a/opensm/main.c b/opensm/main.c
index 5be36b6..5d5bbe1 100644
--- a/opensm/main.c
+++ b/opensm/main.c
@@ -223,6 +223,9 @@ static void show_usage(void)
printf("--io_guid_file, -G <path to file>\n"
" Set the I/O nodes for the Fat-Tree routing
algorithm\n"
" to the guids provided in the given file (one to a
line)\n\n");
+ printf("--port-shifting\n"
+ " Attempt to shift port routes around to remove
alignment problems\n"
+ " in routing tables\n\n");
printf("--max_reverse_hops, -H <hop_count>\n"
" Set the max number of hops the wrong way around\n"
" an I/O node is allowed to do (connectivity for I/O
nodes on top swithces)\n\n");
@@ -601,6 +604,7 @@ int main(int argc, char *argv[])
{"root_guid_file", 1, NULL, 'a'},
{"cn_guid_file", 1, NULL, 'u'},
{"io_guid_file", 1, NULL, 'G'},
+ {"port-shifting", 0, NULL, 11},
{"max_reverse_hops", 1, NULL, 'H'},
{"ids_guid_file", 1, NULL, 'm'},
{"guid_routing_order_file", 1, NULL, 'X'},
@@ -943,6 +947,10 @@ int main(int argc, char *argv[])
opt.io_guid_file = optarg;
printf(" I/O Node Guid File: %s\n", opt.io_guid_file);
break;
+ case 11:
+ opt.port_shifting = TRUE;
+ printf(" Port Shifting is on\n");
+ break;
case 'H':
opt.max_reverse_hops = atoi(optarg);
printf(" Max Reverse Hops: %d\n", opt.max_reverse_hops);
diff --git a/opensm/osm_dump.c b/opensm/osm_dump.c
index 535a03f..a1ff168 100644
--- a/opensm/osm_dump.c
+++ b/opensm/osm_dump.c
@@ -221,7 +221,7 @@ static void dump_ucast_routes(cl_map_item_t * item, FILE *
file, void *cxt)
/* No LMC Optimization */
best_port = osm_switch_recommend_path(p_sw, p_port,
lid_ho, 1, TRUE,
- FALSE, dor);
+ FALSE, dor,
FALSE);
fprintf(file, "No %u hop path possible via port %u!",
best_hops, best_port);
}
diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c
index 228418f..c62192c 100644
--- a/opensm/osm_subnet.c
+++ b/opensm/osm_subnet.c
@@ -347,6 +347,7 @@ static const opt_rec_t opt_tbl[] = {
{ "root_guid_file", OPT_OFFSET(root_guid_file), opts_parse_charp, NULL,
0 },
{ "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 },
{ "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 },
+ { "port_shifting", OPT_OFFSET(port_shifting), opts_parse_boolean, NULL,
1 },
{ "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16,
NULL, 0 },
{ "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0
},
{ "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file),
opts_parse_charp, NULL, 0 },
@@ -740,6 +741,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt)
p_opt->root_guid_file = NULL;
p_opt->cn_guid_file = NULL;
p_opt->io_guid_file = NULL;
+ p_opt->port_shifting = FALSE;
p_opt->max_reverse_hops = 0;
p_opt->ids_guid_file = NULL;
p_opt->guid_routing_order_file = NULL;
@@ -1440,6 +1442,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *
p_opts)
p_opts->lash_start_vl);
fprintf(out,
+ "# Port Shifting (use FALSE if unsure)\n"
+ "port_shifting %s\n\n",
+ p_opts->port_shifting ? "TRUE" : "FALSE");
+
+ fprintf(out,
"# SA database file name\nsa_db_file %s\n\n",
p_opts->sa_db_file ? p_opts->sa_db_file : null_str);
diff --git a/opensm/osm_switch.c b/opensm/osm_switch.c
index 9785a9d..f24d9ea 100644
--- a/opensm/osm_switch.c
+++ b/opensm/osm_switch.c
@@ -51,6 +51,14 @@
#include <iba/ib_types.h>
#include <opensm/osm_switch.h>
+struct switch_port_path {
+ uint8_t port_num;
+ uint32_t path_count;
+ int found_sys_guid;
+ int found_node_guid;
+ uint32_t forwarded_to;
+};
+
cl_status_t osm_switch_set_hops(IN osm_switch_t * p_sw, IN uint16_t lid_ho,
IN uint8_t port_num, IN uint8_t num_hops)
{
@@ -217,7 +225,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
IN unsigned start_from,
IN boolean_t ignore_existing,
IN boolean_t routing_for_lmc,
- IN boolean_t dor)
+ IN boolean_t dor,
+ IN boolean_t port_shifting)
{
/*
We support an enhanced LMC aware routing mode:
@@ -259,6 +268,11 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
osm_node_t *p_rem_node_first = NULL;
struct osm_remote_node *p_remote_guid = NULL;
struct osm_remote_node null_remote_node = {NULL, 0, 0};
+ struct switch_port_path port_paths[IB_NODE_NUM_PORTS_MAX];
+ unsigned int port_paths_total_paths = 0;
+ unsigned int port_paths_count = 0;
+ int found_sys_guid;
+ int found_node_guid;
CL_ASSERT(lid_ho > 0);
@@ -369,6 +383,7 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
check_count =
osm_port_prof_path_count_get(&p_sw->p_prof[port_num]);
+
if (dor) {
/* Get the Remote Node */
p_rem_physp = osm_physp_get_remote(p_physp);
@@ -412,7 +427,10 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
best_port_other_sys = port_num;
least_forwarded_to = 0;
}
+ found_sys_guid = 0;
} else { /* same sys found - try node */
+
+
/* Else is the node guid already used ? */
p_remote_guid =
switch_find_node_guid_count(p_sw,
p_port->priv,
@@ -427,9 +445,27 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
}
/* else prior sys and node guid already used */
+ if (!p_remote_guid)
+ found_node_guid = 0;
+ else
+ found_node_guid = 1;
+ found_sys_guid = 1;
} /* same sys found */
}
+ port_paths[port_paths_count].port_num = port_num;
+ port_paths[port_paths_count].path_count = check_count;
+ if (routing_for_lmc) {
+ port_paths[port_paths_count].found_sys_guid =
found_sys_guid;
+ port_paths[port_paths_count].found_node_guid =
found_node_guid;
+ }
+ if (routing_for_lmc && p_remote_guid)
+ port_paths[port_paths_count].forwarded_to =
p_remote_guid->forwarded_to;
+ else
+ port_paths[port_paths_count].forwarded_to = 0;
+ port_paths_total_paths += check_count;
+ port_paths_count++;
+
/* routing for LMC mode */
/*
the count is min but also lower then the max subscribed
@@ -454,6 +490,66 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
if (port_found == FALSE)
return OSM_NO_PATH;
+ if (port_shifting && port_paths_count) {
+ /* In the port_paths[] array, we now have all the ports that we
+ * can route out of. Using some shifting math below, possibly
+ * select a different one so that lids won't align in LFTs
+ *
+ * If lmc > 0, we need to loop through these ports to find the
+ * least_forwarded_to port, best_port_other_sys, and
+ * best_port_other_node just like before but through the
different
+ * ordering.
+ */
+
+ least_paths = 0xFFFFFFFF;
+ least_paths_other_sys = 0xFFFFFFFF;
+ least_paths_other_nodes = 0xFFFFFFFF;
+ least_forwarded_to = 0xFFFFFFFF;
+ best_port = 0;
+ best_port_other_sys = 0;
+ best_port_other_node = 0;
+
+ for (i = 0; i < port_paths_count; i++) {
+ unsigned int idx;
+
+ idx = (port_paths_total_paths/port_paths_count + i) %
port_paths_count;
+
+ if (routing_for_lmc) {
+ if (!port_paths[idx].found_sys_guid
+ && port_paths[idx].path_count <
least_paths_other_sys) {
+ least_paths_other_sys =
port_paths[idx].path_count;
+ best_port_other_sys =
port_paths[idx].port_num;
+ least_forwarded_to = 0;
+ }
+ else if (!port_paths[idx].found_node_guid
+ && port_paths[idx].path_count <
least_paths_other_nodes) {
+ least_paths_other_nodes =
port_paths[idx].path_count;
+ best_port_other_node =
port_paths[idx].port_num;
+ least_forwarded_to = 0;
+ }
+ }
+
+ if (port_paths[idx].path_count < least_paths) {
+ best_port = port_paths[idx].port_num;
+ least_paths = port_paths[idx].path_count;
+ if (routing_for_lmc
+ && (port_paths[idx].found_sys_guid
+ || port_paths[idx].found_node_guid)
+ && port_paths[idx].forwarded_to <
least_forwarded_to)
+ least_forwarded_to =
port_paths[idx].forwarded_to;
+ }
+ else if (routing_for_lmc
+ && (port_paths[idx].found_sys_guid
+ || port_paths[idx].found_node_guid)
+ && port_paths[idx].path_count == least_paths
+ && port_paths[idx].forwarded_to <
least_forwarded_to) {
+ least_forwarded_to =
port_paths[idx].forwarded_to;
+ best_port = port_paths[idx].port_num;
+ }
+
+ }
+ }
+
/*
if we are in enhanced routing mode and the best port is not
the local port 0
diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 4019589..d32eb60 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -255,7 +255,8 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t *
p_mgr,
port = osm_switch_recommend_path(p_sw, p_port, lid_ho, start_from,
p_mgr->p_subn->ignore_existing_lfts,
p_mgr->p_subn->opt.lmc,
- p_mgr->is_dor);
+ p_mgr->is_dor,
+ p_mgr->p_subn->opt.port_shifting);
if (port == OSM_NO_PATH) {
/* do not try to overwrite the ppro of non existing port ... */
--
1.5.4.5
--- End Message ---
--- Begin Message ---
Signed-off-by: Albert L. Chu <ch...@llnl.gov>
---
include/opensm/osm_subnet.h | 4 ++++
include/opensm/osm_switch.h | 6 +++++-
man/opensm.8.in | 6 ++++++
opensm/main.c | 8 ++++++++
opensm/osm_dump.c | 3 ++-
opensm/osm_subnet.c | 7 +++++++
opensm/osm_switch.c | 42 +++++++++++++++++++++++++++++++++++++-----
opensm/osm_ucast_mgr.c | 3 ++-
8 files changed, 71 insertions(+), 8 deletions(-)
diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h
index 59f877e..589e96c 100644
--- a/include/opensm/osm_subnet.h
+++ b/include/opensm/osm_subnet.h
@@ -200,6 +200,7 @@ typedef struct osm_subn_opt {
char *cn_guid_file;
char *io_guid_file;
boolean_t port_shifting;
+ boolean_t remote_guid_sorting;
uint16_t max_reverse_hops;
char *ids_guid_file;
char *guid_routing_order_file;
@@ -422,6 +423,9 @@ typedef struct osm_subn_opt {
* port_shifting
* This option will turn on port_shifting in routing.
*
+* remote_guid_sorting
+* This option will turn on remote_guid_sorting in routing.
+*
* ids_guid_file
* Name of the file that contains list of ids which should be
* used by Up/Down algorithm instead of node GUIDs
diff --git a/include/opensm/osm_switch.h b/include/opensm/osm_switch.h
index 8eae119..aef45cb 100644
--- a/include/opensm/osm_switch.h
+++ b/include/opensm/osm_switch.h
@@ -920,7 +920,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
IN boolean_t ignore_existing,
IN boolean_t routing_for_lmc,
IN boolean_t dor,
- IN boolean_t port_shifting);
+ IN boolean_t port_shifting,
+ IN boolean_t remote_guid_sorting);
/*
* PARAMETERS
* p_sw
@@ -959,6 +960,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
* port_shifting
* [in] If TRUE, port_shifting will be done.
*
+* remote_guid_sorting
+* [in] If TRUE, remote_guid_sorting will be done.
+*
* RETURN VALUE
* Returns the recommended port on which to route this LID.
*
diff --git a/man/opensm.8.in b/man/opensm.8.in
index f5b4fb9..a642820 100644
--- a/man/opensm.8.in
+++ b/man/opensm.8.in
@@ -216,6 +216,12 @@ congest with other routes due to algorithmically
unchanging traffic
patterns. This routing option will "shift" routing around in an
attempt to alleviate this problem.
.TP
+\fB\-\-remote\-guid\-sorting\fR
+This option enables a feature called \fBremote guid sorting\fR. In some
+fabrics, switches may be cabled in an inconsistent fashion. This option
+may alleviate those issues by sorting remote guids before routing,
+making remote destinations appear to be ordered consistently.
+.TP
\fB\-H\fR, \fB\-\-max_reverse_hops\fR <file name>
Set the maximum number of reverse hops an I/O node is allowed
to make. A reverse hop is the use of a switch the wrong way around.
diff --git a/opensm/main.c b/opensm/main.c
index 5d5bbe1..e2e7355 100644
--- a/opensm/main.c
+++ b/opensm/main.c
@@ -226,6 +226,9 @@ static void show_usage(void)
printf("--port-shifting\n"
" Attempt to shift port routes around to remove
alignment problems\n"
" in routing tables\n\n");
+ printf("--remote-guid-sorting\n"
+ " Sort ports by remote port guid before routing to
alleviate\n"
+ " problems with inconsistent cabling across a
fabric\n\n");
printf("--max_reverse_hops, -H <hop_count>\n"
" Set the max number of hops the wrong way around\n"
" an I/O node is allowed to do (connectivity for I/O
nodes on top swithces)\n\n");
@@ -605,6 +608,7 @@ int main(int argc, char *argv[])
{"cn_guid_file", 1, NULL, 'u'},
{"io_guid_file", 1, NULL, 'G'},
{"port-shifting", 0, NULL, 11},
+ {"remote-guid-sorting", 0, NULL, 13},
{"max_reverse_hops", 1, NULL, 'H'},
{"ids_guid_file", 1, NULL, 'm'},
{"guid_routing_order_file", 1, NULL, 'X'},
@@ -951,6 +955,10 @@ int main(int argc, char *argv[])
opt.port_shifting = TRUE;
printf(" Port Shifting is on\n");
break;
+ case 13:
+ opt.remote_guid_sorting = TRUE;
+ printf(" Remote Guid Sorting is on\n");
+ break;
case 'H':
opt.max_reverse_hops = atoi(optarg);
printf(" Max Reverse Hops: %d\n", opt.max_reverse_hops);
diff --git a/opensm/osm_dump.c b/opensm/osm_dump.c
index a1ff168..bfe63c3 100644
--- a/opensm/osm_dump.c
+++ b/opensm/osm_dump.c
@@ -221,7 +221,8 @@ static void dump_ucast_routes(cl_map_item_t * item, FILE *
file, void *cxt)
/* No LMC Optimization */
best_port = osm_switch_recommend_path(p_sw, p_port,
lid_ho, 1, TRUE,
- FALSE, dor,
FALSE);
+ FALSE, dor, FALSE,
+ FALSE);
fprintf(file, "No %u hop path possible via port %u!",
best_hops, best_port);
}
diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c
index c62192c..b2b219f 100644
--- a/opensm/osm_subnet.c
+++ b/opensm/osm_subnet.c
@@ -348,6 +348,7 @@ static const opt_rec_t opt_tbl[] = {
{ "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 },
{ "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 },
{ "port_shifting", OPT_OFFSET(port_shifting), opts_parse_boolean, NULL,
1 },
+ { "remote_guid_sorting", OPT_OFFSET(remote_guid_sorting),
opts_parse_boolean, NULL, 1 },
{ "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16,
NULL, 0 },
{ "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0
},
{ "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file),
opts_parse_charp, NULL, 0 },
@@ -742,6 +743,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt)
p_opt->cn_guid_file = NULL;
p_opt->io_guid_file = NULL;
p_opt->port_shifting = FALSE;
+ p_opt->remote_guid_sorting = FALSE;
p_opt->max_reverse_hops = 0;
p_opt->ids_guid_file = NULL;
p_opt->guid_routing_order_file = NULL;
@@ -1447,6 +1449,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t *
p_opts)
p_opts->port_shifting ? "TRUE" : "FALSE");
fprintf(out,
+ "# Remote Guid Sorting (use FALSE if unsure)\n"
+ "remote_guid_sorting %s\n\n",
+ p_opts->remote_guid_sorting ? "TRUE" : "FALSE");
+
+ fprintf(out,
"# SA database file name\nsa_db_file %s\n\n",
p_opts->sa_db_file ? p_opts->sa_db_file : null_str);
diff --git a/opensm/osm_switch.c b/opensm/osm_switch.c
index f24d9ea..2584563 100644
--- a/opensm/osm_switch.c
+++ b/opensm/osm_switch.c
@@ -57,6 +57,7 @@ struct switch_port_path {
int found_sys_guid;
int found_node_guid;
uint32_t forwarded_to;
+ uint64_t remote_node_guid;
};
cl_status_t osm_switch_set_hops(IN osm_switch_t * p_sw, IN uint16_t lid_ho,
@@ -169,6 +170,19 @@ boolean_t osm_switch_get_lft_block(IN const osm_switch_t *
p_sw,
return TRUE;
}
+static int
+port_path_guid_cmp(IN const void *x, IN const void *y)
+{
+ struct switch_port_path *a = (struct switch_port_path *)x;
+ struct switch_port_path *b = (struct switch_port_path *)y;
+
+ if (a->remote_node_guid < b->remote_node_guid)
+ return -1;
+ if (a->remote_node_guid > b->remote_node_guid)
+ return 1;
+ return 0;
+}
+
static struct osm_remote_node *
switch_find_guid_common(IN const osm_switch_t * p_sw,
IN struct osm_remote_guids_count *r,
@@ -226,7 +240,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
IN boolean_t ignore_existing,
IN boolean_t routing_for_lmc,
IN boolean_t dor,
- IN boolean_t port_shifting)
+ IN boolean_t port_shifting,
+ IN boolean_t remote_guid_sorting)
{
/*
We support an enhanced LMC aware routing mode:
@@ -428,6 +443,7 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
least_forwarded_to = 0;
}
found_sys_guid = 0;
+ found_node_guid = 0;
} else { /* same sys found - try node */
@@ -463,6 +479,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
port_paths[port_paths_count].forwarded_to =
p_remote_guid->forwarded_to;
else
port_paths[port_paths_count].forwarded_to = 0;
+ p_rem_physp = osm_physp_get_remote(p_physp);
+ p_rem_node = osm_physp_get_node_ptr(p_rem_physp);
+ port_paths[port_paths_count].remote_node_guid =
p_rem_node->node_info.node_guid;
port_paths_total_paths += check_count;
port_paths_count++;
@@ -490,10 +509,15 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
if (port_found == FALSE)
return OSM_NO_PATH;
- if (port_shifting && port_paths_count) {
+ if ((port_shifting
+ || remote_guid_sorting)
+ && port_paths_count) {
/* In the port_paths[] array, we now have all the ports that we
- * can route out of. Using some shifting math below, possibly
- * select a different one so that lids won't align in LFTs
+ * can route out of. If port_shifting is set, using some
shifting
+ * math below, possibly select a different one so that lids
won't
+ * align in LFTs. If it is not set, iterate through the array
+ * normally. New ports will be selected by virtue of a sort
+ * done prior to port selection.
*
* If lmc > 0, we need to loop through these ports to find the
* least_forwarded_to port, best_port_other_sys, and
@@ -508,11 +532,19 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t *
p_sw,
best_port = 0;
best_port_other_sys = 0;
best_port_other_node = 0;
+
+ if (remote_guid_sorting) {
+ qsort(port_paths, port_paths_count, sizeof(struct
switch_port_path),
+ port_path_guid_cmp);
+ }
for (i = 0; i < port_paths_count; i++) {
unsigned int idx;
- idx = (port_paths_total_paths/port_paths_count + i) %
port_paths_count;
+ if (port_shifting)
+ idx = (port_paths_total_paths/port_paths_count
+ i) % port_paths_count;
+ else
+ idx = i;
if (routing_for_lmc) {
if (!port_paths[idx].found_sys_guid
diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index d32eb60..a8982df 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -256,7 +256,8 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t *
p_mgr,
p_mgr->p_subn->ignore_existing_lfts,
p_mgr->p_subn->opt.lmc,
p_mgr->is_dor,
- p_mgr->p_subn->opt.port_shifting);
+ p_mgr->p_subn->opt.port_shifting,
+
p_mgr->p_subn->opt.remote_guid_sorting);
if (port == OSM_NO_PATH) {
/* do not try to overwrite the ppro of non existing port ... */
--
1.5.4.5
--- End Message ---