[PATCH] OpenSM: LFT update breaks if IB_SMP_DATA_SIZE changes
This is only a precautionary patch for a theoretical bug which would arise if someone redefines IB_SMP_DATA_SIZE to a values !=64. ucast_mgr_pipeline_fwd_tbl() calculates the max. number of blocks to update using 64 explicitly, while set_lft_block() uses IB_SMP_DATA_SIZE. If IB_SMP_DATA_SIZE != 64 then switches would receive too few or too many blocks. Signed-off-by: Jens Domke <jens.do...@tu-dresden.de> --- opensm/osm_ucast_mgr.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c index 7ccaa77..893a70b 100644 --- a/opensm/osm_ucast_mgr.c +++ b/opensm/osm_ucast_mgr.c @@ -1036,7 +1036,7 @@ static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * p_mgr) { cl_qmap_t *tbl; cl_map_item_t *item; - unsigned i, max_block = p_mgr->max_lid / 64 + 1; + unsigned i, max_block = p_mgr->max_lid / IB_SMP_DATA_SIZE + 1; tbl = _mgr->p_subn->sw_guid_tbl; for (i = 0; i < max_block; i++) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] OpenSM: command line option ignore-guids broken
this patch changes the documentation (--help and man page) from --ignore-guids to --ignore_guids, so that it matches the implementation Signed-off-by: Jens Domke jens.do...@tu-dresden.de --- doc/current-routing.txt | 2 +- man/opensm.8.in | 6 +++--- opensm/main.c | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/current-routing.txt b/doc/current-routing.txt index d23ae0d..acfeb56 100644 --- a/doc/current-routing.txt +++ b/doc/current-routing.txt @@ -127,7 +127,7 @@ subscription is also equalized with the ability to override based on port GUID. The latter is supplied by: -i equalize-ignore-guids-file --ignore-guids equalize-ignore-guids-file +--ignore_guids equalize-ignore-guids-file This option provides the means to define a set of ports (by guids) that will be ignored by the link load equalization algorithm. diff --git a/man/opensm.8.in b/man/opensm.8.in index c1092cc..8ea127d 100644 --- a/man/opensm.8.in +++ b/man/opensm.8.in @@ -37,7 +37,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) [\-\-maxsmps number] [\-\-console [off | local | socket | loopback]] [\-\-console-port port] -[\-i(gnore-guids) equalize-ignore-guids-file] +[\-i | \-\-ignore_guids equalize-ignore-guids-file] [\-w | \-\-hop_weights_file path to file] [\-O | \-\-port_search_ordering_file path to file] [\-O | \-\-dimn_ports_file path to file] (DEPRECATED) @@ -298,7 +298,7 @@ Specify an alternate telnet port for the socket console (default 1). Note that this option only appears if OpenSM was built with --enable-console-socket. .TP -\fB\-i\fR, \fB\-\-ignore-guids\fR equalize-ignore-guids-file +\fB\-i\fR, \fB\-\-ignore_guids\fR equalize-ignore-guids-file This option provides the means to define a set of ports (by node guid and port number) that will be ignored by the link load equalization algorithm. @@ -987,7 +987,7 @@ port GUID. The latter is supplied by: -i equalize-ignore-guids-file .br -\-\-ignore-guids equalize-ignore-guids-file +\-\-ignore_guids equalize-ignore-guids-file This option provides the means to define a set of ports (by guid) that will be ignored by the link load equalization algorithm. Note that only endports (CA, diff --git a/opensm/main.c b/opensm/main.c index 6551a37..8419e68 100644 --- a/opensm/main.c +++ b/opensm/main.c @@ -289,7 +289,7 @@ static void show_usage(void) Specify an alternate telnet port for the console (default %d).\n\n, OSM_DEFAULT_CONSOLE_PORT); #endif - printf(--ignore-guids, -i equalize-ignore-guids-file\n + printf(--ignore_guids, -i equalize-ignore-guids-file\n This option provides the means to define a set of ports\n (by guid) that will be ignored by the link load\n equalization algorithm.\n\n); -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] OpenSM: osm_ucast_dfsssp.c - prevent double free error
an error in the routing execution can cause a second free() call on sw_list, which results in a 'double free' error Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index 5eaff3d..ec69df0 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -2382,6 +2382,7 @@ static int dfsssp_do_dijkstra_routing(void *context) /* the intermediate array lived long enough */ free(sw_list); + sw_list = NULL; /* same is true for the compute node and I/O guid map */ destroy_guid_map(cn_tbl); cn_nodes_provided = FALSE; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] opensm: Resend LFTs/VLArb/SL2VL MADs in case of error
Dear Alex, the memset call in sl2vl_update_table causes segmentation faults if force_update=1, since p_tbl won't get anything assigned and remains NULL. Please, find a possible fix attached. Regards, Jens On 03.02.14 20:05, Alex Netes wrote: There are several MADs that we only SET during the sweep (and never GET). Zero the stored block, so in case the MAD will end up with error, we will resend it during the next sweep. Signed-off-by: Alex Netes ale...@mellanox.com --- opensm/osm_qos.c | 13 + opensm/osm_ucast_mgr.c |7 +++ 2 files changed, 20 insertions(+), 0 deletions(-) diff --git a/opensm/osm_qos.c b/opensm/osm_qos.c index a301803..473e3c8 100644 --- a/opensm/osm_qos.c +++ b/opensm/osm_qos.c @@ -183,6 +183,13 @@ static ib_api_status_t vlarb_update_table_block(osm_sm_t * sm, if (!p_mad) return IB_INSUFFICIENT_MEMORY; + /* +* Zero the stored VL Arbitration block, so in case the MAD will +* end up with error, we will resend it in the next sweep. +*/ + memset(p-vl_arb[block_num], 0, + block_length * sizeof(block.vl_entry[0])); + cl_qlist_insert_tail(mad_list, p_mad-list_item); return IB_SUCCESS; @@ -272,6 +279,12 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, if (!p_mad) return IB_INSUFFICIENT_MEMORY; + /* +* Zero the stored SL2VL block, so in case the MAD will +* end up with error, we will resend it in the next sweep. +*/ + memset(p_tbl, 0, sizeof(tbl)); + cl_qlist_insert_tail(mad_list, p_mad-list_item); return IB_SUCCESS; } diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c index 8194307..c8a7360 100644 --- a/opensm/osm_ucast_mgr.c +++ b/opensm/osm_ucast_mgr.c @@ -1002,6 +1002,13 @@ static int set_lft_block(IN osm_switch_t *p_sw, IN osm_ucast_mgr_t *p_mgr, IB_SMP_DATA_SIZE)) return 0; + /* +* Zero the stored LFT block, so in case the MAD will end up +* with error, we will resend it in the next sweep. +*/ + memset(p_sw-lft + block_id_ho * IB_SMP_DATA_SIZE, OSM_NO_PATH, + IB_SMP_DATA_SIZE); + OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG, Writing FT block %u to switch 0x% PRIx64 \n, block_id_ho, cl_ntoh64(context.lft_context.node_guid)); From 3cbe8f10c4ab7d83c5898b67e42d9e99be355c05 Mon Sep 17 00:00:00 2001 From: Jens Domke domke.j...@m.titech.ac.jp Date: Tue, 4 Feb 2014 14:47:44 +0900 Subject: [PATCH 1/1] osm_qos.c: fix potential segmentation fault if force_update=1, then p_tbl remains NULL and therefore memset crashes Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_qos.c |5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/opensm/osm_qos.c b/opensm/osm_qos.c index 473e3c8..76f0ff6 100644 --- a/opensm/osm_qos.c +++ b/opensm/osm_qos.c @@ -252,7 +252,7 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, const ib_slvl_table_t * sl2vl_table, cl_qlist_t *mad_list) { - ib_slvl_table_t tbl, *p_tbl; + ib_slvl_table_t tbl, *p_tbl = NULL; unsigned vl_mask; uint8_t vl1, vl2; int i; @@ -283,7 +283,8 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, osm_physp_t * p, * Zero the stored SL2VL block, so in case the MAD will * end up with error, we will resend it in the next sweep. */ - memset(p_tbl, 0, sizeof(tbl)); + if (p_tbl) + memset(p_tbl, 0, sizeof(tbl)); cl_qlist_insert_tail(mad_list, p_mad-list_item); return IB_SUCCESS; -- 1.7.1
[PATCH 2/5] OpenSM: dfsssp - send multicast forwarding tables to switches
Issue: root switch of the mcast spanning tree was ignored. When a port of the root switch is part of the mcast group, then it won't be processed and non of its ports will be part of the resulting mcast forwarding table. Fix: remove the test for used_link==NULL, because all switches in adj_list should have a used_link set by the prior dijkstra step (except the root switch) = test not needed and root switch will be included in mcast update. Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |8 +++- 1 files changed, 3 insertions(+), 5 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index 9c34795..219f8bb 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1607,13 +1607,11 @@ static int update_mcft(osm_sm_t * p_sm, vertex_t * adj_list, (%s) for MLID 0x%X\n, cl_ntoh64(adj_list[i].guid), p_sw-p_node-print_desc, mlid_ho); - /* if a) no route goes thru this switch or - b) the switch does not support mcast or - c) no ports of this switch are part or the mcast group + /* if a) the switch does not support mcast or + b) no ports of this switch are part or the mcast group then cycle */ - if (!(adj_list[i].used_link) || - osm_switch_supports_mcast(p_sw) == FALSE || + if (osm_switch_supports_mcast(p_sw) == FALSE || (p_sw-num_of_mcm == 0 !(p_sw-is_mc_member))) continue; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] OpenSM: dfsssp - send multicast forwarding tables to switches
Issue: dfsssp calculates mcast forwarding tables but doesn't distribute them to the switches, because is_mc_member/num_of_mcm for each switch was reset to 0 in osm_mcast_mgr.c. dfsssp relies on this data to figure out with switch is involved in the mcast group. Fix: recalculate is_mc_member/num_of_mcm similar to the code of create_mgrp_switch_map(...) in osm_mcast_mgr.c right before the update_mcft function and reset to 0 afterwards. Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c | 43 +++ 1 files changed, 43 insertions(+), 0 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index ef7de59..9c34795 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1544,6 +1544,43 @@ static int update_lft(osm_ucast_mgr_t * p_mgr, vertex_t * adj_list, return 0; } +/* the function updates the multicast group membership information + similar to create_mgrp_switch_map (osm_mcast_mgr.c) + = with it we can identify if a switch needs to be processed + or not in update_mcft +*/ +static void update_mgrp_membership(cl_qlist_t * port_list) +{ + osm_mcast_work_obj_t *wobj = NULL; + osm_port_t *port = NULL; + osm_switch_t *sw = NULL; + cl_list_item_t *i = NULL; + + for (i = cl_qlist_head(port_list); i != cl_qlist_end(port_list); +i = cl_qlist_next(i)) { + wobj = cl_item_obj(i, wobj, list_item); + port = wobj-p_port; + if (port-p_node-sw) { + sw = port-p_node-sw; + sw-is_mc_member = 1; + } else { + sw = port-p_physp-p_remote_physp-p_node-sw; + sw-num_of_mcm++; + } + } +} + +/* reset is_mc_member and num_of_mcm for future computations */ +static void reset_mgrp_membership(vertex_t * adj_list, uint32_t adj_list_size) +{ + uint32_t i = 0; + + for (i = 1; i adj_list_size; i++) { + adj_list[i].sw-is_mc_member = 0; + adj_list[i].sw-num_of_mcm = 0; + } +} + /* update the multicast forwarding tables of all switches with the informations from the previous dijsktra step for the current mlid */ @@ -2386,6 +2423,11 @@ static ib_api_status_t dfsssp_do_mcast_routing(void * context, goto Exit; } + /* set mcast group membership again for update_mcft + (unfortunately: osm_mcast_mgr_find_root_switch resets it) +*/ + update_mgrp_membership(mcastgrp_port_list); + /* update the mcast forwarding tables of the switches */ err = update_mcft(sm, adj_list, adj_list_size, mbox-mlid, mcastgrp_port_map, root_sw); @@ -2398,6 +2440,7 @@ static ib_api_status_t dfsssp_do_mcast_routing(void * context, } Exit: + reset_mgrp_membership(adj_list, adj_list_size); osm_mcast_drop_port_list(mcastgrp_port_list); OSM_LOG_EXIT(sm-p_log); return status; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] OpenSM: dfsssp - add missing and change existing return values
a) this patch sets the 'err' variable correclty for the function dfsssp_remove_deadlocks() for the case, that the error occurs within the function and not within a subroutine b) the functions dfsssp_build_graph() and dfsssp_do_dijkstra_routing() now return -1 instead of 1 to indicate an error to be in compliance with the implementation of ucast_mgr_route() in osm_ucast_mgr.c Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c | 27 ++- 1 files changed, 18 insertions(+), 9 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index a08e44b..321fffd 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1158,7 +1158,7 @@ static int dfsssp_build_graph(void *context) if (!adj_list) { OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD02: cannot allocate memory for adj_list\n); - return 1; + goto ERROR; } for (i = 0; i adj_list_size; i++) set_default_vertex(adj_list[i]); @@ -1198,7 +1198,7 @@ static int dfsssp_build_graph(void *context) OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD03: cannot allocate memory for a link\n); dfsssp_context_destroy(context); - return 1; + goto ERROR; } head = link; head-next = NULL; @@ -1243,7 +1243,7 @@ static int dfsssp_build_graph(void *context) head = head-next; free(link); } - return 1; + goto ERROR; } link = link-next; set_default_link(link); @@ -1277,6 +1277,9 @@ static int dfsssp_build_graph(void *context) OSM_LOG_EXIT(p_mgr-p_log); return 0; + +ERROR: + return -1; } static void print_routes(osm_ucast_mgr_t * p_mgr, vertex_t * adj_list, @@ -1891,6 +1894,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) if (!weakest_link) { OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD27: something went wrong in get_weakest_link_in_cycle(...)\n); + err = 1; goto ERROR; } @@ -2012,6 +2016,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) if (!split_count) { OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD24: cannot allocate memory for split_count, skip balancing\n); + err = 1; goto ERROR; } /* initial state: paths for VLs won't be separated */ @@ -2060,6 +2065,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD25: Not enough VL available (avail=%d, needed=%d); Stop dfsssp routing!\n, vl_avail, vl_needed); + err = 1; goto ERROR; } /* else { no balancing } */ @@ -2161,7 +2167,7 @@ static int dfsssp_do_dijkstra_routing(void *context) if (!sw_list) { OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD29: cannot allocate memory for sw_list in dfsssp_do_dijkstra_routing\n); - return 1; + goto ERROR; } memset(sw_list, 0, sw_list_size * sizeof(vertex_t *)); @@ -2197,7 +2203,7 @@ static int dfsssp_do_dijkstra_routing(void *context) OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD31: corrupted sw_list array in dfsssp_do_dijkstra_routing\n); free(sw_list); - return 1; + goto ERROR; } } @@ -2240,7 +2246,7 @@ static int dfsssp_do_dijkstra_routing(void *context) err = dijkstra(p_mgr, adj_list, adj_list_size, port, lid); if (err) - return err; + goto ERROR; if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_DEBUG)) print_routes(p_mgr, adj_list, adj_list_size, port); @@ -2249,7 +2255,7 @@ static int dfsssp_do_dijkstra_routing(void *context) err = update_lft(p_mgr, adj_list, adj_list_size, port, lid); if (err) - return err; + goto ERROR
[PATCH 2/2] OpenSM: DFSSSP - workaround for better VL balancing
Currently, DFSSSP maps the src/dest paths statically to certain VLs. Especially for deadlock-free topologies this can result in an unfair balancing. Some VLs within one link might be overused, which results in slower bandwidth for some src/dest pairs. The fix changes the VL assignment in two ways: first we balance the number of paths per VL; and second we randomly assign the VL as long as this doesn't violate the deadlock-freedom. 1) The balancing splits the paths across available free VLs, so that the maximal number of paths per VL is minimized. We save the number of VLs for each deadlock-free channel dependency graph. E.g. for 8 VLs, paths per CDG: {14,5,1} = balanced VLs: {{3,3,3,3,2},{3,2},1} we have 5 VLs to choose from for CDG(0), two for CDG(1) and one for CDG(2). 2) get_dfsssp_sl(...) will use the information of (1) to randomly assign the VL for one src/dest pair within the possible number of VLs. E.g. for a src/dest pair of CDG(0) we have 5 VLs to choose from, therefore VL := baseVL + rand()%5 Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c | 131 +++-- 1 files changed, 90 insertions(+), 41 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index 98c3f7c..7aecc24 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -133,6 +133,7 @@ typedef struct dfsssp_context { vertex_t *adj_list; uint32_t adj_list_size; vltable_t *srcdest2vl_table; + uint8_t *vl_split_count; } dfsssp_context_t; / set initial values for structs ** @@ -1722,8 +1723,9 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) cl_map_item_t *item1 = NULL, *item2 = NULL; osm_port_t *src_port = NULL, *dest_port = NULL; - uint32_t i = 0, err = 0; - uint8_t test_vl = 0, vl_avail = 0, vl_needed = 1; + uint32_t i = 0, j = 0, err = 0; + uint8_t vl = 0, test_vl = 0, vl_avail = 0, vl_needed = 1; + double most_avg_paths = 0.0; cdg_node_t **cdg = NULL, *start_here = NULL, *cycle = NULL; cdg_link_t *weakest_link = NULL; uint32_t srcdest = 0; @@ -2004,43 +2006,56 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) OSM_LOG(p_mgr-p_log, OSM_LOG_VERBOSE, Balancing the paths on the available Virtual Lanes\n); - /* balancing virtual lanes, but avoid additional cycle check - balancing suboptimal; + /* optimal balancing virtual lanes, under condition: no additional cycle checks; sl/vl != 0 might be assigned to loopback packets (i.e. slid/dlid on the same port for lmc0), but thats no problem, see IBAS 10.2.2.3 */ - if (vl_needed == 1) { - from = 0; - count = paths_per_vl[0] / vl_avail; - for (to = 1; to vl_avail; to++) { - vltable_change_vl(srcdest2vl_table, from, to, count); - paths_per_vl[from] -= count; - paths_per_vl[to] += count; - } - } else if (vl_needed vl_avail) { - split_count = (uint8_t *) malloc(vl_needed * sizeof(uint8_t)); - if (!split_count) { - OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, - ERR AD24: cannot allocate memory for split_count, skip balancing\n); - } else { - memset(split_count, 0, vl_needed * sizeof(uint8_t)); - for (i = vl_needed; i vl_avail; i++) - split_count[(i - vl_needed) % vl_needed]++; - - to = vl_needed; - for (from = 0; from vl_needed; from++) { - count = - paths_per_vl[from] / (split_count[from] + - 1); - for (i = 0; i split_count[from]; i++) { - vltable_change_vl(srcdest2vl_table, - from, to, count); - paths_per_vl[from] -= count; - paths_per_vl[to] += count; - to++; + split_count = (uint8_t *) calloc(vl_avail, sizeof(uint8_t)); + if (!split_count) { + OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, + ERR AD24: cannot allocate memory for split_count, skip balancing\n); + goto ERROR; + } + /* initial state: paths for VLs won't be separated */ + for (i = 0; i ((vl_needed vl_avail) ? vl_needed : vl_avail); i++) + split_count[i] = 1; + dfsssp_ctx-vl_split_count = split_count; + /* balancing is necessary if we have empty VLs */ + if (vl_needed vl_avail
[PATCH 1/1] OpenSM: dfsssp - add support for multicast
Recent tests on a large system revealed a problem with loops in the multicast routing. Using DFSSSP together with the default mcast routing algorithm of OpenSM can produce loops in the fabric. This patch adds the mcast_build_stree function to the DFSSSP routing algorithm, so that DFSSSP is able to calculate the correct mcast forwarding tables for the subnet. It almost does the same steps as the default mcast routing, except that it uses the Dijkstra algorithm to generate the spanning tree instead of using the hop count information given by the unicast routing. General overview of the algorithm in pseudo-code: 1) identify the ports, which are part of the multicast group 2) find the 'best' switch (depending on the hop count) for the mcast group, which can be used as a root of the spanning tree 3) perform a dijkstra step with the root switch as starting point to generate a spanning tree to all other switches in the subnet 4) build the mcast forwarding tables for relevant switches: 4.1) select a switch which has mcast member ports connected to it 4.2) set the downstream ports for the mcast member ports in the mcft 4.3) traverse towards the root of the spanning tree and set up-/downstream ports on this path for all involved switches 4.4) goto 4.1 until all switches have been processed The same mcast algorithm will be used for SSSP, because SSSP has the potential to produce loops in the mcast forwarding table as well. Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- include/opensm/osm_mcast_mgr.h | 72 +++ opensm/Makefile.am |1 + opensm/osm_mcast_mgr.c | 35 opensm/osm_ucast_dfsssp.c | 194 4 files changed, 283 insertions(+), 19 deletions(-) create mode 100644 include/opensm/osm_mcast_mgr.h diff --git a/include/opensm/osm_mcast_mgr.h b/include/opensm/osm_mcast_mgr.h new file mode 100644 index 000..291a478 --- /dev/null +++ b/include/opensm/osm_mcast_mgr.h @@ -0,0 +1,72 @@ +/* + * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved. + * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved. + * Copyright (c) 1996-2003 Intel Corporation. All rights reserved. + * Copyright (c) 2009-2011 ZIH, TU Dresden, Federal Republic of Germany. All rights reserved. + * Copyright (C) 2012-2013 Tokyo Institute of Technology. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +/* + * Abstract: + * Declaration of osm_mcast_work_obj_t. + * Provide access to a mcast function which searches the root swicth for + * a spanning tree. + */ + +#ifndef _OSM_MCAST_MGR_H_ +#define _OSM_MCAST_MGR_H_ + +#ifdef __cplusplus +# define BEGIN_C_DECLS extern C { +# define END_C_DECLS } +#else /* !__cplusplus */ +# define BEGIN_C_DECLS +# define END_C_DECLS +#endif /* __cplusplus */ + +BEGIN_C_DECLS + +typedef struct osm_mcast_work_obj { + cl_list_item_t list_item; + osm_port_t *p_port; + cl_map_item_t map_item; +} osm_mcast_work_obj_t; + +int osm_mcast_make_port_list_and_map(cl_qlist_t * list, cl_qmap_t * map, +osm_mgrp_box_t * mbox); + +void osm_mcast_drop_port_list(cl_qlist_t * list); + +osm_switch_t * osm_mcast_mgr_find_root_switch(osm_sm_t * sm, cl_qlist_t * list); + +END_C_DECLS +#endif /* _OSM_MCAST_MGR_H_ */ diff --git a/opensm/Makefile.am b/opensm/Makefile.am index 7fd6bc6..20318cc 100644 --- a/opensm/Makefile.am +++ b/opensm/Makefile.am @@ -116,6 +116,7 @@ opensminclude_HEADERS
[PATCH 01/10] DFSSSP: fix a memory leak in dfsssp_build_graph
If the graph could not be build correctly and DFSSSP returns an error, then not all allocated memory was freed. Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index ffc317f..ff525ea 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1093,6 +1093,9 @@ static int dfsssp_build_graph(void *context) for (i = 0; i adj_list_size; i++) set_default_vertex(adj_list[i]); + dfsssp_ctx-adj_list = adj_list; + dfsssp_ctx-adj_list_size = adj_list_size; + /* count the total number of Hca / LIDs (for lmc0) in the fabric */ for (item = cl_qmap_head(port_tbl); item != cl_qmap_end(port_tbl); item = cl_qmap_next(item)) { @@ -1190,9 +1193,6 @@ static int dfsssp_build_graph(void *context) if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_DEBUG)) dfsssp_print_graph(p_mgr, adj_list, adj_list_size); - dfsssp_ctx-adj_list = adj_list; - dfsssp_ctx-adj_list_size = adj_list_size; - OSM_LOG_EXIT(p_mgr-p_log); return 0; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] OpenSM: dfsssp - avoid unnecessary nested loop in vltable_print for OSM_LOG_INFO
for OSM_LOG_INFO the debug function vltable_print was called, which iterates in a nested loop over all LIDs and only prints stuff for OSM_LOG_DEBUG; therefor we move vltable_print into a separated if clause Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index b82d8c8..32bc8f1 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1940,10 +1940,13 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) goto ERROR; } /* else { no balancing } */ - if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_INFO)) { + + if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_DEBUG)) { OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG, Virtual Lanes per src/dest combination after balancing:\n); vltable_print(p_mgr, srcdest2vl_table); + } + if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_INFO)) { OSM_LOG(p_mgr-p_log, OSM_LOG_INFO, Paths per VL (after balancing):\n); for (i = 0; i vl_avail; i++) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] opensm/osm_ucast_dfsssp.c : fix dereference before null check
From: Dan Ben Yosef da...@dev.mellanox.co.il Dereferencing dfsssp_ctx before a null check. Signed-off-by: Dan Ben Yosef da...@dev.mellanox.co.il Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index af1b062..f88382b 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -2069,14 +2069,16 @@ static uint8_t get_dfsssp_sl(void *context, uint8_t hint_for_default_sl, const ib_net16_t slid, const ib_net16_t dlid) { dfsssp_context_t *dfsssp_ctx = (dfsssp_context_t *) context; - osm_ucast_mgr_t *p_mgr = (osm_ucast_mgr_t *) dfsssp_ctx-p_mgr; osm_port_t *src_port, *dest_port; vltable_t *srcdest2vl_table = NULL; + osm_ucast_mgr_t *p_mgr = NULL; int32_t res = 0; if (dfsssp_ctx -dfsssp_ctx-routing_type == OSM_ROUTING_ENGINE_TYPE_DFSSSP) +dfsssp_ctx-routing_type == OSM_ROUTING_ENGINE_TYPE_DFSSSP) { + p_mgr = (osm_ucast_mgr_t *) dfsssp_ctx-p_mgr; srcdest2vl_table = (vltable_t *) (dfsssp_ctx-srcdest2vl_table); + } else return hint_for_default_sl; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] OpenSM: dfsssp - moved paths from one to another VL might be counted multiple times
the counter for paths, which have been moved to a different VL, was incorrect; the counter should not include paths moved in a previous step Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index c8a1007..a53e783 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1813,8 +1813,14 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) (uint8_t) vltable_get_vl(srcdest2vl_table, cl_hton16(slid), - cl_hton16(dlid))) + cl_hton16(dlid))) { + /* this path has been moved + before - don't count +*/ + paths_per_vl[test_vl]++; + paths_per_vl[test_vl + 1]--; continue; + } src_port = osm_get_port_by_lid(p_mgr-p_subn, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] opensm/osm_ucast_dfsssp.c : fix dereference null return value
From: Dan Ben Yosef da...@dev.mellanox.co.il Dereferencing a null pointer remote_node Signed-off-by: Dan Ben Yosef da...@dev.mellanox.co.il Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index 013bad4..af1b062 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -815,7 +815,7 @@ static int update_channel_dep_graph(cdg_node_t ** cdg_root, osm_node_get_remote_node(local_node, local_port, remote_port); /* if remote_node is a Hca, then the last channel from switch to Hca would be a sink in the cdg - skip */ - if (!remote_node-sw) + if (!remote_node || !remote_node-sw) break; remote_lid = cl_ntoh16(osm_node_get_base_lid(remote_node, 0)); @@ -961,7 +961,7 @@ static int remove_path_from_cdg(cdg_node_t ** cdg_root, osm_port_t * src_port, osm_node_get_remote_node(local_node, local_port, remote_port); /* if remote_node is a Hca, then the last channel from switch to Hca would be a sink in the cdg - skip */ - if (!remote_node-sw) + if (!remote_node || !remote_node-sw) break; remote_lid = cl_ntoh16(osm_node_get_base_lid(remote_node, 0)); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/10] opensm/osm_ucast_dfsssp.c : Fix resource leak
From: Dan Ben Yosef da...@dev.mellanox.co.il Variable head going out of scope leaks the storage it points to. Signed-off-by: Dan Ben Yosef da...@dev.mellanox.co.il Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index ff525ea..013bad4 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1161,6 +1161,11 @@ static int dfsssp_build_graph(void *context) OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR, ERR AD08: cannot allocate memory for a link\n); dfsssp_context_destroy(context); + while (head) { + link = head; + head = head-next; + free(link); +} return 1; } link = link-next; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] OpenSM: DFSSSP does not find LIDs due to wrong byte order (v2)
Problem: argument list for external calls of path_sl(...) haas been changed at some point in the past; path_sl(...) arguments for slid/dlid are now in network byte order; internal storage of lids is host byte order; this mismatch results in a return value of 'hint_for_default_sl' of DFSSSP's get_dfsssp_sl function for every request Fix: lids will be stored in network byte order, so that a conversion is not necessary and DFSSSP returns the correct SL for that request This is version 2 of the original patch, because I forgot to change some internal calls. Please use this patch, instead of the patch from December 17. Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c | 38 +- 1 files changed, 21 insertions(+), 17 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index 32bc8f1..c8a1007 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -339,7 +339,7 @@ static void heap_free(binary_heap_t * heap) /* compare function of two lids for stdlib qsort */ static int cmp_lids(const void *l1, const void *l2) { - uint16_t lid1 = *((uint16_t *) l1), lid2 = *((uint16_t *) l2); + ib_net16_t lid1 = *((ib_net16_t *) l1), lid2 = *((ib_net16_t *) l2); if (lid1 lid2) return -1; @@ -352,19 +352,19 @@ static int cmp_lids(const void *l1, const void *l2) /* use stdlib to sort the lid array */ static inline void vltable_sort_lids(vltable_t * vltable) { - qsort(vltable-lids, vltable-num_lids, sizeof(uint16_t), cmp_lids); + qsort(vltable-lids, vltable-num_lids, sizeof(ib_net16_t), cmp_lids); } /* use stdlib to get index of key in lid array; return -1 if lid isn't found in lids array */ -static inline int64_t vltable_get_lidindex(uint16_t * key, vltable_t * vltable) +static inline int64_t vltable_get_lidindex(ib_net16_t * key, vltable_t * vltable) { - uint16_t *found_lid = NULL; + ib_net16_t *found_lid = NULL; found_lid = - (uint16_t *) bsearch(key, vltable-lids, vltable-num_lids, -sizeof(uint16_t), cmp_lids); + (ib_net16_t *) bsearch(key, vltable-lids, vltable-num_lids, + sizeof(ib_net16_t), cmp_lids); if (found_lid) return found_lid - vltable-lids; else @@ -374,7 +374,7 @@ static inline int64_t vltable_get_lidindex(uint16_t * key, vltable_t * vltable) /* get virtual lane from src lid X dest lid kombination; return -1 for invalid lids */ -static int32_t vltable_get_vl(vltable_t * vltable, uint16_t slid, uint16_t dlid) +static int32_t vltable_get_vl(vltable_t * vltable, ib_net16_t slid, ib_net16_t dlid) { int64_t ind1 = vltable_get_lidindex(slid, vltable); int64_t ind2 = vltable_get_lidindex(dlid, vltable); @@ -387,8 +387,8 @@ static int32_t vltable_get_vl(vltable_t * vltable, uint16_t slid, uint16_t dlid) } /* set a virtual lane in the matrix */ -static inline void vltable_insert(vltable_t * vltable, uint16_t slid, - uint16_t dlid, uint8_t vl) +static inline void vltable_insert(vltable_t * vltable, ib_net16_t slid, + ib_net16_t dlid, uint8_t vl) { int64_t ind1 = vltable_get_lidindex(slid, vltable); int64_t ind2 = vltable_get_lidindex(dlid, vltable); @@ -436,8 +436,8 @@ static void vltable_print(osm_ucast_mgr_t * p_mgr, vltable_t * vltable) OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG, route from src_lid=% PRIu16 to dest_lid=% PRIu16 on vl=% PRIu8 - \n, vltable-lids[ind1], - vltable-lids[ind2], + \n, cl_ntoh16(vltable-lids[ind1]), + cl_ntoh16(vltable-lids[ind2]), vltable-vls[ind1 + ind2 * vltable-num_lids]); } @@ -464,7 +464,7 @@ static int vltable_alloc(vltable_t ** vltable, uint64_t size) if (!(*vltable)) goto ERROR; (*vltable)-num_lids = size; - (*vltable)-lids = (uint16_t *) malloc(size * sizeof(uint16_t)); + (*vltable)-lids = (ib_net16_t *) malloc(size * sizeof(ib_net16_t)); if (!((*vltable)-lids)) goto ERROR; (*vltable)-vls = (uint8_t *) malloc(size * size * sizeof(uint8_t)); @@ -1704,7 +1704,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) osm_port_get_lid_range_ho(dest_port, min_lid_ho, max_lid_ho); for (dlid = min_lid_ho; dlid = max_lid_ho; dlid++, i++) - srcdest2vl_table-lids[i] = dlid
[PATCH 05/10] OpenSM: dfsssp ignores differences in the lmc value
dfsssp used one port representative to obtain the lmc value for all ports; but lmc can vary, e.g. SW base port 0 vs. CA port Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index f88382b..3e9bc31 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1145,7 +1145,7 @@ static int dfsssp_build_graph(void *context) continue; /* if there is a Hca connected - count and cycle */ if (!remote_node-sw) { - lmc = osm_port_get_lmc(p_port); + lmc = osm_node_get_lmc(remote_node, (uint32_t)remote_port); adj_list[i].num_hca += (1 lmc); continue; } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] OpenSM: dfsssp - change the port traversal for sssp
There are some really rare cases were the sssp part of dfsssp routing will produce a suboptimal path assignment, therefore some links will be oversubscribed and some others will be undersubscribed with paths; - this results in a bad balancing for Hca-Hca traffic. The last patch (adding SP0 support) made the situation even worse. This patch returns the focus to Hca-Hca traffic balancing, again. Previously, the ports for the dijkstra loop have been choosen 'randomly', i.e. obtained by the order of p_subn-port_guid_tbl. Now we process all ports (Hca) of one switch, first, until we proceed with the next switch. Besides that, we sort the switches. The switches will be sorted in descending order with respect to the number of attached Hca. Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c | 136 +++-- 1 files changed, 131 insertions(+), 5 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index 31e4140..b82d8c8 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -1013,6 +1013,72 @@ ERROR: /** **/ +/ helper functions to generate an ordered list of ports *** + (functions copied from osm_ucast_mgr.c and modified) + **/ +static void add_sw_endports_to_order_list(osm_switch_t * sw, + osm_ucast_mgr_t * m) +{ + osm_port_t *port; + osm_physp_t *p; + int i; + + for (i = 1; i sw-num_ports; i++) { + p = osm_node_get_physp_ptr(sw-p_node, i); + if (p p-p_remote_physp !p-p_remote_physp-p_node-sw) { + port = osm_get_port_by_guid(m-p_subn, + p-p_remote_physp- + port_guid); + if (!port) + continue; + cl_qlist_insert_tail(m-port_order_list, +port-list_item); + } + } +} + +static void add_guid_to_order_list(uint64_t guid, osm_ucast_mgr_t * m) +{ + osm_port_t *port = osm_get_port_by_guid(m-p_subn, cl_hton64(guid)); + + if (!port) { +OSM_LOG(m-p_log, OSM_LOG_DEBUG, +port guid not found: 0x%016 PRIx64 \n, guid); + } + + cl_qlist_insert_tail(m-port_order_list, port-list_item); +} + +/* compare function of #Hca attached to a switch for stdlib qsort */ +static int cmp_num_hca(const void * l1, const void * l2) +{ + vertex_t *sw1 = *((vertex_t **) l1); + vertex_t *sw2 = *((vertex_t **) l2); + uint32_t num_hca1 = 0, num_hca2 = 0; + + if (sw1) + num_hca1 = sw1-num_hca; + if (sw2) + num_hca2 = sw2-num_hca; + + if (num_hca1 num_hca2) + return -1; + else if (num_hca1 num_hca2) + return 1; + else + return 0; +} + +/* use stdlib to sort the switch array depending on num_hca */ +static inline void sw_list_sort_by_num_hca(vertex_t ** sw_list, + uint32_t sw_list_size) +{ + qsort(sw_list, sw_list_size, sizeof(vertex_t *), cmp_num_hca); +} + +/** + **/ + static void dfsssp_print_graph(osm_ucast_mgr_t * p_mgr, vertex_t * adj_list, uint32_t size) { @@ -1172,7 +1238,7 @@ static int dfsssp_build_graph(void *context) link = head; head = head-next; free(link); -} + } return 1; } link = link-next; @@ -1919,7 +1985,12 @@ static int dfsssp_do_dijkstra_routing(void *context) vertex_t *adj_list = (vertex_t *) dfsssp_ctx-adj_list; uint32_t adj_list_size = dfsssp_ctx-adj_list_size; - cl_qmap_t *port_tbl = p_mgr-p_subn-port_guid_tbl;/* 1 managment port per switch + 1 or 2 ports for each Hca */ + vertex_t **sw_list = NULL; + uint32_t sw_list_size = 0; + uint64_t guid = 0; + cl_qlist_t *qlist = NULL; + cl_list_item_t *qlist_item = NULL; + cl_qmap_t *sw_tbl = p_mgr-p_subn-sw_guid_tbl; cl_map_item_t *item = NULL; osm_switch_t *sw = NULL; @@ -1949,12 +2020,64 @@ static int dfsssp_do_dijkstra_routing(void *context) } } + /* we need an intermediate array of pointers to switches in adj_list
Re: umad_send with service level higher than 0 does not work
Hello Hal, On Dec 17, 2012, at 9:04 PM, Hal Rosenstock wrote: Hi, On 12/17/2012 1:16 AM, Jens Domke wrote: Hello Hal, I have checked the smpquery and saquery command today. The smpquery SL2VL and PI commands for the opensm port work fine, and I get the expected results: == # SL2VL table: Lid 19 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| == # Port info: Lid 19 port 0 Mkey:not displayed GidPrefix:...0xfe80 Lid:.19 SMLid:...19 CapMask:.0x251086a IsSM IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported DiagCode:0x MkeyLeasePeriod:.0 LocalPort:...1 LinkWidthEnabled:1X or 4X LinkWidthSupported:..1X or 4X LinkWidthActive:.4X LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps LinkState:...Active PhysLinkState:...LinkUp LinkDownDefState:Polling ProtectBits:.0 LMC:.0 LinkSpeedActive:.5.0 Gbps LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps NeighborMTU:.2048 SMSL:0 VLCap:...VL0-7 InitType:0x00 VLHighLimit:.0 VLArbHighCap:8 VLArbLowCap:.8 InitReply:...0x00 MtuCap:..2048 VLStallCount:0 HoqLife:.31 OperVLs:.VL0-7 PartEnforceInb:..0 PartEnforceOutb:.0 FilterRawInb:0 FilterRawOutb:...0 MkeyViolations:..0 PkeyViolations:..0 QkeyViolations:..0 GuidCap:.32 ClientReregister:0 McastPkeyTrapSuppressionEnabled:.0 SubnetTimeout:...18 RespTimeVal:.16 LocalPhysErr:8 OverrunErr:..8 MaxCreditHint:...0 RoundTrip:...0 CapabilityMask2:.0x LinkSpeedExtActive:..No Extended Speed LinkSpeedExtSupported:...0 LinkSpeedExtEnabled:.0 == The problem are the saquery commands on other nodes. In most cases the executions fails, and the node shows the same behaviour like the OpenSM node, when it trys to send on SL0. The PathRequest paket does not arrive at the node with the running OpenSM (checked with ibdumb). At some point of the execution the saquery binary hangs, the kernel log indicates errors and the only option is to reboot. This is the output I see for the saquery: == saquery -P --src-to-dst 4:8 ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out Query SA failed: Connection timed out == (In really rar cases I get the PathRequest back and see the dump, but the saquery binary stalls afterwards, too.) I did some debugging with gdb again, and stepped thru the saquery code. When I change the SL to 0 in the addr vector of the MAD right before umad_send is called, then everthing works. So, the saquery on the compute nodes shows the same behaviour as the opensm with respect to the SL value for umad_send. At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in the config file of opensm. Sadly, this configuration results in the same crashes of the saquery commands. For the runs with MinHop I used also a different SL2VL mapping, just to be sure, that there is no problem with VL0 and every SL travels on VL=0: == # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| == Non QoS routing algorithms still need -Q otherwise the full range of QoS is not available
Re: umad_send with service level higher than 0 does not work
Hello Hal, On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 3:32 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 1:24 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote: Hi again, On 12/14/2012 10:17 AM, Jens Domke wrote: Hello Hal, thank you for the fast response. I will try to clarify some points. d) OpenMPI runs are executed with --mca btl_openib_ib_path_record_service_level 1 I'm not familiar with what DFSSSP does to figure out SLs exactly but there should be no need to set this. The proper SL for querying the SA for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP (and other QoS based routing algorithms), it calculates that and the SM pushes this into each port. That should be used. It's possible that SL1 is not a valid SL for port - SA querying using DFSSSP. The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords. It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request. For the request port - SA every 0=SL=7 was used in the test, and the SA received the requests. e) kernel 2.6.32-220.13.1.el6.x86_64 As far as I understand the whole system: 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM 2. the SA receives the request on QP1 There is the SL in the query itself. This should be the SMSL that the SM set for that port. Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified. In fact OpenMPI sets everthing to 0 except for slid and dlid. 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path This is a (potentially) different SL (for MPI-MPI port communication) than the one the query used and is the one returned inside the PathRecord attribute/data. Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm. With DFSSSP are all SLs same from source port to get to any destination ? No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3). If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path. True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path. So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL. I just read the IB Specs and it says, that SL specified in the received packet is used as the SL in the response packet for MAD packets. So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet. OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, So CompMask in the query has the SL bit on and SL is set to 0 inside the SubAdmGet of PatchRecord ? No, the CompMask didn't had the SL bit and the SL was set to 0. I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c The SA just treats the SL in the PathRequest as a I would like to use this SL in case the SL bit is set. But the routing engine can overwrite the requested SL before the reply is send. Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b. Sadly, the reply send by the SA does not leave the node (for SL_b0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process. and sends the packet on SL_b (PortInfo.SMSL). Good. The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for the response. If SL_b is not 0, then the packet can't reach the OMPI process. Right? Depends. It may be that both SLs work but maybe not. If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet. What do you think? Yes, it might be better to wildcard the SL in the query. The only scenario that would fail with the query you are making if there's no SL 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query. If that's the case, SA should return MAD status 0xc (status code 3 - ERR_NO_RECORDS). But the response doesn't make it back to the requester OMPI node so it's not even getting that far. Yes, exactly. So, do you have an idea why the response hands in the SA node? I have
Re: umad_send with service level higher than 0 does not work
Hi, On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote: Hi, On 12/16/2012 7:03 AM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 3:32 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 1:24 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote: Hi again, On 12/14/2012 10:17 AM, Jens Domke wrote: Hello Hal, thank you for the fast response. I will try to clarify some points. d) OpenMPI runs are executed with --mca btl_openib_ib_path_record_service_level 1 I'm not familiar with what DFSSSP does to figure out SLs exactly but there should be no need to set this. The proper SL for querying the SA for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP (and other QoS based routing algorithms), it calculates that and the SM pushes this into each port. That should be used. It's possible that SL1 is not a valid SL for port - SA querying using DFSSSP. The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords. It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request. For the request port - SA every 0=SL=7 was used in the test, and the SA received the requests. e) kernel 2.6.32-220.13.1.el6.x86_64 As far as I understand the whole system: 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM 2. the SA receives the request on QP1 There is the SL in the query itself. This should be the SMSL that the SM set for that port. Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified. In fact OpenMPI sets everthing to 0 except for slid and dlid. 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path This is a (potentially) different SL (for MPI-MPI port communication) than the one the query used and is the one returned inside the PathRecord attribute/data. Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm. With DFSSSP are all SLs same from source port to get to any destination ? No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3). If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path. True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path. So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL. I just read the IB Specs and it says, that SL specified in the received packet is used as the SL in the response packet for MAD packets. So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet. OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, So CompMask in the query has the SL bit on and SL is set to 0 inside the SubAdmGet of PatchRecord ? No, the CompMask didn't had the SL bit and the SL was set to 0. That means the SL in the request is wildcarded so the SA/SM fills in a valid one in the response. Ok. I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c The SA just treats the SL in the PathRequest as a I would like to use this SL in case the SL bit is set. But the routing engine can overwrite the requested SL before the reply is send. Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b. Sadly, the reply send by the SA does not leave the node (for SL_b0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process. Are you sure the response doesn't leave the SA node or it's not received at the requester (OMPI node) ? No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end. and sends the packet on SL_b (PortInfo.SMSL). Good. The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for the response. If SL_b is not 0, then the packet can't reach the OMPI process. Right? Depends. It may be that both SLs work but maybe not. If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send
Re: umad_send with service level higher than 0 does not work
On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote: On 12/16/2012 8:39 AM, Jens Domke wrote: Hi, On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote: Hi, On 12/16/2012 7:03 AM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 3:32 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 1:24 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote: Hi again, On 12/14/2012 10:17 AM, Jens Domke wrote: Hello Hal, thank you for the fast response. I will try to clarify some points. d) OpenMPI runs are executed with --mca btl_openib_ib_path_record_service_level 1 I'm not familiar with what DFSSSP does to figure out SLs exactly but there should be no need to set this. The proper SL for querying the SA for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP (and other QoS based routing algorithms), it calculates that and the SM pushes this into each port. That should be used. It's possible that SL1 is not a valid SL for port - SA querying using DFSSSP. The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords. It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request. For the request port - SA every 0=SL=7 was used in the test, and the SA received the requests. e) kernel 2.6.32-220.13.1.el6.x86_64 As far as I understand the whole system: 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM 2. the SA receives the request on QP1 There is the SL in the query itself. This should be the SMSL that the SM set for that port. Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified. In fact OpenMPI sets everthing to 0 except for slid and dlid. 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path This is a (potentially) different SL (for MPI-MPI port communication) than the one the query used and is the one returned inside the PathRecord attribute/data. Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm. With DFSSSP are all SLs same from source port to get to any destination ? No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3). If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path. True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path. So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL. I just read the IB Specs and it says, that SL specified in the received packet is used as the SL in the response packet for MAD packets. So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet. OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, So CompMask in the query has the SL bit on and SL is set to 0 inside the SubAdmGet of PatchRecord ? No, the CompMask didn't had the SL bit and the SL was set to 0. That means the SL in the request is wildcarded so the SA/SM fills in a valid one in the response. Ok. I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c The SA just treats the SL in the PathRequest as a I would like to use this SL in case the SL bit is set. But the routing engine can overwrite the requested SL before the reply is send. Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b. Sadly, the reply send by the SA does not leave the node (for SL_b0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process. Are you sure the response doesn't leave the SA node or it's not received at the requester (OMPI node) ? No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end. and sends the packet on SL_b (PortInfo.SMSL). Good. The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for the response. If SL_b is not 0, then the packet can't reach the OMPI process. Right? Depends. It may be that both SLs work but maybe not. If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP
[PATCH 1/1] OpenSM: DFSSSP does not find LIDs due to wrong byte order
Problem: path_sl(...) arguments for slid/dlid are in network byte order; internal storage of lids is host byte order; this mismatch results in a return value of 'hint_for_default_sl' of DFSSSP's get_dfsssp_sl function for every request Fix: lids will be stored in network byte order, so that a conversion is not necessaryand DFSSSP returns the correct SL for tht request Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp --- opensm/osm_ucast_dfsssp.c | 26 +- 1 files changed, 13 insertions(+), 13 deletions(-) diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c index ffc317f..903966c 100644 --- a/opensm/osm_ucast_dfsssp.c +++ b/opensm/osm_ucast_dfsssp.c @@ -339,7 +339,7 @@ static void heap_free(binary_heap_t * heap) /* compare function of two lids for stdlib qsort */ static int cmp_lids(const void *l1, const void *l2) { - uint16_t lid1 = *((uint16_t *) l1), lid2 = *((uint16_t *) l2); + ib_net16_t lid1 = *((ib_net16_t *) l1), lid2 = *((ib_net16_t *) l2); if (lid1 lid2) return -1; @@ -352,19 +352,19 @@ static int cmp_lids(const void *l1, const void *l2) /* use stdlib to sort the lid array */ static inline void vltable_sort_lids(vltable_t * vltable) { - qsort(vltable-lids, vltable-num_lids, sizeof(uint16_t), cmp_lids); + qsort(vltable-lids, vltable-num_lids, sizeof(ib_net16_t), cmp_lids); } /* use stdlib to get index of key in lid array; return -1 if lid isn't found in lids array */ -static inline int64_t vltable_get_lidindex(uint16_t * key, vltable_t * vltable) +static inline int64_t vltable_get_lidindex(ib_net16_t * key, vltable_t * vltable) { - uint16_t *found_lid = NULL; + ib_net16_t *found_lid = NULL; found_lid = - (uint16_t *) bsearch(key, vltable-lids, vltable-num_lids, -sizeof(uint16_t), cmp_lids); + (ib_net16_t *) bsearch(key, vltable-lids, vltable-num_lids, + sizeof(ib_net16_t), cmp_lids); if (found_lid) return found_lid - vltable-lids; else @@ -374,7 +374,7 @@ static inline int64_t vltable_get_lidindex(uint16_t * key, vltable_t * vltable) /* get virtual lane from src lid X dest lid kombination; return -1 for invalid lids */ -static int32_t vltable_get_vl(vltable_t * vltable, uint16_t slid, uint16_t dlid) +static int32_t vltable_get_vl(vltable_t * vltable, ib_net16_t slid, ib_net16_t dlid) { int64_t ind1 = vltable_get_lidindex(slid, vltable); int64_t ind2 = vltable_get_lidindex(dlid, vltable); @@ -387,8 +387,8 @@ static int32_t vltable_get_vl(vltable_t * vltable, uint16_t slid, uint16_t dlid) } /* set a virtual lane in the matrix */ -static inline void vltable_insert(vltable_t * vltable, uint16_t slid, - uint16_t dlid, uint8_t vl) +static inline void vltable_insert(vltable_t * vltable, ib_net16_t slid, + ib_net16_t dlid, uint8_t vl) { int64_t ind1 = vltable_get_lidindex(slid, vltable); int64_t ind2 = vltable_get_lidindex(dlid, vltable); @@ -436,8 +436,8 @@ static void vltable_print(osm_ucast_mgr_t * p_mgr, vltable_t * vltable) OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG, route from src_lid=% PRIu16 to dest_lid=% PRIu16 on vl=% PRIu8 - \n, vltable-lids[ind1], - vltable-lids[ind2], + \n, cl_ntoh16(vltable-lids[ind1]), + cl_ntoh16(vltable-lids[ind2]), vltable-vls[ind1 + ind2 * vltable-num_lids]); } @@ -464,7 +464,7 @@ static int vltable_alloc(vltable_t ** vltable, uint64_t size) if (!(*vltable)) goto ERROR; (*vltable)-num_lids = size; - (*vltable)-lids = (uint16_t *) malloc(size * sizeof(uint16_t)); + (*vltable)-lids = (ib_net16_t *) malloc(size * sizeof(ib_net16_t)); if (!((*vltable)-lids)) goto ERROR; (*vltable)-vls = (uint8_t *) malloc(size * size * sizeof(uint8_t)); @@ -1645,7 +1645,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * dfsssp_ctx) osm_port_get_lid_range_ho(dest_port, min_lid_ho, max_lid_ho); for (dlid = min_lid_ho; dlid = max_lid_ho; dlid++, i++) - srcdest2vl_table-lids[i] = dlid; + srcdest2vl_table-lids[i] = cl_hton16(dlid); } } /* sort lids */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More
Re: umad_send with service level higher than 0 does not work
Hello Hal, I have checked the smpquery and saquery command today. The smpquery SL2VL and PI commands for the opensm port work fine, and I get the expected results: == # SL2VL table: Lid 19 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7| == # Port info: Lid 19 port 0 Mkey:not displayed GidPrefix:...0xfe80 Lid:.19 SMLid:...19 CapMask:.0x251086a IsSM IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported DiagCode:0x MkeyLeasePeriod:.0 LocalPort:...1 LinkWidthEnabled:1X or 4X LinkWidthSupported:..1X or 4X LinkWidthActive:.4X LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps LinkState:...Active PhysLinkState:...LinkUp LinkDownDefState:Polling ProtectBits:.0 LMC:.0 LinkSpeedActive:.5.0 Gbps LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps NeighborMTU:.2048 SMSL:0 VLCap:...VL0-7 InitType:0x00 VLHighLimit:.0 VLArbHighCap:8 VLArbLowCap:.8 InitReply:...0x00 MtuCap:..2048 VLStallCount:0 HoqLife:.31 OperVLs:.VL0-7 PartEnforceInb:..0 PartEnforceOutb:.0 FilterRawInb:0 FilterRawOutb:...0 MkeyViolations:..0 PkeyViolations:..0 QkeyViolations:..0 GuidCap:.32 ClientReregister:0 McastPkeyTrapSuppressionEnabled:.0 SubnetTimeout:...18 RespTimeVal:.16 LocalPhysErr:8 OverrunErr:..8 MaxCreditHint:...0 RoundTrip:...0 CapabilityMask2:.0x LinkSpeedExtActive:..No Extended Speed LinkSpeedExtSupported:...0 LinkSpeedExtEnabled:.0 == The problem are the saquery commands on other nodes. In most cases the executions fails, and the node shows the same behaviour like the OpenSM node, when it trys to send on SL0. The PathRequest paket does not arrive at the node with the running OpenSM (checked with ibdumb). At some point of the execution the saquery binary hangs, the kernel log indicates errors and the only option is to reboot. This is the output I see for the saquery: == saquery -P --src-to-dst 4:8 ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out Query SA failed: Connection timed out == (In really rar cases I get the PathRequest back and see the dump, but the saquery binary stalls afterwards, too.) I did some debugging with gdb again, and stepped thru the saquery code. When I change the SL to 0 in the addr vector of the MAD right before umad_send is called, then everthing works. So, the saquery on the compute nodes shows the same behaviour as the opensm with respect to the SL value for umad_send. At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in the config file of opensm. Sadly, this configuration results in the same crashes of the saquery commands. For the runs with MinHop I used also a different SL2VL mapping, just to be sure, that there is no problem with VL0 and every SL travels on VL=0: == # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ports: in 0, out 0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| == Regards, Jens On Dec 16, 2012, at 11:59 PM, Jens Domke wrote: On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote: On 12/16/2012 8:39 AM, Jens Domke wrote: Hi, On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote: Hi, On 12/16/2012 7:03 AM, Jens Domke wrote: Hello Hal, On Dec 15, 2012
umad_send with service level higher than 0 does not work
information, or if I can test something to give you more inside. Thank you in advance, Jens Dipl.-Math. Jens Domke Researcher - Tokyo Institute of Technology Satoshi MATSUOKA Laboratory Global Scientific Information and Computing Center 2-12-1-E2-7 Ookayama, Meguro-ku, Tokyo, 152-8550, JAPAN Tel/Fax: +81-3-5734-3876 E-Mail: domke.j...@m.titech.ac.jp -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: umad_send with service level higher than 0 does not work
stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL0. A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0. So SL 0 works between all nodes and SA for querying/responses. Wonder if that's how SMSL is set by DFSSSP. No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used for MPI-MPI traffic, to ensure deadlock freedom. Regards Jens Dipl.-Math. Jens Domke Researcher - Tokyo Institute of Technology Satoshi MATSUOKA Laboratory Global Scientific Information and Computing Center 2-12-1-E2-7 Ookayama, Meguro-ku, Tokyo, 152-8550, JAPAN Tel/Fax: +81-3-5734-3876 E-Mail: domke.j...@m.titech.ac.jp -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: umad_send with service level higher than 0 does not work
Hello Hal, On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote: Hi again, On 12/14/2012 10:17 AM, Jens Domke wrote: Hello Hal, thank you for the fast response. I will try to clarify some points. d) OpenMPI runs are executed with --mca btl_openib_ib_path_record_service_level 1 I'm not familiar with what DFSSSP does to figure out SLs exactly but there should be no need to set this. The proper SL for querying the SA for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP (and other QoS based routing algorithms), it calculates that and the SM pushes this into each port. That should be used. It's possible that SL1 is not a valid SL for port - SA querying using DFSSSP. The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords. It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request. For the request port - SA every 0=SL=7 was used in the test, and the SA received the requests. e) kernel 2.6.32-220.13.1.el6.x86_64 As far as I understand the whole system: 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM 2. the SA receives the request on QP1 There is the SL in the query itself. This should be the SMSL that the SM set for that port. Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified. In fact OpenMPI sets everthing to 0 except for slid and dlid. 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path This is a (potentially) different SL (for MPI-MPI port communication) than the one the query used and is the one returned inside the PathRecord attribute/data. Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm. With DFSSSP are all SLs same from source port to get to any destination ? No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3). 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c By the response reversibility rule, I think this is returned on the SL of the original query but haven't verified this in the code base yet. Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL0. I doubled checked and indeed the SA response does use the SL that the incoming request was received on. The osm_vendor_send() function builds the MAD packet with the following attributes: /* GS classes */ umad_set_addr_net(p_vw-umad, p_mad_addr-dest_lid, p_mad_addr-addr_type.gsi.remote_qp, p_mad_addr-addr_type.gsi.service_level, IB_QP1_WELL_KNOWN_Q_KEY); So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too. Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0). By not working, what do you mean ? Do you mean it's not received at the requester with no message in the OpenSM log or not received at the OpenSM or something else ? It could be due to the wrong SL being used in the original request (forcing it to SL 1). That could cause it not to be received at the SM or the response not to make it back to the requester from the SA if the SL used is not reversible. By not working I mean, that the MPI process does not receive any response from the SA. I get messages from the MPI process like the following: [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back. And I think I was some messages in the log about …1 outstanding MAD…. If I look into the MAD before it is send, then it looks like this: Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3) at src/umad.c:791 791 if (umaddebug 1) (gdb) p *mad $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' repeats 15 times, flow_label = 0, pkey_index = 0, reserved = \000\000\000\000\000}, data = 0x7fffe8012530 \002} Is this the PathRecord query on the OpenMPI side or the response on the OpenSM side ? SL is 6 rather than 1 here. This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …). SL=6
Re: umad_send with service level higher than 0 does not work
Hello Hal, On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote: Hi, On 12/14/2012 1:24 PM, Jens Domke wrote: Hello Hal, On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote: Hi again, On 12/14/2012 10:17 AM, Jens Domke wrote: Hello Hal, thank you for the fast response. I will try to clarify some points. d) OpenMPI runs are executed with --mca btl_openib_ib_path_record_service_level 1 I'm not familiar with what DFSSSP does to figure out SLs exactly but there should be no need to set this. The proper SL for querying the SA for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP (and other QoS based routing algorithms), it calculates that and the SM pushes this into each port. That should be used. It's possible that SL1 is not a valid SL for port - SA querying using DFSSSP. The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords. It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request. For the request port - SA every 0=SL=7 was used in the test, and the SA received the requests. e) kernel 2.6.32-220.13.1.el6.x86_64 As far as I understand the whole system: 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM 2. the SA receives the request on QP1 There is the SL in the query itself. This should be the SMSL that the SM set for that port. Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified. In fact OpenMPI sets everthing to 0 except for slid and dlid. 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path This is a (potentially) different SL (for MPI-MPI port communication) than the one the query used and is the one returned inside the PathRecord attribute/data. Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm. With DFSSSP are all SLs same from source port to get to any destination ? No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3). If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path. True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path. So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL. I just read the IB Specs and it says, that SL specified in the received packet is used as the SL in the response packet for MAD packets. So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet. OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, and sends the packet on SL_b (PortInfo.SMSL). The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for the response. If SL_b is not 0, then the packet can't reach the OMPI process. Right? If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet. What do you think? I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level. Maybe, with this change the packets of the SA will reach the OMPI process on a SL0. 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c By the response reversibility rule, I think this is returned on the SL of the original query but haven't verified this in the code base yet. Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL0. I doubled checked and indeed the SA response does use the SL that the incoming request was received on. The osm_vendor_send() function builds the MAD packet with the following attributes: /* GS classes */ umad_set_addr_net(p_vw-umad, p_mad_addr-dest_lid, p_mad_addr-addr_type.gsi.remote_qp, p_mad_addr-addr_type.gsi.service_level, IB_QP1_WELL_KNOWN_Q_KEY); So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too. Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0). By not working, what do you mean ? Do you mean it's not received at the requester with no message in the OpenSM log or not received at the OpenSM or something else ? It could be due to the wrong SL being used in the original request (forcing it to SL 1