[PATCH] OpenSM: LFT update breaks if IB_SMP_DATA_SIZE changes

2015-10-13 Thread Jens Domke
This is only a precautionary patch for a theoretical
bug which would arise if someone redefines IB_SMP_DATA_SIZE
to a values !=64.

ucast_mgr_pipeline_fwd_tbl() calculates the max. number of
blocks to update using 64 explicitly, while set_lft_block()
uses IB_SMP_DATA_SIZE.
If IB_SMP_DATA_SIZE != 64 then switches would receive too few
or too many blocks.

Signed-off-by: Jens Domke <jens.do...@tu-dresden.de>
---
 opensm/osm_ucast_mgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 7ccaa77..893a70b 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -1036,7 +1036,7 @@ static void ucast_mgr_pipeline_fwd_tbl(osm_ucast_mgr_t * 
p_mgr)
 {
cl_qmap_t *tbl;
cl_map_item_t *item;
-   unsigned i, max_block = p_mgr->max_lid / 64 + 1;
+   unsigned i, max_block = p_mgr->max_lid / IB_SMP_DATA_SIZE + 1;
 
tbl = _mgr->p_subn->sw_guid_tbl;
for (i = 0; i < max_block; i++)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] OpenSM: command line option ignore-guids broken

2015-03-27 Thread Jens Domke
this patch changes the documentation (--help and man page) from
--ignore-guids to --ignore_guids, so that it matches the implementation

Signed-off-by: Jens Domke jens.do...@tu-dresden.de
---
 doc/current-routing.txt | 2 +-
 man/opensm.8.in | 6 +++---
 opensm/main.c   | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/current-routing.txt b/doc/current-routing.txt
index d23ae0d..acfeb56 100644
--- a/doc/current-routing.txt
+++ b/doc/current-routing.txt
@@ -127,7 +127,7 @@ subscription is also equalized with the ability to override 
based on
 port GUID. The latter is supplied by:
 
 -i equalize-ignore-guids-file
--ignore-guids equalize-ignore-guids-file
+--ignore_guids equalize-ignore-guids-file
   This option provides the means to define a set of ports
   (by guids) that will be ignored by the link load
   equalization algorithm.
diff --git a/man/opensm.8.in b/man/opensm.8.in
index c1092cc..8ea127d 100644
--- a/man/opensm.8.in
+++ b/man/opensm.8.in
@@ -37,7 +37,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA)
 [\-\-maxsmps number]
 [\-\-console [off | local | socket | loopback]]
 [\-\-console-port port]
-[\-i(gnore-guids) equalize-ignore-guids-file]
+[\-i | \-\-ignore_guids equalize-ignore-guids-file]
 [\-w | \-\-hop_weights_file path to file]
 [\-O | \-\-port_search_ordering_file path to file]
 [\-O | \-\-dimn_ports_file path to file] (DEPRECATED)
@@ -298,7 +298,7 @@ Specify an alternate telnet port for the socket console 
(default 1).
 Note that this option only appears if OpenSM was built with
 --enable-console-socket.
 .TP
-\fB\-i\fR, \fB\-\-ignore-guids\fR equalize-ignore-guids-file
+\fB\-i\fR, \fB\-\-ignore_guids\fR equalize-ignore-guids-file
 This option provides the means to define a set of ports
 (by node guid and port number) that will be ignored by the link load
 equalization algorithm.
@@ -987,7 +987,7 @@ port GUID. The latter is supplied by:
 
 -i equalize-ignore-guids-file
 .br
-\-\-ignore-guids equalize-ignore-guids-file
+\-\-ignore_guids equalize-ignore-guids-file
   This option provides the means to define a set of ports
   (by guid) that will be ignored by the link load
   equalization algorithm. Note that only endports (CA,
diff --git a/opensm/main.c b/opensm/main.c
index 6551a37..8419e68 100644
--- a/opensm/main.c
+++ b/opensm/main.c
@@ -289,7 +289,7 @@ static void show_usage(void)
 Specify an alternate telnet port for the console 
(default %d).\n\n,
   OSM_DEFAULT_CONSOLE_PORT);
 #endif
-   printf(--ignore-guids, -i equalize-ignore-guids-file\n
+   printf(--ignore_guids, -i equalize-ignore-guids-file\n
 This option provides the means to define a set of 
ports\n
 (by guid) that will be ignored by the link load\n
 equalization algorithm.\n\n);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] OpenSM: osm_ucast_dfsssp.c - prevent double free error

2014-02-04 Thread Jens Domke
an error in the routing execution can cause a second
free() call on sw_list, which results in a 'double free' error

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index 5eaff3d..ec69df0 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -2382,6 +2382,7 @@ static int dfsssp_do_dijkstra_routing(void *context)
 
/* the intermediate array lived long enough */
free(sw_list);
+   sw_list = NULL;
/* same is true for the compute node and I/O guid map */
destroy_guid_map(cn_tbl);
cn_nodes_provided = FALSE;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] opensm: Resend LFTs/VLArb/SL2VL MADs in case of error

2014-02-03 Thread Jens Domke

Dear Alex,

the memset call in sl2vl_update_table causes segmentation faults if 
force_update=1, since p_tbl won't get anything assigned and remains NULL.

Please, find a possible fix attached.

Regards,
Jens

On 03.02.14 20:05, Alex Netes wrote:

There are several MADs that we only SET during the sweep (and never
GET).
Zero the stored block, so in case the MAD will end up with error,
we will resend it during the next sweep.

Signed-off-by: Alex Netes ale...@mellanox.com
---
  opensm/osm_qos.c   |   13 +
  opensm/osm_ucast_mgr.c |7 +++
  2 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/opensm/osm_qos.c b/opensm/osm_qos.c
index a301803..473e3c8 100644
--- a/opensm/osm_qos.c
+++ b/opensm/osm_qos.c
@@ -183,6 +183,13 @@ static ib_api_status_t vlarb_update_table_block(osm_sm_t * 
sm,
if (!p_mad)
return IB_INSUFFICIENT_MEMORY;

+   /*
+* Zero the stored VL Arbitration block, so in case the MAD will
+* end up with error, we will resend it in the next sweep.
+*/
+   memset(p-vl_arb[block_num], 0,
+  block_length * sizeof(block.vl_entry[0]));
+
cl_qlist_insert_tail(mad_list, p_mad-list_item);

return IB_SUCCESS;
@@ -272,6 +279,12 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, 
osm_physp_t * p,
if (!p_mad)
return IB_INSUFFICIENT_MEMORY;

+   /*
+* Zero the stored SL2VL block, so in case the MAD will
+* end up with error, we will resend it in the next sweep.
+*/
+   memset(p_tbl, 0, sizeof(tbl));
+
cl_qlist_insert_tail(mad_list, p_mad-list_item);
return IB_SUCCESS;
  }
diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 8194307..c8a7360 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -1002,6 +1002,13 @@ static int set_lft_block(IN osm_switch_t *p_sw, IN 
osm_ucast_mgr_t *p_mgr,
IB_SMP_DATA_SIZE))
return 0;

+   /*
+* Zero the stored LFT block, so in case the MAD will end up
+* with error, we will resend it in the next sweep.
+*/
+   memset(p_sw-lft + block_id_ho * IB_SMP_DATA_SIZE, OSM_NO_PATH,
+  IB_SMP_DATA_SIZE);
+
OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG,
Writing FT block %u to switch 0x% PRIx64 \n, block_id_ho,
cl_ntoh64(context.lft_context.node_guid));

From 3cbe8f10c4ab7d83c5898b67e42d9e99be355c05 Mon Sep 17 00:00:00 2001
From: Jens Domke domke.j...@m.titech.ac.jp
Date: Tue, 4 Feb 2014 14:47:44 +0900
Subject: [PATCH 1/1] osm_qos.c: fix potential segmentation fault

if force_update=1, then p_tbl remains NULL and therefore memset
crashes

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_qos.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/opensm/osm_qos.c b/opensm/osm_qos.c
index 473e3c8..76f0ff6 100644
--- a/opensm/osm_qos.c
+++ b/opensm/osm_qos.c
@@ -252,7 +252,7 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, 
osm_physp_t * p,
  const ib_slvl_table_t * sl2vl_table,
  cl_qlist_t *mad_list)
 {
-   ib_slvl_table_t tbl, *p_tbl;
+   ib_slvl_table_t tbl, *p_tbl = NULL;
unsigned vl_mask;
uint8_t vl1, vl2;
int i;
@@ -283,7 +283,8 @@ static ib_api_status_t sl2vl_update_table(osm_sm_t * sm, 
osm_physp_t * p,
 * Zero the stored SL2VL block, so in case the MAD will
 * end up with error, we will resend it in the next sweep.
 */
-   memset(p_tbl, 0, sizeof(tbl));
+   if (p_tbl)
+   memset(p_tbl, 0, sizeof(tbl));
 
cl_qlist_insert_tail(mad_list, p_mad-list_item);
return IB_SUCCESS;
-- 
1.7.1



[PATCH 2/5] OpenSM: dfsssp - send multicast forwarding tables to switches

2013-10-03 Thread Jens Domke
Issue: root switch of the mcast spanning tree was ignored.
When a port of the root switch is part of the mcast group, then
it won't be processed and non of its ports will be part of
the resulting mcast forwarding table.

Fix: remove the test for used_link==NULL, because all switches in
adj_list should have a used_link set by the prior dijkstra step
(except the root switch) = test not needed and root switch will
be included in mcast update.

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |8 +++-
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index 9c34795..219f8bb 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1607,13 +1607,11 @@ static int update_mcft(osm_sm_t * p_sm, vertex_t * 
adj_list,
 (%s) for MLID 0x%X\n, cl_ntoh64(adj_list[i].guid),
p_sw-p_node-print_desc, mlid_ho);
 
-   /* if a) no route goes thru this switch  or
- b) the switch does not support mcast  or
- c) no ports of this switch are part or the mcast group
+   /* if a) the switch does not support mcast  or
+ b) no ports of this switch are part or the mcast group
   then cycle
 */
-   if (!(adj_list[i].used_link) ||
-   osm_switch_supports_mcast(p_sw) == FALSE ||
+   if (osm_switch_supports_mcast(p_sw) == FALSE ||
(p_sw-num_of_mcm == 0  !(p_sw-is_mc_member)))
continue;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] OpenSM: dfsssp - send multicast forwarding tables to switches

2013-10-03 Thread Jens Domke
Issue: dfsssp calculates mcast forwarding tables but doesn't
distribute them to the switches, because is_mc_member/num_of_mcm
for each switch was reset to 0 in osm_mcast_mgr.c.
dfsssp relies on this data to figure out with switch is involved
in the mcast group.

Fix: recalculate is_mc_member/num_of_mcm similar to the code
of create_mgrp_switch_map(...) in osm_mcast_mgr.c right before
the update_mcft function and reset to 0 afterwards.

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |   43 +++
 1 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index ef7de59..9c34795 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1544,6 +1544,43 @@ static int update_lft(osm_ucast_mgr_t * p_mgr, vertex_t 
* adj_list,
return 0;
 }
 
+/* the function updates the multicast group membership information
+   similar to create_mgrp_switch_map (osm_mcast_mgr.c)
+   = with it we can identify if a switch needs to be processed
+   or not in update_mcft
+*/
+static void update_mgrp_membership(cl_qlist_t * port_list)
+{
+   osm_mcast_work_obj_t *wobj = NULL;
+   osm_port_t *port = NULL;
+   osm_switch_t *sw = NULL;
+   cl_list_item_t *i = NULL;
+
+   for (i = cl_qlist_head(port_list); i != cl_qlist_end(port_list);
+i = cl_qlist_next(i)) {
+   wobj = cl_item_obj(i, wobj, list_item);
+   port = wobj-p_port;
+   if (port-p_node-sw) {
+   sw = port-p_node-sw;
+   sw-is_mc_member = 1;
+   } else {
+   sw = port-p_physp-p_remote_physp-p_node-sw;
+   sw-num_of_mcm++;
+   }
+   }
+}
+
+/* reset is_mc_member and num_of_mcm for future computations */
+static void reset_mgrp_membership(vertex_t * adj_list, uint32_t adj_list_size)
+{
+   uint32_t i = 0;
+
+   for (i = 1; i  adj_list_size; i++) {
+   adj_list[i].sw-is_mc_member = 0;
+   adj_list[i].sw-num_of_mcm = 0;
+   }
+}
+
 /* update the multicast forwarding tables of all switches with the informations
from the previous dijsktra step for the current mlid
 */
@@ -2386,6 +2423,11 @@ static ib_api_status_t dfsssp_do_mcast_routing(void * 
context,
goto Exit;
}
 
+   /* set mcast group membership again for update_mcft
+  (unfortunately: osm_mcast_mgr_find_root_switch resets it)
+*/
+   update_mgrp_membership(mcastgrp_port_list);
+
/* update the mcast forwarding tables of the switches */
err = update_mcft(sm, adj_list, adj_list_size, mbox-mlid,
  mcastgrp_port_map, root_sw);
@@ -2398,6 +2440,7 @@ static ib_api_status_t dfsssp_do_mcast_routing(void * 
context,
}
 
 Exit:
+   reset_mgrp_membership(adj_list, adj_list_size);
osm_mcast_drop_port_list(mcastgrp_port_list);
OSM_LOG_EXIT(sm-p_log);
return status;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] OpenSM: dfsssp - add missing and change existing return values

2013-07-31 Thread Jens Domke
a) this patch sets the 'err' variable correclty for the function
   dfsssp_remove_deadlocks() for the case, that the error occurs within
   the function and not within a subroutine
b) the functions dfsssp_build_graph() and dfsssp_do_dijkstra_routing()
   now return -1 instead of 1 to indicate an error to be in compliance
   with the implementation of ucast_mgr_route() in osm_ucast_mgr.c

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |   27 ++-
 1 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index a08e44b..321fffd 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1158,7 +1158,7 @@ static int dfsssp_build_graph(void *context)
if (!adj_list) {
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD02: cannot allocate memory for adj_list\n);
-   return 1;
+   goto ERROR;
}
for (i = 0; i  adj_list_size; i++)
set_default_vertex(adj_list[i]);
@@ -1198,7 +1198,7 @@ static int dfsssp_build_graph(void *context)
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD03: cannot allocate memory for a 
link\n);
dfsssp_context_destroy(context);
-   return 1;
+   goto ERROR;
}
head = link;
head-next = NULL;
@@ -1243,7 +1243,7 @@ static int dfsssp_build_graph(void *context)
head = head-next;
free(link);
}
-   return 1;
+   goto ERROR;
}
link = link-next;
set_default_link(link);
@@ -1277,6 +1277,9 @@ static int dfsssp_build_graph(void *context)
 
OSM_LOG_EXIT(p_mgr-p_log);
return 0;
+
+ERROR:
+   return -1;
 }
 
 static void print_routes(osm_ucast_mgr_t * p_mgr, vertex_t * adj_list,
@@ -1891,6 +1894,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
if (!weakest_link) {
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD27: something went wrong 
in get_weakest_link_in_cycle(...)\n);
+   err = 1;
goto ERROR;
}
 
@@ -2012,6 +2016,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
if (!split_count) {
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD24: cannot allocate memory for split_count, skip 
balancing\n);
+   err = 1;
goto ERROR;
}
/* initial state: paths for VLs won't be separated */
@@ -2060,6 +2065,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD25: Not enough VL available (avail=%d, 
needed=%d); Stop dfsssp routing!\n,
vl_avail, vl_needed);
+   err = 1;
goto ERROR;
}
/* else { no balancing } */
@@ -2161,7 +2167,7 @@ static int dfsssp_do_dijkstra_routing(void *context)
if (!sw_list) {
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD29: cannot allocate memory for sw_list in 
dfsssp_do_dijkstra_routing\n);
-   return 1;
+   goto ERROR;
}
memset(sw_list, 0, sw_list_size * sizeof(vertex_t *));
 
@@ -2197,7 +2203,7 @@ static int dfsssp_do_dijkstra_routing(void *context)
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD31: corrupted sw_list array in 
dfsssp_do_dijkstra_routing\n);
free(sw_list);
-   return 1;
+   goto ERROR;
}
}
 
@@ -2240,7 +2246,7 @@ static int dfsssp_do_dijkstra_routing(void *context)
err =
dijkstra(p_mgr, adj_list, adj_list_size, port, lid);
if (err)
-   return err;
+   goto ERROR;
if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_DEBUG))
print_routes(p_mgr, adj_list, adj_list_size,
 port);
@@ -2249,7 +2255,7 @@ static int dfsssp_do_dijkstra_routing(void *context)
err =
update_lft(p_mgr, adj_list, adj_list_size, port, 
lid);
if (err)
-   return err;
+   goto ERROR

[PATCH 2/2] OpenSM: DFSSSP - workaround for better VL balancing

2013-05-29 Thread Jens Domke
Currently, DFSSSP maps the src/dest paths statically to certain VLs.
Especially for deadlock-free topologies this can result in an
unfair balancing. Some VLs within one link might be overused,
which results in slower bandwidth for some src/dest pairs.

The fix changes the VL assignment in two ways: first we balance the
number of paths per VL; and second we randomly assign the VL
as long as this doesn't violate the deadlock-freedom.

1) The balancing splits the paths across available free VLs, so that
the maximal number of paths per VL is minimized. We save the number
of VLs for each deadlock-free channel dependency graph. E.g. for
8 VLs, paths per CDG: {14,5,1} = balanced VLs: {{3,3,3,3,2},{3,2},1}
we have 5 VLs to choose from for CDG(0), two for CDG(1) and
one for CDG(2).

2) get_dfsssp_sl(...) will use the information of (1) to randomly
assign the VL for one src/dest pair within the possible number of
VLs. E.g. for a src/dest pair of CDG(0) we have 5 VLs to choose from,
therefore VL := baseVL + rand()%5

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |  131 +++--
 1 files changed, 90 insertions(+), 41 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index 98c3f7c..7aecc24 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -133,6 +133,7 @@ typedef struct dfsssp_context {
vertex_t *adj_list;
uint32_t adj_list_size;
vltable_t *srcdest2vl_table;
+   uint8_t *vl_split_count;
 } dfsssp_context_t;
 
 / set initial values for structs **
@@ -1722,8 +1723,9 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
cl_map_item_t *item1 = NULL, *item2 = NULL;
osm_port_t *src_port = NULL, *dest_port = NULL;
 
-   uint32_t i = 0, err = 0;
-   uint8_t test_vl = 0, vl_avail = 0, vl_needed = 1;
+   uint32_t i = 0, j = 0, err = 0;
+   uint8_t vl = 0, test_vl = 0, vl_avail = 0, vl_needed = 1;
+   double most_avg_paths = 0.0;
cdg_node_t **cdg = NULL, *start_here = NULL, *cycle = NULL;
cdg_link_t *weakest_link = NULL;
uint32_t srcdest = 0;
@@ -2004,43 +2006,56 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
OSM_LOG(p_mgr-p_log, OSM_LOG_VERBOSE,
Balancing the paths on the available Virtual Lanes\n);
 
-   /* balancing virtual lanes, but avoid additional cycle check - 
balancing suboptimal;
+   /* optimal balancing virtual lanes, under condition: no additional 
cycle checks;
   sl/vl != 0 might be assigned to loopback packets (i.e. slid/dlid on 
the
   same port for lmc0), but thats no problem, see IBAS 10.2.2.3
 */
-   if (vl_needed == 1) {
-   from = 0;
-   count = paths_per_vl[0] / vl_avail;
-   for (to = 1; to  vl_avail; to++) {
-   vltable_change_vl(srcdest2vl_table, from, to, count);
-   paths_per_vl[from] -= count;
-   paths_per_vl[to] += count;
-   }
-   } else if (vl_needed  vl_avail) {
-   split_count = (uint8_t *) malloc(vl_needed * sizeof(uint8_t));
-   if (!split_count) {
-   OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
-   ERR AD24: cannot allocate memory for 
split_count, skip balancing\n);
-   } else {
-   memset(split_count, 0, vl_needed * sizeof(uint8_t));
-   for (i = vl_needed; i  vl_avail; i++)
-   split_count[(i - vl_needed) % vl_needed]++;
-
-   to = vl_needed;
-   for (from = 0; from  vl_needed; from++) {
-   count =
-   paths_per_vl[from] / (split_count[from] +
- 1);
-   for (i = 0; i  split_count[from]; i++) {
-   vltable_change_vl(srcdest2vl_table,
- from, to, count);
-   paths_per_vl[from] -= count;
-   paths_per_vl[to] += count;
-   to++;
+   split_count = (uint8_t *) calloc(vl_avail, sizeof(uint8_t));
+   if (!split_count) {
+   OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
+   ERR AD24: cannot allocate memory for split_count, skip 
balancing\n);
+   goto ERROR;
+   }
+   /* initial state: paths for VLs won't be separated */
+   for (i = 0; i  ((vl_needed  vl_avail) ? vl_needed : vl_avail); i++)
+   split_count[i] = 1;
+   dfsssp_ctx-vl_split_count = split_count;
+   /* balancing is necessary if we have empty VLs */
+   if (vl_needed  vl_avail

[PATCH 1/1] OpenSM: dfsssp - add support for multicast

2013-05-02 Thread Jens Domke
Recent tests on a large system revealed a problem with loops in the multicast 
routing.
Using DFSSSP together with the default mcast routing algorithm of OpenSM can
produce loops in the fabric.

This patch adds the mcast_build_stree function to the DFSSSP routing algorithm,
so that DFSSSP is able to calculate the correct mcast forwarding tables for the
subnet.

It almost does the same steps as the default mcast routing, except that it
uses the Dijkstra algorithm to generate the spanning tree instead of using the
hop count information given by the unicast routing.

General overview of the algorithm in pseudo-code:
1) identify the ports, which are part of the multicast group
2) find the 'best' switch (depending on the hop count) for the mcast group,
   which can be used as a root of the spanning tree
3) perform a dijkstra step with the root switch as starting point
   to generate a spanning tree to all other switches in the subnet
4) build the mcast forwarding tables for relevant switches:
   4.1) select a switch which has mcast member ports connected to it
   4.2) set the downstream ports for the mcast member ports in the mcft
   4.3) traverse towards the root of the spanning tree and set up-/downstream
ports on this path for all involved switches
   4.4) goto 4.1 until all switches have been processed

The same mcast algorithm will be used for SSSP, because SSSP has the potential 
to
produce loops in the mcast forwarding table as well.

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 include/opensm/osm_mcast_mgr.h |   72 +++
 opensm/Makefile.am |1 +
 opensm/osm_mcast_mgr.c |   35 
 opensm/osm_ucast_dfsssp.c  |  194 
 4 files changed, 283 insertions(+), 19 deletions(-)
 create mode 100644 include/opensm/osm_mcast_mgr.h

diff --git a/include/opensm/osm_mcast_mgr.h b/include/opensm/osm_mcast_mgr.h
new file mode 100644
index 000..291a478
--- /dev/null
+++ b/include/opensm/osm_mcast_mgr.h
@@ -0,0 +1,72 @@
+/*
+ * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2009 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2009-2011 ZIH, TU Dresden, Federal Republic of Germany. All 
rights reserved.
+ * Copyright (C) 2012-2013 Tokyo Institute of Technology. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ * Declaration of osm_mcast_work_obj_t.
+ * Provide access to a mcast function which searches the root swicth for
+ * a spanning tree.
+ */
+
+#ifndef _OSM_MCAST_MGR_H_
+#define _OSM_MCAST_MGR_H_
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern C {
+#  define END_C_DECLS   }
+#else  /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+typedef struct osm_mcast_work_obj {
+   cl_list_item_t list_item;
+   osm_port_t *p_port;
+   cl_map_item_t map_item;
+} osm_mcast_work_obj_t;
+
+int osm_mcast_make_port_list_and_map(cl_qlist_t * list, cl_qmap_t * map,
+osm_mgrp_box_t * mbox);
+
+void osm_mcast_drop_port_list(cl_qlist_t * list);
+
+osm_switch_t * osm_mcast_mgr_find_root_switch(osm_sm_t * sm, cl_qlist_t * 
list);
+
+END_C_DECLS
+#endif /* _OSM_MCAST_MGR_H_ */
diff --git a/opensm/Makefile.am b/opensm/Makefile.am
index 7fd6bc6..20318cc 100644
--- a/opensm/Makefile.am
+++ b/opensm/Makefile.am
@@ -116,6 +116,7 @@ opensminclude_HEADERS

[PATCH 01/10] DFSSSP: fix a memory leak in dfsssp_build_graph

2013-01-22 Thread Jens Domke
If the graph could not be build correctly and DFSSSP returns an error, then not 
all allocated memory was freed.

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index ffc317f..ff525ea 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1093,6 +1093,9 @@ static int dfsssp_build_graph(void *context)
for (i = 0; i  adj_list_size; i++)
set_default_vertex(adj_list[i]);
 
+   dfsssp_ctx-adj_list = adj_list;
+   dfsssp_ctx-adj_list_size = adj_list_size;
+
/* count the total number of Hca / LIDs (for lmc0) in the fabric */
for (item = cl_qmap_head(port_tbl); item != cl_qmap_end(port_tbl);
 item = cl_qmap_next(item)) {
@@ -1190,9 +1193,6 @@ static int dfsssp_build_graph(void *context)
if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_DEBUG))
dfsssp_print_graph(p_mgr, adj_list, adj_list_size);
 
-   dfsssp_ctx-adj_list = adj_list;
-   dfsssp_ctx-adj_list_size = adj_list_size;
-
OSM_LOG_EXIT(p_mgr-p_log);
return 0;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] OpenSM: dfsssp - avoid unnecessary nested loop in vltable_print for OSM_LOG_INFO

2013-01-22 Thread Jens Domke
for OSM_LOG_INFO the debug function vltable_print was called, which
iterates in a nested loop over all LIDs and only prints stuff for OSM_LOG_DEBUG;
therefor we move vltable_print into a separated if clause

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index b82d8c8..32bc8f1 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1940,10 +1940,13 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
goto ERROR;
}
/* else { no balancing } */
-   if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_INFO)) {
+
+   if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_DEBUG)) {
OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG,
Virtual Lanes per src/dest combination after 
balancing:\n);
vltable_print(p_mgr, srcdest2vl_table);
+   }
+   if (OSM_LOG_IS_ACTIVE_V2(p_mgr-p_log, OSM_LOG_INFO)) {
OSM_LOG(p_mgr-p_log, OSM_LOG_INFO,
Paths per VL (after balancing):\n);
for (i = 0; i  vl_avail; i++)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] opensm/osm_ucast_dfsssp.c : fix dereference before null check

2013-01-22 Thread Jens Domke
From: Dan Ben Yosef da...@dev.mellanox.co.il

Dereferencing dfsssp_ctx before a null check.

Signed-off-by: Dan Ben Yosef da...@dev.mellanox.co.il
Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index af1b062..f88382b 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -2069,14 +2069,16 @@ static uint8_t get_dfsssp_sl(void *context, uint8_t 
hint_for_default_sl,
 const ib_net16_t slid, const ib_net16_t dlid)
 {
dfsssp_context_t *dfsssp_ctx = (dfsssp_context_t *) context;
-   osm_ucast_mgr_t *p_mgr = (osm_ucast_mgr_t *) dfsssp_ctx-p_mgr;
osm_port_t *src_port, *dest_port;
vltable_t *srcdest2vl_table = NULL;
+   osm_ucast_mgr_t *p_mgr = NULL;
int32_t res = 0;
 
if (dfsssp_ctx
-dfsssp_ctx-routing_type == OSM_ROUTING_ENGINE_TYPE_DFSSSP)
+dfsssp_ctx-routing_type == OSM_ROUTING_ENGINE_TYPE_DFSSSP) {
+   p_mgr = (osm_ucast_mgr_t *) dfsssp_ctx-p_mgr;
srcdest2vl_table = (vltable_t *) (dfsssp_ctx-srcdest2vl_table);
+   }
else
return hint_for_default_sl;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] OpenSM: dfsssp - moved paths from one to another VL might be counted multiple times

2013-01-22 Thread Jens Domke
the counter for paths, which have been moved to a different VL,
was incorrect; the counter should not include paths moved in a
previous step

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index c8a1007..a53e783 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1813,8 +1813,14 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
(uint8_t)
vltable_get_vl(srcdest2vl_table,
   cl_hton16(slid),
-  cl_hton16(dlid)))
+  cl_hton16(dlid))) {
+   /* this path has been moved
+  before - don't count
+*/
+   paths_per_vl[test_vl]++;
+   paths_per_vl[test_vl + 1]--;
continue;
+   }
 
src_port =
osm_get_port_by_lid(p_mgr-p_subn,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] opensm/osm_ucast_dfsssp.c : fix dereference null return value

2013-01-22 Thread Jens Domke
From: Dan Ben Yosef da...@dev.mellanox.co.il

Dereferencing a null pointer remote_node

Signed-off-by: Dan Ben Yosef da...@dev.mellanox.co.il
Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index 013bad4..af1b062 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -815,7 +815,7 @@ static int update_channel_dep_graph(cdg_node_t ** cdg_root,
osm_node_get_remote_node(local_node, local_port,
 remote_port);
/* if remote_node is a Hca, then the last channel from switch 
to Hca would be a sink in the cdg - skip */
-   if (!remote_node-sw)
+   if (!remote_node || !remote_node-sw)
break;
remote_lid = cl_ntoh16(osm_node_get_base_lid(remote_node, 0));
 
@@ -961,7 +961,7 @@ static int remove_path_from_cdg(cdg_node_t ** cdg_root, 
osm_port_t * src_port,
osm_node_get_remote_node(local_node, local_port,
 remote_port);
/* if remote_node is a Hca, then the last channel from switch 
to Hca would be a sink in the cdg - skip */
-   if (!remote_node-sw)
+   if (!remote_node || !remote_node-sw)
break;
remote_lid = cl_ntoh16(osm_node_get_base_lid(remote_node, 0));
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] opensm/osm_ucast_dfsssp.c : Fix resource leak

2013-01-22 Thread Jens Domke
From: Dan Ben Yosef da...@dev.mellanox.co.il

Variable head going out of scope leaks the storage it points to.

Signed-off-by: Dan Ben Yosef da...@dev.mellanox.co.il
Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index ff525ea..013bad4 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1161,6 +1161,11 @@ static int dfsssp_build_graph(void *context)
OSM_LOG(p_mgr-p_log, OSM_LOG_ERROR,
ERR AD08: cannot allocate memory for a 
link\n);
dfsssp_context_destroy(context);
+   while (head) {
+   link = head;
+   head = head-next;
+   free(link);
+}
return 1;
}
link = link-next;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] OpenSM: DFSSSP does not find LIDs due to wrong byte order (v2)

2013-01-22 Thread Jens Domke
Problem:
argument list for external calls of path_sl(...) haas been changed at some 
point in the past;
path_sl(...) arguments for slid/dlid are now in network byte order;
internal storage of lids is host byte order;
this mismatch results in a return value of 'hint_for_default_sl' of DFSSSP's 
get_dfsssp_sl function for every request

Fix:
lids will be stored in network byte order, so that a conversion is not 
necessary and
DFSSSP returns the correct SL for that request

This is version 2 of the original patch, because I forgot to change some 
internal calls.
Please use this patch, instead of the patch from December 17.

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |   38 +-
 1 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index 32bc8f1..c8a1007 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -339,7 +339,7 @@ static void heap_free(binary_heap_t * heap)
 /* compare function of two lids for stdlib qsort */
 static int cmp_lids(const void *l1, const void *l2)
 {
-   uint16_t lid1 = *((uint16_t *) l1), lid2 = *((uint16_t *) l2);
+   ib_net16_t lid1 = *((ib_net16_t *) l1), lid2 = *((ib_net16_t *) l2);
 
if (lid1  lid2)
return -1;
@@ -352,19 +352,19 @@ static int cmp_lids(const void *l1, const void *l2)
 /* use stdlib to sort the lid array */
 static inline void vltable_sort_lids(vltable_t * vltable)
 {
-   qsort(vltable-lids, vltable-num_lids, sizeof(uint16_t), cmp_lids);
+   qsort(vltable-lids, vltable-num_lids, sizeof(ib_net16_t), cmp_lids);
 }
 
 /* use stdlib to get index of key in lid array;
return -1 if lid isn't found in lids array
 */
-static inline int64_t vltable_get_lidindex(uint16_t * key, vltable_t * vltable)
+static inline int64_t vltable_get_lidindex(ib_net16_t * key, vltable_t * 
vltable)
 {
-   uint16_t *found_lid = NULL;
+   ib_net16_t *found_lid = NULL;
 
found_lid =
-   (uint16_t *) bsearch(key, vltable-lids, vltable-num_lids,
-sizeof(uint16_t), cmp_lids);
+   (ib_net16_t *) bsearch(key, vltable-lids, vltable-num_lids,
+  sizeof(ib_net16_t), cmp_lids);
if (found_lid)
return found_lid - vltable-lids;
else
@@ -374,7 +374,7 @@ static inline int64_t vltable_get_lidindex(uint16_t * key, 
vltable_t * vltable)
 /* get virtual lane from src lid X dest lid kombination;
return -1 for invalid lids
 */
-static int32_t vltable_get_vl(vltable_t * vltable, uint16_t slid, uint16_t 
dlid)
+static int32_t vltable_get_vl(vltable_t * vltable, ib_net16_t slid, ib_net16_t 
dlid)
 {
int64_t ind1 = vltable_get_lidindex(slid, vltable);
int64_t ind2 = vltable_get_lidindex(dlid, vltable);
@@ -387,8 +387,8 @@ static int32_t vltable_get_vl(vltable_t * vltable, uint16_t 
slid, uint16_t dlid)
 }
 
 /* set a virtual lane in the matrix */
-static inline void vltable_insert(vltable_t * vltable, uint16_t slid,
- uint16_t dlid, uint8_t vl)
+static inline void vltable_insert(vltable_t * vltable, ib_net16_t slid,
+ ib_net16_t dlid, uint8_t vl)
 {
int64_t ind1 = vltable_get_lidindex(slid, vltable);
int64_t ind2 = vltable_get_lidindex(dlid, vltable);
@@ -436,8 +436,8 @@ static void vltable_print(osm_ucast_mgr_t * p_mgr, 
vltable_t * vltable)
OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG,
   route from src_lid=% PRIu16
 to dest_lid=% PRIu16  on vl=% PRIu8
-   \n, vltable-lids[ind1],
-   vltable-lids[ind2],
+   \n, cl_ntoh16(vltable-lids[ind1]),
+   cl_ntoh16(vltable-lids[ind2]),
vltable-vls[ind1 +
 ind2 * vltable-num_lids]);
}
@@ -464,7 +464,7 @@ static int vltable_alloc(vltable_t ** vltable, uint64_t 
size)
if (!(*vltable))
goto ERROR;
(*vltable)-num_lids = size;
-   (*vltable)-lids = (uint16_t *) malloc(size * sizeof(uint16_t));
+   (*vltable)-lids = (ib_net16_t *) malloc(size * sizeof(ib_net16_t));
if (!((*vltable)-lids))
goto ERROR;
(*vltable)-vls = (uint8_t *) malloc(size * size * sizeof(uint8_t));
@@ -1704,7 +1704,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
osm_port_get_lid_range_ho(dest_port, min_lid_ho,
  max_lid_ho);
for (dlid = min_lid_ho; dlid = max_lid_ho; dlid++, i++)
-   srcdest2vl_table-lids[i] = dlid

[PATCH 05/10] OpenSM: dfsssp ignores differences in the lmc value

2013-01-22 Thread Jens Domke
dfsssp used one port representative to obtain the lmc value for all ports;
but lmc can vary, e.g. SW base port 0 vs. CA port

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index f88382b..3e9bc31 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1145,7 +1145,7 @@ static int dfsssp_build_graph(void *context)
continue;
/* if there is a Hca connected - count and cycle */
if (!remote_node-sw) {
-   lmc = osm_port_get_lmc(p_port);
+   lmc = osm_node_get_lmc(remote_node, 
(uint32_t)remote_port);
adj_list[i].num_hca += (1  lmc);
continue;
}
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/10] OpenSM: dfsssp - change the port traversal for sssp

2013-01-22 Thread Jens Domke
There are some really rare cases were the sssp part of dfsssp routing
will produce a suboptimal path assignment, therefore some links will be
oversubscribed and some others will be undersubscribed with paths;
- this results in a bad balancing for Hca-Hca traffic.
The last patch (adding SP0 support) made the situation even worse.

This patch returns the focus to Hca-Hca traffic balancing, again.
Previously, the ports for the dijkstra loop have been choosen 'randomly',
i.e. obtained by the order of p_subn-port_guid_tbl.
Now we process all ports (Hca) of one switch, first, until we proceed
with the next switch. Besides that, we sort the switches.
The switches will be sorted in descending order with respect to the
number of attached Hca.

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |  136 +++--
 1 files changed, 131 insertions(+), 5 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index 31e4140..b82d8c8 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -1013,6 +1013,72 @@ ERROR:
 /**
  **/
 
+/ helper functions to generate an ordered list of ports ***
+  (functions copied from osm_ucast_mgr.c and modified) 
+ **/
+static void add_sw_endports_to_order_list(osm_switch_t * sw,
+ osm_ucast_mgr_t * m)
+{
+   osm_port_t *port;
+   osm_physp_t *p;
+   int i;
+
+   for (i = 1; i  sw-num_ports; i++) {
+   p = osm_node_get_physp_ptr(sw-p_node, i);
+   if (p  p-p_remote_physp  !p-p_remote_physp-p_node-sw) {
+   port = osm_get_port_by_guid(m-p_subn,
+   p-p_remote_physp-
+   port_guid);
+   if (!port)
+   continue;
+   cl_qlist_insert_tail(m-port_order_list,
+port-list_item);
+   }
+   }
+}
+
+static void add_guid_to_order_list(uint64_t guid, osm_ucast_mgr_t * m)
+{
+   osm_port_t *port = osm_get_port_by_guid(m-p_subn, cl_hton64(guid));
+
+   if (!port) {
+OSM_LOG(m-p_log, OSM_LOG_DEBUG,
+port guid not found: 0x%016 PRIx64 \n, guid);
+   }
+
+   cl_qlist_insert_tail(m-port_order_list, port-list_item);
+}
+
+/* compare function of #Hca attached to a switch for stdlib qsort */
+static int cmp_num_hca(const void * l1, const void * l2)
+{
+   vertex_t *sw1 = *((vertex_t **) l1);
+   vertex_t *sw2 = *((vertex_t **) l2);
+   uint32_t num_hca1 = 0, num_hca2 = 0;
+
+   if (sw1)
+   num_hca1 = sw1-num_hca;
+   if (sw2)
+   num_hca2 = sw2-num_hca;
+
+   if (num_hca1  num_hca2)
+   return -1;
+   else if (num_hca1  num_hca2)
+   return 1;
+   else
+   return 0;
+}
+
+/* use stdlib to sort the switch array depending on num_hca */
+static inline void sw_list_sort_by_num_hca(vertex_t ** sw_list,
+  uint32_t sw_list_size)
+{
+   qsort(sw_list, sw_list_size, sizeof(vertex_t *), cmp_num_hca);
+}
+
+/**
+ **/
+
 static void dfsssp_print_graph(osm_ucast_mgr_t * p_mgr, vertex_t * adj_list,
   uint32_t size)
 {
@@ -1172,7 +1238,7 @@ static int dfsssp_build_graph(void *context)
link = head;
head = head-next;
free(link);
-}
+   }
return 1;
}
link = link-next;
@@ -1919,7 +1985,12 @@ static int dfsssp_do_dijkstra_routing(void *context)
vertex_t *adj_list = (vertex_t *) dfsssp_ctx-adj_list;
uint32_t adj_list_size = dfsssp_ctx-adj_list_size;
 
-   cl_qmap_t *port_tbl = p_mgr-p_subn-port_guid_tbl;/* 1 managment 
port per switch + 1 or 2 ports for each Hca */
+   vertex_t **sw_list = NULL;
+   uint32_t sw_list_size = 0;
+   uint64_t guid = 0;
+   cl_qlist_t *qlist = NULL;
+   cl_list_item_t *qlist_item = NULL;
+
cl_qmap_t *sw_tbl = p_mgr-p_subn-sw_guid_tbl;
cl_map_item_t *item = NULL;
osm_switch_t *sw = NULL;
@@ -1949,12 +2020,64 @@ static int dfsssp_do_dijkstra_routing(void *context)
}
}
 
+   /* we need an intermediate array of pointers to switches in adj_list

Re: umad_send with service level higher than 0 does not work

2012-12-17 Thread Jens Domke
Hello Hal,

On Dec 17, 2012, at 9:04 PM, Hal Rosenstock wrote:

 Hi,
 
 On 12/17/2012 1:16 AM, Jens Domke wrote:
 Hello Hal,
 
 I have checked the smpquery and saquery command today.
 
 The smpquery SL2VL and PI commands for the opensm port work fine, and I get 
 the expected results:
 ==
 # SL2VL table: Lid 19
 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
 ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
 ==
 # Port info: Lid 19 port 0
 Mkey:not displayed
 GidPrefix:...0xfe80
 Lid:.19
 SMLid:...19
 CapMask:.0x251086a
IsSM
IsTrapSupported
IsAutomaticMigrationSupported
IsSLMappingSupported
IsSystemImageGUIDsupported
IsCommunicatonManagementSupported
IsVendorClassSupported
IsCapabilityMaskNoticeSupported
IsClientRegistrationSupported
 DiagCode:0x
 MkeyLeasePeriod:.0
 LocalPort:...1
 LinkWidthEnabled:1X or 4X
 LinkWidthSupported:..1X or 4X
 LinkWidthActive:.4X
 LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps
 LinkState:...Active
 PhysLinkState:...LinkUp
 LinkDownDefState:Polling
 ProtectBits:.0
 LMC:.0
 LinkSpeedActive:.5.0 Gbps
 LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps
 NeighborMTU:.2048
 SMSL:0
 VLCap:...VL0-7
 InitType:0x00
 VLHighLimit:.0
 VLArbHighCap:8
 VLArbLowCap:.8
 InitReply:...0x00
 MtuCap:..2048
 VLStallCount:0
 HoqLife:.31
 OperVLs:.VL0-7
 PartEnforceInb:..0
 PartEnforceOutb:.0
 FilterRawInb:0
 FilterRawOutb:...0
 MkeyViolations:..0
 PkeyViolations:..0
 QkeyViolations:..0
 GuidCap:.32
 ClientReregister:0
 McastPkeyTrapSuppressionEnabled:.0
 SubnetTimeout:...18
 RespTimeVal:.16
 LocalPhysErr:8
 OverrunErr:..8
 MaxCreditHint:...0
 RoundTrip:...0
 CapabilityMask2:.0x
 LinkSpeedExtActive:..No Extended Speed
 LinkSpeedExtSupported:...0
 LinkSpeedExtEnabled:.0
 ==
 
 
 The problem are the saquery commands on other nodes.
 In most cases the executions fails, and the node shows the same behaviour 
 like the OpenSM node, when it trys to send on SL0. The PathRequest paket 
 does not arrive at the node with the running OpenSM (checked with ibdumb). 
 At some point of the execution the saquery binary hangs, the kernel log 
 indicates errors and the only option is to reboot. 
 This is the output I see for the saquery:
 ==
 saquery -P --src-to-dst 4:8
 ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out
 
 Query SA failed: Connection timed out
 ==
 (In really rar cases I get the PathRequest back and see the dump, but the 
 saquery binary stalls afterwards, too.)
 
 
 I did some debugging with gdb again, and stepped thru the saquery code.
 When I change the SL to 0 in the addr vector of the MAD right before 
 umad_send is called, then everthing works.
 So, the saquery on the compute nodes shows the same behaviour as the opensm 
 with respect to the SL value for umad_send.
 
 
 At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in 
 the config file of opensm.
 Sadly, this configuration results in the same crashes of the saquery 
 commands.
 For the runs with MinHop I used also a different SL2VL mapping, just to be 
 sure, that there is no problem with VL0 and every SL travels on VL=0:
 ==
 # SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
 ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
 ==
 
 Non QoS routing algorithms still need -Q otherwise the full range of QoS
 is not available

Re: umad_send with service level higher than 0 does not work

2012-12-16 Thread Jens Domke
Hello Hal,

On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:

 Hi,
 
 On 12/14/2012 3:32 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/14/2012 1:24 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
 
 Hi again,
 
 On 12/14/2012 10:17 AM, Jens Domke wrote:
 Hello Hal,
 
 thank you for the fast response. I will try to clarify some points.
 
 d) OpenMPI runs are executed with --mca 
 btl_openib_ib_path_record_service_level 1
 
 I'm not familiar with what DFSSSP does to figure out SLs exactly but
 there should be no need to set this. The proper SL for querying the SA
 for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
 (and other QoS based routing algorithms), it calculates that and the SM
 pushes this into each port. That should be used. It's possible that SL1
 is not a valid SL for port - SA querying using DFSSSP.
 The OpenMPI parameter btl_openib_ib_path_record_service_level does not 
 specify the SL for querying the PathRecords.
 It just enables the functionality. And the ompi processes use the 
 PortInfo.SMSL to send the request.
 For the request port - SA every 0=SL=7 was used in the test, and 
 the SA received the requests.  
 
 e) kernel 2.6.32-220.13.1.el6.x86_64
 
 As far as I understand the whole system:
 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) 
 to the OpenSM
 2. the SA receives the request on QP1
 
 There is the SL in the query itself. This should be the SMSL that the SM
 set for that port.
 Hmm, there you might have a point. I think I saw that the query itself 
 had SL=0 specified.
 In fact OpenMPI sets everthing to 0 except for slid and dlid.
 
 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) 
 about a special service level for the slid/dlid path
 
 This is a (potentially) different SL (for MPI-MPI port communication)
 than the one the query used and is the one returned inside the
 PathRecord attribute/data.
 Yes, it can be different, but DFSSSP sets the same SL, because the SM is 
 running on a port which is also used for MPI comm.
 
 With DFSSSP are all SLs same from source port to get to any destination ?
 No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == 
 SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3).
 
 If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path.
 True. But i don't think that the SA asks the DFSSSP routing about the SL for 
 the reversible path.
 So, the SA could use any SL which is a valid SL, even if the DFSSSP would 
 recommend another SL.
 
 I just read the IB Specs and it says, that SL specified in the received 
 packet is used as the SL in the response packet for MAD packets.
 So, its most likely, that there is a mismatch in the way how OMPI does the 
 setup of the PathRequest and the way how the SA does build the respond 
 packet.
 OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
 
 So CompMask in the query has the SL bit on and SL is set to 0 inside the
 SubAdmGet of PatchRecord ?

No, the CompMask didn't had the SL bit and the SL was set to 0.
I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only 
reference I found was in osm_sa_path_record.c
The SA just treats the SL in the PathRequest as a I would like to use this SL 
in case the SL bit is set.
But the routing engine can overwrite the requested SL before the reply is send.

Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the 
CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
Sadly, the reply send by the SA does not leave the node (for SL_b0). Only if I 
change the SL to 0 in the MAD right before umad_send is called by the SA, the 
paket is able to leave the node and reaches the OMPI process.

 
 and sends the packet on SL_b (PortInfo.SMSL).
 
 Good.
 
 The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for the 
 response.
 If SL_b is not 0, then the packet can't reach the OMPI process. Right?
 
 Depends. It may be that both SLs work but maybe not.
 
 If I analyse this correctly, then there are two bugs. One is in OMPI, that 
 it does not specify the SL within the PathRequest in a appropriate way 
 (which would be a SL suggested by DFSSSP for the reversible path). And the 
 second bug is that the SA uses the SL, on which the PathRequest packet was 
 send, and not the SL specified within the packet.
 What do you think?
 
 Yes, it might be better to wildcard the SL in the query. The only
 scenario that would fail with the query you are making if there's no SL
 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
 If that's the case, SA should return MAD status 0xc (status code 3 -
 ERR_NO_RECORDS). But the response doesn't make it back to the requester
 OMPI node so it's not even getting that far.

Yes, exactly. So, do you have an idea why the response hands in the SA node?
I have

Re: umad_send with service level higher than 0 does not work

2012-12-16 Thread Jens Domke
Hi,

On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:

 Hi,
 
 On 12/16/2012 7:03 AM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/14/2012 3:32 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/14/2012 1:24 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
 
 Hi again,
 
 On 12/14/2012 10:17 AM, Jens Domke wrote:
 Hello Hal,
 
 thank you for the fast response. I will try to clarify some points.
 
 d) OpenMPI runs are executed with --mca 
 btl_openib_ib_path_record_service_level 1
 
 I'm not familiar with what DFSSSP does to figure out SLs exactly but
 there should be no need to set this. The proper SL for querying the SA
 for PathRecords, etc. is always in PortInfo.SMSL. In the case of 
 DFSSSP
 (and other QoS based routing algorithms), it calculates that and the 
 SM
 pushes this into each port. That should be used. It's possible that 
 SL1
 is not a valid SL for port - SA querying using DFSSSP.
 The OpenMPI parameter btl_openib_ib_path_record_service_level does not 
 specify the SL for querying the PathRecords.
 It just enables the functionality. And the ompi processes use the 
 PortInfo.SMSL to send the request.
 For the request port - SA every 0=SL=7 was used in the test, and 
 the SA received the requests.  
 
 e) kernel 2.6.32-220.13.1.el6.x86_64
 
 As far as I understand the whole system:
 1. the OMPI processes are sending MAD requests 
 (SubnAdmGet:PathRecord) to the OpenSM
 2. the SA receives the request on QP1
 
 There is the SL in the query itself. This should be the SMSL that the 
 SM
 set for that port.
 Hmm, there you might have a point. I think I saw that the query itself 
 had SL=0 specified.
 In fact OpenMPI sets everthing to 0 except for slid and dlid.
 
 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) 
 about a special service level for the slid/dlid path
 
 This is a (potentially) different SL (for MPI-MPI port 
 communication)
 than the one the query used and is the one returned inside the
 PathRecord attribute/data.
 Yes, it can be different, but DFSSSP sets the same SL, because the SM 
 is running on a port which is also used for MPI comm.
 
 With DFSSSP are all SLs same from source port to get to any destination 
 ?
 No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) 
 == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3).
 
 If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path.
 True. But i don't think that the SA asks the DFSSSP routing about the SL 
 for the reversible path.
 So, the SA could use any SL which is a valid SL, even if the DFSSSP would 
 recommend another SL.
 
 I just read the IB Specs and it says, that SL specified in the received 
 packet is used as the SL in the response packet for MAD packets.
 So, its most likely, that there is a mismatch in the way how OMPI does the 
 setup of the PathRequest and the way how the SA does build the respond 
 packet.
 OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest 
 packet, 
 
 So CompMask in the query has the SL bit on and SL is set to 0 inside the
 SubAdmGet of PatchRecord ?
 
 No, the CompMask didn't had the SL bit and the SL was set to 0.
 
 That means the SL in the request is wildcarded so the SA/SM fills in a
 valid one in the response.
Ok.
 
 I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only 
 reference I found was in osm_sa_path_record.c
 The SA just treats the SL in the PathRequest as a I would like to use this 
 SL in case the SL bit is set.
 But the routing engine can overwrite the requested SL before the reply is 
 send.
 
 Nevertheless, I have changed the code of OMPI so that it sets the SL bit in 
 the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == 
 SL_b.
 Sadly, the reply send by the SA does not leave the node (for SL_b0). Only 
 if I change the SL to 0 in the MAD right before umad_send is called by the 
 SA, the paket is able to leave the node and reaches the OMPI process.
 
 Are you sure the response doesn't leave the SA node or it's not received
 at the requester (OMPI node) ?
No, I'm not sure. Is there any possibility to check that? As far as I know, 
ibdump does not show MAD pakets which leave a port, it only shows the pakets 
when they are received on the other end.
 
 
 
 and sends the packet on SL_b (PortInfo.SMSL).
 
 Good.
 
 The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for 
 the response.
 If SL_b is not 0, then the packet can't reach the OMPI process. Right?
 
 Depends. It may be that both SLs work but maybe not.
 
 If I analyse this correctly, then there are two bugs. One is in OMPI, that 
 it does not specify the SL within the PathRequest in a appropriate way 
 (which would be a SL suggested by DFSSSP for the reversible path). And the 
 second bug is that the SA uses the SL, on which the PathRequest packet was 
 send

Re: umad_send with service level higher than 0 does not work

2012-12-16 Thread Jens Domke

On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:

 On 12/16/2012 8:39 AM, Jens Domke wrote:
 Hi,
 
 On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/16/2012 7:03 AM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/14/2012 3:32 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/14/2012 1:24 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
 
 Hi again,
 
 On 12/14/2012 10:17 AM, Jens Domke wrote:
 Hello Hal,
 
 thank you for the fast response. I will try to clarify some points.
 
 d) OpenMPI runs are executed with --mca 
 btl_openib_ib_path_record_service_level 1
 
 I'm not familiar with what DFSSSP does to figure out SLs exactly but
 there should be no need to set this. The proper SL for querying the 
 SA
 for PathRecords, etc. is always in PortInfo.SMSL. In the case of 
 DFSSSP
 (and other QoS based routing algorithms), it calculates that and 
 the SM
 pushes this into each port. That should be used. It's possible that 
 SL1
 is not a valid SL for port - SA querying using DFSSSP.
 The OpenMPI parameter btl_openib_ib_path_record_service_level does 
 not specify the SL for querying the PathRecords.
 It just enables the functionality. And the ompi processes use the 
 PortInfo.SMSL to send the request.
 For the request port - SA every 0=SL=7 was used in the test, 
 and the SA received the requests.  
 
 e) kernel 2.6.32-220.13.1.el6.x86_64
 
 As far as I understand the whole system:
 1. the OMPI processes are sending MAD requests 
 (SubnAdmGet:PathRecord) to the OpenSM
 2. the SA receives the request on QP1
 
 There is the SL in the query itself. This should be the SMSL that 
 the SM
 set for that port.
 Hmm, there you might have a point. I think I saw that the query 
 itself had SL=0 specified.
 In fact OpenMPI sets everthing to 0 except for slid and dlid.
 
 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) 
 about a special service level for the slid/dlid path
 
 This is a (potentially) different SL (for MPI-MPI port 
 communication)
 than the one the query used and is the one returned inside the
 PathRecord attribute/data.
 Yes, it can be different, but DFSSSP sets the same SL, because the 
 SM is running on a port which is also used for MPI comm.
 
 With DFSSSP are all SLs same from source port to get to any 
 destination ?
 No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) 
 == SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3).
 
 If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path.
 True. But i don't think that the SA asks the DFSSSP routing about the SL 
 for the reversible path.
 So, the SA could use any SL which is a valid SL, even if the DFSSSP 
 would recommend another SL.
 
 I just read the IB Specs and it says, that SL specified in the received 
 packet is used as the SL in the response packet for MAD packets.
 So, its most likely, that there is a mismatch in the way how OMPI does 
 the setup of the PathRequest and the way how the SA does build the 
 respond packet.
 OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest 
 packet, 
 
 So CompMask in the query has the SL bit on and SL is set to 0 inside the
 SubAdmGet of PatchRecord ?
 
 No, the CompMask didn't had the SL bit and the SL was set to 0.
 
 That means the SL in the request is wildcarded so the SA/SM fills in a
 valid one in the response.
 Ok.
 
 I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only 
 reference I found was in osm_sa_path_record.c
 The SA just treats the SL in the PathRequest as a I would like to use 
 this SL in case the SL bit is set.
 But the routing engine can overwrite the requested SL before the reply is 
 send.
 
 Nevertheless, I have changed the code of OMPI so that it sets the SL bit 
 in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a 
 == SL_b.
 Sadly, the reply send by the SA does not leave the node (for SL_b0). Only 
 if I change the SL to 0 in the MAD right before umad_send is called by the 
 SA, the paket is able to leave the node and reaches the OMPI process.
 
 Are you sure the response doesn't leave the SA node or it's not received
 at the requester (OMPI node) ?
 No, I'm not sure. Is there any possibility to check that? As far as I know, 
 ibdump does not show MAD pakets which leave a port, it only shows the pakets 
 when they are received on the other end.
 
 
 
 and sends the packet on SL_b (PortInfo.SMSL).
 
 Good.
 
 The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for 
 the response.
 If SL_b is not 0, then the packet can't reach the OMPI process. Right?
 
 Depends. It may be that both SLs work but maybe not.
 
 If I analyse this correctly, then there are two bugs. One is in OMPI, 
 that it does not specify the SL within the PathRequest in a appropriate 
 way (which would be a SL suggested by DFSSSP

[PATCH 1/1] OpenSM: DFSSSP does not find LIDs due to wrong byte order

2012-12-16 Thread Jens Domke
Problem:
path_sl(...) arguments for slid/dlid are in network byte order; internal 
storage of lids is host byte order; this mismatch results in a return value of 
'hint_for_default_sl' of DFSSSP's get_dfsssp_sl function for every request

Fix:
lids will be stored in network byte order, so that a conversion is not 
necessaryand DFSSSP returns the correct SL for tht request

Signed-off-by: Jens Domke domke.j...@m.titech.ac.jp
---
 opensm/osm_ucast_dfsssp.c |   26 +-
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/opensm/osm_ucast_dfsssp.c b/opensm/osm_ucast_dfsssp.c
index ffc317f..903966c 100644
--- a/opensm/osm_ucast_dfsssp.c
+++ b/opensm/osm_ucast_dfsssp.c
@@ -339,7 +339,7 @@ static void heap_free(binary_heap_t * heap)
 /* compare function of two lids for stdlib qsort */
 static int cmp_lids(const void *l1, const void *l2)
 {
-   uint16_t lid1 = *((uint16_t *) l1), lid2 = *((uint16_t *) l2);
+   ib_net16_t lid1 = *((ib_net16_t *) l1), lid2 = *((ib_net16_t *) l2);
 
if (lid1  lid2)
return -1;
@@ -352,19 +352,19 @@ static int cmp_lids(const void *l1, const void *l2)
 /* use stdlib to sort the lid array */
 static inline void vltable_sort_lids(vltable_t * vltable)
 {
-   qsort(vltable-lids, vltable-num_lids, sizeof(uint16_t), cmp_lids);
+   qsort(vltable-lids, vltable-num_lids, sizeof(ib_net16_t), cmp_lids);
 }
 
 /* use stdlib to get index of key in lid array;
return -1 if lid isn't found in lids array
 */
-static inline int64_t vltable_get_lidindex(uint16_t * key, vltable_t * vltable)
+static inline int64_t vltable_get_lidindex(ib_net16_t * key, vltable_t * 
vltable)
 {
-   uint16_t *found_lid = NULL;
+   ib_net16_t *found_lid = NULL;
 
found_lid =
-   (uint16_t *) bsearch(key, vltable-lids, vltable-num_lids,
-sizeof(uint16_t), cmp_lids);
+   (ib_net16_t *) bsearch(key, vltable-lids, vltable-num_lids,
+  sizeof(ib_net16_t), cmp_lids);
if (found_lid)
return found_lid - vltable-lids;
else
@@ -374,7 +374,7 @@ static inline int64_t vltable_get_lidindex(uint16_t * key, 
vltable_t * vltable)
 /* get virtual lane from src lid X dest lid kombination;
return -1 for invalid lids
 */
-static int32_t vltable_get_vl(vltable_t * vltable, uint16_t slid, uint16_t 
dlid)
+static int32_t vltable_get_vl(vltable_t * vltable, ib_net16_t slid, ib_net16_t 
dlid)
 {
int64_t ind1 = vltable_get_lidindex(slid, vltable);
int64_t ind2 = vltable_get_lidindex(dlid, vltable);
@@ -387,8 +387,8 @@ static int32_t vltable_get_vl(vltable_t * vltable, uint16_t 
slid, uint16_t dlid)
 }
 
 /* set a virtual lane in the matrix */
-static inline void vltable_insert(vltable_t * vltable, uint16_t slid,
- uint16_t dlid, uint8_t vl)
+static inline void vltable_insert(vltable_t * vltable, ib_net16_t slid,
+ ib_net16_t dlid, uint8_t vl)
 {
int64_t ind1 = vltable_get_lidindex(slid, vltable);
int64_t ind2 = vltable_get_lidindex(dlid, vltable);
@@ -436,8 +436,8 @@ static void vltable_print(osm_ucast_mgr_t * p_mgr, 
vltable_t * vltable)
OSM_LOG(p_mgr-p_log, OSM_LOG_DEBUG,
   route from src_lid=% PRIu16
 to dest_lid=% PRIu16  on vl=% PRIu8
-   \n, vltable-lids[ind1],
-   vltable-lids[ind2],
+   \n, cl_ntoh16(vltable-lids[ind1]),
+   cl_ntoh16(vltable-lids[ind2]),
vltable-vls[ind1 +
 ind2 * vltable-num_lids]);
}
@@ -464,7 +464,7 @@ static int vltable_alloc(vltable_t ** vltable, uint64_t 
size)
if (!(*vltable))
goto ERROR;
(*vltable)-num_lids = size;
-   (*vltable)-lids = (uint16_t *) malloc(size * sizeof(uint16_t));
+   (*vltable)-lids = (ib_net16_t *) malloc(size * sizeof(ib_net16_t));
if (!((*vltable)-lids))
goto ERROR;
(*vltable)-vls = (uint8_t *) malloc(size * size * sizeof(uint8_t));
@@ -1645,7 +1645,7 @@ static int dfsssp_remove_deadlocks(dfsssp_context_t * 
dfsssp_ctx)
osm_port_get_lid_range_ho(dest_port, min_lid_ho,
  max_lid_ho);
for (dlid = min_lid_ho; dlid = max_lid_ho; dlid++, i++)
-   srcdest2vl_table-lids[i] = dlid;
+   srcdest2vl_table-lids[i] = cl_hton16(dlid);
}
}
/* sort lids */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More

Re: umad_send with service level higher than 0 does not work

2012-12-16 Thread Jens Domke
Hello Hal,

I have checked the smpquery and saquery command today.

The smpquery SL2VL and PI commands for the opensm port work fine, and I get the 
expected results:
==
# SL2VL table: Lid 19
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
==
# Port info: Lid 19 port 0
Mkey:not displayed
GidPrefix:...0xfe80
Lid:.19
SMLid:...19
CapMask:.0x251086a
IsSM
IsTrapSupported
IsAutomaticMigrationSupported
IsSLMappingSupported
IsSystemImageGUIDsupported
IsCommunicatonManagementSupported
IsVendorClassSupported
IsCapabilityMaskNoticeSupported
IsClientRegistrationSupported
DiagCode:0x
MkeyLeasePeriod:.0
LocalPort:...1
LinkWidthEnabled:1X or 4X
LinkWidthSupported:..1X or 4X
LinkWidthActive:.4X
LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps
LinkState:...Active
PhysLinkState:...LinkUp
LinkDownDefState:Polling
ProtectBits:.0
LMC:.0
LinkSpeedActive:.5.0 Gbps
LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps
NeighborMTU:.2048
SMSL:0
VLCap:...VL0-7
InitType:0x00
VLHighLimit:.0
VLArbHighCap:8
VLArbLowCap:.8
InitReply:...0x00
MtuCap:..2048
VLStallCount:0
HoqLife:.31
OperVLs:.VL0-7
PartEnforceInb:..0
PartEnforceOutb:.0
FilterRawInb:0
FilterRawOutb:...0
MkeyViolations:..0
PkeyViolations:..0
QkeyViolations:..0
GuidCap:.32
ClientReregister:0
McastPkeyTrapSuppressionEnabled:.0
SubnetTimeout:...18
RespTimeVal:.16
LocalPhysErr:8
OverrunErr:..8
MaxCreditHint:...0
RoundTrip:...0
CapabilityMask2:.0x
LinkSpeedExtActive:..No Extended Speed
LinkSpeedExtSupported:...0
LinkSpeedExtEnabled:.0
==


The problem are the saquery commands on other nodes.
In most cases the executions fails, and the node shows the same behaviour like 
the OpenSM node, when it trys to send on SL0. The PathRequest paket does not 
arrive at the node with the running OpenSM (checked with ibdumb). At some point 
of the execution the saquery binary hangs, the kernel log indicates errors and 
the only option is to reboot. 
This is the output I see for the saquery:
==
saquery -P --src-to-dst 4:8
ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out

Query SA failed: Connection timed out
==
(In really rar cases I get the PathRequest back and see the dump, but the 
saquery binary stalls afterwards, too.)


I did some debugging with gdb again, and stepped thru the saquery code.
When I change the SL to 0 in the addr vector of the MAD right before umad_send 
is called, then everthing works.
So, the saquery on the compute nodes shows the same behaviour as the opensm 
with respect to the SL value for umad_send.


At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in 
the config file of opensm.
Sadly, this configuration results in the same crashes of the saquery commands.
For the runs with MinHop I used also a different SL2VL mapping, just to be 
sure, that there is no problem with VL0 and every SL travels on VL=0:
==
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
==


Regards,
Jens


On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:

 
 On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
 
 On 12/16/2012 8:39 AM, Jens Domke wrote:
 Hi,
 
 On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
 
 Hi,
 
 On 12/16/2012 7:03 AM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012

umad_send with service level higher than 0 does not work

2012-12-14 Thread Jens Domke
 information, or if I can test something to 
give you more inside.

Thank you in advance,
Jens


Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j...@m.titech.ac.jp


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: umad_send with service level higher than 0 does not work

2012-12-14 Thread Jens Domke
 stuck, and have no idea if there is an error in the kernel 
 driver, the HCA firmware or something completely different. Or if umad_send 
 basically does not support SL0.
 A workaround for the moment is to set the SL in the umad_set_addr_net(...) 
 call to 0.
 
 So SL 0 works between all nodes and SA for querying/responses. Wonder if
 that's how SMSL is set by DFSSSP.
No, the SMSL set by DFSSSP is different from 0, I have checked this. In our 
case (OpenSM running on a compute node), it sets the same SL, which is used for 
MPI-MPI traffic, to ensure deadlock freedom.

Regards
Jens


Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j...@m.titech.ac.jp


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: umad_send with service level higher than 0 does not work

2012-12-14 Thread Jens Domke
Hello Hal,

On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:

 Hi again,
 
 On 12/14/2012 10:17 AM, Jens Domke wrote:
 Hello Hal,
 
 thank you for the fast response. I will try to clarify some points.
 
 d) OpenMPI runs are executed with --mca 
 btl_openib_ib_path_record_service_level 1
 
 I'm not familiar with what DFSSSP does to figure out SLs exactly but
 there should be no need to set this. The proper SL for querying the SA
 for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
 (and other QoS based routing algorithms), it calculates that and the SM
 pushes this into each port. That should be used. It's possible that SL1
 is not a valid SL for port - SA querying using DFSSSP.
 The OpenMPI parameter btl_openib_ib_path_record_service_level does not 
 specify the SL for querying the PathRecords.
 It just enables the functionality. And the ompi processes use the 
 PortInfo.SMSL to send the request.
 For the request port - SA every 0=SL=7 was used in the test, and the SA 
 received the requests.  
 
 e) kernel 2.6.32-220.13.1.el6.x86_64
 
 As far as I understand the whole system:
 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to 
 the OpenSM
 2. the SA receives the request on QP1
 
 There is the SL in the query itself. This should be the SMSL that the SM
 set for that port.
 Hmm, there you might have a point. I think I saw that the query itself had 
 SL=0 specified.
 In fact OpenMPI sets everthing to 0 except for slid and dlid.
 
 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a 
 special service level for the slid/dlid path
 
 This is a (potentially) different SL (for MPI-MPI port communication)
 than the one the query used and is the one returned inside the
 PathRecord attribute/data.
 Yes, it can be different, but DFSSSP sets the same SL, because the SM is 
 running on a port which is also used for MPI comm.
 
 With DFSSSP are all SLs same from source port to get to any destination ?
No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == 
SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3).
 
 
 4. SA sends the PathRecord back to the OMPI process via umad_send in 
 libvendor/osm_vendor_ibumad.c
 
 By the response reversibility rule, I think this is returned on the SL
 of the original query but haven't verified this in the code base yet.
 Ok, I was not aware of that rule. But if this is true, then the SA should 
 also be able to send via SL0.
 
 I doubled checked and indeed the SA response does use the SL that the
 incoming request was received on.
 
 
 The osm_vendor_send() function builds the MAD packet with the following 
 attributes:
   /* GS classes */
   umad_set_addr_net(p_vw-umad, p_mad_addr-dest_lid,
 p_mad_addr-addr_type.gsi.remote_qp,
 p_mad_addr-addr_type.gsi.service_level,
 IB_QP1_WELL_KNOWN_Q_KEY);
 So, the SL is the same like the one which was used by the OMPI process. 
 The Q_Key matches the Q_key on the OMPI process, and remote_qp and 
 dest_lid is correct, too.
 Afterwards umad_send(…) is used to send the reply with the PathRecord, and 
 this send does not work (except for SL=0).
 
 By not working, what do you mean ? Do you mean it's not received at the
 requester with no message in the OpenSM log or not received at the
 OpenSM or something else ? It could be due to the wrong SL being used in
 the original request (forcing it to SL 1). That could cause it not to be
 received at the SM or the response not to make it back to the requester
 from the SA if the SL used is not reversible.
 By not working I mean, that the MPI process does not receive any response 
 from the SA.
 I get messages from the MPI process like the following:
 [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info]
  No response from SA after 20 retries
 The log of OpenSM shows that the SA received the PathRequest query, dumps 
 the query into the log, and sends the reply back.
 And I think I was some messages in the log about …1 outstanding MAD….
 
 If I look into the MAD before it is send, then it looks like this:
 Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, 
 timeout_ms=0, retries=3)
   at src/umad.c:791
 791 if (umaddebug  1)
 (gdb) p *mad
 $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, 
 addr = {qpn = 1325427712, qkey = 384, 
   lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', 
 gid_index = 0 '\000', 
   hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' repeats 15 
 times, flow_label = 0, 
   pkey_index = 0, reserved = \000\000\000\000\000}, data = 
 0x7fffe8012530 \002}
 
 Is this the PathRecord query on the OpenMPI side or the response on the
 OpenSM side ? SL is 6 rather than 1 here.
 This is the response on the OpenSM side (inside the umad_send function, 
 right before it is written to the device with write(fd, …).
 SL=6

Re: umad_send with service level higher than 0 does not work

2012-12-14 Thread Jens Domke
Hello Hal,

On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:

 Hi,
 
 On 12/14/2012 1:24 PM, Jens Domke wrote:
 Hello Hal,
 
 On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
 
 Hi again,
 
 On 12/14/2012 10:17 AM, Jens Domke wrote:
 Hello Hal,
 
 thank you for the fast response. I will try to clarify some points.
 
 d) OpenMPI runs are executed with --mca 
 btl_openib_ib_path_record_service_level 1
 
 I'm not familiar with what DFSSSP does to figure out SLs exactly but
 there should be no need to set this. The proper SL for querying the SA
 for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
 (and other QoS based routing algorithms), it calculates that and the SM
 pushes this into each port. That should be used. It's possible that SL1
 is not a valid SL for port - SA querying using DFSSSP.
 The OpenMPI parameter btl_openib_ib_path_record_service_level does not 
 specify the SL for querying the PathRecords.
 It just enables the functionality. And the ompi processes use the 
 PortInfo.SMSL to send the request.
 For the request port - SA every 0=SL=7 was used in the test, and the 
 SA received the requests.  
 
 e) kernel 2.6.32-220.13.1.el6.x86_64
 
 As far as I understand the whole system:
 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) 
 to the OpenSM
 2. the SA receives the request on QP1
 
 There is the SL in the query itself. This should be the SMSL that the SM
 set for that port.
 Hmm, there you might have a point. I think I saw that the query itself had 
 SL=0 specified.
 In fact OpenMPI sets everthing to 0 except for slid and dlid.
 
 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about 
 a special service level for the slid/dlid path
 
 This is a (potentially) different SL (for MPI-MPI port communication)
 than the one the query used and is the one returned inside the
 PathRecord attribute/data.
 Yes, it can be different, but DFSSSP sets the same SL, because the SM is 
 running on a port which is also used for MPI comm.
 
 With DFSSSP are all SLs same from source port to get to any destination ?
 No, not necessarily. In general DFSSSP does not enforce SL(LID1-LID2) == 
 SL(LID2-LID1) or SL(LID1-LID2) == SL(LID1-LID3).
 
 If SL(LID1-LID2) != SL(LID2-LID1), that's not a reversible path.
True. But i don't think that the SA asks the DFSSSP routing about the SL for 
the reversible path.
So, the SA could use any SL which is a valid SL, even if the DFSSSP would 
recommend another SL.

I just read the IB Specs and it says, that SL specified in the received packet 
is used as the SL in the response packet for MAD packets.
So, its most likely, that there is a mismatch in the way how OMPI does the 
setup of the PathRequest and the way how the SA does build the respond packet.
OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
and sends the packet on SL_b (PortInfo.SMSL).
The SA uses p_mad_addr-addr_type.gsi.service_level, which is SL_b, for the 
response.
If SL_b is not 0, then the packet can't reach the OMPI process. Right?

If I analyse this correctly, then there are two bugs. One is in OMPI, that it 
does not specify the SL within the PathRequest in a appropriate way (which 
would be a SL suggested by DFSSSP for the reversible path). And the second bug 
is that the SA uses the SL, on which the PathRequest packet was send, and not 
the SL specified within the packet.
What do you think?

I can try to change the PathRequest of OMPI tomorrow, so that it matches 
addr_type.gsi.service_level.
Maybe, with this change the packets of the SA will reach the OMPI process on a 
SL0.
 
 
 
 4. SA sends the PathRecord back to the OMPI process via umad_send in 
 libvendor/osm_vendor_ibumad.c
 
 By the response reversibility rule, I think this is returned on the SL
 of the original query but haven't verified this in the code base yet.
 Ok, I was not aware of that rule. But if this is true, then the SA should 
 also be able to send via SL0.
 
 I doubled checked and indeed the SA response does use the SL that the
 incoming request was received on.
 
 
 The osm_vendor_send() function builds the MAD packet with the following 
 attributes:
  /* GS classes */
  umad_set_addr_net(p_vw-umad, p_mad_addr-dest_lid,
p_mad_addr-addr_type.gsi.remote_qp,
p_mad_addr-addr_type.gsi.service_level,
IB_QP1_WELL_KNOWN_Q_KEY);
 So, the SL is the same like the one which was used by the OMPI process. 
 The Q_Key matches the Q_key on the OMPI process, and remote_qp and 
 dest_lid is correct, too.
 Afterwards umad_send(…) is used to send the reply with the PathRecord, 
 and this send does not work (except for SL=0).
 
 By not working, what do you mean ? Do you mean it's not received at the
 requester with no message in the OpenSM log or not received at the
 OpenSM or something else ? It could be due to the wrong SL being used in
 the original request (forcing it to SL 1