Re: [opensm] routing segfault + LMC 0 routing bug?

2011-04-17 Thread Alex Netes
On 18:23 Tue 22 Mar , Albert Chu wrote:
 Hey Jim, Alex,
 
 Just hit a segfault on the main tree.  It appears patch 
 
 commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
 Author: Jim Schutt jasc...@sandia.gov
 Date:   Fri Sep 3 10:43:12 2010 -0600
 
 opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
 osm_port_t:priv.
 
 segfaults opensm on one of our systems w/ updn routing and lmc  0
 (would likely segfault dor, minhop, and maybe others too).  Our system
 has older switches that do not support enhanced port zero, thus do not
 support LMC  0.  (I imagine setting lmc_esp0 to FALSE, results in the
 same behavior.)  Subsequently even if you set LMC  0 in your opensm
 config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
 HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
 to NULL and some will not.  Because of assumptions in osm_switch.c about
 priv != NULL when lmc  0, we hit a segfault.  The issue didn't exist
 before b/c we allocated p_port-priv non-NULL no matter what.
 
 The attached patch fixes the problem w/ updn.  I haven't looked through
 all of the 2Qos code thoroughly to figure out the consequences of this
 change, so I'm just considering this a starting point for discussion.
 

Applied, thanks.

-- Alex
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [opensm] routing segfault + LMC 0 routing bug?

2011-03-23 Thread Hal Rosenstock
On Tue, Mar 22, 2011 at 9:23 PM, Albert Chu ch...@llnl.gov wrote:
 Hey Jim, Alex,

 Just hit a segfault on the main tree.  It appears patch

 commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
 Author: Jim Schutt jasc...@sandia.gov
 Date:   Fri Sep 3 10:43:12 2010 -0600

 opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
 osm_port_t:priv.

 segfaults opensm on one of our systems w/ updn routing and lmc  0
 (would likely segfault dor, minhop, and maybe others too).  Our system
 has older switches that do not support enhanced port zero, thus do not
 support LMC  0.  (I imagine setting lmc_esp0 to FALSE, results in the
 same behavior.)  Subsequently even if you set LMC  0 in your opensm
 config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
 HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
 to NULL and some will not.  Because of assumptions in osm_switch.c about
 priv != NULL when lmc  0, we hit a segfault.  The issue didn't exist
 before b/c we allocated p_port-priv non-NULL no matter what.

 The attached patch fixes the problem w/ updn.  I haven't looked through
 all of the 2Qos code thoroughly to figure out the consequences of this
 change, so I'm just considering this a starting point for discussion.

 In addition, with the possibility that SP0 ports will be LMC = 0, this
 code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good.

 lids_per_port = 1  p_mgr-p_subn-opt.lmc;
 for (i = 0; i  lids_per_port; i++) {
     cl_qlist_t *list = p_mgr-port_order_list;
     cl_list_item_t *item;
     for (item = cl_qlist_head(list); item != cl_qlist_end(list);
          item = cl_qlist_next(item)) {
          osm_port_t *port = cl_item_obj(item, port, list_item);
          ucast_mgr_process_port(p_mgr, p_sw, port, i);
     }
 }

 It iterates over all ports with the configured LMC, not the LMC of the
 port?

Yes, base SP0 is always LMC 0 and either LMC of port or perhaps 0 when
base SP0 and configured LMC otherwise (assuming it's an endport) could
be used for such loops. There used to be cases that used the latter
approach. I'm not sure which is more appropriate now.

-- Hal

 I haven't thought about this too deeply or investigated deeply,
 so consider this another starting point for discussion.

 Al

 --
 Albert Chu
 ch...@llnl.gov
 Computer Scientist
 High Performance Systems Division
 Lawrence Livermore National Laboratory

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [opensm] routing segfault + LMC 0 routing bug?

2011-03-23 Thread Jim Schutt

Hi Al,

Albert Chu wrote:

Hey Jim, Alex,

Just hit a segfault on the main tree.  It appears patch 


commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
Author: Jim Schutt jasc...@sandia.gov
Date:   Fri Sep 3 10:43:12 2010 -0600

opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
osm_port_t:priv.

segfaults opensm on one of our systems w/ updn routing and lmc  0
(would likely segfault dor, minhop, and maybe others too).  Our system
has older switches that do not support enhanced port zero, thus do not
support LMC  0.  (I imagine setting lmc_esp0 to FALSE, results in the
same behavior.)  Subsequently even if you set LMC  0 in your opensm
config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
to NULL and some will not.  Because of assumptions in osm_switch.c about
priv != NULL when lmc  0, we hit a segfault.  The issue didn't exist
before b/c we allocated p_port-priv non-NULL no matter what.


OK, I think I see.  But this segfault can only occur in
the case where LMC is configured  0, right?

The issue is in osm_switch_recommend_path() when
routing_for_lmc is true, but p_port-priv is NULL, right?



The attached patch fixes the problem w/ updn.  I haven't looked through
all of the 2Qos code thoroughly to figure out the consequences of this
change, so I'm just considering this a starting point for discussion.


Torus-2QoS's use of port-priv is unique because it persists
between routing sweeps.  So if another routing engine runs
after torus-2QoS and uses port-priv without having ensured
that it set it itself, there will be trouble.  9ddcf3419ea
was fixing such an issue.

I can find only two calls of osm_switch_recommend_path(),
and both seem to be to do the right thing, so I think
your patch is OK.



In addition, with the possibility that SP0 ports will be LMC = 0, this
code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good.

lids_per_port = 1  p_mgr-p_subn-opt.lmc;
for (i = 0; i  lids_per_port; i++) {
 cl_qlist_t *list = p_mgr-port_order_list;
 cl_list_item_t *item;
 for (item = cl_qlist_head(list); item != cl_qlist_end(list);
  item = cl_qlist_next(item)) {
  osm_port_t *port = cl_item_obj(item, port, list_item);
  ucast_mgr_process_port(p_mgr, p_sw, port, i);
 }
}

It iterates over all ports with the configured LMC, not the LMC of the
port?  I haven't thought about this too deeply or investigated deeply,
so consider this another starting point for discussion.


Hmm, looks like ucast_mgr_process_port() DTRT, though;
it ignores lids that aren't in the range configured on
the port?



Al





Subject:
[PATCH] fix segfault corner case w/ updn routing and LMC  0
From:
Albert L.Chu ch...@llnl.gov
Date:
Tue, 22 Mar 2011 17:36:16 -0700


Signed-off-by: Albert L. Chu ch...@llnl.gov


Reviewed-by: Jim Schutt jasc...@sandia.gov

-- Jim


---
 opensm/osm_ucast_mgr.c |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 4019589..211d6e0 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -318,10 +318,6 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
 item = cl_qmap_next(item)) {
port = (osm_port_t *) item;
lmc = ib_port_info_get_lmc(port-p_physp-port_info);
-   if (!lmc) {
-   port-priv = NULL;
-   continue;
-   }
r = malloc(sizeof(*r) + sizeof(r-guids[0]) * (1  lmc));
if (!r) {
OSM_LOG(mgr-p_log, OSM_LOG_ERROR, ERR 3A09: 


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [opensm] routing segfault + LMC 0 routing bug?

2011-03-23 Thread Albert Chu
Hey Jim,

On Wed, 2011-03-23 at 09:01 -0700, Jim Schutt wrote:
 Hi Al,
 
 Albert Chu wrote:
  Hey Jim, Alex,
  
  Just hit a segfault on the main tree.  It appears patch 
  
  commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
  Author: Jim Schutt jasc...@sandia.gov
  Date:   Fri Sep 3 10:43:12 2010 -0600
  
  opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
  osm_port_t:priv.
  
  segfaults opensm on one of our systems w/ updn routing and lmc  0
  (would likely segfault dor, minhop, and maybe others too).  Our system
  has older switches that do not support enhanced port zero, thus do not
  support LMC  0.  (I imagine setting lmc_esp0 to FALSE, results in the
  same behavior.)  Subsequently even if you set LMC  0 in your opensm
  config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
  HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
  to NULL and some will not.  Because of assumptions in osm_switch.c about
  priv != NULL when lmc  0, we hit a segfault.  The issue didn't exist
  before b/c we allocated p_port-priv non-NULL no matter what.
 
 OK, I think I see.  But this segfault can only occur in
 the case where LMC is configured  0, right?

 The issue is in osm_switch_recommend_path() when
 routing_for_lmc is true, but p_port-priv is NULL, right?

Yup.

  
  The attached patch fixes the problem w/ updn.  I haven't looked through
  all of the 2Qos code thoroughly to figure out the consequences of this
  change, so I'm just considering this a starting point for discussion.
 
 Torus-2QoS's use of port-priv is unique because it persists
 between routing sweeps.  So if another routing engine runs
 after torus-2QoS and uses port-priv without having ensured
 that it set it itself, there will be trouble.  9ddcf3419ea
 was fixing such an issue.
 
 I can find only two calls of osm_switch_recommend_path(),
 and both seem to be to do the right thing, so I think
 your patch is OK.

Sounds good.  When reading over your comments about the 2Qos patches
that affected this area, I wasn't quite sure how you were dealing with
the p_port-priv, so I was unsure how my patch would affect things.

  
  In addition, with the possibility that SP0 ports will be LMC = 0, this
  code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good.
  
  lids_per_port = 1  p_mgr-p_subn-opt.lmc;
  for (i = 0; i  lids_per_port; i++) {
   cl_qlist_t *list = p_mgr-port_order_list;
   cl_list_item_t *item;
   for (item = cl_qlist_head(list); item != cl_qlist_end(list);
item = cl_qlist_next(item)) {
osm_port_t *port = cl_item_obj(item, port, list_item);
ucast_mgr_process_port(p_mgr, p_sw, port, i);
   }
  }
  
  It iterates over all ports with the configured LMC, not the LMC of the
  port?  I haven't thought about this too deeply or investigated deeply,
  so consider this another starting point for discussion.
 
 Hmm, looks like ucast_mgr_process_port() DTRT, though;
 it ignores lids that aren't in the range configured on
 the port?

Ahh, I think you're right.  It does appear to do the right thing there.
I don't think it's a problem afterall.

Al

  
  Al
  
  
  
  
  
  Subject:
  [PATCH] fix segfault corner case w/ updn routing and LMC  0
  From:
  Albert L.Chu ch...@llnl.gov
  Date:
  Tue, 22 Mar 2011 17:36:16 -0700
  
  
  Signed-off-by: Albert L. Chu ch...@llnl.gov
 
 Reviewed-by: Jim Schutt jasc...@sandia.gov
 
 -- Jim
 
  ---
   opensm/osm_ucast_mgr.c |4 
   1 files changed, 0 insertions(+), 4 deletions(-)
  
  diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
  index 4019589..211d6e0 100644
  --- a/opensm/osm_ucast_mgr.c
  +++ b/opensm/osm_ucast_mgr.c
  @@ -318,10 +318,6 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
   item = cl_qmap_next(item)) {
  port = (osm_port_t *) item;
  lmc = ib_port_info_get_lmc(port-p_physp-port_info);
  -   if (!lmc) {
  -   port-priv = NULL;
  -   continue;
  -   }
  r = malloc(sizeof(*r) + sizeof(r-guids[0]) * (1  lmc));
  if (!r) {
  OSM_LOG(mgr-p_log, OSM_LOG_ERROR, ERR 3A09: 
 
-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [opensm] routing segfault + LMC 0 routing bug?

2011-03-23 Thread Jim Schutt

Albert Chu wrote:

Hey Jim,

On Wed, 2011-03-23 at 09:01 -0700, Jim Schutt wrote:

Hi Al,




Torus-2QoS's use of port-priv is unique because it persists
between routing sweeps.  So if another routing engine runs
after torus-2QoS and uses port-priv without having ensured
that it set it itself, there will be trouble.  9ddcf3419ea
was fixing such an issue.

I can find only two calls of osm_switch_recommend_path(),
and both seem to be to do the right thing, so I think
your patch is OK.


Sounds good.  When reading over your comments about the 2Qos patches
that affected this area, I wasn't quite sure how you were dealing with
the p_port-priv, so I was unsure how my patch would affect things.



There's some comments at the top of osm_torus.c,
in the definitions of struct endpoint and struct t_switch,
that describe the rules for how torus-2QoS code can
safely use -priv.

They may shed some extra light...

-- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[opensm] routing segfault + LMC 0 routing bug?

2011-03-22 Thread Albert Chu
Hey Jim, Alex,

Just hit a segfault on the main tree.  It appears patch 

commit 9ddcf3419eade13bdc0a54f93930c49fe67efd63
Author: Jim Schutt jasc...@sandia.gov
Date:   Fri Sep 3 10:43:12 2010 -0600

opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of
osm_port_t:priv.

segfaults opensm on one of our systems w/ updn routing and lmc  0
(would likely segfault dor, minhop, and maybe others too).  Our system
has older switches that do not support enhanced port zero, thus do not
support LMC  0.  (I imagine setting lmc_esp0 to FALSE, results in the
same behavior.)  Subsequently even if you set LMC  0 in your opensm
config file, there can be ports with LMC = 0 and LMC != 0 (e.g. from
HCAs). Subsequently in alloc_ports_priv(), some ports will have priv set
to NULL and some will not.  Because of assumptions in osm_switch.c about
priv != NULL when lmc  0, we hit a segfault.  The issue didn't exist
before b/c we allocated p_port-priv non-NULL no matter what.

The attached patch fixes the problem w/ updn.  I haven't looked through
all of the 2Qos code thoroughly to figure out the consequences of this
change, so I'm just considering this a starting point for discussion.

In addition, with the possibility that SP0 ports will be LMC = 0, this
code in osm_ucast_mgr.c ucast_mgr_process_tbl() does not look good.

lids_per_port = 1  p_mgr-p_subn-opt.lmc;
for (i = 0; i  lids_per_port; i++) {
 cl_qlist_t *list = p_mgr-port_order_list;
 cl_list_item_t *item;
 for (item = cl_qlist_head(list); item != cl_qlist_end(list);
  item = cl_qlist_next(item)) {
  osm_port_t *port = cl_item_obj(item, port, list_item);
  ucast_mgr_process_port(p_mgr, p_sw, port, i);
 }
}

It iterates over all ports with the configured LMC, not the LMC of the
port?  I haven't thought about this too deeply or investigated deeply,
so consider this another starting point for discussion.

Al

-- 
Albert Chu
ch...@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
---BeginMessage---

Signed-off-by: Albert L. Chu ch...@llnl.gov
---
 opensm/osm_ucast_mgr.c |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 4019589..211d6e0 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -318,10 +318,6 @@ static void alloc_ports_priv(osm_ucast_mgr_t * mgr)
 item = cl_qmap_next(item)) {
port = (osm_port_t *) item;
lmc = ib_port_info_get_lmc(port-p_physp-port_info);
-   if (!lmc) {
-   port-priv = NULL;
-   continue;
-   }
r = malloc(sizeof(*r) + sizeof(r-guids[0]) * (1  lmc));
if (!r) {
OSM_LOG(mgr-p_log, OSM_LOG_ERROR, ERR 3A09: 
-- 
1.5.4.5

---End Message---