RE: rdma_create_qp() and max_send_wr

2011-04-22 Thread Yann Droneaud
Hi,

Le jeudi 21 avril 2011 à 11:53 -0700, c...@asomi.com a écrit :
 An ENOMEM return does not mean that the subsystem *just* failed to
 allocate system memory.

 The memory that could not be allocated could be device memory.
 

I'm also having some difficulties with system memory allocation.

In my test, a user is allowed to lock 4MBytes of memory, but not all
this memory is available to ibv_reg_mr() since ibv_create_cq() and
ibv_create_qp()/rdma_create_qp() lock memory respectively for CQ and QP.
The question is how much memory is needed for the CQ and QP queues ?

In my case, the maximum message of size is 4MBytes - 20KBytes, for a CQ
and QP (half duplex) queues length of 1.

Using message size of 128 bytes and less hit the QP WR limit of 16351
length.

When using messages of size 256 bytes, I'm only able to register 2609152
bytes, then CQ and QP (half duplex) queues are 10192 entries length. So
they seems to requires about 1585152 bytes. Taking in account a fixed
amount of reserved memory of 20KBytes, this give about 154 bytes per (CQ
+ QP (half duplex)) entry.

When doing the same math with size 512 and 1024, the size of (CQ + QP
(half duplex)) is going down.

msg  512 memory 3395584 length 6632
msg 1024 memory 3788800 length 3700
msg 2048 memory 3985408 length 1946

Note that the memory used for the message is allocated as an aligned big
chunk and registered as whole, and then sliced to be posted in WR. 

But the memory required for the CQ and QP elements (and other) is also
subject to alignment to a page size.

At least, I known that CQ / QP overhead is not going to hurt users, if
they are allocated modern memory limits, let's say 1GBytes ;)

Regards.

-- 
Yann Droneaud
OPTEYA



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdma_create_qp() and max_send_wr

2011-04-22 Thread Yann Droneaud
And I forgot to mention:

Le vendredi 22 avril 2011 à 12:20 +0200, Yann Droneaud a écrit :
 I'm also having some difficulties with system memory allocation.

In this case of failure, strace shows the last write() syscall returning
ENOMEM.

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


opensm: file routing engine

2011-04-22 Thread Paul Monday (Parallel Scientific)
I've been toying with the file routing engine implementation for some 
work I'm doing, but I'm finding very little documentation on it.  I only 
have one switch to experiment with at the moment as well so some of the 
information in the lid / lfts files that are generated are not obvious 
for how they expand to a multiple switch environment.  Perhaps there is 
a document around since I'm a RTFM type of person?


At any rate, here's what I've gathered with 4. being the big question.

1. The easiest way to get started with the file routing engine is to 
generate the lid / lfts using a different routing engine.  I went ahead 
and did the following:  opensm -D 0x40 -R ftree
2. Once run, copy the /var/log/opensm-lfts.dump and 
/var/log/opensm-lid-matrix.dump files elsewhere for use

3. I've tried to generalize the file contents below
4. Modify the opensm-lid-matrix.dump file to implement or tweak the 
routing algorithm over the physical network?

5. Run opensm -R file -M new-lid-matrix.dump -U new-lfts.dump

I have one other strange question ... is it possible to carve a single 
physical switch into two logical switches (put a cable between ports 
16/17 and modify the routing tables ... this seems like it wouldn't work 
as the Unicast LID / Switch: guid rows in the respective files below 
serve as keys so the single switch would be identified twice).


The file formats seem to be:

opensm-lfts.dump (later becomes -U [file])
- Contains all discovered ports (powered on), their function (Switch vs. 
Channel Adapter), their LID and some extra information.  This is 
essentially the physical network (if all machines are powered on) ... 
the format is:

Unicast lids [0-x] of switch Lid LID# guid GUID ('switch description'):
LID 0x SwitchPort ZZZ # Channel Adapter | Switch portguid 
GUID: 'Descirption'


I assume this file grows with all of the Channel Adapters and switches.  
Given a switch-switch connection a row would look like

0x0019 005 # Switch portguid 0x003 'MF3:switch-my:MTS3600/U1'

You could essentially use this file to map the entire physical network, 
you would end up with a graph ... but no information for how to traverse 
it efficiently, does that sound right?


opensm-lid-matrix.dump
- Looks like it contains the hop information ... but it's a bit more 
cryptic since I have only one switch :(  It should contain a list of all 
switches, the LID for the switch and then hop information.  The hop 
information is what I'm a bit puzzled about here, as well as what port 
guid information is tacked on.  The format of the file is:

Switch: guid 0xx
LID 0x 00 ff ff hops for all ports # portguid 0x000

I know ... it's a detailed question but I figured I would write enough 
so someone else wouldn't have to reverse engineer using the file routing 
engine if this is basically right.


Paul Monday
Parallel Scientific, LLC

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] RDMA/cxgb4: Reset wait condition atomically.

2011-04-22 Thread Steve Wise
The driver was never really waiting for RDMA_WR/FINI completions because
the condition variable used to determine if the completion happened was
never reset, and this condition variable is reused for both connection
setup and teardown.  This causes various driver crashes under heavy
loads due to releasing resources too early.

The fix is to use atomic bits to correctly reset the condition immediately
after the completion is detected.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---

 drivers/infiniband/hw/cxgb4/cm.c   |   30 +++---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   26 +++---
 2 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index d235810..d7ee70f 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1198,9 +1198,7 @@ static int pass_open_rpl(struct c4iw_dev *dev, struct 
sk_buff *skb)
}
PDBG(%s ep %p status %d error %d\n, __func__, ep,
 rpl-status, status2errno(rpl-status));
-   ep-com.wr_wait.ret = status2errno(rpl-status);
-   ep-com.wr_wait.done = 1;
-   wake_up(ep-com.wr_wait.wait);
+   c4iw_wake_up(ep-com.wr_wait, status2errno(rpl-status));
 
return 0;
 }
@@ -1234,9 +1232,7 @@ static int close_listsrv_rpl(struct c4iw_dev *dev, struct 
sk_buff *skb)
struct c4iw_listen_ep *ep = lookup_stid(t, stid);
 
PDBG(%s ep %p\n, __func__, ep);
-   ep-com.wr_wait.ret = status2errno(rpl-status);
-   ep-com.wr_wait.done = 1;
-   wake_up(ep-com.wr_wait.wait);
+   c4iw_wake_up(ep-com.wr_wait, status2errno(rpl-status));
return 0;
 }
 
@@ -1492,17 +1488,13 @@ static int peer_close(struct c4iw_dev *dev, struct 
sk_buff *skb)
 * in rdma connection migration (see c4iw_accept_cr()).
 */
__state_set(ep-com, CLOSING);
-   ep-com.wr_wait.done = 1;
-   ep-com.wr_wait.ret = -ECONNRESET;
PDBG(waking up ep %p tid %u\n, ep, ep-hwtid);
-   wake_up(ep-com.wr_wait.wait);
+   c4iw_wake_up(ep-com.wr_wait, -ECONNRESET);
break;
case MPA_REP_SENT:
__state_set(ep-com, CLOSING);
-   ep-com.wr_wait.done = 1;
-   ep-com.wr_wait.ret = -ECONNRESET;
PDBG(waking up ep %p tid %u\n, ep, ep-hwtid);
-   wake_up(ep-com.wr_wait.wait);
+   c4iw_wake_up(ep-com.wr_wait, -ECONNRESET);
break;
case FPDU_MODE:
start_ep_timer(ep);
@@ -1579,9 +1571,7 @@ static int peer_abort(struct c4iw_dev *dev, struct 
sk_buff *skb)
/*
 * Wake up any threads in rdma_init() or rdma_fini().
 */
-   ep-com.wr_wait.done = 1;
-   ep-com.wr_wait.ret = -ECONNRESET;
-   wake_up(ep-com.wr_wait.wait);
+   c4iw_wake_up(ep-com.wr_wait, -ECONNRESET);
 
mutex_lock(ep-com.mutex);
switch (ep-com.state) {
@@ -2294,14 +2284,8 @@ static int fw6_msg(struct c4iw_dev *dev, struct sk_buff 
*skb)
ret = (int)((be64_to_cpu(rpl-data[0])  8)  0xff);
wr_waitp = (struct c4iw_wr_wait *)(__force unsigned long) 
rpl-data[1];
PDBG(%s wr_waitp %p ret %u\n, __func__, wr_waitp, ret);
-   if (wr_waitp) {
-   if (ret)
-   wr_waitp-ret = -ret;
-   else
-   wr_waitp-ret = 0;
-   wr_waitp-done = 1;
-   wake_up(wr_waitp-wait);
-   }
+   if (wr_waitp)
+   c4iw_wake_up(wr_waitp, ret ? -ret : 0);
kfree_skb(skb);
break;
case 2:
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 8e16eb2..3dcfe82 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -131,42 +131,54 @@ static inline int c4iw_num_stags(struct c4iw_rdev *rdev)
 
 #define C4IW_WR_TO (10*HZ)
 
+enum {
+   REPLY_READY = 0,
+};
+
 struct c4iw_wr_wait {
wait_queue_head_t wait;
-   int done;
+   unsigned long status;
int ret;
 };
 
 static inline void c4iw_init_wr_wait(struct c4iw_wr_wait *wr_waitp)
 {
wr_waitp-ret = 0;
-   wr_waitp-done = 0;
+   wr_waitp-status = 0;
init_waitqueue_head(wr_waitp-wait);
 }
 
+static inline void c4iw_wake_up(struct c4iw_wr_wait *wr_waitp, int ret)
+{
+   wr_waitp-ret = ret;
+   set_bit(REPLY_READY, wr_waitp-status);
+   wake_up(wr_waitp-wait);
+}
+
 static inline int c4iw_wait_for_reply(struct c4iw_rdev *rdev,
 struct c4iw_wr_wait *wr_waitp,
 u32 hwtid, u32 qpid,
 const char *func)
 {
unsigned to = C4IW_WR_TO;
-   do {

[PATCH 2/2] RDMA/cxgb4: EEH errors can hang the driver.

2011-04-22 Thread Steve Wise
A few more EEH fixes:

c4iw_wait_for_reply(): detect fatal EEH condition on timeout and return
an error.

The iw_cxgb4 driver was only calling ib_deregister_device()
on an EEH event followed by a ib_register_device() when the
device was reinitialized.  However, the rdma core doesn't allow
multiple iterations of register/deregister by the provider. See
drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs()
where the kobject ref is held until the device is deallocated in
ib_deallocate_device(). Calling deregister adds this kobj reference,
and then a subsequet register call will generate a WARN_ON() from the
kobject subsystem  because the kobject is being initialized but is
already initialized with the ref held.

So the provider must deregister and dealloc when resetting for an EEH
event, then alloc/register to re-initialize.  To do this, we cannot
use the device ptr as our ULD handle since it will change with each
reallocation.  This commit adds a uld context struct which is used as the
ULD handle, and then contains the device pointer and other state needed.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---

 drivers/infiniband/hw/cxgb4/device.c   |  111 ++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |6 +-
 drivers/infiniband/hw/cxgb4/provider.c |2 -
 3 files changed, 66 insertions(+), 53 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 8e70953..40a13cc 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -44,7 +44,7 @@ MODULE_DESCRIPTION(Chelsio T4 RDMA Driver);
 MODULE_LICENSE(Dual BSD/GPL);
 MODULE_VERSION(DRV_VERSION);
 
-static LIST_HEAD(dev_list);
+static LIST_HEAD(uld_ctx_list);
 static DEFINE_MUTEX(dev_mutex);
 
 static struct dentry *c4iw_debugfs_root;
@@ -370,18 +370,23 @@ static void c4iw_rdev_close(struct c4iw_rdev *rdev)
c4iw_destroy_resource(rdev-resource);
 }
 
-static void c4iw_remove(struct c4iw_dev *dev)
+struct uld_ctx {
+   struct list_head entry;
+   struct cxgb4_lld_info lldi;
+   struct c4iw_dev *dev;
+};
+
+static void c4iw_remove(struct uld_ctx *ctx)
 {
-   PDBG(%s c4iw_dev %p\n, __func__,  dev);
-   list_del(dev-entry);
-   if (dev-registered)
-   c4iw_unregister_device(dev);
-   c4iw_rdev_close(dev-rdev);
-   idr_destroy(dev-cqidr);
-   idr_destroy(dev-qpidr);
-   idr_destroy(dev-mmidr);
-   iounmap(dev-rdev.oc_mw_kva);
-   ib_dealloc_device(dev-ibdev);
+   PDBG(%s c4iw_dev %p\n, __func__,  ctx-dev);
+   c4iw_unregister_device(ctx-dev);
+   c4iw_rdev_close(ctx-dev-rdev);
+   idr_destroy(ctx-dev-cqidr);
+   idr_destroy(ctx-dev-qpidr);
+   idr_destroy(ctx-dev-mmidr);
+   iounmap(ctx-dev-rdev.oc_mw_kva);
+   ib_dealloc_device(ctx-dev-ibdev);
+   ctx-dev = NULL;
 }
 
 static struct c4iw_dev *c4iw_alloc(const struct cxgb4_lld_info *infop)
@@ -402,13 +407,11 @@ static struct c4iw_dev *c4iw_alloc(const struct 
cxgb4_lld_info *infop)
devp-rdev.oc_mw_kva = ioremap_wc(devp-rdev.oc_mw_pa,
   devp-rdev.lldi.vr-ocq.size);
 
-   printk(KERN_INFO MOD ocq memory: 
+   PDBG(KERN_INFO MOD ocq memory: 
   hw_start 0x%x size %u mw_pa 0x%lx mw_kva %p\n,
   devp-rdev.lldi.vr-ocq.start, devp-rdev.lldi.vr-ocq.size,
   devp-rdev.oc_mw_pa, devp-rdev.oc_mw_kva);
 
-   mutex_lock(dev_mutex);
-
ret = c4iw_rdev_open(devp-rdev);
if (ret) {
mutex_unlock(dev_mutex);
@@ -421,8 +424,6 @@ static struct c4iw_dev *c4iw_alloc(const struct 
cxgb4_lld_info *infop)
idr_init(devp-qpidr);
idr_init(devp-mmidr);
spin_lock_init(devp-lock);
-   list_add_tail(devp-entry, dev_list);
-   mutex_unlock(dev_mutex);
 
if (c4iw_debugfs_root) {
devp-debugfs_root = debugfs_create_dir(
@@ -435,7 +436,7 @@ static struct c4iw_dev *c4iw_alloc(const struct 
cxgb4_lld_info *infop)
 
 static void *c4iw_uld_add(const struct cxgb4_lld_info *infop)
 {
-   struct c4iw_dev *dev;
+   struct uld_ctx *ctx;
static int vers_printed;
int i;
 
@@ -443,25 +444,33 @@ static void *c4iw_uld_add(const struct cxgb4_lld_info 
*infop)
printk(KERN_INFO MOD Chelsio T4 RDMA Driver - version %s\n,
   DRV_VERSION);
 
-   dev = c4iw_alloc(infop);
-   if (IS_ERR(dev))
+   ctx = kzalloc(sizeof *ctx, GFP_KERNEL);
+   if (!ctx) {
+   ctx = ERR_PTR(-ENOMEM);
goto out;
+   }
+   ctx-lldi = *infop;
 
PDBG(%s found device %s nchan %u nrxq %u ntxq %u nports %u\n,
-__func__, pci_name(dev-rdev.lldi.pdev),
-dev-rdev.lldi.nchan, dev-rdev.lldi.nrxq,
-dev-rdev.lldi.ntxq, dev-rdev.lldi.nports);
+__func__, pci_name(ctx-lldi.pdev),
+ctx-lldi.nchan, ctx-lldi.nrxq,
+  

opensm: switch incorrectly reports IB_PORT_CAP_HAS_MCAST_FDB_TOP ?

2011-04-22 Thread Jim Schutt

Hi,

I've been testing the current opensm development head
(commit 83b67527d16 from git://git.openfabrics.org/~alexnetes/opensm),
and I've been getting some messages that are new since version 3.3.7:

Apr 22 12:08:09 646534 [411CD940] 0x01 - log_rcv_cb_error: ERR 3111: Received 
MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x4802
Initial path: 0,1,1,4 Return path: 0,20,1,7

I get one of these messages for each switch in my fabric, on every
heavy sweep.

It appears these are caused by my switches incorrectly reporting
the capability IB_PORT_CAP_HAS_MCAST_FDB_TOP; i.e. this patch stops
the messages:

diff --git a/opensm/osm_mcast_mgr.c b/opensm/osm_mcast_mgr.c
index ea52bfe..63d2968 100644
--- a/opensm/osm_mcast_mgr.c
+++ b/opensm/osm_mcast_mgr.c
@@ -1041,7 +1041,7 @@ static void mcast_mgr_set_mfttop(IN osm_sm_t * sm, IN 
osm_switch_t * p_sw)
p_path = osm_physp_get_dr_path_ptr(p_physp);
p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);

-   if (p_physp-port_info.capability_mask  IB_PORT_CAP_HAS_MCAST_FDB_TOP) 
{
+   if (0  p_physp-port_info.capability_mask  
IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
/*
   Set the top of the multicast forwarding table.
 */

IB_PORT_CAP_HAS_MCAST_FDB_TOP is bit 30 of the port capability mask,
which in at least IBA v1.2.1 was a reserved bit but apparently is
not anymore.

Should I file a bug report with my switch vendor about setting
a port capability bit for a capability they don't support, or
is there something else going on that I haven't figured out yet?

FWIW I think my switches have a base SP0; maybe it's got something
to do with that?

Thanks -- Jim

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: opensm: file routing engine

2011-04-22 Thread Weiny, Ira K.

On Apr 22, 2011, at 7:41 AM, Paul Monday (Parallel Scientific) wrote:

 I've been toying with the file routing engine implementation for some 
 work I'm doing, but I'm finding very little documentation on it.  I only 
 have one switch to experiment with at the moment as well so some of the 
 information in the lid / lfts files that are generated are not obvious 
 for how they expand to a multiple switch environment.  Perhaps there is 
 a document around since I'm a RTFM type of person?
 
 At any rate, here's what I've gathered with 4. being the big question.
 
 1. The easiest way to get started with the file routing engine is to 
 generate the lid / lfts using a different routing engine.  I went ahead 
 and did the following:  opensm -D 0x40 -R ftree
 2. Once run, copy the /var/log/opensm-lfts.dump and 
 /var/log/opensm-lid-matrix.dump files elsewhere for use
 3. I've tried to generalize the file contents below
 4. Modify the opensm-lid-matrix.dump file to implement or tweak the 
 routing algorithm over the physical network?
 5. Run opensm -R file -M new-lid-matrix.dump -U new-lfts.dump

I think this is the general method yes.

 
 I have one other strange question ... is it possible to carve a single 
 physical switch into two logical switches (put a cable between ports 
 16/17 and modify the routing tables ... this seems like it wouldn't work 
 as the Unicast LID / Switch: guid rows in the respective files below 
 serve as keys so the single switch would be identified twice).

Not that I am aware of.  When you say you have a single switch I assume you 
mean a switch based on a single switch ASIC?  Like a 24 or 36 port pizza box 
switch.

 
 The file formats seem to be:
 
 opensm-lfts.dump (later becomes -U [file])
 - Contains all discovered ports (powered on), their function (Switch vs. 
 Channel Adapter), their LID and some extra information.  This is 
 essentially the physical network (if all machines are powered on) ... 
 the format is:
 Unicast lids [0-x] of switch Lid LID# guid GUID ('switch description'):
 LID 0x SwitchPort ZZZ # Channel Adapter | Switch portguid 
 GUID: 'Descirption'
 
 I assume this file grows with all of the Channel Adapters and switches.  
 Given a switch-switch connection a row would look like
 0x0019 005 # Switch portguid 0x003 'MF3:switch-my:MTS3600/U1'

Yes this file grows with more nodes in the system.  But the line above is not a 
connection but rather a linear forwarding table entry.  In general, this is 
saying that for the given lid 0x0019 route out port 5 of that switch (the 
switch given by the Unicast lids [... line.  The information after '#' is 
more information about the node with lid=0x0019. This is _not_ the other end of 
the link on port 5.

The topology of the physical connections are shown in opensm-subnet.lst.

 
 You could essentially use this file to map the entire physical network, 
 you would end up with a graph ... but no information for how to traverse 
 it efficiently, does that sound right?

No this is not mapping the physical network.  It is a dump of the port 
forwarding which was programed into each switch by opensm.

Changing this file is what allows you to change the routing and then feed it 
back into opensm.

 
 opensm-lid-matrix.dump
 - Looks like it contains the hop information ... but it's a bit more 
 cryptic since I have only one switch :(  It should contain a list of all 
 switches, the LID for the switch and then hop information.  The hop 
 information is what I'm a bit puzzled about here, as well as what port 
 guid information is tacked on.  The format of the file is:
 Switch: guid 0xx
 LID 0x 00 ff ff hops for all ports # portguid 0x000

That is the switch to switch hop count information. Probably not of much use 
with only 1 switch.

Ira

 
 I know ... it's a detailed question but I figured I would write enough 
 so someone else wouldn't have to reverse engineer using the file routing 
 engine if this is basically right.
 
 Paul Monday
 Parallel Scientific, LLC
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: opensm: switch incorrectly reports IB_PORT_CAP_HAS_MCAST_FDB_TOP ?

2011-04-22 Thread Weiny, Ira K.
On Apr 22, 2011, at 11:19 AM, Jim Schutt wrote:

 Hi,
 
 I've been testing the current opensm development head
 (commit 83b67527d16 from git://git.openfabrics.org/~alexnetes/opensm),
 and I've been getting some messages that are new since version 3.3.7:
 
 Apr 22 12:08:09 646534 [411CD940] 0x01 - log_rcv_cb_error: ERR 3111: 
 Received MAD with error status = 0x1C
 SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x4802
 Initial path: 0,1,1,4 Return path: 0,20,1,7
 
 I get one of these messages for each switch in my fabric, on every
 heavy sweep.
 
 It appears these are caused by my switches incorrectly reporting
 the capability IB_PORT_CAP_HAS_MCAST_FDB_TOP; i.e. this patch stops
 the messages:
 
 diff --git a/opensm/osm_mcast_mgr.c b/opensm/osm_mcast_mgr.c
 index ea52bfe..63d2968 100644
 --- a/opensm/osm_mcast_mgr.c
 +++ b/opensm/osm_mcast_mgr.c
 @@ -1041,7 +1041,7 @@ static void mcast_mgr_set_mfttop(IN osm_sm_t * sm, IN 
 osm_switch_t * p_sw)
   p_path = osm_physp_get_dr_path_ptr(p_physp);
   p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
 
 - if (p_physp-port_info.capability_mask  IB_PORT_CAP_HAS_MCAST_FDB_TOP) 
 {
 + if (0  p_physp-port_info.capability_mask  
 IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
   /*
  Set the top of the multicast forwarding table.
*/
 
 IB_PORT_CAP_HAS_MCAST_FDB_TOP is bit 30 of the port capability mask,
 which in at least IBA v1.2.1 was a reserved bit but apparently is
 not anymore.

Yes these have been published as errata to the 1.2.1 specification.

smpquery portinfo lid

should show you if it is reporting that field.  Also what does

smpquery switchinfo lid

say?

Ira

 
 Should I file a bug report with my switch vendor about setting
 a port capability bit for a capability they don't support, or
 is there something else going on that I haven't figured out yet?
 
 FWIW I think my switches have a base SP0; maybe it's got something
 to do with that?
 
 Thanks -- Jim
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: opensm: switch incorrectly reports IB_PORT_CAP_HAS_MCAST_FDB_TOP ?

2011-04-22 Thread Hal Rosenstock
Hi Jim,

On 4/22/2011 2:19 PM, Jim Schutt wrote:
 Hi,
 
 I've been testing the current opensm development head
 (commit 83b67527d16 from git://git.openfabrics.org/~alexnetes/opensm),
 and I've been getting some messages that are new since version 3.3.7:
 
 Apr 22 12:08:09 646534 [411CD940] 0x01 - log_rcv_cb_error: ERR 3111:
 Received MAD with error status = 0x1C
 SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x4802
 Initial path: 0,1,1,4 Return path: 0,20,1,7
 
 I get one of these messages for each switch in my fabric, on every
 heavy sweep.
 
 It appears these are caused by my switches incorrectly reporting
 the capability IB_PORT_CAP_HAS_MCAST_FDB_TOP; i.e. this patch stops
 the messages:
 
 diff --git a/opensm/osm_mcast_mgr.c b/opensm/osm_mcast_mgr.c
 index ea52bfe..63d2968 100644
 --- a/opensm/osm_mcast_mgr.c
 +++ b/opensm/osm_mcast_mgr.c
 @@ -1041,7 +1041,7 @@ static void mcast_mgr_set_mfttop(IN osm_sm_t * sm,
 IN osm_switch_t * p_sw)
  p_path = osm_physp_get_dr_path_ptr(p_physp);
  p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
 
 -if (p_physp-port_info.capability_mask 
 IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
 +if (0  p_physp-port_info.capability_mask 
 IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
  /*
 Set the top of the multicast forwarding table.
   */
 
 IB_PORT_CAP_HAS_MCAST_FDB_TOP is bit 30 of the port capability mask,
 which in at least IBA v1.2.1 was a reserved bit but apparently is
 not anymore.

Yes, this is in IBTA MgtWG public errata beyond IBA 1.2.1.

 Should I file a bug report with my switch vendor about setting
 a port capability bit for a capability they don't support, or
 is there something else going on that I haven't figured out yet?

I will have a patch shortly which can turn this off even if it is
advertised by the switch (not sure what default should be).

You might also want to contact your switch vendor about fixing this.

 FWIW I think my switches have a base SP0; maybe it's got something
 to do with that?

No; either base or enhanced SP0 can support this; it's orthogonal to that.

-- Hal

 Thanks -- Jim
 
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: opensm: switch incorrectly reports IB_PORT_CAP_HAS_MCAST_FDB_TOP ?

2011-04-22 Thread Jim Schutt

Weiny, Ira K. wrote:

On Apr 22, 2011, at 11:19 AM, Jim Schutt wrote:


Hi,

I've been testing the current opensm development head
(commit 83b67527d16 from git://git.openfabrics.org/~alexnetes/opensm),
and I've been getting some messages that are new since version 3.3.7:

Apr 22 12:08:09 646534 [411CD940] 0x01 - log_rcv_cb_error: ERR 3111: Received 
MAD with error status = 0x1C
SubnGetResp(SwitchInfo), attr_mod 0x0, TID 0x4802
Initial path: 0,1,1,4 Return path: 0,20,1,7

I get one of these messages for each switch in my fabric, on every
heavy sweep.

It appears these are caused by my switches incorrectly reporting
the capability IB_PORT_CAP_HAS_MCAST_FDB_TOP; i.e. this patch stops
the messages:

diff --git a/opensm/osm_mcast_mgr.c b/opensm/osm_mcast_mgr.c
index ea52bfe..63d2968 100644
--- a/opensm/osm_mcast_mgr.c
+++ b/opensm/osm_mcast_mgr.c
@@ -1041,7 +1041,7 @@ static void mcast_mgr_set_mfttop(IN osm_sm_t * sm, IN 
osm_switch_t * p_sw)
p_path = osm_physp_get_dr_path_ptr(p_physp);
p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);

-   if (p_physp-port_info.capability_mask  IB_PORT_CAP_HAS_MCAST_FDB_TOP) 
{
+   if (0  p_physp-port_info.capability_mask  
IB_PORT_CAP_HAS_MCAST_FDB_TOP) {
/*
   Set the top of the multicast forwarding table.
 */

IB_PORT_CAP_HAS_MCAST_FDB_TOP is bit 30 of the port capability mask,
which in at least IBA v1.2.1 was a reserved bit but apparently is
not anymore.


Yes these have been published as errata to the 1.2.1 specification.

smpquery portinfo lid

should show you if it is reporting that field.  Also what does

smpquery switchinfo lid

say?


# smpquery --version
smpquery BUILD VERSION: 1.5.8_f0526f4 Build date: Apr 22 2011 12:36:58

# smpquery -G switchinfo 0x21283a87200040
# Switch info: Lid 3
LinearFdbCap:49152
RandomFdbCap:0
McastFdbCap:.4096
LinearFdbTop:105
DefPort:.0
DefMcastPrimPort:255
DefMcastNotPrimPort:.255
LifeTime:18
StateChange:.0
OptSLtoVLMapping:1
LidsPerPort:.0
PartEnforceCap:..32
InboundPartEnf:..1
OutboundPartEnf:.1
FilterRawInbound:1
FilterRawOutbound:...1
EnhancedPort0:...0
MulticastFDBTop:.0x

# smpquery portinfo 3
# Port info: Lid 3 port 0
Mkey:0x
GidPrefix:...0xfe80
Lid:.3
SMLid:...48
CapMask:.0x42500848
IsTrapSupported
IsSLMappingSupported
IsSystemImageGUIDsupported
IsVendorClassSupported
IsCapabilityMaskNoticeSupported
IsClientRegistrationSupported
IsMulticastFDBTopSupported
DiagCode:0x
MkeyLeasePeriod:.0
LocalPort:...20
LinkWidthEnabled:1X or 4X
LinkWidthSupported:..1X or 4X
LinkWidthActive:.4X
LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:...Active
PhysLinkState:...LinkUp
LinkDownDefState:Polling
ProtectBits:.0
LMC:.0
LinkSpeedActive:.10.0 Gbps
LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps or 10.0 Gbps
NeighborMTU:.4096
SMSL:0
VLCap:...VL0-3
InitType:0x00
VLHighLimit:.0
VLArbHighCap:0
VLArbLowCap:.0
InitReply:...0x00
MtuCap:..4096
VLStallCount:0
HoqLife:.0
OperVLs:.VL0-3
PartEnforceInb:..0
PartEnforceOutb:.0
FilterRawInb:0
FilterRawOutb:...0
MkeyViolations:..0
PkeyViolations:..0
QkeyViolations:..0
GuidCap:.1
ClientReregister:0
McastPkeyTrapSuppressionEnabled:.0
SubnetTimeout:...18
RespTimeVal:.19
LocalPhysErr:0
OverrunErr:..0
MaxCreditHint:...0
RoundTrip:...0

-- Jim



Ira


Should I file a bug report with my switch vendor about setting
a port capability bit for a capability they don't support, or
is there something else going on that I haven't figured out yet?


Re: opensm: file routing engine

2011-04-22 Thread Paul Monday (Parallel Scientific)

Thank you, your detail is greatly appreciated :)


I have one other strange question ... is it possible to carve a single
physical switch into two logical switches (put a cable between ports
16/17 and modify the routing tables ... this seems like it wouldn't work
as the Unicast LID / Switch: guid rows in the respective files below
serve as keys so the single switch would be identified twice).

Not that I am aware of.  When you say you have a single switch I assume you mean a switch 
based on a single switch ASIC?  Like a 24 or 36 port pizza box switch.
Yes, a 36 port Mellanox pizza box with a single crossbar ... based on 
how I read these files, it looks like they key off a single GUID that 
identifies the switch ... which would probably make the subnet manager 
unhappy if I arbitrarily tried to mock it up being two switches somehow

The file formats seem to be:

opensm-lfts.dump (later becomes -U [file])
- Contains all discovered ports (powered on), their function (Switch vs.
Channel Adapter), their LID and some extra information.  This is
essentially the physical network (if all machines are powered on) ...
the format is:
Unicast lids [0-x] of switch Lid LID# guidGUID  ('switch description'):
LID 0x  SwitchPort ZZZ  #Channel Adapter | Switch  portguid
GUID: 'Descirption'

I assume this file grows with all of the Channel Adapters and switches.
Given a switch-switch connection a row would look like
0x0019 005 # Switch portguid 0x003 'MF3:switch-my:MTS3600/U1'
Yes this file grows with more nodes in the system.  But the line above is not a connection but 
rather a linear forwarding table entry.  In general, this is saying that for the given lid 
0x0019 route out port 5 of that switch (the switch given by the Unicast lids 
[... line.  The information after '#' is more information about the node with lid=0x0019. 
This is _not_ the other end of the link on port 5.
Ahhh, I see ... so this table could get quite large ... if I have 1,000 
nodes in a subnet, each with a LID assigned, this table would become 
quite large as each LID would be listed for each switch if I have my 
forwarding thoughts in my head ... maybe I need to wander around and 
steal another switch from someone ;-)

The topology of the physical connections are shown in opensm-subnet.lst.
Ahhh, but the opensm-subnet.lst is not handed to the file routing 
algorithm ... this must be derived at runtime each run I'm guessing 
and then dumped to /var/log.  Very helpful!  Thank you for the pointer.

You could essentially use this file to map the entire physical network,
you would end up with a graph ... but no information for how to traverse
it efficiently, does that sound right?

No this is not mapping the physical network.  It is a dump of the port 
forwarding which was programed into each switch by opensm.

Changing this file is what allows you to change the routing and then feed it 
back into opensm.


opensm-lid-matrix.dump
- Looks like it contains the hop information ... but it's a bit more
cryptic since I have only one switch :(  It should contain a list of all
switches, the LID for the switch and then hop information.  The hop
information is what I'm a bit puzzled about here, as well as what port
guid information is tacked on.  The format of the file is:
Switch: guid 0xx
LID 0x  00 ff ffhops for all ports  # portguid 0x000

That is the switch to switch hop count information. Probably not of much use 
with only 1 switch.
Ugh ... I need another switch or .dump files from someone ... I haven't 
found any stray .dump files out on the network, but then, Google knows 
all and someone must have posted a couple somewhere to play with.


Thank you so much again Ira, I wasn't too far off and mostly it seems 
I'm off in places that having only a single switch wouldn't let me see.  
The semantic correction of opensm-lfts.dump was critical.


Cheers, have a wonderful weekend.

Paul Monday
Parallel Scientific, LLC

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] opensm: Provide option to disable use of MulticastFDBTop even if advertised

2011-04-22 Thread Hal Rosenstock

Default is on; as this is a workaround for non compliance:
this feature is advertised but the SMA rejects sets of SwitchInfo that
actually set MFTTop.

Signed-off-by: Hal Rosenstock h...@mellanox.com
---
diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h
index a9499dd..4bab8ee 100644
--- a/include/opensm/osm_subnet.h
+++ b/include/opensm/osm_subnet.h
@@ -171,6 +171,7 @@ typedef struct osm_subn_opt {
uint8_t leaf_head_of_queue_lifetime;
uint8_t local_phy_errors_threshold;
uint8_t overrun_errors_threshold;
+   boolean_t use_mfttop;
uint32_t sminfo_polling_timeout;
uint32_t polling_retry_number;
uint32_t max_msg_fifo_timeout;
diff --git a/opensm/osm_mcast_mgr.c b/opensm/osm_mcast_mgr.c
index ea52bfe..e33c716 100644
--- a/opensm/osm_mcast_mgr.c
+++ b/opensm/osm_mcast_mgr.c
@@ -1041,7 +1041,8 @@ static void mcast_mgr_set_mfttop(IN osm_sm_t * sm, IN 
osm_switch_t * p_sw)
p_path = osm_physp_get_dr_path_ptr(p_physp);
p_tbl = osm_switch_get_mcast_tbl_ptr(p_sw);
 
-   if (p_physp-port_info.capability_mask  IB_PORT_CAP_HAS_MCAST_FDB_TOP) 
{
+   if (sm-p_subn-opt.use_mfttop 
+   p_physp-port_info.capability_mask  IB_PORT_CAP_HAS_MCAST_FDB_TOP) 
{
/*
   Set the top of the multicast forwarding table.
 */
diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c
index 84ac6ed..e4ea841 100644
--- a/opensm/osm_subnet.c
+++ b/opensm/osm_subnet.c
@@ -322,6 +322,7 @@ static const opt_rec_t opt_tbl[] = {
{ leaf_head_of_queue_lifetime, 
OPT_OFFSET(leaf_head_of_queue_lifetime), opts_parse_uint8, NULL, 1 },
{ local_phy_errors_threshold, OPT_OFFSET(local_phy_errors_threshold), 
opts_parse_uint8, NULL, 1 },
{ overrun_errors_threshold, OPT_OFFSET(overrun_errors_threshold), 
opts_parse_uint8, NULL, 1 },
+   { use_mfttop, OPT_OFFSET(use_mfttop), opts_parse_boolean, NULL, 1},
{ sminfo_polling_timeout, OPT_OFFSET(sminfo_polling_timeout), 
opts_parse_uint32, opts_setup_sminfo_polling_timeout, 1 },
{ polling_retry_number, OPT_OFFSET(polling_retry_number), 
opts_parse_uint32, NULL, 1 },
{ force_heavy_sweep, OPT_OFFSET(force_heavy_sweep), 
opts_parse_boolean, NULL, 1 },
@@ -703,6 +704,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt)
OSM_DEFAULT_LEAF_HEAD_OF_QUEUE_LIFE;
p_opt-local_phy_errors_threshold = OSM_DEFAULT_ERROR_THRESHOLD;
p_opt-overrun_errors_threshold = OSM_DEFAULT_ERROR_THRESHOLD;
+   p_opt-use_mfttop = TRUE;
p_opt-sminfo_polling_timeout =
OSM_SM_DEFAULT_POLLING_TIMEOUT_MILLISECS;
p_opt-polling_retry_number = OSM_SM_DEFAULT_POLLING_RETRY_NUMBER;
@@ -1313,7 +1315,9 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t * 
p_opts)
# Threshold of local phy errors for sending Trap 129\n
local_phy_errors_threshold 0x%02x\n\n
# Threshold of credit overrun errors for sending Trap 130\n
-   overrun_errors_threshold 0x%02x\n\n,
+   overrun_errors_threshold 0x%02x\n\n
+   # Use SwitchInfo:MulticastFDBTop if advertised in 
PortInfo:CapabilityMask\n
+   use_mfttop %s\n\n,
cl_ntoh64(p_opts-guid),
cl_ntoh64(p_opts-m_key),
cl_ntoh16(p_opts-m_key_lease_period),
@@ -1332,7 +1336,8 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t * 
p_opts)
p_opts-force_link_speed,
p_opts-subnet_timeout,
p_opts-local_phy_errors_threshold,
-   p_opts-overrun_errors_threshold);
+   p_opts-overrun_errors_threshold,
+   p_opts-use_mfttop ? TRUE : FALSE);
 
fprintf(out,
#\n# PARTITIONING OPTIONS\n#\n
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html