Re: [opensm] RFC: new routing options

2010-10-12 Thread Yevgeny Kliteynik
Hi Al,

This looks really great!
One question: have you tried benchmarking the BW with up/down
routing using the guid_routing_order_file option w/o your new
features?

-- YK

On 08-Oct-10 7:40 PM, Albert Chu wrote:
 Hey Sasha,
 
 We recently got a new cluster and I've been experimenting with some
 routing changes to improve the average bandwidth of the cluster.  They
 are attached as patches with description of the routing goals below.
 
 We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to
 measure min, peak, and average send/recv bandwidth across the cluster.
 What we found with the original updn routing was an average of around
 420 MB/s send bandwidth and 508 MB/s recv bandwidth.  The following two
 patches were able to get the average send bandwidth up to 1045 MB/s and
 recv bandwidth up to 1228 MB/s.
 
 I'm sure this is only round 1 of the patches and I'm looking for
 comments.  Many areas could be cleaned up w/ some rearchitecture or
 struct changes, but I simply implemented the most non-invasive
 implementation first.  I'm also open to name changes on the options.
 
 BTW, b/c of the old management tree on the git server, the following
 patches were developed on an internal LLNL tree.  I'll rebase after the
 up2date tree is on the openfabrics server.
 
 1) Port Shifting
 
 This is similar to what was done with some of the LMC  0 code.
 Congestion would occur due to alignment of routes w/ common traffic
 patterns.  However, we found that it was also necessary for LMC=0 and
 only for used-ports.  For example, lets say there are 4 ports (called A,
 B, C, D) and we are routing lids 1-9 through them.  Suppose only routing
 through A, B, and C will reach lids 1-9.
 
 The LFT would normally be:
 
 A: 1 4 7
 B: 2 5 8
 C: 3 6 9
 D:
 
 The Port Shifting would make this:
 
 A: 1 6 8
 B: 2 4 9
 C: 3 5 7
 D:
 
 This option by itself improved the mpiGraph average send/recv bandwidth
 from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
 
 2) Remote Guid Sorting
 
 Most core/spine switches we've seen have had line boards connected to
 spine boards in a consistent pattern.  However, we recently got some
 Qlogic switches that connect from line/leaf boards to spine boards in a
 (to the casual observer) random pattern.  I'm sure there was a good
 electrical/board reason for this design, but it does hurt routing b/c
 some of the opensm routing algorithms didn't account for this
 assumption.  Here's an output from iblinkinfo as an example.
 
 Switch 0x00066a00ec0029b8 ibcore1 L123:
   1801[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  254   19[  ] 
 ibsw55 ( )
   1802[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  253   19[  ] 
 ibsw56 ( )
   1803[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  258   19[  ] 
 ibsw57 ( )
   1804[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  257   19[  ] 
 ibsw58 ( )
   1805[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  256   19[  ] 
 ibsw59 ( )
   1806[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  255   19[  ] 
 ibsw60 ( )
   1807[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  261   19[  ] 
 ibsw61 ( )
   1808[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  262   19[  ] 
 ibsw62 ( )
   1809[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  260   19[  ] 
 ibsw63 ( )
   180   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  259   19[  ] 
 ibsw64 ( )
   180   11[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  284   19[  ] 
 ibsw65 ( )
   180   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  285   19[  ] 
 ibsw66 ( )
   180   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)== 2227   19[  ] 
 ibsw67 ( )
   180   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  283   19[  ] 
 ibsw68 ( )
   180   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  267   19[  ] 
 ibsw69 ( )
   180   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  270   19[  ] 
 ibsw70 ( )
   180   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  269   19[  ] 
 ibsw71 ( )
   180   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  268   19[  ] 
 ibsw72 ( )
   180   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  222   17[  ] 
 ibcore1 S117B ( )
   180   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  209   19[  ] 
 ibcore1 S211B ( )
   180   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  218   21[  ] 
 ibcore1 S117A ( )
   180   22[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  192   23[  ] 
 ibcore1 S215B ( )
   180   23[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==   85   15[  ] 
 ibcore1 S209A ( )
   180   24[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  182   13[  ] 
 ibcore1 S215A ( )
   180   25[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  200   11[  ] 
 ibcore1 S115B ( )
   180   26[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  129   25[  ] 
 ibcore1 S209B ( )
   180   27[  ] ==( 4X 10.0 Gbps 

Re: [PATCH] mlx4: Limit num of fast reg WRs

2010-10-12 Thread Eli Cohen
On Tue, Oct 12, 2010 at 12:13:26AM +0200, Or Gerlitz wrote:
 Guys, can you clarify if the hardware limitation is 511 entries or its
 (PAGE_SIZE / sizeof(pointer)) - 1 which is 4096 / 8  - 1 = 511 but can
 change if the page size  gets bigger or smaller?


The limit is 511 entries.
After I posted this patch, I was told that there is yet another
constraint on the page list: The buffer containing the list must not
cross a page boundary. So I was thinking what is the best way to deal
with this. One way is to always allocate a whole page and map it using
dma_map_page(page, DMA_TO_DEVICE), something like this (not a complete
patch, just the idea).

diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 83e3cc7..e9b2c8a 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -237,18 +237,23 @@ struct ib_fast_reg_page_list 
*mlx4_ib_alloc_fast_reg_page_list(struct ib_device
if (!mfrpl-ibfrpl.page_list)
goto err_free;
 
-   mfrpl-mapped_page_list = dma_alloc_coherent(dev-dev-pdev-dev,
-size, mfrpl-map,
-GFP_KERNEL);
+   mfrpl-mapped_page_list = (__be64 *)__get_free_page(GFP_KERNEL);
if (!mfrpl-mapped_page_list)
goto err_free;
 
-   WARN_ON(mfrpl-map  0x3f);
+   mfrpl-map = dma_map_single(ibdev-dma_device, mfrpl-mapped_page_list,
+   PAGE_SIZE, DMA_TO_DEVICE);
+   if (dma_mapping_error(ibdev-dma_device, mfrpl-map))
+   goto err_page;
+
+
 
return mfrpl-ibfrpl;
 
+err_page:
+   free_page((unsigned long) mfrpl-mapped_page_list);
+
 err_free:
-   kfree(mfrpl-ibfrpl.page_list);
kfree(mfrpl);
return ERR_PTR(-ENOMEM);
 }
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch v3] infiniband: uverbs: handle large number of entries

2010-10-12 Thread Dan Carpenter
In the original code there was a potential integer overflow if you
passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
buffers than intended, leading to memory corruption.

There was also an information leak.

Documentation/infiniband/user_verbs.txt suggests this function is meant
for unprivileged access.

Jason Gunthorpe suggested that I should modify it to pass the data to
the user bit by bit and avoid the kmalloc() entirely.

CC: sta...@kernel.org
Signed-off-by: Dan Carpenter erro...@gmail.com
---
Please, please, check this.  I've think I've done it right, but I don't
have the hardware and can not test it.

It's strange to me that we return in_len on success.

struct ib_uverbs_poll_cq_resp is used by userspace libraries right?
Otherwise I could delete it.

diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 6fcfbeb..b0788b6 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -891,68 +891,89 @@ out:
return ret ? ret : in_len;
 }
 
+static int copy_header_to_user(void __user *dest, u32 count)
+{
+   u32 header[2];  /* the second u32 is reserved */
+
+   memset(header, 0, sizeof(header));
+   if (copy_to_user(dest, header, sizeof(header)))
+   return -EFAULT;
+   return 0;
+}
+
+static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
+{
+   struct ib_uverbs_wc tmp;
+
+   memset(tmp, 0, sizeof(tmp));
+
+   tmp.wr_id  = wc-wr_id;
+   tmp.status = wc-status;
+   tmp.opcode = wc-opcode;
+   tmp.vendor_err = wc-vendor_err;
+   tmp.byte_len   = wc-byte_len;
+   tmp.ex.imm_data= (__u32 __force) wc-ex.imm_data;
+   tmp.qp_num = wc-qp-qp_num;
+   tmp.src_qp = wc-src_qp;
+   tmp.wc_flags   = wc-wc_flags;
+   tmp.pkey_index = wc-pkey_index;
+   tmp.slid   = wc-slid;
+   tmp.sl = wc-sl;
+   tmp.dlid_path_bits = wc-dlid_path_bits;
+   tmp.port_num   = wc-port_num;
+
+   if (copy_to_user(dest, tmp, sizeof(tmp)))
+   return -EFAULT;
+   return 0;
+}
+
 ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
  const char __user *buf, int in_len,
  int out_len)
 {
struct ib_uverbs_poll_cq   cmd;
-   struct ib_uverbs_poll_cq_resp *resp;
+   u8 __user *header_ptr;
+   u8 __user *data_ptr;
struct ib_cq  *cq;
-   struct ib_wc  *wc;
-   intret = 0;
+   struct ib_wc   wc;
+   u32count = 0;
+   intret;
inti;
-   intrsize;
 
if (copy_from_user(cmd, buf, sizeof cmd))
return -EFAULT;
 
-   wc = kmalloc(cmd.ne * sizeof *wc, GFP_KERNEL);
-   if (!wc)
-   return -ENOMEM;
-
-   rsize = sizeof *resp + cmd.ne * sizeof(struct ib_uverbs_wc);
-   resp = kmalloc(rsize, GFP_KERNEL);
-   if (!resp) {
-   ret = -ENOMEM;
-   goto out_wc;
-   }
-
cq = idr_read_cq(cmd.cq_handle, file-ucontext, 0);
-   if (!cq) {
-   ret = -EINVAL;
-   goto out;
-   }
+   if (!cq)
+   return -EINVAL;
 
-   resp-count = ib_poll_cq(cq, cmd.ne, wc);
+   /* we copy a struct ib_uverbs_poll_cq_resp to user space */
+   header_ptr = (void __user *)(unsigned long)cmd.response;
+   data_ptr = header_ptr + sizeof(u32) * 2;
 
-   put_cq_read(cq);
+   for (i = 0; i  cmd.ne; i++) {
+   ret = ib_poll_cq(cq, 1, wc);
+   if (ret  0)
+   goto out_put;
+   if (!ret)
+   break;
 
-   for (i = 0; i  resp-count; i++) {
-   resp-wc[i].wr_id  = wc[i].wr_id;
-   resp-wc[i].status = wc[i].status;
-   resp-wc[i].opcode = wc[i].opcode;
-   resp-wc[i].vendor_err = wc[i].vendor_err;
-   resp-wc[i].byte_len   = wc[i].byte_len;
-   resp-wc[i].ex.imm_data= (__u32 __force) wc[i].ex.imm_data;
-   resp-wc[i].qp_num = wc[i].qp-qp_num;
-   resp-wc[i].src_qp = wc[i].src_qp;
-   resp-wc[i].wc_flags   = wc[i].wc_flags;
-   resp-wc[i].pkey_index = wc[i].pkey_index;
-   resp-wc[i].slid   = wc[i].slid;
-   resp-wc[i].sl = wc[i].sl;
-   resp-wc[i].dlid_path_bits = wc[i].dlid_path_bits;
-   resp-wc[i].port_num   = wc[i].port_num;
+   ret = copy_wc_to_user(data_ptr, wc);
+   if (ret)
+   goto out_put;
+   data_ptr += 

Trying to link with DAT 2.0 function

2010-10-12 Thread Young, Eric R.
My motivation for using the dat_cno_fd_create() is that I am able
register a file descriptor with a reactor (all events go through a
reactor which has multiple I/O including I/O which is not at all tied to
uDAPL). An application is able to work on other tasks while waiting for
the reactor to call back on the file descriptor when an event is
available. To achieve the same behavior with a dat_cno_wait(), I would
have to spawn off another thread which blocks on dat_cno_wait(), then
notify the reactor (to queue up a reactor event) when the dat_cno_wait()
is unblocked, lock critical sections of code, etc.,. If the
dat_cno_fd_create() function was available to me, it would seem to be a
cleaner way to achieve this functionality.

-Original Message-
From: Davis, Arlin R [mailto:arlin.r.da...@intel.com] 
Sent: Monday, October 11, 2010 5:34 PM
To: Young, Eric R.; linux-rdma@vger.kernel.org
Subject: EXTERNAL:RE: Trying to link with DAT 2.0 function

Do you have a roadmap available? Is this planned to be implemented in
the near future?

There are no plans. I really don't know how this call even made
it in the specification given that DAT is suppose to be O/S agnostic.

In any case, can you use dat_cno_wait() on top of the EVD's 
as a means to support/trigger multiple event streams? What is driving 
your choice to use dat_cno_fd_create()? Maybe we can come up 
with an alternative with the existing API.

Arlin





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [opensm] RFC: new routing options

2010-10-12 Thread Albert Chu
Hey Yevgeny,

Yes, I tried that and it didn't have much of an effect.  Ever since
Sasha put in his routing sorted by switch load
(sort_ports_by_switch_load() in osm_ucast_mgr.c), guid_routing_order
isn't really necessary (as long as most of the cluster is up).

Al

On Tue, 2010-10-12 at 00:59 -0700, Yevgeny Kliteynik wrote:
 Hi Al,
 
 This looks really great!
 One question: have you tried benchmarking the BW with up/down
 routing using the guid_routing_order_file option w/o your new
 features?
 
 -- YK
 
 On 08-Oct-10 7:40 PM, Albert Chu wrote:
  Hey Sasha,
  
  We recently got a new cluster and I've been experimenting with some
  routing changes to improve the average bandwidth of the cluster.  They
  are attached as patches with description of the routing goals below.
  
  We're using mpiGraph (http://BLOCKEDsourceforge.net/projects/mpigraph/) to
  measure min, peak, and average send/recv bandwidth across the cluster.
  What we found with the original updn routing was an average of around
  420 MB/s send bandwidth and 508 MB/s recv bandwidth.  The following two
  patches were able to get the average send bandwidth up to 1045 MB/s and
  recv bandwidth up to 1228 MB/s.
  
  I'm sure this is only round 1 of the patches and I'm looking for
  comments.  Many areas could be cleaned up w/ some rearchitecture or
  struct changes, but I simply implemented the most non-invasive
  implementation first.  I'm also open to name changes on the options.
  
  BTW, b/c of the old management tree on the git server, the following
  patches were developed on an internal LLNL tree.  I'll rebase after the
  up2date tree is on the openfabrics server.
  
  1) Port Shifting
  
  This is similar to what was done with some of the LMC  0 code.
  Congestion would occur due to alignment of routes w/ common traffic
  patterns.  However, we found that it was also necessary for LMC=0 and
  only for used-ports.  For example, lets say there are 4 ports (called A,
  B, C, D) and we are routing lids 1-9 through them.  Suppose only routing
  through A, B, and C will reach lids 1-9.
  
  The LFT would normally be:
  
  A: 1 4 7
  B: 2 5 8
  C: 3 6 9
  D:
  
  The Port Shifting would make this:
  
  A: 1 6 8
  B: 2 4 9
  C: 3 5 7
  D:
  
  This option by itself improved the mpiGraph average send/recv bandwidth
  from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s.
  
  2) Remote Guid Sorting
  
  Most core/spine switches we've seen have had line boards connected to
  spine boards in a consistent pattern.  However, we recently got some
  Qlogic switches that connect from line/leaf boards to spine boards in a
  (to the casual observer) random pattern.  I'm sure there was a good
  electrical/board reason for this design, but it does hurt routing b/c
  some of the opensm routing algorithms didn't account for this
  assumption.  Here's an output from iblinkinfo as an example.
  
  Switch 0x00066a00ec0029b8 ibcore1 L123:
1801[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  254   19[  
  ] ibsw55 ( )
1802[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  253   19[  
  ] ibsw56 ( )
1803[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  258   19[  
  ] ibsw57 ( )
1804[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  257   19[  
  ] ibsw58 ( )
1805[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  256   19[  
  ] ibsw59 ( )
1806[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  255   19[  
  ] ibsw60 ( )
1807[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  261   19[  
  ] ibsw61 ( )
1808[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  262   19[  
  ] ibsw62 ( )
1809[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  260   19[  
  ] ibsw63 ( )
180   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  259   19[  
  ] ibsw64 ( )
180   11[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  284   19[  
  ] ibsw65 ( )
180   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  285   19[  
  ] ibsw66 ( )
180   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)== 2227   19[  
  ] ibsw67 ( )
180   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  283   19[  
  ] ibsw68 ( )
180   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  267   19[  
  ] ibsw69 ( )
180   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  270   19[  
  ] ibsw70 ( )
180   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  269   19[  
  ] ibsw71 ( )
180   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  268   19[  
  ] ibsw72 ( )
180   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  222   17[  
  ] ibcore1 S117B ( )
180   20[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  209   19[  
  ] ibcore1 S211B ( )
180   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  218   21[  
  ] ibcore1 S117A ( )
180   22[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==  192   23[  
  ] ibcore1 

Work completions generated after a queue pair has made the transition to an error state

2010-10-12 Thread Bart Van Assche
Hello,

Has anyone already tried to process the work completions generated by
a HCA after the state of a queue pair has been changed to IB_QPS_ERR ?
With the hardware/firmware/driver combination I have tested I have
observed the following:
* Multiple completions with the same wr_id and nonzero (error) status
were received by the application, while all work requests queued with
the flag IB_SEND_SIGNALED had a unique wr_id.
* Completions with non-zero (error) status and a wr_id / opcode
combination were received that were never queued by the application.
Note: some work requests were queued with and some without the flag
IB_SEND_SIGNALED. I'm not sure however whether that has anything to do
with the observed behavior.

This behavior is easy to reproduce. If I interpret the InfiniBand
Architecture Specification correctly, this behavior is non-compliant.

Has anyone been looking into this before ?

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Work completions generated after a queue pair has made the transition to an error state

2010-10-12 Thread Or Gerlitz
Bart Van Assche bvanass...@acm.org wrote:
 Has anyone been looking into this before ?

nope, never ever, what hca is that?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Work completions generated after a queue pair has made the transition to an error state

2010-10-12 Thread Bart Van Assche
On Tue, Oct 12, 2010 at 8:50 PM, Ralph Campbell
ralph.campb...@qlogic.com wrote:
 On Tue, 2010-10-12 at 11:38 -0700, Bart Van Assche wrote:
 Hello,

 Has anyone already tried to process the work completions generated by
 a HCA after the state of a queue pair has been changed to IB_QPS_ERR ?
 With the hardware/firmware/driver combination I have tested I have
 observed the following:
 * Multiple completions with the same wr_id and nonzero (error) status
 were received by the application, while all work requests queued with
 the flag IB_SEND_SIGNALED had a unique wr_id.
 * Completions with non-zero (error) status and a wr_id / opcode
 combination were received that were never queued by the application.
 Note: some work requests were queued with and some without the flag
 IB_SEND_SIGNALED. I'm not sure however whether that has anything to do
 with the observed behavior.

 This behavior is easy to reproduce. If I interpret the InfiniBand
 Architecture Specification correctly, this behavior is non-compliant.

 Has anyone been looking into this before ?

 I haven't seen it. It isn't supposed to happen.

 What hardware and software are you using and how do you
 reproduce it?

Hello Ralph and Or,

The way I reproduce that behavior is by modifying the state of a queue
pair into IB_QPS_ERR while RDMA is ongoing. The application, which is
multithreaded, performs RDMA by calling ib_post_recv() and
ib_post_send() (opcodes IB_WR_SEND, IB_WR_RDMA_READ and
IB_WR_RDMA_WRITE). This has been observed with the mlx4 driver, a
ConnectX HCA and firmware version 2.7.0.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 0/2] IB/umad: Export mad snooping to userspace

2010-10-12 Thread Hefty, Sean
The kernel mad interface allows a client to view all
sent and received MADs.  This has proven to be a useful
debugging technique when paired with the external kernel
module, madeye.  However, madeye was never intended to
be submitted upstream.

A couple of alternatives have been proposed for making
this functionality available in the upstream kernel,
using trace events or exporting the snooping interface
to user space.  This patch series takes the latter approach.

In addition to snooping MADs simply for debugging purposes,
applications can be constructed to examine and act on
MAD traffic.  For example, a daemon could snoop SA queries
and CM messages as part of providing a path record caching
service.  It could cached snooped path records and use CM
timeouts as an indication that cached data may be stale.

Because such services may become crucial to support large
clusters, the desire is to add mad snooping capabilities
to the stack directly, rather than using a debug interface.

These patches compile, but have not been tested.  If this
approach is acceptable, I will modify libibumad to work
with the proposed changes.  I will also create a userspace
version of madeye as a new ib-diag.  Finally, the IB ACM
will eventually be updated to monitor CM response timeouts.

Signed-off-by: Sean Hefty sean.he...@intel.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 1/2] IB/mad: Simplify snooping interface

2010-10-12 Thread Hefty, Sean
In preparation for exporting the kernel mad snooping capability
to user space, remove all code originally inserted as place holders
and simplify the mad snooping interface.

For performance reasons, we want to filter which mads are reported
to clients of the snooping interface at the lowest level, but we
also don't want to perform complex filtering at that level.
As a trade-off, we allow filtering based on mgmt_class, attr_id,
and mad request status.

The reasoning behind these choices are to allow a user to filter
traffic to a specific service (the SA or CM), for a well known
purpose (path record queries or multicast joins), or view only
operations that have failed.  Filtering based on mgmt_class and
attr_id were used by the external madeye debug module, so we
have some precedence that filtering at that level is usable.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
 drivers/infiniband/core/mad.c  |   86 ++--
 drivers/infiniband/core/mad_priv.h |2 -
 include/rdma/ib_mad.h  |   51 ++---
 3 files changed, 68 insertions(+), 71 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index ef1304f..b90f7f0 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -381,22 +381,6 @@ error1:
 }
 EXPORT_SYMBOL(ib_register_mad_agent);
 
-static inline int is_snooping_sends(int mad_snoop_flags)
-{
-   return (mad_snoop_flags 
-   (/*IB_MAD_SNOOP_POSTED_SENDS |
-IB_MAD_SNOOP_RMPP_SENDS |*/
-IB_MAD_SNOOP_SEND_COMPLETIONS /*|
-IB_MAD_SNOOP_RMPP_SEND_COMPLETIONS*/));
-}
-
-static inline int is_snooping_recvs(int mad_snoop_flags)
-{
-   return (mad_snoop_flags 
-   (IB_MAD_SNOOP_RECVS /*|
-IB_MAD_SNOOP_RMPP_RECVS*/));
-}
-
 static int register_snoop_agent(struct ib_mad_qp_info *qp_info,
struct ib_mad_snoop_private *mad_snoop_priv)
 {
@@ -434,8 +418,8 @@ out:
 struct ib_mad_agent *ib_register_mad_snoop(struct ib_device *device,
   u8 port_num,
   enum ib_qp_type qp_type,
-  int mad_snoop_flags,
-  ib_mad_snoop_handler snoop_handler,
+  struct ib_mad_snoop_reg_req 
*snoop_reg_req,
+  ib_mad_send_handler send_handler,
   ib_mad_recv_handler recv_handler,
   void *context)
 {
@@ -444,12 +428,6 @@ struct ib_mad_agent *ib_register_mad_snoop(struct 
ib_device *device,
struct ib_mad_snoop_private *mad_snoop_priv;
int qpn;
 
-   /* Validate parameters */
-   if ((is_snooping_sends(mad_snoop_flags)  !snoop_handler) ||
-   (is_snooping_recvs(mad_snoop_flags)  !recv_handler)) {
-   ret = ERR_PTR(-EINVAL);
-   goto error1;
-   }
qpn = get_spl_qp_index(qp_type);
if (qpn == -1) {
ret = ERR_PTR(-EINVAL);
@@ -471,11 +449,11 @@ struct ib_mad_agent *ib_register_mad_snoop(struct 
ib_device *device,
mad_snoop_priv-qp_info = port_priv-qp_info[qpn];
mad_snoop_priv-agent.device = device;
mad_snoop_priv-agent.recv_handler = recv_handler;
-   mad_snoop_priv-agent.snoop_handler = snoop_handler;
+   mad_snoop_priv-agent.send_handler = send_handler;
+   mad_snoop_priv-reg_req = *snoop_reg_req;
mad_snoop_priv-agent.context = context;
mad_snoop_priv-agent.qp = port_priv-qp_info[qpn].qp;
mad_snoop_priv-agent.port_num = port_num;
-   mad_snoop_priv-mad_snoop_flags = mad_snoop_flags;
init_completion(mad_snoop_priv-comp);
mad_snoop_priv-snoop_index = register_snoop_agent(
port_priv-qp_info[qpn],
@@ -592,10 +570,35 @@ static void dequeue_mad(struct ib_mad_list_head *mad_list)
spin_unlock_irqrestore(mad_queue-lock, flags);
 }
 
+static int snoop_check_filter(struct ib_mad_snoop_private *mad_snoop_priv,
+ struct ib_mad_hdr *mad_hdr, enum ib_wc_status 
status)
+{
+   struct ib_mad_snoop_reg_req *reg = mad_snoop_priv-reg_req;
+
+   if (reg-errors  !mad_hdr-status 
+   (status == IB_WC_SUCCESS || status == IB_WC_WR_FLUSH_ERR))
+   return 0;
+
+   if (reg-mgmt_class) {
+   if (reg-mgmt_class != mad_hdr-mgmt_class)
+   return 0;
+
+   if (reg-attr_id  reg-attr_id != mad_hdr-attr_id)
+   return 0;
+
+   if (reg-mgmt_class_version 
+   reg-mgmt_class_version != mad_hdr-class_version)
+   return 0;
+
+   if (is_vendor_class(reg-mgmt_class)  is_vendor_oui(reg-oui) 

+   

[RFC 2/2] IB/umad: Export mad snooping capability to userspace

2010-10-12 Thread Hefty, Sean
Export the mad snooping capability to user space clients
through the existing umad interface.  This will allow
users to capture MAD data for debugging, plus it allows
for services to act on MAD traffic that occurs.  For example,
a daemon could snoop SA queries and CM messages as part of
providing a path record caching service.  (It could cached
snooped path records, record the average time needed for the
SA to respond to queries, use CM timeouts as an indication
that cached data may be stale, etc.)

Because such services may become crucial to support large
clusters, mad snooping capabilities are not limited to
a debugging interface.

Backwards compatibility is maintained by using the upper bit
of the QPN to indicate if a user is registering to send/receive
MADs or only wishes to snoop traffic.

Signed-off-by: Sean Hefty sean.he...@intel.com
---
 drivers/infiniband/core/user_mad.c |  134 ++--
 include/rdma/ib_user_mad.h |   33 -
 2 files changed, 143 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/user_mad.c 
b/drivers/infiniband/core/user_mad.c
index 5fa8569..e666038 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -252,6 +252,80 @@ err1:
ib_free_recv_mad(mad_recv_wc);
 }
 
+static void snoop_send_handler(struct ib_mad_agent *agent,
+  struct ib_mad_send_wc *send_wc)
+{
+   struct ib_umad_file *file = agent-context;
+   struct ib_umad_packet *packet;
+   struct ib_mad_send_buf *msg = send_wc-send_buf;
+   struct ib_rmpp_mad *rmpp_mad;
+   int data_len;
+   u32 seg_num;
+
+   data_len = msg-seg_count ? msg-seg_size : msg-data_len;
+   packet = kzalloc(sizeof *packet + msg-hdr_len + data_len, GFP_KERNEL);
+   if (!packet)
+   return;
+
+   packet-length = msg-hdr_len + data_len;
+   packet-mad.hdr.status = send_wc-status;
+   packet-mad.hdr.timeout_ms = msg-timeout_ms;
+   packet-mad.hdr.retries = msg-retries;
+   packet-mad.hdr.length = hdr_size(file) + packet-length;
+
+   if (msg-seg_count) {
+   rmpp_mad = msg-mad;
+   seg_num = be32_to_cpu(rmpp_mad-rmpp_hdr.seg_num);
+   memcpy(packet-mad.data, msg-mad, msg-hdr_len);
+   memcpy(((u8 *) packet-mad.data) + msg-hdr_len,
+  ib_get_rmpp_segment(msg, seg_num), data_len);
+   } else {
+   memcpy(packet-mad.data, msg-mad, packet-length);
+   }
+
+   if (queue_packet(file, agent, packet))
+   kfree(packet);
+}
+
+static void snoop_recv_handler(struct ib_mad_agent *agent,
+  struct ib_mad_recv_wc *mad_recv_wc)
+{
+   struct ib_umad_file *file = agent-context;
+   struct ib_umad_packet *packet;
+   struct ib_mad_recv_buf *recv_buf = mad_recv_wc-recv_buf;
+
+   packet = kzalloc(sizeof *packet + sizeof *recv_buf-mad, GFP_KERNEL);
+   if (!packet)
+   return;
+
+   packet-length = sizeof *recv_buf-mad;
+   packet-mad.hdr.length = hdr_size(file) + packet-length;
+   packet-mad.hdr.qpn = cpu_to_be32(mad_recv_wc-wc-src_qp);
+   packet-mad.hdr.lid = cpu_to_be16(mad_recv_wc-wc-slid);
+   packet-mad.hdr.sl = mad_recv_wc-wc-sl;
+   packet-mad.hdr.path_bits = mad_recv_wc-wc-dlid_path_bits;
+   packet-mad.hdr.pkey_index = mad_recv_wc-wc-pkey_index;
+   packet-mad.hdr.grh_present = !!(mad_recv_wc-wc-wc_flags  IB_WC_GRH);
+   if (packet-mad.hdr.grh_present) {
+   struct ib_ah_attr ah_attr;
+
+   ib_init_ah_from_wc(agent-device, agent-port_num,
+  mad_recv_wc-wc, mad_recv_wc-recv_buf.grh,
+  ah_attr);
+
+   packet-mad.hdr.gid_index = ah_attr.grh.sgid_index;
+   packet-mad.hdr.hop_limit = ah_attr.grh.hop_limit;
+   packet-mad.hdr.traffic_class = ah_attr.grh.traffic_class;
+   memcpy(packet-mad.hdr.gid, ah_attr.grh.dgid, 16);
+   packet-mad.hdr.flow_label = 
cpu_to_be32(ah_attr.grh.flow_label);
+   }
+
+   memcpy(packet-mad.data, recv_buf-mad, packet-length);
+
+   if (queue_packet(file, agent, packet))
+   kfree(packet);
+}
+
 static ssize_t copy_recv_mad(struct ib_umad_file *file, char __user *buf,
 struct ib_umad_packet *packet, size_t count)
 {
@@ -603,8 +677,9 @@ static int ib_umad_reg_agent(struct ib_umad_file *file, 
void __user *arg,
 {
struct ib_user_mad_reg_req ureq;
struct ib_mad_reg_req req;
+   struct ib_mad_snoop_reg_req snoop_req;
struct ib_mad_agent *agent = NULL;
-   int agent_id;
+   int agent_id, snoop;
int ret;
 
mutex_lock(file-port-file_mutex);
@@ -620,6 +695,8 @@ static int ib_umad_reg_agent(struct ib_umad_file *file, 
void __user *arg,
goto out;
}
 
+   snoop = ureq.qpn 

Re: Work completions generated after a queue pair has made the transition to an error state

2010-10-12 Thread Eli Cohen
On Tue, Oct 12, 2010 at 08:58:59PM +0200, Bart Van Assche wrote:
 On Tue, Oct 12, 2010 at 8:50 PM, Ralph Campbell
 ralph.campb...@qlogic.com wrote:
  On Tue, 2010-10-12 at 11:38 -0700, Bart Van Assche wrote:
  Hello,
 
  Has anyone already tried to process the work completions generated by
  a HCA after the state of a queue pair has been changed to IB_QPS_ERR ?
  With the hardware/firmware/driver combination I have tested I have
  observed the following:
  * Multiple completions with the same wr_id and nonzero (error) status
  were received by the application, while all work requests queued with
  the flag IB_SEND_SIGNALED had a unique wr_id.
I assume your QP is configured for selective signalling, right? This
means that for succcessful processing of the work request there will 
not be  any completion. But for unsuccessful WR, the hardware should
generate a completion. For these casese it is worth having a 
meaningfull wrid.
  * Completions with non-zero (error) status and a wr_id / opcode
  combination were received that were never queued by the application.
In case of error the opcode of the completed operation is not provided.
I am not sure why.

  Note: some work requests were queued with and some without the flag
  IB_SEND_SIGNALED. I'm not sure however whether that has anything to do
  with the observed behavior.
If you have WRs for which you did not set IB_SEND_SIGNALED, they are
not considered completed before a comletion entry is pushed to the CQ
that correspnds to that send queue. I am not sure if it means that all
the WR in the send queue should be completed with error.
 
  This behavior is easy to reproduce. If I interpret the InfiniBand
  Architecture Specification correctly, this behavior is non-compliant.
 
  Has anyone been looking into this before ?
 
  I haven't seen it. It isn't supposed to happen.
 
  What hardware and software are you using and how do you
  reproduce it?
 
 Hello Ralph and Or,
 
 The way I reproduce that behavior is by modifying the state of a queue
 pair into IB_QPS_ERR while RDMA is ongoing. The application, which is
 multithreaded, performs RDMA by calling ib_post_recv() and
 ib_post_send() (opcodes IB_WR_SEND, IB_WR_RDMA_READ and
 IB_WR_RDMA_WRITE). This has been observed with the mlx4 driver, a
 ConnectX HCA and firmware version 2.7.0.
 
 Bart.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] svcrdma: NFSRDMA Server fixes for 2.6.37

2010-10-12 Thread Tom Tucker
Hi Bruce,

These fixes are ready for 2.6.37. They fix two bugs in the server-side
NFSRDMA transport.

Thanks,
Tom
---

Tom Tucker (2):
  svcrdma: Cleanup DMA unmapping in error paths.
  svcrdma: Change DMA mapping logic to avoid the page_address kernel API


 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   19 ---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   82 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   41 +++
 3 files changed, 92 insertions(+), 50 deletions(-)

-- 
Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API

2010-10-12 Thread Tom Tucker
There was logic in the send path that assumed that a page containing data
to send to the client has a KVA. This is not always the case and can result
in data corruption when page_address returns zero and we end up DMA mapping
zero.

This patch changes the bus mapping logic to avoid page_address() where
necessary and converts all calls from ib_dma_map_single to ib_dma_map_page
in order to keep the map/unmap calls symmetric.

Signed-off-by: Tom Tucker t...@ogc.us
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   18 ---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   80 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   18 +++
 3 files changed, 78 insertions(+), 38 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0194de8..926bdb4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -263,9 +263,9 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
frmr-page_list_len = PAGE_ALIGN(byte_count)  PAGE_SHIFT;
for (page_no = 0; page_no  frmr-page_list_len; page_no++) {
frmr-page_list-page_list[page_no] =
-   ib_dma_map_single(xprt-sc_cm_id-device,
- 
page_address(rqstp-rq_arg.pages[page_no]),
- PAGE_SIZE, DMA_FROM_DEVICE);
+   ib_dma_map_page(xprt-sc_cm_id-device,
+   rqstp-rq_arg.pages[page_no], 0,
+   PAGE_SIZE, DMA_FROM_DEVICE);
if (ib_dma_mapping_error(xprt-sc_cm_id-device,
 frmr-page_list-page_list[page_no]))
goto fatal_err;
@@ -309,17 +309,21 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
 int count)
 {
int i;
+   unsigned long off;
 
ctxt-count = count;
ctxt-direction = DMA_FROM_DEVICE;
for (i = 0; i  count; i++) {
ctxt-sge[i].length = 0; /* in case map fails */
if (!frmr) {
+   BUG_ON(0 == virt_to_page(vec[i].iov_base));
+   off = (unsigned long)vec[i].iov_base  ~PAGE_MASK;
ctxt-sge[i].addr =
-   ib_dma_map_single(xprt-sc_cm_id-device,
- vec[i].iov_base,
- vec[i].iov_len,
- DMA_FROM_DEVICE);
+   ib_dma_map_page(xprt-sc_cm_id-device,
+   virt_to_page(vec[i].iov_base),
+   off,
+   vec[i].iov_len,
+   DMA_FROM_DEVICE);
if (ib_dma_mapping_error(xprt-sc_cm_id-device,
 ctxt-sge[i].addr))
return -EINVAL;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index b15e1eb..d4f5e0e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -70,8 +70,8 @@
  * on extra page for the RPCRMDA header.
  */
 static int fast_reg_xdr(struct svcxprt_rdma *xprt,
-struct xdr_buf *xdr,
-struct svc_rdma_req_map *vec)
+   struct xdr_buf *xdr,
+   struct svc_rdma_req_map *vec)
 {
int sge_no;
u32 sge_bytes;
@@ -96,21 +96,25 @@ static int fast_reg_xdr(struct svcxprt_rdma *xprt,
vec-count = 2;
sge_no++;
 
-   /* Build the FRMR */
+   /* Map the XDR head */
frmr-kva = frva;
frmr-direction = DMA_TO_DEVICE;
frmr-access_flags = 0;
frmr-map_len = PAGE_SIZE;
frmr-page_list_len = 1;
+   page_off = (unsigned long)xdr-head[0].iov_base  ~PAGE_MASK;
frmr-page_list-page_list[page_no] =
-   ib_dma_map_single(xprt-sc_cm_id-device,
- (void *)xdr-head[0].iov_base,
- PAGE_SIZE, DMA_TO_DEVICE);
+   ib_dma_map_page(xprt-sc_cm_id-device,
+   virt_to_page(xdr-head[0].iov_base),
+   page_off,
+   PAGE_SIZE - page_off,
+   DMA_TO_DEVICE);
if (ib_dma_mapping_error(xprt-sc_cm_id-device,
 frmr-page_list-page_list[page_no]))
goto fatal_err;
atomic_inc(xprt-sc_dma_used);
 
+   /* Map the XDR page list */
page_off = xdr-page_base;
page_bytes = xdr-page_len + page_off;
if (!page_bytes)
@@ -128,9 +132,9 @@ static int fast_reg_xdr(struct svcxprt_rdma 

[PATCH 2/2] svcrdma: Cleanup DMA unmapping in error paths.

2010-10-12 Thread Tom Tucker
There are several error paths in the code that do not unmap DMA. This
patch adds calls to svc_rdma_unmap_dma to free these DMA contexts.

Signed-off-by: Tom Tucker t...@opengridcomputing.com
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |1 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   29 ++---
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 926bdb4..df67211 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -495,6 +495,7 @@ next_sge:
printk(KERN_ERR svcrdma: Error %d posting RDMA_READ\n,
   err);
set_bit(XPT_CLOSE, xprt-sc_xprt.xpt_flags);
+   svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 0);
goto out;
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index d4f5e0e..249a835 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -367,6 +367,8 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
goto err;
return 0;
  err:
+   svc_rdma_unmap_dma(ctxt);
+   svc_rdma_put_frmr(xprt, vec-frmr);
svc_rdma_put_context(ctxt, 0);
/* Fatal error, close transport */
return -EIO;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 23f90c3..d22a44d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -511,9 +511,9 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
ctxt-sge[sge_no].addr = pa;
ctxt-sge[sge_no].length = PAGE_SIZE;
ctxt-sge[sge_no].lkey = xprt-sc_dma_lkey;
+   ctxt-count = sge_no + 1;
buflen += PAGE_SIZE;
}
-   ctxt-count = sge_no;
recv_wr.next = NULL;
recv_wr.sg_list = ctxt-sge[0];
recv_wr.num_sge = ctxt-count;
@@ -529,6 +529,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
return ret;
 
  err_put_ctxt:
+   svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 1);
return -ENOMEM;
 }
@@ -1306,7 +1307,6 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
 enum rpcrdma_errcode err)
 {
struct ib_send_wr err_wr;
-   struct ib_sge sge;
struct page *p;
struct svc_rdma_op_ctxt *ctxt;
u32 *va;
@@ -1319,26 +1319,27 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
/* XDR encode error */
length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);
 
+   ctxt = svc_rdma_get_context(xprt);
+   ctxt-direction = DMA_FROM_DEVICE;
+   ctxt-count = 1;
+   ctxt-pages[0] = p;
+
/* Prepare SGE for local address */
-   sge.addr = ib_dma_map_page(xprt-sc_cm_id-device,
-  p, 0, PAGE_SIZE, DMA_FROM_DEVICE);
-   if (ib_dma_mapping_error(xprt-sc_cm_id-device, sge.addr)) {
+   ctxt-sge[0].addr = ib_dma_map_page(xprt-sc_cm_id-device,
+   p, 0, length, DMA_FROM_DEVICE);
+   if (ib_dma_mapping_error(xprt-sc_cm_id-device, ctxt-sge[0].addr)) {
put_page(p);
return;
}
atomic_inc(xprt-sc_dma_used);
-   sge.lkey = xprt-sc_dma_lkey;
-   sge.length = length;
-
-   ctxt = svc_rdma_get_context(xprt);
-   ctxt-count = 1;
-   ctxt-pages[0] = p;
+   ctxt-sge[0].lkey = xprt-sc_dma_lkey;
+   ctxt-sge[0].length = length;
 
/* Prepare SEND WR */
memset(err_wr, 0, sizeof err_wr);
ctxt-wr_op = IB_WR_SEND;
err_wr.wr_id = (unsigned long)ctxt;
-   err_wr.sg_list = sge;
+   err_wr.sg_list = ctxt-sge;
err_wr.num_sge = 1;
err_wr.opcode = IB_WR_SEND;
err_wr.send_flags = IB_SEND_SIGNALED;
@@ -1348,9 +1349,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
if (ret) {
dprintk(svcrdma: Error %d posting send for protocol error\n,
ret);
-   ib_dma_unmap_page(xprt-sc_cm_id-device,
- sge.addr, PAGE_SIZE,
- DMA_FROM_DEVICE);
+   svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 1);
}
 }

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] IB/umad: Export mad snooping to userspace

2010-10-12 Thread Jason Gunthorpe
On Tue, Oct 12, 2010 at 12:10:37PM -0700, Hefty, Sean wrote:
 The kernel mad interface allows a client to view all
 sent and received MADs.  This has proven to be a useful
 debugging technique when paired with the external kernel
 module, madeye.  However, madeye was never intended to
 be submitted upstream.
 
 A couple of alternatives have been proposed for making
 this functionality available in the upstream kernel,
 using trace events or exporting the snooping interface
 to user space.  This patch series takes the latter approach.

TBH, I think this would be much better off integrating with the
existing paths tcpdump/setc uses rather than yet again something new
and unique. I think everone who has to actually use the IB stuff in
real life would be estatic if wireshark just worked...

Yes, I realize that is a bit awkward.. But maybe it is time we had a
netdev for the raw IB device?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Limit num of fast reg WRs

2010-10-12 Thread Roland Dreier
  After I posted this patch, I was told that there is yet another
  constraint on the page list: The buffer containing the list must not
  cross a page boundary. So I was thinking what is the best way to deal
  with this. One way is to always allocate a whole page and map it using
  dma_map_page(page, DMA_TO_DEVICE), something like this (not a complete
  patch, just the idea).

Is there any chance of the dma_alloc_coherent() in the current code
allocating memory that crosses a page boundary?

 - R.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC 0/2] IB/umad: Export mad snooping to userspace

2010-10-12 Thread Hefty, Sean
 TBH, I think this would be much better off integrating with the
 existing paths tcpdump/setc uses rather than yet again something new

This ties in with the existing MAD interface, which isn't going away anytime 
soon, if ever.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v3] infiniband: uverbs: handle large number of entries

2010-10-12 Thread Jason Gunthorpe
On Tue, Oct 12, 2010 at 01:31:17PM +0200, Dan Carpenter wrote:
 In the original code there was a potential integer overflow if you
 passed in a large cmd.ne.  The calls to kmalloc() would allocate smaller
 buffers than intended, leading to memory corruption.

Keep in mind these are probably performance sensitive APIs, I was
imagining batching a small number and they copy_to_user ? No idea what
the various performance trades offs are..

 Please, please, check this.  I've think I've done it right, but I don't
 have the hardware and can not test it.

Nor, do I.. I actually don't know what hardware uses this path? The
Mellanox cards use a user-space only version.
 
Maybe an iwarp card? I kinda recall some recent messages concerning
memory allocations in these paths for iwarp. I wonder if removing the
allocation is such a big win the larger number of copy_to_user calls
does not matter?

 It's strange to me that we return in_len on success.

Agree..

 +static int copy_header_to_user(void __user *dest, u32 count)
 +{
 + u32 header[2];  /* the second u32 is reserved */
 +
 + memset(header, 0, sizeof(header));

Don't you need header[0] = count ?

Maybe:
  u32 header[2] = {count};

And let the compiler 0 the other word optimally. Also, I'm not matters
here, since you are zeroing user memory that isn't currently used..

 +static int copy_wc_to_user(void __user *dest, struct ib_wc *wc)
 +{
 + struct ib_uverbs_wc tmp;
 +
 + memset(tmp, 0, sizeof(tmp));

I'd really like to see that memset go away for performance. Again
maybe use named initializers and let the compiler zero the
uninitialized (does it zero padding, I wonder?). Or pre-zero this
memory outside the loop..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] IB/umad: Export mad snooping to userspace

2010-10-12 Thread Jason Gunthorpe
On Tue, Oct 12, 2010 at 01:54:54PM -0700, Hefty, Sean wrote:
  TBH, I think this would be much better off integrating with the
  existing paths tcpdump/setc uses rather than yet again something new
 
 This ties in with the existing MAD interface, which isn't going away
 anytime soon, if ever.

I didn't say the MAD interface was going away, I said it was not the
interface everything else in the kernel uses for packet capture.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Make multicast and path record queue flexible.

2010-10-12 Thread Jason Gunthorpe
On Tue, Oct 12, 2010 at 06:29:53PM +0200, Alekseys Senin wrote:
 On Tue, 2010-10-05 at 14:12 -0500, Christoph Lameter wrote:
 
 On Tue, 5 Oct 2010, Jason Gunthorpe wrote:
 
  On Tue, Oct 05, 2010 at 06:07:37PM +0200, Aleksey Senin wrote:
   When using slow SM allow more packets to be buffered before answer
   comming back. This patch based on idea of Christoph Lameter.
  
   http://lists.openfabrics.org/pipermail/general/2009-June/059853.html
 
  IMHO, I think it is better to send multicasts to the broadcast MLID 
 than to
  queue them.. More like ethernet that way.
 
 I agree. We had similar ideas. However, the kernel does send igmp
 reports to the MC address not to 244.0.0.2. We would have to redirect at
 the IB layer until multicast via MLID becomes functional. We cannot tell
 when that will be the case.
 
 But what if it will not be available from some reason? How long
 should we wait?  Do we need implement another queue/counter/timeout?

If you follow the scheme I outlined - where traffic to a MGID that
doesn't yet have a MLID is routed to the broadcast MLID then you do it
until you get a MLID, with periodic retries/refreshes of the SA
operation.

This is similar to how ethernet works, and is generally
harmless. Better to have a working, but suboptimal network, than one
that is busted.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mlx4: Limit num of fast reg WRs

2010-10-12 Thread Eli Cohen
On Tue, Oct 12, 2010 at 01:37:37PM -0700, Roland Dreier wrote:
 
 Is there any chance of the dma_alloc_coherent() in the current code
 allocating memory that crosses a page boundary?
 

You mean that the allocation is aligned at least to its size? I could
not find any commitment to this anywhere.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC 0/2] IB/umad: Export mad snooping to userspace

2010-10-12 Thread Hefty, Sean
   TBH, I think this would be much better off integrating with the
   existing paths tcpdump/setc uses rather than yet again something new
 
  This ties in with the existing MAD interface, which isn't going away
  anytime soon, if ever.
 
 I didn't say the MAD interface was going away, I said it was not the
 interface everything else in the kernel uses for packet capture.

My focus is tying this functionality in with the existing IB stack.  The MAD, 
verbs, and HCA drivers do not use net_device, sk_buff, or anything in netdev, 
and I don't have the time or inclination to try to add it.  We have an 
interface that allows registration to receive MADs; this provides a simple 
extension of that interface.

I'm mainly interested in capturing MAD data, not all packets.  We don't have 
access to any of the headers, or even an easy way to know the destination for 
sent MADs.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Opensm crash with OFED 1.5

2010-10-12 Thread Suresh Shelvapille

Folks:

I have a multi-processor machine, running FedoraCore 12. I have installed OFED 
1.5. Everything seems to come up ok, I
can look at the ibstat and it shows that the Mellanox card stats etc...

As soon as I start opensm, I get the following kernel oops and the machine 
locks up.

Any ideas

Thanks,
Suri

--

Oct 12 17:19:38 localhost OpenSM[2617]: OpenSM 3.3.5#012

Oct 12 17:19:38 localhost OpenSM[2617]: Entering DISCOVERING state#012

Oct 12 17:20:20 localhost kernel: ib0: ib_query_gid() failed

Oct 12 17:20:30 localhost kernel: ib0: ib_query_port failed

Oct 12 17:20:52 localhost kernel: BUG: soft lockup - CPU#15 stuck for 61s! 
[opensm:2637]

Oct 12 17:20:52 localhost kernel: Modules linked in: fuse sunrpc ip6t_REJECT 
nf_conntrack_ipv6 ip6table_filter
ip6_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm 
iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6
ib_uverbs ib_umad iw_nes libcrc32c iw_cxgb3 cxgb3 mlx4_en mlx4_ib ib_mthca 
ib_mad ib_core dm_multipath uinput mlx4_core
igb i2c_i801 joydev dca i2c_core iTCO_wdt iTCO_vendor_support mpt2sas 
scsi_transport_sas [last unloaded: microcode]

Oct 12 17:20:52 localhost kernel: CPU 15:

Oct 12 17:20:52 localhost kernel: Modules linked in: fuse sunrpc ip6t_REJECT 
nf_conntrack_ipv6 ip6table_filter
ip6_tables cpufreq_ondemand acpi_cpufreq freq_table rdma_ucm ib_sdp rdma_cm 
iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6
ib_uverbs ib_umad iw_nes libcrc32c iw_cxgb3 cxgb3 mlx4_en mlx4_ib ib_mthca 
ib_mad ib_core dm_multipath uinput mlx4_core
igb i2c_i801 joydev dca i2c_core iTCO_wdt iTCO_vendor_support mpt2sas 
scsi_transport_sas [last unloaded: microcode]

Oct 12 17:20:52 localhost kernel: Pid: 2637, comm: opensm Not tainted 
2.6.31.5-127.fc12.x86_64 #1 X8DTH-i/6/iF/6F

Oct 12 17:20:52 localhost kernel: RIP: 0010:[81203558]  
[81203558] __bitmap_empty+0x0/0x64

Oct 12 17:20:52 localhost kernel: RSP: 0018:880c174bbd90  EFLAGS: 0246

Oct 12 17:20:52 localhost kernel: RAX:  RBX: 880c174bbdd8 
RCX: 0001

Oct 12 17:20:52 localhost kernel: RDX: 818ba920 RSI: 0100 
RDI: 818ba918

Oct 12 17:20:52 localhost kernel: RBP: 8101286e R08:  
R09: 0004

Oct 12 17:20:52 localhost kernel: R10: 0004 R11: 0206 
R12: 880c174bbdd8

Oct 12 17:20:52 localhost kernel: R13: 8101286e R14: 810dc920 
R15: 880c174bbcf8

Oct 12 17:20:52 localhost kernel: FS:  7ff2d02e7710() 
GS:c90001e0() knlGS:

Oct 12 17:20:52 localhost kernel: CS:  0010 DS:  ES:  CR0: 
80050033

Oct 12 17:20:52 localhost kernel: CR2: 0041f0c0 CR3: 000c19074000 
CR4: 06e0

Oct 12 17:20:52 localhost kernel: DR0:  DR1:  
DR2: 

Oct 12 17:20:52 localhost kernel: DR3:  DR6: 0ff0 
DR7: 0400

Oct 12 17:20:52 localhost kernel: Call Trace:

Oct 12 17:20:52 localhost kernel: [810383f2] ? 
native_flush_tlb_others+0xc3/0xf2

Oct 12 17:20:52 localhost kernel: [8103859d] ? flush_tlb_mm+0x6f/0x76

Oct 12 17:20:52 localhost kernel: [810debbc] ? 
mprotect_fixup+0x480/0x611

Oct 12 17:20:52 localhost kernel: [810da81d] ? free_pgtables+0xa9/0xcc

Oct 12 17:20:52 localhost kernel: [810f185d] ? 
virt_to_head_page+0xe/0x2f

Oct 12 17:20:52 localhost kernel: [810deee9] ? 
sys_mprotect+0x19c/0x227

Oct 12 17:20:52 localhost kernel: [81011cf2] ? 
system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


linux-next: manual merge of the bkl-llseek tree with the infiniband tree

2010-10-12 Thread Stephen Rothwell
Hi Arnd,

Today's linux-next merge of the bkl-llseek tree got a conflict in
drivers/infiniband/hw/cxgb4/device.c between commit
8bbac892fb75d20fa274ca026e24faf00afbf9dd (RDMA/cxgb4: Add default_llseek
to debugfs files) from the infiniband tree and commit
9711569d06e7df5f02a943fc4138fb152526e719 (llseek: automatically
add .llseek fop) from the bkl-llseek tree.

Not really a conflict.  The infiniband tree patch is a superset of the
part of the bkl-llseek commit that affects this file.
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/


pgpmicm3AWJFj.pgp
Description: PGP signature