Re: [PATCH v5 19/27] IB/Verbs: Use management helper cap_iw_cm()

2015-04-21 Thread Tom Tucker

On 4/21/15 2:39 AM, Michael Wang wrote:


On 04/20/2015 05:51 PM, Tom Tucker wrote:
[snip]

int ib_query_gid(struct ib_device *device,
 u8 port_num, int index, union ib_gid *gid);


iWARP devices _must_ support the IWCM so cap_iw_cm() is not really useful.

Sean suggested to add this helper paired with cap_ib_cm(), may be there are
some consideration on maintainability?

Me too also prefer this way to make the code more readable ;-)

It's more consistent, but not necessarily more readable -- if by readability we 
mean understanding.

If the reader knows how the transports work, then the reader would be confused 
by the addition of a check that is always true. For the reader that doesn't 
know, the addition of the check implies that the support is optional, which it 
is not.

The purpose is to make sure folks understand what we really want to check
when they reviewing the code :-) and prepared for the further reform which may
not rely on technology type any more, for example the device could tell core
layer directly what management it required with a bitmask :-)

Hi Michael,

Thanks for the reply, but my premise was just wrong...I need to review the 
whole patch, not just a snippet.


Thanks,
Tom

Regards,
Michael Wang


Tom


Regards,
Michael Wang

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 19/27] IB/Verbs: Use management helper cap_iw_cm()

2015-04-20 Thread Tom Tucker

On 4/20/15 11:19 AM, Jason Gunthorpe wrote:

On Mon, Apr 20, 2015 at 10:51:58AM -0500, Tom Tucker wrote:

On 4/20/15 10:16 AM, Michael Wang wrote:

On 04/20/2015 04:00 PM, Steve Wise wrote:

On 4/20/2015 3:40 AM, Michael Wang wrote:

[snip]

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6805e3e..e4999f6 100644
+++ b/include/rdma/ib_verbs.h
@@ -1818,6 +1818,21 @@ static inline int cap_ib_cm(struct ib_device *device, u8 
port_num)
   return rdma_ib_or_iboe(device, port_num);
   }
   +/**
+ * cap_iw_cm - Check if the port of device has the capability IWARP
+ * Communication Manager.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support IWARP
+ * Communication Manager.
+ */
+static inline int cap_iw_cm(struct ib_device *device, u8 port_num)
+{
+return rdma_tech_iwarp(device, port_num);
+}
+
   int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

iWARP devices _must_ support the IWCM so cap_iw_cm() is not really useful.

Sean suggested to add this helper paired with cap_ib_cm(), may be there are
some consideration on maintainability?

Me too also prefer this way to make the code more readable ;-)

It's more consistent, but not necessarily more readable -- if by
readability we mean understanding.

If the reader knows how the transports work, then the reader would
be confused by the addition of a check that is always true. For the
reader that doesn't know, the addition of the check implies that the
support is optional, which it is not.

No, it says this code is concerned with the unique parts of iWarp
related to CM, not the other unique parts of iWarp. The check isn't
aways true, it is just always true on iWarp devices.

That became the problem with the old way of just saying 'is iWarp'
(and others). There are too many differences, the why became lost in
many places.

There are now too many standards, and several do not have public docs,
to keep relying on a mess of 'is standard' tests.


You're right Jason, this gets called with the device handle so it's only 
true for iwarp.



Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 19/27] IB/Verbs: Use management helper cap_iw_cm()

2015-04-20 Thread Tom Tucker

On 4/20/15 10:16 AM, Michael Wang wrote:

On 04/20/2015 04:00 PM, Steve Wise wrote:

On 4/20/2015 3:40 AM, Michael Wang wrote:

[snip]

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6805e3e..e4999f6 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1818,6 +1818,21 @@ static inline int cap_ib_cm(struct ib_device *device, u8 
port_num)
   return rdma_ib_or_iboe(device, port_num);
   }
   +/**
+ * cap_iw_cm - Check if the port of device has the capability IWARP
+ * Communication Manager.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support IWARP
+ * Communication Manager.
+ */
+static inline int cap_iw_cm(struct ib_device *device, u8 port_num)
+{
+return rdma_tech_iwarp(device, port_num);
+}
+
   int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);
   

iWARP devices _must_ support the IWCM so cap_iw_cm() is not really useful.

Sean suggested to add this helper paired with cap_ib_cm(), may be there are
some consideration on maintainability?

Me too also prefer this way to make the code more readable ;-)


It's more consistent, but not necessarily more readable -- if by 
readability we mean understanding.


If the reader knows how the transports work, then the reader would be 
confused by the addition of a check that is always true. For the reader 
that doesn't know, the addition of the check implies that the support is 
optional, which it is not.


Tom


Regards,
Michael Wang




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 27/27] IB/Verbs: Cleanup rdma_node_get_transport()

2015-04-16 Thread Tom Tucker

On 4/16/15 8:45 AM, Michael Wang wrote:


On 04/16/2015 03:42 PM, Hal Rosenstock wrote:

On 4/16/2015 9:41 AM, Michael Wang wrote:


On 04/16/2015 03:36 PM, Hal Rosenstock wrote:
[snip]

-EXPORT_SYMBOL(rdma_node_get_transport);
-
  enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 
port_num)
  {
if (device-get_link_layer)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 262bf44..f9ef479 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -84,9 +84,6 @@ enum rdma_transport_type {
RDMA_TRANSPORT_IBOE,
  };
  
-__attribute_const__ enum rdma_transport_type

-rdma_node_get_transport(enum rdma_node_type node_type);
-
  enum rdma_link_layer {
IB_LINK_LAYER_UNSPECIFIED,

Is IB_LINK_LAYER_UNSPECIFIED still possible ?

Actually it's impossible in kernel at first, all those who implemented the 
callback
won't return UNSPECIFIED, others all have the correct transport type (otherwise 
BUG())
and won't result UNSPECIFIED :-)

Should it be removed from this enum somewhere in this patch series
(perhaps early on) ?
I don't think it's ever been 'possible.' It's purpose is to catch 
initialized errors where the transport fails to initialize it's 
transport type. So for example,


provider = calloc(1, sizeof *provider)

If 0 is a valid link layer type, then you wouldn't catch these kinds of 
errors.


Tom

It was still directly used by helper like ib_modify_qp_is_ok() as indicator, 
may be
better in another following patch to reform those part :-)

Regards,
Michael Wang


-- Hal


Regards,
Michael Wang


IB_LINK_LAYER_INFINIBAND,


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] scsi: fnic: use kernel's '%pM' format option to print MAC

2015-03-19 Thread Tom Tucker

Hi Andy,

On 3/19/15 12:54 PM, Andy Shevchenko wrote:

On Tue, 2014-04-29 at 17:45 +0300, Andy Shevchenko wrote:

Instead of supplying each byte through stack let's use %pM specifier.

Anyone to comment or apply this patch?


Signed-off-by: Andy Shevchenko andriy.shevche...@linux.intel.com
Cc: Tom Tucker t...@opengridcomputing.com
Cc: Steve Wise sw...@opengridcomputing.com
Cc: linux-rdma@vger.kernel.org
---
  drivers/scsi/fnic/vnic_dev.c | 10 ++
  1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/fnic/vnic_dev.c b/drivers/scsi/fnic/vnic_dev.c
index 9795d6f..ba69d61 100644
--- a/drivers/scsi/fnic/vnic_dev.c
+++ b/drivers/scsi/fnic/vnic_dev.c
@@ -499,10 +499,7 @@ void vnic_dev_add_addr(struct vnic_dev *vdev, u8 *addr)
  
  	err = vnic_dev_cmd(vdev, CMD_ADDR_ADD, a0, a1, wait);

if (err)
-   printk(KERN_ERR
-   Can't add addr [%02x:%02x:%02x:%02x:%02x:%02x], %d\n,
-   addr[0], addr[1], addr[2], addr[3], addr[4], addr[5],
-   err);
+   pr_err(Can't add addr [%pM], %d\n, addr, err);


This looks completely reasonable to me.

Tom

  }
  
  void vnic_dev_del_addr(struct vnic_dev *vdev, u8 *addr)

@@ -517,10 +514,7 @@ void vnic_dev_del_addr(struct vnic_dev *vdev, u8 *addr)
  
  	err = vnic_dev_cmd(vdev, CMD_ADDR_DEL, a0, a1, wait);

if (err)
-   printk(KERN_ERR
-   Can't del addr [%02x:%02x:%02x:%02x:%02x:%02x], %d\n,
-   addr[0], addr[1], addr[2], addr[3], addr[4], addr[5],
-   err);
+   pr_err(Can't del addr [%pM], %d\n, addr, err);
  }
  
  int vnic_dev_notify_set(struct vnic_dev *vdev, u16 intr)




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA crashing

2014-03-12 Thread Tom Tucker

Hi Trond,

I think this patch is still 'off-by-one'. We'll take a look at this today.

Thanks,
Tom

On 3/12/14 9:05 AM, Trond Myklebust wrote:

On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote:


On Sat, 08 Mar 2014 14:13:44 -0600
Steve Wise sw...@opengridcomputing.com wrote:


On 3/8/2014 1:20 PM, Steve Wise wrote:

I removed your change and started debugging original crash that
happens on top-o-tree.   Seems like rq_next_pages is screwed up.  It
should always be = rq_respages, yes?  I added a BUG_ON() to assert
this in rdma_read_xdr() we hit the BUG_ON(). Look

crash svc_rqst.rq_next_page 0x8800b84e6000
rq_next_page = 0x8800b84e6228
crash svc_rqst.rq_respages 0x8800b84e6000
rq_respages = 0x8800b84e62a8

Any ideas Bruce/Tom?


Guys, the patch below seems to fix the problem.  Dunno if it is
correct though.  What do you think?

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0ce7552..6d62411 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
   sge_no++;
   }
   rqstp-rq_respages = rqstp-rq_pages[sge_no];
+   rqstp-rq_next_page = rqstp-rq_respages;

   /* We should never run out of SGE because the limit is defined to
* support the max allowed RPC data length
@@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct
svcxprt_rdma *xprt,

   /* rq_respages points one past arg pages */
   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
+   rqstp-rq_next_page = rqstp-rq_respages;

   /* Create the reply and chunk maps */
   offset = 0;



While this patch avoids the crashing, it apparently isn't correct...I'm
getting IO errors reading files over the mount. :)


I hit the same oops and tested your patch and it seems to have fixed
that particular panic, but I still see a bunch of other mem corruption
oopses even with it. I'll look more closely at that when I get some
time.

FWIW, I can easily reproduce that by simply doing something like:

   $ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1

I'm not sure why you're not seeing any panics with your patch in place.
Perhaps it's due to hw differences between our test rigs.

The EIO problem that you're seeing is likely the same client bug that
Chuck recently fixed in this patch:

   [PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA

AIUI, Trond is merging that set for 3.15, so I'd make sure your client
has those patches when testing.


Nothing is in my queue yet.

_
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for simplifying NFS/RDMA client memory registration

2014-03-01 Thread Tom Tucker

Hi Chuck,

I have a patch for the server side that simplifies the memory registration 
and fixes a bug where the server ignores the FRMR hardware limits. This 
bug is actually upstream now.


I have been sitting on it because it's a big patch and will require a lot 
of testing/review to get it upstream. This is Just an FYI in case there is 
someone on your team who has the bandwidth to take this work and finish it up.


Thanks,
Tom

On 2/28/14 8:59 PM, Chuck Lever wrote:

Hi Wendy-

On Feb 28, 2014, at 5:26 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:


On Fri, Feb 28, 2014 at 2:20 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

ni i...On Fri, Feb 28, 2014 at 1:41 PM, Tom Talpey t...@talpey.com wrote:

On 2/26/2014 8:44 AM, Chuck Lever wrote:

Hi-

Shirley Ma and I are reviving work on the NFS/RDMA client code base in
the Linux kernel.  So far we've built and run functional tests to determine
what is working and what is broken.

[snip]



ALLPHYSICAL - Usually fast, but not safe as it exposes client memory.
All HCAs support this mode.


Not safe is an understatement. It exposes all of client physical
memory to the peer, for both read and write. A simple pointer error
on the server will silently corrupt the client. This mode was
intended only for testing, and in experimental deployments.

(sorry, resend .. previous reply bounced back due to gmail html format)

Please keep ALLPHYSICAL for now  - as our embedded system needs it.

This is just the client side.  Confirming that you still need support for the 
ALLPHYSICAL memory registration mode in the NFS/RDMA client.

Do you have plans to move to a mode that is less risky?  If not, can we depend 
on you to perform regular testing with ALLPHYSICAL as we update the client 
code?  Do you have any bug fixes you’d like to merge upstream?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MLX4 Cq Question

2013-05-20 Thread Tom Tucker

Hi Guys,

One other quick one. I've received conflicting claims on the validity of 
the wc.opcode when wc.status != 0 for mlx4 hardware.


My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe 
owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an 
error and therefore, the only way to recover what the opcode was is 
through the wr_id you used when submitting the WR.


Is my reading of the code correct?

Thanks,
Tom

On 5/20/13 9:53 AM, Jack Morgenstein wrote:

On Saturday 18 May 2013 00:37, Roland Dreier wrote:

On Fri, May 17, 2013 at 12:25 PM, Tom Tucker t...@opengridcomputing.com wrote:

I'm looking at the Linux MLX4 net driver and found something that confuses me 
mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the 
mlx4_ib_completion function does not take any kind of lock when looking up the 
SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition 
if I go look at the code paths where cq are removed from this tree, they are 
protected by spin_lock_irq. So I am baffled at this point as to what the 
locking strategy is and how this is supposed to work. I'm sure I'm missing 
something and would greatly appreciate it if someone would explain this.

This is a bit tricky.  If you look at

void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
{
 struct mlx4_priv *priv = mlx4_priv(dev);
 struct mlx4_cq_table *cq_table = priv-cq_table;
 int err;

 err = mlx4_HW2SW_CQ(dev, NULL, cq-cqn);
 if (err)
 mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n,
err, cq-cqn);

 synchronize_irq(priv-eq_table.eq[cq-vector].irq);

 spin_lock_irq(cq_table-lock);
 radix_tree_delete(cq_table-tree, cq-cqn);
 spin_unlock_irq(cq_table-lock);

 if (atomic_dec_and_test(cq-refcount))
 complete(cq-free);
 wait_for_completion(cq-free);

 mlx4_cq_free_icm(dev, cq-cqn);
}

you see that when freeing a CQ, we first do the HW2SW_CQ firmware
command; once this command completes, no more events will be generated
for that CQ.  Then we do synchronize_irq for the CQ's interrupt
vector.  Once that completes, no more completion handlers will be
running for the CQ, so we can safely delete the CQ from the radix tree
(relying on the radix tree's safety of deleting one entry while
possibly looking up other entries, so no lock is needed).  We also use
the lock to synchronize against the CQ event function, which as you
noted does take the lock too.

Basic idea is that we're tricky and careful so we can make the fast
path (completion interrupt handling) lock-free, but then use locks and
whatever else needed in the slow path (CQ async event handling, CQ
destroy).

  - R.

===

Roland, unfortunately we have seen that we need some locking on the
cq completion handler (there is a stack trace which resulted from this
lack of proper locking).
In our current driver, we are using the patch below (which uses RCU locking
instead of spinlocks).  I can prepare a proper patch for the upstream kernel.

===
net/mlx4_core: Fix racy flow in the driver CQ completion handler

The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock
the radix tree which is used to manage the table of CQs, nor does it increase
the reference count of the CQ before invoking the user provided callback
(and decrease it afterwards).

This is racy and can cause use-after-free, null pointer dereference, etc, which
result in kernel crashes.

To fix this, we must do the following in mlx4_cq_completion:
- increase the ref count on the cq before invoking the user callback, and
   decrement it after the callback.
- Place a lock around the radix tree lookup/ref-count-increase

Using an irq spinlock will not fix this issue. The problem is that under VPI,
the ETH interface uses multiple msix irq's, which can result in one cq 
completion
event interrupting another in-progress cq completion event. A deadlock results
when the handler for the first cq completion grabs the spinlock, and is
interrupted by the second completion before it has a chance to release the 
spinlock.
The handler for the second completion will deadlock waiting for the spinlock
to be released.

The proper fix is to use the RCU mechanism for locking radix-tree accesses in
the cq completion event handler (The radix-tree implementation uses the RCU
mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the
updater, will do the job).

Note that the same issue exists in mlx4_cq_event() (the cq async event
handler), which also takes the same lock on the radix tree. Here, we replace the
spinlock with an rcu_read_lock().

This patch was motivated by the following report from the field:

[...] box panic'ed when trying to find a completion queue. There is
no corruption but there is a possible race which could

Re: MLX4 Cq Question

2013-05-20 Thread Tom Tucker

On 5/20/13 2:58 PM, Hefty, Sean wrote:

My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe
owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an
error and therefore, the only way to recover what the opcode was is
through the wr_id you used when submitting the WR.

Is my reading of the code correct?

I believe this is true wrt the IB spec.

Thanks, this was my recollection as well.

Tom

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MLX4 Cq Question

2013-05-17 Thread Tom Tucker

Hi Roland,

I'm looking at the Linux MLX4 net driver and found something that confuses 
me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, 
the mlx4_ib_completion function does not take any kind of lock when 
looking up the SW CQ in the radix tree, however, the mlx4_cq_event 
function does. In addition if I go look at the code paths where cq are 
removed from this tree, they are protected by spin_lock_irq. So I am 
baffled at this point as to what the locking strategy is and how this is 
supposed to work. I'm sure I'm missing something and would greatly 
appreciate it if someone would explain this.


Thanks,
Tom

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Tucker

On 4/30/13 9:38 AM, Yan Burman wrote:



-Original Message-
From: Tom Talpey [mailto:t...@talpey.com]
Sent: Tuesday, April 30, 2013 17:20
To: Yan Burman
Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux-
r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On 4/30/2013 1:09 AM, Yan Burman wrote:

I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
...
  ib_send_bw with intel iommu enabled did get up to 4.5GB/sec

BTW, you may want to verify that these are the same GB. Many benchmarks
say KB/MB/GB when they really mean KiB/MiB/GiB.

At GB/GiB, the difference is about 7.5%, very close to the difference between
4.1 and 4.5.

Just a thought.

The question is not why there is 400MBps difference between ib_send_bw and 
NFSoRDMA.
The question is why with IOMMU ib_send_bw got to the same bandwidth as without 
it while NFSoRDMA got half.
NFSRDMA is constantly registering and unregistering memory when you use 
FRMR mode. By contrast IPoIB has a descriptor ring that is set up once 
and re-used. I suspect this is the difference maker. Have you tried 
running the server in ALL_PHYSICAL mode, i.e. where it uses a DMA_MR for 
all of memory?


Tom

From some googling, it seems that when IOMMU is enabled, dma mapping functions 
get a lot more expensive.
Perhaps that is the reason for the performance drop.

Yan


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-29 Thread Tom Tucker

On 4/29/13 7:16 AM, Yan Burman wrote:



-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Monday, April 29, 2013 08:35
To: J. Bruce Fields
Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote:


On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

...

[snip]


 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner

That's the inode i_mutex.


 14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).


  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave


And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --
norandommap --group_reporting --exitall --buffered=0


I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 
128-256K block size


* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?


I tried with different iodepth parameters, but found no improvement above 
iodepth 128.


* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe sequential
IO (instead of randread) could help ? Bigger block size (instead of 4K) can
help ?



I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.



I am trying to simulate real load (more or less), that is the reason I use 
randread. Anyhow, read does not result in better performance.
It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-29 Thread Tom Tucker

On 4/29/13 8:05 AM, Tom Tucker wrote:

On 4/29/13 7:16 AM, Yan Burman wrote:



-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Monday, April 29, 2013 08:35
To: J. Bruce Fields
Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields 
bfie...@fieldses.org wrote:



On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
same block sizes (4-512K). running over IPoIB-CM, I get 
200-980MB/sec.

...

[snip]


 36.18%  nfsd [kernel.kallsyms]   [k] mutex_spin_on_owner

That's the inode i_mutex.


 14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

  9.63%  nfsd [kernel.kallsyms]   [k] 
_raw_spin_lock_irqsave



And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --
ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 
--randrepeat=1 --

norandommap --group_reporting --exitall --buffered=0


I tried block sizes from 4-512K.
4K does not give 2.2GB bandwidth - optimal bandwidth is achieved 
around 128-256K block size



* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?

I tried with different iodepth parameters, but found no improvement 
above iodepth 128.



* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe 
sequential
IO (instead of randread) could help ? Bigger block size (instead 
of 4K) can

help ?



I think the biggest issue is that max_payload for TCP is 2MB but only 
256k for RDMA.


Sorry, I mean 1MB...



I am trying to simulate real load (more or less), that is the reason 
I use randread. Anyhow, read does not result in better performance.

It's probably because backing storage is tmpfs...

Yan

--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Tom Tucker

On 4/25/13 3:04 PM, Tom Talpey wrote:

On 4/25/2013 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.


Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this


The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
for is called rdma_slot_table_entries.


moment as /proc entry is easy to add. More questions on the server
though (see below) ...



2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.



Could you give more hints on which part of the path is CPU limited ?


Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
has some ideas on the srv rdma code, but it could also be in the sunrpc
or infiniband driver layers, can't really tell without the call stacks.


The Mellanox driver uses red-black trees extensively for resource 
management, e.g. QP ID, CQ ID, etc... When completions come in from the 
HW, these are used to find the associated software data structures I 
believe. It is certainly possible that these trees get hot on lookup when 
we're pushing a lot of data. I'm surprised, however, to see 
rb_insert_color there because I'm not aware of any where that resources 
are being inserted into and/or removed from a red-black tree in the data path.


They are also used by IPoIB and the IB CM, however, connections should not 
be coming and going unless we've got other problems. IPoIB is only used by 
the IB transport for connection set up and my impression is that this 
trace is for the IB transport.


I don't believe that red-black trees are used by either the client or 
server transports directly. Note that the rb_lock in the client is for 
buffers; not, as the name might imply, a red-black tree.


I think the key here is to discover what lock is being waited on. Are we 
certain that it's a lock on a red-black tree and if so, which one?


Tom



Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?


NFS-RDMA is not really filesystem dependent, but certainly there are
considerations for filesystems to support NFS, and of course the goal in
general is performance. NFS-RDMA is a network transport, applicable to
both client and server. Filesystem choice is a server consideration.

I don't have a simple answer to your question about how much better
NFS-RDMA is over other transports. Architecturally, a lot. In practice,
there are many, many variables. Have you seen RFC5532, that I cowrote
with the late Chet Juszczak? You may find it's still quite relevant.
http://tools.ietf.org/html/rfc5532
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA crashing

2013-02-07 Thread Tom Tucker

On 2/6/13 3:28 PM, Steve Wise wrote:

On 2/6/2013 4:24 PM, J. Bruce Fields wrote:

On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote:

When killing mount command that got stuck:
---

BUG: unable to handle kernel paging request at 880324dc7ff8
IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]
PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161
Oops: 0003 [#1] PREEMPT SMP
Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm
ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables
iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables
nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock
target_core_file target_core_pscsi target_core_mod configfs 8021q
bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel
kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core
ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core
mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd
sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod
CPU 6
Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro
X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[a05f3dfb] [a05f3dfb]
rdma_read_xdr+0x8bb/0xd40 [svcrdma]
RSP: 0018:880324c3dbf8  EFLAGS: 00010297
RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428
RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618
RBP: 880324c3dd78 R08: 60f9c860 R09: 0001
R10: 880324dd8000 R11: 0001 R12: 8806299dcb10
R13: 0003 R14: 0001 R15: 0010
FS:  () GS:88063fc0() 
knlGS:

CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process nfsd (pid: 4744, threadinfo 880324c3c000, task 
88033055)

Stack:
  880324c3dc78 880324c3dcd8 0282 880631cec000
  880324dd8000 88062ed33040 000124c3dc48 880324dd8000
  88062ed33058 880630ce2b90 8806299e8000 0003
Call Trace:
  [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma]
  [81086540] ? try_to_wake_up+0x2f0/0x2f0
  [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc]
  [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
  [a0571e5d] nfsd+0xad/0x130 [nfsd]
  [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
  [81071df6] kthread+0xd6/0xe0
  [81071d20] ? __init_kthread_worker+0x70/0x70
  [814b462c] ret_from_fork+0x7c/0xb0
  [81071d20] ? __init_kthread_worker+0x70/0x70
Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00
00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00
48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00
RIP  [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]
  RSP 880324c3dbf8
CR2: 880324dc7ff8
---[ end trace 06d0384754e9609a ]---


It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e
nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
is responsible for the crash (it seems to be crashing in
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527)
It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and
CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet.

When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I
was no longer getting the server crashes,
so the reset of my tests were done using that point (it is somewhere
in the middle of 3.7.0-rc2).

OK, so this part's clearly my fault--I'll work on a patch, but the
rdma's use of the -rq_pages array is pretty confusing.


Maybe Tom can shed some light?


Yes, the RDMA transport has two confusing tweaks on rq_pages. Most 
transports (UDP/TCP) use the rq_pages allocated by SVC. For RDMA, 
however, the RQ already contains pre-allocated memory that will contain 
inbound NFS requests from the client. Instead of copying this data from 
the per-registered receive buffer into the buffer in rq_pages, I just 
replace the page in rq_pages with the one that already contains the data.


The second somewhat strange thing is that the NFS request contains an 
NFSRDMA header. This is just like TCP (i.e. the 4B length), however, the 
difference is that (unlike TCP) this header is needed for the response 
because it maps out where in the client the response data will be written.


Tom



--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH 0/2] RPCRDMA Fixes

2011-02-09 Thread Tom Tucker
This pair of patches fixes a problem with the marshalling of XDR into
RPCRDMA messages and an issue with FRMR mapping in the presence of transport
errors. The problems were discovered together as part of looking into the
ENOSPC problems seen by spe...@shiftmail.com.

The fixes, however, are independent and do not rely on each other. I have tested
them indepently and together on 64b with both Infiniband and iWARP. They have 
been
compile tested on 32b.

---

Tom Tucker (2):
  RPCRDMA: Fix FRMR registration/invalidate handling.
  RPCRDMA: Fix to XDR page base interpretation in marshalling logic.


 net/sunrpc/xprtrdma/rpc_rdma.c  |   86 +++
 net/sunrpc/xprtrdma/verbs.c |   52 
 net/sunrpc/xprtrdma/xprt_rdma.h |1 
 3 files changed, 87 insertions(+), 52 deletions(-)

-- 
Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] RPCRDMA: Fix to XDR page base interpretation in marshalling logic.

2011-02-09 Thread Tom Tucker
The RPCRDMA marshalling logic assumed that xdr-page_base was an
offset into the first page of xdr-page_list. It is in fact an
offset into the xdr-page_list itself, that is, it selects the
first page in the page_list and the offset into that page.

The symptom depended in part on the rpc_memreg_strategy, if it was
FRMR, or some other one-shot mapping mode, the connection would get
torn down on a base and bounds error. When the badly marshalled RPC
was retransmitted it would reconnect, get the error, and tear down the
connection again in a loop forever. This resulted in a hung-mount. For
the other modes, it would result in silent data corruption. This bug is
most easily reproduced by writing more data than the filesystem
has space for.

This fix corrects the page_base assumption and otherwise simplifies
the iov mapping logic.

Signed-off-by: Tom Tucker t...@ogc.us
---

 net/sunrpc/xprtrdma/rpc_rdma.c |   86 
 1 files changed, 42 insertions(+), 44 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 2ac3f6e..554d081 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -87,6 +87,8 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos,
enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg, int nsegs)
 {
int len, n = 0, p;
+   int page_base;
+   struct page **ppages;
 
if (pos == 0  xdrbuf-head[0].iov_len) {
seg[n].mr_page = NULL;
@@ -95,34 +97,32 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int 
pos,
++n;
}
 
-   if (xdrbuf-page_len  (xdrbuf-pages[0] != NULL)) {
-   if (n == nsegs)
-   return 0;
-   seg[n].mr_page = xdrbuf-pages[0];
-   seg[n].mr_offset = (void *)(unsigned long) xdrbuf-page_base;
-   seg[n].mr_len = min_t(u32,
-   PAGE_SIZE - xdrbuf-page_base, xdrbuf-page_len);
-   len = xdrbuf-page_len - seg[n].mr_len;
+   len = xdrbuf-page_len;
+   ppages = xdrbuf-pages + (xdrbuf-page_base  PAGE_SHIFT);
+   page_base = xdrbuf-page_base  ~PAGE_MASK;
+   p = 0;
+   while (len  n  nsegs) {
+   seg[n].mr_page = ppages[p];
+   seg[n].mr_offset = (void *)(unsigned long) page_base;
+   seg[n].mr_len = min_t(u32, PAGE_SIZE - page_base, len);
+   BUG_ON(seg[n].mr_len  PAGE_SIZE);
+   len -= seg[n].mr_len;
++n;
-   p = 1;
-   while (len  0) {
-   if (n == nsegs)
-   return 0;
-   seg[n].mr_page = xdrbuf-pages[p];
-   seg[n].mr_offset = NULL;
-   seg[n].mr_len = min_t(u32, PAGE_SIZE, len);
-   len -= seg[n].mr_len;
-   ++n;
-   ++p;
-   }
+   ++p;
+   page_base = 0;  /* page offset only applies to first page */
}
 
+   /* Message overflows the seg array */
+   if (len  n == nsegs)
+   return 0;
+
if (xdrbuf-tail[0].iov_len) {
/* the rpcrdma protocol allows us to omit any trailing
 * xdr pad bytes, saving the server an RDMA operation. */
if (xdrbuf-tail[0].iov_len  4  xprt_rdma_pad_optimize)
return n;
if (n == nsegs)
+   /* Tail remains, but we're out of segments */
return 0;
seg[n].mr_page = NULL;
seg[n].mr_offset = xdrbuf-tail[0].iov_base;
@@ -296,6 +296,8 @@ rpcrdma_inline_pullup(struct rpc_rqst *rqst, int pad)
int copy_len;
unsigned char *srcp, *destp;
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst-rq_xprt);
+   int page_base;
+   struct page **ppages;
 
destp = rqst-rq_svec[0].iov_base;
curlen = rqst-rq_svec[0].iov_len;
@@ -324,28 +326,25 @@ rpcrdma_inline_pullup(struct rpc_rqst *rqst, int pad)
__func__, destp + copy_len, curlen);
rqst-rq_svec[0].iov_len += curlen;
}
-
r_xprt-rx_stats.pullup_copy_count += copy_len;
-   npages = PAGE_ALIGN(rqst-rq_snd_buf.page_base+copy_len)  PAGE_SHIFT;
+
+   page_base = rqst-rq_snd_buf.page_base;
+   ppages = rqst-rq_snd_buf.pages + (page_base  PAGE_SHIFT);
+   page_base = ~PAGE_MASK;
+   npages = PAGE_ALIGN(page_base+copy_len)  PAGE_SHIFT;
for (i = 0; copy_len  i  npages; i++) {
-   if (i == 0)
-   curlen = PAGE_SIZE - rqst-rq_snd_buf.page_base;
-   else
-   curlen = PAGE_SIZE;
+   curlen = PAGE_SIZE - page_base;
if (curlen  copy_len)
curlen = copy_len;
dprintk(RPC:   %s: page %d destp 0x%p

[PATCH 2/2] RPCRDMA: Fix FRMR registration/invalidate handling.

2011-02-09 Thread Tom Tucker
When the rpc_memreg_strategy is 5, FRMR are used to map RPC data.
This mode uses an FRMR to map the RPC data, then invalidates
(i.e. unregisers) the data in xprt_rdma_free. These FRMR are used
across connections on the same mount, i.e. if the connection goes
away on an idle timeout and reconnects later, the FRMR are not
destroyed and recreated.

This creates a problem for transport errors because the WR that
invalidate an FRMR may be flushed (i.e. fail) leaving the
FRMR valid. When the FRMR is later used to map an RPC it will fail,
tearing down the transport and starting over. Over time, more and
more of the FRMR pool end up in the wrong state resulting in
seemingly random disconnects.

This fix keeps track of the FRMR state explicitly by setting it's
state based on the successful completion of a reg/inv WR. If the FRMR
is ever used and found to be in the wrong state, an invalidate WR
is prepended, re-syncing the FRMR state and avoiding the connection loss.

Signed-off-by: Tom Tucker t...@ogc.us
---

 net/sunrpc/xprtrdma/verbs.c |   52 +--
 net/sunrpc/xprtrdma/xprt_rdma.h |1 +
 2 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5f4c7b3..570f08d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -144,6 +144,7 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void 
*context)
 static inline
 void rpcrdma_event_process(struct ib_wc *wc)
 {
+   struct rpcrdma_mw *frmr;
struct rpcrdma_rep *rep =
(struct rpcrdma_rep *)(unsigned long) wc-wr_id;
 
@@ -154,15 +155,23 @@ void rpcrdma_event_process(struct ib_wc *wc)
return;
 
if (IB_WC_SUCCESS != wc-status) {
-   dprintk(RPC:   %s: %s WC status %X, connection lost\n,
-   __func__, (wc-opcode  IB_WC_RECV) ? recv : send,
-wc-status);
+   dprintk(RPC:   %s: WC opcode %d status %X, connection 
lost\n,
+   __func__, wc-opcode, wc-status);
rep-rr_len = ~0U;
-   rpcrdma_schedule_tasklet(rep);
+   if (wc-opcode != IB_WC_FAST_REG_MR  wc-opcode != 
IB_WC_LOCAL_INV)
+   rpcrdma_schedule_tasklet(rep);
return;
}
 
switch (wc-opcode) {
+   case IB_WC_FAST_REG_MR:
+   frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id;
+   frmr-r.frmr.state = FRMR_IS_VALID;
+   break;
+   case IB_WC_LOCAL_INV:
+   frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id;
+   frmr-r.frmr.state = FRMR_IS_INVALID;
+   break;
case IB_WC_RECV:
rep-rr_len = wc-byte_len;
ib_dma_sync_single_for_cpu(
@@ -1450,6 +1459,11 @@ rpcrdma_map_one(struct rpcrdma_ia *ia, struct 
rpcrdma_mr_seg *seg, int writing)
seg-mr_dma = ib_dma_map_single(ia-ri_id-device,
seg-mr_offset,
seg-mr_dmalen, seg-mr_dir);
+   if (ib_dma_mapping_error(ia-ri_id-device, seg-mr_dma)) {
+   dprintk(RPC:   %s: mr_dma %llx mr_offset %p mr_dma_len 
%zu\n,
+   __func__,
+   seg-mr_dma, seg-mr_offset, seg-mr_dmalen);
+   }
 }
 
 static void
@@ -1469,7 +1483,8 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_mr_seg *seg1 = seg;
-   struct ib_send_wr frmr_wr, *bad_wr;
+   struct ib_send_wr invalidate_wr, frmr_wr, *bad_wr, *post_wr;
+
u8 key;
int len, pageoff;
int i, rc;
@@ -1484,6 +1499,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg,
rpcrdma_map_one(ia, seg, writing);
seg1-mr_chunk.rl_mw-r.frmr.fr_pgl-page_list[i] = seg-mr_dma;
len += seg-mr_len;
+   BUG_ON(seg-mr_len  PAGE_SIZE);
++seg;
++i;
/* Check for holes */
@@ -1494,26 +1510,45 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
*seg,
dprintk(RPC:   %s: Using frmr %p to map %d segments\n,
__func__, seg1-mr_chunk.rl_mw, i);
 
+   if (unlikely(seg1-mr_chunk.rl_mw-r.frmr.state == FRMR_IS_VALID)) {
+   dprintk(RPC:   %s: frmr %x left valid, posting 
invalidate.\n,
+   __func__,
+   seg1-mr_chunk.rl_mw-r.frmr.fr_mr-rkey);
+   /* Invalidate before using. */
+   memset(invalidate_wr, 0, sizeof invalidate_wr);
+   invalidate_wr.wr_id = (unsigned long)(void 
*)seg1-mr_chunk.rl_mw;
+   invalidate_wr.next = frmr_wr;
+   invalidate_wr.opcode = IB_WR_LOCAL_INV;
+   invalidate_wr.send_flags = IB_SEND_SIGNALED

Re: NFS-RDMA hangs: connection closed (-103)

2010-12-09 Thread Tom Tucker

On 12/8/10 9:10 AM, Spelic wrote:
Tom, have you reproduced the RDMA hangs - connection closes bug or 
the sparse file at server side upon NFS hitting ENOSPC ?


Because for the latter people have already given exhaustive 
explanation: see this other thread at 
http://fossplanet.com/f13/%5Blinux-lvm%5D-bugs-mkfs-xfs-device-mapper-xfs-dev-ram-81653/ 



While the former bug is still open and very interesting for us.

I'm working on the 'former' bug. The bug that I think you've run in to 
with how RDMA transport errors are handled and how RPC are retried in 
the event of an error. With hard mounts (which I'm suspecting you have), 
the RPC will be retried forever. In this bug, the transport never 
'recovers' after the error and therefore the RPC never succeeds and the 
mount is effectively hung.


There were bugs fixed in this area between 34 and top which is why you 
saw the less catastrophic, but still broken behavior you see now.


Unfortunately I can only support this part-time, but I'll keep you 
updated on the progress.


Thanks for finding this and helping to debug,
Tom


Thanks for your help
S.


On 12/07/2010 05:12 PM, Tom Tucker wrote:

Status update...

I have reproduced the bug a number of different ways. It seems to be 
most easily reproduced by simply writing more data than the 
filesystem has space for. I can do this reliably with any FS. I think 
the XFS bug may have tickled this bug somehow.


Tom

On 12/2/10 1:09 PM, Spelic wrote:

Hello all
please be aware that the file oversize bug is reproducible also 
without infiniband, with just nfs over ethernet over xfs over 
ramdisk (but it doesn't hang, so it's a different bug than the one I 
posted here at the RDMA mailing list)
I have posted another thread regarding the file oversize bug, 
which you can read in the LVM, XFS, and LKML mailing lists, please 
have a look
http://fossplanet.com/f13/%5Blinux-lvm%5D-bugs-mkfs-xfs-device-mapper-xfs-dev-ram-81653/ 

Especially my second post, replying myself at +30 minutes, explains 
that it's reproducible also with ethernet.


Thank you

On 12/02/2010 07:37 PM, Roland Dreier wrote:
Adding Dave Chinner to the cc list, since he's both an XFS guru as 
well

as being very familiar with NFS and RDMA...

Dave, if you read below, it seems there is some strange behavior
exporting XFS with NFS/RDMA.

  - R.

  On 12/02/2010 12:59 AM, Tom Tucker wrote:
   Spelic,
 
   I have seen this problem before, but have not been able to 
reliably
   reproduce it. When I saw the problem, there were no transport 
errors
   and it appeared as if the I/O had actually completed, but that 
the

   waiter was not being awoken. I was not able to reliably reproduce
   the problem and was not able to determine if the problem was a
   latent bug in NFS in general or a bug in the RDMA transport in
   particular.
 
   I will try your setup here, but I don't have a system like 
yours so

   I'll have to settle for a smaller ramdisk, however, I have a few
   questions:
 
   - Does the FS matter? For example, can you use ext[2-4] on the
   ramdisk and not still reproduce
   - As I mentioned earlier NFS v3 vs. NFS v4
   - RAMDISK size, i.e. 2G vs. 14G
 
   Thanks,
   Tom

  Hello Tom, thanks for replying

  - The FS matters to some extent: as I wrote, with ext4 it's not
  possible to reproduce the bug in this way, so immediately and
  reliably, however ext4 also will hang eventually if you work on 
it for

  hours so I had to switch to IPoIB for our real work; reread my
  previous post.

  - NFS3 not tried yet. Never tried to do RDMA on NFS3... do you 
have a

  pointer on instructions?


  - RAMDISK size: I am testing it.

  Ok I confirm with 1.5GB ramdisk it's reproducible.
  boot option ramdisk_size=1572864
  (1.5*1024**2=1572864.0)
  confirm: blockdev --getsize64 /dev/ram0 == 1610612736

  now at server side mkfs and mount with defaults:
  mkfs.xfs /dev/ram0
  mount /dev/ram0 /mnt/ram
  (this is a simplification over my previous email, and it's 
needed with

  a smaller ramdisk or mkfs.xfs will refuse to work. The bug is still
  reproducible like this)


  DOH! another bug:
  It's strange how at the end of the test
  ls -lh /mnt/ram
  at server side will show a zerofile larger than 1.5GB at the end of
  the procedure, sometimes it's 3GB, sometimes it's 2.3GB... but it's
  larger than the ramdisk size.

  # ll -h /mnt/ram
  total 1.5G
  drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
  drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
  -rw-r--r-- 1 root root 2.3G 2010-12-02 12:59 zerofile
  # df -h
  FilesystemSize  Used Avail Use% Mounted on
  /dev/sda1 294G  4.1G  275G   2% /
  devtmpfs  7.9G  184K  7.9G   1% /dev
  none  7.9G 0  7.9G   0% /dev/shm
  none  7.9G  100K  7.9G   1% /var/run
  none  7.9G 0  7.9G   0% /var/lock
  none  7.9G 0  7.9G   0% /lib/init/rw
  /dev/ram0 1.5G  1.5G   20K 100% /mnt/ram

  # dd

Re: NFS-RDMA hangs: connection closed (-103)

2010-12-07 Thread Tom Tucker

Status update...

I have reproduced the bug a number of different ways. It seems to be most 
easily reproduced by simply writing more data than the filesystem has 
space for. I can do this reliably with any FS. I think the XFS bug may 
have tickled this bug somehow.


Tom

On 12/2/10 1:09 PM, Spelic wrote:

Hello all
please be aware that the file oversize bug is reproducible also 
without infiniband, with just nfs over ethernet over xfs over ramdisk 
(but it doesn't hang, so it's a different bug than the one I posted here 
at the RDMA mailing list)
I have posted another thread regarding the file oversize bug, which 
you can read in the LVM, XFS, and LKML mailing lists, please have a look
http://fossplanet.com/f13/%5Blinux-lvm%5D-bugs-mkfs-xfs-device-mapper-xfs-dev-ram-81653/ 

Especially my second post, replying myself at +30 minutes, explains that 
it's reproducible also with ethernet.


Thank you

On 12/02/2010 07:37 PM, Roland Dreier wrote:

Adding Dave Chinner to the cc list, since he's both an XFS guru as well
as being very familiar with NFS and RDMA...

Dave, if you read below, it seems there is some strange behavior
exporting XFS with NFS/RDMA.

  - R.

  On 12/02/2010 12:59 AM, Tom Tucker wrote:
   Spelic,
 
   I have seen this problem before, but have not been able to reliably
   reproduce it. When I saw the problem, there were no transport errors
   and it appeared as if the I/O had actually completed, but that the
   waiter was not being awoken. I was not able to reliably reproduce
   the problem and was not able to determine if the problem was a
   latent bug in NFS in general or a bug in the RDMA transport in
   particular.
 
   I will try your setup here, but I don't have a system like yours so
   I'll have to settle for a smaller ramdisk, however, I have a few
   questions:
 
   - Does the FS matter? For example, can you use ext[2-4] on the
   ramdisk and not still reproduce
   - As I mentioned earlier NFS v3 vs. NFS v4
   - RAMDISK size, i.e. 2G vs. 14G
 
   Thanks,
   Tom

  Hello Tom, thanks for replying

  - The FS matters to some extent: as I wrote, with ext4 it's not
  possible to reproduce the bug in this way, so immediately and
  reliably, however ext4 also will hang eventually if you work on it for
  hours so I had to switch to IPoIB for our real work; reread my
  previous post.

  - NFS3 not tried yet. Never tried to do RDMA on NFS3... do you have a
  pointer on instructions?


  - RAMDISK size: I am testing it.

  Ok I confirm with 1.5GB ramdisk it's reproducible.
  boot option ramdisk_size=1572864
  (1.5*1024**2=1572864.0)
  confirm: blockdev --getsize64 /dev/ram0 == 1610612736

  now at server side mkfs and mount with defaults:
  mkfs.xfs /dev/ram0
  mount /dev/ram0 /mnt/ram
  (this is a simplification over my previous email, and it's needed with
  a smaller ramdisk or mkfs.xfs will refuse to work. The bug is still
  reproducible like this)


  DOH! another bug:
  It's strange how at the end of the test
  ls -lh /mnt/ram
  at server side will show a zerofile larger than 1.5GB at the end of
  the procedure, sometimes it's 3GB, sometimes it's 2.3GB... but it's
  larger than the ramdisk size.

  # ll -h /mnt/ram
  total 1.5G
  drwxr-xr-x 2 root root   21 2010-12-02 12:54 ./
  drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
  -rw-r--r-- 1 root root 2.3G 2010-12-02 12:59 zerofile
  # df -h
  FilesystemSize  Used Avail Use% Mounted on
  /dev/sda1 294G  4.1G  275G   2% /
  devtmpfs  7.9G  184K  7.9G   1% /dev
  none  7.9G 0  7.9G   0% /dev/shm
  none  7.9G  100K  7.9G   1% /var/run
  none  7.9G 0  7.9G   0% /var/lock
  none  7.9G 0  7.9G   0% /lib/init/rw
  /dev/ram0 1.5G  1.5G   20K 100% /mnt/ram

  # dd if=/mnt/ram/zerofile | wc -c
  4791480+0 records in
  4791480+0 records out
  2453237760
  2453237760 bytes (2.5 GB) copied, 8.41821 s, 291 MB/s

  It seems there is also an XFS bug here...

  This might help triggering the bug however please note than ext4
  (nfs-rdma over it) also hanged on us and it was real work on HDD disks
  and they were not full... after switching to IPoIB it didn't hang
  anymore.

  On IPoIB the size problem also shows up: final file is 2.3GB instead
  of  1.5GB, however nothing hangs:

  # echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo
  syncing now ; time sync ; echo finished
  begin
  dd: writing `/mnt/nfsram/zerofile': Input/output error
  2497+0 records in
  2496+0 records out
  2617245696 bytes (2.6 GB) copied, 10.4 s, 252 MB/s
  syncing now

  real0m0.057s
  user0m0.000s
  sys 0m0.000s
  finished

  I think I noticed the same problem with a 14GB ramdisk, the file ended
  up to be about 15GB, but at that time I thought I made some
  computation mistakes. Now with a smaller ramdisk it's more obvious.

  Earlier or later someone should notify the XFS developers of the 
size bug.

  However

[RFC PATCH 0/2] IB/uverbs: Add support for registering mmapped memory

2010-12-02 Thread Tom Tucker
This patch series adds the ability for a user-mode program to register
mmapped memory.  The capability was developed to support the sharing of
device memory, for example PCI-E static/flash ram devices, on the network
with RDMA.  It is also useful for sharing kernel resident data with distributed
system monitoring applications (e.g. vmstats) at zero overhead to the
monitored host.

---

Tom Tucker (2):
  IB/uverbs: Add support for user registration of mmap memory
  IB/uverbs: Add memory type to ib_umem structure


 drivers/infiniband/core/umem.c |  272 +---
 include/rdma/ib_umem.h |6 +
 2 files changed, 259 insertions(+), 19 deletions(-)

-- 
Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory

2010-12-02 Thread Tom Tucker
Personally I think the biggest issue is that I don't think the pfn to dma 
address mapping logic is portable.


On 12/2/10 1:35 PM, Ralph Campbell wrote:

I understand the need for something like this patch since
the GPU folks would also like to mmap memory (although the
memory is marked as vma-vm_flags  VM_IO).

It seems to me that duplicating the page walking code is
the wrong approach and exporting a new interface from
mm/memory.c is more appropriate.

Perhaps, but that's kernel proper (not a module) and has it's own issues. 
For example, it represents an exported kernel interface and therefore a 
kernel compatability commitment going forward. I suggest that a new kernel 
interface is a separate effort that this code code utilize going forward.



Also, the quick check to find_vma() is essentially duplicated
if get_user_pages() is called


You need to know the type before you know how to handle it. Unless you 
want to tear up get_user_pages, i think this non-performance path double 
lookup is a non issue.



and it doesn't handle the case
when the region spans multiple vma regions with different flags.


Actually, it specifically does not allow that and I'm not sure that is 
something you would want to support.



Maybe we can modify get_user_pages to have a new flag which
allows VM_PFNMAP segments to be accessed as IB memory regions.
The problem is that VM_PFNMAP means there is no corresponding
struct page to handle reference counting. What happens if the
device that exports the VM_PFNMAP memory is hot removed?


Bus Address Error.


Can the device reference count be incremented to prevent that?



I don't think that would go in this code, it would go in the driver that 
gave the user the address in the first place.



On Thu, 2010-12-02 at 11:02 -0800, Tom Tucker wrote:

Added support to the ib_umem_get helper function for handling
mmaped memory.

Signed-off-by: Tom Tuckert...@ogc.us
---

  drivers/infiniband/core/umem.c |  272 +---
  1 files changed, 253 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 415e186..357ca5e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
int i;

list_for_each_entry_safe(chunk, tmp,umem-chunk_list, list) {
-   ib_dma_unmap_sg(dev, chunk-page_list,
-   chunk-nents, DMA_BIDIRECTIONAL);
-   for (i = 0; i  chunk-nents; ++i) {
-   struct page *page = sg_page(chunk-page_list[i]);
-
-   if (umem-writable  dirty)
-   set_page_dirty_lock(page);
-   put_page(page);
-   }
+   if (umem-type == IB_UMEM_MEM_MAP) {
+   ib_dma_unmap_sg(dev, chunk-page_list,
+   chunk-nents, DMA_BIDIRECTIONAL);
+   for (i = 0; i  chunk-nents; ++i) {
+   struct page *page = 
sg_page(chunk-page_list[i]);

+   if (umem-writable  dirty)
+   set_page_dirty_lock(page);
+   put_page(page);
+   }
+   }
kfree(chunk);
}
  }

-/**
- * ib_umem_get - Pin and DMA map userspace memory.
- * @context: userspace context to pin memory for
- * @addr: userspace virtual address to start at
- * @size: length of region to pin
- * @access: IB_ACCESS_xxx flags for memory being pinned
- * @dmasync: flush in-flight DMA when the memory region is written
- */
-struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
-   size_t size, int access, int dmasync)
+static struct ib_umem *__umem_get(struct ib_ucontext *context,
+ unsigned long addr, size_t size,
+ int access, int dmasync)
  {
struct ib_umem *umem;
struct page **page_list;
@@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
if (!umem)
return ERR_PTR(-ENOMEM);

+   umem-type   = IB_UMEM_MEM_MAP;
umem-context   = context;
umem-length= size;
umem-offset= addr  ~PAGE_MASK;
@@ -215,6 +210,245 @@ out:

return ret  0 ? ERR_PTR(ret) : umem;
  }
+
+/*
+ * Return the PFN for the specified address in the vma. This only
+ * works for a vma that is VM_PFNMAP.
+ */
+static unsigned long __follow_io_pfn(struct vm_area_struct *vma,
+unsigned long address, int write)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+   spinlock_t *ptl;
+   unsigned long pfn;
+   struct mm_struct *mm = vma-vm_mm;
+
+   pgd

Re: NFS-RDMA hangs: connection closed (-103)

2010-12-01 Thread Tom Tucker

Hi Spelic,

Can you reproduce this with an nfsv3 mount?

On 12/1/10 5:13 PM, Spelic wrote:

Hello all

First of all: I have tried to send this message to the list at least 3 
times but it doesn't seem to get through (and I'm given no error back).
It was very long with 2 attachments... is is because of that? What are 
the limits of this ML?

This time I will shorten it a bit and remove the attachments.

Here is my problem:
I am trying to use NFS over RDMA. It doesn't work: hangs very soon.
I tried kernel 2.6.32 from ubuntu 10.04, and then I tried the most 
recent upstream 2.6.37-rc4 compiled from source. They behave basically 
the same regarding the NFS mount itself, only difference is that 2.6.32 
will hang the complete operating system when nfs hangs, while 2.6.37-rc4 
(after nfs hangs) will only hang processes which launch sync or list nfs 
directories. Anyway the mount is hanged forever; does not resolve by 
itself.

IPoIB nfs mounts appear to work flawlessly, the problem is with RDMA only.

Hardware: (identical client and server machines)
07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx 
HCA] (rev 20)

Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at d880 (64-bit, non-prefetchable) [size=1M]
Memory at d800 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 2
Capabilities: [48] Vital Product Data ?
Capabilities: [90] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/5 Enable-

Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32
Capabilities: [60] Express Endpoint, MSI 00
Kernel driver in use: ib_mthca
Kernel modules: ib_mthca

Mainboard = Supermicro X7DWT with embedded infiniband.

This is my test:
on server I make a big 14GB ramdisk (exact boot option: 
ramdisk_size=14680064), format xfs and mount like this:

mkfs.xfs -f -l size=128m -d agcount=16 /dev/ram0
mount -o nobarrier,inode64,logbufs=8,logbsize=256k /dev/ram0 
/mnt/ram/

On the client I mount like this (fstab):
10.100.0.220:/   /mnt/nfsram   nfs4
_netdev,auto,defaults,rdma,port=20049  0  0


Then on the client I perform
echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo 
syncing now ; sync ; echo finished


It hangs as soon as it reaches the end of the 14GB of space, but never 
writes syncing now. It seems like the disk full message triggers the 
hangup reliably on NFS over RDMA over XFS over ramdisk; other 
combinations are not so reliable for triggering the bug (e.g. ext4).


However please note that this is not an XFS problem in itself: we had 
another hangup on an ext4 filesystem on NFS on RDMA on real disks for 
real work after a few hours (and it hadn't hit the disk full 
situation); this technique with XFS on ramdisk is just more reliably 
reproducible.


Note that the hangup does not happen on NFS over IPoIB (no RDMA) over 
XFS over ramdisk. It's really an RDMA-only bug.
On the other machine (2.6.32) that was doing real work on real disks I 
am now mounting over IPoIB without RDMA and in fact that one is still 
running reliably.


The dd process hangs like this: (/proc/pid/stack)
[810f8f75] sync_page+0x45/0x60
[810f9143] wait_on_page_bit+0x73/0x80
[810f9590] filemap_fdatawait_range+0x110/0x1a0
[810f9720] filemap_write_and_wait_range+0x70/0x80
[811766ba] vfs_fsync_range+0x5a/0xa0
[8117676c] vfs_fsync+0x1c/0x20
[a02bda1d] nfs_file_write+0xdd/0x1f0 [nfs]
[8114d4fa] do_sync_write+0xda/0x120
[8114d808] vfs_write+0xc8/0x190
[8114e061] sys_write+0x51/0x90
[8100c042] system_call_fastpath+0x16/0x1b
[] 0x

The dd process is not killable with -9 . Stays alive and hanged.

In the dmesg (client) you can see this line immediately, as soon as 
transfer stops (iostat -n 1) and dd hangs up:

 [ 3072.884988] rpcrdma: connection to 10.100.0.220:20049 closed (-103)

after a while you can see this in dmesg
[ 3242.890030] INFO: task dd:2140 blocked for more than 120 seconds.
[ 3242.890132] echo 0  /proc/sys/kernel/hung_task_timeout_secs 
disables this message.
[ 3242.890239] ddD 88040a8f0398 0  2140   2113 
0x
[ 3242.890243]  88040891fb38 0082 88040891fa98 
88040891fa98
[ 3242.890248]  000139c0 88040a8f 88040a8f0398 
88040891ffd8
[ 3242.890251]  88040a8f03a0 000139c0 88040891e010 
000139c0

[ 3242.890255] Call Trace:
[ 3242.890264]  [81035509] ? default_spin_lock_flags+0x9/0x10
[ 3242.890269]  [810f8f30] ? sync_page+0x0/0x60
[ 3242.890273]  [8157b824] io_schedule+0x44/0x60
[ 3242.890276]  [810f8f75] sync_page+0x45/0x60
[ 3242.890279]  [8157c0bf] __wait_on_bit+0x5f/0x90
[ 3242.890281]  

Re: NFS-RDMA hangs: connection closed (-103)

2010-12-01 Thread Tom Tucker

Spelic,

I have seen this problem before, but have not been able to reliably 
reproduce it. When I saw the problem, there were no transport errors and 
it appeared as if the I/O had actually completed, but that the waiter was 
not being awoken. I was not able to reliably reproduce the problem and was 
not able to determine if the problem was a latent bug in NFS in general or 
a bug in the RDMA transport in particular.


I will try your setup here, but I don't have a system like yours so I'll 
have to settle for a smaller ramdisk, however, I have a few questions:


- Does the FS matter? For example, can you use ext[2-4] on the ramdisk and 
not still reproduce

- As I mentioned earlier NFS v3 vs. NFS v4
- RAMDISK size, i.e. 2G vs. 14G

Thanks,
Tom

On 12/1/10 5:13 PM, Spelic wrote:

Hello all

First of all: I have tried to send this message to the list at least 3 
times but it doesn't seem to get through (and I'm given no error back).
It was very long with 2 attachments... is is because of that? What are 
the limits of this ML?

This time I will shorten it a bit and remove the attachments.

Here is my problem:
I am trying to use NFS over RDMA. It doesn't work: hangs very soon.
I tried kernel 2.6.32 from ubuntu 10.04, and then I tried the most 
recent upstream 2.6.37-rc4 compiled from source. They behave basically 
the same regarding the NFS mount itself, only difference is that 2.6.32 
will hang the complete operating system when nfs hangs, while 2.6.37-rc4 
(after nfs hangs) will only hang processes which launch sync or list nfs 
directories. Anyway the mount is hanged forever; does not resolve by 
itself.

IPoIB nfs mounts appear to work flawlessly, the problem is with RDMA only.

Hardware: (identical client and server machines)
07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx 
HCA] (rev 20)

Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at d880 (64-bit, non-prefetchable) [size=1M]
Memory at d800 (64-bit, prefetchable) [size=8M]
Capabilities: [40] Power Management version 2
Capabilities: [48] Vital Product Data ?
Capabilities: [90] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/5 Enable-

Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32
Capabilities: [60] Express Endpoint, MSI 00
Kernel driver in use: ib_mthca
Kernel modules: ib_mthca

Mainboard = Supermicro X7DWT with embedded infiniband.

This is my test:
on server I make a big 14GB ramdisk (exact boot option: 
ramdisk_size=14680064), format xfs and mount like this:

mkfs.xfs -f -l size=128m -d agcount=16 /dev/ram0
mount -o nobarrier,inode64,logbufs=8,logbsize=256k /dev/ram0 
/mnt/ram/

On the client I mount like this (fstab):
10.100.0.220:/   /mnt/nfsram   nfs4
_netdev,auto,defaults,rdma,port=20049  0  0


Then on the client I perform
echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo 
syncing now ; sync ; echo finished


It hangs as soon as it reaches the end of the 14GB of space, but never 
writes syncing now. It seems like the disk full message triggers the 
hangup reliably on NFS over RDMA over XFS over ramdisk; other 
combinations are not so reliable for triggering the bug (e.g. ext4).


However please note that this is not an XFS problem in itself: we had 
another hangup on an ext4 filesystem on NFS on RDMA on real disks for 
real work after a few hours (and it hadn't hit the disk full 
situation); this technique with XFS on ramdisk is just more reliably 
reproducible.


Note that the hangup does not happen on NFS over IPoIB (no RDMA) over 
XFS over ramdisk. It's really an RDMA-only bug.
On the other machine (2.6.32) that was doing real work on real disks I 
am now mounting over IPoIB without RDMA and in fact that one is still 
running reliably.


The dd process hangs like this: (/proc/pid/stack)
[810f8f75] sync_page+0x45/0x60
[810f9143] wait_on_page_bit+0x73/0x80
[810f9590] filemap_fdatawait_range+0x110/0x1a0
[810f9720] filemap_write_and_wait_range+0x70/0x80
[811766ba] vfs_fsync_range+0x5a/0xa0
[8117676c] vfs_fsync+0x1c/0x20
[a02bda1d] nfs_file_write+0xdd/0x1f0 [nfs]
[8114d4fa] do_sync_write+0xda/0x120
[8114d808] vfs_write+0xc8/0x190
[8114e061] sys_write+0x51/0x90
[8100c042] system_call_fastpath+0x16/0x1b
[] 0x

The dd process is not killable with -9 . Stays alive and hanged.

In the dmesg (client) you can see this line immediately, as soon as 
transfer stops (iostat -n 1) and dd hangs up:

 [ 3072.884988] rpcrdma: connection to 10.100.0.220:20049 closed (-103)

after a while you can see this in dmesg
[ 3242.890030] INFO: task dd:2140 blocked for more than 120 seconds.
[ 3242.890132] echo 0  /proc/sys/kernel/hung_task_timeout_secs 

Re: Problem Pinning Physical Memory

2010-11-30 Thread Tom Tucker

On 11/30/10 9:24 AM, Alan Cook wrote:

Tom Tuckert...@...  writes:

Yes. I removed the new verb and followed Jason's recommendation of adding
this support to the core reg_mr support. I used the type bits in the vma
struct to determine the type of memory being registered and just did the
right thing.

I'll repost in the the next day or two.

Tom


Tom,

Couple of questions:

I noticed that OFED 1.5.3 was released last week.  Are the changes you speak of
part of that release?

No.

If not, is there an alternate branch/project that I should
be looking at or into to for the mentioned changes?

The patch will be against the top-of-tree Linux kernel.

Also, I am inferring that the changes allowing the registering of physical
memory will only happen if my application is running in kernel space.


Actually, no.


  Is this
correct? or will I be able to register the physical memory from user space now
as well?
What I implemented was support for mmap'd memory. In practical terms for 
your application you would write a driver that supported the mmap file op. 
The driver's mmap routine would ioremap the pci memory of interest and 
stuff it in the provided vma. The user-mode app then ibv_reg_mr the 
address and length returned by mmap.


Make sense?
Tom


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem Pinning Physical Memory

2010-11-29 Thread Tom Tucker

On 11/29/10 11:10 AM, Steve Wise wrote:



On 11/24/2010 11:42 AM, Jason Gunthorpe wrote:


The last time this came up I said that the kernel side of ibv_reg_mr
should do the right thing for all types of memory that are mmap'd into
a process and I still think that is true. RDMA to device memory could
be very useful and with things like GEM managing the allocation of
device (video) memory to userspace, so it can be done safely.

Jason


Tom posted changes to support this a while back.  Tom, do you have an 
updated patch series for this support?


Yes. I removed the new verb and followed Jason's recommendation of adding 
this support to the core reg_mr support. I used the type bits in the vma 
struct to determine the type of memory being registered and just did the 
right thing.


I'll repost in the the next day or two.

Tom


Steve.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API

2010-11-17 Thread Tom Tucker

On 11/16/10 1:39 PM, Or Gerlitz wrote:

  Tom Tuckert...@ogc.us  wrote:


This patch changes the bus mapping logic to avoid page_address() where necessary

Hi Tom,

Does when necessary comes to say that invocations of page_address
which remained in the code after this patch was applied are safe and
no kmap call is needed?


That's the premise. Please let me know if something looks suspicious.

Thanks,
Tom


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] svcrdma: NFSRDMA Server fixes for 2.6.37

2010-10-12 Thread Tom Tucker
Hi Bruce,

These fixes are ready for 2.6.37. They fix two bugs in the server-side
NFSRDMA transport.

Thanks,
Tom
---

Tom Tucker (2):
  svcrdma: Cleanup DMA unmapping in error paths.
  svcrdma: Change DMA mapping logic to avoid the page_address kernel API


 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   19 ---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   82 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   41 +++
 3 files changed, 92 insertions(+), 50 deletions(-)

-- 
Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API

2010-10-12 Thread Tom Tucker
There was logic in the send path that assumed that a page containing data
to send to the client has a KVA. This is not always the case and can result
in data corruption when page_address returns zero and we end up DMA mapping
zero.

This patch changes the bus mapping logic to avoid page_address() where
necessary and converts all calls from ib_dma_map_single to ib_dma_map_page
in order to keep the map/unmap calls symmetric.

Signed-off-by: Tom Tucker t...@ogc.us
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   18 ---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   80 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   18 +++
 3 files changed, 78 insertions(+), 38 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0194de8..926bdb4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -263,9 +263,9 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
frmr-page_list_len = PAGE_ALIGN(byte_count)  PAGE_SHIFT;
for (page_no = 0; page_no  frmr-page_list_len; page_no++) {
frmr-page_list-page_list[page_no] =
-   ib_dma_map_single(xprt-sc_cm_id-device,
- 
page_address(rqstp-rq_arg.pages[page_no]),
- PAGE_SIZE, DMA_FROM_DEVICE);
+   ib_dma_map_page(xprt-sc_cm_id-device,
+   rqstp-rq_arg.pages[page_no], 0,
+   PAGE_SIZE, DMA_FROM_DEVICE);
if (ib_dma_mapping_error(xprt-sc_cm_id-device,
 frmr-page_list-page_list[page_no]))
goto fatal_err;
@@ -309,17 +309,21 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
 int count)
 {
int i;
+   unsigned long off;
 
ctxt-count = count;
ctxt-direction = DMA_FROM_DEVICE;
for (i = 0; i  count; i++) {
ctxt-sge[i].length = 0; /* in case map fails */
if (!frmr) {
+   BUG_ON(0 == virt_to_page(vec[i].iov_base));
+   off = (unsigned long)vec[i].iov_base  ~PAGE_MASK;
ctxt-sge[i].addr =
-   ib_dma_map_single(xprt-sc_cm_id-device,
- vec[i].iov_base,
- vec[i].iov_len,
- DMA_FROM_DEVICE);
+   ib_dma_map_page(xprt-sc_cm_id-device,
+   virt_to_page(vec[i].iov_base),
+   off,
+   vec[i].iov_len,
+   DMA_FROM_DEVICE);
if (ib_dma_mapping_error(xprt-sc_cm_id-device,
 ctxt-sge[i].addr))
return -EINVAL;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index b15e1eb..d4f5e0e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -70,8 +70,8 @@
  * on extra page for the RPCRMDA header.
  */
 static int fast_reg_xdr(struct svcxprt_rdma *xprt,
-struct xdr_buf *xdr,
-struct svc_rdma_req_map *vec)
+   struct xdr_buf *xdr,
+   struct svc_rdma_req_map *vec)
 {
int sge_no;
u32 sge_bytes;
@@ -96,21 +96,25 @@ static int fast_reg_xdr(struct svcxprt_rdma *xprt,
vec-count = 2;
sge_no++;
 
-   /* Build the FRMR */
+   /* Map the XDR head */
frmr-kva = frva;
frmr-direction = DMA_TO_DEVICE;
frmr-access_flags = 0;
frmr-map_len = PAGE_SIZE;
frmr-page_list_len = 1;
+   page_off = (unsigned long)xdr-head[0].iov_base  ~PAGE_MASK;
frmr-page_list-page_list[page_no] =
-   ib_dma_map_single(xprt-sc_cm_id-device,
- (void *)xdr-head[0].iov_base,
- PAGE_SIZE, DMA_TO_DEVICE);
+   ib_dma_map_page(xprt-sc_cm_id-device,
+   virt_to_page(xdr-head[0].iov_base),
+   page_off,
+   PAGE_SIZE - page_off,
+   DMA_TO_DEVICE);
if (ib_dma_mapping_error(xprt-sc_cm_id-device,
 frmr-page_list-page_list[page_no]))
goto fatal_err;
atomic_inc(xprt-sc_dma_used);
 
+   /* Map the XDR page list */
page_off = xdr-page_base;
page_bytes = xdr-page_len + page_off;
if (!page_bytes)
@@ -128,9 +132,9 @@ static int fast_reg_xdr(struct svcxprt_rdma

[PATCH 2/2] svcrdma: Cleanup DMA unmapping in error paths.

2010-10-12 Thread Tom Tucker
There are several error paths in the code that do not unmap DMA. This
patch adds calls to svc_rdma_unmap_dma to free these DMA contexts.

Signed-off-by: Tom Tucker t...@opengridcomputing.com
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |1 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   29 ++---
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 926bdb4..df67211 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -495,6 +495,7 @@ next_sge:
printk(KERN_ERR svcrdma: Error %d posting RDMA_READ\n,
   err);
set_bit(XPT_CLOSE, xprt-sc_xprt.xpt_flags);
+   svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 0);
goto out;
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index d4f5e0e..249a835 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -367,6 +367,8 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
goto err;
return 0;
  err:
+   svc_rdma_unmap_dma(ctxt);
+   svc_rdma_put_frmr(xprt, vec-frmr);
svc_rdma_put_context(ctxt, 0);
/* Fatal error, close transport */
return -EIO;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 23f90c3..d22a44d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -511,9 +511,9 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
ctxt-sge[sge_no].addr = pa;
ctxt-sge[sge_no].length = PAGE_SIZE;
ctxt-sge[sge_no].lkey = xprt-sc_dma_lkey;
+   ctxt-count = sge_no + 1;
buflen += PAGE_SIZE;
}
-   ctxt-count = sge_no;
recv_wr.next = NULL;
recv_wr.sg_list = ctxt-sge[0];
recv_wr.num_sge = ctxt-count;
@@ -529,6 +529,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
return ret;
 
  err_put_ctxt:
+   svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 1);
return -ENOMEM;
 }
@@ -1306,7 +1307,6 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
 enum rpcrdma_errcode err)
 {
struct ib_send_wr err_wr;
-   struct ib_sge sge;
struct page *p;
struct svc_rdma_op_ctxt *ctxt;
u32 *va;
@@ -1319,26 +1319,27 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
/* XDR encode error */
length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);
 
+   ctxt = svc_rdma_get_context(xprt);
+   ctxt-direction = DMA_FROM_DEVICE;
+   ctxt-count = 1;
+   ctxt-pages[0] = p;
+
/* Prepare SGE for local address */
-   sge.addr = ib_dma_map_page(xprt-sc_cm_id-device,
-  p, 0, PAGE_SIZE, DMA_FROM_DEVICE);
-   if (ib_dma_mapping_error(xprt-sc_cm_id-device, sge.addr)) {
+   ctxt-sge[0].addr = ib_dma_map_page(xprt-sc_cm_id-device,
+   p, 0, length, DMA_FROM_DEVICE);
+   if (ib_dma_mapping_error(xprt-sc_cm_id-device, ctxt-sge[0].addr)) {
put_page(p);
return;
}
atomic_inc(xprt-sc_dma_used);
-   sge.lkey = xprt-sc_dma_lkey;
-   sge.length = length;
-
-   ctxt = svc_rdma_get_context(xprt);
-   ctxt-count = 1;
-   ctxt-pages[0] = p;
+   ctxt-sge[0].lkey = xprt-sc_dma_lkey;
+   ctxt-sge[0].length = length;
 
/* Prepare SEND WR */
memset(err_wr, 0, sizeof err_wr);
ctxt-wr_op = IB_WR_SEND;
err_wr.wr_id = (unsigned long)ctxt;
-   err_wr.sg_list = sge;
+   err_wr.sg_list = ctxt-sge;
err_wr.num_sge = 1;
err_wr.opcode = IB_WR_SEND;
err_wr.send_flags = IB_SEND_SIGNALED;
@@ -1348,9 +1349,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
if (ret) {
dprintk(svcrdma: Error %d posting send for protocol error\n,
ret);
-   ib_dma_unmap_page(xprt-sc_cm_id-device,
- sge.addr, PAGE_SIZE,
- DMA_FROM_DEVICE);
+   svc_rdma_unmap_dma(ctxt);
svc_rdma_put_context(ctxt, 1);
}
 }

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] libmthca: Add support for the reg_io_mr verb.

2010-08-02 Thread Tom Tucker

Tziporet Koren wrote:

Hi Tom,
What is the purpose of this?
Is there a reason you did it only for mthca and not mlx4?

Tziporet
  

Hi Tziporet,

I just picked mthca arbitrarily to demonstrate how to do it. If people 
like the verb, then I'll do it for all the devices, but I didn't want to 
do all that work when there are likely to be changes.


But the point is that this is certainly not a mthca only functionality.

Thanks,
Tom

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/4] ibverbs: new verbs for registering I/O memory

2010-07-29 Thread Tom Tucker
The following patches add verbs for registering I/O memory from user-space.

This capability allows device memory to be registered. More specifically,
any VM_PFNMAP vma can be registered.

The mmap service is used to obtain the address of this memory and provide it
to user space. This is where any security policy would be implemented.

The ib_iomem_get service requires that any address provided by the service
be in a VMA owned by the process. This precludes providing 'random' addresses
to the service to acquire access to arbitrary memory locations.
---

Tom Tucker (4):
  mthca: Add support for reg_io_mr and unreg_io_mr
  uverbs_cmd: Add uverbs command definitions for reg_io_mr
  uverbs: Add common ib_iomem_get service
  ibverbs: Add new provider verb for I/O memory registration


 drivers/infiniband/core/umem.c   |  248 +-
 drivers/infiniband/core/uverbs.h |2 
 drivers/infiniband/core/uverbs_cmd.c |  140 +++
 drivers/infiniband/core/uverbs_main.c|2 
 drivers/infiniband/hw/mthca/mthca_provider.c |  111 
 include/rdma/ib_umem.h   |   14 +
 include/rdma/ib_user_verbs.h |   24 ++-
 include/rdma/ib_verbs.h  |5 +
 8 files changed, 534 insertions(+), 12 deletions(-)

-- 
Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/4] ibverbs: Add new provider verb for I/O memory registration

2010-07-29 Thread Tom Tucker
From: Tom Tucker t...@opengridcomputing.com

Add a function pointer for the provider's reg_io_mr
method.

Signed-off-by: Tom Tucker t...@ogc.us
---

 include/rdma/ib_verbs.h |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f3e8f3c..5034ac9 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1096,6 +1096,11 @@ struct ib_device {
  u64 virt_addr,
  int mr_access_flags,
  struct ib_udata *udata);
+   struct ib_mr * (*reg_io_mr)(struct ib_pd *pd,
+   u64 start, u64 length,
+   u64 virt_addr,
+   int mr_access_flags,
+   struct ib_udata *udata);
int(*query_mr)(struct ib_mr *mr,
   struct ib_mr_attr *mr_attr);
int(*dereg_mr)(struct ib_mr *mr);

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] libibverbs: Add reg/unreg I/O memory verbs

2010-07-29 Thread Tom Tucker
From: Tom Tucker t...@opengridcomputing.com

Add the ibv_reg_io_mr and ibv_dereg_io_mr verbs.

Signed-off-by: Tom Tucker t...@ogc.us
---

 include/infiniband/driver.h |6 ++
 include/infiniband/verbs.h  |   14 ++
 src/verbs.c |   35 +++
 3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 9a81416..37c0ed1 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -82,6 +82,12 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t 
length,
   size_t cmd_size,
   struct ibv_reg_mr_resp *resp, size_t resp_size);
 int ibv_cmd_dereg_mr(struct ibv_mr *mr);
+int ibv_cmd_reg_io_mr(struct ibv_pd *pd, void *addr, size_t length,
+ uint64_t hca_va, int access,
+ struct ibv_mr *mr, struct ibv_reg_io_mr *cmd,
+ size_t cmd_size,
+ struct ibv_reg_io_mr_resp *resp, size_t resp_size);
+int ibv_cmd_dereg_io_mr(struct ibv_mr *mr);
 int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
  struct ibv_comp_channel *channel,
  int comp_vector, struct ibv_cq *cq,
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 0f1cb2e..a0d969a 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -640,6 +640,9 @@ struct ibv_context_ops {
size_t length,
int access);
int (*dereg_mr)(struct ibv_mr *mr);
+struct ibv_mr * (*reg_io_mr)(struct ibv_pd *pd, void *addr, 
size_t length,
+int access);
+int (*dereg_io_mr)(struct ibv_mr *mr);
struct ibv_mw * (*alloc_mw)(struct ibv_pd *pd, enum ibv_mw_type 
type);
int (*bind_mw)(struct ibv_qp *qp, struct ibv_mw *mw,
   struct ibv_mw_bind *mw_bind);
@@ -801,6 +804,17 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
 int ibv_dereg_mr(struct ibv_mr *mr);
 
 /**
+ * ibv_reg_io_mr - Register a physical memory region
+ */
+struct ibv_mr *ibv_reg_io_mr(struct ibv_pd *pd, void *addr,
+ size_t length, int access);
+
+/**
+ * ibv_dereg_io_mr - Deregister a physical memory region
+ */
+int ibv_dereg_io_mr(struct ibv_mr *mr);
+
+/**
  * ibv_create_comp_channel - Create a completion event channel
  */
 struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context);
diff --git a/src/verbs.c b/src/verbs.c
index ba3c0a4..7d215c1 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -189,6 +189,41 @@ int __ibv_dereg_mr(struct ibv_mr *mr)
 }
 default_symver(__ibv_dereg_mr, ibv_dereg_mr);
 
+struct ibv_mr *__ibv_reg_io_mr(struct ibv_pd *pd, void *addr,
+   size_t length, int access)
+{
+struct ibv_mr *mr;
+
+if (ibv_dontfork_range(addr, length))
+return NULL;
+
+mr = pd-context-ops.reg_io_mr(pd, addr, length, access);
+if (mr) {
+mr-context = pd-context;
+mr-pd  = pd;
+mr-addr= addr;
+mr-length  = length;
+} else
+ibv_dofork_range(addr, length);
+
+return mr;
+}
+default_symver(__ibv_reg_io_mr, ibv_reg_io_mr);
+
+int __ibv_dereg_io_mr(struct ibv_mr *mr)
+{
+int ret;
+void *addr  = mr-addr;
+size_t length   = mr-length;
+
+ret = mr-context-ops.dereg_io_mr(mr);
+if (!ret)
+ibv_dofork_range(addr, length);
+
+return ret;
+}
+default_symver(__ibv_dereg_io_mr, ibv_dereg_io_mr);
+
 static struct ibv_comp_channel *ibv_create_comp_channel_v2(struct ibv_context 
*context)
 {
struct ibv_abi_compat_v2 *t = context-abi_compat;

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] libmthca: Add support for I/O memory registration verbs

2010-07-29 Thread Tom Tucker
This patchset adds support for the new I/O memory registration verbs to
libmthca.

---

Tom Tucker (1):
  libmthca: Add support for the reg_io_mr verb.


 src/mthca-abi.h |4 
 src/mthca.c |2 ++
 src/mthca.h |4 
 src/verbs.c |   50 ++
 4 files changed, 60 insertions(+), 0 deletions(-)

-- 
Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] libmthca: Add support for the reg_io_mr verb.

2010-07-29 Thread Tom Tucker
From: Tom Tucker t...@opengridcomputing.com

Added support for the ibv_reg_io_mr and ibv_unreg_io_mr
verbs to the mthca ilbrary.

Signed-off-by: Tom Tucker t...@ogc.us
---

 src/mthca-abi.h |4 
 src/mthca.c |2 ++
 src/mthca.h |4 
 src/verbs.c |   50 ++
 4 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/src/mthca-abi.h b/src/mthca-abi.h
index 4fbd98b..c0145d6 100644
--- a/src/mthca-abi.h
+++ b/src/mthca-abi.h
@@ -61,6 +61,10 @@ struct mthca_reg_mr {
__u32   reserved;
 };
 
+struct mthca_reg_io_mr {
+   struct ibv_reg_io_mribv_cmd;
+};
+
 struct mthca_create_cq {
struct ibv_create_cqibv_cmd;
__u32   lkey;
diff --git a/src/mthca.c b/src/mthca.c
index e33bf7f..8892504 100644
--- a/src/mthca.c
+++ b/src/mthca.c
@@ -113,6 +113,8 @@ static struct ibv_context_ops mthca_ctx_ops = {
.dealloc_pd= mthca_free_pd,
.reg_mr= mthca_reg_mr,
.dereg_mr  = mthca_dereg_mr,
+   .reg_io_mr = mthca_reg_io_mr,
+   .dereg_io_mr   = mthca_dereg_io_mr,
.create_cq = mthca_create_cq,
.poll_cq   = mthca_poll_cq,
.resize_cq = mthca_resize_cq,
diff --git a/src/mthca.h b/src/mthca.h
index bd1e7a2..92a8649 100644
--- a/src/mthca.h
+++ b/src/mthca.h
@@ -312,6 +312,10 @@ struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr,
size_t length, int access);
 int mthca_dereg_mr(struct ibv_mr *mr);
 
+struct ibv_mr *mthca_reg_io_mr(struct ibv_pd *pd, void *addr,
+  size_t length, enum ibv_access_flags access);
+int mthca_dereg_io_mr(struct ibv_mr *mr);
+
 struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe,
   struct ibv_comp_channel *channel,
   int comp_vector);
diff --git a/src/verbs.c b/src/verbs.c
index b6782c9..3580ad2 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -174,6 +174,56 @@ int mthca_dereg_mr(struct ibv_mr *mr)
return 0;
 }
 
+
+static struct ibv_mr *__mthca_reg_io_mr(struct ibv_pd *pd, void *addr,
+   size_t length, uint64_t hca_va,
+   enum ibv_access_flags access)
+{
+   struct ibv_mr *mr;
+   struct mthca_reg_io_mr cmd;
+   int ret;
+
+   mr = malloc(sizeof *mr);
+   if (!mr)
+   return NULL;
+
+#ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS
+   {
+   struct ibv_reg_io_mr_resp resp;
+
+   ret = ibv_cmd_reg_io_mr(pd, addr, length, hca_va, access, mr,
+   cmd.ibv_cmd, sizeof cmd, resp, sizeof 
resp);
+   }
+#else
+   ret = ibv_cmd_reg_io_mr(pd, addr, length, hca_va, access, mr,
+   cmd.ibv_cmd, sizeof cmd);
+#endif
+   if (ret) {
+   free(mr);
+   return NULL;
+   }
+
+   return mr;
+}
+
+struct ibv_mr *mthca_reg_io_mr(struct ibv_pd *pd, void *addr,
+  size_t length, enum ibv_access_flags access)
+{
+   return __mthca_reg_io_mr(pd, addr, length, (uintptr_t) addr, access);
+}
+
+int mthca_dereg_io_mr(struct ibv_mr *mr)
+{
+   int ret;
+
+   ret = ibv_cmd_dereg_mr(mr);
+   if (ret)
+   return ret;
+
+   free(mr);
+   return 0;
+}
+
 static int align_cq_size(int cqe)
 {
int nent;

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service

2010-07-29 Thread Tom Tucker

On 7/29/10 1:22 PM, Ralph Campbell wrote:

On Thu, 2010-07-29 at 09:25 -0700, Tom Tucker wrote:
   

From: Tom Tuckert...@opengridcomputing.com

Add an ib_iomem_get service that converts a vma to an array of
physical addresses. This makes it easier for each device driver to
add support for the reg_io_mr provider method.

Signed-off-by: Tom Tuckert...@ogc.us
---

  drivers/infiniband/core/umem.c |  248 ++--
  include/rdma/ib_umem.h |   14 ++
  2 files changed, 251 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 415e186..f103956 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
 

...
   

@@ -292,3 +295,226 @@ int ib_umem_page_count(struct ib_umem *umem)
return n;
  }
  EXPORT_SYMBOL(ib_umem_page_count);
+/*
+ * Return the PFN for the specified address in the vma. This only
+ * works for a vma that is VM_PFNMAP.
+ */
+static unsigned long follow_io_pfn(struct vm_area_struct *vma,
+  unsigned long address, int write)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep, pte;
+   spinlock_t *ptl;
+   unsigned long pfn;
+   struct mm_struct *mm = vma-vm_mm;
+
+   BUG_ON(0 == (vma-vm_flags  VM_PFNMAP));
 

Why use BUG_ON?
WARN_ON is more appropriate but
if (!(vma-vm_flags  VM_PFNMAP))
return 0;
seems better.
In fact, move it outside the inner do loop in ib_get_io_pfn().

   
It's paranoia from the debug phase. It's already in the 'outer loop'. I 
should just delete it I think.

+   pgd = pgd_offset(mm, address);
+   if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+   return 0;
+
+   pud = pud_offset(pgd, address);
+   if (pud_none(*pud))
+   return 0;
+   if (unlikely(pud_bad(*pud)))
+   return 0;
+
+   pmd = pmd_offset(pud, address);
+   if (pmd_none(*pmd))
+   return 0;
+   if (unlikely(pmd_bad(*pmd)))
+   return 0;
+
+   ptep = pte_offset_map_lock(mm, pmd, address,ptl);
+   pte = *ptep;
+   if (!pte_present(pte))
+   goto bad;
+   if (write  !pte_write(pte))
+   goto bad;
+
+   pfn = pte_pfn(pte);
+   pte_unmap_unlock(ptep, ptl);
+   return pfn;
+ bad:
+   pte_unmap_unlock(ptep, ptl);
+   return 0;
+}
+
+int ib_get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
+ unsigned long start, int len, int write, int force,
+ unsigned long *pfn_list, struct vm_area_struct **vmas)
+{
+   unsigned long pfn;
+   int i;
+   if (len= 0)
+   return 0;
+
+   i = 0;
+   do {
+   struct vm_area_struct *vma;
+
+   vma = find_vma(mm, start);
+   if (0 == (vma-vm_flags  VM_PFNMAP))
+   return -EINVAL;
 

Style nit: I would use ! instead of 0 ==

   


ok.


+   if (0 == (vma-vm_flags  VM_IO))
+   return -EFAULT;
+
+   if (is_vm_hugetlb_page(vma))
+   return -EFAULT;
+
+   do {
+   cond_resched();
+   pfn = follow_io_pfn(vma, start, write);
+   if (!pfn)
+   return -EFAULT;
+   if (pfn_list)
+   pfn_list[i] = pfn;
+   if (vmas)
+   vmas[i] = vma;
+   i++;
+   start += PAGE_SIZE;
+   len--;
+   } while (len  start  vma-vm_end);
+   } while (len);
+   return i;
+}
+
+/**
+ * ib_iomem_get - DMA map a userspace map of IO memory.
+ * @context: userspace context to map memory for
+ * @addr: userspace virtual address to start at
+ * @size: length of region to map
+ * @access: IB_ACCESS_xxx flags for memory being mapped
+ * @dmasync: flush in-flight DMA when the memory region is written
+ */
+struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr,
+size_t size, int access, int dmasync)
+{
+   struct ib_umem *umem;
+   unsigned long *pfn_list;
+   struct ib_umem_chunk *chunk;
+   unsigned long locked;
+   unsigned long lock_limit;
+   unsigned long cur_base;
+   unsigned long npages;
+   int ret;
+   int off;
+   int i;
+   DEFINE_DMA_ATTRS(attrs);
+
+   if (dmasync)
+   dma_set_attr(DMA_ATTR_WRITE_BARRIER,attrs);
+
+   if (!can_do_mlock())
+   return ERR_PTR(-EPERM);
+
+   umem = kmalloc(sizeof *umem, GFP_KERNEL);
+   if (!umem)
+   return ERR_PTR(-ENOMEM);
+
+   umem-type   = IB_UMEM_IO_MAP;
+   umem-context   = context;
+   umem-length= size;
+   umem-offset= addr  ~PAGE_MASK;
+   umem-page_size

Re: [RFC PATCH 3/3] libibverbs: Add reg/unreg I/O memory verbs

2010-07-29 Thread Tom Tucker

On 7/29/10 3:07 PM, Ralph Campbell wrote:

How does an application know when to call ibv_reg_io_mr()
instead of ibv_reg_mr()? It isn't going to know that some
address returned by mmap() is going to have the VM_PFNMAP
flag set.
   

Please see my response to Jason.

How does an application know that the HCA supports
ibv_reg_io_mr() or not? (see below)
I think returning ENOTSUP or something would be good.

   
There are bits in the devcaps that indicate if these verbs are 
supported. It should however return -ENOTSUPP if they are called without 
support. I copied ibv_reg_mr's which is inappropriate in this regard.

On Thu, 2010-07-29 at 09:32 -0700, Tom Tucker wrote:
   

From: Tom Tuckert...@opengridcomputing.com

Add the ibv_reg_io_mr and ibv_dereg_io_mr verbs.

Signed-off-by: Tom Tuckert...@ogc.us
---

  include/infiniband/driver.h |6 ++
  include/infiniband/verbs.h  |   14 ++
  src/verbs.c |   35 +++
  3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 9a81416..37c0ed1 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -82,6 +82,12 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t 
length,
   size_t cmd_size,
   struct ibv_reg_mr_resp *resp, size_t resp_size);
  int ibv_cmd_dereg_mr(struct ibv_mr *mr);
+int ibv_cmd_reg_io_mr(struct ibv_pd *pd, void *addr, size_t length,
+ uint64_t hca_va, int access,
+ struct ibv_mr *mr, struct ibv_reg_io_mr *cmd,
+ size_t cmd_size,
+ struct ibv_reg_io_mr_resp *resp, size_t resp_size);
+int ibv_cmd_dereg_io_mr(struct ibv_mr *mr);
  int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
  struct ibv_comp_channel *channel,
  int comp_vector, struct ibv_cq *cq,
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 0f1cb2e..a0d969a 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -640,6 +640,9 @@ struct ibv_context_ops {
size_t length,
int access);
int (*dereg_mr)(struct ibv_mr *mr);
+struct ibv_mr * (*reg_io_mr)(struct ibv_pd *pd, void *addr, 
size_t length,
+int access);
+int (*dereg_io_mr)(struct ibv_mr *mr);
struct ibv_mw * (*alloc_mw)(struct ibv_pd *pd, enum ibv_mw_type 
type);
int (*bind_mw)(struct ibv_qp *qp, struct ibv_mw *mw,
   struct ibv_mw_bind *mw_bind);
 

Doesn't adding these in the middle of the struct break the
libibverbs to libxxxverbs.so binary interface?
Shouldn't they be added at the end of the struct?
I'm not sure how the versioning works between libibverbs and
device plugins. Don't we need to protect against libibverbs
being upgraded but the libxxxverbs.so being older?

   

I would think it's broken regardless.


@@ -801,6 +804,17 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr,
  int ibv_dereg_mr(struct ibv_mr *mr);

  /**
+ * ibv_reg_io_mr - Register a physical memory region
+ */
+struct ibv_mr *ibv_reg_io_mr(struct ibv_pd *pd, void *addr,
+ size_t length, int access);
+
+/**
+ * ibv_dereg_io_mr - Deregister a physical memory region
+ */
+int ibv_dereg_io_mr(struct ibv_mr *mr);
+
+/**
   * ibv_create_comp_channel - Create a completion event channel
   */
  struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context);
diff --git a/src/verbs.c b/src/verbs.c
index ba3c0a4..7d215c1 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -189,6 +189,41 @@ int __ibv_dereg_mr(struct ibv_mr *mr)
  }
  default_symver(__ibv_dereg_mr, ibv_dereg_mr);

+struct ibv_mr *__ibv_reg_io_mr(struct ibv_pd *pd, void *addr,
+   size_t length, int access)
+{
+struct ibv_mr *mr;
+
+if (ibv_dontfork_range(addr, length))
+return NULL;
+
+mr = pd-context-ops.reg_io_mr(pd, addr, length, access);
 

Won't reg_io_mr pointer be NULL for other HCAs?
What happens if the device doesn't yet implement this function?

   


Without a check, SEGV. See above.

+if (mr) {
+mr-context = pd-context;
+mr-pd  = pd;
+mr-addr= addr;
+mr-length  = length;
+} else
+ibv_dofork_range(addr, length);
+
+return mr;
+}
+default_symver(__ibv_reg_io_mr, ibv_reg_io_mr);
+
+int __ibv_dereg_io_mr(struct ibv_mr *mr)
+{
+int ret;
+void *addr  = mr-addr;
+size_t length   = mr-length;
+
+ret = mr-context-ops.dereg_io_mr(mr);
+if (!ret)
+ibv_dofork_range(addr

Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service

2010-07-29 Thread Tom Tucker

On 7/29/10 3:41 PM, Jason Gunthorpe wrote:

On Thu, Jul 29, 2010 at 03:29:37PM -0500, Tom Tucker wrote:

   

Also, I'd like to see a strong defence of this new user space API
particularly:
   1) Why can't this be done with the existing ibv_reg_mr, like huge
  pages are.
   
   

The ibv_reg_mr API assumes that the memory being registered was
allocated in user mode and is part of the current-mm VMA. It uses
get_user_pages which will scoff and jeer at kernel memory.
 

I'm confused? What is the vaddr input then? How does userspace get
that value? Isn't it created by mmap or the like?
   

Yes.

Ie for the PCI-E example you gave I assume the flow is that userspace
mmaps devices/pci:00/:00:XX.X/resourceX to get the IO memory
and then passes that through to ibv_reg_mr?

   


Not exactly. It would mmap the device that manages the adapter hosting 
the memory.



IMHO, ibv_reg_mr API should accept any valid vaddr available to the
process and if it bombs for certain kinds of vaddrs then it is just a
bug..

   

Perhaps.

   2) How is it possible for userspace to know when it should use
  ibv_reg_mr vs ibv_reg_io_mr?
   
   

By virtue of the device that it is mmap'ing. If I mmap my_vmstat_driver,
I know that the memory I am mapping is a kernel buffer.
 

Yah, but what if the next version of your vmstat driver changes the
kind of memory it returns?

   


It's a general service for a class of memory, not an enabler for a 
particular application's peculiarities.



On first glance, this seems like a hugely bad API to me :)
   
   

Well hopefully now that it's purpose is revealed you will change your
view and we can collaboratively make it better :-)
 

I don't object to the idea, just to the notion that user space is
supposed to somehow know that one vaddr is different from another
vaddr and call the right API - seems impossible to use correctly to
me.

What would you have to do to implement this using scheme using
ibv_reg_mr as the entry point?

   
The kernel service on the other side of ibv_reg_mr verb could divine the 
necessary information by searching all vma owned by current and looking 
at vma_flags to decide what type it was.




Jason
   


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service

2010-07-29 Thread Tom Tucker

On 7/29/10 11:25 AM, Tom Tucker wrote:

From: Tom Tuckert...@opengridcomputing.com

Add an ib_iomem_get service that converts a vma to an array of
physical addresses. This makes it easier for each device driver to
add support for the reg_io_mr provider method.

Signed-off-by: Tom Tuckert...@ogc.us
---

  drivers/infiniband/core/umem.c |  248 ++--
  include/rdma/ib_umem.h |   14 ++
  2 files changed, 251 insertions(+), 11 deletions(-)

[...snip...]
   



+   /* The pfn_list we built is a set of Page
+* Frame Numbers (PFN) whose physical address
+* is PFN  PAGE_SHIFT. The SG DMA mapping
+* services expect page addresses, not PFN,
+* therefore, we have to do the dma mapping
+* ourselves here. */
+   for (i = 0; i  chunk-nents; ++i) {
+   sg_set_page(chunk-page_list[i], 0,
+   PAGE_SIZE, 0);
+   chunk-page_list[i].dma_address =
+   (pfn_list[i]  PAGE_SHIFT);
   


This is not architecture independent. Does anyone have any thoughts on 
how this ought to be done?



+   chunk-page_list[i].dma_length = PAGE_SIZE;
+   }
+   chunk-nmap = chunk-nents;
+   ret -= chunk-nents;
+   off += chunk-nents;
+   list_add_tail(chunk-list,umem-chunk_list);
+   }
+
+   ret = 0;
+   }
+
   

[...snip...]

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Suspected SPAM] Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service

2010-07-29 Thread Tom Tucker

On 7/29/10 5:57 PM, Jason Gunthorpe wrote:

You would need to modify ib_umem_get() to check for the VM_PFNMAP
flag and build the struct ib_umem similar to the proposed
ib_iomem_get(). However, the page reference counting/sharing issue
would need to be solved. I think there are kernel level callbacks
for this that could be used.
 

But in this case the pages are already mmaped into a user process,
there must be some mechanism to ensure they don't get pulled away?!

   

This is not virtual memory. It's real memory.


Though, I guess, what happens if you hot un-plug the PCI-E card that
has a process mmaping its memory?!

   
Exactly. The memory would have to be physically detached for it to get 
'pulled away'



What happens if you RDMA READ from PCI-E address space that does not
have any device responding?

   


bus error.


Jason
   


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6

2010-04-05 Thread Tom Tucker

J. Bruce Fields wrote:

On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote:
  

On 04/03/2010 09:27 AM, Tom Tucker wrote:


RPC6 requires that it be possible to create endpoints that listen
exclusively for IPv4 or IPv6 connection requests. This is not currently
supported by the RDMA API.

Signed-off-by: Tom Tuckert...@opengridcomputing.com
Tested-by: Steve Wisesw...@opengridcomputing.com
  

Reviewed-by: Chuck Lever chuck.le...@oracle.com



Thanks to all.  I take it the problem began with 37498292a NFSD: Create
PF_INET6 listener in write_ports?

  


Yes.

Tom


--b.

  

---

net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 -
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 3fa5751..4e6bbf9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -678,7 +678,10 @@ static struct svc_xprt *svc_rdma_create(struct
svc_serv *serv,
int ret;

dprintk(svcrdma: Creating RDMA socket\n);
-
+ if (sa-sa_family != AF_INET) {
+ dprintk(svcrdma: Address family %d is not supported.\n, sa-sa_family);
+ return ERR_PTR(-EAFNOSUPPORT);
+ }
cma_xprt = rdma_create_xprt(serv, 1);
if (!cma_xprt)
return ERR_PTR(-ENOMEM);

  

--
chuck[dot]lever[at]oracle[dot]com


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6

2010-04-05 Thread Tom Tucker

J. Bruce Fields wrote:

On Mon, Apr 05, 2010 at 12:16:18PM -0400, J. Bruce Fields wrote:
  

On Mon, Apr 05, 2010 at 10:50:16AM -0500, Tom Tucker wrote:


J. Bruce Fields wrote:
  

On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote:
  


On 04/03/2010 09:27 AM, Tom Tucker wrote:

  

RPC6 requires that it be possible to create endpoints that listen
exclusively for IPv4 or IPv6 connection requests. This is not currently
supported by the RDMA API.

Signed-off-by: Tom Tuckert...@opengridcomputing.com
Tested-by: Steve Wisesw...@opengridcomputing.com
  


Reviewed-by: Chuck Lever chuck.le...@oracle.com

  

Thanks to all.  I take it the problem began with 37498292a NFSD: Create
PF_INET6 listener in write_ports?

  


Yes.
  

Thanks.  I'll pass along

git://linux-nfs.org/~bfields/linux.git for-2.6.34

soon.



And: sorry we didn't catch this when it happened.  I have some of the
equipment I'd need to do basic regression tests, but haven't set it up.

I hope I get to it at some point  For now I depend on others to
catch even basic rdma regressions--let me know if there's some way I
could make your testing easier.

  


We were focused on older kernels..and probably should have caught it 
quicker. No worries. Thanks,


Tom


--b.
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message

2010-04-02 Thread Tom Tucker

Roland Dreier wrote:

   The write_ports code will fail both the INET4 and INET6 transport
   creation if
   the transport returns an error when PF_INET6 is specified. Some transports
   that do not support INET6 return an error other than EAFNOSUPPORT.
  
  That's the real bug.  Any reason the RDMA RPC transport can't return

  EAFNOSUPPORT in this case?

I think Tom's changelog is misleading.
Yes, it should read A transport may fail for some reason other than 
EAFNOSUPPORT.



  The problem is that the RDMA
transport actually does support IPv6, but it doesn't support the
IPV6ONLY option yet.  So if NFS/RDMA binds to a port for IPv4, then the
IPv6 bind fails because of the port collision.

  


Should we fail INET4 if INET6 fails under any circumstances?


Implementing the IPV6ONLY option for RDMA binding is probably not
feasible for 2.6.34, so the best band-aid for now seems to be Tom's
patch.

 - R.
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfsrdma broken on 2.6.34-rc1?

2010-04-01 Thread Tom Tucker

Sean Hefty wrote:

Sean, will you add this to the rdma_cm?



Not immediately because I lack the time to do it.

It would be really nice to share the kernel's port space code and remove the
port code in the rdma_cm.

  


LOL. Yes...yes it would. There is of course a Dragon to be slain. Roland?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message

2010-04-01 Thread Tom Tucker

If this looks right to everyone, I'll post this to linux-nfs.

Tom

nfsd: Make INET6 transport creation failure an informational message

The write_ports code will fail both the IENT4 and INET6 transport creation if
the transport returns an error when PF_INET6 is specified. Some transports
that do not support INET6 return an error other than EAFNOSUPPORT. We should
allow communication on INET4 even if INET6 is not yet supported or fails
for some reason.

Signed-off-by: Tom Tucker t...@opengridcomputing.com
---

fs/nfsd/nfsctl.c |6 --
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 0f0e77f..019a89e 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1008,8 +1008,10 @@ static ssize_t __write_ports_addxprt(char *buf)

err = svc_create_xprt(nfsd_serv, transport,
PF_INET6, port, SVC_SOCK_ANONYMOUS);
-   if (err  0  err != -EAFNOSUPPORT)
-   goto out_close;
+   if (err  0)
+   dprintk(nfsd: Error creating PF_INET6 listener for transport 
'%s'\n,
+transport);
+
return 0;
out_close:
xprt = svc_find_xprt(nfsd_serv, transport, PF_INET, port);

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message

2010-04-01 Thread Tom Tucker

Hi Bruce/Chuck,

RDMA Transports are currently broken in 2.6.34 because they don't have a 
V4ONLY setsockopt. So what happens is that when write_ports attempts to 
create the PF_INET6 transport it fails because the port is already in 
use. There is discussion on linux-rdma about how to fix this, but in the 
interim and perhaps indefinitely, I propose the following:


Tom

nfsd: Make INET6 transport creation failure an informational message

The write_ports code will fail both the INET4 and INET6 transport creation if
the transport returns an error when PF_INET6 is specified. Some transports
that do not support INET6 return an error other than EAFNOSUPPORT. We should
allow communication on INET4 even if INET6 is not yet supported or fails
for some reason.

Signed-off-by: Tom Tucker t...@opengridcomputing.com
---

fs/nfsd/nfsctl.c |6 --
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 0f0e77f..934b624 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1008,8 +1008,10 @@ static ssize_t __write_ports_addxprt(char *buf)

err = svc_create_xprt(nfsd_serv, transport,
PF_INET6, port, SVC_SOCK_ANONYMOUS);
-   if (err  0  err != -EAFNOSUPPORT)
-   goto out_close;
+   if (err  0)
+   printk(KERN_INFO nfsd: Error creating PF_INET6 listener 
+  for transport '%s'\n, transport);
+
return 0;
out_close:
xprt = svc_find_xprt(nfsd_serv, transport, PF_INET, port);

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rnfs: rq_respages pointer is bad

2010-03-11 Thread Tom Tucker

David J. Wilder wrote:

Tom

I have been chasing an rnfs related Oops in svc_process().  I have found
the source of the Oops but I am not sure of my fix.  I am seeing the
problem on ppc64, kernel 2.6.32, I have not tried other arch yet.

The source of the problem is in rdma_read_complete(), I am finding that
rqstp-rq_respages is set to point past the end of the rqstp-rq_pages
page list.  This results in a NULL reference in svc_process() when
passing rq_respages[0] to page_address().

In rdma_read_complete() we are using rqstp-rq_arg.pages as the base of
the page list then indexing by page_no, however rq_arg.pages is not
pointing to the start of the list so rq_respages ends up pointing to:

rqstp-rq_pages[(head-count+1) + head-hdr_count]

In my case, it ends up pointing one past the end of the list by one.

Here is the change I made.

static int rdma_read_complete(struct svc_rqst *rqstp,
  struct svc_rdma_op_ctxt *head)
{
int page_no;
int ret;

BUG_ON(!head);

/* Copy RPC pages */
for (page_no = 0; page_no  head-count; page_no++) {
put_page(rqstp-rq_pages[page_no]);
rqstp-rq_pages[page_no] = head-pages[page_no];
}
/* Point rq_arg.pages past header */
rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count];
rqstp-rq_arg.page_len = head-arg.page_len;
rqstp-rq_arg.page_base = head-arg.page_base;

/* rq_respages starts after the last arg page */
-   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
+   rqstp-rq_respages = rqstp-rq_pages[page_no];
  


This might be clearer as:

   rqstp-rq_respages = rqstp-rq_pages[head-count];


.
.
.

The change works for me, but I am not sure it is safe to assume the
rqstp-rq_pages[head-count] will always point to the last arg page.

Dave.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rnfs: rq_respages pointer is bad

2010-03-11 Thread Tom Tucker

Roland Dreier wrote:

Someone please make sure that a final patch with a full description gets
sent to the NFS guys for merging.  Tom, are you going to handle this?
  

Yes, and I have several more in queue.

Tom

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] nfsrdma fails to write big file,

2010-03-01 Thread Tom Tucker

Roland:

I'll put together a patch based on 5 with a comment that indicates why I 
think 5 is the number. Since Vu has verified this behaviorally as well, 
I'm comfortable that our understanding of the code is sound. I'm on the 
road right now, so it won't be until tomorrow though.


Thanks,
Tom


Vu Pham wrote:
  

-Original Message-
From: Tom Tucker [mailto:t...@opengridcomputing.com]
Sent: Saturday, February 27, 2010 8:23 PM
To: Vu Pham
Cc: Roland Dreier; linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Roland Dreier wrote:


  +   /*
  +* Add room for frmr register and invalidate WRs
  +* Requests sometimes have two chunks, each chunk
  +* requires to have different frmr. The safest
  +* WRs required are max_send_wr * 6; however, we
  +* get send completions and poll fast enough, it
  +* is pretty safe to have max_send_wr * 4.
  +*/
  +   ep-rep_attr.cap.max_send_wr *= 4;

Seems like a bad design if there is a possibility of work queue
overflow; if you're counting on events occurring in a particular
  

order


or completions being handled fast enough, then your design is
  

going
  

to


fail in some high load situations, which I don't think you want.


  

Vu,

Would you please try the following:

- Set the multiplier to 5
- Set the number of buffer credits small as follows echo 4 
/proc/sys/sunrpc/rdma_slot_table_entries
- Rerun your test and see if you can reproduce the problem?

I did the above and was unable to reproduce, but I would like to see


if
  

you can to convince ourselves that 5 is the right number.





Tom,

I did the above and can not reproduce either.

I think 5 is the right number; however, we should optimize it later.

-vu
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rnfs: rq_respages pointer is bad

2010-03-01 Thread Tom Tucker

Hi David:

That looks like a bug to me and it looks like what you propose is the 
correct fix. My only reservation is that if you are correct then how did 
this work at all without data corruption for large writes on x86_64?


I'm on the road right now, so I can't dig too deep until Wednesday, but 
at this point your analysis looks correct to me.


Tom


David J. Wilder wrote:

Tom

I have been chasing an rnfs related Oops in svc_process().  I have found
the source of the Oops but I am not sure of my fix.  I am seeing the
problem on ppc64, kernel 2.6.32, I have not tried other arch yet.

The source of the problem is in rdma_read_complete(), I am finding that
rqstp-rq_respages is set to point past the end of the rqstp-rq_pages
page list.  This results in a NULL reference in svc_process() when
passing rq_respages[0] to page_address().

In rdma_read_complete() we are using rqstp-rq_arg.pages as the base of
the page list then indexing by page_no, however rq_arg.pages is not
pointing to the start of the list so rq_respages ends up pointing to:

rqstp-rq_pages[(head-count+1) + head-hdr_count]

In my case, it ends up pointing one past the end of the list by one.

Here is the change I made.

static int rdma_read_complete(struct svc_rqst *rqstp,
  struct svc_rdma_op_ctxt *head)
{
int page_no;
int ret;

BUG_ON(!head);

/* Copy RPC pages */
for (page_no = 0; page_no  head-count; page_no++) {
put_page(rqstp-rq_pages[page_no]);
rqstp-rq_pages[page_no] = head-pages[page_no];
}
/* Point rq_arg.pages past header */
rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count];
rqstp-rq_arg.page_len = head-arg.page_len;
rqstp-rq_arg.page_base = head-arg.page_base;

/* rq_respages starts after the last arg page */
-   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
+   rqstp-rq_respages = rqstp-rq_pages[page_no];
.
.
.

The change works for me, but I am not sure it is safe to assume the
rqstp-rq_pages[head-count] will always point to the last arg page.

Dave.
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] nfsrdma fails to write big file,

2010-02-27 Thread Tom Tucker

Roland Dreier wrote:
  +   /* 
  +* Add room for frmr register and invalidate WRs

  +* Requests sometimes have two chunks, each chunk
  +* requires to have different frmr. The safest
  +* WRs required are max_send_wr * 6; however, we
  +* get send completions and poll fast enough, it
  +* is pretty safe to have max_send_wr * 4. 
  +*/

  +   ep-rep_attr.cap.max_send_wr *= 4;

Seems like a bad design if there is a possibility of work queue
overflow; if you're counting on events occurring in a particular order
or completions being handled fast enough, then your design is going to
fail in some high load situations, which I don't think you want.

  


Vu,

Would you please try the following:

- Set the multiplier to 5
- Set the number of buffer credits small as follows echo 4  
/proc/sys/sunrpc/rdma_slot_table_entries

- Rerun your test and see if you can reproduce the problem?

I did the above and was unable to reproduce, but I would like to see if 
you can to convince ourselves that 5 is the right number.


Thanks,
Tom


 - R.
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] nfsrdma fails to write big file,

2010-02-24 Thread Tom Tucker

Vu Pham wrote:

Tom,

Did you make any change to have bonnie++, dd of a 10G file and vdbench
concurrently run  finish?

  


No I did not but my disk subsystem is pretty slow, so it might be that I 
just don't have fast enough storage.



I keep hitting the WQE overflow error below.
I saw that most of the requests have two chunks (32K chunk and
some-bytes chunk), each chunk requires an frmr + invalidate wrs;
However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and
then for frmr case you do
ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you
also set ep-rep_cqinit = max_send_wr/2 for send completion signal which
causes the wqe overflow happened faster.

  




After applying the following patch, I have thing vdbench, dd, and copy
10g_file running overnight

-vu


--- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c   2010-02-24
10:41:22.0 -0800
+++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24
10:03:18.0 -0800
@@ -649,8 +654,15 @@
ep-rep_attr.cap.max_send_wr = cdata-max_requests;
switch (ia-ri_memreg_strategy) {
case RPCRDMA_FRMR:
-   /* Add room for frmr register and invalidate WRs */
-   ep-rep_attr.cap.max_send_wr *= 3;
+   /* 
+* Add room for frmr register and invalidate WRs

+* Requests sometimes have two chunks, each chunk
+* requires to have different frmr. The safest
+* WRs required are max_send_wr * 6; however, we
+* get send completions and poll fast enough, it
+* is pretty safe to have max_send_wr * 4. 
+*/

+   ep-rep_attr.cap.max_send_wr *= 4;
if (ep-rep_attr.cap.max_send_wr  devattr.max_qp_wr)
return -EINVAL;
break;
@@ -682,7 +694,8 @@
ep-rep_attr.cap.max_recv_sge);

/* set trigger for requesting send completion */
-   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /*  - 1*/;
+   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4;
+   
switch (ia-ri_memreg_strategy) {

case RPCRDMA_MEMWINDOWS_ASYNC:
case RPCRDMA_MEMWINDOWS:


  
Erf. This is client code. I'll take a look at this and see if I can 
understand what Talpey was up to.


Tom
  






  

-Original Message-
From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
boun...@lists.openfabrics.org] On Behalf Of Vu Pham
Sent: Monday, February 22, 2010 12:23 PM
To: Tom Tucker
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Tom,

Some more info on the problem:
1. Running with memreg=4 (FMR) I can not reproduce the problem
2. I also see different error on client

Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name
'nobody'
does not map into domain 'localdomain'
Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send
returned -12 cq_init 48 cq_count 32
Feb 22 12:17:00 mellanox-2 kernel: RPC:   rpcrdma_event_process:
send WC status 5, vend_err F5
Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to
13.20.1.9:20049 closed (-103)

-vu



-Original Message-
From: Tom Tucker [mailto:t...@opengridcomputing.com]
Sent: Monday, February 22, 2010 10:49 AM
To: Vu Pham
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Vu Pham wrote:
  

Setup:
1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600,


ConnectX2
  

QDR HCAs fw 2.7.8-6, RHEL 5.2.
2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.


Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
count=1*, operation fail, connection get drop, client cannot
re-establish connection to server.
After rebooting only the client, I can mount again.

It happens with both solaris and linux nfsrdma servers.

For linux client/server, I run memreg=5 (FRMR), I don't see


problem
  

with
  

memreg=6 (global dma key)




Awesome. This is the key I think.

Thanks for the info Vu,
Tom


  

On Solaris server snv 130, we see problem decoding write request


of
  

32K.
  

The client send two read chunks (32K  16-byte), the server fail


to
  

do
  

rdma read on the 16-byte chunk (cqe.status = 10 ie.
IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the


connection.


We
  

don't see this problem on nfs version 3 on Solaris. Solaris server


run
  

normal memory registration mode.

On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR

I added these notes in bug #1919 (bugs.openfabrics.org) to track


the


issue.

thanks,
-vu

Re: [ewg] nfsrdma fails to write big file,

2010-02-24 Thread Tom Tucker

Vu,

Are you changing any of the default settings? For example rsize/wsize, 
etc... I'd like to reproduce this problem if I can.


Thanks,

Tom

Vu Pham wrote:

Tom,

Did you make any change to have bonnie++, dd of a 10G file and vdbench
concurrently run  finish?

I keep hitting the WQE overflow error below.
I saw that most of the requests have two chunks (32K chunk and
some-bytes chunk), each chunk requires an frmr + invalidate wrs;
However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and
then for frmr case you do
ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you
also set ep-rep_cqinit = max_send_wr/2 for send completion signal which
causes the wqe overflow happened faster.

After applying the following patch, I have thing vdbench, dd, and copy
10g_file running overnight

-vu


--- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c   2010-02-24
10:41:22.0 -0800
+++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24
10:03:18.0 -0800
@@ -649,8 +654,15 @@
ep-rep_attr.cap.max_send_wr = cdata-max_requests;
switch (ia-ri_memreg_strategy) {
case RPCRDMA_FRMR:
-   /* Add room for frmr register and invalidate WRs */
-   ep-rep_attr.cap.max_send_wr *= 3;
+   /* 
+* Add room for frmr register and invalidate WRs

+* Requests sometimes have two chunks, each chunk
+* requires to have different frmr. The safest
+* WRs required are max_send_wr * 6; however, we
+* get send completions and poll fast enough, it
+* is pretty safe to have max_send_wr * 4. 
+*/

+   ep-rep_attr.cap.max_send_wr *= 4;
if (ep-rep_attr.cap.max_send_wr  devattr.max_qp_wr)
return -EINVAL;
break;
@@ -682,7 +694,8 @@
ep-rep_attr.cap.max_recv_sge);

/* set trigger for requesting send completion */
-   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /*  - 1*/;
+   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4;
+   
switch (ia-ri_memreg_strategy) {

case RPCRDMA_MEMWINDOWS_ASYNC:
case RPCRDMA_MEMWINDOWS:





  

-Original Message-
From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
boun...@lists.openfabrics.org] On Behalf Of Vu Pham
Sent: Monday, February 22, 2010 12:23 PM
To: Tom Tucker
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Tom,

Some more info on the problem:
1. Running with memreg=4 (FMR) I can not reproduce the problem
2. I also see different error on client

Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name
'nobody'
does not map into domain 'localdomain'
Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send
returned -12 cq_init 48 cq_count 32
Feb 22 12:17:00 mellanox-2 kernel: RPC:   rpcrdma_event_process:
send WC status 5, vend_err F5
Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to
13.20.1.9:20049 closed (-103)

-vu



-Original Message-
From: Tom Tucker [mailto:t...@opengridcomputing.com]
Sent: Monday, February 22, 2010 10:49 AM
To: Vu Pham
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Vu Pham wrote:
  

Setup:
1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600,


ConnectX2
  

QDR HCAs fw 2.7.8-6, RHEL 5.2.
2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.


Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
count=1*, operation fail, connection get drop, client cannot
re-establish connection to server.
After rebooting only the client, I can mount again.

It happens with both solaris and linux nfsrdma servers.

For linux client/server, I run memreg=5 (FRMR), I don't see


problem
  

with
  

memreg=6 (global dma key)




Awesome. This is the key I think.

Thanks for the info Vu,
Tom


  

On Solaris server snv 130, we see problem decoding write request


of
  

32K.
  

The client send two read chunks (32K  16-byte), the server fail


to
  

do
  

rdma read on the 16-byte chunk (cqe.status = 10 ie.
IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the


connection.


We
  

don't see this problem on nfs version 3 on Solaris. Solaris server


run
  

normal memory registration mode.

On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR

I added these notes in bug #1919 (bugs.openfabrics.org) to track


the


issue.

thanks,
-vu
___
ewg mailing list
e...@lists.openfabrics.org
http

Re: [ewg] nfsrdma fails to write big file,

2010-02-24 Thread Tom Tucker

Vu,

Based on the mapping code, it looks to me like the worst case is 
RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. 
However, I think in practice, due to the way that iov are built, the 
actual max is 5 (frmr for head + pagelist plus invalidates for same plus 
one for the send itself). Why did you think the max was 6?


Thanks,
Tom

Tom Tucker wrote:

Vu,

Are you changing any of the default settings? For example rsize/wsize, 
etc... I'd like to reproduce this problem if I can.


Thanks,

Tom

Vu Pham wrote:
  

Tom,

Did you make any change to have bonnie++, dd of a 10G file and vdbench
concurrently run  finish?

I keep hitting the WQE overflow error below.
I saw that most of the requests have two chunks (32K chunk and
some-bytes chunk), each chunk requires an frmr + invalidate wrs;
However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and
then for frmr case you do
ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you
also set ep-rep_cqinit = max_send_wr/2 for send completion signal which
causes the wqe overflow happened faster.

After applying the following patch, I have thing vdbench, dd, and copy
10g_file running overnight

-vu


--- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c   2010-02-24
10:41:22.0 -0800
+++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24
10:03:18.0 -0800
@@ -649,8 +654,15 @@
ep-rep_attr.cap.max_send_wr = cdata-max_requests;
switch (ia-ri_memreg_strategy) {
case RPCRDMA_FRMR:
-   /* Add room for frmr register and invalidate WRs */
-   ep-rep_attr.cap.max_send_wr *= 3;
+   /* 
+* Add room for frmr register and invalidate WRs

+* Requests sometimes have two chunks, each chunk
+* requires to have different frmr. The safest
+* WRs required are max_send_wr * 6; however, we
+* get send completions and poll fast enough, it
+* is pretty safe to have max_send_wr * 4. 
+*/

+   ep-rep_attr.cap.max_send_wr *= 4;
if (ep-rep_attr.cap.max_send_wr  devattr.max_qp_wr)
return -EINVAL;
break;
@@ -682,7 +694,8 @@
ep-rep_attr.cap.max_recv_sge);

/* set trigger for requesting send completion */
-   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /*  - 1*/;
+   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4;
+   
switch (ia-ri_memreg_strategy) {

case RPCRDMA_MEMWINDOWS_ASYNC:
case RPCRDMA_MEMWINDOWS:





  


-Original Message-
From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
boun...@lists.openfabrics.org] On Behalf Of Vu Pham
Sent: Monday, February 22, 2010 12:23 PM
To: Tom Tucker
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Tom,

Some more info on the problem:
1. Running with memreg=4 (FMR) I can not reproduce the problem
2. I also see different error on client

Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name
'nobody'
does not map into domain 'localdomain'
Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send
returned -12 cq_init 48 cq_count 32
Feb 22 12:17:00 mellanox-2 kernel: RPC:   rpcrdma_event_process:
send WC status 5, vend_err F5
Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to
13.20.1.9:20049 closed (-103)

-vu


  

-Original Message-
From: Tom Tucker [mailto:t...@opengridcomputing.com]
Sent: Monday, February 22, 2010 10:49 AM
To: Vu Pham
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Vu Pham wrote:
  


Setup:
1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600,

  

ConnectX2
  


QDR HCAs fw 2.7.8-6, RHEL 5.2.
2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.


Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
count=1*, operation fail, connection get drop, client cannot
re-establish connection to server.
After rebooting only the client, I can mount again.

It happens with both solaris and linux nfsrdma servers.

For linux client/server, I run memreg=5 (FRMR), I don't see

  

problem
  


with
  


memreg=6 (global dma key)



  

Awesome. This is the key I think.

Thanks for the info Vu,
Tom


  


On Solaris server snv 130, we see problem decoding write request

  

of
  


32K.
  


The client send two read chunks (32K  16-byte), the server fail

  

to
  


do
  


rdma read on the 16-byte chunk (cqe.status = 10 ie

Re: [ewg] nfsrdma fails to write big file,

2010-02-24 Thread Tom Tucker

Vu,

I ran the number of slots down to 8 (echo 8  rdma_slot_table_entries) 
and I can reproduce the issue now. I'm going to try setting the 
allocation multiple to 5 and see if I can't prove to myself and Roland 
that we've accurately computed the correct factor.


I think overall a better solution might be a different credit system, 
however, I think that's a much more substantial change than we can 
tackle at this point.


Tom


Tom Tucker wrote:

Vu,

Based on the mapping code, it looks to me like the worst case is 
RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. 
However, I think in practice, due to the way that iov are built, the 
actual max is 5 (frmr for head + pagelist plus invalidates for same plus 
one for the send itself). Why did you think the max was 6?


Thanks,
Tom

Tom Tucker wrote:
  

Vu,

Are you changing any of the default settings? For example rsize/wsize, 
etc... I'd like to reproduce this problem if I can.


Thanks,

Tom

Vu Pham wrote:
  


Tom,

Did you make any change to have bonnie++, dd of a 10G file and vdbench
concurrently run  finish?

I keep hitting the WQE overflow error below.
I saw that most of the requests have two chunks (32K chunk and
some-bytes chunk), each chunk requires an frmr + invalidate wrs;
However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and
then for frmr case you do
ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you
also set ep-rep_cqinit = max_send_wr/2 for send completion signal which
causes the wqe overflow happened faster.

After applying the following patch, I have thing vdbench, dd, and copy
10g_file running overnight

-vu


--- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c   2010-02-24
10:41:22.0 -0800
+++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24
10:03:18.0 -0800
@@ -649,8 +654,15 @@
ep-rep_attr.cap.max_send_wr = cdata-max_requests;
switch (ia-ri_memreg_strategy) {
case RPCRDMA_FRMR:
-   /* Add room for frmr register and invalidate WRs */
-   ep-rep_attr.cap.max_send_wr *= 3;
+   /* 
+* Add room for frmr register and invalidate WRs

+* Requests sometimes have two chunks, each chunk
+* requires to have different frmr. The safest
+* WRs required are max_send_wr * 6; however, we
+* get send completions and poll fast enough, it
+* is pretty safe to have max_send_wr * 4. 
+*/

+   ep-rep_attr.cap.max_send_wr *= 4;
if (ep-rep_attr.cap.max_send_wr  devattr.max_qp_wr)
return -EINVAL;
break;
@@ -682,7 +694,8 @@
ep-rep_attr.cap.max_recv_sge);

/* set trigger for requesting send completion */
-   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /*  - 1*/;
+   ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4;
+   
switch (ia-ri_memreg_strategy) {

case RPCRDMA_MEMWINDOWS_ASYNC:
case RPCRDMA_MEMWINDOWS:





  

  

-Original Message-
From: ewg-boun...@lists.openfabrics.org [mailto:ewg-
boun...@lists.openfabrics.org] On Behalf Of Vu Pham
Sent: Monday, February 22, 2010 12:23 PM
To: Tom Tucker
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Tom,

Some more info on the problem:
1. Running with memreg=4 (FMR) I can not reproduce the problem
2. I also see different error on client

Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name
'nobody'
does not map into domain 'localdomain'
Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow
Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send
returned -12 cq_init 48 cq_count 32
Feb 22 12:17:00 mellanox-2 kernel: RPC:   rpcrdma_event_process:
send WC status 5, vend_err F5
Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to
13.20.1.9:20049 closed (-103)

-vu


  


-Original Message-
From: Tom Tucker [mailto:t...@opengridcomputing.com]
Sent: Monday, February 22, 2010 10:49 AM
To: Vu Pham
Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Vu Pham wrote:
  

  

Setup:
1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600,

  


ConnectX2
  

  

QDR HCAs fw 2.7.8-6, RHEL 5.2.
2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.


Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
count=1*, operation fail, connection get drop, client cannot
re-establish connection to server.
After rebooting only the client, I can mount again.

It happens with both solaris and linux nfsrdma servers.

For linux client/server

Re: [ewg] nfsrdma fails to write big file,

2010-02-22 Thread Tom Tucker

Vu Pham wrote:
Setup: 
1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2

QDR HCAs fw 2.7.8-6, RHEL 5.2.
2. Solaris nfsrdma server svn 130, ConnectX QDR HCA.


Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M
count=1*, operation fail, connection get drop, client cannot
re-establish connection to server.
After rebooting only the client, I can mount again.

It happens with both solaris and linux nfsrdma servers.

For linux client/server, I run memreg=5 (FRMR), I don't see problem with
memreg=6 (global dma key)

  


Awesome. This is the key I think.

Thanks for the info Vu,
Tom



On Solaris server snv 130, we see problem decoding write request of 32K.
The client send two read chunks (32K  16-byte), the server fail to do
rdma read on the 16-byte chunk (cqe.status = 10 ie.
IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We
don't see this problem on nfs version 3 on Solaris. Solaris server run
normal memory registration mode.

On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR

I added these notes in bug #1919 (bugs.openfabrics.org) to track the
issue.

thanks,
-vu
___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-17 Thread Tom Tucker

Hi Tziporet:

Here is a trace with the data for WR failing with status 12. The vendor 
error is 129.


Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id 
 status 12 opcode 0 vendor_err 129 byte_len 0 qp 
81002a13ec00 ex  src_qp  wc_flags, 0 pkey_index
Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id 
81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002a13ec00 ex  src_qp  wc_flags, 0 pkey_index
Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:167 wr_id 
81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002a13ec00 ex  src_qp  wc_flags, 0 pkey_index


Any thoughts?
Tom

Tom Tucker wrote:

Tom Tucker wrote:

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
 

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two 
systems

running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER 
system

fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they 
did not

error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   

Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.



Please ignore this. This log skips the failing WR (:-\). I need to do 
another trace.




Does the issue occurs only on the ConnectX cards (mlx4) or also on 
the InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  








--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
  

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two systems
running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they did not
error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   


Tom
What is the vendor syndrome error when you get a completion with error?

  

Hang on... compiling
Does the issue occurs only on the ConnectX cards (mlx4) or also on the 
InfiniHost cards (mthca)


  


Only the MLX4 cards.


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
  

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two systems
running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they did not
error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   


Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.

Does the issue occurs only on the ConnectX cards (mlx4) or also on the 
InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tom Tucker wrote:

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
 

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two systems
running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they did 
not

error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   

Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.



Please ignore this. This log skips the failing WR (:-\). I need to do 
another trace.




Does the issue occurs only on the ConnectX cards (mlx4) or also on 
the InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker


More info...

Reboot the client and try to reconnect to a server that has not been 
rebooted fails in the same way.


It must be an issue with the server. I see no completions on the server 
or any indication that an RDMA_SEND was incoming. Is there some way to 
dump adapter state or otherwise see if there was traffic on the wire?


Tom


Tom Tucker wrote:

Tom Tucker wrote:

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
 

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two 
systems

running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER 
system

fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they 
did not

error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   

Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.



Please ignore this. This log skips the failing WR (:-\). I need to do 
another trace.




Does the issue occurs only on the ConnectX cards (mlx4) or also on 
the InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  








--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html