Re: [PATCH v5 19/27] IB/Verbs: Use management helper cap_iw_cm()
On 4/21/15 2:39 AM, Michael Wang wrote: On 04/20/2015 05:51 PM, Tom Tucker wrote: [snip] int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); iWARP devices _must_ support the IWCM so cap_iw_cm() is not really useful. Sean suggested to add this helper paired with cap_ib_cm(), may be there are some consideration on maintainability? Me too also prefer this way to make the code more readable ;-) It's more consistent, but not necessarily more readable -- if by readability we mean understanding. If the reader knows how the transports work, then the reader would be confused by the addition of a check that is always true. For the reader that doesn't know, the addition of the check implies that the support is optional, which it is not. The purpose is to make sure folks understand what we really want to check when they reviewing the code :-) and prepared for the further reform which may not rely on technology type any more, for example the device could tell core layer directly what management it required with a bitmask :-) Hi Michael, Thanks for the reply, but my premise was just wrong...I need to review the whole patch, not just a snippet. Thanks, Tom Regards, Michael Wang Tom Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 19/27] IB/Verbs: Use management helper cap_iw_cm()
On 4/20/15 11:19 AM, Jason Gunthorpe wrote: On Mon, Apr 20, 2015 at 10:51:58AM -0500, Tom Tucker wrote: On 4/20/15 10:16 AM, Michael Wang wrote: On 04/20/2015 04:00 PM, Steve Wise wrote: On 4/20/2015 3:40 AM, Michael Wang wrote: [snip] diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 6805e3e..e4999f6 100644 +++ b/include/rdma/ib_verbs.h @@ -1818,6 +1818,21 @@ static inline int cap_ib_cm(struct ib_device *device, u8 port_num) return rdma_ib_or_iboe(device, port_num); } +/** + * cap_iw_cm - Check if the port of device has the capability IWARP + * Communication Manager. + * + * @device: Device to be checked + * @port_num: Port number of the device + * + * Return 0 when port of the device don't support IWARP + * Communication Manager. + */ +static inline int cap_iw_cm(struct ib_device *device, u8 port_num) +{ +return rdma_tech_iwarp(device, port_num); +} + int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); iWARP devices _must_ support the IWCM so cap_iw_cm() is not really useful. Sean suggested to add this helper paired with cap_ib_cm(), may be there are some consideration on maintainability? Me too also prefer this way to make the code more readable ;-) It's more consistent, but not necessarily more readable -- if by readability we mean understanding. If the reader knows how the transports work, then the reader would be confused by the addition of a check that is always true. For the reader that doesn't know, the addition of the check implies that the support is optional, which it is not. No, it says this code is concerned with the unique parts of iWarp related to CM, not the other unique parts of iWarp. The check isn't aways true, it is just always true on iWarp devices. That became the problem with the old way of just saying 'is iWarp' (and others). There are too many differences, the why became lost in many places. There are now too many standards, and several do not have public docs, to keep relying on a mess of 'is standard' tests. You're right Jason, this gets called with the device handle so it's only true for iwarp. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 19/27] IB/Verbs: Use management helper cap_iw_cm()
On 4/20/15 10:16 AM, Michael Wang wrote: On 04/20/2015 04:00 PM, Steve Wise wrote: On 4/20/2015 3:40 AM, Michael Wang wrote: [snip] diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 6805e3e..e4999f6 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1818,6 +1818,21 @@ static inline int cap_ib_cm(struct ib_device *device, u8 port_num) return rdma_ib_or_iboe(device, port_num); } +/** + * cap_iw_cm - Check if the port of device has the capability IWARP + * Communication Manager. + * + * @device: Device to be checked + * @port_num: Port number of the device + * + * Return 0 when port of the device don't support IWARP + * Communication Manager. + */ +static inline int cap_iw_cm(struct ib_device *device, u8 port_num) +{ +return rdma_tech_iwarp(device, port_num); +} + int ib_query_gid(struct ib_device *device, u8 port_num, int index, union ib_gid *gid); iWARP devices _must_ support the IWCM so cap_iw_cm() is not really useful. Sean suggested to add this helper paired with cap_ib_cm(), may be there are some consideration on maintainability? Me too also prefer this way to make the code more readable ;-) It's more consistent, but not necessarily more readable -- if by readability we mean understanding. If the reader knows how the transports work, then the reader would be confused by the addition of a check that is always true. For the reader that doesn't know, the addition of the check implies that the support is optional, which it is not. Tom Regards, Michael Wang -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 27/27] IB/Verbs: Cleanup rdma_node_get_transport()
On 4/16/15 8:45 AM, Michael Wang wrote: On 04/16/2015 03:42 PM, Hal Rosenstock wrote: On 4/16/2015 9:41 AM, Michael Wang wrote: On 04/16/2015 03:36 PM, Hal Rosenstock wrote: [snip] -EXPORT_SYMBOL(rdma_node_get_transport); - enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_num) { if (device-get_link_layer) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 262bf44..f9ef479 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -84,9 +84,6 @@ enum rdma_transport_type { RDMA_TRANSPORT_IBOE, }; -__attribute_const__ enum rdma_transport_type -rdma_node_get_transport(enum rdma_node_type node_type); - enum rdma_link_layer { IB_LINK_LAYER_UNSPECIFIED, Is IB_LINK_LAYER_UNSPECIFIED still possible ? Actually it's impossible in kernel at first, all those who implemented the callback won't return UNSPECIFIED, others all have the correct transport type (otherwise BUG()) and won't result UNSPECIFIED :-) Should it be removed from this enum somewhere in this patch series (perhaps early on) ? I don't think it's ever been 'possible.' It's purpose is to catch initialized errors where the transport fails to initialize it's transport type. So for example, provider = calloc(1, sizeof *provider) If 0 is a valid link layer type, then you wouldn't catch these kinds of errors. Tom It was still directly used by helper like ib_modify_qp_is_ok() as indicator, may be better in another following patch to reform those part :-) Regards, Michael Wang -- Hal Regards, Michael Wang IB_LINK_LAYER_INFINIBAND, -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] scsi: fnic: use kernel's '%pM' format option to print MAC
Hi Andy, On 3/19/15 12:54 PM, Andy Shevchenko wrote: On Tue, 2014-04-29 at 17:45 +0300, Andy Shevchenko wrote: Instead of supplying each byte through stack let's use %pM specifier. Anyone to comment or apply this patch? Signed-off-by: Andy Shevchenko andriy.shevche...@linux.intel.com Cc: Tom Tucker t...@opengridcomputing.com Cc: Steve Wise sw...@opengridcomputing.com Cc: linux-rdma@vger.kernel.org --- drivers/scsi/fnic/vnic_dev.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/scsi/fnic/vnic_dev.c b/drivers/scsi/fnic/vnic_dev.c index 9795d6f..ba69d61 100644 --- a/drivers/scsi/fnic/vnic_dev.c +++ b/drivers/scsi/fnic/vnic_dev.c @@ -499,10 +499,7 @@ void vnic_dev_add_addr(struct vnic_dev *vdev, u8 *addr) err = vnic_dev_cmd(vdev, CMD_ADDR_ADD, a0, a1, wait); if (err) - printk(KERN_ERR - Can't add addr [%02x:%02x:%02x:%02x:%02x:%02x], %d\n, - addr[0], addr[1], addr[2], addr[3], addr[4], addr[5], - err); + pr_err(Can't add addr [%pM], %d\n, addr, err); This looks completely reasonable to me. Tom } void vnic_dev_del_addr(struct vnic_dev *vdev, u8 *addr) @@ -517,10 +514,7 @@ void vnic_dev_del_addr(struct vnic_dev *vdev, u8 *addr) err = vnic_dev_cmd(vdev, CMD_ADDR_DEL, a0, a1, wait); if (err) - printk(KERN_ERR - Can't del addr [%02x:%02x:%02x:%02x:%02x:%02x], %d\n, - addr[0], addr[1], addr[2], addr[3], addr[4], addr[5], - err); + pr_err(Can't del addr [%pM], %d\n, addr, err); } int vnic_dev_notify_set(struct vnic_dev *vdev, u16 intr) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA crashing
Hi Trond, I think this patch is still 'off-by-one'. We'll take a look at this today. Thanks, Tom On 3/12/14 9:05 AM, Trond Myklebust wrote: On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote: On Sat, 08 Mar 2014 14:13:44 -0600 Steve Wise sw...@opengridcomputing.com wrote: On 3/8/2014 1:20 PM, Steve Wise wrote: I removed your change and started debugging original crash that happens on top-o-tree. Seems like rq_next_pages is screwed up. It should always be = rq_respages, yes? I added a BUG_ON() to assert this in rdma_read_xdr() we hit the BUG_ON(). Look crash svc_rqst.rq_next_page 0x8800b84e6000 rq_next_page = 0x8800b84e6228 crash svc_rqst.rq_respages 0x8800b84e6000 rq_respages = 0x8800b84e62a8 Any ideas Bruce/Tom? Guys, the patch below seems to fix the problem. Dunno if it is correct though. What do you think? diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 0ce7552..6d62411 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, sge_no++; } rqstp-rq_respages = rqstp-rq_pages[sge_no]; + rqstp-rq_next_page = rqstp-rq_respages; /* We should never run out of SGE because the limit is defined to * support the max allowed RPC data length @@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt, /* rq_respages points one past arg pages */ rqstp-rq_respages = rqstp-rq_arg.pages[page_no]; + rqstp-rq_next_page = rqstp-rq_respages; /* Create the reply and chunk maps */ offset = 0; While this patch avoids the crashing, it apparently isn't correct...I'm getting IO errors reading files over the mount. :) I hit the same oops and tested your patch and it seems to have fixed that particular panic, but I still see a bunch of other mem corruption oopses even with it. I'll look more closely at that when I get some time. FWIW, I can easily reproduce that by simply doing something like: $ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1 I'm not sure why you're not seeing any panics with your patch in place. Perhaps it's due to hw differences between our test rigs. The EIO problem that you're seeing is likely the same client bug that Chuck recently fixed in this patch: [PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA AIUI, Trond is merging that set for 3.15, so I'd make sure your client has those patches when testing. Nothing is in my queue yet. _ Trond Myklebust Linux NFS client maintainer, PrimaryData trond.mykleb...@primarydata.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for simplifying NFS/RDMA client memory registration
Hi Chuck, I have a patch for the server side that simplifies the memory registration and fixes a bug where the server ignores the FRMR hardware limits. This bug is actually upstream now. I have been sitting on it because it's a big patch and will require a lot of testing/review to get it upstream. This is Just an FYI in case there is someone on your team who has the bandwidth to take this work and finish it up. Thanks, Tom On 2/28/14 8:59 PM, Chuck Lever wrote: Hi Wendy- On Feb 28, 2014, at 5:26 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Fri, Feb 28, 2014 at 2:20 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: ni i...On Fri, Feb 28, 2014 at 1:41 PM, Tom Talpey t...@talpey.com wrote: On 2/26/2014 8:44 AM, Chuck Lever wrote: Hi- Shirley Ma and I are reviving work on the NFS/RDMA client code base in the Linux kernel. So far we've built and run functional tests to determine what is working and what is broken. [snip] ALLPHYSICAL - Usually fast, but not safe as it exposes client memory. All HCAs support this mode. Not safe is an understatement. It exposes all of client physical memory to the peer, for both read and write. A simple pointer error on the server will silently corrupt the client. This mode was intended only for testing, and in experimental deployments. (sorry, resend .. previous reply bounced back due to gmail html format) Please keep ALLPHYSICAL for now - as our embedded system needs it. This is just the client side. Confirming that you still need support for the ALLPHYSICAL memory registration mode in the NFS/RDMA client. Do you have plans to move to a mode that is less risky? If not, can we depend on you to perform regular testing with ALLPHYSICAL as we update the client code? Do you have any bug fixes you’d like to merge upstream? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MLX4 Cq Question
Hi Guys, One other quick one. I've received conflicting claims on the validity of the wc.opcode when wc.status != 0 for mlx4 hardware. My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an error and therefore, the only way to recover what the opcode was is through the wr_id you used when submitting the WR. Is my reading of the code correct? Thanks, Tom On 5/20/13 9:53 AM, Jack Morgenstein wrote: On Saturday 18 May 2013 00:37, Roland Dreier wrote: On Fri, May 17, 2013 at 12:25 PM, Tom Tucker t...@opengridcomputing.com wrote: I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this. This is a bit tricky. If you look at void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_cq_table *cq_table = priv-cq_table; int err; err = mlx4_HW2SW_CQ(dev, NULL, cq-cqn); if (err) mlx4_warn(dev, HW2SW_CQ failed (%d) for CQN %06x\n, err, cq-cqn); synchronize_irq(priv-eq_table.eq[cq-vector].irq); spin_lock_irq(cq_table-lock); radix_tree_delete(cq_table-tree, cq-cqn); spin_unlock_irq(cq_table-lock); if (atomic_dec_and_test(cq-refcount)) complete(cq-free); wait_for_completion(cq-free); mlx4_cq_free_icm(dev, cq-cqn); } you see that when freeing a CQ, we first do the HW2SW_CQ firmware command; once this command completes, no more events will be generated for that CQ. Then we do synchronize_irq for the CQ's interrupt vector. Once that completes, no more completion handlers will be running for the CQ, so we can safely delete the CQ from the radix tree (relying on the radix tree's safety of deleting one entry while possibly looking up other entries, so no lock is needed). We also use the lock to synchronize against the CQ event function, which as you noted does take the lock too. Basic idea is that we're tricky and careful so we can make the fast path (completion interrupt handling) lock-free, but then use locks and whatever else needed in the slow path (CQ async event handling, CQ destroy). - R. === Roland, unfortunately we have seen that we need some locking on the cq completion handler (there is a stack trace which resulted from this lack of proper locking). In our current driver, we are using the patch below (which uses RCU locking instead of spinlocks). I can prepare a proper patch for the upstream kernel. === net/mlx4_core: Fix racy flow in the driver CQ completion handler The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock the radix tree which is used to manage the table of CQs, nor does it increase the reference count of the CQ before invoking the user provided callback (and decrease it afterwards). This is racy and can cause use-after-free, null pointer dereference, etc, which result in kernel crashes. To fix this, we must do the following in mlx4_cq_completion: - increase the ref count on the cq before invoking the user callback, and decrement it after the callback. - Place a lock around the radix tree lookup/ref-count-increase Using an irq spinlock will not fix this issue. The problem is that under VPI, the ETH interface uses multiple msix irq's, which can result in one cq completion event interrupting another in-progress cq completion event. A deadlock results when the handler for the first cq completion grabs the spinlock, and is interrupted by the second completion before it has a chance to release the spinlock. The handler for the second completion will deadlock waiting for the spinlock to be released. The proper fix is to use the RCU mechanism for locking radix-tree accesses in the cq completion event handler (The radix-tree implementation uses the RCU mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the updater, will do the job). Note that the same issue exists in mlx4_cq_event() (the cq async event handler), which also takes the same lock on the radix tree. Here, we replace the spinlock with an rcu_read_lock(). This patch was motivated by the following report from the field: [...] box panic'ed when trying to find a completion queue. There is no corruption but there is a possible race which could
Re: MLX4 Cq Question
On 5/20/13 2:58 PM, Hefty, Sean wrote: My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an error and therefore, the only way to recover what the opcode was is through the wr_id you used when submitting the WR. Is my reading of the code correct? I believe this is true wrt the IB spec. Thanks, this was my recollection as well. Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
MLX4 Cq Question
Hi Roland, I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this. Thanks, Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/13 9:38 AM, Yan Burman wrote: -Original Message- From: Tom Talpey [mailto:t...@talpey.com] Sent: Tuesday, April 30, 2013 17:20 To: Yan Burman Cc: J. Bruce Fields; Wendy Cheng; Atchley, Scott; Tom Tucker; linux- r...@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On 4/30/2013 1:09 AM, Yan Burman wrote: I now get up to ~95K IOPS and 4.1GB/sec bandwidth. ... ib_send_bw with intel iommu enabled did get up to 4.5GB/sec BTW, you may want to verify that these are the same GB. Many benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB. At GB/GiB, the difference is about 7.5%, very close to the difference between 4.1 and 4.5. Just a thought. The question is not why there is 400MBps difference between ib_send_bw and NFSoRDMA. The question is why with IOMMU ib_send_bw got to the same bandwidth as without it while NFSoRDMA got half. NFSRDMA is constantly registering and unregistering memory when you use FRMR mode. By contrast IPoIB has a descriptor ring that is set up once and re-used. I suspect this is the difference maker. Have you tried running the server in ALL_PHYSICAL mode, i.e. where it uses a DMA_MR for all of memory? Tom From some googling, it seems that when IOMMU is enabled, dma mapping functions get a lot more expensive. Perhaps that is the reason for the performance drop. Yan -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/29/13 7:16 AM, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Monday, April 29, 2013 08:35 To: J. Bruce Fields Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. ... [snip] 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner That's the inode i_mutex. 14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave And that (and __free_iova below) looks like iova_rbtree_lock. Let's revisit your command: FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 -- ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 -- norandommap --group_reporting --exitall --buffered=0 I tried block sizes from 4-512K. 4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 128-256K block size * inode's i_mutex: If increasing process/file count didn't help, maybe increase iodepth (say 512 ?) could offset the i_mutex overhead a little bit ? I tried with different iodepth parameters, but found no improvement above iodepth 128. * xpt_mutex: (no idea) * iova_rbtree_lock DMA mapping fragmentation ? I have not studied whether NFS-RDMA routines such as svc_rdma_sendto() could do better but maybe sequential IO (instead of randread) could help ? Bigger block size (instead of 4K) can help ? I think the biggest issue is that max_payload for TCP is 2MB but only 256k for RDMA. I am trying to simulate real load (more or less), that is the reason I use randread. Anyhow, read does not result in better performance. It's probably because backing storage is tmpfs... Yan -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/29/13 8:05 AM, Tom Tucker wrote: On 4/29/13 7:16 AM, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Monday, April 29, 2013 08:35 To: J. Bruce Fields Cc: Yan Burman; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. ... [snip] 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner That's the inode i_mutex. 14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave And that (and __free_iova below) looks like iova_rbtree_lock. Let's revisit your command: FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 -- ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 -- norandommap --group_reporting --exitall --buffered=0 I tried block sizes from 4-512K. 4K does not give 2.2GB bandwidth - optimal bandwidth is achieved around 128-256K block size * inode's i_mutex: If increasing process/file count didn't help, maybe increase iodepth (say 512 ?) could offset the i_mutex overhead a little bit ? I tried with different iodepth parameters, but found no improvement above iodepth 128. * xpt_mutex: (no idea) * iova_rbtree_lock DMA mapping fragmentation ? I have not studied whether NFS-RDMA routines such as svc_rdma_sendto() could do better but maybe sequential IO (instead of randread) could help ? Bigger block size (instead of 4K) can help ? I think the biggest issue is that max_payload for TCP is 2MB but only 256k for RDMA. Sorry, I mean 1MB... I am trying to simulate real load (more or less), that is the reason I use randread. Anyhow, read does not result in better performance. It's probably because backing storage is tmpfs... Yan -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/25/13 3:04 PM, Tom Talpey wrote: On 4/25/2013 1:18 PM, Wendy Cheng wrote: On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote: On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. 1) The client slot count is not hard-coded, it can easily be changed by writing a value to /proc and initiating a new mount. But I doubt that increasing the slot table will improve performance much, unless this is a small-random-read, and spindle-limited workload. Hi Tom ! It was a shot in the dark :) .. as our test bed has not been setup yet .However, since I'll be working on (very) slow clients, increasing this buffer is still interesting (to me). I don't see where it is controlled by a /proc value (?) - but that is not a concern at this The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking for is called rdma_slot_table_entries. moment as /proc entry is easy to add. More questions on the server though (see below) ... 2) The observation appears to be that the bandwidth is server CPU limited. Increasing the load offered by the client probably won't move the needle, until that's addressed. Could you give more hints on which part of the path is CPU limited ? Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker has some ideas on the srv rdma code, but it could also be in the sunrpc or infiniband driver layers, can't really tell without the call stacks. The Mellanox driver uses red-black trees extensively for resource management, e.g. QP ID, CQ ID, etc... When completions come in from the HW, these are used to find the associated software data structures I believe. It is certainly possible that these trees get hot on lookup when we're pushing a lot of data. I'm surprised, however, to see rb_insert_color there because I'm not aware of any where that resources are being inserted into and/or removed from a red-black tree in the data path. They are also used by IPoIB and the IB CM, however, connections should not be coming and going unless we've got other problems. IPoIB is only used by the IB transport for connection set up and my impression is that this trace is for the IB transport. I don't believe that red-black trees are used by either the client or server transports directly. Note that the rb_lock in the client is for buffers; not, as the name might imply, a red-black tree. I think the key here is to discover what lock is being waited on. Are we certain that it's a lock on a red-black tree and if so, which one? Tom Is there a known Linux-based filesystem that is reasonbly tuned for NFS-RDMA ? Any specific filesystem features would work well with NFS-RDMA ? I'm wondering when disk+FS are added into the configuration, how much advantages would NFS-RDMA get when compared with a plain TCP/IP, say IPOIB on CM , transport ? NFS-RDMA is not really filesystem dependent, but certainly there are considerations for filesystems to support NFS, and of course the goal in general is performance. NFS-RDMA is a network transport, applicable to both client and server. Filesystem choice is a server consideration. I don't have a simple answer to your question about how much better NFS-RDMA is over other transports. Architecturally, a lot. In practice, there are many, many variables. Have you seen RFC5532, that I cowrote with the late Chet Juszczak? You may find it's still quite relevant. http://tools.ietf.org/html/rfc5532 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA crashing
On 2/6/13 3:28 PM, Steve Wise wrote: On 2/6/2013 4:24 PM, J. Bruce Fields wrote: On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote: When killing mount command that got stuck: --- BUG: unable to handle kernel paging request at 880324dc7ff8 IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161 Oops: 0003 [#1] PREEMPT SMP Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs 8021q bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6 Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro X8DTH-i/6/iF/6F/X8DTH RIP: 0010:[a05f3dfb] [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP: 0018:880324c3dbf8 EFLAGS: 00010297 RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428 RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618 RBP: 880324c3dd78 R08: 60f9c860 R09: 0001 R10: 880324dd8000 R11: 0001 R12: 8806299dcb10 R13: 0003 R14: 0001 R15: 0010 FS: () GS:88063fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055) Stack: 880324c3dc78 880324c3dcd8 0282 880631cec000 880324dd8000 88062ed33040 000124c3dc48 880324dd8000 88062ed33058 880630ce2b90 8806299e8000 0003 Call Trace: [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma] [81086540] ? try_to_wake_up+0x2f0/0x2f0 [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [a0571e5d] nfsd+0xad/0x130 [nfsd] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [81071df6] kthread+0xd6/0xe0 [81071d20] ? __init_kthread_worker+0x70/0x70 [814b462c] ret_from_fork+0x7c/0xb0 [81071d20] ? __init_kthread_worker+0x70/0x70 Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00 00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00 48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP 880324c3dbf8 CR2: 880324dc7ff8 ---[ end trace 06d0384754e9609a ]--- It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e nfsd4: cleanup: replace rq_resused count by rq_next_page pointer is responsible for the crash (it seems to be crashing in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527) It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet. When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I was no longer getting the server crashes, so the reset of my tests were done using that point (it is somewhere in the middle of 3.7.0-rc2). OK, so this part's clearly my fault--I'll work on a patch, but the rdma's use of the -rq_pages array is pretty confusing. Maybe Tom can shed some light? Yes, the RDMA transport has two confusing tweaks on rq_pages. Most transports (UDP/TCP) use the rq_pages allocated by SVC. For RDMA, however, the RQ already contains pre-allocated memory that will contain inbound NFS requests from the client. Instead of copying this data from the per-registered receive buffer into the buffer in rq_pages, I just replace the page in rq_pages with the one that already contains the data. The second somewhat strange thing is that the NFS request contains an NFSRDMA header. This is just like TCP (i.e. the 4B length), however, the difference is that (unlike TCP) this header is needed for the response because it maps out where in the client the response data will be written. Tom -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH 0/2] RPCRDMA Fixes
This pair of patches fixes a problem with the marshalling of XDR into RPCRDMA messages and an issue with FRMR mapping in the presence of transport errors. The problems were discovered together as part of looking into the ENOSPC problems seen by spe...@shiftmail.com. The fixes, however, are independent and do not rely on each other. I have tested them indepently and together on 64b with both Infiniband and iWARP. They have been compile tested on 32b. --- Tom Tucker (2): RPCRDMA: Fix FRMR registration/invalidate handling. RPCRDMA: Fix to XDR page base interpretation in marshalling logic. net/sunrpc/xprtrdma/rpc_rdma.c | 86 +++ net/sunrpc/xprtrdma/verbs.c | 52 net/sunrpc/xprtrdma/xprt_rdma.h |1 3 files changed, 87 insertions(+), 52 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] RPCRDMA: Fix to XDR page base interpretation in marshalling logic.
The RPCRDMA marshalling logic assumed that xdr-page_base was an offset into the first page of xdr-page_list. It is in fact an offset into the xdr-page_list itself, that is, it selects the first page in the page_list and the offset into that page. The symptom depended in part on the rpc_memreg_strategy, if it was FRMR, or some other one-shot mapping mode, the connection would get torn down on a base and bounds error. When the badly marshalled RPC was retransmitted it would reconnect, get the error, and tear down the connection again in a loop forever. This resulted in a hung-mount. For the other modes, it would result in silent data corruption. This bug is most easily reproduced by writing more data than the filesystem has space for. This fix corrects the page_base assumption and otherwise simplifies the iov mapping logic. Signed-off-by: Tom Tucker t...@ogc.us --- net/sunrpc/xprtrdma/rpc_rdma.c | 86 1 files changed, 42 insertions(+), 44 deletions(-) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index 2ac3f6e..554d081 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -87,6 +87,8 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos, enum rpcrdma_chunktype type, struct rpcrdma_mr_seg *seg, int nsegs) { int len, n = 0, p; + int page_base; + struct page **ppages; if (pos == 0 xdrbuf-head[0].iov_len) { seg[n].mr_page = NULL; @@ -95,34 +97,32 @@ rpcrdma_convert_iovs(struct xdr_buf *xdrbuf, unsigned int pos, ++n; } - if (xdrbuf-page_len (xdrbuf-pages[0] != NULL)) { - if (n == nsegs) - return 0; - seg[n].mr_page = xdrbuf-pages[0]; - seg[n].mr_offset = (void *)(unsigned long) xdrbuf-page_base; - seg[n].mr_len = min_t(u32, - PAGE_SIZE - xdrbuf-page_base, xdrbuf-page_len); - len = xdrbuf-page_len - seg[n].mr_len; + len = xdrbuf-page_len; + ppages = xdrbuf-pages + (xdrbuf-page_base PAGE_SHIFT); + page_base = xdrbuf-page_base ~PAGE_MASK; + p = 0; + while (len n nsegs) { + seg[n].mr_page = ppages[p]; + seg[n].mr_offset = (void *)(unsigned long) page_base; + seg[n].mr_len = min_t(u32, PAGE_SIZE - page_base, len); + BUG_ON(seg[n].mr_len PAGE_SIZE); + len -= seg[n].mr_len; ++n; - p = 1; - while (len 0) { - if (n == nsegs) - return 0; - seg[n].mr_page = xdrbuf-pages[p]; - seg[n].mr_offset = NULL; - seg[n].mr_len = min_t(u32, PAGE_SIZE, len); - len -= seg[n].mr_len; - ++n; - ++p; - } + ++p; + page_base = 0; /* page offset only applies to first page */ } + /* Message overflows the seg array */ + if (len n == nsegs) + return 0; + if (xdrbuf-tail[0].iov_len) { /* the rpcrdma protocol allows us to omit any trailing * xdr pad bytes, saving the server an RDMA operation. */ if (xdrbuf-tail[0].iov_len 4 xprt_rdma_pad_optimize) return n; if (n == nsegs) + /* Tail remains, but we're out of segments */ return 0; seg[n].mr_page = NULL; seg[n].mr_offset = xdrbuf-tail[0].iov_base; @@ -296,6 +296,8 @@ rpcrdma_inline_pullup(struct rpc_rqst *rqst, int pad) int copy_len; unsigned char *srcp, *destp; struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(rqst-rq_xprt); + int page_base; + struct page **ppages; destp = rqst-rq_svec[0].iov_base; curlen = rqst-rq_svec[0].iov_len; @@ -324,28 +326,25 @@ rpcrdma_inline_pullup(struct rpc_rqst *rqst, int pad) __func__, destp + copy_len, curlen); rqst-rq_svec[0].iov_len += curlen; } - r_xprt-rx_stats.pullup_copy_count += copy_len; - npages = PAGE_ALIGN(rqst-rq_snd_buf.page_base+copy_len) PAGE_SHIFT; + + page_base = rqst-rq_snd_buf.page_base; + ppages = rqst-rq_snd_buf.pages + (page_base PAGE_SHIFT); + page_base = ~PAGE_MASK; + npages = PAGE_ALIGN(page_base+copy_len) PAGE_SHIFT; for (i = 0; copy_len i npages; i++) { - if (i == 0) - curlen = PAGE_SIZE - rqst-rq_snd_buf.page_base; - else - curlen = PAGE_SIZE; + curlen = PAGE_SIZE - page_base; if (curlen copy_len) curlen = copy_len; dprintk(RPC: %s: page %d destp 0x%p
[PATCH 2/2] RPCRDMA: Fix FRMR registration/invalidate handling.
When the rpc_memreg_strategy is 5, FRMR are used to map RPC data. This mode uses an FRMR to map the RPC data, then invalidates (i.e. unregisers) the data in xprt_rdma_free. These FRMR are used across connections on the same mount, i.e. if the connection goes away on an idle timeout and reconnects later, the FRMR are not destroyed and recreated. This creates a problem for transport errors because the WR that invalidate an FRMR may be flushed (i.e. fail) leaving the FRMR valid. When the FRMR is later used to map an RPC it will fail, tearing down the transport and starting over. Over time, more and more of the FRMR pool end up in the wrong state resulting in seemingly random disconnects. This fix keeps track of the FRMR state explicitly by setting it's state based on the successful completion of a reg/inv WR. If the FRMR is ever used and found to be in the wrong state, an invalidate WR is prepended, re-syncing the FRMR state and avoiding the connection loss. Signed-off-by: Tom Tucker t...@ogc.us --- net/sunrpc/xprtrdma/verbs.c | 52 +-- net/sunrpc/xprtrdma/xprt_rdma.h |1 + 2 files changed, 45 insertions(+), 8 deletions(-) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 5f4c7b3..570f08d 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -144,6 +144,7 @@ rpcrdma_cq_async_error_upcall(struct ib_event *event, void *context) static inline void rpcrdma_event_process(struct ib_wc *wc) { + struct rpcrdma_mw *frmr; struct rpcrdma_rep *rep = (struct rpcrdma_rep *)(unsigned long) wc-wr_id; @@ -154,15 +155,23 @@ void rpcrdma_event_process(struct ib_wc *wc) return; if (IB_WC_SUCCESS != wc-status) { - dprintk(RPC: %s: %s WC status %X, connection lost\n, - __func__, (wc-opcode IB_WC_RECV) ? recv : send, -wc-status); + dprintk(RPC: %s: WC opcode %d status %X, connection lost\n, + __func__, wc-opcode, wc-status); rep-rr_len = ~0U; - rpcrdma_schedule_tasklet(rep); + if (wc-opcode != IB_WC_FAST_REG_MR wc-opcode != IB_WC_LOCAL_INV) + rpcrdma_schedule_tasklet(rep); return; } switch (wc-opcode) { + case IB_WC_FAST_REG_MR: + frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id; + frmr-r.frmr.state = FRMR_IS_VALID; + break; + case IB_WC_LOCAL_INV: + frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id; + frmr-r.frmr.state = FRMR_IS_INVALID; + break; case IB_WC_RECV: rep-rr_len = wc-byte_len; ib_dma_sync_single_for_cpu( @@ -1450,6 +1459,11 @@ rpcrdma_map_one(struct rpcrdma_ia *ia, struct rpcrdma_mr_seg *seg, int writing) seg-mr_dma = ib_dma_map_single(ia-ri_id-device, seg-mr_offset, seg-mr_dmalen, seg-mr_dir); + if (ib_dma_mapping_error(ia-ri_id-device, seg-mr_dma)) { + dprintk(RPC: %s: mr_dma %llx mr_offset %p mr_dma_len %zu\n, + __func__, + seg-mr_dma, seg-mr_offset, seg-mr_dmalen); + } } static void @@ -1469,7 +1483,8 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg, struct rpcrdma_xprt *r_xprt) { struct rpcrdma_mr_seg *seg1 = seg; - struct ib_send_wr frmr_wr, *bad_wr; + struct ib_send_wr invalidate_wr, frmr_wr, *bad_wr, *post_wr; + u8 key; int len, pageoff; int i, rc; @@ -1484,6 +1499,7 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg, rpcrdma_map_one(ia, seg, writing); seg1-mr_chunk.rl_mw-r.frmr.fr_pgl-page_list[i] = seg-mr_dma; len += seg-mr_len; + BUG_ON(seg-mr_len PAGE_SIZE); ++seg; ++i; /* Check for holes */ @@ -1494,26 +1510,45 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg *seg, dprintk(RPC: %s: Using frmr %p to map %d segments\n, __func__, seg1-mr_chunk.rl_mw, i); + if (unlikely(seg1-mr_chunk.rl_mw-r.frmr.state == FRMR_IS_VALID)) { + dprintk(RPC: %s: frmr %x left valid, posting invalidate.\n, + __func__, + seg1-mr_chunk.rl_mw-r.frmr.fr_mr-rkey); + /* Invalidate before using. */ + memset(invalidate_wr, 0, sizeof invalidate_wr); + invalidate_wr.wr_id = (unsigned long)(void *)seg1-mr_chunk.rl_mw; + invalidate_wr.next = frmr_wr; + invalidate_wr.opcode = IB_WR_LOCAL_INV; + invalidate_wr.send_flags = IB_SEND_SIGNALED
Re: NFS-RDMA hangs: connection closed (-103)
On 12/8/10 9:10 AM, Spelic wrote: Tom, have you reproduced the RDMA hangs - connection closes bug or the sparse file at server side upon NFS hitting ENOSPC ? Because for the latter people have already given exhaustive explanation: see this other thread at http://fossplanet.com/f13/%5Blinux-lvm%5D-bugs-mkfs-xfs-device-mapper-xfs-dev-ram-81653/ While the former bug is still open and very interesting for us. I'm working on the 'former' bug. The bug that I think you've run in to with how RDMA transport errors are handled and how RPC are retried in the event of an error. With hard mounts (which I'm suspecting you have), the RPC will be retried forever. In this bug, the transport never 'recovers' after the error and therefore the RPC never succeeds and the mount is effectively hung. There were bugs fixed in this area between 34 and top which is why you saw the less catastrophic, but still broken behavior you see now. Unfortunately I can only support this part-time, but I'll keep you updated on the progress. Thanks for finding this and helping to debug, Tom Thanks for your help S. On 12/07/2010 05:12 PM, Tom Tucker wrote: Status update... I have reproduced the bug a number of different ways. It seems to be most easily reproduced by simply writing more data than the filesystem has space for. I can do this reliably with any FS. I think the XFS bug may have tickled this bug somehow. Tom On 12/2/10 1:09 PM, Spelic wrote: Hello all please be aware that the file oversize bug is reproducible also without infiniband, with just nfs over ethernet over xfs over ramdisk (but it doesn't hang, so it's a different bug than the one I posted here at the RDMA mailing list) I have posted another thread regarding the file oversize bug, which you can read in the LVM, XFS, and LKML mailing lists, please have a look http://fossplanet.com/f13/%5Blinux-lvm%5D-bugs-mkfs-xfs-device-mapper-xfs-dev-ram-81653/ Especially my second post, replying myself at +30 minutes, explains that it's reproducible also with ethernet. Thank you On 12/02/2010 07:37 PM, Roland Dreier wrote: Adding Dave Chinner to the cc list, since he's both an XFS guru as well as being very familiar with NFS and RDMA... Dave, if you read below, it seems there is some strange behavior exporting XFS with NFS/RDMA. - R. On 12/02/2010 12:59 AM, Tom Tucker wrote: Spelic, I have seen this problem before, but have not been able to reliably reproduce it. When I saw the problem, there were no transport errors and it appeared as if the I/O had actually completed, but that the waiter was not being awoken. I was not able to reliably reproduce the problem and was not able to determine if the problem was a latent bug in NFS in general or a bug in the RDMA transport in particular. I will try your setup here, but I don't have a system like yours so I'll have to settle for a smaller ramdisk, however, I have a few questions: - Does the FS matter? For example, can you use ext[2-4] on the ramdisk and not still reproduce - As I mentioned earlier NFS v3 vs. NFS v4 - RAMDISK size, i.e. 2G vs. 14G Thanks, Tom Hello Tom, thanks for replying - The FS matters to some extent: as I wrote, with ext4 it's not possible to reproduce the bug in this way, so immediately and reliably, however ext4 also will hang eventually if you work on it for hours so I had to switch to IPoIB for our real work; reread my previous post. - NFS3 not tried yet. Never tried to do RDMA on NFS3... do you have a pointer on instructions? - RAMDISK size: I am testing it. Ok I confirm with 1.5GB ramdisk it's reproducible. boot option ramdisk_size=1572864 (1.5*1024**2=1572864.0) confirm: blockdev --getsize64 /dev/ram0 == 1610612736 now at server side mkfs and mount with defaults: mkfs.xfs /dev/ram0 mount /dev/ram0 /mnt/ram (this is a simplification over my previous email, and it's needed with a smaller ramdisk or mkfs.xfs will refuse to work. The bug is still reproducible like this) DOH! another bug: It's strange how at the end of the test ls -lh /mnt/ram at server side will show a zerofile larger than 1.5GB at the end of the procedure, sometimes it's 3GB, sometimes it's 2.3GB... but it's larger than the ramdisk size. # ll -h /mnt/ram total 1.5G drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./ drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../ -rw-r--r-- 1 root root 2.3G 2010-12-02 12:59 zerofile # df -h FilesystemSize Used Avail Use% Mounted on /dev/sda1 294G 4.1G 275G 2% / devtmpfs 7.9G 184K 7.9G 1% /dev none 7.9G 0 7.9G 0% /dev/shm none 7.9G 100K 7.9G 1% /var/run none 7.9G 0 7.9G 0% /var/lock none 7.9G 0 7.9G 0% /lib/init/rw /dev/ram0 1.5G 1.5G 20K 100% /mnt/ram # dd
Re: NFS-RDMA hangs: connection closed (-103)
Status update... I have reproduced the bug a number of different ways. It seems to be most easily reproduced by simply writing more data than the filesystem has space for. I can do this reliably with any FS. I think the XFS bug may have tickled this bug somehow. Tom On 12/2/10 1:09 PM, Spelic wrote: Hello all please be aware that the file oversize bug is reproducible also without infiniband, with just nfs over ethernet over xfs over ramdisk (but it doesn't hang, so it's a different bug than the one I posted here at the RDMA mailing list) I have posted another thread regarding the file oversize bug, which you can read in the LVM, XFS, and LKML mailing lists, please have a look http://fossplanet.com/f13/%5Blinux-lvm%5D-bugs-mkfs-xfs-device-mapper-xfs-dev-ram-81653/ Especially my second post, replying myself at +30 minutes, explains that it's reproducible also with ethernet. Thank you On 12/02/2010 07:37 PM, Roland Dreier wrote: Adding Dave Chinner to the cc list, since he's both an XFS guru as well as being very familiar with NFS and RDMA... Dave, if you read below, it seems there is some strange behavior exporting XFS with NFS/RDMA. - R. On 12/02/2010 12:59 AM, Tom Tucker wrote: Spelic, I have seen this problem before, but have not been able to reliably reproduce it. When I saw the problem, there were no transport errors and it appeared as if the I/O had actually completed, but that the waiter was not being awoken. I was not able to reliably reproduce the problem and was not able to determine if the problem was a latent bug in NFS in general or a bug in the RDMA transport in particular. I will try your setup here, but I don't have a system like yours so I'll have to settle for a smaller ramdisk, however, I have a few questions: - Does the FS matter? For example, can you use ext[2-4] on the ramdisk and not still reproduce - As I mentioned earlier NFS v3 vs. NFS v4 - RAMDISK size, i.e. 2G vs. 14G Thanks, Tom Hello Tom, thanks for replying - The FS matters to some extent: as I wrote, with ext4 it's not possible to reproduce the bug in this way, so immediately and reliably, however ext4 also will hang eventually if you work on it for hours so I had to switch to IPoIB for our real work; reread my previous post. - NFS3 not tried yet. Never tried to do RDMA on NFS3... do you have a pointer on instructions? - RAMDISK size: I am testing it. Ok I confirm with 1.5GB ramdisk it's reproducible. boot option ramdisk_size=1572864 (1.5*1024**2=1572864.0) confirm: blockdev --getsize64 /dev/ram0 == 1610612736 now at server side mkfs and mount with defaults: mkfs.xfs /dev/ram0 mount /dev/ram0 /mnt/ram (this is a simplification over my previous email, and it's needed with a smaller ramdisk or mkfs.xfs will refuse to work. The bug is still reproducible like this) DOH! another bug: It's strange how at the end of the test ls -lh /mnt/ram at server side will show a zerofile larger than 1.5GB at the end of the procedure, sometimes it's 3GB, sometimes it's 2.3GB... but it's larger than the ramdisk size. # ll -h /mnt/ram total 1.5G drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./ drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../ -rw-r--r-- 1 root root 2.3G 2010-12-02 12:59 zerofile # df -h FilesystemSize Used Avail Use% Mounted on /dev/sda1 294G 4.1G 275G 2% / devtmpfs 7.9G 184K 7.9G 1% /dev none 7.9G 0 7.9G 0% /dev/shm none 7.9G 100K 7.9G 1% /var/run none 7.9G 0 7.9G 0% /var/lock none 7.9G 0 7.9G 0% /lib/init/rw /dev/ram0 1.5G 1.5G 20K 100% /mnt/ram # dd if=/mnt/ram/zerofile | wc -c 4791480+0 records in 4791480+0 records out 2453237760 2453237760 bytes (2.5 GB) copied, 8.41821 s, 291 MB/s It seems there is also an XFS bug here... This might help triggering the bug however please note than ext4 (nfs-rdma over it) also hanged on us and it was real work on HDD disks and they were not full... after switching to IPoIB it didn't hang anymore. On IPoIB the size problem also shows up: final file is 2.3GB instead of 1.5GB, however nothing hangs: # echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo syncing now ; time sync ; echo finished begin dd: writing `/mnt/nfsram/zerofile': Input/output error 2497+0 records in 2496+0 records out 2617245696 bytes (2.6 GB) copied, 10.4 s, 252 MB/s syncing now real0m0.057s user0m0.000s sys 0m0.000s finished I think I noticed the same problem with a 14GB ramdisk, the file ended up to be about 15GB, but at that time I thought I made some computation mistakes. Now with a smaller ramdisk it's more obvious. Earlier or later someone should notify the XFS developers of the size bug. However
[RFC PATCH 0/2] IB/uverbs: Add support for registering mmapped memory
This patch series adds the ability for a user-mode program to register mmapped memory. The capability was developed to support the sharing of device memory, for example PCI-E static/flash ram devices, on the network with RDMA. It is also useful for sharing kernel resident data with distributed system monitoring applications (e.g. vmstats) at zero overhead to the monitored host. --- Tom Tucker (2): IB/uverbs: Add support for user registration of mmap memory IB/uverbs: Add memory type to ib_umem structure drivers/infiniband/core/umem.c | 272 +--- include/rdma/ib_umem.h |6 + 2 files changed, 259 insertions(+), 19 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/2] IB/uverbs: Add support for user registration of mmap memory
Personally I think the biggest issue is that I don't think the pfn to dma address mapping logic is portable. On 12/2/10 1:35 PM, Ralph Campbell wrote: I understand the need for something like this patch since the GPU folks would also like to mmap memory (although the memory is marked as vma-vm_flags VM_IO). It seems to me that duplicating the page walking code is the wrong approach and exporting a new interface from mm/memory.c is more appropriate. Perhaps, but that's kernel proper (not a module) and has it's own issues. For example, it represents an exported kernel interface and therefore a kernel compatability commitment going forward. I suggest that a new kernel interface is a separate effort that this code code utilize going forward. Also, the quick check to find_vma() is essentially duplicated if get_user_pages() is called You need to know the type before you know how to handle it. Unless you want to tear up get_user_pages, i think this non-performance path double lookup is a non issue. and it doesn't handle the case when the region spans multiple vma regions with different flags. Actually, it specifically does not allow that and I'm not sure that is something you would want to support. Maybe we can modify get_user_pages to have a new flag which allows VM_PFNMAP segments to be accessed as IB memory regions. The problem is that VM_PFNMAP means there is no corresponding struct page to handle reference counting. What happens if the device that exports the VM_PFNMAP memory is hot removed? Bus Address Error. Can the device reference count be incremented to prevent that? I don't think that would go in this code, it would go in the driver that gave the user the address in the first place. On Thu, 2010-12-02 at 11:02 -0800, Tom Tucker wrote: Added support to the ib_umem_get helper function for handling mmaped memory. Signed-off-by: Tom Tuckert...@ogc.us --- drivers/infiniband/core/umem.c | 272 +--- 1 files changed, 253 insertions(+), 19 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 415e186..357ca5e 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d int i; list_for_each_entry_safe(chunk, tmp,umem-chunk_list, list) { - ib_dma_unmap_sg(dev, chunk-page_list, - chunk-nents, DMA_BIDIRECTIONAL); - for (i = 0; i chunk-nents; ++i) { - struct page *page = sg_page(chunk-page_list[i]); - - if (umem-writable dirty) - set_page_dirty_lock(page); - put_page(page); - } + if (umem-type == IB_UMEM_MEM_MAP) { + ib_dma_unmap_sg(dev, chunk-page_list, + chunk-nents, DMA_BIDIRECTIONAL); + for (i = 0; i chunk-nents; ++i) { + struct page *page = sg_page(chunk-page_list[i]); + if (umem-writable dirty) + set_page_dirty_lock(page); + put_page(page); + } + } kfree(chunk); } } -/** - * ib_umem_get - Pin and DMA map userspace memory. - * @context: userspace context to pin memory for - * @addr: userspace virtual address to start at - * @size: length of region to pin - * @access: IB_ACCESS_xxx flags for memory being pinned - * @dmasync: flush in-flight DMA when the memory region is written - */ -struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, - size_t size, int access, int dmasync) +static struct ib_umem *__umem_get(struct ib_ucontext *context, + unsigned long addr, size_t size, + int access, int dmasync) { struct ib_umem *umem; struct page **page_list; @@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (!umem) return ERR_PTR(-ENOMEM); + umem-type = IB_UMEM_MEM_MAP; umem-context = context; umem-length= size; umem-offset= addr ~PAGE_MASK; @@ -215,6 +210,245 @@ out: return ret 0 ? ERR_PTR(ret) : umem; } + +/* + * Return the PFN for the specified address in the vma. This only + * works for a vma that is VM_PFNMAP. + */ +static unsigned long __follow_io_pfn(struct vm_area_struct *vma, +unsigned long address, int write) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep, pte; + spinlock_t *ptl; + unsigned long pfn; + struct mm_struct *mm = vma-vm_mm; + + pgd
Re: NFS-RDMA hangs: connection closed (-103)
Hi Spelic, Can you reproduce this with an nfsv3 mount? On 12/1/10 5:13 PM, Spelic wrote: Hello all First of all: I have tried to send this message to the list at least 3 times but it doesn't seem to get through (and I'm given no error back). It was very long with 2 attachments... is is because of that? What are the limits of this ML? This time I will shorten it a bit and remove the attachments. Here is my problem: I am trying to use NFS over RDMA. It doesn't work: hangs very soon. I tried kernel 2.6.32 from ubuntu 10.04, and then I tried the most recent upstream 2.6.37-rc4 compiled from source. They behave basically the same regarding the NFS mount itself, only difference is that 2.6.32 will hang the complete operating system when nfs hangs, while 2.6.37-rc4 (after nfs hangs) will only hang processes which launch sync or list nfs directories. Anyway the mount is hanged forever; does not resolve by itself. IPoIB nfs mounts appear to work flawlessly, the problem is with RDMA only. Hardware: (identical client and server machines) 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] Flags: bus master, fast devsel, latency 0, IRQ 30 Memory at d880 (64-bit, non-prefetchable) [size=1M] Memory at d800 (64-bit, prefetchable) [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data ? Capabilities: [90] Message Signalled Interrupts: Mask- 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32 Capabilities: [60] Express Endpoint, MSI 00 Kernel driver in use: ib_mthca Kernel modules: ib_mthca Mainboard = Supermicro X7DWT with embedded infiniband. This is my test: on server I make a big 14GB ramdisk (exact boot option: ramdisk_size=14680064), format xfs and mount like this: mkfs.xfs -f -l size=128m -d agcount=16 /dev/ram0 mount -o nobarrier,inode64,logbufs=8,logbsize=256k /dev/ram0 /mnt/ram/ On the client I mount like this (fstab): 10.100.0.220:/ /mnt/nfsram nfs4 _netdev,auto,defaults,rdma,port=20049 0 0 Then on the client I perform echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo syncing now ; sync ; echo finished It hangs as soon as it reaches the end of the 14GB of space, but never writes syncing now. It seems like the disk full message triggers the hangup reliably on NFS over RDMA over XFS over ramdisk; other combinations are not so reliable for triggering the bug (e.g. ext4). However please note that this is not an XFS problem in itself: we had another hangup on an ext4 filesystem on NFS on RDMA on real disks for real work after a few hours (and it hadn't hit the disk full situation); this technique with XFS on ramdisk is just more reliably reproducible. Note that the hangup does not happen on NFS over IPoIB (no RDMA) over XFS over ramdisk. It's really an RDMA-only bug. On the other machine (2.6.32) that was doing real work on real disks I am now mounting over IPoIB without RDMA and in fact that one is still running reliably. The dd process hangs like this: (/proc/pid/stack) [810f8f75] sync_page+0x45/0x60 [810f9143] wait_on_page_bit+0x73/0x80 [810f9590] filemap_fdatawait_range+0x110/0x1a0 [810f9720] filemap_write_and_wait_range+0x70/0x80 [811766ba] vfs_fsync_range+0x5a/0xa0 [8117676c] vfs_fsync+0x1c/0x20 [a02bda1d] nfs_file_write+0xdd/0x1f0 [nfs] [8114d4fa] do_sync_write+0xda/0x120 [8114d808] vfs_write+0xc8/0x190 [8114e061] sys_write+0x51/0x90 [8100c042] system_call_fastpath+0x16/0x1b [] 0x The dd process is not killable with -9 . Stays alive and hanged. In the dmesg (client) you can see this line immediately, as soon as transfer stops (iostat -n 1) and dd hangs up: [ 3072.884988] rpcrdma: connection to 10.100.0.220:20049 closed (-103) after a while you can see this in dmesg [ 3242.890030] INFO: task dd:2140 blocked for more than 120 seconds. [ 3242.890132] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3242.890239] ddD 88040a8f0398 0 2140 2113 0x [ 3242.890243] 88040891fb38 0082 88040891fa98 88040891fa98 [ 3242.890248] 000139c0 88040a8f 88040a8f0398 88040891ffd8 [ 3242.890251] 88040a8f03a0 000139c0 88040891e010 000139c0 [ 3242.890255] Call Trace: [ 3242.890264] [81035509] ? default_spin_lock_flags+0x9/0x10 [ 3242.890269] [810f8f30] ? sync_page+0x0/0x60 [ 3242.890273] [8157b824] io_schedule+0x44/0x60 [ 3242.890276] [810f8f75] sync_page+0x45/0x60 [ 3242.890279] [8157c0bf] __wait_on_bit+0x5f/0x90 [ 3242.890281]
Re: NFS-RDMA hangs: connection closed (-103)
Spelic, I have seen this problem before, but have not been able to reliably reproduce it. When I saw the problem, there were no transport errors and it appeared as if the I/O had actually completed, but that the waiter was not being awoken. I was not able to reliably reproduce the problem and was not able to determine if the problem was a latent bug in NFS in general or a bug in the RDMA transport in particular. I will try your setup here, but I don't have a system like yours so I'll have to settle for a smaller ramdisk, however, I have a few questions: - Does the FS matter? For example, can you use ext[2-4] on the ramdisk and not still reproduce - As I mentioned earlier NFS v3 vs. NFS v4 - RAMDISK size, i.e. 2G vs. 14G Thanks, Tom On 12/1/10 5:13 PM, Spelic wrote: Hello all First of all: I have tried to send this message to the list at least 3 times but it doesn't seem to get through (and I'm given no error back). It was very long with 2 attachments... is is because of that? What are the limits of this ML? This time I will shorten it a bit and remove the attachments. Here is my problem: I am trying to use NFS over RDMA. It doesn't work: hangs very soon. I tried kernel 2.6.32 from ubuntu 10.04, and then I tried the most recent upstream 2.6.37-rc4 compiled from source. They behave basically the same regarding the NFS mount itself, only difference is that 2.6.32 will hang the complete operating system when nfs hangs, while 2.6.37-rc4 (after nfs hangs) will only hang processes which launch sync or list nfs directories. Anyway the mount is hanged forever; does not resolve by itself. IPoIB nfs mounts appear to work flawlessly, the problem is with RDMA only. Hardware: (identical client and server machines) 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20) Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] Flags: bus master, fast devsel, latency 0, IRQ 30 Memory at d880 (64-bit, non-prefetchable) [size=1M] Memory at d800 (64-bit, prefetchable) [size=8M] Capabilities: [40] Power Management version 2 Capabilities: [48] Vital Product Data ? Capabilities: [90] Message Signalled Interrupts: Mask- 64bit+ Queue=0/5 Enable- Capabilities: [84] MSI-X: Enable+ Mask- TabSize=32 Capabilities: [60] Express Endpoint, MSI 00 Kernel driver in use: ib_mthca Kernel modules: ib_mthca Mainboard = Supermicro X7DWT with embedded infiniband. This is my test: on server I make a big 14GB ramdisk (exact boot option: ramdisk_size=14680064), format xfs and mount like this: mkfs.xfs -f -l size=128m -d agcount=16 /dev/ram0 mount -o nobarrier,inode64,logbufs=8,logbsize=256k /dev/ram0 /mnt/ram/ On the client I mount like this (fstab): 10.100.0.220:/ /mnt/nfsram nfs4 _netdev,auto,defaults,rdma,port=20049 0 0 Then on the client I perform echo begin; dd if=/dev/zero of=/mnt/nfsram/zerofile bs=1M ; echo syncing now ; sync ; echo finished It hangs as soon as it reaches the end of the 14GB of space, but never writes syncing now. It seems like the disk full message triggers the hangup reliably on NFS over RDMA over XFS over ramdisk; other combinations are not so reliable for triggering the bug (e.g. ext4). However please note that this is not an XFS problem in itself: we had another hangup on an ext4 filesystem on NFS on RDMA on real disks for real work after a few hours (and it hadn't hit the disk full situation); this technique with XFS on ramdisk is just more reliably reproducible. Note that the hangup does not happen on NFS over IPoIB (no RDMA) over XFS over ramdisk. It's really an RDMA-only bug. On the other machine (2.6.32) that was doing real work on real disks I am now mounting over IPoIB without RDMA and in fact that one is still running reliably. The dd process hangs like this: (/proc/pid/stack) [810f8f75] sync_page+0x45/0x60 [810f9143] wait_on_page_bit+0x73/0x80 [810f9590] filemap_fdatawait_range+0x110/0x1a0 [810f9720] filemap_write_and_wait_range+0x70/0x80 [811766ba] vfs_fsync_range+0x5a/0xa0 [8117676c] vfs_fsync+0x1c/0x20 [a02bda1d] nfs_file_write+0xdd/0x1f0 [nfs] [8114d4fa] do_sync_write+0xda/0x120 [8114d808] vfs_write+0xc8/0x190 [8114e061] sys_write+0x51/0x90 [8100c042] system_call_fastpath+0x16/0x1b [] 0x The dd process is not killable with -9 . Stays alive and hanged. In the dmesg (client) you can see this line immediately, as soon as transfer stops (iostat -n 1) and dd hangs up: [ 3072.884988] rpcrdma: connection to 10.100.0.220:20049 closed (-103) after a while you can see this in dmesg [ 3242.890030] INFO: task dd:2140 blocked for more than 120 seconds. [ 3242.890132] echo 0 /proc/sys/kernel/hung_task_timeout_secs
Re: Problem Pinning Physical Memory
On 11/30/10 9:24 AM, Alan Cook wrote: Tom Tuckert...@... writes: Yes. I removed the new verb and followed Jason's recommendation of adding this support to the core reg_mr support. I used the type bits in the vma struct to determine the type of memory being registered and just did the right thing. I'll repost in the the next day or two. Tom Tom, Couple of questions: I noticed that OFED 1.5.3 was released last week. Are the changes you speak of part of that release? No. If not, is there an alternate branch/project that I should be looking at or into to for the mentioned changes? The patch will be against the top-of-tree Linux kernel. Also, I am inferring that the changes allowing the registering of physical memory will only happen if my application is running in kernel space. Actually, no. Is this correct? or will I be able to register the physical memory from user space now as well? What I implemented was support for mmap'd memory. In practical terms for your application you would write a driver that supported the mmap file op. The driver's mmap routine would ioremap the pci memory of interest and stuff it in the provided vma. The user-mode app then ibv_reg_mr the address and length returned by mmap. Make sense? Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem Pinning Physical Memory
On 11/29/10 11:10 AM, Steve Wise wrote: On 11/24/2010 11:42 AM, Jason Gunthorpe wrote: The last time this came up I said that the kernel side of ibv_reg_mr should do the right thing for all types of memory that are mmap'd into a process and I still think that is true. RDMA to device memory could be very useful and with things like GEM managing the allocation of device (video) memory to userspace, so it can be done safely. Jason Tom posted changes to support this a while back. Tom, do you have an updated patch series for this support? Yes. I removed the new verb and followed Jason's recommendation of adding this support to the core reg_mr support. I used the type bits in the vma struct to determine the type of memory being registered and just did the right thing. I'll repost in the the next day or two. Tom Steve. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API
On 11/16/10 1:39 PM, Or Gerlitz wrote: Tom Tuckert...@ogc.us wrote: This patch changes the bus mapping logic to avoid page_address() where necessary Hi Tom, Does when necessary comes to say that invocations of page_address which remained in the code after this patch was applied are safe and no kmap call is needed? That's the premise. Please let me know if something looks suspicious. Thanks, Tom Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] svcrdma: NFSRDMA Server fixes for 2.6.37
Hi Bruce, These fixes are ready for 2.6.37. They fix two bugs in the server-side NFSRDMA transport. Thanks, Tom --- Tom Tucker (2): svcrdma: Cleanup DMA unmapping in error paths. svcrdma: Change DMA mapping logic to avoid the page_address kernel API net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 19 --- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 82 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 41 +++ 3 files changed, 92 insertions(+), 50 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] svcrdma: Change DMA mapping logic to avoid the page_address kernel API
There was logic in the send path that assumed that a page containing data to send to the client has a KVA. This is not always the case and can result in data corruption when page_address returns zero and we end up DMA mapping zero. This patch changes the bus mapping logic to avoid page_address() where necessary and converts all calls from ib_dma_map_single to ib_dma_map_page in order to keep the map/unmap calls symmetric. Signed-off-by: Tom Tucker t...@ogc.us --- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 18 --- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 80 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 18 +++ 3 files changed, 78 insertions(+), 38 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 0194de8..926bdb4 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -263,9 +263,9 @@ static int fast_reg_read_chunks(struct svcxprt_rdma *xprt, frmr-page_list_len = PAGE_ALIGN(byte_count) PAGE_SHIFT; for (page_no = 0; page_no frmr-page_list_len; page_no++) { frmr-page_list-page_list[page_no] = - ib_dma_map_single(xprt-sc_cm_id-device, - page_address(rqstp-rq_arg.pages[page_no]), - PAGE_SIZE, DMA_FROM_DEVICE); + ib_dma_map_page(xprt-sc_cm_id-device, + rqstp-rq_arg.pages[page_no], 0, + PAGE_SIZE, DMA_FROM_DEVICE); if (ib_dma_mapping_error(xprt-sc_cm_id-device, frmr-page_list-page_list[page_no])) goto fatal_err; @@ -309,17 +309,21 @@ static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt, int count) { int i; + unsigned long off; ctxt-count = count; ctxt-direction = DMA_FROM_DEVICE; for (i = 0; i count; i++) { ctxt-sge[i].length = 0; /* in case map fails */ if (!frmr) { + BUG_ON(0 == virt_to_page(vec[i].iov_base)); + off = (unsigned long)vec[i].iov_base ~PAGE_MASK; ctxt-sge[i].addr = - ib_dma_map_single(xprt-sc_cm_id-device, - vec[i].iov_base, - vec[i].iov_len, - DMA_FROM_DEVICE); + ib_dma_map_page(xprt-sc_cm_id-device, + virt_to_page(vec[i].iov_base), + off, + vec[i].iov_len, + DMA_FROM_DEVICE); if (ib_dma_mapping_error(xprt-sc_cm_id-device, ctxt-sge[i].addr)) return -EINVAL; diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index b15e1eb..d4f5e0e 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -70,8 +70,8 @@ * on extra page for the RPCRMDA header. */ static int fast_reg_xdr(struct svcxprt_rdma *xprt, -struct xdr_buf *xdr, -struct svc_rdma_req_map *vec) + struct xdr_buf *xdr, + struct svc_rdma_req_map *vec) { int sge_no; u32 sge_bytes; @@ -96,21 +96,25 @@ static int fast_reg_xdr(struct svcxprt_rdma *xprt, vec-count = 2; sge_no++; - /* Build the FRMR */ + /* Map the XDR head */ frmr-kva = frva; frmr-direction = DMA_TO_DEVICE; frmr-access_flags = 0; frmr-map_len = PAGE_SIZE; frmr-page_list_len = 1; + page_off = (unsigned long)xdr-head[0].iov_base ~PAGE_MASK; frmr-page_list-page_list[page_no] = - ib_dma_map_single(xprt-sc_cm_id-device, - (void *)xdr-head[0].iov_base, - PAGE_SIZE, DMA_TO_DEVICE); + ib_dma_map_page(xprt-sc_cm_id-device, + virt_to_page(xdr-head[0].iov_base), + page_off, + PAGE_SIZE - page_off, + DMA_TO_DEVICE); if (ib_dma_mapping_error(xprt-sc_cm_id-device, frmr-page_list-page_list[page_no])) goto fatal_err; atomic_inc(xprt-sc_dma_used); + /* Map the XDR page list */ page_off = xdr-page_base; page_bytes = xdr-page_len + page_off; if (!page_bytes) @@ -128,9 +132,9 @@ static int fast_reg_xdr(struct svcxprt_rdma
[PATCH 2/2] svcrdma: Cleanup DMA unmapping in error paths.
There are several error paths in the code that do not unmap DMA. This patch adds calls to svc_rdma_unmap_dma to free these DMA contexts. Signed-off-by: Tom Tucker t...@opengridcomputing.com --- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |1 + net/sunrpc/xprtrdma/svc_rdma_sendto.c|2 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 29 ++--- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 926bdb4..df67211 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -495,6 +495,7 @@ next_sge: printk(KERN_ERR svcrdma: Error %d posting RDMA_READ\n, err); set_bit(XPT_CLOSE, xprt-sc_xprt.xpt_flags); + svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 0); goto out; } diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index d4f5e0e..249a835 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -367,6 +367,8 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, goto err; return 0; err: + svc_rdma_unmap_dma(ctxt); + svc_rdma_put_frmr(xprt, vec-frmr); svc_rdma_put_context(ctxt, 0); /* Fatal error, close transport */ return -EIO; diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 23f90c3..d22a44d 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -511,9 +511,9 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt) ctxt-sge[sge_no].addr = pa; ctxt-sge[sge_no].length = PAGE_SIZE; ctxt-sge[sge_no].lkey = xprt-sc_dma_lkey; + ctxt-count = sge_no + 1; buflen += PAGE_SIZE; } - ctxt-count = sge_no; recv_wr.next = NULL; recv_wr.sg_list = ctxt-sge[0]; recv_wr.num_sge = ctxt-count; @@ -529,6 +529,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt) return ret; err_put_ctxt: + svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 1); return -ENOMEM; } @@ -1306,7 +1307,6 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, enum rpcrdma_errcode err) { struct ib_send_wr err_wr; - struct ib_sge sge; struct page *p; struct svc_rdma_op_ctxt *ctxt; u32 *va; @@ -1319,26 +1319,27 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, /* XDR encode error */ length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va); + ctxt = svc_rdma_get_context(xprt); + ctxt-direction = DMA_FROM_DEVICE; + ctxt-count = 1; + ctxt-pages[0] = p; + /* Prepare SGE for local address */ - sge.addr = ib_dma_map_page(xprt-sc_cm_id-device, - p, 0, PAGE_SIZE, DMA_FROM_DEVICE); - if (ib_dma_mapping_error(xprt-sc_cm_id-device, sge.addr)) { + ctxt-sge[0].addr = ib_dma_map_page(xprt-sc_cm_id-device, + p, 0, length, DMA_FROM_DEVICE); + if (ib_dma_mapping_error(xprt-sc_cm_id-device, ctxt-sge[0].addr)) { put_page(p); return; } atomic_inc(xprt-sc_dma_used); - sge.lkey = xprt-sc_dma_lkey; - sge.length = length; - - ctxt = svc_rdma_get_context(xprt); - ctxt-count = 1; - ctxt-pages[0] = p; + ctxt-sge[0].lkey = xprt-sc_dma_lkey; + ctxt-sge[0].length = length; /* Prepare SEND WR */ memset(err_wr, 0, sizeof err_wr); ctxt-wr_op = IB_WR_SEND; err_wr.wr_id = (unsigned long)ctxt; - err_wr.sg_list = sge; + err_wr.sg_list = ctxt-sge; err_wr.num_sge = 1; err_wr.opcode = IB_WR_SEND; err_wr.send_flags = IB_SEND_SIGNALED; @@ -1348,9 +1349,7 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp, if (ret) { dprintk(svcrdma: Error %d posting send for protocol error\n, ret); - ib_dma_unmap_page(xprt-sc_cm_id-device, - sge.addr, PAGE_SIZE, - DMA_FROM_DEVICE); + svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 1); } } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] libmthca: Add support for the reg_io_mr verb.
Tziporet Koren wrote: Hi Tom, What is the purpose of this? Is there a reason you did it only for mthca and not mlx4? Tziporet Hi Tziporet, I just picked mthca arbitrarily to demonstrate how to do it. If people like the verb, then I'll do it for all the devices, but I didn't want to do all that work when there are likely to be changes. But the point is that this is certainly not a mthca only functionality. Thanks, Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/4] ibverbs: new verbs for registering I/O memory
The following patches add verbs for registering I/O memory from user-space. This capability allows device memory to be registered. More specifically, any VM_PFNMAP vma can be registered. The mmap service is used to obtain the address of this memory and provide it to user space. This is where any security policy would be implemented. The ib_iomem_get service requires that any address provided by the service be in a VMA owned by the process. This precludes providing 'random' addresses to the service to acquire access to arbitrary memory locations. --- Tom Tucker (4): mthca: Add support for reg_io_mr and unreg_io_mr uverbs_cmd: Add uverbs command definitions for reg_io_mr uverbs: Add common ib_iomem_get service ibverbs: Add new provider verb for I/O memory registration drivers/infiniband/core/umem.c | 248 +- drivers/infiniband/core/uverbs.h |2 drivers/infiniband/core/uverbs_cmd.c | 140 +++ drivers/infiniband/core/uverbs_main.c|2 drivers/infiniband/hw/mthca/mthca_provider.c | 111 include/rdma/ib_umem.h | 14 + include/rdma/ib_user_verbs.h | 24 ++- include/rdma/ib_verbs.h |5 + 8 files changed, 534 insertions(+), 12 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/4] ibverbs: Add new provider verb for I/O memory registration
From: Tom Tucker t...@opengridcomputing.com Add a function pointer for the provider's reg_io_mr method. Signed-off-by: Tom Tucker t...@ogc.us --- include/rdma/ib_verbs.h |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f3e8f3c..5034ac9 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1096,6 +1096,11 @@ struct ib_device { u64 virt_addr, int mr_access_flags, struct ib_udata *udata); + struct ib_mr * (*reg_io_mr)(struct ib_pd *pd, + u64 start, u64 length, + u64 virt_addr, + int mr_access_flags, + struct ib_udata *udata); int(*query_mr)(struct ib_mr *mr, struct ib_mr_attr *mr_attr); int(*dereg_mr)(struct ib_mr *mr); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 3/3] libibverbs: Add reg/unreg I/O memory verbs
From: Tom Tucker t...@opengridcomputing.com Add the ibv_reg_io_mr and ibv_dereg_io_mr verbs. Signed-off-by: Tom Tucker t...@ogc.us --- include/infiniband/driver.h |6 ++ include/infiniband/verbs.h | 14 ++ src/verbs.c | 35 +++ 3 files changed, 55 insertions(+), 0 deletions(-) diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index 9a81416..37c0ed1 100644 --- a/include/infiniband/driver.h +++ b/include/infiniband/driver.h @@ -82,6 +82,12 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, size_t cmd_size, struct ibv_reg_mr_resp *resp, size_t resp_size); int ibv_cmd_dereg_mr(struct ibv_mr *mr); +int ibv_cmd_reg_io_mr(struct ibv_pd *pd, void *addr, size_t length, + uint64_t hca_va, int access, + struct ibv_mr *mr, struct ibv_reg_io_mr *cmd, + size_t cmd_size, + struct ibv_reg_io_mr_resp *resp, size_t resp_size); +int ibv_cmd_dereg_io_mr(struct ibv_mr *mr); int ibv_cmd_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector, struct ibv_cq *cq, diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index 0f1cb2e..a0d969a 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -640,6 +640,9 @@ struct ibv_context_ops { size_t length, int access); int (*dereg_mr)(struct ibv_mr *mr); +struct ibv_mr * (*reg_io_mr)(struct ibv_pd *pd, void *addr, size_t length, +int access); +int (*dereg_io_mr)(struct ibv_mr *mr); struct ibv_mw * (*alloc_mw)(struct ibv_pd *pd, enum ibv_mw_type type); int (*bind_mw)(struct ibv_qp *qp, struct ibv_mw *mw, struct ibv_mw_bind *mw_bind); @@ -801,6 +804,17 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, int ibv_dereg_mr(struct ibv_mr *mr); /** + * ibv_reg_io_mr - Register a physical memory region + */ +struct ibv_mr *ibv_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, int access); + +/** + * ibv_dereg_io_mr - Deregister a physical memory region + */ +int ibv_dereg_io_mr(struct ibv_mr *mr); + +/** * ibv_create_comp_channel - Create a completion event channel */ struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context); diff --git a/src/verbs.c b/src/verbs.c index ba3c0a4..7d215c1 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -189,6 +189,41 @@ int __ibv_dereg_mr(struct ibv_mr *mr) } default_symver(__ibv_dereg_mr, ibv_dereg_mr); +struct ibv_mr *__ibv_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, int access) +{ +struct ibv_mr *mr; + +if (ibv_dontfork_range(addr, length)) +return NULL; + +mr = pd-context-ops.reg_io_mr(pd, addr, length, access); +if (mr) { +mr-context = pd-context; +mr-pd = pd; +mr-addr= addr; +mr-length = length; +} else +ibv_dofork_range(addr, length); + +return mr; +} +default_symver(__ibv_reg_io_mr, ibv_reg_io_mr); + +int __ibv_dereg_io_mr(struct ibv_mr *mr) +{ +int ret; +void *addr = mr-addr; +size_t length = mr-length; + +ret = mr-context-ops.dereg_io_mr(mr); +if (!ret) +ibv_dofork_range(addr, length); + +return ret; +} +default_symver(__ibv_dereg_io_mr, ibv_dereg_io_mr); + static struct ibv_comp_channel *ibv_create_comp_channel_v2(struct ibv_context *context) { struct ibv_abi_compat_v2 *t = context-abi_compat; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] libmthca: Add support for I/O memory registration verbs
This patchset adds support for the new I/O memory registration verbs to libmthca. --- Tom Tucker (1): libmthca: Add support for the reg_io_mr verb. src/mthca-abi.h |4 src/mthca.c |2 ++ src/mthca.h |4 src/verbs.c | 50 ++ 4 files changed, 60 insertions(+), 0 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] libmthca: Add support for the reg_io_mr verb.
From: Tom Tucker t...@opengridcomputing.com Added support for the ibv_reg_io_mr and ibv_unreg_io_mr verbs to the mthca ilbrary. Signed-off-by: Tom Tucker t...@ogc.us --- src/mthca-abi.h |4 src/mthca.c |2 ++ src/mthca.h |4 src/verbs.c | 50 ++ 4 files changed, 60 insertions(+), 0 deletions(-) diff --git a/src/mthca-abi.h b/src/mthca-abi.h index 4fbd98b..c0145d6 100644 --- a/src/mthca-abi.h +++ b/src/mthca-abi.h @@ -61,6 +61,10 @@ struct mthca_reg_mr { __u32 reserved; }; +struct mthca_reg_io_mr { + struct ibv_reg_io_mribv_cmd; +}; + struct mthca_create_cq { struct ibv_create_cqibv_cmd; __u32 lkey; diff --git a/src/mthca.c b/src/mthca.c index e33bf7f..8892504 100644 --- a/src/mthca.c +++ b/src/mthca.c @@ -113,6 +113,8 @@ static struct ibv_context_ops mthca_ctx_ops = { .dealloc_pd= mthca_free_pd, .reg_mr= mthca_reg_mr, .dereg_mr = mthca_dereg_mr, + .reg_io_mr = mthca_reg_io_mr, + .dereg_io_mr = mthca_dereg_io_mr, .create_cq = mthca_create_cq, .poll_cq = mthca_poll_cq, .resize_cq = mthca_resize_cq, diff --git a/src/mthca.h b/src/mthca.h index bd1e7a2..92a8649 100644 --- a/src/mthca.h +++ b/src/mthca.h @@ -312,6 +312,10 @@ struct ibv_mr *mthca_reg_mr(struct ibv_pd *pd, void *addr, size_t length, int access); int mthca_dereg_mr(struct ibv_mr *mr); +struct ibv_mr *mthca_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, enum ibv_access_flags access); +int mthca_dereg_io_mr(struct ibv_mr *mr); + struct ibv_cq *mthca_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector); diff --git a/src/verbs.c b/src/verbs.c index b6782c9..3580ad2 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -174,6 +174,56 @@ int mthca_dereg_mr(struct ibv_mr *mr) return 0; } + +static struct ibv_mr *__mthca_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, uint64_t hca_va, + enum ibv_access_flags access) +{ + struct ibv_mr *mr; + struct mthca_reg_io_mr cmd; + int ret; + + mr = malloc(sizeof *mr); + if (!mr) + return NULL; + +#ifdef IBV_CMD_REG_MR_HAS_RESP_PARAMS + { + struct ibv_reg_io_mr_resp resp; + + ret = ibv_cmd_reg_io_mr(pd, addr, length, hca_va, access, mr, + cmd.ibv_cmd, sizeof cmd, resp, sizeof resp); + } +#else + ret = ibv_cmd_reg_io_mr(pd, addr, length, hca_va, access, mr, + cmd.ibv_cmd, sizeof cmd); +#endif + if (ret) { + free(mr); + return NULL; + } + + return mr; +} + +struct ibv_mr *mthca_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, enum ibv_access_flags access) +{ + return __mthca_reg_io_mr(pd, addr, length, (uintptr_t) addr, access); +} + +int mthca_dereg_io_mr(struct ibv_mr *mr) +{ + int ret; + + ret = ibv_cmd_dereg_mr(mr); + if (ret) + return ret; + + free(mr); + return 0; +} + static int align_cq_size(int cqe) { int nent; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
On 7/29/10 1:22 PM, Ralph Campbell wrote: On Thu, 2010-07-29 at 09:25 -0700, Tom Tucker wrote: From: Tom Tuckert...@opengridcomputing.com Add an ib_iomem_get service that converts a vma to an array of physical addresses. This makes it easier for each device driver to add support for the reg_io_mr provider method. Signed-off-by: Tom Tuckert...@ogc.us --- drivers/infiniband/core/umem.c | 248 ++-- include/rdma/ib_umem.h | 14 ++ 2 files changed, 251 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 415e186..f103956 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c ... @@ -292,3 +295,226 @@ int ib_umem_page_count(struct ib_umem *umem) return n; } EXPORT_SYMBOL(ib_umem_page_count); +/* + * Return the PFN for the specified address in the vma. This only + * works for a vma that is VM_PFNMAP. + */ +static unsigned long follow_io_pfn(struct vm_area_struct *vma, + unsigned long address, int write) +{ + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep, pte; + spinlock_t *ptl; + unsigned long pfn; + struct mm_struct *mm = vma-vm_mm; + + BUG_ON(0 == (vma-vm_flags VM_PFNMAP)); Why use BUG_ON? WARN_ON is more appropriate but if (!(vma-vm_flags VM_PFNMAP)) return 0; seems better. In fact, move it outside the inner do loop in ib_get_io_pfn(). It's paranoia from the debug phase. It's already in the 'outer loop'. I should just delete it I think. + pgd = pgd_offset(mm, address); + if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd))) + return 0; + + pud = pud_offset(pgd, address); + if (pud_none(*pud)) + return 0; + if (unlikely(pud_bad(*pud))) + return 0; + + pmd = pmd_offset(pud, address); + if (pmd_none(*pmd)) + return 0; + if (unlikely(pmd_bad(*pmd))) + return 0; + + ptep = pte_offset_map_lock(mm, pmd, address,ptl); + pte = *ptep; + if (!pte_present(pte)) + goto bad; + if (write !pte_write(pte)) + goto bad; + + pfn = pte_pfn(pte); + pte_unmap_unlock(ptep, ptl); + return pfn; + bad: + pte_unmap_unlock(ptep, ptl); + return 0; +} + +int ib_get_io_pfn(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, int len, int write, int force, + unsigned long *pfn_list, struct vm_area_struct **vmas) +{ + unsigned long pfn; + int i; + if (len= 0) + return 0; + + i = 0; + do { + struct vm_area_struct *vma; + + vma = find_vma(mm, start); + if (0 == (vma-vm_flags VM_PFNMAP)) + return -EINVAL; Style nit: I would use ! instead of 0 == ok. + if (0 == (vma-vm_flags VM_IO)) + return -EFAULT; + + if (is_vm_hugetlb_page(vma)) + return -EFAULT; + + do { + cond_resched(); + pfn = follow_io_pfn(vma, start, write); + if (!pfn) + return -EFAULT; + if (pfn_list) + pfn_list[i] = pfn; + if (vmas) + vmas[i] = vma; + i++; + start += PAGE_SIZE; + len--; + } while (len start vma-vm_end); + } while (len); + return i; +} + +/** + * ib_iomem_get - DMA map a userspace map of IO memory. + * @context: userspace context to map memory for + * @addr: userspace virtual address to start at + * @size: length of region to map + * @access: IB_ACCESS_xxx flags for memory being mapped + * @dmasync: flush in-flight DMA when the memory region is written + */ +struct ib_umem *ib_iomem_get(struct ib_ucontext *context, unsigned long addr, +size_t size, int access, int dmasync) +{ + struct ib_umem *umem; + unsigned long *pfn_list; + struct ib_umem_chunk *chunk; + unsigned long locked; + unsigned long lock_limit; + unsigned long cur_base; + unsigned long npages; + int ret; + int off; + int i; + DEFINE_DMA_ATTRS(attrs); + + if (dmasync) + dma_set_attr(DMA_ATTR_WRITE_BARRIER,attrs); + + if (!can_do_mlock()) + return ERR_PTR(-EPERM); + + umem = kmalloc(sizeof *umem, GFP_KERNEL); + if (!umem) + return ERR_PTR(-ENOMEM); + + umem-type = IB_UMEM_IO_MAP; + umem-context = context; + umem-length= size; + umem-offset= addr ~PAGE_MASK; + umem-page_size
Re: [RFC PATCH 3/3] libibverbs: Add reg/unreg I/O memory verbs
On 7/29/10 3:07 PM, Ralph Campbell wrote: How does an application know when to call ibv_reg_io_mr() instead of ibv_reg_mr()? It isn't going to know that some address returned by mmap() is going to have the VM_PFNMAP flag set. Please see my response to Jason. How does an application know that the HCA supports ibv_reg_io_mr() or not? (see below) I think returning ENOTSUP or something would be good. There are bits in the devcaps that indicate if these verbs are supported. It should however return -ENOTSUPP if they are called without support. I copied ibv_reg_mr's which is inappropriate in this regard. On Thu, 2010-07-29 at 09:32 -0700, Tom Tucker wrote: From: Tom Tuckert...@opengridcomputing.com Add the ibv_reg_io_mr and ibv_dereg_io_mr verbs. Signed-off-by: Tom Tuckert...@ogc.us --- include/infiniband/driver.h |6 ++ include/infiniband/verbs.h | 14 ++ src/verbs.c | 35 +++ 3 files changed, 55 insertions(+), 0 deletions(-) diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index 9a81416..37c0ed1 100644 --- a/include/infiniband/driver.h +++ b/include/infiniband/driver.h @@ -82,6 +82,12 @@ int ibv_cmd_reg_mr(struct ibv_pd *pd, void *addr, size_t length, size_t cmd_size, struct ibv_reg_mr_resp *resp, size_t resp_size); int ibv_cmd_dereg_mr(struct ibv_mr *mr); +int ibv_cmd_reg_io_mr(struct ibv_pd *pd, void *addr, size_t length, + uint64_t hca_va, int access, + struct ibv_mr *mr, struct ibv_reg_io_mr *cmd, + size_t cmd_size, + struct ibv_reg_io_mr_resp *resp, size_t resp_size); +int ibv_cmd_dereg_io_mr(struct ibv_mr *mr); int ibv_cmd_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector, struct ibv_cq *cq, diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index 0f1cb2e..a0d969a 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -640,6 +640,9 @@ struct ibv_context_ops { size_t length, int access); int (*dereg_mr)(struct ibv_mr *mr); +struct ibv_mr * (*reg_io_mr)(struct ibv_pd *pd, void *addr, size_t length, +int access); +int (*dereg_io_mr)(struct ibv_mr *mr); struct ibv_mw * (*alloc_mw)(struct ibv_pd *pd, enum ibv_mw_type type); int (*bind_mw)(struct ibv_qp *qp, struct ibv_mw *mw, struct ibv_mw_bind *mw_bind); Doesn't adding these in the middle of the struct break the libibverbs to libxxxverbs.so binary interface? Shouldn't they be added at the end of the struct? I'm not sure how the versioning works between libibverbs and device plugins. Don't we need to protect against libibverbs being upgraded but the libxxxverbs.so being older? I would think it's broken regardless. @@ -801,6 +804,17 @@ struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, int ibv_dereg_mr(struct ibv_mr *mr); /** + * ibv_reg_io_mr - Register a physical memory region + */ +struct ibv_mr *ibv_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, int access); + +/** + * ibv_dereg_io_mr - Deregister a physical memory region + */ +int ibv_dereg_io_mr(struct ibv_mr *mr); + +/** * ibv_create_comp_channel - Create a completion event channel */ struct ibv_comp_channel *ibv_create_comp_channel(struct ibv_context *context); diff --git a/src/verbs.c b/src/verbs.c index ba3c0a4..7d215c1 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -189,6 +189,41 @@ int __ibv_dereg_mr(struct ibv_mr *mr) } default_symver(__ibv_dereg_mr, ibv_dereg_mr); +struct ibv_mr *__ibv_reg_io_mr(struct ibv_pd *pd, void *addr, + size_t length, int access) +{ +struct ibv_mr *mr; + +if (ibv_dontfork_range(addr, length)) +return NULL; + +mr = pd-context-ops.reg_io_mr(pd, addr, length, access); Won't reg_io_mr pointer be NULL for other HCAs? What happens if the device doesn't yet implement this function? Without a check, SEGV. See above. +if (mr) { +mr-context = pd-context; +mr-pd = pd; +mr-addr= addr; +mr-length = length; +} else +ibv_dofork_range(addr, length); + +return mr; +} +default_symver(__ibv_reg_io_mr, ibv_reg_io_mr); + +int __ibv_dereg_io_mr(struct ibv_mr *mr) +{ +int ret; +void *addr = mr-addr; +size_t length = mr-length; + +ret = mr-context-ops.dereg_io_mr(mr); +if (!ret) +ibv_dofork_range(addr
Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
On 7/29/10 3:41 PM, Jason Gunthorpe wrote: On Thu, Jul 29, 2010 at 03:29:37PM -0500, Tom Tucker wrote: Also, I'd like to see a strong defence of this new user space API particularly: 1) Why can't this be done with the existing ibv_reg_mr, like huge pages are. The ibv_reg_mr API assumes that the memory being registered was allocated in user mode and is part of the current-mm VMA. It uses get_user_pages which will scoff and jeer at kernel memory. I'm confused? What is the vaddr input then? How does userspace get that value? Isn't it created by mmap or the like? Yes. Ie for the PCI-E example you gave I assume the flow is that userspace mmaps devices/pci:00/:00:XX.X/resourceX to get the IO memory and then passes that through to ibv_reg_mr? Not exactly. It would mmap the device that manages the adapter hosting the memory. IMHO, ibv_reg_mr API should accept any valid vaddr available to the process and if it bombs for certain kinds of vaddrs then it is just a bug.. Perhaps. 2) How is it possible for userspace to know when it should use ibv_reg_mr vs ibv_reg_io_mr? By virtue of the device that it is mmap'ing. If I mmap my_vmstat_driver, I know that the memory I am mapping is a kernel buffer. Yah, but what if the next version of your vmstat driver changes the kind of memory it returns? It's a general service for a class of memory, not an enabler for a particular application's peculiarities. On first glance, this seems like a hugely bad API to me :) Well hopefully now that it's purpose is revealed you will change your view and we can collaboratively make it better :-) I don't object to the idea, just to the notion that user space is supposed to somehow know that one vaddr is different from another vaddr and call the right API - seems impossible to use correctly to me. What would you have to do to implement this using scheme using ibv_reg_mr as the entry point? The kernel service on the other side of ibv_reg_mr verb could divine the necessary information by searching all vma owned by current and looking at vma_flags to decide what type it was. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
On 7/29/10 11:25 AM, Tom Tucker wrote: From: Tom Tuckert...@opengridcomputing.com Add an ib_iomem_get service that converts a vma to an array of physical addresses. This makes it easier for each device driver to add support for the reg_io_mr provider method. Signed-off-by: Tom Tuckert...@ogc.us --- drivers/infiniband/core/umem.c | 248 ++-- include/rdma/ib_umem.h | 14 ++ 2 files changed, 251 insertions(+), 11 deletions(-) [...snip...] + /* The pfn_list we built is a set of Page +* Frame Numbers (PFN) whose physical address +* is PFN PAGE_SHIFT. The SG DMA mapping +* services expect page addresses, not PFN, +* therefore, we have to do the dma mapping +* ourselves here. */ + for (i = 0; i chunk-nents; ++i) { + sg_set_page(chunk-page_list[i], 0, + PAGE_SIZE, 0); + chunk-page_list[i].dma_address = + (pfn_list[i] PAGE_SHIFT); This is not architecture independent. Does anyone have any thoughts on how this ought to be done? + chunk-page_list[i].dma_length = PAGE_SIZE; + } + chunk-nmap = chunk-nents; + ret -= chunk-nents; + off += chunk-nents; + list_add_tail(chunk-list,umem-chunk_list); + } + + ret = 0; + } + [...snip...] -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Suspected SPAM] Re: [RFC PATCH 2/4] uverbs: Add common ib_iomem_get service
On 7/29/10 5:57 PM, Jason Gunthorpe wrote: You would need to modify ib_umem_get() to check for the VM_PFNMAP flag and build the struct ib_umem similar to the proposed ib_iomem_get(). However, the page reference counting/sharing issue would need to be solved. I think there are kernel level callbacks for this that could be used. But in this case the pages are already mmaped into a user process, there must be some mechanism to ensure they don't get pulled away?! This is not virtual memory. It's real memory. Though, I guess, what happens if you hot un-plug the PCI-E card that has a process mmaping its memory?! Exactly. The memory would have to be physically detached for it to get 'pulled away' What happens if you RDMA READ from PCI-E address space that does not have any device responding? bus error. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6
J. Bruce Fields wrote: On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote: On 04/03/2010 09:27 AM, Tom Tucker wrote: RPC6 requires that it be possible to create endpoints that listen exclusively for IPv4 or IPv6 connection requests. This is not currently supported by the RDMA API. Signed-off-by: Tom Tuckert...@opengridcomputing.com Tested-by: Steve Wisesw...@opengridcomputing.com Reviewed-by: Chuck Lever chuck.le...@oracle.com Thanks to all. I take it the problem began with 37498292a NFSD: Create PF_INET6 listener in write_ports? Yes. Tom --b. --- net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 3fa5751..4e6bbf9 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -678,7 +678,10 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv, int ret; dprintk(svcrdma: Creating RDMA socket\n); - + if (sa-sa_family != AF_INET) { + dprintk(svcrdma: Address family %d is not supported.\n, sa-sa_family); + return ERR_PTR(-EAFNOSUPPORT); + } cma_xprt = rdma_create_xprt(serv, 1); if (!cma_xprt) return ERR_PTR(-ENOMEM); -- chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6
J. Bruce Fields wrote: On Mon, Apr 05, 2010 at 12:16:18PM -0400, J. Bruce Fields wrote: On Mon, Apr 05, 2010 at 10:50:16AM -0500, Tom Tucker wrote: J. Bruce Fields wrote: On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote: On 04/03/2010 09:27 AM, Tom Tucker wrote: RPC6 requires that it be possible to create endpoints that listen exclusively for IPv4 or IPv6 connection requests. This is not currently supported by the RDMA API. Signed-off-by: Tom Tuckert...@opengridcomputing.com Tested-by: Steve Wisesw...@opengridcomputing.com Reviewed-by: Chuck Lever chuck.le...@oracle.com Thanks to all. I take it the problem began with 37498292a NFSD: Create PF_INET6 listener in write_ports? Yes. Thanks. I'll pass along git://linux-nfs.org/~bfields/linux.git for-2.6.34 soon. And: sorry we didn't catch this when it happened. I have some of the equipment I'd need to do basic regression tests, but haven't set it up. I hope I get to it at some point For now I depend on others to catch even basic rdma regressions--let me know if there's some way I could make your testing easier. We were focused on older kernels..and probably should have caught it quicker. No worries. Thanks, Tom --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message
Roland Dreier wrote: The write_ports code will fail both the INET4 and INET6 transport creation if the transport returns an error when PF_INET6 is specified. Some transports that do not support INET6 return an error other than EAFNOSUPPORT. That's the real bug. Any reason the RDMA RPC transport can't return EAFNOSUPPORT in this case? I think Tom's changelog is misleading. Yes, it should read A transport may fail for some reason other than EAFNOSUPPORT. The problem is that the RDMA transport actually does support IPv6, but it doesn't support the IPV6ONLY option yet. So if NFS/RDMA binds to a port for IPv4, then the IPv6 bind fails because of the port collision. Should we fail INET4 if INET6 fails under any circumstances? Implementing the IPV6ONLY option for RDMA binding is probably not feasible for 2.6.34, so the best band-aid for now seems to be Tom's patch. - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsrdma broken on 2.6.34-rc1?
Sean Hefty wrote: Sean, will you add this to the rdma_cm? Not immediately because I lack the time to do it. It would be really nice to share the kernel's port space code and remove the port code in the rdma_cm. LOL. Yes...yes it would. There is of course a Dragon to be slain. Roland? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message
If this looks right to everyone, I'll post this to linux-nfs. Tom nfsd: Make INET6 transport creation failure an informational message The write_ports code will fail both the IENT4 and INET6 transport creation if the transport returns an error when PF_INET6 is specified. Some transports that do not support INET6 return an error other than EAFNOSUPPORT. We should allow communication on INET4 even if INET6 is not yet supported or fails for some reason. Signed-off-by: Tom Tucker t...@opengridcomputing.com --- fs/nfsd/nfsctl.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 0f0e77f..019a89e 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -1008,8 +1008,10 @@ static ssize_t __write_ports_addxprt(char *buf) err = svc_create_xprt(nfsd_serv, transport, PF_INET6, port, SVC_SOCK_ANONYMOUS); - if (err 0 err != -EAFNOSUPPORT) - goto out_close; + if (err 0) + dprintk(nfsd: Error creating PF_INET6 listener for transport '%s'\n, +transport); + return 0; out_close: xprt = svc_find_xprt(nfsd_serv, transport, PF_INET, port); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message
Hi Bruce/Chuck, RDMA Transports are currently broken in 2.6.34 because they don't have a V4ONLY setsockopt. So what happens is that when write_ports attempts to create the PF_INET6 transport it fails because the port is already in use. There is discussion on linux-rdma about how to fix this, but in the interim and perhaps indefinitely, I propose the following: Tom nfsd: Make INET6 transport creation failure an informational message The write_ports code will fail both the INET4 and INET6 transport creation if the transport returns an error when PF_INET6 is specified. Some transports that do not support INET6 return an error other than EAFNOSUPPORT. We should allow communication on INET4 even if INET6 is not yet supported or fails for some reason. Signed-off-by: Tom Tucker t...@opengridcomputing.com --- fs/nfsd/nfsctl.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c index 0f0e77f..934b624 100644 --- a/fs/nfsd/nfsctl.c +++ b/fs/nfsd/nfsctl.c @@ -1008,8 +1008,10 @@ static ssize_t __write_ports_addxprt(char *buf) err = svc_create_xprt(nfsd_serv, transport, PF_INET6, port, SVC_SOCK_ANONYMOUS); - if (err 0 err != -EAFNOSUPPORT) - goto out_close; + if (err 0) + printk(KERN_INFO nfsd: Error creating PF_INET6 listener + for transport '%s'\n, transport); + return 0; out_close: xprt = svc_find_xprt(nfsd_serv, transport, PF_INET, port); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rnfs: rq_respages pointer is bad
David J. Wilder wrote: Tom I have been chasing an rnfs related Oops in svc_process(). I have found the source of the Oops but I am not sure of my fix. I am seeing the problem on ppc64, kernel 2.6.32, I have not tried other arch yet. The source of the problem is in rdma_read_complete(), I am finding that rqstp-rq_respages is set to point past the end of the rqstp-rq_pages page list. This results in a NULL reference in svc_process() when passing rq_respages[0] to page_address(). In rdma_read_complete() we are using rqstp-rq_arg.pages as the base of the page list then indexing by page_no, however rq_arg.pages is not pointing to the start of the list so rq_respages ends up pointing to: rqstp-rq_pages[(head-count+1) + head-hdr_count] In my case, it ends up pointing one past the end of the list by one. Here is the change I made. static int rdma_read_complete(struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head) { int page_no; int ret; BUG_ON(!head); /* Copy RPC pages */ for (page_no = 0; page_no head-count; page_no++) { put_page(rqstp-rq_pages[page_no]); rqstp-rq_pages[page_no] = head-pages[page_no]; } /* Point rq_arg.pages past header */ rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count]; rqstp-rq_arg.page_len = head-arg.page_len; rqstp-rq_arg.page_base = head-arg.page_base; /* rq_respages starts after the last arg page */ - rqstp-rq_respages = rqstp-rq_arg.pages[page_no]; + rqstp-rq_respages = rqstp-rq_pages[page_no]; This might be clearer as: rqstp-rq_respages = rqstp-rq_pages[head-count]; . . . The change works for me, but I am not sure it is safe to assume the rqstp-rq_pages[head-count] will always point to the last arg page. Dave. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rnfs: rq_respages pointer is bad
Roland Dreier wrote: Someone please make sure that a final patch with a full description gets sent to the NFS guys for merging. Tom, are you going to handle this? Yes, and I have several more in queue. Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] nfsrdma fails to write big file,
Roland: I'll put together a patch based on 5 with a comment that indicates why I think 5 is the number. Since Vu has verified this behaviorally as well, I'm comfortable that our understanding of the code is sound. I'm on the road right now, so it won't be until tomorrow though. Thanks, Tom Vu Pham wrote: -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Saturday, February 27, 2010 8:23 PM To: Vu Pham Cc: Roland Dreier; linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 - Set the number of buffer credits small as follows echo 4 /proc/sys/sunrpc/rdma_slot_table_entries - Rerun your test and see if you can reproduce the problem? I did the above and was unable to reproduce, but I would like to see if you can to convince ourselves that 5 is the right number. Tom, I did the above and can not reproduce either. I think 5 is the right number; however, we should optimize it later. -vu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rnfs: rq_respages pointer is bad
Hi David: That looks like a bug to me and it looks like what you propose is the correct fix. My only reservation is that if you are correct then how did this work at all without data corruption for large writes on x86_64? I'm on the road right now, so I can't dig too deep until Wednesday, but at this point your analysis looks correct to me. Tom David J. Wilder wrote: Tom I have been chasing an rnfs related Oops in svc_process(). I have found the source of the Oops but I am not sure of my fix. I am seeing the problem on ppc64, kernel 2.6.32, I have not tried other arch yet. The source of the problem is in rdma_read_complete(), I am finding that rqstp-rq_respages is set to point past the end of the rqstp-rq_pages page list. This results in a NULL reference in svc_process() when passing rq_respages[0] to page_address(). In rdma_read_complete() we are using rqstp-rq_arg.pages as the base of the page list then indexing by page_no, however rq_arg.pages is not pointing to the start of the list so rq_respages ends up pointing to: rqstp-rq_pages[(head-count+1) + head-hdr_count] In my case, it ends up pointing one past the end of the list by one. Here is the change I made. static int rdma_read_complete(struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head) { int page_no; int ret; BUG_ON(!head); /* Copy RPC pages */ for (page_no = 0; page_no head-count; page_no++) { put_page(rqstp-rq_pages[page_no]); rqstp-rq_pages[page_no] = head-pages[page_no]; } /* Point rq_arg.pages past header */ rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count]; rqstp-rq_arg.page_len = head-arg.page_len; rqstp-rq_arg.page_base = head-arg.page_base; /* rq_respages starts after the last arg page */ - rqstp-rq_respages = rqstp-rq_arg.pages[page_no]; + rqstp-rq_respages = rqstp-rq_pages[page_no]; . . . The change works for me, but I am not sure it is safe to assume the rqstp-rq_pages[head-count] will always point to the last arg page. Dave. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] nfsrdma fails to write big file,
Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 - Set the number of buffer credits small as follows echo 4 /proc/sys/sunrpc/rdma_slot_table_entries - Rerun your test and see if you can reproduce the problem? I did the above and was unable to reproduce, but I would like to see if you can to convince ourselves that 5 is the right number. Thanks, Tom - R. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] nfsrdma fails to write big file,
Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? No I did not but my disk subsystem is pretty slow, so it might be that I just don't have fast enough storage. I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: Erf. This is client code. I'll take a look at this and see if I can understand what Talpey was up to. Tom -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu
Re: [ewg] nfsrdma fails to write big file,
Vu, Are you changing any of the default settings? For example rsize/wsize, etc... I'd like to reproduce this problem if I can. Thanks, Tom Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list e...@lists.openfabrics.org http
Re: [ewg] nfsrdma fails to write big file,
Vu, Based on the mapping code, it looks to me like the worst case is RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. However, I think in practice, due to the way that iov are built, the actual max is 5 (frmr for head + pagelist plus invalidates for same plus one for the send itself). Why did you think the max was 6? Thanks, Tom Tom Tucker wrote: Vu, Are you changing any of the default settings? For example rsize/wsize, etc... I'd like to reproduce this problem if I can. Thanks, Tom Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie
Re: [ewg] nfsrdma fails to write big file,
Vu, I ran the number of slots down to 8 (echo 8 rdma_slot_table_entries) and I can reproduce the issue now. I'm going to try setting the allocation multiple to 5 and see if I can't prove to myself and Roland that we've accurately computed the correct factor. I think overall a better solution might be a different credit system, however, I think that's a much more substantial change than we can tackle at this point. Tom Tom Tucker wrote: Vu, Based on the mapping code, it looks to me like the worst case is RPCRDMA_MAX_SEGS * 2 + 1 as the multiplier. However, I think in practice, due to the way that iov are built, the actual max is 5 (frmr for head + pagelist plus invalidates for same plus one for the send itself). Why did you think the max was 6? Thanks, Tom Tom Tucker wrote: Vu, Are you changing any of the default settings? For example rsize/wsize, etc... I'd like to reproduce this problem if I can. Thanks, Tom Vu Pham wrote: Tom, Did you make any change to have bonnie++, dd of a 10G file and vdbench concurrently run finish? I keep hitting the WQE overflow error below. I saw that most of the requests have two chunks (32K chunk and some-bytes chunk), each chunk requires an frmr + invalidate wrs; However, you set ep-rep_attr.cap.max_send_wr = cdata-max_requests and then for frmr case you do ep-rep_atrr.cap.max_send_wr *=3; which is not enough. Moreover, you also set ep-rep_cqinit = max_send_wr/2 for send completion signal which causes the wqe overflow happened faster. After applying the following patch, I have thing vdbench, dd, and copy 10g_file running overnight -vu --- ofa_kernel-1.5.1.orig/net/sunrpc/xprtrdma/verbs.c 2010-02-24 10:41:22.0 -0800 +++ ofa_kernel-1.5.1/net/sunrpc/xprtrdma/verbs.c2010-02-24 10:03:18.0 -0800 @@ -649,8 +654,15 @@ ep-rep_attr.cap.max_send_wr = cdata-max_requests; switch (ia-ri_memreg_strategy) { case RPCRDMA_FRMR: - /* Add room for frmr register and invalidate WRs */ - ep-rep_attr.cap.max_send_wr *= 3; + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; if (ep-rep_attr.cap.max_send_wr devattr.max_qp_wr) return -EINVAL; break; @@ -682,7 +694,8 @@ ep-rep_attr.cap.max_recv_sge); /* set trigger for requesting send completion */ - ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/2 /* - 1*/; + ep-rep_cqinit = ep-rep_attr.cap.max_send_wr/4; + switch (ia-ri_memreg_strategy) { case RPCRDMA_MEMWINDOWS_ASYNC: case RPCRDMA_MEMWINDOWS: -Original Message- From: ewg-boun...@lists.openfabrics.org [mailto:ewg- boun...@lists.openfabrics.org] On Behalf Of Vu Pham Sent: Monday, February 22, 2010 12:23 PM To: Tom Tucker Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Tom, Some more info on the problem: 1. Running with memreg=4 (FMR) I can not reproduce the problem 2. I also see different error on client Feb 22 12:16:55 mellanox-2 rpc.idmapd[5786]: nss_getpwnam: name 'nobody' does not map into domain 'localdomain' Feb 22 12:16:55 mellanox-2 kernel: QP 0x70004b: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: QP 0x6c004a: WQE overflow Feb 22 12:16:55 mellanox-2 kernel: RPC: rpcrdma_ep_post: ib_post_send returned -12 cq_init 48 cq_count 32 Feb 22 12:17:00 mellanox-2 kernel: RPC: rpcrdma_event_process: send WC status 5, vend_err F5 Feb 22 12:17:00 mellanox-2 kernel: rpcrdma: connection to 13.20.1.9:20049 closed (-103) -vu -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Monday, February 22, 2010 10:49 AM To: Vu Pham Cc: linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server
Re: [ewg] nfsrdma fails to write big file,
Vu Pham wrote: Setup: 1. linux nfsrdma client/server with OFED-1.5.1-20100217-0600, ConnectX2 QDR HCAs fw 2.7.8-6, RHEL 5.2. 2. Solaris nfsrdma server svn 130, ConnectX QDR HCA. Running vdbench on 10g file or *dd if=/dev/zero of=10g_file bs=1M count=1*, operation fail, connection get drop, client cannot re-establish connection to server. After rebooting only the client, I can mount again. It happens with both solaris and linux nfsrdma servers. For linux client/server, I run memreg=5 (FRMR), I don't see problem with memreg=6 (global dma key) Awesome. This is the key I think. Thanks for the info Vu, Tom On Solaris server snv 130, we see problem decoding write request of 32K. The client send two read chunks (32K 16-byte), the server fail to do rdma read on the 16-byte chunk (cqe.status = 10 ie. IB_WC_REM_ACCCESS_ERROR); therefore, server terminate the connection. We don't see this problem on nfs version 3 on Solaris. Solaris server run normal memory registration mode. On linux client, I see cqe.status = 12 ie. IB_WC_RETRY_EXC_ERR I added these notes in bug #1919 (bugs.openfabrics.org) to track the issue. thanks, -vu ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Hi Tziporet: Here is a trace with the data for WR failing with status 12. The vendor error is 129. Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id status 12 opcode 0 vendor_err 129 byte_len 0 qp 81002a13ec00 ex src_qp wc_flags, 0 pkey_index Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:154 wr_id 81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002a13ec00 ex src_qp wc_flags, 0 pkey_index Feb 17 12:27:33 vic10 kernel: rpcrdma_event_process:167 wr_id 81002878d800 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002a13ec00 ex src_qp wc_flags, 0 pkey_index Any thoughts? Tom Tom Tucker wrote: Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Hang on... compiling Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Only the MLX4 cards. Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
More info... Reboot the client and try to reconnect to a server that has not been rebooted fails in the same way. It must be an issue with the server. I see no completions on the server or any indication that an RDMA_SEND was incoming. Is there some way to dump adapter state or otherwise see if there was traffic on the wire? Tom Tom Tucker wrote: Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html