Re: [PATCH 0/8] IPoIB: Fix multiple race conditions
On Fri, Aug 15, 2014 at 3:08 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: [snip] Doug Ledford (8): IPoIB: Consolidate rtnl_lock tasks in workqueue IPoIB: Make the carrier_on_task race aware IPoIB: fix MCAST_FLAG_BUSY usage IPoIB: fix mcast_dev_flush/mcast_restart_task race IPoIB: change init sequence ordering IPoIB: Use dedicated workqueues per interface IPoIB: Make ipoib_mcast_stop_thread flush the workqueue IPoIB: No longer use flush as a parameter IPOIB is recently added as a technology preview for Intel Xeon Phi (currently a PCIe card) that runs embedded Linux (named MPSS) with Infiniband software stacks supported via emulation drivers. One early feedback from users with large cluster nodes is IPOIB's power consumption. The root cause of the reported issue is more to do with how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so submitting the fix to upstream is not planned at this moment (unless folks are interested in the changes). However, since this patch set happens to be in the heart of the reported power issue, we would like to take a closer look to avoid MPSS code base deviating too much from future upstream kernel(s). Question, comment, and/or ack will follow sometime next week. I've reviewed the patch set - the first half of the patches look good. Patch #5,#6,#7,#8 are fine if we go for one WQ per device - will let others do the final call. On our system (OFED 1.5.4 based), similar deadlocks were also observed while the power management issues were worked on. Restricted by other issues that were specific to our platform, I took the advantage of single IPOIB workqueue by queuing the if-up(s) and/or if-down(s) to the work queue if one already in progress. It serialized the logic by default. However, I would not mind one WQ per device approach and will re-make the changes when this patch set is picked up by mainline kernel. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: Our mcast_dev_flush routine and our mcast_restart_task can race against each other. In particular, they both hold the priv-lock while manipulating the rbtree and while removing mcast entries from the multicast_list and while adding entries to the remove_list, but they also both drop their locks prior to doing the actual removes. The mcast_dev_flush routine is run entirely under the rtnl lock and so has at least some locking. The actual race condition is like this: Thread 1Thread 2 ifconfig ib0 up start multicast join for broadcast multicast join completes for broadcast start to add more multicast joins call mcast_restart_task to add new entries ifconfig ib0 down mcast_dev_flush mcast_leave(mcast A) mcast_leave(mcast A) As mcast_leave calls ib_sa_multicast_leave, and as member in core/multicast.c is ref counted, we run into an unbalanced refcount issue. To avoid stomping on each others removes, take the rtnl lock specifically when we are deleting the entries from the remove list. Isn't test_and_clear_bit() atomic so it is unlikely that ib_sa_free_multicast() can run multiple times ? Oops .. how about if the structure itself gets freed ? My bad ! However, isn't that the remove_list a local list on the caller's stack ? .. and the original list entry moving (to remove_list) is protected by the spin lock (priv-lock), it is unlikely that the ib_sa_free_multicast() can operate on the same entry ? The patch itself is harmless though .. but adding the rntl_lock is really not ideal. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: Our mcast_dev_flush routine and our mcast_restart_task can race against each other. In particular, they both hold the priv-lock while manipulating the rbtree and while removing mcast entries from the multicast_list and while adding entries to the remove_list, but they also both drop their locks prior to doing the actual removes. The mcast_dev_flush routine is run entirely under the rtnl lock and so has at least some locking. The actual race condition is like this: Thread 1Thread 2 ifconfig ib0 up start multicast join for broadcast multicast join completes for broadcast start to add more multicast joins call mcast_restart_task to add new entries ifconfig ib0 down mcast_dev_flush mcast_leave(mcast A) mcast_leave(mcast A) As mcast_leave calls ib_sa_multicast_leave, and as member in core/multicast.c is ref counted, we run into an unbalanced refcount issue. To avoid stomping on each others removes, take the rtnl lock specifically when we are deleting the entries from the remove list. Isn't test_and_clear_bit() atomic so it is unlikely that ib_sa_free_multicast() can run multiple times ? 638 static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) 639 { 640 struct ipoib_dev_priv *priv = netdev_priv(dev); 641 int ret = 0; 642 643 if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) 644 ib_sa_free_multicast(mcast-mc); 645 646 if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags)) { -- Wendy Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++ 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index f5e8da530d9..19e3fe75ebf 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev) spin_unlock_irqrestore(priv-lock, flags); - /* seperate between the wait to the leave*/ + /* +* make sure the in-flight joins have finished before we attempt +* to leave +*/ list_for_each_entry_safe(mcast, tmcast, remove_list, list) if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) wait_for_completion(mcast-done); @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work) netif_addr_unlock(dev); local_irq_restore(flags); - /* We have to cancel outside of the spinlock */ + /* +* make sure the in-flight joins have finished before we attempt +* to leave +*/ + list_for_each_entry_safe(mcast, tmcast, remove_list, list) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) + wait_for_completion(mcast-done); + + /* +* We have to cancel outside of the spinlock, but we have to +* take the rtnl lock or else we race with the removal of +* entries from the remove list in mcast_dev_flush as part +* of ipoib_stop() which will call mcast_stop_thread with +* flush == 1 while holding the rtnl lock, and the +* flush_workqueue won't complete until this restart_mcast_task +* completes. So do like the carrier on task and attempt to +* take the rtnl lock, but if we can't before the ADMIN_UP flag +* goes away, then just return and know that the remove list will +* get flushed later by mcast_dev_flush. +*/ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) + return; + else + msleep(20); + } list_for_each_entry_safe(mcast, tmcast, remove_list, list) { ipoib_mcast_leave(mcast-dev, mcast); ipoib_mcast_free(mcast); } - - if (test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) - ipoib_mcast_start_thread(dev); + ipoib_mcast_start_thread(dev); + rtnl_unlock(); } #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
On Mon, Aug 25, 2014 at 1:03 PM, Doug Ledford dledf...@redhat.com wrote: On Aug 25, 2014, at 2:51 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: Is it really possible for ib_sa_join_multicast() to return *after* its callback (ipoib_mcast_sendonly_join_complete and ipoib_mcast_join_complete) ? Yes. They are both on work queues and ib_sa_join_multicast simply fires off another workqueue task. The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU). Then it is a race to see who finishes first. Ok, thanks for the explanation. I also googled and found the original patch where the IPOIB_MCAST_JOIN_STARTED was added. This patch now makes sense. Acked-by: Wendy Cheng wendy.ch...@intel.com On the other hand, I'm still puzzled why ib_sa_join_multicast() can't be a blocking call (i.e. wait until callback is executed) - why would IPOIB pay the price to work around these nasty issues ? But I guess that is off-topic too much .. BTW, thanks for the work. Our users will be doing if-up-down a lot for power management, patches like these help ! -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford dledf...@redhat.com wrote: So that's why in this patch we 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast-mc to the proper return value before we process the join completion callback 2) always clear mcast-mc if there is any error since we can't call ib_sa_multicast_leave 3) always complete the mcast in case we are waiting on it 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event We don't have IPOIB_MCAST_JOIN_STARTED (and the done completion struct) in our code base (MPSS) yet ...I'm *not* n-acking this patch but I find it hard to understand the ramifications. It has nothing to do with this patch - actually the patch itself looks pretty ok (by eyes). The original IPOIB mcast flow, particularly its abnormal error path, confuses me. Is it really possible for ib_sa_join_multicast() to return *after* its callback (ipoib_mcast_sendonly_join_complete and ipoib_mcast_join_complete) ? The mcast-done completion struct looks dangerous as well. I'll let other capable people to do the final call(s). -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_ADMIN_UP has been cleared, in which case we don't need to do anything anyway and we just return. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a0a42859f12..7e9cd39b5ef 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) carrier_on_task); struct ib_port_attr attr; - /* -* Take rtnl_lock to avoid racing with ipoib_stop() and -* turning the carrier back on while a device is being -* removed. -*/ if (ib_query_port(priv-ca, priv-port, attr) || attr.state != IB_PORT_ACTIVE) { ipoib_dbg(priv, Keeping carrier off until IB port is active\n); return; } - rtnl_lock(); + /* +* Take rtnl_lock to avoid racing with ipoib_stop() and +* turning the carrier back on while a device is being +* removed. However, ipoib_stop() will attempt to flush +* the workqueue while holding the rtnl lock, so loop +* on trylock until either we get the lock or we see +* FLAG_ADMIN_UP go away as that signals that we are bailing +* and can safely ignore the carrier on work +*/ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) + return; + else + msleep(20); + } I always think rtnl lock is too big for this purpose... and that 20 ms is not ideal either. Could we have a new IPOIB private mutex used by ipoib_stop() and this section of code ? So something like: ipoib_stop() {. mutex_lock(something_new); clear_bit(IPOIB_FLAG_ADMIN_UP, priv-flags); ... mutex_unlock(something_new); return 0; } Then the loop would become: // this while-loop will be very short - since we either get the mutex quickly or return quickly. while (!mutex_trylock(something_new)) { if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) return; } if (!ipoib_cm_admin_enabled(priv-dev)) dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu)); netif_carrier_on(priv-dev); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] IPoIB: Fix multiple race conditions
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: Locking of multicast joins/leaves in the IPoIB layer have been problematic for a while. There have been recent changes to try and make things better, including these changes: bea1e22 IPoIB: Fix use-after-free of multicast object a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects Unfortunately, the following test still fails (miserably) on a plain upstream kernel: pass=0 ifdown ib0 while true; do ifconfig ib0 up ifconfig ib0 down echo Pass $pass let pass++ done This usually fails within 10 to 20 passes, although I did have a lucky run make it to 300 or so. If you happen to have a P_Key child interface, it fails even quicker. [snip] Doug Ledford (8): IPoIB: Consolidate rtnl_lock tasks in workqueue IPoIB: Make the carrier_on_task race aware IPoIB: fix MCAST_FLAG_BUSY usage IPoIB: fix mcast_dev_flush/mcast_restart_task race IPoIB: change init sequence ordering IPoIB: Use dedicated workqueues per interface IPoIB: Make ipoib_mcast_stop_thread flush the workqueue IPoIB: No longer use flush as a parameter IPOIB is recently added as a technology preview for Intel Xeon Phi (currently a PCIe card) that runs embedded Linux (named MPSS) with Infiniband software stacks supported via emulation drivers. One early feedback from users with large cluster nodes is IPOIB's power consumption. The root cause of the reported issue is more to do with how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so submitting the fix to upstream is not planned at this moment (unless folks are interested in the changes). However, since this patch set happens to be in the heart of the reported power issue, we would like to take a closer look to avoid MPSS code base deviating too much from future upstream kernel(s). Question, comment, and/or ack will follow sometime next week. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: Setting the mtu can safely be moved to the carrier_on_task, which keeps us from needing to take the rtnl lock in the join_finish section. Looks good ! Acked-by: Wendy Cheng wendy.ch...@intel.com Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index d4e005720d0..a0a42859f12 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, spin_unlock_irq(priv-lock); priv-tx_wr.wr.ud.remote_qkey = priv-qkey; set_qkey = 1; - - if (!ipoib_cm_admin_enabled(dev)) { - rtnl_lock(); - dev_set_mtu(dev, min(priv-mcast_mtu, priv-admin_mtu)); - rtnl_unlock(); - } } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) { @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) } rtnl_lock(); + if (!ipoib_cm_admin_enabled(priv-dev)) + dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu)); netif_carrier_on(priv-dev); rtnl_unlock(); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfs-rdma performance
On Thu, Jun 12, 2014 at 12:54 PM, Mark Lehrer leh...@gmail.com wrote: Awesome work on nfs-rdma in the later kernels! I had been having panic problems for awhile and now things appear to be quite reliable. Now that things are more reliable, I would like to help work on speed issues. On this same hardware with SMB Direct and the standard storage review 8k 70/30 test, I get combined read write performance of around 2.5GB/sec. With nfs-rdma it is pushing about 850MB/sec. This is simply an unacceptable difference. I'm using the standard settings -- connected mode, 65520 byte MTU, nfs-server-side async, lots of nfsd's, and nfsver=3 with large buffers. Does anyone have any tuning suggestions and/or places to start looking for bottlenecks? There is a tunable called xprt_rdma_slot_table_entries .. Increasing that seemed to help a lot for me last year. Be aware that this tunable is enclosed inside #ifdef RPC_DEBUG so you might need to tweak the source and rebuild the kmod. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: help with IB_WC_MW_BIND_ERR
On Tue, May 20, 2014 at 11:55 AM, Chuck Lever chuck.le...@oracle.com wrote: Hi- What does it mean when a LOCAL_INV work request fails with a IB_WC_MW_BIND_ERR completion? Mapping an IB error code has been a great pain (at least for me) unless you have access to the HCA firmware. In this case, I think it implies memory protection error (registration issues) say in cxgb4 driver, it is associated with invalidate shared MR or invalidate bound memory window (with a QP): case T4_ERR_INVALIDATE_SHARED_MR: case T4_ERR_INVALIDATE_MR_WITH_MW_BOUND: wc-status = IB_WC_MW_BIND_ERR; break; drivers/infiniband/hw/cxgb4/cq.c line 654 of 898 --72%-- col 11-25 You'll probably need to mention the HCA name so the firmware people, if they are reading this, could pinpoint the exact cause. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for simplifying NFS/RDMA client memory registration
On Mon, Mar 3, 2014 at 11:54 AM, faibish, sorin faibish_so...@emc.com wrote: On Mar 3, 2014, at 7:09 PM, Christoph Hellwig h...@infradead.org wrote: On Mon, Mar 03, 2014 at 12:02:33PM -0500, Chuck Lever wrote: All HCAs in 3.13 (and rxe) can support either MTHCA_FMR or FRMR or both. Wendy?s HCA supports only ALLPHYSICAL. Is Wendy planning to submit her HCA driver ASAP? If not there's not reason to keep ALLPHYSICAL either. I second Christoph. Legacy is good as long as there are users of Linux with the legacy server. I would say that the only reason to keep it is if Linux server will support it. Same we apply to Lustre client in kernel. ./Sorin Does it make sense to deprecate then remove the registration modes in the first list? Yes. After discussing this with my manager, we'll let it go for now ... will re-submit the full patch set in the future when we finalize the plan. Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for simplifying NFS/RDMA client memory registration
On Fri, Feb 28, 2014 at 2:20 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: ni i...On Fri, Feb 28, 2014 at 1:41 PM, Tom Talpey t...@talpey.com wrote: On 2/26/2014 8:44 AM, Chuck Lever wrote: Hi- Shirley Ma and I are reviving work on the NFS/RDMA client code base in the Linux kernel. So far we've built and run functional tests to determine what is working and what is broken. [snip] ALLPHYSICAL - Usually fast, but not safe as it exposes client memory. All HCAs support this mode. Not safe is an understatement. It exposes all of client physical memory to the peer, for both read and write. A simple pointer error on the server will silently corrupt the client. This mode was intended only for testing, and in experimental deployments. (sorry, resend .. previous reply bounced back due to gmail html format) Please keep ALLPHYSICAL for now - as our embedded system needs it. Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AW: IPoIB GRO
I looked at TSO code earlier this year. IIRC, if TSO is on, the upper layer (e.g. IP) would just send the super-packet down (to IPOIB) w/out segmentation (for send); if off, it then does the segmentation (to match the MTU size) before calling device's send. For GSO, I would imagine it needs some sorts of segmentation sequence to know how to pull them together on the receive end. Look to me that the segmentation offload (TSO) and receive offload (GSO) are mutual exclusive ? Check out dev_gro_receive() (line number based on 2.6.32 RHEL kernel): 2980 2981 if (skb_is_gso(skb) || skb_has_frags(skb)) 2982 goto normal; See how it bails out when TSO (skb_is_gso()) is on ? So it looks like an IPOIB bug that ipoib_ib_handle_rx_wc() does a unconditional napi_gro_receive() regardless adapter capability (and TSO setting). Just a guess ! -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ACK behaviour difference LRO/GRO
On Mon, Oct 28, 2013 at 12:34 PM, Markus Stockhausen stockhau...@collogia.de wrote: Hello, about two month we had some problems with IPoIB transfer speeds . See more http://marc.info/?l=linux-rdmam=137823326109158w=2 After some quite hard test iterations the problem seems to come from the IPoIB switch from LRO to GRO between kernels 2.6.37 and 2.6.38. I built a test setup with a 2.6.38 kernel and additionaly compiled a 2.6.37 ib_ipoib module against it. This way I can run a direct comparison between the old and new module. The major difference between the two version is inside the ipoib_ib_handle_rx_wc() function: 2.6.37: lro_receive_skb(priv-lro.lro_mgr, skb, NULL); 2.6.38: napi_gro_receive(priv-napi, skb); As in the last post we use ConnectX cards in datagram mode with a 2044 MTU. We read a file sequentially from a NFS server into /dev/null. We just want to get the wire speed neglecting hard drives. The hardware is slightly newer so we get different transfer speeds but the overall effect should be evident. The server uses a 3.5 kernel and is not changed during the tests. With 2.6.37 IPoIB module on the client side and LRO enabled the speed is 950 MByte/sec. On the NFS server side a tcpdump trace reads like: 19:51:51.432630 IP 10.10.30.251.nfs 10.10.30.1.781: Flags [P.], seq 1008434065:1008497161, ack 617432, win 688, options [nop,nop,TS val 133047292 ecr 429568], length 63096 19:51:51.432672 IP 10.10.30.1.781 10.10.30.251.nfs: Flags [.], ack 1008241041, win 24576, options [nop,nop,TS val 429568 ecr 133047292], length 0 19:51:51.432677 IP 10.10.30.251.nfs 10.10.30.1.781: Flags [.], seq 1008497161:1008560905, ack 617432, win 688, options [nop,nop,TS val 133047292 ecr 429568], length 63744 19:51:51.432725 IP 10.10.30.1.781 10.10.30.251.nfs: Flags [.], ack 1008304585, win 24576, options [nop,nop,TS val 429568 ecr 133047292], length 0 19:51:51.432729 IP 10.10.30.251.nfs 10.10.30.1.781: Flags [.], seq 1008560905:1008624649, ack 617432, win 688, options [nop,nop,TS val 133047292 ecr 429568], length 63744 With some slight differences here and there the client sends only 1 ack for about 60k of transferred data. With 2.6.38 module and onwards (GRO enabled) the speed drops down to 380 MByte/sec and a different transfer pattern. 19:58:14.631430 IP 10.10.30.251.nfs 10.10.30.1.ircs: Flags [.], seq 722492293:722502253, ack 442312, win 537, options [nop,nop,TS val 133143092 ecr 467889], length 9960 19:58:14.631460 IP 10.10.30.1.ircs 10.10.30.251.nfs: Flags [.], ack 722478181, win 24562, options [nop,nop,TS val 467889 ecr 133143092], length 0 19:58:14.631485 IP 10.10.30.1.ircs 10.10.30.251.nfs: Flags [.], ack 722478181, win 24562, options [nop,nop,TS val 467889 ecr 133143092,nop,nop,sack 1 {722480117:722482333}], length 0 19:58:14.631510 IP 10.10.30.1.ircs 10.10.30.251.nfs: Flags [.], ack 722488197, win 24562, options [nop,nop,TS val 467889 ecr 133143092], length 0 19:58:14.631534 IP 10.10.30.1.ircs 10.10.30.251.nfs: Flags [.], ack 722494229, win 24562, options [nop,nop,TS val 467889 ecr 133143092], length 0 It seems as if the NFS client acknowledges every 2K packet separately. I thought that it may come from missing coalescing parameters and tried a ethtool -C ib0 rx-usecs 5 on both machines but without success. I'm quite lost now maybe someone can give a tip if I'm missing something. Nice work! Look like napi_gro_receive() does not do the work it is supposed to do ?! My (embedded NFS client) system was on 2.6.38 kernel but we use ipoib kmod from OFED 1.5.4.1 - so we're still on lro_receive_skb() path that does not have this issue. I'll try it out later this week to see what is going on. Mellanox folks or Roland may have more to say. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange NFS client ACK behaviour
On Mon, Sep 9, 2013 at 11:51 PM, Markus Stockhausen stockhau...@collogia.de wrote: Von: Wendy Cheng [s.wendy.ch...@gmail.com] Gesendet: Montag, 9. September 2013 22:03 An: Markus Stockhausen Cc: linux-rdma@vger.kernel.org Betreff: Re: Strange NFS client ACK behaviour On Sun, Sep 8, 2013 at 11:24 AM, Markus Stockhausen stockhau...@collogia.de wrote: we observed a performance drop in our IPoIB NFS backup infrastructure since we switched to machines with newer kernels. Not sure how your backup infrastructure works but the symptoms seem to match with this discussion: http://www.spinics.net/lists/linux-nfs/msg38980.html If you know how to recompile nfs kmod, Trond's patch does worth a try. Or open an Ubuntu support ticket, let them build you a test kmod. -- Wendy Thanks for pointing into that direction. From my understanding this patch goes into the NFS client side. I built a patched module for my Fedora 19 client (3.10 kernel). Nevertheless the behaviour ist still the same. If I get the patch right it is about forked childs that access a page of a mmapped file round robin and the kernel issues tons of write requests to the file. My case is only about ACK transmissions for a single writer. Markus So you have to go back to the drawing board :(. Have you tried to profile it ? http://oprofile.sourceforge.net/about/ -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange NFS client ACK behaviour
On Sun, Sep 8, 2013 at 11:24 AM, Markus Stockhausen stockhau...@collogia.de wrote: we observed a performance drop in our IPoIB NFS backup infrastructure since we switched to machines with newer kernels. Not sure how your backup infrastructure works but the symptoms seem to match with this discussion: http://www.spinics.net/lists/linux-nfs/msg38980.html If you know how to recompile nfs kmod, Trond's patch does worth a try. Or open an Ubuntu support ticket, let them build you a test kmod. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange NFS client ACK behaviour
CC linux-nfs .. maybe this is obvious to someone there ... Two comments inlined below. On Tue, Sep 3, 2013 at 11:28 AM, Markus Stockhausen stockhau...@collogia.de wrote: Hello, we observed a performance drop in our IPoIB NFS backup infrastructure since we switched to machines with newer kernels. As I do not know where to start I hope someone on this list can give me hint where to dig for more details. In case of no other reply, I would start w/ a socket program (or a network performance measuring tool) on the interface that does similar logic as dd you described below; that is, send a 256K message in a fixed number of loops (so total transfer size somewhere close to your file size) between client and server, followed by comparing the interrupt counters (cat /proc/interrtups) on both kernels. If the interrupt count differs as you described, the problem is most likely with the IB driver, not NFS layer. To make a long story short. We use ConnectX cards with the standard kernel drivers on version 2.6.32 (Ubuntu 10.04), 3.5 (Ubuntu 12.04) and 3.10 (Fedora 19). The very simple and not scientific test consists of mounting a NFS share using IPoIB UD network interfaces at MTU of 2044. Afterwards read a large file on the client side with dd if=file of=/dev/null bs=256K. During the transfer we run a tcpdump on the ibX interface on the NFS server side. No special settings for kernel parameters until now. I don't know much about ConnectX. Not sure what IPoIB UD means ? Datagram vs. CM or TCP vs. UDP ? When doing the test with a 2.6.32 kernel based client we see the following packet sequence. More or less a lot of transferd blocks from the NFS server to the client with sometimes an ACK package from the client to the server: 16:16:45.050930 IP server.nfs cli_2_6_32.896: Flags [.], seq 8909853:8913837, ack 1154149, win 604, options [nop,nop,TS val 1640401415 ecr 3881919089], length 3984 16:16:45.050936 IP server.nfs cli_2_6_32.896: Flags [.], seq 8913837:8917821, ack 1154149, win 604, options [nop,nop,TS val 1640401415 ecr 3881919089], length 3984 ... 8 more ... 16:16:45.050976 IP cli_2_6_32.896 server.nfs: Flags [.], ack 8909853, win 24574, options [nop,nop,TS val 3881919089 ecr 1640401415], length 0 ... After switchng to a client with a newer kernel (3.5 or 3.10) the sequence all of a sudden gives just the opposite behaviour. One should note that this is the same server as in the test above. The server sends bigger packets (I guess TSO is doing the rest of the work). After each packet the client sends several ACK packages back. 16:15:21.038782 IP server.nfs cli_3_5_0.928: Flags [.], seq 9612429:9652269, ack 372776, win 5815, options [nop,nop,TS val 1640380412 ecr 560111379], length 39840 16:15:21.038806 IP cli_3_5_0.928 server.nfs: Flags [.], ack 9542205, win 16384, options [nop,nop,TS val 560111379 ecr 1640380412], length 0 16:15:21.038812 IP cli_3_5_0.928 server.nfs: Flags [.], ack 9546077, win 16384, options [nop,nop,TS val 560111379 ecr 1640380412], length 0 ... 6-8 more ... The visible side effects of this changed processing include: - NIC interrupts on the NFS servers raise by a factor of 8. - Transfer speed lowers by 50% (400-200 MB/sec) Best regards. Markus -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Helps to Decode rpc_debug Output
On Mon, Aug 26, 2013 at 6:22 AM, Tom Talpey t...@talpey.com wrote: On 8/21/2013 11:55 AM, Wendy Cheng wrote: On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote: On 8/14/2013 8:14 PM, Wendy Cheng wrote: Longer version of the question: I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver. Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so far but I do manage to get nfs mount working. Simple file operations (such as ls, file read/write, scp, etc) seem to work as well. One thing I'm still scratching my head is that ... by looking at the raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS over IPOIB (TCP). Sounds like your bottleneck lies in some other component. What's the storage, for example? RDMA won't do a thing to improve a slow disk. Or, what kind of IOPS rate are you seeing? If these systems aren't generating enough load to push a CPU limit, then shifting the protocol on the same link might not yield much. There is no kernel profiling tool with this uOS (yet) so it is hard to identify the bottleneck. Looking from the surface, the slow down seems to be from SUNRPC's Van Jacobson congestion control (xprt_reserve_xprt_cong()) where it either creates a race condition for the transmissions (write/commit) to miss their wake-up(s); or the algorithm itself is not a right choice for this client system that consists of many (244 on my system) slower cores (CPU). Solid state drives are used on the RHEL server. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Helps to Decode rpc_debug Output
On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote: On 8/14/2013 8:14 PM, Wendy Cheng wrote: Longer version of the question: I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver. Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so far but I do manage to get nfs mount working. Simple file operations (such as ls, file read/write, scp, etc) seem to work as well. Yay ... got this up .. amazingly on a uOS that does not have much of the conventional kernel debug facilities. The hang was caused by auto disconnect, triggered by xprt-timer. The task was carried out by xprt_init_autodisconnect(). It silently disconnects the xprt w/out sensible warning. The uOS is on a small-core (slower) hardware. Instead of a hard number, this timeout value needs to be at least a proc tunable. Will check newer kernels to see whether it's been improved and/or draft a patch later. One thing I'm still scratching my head is that ... by looking at the raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS over IPOIB (TCP). However, the total run time differs greatly. NFS over RDMA seems to take a much longer time to finish (vs. NFS over IPOIB). Not sure why is that Maybe by the constant connect/disconnect triggered by reestablish_timeout ? The connection re-establish is known to be expensive on this uOS. Why do we need two sets of timeout where 1. xprt-timer disconnects (w/out reconnect) ? 2. reestablish_timeout constantly disconnect/re-connect ? -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Helps to Decode rpc_debug Output
On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote: On 8/14/2013 8:14 PM, Wendy Cheng wrote: Longer version of the question: I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver. Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so far but I do manage to get nfs mount working. Simple file operations (such as ls, file read/write, scp, etc) seem to work as well. [snip] Why did you replace the Linux IB stack with OFED? Did you also take the NFS/RDMA from that package, and if so are you sure that it all is is working properly? Doesn't 2.6.38 already have all this? Other part of the cluster runs OFED 1.5.4 on top of RHEL 6.3 - it was a product decision. Ditto for the 2.6.38 based uOS. OFED 1.5.4 based NFS/RDMA (i.e. xprtrdma) does not run on both platforms. It took a while to understand the setup. I believe issues with RHEL boxes have been fixed - at least iozone runs thru (as client and server) w/out trouble. Now the issue is with this 2.6.38 uOS (as client) that talks to RHEL 6.3 (as server). I don't know much about NFS V4 so the focus is on V3. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Helps to Decode rpc_debug Output
The IO on top of a NFS mounted directory was hanging so I forced a (client side) rpc_debug output from the proc entry. 6[ 4311.590317] 1676 0001-11 8801e0bde400 8801d18b1248 0 81420d40 nfsv3 WRITE a:call_status q:xprt_resend ... (similar lines) ... 6[ 4311.590435] 1682 0001-11 8801e0bde400 8801d18b0e10 0 81420d40 nfsv3 WRITE a:call_connect_status q:xprt_sending ... (similar lines) ... Could someone give me an educational statement on what the above two lines mean ? . More specifically, what call_connect_status does and what xprt_sending means ? Is there any way to force (i.e. hacking the code to get) a re-connect (i.e. invoke connect from rpc_xprt_ops) ? Longer version of the question: I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver. Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so far but I do manage to get nfs mount working. Simple file operations (such as ls, file read/write, scp, etc) seem to work as well. While trying to run iozone to see whether the performance gain can be justified for the development efforts, the program runs until it reaches 2MB file size - at that point, RDMA CM sends out TIMEWAIT_EXIT event, the xprt is disconnected, and all IOs on that share hang. IPOIB still works though. Not sure what would be the best way to debug this. Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Mon, Apr 29, 2013 at 10:09 PM, Yan Burman y...@mellanox.com wrote: I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue). This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code? That's very exciting ! The sad part is that IOMMU has to be turned off. I think ib_send_bw uses a single buffer so the DMA mapping search overhead is not an issue. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. ... [snip] 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner That's the inode i_mutex. 14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave And that (and __free_iova below) looks like iova_rbtree_lock. Let's revisit your command: FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128 --ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255 --loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall --buffered=0 * inode's i_mutex: If increasing process/file count didn't help, maybe increase iodepth (say 512 ?) could offset the i_mutex overhead a little bit ? * xpt_mutex: (no idea) * iova_rbtree_lock DMA mapping fragmentation ? I have not studied whether NFS-RDMA routines such as svc_rdma_sendto() could do better but maybe sequential IO (instead of randread) could help ? Bigger block size (instead of 4K) can help ? -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote: On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. 1) The client slot count is not hard-coded, it can easily be changed by writing a value to /proc and initiating a new mount. But I doubt that increasing the slot table will improve performance much, unless this is a small-random-read, and spindle-limited workload. Hi Tom ! It was a shot in the dark :) .. as our test bed has not been setup yet .However, since I'll be working on (very) slow clients, increasing this buffer is still interesting (to me). I don't see where it is controlled by a /proc value (?) - but that is not a concern at this moment as /proc entry is easy to add. More questions on the server though (see below) ... 2) The observation appears to be that the bandwidth is server CPU limited. Increasing the load offered by the client probably won't move the needle, until that's addressed. Could you give more hints on which part of the path is CPU limited ? Is there a known Linux-based filesystem that is reasonbly tuned for NFS-RDMA ? Any specific filesystem features would work well with NFS-RDMA ? I'm wondering when disk+FS are added into the configuration, how much advantages would NFS-RDMA get when compared with a plain TCP/IP, say IPOIB on CM , transport ? -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker t...@opengridcomputing.com wrote: The Mellanox driver uses red-black trees extensively for resource management, e.g. QP ID, CQ ID, etc... When completions come in from the HW, these are used to find the associated software data structures I believe. It is certainly possible that these trees get hot on lookup when we're pushing a lot of data. I'm surprised, however, to see rb_insert_color there because I'm not aware of any where that resources are being inserted into and/or removed from a red-black tree in the data path. I think they (rb calls) are from base kernel, not from any NFS and/or IB module (e.g. RPC, MLX, etc). See the right column ? it says /root/vmlinux. Just a guess - I don't know much about this perf command. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Thu, Apr 25, 2013 at 2:58 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker t...@opengridcomputing.com wrote: The Mellanox driver uses red-black trees extensively for resource management, e.g. QP ID, CQ ID, etc... When completions come in from the HW, these are used to find the associated software data structures I believe. It is certainly possible that these trees get hot on lookup when we're pushing a lot of data. I'm surprised, however, to see rb_insert_color there because I'm not aware of any where that resources are being inserted into and/or removed from a red-black tree in the data path. I think they (rb calls) are from base kernel, not from any NFS and/or IB module (e.g. RPC, MLX, etc). See the right column ? it says /root/vmlinux. Just a guess - I don't know much about this perf command. Oops .. take my words back ! I confused Linux's RB tree w/ BSD's. BSD's is a set of macros inside a header file while Linux's implementation is a base kernel library. So every KMOD is a suspect here :) -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio I have not looked at NFS RDMA (and 3.x kernel) source yet. But see that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere in the paths ? Trees like that requires extensive lockings. -- Wendy . 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 102.00 0.9% 2640 send_reply /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 99.00 0.9% 810e6490 kmem_cache_alloc /root/vmlinux 96.00 0.8% 810e5840 __slab_alloc /root/vmlinux 91.00 0.8% 6d30 mlx4_ib_post_send /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko 88.00 0.8% 0dd0 svc_rdma_get_context /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 86.00 0.7% 813a1a10 _raw_spin_lock_irq /root/vmlinux 86.00 0.7% 1530 svc_rdma_send /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 85.00 0.7% 81060a80 prepare_creds /root/vmlinux 83.00 0.7% 810a5790 find_get_pages_contig /root/vmlinux 79.00 0.7% 810e4620 __slab_free /root/vmlinux 79.00 0.7% 813a1a40 _raw_spin_unlock_irqrestore /root/vmlinux 77.00 0.7% 81065610 finish_task_switch /root/vmlinux 76.00 0.7% 812e9270 pfn_to_dma_pte /root/vmlinux 75.00 0.6% 810976d0 __call_rcu /root/vmlinux 73.00 0.6% 811a2fa0 _atomic_dec_and_lock /root/vmlinux 73.00 0.6% 02e0 svc_rdma_has_wspace /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 67.00 0.6% 813a1a70 _raw_read_lock /root/vmlinux 65.00 0.6% f590 svcauth_unix_set_client /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 63.00 0.5% 000180e0 svc_reserve /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 60.00 0.5% 64d0 stamp_send_wqe
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio I have not looked at NFS RDMA (and 3.x kernel) source yet. But see that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere in the paths ? Trees like that requires extensive lockings. So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. In short, if anyone has benchmark setup handy, bumping up the slot table size as the following might be interesting: --- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h 2013-03-21 09:19:36.233006570 -0700 +++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h 2013-04-24 10:52:20.934781304 -0700 @@ -59,7 +59,7 @@ * a single chunk type per message is supported currently. */ #define RPCRDMA_MIN_SLOT_TABLE (2U) -#define RPCRDMA_DEF_SLOT_TABLE (32U) +#define RPCRDMA_DEF_SLOT_TABLE (64U) #define RPCRDMA_MAX_SLOT_TABLE (256U) #define RPCRDMA_DEF_INLINE (1024) /* default inline max */ -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote: What do you suggest for benchmarking NFS? I believe SPECsfs has been widely used by NFS (server) vendors to position their product lines. Its workload was based on a real life NFS deployment. I think it is more torward office type of workload (large client/user count with smaller file sizes e.g. software development with build, compile, etc). BTW, we're experimenting a similar project and would be interested to know your findings. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler spencer.shep...@gmail.com wrote: Note that SPEC SFS does not support RDMA. IIRC, the benchmark comes with source code - wondering anyone has modified it to run on RDMA ? Or is there any real user to share the experience ? -- Wendy From: Wendy Cheng Sent: 4/18/2013 9:16 AM To: Yan Burman Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote: What do you suggest for benchmarking NFS? I believe SPECsfs has been widely used by NFS (server) vendors to position their product lines. Its workload was based on a real life NFS deployment. I think it is more torward office type of workload (large client/user count with smaller file sizes e.g. software development with build, compile, etc). BTW, we're experimenting a similar project and would be interested to know your findings. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Remember there are always gaps between wire speed (that ib_send_bw measures) and real world applications. That being said, does your server use default export (sync) option ? Export the share with async option can bring you closer to wire speed. However, the practice (async) is generally not recommended in a real production system - as it can cause data integrity issues, e.g. you have more chances to lose data when the boxes crash. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? Remember there are always gaps between wire speed (that ib_send_bw measures) and real world applications. That being said, does your server use default export (sync) option ? Export the share with async option can bring you closer to wire speed. However, the practice (async) is generally not recommended in a real production system - as it can cause data integrity issues, e.g. you have more chances to lose data when the boxes crash. -- Wendy Wendy, It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring pipelining, the argument from the authors was that the goal was to maximize server performance and not single client performance. Scott That (client count) brings up a good point ... FIO is really not a good benchmark for NFS. Does anyone have SPECsfs numbers on NFS over RDMA to share ? -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IPOIB-CM MTU
We're working on a RHEL based system (2.6.32-279.el6.x86_64) that has slower CPUs (vs. bigger Xeon boxes). The IB stacks are on top of OFA 1.5.4.1 and a mlx adapter (mlx4_0). It is expected that enabling CM will boost IPOIB bandwidth (measured by Netpipe) due to larger MTU size (65520). Unfortunately, it does not happen. It, however, does show 2x bandwidth gain (CM vs. datagram) on Xeon servers. While looking around, it is noticed the MTU reported by ifconfig command correctly shows ipoib_cm_max_mtu number but the socket buffer sent down to ipoib_cm_send() never exceed 2048 bytes on both Xeon and the new HW platform. Seeing TSO (driver/firmware semgntation) is off with CM mode ... intuitively, TCP/IP on Xeon would do better with segmentation while the (segmentation) overhead (and other specific platform issues) could weigh down the bandwidth on the subject HW. If the guess is right, the question here is how to make upper layer, i.e. IP, send down a bigger (than 2048 bytes) size of fragment ?. Any comment and/or help ? Is there any other knobs that I should turn with cm mode (by echo connected into /sys/class/net/ib0/mode) ? Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPOIB Implementation
On Thu, Oct 25, 2012 at 1:26 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: Could folks pass some pointers on the subject implementation (for education purpose) ? Thanks, Wendy BTW, I'm starting with RFC 4391 (and the source code in drivers/infiniband/ulp/ipoib). Is there any other good document for the subject ? Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
user space FMR support
To make a long story short FMR = fast memory registration ... I had been using a set of FMR patches that exported the existing kernel FMR support to user space (with mlx4 driver) for about a year. A mistake in re-installing the development cluster this morning wiped out the private OFED package that contained the changes. I could probably be able to repatch everything in a couple of days but it is a very annoying process. I can't be the only person who needs the FMR in user space (?). Does anyone have a handy set of patches that can make public (or better, push upstream with newer set of kernels) ? Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries
On Fri, Oct 14, 2011 at 2:22 PM, Doug Ledford dledf...@redhat.com wrote: - Original Message - On Wed, Oct 12, 2011 at 9:32 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: The OFED package itself does include XRC support. The issue here (my guess) is that its build script needs to understand the running system's kernel version to decide what should be pulled (from the source). Linux 3.1 could be too new for OFED build script to make a correct decision. Nevertheless, mix-matching OFED modules/libraries is a *bad* idea. No. The same userspace build should work with all kernel versions. Wendy's referring to something other than what you are thinking. The same libibverbs user space build should work on all kernels going back a long way, except when you are talking about OFED their libibverbs is hard coded to assume XRC support and fail if it isn't present, so an OFED libibverbs won't really work without also the OFED kernel module stack. The script Wendy referred to is the script that checks the running kernel's version in order to determine which backport patches need applied to the ofa_kernel source tree in order to build the OFED kernel modules for your running kernel. Without that ofa_kernel build, the OFED libibverbs will indeed fail to run on the running kernel. And that script hasn't been updated to support version 3.x kernels last I checked, so she's right, the script itself doesn't recognize the running kernel version, so ofa_kernel modules don't get built, so OFED libibverbs won't work anyway. So, she's absolutely right, unless you want to start ripping hard coded assumptions about the existence of XRC support out of things like OFED's libibverbs, then out of qperf and a number of their other various packages, then you have to pair the OFED kernel modules and user space packages, they can not be separated. Yes, that (above) is exactly what I was referring to . The conversations in this thread remind me of the tire-swing cartoon that has been passing around for years: http://bibiananunes.com/user-requirements-the-tire-swing-cartoon ) -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries
On Wed, Oct 12, 2011 at 3:41 AM, Bart Van Assche bvanass...@acm.org wrote: On Tue, Oct 11, 2011 at 7:39 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Tue, Oct 11, 2011 at 09:02:41AM -0500, Christoph Lameter wrote: Has XRC support not been merged? How can I build the OFED libraries against Linux 3.1? I'd really like to get rid of the OFED kernel tree nightmare. You have to use upstream libraries with upstream kernels. Be warned that the OFED libraries of the same SONAME are not ABI compatible with upstream libraries. Why is the OFED libibverbs library binary incompatible with the non-OFED libibverbs library ? Why hasn't XRC support been implemented in the OFED libibverbs library such that applications built against the upstream libibverbs headers also work with the latest OFED version of that library ? I'm relatively new to OFED but happened to bump into a similar build issue two weeks ago. The OFED package itself does include XRC support. The issue here (my guess) is that its build script needs to understand the running system's kernel version to decide what should be pulled (from the source). Linux 3.1 could be too new for OFED build script to make a correct decision. Nevertheless, mix-matching OFED modules/libraries is a *bad* idea. It is difficult to love OFED build :) but it seems to work ok (so far for me). Plus, I don't have a better proposal myself anyway . -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to debug (mlx4) CQ overrun
On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: There are not really any tools, but this is usually straightforward to look at from your app. Great thanks for the response. It helped (to ensure our cq handling logic was ok). The issue turns out to be build related. After doing a clean rebuild of OFED IB modules with the modified header files, the problem went away. The (header file) change was a result of exporting kernel FMR (fast memory registration) to user space for an experimental project. Again, thank you for the write-up. It is very appreciated. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how to debug (mlx4) CQ overrun
I have a test program that does RDMA read-write as the following: node A: server listens and handles connection requests setup a piece of memory initialized to 0 node B: two processes parent child child: 1. setup a new channel with server, including a CQ with 1024 entries (ibv_create_cq(ctx, 1024, NULL, channel, 0);) 2. RDMA sequential write (8192 bytes a time) to server memory 4. sync with parent parent: 1. setup the new channel with server, including a CQ with 1024 entries (ibv_create_cq(ctx, 1024, NULL, channel, 0);) 3. RDMA sequential read (8192 byes a time) to the same piece of memory from server - check the buffer contents. - if memory content is still zero, re-read 4. sync with child The parent hangs (but child finishes its write) after the following pops up in /var/log/messages: mlx4_core :06:00.0: CQ overrun on CQN 87 I have my own counters that restrict the read (and write) to 512 max. Both write and read are blocking (i.e. cq is polled after each read/write). I suspect I do not have the cq poll logic correct. The question here is .. is there any diag tool available to check on the internal counters (and /or states) of ibverbs library and/or kernel drivers (to help RDMA applications debug) ? In my case, it hangs around 14546 block (i.e. after 14546*8192 byes). Thanks, Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel space rdma client, reg_mr issue
On Thu, Jun 30, 2011 at 7:19 AM, Benoit Hudzia benoit.hud...@gmail.com wrote: Hi, We are working to create a RDMA client within the kernel which will connect to a working RDMA userspace server. We are using the userspace code for testing purpose , in the final version all the communication will be done within the kernel space. [snip] basically the question boils down to how to allcoate and registr buffer for RDMA communication from inside a kernel module? I happen to have a similar setup ..The original intent was to have a kernel mode RDMA application that took the kernel data and sent it over to the peer node's memory for temporary storage. It had to be able to get read back later. As it didn't matter whether the temporary storage was in kernel or user address space, I re-used my colleague's existing user mode program (to run as an user space daemon on the peer node). This allowed the focus being on the new kernel application development (run on the primary node). After the code was up and running, I see no reason to change the setup and it has been running fine since. The code runs on RHEL 5.5 with OFED-1.5.2 using Mellanox card. The user mode daemon is in a forever receiving loop that follows the standard the RDMA user mode programming logic. The kernel code invokes the ib_xxx set of APIs (vs. user mode ibv_xxx(s)). The kernel memory registration are done by the APIs such as kzalloc(), ib_get_dma_mr(), ib_dma_map_single(), ib_dma_map_page(), etc. Check out the driver code in drivers/infiniband/ulp/iser directory. It has a sample logic to register kernel memory. -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to build a kernel module with OFED kernel
A newbie question . (please cc me as I'm not on linux-rdma list yet) The system was on RHEL 5.4. I used a source tar-ball to build RHEL 5.5 (2.6.18-194.el5). The kernel build was placed at: /usr/src/linux. Rebooted to pick up the new kernel .. worked fine... Then 1. On the new 2.6.18-192.el5 system, install OFED-1.5.2 ... succeed. 2. Reboot to pick up new ofa kernel ..succeed (check with modinfo) 3. Run user mode rdma application ... succeed 4. The new ofa kernel modules are placed (by OFED scripts) at: /lib/modules/2.6.18-194.el5/updates/ directory 5. Build a kernel module (xx.ko) using the attached Makefile Now .. trying to load (or run after forced loading) xx.ko (on top of ofa kernel kmod(s)) .. It fails .. as the build apparently picks up IB kmods from /lib/modules/2.6.18-194.el5/drivers/infiniband directory, instead of /lib/modules/2.6.18-194.el5/updates directory, together with wrong header files from /lib/modules/2.6.18-194.el5/source/include, where source is /usr/src/linux directory. Anyone can help me with a correct procedure ? I do understand the primary usage of OFED RDMA is for user mode applications .. but I need to have a kernel mode driver on top of OFED RDMA for some experiment works. Thanks, Wendy == Make File == EXTRA_CFLAGS:= -I/usr/src/linux/drivers/xx/include EXTRA_CFLAGS+= -DXX_KMOD_DEF obj-m := xx_kmod.o xx_kmod-y := main/xx_main.o main/xx_init.o \ libxxverbs/xx_device.o libxxverbs/xx_cm.o \ libxxverbs/xx_ar.o libxxverbs/xx_mr.o \ libxxverbs/xx_cq.o libxxverbs/xx_sq.o xx_kmod-y += util/xx_perf.o xx_kmod-y += brd/xx_brd.o kmod: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules install: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules_install run: /sbin/depmod -a; echo 5 /proc/sys/kernel/panic; modprobe --force-modversion xx_kmod === End attachment === -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html