Re: [PATCH 0/8] IPoIB: Fix multiple race conditions

2014-09-03 Thread Wendy Cheng
On Fri, Aug 15, 2014 at 3:08 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
[snip]

 Doug Ledford (8):
   IPoIB: Consolidate rtnl_lock tasks in workqueue
   IPoIB: Make the carrier_on_task race aware
   IPoIB: fix MCAST_FLAG_BUSY usage
   IPoIB: fix mcast_dev_flush/mcast_restart_task race
   IPoIB: change init sequence ordering
   IPoIB: Use dedicated workqueues per interface
   IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
   IPoIB: No longer use flush as a parameter


 IPOIB is recently added as a technology preview for Intel Xeon Phi
 (currently a PCIe card) that runs embedded Linux (named MPSS) with
 Infiniband software stacks supported via emulation drivers. One early
 feedback from users with large cluster nodes is IPOIB's power
 consumption. The root cause of the reported issue is more to do with
 how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so
 submitting the fix to upstream is not planned at this moment (unless
 folks are interested in the changes).

 However, since this patch set happens to be in the heart of the
 reported power issue, we would like to take a closer look to avoid
 MPSS code base deviating too much from future upstream kernel(s).
 Question, comment, and/or ack will follow sometime next week.


I've reviewed the patch set - the first half of the patches look good.
Patch #5,#6,#7,#8 are fine if we go for one WQ per device - will let
others do the final call.

On our system (OFED 1.5.4 based), similar deadlocks were also observed
while the power management issues were worked on. Restricted by other
issues that were specific to our platform, I took the advantage of
single IPOIB workqueue by queuing the if-up(s) and/or if-down(s) to
the work queue if one already in progress. It serialized the logic by
default. However, I would not mind one WQ per device approach and will
re-make the changes when this patch set is picked up by mainline
kernel.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race

2014-08-30 Thread Wendy Cheng
On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
 Our mcast_dev_flush routine and our mcast_restart_task can race against
 each other.  In particular, they both hold the priv-lock while
 manipulating the rbtree and while removing mcast entries from the
 multicast_list and while adding entries to the remove_list, but they
 also both drop their locks prior to doing the actual removes.  The
 mcast_dev_flush routine is run entirely under the rtnl lock and so has
 at least some locking.  The actual race condition is like this:

 Thread 1Thread 2
 ifconfig ib0 up
   start multicast join for broadcast
   multicast join completes for broadcast
   start to add more multicast joins
 call mcast_restart_task to add new entries
 ifconfig ib0 down
   mcast_dev_flush
 mcast_leave(mcast A)
 mcast_leave(mcast A)

 As mcast_leave calls ib_sa_multicast_leave, and as member in
 core/multicast.c is ref counted, we run into an unbalanced refcount
 issue.  To avoid stomping on each others removes, take the rtnl lock
 specifically when we are deleting the entries from the remove list.

 Isn't test_and_clear_bit() atomic so it is unlikely that
 ib_sa_free_multicast() can run multiple times  ?

Oops .. how about if the structure itself gets freed ? My bad !

However, isn't that the remove_list a local list on the caller's stack
? .. and  the original list entry moving (to remove_list) is protected
by the spin lock (priv-lock), it is unlikely that the
ib_sa_free_multicast() can operate on the same entry ?

The patch itself is harmless though .. but adding the rntl_lock is
really not ideal.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race

2014-08-29 Thread Wendy Cheng
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
 Our mcast_dev_flush routine and our mcast_restart_task can race against
 each other.  In particular, they both hold the priv-lock while
 manipulating the rbtree and while removing mcast entries from the
 multicast_list and while adding entries to the remove_list, but they
 also both drop their locks prior to doing the actual removes.  The
 mcast_dev_flush routine is run entirely under the rtnl lock and so has
 at least some locking.  The actual race condition is like this:

 Thread 1Thread 2
 ifconfig ib0 up
   start multicast join for broadcast
   multicast join completes for broadcast
   start to add more multicast joins
 call mcast_restart_task to add new entries
 ifconfig ib0 down
   mcast_dev_flush
 mcast_leave(mcast A)
 mcast_leave(mcast A)

 As mcast_leave calls ib_sa_multicast_leave, and as member in
 core/multicast.c is ref counted, we run into an unbalanced refcount
 issue.  To avoid stomping on each others removes, take the rtnl lock
 specifically when we are deleting the entries from the remove list.

Isn't test_and_clear_bit() atomic so it is unlikely that
ib_sa_free_multicast() can run multiple times  ?

638 static int ipoib_mcast_leave(struct net_device *dev, struct
ipoib_mcast *mcast)
639 {
640 struct ipoib_dev_priv *priv = netdev_priv(dev);
641 int ret = 0;
642
643 if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
644 ib_sa_free_multicast(mcast-mc);
645
646 if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED,
mcast-flags)) {


-- Wendy


 Signed-off-by: Doug Ledford dledf...@redhat.com
 ---
  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 
 ++
  1 file changed, 32 insertions(+), 5 deletions(-)

 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 index f5e8da530d9..19e3fe75ebf 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)

 spin_unlock_irqrestore(priv-lock, flags);

 -   /* seperate between the wait to the leave*/
 +   /*
 +* make sure the in-flight joins have finished before we attempt
 +* to leave
 +*/
 list_for_each_entry_safe(mcast, tmcast, remove_list, list)
 if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
 wait_for_completion(mcast-done);
 @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 netif_addr_unlock(dev);
 local_irq_restore(flags);

 -   /* We have to cancel outside of the spinlock */
 +   /*
 +* make sure the in-flight joins have finished before we attempt
 +* to leave
 +*/
 +   list_for_each_entry_safe(mcast, tmcast, remove_list, list)
 +   if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
 +   wait_for_completion(mcast-done);
 +
 +   /*
 +* We have to cancel outside of the spinlock, but we have to
 +* take the rtnl lock or else we race with the removal of
 +* entries from the remove list in mcast_dev_flush as part
 +* of ipoib_stop() which will call mcast_stop_thread with
 +* flush == 1 while holding the rtnl lock, and the
 +* flush_workqueue won't complete until this restart_mcast_task
 +* completes.  So do like the carrier on task and attempt to
 +* take the rtnl lock, but if we can't before the ADMIN_UP flag
 +* goes away, then just return and know that the remove list will
 +* get flushed later by mcast_dev_flush.
 +*/
 +   while (!rtnl_trylock()) {
 +   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
 +   return;
 +   else
 +   msleep(20);
 +   }
 list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
 ipoib_mcast_leave(mcast-dev, mcast);
 ipoib_mcast_free(mcast);
 }
 -
 -   if (test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
 -   ipoib_mcast_start_thread(dev);
 +   ipoib_mcast_start_thread(dev);
 +   rtnl_unlock();
  }

  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 --
 1.9.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-26 Thread Wendy Cheng
On Mon, Aug 25, 2014 at 1:03 PM, Doug Ledford dledf...@redhat.com wrote:

 On Aug 25, 2014, at 2:51 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

 Is it really possible for ib_sa_join_multicast() to
 return *after* its callback (ipoib_mcast_sendonly_join_complete and
 ipoib_mcast_join_complete) ?

 Yes.  They are both on work queues and ib_sa_join_multicast simply fires off 
 another workqueue task.  The scheduler is free to start that task instantly 
 if the workqueue isn't busy, and it often does (although not necessarily on 
 the same CPU).  Then it is a race to see who finishes first.


Ok, thanks for the explanation. I also googled and found the original
patch where the IPOIB_MCAST_JOIN_STARTED was added. This patch now
makes sense.

Acked-by: Wendy Cheng wendy.ch...@intel.com

On the other hand, I'm still puzzled why ib_sa_join_multicast() can't
be a blocking call (i.e. wait until callback is executed) - why would
IPOIB pay the price to work around these nasty issues ? But I guess
that is off-topic too much ..

BTW, thanks for the work. Our users will be doing if-up-down a lot for
power management, patches like these help !

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-25 Thread Wendy Cheng
On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford dledf...@redhat.com wrote:

 So that's why in this patch we

 1) take a mutex to force ib_sa_join_multicast to return and us to set 
 mcast-mc to the proper return value before we process the join completion 
 callback
 2) always clear mcast-mc if there is any error since we can't call 
 ib_sa_multicast_leave
 3) always complete the mcast in case we are waiting on it
 4) only if our status is ENETRESET set our return to 0 so the ib core code 
 knows we acknowledged the event


We don't have IPOIB_MCAST_JOIN_STARTED (and the done completion
struct) in our code base (MPSS) yet ...I'm *not* n-acking this patch
but I find it hard to understand the ramifications. It has nothing to
do with this patch - actually the patch itself looks pretty ok (by
eyes).

The original IPOIB mcast flow, particularly its abnormal error path,
confuses me. Is it really possible for ib_sa_join_multicast() to
return *after* its callback (ipoib_mcast_sendonly_join_complete and
ipoib_mcast_join_complete) ? The mcast-done completion struct looks
dangerous as well.

I'll let other capable people to do the final call(s).

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware

2014-08-18 Thread Wendy Cheng
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
 We blindly assume that we can just take the rtnl lock and that will
 prevent races with downing this interface.  Unfortunately, that's not
 the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
 in an attempt to clear out all remaining instances of ipoib_join_task.
 But, since this task is put on the same workqueue as the join task, the
 flush_workqueue waits on this thread too.  But this thread is deadlocked
 on the rtnl lock.  The better thing here is to use trylock and loop on
 that until we either get the lock or we see that FLAG_ADMIN_UP has
 been cleared, in which case we don't need to do anything anyway and we
 just return.

 Signed-off-by: Doug Ledford dledf...@redhat.com
 ---
  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++--
  1 file changed, 15 insertions(+), 6 deletions(-)

 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 index a0a42859f12..7e9cd39b5ef 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct 
 *work)
carrier_on_task);
 struct ib_port_attr attr;

 -   /*
 -* Take rtnl_lock to avoid racing with ipoib_stop() and
 -* turning the carrier back on while a device is being
 -* removed.
 -*/
 if (ib_query_port(priv-ca, priv-port, attr) ||
 attr.state != IB_PORT_ACTIVE) {
 ipoib_dbg(priv, Keeping carrier off until IB port is 
 active\n);
 return;
 }

 -   rtnl_lock();
 +   /*
 +* Take rtnl_lock to avoid racing with ipoib_stop() and
 +* turning the carrier back on while a device is being
 +* removed.  However, ipoib_stop() will attempt to flush
 +* the workqueue while holding the rtnl lock, so loop
 +* on trylock until either we get the lock or we see
 +* FLAG_ADMIN_UP go away as that signals that we are bailing
 +* and can safely ignore the carrier on work
 +*/
 +   while (!rtnl_trylock()) {
 +   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
 +   return;
 +   else
 +   msleep(20);
 +   }


I always think rtnl lock is too big for this purpose... and that 20 ms
is not ideal either. Could we have a new IPOIB private mutex used by
ipoib_stop() and this section of code ? So something like:

ipoib_stop()
{.
 mutex_lock(something_new);
 clear_bit(IPOIB_FLAG_ADMIN_UP, priv-flags);
 ...
 mutex_unlock(something_new);
 return 0;
}

Then the loop would become:

// this while-loop will be very short - since we either get the mutex
quickly or return quickly.
 while (!mutex_trylock(something_new)) {
  if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
  return;
 }


 if (!ipoib_cm_admin_enabled(priv-dev))
 dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu));
 netif_carrier_on(priv-dev);
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] IPoIB: Fix multiple race conditions

2014-08-15 Thread Wendy Cheng
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
 Locking of multicast joins/leaves in the IPoIB layer have been problematic
 for a while.  There have been recent changes to try and make things better,
 including these changes:

 bea1e22 IPoIB: Fix use-after-free of multicast object
 a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects

 Unfortunately, the following test still fails (miserably) on a plain
 upstream kernel:

 pass=0
 ifdown ib0
 while true; do
 ifconfig ib0 up
 ifconfig ib0 down
 echo Pass $pass
 let pass++
 done

 This usually fails within 10 to 20 passes, although I did have a lucky
 run make it to 300 or so.  If you happen to have a P_Key child interface,
 it fails even quicker.

[snip]

 Doug Ledford (8):
   IPoIB: Consolidate rtnl_lock tasks in workqueue
   IPoIB: Make the carrier_on_task race aware
   IPoIB: fix MCAST_FLAG_BUSY usage
   IPoIB: fix mcast_dev_flush/mcast_restart_task race
   IPoIB: change init sequence ordering
   IPoIB: Use dedicated workqueues per interface
   IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
   IPoIB: No longer use flush as a parameter


IPOIB is recently added as a technology preview for Intel Xeon Phi
(currently a PCIe card) that runs embedded Linux (named MPSS) with
Infiniband software stacks supported via emulation drivers. One early
feedback from users with large cluster nodes is IPOIB's power
consumption. The root cause of the reported issue is more to do with
how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so
submitting the fix to upstream is not planned at this moment (unless
folks are interested in the changes).

However, since this patch set happens to be in the heart of the
reported power issue, we would like to take a closer look to avoid
MPSS code base deviating too much from future upstream kernel(s).
Question, comment, and/or ack will follow sometime next week.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue

2014-08-15 Thread Wendy Cheng
On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
 Setting the mtu can safely be moved to the carrier_on_task, which keeps
 us from needing to take the rtnl lock in the join_finish section.


Looks good !

Acked-by: Wendy Cheng wendy.ch...@intel.com

 Signed-off-by: Doug Ledford dledf...@redhat.com
 ---
  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++--
  1 file changed, 2 insertions(+), 6 deletions(-)

 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 index d4e005720d0..a0a42859f12 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
 *mcast,
 spin_unlock_irq(priv-lock);
 priv-tx_wr.wr.ud.remote_qkey = priv-qkey;
 set_qkey = 1;
 -
 -   if (!ipoib_cm_admin_enabled(dev)) {
 -   rtnl_lock();
 -   dev_set_mtu(dev, min(priv-mcast_mtu, 
 priv-admin_mtu));
 -   rtnl_unlock();
 -   }
 }

 if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) {
 @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 }

 rtnl_lock();
 +   if (!ipoib_cm_admin_enabled(priv-dev))
 +   dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu));
 netif_carrier_on(priv-dev);
 rtnl_unlock();
  }
 --
 1.9.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nfs-rdma performance

2014-06-12 Thread Wendy Cheng
On Thu, Jun 12, 2014 at 12:54 PM, Mark Lehrer leh...@gmail.com wrote:

 Awesome work on nfs-rdma in the later kernels!  I had been having
 panic problems for awhile and now things appear to be quite reliable.

 Now that things are more reliable, I would like to help work on speed
 issues.  On this same hardware with SMB Direct and the standard
 storage review 8k 70/30 test, I get combined read  write performance
 of around 2.5GB/sec.  With nfs-rdma it is pushing about 850MB/sec.
 This is simply an unacceptable difference.

 I'm using the standard settings -- connected mode, 65520 byte MTU,
 nfs-server-side async, lots of nfsd's, and nfsver=3 with large
 buffers.  Does anyone have any tuning suggestions and/or places to
 start looking for bottlenecks?


There is a tunable called xprt_rdma_slot_table_entries .. Increasing
that seemed to help a lot for me last year. Be aware that this tunable
is enclosed inside  #ifdef RPC_DEBUG so you might need to tweak the
source and rebuild the kmod.


-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: help with IB_WC_MW_BIND_ERR

2014-05-20 Thread Wendy Cheng
On Tue, May 20, 2014 at 11:55 AM, Chuck Lever chuck.le...@oracle.com wrote:
 Hi-

 What does it mean when a LOCAL_INV work request fails with a
 IB_WC_MW_BIND_ERR completion?


Mapping an IB error code has been a great pain (at least for me)
unless you have access to the HCA firmware. In this case, I think it
implies memory protection error (registration issues) say in cxgb4
driver, it is associated with invalidate shared MR or invalidate bound
memory window (with a QP):

case T4_ERR_INVALIDATE_SHARED_MR:
case T4_ERR_INVALIDATE_MR_WITH_MW_BOUND:
wc-status = IB_WC_MW_BIND_ERR;
break;

drivers/infiniband/hw/cxgb4/cq.c line 654 of 898 --72%-- col 11-25

You'll probably need to mention the HCA name so the firmware people,
if they are reading this, could pinpoint the exact cause.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for simplifying NFS/RDMA client memory registration

2014-03-03 Thread Wendy Cheng
On Mon, Mar 3, 2014 at 11:54 AM, faibish, sorin faibish_so...@emc.com wrote:

  On Mar 3, 2014, at 7:09 PM, Christoph Hellwig h...@infradead.org wrote:
 
  On Mon, Mar 03, 2014 at 12:02:33PM -0500, Chuck Lever wrote:
  All HCAs in 3.13 (and rxe) can support either MTHCA_FMR or FRMR or both. 
  Wendy?s HCA supports only ALLPHYSICAL.
 
  Is Wendy planning to submit her HCA driver ASAP?  If not there's not
  reason to keep ALLPHYSICAL either.
 I second Christoph. Legacy is good as long as there are users of Linux with 
 the legacy server. I would say that the only reason to keep it is if Linux 
 server will support it. Same we apply to Lustre client in kernel.

 ./Sorin

 
  Does it make sense to deprecate then remove the registration modes in the 
  first list?
 
  Yes.
 


After discussing this with my manager, we'll let it go for now ...
will re-submit the full patch set in the future when we finalize the
plan.

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for simplifying NFS/RDMA client memory registration

2014-02-28 Thread Wendy Cheng
On Fri, Feb 28, 2014 at 2:20 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 ni i...On Fri, Feb 28, 2014 at 1:41 PM, Tom Talpey t...@talpey.com wrote:

 On 2/26/2014 8:44 AM, Chuck Lever wrote:

 Hi-

 Shirley Ma and I are reviving work on the NFS/RDMA client code base in
 the Linux kernel.  So far we've built and run functional tests to determine
 what is working and what is broken.

 [snip]



 ALLPHYSICAL - Usually fast, but not safe as it exposes client memory.
 All HCAs support this mode.


 Not safe is an understatement. It exposes all of client physical
 memory to the peer, for both read and write. A simple pointer error
 on the server will silently corrupt the client. This mode was
 intended only for testing, and in experimental deployments.

(sorry, resend .. previous reply bounced back due to gmail html format)

Please keep ALLPHYSICAL for now  - as our embedded system needs it.

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: IPoIB GRO

2013-11-04 Thread Wendy Cheng
I looked at TSO code earlier this year. IIRC, if TSO is on, the upper
layer (e.g. IP) would just send the super-packet down (to IPOIB) w/out
segmentation (for send); if off, it then does the segmentation (to
match the MTU size) before calling device's send. For GSO, I would
imagine it needs some sorts of segmentation sequence to know how to
pull them together on the receive end. Look to me that the
segmentation offload (TSO) and receive offload (GSO) are mutual
exclusive ? Check out dev_gro_receive() (line number based on 2.6.32
RHEL kernel):

   2980
   2981 if (skb_is_gso(skb) || skb_has_frags(skb))
   2982 goto normal;


See how it bails out when TSO (skb_is_gso()) is on ? So it looks like
an IPOIB bug that ipoib_ib_handle_rx_wc() does a unconditional
napi_gro_receive() regardless adapter capability (and TSO setting).

Just a guess !

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ACK behaviour difference LRO/GRO

2013-10-29 Thread Wendy Cheng
On Mon, Oct 28, 2013 at 12:34 PM, Markus Stockhausen
stockhau...@collogia.de wrote:
 Hello,

 about two month we had some problems with IPoIB transfer speeds .
 See more http://marc.info/?l=linux-rdmam=137823326109158w=2
 After some quite hard test iterations the problem seems to come from the
 IPoIB switch from LRO to GRO between kernels 2.6.37 and 2.6.38.

 I built a test setup with a 2.6.38 kernel and additionaly compiled a 2.6.37
 ib_ipoib module against it. This way I can run a direct comparison
 between the old and new module. The major difference between the
 two version is inside the ipoib_ib_handle_rx_wc() function:

 2.6.37: lro_receive_skb(priv-lro.lro_mgr, skb, NULL);
 2.6.38: napi_gro_receive(priv-napi, skb);

 As in the last post we use ConnectX cards in datagram mode with a
 2044 MTU.  We read a file sequentially from a NFS server into /dev/null.
 We just want to get the wire speed neglecting hard drives. The
 hardware is slightly newer so we get different transfer speeds but
 the overall effect should be evident. The server uses a 3.5 kernel and
 is not changed during the tests.

 With 2.6.37 IPoIB module on the client side and LRO enabled the
 speed is 950 MByte/sec. On the NFS server side a tcpdump trace
 reads like:

 19:51:51.432630 IP 10.10.30.251.nfs  10.10.30.1.781:
   Flags [P.], seq 1008434065:1008497161, ack 617432,
   win 688, options [nop,nop,TS val 133047292 ecr 429568],
   length 63096
 19:51:51.432672 IP 10.10.30.1.781  10.10.30.251.nfs:
   Flags [.], ack 1008241041, win 24576, options
   [nop,nop,TS val 429568 ecr 133047292], length 0
 19:51:51.432677 IP 10.10.30.251.nfs  10.10.30.1.781:
   Flags [.], seq 1008497161:1008560905, ack 617432,
   win 688, options [nop,nop,TS val 133047292 ecr 429568],
   length 63744
 19:51:51.432725 IP 10.10.30.1.781  10.10.30.251.nfs:
   Flags [.], ack 1008304585, win 24576, options
   [nop,nop,TS val 429568 ecr 133047292], length 0
 19:51:51.432729 IP 10.10.30.251.nfs  10.10.30.1.781:
   Flags [.], seq 1008560905:1008624649, ack 617432,
   win 688, options [nop,nop,TS val 133047292 ecr 429568],
 length 63744

 With some slight differences here and there the client sends only
 1 ack for about 60k of transferred data. With 2.6.38 module and
 onwards (GRO enabled) the speed drops down to 380 MByte/sec
 and a different transfer pattern.

 19:58:14.631430 IP 10.10.30.251.nfs  10.10.30.1.ircs:
   Flags [.], seq 722492293:722502253, ack 442312, win 537,
   options [nop,nop,TS val 133143092 ecr 467889], length 9960
 19:58:14.631460 IP 10.10.30.1.ircs  10.10.30.251.nfs:
   Flags [.], ack 722478181, win 24562, options
   [nop,nop,TS val 467889 ecr 133143092], length 0
 19:58:14.631485 IP 10.10.30.1.ircs  10.10.30.251.nfs:
   Flags [.], ack 722478181, win 24562, options
   [nop,nop,TS val 467889 ecr 133143092,nop,nop,sack 1
   {722480117:722482333}], length 0
 19:58:14.631510 IP 10.10.30.1.ircs  10.10.30.251.nfs:
   Flags [.], ack 722488197, win 24562, options [nop,nop,TS
   val 467889 ecr 133143092], length 0
 19:58:14.631534 IP 10.10.30.1.ircs  10.10.30.251.nfs:
   Flags [.], ack 722494229, win 24562, options
   [nop,nop,TS val 467889 ecr 133143092], length 0

 It seems as if the NFS client acknowledges every 2K packet
 separately. I thought that it may come from missing
 coalescing parameters and tried a  ethtool -C ib0 rx-usecs 5
 on both machines but without success.

 I'm quite lost now maybe someone can give a tip if I'm
 missing something.


Nice work! Look like napi_gro_receive() does not do the work it is
supposed to do ?! My (embedded NFS client) system was on 2.6.38 kernel
but we use ipoib kmod from OFED 1.5.4.1 - so we're still on
lro_receive_skb() path that does not have this issue.

I'll try it out later this week to see what is going on. Mellanox
folks or Roland may have more to say.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange NFS client ACK behaviour

2013-09-10 Thread Wendy Cheng
On Mon, Sep 9, 2013 at 11:51 PM, Markus Stockhausen
stockhau...@collogia.de wrote:
 Von: Wendy Cheng [s.wendy.ch...@gmail.com]
 Gesendet: Montag, 9. September 2013 22:03
 An: Markus Stockhausen
 Cc: linux-rdma@vger.kernel.org
 Betreff: Re: Strange NFS client ACK behaviour

 On Sun, Sep 8, 2013 at 11:24 AM, Markus Stockhausen
 stockhau...@collogia.de wrote:

   we observed a performance drop in our IPoIB NFS backup
   infrastructure since we switched to machines with newer
   kernels.
 

 Not sure how your backup infrastructure works but the symptoms seem to
 match with this discussion:
 http://www.spinics.net/lists/linux-nfs/msg38980.html

 If you know how to recompile nfs kmod, Trond's patch does worth a try.
 Or open an Ubuntu support ticket, let them build you a test kmod.

 -- Wendy

 Thanks for pointing into that direction. From my understanding this
 patch goes into the NFS client side. I built a patched module for my
 Fedora 19 client (3.10 kernel). Nevertheless the behaviour ist still
 the same.  If I get the patch right it is about forked childs that
 access a page of a mmapped file round robin and the kernel issues
 tons of write requests to the file.

 My case is only about ACK transmissions for a single writer.

 Markus


So you have to go back to the drawing board :(. Have you tried to profile it ?
http://oprofile.sourceforge.net/about/

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange NFS client ACK behaviour

2013-09-09 Thread Wendy Cheng
On Sun, Sep 8, 2013 at 11:24 AM, Markus Stockhausen
stockhau...@collogia.de wrote:

  we observed a performance drop in our IPoIB NFS backup
  infrastructure since we switched to machines with newer
  kernels.


Not sure how your backup infrastructure works but the symptoms seem to
match with this discussion:
http://www.spinics.net/lists/linux-nfs/msg38980.html

If you know how to recompile nfs kmod, Trond's patch does worth a try.
Or open an Ubuntu support ticket, let them build you a test kmod.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange NFS client ACK behaviour

2013-09-05 Thread Wendy Cheng
CC linux-nfs .. maybe this is obvious to someone there ... Two
comments inlined below.

On Tue, Sep 3, 2013 at 11:28 AM, Markus Stockhausen
stockhau...@collogia.de wrote:
 Hello,

 we observed a performance drop in our IPoIB NFS backup
 infrastructure since we switched to machines with newer
 kernels. As I do not know where to start I hope someone
 on this list can give me hint where to dig for more details.

In case of no other reply, I would start w/ a socket program (or a
network performance measuring tool) on the interface that does similar
logic as dd you described below; that is, send a 256K message in a
fixed number of loops (so total transfer size somewhere close to your
file size) between client and server, followed by comparing the
interrupt counters (cat /proc/interrtups) on both kernels. If the
interrupt count differs as you described, the problem is most likely
with the IB driver, not NFS layer.


 To make a long story short. We use ConnectX cards with the
 standard kernel drivers on version 2.6.32 (Ubuntu 10.04), 3.5
 (Ubuntu 12.04) and 3.10 (Fedora 19). The very simple and not
 scientific test consists of mounting a NFS share using IPoIB UD
 network interfaces at MTU of 2044. Afterwards read a large file
 on the client side with dd if=file of=/dev/null bs=256K.
 During the transfer we run a tcpdump on the ibX interface on
 the NFS server side. No special settings for kernel parameters
 until now.

I don't know much about ConnectX. Not sure what IPoIB UD means ?
Datagram vs. CM or TCP vs. UDP ?


 When doing the test with a 2.6.32 kernel based client we see the
 following packet sequence. More or less a lot of transferd blocks
 from the NFS server to the client with sometimes an ACK package
 from the client to the server:

 16:16:45.050930 IP server.nfs  cli_2_6_32.896:
   Flags [.], seq 8909853:8913837, ack 1154149,
   win 604, options [nop,nop,TS val 1640401415
   ecr 3881919089], length 3984
 16:16:45.050936 IP server.nfs  cli_2_6_32.896:
   Flags [.], seq 8913837:8917821, ack 1154149,
   win 604, options [nop,nop,TS val 1640401415
   ecr 3881919089], length 3984

 ... 8 more ...

 16:16:45.050976 IP cli_2_6_32.896  server.nfs:
   Flags [.], ack 8909853, win 24574, options
   [nop,nop,TS val 3881919089 ecr 1640401415],
   length 0
 ...

 After switchng to a client with a newer kernel (3.5 or 3.10) the
 sequence all of a sudden gives just the opposite behaviour.
 One should note that this is the same server as in the test
 above. The server sends bigger packets (I guess TSO is doing
 the rest of the work). After each packet the client sends
 several ACK packages back.

 16:15:21.038782 IP server.nfs  cli_3_5_0.928:
   Flags [.], seq 9612429:9652269, ack 372776,
   win 5815, options [nop,nop,TS val 1640380412
   ecr 560111379], length 39840
 16:15:21.038806 IP cli_3_5_0.928  server.nfs:
   Flags [.], ack 9542205, win 16384, options
   [nop,nop,TS val 560111379 ecr 1640380412],
   length 0
 16:15:21.038812 IP cli_3_5_0.928  server.nfs:
   Flags [.], ack 9546077, win 16384, options
   [nop,nop,TS val 560111379 ecr 1640380412],
 length 0

 ... 6-8 more ...

 The visible side effects of this changed processing include:
 - NIC interrupts on the NFS servers raise by a factor of 8.
 - Transfer speed lowers by 50% (400-200 MB/sec)

 Best regards.

 Markus
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Helps to Decode rpc_debug Output

2013-08-26 Thread Wendy Cheng
On Mon, Aug 26, 2013 at 6:22 AM, Tom Talpey t...@talpey.com wrote:
 On 8/21/2013 11:55 AM, Wendy Cheng wrote:

 On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:

 On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote:

 On 8/14/2013 8:14 PM, Wendy Cheng wrote:


 Longer version of the question:
 I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
 server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
 Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so
 far but I do manage to get nfs mount working. Simple file operations
 (such as ls, file read/write, scp, etc) seem to work as well.


 One thing I'm still scratching my head is that ... by looking at the
 raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS
 over IPOIB (TCP).


 Sounds like your bottleneck lies in some other component. What's the
 storage, for example? RDMA won't do a thing to improve a slow disk.
 Or, what kind of IOPS rate are you seeing? If these systems aren't
 generating enough load to push a CPU limit, then shifting the protocol
 on the same link might not yield much.

There is no kernel profiling tool with this uOS (yet) so it is hard to
identify the bottleneck. Looking from the surface, the slow down seems
to be from SUNRPC's Van Jacobson congestion control
(xprt_reserve_xprt_cong()) where it either creates a race condition
for the transmissions (write/commit) to miss their wake-up(s); or the
algorithm itself is not a right choice for this client system that
consists of many (244 on my system) slower cores (CPU).

Solid state drives are used on the RHEL server.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Helps to Decode rpc_debug Output

2013-08-21 Thread Wendy Cheng
On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote:
 On 8/14/2013 8:14 PM, Wendy Cheng wrote:

 Longer version of the question:
 I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
 server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
 Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so
 far but I do manage to get nfs mount working. Simple file operations
 (such as ls, file read/write, scp, etc) seem to work as well.


Yay ... got this up .. amazingly on a uOS that does not have much of
the conventional kernel debug facilities.

The hang was caused by auto disconnect, triggered by xprt-timer. The
task was carried out by xprt_init_autodisconnect(). It silently
disconnects the xprt w/out sensible warning. The uOS is on a
small-core (slower) hardware. Instead of a hard number, this timeout
value needs to be at least a proc tunable. Will check newer kernels
to see whether it's been improved and/or draft a patch later.

One thing I'm still scratching my head is that ... by looking at the
raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS
over IPOIB (TCP). However, the total run time differs greatly. NFS
over RDMA seems to take a much longer time to finish (vs. NFS over
IPOIB). Not sure why is that  Maybe by the constant
connect/disconnect triggered by reestablish_timeout ? The connection
re-establish is known to be expensive on this uOS. Why do we need two
sets of timeout where
1. xprt-timer disconnects (w/out reconnect) ?
2. reestablish_timeout constantly disconnect/re-connect ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Helps to Decode rpc_debug Output

2013-08-15 Thread Wendy Cheng
On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote:
 On 8/14/2013 8:14 PM, Wendy Cheng wrote:

 Longer version of the question:
 I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
 server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
 Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so
 far but I do manage to get nfs mount working. Simple file operations
 (such as ls, file read/write, scp, etc) seem to work as well.

[snip]

 Why did you replace the Linux IB stack with OFED? Did you also take the
 NFS/RDMA from that package, and if so are you sure that it all is
 is working properly? Doesn't 2.6.38 already have all this?


Other part of the cluster runs OFED 1.5.4 on top of RHEL 6.3 - it was
a product decision. Ditto for the 2.6.38 based uOS.

OFED 1.5.4 based NFS/RDMA (i.e. xprtrdma) does not run on both
platforms. It took a while to understand the setup. I believe issues
with RHEL boxes have been fixed - at least iozone runs thru (as client
and server) w/out trouble. Now the issue is with this 2.6.38 uOS (as
client) that talks to RHEL 6.3 (as server). I don't know much about
NFS V4 so the focus is on V3.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Helps to Decode rpc_debug Output

2013-08-14 Thread Wendy Cheng
The IO on top of a NFS mounted directory was hanging so I forced a
(client side) rpc_debug output from the proc entry.

6[ 4311.590317]  1676 0001-11 8801e0bde400 8801d18b1248
  0 81420d40 nfsv3 WRITE a:call_status q:xprt_resend
... (similar lines) ...
6[ 4311.590435]  1682 0001-11 8801e0bde400 8801d18b0e10
  0 81420d40 nfsv3 WRITE a:call_connect_status
q:xprt_sending
... (similar lines) ...

Could someone give me an educational statement on what the above two
lines mean ?  . More specifically, what call_connect_status does and
what xprt_sending means ? Is there any way to force (i.e. hacking
the code to get) a re-connect (i.e. invoke connect from
rpc_xprt_ops) ?

Longer version of the question:
I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so
far but I do manage to get nfs mount working. Simple file operations
(such as ls, file read/write, scp, etc) seem to work as well.
While trying to run iozone to see whether the performance gain can be
justified for the development efforts, the program runs until it
reaches 2MB file size - at that point, RDMA CM sends out
TIMEWAIT_EXIT event, the xprt is disconnected, and all IOs on that
share hang. IPOIB still works though. Not sure what would be the best
way to debug this.

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Wendy Cheng
On Mon, Apr 29, 2013 at 10:09 PM, Yan Burman y...@mellanox.com wrote:

 I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also 
 way higher now).
 For some reason when I had intel IOMMU enabled, the performance dropped 
 significantly.
 I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
 Now I will take care of the issue that I am running only at 40Gbit/s instead 
 of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable 
 issue).

 This is still strange, since ib_send_bw with intel iommu enabled did get up 
 to 4.5GB/sec, so why did intel iommu affect only nfs code?


That's very exciting ! The sad part is that IOMMU has to be turned off.

I think ib_send_bw uses a single buffer so the DMA mapping search
overhead is not an issue.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-28 Thread Wendy Cheng
On Sun, Apr 28, 2013 at 7:42 AM, J. Bruce Fields bfie...@fieldses.org wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 ...

[snip]

 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner

 That's the inode i_mutex.

 14.70%-- svc_send

 That's the xpt_mutex (ensuring rpc replies aren't interleaved).


  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave


 And that (and __free_iova below) looks like iova_rbtree_lock.



Let's revisit your command:

FIO arguments: --rw=randread --bs=4k --numjobs=2 --iodepth=128
--ioengine=libaio --size=10k --prioclass=1 --prio=0 --cpumask=255
--loops=25 --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1
--norandommap --group_reporting --exitall --buffered=0

* inode's i_mutex:
If increasing process/file count didn't help, maybe increase iodepth
(say 512 ?) could offset the i_mutex overhead a little bit ?

* xpt_mutex:
(no idea)

* iova_rbtree_lock
DMA mapping fragmentation ? I have not studied whether NFS-RDMA
routines such as svc_rdma_sendto() could do better but maybe
sequential IO (instead of randread) could help ? Bigger block size
(instead of 4K) can help ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:
 On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:

 So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
 tar ball) ... Here is a random thought (not related to the rb tree
 comment).

 The inflight packet count seems to be controlled by
 xprt_rdma_slot_table_entries that is currently hard-coded as
 RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
 with the bandwidth number if we pump it up, say 64 instead ? Not sure
 whether FMR pool size needs to get adjusted accordingly though.

 1)

 The client slot count is not hard-coded, it can easily be changed by
 writing a value to /proc and initiating a new mount. But I doubt that
 increasing the slot table will improve performance much, unless this is
 a small-random-read, and spindle-limited workload.

Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...


 2)

 The observation appears to be that the bandwidth is server CPU limited.
 Increasing the load offered by the client probably won't move the needle,
 until that's addressed.


Could you give more hints on which part of the path is CPU limited ?
Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Wendy Cheng
On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker t...@opengridcomputing.com wrote:
 The Mellanox driver uses red-black trees extensively for resource
 management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
 these are used to find the associated software data structures I believe. It
 is certainly possible that these trees get hot on lookup when we're pushing
 a lot of data. I'm surprised, however, to see rb_insert_color there because
 I'm not aware of any where that resources are being inserted into and/or
 removed from a red-black tree in the data path.


I think they (rb calls) are from base kernel, not from any NFS and/or
IB module (e.g. RPC, MLX, etc). See the right column ?  it says
/root/vmlinux. Just a guess - I don't know much about this perf
command.

 -- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Wendy Cheng
On Thu, Apr 25, 2013 at 2:58 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Thu, Apr 25, 2013 at 2:17 PM, Tom Tucker t...@opengridcomputing.com 
 wrote:
 The Mellanox driver uses red-black trees extensively for resource
 management, e.g. QP ID, CQ ID, etc... When completions come in from the HW,
 these are used to find the associated software data structures I believe. It
 is certainly possible that these trees get hot on lookup when we're pushing
 a lot of data. I'm surprised, however, to see rb_insert_color there because
 I'm not aware of any where that resources are being inserted into and/or
 removed from a red-black tree in the data path.


 I think they (rb calls) are from base kernel, not from any NFS and/or
 IB module (e.g. RPC, MLX, etc). See the right column ?  it says
 /root/vmlinux. Just a guess - I don't know much about this perf
 command.



Oops .. take my words back ! I confused Linux's RB tree w/ BSD's.
BSD's is a set of macros inside a header file while Linux's
implementation is a base kernel library. So every KMOD is a suspect
here :)

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-24 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
 
  Perf top for the CPU with high tasklet count gives:
 
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
 
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux

 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

 Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio


I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.

-- Wendy

.

978.00  8.4% 810297f0 clflush_cache_range 
  /root/vmlinux
445.00  3.8% 812ea440 __domain_mapping
  /root/vmlinux
441.00  3.8% 00018c30 svc_recv
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
  /root/vmlinux
333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
  /root/vmlinux
288.00  2.5% 813a07d0 __schedule  
  /root/vmlinux
249.00  2.1% 811a87e0 rb_prev 
  /root/vmlinux
242.00  2.1% 813a19b0 _raw_spin_lock  
  /root/vmlinux
184.00  1.6% 2e90 svc_rdma_sendto 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
177.00  1.5% 810ac820 get_page_from_freelist  
  /root/vmlinux
174.00  1.5% 812e6da0 alloc_iova  
  /root/vmlinux
165.00  1.4% 810b1390 put_page
  /root/vmlinux
148.00  1.3% 00014760 sunrpc_cache_lookup 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
128.00  1.1% 00017f20 svc_xprt_enqueue
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
126.00  1.1% 8139f820 __mutex_lock_slowpath   
  /root/vmlinux
108.00  0.9% 811a81d0 rb_insert_color 
  /root/vmlinux
107.00  0.9% 4690 svc_rdma_recvfrom   
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
102.00  0.9% 2640 send_reply  
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 99.00  0.9% 810e6490 kmem_cache_alloc
  /root/vmlinux
 96.00  0.8% 810e5840 __slab_alloc
  /root/vmlinux
 91.00  0.8% 6d30 mlx4_ib_post_send   
  /lib/modules/3.5.7-dbg/kernel/drivers/infiniband/hw/mlx4/mlx4_ib.ko
 88.00  0.8% 0dd0 svc_rdma_get_context
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 86.00  0.7% 813a1a10 _raw_spin_lock_irq  
  /root/vmlinux
 86.00  0.7% 1530 svc_rdma_send   
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 85.00  0.7% 81060a80 prepare_creds   
  /root/vmlinux
 83.00  0.7% 810a5790 find_get_pages_contig   
  /root/vmlinux
 79.00  0.7% 810e4620 __slab_free 
  /root/vmlinux
 79.00  0.7% 813a1a40 _raw_spin_unlock_irqrestore 
  /root/vmlinux
 77.00  0.7% 81065610 finish_task_switch  
  /root/vmlinux
 76.00  0.7% 812e9270 pfn_to_dma_pte  
  /root/vmlinux
 75.00  0.6% 810976d0 __call_rcu  
  /root/vmlinux
 73.00  0.6% 811a2fa0 _atomic_dec_and_lock
  /root/vmlinux
 73.00  0.6% 02e0 svc_rdma_has_wspace 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
 67.00  0.6% 813a1a70 _raw_read_lock  
  /root/vmlinux
 65.00  0.6% f590 svcauth_unix_set_client 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
 63.00  0.5% 000180e0 svc_reserve 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
 60.00  0.5% 64d0 stamp_send_wqe  

Re: NFS over RDMA benchmark

2013-04-24 Thread Wendy Cheng
On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
 
  Perf top for the CPU with high tasklet count gives:
 
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
 
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux

 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

 Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio


 I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
 that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
 in the paths ? Trees like that requires extensive lockings.


So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.

In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
  * a single chunk type per message is supported currently.
  */
 #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
 #define RPCRDMA_MAX_SLOT_TABLE (256U)

 #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-18 Thread Wendy Cheng
On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote:


 What do you suggest for benchmarking NFS?


I believe SPECsfs has been widely used by NFS (server) vendors to
position their product lines. Its workload was based on a real life
NFS deployment. I think it is more torward office type of workload
(large client/user count with smaller file sizes e.g. software
development with build, compile, etc).

BTW, we're experimenting a similar project and would be interested to
know your findings.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-18 Thread Wendy Cheng
On Thu, Apr 18, 2013 at 10:50 AM, Spencer Shepler
spencer.shep...@gmail.com wrote:

 Note that SPEC SFS does not support RDMA.


IIRC, the benchmark comes with source code - wondering anyone has
modified it to run on RDMA ?  Or is there any real user to share the
experience ?

-- Wendy

 
 From: Wendy Cheng
 Sent: 4/18/2013 9:16 AM
 To: Yan Burman
 Cc: Atchley, Scott; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
 linux-...@vger.kernel.org; Or Gerlitz

 Subject: Re: NFS over RDMA benchmark

 On Thu, Apr 18, 2013 at 5:47 AM, Yan Burman y...@mellanox.com wrote:


 What do you suggest for benchmarking NFS?


 I believe SPECsfs has been widely used by NFS (server) vendors to
 position their product lines. Its workload was based on a real life
 NFS deployment. I think it is more torward office type of workload
 (large client/user count with smaller file sizes e.g. software
 development with build, compile, etc).

 BTW, we're experimenting a similar project and would be interested to
 know your findings.

 -- Wendy
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in

 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-17 Thread Wendy Cheng
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I seem to only 
 get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of memory, and 
 Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing storage on the 
 server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block 
 sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

Remember there are always gaps between wire speed (that ib_send_bw
measures) and real world applications.

That being said, does your server use default export (sync) option ?
Export the share with async option can bring you closer to wire
speed. However, the practice (async) is generally not recommended in a
real production system - as it can cause data integrity issues, e.g.
you have more chances to lose data when the boxes crash.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-17 Thread Wendy Cheng
On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote:
 On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I seem to only 
 get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of memory, and 
 Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing storage on the 
 server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same 
 block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

 Yan,

 Are you trying to optimize single client performance or server performance 
 with multiple clients?


 Remember there are always gaps between wire speed (that ib_send_bw
 measures) and real world applications.

 That being said, does your server use default export (sync) option ?
 Export the share with async option can bring you closer to wire
 speed. However, the practice (async) is generally not recommended in a
 real production system - as it can cause data integrity issues, e.g.
 you have more chances to lose data when the boxes crash.

 -- Wendy


 Wendy,

 It has a been a few years since I looked at RPCRDMA, but I seem to remember 
 that RPCs were limited to 32KB which means that you have to pipeline them to 
 get linerate. In addition to requiring pipelining, the argument from the 
 authors was that the goal was to maximize server performance and not single 
 client performance.

 Scott


That (client count) brings up a good point ...

FIO is really not a good benchmark for NFS. Does anyone have SPECsfs
numbers on NFS over RDMA to share ?

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPOIB-CM MTU

2013-04-12 Thread Wendy Cheng
We're working on a RHEL based system (2.6.32-279.el6.x86_64) that has
slower CPUs (vs. bigger Xeon boxes). The IB stacks are on top of OFA
1.5.4.1 and a mlx adapter (mlx4_0). It is expected that enabling CM
will boost IPOIB bandwidth (measured by Netpipe) due to larger MTU
size (65520).  Unfortunately, it does not happen. It, however, does
show 2x bandwidth gain (CM vs. datagram) on Xeon servers.

While looking around, it is noticed the MTU reported by ifconfig
command correctly shows ipoib_cm_max_mtu number but the socket buffer
sent down to ipoib_cm_send() never exceed 2048 bytes on both Xeon and
the new HW platform. Seeing TSO (driver/firmware semgntation) is off
with CM mode ... intuitively, TCP/IP on Xeon would do better with
segmentation while the (segmentation) overhead (and other specific
platform issues) could weigh down the bandwidth on the subject HW.

If the guess is right, the question here is how to make upper layer,
i.e. IP, send down a bigger (than 2048 bytes) size of fragment ?.
Any comment and/or help ? Is there any other knobs that I should turn
with cm mode (by echo connected into /sys/class/net/ib0/mode) ?

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPOIB Implementation

2012-10-25 Thread Wendy Cheng
On Thu, Oct 25, 2012 at 1:26 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 Could folks pass some pointers on the subject implementation (for
 education purpose) ?

 Thanks,
 Wendy

BTW, I'm starting with RFC 4391 (and the source code in
drivers/infiniband/ulp/ipoib). Is there any other good document for
the subject ?

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


user space FMR support

2012-02-13 Thread Wendy Cheng
To make a long story short   FMR = fast memory registration ...

I had been using a set of FMR patches that exported the existing
kernel FMR support to user space (with mlx4 driver) for about a year.
A mistake in re-installing the development cluster this morning wiped
out the private OFED package that contained the changes. I could
probably be able to repatch everything in a couple of days but it is a
very annoying process. I can't be the only person who needs the FMR in
user space (?).

Does anyone have a handy set of patches that can make public (or
better, push upstream with newer set of kernels) ?

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-17 Thread Wendy Cheng
On Fri, Oct 14, 2011 at 2:22 PM, Doug Ledford dledf...@redhat.com wrote:
 - Original Message -
 On Wed, Oct 12, 2011 at 9:32 AM, Wendy Cheng
 s.wendy.ch...@gmail.com wrote:
  The OFED package itself does include XRC support. The issue here
  (my
  guess) is that its build script needs to understand the running
  system's kernel version to decide what should be pulled (from the
  source). Linux 3.1 could be too new for OFED build script to make a
  correct decision. Nevertheless, mix-matching OFED modules/libraries
  is
  a *bad* idea.

 No.  The same userspace build should work with all kernel versions.

 Wendy's referring to something other than what you are thinking.  The
 same libibverbs user space build should work on all kernels going back
 a long way, except when you are talking about OFED their libibverbs
 is hard coded to assume XRC support and fail if it isn't present, so
 an OFED libibverbs won't really work without also the OFED kernel
 module stack.  The script Wendy referred to is the script that checks
 the running kernel's version in order to determine which backport
 patches need applied to the ofa_kernel source tree in order to build
 the OFED kernel modules for your running kernel.  Without that
 ofa_kernel build, the OFED libibverbs will indeed fail to run on
 the running kernel.  And that script hasn't been updated to support
 version 3.x kernels last I checked, so she's right, the script itself
 doesn't recognize the running kernel version, so ofa_kernel modules
 don't get built, so OFED libibverbs won't work anyway.  So, she's
 absolutely right, unless you want to start ripping hard coded
 assumptions about the existence of XRC support out of things like
 OFED's libibverbs, then out of qperf and a number of their other
 various packages, then you have to pair the OFED kernel modules and
 user space packages, they can not be separated.


Yes, that (above) is exactly what I was referring to .

The conversations in this thread remind me of the tire-swing cartoon
that has been passing around for years:
http://bibiananunes.com/user-requirements-the-tire-swing-cartoon )


-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building 3.1-rc9 in kernel infiniband support with OFED libraries

2011-10-12 Thread Wendy Cheng
On Wed, Oct 12, 2011 at 3:41 AM, Bart Van Assche bvanass...@acm.org wrote:
 On Tue, Oct 11, 2011 at 7:39 PM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
 On Tue, Oct 11, 2011 at 09:02:41AM -0500, Christoph Lameter wrote:
 Has XRC support not been merged? How can I build the OFED libraries
 against Linux 3.1? I'd really like to get rid of the OFED kernel tree
 nightmare.

 You have to use upstream libraries with upstream kernels. Be warned
 that the OFED libraries of the same SONAME are not ABI compatible with
 upstream libraries.

 Why is the OFED libibverbs library binary incompatible with the
 non-OFED libibverbs library ? Why hasn't XRC support been implemented
 in the OFED libibverbs library such that applications built against
 the upstream libibverbs headers also work with the latest OFED version
 of that library ?


I'm relatively new to OFED but happened to bump into a similar build
issue two weeks ago.

The OFED package itself does include XRC support. The issue here (my
guess) is that its build script needs to understand the running
system's kernel version to decide what should be pulled (from the
source). Linux 3.1 could be too new for OFED build script to make a
correct decision. Nevertheless, mix-matching OFED modules/libraries is
a *bad* idea.

It is difficult to love OFED build :) but it seems to work ok (so far
for me). Plus, I don't have a better proposal myself anyway .

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to debug (mlx4) CQ overrun

2011-10-08 Thread Wendy Cheng
On Fri, Sep 23, 2011 at 2:30 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:

 There are not really any tools, but this is usually straightforward to
 look at from your app.


Great thanks for the response. It helped (to ensure our cq handling
logic was ok). The issue turns out to be build related. After doing a
clean rebuild of OFED IB modules with the modified header files, the
problem went away. The (header file) change was a result of exporting
kernel FMR (fast memory registration) to user space for an
experimental project.

Again, thank you for the write-up. It is very appreciated.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


how to debug (mlx4) CQ overrun

2011-09-23 Thread Wendy Cheng
I have a test program that does RDMA read-write as the following:

node A: server listens and handles connection requests
   setup a piece of memory initialized to 0
node B: two processes parent  child

child:
  1. setup a new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
  2. RDMA sequential write (8192 bytes a time) to server memory
  4. sync with parent

parent:
   1. setup the new channel with server, including a CQ with 1024 entries
  (ibv_create_cq(ctx, 1024, NULL, channel, 0);)
3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
 - check the buffer contents.
 - if memory content is still zero, re-read
4. sync with child

The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
 mlx4_core :06:00.0: CQ overrun on CQN 87

I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).

Thanks,
Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel space rdma client, reg_mr issue

2011-06-30 Thread Wendy Cheng
On Thu, Jun 30, 2011 at 7:19 AM, Benoit Hudzia benoit.hud...@gmail.com wrote:
 Hi,

 We are working to create a RDMA client within the kernel which will
 connect to a working RDMA userspace server.
 We are using the userspace code  for testing purpose , in the final
 version all the communication will be done within the kernel space.

 [snip]

  basically the question boils down to how to allcoate and registr
 buffer for RDMA communication from inside a kernel module?


I happen to have a similar setup ..The original intent was to have a
kernel mode RDMA application that took the kernel data and sent it
over to the peer node's memory for temporary storage. It had to be
able to get read back later. As it didn't matter whether the temporary
storage was in kernel or user address space, I re-used my colleague's
existing user mode program (to run as an user space daemon on the peer
node). This allowed the focus being on the new kernel application
development (run on the primary node). After the code was up and
running, I see no reason to change the setup and it has been running
fine since.

The code runs on RHEL 5.5 with OFED-1.5.2 using Mellanox card.

The user mode daemon is in a forever receiving loop that follows the
standard the RDMA user mode programming logic.

The kernel code invokes the ib_xxx set of APIs (vs. user mode
ibv_xxx(s)). The kernel memory registration are done by the APIs such
as kzalloc(), ib_get_dma_mr(), ib_dma_map_single(), ib_dma_map_page(),
etc.

Check out the driver code in drivers/infiniband/ulp/iser directory. It
has a sample logic to register kernel memory.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to build a kernel module with OFED kernel

2011-02-18 Thread Wendy Cheng
A newbie question . (please cc me as I'm not on linux-rdma list yet)

The system was on RHEL 5.4. I used a source tar-ball to build RHEL 5.5
(2.6.18-194.el5). The kernel build was placed at: /usr/src/linux.
Rebooted to pick up the new kernel .. worked fine... Then

1. On the new 2.6.18-192.el5 system, install OFED-1.5.2 ... succeed.
2. Reboot to pick up new ofa kernel ..succeed (check with modinfo)
3. Run user mode rdma application ... succeed
4. The new ofa kernel modules are placed (by OFED scripts) at:
/lib/modules/2.6.18-194.el5/updates/ directory
5. Build a kernel module (xx.ko) using the attached Makefile

Now .. trying to load (or run after forced loading) xx.ko (on top of
ofa kernel kmod(s)) ..   It fails .. as the build apparently picks up
IB kmods from
/lib/modules/2.6.18-194.el5/drivers/infiniband directory, instead of
/lib/modules/2.6.18-194.el5/updates directory, together with wrong
header files from
/lib/modules/2.6.18-194.el5/source/include, where source is
/usr/src/linux directory.

Anyone can help me with a correct procedure ?

I do understand the primary usage of OFED RDMA is for user mode
applications .. but I need to have a kernel mode driver on top of OFED
RDMA for some experiment works.

Thanks,
Wendy

== Make File ==

EXTRA_CFLAGS:= -I/usr/src/linux/drivers/xx/include
EXTRA_CFLAGS+= -DXX_KMOD_DEF

obj-m   := xx_kmod.o

xx_kmod-y   := main/xx_main.o main/xx_init.o \
   libxxverbs/xx_device.o libxxverbs/xx_cm.o \
   libxxverbs/xx_ar.o libxxverbs/xx_mr.o \
   libxxverbs/xx_cq.o libxxverbs/xx_sq.o

xx_kmod-y   += util/xx_perf.o
xx_kmod-y   += brd/xx_brd.o

kmod:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

install:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules_install

run:
/sbin/depmod -a; echo 5  /proc/sys/kernel/panic; modprobe
--force-modversion xx_kmod

=== End attachment ===
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html