Re: [PATCH 0/8] IPoIB: Fix multiple race conditions

2014-09-03 Thread Or Gerlitz

On 8/13/2014 2:38 AM, Doug Ledford wrote:

Locking of multicast joins/leaves in the IPoIB layer have been problematic
for a while.  There have been recent changes to try and make things better,
including these changes:

bea1e22 IPoIB: Fix use-after-free of multicast object
a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects

Unfortunately, the following test still fails (miserably) on a plain
upstream kernel:

pass=0
ifdown ib0
while true; do
 ifconfig ib0 up
 ifconfig ib0 down
 echo Pass $pass
 let pass++
done

This usually fails within 10 to 20 passes, although I did have a lucky
run make it to 300 or so.  If you happen to have a P_Key child interface,
it fails even quicker.


Hi Doug,

Thanks for looking on that. I wasn't aware we're doing so badly... I 
checked here and the plan is that Erez Shitrit from Mellanox will also 
be providing feedback on the series.


Anyway, just to make sure we're on the same page, people agree that 
picking such series is too heavy for post merge window, right? so we are 
talking on 3.18, agree?


Or.



In tracking down the rtnl deadlock in the multicast code, I ended up
re-designing the flag usage and clean up just the race conditions in
the various tasks.  Even that wasn't enough to resolve the issue entirely
though, as if you ran the test above on multiple interfaces simultaneously,
it could still deadlock.  So in the end I re-did the workqueue usage too
so that we now use a workqueue per device (and that includes P_Key devices
have dedicated workqueues) as well as one global workqueue that does
nothing but flush tasks.  This turns out to be a much more elegant way
of handling the workqueues and in fact enabled me to remove all of the
klunky passing around of flush parameters to tell various functions not
to flush the workqueue if it would deadlock the workqueue we are running
from.

Here is my test setup:

2 InfiniBand physical fabrics: ib0 and ib1
2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1
4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002,
   mlx4_ib1, and mlx4_ib1.8003

In order to test my final patch set, I logged into my test machine on
four different virtual terminals, I used ifdown on all of the above
interfaces to get things in a consistent state, and then I ran the above
loop, one per terminal per interface simultaneously.

It's worth noting here that when you ifconfig a base interface up, it
automatically brings up the child interface too, so the ifconfig mlx4_ib0
up is in fact racing with both ups and downs of mlx4_ib0.8002.  The same
is true for the mlx4_ib1 interface and its child.  With my patch set in
place, these loops are currently running without a problem and have passed
15,000 up/down cycles per interface.

Doug Ledford (8):
   IPoIB: Consolidate rtnl_lock tasks in workqueue
   IPoIB: Make the carrier_on_task race aware
   IPoIB: fix MCAST_FLAG_BUSY usage
   IPoIB: fix mcast_dev_flush/mcast_restart_task race
   IPoIB: change init sequence ordering
   IPoIB: Use dedicated workqueues per interface
   IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
   IPoIB: No longer use flush as a parameter

  drivers/infiniband/ulp/ipoib/ipoib.h   |  19 +-
  drivers/infiniband/ulp/ipoib/ipoib_cm.c|  18 +-
  drivers/infiniband/ulp/ipoib/ipoib_ib.c|  27 ++-
  drivers/infiniband/ulp/ipoib/ipoib_main.c  |  49 +++--
  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 -
  drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  22 ++-
  6 files changed, 240 insertions(+), 134 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] IPoIB: Fix multiple race conditions

2014-09-03 Thread Wendy Cheng
On Fri, Aug 15, 2014 at 3:08 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
[snip]

 Doug Ledford (8):
   IPoIB: Consolidate rtnl_lock tasks in workqueue
   IPoIB: Make the carrier_on_task race aware
   IPoIB: fix MCAST_FLAG_BUSY usage
   IPoIB: fix mcast_dev_flush/mcast_restart_task race
   IPoIB: change init sequence ordering
   IPoIB: Use dedicated workqueues per interface
   IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
   IPoIB: No longer use flush as a parameter


 IPOIB is recently added as a technology preview for Intel Xeon Phi
 (currently a PCIe card) that runs embedded Linux (named MPSS) with
 Infiniband software stacks supported via emulation drivers. One early
 feedback from users with large cluster nodes is IPOIB's power
 consumption. The root cause of the reported issue is more to do with
 how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so
 submitting the fix to upstream is not planned at this moment (unless
 folks are interested in the changes).

 However, since this patch set happens to be in the heart of the
 reported power issue, we would like to take a closer look to avoid
 MPSS code base deviating too much from future upstream kernel(s).
 Question, comment, and/or ack will follow sometime next week.


I've reviewed the patch set - the first half of the patches look good.
Patch #5,#6,#7,#8 are fine if we go for one WQ per device - will let
others do the final call.

On our system (OFED 1.5.4 based), similar deadlocks were also observed
while the power management issues were worked on. Restricted by other
issues that were specific to our platform, I took the advantage of
single IPOIB workqueue by queuing the if-up(s) and/or if-down(s) to
the work queue if one already in progress. It serialized the logic by
default. However, I would not mind one WQ per device approach and will
re-make the changes when this patch set is picked up by mainline
kernel.

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] ib_umem_release should decrement mm-pinned_vm from ib_umem_get

2014-09-03 Thread Shawn Bohrer
From: Shawn Bohrer sboh...@rgmadvisors.com

In debugging an application that receives -ENOMEM from ib_reg_mr() I
found that ib_umem_get() can fail because the pinned_vm count has
wrapped causing it to always be larger than the lock limit even with
RLIMIT_MEMLOCK set to RLIM_INFINITY.

The wrapping of pinned_vm occurs because the process that calls
ib_reg_mr() will have its mm-pinned_vm count incremented.  Later a
different process with a different mm_struct than the one that allocated
the ib_umem struct ends up releasing it which results in decrementing
the new processes mm-pinned_vm count past zero and wrapping.

I'm not entirely sure what circumstances cause a different process to
release the ib_umem than the one that allocated it but the kernel stack
trace of the freeing process from my situation looks like the following:

Call Trace:
 [814d64b1] dump_stack+0x19/0x1b
 [a0b522a5] ib_umem_release+0x1f5/0x200 [ib_core]
 [a0b90681] mlx4_ib_destroy_qp+0x241/0x440 [mlx4_ib]
 [a0b4d93c] ib_destroy_qp+0x12c/0x170 [ib_core]
 [a0cc7129] ib_uverbs_close+0x259/0x4e0 [ib_uverbs]
 [81141cba] __fput+0xba/0x240
 [81141e4e] fput+0xe/0x10
 [81060894] task_work_run+0xc4/0xe0
 [810029e5] do_notify_resume+0x95/0xa0
 [814e3dd0] int_signal+0x12/0x17

The following patch fixes the issue by storing the pid struct of the
process that calls ib_umem_get() so that ib_umem_release and/or
ib_umem_account() can properly decrement the pinned_vm count of the
correct mm_struct.

Signed-off-by: Shawn Bohrer sboh...@rgmadvisors.com
---
v3 changes:
* Fix resource leak with put_task_struct()
v2 changes:
* Updated to use get_task_pid to avoid keeping a reference to the mm

 drivers/infiniband/core/umem.c |   19 +--
 include/rdma/ib_umem.h |1 +
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..df0c4f6 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -105,6 +105,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem-length= size;
umem-offset= addr  ~PAGE_MASK;
umem-page_size = PAGE_SIZE;
+   umem-pid   = get_task_pid(current, PIDTYPE_PID);
/*
 * We ask for writable memory if any access flags other than
 * remote read are set.  Local write and remote write
@@ -198,6 +199,7 @@ out:
if (ret  0) {
if (need_release)
__ib_umem_release(context-device, umem, 0);
+   put_pid(umem-pid);
kfree(umem);
} else
current-mm-pinned_vm = locked;
@@ -230,15 +232,19 @@ void ib_umem_release(struct ib_umem *umem)
 {
struct ib_ucontext *context = umem-context;
struct mm_struct *mm;
+   struct task_struct *task;
unsigned long diff;
 
__ib_umem_release(umem-context-device, umem, 1);
 
-   mm = get_task_mm(current);
-   if (!mm) {
-   kfree(umem);
-   return;
-   }
+   task = get_pid_task(umem-pid, PIDTYPE_PID);
+   put_pid(umem-pid);
+   if (!task)
+   goto out;
+   mm = get_task_mm(task);
+   put_task_struct(task);
+   if (!mm)
+   goto out;
 
diff = PAGE_ALIGN(umem-length + umem-offset)  PAGE_SHIFT;
 
@@ -262,9 +268,10 @@ void ib_umem_release(struct ib_umem *umem)
} else
down_write(mm-mmap_sem);
 
-   current-mm-pinned_vm -= diff;
+   mm-pinned_vm -= diff;
up_write(mm-mmap_sem);
mmput(mm);
+out:
kfree(umem);
 }
 EXPORT_SYMBOL(ib_umem_release);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1ea0b65..a2bf41e 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -47,6 +47,7 @@ struct ib_umem {
int writable;
int hugetlb;
struct work_struct  work;
+   struct pid *pid;
struct mm_struct   *mm;
unsigned long   diff;
struct sg_table sg_head;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race

2014-09-03 Thread Doug Ledford
On Fri, 2014-08-29 at 14:53 -0700, Wendy Cheng wrote:
 On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
  Our mcast_dev_flush routine and our mcast_restart_task can race against
  each other.  In particular, they both hold the priv-lock while
  manipulating the rbtree and while removing mcast entries from the
  multicast_list and while adding entries to the remove_list, but they
  also both drop their locks prior to doing the actual removes.  The
  mcast_dev_flush routine is run entirely under the rtnl lock and so has
  at least some locking.  The actual race condition is like this:
 
  Thread 1Thread 2
  ifconfig ib0 up
start multicast join for broadcast
multicast join completes for broadcast
start to add more multicast joins
  call mcast_restart_task to add new entries
  ifconfig ib0 down
mcast_dev_flush
  mcast_leave(mcast A)
  mcast_leave(mcast A)
 
  As mcast_leave calls ib_sa_multicast_leave, and as member in
  core/multicast.c is ref counted, we run into an unbalanced refcount
  issue.  To avoid stomping on each others removes, take the rtnl lock
  specifically when we are deleting the entries from the remove list.
 
 Isn't test_and_clear_bit() atomic so it is unlikely that
 ib_sa_free_multicast() can run multiple times  ?
 
 638 static int ipoib_mcast_leave(struct net_device *dev, struct
 ipoib_mcast *mcast)
 639 {
 640 struct ipoib_dev_priv *priv = netdev_priv(dev);
 641 int ret = 0;
 642
 643 if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
 644 ib_sa_free_multicast(mcast-mc);
 645
 646 if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED,
 mcast-flags)) {

This is the original code, and you would be right, but it had other
problems.  Even though test_and_clear_bit is atomic, the original usage
of MCAST_FLAG_BUSY was really that the mcast join had started, there was
no guarantee it had completed (we would set the flag, call
ib_sa_multicast_join, and we could then race with our leave where it was
possible to leave prior to the join completing).

My patches change this around now, and the code only checks for
MCAST_FLAG_BUSY as a double check that we aren't leaving an in-process
join.

 
 -- Wendy
 
 
  Signed-off-by: Doug Ledford dledf...@redhat.com
  ---
   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 
  ++
   1 file changed, 32 insertions(+), 5 deletions(-)
 
  diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
  b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
  index f5e8da530d9..19e3fe75ebf 100644
  --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
  +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
  @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
  spin_unlock_irqrestore(priv-lock, flags);
 
  -   /* seperate between the wait to the leave*/
  +   /*
  +* make sure the in-flight joins have finished before we attempt
  +* to leave
  +*/
  list_for_each_entry_safe(mcast, tmcast, remove_list, list)
  if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
  wait_for_completion(mcast-done);
  @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct 
  *work)
  netif_addr_unlock(dev);
  local_irq_restore(flags);
 
  -   /* We have to cancel outside of the spinlock */
  +   /*
  +* make sure the in-flight joins have finished before we attempt
  +* to leave
  +*/
  +   list_for_each_entry_safe(mcast, tmcast, remove_list, list)
  +   if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
  +   wait_for_completion(mcast-done);
  +
  +   /*
  +* We have to cancel outside of the spinlock, but we have to
  +* take the rtnl lock or else we race with the removal of
  +* entries from the remove list in mcast_dev_flush as part
  +* of ipoib_stop() which will call mcast_stop_thread with
  +* flush == 1 while holding the rtnl lock, and the
  +* flush_workqueue won't complete until this restart_mcast_task
  +* completes.  So do like the carrier on task and attempt to
  +* take the rtnl lock, but if we can't before the ADMIN_UP flag
  +* goes away, then just return and know that the remove list will
  +* get flushed later by mcast_dev_flush.
  +*/
  +   while (!rtnl_trylock()) {
  +   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
  +   return;
  +   else
  +   msleep(20);
  +   }
  list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
  ipoib_mcast_leave(mcast-dev, 

Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race

2014-09-03 Thread Doug Ledford
On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote:
 On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
  On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
  Our mcast_dev_flush routine and our mcast_restart_task can race against
  each other.  In particular, they both hold the priv-lock while
  manipulating the rbtree and while removing mcast entries from the
  multicast_list and while adding entries to the remove_list, but they
  also both drop their locks prior to doing the actual removes.  The
  mcast_dev_flush routine is run entirely under the rtnl lock and so has
  at least some locking.  The actual race condition is like this:
 
  Thread 1Thread 2
  ifconfig ib0 up
start multicast join for broadcast
multicast join completes for broadcast
start to add more multicast joins
  call mcast_restart_task to add new entries
  ifconfig ib0 down
mcast_dev_flush
  mcast_leave(mcast A)
  mcast_leave(mcast A)
 
  As mcast_leave calls ib_sa_multicast_leave, and as member in
  core/multicast.c is ref counted, we run into an unbalanced refcount
  issue.  To avoid stomping on each others removes, take the rtnl lock
  specifically when we are deleting the entries from the remove list.
 
  Isn't test_and_clear_bit() atomic so it is unlikely that
  ib_sa_free_multicast() can run multiple times  ?
 
 Oops .. how about if the structure itself gets freed ? My bad !

Well, just like the last email, the code you are referring to is in the
original code, and had other issues.  After my patches it does not look
like that.

 However, isn't that the remove_list a local list on the caller's stack
 ? .. and  the original list entry moving (to remove_list) is protected
 by the spin lock (priv-lock), it is unlikely that the
 ib_sa_free_multicast() can operate on the same entry ?

Yes, you're right.  I had it in my mind that the remove_list was part of
the ipoib private dev, not local on the stack.  So you are right that we
could probably get away with removing the rtnl lock there (although it
would need to be in a later patch than the one you are reviewing
here...here there would still be a race between the restart task and the
downing of the interface because they all still share the same work
queue, but once we switch to the per device work queues in patch #6,
this can happen in parallel safely with the flush task I think).

 The patch itself is harmless though .. but adding the rntl_lock is
 really not ideal.
 
 -- Wendy
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part


Re: [PATCH 0/8] IPoIB: Fix multiple race conditions

2014-09-03 Thread Doug Ledford
On Wed, 2014-09-03 at 16:52 +0300, Or Gerlitz wrote:
 On 8/13/2014 2:38 AM, Doug Ledford wrote:
  Locking of multicast joins/leaves in the IPoIB layer have been problematic
  for a while.  There have been recent changes to try and make things better,
  including these changes:
 
  bea1e22 IPoIB: Fix use-after-free of multicast object
  a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects
 
  Unfortunately, the following test still fails (miserably) on a plain
  upstream kernel:
 
  pass=0
  ifdown ib0
  while true; do
   ifconfig ib0 up
   ifconfig ib0 down
   echo Pass $pass
   let pass++
  done
 
  This usually fails within 10 to 20 passes, although I did have a lucky
  run make it to 300 or so.  If you happen to have a P_Key child interface,
  it fails even quicker.
 
 Hi Doug,
 
 Thanks for looking on that. I wasn't aware we're doing so badly... I 
 checked here and the plan is that Erez Shitrit from Mellanox will also 
 be providing feedback on the series.
 
 Anyway, just to make sure we're on the same page, people agree that 
 picking such series is too heavy for post merge window, right? so we are 
 talking on 3.18, agree?

Given how easy the problem is to reproduce, and how this patch set
positively solves the reproducer, I would have preferred 3.17, and I had
it in to Roland before the merge window closed, but Roland chose to put
it off.

/me boggles.

 Or.
 
 
  In tracking down the rtnl deadlock in the multicast code, I ended up
  re-designing the flag usage and clean up just the race conditions in
  the various tasks.  Even that wasn't enough to resolve the issue entirely
  though, as if you ran the test above on multiple interfaces simultaneously,
  it could still deadlock.  So in the end I re-did the workqueue usage too
  so that we now use a workqueue per device (and that includes P_Key devices
  have dedicated workqueues) as well as one global workqueue that does
  nothing but flush tasks.  This turns out to be a much more elegant way
  of handling the workqueues and in fact enabled me to remove all of the
  klunky passing around of flush parameters to tell various functions not
  to flush the workqueue if it would deadlock the workqueue we are running
  from.
 
  Here is my test setup:
 
  2 InfiniBand physical fabrics: ib0 and ib1
  2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1
  4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002,
 mlx4_ib1, and mlx4_ib1.8003
 
  In order to test my final patch set, I logged into my test machine on
  four different virtual terminals, I used ifdown on all of the above
  interfaces to get things in a consistent state, and then I ran the above
  loop, one per terminal per interface simultaneously.
 
  It's worth noting here that when you ifconfig a base interface up, it
  automatically brings up the child interface too, so the ifconfig mlx4_ib0
  up is in fact racing with both ups and downs of mlx4_ib0.8002.  The same
  is true for the mlx4_ib1 interface and its child.  With my patch set in
  place, these loops are currently running without a problem and have passed
  15,000 up/down cycles per interface.
 
  Doug Ledford (8):
 IPoIB: Consolidate rtnl_lock tasks in workqueue
 IPoIB: Make the carrier_on_task race aware
 IPoIB: fix MCAST_FLAG_BUSY usage
 IPoIB: fix mcast_dev_flush/mcast_restart_task race
 IPoIB: change init sequence ordering
 IPoIB: Use dedicated workqueues per interface
 IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
 IPoIB: No longer use flush as a parameter
 
drivers/infiniband/ulp/ipoib/ipoib.h   |  19 +-
drivers/infiniband/ulp/ipoib/ipoib_cm.c|  18 +-
drivers/infiniband/ulp/ipoib/ipoib_ib.c|  27 ++-
drivers/infiniband/ulp/ipoib/ipoib_main.c  |  49 +++--
drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 
  -
drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  22 ++-
6 files changed, 240 insertions(+), 134 deletions(-)
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part