Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects)
added a new flag MCAST_JOIN_STARTED, but was not very strict in how it
was used.  We didn't always initialize the completion struct before we
set the flag, and we didn't always call complete on the completion
struct from all paths that complete it.  This made it less than totally
effective, and certainly made its use confusing.  And in the flush
function we would use the presence of this flag to signal that we should
wait on the completion struct, but we never cleared this flag, ever.
This is further muddied by the fact that we overload the MCAST_FLAG_BUSY
flag to mean two different things: we have a join in flight, and we have
succeeded in getting an ib_sa_join_multicast.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

1) Remove the MCAST_JOIN_STARTED flag entirely
2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test on mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback.  This forces ib_sa_multicast_join
to return and set mcast->mc before we process the callback.  This way,
our callback can safely clear mcast->mc if there is an error on the join
and we will do the right thing as a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast->mc, we can no longer
call mcast_sendonly_join directly from mcast_send and instead must add
sendonly join processing to the mcast_join_task

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1                             Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
                                     ifconfig ib0 down
                                     wait_for_completion
    callback call completes
                                     wait_for_completion in
                                     mcast_dev_flush completes
                                       mcast->mc is PTR_ERR_OR_NULL
                                       so we skip ib_sa_leave_multicast
    return from callback
  return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1                   Thread 2         Thread 3
ib_sa_multicast_join
                                            ifconfig ib0 down
                                            priv->broadcast = NULL
                           join_complete
                                            wait_for_completion
                           mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
                           complete
                           return -EAGAIN (making mcast->mc invalid)
                                            call ib_sa_multicast_leave
                                            on invalid mcast->mc, hang
                                            forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we run
the callback.  This allows us to clear mcast->mc if there is an error
and the join is going to fail.  We do this before we complete the mcast.
In this way, mcast_dev_flush always sees consistent state in regards to
mcast->mc membership at the time that the wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledf...@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  10 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 ++++++++++++++++---------
 2 files changed, 101 insertions(+), 57 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3edce617c31..43840971ad8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
        IPOIB_MCAST_FLAG_FOUND    = 0,  /* used in set_multicast_list */
        IPOIB_MCAST_FLAG_SENDONLY = 1,
-       IPOIB_MCAST_FLAG_BUSY     = 2,  /* joining or already joined */
+       /*
+        * For IPOIB_MCAST_FLAG_BUSY
+        * When set, in flight join and mcast->mc is unreliable
+        * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
+        *   haven't started yet
+        * When clear and mcast->mc is valid pointer, join was successful
+        */
+       IPOIB_MCAST_FLAG_BUSY     = 2,
        IPOIB_MCAST_FLAG_ATTACHED = 3,
-       IPOIB_MCAST_JOIN_STARTED  = 4,
 
        MAX_SEND_CQE              = 16,
        IPOIB_CM_COPYBREAK        = 256,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 7e9cd39b5ef..f5e8da530d9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
        struct ipoib_mcast *mcast = multicast->context;
        struct net_device *dev = mcast->dev;
 
+       /*
+        * We have to take the mutex to force mcast_sendonly_join to
+        * return from ib_sa_multicast_join and set mcast->mc to a
+        * valid value.  Otherwise we were racing with ourselves in
+        * that we might fail here, but get a valid return from
+        * ib_sa_multicast_join after we had cleared mcast->mc here,
+        * resulting in mis-matched joins and leaves and a deadlock
+        */
+       mutex_lock(&mcast_mutex);
+
        /* We trap for port events ourselves. */
        if (status == -ENETRESET)
-               return 0;
+               goto out;
 
        if (!status)
                status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
        if (status) {
                if (mcast->logcount++ < 20)
-                       ipoib_dbg_mcast(netdev_priv(dev), "multicast join 
failed for %pI6, status %d\n",
+                       ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
+                                       "join failed for %pI6, status %d\n",
                                        mcast->mcmember.mgid.raw, status);
 
                /* Flush out any queued packets */
@@ -290,11 +301,15 @@ ipoib_mcast_sendonly_join_complete(int status,
                        dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
                }
                netif_tx_unlock_bh(dev);
-
-               /* Clear the busy flag so we try again */
-               status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
-                                           &mcast->flags);
        }
+out:
+       clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+       if (status)
+               mcast->mc = NULL;
+       complete(&mcast->done);
+       if (status == -ENETRESET)
+               status = 0;
+       mutex_unlock(&mcast_mutex);
        return status;
 }
 
@@ -312,12 +327,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
        int ret = 0;
 
        if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-               ipoib_dbg_mcast(priv, "device shutting down, no multicast 
joins\n");
+               ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
+                               "multicast joins\n");
                return -ENODEV;
        }
 
-       if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-               ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
+       if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
+               ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
+                               "sendonly join\n");
                return -EBUSY;
        }
 
@@ -325,6 +342,9 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
        rec.port_gid = priv->local_gid;
        rec.pkey     = cpu_to_be16(priv->pkey);
 
+       mutex_lock(&mcast_mutex);
+       init_completion(&mcast->done);
+       set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
        mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
                                         priv->port, &rec,
                                         IB_SA_MCMEMBER_REC_MGID        |
@@ -337,12 +357,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
        if (IS_ERR(mcast->mc)) {
                ret = PTR_ERR(mcast->mc);
                clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-               ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n",
-                          ret);
+               complete(&mcast->done);
+               ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
+                          "failed (ret = %d)\n", ret);
        } else {
-               ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting 
join\n",
-                               mcast->mcmember.mgid.raw);
+               ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
+                               "sendonly join\n", mcast->mcmember.mgid.raw);
        }
+       mutex_unlock(&mcast_mutex);
 
        return ret;
 }
@@ -390,22 +412,28 @@ static int ipoib_mcast_join_complete(int status,
        ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
                        mcast->mcmember.mgid.raw, status);
 
+       /*
+        * We have to take the mutex to force mcast_join to
+        * return from ib_sa_multicast_join and set mcast->mc to a
+        * valid value.  Otherwise we were racing with ourselves in
+        * that we might fail here, but get a valid return from
+        * ib_sa_multicast_join after we had cleared mcast->mc here,
+        * resulting in mis-matched joins and leaves and a deadlock
+        */
+       mutex_lock(&mcast_mutex);
+
        /* We trap for port events ourselves. */
-       if (status == -ENETRESET) {
-               status = 0;
+       if (status == -ENETRESET)
                goto out;
-       }
 
        if (!status)
                status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
        if (!status) {
                mcast->backoff = 1;
-               mutex_lock(&mcast_mutex);
                if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
                        queue_delayed_work(ipoib_workqueue,
                                           &priv->mcast_task, 0);
-               mutex_unlock(&mcast_mutex);
 
                /*
                 * Defer carrier on work to ipoib_workqueue to avoid a
@@ -413,37 +441,35 @@ static int ipoib_mcast_join_complete(int status,
                 */
                if (mcast == priv->broadcast)
                        queue_work(ipoib_workqueue, &priv->carrier_on_task);
-
-               status = 0;
-               goto out;
-       }
-
-       if (mcast->logcount++ < 20) {
-               if (status == -ETIMEDOUT || status == -EAGAIN) {
-                       ipoib_dbg_mcast(priv, "multicast join failed for %pI6, 
status %d\n",
-                                       mcast->mcmember.mgid.raw, status);
-               } else {
-                       ipoib_warn(priv, "multicast join failed for %pI6, 
status %d\n",
-                                  mcast->mcmember.mgid.raw, status);
+       } else {
+               if (mcast->logcount++ < 20) {
+                       if (status == -ETIMEDOUT || status == -EAGAIN) {
+                               ipoib_dbg_mcast(priv, "multicast join failed 
for %pI6, status %d\n",
+                                               mcast->mcmember.mgid.raw, 
status);
+                       } else {
+                               ipoib_warn(priv, "multicast join failed for 
%pI6, status %d\n",
+                                          mcast->mcmember.mgid.raw, status);
+                       }
                }
-       }
-
-       mcast->backoff *= 2;
-       if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-               mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-       /* Clear the busy flag so we try again */
-       status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 
-       mutex_lock(&mcast_mutex);
+               mcast->backoff *= 2;
+               if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+                       mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+       }
+out:
        spin_lock_irq(&priv->lock);
-       if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+       clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+       if (status)
+               mcast->mc = NULL;
+       complete(&mcast->done);
+       if (status == -ENETRESET)
+               status = 0;
+       if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
                queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
                                   mcast->backoff * HZ);
        spin_unlock_irq(&priv->lock);
        mutex_unlock(&mcast_mutex);
-out:
-       complete(&mcast->done);
+
        return status;
 }
 
@@ -492,10 +518,9 @@ static void ipoib_mcast_join(struct net_device *dev, 
struct ipoib_mcast *mcast,
                rec.hop_limit     = priv->broadcast->mcmember.hop_limit;
        }
 
-       set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+       mutex_lock(&mcast_mutex);
        init_completion(&mcast->done);
-       set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags);
-
+       set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
        mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
                                         &rec, comp_mask, GFP_KERNEL,
                                         ipoib_mcast_join_complete, mcast);
@@ -509,13 +534,12 @@ static void ipoib_mcast_join(struct net_device *dev, 
struct ipoib_mcast *mcast,
                if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
                        mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
-               mutex_lock(&mcast_mutex);
                if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
                        queue_delayed_work(ipoib_workqueue,
                                           &priv->mcast_task,
                                           mcast->backoff * HZ);
-               mutex_unlock(&mcast_mutex);
        }
+       mutex_unlock(&mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -576,7 +600,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
        }
 
        if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-               if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
+               if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
+                   !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
                        ipoib_mcast_join(dev, priv->broadcast, 0);
                return;
        }
@@ -584,23 +609,33 @@ void ipoib_mcast_join_task(struct work_struct *work)
        while (1) {
                struct ipoib_mcast *mcast = NULL;
 
+               /*
+                * Need the mutex so our flags are consistent, need the
+                * priv->lock so we don't race with list removals in either
+                * mcast_dev_flush or mcast_restart_task
+                */
+               mutex_lock(&mcast_mutex);
                spin_lock_irq(&priv->lock);
                list_for_each_entry(mcast, &priv->multicast_list, list) {
-                       if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
-                           && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
-                           && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, 
&mcast->flags)) {
+                       if (IS_ERR_OR_NULL(mcast->mc) &&
+                           !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+                           !test_bit(IPOIB_MCAST_FLAG_ATTACHED, 
&mcast->flags)) {
                                /* Found the next unjoined group */
                                break;
                        }
                }
                spin_unlock_irq(&priv->lock);
+               mutex_unlock(&mcast_mutex);
 
                if (&mcast->list == &priv->multicast_list) {
                        /* All done */
                        break;
                }
 
-               ipoib_mcast_join(dev, mcast, 1);
+               if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+                       ipoib_mcast_sendonly_join(mcast);
+               else
+                       ipoib_mcast_join(dev, mcast, 1);
                return;
        }
 
@@ -646,6 +681,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct 
ipoib_mcast *mcast)
        int ret = 0;
 
        if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+               ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n");
+
+       if (!IS_ERR_OR_NULL(mcast->mc))
                ib_sa_free_multicast(mcast->mc);
 
        if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
@@ -698,6 +736,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb)
                memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
                __ipoib_mcast_add(dev, mcast);
                list_add_tail(&mcast->list, &priv->multicast_list);
+               if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+                       queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 
0);
        }
 
        if (!mcast->ah) {
@@ -711,8 +751,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb)
                if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
                        ipoib_dbg_mcast(priv, "no address vector, "
                                        "but multicast join already started\n");
-               else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-                       ipoib_mcast_sendonly_join(mcast);
 
                /*
                 * If lookup completes between here and out:, don't
@@ -774,7 +812,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
        /* seperate between the wait to the leave*/
        list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
-               if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags))
+               if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
                        wait_for_completion(&mcast->done);
 
        list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to