[PATCH v2] IB/ipatch: Use setup_timer and mod_timer

2015-03-03 Thread Vaishali Thakkar
Use timer API functions setup_timer and mod_timer instead
of structure assignments as they are standard way to set
the timer and to update the expire field of an active timer
respectively.

This is done using Coccinelle and semantic patch used for
this is as follows:

// 
@@
expression x,y,z,a,b;
@@

-init_timer (&x);
+setup_timer (&x, y, z);
+mod_timer (&a, b);
-x.function = y;
-x.data = z;
-x.expires = b;
-add_timer(&a);
// 

Signed-off-by: Vaishali Thakkar 
---
Changes since v1:
- Add useful comments which were removed in
  v1

 drivers/infiniband/hw/ipath/ipath_driver.c| 9 +++--
 drivers/infiniband/hw/ipath/ipath_init_chip.c | 8 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.c | 7 ++-
 3 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c 
b/drivers/infiniband/hw/ipath/ipath_driver.c
index bd0caed..7c5968a 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -2300,12 +2300,9 @@ void ipath_set_led_override(struct ipath_devdata *dd, 
unsigned int val)
 */
if (atomic_inc_return(&dd->ipath_led_override_timer_active) == 1) {
/* Need to start timer */
-   init_timer(&dd->ipath_led_override_timer);
-   dd->ipath_led_override_timer.function =
-ipath_run_led_override;
-   dd->ipath_led_override_timer.data = (unsigned long) dd;
-   dd->ipath_led_override_timer.expires = jiffies + 1;
-   add_timer(&dd->ipath_led_override_timer);
+   setup_timer(&dd->ipath_led_override_timer,
+ipath_run_led_override, (unsigned long)dd);
+   mod_timer(&dd->ipath_led_override_timer, jiffies + 1);
} else
atomic_dec(&dd->ipath_led_override_timer_active);
 }
diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c 
b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index be2a60e..f59ca1b 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -950,13 +950,11 @@ int ipath_init_chip(struct ipath_devdata *dd, int reinit)
 * set up stats retrieval timer, even if we had errors
 * in last portion of setup
 */
-   init_timer(&dd->ipath_stats_timer);
-   dd->ipath_stats_timer.function = ipath_get_faststats;
-   dd->ipath_stats_timer.data = (unsigned long) dd;
+   setup_timer(&dd->ipath_stats_timer, ipath_get_faststats,
+   (unsigned long)dd);
/* every 5 seconds; */
-   dd->ipath_stats_timer.expires = jiffies + 5 * HZ;
+   mod_timer(&dd->ipath_stats_timer, jiffies + 5 * HZ);
/* takes ~16 seconds to overflow at full IB 4x bandwdith */
-   add_timer(&dd->ipath_stats_timer);
dd->ipath_stats_timer_active = 1;
}
 
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c 
b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 44ea939..41929ea 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1952,11 +1952,8 @@ static int enable_timer(struct ipath_devdata *dd)
 dd->ipath_gpio_mask);
}
 
-   init_timer(&dd->verbs_timer);
-   dd->verbs_timer.function = __verbs_timer;
-   dd->verbs_timer.data = (unsigned long)dd;
-   dd->verbs_timer.expires = jiffies + 1;
-   add_timer(&dd->verbs_timer);
+   setup_timer(&dd->verbs_timer, __verbs_timer, (unsigned long)dd);
+   mod_timer(&dd->verbs_timer, jiffies + 1);
 
return 0;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 14/19] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag

2015-03-03 Thread Weiny, Ira
Doug,

You have given me a lot to think about...  Comments below...

> > > >
> > > > While it is a different type of technology, standard verbs[*]
> > > > remains 100%
> > > compatible.  Unlike other verbs technologies user space software
> > > does not need any knowledge that the underlying device is not IB.
> > > For example, PR (and SA) queries, CM, rdmacm, and verbs calls themselves
> are all 100% IB compatible.
> > >
> > > Even if OPA is 100% standard verbs compatible which it does not
> > > appear to be, that does not make OPA an extra capability of an IBA device.
> >
> > I don't want to make it an extra capability of an IBA device.  I want to 
> > make it
> an extra capability of a "verbs" device.
> 
> And this, friends, is why it's bad to make both a link layer and an user 
> space API
> with the exact same name ;-).  Anyway, I get your point Ira and it makes sense
> to me.  However, I also get Hal's point.  Our track record on this particular
> issue is a bit wonky though.

Thanks for laying this out.  I too understand Hals point.

> 
> First we had InfiniBand.
> 
> Then came iWARP, and we used the transport type to differentiate it from an
> actual InfiniBand device, but left the underlying link layer listed as 
> InfiniBand.
> Then came RoCE, and we listed its transport type as InfiniBand, but changed
> the link layer to Ethernet.  Which left us in the oxymoronic position that 
> even
> though iWARP was over Ethernet, the tools said it was over InfiniBand, while
> RoCE was the only thing that listed Ethernet as the link layer.  We later 
> fixed
> that up with some hacks in tools to keep users from being confused and filing
> bugs.
> 
> Maybe this represents an opportunity to straighten some of this mess out.  If 
> I
> remember correctly, this is the matrix of technologies today:
> 
> TechnologyLinkLayer   Transport
> 
> InfiniBandInfiniBand  InfiniBand Verbs
> iWARP InfiniBand  iWARP Verbs (subset of IBV, with
>   specific connection establishment
>   requirements that don't exist with IBV)
> RoCE  EthernetInfiniBand Verbs (but with different
>   addressing because of the different
>   link layer)
> OPA   ?   InfiniBand Verbs

I think this is _relatively_ accurate.  The one exception is with the various 
IB verbs extensions which have been introduced.  While most are being pushed 
into the spec not all of them are in the spec prior to being in the kernel.  It 
makes keeping up with what "IB Verbs" really is difficult.

Mind you I'm not opposing having IB Verbs be flexible.  But I think we can 
accurately have multiple underlying technologies which support IB Verbs with 
various extensions.

> 
> It makes me wonder if we shouldn't make this matrix more accurate:
> 
> TechnologyLinkLayer   Transport
> 
> InfiniBandInfiniBand  InfiniBand Verbs
> iWARP EthernetiWARP Verbs
> RoCE  EthernetRoCE-v1 or RoCE-v2
> OPA   ?   OPA Verbs
> 
> With this sort of setup, the core ib_mad/ib_umad code would simply check the
> verbs type to see what support it can enable.  For IBV it would be the 
> existing
> support, for OPAV it would be the additional jumbo support.

OPA, to be compatible with IB Verbs, uses the same node types as InfiniBand 
verbs (1 == CA, 2 == Switch).  As such it returns the same Transport type.

> 
> I'm not sure how much we might expect a change like this to break existing
> software though, so maybe staightening this mess out is a non-starter.

I think this is going to break quite a bit.  I have prototyped setting OPA 
devices to "OPA Link Layer" and the perftest tools just fall over.  Any changes 
to the Link layer or the transport types will require a transition period for 
ULPs.

> 
> > > While it is a primary goal of the RDMA stack to have a common verbs
> > > API for various RDMA interconnects, each one is properly represented
> > > to allow it's unique characteristics to be exposed.
> >
> > The difference here is that we have maintained IB Verbs compatibility where
> other RDMA technologies did not.  We have tested many Verbs applications
> (both kernel and user space) and they function _without_ _modification_.
> >
> > Despite this compatibility we are still having this discussion.
> >
> > I can think of no other way to signal the MAD capability to the MAD stack
> which will preserve the verbs compatibility in the same way.
> 
> See above.  Define a new transport type, OPAVerbs, that is a superset of IBV
> and enable jumbo support when OPAV is the transport on the link.

But the transport type is not changing.  Once again we are attempting to be 
completely verbs compatible.  From the MAD stack POV the verbs calls in the 
kernel are not different.

Would it be acceptable if the result of my patch series was:

InfiniBand  InfiniBand  

[ANNOUNCE] libibumad 1.3.10.2 release

2015-03-03 Thread Hal Rosenstock
There is a new 1.3.10.2 release of libibumad.

Tarball is available in:
http://www.openfabrics.org/downloads/management/
(listed in http://www.openfabrics.org/downloads/management/latest.txt)

md5sum:
dd017844c55713bd335aa229e6841d41  libibumad-1.3.10.2.tar.gz

All component versions are from recent master branch. Full list of
changes is below.

Hal Rosenstock (2):
  umad.h: Remove umad_reg_flags from enum declaration
  configure.in: package version update for 1.3.10.2 release
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH libibumad] umad.h: Remove umad_reg_flags from enum declaration

2015-03-03 Thread Weiny, Ira
> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Hal Rosenstock
> Sent: Tuesday, March 03, 2015 12:23 PM
> To: linux-rdma (linux-rdma@vger.kernel.org)
> Cc: Weiny, Ira; ra...@mellanox.com
> Subject: [PATCH libibumad] umad.h: Remove umad_reg_flags from enum
> declaration
> 
> 
> This causes it to be a global variable which causes linking issues when used 
> by
> multiple files linked together. It results in multiple definition of
> `umad_reg_flags'.
> 
> Found-by: Rafi Weiner 
>

Reviewed-by: Ira Weiny 
 
> Signed-off-by: Hal Rosenstock 
> ---

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibumad] umad.h: Remove umad_reg_flags from enum declaration

2015-03-03 Thread Hal Rosenstock

This causes it to be a global variable which causes linking
issues when used by multiple files linked together. It results
in multiple definition of `umad_reg_flags'.

Found-by: Rafi Weiner 

Signed-off-by: Hal Rosenstock 
---
diff --git a/include/infiniband/umad.h b/include/infiniband/umad.h
index 6c3e7dc..1db4505 100644
--- a/include/infiniband/umad.h
+++ b/include/infiniband/umad.h
@@ -201,7 +201,7 @@ int umad_unregister(int portid, int agentid);
 
 enum {
UMAD_USER_RMPP = (1 << 0)
-} umad_reg_flags;
+};
 
 struct umad_reg_attr {
uint8_tmgmt_class;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-03 Thread Erez Shitrit

On 3/2/2015 5:09 PM, Doug Ledford wrote:

On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:

On 2/26/2015 6:27 PM, Doug Ledford wrote:

@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv 
*priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev, 0);

Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.


Can you please elaborate more here, I took a look in the mthca and 
didn't see that.
anyway, what i don't understand is why you need to do that now, the ah 
is already in the dead_ah_list so, in at the most 1 sec will be cleared 
and if the driver goes down your other hunk fixed that.



The bug happened when the driver was removed (via modprobe -r etc.), and
there were ah's in the dead_ah list, that was fixed by you in the
function ipoib_ib_dev_cleanup, the call that you added here is not
relevant to the bug (and IMHO is not needed at all)

I never said that this hunk was part of the original bug I saw before.


So, the the task of cleaning the dead_ah is already there, no need to
recall it, it will be called anyway 1 sec at the most from now.

You can try that, take of that call, no harm or memory leak will happened.

I have no doubt that it will get freed later.  As I said, I never
considered this particular hunk part of that original bug.  But, as I
point out above, there is no harm in it for any hardware, and depending
on hardware it can help to make sure there isn't a shortage of
resources.  Given that fact, I see no reason to remove it.


I can't see the reason to use the flush not from the stop_ah, meaning
without setting the IPOIB_STOP_REAPER, the flush can send twice the same
work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice, the first run
deletes all the items on the to-be-deleted list, the second run sees an
empty list and does nothing.

As for using flush versus stop: the flush function cancels any delayed
ah_flush work so that it isn't racing with the normally scheduled

when you call cancel_delayed_work to work that can schedule itself, it
is not help, the work can be at the middle of the run and re-schedule
itself...

If it is in the middle of a run and reschedules itself, then it will
schedule itself at precisely the same time we would have anyway, and we
will get flushed properly, so the net result of this particular race is
that we end up doing exactly what we wanted to do anyway.


ah_flush, then flushes the workqueue to make sure anything that might
result in an ah getting freed is done, then flushes, then schedules a
new delayed flush_ah work 1 second later.  So, it does exactly what a
flush should do: it removes what there is currently to remove, and in
the case of a periodically scheduled garbage collection, schedules a new
periodic flush at the maximum interval.

It is not appropriate to call stop_ah at this point because it will
cancel the delayed work, flush the ahs, then never reschedule the
garbage collection.  If we called stop here, we would have to call start
later.  But that's not really necessary as the flush cancels the
scheduled work and reschedules it for a second later.


}

	if (level >= IPOIB_FLUSH_NORMAL)

@@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
ipoib_mcast_stop_thread(dev, 1);
ipoib_mcast_dev_flush(dev);

+	/*

+* All of our ah references aren't free until after
+* ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+* the neighbor garbage collection is stopped and reaped.
+* That should all be done now, so make a final ah flush.
+*/
+   ipoib_stop_ah(dev, 1);
+
ipoib_transport_dev_cleanup(dev);
}


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.ker

Re: [PATCH 8/9] IB/ipoib: deserialize multicast joins

2015-03-03 Thread Erez Shitrit

On 3/2/2015 5:29 PM, Doug Ledford wrote:

On Sun, 2015-03-01 at 15:58 +0200, Erez Shitrit wrote:

On 2/22/2015 2:27 AM, Doug Ledford wrote:

Allow the ipoib layer to attempt to join all outstanding multicast
groups at once.  The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempts to join different groups
in parallel.  Take advantage of that.

In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join.  There are a few special cases we handle though:

1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling.
If we completed successfully, we are done.  The join task now needs kicked
either by mcast_send or mcast_restart_task or mcast_start_thread, but
should not need started anytime else except when scheduling a backoff
attempt to rejoin.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying.  We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.

Signed-off-by: Doug Ledford 
---
   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 
-
   1 file changed, 82 insertions(+), 168 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 277e7ac7c4d..c670d9c2cda 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
*mcast,
return 0;
   }
   
-static int

-ipoib_mcast_sendonly_join_complete(int status,
-  struct ib_sa_multicast *multicast)
-{
-   struct ipoib_mcast *mcast = multicast->context;
-   struct net_device *dev = mcast->dev;
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-   /*
-* We have to take the mutex to force mcast_sendonly_join to
-* return from ib_sa_multicast_join and set mcast->mc to a
-* valid value.  Otherwise we were racing with ourselves in
-* that we might fail here, but get a valid return from
-* ib_sa_multicast_join after we had cleared mcast->mc here,
-* resulting in mis-matched joins and leaves and a deadlock
-*/
-   mutex_lock(&mcast_mutex);
-
-   /* We trap for port events ourselves. */
-   if (status == -ENETRESET) {
-   status = 0;
-   goto out;
-   }
-
-   if (!status)
-   status = ipoib_mcast_join_finish(mcast, &multicast->rec);
-
-   if (status) {
-   if (mcast->logcount++ < 20)
-   ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
-   "join failed for %pI6, status %d\n",
-   mcast->mcmember.mgid.raw, status);
-
-   /* Flush out any queued packets */
-   netif_tx_lock_bh(dev);
-   while (!skb_queue_empty(&mcast->pkt_queue)) {
-   ++dev->stats.tx_dropped;
-   dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
-   }
-   netif_tx_unlock_bh(dev);
-   __ipoib_mcast_schedule_join_thread(priv, mcast, 1);
-   } else {
-   mcast->backoff = 1;
-   mcast->delay_until = jiffies;
-   __ipoib_mcast_schedule_join_thread(priv, NULL, 0);
-   }
-out:
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-   if (status)
-   mcast->mc = NULL;
-   complete(&mcast->done);
-   mutex_unlock(&mcast_mutex);
-   return status;
-}
-
-static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
-{
-   struct net_device *dev = mcast->dev;
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
-   struct ib_sa_mcmember_rec rec = {
-#if 0  /* Some SMs don't support send-only yet */
-   .join_state = 4
-#else
-   .join_state = 1
-#endif
-   };
-   int ret = 0;
-
-   if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-   ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
-   "multicast joins\n");
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-   complete(&mcast->done);
-   return -ENODEV;
-   }
-
-   rec.mgid = mcast->mcmember.mgid;
-   rec.port_gid = priv->local_gid;
-   rec.pkey = cpu_to_be16

Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage

2015-03-03 Thread Erez Shitrit

On 3/2/2015 5:27 PM, Doug Ledford wrote:

On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote:

On 2/22/2015 2:27 AM, Doug Ledford wrote:

Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used.  We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it.  And when we did
complete it, sometimes we continued to touch the mcast entry after
the completion, opening us up to possible use after free issues.

This made it less than totally effective, and certainly made its use
confusing.  And in the flush function we would use the presence of this
flag to signal that we should wait on the completion struct, but we never
cleared this flag, ever.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

   1) Remove the MCAST_JOIN_STARTED flag entirely
   2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
   3) Test mcast->mc directly to see if we have completed
  ib_sa_join_multicast (using IS_ERR_OR_NULL)
   4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
  the mcast->done completion struct
   5) Make sure that before calling complete(&mcast->done), we always clear
  the MCAST_FLAG_BUSY bit
   6) Take the mcast_mutex before we call ib_sa_multicast_join and also
  take the mutex in our join callback.  This forces
  ib_sa_multicast_join to return and set mcast->mc before we process
  the callback.  This way, our callback can safely clear mcast->mc
  if there is an error on the join and we will do the right thing as
  a result in mcast_dev_flush.
   7) Because we need the mutex to synchronize mcast->mc, we can no
  longer call mcast_sendonly_join directly from mcast_send and
  instead must add sendonly join processing to the mcast_join_task
   8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
  we have a running task.  We know when we need to reschedule our
  join task thread and don't need a flag to tell us.
   9) Add a helper for rescheduling the join task thread

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
alloc member
call callback
   ifconfig ib0 down
 wait_for_completion
  callback call completes
   wait_for_completion in
 mcast_dev_flush completes
   mcast->mc is PTR_ERR_OR_NULL
   so we skip ib_sa_leave_multicast
  return from callback
return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1   Thread 2 Thread 3
ib_sa_multicast_join
  ifconfig ib0 down
priv->broadcast = NULL
 join_complete
wait_for_completion
   mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
   complete
   return -EAGAIN (making mcast->mc invalid)
call ib_sa_multicast_leave
on invalid mcast->mc, hang
forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback.  This allows us to clear mcast->mc if there is an
error and the join is going to fail.  We do this before we complete
the mcast.  In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford 
---
   drivers/infiniband/ulp/ipoib/ipoib.h   |  11 +-
   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 
-
   2 files changed, 238 insertions(+), 128 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9ef432ae72e..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
   
   	IPOIB_MCAST_FLAG