Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-26 Thread Wendy Cheng
On Mon, Aug 25, 2014 at 1:03 PM, Doug Ledford dledf...@redhat.com wrote:

 On Aug 25, 2014, at 2:51 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

 Is it really possible for ib_sa_join_multicast() to
 return *after* its callback (ipoib_mcast_sendonly_join_complete and
 ipoib_mcast_join_complete) ?

 Yes.  They are both on work queues and ib_sa_join_multicast simply fires off 
 another workqueue task.  The scheduler is free to start that task instantly 
 if the workqueue isn't busy, and it often does (although not necessarily on 
 the same CPU).  Then it is a race to see who finishes first.


Ok, thanks for the explanation. I also googled and found the original
patch where the IPOIB_MCAST_JOIN_STARTED was added. This patch now
makes sense.

Acked-by: Wendy Cheng wendy.ch...@intel.com

On the other hand, I'm still puzzled why ib_sa_join_multicast() can't
be a blocking call (i.e. wait until callback is executed) - why would
IPOIB pay the price to work around these nasty issues ? But I guess
that is off-topic too much ..

BTW, thanks for the work. Our users will be doing if-up-down a lot for
power management, patches like these help !

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-25 Thread Wendy Cheng
On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford dledf...@redhat.com wrote:

 So that's why in this patch we

 1) take a mutex to force ib_sa_join_multicast to return and us to set 
 mcast-mc to the proper return value before we process the join completion 
 callback
 2) always clear mcast-mc if there is any error since we can't call 
 ib_sa_multicast_leave
 3) always complete the mcast in case we are waiting on it
 4) only if our status is ENETRESET set our return to 0 so the ib core code 
 knows we acknowledged the event


We don't have IPOIB_MCAST_JOIN_STARTED (and the done completion
struct) in our code base (MPSS) yet ...I'm *not* n-acking this patch
but I find it hard to understand the ramifications. It has nothing to
do with this patch - actually the patch itself looks pretty ok (by
eyes).

The original IPOIB mcast flow, particularly its abnormal error path,
confuses me. Is it really possible for ib_sa_join_multicast() to
return *after* its callback (ipoib_mcast_sendonly_join_complete and
ipoib_mcast_join_complete) ? The mcast-done completion struct looks
dangerous as well.

I'll let other capable people to do the final call(s).

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-25 Thread Doug Ledford

 On Aug 25, 2014, at 2:51 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 
 On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford dledf...@redhat.com wrote:
 
 So that's why in this patch we
 
 1) take a mutex to force ib_sa_join_multicast to return and us to set 
 mcast-mc to the proper return value before we process the join completion 
 callback
 2) always clear mcast-mc if there is any error since we can't call 
 ib_sa_multicast_leave
 3) always complete the mcast in case we are waiting on it
 4) only if our status is ENETRESET set our return to 0 so the ib core code 
 knows we acknowledged the event
 
 confuses me. Is it really possible for ib_sa_join_multicast() to
 return *after* its callback (ipoib_mcast_sendonly_join_complete and
 ipoib_mcast_join_complete) ?

Yes.  They are both on work queues and ib_sa_multicast_join simply fires off 
another workqueue task.  The scheduler is free to start that task instantly if 
the workqueue isn't busy, and it often does (although not necessarily on the 
same CPU).  Then it is a race to see who finishes first.

 The mcast-done completion struct looks
 dangerous as well.

It *was* dangerous ;-).  My patches cleaned it up.  Before my patches there 
were failure to completes, then during an early version of my patch there were 
double completes and failure to clear mcast-mc, then I figured out the race on 
the callback finishing before ib_sa_multicast_join and added the mutex and 
things suddenly started working much better ;-)

 I'll let other capable people to do the final call(s).
 
 -- Wendy
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Sent from my iPad

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-19 Thread Doug Ledford
Excuse my top posting, I'm at a meeting and not using my normal mail client.

To answer your question, this is a confusing issue that I had to work through 
myself.

When we call ib_sa_multicast_join, it will create a record for us and return 
that.  That record is not valid for us to call ib_sa_multicast_leave until 
after the join has completed and our callback has acknowledged that join 
completion with a 0 return status from our callback.

At the time the record is created, ib_sa_multicast_join will also queue up the 
actual join.  When the join completes, they call our callback (and without my 
locking fixes, we can get the callback before ib_sa_multicast_join returns our 
record entry to us).

If our callback is called with the status already set, then the join has failed 
and will be handled that way when our callback returns.  In this case, the 
callback is informative for us, that's all.  When we trap -ENETRESET and put 
our return to 0, we are acknowledging that the join failed prior to our 
callback and that the join will not be completed and that mcast-mc as we have 
recorded is not valid.

If the status is 0 to start, and we set it to a non-0 value, then that also 
indicates that the join has failed, but in this case the core code must unroll 
the join that was completed up until the callback was done (and it does).  Any 
time this is the case, the member record returned from ib_sa_multicast_join is 
invalid and we can not call ib_sa_multicast_leave on the record or we will hang 
forever waiting for a completion that won't be coming.

Only if we get our callback entered with a 0 status and return with a 0 status 
does the member record that ib_sa_multicast_join returned to us become fully 
valid.  At this point, we must call ib_sa_multicast_leave on the record if we 
want it to go away.

So that's why in this patch we

1) take a mutex to force ib_sa_join_multicast to return and us to set mcast-mc 
to the proper return value before we process the join completion callback
2) always clear mcast-mc if there is any error since we can't call 
ib_sa_multicast_leave
3) always complete the mcast in case we are waiting on it
4) only if our status is ENETRESET set our return to 0 so the ib core code 
knows we acknowledged the event

Sent from my iPad

 On Aug 19, 2014, at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:
 
 On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote:
 Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects)
 added a new flag MCAST_JOIN_STARTED, but was not very strict in how it
 was used.  We didn't always initialize the completion struct before we
 set the flag, and we didn't always call complete on the completion
 struct from all paths that complete it.  This made it less than totally
 effective, and certainly made its use confusing.  And in the flush
 function we would use the presence of this flag to signal that we should
 wait on the completion struct, but we never cleared this flag, ever.
 This is further muddied by the fact that we overload the MCAST_FLAG_BUSY
 flag to mean two different things: we have a join in flight, and we have
 succeeded in getting an ib_sa_join_multicast.
 
 
 [snip]
 
 diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
 b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 index 7e9cd39b5ef..f5e8da530d9 100644
 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
 @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
struct ipoib_mcast *mcast = multicast-context;
struct net_device *dev = mcast-dev;
 
 +   /*
 +* We have to take the mutex to force mcast_sendonly_join to
 +* return from ib_sa_multicast_join and set mcast-mc to a
 +* valid value.  Otherwise we were racing with ourselves in
 +* that we might fail here, but get a valid return from
 +* ib_sa_multicast_join after we had cleared mcast-mc here,
 +* resulting in mis-matched joins and leaves and a deadlock
 +*/
 +   mutex_lock(mcast_mutex);
 +
/* We trap for port events ourselves. */
if (status == -ENETRESET)
 -   return 0;
 +   goto out;
 
 
 [snip]
 
 
 +out:
 +   clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
 +   if (status)
 +   mcast-mc = NULL;
 +   complete(mcast-done);
 +   if (status == -ENETRESET)
 +   status = 0;
 +   mutex_unlock(mcast_mutex);
return status;
 }
 
 
 I have not gone thru the whole patch yet. However, here is something quick ...
 
 Would this return 0 if (status == -ENETRESET) be wrong in the
 original code ?  Say . when it returns to process_group_error(),
 without proper error return code, it'll skip ib_sa_free_multicast().
 Should we worry about it ?
 
 -- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to 

[PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage

2014-08-12 Thread Doug Ledford
Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects)
added a new flag MCAST_JOIN_STARTED, but was not very strict in how it
was used.  We didn't always initialize the completion struct before we
set the flag, and we didn't always call complete on the completion
struct from all paths that complete it.  This made it less than totally
effective, and certainly made its use confusing.  And in the flush
function we would use the presence of this flag to signal that we should
wait on the completion struct, but we never cleared this flag, ever.
This is further muddied by the fact that we overload the MCAST_FLAG_BUSY
flag to mean two different things: we have a join in flight, and we have
succeeded in getting an ib_sa_join_multicast.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

1) Remove the MCAST_JOIN_STARTED flag entirely
2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test on mcast-mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast-done completion struct
5) Make sure that before calling complete(mcast-done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback.  This forces ib_sa_multicast_join
to return and set mcast-mc before we process the callback.  This way,
our callback can safely clear mcast-mc if there is an error on the join
and we will do the right thing as a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast-mc, we can no longer
call mcast_sendonly_join directly from mcast_send and instead must add
sendonly join processing to the mcast_join_task

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
 ifconfig ib0 down
 wait_for_completion
callback call completes
 wait_for_completion in
 mcast_dev_flush completes
   mcast-mc is PTR_ERR_OR_NULL
   so we skip ib_sa_leave_multicast
return from callback
  return from ib_sa_join_multicast
set mcast-mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1   Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv-broadcast = NULL
   join_complete
wait_for_completion
   mcast-mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast-mc
   complete
   return -EAGAIN (making mcast-mc invalid)
call ib_sa_multicast_leave
on invalid mcast-mc, hang
forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast-mc to be valid at the time we run
the callback.  This allows us to clear mcast-mc if there is an error
and the join is going to fail.  We do this before we complete the mcast.
In this way, mcast_dev_flush always sees consistent state in regards to
mcast-mc membership at the time that the wait_for_completion() returns.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  10 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 -
 2 files changed, 101 insertions(+), 57 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3edce617c31..43840971ad8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
IPOIB_MCAST_FLAG_FOUND= 0,  /* used in set_multicast_list */
IPOIB_MCAST_FLAG_SENDONLY = 1,
-   IPOIB_MCAST_FLAG_BUSY = 2,  /* joining or already joined */
+   /*
+* For IPOIB_MCAST_FLAG_BUSY
+* When set, in flight join and mcast-mc is unreliable
+* When clear and mcast-mc IS_ERR_OR_NULL, need to restart or
+*   haven't started yet
+* When clear and mcast-mc is valid pointer, join was successful
+