Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
On Mon, Aug 25, 2014 at 1:03 PM, Doug Ledford dledf...@redhat.com wrote: On Aug 25, 2014, at 2:51 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: Is it really possible for ib_sa_join_multicast() to return *after* its callback (ipoib_mcast_sendonly_join_complete and ipoib_mcast_join_complete) ? Yes. They are both on work queues and ib_sa_join_multicast simply fires off another workqueue task. The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU). Then it is a race to see who finishes first. Ok, thanks for the explanation. I also googled and found the original patch where the IPOIB_MCAST_JOIN_STARTED was added. This patch now makes sense. Acked-by: Wendy Cheng wendy.ch...@intel.com On the other hand, I'm still puzzled why ib_sa_join_multicast() can't be a blocking call (i.e. wait until callback is executed) - why would IPOIB pay the price to work around these nasty issues ? But I guess that is off-topic too much .. BTW, thanks for the work. Our users will be doing if-up-down a lot for power management, patches like these help ! -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford dledf...@redhat.com wrote: So that's why in this patch we 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast-mc to the proper return value before we process the join completion callback 2) always clear mcast-mc if there is any error since we can't call ib_sa_multicast_leave 3) always complete the mcast in case we are waiting on it 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event We don't have IPOIB_MCAST_JOIN_STARTED (and the done completion struct) in our code base (MPSS) yet ...I'm *not* n-acking this patch but I find it hard to understand the ramifications. It has nothing to do with this patch - actually the patch itself looks pretty ok (by eyes). The original IPOIB mcast flow, particularly its abnormal error path, confuses me. Is it really possible for ib_sa_join_multicast() to return *after* its callback (ipoib_mcast_sendonly_join_complete and ipoib_mcast_join_complete) ? The mcast-done completion struct looks dangerous as well. I'll let other capable people to do the final call(s). -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
On Aug 25, 2014, at 2:51 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford dledf...@redhat.com wrote: So that's why in this patch we 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast-mc to the proper return value before we process the join completion callback 2) always clear mcast-mc if there is any error since we can't call ib_sa_multicast_leave 3) always complete the mcast in case we are waiting on it 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event confuses me. Is it really possible for ib_sa_join_multicast() to return *after* its callback (ipoib_mcast_sendonly_join_complete and ipoib_mcast_join_complete) ? Yes. They are both on work queues and ib_sa_multicast_join simply fires off another workqueue task. The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU). Then it is a race to see who finishes first. The mcast-done completion struct looks dangerous as well. It *was* dangerous ;-). My patches cleaned it up. Before my patches there were failure to completes, then during an early version of my patch there were double completes and failure to clear mcast-mc, then I figured out the race on the callback finishing before ib_sa_multicast_join and added the mutex and things suddenly started working much better ;-) I'll let other capable people to do the final call(s). -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Sent from my iPad -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
Excuse my top posting, I'm at a meeting and not using my normal mail client. To answer your question, this is a confusing issue that I had to work through myself. When we call ib_sa_multicast_join, it will create a record for us and return that. That record is not valid for us to call ib_sa_multicast_leave until after the join has completed and our callback has acknowledged that join completion with a 0 return status from our callback. At the time the record is created, ib_sa_multicast_join will also queue up the actual join. When the join completes, they call our callback (and without my locking fixes, we can get the callback before ib_sa_multicast_join returns our record entry to us). If our callback is called with the status already set, then the join has failed and will be handled that way when our callback returns. In this case, the callback is informative for us, that's all. When we trap -ENETRESET and put our return to 0, we are acknowledging that the join failed prior to our callback and that the join will not be completed and that mcast-mc as we have recorded is not valid. If the status is 0 to start, and we set it to a non-0 value, then that also indicates that the join has failed, but in this case the core code must unroll the join that was completed up until the callback was done (and it does). Any time this is the case, the member record returned from ib_sa_multicast_join is invalid and we can not call ib_sa_multicast_leave on the record or we will hang forever waiting for a completion that won't be coming. Only if we get our callback entered with a 0 status and return with a 0 status does the member record that ib_sa_multicast_join returned to us become fully valid. At this point, we must call ib_sa_multicast_leave on the record if we want it to go away. So that's why in this patch we 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast-mc to the proper return value before we process the join completion callback 2) always clear mcast-mc if there is any error since we can't call ib_sa_multicast_leave 3) always complete the mcast in case we are waiting on it 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event Sent from my iPad On Aug 19, 2014, at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford dledf...@redhat.com wrote: Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. This is further muddied by the fact that we overload the MCAST_FLAG_BUSY flag to mean two different things: we have a join in flight, and we have succeeded in getting an ib_sa_join_multicast. [snip] diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 7e9cd39b5ef..f5e8da530d9 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status, struct ipoib_mcast *mcast = multicast-context; struct net_device *dev = mcast-dev; + /* +* We have to take the mutex to force mcast_sendonly_join to +* return from ib_sa_multicast_join and set mcast-mc to a +* valid value. Otherwise we were racing with ourselves in +* that we might fail here, but get a valid return from +* ib_sa_multicast_join after we had cleared mcast-mc here, +* resulting in mis-matched joins and leaves and a deadlock +*/ + mutex_lock(mcast_mutex); + /* We trap for port events ourselves. */ if (status == -ENETRESET) - return 0; + goto out; [snip] +out: + clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); + if (status) + mcast-mc = NULL; + complete(mcast-done); + if (status == -ENETRESET) + status = 0; + mutex_unlock(mcast_mutex); return status; } I have not gone thru the whole patch yet. However, here is something quick ... Would this return 0 if (status == -ENETRESET) be wrong in the original code ? Say . when it returns to process_group_error(), without proper error return code, it'll skip ib_sa_free_multicast(). Should we worry about it ? -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to
[PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. This is further muddied by the fact that we overload the MCAST_FLAG_BUSY flag to mean two different things: we have a join in flight, and we have succeeded in getting an ib_sa_join_multicast. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test on mcast-mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast-done completion struct 5) Make sure that before calling complete(mcast-done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast-mc before we process the callback. This way, our callback can safely clear mcast-mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast-mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast-mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast-mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv-broadcast = NULL join_complete wait_for_completion mcast-mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast-mc complete return -EAGAIN (making mcast-mc invalid) call ib_sa_multicast_leave on invalid mcast-mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast-mc to be valid at the time we run the callback. This allows us to clear mcast-mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast-mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib.h | 10 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 - 2 files changed, 101 insertions(+), 57 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 3edce617c31..43840971ad8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -98,9 +98,15 @@ enum { IPOIB_MCAST_FLAG_FOUND= 0, /* used in set_multicast_list */ IPOIB_MCAST_FLAG_SENDONLY = 1, - IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + /* +* For IPOIB_MCAST_FLAG_BUSY +* When set, in flight join and mcast-mc is unreliable +* When clear and mcast-mc IS_ERR_OR_NULL, need to restart or +* haven't started yet +* When clear and mcast-mc is valid pointer, join was successful +