Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-16 Thread Erez Shitrit

On 3/15/2015 8:42 PM, Doug Ledford wrote:

On Mar 13, 2015, at 1:39 AM, Or Gerlitz gerlitz...@gmail.com wrote:

On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit ere...@dev.mellanox.co.il wrote:

On 3/2/2015 5:09 PM, Doug Ledford wrote:

On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:

On 2/26/2015 6:27 PM, Doug Ledford wrote:

@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
ipoib_dev_priv *priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev, 0);

Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.


Can you please elaborate more here, I took a look in the mthca and didn't
see that.
anyway, what i don't understand is why you need to do that now, the ah is
already in the dead_ah_list so, in at the most 1 sec will be cleared and if
the driver goes down your other hunk fixed that.

Doug, ten days and no response from you... lets finalize the review on
this series so we have it safely done for 4.1 -- on top of it Erez
prepares a set of IPoIB fixes from our internal tree and we want that
for 4.1 too. Please address.

I didn’t have much to say here.  I said that mthca can have card resources 
freed by this call, which is backed up by this code in mthca_ah.c

int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
{
 switch (ah-type) {
 case MTHCA_AH_ON_HCA:
 mthca_free(dev-av_table.alloc,
(ah-avdma - dev-av_table.ddr_av_base) /
MTHCA_AV_SIZE);
 break;


I’m not entirely sure how Erez missed that, but it’s there and it’s what gets 
called when we destroy an ah (depending on the card of course).  So, that 
represents one case where freeing the resources in a non-lazy fashion has a 
direct benefit.  And there is no cited drawback to freeing the resources in a 
non-lazy fashion on a net event, so I don’t see what there is to discuss 
further on the issue.

sorry, but i still don't see the connection to the device type.
It will be deleted/freed with the regular flow, like it does in the rest 
of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so 
why here it should be directly after the event?



—
Doug Ledford dledf...@redhat.com
GPG Key ID: 0E572FDD







--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-16 Thread Erez Shitrit

On 3/16/2015 6:06 PM, Doug Ledford wrote:

On Mar 16, 2015, at 8:24 AM, Erez Shitrit ere...@dev.mellanox.co.il wrote:

On 3/15/2015 8:42 PM, Doug Ledford wrote:

Doug, ten days and no response from you... lets finalize the review on
this series so we have it safely done for 4.1 -- on top of it Erez
prepares a set of IPoIB fixes from our internal tree and we want that
for 4.1 too. Please address.

I didn’t have much to say here.  I said that mthca can have card resources 
freed by this call, which is backed up by this code in mthca_ah.c

int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
{
 switch (ah-type) {
 case MTHCA_AH_ON_HCA:
 mthca_free(dev-av_table.alloc,
(ah-avdma - dev-av_table.ddr_av_base) /
MTHCA_AV_SIZE);
 break;


I’m not entirely sure how Erez missed that, but it’s there and it’s what gets 
called when we destroy an ah (depending on the card of course).  So, that 
represents one case where freeing the resources in a non-lazy fashion has a 
direct benefit.  And there is no cited drawback to freeing the resources in a 
non-lazy fashion on a net event, so I don’t see what there is to discuss 
further on the issue.

sorry, but i still don't see the connection to the device type.
It will be deleted/freed with the regular flow, like it does in the rest of the 
life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it 
should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is 
when there is a performance benefit to batching.  There is no performance 
benefit to batching here.  And because on certain hardware the action frees 
resources on the card, which are limited, doing non-lazy deletion can be 
beneficial.  Given that there is no downside to doing the deletions in a 
non-lazy fashion, and that there can be an upside depending on hardware, there 
is no reason to stick with the lazy deletions.
OK, i understand your point, not sure why it is not always with the ah 
deletion, anyway it is harmless here.



—
Doug Ledford dledf...@redhat.com
GPG Key ID: 0E572FDD







--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-16 Thread Doug Ledford

 On Mar 16, 2015, at 8:24 AM, Erez Shitrit ere...@dev.mellanox.co.il wrote:
 
 On 3/15/2015 8:42 PM, Doug Ledford wrote:
 
 Doug, ten days and no response from you... lets finalize the review on
 this series so we have it safely done for 4.1 -- on top of it Erez
 prepares a set of IPoIB fixes from our internal tree and we want that
 for 4.1 too. Please address.
 I didn’t have much to say here.  I said that mthca can have card resources 
 freed by this call, which is backed up by this code in mthca_ah.c
 
 int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
 {
switch (ah-type) {
case MTHCA_AH_ON_HCA:
mthca_free(dev-av_table.alloc,
   (ah-avdma - dev-av_table.ddr_av_base) /
   MTHCA_AV_SIZE);
break;
 
 
 I’m not entirely sure how Erez missed that, but it’s there and it’s what 
 gets called when we destroy an ah (depending on the card of course).  So, 
 that represents one case where freeing the resources in a non-lazy fashion 
 has a direct benefit.  And there is no cited drawback to freeing the 
 resources in a non-lazy fashion on a net event, so I don’t see what there is 
 to discuss further on the issue.
 sorry, but i still don't see the connection to the device type.
 It will be deleted/freed with the regular flow, like it does in the rest of 
 the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here 
 it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is 
when there is a performance benefit to batching.  There is no performance 
benefit to batching here.  And because on certain hardware the action frees 
resources on the card, which are limited, doing non-lazy deletion can be 
beneficial.  Given that there is no downside to doing the deletions in a 
non-lazy fashion, and that there can be an upside depending on hardware, there 
is no reason to stick with the lazy deletions.

—
Doug Ledford dledf...@redhat.com
GPG Key ID: 0E572FDD





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-15 Thread Doug Ledford

 On Mar 13, 2015, at 1:39 AM, Or Gerlitz gerlitz...@gmail.com wrote:
 
 On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit ere...@dev.mellanox.co.il 
 wrote:
 On 3/2/2015 5:09 PM, Doug Ledford wrote:
 
 On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
 
 On 2/26/2015 6:27 PM, Doug Ledford wrote:
 
 @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
 ipoib_dev_priv *priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
 +   ipoib_flush_ah(dev, 0);
 
 Why do you need to call the flush function here?
 
 To remove all of the ah's that were reduced to a 0 refcount by the
 previous two functions prior to restarting operations.  When we remove
 an ah, it calls ib_destroy_ah which calls all the way down into the low
 level driver.  This was to make sure that old, stale data was removed
 all the way down to the card level before we started new queries for
 paths and ahs.
 
 Yes. but it is not needed.
 
 That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
 isn't needed but doesn't hurt either.  For older cards (in particular,
 mthca), the driver actually frees up card resources at the time of the
 call.
 
 
 Can you please elaborate more here, I took a look in the mthca and didn't
 see that.
 anyway, what i don't understand is why you need to do that now, the ah is
 already in the dead_ah_list so, in at the most 1 sec will be cleared and if
 the driver goes down your other hunk fixed that.
 
 Doug, ten days and no response from you... lets finalize the review on
 this series so we have it safely done for 4.1 -- on top of it Erez
 prepares a set of IPoIB fixes from our internal tree and we want that
 for 4.1 too. Please address.

I didn’t have much to say here.  I said that mthca can have card resources 
freed by this call, which is backed up by this code in mthca_ah.c

int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
{
switch (ah-type) {
case MTHCA_AH_ON_HCA:
mthca_free(dev-av_table.alloc,
   (ah-avdma - dev-av_table.ddr_av_base) /
   MTHCA_AV_SIZE);
break;


I’m not entirely sure how Erez missed that, but it’s there and it’s what gets 
called when we destroy an ah (depending on the card of course).  So, that 
represents one case where freeing the resources in a non-lazy fashion has a 
direct benefit.  And there is no cited drawback to freeing the resources in a 
non-lazy fashion on a net event, so I don’t see what there is to discuss 
further on the issue.

—
Doug Ledford dledf...@redhat.com
GPG Key ID: 0E572FDD







signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-13 Thread Or Gerlitz
On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit ere...@dev.mellanox.co.il wrote:
 On 3/2/2015 5:09 PM, Doug Ledford wrote:

 On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:

 On 2/26/2015 6:27 PM, Doug Ledford wrote:

 @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
 ipoib_dev_priv *priv,
 if (level == IPOIB_FLUSH_LIGHT) {
 ipoib_mark_paths_invalid(dev);
 ipoib_mcast_dev_flush(dev);
 +   ipoib_flush_ah(dev, 0);

 Why do you need to call the flush function here?

 To remove all of the ah's that were reduced to a 0 refcount by the
 previous two functions prior to restarting operations.  When we remove
 an ah, it calls ib_destroy_ah which calls all the way down into the low
 level driver.  This was to make sure that old, stale data was removed
 all the way down to the card level before we started new queries for
 paths and ahs.

 Yes. but it is not needed.

 That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
 isn't needed but doesn't hurt either.  For older cards (in particular,
 mthca), the driver actually frees up card resources at the time of the
 call.


 Can you please elaborate more here, I took a look in the mthca and didn't
 see that.
 anyway, what i don't understand is why you need to do that now, the ah is
 already in the dead_ah_list so, in at the most 1 sec will be cleared and if
 the driver goes down your other hunk fixed that.

Doug, ten days and no response from you... lets finalize the review on
this series so we have it safely done for 4.1 -- on top of it Erez
prepares a set of IPoIB fixes from our internal tree and we want that
for 4.1 too. Please address.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-03 Thread Erez Shitrit

On 3/2/2015 5:09 PM, Doug Ledford wrote:

On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:

On 2/26/2015 6:27 PM, Doug Ledford wrote:

@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv 
*priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev, 0);

Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.


Can you please elaborate more here, I took a look in the mthca and 
didn't see that.
anyway, what i don't understand is why you need to do that now, the ah 
is already in the dead_ah_list so, in at the most 1 sec will be cleared 
and if the driver goes down your other hunk fixed that.



The bug happened when the driver was removed (via modprobe -r etc.), and
there were ah's in the dead_ah list, that was fixed by you in the
function ipoib_ib_dev_cleanup, the call that you added here is not
relevant to the bug (and IMHO is not needed at all)

I never said that this hunk was part of the original bug I saw before.


So, the the task of cleaning the dead_ah is already there, no need to
recall it, it will be called anyway 1 sec at the most from now.

You can try that, take of that call, no harm or memory leak will happened.

I have no doubt that it will get freed later.  As I said, I never
considered this particular hunk part of that original bug.  But, as I
point out above, there is no harm in it for any hardware, and depending
on hardware it can help to make sure there isn't a shortage of
resources.  Given that fact, I see no reason to remove it.


I can't see the reason to use the flush not from the stop_ah, meaning
without setting the IPOIB_STOP_REAPER, the flush can send twice the same
work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice, the first run
deletes all the items on the to-be-deleted list, the second run sees an
empty list and does nothing.

As for using flush versus stop: the flush function cancels any delayed
ah_flush work so that it isn't racing with the normally scheduled

when you call cancel_delayed_work to work that can schedule itself, it
is not help, the work can be at the middle of the run and re-schedule
itself...

If it is in the middle of a run and reschedules itself, then it will
schedule itself at precisely the same time we would have anyway, and we
will get flushed properly, so the net result of this particular race is
that we end up doing exactly what we wanted to do anyway.


ah_flush, then flushes the workqueue to make sure anything that might
result in an ah getting freed is done, then flushes, then schedules a
new delayed flush_ah work 1 second later.  So, it does exactly what a
flush should do: it removes what there is currently to remove, and in
the case of a periodically scheduled garbage collection, schedules a new
periodic flush at the maximum interval.

It is not appropriate to call stop_ah at this point because it will
cancel the delayed work, flush the ahs, then never reschedule the
garbage collection.  If we called stop here, we would have to call start
later.  But that's not really necessary as the flush cancels the
scheduled work and reschedules it for a second later.


}

	if (level = IPOIB_FLUSH_NORMAL)

@@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
ipoib_mcast_stop_thread(dev, 1);
ipoib_mcast_dev_flush(dev);

+	/*

+* All of our ah references aren't free until after
+* ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+* the neighbor garbage collection is stopped and reaped.
+* That should all be done now, so make a final ah flush.
+*/
+   ipoib_stop_ah(dev, 1);
+
ipoib_transport_dev_cleanup(dev);
}


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to 

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-03-02 Thread Doug Ledford
On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
 On 2/26/2015 6:27 PM, Doug Ledford wrote:
 
  @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct 
  ipoib_dev_priv *priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
  + ipoib_flush_ah(dev, 0);
  Why do you need to call the flush function here?
  To remove all of the ah's that were reduced to a 0 refcount by the
  previous two functions prior to restarting operations.  When we remove
  an ah, it calls ib_destroy_ah which calls all the way down into the low
  level driver.  This was to make sure that old, stale data was removed
  all the way down to the card level before we started new queries for
  paths and ahs.
 
 Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.

 The bug happened when the driver was removed (via modprobe -r etc.), and 
 there were ah's in the dead_ah list, that was fixed by you in the 
 function ipoib_ib_dev_cleanup, the call that you added here is not 
 relevant to the bug (and IMHO is not needed at all)

I never said that this hunk was part of the original bug I saw before.

 So, the the task of cleaning the dead_ah is already there, no need to 
 recall it, it will be called anyway 1 sec at the most from now.
 
 You can try that, take of that call, no harm or memory leak will happened.

I have no doubt that it will get freed later.  As I said, I never
considered this particular hunk part of that original bug.  But, as I
point out above, there is no harm in it for any hardware, and depending
on hardware it can help to make sure there isn't a shortage of
resources.  Given that fact, I see no reason to remove it.

  I can't see the reason to use the flush not from the stop_ah, meaning
  without setting the IPOIB_STOP_REAPER, the flush can send twice the same
  work.
  No, it can't.  The ah flush routine does not search through ahs to find
  ones to flush.  When you delete neighbors and mcasts, they release their
  references to ahs.  When the refcount goes to 0, the put routine puts
  the ah on the to-be-deleted ah list.  All the flush does is take that
  list and delete the items.  If you run the flush twice, the first run
  deletes all the items on the to-be-deleted list, the second run sees an
  empty list and does nothing.
 
  As for using flush versus stop: the flush function cancels any delayed
  ah_flush work so that it isn't racing with the normally scheduled
 
 when you call cancel_delayed_work to work that can schedule itself, it 
 is not help, the work can be at the middle of the run and re-schedule 
 itself...

If it is in the middle of a run and reschedules itself, then it will
schedule itself at precisely the same time we would have anyway, and we
will get flushed properly, so the net result of this particular race is
that we end up doing exactly what we wanted to do anyway.

 
  ah_flush, then flushes the workqueue to make sure anything that might
  result in an ah getting freed is done, then flushes, then schedules a
  new delayed flush_ah work 1 second later.  So, it does exactly what a
  flush should do: it removes what there is currently to remove, and in
  the case of a periodically scheduled garbage collection, schedules a new
  periodic flush at the maximum interval.
 
  It is not appropriate to call stop_ah at this point because it will
  cancel the delayed work, flush the ahs, then never reschedule the
  garbage collection.  If we called stop here, we would have to call start
  later.  But that's not really necessary as the flush cancels the
  scheduled work and reschedules it for a second later.
 
}
 
if (level = IPOIB_FLUSH_NORMAL)
  @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
ipoib_mcast_stop_thread(dev, 1);
ipoib_mcast_dev_flush(dev);
 
  + /*
  +  * All of our ah references aren't free until after
  +  * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
  +  * the neighbor garbage collection is stopped and reaped.
  +  * That should all be done now, so make a final ah flush.
  +  */
  + ipoib_stop_ah(dev, 1);
  +
ipoib_transport_dev_cleanup(dev);
 }
 
  --
  To unsubscribe from this list: send the line unsubscribe linux-rdma in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part


Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-02-28 Thread Erez Shitrit

On 2/26/2015 6:27 PM, Doug Ledford wrote:

On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:

On 2/22/2015 2:26 AM, Doug Ledford wrote:

Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

Because neighbors and mcast entries can each have a reference on any
given ah, we must make sure to free all of those first before our ah
will actually have a 0 refcount and be able to be reaped.

This factoring is needed in preparation for having per-device work
queues.  The original per-device workqueue code resulted in the following
error message:

ibdev: ib_dealloc_pd failed

That error was tracked down to this issue.  With the changes to which
workqueues were flushed when, there were no flushes of the per device
workqueue after the last ah's were freed, resulting in an attempt to
dealloc the pd with outstanding resources still allocated.  This code
puts the explicit flushes in the needed places to avoid that problem.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
   drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 
-
   1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..cb02466a0eb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
   round_jiffies_relative(HZ));
   }
   
+static void ipoib_flush_ah(struct net_device *dev, int flush)

+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   cancel_delayed_work(priv-ah_reap_task);
+   if (flush)
+   flush_workqueue(ipoib_workqueue);
+   ipoib_reap_ah(priv-ah_reap_task.work);
+}
+
+static void ipoib_stop_ah(struct net_device *dev, int flush)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   set_bit(IPOIB_STOP_REAPER, priv-flags);
+   ipoib_flush_ah(dev, flush);
+}
+
   static void ipoib_ib_tx_timer_func(unsigned long ctx)
   {
drain_tx_cq((struct net_device *)ctx);
@@ -877,24 +895,7 @@ timeout:
if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE))
ipoib_warn(priv, Failed to modify QP to RESET state\n);
   
-	/* Wait for all AHs to be reaped */

-   set_bit(IPOIB_STOP_REAPER, priv-flags);
-   cancel_delayed_work(priv-ah_reap_task);
-   if (flush)
-   flush_workqueue(ipoib_workqueue);
-
-   begin = jiffies;
-
-   while (!list_empty(priv-dead_ahs)) {
-   __ipoib_reap_ah(dev);
-
-   if (time_after(jiffies, begin + HZ)) {
-   ipoib_warn(priv, timing out; will leak address 
handles\n);
-   break;
-   }
-
-   msleep(1);
-   }
+   ipoib_flush_ah(dev, flush);
   
   	ib_req_notify_cq(priv-recv_cq, IB_CQ_NEXT_COMP);
   
@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,

if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev, 0);

Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.


Yes. but it is not needed.
The bug happened when the driver was removed (via modprobe -r etc.), and 
there were ah's in the dead_ah list, that was fixed by you in the 
function ipoib_ib_dev_cleanup, the call that you added here is not 
relevant to the bug (and IMHO is not needed at all)
So, the the task of cleaning the dead_ah is already there, no need to 
recall it, it will be called anyway 1 sec at the most from now.


You can try that, take of that call, no harm or memory leak will happened.


I can't see the reason to use the flush not from the stop_ah, meaning
without setting the IPOIB_STOP_REAPER, the flush can send twice the same
work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice, the first run
deletes all the items on the to-be-deleted list, the second run sees an
empty list and does nothing.

As for using flush versus stop: the flush function cancels any delayed
ah_flush work so that it isn't racing with the normally scheduled


when you call cancel_delayed_work to work that can schedule itself, it 

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-02-26 Thread Doug Ledford
On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:
 On 2/22/2015 2:26 AM, Doug Ledford wrote:
  Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
  appropriate times to flush out all remaining ah entries before we shut
  the device down.
 
  Because neighbors and mcast entries can each have a reference on any
  given ah, we must make sure to free all of those first before our ah
  will actually have a 0 refcount and be able to be reaped.
 
  This factoring is needed in preparation for having per-device work
  queues.  The original per-device workqueue code resulted in the following
  error message:
 
  ibdev: ib_dealloc_pd failed
 
  That error was tracked down to this issue.  With the changes to which
  workqueues were flushed when, there were no flushes of the per device
  workqueue after the last ah's were freed, resulting in an attempt to
  dealloc the pd with outstanding resources still allocated.  This code
  puts the explicit flushes in the needed places to avoid that problem.
 
  Signed-off-by: Doug Ledford dledf...@redhat.com
  ---
drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 
  -
1 file changed, 28 insertions(+), 18 deletions(-)
 
  diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
  b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  index 72626c34817..cb02466a0eb 100644
  --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
  @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
 round_jiffies_relative(HZ));
}

  +static void ipoib_flush_ah(struct net_device *dev, int flush)
  +{
  +   struct ipoib_dev_priv *priv = netdev_priv(dev);
  +
  +   cancel_delayed_work(priv-ah_reap_task);
  +   if (flush)
  +   flush_workqueue(ipoib_workqueue);
  +   ipoib_reap_ah(priv-ah_reap_task.work);
  +}
  +
  +static void ipoib_stop_ah(struct net_device *dev, int flush)
  +{
  +   struct ipoib_dev_priv *priv = netdev_priv(dev);
  +
  +   set_bit(IPOIB_STOP_REAPER, priv-flags);
  +   ipoib_flush_ah(dev, flush);
  +}
  +
static void ipoib_ib_tx_timer_func(unsigned long ctx)
{
  drain_tx_cq((struct net_device *)ctx);
  @@ -877,24 +895,7 @@ timeout:
  if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE))
  ipoib_warn(priv, Failed to modify QP to RESET state\n);

  -   /* Wait for all AHs to be reaped */
  -   set_bit(IPOIB_STOP_REAPER, priv-flags);
  -   cancel_delayed_work(priv-ah_reap_task);
  -   if (flush)
  -   flush_workqueue(ipoib_workqueue);
  -
  -   begin = jiffies;
  -
  -   while (!list_empty(priv-dead_ahs)) {
  -   __ipoib_reap_ah(dev);
  -
  -   if (time_after(jiffies, begin + HZ)) {
  -   ipoib_warn(priv, timing out; will leak address 
  handles\n);
  -   break;
  -   }
  -
  -   msleep(1);
  -   }
  +   ipoib_flush_ah(dev, flush);

  ib_req_notify_cq(priv-recv_cq, IB_CQ_NEXT_COMP);

  @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct 
  ipoib_dev_priv *priv,
  if (level == IPOIB_FLUSH_LIGHT) {
  ipoib_mark_paths_invalid(dev);
  ipoib_mcast_dev_flush(dev);
  +   ipoib_flush_ah(dev, 0);
 
 Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

 I can't see the reason to use the flush not from the stop_ah, meaning 
 without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
 work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice, the first run
deletes all the items on the to-be-deleted list, the second run sees an
empty list and does nothing.

As for using flush versus stop: the flush function cancels any delayed
ah_flush work so that it isn't racing with the normally scheduled
ah_flush, then flushes the workqueue to make sure anything that might
result in an ah getting freed is done, then flushes, then schedules a
new delayed flush_ah work 1 second later.  So, it does exactly what a
flush should do: it removes what there is currently to remove, and in
the case of a periodically scheduled garbage collection, schedules a new
periodic flush at the maximum interval.

It is not appropriate to call stop_ah at this point because it will
cancel the delayed work, flush the ahs, then never reschedule the
garbage collection.  If we called stop here, 

Re: [PATCH 1/9] IB/ipoib: factor out ah flushing

2015-02-26 Thread Erez Shitrit

On 2/22/2015 2:26 AM, Doug Ledford wrote:

Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

Because neighbors and mcast entries can each have a reference on any
given ah, we must make sure to free all of those first before our ah
will actually have a 0 refcount and be able to be reaped.

This factoring is needed in preparation for having per-device work
queues.  The original per-device workqueue code resulted in the following
error message:

ibdev: ib_dealloc_pd failed

That error was tracked down to this issue.  With the changes to which
workqueues were flushed when, there were no flushes of the per device
workqueue after the last ah's were freed, resulting in an attempt to
dealloc the pd with outstanding resources still allocated.  This code
puts the explicit flushes in the needed places to avoid that problem.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
  drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 -
  1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..cb02466a0eb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
   round_jiffies_relative(HZ));
  }
  
+static void ipoib_flush_ah(struct net_device *dev, int flush)

+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   cancel_delayed_work(priv-ah_reap_task);
+   if (flush)
+   flush_workqueue(ipoib_workqueue);
+   ipoib_reap_ah(priv-ah_reap_task.work);
+}
+
+static void ipoib_stop_ah(struct net_device *dev, int flush)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   set_bit(IPOIB_STOP_REAPER, priv-flags);
+   ipoib_flush_ah(dev, flush);
+}
+
  static void ipoib_ib_tx_timer_func(unsigned long ctx)
  {
drain_tx_cq((struct net_device *)ctx);
@@ -877,24 +895,7 @@ timeout:
if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE))
ipoib_warn(priv, Failed to modify QP to RESET state\n);
  
-	/* Wait for all AHs to be reaped */

-   set_bit(IPOIB_STOP_REAPER, priv-flags);
-   cancel_delayed_work(priv-ah_reap_task);
-   if (flush)
-   flush_workqueue(ipoib_workqueue);
-
-   begin = jiffies;
-
-   while (!list_empty(priv-dead_ahs)) {
-   __ipoib_reap_ah(dev);
-
-   if (time_after(jiffies, begin + HZ)) {
-   ipoib_warn(priv, timing out; will leak address 
handles\n);
-   break;
-   }
-
-   msleep(1);
-   }
+   ipoib_flush_ah(dev, flush);
  
  	ib_req_notify_cq(priv-recv_cq, IB_CQ_NEXT_COMP);
  
@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,

if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev, 0);


Why do you need to call the flush function here?
I can't see the reason to use the flush not from the stop_ah, meaning 
without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
work.



}
  
  	if (level = IPOIB_FLUSH_NORMAL)

@@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
ipoib_mcast_stop_thread(dev, 1);
ipoib_mcast_dev_flush(dev);
  
+	/*

+* All of our ah references aren't free until after
+* ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+* the neighbor garbage collection is stopped and reaped.
+* That should all be done now, so make a final ah flush.
+*/
+   ipoib_stop_ah(dev, 1);
+
ipoib_transport_dev_cleanup(dev);
  }
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html