RE: Get rid of RCU callbacks in TC filters?

2017-10-23 Thread David Laight
From: Cong Wang
> Sent: 20 October 2017 21:52
> To: Paul E. McKenney
> Cc: Jamal Hadi Salim; Chris Mi; Linux Kernel Network Developers; Daniel 
> Borkmann; Eric Dumazet; David
> Miller; Jiri Pirko
> Subject: Re: Get rid of RCU callbacks in TC filters?
> 
> On Fri, Oct 20, 2017 at 1:31 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> > Actually, I just tried this approach, this way makes the core tc filter
> > code harder to wait for flying callbacks, currently rcu_barrier() is
> > enough, with one more schedule_work() added we probably
> > need flush_workqueue()... Huh, this also means I can't use the
> > global workqueue so should add a new workqueue for tc filters.
> >
> > Good news is I seem to make it work without adding much code.
> > Stay tuned. ;)
> 
> Hmm, lockdep already complains I have a deadlock when flushing
> workqueue while hodling RTNL lock, because callbacks need RTNL
> too... See the rcu_barrier() in tcf_block_put().
> 
> So this approach seems to be a dead end. Is there any way out?
> 
> Probably we have to convert call_rcu() to synchronize_rcu(),
> but with batching of course.

Locking for callbacks is always a PITA.
You also need to sort out how to 'unhook' the callbacks.

If you require the caller hold the rntl_lock prior to adding
or removing callbacks, and guarantee that the functions will
be called with the rntl_lock held you might be ok.

Otherwise the one side can't do its 'expected' locking without
a potential deadlock or use-after-free.

The only way I know to avoid the issues removing the callbacks
is to get the callback function called from its normal scheduling
point with an argument (probably NULL) that indicates it will
never be called again.

David



Re: Get rid of RCU callbacks in TC filters?

2017-10-20 Thread Cong Wang
On Fri, Oct 20, 2017 at 1:31 PM, Cong Wang  wrote:
> Actually, I just tried this approach, this way makes the core tc filter
> code harder to wait for flying callbacks, currently rcu_barrier() is
> enough, with one more schedule_work() added we probably
> need flush_workqueue()... Huh, this also means I can't use the
> global workqueue so should add a new workqueue for tc filters.
>
> Good news is I seem to make it work without adding much code.
> Stay tuned. ;)

Hmm, lockdep already complains I have a deadlock when flushing
workqueue while hodling RTNL lock, because callbacks need RTNL
too... See the rcu_barrier() in tcf_block_put().

So this approach seems to be a dead end. Is there any way out?

Probably we have to convert call_rcu() to synchronize_rcu(),
but with batching of course.


Re: Get rid of RCU callbacks in TC filters?

2017-10-20 Thread Cong Wang
On Fri, Oct 20, 2017 at 9:56 AM, Paul E. McKenney
 wrote:
> On Thu, Oct 19, 2017 at 08:26:01PM -0700, Cong Wang wrote:
>> On Wed, Oct 18, 2017 at 12:35 PM, Paul E. McKenney
>>  wrote:
>> > 5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
>> > The workqueue could then use blocking primitives, for example, acquiring
>> > RTNL.
>>
>> Yeah, this could work too but we would get one more async...
>>
>> filter delete -> call_rcu() -> schedule_work() -> action destroy
>
> True, but on the other hand you get to hold RTNL.

I can get RTNL too with converting call_rcu() to synchronize_rcu().
;)

So this turns into the question again: if we mind synchronize_rcu()
on slow paths or not?

Actually, I just tried this approach, this way makes the core tc filter
code harder to wait for flying callbacks, currently rcu_barrier() is
enough, with one more schedule_work() added we probably
need flush_workqueue()... Huh, this also means I can't use the
global workqueue so should add a new workqueue for tc filters.

Good news is I seem to make it work without adding much code.
Stay tuned. ;)

>
>> > 6) As with #5, have the RCU callback schedule a workqueue, but aggregate
>> > workqueue scheduling using a timer.  This would reduce the number of
>> > RTNL acquisitions.
>>
>> Ouch, sounds like even one more async:
>>
>> filter delete -> call_rcu() -> schedule_work() -> timer -> flush_work()
>> -> action destroy
>>
>> :-(
>
> Indeed, the price of scalability and performance is often added
> asynchronous action at a distance.  But sometimes you can have
> scalability, performance, -and- synchronous action.  Not sure that this
> is one of those cases, but perhaps someone will come up with some trick
> that we are not yet seeing.
>
> And again, one benefit you get from the added asynchrony is the ability
> to acquire RTNL.  Another is increased batching, allowing the overhead
> of acquiring RTNL to be amortized over a larger number of updates.

Understood, my point is it might not be worthy to optimize a slow path
which already has RTNL lock...


>
>> > 7) As with #5, have the RCU callback schedule a workqueue, but have each
>> > iterator accumulate a list of things removed and do call_rcu() on the
>> > list.  This is an alternative way of aggregating to reduce the number
>> > of RTNL acquisitions.
>>
>> Yeah, this seems working too.
>>
>> > There are many other ways to skin this cat.
>>
>> We still have to pick one. :) Any preference? I want to keep it as simple
>> as possible, otherwise some day I would not understand it either.
>
> I must defer to the people who actually fully understand this code.

I understand that code, just not sure about which approach I should
pick.

I will keep you Cc'ed for any further change I make.

Thanks!


Re: Get rid of RCU callbacks in TC filters?

2017-10-20 Thread Paul E. McKenney
On Thu, Oct 19, 2017 at 08:26:01PM -0700, Cong Wang wrote:
> On Wed, Oct 18, 2017 at 12:35 PM, Paul E. McKenney
>  wrote:
> > On Wed, Oct 18, 2017 at 10:36:28AM -0700, Cong Wang wrote:
> >> Hi, all
> >>
> >> Recently, the RCU callbacks used in TC filters and TC actions keep
> >> drawing my attention, they introduce at least 4 race condition bugs:
> >>
> >> 1. A simple one fixed by Daniel:
> >>
> >> commit c78e1746d3ad7d548bdf3fe491898cc453911a49
> >> Author: Daniel Borkmann 
> >> Date:   Wed May 20 17:13:33 2015 +0200
> >>
> >> net: sched: fix call_rcu() race on classifier module unloads
> >>
> >> 2. A very nasty one fixed by me:
> >>
> >> commit 1697c4bb5245649a23f06a144cc38c06715e1b65
> >> Author: Cong Wang 
> >> Date:   Mon Sep 11 16:33:32 2017 -0700
> >>
> >> net_sched: carefully handle tcf_block_put()
> >>
> >> 3. Two more bugs found by Chris:
> >> https://patchwork.ozlabs.org/patch/826696/
> >> https://patchwork.ozlabs.org/patch/826695/
> >>
> >>
> >> Usually RCU callbacks are simple, however for TC filters and actions,
> >> they are complex because at least TC actions could be destroyed
> >> together with the TC filter in one callback. And RCU callbacks are
> >> invoked in BH context, without locking they are parallel too. All of
> >> these contribute to the cause of these nasty bugs. It looks like they
> >> bring us more problems than benefits.
> >>
> >> Therefore, I have been thinking about getting rid of these callbacks,
> >> because they are not strictly necessary, callers of these call_rcu()
> >> are all on slow path and have RTNL lock, so blocking is permitted in
> >> their contexts, and _I think_ it does not harm to use
> >> synchronize_rcu() on slow paths, at least I can argue RTNL lock is
> >> already there and is a bottleneck if we really care. :)
> >>
> >> There are 3 solutions here:
> >>
> >> 1) Get rid of these RCU callbacks and use synchronize_rcu(). The
> >> downside is this could hurt the performance of deleting TC filters,
> >> but again it is slow path comparing to skb classification path. Note,
> >> it is _not_ merely replacing call_rcu() with synchronize_rcu(),
> >> because many call_rcu()'s are actually in list iterations, we have to
> >> use a local list and call list_del_rcu()+list_add() before
> >> synchronize_rcu() (Or is there any other API I am not aware of?). If
> >> people really hate synchronize_rcu() because of performance, we could
> >> also defer the work to a workqueue and callers could keep their
> >> performance as they are.
> >>
> >> 2) Introduce a spinlock to serialize these RCU callbacks. But as I
> >> said in commit 1697c4bb5245 ("net_sched: carefully handle
> >> tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
> >> Potentially we need to do a lot of work to make it possible, if not
> >> impossible.
> >>
> >> 3) Keep these RCU callbacks and fix all race conditions. Like what
> >> Chris tries to do in his patchset, but my argument is that we can not
> >> prove we are really race-free even with Chris' patches and his patches
> >> are already large enough.
> >>
> >>
> >> What do you think? Any other ideas?
> >
> > 4) Move from call_rcu() to synchronize_rcu(), but if feasible use one
> > synchronize_rcu() for multiple deletions/iterations.
> 
> This is what I meant by using a local list, perhaps I didn't make it clear.

Ah, got it.

> > 5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
> > The workqueue could then use blocking primitives, for example, acquiring
> > RTNL.
> 
> Yeah, this could work too but we would get one more async...
> 
> filter delete -> call_rcu() -> schedule_work() -> action destroy

True, but on the other hand you get to hold RTNL.

> > 6) As with #5, have the RCU callback schedule a workqueue, but aggregate
> > workqueue scheduling using a timer.  This would reduce the number of
> > RTNL acquisitions.
> 
> Ouch, sounds like even one more async:
> 
> filter delete -> call_rcu() -> schedule_work() -> timer -> flush_work()
> -> action destroy
> 
> :-(

Indeed, the price of scalability and performance is often added
asynchronous action at a distance.  But sometimes you can have
scalability, performance, -and- synchronous action.  Not sure that this
is one of those cases, but perhaps someone will come up with some trick
that we are not yet seeing.

And again, one benefit you get from the added asynchrony is the ability
to acquire RTNL.  Another is increased batching, allowing the overhead
of acquiring RTNL to be amortized over a larger number of updates.

> > 7) As with #5, have the RCU callback schedule a workqueue, but have each
> > iterator accumulate a list of things removed and do call_rcu() on the
> > list.  This is an alternative way of aggregating to reduce the number
> > of RTNL acquisitions.
> 
> Yeah, this seems working too.
> 
> > There are many other ways to skin this cat.
> 
> We still have to pick one. :) Any 

Re: Get rid of RCU callbacks in TC filters?

2017-10-19 Thread Cong Wang
On Wed, Oct 18, 2017 at 12:35 PM, Paul E. McKenney
 wrote:
> On Wed, Oct 18, 2017 at 10:36:28AM -0700, Cong Wang wrote:
>> Hi, all
>>
>> Recently, the RCU callbacks used in TC filters and TC actions keep
>> drawing my attention, they introduce at least 4 race condition bugs:
>>
>> 1. A simple one fixed by Daniel:
>>
>> commit c78e1746d3ad7d548bdf3fe491898cc453911a49
>> Author: Daniel Borkmann 
>> Date:   Wed May 20 17:13:33 2015 +0200
>>
>> net: sched: fix call_rcu() race on classifier module unloads
>>
>> 2. A very nasty one fixed by me:
>>
>> commit 1697c4bb5245649a23f06a144cc38c06715e1b65
>> Author: Cong Wang 
>> Date:   Mon Sep 11 16:33:32 2017 -0700
>>
>> net_sched: carefully handle tcf_block_put()
>>
>> 3. Two more bugs found by Chris:
>> https://patchwork.ozlabs.org/patch/826696/
>> https://patchwork.ozlabs.org/patch/826695/
>>
>>
>> Usually RCU callbacks are simple, however for TC filters and actions,
>> they are complex because at least TC actions could be destroyed
>> together with the TC filter in one callback. And RCU callbacks are
>> invoked in BH context, without locking they are parallel too. All of
>> these contribute to the cause of these nasty bugs. It looks like they
>> bring us more problems than benefits.
>>
>> Therefore, I have been thinking about getting rid of these callbacks,
>> because they are not strictly necessary, callers of these call_rcu()
>> are all on slow path and have RTNL lock, so blocking is permitted in
>> their contexts, and _I think_ it does not harm to use
>> synchronize_rcu() on slow paths, at least I can argue RTNL lock is
>> already there and is a bottleneck if we really care. :)
>>
>> There are 3 solutions here:
>>
>> 1) Get rid of these RCU callbacks and use synchronize_rcu(). The
>> downside is this could hurt the performance of deleting TC filters,
>> but again it is slow path comparing to skb classification path. Note,
>> it is _not_ merely replacing call_rcu() with synchronize_rcu(),
>> because many call_rcu()'s are actually in list iterations, we have to
>> use a local list and call list_del_rcu()+list_add() before
>> synchronize_rcu() (Or is there any other API I am not aware of?). If
>> people really hate synchronize_rcu() because of performance, we could
>> also defer the work to a workqueue and callers could keep their
>> performance as they are.
>>
>> 2) Introduce a spinlock to serialize these RCU callbacks. But as I
>> said in commit 1697c4bb5245 ("net_sched: carefully handle
>> tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
>> Potentially we need to do a lot of work to make it possible, if not
>> impossible.
>>
>> 3) Keep these RCU callbacks and fix all race conditions. Like what
>> Chris tries to do in his patchset, but my argument is that we can not
>> prove we are really race-free even with Chris' patches and his patches
>> are already large enough.
>>
>>
>> What do you think? Any other ideas?
>
> 4) Move from call_rcu() to synchronize_rcu(), but if feasible use one
> synchronize_rcu() for multiple deletions/iterations.

This is what I meant by using a local list, perhaps I didn't make it clear.


>
> 5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
> The workqueue could then use blocking primitives, for example, acquiring
> RTNL.

Yeah, this could work too but we would get one more async...

filter delete -> call_rcu() -> schedule_work() -> action destroy



>
> 6) As with #5, have the RCU callback schedule a workqueue, but aggregate
> workqueue scheduling using a timer.  This would reduce the number of
> RTNL acquisitions.

Ouch, sounds like even one more async:

filter delete -> call_rcu() -> schedule_work() -> timer -> flush_work()
-> action destroy

:-(


>
> 7) As with #5, have the RCU callback schedule a workqueue, but have each
> iterator accumulate a list of things removed and do call_rcu() on the
> list.  This is an alternative way of aggregating to reduce the number
> of RTNL acquisitions.


Yeah, this seems working too.


>
> There are many other ways to skin this cat.
>

We still have to pick one. :) Any preference? I want to keep it as simple
as possible, otherwise some day I would not understand it either.

Thanks for all the ideas!


Re: Get rid of RCU callbacks in TC filters?

2017-10-19 Thread Cong Wang
On Thu, Oct 19, 2017 at 8:34 AM, John Fastabend
 wrote:
>
> My take on this would be to stay with the current RCU callbacks and try
> to simplify the implementation. Falling back to sync operations seems
> like a step backwards to me. I know update/delete of filters is currently
> a pain point for some use cases so getting the RTNL out of the way may
> become a requirement to support those (alternatively maybe batching is
> good enough).

For me it looks like very hard to make tc action destroy code completely
race-free in RCU callbacks, at least looks harder than getting rid of
RCU callbacks.


>
> I guess at a high level with Cris' patches actions are now doing reference
> counting correctly. If shared filters also do reference counting similarly
> we should be OK right? (yes I know simplifying maybe too much to be
> meaningful)

I don't know what you mean by "doing reference counting correctly",
if you mean making them atomic, as I already explained to Chris, it
is not necessary at all if we remove RCU callbacks. Refcnt doesn't
have to be atomic if it is always serialized with a lock.

>
> Are we aware of any outstanding problem areas?
>

Potentially many problems, since tc action destroy code could be
called either with a RTNL lock (fine) or in a RCU callback without
RTNL lock (buggy), these two paths race with each other and RCU
callbacks race among themselves too.


Re: Get rid of RCU callbacks in TC filters?

2017-10-19 Thread John Fastabend
On 10/18/2017 12:35 PM, Paul E. McKenney wrote:
> On Wed, Oct 18, 2017 at 10:36:28AM -0700, Cong Wang wrote:
>> Hi, all
>>
>> Recently, the RCU callbacks used in TC filters and TC actions keep
>> drawing my attention, they introduce at least 4 race condition bugs:
>>
>> 1. A simple one fixed by Daniel:
>>
>> commit c78e1746d3ad7d548bdf3fe491898cc453911a49
>> Author: Daniel Borkmann 
>> Date:   Wed May 20 17:13:33 2015 +0200
>>
>> net: sched: fix call_rcu() race on classifier module unloads
>>
>> 2. A very nasty one fixed by me:
>>
>> commit 1697c4bb5245649a23f06a144cc38c06715e1b65
>> Author: Cong Wang 
>> Date:   Mon Sep 11 16:33:32 2017 -0700
>>
>> net_sched: carefully handle tcf_block_put()
>>
>> 3. Two more bugs found by Chris:
>> https://patchwork.ozlabs.org/patch/826696/
>> https://patchwork.ozlabs.org/patch/826695/
>>
>>
>> Usually RCU callbacks are simple, however for TC filters and actions,
>> they are complex because at least TC actions could be destroyed
>> together with the TC filter in one callback. And RCU callbacks are
>> invoked in BH context, without locking they are parallel too. All of
>> these contribute to the cause of these nasty bugs. It looks like they
>> bring us more problems than benefits.
>>
>> Therefore, I have been thinking about getting rid of these callbacks,
>> because they are not strictly necessary, callers of these call_rcu()
>> are all on slow path and have RTNL lock, so blocking is permitted in
>> their contexts, and _I think_ it does not harm to use
>> synchronize_rcu() on slow paths, at least I can argue RTNL lock is
>> already there and is a bottleneck if we really care. :)
>>
>> There are 3 solutions here:
>>
>> 1) Get rid of these RCU callbacks and use synchronize_rcu(). The
>> downside is this could hurt the performance of deleting TC filters,
>> but again it is slow path comparing to skb classification path. Note,
>> it is _not_ merely replacing call_rcu() with synchronize_rcu(),
>> because many call_rcu()'s are actually in list iterations, we have to
>> use a local list and call list_del_rcu()+list_add() before
>> synchronize_rcu() (Or is there any other API I am not aware of?). If
>> people really hate synchronize_rcu() because of performance, we could
>> also defer the work to a workqueue and callers could keep their
>> performance as they are.
>>
>> 2) Introduce a spinlock to serialize these RCU callbacks. But as I
>> said in commit 1697c4bb5245 ("net_sched: carefully handle
>> tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
>> Potentially we need to do a lot of work to make it possible, if not
>> impossible.
>>
>> 3) Keep these RCU callbacks and fix all race conditions. Like what
>> Chris tries to do in his patchset, but my argument is that we can not
>> prove we are really race-free even with Chris' patches and his patches
>> are already large enough.
>>

My take on this would be to stay with the current RCU callbacks and try
to simplify the implementation. Falling back to sync operations seems
like a step backwards to me. I know update/delete of filters is currently
a pain point for some use cases so getting the RTNL out of the way may
become a requirement to support those (alternatively maybe batching is
good enough).

I guess at a high level with Cris' patches actions are now doing reference
counting correctly. If shared filters also do reference counting similarly
we should be OK right? (yes I know simplifying maybe too much to be
meaningful)

Are we aware of any outstanding problem areas?

>>
>> What do you think? Any other ideas?
> 
> 4) Move from call_rcu() to synchronize_rcu(), but if feasible use one
> synchronize_rcu() for multiple deletions/iterations.
> 
> 5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
> The workqueue could then use blocking primitives, for example, acquiring
> RTNL.
> 
> 6) As with #5, have the RCU callback schedule a workqueue, but aggregate
> workqueue scheduling using a timer.  This would reduce the number of
> RTNL acquisitions.
> 
> 7) As with #5, have the RCU callback schedule a workqueue, but have each
> iterator accumulate a list of things removed and do call_rcu() on the
> list.  This is an alternative way of aggregating to reduce the number
> of RTNL acquisitions.
> 
> There are many other ways to skin this cat.
> 
>   Thanx, Paul
> 
Batching will probably be needed soon regardless for the hardware offload
folks. 

Thanks,
John


Re: Get rid of RCU callbacks in TC filters?

2017-10-18 Thread Paul E. McKenney
On Wed, Oct 18, 2017 at 10:36:28AM -0700, Cong Wang wrote:
> Hi, all
> 
> Recently, the RCU callbacks used in TC filters and TC actions keep
> drawing my attention, they introduce at least 4 race condition bugs:
> 
> 1. A simple one fixed by Daniel:
> 
> commit c78e1746d3ad7d548bdf3fe491898cc453911a49
> Author: Daniel Borkmann 
> Date:   Wed May 20 17:13:33 2015 +0200
> 
> net: sched: fix call_rcu() race on classifier module unloads
> 
> 2. A very nasty one fixed by me:
> 
> commit 1697c4bb5245649a23f06a144cc38c06715e1b65
> Author: Cong Wang 
> Date:   Mon Sep 11 16:33:32 2017 -0700
> 
> net_sched: carefully handle tcf_block_put()
> 
> 3. Two more bugs found by Chris:
> https://patchwork.ozlabs.org/patch/826696/
> https://patchwork.ozlabs.org/patch/826695/
> 
> 
> Usually RCU callbacks are simple, however for TC filters and actions,
> they are complex because at least TC actions could be destroyed
> together with the TC filter in one callback. And RCU callbacks are
> invoked in BH context, without locking they are parallel too. All of
> these contribute to the cause of these nasty bugs. It looks like they
> bring us more problems than benefits.
> 
> Therefore, I have been thinking about getting rid of these callbacks,
> because they are not strictly necessary, callers of these call_rcu()
> are all on slow path and have RTNL lock, so blocking is permitted in
> their contexts, and _I think_ it does not harm to use
> synchronize_rcu() on slow paths, at least I can argue RTNL lock is
> already there and is a bottleneck if we really care. :)
> 
> There are 3 solutions here:
> 
> 1) Get rid of these RCU callbacks and use synchronize_rcu(). The
> downside is this could hurt the performance of deleting TC filters,
> but again it is slow path comparing to skb classification path. Note,
> it is _not_ merely replacing call_rcu() with synchronize_rcu(),
> because many call_rcu()'s are actually in list iterations, we have to
> use a local list and call list_del_rcu()+list_add() before
> synchronize_rcu() (Or is there any other API I am not aware of?). If
> people really hate synchronize_rcu() because of performance, we could
> also defer the work to a workqueue and callers could keep their
> performance as they are.
> 
> 2) Introduce a spinlock to serialize these RCU callbacks. But as I
> said in commit 1697c4bb5245 ("net_sched: carefully handle
> tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
> Potentially we need to do a lot of work to make it possible, if not
> impossible.
> 
> 3) Keep these RCU callbacks and fix all race conditions. Like what
> Chris tries to do in his patchset, but my argument is that we can not
> prove we are really race-free even with Chris' patches and his patches
> are already large enough.
> 
> 
> What do you think? Any other ideas?

4) Move from call_rcu() to synchronize_rcu(), but if feasible use one
synchronize_rcu() for multiple deletions/iterations.

5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
The workqueue could then use blocking primitives, for example, acquiring
RTNL.

6) As with #5, have the RCU callback schedule a workqueue, but aggregate
workqueue scheduling using a timer.  This would reduce the number of
RTNL acquisitions.

7) As with #5, have the RCU callback schedule a workqueue, but have each
iterator accumulate a list of things removed and do call_rcu() on the
list.  This is an alternative way of aggregating to reduce the number
of RTNL acquisitions.

There are many other ways to skin this cat.

Thanx, Paul