On Wed, 22 Jun 2016 12:03:55 +0200 Florian Westphal <f...@strlen.de> wrote:
> Currently classification and enqueue is done in a single step. > > core acquires the qdisc lock, then calls the ->enqueue() function > of the qdisc. > > Its the job of the qdisc and its attached classifiers to figure out what > to do next. > > Typically the enqueue function will call tc_classify() to lookup a > child class, then call ->enqueue of the child qdisc. > > This can repeat a number of times until a leaf qdisc is reached; this leaf > will do the real enqueue operation (pfifo for example). > > While this approach gives qdiscs and the classifier/action subsystem > a lot of control, it has one major drawback: The root qdisc lock > is held for a prolonged period of time while we recurse through > the qdisc hierarchy from root to leaf. > > This (unfinished!) hack splits classification and enqueue into > two steps. > > Before enqueueing the packet and *before* acquiring the root qdisc lock, > the new qdisc ->classify() function is invoked. > > This function -- much like enqueue in the current scheme -- looks up > a child class and/or determines the next qdisc where the packet needs > to be handled via the classifier action subsystem. > Then it invokes the new ->classify() hook of the child, which can repeat > until the leaf qdisc is reached. > > Whenever a classify operation has succeeded, each classify "level" > stores itself (qdisc) and a cookie (typically just a pointer to the > class) in the newly added Qdisc_classify_state struct. > > After qdisc_classify returns, this struct contains the path that > the skb is supposed to take through the qdisc hierarchy in *reverse* > order, i.e. starting from the leaf up to the root. > > Then, the root qdisc lock is taken and the *leaf* (first entry in > Qdisc_classify_state) ->enqueue function is invoked. > > If this succeeds, the kernel will invoke the new ->senqeue > function for all the remaining entries in Qdisc_classify_state. > > This function is responsible to perform needed qdisc internal actions, > f.e. scheduling a class for transmit. > > This reduces the amount of work done under qdisc root lock. > Very simple test w. hfsc + u32 filter, gso off, 10G link and 20 netperf > TCP_STREAM: > > before: 8.2 GB/s > after: 9.4 GB/s (same as w. mq+fq_codel) > > Known issues with this patch: > - only converted a few qdiscs > - qdisc stats will race since they're changed outside of root lock. > - need to protect vs. class removal during classify ops > - mirred needs some work for this (but not worse than current situation) > - use-after-free in some places, need to ensure that skb remains valid > iff enqueue returns != DROP. > - senqeue is a horrid name. It is probably better to split qdiscs > into leaf and non-leaf ones, where leaf qdiscs implement ->enqueue() and > the others a notify_enqueue() (which doesn't return a value). > > So far I did not see any fundamental issues with this approach. > > If you see any, I'd be happy to learn about them before I spend more > cycles on this. The main work to be done is to convert qstats to > percpu counters and then convert all the qdiscs to this new scheme, > and of course to extend this to all intree schedulers. > > If you attend NF workshop in Amsterdam feel free to yell at me in person > there. > --- > include/net/sch_generic.h | 50 +++++++++++++++++++++++++++++++++ > net/core/dev.c | 10 ++++++- > net/sched/sch_drr.c | 38 +++++++++++++++++++++++-- > net/sched/sch_fq_codel.c | 71 > +++++++++++++++++++++++++++++++++++++++++++++-- > net/sched/sch_generic.c | 32 +++++++++++++++++++++ > net/sched/sch_hfsc.c | 38 +++++++++++++++++++++++-- > 6 files changed, 232 insertions(+), 7 deletions(-) > The architecture of linux packet scheduler makes this too racy. Classification can be attached to different nodes on the hierarchy, and the hierarchy could change underneath. It is a good idea, but maybe would be better to have a separate classification step before scheduler (use BPF) and put meta-data in skb. That way it could be safely lockless and atomic.