On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:
> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
> dequeue routines then sets the NOLOCK bit.
> 
> This also removes the logic used to pick the next band to dequeue from
> and instead just checks each alf_queue for packets from top priority
> to lowest. This might need to be a bit more clever but seems to work
> for now.
> 
> Signed-off-by: John Fastabend <john.r.fastab...@intel.com>
> ---
>  net/sched/sch_generic.c |  131 
> +++++++++++++++++++++++++++--------------------

>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>                             struct sk_buff **to_free)
>  {
> -     return qdisc_drop(skb, qdisc, to_free);
> +     err = skb_array_produce_bh(q, skb);
..
>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
> +             skb = skb_array_consume_bh(q);

For this particular qdisc the performance gain should come from
granularityof spin_lock, right?
Before we were taking the lock much earlier. Here we keep the lock,
but for the very short time.
        original pps        lockless        diff
        1       1418168             1269450         -148718
        2       1587390             1553408         -33982
        4       1084961             1683639         +598678
        8       989636              1522723         +533087
        12      1014018             1348172         +334154
so perf for 1 cpu case is mainly due to array vs list,
since number of locks is still the same and there is no collision ?
but then why shorter lock give higher overhead in multi cpu cases?
That would have been the main target for performance improvement?

Looks like mq gets the most benefit, because it's lockless internally
which makes sense.
In general I think this is the right direction where tc infra should move to.
I'm only not sure whether it's worth converting pfifo to skb_array.
Probably alf queue would have been a better alternative.

Reply via email to