Re: Per-CPU Queueing for QoS

John Fastabend Mon, 13 Nov 2017 19:45:48 -0800

On 11/13/2017 06:18 PM, Michael Ma wrote:
> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>:
>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>> Michael Ma <make0...@gmail.com> wrote:
>>>>>>
>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>>>>>> early feedback.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>>
>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>> avoids out-of-order problem.
>>>>>>>>
>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>> that bad.
>>>>>>>>
>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>
>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>
>>>>>> In general, there is no meta design discussions in Linux development
>>>>>> Several developers have tried to do lockless
>>>>>> qdisc and similar things in the past.
>>>>>>
>>>>>> The devil is in the details, show us the code.
>>>>>
>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>> we have a per-cpu per-class queue defined as this:
>>>>>
>>>>> struct bandwidth_group
>>>>> {
>>>>>     struct skb_list queues[MAX_CPU_COUNT];
>>>>>     struct skb_list drain;
>>>>> }
>>>>>
>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>
>>>>> int cpu;
>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>>>
>>>>> cpu = get_cpu();
>>>>> skb_list_append(&bwg->queues[cpu], skb);
>>>>>
>>>>> Here we don't check the flow of the packet so if there is task
>>>>> migration or multiple threads sending packets through the same flow we
>>>>> theoretically can have packets enqueued to different queues and
>>>>> aggregated to the "drain" queue out of order.
>>>>>
>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>> currently, however if there is already similar effort ongoing please
>>>>> let me know.
>>>>
>>>> The question I would have is how would this differ from using XPS w/
>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>> like mqprio?
>>>>
>>>> From what I can tell XPS would be able to get you your per-cpu
>>>> functionality, the benefit of it though would be that it would avoid
>>>> out-of-order issues for sockets originating on the local system. The
>>>> only thing I see as an issue right now is that the rate limiting with
>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>> DCB.
>>>
>>> I think one of the key point was in : " do busy polling from a per-class
>>> thread to drain these queues."
>>>
>>> I mentioned this idea in TX path of :
>>>
>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>> I think this is a bit different from that idea in that the busy
>> polling is transferring packets from a per-cpu qdisc to a per traffic
>> class queueing discipline. Basically it would be a busy_poll version
>> of qdisc_run that would be transferring packets from one qdisc onto
>> another instead of attempting to transmit them directly.
> 
> The idea is to have the whole part implemented as one classful qdisc
> and remove the qdisc root lock since all the synchronization will be
> handled internally (let's put aside that other filters/actions/qdiscs
> might still require a root lock for now).
>>
>> What I think is tripping me up is that I don't think this is even
>> meant to work with a multiqueue device. The description seems more
>> like a mqprio implementation feeding into a prio qdisc which is then
>> used for dequeue. To me it seems like this solution would be pulling
>> complexity off of the enqueue side and just adding it to the dequeue
>> side. The use of the "busy poll" is just to try to simplify things as
>> the agent would then be consolidating traffic from multiple per-cpu
>> queues onto one drain queue.
> 
> We're essentially trying to spread the complexity from enqueue to
> different stages such as enqueue/aggregation and rate
> limiting/dequeue. Each stage will have different parallelisms. It
> should work with multi-queue device since txq selection can be the
> same as today. However our concern is that between enqueue and
> aggregation we have a small window which can allow packet oob, which
> is a sacrifice to better concurrency.
>


So OOO will happen when application cpu migrates presumably? This is
normally prevented with skb ooo flag but it looks like you plan to
violate this somehow. I think a design using ptr_rings/skb_arrays
with bulk dequeue and a good concurrent token bucket ring would
suffice and also not introduce OOO packets.

But don't completely understand your design so might be missing
something.

>>
>> Structure wise this ends up looking not too different from mqprio, the
>> main difference though would be that this would be a classful qdisc
>> and that the virtual qdiscs we have for the traffic classes would need
>> to be replaced with actual qdiscs for handling the "drain" aspect of
>> this.
> 
> Structure wise it's similar to mqprio + rate limiting qdisc without
> root lock, and replacing txq/flow level parallelism by cpu level
> parallelism. I'm actually not sure about the similarity with busy
> polling that Eric mentioned since I haven't read the slides yet.

I pushed lockless qdisc patches again today and will repost when
net-next opens. These plus a lockless version of tbf might be
what you need. At one point I had a lockless tbf I can probably
dig that up as well if its useful.

https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html

Thanks,
John

Re: Per-CPU Queueing for QoS

Reply via email to