On 11/13/2017 06:18 PM, Michael Ma wrote: > 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: >> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: >>>>>> On Sun, 12 Nov 2017 13:43:13 -0800 >>>>>> Michael Ma <make0...@gmail.com> wrote: >>>>>> >>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any >>>>>>> early feedback. >>>>>>> >>>>>>> Thanks, >>>>>>> Michael >>>>>>> >>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>>>>>>> >>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from >>>>>>>> the same flow will follow the order when they enter qdisc/txq, which >>>>>>>> avoids out-of-order problem. >>>>>>>> >>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple >>>>>>>> per-cpu queues for a single TC class and do busy polling from a >>>>>>>> per-class thread to drain these queues. If we can do this frequently >>>>>>>> enough the out-of-order situation in this polling thread should not be >>>>>>>> that bad. >>>>>>>> >>>>>>>> To give more details - in the send path we introduce per-cpu per-class >>>>>>>> queues so that packets from the same class and same core will be >>>>>>>> enqueued to the same place. Then a per-class thread poll the queues >>>>>>>> belonging to its class from all the cpus and aggregate them into >>>>>>>> another per-class queue. This can effectively reduce contention but >>>>>>>> inevitably introduces potential out-of-order issue. >>>>>>>> >>>>>>>> Any concern/suggestion for working towards this direction? >>>>>> >>>>>> In general, there is no meta design discussions in Linux development >>>>>> Several developers have tried to do lockless >>>>>> qdisc and similar things in the past. >>>>>> >>>>>> The devil is in the details, show us the code. >>>>> >>>>> Thanks for the response, Stephen. The code is fairly straightforward, >>>>> we have a per-cpu per-class queue defined as this: >>>>> >>>>> struct bandwidth_group >>>>> { >>>>> struct skb_list queues[MAX_CPU_COUNT]; >>>>> struct skb_list drain; >>>>> } >>>>> >>>>> "drain" queue is used to aggregate per-cpu queues belonging to the >>>>> same class. In the enqueue function, we determine the cpu where the >>>>> packet is processed and enqueue it to the corresponding per-cpu queue: >>>>> >>>>> int cpu; >>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid]; >>>>> >>>>> cpu = get_cpu(); >>>>> skb_list_append(&bwg->queues[cpu], skb); >>>>> >>>>> Here we don't check the flow of the packet so if there is task >>>>> migration or multiple threads sending packets through the same flow we >>>>> theoretically can have packets enqueued to different queues and >>>>> aggregated to the "drain" queue out of order. >>>>> >>>>> Also AFAIK there is no lockless htb-like qdisc implementation >>>>> currently, however if there is already similar effort ongoing please >>>>> let me know. >>>> >>>> The question I would have is how would this differ from using XPS w/ >>>> mqprio? Would this be a classful qdisc like HTB or a classless one >>>> like mqprio? >>>> >>>> From what I can tell XPS would be able to get you your per-cpu >>>> functionality, the benefit of it though would be that it would avoid >>>> out-of-order issues for sockets originating on the local system. The >>>> only thing I see as an issue right now is that the rate limiting with >>>> mqprio is assumed to be handled via hardware due to mechanisms such as >>>> DCB. >>> >>> I think one of the key point was in : " do busy polling from a per-class >>> thread to drain these queues." >>> >>> I mentioned this idea in TX path of : >>> >>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf >> >> I think this is a bit different from that idea in that the busy >> polling is transferring packets from a per-cpu qdisc to a per traffic >> class queueing discipline. Basically it would be a busy_poll version >> of qdisc_run that would be transferring packets from one qdisc onto >> another instead of attempting to transmit them directly. > > The idea is to have the whole part implemented as one classful qdisc > and remove the qdisc root lock since all the synchronization will be > handled internally (let's put aside that other filters/actions/qdiscs > might still require a root lock for now). >> >> What I think is tripping me up is that I don't think this is even >> meant to work with a multiqueue device. The description seems more >> like a mqprio implementation feeding into a prio qdisc which is then >> used for dequeue. To me it seems like this solution would be pulling >> complexity off of the enqueue side and just adding it to the dequeue >> side. The use of the "busy poll" is just to try to simplify things as >> the agent would then be consolidating traffic from multiple per-cpu >> queues onto one drain queue. > > We're essentially trying to spread the complexity from enqueue to > different stages such as enqueue/aggregation and rate > limiting/dequeue. Each stage will have different parallelisms. It > should work with multi-queue device since txq selection can be the > same as today. However our concern is that between enqueue and > aggregation we have a small window which can allow packet oob, which > is a sacrifice to better concurrency. >
So OOO will happen when application cpu migrates presumably? This is normally prevented with skb ooo flag but it looks like you plan to violate this somehow. I think a design using ptr_rings/skb_arrays with bulk dequeue and a good concurrent token bucket ring would suffice and also not introduce OOO packets. But don't completely understand your design so might be missing something. >> >> Structure wise this ends up looking not too different from mqprio, the >> main difference though would be that this would be a classful qdisc >> and that the virtual qdiscs we have for the traffic classes would need >> to be replaced with actual qdiscs for handling the "drain" aspect of >> this. > > Structure wise it's similar to mqprio + rate limiting qdisc without > root lock, and replacing txq/flow level parallelism by cpu level > parallelism. I'm actually not sure about the similarity with busy > polling that Eric mentioned since I haven't read the slides yet. I pushed lockless qdisc patches again today and will repost when net-next opens. These plus a lockless version of tbf might be what you need. At one point I had a lockless tbf I can probably dig that up as well if its useful. https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html Thanks, John