Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Fri, 06 Jul 2007 10:39:15 -0400 > If the issue is usability of listing 1024 netdevices, i can think of > many ways to resolve it. I would agree with this if there were a reason for it, it's totally unnecessary complication as far as I can see. These virtual devices are an ethernet with the subnet details exposed to the driver, nothing more. I see zero benefit to having a netdev for each guest or node we can speak to whatsoever. It's a very heavy abstraction to use for something that is so bloody simple. My demux on ->hard_start_xmit() is _5 DAMN LINES OF CODE_, you want to replace that with a full netdev because of some minor difficulties in figuring out to record the queuing state. It's beyond unreasonable. Netdevs are like salt, if you put too much in your food it tastes awful. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-07-06 at 10:39 -0400, jamal wrote: > The first thing that crossed my mind was "if you want to select a > destination port based on a destination MAC you are talking about a > switch/bridge". You bring up the issue of "a huge number of virtual NICs > if you wanted arbitrary guests" which is a real one[2]. Hi Jamal, I'm deeply tempted to agree with you that the answer is multiple virtual NICs (and I've been tempted to abandon lguest's N-way transport scheme), except that it looks like we're going to have multi-queue NICs for other reasons. Otherwise I'd be tempted to say "create/destroy virtual NICs as other guests appear/vanish from the network". Noone does this today, but that doesn't make it wrong. > If i got this right, still not answering the netif_stop question posed: > the problem you are also trying to resolve now is get rid of N > netdevices on each guest for a usability reason; i.e have one netdevice, > move the bridging/switching functionality/tables into the driver; > replace the ports with queues instead of netdevices. Did i get that > right? Yep, well summarized. I guess the question is: should the Intel guys be representing their multi-queue NICs as multiple NICs rather than adding the subqueue concept? > BTW, one curve that threw me off a little is it seems most of the > hardware that provides virtualization also provides point-to-point > connections between different domains; i always thought that they all > provided a point-to-point to the dom0 equivalent and let the dom0 worry > about how things get from domainX to domainY. Yeah, but that has obvious limitations as people care more about inter-guest I/O: we want direct inter-guest networking... Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
jamal wrote: If the issue is usability of listing 1024 netdevices, i can think of many ways to resolve it. One way we can resolve the listing is with a simple tag to the netdev struct i could say "list netdevices for guest 0-10" etc etc. This would be a useful feature, not only for virtualization. I've seen some boxes with thousands of net devices (mostly ppp, but also some ATM). It would be nice to be able to assign a tag to an arbitrary set of devices. Does the network namespace stuff help with any of this? -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-06-07 at 17:32 +1000, Rusty Russell wrote: [..some good stuff deleted here ..] > Hope that adds something, It does - thanks. I think i was letting my experience pollute my thinking earlier when Dave posted. The copy-avoidance requirement is clear to me[1]. I had another issue which wasnt clear but you touched on it so this breaks the ice for me - be gentle please, my asbestos suit is full of dust these days ;->: The first thing that crossed my mind was "if you want to select a destination port based on a destination MAC you are talking about a switch/bridge". You bring up the issue of "a huge number of virtual NICs if you wanted arbitrary guests" which is a real one[2]. Lets take the case of a small number of guests; a bridge of course would solve the problem with the copy-avoidance with the caveat being: - you now have N bridges and their respective tables for N domains i.e one on each domain - N netdevices on each domain as well (of course you could say that is not very different resourcewise from N queues instead). If i got this right, still not answering the netif_stop question posed: the problem you are also trying to resolve now is get rid of N netdevices on each guest for a usability reason; i.e have one netdevice, move the bridging/switching functionality/tables into the driver; replace the ports with queues instead of netdevices. Did i get that right? If the issue is usability of listing 1024 netdevices, i can think of many ways to resolve it. One way we can resolve the listing is with a simple tag to the netdev struct i could say "list netdevices for guest 0-10" etc etc. I am having a little problem differentiating conceptually the case of a guest being different from the host/dom0 if you want to migrate the switching/bridging functions into each guest. So even if this doesnt apply to all domains, it does apply to the dom0. I like netdevices today (as opposed to queues within netdevices): - the stack knows them well (I can add IP addresses, i can point routes to, I can change MAC addresses, i can bring them administratively down/up, I can add qos rules etc etc). I can also tie netdevices to a CPU and therefore scale that way. I see this viable at least from the host/dom0 perspective if a netdevice represents a guest. Sorry for the long email - drained some of my morning coffee. Ok, kill me. cheers, jamal [1] My experience is around qemu/uml/old-openvz - their model is to let the host do the routing/switching between guests or the outside of the box. From your description i would add Xen to that behavior. >From Daves posting, i understand that for many good reasons, any time you move between any one domain to another you are copying. So if you use Xen and you want to go from domainX to domainY you go to dom0 which implies copying domainX->dom0 then dom0->domainY. BTW, one curve that threw me off a little is it seems most of the hardware that provides virtualization also provides point-to-point connections between different domains; i always thought that they all provided a point-to-point to the dom0 equivalent and let the dom0 worry about how things get from domainX to domainY. [2] Unfortunately that means if i wanted 1024 virtual routers/guest domains i have at least 1024 netdevices on each guest connected to the bridge on the guest. I have a freaking problem listing 72 netdevices today on some device i have. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Tue, 2007-07-03 at 22:20 -0400, jamal wrote: > On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote: > [.. some useful stuff here deleted ..] > > > That's why you have to copy into a purpose-built set of memory > > that is composed of pages that _ONLY_ contain TX packet buffers > > and nothing else. > > > > The cost of going through the switch is too high, and the copies are > > necessary, so concentrate on allowing me to map the guest ports to the > > egress queues. Anything else is a waste of discussion time, I've been > > pouring over these issues endlessly for weeks, so if I'm saying doing > > copies and avoiding the switch is necessary I do in fact mean it. :-) > > ok, i get it Dave ;-> Thanks for your patience, that was useful. > Now that is clear for me, I will go back and look at your original email > and try to get back on track to what you really asked ;-> To expand on this, there are already "virtual" nic drivers in tree which do the demux based on dst mac and send to appropriate other guest (iseries_veth.c and Carsten Otte said the S/390 drivers do too). lguest and DaveM's LDOM make two more. There is currently no good way to write such a driver. If one recipient is full, you have to drop the packet: if you netif_stop_queue, it means a slow/buggy recipient blocks packets going to other recipients. But dropping packets makes networking suck. Some hypervisors (eg. Xen) only have a virtual NIC which is point-to-point: this sidesteps the issue, with the risk that you might need a huge number of virtual NICs if you wanted arbitrary guests to talk to each other (Xen doesn't support that, they route/bridge through dom0). Most hypervisors have a sensible maximum on the number of guests they could talk to, so I'm not too unhappy with a static number of queues. But the dstmac -> queue mapping changes in hypervisor-specific ways, so it really needs to be managed by the driver... Hope that adds something, Rusty. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote: [.. some useful stuff here deleted ..] > That's why you have to copy into a purpose-built set of memory > that is composed of pages that _ONLY_ contain TX packet buffers > and nothing else. > > The cost of going through the switch is too high, and the copies are > necessary, so concentrate on allowing me to map the guest ports to the > egress queues. Anything else is a waste of discussion time, I've been > pouring over these issues endlessly for weeks, so if I'm saying doing > copies and avoiding the switch is necessary I do in fact mean it. :-) ok, i get it Dave ;-> Thanks for your patience, that was useful. Now that is clear for me, I will go back and look at your original email and try to get back on track to what you really asked ;-> cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Tue, 03 Jul 2007 08:42:33 -0400 > (likely not in the case of hypervisor based virtualization like Xen) > just have their skbs cloned when crossing domains, is that not the > case?[1] > Assuming they copy, the balance that needs to be stricken now is > between: Sigh, I kind of hoped I wouldn't have to give a lesson in hypervisors and virtualized I/O and all the issues contained within, but if you keep pushing the "avoid the copy" idea I guess I am forced to educate. :-) First, keep in mind that my Linux guest drivers are talking to Solaris control node servers and switches, I cannot control the API for any of this stuff. And I think that's a good thing in fact. Exporting memory between nodes is _THE_ problem with virtualized I/O in hypervisor based systems. These things should even be able to work between two guests that simply DO NOT trust each other at all. With that in mind the hypervisor provides a very small shim layer of interface for exporting memory between two nodes. There is a pseudo-pagetable where you export pages, and a set of interfaces one of which copies to/from inported memory to/from local memory. If a guest gets stuck, reboots, crashes, or gets stuck, you have to be able to revoke the memory the remote node has inported. When this happens, if the inporting node comes back to life and tries to touch those pages it takes a fault. Taking a fault is easy if the nodes go through the hypervisor copy interface, they just get a return value back. If, instead, you try to map in those pages or program them into the IOMMU of the PCI controller, you get faults, and extremely difficult to handle faults at that. If the IOMMU takes the exception on a revoked page, your E1000 card resets when it gets the master abort from the PCI controller. On the CPU side you have to annotate every single kernel access to this memory mapping of inported pages, just like we have to annotate all userspace accesses with exception tables mapping load and store instructions to fixup code, in order to handler the fault correctly. Next, you don't trust the other end as we already stated, so you can't export object in a page that belong to other objects. For example, if a SKB's data sits in the same page as the plain-text password the user just typed in, you can't export that page. That's why you have to copy into a purpose-built set of memory that is composed of pages that _ONLY_ contain TX packet buffers and nothing else. The cost of going through the switch is too high, and the copies are necessary, so concentrate on allowing me to map the guest ports to the egress queues. Anything else is a waste of discussion time, I've been pouring over these issues endlessly for weeks, so if I'm saying doing copies and avoiding the switch is necessary I do in fact mean it. :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Sat, 2007-30-06 at 13:33 -0700, David Miller wrote: > It's like twice as fast, since the switch doesn't have to copy > the packet in, switch it, then the destination guest copies it > into it's address space. > > There is approximately one copy for each hop you go over through these > virtual devices. Ok - i see what you are getting at, and while it makes more sense to me now, let me continue to be _the_ devils advocate (sip some esspresso before responding or reading): for some reason i always thought that packets going across these things (likely not in the case of hypervisor based virtualization like Xen) just have their skbs cloned when crossing domains, is that not the case?[1] Assuming they copy, the balance that needs to be stricken now is between: a) copy is expensive vs b1) For N guests, N^2 queues in the system vs N queues and 1 vs N replicated global info. b2) The architecture challenges to resolve the fact you now have to deal with a mesh (1-1 mapping) instead of star topology between the guests. I dont think #b1 is such a big deal; in the old days when i had played with what is now openvz, i was happy to get 1024 virtual routers/guests (each running Zebra/OSPF). I could live with a little more wasted memory if the copy is reduced. I think sub-consciously i am questioning #b2. Do you really need that sacrifice just so that you can avoid one extra copy between two guests? If i was running virtual routers or servers i think the majority of traffic (by far) would be between a domain and outside of the box not between any two domains within the same box. cheers, jamal [1] But then if this is true, i can think of a simple way to attack the other domains by inserting a kernel module into a domain that reduced the refcount of each received skb to 0. I would be suprised if the openvz type approach hasnt thought this through. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Sat, 30 Jun 2007 10:52:44 -0400 > On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote: > > > Awesome, but let's concentrate on the client since I can actually > > implement and test anything we come up with :-) > > Ok, you need to clear one premise for me then ;-> > You said the model is for the guest/client to hook have a port to the > host and one to each guest; i think this is the confusing part for me > (and may have led to the switch discussion) because i have not seen this > model used before. What i have seen before is that the host side > connects the different guests. In such a scenario, on the guest is a > single port that connects to the host - the host worries (lets forget > the switch/bridge for a sec) about how to get packets from guestX to > guestY pending consultation of access control details. > What is the advantage of direct domain-domain connection? Is it a > scalable? It's like twice as fast, since the switch doesn't have to copy the packet in, switch it, then the destination guest copies it into it's address space. There is approximately one copy for each hop you go over through these virtual devices. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote: > Awesome, but let's concentrate on the client since I can actually > implement and test anything we come up with :-) Ok, you need to clear one premise for me then ;-> You said the model is for the guest/client to hook have a port to the host and one to each guest; i think this is the confusing part for me (and may have led to the switch discussion) because i have not seen this model used before. What i have seen before is that the host side connects the different guests. In such a scenario, on the guest is a single port that connects to the host - the host worries (lets forget the switch/bridge for a sec) about how to get packets from guestX to guestY pending consultation of access control details. What is the advantage of direct domain-domain connection? Is it a scalable? cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> It would be great if we could finally get a working e1000 > multiqueue patch so work in this area can actually be tested. I'm actively working on this right now. I'm on vacation next week, but hopefully I can get something working before I leave OLS and post it. -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
David Miller wrote: > Now I get to pose a problem for everyone, prove to me how useful > this new code is by showing me how it can be used to solve a > reocurring problem in virtualized network drivers of which I've > had to code one up recently, see my most recent blog entry at: > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html > > Anyways the gist of the issue is (and this happens for Sun LDOMS > networking, lguest, IBM iSeries, etc.) that we have a single > virtualized network device. There is a "port" to the control > node (which switches packets to the real network for the guest) > and one "port" to each of the other guests. > > Each guest gets a unique MAC address. There is a queue per-port > that can fill up. > > What all the drivers like this do right now is stop the queue if > any of the per-port queues fill up, and that's why my sunvnet > driver does right now as well. We can only thus wakeup the > queue when all of the ports have some space. > > The ports (and thus the queues) are selected by destination > MAC address. Each port has a remote MAC address, if there > is an exact match with a port's remote MAC we'd use that port > and thus that port's queue. If there is no exact match > (some other node on the real network, broadcast, multicast, > etc.) we want to use the control node's port and port queue. > > So the problem to solve is to make a way for drivers to do the queue > selection before the generic queueing layer starts to try and push > things to the driver. Perhaps a classifier in the driver or similar. That sounds like the only reasonable possibility if you really do want to use queues. Another possibility would be to not use a queue and make the hole thing unreliable and treat full rx rings of the guests as "loss on the wire". Not sure if that makes any sense. I was thinking about adding a way for (multiqueue) drivers to use other default qdiscs than pfifo_fast so they can default to a multiband prio or something else that makes sense for them .. maybe a dev->qdisc_setup hook that is invoked from dev_activate. They would need to be able to add a default classifier for this to have any effect (the grand plan is to get rid of the horrible wme scheduler). Specialized classifiers like your dst MAC classifier and maybe even WME should then probably be built into the driver and don't register with the API, so they don't become globally visible. > The solution to this problem generalizes to the other facility > we want now, hashing the transmit queue by smp_processor_id() > or similar. With that in place we can look at doing the TX locking > per-queue too as is hinted at by the comments above the per-queue > structure in the current net-2.6.23 tree. It would be great if we could finally get a working e1000 multiqueue patch so work in this area can actually be tested. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> "DM" == David Miller <[EMAIL PROTECTED]> writes: DM> And some people still use hubs, believe it or not. Hubs are 100Mbps at most. You could of course make a flooding Gbps switch, but it would be rather silly. If you care about multicast performance, you get a switch with IGMP snooping. /Benny - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Fri, 29 Jun 2007 21:30:53 -0400 > On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote: > > Maybe for the control node switch, yes, but not for the guest network > > devices. > > And that is precisely what i was talking about - and i am sure thats how > the discussion with Patrick was. Awesome, but let's concentrate on the client since I can actually implement and test anything we come up with :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote: > This conversation begins to go into a pointless direction already, as > I feared it would. > > Nobody is going to configure bridges, classification, tc, and all of > this other crap just for a simple virtualized guest networking device. > > It's a confined and well defined case that doesn't need any of that. > You've got to be fucking kidding me if you think I'm going to go > through the bridging code and all of that layering instead of my > hash demux on transmit which is 4 or 5 lines of C code at best. > > Such a suggestion is beyond stupid. > Ok, calm down - will you please? If you are soliciting for opinions, then you should be expecting all sorts of answers, otherwise why bother posting. If you think you are misunderstood just clarify. Otherwise you are being totaly unreasonable. > Maybe for the control node switch, yes, but not for the guest network > devices. And that is precisely what i was talking about - and i am sure thats how the discussion with Patrick was. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: Ben Greear <[EMAIL PROTECTED]> Date: Fri, 29 Jun 2007 08:33:06 -0700 > Patrick McHardy wrote: > > Right, but the current bridging code always uses promiscous mode > > and its nice to avoid that if possible. Looking at the code, it > > should be easy to avoid though by disabling learning (and thus > > promisous mode) and adding unicast filters for all static fdb entries. > > > I am curious about why people are so hot to do away with promisc mode. > It seems to me > that in a modern switched environment, there should only very rarely be > unicast packets received > on an interface that does not want to receive them. > > Could someone give a quick example of when I am wrong and promisc mode > would allow > a NIC to receive a significant number of packets not really destined for it? You're neighbour on the switch is being pummeled with multicast traffic, and now you get to see it all too. Switches don't obviate the cost of promiscuous mode, you keep wanting to discuss this and think it doesn't matter, but it does. And some people still use hubs, believe it or not. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
This conversation begins to go into a pointless direction already, as I feared it would. Nobody is going to configure bridges, classification, tc, and all of this other crap just for a simple virtualized guest networking device. It's a confined and well defined case that doesn't need any of that. You've got to be fucking kidding me if you think I'm going to go through the bridging code and all of that layering instead of my hash demux on transmit which is 4 or 5 lines of C code at best. Such a suggestion is beyond stupid. Maybe for the control node switch, yes, but not for the guest network devices. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: Ben Greear wrote: Could someone give a quick example of when I am wrong and promisc mode would allow a NIC to receive a significant number of packets not really destined for it? In a switched environment it won't have a big effect, I agree. It might help avoid receiving unwanted multicast traffic, which could be more significant than unicast. Anyways, why be wasteful when it can be avoided .. :) Ok, I had forgotten about multicast, thanks for the reminder! -- Ben Greear <[EMAIL PROTECTED]> Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Ben Greear wrote: > Patrick McHardy wrote: > >> Right, but the current bridging code always uses promiscous mode >> and its nice to avoid that if possible. Looking at the code, it >> should be easy to avoid though by disabling learning (and thus >> promisous mode) and adding unicast filters for all static fdb entries. >> > > I am curious about why people are so hot to do away with promisc mode. > It seems to me > that in a modern switched environment, there should only very rarely be > unicast packets received > on an interface that does not want to receive them. I don't know if that really was Dave's reason to handle it in a driver. > Could someone give a quick example of when I am wrong and promisc mode > would allow > a NIC to receive a significant number of packets not really destined for > it? In a switched environment it won't have a big effect, I agree. It might help avoid receiving unwanted multicast traffic, which could be more significant than unicast. Anyways, why be wasteful when it can be avoided .. :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: Right, but the current bridging code always uses promiscous mode and its nice to avoid that if possible. Looking at the code, it should be easy to avoid though by disabling learning (and thus promisous mode) and adding unicast filters for all static fdb entries. I am curious about why people are so hot to do away with promisc mode. It seems to me that in a modern switched environment, there should only very rarely be unicast packets received on an interface that does not want to receive them. Could someone give a quick example of when I am wrong and promisc mode would allow a NIC to receive a significant number of packets not really destined for it? Thanks, Ben -- Ben Greear <[EMAIL PROTECTED]> Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 15:08 +0200, Patrick McHardy wrote: > jamal wrote: > > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote: > Right, but the current bridging code always uses promiscous mode > and its nice to avoid that if possible. > Looking at the code, it > should be easy to avoid though by disabling learning (and thus > promisous mode) and adding unicast filters for all static fdb entries. > Yes, that would do it for static provisioning (I suspect that would work today unless bridging has no knobs to turn off going into promisc). But you could even allow for learning and just have extra filters in tc before bridging disallowing things. > Have a look at my secondary unicast address patches in case you didn't > notice them before (there's also a driver example for e1000 on netdev): > > http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7 > > You still need to do filtering in software, but you can have the NIC > pre-filter in case it supports it, otherwise it goes to promiscous mode. > Ok, I will look at them when i get back. Sorry - havent caught up on netdev. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
jamal wrote: > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote: > > >>The difference to a real bridge is that the >>all addresses are completely known in advance, so it doesn't need >>promiscous mode for learning. > > > You mean the per-virtual MAC addresses are known in advance, right? Yes. > This is fine. The bridging or otherwise (like L3 etc) is for > interconnecting once you have the provisioning done. And you could build > different "broadcast domains" by having multiple bridges. Right, but the current bridging code always uses promiscous mode and its nice to avoid that if possible. Looking at the code, it should be easy to avoid though by disabling learning (and thus promisous mode) and adding unicast filters for all static fdb entries. > To go off on a slight tangent: > I think you have to look at the two types of NICs separately > 1) dumb ones where you may have to use the mcast filters in hardware to > pretend you have a unicast address per virtual device - those will be > really hard to simulate using a separate netdevice per MAC address. > Actually your bigger problem on those is tx MAC address selection > because that is not built into the hardware. I still think even for > these types something above netdevice (bridge, L3 routing, tc action > redirect etc) will do. Have a look at my secondary unicast address patches in case you didn't notice them before (there's also a driver example for e1000 on netdev): http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7 You still need to do filtering in software, but you can have the NIC pre-filter in case it supports it, otherwise it goes to promiscous mode. > 2) The new NICs being built for virtualization; those allow you to > explicitly have clean separation of IO where the only thing that is > shared between virtual devices is the wire and the bus (otherwise > each has its own registers etc) i.e the hardware is designed with this > in mind. In such a case, i think a separate netdevice per single MAC > address - possibly tied to a separate CPU should work. Agreed, that could also be useful for non-virtualized use. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote: > I'm guessing that that wouldn't allow to do unicast filtering for > the guests on the real device without hacking the bridge code for > this special case. For ingress (i guess you could say for egress as well): we can do it as well today with tc filtering on the host - it is involved but is part of provisioning for a guest IMO. A substantial amount of ethernet switches (ok, not the $5 ones) do filtering at the same level. > The difference to a real bridge is that the > all addresses are completely known in advance, so it doesn't need > promiscous mode for learning. You mean the per-virtual MAC addresses are known in advance, right? This is fine. The bridging or otherwise (like L3 etc) is for interconnecting once you have the provisioning done. And you could build different "broadcast domains" by having multiple bridges. To go off on a slight tangent: I think you have to look at the two types of NICs separately 1) dumb ones where you may have to use the mcast filters in hardware to pretend you have a unicast address per virtual device - those will be really hard to simulate using a separate netdevice per MAC address. Actually your bigger problem on those is tx MAC address selection because that is not built into the hardware. I still think even for these types something above netdevice (bridge, L3 routing, tc action redirect etc) will do. 2) The new NICs being built for virtualization; those allow you to explicitly have clean separation of IO where the only thing that is shared between virtual devices is the wire and the bus (otherwise each has its own registers etc) i.e the hardware is designed with this in mind. In such a case, i think a separate netdevice per single MAC address - possibly tied to a separate CPU should work. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
jamal wrote: > On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote: > >>Each guest gets a unique MAC address. There is a queue per-port >>that can fill up. >> >>What all the drivers like this do right now is stop the queue if >>any of the per-port queues fill up, and that's why my sunvnet >>driver does right now as well. We can only thus wakeup the >>queue when all of the ports have some space. > > > Is a netdevice really the correct construct for the host side? > Sounds to me a layer above the netdevice is the way to go. A bridge for > example or L3 routing or even simple tc classify/redirection etc. > I havent used what has become openvz these days in many years (or played > with Erics approach), but if i recall correctly - it used to have a > single netdevice per guest on the host. Thats close to what a basic > qemu/UML has today. In such a case it is something above netdevices > which does the guest selection. I'm guessing that that wouldn't allow to do unicast filtering for the guests on the real device without hacking the bridge code for this special case. The difference to a real bridge is that the all addresses are completely known in advance, so it doesn't need promiscous mode for learning. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Ive changed the topic for you friend - otherwise most people wont follow (as youve said a few times yourself ;->). On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote: > Now I get to pose a problem for everyone, prove to me how useful > this new code is by showing me how it can be used to solve a > reocurring problem in virtualized network drivers of which I've > had to code one up recently, see my most recent blog entry at: > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html > nice. > Anyways the gist of the issue is (and this happens for Sun LDOMS > networking, lguest, IBM iSeries, etc.) that we have a single > virtualized network device. There is a "port" to the control > node (which switches packets to the real network for the guest) > and one "port" to each of the other guests. > > Each guest gets a unique MAC address. There is a queue per-port > that can fill up. > > What all the drivers like this do right now is stop the queue if > any of the per-port queues fill up, and that's why my sunvnet > driver does right now as well. We can only thus wakeup the > queue when all of the ports have some space. Is a netdevice really the correct construct for the host side? Sounds to me a layer above the netdevice is the way to go. A bridge for example or L3 routing or even simple tc classify/redirection etc. I havent used what has become openvz these days in many years (or played with Erics approach), but if i recall correctly - it used to have a single netdevice per guest on the host. Thats close to what a basic qemu/UML has today. In such a case it is something above netdevices which does the guest selection. > The ports (and thus the queues) are selected by destinationt > MAC address. Each port has a remote MAC address, if there > is an exact match with a port's remote MAC we'd use that port > and thus that port's queue. If there is no exact match > (some other node on the real network, broadcast, multicast, > etc.) we want to use the control node's port and port queue. > Ok, Dave, isnt that what a bridge does? ;-> Youd need filtering to go with it (for example to restrict guest0 from getting certain brodcasts etc) - but we already have that. > So the problem to solve is to make a way for drivers to do the queue > selection before the generic queueing layer starts to try and push > things to the driver. Perhaps a classifier in the driver or similar. > > The solution to this problem generalizes to the other facility > we want now, hashing the transmit queue by smp_processor_id() > or similar. With that in place we can look at doing the TX locking > per-queue too as is hinted at by the comments above the per-queue > structure in the current net-2.6.23 tree. A major surgery will be needed on the tx path if you want to hash tx queue to processor id. Our unit construct (today, net-2.6.23) that can be tied to a cpu is a netdevice. OTOH, if you used a netdevice it should work as is. But i am possibly missing something in your comments. What do you have in mind. > My current work-in-progress sunvnet.c driver is included below so > we can discuss things concretely with code. > > I'm listening. :-) And you got words above. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> Ok everything is checked into net-2.6.23, thanks everyone. Dave, thank you for your patience and feedback on this whole process. Patrick and everyone else, thank you for your feedback and assistance. I am looking at your posed virtualization question, but I need sleep since I just remembered I'm on east coast time here at OLS, and it's 4:30am. Thanks again, -PJ Waskiewicz - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: Patrick McHardy <[EMAIL PROTECTED]> Date: Thu, 28 Jun 2007 21:24:37 +0200 > Waskiewicz Jr, Peter P wrote: > >>[...] > >>The only reasonable thing it can do is not care about > >>multiqueue and just dequeue as usual. In fact I think it > >>should be an error to configure multiqueue on a non-root qdisc. > > > > > > Ack. This is a thought process that trips me up from time to time...I > > see child qdisc, and think that's the last qdisc to dequeue and send to > > the device, not the first one to dequeue. So please disregard my > > comments before; I totally agree with you. Great catch here; I really > > like the prio_classify() cleanup. > > > Thanks. This updated patch makes configuring a non-root qdisc for > multiqueue an error. Ok everything is checked into net-2.6.23, thanks everyone. Now I get to pose a problem for everyone, prove to me how useful this new code is by showing me how it can be used to solve a reocurring problem in virtualized network drivers of which I've had to code one up recently, see my most recent blog entry at: http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html Anyways the gist of the issue is (and this happens for Sun LDOMS networking, lguest, IBM iSeries, etc.) that we have a single virtualized network device. There is a "port" to the control node (which switches packets to the real network for the guest) and one "port" to each of the other guests. Each guest gets a unique MAC address. There is a queue per-port that can fill up. What all the drivers like this do right now is stop the queue if any of the per-port queues fill up, and that's why my sunvnet driver does right now as well. We can only thus wakeup the queue when all of the ports have some space. The ports (and thus the queues) are selected by destination MAC address. Each port has a remote MAC address, if there is an exact match with a port's remote MAC we'd use that port and thus that port's queue. If there is no exact match (some other node on the real network, broadcast, multicast, etc.) we want to use the control node's port and port queue. So the problem to solve is to make a way for drivers to do the queue selection before the generic queueing layer starts to try and push things to the driver. Perhaps a classifier in the driver or similar. The solution to this problem generalizes to the other facility we want now, hashing the transmit queue by smp_processor_id() or similar. With that in place we can look at doing the TX locking per-queue too as is hinted at by the comments above the per-queue structure in the current net-2.6.23 tree. My current work-in-progress sunvnet.c driver is included below so we can discuss things concretely with code. I'm listening. :-) sunvnet.h #ifndef _SUNVNET_H #define _SUNVNET_H #define DESC_NCOOKIES(entry_size) \ ((entry_size) - sizeof(struct vio_net_desc)) /* length of time before we decide the hardware is borked, * and dev->tx_timeout() should be called to fix the problem */ #define VNET_TX_TIMEOUT (5 * HZ) #define VNET_TX_RING_SIZE 512 #define VNET_TX_WAKEUP_THRESH(dr) ((dr)->pending / 4) /* VNET packets are sent in buffers with the first 6 bytes skipped * so that after the ethernet header the IPv4/IPv6 headers are aligned * properly. */ #define VNET_PACKET_SKIP6 struct vnet_tx_entry { void*buf; unsigned intncookies; struct ldc_trans_cookie cookies[2]; }; struct vnet; struct vnet_port { struct vio_driver_state vio; struct hlist_node hash; u8 raddr[ETH_ALEN]; struct vnet *vp; struct vnet_tx_entrytx_bufs[VNET_TX_RING_SIZE]; struct list_headlist; }; static inline struct vnet_port *to_vnet_port(struct vio_driver_state *vio) { return container_of(vio, struct vnet_port, vio); } #define VNET_PORT_HASH_SIZE 16 #define VNET_PORT_HASH_MASK (VNET_PORT_HASH_SIZE - 1) static inline unsigned int vnet_hashfn(u8 *mac) { unsigned int val = mac[4] ^ mac[5]; return val & (VNET_PORT_HASH_MASK); } struct vnet { /* Protects port_list and port_hash. */ spinlock_t lock; struct net_device *dev; u32 msg_enable; struct vio_dev *vdev; struct list_headport_list; struct hlist_head port_hash[VNET_PORT_HASH_SIZE]; }; #endif /* _SUNVNET_H */ sunvnet.c /* sunvnet.c: Sun LDOM Virtual Network Driver. * * Copyright (C) 2007 David S. Miller <[EMAIL PROTECTED]> */ #include #include #include #include #include #include #include #include #include #include #include #include "sunvnet.h" #define DRV_MODULE_NAME "sunvnet" #define PFX DRV_MODULE_NAME ": " #define DRV_MODULE_VERSIO
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> Waskiewicz Jr, Peter P wrote: > >>[...] > >>The only reasonable thing it can do is not care about > multiqueue and > >>just dequeue as usual. In fact I think it should be an error to > >>configure multiqueue on a non-root qdisc. > > > > > > Ack. This is a thought process that trips me up from time > to time...I > > see child qdisc, and think that's the last qdisc to dequeue > and send > > to the device, not the first one to dequeue. So please > disregard my > > comments before; I totally agree with you. Great catch > here; I really > > like the prio_classify() cleanup. > > > Thanks. This updated patch makes configuring a non-root qdisc > for multiqueue an error. > The patch looks fine to me. Thanks Patrick. -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: >>[...] >>The only reasonable thing it can do is not care about >>multiqueue and just dequeue as usual. In fact I think it >>should be an error to configure multiqueue on a non-root qdisc. > > > Ack. This is a thought process that trips me up from time to time...I > see child qdisc, and think that's the last qdisc to dequeue and send to > the device, not the first one to dequeue. So please disregard my > comments before; I totally agree with you. Great catch here; I really > like the prio_classify() cleanup. Thanks. This updated patch makes configuring a non-root qdisc for multiqueue an error. [SCHED] Qdisc changes and sch_rr added for multiqueue Add the new sch_rr qdisc for multiqueue network device support. Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware support. sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS. This was done since sch_prio and sch_rr only differ in their dequeue routine. Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]> --- commit 8798d6bf4f3ed4f5b162184282adc714ef5b69b1 tree b3ba373a0534b905b34abc520eba6adecdcc3ca5 parent 48e334930c5fbb64a821733a7056e2800057c128 author Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Thu, 28 Jun 2007 18:47:49 +0200 committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 28 Jun 2007 21:23:44 +0200 include/linux/pkt_sched.h |9 +++ net/sched/Kconfig | 11 net/sched/sch_prio.c | 129 - 3 files changed, 134 insertions(+), 15 deletions(-) diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index d10f353..268c515 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -101,6 +101,15 @@ struct tc_prio_qopt __u8 priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */ }; +enum +{ + TCA_PRIO_UNSPEC, + TCA_PRIO_MQ, + __TCA_PRIO_MAX +}; + +#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1) + /* TBF section */ struct tc_tbf_qopt diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 475df84..f321794 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -111,6 +111,17 @@ config NET_SCH_PRIO To compile this code as a module, choose M here: the module will be called sch_prio. +config NET_SCH_RR + tristate "Multi Band Round Robin Queuing (RR)" + select NET_SCH_PRIO + ---help--- + Say Y here if you want to use an n-band round robin packet + scheduler. + + The module uses sch_prio for its framework and is aliased as + sch_rr, so it will load sch_prio, although it is referred + to using sch_rr. + config NET_SCH_RED tristate "Random Early Detection (RED)" ---help--- diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 6d7542c..4045220 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -40,9 +40,11 @@ struct prio_sched_data { int bands; + int curband; /* for round-robin */ struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; + int mq; }; @@ -70,14 +72,17 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #endif if (TC_H_MAJ(band)) band = 0; - return q->queues[q->prio2band[band&TC_PRIO_MAX]]; + band = q->prio2band[band&TC_PRIO_MAX]; + goto out; } band = res.classid; } band = TC_H_MIN(band) - 1; if (band >= q->bands) - return q->queues[q->prio2band[0]]; - + band = q->prio2band[0]; +out: + if (q->mq) + skb_set_queue_mapping(skb, band); return q->queues[band]; } @@ -144,17 +149,58 @@ prio_dequeue(struct Qdisc* sch) struct Qdisc *qdisc; for (prio = 0; prio < q->bands; prio++) { - qdisc = q->queues[prio]; - skb = qdisc->dequeue(qdisc); - if (skb) { - sch->q.qlen--; - return skb; + /* Check if the target subqueue is available before + * pulling an skb. This way we avoid excessive requeues + * for slower queues. + */ + if (!netif_subqueue_stopped(sch->dev, (q->mq ? prio : 0))) { + qdisc = q->queues[prio]; + skb = qdisc->dequeue(qdisc); + if (skb) { +sch->q.qlen--; +return skb; + } } } return NULL; } +static struct sk_buff *rr_dequeue(struct Qdisc* sch) +{ + struct sk_buff *skb; + struct prio_sched_data *q = qdisc_priv(sch); + struct Qdisc *qdisc; + int bandcount; + + /* Only take one pass through the queues. If nothing is available, + * return nothing. + */ + for (bandcount = 0; bandcount < q->bands; bandcount++) { + /* Check if the target subqueue is available before + * pulling an skb. This way we avoid excessive requeues + * for slower queues. If the queue is stopped, try the + * next queue. + */ + if (!netif_subqueue_stopped(sch->dev, + (q->mq ? q->curband : 0))) { + qdisc = q->queues[q->curband]; + skb = qdisc->dequeue(qdisc); + if (skb) { +sch->q.qlen--; +q->curband++; +if (q->curband >= q->bands) + q->curband = 0; +return skb; + } + } + q->curband++; + if (q->c
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> Absolutely not. First of all, its perfectly valid to use > non-multiqueue qdiscs on multiqueue devices. Secondly, its > only the root qdisc that has to know about multiqueue since > that one controls the child qdiscs. > > Think about it, it makes absolutely no sense to have the > child qdisc even know about multiqueue. Changing my example > to have a multiqueue qdisc as child: > > root qdisc: 2 band prio multiqueue > child qdisc of band 0: 2 band prio multiqueue > > When the root qdisc decides to dequeue band0, it checks > whether subqueue 0 is active and dequeues the child qdisc. If > the child qdisc is indeed another multiqueue qdisc as you > suggest, if might decide to dequeue its own band 1 and checks > that subqueue state. So where should the packet finally end > up? And what if one of both subqueues is inactive? > > The only reasonable thing it can do is not care about > multiqueue and just dequeue as usual. In fact I think it > should be an error to configure multiqueue on a non-root qdisc. Ack. This is a thought process that trips me up from time to time...I see child qdisc, and think that's the last qdisc to dequeue and send to the device, not the first one to dequeue. So please disregard my comments before; I totally agree with you. Great catch here; I really like the prio_classify() cleanup. As always, many thanks for your feedback and help Patrick. -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: >>PJ Waskiewicz wrote: >> >>>+#ifdef CONFIG_NET_SCH_MULTIQUEUE >>>+if (q->mq) >>>+skb->queue_mapping = >>>+ >> >>q->prio2band[band&TC_PRIO_MAX]; >> >>>+else >>>+skb->queue_mapping = 0; >>>+#endif >> >> >>Setting it to zero here is wrong, consider: >> >>root qdisc: prio multiqueue >>child qdisc: prio non-multiqueue >> >>The top-level qdisc will set it, the child qdisc will unset it again. >>When multiqueue is inactive it should not touch it. >> >>I'll fix that as well. > > > But the child can't assume the device is multiqueue if the child is > non-multiqueue. The child doesn't have to assume anything. > This is the same issue with IP forwarding, where if you > forward through a multiqueue device to a non-mq device, you don't know > if the destination device is multiqueue. No its not. I'm talking about nested qdiscs, which are all on a single device. > So the last qdisc to actually > dequeue into a device should have control what the queue mapping is. Fully agreed. And that is always the top-level qdisc. > If > a user had a multiqueue qdisc as root, and configured a child qdisc as > non-mq, that is a configuration error if the underlying device is indeed > multiqueue IMO. Absolutely not. First of all, its perfectly valid to use non-multiqueue qdiscs on multiqueue devices. Secondly, its only the root qdisc that has to know about multiqueue since that one controls the child qdiscs. Think about it, it makes absolutely no sense to have the child qdisc even know about multiqueue. Changing my example to have a multiqueue qdisc as child: root qdisc: 2 band prio multiqueue child qdisc of band 0: 2 band prio multiqueue When the root qdisc decides to dequeue band0, it checks whether subqueue 0 is active and dequeues the child qdisc. If the child qdisc is indeed another multiqueue qdisc as you suggest, if might decide to dequeue its own band 1 and checks that subqueue state. So where should the packet finally end up? And what if one of both subqueues is inactive? The only reasonable thing it can do is not care about multiqueue and just dequeue as usual. In fact I think it should be an error to configure multiqueue on a non-root qdisc. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> PJ Waskiewicz wrote: > > +#ifdef CONFIG_NET_SCH_MULTIQUEUE > > + if (q->mq) > > + skb->queue_mapping = > > + > q->prio2band[band&TC_PRIO_MAX]; > > + else > > + skb->queue_mapping = 0; > > +#endif > > > Setting it to zero here is wrong, consider: > > root qdisc: prio multiqueue > child qdisc: prio non-multiqueue > > The top-level qdisc will set it, the child qdisc will unset it again. > When multiqueue is inactive it should not touch it. > > I'll fix that as well. But the child can't assume the device is multiqueue if the child is non-multiqueue. This is the same issue with IP forwarding, where if you forward through a multiqueue device to a non-mq device, you don't know if the destination device is multiqueue. So the last qdisc to actually dequeue into a device should have control what the queue mapping is. If a user had a multiqueue qdisc as root, and configured a child qdisc as non-mq, that is a configuration error if the underlying device is indeed multiqueue IMO. -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: > +#ifdef CONFIG_NET_SCH_MULTIQUEUE > + if (q->mq) > + skb->queue_mapping = > + q->prio2band[band&TC_PRIO_MAX]; > + else > + skb->queue_mapping = 0; > +#endif Setting it to zero here is wrong, consider: root qdisc: prio multiqueue child qdisc: prio non-multiqueue The top-level qdisc will set it, the child qdisc will unset it again. When multiqueue is inactive it should not touch it. I'll fix that as well. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: > Thanks for fixing; however, the current sch_prio doesn't unregister the > qdisc if register_qdisc() on prio fails, or does that happen implicitly > because the module will probably unload? It failed, there's nothing to unregister. But when you register two qdiscs and the second one fails you have to unregister the first one. Your way works too, but it might fail registering the second one without providing any feedback to the user. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> Its not error handling. You do: > > err = register qdisc 1 > if (err) > return err; > err = register qdisc 2 > if (err) > unregister qdisc 2 > return err > > anyways, I already fixed that and cleaned up prio_classify > the way I suggested. Will send shortly. Thanks for fixing; however, the current sch_prio doesn't unregister the qdisc if register_qdisc() on prio fails, or does that happen implicitly because the module will probably unload? Thanks again Patrick, -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: > PJ Waskiewicz wrote: > >> + >> static int __init prio_module_init(void) >> { >> -return register_qdisc(&prio_qdisc_ops); >> +int err; >> +err = register_qdisc(&prio_qdisc_ops); >> +if (!err) >> +err = register_qdisc(&rr_qdisc_ops); >> +return err; >> } >> > > > Thats still broken. I'll fix this and some minor cleanness issues myself so > you don't have to go through another resend. Here it is, fixed error handling and cleaned up prio_classify. There are still too many ifdefs in there for my taste, and I'm wondering whether the NET_SCH_MULTIQUEUE option should really be NETDEVICES_MULTIQUEUE. That would allow to move the #ifdefs to netif_subqueue_stopped and keep the qdiscs clean. I can send a patch for that on top of your patches. [SCHED] Qdisc changes and sch_rr added for multiqueue Add the new sch_rr qdisc for multiqueue network device support. Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware support. sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS. This was done since sch_prio and sch_rr only differ in their dequeue routine. Signed-off-by: Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]> --- commit ab2c102d05981e983b494e9211e1c987ae0b80ee tree edb9457aeed6eb4ac8adef517b783b3361ba8830 parent 1937d8b868253885584dd49fe7ba804eda6b525f author Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Thu, 28 Jun 2007 18:47:49 +0200 committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 28 Jun 2007 18:47:49 +0200 include/linux/pkt_sched.h |9 +++ net/sched/Kconfig | 23 +++ net/sched/sch_prio.c | 142 - 3 files changed, 159 insertions(+), 15 deletions(-) diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index d10f353..268c515 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -101,6 +101,15 @@ struct tc_prio_qopt __u8priomap[TC_PRIO_MAX+1]; /* Map: logical priority -> PRIO band */ }; +enum +{ + TCA_PRIO_UNSPEC, + TCA_PRIO_MQ, + __TCA_PRIO_MAX +}; + +#define TCA_PRIO_MAX(__TCA_PRIO_MAX - 1) + /* TBF section */ struct tc_tbf_qopt diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 475df84..65ee9e7 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -111,6 +111,29 @@ config NET_SCH_PRIO To compile this code as a module, choose M here: the module will be called sch_prio. +config NET_SCH_RR + tristate "Multi Band Round Robin Queuing (RR)" + select NET_SCH_PRIO + ---help--- + Say Y here if you want to use an n-band round robin packet + scheduler. + + The module uses sch_prio for its framework and is aliased as + sch_rr, so it will load sch_prio, although it is referred + to using sch_rr. + +config NET_SCH_MULTIQUEUE + bool "Multiple hardware queue support" + ---help--- + Say Y here if you want to allow supported qdiscs to assign flows to + multiple hardware queues on an ethernet device. This will + still work on devices with 1 queue. + + Current qdiscs supporting this feature are NET_SCH_PRIO and + NET_SCH_RR. + + Most people will say N here. + config NET_SCH_RED tristate "Random Early Detection (RED)" ---help--- diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 6d7542c..b35b9c5 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -40,9 +40,13 @@ struct prio_sched_data { int bands; + int curband; /* for round-robin */ struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; +#ifdef CONFIG_NET_SCH_MULTIQUEUE + unsigned char mq; +#endif }; @@ -70,14 +74,21 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #endif if (TC_H_MAJ(band)) band = 0; - return q->queues[q->prio2band[band&TC_PRIO_MAX]]; + band = q->prio2band[band&TC_PRIO_MAX]; + goto out; } band = res.classid; } band = TC_H_MIN(band) - 1; if (band >= q->bands) - return q->queues[q->prio2band[0]]; - + band = q->prio2band[0]; +out: +#ifdef CONFIG_NET_SCH_MULTIQUEUE + if (q->mq) + skb->queue_mapping = band; + else + skb->queue_mapping = 0; +#endif return q->queues[band]; } @@ -144,17 +155,65 @@ prio_dequeue(struct Qdisc* sch) struct Qdisc *qdisc; for (prio = 0; prio < q->bands; prio++) { - qdisc = q->queues[prio]; - skb = qdisc->dequeue(qdisc); - if (skb) { - sch->q.qlen--; - return skb; +#ifdef CONFIG_NET_SCH_MULTIQUEUE +
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: >>PJ Waskiewicz wrote: >> >> >>>+ >>> static int __init prio_module_init(void) { >>>-return register_qdisc(&prio_qdisc_ops); >>>+int err; >>>+err = register_qdisc(&prio_qdisc_ops); >>>+if (!err) >>>+err = register_qdisc(&rr_qdisc_ops); >>>+return err; >>> } >>> >> >>Thats still broken. I'll fix this and some minor cleanness >>issues myself so you don't have to go through another resend. > > > Auke and I just looked at register_qdisc() and this code. Maybe we > haven't had enough coffee yet, but register_qdisc() returns 0 on > success. So if register_qdisc(&prio_qdisc_ops) succeeds, then > rr_qdisc_ops gets registered. I'm curious what is broken with this. Its not error handling. You do: err = register qdisc 1 if (err) return err; err = register qdisc 2 if (err) unregister qdisc 2 return err anyways, I already fixed that and cleaned up prio_classify the way I suggested. Will send shortly. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> PJ Waskiewicz wrote: > > > + > > static int __init prio_module_init(void) { > > - return register_qdisc(&prio_qdisc_ops); > > + int err; > > + err = register_qdisc(&prio_qdisc_ops); > > + if (!err) > > + err = register_qdisc(&rr_qdisc_ops); > > + return err; > > } > > > > Thats still broken. I'll fix this and some minor cleanness > issues myself so you don't have to go through another resend. Auke and I just looked at register_qdisc() and this code. Maybe we haven't had enough coffee yet, but register_qdisc() returns 0 on success. So if register_qdisc(&prio_qdisc_ops) succeeds, then rr_qdisc_ops gets registered. I'm curious what is broken with this. Thanks Patrick -PJ Waskiewicz - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: + static int __init prio_module_init(void) { - return register_qdisc(&prio_qdisc_ops); + int err; + err = register_qdisc(&prio_qdisc_ops); + if (!err) + err = register_qdisc(&rr_qdisc_ops); + return err; } Thats still broken. I'll fix this and some minor cleanness issues myself so you don't have to go through another resend. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> Thats not necessary. I just though you could add one exit point: > > > ... > out: > skb->queue_mapping = q->mq ? band : 0; > return q->queues[band]; > } > > But if that doesn't work don't bother .. Unfortunately it won't, given how band might be used like this to select the queue: return q->queues[q->prio2band[band&TC_PRIO_MAX]]; I'll keep this in mind though, and if it can be done cleanly, I'll submit a patch. Thanks Patrick, -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #endif if (TC_H_MAJ(band)) band = 0; + if (q->mq) +skb->queue_mapping = + q->prio2band[band&TC_PRIO_MAX]; + else + skb->queue_mapping = 0; Might look cleaner if you have one central point where queue_mapping is set and the band is returned. I've taken a stab at this. I can have one return point, but I'll still have multiple assignments of skb->queue_mapping due to the different branches for which queue to select in the qdisc. I suppose we can do a rewrite of prio_classify(), but to me that seems beyond the scope of the multiqueue patches themselves. What do you think? Thats not necessary. I just though you could add one exit point: ... out: skb->queue_mapping = q->mq ? band : 0; return q->queues[band]; } But if that doesn't work don't bother .. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc > > *sch, int *qerr) #endif > > if (TC_H_MAJ(band)) > > band = 0; > > + if (q->mq) > > + skb->queue_mapping = > > + > q->prio2band[band&TC_PRIO_MAX]; > > + else > > + skb->queue_mapping = 0; > > > Might look cleaner if you have one central point where > queue_mapping is set and the band is returned. I've taken a stab at this. I can have one return point, but I'll still have multiple assignments of skb->queue_mapping due to the different branches for which queue to select in the qdisc. I suppose we can do a rewrite of prio_classify(), but to me that seems beyond the scope of the multiqueue patches themselves. What do you think? Thanks, -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> PJ Waskiewicz wrote: > > + /* If we're multiqueue, make sure the number of incoming bands > > +* matches the number of queues on the device we're > associating with. > > +*/ > > + if (tb[TCA_PRIO_MQ - 1]) > > + q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]); > > + > > + if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count)) > > + return -EINVAL; > > > A nice thing you could do for the user here is use > egress_subqueue_count as default when qopt->bands == 0 (and > change tc prio to accept 0 in case it doesn't). prio only allows a minimum of 2 bands right now. I see what you're suggesting though; let me think about this. I do like this suggestion. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: And RTA_PUT_FLAG. Now that I think of it, does it even makes sense to have a prio private flag for this instead of a qdisc global one? There currently aren't any other qdiscs that are natural fits for multiqueue that I can see. I can see the benefit though of having this as a global flag in the qdisc API; let me check it out, and if it makes sense, I can move it. Yes, that thought occured to me as well. Keeping it private seems better. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> > enum > > { > > - TCA_PRIO_UNPSEC, > > - TCA_PRIO_TEST, > > > You misunderstood me. You can work on top of my compat > attribute patches, but the example code should not have to go > in to apply your patch. Ok. I'll fix my patches. > > diff --git a/net/sched/Kconfig b/net/sched/Kconfig index > > 475df84..7f14fa6 100644 > > --- a/net/sched/Kconfig > > +++ b/net/sched/Kconfig > > @@ -102,8 +102,16 @@ config NET_SCH_ATM > > To compile this code as a module, choose M here: the > > module will be called sch_atm. > > > > +config NET_SCH_BANDS > > +bool "Multi Band Queueing (PRIO and RR)" > > This options seems useless. Its not used *anywhere* except > for dependencies. I was trying to group the multiqueue qdiscs together with this. But I can see just having the multiqueue option for scheduling will cover this. I'll remove this. > > +config NET_SCH_BANDS_MQ > > + bool "Multiple hardware queue support" > > + depends on NET_SCH_BANDS > > > OK, again: > > Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. > Nothing at all changes for NET_SCH_PRIO itself. Additionally > introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at > all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in > sch_prio.c. > Your current code doesn't even have any ifdefs anymore > though, so this might not be needed at all. > > Additionally you could later introduce E1000_MULTIQUEUE and > have that select NET_SCH_MULTIQUEUE. I'll clean this up. Thanks for the persistance. :) > > diff --git a/net/sched/sch_generic.c > b/net/sched/sch_generic.c index > > 9461e8a..203d5c4 100644 > > --- a/net/sched/sch_generic.c > > +++ b/net/sched/sch_generic.c > > @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct > net_device *dev) > > spin_unlock(&dev->queue_lock); > > > > ret = NETDEV_TX_BUSY; > > - if (!netif_queue_stopped(dev)) > > + if (!netif_queue_stopped(dev) && > > + !netif_subqueue_stopped(dev, skb->queue_mapping)) > > /* churn baby churn .. */ > > ret = dev_hard_start_xmit(skb, dev); > > I'll try again - please move this to patch 2/3. I'm sorry; I misread your original comment about this. I'll move the change (although this disappears with Jamal's and KK's qdisc_restart() cleanup). > > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index > > 40a13e8..8a716f0 100644 > > --- a/net/sched/sch_prio.c > > +++ b/net/sched/sch_prio.c > > @@ -40,9 +40,11 @@ > > struct prio_sched_data > > { > > int bands; > > + int curband; /* for round-robin */ > > struct tcf_proto *filter_list; > > u8 prio2band[TC_PRIO_MAX+1]; > > struct Qdisc *queues[TCQ_PRIO_BANDS]; > > + unsigned char mq; > > }; > > > > > > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc > > *sch, int *qerr) #endif > > if (TC_H_MAJ(band)) > > band = 0; > > + if (q->mq) > > + skb->queue_mapping = > > + > q->prio2band[band&TC_PRIO_MAX]; > > + else > > + skb->queue_mapping = 0; > > > Might look cleaner if you have one central point where > queue_mapping is set and the band is returned. I'll see how easy it'll be to condense this; because the queue being selected in the qdisc can be different based on a few different things, I'm not sure how easy it'll be to assign this in one spot. I'll play around with it and see what I can come up with. > > + /* If we're multiqueue, make sure the number of incoming bands > > +* matches the number of queues on the device we're > associating with. > > +*/ > > + if (tb[TCA_PRIO_MQ - 1]) > > + q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]); > > > If you're using it as a flag, please use RTA_GET_FLAG(), > otherwise RTA_GET_U8. Will do. Thanks. > > + if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count)) > > + return -EINVAL; > > > > sch_tree_lock(sch); > > q->bands = qopt->bands; > > @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, > struct sk_buff *skb) > > memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1); > > > > nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt); > > - RTA_PUT_U32(skb, TCA_PRIO_TEST, 321); > > + RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq); > > > And RTA_PUT_FLAG. Now that I think of it, does it even makes > sense to have a prio private flag for this instead of a qdisc > global one? There currently aren't any other qdiscs that are natural fits for multiqueue that I can see. I can see the benefit though of having this as a global flag in the qdisc API; let me check it out, and if it makes sense, I can move it. > > static int __init prio_module_init(void) { > > - return register_qdisc(&prio_qdisc_ops); > > + register_qdisc(&prio_qdisc_ops); > > + register_qdisc(&rr_qdisc_ops); >
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: > + /* If we're multiqueue, make sure the number of incoming bands > + * matches the number of queues on the device we're associating with. > + */ > + if (tb[TCA_PRIO_MQ - 1]) > + q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]); > + > + if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count)) > + return -EINVAL; A nice thing you could do for the user here is use egress_subqueue_count as default when qopt->bands == 0 (and change tc prio to accept 0 in case it doesn't). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: > diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h > index 09808b7..ec3a9a5 100644 > --- a/include/linux/pkt_sched.h > +++ b/include/linux/pkt_sched.h > @@ -103,8 +103,8 @@ struct tc_prio_qopt > > enum > { > - TCA_PRIO_UNPSEC, > - TCA_PRIO_TEST, You misunderstood me. You can work on top of my compat attribute patches, but the example code should not have to go in to apply your patch. > diff --git a/net/sched/Kconfig b/net/sched/Kconfig > index 475df84..7f14fa6 100644 > --- a/net/sched/Kconfig > +++ b/net/sched/Kconfig > @@ -102,8 +102,16 @@ config NET_SCH_ATM > To compile this code as a module, choose M here: the > module will be called sch_atm. > > +config NET_SCH_BANDS > +bool "Multi Band Queueing (PRIO and RR)" This options seems useless. Its not used *anywhere* except for dependencies. > +---help--- > + Say Y here if you want to use n-band multiqueue packet > + schedulers. These include a priority-based scheduler and > +a round-robin scheduler. > + > config NET_SCH_PRIO > tristate "Multi Band Priority Queueing (PRIO)" > + depends on NET_SCH_BANDS And this dependency as well. > ---help--- > Say Y here if you want to use an n-band priority queue packet > scheduler. > @@ -111,6 +119,28 @@ config NET_SCH_PRIO > To compile this code as a module, choose M here: the > module will be called sch_prio. > > +config NET_SCH_RR > + tristate "Multi Band Round Robin Queuing (RR)" > + depends on NET_SCH_BANDS Same here. RR > + select NET_SCH_PRIO > + ---help--- > + Say Y here if you want to use an n-band round robin packet > + scheduler. > + > + The module uses sch_prio for its framework and is aliased as > + sch_rr, so it will load sch_prio, although it is referred > + to using sch_rr. > + > +config NET_SCH_BANDS_MQ > + bool "Multiple hardware queue support" > + depends on NET_SCH_BANDS OK, again: Introduce NET_SCH_RR. NET_SCH_RR selects NET_SCH_PRIO. Nothing at all changes for NET_SCH_PRIO itself. Additionally introduce a boolean NET_SCH_MULTIQUEUE. No dependencies at all. Use NET_SCH_MULTIQUEUE to guard the multiqueue code in sch_prio.c. Your current code doesn't even have any ifdefs anymore though, so this might not be needed at all. Additionally you could later introduce E1000_MULTIQUEUE and have that select NET_SCH_MULTIQUEUE. > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 9461e8a..203d5c4 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -168,7 +168,8 @@ static inline int qdisc_restart(struct net_device *dev) > spin_unlock(&dev->queue_lock); > > ret = NETDEV_TX_BUSY; > - if (!netif_queue_stopped(dev)) > + if (!netif_queue_stopped(dev) && > + !netif_subqueue_stopped(dev, skb->queue_mapping)) > /* churn baby churn .. */ > ret = dev_hard_start_xmit(skb, dev); I'll try again - please move this to patch 2/3. > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c > index 40a13e8..8a716f0 100644 > --- a/net/sched/sch_prio.c > +++ b/net/sched/sch_prio.c > @@ -40,9 +40,11 @@ > struct prio_sched_data > { > int bands; > + int curband; /* for round-robin */ > struct tcf_proto *filter_list; > u8 prio2band[TC_PRIO_MAX+1]; > struct Qdisc *queues[TCQ_PRIO_BANDS]; > + unsigned char mq; > }; > > > @@ -70,14 +72,28 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int > *qerr) > #endif > if (TC_H_MAJ(band)) > band = 0; > + if (q->mq) > + skb->queue_mapping = > + q->prio2band[band&TC_PRIO_MAX]; > + else > + skb->queue_mapping = 0; Might look cleaner if you have one central point where queue_mapping is set and the band is returned. > + /* If we're multiqueue, make sure the number of incoming bands > + * matches the number of queues on the device we're associating with. > + */ > + if (tb[TCA_PRIO_MQ - 1]) > + q->mq = *(unsigned char *)RTA_DATA(tb[TCA_PRIO_MQ - 1]); If you're using it as a flag, please use RTA_GET_FLAG(), otherwise RTA_GET_U8. > + if (q->mq && (qopt->bands != sch->dev->egress_subqueue_count)) > + return -EINVAL; > > sch_tree_lock(sch); > q->bands = qopt->bands; > @@ -280,7 +342,7 @@ static int prio_dump(struct Qdisc *sch, struct sk_buff > *skb) > memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1); > > nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), &opt); > - RTA_PUT_U32(skb, TCA_PRIO_TEST, 321); > + RTA_PUT_U8(skb, TCA_PRIO_MQ, q->mq); And RTA_PUT_FLAG. Now that I think of it, does it even makes sense to have a prio private flag for this ins
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> Patrick McHardy wrote: > > Waskiewicz Jr, Peter P wrote: > > > >>Thought about this more last night and this morning. As > far as I can > >>tell, I still need this. If the qdisc gets loaded with multiqueue > >>turned on, I can just use the value of band to assign > >>skb->queue_mapping. But if the qdisc is loaded without multiqueue > >>support, then I need to assign a value of zero to queue_mapping, or > >>not assign it at all (it will be zero'd out before the call to > >>->enqueue() in dev_queue_xmit()). But I'd rather not have a > >>conditional in the hotpath checking if the qdisc is multiqueue; I'd > >>rather have the array to match the bands so I can just do > an assignment. > >> > >>What do you think? > > > > > > > > I very much doubt that it has any measurable impact. You > can also add > > a small inline function > > > > void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue) > > > OK I didn't really listen obviously :) A compile time option > won't help. Just remove it and assign it conditionally. Sounds good. Thanks Patrick. -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: > Waskiewicz Jr, Peter P wrote: > >>Thought about this more last night and this morning. As far as I can >>tell, I still need this. If the qdisc gets loaded with multiqueue >>turned on, I can just use the value of band to assign >>skb->queue_mapping. But if the qdisc is loaded without multiqueue >>support, then I need to assign a value of zero to queue_mapping, or not >>assign it at all (it will be zero'd out before the call to ->enqueue() >>in dev_queue_xmit()). But I'd rather not have a conditional in the >>hotpath checking if the qdisc is multiqueue; I'd rather have the array >>to match the bands so I can just do an assignment. >> >>What do you think? > > > > I very much doubt that it has any measurable impact. You can > also add a small inline function > > void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue) OK I didn't really listen obviously :) A compile time option won't help. Just remove it and assign it conditionally. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: > void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue) > { > #ifdef CONFIG_NET_SCH_MULTIQUEUE > skb->queue_mapping = queue; > #else > skb->queue_mapping = 0; > #endif Maybe even use it everywhere and guard skb->queue_mapping by an #ifdef, on 32 bit it does enlarge the skb. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: >>> #include >>>@@ -40,9 +42,13 @@ >>> struct prio_sched_data >>> { >>> int bands; >>>+#ifdef CONFIG_NET_SCH_RR >>>+int curband; /* for round-robin */ >>>+#endif >>> struct tcf_proto *filter_list; >>> u8 prio2band[TC_PRIO_MAX+1]; >>> struct Qdisc *queues[TCQ_PRIO_BANDS]; >>>+u16 band2queue[TC_PRIO_MAX + 1]; >>> >> >>Why is this still here? Its a 1:1 mapping. > > > Thought about this more last night and this morning. As far as I can > tell, I still need this. If the qdisc gets loaded with multiqueue > turned on, I can just use the value of band to assign > skb->queue_mapping. But if the qdisc is loaded without multiqueue > support, then I need to assign a value of zero to queue_mapping, or not > assign it at all (it will be zero'd out before the call to ->enqueue() > in dev_queue_xmit()). But I'd rather not have a conditional in the > hotpath checking if the qdisc is multiqueue; I'd rather have the array > to match the bands so I can just do an assignment. > > What do you think? I very much doubt that it has any measurable impact. You can also add a small inline function void skb_set_queue_mapping(struct sk_buff *skb, unsigned int queue) { #ifdef CONFIG_NET_SCH_MULTIQUEUE skb->queue_mapping = queue; #else skb->queue_mapping = 0; #endif } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> > #include > > @@ -40,9 +42,13 @@ > > struct prio_sched_data > > { > > int bands; > > +#ifdef CONFIG_NET_SCH_RR > > + int curband; /* for round-robin */ > > +#endif > > struct tcf_proto *filter_list; > > u8 prio2band[TC_PRIO_MAX+1]; > > struct Qdisc *queues[TCQ_PRIO_BANDS]; > > + u16 band2queue[TC_PRIO_MAX + 1]; > > > > Why is this still here? Its a 1:1 mapping. Thought about this more last night and this morning. As far as I can tell, I still need this. If the qdisc gets loaded with multiqueue turned on, I can just use the value of band to assign skb->queue_mapping. But if the qdisc is loaded without multiqueue support, then I need to assign a value of zero to queue_mapping, or not assign it at all (it will be zero'd out before the call to ->enqueue() in dev_queue_xmit()). But I'd rather not have a conditional in the hotpath checking if the qdisc is multiqueue; I'd rather have the array to match the bands so I can just do an assignment. What do you think? Thanks, -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Waskiewicz Jr, Peter P wrote: The dependencies seem to be very confused. SCHED_PRIO does not depend on anything new, SCH_RR also doesn't depend on anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems better than adding one per scheduler though. I agree with a NET_SCH_MULTIQUEUE option. However, SCH_RR does depend on SCH_PRIO being built since it's the same code, doesn't it? Maybe I'm not understanding something about the build process. I'll clean this up. The easiest solution is to select SCH_PRIO from SCH_RR. I head something else in mind initially but that is needlessly complicated. For the tenth time now, the user should enable this at runtime. You can't just break things dependant on config options. I had this in sch_prio and tc before, and was told to remove it because of ABI issues. I can put it back in, but I'm not sure what those previous ABI issues were. Was it backwards compatibility that you referred to before that was broken? Your tc changes changed the structure in a way that old tc binaries wouldn't work anymore. This version breaks configurations that use a number of bands not matching the HW queues when the user enables the multiqueue compile time option. Unfortunately prio does not use nested attributes, so the easiest way is extending struct tc_prio_qopt at the end and checking the attribute size to decide whether its an old or a new version. A better fix would be to introduce a new qdisc configuration attribute that takes precedence before TCA_OPTIONS and have userspace send both the old non-nested structure and a new nested configuration. That would make sure we never run into this problem again. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> The dependencies seem to be very confused. SCHED_PRIO does > not depend on anything new, SCH_RR also doesn't depend on > anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend > on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems > better than adding one per scheduler though. I agree with a NET_SCH_MULTIQUEUE option. However, SCH_RR does depend on SCH_PRIO being built since it's the same code, doesn't it? Maybe I'm not understanding something about the build process. I'll clean this up. > > > --- a/net/sched/sch_prio.c > > +++ b/net/sched/sch_prio.c > > @@ -9,6 +9,8 @@ > > * Authors:Alexey Kuznetsov, <[EMAIL PROTECTED]> > > * Fixes: 19990609: J Hadi Salim <[EMAIL PROTECTED]>: > > * Init -- EINVAL when opt undefined > > + * Additions: Peter P. Waskiewicz Jr. > <[EMAIL PROTECTED]> > > + * Added round-robin scheduling for selection at load-time > > > > git keeps changelogs, please don't add it here. Roger. > > struct tcf_proto *filter_list; > > u8 prio2band[TC_PRIO_MAX+1]; > > struct Qdisc *queues[TCQ_PRIO_BANDS]; > > + u16 band2queue[TC_PRIO_MAX + 1]; > > > > Why is this still here? Its a 1:1 mapping. I'll fix this. > > @@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc > *sch, struct rtattr *opt) > > return -EINVAL; > > } > > > > + /* If we're prio multiqueue or are using round-robin, make > > +* sure the number of incoming bands matches the number of > > +* queues on the device we're associating with. > > +*/ > > +#ifdef CONFIG_NET_SCH_RR > > + if (strcmp("rr", sch->ops->id) == 0) > > + if (qopt->bands != sch->dev->egress_subqueue_count) > > + return -EINVAL; > > +#endif > > + > > +#ifdef CONFIG_NET_SCH_PRIO_MQ > > + if (strcmp("prio", sch->ops->id) == 0) > > + if (qopt->bands != sch->dev->egress_subqueue_count) > > + return -EINVAL; > > +#endif > > > > For the tenth time now, the user should enable this at > runtime. You can't just break things dependant on config options. I had this in sch_prio and tc before, and was told to remove it because of ABI issues. I can put it back in, but I'm not sure what those previous ABI issues were. Was it backwards compatibility that you referred to before that was broken? As always, the feedback is very much appreciated. I'll get these fixes in as soon as possible. -PJ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
PJ Waskiewicz wrote: diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 475df84..ca0b352 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -102,8 +102,16 @@ config NET_SCH_ATM To compile this code as a module, choose M here: the module will be called sch_atm. +config NET_SCH_BANDS +bool "Multi Band Queueing (PRIO and RR)" +---help--- + Say Y here if you want to use n-band multiqueue packet + schedulers. These include a priority-based scheduler and + a round-robin scheduler. + config NET_SCH_PRIO tristate "Multi Band Priority Queueing (PRIO)" + depends on NET_SCH_BANDS ---help--- Say Y here if you want to use an n-band priority queue packet scheduler. @@ -111,6 +119,30 @@ config NET_SCH_PRIO To compile this code as a module, choose M here: the module will be called sch_prio. +config NET_SCH_PRIO_MQ + bool "Multiple hardware queue support for PRIO" + depends on NET_SCH_PRIO + ---help--- + Say Y here if you want to allow the PRIO qdisc to assign + flows to multiple hardware queues on an ethernet device. This + will still work on devices with 1 queue. + + Consider this scheduler for devices that do not use + hardware-based scheduling policies. Otherwise, use NET_SCH_RR. + + Most people will say N here. + +config NET_SCH_RR + bool "Multi Band Round Robin Queuing (RR)" + depends on NET_SCH_BANDS && NET_SCH_PRIO + ---help--- + Say Y here if you want to use an n-band round robin packet + scheduler. + + The module uses sch_prio for its framework and is aliased as + sch_rr, so it will load sch_prio, although it is referred + to using sch_rr. The dependencies seem to be very confused. SCHED_PRIO does not depend on anything new, SCH_RR also doesn't depend on anything. SCH_PRIO_MQ and SCH_RR_MQ (which is missing) depend on SCH_PRIO/SCH_RR. A single NET_SCH_MULTIQUEUE option seems better than adding one per scheduler though. --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -9,6 +9,8 @@ * Authors:Alexey Kuznetsov, <[EMAIL PROTECTED]> * Fixes: 19990609: J Hadi Salim <[EMAIL PROTECTED]>: * Init -- EINVAL when opt undefined + * Additions: Peter P. Waskiewicz Jr. <[EMAIL PROTECTED]> + * Added round-robin scheduling for selection at load-time git keeps changelogs, please don't add it here. */ #include @@ -40,9 +42,13 @@ struct prio_sched_data { int bands; +#ifdef CONFIG_NET_SCH_RR + int curband; /* for round-robin */ +#endif struct tcf_proto *filter_list; u8 prio2band[TC_PRIO_MAX+1]; struct Qdisc *queues[TCQ_PRIO_BANDS]; + u16 band2queue[TC_PRIO_MAX + 1]; Why is this still here? Its a 1:1 mapping. @@ -211,6 +265,22 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt) return -EINVAL; } + /* If we're prio multiqueue or are using round-robin, make +* sure the number of incoming bands matches the number of +* queues on the device we're associating with. +*/ +#ifdef CONFIG_NET_SCH_RR + if (strcmp("rr", sch->ops->id) == 0) + if (qopt->bands != sch->dev->egress_subqueue_count) + return -EINVAL; +#endif + +#ifdef CONFIG_NET_SCH_PRIO_MQ + if (strcmp("prio", sch->ops->id) == 0) + if (qopt->bands != sch->dev->egress_subqueue_count) + return -EINVAL; +#endif For the tenth time now, the user should enable this at runtime. You can't just break things dependant on config options. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html