Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Fri, 06 Jul 2007 10:39:15 -0400 > If the issue is usability of listing 1024 netdevices, i can think of > many ways to resolve it. I would agree with this if there were a reason for it, it's totally unnecessary complication as far as I can see. These virtual devices are an ethernet with the subnet details exposed to the driver, nothing more. I see zero benefit to having a netdev for each guest or node we can speak to whatsoever. It's a very heavy abstraction to use for something that is so bloody simple. My demux on ->hard_start_xmit() is _5 DAMN LINES OF CODE_, you want to replace that with a full netdev because of some minor difficulties in figuring out to record the queuing state. It's beyond unreasonable. Netdevs are like salt, if you put too much in your food it tastes awful. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-07-06 at 10:39 -0400, jamal wrote: > The first thing that crossed my mind was "if you want to select a > destination port based on a destination MAC you are talking about a > switch/bridge". You bring up the issue of "a huge number of virtual NICs > if you wanted arbitrary guests" which is a real one[2]. Hi Jamal, I'm deeply tempted to agree with you that the answer is multiple virtual NICs (and I've been tempted to abandon lguest's N-way transport scheme), except that it looks like we're going to have multi-queue NICs for other reasons. Otherwise I'd be tempted to say "create/destroy virtual NICs as other guests appear/vanish from the network". Noone does this today, but that doesn't make it wrong. > If i got this right, still not answering the netif_stop question posed: > the problem you are also trying to resolve now is get rid of N > netdevices on each guest for a usability reason; i.e have one netdevice, > move the bridging/switching functionality/tables into the driver; > replace the ports with queues instead of netdevices. Did i get that > right? Yep, well summarized. I guess the question is: should the Intel guys be representing their multi-queue NICs as multiple NICs rather than adding the subqueue concept? > BTW, one curve that threw me off a little is it seems most of the > hardware that provides virtualization also provides point-to-point > connections between different domains; i always thought that they all > provided a point-to-point to the dom0 equivalent and let the dom0 worry > about how things get from domainX to domainY. Yeah, but that has obvious limitations as people care more about inter-guest I/O: we want direct inter-guest networking... Cheers, Rusty. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
jamal wrote: If the issue is usability of listing 1024 netdevices, i can think of many ways to resolve it. One way we can resolve the listing is with a simple tag to the netdev struct i could say "list netdevices for guest 0-10" etc etc. This would be a useful feature, not only for virtualization. I've seen some boxes with thousands of net devices (mostly ppp, but also some ATM). It would be nice to be able to assign a tag to an arbitrary set of devices. Does the network namespace stuff help with any of this? -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-06-07 at 17:32 +1000, Rusty Russell wrote: [..some good stuff deleted here ..] > Hope that adds something, It does - thanks. I think i was letting my experience pollute my thinking earlier when Dave posted. The copy-avoidance requirement is clear to me[1]. I had another issue which wasnt clear but you touched on it so this breaks the ice for me - be gentle please, my asbestos suit is full of dust these days ;->: The first thing that crossed my mind was "if you want to select a destination port based on a destination MAC you are talking about a switch/bridge". You bring up the issue of "a huge number of virtual NICs if you wanted arbitrary guests" which is a real one[2]. Lets take the case of a small number of guests; a bridge of course would solve the problem with the copy-avoidance with the caveat being: - you now have N bridges and their respective tables for N domains i.e one on each domain - N netdevices on each domain as well (of course you could say that is not very different resourcewise from N queues instead). If i got this right, still not answering the netif_stop question posed: the problem you are also trying to resolve now is get rid of N netdevices on each guest for a usability reason; i.e have one netdevice, move the bridging/switching functionality/tables into the driver; replace the ports with queues instead of netdevices. Did i get that right? If the issue is usability of listing 1024 netdevices, i can think of many ways to resolve it. One way we can resolve the listing is with a simple tag to the netdev struct i could say "list netdevices for guest 0-10" etc etc. I am having a little problem differentiating conceptually the case of a guest being different from the host/dom0 if you want to migrate the switching/bridging functions into each guest. So even if this doesnt apply to all domains, it does apply to the dom0. I like netdevices today (as opposed to queues within netdevices): - the stack knows them well (I can add IP addresses, i can point routes to, I can change MAC addresses, i can bring them administratively down/up, I can add qos rules etc etc). I can also tie netdevices to a CPU and therefore scale that way. I see this viable at least from the host/dom0 perspective if a netdevice represents a guest. Sorry for the long email - drained some of my morning coffee. Ok, kill me. cheers, jamal [1] My experience is around qemu/uml/old-openvz - their model is to let the host do the routing/switching between guests or the outside of the box. From your description i would add Xen to that behavior. >From Daves posting, i understand that for many good reasons, any time you move between any one domain to another you are copying. So if you use Xen and you want to go from domainX to domainY you go to dom0 which implies copying domainX->dom0 then dom0->domainY. BTW, one curve that threw me off a little is it seems most of the hardware that provides virtualization also provides point-to-point connections between different domains; i always thought that they all provided a point-to-point to the dom0 equivalent and let the dom0 worry about how things get from domainX to domainY. [2] Unfortunately that means if i wanted 1024 virtual routers/guest domains i have at least 1024 netdevices on each guest connected to the bridge on the guest. I have a freaking problem listing 72 netdevices today on some device i have. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Tue, 2007-07-03 at 22:20 -0400, jamal wrote: > On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote: > [.. some useful stuff here deleted ..] > > > That's why you have to copy into a purpose-built set of memory > > that is composed of pages that _ONLY_ contain TX packet buffers > > and nothing else. > > > > The cost of going through the switch is too high, and the copies are > > necessary, so concentrate on allowing me to map the guest ports to the > > egress queues. Anything else is a waste of discussion time, I've been > > pouring over these issues endlessly for weeks, so if I'm saying doing > > copies and avoiding the switch is necessary I do in fact mean it. :-) > > ok, i get it Dave ;-> Thanks for your patience, that was useful. > Now that is clear for me, I will go back and look at your original email > and try to get back on track to what you really asked ;-> To expand on this, there are already "virtual" nic drivers in tree which do the demux based on dst mac and send to appropriate other guest (iseries_veth.c and Carsten Otte said the S/390 drivers do too). lguest and DaveM's LDOM make two more. There is currently no good way to write such a driver. If one recipient is full, you have to drop the packet: if you netif_stop_queue, it means a slow/buggy recipient blocks packets going to other recipients. But dropping packets makes networking suck. Some hypervisors (eg. Xen) only have a virtual NIC which is point-to-point: this sidesteps the issue, with the risk that you might need a huge number of virtual NICs if you wanted arbitrary guests to talk to each other (Xen doesn't support that, they route/bridge through dom0). Most hypervisors have a sensible maximum on the number of guests they could talk to, so I'm not too unhappy with a static number of queues. But the dstmac -> queue mapping changes in hypervisor-specific ways, so it really needs to be managed by the driver... Hope that adds something, Rusty. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Tue, 2007-03-07 at 14:24 -0700, David Miller wrote: [.. some useful stuff here deleted ..] > That's why you have to copy into a purpose-built set of memory > that is composed of pages that _ONLY_ contain TX packet buffers > and nothing else. > > The cost of going through the switch is too high, and the copies are > necessary, so concentrate on allowing me to map the guest ports to the > egress queues. Anything else is a waste of discussion time, I've been > pouring over these issues endlessly for weeks, so if I'm saying doing > copies and avoiding the switch is necessary I do in fact mean it. :-) ok, i get it Dave ;-> Thanks for your patience, that was useful. Now that is clear for me, I will go back and look at your original email and try to get back on track to what you really asked ;-> cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Tue, 03 Jul 2007 08:42:33 -0400 > (likely not in the case of hypervisor based virtualization like Xen) > just have their skbs cloned when crossing domains, is that not the > case?[1] > Assuming they copy, the balance that needs to be stricken now is > between: Sigh, I kind of hoped I wouldn't have to give a lesson in hypervisors and virtualized I/O and all the issues contained within, but if you keep pushing the "avoid the copy" idea I guess I am forced to educate. :-) First, keep in mind that my Linux guest drivers are talking to Solaris control node servers and switches, I cannot control the API for any of this stuff. And I think that's a good thing in fact. Exporting memory between nodes is _THE_ problem with virtualized I/O in hypervisor based systems. These things should even be able to work between two guests that simply DO NOT trust each other at all. With that in mind the hypervisor provides a very small shim layer of interface for exporting memory between two nodes. There is a pseudo-pagetable where you export pages, and a set of interfaces one of which copies to/from inported memory to/from local memory. If a guest gets stuck, reboots, crashes, or gets stuck, you have to be able to revoke the memory the remote node has inported. When this happens, if the inporting node comes back to life and tries to touch those pages it takes a fault. Taking a fault is easy if the nodes go through the hypervisor copy interface, they just get a return value back. If, instead, you try to map in those pages or program them into the IOMMU of the PCI controller, you get faults, and extremely difficult to handle faults at that. If the IOMMU takes the exception on a revoked page, your E1000 card resets when it gets the master abort from the PCI controller. On the CPU side you have to annotate every single kernel access to this memory mapping of inported pages, just like we have to annotate all userspace accesses with exception tables mapping load and store instructions to fixup code, in order to handler the fault correctly. Next, you don't trust the other end as we already stated, so you can't export object in a page that belong to other objects. For example, if a SKB's data sits in the same page as the plain-text password the user just typed in, you can't export that page. That's why you have to copy into a purpose-built set of memory that is composed of pages that _ONLY_ contain TX packet buffers and nothing else. The cost of going through the switch is too high, and the copies are necessary, so concentrate on allowing me to map the guest ports to the egress queues. Anything else is a waste of discussion time, I've been pouring over these issues endlessly for weeks, so if I'm saying doing copies and avoiding the switch is necessary I do in fact mean it. :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Sat, 2007-30-06 at 13:33 -0700, David Miller wrote: > It's like twice as fast, since the switch doesn't have to copy > the packet in, switch it, then the destination guest copies it > into it's address space. > > There is approximately one copy for each hop you go over through these > virtual devices. Ok - i see what you are getting at, and while it makes more sense to me now, let me continue to be _the_ devils advocate (sip some esspresso before responding or reading): for some reason i always thought that packets going across these things (likely not in the case of hypervisor based virtualization like Xen) just have their skbs cloned when crossing domains, is that not the case?[1] Assuming they copy, the balance that needs to be stricken now is between: a) copy is expensive vs b1) For N guests, N^2 queues in the system vs N queues and 1 vs N replicated global info. b2) The architecture challenges to resolve the fact you now have to deal with a mesh (1-1 mapping) instead of star topology between the guests. I dont think #b1 is such a big deal; in the old days when i had played with what is now openvz, i was happy to get 1024 virtual routers/guests (each running Zebra/OSPF). I could live with a little more wasted memory if the copy is reduced. I think sub-consciously i am questioning #b2. Do you really need that sacrifice just so that you can avoid one extra copy between two guests? If i was running virtual routers or servers i think the majority of traffic (by far) would be between a domain and outside of the box not between any two domains within the same box. cheers, jamal [1] But then if this is true, i can think of a simple way to attack the other domains by inserting a kernel module into a domain that reduced the refcount of each received skb to 0. I would be suprised if the openvz type approach hasnt thought this through. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Sat, 30 Jun 2007 10:52:44 -0400 > On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote: > > > Awesome, but let's concentrate on the client since I can actually > > implement and test anything we come up with :-) > > Ok, you need to clear one premise for me then ;-> > You said the model is for the guest/client to hook have a port to the > host and one to each guest; i think this is the confusing part for me > (and may have led to the switch discussion) because i have not seen this > model used before. What i have seen before is that the host side > connects the different guests. In such a scenario, on the guest is a > single port that connects to the host - the host worries (lets forget > the switch/bridge for a sec) about how to get packets from guestX to > guestY pending consultation of access control details. > What is the advantage of direct domain-domain connection? Is it a > scalable? It's like twice as fast, since the switch doesn't have to copy the packet in, switch it, then the destination guest copies it into it's address space. There is approximately one copy for each hop you go over through these virtual devices. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 21:35 -0700, David Miller wrote: > Awesome, but let's concentrate on the client since I can actually > implement and test anything we come up with :-) Ok, you need to clear one premise for me then ;-> You said the model is for the guest/client to hook have a port to the host and one to each guest; i think this is the confusing part for me (and may have led to the switch discussion) because i have not seen this model used before. What i have seen before is that the host side connects the different guests. In such a scenario, on the guest is a single port that connects to the host - the host worries (lets forget the switch/bridge for a sec) about how to get packets from guestX to guestY pending consultation of access control details. What is the advantage of direct domain-domain connection? Is it a scalable? cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
> "DM" == David Miller <[EMAIL PROTECTED]> writes: DM> And some people still use hubs, believe it or not. Hubs are 100Mbps at most. You could of course make a flooding Gbps switch, but it would be rather silly. If you care about multicast performance, you get a switch with IGMP snooping. /Benny - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: jamal <[EMAIL PROTECTED]> Date: Fri, 29 Jun 2007 21:30:53 -0400 > On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote: > > Maybe for the control node switch, yes, but not for the guest network > > devices. > > And that is precisely what i was talking about - and i am sure thats how > the discussion with Patrick was. Awesome, but let's concentrate on the client since I can actually implement and test anything we come up with :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 14:31 -0700, David Miller wrote: > This conversation begins to go into a pointless direction already, as > I feared it would. > > Nobody is going to configure bridges, classification, tc, and all of > this other crap just for a simple virtualized guest networking device. > > It's a confined and well defined case that doesn't need any of that. > You've got to be fucking kidding me if you think I'm going to go > through the bridging code and all of that layering instead of my > hash demux on transmit which is 4 or 5 lines of C code at best. > > Such a suggestion is beyond stupid. > Ok, calm down - will you please? If you are soliciting for opinions, then you should be expecting all sorts of answers, otherwise why bother posting. If you think you are misunderstood just clarify. Otherwise you are being totaly unreasonable. > Maybe for the control node switch, yes, but not for the guest network > devices. And that is precisely what i was talking about - and i am sure thats how the discussion with Patrick was. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
From: Ben Greear <[EMAIL PROTECTED]> Date: Fri, 29 Jun 2007 08:33:06 -0700 > Patrick McHardy wrote: > > Right, but the current bridging code always uses promiscous mode > > and its nice to avoid that if possible. Looking at the code, it > > should be easy to avoid though by disabling learning (and thus > > promisous mode) and adding unicast filters for all static fdb entries. > > > I am curious about why people are so hot to do away with promisc mode. > It seems to me > that in a modern switched environment, there should only very rarely be > unicast packets received > on an interface that does not want to receive them. > > Could someone give a quick example of when I am wrong and promisc mode > would allow > a NIC to receive a significant number of packets not really destined for it? You're neighbour on the switch is being pummeled with multicast traffic, and now you get to see it all too. Switches don't obviate the cost of promiscuous mode, you keep wanting to discuss this and think it doesn't matter, but it does. And some people still use hubs, believe it or not. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
This conversation begins to go into a pointless direction already, as I feared it would. Nobody is going to configure bridges, classification, tc, and all of this other crap just for a simple virtualized guest networking device. It's a confined and well defined case that doesn't need any of that. You've got to be fucking kidding me if you think I'm going to go through the bridging code and all of that layering instead of my hash demux on transmit which is 4 or 5 lines of C code at best. Such a suggestion is beyond stupid. Maybe for the control node switch, yes, but not for the guest network devices. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: Ben Greear wrote: Could someone give a quick example of when I am wrong and promisc mode would allow a NIC to receive a significant number of packets not really destined for it? In a switched environment it won't have a big effect, I agree. It might help avoid receiving unwanted multicast traffic, which could be more significant than unicast. Anyways, why be wasteful when it can be avoided .. :) Ok, I had forgotten about multicast, thanks for the reminder! -- Ben Greear <[EMAIL PROTECTED]> Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Ben Greear wrote: > Patrick McHardy wrote: > >> Right, but the current bridging code always uses promiscous mode >> and its nice to avoid that if possible. Looking at the code, it >> should be easy to avoid though by disabling learning (and thus >> promisous mode) and adding unicast filters for all static fdb entries. >> > > I am curious about why people are so hot to do away with promisc mode. > It seems to me > that in a modern switched environment, there should only very rarely be > unicast packets received > on an interface that does not want to receive them. I don't know if that really was Dave's reason to handle it in a driver. > Could someone give a quick example of when I am wrong and promisc mode > would allow > a NIC to receive a significant number of packets not really destined for > it? In a switched environment it won't have a big effect, I agree. It might help avoid receiving unwanted multicast traffic, which could be more significant than unicast. Anyways, why be wasteful when it can be avoided .. :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Patrick McHardy wrote: Right, but the current bridging code always uses promiscous mode and its nice to avoid that if possible. Looking at the code, it should be easy to avoid though by disabling learning (and thus promisous mode) and adding unicast filters for all static fdb entries. I am curious about why people are so hot to do away with promisc mode. It seems to me that in a modern switched environment, there should only very rarely be unicast packets received on an interface that does not want to receive them. Could someone give a quick example of when I am wrong and promisc mode would allow a NIC to receive a significant number of packets not really destined for it? Thanks, Ben -- Ben Greear <[EMAIL PROTECTED]> Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 15:08 +0200, Patrick McHardy wrote: > jamal wrote: > > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote: > Right, but the current bridging code always uses promiscous mode > and its nice to avoid that if possible. > Looking at the code, it > should be easy to avoid though by disabling learning (and thus > promisous mode) and adding unicast filters for all static fdb entries. > Yes, that would do it for static provisioning (I suspect that would work today unless bridging has no knobs to turn off going into promisc). But you could even allow for learning and just have extra filters in tc before bridging disallowing things. > Have a look at my secondary unicast address patches in case you didn't > notice them before (there's also a driver example for e1000 on netdev): > > http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7 > > You still need to do filtering in software, but you can have the NIC > pre-filter in case it supports it, otherwise it goes to promiscous mode. > Ok, I will look at them when i get back. Sorry - havent caught up on netdev. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
jamal wrote: > On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote: > > >>The difference to a real bridge is that the >>all addresses are completely known in advance, so it doesn't need >>promiscous mode for learning. > > > You mean the per-virtual MAC addresses are known in advance, right? Yes. > This is fine. The bridging or otherwise (like L3 etc) is for > interconnecting once you have the provisioning done. And you could build > different "broadcast domains" by having multiple bridges. Right, but the current bridging code always uses promiscous mode and its nice to avoid that if possible. Looking at the code, it should be easy to avoid though by disabling learning (and thus promisous mode) and adding unicast filters for all static fdb entries. > To go off on a slight tangent: > I think you have to look at the two types of NICs separately > 1) dumb ones where you may have to use the mcast filters in hardware to > pretend you have a unicast address per virtual device - those will be > really hard to simulate using a separate netdevice per MAC address. > Actually your bigger problem on those is tx MAC address selection > because that is not built into the hardware. I still think even for > these types something above netdevice (bridge, L3 routing, tc action > redirect etc) will do. Have a look at my secondary unicast address patches in case you didn't notice them before (there's also a driver example for e1000 on netdev): http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.23.git;a=commit;h=306890b54dcbd168cdeea64f1630d2024febb5c7 You still need to do filtering in software, but you can have the NIC pre-filter in case it supports it, otherwise it goes to promiscous mode. > 2) The new NICs being built for virtualization; those allow you to > explicitly have clean separation of IO where the only thing that is > shared between virtual devices is the wire and the bus (otherwise > each has its own registers etc) i.e the hardware is designed with this > in mind. In such a case, i think a separate netdevice per single MAC > address - possibly tied to a separate CPU should work. Agreed, that could also be useful for non-virtualized use. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
On Fri, 2007-29-06 at 13:59 +0200, Patrick McHardy wrote: > I'm guessing that that wouldn't allow to do unicast filtering for > the guests on the real device without hacking the bridge code for > this special case. For ingress (i guess you could say for egress as well): we can do it as well today with tc filtering on the host - it is involved but is part of provisioning for a guest IMO. A substantial amount of ethernet switches (ok, not the $5 ones) do filtering at the same level. > The difference to a real bridge is that the > all addresses are completely known in advance, so it doesn't need > promiscous mode for learning. You mean the per-virtual MAC addresses are known in advance, right? This is fine. The bridging or otherwise (like L3 etc) is for interconnecting once you have the provisioning done. And you could build different "broadcast domains" by having multiple bridges. To go off on a slight tangent: I think you have to look at the two types of NICs separately 1) dumb ones where you may have to use the mcast filters in hardware to pretend you have a unicast address per virtual device - those will be really hard to simulate using a separate netdevice per MAC address. Actually your bigger problem on those is tx MAC address selection because that is not built into the hardware. I still think even for these types something above netdevice (bridge, L3 routing, tc action redirect etc) will do. 2) The new NICs being built for virtualization; those allow you to explicitly have clean separation of IO where the only thing that is shared between virtual devices is the wire and the bus (otherwise each has its own registers etc) i.e the hardware is designed with this in mind. In such a case, i think a separate netdevice per single MAC address - possibly tied to a separate CPU should work. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
jamal wrote: > On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote: > >>Each guest gets a unique MAC address. There is a queue per-port >>that can fill up. >> >>What all the drivers like this do right now is stop the queue if >>any of the per-port queues fill up, and that's why my sunvnet >>driver does right now as well. We can only thus wakeup the >>queue when all of the ports have some space. > > > Is a netdevice really the correct construct for the host side? > Sounds to me a layer above the netdevice is the way to go. A bridge for > example or L3 routing or even simple tc classify/redirection etc. > I havent used what has become openvz these days in many years (or played > with Erics approach), but if i recall correctly - it used to have a > single netdevice per guest on the host. Thats close to what a basic > qemu/UML has today. In such a case it is something above netdevices > which does the guest selection. I'm guessing that that wouldn't allow to do unicast filtering for the guests on the real device without hacking the bridge code for this special case. The difference to a real bridge is that the all addresses are completely known in advance, so it doesn't need promiscous mode for learning. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Multiqueue and virtualization WAS(Re: [PATCH 3/3] NET: [SCHED] Qdisc changes and sch_rr added for multiqueue
Ive changed the topic for you friend - otherwise most people wont follow (as youve said a few times yourself ;->). On Thu, 2007-28-06 at 21:20 -0700, David Miller wrote: > Now I get to pose a problem for everyone, prove to me how useful > this new code is by showing me how it can be used to solve a > reocurring problem in virtualized network drivers of which I've > had to code one up recently, see my most recent blog entry at: > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/index.html > nice. > Anyways the gist of the issue is (and this happens for Sun LDOMS > networking, lguest, IBM iSeries, etc.) that we have a single > virtualized network device. There is a "port" to the control > node (which switches packets to the real network for the guest) > and one "port" to each of the other guests. > > Each guest gets a unique MAC address. There is a queue per-port > that can fill up. > > What all the drivers like this do right now is stop the queue if > any of the per-port queues fill up, and that's why my sunvnet > driver does right now as well. We can only thus wakeup the > queue when all of the ports have some space. Is a netdevice really the correct construct for the host side? Sounds to me a layer above the netdevice is the way to go. A bridge for example or L3 routing or even simple tc classify/redirection etc. I havent used what has become openvz these days in many years (or played with Erics approach), but if i recall correctly - it used to have a single netdevice per guest on the host. Thats close to what a basic qemu/UML has today. In such a case it is something above netdevices which does the guest selection. > The ports (and thus the queues) are selected by destinationt > MAC address. Each port has a remote MAC address, if there > is an exact match with a port's remote MAC we'd use that port > and thus that port's queue. If there is no exact match > (some other node on the real network, broadcast, multicast, > etc.) we want to use the control node's port and port queue. > Ok, Dave, isnt that what a bridge does? ;-> Youd need filtering to go with it (for example to restrict guest0 from getting certain brodcasts etc) - but we already have that. > So the problem to solve is to make a way for drivers to do the queue > selection before the generic queueing layer starts to try and push > things to the driver. Perhaps a classifier in the driver or similar. > > The solution to this problem generalizes to the other facility > we want now, hashing the transmit queue by smp_processor_id() > or similar. With that in place we can look at doing the TX locking > per-queue too as is hinted at by the comments above the per-queue > structure in the current net-2.6.23 tree. A major surgery will be needed on the tx path if you want to hash tx queue to processor id. Our unit construct (today, net-2.6.23) that can be tied to a cpu is a netdevice. OTOH, if you used a netdevice it should work as is. But i am possibly missing something in your comments. What do you have in mind. > My current work-in-progress sunvnet.c driver is included below so > we can discuss things concretely with code. > > I'm listening. :-) And you got words above. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html