Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On (Fri) Aug 07 2009 [09:14:43], Anthony Liguori wrote: Amit Shah wrote: On (Thu) Aug 06 2009 [18:37:40], Jamie Lokier wrote: Apart from dbus, hard-coded meanings of small N in /dev/vmchN are asking for trouble. It is bound to break when widely deployed and It's no different from the way major-minor numbering works on the Linux kernel: they uniquely identify a device. Bad example. Quite a lot of modern devices drivers are using dynamic major/minor numbers because they have proven to be such a pain in the butt. That's why we have more sophisticated mechanisms like udev for userspace to make use of. Let me explain how we came to this numbering: we first had support for 'naming' ports and the names were obtained by userspace programs by an ioctl. Rusty suggested to use some numbering scheme where some ports could exist at predefined locations so that we wouldn't need the naming and the ioctls around it. We'll definitely need some way to support dynamic vmchannels. Static allocation of ports is just not going to work. If we did a userspace daemon, I'd suggest using some sort of universal identifier that's easy to manage in a distributed fashion. Like a reverse fqdn. So for instance, I could have an com.ibm.my-awesome-channel and never have to worry about conflicts. Hm, we could have something like a class of ports instead of the current minor-number scheme: each port exposes what class it belongs to, like -virtioserial unix:/tmp/foo,class=stream -virtioserial unix:/tmp/bar,class=console guest/host configs don't match. It also doesn't fit comfortably when you have, say, bob and alice both logged in with desktops on separate VTs. Clashes are inevitable, as third-party apps pick N values for themselves then get distributed - unless N values can be large (/dev/vmch44324 == kernelstats...). Hm, so there can be one daemon on the guest handling all clipboard events. There's some work done already by the fast-user-switch support and that can be extended to daemons that talk over virtio-serial. You could have one daemon that manages all vmchannel sessions. It can then expose channels to apps via whatever mechanism is best. It could use unix domain sockets, sys v ipc, whatever floats your boat. And, you can build this daemon today using the existing vmchannel over TCP/IP. You could also make it support serial devices. We could also introduce a custom usb device and use libusb. libusb is portable to Windows and Linux. There are some other problems with usb too: It's not transparent to users. Any hotplug event could alert users and that's not desired. It's a system-only thing and should also remain that way. Amit ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Gerd Hoffmann wrote: There are some other problems with usb too: It's not transparent to users. Any hotplug event could alert users and that's not desired. It's a system-only thing and should also remain that way. I think virtio-serial is the better way to handle vmchannel. Unlike usb virtio is designed to work nicely in a virtual environment. But vmchannel-over-usbserial should be easy too though in case some guests lacks virtio backports or something. I think you're missing my fundamental point. Don't use the kernel as the guest interface. Introduce a userspace daemon that exposes a domain socket. Then we can have a proper protocol that uses reverse fqdns for identification. We can do the backend over TCP/IP, usb, standard serial, etc. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCH][RFC] net/bridge: add basic VEPA support
Subject: Re: [PATCH][RFC] net/bridge: add basic VEPA support On Friday 07 August 2009, Paul Congdon (UC Davis) wrote: As I understand the macvlan code, it currently doesn't allow two VMs on the same machine to communicate with one another. There are patches to do that. I think if we add that, there should be a way to choose the behavior between either bridging between the guests or VEPA. If you implement this direct bridging capability between local VMs for macvlan, then would this not break existing applications that currently use it? It would be quite a significant change to how macvlan works today. I guess, ideally, you would want to have macvlan work in separate modes, e.g. traditional macvlan, bridging, and VEPA. I could imagine a hairpin mode on the adjacent bridge making this possible, but the macvlan code would need to be updated to filter reflected frames so a source did not receive his own packet. Right, I missed this point so far. I'll follow up with a patch to do that. Can you maybe point me to the missing patches for macvlan that you have mentioned in other emails, and the one you mention above? E.g. enabling multicast distribution and allowing local bridging etc. I could not find any of those in the archives. Thanks. I could imagine this being done as well, but to also support selective multicast usage, something similar to the bridge forwarding table would be needed. I think putting VEPA into a new driver would cause you to implement many things the bridge code already supports. Given that we expect the bridge standard to ultimately include VEPA, and the new functions are basic forwarding operations, it seems to make most sense to keep this consistent with the bridge module. This is the interesting part of the discussion. The bridge and macvlan drivers certainly have an overlap in functionality and you can argue that you only need one. Then again, the bridge code is a little crufty and we might not want to add much more to it for functionality that can be implemented in a much simpler way elsewhere. My preferred way would be to use bridge when you really need 802.1d MAC learning, netfilter- bridge and STP, while we put the optimizations for stuff like VMDq, zero-copy and multiqueue guest adapters only into the macvlan code. I can see this being a possible solution. My concern with putting VEPA into macvlan instead of the bridging code is that there will be more work required to make it usable for other virtualization solution as macvtap will only work for KVM type setups. Basically, VEPA capabilities would rely on someone developing further drivers to connect macvlan to different backend interfaces, e.g. one for KVM (macvtap), one for Xen PV drivers, one for virtio, and whatever else is out there, or will be there in the future. The bridging code is already very generic in that respect, and all virtualization layers can deal with connecting interfaces to a bridge. Our extensions to the bridging code to enable VEPA for the Linux kernel are only very minimal code changes and would allow to make VEPA available to most virtualization solutions today. Anna ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On 08/10/09 15:02, Anthony Liguori wrote: I think you're missing my fundamental point. Don't use the kernel as the guest interface. Introduce a userspace daemon that exposes a domain socket. Then we can have a proper protocol that uses reverse fqdns for identification. We need nothing but (a) bidirectional byte streams and (b) name tags for them. Do we really want design a daemon and a protocol for such a simple thing? Especially as requiring a daemon for that adds a few problems you don't have without them. Access control for example: For device nodes you can just use standard unix permissions and acls. You can easily do stuff like adding the logged in desktop user to the /dev/vmchannel/org/qemu/clipboard acl using existing solutions. With a daemon you have to hop through a number of loops to archive the same. Can't we simply have guest apps open /dev/vmchannel/$protocol ? cheers, Gerd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Amit Shah wrote: Let me explain how we came to this numbering: we first had support for 'naming' ports and the names were obtained by userspace programs by an ioctl. Rusty suggested to use some numbering scheme where some ports could exist at predefined locations so that we wouldn't need the naming and the ioctls around it. Fortunately, if you implement the naming scheme in userspace you get the best of both worlds ;-) Hm, so there can be one daemon on the guest handling all clipboard events. There's some work done already by the fast-user-switch support and that can be extended to daemons that talk over virtio-serial. You could have one daemon that manages all vmchannel sessions. It can then expose channels to apps via whatever mechanism is best. It could use unix domain sockets, sys v ipc, whatever floats your boat. And, you can build this daemon today using the existing vmchannel over TCP/IP. You could also make it support serial devices. We could also introduce a custom usb device and use libusb. libusb is portable to Windows and Linux. There are some other problems with usb too: It's not transparent to users. Any hotplug event could alert users and that's not desired. I don't think this is true in practice. Our goal is not to circumvent an OS's policy decisions either. It's a system-only thing and should also remain that way. I don't buy this argument at all. If you exposed a new usb device that no OS had a kernel driver, and you had a daemon running that watched for insertions of that device, what OS would that not work transparently on? I think my fundamental argument boils down to two points. 1) we should not require new guest drivers unless we absolutely have to 2) we should always do things in userspace unless we absolutely have to do things in the kernel. Adding new kernel drivers breaks support for enterprise Linux distros. Adding a userspace daemon does not. Windows device drivers require signing which is very difficult to do. There's a huge practical advantage in not requiring guest drivers. Regards, Anthony Liguori Amit ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH][RFC] net/bridge: add basic VEPA support
On Monday 10 August 2009, Fischer, Anna wrote: Subject: Re: [PATCH][RFC] net/bridge: add basic VEPA support On Friday 07 August 2009, Paul Congdon (UC Davis) wrote: As I understand the macvlan code, it currently doesn't allow two VMs on the same machine to communicate with one another. There are patches to do that. I think if we add that, there should be a way to choose the behavior between either bridging between the guests or VEPA. If you implement this direct bridging capability between local VMs for macvlan, then would this not break existing applications that currently use it? It would be quite a significant change to how macvlan works today. I guess, ideally, you would want to have macvlan work in separate modes, e.g. traditional macvlan, bridging, and VEPA. Right, that's what I meant with my sentence above. I'm not sure if we need to differentiate traditional macvlan and VEPA though. AFAICT, the only difference should be the handling of broadcast and multicast frames returning from the hairpin turn. Since this does not happen with a traditional macvlan, we can always send them to all macvlan ports except the source port. I could imagine a hairpin mode on the adjacent bridge making this possible, but the macvlan code would need to be updated to filter reflected frames so a source did not receive his own packet. Right, I missed this point so far. I'll follow up with a patch to do that. Can you maybe point me to the missing patches for macvlan that you have mentioned in other emails, and the one you mention above? E.g. enabling multicast distribution and allowing local bridging etc. I could not find any of those in the archives. Thanks. The patch from Eric Biederman to allow macvlan to bridge between its slave ports is at http://kerneltrap.org/mailarchive/linux-netdev/2009/3/9/5125774 I could not find any patches for the other features (or bugs). This is the interesting part of the discussion. The bridge and macvlan drivers certainly have an overlap in functionality and you can argue that you only need one. Then again, the bridge code is a little crufty and we might not want to add much more to it for functionality that can be implemented in a much simpler way elsewhere. My preferred way would be to use bridge when you really need 802.1d MAC learning, netfilter- bridge and STP, while we put the optimizations for stuff like VMDq, zero-copy and multiqueue guest adapters only into the macvlan code. I can see this being a possible solution. My concern with putting VEPA into macvlan instead of the bridging code is that there will be more work required to make it usable for other virtualization solution as macvtap will only work for KVM type setups. Right, I understand. Basically, VEPA capabilities would rely on someone developing further drivers to connect macvlan to different backend interfaces, e.g. one for KVM (macvtap), one for Xen PV drivers, one for virtio, and whatever else is out there, or will be there in the future. The bridging code is already very generic in that respect, and all virtualization layers can deal with connecting interfaces to a bridge. Our extensions to the bridging code to enable VEPA for the Linux kernel are only very minimal code changes and would allow to make VEPA available to most virtualization solutions today. I don't object to having VEPA supported in the bridge code at all. I think your patch is simple enough so it won't hurt in the bridge code. If Stephen prefers to do VEPA only in one component, we should probably make it possible for that component to act as a bridge between 1+n existing interfaces as well. You can almost do that with the regular macvlan and the bridge driver, like / macvlan0 - br0 - tap0 eth0 -- macvlan1 - br1 - tap1 \ macvlan2 - br2 - tap2 Here, you can have two guests attached to tap devices (or xen net ...) and the macvlan driver doing the VEPA. Of course this is not how bridge works -- you would have the same mac addresses on two sides of the bridge. So we could have another macvlan backend (let's call it macvbridge) so you can do this: / macvlan0 - 'qemu -net raw' eth0 -- macvtap0 - 'qemu -net tap,fd=3 3/dev/net/macvtap0' \ macvbr0 -- tap0 - 'qemu -net tap' The macvbr driver could this way be used to associate an existing network device to a slave of a macvlan port. Not sure if this has any significant advantage over your bridge patches, it does have the obvious disadvantage that someone needs to implement it first, while your patch is there ;-) Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Sun, 09 Aug 2009 14:19:08 +0300 Or Gerlitz ogerl...@voltaire.com wrote: Stephen Hemminger wrote: I have a patch that forwards all multicast packets, and another that does proper forwarding. It should have worked that way in original macvlan, the current behavior is really a bug. Looking in macvlan_set_multicast_list() it acts in a similar manner to macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I assume what's left is to add macvlan_hash_xxx multicast logic to map/unmap multicast groups to what macvlan devices want to receive them and this way the flooding can be removed, correct? The device can just flood all multicast packets, since the filtering is done on the receive path anyway. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support
On Friday 07 August 2009, Paul Congdon (UC Davis) wrote: I don't think your scheme works too well because broadcast packet coming from other interfaces on br0 would get replicated and sent across the wire to ethB multiple times. Right, that won't work. So the bridge patch for the hairpin turn is still the best solution. Btw, how will that interact with the bride-netfilter (ebtables) setup? Can you apply any filters that work on current bridges also between two VEPA ports while doing the hairpin turn? Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Monday 10 August 2009, Stephen Hemminger wrote: On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz ogerl...@voltaire.com wrote: Looking in macvlan_set_multicast_list() it acts in a similar manner to macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I assume what's left is to add macvlan_hash_xxx multicast logic to map/unmap multicast groups to what macvlan devices want to receive them and this way the flooding can be removed, correct? The device can just flood all multicast packets, since the filtering is done on the receive path anyway. But we'd still have to copy the frames to user space (for both macvtap and raw packet sockets) and exit from the guest to inject it into its stack, right? I guess for multicast heavy workloads, we could save a lot of cycles by throwing the frames away as early as possible. How common are those setups in virtual servers though? Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On 08/10/09 16:20, Anthony Liguori wrote: Gerd Hoffmann wrote: Do we really want design a daemon and a protocol for such a simple thing? Yes, because we also need (c) the ability to write cross platform software that targets vmchannel. So having a library interface is going to be extremely desirable. You don't need a daemon for that though. Also, see the previous discussion about security. How do you sanely delegate /dev/vmchannel/org/qemu/clipboard to the current Xorg user? pam_console (I think that is the name of the beast). Or is it handled by hal these days? The piece of software which does the very same thing already for sound and other devices. Especially as requiring a daemon for that adds a few problems you don't have without them. Access control for example: For device nodes you can just use standard unix permissions and acls. But how do you set those permissions in the first place? See above. There are other devices which need that too. There are existing solutions for this problem. You can easily do stuff like adding the logged in desktop user to the /dev/vmchannel/org/qemu/clipboard acl using existing solutions. With a daemon you have to hop through a number of loops to archive the same. Can't we simply have guest apps open /dev/vmchannel/$protocol ? /dev interfaces are only simple to kernel developers :-) Besides, why do something that can be clearly done in userspace within the kernel? Ok, lets rip out the in-kernel ioapic code then. It can (and has been) done in userspace. It just increases the possibility of kernel bugs. The daemon increases the possibility of userspace bugs. Seriously: Attaching a name tag to virtio-serial devices and have them exposed via sysfs is probably *much* less code than a vmchannel daemon. Also multiplexing over one device introduces a number of problems you have to take care of on both sides (qemu+daemon) of the connection. For example: When the communication stalls in one protocol the others should keep on going of course. With one device per protocol and thus one virtqueue per protocol the problem doesn't exist in the first place. You can have a /var/run/vmchannel/$protocol.sock unix domain socket and it has all the same properties that you describe. It also Just Works with standard tools like socat. bash: socat: command not found If we really want vmchannel to be used by application developers, then we really need a libvmchannel. We need a sane solution developed and merged and not a new idea each week. cheers, Gerd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On 08/10/09 16:27, Anthony Liguori wrote: I think my fundamental argument boils down to two points. 1) we should not require new guest drivers unless we absolutely have to Allow guest drivers is fine though I guess? 2) we should always do things in userspace unless we absolutely have to do things in the kernel. Wrong. There are often good reasons to do stuff in kernel, even if you can do it in userspace too. Adding new kernel drivers breaks support for enterprise Linux distros. Adding a userspace daemon does not. Windows device drivers require signing which is very difficult to do. There's a huge practical advantage in not requiring guest drivers. Ok, so the virtio-serial + usbserial combo should work well then I think. If you have guest drivers you'll go the virtio-serial route. If you don't have guest drivers you can go the usbserial route, either via /dev/ttyUSB or via libusb. We can also have a libvmchannel as abstraction layer on top of this. cheers, Gerd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
Paul, I also think that bridge may not be the right place for VEPA, but rather a simpler sw/hw mux Although the VEPA support may reside in multiple places (I.e. also in the bridge) As Arnd pointed out Or already added an extension to qemu that allow direct guest virtual NIC mapping to an interface device (vs using tap), this was done specifically to address VEPA, and result in much faster performance and lower cpu overhead (Or and some others are planning additional meaningful performance optimizations) The interface multiplexing can be achieved using macvlan driver or using an SR-IOV capable NIC (the preferred option), macvlan may need to be extended to support VEPA multicast handling, this looks like a rather simple task It may be counter intuitive for some, but we expect the (completed) qemu VEPA mode + SR-IOV + certain switches with hairpin (vepa) mode to perform faster than using bridge+tap even for connecting 2 VMs on the same host Yaron Sent from BlackBerry From: e...@yahoogroups.com To: 'Stephen Hemminger' ; 'Fischer, Anna' Cc: bri...@lists.linux-foundation.org ; linux-ker...@vger.kernel.org ; net...@vger.kernel.org ; virtualization@lists.linux-foundation.org ; e...@yahoogroups.com ; da...@davemloft.net ; ka...@trash.net ; adobri...@gmail.com ; 'Arnd Bergmann' Sent: Fri Aug 07 21:58:00 2009 Subject: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support After reading more about this, I am not convinced this should be part of the bridge code. The bridge code really consists of two parts: forwarding table and optional spanning tree. Well the VEPA code short circuits both of these; it can't imagine it working with STP turned on. The only part of bridge code that really gets used by this are the receive packet hooks and the crufty old API. So instead of adding more stuff to existing bridge code, why not have a new driver for just VEPA. You could do it with a simple version of macvlan type driver. Stephen, Thanks for your comments and questions. We do believe the bridge code is the right place for this, so I'd like to embellish on that a bit more to help persuade you. Sorry for the long winded response, but here are some thoughts: - First and foremost, VEPA is going to be a standard addition to the IEEE 802.1Q specification. The working group agreed at the last meeting to pursue a project to augment the bridge standard with hairpin mode (aka reflective relay) and a remote filtering service (VEPA). See for details: http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-evbPar5C-0709 http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-evbPar5C-0709 -v01.pdf - The VEPA functionality was really a pretty small change to the code with low risk and wouldn't seem to warrant an entire new driver or module. - There are good use cases where VMs will want to have some of their interfaces attached to bridges and others to bridges operating in VEPA mode. In other words, we see simultaneous operation of the bridge code and VEPA occurring, so having as much of the underlying code as common as possible would seem to be beneficial. - By augmenting the bridge code with VEPA there is a great amount of re-use achieved. It works wherever the bridge code works and doesn't need anything special to support KVM, XEN, and all the hooks, etc... - The hardware vendors building SR-IOV NICs with embedded switches will be adding VEPA mode, so by keeping the bridge module in sync would be consistent with this trend and direction. It will be possible to extend the hardware implementations by cascading a software bridge and/or VEPA, so being in sync with the architecture would make this more consistent. - The forwarding table is still needed and used on inbound traffic to deliver frames to the correct virtual interfaces and to filter any reflected frames. A new driver would have to basically implement an equivalent forwarding table anyway. As I understand the current macvlan type driver, it wouldn't filter multicast frames properly without such a table. - It seems the hairpin mode would be needed in the bridge module whether VEPA was added to the bridge module or a new driver. Having the associated changes together in the same code could aid in understanding and deployment. As I understand the macvlan code, it currently doesn't allow two VMs on the same machine to communicate with one another. I could imagine a hairpin mode on the adjacent bridge making this possible, but the macvlan code would need to be updated to filter reflected frames so a source did not receive his own packet. I could imagine this being done as well, but to also support selective multicast usage, something similar to the bridge forwarding table would be needed. I think putting VEPA into a new driver would cause you to implement many things the bridge code already supports. Given that we expect the bridge standard to ultimately include VEPA,
RE: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support
Subject: Re: [PATCH][RFC] net/bridge: add basic VEPA support On Friday 07 August 2009, Paul Congdon (UC Davis) wrote: I don't think your scheme works too well because broadcast packet coming from other interfaces on br0 would get replicated and sent across the wire to ethB multiple times. Right, that won't work. So the bridge patch for the hairpin turn is still the best solution. Yes, I think that we should separate the discussions between hairpin mode on the adjacent bridge and the VEPA filtering service residing within the end-station. The hairpin feature really has to be implemented in the bridging code. Btw, how will that interact with the bride-netfilter (ebtables) setup? Can you apply any filters that work on current bridges also between two VEPA ports while doing the hairpin turn? The hairpin mode is implemented on the adjacent bridge. The only difference for a hairpin mode port vs. a normal bridge port is that it can pass frames back out to the same port it came from. All the netfilter hooks are still in place. On the VEPA filtering service side, the only change we have implemented in the bridging code is that in VEPA mode all frames are passed to the uplink on TX. However, frames are still passed through the netfilter hooks before they go out on the wire. On the inbound path, there are no changes to the way frames are processed (except the filtering for the original source port), so netfilter hooks work in the same way as for a normal bridge. If a frame is reflected back because of a hairpin turn, then of course the incoming port is the VEPA uplink port and not the port that originally sent the frame. So if you are trying to enforce some packet filtering on that inbound path, then you would have to do that based on MAC addresses and not on bridge ports. But I would assume that you would enforce the filtering already before you send out the frame to the adjacent bridge. Apart from that, if you enable your bridge to behave in VEPA mode, then you would typically do packet filtering etc on the adjacent bridge and not use the netfilter hook. You can still use both though, if you like. Anna ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
Fischer, Anna anna.fisc...@hp.com writes: If you do have a SRIOV NIC that supports VEPA, then I would think that you do not have QEMU or macvtap in the setup any more though. Simply because in that case the VM can directly access the VF on the physical device. That would be ideal. I'm just trying to understand how this all works, so I'm probably asking a stupid question: Would a SRIOV NIC with VEPA support show up as multiple devices? I.e. would I get e.g. eth0-eth7 for a NIC with support for 8 virtual interfaces? Would they have different MAC addresses? /Benny ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCH][RFC] net/bridge: add basic VEPA support
After reading more about this, I am not convinced this should be part of the bridge code. The bridge code really consists of two parts: forwarding table and optional spanning tree. Well the VEPA code short circuits both of these; it can't imagine it working with STP turned on. The only part of bridge code that really gets used by this are the receive packet hooks and the crufty old API. So instead of adding more stuff to existing bridge code, why not have a new driver for just VEPA. You could do it with a simple version of macvlan type driver. Stephen, Thanks for your comments and questions. We do believe the bridge code is the right place for this, so I'd like to embellish on that a bit more to help persuade you. Sorry for the long winded response, but here are some thoughts: - First and foremost, VEPA is going to be a standard addition to the IEEE 802.1Q specification. The working group agreed at the last meeting to pursue a project to augment the bridge standard with hairpin mode (aka reflective relay) and a remote filtering service (VEPA). See for details: http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-evbPar5C-0709 -v01.pdf - The VEPA functionality was really a pretty small change to the code with low risk and wouldn't seem to warrant an entire new driver or module. - There are good use cases where VMs will want to have some of their interfaces attached to bridges and others to bridges operating in VEPA mode. In other words, we see simultaneous operation of the bridge code and VEPA occurring, so having as much of the underlying code as common as possible would seem to be beneficial. - By augmenting the bridge code with VEPA there is a great amount of re-use achieved. It works wherever the bridge code works and doesn't need anything special to support KVM, XEN, and all the hooks, etc... - The hardware vendors building SR-IOV NICs with embedded switches will be adding VEPA mode, so by keeping the bridge module in sync would be consistent with this trend and direction. It will be possible to extend the hardware implementations by cascading a software bridge and/or VEPA, so being in sync with the architecture would make this more consistent. - The forwarding table is still needed and used on inbound traffic to deliver frames to the correct virtual interfaces and to filter any reflected frames. A new driver would have to basically implement an equivalent forwarding table anyway. As I understand the current macvlan type driver, it wouldn't filter multicast frames properly without such a table. - It seems the hairpin mode would be needed in the bridge module whether VEPA was added to the bridge module or a new driver. Having the associated changes together in the same code could aid in understanding and deployment. As I understand the macvlan code, it currently doesn't allow two VMs on the same machine to communicate with one another. I could imagine a hairpin mode on the adjacent bridge making this possible, but the macvlan code would need to be updated to filter reflected frames so a source did not receive his own packet. I could imagine this being done as well, but to also support selective multicast usage, something similar to the bridge forwarding table would be needed. I think putting VEPA into a new driver would cause you to implement many things the bridge code already supports. Given that we expect the bridge standard to ultimately include VEPA, and the new functions are basic forwarding operations, it seems to make most sense to keep this consistent with the bridge module. Paul ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support
Arnd, I don't think your scheme works too well because broadcast packet coming from other interfaces on br0 would get replicated and sent across the wire to ethB multiple times. Paul That way you should be able to do something like: Host A Host B /- nalvcam0 -\ /- macvlan0 - 192.168.1.1 br0 -| |- ethA === ethB -| \- nalvcam1 -/ \- macvlan1 - 192.168.1.2 Now assuming that macvlan0 and macvlan1 are in different network namespaces or belong to different KVM guests, these guests would be able to communicate with each other through the bridge on host A, which can set the policy (using ebtables) for this communication and get interface statistics on its nalvcam interfaces. Also, instead of having the br0, Host A could assign an IP addresses to the two nalvcam interfaces that host B has, and use IP forwarding between the guests of host B. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
Yaron, The interface multiplexing can be achieved using macvlan driver or using an SR-IOV capable NIC (the preferred option), macvlan may need to be extended to support VEPA multicast handling, this looks like a rather simple task Agreed that the hardware solution is preferred so the macvlan implementation doesn’t really matter. If we are talking SR-IOV, then it is direct mapped, regardless of whether there is a VEB or VEPA in the hardware below, so you are bypassing the bridge software code also. I disagree that adding the multicast handling is simple – while not conceptually hard, it will basically require you to put an address table into the macvlan implementation – if you have that, then why not have just used the one already in the bridge code. If you hook a VEPA up to a non-hairpin mode external bridge, you get the macvlan capability as well. It also seems to me like the special macvlan interfaces for KVM don’t apply to XEN or a non-virtualized environment? Or more has to be written to make that work? If it is in the bridge code, you get all of this re-use. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] Re: [PATCH][RFC] net/bridge: add basic VEPA support
On Monday 10 August 2009, Fischer, Anna wrote: On the VEPA filtering service side, the only change we have implemented in the bridging code is that in VEPA mode all frames are passed to the uplink on TX. However, frames are still passed through the netfilter hooks before they go out on the wire. On the inbound path, there are no changes to the way frames are processed (except the filtering for the original source port), so netfilter hooks work in the same way as for a normal bridge. Ah, interesting. I did not realize that the hooks were still active, although that obviously makes sense. So that would be another important difference between our implementations. If a frame is reflected back because of a hairpin turn, then of course the incoming port is the VEPA uplink port and not the port that originally sent the frame. So if you are trying to enforce some packet filtering on that inbound path, then you would have to do that based on MAC addresses and not on bridge ports. But I would assume that you would enforce the filtering already before you send out the frame to the adjacent bridge. Apart from that, if you enable your bridge to behave in VEPA mode, then you would typically do packet filtering etc on the adjacent bridge and not use the netfilter hook. You can still use both though, if you like. Right, that was my point. They bridge in VEPA mode would likely be configured to be completely ignorant of the data going through it and not do any filter, and you do all filterring on the adjacent bridge. I just wasn't sure that this is possible with ebtables if the adjacent bridge is a Linux system with the bridge in hairpin turn mode. Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
Subject: Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support On Monday 10 August 2009, Stephen Hemminger wrote: On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz ogerl...@voltaire.com wrote: Looking in macvlan_set_multicast_list() it acts in a similar manner to macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I assume what's left is to add macvlan_hash_xxx multicast logic to map/unmap multicast groups to what macvlan devices want to receive them and this way the flooding can be removed, correct? The device can just flood all multicast packets, since the filtering is done on the receive path anyway. Is this handled by one of the additional patches? In the current kernel tree macvlan code it looks as if multicast filtering is only handled by the physical device driver, but not on particular macvlan devices. But we'd still have to copy the frames to user space (for both macvtap and raw packet sockets) and exit from the guest to inject it into its stack, right? I think it would be nice if you can implement what Or describes for macvlan and avoid flooding, and it doesn't sound too hard to do. I guess one advantage for macvlan (over the bridge) is that you can program in all information you have for the ports attached to it, e.g. MAC addresses and multicast addresses. So you could take advantage of that whereas the bridge always floods multicast frames to all ports. How would this work though, if the OS inside the guest wants to register to a particular multicast address? Is this propagated through the backend drivers to the macvlan/macvtap interface? Anna ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Mon, 10 Aug 2009 16:32:01 + Fischer, Anna anna.fisc...@hp.com wrote: Subject: Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support On Monday 10 August 2009, Stephen Hemminger wrote: On Sun, 09 Aug 2009 14:19:08 +0300, Or Gerlitz ogerl...@voltaire.com wrote: Looking in macvlan_set_multicast_list() it acts in a similar manner to macvlan_set_mac_address() in the sense that it calls dev_mc_sync(). I assume what's left is to add macvlan_hash_xxx multicast logic to map/unmap multicast groups to what macvlan devices want to receive them and this way the flooding can be removed, correct? The device can just flood all multicast packets, since the filtering is done on the receive path anyway. Is this handled by one of the additional patches? In the current kernel tree macvlan code it looks as if multicast filtering is only handled by the physical device driver, but not on particular macvlan devices. But we'd still have to copy the frames to user space (for both macvtap and raw packet sockets) and exit from the guest to inject it into its stack, right? I think it would be nice if you can implement what Or describes for macvlan and avoid flooding, and it doesn't sound too hard to do. I guess one advantage for macvlan (over the bridge) is that you can program in all information you have for the ports attached to it, e.g. MAC addresses and multicast addresses. So you could take advantage of that whereas the bridge always floods multicast frames to all ports. How would this work though, if the OS inside the guest wants to register to a particular multicast address? Is this propagated through the backend drivers to the macvlan/macvtap interface? Sure filtering is better, but multicast performance with large number of guests is really a corner case, not the real performance issue. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Gerd Hoffmann wrote: Ok, lets rip out the in-kernel ioapic code then. It can (and has been) done in userspace. The justification is performance although that's not really true anymore post TPR optimization. But FWIW, I opposed both the in-kernel apic and the in-kernel pit when they were introduced. If nothing else, I'm at least consistent :-) The daemon increases the possibility of userspace bugs. How? Seriously: Attaching a name tag to virtio-serial devices and have them exposed via sysfs is probably *much* less code than a vmchannel daemon. I strongly doubt that. If we really want vmchannel to be used by application developers, then we really need a libvmchannel. We need a sane solution developed and merged and not a new idea each week. There is nothing sane about vmchannel. It's just an attempt to bypass QEMU which is going to introduce all sorts of complexities wrt migration, guest compatibility, etc. However, as I've mentioned repeatedly, the reason I won't merge virtio-serial is that it duplicates functionality with virtio-console. If the two are converged, I'm happy to merge it. I'm not opposed to having more functionality. I think it's the wrong solution for the use-case, and I always have, but that's independent of my willingness to merge it. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Anthony Liguori wrote: There is nothing sane about vmchannel. It's just an attempt to bypass QEMU which is going to introduce all sorts of complexities wrt migration, guest compatibility, etc. However, as I've mentioned repeatedly, the reason I won't merge virtio-serial is that it duplicates functionality with virtio-console. If the two are converged, I'm happy to merge it. I'm not opposed to having more functionality. NB: the userspace interface for these devices should be a tty, not a new character device. If you want to add a new bustype for these devices, and then have an entry in sysfs that had some sort of identification string, that's perfectly acceptable. Also note though that this is exactly what usb-serial is today. /sys/bus/usbserial contains all the usb serial devices and you can get a vendor id/device id to uniquely identify the device type. Using virtio vs. usb has it's advantage but the userspace interface model should be roughly equivalent. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 0/2] vhost: a kernel-level virtio server
This implements vhost: a kernel-level backend for virtio, The main motivation for this work is to reduce virtualization overhead for virtio by removing system calls on data path, without guest changes. For virtio-net, this removes up to 4 system calls per packet: vm exit for kick, reentry for kick, iothread wakeup for packet, interrupt injection for packet. Some more detailed description attached to the patch itself. The patches are against 2.6.31-rc4. I'd like them to go into linux-next and down the road 2.6.32 if possible. Please comment. Userspace bits using this driver will be posted to k...@vger shortly. Michael S. Tsirkin (2): export cpu_tlbstate to modules vhost_net: a kernel-level virtio server MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + arch/x86/mm/tlb.c |1 + drivers/Makefile |1 + drivers/block/virtio_blk.c |3 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 462 ++ drivers/vhost/vhost.c | 663 drivers/vhost/vhost.h | 108 +++ include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 100 +++ 13 files changed, 1364 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 2/2] vhost_net: a kernel-level virtio server
What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backend appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have not run any benchmarks yet, compared to userspace, I expect to see improved latency (as I save up to 4 system calls per packet) but not yet bandwidth/CPU (as TSO and interrupt mitigation are not yet supported). Features that I plan to look at in the future: - TSO - interrupt mitigation - zero copy Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/block/virtio_blk.c |3 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 462 ++ drivers/vhost/vhost.c | 663 drivers/vhost/vhost.h | 108 +++ include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 100 +++ 12 files changed, 1363 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index ebc2691..eb0c1da 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6312,6 +6312,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: k...@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER P: Roger Luethi M: r...@hellgate.ch diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3) += ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB) += ssb/ +obj-$(CONFIG_VHOST_NET)+= vhost/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_VLYNQ)+= vlynq/ obj-$(CONFIG_STAGING) += staging/ diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index aa1a3d5..42e61b0 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) err = PTR_ERR(vblk-vq); goto out_free_vblk; } + printk(KERN_ERR vblk-vq = %p\n, vblk-vq); vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) if (!err) blk_queue_logical_block_size(vblk-disk-queue, blk_size); + printk(KERN_ERR virtio_config_val returned %d\n, err); + add_disk(vblk-disk); return 0; diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig new file mode 100644 index 000..d955406 --- /dev/null +++ b/drivers/vhost/Kconfig @@ -0,0 +1,11 @@ +config VHOST_NET + tristate Host kernel accelerator for virtio net + depends on NET EVENTFD + ---help--- + This kernel module can be loaded in host kernel
[PATCH 1/2] export cpu_tlbstate to modules
vhost net module wants to do copy to/from user from a kernel thread, which needs switch_mm (like what fs/aio has). export cpu_tlbstate to make this possible Signed-off-by: Michael S. Tsirkin m...@redhat.com --- arch/x86/mm/tlb.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 821e970..e33a5f0 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -13,6 +13,7 @@ DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { init_mm, 0, }; +EXPORT_PER_CPU_SYMBOL_GPL(cpu_tlbstate); /* * Smarter SMP flushing macros. -- 1.6.2.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 0/3] qemu-kvm: vhost net support
This adds support for vhost-net virtio kernel backend. This is RFC, but works without issues for me. Still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. Michael S. Tsirkin (3): qemu-kvm: move virtio-pci.o to near pci.o virtio: move features to an inline function qemu-kvm: vhost-net implementation Makefile.hw |2 +- Makefile.target |3 +- hw/vhost_net.c | 181 +++ hw/vhost_net.h | 30 + hw/virtio-balloon.c |2 +- hw/virtio-blk.c |2 +- hw/virtio-console.c |2 +- hw/virtio-net.c | 34 +- hw/virtio-pci.c | 43 +++- hw/virtio.c | 19 -- hw/virtio.h | 38 ++- net.c |8 ++- net.h |1 + qemu-kvm.c |8 -- qemu-kvm.h |9 +++ 15 files changed, 340 insertions(+), 42 deletions(-) create mode 100644 hw/vhost_net.c create mode 100644 hw/vhost_net.h ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [evb] RE: [PATCH][RFC] net/bridge: add basic VEPA support
On Monday 10 August 2009, Stephen Hemminger wrote: On Mon, 10 Aug 2009 16:32:01, Fischer, Anna anna.fisc...@hp.com wrote: How would this work though, if the OS inside the guest wants to register to a particular multicast address? Is this propagated through the backend drivers to the macvlan/macvtap interface? Sure filtering is better, but multicast performance with large number of guests is really a corner case, not the real performance issue. Well, right now, qemu does not care at all about this, it essentially leaves the tun device in ALLMULTI state. I should check whether macvtap at this stage can receive multicast frames at all, but if it does, it will get them all ;-). If we want to implement this with kvm, we would have to start with the qemu virtio-net implementation, to move the receive filter into the tap device. With tun/tap that will mean less copying to user space, with macvtap (after implementing TUNSETTXFILTER) we get already pretty far because we no longer need to have the external interface in ALLMULTI state. Once that is in place, we can start thinking about filtering per virtual device. Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 1/3] qemu: move virtio-pci.o to near pci.o
virtio-pci depends, and will always depend, on pci.c so it makes sense to keep it in the same makefile, (unlike the rest of virtio files which should eventually be moved out to Makefile.hw). Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Makefile.hw |2 +- Makefile.target |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/Makefile.hw b/Makefile.hw index f7a9507..6857fea 100644 --- a/Makefile.hw +++ b/Makefile.hw @@ -16,7 +16,7 @@ CPPFLAGS += -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE CPPFLAGS+=-I$(SRC_PATH)/fpu obj-y = -obj-y += virtio.o virtio-pci.o +obj-y += virtio.o obj-y += fw_cfg.o obj-y += watchdog.o obj-y += nand.o ecc.o diff --git a/Makefile.target b/Makefile.target index df1f32b..0dd56b1 100644 --- a/Makefile.target +++ b/Makefile.target @@ -538,7 +538,7 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \ gdbstub.o gdbstub-xml.o msix.o ioport.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly -obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o +obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o virtio-pci.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o LIBS+=-lz -- 1.6.2.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 2/3] qemu/virtio: move features to an inline function
devices should have the final say over which virtio features they support. E.g. indirect entries may or may not make sense in the context of virtio-console. Move the common bits from virtio-pci to an inline function and let each device call it. No functional changes. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/virtio-balloon.c |2 +- hw/virtio-blk.c |2 +- hw/virtio-console.c |2 +- hw/virtio-net.c |2 +- hw/virtio-pci.c |3 --- hw/virtio.h | 10 ++ 6 files changed, 14 insertions(+), 7 deletions(-) diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 7ca783e..15b50bb 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -127,7 +127,7 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev) { -return 0; +return virtio_common_features(); } static ram_addr_t virtio_balloon_to_target(void *opaque, ram_addr_t target) diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index 2beff52..7d33fee 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -364,7 +364,7 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev) if (strcmp(s-serial_str, 0)) features |= 1 VIRTIO_BLK_F_IDENTIFY; -return features; +return features | virtio_common_features(); } static void virtio_blk_save(QEMUFile *f, void *opaque) diff --git a/hw/virtio-console.c b/hw/virtio-console.c index 663c8b9..ac25499 100644 --- a/hw/virtio-console.c +++ b/hw/virtio-console.c @@ -53,7 +53,7 @@ static void virtio_console_handle_input(VirtIODevice *vdev, VirtQueue *vq) static uint32_t virtio_console_get_features(VirtIODevice *vdev) { -return 0; +return virtio_common_features(); } static int vcon_can_read(void *opaque) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ef9e7ff..2e51a6a 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -154,7 +154,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) } #endif -return features; +return features | virtio_common_features(); } static uint32_t virtio_net_bad_features(VirtIODevice *vdev) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 3b9bfd1..90f51be 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -232,9 +232,6 @@ static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, uint32_t addr) switch (addr) { case VIRTIO_PCI_HOST_FEATURES: ret = vdev-get_features(vdev); -ret |= (1 VIRTIO_F_NOTIFY_ON_EMPTY); -ret |= (1 VIRTIO_RING_F_INDIRECT_DESC); -ret |= (1 VIRTIO_F_BAD_FEATURE); break; case VIRTIO_PCI_GUEST_FEATURES: ret = vdev-features; diff --git a/hw/virtio.h b/hw/virtio.h index aa55677..de620a7 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -166,4 +166,14 @@ VirtIODevice *virtio_net_init(DeviceState *dev); VirtIODevice *virtio_console_init(DeviceState *dev); VirtIODevice *virtio_balloon_init(DeviceState *dev); +static inline uint32_t virtio_common_features(void) +{ +uint32_t features = 0; +features |= (1 VIRTIO_F_NOTIFY_ON_EMPTY); +features |= (1 VIRTIO_RING_F_INDIRECT_DESC); +features |= (1 VIRTIO_F_BAD_FEATURE); + +return features; +} + #endif -- 1.6.2.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 3/3] qemu-kvm: vhost-net implementation
This adds support for vhost-net virtio kernel backend. To enable (assuming device eth2): 1. enable promisc mode or program guest mac in device eth2 2. disable tso, gso, lro on the card 3. add vhost=eth0 to -net flag 4. run with CAP_NET_ADMIN priviledge (e.g. root) This patch is RFC, but works without issues for me. It still needs to be split up, tested and benchmarked properly, but posting it here in case people want to test drive the kernel bits I posted. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- Makefile.target |3 +- hw/vhost_net.c | 181 +++ hw/vhost_net.h | 30 + hw/virtio-net.c | 32 ++- hw/virtio-pci.c | 40 hw/virtio.c | 19 -- hw/virtio.h | 28 - net.c |8 ++- net.h |1 + qemu-kvm.c |8 --- qemu-kvm.h |9 +++ 11 files changed, 325 insertions(+), 34 deletions(-) create mode 100644 hw/vhost_net.c create mode 100644 hw/vhost_net.h diff --git a/Makefile.target b/Makefile.target index 0dd56b1..5763fa8 100644 --- a/Makefile.target +++ b/Makefile.target @@ -538,7 +538,8 @@ obj-y = vl.o osdep.o monitor.o pci.o loader.o isa_mmio.o machine.o \ gdbstub.o gdbstub-xml.o msix.o ioport.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly -obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o virtio-pci.o +obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-console.o virtio-pci.o \ + vhost_net.o obj-$(CONFIG_KVM) += kvm.o kvm-all.o LIBS+=-lz diff --git a/hw/vhost_net.c b/hw/vhost_net.c new file mode 100644 index 000..7d52de0 --- /dev/null +++ b/hw/vhost_net.c @@ -0,0 +1,181 @@ +#include sys/eventfd.h +#include sys/socket.h +#include linux/kvm.h +#include fcntl.h +#include sys/ioctl.h +#include linux/vhost.h +#include linux/virtio_ring.h +#include netpacket/packet.h +#include net/ethernet.h +#include net/if.h +#include netinet/in.h + +#include stdio.h + +#include qemu-kvm.h + +#include vhost_net.h + +const char *vhost_net_device; + +static int vhost_virtqueue_init(struct vhost_dev *dev, + struct VirtIODevice *vdev, + struct vhost_virtqueue *vq, + struct VirtQueue *q, + unsigned idx) +{ + target_phys_addr_t s, l; + int r; + struct vhost_vring_addr addr = { + .index = idx, + }; + struct vhost_vring_file file = { + .index = idx, + }; + struct vhost_vring_state size = { + .index = idx, + }; + + size.num = q-vring.num; + r = ioctl(dev-control, VHOST_SET_VRING_NUM, size); + if (r) + return -errno; + + file.fd = vq-kick = eventfd(0, 0); + r = ioctl(dev-control, VHOST_SET_VRING_KICK, file); + if (r) + return -errno; + file.fd = vq-call = eventfd(0, 0); + r = ioctl(dev-control, VHOST_SET_VRING_CALL, file); + if (r) + return -errno; + + s = l = sizeof(struct vring_desc) * q-vring.num; + vq-desc = cpu_physical_memory_map(q-vring.desc, l, 0); + if (!vq-desc || l != s) + return -ENOMEM; + addr.user_addr = (u_int64_t)(unsigned long)vq-desc; + r = ioctl(dev-control, VHOST_SET_VRING_DESC, addr); + if (r 0) + return -errno; + s = l = offsetof(struct vring_avail, ring) + + sizeof(u_int64_t) * q-vring.num; + vq-avail = cpu_physical_memory_map(q-vring.avail, l, 0); + if (!vq-avail || l != s) + return -ENOMEM; + addr.user_addr = (u_int64_t)(unsigned long)vq-avail; + r = ioctl(dev-control, VHOST_SET_VRING_AVAIL, addr); + if (r 0) + return -errno; + s = l = offsetof(struct vring_used, ring) + + sizeof(struct vring_used_elem) * q-vring.num; + vq-used = cpu_physical_memory_map(q-vring.used, l, 1); + if (!vq-used || l != s) + return -ENOMEM; + addr.user_addr = (u_int64_t)(unsigned long)vq-used; + r = ioctl(dev-control, VHOST_SET_VRING_USED, addr); + if (r 0) + return -errno; + +r = vdev-binding-irqfd(vdev-binding_opaque, q-vector, vq-call); +if (r 0) +return -errno; + +r = vdev-binding-queuefd(vdev-binding_opaque, idx, vq-kick); +if (r 0) +return -errno; + + return 0; +} + +static int vhost_dev_init(struct vhost_dev *hdev, + VirtIODevice *vdev) +{ + int i, r, n = 0; + struct vhost_memory *mem; + hdev-control = open(/dev/vhost-net, O_RDWR); + if (hdev-control 0) + return -errno; + r = ioctl(hdev-control, VHOST_SET_OWNER, NULL); + if (r 0) + return -errno; + for (i = 0; i
[PATCH] qemu/virtio: move features to an inline function
devices should have the final say over which virtio features they support. E.g. indirect entries may or may not make sense in the context of virtio-console. Move the common bits from virtio-pci to an inline function and let each device call it. No functional changes. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- This is part of my vhost work, but IMO the patch makes sense separately as well: the common features are not related to pci at all. hw/virtio-balloon.c |2 +- hw/virtio-blk.c |2 +- hw/virtio-console.c |2 +- hw/virtio-net.c |2 +- hw/virtio-pci.c |3 --- hw/virtio.h | 10 ++ 6 files changed, 14 insertions(+), 7 deletions(-) diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 7ca783e..15b50bb 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -127,7 +127,7 @@ static void virtio_balloon_set_config(VirtIODevice *vdev, static uint32_t virtio_balloon_get_features(VirtIODevice *vdev) { -return 0; +return virtio_common_features(); } static ram_addr_t virtio_balloon_to_target(void *opaque, ram_addr_t target) diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c index 2beff52..7d33fee 100644 --- a/hw/virtio-blk.c +++ b/hw/virtio-blk.c @@ -364,7 +364,7 @@ static uint32_t virtio_blk_get_features(VirtIODevice *vdev) if (strcmp(s-serial_str, 0)) features |= 1 VIRTIO_BLK_F_IDENTIFY; -return features; +return features | virtio_common_features(); } static void virtio_blk_save(QEMUFile *f, void *opaque) diff --git a/hw/virtio-console.c b/hw/virtio-console.c index 663c8b9..ac25499 100644 --- a/hw/virtio-console.c +++ b/hw/virtio-console.c @@ -53,7 +53,7 @@ static void virtio_console_handle_input(VirtIODevice *vdev, VirtQueue *vq) static uint32_t virtio_console_get_features(VirtIODevice *vdev) { -return 0; +return virtio_common_features(); } static int vcon_can_read(void *opaque) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index ef9e7ff..2e51a6a 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -154,7 +154,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev) } #endif -return features; +return features | virtio_common_features(); } static uint32_t virtio_net_bad_features(VirtIODevice *vdev) diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c index 3b9bfd1..90f51be 100644 --- a/hw/virtio-pci.c +++ b/hw/virtio-pci.c @@ -232,9 +232,6 @@ static uint32_t virtio_ioport_read(VirtIOPCIProxy *proxy, uint32_t addr) switch (addr) { case VIRTIO_PCI_HOST_FEATURES: ret = vdev-get_features(vdev); -ret |= (1 VIRTIO_F_NOTIFY_ON_EMPTY); -ret |= (1 VIRTIO_RING_F_INDIRECT_DESC); -ret |= (1 VIRTIO_F_BAD_FEATURE); break; case VIRTIO_PCI_GUEST_FEATURES: ret = vdev-features; diff --git a/hw/virtio.h b/hw/virtio.h index aa55677..de620a7 100644 --- a/hw/virtio.h +++ b/hw/virtio.h @@ -166,4 +166,14 @@ VirtIODevice *virtio_net_init(DeviceState *dev); VirtIODevice *virtio_console_init(DeviceState *dev); VirtIODevice *virtio_balloon_init(DeviceState *dev); +static inline uint32_t virtio_common_features(void) +{ +uint32_t features = 0; +features |= (1 VIRTIO_F_NOTIFY_ON_EMPTY); +features |= (1 VIRTIO_RING_F_INDIRECT_DESC); +features |= (1 VIRTIO_F_BAD_FEATURE); + +return features; +} + #endif -- 1.6.2.5 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] qemu/virtio: move features to an inline function
Michael S. Tsirkin wrote: devices should have the final say over which virtio features they support. E.g. indirect entries may or may not make sense in the context of virtio-console. Move the common bits from virtio-pci to an inline function and let each device call it. What drove this in vhost? Normally, the common features are transport features and the devices should have absolutely no knowledge of transport feature (since they're transport dependent). IOW, VIRTIO_RING_F_INDIRECT_DESC is meaningless to virtio-console because virtio-console has no idea what the ring implementation is that it sits on top of. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function
On Mon, Aug 10, 2009 at 02:37:20PM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: devices should have the final say over which virtio features they support. E.g. indirect entries may or may not make sense in the context of virtio-console. Move the common bits from virtio-pci to an inline function and let each device call it. What drove this in vhost? vhost does not support indirect ring entries and notify on empty yet. Normally, the common features are transport features and the devices should have absolutely no knowledge of transport feature (since they're transport dependent). Good point. But 1. note that with my patch they don't. They call virtio_get_common_features and that's all. 2. some features may not make sense for some devices. For example, it is quite possible that indirect ring entries feature improves performance on block but hurts on net, as net has a similar merged buffers feature. Someone should try benchmarking with it disabled, and it becomes possible with my patch. IOW, VIRTIO_RING_F_INDIRECT_DESC is meaningless to virtio-console because virtio-console has no idea what the ring implementation is that it sits on top of. At some level. Same can be said to be true for virtio-pci: it sits below the ring implementation. However, it is not *completely* meaningless: some devices may benefit from indirect entries, others may not, or may benefit from disabling them. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Monday 10 August 2009, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. Very nice, I loved reading it. It's getting rather late in my time zone, so this comments only on the network driver. I'll go through the rest tomorrow. @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) err = PTR_ERR(vblk-vq); goto out_free_vblk; } + printk(KERN_ERR vblk-vq = %p\n, vblk-vq); vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) if (!err) blk_queue_logical_block_size(vblk-disk-queue, blk_size); + printk(KERN_ERR virtio_config_val returned %d\n, err); + add_disk(vblk-disk); return 0; I guess you meant to remove these before submitting. +static void handle_tx_kick(struct work_struct *work); +static void handle_rx_kick(struct work_struct *work); +static void handle_tx_net(struct work_struct *work); +static void handle_rx_net(struct work_struct *work); [style] I think the code gets more readable if you reorder it so that you don't need forward declarations for static functions. +static long vhost_net_reset_owner(struct vhost_net *n) +{ + struct socket *sock = NULL; + long r; + mutex_lock(n-dev.mutex); + r = vhost_dev_check_owner(n-dev); + if (r) + goto done; + sock = vhost_net_stop(n); + r = vhost_dev_reset_owner(n-dev); +done: + mutex_unlock(n-dev.mutex); + if (sock) + fput(sock-file); + return r; +} what is the difference between vhost_net_reset_owner(n) and vhost_net_set_socket(n, -1)? + +static struct file_operations vhost_net_fops = { + .owner = THIS_MODULE, + .release= vhost_net_release, + .unlocked_ioctl = vhost_net_ioctl, + .open = vhost_net_open, +}; This is missing a compat_ioctl pointer. It should simply be static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl, unsigned long arg) { return f, ioctl, (unsigned long)compat_ptr(arg); } +/* Bits from fs/aio.c. TODO: export and use from there? */ +/* + * use_mm + * Makes the calling kernel thread take on the specified + * mm context. + * Called by the retry thread execute retries within the + * iocb issuer's mm context, so that copy_from/to_user + * operations work seamlessly for aio. + * (Note: this routine is intended to be called only + * from a kernel thread context) + */ +static void use_mm(struct mm_struct *mm) +{ + struct mm_struct *active_mm; + struct task_struct *tsk = current; + + task_lock(tsk); + active_mm = tsk-active_mm; + atomic_inc(mm-mm_count); + tsk-mm = mm; + tsk-active_mm = mm; + switch_mm(active_mm, mm, tsk); + task_unlock(tsk); + + mmdrop(active_mm); +} Why do you need a kernel thread here? If the data transfer functions all get called from a guest intercept, shouldn't you already be in the right mm? +static void handle_tx(struct vhost_net *net) +{ + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + unsigned head, out, in; + struct msghdr msg = { + .msg_name = NULL, + .msg_namelen = 0, + .msg_control = NULL, + .msg_controllen = 0, + .msg_iov = (struct iovec *)vq-iov + 1, + .msg_flags = MSG_DONTWAIT, + }; + size_t len; + int err; + struct socket *sock = rcu_dereference(net-sock); + if (!sock || !sock_writeable(sock-sk)) + return; + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + for (;;) { + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in); + if (head == vq-num) + break; + if (out = 1 || in) { + vq_err(vq, Unexpected descriptor format for TX: +out %d, int %d\n, out, in); + break; + } + /* Sanity check */ + if (vq-iov-iov_len != sizeof(struct virtio_net_hdr)) { + vq_err(vq, Unexpected header len for TX: +%ld expected %zd\n, vq-iov-iov_len, +sizeof(struct virtio_net_hdr)); + break; + } + /* Skip header. TODO: support TSO. */ + msg.msg_iovlen = out - 1; + len = iov_length(vq-iov + 1, out - 1); + /* TODO: Check specific error and bomb out unless ENOBUFS? */ + err = sock-ops-sendmsg(NULL, sock, msg, len); + if (err 0) { +
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Mon, Aug 10, 2009 at 09:51:18PM +0200, Arnd Bergmann wrote: On Monday 10 August 2009, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. Very nice, I loved reading it. It's getting rather late in my time zone, so this comments only on the network driver. I'll go through the rest tomorrow. @@ -293,6 +293,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) err = PTR_ERR(vblk-vq); goto out_free_vblk; } + printk(KERN_ERR vblk-vq = %p\n, vblk-vq); vblk-pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req)); if (!vblk-pool) { @@ -383,6 +384,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev) if (!err) blk_queue_logical_block_size(vblk-disk-queue, blk_size); + printk(KERN_ERR virtio_config_val returned %d\n, err); + add_disk(vblk-disk); return 0; I guess you meant to remove these before submitting. Good catch, thanks! +static void handle_tx_kick(struct work_struct *work); +static void handle_rx_kick(struct work_struct *work); +static void handle_tx_net(struct work_struct *work); +static void handle_rx_net(struct work_struct *work); [style] I think the code gets more readable if you reorder it so that you don't need forward declarations for static functions. Right. +static long vhost_net_reset_owner(struct vhost_net *n) +{ + struct socket *sock = NULL; + long r; + mutex_lock(n-dev.mutex); + r = vhost_dev_check_owner(n-dev); + if (r) + goto done; + sock = vhost_net_stop(n); + r = vhost_dev_reset_owner(n-dev); +done: + mutex_unlock(n-dev.mutex); + if (sock) + fput(sock-file); + return r; +} what is the difference between vhost_net_reset_owner(n) and vhost_net_set_socket(n, -1)? set socket to -1 will only stop the device. reset owner will let another process take over the device. It also needs to reset all parameters to make it safe for that other process, so in particular the device is stopped. I tried explaining this in the header vhost.h - does the comment there help, or do I need to clarify it? + +static struct file_operations vhost_net_fops = { + .owner = THIS_MODULE, + .release= vhost_net_release, + .unlocked_ioctl = vhost_net_ioctl, + .open = vhost_net_open, +}; This is missing a compat_ioctl pointer. It should simply be static long vhost_net_compat_ioctl(struct file *f, unsigned int ioctl, unsigned long arg) { return f, ioctl, (unsigned long)compat_ptr(arg); } I had the impression that if there's no compat_ioctl, unlocked_ioctl will get called automatically. No? +/* Bits from fs/aio.c. TODO: export and use from there? */ +/* + * use_mm + * Makes the calling kernel thread take on the specified + * mm context. + * Called by the retry thread execute retries within the + * iocb issuer's mm context, so that copy_from/to_user + * operations work seamlessly for aio. + * (Note: this routine is intended to be called only + * from a kernel thread context) + */ +static void use_mm(struct mm_struct *mm) +{ + struct mm_struct *active_mm; + struct task_struct *tsk = current; + + task_lock(tsk); + active_mm = tsk-active_mm; + atomic_inc(mm-mm_count); + tsk-mm = mm; + tsk-active_mm = mm; + switch_mm(active_mm, mm, tsk); + task_unlock(tsk); + + mmdrop(active_mm); +} Why do you need a kernel thread here? If the data transfer functions all get called from a guest intercept, shouldn't you already be in the right mm? several reasons :) - I get called under lock, so can't block - eventfd can be passed to another process, and I won't be in guest context at all - this also gets called outside guest context from socket poll - vcpu is blocked while it's doing i/o. it is better to free it up as all the packet copying might take a while +static void handle_tx(struct vhost_net *net) +{ + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + unsigned head, out, in; + struct msghdr msg = { + .msg_name = NULL, + .msg_namelen = 0, + .msg_control = NULL, + .msg_controllen = 0, + .msg_iov = (struct iovec *)vq-iov + 1, + .msg_flags = MSG_DONTWAIT, + }; + size_t len; + int err; + struct socket *sock = rcu_dereference(net-sock); + if (!sock || !sock_writeable(sock-sk)) + return; + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + for (;;) { + head = vhost_get_vq_desc(net-dev, vq, vq-iov, out, in); + if (head == vq-num) + break; + if (out = 1 || in) { +
Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function
Michael S. Tsirkin wrote: Normally, the common features are transport features and the devices should have absolutely no knowledge of transport feature (since they're transport dependent). Good point. But 1. note that with my patch they don't. They call virtio_get_common_features and that's all. 2. some features may not make sense for some devices. For example, it is quite possible that indirect ring entries feature improves performance on block but hurts on net, as net has a similar merged buffers feature. Someone should try benchmarking with it disabled, and it becomes possible with my patch. I don't necessarily disagree but I think your patch goes about it the wrong way. There ought to be a way to layer qdev properties that achieves this goal so that when you create a virtio-pci-block device, you have the ability to turn off indirect sg without virtio-block having to know what that is. For your use-case, I wonder if you're integrating at the wrong level. If you implement a ring in-kernel, maybe the thing to do is introduce more layering in qemu like we have in the kernel so that you can easily add a new ring backend type. At any rate, see if you can achieve the same goal with qdev properties. If you could, you should be able to hack something up easily to disable this for vhost without completely overhauling qemu's virtio implementation. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Monday 10 August 2009 20:10:44 Michael S. Tsirkin wrote: On Mon, Aug 10, 2009 at 09:51:18PM +0200, Arnd Bergmann wrote: what is the difference between vhost_net_reset_owner(n) and vhost_net_set_socket(n, -1)? set socket to -1 will only stop the device. reset owner will let another process take over the device. It also needs to reset all parameters to make it safe for that other process, so in particular the device is stopped. ok I tried explaining this in the header vhost.h - does the comment there help, or do I need to clarify it? No, I just didn't get there yet. I had the impression that if there's no compat_ioctl, unlocked_ioctl will get called automatically. No? It will issue a kernel warning but not call unlocked_ioctl, so you need either a compat_ioctl method or list the numbers in fs/compat_ioctl.c, which I try to avoid. Why do you need a kernel thread here? If the data transfer functions all get called from a guest intercept, shouldn't you already be in the right mm? several reasons :) - I get called under lock, so can't block - eventfd can be passed to another process, and I won't be in guest context at all - this also gets called outside guest context from socket poll - vcpu is blocked while it's doing i/o. it is better to free it up as all the packet copying might take a while Ok. I guess that this is where one could plug into macvlan directly, using sock_alloc_send_skb/memcpy_fromiovec/dev_queue_xmit directly, instead of filling a msghdr for each, if we want to combine this with the work I did on that. quite possibly. Or one can just bind a raw socket to macvlan :) Right, that works as well, but may get more complicated once we try to add zero-copy or other optimizations. Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function
On Mon, Aug 10, 2009 at 03:33:59PM -0500, Anthony Liguori wrote: Michael S. Tsirkin wrote: Normally, the common features are transport features and the devices should have absolutely no knowledge of transport feature (since they're transport dependent). Good point. But 1. note that with my patch they don't. They call virtio_get_common_features and that's all. 2. some features may not make sense for some devices. For example, it is quite possible that indirect ring entries feature improves performance on block but hurts on net, as net has a similar merged buffers feature. Someone should try benchmarking with it disabled, and it becomes possible with my patch. I don't necessarily disagree but I think your patch goes about it the wrong way. There ought to be a way to layer qdev properties that achieves this goal so that when you create a virtio-pci-block device, you have the ability to turn off indirect sg without virtio-block having to know what that is. I don't understand, sorry. Why do you insist on involving pci here? ring layout has nothing to do with pci, does it? With my patch, virtio-block does not know what indirect sg is. It just does enable/disable. virtio net has device-specific feature that overlaps with indirect entries. So insisting that devices should just ignore transport does not make sense, to me. For your use-case, I wonder if you're integrating at the wrong level. Forget about that for now. Let's solve the generic problem. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] export cpu_tlbstate to modules
On 08/10/2009 03:06 PM, Michael S. Tsirkin wrote: Wouldn't it be a *lot* better to move use_mm() from fs/aio.c into common code, and export that instead? That's easy too. What would a good place for it be? Somewhere in mm/, presumably. When in doubt, make it a new file... -hpa ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: [PATCH] qemu/virtio: move features to an inline function
Michael S. Tsirkin wrote: On Mon, Aug 10, 2009 at 03:33:59PM -0500, Anthony Liguori wrote: There ought to be a way to layer qdev properties that achieves this goal so that when you create a virtio-pci-block device, you have the ability to turn off indirect sg without virtio-block having to know what that is. I don't understand, sorry. Why do you insist on involving pci here? ring layout has nothing to do with pci, does it? What I'm saying is that virtio-blk-pci, which is the qdev instantiation of virtio-pci + virtio-blk, should be able to have a set of qdev properties that is composed of a combination of at least two sets of properties: virtio-blk's qdev properties and virtio-pci's qdev properties. Right now, all of the properties are defined in virtio-pci.c, so you could add a property that was DEFINE_PROP_BOOL(indirect-sg, ...), that you could then use to selectively enable/disable indirect sg on virtio-blk-pci devices without ever having to involve virtio-blk.c. Ideally, we wouldn't have the properties centralized in virtio-pci.c. Rather, it would be nice if virtio-blk.c could have a set of properties and virtio-pci.c could just add those properties to it's own set of properties. Today, we don't have a concept of a ring abstraction. If we did, then virtio-ring.c could have it's own set of properties. N.B. I expect that the in-kernel virtio-net device is going to be separate qdev device than virtio-net-pci. It can have an identical guest interface but within qemu, it should be a separate device. This is how we handle the in-kernel PIT and it's how we should handle the in-kernel APIC. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
On Mon, 10 Aug 2009 07:17:54 pm Gerd Hoffmann wrote: On 08/10/09 08:55, Amit Shah wrote: Bad example. Quite a lot of modern devices drivers are using dynamic major/minor numbers because they have proven to be such a pain in the butt. That's why we have more sophisticated mechanisms like udev for userspace to make use of. Let me explain how we came to this numbering: we first had support for 'naming' ports and the names were obtained by userspace programs by an ioctl. Rusty suggested to use some numbering scheme where some ports could exist at predefined locations so that we wouldn't need the naming and the ioctls around it. I think the naming is very important. I disagree. If you can hand out names, you can hand out numbers. Whether the guest chooses to put that number in sysfs or make it a minor, I don't care. Rusty. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Qemu-devel] Re: virtio-serial: An interface for host-guest communication
Rusty Russell wrote: On Mon, 10 Aug 2009 07:17:54 pm Gerd Hoffmann wrote: On 08/10/09 08:55, Amit Shah wrote: Bad example. Quite a lot of modern devices drivers are using dynamic major/minor numbers because they have proven to be such a pain in the butt. That's why we have more sophisticated mechanisms like udev for userspace to make use of. Let me explain how we came to this numbering: we first had support for 'naming' ports and the names were obtained by userspace programs by an ioctl. Rusty suggested to use some numbering scheme where some ports could exist at predefined locations so that we wouldn't need the naming and the ioctls around it. I think the naming is very important. I disagree. If you can hand out names, you can hand out numbers. The problem with handing out names is that there has to be someone to hand things out. And even if you have a good hander-outer, development is difficult in a distributed environment because you may have folks using your code before you've gotten an official hand-out. A better discovery mechanism is based on something that piggy backs on another authority. For instance, reverse fully qualified domains work well. uuid's tend to work pretty well too although it's not perfect. In general, even just open strings can work out okay given that people are responsible in how they name things. Regards, Anthony Liguori ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization