Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On Thu, Dec 24, 2009 at 11:09:39AM -0600, Anthony Liguori wrote: On 12/23/2009 05:42 PM, Ira W. Snyder wrote: I've got a single PCI Host (master) with ~20 PCI slots. Physically, it is a backplane in a cPCI chassis, but the form factor is irrelevant. It is regular PCI from a software perspective. Into this backplane, I plug up to 20 PCI Agents (slaves). They are powerpc computers, almost identical to the Freescale MPC8349EMDS board. They're full-featured powerpc computers, with CPU, RAM, etc. They can run standalone. I want to use the PCI backplane as a data transport. Specifically, I want to transport ethernet over the backplane, so I can have the powerpc boards mount their rootfs via NFS, etc. Everyone knows how to write network daemons. It is a good and very well known way to transport data between systems. On the PCI bus, the powerpc systems expose 3 PCI BAR's. The size is configureable, as is the memory location at which they point. What I cannot do is get notified when a read/write hits the BAR. There is a feature on the board which allows me to generate interrupts in either direction: agent-master (PCI INTX) and master-agent (via an MMIO register). The PCI vendor ID and device ID are not configureable. One thing I cannot assume is that the PCI master system is capable of performing DMA. In my system, it is a Pentium3 class x86 machine, which has no DMA engine. However, the PowerPC systems do have DMA engines. In virtio terms, it was suggested to make the powerpc systems the virtio hosts (running the backends) and make the x86 (PCI master) the virtio guest (running virtio-net, etc.). IMHO, virtio and vbus are both the wrong model for what you're doing. The key reason why is that virtio and vbus are generally designed around the concept that there is shared cache coherent memory from which you can use lock-less ring queues to implement efficient I/O. In your architecture, you do not have cache coherent shared memory. Instead, you have two systems connected via a PCI backplace with non-coherent shared memory. You probably need to use the shared memory as a bounce buffer and implement a driver on top of that. I'm not sure what you're suggesting in the paragraph above. I want to use virtio-net as the transport, I do not want to write my own virtual-network driver. Can you please clarify? virtio-net and vbus are going to be overly painful for you to use because no one end can access arbitrary memory in the other end. The PCI Agents (powerpc's) can access the lowest 4GB of the PCI Master's memory. Not all at the same time, but I have a 1GB movable window into PCI address space. I hunch Kyle's setup is similar. I've proved that virtio can work via my crossed-wires driver, hooking two virtio-net's together. With a proper in-kernel backend, I think the issues would be gone, and things would work great. Hopefully that explains what I'm trying to do. I'd love someone to help guide me in the right direction here. I want something to fill this need in mainline. If I were you, I would write a custom network driver. virtio-net is awfully small (just a few hundred lines). I'd use that as a basis but I would not tie into virtio or vbus. The paradigms don't match. This is exactly what I did first. I proposed it for mainline, and David Miller shot it down, saying: you're creating your own virtualization scheme, use virtio instead. Arnd Bergmann is maintaining a driver out-of-tree for some IBM cell boards which is very similar, IIRC. In my driver, I used the PCI Agent's PCI BAR's to contain ring descriptors. The PCI Agent actually handles all data transfer (via the onboard DMA engine). It works great. I'll gladly post it if you'd like to see it. In my driver, I had to use 64K MTU to get acceptable performance. I'm not entirely sure how to implement a driver that can handle scatter/gather (fragmented skb's). It clearly isn't that easy to tune a network driver for good performance. For reference, my crossed-wires virtio drivers achieved excellent performance (10x better than my custom driver) with 1500 byte MTU. I've been contacted seperately by 10+ people also looking for a similar solution. I hunch most of them end up doing what I did: write a quick-and-dirty network driver. I've been working on this for a year, just to give an idea. The whole architecture of having multiple heterogenous systems on a common high speed backplane is what IBM refers to as hybrid computing. It's a model that I think will be come a lot more common in the future. I think there are typically two types of hybrid models depending on whether the memory sharing is cache coherent or not. If you have coherent shared memory, the problem looks an awfully lot like virtualization. If you don't have coherent shared memory, then the shared memory basically becomes a pool to bounce into and out-of. Let's
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On Wed, Dec 23, 2009 at 12:34:44PM -0500, Gregory Haskins wrote: On 12/23/09 1:15 AM, Kyle Moffett wrote: On Tue, Dec 22, 2009 at 12:36, Gregory Haskins gregory.hask...@gmail.com wrote: On 12/22/09 2:57 AM, Ingo Molnar wrote: * Gregory Haskins gregory.hask...@gmail.com wrote: Actually, these patches have nothing to do with the KVM folks. [...] That claim is curious to me - the AlacrityVM host It's quite simple, really. These drivers support accessing vbus, and vbus is hypervisor agnostic. In fact, vbus isn't necessarily even hypervisor related. It may be used anywhere where a Linux kernel is the io backend, which includes hypervisors like AlacrityVM, but also userspace apps, and interconnected physical systems as well. The vbus-core on the backend, and the drivers on the frontend operate completely independent of the underlying hypervisor. A glue piece called a connector ties them together, and any hypervisor specific details are encapsulated in the connector module. In this case, the connector surfaces to the guest side as a pci-bridge, so even that is not hypervisor specific per se. It will work with any pci-bridge that exposes a compatible ABI, which conceivably could be actual hardware. This is actually something that is of particular interest to me. I have a few prototype boards right now with programmable PCI-E host/device links on them; one of my long-term plans is to finagle vbus into providing multiple virtual devices across that single PCI-E interface. Specifically, I want to be able to provide virtual NIC(s), serial ports and serial consoles, virtual block storage, and possibly other kinds of interfaces. My big problem with existing virtio right now (although I would be happy to be proven wrong) is that it seems to need some sort of out-of-band communication channel for setting up devices, not to mention it seems to need one PCI device per virtual device. Greg, thanks for CC'ing me. Hello Kyle, I've got a similar situation here. I've got many PCI agents (devices) plugged into a PCI backplane. I want to use the network to communicate from the agents to the PCI master (host system). At the moment, I'm using a custom driver, heavily based on the PCINet driver posted on the linux-netdev mailing list. David Miller rejected this approach, and suggested I use virtio instead. My first approach with virtio was to create a crossed-wires driver, which connected two virtio-net drivers together. While this worked, it doesn't support feature negotiation properly, and so it was scrapped. You can find this posted on linux-netdev with the title virtio-over-PCI. I started writing a virtio-phys layer which creates the appropriate distinction between frontend (guest driver) and backend (kvm, qemu, etc.). This effort has been put on hold for lack of time, and because there is no example code which shows how to create an interface from virtio rings to TUN/TAP. The vhost-net driver is supposed to fill this role, but I haven't seen any test code for that either. The developers haven't been especially helpful answering questions like: how would I use vhost-net with a DMA engine. (You'll quickly find that you must use DMA to transfer data across PCI. AFAIK, CPU's cannot do burst accesses to the PCI bus. I get a 10+ times speedup using DMA.) The virtio-phys work is mostly lacking a backend for virtio-net. It is still incomplete, but at least devices can be registered, etc. It is available at: http://www.mmarray.org/~iws/virtio-phys/ Another thing you'll notice about virtio-net (and vbus' venet) is that they DO NOT specify endianness. This means that they cannot be used with a big-endian guest and a little-endian host, or vice versa. This means they will not work in certain QEMU setups today. Another problem with virtio is that you'll need to invent your own bus model. QEMU/KVM has their bus model, lguest uses a different one, and s390 uses yet another, IIRC. At least vbus provides a standardized bus model. All in all, I've written a lot of virtio code, and it has pretty much all been shot down. It isn't very encouraging. So I would love to be able to port something like vbus to my nify PCI hardware and write some backend drivers... then my PCI-E connected systems would dynamically provide a list of highly-efficient virtual devices to each other, with only one 4-lane PCI-E bus. I've written some IOQ test code, all of which is posted on the alacrityvm-devel mailing list. If we can figure out how to make IOQ use the proper ioread32()/iowrite32() accessors for accessing ioremap()ed PCI BARs, then I can pretty easily write the rest of a vbus-phys connector. Hi Kyle, We indeed have others that are doing something similar. I have CC'd Ira who may be able to provide you more details. I would also point you at the canonical example for what you would need to write to tie your systems together. Its the null
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On Wed, Dec 23, 2009 at 09:09:21AM -0600, Anthony Liguori wrote: On 12/23/2009 12:15 AM, Kyle Moffett wrote: This is actually something that is of particular interest to me. I have a few prototype boards right now with programmable PCI-E host/device links on them; one of my long-term plans is to finagle vbus into providing multiple virtual devices across that single PCI-E interface. Specifically, I want to be able to provide virtual NIC(s), serial ports and serial consoles, virtual block storage, and possibly other kinds of interfaces. My big problem with existing virtio right now (although I would be happy to be proven wrong) is that it seems to need some sort of out-of-band communication channel for setting up devices, not to mention it seems to need one PCI device per virtual device. We've been thinking about doing a virtio-over-IP mechanism such that you could remote the entire virtio bus to a separate physical machine. virtio-over-IB is probably more interesting since you can make use of RDMA. virtio-over-PCI-e would work just as well. I didn't know you were interested in this as well. See my later reply to Kyle for a lot of code that I've written with this in mind. virtio is a layered architecture. Device enumeration/discovery sits at a lower level than the actual device ABIs. The device ABIs are implemented on top of a bulk data transfer API. The reason for this layering is so that we can reuse PCI as an enumeration/discovery mechanism. This tremendenously simplifies porting drivers to other OSes and let's us use PCI hotplug automatically. We get integration into all the fancy userspace hotplug support for free. But both virtio-lguest and virtio-s390 use in-band enumeration and discovery since they do not have support for PCI on either platform. I'm interested in the same thing, just over PCI. The only PCI agent systems I've used are not capable of manipulating the PCI configuration space in such a way that virtio-pci is usable on them. This means creating your own enumeration mechanism. Which sucks. See my virtio-phys code (http://www.mmarray.org/~iws/virtio-phys/) for an example of how I did it. It was modeled on lguest. Help is appreciated. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On Wed, Dec 23, 2009 at 04:58:37PM -0600, Anthony Liguori wrote: On 12/23/2009 01:54 PM, Ira W. Snyder wrote: On Wed, Dec 23, 2009 at 09:09:21AM -0600, Anthony Liguori wrote: I didn't know you were interested in this as well. See my later reply to Kyle for a lot of code that I've written with this in mind. BTW, in the future, please CC me or CC virtualizat...@lists.linux-foundation.org. Or certainly k...@vger. I never looked at the virtio-over-pci patchset although I've heard it referenced before. Will do. I wouldn't think k...@vger would be on-topic. I'm not interested in KVM (though I do use it constantly, it is great). I'm only interested in using virtio as a transport between physical systems. Is it a place where discussing virtio by itself is on-topic? But both virtio-lguest and virtio-s390 use in-band enumeration and discovery since they do not have support for PCI on either platform. I'm interested in the same thing, just over PCI. The only PCI agent systems I've used are not capable of manipulating the PCI configuration space in such a way that virtio-pci is usable on them. virtio-pci is the wrong place to start if you want to use a PCI *device* as the virtio bus. virtio-pci is meant to use the PCI bus as the virtio bus. That's a very important requirement for us because it maintains the relationship of each device looking like a normal PCI device. This means creating your own enumeration mechanism. Which sucks. I don't think it sucks. The idea is that we don't want to unnecessarily reinvent things. Of course, the key feature of virtio is that it makes it possible for you to create your own enumeration mechanism if you're so inclined. See my virtio-phys code (http://www.mmarray.org/~iws/virtio-phys/) for an example of how I did it. It was modeled on lguest. Help is appreciated. If it were me, I'd take a much different approach. I would use a very simple device with a single transmit and receive queue. I'd create a standard header, and the implement a command protocol on top of it. You'll be able to support zero copy I/O (although you'll have a fixed number of outstanding requests). You would need a single large ring. But then again, I have no idea what your requirements are. You could probably get far treating the thing as a network device and just doing ATAoE or something like that. I've got a single PCI Host (master) with ~20 PCI slots. Physically, it is a backplane in a cPCI chassis, but the form factor is irrelevant. It is regular PCI from a software perspective. Into this backplane, I plug up to 20 PCI Agents (slaves). They are powerpc computers, almost identical to the Freescale MPC8349EMDS board. They're full-featured powerpc computers, with CPU, RAM, etc. They can run standalone. I want to use the PCI backplane as a data transport. Specifically, I want to transport ethernet over the backplane, so I can have the powerpc boards mount their rootfs via NFS, etc. Everyone knows how to write network daemons. It is a good and very well known way to transport data between systems. On the PCI bus, the powerpc systems expose 3 PCI BAR's. The size is configureable, as is the memory location at which they point. What I cannot do is get notified when a read/write hits the BAR. There is a feature on the board which allows me to generate interrupts in either direction: agent-master (PCI INTX) and master-agent (via an MMIO register). The PCI vendor ID and device ID are not configureable. One thing I cannot assume is that the PCI master system is capable of performing DMA. In my system, it is a Pentium3 class x86 machine, which has no DMA engine. However, the PowerPC systems do have DMA engines. In virtio terms, it was suggested to make the powerpc systems the virtio hosts (running the backends) and make the x86 (PCI master) the virtio guest (running virtio-net, etc.). I'm not sure what you're suggesting in the paragraph above. I want to use virtio-net as the transport, I do not want to write my own virtual-network driver. Can you please clarify? Hopefully that explains what I'm trying to do. I'd love someone to help guide me in the right direction here. I want something to fill this need in mainline. I've been contacted seperately by 10+ people also looking for a similar solution. I hunch most of them end up doing what I did: write a quick-and-dirty network driver. I've been working on this for a year, just to give an idea. PS - should I create a new thread on the two mailing lists mentioned above? I don't want to go too far off-topic in an alacrityvm thread. :) Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
On Tue, Oct 06, 2009 at 12:58:06PM -0400, Gregory Haskins wrote: Avi Kivity wrote: On 10/06/2009 03:31 PM, Gregory Haskins wrote: slots would be one implementation, if you can think of others then you'd add them. I'm more interested in *how* you'd add them more than if we would add them. What I am getting at are the logistics of such a beast. Add alternative ioctls, or have one ioctl with a 'type' field. For instance, would I have /dev/slots-vas with ioctls for adding slots, and /dev/foo-vas for adding foos? And each one would instantiate a different vas_struct object with its own vas_struct-ops? Or were you thinking of something different. I think a single /dev/foo is sufficient, unless some of those address spaces are backed by real devices. If you can't, I think it indicates that the whole thing isn't necessary and we're better off with slots and virtual memory. I'm not sure if we are talking about the same thing yet, but if we are, there are uses of a generalized interface outside of slots/virtual memory (Ira's physical box being a good example). I'm not sure Ira's case is not best supported by virtual memory. Perhaps, but there are surely some cases where the memory is not pageable, but is accessible indirectly through some DMA controller. So if we require it to be pagable we will limit the utility of the interface, though admittedly it will probably cover most cases. The limitation I have is that memory made available from the host system (PCI card) as PCI BAR1 must not be migrated around in memory. I can only change the address decoding to hit a specific physical address. AFAIK, this means it cannot be userspace memory (since the underlying physical page could change, or it could be in swap), and must be allocated with something like __get_free_pages() or dma_alloc_coherent(). This is how all 83xx powerpc boards work, and I'd bet that the 85xx and 86xx boards work almost exactly the same. I can't say anything about non-powerpc boards. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy Acked-by: Arnd Bergmann a...@arndb.de Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 475 ++ drivers/vhost/vhost.c | 688 drivers/vhost/vhost.h | 122 include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 101 +++ 11 files changed, 1413 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index b1114cf..de4587f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5431,6 +5431,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: kvm@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi r...@hellgate.ch S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3)+= ps3/ obj-$(CONFIG_OF) += of/ obj-$(CONFIG_SSB)+= ssb/ +obj-$(CONFIG_VHOST_NET) += vhost/ obj-$(CONFIG_VIRTIO) += virtio/ obj-$(CONFIG_VLYNQ) += vlynq/ obj-$(CONFIG_STAGING)+= staging/ diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig new file mode 100644 index 000..d955406 --- /dev/null +++ b/drivers/vhost/Kconfig @@ -0,0 +1,11 @@ +config VHOST_NET + tristate Host kernel accelerator for virtio net + depends on NET EVENTFD + ---help--- + This kernel module can be loaded in host kernel to accelerate + guest networking with virtio_net. Not to be confused with virtio_net + module itself which needs to be loaded in guest kernel. + + To compile this driver as a module, choose M here: the module will + be called vhost_net. + diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile new file mode 100644 index 000..72dd020 --- /dev/null +++ b/drivers/vhost/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_VHOST_NET) += vhost_net.o +vhost_net-y := vhost.o net.o diff --git
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Sep 24, 2009 at 10:18:28AM +0300, Avi Kivity wrote: On 09/24/2009 12:15 AM, Gregory Haskins wrote: There are various aspects about designing high-performance virtual devices such as providing the shortest paths possible between the physical resources and the consumers. Conversely, we also need to ensure that we meet proper isolation/protection guarantees at the same time. What this means is there are various aspects to any high-performance PV design that require to be placed in-kernel to maximize the performance yet properly isolate the guest. For instance, you are required to have your signal-path (interrupts and hypercalls), your memory-path (gpa translation), and addressing/isolation model in-kernel to maximize performance. Exactly. That's what vhost puts into the kernel and nothing more. Actually, no. Generally, _KVM_ puts those things into the kernel, and vhost consumes them. Without KVM (or something equivalent), vhost is incomplete. One of my goals with vbus is to generalize the something equivalent part here. I don't really see how vhost and vbus are different here. vhost expects signalling to happen through a couple of eventfds and requires someone to supply them and implement kernel support (if needed). vbus requires someone to write a connector to provide the signalling implementation. Neither will work out-of-the-box when implementing virtio-net over falling dominos, for example. Vbus accomplishes its in-kernel isolation model by providing a container concept, where objects are placed into this container by userspace. The host kernel enforces isolation/protection by using a namespace to identify objects that is only relevant within a specific container's context (namely, a u32 dev-id). The guest addresses the objects by its dev-id, and the kernel ensures that the guest can't access objects outside of its dev-id namespace. vhost manages to accomplish this without any kernel support. No, vhost manages to accomplish this because of KVMs kernel support (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a merely a kind of tuntap-like clone signalled by eventfds. Without a vbus-connector-falling-dominos, vbus-venet can't do anything either. Both vhost and vbus need an interface, vhost's is just narrower since it doesn't do configuration or enumeration. This goes directly to my rebuttal of your claim that vbus places too much in the kernel. I state that, one way or the other, address decode and isolation _must_ be in the kernel for performance. Vbus does this with a devid/container scheme. vhost+virtio-pci+kvm does it with pci+pio+ioeventfd. vbus doesn't do kvm guest address decoding for the fast path. It's still done by ioeventfd. The guest simply has not access to any vhost resources other than the guest-host doorbell, which is handed to the guest outside vhost (so it's somebody else's problem, in userspace). You mean _controlled_ by userspace, right? Obviously, the other side of the kernel still needs to be programmed (ioeventfd, etc). Otherwise, vhost would be pointless: e.g. just use vanilla tuntap if you don't need fast in-kernel decoding. Yes (though for something like level-triggered interrupts we're probably keeping it in userspace, enjoying the benefits of vhost data path while paying more for signalling). All that is required is a way to transport a message with a devid attribute as an address (such as DEVCALL(devid)) and the framework provides the rest of the decode+execute function. vhost avoids that. No, it doesn't avoid it. It just doesn't specify how its done, and relies on something else to do it on its behalf. That someone else can be in userspace, apart from the actual fast path. Conversely, vbus specifies how its done, but not how to transport the verb across the wire. That is the role of the vbus-connector abstraction. So again, vbus does everything in the kernel (since it's so easy and cheap) but expects a vbus-connector. vhost does configuration in userspace (since it's so clunky and fragile) but expects a couple of eventfds. Contrast this to vhost+virtio-pci (called simply vhost from here). It's the wrong name. vhost implements only the data path. Understood, but vhost+virtio-pci is what I am contrasting, and I use vhost for short from that point on because I am too lazy to type the whole name over and over ;) If you #define A A+B+C don't expect intelligent conversation afterwards. It is not immune to requiring in-kernel addressing support either, but rather it just does it differently (and its not as you might expect via qemu). Vhost relies on QEMU to render PCI objects to the guest, which the guest assigns resources (such
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Tue, Sep 22, 2009 at 12:43:36PM +0300, Avi Kivity wrote: On 09/22/2009 12:43 AM, Ira W. Snyder wrote: Sure, virtio-ira and he is on his own to make a bus-model under that, or virtio-vbus + vbus-ira-connector to use the vbus framework. Either model can work, I agree. Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and virtio-s390. It isn't especially easy. I can steal lots of code from the lguest bus model, but sometimes it is good to generalize, especially after the fourth implemention or so. I think this is what GHaskins tried to do. Yes. vbus is more finely layered so there is less code duplication. The virtio layering was more or less dictated by Xen which doesn't have shared memory (it uses grant references instead). As a matter of fact lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that part is duplicated. It's probably possible to add a virtio-shmem.ko library that people who do have shared memory can reuse. Seems like a nice benefit of vbus. I've given it some thought, and I think that running vhost-net (or similar) on the ppc boards, with virtio-net on the x86 crate server will work. The virtio-ring abstraction is almost good enough to work for this situation, but I had to re-invent it to work with my boards. I've exposed a 16K region of memory as PCI BAR1 from my ppc board. Remember that this is the host system. I used each 4K block as a device descriptor which contains: 1) the type of device, config space, etc. for virtio 2) the desc table (virtio memory descriptors, see virtio-ring) 3) the avail table (available entries in the desc table) Won't access from x86 be slow to this memory (on the other hand, if you change it to main memory access from ppc will be slow... really depends on how your system is tuned. Writes across the bus are fast, reads across the bus are slow. These are just the descriptor tables for memory buffers, not the physical memory buffers themselves. These only need to be written by the guest (x86), and read by the host (ppc). The host never changes the tables, so we can cache a copy in the guest, for a fast detach_buf() implementation (see virtio-ring, which I'm copying the design from). The only accesses are writes across the PCI bus. There is never a need to do a read (except for slow-path configuration). Parts 2 and 3 are repeated three times, to allow for a maximum of three virtqueues per device. This is good enough for all current drivers. The plan is to switch to multiqueue soon. Will not affect you if your boards are uniprocessor or small smp. Everything I have is UP. I don't need extreme performance, either. 40MB/sec is the minimum I need to reach, though I'd like to have some headroom. For reference, using the CPU to handle data transfers, I get ~2MB/sec transfers. Using the DMA engine, I've hit about 60MB/sec with my crossed-wires virtio-net. I've gotten plenty of email about this from lots of interested developers. There are people who would like this kind of system to just work, while having to write just some glue for their device, just like a network driver. I hunch most people have created some proprietary mess that basically works, and left it at that. So long as you keep the system-dependent features hookable or configurable, it should work. So, here is a desperate cry for help. I'd like to make this work, and I'd really like to see it in mainline. I'm trying to give back to the community from which I've taken plenty. Not sure who you're crying for help to. Once you get this working, post patches. If the patches are reasonably clean and don't impact performance for the main use case, and if you can show the need, I expect they'll be merged. In the spirit of post early and often, I'm making my code available, that's all. I'm asking anyone interested for some review, before I have to re-code this for about the fifth time now. I'm trying to avoid Haskins' situation, where he's invented and debugged a lot of new code, and then been told to do it completely differently. Yes, the code I posted is only compile-tested, because quite a lot of code (kernel and userspace) must be working before anything works at all. I hate to design the whole thing, then be told that something fundamental about it is wrong, and have to completely re-write it. Thanks for the comments, Ira -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Wed, Sep 16, 2009 at 11:11:57PM -0400, Gregory Haskins wrote: Avi Kivity wrote: On 09/16/2009 10:22 PM, Gregory Haskins wrote: Avi Kivity wrote: On 09/16/2009 05:10 PM, Gregory Haskins wrote: If kvm can do it, others can. The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. In the kernel. IMO that's the wrong place for it. 3) in-kernel: You can do something like virtio-net to vhost to potentially meet some of the requirements, but not all. In order to fully meet (3), you would need to do some of that stuff you mentioned in the last reply with muxing device-nr/reg-nr. In addition, we need to have a facility for mapping eventfds and establishing a signaling mechanism (like PIO+qid), etc. KVM does this with IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be invented. irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. Not per se, but it needs to be interfaced. How do I register that eventfd with the fastpath in Ira's rig? How do I signal the eventfd (x86-ppc, and ppc-x86)? Sorry to reply so late to this thread, I've been on vacation for the past week. If you'd like to continue in another thread, please start it and CC me. On the PPC, I've got a hardware doorbell register which generates 30 distiguishable interrupts over the PCI bus. I have outbound and inbound registers, which can be used to signal the other side. I assume it isn't too much code to signal an eventfd in an interrupt handler. I haven't gotten to this point in the code yet. To take it to the next level, how do I organize that mechanism so that it works for more than one IO-stream (e.g. address the various queues within ethernet or a different device like the console)? KVM has IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not have the luxury of an established IO paradigm. Is vbus the only way to implement a solution? No. But it is _a_ way, and its one that was specifically designed to solve this very problem (as well as others). (As an aside, note that you generally will want an abstraction on top of irqfd/eventfd like shm-signal or virtqueues to do shared-memory based event mitigation, but I digress. That is a separate topic). To meet performance, this stuff has to be in kernel and there has to be a way to manage it. and management belongs in userspace. vbus does not dictate where the management must be. Its an extensible framework, governed by what you plug into it (ala connectors and devices). For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD and DEVDROP hotswap events into the interrupt stream, because they are simple and we already needed the interrupt stream anyway for fast-path. As another example: venet chose to put -call(MACQUERY) config-space into its call namespace because its simple, and we already need -calls() for fastpath. It therefore exports an attribute to sysfs that allows the management app to set it. I could likewise have designed the connector or device-model differently as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI userspace) but this seems silly to me when they are so trivial, so I didn't. Since vbus was designed to do exactly that, this is what I would advocate. You could also reinvent these concepts and put your own mux and mapping code in place, in addition to all the other stuff that vbus does. But I am not clear why anyone would want to. Maybe they like their backward compatibility and Windows support. This is really not relevant to this thread, since we are talking about Ira's hardware. But if you must bring this up, then I will reiterate that you just design the connector to interface with QEMU+PCI and you have that too if that was important to you. But on that topic: Since you could consider KVM a motherboard manufacturer of sorts (it just happens to be virtual hardware), I don't know why KVM seems to consider itself the only motherboard manufacturer in the world that has to make everything look legacy. If a company like ASUS wants to add some cutting edge IO controller/bus, they simply do it. Pretty much every product release may contain a different array of devices, many of which are not backwards compatible with any prior silicon. The guy/gal installing Windows on that system may see a ? in device-manager until they load a driver that supports the new chip, and subsequently it works. It is certainly not a requirement to make said chip somehow work with existing drivers/facilities on bare metal, per se. Why should virtual systems be different? So, yeah, the
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote: On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote: On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy Hello Michael, I've started looking at vhost with the intention of using it over PCI to connect physical machines together. The part that I am struggling with the most is figuring out which parts of the rings are in the host's memory, and which parts are in the guest's memory. All rings are in guest's memory, to match existing virtio code. Ok, this makes sense. vhost assumes that the memory space of the hypervisor userspace process covers the whole of guest memory. Is this necessary? Why? The assumption seems very wrong when you're doing data transport between two physical systems via PCI. I know vhost has not been designed for this specific situation, but it is good to be looking toward other possible uses. And there's a translation table. Ring addresses are userspace addresses, they do not undergo translation. If I understand everything correctly, the rings are all userspace addresses, which means that they can be moved around in physical memory, and get pushed out to swap. Unless they are locked, yes. AFAIK, this is impossible to handle when connecting two physical systems, you'd need the rings available in IO memory (PCI memory), so you can ioreadXX() them instead. To the best of my knowledge, I shouldn't be using copy_to_user() on an __iomem address. Also, having them migrate around in memory would be a bad thing. Also, I'm having trouble figuring out how the packet contents are actually copied from one system to the other. Could you point this out for me? The code in net/packet/af_packet.c does it when vhost calls sendmsg. Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible to make this use a DMA engine instead? I know this was suggested in an earlier thread. Is there somewhere I can find the userspace code (kvm, qemu, lguest, etc.) code needed for interacting with the vhost misc device so I can get a better idea of how userspace is supposed to work? Look in archives for k...@vger.kernel.org. the subject is qemu-kvm: vhost net. (Features negotiation, etc.) That's not yet implemented as there are no features yet. I'm working on tap support, which will add a feature bit. Overall, qemu does an ioctl to query supported features, and then acks them with another ioctl. I'm also trying to avoid duplicating functionality available elsewhere. So that to check e.g. TSO support, you'd just look at the underlying hardware device you are binding to. Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in the future? I found that this made an enormous improvement in throughput on my virtio-net - virtio-net system. Perhaps it isn't needed with vhost-net. Thanks for replying, Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote: What it is: vhost net is a character device that can be used to reduce the number of system calls involved in virtio networking. Existing virtio net code is used in the guest without modification. There's similarity with vringfd, with some differences and reduced scope - uses eventfd for signalling - structures can be moved around in memory at any time (good for migration) - support memory table and not just an offset (needed for kvm) common virtio related code has been put in a separate file vhost.c and can be made into a separate module if/when more backends appear. I used Rusty's lguest.c as the source for developing this part : this supplied me with witty comments I wouldn't be able to write myself. What it is not: vhost net is not a bus, and not a generic new system call. No assumptions are made on how guest performs hypercalls. Userspace hypervisors are supported as well as kvm. How it works: Basically, we connect virtio frontend (configured by userspace) to a backend. The backend could be a network device, or a tun-like device. In this version I only support raw socket as a backend, which can be bound to e.g. SR IOV, or to macvlan device. Backend is also configured by userspace, including vlan/mac etc. Status: This works for me, and I haven't see any crashes. I have done some light benchmarking (with v4), compared to userspace, I see improved latency (as I save up to 4 system calls per packet) but not bandwidth/CPU (as TSO and interrupt mitigation are not supported). For ping benchmark (where there's no TSO) troughput is also improved. Features that I plan to look at in the future: - tap support - TSO - interrupt mitigation - zero copy Hello Michael, I've started looking at vhost with the intention of using it over PCI to connect physical machines together. The part that I am struggling with the most is figuring out which parts of the rings are in the host's memory, and which parts are in the guest's memory. If I understand everything correctly, the rings are all userspace addresses, which means that they can be moved around in physical memory, and get pushed out to swap. AFAIK, this is impossible to handle when connecting two physical systems, you'd need the rings available in IO memory (PCI memory), so you can ioreadXX() them instead. To the best of my knowledge, I shouldn't be using copy_to_user() on an __iomem address. Also, having them migrate around in memory would be a bad thing. Also, I'm having trouble figuring out how the packet contents are actually copied from one system to the other. Could you point this out for me? Is there somewhere I can find the userspace code (kvm, qemu, lguest, etc.) code needed for interacting with the vhost misc device so I can get a better idea of how userspace is supposed to work? (Features negotiation, etc.) Thanks, Ira Acked-by: Arnd Bergmann a...@arndb.de Signed-off-by: Michael S. Tsirkin m...@redhat.com --- MAINTAINERS| 10 + arch/x86/kvm/Kconfig |1 + drivers/Makefile |1 + drivers/vhost/Kconfig | 11 + drivers/vhost/Makefile |2 + drivers/vhost/net.c| 475 ++ drivers/vhost/vhost.c | 688 drivers/vhost/vhost.h | 122 include/linux/Kbuild |1 + include/linux/miscdevice.h |1 + include/linux/vhost.h | 101 +++ 11 files changed, 1413 insertions(+), 0 deletions(-) create mode 100644 drivers/vhost/Kconfig create mode 100644 drivers/vhost/Makefile create mode 100644 drivers/vhost/net.c create mode 100644 drivers/vhost/vhost.c create mode 100644 drivers/vhost/vhost.h create mode 100644 include/linux/vhost.h diff --git a/MAINTAINERS b/MAINTAINERS index b1114cf..de4587f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5431,6 +5431,16 @@ S: Maintained F: Documentation/filesystems/vfat.txt F: fs/fat/ +VIRTIO HOST (VHOST) +P: Michael S. Tsirkin +M: m...@redhat.com +L: kvm@vger.kernel.org +L: virtualizat...@lists.osdl.org +L: net...@vger.kernel.org +S: Maintained +F: drivers/vhost/ +F: include/linux/vhost.h + VIA RHINE NETWORK DRIVER M: Roger Luethi r...@hellgate.ch S: Maintained diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index b84e571..94f44d9 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -64,6 +64,7 @@ config KVM_AMD # OK, it's a little counter-intuitive to do this, but it puts it neatly under # the virtualization menu. +source drivers/vhost/Kconfig source drivers/lguest/Kconfig source drivers/virtio/Kconfig diff --git a/drivers/Makefile b/drivers/Makefile index bc4205d..1551ae1 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -105,6 +105,7 @@ obj-$(CONFIG_HID) += hid/ obj-$(CONFIG_PPC_PS3)
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Wed, Aug 19, 2009 at 08:40:33AM +0300, Avi Kivity wrote: On 08/19/2009 03:38 AM, Ira W. Snyder wrote: On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote: On 08/18/2009 11:59 PM, Ira W. Snyder wrote: On a non shared-memory system (where the guest's RAM is not just a chunk of userspace RAM in the host system), virtio's management model seems to fall apart. Feature negotiation doesn't work as one would expect. In your case, virtio-net on the main board accesses PCI config space registers to perform the feature negotiation; software on your PCI cards needs to trap these config space accesses and respond to them according to virtio ABI. Is this real PCI (physical hardware) or fake PCI (software PCI emulation) that you are describing? Real PCI. The host (x86, PCI master) must use real PCI to actually configure the boards, enable bus mastering, etc. Just like any other PCI device, such as a network card. On the guests (ppc, PCI agents) I cannot add/change PCI functions (the last .[0-9] in the PCI address) nor can I change PCI BAR's once the board has started. I'm pretty sure that would violate the PCI spec, since the PCI master would need to re-scan the bus, and re-assign addresses, which is a task for the BIOS. Yes. Can the boards respond to PCI config space cycles coming from the host, or is the config space implemented in silicon and immutable? (reading on, I see the answer is no). virtio-pci uses the PCI config space to configure the hardware. Yes, the PCI config space is implemented in silicon. I can change a few things (mostly PCI BAR attributes), but not much. (There's no real guest on your setup, right? just a kernel running on and x86 system and other kernels running on the PCI cards?) Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's (PCI agents) also run Linux (booted via U-Boot). They are independent Linux systems, with a physical PCI interconnect. The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's PCI stack does bad things as a PCI agent. It always assumes it is a PCI master. It is possible for me to enable CONFIG_PCI=y on the ppc's by removing the PCI bus from their list of devices provided by OpenFirmware. They can not access PCI via normal methods. PCI drivers cannot work on the ppc's, because Linux assumes it is a PCI master. To the best of my knowledge, I cannot trap configuration space accesses on the PCI agents. I haven't needed that for anything I've done thus far. Well, if you can't do that, you can't use virtio-pci on the host. You'll need another virtio transport (equivalent to fake pci you mentioned above). Ok. Is there something similar that I can study as an example? Should I look at virtio-pci? This does appear to be solved by vbus, though I haven't written a vbus-over-PCI implementation, so I cannot be completely sure. Even if virtio-pci doesn't work out for some reason (though it should), you can write your own virtio transport and implement its config space however you like. This is what I did with virtio-over-PCI. The way virtio-net negotiates features makes this work non-intuitively. I think you tried to take two virtio-nets and make them talk together? That won't work. You need the code from qemu to talk to virtio-net config space, and vhost-net to pump the rings. It *is* possible to make two unmodified virtio-net's talk together. I've done it, and it is exactly what the virtio-over-PCI patch does. Study it and you'll see how I connected the rx/tx queues together. The feature negotiation code also works, but in a very unintuitive manner. I made it work in the virtio-over-PCI patch, but the devices are hardcoded into the driver. It would be quite a bit of work to swap virtio-net and virtio-console, for example. I'm not at all clear on how to get feature negotiation to work on a system like mine. From my study of lguest and kvm (see below) it looks like userspace will need to be involved, via a miscdevice. I don't see why. Is the kernel on the PCI cards in full control of all accesses? I'm not sure what you mean by this. Could you be more specific? This is a normal, unmodified vanilla Linux kernel running on the PCI agents. I meant, does board software implement the config space accesses issued from the host, and it seems the answer is no. In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote an algorithm to pair the tx/rx queues together. Since virtio-net pre-fills its rx queues with buffers, I was able to use the DMA engine to copy from the tx queue into the pre-allocated memory in the rx queue. Please find a name other than virtio-over-PCI since it conflicts with virtio-pci. You're tunnelling virtio config cycles (which are usually done on pci config cycles) on a new protocol which
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Wed, Aug 19, 2009 at 06:37:06PM +0300, Avi Kivity wrote: On 08/19/2009 06:28 PM, Ira W. Snyder wrote: Well, if you can't do that, you can't use virtio-pci on the host. You'll need another virtio transport (equivalent to fake pci you mentioned above). Ok. Is there something similar that I can study as an example? Should I look at virtio-pci? There's virtio-lguest, virtio-s390, and virtio-vbus. I think you tried to take two virtio-nets and make them talk together? That won't work. You need the code from qemu to talk to virtio-net config space, and vhost-net to pump the rings. It *is* possible to make two unmodified virtio-net's talk together. I've done it, and it is exactly what the virtio-over-PCI patch does. Study it and you'll see how I connected the rx/tx queues together. Right, crossing the cables works, but feature negotiation is screwed up, and both sides think the data is in their RAM. vhost-net doesn't do negotiation and doesn't assume the data lives in its address space. Yes, that is exactly what I did: crossed the cables (in software). I'll take a closer look at vhost-net now, and make sure I understand how it works. Please find a name other than virtio-over-PCI since it conflicts with virtio-pci. You're tunnelling virtio config cycles (which are usually done on pci config cycles) on a new protocol which is itself tunnelled over PCI shared memory. Sorry about that. Do you have suggestions for a better name? virtio-$yourhardware or maybe virtio-dma How about virtio-phys? Arnd and BenH are both looking at PPC systems (similar to mine). Grant Likely is looking at talking to an processor core running on an FPGA, IIRC. Most of the code can be shared, very little should need to be board-specific, I hope. I called it virtio-over-PCI in my previous postings to LKML, so until a new patch is written and posted, I'll keep referring to it by the name used in the past, so people can search for it. When I post virtio patches, should I CC another mailing list in addition to LKML? virtualizat...@lists.linux-foundation.org is virtio's home. That said, I'm not sure how qemu-system-ppc running on x86 could possibly communicate using virtio-net. This would mean the guest is an emulated big-endian PPC, while the host is a little-endian x86. I haven't actually tested this situation, so perhaps I am wrong. I'm confused now. You don't actually have any guest, do you, so why would you run qemu at all? I do not run qemu. I am just stating a problem with virtio-net that I noticed. This is just so someone more knowledgeable can be aware of the problem. The x86 side only needs to run virtio-net, which is present in RHEL 5.3. You'd only need to run virtio-tunnel or however it's called. All the eventfd magic takes place on the PCI agents. I can upgrade the kernel to anything I want on both the x86 and ppc's. I'd like to avoid changing the x86 (RHEL5) userspace, though. On the ppc's, I have full control over the userspace environment. You don't need any userspace on virtio-net's side. Your ppc boards emulate a virtio-net device, so all you need is the virtio-net module (and virtio bindings). If you chose to emulate, say, an e1000 card all you'd need is the e1000 driver. Thanks for the replies. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tue, Aug 18, 2009 at 11:46:06AM +0300, Michael S. Tsirkin wrote: On Mon, Aug 17, 2009 at 04:17:09PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: On Mon, Aug 17, 2009 at 10:14:56AM -0400, Gregory Haskins wrote: Case in point: Take an upstream kernel and you can modprobe the vbus-pcibridge in and virtio devices will work over that transport unmodified. See http://lkml.org/lkml/2009/8/6/244 for details. The modprobe you are talking about would need to be done in guest kernel, correct? Yes, and your point is? unmodified (pardon the psuedo pun) modifies virtio, not guest. It means you can take an off-the-shelf kernel with off-the-shelf virtio (ala distro-kernel) and modprobe vbus-pcibridge and get alacrityvm acceleration. Heh, by that logic ksplice does not modify running kernel either :) It is not a design goal of mine to forbid the loading of a new driver, so I am ok with that requirement. OTOH, Michael's patch is purely targeted at improving virtio-net on kvm, and its likewise constrained by various limitations of that decision (such as its reliance of the PCI model, and the kvm memory scheme). vhost is actually not related to PCI in any way. It simply leaves all setup for userspace to do. And the memory scheme was intentionally separated from kvm so that it can easily support e.g. lguest. I think you have missed my point. I mean that vhost requires a separate bus-model (ala qemu-pci). So? That can be in userspace, and can be anything including vbus. And no, your memory scheme is not separated, at least, not very well. It still assumes memory-regions and copy_to_user(), which is very kvm-esque. I don't think so: works for lguest, kvm, UML and containers Vbus has people using things like userspace containers (no regions), vhost by default works without regions and physical hardware (dma controllers, so no regions or copy_to_user) so your scheme quickly falls apart once you get away from KVM. Someone took a driver and is building hardware for it ... so what? I think Greg is referring to something like my virtio-over-PCI patch. I'm pretty sure that vhost is completely useless for my situation. I'd like to see vhost work for my use, so I'll try to explain what I'm doing. I've got a system where I have about 20 computers connected via PCI. The PCI master is a normal x86 system, and the PCI agents are PowerPC systems. The PCI agents act just like any other PCI card, except they are running Linux, and have their own RAM and peripherals. I wrote a custom driver which imitated a network interface and a serial port. I tried to push it towards mainline, and DavidM rejected it, with the argument, use virtio, don't add another virtualization layer to the kernel. I think he has a decent argument, so I wrote virtio-over-PCI. Now, there are some things about virtio that don't work over PCI. Mainly, memory is not truly shared. It is extremely slow to access memory that is far away, meaning across the PCI bus. This can be worked around by using a DMA controller to transfer all data, along with an intelligent scheme to perform only writes across the bus. If you're careful, reads are never needed. So, in my system, copy_(to|from)_user() is completely wrong. There is no userspace, only a physical system. In fact, because normal x86 computers do not have DMA controllers, the host system doesn't actually handle any data transfer! I used virtio-net in both the guest and host systems in my example virtio-over-PCI patch, and succeeded in getting them to communicate. However, the lack of any setup interface means that the devices must be hardcoded into both drivers, when the decision could be up to userspace. I think this is a problem that vbus could solve. For my own selfish reasons (I don't want to maintain an out-of-tree driver) I'd like to see *something* useful in mainline Linux. I'm happy to answer questions about my setup, just ask. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tue, Aug 18, 2009 at 07:51:21PM +0300, Avi Kivity wrote: On 08/18/2009 06:53 PM, Ira W. Snyder wrote: So, in my system, copy_(to|from)_user() is completely wrong. There is no userspace, only a physical system. In fact, because normal x86 computers do not have DMA controllers, the host system doesn't actually handle any data transfer! In fact, modern x86s do have dma engines these days (google for Intel I/OAT), and one of our plans for vhost-net is to allow their use for packets above a certain size. So a patch allowing vhost-net to optionally use a dma engine is a good thing. Yes, I'm aware that very modern x86 PCs have general purpose DMA engines, even though I don't have any capable hardware. However, I think it is better to support using any PC (with or without DMA engine, any architecture) as the PCI master, and just handle the DMA all from the PCI agent, which is known to have DMA? I used virtio-net in both the guest and host systems in my example virtio-over-PCI patch, and succeeded in getting them to communicate. However, the lack of any setup interface means that the devices must be hardcoded into both drivers, when the decision could be up to userspace. I think this is a problem that vbus could solve. Exposing a knob to userspace is not an insurmountable problem; vhost-net already allows changing the memory layout, for example. Let me explain the most obvious problem I ran into: setting the MAC addresses used in virtio. On the host (PCI master), I want eth0 (virtio-net) to get a random MAC address. On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC address, aa:bb:cc:dd:ee:ff. The virtio feature negotiation code handles this, by seeing the VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC address. This is because the feature negotiation code only accepts a feature if it is offered by both sides of the connection. In this case, I must have the guest generate a random MAC address and have the host put aa:bb:cc:dd:ee:ff into the guest's configuration space. This basically means hardcoding the MAC addresses in the Linux drivers, which is a big no-no. What would I expose to userspace to make this situation manageable? Thanks for the response, Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tue, Aug 18, 2009 at 08:47:04PM +0300, Avi Kivity wrote: On 08/18/2009 08:27 PM, Ira W. Snyder wrote: In fact, modern x86s do have dma engines these days (google for Intel I/OAT), and one of our plans for vhost-net is to allow their use for packets above a certain size. So a patch allowing vhost-net to optionally use a dma engine is a good thing. Yes, I'm aware that very modern x86 PCs have general purpose DMA engines, even though I don't have any capable hardware. However, I think it is better to support using any PC (with or without DMA engine, any architecture) as the PCI master, and just handle the DMA all from the PCI agent, which is known to have DMA? Certainly; but if your PCI agent will support the DMA API, then the same vhost code will work with both I/OAT and your specialized hardware. Yes, that's true. My ppc is a Freescale MPC8349EMDS. It has a Linux DMAEngine driver in mainline, which I've used. That's excellent. Exposing a knob to userspace is not an insurmountable problem; vhost-net already allows changing the memory layout, for example. Let me explain the most obvious problem I ran into: setting the MAC addresses used in virtio. On the host (PCI master), I want eth0 (virtio-net) to get a random MAC address. On the guest (PCI agent), I want eth0 (virtio-net) to get a specific MAC address, aa:bb:cc:dd:ee:ff. The virtio feature negotiation code handles this, by seeing the VIRTIO_NET_F_MAC feature in it's configuration space. If BOTH drivers do not have VIRTIO_NET_F_MAC set, then NEITHER will use the specified MAC address. This is because the feature negotiation code only accepts a feature if it is offered by both sides of the connection. In this case, I must have the guest generate a random MAC address and have the host put aa:bb:cc:dd:ee:ff into the guest's configuration space. This basically means hardcoding the MAC addresses in the Linux drivers, which is a big no-no. What would I expose to userspace to make this situation manageable? I think in this case you want one side to be virtio-net (I'm guessing the x86) and the other side vhost-net (the ppc boards with the dma engine). virtio-net on x86 would communicate with userspace on the ppc board to negotiate features and get a mac address, the fast path would be between virtio-net and vhost-net (which would use the dma engine to push and pull data). Ah, that seems backwards, but it should work after vhost-net learns how to use the DMAEngine API. I haven't studied vhost-net very carefully yet. As soon as I saw the copy_(to|from)_user() I stopped reading, because it seemed useless for my case. I'll look again and try to find where vhost-net supports setting MAC addresses and other features. Also, in my case I'd like to boot Linux with my rootfs over NFS. Is vhost-net capable of this? I've had Arnd, BenH, and Grant Likely (and others, privately) contact me about devices they are working with that would benefit from something like virtio-over-PCI. I'd like to see vhost-net be merged with the capability to support my use case. There are plenty of others that would benefit, not just myself. I'm not sure vhost-net is being written with this kind of future use in mind. I'd hate to see it get merged, and then have to change the ABI to support physical-device-to-device usage. It would be better to keep future use in mind now, rather than try and hack it in later. Thanks for the comments. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tue, Aug 18, 2009 at 09:52:48PM +0300, Avi Kivity wrote: On 08/18/2009 09:27 PM, Ira W. Snyder wrote: I think in this case you want one side to be virtio-net (I'm guessing the x86) and the other side vhost-net (the ppc boards with the dma engine). virtio-net on x86 would communicate with userspace on the ppc board to negotiate features and get a mac address, the fast path would be between virtio-net and vhost-net (which would use the dma engine to push and pull data). Ah, that seems backwards, but it should work after vhost-net learns how to use the DMAEngine API. I haven't studied vhost-net very carefully yet. As soon as I saw the copy_(to|from)_user() I stopped reading, because it seemed useless for my case. I'll look again and try to find where vhost-net supports setting MAC addresses and other features. It doesn't; all it does is pump the rings, leaving everything else to userspace. Ok. On a non shared-memory system (where the guest's RAM is not just a chunk of userspace RAM in the host system), virtio's management model seems to fall apart. Feature negotiation doesn't work as one would expect. This does appear to be solved by vbus, though I haven't written a vbus-over-PCI implementation, so I cannot be completely sure. I'm not at all clear on how to get feature negotiation to work on a system like mine. From my study of lguest and kvm (see below) it looks like userspace will need to be involved, via a miscdevice. Also, in my case I'd like to boot Linux with my rootfs over NFS. Is vhost-net capable of this? It's just another network interface. You'd need an initramfs though to contain the needed userspace. Ok. I'm using an initramfs already, so adding some more userspace to it isn't a problem. I've had Arnd, BenH, and Grant Likely (and others, privately) contact me about devices they are working with that would benefit from something like virtio-over-PCI. I'd like to see vhost-net be merged with the capability to support my use case. There are plenty of others that would benefit, not just myself. I'm not sure vhost-net is being written with this kind of future use in mind. I'd hate to see it get merged, and then have to change the ABI to support physical-device-to-device usage. It would be better to keep future use in mind now, rather than try and hack it in later. Please review and comment then. I'm fairly confident there won't be any ABI issues since vhost-net does so little outside pumping the rings. Ok. I thought I should at least express my concerns while we're discussing this, rather than being too late after finding the time to study the driver. Off the top of my head, I would think that transporting userspace addresses in the ring (for copy_(to|from)_user()) vs. physical addresses (for DMAEngine) might be a problem. Pinning userspace pages into memory for DMA is a bit of a pain, though it is possible. There is also the problem of different endianness between host and guest in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h) defines fields in host byte order. Which totally breaks if the guest has a different endianness. This is a virtio-net problem though, and is not transport specific. Note the signalling paths go through eventfd: when vhost-net wants the other side to look at its ring, it tickles an eventfd which is supposed to trigger an interrupt on the other side. Conversely, when another eventfd is signalled, vhost-net will look at the ring and process any data there. You'll need to wire your signalling to those eventfds, either in userspace or in the kernel. Ok. I've never used eventfd before, so that'll take yet more studying. I've browsed over both the kvm and lguest code, and it looks like they each re-invent a mechanism for transporting interrupts between the host and guest, using eventfd. They both do this by implementing a miscdevice, which is basically their management interface. See drivers/lguest/lguest_user.c (see write() and LHREQ_EVENTFD) and kvm-kmod-devel-88/x86/kvm_main.c (see kvm_vm_ioctl(), called via kvm_dev_ioctl()) for how they hook up eventfd's. I can now imagine how two userspace programs (host and guest) could work together to implement a management interface, including hotplug of devices, etc. Of course, this would basically reinvent the vbus management interface into a specific driver. I think this is partly what Greg is trying to abstract out into generic code. I haven't studied the actual data transport mechanisms in vbus, though I have studied virtio's transport mechanism. I think a generic management interface for virtio might be a good thing to consider, because it seems there are at least two implementations already: kvm and lguest. Thanks for answering my questions. It helps to talk with someone more familiar with the issues than I am. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Tue, Aug 18, 2009 at 11:57:48PM +0300, Michael S. Tsirkin wrote: On Tue, Aug 18, 2009 at 08:53:29AM -0700, Ira W. Snyder wrote: I think Greg is referring to something like my virtio-over-PCI patch. I'm pretty sure that vhost is completely useless for my situation. I'd like to see vhost work for my use, so I'll try to explain what I'm doing. I've got a system where I have about 20 computers connected via PCI. The PCI master is a normal x86 system, and the PCI agents are PowerPC systems. The PCI agents act just like any other PCI card, except they are running Linux, and have their own RAM and peripherals. I wrote a custom driver which imitated a network interface and a serial port. I tried to push it towards mainline, and DavidM rejected it, with the argument, use virtio, don't add another virtualization layer to the kernel. I think he has a decent argument, so I wrote virtio-over-PCI. Now, there are some things about virtio that don't work over PCI. Mainly, memory is not truly shared. It is extremely slow to access memory that is far away, meaning across the PCI bus. This can be worked around by using a DMA controller to transfer all data, along with an intelligent scheme to perform only writes across the bus. If you're careful, reads are never needed. So, in my system, copy_(to|from)_user() is completely wrong. There is no userspace, only a physical system. Can guests do DMA to random host memory? Or is there some kind of IOMMU and DMA API involved? If the later, then note that you'll still need some kind of driver for your device. The question we need to ask ourselves then is whether this driver can reuse bits from vhost. Mostly. All of my systems are 32 bit (both x86 and ppc). From the view of the ppc (and DMAEngine), I can view the first 1GB of host memory. This limited view is due to address space limitations on the ppc. The view of PCI memory must live somewhere in the ppc address space, along with the ppc's SDRAM, flash, and other peripherals. Since this is a 32bit processor, I only have 4GB of address space to work with. The PCI address space could be up to 4GB in size. If I tried to allow the ppc boards to view all 4GB of PCI address space, then they would have no address space left for their onboard SDRAM, etc. Hopefully that makes sense. I use dma_set_mask(dev, DMA_BIT_MASK(30) on the host system to ensure that when dma_map_sg() is called, it returns addresses that can be accessed directly by the device. The DMAEngine can access any local (ppc) memory without any restriction. I have used the Linux DMAEngine API (include/linux/dmaengine.h) to handle all data transfer across the PCI bus. The Intel I/OAT (and many others) use the same API. In fact, because normal x86 computers do not have DMA controllers, the host system doesn't actually handle any data transfer! Is it true that PPC has to initiate all DMA then? How do you manage not to do DMA reads then? Yes, the ppc initiates all DMA. It handles all data transfer (both reads and writes) across the PCI bus, for speed reasons. A CPU cannot create burst transactions on the PCI bus. This is the reason that most (all?) network cards (as a familiar example) use DMA to transfer packet contents into RAM. Sorry if I made a confusing statement (no reads are necessary) earlier. What I meant to say was: If you are very careful, it is not necessary for the CPU to do any reads over the PCI bus to maintain state. Writes are the only necessary CPU-initiated transaction. I implemented this in my virtio-over-PCI patch, copying as much as possible from the virtio vring structure. The descriptors in the rings are only changed by one side of the connection, therefore they can be cached as they are written (via the CPU) across the PCI bus, with the knowledge that both sides will have a consistent view. I'm sorry, this is hard to explain via email. It is much easier in a room with a whiteboard. :) I used virtio-net in both the guest and host systems in my example virtio-over-PCI patch, and succeeded in getting them to communicate. However, the lack of any setup interface means that the devices must be hardcoded into both drivers, when the decision could be up to userspace. I think this is a problem that vbus could solve. What you describe (passing setup from host to guest) seems like a feature that guest devices need to support. It seems unlikely that vbus, being a transport layer, can address this. I think I explained this poorly as well. Virtio needs two things to function: 1) a set of descriptor rings (1 or more) 2) a way to kick each ring. With the amount of space available in the ppc's PCI BAR's (which point at a small chunk of SDRAM), I could potentially make ~6 virtqueues + 6 kick interrupts available. Right now, my virtio-over-PCI driver hardcoded the first and second virtqueues to be for virtio-net only, and nothing else. What if the user wanted 2
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Wed, Aug 19, 2009 at 12:26:23AM +0300, Avi Kivity wrote: On 08/18/2009 11:59 PM, Ira W. Snyder wrote: On a non shared-memory system (where the guest's RAM is not just a chunk of userspace RAM in the host system), virtio's management model seems to fall apart. Feature negotiation doesn't work as one would expect. In your case, virtio-net on the main board accesses PCI config space registers to perform the feature negotiation; software on your PCI cards needs to trap these config space accesses and respond to them according to virtio ABI. Is this real PCI (physical hardware) or fake PCI (software PCI emulation) that you are describing? The host (x86, PCI master) must use real PCI to actually configure the boards, enable bus mastering, etc. Just like any other PCI device, such as a network card. On the guests (ppc, PCI agents) I cannot add/change PCI functions (the last .[0-9] in the PCI address) nor can I change PCI BAR's once the board has started. I'm pretty sure that would violate the PCI spec, since the PCI master would need to re-scan the bus, and re-assign addresses, which is a task for the BIOS. (There's no real guest on your setup, right? just a kernel running on and x86 system and other kernels running on the PCI cards?) Yes, the x86 (PCI master) runs Linux (booted via PXELinux). The ppc's (PCI agents) also run Linux (booted via U-Boot). They are independent Linux systems, with a physical PCI interconnect. The x86 has CONFIG_PCI=y, however the ppc's have CONFIG_PCI=n. Linux's PCI stack does bad things as a PCI agent. It always assumes it is a PCI master. It is possible for me to enable CONFIG_PCI=y on the ppc's by removing the PCI bus from their list of devices provided by OpenFirmware. They can not access PCI via normal methods. PCI drivers cannot work on the ppc's, because Linux assumes it is a PCI master. To the best of my knowledge, I cannot trap configuration space accesses on the PCI agents. I haven't needed that for anything I've done thus far. This does appear to be solved by vbus, though I haven't written a vbus-over-PCI implementation, so I cannot be completely sure. Even if virtio-pci doesn't work out for some reason (though it should), you can write your own virtio transport and implement its config space however you like. This is what I did with virtio-over-PCI. The way virtio-net negotiates features makes this work non-intuitively. I'm not at all clear on how to get feature negotiation to work on a system like mine. From my study of lguest and kvm (see below) it looks like userspace will need to be involved, via a miscdevice. I don't see why. Is the kernel on the PCI cards in full control of all accesses? I'm not sure what you mean by this. Could you be more specific? This is a normal, unmodified vanilla Linux kernel running on the PCI agents. Ok. I thought I should at least express my concerns while we're discussing this, rather than being too late after finding the time to study the driver. Off the top of my head, I would think that transporting userspace addresses in the ring (for copy_(to|from)_user()) vs. physical addresses (for DMAEngine) might be a problem. Pinning userspace pages into memory for DMA is a bit of a pain, though it is possible. Oh, the ring doesn't transport userspace addresses. It transports guest addresses, and it's up to vhost to do something with them. Currently vhost supports two translation modes: 1. virtio address == host virtual address (using copy_to_user) 2. virtio address == offsetted host virtual address (using copy_to_user) The latter mode is used for kvm guests (with multiple offsets, skipping some details). I think you need to add a third mode, virtio address == host physical address (using dma engine). Once you do that, and wire up the signalling, things should work. Ok. In my virtio-over-PCI patch, I hooked two virtio-net's together. I wrote an algorithm to pair the tx/rx queues together. Since virtio-net pre-fills its rx queues with buffers, I was able to use the DMA engine to copy from the tx queue into the pre-allocated memory in the rx queue. I have an intuitive idea about how I think vhost-net works in this case. There is also the problem of different endianness between host and guest in virtio-net. The struct virtio_net_hdr (include/linux/virtio_net.h) defines fields in host byte order. Which totally breaks if the guest has a different endianness. This is a virtio-net problem though, and is not transport specific. Yeah. You'll need to add byteswaps. I wonder if Rusty would accept a new feature: VIRTIO_F_NET_LITTLE_ENDIAN, which would allow the virtio-net driver to use LE for all of it's multi-byte fields. I don't think the transport should have to care about the endianness. I've browsed over both the kvm and lguest code, and it looks like they each re-invent a mechanism for transporting interrupts between
Re: [Alacrityvm-devel] [PATCH v3 3/6] vbus: add a vbus-proxy bus model for vbus_driver objects
On Wed, Aug 19, 2009 at 01:06:45AM +0300, Avi Kivity wrote: On 08/19/2009 12:26 AM, Avi Kivity wrote: Off the top of my head, I would think that transporting userspace addresses in the ring (for copy_(to|from)_user()) vs. physical addresses (for DMAEngine) might be a problem. Pinning userspace pages into memory for DMA is a bit of a pain, though it is possible. Oh, the ring doesn't transport userspace addresses. It transports guest addresses, and it's up to vhost to do something with them. Currently vhost supports two translation modes: 1. virtio address == host virtual address (using copy_to_user) 2. virtio address == offsetted host virtual address (using copy_to_user) The latter mode is used for kvm guests (with multiple offsets, skipping some details). I think you need to add a third mode, virtio address == host physical address (using dma engine). Once you do that, and wire up the signalling, things should work. You don't need in fact a third mode. You can mmap the x86 address space into your ppc userspace and use the second mode. All you need then is the dma engine glue and byte swapping. Hmm, I'll have to think about that. The ppc is a 32-bit processor, so it has 4GB of address space for everything, including PCI, SDRAM, flash memory, and all other peripherals. This is exactly like 32bit x86, where you cannot have a PCI card that exposes a 4GB PCI BAR. The system would have no address space left for its own SDRAM. On my x86 computers, I only have 1GB of physical RAM, and so the ppc's have plenty of room in their address spaces to map the entire x86 RAM into their own address space. That is exactly what I do now. Accesses to ppc physical address 0x8000 magically hit x86 physical address 0x0. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 07:03:22PM +0200, Arnd Bergmann wrote: On Monday 10 August 2009, Michael S. Tsirkin wrote: +struct workqueue_struct *vhost_workqueue; [nitpicking] This could be static. +/* The virtqueue structure describes a queue attached to a device. */ +struct vhost_virtqueue { + struct vhost_dev *dev; + + /* The actual ring of buffers. */ + struct mutex mutex; + unsigned int num; + struct vring_desc __user *desc; + struct vring_avail __user *avail; + struct vring_used __user *used; + struct file *kick; + struct file *call; + struct file *error; + struct eventfd_ctx *call_ctx; + struct eventfd_ctx *error_ctx; + + struct vhost_poll poll; + + /* The routine to call when the Guest pings us, or timeout. */ + work_func_t handle_kick; + + /* Last available index we saw. */ + u16 last_avail_idx; + + /* Last index we used. */ + u16 last_used_idx; + + /* Outstanding buffers */ + unsigned int inflight; + + /* Is this blocked? */ + bool blocked; + + struct iovec iov[VHOST_NET_MAX_SG]; + +} cacheline_aligned; We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. That would make it possible for simple device drivers to use the same driver in both host and guest, similar to how Ira Snyder used virtqueues to make virtio_net run between two hosts running the same code [1]. Ideally, I guess you should be able to even make virtio_net work in the host if you do that, but that could bring other complexities. I have no comments about the vhost code itself, I haven't reviewed it. It might be interesting to try using a virtio-net in the host kernel to communicate with the virtio-net running in the guest kernel. The lack of a management interface is the biggest problem you will face (setting MAC addresses, negotiating features, etc. doesn't work intuitively). Getting the network interfaces talking is relatively easy. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] vhost_net: a kernel-level virtio server
On Wed, Aug 12, 2009 at 08:31:04PM +0300, Michael S. Tsirkin wrote: On Wed, Aug 12, 2009 at 10:19:22AM -0700, Ira W. Snyder wrote: [ snip out code ] We discussed this before, and I still think this could be directly derived from struct virtqueue, in the same way that vring_virtqueue is derived from struct virtqueue. That would make it possible for simple device drivers to use the same driver in both host and guest, similar to how Ira Snyder used virtqueues to make virtio_net run between two hosts running the same code [1]. Ideally, I guess you should be able to even make virtio_net work in the host if you do that, but that could bring other complexities. I have no comments about the vhost code itself, I haven't reviewed it. It might be interesting to try using a virtio-net in the host kernel to communicate with the virtio-net running in the guest kernel. The lack of a management interface is the biggest problem you will face (setting MAC addresses, negotiating features, etc. doesn't work intuitively). That was one of the reasons I decided to move most of code out to userspace. My kernel driver only handles datapath, it's much smaller than virtio net. Getting the network interfaces talking is relatively easy. Ira Tried this, but - guest memory isn't pinned, so copy_to_user to access it, errors need to be handled in a sane way - used/available roles are reversed - kick/interrupt roles are reversed So most of the code then looks like if (host) { } else { } return The only common part is walking the descriptor list, but that's like 10 lines of code. At which point it's better to keep host/guest code separate, IMO. Ok, that makes sense. Let me see if I understand the concept of the driver. Here's a picture of what makes sense to me: guest system - | userspace applications| - | kernel network stack | - | virtio-net| - | transport (virtio-ring, etc.) | - | | - | transport (virtio-ring, etc.) | - | some driver (maybe vhost?)| -- [1] - | kernel network stack | - host system From the host's network stack, packets can be forwarded out to the physical network, or be consumed by a normal userspace application on the host. Just as if this were any other network interface. In my patch, [1] was the virtio-net driver, completely unmodified. So, does this patch accomplish the above diagram? If so, why the copy_to_user(), etc? Maybe I'm confusing this with my system, where the guest is another physical system, separated by the PCI bus. Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] AlacrityVM guest drivers Reply-To:
On Thu, Aug 06, 2009 at 10:29:08AM -0600, Gregory Haskins wrote: On 8/6/2009 at 11:40 AM, in message 200908061740.04276.a...@arndb.de, Arnd Bergmann a...@arndb.de wrote: On Thursday 06 August 2009, Gregory Haskins wrote: [ big snip ] 3. The ioq method seems to be the real core of your work that makes venet perform better than virtio-net with its virtqueues. I don't see any reason to doubt that your claim is correct. My conclusion from this would be to add support for ioq to virtio devices, alongside virtqueues, but to leave out the extra bus_type and probing method. While I appreciate the sentiment, I doubt that is actually whats helping here. There are a variety of factors that I poured into venet/vbus that I think contribute to its superior performance. However, the difference in the ring design I do not think is one if them. In fact, in many ways I think Rusty's design might turn out to be faster if put side by side because he was much more careful with cacheline alignment than I was. Also note that I was careful to not pick one ring vs the other ;) They both should work. IMO, the virtio vring design is very well thought out. I found it relatively easy to port to a host+blade setup, and run virtio-net over a physical PCI bus, connecting two physical CPUs. IMO, we are only looking at the tip of the iceberg when looking at this purely as the difference between virtio-pci vs virtio-vbus, or venet vs virtio-net. Really, the big thing I am working on here is the host side device-model. The idea here was to design a bus model that was conducive to high performance, software to software IO that would work in a variety of environments (that may or may not have PCI). KVM is one such environment, but I also have people looking at building other types of containers, and even physical systems (host+blade kind of setups). The idea is that the connector is modular, and then something like virtio-net or venet just work: in kvm, in the userspace container, on the blade system. It provides a management infrastructure that (hopefully) makes sense for these different types of containers, regardless of whether they have PCI, QEMU, etc (e.g. things that are inherent to KVM, but not others). I hope this helps to clarify the project :) I think this is the major benefit of vbus. I've only started studying the vbus code, so I don't have lots to say yet. The overview of the management interface makes it look pretty good. Getting two virtio-net drivers hooked together in my virtio-over-PCI patches was nasty. If you read the thread that followed, you'll see the lack of a management interface as a concern of mine. It was basically decided that it could come later. The configfs interface vbus provides is pretty nice, IMO. Just my two cents, Ira -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html