Re: [PATCH] kvm: fast-path msi injection with irqfd
On 11/19/2010 at 10:54 AM, in message 20101119155427.ga20...@amt.cnet, Marcelo Tosatti mtosa...@redhat.com wrote: On Thu, Nov 18, 2010 at 07:09:08PM +0200, Michael S. Tsirkin wrote: Store irq routing table pointer in the irqfd object, and use that to inject MSI directly without bouncing out to a kernel thread. While we touch this structure, rearrange irqfd fields to make fastpath better packed for better cache utilization. This also adds some comments about locking rules and rcu usage in code. Some notes on the design: - Use pointer into the rt instead of copying an entry, to make it possible to use rcu, thus side-stepping locking complexities. We also save some memory this way. - Old workqueue code is still used for level irqs. I don't think we DTRT with level anyway, however, it seems easier to keep the code around as it has been thought through and debugged, and fix level later than rip out and re-instate it later. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- OK, this seems to work fine for me. Tested with virtio-net in guest with and without vhost-net. Pls review/apply if appropriate. Acked-by: Marcelo Tosatti mtosa...@redhat.com Acked-by: Gregory Haskins ghask...@novell.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix irqfd assign/deassign race
On 9/19/2010 at 01:02 PM, in message 20100919170231.ga12...@redhat.com, Michael S. Tsirkin m...@redhat.com wrote: I think I see the following (theoretical) race: During irqfd assign, we drop irqfds lock before we schedule inject work. Therefore, deassign running on another CPU could cause shutdown and flush to run before inject, causing user after free in inject. A simple fix it to schedule inject under the lock. I swear there was some reason why the schedule_work() was done outside of the lock, but I can't for the life of me remember why anymore (it obviously was a failing on my part to _comment_ why if there was such a reason). So, short of recalling what that reason was, and the fact that Michael's theory seems rational and legit... Acked-by: Gregory Haskins ghask...@novell.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- If the issue is real, this might be a 2.6.36 and -stable candidate. Comments? virt/kvm/eventfd.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 66cf65b..c1f1e3c 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -218,7 +218,6 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) events = file-f_op-poll(file, irqfd-pt); list_add_tail(irqfd-list, kvm-irqfds.items); - spin_unlock_irq(kvm-irqfds.lock); /* * Check if there was an event already pending on the eventfd @@ -227,6 +226,8 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) if (events POLLIN) schedule_work(irqfd-inject); + spin_unlock_irq(kvm-irqfds.lock); + /* * do not drop the file until the irqfd is fully initialized, otherwise * we might race against the POLLHUP -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: only allow one gsi per fd
On 1/13/2010 at 11:58 AM, in message 20100113165809.ga13...@redhat.com, Michael S. Tsirkin m...@redhat.com wrote: Looks like repeatedly binding same fd to multiple gsi's with irqfd can use up a ton of kernel memory for irqfd structures. A simple fix is to allow each fd to only trigger one gsi: triggering a srorm of interrupts in guest is likely useless anyway, and we can do it by binding a single gsi to many interrupts if we really want to. Signed-off-by: Michael S. Tsirkin m...@redhat.com Seems reasonable to me. Acked-by: Gregory Haskins ghask...@novell.com --- This patch is IMO a good candidate for 2.6.33 and 2.6.32.x. virt/kvm/eventfd.c | 15 +-- 1 files changed, 13 insertions(+), 2 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 30f70fd..62e4cd9 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -166,7 +166,7 @@ irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh, static int kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) { - struct _irqfd *irqfd; + struct _irqfd *irqfd, *tmp; struct file *file = NULL; struct eventfd_ctx *eventfd = NULL; int ret; @@ -203,9 +203,20 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup); init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc); + spin_lock_irq(kvm-irqfds.lock); + + ret = 0; + list_for_each_entry(tmp, kvm-irqfds.items, list) { + if (irqfd-eventfd != tmp-eventfd) + continue; + /* This fd is used for another irq already. */ + ret = -EBUSY; + spin_unlock_irq(kvm-irqfds.lock); + goto fail; + } + events = file-f_op-poll(file, irqfd-pt); - spin_lock_irq(kvm-irqfds.lock); list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/27/09 4:15 AM, Avi Kivity wrote: On 12/23/2009 11:21 PM, Gregory Haskins wrote: That said, you are still incorrect. With what I proposed, the model will run as an in-kernel vbus device, and no longer run in userspace. It would therefore improve virtio-net as I stated, much in the same way vhost-net or venet-tap do today. That can't work. virtio-net has its own ABI on top of virtio, for example it prepends a header for TSO information. Maybe if you disable all features it becomes compatible with venet, but that cripples it. You are confused. The backend would be virtio-net specific, and would therefore understand the virtio-net ABI. It would support any feature of virtio-net as long as it was implemented and negotiated by both sides of the link. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/27/09 4:33 AM, Avi Kivity wrote: On 12/24/2009 11:36 AM, Gregory Haskins wrote: As a twist on this, the VMware paravirt driver interface is so hardware-like that they're getting hardware vendors to supply cards that implement it. Try that with a pure software approach. Any hardware engineer (myself included) will tell you that, generally speaking, what you can do in hardware you can do in software (think of what QEMU does today, for instance). It's purely a cost/performance tradeoff. I can at least tell you that is true of vbus. Anything on the vbus side would be equally eligible for a hardware implementation, though there is not reason to do this today since we have equivalent functionality in baremetal already. There's a huge difference in the probability of vmware getting cards to their spec, or x86 vendors improving interrupt delivery to guests, compared to vbus being implemented in hardware. Thats not relevant, however. I said in the original quote that you snipped that I made it a software design on purpose, and you tried to somehow paint that as a negative because vmware made theirs hardware-like and you implied it could not be done with my approach with the statement try that with a pure software approach. And the bottom line is that the statement is incorrect and/or misleading. The only motiviation is if you wanted to preserve ABI etc, which is what vmware is presumably after. However, I am not advocating this as necessary at this juncture. Maybe AlacrityVM users don't care about compatibility, but my users do. Again, not relevant to this thread. Making your interface hardware-like buys you nothing in the end, as you ultimately need to load drivers in the guest either way, and any major OS lets you extend both devices and buses with relative ease. The only counter example would be if you truly were hardware-exactly like e1000 emulation, but we already know that this means it is hardware centric and not exit-rate aware and would perform poorly. Otherwise compatible is purely a point on the time line (for instance, the moment virtio-pci ABI shipped), not an architectural description such as hardware-like. signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/27/09 8:27 AM, Avi Kivity wrote: On 12/27/2009 03:18 PM, Gregory Haskins wrote: On 12/27/09 4:15 AM, Avi Kivity wrote: On 12/23/2009 11:21 PM, Gregory Haskins wrote: That said, you are still incorrect. With what I proposed, the model will run as an in-kernel vbus device, and no longer run in userspace. It would therefore improve virtio-net as I stated, much in the same way vhost-net or venet-tap do today. That can't work. virtio-net has its own ABI on top of virtio, for example it prepends a header for TSO information. Maybe if you disable all features it becomes compatible with venet, but that cripples it. You are confused. The backend would be virtio-net specific, and would therefore understand the virtio-net ABI. It would support any feature of virtio-net as long as it was implemented and negotiated by both sides of the link. Then we're back to square one. A nice demonstration of vbus flexibility, but no help for virtio. No, where we are is at the point where we demonstrate that your original statement that I did nothing to improve virtio was wrong. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/27/09 8:49 AM, Avi Kivity wrote: On 12/27/2009 03:34 PM, Gregory Haskins wrote: On 12/27/09 4:33 AM, Avi Kivity wrote: On 12/24/2009 11:36 AM, Gregory Haskins wrote: As a twist on this, the VMware paravirt driver interface is so hardware-like that they're getting hardware vendors to supply cards that implement it. Try that with a pure software approach. Any hardware engineer (myself included) will tell you that, generally speaking, what you can do in hardware you can do in software (think of what QEMU does today, for instance). It's purely a cost/performance tradeoff. I can at least tell you that is true of vbus. Anything on the vbus side would be equally eligible for a hardware implementation, though there is not reason to do this today since we have equivalent functionality in baremetal already. There's a huge difference in the probability of vmware getting cards to their spec, or x86 vendors improving interrupt delivery to guests, compared to vbus being implemented in hardware. Thats not relevant, however. I said in the original quote that you snipped that I made it a software design on purpose, and you tried to somehow paint that as a negative because vmware made theirs hardware-like and you implied it could not be done with my approach with the statement try that with a pure software approach. And the bottom line is that the statement is incorrect and/or misleading. It's not incorrect. At the very best it's misleading. VMware stuck to the pci specs and as a result they can have hardware implement their virtual NIC protocol. For vbus this is much harder Not really. to do since you need a side-channel between different cards to coordinate interrupt delivery. In theory you can do eveything if you don't consider practicalities. pci based designs, such as vmware and virtio-pci arent free of this notion either. They simply rely on APIC emulation for the irq-chip, and it just so happens that vbus implements a different irq-chip (more specifically, the connector that we employ between the guest and vbus does). On one hand, you have the advantage of the guest already supporting the irq-chip ABI, and on other other, you have an optimized (e.g. shared memory based inject/ack) and feature enhanced ABI (interrupt priority, no IDT constraints, etc). The are pros and cons to either direction, but the vbus project charter is to go for maximum performance and features, so that is acceptable to us. That's a digression, though, I'm not suggesting we'll see virtio hardware or that this is a virtio/pci advantage vs. vbus. It's an anecdote showing that sticking with specs has its advantages. It also has distinct disadvantages. For instance, the PCI spec is gigantic, yet almost none of it is needed to do the job here. When you are talking full-virt, you are left with no choice. With para-virt, you do have a choice, and the vbus-connector for AlacrityVM capitalizes on this. As an example, think about all the work that went into emulating the PCI chipset, the APIC chipset, MSI-X support, irq-routing, etc, when all you needed was a simple event-queue to indicate that an event (e.g. an interrupt) occurred. This is an example connector in vbus: http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/connectors/null.c;h=b6d16cb68b7e49e07528278bc9f5b73e1dac0c2f;hb=HEAD It encapsulates all of hotplug, signal (interrupt) routing, and memory routing for both sides of the link in 584 lines of code. And that also implicitly brings in device discovery and configuration since that is covered by the vbus framework. Try doing that with PCI, especially when you are not already under the qemu umbrella, and the standards based approach suddenly doesn't look very attractive. wrt pci vs vbus, the difference is in the ability to use improvements in interrupt delivery accelerations in virt hardware. Most of which will also apply to the current vbus design as well since at some point I have to have an underlying IDT mechanism too, btw. If this happens, virtio/pci can immediately take advantage of it, while vbus has to stick with software delivery for backward compatibility, and all that code becomes a useless support burden. The shared-memory path will always be the fastest anyway, so I am not too worried about it. But vbus supports feature negotiation, so we can always phase that out if need be, same as anything else. As an example of what hardware can do when it really sets its mind to it, s390 can IPI from vcpu to vcpu without exiting to the host. Great! I am just not in the practice of waiting for hardware to cover sloppy software. There is a ton of impracticality in doing so, such as the fact that the hardware, even once available, will not be ubiquitous instantly. The only motiviation is if you wanted to preserve ABI etc, which is what vmware is presumably
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/27/09 8:49 AM, Avi Kivity wrote: On 12/27/2009 03:39 PM, Gregory Haskins wrote: No, where we are is at the point where we demonstrate that your original statement that I did nothing to improve virtio was wrong. I stand by it. virtio + your patch does nothing without a ton more work (more or less equivalent to vhost-net). Perhaps, but my work predates vhost-net by months and that has nothing to do with what we are talking about anyway. Since you snipped your original comment that started the thread, here it is again: On 12/23/09 5:22 AM, Avi Kivity wrote: There was no attempt by Gregory to improve virtio-net. It's not a gray area, nor open to interpretation. That statement was, is, and will always be demonstrably false, so I'm sorry but you are still wrong. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 3:36 PM, Avi Kivity wrote: On 12/23/2009 06:44 PM, Gregory Haskins wrote: - Are a pure software concept By design. In fact, I would describe it as software to software optimized as opposed to trying to shoehorn into something that was designed as a software-to-hardware interface (and therefore has assumptions about the constraints in that environment that are not applicable in software-only). And that's the biggest mistake you can make. Sorry, that is just wrong or you wouldn't have virtio either. Look at Xen, for instance. The paravirtualized the fork out of everything that moved in order to get x86 virt going. And where are they now? x86_64 syscalls are slow since they have to trap to the hypervisor and (partially) flush the tlb. With npt or ept capable hosts performance is better for many workloads on fullvirt. And paravirt doesn't support Windows. Their unsung hero Jeremy is still trying to upstream dom0 Xen support. And they get to support it forever. We are only talking about PV-IO here, so not apples to apples to what Xen is going through. VMware stuck with the hardware defined interfaces. Sure they had to implement binary translation to get there, but as a result, they only have to support one interface, all guests support it, and they can drop it on newer hosts where it doesn't give them anything. Again, you are confusing PV-IO. Not relevant here. Afaict, vmware, kvm, xen, etc, all still do PV-IO and likely will for the foreseeable future. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 4:01 PM, Avi Kivity wrote: On 12/23/2009 10:36 PM, Avi Kivity wrote: On 12/23/2009 06:44 PM, Gregory Haskins wrote: - Are a pure software concept By design. In fact, I would describe it as software to software optimized as opposed to trying to shoehorn into something that was designed as a software-to-hardware interface (and therefore has assumptions about the constraints in that environment that are not applicable in software-only). And that's the biggest mistake you can make. Look at Xen, for instance. The paravirtualized the fork out of everything that moved in order to get x86 virt going. And where are they now? x86_64 syscalls are slow since they have to trap to the hypervisor and (partially) flush the tlb. With npt or ept capable hosts performance is better for many workloads on fullvirt. And paravirt doesn't support Windows. Their unsung hero Jeremy is still trying to upstream dom0 Xen support. And they get to support it forever. VMware stuck with the hardware defined interfaces. Sure they had to implement binary translation to get there, but as a result, they only have to support one interface, all guests support it, and they can drop it on newer hosts where it doesn't give them anything. As a twist on this, the VMware paravirt driver interface is so hardware-like that they're getting hardware vendors to supply cards that implement it. Try that with a pure software approach. Any hardware engineer (myself included) will tell you that, generally speaking, what you can do in hardware you can do in software (think of what QEMU does today, for instance). It's purely a cost/performance tradeoff. I can at least tell you that is true of vbus. Anything on the vbus side would be equally eligible for a hardware implementation, though there is not reason to do this today since we have equivalent functionality in baremetal already. The only motiviation is if you wanted to preserve ABI etc, which is what vmware is presumably after. However, I am not advocating this as necessary at this juncture. So sorry, your statement is not relevant. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 1:51 AM, Ingo Molnar wrote: * Anthony Liguori anth...@codemonkey.ws wrote: On 12/22/2009 10:01 AM, Bartlomiej Zolnierkiewicz wrote: new e1000 driver is more superior in architecture and do the required work to make the new e1000 driver a full replacement for the old one. Right, like everyone actually does things this way.. I wonder why do we have OSS, old Firewire and IDE stacks still around then? And it's always a source of pain, isn't it. Even putting aside the fact that such overlap sucks and is a pain to users (and that 98% of driver and subsystem version transitions are done completely seemlessly to users - the examples that were cited were the odd ones out of 150K commits in the past 4 years, 149K+ of which are seemless), the comparison does not even apply really. e1000, OSS, old Firewire and IDE are hardware stacks, where hardware is a not fully understood externality, with its inevitable set of compatibility voes. There's often situations where one piece of hardware still works better with the old driver, for some odd (or not so odd) reason. Also, note that the 'new' hw drivers are generally intended and are maintained as clear replacements for the old ones, and do so with minimal ABI changes - or preferably with no ABI changes at all. Most driver developers just switch from old to new and the old bits are left around and are phased out. We phased out old OSS recently. That is a very different situation from the AlacrityVM patches, which: - Are a pure software concept By design. In fact, I would describe it as software to software optimized as opposed to trying to shoehorn into something that was designed as a software-to-hardware interface (and therefore has assumptions about the constraints in that environment that are not applicable in software-only). and any compatibility mismatch is self-inflicted. .. because the old model is not great for the intended use cases and has issues. I've already covered the reasons why adnauseam. The patches are in fact breaking the ABI to KVM intentionally (for better or worse). No, at the very worst they are _augmenting_ the ABI, as evident from the fact that AlacrityVM is a superset of the entire KVM ABI. - Gregory claims that the AlacricityVM patches are not a replacement for KVM. I.e. there's no intention to phase out any 'old stuff' There's no reason to phase anything out, except perhaps the virtio-pci transport. This is one more transport, plugging into virtio underneath (just like virtio-pci, virtio-lguest, and virtio-s390). I am not even suggesting that the old transport has to go away, per se. It is the KVM maintainers who insist on it being all or nothing. For me, I do not see the big deal in having one more model option in the qemu cmd-line, but that is just my opinion. If the maintainers were really so adamant that choice is pure evil, I am not sure why we don't see patches for removing everything but one model type in each IO category. But I digress. and it splits the pool of driver developers. ..it these dumb threads that are splitting driver developers with ignorant statements, irrelevant numbers, and dubious facts. I actually tried many many times to ask others to join the effort, and instead _they_ forked off and made vhost-net with a sorry, not interested in working with you(and based the design largely on the ideas proposed in my framework, I might add). Thats fine, and it's their prerogative. I can easily maintain my project out of tree if upstream is not interested. But do not turn around and try to blame me for the current state of affairs. i.e. it has all the makings of a stupid, avoidable, permanent fork. The thing is, if AlacricityVM is better, and KVM developers are not willing to fix their stuff, replace KVM with it. It's a bit as if someone found a performance problem with sys_open() and came up with sys_open_v2() and claimed that he wants to work with the VFS developers while not really doing so but advances sys_open_v2() all the time. No, its more like if I suggested sys_open_vbus() to augment sys_open_pci(), sys_open_lguest(), sys_open_s390() because our fictitious glibc natively supported modular open() implementations. And I tried to work for a very long time convincing the VFS developers that this new transport was a good idea to add because it was optimized for the problem space, made it easy for API callers to reuse the important elements of the design (e.g. built in tx-mitigation instead of waiting for someone to write it for each device), had new features like the ability to prioritize signals, create/route signal paths arbitrarily, implement raw shared memory for idempotent updates, and didn't require the massive and useless PCI/APIC emulation logic to function like sys_open_pci() (e.g. easier to port around). Ultimately, the VFS developers said I know we let other transports in in the past, but now
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 12:10 PM, Andi Kleen wrote: And its moot, anyway, as I have already retracted my one outstanding pull request based on Linus' observation. So at this time, I am not advocating _anything_ for upstream inclusion. And I am contemplating _never_ doing so again. It's not worth _this_. That certainly sounds like the wrong reaction. Out of tree drivers are typically a pain to use. Well, to Linus' point, it shouldn't go in until a critical mass of users have expressed desire to see it in tree, which seems reasonable to me. For the admittedly small group that are using it today, modern tools like the opensuse-build-service ease the deployment as a KMP, so that can suffice for now. Its actually what most of the alacrityvm community uses today anyway (as opposed to using a merged tree in the guest) And upstream submission is not always like this! I would think the process would come to a grinding halt if it were ;) Thanks Andi, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 1:15 AM, Kyle Moffett wrote: On Tue, Dec 22, 2009 at 12:36, Gregory Haskins gregory.hask...@gmail.com wrote: On 12/22/09 2:57 AM, Ingo Molnar wrote: * Gregory Haskins gregory.hask...@gmail.com wrote: Actually, these patches have nothing to do with the KVM folks. [...] That claim is curious to me - the AlacrityVM host It's quite simple, really. These drivers support accessing vbus, and vbus is hypervisor agnostic. In fact, vbus isn't necessarily even hypervisor related. It may be used anywhere where a Linux kernel is the io backend, which includes hypervisors like AlacrityVM, but also userspace apps, and interconnected physical systems as well. The vbus-core on the backend, and the drivers on the frontend operate completely independent of the underlying hypervisor. A glue piece called a connector ties them together, and any hypervisor specific details are encapsulated in the connector module. In this case, the connector surfaces to the guest side as a pci-bridge, so even that is not hypervisor specific per se. It will work with any pci-bridge that exposes a compatible ABI, which conceivably could be actual hardware. This is actually something that is of particular interest to me. I have a few prototype boards right now with programmable PCI-E host/device links on them; one of my long-term plans is to finagle vbus into providing multiple virtual devices across that single PCI-E interface. Specifically, I want to be able to provide virtual NIC(s), serial ports and serial consoles, virtual block storage, and possibly other kinds of interfaces. My big problem with existing virtio right now (although I would be happy to be proven wrong) is that it seems to need some sort of out-of-band communication channel for setting up devices, not to mention it seems to need one PCI device per virtual device. So I would love to be able to port something like vbus to my nify PCI hardware and write some backend drivers... then my PCI-E connected systems would dynamically provide a list of highly-efficient virtual devices to each other, with only one 4-lane PCI-E bus. Hi Kyle, We indeed have others that are doing something similar. I have CC'd Ira who may be able to provide you more details. I would also point you at the canonical example for what you would need to write to tie your systems together. Its the null connector, which you can find here: http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/connectors/null.c;h=b6d16cb68b7e49e07528278bc9f5b73e1dac0c2f;hb=HEAD Do not hesitate to ask any questions, though you may want to take the conversation to the alacrityvm-devel list as to not annoy the current CC list any further than I already have ;) Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 5:22 AM, Avi Kivity wrote: There was no attempt by Gregory to improve virtio-net. If you truly do not understand why your statement is utterly wrong at this point in the discussion, I feel sorry for you. If you are trying to be purposely disingenuous, you should be ashamed of yourself. In any case, your statement is demonstrably bogus, but you should already know this given that we talked about at least several times. To refresh your memory: http://patchwork.kernel.org/patch/17428/ In case its not blatantly clear, which I would hope it would be to anyone that understands the problem space: What that patch would do is allow an unmodified virtio-net to bridge to a vbus based virtio-net backend. (Also note that this predates vhost-net by months (the date in that thread is 4/9/2009) in case you are next going to try to argue that it does nothing over vhost-net). This would mean that virtio-net would gain most of the benefits I have been advocating (fewer exits, cheaper exits, concurrent execution, etc). So this would very much improve virtio-net indeed, given how poorly the current backend was performing. I tried to convince the team to help me build it out to completion on multiple occasions, but that request was answered with sorry, we are doing our own thing instead. You can say that you didn't like my approach, since that is a subjective opinion. But to say that I didn't attempt to improve it is a flat out wrong, and I do not appreciate it. -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/23/09 12:52 PM, Peter W. Morreale wrote: On Wed, 2009-12-23 at 13:14 +0100, Andi Kleen wrote: http://www.redhat.com/f/pdf/summit/cwright_11_open_source_virt.pdf See slide 32. This is without vhost-net. Thanks. Do you also have latency numbers? It seems like there's definitely still potential for improvement with messages 4K. But for the large messages they indeed look rather good. It's unclear what message size the Alacrity numbers used, but I presume it was rather large. No. It was 1500. Please see: http://developer.novell.com/wiki/index.php/AlacrityVM/Results Note: 1500 was the L2 MTU, not necessarily the L3/L4 size which was probably much larger (though I do not recall what exactly atm). -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [GIT PULL] AlacrityVM guest drivers for 2.6.33
(Sorry for top post...on a mobile) When someone repeatedly makes a claim you believe to be wrong and you correct them, you start to wonder if that person has a less than honorable agenda. In any case, I overreacted. For that, I apologize. That said, you are still incorrect. With what I proposed, the model will run as an in-kernel vbus device, and no longer run in userspace. It would therefore improve virtio-net as I stated, much in the same way vhost-net or venet-tap do today. FYI I am about to log out for the long holiday, so will be unresponsive for a bit. Kind Regards, -Greg On 12/23/09, Avi Kivity a...@redhat.com wrote: On 12/23/2009 08:15 PM, Gregory Haskins wrote: On 12/23/09 5:22 AM, Avi Kivity wrote: There was no attempt by Gregory to improve virtio-net. If you truly do not understand why your statement is utterly wrong at this point in the discussion, I feel sorry for you. If you are trying to be purposely disingenuous, you should be ashamed of yourself. In any case, your statement is demonstrably bogus, but you should already know this given that we talked about at least several times. There's no need to feel sorry for me, thanks. There's no reason for me to be ashamed, either. And there's no need to take the discussion to personal levels. Please keep it technical. To refresh your memory: http://patchwork.kernel.org/patch/17428/ This is not an attempt to improve virtio-net, it's an attempt to push vbus. With this, virtio-net doesn't become any faster, since the greatest bottleneck is not removed, it remains in userspace. If you wanted to improve virtio-net, you would port venet-host to the virtio-net guest/host interface, and port any secret sauce in venet(-guest) to virtio-net. After that we could judge what vbus' contribution to the equation is. In case its not blatantly clear, which I would hope it would be to anyone that understands the problem space: What that patch would do is allow an unmodified virtio-net to bridge to a vbus based virtio-net backend. (Also note that this predates vhost-net by months (the date in that thread is 4/9/2009) in case you are next going to try to argue that it does nothing over vhost-net). Without the backend, it is useless. It demonstrates vbus' flexibility quite well, but does nothing for virtio-net or its users, at least without a lot more work. This would mean that virtio-net would gain most of the benefits I have been advocating (fewer exits, cheaper exits, concurrent execution, etc). So this would very much improve virtio-net indeed, given how poorly the current backend was performing. I tried to convince the team to help me build it out to completion on multiple occasions, but that request was answered with sorry, we are doing our own thing instead. You can say that you didn't like my approach, since that is a subjective opinion. But to say that I didn't attempt to improve it is a flat out wrong, and I do not appreciate it. Cutting down on the rhetoric is more important than cutting down exits at this point in time. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- Sent from my mobile device -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/22/09 1:53 PM, Avi Kivity wrote: On 12/22/2009 07:36 PM, Gregory Haskins wrote: Gregory, it would be nice if you worked _much_ harder with the KVM folks before giving up. I think the 5+ months that I politely tried to convince the KVM folks that this was a good idea was pretty generous of my employer. The KVM maintainers have ultimately made it clear they are not interested in directly supporting this concept (which is their prerogative), but are perhaps willing to support the peripheral logic needed to allow it to easily interface with KVM. I can accept that, and thus AlacrityVM was born. Review pointed out locking issues with xinterface which I have not seen addressed. I asked why the irqfd/ioeventfd mechanisms are insufficient, and you did not reply. Yes, I understand. I've been too busy to rework the code for an upstream push. I will certainly address those questions when I make the next attempt, but they weren't relevant to the guest side. signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/22/09 1:53 PM, Avi Kivity wrote: I asked why the irqfd/ioeventfd mechanisms are insufficient, and you did not reply. BTW: the ioeventfd issue just fell through the cracks, so sorry about that. Note that I have no specific issue with irqfd ever since the lockless IRQ injection code was added. ioeventfd turned out to be suboptimal for me in the fast path for two reasons: 1) the underlying eventfd is called in atomic context. I had posted patches to Davide to address that limitation, but I believe he rejected them on the grounds that they are only relevant to KVM. 2) it cannot retain the data field passed in the PIO. I wanted to have one vector that could tell me what value was written, and this cannot be expressed in ioeventfd. Based on this, it was a better decision to add a ioevent interface to xinterface. It neatly solves both problems. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/22/09 2:32 PM, Gregory Haskins wrote: On 12/22/09 2:25 PM, Avi Kivity wrote: If you're not doing something pretty minor, you're better of waking up a thread (perhaps _sync if you want to keep on the same cpu). With the new user return notifier thingie, that's pretty cheap. We have exploits that take advantage of IO heuristics. When triggered they do more work in vcpu context than normal, which reduces latency under certain circumstances. But you definitely do _not_ want to do them in-atomic ;) And I almost forgot: dev-call() is an RPC to the backend device. Therefore, it must be synchronous, yet we dont want it locked either. I think that was actually the primary motivation for the change, now that I think about it. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/22/09 2:38 PM, Avi Kivity wrote: On 12/22/2009 09:32 PM, Gregory Haskins wrote: xinterface, as it turns out, is a great KVM interface for me and easy to extend, all without conflicting with the changes in upstream. The old way was via the kvm ioctl interface, but that sucked as the ABI was always moving. Where is the problem? ioeventfd still works fine as it is. It means that kvm locking suddenly affects more of the kernel. Thats ok. This would only be w.r.t. devices that are bound to the KVM instance anyway, so they better know what they are doing (and they do). -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/22/09 2:43 PM, Avi Kivity wrote: On 12/22/2009 09:41 PM, Gregory Haskins wrote: It means that kvm locking suddenly affects more of the kernel. Thats ok. This would only be w.r.t. devices that are bound to the KVM instance anyway, so they better know what they are doing (and they do). It's okay to the author of that device. It's not okay to the kvm developers who are still evolving the locking and have to handle all devices that use xinterface. Perhaps, but like it or not, if you want to do in-kernel you need to invoke backends. And if you want to invoke backends, limiting it to thread wakeups is, well, limiting. For one, you miss out on that exploit I mentioned earlier which can help sometimes. Besides, the direction that Marcelo and I left the mmio/pio bus was that it would go lockless eventually, not more lockful ;) Has that changed? I honestly haven't followed whats going on in the io-bus code in a while. -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/22/09 2:39 PM, Davide Libenzi wrote: On Tue, 22 Dec 2009, Gregory Haskins wrote: On 12/22/09 1:53 PM, Avi Kivity wrote: I asked why the irqfd/ioeventfd mechanisms are insufficient, and you did not reply. BTW: the ioeventfd issue just fell through the cracks, so sorry about that. Note that I have no specific issue with irqfd ever since the lockless IRQ injection code was added. ioeventfd turned out to be suboptimal for me in the fast path for two reasons: 1) the underlying eventfd is called in atomic context. I had posted patches to Davide to address that limitation, but I believe he rejected them on the grounds that they are only relevant to KVM. I thought we addressed this already, in the few hundreds of email we exchanged back then :) We addressed the race conditions, but not the atomic callbacks. I can't remember exactly what you said, but the effect was no, so I dropped it. ;) This was the thread. http://www.archivum.info/linux-ker...@vger.kernel.org/2009-06/08548/Re:_[KVM-RFC_PATCH_1_2]_eventfd:_add_an_explicit_srcu_based_notifier_interface 2) it cannot retain the data field passed in the PIO. I wanted to have one vector that could tell me what value was written, and this cannot be expressed in ioeventfd. Like might have hinted in his reply, couldn't you add data support to the ioeventfd bits in KVM, instead of leaking them into mainline eventfd? Perhaps, or even easier I could extend xinterface. Which is what I did ;) The problem with the first proposal is that you would no longer actually have an eventfd based mechanism...so any code using ioeventfd (like Michael Tsirkin's for instance) would break. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/21/09 7:12 PM, Anthony Liguori wrote: On 12/21/2009 11:44 AM, Gregory Haskins wrote: Well, surely something like SR-IOV is moving in that direction, no? Not really, but that's a different discussion. Ok, but my general point still stands. At some level, some crafty hardware engineer may invent something that obsoletes the need for, say, PV 802.x drivers because it can hit 40GE line rate at the same performance level of bare metal with some kind of pass-through trick. But I still do not see that as an excuse for sloppy software in the meantime, as there will always be older platforms, older IO cards, or different IO types that are not benefactors of said hw based optimizations. But let's focus on concrete data. For a given workload, how many exits do you see due to EOI? Its of course highly workload dependent, and I've published these details in the past, I believe. Off the top of my head, I recall that virtio-pci tends to throw about 65k exits per second, vs about 32k/s for venet on a 10GE box, but I don't recall what ratio of those exits are EOI. Was this userspace virtio-pci or was this vhost-net? Both, actually, though userspace is obviously even worse. If it was the former, then were you using MSI-X? MSI-X If you weren't, there would be an additional (rather heavy) exit per-interrupt to clear the ISR which would certainly account for a large portion of the additional exits. Yep, if you don't use MSI it is significantly worse as expected. To be perfectly honest, I don't care. I do not discriminate against the exit type...I want to eliminate as many as possible, regardless of the type. That's how you go fast and yet use less CPU. It's important to understand why one mechanism is better than another. Agreed, but note _I_ already understand why. I've certainly spent countless hours/emails trying to get others to understand as well, but it seems most are too busy to actually listen. All I'm looking for is a set of bullet points that say, vbus does this, vhost-net does that, therefore vbus is better. We would then either say, oh, that's a good idea, let's change vhost-net to do that, or we would say, hrm, well, we can't change vhost-net to do that because of some fundamental flaw, let's drop it and adopt vbus. It's really that simple :-) This is all been covered ad-nauseam, directly with youself in many cases. Google is your friend. Here are some tips while you research: Do not fall into the trap of vhost-net vs vbus, or venet vs virtio-net, or you miss the point entirely. Recall that venet was originally crafted to demonstrate the virtues of my three performance objectives (kill exits, reduce exit overhead, and run concurrently). Then there is all the stuff we are laying on top, like qos, real-time, advanced fabrics, and easy adoption for various environments (so it doesn't need to be redefined each time). Therefore if you only look at the limited feature set of virtio-net, you will miss the majority of the points of the framework. virtio tried to capture some of these ideas, but it missed the mark on several levels and was only partially defined. Incidentally, you can stil run virtio over vbus if desired, but so far no one has tried to use my transport. They should be relatively rare because obtaining good receive batching is pretty easy. Batching is poor mans throughput (its easy when you dont care about latency), so we generally avoid as much as possible. Fair enough. Considering these are lightweight exits (on the order of 1-2us), APIC EOIs on x86 are MMIO based, so they are generally much heavier than that. I measure at least 4-5us just for the MMIO exit on my Woodcrest, never mind executing the locking/apic-emulation code. You won't like to hear me say this, but Woodcrests are pretty old and clunky as far as VT goes :-) Fair enough. On a modern Nehalem, I would be surprised if an MMIO exit handled in the kernel was muck more than 2us. The hardware is getting very, very fast. The trends here are very important to consider when we're looking at architectures that we potentially are going to support for a long time. The exit you do not take will always be infinitely faster. you need an awfully large amount of interrupts before you get really significant performance impact. You would think NAPI would kick in at this point anyway. Whether NAPI can kick in or not is workload dependent, and it also does not address coincident events. But on that topic, you can think of AlacrityVM's interrupt controller as NAPI for interrupts, because it operates on the same principle. For what its worth, it also operates on a NAPI for hypercalls concept too. The concept of always batching hypercalls has certainly been explored within the context of Xen. I am not talking about batching, which again is a poor mans throughput trick
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/18/09 4:51 PM, Ingo Molnar wrote: * Gregory Haskins gregory.hask...@gmail.com wrote: Hi Linus, Please pull AlacrityVM guest support for 2.6.33 from: git://git.kernel.org/pub/scm/linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git for-linus All of these patches have stewed in linux-next for quite a while now: Gregory Haskins (26): I think it would be fair to point out that these patches have been objected to by the KVM folks quite extensively, Actually, these patches have nothing to do with the KVM folks. You are perhaps confusing this with the hypervisor-side discussion, of which there is indeed much disagreement. To that point, it's certainly fair to point out the controversy on the host side. It ultimately is what forced the creation of the AlacrityVM project, after all. However, it should also be pointed out that this pull request is not KVM specific, nor even KVM related per se. These patches can (and in fact, do) work in other environments that do not use KVM nor even AlacrityVM at all. VBUS, the underlying technology here, is a framework for creating optimized software-based device models using a Linux-kernel as a host and their corresponding driver resources to the backend. AlacrityVM is the application of these technologies using KVM/Linux/Qemu as a base, but that is an implementation detail. For more details, please see the project wiki http://developer.novell.com/wiki/index.php/AlacrityVM This pull request is for drivers to support running a Linux kernel as a guest in this environment, so it actually doesn't affect KVM in any way. They are just standard Linux drivers and in fact can load as stand-alone KMPs in any modern vanilla distro. I haven't even pushed the host side code to linux-next yet specifically because of that controversy you mention. on multiple technical grounds - as basically this tree forks the KVM driver space for which no valid technical reason could be offered by you in a 100+ mails long discussion. You will have to be more specific on these technical grounds you mention, because I believe I satisfactorily rebutted any issues raised. To say that there is no technical reason is, at best, a matter of opinion. I have in fact listed numerous reasons on a technical, feature, and architectural basis on what differentiates my approach, and provided numbers which highlights their merits. Given that they are all recorded in the archives of said 100+ email thread as well as numerous others, I wont rehash the entire list here. Instead, I will post a summary of the problem space from the performance perspective, since that seems to be of most interest atm. From my research, the reason why virt in general, and KVM in particular suffers on the IO performance front is as follows: IOs (traps+interrupts) are more expensive than bare-metal, and real hardware is naturally concurrent (your hbas and nics are effectively parallel execution engines, etc). Assuming my observations are correct, in order to squeeze maximum performance from a given guest, you need to do three things: A) eliminate as many IOs as you possibly can, B) reduce the cost of the ones you can't avoid, and C) run your algorithms in parallel to emulate concurrent silicon. So to that front, we move the device models to the kernel (where they are closest to the physical IO devices) and use cheap instructions like PIOs/Hypercalls for (B), and exploit spare host-side SMP resources via kthreads for (C). For (A), part of the problem is that virtio-pci is not designed optimally to address the problem space, and part of it is a limitation of the PCI transport underneath it. For example, PCI is somewhat of a unique bus design in that it wants to map signals to interrupts 1:1. This works fine for real hardware where interrupts are relatively cheap, but is quite suboptimal on virt where the window-exits, injection-exits, and MMIO-based EIOs hurt substantially (multiple microseconds per). One core observation is that we don't technically need 1:1 interrupts to signals in order to function properly. Ideally we will only bother the CPU when work of a higher priority becomes ready. So the alacrityvm connector to vbus uses a model were we deploy a lockless shared-memory queue to inject interrupts. This means that temporal interrupts (of both intra and inter device variety) of similar priority can queue without incurring any extra IO. This means fewer exits, fewer EOIs, etc. The end result is that I can demonstrate that even with a single stream to a single device, I can reduce exit rate by over 45% and interrupt rate 50% when compared to the equivalent virtio-pci ABI. This scales even higher when you add additional devices to the mix. The bottom line is that we use significantly less CPU while producing the highest throughput and lowest latency. In fact, to my knowledge vbus+venet is still the highest performing 802.x device for KVM to my knowledge, even when turning off its
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/21/09 10:43 AM, Avi Kivity wrote: On 12/21/2009 05:34 PM, Gregory Haskins wrote: I think it would be fair to point out that these patches have been objected to by the KVM folks quite extensively, Actually, these patches have nothing to do with the KVM folks. You are perhaps confusing this with the hypervisor-side discussion, of which there is indeed much disagreement. This is true, though these drivers are fairly pointless for virtualization without the host side support. The host side support is available in various forms (git tree, rpm, etc) from our project page. I would encourage any interested parties to check it out: Here is the git tree http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=summary Here are some RPMs: http://download.opensuse.org/repositories/devel://LLDC://alacrity/openSUSE_11.1/ And the main project site: http://developer.novell.com/wiki/index.php/AlacrityVM I did have a few issues with the guest drivers: - the duplication of effort wrt virtio. These drivers don't cover exactly the same problem space, but nearly so. Virtio itself is more or less compatible with this effort, as we have discussed (see my virtio-vbus transport, for instance). I have issues with some of the design decisions in the virtio device and ring models, but they are minor in comparison to the beef I have with the virtio-pci transport as a whole. - no effort at scalability - all interrupts are taken on one cpu Addressed by the virtual-interrupt controller. This will enable us to route shm-signal messages to a core, under guidance from the standard irq-balance facilities. - the patches introduce a new virtual interrupt controller for dubious (IMO) benefits See above. Its not fully plumbed yet, which is perhaps the reason for the confusion as to its merits. Eventually I will trap the affinity calls and pass them to the host, too. Today, it at least lets us see the shm-signal statistics under /proc/interrupts, which is nice and is consistent with other IO mechanisms. From my research, the reason why virt in general, and KVM in particular suffers on the IO performance front is as follows: IOs (traps+interrupts) are more expensive than bare-metal, and real hardware is naturally concurrent (your hbas and nics are effectively parallel execution engines, etc). Assuming my observations are correct, in order to squeeze maximum performance from a given guest, you need to do three things: A) eliminate as many IOs as you possibly can, B) reduce the cost of the ones you can't avoid, and C) run your algorithms in parallel to emulate concurrent silicon. All these are addressed by vhost-net without introducing new drivers. No, B and C definitely are, but A is lacking. And the performance suffers as a result in my testing (vhost-net still throws a ton of exits as its limited by virtio-pci and only adds about 1Gb/s to virtio-u, far behind venet even with things like zero-copy turned off). I will also point out that these performance aspects are only a subset of the discussion, since we are also addressing things like qos/priority, alternate fabric types, etc. I do not expect you to understand and agree where I am going per se. We can have that discussion when I once again ask you for merge consideration. But if you say they are the same I will call you on it, because they are demonstrably unique capability sets. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/21/09 11:37 AM, Anthony Liguori wrote: On 12/21/2009 10:04 AM, Gregory Haskins wrote: No, B and C definitely are, but A is lacking. And the performance suffers as a result in my testing (vhost-net still throws a ton of exits as its limited by virtio-pci and only adds about 1Gb/s to virtio-u, far behind venet even with things like zero-copy turned off). How does virtio-pci limit vhost-net? The only time exits should occur are when the guest notifies the host that something has been placed on the ring. Since vhost-net has no tx mitigation scheme right now, the result may be that it's taking an io exit on every single packet but this is orthogonal to virtio-pci. Since virtio-pci supports MSI-X, there should be no IO exits on host-guest notification other than EOI in the virtual APIC. The very best you can hope to achieve is 1:1 EOI per signal (though today virtio-pci is even worse than that). As I indicated above, I can eliminate more than 50% of even the EOIs in trivial examples, and even more as we scale up the number of devices or the IO load (or both). This is a light weight exit today and will likely disappear entirely with newer hardware. By that argument, this is all moot. New hardware will likely obsolete the need for venet or virtio-net anyway. The goal of my work is to provide an easy to use framework for maximizing the IO transport _in lieu_ of hardware acceleration. Software will always be leading here, so we don't want to get into a pattern of waiting for new hardware to cover poor software engineering. Its simply not necessary in most cases. A little smart software design and a framework that allows it to be easily exploited/reused is the best step forward, IMO. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/21/09 11:40 AM, Avi Kivity wrote: On 12/21/2009 06:37 PM, Anthony Liguori wrote: Since virtio-pci supports MSI-X, there should be no IO exits on host-guest notification other than EOI in the virtual APIC. This is a light weight exit today and will likely disappear entirely with newer hardware. I'm working on disappearing EOI exits on older hardware as well. Same idea as the old TPR patching, without most of the magic. While I applaud any engineering effort that results in more optimal execution, if you are talking about what we have discussed in the past its not quite in the same league as my proposal. You are talking about the ability to optimize the final EOI if there are no pending interrupts remaining, right? The problem with this approach is it addresses the wrong side of the curve: That is, it optimizes the code as its about to go io-idle. You still have to take an extra exit for each injection during the heat of battle, which is when you actually need it most. To that front, what I have done is re-used the lockless shared-memory concept for even interrupt injection. Lockless shared-memory rings have the property that both producer and consumer can simultaneously manipulate the ring. So what we do in alacrityvm is deliver shm-signals (shm-signal == interrupt in vbus) over a ring so that the host can inject a signal to a running vcpu and a vcpu can complete an ack/re-inject cycle directly from vcpu context. Therefore, we only need a physical IDT injection when the vcpu is io-idle transitioning to io-busy, and remain completely in parallel guest/host context until we go idle again. That said, your suggestion would play nicely with the above mentioned scheme, so I look forward to seeing it in the tree. Feel free to send me patches for testing. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/21/09 12:05 PM, Avi Kivity wrote: On 12/21/2009 06:56 PM, Gregory Haskins wrote: I'm working on disappearing EOI exits on older hardware as well. Same idea as the old TPR patching, without most of the magic. While I applaud any engineering effort that results in more optimal execution, if you are talking about what we have discussed in the past its not quite in the same league as my proposal. I don't doubt this for a minute. You are talking about the ability to optimize the final EOI if there are no pending interrupts remaining, right? The problem with this approach is it addresses the wrong side of the curve: That is, it optimizes the code as its about to go io-idle. You still have to take an extra exit for each injection during the heat of battle, which is when you actually need it most. No, it's completely orthogonal. An interrupt is injected, the handler disables further interrupts and EOIs, then schedules the rest of the handling code. So long as there as packets in the ring interrupts won't be enabled and hence there won't be any reinjections. I meant inter-vector next-interrupt injects. For lack of a better term, I called it reinject, but I realize in retrospect that this is ambiguous. Different interrupt sources still need different interrupts, but as all of your tests have been single-interface, this can't be the reason for your performance. Actually I have tested both single and multi-homed setups, but it doesn't matter. Even a single device can benefit, as even single devices may have multiple vector sources that are highly probably to generate coincident events. For instance, consider that even a basic ethernet may have separate vectors for rx and tx-complete. A simple ping is likely to generate both vectors at approximately the same time, given how the host side resources often work. Trying to condense multiple vectors into one means its up to the driver to implement any type of prioritization on its own (or worse, it just suffers from PI). Likewise, implementing them as unique vectors means you are likely to have coincident events for certain workloads. What alacrityvm tries to do is recognize these points and optimize for both cases. It means we still retain framework-managed prioritized callbacks, yet optimize away extraneous IO for coincident signals. IOW: best of both worlds. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [GIT PULL] AlacrityVM guest drivers for 2.6.33
On 12/21/09 12:20 PM, Anthony Liguori wrote: On 12/21/2009 10:46 AM, Gregory Haskins wrote: The very best you can hope to achieve is 1:1 EOI per signal (though today virtio-pci is even worse than that). As I indicated above, I can eliminate more than 50% of even the EOIs in trivial examples, and even more as we scale up the number of devices or the IO load (or both). If optimizing EOI is the main technical advantage of vbus, then surely we could paravirtualize EOI access and get that benefit in KVM without introducing a whole new infrastructure, no? No, because I never claimed optimizing EOI was the main/only advantage. The feature set has all been covered in extensive detail in the lists, however, so I will defer you to google for the archives for your reading pleasure. This is a light weight exit today and will likely disappear entirely with newer hardware. By that argument, this is all moot. New hardware will likely obsolete the need for venet or virtio-net anyway. Not at all. Well, surely something like SR-IOV is moving in that direction, no? But let's focus on concrete data. For a given workload, how many exits do you see due to EOI? Its of course highly workload dependent, and I've published these details in the past, I believe. Off the top of my head, I recall that virtio-pci tends to throw about 65k exits per second, vs about 32k/s for venet on a 10GE box, but I don't recall what ratio of those exits are EOI. To be perfectly honest, I don't care. I do not discriminate against the exit type...I want to eliminate as many as possible, regardless of the type. That's how you go fast and yet use less CPU. They should be relatively rare because obtaining good receive batching is pretty easy. Batching is poor mans throughput (its easy when you dont care about latency), so we generally avoid as much as possible. Considering these are lightweight exits (on the order of 1-2us), APIC EOIs on x86 are MMIO based, so they are generally much heavier than that. I measure at least 4-5us just for the MMIO exit on my Woodcrest, never mind executing the locking/apic-emulation code. you need an awfully large amount of interrupts before you get really significant performance impact. You would think NAPI would kick in at this point anyway. Whether NAPI can kick in or not is workload dependent, and it also does not address coincident events. But on that topic, you can think of AlacrityVM's interrupt controller as NAPI for interrupts, because it operates on the same principle. For what its worth, it also operates on a NAPI for hypercalls concept too. Do you have data demonstrating the advantage of EOI mitigation? I have non-scientifically gathered numbers in my notebook that put it on average of about 55%-60% reduction in EOIs for inbound netperf runs, for instance. I don't have time to gather more in the near term, but its typically in that range for a chatty enough workload, and it goes up as you add devices. I would certainly formally generate those numbers when I make another merge request in the future, but I don't have them now. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
[ANNOUNCE] AlacrityVM v0.2 is released
We (the AlacrityVM team) are pleased to announce the availability of the v0.2 release. There are numerous tweaks, fixes, and features that we have added since the v0.1 days. Here are a few of the key highlights. *) VENET support: *) zero-copy transmits (guest memory is paged directly to the physical adapter on the host, where applicable) *) SAR offloading (essentially bi-directional GSO with 64k MTU) *) pre-mapped descriptors to reduce latency *) pre-loading of mmu context to reduce latency *) shm-signal's are now routable (e.g. future MQ support) *) vbus is optionally managed in its entirely from the rebundled qemu *) -net venet-tap will fully initialize the interface internally..no more scripting! *) vbus components are automatically reaped when the guest exits, even if the exit is not clean. We will update the performance numbers ASAP and announce any changes. For further details, please visit our wiki: http://developer.novell.com/wiki/index.php/AlacrityVM Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv8 0/3] vhost: a kernel-level virtio server
Michael S. Tsirkin wrote: Ok, I think I've addressed all comments so far here. Rusty, I'd like this to go into linux-next, through your tree, and hopefully 2.6.33. What do you think? I think the benchmark data is a prerequisite for merge consideration, IMO. Do you have anything for us to look at? I think comparison that show the following are of interest: throughput (e.g. netperf::TCP_STREAM): guest-host, guest-host-guest, guest-host-remote, host-remote, remote-host-guest latency (e.g. netperf::UDP_RR): same conditions as throughput cpu-utilization others? Ideally, this should be at least between upstream virtio and vhost. Bonus points if you include venet as well. Kind regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: On Wed, Nov 04, 2009 at 09:25:42AM -0800, Paul E. McKenney wrote: (Sorry, but, as always, I could not resist!) Thanx, Paul Thanks Paul! Jonathan: are you reading this? Another one for your quotes of the week collection :) I second that. signature.asc Description: OpenPGP digital signature
Re: [PATCHv8 0/3] vhost: a kernel-level virtio server
Michael S. Tsirkin wrote: On Wed, Nov 04, 2009 at 11:02:15AM -0500, Gregory Haskins wrote: Michael S. Tsirkin wrote: Ok, I think I've addressed all comments so far here. Rusty, I'd like this to go into linux-next, through your tree, and hopefully 2.6.33. What do you think? I think the benchmark data is a prerequisite for merge consideration, IMO. Shirley Ma was kind enough to send me some measurement results showing how kernel level acceleration helps speed up you can find them here: http://www.linux-kvm.org/page/VhostNet Thanks for the pointers. I will roll your latest v8 code into our test matrix. What kernel/qemu trees do they apply to? -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Gregory Haskins wrote: Eric Dumazet wrote: Michael S. Tsirkin a écrit : +static void handle_tx(struct vhost_net *net) +{ + struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; + unsigned head, out, in, s; + struct msghdr msg = { + .msg_name = NULL, + .msg_namelen = 0, + .msg_control = NULL, + .msg_controllen = 0, + .msg_iov = vq-iov, + .msg_flags = MSG_DONTWAIT, + }; + size_t len, total_len = 0; + int err, wmem; + size_t hdr_size; + struct socket *sock = rcu_dereference(vq-private_data); + if (!sock) + return; + + wmem = atomic_read(sock-sk-sk_wmem_alloc); + if (wmem = sock-sk-sk_sndbuf) + return; + + use_mm(net-dev.mm); + mutex_lock(vq-mutex); + vhost_no_notify(vq); + using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect that your use of RCU is not correct. 1) rcu_dereference() should be done inside a read_rcu_lock() section, and we are not allowed to sleep in such a section. (Quoting Documentation/RCU/whatisRCU.txt : It is illegal to block while in an RCU read-side critical section, ) 2) mutex_lock() can sleep (ie block) Michael, I warned you that this needed better documentation ;) Eric, I think I flagged this once before, but Michael convinced me that it was indeed ok, if but perhaps a bit unconventional. I will try to find the thread. Kind Regards, -Greg Here it is: http://lkml.org/lkml/2009/8/12/173 Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv7 2/3] mm: export use_mm/unuse_mm to modules
Michael S. Tsirkin wrote: vhost net module wants to do copy to/from user from a kernel thread, which needs use_mm. Export it to modules. Acked-by: Andrea Arcangeli aarca...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com I need this too: Acked-by: Gregory Haskins ghask...@novell.com --- mm/mmu_context.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/mmu_context.c b/mm/mmu_context.c index ded9081..0777654 100644 --- a/mm/mmu_context.c +++ b/mm/mmu_context.c @@ -5,6 +5,7 @@ #include linux/mm.h #include linux/mmu_context.h +#include linux/module.h #include linux/sched.h #include asm/mmu_context.h @@ -37,6 +38,7 @@ void use_mm(struct mm_struct *mm) if (active_mm != mm) mmdrop(active_mm); } +EXPORT_SYMBOL_GPL(use_mm); /* * unuse_mm @@ -56,3 +58,4 @@ void unuse_mm(struct mm_struct *mm) enter_lazy_tlb(mm, tsk); task_unlock(tsk); } +EXPORT_SYMBOL_GPL(unuse_mm); signature.asc Description: OpenPGP digital signature
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Eric Dumazet wrote: Michael S. Tsirkin a écrit : +static void handle_tx(struct vhost_net *net) +{ +struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX]; +unsigned head, out, in, s; +struct msghdr msg = { +.msg_name = NULL, +.msg_namelen = 0, +.msg_control = NULL, +.msg_controllen = 0, +.msg_iov = vq-iov, +.msg_flags = MSG_DONTWAIT, +}; +size_t len, total_len = 0; +int err, wmem; +size_t hdr_size; +struct socket *sock = rcu_dereference(vq-private_data); +if (!sock) +return; + +wmem = atomic_read(sock-sk-sk_wmem_alloc); +if (wmem = sock-sk-sk_sndbuf) +return; + +use_mm(net-dev.mm); +mutex_lock(vq-mutex); +vhost_no_notify(vq); + using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect that your use of RCU is not correct. 1) rcu_dereference() should be done inside a read_rcu_lock() section, and we are not allowed to sleep in such a section. (Quoting Documentation/RCU/whatisRCU.txt : It is illegal to block while in an RCU read-side critical section, ) 2) mutex_lock() can sleep (ie block) Michael, I warned you that this needed better documentation ;) Eric, I think I flagged this once before, but Michael convinced me that it was indeed ok, if but perhaps a bit unconventional. I will try to find the thread. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv7 3/3] vhost_net: a kernel-level virtio server
Eric Dumazet wrote: Gregory Haskins a écrit : Gregory Haskins wrote: Eric Dumazet wrote: Michael S. Tsirkin a écrit : using rcu_dereference() and mutex_lock() at the same time seems wrong, I suspect that your use of RCU is not correct. 1) rcu_dereference() should be done inside a read_rcu_lock() section, and we are not allowed to sleep in such a section. (Quoting Documentation/RCU/whatisRCU.txt : It is illegal to block while in an RCU read-side critical section, ) 2) mutex_lock() can sleep (ie block) Michael, I warned you that this needed better documentation ;) Eric, I think I flagged this once before, but Michael convinced me that it was indeed ok, if but perhaps a bit unconventional. I will try to find the thread. Kind Regards, -Greg Here it is: http://lkml.org/lkml/2009/8/12/173 Yes, this doesnt convince me at all, and could be a precedent for a wrong RCU use. People wanting to use RCU do a grep on kernel sources to find how to correctly use RCU. Michael, please use existing locking/barrier mechanisms, and not pretend to use RCU. Yes, I would tend to agree with you. In fact, I think I suggested that a normal barrier should be used instead of abusing rcu_dereference(). But as far as his code is concerned, I think it technically works properly, and that was my main point. Also note that the usage rcu_dereference+mutex_lock() are not necessarily broken, per se: it could be an srcu-based critical section created by the caller, for instance. It would be perfectly legal to sleep on the mutex if that were the case. To me, the bigger issue is that the rcu_dereference() without any apparent hint of a corresponding RSCS is simply confusing as a reviewer. smp_rmb() (or whatever is proper in this case) is probably more appropriate. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute
Avi Kivity wrote: On 10/26/2009 05:38 PM, Gregory Haskins wrote: Instead of a lockless attribute, how about a -set_atomic() method. For msi this can be the same as -set(), for non-msi it can be a function that schedules the work (which will eventually call -set()). The benefit is that we make a decision only once, when preparing the routing entry, and install that decision in the routing entry instead of making it again and again later. Yeah, I like this idea. I think we can also get rid of the custom workqueue if we do this as well, TBD. So I looked into this. It isn't straight forward because you need to retain some kind of state across the deferment on a per-request basis (not per-GSI). Today, this state is neatly tracked into the irqfd object itself (e.g. it knows to toggle the GSI). Yes, and it also contains the work_struct. What if we make the work_struct (and any additional state) part of the set_atomic() argument list? Does it simplify things? Hmmm, that might not, but we could do a kmalloc(GFP_ATOMIC) for such parameters. Considering this is just a safety net, perhaps this would work fine. So while generalizing this perhaps makes sense at some point, especially if irqfd-like interfaces get added, it probably doesn't make a ton of sense to expend energy on it ATM. It is basically a generalization of the irqfd deferrment code. Lets just wait until we have a user beyond irqfd for now. Sound acceptable? I'll look at v3, but would really like to disentangle this. Ok, I will see what I can do. I need at least a v4 to get rid of the dependency on the now defunct v3:1/3 patch per yesterdays discussion. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation
Michael S. Tsirkin wrote: On Tue, Oct 27, 2009 at 02:54:40PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote: IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support for certain GSIs, the deferment mechanism is no longer technically needed in all cases. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method whenever possible. Fortunately for us, the most common use of irqfd (MSI-based GSIs) readily support lockless injection. Signed-off-by: Gregory Haskins ghask...@novell.com This is a useful optimization I think. Some comments below. --- virt/kvm/eventfd.c | 31 +++ 1 files changed, 27 insertions(+), 4 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 30f70fd..e6cc958 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -51,20 +51,34 @@ struct _irqfd { wait_queue_t wait; struct work_structinject; struct work_structshutdown; + void (*execute)(struct _irqfd *); }; static struct workqueue_struct *irqfd_cleanup_wq; static void -irqfd_inject(struct work_struct *work) +irqfd_inject(struct _irqfd *irqfd) { - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); struct kvm *kvm = irqfd-kvm; kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1); kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0); } +static void +irqfd_deferred_inject(struct work_struct *work) +{ + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); + + irqfd_inject(irqfd); +} + +static void +irqfd_schedule(struct _irqfd *irqfd) +{ + schedule_work(irqfd-inject); +} + /* * Race-free decouple logic (ordering is critical) */ @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) /* An event has been signaled, inject an interrupt */ - schedule_work(irqfd-inject); + irqfd-execute(irqfd); if (flags POLLHUP) { /* The eventfd is closing, detach from KVM */ @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) irqfd-kvm = kvm; irqfd-gsi = gsi; INIT_LIST_HEAD(irqfd-list); - INIT_WORK(irqfd-inject, irqfd_inject); + INIT_WORK(irqfd-inject, irqfd_deferred_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); file = eventfd_fget(fd); @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); + ret = kvm_irq_check_lockless(kvm, gsi); + if (ret 0) + goto fail; + + if (ret) + irqfd-execute = irqfd_inject; + else + irqfd-execute = irqfd_schedule; + Can't gsi get converted from lockless to non-lockless after it's checked (by the routing ioctl)? I think I protect against this in patch 2/3 by ensuring that any vectors that are added have to conform to the same locking rules. The code doesn't support deleting routes, so we really only need to make sure that new routes do not change. What I refer to, is when userspace calls KVM_SET_GSI_ROUTING. I don't see how your patch helps here: can't a GSI formerly used for MSI become unused, and then reused for non-MSI? If not, it's a problem I think, because I think userspace currently does this sometimes. I see your point. I was thinking vectors could only be added, not deleted, but I see upon further inspection that is not the case. Kernel will crash then. How about, each time we get event from eventfd, we implement kvm_irqfd_toggle_lockless, which does a single scan, and returns true/false status (and I really mean toggle, let's not do set 1 / set 0 as well) telling us whether interrupts could be delivered in a lockless manner? I am not sure I like this idea in general given that I believe I already handle the error case you are concerned with. However, the concept of providing a toggle option so we can avoid scanning the list twice is a good one. That can be done as a new patch series, but it would be a nice addition. Thanks Michael, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 2/3] KVM: export lockless GSI attribute
Michael S. Tsirkin wrote: On Mon, Oct 26, 2009 at 12:22:03PM -0400, Gregory Haskins wrote: Certain GSI's support lockless injecton, but we have no way to detect which ones at the GSI level. Knowledge of this attribute will be useful later in the series so that we can optimize irqfd injection paths for cases where we know the code will not sleep. Therefore, we provide an API to query a specific GSI. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_host.h |2 ++ virt/kvm/irq_comm.c | 35 ++- 2 files changed, 36 insertions(+), 1 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1fe135d..01151a6 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -119,6 +119,7 @@ struct kvm_memory_slot { struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; +bool lockless; So lockless is the same as type == MSI from below? Yep, today anyway. If the idea is to make it extensible for the future, let's just add an inline function, we don't need a field for this. This makes sense. int (*set)(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level); union { @@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic, unsigned long *deliver_bitmask); #endif int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level); +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq); void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin); void kvm_register_irq_ack_notifier(struct kvm *kvm, struct kvm_irq_ack_notifier *kian); diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index db2553f..a7fd487 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) return ret; } +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq) +{ +struct kvm_kernel_irq_routing_entry *e; +struct kvm_irq_routing_table *irq_rt; +struct hlist_node *n; +int ret = -ENOENT; +int idx; + +idx = srcu_read_lock(kvm-irq_routing.srcu); +irq_rt = rcu_dereference(kvm-irq_routing.table); +if (irq irq_rt-nr_rt_entries) +hlist_for_each_entry(e, n, irq_rt-map[irq], link) { +if (!e-lockless) { +/* + * all destinations need to be lockless to + * declare that the GSI as a whole is also + * lockless + */ +ret = 0; +break; +} + +ret = 1; +} +srcu_read_unlock(kvm-irq_routing.srcu, idx); + +return ret; +} + void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; @@ -310,18 +339,22 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt, int delta; struct kvm_kernel_irq_routing_entry *ei; struct hlist_node *n; +bool lockless = ue-type == KVM_IRQ_ROUTING_MSI; /* * Do not allow GSI to be mapped to the same irqchip more than once. * Allow only one to one mapping between GSI and MSI. + * Do not allow mixed lockless vs locked variants to coexist. Userspace has no idea which entries are lockless and which are not: this is an implementation detail - so it might not be able to avoid illegal combinations. Since this is called on an ioctl, can the rule be formulated in a way that makes sense for userspace? I'm not sure. */ hlist_for_each_entry(ei, n, rt-map[ue-gsi], link) if (ei-type == KVM_IRQ_ROUTING_MSI || -ue-u.irqchip.irqchip == ei-irqchip.irqchip) +ue-u.irqchip.irqchip == ei-irqchip.irqchip || +ei-lockless != lockless) So this check seems like it does nothing, because lockless is same as MSI, and MSI is always 1:1? Intentional? Yeah, it was more to guard-against/document the dependency, in case the 1:1 with MSI ever changes in the future. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute
Avi Kivity wrote: On 10/28/2009 03:19 PM, Gregory Haskins wrote: Yes, and it also contains the work_struct. What if we make the work_struct (and any additional state) part of the set_atomic() argument list? Does it simplify things? Hmmm, that might not, but we could do a kmalloc(GFP_ATOMIC) for such parameters. Considering this is just a safety net, perhaps this would work fine. Can't you simply pass the same work_struct from irqfd as we use now? Well, yes, of course, but I am not sure that buys us much in terms of generalizing the code. Unless I am misunderstanding, that would still leave the impetus of the init/sync/cleanup to the irqfd code, at which point we might as well just leave it entirely in irqfd anyway. Or am I misunderstanding you? So while generalizing this perhaps makes sense at some point, especially if irqfd-like interfaces get added, it probably doesn't make a ton of sense to expend energy on it ATM. It is basically a generalization of the irqfd deferrment code. Lets just wait until we have a user beyond irqfd for now. Sound acceptable? I'll look at v3, but would really like to disentangle this. Ok, I will see what I can do. I need at least a v4 to get rid of the dependency on the now defunct v3:1/3 patch per yesterdays discussion. There's another alternative - make ioapic and pic irq-safe by switching irq locking to spinlocks and using spin_lock_irqsave(). I've long opposed this since the ioapic loops on all vcpus when injecting some irqs and this will increase irqoff times with large guests. But we don't have large guests now, and we need irq-safe injection in three places now: - irqfd - pit - we now signal vcpu0 to handle the injection, but this has its problems - device assignment so it may be better to have irq-safe injection, and deal with the loop later (would be good to have an idea how exactly). Ok, perhaps I should just hold off on this series for now, then. I can submit the original assume atomic safe once the path is fully lockless. -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Hi Paul, Paul E. McKenney wrote: On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote: The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ - Because the pointer is accessed outside of the read-side critical section. There are two basic patterns we can use to fix this bug: 1) Switch to sleeping-rcu and encompass the -set() access within the read-side critical section, OR 2) Add reference counting to the irq_rt structure, and simply acquire the reference from within the RSCS. This patch implements solution (1). Looks like a good transformation! A few questions interspersed below. Thanks for the review. I would have CC'd you but I figured I pestered you enough with my RCU reviews in the past, and didn't want to annoy you ;) I will be sure to CC you in the future, unless you ask otherwise. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_host.h |6 +- virt/kvm/irq_comm.c | 50 +++--- virt/kvm/kvm_main.c |1 + 3 files changed, 35 insertions(+), 22 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bd5a616..1fe135d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -185,7 +185,10 @@ struct kvm { struct mutex irq_lock; #ifdef CONFIG_HAVE_KVM_IRQCHIP -struct kvm_irq_routing_table *irq_routing; +struct { +struct srcu_structsrcu; Each structure has its own SRCU domain. This is OK, but just asking if that is the intent. It does look like the SRCU primitives are passed a pointer to the correct structure, and that the return value from srcu_read_lock() gets passed into the matching srcu_read_unlock() like it needs to be, so that is good. Yeah, it was intentional. Technically the table is per-guest, and thus the locking is too, which is the desired/intentional granularity. On that note, I tried to denote that kvm-irq_routing.srcu and kvm-irq_routing.table were related to one another, but then went ahead and modified the hunks that touched kvm-irq_ack_notifier_list, too. In retrospect, this was probably a mistake. I should leave the rcu usage outside of -irq_routing.table alone. +struct kvm_irq_routing_table *table; +} irq_routing; struct hlist_head mask_notifier_list; struct hlist_head irq_ack_notifier_list; #endif [ . . . ] @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) * IOAPIC. So set the bit in both. The guest will ignore * writes to the unused one. */ -rcu_read_lock(); -irq_rt = rcu_dereference(kvm-irq_routing); +idx = srcu_read_lock(kvm-irq_routing.srcu); +irq_rt = rcu_dereference(kvm-irq_routing.table); if (irq irq_rt-nr_rt_entries) -hlist_for_each_entry(e, n, irq_rt-map[irq], link) -irq_set[i++] = *e; -rcu_read_unlock(); +hlist_for_each_entry(e, n, irq_rt-map[irq], link) { What prevents the above list from changing while we are traversing it? (Yes, presumably whatever was preventing it from changing before this patch, but what?) Mostly kvm-lock is held, but not always. And if kvm-lock were held all the time, there would be no point in using SRCU. ;-) This is protected by kvm-irq_lock within kvm_set_irq_routing(). Entries are added to a copy of the list, and the top-level table pointer is swapped (via rcu_assign_pointer(), as it should be) while holding the lock. Finally, we synchronize with the RSCS before deleting the old copy. It looks to me like the original author got this part right, so I didn't modify it outside of converting to SRCU. +int r; -while(i--) { -int r; -r = irq_set[i].set(irq_set[i], kvm, irq_source_id, level); -if (r 0) -continue; +r = e-set(e, kvm, irq_source_id, level); +if (r 0) +continue; -ret = r + ((ret 0) ? 0 : ret); -} +ret = r + ((ret 0) ? 0 : ret); +} +srcu_read_unlock(kvm
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Gleb Natapov wrote: On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote: The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ This is not what happens. irq_rt is never accessed outside read-side critical section. Sorry, I was generalizing to keep the comments short. I figured it would be clear what I was actually saying, but realize in retrospect that I was a little ambiguous. Yes, irq_rt is not accessed outside the RSCS. However, the entry pointers stored in the irq_rt-map are, and this is equally problematic afaict. In this particular case we seem to never delete entries at run-time once they are established. Therefore, while perhaps sloppy, its technically safe to leave them unprotected from this perspective. The issue is more related to shutdown since a kvm_set_irq() caller could be within the aforementioned race-region and call entry-set() after the guest is gone. Or did I miss something? Data is copied from irq_rt onto the stack and this copy is accessed outside critical section. As mentioned above, I do not believe this really protect us. And even if it did, the copy is just a work-around to avoid sleeping within the standard RCU RSCS, which is what SRCU is designed for. So rather than inventing an awkward two-phased stack based solution, it's better to reuse the provided tools, IMO. To flip it around: Is there any reason why an SRCU would not work here, and thus we were forced to use something like the stack-copy approach? Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Gregory Haskins wrote: Gleb Natapov wrote: On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote: The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ This is not what happens. irq_rt is never accessed outside read-side critical section. Sorry, I was generalizing to keep the comments short. I figured it would be clear what I was actually saying, but realize in retrospect that I was a little ambiguous. Here is a revised problem statement thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); entry_cache = get_entries(irq_rt); rcu_read_unlock(); invalidate_entries(irq_rt); for_each_entry(entry_cache) entry-set(); /* bad */ - invalidate_entries() may be any operation that deletes an entry at run-time (doesn't exist today), or as the guest is shutting down. As far as I can tell, the current code does not protect us from either condition, and my proposed patch protects us from both. Did I miss anything? HTH -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Gleb Natapov wrote: On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote: Gleb Natapov wrote: On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote: The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ This is not what happens. irq_rt is never accessed outside read-side critical section. Sorry, I was generalizing to keep the comments short. I figured it would be clear what I was actually saying, but realize in retrospect that I was a little ambiguous. A little is underestimation :) There is not /* bad */ line in the code! Sorry, that was my own highlighting, not trying to reflect actual code. Yes, irq_rt is not accessed outside the RSCS. However, the entry pointers stored in the irq_rt-map are, and this is equally problematic afaict. The pointer is in text and can't disappear without kvm_set_irq() disappearing too. No, the entry* pointer is .text _AND_ .data, and is subject to standard synchronization rules like most other objects. Unless I am misreading the code, the entry* pointers point to heap within the irq_rt pointer. Therefore, the kfree(irq_rt) I mention above effectively invalidates the entire set of entry* pointers that you are caching, and is thus an issue. In this particular case we seem to never delete entries at run-time once they are established. Therefore, while perhaps sloppy, its technically safe to leave them unprotected from this perspective. Note: I was wrong in this statement. I forgot that it's not safe at run-time either since the entry objects are part of irq_rt. The issue is more related to shutdown since a kvm_set_irq() caller could be within the aforementioned race-region and call entry-set() after the guest is gone. Or did I miss something? The caller of kvm_set_irq() should hold reference to kvm instance, so it can't disappear while you are inside kvm_set_irq(). RCU protects only kvm-irq_routing not kvm structure itself. Agreed, but this has nothing to do with protecting the entry* pointers. Data is copied from irq_rt onto the stack and this copy is accessed outside critical section. As mentioned above, I do not believe this really protect us. And even I don't see the prove it doesn't, so I assume it does. What would you like to see beyond what I've already provided you? I can show how the entry pointers are allocated as part of the irq_rt, and I can show how the irq_rt (via entry-set) access is racy against invalidation. if it did, the copy is just a work-around to avoid sleeping within the It is not a work-around. There was two solutions to the problem one is to call -set() outside rcu critical section This is broken afaict without taking additional precautions, such as a reference count on the irq_rt structure, but I mentioned this alternate solution in my header. another is to use SRCU. I decided to use the first one. This way the code is much simpler simpler is debatable, but ok. SRCU is an established pattern available in the upstream kernel, so I don't think its particularly complicated or controversial to use. and I remember asking Paul what are the disadvantages of using SRCU and there was something. The disadvantages to my knowledge are as follows: 1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but we are talking about nanoseconds on modern hardware (I think Paul quoted me 10ns vs 45ns on his rig). I don't think either overhead is something to be concerned about in this case. 2) standard rcu supports deferred synchronization (call_rcu()), as well as barriers (synchronize_rcu()), whereas SRCU only supports barriers (synchronize_srcu()). We only use the barrier type in this code path, so that is not an issue. 3) SRCU requires explicit initialization/cleanup, whereas regular RCU does not. Trivially solved in my patch since KVM has plenty of init/cleanup hook points. standard RCU RSCS, which is what SRCU is designed for. So rather than inventing an awkward two-phased stack based solution, it's better to reuse the provided tools, IMO. To flip it around: Is there any reason why an SRCU would not work here, and thus we were forced to use something like the stack-copy approach? If SRCU has no disadvantage comparing to RCU why not use it always? :) No one is debating
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Gleb Natapov wrote: On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote: Gregory Haskins wrote: Gleb Natapov wrote: On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote: The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ This is not what happens. irq_rt is never accessed outside read-side critical section. Sorry, I was generalizing to keep the comments short. I figured it would be clear what I was actually saying, but realize in retrospect that I was a little ambiguous. Here is a revised problem statement thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); entry_cache = get_entries(irq_rt); rcu_read_unlock(); invalidate_entries(irq_rt); for_each_entry(entry_cache) entry-set(); /* bad */ - invalidate_entries() may be any operation that deletes an entry at run-time (doesn't exist today), or as the guest is shutting down. As far as I can tell, the current code does not protect us from either condition, and my proposed patch protects us from both. Did I miss anything? Yes. What happened to irq_rt is completely irrelevant at the point you marked /* bad */. kfree() happened to irq_rt, and thus to the objects behind the pointers in entry_cache at the point I marked /* bad */. That certainly isn't /* good */ ;) -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Thanks for this, Paul. Some questions and statements below. Paul E. McKenney wrote: On Tue, Oct 27, 2009 at 04:02:37PM +0200, Gleb Natapov wrote: On Tue, Oct 27, 2009 at 09:39:03AM -0400, Gregory Haskins wrote: [ . . . ] standard RCU RSCS, which is what SRCU is designed for. So rather than inventing an awkward two-phased stack based solution, it's better to reuse the provided tools, IMO. To flip it around: Is there any reason why an SRCU would not work here, and thus we were forced to use something like the stack-copy approach? If SRCU has no disadvantage comparing to RCU why not use it always? :) The disadvantages of SRCU compared to RCU include the following: 1.SRCU requires that the return value of srcu_read_lock() be fed into srcu_read_unlock(). This is usually not a problem, but can be painful if there are multiple levels of function call separating the two. Right, and this is simple/neat w.r.t. its usage in irq_routing, so no problem there. 2.SRCU's grace periods are about 4x slower than those of RCU. And they also don't scale all that well with extremely large numbers of CPUs (but this can be fixed when/if it becomes a real problem). The irq_routing update path is extremely infrequent, so this should not be an issue. 3.SRCU's read-side primitives are also significantly slower than those of RCU. Are the 10ns vs 45ns numbers that I mentioned in my last reply the proper ballpark? How do these compare to an atomic-op, say an uncontended spinlock on modern hardware? The assumption is that srcu_read_lock() should be significantly cheaper than a read-lock(). If its not, then we might as well use something else, I suppose. But if its not, I guess you probably wouldn't have bothered to invent it in the first place ;) 4.SRCU does not have a call_srcu(). One could be provided, but its semantics would be a bit strange due to the need to limit the number of callbacks, given that general blocking is permitted in SRCU read-side critical sections. (And it would take some doing to convince me to supply an SRCU!) This is not an issue in our design. 5.The current SRCU has no reasonable way to implement read-side priority boosting, as there is no record of which task is read-holding which SRCU. Given the infrequency of the update path, I do not see this as a problem. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Gleb Natapov wrote: On Tue, Oct 27, 2009 at 10:50:45AM -0400, Gregory Haskins wrote: Gleb Natapov wrote: On Tue, Oct 27, 2009 at 10:00:15AM -0400, Gregory Haskins wrote: Gregory Haskins wrote: Gleb Natapov wrote: On Mon, Oct 26, 2009 at 12:21:57PM -0400, Gregory Haskins wrote: The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ This is not what happens. irq_rt is never accessed outside read-side critical section. Sorry, I was generalizing to keep the comments short. I figured it would be clear what I was actually saying, but realize in retrospect that I was a little ambiguous. Here is a revised problem statement thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); entry_cache = get_entries(irq_rt); rcu_read_unlock(); invalidate_entries(irq_rt); for_each_entry(entry_cache) entry-set(); /* bad */ - invalidate_entries() may be any operation that deletes an entry at run-time (doesn't exist today), or as the guest is shutting down. As far as I can tell, the current code does not protect us from either condition, and my proposed patch protects us from both. Did I miss anything? Yes. What happened to irq_rt is completely irrelevant at the point you marked /* bad */. kfree() happened to irq_rt, and thus to the objects behind the pointers in entry_cache at the point I marked /* bad */. The entire entry is cached not a pointer to an entry! kfree(). light bulb goes off Ah, I see. I missed that detail that it was a structure copy, not a pointer copy. My bad. You are right, and I am wrong. I retract the 1/3 patch. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
Gleb Natapov wrote: 1) rcu_read_lock is something like 4x faster than srcu_read_lock(), but we are talking about nanoseconds on modern hardware (I think Paul quoted me 10ns vs 45ns on his rig). I don't think either overhead is something to be concerned about in this case. If we can avoid why not? Nanoseconds tend to add up. BTW: I didn't mean to imply that we should be cavalier in adding overhead. My point was that adding overhead is sometimes _necessary_ to prevent a race above and beyond an RCU pointer acquisition, and 35ns is not _terrible_ IMO. I was suggesting to solve this by switching to SRCU, but an alternative is copying the structure (when permitted with immutable objects), which seems to work in this particular case. It should be noted that the copy has its own unquantified overhead beyond basic RCU as well, so its not truly free (I'd guess its = as the switch to SRCU without copies, though). IOW: sync hurts, but its sometimes a necessary evil ;) Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation
Michael S. Tsirkin wrote: On Mon, Oct 26, 2009 at 12:22:08PM -0400, Gregory Haskins wrote: IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support for certain GSIs, the deferment mechanism is no longer technically needed in all cases. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method whenever possible. Fortunately for us, the most common use of irqfd (MSI-based GSIs) readily support lockless injection. Signed-off-by: Gregory Haskins ghask...@novell.com This is a useful optimization I think. Some comments below. --- virt/kvm/eventfd.c | 31 +++ 1 files changed, 27 insertions(+), 4 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 30f70fd..e6cc958 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -51,20 +51,34 @@ struct _irqfd { wait_queue_t wait; struct work_structinject; struct work_structshutdown; +void (*execute)(struct _irqfd *); }; static struct workqueue_struct *irqfd_cleanup_wq; static void -irqfd_inject(struct work_struct *work) +irqfd_inject(struct _irqfd *irqfd) { -struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); struct kvm *kvm = irqfd-kvm; kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1); kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0); } +static void +irqfd_deferred_inject(struct work_struct *work) +{ +struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); + +irqfd_inject(irqfd); +} + +static void +irqfd_schedule(struct _irqfd *irqfd) +{ +schedule_work(irqfd-inject); +} + /* * Race-free decouple logic (ordering is critical) */ @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) /* An event has been signaled, inject an interrupt */ -schedule_work(irqfd-inject); +irqfd-execute(irqfd); if (flags POLLHUP) { /* The eventfd is closing, detach from KVM */ @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) irqfd-kvm = kvm; irqfd-gsi = gsi; INIT_LIST_HEAD(irqfd-list); -INIT_WORK(irqfd-inject, irqfd_inject); +INIT_WORK(irqfd-inject, irqfd_deferred_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); file = eventfd_fget(fd); @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); +ret = kvm_irq_check_lockless(kvm, gsi); +if (ret 0) +goto fail; + +if (ret) +irqfd-execute = irqfd_inject; +else +irqfd-execute = irqfd_schedule; + Can't gsi get converted from lockless to non-lockless after it's checked (by the routing ioctl)? I think I protect against this in patch 2/3 by ensuring that any vectors that are added have to conform to the same locking rules. The code doesn't support deleting routes, so we really only need to make sure that new routes do not change. Kernel will crash then. How about, each time we get event from eventfd, we implement kvm_irqfd_toggle_lockless, which does a single scan, and returns true/false status (and I really mean toggle, let's not do set 1 / set 0 as well) telling us whether interrupts could be delivered in a lockless manner? I am not sure I like this idea in general given that I believe I already handle the error case you are concerned with. However, the concept of providing a toggle option so we can avoid scanning the list twice is a good one. That can be done as a new patch series, but it would be a nice addition. Thanks Michael, -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute
Avi Kivity wrote: On 10/23/2009 04:38 AM, Gregory Haskins wrote: Certain GSI's support lockless injecton, but we have no way to detect which ones at the GSI level. Knowledge of this attribute will be useful later in the series so that we can optimize irqfd injection paths for cases where we know the code will not sleep. Therefore, we provide an API to query a specific GSI. Instead of a lockless attribute, how about a -set_atomic() method. For msi this can be the same as -set(), for non-msi it can be a function that schedules the work (which will eventually call -set()). The benefit is that we make a decision only once, when preparing the routing entry, and install that decision in the routing entry instead of making it again and again later. Yeah, I like this idea. I think we can also get rid of the custom workqueue if we do this as well, TBD. +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq) bool kvm_irq_check_lockless(...) We lose the ability to detect failure (such as ENOENT) if we do this, but its moot if we move to the -set_atomic() model, since this attribute is no longer necessary and this patch can be dropped. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [Alacrityvm-devel] [KVM PATCH v2 1/2] KVM: export lockless GSI attribute
Gregory Haskins wrote: Avi Kivity wrote: On 10/23/2009 04:38 AM, Gregory Haskins wrote: Certain GSI's support lockless injecton, but we have no way to detect which ones at the GSI level. Knowledge of this attribute will be useful later in the series so that we can optimize irqfd injection paths for cases where we know the code will not sleep. Therefore, we provide an API to query a specific GSI. Instead of a lockless attribute, how about a -set_atomic() method. For msi this can be the same as -set(), for non-msi it can be a function that schedules the work (which will eventually call -set()). The benefit is that we make a decision only once, when preparing the routing entry, and install that decision in the routing entry instead of making it again and again later. Yeah, I like this idea. I think we can also get rid of the custom workqueue if we do this as well, TBD. So I looked into this. It isn't straight forward because you need to retain some kind of state across the deferment on a per-request basis (not per-GSI). Today, this state is neatly tracked into the irqfd object itself (e.g. it knows to toggle the GSI). So while generalizing this perhaps makes sense at some point, especially if irqfd-like interfaces get added, it probably doesn't make a ton of sense to expend energy on it ATM. It is basically a generalization of the irqfd deferrment code. Lets just wait until we have a user beyond irqfd for now. Sound acceptable? In the meantime, I found a bug in the irq_routing code, so I will submit a v3 with this fix, as well as a few other things I improved in the v2 series. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
[KVM PATCH v3 0/3] irqfd enhancements, and irq_routing fixes
(Applies to kvm.git/master:11b06403) The following patches are cleanups/enhancements for IRQFD now that we have lockless interrupt injection. For more details, please see the patch headers. These patches pass checkpatch, and are fully tested. Please consider for merging. Patch 1/3 is a fix for an issue that may exist upstream and should be considered for a more timely push upstream. Patches 2/3 - 3/3 are an enhancement only, so there is no urgency to push to mainline until a suitable merge window presents itself. Kind Regards, -Greg [ Change log: v3: *) Added patch 1/3 as a fix for a race condition *) Minor cleanup to 2/3 to ensure that all shared vectors conform to a unified locking model. v2: *) dropped original cleanup which relied on the user registering MSI based GSIs or we may crash at runtime. Instead, we now check at registration whether the GSI supports lockless operation and dynamically adapt to either the original deferred path for lock-based injections, or direct for lockless. v1: *) original release ] --- Gregory Haskins (3): KVM: Directly inject interrupts if they support lockless operation KVM: export lockless GSI attribute KVM: fix race in irq_routing logic include/linux/kvm_host.h |8 virt/kvm/eventfd.c | 31 +++-- virt/kvm/irq_comm.c | 85 ++ virt/kvm/kvm_main.c |1 + 4 files changed, 98 insertions(+), 27 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH v3 3/3] KVM: Directly inject interrupts if they support lockless operation
IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support for certain GSIs, the deferment mechanism is no longer technically needed in all cases. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method whenever possible. Fortunately for us, the most common use of irqfd (MSI-based GSIs) readily support lockless injection. Signed-off-by: Gregory Haskins ghask...@novell.com --- virt/kvm/eventfd.c | 31 +++ 1 files changed, 27 insertions(+), 4 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 30f70fd..e6cc958 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -51,20 +51,34 @@ struct _irqfd { wait_queue_t wait; struct work_structinject; struct work_structshutdown; + void (*execute)(struct _irqfd *); }; static struct workqueue_struct *irqfd_cleanup_wq; static void -irqfd_inject(struct work_struct *work) +irqfd_inject(struct _irqfd *irqfd) { - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); struct kvm *kvm = irqfd-kvm; kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1); kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0); } +static void +irqfd_deferred_inject(struct work_struct *work) +{ + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); + + irqfd_inject(irqfd); +} + +static void +irqfd_schedule(struct _irqfd *irqfd) +{ + schedule_work(irqfd-inject); +} + /* * Race-free decouple logic (ordering is critical) */ @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) /* An event has been signaled, inject an interrupt */ - schedule_work(irqfd-inject); + irqfd-execute(irqfd); if (flags POLLHUP) { /* The eventfd is closing, detach from KVM */ @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) irqfd-kvm = kvm; irqfd-gsi = gsi; INIT_LIST_HEAD(irqfd-list); - INIT_WORK(irqfd-inject, irqfd_inject); + INIT_WORK(irqfd-inject, irqfd_deferred_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); file = eventfd_fget(fd); @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); + ret = kvm_irq_check_lockless(kvm, gsi); + if (ret 0) + goto fail; + + if (ret) + irqfd-execute = irqfd_inject; + else + irqfd-execute = irqfd_schedule; + /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH v3 1/3] KVM: fix race in irq_routing logic
The current code suffers from the following race condition: thread-1thread-2 --- kvm_set_irq() { rcu_read_lock() irq_rt = rcu_dereference(table); rcu_read_unlock(); kvm_set_irq_routing() { mutex_lock(); irq_rt = table; rcu_assign_pointer(); mutex_unlock(); synchronize_rcu(); kfree(irq_rt); irq_rt-entry-set(); /* bad */ - Because the pointer is accessed outside of the read-side critical section. There are two basic patterns we can use to fix this bug: 1) Switch to sleeping-rcu and encompass the -set() access within the read-side critical section, OR 2) Add reference counting to the irq_rt structure, and simply acquire the reference from within the RSCS. This patch implements solution (1). Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_host.h |6 +- virt/kvm/irq_comm.c | 50 +++--- virt/kvm/kvm_main.c |1 + 3 files changed, 35 insertions(+), 22 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bd5a616..1fe135d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -185,7 +185,10 @@ struct kvm { struct mutex irq_lock; #ifdef CONFIG_HAVE_KVM_IRQCHIP - struct kvm_irq_routing_table *irq_routing; + struct { + struct srcu_structsrcu; + struct kvm_irq_routing_table *table; + } irq_routing; struct hlist_head mask_notifier_list; struct hlist_head irq_ack_notifier_list; #endif @@ -541,6 +544,7 @@ int kvm_set_irq_routing(struct kvm *kvm, const struct kvm_irq_routing_entry *entries, unsigned nr, unsigned flags); +void kvm_init_irq_routing(struct kvm *kvm); void kvm_free_irq_routing(struct kvm *kvm); #else diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index 00c68d2..db2553f 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -144,10 +144,11 @@ static int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e, */ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) { - struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS]; - int ret = -1, i = 0; + struct kvm_kernel_irq_routing_entry *e; + int ret = -1; struct kvm_irq_routing_table *irq_rt; struct hlist_node *n; + int idx; trace_kvm_set_irq(irq, level, irq_source_id); @@ -155,21 +156,19 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) * IOAPIC. So set the bit in both. The guest will ignore * writes to the unused one. */ - rcu_read_lock(); - irq_rt = rcu_dereference(kvm-irq_routing); + idx = srcu_read_lock(kvm-irq_routing.srcu); + irq_rt = rcu_dereference(kvm-irq_routing.table); if (irq irq_rt-nr_rt_entries) - hlist_for_each_entry(e, n, irq_rt-map[irq], link) - irq_set[i++] = *e; - rcu_read_unlock(); + hlist_for_each_entry(e, n, irq_rt-map[irq], link) { + int r; - while(i--) { - int r; - r = irq_set[i].set(irq_set[i], kvm, irq_source_id, level); - if (r 0) - continue; + r = e-set(e, kvm, irq_source_id, level); + if (r 0) + continue; - ret = r + ((ret 0) ? 0 : ret); - } + ret = r + ((ret 0) ? 0 : ret); + } + srcu_read_unlock(kvm-irq_routing.srcu, idx); return ret; } @@ -179,17 +178,18 @@ void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) struct kvm_irq_ack_notifier *kian; struct hlist_node *n; int gsi; + int idx; trace_kvm_ack_irq(irqchip, pin); - rcu_read_lock(); - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin]; + idx = srcu_read_lock(kvm-irq_routing.srcu); + gsi = rcu_dereference(kvm-irq_routing.table)-chip[irqchip][pin]; if (gsi != -1) hlist_for_each_entry_rcu(kian, n, kvm-irq_ack_notifier_list, link) if (kian-gsi == gsi) kian-irq_acked(kian); - rcu_read_unlock(); + srcu_read_unlock(kvm-irq_routing.srcu, idx); } void kvm_register_irq_ack_notifier(struct kvm *kvm, @@ -287,11
[KVM PATCH v3 2/3] KVM: export lockless GSI attribute
Certain GSI's support lockless injecton, but we have no way to detect which ones at the GSI level. Knowledge of this attribute will be useful later in the series so that we can optimize irqfd injection paths for cases where we know the code will not sleep. Therefore, we provide an API to query a specific GSI. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_host.h |2 ++ virt/kvm/irq_comm.c | 35 ++- 2 files changed, 36 insertions(+), 1 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1fe135d..01151a6 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -119,6 +119,7 @@ struct kvm_memory_slot { struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; + bool lockless; int (*set)(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level); union { @@ -420,6 +421,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic, unsigned long *deliver_bitmask); #endif int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level); +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq); void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin); void kvm_register_irq_ack_notifier(struct kvm *kvm, struct kvm_irq_ack_notifier *kian); diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index db2553f..a7fd487 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -173,6 +173,35 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) return ret; } +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq) +{ + struct kvm_kernel_irq_routing_entry *e; + struct kvm_irq_routing_table *irq_rt; + struct hlist_node *n; + int ret = -ENOENT; + int idx; + + idx = srcu_read_lock(kvm-irq_routing.srcu); + irq_rt = rcu_dereference(kvm-irq_routing.table); + if (irq irq_rt-nr_rt_entries) + hlist_for_each_entry(e, n, irq_rt-map[irq], link) { + if (!e-lockless) { + /* +* all destinations need to be lockless to +* declare that the GSI as a whole is also +* lockless +*/ + ret = 0; + break; + } + + ret = 1; + } + srcu_read_unlock(kvm-irq_routing.srcu, idx); + + return ret; +} + void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; @@ -310,18 +339,22 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt, int delta; struct kvm_kernel_irq_routing_entry *ei; struct hlist_node *n; + bool lockless = ue-type == KVM_IRQ_ROUTING_MSI; /* * Do not allow GSI to be mapped to the same irqchip more than once. * Allow only one to one mapping between GSI and MSI. +* Do not allow mixed lockless vs locked variants to coexist. */ hlist_for_each_entry(ei, n, rt-map[ue-gsi], link) if (ei-type == KVM_IRQ_ROUTING_MSI || - ue-u.irqchip.irqchip == ei-irqchip.irqchip) + ue-u.irqchip.irqchip == ei-irqchip.irqchip || + ei-lockless != lockless) return r; e-gsi = ue-gsi; e-type = ue-type; + e-lockless = lockless; switch (ue-type) { case KVM_IRQ_ROUTING_IRQCHIP: delta = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd
Avi Kivity wrote: On 10/21/2009 05:42 PM, Gregory Haskins wrote: I believe Avi, Michael, et. al. were in agreement with me on that design choice. I believe the reason is that there is no good way to do EOI/ACK feedback within the constraints of an eventfd pipe which would be required for the legacy pin-type interrupts. Therefore, we won't even bother trying. High-performance subsystems will use irqfd/msi, and legacy emulation can use the existing injection code (which includes the necessary feedback for ack/eoi). Right. But we don't actually prevent anyone using non-msi with irqfd, which can trigger the bad lock usage from irq context, with a nice boom afterwards. So we need to either prevent it during registration, or to gracefully handle it afterwards. Yeah, I was thinking about that after I initially responded to Gleb. I am thinking something along these lines: Provide a function that lets you query a GSI for whether it supports LOCKLESS or not. Then we can either do one of two things: 1) Check for the LOCKLESS attribute at irqfd registration, fail if not present 2) Cache the LOCKLESS attribute in the irqfd structure, and either go direct or defer to a workqueue depending on the flag. Thoughts? -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd
Avi Kivity wrote: On 10/22/2009 05:14 PM, Gregory Haskins wrote: Yeah, I was thinking about that after I initially responded to Gleb. I am thinking something along these lines: Provide a function that lets you query a GSI for whether it supports LOCKLESS or not. Then we can either do one of two things: 1) Check for the LOCKLESS attribute at irqfd registration, fail if not present This is the most practical path and leads to the smallest code. However it has the deficiency of exposing internal implementation details to userspace. In theory userspace could use msi and edge-triggered pic/ioapic interrupts equally well, it shouldn't have to know that we didn't bother to lockfree ioapic/pic. 2) Cache the LOCKLESS attribute in the irqfd structure, and either go direct or defer to a workqueue depending on the flag. While this leads to larger code, it is more consistent. Yeah, I think you are right. Consider these two patches retracted, and I will rewrite it with this concept in place. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
[KVM PATCH v2 0/2] irqfd enhancements
(Applies to kvm.git/master:11b06403) The following patches are cleanups/enhancements for IRQFD now that we have lockless interrupt injection. For more details, please see the patch headers. These patches pass checkpatch, and are fully tested. Please consider for merging. They are an enhancement only, so there is no urgency to push to mainline until a suitable merge window presents itself. Kind Regards, -Greg [ Change log: v2: *) dropped original cleanup which relied on the user registering MSI based GSIs or we may crash at runtime. Instead, we now check at registration whether the GSI supports lockless operation and dynamically adapt to either the original deferred path for lock-based injections, or direct for lockless. v1: *) original release ] --- Gregory Haskins (2): KVM: Directly inject interrupts if they support lockless operation KVM: export lockless GSI attribute include/linux/kvm_host.h |2 ++ virt/kvm/eventfd.c | 31 +++ virt/kvm/irq_comm.c | 19 +++ 3 files changed, 48 insertions(+), 4 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH v2 1/2] KVM: export lockless GSI attribute
Certain GSI's support lockless injecton, but we have no way to detect which ones at the GSI level. Knowledge of this attribute will be useful later in the series so that we can optimize irqfd injection paths for cases where we know the code will not sleep. Therefore, we provide an API to query a specific GSI. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_host.h |2 ++ virt/kvm/irq_comm.c | 19 +++ 2 files changed, 21 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bd5a616..93393a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -119,6 +119,7 @@ struct kvm_memory_slot { struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; + bool lockless; int (*set)(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level); union { @@ -417,6 +418,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic, unsigned long *deliver_bitmask); #endif int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level); +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq); void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin); void kvm_register_irq_ack_notifier(struct kvm *kvm, struct kvm_irq_ack_notifier *kian); diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index 00c68d2..04f0134 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -174,6 +174,23 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) return ret; } +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq) +{ + struct kvm_kernel_irq_routing_entry *e; + struct kvm_irq_routing_table *irq_rt; + struct hlist_node *n; + int ret = -ENOENT; + + rcu_read_lock(); + irq_rt = rcu_dereference(kvm-irq_routing); + if (irq irq_rt-nr_rt_entries) + hlist_for_each_entry(e, n, irq_rt-map[irq], link) + ret = e-lockless ? 1 : 0; + rcu_read_unlock(); + + return ret; +} + void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; @@ -314,6 +331,7 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt, e-gsi = ue-gsi; e-type = ue-type; + e-lockless = false; switch (ue-type) { case KVM_IRQ_ROUTING_IRQCHIP: delta = 0; @@ -342,6 +360,7 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt, e-msi.address_lo = ue-u.msi.address_lo; e-msi.address_hi = ue-u.msi.address_hi; e-msi.data = ue-u.msi.data; + e-lockless = true; break; default: goto out; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH v2 2/2] KVM: Directly inject interrupts if they support lockless operation
IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support for certain GSIs, the deferment mechanism is no longer technically needed in all cases. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method whenever possible. Fortunately for us, the most common use of irqfd (MSI-based GSIs) readily support lockless injection. Signed-off-by: Gregory Haskins ghask...@novell.com --- virt/kvm/eventfd.c | 31 +++ 1 files changed, 27 insertions(+), 4 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 30f70fd..e6cc958 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -51,20 +51,34 @@ struct _irqfd { wait_queue_t wait; struct work_structinject; struct work_structshutdown; + void (*execute)(struct _irqfd *); }; static struct workqueue_struct *irqfd_cleanup_wq; static void -irqfd_inject(struct work_struct *work) +irqfd_inject(struct _irqfd *irqfd) { - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); struct kvm *kvm = irqfd-kvm; kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1); kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 0); } +static void +irqfd_deferred_inject(struct work_struct *work) +{ + struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); + + irqfd_inject(irqfd); +} + +static void +irqfd_schedule(struct _irqfd *irqfd) +{ + schedule_work(irqfd-inject); +} + /* * Race-free decouple logic (ordering is critical) */ @@ -126,7 +140,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) /* An event has been signaled, inject an interrupt */ - schedule_work(irqfd-inject); + irqfd-execute(irqfd); if (flags POLLHUP) { /* The eventfd is closing, detach from KVM */ @@ -179,7 +193,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) irqfd-kvm = kvm; irqfd-gsi = gsi; INIT_LIST_HEAD(irqfd-list); - INIT_WORK(irqfd-inject, irqfd_inject); + INIT_WORK(irqfd-inject, irqfd_deferred_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); file = eventfd_fget(fd); @@ -209,6 +223,15 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); + ret = kvm_irq_check_lockless(kvm, gsi); + if (ret 0) + goto fail; + + if (ret) + irqfd-execute = irqfd_inject; + else + irqfd-execute = irqfd_schedule; + /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v2 1/2] KVM: export lockless GSI attribute
Gregory Haskins wrote: Certain GSI's support lockless injecton, but we have no way to detect which ones at the GSI level. Knowledge of this attribute will be useful later in the series so that we can optimize irqfd injection paths for cases where we know the code will not sleep. Therefore, we provide an API to query a specific GSI. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_host.h |2 ++ virt/kvm/irq_comm.c | 19 +++ 2 files changed, 21 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bd5a616..93393a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -119,6 +119,7 @@ struct kvm_memory_slot { struct kvm_kernel_irq_routing_entry { u32 gsi; u32 type; + bool lockless; int (*set)(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level); union { @@ -417,6 +418,7 @@ void kvm_get_intr_delivery_bitmask(struct kvm_ioapic *ioapic, unsigned long *deliver_bitmask); #endif int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level); +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq); void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin); void kvm_register_irq_ack_notifier(struct kvm *kvm, struct kvm_irq_ack_notifier *kian); diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c index 00c68d2..04f0134 100644 --- a/virt/kvm/irq_comm.c +++ b/virt/kvm/irq_comm.c @@ -174,6 +174,23 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level) return ret; } +int kvm_irq_check_lockless(struct kvm *kvm, u32 irq) +{ + struct kvm_kernel_irq_routing_entry *e; + struct kvm_irq_routing_table *irq_rt; + struct hlist_node *n; + int ret = -ENOENT; + + rcu_read_lock(); + irq_rt = rcu_dereference(kvm-irq_routing); + if (irq irq_rt-nr_rt_entries) + hlist_for_each_entry(e, n, irq_rt-map[irq], link) + ret = e-lockless ? 1 : 0; Sigh... just noticed this as it hits the list. I should probably break out of the loop here. + rcu_read_unlock(); + + return ret; +} + void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin) { struct kvm_irq_ack_notifier *kian; @@ -314,6 +331,7 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt, e-gsi = ue-gsi; e-type = ue-type; + e-lockless = false; switch (ue-type) { case KVM_IRQ_ROUTING_IRQCHIP: delta = 0; @@ -342,6 +360,7 @@ static int setup_routing_entry(struct kvm_irq_routing_table *rt, e-msi.address_lo = ue-u.msi.address_lo; e-msi.address_hi = ue-u.msi.address_hi; e-msi.data = ue-u.msi.data; + e-lockless = true; break; default: goto out; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ signature.asc Description: OpenPGP digital signature
[KVM PATCH 0/2] irqfd enhancements
(Applies to kvm.git/master:11b06403) The following patches are cleanups/enhancements for IRQFD now that we have lockless interrupt injection. For more details, please see the patch headers. These patches pass checkpatch, and are fully tested. Please consider for merging. They are an enhancement only, so there is no urgency to push to mainline until a suitable merge window presents itself. Kind Regards, -Greg --- Gregory Haskins (2): KVM: Remove unecessary irqfd-cleanup-wq KVM: Directly inject interrupts via irqfd virt/kvm/eventfd.c | 45 ++--- 1 files changed, 6 insertions(+), 39 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd
IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support in kvm_set_irq, the deferment mechanism is no longer technically needed. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method. Signed-off-by: Gregory Haskins ghask...@novell.com --- virt/kvm/eventfd.c | 15 +++ 1 files changed, 3 insertions(+), 12 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 30f70fd..1a529d4 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -49,16 +49,14 @@ struct _irqfd { poll_tablept; wait_queue_head_t*wqh; wait_queue_t wait; - struct work_structinject; struct work_structshutdown; }; static struct workqueue_struct *irqfd_cleanup_wq; static void -irqfd_inject(struct work_struct *work) +irqfd_inject(struct _irqfd *irqfd) { - struct _irqfd *irqfd = container_of(work, struct _irqfd, inject); struct kvm *kvm = irqfd-kvm; kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd-gsi, 1); @@ -80,12 +78,6 @@ irqfd_shutdown(struct work_struct *work) remove_wait_queue(irqfd-wqh, irqfd-wait); /* -* We know no new events will be scheduled at this point, so block -* until all previously outstanding events have completed -*/ - flush_work(irqfd-inject); - - /* * It is now safe to release the object's resources */ eventfd_ctx_put(irqfd-eventfd); @@ -126,7 +118,7 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) /* An event has been signaled, inject an interrupt */ - schedule_work(irqfd-inject); + irqfd_inject(irqfd); if (flags POLLHUP) { /* The eventfd is closing, detach from KVM */ @@ -179,7 +171,6 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) irqfd-kvm = kvm; irqfd-gsi = gsi; INIT_LIST_HEAD(irqfd-list); - INIT_WORK(irqfd-inject, irqfd_inject); INIT_WORK(irqfd-shutdown, irqfd_shutdown); file = eventfd_fget(fd); @@ -214,7 +205,7 @@ kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) * before we registered, and trigger it as if we didn't miss it. */ if (events POLLIN) - schedule_work(irqfd-inject); + irqfd_inject(irqfd); /* * do not drop the file until the irqfd is fully initialized, otherwise -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH 2/2] KVM: Remove unecessary irqfd-cleanup-wq
We originally created the irqfd-cleanup-wq so that we could safely implement a shutdown that blocked on outstanding injection-requests that may have been in flight. We couldn't reuse something like kevent to shutdown since the injection path was also using kevent and that may have caused a deadlock if we tried. Since the injection path is now no longer utilizing a work-item, it is no longer necessary to maintain a separate cleanup WQ. The standard kevent queues should be sufficient, and thus we can eliminate an extra kthread from the system. Signed-off-by: Gregory Haskins ghask...@novell.com --- virt/kvm/eventfd.c | 30 +++--- 1 files changed, 3 insertions(+), 27 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 1a529d4..fb698f4 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -52,8 +52,6 @@ struct _irqfd { struct work_structshutdown; }; -static struct workqueue_struct *irqfd_cleanup_wq; - static void irqfd_inject(struct _irqfd *irqfd) { @@ -104,7 +102,7 @@ irqfd_deactivate(struct _irqfd *irqfd) list_del_init(irqfd-list); - queue_work(irqfd_cleanup_wq, irqfd-shutdown); + schedule_work(irqfd-shutdown); } /* @@ -262,7 +260,7 @@ kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi) * so that we guarantee there will not be any more interrupts on this * gsi once this deassign function returns. */ - flush_workqueue(irqfd_cleanup_wq); + flush_scheduled_work(); return 0; } @@ -296,33 +294,11 @@ kvm_irqfd_release(struct kvm *kvm) * Block until we know all outstanding shutdown jobs have completed * since we do not take a kvm* reference. */ - flush_workqueue(irqfd_cleanup_wq); + flush_scheduled_work(); } /* - * create a host-wide workqueue for issuing deferred shutdown requests - * aggregated from all vm* instances. We need our own isolated single-thread - * queue to prevent deadlock against flushing the normal work-queue. - */ -static int __init irqfd_module_init(void) -{ - irqfd_cleanup_wq = create_singlethread_workqueue(kvm-irqfd-cleanup); - if (!irqfd_cleanup_wq) - return -ENOMEM; - - return 0; -} - -static void __exit irqfd_module_exit(void) -{ - destroy_workqueue(irqfd_cleanup_wq); -} - -module_init(irqfd_module_init); -module_exit(irqfd_module_exit); - -/* * * ioeventfd: translate a PIO/MMIO memory write to an eventfd signal. * -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd
Gleb Natapov wrote: On Wed, Oct 21, 2009 at 10:34:53AM -0400, Gregory Haskins wrote: IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support in kvm_set_irq, the deferment mechanism is no longer technically needed. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method. kvm_set_irq is fully lockless only in MSI case. IOAPIC/PIC has mutexes. Right, but irqfd by design only works with MSI (or MSI like edge triggers) anyway. Legacy-type injections follow a different path. In any case, I didn't change the locking (you did ;). You recently patched the irqfd code to remove the irq_lock, but we still had the deferment mechanism in place to avoid the mutex_lock from within the POLLIN callback. Since the mutex_lock is now no longer acquired in this path, the deferment technique is not needed either. Its only adding overhead for no purpose. So I am simply cleaning that up to improve interrupt performance. HTH, -Greg signature.asc Description: OpenPGP digital signature
Re: [KVM PATCH 1/2] KVM: Directly inject interrupts via irqfd
Gleb Natapov wrote: On Wed, Oct 21, 2009 at 11:34:51AM -0400, Gregory Haskins wrote: Gleb Natapov wrote: On Wed, Oct 21, 2009 at 10:34:53AM -0400, Gregory Haskins wrote: IRQFD currently uses a deferred workqueue item to execute the injection operation. It was originally designed this way because kvm_set_irq() required the caller to hold the irq_lock mutex, and the eventfd callback is invoked from within a non-preemptible critical section. With the advent of lockless injection support in kvm_set_irq, the deferment mechanism is no longer technically needed. Since context switching to the workqueue is a source of interrupt latency, lets switch to a direct method. kvm_set_irq is fully lockless only in MSI case. IOAPIC/PIC has mutexes. Right, but irqfd by design only works with MSI (or MSI like edge triggers) anyway. Legacy-type injections follow a different path. Ah, If this the case and it will stay that way then the change looks OK to me. I believe Avi, Michael, et. al. were in agreement with me on that design choice. I believe the reason is that there is no good way to do EOI/ACK feedback within the constraints of an eventfd pipe which would be required for the legacy pin-type interrupts. Therefore, we won't even bother trying. High-performance subsystems will use irqfd/msi, and legacy emulation can use the existing injection code (which includes the necessary feedback for ack/eoi). To that point, perhaps it should be better documented. If we need a v2, I will add a comment. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Avi Kivity wrote: On 10/06/2009 09:40 PM, Gregory Haskins wrote: Thinking about this some more over lunch, I think we (Avi and I) might both be wrong (and David is right). Avi is right that we don't need rmb() or barrier() for the reasons already stated, but I think David is right that we need an smp_mb() to ensure the cpu doesn't do the reordering. Otherwise a different cpu could invalidate the memory if it reuses the freed memory in the meantime, iiuc. IOW: its not a compiler issue but a cpu issue. Or am I still confused? The sequence of operations is: v = p-v; f(); // rmb() ? g(v); You are worried that the compiler No or cpu will fetch p-v after f() has executed? Yes. The compiler may not, since it can't tell whether f() might change p-v. Right, you were correct to say my barrier() suggestion was wrong. If f() can cause another agent to write to p (by freeing it to a global list, for example), then it is its responsibility to issue the smp_rmb(), otherwise no calculation that took place before f() and accessed p is safe. IOW: David is right. You need a cpu-barrier one way or the other. We can either allow -release() to imply one (and probably document it that way, like we did for slow-work), or we can be explicit. I chose to be explicit since it is kind of self-documenting, and there is no need to be worried about performance since the release is slow-path. OTOH: If you feel strongly about it, we can take it out, knowing that most anything the properly invalidates the memory will likely include an implicit barrier of some kind. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Avi Kivity wrote: On 10/06/2009 01:57 AM, Gregory Haskins wrote: Avi Kivity wrote: On 10/02/2009 10:19 PM, Gregory Haskins wrote: What: xinterface is a mechanism that allows kernel modules external to the kvm.ko proper to interface with a running guest. It accomplishes this by creating an abstracted interface which does not expose any private details of the guest or its related KVM structures, and provides a mechanism to find and bind to this interface at run-time. If this is needed, it should be done as a virt_address_space to which kvm and other modules bind, instead of as something that kvm exports and other modules import. The virt_address_space can be identified by an fd and passed around to kvm and other modules. IIUC, what you are proposing is something similar to generalizing the vbus::memctx object. I had considered doing something like that in the early design phase of vbus, but then decided it would be a hard-sell to the mm crowd, and difficult to generalize. What do you propose as the interface to program the object? Something like the current kvm interfaces, de-warted. It will be a hard sell indeed, for good reasons. I am not convinced generalizing this at this point is the best step forward, since we only have a KVM client. Let me put some more thought into it. So, under my suggestion above, you'd call sys_create_virt_address_space(), populate it, and pass the result to kvm and to foo. This allows the use of virt_address_space without kvm and doesn't require foo to interact with kvm. The problem I see here is that the only way I can think to implement this generally is something that looks very kvm-esque (slots-to-pages kind of translation). Is there a way you can think of that does not involve a kvm.ko originated vtable that is also not kvm centric? slots would be one implementation, if you can think of others then you'd add them. I'm more interested in *how* you'd add them more than if we would add them. What I am getting at are the logistics of such a beast. For instance, would I have /dev/slots-vas with ioctls for adding slots, and /dev/foo-vas for adding foos? And each one would instantiate a different vas_struct object with its own vas_struct-ops? Or were you thinking of something different. If you can't, I think it indicates that the whole thing isn't necessary and we're better off with slots and virtual memory. I'm not sure if we are talking about the same thing yet, but if we are, there are uses of a generalized interface outside of slots/virtual memory (Ira's physical box being a good example). In any case, I think the best approach is what I already proposed. KVM's arrangement of memory is going to tend to be KVM specific, and what better place to implement the interface than close to the kvm.ko core. The only thing missing is dma, which you don't deal with anyway. Afaict I do support dma in the generalized vbus::memctx, though I do not use it on anything related to KVM or xinterface. Can you elaborate on the problem here? Does the SG interface in 4/4 help get us closer to what you envision? +struct kvm_xinterface_ops { +unsigned long (*copy_to)(struct kvm_xinterface *intf, + unsigned long gpa, const void *src, + unsigned long len); +unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst, + unsigned long gpa, unsigned long len); +struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf, + unsigned long gpa, + unsigned long len); How would vmap() work with live migration? vmap represents shmem regions, and is a per-guest-instance resource. So my plan there is that the new and old guest instance would each have the vmap region instated at the same GPA location (assumption: gpas are stable across migration), and any state relevant data local to the shmem (like ring head/tail position) is conveyed in the serialized stream for the device model. You'd have to copy the entire range since you don't know what the guest might put there. I guess it's acceptable for small areas. ? The vmap is presumably part of an ABI between guest and host, so the host should always know what structure is present within the region, and what is relevant from within that structure to migrate once that state is frozen. These regions (for vbus, anyway) equate to things like virtqueue metadata, and presumably the same problem exists for virtio-net in userspace as it does here, since that is another form of a vmap. So whatever solution works for virtio-net migrating its virtqueues in userspace should be close to what will work here. The primary difference is the location of the serializer. + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ +struct kvm_xinterface *intf; +struct module *owner; + +intf = container_of
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Gregory Haskins wrote: Avi Kivity wrote: On 10/06/2009 01:57 AM, Gregory Haskins wrote: Avi Kivity wrote: On 10/02/2009 10:19 PM, Gregory Haskins wrote: + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ +struct kvm_xinterface *intf; +struct module *owner; + +intf = container_of(kref, struct kvm_xinterface, kref); + +owner = intf-owner; +rmb(); Why rmb? the intf-ops-release() line may invalidate the intf pointer, so we want to ensure that the read completes before the release() is called. TBH: I'm not 100% its needed, but I was being conservative. rmb()s are only needed if an external agent can issue writes, otherwise you'd need one after every statement. I was following lessons learned here: http://lkml.org/lkml/2009/7/7/175 Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing David Howells in case he has more insight. BTW: In case it is not clear, the rationale as I understand it is we worry about the case where one cpu reorders the read to be after the -release(), and another cpu might grab the memory that was kfree()'d within the -release() and scribble something else on it before the read completes. I know rmb() typically needs to be paired with wmb() to be correct, so you are probably right to say that the rmb() itself is not appropriate. This problem in general makes my head hurt, which is why I said I am not 100% sure of what is required. As David mentions, perhaps smp_mb() is more appropriate for this application. I also speculate barrier() may be all that we need. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Avi Kivity wrote: On 10/06/2009 03:31 PM, Gregory Haskins wrote: slots would be one implementation, if you can think of others then you'd add them. I'm more interested in *how* you'd add them more than if we would add them. What I am getting at are the logistics of such a beast. Add alternative ioctls, or have one ioctl with a 'type' field. For instance, would I have /dev/slots-vas with ioctls for adding slots, and /dev/foo-vas for adding foos? And each one would instantiate a different vas_struct object with its own vas_struct-ops? Or were you thinking of something different. I think a single /dev/foo is sufficient, unless some of those address spaces are backed by real devices. If you can't, I think it indicates that the whole thing isn't necessary and we're better off with slots and virtual memory. I'm not sure if we are talking about the same thing yet, but if we are, there are uses of a generalized interface outside of slots/virtual memory (Ira's physical box being a good example). I'm not sure Ira's case is not best supported by virtual memory. Perhaps, but there are surely some cases where the memory is not pageable, but is accessible indirectly through some DMA controller. So if we require it to be pagable we will limit the utility of the interface, though admittedly it will probably cover most cases. In any case, I think the best approach is what I already proposed. KVM's arrangement of memory is going to tend to be KVM specific, and what better place to implement the interface than close to the kvm.ko core. The only thing missing is dma, which you don't deal with anyway. Afaict I do support dma in the generalized vbus::memctx, though I do not use it on anything related to KVM or xinterface. Can you elaborate on the problem here? Does the SG interface in 4/4 help get us closer to what you envision? There's no dma method in there. Mapping to physical addresses is equivalent to get_user_pages(), so it doesn't add anything over virtual memory slots. I am not following you at all. What kind of interface do we need for DMA? Since the KVM guest is pagable, the support for DMA would come from building a scatterlist (patch 4/4) and passing the resulting pages out using standard sg mechanisms, like sg_dma_address(). This is what I do today to support zero-copy in AlacrityVM. What more do we need? You'd have to copy the entire range since you don't know what the guest might put there. I guess it's acceptable for small areas. ? The vmap is presumably part of an ABI between guest and host, so the host should always know what structure is present within the region, and what is relevant from within that structure to migrate once that state is frozen. I was thinking of the case where the page is shared with some other (guest-private) data. But dirtying that data would be tracked by kvm, so it isn't a problem. Ok. These regions (for vbus, anyway) equate to things like virtqueue metadata, and presumably the same problem exists for virtio-net in userspace as it does here, since that is another form of a vmap. So whatever solution works for virtio-net migrating its virtqueues in userspace should be close to what will work here. The primary difference is the location of the serializer. Actually, virtio doesn't serialize this data, instead it marks the page dirty and lets normal RAM migration move it. Ok, so its effectively serializing the entire region indirectly. That works too. rmb()s are only needed if an external agent can issue writes, otherwise you'd need one after every statement. I was following lessons learned here: http://lkml.org/lkml/2009/7/7/175 Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing David Howells in case he has more insight. I'm not sure I understand the reference. Callers and callees don't need memory barriers since they're guaranteed program order. You only need memory barriers when you have an external agent (a device, or another cpu). What external agent can touch things during -release()? We already have a different subthread started on this, so I'll pick this up there. A simple per-vcpu cache (in struct kvm_vcpu) is likely to give better results. per-vcpu will not work well here, unfortunately, since this is an external interface mechanism. The callers will generally be from a kthread or some other non-vcpu related context. Even if we could figure out a vcpu to use as a basis, we would require some kind of heavier-weight synchronization which would not be as desirable. Therefore, I opted to go per-cpu and use the presumably lighterweight get_cpu/put_cpu() instead. This just assumes a low context switch rate. It primarily assumes a low _migration_ rate, since you do not typically have two contexts on the same cpu pounding
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Avi Kivity wrote: On 10/06/2009 04:22 PM, Gregory Haskins wrote: + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ +struct kvm_xinterface *intf; +struct module *owner; + +intf = container_of(kref, struct kvm_xinterface, kref); + +owner = intf-owner; +rmb(); Why rmb? the intf-ops-release() line may invalidate the intf pointer, so we want to ensure that the read completes before the release() is called. TBH: I'm not 100% its needed, but I was being conservative. rmb()s are only needed if an external agent can issue writes, otherwise you'd need one after every statement. I was following lessons learned here: http://lkml.org/lkml/2009/7/7/175 Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing David Howells in case he has more insight. BTW: In case it is not clear, the rationale as I understand it is we worry about the case where one cpu reorders the read to be after the -release(), and another cpu might grab the memory that was kfree()'d within the -release() and scribble something else on it before the read completes. I know rmb() typically needs to be paired with wmb() to be correct, so you are probably right to say that the rmb() itself is not appropriate. This problem in general makes my head hurt, which is why I said I am not 100% sure of what is required. As David mentions, perhaps smp_mb() is more appropriate for this application. I also speculate barrier() may be all that we need. barrier() is the operation for a compiler barrier. But it's unneeded here - unless the compiler can prove that -release(intf) will not modify intf-owner it is not allowed to move the access afterwards. An indirect function call is generally a barrier() since the compiler can't assume memory has not been modified. You're logic seems reasonable to me. I will defer to David, since he brought up the issue with the similar logic originally. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Gregory Haskins wrote: Avi Kivity wrote: On 10/06/2009 04:22 PM, Gregory Haskins wrote: + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ +struct kvm_xinterface *intf; +struct module *owner; + +intf = container_of(kref, struct kvm_xinterface, kref); + +owner = intf-owner; +rmb(); Why rmb? the intf-ops-release() line may invalidate the intf pointer, so we want to ensure that the read completes before the release() is called. TBH: I'm not 100% its needed, but I was being conservative. rmb()s are only needed if an external agent can issue writes, otherwise you'd need one after every statement. I was following lessons learned here: http://lkml.org/lkml/2009/7/7/175 Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing David Howells in case he has more insight. BTW: In case it is not clear, the rationale as I understand it is we worry about the case where one cpu reorders the read to be after the -release(), and another cpu might grab the memory that was kfree()'d within the -release() and scribble something else on it before the read completes. I know rmb() typically needs to be paired with wmb() to be correct, so you are probably right to say that the rmb() itself is not appropriate. This problem in general makes my head hurt, which is why I said I am not 100% sure of what is required. As David mentions, perhaps smp_mb() is more appropriate for this application. I also speculate barrier() may be all that we need. barrier() is the operation for a compiler barrier. But it's unneeded here - unless the compiler can prove that -release(intf) will not modify intf-owner it is not allowed to move the access afterwards. An indirect function call is generally a barrier() since the compiler can't assume memory has not been modified. You're logic gak. or your logic even. seems reasonable to me. I will defer to David, since he brought up the issue with the similar logic originally. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Gregory Haskins wrote: Gregory Haskins wrote: Avi Kivity wrote: On 10/06/2009 04:22 PM, Gregory Haskins wrote: + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ +struct kvm_xinterface *intf; +struct module *owner; + +intf = container_of(kref, struct kvm_xinterface, kref); + +owner = intf-owner; +rmb(); Why rmb? the intf-ops-release() line may invalidate the intf pointer, so we want to ensure that the read completes before the release() is called. TBH: I'm not 100% its needed, but I was being conservative. rmb()s are only needed if an external agent can issue writes, otherwise you'd need one after every statement. I was following lessons learned here: http://lkml.org/lkml/2009/7/7/175 Perhaps mb() or barrier() are more appropriate than rmb()? I'm CC'ing David Howells in case he has more insight. BTW: In case it is not clear, the rationale as I understand it is we worry about the case where one cpu reorders the read to be after the -release(), and another cpu might grab the memory that was kfree()'d within the -release() and scribble something else on it before the read completes. I know rmb() typically needs to be paired with wmb() to be correct, so you are probably right to say that the rmb() itself is not appropriate. This problem in general makes my head hurt, which is why I said I am not 100% sure of what is required. As David mentions, perhaps smp_mb() is more appropriate for this application. I also speculate barrier() may be all that we need. barrier() is the operation for a compiler barrier. But it's unneeded here - unless the compiler can prove that -release(intf) will not modify intf-owner it is not allowed to move the access afterwards. An indirect function call is generally a barrier() since the compiler can't assume memory has not been modified. You're logic gak. or your logic even. seems reasonable to me. I will defer to David, since he brought up the issue with the similar logic originally. Kind Regards, -Greg Thinking about this some more over lunch, I think we (Avi and I) might both be wrong (and David is right). Avi is right that we don't need rmb() or barrier() for the reasons already stated, but I think David is right that we need an smp_mb() to ensure the cpu doesn't do the reordering. Otherwise a different cpu could invalidate the memory if it reuses the freed memory in the meantime, iiuc. IOW: its not a compiler issue but a cpu issue. Or am I still confused? -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Hi Marcelo! Marcelo Tosatti wrote: On Fri, Oct 02, 2009 at 04:19:27PM -0400, Gregory Haskins wrote: What: xinterface is a mechanism that allows kernel modules external to the kvm.ko proper to interface with a running guest. It accomplishes this by creating an abstracted interface which does not expose any private details of the guest or its related KVM structures, and provides a mechanism to find and bind to this interface at run-time. Why: There are various subsystems that would like to interact with a KVM guest which are ideally suited to exist outside the domain of the kvm.ko core logic. For instance, external pci-passthrough, virtual-bus, and virtio-net modules are currently under development. In order for these modules to successfully interact with the guest, they need, at the very least, various interfaces for signaling IO events, pointer translation, and possibly memory mapping. The signaling case is covered by the recent introduction of the irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the other cases. Note that today we only expose pointer-translation related functions, but more could be added at a future date as needs arise. Example usage: QEMU instantiates a guest, and an external module foo that desires the ability to interface with the guest (say via open(/dev/foo)). QEMU may then pass the kvmfd to foo via an ioctl, such as: ioctl(foofd, FOO_SET_VMID, kvmfd). Upon receipt, the foo module can issue kvm_xinterface_bind(kvmfd) to acquire the proper context. Internally, the struct kvm* and associated struct module* will remain pinned at least until the foo module calls kvm_xinterface_put(). --- /dev/null +++ b/virt/kvm/xinterface.c @@ -0,0 +1,409 @@ +/* + * KVM module interface - Allows external modules to interface with a guest + * + * Copyright 2009 Novell. All Rights Reserved. + * + * Author: + * Gregory Haskins ghask...@novell.com + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#include linux/mm.h +#include linux/vmalloc.h +#include linux/highmem.h +#include linux/module.h +#include linux/mmu_context.h +#include linux/kvm_host.h +#include linux/kvm_xinterface.h + +struct _xinterface { +struct kvm *kvm; +struct task_struct *task; +struct mm_struct *mm; +struct kvm_xinterface intf; +struct kvm_memory_slot *slotcache[NR_CPUS]; +}; + +struct _xvmap { +struct kvm_memory_slot*memslot; +unsigned long npages; +struct kvm_xvmap vmap; +}; + +static struct _xinterface * +to_intf(struct kvm_xinterface *intf) +{ +return container_of(intf, struct _xinterface, intf); +} + +#define _gfn_to_hva(gfn, memslot) \ +(memslot-userspace_addr + (gfn - memslot-base_gfn) * PAGE_SIZE) + +/* + * gpa_to_hva() - translate a guest-physical to host-virtual using + * a per-cpu cache of the memslot. + * + * The gfn_to_memslot() call is relatively expensive, and the gpa access + * patterns exhibit a high degree of locality. Therefore, lets cache + * the last slot used on a per-cpu basis to optimize the lookup + * + * assumes slots_lock held for read + */ +static unsigned long +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa) +{ +int cpu = get_cpu(); +unsigned long gfn = gpa PAGE_SHIFT; +struct kvm_memory_slot *memslot = _intf-slotcache[cpu]; +unsigned long addr= 0; + +if (!memslot +|| gfn memslot-base_gfn +|| gfn = memslot-base_gfn + memslot-npages) { + +memslot = gfn_to_memslot(_intf-kvm, gfn); +if (!memslot) +goto out; + +_intf-slotcache[cpu] = memslot; +} + +addr = _gfn_to_hva(gfn, memslot) + offset_in_page(gpa); + +out: +put_cpu(); + +return addr; Please optimize gfn_to_memslot() instead, so everybody benefits. It shows very often on profiles. Yeah, its not a bad idea. The reason why I did it here is because the requirements for sync (kvm-vcpu) vs async (xinterface) access is slightly different. Sync is probably optimal with per-vcpu caching, whereas async is optimal with per-cpu. That said, we could probably build the entire algorithm to be per-cpu as a compromise and still gain benefits
Re: [PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
Avi Kivity wrote: On 10/02/2009 10:19 PM, Gregory Haskins wrote: What: xinterface is a mechanism that allows kernel modules external to the kvm.ko proper to interface with a running guest. It accomplishes this by creating an abstracted interface which does not expose any private details of the guest or its related KVM structures, and provides a mechanism to find and bind to this interface at run-time. If this is needed, it should be done as a virt_address_space to which kvm and other modules bind, instead of as something that kvm exports and other modules import. The virt_address_space can be identified by an fd and passed around to kvm and other modules. IIUC, what you are proposing is something similar to generalizing the vbus::memctx object. I had considered doing something like that in the early design phase of vbus, but then decided it would be a hard-sell to the mm crowd, and difficult to generalize. What do you propose as the interface to program the object? Why: There are various subsystems that would like to interact with a KVM guest which are ideally suited to exist outside the domain of the kvm.ko core logic. For instance, external pci-passthrough, virtual-bus, and virtio-net modules are currently under development. In order for these modules to successfully interact with the guest, they need, at the very least, various interfaces for signaling IO events, pointer translation, and possibly memory mapping. The signaling case is covered by the recent introduction of the irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the other cases. Note that today we only expose pointer-translation related functions, but more could be added at a future date as needs arise. Example usage: QEMU instantiates a guest, and an external module foo that desires the ability to interface with the guest (say via open(/dev/foo)). QEMU may then pass the kvmfd to foo via an ioctl, such as: ioctl(foofd, FOO_SET_VMID,kvmfd). Upon receipt, the foo module can issue kvm_xinterface_bind(kvmfd) to acquire the proper context. Internally, the struct kvm* and associated struct module* will remain pinned at least until the foo module calls kvm_xinterface_put(). So, under my suggestion above, you'd call sys_create_virt_address_space(), populate it, and pass the result to kvm and to foo. This allows the use of virt_address_space without kvm and doesn't require foo to interact with kvm. The problem I see here is that the only way I can think to implement this generally is something that looks very kvm-esque (slots-to-pages kind of translation). Is there a way you can think of that does not involve a kvm.ko originated vtable that is also not kvm centric? +struct kvm_xinterface_ops { +unsigned long (*copy_to)(struct kvm_xinterface *intf, + unsigned long gpa, const void *src, + unsigned long len); +unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst, + unsigned long gpa, unsigned long len); +struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf, + unsigned long gpa, + unsigned long len); How would vmap() work with live migration? vmap represents shmem regions, and is a per-guest-instance resource. So my plan there is that the new and old guest instance would each have the vmap region instated at the same GPA location (assumption: gpas are stable across migration), and any state relevant data local to the shmem (like ring head/tail position) is conveyed in the serialized stream for the device model. + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ +struct kvm_xinterface *intf; +struct module *owner; + +intf = container_of(kref, struct kvm_xinterface, kref); + +owner = intf-owner; +rmb(); Why rmb? the intf-ops-release() line may invalidate the intf pointer, so we want to ensure that the read completes before the release() is called. TBH: I'm not 100% its needed, but I was being conservative. + +intf-ops-release(intf); +module_put(owner); +} + + +/* + * gpa_to_hva() - translate a guest-physical to host-virtual using + * a per-cpu cache of the memslot. + * + * The gfn_to_memslot() call is relatively expensive, and the gpa access + * patterns exhibit a high degree of locality. Therefore, lets cache + * the last slot used on a per-cpu basis to optimize the lookup + * + * assumes slots_lock held for read + */ +static unsigned long +gpa_to_hva(struct _xinterface *_intf, unsigned long gpa) +{ +int cpu = get_cpu(); +unsigned long gfn = gpa PAGE_SHIFT; +struct kvm_memory_slot *memslot = _intf-slotcache[cpu]; +unsigned long addr= 0; + +if (!memslot +|| gfn memslot-base_gfn +|| gfn= memslot-base_gfn + memslot-npages) { + +memslot
Re: [PATCH v2 4/4] KVM: add scatterlist support to xinterface
Avi Kivity wrote: On 10/02/2009 10:19 PM, Gregory Haskins wrote: This allows a scatter-gather approach to IO, which will be useful for building high performance interfaces, like zero-copy and low-latency copy (avoiding multiple calls to copy_to/from). The interface is based on the existing scatterlist infrastructure. The caller is expected to pass in a scatterlist with its dma field populated with valid GPAs. The xinterface will then populate each entry by translating the GPA to a page*. The caller signifies completion by simply performing a put_page() on each page returned in the list. Signed-off-by: Gregory Haskinsghask...@novell.com --- include/linux/kvm_xinterface.h |4 ++ virt/kvm/xinterface.c | 72 2 files changed, 76 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h index 684b6f8..eefb575 100644 --- a/include/linux/kvm_xinterface.h +++ b/include/linux/kvm_xinterface.h @@ -9,6 +9,7 @@ #includelinux/kref.h #includelinux/module.h #includelinux/file.h +#includelinux/scatterlist.h struct kvm_xinterface; struct kvm_xvmap; @@ -36,6 +37,9 @@ struct kvm_xinterface_ops { u64 addr, unsigned long len, unsigned long flags); +unsigned long (*sgmap)(struct kvm_xinterface *intf, + struct scatterlist *sgl, int nents, + unsigned long flags); void (*release)(struct kvm_xinterface *); }; diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c index c356835..16729f6 100644 --- a/virt/kvm/xinterface.c +++ b/virt/kvm/xinterface.c @@ -467,6 +467,77 @@ fail: } +static unsigned long +xinterface_sgmap(struct kvm_xinterface *intf, + struct scatterlist *sgl, int nents, + unsigned long flags) +{ +struct _xinterface *_intf = to_intf(intf); +struct task_struct *p = _intf-task; +struct mm_struct *mm = _intf-mm; +struct kvm *kvm = _intf-kvm; +struct kvm_memory_slot *memslot = NULL; +boolkthread = !current-mm; +int ret; +struct scatterlist *sg; +int i; + +down_read(kvm-slots_lock); + +if (kthread) +use_mm(_intf-mm); + +for_each_sg(sgl, sg, nents, i) { +unsigned long gpa= sg_dma_address(sg); +unsigned long len= sg_dma_len(sg); +unsigned long gfn= gpa PAGE_SHIFT; +off_t offset = offset_in_page(gpa); +unsigned long hva; +struct page*pg; + +/* ensure that we do not have more than one page per entry */ +if ((PAGE_ALIGN(len + offset) PAGE_SHIFT) != 1) { +ret = -EINVAL; +break; +} + +/* check for a memslot-cache miss */ +if (!memslot +|| gfn memslot-base_gfn +|| gfn= memslot-base_gfn + memslot-npages) { +memslot = gfn_to_memslot(kvm, gfn); +if (!memslot) { +ret = -EFAULT; +break; +} +} + +hva = (memslot-userspace_addr + + (gfn - memslot-base_gfn) * PAGE_SIZE); + +if (kthread || current-mm == mm) +ret = get_user_pages_fast(hva, 1, 1,pg); +else +ret = get_user_pages(p, mm, hva, 1, 1, 0,pg, NULL); One of these needs the mm semaphore. Indeed. Good catch. + +if (ret != 1) { +if (ret= 0) +ret = -EFAULT; +break; +} + +sg_set_page(sg, pg, len, offset); +ret = 0; +} + +if (kthread) +unuse_mm(_intf-mm); + +up_read(kvm-slots_lock); + +return ret; +} + static void xinterface_release(struct kvm_xinterface *intf) { @@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = { .copy_from = xinterface_copy_from, .vmap= xinterface_vmap, .ioevent = xinterface_ioevent, +.sgmap = xinterface_sgmap, .release = xinterface_release, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
[PATCH v2 0/4] KVM: xinterface
(Applies to kvm.git/master:083e9e10) For details, please read the patch headers. [ Changelog: v2: *) We now re-use the vmfd as the binding token, instead of creating a new separate namespace *) Added support for switch_to(mm), which is much faster *) Added support for memslot-cache for exploiting slot locality *) Added support for scatter-gather access *) Added support for xioevent interface v1: *) Initial release ] This series is included in upstream AlacrityVM and is well tested and known to work properly. Comments? Kind Regards, -Greg --- Gregory Haskins (4): KVM: add scatterlist support to xinterface KVM: add io services to xinterface KVM: introduce xinterface API for external interaction with guests mm: export use_mm() and unuse_mm() to modules arch/x86/kvm/Makefile |2 include/linux/kvm_host.h |3 include/linux/kvm_xinterface.h | 165 +++ kernel/fork.c |1 mm/mmu_context.c |3 virt/kvm/kvm_main.c| 24 ++ virt/kvm/xinterface.c | 587 7 files changed, 784 insertions(+), 1 deletions(-) create mode 100644 include/linux/kvm_xinterface.h create mode 100644 virt/kvm/xinterface.c -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] mm: export use_mm() and unuse_mm() to modules
We want to use these functions from withing KVM, which may be built as a module. Signed-off-by: Gregory Haskins ghask...@novell.com --- mm/mmu_context.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/mm/mmu_context.c b/mm/mmu_context.c index ded9081..f31ba20 100644 --- a/mm/mmu_context.c +++ b/mm/mmu_context.c @@ -6,6 +6,7 @@ #include linux/mm.h #include linux/mmu_context.h #include linux/sched.h +#include linux/module.h #include asm/mmu_context.h @@ -37,6 +38,7 @@ void use_mm(struct mm_struct *mm) if (active_mm != mm) mmdrop(active_mm); } +EXPORT_SYMBOL_GPL(use_mm); /* * unuse_mm @@ -56,3 +58,4 @@ void unuse_mm(struct mm_struct *mm) enter_lazy_tlb(mm, tsk); task_unlock(tsk); } +EXPORT_SYMBOL_GPL(unuse_mm); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] KVM: introduce xinterface API for external interaction with guests
What: xinterface is a mechanism that allows kernel modules external to the kvm.ko proper to interface with a running guest. It accomplishes this by creating an abstracted interface which does not expose any private details of the guest or its related KVM structures, and provides a mechanism to find and bind to this interface at run-time. Why: There are various subsystems that would like to interact with a KVM guest which are ideally suited to exist outside the domain of the kvm.ko core logic. For instance, external pci-passthrough, virtual-bus, and virtio-net modules are currently under development. In order for these modules to successfully interact with the guest, they need, at the very least, various interfaces for signaling IO events, pointer translation, and possibly memory mapping. The signaling case is covered by the recent introduction of the irqfd/ioeventfd mechanisms. This patch provides a mechanism to cover the other cases. Note that today we only expose pointer-translation related functions, but more could be added at a future date as needs arise. Example usage: QEMU instantiates a guest, and an external module foo that desires the ability to interface with the guest (say via open(/dev/foo)). QEMU may then pass the kvmfd to foo via an ioctl, such as: ioctl(foofd, FOO_SET_VMID, kvmfd). Upon receipt, the foo module can issue kvm_xinterface_bind(kvmfd) to acquire the proper context. Internally, the struct kvm* and associated struct module* will remain pinned at least until the foo module calls kvm_xinterface_put(). Signed-off-by: Gregory Haskins ghask...@novell.com --- arch/x86/kvm/Makefile |2 include/linux/kvm_host.h |3 include/linux/kvm_xinterface.h | 114 +++ kernel/fork.c |1 virt/kvm/kvm_main.c| 24 ++ virt/kvm/xinterface.c | 409 6 files changed, 552 insertions(+), 1 deletions(-) create mode 100644 include/linux/kvm_xinterface.h create mode 100644 virt/kvm/xinterface.c diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index 31a7035..0449d6e 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -7,7 +7,7 @@ CFLAGS_vmx.o := -I. kvm-y += $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ coalesced_mmio.o irq_comm.o eventfd.o \ - assigned-dev.o) + assigned-dev.o xinterface.o) kvm-$(CONFIG_IOMMU_API)+= $(addprefix ../../../virt/kvm/, iommu.o) kvm-y += x86.o mmu.o emulate.o i8259.o irq.o lapic.o \ diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index b985a29..7cc1afb 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -362,6 +362,9 @@ void kvm_arch_sync_events(struct kvm *kvm); int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu); void kvm_vcpu_kick(struct kvm_vcpu *vcpu); +struct kvm_xinterface * +kvm_xinterface_alloc(struct kvm *kvm, struct module *owner); + int kvm_is_mmio_pfn(pfn_t pfn); struct kvm_irq_ack_notifier { diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h new file mode 100644 index 000..01f092b --- /dev/null +++ b/include/linux/kvm_xinterface.h @@ -0,0 +1,114 @@ +#ifndef __KVM_XINTERFACE_H +#define __KVM_XINTERFACE_H + +/* + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + */ + +#include linux/kref.h +#include linux/module.h +#include linux/file.h + +struct kvm_xinterface; +struct kvm_xvmap; + +struct kvm_xinterface_ops { + unsigned long (*copy_to)(struct kvm_xinterface *intf, +unsigned long gpa, const void *src, +unsigned long len); + unsigned long (*copy_from)(struct kvm_xinterface *intf, void *dst, + unsigned long gpa, unsigned long len); + struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf, + unsigned long gpa, + unsigned long len); + void (*release)(struct kvm_xinterface *); +}; + +struct kvm_xinterface { + struct module *owner; + struct kref kref; + const struct kvm_xinterface_ops *ops; +}; + +static inline void +kvm_xinterface_get(struct kvm_xinterface *intf) +{ + kref_get(intf-kref); +} + +static inline void +_kvm_xinterface_release(struct kref *kref) +{ + struct kvm_xinterface *intf; + struct module *owner; + + intf = container_of(kref, struct kvm_xinterface, kref); + + owner = intf-owner; + rmb(); + + intf-ops-release(intf); + module_put(owner); +} + +static inline void +kvm_xinterface_put(struct kvm_xinterface *intf) +{ + kref_put(intf-kref, _kvm_xinterface_release); +} + +struct kvm_xvmap_ops { + void (*release)(struct
[PATCH v2 3/4] KVM: add io services to xinterface
We want to add a more efficient way to get PIO signals out of the guest, so we add an xioevent interface. This allows a client to register for notifications when a specific MMIO/PIO address is touched by the guest. This is an alternative interface to ioeventfd, which is performance limited by io-bus scaling and eventfd wait-queue based notification mechanism. This also has the advantage of retaining the full PIO data payload and passing it to the recipient. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_xinterface.h | 47 ++ virt/kvm/xinterface.c | 106 2 files changed, 153 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h index 01f092b..684b6f8 100644 --- a/include/linux/kvm_xinterface.h +++ b/include/linux/kvm_xinterface.h @@ -12,6 +12,16 @@ struct kvm_xinterface; struct kvm_xvmap; +struct kvm_xioevent; + +enum { + kvm_xioevent_flag_nr_pio, + kvm_xioevent_flag_nr_max, +}; + +#define KVM_XIOEVENT_FLAG_PIO (1 kvm_xioevent_flag_nr_pio) + +#define KVM_XIOEVENT_VALID_FLAG_MASK ((1 kvm_xioevent_flag_nr_max) - 1) struct kvm_xinterface_ops { unsigned long (*copy_to)(struct kvm_xinterface *intf, @@ -22,6 +32,10 @@ struct kvm_xinterface_ops { struct kvm_xvmap* (*vmap)(struct kvm_xinterface *intf, unsigned long gpa, unsigned long len); + struct kvm_xioevent* (*ioevent)(struct kvm_xinterface *intf, + u64 addr, + unsigned long len, + unsigned long flags); void (*release)(struct kvm_xinterface *); }; @@ -109,6 +123,39 @@ kvm_xvmap_put(struct kvm_xvmap *vmap) kref_put(vmap-kref, _kvm_xvmap_release); } +struct kvm_xioevent_ops { + void (*deassign)(struct kvm_xioevent *ioevent); +}; + +struct kvm_xioevent { + const struct kvm_xioevent_ops *ops; + struct kvm_xinterface *intf; + void (*signal)(struct kvm_xioevent *ioevent, const void *val); + void *priv; +}; + +static inline void +kvm_xioevent_init(struct kvm_xioevent *ioevent, + const struct kvm_xioevent_ops *ops, + struct kvm_xinterface *intf) +{ + memset(ioevent, 0, sizeof(vmap)); + ioevent-ops = ops; + ioevent-intf = intf; + + kvm_xinterface_get(intf); +} + +static inline void +kvm_xioevent_deassign(struct kvm_xioevent *ioevent) +{ + struct kvm_xinterface *intf = ioevent-intf; + rmb(); + + ioevent-ops-deassign(ioevent); + kvm_xinterface_put(intf); +} + struct kvm_xinterface *kvm_xinterface_bind(int fd); #endif /* __KVM_XINTERFACE_H */ diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c index 3b586c5..c356835 100644 --- a/virt/kvm/xinterface.c +++ b/virt/kvm/xinterface.c @@ -28,6 +28,8 @@ #include linux/kvm_host.h #include linux/kvm_xinterface.h +#include iodev.h + struct _xinterface { struct kvm *kvm; struct task_struct *task; @@ -42,6 +44,14 @@ struct _xvmap { struct kvm_xvmap vmap; }; +struct _ioevent { + u64 addr; + int length; + struct kvm_io_bus*bus; + struct kvm_io_device dev; + struct kvm_xioevent ioevent; +}; + static struct _xinterface * to_intf(struct kvm_xinterface *intf) { @@ -362,6 +372,101 @@ fail: return ERR_PTR(ret); } +/* MMIO/PIO writes trigger an event if the addr/val match */ +static int +ioevent_write(struct kvm_io_device *dev, gpa_t addr, int len, const void *val) +{ + struct _ioevent *p = container_of(dev, struct _ioevent, dev); + struct kvm_xioevent *ioevent = p-ioevent; + + if (!(addr == p-addr len == p-length)) + return -EOPNOTSUPP; + + if (!ioevent-signal) + return 0; + + ioevent-signal(ioevent, val); + return 0; +} + +static const struct kvm_io_device_ops ioevent_device_ops = { + .write = ioevent_write, +}; + +static void +ioevent_deassign(struct kvm_xioevent *ioevent) +{ + struct _ioevent*p = container_of(ioevent, struct _ioevent, ioevent); + struct _xinterface *_intf = to_intf(ioevent-intf); + struct kvm *kvm = _intf-kvm; + + kvm_io_bus_unregister_dev(kvm, p-bus, p-dev); + kfree(p); +} + +static const struct kvm_xioevent_ops ioevent_intf_ops = { + .deassign = ioevent_deassign, +}; + +static struct kvm_xioevent* +xinterface_ioevent(struct kvm_xinterface *intf, + u64 addr, + unsigned long len, + unsigned long flags) +{ + struct _xinterface *_intf = to_intf(intf); + struct kvm *kvm = _intf-kvm; + int pio = flags
[PATCH v2 4/4] KVM: add scatterlist support to xinterface
This allows a scatter-gather approach to IO, which will be useful for building high performance interfaces, like zero-copy and low-latency copy (avoiding multiple calls to copy_to/from). The interface is based on the existing scatterlist infrastructure. The caller is expected to pass in a scatterlist with its dma field populated with valid GPAs. The xinterface will then populate each entry by translating the GPA to a page*. The caller signifies completion by simply performing a put_page() on each page returned in the list. Signed-off-by: Gregory Haskins ghask...@novell.com --- include/linux/kvm_xinterface.h |4 ++ virt/kvm/xinterface.c | 72 2 files changed, 76 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_xinterface.h b/include/linux/kvm_xinterface.h index 684b6f8..eefb575 100644 --- a/include/linux/kvm_xinterface.h +++ b/include/linux/kvm_xinterface.h @@ -9,6 +9,7 @@ #include linux/kref.h #include linux/module.h #include linux/file.h +#include linux/scatterlist.h struct kvm_xinterface; struct kvm_xvmap; @@ -36,6 +37,9 @@ struct kvm_xinterface_ops { u64 addr, unsigned long len, unsigned long flags); + unsigned long (*sgmap)(struct kvm_xinterface *intf, + struct scatterlist *sgl, int nents, + unsigned long flags); void (*release)(struct kvm_xinterface *); }; diff --git a/virt/kvm/xinterface.c b/virt/kvm/xinterface.c index c356835..16729f6 100644 --- a/virt/kvm/xinterface.c +++ b/virt/kvm/xinterface.c @@ -467,6 +467,77 @@ fail: } +static unsigned long +xinterface_sgmap(struct kvm_xinterface *intf, +struct scatterlist *sgl, int nents, +unsigned long flags) +{ + struct _xinterface *_intf = to_intf(intf); + struct task_struct *p = _intf-task; + struct mm_struct *mm = _intf-mm; + struct kvm *kvm = _intf-kvm; + struct kvm_memory_slot *memslot = NULL; + boolkthread = !current-mm; + int ret; + struct scatterlist *sg; + int i; + + down_read(kvm-slots_lock); + + if (kthread) + use_mm(_intf-mm); + + for_each_sg(sgl, sg, nents, i) { + unsigned long gpa= sg_dma_address(sg); + unsigned long len= sg_dma_len(sg); + unsigned long gfn= gpa PAGE_SHIFT; + off_t offset = offset_in_page(gpa); + unsigned long hva; + struct page*pg; + + /* ensure that we do not have more than one page per entry */ + if ((PAGE_ALIGN(len + offset) PAGE_SHIFT) != 1) { + ret = -EINVAL; + break; + } + + /* check for a memslot-cache miss */ + if (!memslot + || gfn memslot-base_gfn + || gfn = memslot-base_gfn + memslot-npages) { + memslot = gfn_to_memslot(kvm, gfn); + if (!memslot) { + ret = -EFAULT; + break; + } + } + + hva = (memslot-userspace_addr + + (gfn - memslot-base_gfn) * PAGE_SIZE); + + if (kthread || current-mm == mm) + ret = get_user_pages_fast(hva, 1, 1, pg); + else + ret = get_user_pages(p, mm, hva, 1, 1, 0, pg, NULL); + + if (ret != 1) { + if (ret = 0) + ret = -EFAULT; + break; + } + + sg_set_page(sg, pg, len, offset); + ret = 0; + } + + if (kthread) + unuse_mm(_intf-mm); + + up_read(kvm-slots_lock); + + return ret; +} + static void xinterface_release(struct kvm_xinterface *intf) { @@ -483,6 +554,7 @@ struct kvm_xinterface_ops _xinterface_ops = { .copy_from = xinterface_copy_from, .vmap= xinterface_vmap, .ioevent = xinterface_ioevent, + .sgmap = xinterface_sgmap, .release = xinterface_release, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/30/2009 10:04 PM, Gregory Haskins wrote: A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work over vbus. Binary compatibility with existing virtio drivers, while nice to have, is not a specific requirement nor goal. We will simply load an updated KMP/MSI into those guests and they will work again. As previously discussed, this is how more or less any system works today. It's like we are removing an old adapter card and adding a new one to uprev the silicon. Virtualization is about not doing that. Sometimes it's necessary (when you have made unfixable design mistakes), but just to replace a bus, with no advantages to the guest that has to be changed (other hypervisors or hypervisorless deployment scenarios aren't). The problem is that your continued assertion that there is no advantage to the guest is a completely unsubstantiated claim. As it stands right now, I have a public git tree that, to my knowledge, is the fastest KVM PV networking implementation around. It also has capabilities that are demonstrably not found elsewhere, such as the ability to render generic shared-memory interconnects (scheduling, timers), interrupt-priority (qos), and interrupt-coalescing (exit-ratio reduction). I designed each of these capabilities after carefully analyzing where KVM was coming up short. Those are facts. I can't easily prove which of my new features alone are what makes it special per se, because I don't have unit tests for each part that breaks it down. What I _can_ state is that its the fastest and most feature rich KVM-PV tree that I am aware of, and others may download and test it themselves to verify my claims. The disproof, on the other hand, would be in a counter example that still meets all the performance and feature criteria under all the same conditions while maintaining the existing ABI. To my knowledge, this doesn't exist. Therefore, if you believe my work is irrelevant, show me a git tree that accomplishes the same feats in a binary compatible way, and I'll rethink my position. Until then, complaining about lack of binary compatibility is pointless since it is not an insurmountable proposition, and the one and only available solution declares it a required casualty. Further, non-shmem virtio can't work over vbus. Actually I misspoke earlier when I said virtio works over non-shmem. Thinking about it some more, both virtio and vbus fundamentally require shared-memory, since sharing their metadata concurrently on both sides is their raison d'être. The difference is that virtio utilizes a pre-translation/mapping (via -add_buf) from the guest side. OTOH, vbus uses a post translation scheme (via memctx) from the host-side. If anything, vbus is actually more flexible because it doesn't assume the entire guest address space is directly mappable. In summary, your statement is incorrect (though it is my fault for putting that idea in your head). Well, Xen requires pre-translation (since the guest has to give the host (which is just another guest) permissions to access the data). Actually I am not sure that it does require pre-translation. You might be able to use the memctx-copy_to/copy_from scheme in post translation as well, since those would be able to communicate to something like the xen kernel. But I suppose either method would result in extra exits, so there is no distinct benefit using vbus there..as you say below they're just different. The biggest difference is that my proposed model gets around the notion that the entire guest address space can be represented by an arbitrary pointer. For instance, the copy_to/copy_from routines take a GPA, but may use something indirect like a DMA controller to access that GPA. On the other hand, virtio fully expects a viable pointer to come out of the interface iiuc. This is in part what makes vbus more adaptable to non-virt. So neither is a superset of the other, they're just different. It doesn't really matter since Xen is unlikely to adopt virtio. Agreed. An interesting thing here is that you don't even need a fancy multi-homed setup to see the effects of my exit-ratio reduction work: even single port configurations suffer from the phenomenon since many devices have multiple signal-flows (e.g. network adapters tend to have at least 3 flows: rx-ready, tx-complete, and control-events (link-state, etc). Whats worse, is that the flows often are indirectly related (for instance, many host adapters will free tx skbs during rx operations, so you tend to get bursts of tx-completes at the same time as rx-ready. If the flows map 1:1 with IDT, they will suffer the same problem. You can simply use the same vector for both rx and tx and poll both at every interrupt. Yes, but that has its own problems: e.g. additional exits or at least additional overhead figuring out what happens each time. This is even
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/26/2009 12:32 AM, Gregory Haskins wrote: I realize in retrospect that my choice of words above implies vbus _is_ complete, but this is not what I was saying. What I was trying to convey is that vbus is _more_ complete. Yes, in either case some kind of glue needs to be written. The difference is that vbus implements more of the glue generally, and leaves less required to be customized for each iteration. No argument there. Since you care about non-virt scenarios and virtio doesn't, naturally vbus is a better fit for them as the code stands. Thanks for finally starting to acknowledge there's a benefit, at least. I think I've mentioned vbus' finer grained layers as helpful here, though I doubt the value of this. Hypervisors are added rarely, while devices and drivers are added (and modified) much more often. I don't buy the anything-to-anything promise. The ease in which a new hypervisor should be able to integrate into the stack is only one of vbus's many benefits. To be more precise, IMO virtio is designed to be a performance oriented ring-based driver interface that supports all types of hypervisors (e.g. shmem based kvm, and non-shmem based Xen). vbus is designed to be a high-performance generic shared-memory interconnect (for rings or otherwise) framework for environments where linux is the underpinning host (physical or virtual). They are distinctly different, but complementary (the former addresses the part of the front-end, and latter addresses the back-end, and a different part of the front-end). They're not truly complementary since they're incompatible. No, that is incorrect. Not to be rude, but for clarity: Complementary \Com`ple*menta*ry\, a. Serving to fill out or to complete; as, complementary numbers. [1913 Webster] Citation: www.dict.org IOW: Something being complementary has nothing to do with guest/host binary compatibility. virtio-pci and virtio-vbus are both equally complementary to virtio since they fill in the bottom layer of the virtio stack. So yes, vbus is truly complementary to virtio afaict. A 2.6.27 guest, or Windows guest with the existing virtio drivers, won't work over vbus. Binary compatibility with existing virtio drivers, while nice to have, is not a specific requirement nor goal. We will simply load an updated KMP/MSI into those guests and they will work again. As previously discussed, this is how more or less any system works today. It's like we are removing an old adapter card and adding a new one to uprev the silicon. Further, non-shmem virtio can't work over vbus. Actually I misspoke earlier when I said virtio works over non-shmem. Thinking about it some more, both virtio and vbus fundamentally require shared-memory, since sharing their metadata concurrently on both sides is their raison d'être. The difference is that virtio utilizes a pre-translation/mapping (via -add_buf) from the guest side. OTOH, vbus uses a post translation scheme (via memctx) from the host-side. If anything, vbus is actually more flexible because it doesn't assume the entire guest address space is directly mappable. In summary, your statement is incorrect (though it is my fault for putting that idea in your head). Since virtio is guest-oriented and host-agnostic, it can't ignore non-shared-memory hosts (even though it's unlikely virtio will be adopted there) Well, to be fair no one said it has to ignore them. Either virtio-vbus transport is present and available to the virtio stack, or it isn't. If its present, it may or may not publish objects for consumption. Providing a virtio-vbus transport in no way limits or degrades the existing capabilities of the virtio stack. It only enhances them. I digress. The whole point is moot since I realized that the non-shmem distinction isn't accurate anyway. They both require shared-memory for the metadata, and IIUC virtio requires the entire address space to be mappable whereas vbus only assumes the metadata is. In addition, the kvm-connector used in AlacrityVM's design strives to add value and improve performance via other mechanisms, such as dynamic allocation, interrupt coalescing (thus reducing exit-ratio, which is a serious issue in KVM) Do you have measurements of inter-interrupt coalescing rates (excluding intra-interrupt coalescing). I actually do not have a rig setup to explicitly test inter-interrupt rates at the moment. Once things stabilize for me, I will try to re-gather some numbers here. Last time I looked, however, there were some decent savings for inter as well. Inter rates are interesting because they are what tends to ramp up with IO load more than intra since guest interrupt mitigation techniques like NAPI often quell intra-rates naturally. This is especially true for data-center, cloud, hpc-grid, etc, kind of workloads (vs vanilla desktops, etc) that tend to have multiple IO
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/24/2009 09:03 PM, Gregory Haskins wrote: I don't really see how vhost and vbus are different here. vhost expects signalling to happen through a couple of eventfds and requires someone to supply them and implement kernel support (if needed). vbus requires someone to write a connector to provide the signalling implementation. Neither will work out-of-the-box when implementing virtio-net over falling dominos, for example. I realize in retrospect that my choice of words above implies vbus _is_ complete, but this is not what I was saying. What I was trying to convey is that vbus is _more_ complete. Yes, in either case some kind of glue needs to be written. The difference is that vbus implements more of the glue generally, and leaves less required to be customized for each iteration. No argument there. Since you care about non-virt scenarios and virtio doesn't, naturally vbus is a better fit for them as the code stands. Thanks for finally starting to acknowledge there's a benefit, at least. To be more precise, IMO virtio is designed to be a performance oriented ring-based driver interface that supports all types of hypervisors (e.g. shmem based kvm, and non-shmem based Xen). vbus is designed to be a high-performance generic shared-memory interconnect (for rings or otherwise) framework for environments where linux is the underpinning host (physical or virtual). They are distinctly different, but complementary (the former addresses the part of the front-end, and latter addresses the back-end, and a different part of the front-end). In addition, the kvm-connector used in AlacrityVM's design strives to add value and improve performance via other mechanisms, such as dynamic allocation, interrupt coalescing (thus reducing exit-ratio, which is a serious issue in KVM) and priortizable/nestable signals. Today there is a large performance disparity between what a KVM guest sees and what a native linux application sees on that same host. Just take a look at some of my graphs between virtio, and native, for example: http://developer.novell.com/wiki/images/b/b7/31-rc4_throughput.png A dominant vbus design principle is to try to achieve the same IO performance for all linux applications whether they be literally userspace applications, or things like KVM vcpus or Ira's physical boards. It also aims to solve problems not previously expressible with current technologies (even virtio), like nested real-time. And even though you repeatedly insist otherwise, the neat thing here is that the two technologies mesh (at least under certain circumstances, like when virtio is deployed on a shared-memory friendly linux backend like KVM). I hope that my stack diagram below depicts that clearly. But that's not a strong argument for vbus; instead of adding vbus you could make virtio more friendly to non-virt Actually, it _is_ a strong argument then because adding vbus is what helps makes virtio friendly to non-virt, at least for when performance matters. (there's a limit how far you can take this, not imposed by the code, but by virtio's charter as a virtual device driver framework). Going back to our stack diagrams, you could think of a vhost solution like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined-1 ? -- | vhost -- and you could think of a vbus solution like this -- | virtio-net -- | virtio-ring -- | virtio-bus -- | bus-interface -- | ? undefined-2 ? -- | bus-model -- | virtio-net-device (vhost ported to vbus model? :) -- So the difference between vhost and vbus in this particular context is that you need to have undefined-1 do device discovery/hotswap, config-space, address-decode/isolation, signal-path routing, memory-path routing, etc. Today this function is filled by things like virtio-pci, pci-bus, KVM/ioeventfd, and QEMU for x86. I am not as familiar with lguest, but presumably it is filled there by components like virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher. And to use more contemporary examples, we might have virtio-domino, domino-bus, domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko, and ira-launcher. Contrast this to the vbus stack: The bus-X components (when optionally employed by the connector designer) do device-discovery, hotswap, config-space, address-decode/isolation, signal-path and memory-path routing, etc in a general (and pv-centric) way. The undefined-2 portion is the connector, and just needs to convey messages like DEVCALL and SHMSIGNAL. The rest is handled in other parts
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/24/2009 12:15 AM, Gregory Haskins wrote: There are various aspects about designing high-performance virtual devices such as providing the shortest paths possible between the physical resources and the consumers. Conversely, we also need to ensure that we meet proper isolation/protection guarantees at the same time. What this means is there are various aspects to any high-performance PV design that require to be placed in-kernel to maximize the performance yet properly isolate the guest. For instance, you are required to have your signal-path (interrupts and hypercalls), your memory-path (gpa translation), and addressing/isolation model in-kernel to maximize performance. Exactly. That's what vhost puts into the kernel and nothing more. Actually, no. Generally, _KVM_ puts those things into the kernel, and vhost consumes them. Without KVM (or something equivalent), vhost is incomplete. One of my goals with vbus is to generalize the something equivalent part here. I don't really see how vhost and vbus are different here. vhost expects signalling to happen through a couple of eventfds and requires someone to supply them and implement kernel support (if needed). vbus requires someone to write a connector to provide the signalling implementation. Neither will work out-of-the-box when implementing virtio-net over falling dominos, for example. I realize in retrospect that my choice of words above implies vbus _is_ complete, but this is not what I was saying. What I was trying to convey is that vbus is _more_ complete. Yes, in either case some kind of glue needs to be written. The difference is that vbus implements more of the glue generally, and leaves less required to be customized for each iteration. Going back to our stack diagrams, you could think of a vhost solution like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined-1 ? -- | vhost -- and you could think of a vbus solution like this -- | virtio-net -- | virtio-ring -- | virtio-bus -- | bus-interface -- | ? undefined-2 ? -- | bus-model -- | virtio-net-device (vhost ported to vbus model? :) -- So the difference between vhost and vbus in this particular context is that you need to have undefined-1 do device discovery/hotswap, config-space, address-decode/isolation, signal-path routing, memory-path routing, etc. Today this function is filled by things like virtio-pci, pci-bus, KVM/ioeventfd, and QEMU for x86. I am not as familiar with lguest, but presumably it is filled there by components like virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher. And to use more contemporary examples, we might have virtio-domino, domino-bus, domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko, and ira-launcher. Contrast this to the vbus stack: The bus-X components (when optionally employed by the connector designer) do device-discovery, hotswap, config-space, address-decode/isolation, signal-path and memory-path routing, etc in a general (and pv-centric) way. The undefined-2 portion is the connector, and just needs to convey messages like DEVCALL and SHMSIGNAL. The rest is handled in other parts of the stack. So to answer your question, the difference is that the part that has to be customized in vbus should be a fraction of what needs to be customized with vhost because it defines more of the stack. And, as eluded to in my diagram, both virtio-net and vhost (with some modifications to fit into the vbus framework) are potentially complementary, not competitors. Vbus accomplishes its in-kernel isolation model by providing a container concept, where objects are placed into this container by userspace. The host kernel enforces isolation/protection by using a namespace to identify objects that is only relevant within a specific container's context (namely, a u32 dev-id). The guest addresses the objects by its dev-id, and the kernel ensures that the guest can't access objects outside of its dev-id namespace. vhost manages to accomplish this without any kernel support. No, vhost manages to accomplish this because of KVMs kernel support (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a merely a kind of tuntap-like clone signalled by eventfds. Without a vbus-connector-falling-dominos, vbus-venet can't do anything either. Mostly covered above... However, I was addressing your assertion that vhost somehow magically accomplishes this container/addressing function without any specific kernel support. This is incorrect. I contend that this kernel support
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/23/2009 10:37 PM, Avi Kivity wrote: Example: feature negotiation. If it happens in userspace, it's easy to limit what features we expose to the guest. If it happens in the kernel, we need to add an interface to let the kernel know which features it should expose to the guest. We also need to add an interface to let userspace know which features were negotiated, if we want to implement live migration. Something fairly trivial bloats rapidly. btw, we have this issue with kvm reporting cpuid bits to the guest. Instead of letting kvm talk directly to the hardware and the guest, kvm gets the cpuid bits from the hardware, strips away features it doesn't support, exposes that to userspace, and expects userspace to program the cpuid bits it wants to expose to the guest (which may be different than what kvm exposed to userspace, and different from guest to guest). This issue doesn't exist in the model I am referring to, as these are all virtual-devices anyway. See my last reply -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/22/2009 12:43 AM, Ira W. Snyder wrote: Sure, virtio-ira and he is on his own to make a bus-model under that, or virtio-vbus + vbus-ira-connector to use the vbus framework. Either model can work, I agree. Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and virtio-s390. It isn't especially easy. I can steal lots of code from the lguest bus model, but sometimes it is good to generalize, especially after the fourth implemention or so. I think this is what GHaskins tried to do. Yes. vbus is more finely layered so there is less code duplication. To clarify, Ira was correct in stating this generalizing some of these components was one of the goals for the vbus project: IOW vbus finely layers and defines what's below virtio, not replaces it. You can think of a virtio-stack like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined ? -- IOW: The way I see it, virtio is a device interface model only. The rest of it is filled in by the virtio-transport and some kind of back-end. So today, we can complete the ? undefined ? block like this for KVM: -- | virtio-pci -- | -- | kvm.ko -- | qemu -- | tuntap -- In this case, kvm.ko and tuntap are providing plumbing, and qemu is providing a backend device model (pci-based, etc). You can, of course, plug a different stack in (such as virtio-lguest, virtio-ira, etc) but you are more or less on your own to recreate many of the various facilities contained in that stack (such as things provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering. Vbus tries to commoditize more components in the stack (like the bus model and backend-device model) so they don't need to be redesigned each time we solve this virtio-transport problem. IOW: stop the proliferation of the need for pci-bus, lguest-bus, foo-bus underneath virtio. Instead, we can then focus on the value add on top, like the models themselves or the simple glue between them. So now you might have something like -- | virtio-vbus -- | vbus-proxy -- | kvm-guest-connector -- | -- | kvm.ko -- | kvm-host-connector.ko -- | vbus.ko -- | virtio-net-backend.ko -- so now we don't need to worry about the bus-model or the device-model framework. We only need to implement the connector, etc. This is handy when you find yourself in an environment that doesn't support PCI (such as Ira's rig, or userspace containers), or when you want to add features that PCI doesn't have (such as fluid event channels for things like IPC services, or priortizable interrupts, etc). The virtio layering was more or less dictated by Xen which doesn't have shared memory (it uses grant references instead). As a matter of fact lguest, kvm/pci, and kvm/s390 all have shared memory, as you do, so that part is duplicated. It's probably possible to add a virtio-shmem.ko library that people who do have shared memory can reuse. Note that I do not believe the Xen folk use virtio, so while I can appreciate the foresight that went into that particular aspect of the design of the virtio model, I am not sure if its a realistic constraint. The reason why I decided to not worry about that particular model is twofold: 1) Trying to support non shared-memory designs is prohibitively high for my performance goals (for instance, requiring an exit on each -add_buf() in addition to the -kick()). 2) The Xen guys are unlikely to diverge from something like xenbus/xennet anyway, so it would be for naught. Therefore, I just went with a device model optimized for shared-memory outright. That said, I believe we can refactor what is called the vbus-proxy-device into this virtio-shmem interface that you and Anthony have described. We could make the feature optional and only support on architectures where this makes sense. snip Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/23/2009 05:26 PM, Gregory Haskins wrote: Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and virtio-s390. It isn't especially easy. I can steal lots of code from the lguest bus model, but sometimes it is good to generalize, especially after the fourth implemention or so. I think this is what GHaskins tried to do. Yes. vbus is more finely layered so there is less code duplication. To clarify, Ira was correct in stating this generalizing some of these components was one of the goals for the vbus project: IOW vbus finely layers and defines what's below virtio, not replaces it. You can think of a virtio-stack like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined ? -- IOW: The way I see it, virtio is a device interface model only. The rest of it is filled in by the virtio-transport and some kind of back-end. So today, we can complete the ? undefined ? block like this for KVM: -- | virtio-pci -- | -- | kvm.ko -- | qemu -- | tuntap -- In this case, kvm.ko and tuntap are providing plumbing, and qemu is providing a backend device model (pci-based, etc). You can, of course, plug a different stack in (such as virtio-lguest, virtio-ira, etc) but you are more or less on your own to recreate many of the various facilities contained in that stack (such as things provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering. Vbus tries to commoditize more components in the stack (like the bus model and backend-device model) so they don't need to be redesigned each time we solve this virtio-transport problem. IOW: stop the proliferation of the need for pci-bus, lguest-bus, foo-bus underneath virtio. Instead, we can then focus on the value add on top, like the models themselves or the simple glue between them. So now you might have something like -- | virtio-vbus -- | vbus-proxy -- | kvm-guest-connector -- | -- | kvm.ko -- | kvm-host-connector.ko -- | vbus.ko -- | virtio-net-backend.ko -- so now we don't need to worry about the bus-model or the device-model framework. We only need to implement the connector, etc. This is handy when you find yourself in an environment that doesn't support PCI (such as Ira's rig, or userspace containers), or when you want to add features that PCI doesn't have (such as fluid event channels for things like IPC services, or priortizable interrupts, etc). Well, vbus does more, for example it tunnels interrupts instead of exposing them 1:1 on the native interface if it exists. As I've previously explained, that trait is a function of the kvm-connector I've chosen to implement, not of the overall design of vbus. The reason why my kvm-connector is designed that way is because my early testing/benchmarking shows one of the issues in KVM performance is the ratio of exits per IO operation are fairly high, especially as your scale io-load. Therefore, the connector achieves a substantial reduction in that ratio by treating interrupts to the same kind of benefits that NAPI brought to general networking: That is, we enqueue interrupt messages into a lockless ring and only hit the IDT for the first occurrence. Subsequent interrupts are injected in a parallel/lockless manner, without hitting the IDT nor incurring an extra EOI. This pays dividends as the IO rate increases, which is when the guest needs the most help. OTOH, it is entirely possible to design the connector such that we maintain a 1:1 ratio of signals to traditional IDT interrupts. It is also possible to design a connector which surfaces as something else, such as PCI devices (by terminating the connector in QEMU and utilizing its PCI emulation facilities), which would naturally employ 1:1 mapping. So if 1:1 mapping is a critical feature (I would argue to the contrary), vbus can support it. It also pulls parts of the device model into the host kernel. That is the point. Most of it needs to be there for performance. And what doesn't need to be there for performance can either be: a) skipped at the discretion of the connector/device-model designer OR b) included because its trivially small subset of the model (e.g. a mac-addr attribute) and its nice to have a cohesive solution instead of requiring a separate binary blob that can get out of sync, etc. The example Ive provided to date (venet on kvm) utilizes (b), but it certainly
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Gregory Haskins wrote: Avi Kivity wrote: On 09/23/2009 05:26 PM, Gregory Haskins wrote: Yes, I'm having to create my own bus model, a-la lguest, virtio-pci, and virtio-s390. It isn't especially easy. I can steal lots of code from the lguest bus model, but sometimes it is good to generalize, especially after the fourth implemention or so. I think this is what GHaskins tried to do. Yes. vbus is more finely layered so there is less code duplication. To clarify, Ira was correct in stating this generalizing some of these components was one of the goals for the vbus project: IOW vbus finely layers and defines what's below virtio, not replaces it. You can think of a virtio-stack like this: -- | virtio-net -- | virtio-ring -- | virtio-bus -- | ? undefined ? -- IOW: The way I see it, virtio is a device interface model only. The rest of it is filled in by the virtio-transport and some kind of back-end. So today, we can complete the ? undefined ? block like this for KVM: -- | virtio-pci -- | -- | kvm.ko -- | qemu -- | tuntap -- In this case, kvm.ko and tuntap are providing plumbing, and qemu is providing a backend device model (pci-based, etc). You can, of course, plug a different stack in (such as virtio-lguest, virtio-ira, etc) but you are more or less on your own to recreate many of the various facilities contained in that stack (such as things provided by QEMU, like discovery/hotswap/addressing), as Ira is discovering. Vbus tries to commoditize more components in the stack (like the bus model and backend-device model) so they don't need to be redesigned each time we solve this virtio-transport problem. IOW: stop the proliferation of the need for pci-bus, lguest-bus, foo-bus underneath virtio. Instead, we can then focus on the value add on top, like the models themselves or the simple glue between them. So now you might have something like -- | virtio-vbus -- | vbus-proxy -- | kvm-guest-connector -- | -- | kvm.ko -- | kvm-host-connector.ko -- | vbus.ko -- | virtio-net-backend.ko -- so now we don't need to worry about the bus-model or the device-model framework. We only need to implement the connector, etc. This is handy when you find yourself in an environment that doesn't support PCI (such as Ira's rig, or userspace containers), or when you want to add features that PCI doesn't have (such as fluid event channels for things like IPC services, or priortizable interrupts, etc). Well, vbus does more, for example it tunnels interrupts instead of exposing them 1:1 on the native interface if it exists. As I've previously explained, that trait is a function of the kvm-connector I've chosen to implement, not of the overall design of vbus. The reason why my kvm-connector is designed that way is because my early testing/benchmarking shows one of the issues in KVM performance is the ratio of exits per IO operation are fairly high, especially as your scale io-load. Therefore, the connector achieves a substantial reduction in that ratio by treating interrupts to the same kind of benefits that NAPI brought to general networking: That is, we enqueue interrupt messages into a lockless ring and only hit the IDT for the first occurrence. Subsequent interrupts are injected in a parallel/lockless manner, without hitting the IDT nor incurring an extra EOI. This pays dividends as the IO rate increases, which is when the guest needs the most help. OTOH, it is entirely possible to design the connector such that we maintain a 1:1 ratio of signals to traditional IDT interrupts. It is also possible to design a connector which surfaces as something else, such as PCI devices (by terminating the connector in QEMU and utilizing its PCI emulation facilities), which would naturally employ 1:1 mapping. So if 1:1 mapping is a critical feature (I would argue to the contrary), vbus can support it. It also pulls parts of the device model into the host kernel. That is the point. Most of it needs to be there for performance. To clarify this point: There are various aspects about designing high-performance virtual devices such as providing the shortest paths possible between the physical resources and the consumers. Conversely, we also need to ensure that we meet proper isolation/protection guarantees at the same time. What this means is there are various aspects to any high-performance PV
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/23/2009 08:58 PM, Gregory Haskins wrote: It also pulls parts of the device model into the host kernel. That is the point. Most of it needs to be there for performance. To clarify this point: There are various aspects about designing high-performance virtual devices such as providing the shortest paths possible between the physical resources and the consumers. Conversely, we also need to ensure that we meet proper isolation/protection guarantees at the same time. What this means is there are various aspects to any high-performance PV design that require to be placed in-kernel to maximize the performance yet properly isolate the guest. For instance, you are required to have your signal-path (interrupts and hypercalls), your memory-path (gpa translation), and addressing/isolation model in-kernel to maximize performance. Exactly. That's what vhost puts into the kernel and nothing more. Actually, no. Generally, _KVM_ puts those things into the kernel, and vhost consumes them. Without KVM (or something equivalent), vhost is incomplete. One of my goals with vbus is to generalize the something equivalent part here. I know you may not care about non-kvm use cases, and thats fine. No one says you have to. However, note that some of use do care about these non-kvm cases, and thus its a distinction I am making here as a benefit of the vbus framework. Vbus accomplishes its in-kernel isolation model by providing a container concept, where objects are placed into this container by userspace. The host kernel enforces isolation/protection by using a namespace to identify objects that is only relevant within a specific container's context (namely, a u32 dev-id). The guest addresses the objects by its dev-id, and the kernel ensures that the guest can't access objects outside of its dev-id namespace. vhost manages to accomplish this without any kernel support. No, vhost manages to accomplish this because of KVMs kernel support (ioeventfd, etc). Without a KVM-like in-kernel support, vhost is a merely a kind of tuntap-like clone signalled by eventfds. vbus on the other hand, generalizes one more piece of the puzzle (namely, the function of pio+ioeventfd and userspace's programming of it) by presenting the devid namespace and container concept. This goes directly to my rebuttal of your claim that vbus places too much in the kernel. I state that, one way or the other, address decode and isolation _must_ be in the kernel for performance. Vbus does this with a devid/container scheme. vhost+virtio-pci+kvm does it with pci+pio+ioeventfd. The guest simply has not access to any vhost resources other than the guest-host doorbell, which is handed to the guest outside vhost (so it's somebody else's problem, in userspace). You mean _controlled_ by userspace, right? Obviously, the other side of the kernel still needs to be programmed (ioeventfd, etc). Otherwise, vhost would be pointless: e.g. just use vanilla tuntap if you don't need fast in-kernel decoding. All that is required is a way to transport a message with a devid attribute as an address (such as DEVCALL(devid)) and the framework provides the rest of the decode+execute function. vhost avoids that. No, it doesn't avoid it. It just doesn't specify how its done, and relies on something else to do it on its behalf. Conversely, vbus specifies how its done, but not how to transport the verb across the wire. That is the role of the vbus-connector abstraction. Contrast this to vhost+virtio-pci (called simply vhost from here). It's the wrong name. vhost implements only the data path. Understood, but vhost+virtio-pci is what I am contrasting, and I use vhost for short from that point on because I am too lazy to type the whole name over and over ;) It is not immune to requiring in-kernel addressing support either, but rather it just does it differently (and its not as you might expect via qemu). Vhost relies on QEMU to render PCI objects to the guest, which the guest assigns resources (such as BARs, interrupts, etc). vhost does not rely on qemu. It relies on its user to handle configuration. In one important case it's qemu+pci. It could just as well be the lguest launcher. I meant vhost=vhost+virtio-pci here. Sorry for the confusion. The point I am making specifically is that vhost in general relies on other in-kernel components to function. I.e. It cannot function without having something like the PCI model to build an IO namespace. That namespace (in this case, pio addresses+data tuples) are used for the in-kernel addressing function under KVM + virtio-pci. The case of the lguest launcher is a good one to highlight. Yes, you can presumably also use lguest with vhost, if the requisite facilities are exposed to lguest-bus, and some eventfd based thing like ioeventfd is written for the host (if it doesnt exist already
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/15/2009 11:08 PM, Gregory Haskins wrote: There's virtio-console, virtio-blk etc. None of these have kernel-mode servers, but these could be implemented if/when needed. IIUC, Ira already needs at least ethernet and console capability. He's welcome to pick up the necessary code from qemu. The problem isn't where to find the models...the problem is how to aggregate multiple models to the guest. b) what do you suppose this protocol to aggregate the connections would look like? (hint: this is what a vbus-connector does). You mean multilink? You expose the device as a multiqueue. No, what I mean is how do you surface multiple ethernet and consoles to the guests? For Ira's case, I think he needs at minimum at least one of each, and he mentioned possibly having two unique ethernets at one point. You instantiate multiple vhost-nets. Multiple ethernet NICs is a supported configuration for kvm. But this is not KVM. His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? I don't really see the difference between 1 and N here. A KVM surfaces N virtio-devices as N pci-devices to the guest. What do we do in Ira's case where the entire guest represents itself as a PCI device to the host, and nothing the other way around? There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) I'm not sure if you're talking about the configuration interface or data path here. I am talking about how we would tunnel the config space for N devices across his transport. As an aside, the vbus-kvm connector makes them one and the same, but they do not have to be. Its all in the connector design. c) how do you manage the configuration, especially on a per-board basis? pci (for kvm/x86). Ok, for kvm understood (and I would also add qemu to that mix). But we are talking about vhost's application in a non-kvm environment here, right?. So if the vhost-X devices are in the guest, They aren't in the guest. The best way to look at it is - a device side, with a dma engine: vhost-net - a driver side, only accessing its own memory: virtio-net Given that Ira's config has the dma engine in the ppc boards, that's where vhost-net would live (the ppc boards acting as NICs to the x86 board, essentially). That sounds convenient given his hardware, but it has its own set of problems. For one, the configuration/inventory of these boards is now driven by the wrong side and has to be addressed. Second, the role reversal will likely not work for many models other than ethernet (e.g. virtio-console or virtio-blk drivers running on the x86 board would be naturally consuming services from the slave boards...virtio-net is an exception because 802.x is generally symmetrical). IIUC, vbus would support having the device models live properly on the x86 side, solving both of these problems. It would be impossible to reverse vhost given its current design. and the x86 board is just a slave...How do you tell each ppc board how many devices and what config (e.g. MACs, etc) to instantiate? Do you assume that they should all be symmetric and based on positional (e.g. slot) data? What if you want asymmetric configurations (if not here, perhaps in a different environment)? I have no idea, that's for Ira to solve. Bingo. Thus my statement that the vhost proposal is incomplete. You have the virtio-net and vhost-net pieces covering the fast-path end-points, but nothing in the middle (transport, aggregation, config-space), and nothing on the management-side. vbus provides most of the other pieces, and can even support the same virtio-net protocol on top. The remaining part would be something like a udev script to populate the vbus with devices on board-insert events. If he could fake the PCI config space as seen by the x86 board, he would just show the normal pci config and use virtio-pci (multiple channels would show up as a multifunction device). Given he can't, he needs to tunnel the virtio config space some other way. Right, and note that vbus was designed to solve this. This tunneling can, of course, be done without vbus using some other design. However, whatever solution is created will look incredibly close to what I've already done, so my point is why reinvent it? Yes. virtio is really virtualization oriented. I would say that its vhost in particular that is virtualization oriented. virtio, as a concept, generally should work in physical systems, if perhaps with some minor modifications. The biggest limit is having virt in its name ;) Let me rephrase. The virtio developers
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/16/2009 02:44 PM, Gregory Haskins wrote: The problem isn't where to find the models...the problem is how to aggregate multiple models to the guest. You mean configuration? You instantiate multiple vhost-nets. Multiple ethernet NICs is a supported configuration for kvm. But this is not KVM. If kvm can do it, others can. The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. So the question is: is your position that vbus is all wrong and you wish to create a new bus-like thing to solve the problem? If so, how is it different from what Ive already done? More importantly, what specific objections do you have to what Ive done, as perhaps they can be fixed instead of starting over? His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? I don't really see the difference between 1 and N here. A KVM surfaces N virtio-devices as N pci-devices to the guest. What do we do in Ira's case where the entire guest represents itself as a PCI device to the host, and nothing the other way around? There is no guest and host in this scenario. There's a device side (ppc) and a driver side (x86). The driver side can access configuration information on the device side. How to multiplex multiple devices is an interesting exercise for whoever writes the virtio binding for that setup. Bingo. So now its a question of do you want to write this layer from scratch, or re-use my framework. There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) I'm not sure if you're talking about the configuration interface or data path here. I am talking about how we would tunnel the config space for N devices across his transport. Sounds trivial. No one said it was rocket science. But it does need to be designed and implemented end-to-end, much of which Ive already done in what I hope is an extensible way. Write an address containing the device number and register number to on location, read or write data from another. You mean like the u64 devh, and u32 func fields I have here for the vbus-kvm connector? http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=include/linux/vbus_pci.h;h=fe337590e644017392e4c9d9236150adb2333729;hb=ded8ce2005a85c174ba93ee26f8d67049ef11025#l64 Just like the PCI cf8/cfc interface. They aren't in the guest. The best way to look at it is - a device side, with a dma engine: vhost-net - a driver side, only accessing its own memory: virtio-net Given that Ira's config has the dma engine in the ppc boards, that's where vhost-net would live (the ppc boards acting as NICs to the x86 board, essentially). That sounds convenient given his hardware, but it has its own set of problems. For one, the configuration/inventory of these boards is now driven by the wrong side and has to be addressed. Why is it the wrong side? Wrong is probably too harsh a word when looking at ethernet. Its certainly odd, and possibly inconvenient. It would be like having vhost in a KVM guest, and virtio-net running on the host. You could do it, but its weird and awkward. Where it really falls apart and enters the wrong category is for non-symmetric devices, like disk-io. Second, the role reversal will likely not work for many models other than ethernet (e.g. virtio-console or virtio-blk drivers running on the x86 board would be naturally consuming services from the slave boards...virtio-net is an exception because 802.x is generally symmetrical). There is no role reversal. So if I have virtio-blk driver running on the x86 and vhost-blk device running on the ppc board, I can use the ppc board as a block-device. What if I really wanted to go the other way? The side doing dma is the device, the side accessing its own memory is the driver. Just like that other 1e12 driver/device pairs out there. IIUC, his ppc boards really can be seen as guests (they are linux instances that are utilizing services from the x86, not the other way around). vhost forces the model to have the ppc boards act as IO-hosts, whereas vbus would likely work in either direction due to its more refined abstraction layer. I have no idea, that's for Ira to solve. Bingo. Thus my statement that the vhost proposal is incomplete. You have the virtio-net and vhost-net pieces covering the fast-path end-points, but nothing in the middle (transport, aggregation
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/16/2009 05:10 PM, Gregory Haskins wrote: If kvm can do it, others can. The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. In the kernel. IMO that's the wrong place for it. In conversations with Ira, he indicated he needs kernel-to-kernel ethernet for performance, and needs at least an ethernet and console connectivity. You could conceivably build a solution for this system 3 basic ways: 1) completely in userspace: use things like tuntap on the ppc boards, and tunnel packets across a custom point-to-point connection formed over the pci link to a userspace app on the x86 board. This app then reinjects the packets into the x86 kernel as a raw socket or tuntap, etc. Pretty much vanilla tuntap/vpn kind of stuff. Advantage: very little kernel code. Problem: performance (citation: hopefully obvious). 2) partially in userspace: have an in-kernel virtio-net driver talk to a userspace based virtio-net backend. This is the (current, non-vhost oriented) KVM/qemu model. Advantage, re-uses existing kernel-code. Problem: performance (citation: see alacrityvm numbers). 3) in-kernel: You can do something like virtio-net to vhost to potentially meet some of the requirements, but not all. In order to fully meet (3), you would need to do some of that stuff you mentioned in the last reply with muxing device-nr/reg-nr. In addition, we need to have a facility for mapping eventfds and establishing a signaling mechanism (like PIO+qid), etc. KVM does this with IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be invented. To meet performance, this stuff has to be in kernel and there has to be a way to manage it. Since vbus was designed to do exactly that, this is what I would advocate. You could also reinvent these concepts and put your own mux and mapping code in place, in addition to all the other stuff that vbus does. But I am not clear why anyone would want to. So no, the kernel is not the wrong place for it. Its the _only_ place for it. Otherwise, just use (1) and be done with it. Further, if we adopt vbus, if drop compatibility with existing guests or have to support both vbus and virtio-pci. We already need to support both (at least to support Ira). virtio-pci doesn't work here. Something else (vbus, or vbus-like) is needed. So the question is: is your position that vbus is all wrong and you wish to create a new bus-like thing to solve the problem? I don't intend to create anything new, I am satisfied with virtio. If it works for Ira, excellent. If not, too bad. I think that about sums it up, then. I believe it will work without too much trouble. Afaict it wont for the reasons I mentioned. If so, how is it different from what Ive already done? More importantly, what specific objections do you have to what Ive done, as perhaps they can be fixed instead of starting over? The two biggest objections are: - the host side is in the kernel As it needs to be. - the guest side is a new bus instead of reusing pci (on x86/kvm), making Windows support more difficult Thats a function of the vbus-connector, which is different from vbus-core. If you don't like it (and I know you don't), we can write one that interfaces to qemu's pci system. I just don't like the limitations that imposes, nor do I think we need that complexity of dealing with a split PCI model, so I chose to not implement vbus-kvm this way. With all due respect, based on all of your comments in aggregate I really do not think you are truly grasping what I am actually building here. I guess these two are exactly what you think are vbus' greatest advantages, so we'll probably have to extend our agree-to-disagree on this one. I also had issues with using just one interrupt vector to service all events, but that's easily fixed. Again, function of the connector. There is no guest and host in this scenario. There's a device side (ppc) and a driver side (x86). The driver side can access configuration information on the device side. How to multiplex multiple devices is an interesting exercise for whoever writes the virtio binding for that setup. Bingo. So now its a question of do you want to write this layer from scratch, or re-use my framework. You will have to implement a connector or whatever for vbus as well. vbus has more layers so it's probably smaller for vbus. Bingo! That is precisely the point. All the stuff for how to map eventfds, handle signal mitigation, demux device/function pointers, isolation, etc, are built in. All the connector has to do is transport the 4-6 verbs and provide a memory mapping/copy function, and the rest is reusable. The device models would
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/16/2009 10:22 PM, Gregory Haskins wrote: Avi Kivity wrote: On 09/16/2009 05:10 PM, Gregory Haskins wrote: If kvm can do it, others can. The problem is that you seem to either hand-wave over details like this, or you give details that are pretty much exactly what vbus does already. My point is that I've already sat down and thought about these issues and solved them in a freely available GPL'ed software package. In the kernel. IMO that's the wrong place for it. 3) in-kernel: You can do something like virtio-net to vhost to potentially meet some of the requirements, but not all. In order to fully meet (3), you would need to do some of that stuff you mentioned in the last reply with muxing device-nr/reg-nr. In addition, we need to have a facility for mapping eventfds and establishing a signaling mechanism (like PIO+qid), etc. KVM does this with IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be invented. irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted. Not per se, but it needs to be interfaced. How do I register that eventfd with the fastpath in Ira's rig? How do I signal the eventfd (x86-ppc, and ppc-x86)? To take it to the next level, how do I organize that mechanism so that it works for more than one IO-stream (e.g. address the various queues within ethernet or a different device like the console)? KVM has IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not have the luxury of an established IO paradigm. Is vbus the only way to implement a solution? No. But it is _a_ way, and its one that was specifically designed to solve this very problem (as well as others). (As an aside, note that you generally will want an abstraction on top of irqfd/eventfd like shm-signal or virtqueues to do shared-memory based event mitigation, but I digress. That is a separate topic). To meet performance, this stuff has to be in kernel and there has to be a way to manage it. and management belongs in userspace. vbus does not dictate where the management must be. Its an extensible framework, governed by what you plug into it (ala connectors and devices). For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD and DEVDROP hotswap events into the interrupt stream, because they are simple and we already needed the interrupt stream anyway for fast-path. As another example: venet chose to put -call(MACQUERY) config-space into its call namespace because its simple, and we already need -calls() for fastpath. It therefore exports an attribute to sysfs that allows the management app to set it. I could likewise have designed the connector or device-model differently as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI userspace) but this seems silly to me when they are so trivial, so I didn't. Since vbus was designed to do exactly that, this is what I would advocate. You could also reinvent these concepts and put your own mux and mapping code in place, in addition to all the other stuff that vbus does. But I am not clear why anyone would want to. Maybe they like their backward compatibility and Windows support. This is really not relevant to this thread, since we are talking about Ira's hardware. But if you must bring this up, then I will reiterate that you just design the connector to interface with QEMU+PCI and you have that too if that was important to you. But on that topic: Since you could consider KVM a motherboard manufacturer of sorts (it just happens to be virtual hardware), I don't know why KVM seems to consider itself the only motherboard manufacturer in the world that has to make everything look legacy. If a company like ASUS wants to add some cutting edge IO controller/bus, they simply do it. Pretty much every product release may contain a different array of devices, many of which are not backwards compatible with any prior silicon. The guy/gal installing Windows on that system may see a ? in device-manager until they load a driver that supports the new chip, and subsequently it works. It is certainly not a requirement to make said chip somehow work with existing drivers/facilities on bare metal, per se. Why should virtual systems be different? So, yeah, the current design of the vbus-kvm connector means I have to provide a driver. This is understood, and I have no problem with that. The only thing that I would agree has to be backwards compatible is the BIOS/boot function. If you can't support running an image like the Windows installer, you are hosed. If you can't use your ethernet until you get a chance to install a driver after the install completes, its just like most other systems in existence. IOW: It's not a big deal. For cases where the IO system is needed as part of the boot/install, you provide BIOS and/or an install-disk support for it. So no, the kernel
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote: There is no role reversal. So if I have virtio-blk driver running on the x86 and vhost-blk device running on the ppc board, I can use the ppc board as a block-device. What if I really wanted to go the other way? It seems ppc is the only one that can initiate DMA to an arbitrary address, so you can't do this really, or you can by tunneling each request back to ppc, or doing an extra data copy, but it's unlikely to work well. The limitation comes from hardware, not from the API we use. Understood, but presumably it can be exposed as a sub-function of the ppc's board's register file as a DMA-controller service to the x86. This would fall into the tunnel requests back category you mention above, though I think tunnel implies a heavier protocol than it would actually require. This would look more like a PIO cycle to a DMA controller than some higher layer protocol. You would then utilize that DMA service inside the memctx, and it the rest of vbus would work transparently with the existing devices/drivers. I do agree it would require some benchmarking to determine its feasibility, which is why I was careful to say things like may work ;). I also do not even know if its possible to expose the service this way on his system. If this design is not possible or performs poorly, I admit vbus is just as hosed as vhost in regard to the role correction benefit. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/14/2009 10:14 PM, Gregory Haskins wrote: To reiterate, as long as the model is such that the ppc boards are considered the owner (direct access, no translation needed) I believe it will work. If the pointers are expected to be owned by the host, then my model doesn't work well either. In this case the x86 is the owner and the ppc boards use translated access. Just switch drivers and device and it falls into place. You could switch vbus roles as well, I suppose. Another potential option is that he can stop mapping host memory on the guest so that it follows the more traditional model. As a bus-master device, the ppc boards should have access to any host memory at least in the GFP_DMA range, which would include all relevant pointers here. I digress: I was primarily addressing the concern that Ira would need to manage the host side of the link using hvas mapped from userspace (even if host side is the ppc boards). vbus abstracts that access so as to allow something other than userspace/hva mappings. OTOH, having each ppc board run a userspace app to do the mapping on its behalf and feed it to vhost is probably not a huge deal either. Where vhost might really fall apart is when any assumptions about pageable memory occur, if any. As an aside: a bigger issue is that, iiuc, Ira wants more than a single ethernet channel in his design (multiple ethernets, consoles, etc). A vhost solution in this environment is incomplete. Note that Ira's architecture highlights that vbus's explicit management interface is more valuable here than it is in KVM, since KVM already has its own management interface via QEMU. Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Avi Kivity wrote: On 09/15/2009 04:03 PM, Gregory Haskins wrote: In this case the x86 is the owner and the ppc boards use translated access. Just switch drivers and device and it falls into place. You could switch vbus roles as well, I suppose. Right, there's not real difference in this regard. Another potential option is that he can stop mapping host memory on the guest so that it follows the more traditional model. As a bus-master device, the ppc boards should have access to any host memory at least in the GFP_DMA range, which would include all relevant pointers here. I digress: I was primarily addressing the concern that Ira would need to manage the host side of the link using hvas mapped from userspace (even if host side is the ppc boards). vbus abstracts that access so as to allow something other than userspace/hva mappings. OTOH, having each ppc board run a userspace app to do the mapping on its behalf and feed it to vhost is probably not a huge deal either. Where vhost might really fall apart is when any assumptions about pageable memory occur, if any. Why? vhost will call get_user_pages() or copy_*_user() which ought to do the right thing. I was speaking generally, not specifically to Ira's architecture. What I mean is that vbus was designed to work without assuming that the memory is pageable. There are environments in which the host is not capable of mapping hvas/*page, but the memctx-copy_to/copy_from paradigm could still work (think rdma, for instance). As an aside: a bigger issue is that, iiuc, Ira wants more than a single ethernet channel in his design (multiple ethernets, consoles, etc). A vhost solution in this environment is incomplete. Why? Instantiate as many vhost-nets as needed. a) what about non-ethernets? b) what do you suppose this protocol to aggregate the connections would look like? (hint: this is what a vbus-connector does). c) how do you manage the configuration, especially on a per-board basis? Note that Ira's architecture highlights that vbus's explicit management interface is more valuable here than it is in KVM, since KVM already has its own management interface via QEMU. vhost-net and vbus both need management, vhost-net via ioctls and vbus via configfs. Actually I have patches queued to allow vbus to be managed via ioctls as well, per your feedback (and it solves the permissions/lifetime critisims in alacrityvm-v0.1). The only difference is the implementation. vhost-net leaves much more to userspace, that's the main difference. Also, *) vhost is virtio-net specific, whereas vbus is a more generic device model where thing like virtio-net or venet ride on top. *) vhost is only designed to work with environments that look very similar to a KVM guest (slot/hva translatable). vbus can bridge various environments by abstracting the key components (such as memory access). *) vhost requires an active userspace management daemon, whereas vbus can be driven by transient components, like scripts (ala udev) Kind Regards, -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: No, what I mean is how do you surface multiple ethernet and consoles to the guests? For Ira's case, I think he needs at minimum at least one of each, and he mentioned possibly having two unique ethernets at one point. His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) Can't all this be in userspace? Can you outline your proposal? -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
Michael S. Tsirkin wrote: On Tue, Sep 15, 2009 at 04:43:58PM -0400, Gregory Haskins wrote: Michael S. Tsirkin wrote: On Tue, Sep 15, 2009 at 04:08:23PM -0400, Gregory Haskins wrote: No, what I mean is how do you surface multiple ethernet and consoles to the guests? For Ira's case, I think he needs at minimum at least one of each, and he mentioned possibly having two unique ethernets at one point. His slave boards surface themselves as PCI devices to the x86 host. So how do you use that to make multiple vhost-based devices (say two virtio-nets, and a virtio-console) communicate across the transport? There are multiple ways to do this, but what I am saying is that whatever is conceived will start to look eerily like a vbus-connector, since this is one of its primary purposes ;) Can't all this be in userspace? Can you outline your proposal? -Greg Userspace in x86 maps a PCI region, uses it for communication with ppc? And what do you propose this communication to look like? -Greg signature.asc Description: OpenPGP digital signature