Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Sun, May 03, 2009 at 01:11:39PM -0700, Davide Libenzi wrote: > On Sun, 3 May 2009, Al Viro wrote: > > > IOW, the sane solution would be to export something that returns your > > struct file *. And stop playing with fd completely. > > This builds but it's not tested at all. > > - Make all the work of the old anon_inode_getfd(), done by a new > anon_inode_getfile(), with anon_inode_getfd() using its services > > - Make all the work done by sys_eventfd(), done by a new > eventfd_file_create() (which in turn uses anon_inode_getfile()), with > sys_eventfd() using its services > > IRQfd can use eventfd_file_create(), fget(), get_unused_fd_flags() and > fd_install() just before returning. > Is that what you had in mind? More or less, but I'd like to see the irqfd side of that... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Sun, 3 May 2009, Al Viro wrote: > IOW, the sane solution would be to export something that returns your > struct file *. And stop playing with fd completely. This builds but it's not tested at all. - Make all the work of the old anon_inode_getfd(), done by a new anon_inode_getfile(), with anon_inode_getfd() using its services - Make all the work done by sys_eventfd(), done by a new eventfd_file_create() (which in turn uses anon_inode_getfile()), with sys_eventfd() using its services IRQfd can use eventfd_file_create(), fget(), get_unused_fd_flags() and fd_install() just before returning. Is that what you had in mind? - Davide --- fs/anon_inodes.c| 68 +--- fs/eventfd.c| 44 +--- include/linux/anon_inodes.h |3 + include/linux/eventfd.h |6 +++ 4 files changed, 92 insertions(+), 29 deletions(-) Index: linux-2.6.mod/fs/anon_inodes.c === --- linux-2.6.mod.orig/fs/anon_inodes.c 2009-05-03 12:21:09.0 -0700 +++ linux-2.6.mod/fs/anon_inodes.c 2009-05-03 12:54:02.0 -0700 @@ -64,28 +64,24 @@ static const struct dentry_operations an * * Creates a new file by hooking it on a single inode. This is useful for files * that do not need to have a full-fledged inode in order to operate correctly. - * All the files created with anon_inode_getfd() will share a single inode, + * All the files created with anon_inode_getfile() will share a single inode, * hence saving memory and avoiding code duplication for the file/inode/dentry - * setup. Returns new descriptor or -error. + * setup. Returns the newly created file* or error. */ -int anon_inode_getfd(const char *name, const struct file_operations *fops, -void *priv, int flags) +struct file *anon_inode_getfile(const char *name, + const struct file_operations *fops, + void *priv, int flags) { struct qstr this; struct dentry *dentry; struct file *file; - int error, fd; + int error; if (IS_ERR(anon_inode_inode)) - return -ENODEV; + return ERR_PTR(-ENODEV); if (fops->owner && !try_module_get(fops->owner)) - return -ENOENT; - - error = get_unused_fd_flags(flags); - if (error < 0) - goto err_module; - fd = error; + return ERR_PTR(-ENOENT); /* * Link the inode to a directory entry by creating a unique name @@ -97,7 +93,7 @@ int anon_inode_getfd(const char *name, c this.hash = 0; dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this); if (!dentry) - goto err_put_unused_fd; + goto err_module; /* * We know the anon_inode inode count is always greater than zero, @@ -123,16 +119,54 @@ int anon_inode_getfd(const char *name, c file->f_version = 0; file->private_data = priv; + return file; + +err_dput: + dput(dentry); +err_module: + module_put(fops->owner); + return ERR_PTR(error); +} +EXPORT_SYMBOL_GPL(anon_inode_getfile); + +/** + * anon_inode_getfd - creates a new file instance by hooking it up to an + *anonymous inode, and a dentry that describe the "class" + *of the file + * + * @name:[in]name of the "class" of the new file + * @fops:[in]file operations for the new file + * @priv:[in]private data for the new file (will be file's private_data) + * @flags: [in]flags + * + * Creates a new file by hooking it on a single inode. This is useful for files + * that do not need to have a full-fledged inode in order to operate correctly. + * All the files created with anon_inode_getfd() will share a single inode, + * hence saving memory and avoiding code duplication for the file/inode/dentry + * setup. Returns new descriptor or -error. + */ +int anon_inode_getfd(const char *name, const struct file_operations *fops, +void *priv, int flags) +{ + int error, fd; + struct file *file; + + error = get_unused_fd_flags(flags); + if (error < 0) + return error; + fd = error; + + file = anon_inode_getfile(name, fops, priv, flags); + if (IS_ERR(file)) { + error = PTR_ERR(file); + goto err_put_unused_fd; + } fd_install(fd, file); return fd; -err_dput: - dput(dentry); err_put_unused_fd: put_unused_fd(fd); -err_module: - module_put(fops->owner); return error; } EXPORT_SYMBOL_GPL(anon_inode_getfd); Index: linux-2.6.mod/fs/eventfd.c === --- linux-2.6.mod.orig/fs/eventfd.c 2009-05-03 12:21:09.0 -0700 +++ linux-2.6.mod/fs/eventfd.c 2009
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Sun, May 03, 2009 at 10:17:16PM +0300, Avi Kivity wrote: >> Actually there's a third option: add KVM_MASK_IRQ, KVM_UNMASK_IRQ ioctls >> which will block/unblock guest from getting interrupt on this irq, >> whatever the source. Interrupts are queued in kernel while masked. A >> third ioctl KVM_PENDING_IRQS will return the status for a set if IRQs. >> qemu would call these ioctls when guest edits the MSIX vector control or >> reads the pending bit array. >> > > I think this is the best option. Sounds good. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
Michael S. Tsirkin wrote: On Sun, May 03, 2009 at 07:59:40PM +0300, Avi Kivity wrote: Michael S. Tsirkin wrote: On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: This allows an eventfd to be registered as an irq source with a guest. Any signaling operation on the eventfd (via userspace or kernel) will inject the registered GSI at the next available window. Signed-off-by: Gregory Haskins If we ever want to use this with e.g. MSI-X emulation in guest, and want to be stricly compliant to MSI-X, we'll need a way for guest to mask interrupts, and for host to report that a masked interrupt is pending. Ideally, all this will be doable with a couple of mmapped pages to avoid vmexits/system calls. We could do this in two ways: - move msix entry emulation into the kernel It's not too bad IMO: MSIX is just a table with a list of vectors, you check the mask bit on each interrupt, if masked mark vector pending and poll until unmasked. Right, but it's more and more core, and more and more bugs. Bugs in the kernel are more expensive to fix and get to users. - require the device to support replacing its irqfd, and juggle it like so: - guest disables msi - replace device model fd with eventfd belonging to us - when the device fires its eventfd, set the irq pending bit - guest enables msi - if the pending bit is set, fire the interrupt? - replace device model fd with the real irqfd Looks like a lot of code. No? We'll need exactly the same code if we do it in the kernel. The only addition is replacing the fd. I'm leaning towards the latter, though it's not an easy call. Actually there's a third option: add KVM_MASK_IRQ, KVM_UNMASK_IRQ ioctls which will block/unblock guest from getting interrupt on this irq, whatever the source. Interrupts are queued in kernel while masked. A third ioctl KVM_PENDING_IRQS will return the status for a set if IRQs. qemu would call these ioctls when guest edits the MSIX vector control or reads the pending bit array. I think this is the best option. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Sun, May 03, 2009 at 07:59:40PM +0300, Avi Kivity wrote: > Michael S. Tsirkin wrote: >> On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: >> >>> This allows an eventfd to be registered as an irq source with a guest. Any >>> signaling operation on the eventfd (via userspace or kernel) will inject >>> the registered GSI at the next available window. >>> >>> Signed-off-by: Gregory Haskins >>> >> >> If we ever want to use this with e.g. MSI-X emulation in guest, and want >> to be stricly compliant to MSI-X, we'll need a way for guest to mask >> interrupts, and for host to report that a masked interrupt is pending. >> Ideally, all this will be doable with a couple of mmapped pages to avoid >> vmexits/system calls. >> >> > > We could do this in two ways: > > - move msix entry emulation into the kernel It's not too bad IMO: MSIX is just a table with a list of vectors, you check the mask bit on each interrupt, if masked mark vector pending and poll until unmasked. > - require the device to support replacing its irqfd, and juggle it like so: > - guest disables msi > - replace device model fd with eventfd belonging to us > - when the device fires its eventfd, set the irq pending bit > - guest enables msi >- if the pending bit is set, fire the interrupt? >- replace device model fd with the real irqfd Looks like a lot of code. No? > I'm leaning towards the latter, though it's not an easy call. Actually there's a third option: add KVM_MASK_IRQ, KVM_UNMASK_IRQ ioctls which will block/unblock guest from getting interrupt on this irq, whatever the source. Interrupts are queued in kernel while masked. A third ioctl KVM_PENDING_IRQS will return the status for a set if IRQs. qemu would call these ioctls when guest edits the MSIX vector control or reads the pending bit array. >>> +static void >>> +irqfd_inject(struct work_struct *work) >>> +{ >>> + struct _irqfd *irqfd = container_of(work, struct _irqfd, work); >>> + struct kvm *kvm = irqfd->kvm; >>> + >>> + mutex_lock(&kvm->lock); >>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); >>> + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); >>> + mutex_unlock(&kvm->lock); >>> >> >> This will do weird stuff (deliver the irq twice) if the irq is >> MSI/MSI-X. I know this was discussed already and is a temporary >> shortcut, but maybe add a comment that we really want kvm_toggle_irq, >> so that we won't forget? >> > > If so, that's a bug. MSI should ignore kvm_set_irq(..., 0). -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Sun, May 03, 2009 at 11:07:26AM -0700, Davide Libenzi wrote: > On Sun, 3 May 2009, Al Viro wrote: > > On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: > > > + /* We re-use eventfd for irqfd */ > > > + fd = sys_eventfd2(0, 0); > > > + if (fd < 0) { > > > + ret = fd; > > > + goto fail; > > > + } > > > + > > > + /* We maintain a reference to eventfd for the irqfd lifetime */ > > > + file = eventfd_fget(fd); > > > + if (IS_ERR(file)) { > > > + ret = PTR_ERR(file); > > > + goto fail; > > > + } > > > + > > > + irqfd->file = file; > > > > This is just plain wrong. You have no promise whatsoever that caller of > > that sucker won't race with e.g. dup2(). IOW, you can't assume that > > file will be of the expected kind. > > The eventfd_fget() checks for the file_operations pointer, before > returning the file*, and fails if the fd in not an eventfd. Or you have > other concerns? OK, but... it's still wrong. Descriptor numbers are purely for interaction with userland; using them that way violates very general race-prevention rules, even if you do paper over the worst of consequences with check in eventfd_fget(). General rules: * descriptor you've generated is fit only for return to userland; * descriptor you've got from userland is fit only for *single* fget() or equivalent, unless you are one of the core syscalls manipulating the descriptor table itself (dup2, etc.) * once file is installed in descriptor table, you'd better be past the last failure exit; sys_close() on cleanup path is not acceptable. That's what reserving descriptors is for. IOW, the sane solution would be to export something that returns your struct file *. And stop playing with fd completely. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Sun, 3 May 2009, Al Viro wrote: > On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: > > + /* We re-use eventfd for irqfd */ > > + fd = sys_eventfd2(0, 0); > > + if (fd < 0) { > > + ret = fd; > > + goto fail; > > + } > > + > > + /* We maintain a reference to eventfd for the irqfd lifetime */ > > + file = eventfd_fget(fd); > > + if (IS_ERR(file)) { > > + ret = PTR_ERR(file); > > + goto fail; > > + } > > + > > + irqfd->file = file; > > This is just plain wrong. You have no promise whatsoever that caller of > that sucker won't race with e.g. dup2(). IOW, you can't assume that > file will be of the expected kind. The eventfd_fget() checks for the file_operations pointer, before returning the file*, and fails if the fd in not an eventfd. Or you have other concerns? - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
Michael S. Tsirkin wrote: On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: This allows an eventfd to be registered as an irq source with a guest. Any signaling operation on the eventfd (via userspace or kernel) will inject the registered GSI at the next available window. Signed-off-by: Gregory Haskins If we ever want to use this with e.g. MSI-X emulation in guest, and want to be stricly compliant to MSI-X, we'll need a way for guest to mask interrupts, and for host to report that a masked interrupt is pending. Ideally, all this will be doable with a couple of mmapped pages to avoid vmexits/system calls. We could do this in two ways: - move msix entry emulation into the kernel - require the device to support replacing its irqfd, and juggle it like so: - guest disables msi - replace device model fd with eventfd belonging to us - when the device fires its eventfd, set the irq pending bit - guest enables msi - if the pending bit is set, fire the interrupt? - replace device model fd with the real irqfd I'm leaning towards the latter, though it's not an easy call. +static void +irqfd_inject(struct work_struct *work) +{ + struct _irqfd *irqfd = container_of(work, struct _irqfd, work); + struct kvm *kvm = irqfd->kvm; + + mutex_lock(&kvm->lock); + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); + mutex_unlock(&kvm->lock); This will do weird stuff (deliver the irq twice) if the irq is MSI/MSI-X. I know this was discussed already and is a temporary shortcut, but maybe add a comment that we really want kvm_toggle_irq, so that we won't forget? If so, that's a bug. MSI should ignore kvm_set_irq(..., 0). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: > + /* We re-use eventfd for irqfd */ > + fd = sys_eventfd2(0, 0); > + if (fd < 0) { > + ret = fd; > + goto fail; > + } > + > + /* We maintain a reference to eventfd for the irqfd lifetime */ > + file = eventfd_fget(fd); > + if (IS_ERR(file)) { > + ret = PTR_ERR(file); > + goto fail; > + } > + > + irqfd->file = file; This is just plain wrong. You have no promise whatsoever that caller of that sucker won't race with e.g. dup2(). IOW, you can't assume that file will be of the expected kind. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
On Mon, Apr 27, 2009 at 02:33:34PM -0400, Gregory Haskins wrote: > This allows an eventfd to be registered as an irq source with a guest. Any > signaling operation on the eventfd (via userspace or kernel) will inject > the registered GSI at the next available window. > > Signed-off-by: Gregory Haskins If we ever want to use this with e.g. MSI-X emulation in guest, and want to be stricly compliant to MSI-X, we'll need a way for guest to mask interrupts, and for host to report that a masked interrupt is pending. Ideally, all this will be doable with a couple of mmapped pages to avoid vmexits/system calls. > +static void > +irqfd_inject(struct work_struct *work) > +{ > + struct _irqfd *irqfd = container_of(work, struct _irqfd, work); > + struct kvm *kvm = irqfd->kvm; > + > + mutex_lock(&kvm->lock); > + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); > + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); > + mutex_unlock(&kvm->lock); This will do weird stuff (deliver the irq twice) if the irq is MSI/MSI-X. I know this was discussed already and is a temporary shortcut, but maybe add a comment that we really want kvm_toggle_irq, so that we won't forget? > +} > + -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM PATCH v3 2/2] kvm: add support for irqfd via eventfd-notification interface
This allows an eventfd to be registered as an irq source with a guest. Any signaling operation on the eventfd (via userspace or kernel) will inject the registered GSI at the next available window. Signed-off-by: Gregory Haskins --- arch/x86/kvm/Makefile|2 - arch/x86/kvm/x86.c |1 include/linux/kvm.h |7 ++ include/linux/kvm_host.h |4 + virt/kvm/irqfd.c | 158 ++ virt/kvm/kvm_main.c | 11 +++ 6 files changed, 182 insertions(+), 1 deletions(-) create mode 100644 virt/kvm/irqfd.c diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile index b43c4ef..d5fff51 100644 --- a/arch/x86/kvm/Makefile +++ b/arch/x86/kvm/Makefile @@ -3,7 +3,7 @@ # common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o ioapic.o \ -coalesced_mmio.o irq_comm.o) +coalesced_mmio.o irq_comm.o irqfd.o) ifeq ($(CONFIG_KVM_TRACE),y) common-objs += $(addprefix ../../../virt/kvm/, kvm_trace.o) endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ab1fdac..0773433 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1027,6 +1027,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_REINJECT_CONTROL: case KVM_CAP_IRQ_INJECT_STATUS: case KVM_CAP_ASSIGN_DEV_IRQ: + case KVM_CAP_IRQFD: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index 3db5d8d..0e594f2 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -415,6 +415,7 @@ struct kvm_trace_rec { #define KVM_CAP_ASSIGN_DEV_IRQ 29 /* Another bug in KVM_SET_USER_MEMORY_REGION fixed: */ #define KVM_CAP_JOIN_MEMORY_REGIONS_WORKS 30 +#define KVM_CAP_IRQFD 31 #ifdef KVM_CAP_IRQ_ROUTING @@ -454,6 +455,11 @@ struct kvm_irq_routing { #endif +struct kvm_irqfd { + __u32 gsi; + __u32 flags; +}; + /* * ioctls for VM fds */ @@ -498,6 +504,7 @@ struct kvm_irq_routing { #define KVM_ASSIGN_SET_MSIX_ENTRY \ _IOW(KVMIO, 0x74, struct kvm_assigned_msix_entry) #define KVM_DEASSIGN_DEV_IRQ _IOW(KVMIO, 0x75, struct kvm_assigned_irq) +#define KVM_IRQFD _IOW(KVMIO, 0x76, struct kvm_irqfd) /* * ioctls for vcpu fds diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 095ebb6..6a8d1c1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -134,6 +134,7 @@ struct kvm { struct list_head vm_list; struct kvm_io_bus mmio_bus; struct kvm_io_bus pio_bus; + struct list_head irqfds; struct kvm_vm_stat stat; struct kvm_arch arch; atomic_t users_count; @@ -524,4 +525,7 @@ static inline void kvm_free_irq_routing(struct kvm *kvm) {} #endif +int kvm_irqfd(struct kvm *kvm, int gsi, int flags); +void kvm_irqfd_release(struct kvm *kvm); + #endif diff --git a/virt/kvm/irqfd.c b/virt/kvm/irqfd.c new file mode 100644 index 000..4fd2bea --- /dev/null +++ b/virt/kvm/irqfd.c @@ -0,0 +1,158 @@ +/* + * irqfd: Allows an eventfd to be used to inject an interrupt to the guest + * + * Credit goes to Avi Kivity for the original idea. + * + * Copyright 2009 Novell. All Rights Reserved. + * + * Author: + * Gregory Haskins + * + * This file is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +struct _irqfd { + struct kvm *kvm; + int gsi; + struct file *file; + struct list_head list; + poll_tablept; + wait_queue_head_t*wqh; + wait_queue_t wait; + struct work_structwork; +}; + +static void +irqfd_inject(struct work_struct *work) +{ + struct _irqfd *irqfd = container_of(work, struct _irqfd, work); + struct kvm *kvm = irqfd->kvm; + + mutex_lock(&kvm->lock); + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 1); + kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0); + mutex_unlock(&kvm->lock); +} + +static int +irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) +{ + struct _irqfd *irqfd = container_of(wait, struct _irqfd, wait); + + /* +* The eventfd calls its wake_up