Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 02:53:53PM -0800, Davide Libenzi wrote: On Mon, 11 Jan 2010, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote: On Mon, 11 Jan 2010, Michael S. Tsirkin wrote: Yes, I think this will work. Will test and report. Thanks! If you ever happen to find a solution in KVM that does not require eventfd changes, that'd be even better :) Otherwise ping me when you tested, that I'll puch this to Andrew. - Davide Hmm, the fix also needs changes in kvm driver to call the new API. To simplify dependencies, can we merge this patch through Avi's kvm tree as well? It is fine for me. I can officially post it to Avi, CCing kvm and lkml. - Davide Since it now uses wait_queue_t, I had to add include linux/wait.h to make it self-contained. The following works fine for me. Thanks! --- From: Davide Libenzi davi...@xmailserver.org Subject: eventfd: support eventfd_ctx_remove_wait_queue Add eventfd_ctx_remove_wait_queue API so that kernel users can clear counter while getting out of poll queue, atomically. Signed-off-by: Davide Libenzi davi...@xmailserver.org Tested-by: Michael S. Tsirkin m...@redhat.com -- diff --git a/fs/eventfd.c b/fs/eventfd.c index d26402f..4cda278 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) return events; } -static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, - loff_t *ppos) +static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) +{ + *cnt = (ctx-flags EFD_SEMAPHORE) ? 1: ctx-count; + ctx-count -= *cnt; +} + +/** + * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue. + * @ctx: [in] Pointer to eventfd context. + * @wait: [in] Wait queue to be removed. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked. + * + * This is used to atomically remove a wait queue entry from the eventfd wait + * queue head, and read/reset the counter value. + */ +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, + __u64 *cnt) +{ + unsigned long flags; + + spin_lock_irqsave(ctx-wqh.lock, flags); + eventfd_ctx_do_read(ctx, cnt); + __remove_wait_queue(ctx-wqh, wait); + if (*cnt != 0 waitqueue_active(ctx-wqh)) + wake_up_locked_poll(ctx-wqh, POLLOUT); + spin_unlock_irqrestore(ctx-wqh.lock, flags); + + return *cnt != 0 ? 0: -EAGAIN; +} +EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue); + +/** + * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero. + * @ctx: [in] Pointer to eventfd context. + * @no_wait: [in] Different from zero if the operation should not block. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked but @no_wait was nonzero. + * -ERESTARTSYS : A signal interrupted the wait operation. + * + * If @no_wait is zero, the function might sleep until the eventfd internal + * counter becomes greater than zero. + */ +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) { - struct eventfd_ctx *ctx = file-private_data; ssize_t res; - __u64 ucnt = 0; DECLARE_WAITQUEUE(wait, current); - if (count sizeof(ucnt)) - return -EINVAL; spin_lock_irq(ctx-wqh.lock); + *cnt = 0; res = -EAGAIN; if (ctx-count 0) - res = sizeof(ucnt); - else if (!(file-f_flags O_NONBLOCK)) { + res = 0; + else if (!no_wait) { __add_wait_queue(ctx-wqh, wait); - for (res = 0;;) { + for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (ctx-count 0) { - res = sizeof(ucnt); + res = 0; break; } if (signal_pending(current)) { @@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (likely(res 0)) { - ucnt = (ctx-flags EFD_SEMAPHORE) ? 1 : ctx-count; - ctx-count -= ucnt; + if (likely(res == 0)) { + eventfd_ctx_do_read(ctx, cnt); if (waitqueue_active(ctx-wqh)) wake_up_locked_poll(ctx-wqh, POLLOUT); } spin_unlock_irq(ctx-wqh.lock); - if (res 0 put_user(ucnt, (__u64 __user *) buf)) - return -EFAULT;
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote: Or if I do it the other way: remove_wait_queue(irqfd-wqh, irqfd-wait); - eventfd_read_ctx(irqfd-eventfd, ucnt); now, if device signals eventfd at point marked by -, it will not be sent but counter will be cleared, so we will loose a message. May be I am missing something here, but why doing it like that is a problem? Event delivery races with interrupt masking so making masking succeed before event delivery is OK. Event generation is asynchronous anyway and could have happened jiffy latter anyway. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote: On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote: Or if I do it the other way: remove_wait_queue(irqfd-wqh, irqfd-wait); - eventfd_read_ctx(irqfd-eventfd, ucnt); now, if device signals eventfd at point marked by -, it will not be sent but counter will be cleared, so we will loose a message. May be I am missing something here, but why doing it like that is a problem? Event delivery races with interrupt masking so making masking succeed before event delivery is OK. Event generation is asynchronous anyway and could have happened jiffy latter anyway. -- Gleb. No, event generation would only trigger a single interrupt. This race generates two interrupts for a single event. This can never happen with real hardware. eventfd_ctx_remove_wait_queue would solve this problem. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote: On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote: Or if I do it the other way: remove_wait_queue(irqfd-wqh, irqfd-wait); - eventfd_read_ctx(irqfd-eventfd, ucnt); now, if device signals eventfd at point marked by -, it will not be sent but counter will be cleared, so we will loose a message. May be I am missing something here, but why doing it like that is a problem? Event delivery races with interrupt masking so making masking succeed before event delivery is OK. Event generation is asynchronous anyway and could have happened jiffy latter anyway. -- Gleb. No, event generation would only trigger a single interrupt. This race generates two interrupts for a single event. This can never happen with real hardware. eventfd_ctx_remove_wait_queue would solve this problem. In quoted test above you are saying that we will loose a message. So how does it generates two interrupts when we loose a message? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote: On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote: On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote: Or if I do it the other way: remove_wait_queue(irqfd-wqh, irqfd-wait); - eventfd_read_ctx(irqfd-eventfd, ucnt); now, if device signals eventfd at point marked by -, it will not be sent but counter will be cleared, so we will loose a message. May be I am missing something here, but why doing it like that is a problem? Event delivery races with interrupt masking so making masking succeed before event delivery is OK. Event generation is asynchronous anyway and could have happened jiffy latter anyway. -- Gleb. No, event generation would only trigger a single interrupt. This race generates two interrupts for a single event. This can never happen with real hardware. eventfd_ctx_remove_wait_queue would solve this problem. In quoted test above you are saying that we will loose a message. So how does it generates two interrupts when we loose a message? -- Gleb. Right, sorry. I think what you miss is the fact that this is done during interrupt vector masking/unmasking, so events signalled while eventfd is not assigned to interrupt must not be lost, they should be pending and delivered later when interrupt vector is unmasked, that is when eventfd is reassigned to an interrupt. So this implementation really loses an interrupt: remove_wait_queue(irqfd-wqh, irqfd-wait); - at this point vector is masked: we are not polling eventfd anymore, event triggered at this point should cause interrupt after vector is unmasked, but the only thing is causes is counter increment in eventfd. eventfd_read_ctx(irqfd-eventfd, ucnt); - the above call would clear the counter, so we won't get an interrupt when vector is later unmasked. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 11:19:19AM +0200, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote: On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote: On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote: Or if I do it the other way: remove_wait_queue(irqfd-wqh, irqfd-wait); - eventfd_read_ctx(irqfd-eventfd, ucnt); now, if device signals eventfd at point marked by -, it will not be sent but counter will be cleared, so we will loose a message. May be I am missing something here, but why doing it like that is a problem? Event delivery races with interrupt masking so making masking succeed before event delivery is OK. Event generation is asynchronous anyway and could have happened jiffy latter anyway. -- Gleb. No, event generation would only trigger a single interrupt. This race generates two interrupts for a single event. This can never happen with real hardware. eventfd_ctx_remove_wait_queue would solve this problem. In quoted test above you are saying that we will loose a message. So how does it generates two interrupts when we loose a message? -- Gleb. Right, sorry. I think what you miss is the fact that this is done during interrupt vector masking/unmasking, so events signalled while eventfd is not assigned to interrupt must not be lost, they should be pending and delivered later when interrupt vector is unmasked, that is when eventfd is reassigned to an interrupt. Is this how MSI works? If interrupt is triggered while it is masked it is reasserted after unmasking? This is certainly no true for interrupt masking on irq chip level. So this implementation really loses an interrupt: remove_wait_queue(irqfd-wqh, irqfd-wait); - at this point vector is masked: we are not polling eventfd anymore, event triggered at this point should cause interrupt after vector is unmasked, but the only thing is causes is counter increment in eventfd. eventfd_read_ctx(irqfd-eventfd, ucnt); - the above call would clear the counter, so we won't get an interrupt when vector is later unmasked. Don't you going to use ucnt to set interrupt status bit? Can't you re-trigger the interrupt after unmasking if status bit is set? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 11:36:16AM +0200, Gleb Natapov wrote: On Mon, Jan 11, 2010 at 11:19:19AM +0200, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote: On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote: On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote: Or if I do it the other way: remove_wait_queue(irqfd-wqh, irqfd-wait); - eventfd_read_ctx(irqfd-eventfd, ucnt); now, if device signals eventfd at point marked by -, it will not be sent but counter will be cleared, so we will loose a message. May be I am missing something here, but why doing it like that is a problem? Event delivery races with interrupt masking so making masking succeed before event delivery is OK. Event generation is asynchronous anyway and could have happened jiffy latter anyway. -- Gleb. No, event generation would only trigger a single interrupt. This race generates two interrupts for a single event. This can never happen with real hardware. eventfd_ctx_remove_wait_queue would solve this problem. In quoted test above you are saying that we will loose a message. So how does it generates two interrupts when we loose a message? -- Gleb. Right, sorry. I think what you miss is the fact that this is done during interrupt vector masking/unmasking, so events signalled while eventfd is not assigned to interrupt must not be lost, they should be pending and delivered later when interrupt vector is unmasked, that is when eventfd is reassigned to an interrupt. Is this how MSI works? If interrupt is triggered while it is masked it is reasserted after unmasking? This is certainly no true for interrupt masking on irq chip level. Yes. So this implementation really loses an interrupt: remove_wait_queue(irqfd-wqh, irqfd-wait); - at this point vector is masked: we are not polling eventfd anymore, event triggered at this point should cause interrupt after vector is unmasked, but the only thing is causes is counter increment in eventfd. eventfd_read_ctx(irqfd-eventfd, ucnt); - the above call would clear the counter, so we won't get an interrupt when vector is later unmasked. Don't you going to use ucnt to set interrupt status bit? Can't you re-trigger the interrupt after unmasking if status bit is set? Yes, this is what Davidel suggested originally. There's nowhere to save the ucnt though because event is being deasserted. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, 11 Jan 2010, Michael S. Tsirkin wrote: Yes, I think this will work. Will test and report. Thanks! If you ever happen to find a solution in KVM that does not require eventfd changes, that'd be even better :) Otherwise ping me when you tested, that I'll puch this to Andrew. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote: On Mon, 11 Jan 2010, Michael S. Tsirkin wrote: Yes, I think this will work. Will test and report. Thanks! If you ever happen to find a solution in KVM that does not require eventfd changes, that'd be even better :) Otherwise ping me when you tested, that I'll puch this to Andrew. - Davide Hmm, the fix also needs changes in kvm driver to call the new API. To simplify dependencies, can we merge this patch through Avi's kvm tree as well? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, 11 Jan 2010, Michael S. Tsirkin wrote: On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote: On Mon, 11 Jan 2010, Michael S. Tsirkin wrote: Yes, I think this will work. Will test and report. Thanks! If you ever happen to find a solution in KVM that does not require eventfd changes, that'd be even better :) Otherwise ping me when you tested, that I'll puch this to Andrew. - Davide Hmm, the fix also needs changes in kvm driver to call the new API. To simplify dependencies, can we merge this patch through Avi's kvm tree as well? It is fine for me. I can officially post it to Avi, CCing kvm and lkml. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Jan 07, 2010 at 04:26:24PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: Sure, I was trying to be as brief as possible, here's a detailed summary. Description of the system (MSI emulation in KVM): KVM supports an ioctl to assign/deassign an eventfd file to interrupt message in guest OS. When this eventfd is signalled, interrupt message is sent. This assignment is done from qemu system emulator. eventfd is signalled from device emulation in another thread in userspace or from kernel, which talks with guest OS through another eventfd and shared memory (possibility of out of process was discussed but never got implemented yet). Note: it's okay to delay messages from correctness point of view, but generally this is latency-sensitive path. If multiple identical messages are requested, it's okay to send a single last message, but missing a message altogether causes deadlocks. Sending a message when none were requested might in theory cause crashes, in practice doing this causes performance degradation. Another KVM feature is interrupt masking: guest OS requests that we stop sending some interrupt message, possibly modified mapping and re-enables this message. This needs to be done without involving the device that might keep requesting events: while masked, message is marked pending, and guest might test the pending status. We can implement masking in system emulator in userspace, by using assign/deassign ioctls: when message is masked, we simply deassign all eventfd, and when it is unmasked, we assign them back. Here's some code to illustrate how this all works: assign/deassign code in kernel looks like the following: this is called to unmask interrupt static int kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct file *file = NULL; struct eventfd_ctx *eventfd = NULL; int ret; unsigned int events; irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); ... file = eventfd_fget(fd); if (IS_ERR(file)) { ret = PTR_ERR(file); goto fail; } eventfd = eventfd_ctx_fileget(file); if (IS_ERR(eventfd)) { ret = PTR_ERR(eventfd); goto fail; } irqfd-eventfd = eventfd; /* * Install our own custom wake-up handling so we are notified via * a callback whenever someone signals the underlying eventfd */ init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup); init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc); spin_lock_irq(kvm-irqfds.lock); events = file-f_op-poll(file, irqfd-pt); list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); A. /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. */ if (events POLLIN) schedule_work(irqfd-inject); /* * do not drop the file until the irqfd is fully initialized, otherwise * we might race against the POLLHUP */ fput(file); return 0; fail: ... } What is you do (under proper irqfd locking) something like: eventfd_ctx_read(ctx, 1, cnt); if (irqfd-cnt != cnt) { irqfd-cnt = cnt; schedule_work(irqfd-inject); } And deactivation deep down does this (from irqfd_cleanup_wq workqueue, so this is not under the spinlock): /* * Synchronize with the wait-queue and unhook ourselves to * prevent * further events. */ B. remove_wait_queue(irqfd-wqh, irqfd-wait); /* * It is now safe to release the object's resources */ eventfd_ctx_put(irqfd-eventfd); kfree(irqfd); And: eventfd_ctx_read(ctx, 1, irqfd-cnt); - remove_wait_queue(irqfd-wqh, irqfd-wait); - Davide Yes, this is exactly what I wanted to do. So, here's the issue: if an event is signalled at point -: after eventfd_ctx_read but before remove_wait_queue, then we inject interrupt but counter will be left non-zero and then when we unmask, we inject antoher, spurious interrupt. This is why I wanted to have eventfd_ctx_read not take wait queue head lock: then I could do: spin_lock_irqsave(ctx-wqh.lock, flags); eventfd_ctx_read(ctx, 1, irqfd-cnt); __remove_wait_queue(irqfd-wqh, irqfd-wait); spin_lock_irqrestore(ctx-wqh.lock, flags); -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: What is you do (under proper irqfd locking) something like: eventfd_ctx_read(ctx, 1, cnt); if (irqfd-cnt != cnt) { irqfd-cnt = cnt; schedule_work(irqfd-inject); } And deactivation deep down does this (from irqfd_cleanup_wq workqueue, so this is not under the spinlock): /* * Synchronize with the wait-queue and unhook ourselves to * prevent * further events. */ B. remove_wait_queue(irqfd-wqh, irqfd-wait); /* * It is now safe to release the object's resources */ eventfd_ctx_put(irqfd-eventfd); kfree(irqfd); And: eventfd_ctx_read(ctx, 1, irqfd-cnt); - remove_wait_queue(irqfd-wqh, irqfd-wait); - Davide Yes, this is exactly what I wanted to do. So, here's the issue: if an event is signalled at point -: after eventfd_ctx_read but before remove_wait_queue, then we inject interrupt but counter will be left non-zero and then when we unmask, we inject antoher, spurious interrupt. This is why I wanted to have eventfd_ctx_read not take wait queue head lock: then I could do: spin_lock_irqsave(ctx-wqh.lock, flags); eventfd_ctx_read(ctx, 1, irqfd-cnt); __remove_wait_queue(irqfd-wqh, irqfd-wait); spin_lock_irqrestore(ctx-wqh.lock, flags); This is why I said under proper irqfd locking. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: Not sure what you mean by irqfd locking. IMO we must take waitqueue lock, otherwise we can not prevent wait queue callback from being invoked, right? Note that if we take our own lock under wait queue callback, we won't be able to use this lock around remove_wait_queue. Also note how fs/evetfd.c uses __remove_wait_queue after taking wait queue lock - we'd like to do the same with our wait queue. Could you clarify pls? Wait a second. Looking at the current code in Linus tree, when you deassign an irqfd, you are actually destroying it, so the idea of saving the counter on deassign is not going to work. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote: On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: Not sure what you mean by irqfd locking. IMO we must take waitqueue lock, otherwise we can not prevent wait queue callback from being invoked, right? Note that if we take our own lock under wait queue callback, we won't be able to use this lock around remove_wait_queue. Also note how fs/evetfd.c uses __remove_wait_queue after taking wait queue lock - we'd like to do the same with our wait queue. Could you clarify pls? Wait a second. Looking at the current code in Linus tree, when you deassign an irqfd, you are actually destroying it, so the idea of saving the counter on deassign is not going to work. - Davide Yes, the only context passed between deassign and assign is the eventfd itself. So I think the following code for deassign would work provided eventfd_ctx_read_locked works with wait queue lock taken: spin_lock_irqsave(irqfd-wqh.lock, flags); eventfd_ctx_read_locked(ctx-eventfd, 1, ucnt); __remove_wait_queue(irqfd-wqh, irqfd-wait); spin_lock_irqrestore(irqfd-wqh.lock, flags); And we discard ucnt here. This works because on on assign we do: events = file-f_op-poll(file, irqfd-pt); if (events POLLIN) schedule_work(irqfd-inject); -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote: On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: Not sure what you mean by irqfd locking. IMO we must take waitqueue lock, otherwise we can not prevent wait queue callback from being invoked, right? Note that if we take our own lock under wait queue callback, we won't be able to use this lock around remove_wait_queue. Also note how fs/evetfd.c uses __remove_wait_queue after taking wait queue lock - we'd like to do the same with our wait queue. Could you clarify pls? Wait a second. Looking at the current code in Linus tree, when you deassign an irqfd, you are actually destroying it, so the idea of saving the counter on deassign is not going to work. - Davide Yes, the only context passed between deassign and assign is the eventfd itself. So I think the following code for deassign would work provided eventfd_ctx_read_locked works with wait queue lock taken: spin_lock_irqsave(irqfd-wqh.lock, flags); eventfd_ctx_read_locked(ctx-eventfd, 1, ucnt); __remove_wait_queue(irqfd-wqh, irqfd-wait); spin_lock_irqrestore(irqfd-wqh.lock, flags); As I said, you cannot do that from KVM, since write-witers will nedd to be woke up. Use the eventfd_ctx_remove_wait_queue() below (NOTE: compiled, not tested). - Davide --- fs/eventfd.c| 88 +++- include/linux/eventfd.h | 15 2 files changed, 88 insertions(+), 15 deletions(-) Index: linux-2.6.mod/fs/eventfd.c === --- linux-2.6.mod.orig/fs/eventfd.c 2010-01-06 18:33:27.0 -0800 +++ linux-2.6.mod/fs/eventfd.c 2010-01-10 11:02:00.0 -0800 @@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct return events; } -static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, - loff_t *ppos) +static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) +{ + *cnt = (ctx-flags EFD_SEMAPHORE) ? 1: ctx-count; + ctx-count -= *cnt; +} + +/** + * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue. + * @ctx: [in] Pointer to eventfd context. + * @wait: [in] Wait queue to be removed. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked. + * + * This is used to atomically remove a wait queue entry from the eventfd wait + * queue head, and read/reset the counter value. + */ +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, + __u64 *cnt) +{ + unsigned long flags; + + spin_lock_irqsave(ctx-wqh.lock, flags); + eventfd_ctx_do_read(ctx, cnt); + __remove_wait_queue(ctx-wqh, wait); + if (*cnt != 0 waitqueue_active(ctx-wqh)) + wake_up_locked_poll(ctx-wqh, POLLOUT); + spin_unlock_irqrestore(ctx-wqh.lock, flags); + + return *cnt != 0 ? 0: -EAGAIN; +} +EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue); + +/** + * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero. + * @ctx: [in] Pointer to eventfd context. + * @no_wait: [in] Different from zero if the operation should not block. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked but @no_wait was nonzero. + * -ERESTARTSYS : A signal interrupted the wait operation. + * + * If @no_wait is zero, the function might sleep until the eventfd internal + * counter becomes greater than zero. + */ +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) { - struct eventfd_ctx *ctx = file-private_data; ssize_t res; - __u64 ucnt = 0; DECLARE_WAITQUEUE(wait, current); - if (count sizeof(ucnt)) - return -EINVAL; spin_lock_irq(ctx-wqh.lock); + *cnt = 0; res = -EAGAIN; if (ctx-count 0) - res = sizeof(ucnt); - else if (!(file-f_flags O_NONBLOCK)) { + res = 0; + else if (!no_wait) { __add_wait_queue(ctx-wqh, wait); - for (res = 0;;) { + for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (ctx-count 0) { - res = sizeof(ucnt); + res = 0; break; } if (signal_pending(current)) { @@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (likely(res
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, Jan 10, 2010 at 11:04:23AM -0800, Davide Libenzi wrote: On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote: On Sun, 10 Jan 2010, Michael S. Tsirkin wrote: Not sure what you mean by irqfd locking. IMO we must take waitqueue lock, otherwise we can not prevent wait queue callback from being invoked, right? Note that if we take our own lock under wait queue callback, we won't be able to use this lock around remove_wait_queue. Also note how fs/evetfd.c uses __remove_wait_queue after taking wait queue lock - we'd like to do the same with our wait queue. Could you clarify pls? Wait a second. Looking at the current code in Linus tree, when you deassign an irqfd, you are actually destroying it, so the idea of saving the counter on deassign is not going to work. - Davide Yes, the only context passed between deassign and assign is the eventfd itself. So I think the following code for deassign would work provided eventfd_ctx_read_locked works with wait queue lock taken: spin_lock_irqsave(irqfd-wqh.lock, flags); eventfd_ctx_read_locked(ctx-eventfd, 1, ucnt); __remove_wait_queue(irqfd-wqh, irqfd-wait); spin_lock_irqrestore(irqfd-wqh.lock, flags); As I said, you cannot do that from KVM, since write-witers will nedd to be woke up. Use the eventfd_ctx_remove_wait_queue() below (NOTE: compiled, not tested). - Davide Yes, I think this will work. Will test and report. Thanks! --- fs/eventfd.c| 88 +++- include/linux/eventfd.h | 15 2 files changed, 88 insertions(+), 15 deletions(-) Index: linux-2.6.mod/fs/eventfd.c === --- linux-2.6.mod.orig/fs/eventfd.c 2010-01-06 18:33:27.0 -0800 +++ linux-2.6.mod/fs/eventfd.c2010-01-10 11:02:00.0 -0800 @@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct return events; } -static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, - loff_t *ppos) +static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt) +{ + *cnt = (ctx-flags EFD_SEMAPHORE) ? 1: ctx-count; + ctx-count -= *cnt; +} + +/** + * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait queue. + * @ctx: [in] Pointer to eventfd context. + * @wait: [in] Wait queue to be removed. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked. + * + * This is used to atomically remove a wait queue entry from the eventfd wait + * queue head, and read/reset the counter value. + */ +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, + __u64 *cnt) +{ + unsigned long flags; + + spin_lock_irqsave(ctx-wqh.lock, flags); + eventfd_ctx_do_read(ctx, cnt); + __remove_wait_queue(ctx-wqh, wait); + if (*cnt != 0 waitqueue_active(ctx-wqh)) + wake_up_locked_poll(ctx-wqh, POLLOUT); + spin_unlock_irqrestore(ctx-wqh.lock, flags); + + return *cnt != 0 ? 0: -EAGAIN; +} +EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue); + +/** + * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero. + * @ctx: [in] Pointer to eventfd context. + * @no_wait: [in] Different from zero if the operation should not block. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked but @no_wait was nonzero. + * -ERESTARTSYS : A signal interrupted the wait operation. + * + * If @no_wait is zero, the function might sleep until the eventfd internal + * counter becomes greater than zero. + */ +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) { - struct eventfd_ctx *ctx = file-private_data; ssize_t res; - __u64 ucnt = 0; DECLARE_WAITQUEUE(wait, current); - if (count sizeof(ucnt)) - return -EINVAL; spin_lock_irq(ctx-wqh.lock); + *cnt = 0; res = -EAGAIN; if (ctx-count 0) - res = sizeof(ucnt); - else if (!(file-f_flags O_NONBLOCK)) { + res = 0; + else if (!no_wait) { __add_wait_queue(ctx-wqh, wait); - for (res = 0;;) { + for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (ctx-count 0) { - res = sizeof(ucnt); + res = 0; break; } if (signal_pending(current)) { @@
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Jan 06, 2010 at 11:25:40PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: OK. What I think we need is a way to remove ourselves from the eventfd wait queue and clear the counter atomically. We currently do remove_wait_queue(irqfd-wqh, irqfd-wait); where wqh saves the eventfd wait queue head. You do a remove_wait_queue() from inside a callback wakeup on the same wait queue head? No, not from callback, in ioctl context. If we do this before proposed eventfd_read_ctx, we can lose events. If we do this after, we can get spurious events. An unlocked read is one way to fix this. You posted one line of code and a two lines analysis of the issue. Can you be a little bit more verbose and show me more code, so that I can actually see what is going on? - Davide Sure, I was trying to be as brief as possible, here's a detailed summary. Description of the system (MSI emulation in KVM): KVM supports an ioctl to assign/deassign an eventfd file to interrupt message in guest OS. When this eventfd is signalled, interrupt message is sent. This assignment is done from qemu system emulator. eventfd is signalled from device emulation in another thread in userspace or from kernel, which talks with guest OS through another eventfd and shared memory (possibility of out of process was discussed but never got implemented yet). Note: it's okay to delay messages from correctness point of view, but generally this is latency-sensitive path. If multiple identical messages are requested, it's okay to send a single last message, but missing a message altogether causes deadlocks. Sending a message when none were requested might in theory cause crashes, in practice doing this causes performance degradation. Another KVM feature is interrupt masking: guest OS requests that we stop sending some interrupt message, possibly modified mapping and re-enables this message. This needs to be done without involving the device that might keep requesting events: while masked, message is marked pending, and guest might test the pending status. We can implement masking in system emulator in userspace, by using assign/deassign ioctls: when message is masked, we simply deassign all eventfd, and when it is unmasked, we assign them back. Here's some code to illustrate how this all works: assign/deassign code in kernel looks like the following: this is called to unmask interrupt static int kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct file *file = NULL; struct eventfd_ctx *eventfd = NULL; int ret; unsigned int events; irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); ... file = eventfd_fget(fd); if (IS_ERR(file)) { ret = PTR_ERR(file); goto fail; } eventfd = eventfd_ctx_fileget(file); if (IS_ERR(eventfd)) { ret = PTR_ERR(eventfd); goto fail; } irqfd-eventfd = eventfd; /* * Install our own custom wake-up handling so we are notified via * a callback whenever someone signals the underlying eventfd */ init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup); init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc); spin_lock_irq(kvm-irqfds.lock); events = file-f_op-poll(file, irqfd-pt); list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); A. /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. */ if (events POLLIN) schedule_work(irqfd-inject); /* * do not drop the file until the irqfd is fully initialized, otherwise * we might race against the POLLHUP */ fput(file); return 0; fail: ... } This is called to mask interrupt /* * shutdown any irqfd's that match fd+gsi */ static int kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct eventfd_ctx *eventfd; eventfd = eventfd_ctx_fdget(fd); if (IS_ERR(eventfd)) return PTR_ERR(eventfd); spin_lock_irq(kvm-irqfds.lock); list_for_each_entry_safe(irqfd, tmp, kvm-irqfds.items, list) { if (irqfd-eventfd == eventfd irqfd-gsi == gsi) irqfd_deactivate(irqfd); } spin_unlock_irq(kvm-irqfds.lock); eventfd_ctx_put(eventfd); /* * Block until we know all outstanding shutdown jobs have completed * so that we guarantee there will not be any more interrupts on this * gsi once this deassign function returns. */ flush_workqueue(irqfd_cleanup_wq); return 0; } And deactivation deep down does this (from
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: Sure, I was trying to be as brief as possible, here's a detailed summary. Description of the system (MSI emulation in KVM): KVM supports an ioctl to assign/deassign an eventfd file to interrupt message in guest OS. When this eventfd is signalled, interrupt message is sent. This assignment is done from qemu system emulator. eventfd is signalled from device emulation in another thread in userspace or from kernel, which talks with guest OS through another eventfd and shared memory (possibility of out of process was discussed but never got implemented yet). Note: it's okay to delay messages from correctness point of view, but generally this is latency-sensitive path. If multiple identical messages are requested, it's okay to send a single last message, but missing a message altogether causes deadlocks. Sending a message when none were requested might in theory cause crashes, in practice doing this causes performance degradation. Another KVM feature is interrupt masking: guest OS requests that we stop sending some interrupt message, possibly modified mapping and re-enables this message. This needs to be done without involving the device that might keep requesting events: while masked, message is marked pending, and guest might test the pending status. We can implement masking in system emulator in userspace, by using assign/deassign ioctls: when message is masked, we simply deassign all eventfd, and when it is unmasked, we assign them back. Here's some code to illustrate how this all works: assign/deassign code in kernel looks like the following: this is called to unmask interrupt static int kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct file *file = NULL; struct eventfd_ctx *eventfd = NULL; int ret; unsigned int events; irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); ... file = eventfd_fget(fd); if (IS_ERR(file)) { ret = PTR_ERR(file); goto fail; } eventfd = eventfd_ctx_fileget(file); if (IS_ERR(eventfd)) { ret = PTR_ERR(eventfd); goto fail; } irqfd-eventfd = eventfd; /* * Install our own custom wake-up handling so we are notified via * a callback whenever someone signals the underlying eventfd */ init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup); init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc); spin_lock_irq(kvm-irqfds.lock); events = file-f_op-poll(file, irqfd-pt); list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); A. /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. */ if (events POLLIN) schedule_work(irqfd-inject); /* * do not drop the file until the irqfd is fully initialized, otherwise * we might race against the POLLHUP */ fput(file); return 0; fail: ... } This is called to mask interrupt /* * shutdown any irqfd's that match fd+gsi */ static int kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct eventfd_ctx *eventfd; eventfd = eventfd_ctx_fdget(fd); if (IS_ERR(eventfd)) return PTR_ERR(eventfd); spin_lock_irq(kvm-irqfds.lock); list_for_each_entry_safe(irqfd, tmp, kvm-irqfds.items, list) { if (irqfd-eventfd == eventfd irqfd-gsi == gsi) irqfd_deactivate(irqfd); } spin_unlock_irq(kvm-irqfds.lock); eventfd_ctx_put(eventfd); /* * Block until we know all outstanding shutdown jobs have completed * so that we guarantee there will not be any more interrupts on this * gsi once this deassign function returns. */ flush_workqueue(irqfd_cleanup_wq); return 0; } And deactivation deep down does this (from irqfd_cleanup_wq workqueue, so this is not under the spinlock): /* * Synchronize with the wait-queue and unhook ourselves to * prevent * further events. */ B. remove_wait_queue(irqfd-wqh, irqfd-wait); /* * It is now safe to release the object's resources */ eventfd_ctx_put(irqfd-eventfd); kfree(irqfd); The problems (really the same bug) in KVM that I am trying to fix: 1. Because of A above, if event was requested while message was masked, we will not miss a message. However, because we never clear the counter, so we currently get a spurious message each time we unmask. We should clear the counter either each time we deliver message,
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 7 Jan 2010, Davide Libenzi wrote: When you unmask, shouldn't the fd be in a clear state anyway? Forget that. If it's used for IRQs you likely need to have a current state when you unmask. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: Sure, I was trying to be as brief as possible, here's a detailed summary. Description of the system (MSI emulation in KVM): KVM supports an ioctl to assign/deassign an eventfd file to interrupt message in guest OS. When this eventfd is signalled, interrupt message is sent. This assignment is done from qemu system emulator. eventfd is signalled from device emulation in another thread in userspace or from kernel, which talks with guest OS through another eventfd and shared memory (possibility of out of process was discussed but never got implemented yet). Note: it's okay to delay messages from correctness point of view, but generally this is latency-sensitive path. If multiple identical messages are requested, it's okay to send a single last message, but missing a message altogether causes deadlocks. Sending a message when none were requested might in theory cause crashes, in practice doing this causes performance degradation. Another KVM feature is interrupt masking: guest OS requests that we stop sending some interrupt message, possibly modified mapping and re-enables this message. This needs to be done without involving the device that might keep requesting events: while masked, message is marked pending, and guest might test the pending status. We can implement masking in system emulator in userspace, by using assign/deassign ioctls: when message is masked, we simply deassign all eventfd, and when it is unmasked, we assign them back. Here's some code to illustrate how this all works: assign/deassign code in kernel looks like the following: this is called to unmask interrupt static int kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi) { struct _irqfd *irqfd, *tmp; struct file *file = NULL; struct eventfd_ctx *eventfd = NULL; int ret; unsigned int events; irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL); ... file = eventfd_fget(fd); if (IS_ERR(file)) { ret = PTR_ERR(file); goto fail; } eventfd = eventfd_ctx_fileget(file); if (IS_ERR(eventfd)) { ret = PTR_ERR(eventfd); goto fail; } irqfd-eventfd = eventfd; /* * Install our own custom wake-up handling so we are notified via * a callback whenever someone signals the underlying eventfd */ init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup); init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc); spin_lock_irq(kvm-irqfds.lock); events = file-f_op-poll(file, irqfd-pt); list_add_tail(irqfd-list, kvm-irqfds.items); spin_unlock_irq(kvm-irqfds.lock); A. /* * Check if there was an event already pending on the eventfd * before we registered, and trigger it as if we didn't miss it. */ if (events POLLIN) schedule_work(irqfd-inject); /* * do not drop the file until the irqfd is fully initialized, otherwise * we might race against the POLLHUP */ fput(file); return 0; fail: ... } What is you do (under proper irqfd locking) something like: eventfd_ctx_read(ctx, 1, cnt); if (irqfd-cnt != cnt) { irqfd-cnt = cnt; schedule_work(irqfd-inject); } And deactivation deep down does this (from irqfd_cleanup_wq workqueue, so this is not under the spinlock): /* * Synchronize with the wait-queue and unhook ourselves to * prevent * further events. */ B. remove_wait_queue(irqfd-wqh, irqfd-wait); /* * It is now safe to release the object's resources */ eventfd_ctx_put(irqfd-eventfd); kfree(irqfd); And: eventfd_ctx_read(ctx, 1, irqfd-cnt); remove_wait_queue(irqfd-wqh, irqfd-wait); - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote: On Tue, 1 Sep 2009, Avi Kivity wrote: On 09/01/2009 02:45 AM, Davide Libenzi wrote: On Thu, 27 Aug 2009, Davide Libenzi wrote: On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: Oh, I stopped pushing EFD_STATE since we have a solution. Do you guys need the kernel-side eventfd_ctx_read() I posted or not? Because if nobody uses it, I'm not going to push it. Guys, I did not get a reply on this. Do you need me to push it, or you're not going to use it at the end? We'll use it eventually, but we're still some ways from it. OK, then bug me when you're going to need it. I won't push it before that. - Davide So, it turns out that we need this: be thought we don't because currently kvm does not zero eventfd counter when it polls eventfd. But this causes spurious interrupts when we disconnect irqfd from kvm and re-connect it back. However, since kvm does its own thing with the wait queue, and might read the counter from wait queue callback (which might be from interrupt context), a simpler, lower-level interface would be better for us. Does the following (build tested only) look palatable? Thanks! diff --git a/fs/eventfd.c b/fs/eventfd.c index d26402f..e350ffd 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -135,6 +135,17 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) return events; } +/* Caller must have wait queue head lock. */ +ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt) +{ + if (!ctx-count) + return -EAGAIN; + *ucnt = (ctx-flags EFD_SEMAPHORE) ? 1 : ctx-count; + ctx-count -= *ucnt; + return sizeof *ucnt; +} +EXPORT_SYMBOL_GPL(_eventfd_read_ctx); + static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) { @@ -146,17 +157,14 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, if (count sizeof(ucnt)) return -EINVAL; spin_lock_irq(ctx-wqh.lock); - res = -EAGAIN; - if (ctx-count 0) - res = sizeof(ucnt); - else if (!(file-f_flags O_NONBLOCK)) { + res = _eventfd_read_ctx(ctx, ucnt); + if (res 0 !(file-f_flags O_NONBLOCK)) { __add_wait_queue(ctx-wqh, wait); for (res = 0;;) { set_current_state(TASK_INTERRUPTIBLE); - if (ctx-count 0) { - res = sizeof(ucnt); + res = _eventfd_read_ctx(ctx, ucnt); + if (res 0) break; - } if (signal_pending(current)) { res = -ERESTARTSYS; break; @@ -169,8 +177,6 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, __set_current_state(TASK_RUNNING); } if (likely(res 0)) { - ucnt = (ctx-flags EFD_SEMAPHORE) ? 1 : ctx-count; - ctx-count -= ucnt; if (waitqueue_active(ctx-wqh)) wake_up_locked_poll(ctx-wqh, POLLOUT); } diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 94dd103..a3d0ce9 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -34,6 +34,7 @@ struct file *eventfd_fget(int fd); struct eventfd_ctx *eventfd_ctx_fdget(int fd); struct eventfd_ctx *eventfd_ctx_fileget(struct file *file); int eventfd_signal(struct eventfd_ctx *ctx, int n); +ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt); #else /* CONFIG_EVENTFD */ @@ -61,6 +62,11 @@ static inline void eventfd_ctx_put(struct eventfd_ctx *ctx) } +static inline ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt) +{ + return -ENOSYS; +} + #endif #endif /* _LINUX_EVENTFD_H */ -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 6 Jan 2010, Michael S. Tsirkin wrote: On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote: On Tue, 1 Sep 2009, Avi Kivity wrote: On 09/01/2009 02:45 AM, Davide Libenzi wrote: On Thu, 27 Aug 2009, Davide Libenzi wrote: On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: Oh, I stopped pushing EFD_STATE since we have a solution. Do you guys need the kernel-side eventfd_ctx_read() I posted or not? Because if nobody uses it, I'm not going to push it. Guys, I did not get a reply on this. Do you need me to push it, or you're not going to use it at the end? We'll use it eventually, but we're still some ways from it. OK, then bug me when you're going to need it. I won't push it before that. - Davide So, it turns out that we need this: be thought we don't because currently kvm does not zero eventfd counter when it polls eventfd. But this causes spurious interrupts when we disconnect irqfd from kvm and re-connect it back. However, since kvm does its own thing with the wait queue, and might read the counter from wait queue callback (which might be from interrupt context), a simpler, lower-level interface would be better for us. Does the following (build tested only) look palatable? What is wrong with you? :/ There was a patch attached with this email, and yet you did your own, and yet again you managed to add an underscore at the beginning of the API name, in an API set where there are no leading underscores. Even if KVM does that for its own reasons, you should be able to fit your naming style to the interface you're adding your droplets into. I will post an eventfd_read_ctx() to Andrew ASAP. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Jan 06, 2010 at 12:43:07PM -0800, Davide Libenzi wrote: On Wed, 6 Jan 2010, Michael S. Tsirkin wrote: On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote: On Tue, 1 Sep 2009, Avi Kivity wrote: On 09/01/2009 02:45 AM, Davide Libenzi wrote: On Thu, 27 Aug 2009, Davide Libenzi wrote: On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: Oh, I stopped pushing EFD_STATE since we have a solution. Do you guys need the kernel-side eventfd_ctx_read() I posted or not? Because if nobody uses it, I'm not going to push it. Guys, I did not get a reply on this. Do you need me to push it, or you're not going to use it at the end? We'll use it eventually, but we're still some ways from it. OK, then bug me when you're going to need it. I won't push it before that. - Davide So, it turns out that we need this: be thought we don't because currently kvm does not zero eventfd counter when it polls eventfd. But this causes spurious interrupts when we disconnect irqfd from kvm and re-connect it back. However, since kvm does its own thing with the wait queue, and might read the counter from wait queue callback (which might be from interrupt context), a simpler, lower-level interface would be better for us. Does the following (build tested only) look palatable? What is wrong with you? :/ I was trying to be helpful. There was a patch attached with this email, and yet you did your own, I tried to explain, no? That patch was taking wait queue spinlock and was assuming that eventfd_read_ctx is called from a task that can block. KVM attaches its own poller so this is not a good fit for us. Hope this clarifies. and yet again you managed to add an underscore at the beginning of the API name, in an API set where there are no leading underscores. Even if KVM does that for its own reasons, you should be able to fit your naming style to the interface you're adding your droplets into. I will post an eventfd_read_ctx() to Andrew ASAP. - Davide Sorry about the underscore - it's easy to fix but I won't bore you with more patch revisions before I understand what makes you unhappy. Before KVM starts creating workqueues and doing schedule_work calls just to work around the API, can we discuss this please? I am not hung on my patch, but could we have it work so that we can call eventfd_read_ctx from wait queue callback directly? Thanks! -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 6 Jan 2010, Michael S. Tsirkin wrote: I tried to explain, no? That patch was taking wait queue spinlock and was assuming that eventfd_read_ctx is called from a task that can block. KVM attaches its own poller so this is not a good fit for us. Hope this clarifies. and yet again you managed to add an underscore at the beginning of the API name, in an API set where there are no leading underscores. Even if KVM does that for its own reasons, you should be able to fit your naming style to the interface you're adding your droplets into. I will post an eventfd_read_ctx() to Andrew ASAP. - Davide Sorry about the underscore - it's easy to fix but I won't bore you with more patch revisions before I understand what makes you unhappy. Before KVM starts creating workqueues and doing schedule_work calls just to work around the API, can we discuss this please? I am not hung on my patch, but could we have it work so that we can call eventfd_read_ctx from wait queue callback directly? The read needs to wake possible OUT waiters, *cannot* be done your way. On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). I thought you *already* do your own stuff from a work queue, why would you need to read from IRQ context? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Jan 06, 2010 at 01:17:02PM -0800, Davide Libenzi wrote: On Wed, 6 Jan 2010, Michael S. Tsirkin wrote: I tried to explain, no? That patch was taking wait queue spinlock and was assuming that eventfd_read_ctx is called from a task that can block. KVM attaches its own poller so this is not a good fit for us. Hope this clarifies. and yet again you managed to add an underscore at the beginning of the API name, in an API set where there are no leading underscores. Even if KVM does that for its own reasons, you should be able to fit your naming style to the interface you're adding your droplets into. I will post an eventfd_read_ctx() to Andrew ASAP. - Davide Sorry about the underscore - it's easy to fix but I won't bore you with more patch revisions before I understand what makes you unhappy. Before KVM starts creating workqueues and doing schedule_work calls just to work around the API, can we discuss this please? I am not hung on my patch, but could we have it work so that we can call eventfd_read_ctx from wait queue callback directly? The read needs to wake possible OUT waiters, *cannot* be done your way. Right. Now I see. Thanks! On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I think here's how KVM gets it: the way it does is by calling poll with our own poll table, then in poll_queue_proc we get wait queue pointer, and we use the wait queue. Lock is in there :) I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). Yes, this will work. Thanks! I thought you *already* do your own stuff from a work queue, why would you need to read from IRQ context? - Davide This was just work around while interrupt delivery was using mutex locks, so we had a workqueue to work around that limitation. Now that it has been all switched to RCU, we'll be getting getting rid of this. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I think here's how KVM gets it: the way it does is by calling poll with our own poll table, then in poll_queue_proc we get wait queue pointer, and we use the wait queue. Lock is in there :) Yes, I know you are called locked, but it does not lead to a clean interface. I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). Yes, this will work. Thanks! This is a lot more complex than I thought. The wakeup code is already enumerating the list, and doing a wakeup might trigger a secondary enumeration/recursion. Do you really need to consume the value from IRQ context, or you can simply peek the value, and flush it later? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I think here's how KVM gets it: the way it does is by calling poll with our own poll table, then in poll_queue_proc we get wait queue pointer, and we use the wait queue. Lock is in there :) Yes, I know you are called locked, but it does not lead to a clean interface. True. I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). Yes, this will work. Thanks! This is a lot more complex than I thought. The wakeup code is already enumerating the list, and doing a wakeup might trigger a secondary enumeration/recursion. For KVM what you describe is I think is not a problem: we check wake type and ignore POLLOUT events. Do you really need to consume the value from IRQ context, or you can simply peek the value, and flush it later? - Davide Maybe yes. I'll think it over and get back to you. Thanks! -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I think here's how KVM gets it: the way it does is by calling poll with our own poll table, then in poll_queue_proc we get wait queue pointer, and we use the wait queue. Lock is in there :) Yes, I know you are called locked, but it does not lead to a clean interface. True. I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). Yes, this will work. Thanks! This is a lot more complex than I thought. The wakeup code is already enumerating the list, and doing a wakeup might trigger a secondary enumeration/recursion. For KVM what you describe is I think is not a problem: we check wake type and ignore POLLOUT events. You seem to think in one dimension only ;) The interface needs to be stable for everyone. Maybe yes. I'll think it over and get back to you. Thanks! Let me know. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Jan 06, 2010 at 03:59:11PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I think here's how KVM gets it: the way it does is by calling poll with our own poll table, then in poll_queue_proc we get wait queue pointer, and we use the wait queue. Lock is in there :) Yes, I know you are called locked, but it does not lead to a clean interface. True. I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). Yes, this will work. Thanks! This is a lot more complex than I thought. The wakeup code is already enumerating the list, and doing a wakeup might trigger a secondary enumeration/recursion. For KVM what you describe is I think is not a problem: we check wake type and ignore POLLOUT events. You seem to think in one dimension only ;) The interface needs to be stable for everyone. I agree it's better to have interface that can't be misused. I was just pointing out that users *can* break recursion by looking at event type. Maybe yes. I'll think it over and get back to you. Thanks! Let me know. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Jan 06, 2010 at 03:59:11PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote: On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: On top of creating an interface which requires a lock, which noone can get from the interface itself, since it's not exposed. I think here's how KVM gets it: the way it does is by calling poll with our own poll table, then in poll_queue_proc we get wait queue pointer, and we use the wait queue. Lock is in there :) Yes, I know you are called locked, but it does not lead to a clean interface. True. I could split the two and have a locked one, and an unlocked one, but that looks shitty too (for the above reason). Yes, this will work. Thanks! This is a lot more complex than I thought. The wakeup code is already enumerating the list, and doing a wakeup might trigger a secondary enumeration/recursion. For KVM what you describe is I think is not a problem: we check wake type and ignore POLLOUT events. You seem to think in one dimension only ;) The interface needs to be stable for everyone. Maybe yes. I'll think it over and get back to you. Thanks! Let me know. - Davide OK. What I think we need is a way to remove ourselves from the eventfd wait queue and clear the counter atomically. We currently do remove_wait_queue(irqfd-wqh, irqfd-wait); where wqh saves the eventfd wait queue head. If we do this before proposed eventfd_read_ctx, we can lose events. If we do this after, we can get spurious events. An unlocked read is one way to fix this. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote: OK. What I think we need is a way to remove ourselves from the eventfd wait queue and clear the counter atomically. We currently do remove_wait_queue(irqfd-wqh, irqfd-wait); where wqh saves the eventfd wait queue head. You do a remove_wait_queue() from inside a callback wakeup on the same wait queue head? If we do this before proposed eventfd_read_ctx, we can lose events. If we do this after, we can get spurious events. An unlocked read is one way to fix this. You posted one line of code and a two lines analysis of the issue. Can you be a little bit more verbose and show me more code, so that I can actually see what is going on? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Aug 27, 2009 at 07:13:32AM +0300, Avi Kivity wrote: On 08/27/2009 02:30 AM, Davide Libenzi wrote: On Wed, 26 Aug 2009, Davide Libenzi wrote: I see no kernel equivalent to read(), but that's easily done. Adding an in-kernel read based on ctx, that is no problem at all. Something like the untested below. I had thought you said the eventfd readers where in userspace, but I might have misunderstood you. No, they're all over the place. Further, with Davide's proposal you must be a reader to signal events. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was added exactly to avoid a read+write combination for the case of decrementing a value. Here it's the same, just it's about the case of writing a *given* value. What about having EFD_STATE simply mean do not use a counter, just write the value without affecting the way read works, and use /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); See below? Paolo On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } - 8-- --- Subject: [PATCH] eventfd: new EFD_ABSOLUTE flag This implements a new EFD_ABSOLUTE flag for eventfd. This changes eventfd behaviour so that write simply stores the value written, and is always non-blocking. Untested (I just modified Michael's patch, and given the simpler code I'm not sure it's now worthwhile introducing the inline functions), but otherwise Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- fs/eventfd.c| 13 - include/linux/eventfd.h |4 ++-- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 347a0e0..7b279e3 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -31,8 +31,6 @@ static inline int eventfd_writeable(struct eventfd_ctx *ctx, u64 n) { - return ULLONG_MAX - n ctx-count; - return (ctx-flags EFD_ABSOLUTE) || (ULLONG_MAX - n ctx-count); } static inline int eventfd_overflow(struct eventfd_ctx *ctx, u64 cnt) @@ -42,10 +40,14 @@ static inline void eventfd_dowrite(struct eventfd_ctx *ctx, u64 ucnt) { - if (eventfd_writeable(ctx, ucnt)) - ucnt = ULLONG_MAX - ctx-count; + if (ctx-flags EFD_ABSOLUTE) + ctx-count = ucnt; + else { + if (ULLONG_MAX - ctx-count ucnt) + ucnt = ULLONG_MAX - ctx-count; - ctx-count += ucnt; + ctx-count += ucnt; + } } static inline u64 eventfd_doread(struct eventfd_ctx *ctx) diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 3b85ba6..78ff649 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -19,11 +19,12 @@ * shared O_* flags. */ #define EFD_SEMAPHORE (1 0) +#define EFD_ABSOLUTE (1 1) #define EFD_CLOEXEC O_CLOEXEC #define EFD_NONBLOCK O_NONBLOCK #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_ABSOLUTE) #ifdef CONFIG_EVENTFD -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Aug 27, 2009 at 11:05:30AM +0200, Paolo Bonzini wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was added exactly to avoid a read+write combination for the case of decrementing a value. Here it's the same, just it's about the case of writing a *given* value. What about having EFD_STATE simply mean do not use a counter, just write the value without affecting the way read works, and use /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); That would work for kvm. See below? Paolo On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } - 8-- --- Subject: [PATCH] eventfd: new EFD_ABSOLUTE flag This implements a new EFD_ABSOLUTE flag for eventfd. This changes eventfd behaviour so that write simply stores the value written, and is always non-blocking. Untested (I just modified Michael's patch, and given the simpler code I'm not sure it's now worthwhile introducing the inline functions), but otherwise Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- fs/eventfd.c| 13 - include/linux/eventfd.h |4 ++-- 2 files changed, 10 insertions(+), 7 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 347a0e0..7b279e3 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -31,8 +31,6 @@ static inline int eventfd_writeable(struct eventfd_ctx *ctx, u64 n) { - return ULLONG_MAX - n ctx-count; - return (ctx-flags EFD_ABSOLUTE) || (ULLONG_MAX - n ctx-count); } static inline int eventfd_overflow(struct eventfd_ctx *ctx, u64 cnt) @@ -42,10 +40,14 @@ static inline void eventfd_dowrite(struct eventfd_ctx *ctx, u64 ucnt) { - if (eventfd_writeable(ctx, ucnt)) - ucnt = ULLONG_MAX - ctx-count; + if (ctx-flags EFD_ABSOLUTE) + ctx-count = ucnt; + else { + if (ULLONG_MAX - ctx-count ucnt) + ucnt = ULLONG_MAX - ctx-count; - ctx-count += ucnt; + ctx-count += ucnt; + } } static inline u64 eventfd_doread(struct eventfd_ctx *ctx) diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 3b85ba6..78ff649 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -19,11 +19,12 @@ * shared O_* flags. */ #define EFD_SEMAPHORE (1 0) +#define EFD_ABSOLUTE (1 1) #define EFD_CLOEXEC O_CLOEXEC #define EFD_NONBLOCK O_NONBLOCK #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_ABSOLUTE) #ifdef CONFIG_EVENTFD -- 1.6.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: On Thu, Aug 27, 2009 at 07:13:32AM +0300, Avi Kivity wrote: On 08/27/2009 02:30 AM, Davide Libenzi wrote: On Wed, 26 Aug 2009, Davide Libenzi wrote: I see no kernel equivalent to read(), but that's easily done. Adding an in-kernel read based on ctx, that is no problem at all. Something like the untested below. I had thought you said the eventfd readers where in userspace, but I might have misunderstood you. No, they're all over the place. Further, with Davide's proposal you must be a reader to signal events. To signal events, you must have an instance of ctx (with the new exposed eventfd_ctx_read). How would you do otherwise? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 27 Aug 2009, Paolo Bonzini wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was added exactly to avoid a read+write combination for the case of decrementing a value. Like I repeated 25 times already, EFD_SEMAPHORE was added, because a *semaphore* is a pretty widely known and used abstraction. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote: On Thu, 27 Aug 2009, Paolo Bonzini wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was added exactly to avoid a read+write combination for the case of decrementing a value. Like I repeated 25 times already, EFD_SEMAPHORE was added, because a *semaphore* is a pretty widely known and used abstraction. what about an atomic variable, btw? does it make sense to implement write that does compare and exchange? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote: On Thu, 27 Aug 2009, Paolo Bonzini wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was added exactly to avoid a read+write combination for the case of decrementing a value. Like I repeated 25 times already, EFD_SEMAPHORE was added, because a *semaphore* is a pretty widely known and used abstraction. what about an atomic variable, btw? does it make sense to implement write that does compare and exchange? It is surprising to me, that is front of a workable solution w/out any use-once additions, yet you want to try to add optimizations and new ad-hoc abstractions to user visible interfaces. Now, you tell me what an atomic variable has to do with an eventfd. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Aug 27, 2009 at 07:38:46AM -0700, Davide Libenzi wrote: On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote: On Thu, 27 Aug 2009, Paolo Bonzini wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was added exactly to avoid a read+write combination for the case of decrementing a value. Like I repeated 25 times already, EFD_SEMAPHORE was added, because a *semaphore* is a pretty widely known and used abstraction. what about an atomic variable, btw? does it make sense to implement write that does compare and exchange? It is surprising to me, that is front of a workable solution w/out any use-once additions, yet you want to try to add optimizations and new ad-hoc abstractions to user visible interfaces. Now, you tell me what an atomic variable has to do with an eventfd. - Davide Oh, I stopped pushing EFD_STATE since we have a solution. I am just trying to grok what does and what does not consititute a use-once addition, in your mind, and what does and what does not belong in eventfd. The reason atomic does not belong there and semaphore does is because one waits on semaphore but not on atomic? Is that it? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: Oh, I stopped pushing EFD_STATE since we have a solution. I am just trying to grok what does and what does not consititute a use-once addition, in your mind, and what does and what does not belong in eventfd. The reason atomic does not belong there and semaphore does is because one waits on semaphore but not on atomic? Is that it? An atomic variable is not a synchronization interface. Even if it'd come up that we really need an atomic variable abstraction via an fd, this should not go via eventfd. Yeah, maybe the name syncfd would have been better, but the very reason why eventfd born was to allow KAIO to signal events to poll/select/epoll. That first implementation was also usable as a mutex. Then it was propsed to make eventfd to act as a semaphore too. Code was trivial, a semaphore fitted the interface, and the abstraction was a pretty damn known too. So it went in. Yes, you could have implemented a semaphore with the existing eventfd, by reading the counter and writing counter-1. But again, a semaphore was something widely known and generic enough, and was fitting the bill. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote: Oh, I stopped pushing EFD_STATE since we have a solution. Do you guys need the kernel-side eventfd_ctx_read() I posted or not? Because if nobody uses it, I'm not going to push it. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Tue, Aug 25, 2009 at 02:57:01PM -0700, Davide Libenzi wrote: On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: Yes, we don't want that. The best thing is to try to restate the problem in a way that is generic, and then either solve or best use existing solution. Right? I thought I had that, but apparently not. The reason I'm Cc-ing you is not to try and spam you until you give up and accept the patch, it's hoping that you see the pattern behind our usage, and help generalize it. If I understand it correctly, you believe this is not possible and so any solution will have to be in KVM? Or maybe I didn't state the problem clearly enough and should restate it? Please do. - Davide Problem looks like this: There are multiple processes (devices) where each has a condition (interrupt line) which it has logic to determine is either true or false. A single other process (hypervisor) is interested in a condition (interrupt level) which is a logical OR of all interrupt lines. On changes, an interrupt level value needs to be read and copied to guest virtual cpu. We also want ability to replace some or all processes above by a kernel components, with condition changes done potentially from hardware interrupt context. How we wanted to solve it with EFD_STATE: Share a separate eventfd between each device and the hypervisor. device sets state to either 0 or 1. hypervisor polls all eventfds, reads interrupt line on changes, calculates the interrupt level and updates guest. Alternative solution: shared memory where each device writes interrupt line value. This makes setup more complex (need to share around much more than just an fd), and makes access from interrupt impossible unless we lock the memory (and locking userspace memory introduces yet another set of issues). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/26/2009 01:29 PM, Michael S. Tsirkin wrote: How we wanted to solve it with EFD_STATE: Share a separate eventfd between each device and the hypervisor. device sets state to either 0 or 1. hypervisor polls all eventfds, reads interrupt line on changes, calculates the interrupt level and updates guest. Alternative solution: shared memory where each device writes interrupt line value. This makes setup more complex (need to share around much more than just an fd), and makes access from interrupt impossible unless we lock the memory (and locking userspace memory introduces yet another set of issues). For completeness: If the device is implemented in the same process as the hypervisor, an eventfd isn't really needed, as there is an ioctl which performs the same operation. An important class of device implementations is real devices that are assigned to a guest. We would like to forward the interrupt directly from the host interrupt handler to qemu. Currently, we have a kvm-specific interrupt handler that forwards the interrupt using kvm-specific interfaces. We would like to use a generic interrupt handler implemented by uio, so we want a generic interrupt transfer mechanism. uio already supports edge-triggered interrupts using an eventfd-like mechanism. So it makes sense to extend uio to support real eventfds, and to make it also support level-triggered interrupts. We can work around the lack of state eventfd by having userspace wait on whatever mechanism uio uses to make the interrupt state visible, and then use the ioctl mentioned above to inform the hypervisor of this state. But it's faster and nicer to give both components an eventfd and let them communicate directly. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 26 Aug 2009, Michael S. Tsirkin wrote: On Tue, Aug 25, 2009 at 02:57:01PM -0700, Davide Libenzi wrote: On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: Yes, we don't want that. The best thing is to try to restate the problem in a way that is generic, and then either solve or best use existing solution. Right? I thought I had that, but apparently not. The reason I'm Cc-ing you is not to try and spam you until you give up and accept the patch, it's hoping that you see the pattern behind our usage, and help generalize it. If I understand it correctly, you believe this is not possible and so any solution will have to be in KVM? Or maybe I didn't state the problem clearly enough and should restate it? Please do. - Davide Problem looks like this: There are multiple processes (devices) where each has a condition (interrupt line) which it has logic to determine is either true or false. A single other process (hypervisor) is interested in a condition (interrupt level) which is a logical OR of all interrupt lines. On changes, an interrupt level value needs to be read and copied to guest virtual cpu. We also want ability to replace some or all processes above by a kernel components, with condition changes done potentially from hardware interrupt context. How we wanted to solve it with EFD_STATE: Share a separate eventfd between each device and the hypervisor. device sets state to either 0 or 1. hypervisor polls all eventfds, reads interrupt line on changes, calculates the interrupt level and updates guest. Alternative solution: shared memory where each device writes interrupt line value. This makes setup more complex (need to share around much more than just an fd), and makes access from interrupt impossible unless we lock the memory (and locking userspace memory introduces yet another set of issues). OK, if I get it correctly, there is one eventfd signaler (the device), and one eventfd reader (the hypervisor), right? Each hypervisor listens for multiple devices detecting state changes, and associating the eventfd line to the IRQ number by some configuration (ala PCI), right? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/26/2009 08:45 PM, Davide Libenzi wrote: OK, if I get it correctly, there is one eventfd signaler (the device), and one eventfd reader (the hypervisor), right? Each hypervisor listens for multiple devices detecting state changes, and associating the eventfd line to the IRQ number by some configuration (ala PCI), right? Yes. The PCI stuff happens in userspace, all the hypervisor sees is this eventfd is IRQ 10. There may be multiple eventfds routed to one IRQ (corresponding to a shared IRQ line). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 26 Aug 2009, Avi Kivity wrote: On 08/26/2009 08:45 PM, Davide Libenzi wrote: OK, if I get it correctly, there is one eventfd signaler (the device), and one eventfd reader (the hypervisor), right? Each hypervisor listens for multiple devices detecting state changes, and associating the eventfd line to the IRQ number by some configuration (ala PCI), right? Yes. The PCI stuff happens in userspace, all the hypervisor sees is this eventfd is IRQ 10. There may be multiple eventfds routed to one IRQ (corresponding to a shared IRQ line). Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd, cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd, cnt, sizeof(cnt)); } On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd, cnt, sizeof(cnt)); return state - 1; } - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/26/2009 10:13 PM, Davide Libenzi wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } Hadn't though of read+write as set. While the 1+ is a little ugly, it's workable. I see no kernel equivalent to read(), but that's easily done. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 26 Aug 2009, Avi Kivity wrote: On 08/26/2009 10:13 PM, Davide Libenzi wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } Hadn't though of read+write as set. While the 1+ is a little ugly, it's workable. Pick what you want, as long as it always writes something != 0 :) I see no kernel equivalent to read(), but that's easily done. Adding an in-kernel read based on ctx, that is no problem at all. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote: On 08/26/2009 10:13 PM, Davide Libenzi wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } Hadn't though of read+write as set. While the 1+ is a little ugly, it's workable. It's two system calls instead of one to inject interrupt. I see no kernel equivalent to read(), but that's easily done. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 26 Aug 2009, Gleb Natapov wrote: On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote: On 08/26/2009 10:13 PM, Davide Libenzi wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } Hadn't though of read+write as set. While the 1+ is a little ugly, it's workable. It's two system calls instead of one to inject interrupt. I guess that's going to completely throw off-chart your RT performance, doesn't it? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, 26 Aug 2009, Davide Libenzi wrote: I see no kernel equivalent to read(), but that's easily done. Adding an in-kernel read based on ctx, that is no problem at all. Something like the untested below. I had thought you said the eventfd readers where in userspace, but I might have misunderstood you. - Davide --- fs/eventfd.c| 51 +--- include/linux/eventfd.h |7 ++ 2 files changed, 43 insertions(+), 15 deletions(-) Index: linux-2.6.mod/fs/eventfd.c === --- linux-2.6.mod.orig/fs/eventfd.c 2009-08-26 15:58:03.0 -0700 +++ linux-2.6.mod/fs/eventfd.c 2009-08-26 16:20:03.0 -0700 @@ -130,26 +130,33 @@ static unsigned int eventfd_poll(struct return events; } -static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, - loff_t *ppos) +/** + * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero. + * @ctx: [in] Pointer to eventfd context. + * @no_wait: [in] Different from zero if the operation should not block. + * @cnt: [out] Pointer to the 64bit conter value. + * + * Returns zero if successful, or the following error codes: + * + * -EAGAIN : The operation would have blocked but @no_wait was nonzero. + * -ERESTARTSYS : A signal interrupted the wait operation. + */ +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) { - struct eventfd_ctx *ctx = file-private_data; ssize_t res; - __u64 ucnt = 0; DECLARE_WAITQUEUE(wait, current); - if (count sizeof(ucnt)) - return -EINVAL; spin_lock_irq(ctx-wqh.lock); + *cnt = 0; res = -EAGAIN; if (ctx-count 0) - res = sizeof(ucnt); - else if (!(file-f_flags O_NONBLOCK)) { + res = 0; + else if (!no_wait) { __add_wait_queue(ctx-wqh, wait); - for (res = 0;;) { + for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (ctx-count 0) { - res = sizeof(ucnt); + res = 0; break; } if (signal_pending(current)) { @@ -163,18 +170,32 @@ static ssize_t eventfd_read(struct file __remove_wait_queue(ctx-wqh, wait); __set_current_state(TASK_RUNNING); } - if (likely(res 0)) { - ucnt = (ctx-flags EFD_SEMAPHORE) ? 1 : ctx-count; - ctx-count -= ucnt; + if (likely(res == 0)) { + *cnt = (ctx-flags EFD_SEMAPHORE) ? 1 : ctx-count; + ctx-count -= *cnt; if (waitqueue_active(ctx-wqh)) wake_up_locked_poll(ctx-wqh, POLLOUT); } spin_unlock_irq(ctx-wqh.lock); - if (res 0 put_user(ucnt, (__u64 __user *) buf)) - return -EFAULT; return res; } +EXPORT_SYMBOL_GPL(eventfd_ctx_read); + +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, + loff_t *ppos) +{ + struct eventfd_ctx *ctx = file-private_data; + ssize_t res; + __u64 cnt; + + if (count sizeof(cnt)) + return -EINVAL; + if ((res = eventfd_ctx_read(ctx, file-f_flags O_NONBLOCK, cnt)) 0) + return res; + + return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt); +} static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) Index: linux-2.6.mod/include/linux/eventfd.h === --- linux-2.6.mod.orig/include/linux/eventfd.h 2009-08-26 15:58:03.0 -0700 +++ linux-2.6.mod/include/linux/eventfd.h 2009-08-26 16:17:01.0 -0700 @@ -33,6 +33,7 @@ struct file *eventfd_fget(int fd); struct eventfd_ctx *eventfd_ctx_fdget(int fd); struct eventfd_ctx *eventfd_ctx_fileget(struct file *file); int eventfd_signal(struct eventfd_ctx *ctx, int n); +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt); #else /* CONFIG_EVENTFD */ @@ -55,6 +56,12 @@ static inline void eventfd_ctx_put(struc } +static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, + __u64 *cnt) +{ + return -ENOSYS; +} + #endif #endif /* _LINUX_EVENTFD_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/27/2009 02:30 AM, Davide Libenzi wrote: On Wed, 26 Aug 2009, Davide Libenzi wrote: I see no kernel equivalent to read(), but that's easily done. Adding an in-kernel read based on ctx, that is no problem at all. Something like the untested below. I had thought you said the eventfd readers where in userspace, but I might have misunderstood you. No, they're all over the place. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Wed, Aug 26, 2009 at 01:04:09PM -0700, Davide Libenzi wrote: On Wed, 26 Aug 2009, Gleb Natapov wrote: On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote: On 08/26/2009 10:13 PM, Davide Libenzi wrote: Ok, so why not using the eventfd counter as state? On the device side: void write_state(int sfd, int state) { u64 cnt; /* Clear the current state, sfd is in non-blocking mode */ read(sfd,cnt, sizeof(cnt)); /* Writes new state */ cnt = 1 + !!state; write(sfd,cnt, sizeof(cnt)); } On the hypervisor side: int read_state(int sfd) { u64 cnt; read(sfd,cnt, sizeof(cnt)); return state - 1; } Hadn't though of read+write as set. While the 1+ is a little ugly, it's workable. It's two system calls instead of one to inject interrupt. I guess that's going to completely throw off-chart your RT performance, doesn't it? Do you consider interrupt injection path not worth of optimizing? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
There are userspace libraries that do almost everything, but you hardly see things like pthread_(EFD_STATE-like)_create() or similar system interfaces based on such abstraction. It actually seems as close to a condition variable as an eventfd can be. A pthread condition typical code usage maps to eventfd like: while (read(efd, ...) 0) if (CONDITION) break; So a pthread condition is really a wakeup gate like eventfd is. EFD_STATE has nothing to do with a pthread condition. No, your code does not work for pthread_cond_broadcast (which arguably is the common case) because all the eventfd readers after the first would block. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Aug 24, 2009 at 03:15:05PM -0700, Davide Libenzi wrote: On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote: On Sun, 23 Aug 2009, Michael S. Tsirkin wrote: On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. Hmm. All we try to do is, associate a small key with the event that we signal. Is it really that uncommon/KVM specific? All I'm trying to do, is to avoid that eventfd will become an horrible multiplexor for every freaky one-time-use behaviors arising inside kernel modules. Yes, we don't want that. The best thing is to try to restate the problem in a way that is generic, and then either solve or best use existing solution. Right? I thought I had that, but apparently not. The reason I'm Cc-ing you is not to try and spam you until you give up and accept the patch, it's hoping that you see the pattern behind our usage, and help generalize it. If I understand it correctly, you believe this is not possible and so any solution will have to be in KVM? Or maybe I didn't state the problem clearly enough and should restate it? - Davide -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: Yes, we don't want that. The best thing is to try to restate the problem in a way that is generic, and then either solve or best use existing solution. Right? I thought I had that, but apparently not. The reason I'm Cc-ing you is not to try and spam you until you give up and accept the patch, it's hoping that you see the pattern behind our usage, and help generalize it. If I understand it correctly, you believe this is not possible and so any solution will have to be in KVM? Or maybe I didn't state the problem clearly enough and should restate it? Please do. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/24/2009 09:25 PM, Davide Libenzi wrote: Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. There are libraries that provide notifications (or fire watches) when some value changes. They're much less frequently used than events or semaphores, though. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, 23 Aug 2009, Michael S. Tsirkin wrote: On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote: On Sun, 23 Aug 2009, Michael S. Tsirkin wrote: On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. Hmm. All we try to do is, associate a small key with the event that we signal. Is it really that uncommon/KVM specific? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Mon, 24 Aug 2009, Avi Kivity wrote: On 08/24/2009 09:25 PM, Davide Libenzi wrote: Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. There are libraries that provide notifications (or fire watches) when some value changes. They're much less frequently used than events or semaphores, though. There are userspace libraries that do almost everything, but you hardly see things like pthread_(EFD_STATE-like)_create() or similar system interfaces based on such abstraction. Is that really difficult to understand where I'm standing, leaving the KVM hat off for a moment? - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
There are userspace libraries that do almost everything, but you hardly see things like pthread_(EFD_STATE-like)_create() or similar system interfaces based on such abstraction. It actually seems as close to a condition variable as an eventfd can be. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Tue, 25 Aug 2009, Michael S. Tsirkin wrote: On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote: On Sun, 23 Aug 2009, Michael S. Tsirkin wrote: On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. Indeed, the default eventfd behaviour is like, well, an event. Signaling (kernel side) or writing (userspace side), signals the event. Waiting (reading) it, will reset the event. If you use EFD_SEMAPHORE, you get a semaphore-like behavior. Events and sempahores are two widely known and used abstractions. The EFD_STATE proposed one, well, no. Not at all. Hmm. All we try to do is, associate a small key with the event that we signal. Is it really that uncommon/KVM specific? All I'm trying to do, is to avoid that eventfd will become an horrible multiplexor for every freaky one-time-use behaviors arising inside kernel modules. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Tue, 25 Aug 2009, Paolo Bonzini wrote: There are userspace libraries that do almost everything, but you hardly see things like pthread_(EFD_STATE-like)_create() or similar system interfaces based on such abstraction. It actually seems as close to a condition variable as an eventfd can be. A pthread condition typical code usage maps to eventfd like: while (read(efd, ...) 0) if (CONDITION) break; So a pthread condition is really a wakeup gate like eventfd is. EFD_STATE has nothing to do with a pthread condition. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/25/2009 01:08 AM, Davide Libenzi wrote: Is that really difficult to understand where I'm standing, leaving the KVM hat off for a moment? I understand it perfectly. I take the same position with kvm. I'm providing more data in the hope that you'll change you mind, not trying to flood you with email so you'll give up. We can always create our eventfd-lookalike for kvm, but I'd rather not do that (other options include a userspace proxy through existing interfaces, it might even be better than changing eventfd if we decide performance for level-triggered interrupts is not critical). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/20/2009 09:28 PM, Michael S. Tsirkin wrote: I thought the point was to move assigned devices out of KVM? Grr. Forgot about that. That's much more important. Looks like we'll have do with a separate char device or something? We can still tunnel it through userspace. Most assigned devices will have MSI. uio can signal the eventfd when the level changes, qemu can read the level and write it to kvm. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, Aug 23, 2009 at 04:01:29PM +0300, Avi Kivity wrote: On 08/20/2009 09:28 PM, Michael S. Tsirkin wrote: I thought the point was to move assigned devices out of KVM? Grr. Forgot about that. That's much more important. Looks like we'll have do with a separate char device or something? We can still tunnel it through userspace. More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Most assigned devices will have MSI. uio can signal the eventfd when the level changes, qemu can read the level and write it to kvm. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. Look at eventfd as pipe replacement: for wake-up between processes, it works well. But what if we want to pass around some kind of key as well? pipe lets you pass single-byte values atomically. This flag allows us to use 8 byte values as keys. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/23/2009 04:30 PM, Michael S. Tsirkin wrote: On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote: On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote: More important here is realization that eventfd is a mutex/semaphore implementation, not a generic event reporting interface as we are trying to use it. Well it is a generic event reporting interface (for example, aio uses it). Davide, I think it's a valid point. For example, what read on eventfd does (zero a counter and return) is not like any semaphore I saw. It is similar to Win32 events (which can be used like contorted mutexes to sync two processes). Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 20 Aug 2009, Paolo Bonzini wrote: On 08/20/2009 07:44 PM, Davide Libenzi wrote: On Thu, 20 Aug 2009, Avi Kivity wrote: On 08/20/2009 07:20 PM, Davide Libenzi wrote: I briefly looked at this while in vacation, although I did not reply hoping the horrible feeling about this code would go away. It didn't. I find this to be an ugly and ad-hoc multiplexing of eventfd with added functionalities of questionable general use. I'm pretty sure you can do better on KVM side, to solve the problem w/out littering eventfd. While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. I really didn't mean to be harsh, but seriously, we cannot just have a Multiplexing Feast over eventfd, with one-time users. EFD_STATE actually does two changes: a) makes read block until the value changes; b) makes each write override the previous one. How would you feel if the two changes were separated? I'm afraid that splitting the change won't make less inappropriate for eventfd. I've no doubt KVM needs something like this, but this just don't belong to eventfd, which act as either a mutex or a semaphore. This EFD_STATE does not belong to a generic interface, proof of which is the lack of such abstraction in any generic library I'm aware of. This aside from any consideration about how the code will look with that multiplexing in place. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] eventfd: new EFD_STATE flag
Davide, Looks like I got an ack from Avi and no comments from others, so the following patchset is identical to the last RFC. Can it be merged for 2.6.32? Thanks! This series implements a new EFD_STATE flag for eventfd. When set, this changes eventfd behaviour in the following way: - write simply stores the value written, and is always non-blocking - read unblocks when the value written changes, and returns the value written Motivation: we'd like to use eventfd in qemu to pass interrupts from (emulated or assigned) devices to guest. For level interrupts, the counter supported currently by eventfd is not a good match: we really need to set interrupt to a level, typically 0 or 1, wake the guest if there was a change and give the guest ability to see the last value written. The level can be set either from kernel (e.g. with assigned devices) or from userspace. Michael S. Tsirkin (2): eventfd: reorganize the code to simplify new flags eventfd: EFD_STATE flag fs/eventfd.c| 83 +++ include/linux/eventfd.h |3 +- 2 files changed, 71 insertions(+), 15 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 20 Aug 2009, Michael S. Tsirkin wrote: Davide, Looks like I got an ack from Avi and no comments from others, so the following patchset is identical to the last RFC. Can it be merged for 2.6.32? Thanks! I briefly looked at this while in vacation, although I did not reply hoping the horrible feeling about this code would go away. It didn't. I find this to be an ugly and ad-hoc multiplexing of eventfd with added functionalities of questionable general use. I'm pretty sure you can do better on KVM side, to solve the problem w/out littering eventfd. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/20/2009 07:20 PM, Davide Libenzi wrote: I briefly looked at this while in vacation, although I did not reply hoping the horrible feeling about this code would go away. It didn't. I find this to be an ugly and ad-hoc multiplexing of eventfd with added functionalities of questionable general use. I'm pretty sure you can do better on KVM side, to solve the problem w/out littering eventfd. While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. In the meanwhile, we can let vhost-net support edge-triggered interrupts only, let qemu terminate those eventfds and convert then to level-triggered interrupts (which it can then inject using the existing ioctl). It will keep vhost-net and kvm simpler at the cost of some performance penalty to guests using level interrupts. These suck anyway so we'll point users at msi. -- I have a truly marvellous patch that fixes the bug which thisb signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, 20 Aug 2009, Avi Kivity wrote: On 08/20/2009 07:20 PM, Davide Libenzi wrote: I briefly looked at this while in vacation, although I did not reply hoping the horrible feeling about this code would go away. It didn't. I find this to be an ugly and ad-hoc multiplexing of eventfd with added functionalities of questionable general use. I'm pretty sure you can do better on KVM side, to solve the problem w/out littering eventfd. While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. I really didn't mean to be harsh, but seriously, we cannot just have a Multiplexing Feast over eventfd, with one-time users. - Davide -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/20/2009 07:44 PM, Davide Libenzi wrote: On Thu, 20 Aug 2009, Avi Kivity wrote: On 08/20/2009 07:20 PM, Davide Libenzi wrote: I briefly looked at this while in vacation, although I did not reply hoping the horrible feeling about this code would go away. It didn't. I find this to be an ugly and ad-hoc multiplexing of eventfd with added functionalities of questionable general use. I'm pretty sure you can do better on KVM side, to solve the problem w/out littering eventfd. While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. I really didn't mean to be harsh, but seriously, we cannot just have a Multiplexing Feast over eventfd, with one-time users. EFD_STATE actually does two changes: a) makes read block until the value changes; b) makes each write override the previous one. How would you feel if the two changes were separated? I can see each of them has use cases For example, (a) could be implemented by using select's xfds (POLLPRI) to poll for value changes (rfds would still poll for non-zeroness). Then Michael does not need even to read the eventfd; instead he'd check POLLIN with a zero timeout. (b) could be implemented with a flag like Michael did. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Aug 20, 2009 at 08:38:48PM +0300, Avi Kivity wrote: On 08/20/2009 07:20 PM, Davide Libenzi wrote: I briefly looked at this while in vacation, although I did not reply hoping the horrible feeling about this code would go away. It didn't. I find this to be an ugly and ad-hoc multiplexing of eventfd with added functionalities of questionable general use. I'm pretty sure you can do better on KVM side, to solve the problem w/out littering eventfd. While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. In the meanwhile, we can let vhost-net support edge-triggered interrupts only, let qemu terminate those eventfds and convert then to level-triggered interrupts (which it can then inject using the existing ioctl). It will keep vhost-net and kvm simpler at the cost of some performance penalty to guests using level interrupts. These suck anyway so we'll point users at msi. I thought the point was to move assigned devices out of KVM? -- I have a truly marvellous patch that fixes the bug which thisb signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On 08/20/2009 08:55 PM, Michael S. Tsirkin wrote: While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. In the meanwhile, we can let vhost-net support edge-triggered interrupts only, let qemu terminate those eventfds and convert then to level-triggered interrupts (which it can then inject using the existing ioctl). It will keep vhost-net and kvm simpler at the cost of some performance penalty to guests using level interrupts. These suck anyway so we'll point users at msi. I thought the point was to move assigned devices out of KVM? Grr. Forgot about that. That's much more important. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] eventfd: new EFD_STATE flag
On Thu, Aug 20, 2009 at 09:06:51PM +0300, Avi Kivity wrote: On 08/20/2009 08:55 PM, Michael S. Tsirkin wrote: While we could argue about this my feeling is that we should drop this, at least until we can quantify what benefit it has and whether there are any Davide-acceptable alternatives. In the meanwhile, we can let vhost-net support edge-triggered interrupts only, let qemu terminate those eventfds and convert then to level-triggered interrupts (which it can then inject using the existing ioctl). It will keep vhost-net and kvm simpler at the cost of some performance penalty to guests using level interrupts. These suck anyway so we'll point users at msi. I thought the point was to move assigned devices out of KVM? Grr. Forgot about that. That's much more important. Looks like we'll have do with a separate char device or something? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html