Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-13 Thread Michael S. Tsirkin
On Mon, Jan 11, 2010 at 02:53:53PM -0800, Davide Libenzi wrote:
 On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
 
  On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote:
   On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
   
Yes, I think this will work. Will test and report. Thanks!
   
   If you ever happen to find a solution in KVM that does not require 
   eventfd 
   changes, that'd be even better :)
   Otherwise ping me when you tested, that I'll puch this to Andrew.
   
   
   - Davide
  
  Hmm, the fix also needs changes in kvm driver to call the new API. To
  simplify dependencies, can we merge this patch through Avi's kvm tree as
  well?
 
 It is fine for me. I can officially post it to Avi, CCing kvm and lkml.
 
 
 - Davide

Since it now uses wait_queue_t, I had to add include linux/wait.h to
make it self-contained.  The following works fine for me. Thanks!

---

From: Davide Libenzi davi...@xmailserver.org
Subject: eventfd: support eventfd_ctx_remove_wait_queue

Add eventfd_ctx_remove_wait_queue API so that kernel users can clear
counter while getting out of poll queue, atomically.

Signed-off-by: Davide Libenzi davi...@xmailserver.org
Tested-by: Michael S. Tsirkin m...@redhat.com

--

diff --git a/fs/eventfd.c b/fs/eventfd.c
index d26402f..4cda278 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct file *file, 
poll_table *wait)
return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-   loff_t *ppos)
+static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+{
+   *cnt = (ctx-flags  EFD_SEMAPHORE) ? 1: ctx-count;
+   ctx-count -= *cnt;
+}
+
+/**
+ * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait 
queue.
+ * @ctx: [in] Pointer to eventfd context.
+ * @wait: [in] Wait queue to be removed.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN  : The operation would have blocked.
+ *
+ * This is used to atomically remove a wait queue entry from the eventfd wait
+ * queue head, and read/reset the counter value.
+ */
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+ __u64 *cnt)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(ctx-wqh.lock, flags);
+   eventfd_ctx_do_read(ctx, cnt);
+   __remove_wait_queue(ctx-wqh, wait);
+   if (*cnt != 0  waitqueue_active(ctx-wqh))
+   wake_up_locked_poll(ctx-wqh, POLLOUT);
+   spin_unlock_irqrestore(ctx-wqh.lock, flags);
+
+   return *cnt != 0 ? 0: -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
+
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN  : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ *
+ * If @no_wait is zero, the function might sleep until the eventfd internal
+ * counter becomes greater than zero.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-   struct eventfd_ctx *ctx = file-private_data;
ssize_t res;
-   __u64 ucnt = 0;
DECLARE_WAITQUEUE(wait, current);
 
-   if (count  sizeof(ucnt))
-   return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
+   *cnt = 0;
res = -EAGAIN;
if (ctx-count  0)
-   res = sizeof(ucnt);
-   else if (!(file-f_flags  O_NONBLOCK)) {
+   res = 0;
+   else if (!no_wait) {
__add_wait_queue(ctx-wqh, wait);
-   for (res = 0;;) {
+   for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (ctx-count  0) {
-   res = sizeof(ucnt);
+   res = 0;
break;
}
if (signal_pending(current)) {
@@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file *file, char 
__user *buf, size_t count,
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (likely(res  0)) {
-   ucnt = (ctx-flags  EFD_SEMAPHORE) ? 1 : ctx-count;
-   ctx-count -= ucnt;
+   if (likely(res == 0)) {
+   eventfd_ctx_do_read(ctx, cnt);
if (waitqueue_active(ctx-wqh))
wake_up_locked_poll(ctx-wqh, POLLOUT);
}
spin_unlock_irq(ctx-wqh.lock);
-   if (res  0  put_user(ucnt, (__u64 __user *) buf))
-   return -EFAULT;
 

Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Gleb Natapov
On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
 Or if I do it the other way:
   remove_wait_queue(irqfd-wqh, irqfd-wait);
   -
   eventfd_read_ctx(irqfd-eventfd, ucnt);
 
 now, if device signals eventfd at point marked by -,
 it will not be sent but counter will be cleared,
 so we will loose a message.
 
May be I am missing something here, but why doing it like that is a
problem? Event delivery races with interrupt masking so making masking
succeed before event delivery is OK. Event generation is asynchronous
anyway and could have happened jiffy latter anyway.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Michael S. Tsirkin
On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
 On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
  Or if I do it the other way:
  remove_wait_queue(irqfd-wqh, irqfd-wait);
  -
  eventfd_read_ctx(irqfd-eventfd, ucnt);
  
  now, if device signals eventfd at point marked by -,
  it will not be sent but counter will be cleared,
  so we will loose a message.
  
 May be I am missing something here, but why doing it like that is a
 problem? Event delivery races with interrupt masking so making masking
 succeed before event delivery is OK. Event generation is asynchronous
 anyway and could have happened jiffy latter anyway.
 
 --
   Gleb.

No, event generation would only trigger a single interrupt.  This race
generates two interrupts for a single event.  This can never happen with
real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Gleb Natapov
On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
 On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
  On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
   Or if I do it the other way:
 remove_wait_queue(irqfd-wqh, irqfd-wait);
 -
 eventfd_read_ctx(irqfd-eventfd, ucnt);
   
   now, if device signals eventfd at point marked by -,
   it will not be sent but counter will be cleared,
   so we will loose a message.
   
  May be I am missing something here, but why doing it like that is a
  problem? Event delivery races with interrupt masking so making masking
  succeed before event delivery is OK. Event generation is asynchronous
  anyway and could have happened jiffy latter anyway.
  
  --
  Gleb.
 
 No, event generation would only trigger a single interrupt.  This race
 generates two interrupts for a single event.  This can never happen with
 real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
 
In quoted test above you are saying that we will loose a message. So
how does it generates two interrupts when we loose a message?
  
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Michael S. Tsirkin
On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote:
 On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
  On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
   On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
Or if I do it the other way:
remove_wait_queue(irqfd-wqh, irqfd-wait);
-
eventfd_read_ctx(irqfd-eventfd, ucnt);

now, if device signals eventfd at point marked by -,
it will not be sent but counter will be cleared,
so we will loose a message.

   May be I am missing something here, but why doing it like that is a
   problem? Event delivery races with interrupt masking so making masking
   succeed before event delivery is OK. Event generation is asynchronous
   anyway and could have happened jiffy latter anyway.
   
   --
 Gleb.
  
  No, event generation would only trigger a single interrupt.  This race
  generates two interrupts for a single event.  This can never happen with
  real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
  
 In quoted test above you are saying that we will loose a message. So
 how does it generates two interrupts when we loose a message?
   
 --
   Gleb.


Right, sorry. I think what you miss is the fact that this is done during
interrupt vector masking/unmasking, so events signalled while eventfd is not
assigned to interrupt must not be lost, they should be pending and
delivered later when interrupt vector is unmasked, that is when
eventfd is reassigned to an interrupt.

So this implementation really loses an interrupt:

remove_wait_queue(irqfd-wqh, irqfd-wait);
- at this point vector is masked: we are not polling
   eventfd anymore, event triggered at this point should cause 
interrupt
   after vector is unmasked, but the only thing is causes
   is counter increment in eventfd.

eventfd_read_ctx(irqfd-eventfd, ucnt);
- the above call would clear the counter, so
   we won't get an interrupt when vector is later
   unmasked.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Gleb Natapov
On Mon, Jan 11, 2010 at 11:19:19AM +0200, Michael S. Tsirkin wrote:
 On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote:
  On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
   On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
 Or if I do it the other way:
   remove_wait_queue(irqfd-wqh, irqfd-wait);
   -
   eventfd_read_ctx(irqfd-eventfd, ucnt);
 
 now, if device signals eventfd at point marked by -,
 it will not be sent but counter will be cleared,
 so we will loose a message.
 
May be I am missing something here, but why doing it like that is a
problem? Event delivery races with interrupt masking so making masking
succeed before event delivery is OK. Event generation is asynchronous
anyway and could have happened jiffy latter anyway.

--
Gleb.
   
   No, event generation would only trigger a single interrupt.  This race
   generates two interrupts for a single event.  This can never happen with
   real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.
   
  In quoted test above you are saying that we will loose a message. So
  how does it generates two interrupts when we loose a message?

  --
  Gleb.
 
 
 Right, sorry. I think what you miss is the fact that this is done during
 interrupt vector masking/unmasking, so events signalled while eventfd is not
 assigned to interrupt must not be lost, they should be pending and
 delivered later when interrupt vector is unmasked, that is when
 eventfd is reassigned to an interrupt.
 
Is this how MSI works? If interrupt is triggered while it is masked it
is reasserted after unmasking? This is certainly no true for interrupt
masking on irq chip level.

 So this implementation really loses an interrupt:
 
   remove_wait_queue(irqfd-wqh, irqfd-wait);
   - at this point vector is masked: we are not polling
  eventfd anymore, event triggered at this point should cause 
 interrupt
  after vector is unmasked, but the only thing is causes
  is counter increment in eventfd.
 
   eventfd_read_ctx(irqfd-eventfd, ucnt);
   - the above call would clear the counter, so
  we won't get an interrupt when vector is later
  unmasked.
 
Don't you going to use ucnt to set interrupt status bit? Can't you
re-trigger the interrupt after unmasking if status bit is set?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Michael S. Tsirkin
On Mon, Jan 11, 2010 at 11:36:16AM +0200, Gleb Natapov wrote:
 On Mon, Jan 11, 2010 at 11:19:19AM +0200, Michael S. Tsirkin wrote:
  On Mon, Jan 11, 2010 at 11:08:38AM +0200, Gleb Natapov wrote:
   On Mon, Jan 11, 2010 at 11:02:27AM +0200, Michael S. Tsirkin wrote:
On Mon, Jan 11, 2010 at 11:01:27AM +0200, Gleb Natapov wrote:
 On Thu, Jan 07, 2010 at 12:36:01PM +0200, Michael S. Tsirkin wrote:
  Or if I do it the other way:
  remove_wait_queue(irqfd-wqh, irqfd-wait);
  -
  eventfd_read_ctx(irqfd-eventfd, ucnt);
  
  now, if device signals eventfd at point marked by -,
  it will not be sent but counter will be cleared,
  so we will loose a message.
  
 May be I am missing something here, but why doing it like that is a
 problem? Event delivery races with interrupt masking so making masking
 succeed before event delivery is OK. Event generation is asynchronous
 anyway and could have happened jiffy latter anyway.
 
 --
   Gleb.

No, event generation would only trigger a single interrupt.  This race
generates two interrupts for a single event.  This can never happen with
real hardware.  eventfd_ctx_remove_wait_queue would solve this problem.

   In quoted test above you are saying that we will loose a message. So
   how does it generates two interrupts when we loose a message?
 
   --
 Gleb.
  
  
  Right, sorry. I think what you miss is the fact that this is done during
  interrupt vector masking/unmasking, so events signalled while eventfd is not
  assigned to interrupt must not be lost, they should be pending and
  delivered later when interrupt vector is unmasked, that is when
  eventfd is reassigned to an interrupt.
  
 Is this how MSI works? If interrupt is triggered while it is masked it
 is reasserted after unmasking? This is certainly no true for interrupt
 masking on irq chip level.

Yes.

  So this implementation really loses an interrupt:
  
  remove_wait_queue(irqfd-wqh, irqfd-wait);
  - at this point vector is masked: we are not polling
 eventfd anymore, event triggered at this point should cause 
  interrupt
 after vector is unmasked, but the only thing is causes
 is counter increment in eventfd.
  
  eventfd_read_ctx(irqfd-eventfd, ucnt);
  - the above call would clear the counter, so
 we won't get an interrupt when vector is later
 unmasked.
  
 Don't you going to use ucnt to set interrupt status bit? Can't you
 re-trigger the interrupt after unmasking if status bit is set?

Yes, this is what Davidel suggested originally.  There's nowhere to save
the ucnt though because event is being deasserted.

 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Davide Libenzi
On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:

 Yes, I think this will work. Will test and report. Thanks!

If you ever happen to find a solution in KVM that does not require eventfd 
changes, that'd be even better :)
Otherwise ping me when you tested, that I'll puch this to Andrew.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Michael S. Tsirkin
On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote:
 On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
 
  Yes, I think this will work. Will test and report. Thanks!
 
 If you ever happen to find a solution in KVM that does not require eventfd 
 changes, that'd be even better :)
 Otherwise ping me when you tested, that I'll puch this to Andrew.
 
 
 - Davide

Hmm, the fix also needs changes in kvm driver to call the new API. To
simplify dependencies, can we merge this patch through Avi's kvm tree as
well?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-11 Thread Davide Libenzi
On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:

 On Mon, Jan 11, 2010 at 11:14:14AM -0800, Davide Libenzi wrote:
  On Mon, 11 Jan 2010, Michael S. Tsirkin wrote:
  
   Yes, I think this will work. Will test and report. Thanks!
  
  If you ever happen to find a solution in KVM that does not require eventfd 
  changes, that'd be even better :)
  Otherwise ping me when you tested, that I'll puch this to Andrew.
  
  
  - Davide
 
 Hmm, the fix also needs changes in kvm driver to call the new API. To
 simplify dependencies, can we merge this patch through Avi's kvm tree as
 well?

It is fine for me. I can officially post it to Avi, CCing kvm and lkml.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-10 Thread Michael S. Tsirkin
On Thu, Jan 07, 2010 at 04:26:24PM -0800, Davide Libenzi wrote:
 On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
 
  Sure, I was trying to be as brief as possible, here's a detailed summary.
  
  Description of the system (MSI emulation in KVM):
  
  KVM supports an ioctl to assign/deassign an eventfd file to interrupt 
  message
  in guest OS.  When this eventfd is signalled, interrupt message is sent.
  This assignment is done from qemu system emulator.
  
  eventfd is signalled from device emulation in another thread in
  userspace or from kernel, which talks with guest OS through another
  eventfd and shared memory (possibility of out of process was discussed
  but never got implemented yet).
  
  Note: it's okay to delay messages from correctness point of view, but
  generally this is latency-sensitive path. If multiple identical messages
  are requested, it's okay to send a single last message, but missing a
  message altogether causes deadlocks.  Sending a message when none were
  requested might in theory cause crashes, in practice doing this causes
  performance degradation.
  
  Another KVM feature is interrupt masking: guest OS requests that we
  stop sending some interrupt message, possibly modified mapping
  and re-enables this message. This needs to be done without
  involving the device that might keep requesting events:
  while masked, message is marked pending, and guest might test
  the pending status.
  
  We can implement masking in system emulator in userspace, by using
  assign/deassign ioctls: when message is masked, we simply deassign all
  eventfd, and when it is unmasked, we assign them back.
  
  Here's some code to illustrate how this all works: assign/deassign code
  in kernel looks like the following:
  
  
  this is called to unmask interrupt
  
  static int
  kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
  {
  struct _irqfd *irqfd, *tmp;
  struct file *file = NULL;
  struct eventfd_ctx *eventfd = NULL;
  int ret;
  unsigned int events;
  
  irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
  
  ...
  
  file = eventfd_fget(fd);
  if (IS_ERR(file)) {
  ret = PTR_ERR(file);
  goto fail;
  }
  
  eventfd = eventfd_ctx_fileget(file);
  if (IS_ERR(eventfd)) {
  ret = PTR_ERR(eventfd);
  goto fail;
  }
  
  irqfd-eventfd = eventfd;
  
  /*
   * Install our own custom wake-up handling so we are notified via
   * a callback whenever someone signals the underlying eventfd
   */
  init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup);
  init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc);
  
  spin_lock_irq(kvm-irqfds.lock);
  
  events = file-f_op-poll(file, irqfd-pt);
  
  list_add_tail(irqfd-list, kvm-irqfds.items);
  spin_unlock_irq(kvm-irqfds.lock);
  
  A.
  /*
   * Check if there was an event already pending on the eventfd
   * before we registered, and trigger it as if we didn't miss it.
   */
  if (events  POLLIN)
  schedule_work(irqfd-inject);
  
  /*
   * do not drop the file until the irqfd is fully initialized, otherwise
   * we might race against the POLLHUP
   */
  fput(file);
  
  return 0;
  
  fail:
  ...
  }
 
 What is you do (under proper irqfd locking) something like:
 
   eventfd_ctx_read(ctx, 1, cnt);
   if (irqfd-cnt != cnt) {
   irqfd-cnt = cnt;
   schedule_work(irqfd-inject);
   }
 
 
 
 
  And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
  so this is not under the spinlock):
  
  /*
   * Synchronize with the wait-queue and unhook ourselves to
   * prevent
   * further events.
   */
  B.
  remove_wait_queue(irqfd-wqh, irqfd-wait);
  
  
  
  /*
   * It is now safe to release the object's resources
   */
  eventfd_ctx_put(irqfd-eventfd);
  kfree(irqfd);
 
 And:
 
   eventfd_ctx_read(ctx, 1, irqfd-cnt);


-

   remove_wait_queue(irqfd-wqh, irqfd-wait);
 
 
 
 
 - Davide

Yes, this is exactly what I wanted to do.  So, here's the issue: if an
event is signalled at point -: after eventfd_ctx_read but before
remove_wait_queue, then we inject interrupt but counter will be left
non-zero and then when we unmask, we inject antoher, spurious interrupt.

This is why I wanted to have eventfd_ctx_read not take wait queue head
lock: then I could do:

spin_lock_irqsave(ctx-wqh.lock, flags);
eventfd_ctx_read(ctx, 1, irqfd-cnt);
__remove_wait_queue(irqfd-wqh, irqfd-wait);
spin_lock_irqrestore(ctx-wqh.lock, flags);


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-10 Thread Davide Libenzi
On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:

  What is you do (under proper irqfd locking) something like:
  
  eventfd_ctx_read(ctx, 1, cnt);
  if (irqfd-cnt != cnt) {
  irqfd-cnt = cnt;
  schedule_work(irqfd-inject);
  }
  
  
  
  
   And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
   so this is not under the spinlock):
   
   /*
* Synchronize with the wait-queue and unhook ourselves to
* prevent
* further events.
*/
   B.
   remove_wait_queue(irqfd-wqh, irqfd-wait);
   
 
   
   /*
* It is now safe to release the object's resources
*/
   eventfd_ctx_put(irqfd-eventfd);
   kfree(irqfd);
  
  And:
  
  eventfd_ctx_read(ctx, 1, irqfd-cnt);
 
 
 -
 
  remove_wait_queue(irqfd-wqh, irqfd-wait);
  
  
  
  
  - Davide
 
 Yes, this is exactly what I wanted to do.  So, here's the issue: if an
 event is signalled at point -: after eventfd_ctx_read but before
 remove_wait_queue, then we inject interrupt but counter will be left
 non-zero and then when we unmask, we inject antoher, spurious interrupt.
 
 This is why I wanted to have eventfd_ctx_read not take wait queue head
 lock: then I could do:
 
   spin_lock_irqsave(ctx-wqh.lock, flags);
   eventfd_ctx_read(ctx, 1, irqfd-cnt);
   __remove_wait_queue(irqfd-wqh, irqfd-wait);
   spin_lock_irqrestore(ctx-wqh.lock, flags);

This is why I said under proper irqfd locking.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-10 Thread Davide Libenzi
On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:

 Not sure what you mean by irqfd locking.  IMO we must take waitqueue
 lock, otherwise we can not prevent wait queue callback from being
 invoked, right? Note that if we take our own lock under wait queue
 callback, we won't be able to use this lock around remove_wait_queue.
 Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
 queue lock - we'd like to do the same with our wait queue.
 
 Could you clarify pls?

Wait a second. Looking at the current code in Linus tree, when you 
deassign an irqfd, you are actually destroying it, so the idea of saving 
the counter on deassign is not going to work.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-10 Thread Michael S. Tsirkin
On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote:
 On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
 
  Not sure what you mean by irqfd locking.  IMO we must take waitqueue
  lock, otherwise we can not prevent wait queue callback from being
  invoked, right? Note that if we take our own lock under wait queue
  callback, we won't be able to use this lock around remove_wait_queue.
  Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
  queue lock - we'd like to do the same with our wait queue.
  
  Could you clarify pls?
 
 Wait a second. Looking at the current code in Linus tree, when you 
 deassign an irqfd, you are actually destroying it, so the idea of saving 
 the counter on deassign is not going to work.
 
 
 
 - Davide

Yes, the only context passed between deassign and assign is the eventfd
itself. So I think the following code for deassign would work provided
eventfd_ctx_read_locked works with wait queue lock taken:

spin_lock_irqsave(irqfd-wqh.lock, flags);
eventfd_ctx_read_locked(ctx-eventfd, 1, ucnt);
__remove_wait_queue(irqfd-wqh, irqfd-wait);
spin_lock_irqrestore(irqfd-wqh.lock, flags);

And we discard ucnt here. This works because on on assign we do:

events = file-f_op-poll(file, irqfd-pt);
if (events  POLLIN)
schedule_work(irqfd-inject);

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-10 Thread Davide Libenzi
On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:

 On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote:
  On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
  
   Not sure what you mean by irqfd locking.  IMO we must take waitqueue
   lock, otherwise we can not prevent wait queue callback from being
   invoked, right? Note that if we take our own lock under wait queue
   callback, we won't be able to use this lock around remove_wait_queue.
   Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
   queue lock - we'd like to do the same with our wait queue.
   
   Could you clarify pls?
  
  Wait a second. Looking at the current code in Linus tree, when you 
  deassign an irqfd, you are actually destroying it, so the idea of saving 
  the counter on deassign is not going to work.
  
  
  
  - Davide
 
 Yes, the only context passed between deassign and assign is the eventfd
 itself. So I think the following code for deassign would work provided
 eventfd_ctx_read_locked works with wait queue lock taken:
 
 spin_lock_irqsave(irqfd-wqh.lock, flags);
 eventfd_ctx_read_locked(ctx-eventfd, 1, ucnt);
 __remove_wait_queue(irqfd-wqh, irqfd-wait);
 spin_lock_irqrestore(irqfd-wqh.lock, flags);

As I said, you cannot do that from KVM, since write-witers will nedd to be 
woke up.
Use the eventfd_ctx_remove_wait_queue() below (NOTE: compiled, not tested).



- Davide


---
 fs/eventfd.c|   88 +++-
 include/linux/eventfd.h |   15 
 2 files changed, 88 insertions(+), 15 deletions(-)

Index: linux-2.6.mod/fs/eventfd.c
===
--- linux-2.6.mod.orig/fs/eventfd.c 2010-01-06 18:33:27.0 -0800
+++ linux-2.6.mod/fs/eventfd.c  2010-01-10 11:02:00.0 -0800
@@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct 
return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-   loff_t *ppos)
+static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
+{
+   *cnt = (ctx-flags  EFD_SEMAPHORE) ? 1: ctx-count;
+   ctx-count -= *cnt;
+}
+
+/**
+ * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait 
queue.
+ * @ctx: [in] Pointer to eventfd context.
+ * @wait: [in] Wait queue to be removed.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN  : The operation would have blocked.
+ *
+ * This is used to atomically remove a wait queue entry from the eventfd wait
+ * queue head, and read/reset the counter value.
+ */
+int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
+ __u64 *cnt)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(ctx-wqh.lock, flags);
+   eventfd_ctx_do_read(ctx, cnt);
+   __remove_wait_queue(ctx-wqh, wait);
+   if (*cnt != 0  waitqueue_active(ctx-wqh))
+   wake_up_locked_poll(ctx-wqh, POLLOUT);
+   spin_unlock_irqrestore(ctx-wqh.lock, flags);
+
+   return *cnt != 0 ? 0: -EAGAIN;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
+
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN  : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ *
+ * If @no_wait is zero, the function might sleep until the eventfd internal
+ * counter becomes greater than zero.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-   struct eventfd_ctx *ctx = file-private_data;
ssize_t res;
-   __u64 ucnt = 0;
DECLARE_WAITQUEUE(wait, current);
 
-   if (count  sizeof(ucnt))
-   return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
+   *cnt = 0;
res = -EAGAIN;
if (ctx-count  0)
-   res = sizeof(ucnt);
-   else if (!(file-f_flags  O_NONBLOCK)) {
+   res = 0;
+   else if (!no_wait) {
__add_wait_queue(ctx-wqh, wait);
-   for (res = 0;;) {
+   for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (ctx-count  0) {
-   res = sizeof(ucnt);
+   res = 0;
break;
}
if (signal_pending(current)) {
@@ -168,18 +213,31 @@ static ssize_t eventfd_read(struct file 
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (likely(res  

Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-10 Thread Michael S. Tsirkin
On Sun, Jan 10, 2010 at 11:04:23AM -0800, Davide Libenzi wrote:
 On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
 
  On Sun, Jan 10, 2010 at 09:27:34AM -0800, Davide Libenzi wrote:
   On Sun, 10 Jan 2010, Michael S. Tsirkin wrote:
   
Not sure what you mean by irqfd locking.  IMO we must take waitqueue
lock, otherwise we can not prevent wait queue callback from being
invoked, right? Note that if we take our own lock under wait queue
callback, we won't be able to use this lock around remove_wait_queue.
Also note how fs/evetfd.c uses __remove_wait_queue after taking wait
queue lock - we'd like to do the same with our wait queue.

Could you clarify pls?
   
   Wait a second. Looking at the current code in Linus tree, when you 
   deassign an irqfd, you are actually destroying it, so the idea of saving 
   the counter on deassign is not going to work.
   
   
   
   - Davide
  
  Yes, the only context passed between deassign and assign is the eventfd
  itself. So I think the following code for deassign would work provided
  eventfd_ctx_read_locked works with wait queue lock taken:
  
  spin_lock_irqsave(irqfd-wqh.lock, flags);
  eventfd_ctx_read_locked(ctx-eventfd, 1, ucnt);
  __remove_wait_queue(irqfd-wqh, irqfd-wait);
  spin_lock_irqrestore(irqfd-wqh.lock, flags);
 
 As I said, you cannot do that from KVM, since write-witers will nedd to be 
 woke up.
 Use the eventfd_ctx_remove_wait_queue() below (NOTE: compiled, not tested).
 
 
 
 - Davide
 

Yes, I think this will work. Will test and report. Thanks!

 ---
  fs/eventfd.c|   88 
 +++-
  include/linux/eventfd.h |   15 
  2 files changed, 88 insertions(+), 15 deletions(-)
 
 Index: linux-2.6.mod/fs/eventfd.c
 ===
 --- linux-2.6.mod.orig/fs/eventfd.c   2010-01-06 18:33:27.0 -0800
 +++ linux-2.6.mod/fs/eventfd.c2010-01-10 11:02:00.0 -0800
 @@ -135,26 +135,71 @@ static unsigned int eventfd_poll(struct 
   return events;
  }
  
 -static ssize_t eventfd_read(struct file *file, char __user *buf, size_t 
 count,
 - loff_t *ppos)
 +static void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt)
 +{
 + *cnt = (ctx-flags  EFD_SEMAPHORE) ? 1: ctx-count;
 + ctx-count -= *cnt;
 +}
 +
 +/**
 + * eventfd_ctx_remove_wait_queue - Read the current counter and removes wait 
 queue.
 + * @ctx: [in] Pointer to eventfd context.
 + * @wait: [in] Wait queue to be removed.
 + * @cnt: [out] Pointer to the 64bit conter value.
 + *
 + * Returns zero if successful, or the following error codes:
 + *
 + * -EAGAIN  : The operation would have blocked.
 + *
 + * This is used to atomically remove a wait queue entry from the eventfd wait
 + * queue head, and read/reset the counter value.
 + */
 +int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t 
 *wait,
 +   __u64 *cnt)
 +{
 + unsigned long flags;
 +
 + spin_lock_irqsave(ctx-wqh.lock, flags);
 + eventfd_ctx_do_read(ctx, cnt);
 + __remove_wait_queue(ctx-wqh, wait);
 + if (*cnt != 0  waitqueue_active(ctx-wqh))
 + wake_up_locked_poll(ctx-wqh, POLLOUT);
 + spin_unlock_irqrestore(ctx-wqh.lock, flags);
 +
 + return *cnt != 0 ? 0: -EAGAIN;
 +}
 +EXPORT_SYMBOL_GPL(eventfd_ctx_remove_wait_queue);
 +
 +/**
 + * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
 + * @ctx: [in] Pointer to eventfd context.
 + * @no_wait: [in] Different from zero if the operation should not block.
 + * @cnt: [out] Pointer to the 64bit conter value.
 + *
 + * Returns zero if successful, or the following error codes:
 + *
 + * -EAGAIN  : The operation would have blocked but @no_wait was nonzero.
 + * -ERESTARTSYS : A signal interrupted the wait operation.
 + *
 + * If @no_wait is zero, the function might sleep until the eventfd internal
 + * counter becomes greater than zero.
 + */
 +ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
  {
 - struct eventfd_ctx *ctx = file-private_data;
   ssize_t res;
 - __u64 ucnt = 0;
   DECLARE_WAITQUEUE(wait, current);
  
 - if (count  sizeof(ucnt))
 - return -EINVAL;
   spin_lock_irq(ctx-wqh.lock);
 + *cnt = 0;
   res = -EAGAIN;
   if (ctx-count  0)
 - res = sizeof(ucnt);
 - else if (!(file-f_flags  O_NONBLOCK)) {
 + res = 0;
 + else if (!no_wait) {
   __add_wait_queue(ctx-wqh, wait);
 - for (res = 0;;) {
 + for (;;) {
   set_current_state(TASK_INTERRUPTIBLE);
   if (ctx-count  0) {
 - res = sizeof(ucnt);
 + res = 0;
   break;
   }
   if (signal_pending(current)) {
 @@ 

Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-07 Thread Michael S. Tsirkin
On Wed, Jan 06, 2010 at 11:25:40PM -0800, Davide Libenzi wrote:
 On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
 
  OK. What I think we need is a way to remove ourselves from the eventfd
  wait queue and clear the counter atomically.
  
  We currently do
  remove_wait_queue(irqfd-wqh, irqfd-wait);
  where wqh saves the eventfd wait queue head.
 
 You do a remove_wait_queue() from inside a callback wakeup on the same 
 wait queue head?
 

No, not from callback, in ioctl context.

  If we do this before proposed eventfd_read_ctx, we can lose events.
  If we do this after, we can get spurious events.
  
  An unlocked read is one way to fix this.
 
 You posted one line of code and a two lines analysis of the issue. Can you 
 be a little bit more verbose and show me more code, so that I can actually 
 see what is going on?
 
 
 - Davide


Sure, I was trying to be as brief as possible, here's a detailed summary.

Description of the system (MSI emulation in KVM):

KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
in guest OS.  When this eventfd is signalled, interrupt message is sent.
This assignment is done from qemu system emulator.

eventfd is signalled from device emulation in another thread in
userspace or from kernel, which talks with guest OS through another
eventfd and shared memory (possibility of out of process was discussed
but never got implemented yet).

Note: it's okay to delay messages from correctness point of view, but
generally this is latency-sensitive path. If multiple identical messages
are requested, it's okay to send a single last message, but missing a
message altogether causes deadlocks.  Sending a message when none were
requested might in theory cause crashes, in practice doing this causes
performance degradation.

Another KVM feature is interrupt masking: guest OS requests that we
stop sending some interrupt message, possibly modified mapping
and re-enables this message. This needs to be done without
involving the device that might keep requesting events:
while masked, message is marked pending, and guest might test
the pending status.

We can implement masking in system emulator in userspace, by using
assign/deassign ioctls: when message is masked, we simply deassign all
eventfd, and when it is unmasked, we assign them back.

Here's some code to illustrate how this all works: assign/deassign code
in kernel looks like the following:


this is called to unmask interrupt

static int
kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
{
struct _irqfd *irqfd, *tmp;
struct file *file = NULL;
struct eventfd_ctx *eventfd = NULL;
int ret;
unsigned int events;

irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);

...

file = eventfd_fget(fd);
if (IS_ERR(file)) {
ret = PTR_ERR(file);
goto fail;
}

eventfd = eventfd_ctx_fileget(file);
if (IS_ERR(eventfd)) {
ret = PTR_ERR(eventfd);
goto fail;
}

irqfd-eventfd = eventfd;

/*
 * Install our own custom wake-up handling so we are notified via
 * a callback whenever someone signals the underlying eventfd
 */
init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup);
init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc);

spin_lock_irq(kvm-irqfds.lock);

events = file-f_op-poll(file, irqfd-pt);

list_add_tail(irqfd-list, kvm-irqfds.items);
spin_unlock_irq(kvm-irqfds.lock);

A.
/*
 * Check if there was an event already pending on the eventfd
 * before we registered, and trigger it as if we didn't miss it.
 */
if (events  POLLIN)
schedule_work(irqfd-inject);

/*
 * do not drop the file until the irqfd is fully initialized, otherwise
 * we might race against the POLLHUP
 */
fput(file);

return 0;

fail:
...
}

This is called to mask interrupt

/*
 * shutdown any irqfd's that match fd+gsi
 */
static int
kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
{
struct _irqfd *irqfd, *tmp;
struct eventfd_ctx *eventfd;

eventfd = eventfd_ctx_fdget(fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);

spin_lock_irq(kvm-irqfds.lock);

list_for_each_entry_safe(irqfd, tmp, kvm-irqfds.items, list) {
if (irqfd-eventfd == eventfd  irqfd-gsi == gsi)
irqfd_deactivate(irqfd);
}

spin_unlock_irq(kvm-irqfds.lock);
eventfd_ctx_put(eventfd);

/*
 * Block until we know all outstanding shutdown jobs have completed
 * so that we guarantee there will not be any more interrupts on this
 * gsi once this deassign function returns.
 */
flush_workqueue(irqfd_cleanup_wq);

return 0;
}


And deactivation deep down does this (from 

Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-07 Thread Davide Libenzi
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

 Sure, I was trying to be as brief as possible, here's a detailed summary.
 
 Description of the system (MSI emulation in KVM):
 
 KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
 in guest OS.  When this eventfd is signalled, interrupt message is sent.
 This assignment is done from qemu system emulator.
 
 eventfd is signalled from device emulation in another thread in
 userspace or from kernel, which talks with guest OS through another
 eventfd and shared memory (possibility of out of process was discussed
 but never got implemented yet).
 
 Note: it's okay to delay messages from correctness point of view, but
 generally this is latency-sensitive path. If multiple identical messages
 are requested, it's okay to send a single last message, but missing a
 message altogether causes deadlocks.  Sending a message when none were
 requested might in theory cause crashes, in practice doing this causes
 performance degradation.
 
 Another KVM feature is interrupt masking: guest OS requests that we
 stop sending some interrupt message, possibly modified mapping
 and re-enables this message. This needs to be done without
 involving the device that might keep requesting events:
 while masked, message is marked pending, and guest might test
 the pending status.
 
 We can implement masking in system emulator in userspace, by using
 assign/deassign ioctls: when message is masked, we simply deassign all
 eventfd, and when it is unmasked, we assign them back.
 
 Here's some code to illustrate how this all works: assign/deassign code
 in kernel looks like the following:
 
 
 this is called to unmask interrupt
 
 static int
 kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
 {
   struct _irqfd *irqfd, *tmp;
   struct file *file = NULL;
   struct eventfd_ctx *eventfd = NULL;
   int ret;
   unsigned int events;
 
   irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
 
 ...
 
   file = eventfd_fget(fd);
   if (IS_ERR(file)) {
   ret = PTR_ERR(file);
   goto fail;
   }
 
   eventfd = eventfd_ctx_fileget(file);
   if (IS_ERR(eventfd)) {
   ret = PTR_ERR(eventfd);
   goto fail;
   }
 
   irqfd-eventfd = eventfd;
 
   /*
* Install our own custom wake-up handling so we are notified via
* a callback whenever someone signals the underlying eventfd
*/
   init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup);
   init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc);
 
   spin_lock_irq(kvm-irqfds.lock);
 
   events = file-f_op-poll(file, irqfd-pt);
 
   list_add_tail(irqfd-list, kvm-irqfds.items);
   spin_unlock_irq(kvm-irqfds.lock);
 
 A.
   /*
* Check if there was an event already pending on the eventfd
* before we registered, and trigger it as if we didn't miss it.
*/
   if (events  POLLIN)
   schedule_work(irqfd-inject);
 
   /*
* do not drop the file until the irqfd is fully initialized, otherwise
* we might race against the POLLHUP
*/
   fput(file);
 
   return 0;
 
 fail:
   ...
 }
 
 This is called to mask interrupt
 
 /*
  * shutdown any irqfd's that match fd+gsi
  */
 static int
 kvm_irqfd_deassign(struct kvm *kvm, int fd, int gsi)
 {
   struct _irqfd *irqfd, *tmp;
   struct eventfd_ctx *eventfd;
 
   eventfd = eventfd_ctx_fdget(fd);
   if (IS_ERR(eventfd))
   return PTR_ERR(eventfd);
 
   spin_lock_irq(kvm-irqfds.lock);
 
   list_for_each_entry_safe(irqfd, tmp, kvm-irqfds.items, list) {
   if (irqfd-eventfd == eventfd  irqfd-gsi == gsi)
   irqfd_deactivate(irqfd);
   }
 
   spin_unlock_irq(kvm-irqfds.lock);
   eventfd_ctx_put(eventfd);
 
   /*
* Block until we know all outstanding shutdown jobs have completed
* so that we guarantee there will not be any more interrupts on this
* gsi once this deassign function returns.
*/
   flush_workqueue(irqfd_cleanup_wq);
 
   return 0;
 }
 
 
 And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
 so this is not under the spinlock):
 
 /*
  * Synchronize with the wait-queue and unhook ourselves to
  * prevent
  * further events.
  */
 B.
 remove_wait_queue(irqfd-wqh, irqfd-wait);
 
   
 
 /*
  * It is now safe to release the object's resources
  */
 eventfd_ctx_put(irqfd-eventfd);
 kfree(irqfd);
 
 
 The problems (really the same bug) in KVM that I am trying to fix:
 1. Because of A above, if event was requested while message was masked,
we will not miss a message. However, because we never clear
the counter, so we currently get a spurious message each time
we unmask.
 
We should clear the counter either each time we
deliver message, 

Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-07 Thread Davide Libenzi
On Thu, 7 Jan 2010, Davide Libenzi wrote:

 When you unmask, shouldn't the fd be in a clear state anyway?

Forget that. If it's used for IRQs you likely need to have a current state 
when you unmask.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-07 Thread Davide Libenzi
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

 Sure, I was trying to be as brief as possible, here's a detailed summary.
 
 Description of the system (MSI emulation in KVM):
 
 KVM supports an ioctl to assign/deassign an eventfd file to interrupt message
 in guest OS.  When this eventfd is signalled, interrupt message is sent.
 This assignment is done from qemu system emulator.
 
 eventfd is signalled from device emulation in another thread in
 userspace or from kernel, which talks with guest OS through another
 eventfd and shared memory (possibility of out of process was discussed
 but never got implemented yet).
 
 Note: it's okay to delay messages from correctness point of view, but
 generally this is latency-sensitive path. If multiple identical messages
 are requested, it's okay to send a single last message, but missing a
 message altogether causes deadlocks.  Sending a message when none were
 requested might in theory cause crashes, in practice doing this causes
 performance degradation.
 
 Another KVM feature is interrupt masking: guest OS requests that we
 stop sending some interrupt message, possibly modified mapping
 and re-enables this message. This needs to be done without
 involving the device that might keep requesting events:
 while masked, message is marked pending, and guest might test
 the pending status.
 
 We can implement masking in system emulator in userspace, by using
 assign/deassign ioctls: when message is masked, we simply deassign all
 eventfd, and when it is unmasked, we assign them back.
 
 Here's some code to illustrate how this all works: assign/deassign code
 in kernel looks like the following:
 
 
 this is called to unmask interrupt
 
 static int
 kvm_irqfd_assign(struct kvm *kvm, int fd, int gsi)
 {
   struct _irqfd *irqfd, *tmp;
   struct file *file = NULL;
   struct eventfd_ctx *eventfd = NULL;
   int ret;
   unsigned int events;
 
   irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
 
 ...
 
   file = eventfd_fget(fd);
   if (IS_ERR(file)) {
   ret = PTR_ERR(file);
   goto fail;
   }
 
   eventfd = eventfd_ctx_fileget(file);
   if (IS_ERR(eventfd)) {
   ret = PTR_ERR(eventfd);
   goto fail;
   }
 
   irqfd-eventfd = eventfd;
 
   /*
* Install our own custom wake-up handling so we are notified via
* a callback whenever someone signals the underlying eventfd
*/
   init_waitqueue_func_entry(irqfd-wait, irqfd_wakeup);
   init_poll_funcptr(irqfd-pt, irqfd_ptable_queue_proc);
 
   spin_lock_irq(kvm-irqfds.lock);
 
   events = file-f_op-poll(file, irqfd-pt);
 
   list_add_tail(irqfd-list, kvm-irqfds.items);
   spin_unlock_irq(kvm-irqfds.lock);
 
 A.
   /*
* Check if there was an event already pending on the eventfd
* before we registered, and trigger it as if we didn't miss it.
*/
   if (events  POLLIN)
   schedule_work(irqfd-inject);
 
   /*
* do not drop the file until the irqfd is fully initialized, otherwise
* we might race against the POLLHUP
*/
   fput(file);
 
   return 0;
 
 fail:
   ...
 }

What is you do (under proper irqfd locking) something like:

eventfd_ctx_read(ctx, 1, cnt);
if (irqfd-cnt != cnt) {
irqfd-cnt = cnt;
schedule_work(irqfd-inject);
}




 And deactivation deep down does this (from irqfd_cleanup_wq workqueue,
 so this is not under the spinlock):
 
 /*
  * Synchronize with the wait-queue and unhook ourselves to
  * prevent
  * further events.
  */
 B.
 remove_wait_queue(irqfd-wqh, irqfd-wait);
 
   
 
 /*
  * It is now safe to release the object's resources
  */
 eventfd_ctx_put(irqfd-eventfd);
 kfree(irqfd);

And:

eventfd_ctx_read(ctx, 1, irqfd-cnt);
remove_wait_queue(irqfd-wqh, irqfd-wait);




- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Michael S. Tsirkin
On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote:
 On Tue, 1 Sep 2009, Avi Kivity wrote:
 
  On 09/01/2009 02:45 AM, Davide Libenzi wrote:
   On Thu, 27 Aug 2009, Davide Libenzi wrote:
   
  
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

 
 Oh, I stopped pushing EFD_STATE since we have a solution.

Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
Because if nobody uses it, I'm not going to push it.
 
   Guys, I did not get a reply on this. Do you need me to push it, or you're
   not going to use it at the end?
  
  
  We'll use it eventually, but we're still some ways from it.
 
 OK, then bug me when you're going to need it. I won't push it before that.
 
 
 - Davide

So, it turns out that we need this: be thought we don't because
currently kvm does not zero eventfd counter when it polls eventfd.  But
this causes spurious interrupts when we disconnect irqfd from kvm and
re-connect it back.

However, since kvm does its own thing with the wait queue, and might
read the counter from wait queue callback (which might be from
interrupt context), a simpler, lower-level interface would be better for
us.  Does the following (build tested only) look palatable?

Thanks!


diff --git a/fs/eventfd.c b/fs/eventfd.c
index d26402f..e350ffd 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -135,6 +135,17 @@ static unsigned int eventfd_poll(struct file *file, 
poll_table *wait)
return events;
 }
 
+/* Caller must have wait queue head lock. */
+ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt)
+{
+   if (!ctx-count)
+   return -EAGAIN;
+   *ucnt = (ctx-flags  EFD_SEMAPHORE) ? 1 : ctx-count;
+   ctx-count -= *ucnt;
+   return sizeof *ucnt;
+}
+EXPORT_SYMBOL_GPL(_eventfd_read_ctx);
+
 static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
loff_t *ppos)
 {
@@ -146,17 +157,14 @@ static ssize_t eventfd_read(struct file *file, char 
__user *buf, size_t count,
if (count  sizeof(ucnt))
return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
-   res = -EAGAIN;
-   if (ctx-count  0)
-   res = sizeof(ucnt);
-   else if (!(file-f_flags  O_NONBLOCK)) {
+   res = _eventfd_read_ctx(ctx, ucnt);
+   if (res  0  !(file-f_flags  O_NONBLOCK)) {
__add_wait_queue(ctx-wqh, wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx-count  0) {
-   res = sizeof(ucnt);
+   res = _eventfd_read_ctx(ctx, ucnt);
+   if (res  0)
break;
-   }
if (signal_pending(current)) {
res = -ERESTARTSYS;
break;
@@ -169,8 +177,6 @@ static ssize_t eventfd_read(struct file *file, char __user 
*buf, size_t count,
__set_current_state(TASK_RUNNING);
}
if (likely(res  0)) {
-   ucnt = (ctx-flags  EFD_SEMAPHORE) ? 1 : ctx-count;
-   ctx-count -= ucnt;
if (waitqueue_active(ctx-wqh))
wake_up_locked_poll(ctx-wqh, POLLOUT);
}
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 94dd103..a3d0ce9 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -34,6 +34,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt);
 
 #else /* CONFIG_EVENTFD */
 
@@ -61,6 +62,11 @@ static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
 
 }
 
+static inline ssize_t _eventfd_read_ctx(struct eventfd_ctx *ctx, u64 *ucnt)
+{
+   return -ENOSYS;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Davide Libenzi
On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:

 On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote:
  On Tue, 1 Sep 2009, Avi Kivity wrote:
  
   On 09/01/2009 02:45 AM, Davide Libenzi wrote:
On Thu, 27 Aug 2009, Davide Libenzi wrote:

   
 On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
 
  
  Oh, I stopped pushing EFD_STATE since we have a solution.
 
 Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
 Because if nobody uses it, I'm not going to push it.
  
Guys, I did not get a reply on this. Do you need me to push it, or 
you're
not going to use it at the end?
   
   
   We'll use it eventually, but we're still some ways from it.
  
  OK, then bug me when you're going to need it. I won't push it before that.
  
  
  - Davide
 
 So, it turns out that we need this: be thought we don't because
 currently kvm does not zero eventfd counter when it polls eventfd.  But
 this causes spurious interrupts when we disconnect irqfd from kvm and
 re-connect it back.
 
 However, since kvm does its own thing with the wait queue, and might
 read the counter from wait queue callback (which might be from
 interrupt context), a simpler, lower-level interface would be better for
 us.  Does the following (build tested only) look palatable?

What is wrong with you? :/
There was a patch attached with this email, and yet you did your own, and 
yet again you managed to add an underscore at the beginning of the API 
name, in an API set where there are no leading underscores.
Even if KVM does that for its own reasons, you should be able to fit your 
naming style to the interface you're adding your droplets into.
I will post an eventfd_read_ctx() to Andrew ASAP.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Michael S. Tsirkin
On Wed, Jan 06, 2010 at 12:43:07PM -0800, Davide Libenzi wrote:
 On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:
 
  On Tue, Sep 01, 2009 at 07:24:24AM -0700, Davide Libenzi wrote:
   On Tue, 1 Sep 2009, Avi Kivity wrote:
   
On 09/01/2009 02:45 AM, Davide Libenzi wrote:
 On Thu, 27 Aug 2009, Davide Libenzi wrote:
 

  On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
  
   
   Oh, I stopped pushing EFD_STATE since we have a solution.
  
  Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
  Because if nobody uses it, I'm not going to push it.
   
 Guys, I did not get a reply on this. Do you need me to push it, or 
 you're
 not going to use it at the end?


We'll use it eventually, but we're still some ways from it.
   
   OK, then bug me when you're going to need it. I won't push it before that.
   
   
   - Davide
  
  So, it turns out that we need this: be thought we don't because
  currently kvm does not zero eventfd counter when it polls eventfd.  But
  this causes spurious interrupts when we disconnect irqfd from kvm and
  re-connect it back.
  
  However, since kvm does its own thing with the wait queue, and might
  read the counter from wait queue callback (which might be from
  interrupt context), a simpler, lower-level interface would be better for
  us.  Does the following (build tested only) look palatable?
 
 What is wrong with you? :/

I was trying to be helpful.

 There was a patch attached with this email, and yet you did your own,

I tried to explain, no?  That patch was taking wait queue spinlock and
was assuming that eventfd_read_ctx is called from a task that can block.
KVM attaches its own poller so this is not a good fit for us.  Hope this
clarifies.

 and yet again you managed to add an underscore at the beginning of the
 API name, in an API set where there are no leading underscores.  Even
 if KVM does that for its own reasons, you should be able to fit your
 naming style to the interface you're adding your droplets into.  I
 will post an eventfd_read_ctx() to Andrew ASAP.
 
 
 
 - Davide

Sorry about the underscore - it's easy to fix but I won't bore you with
more patch revisions before I understand what makes you unhappy.

Before KVM starts creating workqueues and doing schedule_work calls just
to work around the API, can we discuss this please? I am not hung on my
patch, but could we have it work so that we can call eventfd_read_ctx
from wait queue callback directly?

Thanks!


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Davide Libenzi
On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:

 I tried to explain, no?  That patch was taking wait queue spinlock and
 was assuming that eventfd_read_ctx is called from a task that can block.
 KVM attaches its own poller so this is not a good fit for us.  Hope this
 clarifies.
 
  and yet again you managed to add an underscore at the beginning of the
  API name, in an API set where there are no leading underscores.  Even
  if KVM does that for its own reasons, you should be able to fit your
  naming style to the interface you're adding your droplets into.  I
  will post an eventfd_read_ctx() to Andrew ASAP.
  
  
  
  - Davide
 
 Sorry about the underscore - it's easy to fix but I won't bore you with
 more patch revisions before I understand what makes you unhappy.
 
 Before KVM starts creating workqueues and doing schedule_work calls just
 to work around the API, can we discuss this please? I am not hung on my
 patch, but could we have it work so that we can call eventfd_read_ctx
 from wait queue callback directly?

The read needs to wake possible OUT waiters, *cannot* be done your way.
On top of creating an interface which requires a lock, which noone can get 
from the interface itself, since it's not exposed.
I could split the two and have a locked one, and an unlocked one, but that 
looks shitty too (for the above reason).
I thought you *already* do your own stuff from a work queue, why would you 
need to read from IRQ context?



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Michael S. Tsirkin
On Wed, Jan 06, 2010 at 01:17:02PM -0800, Davide Libenzi wrote:
 On Wed, 6 Jan 2010, Michael S. Tsirkin wrote:
 
  I tried to explain, no?  That patch was taking wait queue spinlock and
  was assuming that eventfd_read_ctx is called from a task that can block.
  KVM attaches its own poller so this is not a good fit for us.  Hope this
  clarifies.
  
   and yet again you managed to add an underscore at the beginning of the
   API name, in an API set where there are no leading underscores.  Even
   if KVM does that for its own reasons, you should be able to fit your
   naming style to the interface you're adding your droplets into.  I
   will post an eventfd_read_ctx() to Andrew ASAP.
   
   
   
   - Davide
  
  Sorry about the underscore - it's easy to fix but I won't bore you with
  more patch revisions before I understand what makes you unhappy.
  
  Before KVM starts creating workqueues and doing schedule_work calls just
  to work around the API, can we discuss this please? I am not hung on my
  patch, but could we have it work so that we can call eventfd_read_ctx
  from wait queue callback directly?
 
 The read needs to wake possible OUT waiters, *cannot* be done your way.

Right. Now I see. Thanks!

 On top of creating an interface which requires a lock, which noone can get 
 from the interface itself, since it's not exposed.

I think here's how KVM gets it: the way it does is by calling poll with
our own poll table, then in poll_queue_proc we get wait queue pointer,
and we use the wait queue. Lock is in there :)

 I could split the two and have a locked one, and an unlocked one, but that 
 looks shitty too (for the above reason).

Yes, this will work. Thanks!

 I thought you *already* do your own stuff from a work queue, why would you 
 need to read from IRQ context?
 
 
 
 - Davide

This was just work around while interrupt delivery was using mutex
locks, so we had a workqueue to work around that limitation.  Now that
it has been all switched to RCU, we'll be getting getting rid of this.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Davide Libenzi
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

  On top of creating an interface which requires a lock, which noone can get 
  from the interface itself, since it's not exposed.
 
 I think here's how KVM gets it: the way it does is by calling poll with
 our own poll table, then in poll_queue_proc we get wait queue pointer,
 and we use the wait queue. Lock is in there :)

Yes, I know you are called locked, but it does not lead to a clean 
interface.



  I could split the two and have a locked one, and an unlocked one, but that 
  looks shitty too (for the above reason).
 
 Yes, this will work. Thanks!

This is a lot more complex than I thought. The wakeup code is already 
enumerating the list, and doing a wakeup might trigger a secondary 
enumeration/recursion.
Do you really need to consume the value from IRQ context, or you can 
simply peek the value, and flush it later?



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Michael S. Tsirkin
On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
 On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
 
   On top of creating an interface which requires a lock, which noone can 
   get 
   from the interface itself, since it's not exposed.
  
  I think here's how KVM gets it: the way it does is by calling poll with
  our own poll table, then in poll_queue_proc we get wait queue pointer,
  and we use the wait queue. Lock is in there :)
 
 Yes, I know you are called locked, but it does not lead to a clean 
 interface.

True.

   I could split the two and have a locked one, and an unlocked one, but 
   that 
   looks shitty too (for the above reason).
  
  Yes, this will work. Thanks!
 
 This is a lot more complex than I thought. The wakeup code is already 
 enumerating the list, and doing a wakeup might trigger a secondary 
 enumeration/recursion.

For KVM what you describe is I think is not a problem: we check wake type
and ignore POLLOUT events.

 Do you really need to consume the value from IRQ context, or you can 
 simply peek the value, and flush it later?
 
 
 
 - Davide

Maybe yes. I'll think it over and get back to you. Thanks!

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Davide Libenzi
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

 On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
  On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
  
On top of creating an interface which requires a lock, which noone can 
get 
from the interface itself, since it's not exposed.
   
   I think here's how KVM gets it: the way it does is by calling poll with
   our own poll table, then in poll_queue_proc we get wait queue pointer,
   and we use the wait queue. Lock is in there :)
  
  Yes, I know you are called locked, but it does not lead to a clean 
  interface.
 
 True.
 
I could split the two and have a locked one, and an unlocked one, but 
that 
looks shitty too (for the above reason).
   
   Yes, this will work. Thanks!
  
  This is a lot more complex than I thought. The wakeup code is already 
  enumerating the list, and doing a wakeup might trigger a secondary 
  enumeration/recursion.
 
 For KVM what you describe is I think is not a problem: we check wake type
 and ignore POLLOUT events.

You seem to think in one dimension only ;)
The interface needs to be stable for everyone.



 Maybe yes. I'll think it over and get back to you. Thanks!

Let me know.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Michael S. Tsirkin
On Wed, Jan 06, 2010 at 03:59:11PM -0800, Davide Libenzi wrote:
 On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
 
  On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
   On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
   
 On top of creating an interface which requires a lock, which noone 
 can get 
 from the interface itself, since it's not exposed.

I think here's how KVM gets it: the way it does is by calling poll with
our own poll table, then in poll_queue_proc we get wait queue pointer,
and we use the wait queue. Lock is in there :)
   
   Yes, I know you are called locked, but it does not lead to a clean 
   interface.
  
  True.
  
 I could split the two and have a locked one, and an unlocked one, but 
 that 
 looks shitty too (for the above reason).

Yes, this will work. Thanks!
   
   This is a lot more complex than I thought. The wakeup code is already 
   enumerating the list, and doing a wakeup might trigger a secondary 
   enumeration/recursion.
  
  For KVM what you describe is I think is not a problem: we check wake type
  and ignore POLLOUT events.
 
 You seem to think in one dimension only ;)
 The interface needs to be stable for everyone.
 

I agree it's better to have interface that can't be misused.  I was just
pointing out that users *can* break recursion by looking at event type.

  Maybe yes. I'll think it over and get back to you. Thanks!
 
 Let me know.
 
 
 
 - Davide
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Michael S. Tsirkin
On Wed, Jan 06, 2010 at 03:59:11PM -0800, Davide Libenzi wrote:
 On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
 
  On Wed, Jan 06, 2010 at 02:46:06PM -0800, Davide Libenzi wrote:
   On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:
   
 On top of creating an interface which requires a lock, which noone 
 can get 
 from the interface itself, since it's not exposed.

I think here's how KVM gets it: the way it does is by calling poll with
our own poll table, then in poll_queue_proc we get wait queue pointer,
and we use the wait queue. Lock is in there :)
   
   Yes, I know you are called locked, but it does not lead to a clean 
   interface.
  
  True.
  
 I could split the two and have a locked one, and an unlocked one, but 
 that 
 looks shitty too (for the above reason).

Yes, this will work. Thanks!
   
   This is a lot more complex than I thought. The wakeup code is already 
   enumerating the list, and doing a wakeup might trigger a secondary 
   enumeration/recursion.
  
  For KVM what you describe is I think is not a problem: we check wake type
  and ignore POLLOUT events.
 
 You seem to think in one dimension only ;)
 The interface needs to be stable for everyone.
 
 
 
  Maybe yes. I'll think it over and get back to you. Thanks!
 
 Let me know.
 
 
 
 - Davide

OK. What I think we need is a way to remove ourselves from the eventfd
wait queue and clear the counter atomically.

We currently do
remove_wait_queue(irqfd-wqh, irqfd-wait);
where wqh saves the eventfd wait queue head.

If we do this before proposed eventfd_read_ctx, we can lose events.
If we do this after, we can get spurious events.

An unlocked read is one way to fix this.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2010-01-06 Thread Davide Libenzi
On Thu, 7 Jan 2010, Michael S. Tsirkin wrote:

 OK. What I think we need is a way to remove ourselves from the eventfd
 wait queue and clear the counter atomically.
 
 We currently do
 remove_wait_queue(irqfd-wqh, irqfd-wait);
 where wqh saves the eventfd wait queue head.

You do a remove_wait_queue() from inside a callback wakeup on the same 
wait queue head?


 If we do this before proposed eventfd_read_ctx, we can lose events.
 If we do this after, we can get spurious events.
 
 An unlocked read is one way to fix this.

You posted one line of code and a two lines analysis of the issue. Can you 
be a little bit more verbose and show me more code, so that I can actually 
see what is going on?


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Michael S. Tsirkin
On Thu, Aug 27, 2009 at 07:13:32AM +0300, Avi Kivity wrote:
 On 08/27/2009 02:30 AM, Davide Libenzi wrote:
 On Wed, 26 Aug 2009, Davide Libenzi wrote:


 I see no kernel equivalent to read(), but that's easily done.

 Adding an in-kernel read based on ctx, that is no problem at all.
  
 Something like the untested below.
 I had thought you said the eventfd readers where in userspace, but I might
 have misunderstood you.


 No, they're all over the place.

Further, with Davide's proposal you must be a reader to signal events.

 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Paolo Bonzini
 Ok, so why not using the eventfd counter as state?
 On the device side:
 
 void write_state(int sfd, int state) {
   u64 cnt;
 
   /* Clear the current state, sfd is in non-blocking mode */
   read(sfd,cnt, sizeof(cnt));
   /* Writes new state */
   cnt = 1 + !!state;
   write(sfd,cnt, sizeof(cnt));
 }

It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
added exactly to avoid a read+write combination for the case of
decrementing a value.  Here it's the same, just it's about the case of
writing a *given* value.  What about having EFD_STATE simply mean do
not use a counter, just write the value without affecting the way
read works, and use

/* Writes new state */
cnt = 1 + !!state;
write(sfd,cnt, sizeof(cnt));

See below?

Paolo

 On the hypervisor side:
 
 int read_state(int sfd) {
   u64 cnt;
 
   read(sfd,cnt, sizeof(cnt));
   return state - 1;
 }


- 8-- ---
Subject: [PATCH] eventfd: new EFD_ABSOLUTE flag

This implements a new EFD_ABSOLUTE flag for eventfd.
This changes eventfd behaviour so that write simply
stores the value written, and is always non-blocking.

Untested (I just modified Michael's patch, and given
the simpler code I'm not sure it's now worthwhile
introducing the inline functions), but otherwise

Signed-off-by: Paolo Bonzini pbonz...@redhat.com
---
 fs/eventfd.c|   13 -
 include/linux/eventfd.h |4 ++--
 2 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 347a0e0..7b279e3 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -31,8 +31,6 @@
 
 static inline int eventfd_writeable(struct eventfd_ctx *ctx, u64 n)
 {
-   return ULLONG_MAX - n  ctx-count;
-   return (ctx-flags  EFD_ABSOLUTE) || (ULLONG_MAX - n  ctx-count);
 }
 
 static inline int eventfd_overflow(struct eventfd_ctx *ctx, u64 cnt)
@@ -42,10 +40,14 @@
 
 static inline void eventfd_dowrite(struct eventfd_ctx *ctx, u64 ucnt)
 {
-   if (eventfd_writeable(ctx, ucnt))
-   ucnt = ULLONG_MAX - ctx-count;
+   if (ctx-flags  EFD_ABSOLUTE)
+   ctx-count = ucnt;
+   else {
+   if (ULLONG_MAX - ctx-count  ucnt)
+   ucnt = ULLONG_MAX - ctx-count;
 
-   ctx-count += ucnt;
+   ctx-count += ucnt;
+   }
 }
 
 static inline u64 eventfd_doread(struct eventfd_ctx *ctx)
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3b85ba6..78ff649 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -19,11 +19,12 @@
  * shared O_* flags.
  */
 #define EFD_SEMAPHORE (1  0)
+#define EFD_ABSOLUTE (1  1)
 #define EFD_CLOEXEC O_CLOEXEC
 #define EFD_NONBLOCK O_NONBLOCK
 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_ABSOLUTE)
 
 #ifdef CONFIG_EVENTFD
--
1.6.2.5 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Michael S. Tsirkin
On Thu, Aug 27, 2009 at 11:05:30AM +0200, Paolo Bonzini wrote:
  Ok, so why not using the eventfd counter as state?
  On the device side:
  
  void write_state(int sfd, int state) {
  u64 cnt;
  
  /* Clear the current state, sfd is in non-blocking mode */
  read(sfd,cnt, sizeof(cnt));
  /* Writes new state */
  cnt = 1 + !!state;
  write(sfd,cnt, sizeof(cnt));
  }
 
 It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
 added exactly to avoid a read+write combination for the case of
 decrementing a value.  Here it's the same, just it's about the case of
 writing a *given* value.  What about having EFD_STATE simply mean do
 not use a counter, just write the value without affecting the way
 read works, and use
 
   /* Writes new state */
   cnt = 1 + !!state;
   write(sfd,cnt, sizeof(cnt));

That would work for kvm.

 See below?
 
 Paolo
 
  On the hypervisor side:
  
  int read_state(int sfd) {
  u64 cnt;
  
  read(sfd,cnt, sizeof(cnt));
  return state - 1;
  }
 
 
 - 8-- ---
 Subject: [PATCH] eventfd: new EFD_ABSOLUTE flag
 
 This implements a new EFD_ABSOLUTE flag for eventfd.
 This changes eventfd behaviour so that write simply
 stores the value written, and is always non-blocking.
 
 Untested (I just modified Michael's patch, and given
 the simpler code I'm not sure it's now worthwhile
 introducing the inline functions), but otherwise
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  fs/eventfd.c|   13 -
  include/linux/eventfd.h |4 ++--
  2 files changed, 10 insertions(+), 7 deletions(-)
 
 diff --git a/fs/eventfd.c b/fs/eventfd.c
 index 347a0e0..7b279e3 100644
 --- a/fs/eventfd.c
 +++ b/fs/eventfd.c
 @@ -31,8 +31,6 @@
  
  static inline int eventfd_writeable(struct eventfd_ctx *ctx, u64 n)
  {
 - return ULLONG_MAX - n  ctx-count;
 - return (ctx-flags  EFD_ABSOLUTE) || (ULLONG_MAX - n  ctx-count);
  }
  
  static inline int eventfd_overflow(struct eventfd_ctx *ctx, u64 cnt)
 @@ -42,10 +40,14 @@
  
  static inline void eventfd_dowrite(struct eventfd_ctx *ctx, u64 ucnt)
  {
 - if (eventfd_writeable(ctx, ucnt))
 - ucnt = ULLONG_MAX - ctx-count;
 + if (ctx-flags  EFD_ABSOLUTE)
 + ctx-count = ucnt;
 + else {
 + if (ULLONG_MAX - ctx-count  ucnt)
 + ucnt = ULLONG_MAX - ctx-count;
  
 - ctx-count += ucnt;
 + ctx-count += ucnt;
 + }
  }
  
  static inline u64 eventfd_doread(struct eventfd_ctx *ctx)
 diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
 index 3b85ba6..78ff649 100644
 --- a/include/linux/eventfd.h
 +++ b/include/linux/eventfd.h
 @@ -19,11 +19,12 @@
   * shared O_* flags.
   */
  #define EFD_SEMAPHORE (1  0)
 +#define EFD_ABSOLUTE (1  1)
  #define EFD_CLOEXEC O_CLOEXEC
  #define EFD_NONBLOCK O_NONBLOCK
  
  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_ABSOLUTE)
  
  #ifdef CONFIG_EVENTFD
 --
 1.6.2.5 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Davide Libenzi
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

 On Thu, Aug 27, 2009 at 07:13:32AM +0300, Avi Kivity wrote:
  On 08/27/2009 02:30 AM, Davide Libenzi wrote:
  On Wed, 26 Aug 2009, Davide Libenzi wrote:
 
 
  I see no kernel equivalent to read(), but that's easily done.
 
  Adding an in-kernel read based on ctx, that is no problem at all.
   
  Something like the untested below.
  I had thought you said the eventfd readers where in userspace, but I might
  have misunderstood you.
 
 
  No, they're all over the place.
 
 Further, with Davide's proposal you must be a reader to signal events.

To signal events, you must have an instance of ctx (with the new 
exposed eventfd_ctx_read). How would you do otherwise?


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Davide Libenzi
On Thu, 27 Aug 2009, Paolo Bonzini wrote:

  Ok, so why not using the eventfd counter as state?
  On the device side:
  
  void write_state(int sfd, int state) {
  u64 cnt;
  
  /* Clear the current state, sfd is in non-blocking mode */
  read(sfd,cnt, sizeof(cnt));
  /* Writes new state */
  cnt = 1 + !!state;
  write(sfd,cnt, sizeof(cnt));
  }
 
 It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
 added exactly to avoid a read+write combination for the case of
 decrementing a value.

Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
*semaphore* is a pretty widely known and used abstraction.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Michael S. Tsirkin
On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote:
 On Thu, 27 Aug 2009, Paolo Bonzini wrote:
 
   Ok, so why not using the eventfd counter as state?
   On the device side:
   
   void write_state(int sfd, int state) {
 u64 cnt;
   
 /* Clear the current state, sfd is in non-blocking mode */
 read(sfd,cnt, sizeof(cnt));
 /* Writes new state */
 cnt = 1 + !!state;
 write(sfd,cnt, sizeof(cnt));
   }
  
  It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
  added exactly to avoid a read+write combination for the case of
  decrementing a value.
 
 Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
 *semaphore* is a pretty widely known and used abstraction.

what about an atomic variable, btw?  does it make sense to implement
write that does compare and exchange?

 
 - Davide
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Davide Libenzi
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

 On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote:
  On Thu, 27 Aug 2009, Paolo Bonzini wrote:
  
Ok, so why not using the eventfd counter as state?
On the device side:

void write_state(int sfd, int state) {
u64 cnt;

/* Clear the current state, sfd is in non-blocking mode */
read(sfd,cnt, sizeof(cnt));
/* Writes new state */
cnt = 1 + !!state;
write(sfd,cnt, sizeof(cnt));
}
   
   It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
   added exactly to avoid a read+write combination for the case of
   decrementing a value.
  
  Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
  *semaphore* is a pretty widely known and used abstraction.
 
 what about an atomic variable, btw?  does it make sense to implement
 write that does compare and exchange?

It is surprising to me, that is front of a workable solution w/out any 
use-once additions, yet you want to try to add optimizations and new 
ad-hoc abstractions to user visible interfaces.
Now, you tell me what an atomic variable has to do with an eventfd.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Michael S. Tsirkin
On Thu, Aug 27, 2009 at 07:38:46AM -0700, Davide Libenzi wrote:
 On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:
 
  On Thu, Aug 27, 2009 at 07:21:49AM -0700, Davide Libenzi wrote:
   On Thu, 27 Aug 2009, Paolo Bonzini wrote:
   
 Ok, so why not using the eventfd counter as state?
 On the device side:
 
 void write_state(int sfd, int state) {
   u64 cnt;
 
   /* Clear the current state, sfd is in non-blocking mode */
   read(sfd,cnt, sizeof(cnt));
   /* Writes new state */
   cnt = 1 + !!state;
   write(sfd,cnt, sizeof(cnt));
 }

It's interesting [no sarcasm intended, mind] that EFD_SEMAPHORE was
added exactly to avoid a read+write combination for the case of
decrementing a value.
   
   Like I repeated 25 times already, EFD_SEMAPHORE was added, because a 
   *semaphore* is a pretty widely known and used abstraction.
  
  what about an atomic variable, btw?  does it make sense to implement
  write that does compare and exchange?
 
 It is surprising to me, that is front of a workable solution w/out any 
 use-once additions, yet you want to try to add optimizations and new 
 ad-hoc abstractions to user visible interfaces.
 Now, you tell me what an atomic variable has to do with an eventfd.
 
 
 - Davide
 

Oh, I stopped pushing EFD_STATE since we have a solution.
I am just trying to grok what does and what does not consititute a
use-once addition, in your mind, and what does and what does not
belong in eventfd. The reason atomic does not belong there and
semaphore does is because one waits on semaphore but not
on atomic? Is that it?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Davide Libenzi
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

 Oh, I stopped pushing EFD_STATE since we have a solution.
 I am just trying to grok what does and what does not consititute a
 use-once addition, in your mind, and what does and what does not
 belong in eventfd. The reason atomic does not belong there and
 semaphore does is because one waits on semaphore but not
 on atomic? Is that it?

An atomic variable is not a synchronization interface.
Even if it'd come up that we really need an atomic variable abstraction 
via an fd, this should not go via eventfd.
Yeah, maybe the name syncfd would have been better, but the very reason 
why eventfd born was to allow KAIO to signal events to poll/select/epoll.
That first implementation was also usable as a mutex.
Then it was propsed to make eventfd to act as a semaphore too. Code was 
trivial, a semaphore fitted the interface, and the abstraction was a 
pretty damn known too. So it went in.
Yes, you could have implemented a semaphore with the existing eventfd, by 
reading the counter and writing counter-1. But again, a semaphore was 
something widely known and generic enough, and was fitting the bill.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-27 Thread Davide Libenzi
On Thu, 27 Aug 2009, Michael S. Tsirkin wrote:

 Oh, I stopped pushing EFD_STATE since we have a solution.

Do you guys need the kernel-side eventfd_ctx_read() I posted or not?
Because if nobody uses it, I'm not going to push it.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Michael S. Tsirkin
On Tue, Aug 25, 2009 at 02:57:01PM -0700, Davide Libenzi wrote:
 On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
 
  Yes, we don't want that. The best thing is to try to restate the problem
  in a way that is generic, and then either solve or best use existing
  solution. Right?
  
  I thought I had that, but apparently not.  The reason I'm Cc-ing you is
  not to try and spam you until you give up and accept the patch, it's
  hoping that you see the pattern behind our usage, and help generalize
  it.
  
  If I understand it correctly, you believe this is not possible and so
  any solution will have to be in KVM? Or maybe I didn't state the problem
  clearly enough and should restate it?
 
 Please do.
 
 
 
 - Davide


Problem looks like this:

There are multiple processes (devices) where each has a condition
(interrupt line) which it has logic to determine is either true or
false.

A single other process (hypervisor) is interested in a condition
(interrupt level) which is a logical OR of all interrupt lines.
On changes, an interrupt level value needs to be read and copied to
guest virtual cpu.

We also want ability to replace some or all processes above by a kernel
components, with condition changes done potentially from hardware
interrupt context.


How we wanted to solve it with EFD_STATE: Share a separate eventfd
between each device and the hypervisor.  device sets state to either 0
or 1.  hypervisor polls all eventfds, reads interrupt line on changes,
calculates the interrupt level and updates guest.

Alternative solution: shared memory where each device writes interrupt
line value. This makes setup more complex (need to share around much more
than just an fd), and makes access from interrupt impossible unless we
lock the memory (and locking userspace memory introduces yet another set
of issues).


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Avi Kivity

On 08/26/2009 01:29 PM, Michael S. Tsirkin wrote:

How we wanted to solve it with EFD_STATE: Share a separate eventfd
between each device and the hypervisor.  device sets state to either 0
or 1.  hypervisor polls all eventfds, reads interrupt line on changes,
calculates the interrupt level and updates guest.

Alternative solution: shared memory where each device writes interrupt
line value. This makes setup more complex (need to share around much more
than just an fd), and makes access from interrupt impossible unless we
lock the memory (and locking userspace memory introduces yet another set
of issues).
   


For completeness:

If the device is implemented in the same process as the hypervisor, an 
eventfd isn't really needed, as there is an ioctl which performs the 
same operation.


An important class of device implementations is real devices that are 
assigned to a guest.  We would like to forward the interrupt directly 
from the host interrupt handler to qemu.  Currently, we have a 
kvm-specific interrupt handler that forwards the interrupt using 
kvm-specific interfaces.  We would like to use a generic interrupt 
handler implemented by uio, so we want a generic interrupt transfer 
mechanism.


uio already supports edge-triggered interrupts using an eventfd-like 
mechanism.  So it makes sense to extend uio to support real eventfds, 
and to make it also support level-triggered interrupts.


We can work around the lack of state eventfd by having userspace wait on 
whatever mechanism uio uses to make the interrupt state visible, and 
then use the ioctl mentioned above to inform the hypervisor of this 
state.  But it's faster and nicer to give both components an eventfd and 
let them communicate directly.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Davide Libenzi
On Wed, 26 Aug 2009, Michael S. Tsirkin wrote:

 On Tue, Aug 25, 2009 at 02:57:01PM -0700, Davide Libenzi wrote:
  On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
  
   Yes, we don't want that. The best thing is to try to restate the problem
   in a way that is generic, and then either solve or best use existing
   solution. Right?
   
   I thought I had that, but apparently not.  The reason I'm Cc-ing you is
   not to try and spam you until you give up and accept the patch, it's
   hoping that you see the pattern behind our usage, and help generalize
   it.
   
   If I understand it correctly, you believe this is not possible and so
   any solution will have to be in KVM? Or maybe I didn't state the problem
   clearly enough and should restate it?
  
  Please do.
  
  
  
  - Davide
 
 
 Problem looks like this:
 
 There are multiple processes (devices) where each has a condition
 (interrupt line) which it has logic to determine is either true or
 false.
 
 A single other process (hypervisor) is interested in a condition
 (interrupt level) which is a logical OR of all interrupt lines.
 On changes, an interrupt level value needs to be read and copied to
 guest virtual cpu.
 
 We also want ability to replace some or all processes above by a kernel
 components, with condition changes done potentially from hardware
 interrupt context.
 
 
 How we wanted to solve it with EFD_STATE: Share a separate eventfd
 between each device and the hypervisor.  device sets state to either 0
 or 1.  hypervisor polls all eventfds, reads interrupt line on changes,
 calculates the interrupt level and updates guest.
 
 Alternative solution: shared memory where each device writes interrupt
 line value. This makes setup more complex (need to share around much more
 than just an fd), and makes access from interrupt impossible unless we
 lock the memory (and locking userspace memory introduces yet another set
 of issues).

OK, if I get it correctly, there is one eventfd signaler (the device), and 
one eventfd reader (the hypervisor), right?
Each hypervisor listens for multiple devices detecting state changes, and 
associating the eventfd line to the IRQ number by some configuration 
(ala PCI), right?



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Avi Kivity

On 08/26/2009 08:45 PM, Davide Libenzi wrote:

OK, if I get it correctly, there is one eventfd signaler (the device), and
one eventfd reader (the hypervisor), right?
Each hypervisor listens for multiple devices detecting state changes, and
associating the eventfd line to the IRQ number by some configuration
(ala PCI), right?
   


Yes.  The PCI stuff happens in userspace, all the hypervisor sees is 
this eventfd is IRQ 10.  There may be multiple eventfds routed to one 
IRQ (corresponding to a shared IRQ line).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Davide Libenzi
On Wed, 26 Aug 2009, Avi Kivity wrote:

 On 08/26/2009 08:45 PM, Davide Libenzi wrote:
  OK, if I get it correctly, there is one eventfd signaler (the device), and
  one eventfd reader (the hypervisor), right?
  Each hypervisor listens for multiple devices detecting state changes, and
  associating the eventfd line to the IRQ number by some configuration
  (ala PCI), right?
 
 
 Yes.  The PCI stuff happens in userspace, all the hypervisor sees is this
 eventfd is IRQ 10.  There may be multiple eventfds routed to one IRQ
 (corresponding to a shared IRQ line).

Ok, so why not using the eventfd counter as state?
On the device side:

void write_state(int sfd, int state) {
u64 cnt;

/* Clear the current state, sfd is in non-blocking mode */
read(sfd, cnt, sizeof(cnt));
/* Writes new state */
cnt = 1 + !!state;
write(sfd, cnt, sizeof(cnt));
}


On the hypervisor side:

int read_state(int sfd) {
u64 cnt;

read(sfd, cnt, sizeof(cnt));
return state - 1;
}




- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Avi Kivity

On 08/26/2009 10:13 PM, Davide Libenzi wrote:

Ok, so why not using the eventfd counter as state?
On the device side:

void write_state(int sfd, int state) {
u64 cnt;

/* Clear the current state, sfd is in non-blocking mode */
read(sfd,cnt, sizeof(cnt));
/* Writes new state */
cnt = 1 + !!state;
write(sfd,cnt, sizeof(cnt));
}


On the hypervisor side:

int read_state(int sfd) {
u64 cnt;

read(sfd,cnt, sizeof(cnt));
return state - 1;
}

   


Hadn't though of read+write as set.  While the 1+ is a little ugly, it's 
workable.


I see no kernel equivalent to read(), but that's easily done.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Davide Libenzi
On Wed, 26 Aug 2009, Avi Kivity wrote:

 On 08/26/2009 10:13 PM, Davide Libenzi wrote:
  Ok, so why not using the eventfd counter as state?
  On the device side:
  
  void write_state(int sfd, int state) {
  u64 cnt;
  
  /* Clear the current state, sfd is in non-blocking mode */
  read(sfd,cnt, sizeof(cnt));
  /* Writes new state */
  cnt = 1 + !!state;
  write(sfd,cnt, sizeof(cnt));
  }
  
  
  On the hypervisor side:
  
  int read_state(int sfd) {
  u64 cnt;
  
  read(sfd,cnt, sizeof(cnt));
  return state - 1;
  }
  
 
 
 Hadn't though of read+write as set.  While the 1+ is a little ugly, it's
 workable.

Pick what you want, as long as it always writes something != 0 :)


 I see no kernel equivalent to read(), but that's easily done.

Adding an in-kernel read based on ctx, that is no problem at all.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Gleb Natapov
On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote:
 On 08/26/2009 10:13 PM, Davide Libenzi wrote:
 Ok, so why not using the eventfd counter as state?
 On the device side:
 
 void write_state(int sfd, int state) {
  u64 cnt;
 
  /* Clear the current state, sfd is in non-blocking mode */
  read(sfd,cnt, sizeof(cnt));
  /* Writes new state */
  cnt = 1 + !!state;
  write(sfd,cnt, sizeof(cnt));
 }
 
 
 On the hypervisor side:
 
 int read_state(int sfd) {
  u64 cnt;
 
  read(sfd,cnt, sizeof(cnt));
  return state - 1;
 }
 
 
 Hadn't though of read+write as set.  While the 1+ is a little ugly,
 it's workable.
 
It's two system calls instead of one to inject interrupt.

 I see no kernel equivalent to read(), but that's easily done.
 

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Davide Libenzi
On Wed, 26 Aug 2009, Gleb Natapov wrote:

 On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote:
  On 08/26/2009 10:13 PM, Davide Libenzi wrote:
  Ok, so why not using the eventfd counter as state?
  On the device side:
  
  void write_state(int sfd, int state) {
 u64 cnt;
  
 /* Clear the current state, sfd is in non-blocking mode */
 read(sfd,cnt, sizeof(cnt));
 /* Writes new state */
 cnt = 1 + !!state;
 write(sfd,cnt, sizeof(cnt));
  }
  
  
  On the hypervisor side:
  
  int read_state(int sfd) {
 u64 cnt;
  
 read(sfd,cnt, sizeof(cnt));
 return state - 1;
  }
  
  
  Hadn't though of read+write as set.  While the 1+ is a little ugly,
  it's workable.
  
 It's two system calls instead of one to inject interrupt.

I guess that's going to completely throw off-chart your RT performance, 
doesn't it?


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Davide Libenzi
On Wed, 26 Aug 2009, Davide Libenzi wrote:

  I see no kernel equivalent to read(), but that's easily done.
 
 Adding an in-kernel read based on ctx, that is no problem at all.

Something like the untested below.
I had thought you said the eventfd readers where in userspace, but I might 
have misunderstood you.



- Davide



---
 fs/eventfd.c|   51 +---
 include/linux/eventfd.h |7 ++
 2 files changed, 43 insertions(+), 15 deletions(-)

Index: linux-2.6.mod/fs/eventfd.c
===
--- linux-2.6.mod.orig/fs/eventfd.c 2009-08-26 15:58:03.0 -0700
+++ linux-2.6.mod/fs/eventfd.c  2009-08-26 16:20:03.0 -0700
@@ -130,26 +130,33 @@ static unsigned int eventfd_poll(struct 
return events;
 }
 
-static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
-   loff_t *ppos)
+/**
+ * eventfd_ctx_read - Reads the eventfd counter or wait if it is zero.
+ * @ctx: [in] Pointer to eventfd context.
+ * @no_wait: [in] Different from zero if the operation should not block.
+ * @cnt: [out] Pointer to the 64bit conter value.
+ *
+ * Returns zero if successful, or the following error codes:
+ *
+ * -EAGAIN  : The operation would have blocked but @no_wait was nonzero.
+ * -ERESTARTSYS : A signal interrupted the wait operation.
+ */
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
 {
-   struct eventfd_ctx *ctx = file-private_data;
ssize_t res;
-   __u64 ucnt = 0;
DECLARE_WAITQUEUE(wait, current);
 
-   if (count  sizeof(ucnt))
-   return -EINVAL;
spin_lock_irq(ctx-wqh.lock);
+   *cnt = 0;
res = -EAGAIN;
if (ctx-count  0)
-   res = sizeof(ucnt);
-   else if (!(file-f_flags  O_NONBLOCK)) {
+   res = 0;
+   else if (!no_wait) {
__add_wait_queue(ctx-wqh, wait);
-   for (res = 0;;) {
+   for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (ctx-count  0) {
-   res = sizeof(ucnt);
+   res = 0;
break;
}
if (signal_pending(current)) {
@@ -163,18 +170,32 @@ static ssize_t eventfd_read(struct file 
__remove_wait_queue(ctx-wqh, wait);
__set_current_state(TASK_RUNNING);
}
-   if (likely(res  0)) {
-   ucnt = (ctx-flags  EFD_SEMAPHORE) ? 1 : ctx-count;
-   ctx-count -= ucnt;
+   if (likely(res == 0)) {
+   *cnt = (ctx-flags  EFD_SEMAPHORE) ? 1 : ctx-count;
+   ctx-count -= *cnt;
if (waitqueue_active(ctx-wqh))
wake_up_locked_poll(ctx-wqh, POLLOUT);
}
spin_unlock_irq(ctx-wqh.lock);
-   if (res  0  put_user(ucnt, (__u64 __user *) buf))
-   return -EFAULT;
 
return res;
 }
+EXPORT_SYMBOL_GPL(eventfd_ctx_read);
+
+static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
+   loff_t *ppos)
+{
+   struct eventfd_ctx *ctx = file-private_data;
+   ssize_t res;
+   __u64 cnt;
+
+   if (count  sizeof(cnt))
+   return -EINVAL;
+   if ((res = eventfd_ctx_read(ctx, file-f_flags  O_NONBLOCK, cnt))  0)
+   return res;
+
+   return put_user(cnt, (__u64 __user *) buf) ? -EFAULT: sizeof(cnt);
+}
 
 static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t 
count,
 loff_t *ppos)
Index: linux-2.6.mod/include/linux/eventfd.h
===
--- linux-2.6.mod.orig/include/linux/eventfd.h  2009-08-26 15:58:03.0 
-0700
+++ linux-2.6.mod/include/linux/eventfd.h   2009-08-26 16:17:01.0 
-0700
@@ -33,6 +33,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 int eventfd_signal(struct eventfd_ctx *ctx, int n);
+ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt);
 
 #else /* CONFIG_EVENTFD */
 
@@ -55,6 +56,12 @@ static inline void eventfd_ctx_put(struc
 
 }
 
+static inline ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait,
+  __u64 *cnt)
+{
+   return -ENOSYS;
+}
+
 #endif
 
 #endif /* _LINUX_EVENTFD_H */
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Avi Kivity

On 08/27/2009 02:30 AM, Davide Libenzi wrote:

On Wed, 26 Aug 2009, Davide Libenzi wrote:

   

I see no kernel equivalent to read(), but that's easily done.
   

Adding an in-kernel read based on ctx, that is no problem at all.
 

Something like the untested below.
I had thought you said the eventfd readers where in userspace, but I might
have misunderstood you.
   


No, they're all over the place.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-26 Thread Gleb Natapov
On Wed, Aug 26, 2009 at 01:04:09PM -0700, Davide Libenzi wrote:
 On Wed, 26 Aug 2009, Gleb Natapov wrote:
 
  On Wed, Aug 26, 2009 at 10:42:05PM +0300, Avi Kivity wrote:
   On 08/26/2009 10:13 PM, Davide Libenzi wrote:
   Ok, so why not using the eventfd counter as state?
   On the device side:
   
   void write_state(int sfd, int state) {
u64 cnt;
   
/* Clear the current state, sfd is in non-blocking mode */
read(sfd,cnt, sizeof(cnt));
/* Writes new state */
cnt = 1 + !!state;
write(sfd,cnt, sizeof(cnt));
   }
   
   
   On the hypervisor side:
   
   int read_state(int sfd) {
u64 cnt;
   
read(sfd,cnt, sizeof(cnt));
return state - 1;
   }
   
   
   Hadn't though of read+write as set.  While the 1+ is a little ugly,
   it's workable.
   
  It's two system calls instead of one to inject interrupt.
 
 I guess that's going to completely throw off-chart your RT performance, 
 doesn't it?
 
Do you consider interrupt injection path not worth of optimizing?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-25 Thread Paolo Bonzini



There are userspace libraries that do almost everything, but you hardly
see things like pthread_(EFD_STATE-like)_create() or similar system
interfaces based on such abstraction.


It actually seems as close to a condition variable as an eventfd can be.


A pthread condition typical code usage maps to eventfd like:

while (read(efd, ...)  0)
if (CONDITION)
break;

So a pthread condition is really a wakeup gate like eventfd is.
EFD_STATE has nothing to do with a pthread condition.


No, your code does not work for pthread_cond_broadcast (which arguably 
is the common case) because all the eventfd readers after the first 
would block.


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-25 Thread Michael S. Tsirkin
On Mon, Aug 24, 2009 at 03:15:05PM -0700, Davide Libenzi wrote:
 On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:
 
  On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
   On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
   
On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
 On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
 More important here is realization that eventfd is a mutex/semaphore
 implementation, not a generic event reporting interface as we are 
 trying
 to use it.


 Well it is a generic event reporting interface (for example, aio uses 
 it).

Davide, I think it's a valid point.  For example, what read on eventfd
does (zero a counter and return) is not like any semaphore I saw.
   
   
   Indeed, the default eventfd behaviour is like, well, an event. Signaling 
   (kernel side) or writing (userspace side), signals the event.
   Waiting (reading) it, will reset the event.
   If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
   Events and sempahores are two widely known and used abstractions.
   The EFD_STATE proposed one, well, no. Not at all.
  
  Hmm. All we try to do is, associate a small key with the event
  that we signal. Is it really that uncommon/KVM specific?
 
 All I'm trying to do, is to avoid that eventfd will become an horrible 
 multiplexor for every freaky one-time-use behaviors arising inside kernel 
 modules.

Yes, we don't want that. The best thing is to try to restate the problem
in a way that is generic, and then either solve or best use existing
solution. Right?

I thought I had that, but apparently not.  The reason I'm Cc-ing you is
not to try and spam you until you give up and accept the patch, it's
hoping that you see the pattern behind our usage, and help generalize
it.

If I understand it correctly, you believe this is not possible and so
any solution will have to be in KVM? Or maybe I didn't state the problem
clearly enough and should restate it?


 
 
 - Davide

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-25 Thread Davide Libenzi
On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:

 Yes, we don't want that. The best thing is to try to restate the problem
 in a way that is generic, and then either solve or best use existing
 solution. Right?
 
 I thought I had that, but apparently not.  The reason I'm Cc-ing you is
 not to try and spam you until you give up and accept the patch, it's
 hoping that you see the pattern behind our usage, and help generalize
 it.
 
 If I understand it correctly, you believe this is not possible and so
 any solution will have to be in KVM? Or maybe I didn't state the problem
 clearly enough and should restate it?

Please do.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Avi Kivity

On 08/24/2009 09:25 PM, Davide Libenzi wrote:

Indeed, the default eventfd behaviour is like, well, an event. Signaling
(kernel side) or writing (userspace side), signals the event.
Waiting (reading) it, will reset the event.
If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
Events and sempahores are two widely known and used abstractions.
The EFD_STATE proposed one, well, no. Not at all.
   


There are libraries that provide notifications (or fire watches) when 
some value changes.  They're much less frequently used than events or 
semaphores, though.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Davide Libenzi
On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:

 On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
  On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
  More important here is realization that eventfd is a mutex/semaphore
  implementation, not a generic event reporting interface as we are trying
  to use it.
 
 
  Well it is a generic event reporting interface (for example, aio uses it).
 
 Davide, I think it's a valid point.  For example, what read on eventfd
 does (zero a counter and return) is not like any semaphore I saw.


Indeed, the default eventfd behaviour is like, well, an event. Signaling 
(kernel side) or writing (userspace side), signals the event.
Waiting (reading) it, will reset the event.
If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
Events and sempahores are two widely known and used abstractions.
The EFD_STATE proposed one, well, no. Not at all.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Michael S. Tsirkin
On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
 On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
 
  On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
   On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
   More important here is realization that eventfd is a mutex/semaphore
   implementation, not a generic event reporting interface as we are trying
   to use it.
  
  
   Well it is a generic event reporting interface (for example, aio uses it).
  
  Davide, I think it's a valid point.  For example, what read on eventfd
  does (zero a counter and return) is not like any semaphore I saw.
 
 
 Indeed, the default eventfd behaviour is like, well, an event. Signaling 
 (kernel side) or writing (userspace side), signals the event.
 Waiting (reading) it, will reset the event.
 If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
 Events and sempahores are two widely known and used abstractions.
 The EFD_STATE proposed one, well, no. Not at all.

Hmm. All we try to do is, associate a small key with the event
that we signal. Is it really that uncommon/KVM specific?

 
 
 - Davide
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Davide Libenzi
On Mon, 24 Aug 2009, Avi Kivity wrote:

 On 08/24/2009 09:25 PM, Davide Libenzi wrote:
  Indeed, the default eventfd behaviour is like, well, an event. Signaling
  (kernel side) or writing (userspace side), signals the event.
  Waiting (reading) it, will reset the event.
  If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
  Events and sempahores are two widely known and used abstractions.
  The EFD_STATE proposed one, well, no. Not at all.
 
 
 There are libraries that provide notifications (or fire watches) when some
 value changes.  They're much less frequently used than events or semaphores,
 though.

There are userspace libraries that do almost everything, but you hardly 
see things like pthread_(EFD_STATE-like)_create() or similar system 
interfaces based on such abstraction.
Is that really difficult to understand where I'm standing, leaving the KVM 
hat off for a moment?



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Paolo Bonzini



There are userspace libraries that do almost everything, but you hardly
see things like pthread_(EFD_STATE-like)_create() or similar system
interfaces based on such abstraction.


It actually seems as close to a condition variable as an eventfd can be.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Davide Libenzi
On Tue, 25 Aug 2009, Michael S. Tsirkin wrote:

 On Mon, Aug 24, 2009 at 11:25:01AM -0700, Davide Libenzi wrote:
  On Sun, 23 Aug 2009, Michael S. Tsirkin wrote:
  
   On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
More important here is realization that eventfd is a mutex/semaphore
implementation, not a generic event reporting interface as we are 
trying
to use it.
   
   
Well it is a generic event reporting interface (for example, aio uses 
it).
   
   Davide, I think it's a valid point.  For example, what read on eventfd
   does (zero a counter and return) is not like any semaphore I saw.
  
  
  Indeed, the default eventfd behaviour is like, well, an event. Signaling 
  (kernel side) or writing (userspace side), signals the event.
  Waiting (reading) it, will reset the event.
  If you use EFD_SEMAPHORE, you get a semaphore-like behavior.
  Events and sempahores are two widely known and used abstractions.
  The EFD_STATE proposed one, well, no. Not at all.
 
 Hmm. All we try to do is, associate a small key with the event
 that we signal. Is it really that uncommon/KVM specific?

All I'm trying to do, is to avoid that eventfd will become an horrible 
multiplexor for every freaky one-time-use behaviors arising inside kernel 
modules.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Davide Libenzi
On Tue, 25 Aug 2009, Paolo Bonzini wrote:

 
  There are userspace libraries that do almost everything, but you hardly
  see things like pthread_(EFD_STATE-like)_create() or similar system
  interfaces based on such abstraction.
 
 It actually seems as close to a condition variable as an eventfd can be.

A pthread condition typical code usage maps to eventfd like:

while (read(efd, ...)  0)
if (CONDITION)
break;

So a pthread condition is really a wakeup gate like eventfd is.
EFD_STATE has nothing to do with a pthread condition.



- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-24 Thread Avi Kivity

On 08/25/2009 01:08 AM, Davide Libenzi wrote:

Is that really difficult to understand where I'm standing, leaving the KVM
hat off for a moment?
   


I understand it perfectly.  I take the same position with kvm.  I'm 
providing more data in the hope that you'll change you mind, not trying 
to flood you with email so you'll give up.


We can always create our eventfd-lookalike for kvm, but I'd rather not 
do that (other options include a userspace proxy through existing 
interfaces, it might even be better than changing eventfd if we decide 
performance for level-triggered interrupts is not critical).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-23 Thread Avi Kivity

On 08/20/2009 09:28 PM, Michael S. Tsirkin wrote:

I thought the point was to move assigned devices out of KVM?

   

Grr.  Forgot about that.

That's much more important.
 

Looks like we'll have do with a separate char device or
something?
   


We can still tunnel it through userspace.  Most assigned devices will 
have MSI.  uio can signal the eventfd when the level changes, qemu can 
read the level and write it to kvm.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-23 Thread Michael S. Tsirkin
On Sun, Aug 23, 2009 at 04:01:29PM +0300, Avi Kivity wrote:
 On 08/20/2009 09:28 PM, Michael S. Tsirkin wrote:
 I thought the point was to move assigned devices out of KVM?


 Grr.  Forgot about that.

 That's much more important.
  
 Looks like we'll have do with a separate char device or
 something?


 We can still tunnel it through userspace.

More important here is realization that eventfd is a mutex/semaphore
implementation, not a generic event reporting interface as we are trying
to use it.

 Most assigned devices will  
 have MSI.  uio can signal the eventfd when the level changes, qemu can  
 read the level and write it to kvm.
 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-23 Thread Avi Kivity

On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:

More important here is realization that eventfd is a mutex/semaphore
implementation, not a generic event reporting interface as we are trying
to use it.
   


Well it is a generic event reporting interface (for example, aio uses it).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-23 Thread Michael S. Tsirkin
On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:
 On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:
 More important here is realization that eventfd is a mutex/semaphore
 implementation, not a generic event reporting interface as we are trying
 to use it.


 Well it is a generic event reporting interface (for example, aio uses it).

Davide, I think it's a valid point.  For example, what read on eventfd
does (zero a counter and return) is not like any semaphore I saw.

Look at eventfd as pipe replacement: for wake-up between processes, it
works well. But what if we want to pass around some kind of key as well?
pipe lets you pass single-byte values atomically. This flag allows us to
use 8 byte values as keys.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-23 Thread Paolo Bonzini

On 08/23/2009 04:30 PM, Michael S. Tsirkin wrote:

On Sun, Aug 23, 2009 at 04:40:51PM +0300, Avi Kivity wrote:

On 08/23/2009 04:36 PM, Michael S. Tsirkin wrote:

More important here is realization that eventfd is a mutex/semaphore
implementation, not a generic event reporting interface as we are trying
to use it.



Well it is a generic event reporting interface (for example, aio uses it).


Davide, I think it's a valid point.  For example, what read on eventfd
does (zero a counter and return) is not like any semaphore I saw.


It is similar to Win32 events (which can be used like contorted mutexes 
to sync two processes).


Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-21 Thread Davide Libenzi
On Thu, 20 Aug 2009, Paolo Bonzini wrote:

 On 08/20/2009 07:44 PM, Davide Libenzi wrote:
  On Thu, 20 Aug 2009, Avi Kivity wrote:
  
   On 08/20/2009 07:20 PM, Davide Libenzi wrote:

I briefly looked at this while in vacation, although I did not reply
hoping the horrible feeling about this code would go away.
It didn't.
I find this to be an ugly and ad-hoc multiplexing of eventfd with added
functionalities of questionable general use.
I'm pretty sure you can do better on KVM side, to solve the problem
w/out
littering eventfd.
   
   While we could argue about this my feeling is that we should drop this, at
   least until we can quantify what benefit it has and whether there are any
   Davide-acceptable alternatives.
  
  I really didn't mean to be harsh, but seriously, we cannot just have a
  Multiplexing Feast over eventfd, with one-time users.
 
 EFD_STATE actually does two changes: a) makes read block until the value
 changes; b) makes each write override the previous one.  How would you feel if
 the two changes were separated?

I'm afraid that splitting the change won't make less inappropriate for 
eventfd.
I've no doubt KVM needs something like this, but this just don't belong to 
eventfd, which act as either a mutex or a semaphore.
This EFD_STATE does not belong to a generic interface, proof of which is 
the lack of such abstraction in any generic library I'm aware of.
This aside from any consideration about how the code will look with that 
multiplexing in place.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Michael S. Tsirkin
Davide,
Looks like I got an ack from Avi and no comments from others,
so the following patchset is identical to the last RFC.
Can it be merged for 2.6.32?
Thanks!

This series implements a new EFD_STATE flag for eventfd.
When set, this changes eventfd behaviour in the following way:
- write simply stores the value written, and is always non-blocking
- read unblocks when the value written changes, and
  returns the value written

Motivation: we'd like to use eventfd in qemu to pass interrupts from
(emulated or assigned) devices to guest. For level interrupts, the
counter supported currently by eventfd is not a good match: we really
need to set interrupt to a level, typically 0 or 1, wake the guest if
there was a change and give the guest ability to see the last value
written. The level can be set either from kernel (e.g. with assigned
devices) or from userspace.

Michael S. Tsirkin (2):
  eventfd: reorganize the code to simplify new flags
  eventfd: EFD_STATE flag

 fs/eventfd.c|   83 +++
 include/linux/eventfd.h |3 +-
 2 files changed, 71 insertions(+), 15 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Davide Libenzi
On Thu, 20 Aug 2009, Michael S. Tsirkin wrote:

 Davide,
 Looks like I got an ack from Avi and no comments from others,
 so the following patchset is identical to the last RFC.
 Can it be merged for 2.6.32?
 Thanks!

I briefly looked at this while in vacation, although I did not reply 
hoping the horrible feeling about this code would go away.
It didn't.
I find this to be an ugly and ad-hoc multiplexing of eventfd with added 
functionalities of questionable general use.
I'm pretty sure you can do better on KVM side, to solve the problem w/out 
littering eventfd.


- Davide

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Avi Kivity

On 08/20/2009 07:20 PM, Davide Libenzi wrote:


I briefly looked at this while in vacation, although I did not reply
hoping the horrible feeling about this code would go away.
It didn't.
I find this to be an ugly and ad-hoc multiplexing of eventfd with added
functionalities of questionable general use.
I'm pretty sure you can do better on KVM side, to solve the problem w/out
littering eventfd.

   


While we could argue about this my feeling is that we should drop this, 
at least until we can quantify what benefit it has and whether there are 
any Davide-acceptable alternatives.


In the meanwhile, we can let vhost-net support edge-triggered interrupts 
only, let qemu terminate those eventfds and convert then to 
level-triggered interrupts (which it can then inject using the existing 
ioctl).  It will keep vhost-net and kvm simpler at the cost of some 
performance penalty to guests using level interrupts.  These suck anyway 
so we'll point users at msi.


--
I have a truly marvellous patch that fixes the bug which thisb
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Davide Libenzi
On Thu, 20 Aug 2009, Avi Kivity wrote:

 On 08/20/2009 07:20 PM, Davide Libenzi wrote:
  
  I briefly looked at this while in vacation, although I did not reply
  hoping the horrible feeling about this code would go away.
  It didn't.
  I find this to be an ugly and ad-hoc multiplexing of eventfd with added
  functionalities of questionable general use.
  I'm pretty sure you can do better on KVM side, to solve the problem w/out
  littering eventfd.
  
 
 
 While we could argue about this my feeling is that we should drop this, at
 least until we can quantify what benefit it has and whether there are any
 Davide-acceptable alternatives.

I really didn't mean to be harsh, but seriously, we cannot just have a 
Multiplexing Feast over eventfd, with one-time users.


- Davide


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Paolo Bonzini

On 08/20/2009 07:44 PM, Davide Libenzi wrote:

On Thu, 20 Aug 2009, Avi Kivity wrote:


On 08/20/2009 07:20 PM, Davide Libenzi wrote:


I briefly looked at this while in vacation, although I did not reply
hoping the horrible feeling about this code would go away.
It didn't.
I find this to be an ugly and ad-hoc multiplexing of eventfd with added
functionalities of questionable general use.
I'm pretty sure you can do better on KVM side, to solve the problem w/out
littering eventfd.


While we could argue about this my feeling is that we should drop this, at
least until we can quantify what benefit it has and whether there are any
Davide-acceptable alternatives.


I really didn't mean to be harsh, but seriously, we cannot just have a
Multiplexing Feast over eventfd, with one-time users.


EFD_STATE actually does two changes: a) makes read block until the value 
changes; b) makes each write override the previous one.  How would you 
feel if the two changes were separated?  I can see each of them has use 
cases


For example, (a) could be implemented by using select's xfds (POLLPRI) 
to poll for value changes (rfds would still poll for non-zeroness). 
Then Michael does not need even to read the eventfd; instead he'd check 
POLLIN with a zero timeout.


(b) could be implemented with a flag like Michael did.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Michael S. Tsirkin
On Thu, Aug 20, 2009 at 08:38:48PM +0300, Avi Kivity wrote:
 On 08/20/2009 07:20 PM, Davide Libenzi wrote:

 I briefly looked at this while in vacation, although I did not reply
 hoping the horrible feeling about this code would go away.
 It didn't.
 I find this to be an ugly and ad-hoc multiplexing of eventfd with added
 functionalities of questionable general use.
 I'm pretty sure you can do better on KVM side, to solve the problem w/out
 littering eventfd.



 While we could argue about this my feeling is that we should drop this,  
 at least until we can quantify what benefit it has and whether there are  
 any Davide-acceptable alternatives.

 In the meanwhile, we can let vhost-net support edge-triggered interrupts  
 only, let qemu terminate those eventfds and convert then to  
 level-triggered interrupts (which it can then inject using the existing  
 ioctl).  It will keep vhost-net and kvm simpler at the cost of some  
 performance penalty to guests using level interrupts.  These suck anyway  
 so we'll point users at msi.

I thought the point was to move assigned devices out of KVM?

 -- 
 I have a truly marvellous patch that fixes the bug which thisb
 signature is too narrow to contain.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Avi Kivity

On 08/20/2009 08:55 PM, Michael S. Tsirkin wrote:



While we could argue about this my feeling is that we should drop this,
at least until we can quantify what benefit it has and whether there are
any Davide-acceptable alternatives.

In the meanwhile, we can let vhost-net support edge-triggered interrupts
only, let qemu terminate those eventfds and convert then to
level-triggered interrupts (which it can then inject using the existing
ioctl).  It will keep vhost-net and kvm simpler at the cost of some
performance penalty to guests using level interrupts.  These suck anyway
so we'll point users at msi.
 

I thought the point was to move assigned devices out of KVM?
   


Grr.  Forgot about that.

That's much more important.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] eventfd: new EFD_STATE flag

2009-08-20 Thread Michael S. Tsirkin
On Thu, Aug 20, 2009 at 09:06:51PM +0300, Avi Kivity wrote:
 On 08/20/2009 08:55 PM, Michael S. Tsirkin wrote:

 While we could argue about this my feeling is that we should drop this,
 at least until we can quantify what benefit it has and whether there are
 any Davide-acceptable alternatives.

 In the meanwhile, we can let vhost-net support edge-triggered interrupts
 only, let qemu terminate those eventfds and convert then to
 level-triggered interrupts (which it can then inject using the existing
 ioctl).  It will keep vhost-net and kvm simpler at the cost of some
 performance penalty to guests using level interrupts.  These suck anyway
 so we'll point users at msi.
  
 I thought the point was to move assigned devices out of KVM?


 Grr.  Forgot about that.

 That's much more important.

Looks like we'll have do with a separate char device or
something?

 -- 
 I have a truly marvellous patch that fixes the bug which this
 signature is too narrow to contain.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html