[PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-06 Thread Martin Sustrik
When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
polling  (select, poll, epoll). For polling, real system-level file descriptor
is needed.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal
POLLOUT and POLLERR, however:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
!POLLIN & !POLLOUT, which is a perfectly valid combination for a network
protocol (rx buffer is empty and tx buffer is full), cannot be signaled
using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  short events;
  void *ptr;
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
'ptr' is an opaque pointer that is not interpreted by eventfd object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 
---
 fs/eventfd.c|  105 ---
 include/linux/eventfd.h |3 +-
 2 files changed, 83 insertions(+), 25 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..9fec49f 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */
 
@@ -22,18 +23,26 @@
 #include 
 #include 
 
+struct eventfd_mask {
+   short events;
+   void *ptr;
+};
+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+   struct eventfd_mask mask;
+   };
unsigned int flags;
 };
 
@@ -55,6 +64,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 {
unsigned long flags;
 
+   /* This function should never be used with eventfd in the mask mode. */
+   BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;
@@ -123,12 +135,16 @@ static unsigned int eventfd_poll(struct file *file, 
poll_table *wait)
poll_wait(file, &ctx->wqh, wait);
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
-   if (ctx->count > 0)
-   events |= POLLIN;
-   if (ctx->count == ULLONG_MAX)
-   events |= POLLERR;
-   if (ULLONG_MAX - 1 > ctx->count)
-   events |= POLLOUT;
+   if (ctx->flags & EFD_MASK) {
+   events = ctx->m

Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-07 Thread Martin Sustrik

On 07/02/13 20:12, Andy Lutomirski wrote:

On 02/06/2013 10:41 PM, Martin Sustrik wrote:

When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
polling  (select, poll, epoll). For polling, real system-level file descriptor
is needed.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal
POLLOUT and POLLERR, however:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
 !POLLIN&  !POLLOUT, which is a perfectly valid combination for a network
 protocol (rx buffer is empty and tx buffer is full), cannot be signaled
 using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
   short events;
   void *ptr;
};


IMO that should be u64 ptr to avoid compat problems.


I was following the user space declaration of epoll_data:

   typedef union epoll_data {
   void*ptr;  <-
   int  fd;
   uint32_t u32;
   uint64_t u64;
   } epoll_data_t;

However, now I'm looking at the kernel side definition of the whole 
union which looks like this (obviously it assumes that pointer is never 
longer than 64 bits):


 __u64 data;

Hm, not very helpful. Anyway, I am not a kernel developer, so any 
concrete suggestion about what type to use to map cleanly to user-space 
void* is welcome.



The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
'ptr' is an opaque pointer that is not interpreted by eventfd object.


How does this interact with EPOLLET?


That's an interesting question. The original eventfd code doesn't do 
anything specific to either edge or level mode. Neither does my patch.


Inspection of the code seems to suggest that edge vs. level distinction 
is handled elsewhere (ep_send_events_proc) where there is a separate 
list of ready events and the function, after returning the event, 
decides whether to leave the event in the list (level) or delete it from 
the list (edge).


In any case, review from someone with experience with epoll 
implementation would help.


Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-07 Thread Martin Sustrik
When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
polling  (select, poll, epoll). And unfortunately, sockets that can't be polled
on allow only for building the simplest possible applications. Basically, you
can build a simple client, but once you want to implement a server handling
many sockets in parallel, you are stuck.

However, to do polling, real system-level file descriptor is needed,
not a fake one.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal
POLLOUT and POLLERR, but:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
!POLLIN & !POLLOUT, which is a perfectly valid combination for a network
protocol (rx buffer is empty and tx buffer is full), cannot be signaled
using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  short events;
  void *ptr;
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
'ptr' is an opaque pointer that is not interpreted by eventfd object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 
---
 fs/eventfd.c|  105 ---
 include/linux/eventfd.h |3 +-
 2 files changed, 83 insertions(+), 25 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..9fec49f 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */
 
@@ -22,18 +23,26 @@
 #include 
 #include 
 
+struct eventfd_mask {
+   short events;
+   void *ptr;
+};
+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+   struct eventfd_mask mask;
+   };
unsigned int flags;
 };
 
@@ -55,6 +64,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 {
unsigned long flags;
 
+   /* This function should never be used with eventfd in the mask mode. */
+   BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;
@@ -123,12 +135,16 @@ static unsigned int eventfd_poll(struct file *file, 
poll_table *wait)
poll_wait(file, &ctx->wqh, wait);
 
spin_lock_irqsave(&ctx->wqh.lock, flags);
-   if (ctx->count &g

Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-07 Thread Martin Sustrik

On 07/02/13 23:44, Andrew Morton wrote:


So please update the changelog and then cc net...@vger.kernel.org on
the patch - the netdev people are probably best-situated to comment on
the proposal.


OK. Done. Thanks for the advice!

Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-07 Thread Martin Sustrik

Hi Andy,

On 08/02/13 02:03, Andy Lutomirski wrote:

There may be some
advantage to adding (later on, if needed) an option to change the
flags set in:

+   if (waitqueue_active(&ctx->wqh))
+   wake_up_locked_poll(&ctx->wqh,
+   (unsigned long)ctx->mask.events);

(i.e. to allow the second parameter to omit some bits that were
already signaled.)  Allowing write to write a bigger struct in the
future won't break anything.


I think I don't follow. Either the second parameter is supposed to be 
*newly* signaled events, in which case the events that were already 
signaled in the past should be ommitted, or it is meant to be *all* 
signaled events, in which case the current implementation is OK.


Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-07 Thread Martin Sustrik

On 08/02/13 07:36, Andy Lutomirski wrote:


On 08/02/13 02:03, Andy Lutomirski wrote:


There may be some
advantage to adding (later on, if needed) an option to change the
flags set in:

+   if (waitqueue_active(&ctx->wqh))
+   wake_up_locked_poll(&ctx->wqh,
+   (unsigned long)ctx->mask.events);

(i.e. to allow the second parameter to omit some bits that were
already signaled.)  Allowing write to write a bigger struct in the
future won't break anything.



I think I don't follow. Either the second parameter is supposed to be
*newly* signaled events, in which case the events that were already signaled
in the past should be ommitted, or it is meant to be *all* signaled events,
in which case the current implementation is OK.


I defer to the experts here.  But I suspect that if you want to
perfectly emulate sockets, you may need to vary what you specify.
(IIRC tcp sockets report an EPOLLIN edge every time data is received
even if the receive buffer wasn't empty.)


Hm. That sounds like leaking protocol implementation details to the 
user. That's a bad design IMO and should not be encouraged.


Anyway, I have implemented your other suggestions.

Btw, one thing I am not sure about is how to submit improved patches to 
the ML. Should I use the same patch name? Doesn't that cause confusion?


Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2013-02-08 Thread Martin Sustrik
When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
polling  (select, poll, epoll). And unfortunately, sockets that can't be polled
on allow only for building the simplest possible applications. Basically, you
can build a simple client, but once you want to implement a server handling
many sockets in parallel, you are stuck.

However, to do polling, real system-level file descriptor is needed,
not a fake one.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal
POLLOUT and POLLERR, but:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
!POLLIN & !POLLOUT, which is a perfectly valid combination for a network
protocol (rx buffer is empty and tx buffer is full), cannot be signaled
using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  short events;
  union {
void *ptr;
uint32_t u32;
uint64_t u64;
  };
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
ptr, u32 and u63 are opaque data that are not interpreted by eventfd object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 
---
v2 - Concerns raised by Andy Lutomirski addressed:
- uses fixed size (64 bits) opaque data in eventfd_mask instead of void*
- write returns EINVAL if invalid POLL flags are specified
---
 fs/eventfd.c|  116 +--
 include/linux/eventfd.h |3 +-
 2 files changed, 94 insertions(+), 25 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..bf83f6d87 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */
 
@@ -22,18 +23,35 @@
 #include 
 #include 
 
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
+/*  On x86-64 keep the same binary layout as on i386. */
+#ifdef __x86_64__
+#define EVENTFD_MASK_PACKED __packed
+#else
+#define EVENTFD_MASK_PACKED
+#endif
+
+struct eventfd_mask {
+   __u32 events;
+   __u64 data;
+} EVENTFD_MASK_PACKED;
+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+   struct eventfd_mask mask;
+   };
unsigned int flags;
 };
 
@@ -55,6 +73,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 {
unsigned long flags;
 
+   /* T

Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-08 Thread Martin Sustrik

On 07/02/13 23:44, Andrew Morton wrote:


That's a nice changelog but it omitted a critical thing: why do you
think the kernel needs this feature?  What's the value and use case for
being able to poll these descriptors?


To address the question, I've written down detailed description of the 
challenges of the network protocol development in user space and how the 
proposed feature addresses the problems.


It's too long to fit into ChangeLog, but it may be worth reading when 
trying to judge the merit of the patch.


It can be found here: http://www.250bpm.com/blog:16

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-08 Thread Martin Sustrik

Hi Eric,

On 08/02/13 23:21, Eric Wong wrote:

Martin Sustrik  wrote:

On 07/02/13 23:44, Andrew Morton wrote:

That's a nice changelog but it omitted a critical thing: why do you
think the kernel needs this feature?  What's the value and use case for
being able to poll these descriptors?


To address the question, I've written down detailed description of
the challenges of the network protocol development in user space and
how the proposed feature addresses the problems.

It's too long to fit into ChangeLog, but it may be worth reading
when trying to judge the merit of the patch.

It can be found here: http://www.250bpm.com/blog:16


Using one eventfd per userspace socket still seems a bit wasteful.


Wasteful in what sense? Occupying a slot in file descriptor table? 
That's the price for having the socket uniquely identified by the fd.



Couldn't you use a single pipe for all sockets and write the efd_mask to
the pipe for each socket?

A read from the pipe would behave like epoll_wait.

You might need to use one-shot semantics; but that's probably
the easiest thing in multithreaded apps anyways.


Having multiple sockets represented by a single eventfd. how would you 
distinguish where did individual events came from?


  struct pollfd pfd;
  ...
  poll (pfd, 1, -1);
  if (pfd.revents & POLLIN) /* Incoming data on which socket? */
...

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-08 Thread Martin Sustrik

On 08/02/13 23:08, Eric Wong wrote:


poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will
be signaled when polling (select, poll, epoll) on the eventfd is done
later on.
'ptr' is an opaque pointer that is not interpreted by eventfd object.


How does this interact with EPOLLET?


That's an interesting question. The original eventfd code doesn't do
anything specific to either edge or level mode. Neither does my patch.

Inspection of the code seems to suggest that edge vs. level distinction is
handled elsewhere (ep_send_events_proc) where there is a separate list of
ready events and the function, after returning the event, decides whether to
leave the event in the list (level) or delete it from the list (edge).


Right, the edge vs. level distinction is internal to epoll.


I wrote a test program for EFD_MASK+EPOLLET and it seems to behave in 
intuitive kind of way:


int main ()
{
int fd;
struct efd_mask mask;
ssize_t nbytes;
int rc;
int ep;
struct epoll_event epe;

fd = eventfd (0, EFD_MASK);

ep = epoll_create (10);
assert (ep != -1);
epe.events = EPOLLIN | EPOLLET;
rc = epoll_ctl (ep, EPOLL_CTL_ADD, fd, &epe);
assert (rc != -1);

mask.events = 0;
nbytes = write (fd, &mask, sizeof (mask));
assert (nbytes == sizeof (mask));
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 0);

mask.events = POLLIN;
nbytes = write (fd, &mask, sizeof (mask));
assert (nbytes == sizeof (mask));
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 1 && epe.events == EPOLLIN);
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 0);

mask.events = POLLIN;
nbytes = write (fd, &mask, sizeof (mask));
mask.events = 0;
nbytes = write (fd, &mask, sizeof (mask));
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 0);

rc = close (ep);
assert (rc == 0);
rc = close (fd);
assert (rc == 0);

return 0;
}

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-08 Thread Martin Sustrik

On 09/02/13 04:54, Eric Wong wrote:


Using one eventfd per userspace socket still seems a bit wasteful.


Wasteful in what sense? Occupying a slot in file descriptor table?
That's the price for having the socket uniquely identified by the
fd.


Yes.  I realize eventfd is small, but I don't think eventfd is needed
at all, here.  Just one pipe.


Ah. Got you! You mean not to change the kernel, just use pipe for the 
purpose.


However, the convoluted pipe-style design is the problem I am trying to 
solve rather than the solution. It leads to convoluted APIs with 
convoluted semantics as described in the article. I've been using that 
kind of design for past 8 years and every time I have to deal with it I 
swear that one day I will implement a proper in-kernel solution to get 
rid of the hack.


And now I have finally done so.

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag

2013-02-09 Thread Martin Sustrik

On 2013-02-09 12:51, Eric Wong wrote:


Yes, your eventfd change is probably the best way if you want/need
to only watch a subset of your sockets, especially if you want
poll/select to be an option.


Yes, the poll/select thing is the important point.

I wouldn't care if the only problem was that I, as the protocol 
implementer, would have to implement some kind of workaround in my 
protocol library. The problem is that these convoluted semantics leak -- 
through the use of poll, select et al. -- to the end user.


From my personal experience I can say that end users have pretty hard 
time using such complex workarounds instead of simply using a native 
file descriptor with standardised semantics.


Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1] eventfd: fix incorrect filename is a comment

2013-01-30 Thread Martin Sustrik
Comment in eventfd.h referred to 'include/asm-generic/fcntl.h'
while the correct path is 'include/uapi/asm-generic/fcntl.h'.

Signed-off-by: Martin Sustrik 
---
 include/linux/eventfd.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3c3ef19..cf5d2af 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -13,7 +13,7 @@
 #include 
 
 /*
- * CAREFUL: Check include/asm-generic/fcntl.h when defining
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
  * new flags, since they might collide with O_* ones. We want
  * to re-use O_* flags that couldn't possibly have a meaning
  * from eventfd, in order to leave a free define-space for
-- 
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2013-02-14 Thread Martin Sustrik

Hi Andrew,

Thanks for the detailed code review! I'll have a look at all the 
problems you've pointed out, however, one quick question:



-   ret = seq_printf(m, "eventfd-count: %16llx\n",
-(unsigned long long)ctx->count);
+   if (ctx->flags&  EFD_MASK) {
+   ret = seq_printf(m, "eventfd-mask: %x\n",
+(unsigned)ctx->mask.events);
+   } else {
+   ret = seq_printf(m, "eventfd-count: %16llx\n",
+(unsigned long long)ctx->count);
+   }
spin_unlock_irq(&ctx->wqh.lock);


This is a non-back-compatible userspace interface change.  A procfs
file which previously displayed

eventfd-count: 

can now also display

eventfd-mask: 

So existing userspace could misbehave.

Please fully describe the proposed interface change in the changelog.
That description should include the full pathname of the procfs file
and example before-and-after output and a discussion of whether and why
the risk to existing userspace is acceptable.


I am not sure what the policy is here. Is not printing out the state of 
the object acceptable way to maintain backward compatibility? If not so, 
does new type of object require new procfs file, which, AFAIU, is the 
only way to retain full backward compatibility?


Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2013-02-15 Thread Martin Sustrik

On 15/02/13 18:32, Andy Lutomirski wrote:

On Thu, Feb 14, 2013 at 9:24 PM, Andrew Morton
  wrote:

On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik  wrote:


This is a non-back-compatible userspace interface change.  A procfs
file which previously displayed

 eventfd-count: 

can now also display

 eventfd-mask: 

So existing userspace could misbehave.

Please fully describe the proposed interface change in the changelog.
That description should include the full pathname of the procfs file
and example before-and-after output and a discussion of whether and why
the risk to existing userspace is acceptable.


I am not sure what the policy is here. Is not printing out the state of
the object acceptable way to maintain backward compatibility? If not so,
does new type of object require new procfs file, which, AFAIU, is the
only way to retain full backward compatibility?


Adding a new file is the only way I can think of to preserve the API.
But from Andy's comment is sounds like we don't have to worry a lot
about back-compatibility.



I'm not even convinced there's an issue in the first place (other than
the fact that use of this feature will break old criu, regardless of
/proc changes).  The fdinfo files already vary by descriptor type.
Anything that screws up if unexpected fields are present is already
screwed.


Ok then. I'll leave the relevant code as is.

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2013-02-18 Thread Martin Sustrik

On 14/02/13 23:54, Andrew Morton wrote:


+/*  On x86-64 keep the same binary layout as on i386. */
+#ifdef __x86_64__
+#define EVENTFD_MASK_PACKED __packed
+#else
+#define EVENTFD_MASK_PACKED
+#endif
+
+struct eventfd_mask {
+   __u32 events;
+   __u64 data;
+} EVENTFD_MASK_PACKED;


The x86-64 specific thing is ugly.  I can find no explanation of why it
was done, but it should go away.  You could make `events' a u64, or
swap the order of the two fields and make the struct __packed on all
architectures.

Given that the size of the types is fixed, I see no compat issues here.


I've just copied how the definition is done for epoll_event. The comment 
there goes like this:


/*
 * On x86-64 make the 64bit structure have the same alignment as the
 * 32bit structure. This makes 32bit emulation easier.
 *
 * UML/x86_64 needs the same packing as x86_64
 */

If you still think I should remove the #ifdef, I am happy to do so.

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/1] eventfd: implementation of EFD_MASK flag

2013-02-18 Thread Martin Sustrik
When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll accept
only true file descriptors) and forces protocol implementers to use various
workarounds resulting in complex, non-standard and convoluted APIs.

More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes half the way
towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most notably,
simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination
for a network protocol (rx buffer is empty and tx buffer is full), cannot be
signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above problems.

Additionally, to provide a way to associate user-space state with eventfd
object, it allows to attach user-space data to the file descriptor.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
  short events;
  union {
void*ptr;
uint32_t u32;
uint64_t u64;
  };
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
'ptr', 'u32' or 'u64' are opaque data that are not interpreted by eventfd
object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 
---
Following changes were made to the patch since v2:

- eventfd_mask structure renamed to efd_mask to keep user-space prefixes
  consistent
- efd_mask is __packed for all architectures
- eventfd.h header file added to uapi; given there wasn't one so far, in
  addition to moving efd_mask there, I've moved also all the eventfd flags that
  are meant to be visible from the user space
- comment for 'mask' member eventfd_ctx added 
- synchronisation bugs in eventfd_read fixed
- several variable declarations moved from the beginning of the function to
  the blocks where they are used
- changelog text made simpler and more up to the point

There was also a request to explain why this functionality is needed. I've
written an article explaining the problems user-space network protocol
implementers face here: http://www.250bpm.com/blog:16
---
 fs/eventfd.c |  194 --
 include/linux/eventfd.h  |   17 +---
 include/uapi/linux/eventfd.h |   40 +
 3 files changed, 172 insertions(+), 79 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..8f821b1 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */
 
@@ -22,18 +23,31 @@
 #include 
 #include 
 
+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+  

Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2013-02-18 Thread Martin Sustrik

On 14/02/13 23:54, Andrew Morton wrote:


This patch adds userspace interfaces which will require manpage
updates.  Please Cc Michael and work with him on getting those changes
completed.


Right. It adds the efd_mask structure. As far as I understand how it 
works is that the actual user-space definition of the structure should 
be provided by glibc (sys/eventfd.h), right?


If so, should the man page be updated now? If so, it may result in the 
situation where the man page describes a structure that's not available 
in the user space yet.


Is there a formal process to do this kind of kernel+glibc+docs changes?

Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/1] eventfd: implementation of EFD_MASK flag

2015-07-09 Thread Martin Sustrik

Hi Damian,

Yes, this patch would be geneally useful for implementing stuff in user 
space that otherwise would have to live in kernelspace.


Unfortunately, I have no cycles left to pursue getting it to the 
mainline. If you feel like you can take care of it, that would be great.


I can help with the documentation, if needed.

Martin

On 2015-07-09 09:57, Damian Hobson-Garcia wrote:

Hello Martin and all,

I recently came across this (quite old by now) patch submission for an
extension to the functionality of eventfd and I noticed that the
discussion seems to have fizzled out.  Is this functionality still of
use for user space network protocols?  It seems like it would be usable
for other notification use cases as well.  In particular I'm thinking 
of

poll/select/epoll support for a user space video codec via libv4l.
Is there value in re-examining this patch?

Thank you,
Damian

On 2013-02-18 8:34 PM, Martin Sustrik wrote:
When implementing network protocols in user space, one has to 
implement

fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll 
accept
only true file descriptors) and forces protocol implementers to use 
various

workarounds resulting in complex, non-standard and convoluted APIs.

More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes 
half the way

towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. 
Most notably,
simultaneous !POLLIN and !POLLOUT, which is a perfectly valid 
combination
for a network protocol (rx buffer is empty and tx buffer is full), 
cannot be

signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above 
problems.


Additionally, to provide a way to associate user-space state with 
eventfd

object, it allows to attach user-space data to the file descriptor.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in 
such a way
as to signal no events on the file descriptor when it is polled on. 
'initval'

argument is ignored.

write(2):

User is allowed to write only buffers containing the following 
structure:


struct efd_mask {
  short events;
  union {
void*ptr;
uint32_t u32;
uint64_t u64;
  };
};

The value of 'events' should be any combination of event flags as 
defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified 
events will
be signaled when polling (select, poll, epoll) on the eventfd is done 
later on.
'ptr', 'u32' or 'u64' are opaque data that are not interpreted by 
eventfd

object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked 
by

EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events 
specified

in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 
---
Following changes were made to the patch since v2:

- eventfd_mask structure renamed to efd_mask to keep user-space 
prefixes

  consistent
- efd_mask is __packed for all architectures
- eventfd.h header file added to uapi; given there wasn't one so far, 
in
  addition to moving efd_mask there, I've moved also all the eventfd 
flags that

  are meant to be visible from the user space
- comment for 'mask' member eventfd_ctx added
- synchronisation bugs in eventfd_read fixed
- several variable declarations moved from the beginning of the 
function to

  the blocks where they are used
- changelog text made simpler and more up to the point

There was also a request to explain why this functionality is needed. 
I've

written an article explaining the problems user-space network protocol
implementers face here: http://www.250bpm.com/blog:16
---
 fs/eventfd.c |  194 
--

 include/linux/eventfd.h  |   17 +---
 include/uapi/linux/eventfd.h |   40 +
 3 files changed, 172 insertions(+), 79 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..8f821b1 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */

@@ -22,18 +23,31 @@
 #include 
 #include 

+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | 
EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | 
POLLHUP)

+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time 

Re: [PATCH v3 1/1] eventfd: implementation of EFD_MASK flag

2015-07-09 Thread Martin Sustrik

On 2015-07-09 11:06, Damian Hobson-Garcia wrote:

Hi Martin,

On 2015-07-09 5:41 PM, Martin Sustrik wrote:

Hi Damian,

Yes, this patch would be geneally useful for implementing stuff in 
user

space that otherwise would have to live in kernelspace.

Unfortunately, I have no cycles left to pursue getting it to the
mainline. If you feel like you can take care of it, that would be 
great.

I'll see what I can do. I'll rebase it and resubmit it.
Unless there are major changes (which I highly doubt), I'll keep you as
the author, if that's ok with you.


Sure. No problem.



I can help with the documentation, if needed.

Thank you, that would be very much appreciated.


Ok, ping me when it's needed.

Thanks!
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] eventfd: implementation of EFD_MASK flag

2015-08-09 Thread Martin Sustrik

On 2015-08-10 08:23, Damian Hobson-Garcia wrote:

Replying to my own post, but I had the following comments/questions.
Martin, if you have any response to my comments I would be very happy 
to

hear them.

On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote:

From: Martin Sustrik 


[snip]


write(2):

User is allowed to write only buffers containing the following 
structure:


struct efd_mask {
  __u32 events;
  __u64 data;
};

The value of 'events' should be any combination of event flags as 
defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified 
events will
be signaled when polling (select, poll, epoll) on the eventfd is done 
later on.

'data' is opaque data that are not interpreted by eventfd object.


I'm not fully clear on the purpose that the 'data' member serves.  Does
this opaque handle need to be tied together with this event
synchronization construct?


It's a convenience thing. Imagine you are implementing your own file 
descriptor type in user space. You create an EFD_MASK socket and a 
structure that will hold any state that you need for the socket (tx/rx 
buffers and such).


Now you have two things to pass around. If you want to pass the fd to a 
function, it must have two parameters (fd and pointer to the structure).


To fix it you can put the fd into the structure. That way there's only 
one thing to pass around (the structure).


The problem with that approach is when you have generic code that deals 
with file descriptors. For example, a simple poller which accepts a list 
of (fd, callback) pairs and invokes the callback when one of the fds 
signals POLLIN. You can't send a pointer to a structure to such 
function. All you can send is the fd, but then, when the callback is 
invoked, fd is all you have. You have no idea where your state is.


'data' member allows you to put the pointer to the state to the socket 
itself. Thus, if you have a fd, you can always find out where the 
associated data is by reading the mask structure from the fd.




[snip]

@@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 
n)

 {
+	/* This function should never be used with eventfd in the mask mode. 
*/

+   BUG_ON(ctx->flags & EFD_MASK);
+

...
@@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct 
eventfd_ctx *ctx, wait_queue_t *wait,

 {
+	/* This function should never be used with eventfd in the mask mode. 
*/

+   BUG_ON(ctx->flags & EFD_MASK);
+

...
@@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, 
int no_wait, __u64 *cnt)
+	/* This function should never be used with eventfd in the mask mode. 
*/

+   BUG_ON(ctx->flags & EFD_MASK);
+


If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't
think that there will be a way to call these functions in the mask 
mode,

so it should be possible to get rid of the BUG_ON checks.


Sure. Feel free to do so.



[snip]
@@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, 
char __user *buf, size_t count,

ssize_t res;
__u64 cnt;

+   if (ctx->flags & EFD_MASK) {
+   struct efd_mask mask;
+
+   if (count < sizeof(mask))
+   return -EINVAL;
+   spin_lock_irq(&ctx->wqh.lock);
+   mask = ctx->mask;
+   spin_unlock_irq(&ctx->wqh.lock);
+   if (copy_to_user(buf, &mask, sizeof(mask)))
+   return -EFAULT;
+   return sizeof(mask);
+   }
+


For the other eventfd modes, reading the value will update the internal
state of the eventfd (either clearing or decrementing the counter).
Should something similar be done here? I'm thinking of a case where a
process is polling on this fd in a loop. Clearing the efd_mask data  on
read should provide an easy way for the polling process to know if it 
is

seeing new poll events.


No. In this case reading the value has no effect on the state of the fd. 
How it should work is rather:


// fd is in POLLIN state
poll(fd);
// function exits with POLLIN but fd remains in POLLIN state
my_recv(fd, buf, size);
// my_recv function have found out that there's no more data to recv and 
switched off the POLLIN flag
poll(fd); // we block here waiting for more data to arrive from the 
network




@@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file 
*m, struct file *f)

struct eventfd_ctx *ctx = f->private_data;

spin_lock_irq(&ctx->wqh.lock);
-   seq_printf(m, "eventfd-count: %16llx\n",
-  (unsigned long long)ctx->count);
+   if (ctx->flags & EFD_MASK) {
+   seq_printf(m, "eventfd-mask: %x\n",
+(unsigned)ctx->mask.events);
+   } else {
+   seq_printf(m, "eventfd-count: %16llx\n",
+   

Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag

2015-09-15 Thread Martin Sustrik

On 2015-09-16 08:27, Damian Hobson-Garcia wrote:

From: Martin Sustrik 

When implementing network protocols in user space, one has to implement
fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll
accept only true file descriptors) and forces protocol implementers to 
use
various workarounds resulting in complex, non-standard and convoluted 
APIs.


More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes 
half

the way towards this goal it has follwoing shorcomings:

I.  There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitrary combination of POLL* flags. Most
notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly 
valid
combination for a network protocol (rx buffer is empty and tx 
buffer is

full), cannot be signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above 
problems.


Additionally, to provide a way to associate user-space state with 
eventfd

object, it allows to attach user-space data to the file descriptor.


The above paragraph is a leftover from the past. The functionality no 
longer exist.




The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such 
a

way as to signal no events on the file descriptor when it is polled on.
The 'initval' argument is ignored.

write(2):

User is allowed to write only buffers containing the following 
structure:


struct efd_mask {
  uint32_t events;
};


Is it worth having a struct here? Why not just uint32_t?

Martin



The value of 'events' should be any combination of event flags as 
defined

by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will be signaled when polling (select, poll, epoll) on the 
eventfd

is done later on.

read(2):

read is not supported and will fail with EINVAL.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events
specified in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik 

[dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3]
Signed-off-by: Damian Hobson-Garcia 
---
 fs/eventfd.c | 102 
++-

 include/linux/eventfd.h  |  16 +--
 include/uapi/linux/eventfd.h |  39 +
 3 files changed, 132 insertions(+), 25 deletions(-)
 create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8d0c0df..1a6a066 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2013  Martin Sustrik 
  *
  */

@@ -22,18 +23,31 @@
 #include 
 #include 

+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | 
EFD_MASK)
+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | 
POLLHUP)

+
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
-   /*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
-*/
-   __u64 count;
+   union {
+   /*
+* Every time that a write(2) is performed on an eventfd, the
+* value of the __u64 being written is added to "count" and a
+* wakeup is performed on "wqh". A read(2) will return the
+* "count" value to userspace, and will reset "count" to zero.
+* The kernel side eventfd_signal() also, adds to the "count"
+* counter and issue a wakeup.
+*/
+   __u64 count;
+
+   /*
+* When using eventfd in EFD_MASK mode this stracture stores the
+* current events to be signaled on the eventfd (events member)
+* along with opaque user-defined data (data member).
+*/
+   struct efd_mask mask;
+   };
unsigned int flags;
 };

@@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file
*file, poll_table *wait)
return events;
 }

+static unsigned int eventfd_mask_poll(struct file *file, poll_table 
*wait)

+{
+   struct eventfd_ctx *ctx = file->private_data;
+
+   poll_wait(file, &ctx->wqh, wait);
+   return ctx->mask.events;
+}
+
 static void eventfd_ctx_do_read(stru

Re: [PATCH] eventfd: implementation of EFD_MASK flag

2015-08-10 Thread Martin Sustrik

On 2015-08-10 10:57, Damian Hobson-Garcia wrote:

Hi Martin,

Thanks for your comments.

On 2015-08-10 3:39 PM, Martin Sustrik wrote:

On 2015-08-10 08:23, Damian Hobson-Garcia wrote:

Replying to my own post, but I had the following comments/questions.
Martin, if you have any response to my comments I would be very happy 
to

hear them.

On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote:

From: Martin Sustrik 


[snip]


write(2):

User is allowed to write only buffers containing the following
structure:

struct efd_mask {
  __u32 events;
  __u64 data;
};

The value of 'events' should be any combination of event flags as
defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
events will
be signaled when polling (select, poll, epoll) on the eventfd is 
done

later on.
'data' is opaque data that are not interpreted by eventfd object.

I'm not fully clear on the purpose that the 'data' member serves.  
Does

this opaque handle need to be tied together with this event
synchronization construct?


It's a convenience thing. Imagine you are implementing your own file
descriptor type in user space. You create an EFD_MASK socket and a
structure that will hold any state that you need for the socket (tx/rx
buffers and such).

Now you have two things to pass around. If you want to pass the fd to 
a
function, it must have two parameters (fd and pointer to the 
structure).


To fix it you can put the fd into the structure. That way there's only
one thing to pass around (the structure).

The problem with that approach is when you have generic code that 
deals
with file descriptors. For example, a simple poller which accepts a 
list

of (fd, callback) pairs and invokes the callback when one of the fds
signals POLLIN. You can't send a pointer to a structure to such
function. All you can send is the fd, but then, when the callback is
invoked, fd is all you have. You have no idea where your state is.

'data' member allows you to put the pointer to the state to the socket
itself. Thus, if you have a fd, you can always find out where the
associated data is by reading the mask structure from the fd.



Ok, I see what you're saying. I guess that keeping track of the mapping
between the fd and the struct in user space could be non-trivial if
there are a large number of active fds that are polling very 
frequently.

Wouldn't it be sufficient to just use epoll() in this case though?  It
already seems to support this kind of thing.


My use case was like this:

int s = mysocket();
...
// myrecv() can get the pointer to the structure
// without user having to pass it as an argument
myrecv(s, buf, sizeof(buf));

However, same behaviour can be accomplished by simply keeping
a static array of pointers in the user space.

So let's cut this part out of the patch.





[snip]

@@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, 
__u64 n)

 {
+/* This function should never be used with eventfd in the mask
mode. */
+BUG_ON(ctx->flags & EFD_MASK);
+

...

@@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct
eventfd_ctx *ctx, wait_queue_t *wait,
 {
+/* This function should never be used with eventfd in the mask
mode. */
+BUG_ON(ctx->flags & EFD_MASK);
+

...
@@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx 
*ctx,

int no_wait, __u64 *cnt)
+/* This function should never be used with eventfd in the mask
mode. */
+BUG_ON(ctx->flags & EFD_MASK);
+


If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't
think that there will be a way to call these functions in the mask 
mode,

so it should be possible to get rid of the BUG_ON checks.


Sure. Feel free to do so.



[snip]

@@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file,
char __user *buf, size_t count,
 ssize_t res;
 __u64 cnt;

+if (ctx->flags & EFD_MASK) {
+struct efd_mask mask;
+
+if (count < sizeof(mask))
+return -EINVAL;
+spin_lock_irq(&ctx->wqh.lock);
+mask = ctx->mask;
+spin_unlock_irq(&ctx->wqh.lock);
+if (copy_to_user(buf, &mask, sizeof(mask)))
+return -EFAULT;
+return sizeof(mask);
+}
+


For the other eventfd modes, reading the value will update the 
internal

state of the eventfd (either clearing or decrementing the counter).
Should something similar be done here? I'm thinking of a case where a
process is polling on this fd in a loop. Clearing the efd_mask data  
on
read should provide an easy way for the polling process to know if it 
is

seeing new poll events.


No. In this case reading the value has no effect on the state of the 
fd.

How it should work is rather:

// fd is in POLLIN state
poll(fd);
// function exits with POLLIN but fd remains in POLLIN state
my_recv(fd, buf, size);
// my_recv function have found out that there'