[PATCH 1/1] eventfd: implementation of EFD_MASK flag
When implementing network protocols in user space, one has to implement fake user-space file descriptors to represent the sockets for the protocol. While all the BSD socket API functionality for such descriptors may be faked as well (myproto_send(), myproto_recv() etc.) this approach doesn't work for polling (select, poll, epoll). For polling, real system-level file descriptor is needed. In theory, eventfd may be used for this purpose, except that it is well suited only for signaling POLLIN. With some hacking it can be also used to signal POLLOUT and POLLERR, however: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitraty combination of POLL* flags. Most notably, !POLLIN & !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using current implementation of eventfd. This patch implements new EFD_MASK flag which attempts to solve this problem. Additionally, when implementing network protocols in user space, there's a need to associate user-space state with the each "socket". If eventfd object is used as a reference to the socket, it should be possible to associate an opaque pointer to user-space data with it. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { short events; void *ptr; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'ptr' is an opaque pointer that is not interpreted by eventfd object. read(2): User is allowed to read an efd_mask structure from the eventfd marked by EFD_MASK. Returned value shall be the last one written to the eventfd. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik --- fs/eventfd.c| 105 --- include/linux/eventfd.h |3 +- 2 files changed, 83 insertions(+), 25 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 35470d9..9fec49f 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,26 @@ #include #include +struct eventfd_mask { + short events; + void *ptr; +}; + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +*/ + __u64 count; + struct eventfd_mask mask; + }; unsigned int flags; }; @@ -55,6 +64,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { unsigned long flags; + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + spin_lock_irqsave(&ctx->wqh.lock, flags); if (ULLONG_MAX - ctx->count < n) n = ULLONG_MAX - ctx->count; @@ -123,12 +135,16 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) poll_wait(file, &ctx->wqh, wait); spin_lock_irqsave(&ctx->wqh.lock, flags); - if (ctx->count > 0) - events |= POLLIN; - if (ctx->count == ULLONG_MAX) - events |= POLLERR; - if (ULLONG_MAX - 1 > ctx->count) - events |= POLLOUT; + if (ctx->flags & EFD_MASK) { + events = ctx->m
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 07/02/13 20:12, Andy Lutomirski wrote: On 02/06/2013 10:41 PM, Martin Sustrik wrote: When implementing network protocols in user space, one has to implement fake user-space file descriptors to represent the sockets for the protocol. While all the BSD socket API functionality for such descriptors may be faked as well (myproto_send(), myproto_recv() etc.) this approach doesn't work for polling (select, poll, epoll). For polling, real system-level file descriptor is needed. In theory, eventfd may be used for this purpose, except that it is well suited only for signaling POLLIN. With some hacking it can be also used to signal POLLOUT and POLLERR, however: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitraty combination of POLL* flags. Most notably, !POLLIN& !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using current implementation of eventfd. This patch implements new EFD_MASK flag which attempts to solve this problem. Additionally, when implementing network protocols in user space, there's a need to associate user-space state with the each "socket". If eventfd object is used as a reference to the socket, it should be possible to associate an opaque pointer to user-space data with it. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { short events; void *ptr; }; IMO that should be u64 ptr to avoid compat problems. I was following the user space declaration of epoll_data: typedef union epoll_data { void*ptr; <- int fd; uint32_t u32; uint64_t u64; } epoll_data_t; However, now I'm looking at the kernel side definition of the whole union which looks like this (obviously it assumes that pointer is never longer than 64 bits): __u64 data; Hm, not very helpful. Anyway, I am not a kernel developer, so any concrete suggestion about what type to use to map cleanly to user-space void* is welcome. The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'ptr' is an opaque pointer that is not interpreted by eventfd object. How does this interact with EPOLLET? That's an interesting question. The original eventfd code doesn't do anything specific to either edge or level mode. Neither does my patch. Inspection of the code seems to suggest that edge vs. level distinction is handled elsewhere (ep_send_events_proc) where there is a separate list of ready events and the function, after returning the event, decides whether to leave the event in the list (level) or delete it from the list (edge). In any case, review from someone with experience with epoll implementation would help. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] eventfd: implementation of EFD_MASK flag
When implementing network protocols in user space, one has to implement fake user-space file descriptors to represent the sockets for the protocol. While all the BSD socket API functionality for such descriptors may be faked as well (myproto_send(), myproto_recv() etc.) this approach doesn't work for polling (select, poll, epoll). And unfortunately, sockets that can't be polled on allow only for building the simplest possible applications. Basically, you can build a simple client, but once you want to implement a server handling many sockets in parallel, you are stuck. However, to do polling, real system-level file descriptor is needed, not a fake one. In theory, eventfd may be used for this purpose, except that it is well suited only for signaling POLLIN. With some hacking it can be also used to signal POLLOUT and POLLERR, but: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitraty combination of POLL* flags. Most notably, !POLLIN & !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using current implementation of eventfd. This patch implements new EFD_MASK flag which attempts to solve this problem. Additionally, when implementing network protocols in user space, there's a need to associate user-space state with the each "socket". If eventfd object is used as a reference to the socket, it should be possible to associate an opaque pointer to user-space data with it. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { short events; void *ptr; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'ptr' is an opaque pointer that is not interpreted by eventfd object. read(2): User is allowed to read an efd_mask structure from the eventfd marked by EFD_MASK. Returned value shall be the last one written to the eventfd. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik --- fs/eventfd.c| 105 --- include/linux/eventfd.h |3 +- 2 files changed, 83 insertions(+), 25 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 35470d9..9fec49f 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,26 @@ #include #include +struct eventfd_mask { + short events; + void *ptr; +}; + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +*/ + __u64 count; + struct eventfd_mask mask; + }; unsigned int flags; }; @@ -55,6 +64,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { unsigned long flags; + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + spin_lock_irqsave(&ctx->wqh.lock, flags); if (ULLONG_MAX - ctx->count < n) n = ULLONG_MAX - ctx->count; @@ -123,12 +135,16 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) poll_wait(file, &ctx->wqh, wait); spin_lock_irqsave(&ctx->wqh.lock, flags); - if (ctx->count &g
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 07/02/13 23:44, Andrew Morton wrote: So please update the changelog and then cc net...@vger.kernel.org on the patch - the netdev people are probably best-situated to comment on the proposal. OK. Done. Thanks for the advice! Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
Hi Andy, On 08/02/13 02:03, Andy Lutomirski wrote: There may be some advantage to adding (later on, if needed) an option to change the flags set in: + if (waitqueue_active(&ctx->wqh)) + wake_up_locked_poll(&ctx->wqh, + (unsigned long)ctx->mask.events); (i.e. to allow the second parameter to omit some bits that were already signaled.) Allowing write to write a bigger struct in the future won't break anything. I think I don't follow. Either the second parameter is supposed to be *newly* signaled events, in which case the events that were already signaled in the past should be ommitted, or it is meant to be *all* signaled events, in which case the current implementation is OK. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 08/02/13 07:36, Andy Lutomirski wrote: On 08/02/13 02:03, Andy Lutomirski wrote: There may be some advantage to adding (later on, if needed) an option to change the flags set in: + if (waitqueue_active(&ctx->wqh)) + wake_up_locked_poll(&ctx->wqh, + (unsigned long)ctx->mask.events); (i.e. to allow the second parameter to omit some bits that were already signaled.) Allowing write to write a bigger struct in the future won't break anything. I think I don't follow. Either the second parameter is supposed to be *newly* signaled events, in which case the events that were already signaled in the past should be ommitted, or it is meant to be *all* signaled events, in which case the current implementation is OK. I defer to the experts here. But I suspect that if you want to perfectly emulate sockets, you may need to vary what you specify. (IIRC tcp sockets report an EPOLLIN edge every time data is received even if the receive buffer wasn't empty.) Hm. That sounds like leaking protocol implementation details to the user. That's a bad design IMO and should not be encouraged. Anyway, I have implemented your other suggestions. Btw, one thing I am not sure about is how to submit improved patches to the ML. Should I use the same patch name? Doesn't that cause confusion? Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
When implementing network protocols in user space, one has to implement fake user-space file descriptors to represent the sockets for the protocol. While all the BSD socket API functionality for such descriptors may be faked as well (myproto_send(), myproto_recv() etc.) this approach doesn't work for polling (select, poll, epoll). And unfortunately, sockets that can't be polled on allow only for building the simplest possible applications. Basically, you can build a simple client, but once you want to implement a server handling many sockets in parallel, you are stuck. However, to do polling, real system-level file descriptor is needed, not a fake one. In theory, eventfd may be used for this purpose, except that it is well suited only for signaling POLLIN. With some hacking it can be also used to signal POLLOUT and POLLERR, but: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitraty combination of POLL* flags. Most notably, !POLLIN & !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using current implementation of eventfd. This patch implements new EFD_MASK flag which attempts to solve this problem. Additionally, when implementing network protocols in user space, there's a need to associate user-space state with the each "socket". If eventfd object is used as a reference to the socket, it should be possible to associate an opaque pointer to user-space data with it. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { short events; union { void *ptr; uint32_t u32; uint64_t u64; }; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. ptr, u32 and u63 are opaque data that are not interpreted by eventfd object. read(2): User is allowed to read an efd_mask structure from the eventfd marked by EFD_MASK. Returned value shall be the last one written to the eventfd. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik --- v2 - Concerns raised by Andy Lutomirski addressed: - uses fixed size (64 bits) opaque data in eventfd_mask instead of void* - write returns EINVAL if invalid POLL flags are specified --- fs/eventfd.c| 116 +-- include/linux/eventfd.h |3 +- 2 files changed, 94 insertions(+), 25 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 35470d9..bf83f6d87 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,35 @@ #include #include +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + +/* On x86-64 keep the same binary layout as on i386. */ +#ifdef __x86_64__ +#define EVENTFD_MASK_PACKED __packed +#else +#define EVENTFD_MASK_PACKED +#endif + +struct eventfd_mask { + __u32 events; + __u64 data; +} EVENTFD_MASK_PACKED; + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +*/ + __u64 count; + struct eventfd_mask mask; + }; unsigned int flags; }; @@ -55,6 +73,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { unsigned long flags; + /* T
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 07/02/13 23:44, Andrew Morton wrote: That's a nice changelog but it omitted a critical thing: why do you think the kernel needs this feature? What's the value and use case for being able to poll these descriptors? To address the question, I've written down detailed description of the challenges of the network protocol development in user space and how the proposed feature addresses the problems. It's too long to fit into ChangeLog, but it may be worth reading when trying to judge the merit of the patch. It can be found here: http://www.250bpm.com/blog:16 Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
Hi Eric, On 08/02/13 23:21, Eric Wong wrote: Martin Sustrik wrote: On 07/02/13 23:44, Andrew Morton wrote: That's a nice changelog but it omitted a critical thing: why do you think the kernel needs this feature? What's the value and use case for being able to poll these descriptors? To address the question, I've written down detailed description of the challenges of the network protocol development in user space and how the proposed feature addresses the problems. It's too long to fit into ChangeLog, but it may be worth reading when trying to judge the merit of the patch. It can be found here: http://www.250bpm.com/blog:16 Using one eventfd per userspace socket still seems a bit wasteful. Wasteful in what sense? Occupying a slot in file descriptor table? That's the price for having the socket uniquely identified by the fd. Couldn't you use a single pipe for all sockets and write the efd_mask to the pipe for each socket? A read from the pipe would behave like epoll_wait. You might need to use one-shot semantics; but that's probably the easiest thing in multithreaded apps anyways. Having multiple sockets represented by a single eventfd. how would you distinguish where did individual events came from? struct pollfd pfd; ... poll (pfd, 1, -1); if (pfd.revents & POLLIN) /* Incoming data on which socket? */ ... Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 08/02/13 23:08, Eric Wong wrote: poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'ptr' is an opaque pointer that is not interpreted by eventfd object. How does this interact with EPOLLET? That's an interesting question. The original eventfd code doesn't do anything specific to either edge or level mode. Neither does my patch. Inspection of the code seems to suggest that edge vs. level distinction is handled elsewhere (ep_send_events_proc) where there is a separate list of ready events and the function, after returning the event, decides whether to leave the event in the list (level) or delete it from the list (edge). Right, the edge vs. level distinction is internal to epoll. I wrote a test program for EFD_MASK+EPOLLET and it seems to behave in intuitive kind of way: int main () { int fd; struct efd_mask mask; ssize_t nbytes; int rc; int ep; struct epoll_event epe; fd = eventfd (0, EFD_MASK); ep = epoll_create (10); assert (ep != -1); epe.events = EPOLLIN | EPOLLET; rc = epoll_ctl (ep, EPOLL_CTL_ADD, fd, &epe); assert (rc != -1); mask.events = 0; nbytes = write (fd, &mask, sizeof (mask)); assert (nbytes == sizeof (mask)); rc = epoll_wait (ep, &epe, 1, 100); assert (rc == 0); mask.events = POLLIN; nbytes = write (fd, &mask, sizeof (mask)); assert (nbytes == sizeof (mask)); rc = epoll_wait (ep, &epe, 1, 100); assert (rc == 1 && epe.events == EPOLLIN); rc = epoll_wait (ep, &epe, 1, 100); assert (rc == 0); mask.events = POLLIN; nbytes = write (fd, &mask, sizeof (mask)); mask.events = 0; nbytes = write (fd, &mask, sizeof (mask)); rc = epoll_wait (ep, &epe, 1, 100); assert (rc == 0); rc = close (ep); assert (rc == 0); rc = close (fd); assert (rc == 0); return 0; } Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 09/02/13 04:54, Eric Wong wrote: Using one eventfd per userspace socket still seems a bit wasteful. Wasteful in what sense? Occupying a slot in file descriptor table? That's the price for having the socket uniquely identified by the fd. Yes. I realize eventfd is small, but I don't think eventfd is needed at all, here. Just one pipe. Ah. Got you! You mean not to change the kernel, just use pipe for the purpose. However, the convoluted pipe-style design is the problem I am trying to solve rather than the solution. It leads to convoluted APIs with convoluted semantics as described in the article. I've been using that kind of design for past 8 years and every time I have to deal with it I swear that one day I will implement a proper in-kernel solution to get rid of the hack. And now I have finally done so. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] eventfd: implementation of EFD_MASK flag
On 2013-02-09 12:51, Eric Wong wrote: Yes, your eventfd change is probably the best way if you want/need to only watch a subset of your sockets, especially if you want poll/select to be an option. Yes, the poll/select thing is the important point. I wouldn't care if the only problem was that I, as the protocol implementer, would have to implement some kind of workaround in my protocol library. The problem is that these convoluted semantics leak -- through the use of poll, select et al. -- to the end user. From my personal experience I can say that end users have pretty hard time using such complex workarounds instead of simply using a native file descriptor with standardised semantics. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/1] eventfd: fix incorrect filename is a comment
Comment in eventfd.h referred to 'include/asm-generic/fcntl.h' while the correct path is 'include/uapi/asm-generic/fcntl.h'. Signed-off-by: Martin Sustrik --- include/linux/eventfd.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index 3c3ef19..cf5d2af 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -13,7 +13,7 @@ #include /* - * CAREFUL: Check include/asm-generic/fcntl.h when defining + * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining * new flags, since they might collide with O_* ones. We want * to re-use O_* flags that couldn't possibly have a meaning * from eventfd, in order to leave a free define-space for -- 1.7.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
Hi Andrew, Thanks for the detailed code review! I'll have a look at all the problems you've pointed out, however, one quick question: - ret = seq_printf(m, "eventfd-count: %16llx\n", -(unsigned long long)ctx->count); + if (ctx->flags& EFD_MASK) { + ret = seq_printf(m, "eventfd-mask: %x\n", +(unsigned)ctx->mask.events); + } else { + ret = seq_printf(m, "eventfd-count: %16llx\n", +(unsigned long long)ctx->count); + } spin_unlock_irq(&ctx->wqh.lock); This is a non-back-compatible userspace interface change. A procfs file which previously displayed eventfd-count: can now also display eventfd-mask: So existing userspace could misbehave. Please fully describe the proposed interface change in the changelog. That description should include the full pathname of the procfs file and example before-and-after output and a discussion of whether and why the risk to existing userspace is acceptable. I am not sure what the policy is here. Is not printing out the state of the object acceptable way to maintain backward compatibility? If not so, does new type of object require new procfs file, which, AFAIU, is the only way to retain full backward compatibility? Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
On 15/02/13 18:32, Andy Lutomirski wrote: On Thu, Feb 14, 2013 at 9:24 PM, Andrew Morton wrote: On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik wrote: This is a non-back-compatible userspace interface change. A procfs file which previously displayed eventfd-count: can now also display eventfd-mask: So existing userspace could misbehave. Please fully describe the proposed interface change in the changelog. That description should include the full pathname of the procfs file and example before-and-after output and a discussion of whether and why the risk to existing userspace is acceptable. I am not sure what the policy is here. Is not printing out the state of the object acceptable way to maintain backward compatibility? If not so, does new type of object require new procfs file, which, AFAIU, is the only way to retain full backward compatibility? Adding a new file is the only way I can think of to preserve the API. But from Andy's comment is sounds like we don't have to worry a lot about back-compatibility. I'm not even convinced there's an issue in the first place (other than the fact that use of this feature will break old criu, regardless of /proc changes). The fdinfo files already vary by descriptor type. Anything that screws up if unexpected fields are present is already screwed. Ok then. I'll leave the relevant code as is. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
On 14/02/13 23:54, Andrew Morton wrote: +/* On x86-64 keep the same binary layout as on i386. */ +#ifdef __x86_64__ +#define EVENTFD_MASK_PACKED __packed +#else +#define EVENTFD_MASK_PACKED +#endif + +struct eventfd_mask { + __u32 events; + __u64 data; +} EVENTFD_MASK_PACKED; The x86-64 specific thing is ugly. I can find no explanation of why it was done, but it should go away. You could make `events' a u64, or swap the order of the two fields and make the struct __packed on all architectures. Given that the size of the types is fixed, I see no compat issues here. I've just copied how the definition is done for epoll_event. The comment there goes like this: /* * On x86-64 make the 64bit structure have the same alignment as the * 32bit structure. This makes 32bit emulation easier. * * UML/x86_64 needs the same packing as x86_64 */ If you still think I should remove the #ifdef, I am happy to do so. Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/1] eventfd: implementation of EFD_MASK flag
When implementing network protocols in user space, one has to implement fake file descriptors to represent the sockets for the protocol. Polling on such fake file descriptors is a problem (poll/select/epoll accept only true file descriptors) and forces protocol implementers to use various workarounds resulting in complex, non-standard and convoluted APIs. More generally, ability to create full-blown file descriptors for userspace-to-userspace signalling is missing. While eventfd(2) goes half the way towards this goal it has follwoing shorcomings: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitrary combination of POLL* flags. Most notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using eventfd. This patch implements new EFD_MASK flag which solves the above problems. Additionally, to provide a way to associate user-space state with eventfd object, it allows to attach user-space data to the file descriptor. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { short events; union { void*ptr; uint32_t u32; uint64_t u64; }; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'ptr', 'u32' or 'u64' are opaque data that are not interpreted by eventfd object. read(2): User is allowed to read an efd_mask structure from the eventfd marked by EFD_MASK. Returned value shall be the last one written to the eventfd. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik --- Following changes were made to the patch since v2: - eventfd_mask structure renamed to efd_mask to keep user-space prefixes consistent - efd_mask is __packed for all architectures - eventfd.h header file added to uapi; given there wasn't one so far, in addition to moving efd_mask there, I've moved also all the eventfd flags that are meant to be visible from the user space - comment for 'mask' member eventfd_ctx added - synchronisation bugs in eventfd_read fixed - several variable declarations moved from the beginning of the function to the blocks where they are used - changelog text made simpler and more up to the point There was also a request to explain why this functionality is needed. I've written an article explaining the problems user-space network protocol implementers face here: http://www.250bpm.com/blog:16 --- fs/eventfd.c | 194 -- include/linux/eventfd.h | 17 +--- include/uapi/linux/eventfd.h | 40 + 3 files changed, 172 insertions(+), 79 deletions(-) create mode 100644 include/uapi/linux/eventfd.h diff --git a/fs/eventfd.c b/fs/eventfd.c index 35470d9..8f821b1 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,31 @@ #include #include +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +
Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
On 14/02/13 23:54, Andrew Morton wrote: This patch adds userspace interfaces which will require manpage updates. Please Cc Michael and work with him on getting those changes completed. Right. It adds the efd_mask structure. As far as I understand how it works is that the actual user-space definition of the structure should be provided by glibc (sys/eventfd.h), right? If so, should the man page be updated now? If so, it may result in the situation where the man page describes a structure that's not available in the user space yet. Is there a formal process to do this kind of kernel+glibc+docs changes? Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/1] eventfd: implementation of EFD_MASK flag
Hi Damian, Yes, this patch would be geneally useful for implementing stuff in user space that otherwise would have to live in kernelspace. Unfortunately, I have no cycles left to pursue getting it to the mainline. If you feel like you can take care of it, that would be great. I can help with the documentation, if needed. Martin On 2015-07-09 09:57, Damian Hobson-Garcia wrote: Hello Martin and all, I recently came across this (quite old by now) patch submission for an extension to the functionality of eventfd and I noticed that the discussion seems to have fizzled out. Is this functionality still of use for user space network protocols? It seems like it would be usable for other notification use cases as well. In particular I'm thinking of poll/select/epoll support for a user space video codec via libv4l. Is there value in re-examining this patch? Thank you, Damian On 2013-02-18 8:34 PM, Martin Sustrik wrote: When implementing network protocols in user space, one has to implement fake file descriptors to represent the sockets for the protocol. Polling on such fake file descriptors is a problem (poll/select/epoll accept only true file descriptors) and forces protocol implementers to use various workarounds resulting in complex, non-standard and convoluted APIs. More generally, ability to create full-blown file descriptors for userspace-to-userspace signalling is missing. While eventfd(2) goes half the way towards this goal it has follwoing shorcomings: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitrary combination of POLL* flags. Most notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using eventfd. This patch implements new EFD_MASK flag which solves the above problems. Additionally, to provide a way to associate user-space state with eventfd object, it allows to attach user-space data to the file descriptor. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { short events; union { void*ptr; uint32_t u32; uint64_t u64; }; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'ptr', 'u32' or 'u64' are opaque data that are not interpreted by eventfd object. read(2): User is allowed to read an efd_mask structure from the eventfd marked by EFD_MASK. Returned value shall be the last one written to the eventfd. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik --- Following changes were made to the patch since v2: - eventfd_mask structure renamed to efd_mask to keep user-space prefixes consistent - efd_mask is __packed for all architectures - eventfd.h header file added to uapi; given there wasn't one so far, in addition to moving efd_mask there, I've moved also all the eventfd flags that are meant to be visible from the user space - comment for 'mask' member eventfd_ctx added - synchronisation bugs in eventfd_read fixed - several variable declarations moved from the beginning of the function to the blocks where they are used - changelog text made simpler and more up to the point There was also a request to explain why this functionality is needed. I've written an article explaining the problems user-space network protocol implementers face here: http://www.250bpm.com/blog:16 --- fs/eventfd.c | 194 -- include/linux/eventfd.h | 17 +--- include/uapi/linux/eventfd.h | 40 + 3 files changed, 172 insertions(+), 79 deletions(-) create mode 100644 include/uapi/linux/eventfd.h diff --git a/fs/eventfd.c b/fs/eventfd.c index 35470d9..8f821b1 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,31 @@ #include #include +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time
Re: [PATCH v3 1/1] eventfd: implementation of EFD_MASK flag
On 2015-07-09 11:06, Damian Hobson-Garcia wrote: Hi Martin, On 2015-07-09 5:41 PM, Martin Sustrik wrote: Hi Damian, Yes, this patch would be geneally useful for implementing stuff in user space that otherwise would have to live in kernelspace. Unfortunately, I have no cycles left to pursue getting it to the mainline. If you feel like you can take care of it, that would be great. I'll see what I can do. I'll rebase it and resubmit it. Unless there are major changes (which I highly doubt), I'll keep you as the author, if that's ok with you. Sure. No problem. I can help with the documentation, if needed. Thank you, that would be very much appreciated. Ok, ping me when it's needed. Thanks! Martin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eventfd: implementation of EFD_MASK flag
On 2015-08-10 08:23, Damian Hobson-Garcia wrote: Replying to my own post, but I had the following comments/questions. Martin, if you have any response to my comments I would be very happy to hear them. On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: From: Martin Sustrik [snip] write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { __u32 events; __u64 data; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'data' is opaque data that are not interpreted by eventfd object. I'm not fully clear on the purpose that the 'data' member serves. Does this opaque handle need to be tied together with this event synchronization construct? It's a convenience thing. Imagine you are implementing your own file descriptor type in user space. You create an EFD_MASK socket and a structure that will hold any state that you need for the socket (tx/rx buffers and such). Now you have two things to pass around. If you want to pass the fd to a function, it must have two parameters (fd and pointer to the structure). To fix it you can put the fd into the structure. That way there's only one thing to pass around (the structure). The problem with that approach is when you have generic code that deals with file descriptors. For example, a simple poller which accepts a list of (fd, callback) pairs and invokes the callback when one of the fds signals POLLIN. You can't send a pointer to a structure to such function. All you can send is the fd, but then, when the callback is invoked, fd is all you have. You have no idea where your state is. 'data' member allows you to put the pointer to the state to the socket itself. Thus, if you have a fd, you can always find out where the associated data is by reading the mask structure from the fd. [snip] @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + ... @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, { + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + ... @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't think that there will be a way to call these functions in the mask mode, so it should be possible to get rid of the BUG_ON checks. Sure. Feel free to do so. [snip] @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, ssize_t res; __u64 cnt; + if (ctx->flags & EFD_MASK) { + struct efd_mask mask; + + if (count < sizeof(mask)) + return -EINVAL; + spin_lock_irq(&ctx->wqh.lock); + mask = ctx->mask; + spin_unlock_irq(&ctx->wqh.lock); + if (copy_to_user(buf, &mask, sizeof(mask))) + return -EFAULT; + return sizeof(mask); + } + For the other eventfd modes, reading the value will update the internal state of the eventfd (either clearing or decrementing the counter). Should something similar be done here? I'm thinking of a case where a process is polling on this fd in a loop. Clearing the efd_mask data on read should provide an easy way for the polling process to know if it is seeing new poll events. No. In this case reading the value has no effect on the state of the fd. How it should work is rather: // fd is in POLLIN state poll(fd); // function exits with POLLIN but fd remains in POLLIN state my_recv(fd, buf, size); // my_recv function have found out that there's no more data to recv and switched off the POLLIN flag poll(fd); // we block here waiting for more data to arrive from the network @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f) struct eventfd_ctx *ctx = f->private_data; spin_lock_irq(&ctx->wqh.lock); - seq_printf(m, "eventfd-count: %16llx\n", - (unsigned long long)ctx->count); + if (ctx->flags & EFD_MASK) { + seq_printf(m, "eventfd-mask: %x\n", +(unsigned)ctx->mask.events); + } else { + seq_printf(m, "eventfd-count: %16llx\n", +
Re: [PATCH v2 1/1] eventfd: implementation of EFD_MASK flag
On 2015-09-16 08:27, Damian Hobson-Garcia wrote: From: Martin Sustrik When implementing network protocols in user space, one has to implement fake file descriptors to represent the sockets for the protocol. Polling on such fake file descriptors is a problem (poll/select/epoll accept only true file descriptors) and forces protocol implementers to use various workarounds resulting in complex, non-standard and convoluted APIs. More generally, ability to create full-blown file descriptors for userspace-to-userspace signalling is missing. While eventfd(2) goes half the way towards this goal it has follwoing shorcomings: I. There's no way to signal POLLPRI, POLLHUP etc. II. There's no way to signal arbitrary combination of POLL* flags. Most notably, simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination for a network protocol (rx buffer is empty and tx buffer is full), cannot be signaled using eventfd. This patch implements new EFD_MASK flag which solves the above problems. Additionally, to provide a way to associate user-space state with eventfd object, it allows to attach user-space data to the file descriptor. The above paragraph is a leftover from the past. The functionality no longer exist. The semantics of EFD_MASK are as follows: eventfd(2): If eventfd is created with EFD_MASK flag set, it is initialised in such a way as to signal no events on the file descriptor when it is polled on. The 'initval' argument is ignored. write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { uint32_t events; }; Is it worth having a struct here? Why not just uint32_t? Martin The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. read(2): read is not supported and will fail with EINVAL. select(2), poll(2) and similar: When polling on the eventfd marked by EFD_MASK flag, all the events specified in last written 'events' field shall be signaled. Signed-off-by: Martin Sustrik [dhobs...@igel.co.jp: Rebased, and resubmitted for Linux 4.3] Signed-off-by: Damian Hobson-Garcia --- fs/eventfd.c | 102 ++- include/linux/eventfd.h | 16 +-- include/uapi/linux/eventfd.h | 39 + 3 files changed, 132 insertions(+), 25 deletions(-) create mode 100644 include/uapi/linux/eventfd.h diff --git a/fs/eventfd.c b/fs/eventfd.c index 8d0c0df..1a6a066 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * Copyright (C) 2013 Martin Sustrik * */ @@ -22,18 +23,31 @@ #include #include +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. -*/ - __u64 count; + union { + /* +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the +* "count" value to userspace, and will reset "count" to zero. +* The kernel side eventfd_signal() also, adds to the "count" +* counter and issue a wakeup. +*/ + __u64 count; + + /* +* When using eventfd in EFD_MASK mode this stracture stores the +* current events to be signaled on the eventfd (events member) +* along with opaque user-defined data (data member). +*/ + struct efd_mask mask; + }; unsigned int flags; }; @@ -134,6 +148,14 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) return events; } +static unsigned int eventfd_mask_poll(struct file *file, poll_table *wait) +{ + struct eventfd_ctx *ctx = file->private_data; + + poll_wait(file, &ctx->wqh, wait); + return ctx->mask.events; +} + static void eventfd_ctx_do_read(stru
Re: [PATCH] eventfd: implementation of EFD_MASK flag
On 2015-08-10 10:57, Damian Hobson-Garcia wrote: Hi Martin, Thanks for your comments. On 2015-08-10 3:39 PM, Martin Sustrik wrote: On 2015-08-10 08:23, Damian Hobson-Garcia wrote: Replying to my own post, but I had the following comments/questions. Martin, if you have any response to my comments I would be very happy to hear them. On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: From: Martin Sustrik [snip] write(2): User is allowed to write only buffers containing the following structure: struct efd_mask { __u32 events; __u64 data; }; The value of 'events' should be any combination of event flags as defined by poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will be signaled when polling (select, poll, epoll) on the eventfd is done later on. 'data' is opaque data that are not interpreted by eventfd object. I'm not fully clear on the purpose that the 'data' member serves. Does this opaque handle need to be tied together with this event synchronization construct? It's a convenience thing. Imagine you are implementing your own file descriptor type in user space. You create an EFD_MASK socket and a structure that will hold any state that you need for the socket (tx/rx buffers and such). Now you have two things to pass around. If you want to pass the fd to a function, it must have two parameters (fd and pointer to the structure). To fix it you can put the fd into the structure. That way there's only one thing to pass around (the structure). The problem with that approach is when you have generic code that deals with file descriptors. For example, a simple poller which accepts a list of (fd, callback) pairs and invokes the callback when one of the fds signals POLLIN. You can't send a pointer to a structure to such function. All you can send is the fd, but then, when the callback is invoked, fd is all you have. You have no idea where your state is. 'data' member allows you to put the pointer to the state to the socket itself. Thus, if you have a fd, you can always find out where the associated data is by reading the mask structure from the fd. Ok, I see what you're saying. I guess that keeping track of the mapping between the fd and the struct in user space could be non-trivial if there are a large number of active fds that are polling very frequently. Wouldn't it be sufficient to just use epoll() in this case though? It already seems to support this kind of thing. My use case was like this: int s = mysocket(); ... // myrecv() can get the pointer to the structure // without user having to pass it as an argument myrecv(s, buf, sizeof(buf)); However, same behaviour can be accomplished by simply keeping a static array of pointers in the user space. So let's cut this part out of the patch. [snip] @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { +/* This function should never be used with eventfd in the mask mode. */ +BUG_ON(ctx->flags & EFD_MASK); + ... @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, { +/* This function should never be used with eventfd in the mask mode. */ +BUG_ON(ctx->flags & EFD_MASK); + ... @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) +/* This function should never be used with eventfd in the mask mode. */ +BUG_ON(ctx->flags & EFD_MASK); + If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't think that there will be a way to call these functions in the mask mode, so it should be possible to get rid of the BUG_ON checks. Sure. Feel free to do so. [snip] @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, ssize_t res; __u64 cnt; +if (ctx->flags & EFD_MASK) { +struct efd_mask mask; + +if (count < sizeof(mask)) +return -EINVAL; +spin_lock_irq(&ctx->wqh.lock); +mask = ctx->mask; +spin_unlock_irq(&ctx->wqh.lock); +if (copy_to_user(buf, &mask, sizeof(mask))) +return -EFAULT; +return sizeof(mask); +} + For the other eventfd modes, reading the value will update the internal state of the eventfd (either clearing or decrementing the counter). Should something similar be done here? I'm thinking of a case where a process is polling on this fd in a loop. Clearing the efd_mask data on read should provide an easy way for the polling process to know if it is seeing new poll events. No. In this case reading the value has no effect on the state of the fd. How it should work is rather: // fd is in POLLIN state poll(fd); // function exits with POLLIN but fd remains in POLLIN state my_recv(fd, buf, size); // my_recv function have found out that there'