ChangeLog:
v3) Removed the "revents" field from the epoll item structure, as
suggested by Eric Dumazet
v2) In v1, I was trying to avoid to get the spinlock twice WRT yesterday
patch, but it turns out I can't since the ready list will be travelling
through a path w/out the ep->sem held. Oh well...
Epoll is doing multiple passes over the ready set at the moment, because
of the constraints over the f_op->poll() call. Looking at the code again,
I noticed that we already hold the epoll semaphore in read, and this
(together with other locking conditions that hold while doing an
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
(in a single pass).
This is a stress application that can be used to test the new code. It
spwans multiple thread and call epoll_wait() and epoll_ctl() from many
threads. Stress tested on my dual Opteron 254 w/out any problems.
http://www.xmailserver.org/totalmess.c
This is not a benchmark, just something that tries to stress and exploit
possible problems with the new code.
Also, I made a stupid micro-benchmark:
http://www.xmailserver.org/epwbench.c
[1] Considering that epoll must be thread-safe, there are five ways we can
be hit during an epoll_wait() transfer loop (ep_send_events()):
1) The epoll fd going away and calling ep_free
This just can't happen, since we did an fget() in sys_epoll_wait
2) An epoll_ctl(EPOLL_CTL_DEL)
This can't happen because epoll_ctl() gets ep->sem in write, and
we're holding it in read during ep_send_events()
3) An fd stored inside the epoll fd going away
This can't happen because in eventpoll_release_file() we get
ep->sem in write, and we're holding it in read during
ep_send_events()
4) Another epoll_wait() happening on another thread
They both can be inside ep_send_events() at the same time, we get
(splice) the ready-list under the spinlock, so each one will get
its own ready list. Note that an fd cannot be at the same time
inside more than one ready list, because ep_poll_callback() will
not re-queue it if it sees it already linked:
if (ep_is_linked(&epi->rdllink))
goto is_linked;
Another case that can happen, is two concurrent epoll_wait(),
coming in with a userspace event buffer of size, say, ten.
Suppose there are 50 event ready in the list. The first
epoll_wait() will "steal" the whole list, while the second, seeing
no events, will go to sleep. But at the end of ep_send_events() in
the first epoll_wait(), we will re-inject surplus ready fds, and we
will trigger the proper wake_up to the second epoll_wait().
5) ep_poll_callback() hitting us asyncronously
This is the tricky part. As I said above, the ep_is_linked() test
done inside ep_poll_callback(), will guarantee us that until the
item will result linked to a list, ep_poll_callback() will not try
to re-queue it again (read, write data on any of its members). When
we do a list_del() in ep_send_events(), the item will still satisfy
the ep_is_linked() test (whatever data is written in prev/next,
it'll never be its own pointer), so ep_poll_callback() will still
leave us alone. It's only after the eventual
smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
that it'll become visible to ep_poll_callback(), but at the point
we're already past it.
Signed-off-by: Davide Libenzi
- Davide
eventpoll.c | 230 ++--
1 file changed, 85 insertions(+), 145 deletions(-)
diff -Nru linux-2.6.20/fs/eventpoll.c linux-2.6.20.mod/fs/eventpoll.c
--- linux-2.6.20/fs/eventpoll.c 2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20.mod/fs/eventpoll.c 2007-02-28 10:30:49.0 -0800
@@ -185,7 +185,7 @@
/*
* Each file descriptor added to the eventpoll interface will
- * have an entry of this type linked to the hash.
+ * have an entry of this type linked to the "rbr" RB tree.
*/
struct epitem {
/* RB-Tree node used to link this structure to the eventpoll rb-tree */
@@ -217,15 +217,6 @@
/* List header used to link this item to the "struct file" items list */
struct list_head fllink;
-
- /* List header used to link the item to the transfer list */
- struct list_head txlink;
-
- /*
-* This is used during the collection/transfer of events to userspace
-* to pin items empty events set.
-*/
- unsigned int revents;
};
/* Wrapper struct used by poll queueing */
@@ -258,11 +249,8 @@
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void
*key);
static int ep_eventpoll_close(struct inode *inode, struct file *file);
static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait);
-static int ep_collect_ready_items(struct eventpoll *ep,
-