Re: [patch - v3] epoll ready set loops diet ...

2007-03-01 Thread Ingo Molnar

* Davide Libenzi  wrote:

> > I was wrong about the size of epitem : it is now 68 bytes instead of 
> > 72. At least we now use/dirty one cache line instead of two per 
> > epitem.
> > 
> > Do you have another brilliant idea to shrink 4 more bytes ? :)
> 
> I don't think we can cleanly shove more stuff out of it.

then i'd suggest to not align events to 64 bytes, especially since the 
events are handled by the same CPU in one go. 68 bytes also 
automatically colors the fields in the L1 cache (it might matter).

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch - v3] epoll ready set loops diet ...

2007-02-28 Thread Davide Libenzi
On Wed, 28 Feb 2007, Eric Dumazet wrote:

> On Wednesday 28 February 2007 19:37, Davide Libenzi wrote:
> 
> > +   list_del(&epi->rdllink);
> > +   if (!(epi->event.events & EPOLLET) && (revents & 
> > epi->event.events))
> > +   list_add_tail(&epi->rdllink, &injlist);
> > +   else {
> 
> Is the ( ... & epi->event.events) really necessary ? (It seems already done)

Yes, look here:

if (epi->event.events & EPOLLONESHOT)
 epi->event.events &= EP_PRIVATE_BITS;

Oneshot events should not be requeued.


> I was wrong about the size of epitem : it is now 68 bytes instead of 72.
> At least we now use/dirty one cache line instead of two per epitem.
> 
> Do you have another brilliant idea to shrink 4 more bytes ? :)

I don't think we can cleanly shove more stuff out of it.



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch - v3] epoll ready set loops diet ...

2007-02-28 Thread Eric Dumazet
On Wednesday 28 February 2007 19:37, Davide Libenzi wrote:

> + list_del(&epi->rdllink);
> + if (!(epi->event.events & EPOLLET) && (revents & 
> epi->event.events))
> + list_add_tail(&epi->rdllink, &injlist);
> + else {

Is the ( ... & epi->event.events) really necessary ? (It seems already done)

I was wrong about the size of epitem : it is now 68 bytes instead of 72.
At least we now use/dirty one cache line instead of two per epitem.

Do you have another brilliant idea to shrink 4 more bytes ? :)

It seems to me 'nwait' is only used at init time (so that 
ep_ptable_queue_proc() can signal an error occured).

Maybe another mechanism could let us delete nwait from epitem ?

We could use a field in task_struct for example (see usage of total_link_count 
for example)

Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch - v3] epoll ready set loops diet ...

2007-02-28 Thread Davide Libenzi

ChangeLog:

v3) Removed the "revents" field from the epoll item structure, as 
suggested by Eric Dumazet

v2) In v1, I was trying to avoid to get the spinlock twice WRT yesterday 
patch, but it turns out I can't since the ready list will be travelling 
through a path w/out the ep->sem held. Oh well...


Epoll is doing multiple passes over the ready set at the moment, because 
of the constraints over the f_op->poll() call. Looking at the code again, 
I noticed that we already hold the epoll semaphore in read, and this 
(together with other locking conditions that hold while doing an 
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace 
(in a single pass).
This is a stress application that can be used to test the new code. It 
spwans multiple thread and call epoll_wait() and epoll_ctl() from many 
threads. Stress tested on my dual Opteron 254 w/out any problems.

http://www.xmailserver.org/totalmess.c

This is not a benchmark, just something that tries to stress and exploit 
possible problems with the new code.
Also, I made a stupid micro-benchmark:

http://www.xmailserver.org/epwbench.c



[1] Considering that epoll must be thread-safe, there are five ways we can 
be hit during an epoll_wait() transfer loop (ep_send_events()):

1) The epoll fd going away and calling ep_free
   This just can't happen, since we did an fget() in sys_epoll_wait

2) An epoll_ctl(EPOLL_CTL_DEL)
   This can't happen because epoll_ctl() gets ep->sem in write, and 
   we're holding it in read during ep_send_events()

3) An fd stored inside the epoll fd going away
   This can't happen because in eventpoll_release_file() we get 
   ep->sem in write, and we're holding it in read during 
   ep_send_events()

4) Another epoll_wait() happening on another thread
   They both can be inside ep_send_events() at the same time, we get 
   (splice) the ready-list under the spinlock, so each one will get 
   its own ready list. Note that an fd cannot be at the same time 
   inside more than one ready list, because ep_poll_callback() will 
   not re-queue it if it sees it already linked:

   if (ep_is_linked(&epi->rdllink))
goto is_linked;

   Another case that can happen, is two concurrent epoll_wait(), 
   coming in with a userspace event buffer of size, say, ten.
   Suppose there are 50 event ready in the list. The first 
   epoll_wait() will "steal" the whole list, while the second, seeing 
   no events, will go to sleep. But at the end of ep_send_events() in 
   the first epoll_wait(), we will re-inject surplus ready fds, and we 
   will trigger the proper wake_up to the second epoll_wait().

5) ep_poll_callback() hitting us asyncronously
   This is the tricky part. As I said above, the ep_is_linked() test 
   done inside ep_poll_callback(), will guarantee us that until the 
   item will result linked to a list, ep_poll_callback() will not try 
   to re-queue it again (read, write data on any of its members). When 
   we do a list_del() in ep_send_events(), the item will still satisfy 
   the ep_is_linked() test (whatever data is written in prev/next, 
   it'll never be its own pointer), so ep_poll_callback() will still 
   leave us alone. It's only after the eventual 
smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
   that it'll become visible to ep_poll_callback(), but at the point 
   we're already past it.



Signed-off-by: Davide Libenzi 



- Davide



eventpoll.c |  230 ++--
1 file changed, 85 insertions(+), 145 deletions(-)




diff -Nru linux-2.6.20/fs/eventpoll.c linux-2.6.20.mod/fs/eventpoll.c
--- linux-2.6.20/fs/eventpoll.c 2007-02-04 10:44:54.0 -0800
+++ linux-2.6.20.mod/fs/eventpoll.c 2007-02-28 10:30:49.0 -0800
@@ -185,7 +185,7 @@
 
 /*
  * Each file descriptor added to the eventpoll interface will
- * have an entry of this type linked to the hash.
+ * have an entry of this type linked to the "rbr" RB tree.
  */
 struct epitem {
/* RB-Tree node used to link this structure to the eventpoll rb-tree */
@@ -217,15 +217,6 @@
 
/* List header used to link this item to the "struct file" items list */
struct list_head fllink;
-
-   /* List header used to link the item to the transfer list */
-   struct list_head txlink;
-
-   /*
-* This is used during the collection/transfer of events to userspace
-* to pin items empty events set.
-*/
-   unsigned int revents;
 };
 
 /* Wrapper struct used by poll queueing */
@@ -258,11 +249,8 @@
 static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void 
*key);
 static int ep_eventpoll_close(struct inode *inode, struct file *file);
 static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait);
-static int ep_collect_ready_items(struct eventpoll *ep,
-