Re: [patch] aio: add per task aio wait event condition

2007-02-01 Thread Zach Brown

That sounds like a programming error, don't you think?  Maybe
returning EINVAL is the right approach?


Maybe.  I think I'd prefer to be permissive and queue as much as  
possible, but it's not a strong preference.  Returning EINVAL seems  
ok, too.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-02-01 Thread Jeff Moyer
==> On Wed, 3 Jan 2007 10:50:06 -0800, Zach Brown <[EMAIL PROTECTED]> said:

Zach> The controversial part here happens when min_nr is larger than the
Zach> ring size.  In that case I think we should consider min_nr to be
Zach> equal to the ring size.  We'll return fewer events than userspace
Zach> asked for, but it'll still be a big batch.

That sounds like a programming error, don't you think?  Maybe
returning EINVAL is the right approach?

-Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-02-01 Thread Jeff Moyer
== On Wed, 3 Jan 2007 10:50:06 -0800, Zach Brown [EMAIL PROTECTED] said:

Zach The controversial part here happens when min_nr is larger than the
Zach ring size.  In that case I think we should consider min_nr to be
Zach equal to the ring size.  We'll return fewer events than userspace
Zach asked for, but it'll still be a big batch.

That sounds like a programming error, don't you think?  Maybe
returning EINVAL is the right approach?

-Jeff
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-02-01 Thread Zach Brown

That sounds like a programming error, don't you think?  Maybe
returning EINVAL is the right approach?


Maybe.  I think I'd prefer to be permissive and queue as much as  
possible, but it's not a strong preference.  Returning EINVAL seems  
ok, too.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-03 Thread Zach Brown


On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote:


Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM

In the example you
gave earlier, task with min_nr of 2 will be woken up after 4  
completed

events.


I only gave 2 ios/events in that example.

Does that clear up the confusion?


It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.


Yeah.  I have to admit that I hadn't fully grasped the behaviour of  
the interface when we started this thread.  I realized just how  
insane it was as I was finishing off last night.



wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.


I agree completely.  This leads me to assume that essentially *no  
one* has multiple waiters on a context who care about min_nr.  That's  
good news because we can consider it the slow path for now :)



So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


Yeah.  How about this:

Start with a cleanup of the event copying loop:

- prefault user event dest with fault_in_pages_writable()
- copy multiple events with _inatomic
- only copy events if min_nr are available
- back off and retry if _inatomic faults
- go back to sleep if too few events are available
- return once a copy doesn't fault

(I guess one could only fall back to prefaulting if the _inatomic  
fails.  I don't know if most event buffers fault or not, I'd guess not.)


This would get rid of the nonsense of consuming events (possibly  
depriving other concurrent waiters) without returning them to  
userspace in a timely fashion.  We get rid of a lot of the per-event- 
delivery overhead that the current loop suffers from.


The controversial part here happens when min_nr is larger than the  
ring size.  In that case I think we should consider min_nr to be  
equal to the ring size.  We'll return fewer events than userspace  
asked for, but it'll still be a big batch.


Does io_getevents() returning +ve but < min_nr raise alarm bells for  
anyone?  It's already done in the timeout case, so arguably code  
already handles it.


Now, on the waking and sleeping side:

- min_nr 0 and 1 as exclusive waiters in ctx->wait
- maintain a rbtree of with min_nr of waiters when > 1
- always wake ctx->wait if it's pending
- wake rb_first of the tree if nr_in_ring >= node->min_nr

The rbtree nodes would have a task_struct pointer and its waker would  
use wake_up_process().  It'd be maintained under the ctx lock.  It'll  
be expensive to maintain in the case of lots of concurrent min_nr > 1  
waiters, but we're declaring that the slow path.


The fast path of a single waiter with > 1 min_nr becomes a pointer  
indirection through the root node of the rb tree.  It should be just  
like the current list_head use in the wait queue.  If the wait queue  
head and rbtree root node share a cacheline in the context then  
checking them both (and almost always only finding one of them  
populated) shouldn't cost the completion path much.


I don't have a strong opinion about using exclusive wake-ups or not  
because I doubt we ever have concurrent waiters in the field.  I'm  
leaning towards avoiding the thundering herd, even if that risks the  
event delivery stream waiting behind a woken waiter who might be  
blocking while prefaulting their destination event buffer.  We should  
wake from io_getevents() before it sleeps or returns, then.  I could  
easily be convinced otherwise.


The wait queue lets us support lots of min_nr == 1 waiters without  
the rbtree maintenance overhead.. they're not as self-evidently  
broken in the current code as multiple min_nr > 1 waiters, so they  
may well be in use?  who knows.  It's easy.


How's this all sound?  I can throw this together if you'd prefer it  
that way.  Though I suspect you'll want to keep hacking on this :).


Finally, let's get a test case for multiple min_nr > 1 waiters into  
autotest.  I'd use our trivial example where the min_nr = 2 waiter  
fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that  
things don't get stuck.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-03 Thread Zach Brown


On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote:


Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM

In the example you
gave earlier, task with min_nr of 2 will be woken up after 4  
completed

events.


I only gave 2 ios/events in that example.

Does that clear up the confusion?


It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.


Yeah.  I have to admit that I hadn't fully grasped the behaviour of  
the interface when we started this thread.  I realized just how  
insane it was as I was finishing off last night.



wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.


I agree completely.  This leads me to assume that essentially *no  
one* has multiple waiters on a context who care about min_nr.  That's  
good news because we can consider it the slow path for now :)



So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


Yeah.  How about this:

Start with a cleanup of the event copying loop:

- prefault user event dest with fault_in_pages_writable()
- copy multiple events with _inatomic
- only copy events if min_nr are available
- back off and retry if _inatomic faults
- go back to sleep if too few events are available
- return once a copy doesn't fault

(I guess one could only fall back to prefaulting if the _inatomic  
fails.  I don't know if most event buffers fault or not, I'd guess not.)


This would get rid of the nonsense of consuming events (possibly  
depriving other concurrent waiters) without returning them to  
userspace in a timely fashion.  We get rid of a lot of the per-event- 
delivery overhead that the current loop suffers from.


The controversial part here happens when min_nr is larger than the  
ring size.  In that case I think we should consider min_nr to be  
equal to the ring size.  We'll return fewer events than userspace  
asked for, but it'll still be a big batch.


Does io_getevents() returning +ve but  min_nr raise alarm bells for  
anyone?  It's already done in the timeout case, so arguably code  
already handles it.


Now, on the waking and sleeping side:

- min_nr 0 and 1 as exclusive waiters in ctx-wait
- maintain a rbtree of with min_nr of waiters when  1
- always wake ctx-wait if it's pending
- wake rb_first of the tree if nr_in_ring = node-min_nr

The rbtree nodes would have a task_struct pointer and its waker would  
use wake_up_process().  It'd be maintained under the ctx lock.  It'll  
be expensive to maintain in the case of lots of concurrent min_nr  1  
waiters, but we're declaring that the slow path.


The fast path of a single waiter with  1 min_nr becomes a pointer  
indirection through the root node of the rb tree.  It should be just  
like the current list_head use in the wait queue.  If the wait queue  
head and rbtree root node share a cacheline in the context then  
checking them both (and almost always only finding one of them  
populated) shouldn't cost the completion path much.


I don't have a strong opinion about using exclusive wake-ups or not  
because I doubt we ever have concurrent waiters in the field.  I'm  
leaning towards avoiding the thundering herd, even if that risks the  
event delivery stream waiting behind a woken waiter who might be  
blocking while prefaulting their destination event buffer.  We should  
wake from io_getevents() before it sleeps or returns, then.  I could  
easily be convinced otherwise.


The wait queue lets us support lots of min_nr == 1 waiters without  
the rbtree maintenance overhead.. they're not as self-evidently  
broken in the current code as multiple min_nr  1 waiters, so they  
may well be in use?  who knows.  It's easy.


How's this all sound?  I can throw this together if you'd prefer it  
that way.  Though I suspect you'll want to keep hacking on this :).


Finally, let's get a test case for multiple min_nr  1 waiters into  
autotest.  I'd use our trivial example where the min_nr = 2 waiter  
fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that  
things don't get stuck.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> > In the example you
> > gave earlier, task with min_nr of 2 will be woken up after 4 completed
> > events.
> 
> I only gave 2 ios/events in that example.
> 
> Does that clear up the confusion?

It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.

In the blocking path of read_events(), we essentially doing the
following loop (simplified for clarity):

while (i < nr) {
add_wait_queue_exclusive(>wait, );
do {
ret = aio_read_evt(ctx, );
if (!ret)
schedule();
while (1);
remove_wait_queue(>wait, );
copy_to_user(event, , sizeof(ent));
}

Noticed that when thread comes out of schedule(), it removes itself
from the wait queue, and requeue itself at the end of the wait queue
for each and every event it reaps.  So if there are multiple threads
waiting in io_getevents, completed I/O are handed out in round robin
scheme to all waiting threads.

To illustrate it in ascii graph, here is what happens:

   thread 1   thread 2

   queue at head
   schedule()

  queue at 2nd position
  schedule

aio_complete
(event 1)
   remove_wait_queue  (now thread 2 is at head)
   reap event 1
   requeue at tail
   schedule

aio_complete
(event 2)
  remove_wait_queue (now thread 1 is at 
head)
  reap event 2
  requeue at tail
  schedule

If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps
second with min_nr = 3, then thread 1 wakes up on event _3_.
But if thread 2 sleeps first, thread 1 sleeps second, thread 1
wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.

Looking back to the example Zach gave earlier, current
implementation behaves just like what described as an undesired
bug (modified and tortured):

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete
first sleeper twiddles thumbs

So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:
> > Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> >>> That is not possible because when multiple tasks waiting for
> >>> events, they
> >>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> >>> __add_wait_queue_tail().  So first io_getevents() with min_nr of 2
> >>> will be woken up when 2 ops completes.
> >>
> >> So switch the order of the two sleepers in the example?
> >
> > Not sure why that would be a problem though:  whoever sleep first will
> > be woken up first.
> 
> Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
> ios were issued.
> 
> Maybe the app was relying on the min_nr = 2 completion to issue 3  
> more ios for the min_nr = 3 sleeper, who knows.
> 
> Does that clear up the confusion?


Not really. I don't think I understand your concern. You gave an example:

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

Or:

issue 2 ops
first io_getevents sleeps with a min_nr of 3
second io_getevents sleeps with min_nr of 2
2 ops complete but only test the second sleeper's min_nr of 2
first sleeper twiddles thumbs


First scenario doesn't exist because in the new scheme, we test first
sleeper (as in head of the queue) when 2 ops complete. It wakes up first.

2nd scenario is OK to me because first sleeper waiting for 3 events,
and there are only 2 ops completed, so it waits.

The one scenario that I can think of that breaks down is that one task
sleeps with min_nr of 100.  Then 50 ops completed.  Comes along 2nd
thread does a io_getevents and it will take all 50 events in the 2nd
thread.  Is that what you are talking about?  It doesn't involve two
sleepers.  That I can fix by testing whether wait queue is active or
not at the beginning of fast path in read_events().

The bigger question is: what is the semantics on event reap order for
thread? Random, FIFO or round robin?  It is not specified anywhere.
What would be the most optimal policy?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown


On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:


Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM

That is not possible because when multiple tasks waiting for
events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2
will be woken up when 2 ops completes.


So switch the order of the two sleepers in the example?


Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
ios were issued.


Maybe the app was relying on the min_nr = 2 completion to issue 3  
more ios for the min_nr = 3 sleeper, who knows.



Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round  
robin

fashion to all pending tasks waiting in io_getevents.


Yeah, don't misunderstand me -- we agree that the current situation  
is bad.



  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.


I only gave 2 ios/events in that example.

Does that clear up the confusion?

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> > That is not possible because when multiple tasks waiting for  
> > events, they
> > enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> > __add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
> > will be woken up when 2 ops completes.
> 
> So switch the order of the two sleepers in the example?

Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


> The point is that there's no way to guarantee that the head of the  
> wait queue will be the lowest min_nr.

Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round robin
fashion to all pending tasks waiting in io_getevents.  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.  I consider that as an undesirable behavior as well.

Going back to your counter argument, why do we need the lowest min_nr in
the head of the queue?  These are tasks that shares one aio ctx and ioctx
is shareable only among threads.  Any reason why round robin policy is
superior than FIFO?  Also presumably, threads that shares ioctx should be
capable of handling events for the same ioctx.

>From wakeup order point of view, yes, tasks with lowest min_nr wakes up
first, but looking from io completion order, they are not. And these are
the source of excessive ctx switch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown


That is not possible because when multiple tasks waiting for  
events, they

enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
will

be woken up when 2 ops completes.


So switch the order of the two sleepers in the example?

The point is that there's no way to guarantee that the head of the  
wait queue will be the lowest min_nr.


I got list_add() from the add_wait_queue() still being used in  
wait_for_all_aios(), fwiw.  My mistake.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM
> On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:
> > This patch adds a wait condition to the wait queue and only wake-up
> > process when that condition meets.  And this condition is added on a
> > per task base for handling multi-threaded app that shares single  
> > ioctx.
> 
> But only one of the waiting tasks is tested, the one at the head of  
> the list.  It looks like this change could starve a io_getevents()  
> with a low min_nr in the presence of another io_getevents() with a  
> larger min_nr.
> 
> > +   if (waitqueue_active(>wait)) {
> > +   struct aio_wait_queue *wait;
> > +   wait = container_of(ctx->wait.task_list.next,
> > +   struct aio_wait_queue, wait.task_list);
> > +   if (nr_evt >= wait->nr_wait)
> > +   wake_up(>wait);
> > +   }
> 
> First is the fear of starvation as mentioned previously.
> 
> issue 2 ops
> first io_getevents sleeps with a min_nr of 2
> second io_getevents sleeps with min_nr of 3
> 2 ops complete but only test the second sleeper's min_nr of 3
> first sleeper twiddles thumbs

That is not possible because when multiple tasks waiting for events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2 will
be woken up when 2 ops completes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown


On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:


The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().


Yeah, it's a real deficiency.  Thanks for taking a stab at it.


This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single  
ioctx.


But only one of the waiting tasks is tested, the one at the head of  
the list.  It looks like this change could starve a io_getevents()  
with a low min_nr in the presence of another io_getevents() with a  
larger min_nr.



Before:
 0  0  0 3972608   7056  3131200 14100 0 7885  
13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 13800 0 7885 
42  0  2 98  0


Nice.  What min_nr was used in this test?


+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */


It appears that this is never assigned a negative?  Can we make it  
that explicit in the type so that we reviewers don't have to worry  
about wrapping and signed comparisons?



-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;



+   aio_init_wait();


This just changed from using default_wake_function() to  
autoremove_wait_function().  Very sneaky!  wait_for_all_aios() should  
be adding the wait queue before going to sleep each time.  (better  
still to just use wait_event()).


Was this on purpose?  I'm all for it as a way to reduce wakeups from  
a stream of completions to a single waiter.



+   nr_evt = ring->tail - ring->head;
+   if (nr_evt < 0)
+   nr_evt += info->nr;


 int = unsigned - unsigned;
 if (int < 0)

My head already hurts.  Can we clean this up so one doesn't have to  
live and breath type conversion rules to tell if this code is correct?



+   if (waitqueue_active(>wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx->wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt >= wait->nr_wait)
+   wake_up(>wait);
+   }


First is the fear of starvation as mentioned previously.

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

This makes me think this elegant task_list approach is doomed.  I  
think this is what stopped Ben and I from being interested in this  
last time we talked about it :).


Also, is that container_of() and dereference safe in the presence of  
racing wake-ups?  It looks like we could get deref a freed wait and  
get a bogus nr_wait and decide not to wake.


Andrew, I fear we should remove this from -mm until it's fixed up.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown


On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:


The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().


Yeah, it's a real deficiency.  Thanks for taking a stab at it.


This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single  
ioctx.


But only one of the waiting tasks is tested, the one at the head of  
the list.  It looks like this change could starve a io_getevents()  
with a low min_nr in the presence of another io_getevents() with a  
larger min_nr.



Before:
 0  0  0 3972608   7056  3131200 14100 0 7885  
13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 13800 0 7885 
42  0  2 98  0


Nice.  What min_nr was used in this test?


+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */


It appears that this is never assigned a negative?  Can we make it  
that explicit in the type so that we reviewers don't have to worry  
about wrapping and signed comparisons?



-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;



+   aio_init_wait(wait);


This just changed from using default_wake_function() to  
autoremove_wait_function().  Very sneaky!  wait_for_all_aios() should  
be adding the wait queue before going to sleep each time.  (better  
still to just use wait_event()).


Was this on purpose?  I'm all for it as a way to reduce wakeups from  
a stream of completions to a single waiter.



+   nr_evt = ring-tail - ring-head;
+   if (nr_evt  0)
+   nr_evt += info-nr;


 int = unsigned - unsigned;
 if (int  0)

My head already hurts.  Can we clean this up so one doesn't have to  
live and breath type conversion rules to tell if this code is correct?



+   if (waitqueue_active(ctx-wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx-wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt = wait-nr_wait)
+   wake_up(ctx-wait);
+   }


First is the fear of starvation as mentioned previously.

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

This makes me think this elegant task_list approach is doomed.  I  
think this is what stopped Ben and I from being interested in this  
last time we talked about it :).


Also, is that container_of() and dereference safe in the presence of  
racing wake-ups?  It looks like we could get deref a freed wait and  
get a bogus nr_wait and decide not to wake.


Andrew, I fear we should remove this from -mm until it's fixed up.

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM
 On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:
  This patch adds a wait condition to the wait queue and only wake-up
  process when that condition meets.  And this condition is added on a
  per task base for handling multi-threaded app that shares single  
  ioctx.
 
 But only one of the waiting tasks is tested, the one at the head of  
 the list.  It looks like this change could starve a io_getevents()  
 with a low min_nr in the presence of another io_getevents() with a  
 larger min_nr.
 
  +   if (waitqueue_active(ctx-wait)) {
  +   struct aio_wait_queue *wait;
  +   wait = container_of(ctx-wait.task_list.next,
  +   struct aio_wait_queue, wait.task_list);
  +   if (nr_evt = wait-nr_wait)
  +   wake_up(ctx-wait);
  +   }
 
 First is the fear of starvation as mentioned previously.
 
 issue 2 ops
 first io_getevents sleeps with a min_nr of 2
 second io_getevents sleeps with min_nr of 3
 2 ops complete but only test the second sleeper's min_nr of 3
 first sleeper twiddles thumbs

That is not possible because when multiple tasks waiting for events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2 will
be woken up when 2 ops completes.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown


That is not possible because when multiple tasks waiting for  
events, they

enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
will

be woken up when 2 ops completes.


So switch the order of the two sleepers in the example?

The point is that there's no way to guarantee that the head of the  
wait queue will be the lowest min_nr.


I got list_add() from the add_wait_queue() still being used in  
wait_for_all_aios(), fwiw.  My mistake.


- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
  That is not possible because when multiple tasks waiting for  
  events, they
  enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
  __add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
  will be woken up when 2 ops completes.
 
 So switch the order of the two sleepers in the example?

Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


 The point is that there's no way to guarantee that the head of the  
 wait queue will be the lowest min_nr.

Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round robin
fashion to all pending tasks waiting in io_getevents.  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.  I consider that as an undesirable behavior as well.

Going back to your counter argument, why do we need the lowest min_nr in
the head of the queue?  These are tasks that shares one aio ctx and ioctx
is shareable only among threads.  Any reason why round robin policy is
superior than FIFO?  Also presumably, threads that shares ioctx should be
capable of handling events for the same ioctx.

From wakeup order point of view, yes, tasks with lowest min_nr wakes up
first, but looking from io completion order, they are not. And these are
the source of excessive ctx switch.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown


On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:


Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM

That is not possible because when multiple tasks waiting for
events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2
will be woken up when 2 ops completes.


So switch the order of the two sleepers in the example?


Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
ios were issued.


Maybe the app was relying on the min_nr = 2 completion to issue 3  
more ios for the min_nr = 3 sleeper, who knows.



Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round  
robin

fashion to all pending tasks waiting in io_getevents.


Yeah, don't misunderstand me -- we agree that the current situation  
is bad.



  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.


I only gave 2 ios/events in that example.

Does that clear up the confusion?

- z
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
 On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:
  Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
  That is not possible because when multiple tasks waiting for
  events, they
  enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
  __add_wait_queue_tail().  So first io_getevents() with min_nr of 2
  will be woken up when 2 ops completes.
 
  So switch the order of the two sleepers in the example?
 
  Not sure why that would be a problem though:  whoever sleep first will
  be woken up first.
 
 Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
 ios were issued.
 
 Maybe the app was relying on the min_nr = 2 completion to issue 3  
 more ios for the min_nr = 3 sleeper, who knows.
 
 Does that clear up the confusion?


Not really. I don't think I understand your concern. You gave an example:

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

Or:

issue 2 ops
first io_getevents sleeps with a min_nr of 3
second io_getevents sleeps with min_nr of 2
2 ops complete but only test the second sleeper's min_nr of 2
first sleeper twiddles thumbs


First scenario doesn't exist because in the new scheme, we test first
sleeper (as in head of the queue) when 2 ops complete. It wakes up first.

2nd scenario is OK to me because first sleeper waiting for 3 events,
and there are only 2 ops completed, so it waits.

The one scenario that I can think of that breaks down is that one task
sleeps with min_nr of 100.  Then 50 ops completed.  Comes along 2nd
thread does a io_getevents and it will take all 50 events in the 2nd
thread.  Is that what you are talking about?  It doesn't involve two
sleepers.  That I can fix by testing whether wait queue is active or
not at the beginning of fast path in read_events().

The bigger question is: what is the semantics on event reap order for
thread? Random, FIFO or round robin?  It is not specified anywhere.
What would be the most optimal policy?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
  In the example you
  gave earlier, task with min_nr of 2 will be woken up after 4 completed
  events.
 
 I only gave 2 ios/events in that example.
 
 Does that clear up the confusion?

It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.

In the blocking path of read_events(), we essentially doing the
following loop (simplified for clarity):

while (i  nr) {
add_wait_queue_exclusive(ctx-wait, wait);
do {
ret = aio_read_evt(ctx, ent);
if (!ret)
schedule();
while (1);
remove_wait_queue(ctx-wait, wait);
copy_to_user(event, ent, sizeof(ent));
}

Noticed that when thread comes out of schedule(), it removes itself
from the wait queue, and requeue itself at the end of the wait queue
for each and every event it reaps.  So if there are multiple threads
waiting in io_getevents, completed I/O are handed out in round robin
scheme to all waiting threads.

To illustrate it in ascii graph, here is what happens:

   thread 1   thread 2

   queue at head
   schedule()

  queue at 2nd position
  schedule

aio_complete
(event 1)
   remove_wait_queue  (now thread 2 is at head)
   reap event 1
   requeue at tail
   schedule

aio_complete
(event 2)
  remove_wait_queue (now thread 1 is at 
head)
  reap event 2
  requeue at tail
  schedule

If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps
second with min_nr = 3, then thread 1 wakes up on event _3_.
But if thread 2 sleeps first, thread 1 sleeps second, thread 1
wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.

Looking back to the example Zach gave earlier, current
implementation behaves just like what described as an undesired
bug (modified and tortured):

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete
first sleeper twiddles thumbs

So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


- Ken
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] aio: add per task aio wait event condition

2006-12-29 Thread Chen, Kenneth W
The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().

For example, if app calls io_getevents with min_nr > 1, and aio event
queue doesn't have enough completed aio event, the process will block
in read_events().  However, aio_complete() will wake up the waiting
process for *each* complete I/O even though number of events that an
app is waiting for is much larger than 1.  This makes excessive and
unnecessary context switch because the waiting process will just reap
one single event and goes back to sleep again.  It is much more efficient
to wake up the waiting process when there are enough events for it to
reap.

This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single ioctx.

To show the effect of this patch, here is an vmstat output before and
after the patch. The app does random O_DIRECT AIO on 60 disks. Context
switch is reduced from 13 thousand+ down to just 40+, an significant
improvement.

Before:
procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 0  0  0 3972608   7056  3131200 14000 0 7840 13715  0  2 98  0
 0  0  0 3972608   7056  3131200 14300 0 7793 13641  0  2 98  0
 0  0  0 3972608   7056  3131200 14100 0 7885 13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 14000 0 784049  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 779353  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 788542  0  2 98  0


Signed-off-by: Ken Chen <[EMAIL PROTECTED]>


--- ./fs/aio.c.orig 2006-12-24 16:41:39.0 -0800
+++ ./fs/aio.c  2006-12-24 16:52:15.0 -0800
@@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
 } while(0)
 
+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */
+   wait_queue_twait;
+};
+
+static inline void aio_init_wait(struct aio_wait_queue *wait)
+{
+   wait->nr_wait = 0;
+   init_wait(>wait);
+}
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx
 static void wait_for_all_aios(struct kioctx *ctx)
 {
struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;
 
spin_lock_irq(>ctx_lock);
if (!ctx->reqs_active)
goto out;
 
-   add_wait_queue(>wait, );
+   aio_init_wait();
+   add_wait_queue(>wait, );
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
spin_unlock_irq(>ctx_lock);
@@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio
spin_lock_irq(>ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
-   remove_wait_queue(>wait, );
+   remove_wait_queue(>wait, );
 
 out:
spin_unlock_irq(>ctx_lock);
@@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb *
unsigned long   flags;
unsigned long   tail;
int ret;
+   int nr_evt = 0;
 
/*
 * Special case handling for sync iocbs:
@@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb *
info->tail = tail;
ring->tail = tail;
 
+   nr_evt = ring->tail - ring->head;
+   if (nr_evt < 0)
+   nr_evt += info->nr;
put_aio_ring_event(event, KM_IRQ0);
kunmap_atomic(ring, KM_IRQ1);
 
@@ -1000,8 +1016,13 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   if (waitqueue_active(>wait))
-   wake_up(>wait);
+   if (waitqueue_active(>wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx->wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt >= wait->nr_wait)
+   wake_up(>wait);
+   }
 
spin_unlock_irqrestore(>ctx_lock, flags);
return ret;
@@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct
 {
longstart_jiffies = jiffies;
struct task_struct  *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue   wait;
int ret;
int i = 0;
struct io_event ent;
@@ -1152,10 +1173,11 @@ retry:
set_timeout(start_jiffies, , );
}
 
+   aio_init_wait();
while (likely(i < nr)) {
-   add_wait_queue_exclusive(>wait, );

[patch] aio: add per task aio wait event condition

2006-12-29 Thread Chen, Kenneth W
The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().

For example, if app calls io_getevents with min_nr  1, and aio event
queue doesn't have enough completed aio event, the process will block
in read_events().  However, aio_complete() will wake up the waiting
process for *each* complete I/O even though number of events that an
app is waiting for is much larger than 1.  This makes excessive and
unnecessary context switch because the waiting process will just reap
one single event and goes back to sleep again.  It is much more efficient
to wake up the waiting process when there are enough events for it to
reap.

This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single ioctx.

To show the effect of this patch, here is an vmstat output before and
after the patch. The app does random O_DIRECT AIO on 60 disks. Context
switch is reduced from 13 thousand+ down to just 40+, an significant
improvement.

Before:
procs ---memory-- ---swap-- -io --system-- cpu
 r  b   swpd   free   buff  cache   si   sobibo   incs us sy id wa
 0  0  0 3972608   7056  3131200 14000 0 7840 13715  0  2 98  0
 0  0  0 3972608   7056  3131200 14300 0 7793 13641  0  2 98  0
 0  0  0 3972608   7056  3131200 14100 0 7885 13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 14000 0 784049  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 779353  0  2 98  0
 0  0  0 3972608   7056  3131200 13800 0 788542  0  2 98  0


Signed-off-by: Ken Chen [EMAIL PROTECTED]


--- ./fs/aio.c.orig 2006-12-24 16:41:39.0 -0800
+++ ./fs/aio.c  2006-12-24 16:52:15.0 -0800
@@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx 
kunmap_atomic((void *)((unsigned long)__event  PAGE_MASK), km); \
 } while(0)
 
+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */
+   wait_queue_twait;
+};
+
+static inline void aio_init_wait(struct aio_wait_queue *wait)
+{
+   wait-nr_wait = 0;
+   init_wait(wait-wait);
+}
+
 /* ioctx_alloc
  * Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
  */
@@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx
 static void wait_for_all_aios(struct kioctx *ctx)
 {
struct task_struct *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;
 
spin_lock_irq(ctx-ctx_lock);
if (!ctx-reqs_active)
goto out;
 
-   add_wait_queue(ctx-wait, wait);
+   aio_init_wait(wait);
+   add_wait_queue(ctx-wait, wait.wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx-reqs_active) {
spin_unlock_irq(ctx-ctx_lock);
@@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio
spin_lock_irq(ctx-ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
-   remove_wait_queue(ctx-wait, wait);
+   remove_wait_queue(ctx-wait, wait.wait);
 
 out:
spin_unlock_irq(ctx-ctx_lock);
@@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb *
unsigned long   flags;
unsigned long   tail;
int ret;
+   int nr_evt = 0;
 
/*
 * Special case handling for sync iocbs:
@@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb *
info-tail = tail;
ring-tail = tail;
 
+   nr_evt = ring-tail - ring-head;
+   if (nr_evt  0)
+   nr_evt += info-nr;
put_aio_ring_event(event, KM_IRQ0);
kunmap_atomic(ring, KM_IRQ1);
 
@@ -1000,8 +1016,13 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);
 
-   if (waitqueue_active(ctx-wait))
-   wake_up(ctx-wait);
+   if (waitqueue_active(ctx-wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx-wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt = wait-nr_wait)
+   wake_up(ctx-wait);
+   }
 
spin_unlock_irqrestore(ctx-ctx_lock, flags);
return ret;
@@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct
 {
longstart_jiffies = jiffies;
struct task_struct  *tsk = current;
-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue   wait;
int ret;
int i = 0;
struct io_event ent;
@@ -1152,10 +1173,11 @@ retry:
set_timeout(start_jiffies, to, ts);
}
 
+   aio_init_wait(wait);
while