subject:"RE\: \[patch\] aio\: add per task aio wait event condition"

Re: [patch] aio: add per task aio wait event condition

2007-02-01 Thread Zach Brown


That sounds like a programming error, don't you think?  Maybe
returning EINVAL is the right approach?


Maybe.  I think I'd prefer to be permissive and queue as much as  
possible, but it's not a strong preference.  Returning EINVAL seems  
ok, too.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] aio: add per task aio wait event condition

2007-02-01 Thread Jeff Moyer

==> On Wed, 3 Jan 2007 10:50:06 -0800, Zach Brown <[EMAIL PROTECTED]> said:

Zach> The controversial part here happens when min_nr is larger than the
Zach> ring size.  In that case I think we should consider min_nr to be
Zach> equal to the ring size.  We'll return fewer events than userspace
Zach> asked for, but it'll still be a big batch.

That sounds like a programming error, don't you think?  Maybe
returning EINVAL is the right approach?

-Jeff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] aio: add per task aio wait event condition

2007-01-03 Thread Zach Brown



On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote:


Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM

In the example you
gave earlier, task with min_nr of 2 will be woken up after 4  
completed

events.


I only gave 2 ios/events in that example.

Does that clear up the confusion?


It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.


Yeah.  I have to admit that I hadn't fully grasped the behaviour of  
the interface when we started this thread.  I realized just how  
insane it was as I was finishing off last night.



wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.


I agree completely.  This leads me to assume that essentially *no  
one* has multiple waiters on a context who care about min_nr.  That's  
good news because we can consider it the slow path for now :)



So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


Yeah.  How about this:

Start with a cleanup of the event copying loop:

- prefault user event dest with fault_in_pages_writable()
- copy multiple events with _inatomic
- only copy events if min_nr are available
- back off and retry if _inatomic faults
- go back to sleep if too few events are available
- return once a copy doesn't fault

(I guess one could only fall back to prefaulting if the _inatomic  
fails.  I don't know if most event buffers fault or not, I'd guess not.)


This would get rid of the nonsense of consuming events (possibly  
depriving other concurrent waiters) without returning them to  
userspace in a timely fashion.  We get rid of a lot of the per-event- 
delivery overhead that the current loop suffers from.


The controversial part here happens when min_nr is larger than the  
ring size.  In that case I think we should consider min_nr to be  
equal to the ring size.  We'll return fewer events than userspace  
asked for, but it'll still be a big batch.


Does io_getevents() returning +ve but < min_nr raise alarm bells for  
anyone?  It's already done in the timeout case, so arguably code  
already handles it.


Now, on the waking and sleeping side:

- min_nr 0 and 1 as exclusive waiters in ctx->wait
- maintain a rbtree of with min_nr of waiters when > 1
- always wake ctx->wait if it's pending
- wake rb_first of the tree if nr_in_ring >= node->min_nr

The rbtree nodes would have a task_struct pointer and its waker would  
use wake_up_process().  It'd be maintained under the ctx lock.  It'll  
be expensive to maintain in the case of lots of concurrent min_nr > 1  
waiters, but we're declaring that the slow path.


The fast path of a single waiter with > 1 min_nr becomes a pointer  
indirection through the root node of the rb tree.  It should be just  
like the current list_head use in the wait queue.  If the wait queue  
head and rbtree root node share a cacheline in the context then  
checking them both (and almost always only finding one of them  
populated) shouldn't cost the completion path much.


I don't have a strong opinion about using exclusive wake-ups or not  
because I doubt we ever have concurrent waiters in the field.  I'm  
leaning towards avoiding the thundering herd, even if that risks the  
event delivery stream waiting behind a woken waiter who might be  
blocking while prefaulting their destination event buffer.  We should  
wake from io_getevents() before it sleeps or returns, then.  I could  
easily be convinced otherwise.


The wait queue lets us support lots of min_nr == 1 waiters without  
the rbtree maintenance overhead.. they're not as self-evidently  
broken in the current code as multiple min_nr > 1 waiters, so they  
may well be in use?  who knows.  It's easy.


How's this all sound?  I can throw this together if you'd prefer it  
that way.  Though I suspect you'll want to keep hacking on this :).


Finally, let's get a test case for multiple min_nr > 1 waiters into  
autotest.  I'd use our trivial example where the min_nr = 2 waiter  
fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that  
things don't get stuck.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> > In the example you
> > gave earlier, task with min_nr of 2 will be woken up after 4 completed
> > events.
> 
> I only gave 2 ios/events in that example.
> 
> Does that clear up the confusion?

It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.

In the blocking path of read_events(), we essentially doing the
following loop (simplified for clarity):

while (i < nr) {
add_wait_queue_exclusive(&ctx->wait, &wait);
do {
ret = aio_read_evt(ctx, &ent);
if (!ret)
schedule();
while (1);
remove_wait_queue(&ctx->wait, &wait);
copy_to_user(event, &ent, sizeof(ent));
}

Noticed that when thread comes out of schedule(), it removes itself
from the wait queue, and requeue itself at the end of the wait queue
for each and every event it reaps.  So if there are multiple threads
waiting in io_getevents, completed I/O are handed out in round robin
scheme to all waiting threads.

To illustrate it in ascii graph, here is what happens:

   thread 1   thread 2

   queue at head
   schedule()

  queue at 2nd position
  schedule

aio_complete
(event 1)
   remove_wait_queue  (now thread 2 is at head)
   reap event 1
   requeue at tail
   schedule

aio_complete
(event 2)
  remove_wait_queue (now thread 1 is at 
head)
  reap event 2
  requeue at tail
  schedule

If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps
second with min_nr = 3, then thread 1 wakes up on event _3_.
But if thread 2 sleeps first, thread 1 sleeps second, thread 1
wakes up on event _4_.  If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.

Looking back to the example Zach gave earlier, current
implementation behaves just like what described as an undesired
bug (modified and tortured):

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete
first sleeper twiddles thumbs

So I can categorize my patchset as a bug fix instead of a
performance patch ;-)  Let's be serious, this ought to be fixed
one way or the other.


- Ken
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:
> > Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> >>> That is not possible because when multiple tasks waiting for
> >>> events, they
> >>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> >>> __add_wait_queue_tail().  So first io_getevents() with min_nr of 2
> >>> will be woken up when 2 ops completes.
> >>
> >> So switch the order of the two sleepers in the example?
> >
> > Not sure why that would be a problem though:  whoever sleep first will
> > be woken up first.
> 
> Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
> ios were issued.
> 
> Maybe the app was relying on the min_nr = 2 completion to issue 3  
> more ios for the min_nr = 3 sleeper, who knows.
> 
> Does that clear up the confusion?

Not really. I don't think I understand your concern. You gave an example:

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

Or:

issue 2 ops
first io_getevents sleeps with a min_nr of 3
second io_getevents sleeps with min_nr of 2
2 ops complete but only test the second sleeper's min_nr of 2
first sleeper twiddles thumbs

First scenario doesn't exist because in the new scheme, we test first
sleeper (as in head of the queue) when 2 ops complete. It wakes up first.

2nd scenario is OK to me because first sleeper waiting for 3 events,
and there are only 2 ops completed, so it waits.

The one scenario that I can think of that breaks down is that one task
sleeps with min_nr of 100.  Then 50 ops completed.  Comes along 2nd
thread does a io_getevents and it will take all 50 events in the 2nd
thread.  Is that what you are talking about?  It doesn't involve two
sleepers.  That I can fix by testing whether wait queue is active or
not at the beginning of fast path in read_events().

The bigger question is: what is the semantics on event reap order for
thread? Random, FIFO or round robin?  It is not specified anywhere.
What would be the most optimal policy?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown



On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:


Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM

That is not possible because when multiple tasks waiting for
events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2
will be woken up when 2 ops completes.


So switch the order of the two sleepers in the example?


Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


Why would the min_nr = 3 sleeper be woken up in that case?  Only 2  
ios were issued.


Maybe the app was relying on the min_nr = 2 completion to issue 3  
more ios for the min_nr = 3 sleeper, who knows.



Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round  
robin

fashion to all pending tasks waiting in io_getevents.


Yeah, don't misunderstand me -- we agree that the current situation  
is bad.



  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.


I only gave 2 ios/events in that example.

Does that clear up the confusion?

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> > That is not possible because when multiple tasks waiting for  
> > events, they
> > enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> > __add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
> > will be woken up when 2 ops completes.
> 
> So switch the order of the two sleepers in the example?

Not sure why that would be a problem though:  whoever sleep first will
be woken up first.


> The point is that there's no way to guarantee that the head of the  
> wait queue will be the lowest min_nr.

Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round robin
fashion to all pending tasks waiting in io_getevents.  In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events.  I consider that as an undesirable behavior as well.

Going back to your counter argument, why do we need the lowest min_nr in
the head of the queue?  These are tasks that shares one aio ctx and ioctx
is shareable only among threads.  Any reason why round robin policy is
superior than FIFO?  Also presumably, threads that shares ioctx should be
capable of handling events for the same ioctx.

>From wakeup order point of view, yes, tasks with lowest min_nr wakes up
first, but looking from io completion order, they are not. And these are
the source of excessive ctx switch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown



That is not possible because when multiple tasks waiting for  
events, they

enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2  
will

be woken up when 2 ops completes.


So switch the order of the two sleepers in the example?

The point is that there's no way to guarantee that the head of the  
wait queue will be the lowest min_nr.


I got list_add() from the add_wait_queue() still being used in  
wait_for_all_aios(), fwiw.  My mistake.


- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Chen, Kenneth W

Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM
> On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:
> > This patch adds a wait condition to the wait queue and only wake-up
> > process when that condition meets.  And this condition is added on a
> > per task base for handling multi-threaded app that shares single  
> > ioctx.
> 
> But only one of the waiting tasks is tested, the one at the head of  
> the list.  It looks like this change could starve a io_getevents()  
> with a low min_nr in the presence of another io_getevents() with a  
> larger min_nr.
> 
> > +   if (waitqueue_active(&ctx->wait)) {
> > +   struct aio_wait_queue *wait;
> > +   wait = container_of(ctx->wait.task_list.next,
> > +   struct aio_wait_queue, wait.task_list);
> > +   if (nr_evt >= wait->nr_wait)
> > +   wake_up(&ctx->wait);
> > +   }
> 
> First is the fear of starvation as mentioned previously.
> 
> issue 2 ops
> first io_getevents sleeps with a min_nr of 2
> second io_getevents sleeps with min_nr of 3
> 2 ops complete but only test the second sleeper's min_nr of 3
> first sleeper twiddles thumbs

That is not possible because when multiple tasks waiting for events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail().  So first io_getevents() with min_nr of 2 will
be woken up when 2 ops completes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] aio: add per task aio wait event condition

2007-01-02 Thread Zach Brown



On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:


The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().


Yeah, it's a real deficiency.  Thanks for taking a stab at it.


This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets.  And this condition is added on a
per task base for handling multi-threaded app that shares single  
ioctx.


But only one of the waiting tasks is tested, the one at the head of  
the list.  It looks like this change could starve a io_getevents()  
with a low min_nr in the presence of another io_getevents() with a  
larger min_nr.



Before:
 0  0  0 3972608   7056  3131200 14100 0 7885  
13747  0  2 98  0

After:
 0  0  0 3972608   7056  3131200 13800 0 7885 
42  0  2 98  0


Nice.  What min_nr was used in this test?


+struct aio_wait_queue {
+   int nr_wait;/* wake-up condition */


It appears that this is never assigned a negative?  Can we make it  
that explicit in the type so that we reviewers don't have to worry  
about wrapping and signed comparisons?



-   DECLARE_WAITQUEUE(wait, tsk);
+   struct aio_wait_queue wait;



+   aio_init_wait(&wait);


This just changed from using default_wake_function() to  
autoremove_wait_function().  Very sneaky!  wait_for_all_aios() should  
be adding the wait queue before going to sleep each time.  (better  
still to just use wait_event()).


Was this on purpose?  I'm all for it as a way to reduce wakeups from  
a stream of completions to a single waiter.



+   nr_evt = ring->tail - ring->head;
+   if (nr_evt < 0)
+   nr_evt += info->nr;


 int = unsigned - unsigned;
 if (int < 0)

My head already hurts.  Can we clean this up so one doesn't have to  
live and breath type conversion rules to tell if this code is correct?



+   if (waitqueue_active(&ctx->wait)) {
+   struct aio_wait_queue *wait;
+   wait = container_of(ctx->wait.task_list.next,
+   struct aio_wait_queue, wait.task_list);
+   if (nr_evt >= wait->nr_wait)
+   wake_up(&ctx->wait);
+   }


First is the fear of starvation as mentioned previously.

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

This makes me think this elegant task_list approach is doomed.  I  
think this is what stopped Ben and I from being interested in this  
last time we talked about it :).


Also, is that container_of() and dereference safe in the presence of  
racing wake-ups?  It looks like we could get deref a freed wait and  
get a bogus nr_wait and decide not to wake.


Andrew, I fear we should remove this from -mm until it's fixed up.

- z
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] aio: add per task aio wait event condition

Re: [patch] aio: add per task aio wait event condition

Re: [patch] aio: add per task aio wait event condition

RE: [patch] aio: add per task aio wait event condition

RE: [patch] aio: add per task aio wait event condition

Re: [patch] aio: add per task aio wait event condition

RE: [patch] aio: add per task aio wait event condition

Re: [patch] aio: add per task aio wait event condition

RE: [patch] aio: add per task aio wait event condition

Re: [patch] aio: add per task aio wait event condition

10 matches

Site Navigation

Mail list logo

Footer information