Re: [patch] aio: add per task aio wait event condition
That sounds like a programming error, don't you think? Maybe returning EINVAL is the right approach? Maybe. I think I'd prefer to be permissive and queue as much as possible, but it's not a strong preference. Returning EINVAL seems ok, too. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] aio: add per task aio wait event condition
==> On Wed, 3 Jan 2007 10:50:06 -0800, Zach Brown <[EMAIL PROTECTED]> said: Zach> The controversial part here happens when min_nr is larger than the Zach> ring size. In that case I think we should consider min_nr to be Zach> equal to the ring size. We'll return fewer events than userspace Zach> asked for, but it'll still be a big batch. That sounds like a programming error, don't you think? Maybe returning EINVAL is the right approach? -Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] aio: add per task aio wait event condition
On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote: Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM In the example you gave earlier, task with min_nr of 2 will be woken up after 4 completed events. I only gave 2 ios/events in that example. Does that clear up the confusion? It occurs to me that people might not be aware how peculiar the current io_getevent wakeup scheme is, to the extend of erratic behavior. Yeah. I have to admit that I hadn't fully grasped the behaviour of the interface when we started this thread. I realized just how insane it was as I was finishing off last night. wakes up on event _4_. If someone ask me to describe algorithm of io_getevents wake-up scheme in the presence of multiple waiters, I call it erratic and un-deterministic. I agree completely. This leads me to assume that essentially *no one* has multiple waiters on a context who care about min_nr. That's good news because we can consider it the slow path for now :) So I can categorize my patchset as a bug fix instead of a performance patch ;-) Let's be serious, this ought to be fixed one way or the other. Yeah. How about this: Start with a cleanup of the event copying loop: - prefault user event dest with fault_in_pages_writable() - copy multiple events with _inatomic - only copy events if min_nr are available - back off and retry if _inatomic faults - go back to sleep if too few events are available - return once a copy doesn't fault (I guess one could only fall back to prefaulting if the _inatomic fails. I don't know if most event buffers fault or not, I'd guess not.) This would get rid of the nonsense of consuming events (possibly depriving other concurrent waiters) without returning them to userspace in a timely fashion. We get rid of a lot of the per-event- delivery overhead that the current loop suffers from. The controversial part here happens when min_nr is larger than the ring size. In that case I think we should consider min_nr to be equal to the ring size. We'll return fewer events than userspace asked for, but it'll still be a big batch. Does io_getevents() returning +ve but < min_nr raise alarm bells for anyone? It's already done in the timeout case, so arguably code already handles it. Now, on the waking and sleeping side: - min_nr 0 and 1 as exclusive waiters in ctx->wait - maintain a rbtree of with min_nr of waiters when > 1 - always wake ctx->wait if it's pending - wake rb_first of the tree if nr_in_ring >= node->min_nr The rbtree nodes would have a task_struct pointer and its waker would use wake_up_process(). It'd be maintained under the ctx lock. It'll be expensive to maintain in the case of lots of concurrent min_nr > 1 waiters, but we're declaring that the slow path. The fast path of a single waiter with > 1 min_nr becomes a pointer indirection through the root node of the rb tree. It should be just like the current list_head use in the wait queue. If the wait queue head and rbtree root node share a cacheline in the context then checking them both (and almost always only finding one of them populated) shouldn't cost the completion path much. I don't have a strong opinion about using exclusive wake-ups or not because I doubt we ever have concurrent waiters in the field. I'm leaning towards avoiding the thundering herd, even if that risks the event delivery stream waiting behind a woken waiter who might be blocking while prefaulting their destination event buffer. We should wake from io_getevents() before it sleeps or returns, then. I could easily be convinced otherwise. The wait queue lets us support lots of min_nr == 1 waiters without the rbtree maintenance overhead.. they're not as self-evidently broken in the current code as multiple min_nr > 1 waiters, so they may well be in use? who knows. It's easy. How's this all sound? I can throw this together if you'd prefer it that way. Though I suspect you'll want to keep hacking on this :). Finally, let's get a test case for multiple min_nr > 1 waiters into autotest. I'd use our trivial example where the min_nr = 2 waiter fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that things don't get stuck. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM > > In the example you > > gave earlier, task with min_nr of 2 will be woken up after 4 completed > > events. > > I only gave 2 ios/events in that example. > > Does that clear up the confusion? It occurs to me that people might not be aware how peculiar the current io_getevent wakeup scheme is, to the extend of erratic behavior. In the blocking path of read_events(), we essentially doing the following loop (simplified for clarity): while (i < nr) { add_wait_queue_exclusive(&ctx->wait, &wait); do { ret = aio_read_evt(ctx, &ent); if (!ret) schedule(); while (1); remove_wait_queue(&ctx->wait, &wait); copy_to_user(event, &ent, sizeof(ent)); } Noticed that when thread comes out of schedule(), it removes itself from the wait queue, and requeue itself at the end of the wait queue for each and every event it reaps. So if there are multiple threads waiting in io_getevents, completed I/O are handed out in round robin scheme to all waiting threads. To illustrate it in ascii graph, here is what happens: thread 1 thread 2 queue at head schedule() queue at 2nd position schedule aio_complete (event 1) remove_wait_queue (now thread 2 is at head) reap event 1 requeue at tail schedule aio_complete (event 2) remove_wait_queue (now thread 1 is at head) reap event 2 requeue at tail schedule If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps second with min_nr = 3, then thread 1 wakes up on event _3_. But if thread 2 sleeps first, thread 1 sleeps second, thread 1 wakes up on event _4_. If someone ask me to describe algorithm of io_getevents wake-up scheme in the presence of multiple waiters, I call it erratic and un-deterministic. Looking back to the example Zach gave earlier, current implementation behaves just like what described as an undesired bug (modified and tortured): issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete first sleeper twiddles thumbs So I can categorize my patchset as a bug fix instead of a performance patch ;-) Let's be serious, this ought to be fixed one way or the other. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM > On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote: > > Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM > >>> That is not possible because when multiple tasks waiting for > >>> events, they > >>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does > >>> __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 > >>> will be woken up when 2 ops completes. > >> > >> So switch the order of the two sleepers in the example? > > > > Not sure why that would be a problem though: whoever sleep first will > > be woken up first. > > Why would the min_nr = 3 sleeper be woken up in that case? Only 2 > ios were issued. > > Maybe the app was relying on the min_nr = 2 completion to issue 3 > more ios for the min_nr = 3 sleeper, who knows. > > Does that clear up the confusion? Not really. I don't think I understand your concern. You gave an example: issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete but only test the second sleeper's min_nr of 3 first sleeper twiddles thumbs Or: issue 2 ops first io_getevents sleeps with a min_nr of 3 second io_getevents sleeps with min_nr of 2 2 ops complete but only test the second sleeper's min_nr of 2 first sleeper twiddles thumbs First scenario doesn't exist because in the new scheme, we test first sleeper (as in head of the queue) when 2 ops complete. It wakes up first. 2nd scenario is OK to me because first sleeper waiting for 3 events, and there are only 2 ops completed, so it waits. The one scenario that I can think of that breaks down is that one task sleeps with min_nr of 100. Then 50 ops completed. Comes along 2nd thread does a io_getevents and it will take all 50 events in the 2nd thread. Is that what you are talking about? It doesn't involve two sleepers. That I can fix by testing whether wait queue is active or not at the beginning of fast path in read_events(). The bigger question is: what is the semantics on event reap order for thread? Random, FIFO or round robin? It is not specified anywhere. What would be the most optimal policy? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] aio: add per task aio wait event condition
On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote: Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. So switch the order of the two sleepers in the example? Not sure why that would be a problem though: whoever sleep first will be woken up first. Why would the min_nr = 3 sleeper be woken up in that case? Only 2 ios were issued. Maybe the app was relying on the min_nr = 2 completion to issue 3 more ios for the min_nr = 3 sleeper, who knows. Before I challenge that semantics, I want to mention that in current implementation, dribbling AIO events will be distributed in round robin fashion to all pending tasks waiting in io_getevents. Yeah, don't misunderstand me -- we agree that the current situation is bad. In the example you gave earlier, task with min_nr of 2 will be woken up after 4 completed events. I only gave 2 ios/events in that example. Does that clear up the confusion? - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM > > That is not possible because when multiple tasks waiting for > > events, they > > enter the wait queue in FIFO order, prepare_to_wait_exclusive() does > > __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 > > will be woken up when 2 ops completes. > > So switch the order of the two sleepers in the example? Not sure why that would be a problem though: whoever sleep first will be woken up first. > The point is that there's no way to guarantee that the head of the > wait queue will be the lowest min_nr. Before I challenge that semantics, I want to mention that in current implementation, dribbling AIO events will be distributed in round robin fashion to all pending tasks waiting in io_getevents. In the example you gave earlier, task with min_nr of 2 will be woken up after 4 completed events. I consider that as an undesirable behavior as well. Going back to your counter argument, why do we need the lowest min_nr in the head of the queue? These are tasks that shares one aio ctx and ioctx is shareable only among threads. Any reason why round robin policy is superior than FIFO? Also presumably, threads that shares ioctx should be capable of handling events for the same ioctx. >From wakeup order point of view, yes, tasks with lowest min_nr wakes up first, but looking from io completion order, they are not. And these are the source of excessive ctx switch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] aio: add per task aio wait event condition
That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. So switch the order of the two sleepers in the example? The point is that there's no way to guarantee that the head of the wait queue will be the lowest min_nr. I got list_add() from the add_wait_queue() still being used in wait_for_all_aios(), fwiw. My mistake. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM > On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote: > > This patch adds a wait condition to the wait queue and only wake-up > > process when that condition meets. And this condition is added on a > > per task base for handling multi-threaded app that shares single > > ioctx. > > But only one of the waiting tasks is tested, the one at the head of > the list. It looks like this change could starve a io_getevents() > with a low min_nr in the presence of another io_getevents() with a > larger min_nr. > > > + if (waitqueue_active(&ctx->wait)) { > > + struct aio_wait_queue *wait; > > + wait = container_of(ctx->wait.task_list.next, > > + struct aio_wait_queue, wait.task_list); > > + if (nr_evt >= wait->nr_wait) > > + wake_up(&ctx->wait); > > + } > > First is the fear of starvation as mentioned previously. > > issue 2 ops > first io_getevents sleeps with a min_nr of 2 > second io_getevents sleeps with min_nr of 3 > 2 ops complete but only test the second sleeper's min_nr of 3 > first sleeper twiddles thumbs That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] aio: add per task aio wait event condition
On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote: The AIO wake-up notification from aio_complete is really inefficient in current AIO implementation in the presence of process waiting in io_getevents(). Yeah, it's a real deficiency. Thanks for taking a stab at it. This patch adds a wait condition to the wait queue and only wake-up process when that condition meets. And this condition is added on a per task base for handling multi-threaded app that shares single ioctx. But only one of the waiting tasks is tested, the one at the head of the list. It looks like this change could starve a io_getevents() with a low min_nr in the presence of another io_getevents() with a larger min_nr. Before: 0 0 0 3972608 7056 3131200 14100 0 7885 13747 0 2 98 0 After: 0 0 0 3972608 7056 3131200 13800 0 7885 42 0 2 98 0 Nice. What min_nr was used in this test? +struct aio_wait_queue { + int nr_wait;/* wake-up condition */ It appears that this is never assigned a negative? Can we make it that explicit in the type so that we reviewers don't have to worry about wrapping and signed comparisons? - DECLARE_WAITQUEUE(wait, tsk); + struct aio_wait_queue wait; + aio_init_wait(&wait); This just changed from using default_wake_function() to autoremove_wait_function(). Very sneaky! wait_for_all_aios() should be adding the wait queue before going to sleep each time. (better still to just use wait_event()). Was this on purpose? I'm all for it as a way to reduce wakeups from a stream of completions to a single waiter. + nr_evt = ring->tail - ring->head; + if (nr_evt < 0) + nr_evt += info->nr; int = unsigned - unsigned; if (int < 0) My head already hurts. Can we clean this up so one doesn't have to live and breath type conversion rules to tell if this code is correct? + if (waitqueue_active(&ctx->wait)) { + struct aio_wait_queue *wait; + wait = container_of(ctx->wait.task_list.next, + struct aio_wait_queue, wait.task_list); + if (nr_evt >= wait->nr_wait) + wake_up(&ctx->wait); + } First is the fear of starvation as mentioned previously. issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete but only test the second sleeper's min_nr of 3 first sleeper twiddles thumbs This makes me think this elegant task_list approach is doomed. I think this is what stopped Ben and I from being interested in this last time we talked about it :). Also, is that container_of() and dereference safe in the presence of racing wake-ups? It looks like we could get deref a freed wait and get a bogus nr_wait and decide not to wake. Andrew, I fear we should remove this from -mm until it's fixed up. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/