RE: [patch] optimize o_direct on block device - v3
Randy Dunlap wrote on Thursday, January 11, 2007 1:45 PM > > +/* return a pge back to pvec array */ > > is pge just a typo or some other tla that i don't know? > (not portland general electric or pacific gas & electric) Typo with fat fingers. Thanks for catching it. Full patch with typo fixed. [patch] fix blk_direct_IO bio preparation. For large size DIO that needs multiple bio, one full page worth of data was lost at the boundary of bio's maximum sector or segment limits. After a bio is full and got submitted. The outer while (nbytes) { ... } loop will allocate a new bio and just march on to index into next page. It just forget about the page that bio_add_page() rejected when previous bio is full. Fix it by put the rejected page back to pvec so we pick it up again for the next bio. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c --- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800 +++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800 @@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne return pvec->page[pvec->idx++]; } +/* return a page back to pvec array */ +static void blk_unget_page(struct page *page, struct pvec *pvec) +{ + pvec->page[--pvec->idx] = page; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) @@ -278,6 +284,8 @@ same_bio: count = min(count, nbytes); goto same_bio; } + } else { + blk_unget_page(page, ); } /* bio is ready, submit it */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] optimize o_direct on block device - v3
Andrew Morton wrote on Thursday, January 11, 2007 11:29 AM > On Thu, 11 Jan 2007 13:21:57 -0600 > Michael Reed <[EMAIL PROTECTED]> wrote: > > Testing on my ia64 system reveals that this patch introduces a > > data integrity error for direct i/o to a block device. Device > > errors which result in i/o failure do not propagate to the > > process issuing direct i/o to the device. > > > > This can be reproduced by doing writes to a fibre channel block > > device and then disabling the switch port connecting the host > > adapter to the switch. > > > > Does this fix it? > > Darn, kicking myself in the butt. Thank you Andrew for fixing this. We've also running DIO stress test almost non-stop over the last 30 days or so and we did uncover another bug in that patch. Andrew, would you please take the follow bug fix patch as well. It is critical because it also affects data integrity. [patch] fix blk_direct_IO bio preparation. For large size DIO that needs multiple bio, one full page worth of data was lost at the boundary of bio's maximum sector or segment limits. After a bio is full and got submitted. The outer while (nbytes) { ... } loop will allocate a new bio and just march on to index into next page. It just forget about the page that bio_add_page() rejected when previous bio is full. Fix it by put the rejected page back to pvec so we pick it up again for the next bio. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c --- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800 +++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800 @@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne return pvec->page[pvec->idx++]; } +/* return a pge back to pvec array */ +static void blk_unget_page(struct page *page, struct pvec *pvec) +{ + pvec->page[--pvec->idx] = page; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) @@ -278,6 +284,8 @@ same_bio: count = min(count, nbytes); goto same_bio; } + } else { + blk_unget_page(page, ); } /* bio is ready, submit it */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] optimize o_direct on block device - v3
Andrew Morton wrote on Thursday, January 11, 2007 11:29 AM On Thu, 11 Jan 2007 13:21:57 -0600 Michael Reed [EMAIL PROTECTED] wrote: Testing on my ia64 system reveals that this patch introduces a data integrity error for direct i/o to a block device. Device errors which result in i/o failure do not propagate to the process issuing direct i/o to the device. This can be reproduced by doing writes to a fibre channel block device and then disabling the switch port connecting the host adapter to the switch. Does this fix it? thwaps Ken Darn, kicking myself in the butt. Thank you Andrew for fixing this. We've also running DIO stress test almost non-stop over the last 30 days or so and we did uncover another bug in that patch. Andrew, would you please take the follow bug fix patch as well. It is critical because it also affects data integrity. [patch] fix blk_direct_IO bio preparation. For large size DIO that needs multiple bio, one full page worth of data was lost at the boundary of bio's maximum sector or segment limits. After a bio is full and got submitted. The outer while (nbytes) { ... } loop will allocate a new bio and just march on to index into next page. It just forget about the page that bio_add_page() rejected when previous bio is full. Fix it by put the rejected page back to pvec so we pick it up again for the next bio. Signed-off-by: Ken Chen [EMAIL PROTECTED] diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c --- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800 +++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800 @@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne return pvec-page[pvec-idx++]; } +/* return a pge back to pvec array */ +static void blk_unget_page(struct page *page, struct pvec *pvec) +{ + pvec-page[--pvec-idx] = page; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) @@ -278,6 +284,8 @@ same_bio: count = min(count, nbytes); goto same_bio; } + } else { + blk_unget_page(page, pvec); } /* bio is ready, submit it */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] optimize o_direct on block device - v3
Randy Dunlap wrote on Thursday, January 11, 2007 1:45 PM +/* return a pge back to pvec array */ is pge just a typo or some other tla that i don't know? (not portland general electric or pacific gas electric) Typo with fat fingers. Thanks for catching it. Full patch with typo fixed. [patch] fix blk_direct_IO bio preparation. For large size DIO that needs multiple bio, one full page worth of data was lost at the boundary of bio's maximum sector or segment limits. After a bio is full and got submitted. The outer while (nbytes) { ... } loop will allocate a new bio and just march on to index into next page. It just forget about the page that bio_add_page() rejected when previous bio is full. Fix it by put the rejected page back to pvec so we pick it up again for the next bio. Signed-off-by: Ken Chen [EMAIL PROTECTED] diff -Nurp linux-2.6.20-rc4/fs/block_dev.c linux-2.6.20.ken/fs/block_dev.c --- linux-2.6.20-rc4/fs/block_dev.c 2007-01-06 21:45:51.0 -0800 +++ linux-2.6.20.ken/fs/block_dev.c 2007-01-10 19:54:53.0 -0800 @@ -190,6 +190,12 @@ static struct page *blk_get_page(unsigne return pvec-page[pvec-idx++]; } +/* return a page back to pvec array */ +static void blk_unget_page(struct page *page, struct pvec *pvec) +{ + pvec-page[--pvec-idx] = page; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t pos, unsigned long nr_segs) @@ -278,6 +284,8 @@ same_bio: count = min(count, nbytes); goto same_bio; } + } else { + blk_unget_page(page, pvec); } /* bio is ready, submit it */ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] 4/4 block: explicit plugging
Andrew Morton wrote on Wednesday, January 03, 2007 12:09 AM > Do you have any benchmarks which got faster with these changes? Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM > I've asked Ken to run this series on some of his big iron, I hope he'll > have some results for us soonish. I can run some pseudo benchmarks on a > 4-way here with some simulated storage to demonstrate the locking > improvements. > Jens Axboe wrote on Thursday, January 04, 2007 6:39 AM > > I will just retake the tip of your plug tree and retest. > > That would be great! There's a busy race fixed in the current branch, > make sure that one is included as well. Good news: the tip of plug tree fixed the FC loop reset issue we are seeing earlier. Performance wise, our big db benchmark run came out with 0.14% regression compare to 2.6.20-rc2. It is small enough that we declared it as noise level change. Unfortunately our internal profile tool broke on 2.6.20-rc2 so I don't have an execution profile to post. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] 4/4 block: explicit plugging
Andrew Morton wrote on Wednesday, January 03, 2007 12:09 AM Do you have any benchmarks which got faster with these changes? Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM I've asked Ken to run this series on some of his big iron, I hope he'll have some results for us soonish. I can run some pseudo benchmarks on a 4-way here with some simulated storage to demonstrate the locking improvements. Jens Axboe wrote on Thursday, January 04, 2007 6:39 AM I will just retake the tip of your plug tree and retest. That would be great! There's a busy race fixed in the current branch, make sure that one is included as well. Good news: the tip of plug tree fixed the FC loop reset issue we are seeing earlier. Performance wise, our big db benchmark run came out with 0.14% regression compare to 2.6.20-rc2. It is small enough that we declared it as noise level change. Unfortunately our internal profile tool broke on 2.6.20-rc2 so I don't have an execution profile to post. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: open(O_DIRECT) on a tmpfs?
Hugh Dickins wrote on Thursday, January 04, 2007 11:14 AM > On Thu, 4 Jan 2007, Hua Zhong wrote: > > So I'd argue that it makes more sense to support O_DIRECT > > on tmpfs as the memory IS the backing store. > > A few more voices in favour and I'll be persuaded. Perhaps I'm > out of date: when O_DIRECT came in, just a few filesystems supported > it, and it was perfectly normal for open O_DIRECT to be failed; but > I wouldn't want tmpfs to stand out now as a lone obstacle. Maybe a bit hackish, all we need is to have an empty .direct_IO method in shmem_aops to make __dentry_open() to pass the O_DIRECT check. The following patch adds 40 bytes to kernel text on x86-64. An even more hackish but zero cost route is to make .direct_IO variable non-zero via a cast of -1 or some sort (that is probably ugly as hell). diff -Nurp linus-2.6.git/mm/shmem.c linus-2.6.git.ken/mm/shmem.c --- linus-2.6.git/mm/shmem.c2006-12-27 19:06:11.0 -0800 +++ linus-2.6.git.ken/mm/shmem.c2007-01-04 21:03:14.0 -0800 @@ -2314,10 +2314,18 @@ static void destroy_inodecache(void) kmem_cache_destroy(shmem_inode_cachep); } +ssize_t shmem_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, + loff_t offset, unsigned long nr_segs) +{ + /* dummy direct_IO function. Not to be executed */ + BUG(); +} + static const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_nobuffers, #ifdef CONFIG_TMPFS + .direct_IO = shmem_direct_IO, .prepare_write = shmem_prepare_write, .commit_write = simple_commit_write, #endif - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: open(O_DIRECT) on a tmpfs?
Hugh Dickins wrote on Thursday, January 04, 2007 11:14 AM On Thu, 4 Jan 2007, Hua Zhong wrote: So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the memory IS the backing store. A few more voices in favour and I'll be persuaded. Perhaps I'm out of date: when O_DIRECT came in, just a few filesystems supported it, and it was perfectly normal for open O_DIRECT to be failed; but I wouldn't want tmpfs to stand out now as a lone obstacle. Maybe a bit hackish, all we need is to have an empty .direct_IO method in shmem_aops to make __dentry_open() to pass the O_DIRECT check. The following patch adds 40 bytes to kernel text on x86-64. An even more hackish but zero cost route is to make .direct_IO variable non-zero via a cast of -1 or some sort (that is probably ugly as hell). diff -Nurp linus-2.6.git/mm/shmem.c linus-2.6.git.ken/mm/shmem.c --- linus-2.6.git/mm/shmem.c2006-12-27 19:06:11.0 -0800 +++ linus-2.6.git.ken/mm/shmem.c2007-01-04 21:03:14.0 -0800 @@ -2314,10 +2314,18 @@ static void destroy_inodecache(void) kmem_cache_destroy(shmem_inode_cachep); } +ssize_t shmem_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, + loff_t offset, unsigned long nr_segs) +{ + /* dummy direct_IO function. Not to be executed */ + BUG(); +} + static const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_nobuffers, #ifdef CONFIG_TMPFS + .direct_IO = shmem_direct_IO, .prepare_write = shmem_prepare_write, .commit_write = simple_commit_write, #endif - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] 4/4 block: explicit plugging
Jens Axboe wrote on Wednesday, January 03, 2007 2:30 PM > > We are having some trouble with the patch set that some of our fiber channel > > host controller doesn't initialize properly anymore and thus lost whole > > bunch of disks (somewhere around 200 disks out of 900) at boot time. > > Presumably FC loop initialization command are done through block layer etc. > > I haven't looked into the problem closely. > > > > Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch > > set? > > It is. Are you still seeing problems after the initial mail exchange we > had prior to christmas, Yes. Not the same kernel panic, but a problem with FC loop reset itself. > or are you referencing that initial problem? No. we got passed that point thanks for the bug fix patch you give me prior to Christmas. That fixed a kernel panic on boot up. > It's not likely to be a block layer issue, more likely the SCSI <-> > block interactions. If you mail me a new dmesg (if your problem is with > the __blk_run_queue() fixups), I can take a look. Otherwise please do > test with the __blk_run_queue() fixup, just use the current patchset. I will just retake the tip of your plug tree and retest. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] 4/4 block: explicit plugging
Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM > > Do you have any benchmarks which got faster with these changes? > > On the hardware I have immediately available, I see no regressions wrt > performance. With instrumentation it's simple to demonstrate that most > of the queueing activity of an io heavy benchmark spends less time in > the kernel (most merging activity takes place outside of the queue lock, > hence queueing is lock free). > > I've asked Ken to run this series on some of his big iron, I hope he'll > have some results for us soonish. We are having some trouble with the patch set that some of our fiber channel host controller doesn't initialize properly anymore and thus lost whole bunch of disks (somewhere around 200 disks out of 900) at boot time. Presumably FC loop initialization command are done through block layer etc. I haven't looked into the problem closely. Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch set? - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] 4/4 block: explicit plugging
Jens Axboe wrote on Wednesday, January 03, 2007 12:22 AM Do you have any benchmarks which got faster with these changes? On the hardware I have immediately available, I see no regressions wrt performance. With instrumentation it's simple to demonstrate that most of the queueing activity of an io heavy benchmark spends less time in the kernel (most merging activity takes place outside of the queue lock, hence queueing is lock free). I've asked Ken to run this series on some of his big iron, I hope he'll have some results for us soonish. We are having some trouble with the patch set that some of our fiber channel host controller doesn't initialize properly anymore and thus lost whole bunch of disks (somewhere around 200 disks out of 900) at boot time. Presumably FC loop initialization command are done through block layer etc. I haven't looked into the problem closely. Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch set? - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] 4/4 block: explicit plugging
Jens Axboe wrote on Wednesday, January 03, 2007 2:30 PM We are having some trouble with the patch set that some of our fiber channel host controller doesn't initialize properly anymore and thus lost whole bunch of disks (somewhere around 200 disks out of 900) at boot time. Presumably FC loop initialization command are done through block layer etc. I haven't looked into the problem closely. Jens, I assume the spin lock bug in __blk_run_queue is fixed in this patch set? It is. Are you still seeing problems after the initial mail exchange we had prior to christmas, Yes. Not the same kernel panic, but a problem with FC loop reset itself. or are you referencing that initial problem? No. we got passed that point thanks for the bug fix patch you give me prior to Christmas. That fixed a kernel panic on boot up. It's not likely to be a block layer issue, more likely the SCSI - block interactions. If you mail me a new dmesg (if your problem is with the __blk_run_queue() fixups), I can take a look. Otherwise please do test with the __blk_run_queue() fixup, just use the current patchset. I will just retake the tip of your plug tree and retest. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM > > In the example you > > gave earlier, task with min_nr of 2 will be woken up after 4 completed > > events. > > I only gave 2 ios/events in that example. > > Does that clear up the confusion? It occurs to me that people might not be aware how peculiar the current io_getevent wakeup scheme is, to the extend of erratic behavior. In the blocking path of read_events(), we essentially doing the following loop (simplified for clarity): while (i < nr) { add_wait_queue_exclusive(>wait, ); do { ret = aio_read_evt(ctx, ); if (!ret) schedule(); while (1); remove_wait_queue(>wait, ); copy_to_user(event, , sizeof(ent)); } Noticed that when thread comes out of schedule(), it removes itself from the wait queue, and requeue itself at the end of the wait queue for each and every event it reaps. So if there are multiple threads waiting in io_getevents, completed I/O are handed out in round robin scheme to all waiting threads. To illustrate it in ascii graph, here is what happens: thread 1 thread 2 queue at head schedule() queue at 2nd position schedule aio_complete (event 1) remove_wait_queue (now thread 2 is at head) reap event 1 requeue at tail schedule aio_complete (event 2) remove_wait_queue (now thread 1 is at head) reap event 2 requeue at tail schedule If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps second with min_nr = 3, then thread 1 wakes up on event _3_. But if thread 2 sleeps first, thread 1 sleeps second, thread 1 wakes up on event _4_. If someone ask me to describe algorithm of io_getevents wake-up scheme in the presence of multiple waiters, I call it erratic and un-deterministic. Looking back to the example Zach gave earlier, current implementation behaves just like what described as an undesired bug (modified and tortured): issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete first sleeper twiddles thumbs So I can categorize my patchset as a bug fix instead of a performance patch ;-) Let's be serious, this ought to be fixed one way or the other. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM > On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote: > > Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM > >>> That is not possible because when multiple tasks waiting for > >>> events, they > >>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does > >>> __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 > >>> will be woken up when 2 ops completes. > >> > >> So switch the order of the two sleepers in the example? > > > > Not sure why that would be a problem though: whoever sleep first will > > be woken up first. > > Why would the min_nr = 3 sleeper be woken up in that case? Only 2 > ios were issued. > > Maybe the app was relying on the min_nr = 2 completion to issue 3 > more ios for the min_nr = 3 sleeper, who knows. > > Does that clear up the confusion? Not really. I don't think I understand your concern. You gave an example: issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete but only test the second sleeper's min_nr of 3 first sleeper twiddles thumbs Or: issue 2 ops first io_getevents sleeps with a min_nr of 3 second io_getevents sleeps with min_nr of 2 2 ops complete but only test the second sleeper's min_nr of 2 first sleeper twiddles thumbs First scenario doesn't exist because in the new scheme, we test first sleeper (as in head of the queue) when 2 ops complete. It wakes up first. 2nd scenario is OK to me because first sleeper waiting for 3 events, and there are only 2 ops completed, so it waits. The one scenario that I can think of that breaks down is that one task sleeps with min_nr of 100. Then 50 ops completed. Comes along 2nd thread does a io_getevents and it will take all 50 events in the 2nd thread. Is that what you are talking about? It doesn't involve two sleepers. That I can fix by testing whether wait queue is active or not at the beginning of fast path in read_events(). The bigger question is: what is the semantics on event reap order for thread? Random, FIFO or round robin? It is not specified anywhere. What would be the most optimal policy? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: streamline read events after woken up
Zach Brown wrote on Tuesday, January 02, 2007 5:06 PM > To: Chen, Kenneth W > > Given the previous patch "aio: add per task aio wait event condition" > > that we properly wake up event waiting process knowing that we have > > enough events to reap, it's just plain waste of time to insert itself > > into a wait queue, and then immediately remove itself from the wait > > queue for *every* event reap iteration. > > Hmm, I dunno. It seems like we're still left with a pretty silly loop. > > Would it be reasonable to have a loop that copied multiple events at > a time? We could use some __copy_to_user_inatomic(), it didn't exist > when this stuff was first written. It sounds reasonable, but I think it will be complicated because of kmap_atomic on the ring buffer, along with tail wraps around ring buffer index there. By then, most of you would probably veto the patch anyway ;-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM > > That is not possible because when multiple tasks waiting for > > events, they > > enter the wait queue in FIFO order, prepare_to_wait_exclusive() does > > __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 > > will be woken up when 2 ops completes. > > So switch the order of the two sleepers in the example? Not sure why that would be a problem though: whoever sleep first will be woken up first. > The point is that there's no way to guarantee that the head of the > wait queue will be the lowest min_nr. Before I challenge that semantics, I want to mention that in current implementation, dribbling AIO events will be distributed in round robin fashion to all pending tasks waiting in io_getevents. In the example you gave earlier, task with min_nr of 2 will be woken up after 4 completed events. I consider that as an undesirable behavior as well. Going back to your counter argument, why do we need the lowest min_nr in the head of the queue? These are tasks that shares one aio ctx and ioctx is shareable only among threads. Any reason why round robin policy is superior than FIFO? Also presumably, threads that shares ioctx should be capable of handling events for the same ioctx. >From wakeup order point of view, yes, tasks with lowest min_nr wakes up first, but looking from io completion order, they are not. And these are the source of excessive ctx switch. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: make aio_ring_info->nr_pages an unsigned int
Zach Brown wrote on Tuesday, January 02, 2007 5:14 PM > To: Chen, Kenneth W > > --- ./include/linux/aio.h.orig 2006-12-24 22:31:55.0 -0800 > > +++ ./include/linux/aio.h 2006-12-24 22:41:28.0 -0800 > > @@ -165,7 +165,7 @@ struct aio_ring_info { > > > > struct page **ring_pages; > > spinlock_t ring_lock; > > - longnr_pages; > > + unsignednr_pages; > > > > unsignednr, tail; > > Hmm. > > This seems so trivial as to not be worth it. It'd be more compelling > if it was more thorough -- doing things like updating the 'long i' > iterators that a feww have over ->nr_pages. That kind of thing. > Giving some confidence that the references of ->nr_pages were audited. I had that changes earlier, but dropped it to make the patch smaller. It all started with head and tail index, which is defined as unsigned int in structure, but in aio.c, all local variables that does temporary head and tail calculation are unsigned long. While cleaning that, it got expanded into nr_pages etc. Oh well. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM > On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote: > > This patch adds a wait condition to the wait queue and only wake-up > > process when that condition meets. And this condition is added on a > > per task base for handling multi-threaded app that shares single > > ioctx. > > But only one of the waiting tasks is tested, the one at the head of > the list. It looks like this change could starve a io_getevents() > with a low min_nr in the presence of another io_getevents() with a > larger min_nr. > > > + if (waitqueue_active(>wait)) { > > + struct aio_wait_queue *wait; > > + wait = container_of(ctx->wait.task_list.next, > > + struct aio_wait_queue, wait.task_list); > > + if (nr_evt >= wait->nr_wait) > > + wake_up(>wait); > > + } > > First is the fear of starvation as mentioned previously. > > issue 2 ops > first io_getevents sleeps with a min_nr of 2 > second io_getevents sleeps with min_nr of 3 > 2 ops complete but only test the second sleeper's min_nr of 3 > first sleeper twiddles thumbs That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] remove redundant iov segment check
Zach Brown wrote on Tuesday, January 02, 2007 10:22 AM > >> I wonder if it wouldn't be better to make this change as part of a > >> larger change that moves towards an explicit iovec container struct > >> rather than bare 'struct iov *' and 'nr_segs' arguments. > > > I suspect it should be rather trivial to get this started. As a first > > step we simply add a > > > > struct iodesc { > > int nr_segs; > > struct iovec ioc[] > > }; > > Agreed, does anyone plan to try this in the near future? Yes, I will take a stab at it this week. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] remove redundant iov segment check
Zach Brown wrote on Tuesday, January 02, 2007 10:22 AM I wonder if it wouldn't be better to make this change as part of a larger change that moves towards an explicit iovec container struct rather than bare 'struct iov *' and 'nr_segs' arguments. I suspect it should be rather trivial to get this started. As a first step we simply add a struct iodesc { int nr_segs; struct iovec ioc[] }; Agreed, does anyone plan to try this in the near future? Yes, I will take a stab at it this week. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote: This patch adds a wait condition to the wait queue and only wake-up process when that condition meets. And this condition is added on a per task base for handling multi-threaded app that shares single ioctx. But only one of the waiting tasks is tested, the one at the head of the list. It looks like this change could starve a io_getevents() with a low min_nr in the presence of another io_getevents() with a larger min_nr. + if (waitqueue_active(ctx-wait)) { + struct aio_wait_queue *wait; + wait = container_of(ctx-wait.task_list.next, + struct aio_wait_queue, wait.task_list); + if (nr_evt = wait-nr_wait) + wake_up(ctx-wait); + } First is the fear of starvation as mentioned previously. issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete but only test the second sleeper's min_nr of 3 first sleeper twiddles thumbs That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: make aio_ring_info-nr_pages an unsigned int
Zach Brown wrote on Tuesday, January 02, 2007 5:14 PM To: Chen, Kenneth W --- ./include/linux/aio.h.orig 2006-12-24 22:31:55.0 -0800 +++ ./include/linux/aio.h 2006-12-24 22:41:28.0 -0800 @@ -165,7 +165,7 @@ struct aio_ring_info { struct page **ring_pages; spinlock_t ring_lock; - longnr_pages; + unsignednr_pages; unsignednr, tail; Hmm. This seems so trivial as to not be worth it. It'd be more compelling if it was more thorough -- doing things like updating the 'long i' iterators that a feww have over -nr_pages. That kind of thing. Giving some confidence that the references of -nr_pages were audited. I had that changes earlier, but dropped it to make the patch smaller. It all started with head and tail index, which is defined as unsigned int in structure, but in aio.c, all local variables that does temporary head and tail calculation are unsigned long. While cleaning that, it got expanded into nr_pages etc. Oh well. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. So switch the order of the two sleepers in the example? Not sure why that would be a problem though: whoever sleep first will be woken up first. The point is that there's no way to guarantee that the head of the wait queue will be the lowest min_nr. Before I challenge that semantics, I want to mention that in current implementation, dribbling AIO events will be distributed in round robin fashion to all pending tasks waiting in io_getevents. In the example you gave earlier, task with min_nr of 2 will be woken up after 4 completed events. I consider that as an undesirable behavior as well. Going back to your counter argument, why do we need the lowest min_nr in the head of the queue? These are tasks that shares one aio ctx and ioctx is shareable only among threads. Any reason why round robin policy is superior than FIFO? Also presumably, threads that shares ioctx should be capable of handling events for the same ioctx. From wakeup order point of view, yes, tasks with lowest min_nr wakes up first, but looking from io completion order, they are not. And these are the source of excessive ctx switch. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: streamline read events after woken up
Zach Brown wrote on Tuesday, January 02, 2007 5:06 PM To: Chen, Kenneth W Given the previous patch aio: add per task aio wait event condition that we properly wake up event waiting process knowing that we have enough events to reap, it's just plain waste of time to insert itself into a wait queue, and then immediately remove itself from the wait queue for *every* event reap iteration. Hmm, I dunno. It seems like we're still left with a pretty silly loop. Would it be reasonable to have a loop that copied multiple events at a time? We could use some __copy_to_user_inatomic(), it didn't exist when this stuff was first written. It sounds reasonable, but I think it will be complicated because of kmap_atomic on the ring buffer, along with tail wraps around ring buffer index there. By then, most of you would probably veto the patch anyway ;-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote: Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM That is not possible because when multiple tasks waiting for events, they enter the wait queue in FIFO order, prepare_to_wait_exclusive() does __add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will be woken up when 2 ops completes. So switch the order of the two sleepers in the example? Not sure why that would be a problem though: whoever sleep first will be woken up first. Why would the min_nr = 3 sleeper be woken up in that case? Only 2 ios were issued. Maybe the app was relying on the min_nr = 2 completion to issue 3 more ios for the min_nr = 3 sleeper, who knows. Does that clear up the confusion? Not really. I don't think I understand your concern. You gave an example: issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete but only test the second sleeper's min_nr of 3 first sleeper twiddles thumbs Or: issue 2 ops first io_getevents sleeps with a min_nr of 3 second io_getevents sleeps with min_nr of 2 2 ops complete but only test the second sleeper's min_nr of 2 first sleeper twiddles thumbs First scenario doesn't exist because in the new scheme, we test first sleeper (as in head of the queue) when 2 ops complete. It wakes up first. 2nd scenario is OK to me because first sleeper waiting for 3 events, and there are only 2 ops completed, so it waits. The one scenario that I can think of that breaks down is that one task sleeps with min_nr of 100. Then 50 ops completed. Comes along 2nd thread does a io_getevents and it will take all 50 events in the 2nd thread. Is that what you are talking about? It doesn't involve two sleepers. That I can fix by testing whether wait queue is active or not at the beginning of fast path in read_events(). The bigger question is: what is the semantics on event reap order for thread? Random, FIFO or round robin? It is not specified anywhere. What would be the most optimal policy? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: add per task aio wait event condition
Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM In the example you gave earlier, task with min_nr of 2 will be woken up after 4 completed events. I only gave 2 ios/events in that example. Does that clear up the confusion? It occurs to me that people might not be aware how peculiar the current io_getevent wakeup scheme is, to the extend of erratic behavior. In the blocking path of read_events(), we essentially doing the following loop (simplified for clarity): while (i nr) { add_wait_queue_exclusive(ctx-wait, wait); do { ret = aio_read_evt(ctx, ent); if (!ret) schedule(); while (1); remove_wait_queue(ctx-wait, wait); copy_to_user(event, ent, sizeof(ent)); } Noticed that when thread comes out of schedule(), it removes itself from the wait queue, and requeue itself at the end of the wait queue for each and every event it reaps. So if there are multiple threads waiting in io_getevents, completed I/O are handed out in round robin scheme to all waiting threads. To illustrate it in ascii graph, here is what happens: thread 1 thread 2 queue at head schedule() queue at 2nd position schedule aio_complete (event 1) remove_wait_queue (now thread 2 is at head) reap event 1 requeue at tail schedule aio_complete (event 2) remove_wait_queue (now thread 1 is at head) reap event 2 requeue at tail schedule If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps second with min_nr = 3, then thread 1 wakes up on event _3_. But if thread 2 sleeps first, thread 1 sleeps second, thread 1 wakes up on event _4_. If someone ask me to describe algorithm of io_getevents wake-up scheme in the presence of multiple waiters, I call it erratic and un-deterministic. Looking back to the example Zach gave earlier, current implementation behaves just like what described as an undesired bug (modified and tortured): issue 2 ops first io_getevents sleeps with a min_nr of 2 second io_getevents sleeps with min_nr of 3 2 ops complete first sleeper twiddles thumbs So I can categorize my patchset as a bug fix instead of a performance patch ;-) Let's be serious, this ought to be fixed one way or the other. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: make aio_ring_info->nr_pages an unsigned int
The number of io_event in AIO event queue allowed currently is no more than 2^32-1, because the syscall defines: asmlinkage long sys_io_setup(unsigned nr_events, aio_context_t __user *ctxp) We internally allocate a ring buffer for nr_events and keeps tracks of page descriptors for each of these ring buffer pages. Since page size is significantly larger than AIO event size (4096 versus 32), I don't think it is ever possible to overflow nr_pages in 32-bit quantity. This patch changes nr_pages to unsigned int. on 64-bit arch, changing it to unsigned int also allows better packing of aio_ring_info structure. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./include/linux/aio.h.orig 2006-12-24 22:31:55.0 -0800 +++ ./include/linux/aio.h 2006-12-24 22:41:28.0 -0800 @@ -165,7 +165,7 @@ struct aio_ring_info { struct page **ring_pages; spinlock_t ring_lock; - longnr_pages; + unsignednr_pages; unsignednr, tail; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: remove spurious ring head index modulo info->nr
In aio_read_evt(), the ring->head will never wrap info->nr because we already does the wrap when updating the ring head index: if (head != ring->tail) { ... head = (head + 1) % info->nr; ring->head = head; } This makes the modulo of ring->head into local variable head unnecessary. This patch removes that bogus code. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/aio.c.orig 2006-12-24 22:01:36.0 -0800 +++ ./fs/aio.c 2006-12-24 22:34:48.0 -0800 @@ -1019,7 +1019,7 @@ static int aio_read_evt(struct kioctx *i { struct aio_ring_info *info = >ring_info; struct aio_ring *ring; - unsigned long head; + unsigned int head; int ret = 0; ring = kmap_atomic(info->ring_pages[0], KM_USER0); @@ -1032,7 +1032,7 @@ static int aio_read_evt(struct kioctx *i spin_lock(>ring_lock); - head = ring->head % info->nr; + head = ring->head; if (head != ring->tail) { struct io_event *evp = aio_ring_event(info, head, KM_USER1); *ent = *evp; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: streamline read events after woken up
The read event loop in the blocking path is also inefficient. For every event it reap (if not blocking), it does the following in a loop: while (i < nr) { prepare_to_wait_exclusive aio_read_evt finish_wait ... } Given the previous patch "aio: add per task aio wait event condition" that we properly wake up event waiting process knowing that we have enough events to reap, it's just plain waste of time to insert itself into a wait queue, and then immediately remove itself from the wait queue for *every* event reap iteration. This patch factors out the wait queue insertion/deletion out of the event reap loop, streamlines the event reaping after the process wakes up. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/aio.c.orig 2006-12-24 17:04:52.0 -0800 +++ ./fs/aio.c 2006-12-24 17:05:10.0 -0800 @@ -1174,42 +1174,40 @@ retry: } aio_init_wait(); +wait: + prepare_to_wait_exclusive(>wait, , TASK_INTERRUPTIBLE); + ret = aio_read_evt(ctx, ); + if (!ret) { + wait.nr_wait = min_nr - i; + schedule(); + if (signal_pending(tsk)) + ret = -EINTR; + } + finish_wait(>wait, ); + + if (ret < 0) + goto out_cleanup; + while (likely(i < nr)) { - do { - prepare_to_wait_exclusive(>wait, , - TASK_INTERRUPTIBLE); - ret = aio_read_evt(ctx, ); - if (ret) - break; - if (min_nr <= i) - break; - ret = 0; - if (to.timed_out) /* Only check after read evt */ - break; - wait.nr_wait = min_nr - i; - schedule(); - if (signal_pending(tsk)) { - ret = -EINTR; + if (ret) { + if (unlikely(copy_to_user(event, , sizeof(ent { + dprintk("aio: lost an event due to EFAULT.\n"); + ret = -EFAULT; break; } - /*ret = aio_read_evt(ctx, );*/ - } while (1) ; - finish_wait(>wait, ); - - if (unlikely(ret <= 0)) - break; + event++; + i++; + } - ret = -EFAULT; - if (unlikely(copy_to_user(event, , sizeof(ent { - dprintk("aio: lost an event due to EFAULT.\n"); + ret = aio_read_evt(ctx, ); + if (unlikely(!ret)) { + if (i < min_nr && !to.timed_out) + goto wait; break; } - - /* Good, event copied to userland, update counts. */ - event ++; - i ++; } +out_cleanup: if (timeout) clear_timeout(); out: - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: add per task aio wait event condition
The AIO wake-up notification from aio_complete is really inefficient in current AIO implementation in the presence of process waiting in io_getevents(). For example, if app calls io_getevents with min_nr > 1, and aio event queue doesn't have enough completed aio event, the process will block in read_events(). However, aio_complete() will wake up the waiting process for *each* complete I/O even though number of events that an app is waiting for is much larger than 1. This makes excessive and unnecessary context switch because the waiting process will just reap one single event and goes back to sleep again. It is much more efficient to wake up the waiting process when there are enough events for it to reap. This patch adds a wait condition to the wait queue and only wake-up process when that condition meets. And this condition is added on a per task base for handling multi-threaded app that shares single ioctx. To show the effect of this patch, here is an vmstat output before and after the patch. The app does random O_DIRECT AIO on 60 disks. Context switch is reduced from 13 thousand+ down to just 40+, an significant improvement. Before: procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 0 0 0 3972608 7056 3131200 14000 0 7840 13715 0 2 98 0 0 0 0 3972608 7056 3131200 14300 0 7793 13641 0 2 98 0 0 0 0 3972608 7056 3131200 14100 0 7885 13747 0 2 98 0 After: 0 0 0 3972608 7056 3131200 14000 0 784049 0 2 98 0 0 0 0 3972608 7056 3131200 13800 0 779353 0 2 98 0 0 0 0 3972608 7056 3131200 13800 0 788542 0 2 98 0 Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/aio.c.orig 2006-12-24 16:41:39.0 -0800 +++ ./fs/aio.c 2006-12-24 16:52:15.0 -0800 @@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \ } while(0) +struct aio_wait_queue { + int nr_wait;/* wake-up condition */ + wait_queue_twait; +}; + +static inline void aio_init_wait(struct aio_wait_queue *wait) +{ + wait->nr_wait = 0; + init_wait(>wait); +} + /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. */ @@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx static void wait_for_all_aios(struct kioctx *ctx) { struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + struct aio_wait_queue wait; spin_lock_irq(>ctx_lock); if (!ctx->reqs_active) goto out; - add_wait_queue(>wait, ); + aio_init_wait(); + add_wait_queue(>wait, ); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx->reqs_active) { spin_unlock_irq(>ctx_lock); @@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio spin_lock_irq(>ctx_lock); } __set_task_state(tsk, TASK_RUNNING); - remove_wait_queue(>wait, ); + remove_wait_queue(>wait, ); out: spin_unlock_irq(>ctx_lock); @@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb * unsigned long flags; unsigned long tail; int ret; + int nr_evt = 0; /* * Special case handling for sync iocbs: @@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb * info->tail = tail; ring->tail = tail; + nr_evt = ring->tail - ring->head; + if (nr_evt < 0) + nr_evt += info->nr; put_aio_ring_event(event, KM_IRQ0); kunmap_atomic(ring, KM_IRQ1); @@ -1000,8 +1016,13 @@ put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - if (waitqueue_active(>wait)) - wake_up(>wait); + if (waitqueue_active(>wait)) { + struct aio_wait_queue *wait; + wait = container_of(ctx->wait.task_list.next, + struct aio_wait_queue, wait.task_list); + if (nr_evt >= wait->nr_wait) + wake_up(>wait); + } spin_unlock_irqrestore(>ctx_lock, flags); return ret; @@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct { longstart_jiffies = jiffies; struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + struct aio_wait_queue wait; int ret; int i = 0; struct io_event ent; @@ -1152,10 +1173,11 @@ retry: set_timeout(start_jiffies, , ); } + aio_init_wait(); while (likely(i < nr)) { - add_wait_queue_exclusive(>wait, );
[patch] aio: fix buggy put_ioctx call in aio_complete - v2
An AIO bug was reported that sleeping function is being called in softirq context: BUG: warning at kernel/mutex.c:132/__mutex_lock_common() Call Trace: [] __mutex_lock_slowpath+0x640/0x6c0 [] mutex_lock+0x20/0x40 [] flush_workqueue+0xb0/0x1a0 [] __put_ioctx+0xc0/0x240 [] aio_complete+0x2f0/0x420 [] finished_one_bio+0x200/0x2a0 [] dio_bio_complete+0x1c0/0x200 [] dio_bio_end_aio+0x60/0x80 [] bio_endio+0x110/0x1c0 [] __end_that_request_first+0x180/0xba0 [] end_that_request_chunk+0x30/0x60 [] scsi_end_request+0x50/0x300 [scsi_mod] [] scsi_io_completion+0x200/0x8a0 [scsi_mod] [] sd_rw_intr+0x330/0x860 [sd_mod] [] scsi_finish_command+0x100/0x1c0 [scsi_mod] [] scsi_softirq_done+0x230/0x300 [scsi_mod] [] blk_done_softirq+0x160/0x1c0 [] __do_softirq+0x200/0x240 [] do_softirq+0x70/0xc0 See report: http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and triggers bug. It is simply incorrect to perform ioctx freeing from aio_complete. The bug is trigger-able from a race between io_destroy() and aio_complete(). A possible scenario: cpu0 cpu1 io_destroy aio_complete wait_for_all_aios {__aio_put_req ... ctx->reqs_active--; if (!ctx->reqs_active) return; } ... put_ioctx(ioctx) put_ioctx(ctx); __put_ioctx bam! Bug trigger! The real problem is that the condition check of ctx->reqs_active in wait_for_all_aios() is incorrect that access to reqs_active is not being properly protected by spin lock. This patch adds that protective spin lock, and at the same time removes all duplicate ref counting for each kiocb as reqs_active is already used as a ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800 +++ ./fs/aio.c 2006-12-21 08:14:27.0 -0800 @@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); + spin_lock_irq(>ctx_lock); if (!ctx->reqs_active) - return; + goto out; add_wait_queue(>wait, ); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx->reqs_active) { + spin_unlock_irq(>ctx_lock); schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + spin_lock_irq(>ctx_lock); } __set_task_state(tsk, TASK_RUNNING); remove_wait_queue(>wait, ); + +out: + spin_unlock_irq(>ctx_lock); } /* wait_on_sync_kiocb: @@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0); if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) { list_add(>ki_list, >active_reqs); - get_ioctx(ctx); ctx->reqs_active++; okay = 1; } @@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(>ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(>ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb->ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(>run_list)) return 1; @@ -998,14 +1000,10 @@ put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - spin_unlock_irqrestore(>ctx_lock, flags); - if (waitqueue_active(>wait)) wake_up(>wait); - if (ret) - put_ioctx(ctx); - + spin_unlock_irqrestore(>ctx_lock, flags); return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: fix buggy put_ioctx call in aio_complete - v2
An AIO bug was reported that sleeping function is being called in softirq context: BUG: warning at kernel/mutex.c:132/__mutex_lock_common() Call Trace: [a00100577b00] __mutex_lock_slowpath+0x640/0x6c0 [a00100577ba0] mutex_lock+0x20/0x40 [a001000a25b0] flush_workqueue+0xb0/0x1a0 [a0010018c0c0] __put_ioctx+0xc0/0x240 [a0010018d470] aio_complete+0x2f0/0x420 [a0010019cc80] finished_one_bio+0x200/0x2a0 [a0010019d1c0] dio_bio_complete+0x1c0/0x200 [a0010019d260] dio_bio_end_aio+0x60/0x80 [a0010014acd0] bio_endio+0x110/0x1c0 [a001002770e0] __end_that_request_first+0x180/0xba0 [a00100277b90] end_that_request_chunk+0x30/0x60 [a002073c0c70] scsi_end_request+0x50/0x300 [scsi_mod] [a002073c1240] scsi_io_completion+0x200/0x8a0 [scsi_mod] [a002074729b0] sd_rw_intr+0x330/0x860 [sd_mod] [a002073b3ac0] scsi_finish_command+0x100/0x1c0 [scsi_mod] [a002073c2910] scsi_softirq_done+0x230/0x300 [scsi_mod] [a00100277d20] blk_done_softirq+0x160/0x1c0 [a00100083e00] __do_softirq+0x200/0x240 [a00100083eb0] do_softirq+0x70/0xc0 See report: http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and triggers bug. It is simply incorrect to perform ioctx freeing from aio_complete. The bug is trigger-able from a race between io_destroy() and aio_complete(). A possible scenario: cpu0 cpu1 io_destroy aio_complete wait_for_all_aios {__aio_put_req ... ctx-reqs_active--; if (!ctx-reqs_active) return; } ... put_ioctx(ioctx) put_ioctx(ctx); __put_ioctx bam! Bug trigger! The real problem is that the condition check of ctx-reqs_active in wait_for_all_aios() is incorrect that access to reqs_active is not being properly protected by spin lock. This patch adds that protective spin lock, and at the same time removes all duplicate ref counting for each kiocb as reqs_active is already used as a ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800 +++ ./fs/aio.c 2006-12-21 08:14:27.0 -0800 @@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); + spin_lock_irq(ctx-ctx_lock); if (!ctx-reqs_active) - return; + goto out; add_wait_queue(ctx-wait, wait); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx-reqs_active) { + spin_unlock_irq(ctx-ctx_lock); schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + spin_lock_irq(ctx-ctx_lock); } __set_task_state(tsk, TASK_RUNNING); remove_wait_queue(ctx-wait, wait); + +out: + spin_unlock_irq(ctx-ctx_lock); } /* wait_on_sync_kiocb: @@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0); if (ctx-reqs_active aio_ring_avail(ctx-ring_info, ring)) { list_add(req-ki_list, ctx-active_reqs); - get_ioctx(ctx); ctx-reqs_active++; okay = 1; } @@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(ctx-ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(ctx-ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb-ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(ctx-run_list)) return 1; @@ -998,14 +1000,10 @@ put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - spin_unlock_irqrestore(ctx-ctx_lock, flags); - if (waitqueue_active(ctx-wait)) wake_up(ctx-wait); - if (ret) - put_ioctx(ctx); - + spin_unlock_irqrestore(ctx-ctx_lock, flags); return ret; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at
[patch] aio: add per task aio wait event condition
The AIO wake-up notification from aio_complete is really inefficient in current AIO implementation in the presence of process waiting in io_getevents(). For example, if app calls io_getevents with min_nr 1, and aio event queue doesn't have enough completed aio event, the process will block in read_events(). However, aio_complete() will wake up the waiting process for *each* complete I/O even though number of events that an app is waiting for is much larger than 1. This makes excessive and unnecessary context switch because the waiting process will just reap one single event and goes back to sleep again. It is much more efficient to wake up the waiting process when there are enough events for it to reap. This patch adds a wait condition to the wait queue and only wake-up process when that condition meets. And this condition is added on a per task base for handling multi-threaded app that shares single ioctx. To show the effect of this patch, here is an vmstat output before and after the patch. The app does random O_DIRECT AIO on 60 disks. Context switch is reduced from 13 thousand+ down to just 40+, an significant improvement. Before: procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 0 0 0 3972608 7056 3131200 14000 0 7840 13715 0 2 98 0 0 0 0 3972608 7056 3131200 14300 0 7793 13641 0 2 98 0 0 0 0 3972608 7056 3131200 14100 0 7885 13747 0 2 98 0 After: 0 0 0 3972608 7056 3131200 14000 0 784049 0 2 98 0 0 0 0 3972608 7056 3131200 13800 0 779353 0 2 98 0 0 0 0 3972608 7056 3131200 13800 0 788542 0 2 98 0 Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/aio.c.orig 2006-12-24 16:41:39.0 -0800 +++ ./fs/aio.c 2006-12-24 16:52:15.0 -0800 @@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx kunmap_atomic((void *)((unsigned long)__event PAGE_MASK), km); \ } while(0) +struct aio_wait_queue { + int nr_wait;/* wake-up condition */ + wait_queue_twait; +}; + +static inline void aio_init_wait(struct aio_wait_queue *wait) +{ + wait-nr_wait = 0; + init_wait(wait-wait); +} + /* ioctx_alloc * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. */ @@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx static void wait_for_all_aios(struct kioctx *ctx) { struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + struct aio_wait_queue wait; spin_lock_irq(ctx-ctx_lock); if (!ctx-reqs_active) goto out; - add_wait_queue(ctx-wait, wait); + aio_init_wait(wait); + add_wait_queue(ctx-wait, wait.wait); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx-reqs_active) { spin_unlock_irq(ctx-ctx_lock); @@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio spin_lock_irq(ctx-ctx_lock); } __set_task_state(tsk, TASK_RUNNING); - remove_wait_queue(ctx-wait, wait); + remove_wait_queue(ctx-wait, wait.wait); out: spin_unlock_irq(ctx-ctx_lock); @@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb * unsigned long flags; unsigned long tail; int ret; + int nr_evt = 0; /* * Special case handling for sync iocbs: @@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb * info-tail = tail; ring-tail = tail; + nr_evt = ring-tail - ring-head; + if (nr_evt 0) + nr_evt += info-nr; put_aio_ring_event(event, KM_IRQ0); kunmap_atomic(ring, KM_IRQ1); @@ -1000,8 +1016,13 @@ put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - if (waitqueue_active(ctx-wait)) - wake_up(ctx-wait); + if (waitqueue_active(ctx-wait)) { + struct aio_wait_queue *wait; + wait = container_of(ctx-wait.task_list.next, + struct aio_wait_queue, wait.task_list); + if (nr_evt = wait-nr_wait) + wake_up(ctx-wait); + } spin_unlock_irqrestore(ctx-ctx_lock, flags); return ret; @@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct { longstart_jiffies = jiffies; struct task_struct *tsk = current; - DECLARE_WAITQUEUE(wait, tsk); + struct aio_wait_queue wait; int ret; int i = 0; struct io_event ent; @@ -1152,10 +1173,11 @@ retry: set_timeout(start_jiffies, to, ts); } + aio_init_wait(wait); while
[patch] aio: streamline read events after woken up
The read event loop in the blocking path is also inefficient. For every event it reap (if not blocking), it does the following in a loop: while (i nr) { prepare_to_wait_exclusive aio_read_evt finish_wait ... } Given the previous patch aio: add per task aio wait event condition that we properly wake up event waiting process knowing that we have enough events to reap, it's just plain waste of time to insert itself into a wait queue, and then immediately remove itself from the wait queue for *every* event reap iteration. This patch factors out the wait queue insertion/deletion out of the event reap loop, streamlines the event reaping after the process wakes up. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/aio.c.orig 2006-12-24 17:04:52.0 -0800 +++ ./fs/aio.c 2006-12-24 17:05:10.0 -0800 @@ -1174,42 +1174,40 @@ retry: } aio_init_wait(wait); +wait: + prepare_to_wait_exclusive(ctx-wait, wait.wait, TASK_INTERRUPTIBLE); + ret = aio_read_evt(ctx, ent); + if (!ret) { + wait.nr_wait = min_nr - i; + schedule(); + if (signal_pending(tsk)) + ret = -EINTR; + } + finish_wait(ctx-wait, wait.wait); + + if (ret 0) + goto out_cleanup; + while (likely(i nr)) { - do { - prepare_to_wait_exclusive(ctx-wait, wait.wait, - TASK_INTERRUPTIBLE); - ret = aio_read_evt(ctx, ent); - if (ret) - break; - if (min_nr = i) - break; - ret = 0; - if (to.timed_out) /* Only check after read evt */ - break; - wait.nr_wait = min_nr - i; - schedule(); - if (signal_pending(tsk)) { - ret = -EINTR; + if (ret) { + if (unlikely(copy_to_user(event, ent, sizeof(ent { + dprintk(aio: lost an event due to EFAULT.\n); + ret = -EFAULT; break; } - /*ret = aio_read_evt(ctx, ent);*/ - } while (1) ; - finish_wait(ctx-wait, wait.wait); - - if (unlikely(ret = 0)) - break; + event++; + i++; + } - ret = -EFAULT; - if (unlikely(copy_to_user(event, ent, sizeof(ent { - dprintk(aio: lost an event due to EFAULT.\n); + ret = aio_read_evt(ctx, ent); + if (unlikely(!ret)) { + if (i min_nr !to.timed_out) + goto wait; break; } - - /* Good, event copied to userland, update counts. */ - event ++; - i ++; } +out_cleanup: if (timeout) clear_timeout(to); out: - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: remove spurious ring head index modulo info-nr
In aio_read_evt(), the ring-head will never wrap info-nr because we already does the wrap when updating the ring head index: if (head != ring-tail) { ... head = (head + 1) % info-nr; ring-head = head; } This makes the modulo of ring-head into local variable head unnecessary. This patch removes that bogus code. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/aio.c.orig 2006-12-24 22:01:36.0 -0800 +++ ./fs/aio.c 2006-12-24 22:34:48.0 -0800 @@ -1019,7 +1019,7 @@ static int aio_read_evt(struct kioctx *i { struct aio_ring_info *info = ioctx-ring_info; struct aio_ring *ring; - unsigned long head; + unsigned int head; int ret = 0; ring = kmap_atomic(info-ring_pages[0], KM_USER0); @@ -1032,7 +1032,7 @@ static int aio_read_evt(struct kioctx *i spin_lock(info-ring_lock); - head = ring-head % info-nr; + head = ring-head; if (head != ring-tail) { struct io_event *evp = aio_ring_event(info, head, KM_USER1); *ent = *evp; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: make aio_ring_info-nr_pages an unsigned int
The number of io_event in AIO event queue allowed currently is no more than 2^32-1, because the syscall defines: asmlinkage long sys_io_setup(unsigned nr_events, aio_context_t __user *ctxp) We internally allocate a ring buffer for nr_events and keeps tracks of page descriptors for each of these ring buffer pages. Since page size is significantly larger than AIO event size (4096 versus 32), I don't think it is ever possible to overflow nr_pages in 32-bit quantity. This patch changes nr_pages to unsigned int. on 64-bit arch, changing it to unsigned int also allows better packing of aio_ring_info structure. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./include/linux/aio.h.orig 2006-12-24 22:31:55.0 -0800 +++ ./include/linux/aio.h 2006-12-24 22:41:28.0 -0800 @@ -165,7 +165,7 @@ struct aio_ring_info { struct page **ring_pages; spinlock_t ring_lock; - longnr_pages; + unsignednr_pages; unsignednr, tail; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > > > I still don't see _why_, though. But maybe smarter people than me can > > > > see > > > > it.. > > > > > > FWIW this program definitely triggers the bug for me. > > > > Ok, now that I have something simple to do repeatable stuff with, I can > > say what the pattern is.. It's not all that surprising, but it's still > > worth just stating for the record. > > > Running the test code, git bisect points its finger at this commit. Reverting > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f > Author: Peter Zijlstra <[EMAIL PROTECTED]> > Date: Mon Sep 25 23:30:58 2006 -0700 > > [PATCH] mm: balance dirty pages > > Now that we can detect writers of shared mappings, throttle them. Avoids > OOM > by surprise. Oh, never mind :-( I just didn't create enough write out pressure when test this. I just saw bug got triggered on a kernel I previously thought was OK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f Author: Peter Zijlstra <[EMAIL PROTECTED]> Date: Mon Sep 25 23:30:58 2006 -0700 [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]> Cc: Hugh Dickins <[EMAIL PROTECTED]> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> Signed-off-by: Linus Torvalds <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM On Wed, 27 Dec 2006, David Miller wrote: I still don't see _why_, though. But maybe smarter people than me can see it.. FWIW this program definitely triggers the bug for me. Ok, now that I have something simple to do repeatable stuff with, I can say what the pattern is.. It's not all that surprising, but it's still worth just stating for the record. Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f Author: Peter Zijlstra [EMAIL PROTECTED] Date: Mon Sep 25 23:30:58 2006 -0700 [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra [EMAIL PROTECTED] Cc: Hugh Dickins [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] mm: fix page_mkclean_one
Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM On Wed, 27 Dec 2006, David Miller wrote: I still don't see _why_, though. But maybe smarter people than me can see it.. FWIW this program definitely triggers the bug for me. Ok, now that I have something simple to do repeatable stuff with, I can say what the pattern is.. It's not all that surprising, but it's still worth just stating for the record. Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f Author: Peter Zijlstra [EMAIL PROTECTED] Date: Mon Sep 25 23:30:58 2006 -0700 [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Oh, never mind :-( I just didn't create enough write out pressure when test this. I just saw bug got triggered on a kernel I previously thought was OK. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 9:35 AM > kenneth.w.chen> Take ioctx_lock is one part, the other part is to move > kenneth.w.chen> spin_unlock_irqrestore(>ctx_lock, flags); > kenneth.w.chen> in aio_complete all the way down to the end of the > kenneth.w.chen> function, after wakeup and put_ioctx. But then the ref > kenneth.w.chen> counting on ioctx in aio_complete path is Meaningless, > kenneth.w.chen> which is the thing I'm trying to remove. > > OK, right. But are we simply papering over the real problem? Earlier in > this thread, you stated: > > > flush_workqueue() is not allowed to be called in the softirq context. > > However, aio_complete() called from I/O interrupt can potentially call > > put_ioctx with last ref count on ioctx and trigger a bug warning. It > > is simply incorrect to perform ioctx freeing from aio_complete. > > But how do we end up with the last reference to the ioctx in the aio > completion path? That's a question I haven't seen answered. It is explained in this posting, A race between io_destroy and aio_complete: http://groups.google.com/group/linux.kernel/msg/d2ba7d73aca1dd0c?=en Trond spotted a bug in that posting. The correct way of locking is to move the spin_unlock_irqrestore in aio_complete all the way down instead of calling aio_put_req at the end. Like this: --- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800 +++ ./fs/aio.c 2006-12-21 08:14:27.0 -0800 @@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); + spin_lock_irq(>ctx_lock); if (!ctx->reqs_active) - return; + goto out; add_wait_queue(>wait, ); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx->reqs_active) { + spin_unlock_irq(>ctx_lock); schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + spin_lock_irq(>ctx_lock); } __set_task_state(tsk, TASK_RUNNING); remove_wait_queue(>wait, ); + +out: + spin_unlock_irq(>ctx_lock); } /* wait_on_sync_kiocb: @@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0); if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) { list_add(>ki_list, >active_reqs); - get_ioctx(ctx); ctx->reqs_active++; okay = 1; } @@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(>ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(>ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb->ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(>run_list)) return 1; @@ -998,14 +1000,10 @@ put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - spin_unlock_irqrestore(>ctx_lock, flags); - if (waitqueue_active(>wait)) wake_up(>wait); - if (ret) - put_ioctx(ctx); - + spin_unlock_irqrestore(>ctx_lock, flags); return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 8:56 AM > kenneth.w.chen> I think I'm going to abandon this whole synchronize thing > kenneth.w.chen> and going to put the wake up call inside ioctx_lock spin > kenneth.w.chen> lock along with the other patch you mentioned above in the > kenneth.w.chen> waiter path. On top of that, I have another patch attempts > kenneth.w.chen> to perform wake-up only when the waiter can truly proceed > kenneth.w.chen> in aio_read_evt so dribbling I/O completion doesn't > kenneth.w.chen> inefficiently waking up waiter too frequently and only to > kenneth.w.chen> have waiter put back to sleep again. I will dig that up and > kenneth.w.chen> experiment. > > In the mean time, can't we simply take the context lock in > wait_for_all_aios? Unless I missed something, I think that will address > the reference count problem. Take ioctx_lock is one part, the other part is to move spin_unlock_irqrestore(>ctx_lock, flags); in aio_complete all the way down to the end of the function, after wakeup and put_ioctx. But then the ref counting on ioctx in aio_complete path is Meaningless, which is the thing I'm trying to remove. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
Andrew Morton wrote on Thursday, December 21, 2006 12:18 AM > Alas, your above description doesn't really tell us what the bug is, so I'm > at a bit of a loss here. > > http://marc.theaimsgroup.com/?l=linux-aio=116616463009218=2> > > So that's a refcounting bug. But it's really a locking bug, because > refcounting needs locking too. I should've quoted the original bug report (kicking myself for those fat fingers!): http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2 The bug manifested from an expectation that __put_ioctx can be called in the softirq context, but it really shouldn't. Normally, aio_complete will not decrement last ref count on ioctx, but under stressed system, it might. > > Problem is in wait_for_all_aios(), it is checking wait status without > > properly holding an ioctx lock. Perhaps, this patch is walking on thin > > ice. It abuses rcu over a buggy code. OTOH, I really don't want to hold > > ctx_lock over the entire wakeup call at the end of aio_complete: > > > > if (waitqueue_active(>wait)) > > wake_up(>wait); > > > > I'm worried about longer lock hold time in aio_complete and potentially > > increase lock contention for concurrent I/O completion. > > There is a potential problem where we deliver a storm of wakeups at the > waitqueue, and until the waked-up process actually ges going and removes > itself from the waitqueue, those wake_up()s do lots of work waking up an > already-runnable task. > > If we were using DEFINE_WAIT/prepare_to_wait/finish_wait in there then the > *first* wake_up() would do some work, but all the following ones are > practically free. > > So if you're going to get in there and run the numbers on this, try both > approaches. Yes, I agree and I would like to bring your patch on "DEFINE_WAIT..." back too. I will run that experiment. > > A quick look > > at lockmeter data I had on a 4 socket system (with dual core + HT), it > > already showing signs of substantial lock contention in aio_complete. > > I'm afraid putting the above call inside ioctx_lock will make things > > worse. > > It beats oopsing. Yeah, correctness absolutely over rule optimization. I just hope I can find a way to satisfy both. > > And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup > > status, drop ioctx_lock, do the wakeup call all protected inside rcu > > lock. Then wait_for_all_aios will just wait for all that sequence to > > complete before it proceed with __put_ioctx(). All nice and easy. > > Possibly it would be less abusive to use preempt_disable()/enable (with > suitable comments) and synchronize_kernel(). To at least remove the rcu > signals from in there. I think I'm going to abandon this whole synchronize thing and going to put the wake up call inside ioctx_lock spin lock along with the other patch you mentioned above in the waiter path. On top of that, I have another patch attempts to perform wake-up only when the waiter can truly proceed in aio_read_evt so dribbling I/O completion doesn't inefficiently waking up waiter too frequently and only to have waiter put back to sleep again. I will dig that up and experiment. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
Andrew Morton wrote on Wednesday, December 20, 2006 8:06 PM > On Tue, 19 Dec 2006 13:49:18 -0800 > "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote: > > Regarding to a bug report on: > > http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2 > > > > flush_workqueue() is not allowed to be called in the softirq context. > > However, aio_complete() called from I/O interrupt can potentially call > > put_ioctx with last ref count on ioctx and trigger a bug warning. It > > is simply incorrect to perform ioctx freeing from aio_complete. > > > > This patch removes all duplicate ref counting for each kiocb as > > reqs_active already used as a request ref count for each active ioctx. > > This also ensures that buggy call to flush_workqueue() in softirq > > context is eliminated. wait_for_all_aios currently will wait on last > > active kiocb. However, it is racy. This patch also tighten it up > > by utilizing rcu synchronization mechanism to ensure no further > > reference to ioctx before put_ioctx function is run. > > hrm, maybe. Does this count as "abuse of the RCU interfaces". Or "reuse"? Yeah, it's abuse. Problem is in wait_for_all_aios(), it is checking wait status without properly holding an ioctx lock. Perhaps, this patch is walking on thin ice. It abuses rcu over a buggy code. OTOH, I really don't want to hold ctx_lock over the entire wakeup call at the end of aio_complete: if (waitqueue_active(>wait)) wake_up(>wait); I'm worried about longer lock hold time in aio_complete and potentially increase lock contention for concurrent I/O completion. A quick look at lockmeter data I had on a 4 socket system (with dual core + HT), it already showing signs of substantial lock contention in aio_complete. I'm afraid putting the above call inside ioctx_lock will make things worse. And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup status, drop ioctx_lock, do the wakeup call all protected inside rcu lock. Then wait_for_all_aios will just wait for all that sequence to complete before it proceed with __put_ioctx(). All nice and easy. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
Andrew Morton wrote on Wednesday, December 20, 2006 8:06 PM On Tue, 19 Dec 2006 13:49:18 -0800 Chen, Kenneth W [EMAIL PROTECTED] wrote: Regarding to a bug report on: http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and trigger a bug warning. It is simply incorrect to perform ioctx freeing from aio_complete. This patch removes all duplicate ref counting for each kiocb as reqs_active already used as a request ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. wait_for_all_aios currently will wait on last active kiocb. However, it is racy. This patch also tighten it up by utilizing rcu synchronization mechanism to ensure no further reference to ioctx before put_ioctx function is run. hrm, maybe. Does this count as abuse of the RCU interfaces. Or reuse? Yeah, it's abuse. Problem is in wait_for_all_aios(), it is checking wait status without properly holding an ioctx lock. Perhaps, this patch is walking on thin ice. It abuses rcu over a buggy code. OTOH, I really don't want to hold ctx_lock over the entire wakeup call at the end of aio_complete: if (waitqueue_active(ctx-wait)) wake_up(ctx-wait); I'm worried about longer lock hold time in aio_complete and potentially increase lock contention for concurrent I/O completion. A quick look at lockmeter data I had on a 4 socket system (with dual core + HT), it already showing signs of substantial lock contention in aio_complete. I'm afraid putting the above call inside ioctx_lock will make things worse. And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup status, drop ioctx_lock, do the wakeup call all protected inside rcu lock. Then wait_for_all_aios will just wait for all that sequence to complete before it proceed with __put_ioctx(). All nice and easy. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
Andrew Morton wrote on Thursday, December 21, 2006 12:18 AM Alas, your above description doesn't really tell us what the bug is, so I'm at a bit of a loss here. finds http://marc.theaimsgroup.com/?l=linux-aiom=116616463009218w=2 So that's a refcounting bug. But it's really a locking bug, because refcounting needs locking too. I should've quoted the original bug report (kicking myself for those fat fingers!): http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2 The bug manifested from an expectation that __put_ioctx can be called in the softirq context, but it really shouldn't. Normally, aio_complete will not decrement last ref count on ioctx, but under stressed system, it might. Problem is in wait_for_all_aios(), it is checking wait status without properly holding an ioctx lock. Perhaps, this patch is walking on thin ice. It abuses rcu over a buggy code. OTOH, I really don't want to hold ctx_lock over the entire wakeup call at the end of aio_complete: if (waitqueue_active(ctx-wait)) wake_up(ctx-wait); I'm worried about longer lock hold time in aio_complete and potentially increase lock contention for concurrent I/O completion. There is a potential problem where we deliver a storm of wakeups at the waitqueue, and until the waked-up process actually ges going and removes itself from the waitqueue, those wake_up()s do lots of work waking up an already-runnable task. If we were using DEFINE_WAIT/prepare_to_wait/finish_wait in there then the *first* wake_up() would do some work, but all the following ones are practically free. So if you're going to get in there and run the numbers on this, try both approaches. Yes, I agree and I would like to bring your patch on DEFINE_WAIT... back too. I will run that experiment. A quick look at lockmeter data I had on a 4 socket system (with dual core + HT), it already showing signs of substantial lock contention in aio_complete. I'm afraid putting the above call inside ioctx_lock will make things worse. It beats oopsing. Yeah, correctness absolutely over rule optimization. I just hope I can find a way to satisfy both. And synchronize_rcu fits the bill perfectly: aio_complete sets wakeup status, drop ioctx_lock, do the wakeup call all protected inside rcu lock. Then wait_for_all_aios will just wait for all that sequence to complete before it proceed with __put_ioctx(). All nice and easy. Possibly it would be less abusive to use preempt_disable()/enable (with suitable comments) and synchronize_kernel(). To at least remove the rcu signals from in there. I think I'm going to abandon this whole synchronize thing and going to put the wake up call inside ioctx_lock spin lock along with the other patch you mentioned above in the waiter path. On top of that, I have another patch attempts to perform wake-up only when the waiter can truly proceed in aio_read_evt so dribbling I/O completion doesn't inefficiently waking up waiter too frequently and only to have waiter put back to sleep again. I will dig that up and experiment. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 8:56 AM kenneth.w.chen I think I'm going to abandon this whole synchronize thing kenneth.w.chen and going to put the wake up call inside ioctx_lock spin kenneth.w.chen lock along with the other patch you mentioned above in the kenneth.w.chen waiter path. On top of that, I have another patch attempts kenneth.w.chen to perform wake-up only when the waiter can truly proceed kenneth.w.chen in aio_read_evt so dribbling I/O completion doesn't kenneth.w.chen inefficiently waking up waiter too frequently and only to kenneth.w.chen have waiter put back to sleep again. I will dig that up and kenneth.w.chen experiment. In the mean time, can't we simply take the context lock in wait_for_all_aios? Unless I missed something, I think that will address the reference count problem. Take ioctx_lock is one part, the other part is to move spin_unlock_irqrestore(ctx-ctx_lock, flags); in aio_complete all the way down to the end of the function, after wakeup and put_ioctx. But then the ref counting on ioctx in aio_complete path is Meaningless, which is the thing I'm trying to remove. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] aio: fix buggy put_ioctx call in aio_complete
[EMAIL PROTECTED] wrote on Thursday, December 21, 2006 9:35 AM kenneth.w.chen Take ioctx_lock is one part, the other part is to move kenneth.w.chen spin_unlock_irqrestore(ctx-ctx_lock, flags); kenneth.w.chen in aio_complete all the way down to the end of the kenneth.w.chen function, after wakeup and put_ioctx. But then the ref kenneth.w.chen counting on ioctx in aio_complete path is Meaningless, kenneth.w.chen which is the thing I'm trying to remove. OK, right. But are we simply papering over the real problem? Earlier in this thread, you stated: flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and trigger a bug warning. It is simply incorrect to perform ioctx freeing from aio_complete. But how do we end up with the last reference to the ioctx in the aio completion path? That's a question I haven't seen answered. It is explained in this posting, A race between io_destroy and aio_complete: http://groups.google.com/group/linux.kernel/msg/d2ba7d73aca1dd0c?hl=en Trond spotted a bug in that posting. The correct way of locking is to move the spin_unlock_irqrestore in aio_complete all the way down instead of calling aio_put_req at the end. Like this: --- ./fs/aio.c.orig 2006-12-21 08:08:14.0 -0800 +++ ./fs/aio.c 2006-12-21 08:14:27.0 -0800 @@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); + spin_lock_irq(ctx-ctx_lock); if (!ctx-reqs_active) - return; + goto out; add_wait_queue(ctx-wait, wait); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx-reqs_active) { + spin_unlock_irq(ctx-ctx_lock); schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + spin_lock_irq(ctx-ctx_lock); } __set_task_state(tsk, TASK_RUNNING); remove_wait_queue(ctx-wait, wait); + +out: + spin_unlock_irq(ctx-ctx_lock); } /* wait_on_sync_kiocb: @@ -424,7 +430,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0); if (ctx-reqs_active aio_ring_avail(ctx-ring_info, ring)) { list_add(req-ki_list, ctx-active_reqs); - get_ioctx(ctx); ctx-reqs_active++; okay = 1; } @@ -536,8 +541,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(ctx-ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(ctx-ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -782,8 +785,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb-ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(ctx-run_list)) return 1; @@ -998,14 +1000,10 @@ put_rq: /* everything turned out well, dispose of the aiocb. */ ret = __aio_put_req(ctx, iocb); - spin_unlock_irqrestore(ctx-ctx_lock, flags); - if (waitqueue_active(ctx-wait)) wake_up(ctx-wait); - if (ret) - put_ioctx(ctx); - + spin_unlock_irqrestore(ctx-ctx_lock, flags); return ret; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC PATCH 1/8] rqbased-dm: allow blk_get_request() to be called from interrupt context
Kiyoshi Ueda wrote on Wednesday, December 20, 2006 9:50 AM > On Wed, 20 Dec 2006 14:48:49 +0100, Jens Axboe <[EMAIL PROTECTED]> wrote: > > Big NACK on this - it's not only really ugly, it's also buggy to pass > > interrupt flags as function arguments. As you also mention in the 0/1 > > mail, this also breaks CFQ. > > > > Why do you need in-interrupt request allocation? > > Because I'd like to use blk_get_request() in q->request_fn() > which can be called from interrupt context like below: > scsi_io_completion -> scsi_end_request -> scsi_next_command > -> scsi_run_queue -> blk_run_queue -> q->request_fn > > [ ...] > > Do you think creating another function like blk_get_request_nowait() > is acceptable? You don't need to create another function. blk_get_request already have both wait and nowait semantics via gfp_mask argument. If you can not block, then clear __GFP_WAIT bit in the mask before calling blk_get_request. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [RFC PATCH 1/8] rqbased-dm: allow blk_get_request() to be called from interrupt context
Kiyoshi Ueda wrote on Wednesday, December 20, 2006 9:50 AM On Wed, 20 Dec 2006 14:48:49 +0100, Jens Axboe [EMAIL PROTECTED] wrote: Big NACK on this - it's not only really ugly, it's also buggy to pass interrupt flags as function arguments. As you also mention in the 0/1 mail, this also breaks CFQ. Why do you need in-interrupt request allocation? Because I'd like to use blk_get_request() in q-request_fn() which can be called from interrupt context like below: scsi_io_completion - scsi_end_request - scsi_next_command - scsi_run_queue - blk_run_queue - q-request_fn [ ...] Do you think creating another function like blk_get_request_nowait() is acceptable? You don't need to create another function. blk_get_request already have both wait and nowait semantics via gfp_mask argument. If you can not block, then clear __GFP_WAIT bit in the mask before calling blk_get_request. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: fix buggy put_ioctx call in aio_complete
Regarding to a bug report on: http://marc.theaimsgroup.com/?l=linux-kernel=116599593200888=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and trigger a bug warning. It is simply incorrect to perform ioctx freeing from aio_complete. This patch removes all duplicate ref counting for each kiocb as reqs_active already used as a request ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. wait_for_all_aios currently will wait on last active kiocb. However, it is racy. This patch also tighten it up by utilizing rcu synchronization mechanism to ensure no further reference to ioctx before put_ioctx function is run. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/aio.c.orig 2006-12-19 08:35:01.0 -0800 +++ ./fs/aio.c 2006-12-19 08:46:34.0 -0800 @@ -308,6 +308,7 @@ static void wait_for_all_aios(struct kio set_task_state(tsk, TASK_UNINTERRUPTIBLE); } __set_task_state(tsk, TASK_RUNNING); + synchronize_rcu(); remove_wait_queue(>wait, ); } @@ -425,7 +426,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0); if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) { list_add(>ki_list, >active_reqs); - get_ioctx(ctx); ctx->reqs_active++; okay = 1; } @@ -538,8 +538,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(>ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(>ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -795,8 +793,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb->ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(>run_list)) return 1; @@ -1012,6 +1009,7 @@ int fastcall aio_complete(struct kiocb * iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes); put_rq: /* everything turned out well, dispose of the aiocb. */ + rcu_read_lock(); ret = __aio_put_req(ctx, iocb); spin_unlock_irqrestore(>ctx_lock, flags); @@ -1019,9 +1017,7 @@ put_rq: if (waitqueue_active(>wait)) wake_up(>wait); - if (ret) - put_ioctx(ctx); - + rcu_read_unlock(); return ret; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] aio: fix buggy put_ioctx call in aio_complete
Regarding to a bug report on: http://marc.theaimsgroup.com/?l=linux-kernelm=116599593200888w=2 flush_workqueue() is not allowed to be called in the softirq context. However, aio_complete() called from I/O interrupt can potentially call put_ioctx with last ref count on ioctx and trigger a bug warning. It is simply incorrect to perform ioctx freeing from aio_complete. This patch removes all duplicate ref counting for each kiocb as reqs_active already used as a request ref count for each active ioctx. This also ensures that buggy call to flush_workqueue() in softirq context is eliminated. wait_for_all_aios currently will wait on last active kiocb. However, it is racy. This patch also tighten it up by utilizing rcu synchronization mechanism to ensure no further reference to ioctx before put_ioctx function is run. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/aio.c.orig 2006-12-19 08:35:01.0 -0800 +++ ./fs/aio.c 2006-12-19 08:46:34.0 -0800 @@ -308,6 +308,7 @@ static void wait_for_all_aios(struct kio set_task_state(tsk, TASK_UNINTERRUPTIBLE); } __set_task_state(tsk, TASK_RUNNING); + synchronize_rcu(); remove_wait_queue(ctx-wait, wait); } @@ -425,7 +426,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0); if (ctx-reqs_active aio_ring_avail(ctx-ring_info, ring)) { list_add(req-ki_list, ctx-active_reqs); - get_ioctx(ctx); ctx-reqs_active++; okay = 1; } @@ -538,8 +538,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(ctx-ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(ctx-ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -795,8 +793,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb-ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(ctx-run_list)) return 1; @@ -1012,6 +1009,7 @@ int fastcall aio_complete(struct kiocb * iocb-ki_nbytes - iocb-ki_left, iocb-ki_nbytes); put_rq: /* everything turned out well, dispose of the aiocb. */ + rcu_read_lock(); ret = __aio_put_req(ctx, iocb); spin_unlock_irqrestore(ctx-ctx_lock, flags); @@ -1019,9 +1017,7 @@ put_rq: if (waitqueue_active(ctx-wait)) wake_up(ctx-wait); - if (ret) - put_ioctx(ctx); - + rcu_read_unlock(); return ret; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] incorrect direct io error handling
Dmitriy Monakhov wrote on Monday, December 18, 2006 5:23 AM > This patch is result of discussion started week ago here: > http://lkml.org/lkml/2006/12/11/66 > changes from original patch: > - Update wrong comments about i_mutex locking. > - Add BUG_ON(!mutex_is_locked(..)) for non blkdev. > - vmtruncate call only for non blockdev > LOG: > If generic_file_direct_write() has fail (ENOSPC condition) inside > __generic_file_aio_write_nolock() it may have instantiated > a few blocks outside i_size. And fsck will complain about wrong i_size > (ext2, ext3 and reiserfs interpret i_size and biggest block difference as > error), > after fsck will fix error i_size will be increased to the biggest block, > but this blocks contain gurbage from previous write attempt, this is not > information leak, but its silence file data corruption. This issue affect > fs regardless the values of blocksize or pagesize. > We need truncate any block beyond i_size after write have failed , do in > simular > generic_file_buffered_write() error path. If host is !S_ISBLK i_mutex always > held inside generic_file_aio_write_nolock() and we may safely call > vmtruncate(). > Some fs (XFS at least) may directly call generic_file_direct_write()with > i_mutex not held. There is no general scenario in this case. This fs have to > handle generic_file_direct_write() error by its own specific way (place). > I'm puzzled that if ext2 is able to instantiate some blocks, then why does it return no space error? Where is the error coming from? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] IA64: alignment bug in ldscript
Kirill Korotaev wrote on Monday, December 18, 2006 4:05 AM > [IA64] bug in ldscript (mainstream) > > Occasionally, in mainstream number of fsys entries is even. Is it a typo on "fsys entries is even"? If not, then this change log is misleading. It is the instruction patch list of FSYS_RETURN that can potentially cause other data structures to be out of alignment. And number of FSYS_RETURN call site will not necessarily match number of fsys entry. > In OpenVZ it is odd and we get misaligned kernel image, > which does not boot. Otherwise, the patch looks fine to me. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] incorrect direct io error handling
Dmitriy Monakhov wrote on Monday, December 18, 2006 5:23 AM This patch is result of discussion started week ago here: http://lkml.org/lkml/2006/12/11/66 changes from original patch: - Update wrong comments about i_mutex locking. - Add BUG_ON(!mutex_is_locked(..)) for non blkdev. - vmtruncate call only for non blockdev LOG: If generic_file_direct_write() has fail (ENOSPC condition) inside __generic_file_aio_write_nolock() it may have instantiated a few blocks outside i_size. And fsck will complain about wrong i_size (ext2, ext3 and reiserfs interpret i_size and biggest block difference as error), after fsck will fix error i_size will be increased to the biggest block, but this blocks contain gurbage from previous write attempt, this is not information leak, but its silence file data corruption. This issue affect fs regardless the values of blocksize or pagesize. We need truncate any block beyond i_size after write have failed , do in simular generic_file_buffered_write() error path. If host is !S_ISBLK i_mutex always held inside generic_file_aio_write_nolock() and we may safely call vmtruncate(). Some fs (XFS at least) may directly call generic_file_direct_write()with i_mutex not held. There is no general scenario in this case. This fs have to handle generic_file_direct_write() error by its own specific way (place). I'm puzzled that if ext2 is able to instantiate some blocks, then why does it return no space error? Where is the error coming from? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] IA64: alignment bug in ldscript
Kirill Korotaev wrote on Monday, December 18, 2006 4:05 AM [IA64] bug in ldscript (mainstream) Occasionally, in mainstream number of fsys entries is even. Is it a typo on fsys entries is even? If not, then this change log is misleading. It is the instruction patch list of FSYS_RETURN that can potentially cause other data structures to be out of alignment. And number of FSYS_RETURN call site will not necessarily match number of fsys entry. In OpenVZ it is odd and we get misaligned kernel image, which does not boot. Otherwise, the patch looks fine to me. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: 2.6.18.4: flush_workqueue calls mutex_lock in interruptenvironment
Trond Myklebust wrote on Friday, December 15, 2006 6:01 AM > Oops. Missed the fact that you are removed the put_ioctx from > aio_put_req, but the first sentence is still true. If you try to wake up > wait_for_all_aios before you've changed the condition it is waiting for, > then it may end up hanging forever. The easy fix to that is to put wake_up in aio_complete inside the ctx spin lock. > Why not fix this by having the context freed via an RCU callback? That > way you can protect the combined call to aio_put_req() + > wake_up(ctx->wait) using a simple preempt_off/preempt_on, and all is > good. That has been suggested before on a different subject. I will whip up something. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] incorrect error handling inside generic_file_direct_write
Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM > So we're doing the sync_page_range once in __generic_file_aio_write > with i_mutex held. > > > > mutex_lock(>i_mutex); > > - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, > > - >ki_pos); > > + ret = __generic_file_aio_write(iocb, iov, nr_segs, pos); > > mutex_unlock(>i_mutex); > > > > if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { > > And then another time after it's unlocked, this seems wrong. I didn't invent that mess though. I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write will call sync_page_range twice, once from __generic_file_aio_write_nolock and once within the function itself. Is it redundant? Can we delete the one in the top level function? Like the following? --- ./mm/filemap.c.orig 2006-12-15 09:02:58.0 -0800 +++ ./mm/filemap.c 2006-12-15 09:03:19.0 -0800 @@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, >ki_pos); mutex_unlock(>i_mutex); - - if (ret > 0 && ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { - ssize_t err; - - err = sync_page_range(inode, mapping, pos, ret); - if (err < 0) - ret = err; - } return ret; } EXPORT_SYMBOL(generic_file_aio_write); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] incorrect error handling inside generic_file_direct_write
Christoph Hellwig wrote on Friday, December 15, 2006 2:44 AM So we're doing the sync_page_range once in __generic_file_aio_write with i_mutex held. mutex_lock(inode-i_mutex); - ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, - iocb-ki_pos); + ret = __generic_file_aio_write(iocb, iov, nr_segs, pos); mutex_unlock(inode-i_mutex); if (ret 0 ((file-f_flags O_SYNC) || IS_SYNC(inode))) { And then another time after it's unlocked, this seems wrong. I didn't invent that mess though. I should've ask the question first: in 2.6.20-rc1, generic_file_aio_write will call sync_page_range twice, once from __generic_file_aio_write_nolock and once within the function itself. Is it redundant? Can we delete the one in the top level function? Like the following? --- ./mm/filemap.c.orig 2006-12-15 09:02:58.0 -0800 +++ ./mm/filemap.c 2006-12-15 09:03:19.0 -0800 @@ -2370,14 +2370,6 @@ ssize_t generic_file_aio_write(struct ki ret = __generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb-ki_pos); mutex_unlock(inode-i_mutex); - - if (ret 0 ((file-f_flags O_SYNC) || IS_SYNC(inode))) { - ssize_t err; - - err = sync_page_range(inode, mapping, pos, ret); - if (err 0) - ret = err; - } return ret; } EXPORT_SYMBOL(generic_file_aio_write); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: 2.6.18.4: flush_workqueue calls mutex_lock in interruptenvironment
Trond Myklebust wrote on Friday, December 15, 2006 6:01 AM Oops. Missed the fact that you are removed the put_ioctx from aio_put_req, but the first sentence is still true. If you try to wake up wait_for_all_aios before you've changed the condition it is waiting for, then it may end up hanging forever. The easy fix to that is to put wake_up in aio_complete inside the ctx spin lock. Why not fix this by having the context freed via an RCU callback? That way you can protect the combined call to aio_put_req() + wake_up(ctx-wait) using a simple preempt_off/preempt_on, and all is good. That has been suggested before on a different subject. I will whip up something. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment
Chen, Kenneth wrote on Thursday, December 14, 2006 5:59 PM > > It seems utterly insane to have aio_complete() flush a workqueue. That > > function has to be called from a number of different environments, > > including non-sleep tolerant environments. > > > > For instance it means that directIO on NFS will now cause the rpciod > > workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC > > activity. > > The bug appears to be somewhere else, somehow the ref count on ioctx is > all messed up. > > In aio_complete, __put_ioctx() should not be invoked because ref count > on ioctx is supposedly more than 2, aio_complete decrement it once and > should return without invoking the free function. > > The real freeing ioctx should be coming from exit_aio() or io_destroy(), > in which case both wait until no further pending AIO request via > wait_for_all_aios(). Ah, I think I see the bug: it must be a race between io_destroy() and aio_complete(). A possible scenario: cpu0 cpu1 io_destroy aio_complete wait_for_all_aios {__aio_put_req ... ctx->reqs_active--; if (!ctx->reqs_active) return; } ... put_ioctx(ioctx) put_ioctx(ctx); bam! Bug trigger! AIO finished on cpu1 and while in the middle of aio_complete, cpu0 starts io_destroy sequence, sees no pending AIO, went ahead decrement the ref count on ioctx. At a later point in aio_complete, the put_ioctx decrement last ref count and calls the ioctx freeing function and there it triggered the bug warning. A simple fix would be to access ctx->reqs_active inside ctx spin lock in wait_for_all_aios(). At the mean time, I would like to remove ref counting for each iocb because we already performing ref count using reqs_active. This would also prevent similar buggy code in the future. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/aio.c.orig 2006-11-29 13:57:37.0 -0800 +++ ./fs/aio.c 2006-12-14 20:45:14.0 -0800 @@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); + spin_lock_irq(>ctx_lock); if (!ctx->reqs_active) - return; + goto out; add_wait_queue(>wait, ); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx->reqs_active) { + spin_unlock_irq(>ctx_lock); schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + spin_lock_irq(>ctx_lock); } __set_task_state(tsk, TASK_RUNNING); remove_wait_queue(>wait, ); + +out: + spin_unlock_irq(>ctx_lock); } /* wait_on_sync_kiocb: @@ -425,7 +431,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0); if (ctx->reqs_active < aio_ring_avail(>ring_info, ring)) { list_add(>ki_list, >active_reqs); - get_ioctx(ctx); ctx->reqs_active++; okay = 1; } @@ -538,8 +543,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(>ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(>ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -795,8 +798,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb->ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(>run_list)) return 1; @@ -942,7 +944,6 @@ int fastcall aio_complete(struct kiocb * struct io_event *event; unsigned long flags; unsigned long tail; - int ret; /* * Special case handling for sync iocbs: @@ -1011,18 +1012,12 @@ int fastcall aio_complete(struct kiocb * pr_debug("%ld retries: %zd of %zd\n", iocb->ki_retried, iocb->ki_nbytes - iocb->ki_left, iocb->ki_nbytes); put_rq: - /* everything turned out well, dispose of the aiocb. */ - ret = __aio_put_req(ctx, iocb); - spin_unlock_irqrestore(>ctx_lock, flags); if (waitqueue_active(>wait)) wake_up(>wait); - if (ret) - put_ioctx(ctx); - - return ret; + return aio_put_req(iocb); } /* aio_read_evt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment
Andrew Morton wrote on Thursday, December 14, 2006 5:20 PM > it's hard to disagree. > > Begin forwarded message: > > On Wed, 2006-12-13 at 08:25 +0100, xb wrote: > > > Hi all, > > > > > > Running some IO stress tests on a 8*ways IA64 platform, we got: > > > BUG: warning at kernel/mutex.c:132/__mutex_lock_common() message > > > followed by: > > > Unable to handle kernel paging request at virtual address > > > 00200200 > > > oops corresponding to anon_vma_unlink() calling list_del() on a > > > poisonned list. > > > > > > Having a look to the stack, we see that flush_workqueue() calls > > > mutex_lock() with softirqs disabled. > > > > something is wrong here... flush_workqueue() is a sleeping function and > > is not allowed to be called in such a context! > > It seems utterly insane to have aio_complete() flush a workqueue. That > function has to be called from a number of different environments, > including non-sleep tolerant environments. > > For instance it means that directIO on NFS will now cause the rpciod > workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC > activity. The bug appears to be somewhere else, somehow the ref count on ioctx is all messed up. In aio_complete, __put_ioctx() should not be invoked because ref count on ioctx is supposedly more than 2, aio_complete decrement it once and should return without invoking the free function. The real freeing ioctx should be coming from exit_aio() or io_destroy(), in which case both wait until no further pending AIO request via wait_for_all_aios(). - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment
Chen, Kenneth wrote on Thursday, December 14, 2006 5:59 PM It seems utterly insane to have aio_complete() flush a workqueue. That function has to be called from a number of different environments, including non-sleep tolerant environments. For instance it means that directIO on NFS will now cause the rpciod workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC activity. The bug appears to be somewhere else, somehow the ref count on ioctx is all messed up. In aio_complete, __put_ioctx() should not be invoked because ref count on ioctx is supposedly more than 2, aio_complete decrement it once and should return without invoking the free function. The real freeing ioctx should be coming from exit_aio() or io_destroy(), in which case both wait until no further pending AIO request via wait_for_all_aios(). Ah, I think I see the bug: it must be a race between io_destroy() and aio_complete(). A possible scenario: cpu0 cpu1 io_destroy aio_complete wait_for_all_aios {__aio_put_req ... ctx-reqs_active--; if (!ctx-reqs_active) return; } ... put_ioctx(ioctx) put_ioctx(ctx); bam! Bug trigger! AIO finished on cpu1 and while in the middle of aio_complete, cpu0 starts io_destroy sequence, sees no pending AIO, went ahead decrement the ref count on ioctx. At a later point in aio_complete, the put_ioctx decrement last ref count and calls the ioctx freeing function and there it triggered the bug warning. A simple fix would be to access ctx-reqs_active inside ctx spin lock in wait_for_all_aios(). At the mean time, I would like to remove ref counting for each iocb because we already performing ref count using reqs_active. This would also prevent similar buggy code in the future. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./fs/aio.c.orig 2006-11-29 13:57:37.0 -0800 +++ ./fs/aio.c 2006-12-14 20:45:14.0 -0800 @@ -298,17 +298,23 @@ static void wait_for_all_aios(struct kio struct task_struct *tsk = current; DECLARE_WAITQUEUE(wait, tsk); + spin_lock_irq(ctx-ctx_lock); if (!ctx-reqs_active) - return; + goto out; add_wait_queue(ctx-wait, wait); set_task_state(tsk, TASK_UNINTERRUPTIBLE); while (ctx-reqs_active) { + spin_unlock_irq(ctx-ctx_lock); schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); + spin_lock_irq(ctx-ctx_lock); } __set_task_state(tsk, TASK_RUNNING); remove_wait_queue(ctx-wait, wait); + +out: + spin_unlock_irq(ctx-ctx_lock); } /* wait_on_sync_kiocb: @@ -425,7 +431,6 @@ static struct kiocb fastcall *__aio_get_ ring = kmap_atomic(ctx-ring_info.ring_pages[0], KM_USER0); if (ctx-reqs_active aio_ring_avail(ctx-ring_info, ring)) { list_add(req-ki_list, ctx-active_reqs); - get_ioctx(ctx); ctx-reqs_active++; okay = 1; } @@ -538,8 +543,6 @@ int fastcall aio_put_req(struct kiocb *r spin_lock_irq(ctx-ctx_lock); ret = __aio_put_req(ctx, req); spin_unlock_irq(ctx-ctx_lock); - if (ret) - put_ioctx(ctx); return ret; } @@ -795,8 +798,7 @@ static int __aio_run_iocbs(struct kioctx */ iocb-ki_users++; /* grab extra reference */ aio_run_iocb(iocb); - if (__aio_put_req(ctx, iocb)) /* drop extra ref */ - put_ioctx(ctx); + __aio_put_req(ctx, iocb); } if (!list_empty(ctx-run_list)) return 1; @@ -942,7 +944,6 @@ int fastcall aio_complete(struct kiocb * struct io_event *event; unsigned long flags; unsigned long tail; - int ret; /* * Special case handling for sync iocbs: @@ -1011,18 +1012,12 @@ int fastcall aio_complete(struct kiocb * pr_debug(%ld retries: %zd of %zd\n, iocb-ki_retried, iocb-ki_nbytes - iocb-ki_left, iocb-ki_nbytes); put_rq: - /* everything turned out well, dispose of the aiocb. */ - ret = __aio_put_req(ctx, iocb); - spin_unlock_irqrestore(ctx-ctx_lock, flags); if (waitqueue_active(ctx-wait)) wake_up(ctx-wait); - if (ret) - put_ioctx(ctx); - - return ret; + return aio_put_req(iocb); } /* aio_read_evt - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: 2.6.18.4: flush_workqueue calls mutex_lock in interrupt environment
Andrew Morton wrote on Thursday, December 14, 2006 5:20 PM it's hard to disagree. Begin forwarded message: On Wed, 2006-12-13 at 08:25 +0100, xb wrote: Hi all, Running some IO stress tests on a 8*ways IA64 platform, we got: BUG: warning at kernel/mutex.c:132/__mutex_lock_common() message followed by: Unable to handle kernel paging request at virtual address 00200200 oops corresponding to anon_vma_unlink() calling list_del() on a poisonned list. Having a look to the stack, we see that flush_workqueue() calls mutex_lock() with softirqs disabled. something is wrong here... flush_workqueue() is a sleeping function and is not allowed to be called in such a context! It seems utterly insane to have aio_complete() flush a workqueue. That function has to be called from a number of different environments, including non-sleep tolerant environments. For instance it means that directIO on NFS will now cause the rpciod workqueues to call flush_workqueue(aio_wq), thus slowing down all RPC activity. The bug appears to be somewhere else, somehow the ref count on ioctx is all messed up. In aio_complete, __put_ioctx() should not be invoked because ref count on ioctx is supposedly more than 2, aio_complete decrement it once and should return without invoking the free function. The real freeing ioctx should be coming from exit_aio() or io_destroy(), in which case both wait until no further pending AIO request via wait_for_all_aios(). - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM > Chen, Kenneth W <[EMAIL PROTECTED]> wrote: > >This rawio test plows through sequential I/O and modulo each small record > >over number of threads. So each thread appears to be non-contiguous within > >its own process context, overall request hitting the device are sequential. > >I can't see how any application does that kind of I/O pattern. > > A NNTP server that has many incoming connections, handled by > multiple threads, that stores the data in cylic buffers ? Then whichever the thread that dumps the buffer content to the storage will do one large contiguous I/O. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
Miquel van Smoorenburg wrote on Wednesday, December 13, 2006 1:57 AM Chen, Kenneth W [EMAIL PROTECTED] wrote: This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. A NNTP server that has many incoming connections, handled by multiple threads, that stores the data in cylic buffers ? Then whichever the thread that dumps the buffer content to the storage will do one large contiguous I/O. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: cfq performance gap
AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM > >> rawio is actually performing sequential reads, but I don't believe it is > >> purely sequential with the multiple processes. > >> I am currently running the test with longer runtimes and will post > >> results once it is complete. > >> I've also attached the rawio source. > >> > > > > It's certainly the slice and idling hurting here. But at the same time, > > I don't really think your test case is very interesting. The test area > > is very small and you have 16 threads trying to read the same thing, > > optimizing for that would be silly as I don't think it has much real > > world relevance. > > Could a database have similar workload to this test? No. Not what I have seen with db workloads exhibits such pattern. There are basically two types of db workloads: one does transaction processing, and I/O pattern are truly random with large stride, both in the context of process and overall I/O seen at device level. A second one is decision making type of db queries. They does large sequential I/O within one process context. This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] incorrect error handling inside generic_file_direct_write
Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM > On Tue, 12 Dec 2006 16:18:32 +0300 > Dmitriy Monakhov <[EMAIL PROTECTED]> wrote: > > > >> but according to filemaps locking rules: mm/filemap.c:77 > > >> .. > > >> * ->i_mutex(generic_file_buffered_write) > > >> *->mmap_sem (fault_in_pages_readable->do_page_fault) > > >> .. > > >> I'm confused a litle bit, where is the truth? > > > > > > xfs_write() calls generic_file_direct_write() without taking i_mutex for > > > O_DIRECT writes. > > Yes, but my quastion is about __generic_file_aio_write_nolock(). > > As i understand _nolock sufix means that i_mutex was already locked > > by caller, am i right ? > > Nope. It just means that __generic_file_aio_write_nolock() doesn't take > the lock. We don't assume or require that the caller took it. For example > the raw driver calls generic_file_aio_write_nolock() without taking > i_mutex. Raw isn't relevant to the problem (although ocfs2 might be). But > we cannot assume that all callers have taken i_mutex, I think. I think we should also clean up generic_file_aio_write_nolock. This was brought up a couple of weeks ago and I gave up too early. Here is my second attempt. How about the following patch, I think we can kill generic_file_aio_write_nolock and merge both *file_aio_write_nolock into one function, then generic_file_aio_write ocfs2_file_aio_write blk_dev->aio_write all points to a non-lock version of __generic_file_aio_write(). First two already hold i_mutex, while the block device's aio_write method doesn't require i_mutex to be held. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c --- linux-2.6.19/drivers/char/raw.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/drivers/char/raw.c 2006-12-12 16:41:39.0 -0800 @@ -242,7 +242,7 @@ static const struct file_operations raw_ .read = do_sync_read, .aio_read = generic_file_aio_read, .write = do_sync_write, - .aio_write =generic_file_aio_write_nolock, + .aio_write =__generic_file_aio_write, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c --- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/block_dev.c 2006-12-12 16:47:58.0 -0800 @@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, - .aio_write = generic_file_aio_write_nolock, + .aio_write = __generic_file_aio_write, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c --- linux-2.6.19/fs/ocfs2/file.c2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/ocfs2/file.c2006-12-12 16:42:09.0 -0800 @@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_rw_locked(iocb); - ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb->ki_pos); + ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb->ki_pos); /* buffered aio wouldn't have proper lock coverage today */ BUG_ON(ret == -EIOCBQUEUED && !(filp->f_flags & O_DIRECT)); diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h --- linux-2.6.19/include/linux/fs.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/include/linux/fs.h 2006-12-12 16:41:58.0 -0800 @@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); -extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, +extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c --- linux-2.6.19/mm/filemap.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/mm/filemap.c 2006-12-12 16:47:58.0 -0800 @@ -2219,9 +2219,9 @@ zero_length_segment: } EXPORT_SYMBOL(generic_file_buffered_write); -static ssize_t -__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t
RE: [PATCH] incorrect error handling inside generic_file_direct_write
Andrew Morton wrote on Tuesday, December 12, 2006 2:40 AM On Tue, 12 Dec 2006 16:18:32 +0300 Dmitriy Monakhov [EMAIL PROTECTED] wrote: but according to filemaps locking rules: mm/filemap.c:77 .. * -i_mutex(generic_file_buffered_write) *-mmap_sem (fault_in_pages_readable-do_page_fault) .. I'm confused a litle bit, where is the truth? xfs_write() calls generic_file_direct_write() without taking i_mutex for O_DIRECT writes. Yes, but my quastion is about __generic_file_aio_write_nolock(). As i understand _nolock sufix means that i_mutex was already locked by caller, am i right ? Nope. It just means that __generic_file_aio_write_nolock() doesn't take the lock. We don't assume or require that the caller took it. For example the raw driver calls generic_file_aio_write_nolock() without taking i_mutex. Raw isn't relevant to the problem (although ocfs2 might be). But we cannot assume that all callers have taken i_mutex, I think. I think we should also clean up generic_file_aio_write_nolock. This was brought up a couple of weeks ago and I gave up too early. Here is my second attempt. How about the following patch, I think we can kill generic_file_aio_write_nolock and merge both *file_aio_write_nolock into one function, then generic_file_aio_write ocfs2_file_aio_write blk_dev-aio_write all points to a non-lock version of __generic_file_aio_write(). First two already hold i_mutex, while the block device's aio_write method doesn't require i_mutex to be held. Signed-off-by: Ken Chen [EMAIL PROTECTED] diff -Nurp linux-2.6.19/drivers/char/raw.c linux-2.6.19.ken/drivers/char/raw.c --- linux-2.6.19/drivers/char/raw.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/drivers/char/raw.c 2006-12-12 16:41:39.0 -0800 @@ -242,7 +242,7 @@ static const struct file_operations raw_ .read = do_sync_read, .aio_read = generic_file_aio_read, .write = do_sync_write, - .aio_write =generic_file_aio_write_nolock, + .aio_write =__generic_file_aio_write, .open = raw_open, .release= raw_release, .ioctl = raw_ioctl, diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c --- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/block_dev.c 2006-12-12 16:47:58.0 -0800 @@ -1198,7 +1198,7 @@ const struct file_operations def_blk_fop .read = do_sync_read, .write = do_sync_write, .aio_read = generic_file_aio_read, - .aio_write = generic_file_aio_write_nolock, + .aio_write = __generic_file_aio_write, .mmap = generic_file_mmap, .fsync = block_fsync, .unlocked_ioctl = block_ioctl, diff -Nurp linux-2.6.19/fs/ocfs2/file.c linux-2.6.19.ken/fs/ocfs2/file.c --- linux-2.6.19/fs/ocfs2/file.c2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/ocfs2/file.c2006-12-12 16:42:09.0 -0800 @@ -1107,7 +1107,7 @@ static ssize_t ocfs2_file_aio_write(stru /* communicate with ocfs2_dio_end_io */ ocfs2_iocb_set_rw_locked(iocb); - ret = generic_file_aio_write_nolock(iocb, iov, nr_segs, iocb-ki_pos); + ret = __generic_file_aio_write(iocb, iov, nr_segs, iocb-ki_pos); /* buffered aio wouldn't have proper lock coverage today */ BUG_ON(ret == -EIOCBQUEUED !(filp-f_flags O_DIRECT)); diff -Nurp linux-2.6.19/include/linux/fs.h linux-2.6.19.ken/include/linux/fs.h --- linux-2.6.19/include/linux/fs.h 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/include/linux/fs.h 2006-12-12 16:41:58.0 -0800 @@ -1742,7 +1742,7 @@ extern int file_send_actor(read_descript int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk); extern ssize_t generic_file_aio_read(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); -extern ssize_t generic_file_aio_write_nolock(struct kiocb *, const struct iovec *, +extern ssize_t __generic_file_aio_write(struct kiocb *, const struct iovec *, unsigned long, loff_t); extern ssize_t generic_file_direct_write(struct kiocb *, const struct iovec *, unsigned long *, loff_t, loff_t *, size_t, size_t); diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c --- linux-2.6.19/mm/filemap.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/mm/filemap.c 2006-12-12 16:47:58.0 -0800 @@ -2219,9 +2219,9 @@ zero_length_segment: } EXPORT_SYMBOL(generic_file_buffered_write); -static ssize_t -__generic_file_aio_write_nolock(struct kiocb *iocb, const struct iovec *iov, - unsigned long nr_segs, loff_t *ppos) +ssize_t +__generic_file_aio_write(struct kiocb *iocb,
RE: cfq performance gap
AVANTIKA R. MATHUR wrote on Tuesday, December 12, 2006 5:33 PM rawio is actually performing sequential reads, but I don't believe it is purely sequential with the multiple processes. I am currently running the test with longer runtimes and will post results once it is complete. I've also attached the rawio source. It's certainly the slice and idling hurting here. But at the same time, I don't really think your test case is very interesting. The test area is very small and you have 16 threads trying to read the same thing, optimizing for that would be silly as I don't think it has much real world relevance. Could a database have similar workload to this test? No. Not what I have seen with db workloads exhibits such pattern. There are basically two types of db workloads: one does transaction processing, and I/O pattern are truly random with large stride, both in the context of process and overall I/O seen at device level. A second one is decision making type of db queries. They does large sequential I/O within one process context. This rawio test plows through sequential I/O and modulo each small record over number of threads. So each thread appears to be non-contiguous within its own process context, overall request hitting the device are sequential. I can't see how any application does that kind of I/O pattern. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] connector: Some fixes for ia64 unaligned access errors
Pete Zaitcev wrote on Monday, December 11, 2006 5:29 PM > On Mon, 11 Dec 2006 15:52:47 -0800, Matt Helsley <[EMAIL PROTECTED]> wrote: > > > I'm shocked memcpy() introduces 8-byte stores that violate architecture > > alignment rules. Is there any chance this a bug in ia64's memcpy() > > implementation? I've tried to read it but since I'm not familiar with > > ia64 asm I can't make out significant parts of it in > > arch/ia64/lib/memcpy.S. > > The arch/ia64/lib/memcpy.S is probably fine, it must be gcc doing > an inline substitution of a well-known function. arch/ia64/lib/memcpy.S is fine because it does alignment check at the very beginning of the function and depends on the alignment of dst/src alignment, it does different thing. The unaligned access is coming from gcc inlined version of memcpy. But looking deeply, memory allocation for proc_event in proc_for_connector doesn't looked correct at all: In drivers/connector/cn_proc.c: #define CN_PROC_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct proc_event)) void proc_fork_connector(struct task_struct *task) { struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE]; You can't do that because gcc assumes struct proc_event aligns on certain boundary. Doing fancy hand crafting like that breaks code generated by gcc. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] connector: Some fixes for ia64 unaligned access errors
Pete Zaitcev wrote on Monday, December 11, 2006 5:29 PM On Mon, 11 Dec 2006 15:52:47 -0800, Matt Helsley [EMAIL PROTECTED] wrote: I'm shocked memcpy() introduces 8-byte stores that violate architecture alignment rules. Is there any chance this a bug in ia64's memcpy() implementation? I've tried to read it but since I'm not familiar with ia64 asm I can't make out significant parts of it in arch/ia64/lib/memcpy.S. The arch/ia64/lib/memcpy.S is probably fine, it must be gcc doing an inline substitution of a well-known function. arch/ia64/lib/memcpy.S is fine because it does alignment check at the very beginning of the function and depends on the alignment of dst/src alignment, it does different thing. The unaligned access is coming from gcc inlined version of memcpy. But looking deeply, memory allocation for proc_event in proc_for_connector doesn't looked correct at all: In drivers/connector/cn_proc.c: #define CN_PROC_MSG_SIZE (sizeof(struct cn_msg) + sizeof(struct proc_event)) void proc_fork_connector(struct task_struct *task) { struct cn_msg *msg; struct proc_event *ev; __u8 buffer[CN_PROC_MSG_SIZE]; You can't do that because gcc assumes struct proc_event aligns on certain boundary. Doing fancy hand crafting like that breaks code generated by gcc. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
> Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM > > Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM > > This is what I had in mind, in case it wasn't completely clear. Not > > tested, other than it compiles. Basically it eliminates the small > > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by > > 12-bytes on 32-bit archs instead and uses the room at the end for the > > bio_vec structure. > > Yeah, I had a very similar patch queued internally for the large benchmark > measurement. I will post the result as soon as I get it. Jens, this improves 0.25% on our db transaction processing benchmark setup. The patch tested is (on top of 2.6.19): http://marc.theaimsgroup.com/?l=linux-kernel=116539972229021=2 - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Chen, Kenneth wrote on Wednesday, December 06, 2006 10:20 AM Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. Jens, this improves 0.25% on our db transaction processing benchmark setup. The patch tested is (on top of 2.6.19): http://marc.theaimsgroup.com/?l=linux-kernelm=116539972229021w=2 - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM > "Chen, Kenneth W" <[EMAIL PROTECTED]> writes: > > I tried to use cache_line_size() to find out the alignment of struct bio, > > but > > stumbled on that it is a runtime function for x86_64. > > It's a single global variable access: > > #define cache_line_size() (boot_cpu_data.x86_cache_alignment) > > Or do you mean it caused cache misses? boot_cpu_data is cache aligned > and practically read only, so there shouldn't be any false sharing at least. No, I was looking for a generic constant that describes cache line size. I needed a constant in order to avoid runtime check and to rely on compiler to optimize a conditional check away. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 1:46 PM > the current code is straightforward and obviously correct. you want > to make the alloc/dealloc paths more complex, by special-casing for an > arbitrary limit of "small" I/O, AFAICT. of *course* you can expect > less overhead when you're doing one large I/O vs. two small ones, > that's the whole reason we have all this code to try to coalesce > contiguous I/O, do readahead, swap page clustering, etc. we *want* > more complexity if it will get us bigger I/Os. I don't see why we > want more complexity to reduce the *inherent* penalty of doing smaller > ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 11:22 AM > > > I still can't help but think we can do better than this, and that this > > > is nothing more than optimizing for a benchmark. For high performance > > > I/O, you will be doing > 1 page bio's anyway and this patch wont help > > > you at all. Perhaps we can just kill bio_vec slabs completely, and > > > create bio slabs instead with differing sizes. So instead of having 1 > > > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > > > for the bio_vec list at the end. That would always eliminate the extra > > > allocation, at the cost of blowing the 256-page case into a order 1 page > > > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > > > which is something I've always tried to avoid. > > > > I took a quick query of biovec-* slab stats on various production machines, > > majority of the allocation is on 1 and 4 segments, usages falls off quickly > > on 16 or more. 256 segment biovec allocation is really rare. I think it > > makes sense to heavily bias towards smaller biovec allocation and have > > separate biovec allocation for really large ones. > > what file system? have you tested with more than one? have you > tested with file systems that build their own bio's instead of using > get_block() calls? have you tested with large files or streaming > workloads? how about direct I/O? > > i think that a "heavy bias" toward small biovecs is FS and workload > dependent, and that it's irresponsible to make such unjustified > changes just to show improvement on your particular benchmark. It is no doubt that the above data is just a quick estimate on one usage model. There are tons of other usage in the world. After all, any algorithm in the kernel has to be generic and self tune to specific environment. On very large I/O, the relative overhead in allocating biovec will decrease because larger I/O needs more code to do setup, more code to perform I/O completion, more code in the device driver etc. So time spent on one mempool alloc will amortize over the size of I/O. On a smaller I/O size, the overhead is more visible and thus makes sense to me to cut down that relative overhead. In fact, the large I/O already have unfair advantage. If you do 1MB I/O, only 1 call to mempool to get a 256 segment bio. However if you do two 512K I/O, two calls to mempool is made. So in some sense, current scheme is unfair to small I/O. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 11:22 AM I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. what file system? have you tested with more than one? have you tested with file systems that build their own bio's instead of using get_block() calls? have you tested with large files or streaming workloads? how about direct I/O? i think that a heavy bias toward small biovecs is FS and workload dependent, and that it's irresponsible to make such unjustified changes just to show improvement on your particular benchmark. It is no doubt that the above data is just a quick estimate on one usage model. There are tons of other usage in the world. After all, any algorithm in the kernel has to be generic and self tune to specific environment. On very large I/O, the relative overhead in allocating biovec will decrease because larger I/O needs more code to do setup, more code to perform I/O completion, more code in the device driver etc. So time spent on one mempool alloc will amortize over the size of I/O. On a smaller I/O size, the overhead is more visible and thus makes sense to me to cut down that relative overhead. In fact, the large I/O already have unfair advantage. If you do 1MB I/O, only 1 call to mempool to get a 256 segment bio. However if you do two 512K I/O, two calls to mempool is made. So in some sense, current scheme is unfair to small I/O. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Nate Diller wrote on Thursday, December 07, 2006 1:46 PM the current code is straightforward and obviously correct. you want to make the alloc/dealloc paths more complex, by special-casing for an arbitrary limit of small I/O, AFAICT. of *course* you can expect less overhead when you're doing one large I/O vs. two small ones, that's the whole reason we have all this code to try to coalesce contiguous I/O, do readahead, swap page clustering, etc. we *want* more complexity if it will get us bigger I/Os. I don't see why we want more complexity to reduce the *inherent* penalty of doing smaller ones. You should check out the latest proposal from Jens Axboe which treats all biovec size the same and stuff it along with struct bio. I think it is a better approach than my first cut of special casing 1 segment biovec. His patch will speed up all sized I/O. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Andi Kleen wrote on Thursday, December 07, 2006 6:28 PM Chen, Kenneth W [EMAIL PROTECTED] writes: I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. It's a single global variable access: #define cache_line_size() (boot_cpu_data.x86_cache_alignment) Or do you mean it caused cache misses? boot_cpu_data is cache aligned and practically read only, so there shouldn't be any false sharing at least. No, I was looking for a generic constant that describes cache line size. I needed a constant in order to avoid runtime check and to rely on compiler to optimize a conditional check away. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] optimize o_direct on block device - v3
This patch implements block device specific .direct_IO method instead of going through generic direct_io_worker for block device. direct_io_worker is fairly complex because it needs to handle O_DIRECT on file system, where it needs to perform block allocation, hole detection, extents file on write, and tons of other corner cases. The end result is that it takes tons of CPU time to submit an I/O. For block device, the block allocation is much simpler and a tight triple loop can be written to iterate each iovec and each page within the iovec in order to construct/prepare bio structure and then subsequently submit it to the block layer. This significantly speeds up O_D on block device. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- This is v3 of the patch, I have addressed all the comments from Andrew, Christoph, and Zach. Too many to list here for v2->v3 changes. I've created 34 test cases specifically for corner cases and tested this patch. (my monkey test code is on http://kernel-perf.sourceforge.net/diotest). diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c --- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/block_dev.c 2006-12-06 13:16:43.0 -0800 @@ -129,43 +129,188 @@ blkdev_get_block(struct inode *inode, se return 0; } -static int -blkdev_get_blocks(struct inode *inode, sector_t iblock, - struct buffer_head *bh, int create) +static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error) { - sector_t end_block = max_block(I_BDEV(inode)); - unsigned long max_blocks = bh->b_size >> inode->i_blkbits; + struct kiocb *iocb = bio->bi_private; + atomic_t *bio_count = >ki_bio_count; - if ((iblock + max_blocks) > end_block) { - max_blocks = end_block - iblock; - if ((long)max_blocks <= 0) { - if (create) - return -EIO;/* write fully beyond EOF */ - /* -* It is a read which is fully beyond EOF. We return -* a !buffer_mapped buffer -*/ - max_blocks = 0; - } + if (bio_data_dir(bio) == READ) + bio_check_pages_dirty(bio); + else { + bio_release_pages(bio); + bio_put(bio); + } + + /* iocb->ki_nbytes stores error code from LLDD */ + if (error) + iocb->ki_nbytes = -EIO; + + if (atomic_dec_and_test(bio_count)) { + if (iocb->ki_nbytes < 0) + aio_complete(iocb, iocb->ki_nbytes, 0); + else + aio_complete(iocb, iocb->ki_left, 0); } - bh->b_bdev = I_BDEV(inode); - bh->b_blocknr = iblock; - bh->b_size = max_blocks << inode->i_blkbits; - if (max_blocks) - set_buffer_mapped(bh); return 0; } +#define VEC_SIZE 16 +struct pvec { + unsigned short nr; + unsigned short idx; + struct page *page[VEC_SIZE]; +}; + +#define PAGES_SPANNED(addr, len) \ + (DIV_ROUND_UP((addr) + (len), PAGE_SIZE) - (addr) / PAGE_SIZE); + +/* + * get page pointer for user addr, we internally cache struct page array for + * (addr, count) range in pvec to avoid frequent call to get_user_pages. If + * internal page list is exhausted, a batch count of up to VEC_SIZE is used + * to get next set of page struct. + */ +static struct page *blk_get_page(unsigned long addr, size_t count, int rw, +struct pvec *pvec) +{ + int ret, nr_pages; + if (pvec->idx == pvec->nr) { + nr_pages = PAGES_SPANNED(addr, count); + nr_pages = min(nr_pages, VEC_SIZE); + down_read(>mm->mmap_sem); + ret = get_user_pages(current, current->mm, addr, nr_pages, +rw == READ, 0, pvec->page, NULL); + up_read(>mm->mmap_sem); + if (ret < 0) + return ERR_PTR(ret); + pvec->nr = ret; + pvec->idx = 0; + } + return pvec->page[pvec->idx++]; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, - loff_t offset, unsigned long nr_segs) +loff_t pos, unsigned long nr_segs) { - struct file *file = iocb->ki_filp; - struct inode *inode = file->f_mapping->host; + struct inode *inode = iocb->ki_filp->f_mapping->host; + unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); + unsigned blocksize_mask = (1<< blkbits) - 1; + unsigned long seg = 0; /* iov segment iterator */ + unsigned long nvec; /* number of bio vec needed */ + unsigned long cur_off; /* offset into current page */ + unsigned long cur_len; /* I/O len of current page,
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM > > > I will try that too. I'm a bit touchy about sharing a cache line for > > > different bio. But given that there are 200,000 I/O per second we are > > > currently pushing the kernel, the chances of two cpu working on two > > > bio that sits in the same cache line are pretty small. > > > > Yep I really think so. Besides, it's not like we are repeatedly writing > > to these objects in the first place. > > This is what I had in mind, in case it wasn't completely clear. Not > tested, other than it compiles. Basically it eliminates the small > bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by > 12-bytes on 32-bit archs instead and uses the room at the end for the > bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. > I still can't help but think we can do better than this, and that this > is nothing more than optimizing for a benchmark. For high performance > I/O, you will be doing > 1 page bio's anyway and this patch wont help > you at all. Perhaps we can just kill bio_vec slabs completely, and > create bio slabs instead with differing sizes. So instead of having 1 > bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room > for the bio_vec list at the end. That would always eliminate the extra > allocation, at the cost of blowing the 256-page case into a order 1 page > allocation (256*16 + sizeof(*bio) > PAGE_SIZE) for the 4kb 64-bit archs, > which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Wednesday, December 06, 2006 2:09 AM I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. Yep I really think so. Besides, it's not like we are repeatedly writing to these objects in the first place. This is what I had in mind, in case it wasn't completely clear. Not tested, other than it compiles. Basically it eliminates the small bio_vec pool, and grows the bio by 16-bytes on 64-bit archs, or by 12-bytes on 32-bit archs instead and uses the room at the end for the bio_vec structure. Yeah, I had a very similar patch queued internally for the large benchmark measurement. I will post the result as soon as I get it. I still can't help but think we can do better than this, and that this is nothing more than optimizing for a benchmark. For high performance I/O, you will be doing 1 page bio's anyway and this patch wont help you at all. Perhaps we can just kill bio_vec slabs completely, and create bio slabs instead with differing sizes. So instead of having 1 bio slab and 5 bio_vec slabs, change that to 5 bio slabs that leave room for the bio_vec list at the end. That would always eliminate the extra allocation, at the cost of blowing the 256-page case into a order 1 page allocation (256*16 + sizeof(*bio) PAGE_SIZE) for the 4kb 64-bit archs, which is something I've always tried to avoid. I took a quick query of biovec-* slab stats on various production machines, majority of the allocation is on 1 and 4 segments, usages falls off quickly on 16 or more. 256 segment biovec allocation is really rare. I think it makes sense to heavily bias towards smaller biovec allocation and have separate biovec allocation for really large ones. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] optimize o_direct on block device - v3
This patch implements block device specific .direct_IO method instead of going through generic direct_io_worker for block device. direct_io_worker is fairly complex because it needs to handle O_DIRECT on file system, where it needs to perform block allocation, hole detection, extents file on write, and tons of other corner cases. The end result is that it takes tons of CPU time to submit an I/O. For block device, the block allocation is much simpler and a tight triple loop can be written to iterate each iovec and each page within the iovec in order to construct/prepare bio structure and then subsequently submit it to the block layer. This significantly speeds up O_D on block device. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- This is v3 of the patch, I have addressed all the comments from Andrew, Christoph, and Zach. Too many to list here for v2-v3 changes. I've created 34 test cases specifically for corner cases and tested this patch. (my monkey test code is on http://kernel-perf.sourceforge.net/diotest). diff -Nurp linux-2.6.19/fs/block_dev.c linux-2.6.19.ken/fs/block_dev.c --- linux-2.6.19/fs/block_dev.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/block_dev.c 2006-12-06 13:16:43.0 -0800 @@ -129,43 +129,188 @@ blkdev_get_block(struct inode *inode, se return 0; } -static int -blkdev_get_blocks(struct inode *inode, sector_t iblock, - struct buffer_head *bh, int create) +static int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error) { - sector_t end_block = max_block(I_BDEV(inode)); - unsigned long max_blocks = bh-b_size inode-i_blkbits; + struct kiocb *iocb = bio-bi_private; + atomic_t *bio_count = iocb-ki_bio_count; - if ((iblock + max_blocks) end_block) { - max_blocks = end_block - iblock; - if ((long)max_blocks = 0) { - if (create) - return -EIO;/* write fully beyond EOF */ - /* -* It is a read which is fully beyond EOF. We return -* a !buffer_mapped buffer -*/ - max_blocks = 0; - } + if (bio_data_dir(bio) == READ) + bio_check_pages_dirty(bio); + else { + bio_release_pages(bio); + bio_put(bio); + } + + /* iocb-ki_nbytes stores error code from LLDD */ + if (error) + iocb-ki_nbytes = -EIO; + + if (atomic_dec_and_test(bio_count)) { + if (iocb-ki_nbytes 0) + aio_complete(iocb, iocb-ki_nbytes, 0); + else + aio_complete(iocb, iocb-ki_left, 0); } - bh-b_bdev = I_BDEV(inode); - bh-b_blocknr = iblock; - bh-b_size = max_blocks inode-i_blkbits; - if (max_blocks) - set_buffer_mapped(bh); return 0; } +#define VEC_SIZE 16 +struct pvec { + unsigned short nr; + unsigned short idx; + struct page *page[VEC_SIZE]; +}; + +#define PAGES_SPANNED(addr, len) \ + (DIV_ROUND_UP((addr) + (len), PAGE_SIZE) - (addr) / PAGE_SIZE); + +/* + * get page pointer for user addr, we internally cache struct page array for + * (addr, count) range in pvec to avoid frequent call to get_user_pages. If + * internal page list is exhausted, a batch count of up to VEC_SIZE is used + * to get next set of page struct. + */ +static struct page *blk_get_page(unsigned long addr, size_t count, int rw, +struct pvec *pvec) +{ + int ret, nr_pages; + if (pvec-idx == pvec-nr) { + nr_pages = PAGES_SPANNED(addr, count); + nr_pages = min(nr_pages, VEC_SIZE); + down_read(current-mm-mmap_sem); + ret = get_user_pages(current, current-mm, addr, nr_pages, +rw == READ, 0, pvec-page, NULL); + up_read(current-mm-mmap_sem); + if (ret 0) + return ERR_PTR(ret); + pvec-nr = ret; + pvec-idx = 0; + } + return pvec-page[pvec-idx++]; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, - loff_t offset, unsigned long nr_segs) +loff_t pos, unsigned long nr_segs) { - struct file *file = iocb-ki_filp; - struct inode *inode = file-f_mapping-host; + struct inode *inode = iocb-ki_filp-f_mapping-host; + unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); + unsigned blocksize_mask = (1 blkbits) - 1; + unsigned long seg = 0; /* iov segment iterator */ + unsigned long nvec; /* number of bio vec needed */ + unsigned long cur_off; /* offset into current page */ + unsigned long cur_len; /* I/O len of current page, up to PAGE_SIZE */ + +
RE: [-mm patch] sched remove lb_stopbalance counter
Ingo Molnar wrote on Tuesday, December 05, 2006 7:42 AM > * Chen, Kenneth W <[EMAIL PROTECTED]> wrote: > > > but, please: > > > > > > > -#define SCHEDSTAT_VERSION 13 > > > > +#define SCHEDSTAT_VERSION 12 > > > > > > change this to 14 instead. Versions should only go upwards, even if > > > we revert to an earlier output format. > > > > Really? sched-decrease-number-of-load-balances.patch has not yet hit > > the mainline and I think it's in -mm for only a couple of weeks. I'm > > trying to back out the change after brief reviewing the patch. > > not a big issue but it costs nothing to go to version 14 - OTOH if any > utility has been updated to version 13 and is forgotten about, it might > break spuriously if we again go to 13 in the future. OK, with a reluctant agreement, I've changed the version to 14. [patch] sched: remove lb_stopbalance counter Remove scheduler stats lb_stopbalance counter. This counter can be calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to create gazillion counters while we can derive the value. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]> Acked-by: Ingo Molnar <[EMAIL PROTECTED]> --- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800 +++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800 @@ -684,7 +684,6 @@ struct sched_domain { unsigned long lb_hot_gained[MAX_IDLE_TYPES]; unsigned long lb_nobusyg[MAX_IDLE_TYPES]; unsigned long lb_nobusyq[MAX_IDLE_TYPES]; - unsigned long lb_stopbalance[MAX_IDLE_TYPES]; /* Active load balancing */ unsigned long alb_cnt; --- ./kernel/sched.c.orig 2006-12-05 07:56:31.0 -0800 +++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800 @@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct * bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 13 +#define SCHEDSTAT_VERSION 14 static int show_schedstat(struct seq_file *seq, void *v) { @@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil seq_printf(seq, "domain%d %s", dcnt++, mask_str); for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES; itype++) { - seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu", + seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu", sd->lb_cnt[itype], sd->lb_balanced[itype], sd->lb_failed[itype], @@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil sd->lb_gained[itype], sd->lb_hot_gained[itype], sd->lb_nobusyq[itype], - sd->lb_nobusyg[itype], - sd->lb_stopbalance[itype]); + sd->lb_nobusyg[itype]); } seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n", sd->alb_cnt, sd->alb_failed, sd->alb_pushed, @@ -2579,10 +2578,8 @@ redo: group = find_busiest_group(sd, this_cpu, , idle, _idle, , balance); - if (*balance == 0) { - schedstat_inc(sd, lb_stopbalance[idle]); + if (*balance == 0) goto out_balanced; - } if (!group) { schedstat_inc(sd, lb_nobusyg[idle]); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [-mm patch] sched remove lb_stopbalance counter
Ingo Molnar wrote on Tuesday, December 05, 2006 7:32 AM > * Chen, Kenneth W <[EMAIL PROTECTED]> wrote: > > in -mm tree: I would like to revert the change on adding > > lb_stopbalance counter. This count can be calculated by: lb_balanced > > - lb_nobusyg - lb_nobusyq. There is no need to create gazillion > > counters while we can derive the value. I'm more of against changing > > sched-stat format unless it is absolutely necessary as all user land > > tool parsing /proc/schedstat needs to be updated and it's a real pain > > trying to keep multiple versions of it. > > > > Signed-off-by: Ken Chen <[EMAIL PROTECTED]> > > Acked-by: Ingo Molnar <[EMAIL PROTECTED]> > > but, please: > > > -#define SCHEDSTAT_VERSION 13 > > +#define SCHEDSTAT_VERSION 12 > > change this to 14 instead. Versions should only go upwards, even if we > revert to an earlier output format. Really? sched-decrease-number-of-load-balances.patch has not yet hit the mainline and I think it's in -mm for only a couple of weeks. I'm trying to back out the change after brief reviewing the patch. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[-mm patch] sched remove lb_stopbalance counter
Regarding to sched-decrease-number-of-load-balances.patch currently in -mm tree: I would like to revert the change on adding lb_stopbalance counter. This count can be calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to create gazillion counters while we can derive the value. I'm more of against changing sched-stat format unless it is absolutely necessary as all user land tool parsing /proc/schedstat needs to be updated and it's a real pain trying to keep multiple versions of it. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800 +++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800 @@ -684,7 +684,6 @@ struct sched_domain { unsigned long lb_hot_gained[MAX_IDLE_TYPES]; unsigned long lb_nobusyg[MAX_IDLE_TYPES]; unsigned long lb_nobusyq[MAX_IDLE_TYPES]; - unsigned long lb_stopbalance[MAX_IDLE_TYPES]; /* Active load balancing */ unsigned long alb_cnt; --- ./kernel/sched.c.orig 2006-12-05 07:56:31.0 -0800 +++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800 @@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct * bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 13 +#define SCHEDSTAT_VERSION 12 static int show_schedstat(struct seq_file *seq, void *v) { @@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil seq_printf(seq, "domain%d %s", dcnt++, mask_str); for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES; itype++) { - seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu", + seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu", sd->lb_cnt[itype], sd->lb_balanced[itype], sd->lb_failed[itype], @@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil sd->lb_gained[itype], sd->lb_hot_gained[itype], sd->lb_nobusyq[itype], - sd->lb_nobusyg[itype], - sd->lb_stopbalance[itype]); + sd->lb_nobusyg[itype]); } seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n", sd->alb_cnt, sd->alb_failed, sd->alb_pushed, @@ -2579,10 +2578,8 @@ redo: group = find_busiest_group(sd, this_cpu, , idle, _idle, , balance); - if (*balance == 0) { - schedstat_inc(sd, lb_stopbalance[idle]); + if (*balance == 0) goto out_balanced; - } if (!group) { schedstat_inc(sd, lb_nobusyg[idle]); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[-mm patch] sched remove lb_stopbalance counter
Regarding to sched-decrease-number-of-load-balances.patch currently in -mm tree: I would like to revert the change on adding lb_stopbalance counter. This count can be calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to create gazillion counters while we can derive the value. I'm more of against changing sched-stat format unless it is absolutely necessary as all user land tool parsing /proc/schedstat needs to be updated and it's a real pain trying to keep multiple versions of it. Signed-off-by: Ken Chen [EMAIL PROTECTED] --- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800 +++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800 @@ -684,7 +684,6 @@ struct sched_domain { unsigned long lb_hot_gained[MAX_IDLE_TYPES]; unsigned long lb_nobusyg[MAX_IDLE_TYPES]; unsigned long lb_nobusyq[MAX_IDLE_TYPES]; - unsigned long lb_stopbalance[MAX_IDLE_TYPES]; /* Active load balancing */ unsigned long alb_cnt; --- ./kernel/sched.c.orig 2006-12-05 07:56:31.0 -0800 +++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800 @@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct * bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 13 +#define SCHEDSTAT_VERSION 12 static int show_schedstat(struct seq_file *seq, void *v) { @@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil seq_printf(seq, domain%d %s, dcnt++, mask_str); for (itype = SCHED_IDLE; itype MAX_IDLE_TYPES; itype++) { - seq_printf(seq, %lu %lu %lu %lu %lu %lu %lu %lu %lu, + seq_printf(seq, %lu %lu %lu %lu %lu %lu %lu %lu, sd-lb_cnt[itype], sd-lb_balanced[itype], sd-lb_failed[itype], @@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil sd-lb_gained[itype], sd-lb_hot_gained[itype], sd-lb_nobusyq[itype], - sd-lb_nobusyg[itype], - sd-lb_stopbalance[itype]); + sd-lb_nobusyg[itype]); } seq_printf(seq, %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n, sd-alb_cnt, sd-alb_failed, sd-alb_pushed, @@ -2579,10 +2578,8 @@ redo: group = find_busiest_group(sd, this_cpu, imbalance, idle, sd_idle, cpus, balance); - if (*balance == 0) { - schedstat_inc(sd, lb_stopbalance[idle]); + if (*balance == 0) goto out_balanced; - } if (!group) { schedstat_inc(sd, lb_nobusyg[idle]); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [-mm patch] sched remove lb_stopbalance counter
Ingo Molnar wrote on Tuesday, December 05, 2006 7:32 AM * Chen, Kenneth W [EMAIL PROTECTED] wrote: in -mm tree: I would like to revert the change on adding lb_stopbalance counter. This count can be calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to create gazillion counters while we can derive the value. I'm more of against changing sched-stat format unless it is absolutely necessary as all user land tool parsing /proc/schedstat needs to be updated and it's a real pain trying to keep multiple versions of it. Signed-off-by: Ken Chen [EMAIL PROTECTED] Acked-by: Ingo Molnar [EMAIL PROTECTED] but, please: -#define SCHEDSTAT_VERSION 13 +#define SCHEDSTAT_VERSION 12 change this to 14 instead. Versions should only go upwards, even if we revert to an earlier output format. Really? sched-decrease-number-of-load-balances.patch has not yet hit the mainline and I think it's in -mm for only a couple of weeks. I'm trying to back out the change after brief reviewing the patch. - Ken - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [-mm patch] sched remove lb_stopbalance counter
Ingo Molnar wrote on Tuesday, December 05, 2006 7:42 AM * Chen, Kenneth W [EMAIL PROTECTED] wrote: but, please: -#define SCHEDSTAT_VERSION 13 +#define SCHEDSTAT_VERSION 12 change this to 14 instead. Versions should only go upwards, even if we revert to an earlier output format. Really? sched-decrease-number-of-load-balances.patch has not yet hit the mainline and I think it's in -mm for only a couple of weeks. I'm trying to back out the change after brief reviewing the patch. not a big issue but it costs nothing to go to version 14 - OTOH if any utility has been updated to version 13 and is forgotten about, it might break spuriously if we again go to 13 in the future. OK, with a reluctant agreement, I've changed the version to 14. [patch] sched: remove lb_stopbalance counter Remove scheduler stats lb_stopbalance counter. This counter can be calculated by: lb_balanced - lb_nobusyg - lb_nobusyq. There is no need to create gazillion counters while we can derive the value. Signed-off-by: Ken Chen [EMAIL PROTECTED] Signed-off-by: Suresh Siddha [EMAIL PROTECTED] Acked-by: Ingo Molnar [EMAIL PROTECTED] --- ./include/linux/sched.h.orig2006-12-05 07:56:11.0 -0800 +++ ./include/linux/sched.h 2006-12-05 07:57:55.0 -0800 @@ -684,7 +684,6 @@ struct sched_domain { unsigned long lb_hot_gained[MAX_IDLE_TYPES]; unsigned long lb_nobusyg[MAX_IDLE_TYPES]; unsigned long lb_nobusyq[MAX_IDLE_TYPES]; - unsigned long lb_stopbalance[MAX_IDLE_TYPES]; /* Active load balancing */ unsigned long alb_cnt; --- ./kernel/sched.c.orig 2006-12-05 07:56:31.0 -0800 +++ ./kernel/sched.c2006-12-05 08:02:11.0 -0800 @@ -428,7 +428,7 @@ static inline void task_rq_unlock(struct * bump this up when changing the output format or the meaning of an existing * format, so that tools can adapt (or abort) */ -#define SCHEDSTAT_VERSION 13 +#define SCHEDSTAT_VERSION 14 static int show_schedstat(struct seq_file *seq, void *v) { @@ -466,7 +466,7 @@ static int show_schedstat(struct seq_fil seq_printf(seq, domain%d %s, dcnt++, mask_str); for (itype = SCHED_IDLE; itype MAX_IDLE_TYPES; itype++) { - seq_printf(seq, %lu %lu %lu %lu %lu %lu %lu %lu %lu, + seq_printf(seq, %lu %lu %lu %lu %lu %lu %lu %lu, sd-lb_cnt[itype], sd-lb_balanced[itype], sd-lb_failed[itype], @@ -474,8 +474,7 @@ static int show_schedstat(struct seq_fil sd-lb_gained[itype], sd-lb_hot_gained[itype], sd-lb_nobusyq[itype], - sd-lb_nobusyg[itype], - sd-lb_stopbalance[itype]); + sd-lb_nobusyg[itype]); } seq_printf(seq, %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu\n, sd-alb_cnt, sd-alb_failed, sd-alb_pushed, @@ -2579,10 +2578,8 @@ redo: group = find_busiest_group(sd, this_cpu, imbalance, idle, sd_idle, cpus, balance); - if (*balance == 0) { - schedstat_inc(sd, lb_stopbalance[idle]); + if (*balance == 0) goto out_balanced; - } if (!group) { schedstat_inc(sd, lb_nobusyg[idle]); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] add an iterator index in struct pagevec
Andrew Morton wrote on Monday, December 04, 2006 9:45 PM > On Mon, 4 Dec 2006 21:21:31 -0800 > "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote: > > > pagevec is never expected to be more than PAGEVEC_SIZE, I think a > > unsigned char is enough to count them. This patch makes nr, cold > > to be unsigned char > > Is that on the right side of the speed/space tradeoff? I haven't measured speed. Size wise, making them char shrinks vmlinux text size by 112 bytes on x86_64 (using default config option). > I must say I'm a bit skeptical about the need for this. But I haven't > looked closely at the blockdev-specific dio code yet. It was suggested to declare another struct that embeds pagevec to perform iteration. But I prefer to have pagevec having the capability, it is more compact this way. It would be nice if you can review blockdev-specific dio code. I would appreciate it very much. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] add an iterator index in struct pagevec
pagevec is never expected to be more than PAGEVEC_SIZE, I think a unsigned char is enough to count them. This patch makes nr, cold to be unsigned char and also adds an iterator index. With that, the size can be even bumped up by 1 to 15. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> diff -Nurp linux-2.6.19/include/linux/pagevec.h linux-2.6.19.ken/include/linux/pagevec.h --- linux-2.6.19/include/linux/pagevec.h2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/include/linux/pagevec.h2006-12-04 19:18:21.0 -0800 @@ -8,15 +8,16 @@ #ifndef _LINUX_PAGEVEC_H #define _LINUX_PAGEVEC_H -/* 14 pointers + two long's align the pagevec structure to a power of two */ -#define PAGEVEC_SIZE 14 +/* 15 pointers + 3 char's align the pagevec structure to a power of two */ +#define PAGEVEC_SIZE 15 struct page; struct address_space; struct pagevec { - unsigned long nr; - unsigned long cold; + unsigned char nr; + unsigned char cold; + unsigned char idx; struct page *pages[PAGEVEC_SIZE]; }; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] optimize o_direct on block device - v2
This patch implements block device specific .direct_IO method instead of going through generic direct_io_worker for block device. direct_io_worker is fairly complex because it needs to handle O_DIRECT on file system, where it needs to perform block allocation, hole detection, extents file on write, and tons of other corner cases. The end result is that it takes tons of CPU time to submit an I/O. For block device, the block allocation is much simpler and a tight triple loop can be written to iterate each iovec and each page within the iovec in order to construct/prepare bio structure and then subsequently submit it to the block layer. This significantly speeds up O_D on block device. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- Changes since v1->v2: * add BUILD_BUG_ON to ensure bio_count fit inside iocb->private * add comment that bio_alloc won't fail with GFP_KERNEL * fix back out path if get_uer_pages fail * fix back out path if iov segment doesn't align properly fs/bio.c|2 fs/block_dev.c | 173 fs/read_write.c |2 include/linux/bio.h |1 4 files changed, 150 insertions(+), 28 deletions(-) --- ./fs/block_dev.c.orig 2006-11-29 13:57:37.0 -0800 +++ ./fs/block_dev.c2006-12-04 18:38:53.0 -0800 @@ -129,43 +129,164 @@ blkdev_get_block(struct inode *inode, se return 0; } -static int -blkdev_get_blocks(struct inode *inode, sector_t iblock, - struct buffer_head *bh, int create) +int blk_end_aio(struct bio *bio, unsigned int bytes_done, int error) { - sector_t end_block = max_block(I_BDEV(inode)); - unsigned long max_blocks = bh->b_size >> inode->i_blkbits; + struct kiocb* iocb = bio->bi_private; + atomic_t* bio_count = (atomic_t*) >private; + long res; + + if ((bio->bi_rw & 1) == READ) + bio_check_pages_dirty(bio); + else { + bio_release_pages(bio); + bio_put(bio); + } - if ((iblock + max_blocks) > end_block) { - max_blocks = end_block - iblock; - if ((long)max_blocks <= 0) { - if (create) - return -EIO;/* write fully beyond EOF */ - /* -* It is a read which is fully beyond EOF. We return -* a !buffer_mapped buffer -*/ - max_blocks = 0; - } + if (error) + iocb->ki_left = -EIO; + + if (atomic_dec_and_test(bio_count)) { + res = (iocb->ki_left < 0) ? iocb->ki_left : iocb->ki_nbytes; + aio_complete(iocb, res, 0); } - bh->b_bdev = I_BDEV(inode); - bh->b_blocknr = iblock; - bh->b_size = max_blocks << inode->i_blkbits; - if (max_blocks) - set_buffer_mapped(bh); return 0; } +#define VEC_SIZE 16 +struct pvec { + unsigned short nr; + unsigned short idx; + struct page *page[VEC_SIZE]; +}; + + +struct page *blk_get_page(unsigned long addr, size_t count, int rw, + struct pvec *pvec) +{ + int ret, nr_pages; + if (pvec->idx == pvec->nr) { + nr_pages = (addr + count + PAGE_SIZE - 1) / PAGE_SIZE - + addr / PAGE_SIZE; + nr_pages = min(nr_pages, VEC_SIZE); + down_read(>mm->mmap_sem); + ret = get_user_pages(current, current->mm, addr, nr_pages, +rw==READ, 0, pvec->page, NULL); + up_read(>mm->mmap_sem); + if (ret < 0) + return ERR_PTR(ret); + pvec->nr = ret; + pvec->idx = 0; + } + return pvec->page[pvec->idx++]; +} + static ssize_t blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, - loff_t offset, unsigned long nr_segs) +loff_t pos, unsigned long nr_segs) { - struct file *file = iocb->ki_filp; - struct inode *inode = file->f_mapping->host; + struct inode *inode = iocb->ki_filp->f_mapping->host; + unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode))); + unsigned blocksize_mask = (1<< blkbits) - 1; + unsigned long seg, nvec, cur_off, cur_len; + + unsigned long addr; + size_t count, nbytes = iocb->ki_nbytes; + loff_t size; + struct bio * bio; + atomic_t *bio_count = (atomic_t *) >private; + struct page *page; + struct pvec pvec = {.nr = 0, .idx = 0, }; + + BUILD_BUG_ON(sizeof(atomic_t) > sizeof(iocb->private)); + + size = i_size_read(inode); + if (pos + nbytes > size) + nbytes = size - pos; + + seg = 0; + addr = (unsigned long) iov[0].iov_base; + count = iov[0].iov_len; + atomic_set(bio_count, 1); + +
RE: [patch] speed up single bio_vec allocation
Jens Axboe wrote on Monday, December 04, 2006 12:07 PM > On Mon, Dec 04 2006, Chen, Kenneth W wrote: > > On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is > > created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory > > available at the end of bio. I think we can utilize that memory for > > bio_vec allocation. The purpose is not so much on saving memory consumption > > for bio_vec, instead, I'm attempting to optimize away a call to > > bvec_alloc_bs. > > > > So here is a patch to do just that for 1 segment bio_vec (we currently only > > have space for 1 on 2.6.19). And the detection whether there are spare > > space > > available is dynamically calculated at compile time. If there are no space > > available, there will be no run time cost at all because gcc simply optimize > > away all the code added in this patch. If there are space available, the > > only > > run time check is to see what the size of iovec is and we do appropriate > > assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole > > lot back from not calling bvec_alloc_bs() function. > > > > I tried to use cache_line_size() to find out the alignment of struct bio, > > but > > stumbled on that it is a runtime function for x86_64. So instead I made bio > > to hint to the slab allocator to align on 32 byte (slab will use the larger > > value of hw cache line and caller hints of "align"). I think it is a sane > > number for majority of the CPUs out in the world. > > Any benchmarks for this one? About 0.2% on database transaction processing benchmark. It was done a while back on top of a major Linux vendor kernel. I will retest it again for 2.6.19. > [...] > > Another idea would be to kill SLAB_HWCACHE_ALIGN (it's pretty pointless, > I bet), and always alloc sizeof(*bio) + sizeof(*bvl) in one go when a > bio is allocated. It doesn't add a lot of overhead even for the case > where we do > 1 page bios, and it gets rid of the dual allocation for > the 1 page bio. I will try that too. I'm a bit touchy about sharing a cache line for different bio. But given that there are 200,000 I/O per second we are currently pushing the kernel, the chances of two cpu working on two bio that sits in the same cache line are pretty small. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] remove redundant iov segment check
Andrew Morton wrote on Monday, December 04, 2006 11:36 AM > On Mon, 4 Dec 2006 08:26:36 -0800 > "Chen, Kenneth W" <[EMAIL PROTECTED]> wrote: > > > So it's not possible to call down to generic_file_aio_read/write with > > invalid > > iov segment. Patch proposed to delete these redundant code. > > erp, please don't touch that code. > > The writev deadlock fixes which went into 2.6.17 trashed writev() > performance and Nick and I are slowly trying to get it back, while fixing > the has-been-there-forever pagefault-in-write() deadlock. > > This is all proving very hard to do, and we don't need sweeping code > cleanups happening under our feet ;) > > I'll bring those patches back in next -mm, but not very confidently. OK, I will wait until that bug settles down and resubmit. I really don't see the value of walking the iov multiple times doing the same thing. I will also dig up the archive to see if I can help in any way. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [patch] remove redundant iov segment check
Zach Brown wrote on Monday, December 04, 2006 11:19 AM > On Dec 4, 2006, at 8:26 AM, Chen, Kenneth W wrote: > > > The access_ok() and negative length check on each iov segment in > > function > > generic_file_aio_read/write are redundant. They are all already > > checked > > before calling down to these low level generic functions. > > ... > > > So it's not possible to call down to generic_file_aio_read/write > > with invalid > > iov segment. > > Well, generic_file_aio_{read,write}() are exported to modules, so > anything's *possible*. :) > > This change makes me nervous because it relies on our ability to > audit all code paths to ensure that it's correct. It'd be nice if > the code enforced the rules. Maybe we should create another internal generic_file_aio_read/write for in-core function? fs/read_write.c and fs/aio.c are not module-able and the check is already there. For external module, we can do the check and then calls down to the internal one. I hate to see iov is being walked multiple times And this is part of my effort to bring back O_DIRECT performance compares to a 3-years old vendor kernel based on 2.4 kernel. - Ken - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] speed up single bio_vec allocation
On 64-bit arch like x86_64, struct bio is 104 byte. Since bio slab is created with SLAB_HWCACHE_ALIGN flag, there are usually spare memory available at the end of bio. I think we can utilize that memory for bio_vec allocation. The purpose is not so much on saving memory consumption for bio_vec, instead, I'm attempting to optimize away a call to bvec_alloc_bs. So here is a patch to do just that for 1 segment bio_vec (we currently only have space for 1 on 2.6.19). And the detection whether there are spare space available is dynamically calculated at compile time. If there are no space available, there will be no run time cost at all because gcc simply optimize away all the code added in this patch. If there are space available, the only run time check is to see what the size of iovec is and we do appropriate assignment to bio->bi_io_Vec etc. The cost is minimal and we gain a whole lot back from not calling bvec_alloc_bs() function. I tried to use cache_line_size() to find out the alignment of struct bio, but stumbled on that it is a runtime function for x86_64. So instead I made bio to hint to the slab allocator to align on 32 byte (slab will use the larger value of hw cache line and caller hints of "align"). I think it is a sane number for majority of the CPUs out in the world. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> --- ./fs/bio.c.orig 2006-12-03 17:20:36.0 -0800 +++ ./fs/bio.c 2006-12-03 21:29:20.0 -0800 @@ -29,11 +29,14 @@ #include/* for struct sg_iovec */ #define BIO_POOL_SIZE 256 - +#define BIO_ALIGN 32 /* minimal bio structure alignment */ static kmem_cache_t *bio_slab __read_mostly; #define BIOVEC_NR_POOLS 6 +#define BIOVEC_FIT_INSIDE_BIO_CACHE_LINE \ + (ALIGN(sizeof(struct bio), BIO_ALIGN) ==\ +ALIGN(sizeof(struct bio) + sizeof(struct bio_vec), BIO_ALIGN)) /* * a small number of entries is fine, not going to be performance critical. * basically we just need to survive @@ -113,7 +116,8 @@ void bio_free(struct bio *bio, struct bi BIO_BUG_ON(pool_idx >= BIOVEC_NR_POOLS); - mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); + if (!BIOVEC_FIT_INSIDE_BIO_CACHE_LINE || pool_idx) + mempool_free(bio->bi_io_vec, bio_set->bvec_pools[pool_idx]); mempool_free(bio, bio_set->bio_pool); } @@ -166,7 +170,15 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m struct bio_vec *bvl = NULL; bio_init(bio); - if (likely(nr_iovecs)) { + + /* +* if bio_vec can fit into remaining cache line of struct +* bio, go ahead use it and skip mempool allocation. +*/ + if (nr_iovecs == 1 && BIOVEC_FIT_INSIDE_BIO_CACHE_LINE) { + bvl = (struct bio_vec*) (bio + 1); + bio->bi_max_vecs = 1; + } else if (likely(nr_iovecs)) { unsigned long idx = 0; /* shut up gcc */ bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, , bs); @@ -1204,7 +1216,7 @@ static void __init biovec_init_slabs(voi struct biovec_slab *bvs = bvec_slabs + i; size = bvs->nr_vecs * sizeof(struct bio_vec); - bvs->slab = kmem_cache_create(bvs->name, size, 0, + bvs->slab = kmem_cache_create(bvs->name, size, BIO_ALIGN, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL, NULL); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] remove redundant iov segment check
The access_ok() and negative length check on each iov segment in function generic_file_aio_read/write are redundant. They are all already checked before calling down to these low level generic functions. Vector I/O (both sync and async) are checked via rw_copy_check_uvector(). For single segment synchronous I/O, the checks are done in various places: first access_ok() is checked in vfs_read/write, the negative length is checked in rw_verify_area. For single segment AIO, access_ok() is done in aio_setup_iocb, and negative length is checked in io_submit_one. So it's not possible to call down to generic_file_aio_read/write with invalid iov segment. Patch proposed to delete these redundant code. Also moved negative length check for single segment AIO into aio_setup_single_vector to somewhat mirror aio_setup_vectored_rw that they check illegal negative length. It fits better there too because iocb->aio_nbytes is now double duty as segment count for vectored AIO, and checking negative length in the entry point of all AIO isn't very obvious. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> diff -Nurp linux-2.6.19/fs/aio.c linux-2.6.19.ken/fs/aio.c --- linux-2.6.19/fs/aio.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/fs/aio.c 2006-12-03 17:16:52.0 -0800 @@ -1416,6 +1416,8 @@ static ssize_t aio_setup_single_vector(s kiocb->ki_nr_segs = 1; kiocb->ki_cur_seg = 0; kiocb->ki_nbytes = kiocb->ki_left; + if (unlikely((ssize_t) kiocb->ki_nbytes < 0)) + return -EINVAL; return 0; } @@ -1560,8 +1562,7 @@ int fastcall io_submit_one(struct kioctx /* prevent overflows */ if (unlikely( (iocb->aio_buf != (unsigned long)iocb->aio_buf) || - (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) || - ((ssize_t)iocb->aio_nbytes < 0) + (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) )) { pr_debug("EINVAL: io_submit: overflow check\n"); return -EINVAL; diff -Nurp linux-2.6.19/mm/filemap.c linux-2.6.19.ken/mm/filemap.c --- linux-2.6.19/mm/filemap.c 2006-11-29 13:57:37.0 -0800 +++ linux-2.6.19.ken/mm/filemap.c 2006-12-03 17:16:10.0 -0800 @@ -1143,29 +1143,9 @@ generic_file_aio_read(struct kiocb *iocb struct file *filp = iocb->ki_filp; ssize_t retval; unsigned long seg; - size_t count; + size_t count = iocb->ki_left; loff_t *ppos = >ki_pos; - count = 0; - for (seg = 0; seg < nr_segs; seg++) { - const struct iovec *iv = [seg]; - - /* -* If any segment has a negative length, or the cumulative -* length ever wraps negative then return -EINVAL. -*/ - count += iv->iov_len; - if (unlikely((ssize_t)(count|iv->iov_len) < 0)) - return -EINVAL; - if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len)) - continue; - if (seg == 0) - return -EFAULT; - nr_segs = seg; - count -= iv->iov_len; /* This segment is no good */ - break; - } - /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ if (filp->f_flags & O_DIRECT) { loff_t size; @@ -2228,32 +2208,11 @@ __generic_file_aio_write_nolock(struct k size_t ocount; /* original count */ size_t count; /* after file limit checks */ struct inode*inode = mapping->host; - unsigned long seg; loff_t pos; ssize_t written; ssize_t err; - ocount = 0; - for (seg = 0; seg < nr_segs; seg++) { - const struct iovec *iv = [seg]; - - /* -* If any segment has a negative length, or the cumulative -* length ever wraps negative then return -EINVAL. -*/ - ocount += iv->iov_len; - if (unlikely((ssize_t)(ocount|iv->iov_len) < 0)) - return -EINVAL; - if (access_ok(VERIFY_READ, iv->iov_base, iv->iov_len)) - continue; - if (seg == 0) - return -EFAULT; - nr_segs = seg; - ocount -= iv->iov_len; /* This segment is no good */ - break; - } - - count = ocount; + count = ocount = iocb->ki_left; pos = *ppos; vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] kill pointless ki_nbytes assignment in aio_setup_single_vector
io_submit_one assigns ki_left = ki_nbytes = iocb->aio_nbytes, then calls down to aio_setup_iocb, then to aio_setup_single_vector. In there, ki_nbytes is reassigned to the same value it got two call stack above it. There is no need to do so. Signed-off-by: Ken Chen <[EMAIL PROTECTED]> diff -Nurp linux-2.6.19/fs/aio.c linux-2.6.19.ken/fs/aio.c --- linux-2.6.19/fs/aio.c 2006-12-03 17:16:52.0 -0800 +++ linux-2.6.19.ken/fs/aio.c 2006-12-03 17:19:06.0 -0800 @@ -1415,7 +1415,6 @@ static ssize_t aio_setup_single_vector(s kiocb->ki_iovec->iov_len = kiocb->ki_left; kiocb->ki_nr_segs = 1; kiocb->ki_cur_seg = 0; - kiocb->ki_nbytes = kiocb->ki_left; if (unlikely((ssize_t) kiocb->ki_nbytes < 0)) return -EINVAL; return 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] kill pointless ki_nbytes assignment in aio_setup_single_vector
io_submit_one assigns ki_left = ki_nbytes = iocb-aio_nbytes, then calls down to aio_setup_iocb, then to aio_setup_single_vector. In there, ki_nbytes is reassigned to the same value it got two call stack above it. There is no need to do so. Signed-off-by: Ken Chen [EMAIL PROTECTED] diff -Nurp linux-2.6.19/fs/aio.c linux-2.6.19.ken/fs/aio.c --- linux-2.6.19/fs/aio.c 2006-12-03 17:16:52.0 -0800 +++ linux-2.6.19.ken/fs/aio.c 2006-12-03 17:19:06.0 -0800 @@ -1415,7 +1415,6 @@ static ssize_t aio_setup_single_vector(s kiocb-ki_iovec-iov_len = kiocb-ki_left; kiocb-ki_nr_segs = 1; kiocb-ki_cur_seg = 0; - kiocb-ki_nbytes = kiocb-ki_left; if (unlikely((ssize_t) kiocb-ki_nbytes 0)) return -EINVAL; return 0; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/