Re: [Patch] invalidate range of pages after direct IO write
>> But this won't happen if next >>started as 0 and we didn't update it. I don't know if retrying is the >>intended behaviour or if we care that the start == 0 case doesn't do it. > > > Good point. Let's make it explicit? Looks great. I briefly had visions of some bitfield to pack the three boolean ints we have and then quickly came to my senses. :) I threw together those other two patches that work with ranges around direct IO. (unmaping before r/w and writing and waiting before reads). rc3-mm1 is angry with my test machine so they're actually against current -bk with this first invalidation patch applied. I hope that doesn't make life harder than it needs to be. I'll send them under seperate cover. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] invalidate range of pages after direct IO write
Zach Brown <[EMAIL PROTECTED]> wrote: > > Andrew Morton wrote: > > > I'd be inclined to hold off on the macro until we actually get the > > open-coded stuff right.. Sometimes the page lookup loops take an end+1 > > argument and sometimes they take an inclusive `end'. I think. That might > > make it a bit messy. > > > > How's this look? > > Looks good. It certainly stops the worst behaviour we were worried > about. I wonder about 2 things: > > 1) Now that we're calculating a specifc length for pagevec_lookup(), is > testing that page->index > end still needed for pages that get remapped > somewhere weird before we locked? If it is, I imagine we should test > for < start as well. Nope. We're guaranteed that the pages which pagevec_lookup() returned are a) at or beyond `start' and b) in ascending pgoff_t order, although not necessarily contiguous. There will be gaps for not-present pages. It's just an in-order gather. So we need the `page->index > end' test to cope with the situation where a request for three pages returned the three pages at indices 10, 11 and 3,000,000. > 2) If we get a pagevec full of pages that fail the != mapping test we > will probably just try again, not having changed next. This is fine, we > won't find them in our mapping again. Yes, good point. That page should go away once we drop our ref, and it's not in the radix tree any more. > But this won't happen if next > started as 0 and we didn't update it. I don't know if retrying is the > intended behaviour or if we care that the start == 0 case doesn't do it. Good point. Let's make it explicit? --- 25/mm/truncate.c~invalidate-range-of-pages-after-direct-io-write-fix-fix Fri Feb 4 15:33:52 2005 +++ 25-akpm/mm/truncate.c Fri Feb 4 15:34:47 2005 @@ -260,11 +260,12 @@ int invalidate_inode_pages2_range(struct int i; int ret = 0; int did_range_unmap = 0; + int wrapped = 0; pagevec_init(&pvec, 0); next = start; - while (next <= end && - !ret && pagevec_lookup(&pvec, mapping, next, + while (next <= end && !ret && !wrapped && + pagevec_lookup(&pvec, mapping, next, min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; @@ -277,6 +278,8 @@ int invalidate_inode_pages2_range(struct } wait_on_page_writeback(page); next = page->index + 1; + if (next == 0) + wrapped = 1; while (page_mapped(page)) { if (!did_range_unmap) { /* @@ -307,8 +310,6 @@ int invalidate_inode_pages2_range(struct } pagevec_release(&pvec); cond_resched(); - if (next == 0) - break; /* The pgoff_t wrapped */ } return ret; } _ > I'm being less excited by the iterating macro the more I think about it. >Tearing down the pagevec in some complicated for(;;) doesn't sound > reliable and I fear that some function that takes a per-page callback > function pointer from the caller will turn people's stomachs. heh, OK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] invalidate range of pages after direct IO write
Andrew Morton wrote: > I'd be inclined to hold off on the macro until we actually get the > open-coded stuff right.. Sometimes the page lookup loops take an end+1 > argument and sometimes they take an inclusive `end'. I think. That might > make it a bit messy. > > How's this look? Looks good. It certainly stops the worst behaviour we were worried about. I wonder about 2 things: 1) Now that we're calculating a specifc length for pagevec_lookup(), is testing that page->index > end still needed for pages that get remapped somewhere weird before we locked? If it is, I imagine we should test for < start as well. 2) If we get a pagevec full of pages that fail the != mapping test we will probably just try again, not having changed next. This is fine, we won't find them in our mapping again. But this won't happen if next started as 0 and we didn't update it. I don't know if retrying is the intended behaviour or if we care that the start == 0 case doesn't do it. I'm being less excited by the iterating macro the more I think about it. Tearing down the pagevec in some complicated for(;;) doesn't sound reliable and I fear that some function that takes a per-page callback function pointer from the caller will turn people's stomachs. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] invalidate range of pages after direct IO write
Zach Brown <[EMAIL PROTECTED]> wrote: > > > I'll make that change and plop the patch into -mm, but we need to think > > about the infinite-loop problem.. > > I can try hacking together that macro and auditing pagevec_lookup() > callers.. I'd be inclined to hold off on the macro until we actually get the open-coded stuff right.. Sometimes the page lookup loops take an end+1 argument and sometimes they take an inclusive `end'. I think. That might make it a bit messy. How's this look? - Don't look up more pages than we're going to use - Don't test page->index until we've locked the page - Check for the cursor wrapping at the end of the mapping. --- 25/mm/truncate.c~invalidate-range-of-pages-after-direct-io-write-fix 2005-02-03 18:20:22.0 -0800 +++ 25-akpm/mm/truncate.c 2005-02-03 18:28:24.627796400 -0800 @@ -264,18 +264,14 @@ int invalidate_inode_pages2_range(struct pagevec_init(&pvec, 0); next = start; while (next <= end && - !ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + !ret && pagevec_lookup(&pvec, mapping, next, + min(end - next, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) { for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; int was_dirty; - if (page->index > end) { - next = page->index; - break; - } - lock_page(page); - if (page->mapping != mapping) { /* truncate race? */ + if (page->mapping != mapping || page->index > end) { unlock_page(page); continue; } @@ -311,6 +307,8 @@ int invalidate_inode_pages2_range(struct } pagevec_release(&pvec); cond_resched(); + if (next == 0) + break; /* The pgoff_t wrapped */ } return ret; } _ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] invalidate range of pages after direct IO write
Andrew Morton wrote: > Note that the same optimisation should be made in the call to > unmap_mapping_range() in generic_file_direct_IO(). Currently we try and > unmap the whole file, even if we're only writing a single byte. Given that > you're now calculating iov_length() in there we might as well use that > number a few lines further up in that function. Indeed. I can throw that together. I also have a patch that introduces filemap_write_and_wait_range() and calls it from the read bit of __blockdev_direct_IO(). It didn't change the cheesy fsx load I was using and then I got distracted. I can try harder. > Reading the code, I'm unable to convince myself that it won't go into an > infinite loop if it finds a page at page->index = -1. But I didn't try > very hard ;) I'm unconvinced as well. I got the pattern from truncate_inode_pages_range() and it seems to have a related problem when end is something that page_index can never be greater than. It just starts over. I wonder if we should add some mapping_for_each_range() (less ridiculous names welcome :)) macro that handles this crap for the caller who just works with page pointers. We could introduce some iterator struct that the caller would put on the stack and pass in to hold state, something like the 'n' in list_for_each_safe(). It looks like a few pagevec_lookup() callers could use this. > Minor note on this: > > return invalidate_inode_pages2_range(mapping, 0, ~0UL); > > I just use `-1' there. Roger. I was just mimicking invalidate_inode_pages(). > I'll make that change and plop the patch into -mm, but we need to think > about the infinite-loop problem.. I can try hacking together that macro and auditing pagevec_lookup() callers.. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] invalidate range of pages after direct IO write
Zach Brown <[EMAIL PROTECTED]> wrote: > > After a direct IO write only invalidate the pages that the write intersected. > invalidate_inode_pages2_range(mapping, pgoff start, pgoff end) is added and > called from generic_file_direct_IO(). This doesn't break some subtle > agreement > with some other part of the code, does it? It should be OK. Note that the same optimisation should be made in the call to unmap_mapping_range() in generic_file_direct_IO(). Currently we try and unmap the whole file, even if we're only writing a single byte. Given that you're now calculating iov_length() in there we might as well use that number a few lines further up in that function. Note that invalidate_inode_pages[_range2] also does the unmapping thing - in the direct-io case we don't expect that path th ever trigger: the only way we'll find mapped pages here is if someone raced and faulted a page back in. Reading the code, I'm unable to convince myself that it won't go into an infinite loop if it finds a page at page->index = -1. But I didn't try very hard ;) Minor note on this: return invalidate_inode_pages2_range(mapping, 0, ~0UL); I just use `-1' there. We don't _know_ that pgoff_t is an unsigned long. Some smarty may come along one day and make it unsigned long long, in which case the code will break. Using -1 here just works everywhere. I'll make that change and plop the patch into -mm, but we need to think about the infinite-loop problem.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Patch] invalidate range of pages after direct IO write
After a direct IO write only invalidate the pages that the write intersected. invalidate_inode_pages2_range(mapping, pgoff start, pgoff end) is added and called from generic_file_direct_IO(). This doesn't break some subtle agreement with some other part of the code, does it? While we're in there, invalidate_inode_pages2() was calling unmap_mapping_range() with the wrong convention in the single page case. It was providing the byte offset of the final page rather than the length of the hole being unmapped. This is also fixed. This was lightly tested with a 10k op fsx run with O_DIRECT on a 16MB file in ext3 on a junky old IDE drive. Totaling vmstat columns of blocks read and written during the runs shows that read traffic drops significantly. The run time seems to have gone down a little. Two runs before the patch gave the following user/real/sys times and total blocks in and out: 0m28.029s 0m20.093s 0m3.166s 16673 125107 0m27.949s 0m20.068s 0m3.227s 18426 126094 and after the patch: 0m26.775s 0m19.996s 0m3.060s 3505 124982 0m26.856s 0m19.935s 0m3.052s 3505 125279 Signed-off-by: Zach Brown <[EMAIL PROTECTED]> --- include/linux/fs.h |2 ++ mm/filemap.c |5 - mm/truncate.c | 52 ++-- 3 files changed, 44 insertions(+), 15 deletions(-) Index: 2.6-mm-odirinv/include/linux/fs.h === --- 2.6-mm-odirinv.orig/include/linux/fs.h 2005-01-28 14:14:19.0 -0800 +++ 2.6-mm-odirinv/include/linux/fs.h 2005-01-28 14:14:35.0 -0800 @@ -1369,6 +1369,8 @@ invalidate_inode_pages(inode->i_mapping); } extern int invalidate_inode_pages2(struct address_space *mapping); +extern int invalidate_inode_pages2_range(struct address_space *mapping, +pgoff_t start, pgoff_t end); extern int write_inode_now(struct inode *, int); extern int filemap_fdatawrite(struct address_space *); extern int filemap_flush(struct address_space *); Index: 2.6-mm-odirinv/mm/filemap.c === --- 2.6-mm-odirinv.orig/mm/filemap.c2005-01-28 13:32:06.0 -0800 +++ 2.6-mm-odirinv/mm/filemap.c 2005-01-28 14:21:04.0 -0800 @@ -2325,7 +2325,10 @@ retval = mapping->a_ops->direct_IO(rw, iocb, iov, offset, nr_segs); if (rw == WRITE && mapping->nrpages) { - int err = invalidate_inode_pages2(mapping); + pgoff_t end = (offset + iov_length(iov, nr_segs) - 1) + >> PAGE_CACHE_SHIFT; + int err = invalidate_inode_pages2_range(mapping, + offset >> PAGE_CACHE_SHIFT, end); if (err) retval = err; } Index: 2.6-mm-odirinv/mm/truncate.c === --- 2.6-mm-odirinv.orig/mm/truncate.c 2005-01-28 13:32:06.0 -0800 +++ 2.6-mm-odirinv/mm/truncate.c2005-01-28 17:03:09.783939857 -0800 @@ -99,7 +99,7 @@ } /** - * truncate_inode_pages - truncate range of pages specified by start and + * truncate_inode_pages_range - truncate range of pages specified by start and * end byte offsets * @mapping: mapping to truncate * @lstart: offset from which to truncate @@ -279,28 +279,38 @@ EXPORT_SYMBOL(invalidate_inode_pages); /** - * invalidate_inode_pages2 - remove all pages from an address_space + * invalidate_inode_pages2_range - remove range of pages from an address_space * @mapping - the address_space + * @start: the page offset 'from' which to invalidate + * @end: the page offset 'to' which to invalidate (inclusive) * * Any pages which are found to be mapped into pagetables are unmapped prior to * invalidation. * * Returns -EIO if any pages could not be invalidated. */ -int invalidate_inode_pages2(struct address_space *mapping) +int invalidate_inode_pages2_range(struct address_space *mapping, + pgoff_t start, pgoff_t end) { struct pagevec pvec; - pgoff_t next = 0; + pgoff_t next; int i; int ret = 0; - int did_full_unmap = 0; + int did_range_unmap = 0; pagevec_init(&pvec, 0); - while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { + next = start; + while (next <= end && + !ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) { for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; int was_dirty; + if (page->index > end) { + next = page->index; + break; + } +