On Wed, 23 May 2007 20:15:24 +0100 David Howells <[EMAIL PROTECTED]> wrote:
> Add a function - cancel_rejected_write() - to excise a rejected write from the > pagecache. This function is related to the truncation family of routines. It > permits the pages modified by a network filesystem client (such as AFS) to be > excised and discarded from the pagecache if the attempt to write them back to > the server fails. Could you please flesh this out a bit, from a higher level? As in: what does it mean for a server to "reject" a write? What's actually going on here? Why does the server do this? I assume that the application will see a short write or an EIO or something? How does this interact with MAP_SHARED? Do we expect that any other networked filesystem would want to do this? (and if not, why not?) Or do they already attempt to do it in some other fashion? > The dirty and writeback states of the afflicted pages are cancelled and the > pages themselves are detached for recycling. All PTEs referring to those > pages are removed. > > Note that the locking is tricky as it's very easy to deadlock against > truncate() and other routines once the pages have been unlocked as part of the > writeback process. To this end, the PG_error flag is set, then the > PG_writeback flag is cleared, and only *then* can lock_page() be called. hm. > Signed-off-by: David Howells <[EMAIL PROTECTED]> > --- > > include/linux/mm.h | 5 ++- > mm/truncate.c | 83 > ++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 86 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e4183c6..73688d4 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned > long, size_t); > > extern unsigned long do_brk(unsigned long, unsigned long); > > -/* filemap.c */ > -extern unsigned long page_unuse(struct page *); > +/* truncate.c */ > extern void truncate_inode_pages(struct address_space *, loff_t); > extern void truncate_inode_pages_range(struct address_space *, > loff_t lstart, loff_t lend); > +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t); > > +/* filemap.c */ > /* generic vm_area_ops exported for stackable file systems */ > extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, > int *); > extern int filemap_populate(struct vm_area_struct *, unsigned long, > diff --git a/mm/truncate.c b/mm/truncate.c > index 4fbe1a2..f742096 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space > *mapping) > return invalidate_inode_pages2_range(mapping, 0, -1); > } > EXPORT_SYMBOL_GPL(invalidate_inode_pages2); > + > +/* > + * Cancel that part of a rejected write that affects a particular page > + */ > +static void cancel_rejected_page(struct address_space *mapping, > + struct page *page, pgoff_t *_next) > +{ > + if (!TestSetPageError(page)) { > + /* can't lock the page until we've cleared PG_writeback lest we > + * deadlock with truncate (amongst other things) */ > + end_page_writeback(page); > + if (page->mapping == mapping) { > + lock_page(page); > + if (page->mapping == mapping) { > + truncate_complete_page(mapping, page); > + *_next = page->index + 1; > + } > + unlock_page(page); > + } > + } else if (PageWriteback(page) || PageDirty(page)) { > + BUG(); > + } > +} Why can't we just run the end_page_writeback() unconditionally? PG_writeback _must_ be set here, and it is the caller who set PG_writeback, so this thread of control "owns" that flag. Also, are you really really sure that there is no way in which PG_writeback can get itself set again after the end_page_writeback()? I'd have thought that we should be doing a wait_on_page_writeback() after the lock_page() there. Remember, other filesystems might want to be calling this, so we shouldn't be designing around AFS implementation specifics. > +/** > + * cancel_rejected_write - Cancel a write on a contiguous set of pages > + * @mapping: mapping affected > + * @start: first page in set > + * @end: last page in set > + * > + * Cancel a write of a contiguous set of pages when the writeback was > rejected > + * by the target medium or server. > + * > + * The pages in question are detached and discarded from the pagecache, and > the > + * writeback and dirty states are cleared prior to invalidation. The caller > + * must make sure that all the pages in the range are present in the > pagecache, > + * and the caller must hold PG_writeback on each of them. NOTE! All the > pages > + * are locked and unlocked as part of this process, so the caller must take > + * care to avoid deadlock. > + * > + * The PTEs pointing to those pages are also cleared, leading to the PTEs > being > + * reset when new pages are allocated and the contents reloaded. > + */ > +void cancel_rejected_write(struct address_space *mapping, > + pgoff_t start, pgoff_t end) > +{ > + struct pagevec pvec; > + pgoff_t n; > + int i; > + > + BUG_ON(mapping->nrpages < end - start + 1); > + > + /* dispose of any PTEs pointing to the affected pages */ > + unmap_mapping_range(mapping, > + (loff_t)start << PAGE_CACHE_SHIFT, > + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT, > + 0); > + > + pagevec_init(&pvec, 0); > + do { > + cond_resched(); > + n = end - start + 1; > + if (n > PAGEVEC_SIZE) > + n = PAGEVEC_SIZE; > + n = pagevec_lookup(&pvec, mapping, start, n); > + for (i = 0; i < n; i++) { > + struct page *page = pvec.pages[i]; > + > + if (page->index < start || page->index > end) > + continue; > + start++; > + cancel_rejected_page(mapping, page, &start); > + } > + pagevec_release(&pvec); > + } while (start - 1 < end); > + > + /* dispose of any new PTEs pointing to the affected pages */ > + unmap_mapping_range(mapping, > + (loff_t)start << PAGE_CACHE_SHIFT, > + (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT, > + 0); > +} > +EXPORT_SYMBOL_GPL(cancel_rejected_write); hm, is the pte-unmapping here completely solid? Are there any race windows in which a faulter can end up owning, say, an anonymous page? We've had heaps of problems with that sort of thing in generic_file_direct_IO() and I don't expect it's this easy. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/