Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache

Andrew Morton Thu, 24 May 2007 13:40:27 -0700

On Wed, 23 May 2007 20:15:24 +0100
David Howells <[EMAIL PROTECTED]> wrote:


> Add a function - cancel_rejected_write() - to excise a rejected write from the
> pagecache.  This function is related to the truncation family of routines.  It
> permits the pages modified by a network filesystem client (such as AFS) to be
> excised and discarded from the pagecache if the attempt to write them back to
> the server fails.

Could you please flesh this out a bit, from a higher level?

As in: what does it mean for a server to "reject" a write?  What's actually
going on here?  Why does the server do this?  I assume that the application
will see a short write or an EIO or something?

How does this interact with MAP_SHARED?

Do we expect that any other networked filesystem would want to do this? 
(and if not, why not?) Or do they already attempt to do it in some other
fashion?

> The dirty and writeback states of the afflicted pages are cancelled and the
> pages themselves are detached for recycling.  All PTEs referring to those
> pages are removed.
> 
> Note that the locking is tricky as it's very easy to deadlock against
> truncate() and other routines once the pages have been unlocked as part of the
> writeback process.  To this end, the PG_error flag is set, then the
> PG_writeback flag is cleared, and only *then* can lock_page() be called.

hm.

> Signed-off-by: David Howells <[EMAIL PROTECTED]>
> ---
> 
>  include/linux/mm.h |    5 ++-
>  mm/truncate.c      |   83 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e4183c6..73688d4 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1086,12 +1086,13 @@ extern int do_munmap(struct mm_struct *, unsigned 
> long, size_t);
>  
>  extern unsigned long do_brk(unsigned long, unsigned long);
>  
> -/* filemap.c */
> -extern unsigned long page_unuse(struct page *);
> +/* truncate.c */
>  extern void truncate_inode_pages(struct address_space *, loff_t);
>  extern void truncate_inode_pages_range(struct address_space *,
>                                      loff_t lstart, loff_t lend);
> +extern void cancel_rejected_write(struct address_space *, pgoff_t, pgoff_t);
>  
> +/* filemap.c */
>  /* generic vm_area_ops exported for stackable file systems */
>  extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, 
> int *);
>  extern int filemap_populate(struct vm_area_struct *, unsigned long,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 4fbe1a2..f742096 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -443,3 +443,86 @@ int invalidate_inode_pages2(struct address_space 
> *mapping)
>       return invalidate_inode_pages2_range(mapping, 0, -1);
>  }
>  EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> +
> +/*
> + * Cancel that part of a rejected write that affects a particular page
> + */
> +static void cancel_rejected_page(struct address_space *mapping,
> +                              struct page *page, pgoff_t *_next)
> +{
> +     if (!TestSetPageError(page)) {
> +             /* can't lock the page until we've cleared PG_writeback lest we
> +              * deadlock with truncate (amongst other things) */
> +             end_page_writeback(page);
> +             if (page->mapping == mapping) {
> +                     lock_page(page);
> +                     if (page->mapping == mapping) {
> +                             truncate_complete_page(mapping, page);
> +                             *_next = page->index + 1;
> +                     }
> +                     unlock_page(page);
> +             }
> +     } else if (PageWriteback(page) || PageDirty(page)) {
> +             BUG();
> +     }
> +}

Why can't we just run the end_page_writeback() unconditionally? 
PG_writeback _must_ be set here, and it is the caller who set PG_writeback,
so this thread of control "owns" that flag.

Also, are you really really sure that there is no way in which PG_writeback
can get itself set again after the end_page_writeback()?  I'd have thought
that we should be doing a wait_on_page_writeback() after the lock_page()
there.  Remember, other filesystems might want to be calling this, so we
shouldn't be designing around AFS implementation specifics.

> +/**
> + * cancel_rejected_write - Cancel a write on a contiguous set of pages
> + * @mapping: mapping affected
> + * @start: first page in set
> + * @end: last page in set
> + *
> + * Cancel a write of a contiguous set of pages when the writeback was 
> rejected
> + * by the target medium or server.
> + *
> + * The pages in question are detached and discarded from the pagecache, and 
> the
> + * writeback and dirty states are cleared prior to invalidation.  The caller
> + * must make sure that all the pages in the range are present in the 
> pagecache,
> + * and the caller must hold PG_writeback on each of them.  NOTE! All the 
> pages
> + * are locked and unlocked as part of this process, so the caller must take
> + * care to avoid deadlock.
> + *
> + * The PTEs pointing to those pages are also cleared, leading to the PTEs 
> being
> + * reset when new pages are allocated and the contents reloaded.
> + */
> +void cancel_rejected_write(struct address_space *mapping,
> +                        pgoff_t start, pgoff_t end)
> +{
> +     struct pagevec pvec;
> +     pgoff_t n;
> +     int i;
> +
> +     BUG_ON(mapping->nrpages < end - start + 1);
> +
> +     /* dispose of any PTEs pointing to the affected pages */
> +     unmap_mapping_range(mapping,
> +                         (loff_t)start << PAGE_CACHE_SHIFT,
> +                         (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> +                         0);
> +
> +     pagevec_init(&pvec, 0);
> +     do {
> +             cond_resched();
> +             n = end - start + 1;
> +             if (n > PAGEVEC_SIZE)
> +                     n = PAGEVEC_SIZE;
> +             n = pagevec_lookup(&pvec, mapping, start, n);
> +             for (i = 0; i < n; i++) {
> +                     struct page *page = pvec.pages[i];
> +
> +                     if (page->index < start || page->index > end)
> +                             continue;
> +                     start++;
> +                     cancel_rejected_page(mapping, page, &start);
> +             }
> +             pagevec_release(&pvec);
> +     } while (start - 1 < end);
> +
> +     /* dispose of any new PTEs pointing to the affected pages */
> +     unmap_mapping_range(mapping,
> +                         (loff_t)start << PAGE_CACHE_SHIFT,
> +                         (loff_t)(end - start + 1) << PAGE_CACHE_SHIFT,
> +                         0);
> +}
> +EXPORT_SYMBOL_GPL(cancel_rejected_write);

hm, is the pte-unmapping here completely solid?  Are there any race windows
in which a faulter can end up owning, say, an anonymous page?  We've had
heaps of problems with that sort of thing in generic_file_direct_IO() and I
don't expect it's this easy.





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] AFS: Add a function to excise a rejected write from the pagecache

Reply via email to