On Wed, 16 Apr 2014, Johannes Weiner wrote:
> Subject: [patch] mm: filemap: update find_get_pages_tag() to deal with shadow
>  entries
> 
> Dave Jones reports the following crash when find_get_pages_tag() runs
> into an exceptional entry:
> 
> kernel BUG at mm/filemap.c:1347!
> RIP: 0010:[<ffffffffb815aeab>]  [<ffffffffb815aeab>] 
> find_get_pages_tag+0x1cb/0x220
> Call Trace:
>  [<ffffffffb815ad16>] ? find_get_pages_tag+0x36/0x220
>  [<ffffffffb8168511>] pagevec_lookup_tag+0x21/0x30
>  [<ffffffffb81595de>] filemap_fdatawait_range+0xbe/0x1e0
>  [<ffffffffb8159727>] filemap_fdatawait+0x27/0x30
>  [<ffffffffb81f2fa4>] sync_inodes_sb+0x204/0x2a0
>  [<ffffffffb874d98f>] ? wait_for_completion+0xff/0x130
>  [<ffffffffb81fa5b0>] ? vfs_fsync+0x40/0x40
>  [<ffffffffb81fa5c9>] sync_inodes_one_sb+0x19/0x20
>  [<ffffffffb81caab2>] iterate_supers+0xb2/0x110
>  [<ffffffffb81fa864>] sys_sync+0x44/0xb0
>  [<ffffffffb875c4a9>] ia32_do_call+0x13/0x13
> 
> 1343                         /*
> 1344                          * This function is never used on a shmem/tmpfs
> 1345                          * mapping, so a swap entry won't be found here.
> 1346                          */
> 1347                         BUG();
> 
> After 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
> cache radix trees") this comment and BUG() are out of date because
> exceptional entries can now appear in all mappings - as shadows of
> recently evicted pages.
> 
> However, as Hugh Dickins notes,
> 
>   "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
>    any other PAGECACHE_TAG_*) to appear on an exceptional entry.
> 
>    I expect it comes down to an occasional race in RCU lookup of the
>    radix_tree: lacking absolute synchronization, we might sometimes
>    catch an exceptional entry, with the tag which really belongs with
>    the unexceptional entry which was there an instant before."
> 
> And indeed, not only is the tree walk lockless, the tags are also read
> in chunks, one radix tree node at a time.  There is plenty of time for
> page reclaim to swoop in and replace a page that was already looked up
> as tagged with a shadow entry.
> 
> Remove the BUG() and update the comment.  While reviewing all other
> lookup sites for whether they properly deal with shadow entries of
> evicted pages, update all the comments and fix memcg file charge
> moving to not miss shmem/tmpfs swapcache pages.
> 
> Reported-by: Dave Jones <da...@redhat.com>
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>
> Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache 
> radix trees")

Looks exactly right to me, thanks Hannes.  Good catch in memcontrol.c.

Acked-by: Hugh Dickins <hu...@google.com>

And I realize now that the tag races which led me to defer to you, are
actually just races we have lived with for years; but before they were
all handled invisibly at the "unlikely(!page)" stage, whereas now they
simply need active handling at the radix_tree_exception stage too.

There is, by the way, a separate cleanup that I noticed last night,
while puzzing over the filemap.c:202 bug.  In mm/truncate.c there
are several "We rely upon deletion not changing page->index" comments
(and in mm/filemap.c "Leave page->index set: truncation relies upon it").
I think your indices[] everywhere have ended that reliance?  Whether you
also remove the "WARN_ON(page->index != index)"s is a matter of taste:
it is reassuring to have that checked somewhere, but no longer so
particular to those loops.

> ---
>  mm/filemap.c    | 49 ++++++++++++++++++++++++++++---------------------
>  mm/memcontrol.c | 20 ++++++++++++--------
>  mm/truncate.c   |  8 --------
>  3 files changed, 40 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a82fbe4c9e8e..d92c437a79c4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -906,8 +906,8 @@ EXPORT_SYMBOL(page_cache_prev_hole);
>   * Looks up the page cache slot at @mapping & @offset.  If there is a
>   * page cache page, it is returned with an increased refcount.
>   *
> - * If the slot holds a shadow entry of a previously evicted page, it
> - * is returned.
> + * If the slot holds a shadow entry of a previously evicted page, or a
> + * swap entry from shmem/tmpfs, it is returned.
>   *
>   * Otherwise, %NULL is returned.
>   */
> @@ -928,9 +928,9 @@ repeat:
>                       if (radix_tree_deref_retry(page))
>                               goto repeat;
>                       /*
> -                      * Otherwise, shmem/tmpfs must be storing a swap entry
> -                      * here as an exceptional entry: so return it without
> -                      * attempting to raise page count.
> +                      * A shadow entry of a recently evicted page,
> +                      * or a swap entry from shmem/tmpfs.  Return
> +                      * it without attempting to raise page count.
>                        */
>                       goto out;
>               }
> @@ -983,8 +983,8 @@ EXPORT_SYMBOL(find_get_page);
>   * page cache page, it is returned locked and with an increased
>   * refcount.
>   *
> - * If the slot holds a shadow entry of a previously evicted page, it
> - * is returned.
> + * If the slot holds a shadow entry of a previously evicted page, or a
> + * swap entry from shmem/tmpfs, it is returned.
>   *
>   * Otherwise, %NULL is returned.
>   *
> @@ -1099,8 +1099,8 @@ EXPORT_SYMBOL(find_or_create_page);
>   * with ascending indexes.  There may be holes in the indices due to
>   * not-present pages.
>   *
> - * Any shadow entries of evicted pages are included in the returned
> - * array.
> + * Any shadow entries of evicted pages, or swap entries from
> + * shmem/tmpfs, are included in the returned array.
>   *
>   * find_get_entries() returns the number of pages and shadow entries
>   * which were found.
> @@ -1128,9 +1128,9 @@ repeat:
>                       if (radix_tree_deref_retry(page))
>                               goto restart;
>                       /*
> -                      * Otherwise, we must be storing a swap entry
> -                      * here as an exceptional entry: so return it
> -                      * without attempting to raise page count.
> +                      * A shadow entry of a recently evicted page,
> +                      * or a swap entry from shmem/tmpfs.  Return
> +                      * it without attempting to raise page count.
>                        */
>                       goto export;
>               }
> @@ -1198,9 +1198,9 @@ repeat:
>                               goto restart;
>                       }
>                       /*
> -                      * Otherwise, shmem/tmpfs must be storing a swap entry
> -                      * here as an exceptional entry: so skip over it -
> -                      * we only reach this from invalidate_mapping_pages().
> +                      * A shadow entry of a recently evicted page,
> +                      * or a swap entry from shmem/tmpfs.  Skip
> +                      * over it.
>                        */
>                       continue;
>               }
> @@ -1265,9 +1265,9 @@ repeat:
>                               goto restart;
>                       }
>                       /*
> -                      * Otherwise, shmem/tmpfs must be storing a swap entry
> -                      * here as an exceptional entry: so stop looking for
> -                      * contiguous pages.
> +                      * A shadow entry of a recently evicted page,
> +                      * or a swap entry from shmem/tmpfs.  Stop
> +                      * looking for contiguous pages.
>                        */
>                       break;
>               }
> @@ -1341,10 +1341,17 @@ repeat:
>                               goto restart;
>                       }
>                       /*
> -                      * This function is never used on a shmem/tmpfs
> -                      * mapping, so a swap entry won't be found here.
> +                      * A shadow entry of a recently evicted page.
> +                      *
> +                      * Those entries should never be tagged, but
> +                      * this tree walk is lockless and the tags are
> +                      * looked up in bulk, one radix tree node at a
> +                      * time, so there is a sizable window for page
> +                      * reclaim to evict a page we saw tagged.
> +                      *
> +                      * Skip over it.
>                        */
> -                     BUG();
> +                     continue;
>               }
>  
>               if (!page_cache_get_speculative(page))
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 29501f040568..c47dffdcb246 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6686,16 +6686,20 @@ static struct page *mc_handle_file_pte(struct 
> vm_area_struct *vma,
>               pgoff = pte_to_pgoff(ptent);
>  
>       /* page is moved even if it's not RSS of this task(page-faulted). */
> -     page = find_get_page(mapping, pgoff);
> -
>  #ifdef CONFIG_SWAP
>       /* shmem/tmpfs may report page out on swap: account for that too. */
> -     if (radix_tree_exceptional_entry(page)) {
> -             swp_entry_t swap = radix_to_swp_entry(page);
> -             if (do_swap_account)
> -                     *entry = swap;
> -             page = find_get_page(swap_address_space(swap), swap.val);
> -     }
> +     if (shmem_mapping(mapping)) {
> +             page = find_get_entry(mapping, pgoff);
> +             if (radix_tree_exceptional_entry(page)) {
> +                     swp_entry_t swp = radix_to_swp_entry(page);
> +                     if (do_swap_account)
> +                             *entry = swp;
> +                     page = find_get_page(swap_address_space(swp), swp.val);
> +             }
> +     } else
> +             page = find_get_page(mapping, pgoff);
> +#else
> +     page = find_get_page(mapping, pgoff);
>  #endif
>       return page;
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index e5cc39ab0751..6a78c814bebf 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -484,14 +484,6 @@ unsigned long invalidate_mapping_pages(struct 
> address_space *mapping,
>       unsigned long count = 0;
>       int i;
>  
> -     /*
> -      * Note: this function may get called on a shmem/tmpfs mapping:
> -      * pagevec_lookup() might then return 0 prematurely (because it
> -      * got a gangful of swap entries); but it's hardly worth worrying
> -      * about - it can rarely have anything to free from such a mapping
> -      * (most pages are dirty), and already skips over any difficulties.
> -      */
> -
>       pagevec_init(&pvec, 0);
>       while (index <= end && pagevec_lookup_entries(&pvec, mapping, index,
>                       min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
> -- 
> 1.9.2
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to