On Thu, May 21, 2020 at 10:56:20PM -0700, Hugh Dickins wrote:
> I've only seen this livelock on one machine (repeatably, but not to
> order), and not fully analyzed it - two processes seen looping around
> getting -EEXIST from swapcache_prepare(), I guess a third (at lower
> priority? but wanting the same cpu as one of the loopers? preemption
> or cond_resched() not enough to let it back in?) set SWAP_HAS_CACHE,
> then went off into direct reclaim, scheduled away, and somehow could
> not get back to add the page to swap cache and let them all complete.
> 
> Restore the page allocation in __read_swap_cache_async() to before
> the swapcache_prepare() call: "mm: memcontrol: charge swapin pages
> on instantiation" moved it outside the loop, which indeed looks much
> nicer, but exposed this weakness.  We used to allocate new_page once
> and then keep it across all iterations of the loop: but I think that
> just optimizes for a rare case, and complicates the flow, so go with
> the new simpler structure, with allocate+free each time around (which
> is more considerate use of the memory too).
> 
> Fix the comment on the looping case, which has long been inaccurate:
> it's not a racing get_swap_page() that's the problem here.
> 
> Fix the add_to_swap_cache() and mem_cgroup_charge() error recovery:
> not swap_free(), but put_swap_page() to undo SWAP_HAS_CACHE, as was
> done before; but delete_from_swap_cache() already includes it.
> 
> And one more nit: I don't think it makes any difference in practice,
> but remove the "& GFP_KERNEL" mask from the mem_cgroup_charge() call:
> add_to_swap_cache() needs that, to convert gfp_mask from user and page
> cache allocation (e.g. highmem) to radix node allocation (lowmem), but
> we don't need or usually apply that mask when charging mem_cgroup.
> 
> Signed-off-by: Hugh Dickins <hu...@google.com>
> ---
> Mostly fixing mm-memcontrol-charge-swapin-pages-on-instantiation.patch
> but now I see that mm-memcontrol-delete-unused-lrucare-handling.patch
> made a further change here (took an arg off the mem_cgroup_charge call):
> as is, this patch is diffed to go on top of both of them, and better
> that I get it out now for Johannes look at; but could be rediffed for
> folding into blah-instantiation.patch later.
> 
> Earlier in the day I promised two patches to __read_swap_cache_async(),
> but find now that I cannot quite justify the second patch: it makes a
> slight adjustment in swapcache_prepare(), then removes the redundant
> __swp_swapcount() && swap_slot_cache_enabled business from blah_async().
> 
> I'd still like to do that, but this patch here brings back the
> alloc_page_vma() in between them, and I don't have any evidence to
> reassure us that I'm not then pessimizing a readahead case by doing
> unnecessary allocation and free. Leave it for some other time perhaps.
> 
>  mm/swap_state.c |   52 +++++++++++++++++++++++++---------------------
>  1 file changed, 29 insertions(+), 23 deletions(-)
> 
> --- 5.7-rc6-mm1/mm/swap_state.c       2020-05-20 12:21:56.149694170 -0700
> +++ linux/mm/swap_state.c     2020-05-21 20:17:50.188773901 -0700
> @@ -392,56 +392,62 @@ struct page *__read_swap_cache_async(swp
>                       return NULL;
>  
>               /*
> +              * Get a new page to read into from swap.  Allocate it now,
> +              * before marking swap_map SWAP_HAS_CACHE, when -EEXIST will
> +              * cause any racers to loop around until we add it to cache.
> +              */
> +             page = alloc_page_vma(gfp_mask, vma, addr);
> +             if (!page)
> +                     return NULL;
> +
> +             /*
>                * Swap entry may have been freed since our caller observed it.
>                */
>               err = swapcache_prepare(entry);
>               if (!err)
>                       break;
>  
> -             if (err == -EEXIST) {
> -                     /*
> -                      * We might race against get_swap_page() and stumble
> -                      * across a SWAP_HAS_CACHE swap_map entry whose page
> -                      * has not been brought into the swapcache yet.
> -                      */
> -                     cond_resched();
> -                     continue;
> -             }
> +             put_page(page);
> +             if (err != -EEXIST)
> +                     return NULL;
>  
> -             return NULL;
> +             /*
> +              * We might race against __delete_from_swap_cache(), and
> +              * stumble across a swap_map entry whose SWAP_HAS_CACHE
> +              * has not yet been cleared.  Or race against another
> +              * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> +              * in swap_map, but not yet added its page to swap cache.
> +              */
> +             cond_resched();
>       }
>  
>       /*
> -      * The swap entry is ours to swap in. Prepare a new page.
> +      * The swap entry is ours to swap in. Prepare the new page.
>        */
>  
> -     page = alloc_page_vma(gfp_mask, vma, addr);
> -     if (!page)
> -             goto fail_free;
> -
>       __SetPageLocked(page);
>       __SetPageSwapBacked(page);
>  
>       /* May fail (-ENOMEM) if XArray node allocation failed. */
> -     if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> +     if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL)) {
> +             put_swap_page(page, entry);
>               goto fail_unlock;
> +     }
>  
> -     if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
> -             goto fail_delete;
> +     if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> +             delete_from_swap_cache(page);
> +             goto fail_unlock;
> +     }
>  
> -     /* Initiate read into locked page */
> +     /* Caller will initiate read into locked page */
>       SetPageWorkingset(page);
>       lru_cache_add_anon(page);
>       *new_page_allocated = true;
>       return page;
>  
> -fail_delete:
> -     delete_from_swap_cache(page);
>  fail_unlock:
>       unlock_page(page);
>       put_page(page);
> -fail_free:
> -     swap_free(entry);
>       return NULL;
>  }
>  
> 
Acked-by: Rafael Aquini <aqu...@redhat.com>

Reply via email to