Re: [PATCH] mm: don't expose page to fast gup before it's ready

2019-05-14 Thread Yu Zhao
On Tue, May 14, 2019 at 02:25:27PM -0700, Andrew Morton wrote:
> On Tue, 9 Jan 2018 02:10:50 -0800 Yu Zhao  wrote:
> 
> > > Also what prevents reordering here? There do not seem to be any barriers
> > > to prevent __SetPageSwapBacked leak after set_pte_at with your patch.
> > 
> > I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
> > explicitly asked the question, I realized my assumption doesn't hold
> > when memcg is disabled. So we do need something to prevent reordering
> > in my patch. And it brings up the question whether we want to add more
> > barrier to other places that call page_add_new_anon_rmap() and
> > set_pte_at().
> 
> Is a new version of this patch planned?

Sorry for the late reply. The last time I tried, I didn't come up
with a better fix because:
  1) as Michal pointed out, we need to make sure the fast gup sees
  all changes made before set_pte_at();
  2) pairing smp_wmb() in set_pte/pmd_at() with smp_rmb() in gup
  seems the best way to prevent any potential ordering related
  problems in the future;
  3) but this slows down the paths that don't require the smp_mwb()
  unnecessarily.

I didn't give it further thought because the problem doesn't seem
fatal at the time. Now the fast gup has changed and the problem is
serious:

CPU 1   CPU 1
set_pte_at  get_user_pages_fast
page_add_new_anon_rmap  gup_pte_range
__SetPageSwapBacked (fetch)
try_get_compound_head
page_ref_add_unless
__SetPageSwapBacked (store)

Or the similar problem could happen to __do_huge_pmd_anonymous_page(),
for the reason of missing smp_wmb() between the non-atomic bit op
and set_pmd_at().

We could simply replace __SetPageSwapBacked() with its atomic
version. But 2) seems more preferable to me because it addresses
my original problem:

> > I didn't observe the race directly. But I did get few crashes when
> > trying to access mem_cgroup of pages returned by get_user_pages_fast().
> > Those page were charged and they showed valid mem_cgroup in kdumps.
> > So this led me to think the problem came from premature set_pte_at().

Thoughts? Thanks.


Re: [PATCH] mm: don't expose page to fast gup before it's ready

2019-05-14 Thread Andrew Morton
On Tue, 9 Jan 2018 02:10:50 -0800 Yu Zhao  wrote:

> > Also what prevents reordering here? There do not seem to be any barriers
> > to prevent __SetPageSwapBacked leak after set_pte_at with your patch.
> 
> I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
> explicitly asked the question, I realized my assumption doesn't hold
> when memcg is disabled. So we do need something to prevent reordering
> in my patch. And it brings up the question whether we want to add more
> barrier to other places that call page_add_new_anon_rmap() and
> set_pte_at().

Is a new version of this patch planned?


Re: [PATCH] mm: don't expose page to fast gup before it's ready

2018-01-31 Thread Andrew Morton
On Tue, 9 Jan 2018 02:10:50 -0800 Yu Zhao  wrote:

> On Tue, Jan 09, 2018 at 09:46:22AM +0100, Michal Hocko wrote:
> > On Mon 08-01-18 14:56:32, Yu Zhao wrote:
> > > We don't want to expose page before it's properly setup. During
> > > page setup, we may call page_add_new_anon_rmap() which uses non-
> > > atomic bit op. If page is exposed before it's done, we could
> > > overwrite page flags that are set by get_user_pages_fast() or
> > > it's callers. Here is a non-fatal scenario (there might be other
> > > fatal problems that I didn't look into):
> > > 
> > >   CPU 1   CPU1
> > > set_pte_at()  get_user_pages_fast()
> > > page_add_new_anon_rmap()  gup_pte_range()
> > >   __SetPageSwapBacked()   SetPageReferenced()
> > > 
> > > Fix the problem by delaying set_pte_at() until page is ready.
> > 
> > Have you seen this race happening in real workloads or this is a code
> > review based fix or a theoretical issue? I am primarily asking because
> > the code is like that at least throughout git era and I do not remember
> > any issue like this. If you can really trigger this tiny race window
> > then we should mark the fix for stable.
> 
> I didn't observe the race directly. But I did get few crashes when
> trying to access mem_cgroup of pages returned by get_user_pages_fast().
> Those page were charged and they showed valid mem_cgroup in kdumps.
> So this led me to think the problem came from premature set_pte_at().
> 
> I think the fact that nobody complained about this problem is because
> the race only happens when using ksm+swap, and it might not cause
> any fatal problem even so. Nevertheless, it's nice to have set_pte_at()
> done consistently after rmap is added and page is charged.
> 
> > Also what prevents reordering here? There do not seem to be any barriers
> > to prevent __SetPageSwapBacked leak after set_pte_at with your patch.
> 
> I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
> explicitly asked the question, I realized my assumption doesn't hold
> when memcg is disabled. So we do need something to prevent reordering
> in my patch. And it brings up the question whether we want to add more
> barrier to other places that call page_add_new_anon_rmap() and
> set_pte_at().

No progress here?  I have the patch marked "to be updated", hence it is
stuck.  Please let's get it finished off for 4.17-rc1.



Re: [PATCH] mm: don't expose page to fast gup before it's ready

2018-01-09 Thread Yu Zhao
On Tue, Jan 09, 2018 at 09:46:22AM +0100, Michal Hocko wrote:
> On Mon 08-01-18 14:56:32, Yu Zhao wrote:
> > We don't want to expose page before it's properly setup. During
> > page setup, we may call page_add_new_anon_rmap() which uses non-
> > atomic bit op. If page is exposed before it's done, we could
> > overwrite page flags that are set by get_user_pages_fast() or
> > it's callers. Here is a non-fatal scenario (there might be other
> > fatal problems that I didn't look into):
> > 
> > CPU 1   CPU1
> > set_pte_at()get_user_pages_fast()
> > page_add_new_anon_rmap()gup_pte_range()
> > __SetPageSwapBacked()   SetPageReferenced()
> > 
> > Fix the problem by delaying set_pte_at() until page is ready.
> 
> Have you seen this race happening in real workloads or this is a code
> review based fix or a theoretical issue? I am primarily asking because
> the code is like that at least throughout git era and I do not remember
> any issue like this. If you can really trigger this tiny race window
> then we should mark the fix for stable.

I didn't observe the race directly. But I did get few crashes when
trying to access mem_cgroup of pages returned by get_user_pages_fast().
Those page were charged and they showed valid mem_cgroup in kdumps.
So this led me to think the problem came from premature set_pte_at().

I think the fact that nobody complained about this problem is because
the race only happens when using ksm+swap, and it might not cause
any fatal problem even so. Nevertheless, it's nice to have set_pte_at()
done consistently after rmap is added and page is charged.

> Also what prevents reordering here? There do not seem to be any barriers
> to prevent __SetPageSwapBacked leak after set_pte_at with your patch.

I assumed mem_cgroup_commit_charge() acted as full barrier. Since you
explicitly asked the question, I realized my assumption doesn't hold
when memcg is disabled. So we do need something to prevent reordering
in my patch. And it brings up the question whether we want to add more
barrier to other places that call page_add_new_anon_rmap() and
set_pte_at().


Re: [PATCH] mm: don't expose page to fast gup before it's ready

2018-01-09 Thread Michal Hocko
On Mon 08-01-18 14:56:32, Yu Zhao wrote:
> We don't want to expose page before it's properly setup. During
> page setup, we may call page_add_new_anon_rmap() which uses non-
> atomic bit op. If page is exposed before it's done, we could
> overwrite page flags that are set by get_user_pages_fast() or
> it's callers. Here is a non-fatal scenario (there might be other
> fatal problems that I didn't look into):
> 
>   CPU 1   CPU1
> set_pte_at()  get_user_pages_fast()
> page_add_new_anon_rmap()  gup_pte_range()
>   __SetPageSwapBacked()   SetPageReferenced()
> 
> Fix the problem by delaying set_pte_at() until page is ready.

Have you seen this race happening in real workloads or this is a code
review based fix or a theoretical issue? I am primarily asking because
the code is like that at least throughout git era and I do not remember
any issue like this. If you can really trigger this tiny race window
then we should mark the fix for stable.

Also what prevents reordering here? There do not seem to be any barriers
to prevent __SetPageSwapBacked leak after set_pte_at with your patch.

> Signed-off-by: Yu Zhao 
> ---
>  mm/memory.c   | 2 +-
>  mm/swapfile.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ca5674cbaff2..b8be1a4adf93 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3010,7 +3010,6 @@ int do_swap_page(struct vm_fault *vmf)
>   flush_icache_page(vma, page);
>   if (pte_swp_soft_dirty(vmf->orig_pte))
>   pte = pte_mksoft_dirty(pte);
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>   vmf->orig_pte = pte;
>  
>   /* ksm created a completely new copy */
> @@ -3023,6 +3022,7 @@ int do_swap_page(struct vm_fault *vmf)
>   mem_cgroup_commit_charge(page, memcg, true, false);
>   activate_page(page);
>   }
> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>  
>   swap_free(entry);
>   if (mem_cgroup_swap_full(page) ||
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 3074b02eaa09..bd49da2b5221 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1800,8 +1800,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
>   inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>   get_page(page);
> - set_pte_at(vma->vm_mm, addr, pte,
> -pte_mkold(mk_pte(page, vma->vm_page_prot)));
>   if (page == swapcache) {
>   page_add_anon_rmap(page, vma, addr, false);
>   mem_cgroup_commit_charge(page, memcg, true, false);
> @@ -1810,6 +1808,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   mem_cgroup_commit_charge(page, memcg, false, false);
>   lru_cache_add_active_or_unevictable(page, vma);
>   }
> + set_pte_at(vma->vm_mm, addr, pte,
> +pte_mkold(mk_pte(page, vma->vm_page_prot)));
>   swap_free(entry);
>   /*
>* Move the page to the active list so it is not
> -- 
> 2.16.0.rc0.223.g4a4ac83678-goog
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs


[PATCH] mm: don't expose page to fast gup before it's ready

2018-01-08 Thread Yu Zhao
We don't want to expose page before it's properly setup. During
page setup, we may call page_add_new_anon_rmap() which uses non-
atomic bit op. If page is exposed before it's done, we could
overwrite page flags that are set by get_user_pages_fast() or
it's callers. Here is a non-fatal scenario (there might be other
fatal problems that I didn't look into):

CPU 1   CPU1
set_pte_at()get_user_pages_fast()
page_add_new_anon_rmap()gup_pte_range()
__SetPageSwapBacked()   SetPageReferenced()

Fix the problem by delaying set_pte_at() until page is ready.

Signed-off-by: Yu Zhao 
---
 mm/memory.c   | 2 +-
 mm/swapfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ca5674cbaff2..b8be1a4adf93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3010,7 +3010,6 @@ int do_swap_page(struct vm_fault *vmf)
flush_icache_page(vma, page);
if (pte_swp_soft_dirty(vmf->orig_pte))
pte = pte_mksoft_dirty(pte);
-   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
vmf->orig_pte = pte;
 
/* ksm created a completely new copy */
@@ -3023,6 +3022,7 @@ int do_swap_page(struct vm_fault *vmf)
mem_cgroup_commit_charge(page, memcg, true, false);
activate_page(page);
}
+   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 
swap_free(entry);
if (mem_cgroup_swap_full(page) ||
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3074b02eaa09..bd49da2b5221 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1800,8 +1800,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t 
*pmd,
dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
get_page(page);
-   set_pte_at(vma->vm_mm, addr, pte,
-  pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
page_add_anon_rmap(page, vma, addr, false);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -1810,6 +1808,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t 
*pmd,
mem_cgroup_commit_charge(page, memcg, false, false);
lru_cache_add_active_or_unevictable(page, vma);
}
+   set_pte_at(vma->vm_mm, addr, pte,
+  pte_mkold(mk_pte(page, vma->vm_page_prot)));
swap_free(entry);
/*
 * Move the page to the active list so it is not
-- 
2.16.0.rc0.223.g4a4ac83678-goog