Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-10-01 Thread Yu Zhao
On Tue, Oct 01, 2019 at 03:31:51PM -0700, John Hubbard wrote:
> On 9/26/19 10:06 PM, Yu Zhao wrote:
> > On Thu, Sep 26, 2019 at 08:26:46PM -0700, John Hubbard wrote:
> >> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> >>> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
>  On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> >> ...
> > I'm thinking this patch make stuff rather fragile.. Should we instead
> > stick the barrier in set_p*d_at() instead? Or rather, make that store a
> > store-release?
> 
>  I prefer it this way too, but I suspected the majority would be
>  concerned with the performance implications, especially those
>  looping set_pte_at()s in mm/huge_memory.c.
> >>>
> >>> We can rename current set_pte_at() to __set_pte_at() or something and
> >>> leave it in places where barrier is not needed. The new set_pte_at()( will
> >>> be used in the rest of the places with the barrier inside.
> >>
> >> +1, sounds nice. I was unhappy about the wide-ranging changes that would 
> >> have
> >> to be maintained. So this seems much better.
> > 
> > Just to be clear that doing so will add unnecessary barriers to one
> > of the two paths that share set_pte_at().
> 
> Good point, maybe there's a better place to do it...
> 
> 
> > 
> >>> BTW, have you looked at other levels of page table hierarchy. Do we have
> >>> the same issue for PMD/PUD/... pages?
> >>>
> >>
> >> Along the lines of "what other memory barriers might be missing for
> >> get_user_pages_fast(), I'm also concerned that the synchronization between
> >> get_user_pages_fast() and freeing the page tables might be technically 
> >> broken,
> >> due to missing memory barriers on the get_user_pages_fast() side. Details:
> >>
> >> gup_fast() disables interrupts, but I think it also needs some sort of
> >> memory barrier(s), in order to prevent reads of the page table 
> >> (gup_pgd_range,
> >> etc) from speculatively happening before the interrupts are disabled. 
> > 
> > I was under impression switching back from interrupt context is a
> > full barrier (otherwise wouldn't we be vulnerable to some side
> > channel attacks?), so the reader side wouldn't need explicit rmb.
> > 
> 
> Documentation/memory-barriers.txt points out:
> 
> INTERRUPT DISABLING FUNCTIONS
> -
> 
> Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> (RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
> barriers are required in such a situation, they must be provided from some
> other means.
> 
> btw, I'm really sorry I missed your responses over the last 3 or 4 days.
> I just tracked down something in our email system that was sometimes
> moving some emails to spam (just few enough to escape immediate attention, 
> argghh!).
> I think I killed it off for good now. I wasn't ignoring you. :)

Thanks, John. I agree with all you said, including the irq disabling
function not being a sufficient smp_rmb().

I was hoping somebody could clarify whether ipi handlers used by tlb
flush are sufficient to prevent CPU 1 from seeing any stale data from
freed page tables on all supported archs.

CPU 1   CPU 2

flush remote tlb by ipi
wait for the ipi hanlder

free page table
disable irq
walk page table
enable irq

I think they should because otherwise tlb flush wouldn't work if CPU 1
still sees stale data from the freed page table, unless there is a
really strange CPU cache design I'm not aware of.

Quoting comments from x86 ipi handler flush_tlb_func_common():
 * read active_mm's tlb_gen.  We don't need any explicit barriers
 * because all x86 flush operations are serializing and the
 * atomic64_read operation won't be reordered by the compiler.

For ppc64 ipi hander radix__flush_tlb_range(), there is an "eieio"
instruction:
  
https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/assembler/idalangref_eieio_instrs.html
I'm not sure why it's not "sync" -- I'd guess something implicitly
works as "sync" already (or it's a bug).

I didn't find an ipi handler for tlb flush on arm64. There should be
one, otherwise fast gup on arm64 would be broken. Mark?


Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-27 Thread Michal Hocko
On Thu 26-09-19 20:26:46, John Hubbard wrote:
> On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> > BTW, have you looked at other levels of page table hierarchy. Do we have
> > the same issue for PMD/PUD/... pages?
> > 
> 
> Along the lines of "what other memory barriers might be missing for
> get_user_pages_fast(), I'm also concerned that the synchronization between
> get_user_pages_fast() and freeing the page tables might be technically broken,
> due to missing memory barriers on the get_user_pages_fast() side. Details:
> 
> gup_fast() disables interrupts, but I think it also needs some sort of
> memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
> etc) from speculatively happening before the interrupts are disabled. 

Could you be more specific about the race scenario please? I thought
that the unmap path will be serialized by the pte lock.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-26 Thread John Hubbard
On 9/26/19 3:20 AM, Kirill A. Shutemov wrote:
> On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
>> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
...
>>> I'm thinking this patch make stuff rather fragile.. Should we instead
>>> stick the barrier in set_p*d_at() instead? Or rather, make that store a
>>> store-release?
>>
>> I prefer it this way too, but I suspected the majority would be
>> concerned with the performance implications, especially those
>> looping set_pte_at()s in mm/huge_memory.c.
> 
> We can rename current set_pte_at() to __set_pte_at() or something and
> leave it in places where barrier is not needed. The new set_pte_at()( will
> be used in the rest of the places with the barrier inside.

+1, sounds nice. I was unhappy about the wide-ranging changes that would have
to be maintained. So this seems much better.

> 
> BTW, have you looked at other levels of page table hierarchy. Do we have
> the same issue for PMD/PUD/... pages?
> 

Along the lines of "what other memory barriers might be missing for
get_user_pages_fast(), I'm also concerned that the synchronization between
get_user_pages_fast() and freeing the page tables might be technically broken,
due to missing memory barriers on the get_user_pages_fast() side. Details:

gup_fast() disables interrupts, but I think it also needs some sort of
memory barrier(s), in order to prevent reads of the page table (gup_pgd_range,
etc) from speculatively happening before the interrupts are disabled. 

Leonardo Bras's recent patchset brought this to my attention. Here, he's
recommending adding atomic counting inc/dec before and after the gup_fast()
irq disable/enable points:

   https://lore.kernel.org/r/20190920195047.7703-4-leona...@linux.ibm.com

...and that lead to noticing a general lack of barriers there.

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-26 Thread Kirill A. Shutemov
On Wed, Sep 25, 2019 at 04:26:54PM -0600, Yu Zhao wrote:
> On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > > We don't want to expose a non-hugetlb page to the fast gup running
> > > on a remote CPU before all local non-atomic ops on the page flags
> > > are visible first.
> > > 
> > > For an anon page that isn't in swap cache, we need to make sure all
> > > prior non-atomic ops, especially __SetPageSwapBacked() in
> > > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > > the following race:
> > > 
> > >   CPU 1   CPU1
> > >   set_pte_at()get_user_pages_fast()
> > > page_add_new_anon_rmap()gup_pte_range()
> > > __SetPageSwapBacked() SetPageReferenced()
> > > 
> > > This demonstrates a non-fatal scenario. Though haven't been directly
> > > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > > caller and then overwritten by __SetPageSwapBacked().
> > > 
> > > For an anon page that is already in swap cache or a file page, we
> > > don't need smp_wmb() before set_pte_at() because adding to swap or
> > > file cach serves as a valid write barrier. Using non-atomic ops
> > > thereafter is a bug, obviously.
> > > 
> > > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > > call sites, with the only exception being
> > > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > > 
> > 
> > I'm thinking this patch make stuff rather fragile.. Should we instead
> > stick the barrier in set_p*d_at() instead? Or rather, make that store a
> > store-release?
> 
> I prefer it this way too, but I suspected the majority would be
> concerned with the performance implications, especially those
> looping set_pte_at()s in mm/huge_memory.c.

We can rename current set_pte_at() to __set_pte_at() or something and
leave it in places where barrier is not needed. The new set_pte_at()( will
be used in the rest of the places with the barrier inside.

BTW, have you looked at other levels of page table hierarchy. Do we have
the same issue for PMD/PUD/... pages?

-- 
 Kirill A. Shutemov


Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-25 Thread Yu Zhao
On Wed, Sep 25, 2019 at 10:25:30AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> > We don't want to expose a non-hugetlb page to the fast gup running
> > on a remote CPU before all local non-atomic ops on the page flags
> > are visible first.
> > 
> > For an anon page that isn't in swap cache, we need to make sure all
> > prior non-atomic ops, especially __SetPageSwapBacked() in
> > page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> > the following race:
> > 
> > CPU 1   CPU1
> > set_pte_at()get_user_pages_fast()
> >   page_add_new_anon_rmap()gup_pte_range()
> >   __SetPageSwapBacked() SetPageReferenced()
> > 
> > This demonstrates a non-fatal scenario. Though haven't been directly
> > observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> > caller and then overwritten by __SetPageSwapBacked().
> > 
> > For an anon page that is already in swap cache or a file page, we
> > don't need smp_wmb() before set_pte_at() because adding to swap or
> > file cach serves as a valid write barrier. Using non-atomic ops
> > thereafter is a bug, obviously.
> > 
> > smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> > call sites, with the only exception being
> > do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> > 
> 
> I'm thinking this patch make stuff rather fragile.. Should we instead
> stick the barrier in set_p*d_at() instead? Or rather, make that store a
> store-release?

I prefer it this way too, but I suspected the majority would be
concerned with the performance implications, especially those
looping set_pte_at()s in mm/huge_memory.c.

If we have a consensus that smp_wmb() would be better off wrapped
together with set_p*d_at() in a macro, I'd be glad to make the change.

And it seems to me applying smp_store_release() would have to happen
in each arch, which might be just as fragile.


Re: [PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-25 Thread Peter Zijlstra
On Tue, Sep 24, 2019 at 05:24:58PM -0600, Yu Zhao wrote:
> We don't want to expose a non-hugetlb page to the fast gup running
> on a remote CPU before all local non-atomic ops on the page flags
> are visible first.
> 
> For an anon page that isn't in swap cache, we need to make sure all
> prior non-atomic ops, especially __SetPageSwapBacked() in
> page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
> the following race:
> 
>   CPU 1   CPU1
>   set_pte_at()get_user_pages_fast()
> page_add_new_anon_rmap()gup_pte_range()
> __SetPageSwapBacked() SetPageReferenced()
> 
> This demonstrates a non-fatal scenario. Though haven't been directly
> observed, the fatal ones can exist, e.g., PG_lock set by fast gup
> caller and then overwritten by __SetPageSwapBacked().
> 
> For an anon page that is already in swap cache or a file page, we
> don't need smp_wmb() before set_pte_at() because adding to swap or
> file cach serves as a valid write barrier. Using non-atomic ops
> thereafter is a bug, obviously.
> 
> smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
> call sites, with the only exception being
> do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().
> 

I'm thinking this patch make stuff rather fragile.. Should we instead
stick the barrier in set_p*d_at() instead? Or rather, make that store a
store-release?




[PATCH v3 3/4] mm: don't expose non-hugetlb page to fast gup prematurely

2019-09-24 Thread Yu Zhao
We don't want to expose a non-hugetlb page to the fast gup running
on a remote CPU before all local non-atomic ops on the page flags
are visible first.

For an anon page that isn't in swap cache, we need to make sure all
prior non-atomic ops, especially __SetPageSwapBacked() in
page_add_new_anon_rmap(), are ordered before set_pte_at() to prevent
the following race:

CPU 1   CPU1
set_pte_at()get_user_pages_fast()
  page_add_new_anon_rmap()gup_pte_range()
  __SetPageSwapBacked() SetPageReferenced()

This demonstrates a non-fatal scenario. Though haven't been directly
observed, the fatal ones can exist, e.g., PG_lock set by fast gup
caller and then overwritten by __SetPageSwapBacked().

For an anon page that is already in swap cache or a file page, we
don't need smp_wmb() before set_pte_at() because adding to swap or
file cach serves as a valid write barrier. Using non-atomic ops
thereafter is a bug, obviously.

smp_wmb() is added following 11 of total 12 page_add_new_anon_rmap()
call sites, with the only exception being
do_huge_pmd_wp_page_fallback() because of an existing smp_wmb().

Signed-off-by: Yu Zhao 
---
 kernel/events/uprobes.c |  2 ++
 mm/huge_memory.c|  6 ++
 mm/khugepaged.c |  2 ++
 mm/memory.c | 10 +-
 mm/migrate.c|  2 ++
 mm/swapfile.c   |  6 --
 mm/userfaultfd.c|  2 ++
 7 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..7069785e2e52 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -194,6 +194,8 @@ static int __replace_page(struct vm_area_struct *vma, 
unsigned long addr,
 
flush_cache_page(vma, addr, pte_pfn(*pvmw.pte));
ptep_clear_flush_notify(vma, addr, pvmw.pte);
+   /* commit non-atomic ops before exposing to fast gup */
+   smp_wmb();
set_pte_at_notify(mm, addr, pvmw.pte,
mk_pte(new_page, vma->vm_page_prot));
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de1f15969e27..21d271a29d96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -616,6 +616,8 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct 
vm_fault *vmf,
mem_cgroup_commit_charge(page, memcg, false, true);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
+   /* commit non-atomic ops before exposing to fast gup */
+   smp_wmb();
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
mm_inc_nr_ptes(vma->vm_mm);
@@ -1276,7 +1278,9 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct 
vm_fault *vmf,
}
kfree(pages);
 
+   /* commit non-atomic ops before exposing to fast gup */
smp_wmb(); /* make pte visible before pmd */
+
pmd_populate(vma->vm_mm, vmf->pmd, pgtable);
page_remove_rmap(page, true);
spin_unlock(vmf->ptl);
@@ -1423,6 +1427,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, 
pmd_t orig_pmd)
page_add_new_anon_rmap(new_page, vma, haddr, true);
mem_cgroup_commit_charge(new_page, memcg, false, true);
lru_cache_add_active_or_unevictable(new_page, vma);
+   /* commit non-atomic ops before exposing to fast gup */
+   smp_wmb();
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
if (!page) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 70ff98e1414d..f2901edce6de 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1074,6 +1074,8 @@ static void collapse_huge_page(struct mm_struct *mm,
count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
+   /* commit non-atomic ops before exposing to fast gup */
+   smp_wmb();
set_pmd_at(mm, address, pmd, _pmd);
update_mmu_cache_pmd(vma, address, pmd);
spin_unlock(pmd_ptl);
diff --git a/mm/memory.c b/mm/memory.c
index aa86852d9ec2..6dabbc3cd3b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2367,6 +2367,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 * mmu page tables (such as kvm shadow page tables), we want the
 * new page to be mapped directly into the secondary page table.
 */
+   /* commit non-atomic ops before exposing to fast gup */
+   smp_wmb();
set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
update_mmu_cache(vma, vmf->address, vmf->pte);
if (old_page) {
@@ -2877,7 +2879,6 @@ vm_fault_t do_swap_page(struct