On Tue, Aug 01, 2017 at 12:21:41PM -0700, Nadav Amit wrote:
> Minchan Kim <minc...@kernel.org> wrote:
> 
> > Nadav reported KSM can corrupt the user data by the TLB batching race[1].
> > That means data user written can be lost.
> > 
> > Quote from Nadav Amit
> > "
> > For this race we need 4 CPUs:
> > 
> > CPU0: Caches a writable and dirty PTE entry, and uses the stale value for
> > write later.
> > 
> > CPU1: Runs madvise_free on the range that includes the PTE. It would clear
> > the dirty-bit. It batches TLB flushes.
> > 
> > CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty. We
> > care about the fact that it clears the PTE write-bit, and of course, batches
> > TLB flushes.
> > 
> > CPU3: Runs KSM. Our purpose is to pass the following test in
> > write_protect_page():
> > 
> >     if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> >         (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
> > 
> > Since it will avoid TLB flush. And we want to do it while the PTE is stale.
> > Later, and before replacing the page, we would be able to change the page.
> > 
> > Note that all the operations the CPU1-3 perform canhappen in parallel since
> > they only acquire mmap_sem for read.
> > 
> > We start with two identical pages. Everything below regards the same
> > page/PTE.
> > 
> > CPU0                CPU1            CPU2            CPU3
> > ----                ----            ----            ----
> > Write the same
> > value on page
> > 
> > [cache PTE as
> > dirty in TLB]
> > 
> >             MADV_FREE
> >             pte_mkclean()
> > 
> >                             4 > clear_refs
> >                             pte_wrprotect()
> > 
> >                                             write_protect_page()
> >                                             [ success, no flush ]
> > 
> >                                             pages_indentical()
> >                                             [ ok ]
> > 
> > Write to page
> > different value
> > 
> > [Ok, using stale
> > PTE]
> > 
> >                                             replace_page()
> > 
> > Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late. CPU0
> > already wrote on the page, but KSM ignored this write, and it got lost.
> > "
> > 
> > In above scenario, MADV_FREE is fixed by changing TLB batching API
> > including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty part.
> > 
> > This patch changes soft-dirty uses TLB batching API instead of flush_tlb_mm
> > and KSM checks pending TLB flush by using mm_tlb_flush_pending so that
> > it will flush TLB to avoid data lost if there are other parallel threads
> > pending TLB flush.
> > 
> > [1] http://lkml.kernel.org/r/bd3a0ebe-ecf4-41d4-87fa-c755ea9ab...@gmail.com
> > 
> > Note:
> > I failed to reproduce this problem through Nadav's test program which
> > need to tune timing in my system speed so didn't confirm it work.
> > Nadav, Could you test this patch on your test machine?
> > 
> > Thanks!
> > 
> > Cc: Nadav Amit <nadav.a...@gmail.com>
> > Cc: Mel Gorman <mgor...@techsingularity.net>
> > Cc: Hugh Dickins <hu...@google.com>
> > Cc: Andrea Arcangeli <aarca...@redhat.com>
> > Signed-off-by: Minchan Kim <minc...@kernel.org>
> > ---
> > fs/proc/task_mmu.c | 4 +++-
> > mm/ksm.c           | 3 ++-
> > 2 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 9782dedeead7..58ef3a6abbc0 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1018,6 +1018,7 @@ static ssize_t clear_refs_write(struct file *file, 
> > const char __user *buf,
> >     enum clear_refs_types type;
> >     int itype;
> >     int rv;
> > +   struct mmu_gather tlb;
> > 
> >     memset(buffer, 0, sizeof(buffer));
> >     if (count > sizeof(buffer) - 1)
> > @@ -1062,6 +1063,7 @@ static ssize_t clear_refs_write(struct file *file, 
> > const char __user *buf,
> >             }
> > 
> >             down_read(&mm->mmap_sem);
> > +           tlb_gather_mmu(&tlb, mm, 0, -1);
> >             if (type == CLEAR_REFS_SOFT_DIRTY) {
> >                     for (vma = mm->mmap; vma; vma = vma->vm_next) {
> >                             if (!(vma->vm_flags & VM_SOFTDIRTY))
> > @@ -1083,7 +1085,7 @@ static ssize_t clear_refs_write(struct file *file, 
> > const char __user *buf,
> >             walk_page_range(0, mm->highest_vm_end, &clear_refs_walk);
> >             if (type == CLEAR_REFS_SOFT_DIRTY)
> >                     mmu_notifier_invalidate_range_end(mm, 0, -1);
> > -           flush_tlb_mm(mm);
> > +           tlb_finish_mmu(&tlb, 0, -1);
> >             up_read(&mm->mmap_sem);
> > out_mm:
> >             mmput(mm);
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 0c927e36a639..15dd7415f7b3 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1038,7 +1038,8 @@ static int write_protect_page(struct vm_area_struct 
> > *vma, struct page *page,
> >             goto out_unlock;
> > 
> >     if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
> > -       (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte))) {
> > +       (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)) ||
> > +                                           mm_tlb_flush_pending(mm)) {
> >             pte_t entry;
> > 
> >             swapped = PageSwapCache(page);
> > -- 
> > 2.7.4
> 
> I tested the patch-set, and my PoC does not fail anymore.

Thanks for the testing with great reproduing application, Nadav!

Reply via email to