On Thu, Sep 17, 2020 at 11:11:06AM -0700, Linus Torvalds wrote:
>  (a) if the pinner is going to change the page, it will have to get
> the pin with FOLL_WRITE in addition to FOLL_PIN
> 
>  (b) that means it will do the COW and mark the page writable and dirty

Yep
 
>  (c) the whole _point_ of the FOLL_PIN is that subsequent operations
> shouldn't make it non-writable any more (ie it can't be unmapped, and
> we should synchronize on fork etc)

It is the ideal, but FOLL_PIN has been troubled for a long time:
https://lwn.net/Articles/753027/

ie writeprotect is known to happen due to writeback. I had understood
that fork() will also cause write protect. Eg

copy_process()
 copy_mm()
  dup_mm()
   dup_mmap()
    copy_page_range()
     [..]
      copy_one_pte()

Gets to:

        if (is_cow_mapping(vm_flags) && pte_write(pte)) {
                ptep_set_wrprotect(src_mm, addr, src_pte);
                pte = pte_wrprotect(pte);
        }

Which blindly write protects a FOLL_PIN page. Here src_pte will be
under a pin.

I *think* the failing test is basically:

 1) pin_user_pages(mem, FOLL_FORCE | FOLL_WRITE)
 2) pid = fork()
 3) child: does a bit, then exec
 4) parent: waitpid(pid)
 5) *mem = 0

Here #5 is the WP triggered COW. We know a WP triggered COW is
happening from the bisect and success result with MADV_DONTFORK.

#2 triggers the dup_mmap() and the ptep_set_wrprotect() (From
inspection, at least)

This "Trial do_wp_page() simplification" patch means that when #5 goes
to do COW it gets a copy instead of a re-use because the
reuse_swap_page() was aborting the copy before.

So, to your point, yes ideally FOLL_PIN would never write-protect
pages!

Looking for awhile, this now looks reasonable and
doable. page_maybe_dma_pinned() was created for exactly this kind of
case.

I've attached a dumb sketch for the pte level (surely wrong! I have
never looked at this part of the mm before!) at the end of this
message. 

Peter, do you know this better? Does this inspire you to make a patch? :)

Would really love to see this. We have such a huge expensive mess with
MADV_DONTFORK, this would eliminate all that overhead.

Thanks,
Jason

diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76e1..6bc19a43da1391 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -689,6 +689,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct 
*vma, unsigned long addr,
 }
 #endif
 
+static int duplicate_page(pte_t *newpte, struct vm_area_struct *vma,
+                         unsigned long address, struct page *page)
+{
+       struct page *new_page;
+
+       new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+       if (!new_page)
+               return -ENOMEM;
+       copy_user_highpage(new_page, page, address, vma);
+
+       /* FIXME: surely more than this */
+       *newpte = mk_pte(new_page, vma->vm_page_prot);
+       return 0;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -703,6 +718,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
        unsigned long vm_flags = vma->vm_flags;
        pte_t pte = *src_pte;
        struct page *page;
+       bool do_src_wp;
 
        /* pte contains position in swap or file, so copy. */
        if (unlikely(!pte_present(pte))) {
@@ -775,15 +791,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
                goto out_set_pte;
        }
 
-       /*
-        * If it's a COW mapping, write protect it both
-        * in the parent and the child
-        */
-       if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-               ptep_set_wrprotect(src_mm, addr, src_pte);
-               pte = pte_wrprotect(pte);
-       }
-
        /*
         * If it's a shared mapping, mark it clean in
         * the child
@@ -800,11 +807,34 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
*src_mm,
        if (!(vm_flags & VM_UFFD_WP))
                pte = pte_clear_uffd_wp(pte);
 
+       do_src_wp = is_cow_mapping(vm_flags) && pte_write(pte);
        page = vm_normal_page(vma, addr, pte);
        if (page) {
                get_page(page);
                page_dup_rmap(page, false);
                rss[mm_counter(page)]++;
+
+               /*
+                * If a page is DMA pinned we never want to write protect it,
+                * copy it now.
+                */
+               if (do_src_wp && page_maybe_dma_pinned(page)) {
+                       do_src_wp = false;
+                       ret = duplicate_page(&pte, vma, addr, page);
+                       if (ret)
+                            /* FIXME: need to restructure a bit to handle this 
*/
+                            return ret;
+               }
+       }
+
+       /*
+        * If it's a COW mapping, write protect it both
+        * in the parent and the child
+        * FIXME check carefully this is new order is OK
+        */
+       if (do_src_wp) {
+               ptep_set_wrprotect(src_mm, addr, src_pte);
+               pte = pte_wrprotect(pte);
        }
 
 out_set_pte:

Reply via email to