> On Dec 22, 2020, at 11:31 AM, Andrea Arcangeli <aarca...@redhat.com> wrote: > > From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli <aarca...@redhat.com> > Date: Tue, 22 Dec 2020 14:30:16 -0500 > Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after > userfaultfd_writeprotect() > > change_protection() is called by userfaultfd_writeprotect() with the > mmap_lock_read like in change_prot_numa(). > > The page fault code in wp_copy_page() rightfully assumes if the CPU > issued a write fault and the write bit in the pagetable is not set, no > CPU can write to the page. That's wrong assumption after > userfaultfd_writeprotect(). That's also wrong assumption after > change_prot_numa() where the regular page fault code also would assume > that if the present bit is not set and the page fault is running, > there should be no stale TLB entry, but there is still. > > So to stay safe, the page fault code must be prevented to run as long > as long as the TLB flush remains pending. That is already achieved by > the do_numa_page() path for change_prot_numa() and by the > userfaultfd_pte_wp() path for userfaultfd_writeprotect(). > > The problem that needs fixing is that an un-wrprotect > (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not > set) could run in between the original wrprotect > (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set) > and wp_copy_page, while the TLB flush remains pending.
I may need to read your patch more carefully, but in general I do not like the approach. You are much more experienced than I am, but IMHO the TLB flushing logic needs to be further simplified and generalized and not the other way around. The complexity is already too high. We have tlb_flush_batched and tlb_flush_pending, which I think should be (somehow) combined. tlb_gather_mmu() is designed for zapping, but can’t it be modified to suit protection changes to avoid buggy use-cases (as the wrong use in clear_refs_write() ) ? So adding new userfaultfd specific code, which potentially does not address all the interactions (now or the future), is concerning. In this regard, a similar problem to the one in userfaultfd (mmap_read_lock() while write-protecting) already exists with soft-dirty clearing, so any solution should also address the soft-dirty issue.