Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-22 Thread Jann Horn
On Tue, Aug 22, 2023 at 8:54 PM Hugh Dickins  wrote:
> But rather than reworking it, please let's just go with v1 for now.

Sounds good to me.


Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-22 Thread Jann Horn
On Tue, Aug 22, 2023 at 5:23 PM Matthew Wilcox  wrote:
> On Tue, Aug 22, 2023 at 04:39:43PM +0200, Jann Horn wrote:
> > > Perhaps something else will want that same behaviour in future (it's
> > > tempting, but difficult to guarantee correctness); for now, it is just
> > > userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> > > expecting uffd to add more such exceptional modes in future).
> >
> > Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
> > wanted to make it possible to reliably install PTE markers with
> > madvise() or something like that, which might be nice for allowing
> > userspace to create guard pages without unnecessary extra VMAs...)
>
> I don't know what a userspace API for this would look like, but I have
> a dream of creating guard VMAs which only live in the maple tree and
> don't require the allocation of a struct VMA.  Use some magic reserved
> pointer value like XA_ZERO_ENTRY to represent them ... seems more
> robust than putting a PTE marker in the page tables?

Chrome currently uses a lot of VMAs for its heap, which I think are
basically alternating PROT_NONE and PROT_READ|PROT_WRITE anonymous
VMAs. Like this:

[...]
3a10002cf000-3a10002d ---p  00:00 0
3a10002d-3a10002e6000 rw-p  00:00 0
3a10002e6000-3a10002e8000 ---p  00:00 0
3a10002e8000-3a10002f2000 rw-p  00:00 0
3a10002f2000-3a10002f4000 ---p  00:00 0
3a10002f4000-3a10002fb000 rw-p  00:00 0
3a10002fb000-3a10002fc000 ---p  00:00 0
3a10002fc000-3a1000303000 rw-p  00:00 0
3a1000303000-3a1000304000 ---p  00:00 0
3a1000304000-3a100031b000 rw-p  00:00 0
3a100031b000-3a100031c000 ---p  00:00 0
3a100031c000-3a1000326000 rw-p  00:00 0
3a1000326000-3a1000328000 ---p  00:00 0
3a1000328000-3a100033a000 rw-p  00:00 0
3a100033a000-3a100033c000 ---p  00:00 0
3a100033c000-3a100038b000 rw-p  00:00 0
3a100038b000-3a100038c000 ---p  00:00 0
3a100038c000-3a100039b000 rw-p  00:00 0
3a100039b000-3a100039c000 ---p  00:00 0
3a100039c000-3a10003af000 rw-p  00:00 0
3a10003af000-3a10003b ---p  00:00 0
3a10003b-3a10003e8000 rw-p  00:00 0
3a10003e8000-3a1000401000 ---p  00:00 0
3a1000401000-3a1000402000 rw-p  00:00 0
3a1000402000-3a100040c000 ---p  00:00 0
3a100040c000-3a100046f000 rw-p  00:00 0
3a100046f000-3a100047 ---p  00:00 0
3a100047-3a100047a000 rw-p  00:00 0
3a100047a000-3a100047c000 ---p  00:00 0
3a100047c000-3a1000492000 rw-p  00:00 0
3a1000492000-3a1000494000 ---p  00:00 0
3a1000494000-3a10004a2000 rw-p  00:00 0
3a10004a2000-3a10004a4000 ---p  00:00 0
3a10004a4000-3a10004b6000 rw-p  00:00 0
3a10004b6000-3a10004b8000 ---p  00:00 0
3a10004b8000-3a10004ea000 rw-p  00:00 0
3a10004ea000-3a10004ec000 ---p  00:00 0
3a10004ec000-3a10005f4000 rw-p  00:00 0
3a10005f4000-3a1000601000 ---p  00:00 0
3a1000601000-3a1000602000 rw-p  00:00 0
3a1000602000-3a1000604000 ---p  00:00 0
3a1000604000-3a100062b000 rw-p  00:00 0
3a100062b000-3a1000801000 ---p  00:00 0
[...]

I was thinking if you used PTE markers as guards, you could maybe turn
all that into more or less a single VMA?


Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-22 Thread Jann Horn
On Tue, Aug 22, 2023 at 4:51 AM Hugh Dickins  wrote:
> On Mon, 21 Aug 2023, Jann Horn wrote:
> > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins  wrote:
> > > Just for this case, take the pmd_lock() two steps earlier: not because
> > > it gives any protection against this case itself, but because ptlock
> > > nests inside it, and it's the dropping of ptlock which let the bug in.
> > > In other cases, continue to minimize the pmd_lock() hold time.
> >
> > Special-casing userfaultfd like this makes me a bit uncomfortable; but
> > I also can't find anything other than userfaultfd that would insert
> > pages into regions that are khugepaged-compatible, so I guess this
> > works?
>
> I'm as sure as I can be that it's solely because userfaultfd breaks
> the usual rules here (and in fairness, IIRC Andrea did ask my permission
> before making it behave that way on shmem, COWing without a source page).
>
> Perhaps something else will want that same behaviour in future (it's
> tempting, but difficult to guarantee correctness); for now, it is just
> userfaultfd (but by saying "_armed" rather than "_missing", I'm half-
> expecting uffd to add more such exceptional modes in future).

Hm, yeah, sounds okay. (I guess we'd also run into this if we ever
wanted to make it possible to reliably install PTE markers with
madvise() or something like that, which might be nice for allowing
userspace to create guard pages without unnecessary extra VMAs...)

> > I guess an alternative would be to use a spin_trylock() instead of the
> > current pmd_lock(), and if that fails, temporarily drop the page table
> > lock and then restart from step 2 with both locks held - and at that
> > point the page table scan should be fast since we expect it to usually
> > be empty.
>
> That's certainly a good idea, if collapse on userfaultfd_armed private
> is anything of a common case (I doubt, but I don't know).  It may be a
> better idea anyway (saving a drop and retake of ptlock).

I was thinking it also has the advantage that it would still perform
okay if we got rid of the userfaultfd_armed() condition at some point
- though I realize that designing too much for hypothetical future
features is an antipattern.

> I gave it a try, expecting to end up with something that would lead
> me to say "I tried it, but it didn't work out well"; but actually it
> looks okay to me.  I wouldn't say I prefer it, but it seems reasonable,
> and no more complicated (as Peter rightly observes) than the original.
>
> It's up to you and Peter, and whoever has strong feelings about it,
> to choose between them: I don't mind (but I shall be sad if someone
> demands that I indent that comment deeper - I'm not a fan of long
> multi-line comments near column 80).

I prefer this version because it would make it easier to remove the
"userfaultfd_armed()" check in the future if we have to, but I guess
we could also always change it later if that becomes necessary, so I
don't really have strong feelings on it at this point.


Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd

2023-08-21 Thread Jann Horn
On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins  wrote:
> Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private
> shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp()
> thought it had emptied: page lock on the huge page is enough to protect
> against WP faults (which find the PTE has been cleared), but not enough
> to protect against userfaultfd.  "BUG: Bad rss-counter state" followed.
>
> retract_page_tables() protects against this by checking !vma->anon_vma;
> but we know that MADV_COLLAPSE needs to be able to work on private shmem
> mappings, even those with an anon_vma prepared for another part of the
> mapping; and we know that MADV_COLLAPSE needs to work on shared shmem
> mappings which are userfaultfd_armed().  Whether it needs to work on
> private shmem mappings which are userfaultfd_armed(), I'm not so sure:
> but assume that it does.

I think we couldn't rely on anon_vma here anyway, since holding the
mmap_lock in read mode doesn't prevent concurrent creation of an
anon_vma?

> Just for this case, take the pmd_lock() two steps earlier: not because
> it gives any protection against this case itself, but because ptlock
> nests inside it, and it's the dropping of ptlock which let the bug in.
> In other cases, continue to minimize the pmd_lock() hold time.

Special-casing userfaultfd like this makes me a bit uncomfortable; but
I also can't find anything other than userfaultfd that would insert
pages into regions that are khugepaged-compatible, so I guess this
works?

I guess an alternative would be to use a spin_trylock() instead of the
current pmd_lock(), and if that fails, temporarily drop the page table
lock and then restart from step 2 with both locks held - and at that
point the page table scan should be fast since we expect it to usually
be empty.


[BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-08-14 Thread Jann Horn
On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins  wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

We can still have a racing userfaultfd operation at the "/* step 4:
remove page table */" point that installs a new PTE before the page
table is removed.

To reproduce, patch a delay into the kernel like this:


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9a6e0d507759..27cc8dfbf3a7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
*mm, unsigned long addr,
}

/* step 4: remove page table */
+   if (strcmp(current->comm, "DELAYME") == 0) {
+   pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
+   mdelay(5000);
+   pr_warn("%s: END DELAY INJECTION\n", __func__);
+   }

/* Huge page lock is still held, so page table must remain empty */
pml = pmd_lock(mm, pmd);


And then run the attached reproducer against mm/mm-everything. You
should get this in dmesg:

[  206.578096] BUG: Bad rss-counter state mm:0942ebea
type:MM_ANONPAGES val:1
// compile with "gcc -o khugepaged-vs-uffd khugepaged-vs-uffd.c -pthread"
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef MADV_COLLAPSE
#define MADV_COLLAPSE 25
#endif

#ifndef UFFD_USER_MODE_ONLY
#define UFFD_USER_MODE_ONLY 1
#endif

#define SYSCHK(x) ({  \
  typeof(x) __res = (x);  \
  if (__res == (typeof(x))-1) \
err(1, "SYSCHK(" #x ")"); \
  __res;  \
})

static void write_file(char *name, char *buf) {
  int fd = SYSCHK(open(name, O_WRONLY));
  if (write(fd, buf, strlen(buf)) != strlen(buf))
err(1, "write %s", name);
  close(fd);
}

static void write_map(char *name, int outer_id) {
  char buf[100];
  sprintf(buf, "0 %d 1", outer_id);
  write_file(name, buf);
}

static void *thread_fn(void *dummy) {
  system("head -n50 /proc/$PPID/smaps;echo;echo");
  SYSCHK(prctl(PR_SET_NAME, "DELAYME"));
  SYSCHK(madvise((void*)0x20UL, 0x20, MADV_COLLAPSE));
  SYSCHK(prctl(PR_SET_NAME, "thread"));
  system("head -n50 /proc/$PPID/smaps");
  return NULL;
}

int main(void) {
  int outer_uid = getuid();
  int outer_gid = getgid();
  SYSCHK(unshare(CLONE_NEWNS|CLONE_NEWUSER));
  SYSCHK(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL));
  write_file("/proc/self/setgroups", "deny");
  write_map("/proc/self/uid_map", outer_uid);
  write_map("/proc/self/gid_map", outer_gid);

  SYSCHK(mount("none", "/tmp", "tmpfs", MS_NOSUID|MS_NODEV, "huge=always"));
  int fd = SYSCHK(open("/tmp/a", O_RDWR|O_CREAT, 0600));
  SYSCHK(ftruncate(fd, 0x20));
  void *ptr = SYSCHK(mmap((void*)0x20UL, 0x10, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0));
  *(volatile char *)ptr;
  SYSCHK(mmap((void*)0x30UL, 0x10, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0x10));
  for (int i=0; i<512; i++)
*(volatile char *)(0x20UL + 0x1000 * i);

  int uffd = SYSCHK(syscall(__NR_userfaultfd, UFFD_USER_MODE_ONLY));

  struct uffdio_api api = { .api = UFFD_API, .features = 0 };
  SYSCHK(ioctl(uffd, UFFDIO_API, ));

  struct uffdio_register reg = {
.range = { .start = 0x20, .len = 0x20 },
.mode = UFFDIO_REGISTER_MODE_MISSING
  };
  SYSCHK(ioctl(uffd, UFFDIO_REGISTER, ));

  pthread_t thread;
  if (pthread_create(, NULL, thread_fn, NULL))
errx(1, "pthread_create");

  sleep(1);

  unsigned char dummy_page[0x1000] = {1};
  struct uffdio_copy copy = {
.dst = 0x201000,
.src = (unsigned long)dummy_page,
.len = 0x1000,
.mode = 0
  };
  SYSCHK(ioctl(uffd, UFFDIO_COPY, ));

  if (pthread_join(thread, NULL))
errx(1, "pthread_join");

  //system("cat /proc/$PPID/smaps");
}


Re: [PATCH 00/12] mm: free retracted page table by RCU

2023-06-02 Thread Jann Horn
On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins  wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:11 AM Hugh Dickins  wrote:
> > > Here is the third series of patches to mm (and a few architectures), based
> > > on v6.4-rc3 with the preceding two series applied: in which khugepaged
> > > takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.
> >
> > To clarify: Part of the design here is that when you look up a user
> > page table with pte_offset_map_nolock() or pte_offset_map() without
> > holding mmap_lock in write mode, and you later lock the page table
> > yourself, you don't know whether you actually have the real page table
> > or a detached table that is currently in its RCU grace period, right?
>
> Right.  (And I'd rather not assume anything of mmap_lock, but there are
> one or two or three places that may still do so.)
>
> > And detached tables are supposed to consist of only zeroed entries,
> > and we assume that no relevant codepath will do anything bad if one of
> > these functions spuriously returns a pointer to a page table full of
> > zeroed entries?
>
> (Nit that I expect you're well aware of: IIRC "zeroed" isn't 0 on s390.)

I was not aware, thanks. I only knew that on Intel's Knights Landing
CPUs, the A/D bits are ignored by pte_none() due to some erratum.

> If someone is using pte_offset_map() without lock, they must be prepared
> to accept page-table-like changes.  The limits of pte_offset_map_nolock()
> with later spin_lock(ptl): I'm still exploring: there's certainly an
> argument that one ought to do a pmd_same() check before proceeding,
> but I don't think anywhere needs that at present.
>
> Whether the page table has to be full of zeroed entries when detached:
> I believe it is always like that at present (by the end of the series,
> when the collapse_pte_offset_map() oddity is fixed), but whether it needs
> to be so I'm not sure.  Quite likely it will need to be; but I'm open to
> the possibility that all it needs is to be still a page table, with
> perhaps new entries from a new usage in it.

My understanding is that at least handle_pte_fault(), the way it is
currently written, would do bad things in that case:

// assume we enter with mmap_lock in read mode,
// for a write fault on a shared writable VMA without a page_mkwrite handler
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
  pte_t entry;

  if (unlikely(pmd_none(*vmf->pmd))) {
[ not executed ]
  } else {
/*
 * A regular pmd is established and it can't morph into a huge
 * pmd by anon khugepaged, since that takes mmap_lock in write
 * mode; but shmem or file collapse to THP could still morph
 * it into a huge pmd: just retry later if so.
 */
vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
 vmf->address, >ptl);
if (unlikely(!vmf->pte))
  [not executed]
[assume that at this point, a concurrent THP collapse operation
 removes the page table, and the page table has now been reused
 and contains a read-only PTE]
// this reads page table contents protected solely by RCU
vmf->orig_pte = ptep_get_lockless(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

if (pte_none(vmf->orig_pte)) {
  pte_unmap(vmf->pte);
  vmf->pte = NULL;
}
  }

  if (!vmf->pte)
[not executed]

  if (!pte_present(vmf->orig_pte))
[not executed]

  if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
[not executed]

  spin_lock(vmf->ptl);
  entry = vmf->orig_pte;
  if (unlikely(!pte_same(*vmf->pte, entry))) {
[not executed]
  }
  if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
if (!pte_write(entry))
  // This will go into wp_page_shared(),
  // which will call wp_page_reuse(),
  // which will upgrade the page to writable
  return do_wp_page(vmf);
[ not executed ]
  }
  [ not executed ]
}

That looks like we could end up racing such that a read-only PTE in
the reused page table gets upgraded to writable, which would probably
be a security bug.

But I guess if you added bailout checks to every page table lock
operation, it'd be a different story, maybe?

> The most obvious vital thing (in the split ptlock case) is that it
> remains a struct page with a usable ptl spinlock embedded in it.
>
> The question becomes more urgent when/if extending to replacing the
> pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> wants to deposit the page table for later use even in the shmem/file
> case (and all arches in the anon case): I did work out the details once
> before, but I'm not sure whether I would still agree with myself; and was
> glad to leave replacement out of this

Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-06-02 Thread Jann Horn
On Fri, Jun 2, 2023 at 4:50 AM Hugh Dickins  wrote:
> On Wed, 31 May 2023, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:15 AM Hugh Dickins  wrote:
> > > Before putting them to use (several commits later), add rcu_read_lock()
> > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> > > separate commit, since it risks exposing imbalances: prior commits have
> > > fixed all the known imbalances, but we may find some have been missed.
> > [...]
> > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> > > index c7ab18a5fb77..674671835631 100644
> > > --- a/mm/pgtable-generic.c
> > > +++ b/mm/pgtable-generic.c
> > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long 
> > > addr, pmd_t *pmdvalp)
> > >  {
> > > pmd_t pmdval;
> > >
> > > -   /* rcu_read_lock() to be added later */
> > > +   rcu_read_lock();
> > > pmdval = pmdp_get_lockless(pmd);
> > > if (pmdvalp)
> > > *pmdvalp = pmdval;
> >
> > It might be a good idea to document that this series assumes that the
> > first argument to __pte_offset_map() is a pointer into a second-level
> > page table (and not a local copy of the entry) unless the containing
> > VMA is known to not be THP-eligible or the page table is detached from
> > the page table hierarchy or something like that. Currently a bunch of
> > places pass references to local copies of the entry, and while I think
> > all of these are fine, it would probably be good to at least document
> > why these are allowed to do it while other places aren't.
>
> Thanks Jann: but I have to guess that here you are showing awareness of
> an important issue that I'm simply ignorant of.
>
> I have been haunted by a dim recollection that there is one architecture
> (arm-32?) which is fussy about the placement of the pmdval being examined
> (deduces info missing from the arch-independent interface, by following
> up the address?), but I couldn't track it down when I tried.
>
> Please tell me more; or better, don't spend your time explaining to me,
> but please just send a link to a good reference on the issue.  I'll be
> unable to document what you ask there, without educating myself first.

Sorry, I think I was somewhat confused about what was going on when I
wrote that message.

After this series, __pte_offset_map() looks as follows, with added
comments describing my understanding of the semantics:

// `pmd` points to one of:
// case 1: a pmd_t stored outside a page table,
// referencing a page table detached by the caller
// case 2: a pmd_t stored outside a page table, which the caller copied
// from a page table in an RCU-critical section that extends
// until at least the end of this function
// case 3: a pmd_t stored inside a page table
pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
{
unsigned long __maybe_unused flags;
pmd_t pmdval;

// begin an RCU section; this is needed for case 3
rcu_read_lock();
config_might_irq_save(flags);
// read the pmd_t.
// if the pmd_t references a page table, this page table can not
// go away because:
//  - in case 1, the caller is the main owner of the page table
//  - in case 2, because the caller
//started an RCU read-side critical section before the caller
//read the original pmd_t. (This pmdp_get_lockless() is just
//reading a copied pmd_t off the stack.)
//  - in case 3, because we started an RCU section above before
//reading the pmd_t out of the page table here
pmdval = pmdp_get_lockless(pmd);
config_might_irq_restore(flags);

if (pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
goto nomap;
if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
goto nomap;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
goto nomap;
}
return __pte_map(, addr);
nomap:
rcu_read_unlock();
return NULL;
}

case 1 is what happens in __page_table_check_pte_clear_range(),
__split_huge_zero_page_pmd() and __split_huge_pmd_locked().
case 2 happens in lockless page table traversal (gup_pte_range() and
perf_get_pgtable_size()).
case 3 is normal page table traversal under mmap lock or mapping lock.

I think having a function like this that can run in three different
contexts in which it is protected in three different ways is somewhat
hard to understand without comments. Though maybe I'm thinking about
it the wrong way?

Basically my point is

Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page

2023-06-01 Thread Jann Horn
On Mon, May 29, 2023 at 8:23 AM Hugh Dickins  wrote:
> Add the generic pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This version
> suits all those architectures which use an unfragmented page for one page
> table (none of whose pte_free()s use the mm arg which was passed to it).

Pages that have been scheduled for deferred freeing can still be
locked, right? So struct page's members "ptl" and "rcu_head" can now
be in use at the same time? If that's intended, it would probably be a
good idea to add comments in the "/* Page table pages */" part of
struct page to point out that the first two members can be used by the
rcu_head while the page is still used as a page table in some
contexts, including use of the ptl.


Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-31 Thread Jann Horn
On Wed, May 31, 2023 at 10:54 PM Peter Xu  wrote:
> On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins  wrote:
> > > -static int retract_page_tables(struct address_space *mapping, pgoff_t 
> > > pgoff,
> > > -  struct mm_struct *target_mm,
> > > -  unsigned long target_addr, struct page 
> > > *hpage,
> > > -  struct collapse_control *cc)
> > > +static void retract_page_tables(struct address_space *mapping, pgoff_t 
> > > pgoff)
> > >  {
> > > struct vm_area_struct *vma;
> > > -   int target_result = SCAN_FAIL;
> > >
> > > -   i_mmap_lock_write(mapping);
> > > +   i_mmap_lock_read(mapping);
> > > vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
> > > -   int result = SCAN_FAIL;
> > > -   struct mm_struct *mm = NULL;
> > > -   unsigned long addr = 0;
> > > -   pmd_t *pmd;
> > > -   bool is_target = false;
> > > +   struct mm_struct *mm;
> > > +   unsigned long addr;
> > > +   pmd_t *pmd, pgt_pmd;
> > > +   spinlock_t *pml;
> > > +   spinlock_t *ptl;
> > >
> > > /*
> > >  * Check vma->anon_vma to exclude MAP_PRIVATE mappings 
> > > that
> > > -* got written to. These VMAs are likely not worth 
> > > investing
> > > -* mmap_write_lock(mm) as PMD-mapping is likely to be 
> > > split
> > > -* later.
> > > +* got written to. These VMAs are likely not worth 
> > > removing
> > > +* page tables from, as PMD-mapping is likely to be split 
> > > later.
> > >  *
> > > -* Note that vma->anon_vma check is racy: it can be set 
> > > up after
> > > -* the check but before we took mmap_lock by the fault 
> > > path.
> > > -* But page lock would prevent establishing any new ptes 
> > > of the
> > > -* page, so we are safe.
> > > -*
> > > -* An alternative would be drop the check, but check that 
> > > page
> > > -* table is clear before calling pmdp_collapse_flush() 
> > > under
> > > -* ptl. It has higher chance to recover THP for the VMA, 
> > > but
> > > -* has higher cost too. It would also probably require 
> > > locking
> > > -* the anon_vma.
> > > +* Note that vma->anon_vma check is racy: it can be set 
> > > after
> > > +* the check, but page locks (with XA_RETRY_ENTRYs in 
> > > holes)
> > > +* prevented establishing new ptes of the page. So we are 
> > > safe
> > > +* to remove page table below, without even checking it's 
> > > empty.
> >
> > This "we are safe to remove page table below, without even checking
> > it's empty" assumes that the only way to create new anonymous PTEs is
> > to use existing file PTEs, right? What about private shmem VMAs that
> > are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> > looking at the mapping and its pages (except for checking that the
> > insertion point is before end-of-file), protected only by mmap_lock
> > (shared) and pte_offset_map_lock().
>
> Hmm, yes.  We probably need to keep that though, and 5b51072e97 explained
> the reason (to still respect file permissions).
>
> Maybe the anon_vma check can also be moved into the pgtable lock section,
> with some comments explaining (but it's getting a bit ugly..)?

Or check that all entries are pte_none() or something like that inside
the pgtable-locked section?

[...]
> > The old code called collapse_and_free_pmd(), which involves MMU
> > notifier invocation...
[...]
> > ... while the new code only does pmdp_collapse_flush(), which clears
> > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> > notifiers. My understanding is that that's problematic - maybe (?) it
> > is sort of okay with regards to classic MMU notifier users like KVM,
> > but it's probably wrong for IOMM

Re: [PATCH 00/12] mm: free retracted page table by RCU

2023-05-31 Thread Jann Horn
On Mon, May 29, 2023 at 8:11 AM Hugh Dickins  wrote:
> Here is the third series of patches to mm (and a few architectures), based
> on v6.4-rc3 with the preceding two series applied: in which khugepaged
> takes advantage of pte_offset_map[_lock]() allowing for pmd transitions.

To clarify: Part of the design here is that when you look up a user
page table with pte_offset_map_nolock() or pte_offset_map() without
holding mmap_lock in write mode, and you later lock the page table
yourself, you don't know whether you actually have the real page table
or a detached table that is currently in its RCU grace period, right?
And detached tables are supposed to consist of only zeroed entries,
and we assume that no relevant codepath will do anything bad if one of
these functions spuriously returns a pointer to a page table full of
zeroed entries?

So in particular, in handle_pte_fault() we can reach the "if
(unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
detached zeroed page table, but we're okay with that because in that
case we know that !pte_none(vmf->orig_pte)&_none(*vmf->pte) ,
which implies !pte_same(*vmf->pte, entry) , which means we'll bail
out?

If that's the intent, it might be good to add some comments, because
at least to me that's not very obvious.


Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

2023-05-31 Thread Jann Horn
On Mon, May 29, 2023 at 8:26 AM Hugh Dickins  wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

I think there's a weirdness in the existing code, and this change
probably turns that into a UAF bug.

collapse_pte_mapped_thp() can be called on an address that might not
be associated with a VMA anymore, and after this change, the page
tables for that address might be in the middle of page table teardown
in munmap(), right? The existing mmap_write_lock() guards against
concurrent munmap() (so in the old code we are guaranteed to either
see a normal VMA or not see the page tables anymore), but
mmap_read_lock() only guards against the part of munmap() up to the
mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
(including free_pgtables()) happens after that.

So we can now enter collapse_pte_mapped_thp() and race with concurrent
free_pgtables() such that a PUD disappears under us while we're
walking it or something like that:


int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
  bool install_pmd)
{
  struct mmu_notifier_range range;
  unsigned long haddr = addr & HPAGE_PMD_MASK;
  struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
  struct page *hpage;
  pte_t *start_pte, *pte;
  pmd_t *pmd, pgt_pmd;
  spinlock_t *pml, *ptl;
  int nr_ptes = 0, result = SCAN_FAIL;
  int i;

  mmap_assert_locked(mm);

  /* Fast check before locking page if already PMD-mapped */
  result = find_pmd_or_thp_or_none(mm, haddr, ); // <<< PUD UAF in here
  if (result == SCAN_PMD_MAPPED)
return result;

  if (!vma || !vma->vm_file || // <<< bailout happens too late
  !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
return SCAN_VMA_CHECK;


I guess the right fix here is to make sure that at least the basic VMA
revalidation stuff (making sure there still is a VMA covering this
range) happens before find_pmd_or_thp_or_none()? Like:


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 301c0e54a2ef..5db365587556 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct
*mm, unsigned long addr,

 mmap_assert_locked(mm);

+if (!vma || !vma->vm_file ||
+!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+return SCAN_VMA_CHECK;
+
 /* Fast check before locking page if already PMD-mapped */
 result = find_pmd_or_thp_or_none(mm, haddr, );
 if (result == SCAN_PMD_MAPPED)
 return result;

-if (!vma || !vma->vm_file ||
-!range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-return SCAN_VMA_CHECK;
-
 /*
  * If we are here, we've succeeded in replacing all the native pages
  * in the page cache with a single hugepage. If a mm were to fault-in


Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s

2023-05-31 Thread Jann Horn
On Mon, May 29, 2023 at 8:15 AM Hugh Dickins  wrote:
> Before putting them to use (several commits later), add rcu_read_lock()
> to pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
> separate commit, since it risks exposing imbalances: prior commits have
> fixed all the known imbalances, but we may find some have been missed.
[...]
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index c7ab18a5fb77..674671835631 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, 
> pmd_t *pmdvalp)
>  {
> pmd_t pmdval;
>
> -   /* rcu_read_lock() to be added later */
> +   rcu_read_lock();
> pmdval = pmdp_get_lockless(pmd);
> if (pmdvalp)
> *pmdvalp = pmdval;

It might be a good idea to document that this series assumes that the
first argument to __pte_offset_map() is a pointer into a second-level
page table (and not a local copy of the entry) unless the containing
VMA is known to not be THP-eligible or the page table is detached from
the page table hierarchy or something like that. Currently a bunch of
places pass references to local copies of the entry, and while I think
all of these are fine, it would probably be good to at least document
why these are allowed to do it while other places aren't.

$ vgrep 'pte_offset_map(&'
Index File  Line Content
0 arch/sparc/mm/tlb.c151 pte = pte_offset_map(, vaddr);
1 kernel/events/core.c  7501 ptep = pte_offset_map(, addr);
2 mm/gup.c  2460 ptem = ptep = pte_offset_map(, addr);
3 mm/huge_memory.c  2057 pte = pte_offset_map(&_pmd, haddr);
4 mm/huge_memory.c  2214 pte = pte_offset_map(&_pmd, haddr);
5 mm/page_table_check.c  240 pte_t *ptep = pte_offset_map(, addr);


Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-05-31 Thread Jann Horn
On Mon, May 29, 2023 at 8:25 AM Hugh Dickins  wrote:
> -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> -  struct mm_struct *target_mm,
> -  unsigned long target_addr, struct page *hpage,
> -  struct collapse_control *cc)
> +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>  {
> struct vm_area_struct *vma;
> -   int target_result = SCAN_FAIL;
>
> -   i_mmap_lock_write(mapping);
> +   i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
> -   int result = SCAN_FAIL;
> -   struct mm_struct *mm = NULL;
> -   unsigned long addr = 0;
> -   pmd_t *pmd;
> -   bool is_target = false;
> +   struct mm_struct *mm;
> +   unsigned long addr;
> +   pmd_t *pmd, pgt_pmd;
> +   spinlock_t *pml;
> +   spinlock_t *ptl;
>
> /*
>  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -* got written to. These VMAs are likely not worth investing
> -* mmap_write_lock(mm) as PMD-mapping is likely to be split
> -* later.
> +* got written to. These VMAs are likely not worth removing
> +* page tables from, as PMD-mapping is likely to be split 
> later.
>  *
> -* Note that vma->anon_vma check is racy: it can be set up 
> after
> -* the check but before we took mmap_lock by the fault path.
> -* But page lock would prevent establishing any new ptes of 
> the
> -* page, so we are safe.
> -*
> -* An alternative would be drop the check, but check that page
> -* table is clear before calling pmdp_collapse_flush() under
> -* ptl. It has higher chance to recover THP for the VMA, but
> -* has higher cost too. It would also probably require locking
> -* the anon_vma.
> +* Note that vma->anon_vma check is racy: it can be set after
> +* the check, but page locks (with XA_RETRY_ENTRYs in holes)
> +* prevented establishing new ptes of the page. So we are safe
> +* to remove page table below, without even checking it's 
> empty.

This "we are safe to remove page table below, without even checking
it's empty" assumes that the only way to create new anonymous PTEs is
to use existing file PTEs, right? What about private shmem VMAs that
are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
looking at the mapping and its pages (except for checking that the
insertion point is before end-of-file), protected only by mmap_lock
(shared) and pte_offset_map_lock().


>  */
> -   if (READ_ONCE(vma->anon_vma)) {
> -   result = SCAN_PAGE_ANON;
> -   goto next;
> -   }
> +   if (READ_ONCE(vma->anon_vma))
> +   continue;
> +
> addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << 
> PAGE_SHIFT);
> if (addr & ~HPAGE_PMD_MASK ||
> -   vma->vm_end < addr + HPAGE_PMD_SIZE) {
> -   result = SCAN_VMA_CHECK;
> -   goto next;
> -   }
> -   mm = vma->vm_mm;
> -   is_target = mm == target_mm && addr == target_addr;
> -   result = find_pmd_or_thp_or_none(mm, addr, );
> -   if (result != SCAN_SUCCEED)
> -   goto next;
> -   /*
> -* We need exclusive mmap_lock to retract page table.
> -*
> -* We use trylock due to lock inversion: we need to acquire
> -* mmap_lock while holding page lock. Fault path does it in
> -* reverse order. Trylock is a way to avoid deadlock.
> -*
> -* Also, it's not MADV_COLLAPSE's job to collapse other
> -* mappings - let khugepaged take care of them later.
> -*/
> -   result = SCAN_PTE_MAPPED_HUGEPAGE;
> -   if ((cc->is_khugepaged || is_target) &&
> -   mmap_write_trylock(mm)) {
> -   /* trylock for the same lock inversion as above */
> -   if (!vma_try_start_write(vma))
> -   goto unlock_next;
> -
> -   /*
> -* Re-check whether we have an ->anon_vma, because
> -* collapse_and_free_pmd() requires that either no
> -* ->anon_vma exists or the anon_vma is locked.
> -  

Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread Jann Horn
On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko  wrote:
> On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > +locking maintainers
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > +   up_read(>lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else.
>
> Yes, I think you are right. From a look into the code it seems that
> the UAF is quite unlikely as there is a ton of work to be done between
> vma_write_lock used to prepare vma for removal and actual removal.
> That doesn't make it less of a problem though.
>
> > So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(>lock);
> > rcu_read_unlock();
> >
> > But I'm not entirely sure about that, the locking folks might know better.
>
> I am not a locking expert but to me it looks like this should work
> because the final cleanup would have to happen rcu_read_unlock.
>
> Thanks, I have completely missed this aspect of the locking when looking
> into the code.
>
> Btw. looking at this again I have fully realized how hard it is actually
> to see that vm_area_free is guaranteed to sync up with ongoing readers.
> vma manipulation functions like __adjust_vma make my head spin. Would it
> make more sense to have a rcu style synchronization point in
> vm_area_free directly before call_rcu? This would add an overhead of
> uncontended down_write of course.

Something along those lines might be a good idea, but I think that
rather than synchronizing the removal, it should maybe be something
that splats (and bails out?) if it detects pending readers. If we get
to vm_area_free() on a VMA that has pending readers, we might already
be in a lot of trouble because the concurrent readers might have been
traversing page tables while we were tearing them down or fun stuff
like that.

I think maybe Suren was already talking about something like that in
another part of this patch series but I don't remember...


Re: [PATCH 27/41] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration

2023-01-18 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> Page fault handlers might need to fire MMU notifications while a new
> notifier is being registered. Modify mm_take_all_locks to write-lock all
> VMAs and prevent this race with fault handlers that would hold VMA locks.
> VMAs are locked before i_mmap_rwsem and anon_vma to keep the same
> locking order as in page fault handlers.
>
> Signed-off-by: Suren Baghdasaryan 
> ---
>  mm/mmap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 30c7d1c5206e..a256deca0bc0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3566,6 +3566,7 @@ static void vm_lock_mapping(struct mm_struct *mm, 
> struct address_space *mapping)
>   * of mm/rmap.c:
>   *   - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for
>   * hugetlb mapping);
> + *   - all vmas marked locked

The existing comment above says that this is an *ordered* listing of
which locks are taken.

>   *   - all i_mmap_rwsem locks;
>   *   - all anon_vma->rwseml
>   *
> @@ -3591,6 +3592,7 @@ int mm_take_all_locks(struct mm_struct *mm)
> mas_for_each(, vma, ULONG_MAX) {
> if (signal_pending(current))
> goto out_unlock;
> +   vma_write_lock(vma);
> if (vma->vm_file && vma->vm_file->f_mapping &&
> is_vm_hugetlb_page(vma))
> vm_lock_mapping(mm, vma->vm_file->f_mapping);

Note that multiple VMAs can have the same ->f_mapping, so with this,
the lock ordering between VMA locks and the mapping locks of hugetlb
VMAs is mixed: If you have two adjacent hugetlb VMAs with the same
->f_mapping, then the following operations happen:

1. lock VMA 1
2. lock mapping of VMAs 1 and 2
3. lock VMA 2
4. [second vm_lock_mapping() is a no-op]

So for VMA 1, we ended up taking the VMA lock first, but for VMA 2, we
took the mapping lock first.

The existing code has one loop per lock type to ensure that the locks
really are taken in the specified order, even when some of the locks
are associated with multiple VMAs.

If we don't care about the ordering between these two, maybe that's
fine and you just have to adjust the comment; but it would be clearer
to add a separate loop for the VMA locks.

> @@ -3677,6 +3679,7 @@ void mm_drop_all_locks(struct mm_struct *mm)
> if (vma->vm_file && vma->vm_file->f_mapping)
> vm_unlock_mapping(vma->vm_file->f_mapping);
> }
> +   vma_write_unlock_mm(mm);
>
> mutex_unlock(_all_locks_mutex);
>  }
> --
> 2.39.0
>


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-18 Thread Jann Horn
On Wed, Jan 18, 2023 at 10:40 AM Michal Hocko  wrote:
> On Tue 17-01-23 21:28:06, Jann Horn wrote:
> > On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> > > On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > > > Protect VMA from concurrent page fault handler while collapsing a huge
> > > > page. Page fault handler needs a stable PMD to use PTL and relies on
> > > > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > > > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > > > not be detected by a page fault handler without proper locking.
> > >
> > > I am struggling with this changelog. Maybe because my recollection of
> > > the THP collapsing subtleties is weak. But aren't you just trying to say
> > > that the current #PF handling and THP collapsing need to be mutually
> > > exclusive currently so in order to keep that assumption you have mark
> > > the vma write locked?
> > >
> > > Also it is not really clear to me how that handles other vmas which can
> > > share the same thp?
> >
> > It's not about the hugepage itself, it's about how the THP collapse
> > operation frees page tables.
> >
> > Before this series, page tables can be walked under any one of the
> > mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
> > unlinks and frees page tables, it must ensure that all of those either
> > are locked or don't exist. This series adds a fourth lock under which
> > page tables can be traversed, and so khugepaged must also lock out that one.
> >
> > There is a codepath in khugepaged that iterates through all mappings
> > of a file to zap page tables (retract_page_tables()), which locks each
> > visited mm with mmap_write_trylock() and now also does
> > vma_write_lock().
>
> OK, I see. This would be a great addendum to the changelog.
>
> > I think one aspect of this patch that might cause trouble later on, if
> > support for non-anonymous VMAs is added, is that retract_page_tables()
> > now does vma_write_lock() while holding the mapping lock; the page
> > fault handling path would probably take the locks the other way
> > around, leading to a deadlock? So the vma_write_lock() in
> > retract_page_tables() might have to become a trylock later on.
>
> This, right?
> #PF retract_page_tables
> vma_read_lock
> i_mmap_lock_write
> i_mmap_lock_read
> vma_write_lock
>
>
> I might be missing something but I have only found huge_pmd_share to be
> called from the #PF path. That one should be safe as it cannot be a
> target for THP. Not that it would matter much because such a dependency
> chain would be really subtle.

Oops, yeah. Now that I'm looking closer I also don't see a path from
the #PF path to i_mmap_lock_read. Sorry for sending you on a wild
goose chase.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 10:28 PM Suren Baghdasaryan  wrote:
> On Tue, Jan 17, 2023 at 10:03 AM Jann Horn  wrote:
> >
> > +locking maintainers
>
> Thanks! I'll CC the locking maintainers in the next posting.
>
> >
> > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > to be exclusively locked during VMA tree modifications, instead of the
> > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > locked.
> > [...]
> > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > +{
> > > +   up_read(>lock);
> > > +}
> >
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else. So if you want to protect against concurrent
> > deletion, this might have to be something like:
> >
> > rcu_read_lock(); /* keeps vma alive */
> > up_read(>lock);
> > rcu_read_unlock();
>
> But for deleting VMA one would need to write-lock the vma->lock first,
> which I assume can't happen until this up_read() is complete. Is that
> assumption wrong?

__up_read() does:

rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
DEBUG_RWSEMS_WARN_ON(tmp < 0, sem);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
  RWSEM_FLAG_WAITERS)) {
  clear_nonspinnable(sem);
  rwsem_wake(sem);
}

The atomic_long_add_return_release() is the point where we are doing
the main lock-releasing.

So if a reader dropped the read-lock while someone else was waiting on
the lock (RWSEM_FLAG_WAITERS) and no other readers were holding the
lock together with it, the reader also does clear_nonspinnable() and
rwsem_wake() afterwards.
But in rwsem_down_write_slowpath(), after we've set
RWSEM_FLAG_WAITERS, we can return successfully immediately once
rwsem_try_write_lock() sees that there are no active readers or
writers anymore (if RWSEM_LOCK_MASK is unset and the cmpxchg
succeeds). We're not necessarily waiting for the "nonspinnable" bit or
the wake.

So yeah, I think down_write() can return successfully before up_read()
is done with its memory accesses.

(Spinlocks are different - the kernel relies on being able to drop
references via spin_unlock() in some places.)


Re: [PATCH 28/41] mm: introduce lock_vma_under_rcu to be used from arch-specific code

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> Introduce lock_vma_under_rcu function to lookup and lock a VMA during
> page fault handling. When VMA is not found, can't be locked or changes
> after being locked, the function returns NULL. The lookup is performed
> under RCU protection to prevent the found VMA from being destroyed before
> the VMA lock is acquired. VMA lock statistics are updated according to
> the results.
> For now only anonymous VMAs can be searched this way. In other cases the
> function returns NULL.
[...]
> +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> + unsigned long address)
> +{
> +   MA_STATE(mas, >mm_mt, address, address);
> +   struct vm_area_struct *vma, *validate;
> +
> +   rcu_read_lock();
> +   vma = mas_walk();
> +retry:
> +   if (!vma)
> +   goto inval;
> +
> +   /* Only anonymous vmas are supported for now */
> +   if (!vma_is_anonymous(vma))
> +   goto inval;
> +
> +   if (!vma_read_trylock(vma))
> +   goto inval;
> +
> +   /* Check since vm_start/vm_end might change before we lock the VMA */
> +   if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
> +   vma_read_unlock(vma);
> +   goto inval;
> +   }
> +
> +   /* Check if the VMA got isolated after we found it */
> +   mas.index = address;
> +   validate = mas_walk();

Question for Maple Tree experts:

Are you allowed to use mas_walk() like this? If the first mas_walk()
call encountered a single-entry tree, it would store mas->node =
MAS_ROOT, right? And then the second call would go into
mas_state_walk(), mas_start() would return NULL, mas_is_ptr() would be
true, and then mas_state_walk() would return the result of
mas_start(), which is NULL? And we'd end up with mas_walk() returning
NULL on the second run even though the tree hasn't changed?

> +   if (validate != vma) {
> +   vma_read_unlock(vma);
> +   count_vm_vma_lock_event(VMA_LOCK_MISS);
> +   /* The area was replaced with another one. */
> +   vma = validate;
> +   goto retry;
> +   }
> +
> +   rcu_read_unlock();
> +   return vma;
> +inval:
> +   rcu_read_unlock();
> +   count_vm_vma_lock_event(VMA_LOCK_ABORT);
> +   return NULL;
> +}


Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 8:51 PM Jann Horn  wrote:
> On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> > Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> > handling under VMA lock and retry holding mmap_lock. This can be handled
> > more gracefully in the future.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > Suggested-by: Peter Xu 
> > ---
> >  mm/memory.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 20806bc8b4eb..12508f4d845a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> > mm_struct *mm,
> > if (!vma->anon_vma)
> > goto inval;
> >
> > +   /*
> > +* Due to the possibility of userfault handler dropping mmap_lock, 
> > avoid
> > +* it for now and fall back to page fault handling under mmap_lock.
> > +*/
> > +   if (userfaultfd_armed(vma))
> > +   goto inval;
>
> This looks racy wrt concurrent userfaultfd_register(). I think you'll
> want to do the userfaultfd_armed(vma) check _after_ locking the VMA,

I still think this change is needed...

> and ensure that the userfaultfd code write-locks the VMA before
> changing the __VM_UFFD_FLAGS in vma->vm_flags.

Ah, but now I see you already took care of this half of the issue with
the reset_vm_flags() change in
https://lore.kernel.org/linux-mm/20230109205336.3665937-16-sur...@google.com/
.


> > if (!vma_read_trylock(vma))
> > goto inval;
> >
> > --
> > 2.39.0
> >


Re: [PATCH 18/41] mm/khugepaged: write-lock VMA while collapsing a huge page

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 4:25 PM Michal Hocko  wrote:
> On Mon 09-01-23 12:53:13, Suren Baghdasaryan wrote:
> > Protect VMA from concurrent page fault handler while collapsing a huge
> > page. Page fault handler needs a stable PMD to use PTL and relies on
> > per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(),
> > set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will
> > not be detected by a page fault handler without proper locking.
>
> I am struggling with this changelog. Maybe because my recollection of
> the THP collapsing subtleties is weak. But aren't you just trying to say
> that the current #PF handling and THP collapsing need to be mutually
> exclusive currently so in order to keep that assumption you have mark
> the vma write locked?
>
> Also it is not really clear to me how that handles other vmas which can
> share the same thp?

It's not about the hugepage itself, it's about how the THP collapse
operation frees page tables.

Before this series, page tables can be walked under any one of the
mmap lock, the mapping lock, and the anon_vma lock; so when khugepaged
unlinks and frees page tables, it must ensure that all of those either
are locked or don't exist. This series adds a fourth lock under which
page tables can be traversed, and so khugepaged must also lock out that one.

There is a codepath in khugepaged that iterates through all mappings
of a file to zap page tables (retract_page_tables()), which locks each
visited mm with mmap_write_trylock() and now also does
vma_write_lock().


I think one aspect of this patch that might cause trouble later on, if
support for non-anonymous VMAs is added, is that retract_page_tables()
now does vma_write_lock() while holding the mapping lock; the page
fault handling path would probably take the locks the other way
around, leading to a deadlock? So the vma_write_lock() in
retract_page_tables() might have to become a trylock later on.

Related: Please add the new VMA lock to the big lock ordering comments
at the top of mm/rmap.c. (And maybe later mm/filemap.c, if/when you
add file VMA support.)


Re: [PATCH 32/41] mm: prevent userfaults to be handled under per-vma lock

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> Due to the possibility of handle_userfault dropping mmap_lock, avoid fault
> handling under VMA lock and retry holding mmap_lock. This can be handled
> more gracefully in the future.
>
> Signed-off-by: Suren Baghdasaryan 
> Suggested-by: Peter Xu 
> ---
>  mm/memory.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 20806bc8b4eb..12508f4d845a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5273,6 +5273,13 @@ struct vm_area_struct *lock_vma_under_rcu(struct 
> mm_struct *mm,
> if (!vma->anon_vma)
> goto inval;
>
> +   /*
> +* Due to the possibility of userfault handler dropping mmap_lock, 
> avoid
> +* it for now and fall back to page fault handling under mmap_lock.
> +*/
> +   if (userfaultfd_armed(vma))
> +   goto inval;

This looks racy wrt concurrent userfaultfd_register(). I think you'll
want to do the userfaultfd_armed(vma) check _after_ locking the VMA,
and ensure that the userfaultfd code write-locks the VMA before
changing the __VM_UFFD_FLAGS in vma->vm_flags.

> if (!vma_read_trylock(vma))
> goto inval;
>
> --
> 2.39.0
>


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 7:55 PM Suren Baghdasaryan  wrote:
> On Tue, Jan 17, 2023 at 10:47 AM Matthew Wilcox  wrote:
> >
> > On Tue, Jan 17, 2023 at 10:36:42AM -0800, Suren Baghdasaryan wrote:
> > > On Tue, Jan 17, 2023 at 10:31 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > > > > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > > > > >
> > > > > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan 
> > > > > >  wrote:
> > > > > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > > > > considerable space for each vm_area_struct. However vma_lock has
> > > > > > > two important specifics which can be used to replace rw_semaphore
> > > > > > > with a simpler structure:
> > > > > > [...]
> > > > > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > > >  {
> > > > > > > -   up_read(>vm_lock->lock);
> > > > > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > > > > >  }
> > > > > >
> > > > > > I haven't properly reviewed this, but this bit looks like a
> > > > > > use-after-free because you're accessing the vma after dropping your
> > > > > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > > > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > > > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > > > > read-side critical section if the VMA is freed with RCU delay.
> > > > >
> > > > > vm_lock->count does not control the lifetime of the VMA, it's a
> > > > > counter of how many readers took the lock or it's negative if the lock
> > > > > is write-locked.
> > > >
> > > > Yes, but ...
> > > >
> > > > Task A:
> > > > atomic_dec_and_test(>vm_lock->count)
> > > > Task B:
> > > > munmap()
> > > > write lock
> > > > free VMA
> > > > synchronize_rcu()
> > > > VMA is really freed
> > > > wake_up(>vm_mm->vma_writer_wait);
> > > >
> > > > ... vma is freed.
> > > >
> > > > Now, I think this doesn't occur.  I'm pretty sure that every caller of
> > > > vma_read_unlock() is holding the RCU read lock.  But maybe we should
> > > > have that assertion?
> > >
> > > Yep, that's what this patch is doing
> > > https://lore.kernel.org/all/20230109205336.3665937-27-sur...@google.com/
> > > by calling vma_assert_no_reader() from __vm_area_free().
> >
> > That's not enough though.  Task A still has a pointer to vma after it
> > has called atomic_dec_and_test(), even after vma has been freed by
> > Task B, and before Task A dereferences vma->vm_mm.
>
> Ah, I see your point now. I guess I'll have to store vma->vm_mm in a
> local variable and call mmgrab() before atomic_dec_and_test(), then
> use it in wake_up() and call mmdrop(). Is that what you are thinking?

You shouldn't need mmgrab()/mmdrop(), because whoever is calling you
for page fault handling must be keeping the mm_struct alive.


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Jann Horn
On Tue, Jan 17, 2023 at 7:31 PM Matthew Wilcox  wrote:
>
> On Tue, Jan 17, 2023 at 10:26:32AM -0800, Suren Baghdasaryan wrote:
> > On Tue, Jan 17, 2023 at 10:12 AM Jann Horn  wrote:
> > >
> > > On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  
> > > wrote:
> > > > rw_semaphore is a sizable structure of 40 bytes and consumes
> > > > considerable space for each vm_area_struct. However vma_lock has
> > > > two important specifics which can be used to replace rw_semaphore
> > > > with a simpler structure:
> > > [...]
> > > >  static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > >  {
> > > > -   up_read(>vm_lock->lock);
> > > > +   if (atomic_dec_and_test(>vm_lock->count))
> > > > +   wake_up(>vm_mm->vma_writer_wait);
> > > >  }
> > >
> > > I haven't properly reviewed this, but this bit looks like a
> > > use-after-free because you're accessing the vma after dropping your
> > > reference on it. You'd have to first look up the vma->vm_mm, then do
> > > the atomic_dec_and_test(), and afterwards do the wake_up() without
> > > touching the vma. Or alternatively wrap the whole thing in an RCU
> > > read-side critical section if the VMA is freed with RCU delay.
> >
> > vm_lock->count does not control the lifetime of the VMA, it's a
> > counter of how many readers took the lock or it's negative if the lock
> > is write-locked.
>
> Yes, but ...
>
> Task A:
> atomic_dec_and_test(>vm_lock->count)
> Task B:
> munmap()
> write lock
> free VMA
> synchronize_rcu()
> VMA is really freed
> wake_up(>vm_mm->vma_writer_wait);
>
> ... vma is freed.
>
> Now, I think this doesn't occur.  I'm pretty sure that every caller of
> vma_read_unlock() is holding the RCU read lock.  But maybe we should
> have that assertion?

I don't see that. When do_user_addr_fault() is calling
vma_read_unlock(), there's no RCU read lock held, right?


Re: [PATCH 40/41] mm: separate vma->lock from vm_area_struct

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> vma->lock being part of the vm_area_struct causes performance regression
> during page faults because during contention its count and owner fields
> are constantly updated and having other parts of vm_area_struct used
> during page fault handling next to them causes constant cache line
> bouncing. Fix that by moving the lock outside of the vm_area_struct.
> All attempts to keep vma->lock inside vm_area_struct in a separate
> cache line still produce performance regression especially on NUMA
> machines. Smallest regression was achieved when lock is placed in the
> fourth cache line but that bloats vm_area_struct to 256 bytes.

Just checking: When you tested putting the lock in different cache
lines, did you force the slab allocator to actually store the
vm_area_struct with cacheline alignment (by setting SLAB_HWCACHE_ALIGN
on the slab or with a cacheline_aligned_in_smp on the struct
definition)?


Re: [PATCH 41/41] mm: replace rw_semaphore with atomic_t in vma_lock

2023-01-17 Thread Jann Horn
On Mon, Jan 9, 2023 at 9:55 PM Suren Baghdasaryan  wrote:
> rw_semaphore is a sizable structure of 40 bytes and consumes
> considerable space for each vm_area_struct. However vma_lock has
> two important specifics which can be used to replace rw_semaphore
> with a simpler structure:
[...]
>  static inline void vma_read_unlock(struct vm_area_struct *vma)
>  {
> -   up_read(>vm_lock->lock);
> +   if (atomic_dec_and_test(>vm_lock->count))
> +   wake_up(>vm_mm->vma_writer_wait);
>  }

I haven't properly reviewed this, but this bit looks like a
use-after-free because you're accessing the vma after dropping your
reference on it. You'd have to first look up the vma->vm_mm, then do
the atomic_dec_and_test(), and afterwards do the wake_up() without
touching the vma. Or alternatively wrap the whole thing in an RCU
read-side critical section if the VMA is freed with RCU delay.


Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-17 Thread Jann Horn
+locking maintainers

On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan  wrote:
> Introduce a per-VMA rw_semaphore to be used during page fault handling
> instead of mmap_lock. Because there are cases when multiple VMAs need
> to be exclusively locked during VMA tree modifications, instead of the
> usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> mmap_write_lock holder is done with all modifications and drops mmap_lock,
> it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> locked.
[...]
> +static inline void vma_read_unlock(struct vm_area_struct *vma)
> +{
> +   up_read(>lock);
> +}

One thing that might be gnarly here is that I think you might not be
allowed to use up_read() to fully release ownership of an object -
from what I remember, I think that up_read() (unlike something like
spin_unlock()) can access the lock object after it's already been
acquired by someone else. So if you want to protect against concurrent
deletion, this might have to be something like:

rcu_read_lock(); /* keeps vma alive */
up_read(>lock);
rcu_read_unlock();

But I'm not entirely sure about that, the locking folks might know better.

Also, it might not matter given that the rw_semaphore part is removed
in the current patch 41/41 anyway...


Re: [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries.

2021-06-11 Thread Jann Horn
On Thu, Jun 10, 2021 at 10:35 AM Aneesh Kumar K.V
 wrote:
> To avoid a race between rmap walk and mremap, mremap does take_rmap_locks().
> The lock was taken to ensure that rmap walk don't miss a page table entry due 
> to
> PTE moves via move_pagetables(). The kernel does further optimization of
> this lock such that if we are going to find the newly added vma after the
> old vma, the rmap lock is not taken. This is because rmap walk would find the
> vmas in the same order and if we don't find the page table attached to
> older vma we would find it with the new vma which we would iterate later.
[...]
> Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
> Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")

probably also "Cc: sta...@vger.kernel.org"?


[PATCH] mmap.2: describe the 5level paging hack

2019-02-11 Thread Jann Horn
The manpage is missing information about the compatibility hack for
5-level paging that went in in 4.14, around commit ee00f4a32a76 ("x86/mm:
Allow userspace have mappings above 47-bit"). Add some information about
that.

While I don't think any hardware supporting this is shipping yet (?), I
think it's useful to try to write a manpage for this API, partly to
figure out how usable that API actually is, and partly because when this
hardware does ship, it'd be nice if distro manpages had information about
how to use it.

Signed-off-by: Jann Horn 
---
This patch goes on top of the patch "[PATCH] mmap.2: fix description of
treatment of the hint" that I just sent, but I'm not sending them in a
series because I want the first one to go in, and I think this one might
be a bit more controversial.

It would be nice if the architecture maintainers and mm folks could have
a look at this and check that what I wrote is right - I only looked at
the source for this, I haven't tried it.

 man2/mmap.2 | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/man2/mmap.2 b/man2/mmap.2
index 8556bbfeb..977782fa8 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -67,6 +67,8 @@ is NULL,
 then the kernel chooses the (page-aligned) address
 at which to create the mapping;
 this is the most portable method of creating a new mapping.
+On Linux, in this case, the kernel may limit the maximum address that can be
+used for allocations to a legacy limit for compatibility reasons.
 If
 .I addr
 is not NULL,
@@ -77,6 +79,19 @@ or equal to the value specified by
 and attempt to create the mapping there.
 If another mapping already exists there, the kernel picks a new
 address, independent of the hint.
+However, if a hint above the architecture's legacy address limit is provided
+(on x86-64: above 0x7000, on arm64: above 0x1, on ppc64 
with
+book3s: above 0x7fff or 0x3fff, depending on page size), the
+kernel is permitted to allocate mappings beyond the architecture's legacy
+address limit. The availability of such addresses is hardware-dependent.
+Therefore, if you want to be able to use the full virtual address space of
+hardware that supports addresses beyond the legacy range, you need to specify 
an
+address above that limit; however, for security reasons, you should avoid
+specifying a fixed valid address outside the compatibility range,
+since that would reduce the value of userspace address space layout
+randomization. Therefore, it is recommended to specify an address
+.I beyond
+the end of the userspace address space.
 .\" Before Linux 2.6.24, the address was rounded up to the next page
 .\" boundary; since 2.6.24, it is rounded down!
 The address of the new mapping is returned as the result of the call.
-- 
2.20.1.791.gb4d0f1c61a-goog



Re: [PATCH] powerpc: Don't print kernel instructions in show_user_instructions()

2018-10-05 Thread Jann Horn
On Fri, Oct 5, 2018 at 3:21 PM Michael Ellerman  wrote:
> Recently we implemented show_user_instructions() which dumps the code
> around the NIP when a user space process dies with an unhandled
> signal. This was modelled on the x86 code, and we even went so far as
> to implement the exact same bug, namely that if the user process
> crashed with its NIP pointing into the kernel we will dump kernel text
> to dmesg. eg:
>
>   bad-bctr[2996]: segfault (11) at c001 nip c001 lr 
> 12d0b0894 code 1
>   bad-bctr[2996]: code: fbe10068 7cbe2b78 7c7f1b78 fb610048 38a10028 38810020 
> fb810050 7f8802a6
>   bad-bctr[2996]: code: 3860001c f8010080 48242371 6000 <7c7b1b79> 
> 4082002c e8010080 eb610048
>
> This was discovered on x86 by Jann Horn and fixed in commit
> 342db04ae712 ("x86/dumpstack: Don't dump kernel memory based on usermode 
> RIP").
>
> Fix it by checking the adjusted NIP value (pc) and number of
> instructions against USER_DS, and bail if we fail the check, eg:

This fix looks good to me.

In the long term, I think it is somewhat awkward to use
probe_kernel_address(), which uses set_fs(KERNEL_DS), when you
actually just want to access userspace memory. It might make sense to
provide a better helper for explicitly accessing memory with USER_DS.