Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
On Sat, Jan 09, 2021 at 05:37:09PM -0800, Linus Torvalds wrote: > On Sat, Jan 9, 2021 at 5:19 PM Linus Torvalds > wrote: > > > > And no, I didn't make the UFFDIO_WRITEPROTECT code take the mmap_sem > > for writing. For whoever wants to look at that, it's > > mwriteprotect_range() in mm/userfaultfd.c and the fix is literally to > > turn the read-lock (and unlock) into a write-lock (and unlock). > > Oh, and if it wasn't obvious, we'll have to debate what to do with > trying to mprotect() a pinned page. Do we just ignore the pinned page > (the way my clear_refs patch did)? Or do we turn it into -EBUSY? Or > what? Agreed, I assume mprotect would have the same effect. mprotect in parallel of a read or recvmgs may be undefined, so I didn't bring it up, but it was pretty clear. The moment the write bit is cleared (no matter why and from who) and the PG lock relased, if there's any GUP pin, GUP currently loses synchrony. In any case I intended to help exercising the new page_count logic with the testcase, possibly to make it behave better somehow, no matter how. I admit I'm also wondering myself the exact semantics of O_DIRECT on clear_refs or uffd-wp tracking, but the point is that losing reads and getting unexpected data in the page, still doesn't look a good behavior and it had to be at least checked. To me ultimately the useful use case that is become impossible with page_count isn't even clear_refs nor uffd-wp. The useful case that I can see zero fundamental flaws in it, is a RDMA or some other device computing in pure readonly DMA on the data while a program runs normally and produces it. It could be even a framebuffer that doesn't care about coherency. You may want to occasionally wrprotect the memory under readonly long term GUP pin for consistency even against bugs of the program itself. Why should wrprotecting make the device lose synchrony? And kind of performance we gain to the normal useful cases by breaking the special case? Is there a benchmark showing it? > So it's not *just* the locking that needs to be fixed. But just take a > look at that suggested clear_refs patch of mine - it sure isn't > complicated. If we can skip the wrprotection it's fairly easy, I fully agree, even then it still looks more difficult than using page_mapcount in do_wp_page in my view, so I also don't see the simplification. And overall the amount of kernel code had a net increase as result. Thanks, Andrea
Re: [PATCH 1/1] mm: restore full accuracy in COW page reuse
Hello, On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote: > allowing a child to corrupt memory in the parent. That's a problem > that could happen not-maliciously too. So the scenario described I updated the above partly quoted sentence since in the previous version it didn't have full accuracy: https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=fc5a76b1c14e5e6cdc64ece306fc03773662d98a "However since a single transient GUP pin on a tail page, would elevate the page_count for all other tail pages (unlike the mapcount which is subpage granular), the COW page reuse inaccuracy would then cross different vmas and the effect would happen at a distance in vma of different processes. A single GUP pin taken on a subpage mapped in a different process could trigger 511 false positive COWs copies in the local process, after a fork()." This a best effort to try to document all side effects, but it'd be great to hear from Kirill too on the above detail to have confirmation. Thanks and have a great weekend everyone, Andrea
Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
Hello Linus, On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote: > +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == > VM_MAYWRITE) > + > +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long > addr, pte_t pte) > +{ > + struct page *page; > + > + if (!is_cow_mapping(vma->vm_flags)) > + return false; > + if (likely(!atomic_read(&vma->vm_mm->has_pinned))) > + return false; > + page = vm_normal_page(vma, addr, pte); > + if (!page) > + return false; > + if (page_mapcount(page) != 1) > + return false; > + return page_maybe_dma_pinned(page); > +} I just don't see the simplification coming from 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking page_mapcount above as an optimization, to me it looks much simpler to check it in a single place, in do_wp_page, that in addition of optimizing away the superfluous copy, would optimize away the above complexity as well. And I won't comment if it's actually safe to skip random pages or not. All I know is for mprotect and uffd-wp, definitely the above approach wouldn't work. In addition I dislike has_pinned and FOLL_PIN. has_pinned450 include/linux/mm_types.h * for instance during page table copying for fork(). has_pinned 1021 kernel/fork.c atomic64_set(&mm->pinned_vm, 0); has_pinned 1239 mm/gup.c atomic_set(&mm->has_pinned, 1); has_pinned 2579 mm/gup.c atomic_set(¤t->mm->has_pinned, 1); has_pinned 1099 mm/huge_memory.c atomic_read(&src_mm->has_pinned) && has_pinned 1213 mm/huge_memory.c atomic_read(&src_mm->has_pinned) && has_pinned819 mm/memory.c if (likely(!atomic_read(&src_mm->has_pinned))) Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1) on it? Why isn't it a MMF_HAS_PINNED that already can be set atomically under mmap_read_lock too? There's bit left free there, we didn't run out yet to justify wasting another 31 bits. I hope I'm overlooking something. The existence of FOLL_LONGTERM is good and makes a difference at times for writeback if it's on a MAP_SHARED, or it makes difference during GUP to do a page_migrate before taking the pin, but for the whole rest of the VM it's irrelevant if it's long or short term, so I'm also concerned from what Jason mentioned about long term pins being treated differently within the VM. That to me looks fundamentally as flawed as introducing inaccuracies in do_wp_page, that even ignoring the performance implications caused by the inaccuracy, simply prevent to do some useful things. I obviously agree all common workloads with no GUP pins and no clear_refs and no uffd, are way more important to be optimal, but I haven't seen a single benchmark showing the benefit of not taking the PG_lock before a page copy or any other runtime benefit coming from page_count in do_wp_page. To the contrary now I see additional branches in fork and in various other paths. The only thing again where I see page_count provides a tangible benefit, is the vmsplice attack from child to parent, but that has not been fully fixed and if page_count is added to fix it in all COW faults, it'll introduce extra inefficiency to the the very common important workloads, not only to the special GUP/clear_refs/uffd-wp workloads as your patch above shows. Thanks, Andrea
[PATCH 0/1] mm: restore full accuracy in COW page reuse
Hello Andrew and everyone, Once we agree that COW page reuse requires full accuracy, the next step is to re-apply 17839856fd588f4ab6b789f482ed3ffd7c403e1f and to return going in that direction. Who is going to orthogonally secure vmsplice, Andy, Jason, Jens? Once vmsplice is secured from taking unconstrained unprivileged long term GUP pins, the attack from child to parent can still happen in theory, but statistically speaking once that huge window is closed, it won't be a practical concern, so it'll give us time to perfect the full solution by closing all windows the VM core. vmsplice has to be orthogonally fixed anyway, even if all windows were closed in VM core first. Unfortunately it's still not clear exactly what failed with 17839856fd588f4ab6b789f482ed3ffd7c403e1f but the whole point is that we need to discuss that together. Thanks, Andrea // SPDX-License-Identifier: GPL-3.0-or-later /* * reproducer for v5.11 O_DIRECT mm corruption with page_count * instead of mapcount in do_wp_page. * * Copyright (C) 2021 Red Hat, Inc. * * gcc -O2 -o page_count_do_wp_page page_count_do_wp_page.c -lpthread * page_count_do_wp_page ./whateverfile * * NOTE: CONFIG_SOFT_DIRTY=y is required in the kernel config. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #define PAGE_SIZE (1UL<<12) #define HARDBLKSIZE 512 static void* writer(void *_mem) { char *mem = (char *)_mem; for(;;) { usleep(random() % 1000); mem[PAGE_SIZE-1] = 0; } } static void* background_soft_dirty(void *data) { long fd = (long) data; for (;;) if (write(fd, "4", 1) != 1) perror("write soft dirty"), exit(1); } int main(int argc, char *argv[]) { if (argc < 2) printf("%s ", argv[0]), exit(1); char path[PAGE_SIZE]; strcpy(path, "/proc/"); sprintf(path + strlen(path), "%d", getpid()); strcat(path, "/clear_refs"); long soft_dirty_fd = open(path, O_WRONLY); if (soft_dirty_fd < 0) perror("open clear_refs"), exit(1); char *mem; if (posix_memalign((void **)&mem, PAGE_SIZE, PAGE_SIZE*3)) perror("posix_memalign"), exit(1); /* THP is not using page_count so it would not corrupt memory */ if (madvise(mem, PAGE_SIZE, MADV_NOHUGEPAGE)) perror("madvise"), exit(1); bzero(mem, PAGE_SIZE * 3); memset(mem + PAGE_SIZE * 2, 0xff, HARDBLKSIZE); /* * This is not specific to O_DIRECT. Even if O_DIRECT was * forced to use PAGE_SIZE minimum granularity for reads, a * recvmsg would create the same issue since it also use * iov_iter_get_pages internally to create transient GUP pins * on anon memory. */ int fd = open(argv[1], O_DIRECT|O_CREAT|O_RDWR|O_TRUNC, 0600); if (fd < 0) perror("open"), exit(1); if (write(fd, mem, PAGE_SIZE) != PAGE_SIZE) perror("write"), exit(1); pthread_t soft_dirty; if (pthread_create(&soft_dirty, NULL, background_soft_dirty, (void *)soft_dirty_fd)) perror("soft_dirty"), exit(1); pthread_t thread; if (pthread_create(&thread, NULL, writer, mem)) perror("pthread_create"), exit(1); bool skip_memset = true; while (1) { if (pread(fd, mem, HARDBLKSIZE, 0) != HARDBLKSIZE) perror("read"), exit(1); if (memcmp(mem, mem+PAGE_SIZE, HARDBLKSIZE)) { if (memcmp(mem, mem+PAGE_SIZE*2, PAGE_SIZE)) { if (skip_memset) printf("unexpected memory " "corruption detected\n"); else printf("memory corruption detected, " "dumping page\n"); int end = PAGE_SIZE; if (!memcmp(mem+HARDBLKSIZE, mem+PAGE_SIZE, PAGE_SIZE-HARDBLKSIZE)) end = HARDBLKSIZE; for (int i = 0; i < end; i++) printf("%x", mem[i]); printf("\n"); } else printf("memory corruption detected\n"); } skip_memset = !skip_memset; if
[PATCH 1/1] mm: restore full accuracy in COW page reuse
This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4. This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3. This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c. This example below uses O_DIRECT, on a process under clear_refs. There's no FOLL_LONGTERM involvement. All GUP pins are transient. fork is never called and even when clear_refs is cleared by an external program, fork() would not be involved. thread0 thread1 other process (or thread 3) = = === read syscall O_DIRECT read DMA to vaddr+0 len = 512 bytes GUP(FOLL_WRITE) DMA writing to RAM started clear_refs pte_wrprotect write vaddrA+512 page_count == 2 wp_copy_page read syscall returns read lost Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE granularity is an ABI break by itself, this is of course not just about O_DIRECT. recvmsg kernel TLS and plenty of other GUP FOLL_WRITE iov_iter_get_pages users would write to the memory with sub-PAGE_SIZE granularity, depending on the msghdr iov tiny buffers. Those recvmsg are already silently lost too as well as the above O_DIRECT already get lost. The fact COW must not happen too much is documented in commit 6d0a07edd17cfc12fdc1f36de8072fa17cc3666f: == This will provide fully accuracy to the mapcount calculation in the write protect faults, so page pinning will not get broken by false positive copy-on-writes. == And in the current comment above the THP mapcount: == * [..] we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] == Quote from the Link tagged: == In other words, the page_count check in do_wp_page, extends the old fork vs gup vs threads race, so that it also happens within a single thread without fork() involvement. === In addition FOLL_LONGTERM breaks for readonly DMA, despite FOLL_LONGTERM pages are now treated specially and copied in fork(). FOLL_LONGTERM is in no way special with regard to the VM core and it can't be. From the VM standpoint all GUP in are equal and breaking one, means breaking them all the same. It is not possible to resolve this by looking only at FOLL_LONGTERM, that only hides the problem, but it doesn't fully resolve it as shown above. In addition this avoids storms of extra false positive COWs if the THP are shared by different processes, which causes extra overhead, but the several performance considerations are only a secondary concern. After this commit, the copy in fork() for FOLL_LONGTERM also must be extended to all GUP pins including transient pins or we shouldn't even copy FOLL_LONGTERM pinned pages. Treating FOLL_LONGTERM any different from transient GUP pin is sign that something is fundamentally flawed. FOLL_LONGTERM can be treated different in writeback and during GUP runtime itself, but not in the VM-core when deciding when to COW or not to COW an anonymous page. This revert causes no theoretical security regression because THP is not yet using page_count in do_huge_pmd_wp_page. The alternative of this patch would be to replace the mapcount with the page_count in do_huge_pmd_wp_page too in order to really close the vmsplice attack from child to parent that way. However since a single transient GUP pin on a tail page, would elevate the page_count for all other tail pages (unlike the mapcount which is subpage granular), the above "fork-less" race would span over a HPAGE_PMD_SIZE range, it would even cross different vmas and the effect would happen at a distance in vma of different processes allowing a child to corrupt memory in the parent. That's a problem that could happen not-maliciously too. So the scenario described above, if extended to THP new refcounting, would be more concerning than the vmsplice issue, that can be tackled first in vmsplice itself, and only at a second stage in the COW code. Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarca...@redhat.com Cc: sta...@kernel.org Fixes: 09854ba94c6a ("mm: do_wp_page() simplification") Signed-off-by: Andrea Arcangeli --- include/linux/ksm.h | 7 ++ mm/ksm.c| 25 +++ mm/memory.c | 59 - 3 files changed, 74 insertions(+), 17 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 161e8164abcf..e48b1e453ff5 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page, void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc); void ksm_migrate_page(struct page *newpage
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
Hello Jason, On Fri, Jan 08, 2021 at 08:42:55PM -0400, Jason Gunthorpe wrote: > There is already a patch series floating about to do exactly that for > FOLL_LONGTERM pins based on the existing code in GUP for CMA migration Sounds great. > The ship sailed on this a decade ago, it is completely infeasible to > go back now, it would completely break widely used things like GPU, > RDMA and more. For all those that aren't using mmu notifier and that rely solely on page pins, they still require privilege, except they do through /dev/ permissions. Just the fact there's no capability check in the read/write/ioctl doesn't mean those device inodes can be opened any luser: the fact the kernel allows it, doesn't mean the /dev/ permission does too. The same applies to /dev/kvm too, not just PCI device drivers. Device drivers that you need to open in /dev/ before you can take a GUP pin require whole different checks than syscalls like vmsplice and io_uring that are universally available. The very same GUP long term pinning kernel code can be perfectly safe to use without any permission check for a device driver of an iommu in /dev/, but completely unsafe for a syscall. > If we want to have a high speed copy_from_user like thing that is not > based on page pins but on mmu notifiers, then we should make that > infrastructure and the various places that need it should use common > code. At least vhost and io_uring are good candidates. Actually the mmu notifier doesn't strictly require pins, it only requires GUP. All users tend to use FOLL_GET just as a safety precaution (I already tried to optimize away the two atomics per GUP, but we were naked by the KVM maintainer that didn't want to take the risk, I would have, but it's a fair point indeed, obviously it's safer with the pin plus the mmu notifier, two is safer than one). I'm not sure how any copy-user could obviate a secondary MMU mapping, mappings and copies are mutually exclusive. Any copy would be breaking memory coherency in this environment. > Otherwise, we are pretending that they are DMA and using the DMA > centric pin_user_pages() interface, which we still have to support and > make work. vhost and io_uring would be pure software constructs, but there are hardware users of the GUP pin that don't use any DMA. The long term GUP pin is not only about PCI devices doing DMA. KVM is not ever using any DMA, despite it takes terabytes worth of very long term GUP pins. > > In any case, the extra flags required in FOLL_LONGTERM should be > > implied by FOLL_LONGTERM itself, once it enters the gup code, because > > it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), > > let alone having to specify FOLL_FORCE for just a read. But this is > > going offtopic. > > We really should revise this, I've been thinking for a while we need > to internalize into gup.c the FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM > idiom at least.. 100% agreed. > > > simply because it is using the CPU to memory copy as its "DMA". > > > > vmsplice can't find all put_pages that may release the pages when the > > pipe is read, or it'd be at least be able to do the unreliable > > RLIMIT_MEMLOCK accounting. > > Yikes! So it can't even use pin_user_pages FOLL_LONGTERM properly > because that requires unpin_user_pages(), which means finding all the > unpin sites too :\ Exactly. > > To make another example a single unprivileged pin on the movable zone, > > can break memhotunplug unless you use the mmu notifier. Every other > > advanced feature falls apart. > > As above FOLL_LONGTERM will someday migrate from movable zones. Something like: 1) migrate from movable zones contextually to GUP 2) be accounted on the compound_order not on the number of GUP (io_uring needs fixing here) 3) maybe account not only in rlimit, but also expose the total worth of GUP pins in page_order units (not pins) to the OOM killer to be added to the rss (will double count though). Maybe 3 is overkill but without it, OOM killer won't even see those GUP pin coming, so if not done it's still kind of unsafe, if done it'll risk double count. Even then a GUP pin, still prevents optimization, it can't converge in the right NUMA node the io ring just to make an example, but that's a secondary performance concern. The primary concern with the mmu notifier in io_uring is the take_all_locks latency. Longlived apps like qemu would be fine with mmu notifier. The main question is also if there's any short-lived latency io_uring usage... that wouldn't fly with take_all_locks. The problem with the mmu notifier as an universal solution, for example is that it can't wait for I/O completion of O_DIRECT since it has no clue where the put_page is to wait for it, otherwise we could avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be completed before paging or anything can unmap the page under I/O from the pagetable. Even if we could reliably identify all the put_page of transient pins rel
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Fri, Jan 08, 2021 at 11:25:21AM -0800, Linus Torvalds wrote: > On Fri, Jan 8, 2021 at 9:53 AM Andrea Arcangeli wrote: > > > > Do you intend to eventually fix the zygote vmsplice case or not? > > Because in current upstream it's not fixed currently using the > > enterprise default config. > > Is this the hugepage case? Neither of your patches actually touched > that, so I've forgotten the details. The two patches only fixed the TLB flushing deferral in clear_refs and uffd-wp. So I didn't actually try to fix the hugepage case by adding the page_count checks there too. I could try to do that at least it'd be consistent but I still would try to find an alternate solution later. > > Irrelevant special case as in: long term GUP pin on the memory? > > Irrelevant special case in that > > (a) an extra COW shouldn't be a correctness issue unless somebody > does something horribly wrong (and obviously the code that hasn't > taken the mmap_lock for writing are then examples of that) > > and > > (b) it's not a performance issue either unless you can find a real > load that does it. > > Hmm? For b) I don't have an hard time to imagine `ps` hanging for seconds, if clear_refs is touched on a 4T mm, but b) is not the main concern. Having to rely on a) is the main concern and it's not about tlb flushes but the long term GUP pins. Thanks, Andrea
Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support
On Fri, Jan 08, 2021 at 10:55:05AM -0800, Andy Lutomirski wrote: > The implementation was rather buggy. It unconditionally marked PTEs > released the mmap lock before flushing the TLB, which could allow a racing > CoW operation to falsely believe that the underlying memory was not > writable. > > I can't find any users at all of this mechanism, so just remove it. Reviewed-by: Andrea Arcangeli
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Fri, Jan 08, 2021 at 10:31:24AM -0800, Andy Lutomirski wrote: > Can we just remove vmsplice() support? We could make it do a normal The single case I've seen vmsplice used so far, that was really cool is localhost live migration of qemu. However despite really cool, it wasn't merged in the end, and I don't recall exactly why. There are even more efficient (but slightly more complex) ways to do that than vmsplice: using MAP_SHARED gigapages or MAP_SHARED tmpfs with THP opted-in in the tmpfs mount, as guest physical memory instead of anon memory and finding a way not having it cleared by kexec, so you can also upgrade the host kernel and not just qemu... is a way more optimal way to PIN and move all pages through the pipe and still having to pay a superfluous copy on destination. My guess why it's not popular, and I may be completely wrong on this since I basically never used vmsplice (other than to proof of concept DoS my phone to verify the long term GUP pin exploit works), is that vmsplice is a more efficient, but not the most efficient option. Exactly like in the live migration in place, it's always more efficient to share a tmpfs THP backed region and have true zero copy, than going through a pipe that still does one copy at the receiving end. It may also be simpler and it's not dependent on F_SETPIPE_SIZE obscure tunings. So in the end it's still too slow for apps that requires maximum performance, and not worth the extra work for those that don't. I love vmsplice conceptually, just I'd rather prefer an luser cannot run it. > copy, thereby getting rid of a fair amount of nastiness and potential > attacks. Even ignoring issues relating to the length of time that the > vmsplice reference is alive, we also have whatever problems could be > caused by a malicious or misguided user vmsplice()ing some memory and > then modifying it. Sorry to ask but I'm curious, what also goes wrong if the user modifies memory under GUP pin from vmsplice? That's not obvious to see. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Fri, Jan 08, 2021 at 02:19:45PM -0400, Jason Gunthorpe wrote: > On Fri, Jan 08, 2021 at 12:00:36PM -0500, Andrea Arcangeli wrote: > > > The majority cannot be converted to notifiers because they are DMA > > > based. Every one of those is an ABI for something, and does not expect > > > extra privilege to function. It would be a major breaking change to > > > have pin_user_pages require some cap. > > > > ... what makes them safe is to be transient GUP pin and not long > > term. > > > > Please note the "long term" in the underlined line. > > Many of them are long term, though only 50 or so have been marked > specifically with FOLL_LONGTERM. I don't see how we can make such a > major ABI break. io_uring is one of those indeed and I already flagged it. This isn't a black and white issue, kernel memory is also pinned but it's not in movable pageblocks... How do you tell the VM in GUP to migrate memory to a non movable pageblock before pinning it? Because that's what it should do to create less breakage. For example iommu obviously need to be privileged, if your argument that it's enough to use the right API to take long term pins unconstrained, that's not the case. Pins are pins and prevent moving or freeing the memory, their effect is the same and again worse than mlock on many levels. (then I know on preempt-rt should behave like a pin, and that's fine, you disable all features for such purpose there) io_uring is fine in comparison to vmpslice but still not perfect, because it does the RLIMIT_MEMLOCK accounting but unfortunately, is tangibly unreliable since a pin can cost 2m or 1G (now 1G is basically privileged so it doesn't hurt to get the accounting wrong in such case, but it's still technically mixing counting apples as oranges). Maybe io_uring could keep not doing mmu notifier, I'd need more investigation to be sure, but what's the point of keeping it VM-breaking when it doesn't need to? Is io_uring required to setup the ring at high frequency? > Looking at it, vmsplice() is simply wrong. A long term page pin must > use pin_user_pages(), and either FOLL_LONGTERM|FOLL_WRITE (write mode) > FOLL_LONGTERM|FOLL_FORCE|FOLL_WRITE (read mode) > > ie it must COW and it must reject cases that are not longterm safe, > like DAX and CMA and so on. > > These are the well established rules, vmsplice does not get a pass Where are the established rules written down? pin_user_pages.rst doesn't even make a mention of FOLL_FORCE or FOLL_WRITE at all, mm.h same thing. In any case, the extra flags required in FOLL_LONGTERM should be implied by FOLL_LONGTERM itself, once it enters the gup code, because it's not cool having to FOLL_WRITE in all drivers for a GUP(write=0), let alone having to specify FOLL_FORCE for just a read. But this is going offtopic. > simply because it is using the CPU to memory copy as its "DMA". vmsplice can't find all put_pages that may release the pages when the pipe is read, or it'd be at least be able to do the unreliable RLIMIT_MEMLOCK accounting. I'm glad we agree vmsplice is unsafe. The PR for the seccomp filter is open so if you don't mind, I'll link your review as confirmation. > > speaking in practice. io_uring has similar concern but it can use mmu > > notifier, so it can totally fix it and be 100% safe from this. > > IIRC io_uring does use FOLL_LONGTERM and FOLL_WRITE.. Right it's one of those 50. FOLL_WRITE won't magically allow the memory to be swapped or migrated. To make another example a single unprivileged pin on the movable zone, can break memhotunplug unless you use the mmu notifier. Every other advanced feature falls apart. So again, if an unprivileged syscalls allows a very limited number of pages, maybe it checks also if it got a THP or a gigapage page from the pin, it sets its own limit, maybe again it's not a big concern. vmsplice currently with zero privilege allows this: 2 0 1074432 9589344 13548 132186040 4 172 2052 9997 5 2 93 0 0 -> vmsplice reproducer started here 1 0 1074432 8538184 13548 132582000 0 0 1973 8838 4 3 93 0 0 1 0 1074432 8538436 13548 132552400 0 0 1730 8168 4 2 94 0 0 1 0 1074432 8539096 13556 132188000 072 1811 8640 3 2 95 0 0 0 0 1074432 8539348 13564 132202800 036 1936 8684 4 2 95 0 0 -> vmsplice killed here 1 0 1074432 9586720 13564 132224800 0 0 1893 8514 4 2 94 0 0 That's ~1G that goes away for each task and I didn't even check if it's all THP pages getting in there, the rss is 3MB despite 1G is taken down in GUP pins with zero privilege: 1512 pts/25 S 0:00 0 0 133147 3044 0.0 ./vmspli
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Fri, Jan 08, 2021 at 09:39:56AM -0800, Linus Torvalds wrote: > page_count() is simply the right and efficient thing to do. > > You talk about all these theoretical inefficiencies for cases like > zygote and page pinning, which have never ever been seen except as a > possible attack vector. Do you intend to eventually fix the zygote vmsplice case or not? Because in current upstream it's not fixed currently using the enterprise default config. > Stop talking about irrelevant things. Stop trying to "optimize" things > that never happen and don't matter. > > Instead, what matters is the *NORMAL* VM flow. > > Things like COW. > > Things like "oh, now that we check just the page count, we don't even > need the page lock for the common case any more". > > > For the long term, I can't see how using page_count in do_wp_page is a > > tenable proposition, > > I think you should re-calibrate your expectations, and accept that > page_count() is the right thing to do, and live with it. > > And instead of worrying about irrelevant special-case code, start Irrelevant special case as in: long term GUP pin on the memory? Or irrelevant special case as in: causing secondary MMU to hit silent data loss if a pte is ever wrprotected (arch code, vm86, whatever, all under mmap_write_lock of course). > worrying about the code that gets triggered tens of thousands of times > a second, on regular loads, without anybody doing anything odd or > special at all, just running plain and normal shell scripts or any > other normal traditional load. > > Those irrelevant special cases should be simple and work, not badly > optimized to the point where they are buggy. And they are MUCH LESS > IMPORTANT than the normal VM code, so if somebody does something odd, > and it's slow, then that is the problem for the _odd_ case, not for > the normal codepaths. > > This is why I refuse to add crazy new special cases to core code. Make > the rules simple and straightforward, and make the code VM work well. New special cases? which new cases? There's nothing new here besides the zygote that wasn't fully fixed with 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 and is actually the only new case I can imagine where page_count actually isn't a regression. All old cases that you seem to refer as irrelevant and are in production in v4.18, I don't see anything new here. Even for the pure COW case with zero GUP involvement an hugepage with cows happening in different processes, would forever hit wp_copy_page since count is always > 1 despite mapcount can be 1 for all subpages. A simple app doing fork/exec would forever copy all memory in the parent even after the exec is finished. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Fri, Jan 08, 2021 at 09:36:49AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 04:45:33PM -0500, Andrea Arcangeli wrote: > > On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > > > > > vmsplice syscall API is insecure allowing long term GUP PINs without ^ > > > > privilege. > > > > > > Lots of places are relying on pin_user_pages long term pins of memory, > > > and cannot be converted to notifiers. > > > > > > I don't think it is reasonable to just declare that insecure and > > > requires privileges, it is a huge ABI break. > > > > Where's that ABI? Are there specs or a code example in kernel besides > > vmsplice itself? > > If I understand you right, you are trying to say that the 193 > pin_user_pages() callers cannot exist as unpriv any more? 193, 1k 1m or their number in general, won't just make them safe... > The majority cannot be converted to notifiers because they are DMA > based. Every one of those is an ABI for something, and does not expect > extra privilege to function. It would be a major breaking change to > have pin_user_pages require some cap. ... what makes them safe is to be transient GUP pin and not long term. Please note the "long term" in the underlined line. O_DIRECT is perfectly ok to be unprivileged obviously. The VM can wait, eventually it goes away. Even a swapout is not an instant event and can be hold off by any number of other things besides a transient GUP pin. It can be hold off by PG_lock just to make an example. mlock however is long term, persistent, vmsplice takes persistent and can pin way too much memory for each mm, that doesn't feel safe. The more places doing stuff like that, the more likely one causes a safety issue, not the other way around it in fact. > > The whole zygote issue wouldn't even register if the child had the > > exact same credentials of the parent. Problem is the child dropped > > privileges and went with a luser id, that clearly cannot ptrace the > > parent, and so if long term unprivileged GUP pins are gone from the > > equation, what remains that the child can do is purely theoretical > > even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. > > Sorry, I'm not sure I've found a good explanation how ptrace and GUP > are interacting to become a security problem. ptrace is not involved. What I meant by mentioning ptrace, is that if the child can ptrace the parent, then it doesn't matter if it can also do the below, so the security concern is zero in such case. With O_DIRECT or any transient pin you will never munmap while O_DIRECT is in flight, if you munmap it's undefined what happens in such case anyway. It is a theoretical security issue made practical by vmsplice API that allows to enlarge the window to years of time (not guaranteed milliseconds), to wait for the parent to trigger the wp_page_reuse. Remove vmsplice and the security issue in theory remains, but removed vmsplice it becomes irrelevant statistically speaking in practice. io_uring has similar concern but it can use mmu notifier, so it can totally fix it and be 100% safe from this. The scheduler disclosure date was 2020-08-25 so I can freely explain the case that motivated all these changes. case A) if !fork() { // in child mmap one page vmsplice takes gup pin long term on such page munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in child, and parent mm) } else { parent writes to the page // mapcount == 1, wp_page_reuse } parent did a COW with mapcount == 1 so the parent will take over a page that is still GUP pinned in the child. That's the security issue because in this case the GUP pin is malicious. Now imagine this case B) mmap one page RDMA or any secondary MMU takes a long term GUP pin munmap one page // mapcount == 1 (parent mm) // page_count == 2 (gup in RDMA, and parent mm) How does the VM can tell between the two different cases? It can't. The current page_count in do_wp_page treats both cases the same and because page_count is 2 in both cases, it calls wp_page_copy in both cases breaking-COW in both cases. However, you know full well in the second case it is a feature and not a bug, that wp_page_reuse is called instead, and in fact it has to be called or it's a bug (and that's the bug page_count in do_wp_page introduces). So page_count in do_wp_page is breaking all valid users, to take care of the purely theoretical security issue that isn't a practical concern if only vmsplice is secured at least as good as mlock. page_count in do_wp_page is fundamental
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
Hello everyone, On Fri, Jan 08, 2021 at 12:48:16PM +, Will Deacon wrote: > On Thu, Jan 07, 2021 at 04:25:54PM -0800, Linus Torvalds wrote: > > Please. Why is the correct patch not the attached one (apart from the > > obvious fact that I haven't tested it and maybe just screwed up > > completely - but you get the idea)? > > It certainly looks simple and correct to me, although it means we're now > taking the mmap sem for write in the case where we only want to clear the > access flag, which should be fine with the thing only held for read, no? I'm curious, would you also suggest that fixing just the TLB flushing symptom is enough and we can forget about the ABI break coming from page_count used in do_wp_page? One random example: clear_refs will still break all long term GUP pins, are you ok with that too? page_count in do_wp_page is a fix for the original security issue from vmsplice (where the child is fooling the parent in taking the exclusive page in do_wp_page), that appears worse than the bug itself. page_count in do_wp_page, instead of isolating as malicious when the parent is reusing the page queued in the vmsplice pipe, is treating as malicious also all legit cases that had to reliably reuse the page to avoid the secondary MMUs to go out of sync. page_count in do_wp_page is like a credit card provider blocking all credit cards of all customers, because one credit card may have been cloned (by vmsplice), but nobody can know which one was it. Of course this technique will work perfectly as security fix because it will treat all credit card users as malicious and it'll block them all ("block as in preventing re-use of the anon page"). The problem are those other credit card users that weren't malicious that get their COW broken too. Those are the very long term GUP pins if any anon page can be still wrprotected anywhere in the VM. At the same time the real hanging fruit (vmsplice) that, if taken care of, would have rendered the bug purely theoretical in security terms hasn't been fixed yet, despite those unprivileged long term GUP pins causes more reproducible security issues than just the COW, since they can still DoS the OOM killer and they bypass at least the mlock enforcement, even for non compound pages. Of course just fixing vmsplice to require some privilege won't fix the bug in full, so it's not suitable long term solution, but it has to happen orthogonality for other reason, and it'd at least remove the short term security concern. In addition you're not experiencing the full fallout of the side effects of page_count used to decide if to re-use all anon COW pages because the bug is still there (with enterprise default config options at least). Not all credit cards are blocked yet with only 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 applied. Only after you will block them all, you will experience all the side effects of replacing the per-subpage finegrined mapcount with the compound-wide page count. The two statements above combined, result in my recommendation at this point to resolve this in userland by rendering the security issue theoretical by removing vmsplice from the OCI schema allowlist or by enforcing it fixing in userland by always using execve after drop privs (as crun always does when it starts the container of course). For the long term, I can't see how using page_count in do_wp_page is a tenable proposition, unless we either drop all secondary MMUs from the kernel or VM features like clear_refs are dropped or unless the page_count is magically stabilized and the speculative pagecache lookups are also dropped. If trying to manage the fallout by enforcing no anon page can ever be wrprotected in place (i.e. dropping clear_refs feature or rendering it unreliable by skipping elevated counts caused by spurious pagecache lookups), it'd still sounds a too fragile design and too prone to break to rely on that. There's random arch stuff even wrprotecting memory, even very vm86 does it under the hood (vm86 is unlikely it has a long term GUP pin on it of course, but still who knows?). I mean the VM core cannot make assumptions like: "this vm86 case can still wrprotect without worry because probably vm86 isn't used anymore with any advanced secondary MMU, so if there's a GUP pin it probably is a malicious vmsplice and not a RDMA or GPU or Virtualization secondary MMU". Then there's the secondary concern of the inefficiency it introduces with extra unnecessary copies when a single GUP pin will prevent reuse of 512 or 262144 subpages, in the 512 case also potentially mapped in different processes. The TLB flushing discussions registers as the last concern in my view. Thanks, Andrea
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Thu, Jan 07, 2021 at 02:51:24PM -0800, Linus Torvalds wrote: > Ho humm. I had obviously not looked very much at that code. I had done > a quick git grep, but now that I look closer, it *does* get the > mmap_sem for writing, but only for that VM_SOFTDIRTY bit clearing, and > then it does a mmap_write_downgrade(). > > So that's why I had looked more at the UFFD code, because that one was > the one I was aware of doing this all with just the read lock. I > _thought_ the softdirty code already took the write lock and wouldn't > race with page faults. > > But I had missed that write_downgrade. So yeah, this code has the same issue. I overlooked the same thing initially. It's only when I noticed it also used mmap_read_lock, that I figured that the group lock thingy uffd-wp ad-hoc solution, despite it was fully self contained thanks to the handle_userfault() catcher for the uffd-wp bit in the pagetable, wasn't worth it since uffd-wp could always use whatever clear_refs used to solve it. > Anyway, the fix is - I think - the same I outlined earlier when I was > talking about UFFD: take the thing for writing, so that you can't > race. Sure. > The alternate fix remains "make sure we always flush the TLB before > releasing the page table lock, and make COW do the copy under the page > table lock". But I really would prefer to just have this code follow The copy under PT lock isn't enough. Flush TLB before releasing is enough of course. Note also the patch in 2/2 patch that I sent: https://lkml.kernel.org/r/20210107200402.31095-3-aarca...@redhat.com 2/2 would have been my preferred solution for both and it works fine. All corruption that was trivially reproducible with heavy selftest program in the kernel, is all gone. If only the TLB pending issue was the only regression page_count in do_wp_page introduced, I would have never suggested we should re-evaluate it. It'd be a good tradeoff in such case, even if it'd penalize the soft-dirty runtime, especially if we were allowed to deploy 2/2 as a non-blocking solution. Until yesterday I fully intended to just propose 1/2 and 2/2, with a whole different cover letter, CC stable and close this case. > all the usual rules, and if it does a write protect, then it should > take the mmap_sem for writing. The problem isn't about performance anymore, the problem is a silent ABI break to long term PIN user attached to an mm under clear_refs. > Why is that very simple rule so bad? > > (And see my unrelated but incidental note on it being a good idea to > try to minimize latency by making surfe we don't do any IO under the > mmap lock - whether held for reading _or_ writing. Because I do think > we can improve in that area, if you have some good test-case). That would be great indeed. Thanks, Andrea
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Thu, Jan 07, 2021 at 02:42:17PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 2:31 PM Andrea Arcangeli wrote: > > > > Random memory corruption will still silently materialize as result of > > the speculative lookups in the above scenario. > > Explain. > > Yes, you'll get random memory corruption if you keep doing wrprotect() > without mmap_sem held for writing. I didn't meant that. > But I thought we agreed earlier that that is a bug. And I thought the > softdirty code already got it for writing. softdirty used mmap_read_lock too but this again isn't relevant here and for the sake of discussion we can safely assume mmap_read_lock doesn't exist in the kernel, and everything takes the mmap_write_lock whenever a mmap_lock is taken at all. I mean something bad will happen if a write happens, but soft dirty cannot register it because we didn't wrprotect the pte? Some dirty page won't be transferred to destination and it will be assumed there was no softy dirty event for such page? Otherwise it would mean that wrprotecting is simply optional for all pages under clear_refs? Not doing the final TLB flush in softdirty caused some issue even when there was no COW and the deferred flush only would delay the wrprotect fault: https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=rdjnu6vgf3hlo+...@mail.gmail.com/ https://lore.kernel.org/linux-mm/20210105221628.GA12854@willie-the-truck/ Skipping the wrprotection of the pte because of a speculative pagecache lookup elevating a random page_count, from the userland point of view, I guessed would behave as missing the final TLB flush before clear_refs returns to userland, just worse. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Thu, Jan 07, 2021 at 02:17:50PM -0800, Linus Torvalds wrote: > So I think we can agree that even that softdirty case we can just > handle by "don't do that then". Absolutely. The question is if somebody was happily running clear_refs with a RDMA attached to the process, by the time they update and reboot they'll find it the hard way with silent mm corruption currently. So I was obliged to report this issue and the fact there was very strong reason why page_count was not used there and it's even documented explicitly in the source: * [..] however we only use * page_trans_huge_mapcount() in the copy-on-write faults where we * need full accuracy to avoid breaking page pinning, [..] I didn't entirely forget the comment when I reiterated it in fact also in Message-ID: <20200527212005.gc31...@redhat.com> on May 27 2020 since I recalled there was a very explicit reason why we had to use page_mapcount in do_wp_page and deliver full accuracy. Now I cannot proof there's any such user that will break, but we'll find those with a 1 year or more of delay. Even the tlb flushing deferral that caused clear_refs_write to also corrupt userland memory and literally lose userland writes even without any special secondary MMU hardware being attached to the memory, took 6 months to find. > if a page is pinned, the dirty bit of it makes no sense, because it > might be dirtied complately asynchronously by the pinner. > > So I think none of the softdirty stuff should touch pinned pages. I > think it falls solidly under that "don't do it then". > > Just skipping over them in clear_soft_dirty[_pmd]() doesn't look that > hard, does it? 1) How do you know again if it's not speculative lookup or an O_DIRECT that made them look pinned? 2) it's a hugepage 1, a 4k pin will make soft dirty then skip 511 dirty bits by mistake including those pages that weren't pinned and that userland is still writing through the transhuge pmd. Until v4.x we had a page pin counter for each subpage so the above wouldn't have happened, but not anymore since it was simplified and improved but now the page_count is pretty inefficient there, even if it'd be possible to make safe. 3) the GUP(write=0) may be just reading from RAM and sending to the other end so clear_refs may have currently very much tracked CPU writes fine, with no interference whatsoever from the secondary MMU working in readonly on the memory. Thanks, Andrea
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
On Thu, Jan 07, 2021 at 01:29:43PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:59 PM Andrea Arcangeli wrote: > > > > The problem is it's not even possible to detect reliably if there's > > really a long term GUP pin because of speculative pagecache lookups. > > So none of the normal code _needs_ that any more these days, which is > what I think is so nice. Any pinning will do the COW, and then we have > the logic to make sure it stays writable, and that keeps everything > nicely coherent and is all fairly simple. > > And yes, it does mean that if somebody then explicitly write-protects > a page, it may end up being COW'ed after all, but if you first pinned > it, and then started playing with the protections of that page, why > should you be surprised? > > So to me, this sounds like a "don't do that then" situation. > > Anybody who does page pinning and wants coherency should NOT TOUCH THE > MAPPING IT PINNED. > > (And if you do touch it, it's your own fault, and you get to keep both > of the broken pieces) > > Now, I do agree that from a QoI standpoint, it would be really lovely > if we actually enforced it. I'm not entirely sure we can, but maybe it > would be reasonable to use that > > mm->has_pinned && page_maybe_dma_pinned(page) > > at least as the beginning of a heuristic. > > In fact, I do think that "page_maybe_dma_pinned()" could possibly be > made stronger than it is. Because at *THAT* point, we might say "we > know a pinned page always must have a page_mapcount() of 1" - since as > part of pinning it and doing the GUP_PIN, we forced the COW, and then > subsequent fork() operations enforce it too. > > So I do think that it might be possible to make that clear_refs code > notice "this page is pinned, I can't mark it WP without the pinning > coherency breaking". > > It might not even be hard. But admittedly I'm somewhat handwaving > here, and I might not have thought of some situation. I suppose the objective would be to detect when it's a transient pin (as an O_DIRECT write) and fail clear_refs with an error for all other cases of real long term pins that need to keep reading at full PCI bandwidth, without extra GUP invocations after the wp_copy_page run. Because of the speculative lookups are making the count unstable, it's not even enough to use mmu notifier and never use FOLL_GET in GUP to make it safe again (unlike what I mentioned in a previous email). Random memory corruption will still silently materialize as result of the speculative lookups in the above scenario. My whole point here in starting this new thread to suggest page_count in do_wp_page is an untenable solution is that such commit silently broke every single long term PIN user that may be used in combination of clear_refs since 2013. Silent memory corruption undetected or a detectable error out of clear_refs, are both different side effects the same technical ABI break that rendered clear_refs fundamentally incompatible with clear_refs, so detecting it or not still an ABI break that is. I felt obliged to report there's something much deeper and fundamentally incompatible between the page_count in do_wp_page any wrprotection of exclusive memory in place, as if used in combination with any RDMA for example. The TLB flushing and the mmap_read/write_lock were just the tip of the iceberg and they're not the main concern anymore. In addition, the inefficiency caused by the fact the page_count effect is multiplied by 512 times or 262144 while the mapcount remains 4k granular, makes me think the page_count is unsuitable to be used there and this specific cure with page_count in do_wp_page, looks worse than the initial zygote disease. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Thu, Jan 07, 2021 at 01:05:19PM -0800, Linus Torvalds wrote: > I think those would very much be worth fixing, so that if > UFFDIO_WRITEPROTECT taking the mmapo_sem for writing causes problems, > we can _fix_ those problems. > > But I think it's entirely wrong to treat UFFDIO_WRITEPROTECT as > specially as Andrea seems to want to treat it. Particularly with > absolutely zero use cases to back it up. Again for the record: there's nothing at all special in UFFDIO_WRITEPROTECT in this respect. If you could stop mentioning UFFDIO_WRITEPROTECT and only focus on softdirty/clear_refs, maybe you wouldn't think my judgment is biased towards clear_refs/softdirty too. You can imagine the side effects of page_count doing a COW erroneously, as corollary of the fact that KSM won't ever allow to merge two pages if one of them is under GUP pin. Why is that? Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Thu, Jan 07, 2021 at 12:32:09PM -0800, Linus Torvalds wrote: > I think Andrea is blinded by his own love for UFFDIO: when I do a > debian codesearch for UFFDIO_WRITEPROTECT, all it finds is the kernel > and strace (and the qemu copies of the kernel headers). For the record, I feel obliged to reiterate I'm thinking purely in clear_refs terms here. It'd be great if we can only focus on clear_refs_write and nothing else. Like I said earlier, whatever way clear_refs/softdirty copes with do_wp_page, uffd-wp can do the identical thing so, uffd-wp is effectively irrelevant in this whole discussion. Clear-refs/softdirty predates uffd-wp by several years too. Thanks, Andrea
Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy
On Thu, Jan 07, 2021 at 04:25:25PM -0400, Jason Gunthorpe wrote: > On Thu, Jan 07, 2021 at 03:04:00PM -0500, Andrea Arcangeli wrote: > > > vmsplice syscall API is insecure allowing long term GUP PINs without > > privilege. > > Lots of places are relying on pin_user_pages long term pins of memory, > and cannot be converted to notifiers. > > I don't think it is reasonable to just declare that insecure and > requires privileges, it is a huge ABI break. Where's that ABI? Are there specs or a code example in kernel besides vmsplice itself? I don't see how it's possible to consider long term GUP pins completely unprivileged if not using mmu notifier. vmsplice doesn't even account them in rlimit (it cannot because it cannot identify all put_pages either). Long term GUP pins not using mmu notifier and not accounted in rlimit are an order of magnitude more VM-intrusive than mlock. The reason it's worse than mlock, even if ignore all performance feature that they break including numa bindings and that mlock doesn't risk to break, come because you can unmap the memory after taking those rlimit unaccounted GUP pins. So the OOM killer won't even have a chance to see the GUP pins coming. So it can't be that mlock has to be privileged but unconstrainted unaccounted long term GUP pins as in vmsplice are ok to stay unprivileged. Now io_uring does account the GPU pins in the mlock rlimit, but after the vma is unmapped it'd still cause the same confusion to OOM killer and in addition the assumption that each GUP pin cost 4k is also flawed. However io_uring model can use the mmu notifier without slowdown to the fast paths, so it's not going to cause any ABI break to fix it. Or to see it another way, it'd be fine to declare all mlock rlimits are obsolete and memcg is the only way to constrain RAM usage, but then mlock should stop being privileged, because mlock is a lesser concern and it won't risk to confuse the OOM killer at least. The good thing is the GUP pins won't escape memcg accounting but that accounting also doesn't come entirely free. > FWIW, vhost tries to use notifiers as a replacement for GUP, and I > think it ended up quite strange and complicated. It is hard to > maintain performance when every access to the pages needs to hold some > protection against parallel invalidation. And that's fine, this is all about if it should require a one liner change to add the username in the realtime group in /etc/group or not. You're focusing on your use case, but we've to put things in prospective of all these changes started. The whole zygote issue wouldn't even register if the child had the exact same credentials of the parent. Problem is the child dropped privileges and went with a luser id, that clearly cannot ptrace the parent, and so if long term unprivileged GUP pins are gone from the equation, what remains that the child can do is purely theoretical even before commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f. NOTE: I'm all for fixing the COW for good, but vmsplice or any long term GUP pin that is absolutely required to make such attack practical, looks the real low hanging fruit here to fix. However fixing it so clear_refs becomes fundamentally incompatible with mmu notifier users unless they all convert to pure !FOLL_GET GUPs, let alone long term GUP pins not using mmu notifier, doesn't look great. For vmsplice that new break-COW is the fix because it happens in the other process. For every legit long term GUP, where the break-COW happens in the single and only process, it's silent MM corruption. Thanks, Andrea
Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
Hi Linus, On Thu, Jan 07, 2021 at 12:17:40PM -0800, Linus Torvalds wrote: > On Thu, Jan 7, 2021 at 12:04 PM Andrea Arcangeli wrote: > > > > However there are two cases that could wrprotecting exclusive anon > > pages with only the mmap_read_lock: > > I still think the real fix is "Don't do that then", and just take the > write lock. > > The UFFDIO_WRITEPROTECT case simply isn't that critical. It's not a > normal operation. Same goes for softdirty. > > Why have those become _so_ magical that they can break the VM for > everybody else? I see what you mean above and I agree. Like said below: == In simple terms: the page_count check in do_wp_page makes it impossible to wrprotect memory, if such memory is under a !FOLL_WRITE GUP pin. == So to simplify let's ignore UFFDIO_WRITEPROTECT here, which is new and adds no dependency on top of clear_refs in this respect. So yes, if we drop any code that has to wrprotect memory in place in the kernel (since all userland memory can be under GUP pin in read mode) and we make such an operation illegal, then it's fine, but that means clear_refs has to go or it has to fail if there's a GUP pin during the wrprotection. The problem is it's not even possible to detect reliably if there's really a long term GUP pin because of speculative pagecache lookups. We would need to declare that any secondary MMU which is supposed to be VM-neutral using mmu notifier like a GPU or a RDMA device, cannot be used in combination on clear_refs and it would need to be enforced fully in userland. Most mmu notifier users drop the GUP pin during the invalidate for extra safety in case an invalidate goes missing: they would all need to drop FOLL_GET to be compliant and stop causing memory corruption if clear_refs shall be still allowed to happen on mmu notifier capable secondary MMUs. Even then how does userland know which devices attaches to the memory with mmu notifer and never using FOLL_GET and which aren't? It doesn't sound reliable to enforce this in userland. So I don't see how clear_refs can be saved. Now let's make another example that still shows at least some fundamental inefficiency that has nothing to do with clear_refs. Let's suppose a GUP pin is taken on a subpageA by a RDMA device by process A (parent). Let's now assume subpageB is mapped in process B (child of process A). Both subpageA and subpageB are exclusive (mapcount == 1) and read-write but they share the same page_count atomic counter (only the page_mapcounts are subpage granular). To still tame the zygote concern with the page_count in do_wp_page, then process B when it forks a child (processC) would forever have to do an extra superflous COW even after process C exits. Is that what we want on top of the fundamental unsafety added to clear_refs? Thanks, Andrea
[PATCH 0/2] page_count can't be used to decide when wp_page_copy
ll memory if using a jailer that doesn't execve) to avoid the aforementioned side channel to remain. Only RAM constrained embedded devices are justified to take shortcuts with the ensuing side channel security concern that emerges from it. For all the above reasons, because so far the cure has been worse than the disease itself, I'd recommend enterprise distro kernels to ignore the zygote embedded model attack for the time being and not to backport anything in this regard. What should not be backported, specifically starts in 17839856fd588f4ab6b789f482ed3ffd7c403e1f. 17839856fd588f4ab6b789f482ed3ffd7c403e1f was supposed to be fine and not break anything and unfortunately I was too busy while 17839856fd588f4ab6b789f482ed3ffd7c403e1f morphed into 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, so I still miss a whole huge discussion in that transition. I don't know what was fundamentally flawed in 17839856fd588f4ab6b789f482ed3ffd7c403e1f yet. All I comment about here is purely the current end result: 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 not any of the intermediate steps that brought us there. I already mentioned the page_count wasn't workable in do_wp_page in Message-ID: <20200527212005.gc31...@redhat.com> on May 27 2020, quote: "I don't see how we can check the page_count to decide if to COW or not in the wrprotect fault, given [..] introduced in >=" Then Hugh said he wasn't happy about on 1 Sep 2020 in Message-ID: . In https://lkml.kernel.org/r/x+o49hrck1fbd...@redhat.com I suggested "I hope we can find a way put the page_mapcount back where there" and now I have to double down and suggest that the page_count is fundamentally unsafe in do_wp_page. I see how the page_count would also fix the specific zygote child->parent attack and I kept an open mind hoping it would just solve all problems magically. So I tried to fix even clear_refs to cope with it, but this is only the tip of the icerbeg of what really breaks. So in short I contextually self-NAK 2/2 of this patchset and we need to somehow reverse 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 instead. Thanks, Andrea Andrea Arcangeli (1): mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Will Deacon (1): mm: proc: Invalidate TLB after clearing soft-dirty page state fs/proc/task_mmu.c | 26 --- include/linux/mm.h | 46 +++ include/linux/mm_types.h | 5 +++ kernel/fork.c| 1 + mm/memory.c | 69 mm/mprotect.c| 4 +++ 6 files changed, 146 insertions(+), 5 deletions(-)
[PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending
Reported-by: Nadav Amit Suggested-by: Yu Zhao Signed-off-by: Andrea Arcangeli --- fs/proc/task_mmu.c | 17 +- include/linux/mm.h | 46 +++ include/linux/mm_types.h | 5 +++ kernel/fork.c| 1 + mm/memory.c | 69 mm/mprotect.c| 4 +++ 6 files changed, 141 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index a127262ba517..e75cb135db02 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1235,8 +1235,20 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, } if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_SOFTDIRTY)) + struct vm_area_struct *tmp; + if (!(vma->vm_flags & VM_SOFTDIRTY)) { + inc_wrprotect_tlb_flush_pending(vma); continue; + } + + /* +* Rollback wrprotect_tlb_flush_pending before +* releasing the mmap_lock. +*/ + for (vma = mm->mmap; vma != tmp; +vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); + mmap_read_unlock(mm); if (mmap_write_lock_killable(mm)) { count = -EINTR; @@ -1245,6 +1257,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, for (vma = mm->mmap; vma; vma = vma->vm_next) { vma->vm_flags &= ~VM_SOFTDIRTY; vma_set_page_prot(vma); + inc_wrprotect_tlb_flush_pending(vma); } mmap_write_downgrade(mm); break; @@ -1260,6 +1273,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); flush_tlb_mm(mm); + for (vma = mm->mmap; vma; vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); } mmap_read_unlock(mm); diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..caa1d9a71cb2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3177,5 +3177,51 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +/* + * NOTE: the mmap_lock must be hold and it cannot be released at any + * time in between inc_wrprotect_tlb_flush_pending() and + * dec_wrprotect_tlb_flush_pending(). + * + * This counter has to be elevated before taking the PT-lock to + * wrprotect pagetables, if the TLB isn't flushed before the + * PT-unlock. + * + * The only reader is the page fault so this has to be elevated (in + * addition of the mm->tlb_flush_pending) only when the mmap_read_lock + * is taken instead of the mmap_write_lock (otherwise the page fault + * couldn't run concurrently in the first place). + * + * This doesn't need to be elevated when clearing pagetables even if + * only holding the mmap_read_lock (as in MADV_DONTNEED). The page + * fault doesn't risk to access the data of the page that is still + * under tlb-gather deferred flushing, if the pagetable is none, + * because the pagetable doesn't point to it anymore. + * + * This counter is read more specifically by the page fault when it + * has to issue a COW that doesn't result in page re-use because of + * the lack of stability of the page_count (vs speculative pagecache + * lookups) or because of a GUP pin exist on an exclusive and writable + * anon page. + * + * If this counter is elevated and the pageteable is wrprotected (as + * in !pte/pmd_write) and present, it means the page may be still + * modified by userland through a stale TLB entry that was + * instantiated before the wrprotection. In such case the COW fault, + * if it decides not to re-use the page, will have to either wait this + * counter to return zero, or it needs to flush the TLB before + * proceeding copying the page. + */ +static inline void inc_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_inc(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->
[PATCH 1/2] mm: proc: Invalidate TLB after clearing soft-dirty page state
From: Will Deacon Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"), TLB invalidation is elided in tlb_finish_mmu() if no entries were batched via the tlb_remove_*() functions. Consequently, the page-table modifications performed by clear_refs_write() in response to a write to /proc//clear_refs do not perform TLB invalidation. Although this is fine when simply aging the ptes, in the case of clearing the "soft-dirty" state we can end up with entries where pte_write() is false, yet a writable mapping remains in the TLB. Fix this by avoiding the mmu_gather API altogether: managing both the 'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB invalidation for the sort-dirty path, much like mprotect() does already. Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush") Signed-off-by: Will Deacon Signed-off-by: Andrea Arcangeli --- fs/proc/task_mmu.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ee5a235b3056..a127262ba517 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1189,7 +1189,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, struct mm_struct *mm; struct vm_area_struct *vma; enum clear_refs_types type; - struct mmu_gather tlb; int itype; int rv; @@ -1234,7 +1233,6 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, count = -EINTR; goto out_mm; } - tlb_gather_mmu(&tlb, mm, 0, -1); if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { if (!(vma->vm_flags & VM_SOFTDIRTY)) @@ -1252,15 +1250,18 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, break; } + inc_tlb_flush_pending(mm); mmu_notifier_range_init(&range, MMU_NOTIFY_SOFT_DIRTY, 0, NULL, mm, 0, -1UL); mmu_notifier_invalidate_range_start(&range); } walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops, &cp); - if (type == CLEAR_REFS_SOFT_DIRTY) + if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); - tlb_finish_mmu(&tlb, 0, -1); + flush_tlb_mm(mm); + dec_tlb_flush_pending(mm); + } mmap_read_unlock(mm); out_mm: mmput(mm);
Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()
On Thu, Jan 07, 2021 at 06:28:29PM +0100, Vlastimil Babka wrote: > On 1/6/21 9:18 PM, Hugh Dickins wrote: > > On Wed, 6 Jan 2021, Andrea Arcangeli wrote: > >> > >> I'd be surprised if the kernel can boot with BUG_ON() defined as "do > >> {}while(0)" so I guess it doesn't make any difference. > > > > I had been afraid of that too, when CONFIG_BUG is not set: > > but I think it's actually "if (cond) do {} while (0)". > > It's a maze of configs and arch-specific vs generic headers, but I do see this > in include/asm-generic/bug.h: > > #else /* !CONFIG_BUG */ > #ifndef HAVE_ARCH_BUG > #define BUG() do {} while (1) > #endif > > So seems to me there *are* configurations possible where side-effects are > indeed > thrown away, right? But this not BUG_ON, and that is an infinite loop while(1), not an optimization away as in while (0) that I was suggesting to just throw away cond and make it a noop. BUG() is actually the thing to use to move functional stuff out of BUG_ON so it's not going to be causing issues if it just loops. This overall feels mostly an aesthetically issue. Thanks, Andrea
Re: [PATCH] mm/mmap: replace if (cond) BUG() with BUG_ON()
Hello, On Wed, Jan 06, 2021 at 11:46:20AM -0800, Andrew Morton wrote: > On Tue, 5 Jan 2021 20:28:27 -0800 (PST) Hugh Dickins wrote: > > > Alex, please consider why the authors of these lines (whom you > > did not Cc) chose to write them without BUG_ON(): it has always > > been preferred practice to use BUG_ON() on predicates, but not on > > functionally effective statements (sorry, I've forgotten the proper > > term: I'd say statements with side-effects, but here they are not > > just side-effects: they are their main purpose). > > > > We prefer not to hide those away inside BUG macros > > Should we change that? I find BUG_ON(something_which_shouldnt_fail()) > to be quite natural and readable. > > As are things like the existing > > BUG_ON(mmap_read_trylock(mm)); > BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL)); > > etc. > > > No strong opinion here, but is current mostly-practice really > useful? I'd be surprised if the kernel can boot with BUG_ON() defined as "do {}while(0)" so I guess it doesn't make any difference. I've no strong opinion either, but personally my views matches Hugh's views on this. I certainly tried to stick to that in the past since I find it cleaner if a bugcheck just "checks" and can be deleted at any time without sudden breakage. Said that I also guess we're in the minority. Thanks, Andrea
Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup
On Tue, Jan 05, 2021 at 10:16:29PM +, Will Deacon wrote: > On Tue, Jan 05, 2021 at 09:22:51PM +, Nadav Amit wrote: > > > On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli wrote: > > > > > > On Tue, Jan 05, 2021 at 07:26:43PM +, Nadav Amit wrote: > > >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli > > >>> wrote: > > >>> > > >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > > >>>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes > > >>>> tracking") > > >>> > > >>> Targeting a backport down to 2013 when nothing could wrong in practice > > >>> with page_mapcount sounds backwards and unnecessarily risky. > > >>> > > >>> In theory it was already broken and in theory > > >>> 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > > >>> previous code of 2013 is completely wrong, but in practice the code > > >>> from 2013 worked perfectly until Aug 21 2020. > > >> > > >> Well… If you consider the bug that Will recently fixed [1], then > > >> soft-dirty > > >> was broken (for a different, yet related reason) since 0758cd830494 > > >> ("asm-generic/tlb: avoid potential double flush”). > > >> > > >> This is not to say that I argue that the patch should be backported to > > >> 2013, > > >> just to say that memory corruption bugs can be unnoticed. > > >> > > >> [1] > > >> https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/ > > > > > > Is this a fix or a cleanup? > > > > > > The above is precisely what I said earlier that tlb_gather had no > > > reason to stay in clear_refs and it had to use inc_tlb_flush_pending > > > as mprotect, but it's not a fix? Is it? I suggested it as a pure > > > cleanup. So again no backport required. The commit says fix this but > > > it means "clean this up". > > > > It is actually a fix. I think the commit log is not entirely correct and > > should include: > > > > Fixes: 0758cd830494 ("asm-generic/tlb: avoid potential double flush”). Agreed. > > > > Since 0758cd830494, calling tlb_finish_mmu() without any previous call to > > pte_free_tlb() and friends does not flush the TLB. The soft-dirty bug > > producer that I sent fails without this patch of Will. > > Yes, it's a fix, but I didn't rush it for 5.10 because I don't think rushing > this sort of thing does anybody any favours. I agree that the commit log > should be updated; I mentioned this report in the cover letter: > > https://lore.kernel.org/linux-mm/CA+32v5zzFYJQ7eHfJP-2OHeR+6p5PZsX=rdjnu6vgf3hlo+...@mail.gmail.com/ > > demonstrating that somebody has independently stumbled over the missing TLB > invalidation in userspace, but it's not as bad as the other issues we've been > discussing in this thread. Thanks for the explanation Nadav and Will. The fact the code was a 100% match to the cleanup I independently suggested a few weeks ago to reduce the confusion in clear_refs, made me overlook the difference 0758cd830494 made. I didn't realize the flush got optimized away if no gathering happened. Backporting this sort of thing with Fixes I guess tends to give the same kind of favors as rushing it for 5.10, but then in general the Fixes is accurate here. Overall it looks obviously safe cleanup and it is also a fix starting in v5.6, so I don't think this can cause more issues than what it sure fixes at least. The cleanup was needed anyway, even before it become a fix, since if it was mandatory to use tlb_gather when you purely need inc_tlb_flush_pending, then mprotect couldn't get away with it. I guess the the optimization in 0758cd830494 just made it more explicit that no code should use tlb_gather if it doesn't need to gather any page. Maybe adding some commentary in the comment on top of tlb_gather_mmu about the new behavior wouldn't hurt.
Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup
On Tue, Jan 05, 2021 at 09:22:51PM +, Nadav Amit wrote: > It is also about performance due to unwarranted TLB flushes. If there will be a problem switching to the wait_flush_pending() model suggested by Peter may not even require changes to the common code in memory.c since I'm thinking it may not even need to take a failure path if we plug it in the same place of the tlb flush. So instead of the flush we could always block there until we read zero in the atomic, then smp_rmb() and we're ready to start the copy. So either we flush IPI if we didn't read zero, or we block until we read zero, the difference is going to be hidden to do_wp_page. All do_wp_page cares about is that by the time the abstract call returns, there's no stale TLB left for such pte. If it is achieved by blocking and waiting or flushing the TLB it doesn't matter too much. So thinking of how bad the IPI will do, with the improved arm64 tlb flushing code in production, we keep track of how many simultaneous mm context there are, specifically to skip the SMP-unscalable TLBI broadcast on arm64 like we already avoid IPIs on lazy tlbs on x86 (see x86 tlb_is_not_lazy in native_flush_tlb_others). In other words the IPI will materialize only if there's more than one thread running while clear_refs run. All lazy tlbs won't get IPIs on both x86 upstream and arm64 enterprise. This won't help multithreaded processes that compute from all CPUs at all times but even multiple vcpu threads aren't always guaranteed to be running at all times. My main concern would be an IPI flood that slowdown clear_refs and UFFDIO_WRITEPROTECT, but an incremental optimization (not required for correctness) is to have UFFDIO_WRITEPROTECT and clear_refs switch to lazy tlb mode before they call inc_tlb_flush_pending() and unlazy only after dec_tlb_flush_pending. So it's possible to at least guarantee the IPI won't slow down them down. > In addition, as I stated before, having some clean interfaces that tell > whether a TLB flush is needed or not would be helpful and simpler to follow. > For instance, we can have is_pte_prot_demotion(oldprot, newprot) to figure > out whether a TLB flush is needed in change_pte_range() and avoid > unnecessary flushes when unprotecting pages with either mprotect() or > userfaultfd. When you mentioned this earlier I was thinking what happens then with flush_tlb_fix_spurious_fault(). The fact it's safe doesn't guarantee it's a performance win if there's a stream of spurious faults as result. So it'd need to be checked, especially as in the case of mprotect where the flush can be deferred and coalesced in a single IPI at the end so there's not so much to gain from it anyway. If you can guarantee there won't be a flood suprious wrprotect faults, then it'll be a nice optimization. Thanks, Andrea
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Jan 05, 2021 at 08:06:22PM +, Nadav Amit wrote: > I just thought that there might be some insinuation, as you mentioned VMware > by name. My response was half-jokingly and should have had a smiley to > prevent you from wasting your time on the explanation. No problem, actually I appreciate you pointed out to give me the extra opportunity to further clarify I wasn't implying anything like that, sorry again for any confusion I may have generated. I mentioned vmware because I'd be shocked if for the whole duration of the wrprotect on the guest physical memory it'd have to halt all minor faults and all memory freeing like it would happen to rust-vmm/qemu if we take the mmap_write_lock, that's all. Or am I wrong about this? For uffd-wp avoiding the mmap_write_lock isn't an immediate concern (obviously so in the rust-vmm case which won't even do postcopy live migration), but the above concern applies for the long term and maybe mid term for qemu. The postcopy live snapshoitting was the #1 use case so it's hard not to mention it, but there's still other interesting userland use cases of uffd-wp with various users already testing it in their apps, that may ultimately become more prevalent, who knows. The point is that those that will experiment with uffd-wp will run a benchmark, post a blog, others will see the blog, they will test too in their app and post their blog. It needs to deliver the full acceleration immediately, otherwise the evaluation may show it as a fail or not worth it. In theory we could just say, we'll optimize it later if significant userbase emerge, but in my view it's bit of a chicken egg problem and I'm afraid that such theory may not work well in practice. Still, for the initial fix, avoiding the mmap_write_lock seems more important actually for clear_refs than for uffd-wp. uffd-wp is somewhat lucky and will just share any solution to keep clear_refs scalable, since the issue is identical. Thanks, Andrea
Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup
On Tue, Jan 05, 2021 at 07:26:43PM +, Nadav Amit wrote: > > On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli wrote: > > > > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes > >> tracking") > > > > Targeting a backport down to 2013 when nothing could wrong in practice > > with page_mapcount sounds backwards and unnecessarily risky. > > > > In theory it was already broken and in theory > > 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the > > previous code of 2013 is completely wrong, but in practice the code > > from 2013 worked perfectly until Aug 21 2020. > > Well… If you consider the bug that Will recently fixed [1], then soft-dirty > was broken (for a different, yet related reason) since 0758cd830494 > ("asm-generic/tlb: avoid potential double flush”). > > This is not to say that I argue that the patch should be backported to 2013, > just to say that memory corruption bugs can be unnoticed. > > [1] > https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/ Is this a fix or a cleanup? The above is precisely what I said earlier that tlb_gather had no reason to stay in clear_refs and it had to use inc_tlb_flush_pending as mprotect, but it's not a fix? Is it? I suggested it as a pure cleanup. So again no backport required. The commit says fix this but it means "clean this up". Now there are plenty of bugs can go unnoticed for decades, including dirtycow and the very bug that allowed the fork child to attack the parent with vmsplice that ultimately motivated the page_mapcount->page_count in do_wp_page in Aug 2020. Now let's take another example: 7066f0f933a1fd707bb38781866657769cff7efc which also was found by source review only and never happened in practice, and unlike dirtycow and the vmsplice attack on parent was not reproducible even at will after it was found (even then it wouldn't be reproducible exploitable). So you can take 7066f0f933a1fd707bb38781866657769cff7efc as the example of theoretical issue that might still crash the kernel if unlucky. So before 7066f0f933a1fd707bb38781866657769cff7efc, things worked by luck but not reliably so. How are all those above relevant here? In my view none of the above is relevant. As I already stated this specific issue, for both uffd-wp and clear_refs wasn't even a theoretical bug before 2020 Aug, it is not like 7066f0f933a1fd707bb38781866657769cff7efc, and it's not like https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/ which appears a pure cleanup and doesn't need backporting to any tree. The uffd-wp clear_refs corruption mathematically could not happen before Aug 2020, it worked by luck too, but unlike 7066f0f933a1fd707bb38781866657769cff7efc reliably so. Philosophically I obviously agree the bug originated in 2013, but the stable trees don't feel research material that would care about such a intellectual detail. So setting a Fixes back to 2013 that would go mess with all stable tree by actively backporting a performance regressions to clear_refs that can break runtime performance to fix a philosophical issue that isn't even a theoretical issue, doesn't sound ideal to me. > To summarize my action items based your (and others) feedback on both > patches: > > 1. I will break the first patch into two different patches, one with the > “optimization” for write-unprotect, based on your feedback. It will not > be backported. > > 2. I will try to add a patch to avoid TLB flushes on > userfaultfd-writeunprotect. It will also not be backported. I think 1 and 2 above could be in the same patch. Mixing an uffd-wp optimization with the actual fix the memory corruption wasn't ideal, but doing the same optimization to both wrprotect and un-wrprotect in the same patch sounds ideal. The commit explanation would be identical and it can be de-duplicated this way. I'd suggest to coordinate with Peter on that, since I wasn't planning to work on this if somebody else offered to do it. > 3. Let me know if you want me to use your version of testing > mm_tlb_flush_pending() and conditionally flushing, wait for new version fro > you or Peter or to go with taking mmap_lock for write. Yes, as you suggested, I'm trying to clean it up and send a new version. Ultimately my view is there are an huge number of cases where mmap_write_lock or some other heavy lock that will require occasionally to block on I/O is beyond impossible not to take. Even speculative page faults only attack the low hanging anon memory and there's still MADV_DONTNEED/FREE and other stuff that may have to run in parallel with UFFDIO_WRITEPROTECT and clear_refs, not just p
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Jan 05, 2021 at 07:05:22PM +, Nadav Amit wrote: > > On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli wrote: > > I just don't like to slow down a feature required in the future for > > implementing postcopy live snapshotting or other snapshots to userland > > processes (for the non-KVM case, also unprivileged by default if using > > bounce buffers to feed the syscalls) that can be used by open source > > hypervisors to beat proprietary hypervisors like vmware. > > Ouch, that’s uncalled for. I am sure that you understand that I have no > hidden agenda and we all have the same goal. Ehm I never said you had an hidden agenda, so I'm not exactly why you're accusing me of something I never said. I merely pointed out one relevant justification for increasing kernel complexity here by not slowing down clear_refs longstanding O(N) clear_refs/softdirty feature and the recent uffd-wp O(1) feature, is to be more competitive with proprietary software solutions, since at least for uffd-wp, postcopy live snapshotting that the #1 use case. I never questioned your contribution or your preference, to be even more explicit, it never crossed my mind that you have an hidden agenda. However since everyone already acked your patches and I'm not ok with your patches because they will give a hit to KVM postcopy live snapshotting and other container clear_refs users, I have to justify why I NAK your patches and remaining competitive with proprietary hypervisors is one of them, so I don't see what is wrong to state a tangible end goal here. Thanks, Andrea
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Jan 05, 2021 at 01:41:34PM -0500, Peter Xu wrote: > Agreed. I didn't mention uffd_wp check (which I actually mentioned in the > reply > to v1 patchset) here only because the uffd_wp check is pure optimization; > while Agreed it's a pure optimization. Only if we used the group lock to fix this (which we didn't since it wouldn't help clear_refs to avoid the performance regression), the optimization would have become not an optimization anymore. > the uffd_wp_resolve check is more critical because it is potentially a fix of > similar tlb flushing issue where we could have demoted the pte without being > noticed, so I think it's indeed more important as Nadav wanted to fix in the > same patch. I didn't get why that was touched in the same patch, I already suggested to remove that optimization... > It would be even nicer if we have both covered (all of them can be in > unlikely() as Andrea suggested in the other email), then maybe nicer as a > standalone patch, then mention about the difference of the two in the commit > log (majorly, the resolving change will be more than optimization). Yes, if you want to go ahead optimizing both cases of the UFFDIO_WRITEPROTECT, I don't think there's any dependency on this. The huge_memory.c also needs covering but I didn't look at it, hopefully the code will result as clean as in the pte case. I'll try to cleanup the tlb flush in the meantime to see if it look maintainable after the cleanups. Then we can change it to wait_pending_flush(); return VM_FAULT_RETRY model if we want to or if the IPI is slower, at least clear_refs will still not block on random pagein or swapin from disk, but only anon memory write access will block while clear_refs run. Thanks, Andrea
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Mon, Jan 04, 2021 at 09:26:33PM +, Nadav Amit wrote: > I would feel more comfortable if you provide patches for uffd-wp. If you > want, I will do it, but I restate that I do not feel comfortable with this > solution (worried as it seems a bit ad-hoc and might leave out a scenario > we all missed or cause a TLB shootdown storm). > > As for soft-dirty, I thought that you said that you do not see a better > (backportable) solution for soft-dirty. Correct me if I am wrong. I think they should use the same technique, since they deal with the exact same challenge. I will try to cleanup the patch in the meantime. I can also try to do the additional cleanups to clear_refs to eliminate the tlb_gather completely since it doesn't gather any page and it has no point in using it. > Anyhow, I will add your comments regarding the stale TLB window to make the > description clearer. Having the mmap_write_lock solution as backup won't hurt, but I think it's only for planB if planA doesn't work and the only stable tree that will have to apply this is v5.9.x. All previous don't need any change in this respect. So there's no worry of rejects. It worked by luck until Aug 2020, but it did so reliably or somebody would have noticed already. And it's not exploitable either, it just works stable, but it was prone to break if the kernel changed in some other way, and it eventually changed in Aug 2020 when an unrelated patch happened to the reuse logic. If you want to maintain the mmap_write_lock patch if you could drop the preserved_write and adjust the Fixes to target Aug 2020 it'd be ideal. The uffd-wp needs a different optimization that maybe Peter is already working on or I can include in the patchset for this, but definitely in a separate commit because it's orthogonal. It's great you noticed the W->RO transition of un-wprotect so we can optimize that too (it will have a positive runtime effect, it's not just theoretical since it's normal to unwrprotect a huge range once the postcopy snapshotting of the virtual machine is complete), I was thinking at the previous case discussed in the other thread. I just don't like to slow down a feature required in the future for implementing postcopy live snapshotting or other snapshots to userland processes (for the non-KVM case, also unprivileged by default if using bounce buffers to feed the syscalls) that can be used by open source hypervisors to beat proprietary hypervisors like vmware. The security concern of uffd-wp that allows to enlarge the window of use-after-free kernel bugs, is not as a concern as it is for regular processes. First the jailer model can obtain the uffd before dropping all caps and before firing up seccomp in the child, so it won't even require to lift the unprivileged_userfaultfd in the superior and cleaner monolithic jailer model. If the uffd and uffd-wp can only run in rust-vmm and qemu, that userland is system software to be trusted as the kernel from the guest point of view. It's similar to fuse, if somebody gets into the fuse process it can also stop the kernel initiated faults. From that respect fuse is also system software despite it runs in userland. In other words I think if there's a vm-escape that takes control of rust-vmm userland, the last worry is the fact it can stop kernel initiated page faults because the jailer took an uffd before drop privs. Thanks, Andrea
Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup
On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: > Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") Targeting a backport down to 2013 when nothing could wrong in practice with page_mapcount sounds backwards and unnecessarily risky. In theory it was already broken and in theory 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is absolutely perfect and the previous code of 2013 is completely wrong, but in practice the code from 2013 worked perfectly until Aug 21 2020. Since nothing at all could go wrong in soft dirty and uffd-wp until 09854ba94c6aad7886996bfbee2530b3d8a7f4f4, the Fixes need to target that, definitely not a patch from 2013. This means the backports will apply clean, they don't need a simple solution but one that doesn't regress the performance of open source virtual machines and open source products using clear_refs and uffd-wp in general. Thanks, Andrea
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Jan 05, 2021 at 10:08:13AM -0500, Peter Xu wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index ab709023e9aa..c08c4055b051 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -75,7 +75,8 @@ static unsigned long change_pte_range(struct > > vm_area_struct *vma, pmd_t *pmd, > > oldpte = *pte; > > if (pte_present(oldpte)) { > > pte_t ptent; > > - bool preserve_write = prot_numa && pte_write(oldpte); > > + bool preserve_write = (prot_numa || uffd_wp_resolve) && > > + pte_write(oldpte); > > Irrelevant of the other tlb issue, this is a standalone one and I commented in > v1 about simply ignore the change if necessary; unluckily that seems to be > ignored.. so I'll try again - would below be slightly better? > > if (uffd_wp_resolve && !pte_uffd_wp(oldpte)) > continue; I posted the exact same code before seeing the above so I take it as a good sign :). I'd suggest to add the reverse check to the uffd_wp too. > Firstly, current patch is confusing at least to me, because "uffd_wp_resolve" > means "unprotect the pte", whose write bit should mostly be cleared already > when uffd_wp_resolve is applicable. Then "preserve_write" for that pte looks > odd already. > > Meanwhile, if that really happens (when pte write bit set, but during a > uffd_wp_resolve request) imho there is really nothing we can do, so we should > simply avoid touching that at all, and also avoid ptep_modify_prot_start, > pte_modify, ptep_modify_prot_commit, calls etc., which takes extra cost. Agreed. It should not just defer the flush, by doing continue we will not flush anything. So ultimately the above will be an orthogonal optimization, but now I get the why the deferred tlb flush on the cpu0 of the v2 patch was the problematic one. I didn't see we lacked the above optimization and I thought we were discussing still the regular case where un-wrprotect is called on a pte with uffd-wp set. thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote: > (your other email clarified this point; the COW needs to copy while > holding the PTL and we need TLBI under PTL if we're to change this) The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't need to be delivered under PT lock either. Simply there need to be a TLBI broadcast before the copy. The patch I sent here https://lkml.kernel.org/r/x+qlr1wmgxms3...@redhat.com that needs to be cleaned up with some abstraction and better commentary also misses a smp_mb() in the case flush_tlb_page is not called, but that's a small detail. > And I'm thinking the speculative page fault series steps right into all > this, it fundamentally avoids mmap_sem and entirely relies on the PTL. I thought about that but that only applies to some kind of "anon" page fault. Here the problem isn't just the page fault, the problem is not to regress clear_refs to block on page fault I/O, and all MAP_PRIVATE/MAP_SHARED filebacked faults bitting the disk to read /usr/ will still prevent clear_refs from running (and the other way around) if it has to take the mmap_sem for writing. I don't look at the speculative page fault for a while but last I checked there was nothing there that can tame the above major regression from CPU speed to disk I/O speed that would be inflicted on both clear_refs on huge mm and on uffd-wp. Thanks, Andrea
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Jan 05, 2021 at 09:58:57AM +0100, Peter Zijlstra wrote: > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > > > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > > > > > cpu0cpu1cpu2 > > > > > > > > [ Writable PTE > > > > cached in TLB > > > > ] > > > > userfaultfd_writeprotect() > > > > [ write-*unprotect* ] > > > > mwriteprotect_range() > > > > mmap_read_lock() > > > > change_protection() > > > > > > > > change_protection_range() > > > > ... > > > > change_pte_range() > > > > [ *clear* “write”-bit ] > > > > [ defer TLB flushes ] > > > > [ page-fault ] > > > > ... > > > > wp_page_copy() > > > > cow_user_page() > > > > [ copy page ] > > > > [ write to old > > > > page ] > > > > ... > > > > set_pte_at_notify() > > > > > > Yuck! > > > > > > > Note, the above was posted before we figured out the details so it > > wasn't showing the real deferred tlb flush that caused problems (the > > one showed on the left causes zero issues). > > > > The problematic one not pictured is the one of the wrprotect that has > > to be running in another CPU which is also isn't picture above. More > > accurate traces are posted later in the thread. > > Lets assume CPU0 does a read-lock, W -> RO with deferred flush. I was mistaken saying the deferred tlb flush was not shown in the v2 trace, just this appears a new different case we didn't happen to consider before. In the previous case we discussed earlier, when un-wrprotect above is called it never should have been a W->RO since a wrprotect run first. Doesn't it ring a bell that if an un-wrprotect does a W->RO transition, something is a bit going backwards? I don't recall from previous discussion that un-wrprotect was considered as called on read-write memory. I think we need the below change to fix this new case. if (uffd_wp) { + if (unlikely(pte_uffd_wp(oldpte))) + continue; ptent = pte_wrprotect(ptent); ptent = pte_mkuffd_wp(ptent); } else if (uffd_wp_resolve) { + if (unlikely(!pte_uffd_wp(oldpte))) + continue; /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } I now get why the v2 patch touches preserved_write, but this is not about preserve_write, it's not about leaving the write bit alone. This is about leaving the whole pte alone if the uffd-wp bit doesn't actually change. We shouldn't just defer the tlb flush if un-wprotect is called on read-write memory: we should not have flushed the tlb at all in such case. Same for hugepmd in huge_memory.c which will be somewhere else. Once the above is optimized, then un-wrprotect as in MM_CP_UFFD_WP_RESOLVE is usually preceded by wrprotect as in MM_CP_UFFD_WP, and so it'll never be a W->RO but a RO->RO transition that just clears the uffd_wp flag and nothing else and whose tlb flush is in turn irrelevant. The fix discussed still works for this new case too: I'm not suggesting we should rely on the above optimization for the tlb safety. The above is just a missing optimization. > > > Isn't this all rather similar to the problem that resulted in the > > > tlb_flush_pending mess? > > > > > > I still think that's all fundamentally buggered, th
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Mon, Jan 04, 2021 at 08:39:37PM +, Nadav Amit wrote: > > On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli wrote: > > > > On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote: > >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote: > >>> > >>> Hello, > >>> > >>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > >>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >>>> > >>>>> The scenario that happens in selftests/vm/userfaultfd is as follows: > >>>>> > >>>>> cpu0cpu1cpu2 > >>>>> > >>>>> [ Writable PTE > >>>>> cached in TLB > >>>>> ] > >>>>> userfaultfd_writeprotect() > >>>>> [ write-*unprotect* ] > >>>>> mwriteprotect_range() > >>>>> mmap_read_lock() > >>>>> change_protection() > >>>>> > >>>>> change_protection_range() > >>>>> ... > >>>>> change_pte_range() > >>>>> [ *clear* “write”-bit ] > >>>>> [ defer TLB flushes ] > >>>>> [ page-fault ] > >>>>> ... > >>>>> wp_page_copy() > >>>>> cow_user_page() > >>>>> [ copy page ] > >>>>> [ write to old > >>>>> page ] > >>>>> ... > >>>>> set_pte_at_notify() > >>>> > >>>> Yuck! > >>> > >>> Note, the above was posted before we figured out the details so it > >>> wasn't showing the real deferred tlb flush that caused problems (the > >>> one showed on the left causes zero issues). > >> > >> Actually it was posted after (note that this is v2). The aforementioned > >> scenario that Peter regards to is the one that I actually encountered (not > >> the second scenario that is “theoretical”). This scenario that Peter > >> regards > >> is indeed more “stupid” in the sense that we should just not write-protect > >> the PTE on userfaultfd write-unprotect. > >> > >> Let me know if I made any mistake in the description. > > > > I didn't say there is a mistake. I said it is not showing the real > > deferred tlb flush that cause problems. > > > > The issue here is that we have a "defer tlb flush" that runs after > > "write to old page". > > > > If you look at the above, you're induced to think the "defer tlb > > flush" that causes issues is the one in cpu0. It's not. That is > > totally harmless. > > I do not understand what you say. The deferred TLB flush on cpu0 *is* the > the one that causes the problem. The PTE is write-protected (although it is > a userfaultfd unprotect operation), causing cpu1 to encounter a #PF, handle > the page-fault (and copy), while cpu2 keeps writing to the source page. If > cpu0 did not defer the TLB flush, this problem would not happen. Your argument "If cpu0 did not defer the TLB flush, this problem would not happen" is identical to "if the cpu0 has a small TLB size and the tlb entry is recycled, the problem would not happen". There are a multitude of factors that are unrelated to the real problematic deferred tlb flush that may happen and still won't cause the issue, including a parallel IPI. The point is that we don't need to worry about the "defer TLB flushes" of the un-wrprotect, because you said earlier that deferring tlb flushes when you're doing "permission promotions" does not cause problems. The only "deferred tlb flush" we need to worry about, not in the picture, is the one following the actual permission removal (the wrprotection). > it shows the write that triggers the corruption instead of discussing > “windows”, which might be less clear. Running copy_user_page() with stale I think showing exactly where the race window opens is key to understand the issue, but then that's the w
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote: > > On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote: > > > > Hello, > > > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > >> > >>> The scenario that happens in selftests/vm/userfaultfd is as follows: > >>> > >>> cpu0 cpu1cpu2 > >>> > >>> [ Writable PTE > >>> cached in TLB ] > >>> userfaultfd_writeprotect() > >>> [ write-*unprotect* ] > >>> mwriteprotect_range() > >>> mmap_read_lock() > >>> change_protection() > >>> > >>> change_protection_range() > >>> ... > >>> change_pte_range() > >>> [ *clear* “write”-bit ] > >>> [ defer TLB flushes ] > >>> [ page-fault ] > >>> ... > >>> wp_page_copy() > >>>cow_user_page() > >>> [ copy page ] > >>> [ write to old > >>> page ] > >>> ... > >>>set_pte_at_notify() > >> > >> Yuck! > > > > Note, the above was posted before we figured out the details so it > > wasn't showing the real deferred tlb flush that caused problems (the > > one showed on the left causes zero issues). > > Actually it was posted after (note that this is v2). The aforementioned > scenario that Peter regards to is the one that I actually encountered (not > the second scenario that is “theoretical”). This scenario that Peter regards > is indeed more “stupid” in the sense that we should just not write-protect > the PTE on userfaultfd write-unprotect. > > Let me know if I made any mistake in the description. I didn't say there is a mistake. I said it is not showing the real deferred tlb flush that cause problems. The issue here is that we have a "defer tlb flush" that runs after "write to old page". If you look at the above, you're induced to think the "defer tlb flush" that causes issues is the one in cpu0. It's not. That is totally harmless. > > > The problematic one not pictured is the one of the wrprotect that has > > to be running in another CPU which is also isn't picture above. More > > accurate traces are posted later in the thread. > > I think I included this scenario as well in the commit log (of v2). Let me > know if I screwed up and the description is not clear. Instead of not showing the real "defer tlb flush" in the trace and then fixing it up in the comment, why don't you take the trace showing the real problematic "defer tlb flush"? No need to reinvent it. https://lkml.kernel.org/r/x+jjqk91plkbv...@redhat.com See here the detail underlined: deferred tlb flush <- too late XX BUG RACE window close here This show the real deferred tlb flush, your v2 does not include it instead.
Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect
Hello, On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: > > > The scenario that happens in selftests/vm/userfaultfd is as follows: > > > > cpu0cpu1cpu2 > > > > [ Writable PTE > > cached in TLB ] > > userfaultfd_writeprotect() > > [ write-*unprotect* ] > > mwriteprotect_range() > > mmap_read_lock() > > change_protection() > > > > change_protection_range() > > ... > > change_pte_range() > > [ *clear* “write”-bit ] > > [ defer TLB flushes ] > > [ page-fault ] > > ... > > wp_page_copy() > > cow_user_page() > > [ copy page ] > > [ write to old > > page ] > > ... > > set_pte_at_notify() > > Yuck! > Note, the above was posted before we figured out the details so it wasn't showing the real deferred tlb flush that caused problems (the one showed on the left causes zero issues). The problematic one not pictured is the one of the wrprotect that has to be running in another CPU which is also isn't picture above. More accurate traces are posted later in the thread. > Isn't this all rather similar to the problem that resulted in the > tlb_flush_pending mess? > > I still think that's all fundamentally buggered, the much saner solution > (IMO) would've been to make things wait for the pending flush, instead How do intend you wait in PT lock while the writer also has to take PT lock repeatedly before it can do wake_up_var? If you release the PT lock before calling wait_tlb_flush_pending it all falls apart again. This I guess explains why a local pte/hugepmd smp local invlpg is the only working solution for this issue, similarly to how it's done in rmap. > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 07d9acb5b19c..0210547ac424 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -649,7 +649,8 @@ static inline void dec_tlb_flush_pending(struct mm_struct > *mm) >* >* Therefore we must rely on tlb_flush_*() to guarantee order. >*/ > - atomic_dec(&mm->tlb_flush_pending); > + if (atomic_dec_and_test(&mm->tlb_flush_pending)) > + wake_up_var(&mm->tlb_flush_pending); > } > > static inline bool mm_tlb_flush_pending(struct mm_struct *mm) > @@ -677,6 +678,12 @@ static inline bool mm_tlb_flush_nested(struct mm_struct > *mm) > return atomic_read(&mm->tlb_flush_pending) > 1; > } > > +static inline void wait_tlb_flush_pending(struct mm_struct *mm) > +{ > + wait_var_event(&mm->tlb_flush_pending, > +atomic_read(&mm->tlb_flush_pending) == 0); > +} I appreciate the effort in not regressing soft dirty and uffd-wp writeprotect to disk-I/O spindle bandwidth and not using mmap_sem for writing. At the same time what was posted so far wasn't clean enough but it wasn't even tested... if we abstract it in some clean way and we mark all connected points (soft dirty, uffd-wp, the wrprotect page fault), then I can be optimistic it will remain understandable when we look at it again a few years down the road. Or at the very least it can't get worse than the "tlb_flush_pending mess" you mentioned above. flush_tlb_batched_pending() has to be orthogonally re-reviewed for those things Nadav pointed out. But I'd rather keep that review in a separate thread since any bug in that code has zero connection to this issue. The basic idea is similar but the methods and logic are different and our flush here will be granular and it's going to be only run if VM_SOFTDIRTY isn't set and soft dirty is compiled in, or if VM_UFFD_WP is set. The flush_tlb_batched_pending is mm wide, unconditional etc.. Pretty much all different. Thanks, Andrea
Re: kernelci/staging-next bisection: sleep.login on rk3288-rock2-square #2286-staging
Hello Mike, On Sun, Jan 03, 2021 at 03:47:53PM +0200, Mike Rapoport wrote: > Thanks for the logs, it seems that implicitly adding reserved regions to > memblock.memory wasn't that bright idea :) Would it be possible to somehow clean up the hack then? The only difference between the clean solution and the hack is that the hack intended to achieved the exact same, but without adding the reserved regions to memblock.memory. The comment on that problematic area says the reserved area cannot be used for DMA because of some unexplained hw issue, and that doing so prevents booting, but since the area got reserved, even with the clean solution, it shouldn't have never been used for DMA? So I can only imagine that the physical memory region is way more problematic than just for DMA. It sounds like that anything that touches it, including the CPU, will hang the system, not just DMA. It sounds somewhat similar to the other e820 direct mapping issue on x86? If you want to test the hack on the arm board to check if it boots you can use the below commit: https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=c3ea2633015104ce0df33dcddbc36f57de1392bc Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Thu, Dec 24, 2020 at 01:49:45PM -0500, Andrea Arcangeli wrote: > Without the above, can't the CPU decrement the tlb_flush_pending while > the IPI to ack didn't arrive yet in csd_lock_wait? Ehm: csd_lock_wait has smp_acquire__after_ctrl_dep() so the write side looks ok after all sorry.
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 09:18:09PM -0800, Nadav Amit wrote: > I am not trying to be argumentative, and I did not think through about an > alternative solution. It sounds to me that your proposed solution is correct > and would probably be eventually (slightly) more efficient than anything > that I can propose. On a side note, on the last proposed solution, I've been wondering about the memory ordering mm_tlb_flush_pending. There's plenty of locking before the point where the actual data is read, but it's not a release/acquire full barrier (or more accurately it is only on x86), so smp_rmb() seems needed before cow_user_page to be sure no data can be read before we read the value of mm_tlb_flush_pending. To avoid an explicit smp_rmb() and to inherit the implicit smp_mb() from the release/acquire of PT unlock/PT lock, we'd need to put the flush before the previous PT unlock. (note, not after the next PT lock or it'd be even worse). So this made me look also the inc/dec: + smp_mb__before_atomic(); atomic_dec(&mm->tlb_flush_pending); Without the above, can't the CPU decrement the tlb_flush_pending while the IPI to ack didn't arrive yet in csd_lock_wait? The smp_rmb() is still needed in the page fault (implicit or explicit doesn't matter), but we also need a smp_mb__before_atomic() above to really make this work. > Yet, I do want to explain my position. Reasoning on TLB flushes is hard, as > this long thread shows. The question is whether it has to be so hard. In > theory, we can only think about architectural considerations - whether a PTE > permissions are promoted/demoted and whether the PTE was changed/cleared. > > Obviously, it is more complex than that. Yet, once you add into the equation > various parameters such as the VMA flags or whether a page is locked (which > Mel told me was once a consideration), things become much more complicated. > If all the logic of TLB flushes had been concentrated in a single point and > maintenance of this code did not require thought about users and use-cases, > I think things would have been much simpler, at least for me. In your previous email you also suggested to add range invalidates and bloom filter to index them by the address in the page fault since it'd help MADV_PAGEOUT. That would increase complexity and it won't go exactly in the above direction. I assume that here nobody wants to add gratuitous complexity, and I never suggested the code shouldn't become simpler and easier to understand and maintain. But we can't solve everything in a single thread in terms of cleaning up and auditing the entirety of the TLB code. To refocus: the only short term objective in this thread, is to fix the data corruption in uffd-wp and clear_refs_write without introducing new performance regressions compared to the previous clear_refs_write runtime. Once that is fixed and we didn't introduce a performance regression while fixing an userland memory corruption regression (i.e. not really a regression in theory, but in practice it is because it worked by luck), then we can look at all the rest without hurry. So if already want to start the cleanups like I mentioned in a previous email, and I'll say it more explicitly, the tlb gather is for freeing memory, it's just pure overhead and gives a false sense of security, like it can make any difference, when just changing protection with mmap_read_lock. It wouldn't be needed with the write lock and it can't help solve the races that trigger with mmap_read_lock either. It is needed when you have to store the page pointer outside the pte, clear a pte, flush the tlb and only then put_page. So it is needed to keep tracks of which pages got cleared in the ptes so you don't have to issue a tlb flush for each single pte that gets cleared. So the only case when to use the tlb_gather is when you must make the pte stop pointing to the page and you need an external storage that will keep track of those pages that we cannot yet lose track of since we didn't do put_page yet. That kind of external storage to keep track of the pages that have pending tlb flushes, is never needed in change_protection and clear_refs_write. change_protection is already correct in that respect, it doesn't use the tlb_gather. clear_refs_write is not ok and it need to stop using tlb_gather_mmu/tlb_finish_mmu. * tlb_gather_mmu - initialize an mmu_gather structure for page-table tear-down See also the tear-down mention which doesn't apply to clear_refs_write. clear_refs_write needs to become identical to change_protection_range. Just increasing inc_tlb_flush_pending and then doing a flush of the range right before the dec_tlb_flush_pending. That change is actually orthogonal to the regression fix to teach the COW to deal with stale TLB entries and it will clean up the code. So to pursue your simplification objective you could already go ahead doing this improvement since it's very confusing to see the clear_refs_write use something that it sh
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote: > > I think there are other cases in which Andy’s concern is relevant > > (MADV_PAGEOUT). I didn't try to figure how it would help MADV_PAGEOUT. MADV_PAGEOUT sounds cool feature, but maybe it'd need a way to flush the invalidates out and a take a static key to enable the buildup of those ranges? I wouldn't like to slow down the fast paths even for MADV_PAGEOUT, and the same applies to uffd-wp and softdirty in fact. On Wed, Dec 23, 2020 at 08:34:00PM -0700, Yu Zhao wrote: > That patch only demonstrate a rough idea and I should have been > elaborate: if we ever decide to go that direction, we only need to > worry about "jumping through hoops", because the final patch (set) > I have in mind would not only have the build time optimization Andrea > suggested but also include runtime optimizations like skipping > do_swap_page() path and (!PageAnon() || page_mapcount > 1). Rest > assured, the performance impact on do_wp_page() from occasionally an > additional TLB flush on top of a page copy is negligible. Agreed, I just posted something in that direction. Feel free to refactor it, it's just a tentative implementation to show something that may deliver all the performance we need in all cases involved without slowing down the fast paths of non-uffd and non-softdirty (well 1 branch). > On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote: > > Perhaps holding some small bitmap based on part of the deferred flushed > > pages (e.g., bits 12-17 of the address or some other kind of a single > > hash-function bloom-filter) would be more performant to avoid (most) The concern here aren't only the page faults having to run the bloom filter, but how to manage the RAM storage pointed by the bloomfilter or whatever index into the storage, which would slowdown mprotect. Granted that mprotect is slow to begin with, but the idea we can't make it any slower to make MADV_PAGEOUT or uffd-wp or clear_refs run faster since it's too important and too frequent in comparison. Just to restrict the potential false positive IPI caused by page_count inevitable inaccuracies to uffd-wp and softdirty runtimes, a simple check on vm_flags should be enough. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote: > One other near zero cost improvement easy to add if this would be "if > (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made The next worry then is if UFFDIO_WRITEPROTECT is very large then there would be a flood of wrprotect faults, and they'd end up all issuing a tlb flush during the UFFDIO_WRITEPROTECT itself which again is a performance concern for certain uffd-wp use cases. Those use cases would be for example redis and postcopy live snapshotting, to use it for async snapshots, unprivileged too in the case of redis if it temporarily uses bounce buffers for the syscall I/O for the duration of the snapshot. hypervisors tuned profiles need to manually lift the unprivileged_userfaultfd to 1 unless their jailer leaves one capability in the snapshot thread. Moving the check after userfaultfd_pte_wp would solve userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false) because change_protection doesn't restore the pte_write: } else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } When the snapshot is complete userfaultfd_writeprotect(mode_wp=false) would need to run again on the whole range which can be very big again. Orthogonally I think we should also look to restore the pte_write above orthogonally in uffd-wp, so it'll get yet an extra boost if compared to current redis snapshotting fork(), that cannot restore all pte_write after the snapshot child quit and forces a flood of spurious wrprotect faults (uffd-wp can solve that too). However, even if uffd-wp restored the pte_write, things would remain suboptimal for a terabyte process under clear_refs, since softdirty wrprotect faults that start happening while softdirty is still running on the mm, won't be caught in userfaultfd_pte_wp. Something like below, if cleaned up, abstracted properly and documented well in the two places involved, will have a better chance to perform optimally for softdirty too. And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be removed and it won't make any measurable difference even when USERFAULTFD=n) RFC untested below, it's supposed to fix the softdirty testcase too, even without the incremental fix, since it already does tlb_gather_mmu before walk_page_range and tlb_finish_mmu after it and that appears enough to define the inc/dec_tlb_flush_pending. diff --git a/mm/memory.c b/mm/memory.c index 7d608765932b..66fd6d070c47 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; } else { + bool in_uffd_wp, in_softdirty; new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (!new_page) goto oom; +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP + in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP); +#else + in_uffd_wp = false; +#endif +#ifdef CONFIG_MEM_SOFT_DIRTY + in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY); +#else + in_softdirty = false; +#endif + if ((in_uffd_wp || in_softdirty) && + mm_tlb_flush_pending(mm)) + flush_tlb_page(vma, vmf->address); + if (!cow_user_page(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other,
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote: > I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending > set for quite a while — mprotect seems like it can wait in IO while splitting > a huge page, for example. That gives us a window in which every write fault > turns into a TLB flush. mprotect can't run concurrently with a page fault in the first place. One other near zero cost improvement easy to add if this would be "if (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made conditional to the two config options too. Still I don't mind doing it in some other way, uffd-wp has much easier time doing it in another way in fact. Whatever performs better is fine, but queuing up pending invalidate ranges don't look very attractive since it'd be a fixed cost that we'd always have to pay even when there's no fault (and there can't be any fault at least for mprotect). Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Hello Linus, On Wed, Dec 23, 2020 at 03:39:53PM -0800, Linus Torvalds wrote: > On Wed, Dec 23, 2020 at 1:39 PM Andrea Arcangeli wrote: > > > > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote: > > > Thanks for the details. > > > > I hope we can find a way put the page_mapcount back where there's a > > page_count right now. > > I really don't think that's ever going to happen - at least if you're > talking about do_wp_page(). Yes, I was referring to do_wp_page(). > I refuse to make the *simple* core operations VM have to jump through > hoops, and the old COW mapcount logic was that. I *much* prefer the > newer COW code, because the rules are more straightforward. Agreed. I don't have any practical alternative proposal in fact, it was only an hypothetical wish/hope, echoing Hugh's view on this matter. > > page_count is far from optimal, but it is a feature it finally allowed > > us to notice that various code (clear_refs_write included apparently > > even after the fix) leaves stale too permissive TLB entries when it > > shouldn't. > > I absolutely agree that page_count isn't exactly optimal, but > "mapcount" is just so much worse. Agreed. The "isn't exactly optimal" part is the only explanation for wish/hope. > page_count() is at least _logical_, and has a very clear meaning: > "this page has other users". mapcount() means something else, and is > simply not sufficient or relevant wrt COW. > > That doesn't mean that page_mapcount() is wrong - it's just that it's > wrong for COW. page_mapcount() is great for rmap, so that we can see > when we need to shoot down a memory mapping of a page that gets > released (truncate being the classic example). > > I think that the mapcount games we used to have were horrible. I > absolutely much prefer where we are now wrt COW. > > The modern rules for COW handling are: > > - if we got a COW fault there might be another user, we copy (and > this is where the page count makes so much logical sense). > > - if somebody needs to pin the page in the VM, we either make sure > that it is pre-COWed and we > >(a) either never turn it back into a COW page (ie the fork()-time > stuff we have for pinned pages) > >(b) or there is some explicit marker on the page or in the page > table (ie the userfaultfd_pte_wp thing). > > those are _so_ much more straightforward than the very complex rules > we used to have that were actively buggy, in addition to requiring the > page lock. So they were buggy and slow. Agreed, this is a very solid solution that solves the problem the mapcount had with GUP pins. The only alternative but very vapourware thought I had on this matter, is that all troubles start when we let a GUP-pinned page be unmapped from the pgtables. I already told Jens, is io_uring could use mmu notifier already (that would make any io_uring GUP pin not count anymore with regard to page_mapcount vs page_count difference) and vmsplice also needs some handling or maybe become privileged. The above two improvements are orthogonal with the COW issue since long term GUP pins do more than just mlock and they break most advanced VM features. It's not ok just to account GUP pins in rlimit mlock. The above two improvements however would go into the direction to make mapcount more suitable for COW, but it'd still not be enough. The transient GUP pins also would need to be taken care of, but we could wait for those to be released, since those are guaranteed to be released or something else, not just munmap()/MADV_DONTNEED, will remain in D state. Anyway.. until io_uring and vmsplice are solved first, it's pointless to even wonder about transient GUP pins. > And yes, I had forgotten about that userfaultfd_pte_wp() because I was > being myopic and only looking at wp_page_copy(). So using that as a > way to make sure that a page doesn't go through COW is a good way to > avoid the COW race, but I think that thing requires a bit in the page > table which might be a problem on some architectures? Correct, it requires a bit in the pgtable. The bit is required in order to disambiguate which regions have been marked for wrprotect faults and which didn't. A practical example would be: fork() called by an app with a very large vma with VM_UFFD_WP set. There would be a storm of WP userfaults if we didn't have the software bit to detect exactly which pte were wrprotected by uffd-wp and which ones were wrprotected by fork. That bit then sends the COW fault to a safe place if wrprotect is running (the problem is we didn't see the un-wrprotect would clear the bit while the TLB flush of the wrprotect could be still pending). The page_mapcount I gue
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 02:45:59PM -0800, Nadav Amit wrote: > I think it may be reasonable. Whatever solution used, there will be 2 users of it: uffd-wp will use whatever technique used by clear_refs_write to avoid the mmap_write_lock. My favorite is Yu's patch and not the group lock anymore. The cons is it changes the VM rules (which kind of reminds me my initial proposal of adding a spurious tlb flush if mm_tlb_flush_pending is set, except I didn't correctly specify it'd need to go in the page fault), but it still appears the simplest. > Just a proposal: At some point we can also ask ourselves whether the > “artificial" limitation of the number of software bits per PTE should really > limit us, or do we want to hold some additional metadata per-PTE by either > putting it in an adjacent page (holding 64-bits of additional software-bits > per PTE) or by finding some place in the page-struct to link to this > metadata (and have the liberty of number of bits per PTE). One of the PTE > software-bits can be repurposed to say whether there is “extra-metadata” > associated with the PTE. > > I am fully aware that there will be some overhead associated, but it > can be limited to less-common use-cases. That's a good point, so far far we didn't run out so it's not an immediate concern. (as opposed we run out in page->flags where the PG_tail went to some LSB). In general kicking the can down the road sounds like the best thing to do for those bit shortage matters, until we can't anymore at least.. There's no gain to the kernel runtime, in doing something generically good here (again see where PG_tail rightfully went). So before spending RAM and CPU, we'd need to find a more compact encoding with the bits we already have available. This reminds me again we could double check if we could make VM_UFFD_WP mutually exclusive with VM_SOFTDIRTY. I wasn't sure if it could ever happen in a legit way to use both at the same time (CRIU on a app using uffd-wp for its own internal mm management?). Theoretically it's too late already for it, but VM_UFFD_WP is relatively new, if we're sure it cannot ever happen in a legit way, it would be possible to at least evaluate/discuss it. This is an immediate matter. What we'll do if we later run out, is not an immediate matter instead, because it won't make our life any simpler to resolve it now. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote: > I was hesitant to suggest the following because it isn't that straight > forward. But since you seem to be less concerned with the complexity, > I'll just bring it on the table -- it would take care of both ufd and > clear_refs_write, wouldn't it? It certainly would since this is basically declaring "leaving stale TLB entries past mmap_read_lock" is now permitted as long as the pending flush counter is elevated under mmap_sem for reading. Anything that prevents uffd-wp to take mmap_write_lock looks great to me, anything, the below included, as long as it looks like easy to enforce and understand. And the below certainly is. My view is that the below is at the very least an improvement in terms of total complexity, compared to v5.10. At least it'll be documented. So what would be not ok to me is to depend on undocumented not guaranteed behavior in do_wp_page like the page_mapcount, which is what we had until now in clear_refs_write and in uffd-wp (but only if wrprotect raced against un-wrprotect, a tiny window if compared to clear_refs_write). Documenting that clearing pte_write and deferring the flush is allowed if mm_tlb_flush_pending was elevated before taking the PT lock is less complex and very well defined rule, if compared to what we had before in the page_mapcount dependency of clear_refs_write which was prone to break, and in fact it just did. We'll need a commentary in both userfaultfd_writeprotect and clear_refs_write that links to the below snippet. If we abstract it in a header we can hide there also a #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY=y && CONFIG_HAVE_ARCH_USERFAULTFD_WP=y && CONFIG_USERFAULTFD=y to make it even more explicit. However it may be simpler to keep it unconditional, I don't mind either ways. If it was up to me I'd write it to those 3 config options to be sure I remember where it comes from and to force any other user to register to be explicit they depend on that. > > diff --git a/mm/memory.c b/mm/memory.c > index 5e9ca612d7d7..af38c5ee327e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault > *vmf) > goto unlock; > } > if (vmf->flags & FAULT_FLAG_WRITE) { > - if (!pte_write(entry)) > + if (!pte_write(entry)) { > + if (mm_tlb_flush_pending(vmf->vma->vm_mm)) > + flush_tlb_page(vmf->vma, vmf->address); > return do_wp_page(vmf); > + } > entry = pte_mkdirty(entry); > } > entry = pte_mkyoung(entry); >
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote: > > On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote: > > > > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote: > >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit wrote: > >>> Using mmap_write_lock() was my initial fix and there was a strong pushback > >>> on this approach due to its potential impact on performance. > >> > >> From whom? > >> > >> Somebody who doesn't understand that correctness is more important > >> than performance? And that userfaultfd is not the most important part > >> of the system? > >> > >> The fact is, userfaultfd is CLEARLY BUGGY. > >> > >> Linus > > > > Fair enough. > > > > Nadav, for your patch (you might want to update the commit message). > > > > Reviewed-by: Yu Zhao > > > > While we are all here, there is also clear_soft_dirty() that could > > use a similar fix… > > Just an update as for why I have still not sent v2: I fixed > clear_soft_dirty(), created a reproducer, and the reproducer kept failing. > > So after some debugging, it appears that clear_refs_write() does not flush > the TLB. It indeed calls tlb_finish_mmu() but since 0758cd830494 > ("asm-generic/tlb: avoid potential double flush”), tlb_finish_mmu() does not > flush the TLB since there is clear_refs_write() does not call to > __tlb_adjust_range() (unless there are nested TLBs are pending). > > So I have a patch for this issue too: arguably the tlb_gather interface is > not the right one for clear_refs_write() that does not clear PTEs but > changes them. > > Yet, sadly, my reproducer keeps falling (less frequently, but still). So I > will keep debugging to see what goes wrong. I will send v2 once I figure out > what the heck is wrong in the code or my reproducer. If you put the page_mapcount check back in do_wp_page instead of page_count, it'll stop reproducing but the bug is still very much there... It's a feature page_count finally shows you the corruption now by virtue of the page_count being totally unreliable with all speculative pagecache lookups randomly elevating it in the background. The proof it worked by luck is that an unrelated change (s/page_mapcount/page_count/) made the page fault behave slightly different and broke clear_refs_write. Even before page_mapcount was replaced with page_count, it has always been forbidden to leave too permissive stale TLB entries out of sync with a more restrictive pte/hugepmd permission past the PT unlock, unless you're holding the mmap_write_lock. So for example all rmap code has to flush before PT unlock release too, usually it clears the pte as a whole but it's still a downgrade. The rmap_lock and the mmap_read_lock achieve the same: they keep the vma stable but they can't stop the page fault from running (that's a feature) so they have to flush inside the PT lock. The tlb gather deals with preventing use after free (where userland can modify kernel memory), but it cannot deal with the guarantee the page fault requires. So the clear_refs_write patch linked that alters the tlb flushing appears a noop with respect to this bug. It cannot do anything to prevent the page fault run with writable stale TLB entries with the !pte_write. If you don't add a marker here (it clears it, the exact opposite of what should be happening), there's no way to avoid the mmap_write_lock in my view. static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */ pte_t ptent = *pte; if (pte_present(ptent)) { pte_t old_pte; old_pte = ptep_modify_prot_start(vma, addr, pte); ptent = pte_wrprotect(old_pte); ptent = pte_clear_soft_dirty(ptent); + ptent = pte_mkuffd_wp(ptent); ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent); One solution that would fix the userland mm corruption in clear_refs_write is to take the mmap_read_lock, take some mutex somewhere (vma/mm whatever), then in clear_soft_dirty make the above modification adding the _PAGE_UFFD_WP, then flush tlb, release the mutex and then release the mmap_read_lock. Then here: if (userfaultfd_pte_wp(vma, *vmf->pte)) { pte_unmap_unlock(vmf->pte, vmf->ptl); + if (vma->vm_flags & VM_SOFTDIRTY) + return handle_soft_dirty(vma); return handle_userfault(vmf, VM_UFFD_WP); Of course handle_soft_dirty will have to take the mutex once (1 mutex_lock/unlock cycle to run after any pending flush). And then we'll have to enforce uffd-wp cannot be registered if VM_SOFTDIRTY is set or the other way around so that VM_UFFD* is mutually exclusive with VM_SOFTDIRTY. So
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote: > Thanks for the details. I hope we can find a way put the page_mapcount back where there's a page_count right now. If you're so worried about having to maintain a all defined well documented (or to be documented even better if you ACK it) marker/catcher for userfaultfd_writeprotect, I can't see how you could consider to maintain the page fault safe against any random code leaving too permissive TLB entries out of sync of the more restrictive pte permissions as it was happening with clear_refs_write, which worked by luck until page_mapcount was changed to page_count. page_count is far from optimal, but it is a feature it finally allowed us to notice that various code (clear_refs_write included apparently even after the fix) leaves stale too permissive TLB entries when it shouldn't. The question is only which way you prefer to fix clear_refs_write and I don't think we can deviate from those 3 methods that already exist today. So clear_refs_write will have to pick one of those and currently it's not falling in the same category with mprotect even after the fix. I think if clear_refs_write starts to take the mmap_write_lock and really start to operate like mprotect, only then we can consider to make userfaultfd_writeprotect also operate like mprotect. Even then I'd hope we can at least be allowed to make it operate like KSM write_protect_page for len <= HPAGE_PMD_SIZE, something that clear_refs_write cannot do since it works in O(N) and tends to scan everything at once, so there would be no point to optimize not to defer the flush, for a process with a tiny amount of virtual memory mapped. vm86 also should be fixed to fall in the same category with mprotect, since performance there is irrelevant. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 10:52:35AM -0500, Peter Xu wrote: > On Tue, Dec 22, 2020 at 08:36:04PM -0700, Yu Zhao wrote: > > In your patch, do we need to take wrprotect_rwsem in > > handle_userfault() as well? Otherwise, it seems userspace would have > > to synchronize between its wrprotect ioctl and fault handler? i.e., > > the fault hander needs to be aware that the content of write- > > protected pages can actually change before the iotcl returns. > > The handle_userfault() thread should be sleeping until another uffd_wp_resolve > fixes the page fault for it. However when the uffd_wp_resolve ioctl comes, > then rwsem (either the group rwsem lock as Andrea proposed, or the mmap_sem, > or > any new rwsem lock we'd like to introduce, maybe per-uffd rather than per-mm) > should have guaranteed the previous wr-protect ioctls are finished and tlb > must > have been flushed until this thread continues. > > And I don't know why it matters even if the data changed - IMHO what uffd-wp The data will change indeed and it's fine. > wants to do is simply to make sure after wr-protect ioctl returns to > userspace, > no change on the page should ever happen anymore. So "whether data changed" > seems matter more on the ioctl thread rather than the handle_userfault() > thread. IOW, I think data changes before tlb flush but after pte wr-protect > is > always fine - but that's not fine anymore if the syscall returns. Agreed. >From the userland point of view all it matters is that the writes through the stale TLB entries will stop in both the two cases: 1) before returning from the UFFDIO_WRITEPROTECT(mode_wp = true) ioctl syscall 2) before a parallel UFFDIO_WRITEPROTECT(mode_wp = false) can clear the _PAGE_UFFD_WP marker in the pte/hugepmd under the PT lock, assuming the syscall at point 1) is still in flight Both points are guaranteed at all times by the group lock now, so userland cannot even measure or perceive the existence of any stale TLB at any given time in the whole uffd-wp workload. So it's perfectly safe and identical as NUMA balancing and requires zero extra locking in handle_userfault(). handle_userfault() is a dead end that simply waits and when it's the right time it restarts the page fault. It can have occasional false positives after f9bf352224d7d4612b55b8d0cd0eaa981a3246cf, false positive as in restarting too soon, but even then it's perfectly safe since it's equivalent of one more CPU hitting the page fault path. As long as the marker is there, any spurious userfault will re-enter handle_userfault(). handle_userfault() doesn't care about the data and in turn it cannot care less about any stale TLB either. Userland cares but userland cannot make any assumption about writes being fully stopped, until the ioctl returned anyway and by that time the pending flush will be done and in fact by the time userland can make any assumption also the mmap_write_lock would have been released with the first proposed patch. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 01:51:59PM -0500, Andrea Arcangeli wrote: > NOTE: about the above comment, that mprotect takes > mmap_read_lock. Your above code change in the commit above, still has write Correction to avoid any confusion.
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Wed, Dec 23, 2020 at 11:24:16AM -0500, Peter Xu wrote: > I think this is not against Linus's example - where cpu2 does not have tlb > cached so it sees RO while cpu3 does have tlb cached so cpu3 can still modify > it. So IMHO there's no problem here. > > But I do think in step 2 here we overlooked _PAGE_UFFD_WP bit. Note that if > it's uffd-wp wr-protection it's always applied along with removing of the > write > bit in change_pte_range(): > > if (uffd_wp) { > ptent = pte_wrprotect(ptent); > ptent = pte_mkuffd_wp(ptent); > } > > So instead of being handled as COW page do_wp_page() will always trap > userfaultfd-wp first, hence no chance to race with COW. > > COW could only trigger after another uffd-wp-resolve ioctl which could remove > the _PAGE_UFFD_WP bit, but with Andrea's theory unprotect will only happen > after all wr-protect completes, which guarantees that when reaching the COW > path the tlb must has been flushed anyways. Then no one should be modifying > the page anymore even without pgtable lock in COW path. > > So IIUC what Linus proposed on "flushing tlb within pgtable lock" seems to > work, but it just may cause more tlb flush than Andrea's proposal especially > when the protection range is large (it's common to have a huge protection > range > for e.g. VM live snapshotting, where we'll protect all guest rw ram). > > My understanding of current issue is that either we can take Andrea's proposal > (although I think the group rwsem may not be extremely better than a per-mm > rwsem, which I don't know... at least not worst than that?), or we can also go > the other way (also as Andrea mentioned) so that when wr-protect: > > - for <=2M range (pmd or less), we take read rwsem, but flush tlb within > pgtable lock > > - for >2M range, we take write rwsem directly but flush tlb once I fully agree indeed. I still have to fully understand how the clear_refs_write was fixed. clear_refs_write has not even a "catcher" in the page fault but it clears the pte_write under mmap_read_lock and it doesn't flush before releasing the PT lock. It appears way more broken than userfaultfd_writeprotect ever was. static inline void clear_soft_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *pte) { /* * The soft-dirty tracker uses #PF-s to catch writes * to pages, so write-protect the pte as well. See the * Documentation/admin-guide/mm/soft-dirty.rst for full description * of how soft-dirty works. */ https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/ "Although this is fine when simply aging the ptes, in the case of clearing the "soft-dirty" state we can end up with entries where pte_write() is false, yet a writable mapping remains in the TLB. Fix this by avoiding the mmu_gather API altogether: managing both the 'tlb_flush_pending' flag on the 'mm_struct' and explicit TLB invalidation for the sort-dirty path, much like mprotect() does already" NOTE: about the above comment, that mprotect takes mmap_read_lock. Your above code change in the commit above, still has mmap_read_lock, there's still no similarity with mprotect whatsoever. Did you test what happens when clear_refs_write leaves writable stale TLB and you run do_wp_page copies the page while the other CPU still is writing to the page? Has clear_refs_write a selftest as aggressive and racy as the uffd selftest to reproduce that workload? The race fixed with the group lock internally to uffd, is possible thanks to marker+catcher in NUMA balancing style? But that's not there for clear_refs_write even after the above commit. It doesn't appear either that this patch is adding a synchronous tlb flush inside the PT lock either. Overall, it would be unfair if clear_refs_write is allowed to do the same thing that userfaultfd_writeprotect has to do, but with the mmap_read_lock, if userfaultfd_writeprotect will be forced to take mmap_write_lock. In my view there are 3 official ways to deal with this: 1) set a marker in the pte/hugepmd inside the PT lock, and add a catcher for the marker in the page fault to send the page fault to a dead end while there are stale TLB entries cases: userfaultfd_writeprotect() and NUMA balancing 2) take mmap_write_lock case: mprotect 3) flush the TLB before the PT lock release case: KSM write_protect_page Where does the below patch land in the 3 cases? It doesn't fit any of them and what it does looks still not sufficient to fix the userfault memory corruption. https://patchwork.kernel.org/project/linux-mm/patch/20201210121110.10094-2-w...@kernel.org/ It'd be backwards if the complexity of the kernel was increased for clear_refs_write, but forbidden for the O(1) API that delivers the exact same information (clear_refs_write API delivers that info in O(N)). We went the last mile to guarantee uffd can be
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 05:23:39PM -0700, Yu Zhao wrote: > and 2) people are spearheading multiple efforts to reduce the mmap_lock > contention, which hopefully would make ufd users suffer less soon. In my view UFFD is an already deployed working solution that eliminates the mmap_lock_write contention to allocate and free memory. We need to add a UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page or THP and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE, but it will solve the last performance bottleneck by avoiding a suprious wrprotect fault after the allocation). After that malloc based on uffd should become competitive single threaded and it won't ever require the mmap_lock_write so allocations and freeing of memory can continue indefinitely from all threaded in parallel. There will never be another mmap or munmap stalling all threads. This is not why uffd was created, it's just a secondary performance benefit of uffd, but it's still a relevant benefit in my view. Every time I hear people with major mmap_lock_write issues I recommend uffd, but you know, until we add the UFFDIO_POPULATE, it will still have higher fixed allocation overhead because of the wprotect fault after UFFDIO_ZEROCOPY. UFFDIO_COPY also would be not as optimal as a clear_page and currently it's not even THP capable. In addition you'll get a SIGBUS after an user after free. It's not like when you have a malloc lib doing MADV_DONTNEED at PAGE_SIZE granularity to rate limit the costly munmap, and then the app does an use after free and it reads zero or writes to a newly faulted in page. The above will not require any special privilege and all allocated virtual memory remains fully swappable, because SIGBUS mode will never have to block any kernel initiated faults. uffd-wp also is totally usable unprivileged by default to replace various software dirty bits with the info provided in O(1) instead of O(N), as long as the writes are done in userland also unprivileged by default without tweaking any sysctl and with zero risk of increasing reproduciblity of any exploit against unrelated random kernel bugs. So if we're forced to take the mmap_lock_write it'd be cool if at least we can avoid it for 1 single pte or hugepmd wrprotection, as it happens in write_protect_page() KSM. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 04:39:46PM -0700, Yu Zhao wrote: > We are talking about non-COW anon pages here -- they can't be mapped > more than once. So why not just identify them by checking > page_mapcount == 1 and then unconditionally reuse them? (This is > probably where I've missed things.) The problem in depending on page_mapcount to decide if it's COW or non-COW (respectively wp_page_copy or wp_page_reuse) is that is GUP may elevate the count of a COW anon page that become a non-COW anon page. This is Jann's idea not mine. The problem is we have an unprivileged long term GUP like vmsplice that facilitates elevating the page count indefinitely, until the parent finally writes a secret to it. Theoretically a short term pin would do it too so it's not just vmpslice, but the short term pin would be incredibly more challenging to become a concern since it'd kill a phone battery and flash before it can read any data. So what happens with your page_mapcount == 1 check is that it doesn't mean non-COW (we thought it did until it didn't for the long term gup pin in vmsplice). Jann's testcases does fork() and set page_mapcount 2 and page_count to 2, vmsplice, take unprivileged infinitely long GUP pin to set page_count to 3, queue the page in the pipe with page_count elevated, munmap to drop page_count to 2 and page_mapcount to 1. page_mapcount is 1, so you'd think the page is non-COW and owned by the parent, but the child can still read it so it's very much still wp_page_copy material if the parent tries to modify it. Otherwise the child can read the content. This was supposed to be solvable by just doing the COW in gup(write=0) case if page_mapcount > 1 with commit 17839856fd58. I'm not exactly sure why that didn't fly and it had to be reverted by Peter in a308c71bf1e6e19cc2e4ced31853ee0fc7cb439a but at the time this was happening I was side tracked by urgent issues and I didn't manage to look back of how we ended up with the big hammer page_count == 1 check instead to decide if to call wp_page_reuse or wp_page_shared. So anyway, the only thing that is clear to me is that keeping the child from reading the page_mapcount == 1 pages of the parent, is the only reason why wp_page_reuse(vmf) will only be called on page_count(page) == 1 and not on page_mapcount(page) == 1. It's also the reason why your page_mapcount assumption will risk to reintroduce the issue, and I only wish we could put back page_mapcount == 1 back there. Still even if we put back page_mapcount there, it is not ok to leave the page fault with stale TLB entries and to rely on the fact wp_page_shared won't run. It'd also avoid the problem but I think if you leave stale TLB entries in change_protection just like NUMA balancing does, it also requires a catcher just like NUMA balancing has, or it'd truly work by luck. So until we can put a page_mapcount == 1 check back there, the page_count will be by definition unreliable because of the speculative lookups randomly elevating all non zero page_counts at any time in the background on all pages, so you will never be able to tell if a page is true COW or if it's just a spurious COW because of a speculative lookup. It is impossible to differentiate a speculative lookup from a vmsplice ref in a child. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 02:14:41PM -0700, Yu Zhao wrote: > This works but I don't prefer this option because 1) this is new > way of making pte_wrprotect safe and 2) it's specific to ufd and > can't be applied to clear_soft_dirty() which has no "catcher". No I didn't look into clear_soft_dirty issue, I can look into that. To avoid having to deal with a 3rd solution it will have to use one of the two: 1) avoid deferring tlb flush and enforce a sync flush before dropping the PT lock even if mm_mm_tlb_flush_pending is true -> write_protect_page in KSM 2) add its own new catcher for its own specific marker (for UFFD-WP the marker is _PAGE_UFFD_WP, for change_prot_numa is PROT_NONE on a vma->vm_pgprot not PROT_NONE, for soft dirty it could be _PAGE_SOFT_DIRTY) to send the page fault to a dead end before the pte value is interpreted. > matter how good the documentation about this new way is now, it > will be out of date, speaking from my personal experience. A common catcher for all 3 is not possible because each catcher depends on whatever marker and whatever pte value they set that may lead to a different deterministic path where to put the catcher or multiple paths even. do_numa_page requires a catcher in a different place already. Or well, a common catcher for all 3 is technically possible but it'd perform slower requiring to check things twice. But perhaps the soft_dirty can use the same catcher of uffd-wp given the similarity? > I'd go with what Nadav has -- the memory corruption problem has been > there for a while and nobody has complained except Nadav. This makes > me think people is less likely to notice any performance issues from > holding mmap_lock for write in his patch either. uffd-wp is a fairly new feature, the current users are irrelevant, keeping it optimal is just for the future potential. > But I can't say I have zero concern with the potential performance > impact given that I have been expecting the fix to go to stable, > which I care most. So the next option on my list is to have a Actually stable would be very fine to go with Nadav patch and use the mmap_lock_write unconditionally. The optimal fix is only relevant for the future potential, so it's only relevant for Linus's tree. However the feature is recent enough that it won't require a deep backport so the optimal fix is likely fine for stable as well, generally stable prefers the same fix as in the upstream when there's no major backport rejection issue. The alternative solution for uffd is to do the deferred flush under mmap_lock_write if len is > HPAGE_PMD_SIZE, or to tell change_protection not to defer the flush and to take the mmap_lock_read for <= HPAGE_PMD_SIZE. That will avoid depending on the catcher and then userfaultfd_writeprotect(mode_wp=true) userfaultfd_writeprotect(mode_wp=false) can even run in parallel at all times. The cons is large userfaultfd_writeprotect will block for I/O and those would happen at least in the postcopy live snapshotting use case. The main cons is that it'd require modification to change_protection so it actually looks more intrusive, not less. Overall anything that allows to wrprotect 1 pte with only the mmap_lock_read exactly like KSM write_protect_page, would be enough for uffd-wp. What isn't ok in terms of future potential is unconditional mmap_lock_write as in the original suggested patch in my view. It doesn't mean we can take mmap_lock_write when the operation is large and there is actually more benefit from deferring the flush. > common "catcher" in do_wp_page() which singles out pages that have > page_mapcount equal to one and reuse them instead of trying to I don't think the page_mapcount matters here. If the wp page reuse was more accurate (as it was before) we wouldn't notice this issue, but it still would be a bug that there were stale TLB entries. It worked by luck. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 12:58:18PM -0800, Nadav Amit wrote: > I had somewhat similar ideas - saving in each page-struct the generation, > which would allow to: (1) extend pte_same() to detect interim changes > that were reverted (RO->RW->RO) and (2) per-PTE pending flushes. What don't you feel safe about, what's the problem with RO->RO->RO, I don't get it. The pte_same is perfectly ok without sequence counter in my view, I never seen anything that would not be ok with pte_same given all the invariant are respected. It's actually a great optimization compared to any unscalable sequence counter. The counter would slowdown everything, having to increase a counter every time you change a pte, no matter if it's a counter per pgtable or per-vma or per-mm, sounds very bad. I'd rather prefer to take mmap_lock_write across the whole userfaultfd ioctl, than having to deal with a new sequence counter increase for every pte modification on a heavily contended cacheline. Also note the counter would have solved nothing for userfaultfd_writeprotect, it's useless to detect stale TLB entries. See how !pte_write check happens after the counter was already increased: CPU0CPU 1 CPU 2 -- --- userfaultfd_wrprotect(mode_wp = true) PT lock atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE false_shared_counter_counter++ ^^ PT unlock do_page_fault FAULT_FLAG_WRITE userfaultfd_wrprotect(mode_wp = false) PT lock ATOMIC clear _PAGE_UFFD_WP <- problem /* _PAGE_WRITE not set */ false_shared_counter_counter++ ^^ PT unlock XX BUG RACE window open here PT lock counter = false_shared_counter_counter ^^ FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy copy_user_page runs with stale TLB pte_same(counter, orig_pte, pte) -> PASS ^^^ commit the copy to the pte with the lost writes deferred tlb flush <- too late XX BUG RACE window close here
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 12:19:49PM -0800, Nadav Amit wrote: > Perhaps any change to PTE in a page-table should increase a page-table > generation that we would save in the page-table page-struct (next to the The current rule is that by the time in the page fault we find a pte/hugepmd in certain state under the "PT lock", the TLB cannot be left stale with a more permissive state at that point. So if under "PT lock" the page fault sees !pte_write, it means no other CPU can possibly modify the page anymore. That must be enforced at all times, no sequence number is required if you enforce that and the whole point of the fix is to enforce that. This is how things always worked in the page fault and it's perfectly fine. For the record I never suggested to change anything in the page fault code. So the only way we can leave stale TLB after dropping PT lock with the mmap_read_lock and concurrent page faults is with a marker: 1) take PT lock 2) leave in the pte/hugepmd a unique identifiable marker 3) change the pte to downgrade permissions 4) drop PT lock and leave stale TLB entries with the upgraded permissions In point 3) the pte is built in a way that is guaranteed to trigger a deterministic path in the page fault. And in the way of that deterministic path you put the marker check for 2) to send the fault to a dead-end where the stale TLB is actually irrelevant and harmless. No change to the page fault is needed if you only enforce the above. Even if we'd take the mmap_write_lock in userfaultfd_writeprotect, you will still have to deal with the above technique because of change_prot_numa(). Would you suggest to add a sequence counter and to change all pte_same in order to make change_prot_numa safer or is it safe enough already using the above technique that checks the marker in the page fault? To be safe NUMA balancing is using the same mechanism above of sending the page fault into a dead end, in order to call the very same function (change_permissions) with the very same lock (mmap_read_lock) as userfaultfd_writeprotect. What happens then in do_numa_page then is different than what happens in handle_userfault, but both are a dead ends as far as the page fault code is concerned and they will never come back to the page fault code. That's how they avoid breaking the page fault. The final rule for the above logic to work safe for uffd, is that the marker cannot be cleared until after the deferred TLB flush is executed (that's where the window for the race existed, and the fix closed such window). do_numa_page differs in clearing the marker while the TLB flush still pending, but it can do that because it puts the same exact pte value (with the upgraded permissions) that was there before it put the marker in the pte. In turn do_numa_page makes the pending TLB flush becomes a noop and it doesn't need to wait for it before removing the marker. Does that last difference in how the marker is cleared, warrant to consider what NUMA balancing radically different so to forbid userfaultfd_writeprotect to use the same logic in the very same function with the very same lock in order to decrease the VM complexity? In my view no. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Tue, Dec 22, 2020 at 08:15:53PM +, Matthew Wilcox wrote: > On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote: > > My previous suggestion to use a mutex to serialize > > userfaultfd_writeprotect with a mutex will still work, but we can run > > as many wrprotect and un-wrprotect as we want in parallel, as long as > > they're not simultaneous, we can do much better than a mutex. > > > > Ideally we would need a new two_group_semaphore, where each group can > > run as many parallel instances as it wants, but no instance of one > > group can run in parallel with any instance of the other group. AFIK > > such a kind of lock doesn't exist right now. > > Kent and I worked on one for a bit, and we called it a red-black mutex. > If team red had the lock, more members of team red could join in. > If team black had the lock, more members of team black could join in. > I forget what our rule was around fairness (if team red has the lock, > and somebody from team black is waiting, can another member of team red > take the lock, or must they block?) In this case they would need to block and provide full fairness. Well maybe just a bit of unfariness (to let a few more through the door before it shuts) wouldn't be a deal breaker but it would need to be bound or it'd starve the other color/side indefinitely. Otherwise an ioctl mode_wp = true would block forever, if more ioctl mode_wp = false keep coming in other CPUs (or the other way around). The approximation with rwsem and two atomics provides full fariness in both read and write mode (originally the read would stave the write IIRC which was an issue for all mprotect etc.. not anymore thankfully). > It was to solve the direct-IO vs buffered-IO problem (you can have as many > direct-IO readers/writers at once or you can have as many buffered-IO > readers/writers at once, but exclude a mix of direct and buffered I/O). > In the end, we decided it didn't work all that well. Well mixing buffered and direct-IO is certainly not a good practice so it's reasonable to leave it up to userland to serialize if such mix is needed, the kernel behavior is undefined if the mix is concurrent out of order.
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote: > wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so I assume you mean "in" mprotect_fixup, after change_protection. If you would downgrade the mmap_lock to read there, then it'd severely slowdown the non contention case, if there's more than vma that needs change_protection. You'd need to throw away the prev->vm_next info and you'd need to do a new find_vma after droping the mmap_lock for reading and re-taking the mmap_lock for writing at every iteration of the loop. To do less harm to the non-contention case you could perhaps walk vma->vm_next and check if it's outside the mprotect range and only downgrade in such case. So let's assume we intend to optimize with mmap_write_downgrade only the last vma. The problem is once you had to take mmap_lock for writing, you already stalled for I/O and waited all concurrent page faults and blocked them as well for the vma allocations in split_vma, so that extra boost in SMP scalability you get is lost in the noise there at best. And the risk is that at worst that extra locked op of mmap_write_downgrade() will hurt SMP scalability because it would increase the locked ops of mprotect on the hottest false-shared cacheline by 50% and that may outweight the benefit from unblocking the page faults half a usec sooner on large systems. But the ultimate reason why mprotect cannot do mmap_write_downgrade() while userfaultfd_writeprotect can do mmap_read_lock and avoid the mmap_write_lock altogether, is that mprotect leaves no mark in the pte/hugepmd that allows to detect when the TLB is stale in order to redirect the page fault in a dead end (handle_userfault() or do_numa_page) until after the TLB has been flushed as it happens in the the 4 cases below: /* * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB * can be stale. We cannot allow do_wp_page to proceed or * it'll wrongly assume that nobody can still be writing to * the page if !pte_write. */ if (userfaultfd_pte_wp(vma, *vmf->pte)) { /* * STALE_TLB_WARNING: while the uffd_wp bit is set, * the TLB can be stale. We cannot allow wp_huge_pmd() * to proceed or it'll wrongly assume that nobody can * still be writing to the page if !pmd_write. */ if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd)) /* * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can * be stale. */ if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) /* * STALE_TLB_WARNING: if the pmd is NUMA * protnone the TLB can be stale. */ if (pmd_protnone(orig_pmd) && vma_is_accessible(vma)) Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
we have to block for I/O during those CPU-only events, the better. This is one of the potential use cases, but there's plenty more. So the alternative would be to take mmap_read_lock for "len <= HPAGE_PMD_SIZE" and the mmap_write_lock otherwise, and to add a parameter in change_protection to tell it to flush before dropping the PT lock and not defer the flush. So this is going to be more intrusive in the VM and it's overall unnecessary. The below is mostly untested... but it'd be good to hear some feedback before doing more work in this direction. >From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Tue, 22 Dec 2020 14:30:16 -0500 Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after userfaultfd_writeprotect() change_protection() is called by userfaultfd_writeprotect() with the mmap_lock_read like in change_prot_numa(). The page fault code in wp_copy_page() rightfully assumes if the CPU issued a write fault and the write bit in the pagetable is not set, no CPU can write to the page. That's wrong assumption after userfaultfd_writeprotect(). That's also wrong assumption after change_prot_numa() where the regular page fault code also would assume that if the present bit is not set and the page fault is running, there should be no stale TLB entry, but there is still. So to stay safe, the page fault code must be prevented to run as long as long as the TLB flush remains pending. That is already achieved by the do_numa_page() path for change_prot_numa() and by the userfaultfd_pte_wp() path for userfaultfd_writeprotect(). The problem that needs fixing is that an un-wrprotect (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not set) could run in between the original wrprotect (i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set) and wp_copy_page, while the TLB flush remains pending. In such case the userfaultfd_pte_wp() check will stop being effective to prevent the regular page fault code to run. CPU0CPU 1 CPU 2 -- --- userfaultfd_wrprotect(mode_wp = true) PT lock atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE PT unlock do_page_fault FAULT_FLAG_WRITE userfaultfd_wrprotect(mode_wp = false) PT lock ATOMIC clear _PAGE_UFFD_WP <- problem /* _PAGE_WRITE not set */ PT unlock XX BUG RACE window open here PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy copy_user_page runs with stale TLB deferred tlb flush <- too late XX BUG RACE window close here If the userfaultfd_wrprotect(mode_wp = false) can only run after the deferred TLB flush the above cannot happen either, because the uffd_wp bit will remain set until after the TLB flush and the page fault would reliably hit the handle_userfault() dead end as long as the TLB is stale. So to fix this we need to prevent any un-wrprotect to start until the last outstanding wrprotect completed and to prevent any further wrprotect until the last outstanding un-wrprotect completed. Then wp_page_copy can still run in parallel but only with the un-wrprotect, and that's fine since it's a permission promotion. We would need a new two_group_semaphore, where each group can run as many parallel instances as it wants, but no instance of one group can run in parallel with any instance of the other group. The below rwsem with two atomics approximates that kind lock. Signed-off-by: Andrea Arcangeli --- fs/userfaultfd.c | 39 +++ mm/memory.c | 20 2 files changed, 59 insertions(+) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 894cc28142e7..3729ca99dae5 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -76,6 +76,20 @@ struct userfaultfd_ctx { bool mmap_changing; /* mm with one ore more vmas attached to this userfaultfd_ctx */ struct mm_struct *mm; + /* +* We cannot let userfaultd_writeprotect(mode_wp = false) +* clear _PAGE_UFFD_WP from the pgtable, until the last +* outstanding userfaultd_writeprotect(mode_wp = true) +* completes, because the TLB flush is deferred. The page +* fault must be forced to enter handle_userfault() while the +* TLB flush is deferred, and that's achieved with the +* _PAGE_UFFD_WP bit. The _PAGE_UFFD_WP can only be clea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Hello, On Sat, Dec 19, 2020 at 09:08:55PM -0800, Andy Lutomirski wrote: > On Sat, Dec 19, 2020 at 6:49 PM Andrea Arcangeli wrote: > > The ptes are changed always with the PT lock, in fact there's no > > problem with the PTE updates. The only difference with mprotect > > runtime is that the mmap_lock is taken for reading. And the effect > > contested for this change doesn't affect the PTE, but supposedly the > > tlb flushing deferral. > > Can you point me at where the lock ends up being taken in this path? pte_offset_map_lock in change_pte_range, as in mprotect, no difference. As I suspected on my follow up, the bug described wasn't there, but I'll look at the new theory posted. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Sat, Dec 19, 2020 at 06:01:39PM -0800, Andy Lutomirski wrote: > I missed the beginning of this thread, but it looks to me like > userfaultfd changes PTEs with not locking except mmap_read_lock(). It There's no mmap_read_lock, I assume you mean mmap_lock for reading. The ptes are changed always with the PT lock, in fact there's no problem with the PTE updates. The only difference with mprotect runtime is that the mmap_lock is taken for reading. And the effect contested for this change doesn't affect the PTE, but supposedly the tlb flushing deferral. The change_protection_range is identical to what already happens with zap_page_range. zap_page_range is called with mmap_lock for reading in MADV_DONTNEED, and by munmap with mmap_lock for writing. change_protection_range is called with mmap_lock for writing by mprotect, and mmap_lock for reading by UFFDIO_WRITEPROTECT. > also calls inc_tlb_flush_pending(), which is very explicitly > documented as requiring the pagetable lock. Those docs must be wrong, The comment in inc_tlb_flush_pending() shows the pagetable lock is taken after inc_tlb_flush_pending(): * atomic_inc(&mm->tlb_flush_pending); * spin_lock(&ptl); > because mprotect() uses the mmap_sem write lock, which is just fine, > but ISTM some kind of mutual exclusion with proper acquire/release > ordering is indeed needed. So the userfaultfd code seems bogus. If there's a bug, it'd be nice to fix without taking the mmap_lock for writing. The vma is guaranteed not modified, so I think it'd be pretty bad if we had to give in the mmap_lock for writing just to wait for a tlb flush that is issued deferred in the context of userfaultfd_writeprotect. > I think userfaultfd either needs to take a real lock (probably doesn't > matter which) or the core rules about PTEs need to be rewritten. It's not exactly clear how the do_wp_page could run on cpu1 before the unprotect did an extra flush, I guess the trace needs one more cpu/thread? Anyway to wait the wrprotect to do the deferred flush, before the unprotect can even start, one more mutex in the mm to take in all callers of change_protection_range with the mmap_lock for reading may be enough. Thanks, Andrea
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote: > > On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote: > > > > [ cc’ing some more people who have experience with similar problems ] > > > >> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote: > >> > >> Hello, > >> > >> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote: > >>> Analyzing this problem indicates that there is a real bug since > >>> mmap_lock is only taken for read in mwriteprotect_range(). This might > >> > >> Never having to take the mmap_sem for writing, and in turn never > >> blocking, in order to modify the pagetables is quite an important > >> feature in uffd that justifies uffd instead of mprotect. It's not the > >> most important reason to use uffd, but it'd be nice if that guarantee > >> would remain also for the UFFDIO_WRITEPROTECT API, not only for the > >> other pgtable manipulations. > >> > >>> Consider the following scenario with 3 CPUs (cpu2 is not shown): > >>> > >>> cpu0 cpu1 > >>> > >>> userfaultfd_writeprotect() > >>> [ write-protecting ] > >>> mwriteprotect_range() > >>> mmap_read_lock() > >>> change_protection() > >>> change_protection_range() > >>> ... > >>> change_pte_range() > >>> [ defer TLB flushes] > >>> userfaultfd_writeprotect() > >>>mmap_read_lock() > >>>change_protection() > >>>[ write-unprotect ] > >>>... > >>> [ unprotect PTE logically ] Is the uffd selftest failing with upstream or after your kernel modification that removes the tlb flush from unprotect? } else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } Upstraem this will still do pages++, there's a tlb flush before change_protection can return here, so I'm confused. > >>> ... > >>> [ page-fault] > >>> ... > >>> wp_page_copy() > >>> [ set new writable page in PTE] > >> > >> Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL > >> is set and do an explicit (potentially spurious) tlb flush before > >> write-unprotect? > > > > There is a concrete scenario that I actually encountered and then there is a > > general problem. > > > > In general, the kernel code assumes that PTEs that are read from the > > page-tables are coherent across all the TLBs, excluding permission promotion > > (i.e., the PTE may have higher permissions in the page-tables than those > > that are cached in the TLBs). > > > > We therefore need to both: (a) protect change_protection_range() from the > > changes of others who might defer TLB flushes without taking mmap_sem for > > write (e.g., try_to_unmap_one()); and (b) to protect others (e.g., > > page-fault handlers) from concurrent changes of change_protection(). > > > > We have already encountered several similar bugs, and debugging such issues > > s time consuming and these bugs impact is substantial (memory corruption, > > security). So I think we should only stick to general solutions. > > > > So perhaps your the approach of your proposed solution is feasible, but it > > would have to be applied all over the place: we will need to add a check for > > mm_tlb_flush_pending() and conditionally flush the TLB in every case in > > which PTEs are read and there might be an assumption that the > > access-permission reflect what the TLBs hold. This includes page-fault > > handlers, but also NUMA migration code in change_protection(), softdirty > > cleanup in clear_refs_write() and maybe others. > > > > [ I have in mind another solution, such as keeping in each page-table a > > “table-generation” which is the mm-generation at the time of the chan
Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect
Hello, On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote: > Analyzing this problem indicates that there is a real bug since > mmap_lock is only taken for read in mwriteprotect_range(). This might Never having to take the mmap_sem for writing, and in turn never blocking, in order to modify the pagetables is quite an important feature in uffd that justifies uffd instead of mprotect. It's not the most important reason to use uffd, but it'd be nice if that guarantee would remain also for the UFFDIO_WRITEPROTECT API, not only for the other pgtable manipulations. > Consider the following scenario with 3 CPUs (cpu2 is not shown): > > cpu0 cpu1 > > userfaultfd_writeprotect() > [ write-protecting ] > mwriteprotect_range() > mmap_read_lock() > change_protection() > change_protection_range() >... >change_pte_range() >[ defer TLB flushes] > userfaultfd_writeprotect() >mmap_read_lock() >change_protection() >[ write-unprotect ] >... > [ unprotect PTE logically ] > ... > [ page-fault] > ... > wp_page_copy() > [ set new writable page in PTE] Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL is set and do an explicit (potentially spurious) tlb flush before write-unprotect? Thanks, Andrea
Re: [PATCH v2 1/2] mm: memblock: enforce overlap of memory.memblock and memory.reserved
On Mon, Dec 14, 2020 at 12:18:07PM +0100, David Hildenbrand wrote: > On 14.12.20 12:12, Mike Rapoport wrote: > > On Mon, Dec 14, 2020 at 11:11:35AM +0100, David Hildenbrand wrote: > >> On 09.12.20 22:43, Mike Rapoport wrote: > >>> From: Mike Rapoport > >>> > >>> memblock does not require that the reserved memory ranges will be a subset > >>> of memblock.memory. > >>> > >>> As the result there maybe reserved pages that are not in the range of any > >>> zone or node because zone and node boundaries are detected based on > >>> memblock.memory and pages that only present in memblock.reserved are not > >>> taken into account during zone/node size detection. > >>> > >>> Make sure that all ranges in memblock.reserved are added to > >>> memblock.memory > >>> before calculating node and zone boundaries. > >>> > >>> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions > >>> rather that check each PFN") > >>> Reported-by: Andrea Arcangeli > >>> Signed-off-by: Mike Rapoport > >>> --- > >>> include/linux/memblock.h | 1 + > >>> mm/memblock.c| 24 > >>> mm/page_alloc.c | 7 +++ > >>> 3 files changed, 32 insertions(+) > >>> > >>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h > >>> index ef131255cedc..e64dae2dd1ce 100644 > >>> --- a/include/linux/memblock.h > >>> +++ b/include/linux/memblock.h > >>> @@ -120,6 +120,7 @@ int memblock_clear_nomap(phys_addr_t base, > >>> phys_addr_t size); > >>> unsigned long memblock_free_all(void); > >>> void reset_node_managed_pages(pg_data_t *pgdat); > >>> void reset_all_zones_managed_pages(void); > >>> +void memblock_enforce_memory_reserved_overlap(void); > >>> > >>> /* Low level functions */ > >>> void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags, > >>> diff --git a/mm/memblock.c b/mm/memblock.c > >>> index b68ee86788af..9277aca642b2 100644 > >>> --- a/mm/memblock.c > >>> +++ b/mm/memblock.c > >>> @@ -1857,6 +1857,30 @@ void __init_memblock > >>> memblock_trim_memory(phys_addr_t align) > >>> } > >>> } > >>> > >>> +/** > >>> + * memblock_enforce_memory_reserved_overlap - make sure every range in > >>> + * @memblock.reserved is covered by @memblock.memory > >>> + * > >>> + * The data in @memblock.memory is used to detect zone and node > >>> boundaries > >>> + * during initialization of the memory map and the page allocator. Make > >>> + * sure that every memory range present in @memblock.reserved is also > >>> added > >>> + * to @memblock.memory even if the architecture specific memory > >>> + * initialization failed to do so > >>> + */ > >>> +void __init memblock_enforce_memory_reserved_overlap(void) > >>> +{ > >>> + phys_addr_t start, end; > >>> + int nid; > >>> + u64 i; > >>> + > >>> + __for_each_mem_range(i, &memblock.reserved, &memblock.memory, > >>> + NUMA_NO_NODE, MEMBLOCK_NONE, &start, &end, &nid) { > >>> + pr_warn("memblock: reserved range [%pa-%pa] is not in memory\n", > >>> + &start, &end); > >>> + memblock_add_node(start, (end - start), nid); > >>> + } > >>> +} > >>> + > >>> void __init_memblock memblock_set_current_limit(phys_addr_t limit) > >>> { > >>> memblock.current_limit = limit; > >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>> index eaa227a479e4..dbc57dbbacd8 100644 > >>> --- a/mm/page_alloc.c > >>> +++ b/mm/page_alloc.c > >>> @@ -7436,6 +7436,13 @@ void __init free_area_init(unsigned long > >>> *max_zone_pfn) > >>> memset(arch_zone_highest_possible_pfn, 0, > >>> sizeof(arch_zone_highest_possible_pfn)); > >>> > >>> + /* > >>> + * Some architectures (e.g. x86) have reserved pages outside of > >>> + * memblock.memory. Make sure these pages are taken into account > >>> + * when detecting zone and node boundaries > >>> + */ > >>
Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout
Hello, On Wed, Dec 09, 2020 at 11:43:04PM +0200, Mike Rapoport wrote: > +void __init __weak memmap_init(unsigned long size, int nid, > +unsigned long zone, > +unsigned long range_start_pfn) > +{ > + unsigned long start_pfn, end_pfn, hole_start_pfn = 0; > unsigned long range_end_pfn = range_start_pfn + size; > + u64 pgcnt = 0; > int i; > > for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) { > start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn); > end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn); > + hole_start_pfn = clamp(hole_start_pfn, range_start_pfn, > +range_end_pfn); > > if (end_pfn > start_pfn) { > size = end_pfn - start_pfn; > memmap_init_zone(size, nid, zone, start_pfn, >MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); > } > + > + if (hole_start_pfn < start_pfn) > + pgcnt += init_unavailable_range(hole_start_pfn, > + start_pfn, zone, nid); > + hole_start_pfn = end_pfn; > } After applying the new 1/2, the above loop seem to be functionally a noop compared to what was in -mm yesterday, so the above looks great as far as I'm concerned. Unlike the simple fix this will not loop over holes that aren't part of memblock.memory nor memblock.reserved and it drops the static variable which would have required ordering and serialization. By being functionally equivalent, it looks it also suffers from the same dependency on pfn 0 (and not just pfn 0) being reserved that you pointed out earlier. I suppose to drop that further dependency we need a further round down in this logic to the start of the pageblock_order or max-order like mentioned yesterday? If the first pfn of a pageblock (or maybe better a max-order block) is valid, but not in memblock.reserved nor memblock.memory and any other pages in such pageblock is freed to the buddy allocator, we should make sure the whole pageblock gets initialized (or at least the pages with a pfn lower than the one that was added to the buddy). So applying a round down in the above loop might just do the trick. Since the removal of that extra dependency was mostly orthogonal with the above, I guess it's actually cleaner to do it incrementally. I'd suggest to also document why we're doing it, in the code (not just commit header) of the incremental patch, by mentioning which are the specific VM invariants we're enforcing that the VM code always depended upon, that required the rundown etc... In the meantime I'll try to update all systems again with this implementation to test it. Thanks! Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Hi Mel, On Thu, Nov 26, 2020 at 10:47:20AM +, Mel Gorman wrote: > Agreed. This thread has a lot of different directions in it at this > point so what I'd hope for is first, a patch that initialises holes with > zone/node linkages within a 1<<(MAX_ORDER-1) alignment. If there is a > hole, it would be expected the pages are PageReserved. Second, a fix to > fast_isolate that forces PFNs returned to always be within the stated > zone boundaries. The last two patches should resolve the struct page initialization https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/ and the VM_BUG_ON_PAGE never happened again as expected. So I looked back to see how the "distance" logic is accurate. I added those checks and I get negative hits: diff --git a/mm/compaction.c b/mm/compaction.c index cc1a7f600a86..844a90b0acdf 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1331,6 +1331,12 @@ fast_isolate_freepages(struct compact_control *cc) low_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 2)); min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); + WARN_ON_ONCE((long) low_pfn < 0); + WARN_ON_ONCE((long) min_pfn < 0); + if ((long) low_pfn < 0) + return cc->free_pfn; + if ((long) min_pfn < 0) + return cc->free_pfn; if (WARN_ON_ONCE(min_pfn > low_pfn)) low_pfn = min_pfn; Both warn-on-once triggers, so it goes negative. While it appears not kernel crashing since pfn_valid won't succeed on negative entries and they'll always be higher than any pfn in the freelist, is this sign that there's further room for improvement here? Thanks, Andrea
[PATCH 0/1] mm: initialize struct pages in reserved regions outside of the zone ranges
I'm running with these patch applied on all instances as solution to the compaction crashes that started to trigger after the v5.9 upgrade. It's applied on top of https://lore.kernel.org/lkml/20201201181502.2340-1-r...@kernel.org so that it boots and works fine even when there are memblock.reserved regions that don't overlap with the memblock.memory (as pfn 0 which is in memblock.reserved but not added to memblock.memory). To facilitate any testing: https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git === # grep -A1 -i e820 /proc/iomem 39619000-397f4fff : Unknown E820 type 397f5000-3be55fff : System RAM DMA zone_start_pfn 0zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 Normal zone_start_pfn 1048576 zone_end_pfn() 4909056 contiguous 1 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 235508 0x397f4000 0x40001000 reserved True pfn_valid True 235509 0x397f5000 0x4000 reserved False pfn_valid True === # grep -A1 -i e820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bff : System RAM DMA zone_start_pfn 0zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0x3fff1000 reserved True pfn_valid True 500247 0x7a217000 0x3fff reserved False pfn_valid True === I dislike the fix for pfn 0 merged in -mm so that it will still boot with DEBUG_VM=y, posted in https://lkml.kernel.org/r/20201203105107.gr123...@linux.ibm.com, because in addition of looping every 2m from the start at every memblock for every zone in a O(N*2) way, it'll keep assigning a "wrong" zoneid to pfn 0 and it literally reintroduces the problem that https://lore.kernel.org/lkml/20201201181502.2340-1-r...@kernel.org was meant to solve. This patch implements a more significant change since it alters the zone boundaries. If there's any problem with extending the zone spans to include memblock.reserved regions it'd be good to know now, because if there's no problem with that, I only see benefits in being able to guarantee a proper zoneid and nid of all struct page that are in reserved memblocks too and where reserve_bootmem_region will attempt to call __SetPageReserved later expecting to deal with a "functional" page structure. An alternative to this patch, which still sounds preferable than https://lkml.kernel.org/r/20201203105107.gr123...@linux.ibm.com, is to just teach reserve_bootmem_region not try to set PageReserved on uninitialized PagePoison pages and to learn how to deal with "non functional" page structures, and to just leave them alone. PagePoison conceptually would act as a NO_ZONEID setting in such case and orthogonal with all the rest, it'd be nice if it was set also with DEBUG_VM=n. Overall anything looks preferable than once again assigning a wrong zoneid to a page struct that belongs to a pfn that is not part of the zone in order to boot with the very fix that was meant to prevent such invariant to be broken in the first place. I don't think pfn 0 deserves a magic exception and a pass to break the invariant (even ignoring it may happen that the first pfn in nid > 0 might then also get the pass). Andrea Arcangeli (1): mm: initialize struct pages in reserved regions outside of the zone ranges include/linux/memblock.h | 17 +--- mm/debug.c | 3 ++- mm/memblock.c| 4 +-- mm/page_alloc.c | 57 4 files changed, 63 insertions(+), 18 deletions(-)
[PATCH 1/1] mm: initialize struct pages in reserved regions outside of the zone ranges
Without this change, the pfn 0 isn't in any zone spanned range, and it's also not in any memory.memblock range, so the struct page of pfn 0 wasn't initialized and the PagePoison remained set when reserve_bootmem_region called __SetPageReserved, inducing a silent boot failure with DEBUG_VM (and correctly so, because the crash signaled the nodeid/nid of pfn 0 would be again wrong). There's no enforcement that all memblock.reserved ranges must overlap memblock.memory ranges, so the memblock.reserved ranges also require an explicit initialization and the zones ranges need to be extended to include all memblock.reserved ranges with struct pages too or they'll be left uninitialized with PagePoison as it happened to pfn 0. Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Andrea Arcangeli --- include/linux/memblock.h | 17 +--- mm/debug.c | 3 ++- mm/memblock.c| 4 +-- mm/page_alloc.c | 57 4 files changed, 63 insertions(+), 18 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index ef131255cedc..c8e30cd69564 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -251,7 +251,8 @@ static inline bool memblock_is_nomap(struct memblock_region *m) int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn, unsigned long *end_pfn); void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, - unsigned long *out_end_pfn, int *out_nid); + unsigned long *out_end_pfn, int *out_nid, + struct memblock_type *type); /** * for_each_mem_pfn_range - early memory pfn range iterator @@ -263,9 +264,17 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, * * Walks over configured memory ranges. */ -#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid) \ - for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid); \ -i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid)) +#define for_each_mem_pfn_range(i, nid, p_start, p_end, p_nid)\ + for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.memory); \ +i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.memory)) + +#define for_each_res_pfn_range(i, nid, p_start, p_end, p_nid)\ + for (i = -1, __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.reserved);\ +i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid, \ + &memblock.reserved)) #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, diff --git a/mm/debug.c b/mm/debug.c index ccca576b2899..6a1d534f5ffc 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -64,7 +64,8 @@ void __dump_page(struct page *page, const char *reason) * dump_page() when detected. */ if (page_poisoned) { - pr_warn("page:%px is uninitialized and poisoned", page); + pr_warn("page:%px pfn:%ld is uninitialized and poisoned", + page, page_to_pfn(page)); goto hex_only; } diff --git a/mm/memblock.c b/mm/memblock.c index b68ee86788af..3964a5e8914f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1198,9 +1198,9 @@ void __init_memblock __next_mem_range_rev(u64 *idx, int nid, */ void __init_memblock __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn, - unsigned long *out_end_pfn, int *out_nid) + unsigned long *out_end_pfn, int *out_nid, + struct memblock_type *type) { - struct memblock_type *type = &memblock.memory; struct memblock_region *r; int r_nid; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce2bdaabdf96..3eed49598d66 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1458,6 +1458,7 @@ static void __meminit init_reserved_page(unsigned long pfn) { pg_data_t *pgdat; int nid, zid; + bool found = false; if (!early_page_uninitialised(pfn)) return; @@ -1468,10 +1469,15 @@ static void __meminit init_reserved_page(unsigned long pfn) for (zid = 0; zid < MAX_NR_ZONES; zid++) { struct zone *zone = &pgdat->node_zones[zid]; - if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) +
Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
On Fri, Dec 04, 2020 at 02:23:29PM -0500, Peter Xu wrote: > If we see [1]: > > if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && > !is_migration_entry) > > Then it's fundamentally the same as: > > swp_entry(0, _UFFD_SWP_UFFD_WP) && !vma_is_anonymous(vma) Yes conceptually it's the same, but in practice it's different value in the raw pte if you set swp_offset to _UFFD_SWP_UFFD_WP. Setting swp_offset to _UFFD_SWP_UFFD_WP is just confusing, it's better magic type 1 offset 0 then to even touch _UFFD_SWP_UFFD_WP if you reserve the magic value in the swap entry. pte_swp_uffd_wp or _UFFD_SWP_UFFD_WP are the raw pte value. swp_entry(0, _UFFD_SWP_UFFD_WP) is not the raw pte value, and it has to be first converted to the raw pte value with swp_entry_to_pte, so the _UFFD_SWP_UFFD_WP gets shifted left. If we use _UFFD_SWP_UFFD_WP it looks much cleaner to keep it in the pte, not in the swp entry, since then you can use the already existing methods that only can take in input the pte_t (not the swp_entry_t). Thanks, Andrea
Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
On Thu, Dec 03, 2020 at 11:10:18PM -0500, Andrea Arcangeli wrote: > from the pte, one that cannot ever be set in any swp entry today. I > assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but > you may want to verify it... I thought more about the above, and I think the already existing pte_swp_mkuffd_wp will just be enough without having to reserve an extra bitflag if we encode it as a non migration entry. The check: if (!pte_present && !pte_none && pte_swp_uffd_wp && not_anonymous_vma && !is_migration_entry) should be enough to disambiguate it. When setting it, it'd be enough to set the pte to the value _PAGE_SWP_UFFD_WP. Although if you prefer to check for: if (!pte_present && !pte_none && swp_type == 1 && swp_offset == 0 && not_anonymous_vma && !is_migration_entry) that would do as well. It's up to you, just my preference is to reuse _PAGE_SWP_UFFD_WP since it has already to exist, there are already all the pte_swp_*uffd* methods available or uffd-wp cannot work. Thanks, Andrea
Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
Hi Peter, On Thu, Dec 03, 2020 at 09:30:51PM -0500, Peter Xu wrote: > I'm just afraid there's no space left for a migration entry, because migration > entries fills in the pfn information into swp offset field rather than a real > offset (please refer to make_migration_entry())? I assume PFN can use any > bit. > Or did I miss anything? > > I went back to see the original proposal from Hugh: > > IIUC you only need a single value, no need to carve out another whole > swp_type: could probably be swp_offset 0 of any swp_type other than 0. > > Hugh/Andrea, sorry if this is a stupid swap question: could you help explain > why swp_offset=0 won't be used by any swap device? I believe it's correct, > it's just that I failed to figure out the reason myself. :( > Hugh may want to review if I got it wrong, but there's basically three ways. swp_type would mean adding one more reserved value in addition of SWP_MIGRATION_READ and SWP_MIGRATION_WRITE (kind of increasing SWP_MIGRATION_NUM to 3). swp_offset = 0 works in combination of SWP_MIGRATION_WRITE and SWP_MIGRATION_READ if we enforce pfn 0 is never used by the kernel (I'd feel safer with pfn value -1UL truncated to the bits of the swp offset, since the swp_entry format is common code). The bit I was suggesting is just one more bit like _PAGE_SWP_UFFD_WP from the pte, one that cannot ever be set in any swp entry today. I assume it can't be _PAGE_SWP_UFFD_WP since that already can be set but you may want to verify it... It'd be set on the pte (not in the swap entry), then it doesn't matter much what's inside the swp_entry anymore. The pte value would be generated with this: pte_swp_uffd_wp_unmap(swp_entry_to_pte(swp_entry(SWP_MIGRATION_READ, 0))) (maybe SWP_MIGRATION_READ could also be 0 and then it can be just enough to set that single bit in the pte and nothing else, all other bits zero) We never store a raw swp entry in the pte (the raw swp entry is stored in the xarray, it's the index of the swapcache). To solve our unmap issue we only deal with pte storage (no xarray index storage). This is why it can also be in the arch specific pte representation of the swp entry, it doesn't need to be a special value defined in the swp entry common code. Being the swap entry to pte conversion arch dependent, such bit needs to be defined by each arch (reserving a offset or type value in swp entry would solve it in the common code). #define SWP_OFFSET_FIRST_BIT(_PAGE_BIT_PROTNONE + 1) All bits below PROTNONE are available for software use and we use bit 1 (soft dirty) 2 (uffd_wp). protnone bit 8 itself (global bit) must not be set or it'll look protnone and pte_present will be true. Bit 7 is PSE so it's also not available because pte_present checks that too. It appears you can pick between bit 3 4 5 6 at your own choice and it doesn't look like we're running out of those yet (if we were there would be a bigger incentive to encode it as part of the swp entry format). Example: #define _PAGE_SWP_UFFD_WP_UNMAP _PAGE_PWT If that bit it set and pte_present is false, then everything else in that that pte is meaningless and it means uffd wrprotected pte_none. So in the migration-entry/swapin page fault path, you could go one step back and check the pte for such bit, if it's set it's not a migration entry. If there's a read access it should fill the page mark with shmem_fault, keep the pte wrprotected and then set _PAGE_UFFD_WP on the pte. If there's a write access it should invoke handle_userfault. If there's any reason where the swp_entry reservation is simpler that's ok too, you'll see an huge lot of more details once you try to implement it so you'll be better able to judje later. I'm greatly simplifying everything but this is not simple feat... Thanks, Andrea
Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
On Thu, Dec 03, 2020 at 01:02:34PM -0500, Peter Xu wrote: > On Wed, Dec 02, 2020 at 09:36:45PM -0800, Hugh Dickins wrote: > > On Wed, 2 Dec 2020, Peter Xu wrote: > > > On Wed, Dec 02, 2020 at 02:37:33PM -0800, Hugh Dickins wrote: > > > > On Tue, 1 Dec 2020, Andrea Arcangeli wrote: > > > > > > > > > > Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit > > > > > survive the pte invalidates in a way that remains associated to a > > > > > certain vaddr in a single mm (so it can shoot itself in the foot if it > > > > > wants, but it can't interfere with all other mm sharing the shmem > > > > > file) would be welcome... > > > > > > > > I think it has to be a new variety of swap-like non_swap_entry() pte, > > > > see include/linux/swapops.h. Anything else would be more troublesome. Agreed. Solving it by changing the unmapping of the ptes is also some trouble but less troublesome than adding new bitmaps to the vma to handle in vma_merge/vma_split. > > But those ptes will be pte_present(), so you must provide a pfn, > > Could I ask why? _PAGE_PROTNONE exists only for one reason, so pte_present returns true and the page is as good as mapped, the pfn is real and mapped, everything is up and running fine except _PAGE_PRESENT is not set in the pte. _PAGE_PROTNONE doesn't unmap the pte, it only triggers faults on a mapped pte. When we set _PAGE_PROTNONE and clear _PAGE_PRESENT atomically, nothing changes at all for all pte_present and all other VM common code, except now you get page faults. So numa hinting faults use that and it's a perfect fit for that, because you just want to change nothing, but still be notified on access. Here instead you need to really unmap the page and lose any pfn or page reference and everything along with it, just one bit need to survive the unmap: the _PAGE_UFFD_WP bit. I tend to agree this needs to work more similarly to the migration entry like Hugh suggested or an entirely new mechanism to keep "vma specific state" alive, for filebacked mappings that get zapped but that have still a vma intact. The vma removal in munmap() syscall, is then the only point where the pte is only allowed to be cleared for good and only then the pte can be freed. Not even MADV_DONTNEED should be allowed to zero out the pte, it can drop everything but that single bit. > Meanwhile, this reminded me another option - besides _PAGE_PROTNONE, can we > use > _PAGE_SPECIAL? That sounds better at least from its naming that it tells it's > a special page already. We can also leverage existing pte_special() checks > here > and there, then mimic what we do with pte_devmap(), maybe? That's also for mapped pages VM_PFNMAP or similar I think. By memory the !pte_present case for filebacked vmas only exists as migration entry so I think we'll just add a branch to that case so that it can disambiguate the migration entry from the _PAGE_UFFDP_WP bit. So we can reserve one bit in the migration entry that is always enforced zero when it is a migration entry. When that bit is set on a non-present page in a filebacked vma, it will mean _UFFD_PAGE_WP is set for that vaddr in that mm. Then we need to pass a parameter in all pte zapping operations to tell if the unmap event is happening because the vma has been truncated, or if it's happening for any other reason. If it's happening for any other reasons (page truncate, MADV_DONTNEED, memory pressure to swap/writepage) we just convert any present pte with _UFFD_PAGE_WP set, to the non-migration entry non-present pte with the reserved migration entry bit set. If the present pte has no _UFFD_PAGE_WP then it'll be zapped as usual regardless of VM_UFFD_WP in the vma or not. If the pte zapping is instead invoked because of a vma truncation, and it means it's the last unmap operation on that vaddr, the pte zap logic will be told to ignore the _UFFD_PAGE_WP in any present pte so the ptes will be zeroed out and later freed as needed. Thanks, Andrea
Re: [PATCH] mm: refactor initialization of stuct page for holes in memory layout
Hello, On Thu, Dec 03, 2020 at 08:25:49AM +0200, Mike Rapoport wrote: > On Wed, Dec 02, 2020 at 03:47:36PM -0800, Andrew Morton wrote: > > On Tue, 1 Dec 2020 20:15:02 +0200 Mike Rapoport wrote: > > > > > From: Mike Rapoport > > > > > > There could be struct pages that are not backed by actual physical memory. > > > This can happen when the actual memory bank is not a multiple of > > > SECTION_SIZE or when an architecture does not register memory holes > > > reserved by the firmware as memblock.memory. > > > > > > Such pages are currently initialized using init_unavailable_mem() function > > > that iterated through PFNs in holes in memblock.memory and if there is a > > > struct page corresponding to a PFN, the fields if this page are set to > > > default values and it is marked as Reserved. > > > > > > init_unavailable_mem() does not take into account zone and node the page > > > belongs to and sets both zone and node links in struct page to zero. > > > > > > On a system that has firmware reserved holes in a zone above ZONE_DMA, for > > > instance in a configuration below: > > > > > > # grep -A1 E820 /proc/iomem > > > 7a17b000-7a216fff : Unknown E820 type > > > 7a217000-7bff : System RAM > > > > > > unset zone link in struct page will trigger > > > > > > VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); > > > > That sounds pretty serious. My understanding is that with DEBUG_VM=n the invariant that broke won't cause trouble, but Fedora is helping the upstream testing by keeping DEBUG_VM=y and it's shipping with v5.8 and v5.9 for a while, so it could very well crash those kernels if they've that type 20 thing in the e820 map. > > > > > because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link > > > in > > > struct page) in the same pageblock. > > > > > > Interleave initialization of pages that correspond to holes with the > > > initialization of memory map, so that zone and node information will be > > > properly set on such pages. > > > > > > > Should this be backported to -stable? If so, do we have a suitable Fixes:? > > Sorry, I forgot to add > > Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather > that check each PFN") I've been wondering already why I'm the only one getting a crash every two weeks. Ince it crashed in MADV_HUGEPAGE of qemu that would definitely happened even with Fedora despite the THP enabled = madvise, and it hung qemu for good so it was noticeable since it was in direction compaction. Other times it was in kcompactd so it just killed the kernel thread and it was only noticeable in the kernel logs and probably it doesn't happen that frequently unless THP enabled = always, although it could still happen, compaction isn't used just for THP. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Thu, Dec 03, 2020 at 12:51:07PM +0200, Mike Rapoport wrote: > On Thu, Dec 03, 2020 at 01:23:02AM -0500, Andrea Arcangeli wrote: > > 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in > >memblock.reserve that doesn't overlap any memblock.memory zone. > > And, IMHO, this is the fundamental BUG. We have a dozen arches that change all the time, new efi/e820 stuff, always new bios configs, the memblock code must cope and be able to cope with the caller being wrong or changing behavior if the e820 map changes behaviour. Trying to work around memblock core deficiencies in the caller is what 124049decbb121ec32742c94fb5d9d6bed8f24d8 already did, and it doesn't work and it's not even clean thing to do. In fact here there's really nothing to blame the caller for (i.e. the e820 init code), unless you declare that there must be always overlap (which I understood would break stuff for regions that must not have a direct mapping). The caller here is correct and it can be summarized as below: if (e820 map type != system_ram) memblock_reserve(range) else memblock_add(range) Then: zone_boundaries = [ 16M, 4G, 0 ] free_area_init(zone_boundaries) It's not the caller fault if the e820 map in the bios starts with: pfn 0-1 reserved pfn 1-N system RAM This is better fixed in a way that must not require any change in the caller... > There is RAM at address 0, there is a DIMM there, so why on earth this > should not be a part of memblock.memory? How can you blame the caller if you explicitly didn't required that .reserved ranges must always overlap .memory ranges? In fact it was explained as a feature to avoid the direct mapping. Declaring that there must be always overlap would be one way to cope with this issue, but even then memblock must detect a wrong caller, workaround it by doing a memblock_add call inside the memblock_reserved and by printing a warning to improve the caller. It shouldn't crash at boot with console off if the overlap is not there. In my view the caller here is not to blame, all these issues are because memblock needs improvement and must cope with any caller. > >The memblock_start_of_DRAM moves the start of the DMA zone above > >the pfn 0, so then pfn 0 already ends up in the no zone land David > >mentioned where it's not trivial to decide to give it a zoneid if > >it's not even spanned in the zone. > > > >Not just the zoneid, there's not even a nid. > > > >So we have a pfn with no zoneid, and no nid, and not part of the > >zone ranges, not part of the nid ranges not part of the > >memory.memblock. > > We have a pfn that should have been in memblock.memory, in ZONE_DMA and > in the first node with memory. If it was not trimmed from there My incremental patch already solves how to extend the zones spans and the nid spans by taking memblock.reserved into account, not just memblock.memory, but still without actually messing with the direct mapping, otherwise I would have called memblock_add instead of walking memblock.reserved when detecting the nid and zoneid ranges. > > >We can't really skip the initialization of the the pfn 0, it has to > >get the zoneid 0 treatment or if pfn 1 ends up in compaction code, > >we'd crash again. (although we'd crash with a nice PagePoison(page) > >== true behavior) > > Agree. Struct page for pfn should get the same zoneid and nodeid as pfn 1. How are you sure there's no a zone ID 0 that starts at pfn 0 and ends at pfn 1 and the zone dma starts at pfn 1? In such case the pfn 0 would have a zoneid different from pfn 1. I'm not exactly sure why we should initialize a pfn 0 with a zoneid of a zone whose pfn 0 is not part of, that looks wrong. I'm not exactly sure why you don't fix the zone->zone_start_pfn and the respective nid node_start_pfn to start from pfn 0 instead, just like I did in my patch. > From 84a1c2531374706f3592a638523278aa29aaa448 Mon Sep 17 00:00:00 2001 > From: Mike Rapoport > Date: Thu, 3 Dec 2020 11:40:17 +0200 > Subject: [PATCH] fixup for "mm: refactor initialization of stuct page for > holes" > > Signed-off-by: Mike Rapoport > --- > mm/page_alloc.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ce2bdaabdf96..86fde4424e87 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6227,7 +6227,8 @@ void __init __weak memmap_init(unsigned long size, int > nid, > unsigned long zone, > unsigned long range_start_pfn) > { > - unsigned long start_pfn, end_pfn, next_pfn = 0; > + static uns
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Dec 02, 2020 at 07:39:23PM +0200, Mike Rapoport wrote: > Hmm, do you have any logs? It worked on my box and with various memory > configurations in qemu. It crashes in reserve_bootmem_region so no logs at all since the serial console isn't active yet. > I believe that the proper solution would be to have all the memory > reserved by the firmware added to memblock.memory, like all other > architectures do, but I'm not sure if we can easily and reliably > determine what of E820 reserved types are actually backed by RAM. > So this is not feasible as a short term solution. > > My patch [1], though, is an immediate improvement and I think it's worth > trying to fix off-by-one's that prevent your system from booting. > > [1] https://lore.kernel.org/lkml/20201201181502.2340-1-r...@kernel.org Yes that's the patch I applied. Relevant points after debugging: 1) can you try again with DEBUG_VM=y? 2) DEBUG_VM=y already enforces the memset(page, 0xff, sizeof(struct page)) on all struct pages allocated, exactly I was suggesting to do in the previous email. I wonder why we're not doing that even with DEBUG_VM=n, perhaps it's too slow for TiB systems. See page_init_poison(). 3) given 2), your patch below by deleting "0,0" initialization achieves the debug feature to keep PagePoison forever on all uninitialized pages, imagine PagePoison is really NO_NODEID/NO_ZONEID and doesn't need handling other than a crash. - __init_single_page(pfn_to_page(pfn), pfn, 0, 0); 4) because of the feature in 3) I now hit the PagePoison VM_BUG_ON because pfn 0 wasn't initialized when reserve_bootmem_region tries to call __SetPageReserved on a PagePoison page. 5) pfn 0 is the classical case where pfn 0 is in a reserved zone in memblock.reserve that doesn't overlap any memblock.memory zone. The memblock_start_of_DRAM moves the start of the DMA zone above the pfn 0, so then pfn 0 already ends up in the no zone land David mentioned where it's not trivial to decide to give it a zoneid if it's not even spanned in the zone. Not just the zoneid, there's not even a nid. So we have a pfn with no zoneid, and no nid, and not part of the zone ranges, not part of the nid ranges not part of the memory.memblock. We can't really skip the initialization of the the pfn 0, it has to get the zoneid 0 treatment or if pfn 1 ends up in compaction code, we'd crash again. (although we'd crash with a nice PagePoison(page) == true behavior) The below fixes the issue and it boots fine again and the compaction crash should be solved with both patches applied. DMA zone_start_pfn 0zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0zone_end_pfn() 0contiguous 0 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0x1fff1000 reserved True pfn_valid True 500247 0x7a217000 0x1fff reserved False pfn_valid True 500248 0x7a218000 0x1fff reserved False pfn_valid True quote from previous email 73a6e474cb376921a311786652782155eac2fdf0 this changed it: DMA zone_start_pfn 1zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 0 Normal zone_start_pfn 0zone_end_pfn() 0contiguous 0 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0xfff1000 reserved True 500247 0x7a217000 0x1fff reserved False 500248 0x7a218000 0x1fff00010200 reserved False ] quote from previous email >From 89469f063c192ae70aea0bd6e1e2a7e99894e5b6 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Wed, 2 Dec 2020 23:23:26 -0500 Subject: [PATCH 1/1] mm: initialize struct pages in reserved regions outside of the zone ranges pfn 0 wasn't initialized and the PagePoison remained set when reserve_bootmem_region() called __SetPageReserved, inducing a silent boot failure with DEBUG_VM (and correctly so, because the crash signaled the nodeid/nid of pfn 0 would be again wrong). Without this change, the pfn 0 part of a memblock.reserved range, isn't in any zone spanned range, nor in any nid spanned range, when we initialize the memblock.memory holes pfn 0 won't be included because it has no nid and no zoneid. There's no enforcement that all memblock.reserved ranges must overlap memblock.memory ranges, so the memblock.reserved ranges also require an explicit initialization and the zones and nid ranges need to be extended to include all memblock.reserved ranges with struct pages, or they'll be left uninitialized with PagePoison as
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Hello Mike, On Sun, Nov 29, 2020 at 02:32:57PM +0200, Mike Rapoport wrote: > Hello Andrea, > > On Thu, Nov 26, 2020 at 07:46:05PM +0200, Mike Rapoport wrote: > > On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > > > > Let's try to merge init_unavailable_memory() into memmap_init(). > > Than it'll be able to set zone/nid for those nasty pfns that BIOS > > decided to keep to itself, like in Andrea's case and will also take care > > of struct pages that do not really have a frame in DRAM, but are there > > because of arbitrary section size. > > Did you have a chance to check if this solves your issue? > If yes, I'll resend this as a formal patch. I tested the patch you sent, but it doesn't seem to boot. Reverting it booted. Also I noticed leaving pages uninitialized already happened before and was fixed in 124049decbb121ec32742c94fb5d9d6bed8f24d8 where literally all holes were registered in memblock_reserve() by hand to workaround this very same defect in memblock callee we're fixing again. Then it was lifted in 9fd61bc95130d4971568b89c9548b5e0a4e18e0e since the memblock was supposed to be able to initialize all pages. Since this seems the second time this happens, I'd suggest to change the graceful pfn, 0,0 initialization to memset(page, 0xff, sizeof(struct page)) too like we mentioned earlier and then have at least a DEBUG_SOMETHING=y to search for struct pages with all 1 in page->flags to ensure the boot stage worked right so perhaps there's a graceful notification at boot before a VM_BUG_ON triggers later. The page struct validation could be done based on DEBUG_VM=y too since it won't cause any runtime cost post boot. Thanks, Andrea
Re: [PATCH v2] mm: Don't fault around userfaultfd-registered regions on reads
On Tue, Dec 01, 2020 at 05:30:33PM -0500, Peter Xu wrote: > On Tue, Dec 01, 2020 at 12:59:27PM +, Matthew Wilcox wrote: > > On Mon, Nov 30, 2020 at 06:06:03PM -0500, Peter Xu wrote: > > > Faulting around for reads are in most cases helpful for the performance > > > so that > > > continuous memory accesses may avoid another trip of page fault. However > > > it > > > may not always work as expected. > > > > > > For example, userfaultfd registered regions may not be the best candidate > > > for > > > pre-faults around the reads. > > > > > > For missing mode uffds, fault around does not help because if the page > > > cache > > > existed, then the page should be there already. If the page cache is not > > > there, nothing else we can do, either. If the fault-around code is > > > destined to > > > be helpless for userfault-missing vmas, then ideally we can skip it. > > > > This sounds like you're thinking of a file which has exactly one user. > > If there are multiple processes mapping the same file, then no, there's > > no reason to expect a page to be already present in the page table, > > just because it's present in the page cache. > > > > > For wr-protected mode uffds, errornously fault in those pages around > > > could lead > > > to threads accessing the pages without uffd server's awareness. For > > > example, > > > when punching holes on uffd-wp registered shmem regions, we'll first try > > > to > > > unmap all the pages before evicting the page cache but without locking the > > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is > > > called > > > before shmem_truncate_range()). When fault-around happens near a hole > > > being > > > punched, we might errornously fault in the "holes" right before it will be > > > punched. Then there's a small window before the page cache was finally > > > dropped, and after the page will be writable again (NOTE: the uffd-wp > > > protect > > > information is totally lost due to the pre-unmap in shmem_fallocate(), so > > > the > > > page can be writable within the small window). That's severe data loss. > > > > This still doesn't make sense. If the page is Uptodate in the page > > cache, then userspace gets to access it. If you don't want the page to > > be accessible, ClearPageUptodate(). read() can also access it if it's > > marked Uptodate. A write fault on a page will call the filesystem's > > page_mkwrite() and you can block it there. > > I still don't think the page_mkwrite() could help here... Though Andrea > pointed I tend to agree page_mkwrite won't help, there's no I/O, nor dirty memory pressure, not even VM_FAULT_MISSING can work on real filesystems yet. The uffd context is associated to certain virtual addresses in the "mm", read/write syscalls shouldn't notice any difference, as all other mm shouldn't notice anything either. It should be enough to check the bit in shmem_fault invoked in ->fault for this purpose, the problem is we need the bit to survive the invalidate. > out an more important issue against swap cache (in the v1 thread [1]). Indeed > if we have those figured out maybe we'll also rethink this patch then it could > become optional; while that seems to be required to allow shmem swap in/out > with uffd-wp which I haven't yet tested. As Hugh pointed out, purely reuse > the > _PAGE_SWP_UFFD_WP in swap cache may not work trivially since uffd-wp is > per-pte > rather than per-page, so I probably need to think a bit more on how to do > that... > > I don't know whether a patch like this could still be good in the future. For > now, let's drop this patch until we solve all the rest of the puzzle. > > My thanks to all the reviewers, and sorry for the noise! Thanks to you Peter! No noise here, it's great progress to have found the next piece of the puzzle. Any suggestions on how to have the per-vaddr per-mm _PAGE_UFFD_WP bit survive the pte invalidates in a way that remains associated to a certain vaddr in a single mm (so it can shoot itself in the foot if it wants, but it can't interfere with all other mm sharing the shmem file) would be welcome... Andrea
Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads
Hi Hugh, On Tue, Dec 01, 2020 at 01:31:21PM -0800, Hugh Dickins wrote: > Please don't ever rely on that i_private business for correctness: as The out of order and lockless "if (inode->i_private)" branch didn't inspire much confidence in terms of being able to rely on it for locking correctness indeed. > more of the comment around "So refrain from" explains, it was added > to avoid a livelock with the trinity fuzzer, and I'd gladly delete > it all, if I could ever be confident that enough has changed in the > intervening years that it's no longer necessary. I was wondering if it's still needed, so now I checked the old code and it also did an unconditional lock_page in shmem_fault, so I assume it's still necessary. > It tends to prevent shmem faults racing hole punches in the same area > (without quite guaranteeing no race at all). But without the livelock > issue, I'd just have gone on letting them race. "Punch hole" == > "Lose data" - though I can imagine that UFFD would like more control > over that. Since map_pages is driven by faulting, it should already > be throttled too. Yes I see now it just needs to "eventually" stop the shmem_fault activity, it doesn't need to catch those faults already in flight, so it cannot relied upon as the form of serialization to zap the pageteables while truncating the page. > But Andrea in next mail goes on to see other issues with UFFD_WP > in relation to shmem swap, so this thread is probably dead now. > If there's a bit to spare for UFFD_WP in the anonymous swap entry, > then that bit should be usable for shmem too: but a shmem page > (and its swap entry) can be shared between many different users, > so I don't know whether that will make sense. An UFFD_WP software bit exists both for the present pte and for the swap entry: #define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) #define _PAGE_SWP_UFFD_WP _PAGE_USER It works similarly to soft dirty, if the page is swapped _PAGE_UFFD_WP it's translated to _PAGE_SWP_UFFD_WP and the other way around during swapin. The problem is that the bit represents an information that is specific to an mm and a single mapping. If you punch an hole in a shmem, you map the shmem file in two processes A and B, you register the mapped range in a uffd opened by process A using VM_UFFD_MISSING, only process A will generate userfaults if you access the virtual address that corresponds to the hole in the file, process B will still fill the hole normally and it won't trigger userfaults in process A. The uffd context is attached to an mm, and the notification only has effect on that mm, the mm that created the context. Only the UFFDIO_ operations that resolve the fault (like UFFDIO_COPY) have effect visible by all file sharers. So VM_UFFD_WP shall work the same, and the _PAGE_SWP_UFFD_WP information of a swapped out shmem page shouldn't go in the xarray because doing so would share it with all "mm" sharing the mapping. Currently uffd-wp only works on anon memory so this is a new challenge in extending uffd-wp to shmem it seems. If any pagetable of an anonymous memory mapping is zapped, then not just the _PAGE_SWP_UFFD_WP bit is lost, the whole data is lost too so it'd break not just uffd-wp. As opposed with shmem the ptes can be zapped at all times by memory pressure alone even before the final ->writepage swap out is invoked. It's always possible to make uffd-wp work without those bits (anon memory also would work without those bits), but the reason those bits exist is to avoid a flood of UFFDIO_WRITEPROTECT ioctl false-negatives after fork or after swapping (in the anon case). In the shmem case if we'd it without a bitflag to track which page was wrprotected in which vma, the uffd handler would be required to mark each pte writable with a syscall every time a new pte has been established on the range and there's a write to the page. One possibility could be to store the bit set by UFFDIO_WRITEPROTECT in a vmalloc bitmap associated with the shmem vma at uffd registration time, so it can be checked during the shmem_fault() if VM_UFFD_WP is set on the vma? The vma_splits and vma_merge would be the hard part. Another possibility would be to change how the pte invalidate work for vmas with VM_UFFD_WP set, the pte and the _PAGE_UFFD_WP would need to survive all invalidates... only the final truncation under page lock would be really allowed to clear the whole pte and destroy the _PAGE_UFFD_WP information in the pte and then to free the pgtables on those ranges. Once we find a way to make that bit survive the invalidates, we can check it in shmem_fault to decide if to invoke handle_userfault if the bit is set and FAULT_FLAG_WRITE is set in vmf->flags, or if to map the page wrprotected in that vaddr if the bit is set and FAULT_FLAG_WRITE is not set. Thanks, Andrea
Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads
Hi Peter, On Sat, Nov 28, 2020 at 10:29:03AM -0500, Peter Xu wrote: > Yes. A trivial supplementary detail is that filemap_map_pages() will only set > it read-only since alloc_set_pte() will only set write bit if it's a write. > In > our case it's read fault-around so without it. However it'll be quickly > marked > as writable as long as the thread quickly writes to it again. The problem here is that all the information injected into the kernel through the UFFDIO_WRITEPROTECT syscalls is wiped by the quoted unmap_mapping_range. We can later discuss we actually check _PAGE_UFFD_WP, but that's not the first thing to solve here. Here need to we find a way to retain _PAGE_UFFD_WP until the page is actually locked and is atomically disappearing from the xarray index as far as further shmem page faults are concerned. Adding that information to the xarray isn't ideal because it's mm specific and not a property of the "pagecache", but keeping that information in the pagetables as we do for anon memory is also causing issue as shown below, because the shmem pagetables are supposed to be zapped at any time and be re-created on the fly based on the xarray. As opposed this never happens with anon memory. For example when shmem swap the swap entry is written in the xarray not in the pagetable, how are we going to let _PAGE_UFFD_WP survive swapping of a shmem if we store the _PAGE_UFFD_WP in the pte? Did you test swap? So here I'm afraid there's something more fundamental in how to have _PAGE_UFFD_WP survive swapping and a random pte zap, and we need more infrastructure in the unmap_mapping_range to retain that information so then swapping also works then. So I don't think we can evaluate this patch without seeing the full patchset and how the rest is being handled. Still it's great you found the source of the corruption for the current shmem uffd-wp codebase so we can move forward! > > However filemap_map_pages isn't capable to fill a hole and to undo the > > hole punch, all it can do are minor faults to re-fill the ptes from a > > not-yet-truncated inode page. > > > > Now it would be ok if filemap_map_pages re-filled the ptes, after they > > were zapped the first time by fallocate, and then the fallocate would > > zap them again before truncating the page, but I don't see a second > > pte zap... there's just the below single call of unmap_mapping_range: > > > > if ((u64)unmap_end > (u64)unmap_start) > > unmap_mapping_range(mapping, unmap_start, > > 1 + unmap_end - unmap_start, 0); > > shmem_truncate_range(inode, offset, offset + len - 1); > > IMHO the 2nd pte zap is inside shmem_truncate_range(), where we will call > truncate_inode_page() then onto truncate_cleanup_page(). I wrongly assumed there was only one invalidate and I guessed that is what caused the problem since it run outside the page lock so it wasn't "atomic" with respect to the page truncation in the xarray. The comment "refrain from faulting pages into the hole while it's being punched" and the "inode->i_private waitqueue" logic is just to slowly throttle the faulting so the truncation is faster. It's still not entirely clear why it's needed since shmem_getpage_gfp uses find_lock_entry and the page had to be locked. I didn't look at the history to see the order of where those changes were added to see if it's still needed. Anyway it was already suspect that "unlikely(inode->i_private)" would be enough as an out of order check to fully serialize things but for a performance "throttling" logic it's fine and non concerning (worst case it is not needed and the page lock waitqueue would have serialized the faults against the slow invalidate loops already). > Since we're at it, some more context: this is actually where I started to > notice the issue, that we'll try to pre-unmap the whole region first before > shmem_truncate_range(). The thing is shmem_truncate_range() will zap the ptes > too, in an even safer way (with page lock held). So before I came up with the > current patch, I also tried below patch, and it also fixes the data corrupt > issue: > > -8<- > diff --git a/mm/shmem.c b/mm/shmem.c > index efa19e33e470..b275f401fe1f 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2777,7 +2777,6 @@ static long shmem_fallocate(struct file *file, int > mode, loff_t offset, > inode_lock(inode); > > if (mode & FALLOC_FL_PUNCH_HOLE) { > - struct address_space *mapping = file->f_mapping; > loff_t unmap_start = round_up(offset, PAGE_SIZE); > loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; > DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); > @@ -2795,9 +2794,6 @@ static long shmem_fallocate(struct file *file, int > mode, loff_t offset, > inode->i_private = &shmem_falloc; > spin_unlock(&inode->i_lock); > > - if
Re: [PATCH] mm: Don't fault around userfaultfd-registered regions on reads
Hello, On Fri, Nov 27, 2020 at 12:22:24PM +, Matthew Wilcox wrote: > On Thu, Nov 26, 2020 at 05:23:59PM -0500, Peter Xu wrote: > > For wr-protected mode uffds, errornously fault in those pages around could > > lead > > to threads accessing the pages without uffd server's awareness. For > > example, > > when punching holes on uffd-wp registered shmem regions, we'll first try to > > unmap all the pages before evicting the page cache but without locking the > > page (please refer to shmem_fallocate(), where unmap_mapping_range() is > > called > > before shmem_truncate_range()). When fault-around happens near a hole being > > punched, we might errornously fault in the "holes" right before it will be > > punched. Then there's a small window before the page cache was finally > > dropped, and after the page will be writable again (NOTE: the uffd-wp > > protect > > information is totally lost due to the pre-unmap in shmem_fallocate(), so > > the > > page can be writable within the small window). That's severe data loss. > > Sounds like you have a missing page_mkwrite implementation. If the real fault happened through the pagetable (which got dropped by the hole punching), a "missing type" userfault would be delivered to userland (because the pte would be none). Userland would invoke UFFDIO_COPY with the UFFDIO_COPY_MODE_WP flag. Such flag would then map the filled shmem page (not necessarily all zero and not necessarily the old content before the hole punch) with _PAGE_RW not set and _PAGE_UFFD_WP set, so the next write would also trigger a wrprotect userfault (this is what the uffd-wp app expects). filemap_map_pages doesn't notify userland when it fills a pte and it will map again the page read-write. However filemap_map_pages isn't capable to fill a hole and to undo the hole punch, all it can do are minor faults to re-fill the ptes from a not-yet-truncated inode page. Now it would be ok if filemap_map_pages re-filled the ptes, after they were zapped the first time by fallocate, and then the fallocate would zap them again before truncating the page, but I don't see a second pte zap... there's just the below single call of unmap_mapping_range: if ((u64)unmap_end > (u64)unmap_start) unmap_mapping_range(mapping, unmap_start, 1 + unmap_end - unmap_start, 0); shmem_truncate_range(inode, offset, offset + len - 1); So looking at filemap_map_pages in shmem, I'm really wondering if the non-uffd case is correct or not. Do we end up with ptes pointing to non pagecache, so then the virtual mapping is out of sync also with read/write syscalls that will then allocate another shmem page out of sync of the ptes? If a real page fault happened instead of filemap_map_pages, the shmem_fault() would block during fallocate punch hole by checking inode->i_private, as the comment says: * So refrain from * faulting pages into the hole while it's being punched. It's not immediately clear where filemap_map_pages refrains from faulting pages into the hole while it's being punched... given it's ignoring inode->i_private. So I'm not exactly sure how shmem can safely faulted in through filemap_map_pages, without going through shmem_fault... I suppose shmem simply is unsafe to use filemap_map_pages and it'd require a specific shmem_map_pages? If only filemap_map_pages was refraining re-faulting ptes of a shmem page that is about to be truncated (whose original ptes had _PAGE_RW unset and _PAGE_UFFD_WP set) there would be no problem with the uffd interaction. So a proper shmem_map_pages could co-exist with uffd, the userfaultfd_armed check would be only an optimization but it wouldn't be required to avoid userland memory corruption? >From 8c6fb1b7dde309f0c8b5666a8e13557ae46369b4 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 27 Nov 2020 19:12:44 -0500 Subject: [PATCH 1/1] shmem: stop faulting in pages without checking inode->i_private Per shmem_fault comment shmem need to "refrain from faulting pages into the hole while it's being punched" and to do so it must check inode->i_private, which filemap_map_pages won't so it's unsafe to use in shmem because it can leave ptes pointing to non-pagecache pages in shmem backed vmas. Signed-off-by: Andrea Arcangeli --- mm/shmem.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 8e2b35ba93ad..f6f29b3e67c6 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3942,7 +3942,6 @@ static const struct super_operations shmem_ops = { static const struct vm_operations_struct shmem_vm_ops = { .fault = shmem_fault, - .map_pages = filemap_map_pages, #ifdef CONFIG_NUMA .set_policy = shmem_set_policy, .get_policy = shmem_get_policy, Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Thu, Nov 26, 2020 at 09:44:26PM +0200, Mike Rapoport wrote: > TBH, the whole interaction between e820 and memblock keeps me puzzled > and I can only make educated guesses why some ranges here are > memblock_reserve()'d and some memblock_add()ed. The mixed usage in that interaction between memblock.reserve and memblock.memory where sometime it's used to reserve overlapping memblock.memory ranges (clearly ok), and sometimes is used on ranges with no overlap (not clear even why it's being called), is what makes the current code messy. We're basically passing down the exact same information (inverted), through two different APIs, if there is no overlap. > I think what should be there is that e820 entries that are essentially > RAM, used by BIOS or not, should be listed in memblock.memory. Then > using memblock_reserve() for parts that BIOS claimed for itself would > have the same semantics as for memory allocated by kernel. > > I.e. if there is a DIMM from 0 to, say 512M, memblock.memory will have a > range [0, 512M]. And areas such as 0x000-0xfff, 0x9d000-0x9 will be > in memblock.reserved. > > Than in page_alloc.c we'll know that we have a physical memory bank from > 0 to 512M but there are some ranges that we cannot use. > > I suggested it back then when the issue with compaction was reported at > the first time, but Baoquan mentioned that there are systems that cannot > even tolerate having BIOS reserved areas in the page tables and I didn't > continue to pursue this. That explains why we can't add the e820 non-RAM regions to memblock_add, what's not clear is why it should be required to call memblock_reserve on a region that has no overlap with any memblock_add. Instead of the patch that adds a walk to the memblock.reserve and that requires adding even more memblock_reserve to e820__memblock_setup for type 20, we can add a walk for the memblock.memory holes and then we can remove the memblock_reserve for E820_TYPE_SOFT_RESERVED too. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote: > I think it's inveneted by your BIOS vendor :) BTW, all systems I use on a daily basis have that type 20... Only two of them are reproducing the VM_BUG_ON on a weekly basis on v5.9. If you search 'E820 "type 20"' you'll get plenty of hits, so it's not just me at least :).. In fact my guess is there are probably more workstation/laptops with that type 20 than without. Maybe it only showup if booting with EFI? Easy to check with `dmesg | grep "type 20"` after boot. One guess why this wasn't frequently reproduced is some desktop distro is doing the mistake of keeping THP enabled = madvise by default and they're reducing the overall compaction testing? Or maybe they're not all setting DEBUG_VM=y (but some other distro I'm sure ships v5.9 with DEBUG_VM=y). Often I hit this bug in kcompactd0 for example, that wouldn't happen with THP enabled=madvise. The two bpf tracing tools below can proof how the current defrag=madvise default only increase the allocation latency from a few usec to a dozen usec. Only if setting defrag=always the latency goes up to single digit milliseconds, because of the cost of direct compaction which is only worth paying for, for MADV_HUGEPAGE ranges doing long-lived allocations (we know by now that defrag=always was a suboptimal default). https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp-comm.bp https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp.bp Since 3917c80280c93a7123f1a3a6dcdb10a3ea19737d even app like Redis using fork for snapshotting purposes should prefer THP enabled. (besides it would be better if it used uffd-wp as alternative to fork) 3917c80280c93a7123f1a3a6dcdb10a3ea19737d also resolved another concern because the decade old "fork() vs gup/O_DIRECT vs thread" race was supposed to be unnoticeable from userland if the O_DIRECT min I/O granularity was enforced to be >=PAGE_SIZE. However with THP backed anon memory, that minimum granularity requirement increase to HPAGE_PMD_SIZE. Recent kernels are going in the direction of solving that race by doing cow during fork as it was originally proposed long time ago (https://lkml.kernel.org/r/20090311165833.GI27823@random.random) which I suppose will solve the race with sub-PAGE_SIZE granularity too, but 3917c80280c93a7123f1a3a6dcdb10a3ea19737d alone is enough to reduce the minumum I/O granularity requirement from HPAGE_PMD_SIZE to PAGE_SIZE as some userland may have expected. The best of course is to fully prevent that race condition by setting MADV_DONTFORK on the regions under O_DIRECT (as qemu does for example). Overall the only tangible concern left is potential higher memory usage for servers handling tiny object storage freed at PAGE_SIZE granularity with MADV_DONTNEED (instead of having a way to copy and defrag the virtual space of small objects through a callback that updates the pointer to the object...). Small object storage relying on jemalloc/tcmalloc for tiny object management simply need to selectively disable THP to avoid wasting memory either with MADV_NOHUGEPAGE or with the prctl PR_SET_THP_DISABLE. Flipping a switch in the OCI schema allows to disable THP too for those object storage apps making heavy use of MADV_DONTNEED, not even a single line of code need changing in the app for it if deployed through the OCI container runtime. Recent jemalloc uses MADV_NOHUGEPAGE. I didn't check exactly how it's being used but I've an hope it already does the right thing and separates small object arena zapped with MADV_DONTNEED at PAGE_SIZE granularity, with large object arena where THP shall remain enabled. glibc also should learn to separate small objects and big objects in different arenas. This has to be handled by the app, like it is handled optimally already in scylladb that in fact invokes MADV_HUGEPAGE, or plenty of other databases are using not just THP but also hugetlbfs which certainly won't fly if MADV_DONTNEED is attempted at PAGE_SIZE granularity.. or elastic search that also gets a significant boost from THP etc.. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote: > memory.reserved cannot be calculated automatically. It represents all > the memory allocations made before page allocator is up. And as > memblock_reserve() is the most basic to allocate memory early at boot we > cannot really delete it ;-) Well this explanation totally covers "memory allocated early at boot" that overlaps with memblock.memory. Does the E820_TYPE_SOFT_RESERVED range added to memblock.reserve define as "memory allocated early at boot"? Does it overlap ranges added with any RAM added to memblock.memory? if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; memblock_add(entry->addr, entry->size); To me the above looks it's being used for something completely different than from reserving "memory allocated early at boot". Why there is no warning at boot if there's no overlap between memblock.resereve and memblock.memory? My question about memblock.reserve is really about the non overlapping ranges: why are ranges non overlapping with memblock.memory regions, added to memblock.reserve, and why aren't those calculated automatically as reverse of memblock.memory? It's easy to see that when memblock.reserve overlaps fully, it makes perfect sense and it has to stay for it. I was really only thinking at the usage like above of memblock_reserve that looks like it should be turned into a noop and deleted. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Thu, Nov 26, 2020 at 11:05:14AM +0100, David Hildenbrand wrote: > I agree that this is sub-optimal, as such pages are impossible to detect > (PageReserved is just not clear as discussed with Andrea). The basic > question is how we want to proceed: > > a) Make sure any online struct page has a valid nid/zid, and is spanned > by the nid/zid. > b) Use a fake nid that will bail out when used for page_zone() and > page_pgdat(), and make pfn walkers detect that. > > AFAIU, Andrea seems to prefer a). I thing b) might be easier in corner > cases. Thoughts? Yes that's a good summary. My reason I slightly prefer a) is that it will perform faster at runtime. I seen also the proposed patch that adds the pfn_to_page embedded in pfn_valid to show those pfn as invalid, that is especially slow and I don't like it for that reason. Additional bugchecks for NO_NODEID will slowdown things too, so they should be justified by some benefit. It concerns me if pfn_valid becomes slower. b) will also make compaction bail out on that pageblock which it doesn't need to so it'll provide a worse runtime to compaction as well. a) is what the code already does if only the e820 map range was reserved with memblock_reserved, after applying Mike's patch to initialize reserved memblock regions too. It turns out the VM_BUG_ON_PAGE, as far as compaction is concerned, is just a false positive, it detected a broken VM invariant, but it was fine to proceed in the DEBUG_VM=n case that wouldn't crash. Before da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9 the struct page corresponding to the e820 unknown type range page wouldn't be marked PageReserved, that also looked wrong but it was also basically harmless. Neither of the two defects is too bad: it could be ignored if we just remove the bugcheck, but it's just nice if the bugcheck turn out to be correct in the pageblock. If we initialize all RAM and non-RAM ranges in the same way, then there's no discrepancy. Mike's patch seem to do just that by walking the memblock.reserved memblocks in the same way as the memblock.memory. NOTE: I would also much prefer b) if only it would guarantee zero runtime slowdown. (Which is I especially dislike pfn_valid internally calling pfn_to_page) Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 12:34:41AM -0500, Andrea Arcangeli wrote: > pfnphysaddr page->flags > 500224 0x7a20 0x1fff1000 reserved True > 500225 0x7a201000 0x1fff1000 reserved True > *snip* > 500245 0x7a215000 0x1fff1000 reserved True > 500246 0x7a216000 0x1fff1000 reserved True > 500247 0x7a217000 0x3fff reserved False > 500248 0x7a218000 0x3fff reserved False The range is still this: 7a17b000-7a216fff : Unknown E820 type 7a217000-7ffdbfff : System RAM This is with v5.3: Interestingly the pages below 0x7a217000 weren't even marked reserved. DMA zone_start_pfn 1zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0zone_end_pfn() 0contiguous 0 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0x1fff reserved False 500247 0x7a217000 0x1fff00022036 reserved False 500248 0x7a218000 0x1fff2032 reserved False 4b094b7851bf4bf551ad456195d3f26e1c03bd74 no change (because the unknown e820 type 20 was still initialized by memmap_init_zone despite it was not part of memblock.memory): DMA zone_start_pfn 1zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0zone_end_pfn() 0contiguous 0 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0x1fff reserved False 500247 0x7a217000 0x1fff00022036 reserved False 500248 0x7a218000 0x1fff2032 reserved False 73a6e474cb376921a311786652782155eac2fdf0 this changed it: DMA zone_start_pfn 1zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 0 Normal zone_start_pfn 0zone_end_pfn() 0contiguous 0 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0xfff1000 reserved True 500247 0x7a217000 0x1fff reserved False 500248 0x7a218000 0x1fff00010200 reserved False da50c57bdbb8e37ec6f8c934a2f8acbf4e4fdfb9 (73a6e474cb376921a311786652782155eac2fdf0^) no change: DMA zone_start_pfn 1zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 524252 contiguous 1 Normal zone_start_pfn 0zone_end_pfn() 0contiguous 0 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500246 0x7a216000 0x1fff reserved False 500247 0x7a217000 0x1fff2032 reserved False 500248 0x7a218000 0x1fff2032 reserved False So like Mike suggested this started in 73a6e474cb376921a311786652782155eac2fdf0, although the previous code looked buggy too by not setting PageReserved when it should have. It appears the pfn in the e820 type 20 were initialized by memmap_init_zone before commit 73a6e474cb376921a311786652782155eac2fdf0 and they're not initialized anymore after 73a6e474cb376921a311786652782155eac2fdf0 because the memblock.memory didn't include the e820 type 20 so it couldn't call memmap_init_zone anymore. So after 4b094b7851bf4bf551ad456195d3f26e1c03bd74 uninitialized pfn gets the 0,0 assignment (but even before 4b094b7851bf4bf551ad456195d3f26e1c03bd74 they would have gotten a completely zero page->flags, 4b094b7851bf4bf551ad456195d3f26e1c03bd74 is completely unrelated to this issue). Mike, your patch alone doesn't help because nothing calls memblock_reserve on the range starting at 0x7a17b000. I appended at the end the output of memblock=debug. Only after I apply the below hack on top of your patch finally all sign of bugs are gone, PG_reserved is finally set (old code was wrong not setting it) and the zoneid/nid is right (new code leaves it uninitialized and so it's wrong): diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index 983cd53ed4c9..bbb5b4d7a117 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1303,6 +1303,8 @@ void __init e820__memblock_setup(void) if (entry->type == E820_TYPE_SOFT_RESERVED) memblock_reserve(entry->addr, entry->size); + if (entry->type == 20) + memblock_reserve(entry->addr, entry->size); if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) continue; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3fb35fe6a9e4..2e2047282ff3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6145,7 +6145,9 @@ void __meminit __weak memmap_init(unsigned long size
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote: > I think the very root cause is how e820__memblock_setup() registers > memory with memblock: > > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > > if (entry->type != E820_TYPE_RAM && entry->type != > E820_TYPE_RESERVED_KERN) > continue; > > memblock_add(entry->addr, entry->size); > > From that point the system has inconsistent view of RAM in both > memblock.memory and memblock.reserved and, which is then translated to > memmap etc. > > Unfortunately, simply adding all RAM to memblock is not possible as > there are systems that for them "the addresses listed in the reserved > range must never be accessed, or (as we discovered) even be reachable by > an active page table entry" [1]. > > [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/ It looks like what's missing is a blockmem_reserve which I don't think would interfere at all with the issue above since it won't create direct mapping and it'll simply invoke the second stage that wasn't invoked here. I guess this would have a better chance to have the second initialization stage run in reserve_bootmem_region and it would likely solve the problem without breaking E820_TYPE_RESERVED which is known by the kernel: > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > + if (entry->type == 20) + memblock_reserve(entry->addr, entry->size); > if (entry->type != E820_TYPE_RAM && entry->type != > E820_TYPE_RESERVED_KERN) > continue; > This is however just to show the problem, I didn't check what type 20 is. To me it doesn't look the root cause though, the root cause is that if you don't call memblock_reserve the page->flags remains uninitialized. I think the page_alloc.c need to be more robust and detect at least if if holes within zones (but ideally all pfn_valid of all struct pages in system even if beyond the end of the zone) aren't being initialized in the second stage without relying on the arch code to remember to call memblock_reserve. In fact it's not clear why memblock_reserve even exists, that information can be calculated reliably by page_alloc in function of memblock.memory alone by walking all nodes and all zones. It doesn't even seem to help in destroying the direct mapping, reserve_bootmem_region just initializes the struct pages so it doesn't need a special memeblock_reserved to find those ranges. In fact it's scary that codes then does stuff like this trusting the memblock_reserve is nearly complete information (which obviously isn't given type 20 doesn't get queued and I got that type 20 in all my systems): for_each_reserved_mem_region(i, &start, &end) { if (addr >= start && addr_end <= end) return true; } That code in irq-gic-v3-its.c should stop using for_each_reserved_mem_region and start doing pfn_valid(addr>>PAGE_SHIFT) if PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead. At best memory.reserved should be calculated automatically by the page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed by the e820 caller, instead of adding the memory_reserve call for type 20 we should delete the memory_reserve function. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 08:27:21PM +0100, David Hildenbrand wrote: > On 25.11.20 19:28, Andrea Arcangeli wrote: > > On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > >> Before that change, the memmap of memory holes were only zeroed > >> out. So the zones/nid was 0, however, pages were not reserved and > >> had a refcount of zero - resulting in other issues. > > > > So maybe that "0,0" zoneid/nid was not actually the thing that > > introduced the regression? Note: I didn't bisect anything yet, it was > > just a guess. > > I guess 0/0 is the issue, but that existed before when we had a simple > memmset(0). The root issue should be what Mike said: Yes, the second stage must have stopped running somehow. Is there anything we can do to induce a deterministically reproducible kernel crashing behavior if the second stage doesn't run? Why did we start doing a more graceful initialization in the first stage, instead of making a less graceful by setting it to 0xff instead of 0x00? > 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather > that check each PFN") So if that's not intentional, are you suggesting nodeid/nid was a bug if it was set to 0,0 for a non-RAM valid pfn? > "correct" is problematic. If you have an actual memory hole, there is > not always a right answer - unless I am missing something important. > > > Assume you have a layout like this > > [ zone X ] [ hole ] [ zone Y ] > > If either X and or Y starts within a memory section, you have a valid > memmap for X - but what would be the right node/zone? > > > Assume you have a layout like this > > [ zone X ] > > whereby X ends inside a memory section. The you hotplug memory. Assume > it goes to X > > [ zone X ][ hole in X ][ zone X] > > or it goes to y > > [ zone X ][ hole ][ zone Y ] > > This can easily be reproduced by starting a VM in qemu with a memory > size not aligned to 128 MB (e.g., -M 4000) and hotplugging memory. I don't get what the problem is sorry. You have a pfn, if pfn_valid() is true, pfn_to_page returns a page deterministically. It's up to the kernel to decide which page structure blongs to any pfn in the pfn_to_page function. Now if the pfn_to_page(pfn) function returns a page whose nid/zone_id in page->flags points to a node->zone whose zone_start_pfn - end_zone_pfn range doesn't contain "pfn" that is a bug in page_alloc.c. I don't see how is it not possible to deterministically enforce the above never happens. Only then it would be true that there's not always a right answer. zone can overlap, but it can't be that you do pfn_to_page of a pfn_valid and you obtain a page whose zone doesn't contain that pfn. Which is what is currently crashing compaction. I don't see how this is an unsolvable problem and why we should accept to live with a bogus page->flags for reserved pages. > We can't. The general rule is (as I was once told by Michal IIRC) that The fact we can't kernel crash reliably when somebody uses the wrong 0,0 uninitialized value by not adding an explicit PageReserved check, is my primary concern in keeping those nodeid/nid uninitialized, but non-kernel-crashing, since it already created this unreproducible bug. > I'm not rooting for "keep this at 0/0" - I'm saying that I think there > are corner cases where it might not be that easy. I'm not saying it's easy. What I don't see is how you don't always have the right answer and why it would be an unsolvable problem. It is certainly problematic and difficult to solve in the mem_map iniitalization logic, but to me having pfn_valid() && page_zone(pfn_to_page(pfn)) randomly returning the DMA zone on first node also looks problematic and difficult to handle across all VM code, so overall it looks preferable to keep the complexity of the mem_map initialization self contained and not spilling over the rest of the VM. > Yes, but there is a "Some of these" :) > > Boot a VM with "-M 4000" and observe the memmap in the last section - > they won't get initialized a second time. Is the beyond the end of the zone yet another case? I guess that's less likely to give us problems because it's beyond the end of the zone. Would pfn_valid return true for those pfn? If pfn_valid is not true it's not really a concern but the again I'd rather prefer if those struct pages beyond the end of the zone were kernel crashing set to 0xff. In other words I just don't see why we should ever prefer to leave some pages at a graceful and erroneous nid 0 nodeid 0 that wouldn't easily induce a crash if used. > AFAIK, the mem_map array might have mult
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 04:13:25PM +0200, Mike Rapoport wrote: > I suspect that memmap for the reserved pages is not properly initialized > after recent changes in free_area_init(). They are cleared at > init_unavailable_mem() to have zone=0 and node=0, but they seem to be I'd really like if we would not leave those to 0,0 and if we set the whole struct page at 0xff, if we miss the second stage that corrects the uninitialized value. The hope is that it'll crash faster and more reproducible that way. > never re-initialized with proper zone and node links which was not the > case before commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock > regions rather that check each PFN"). What's strange is that 73a6e474cb37 was suggested as fix for this bug... https://lkml.kernel.org/r/20200505124314.GA5029@MiWiFi-R3L-srv The addition of "pageblock_pfn_to_page" to validate min_pfn was added in commit 73a6e474cb37, so I assumed that the first report below didn't have commit 73a6e474cb37 already applied. https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw However if you're correct perhaps the patch was already applied in 5.7.0-rc2-next-20200423+, it landed upstream in v5.8 after all. > Back then, memmap_init_zone() looped from zone_start_pfn till > zone_end_pfn and struct page for reserved pages with pfns inside the > zone would be initialized. > > Now the loop is for interesection of [zone_start_pfn, zone_end_pfn] with > memblock.memory and for x86 reserved ranges are not in memblock.memory, > so the memmap for them remains semi-initialized. That would matches the symptoms. I'll test it as first thing after confirming older kernels had the right zoneid/nid on the reserved pages. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 01:08:54PM +0100, Vlastimil Babka wrote: > Yeah I guess it would be simpler if zoneid/nid was correct for > pfn_valid() pfns within a zone's range, even if they are reserved due > not not being really usable memory. > > I don't think we want to introduce CONFIG_HOLES_IN_ZONE to x86. If the > chosen solution is to make this to a real hole, the hole should be > extended to MAX_ORDER_NR_PAGES aligned boundaries. The way pfn_valid works it's not possible to render all non-RAM pfn as !pfn_valid, CONFIG_HOLES_IN_ZONE would not achieve it 100% either. So I don't think we can rely on that to eliminate all non-RAM reserved pages from the mem_map and avoid having to initialize them in the first place. Some could remain as in this case since in the same pageblock there's non-RAM followed by RAM and all pfn are valid. > In any case, compaction code can't fix this with better range checks. David's correct that it can, by adding enough PageReserved (I'm running all systems reproducing this with plenty of PageReserved checks in all places to work around it until we do a proper fix). My problem with that is that 1) it's simply non enforceable at runtime that there is not missing PageReserved check and 2) what benefit it would provide to leave a wrong zoneid in reserved pages and having to add extra PageReserved checks? A struct page has a deterministic zoneid/nid, if it's pointed by a valid pfn (as in pfn_valid()) the simplest is that the zoneid/nid in the page remain correct no matter if it's reserved at boot, it was marked reserved by a driver that swap the page somewhere else with the GART or EFI or something else. All reserved pages should work the same, RAM and non-RAM, since the non-RAM status can basically change at runtime if a driver assigns the page to hw somehow. NOTE: on the compaction side, we still need to add thepageblock_pfn_to_page to validate the "highest" pfn because the pfn_valid() check is missing on the first pfn on the pageblock as it's also missing the check of a pageblock that spans over two different zones. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 12:41:55PM +0100, David Hildenbrand wrote: > On 25.11.20 12:04, David Hildenbrand wrote: > > On 25.11.20 11:39, Mel Gorman wrote: > >> On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > Something must have changed more recently than v5.1 that caused the > zoneid of reserved pages to be wrong, a possible candidate for the > real would be this change below: > > + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); > > >>> > >>> Before that change, the memmap of memory holes were only zeroed out. So > >>> the zones/nid was 0, however, pages were not reserved and had a refcount > >>> of zero - resulting in other issues. > >>> > >>> Most pfn walkers shouldn???t mess with reserved pages and simply skip > >>> them. That would be the right fix here. > >>> > >> > >> Ordinarily yes, pfn walkers should not care about reserved pages but it's > >> still surprising that the node/zone linkages would be wrong for memory > >> holes. If they are in the middle of a zone, it means that a hole with > >> valid struct pages could be mistaken for overlapping nodes (if the hole > >> was in node 1 for example) or overlapping zones which is just broken. > > > > I agree within zones - but AFAIU, the issue is reserved memory between > > zones, right? > > Double checking, I was confused. This applies also to memory holes > within zones in x86. Yes this is a memory hole within the DMA32 zone. Still why there should be any difference? As long as a page struct exists it's in a well defined mem_map array which comes for one and only one zoneid/nid combination. So what would be the benefit of treating memory holes within zones or in between zones differently and leave one or the other with a zoneid/nid uninitialized?
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 07:45:30AM +0100, David Hildenbrand wrote: > Before that change, the memmap of memory holes were only zeroed > out. So the zones/nid was 0, however, pages were not reserved and > had a refcount of zero - resulting in other issues. So maybe that "0,0" zoneid/nid was not actually the thing that introduced the regression? Note: I didn't bisect anything yet, it was just a guess. In fact, I need first thing to boot with an old v5.3 kernel and to dump the page->flags around the "Unknown E820 type" region to be 100% certain that the older kernels had a correct page->flags for reserved pages. I will clear this up before the end of the day. > Most pfn walkers shouldn‘t mess with reserved pages and simply skip > them. That would be the right fix here. What would then be the advantage of keeping wrong page->flags in reserved pages compared to actually set it correct? How do you plan to enforce that every VM bit will call PageReserved (as it should be introduced in pageblock_pfn_to_page in that case) if it's not possible to disambiguate a real DMA zone and nid 0 from the uninitialized value? How can we patch page_nodenum to enforce it won't read an uninitialized value because a PageReserved check was missing in the caller? I don't see this easily enforceable by code review, it'd be pretty flakey to leave a 0,0 wrong value in there with no way to verify if a PageResered check in the caller was missing. > > It'd be preferable if the pfn_valid fails and the > > pfn_to_section_nr(pfn) returns an invalid section for the intermediate > > step. Even better memset 0xff over the whole page struct until the > > second stage comes around. > > I recently discussed with Baoquan to > 1. Using a new pagetype to mark memory holes > 2. Using special nid/zid to mark memory holes > > Setting the memmap to 0xff would be even more dangerous - pfn_zone() might > simply BUG_ON. What would need to call pfn_zone in between first and second stage? If something calls pfn_zone in between first and second stage isn't it a feature if it crashes the kernel at boot? Note: I suggested 0xff kernel crashing "until the second stage comes around" during meminit at boot, not permanently. /* * Use a fake node/zone (0) for now. Some of these pages * (in memblock.reserved but not in memblock.memory) will * get re-initialized via reserve_bootmem_region() later. */ Specifically I relied on the comment "get re-initialized via reserve_bootmem_region() later". I assumed the second stage overwrites the 0,0 to the real zoneid/nid value, which is clearly not happening, hence it'd be preferable to get a crash at boot reliably. Now I have CONFIG_DEFERRED_STRUCT_PAGE_INIT=n so the second stage calling init_reserved_page(start_pfn) won't do much with CONFIG_DEFERRED_STRUCT_PAGE_INIT=n but I already tried to enable CONFIG_DEFERRED_STRUCT_PAGE_INIT=y yesterday and it didn't help, the page->flags were still wrong for reserved pages in the "Unknown E820 type" region. > > Whenever pfn_valid is true, it's better that the zoneid/nid is correct > > all times, otherwise if the second stage fails we end up in a bug with > > weird side effects. > > Memory holes with a valid memmap might not have a zone/nid. For now, skipping > reserved pages should be good enough, no? zoneid is always a pure abstract concept, not just the RAM-holes but the RAM itself doesn't have a zoneid at all. Nid is hardware defined for RAM, but for reserved RAM holes it becomes a pure software construct as the zoneid. So while it's right that they have no zone/nid in the hardware, they still have a very well defined zoneid/nid in the software implementation. The page struct belongs to one and only one mem_map array, that has a specific nodeid and nid. So it can be always initialized right even for non-RAM. Only if the pfn is !pfn_valid, then it has no page struct and then there's no zone/nid association, but in that case there's no reason to even worry about it since nobody can possibly get a wrong value out of the page struct because there's no page struct in the case. Last but not the least, RAM pages can be marked reserved and assigned to hardware and so it'd be really messy if real reserved RAM pages given to hw, behaved different than non-RAM that accidentally got a struct page because of section alignment issues. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Wed, Nov 25, 2020 at 10:30:53AM +, Mel Gorman wrote: > On Tue, Nov 24, 2020 at 03:56:22PM -0500, Andrea Arcangeli wrote: > > Hello, > > > > On Tue, Nov 24, 2020 at 01:32:05PM +, Mel Gorman wrote: > > > I would hope that is not the case because they are not meant to overlap. > > > However, if the beginning of the pageblock was not the start of a zone > > > then the pages would be valid but the pfn would still be outside the > > > zone boundary. If it was reserved, the struct page is valid but not > > > suitable for set_pfnblock_flags_mask. However, it is a concern in > > > general because the potential is there that pages are isolated from the > > > wrong zone. > > > > I guess we have more than one issue to correct in that function > > because the same BUG_ON reproduced again even with the tentative patch > > I posted earlier. > > > > So my guess is that the problematic reserved page isn't pointed by the > > min_pfn, but it must have been pointed by the "highest" variable > > calculated below? > > > > if (pfn >= highest) > > highest = pageblock_start_pfn(pfn); > > > > When I looked at where "highest" comes from, it lacks > > pageblock_pfn_to_page check (which was added around v5.7 to min_pfn). > > > > Is that the real bug, which may be fixed by something like this? (untested) > > > > It's plausible as it is a potential source of leaking but as you note > in another mail, it's surprising to me that valid struct pages, even if > within memory holes and reserved would have broken node/zone information > in the page flags. I think the patch to add pageblock_pfn_to_page is still needed to cope with !pfn_valid or a pageblock in between zones, but pfn_valid or pageblock in between zones is not what happens here. So the patch adding pageblock_pfn_to_page would have had the undesired side effect of hiding the bug so it's best to deal with the other bug first.
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Hello, On Mon, Nov 23, 2020 at 02:01:16PM +0100, Vlastimil Babka wrote: > On 11/21/20 8:45 PM, Andrea Arcangeli wrote: > > A corollary issue was fixed in > > 39639000-39814fff : Unknown E820 type > > > > pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM: > > > > 7a17b000-7a216fff : Unknown E820 type > > It would be nice to also provide a /proc/zoneinfo and how exactly the > "zone_spans_pfn" was violated. I assume we end up below zone's > start_pfn, but is it true? Agreed, I was about to grab that info along with all page struct around the pfn 0x7a200 and phys address 0x7a216fff. # grep -A1 E820 /proc/iomem 7a17b000-7a216fff : Unknown E820 type 7a217000-7bff : System RAM DMA zone_start_pfn 1zone_end_pfn() 4096 contiguous 1 DMA32zone_start_pfn 4096 zone_end_pfn() 1048576 contiguous 0 Normal zone_start_pfn 1048576 zone_end_pfn() 4715392 contiguous 1 Movable zone_start_pfn 0zone_end_pfn() 0contiguous 0 500222 0x7a1fe000 0x1fff1000 reserved True 500223 0x7a1ff000 0x1fff1000 reserved True # I suspect "highest pfn" was somewhere in the RAM range # 0x7a217000-0x7a40 and the pageblock_start_pfn(pfn) # made highest point to pfn 0x7a200 physaddr 0x7a20 # below, which is reserved indeed since it's non-RAM # first number is pfn hex(500224) == 0x7a200 pfnphysaddr page->flags 500224 0x7a20 0x1fff1000 reserved True 500225 0x7a201000 0x1fff1000 reserved True *snip* 500245 0x7a215000 0x1fff1000 reserved True 500246 0x7a216000 0x1fff1000 reserved True 500247 0x7a217000 0x3fff reserved False 500248 0x7a218000 0x3fff reserved False All RAM pages non-reserved are starting at 0x7a217000 as expected. The non-RAM page_zonenum(pfn_to_page(0x7a200)) points to ZONE_DMA and page_zone(page) below was the DMA zone despite the pfn of 0x7a200 is in DMA32. VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); So the patch I sent earlier should prevent the above BUG_ON by not setting highest to 0x7a200 when pfn is in the phys RAM range 0x7a217000-0x7a40, because pageblock_pfn_to_page will notice that the zone is the wrong one. if (page_zone(start_page) != zone) return NULL; However the real bug seems that reserved pages have a zero zone_id in the page->flags when it should have the real zone id/nid. The patch I sent earlier to validate highest would only be needed to deal with pfn_valid. Something must have changed more recently than v5.1 that caused the zoneid of reserved pages to be wrong, a possible candidate for the real would be this change below: + __init_single_page(pfn_to_page(pfn), pfn, 0, 0); Even if it may not be it, at the light of how the reserved page zoneid/nid initialized went wrong, the above line like it's too flakey to stay. It'd be preferable if the pfn_valid fails and the pfn_to_section_nr(pfn) returns an invalid section for the intermediate step. Even better memset 0xff over the whole page struct until the second stage comes around. Whenever pfn_valid is true, it's better that the zoneid/nid is correct all times, otherwise if the second stage fails we end up in a bug with weird side effects. Maybe it's not the above that left a zero zoneid though, I haven't tried to bisect it yet to look how the page->flags looked like on a older kernel that didn't seem to reproduce this crash, I'm just guessing. Thanks, Andrea
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
Hello, On Tue, Nov 24, 2020 at 01:32:05PM +, Mel Gorman wrote: > I would hope that is not the case because they are not meant to overlap. > However, if the beginning of the pageblock was not the start of a zone > then the pages would be valid but the pfn would still be outside the > zone boundary. If it was reserved, the struct page is valid but not > suitable for set_pfnblock_flags_mask. However, it is a concern in > general because the potential is there that pages are isolated from the > wrong zone. I guess we have more than one issue to correct in that function because the same BUG_ON reproduced again even with the tentative patch I posted earlier. So my guess is that the problematic reserved page isn't pointed by the min_pfn, but it must have been pointed by the "highest" variable calculated below? if (pfn >= highest) highest = pageblock_start_pfn(pfn); When I looked at where "highest" comes from, it lacks pageblock_pfn_to_page check (which was added around v5.7 to min_pfn). Is that the real bug, which may be fixed by something like this? (untested) == >From 262671e88723b3074251189004ceae39dcd1689d Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Sat, 21 Nov 2020 12:55:58 -0500 Subject: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages A corollary issue was fixed in e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in v5.7: https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw == page:eaaa refcount:1 mapcount:0 mapping:2243743b index:0x0 flags: 0x1fffe01000(reserved) == 73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly fix the second issue, but it still reproduces with v5.9 on two different systems: == page:62b3e92f refcount:1 mapcount:0 mapping: index:0x0 pfn:0x39800 flags: 0x1000(reserved) raw: 1000 f5b880e60008 f5b880e60008 raw: 0001 page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) == page:4d32675d refcount:1 mapcount:0 mapping: index:0x0 pfn:0x7a200 flags: 0x1fff1000(reserved) raw: 1fff1000 e6c5c1e88008 e6c5c1e88008 raw: 0001 page dumped because: VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn)) == The page is "reserved" in all cases. In the last two crashes with the pfn: pfn 0x39800 -> 0x3980 min_pfn hit non-RAM: 39639000-39814fff : Unknown E820 type pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM: 7a17b000-7a216fff : Unknown E820 type The pageblock_pfn_to_page check was missing when rewinding pfn to the start of the pageblock to calculate the "highest" value. So the "highest" pfn could point to a non valid pfn or not within the zone, checking the pageblock_pfn_to_page fixes it. Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Andrea Arcangeli --- mm/compaction.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 653862aba266..76f061af8f22 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1418,8 +1418,14 @@ fast_isolate_freepages(struct compact_control *cc) nr_scanned++; pfn = page_to_pfn(freepage); - if (pfn >= highest) - highest = pageblock_start_pfn(pfn); + if (pfn >= highest) { + unsigned long start_pfn, end_pfn; + start_pfn = pageblock_start_pfn(pfn); + end_pfn = pageblock_end_pfn(start_pfn); + if (pageblock_pfn_to_page(start_pfn, end_pfn, + cc->zone)) + highest = pfn; + } if (pfn >= low_pfn) { cc->fast_search_fail = 0; == Can't we also try to scan in between start_pfn and "pfn" to see if there's one pfn that passes the pageblock_pfn_to_page test or isn't it worth it for the fast isolate variant? > > Then compact_finished() detects that in > > compact_zone(), but only after migrate_pages() and thus > > fast_isolate_freepages() is called. > > > > That would mean distance can be negative, or rather a large unsigned number > > and low_pfn and min_pfn end up away from the zone? > > > > Or maybe the above doesn't happen, but cc->free_pfn gets so close to start > &g
Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
On Sat, Nov 21, 2020 at 02:45:06PM -0500, Andrea Arcangeli wrote: > + if (likely(!PageReserved(page))) NOTE: this line will have to become "likely(page && !PageReserved(page))" to handle the case of non contiguous zones, since pageblock_pfn_to_page might return NULL in that case.. I noticed right after sending, but I'll wait some feedback before resending the correction in case it gets fixed in another way.
[PATCH 0/1] VM_BUG_ON_PAGE(!zone_spans_pfn) in set_pfnblock_flags_mask
Hello, After hitting this twice on two different systems, I'm now running with the tentative fix applied, but it's not a meaningful test since it's non reproducible. However it is possible to inject this bug if you do "grep E820 /proc/iomem" and then find a phys addr there with a struct page (i.e. pfn_valid) in a zone, with this change: min_pfn = pageblock_start_pfn(cc->free_pfn - (distance >> 1)); + if (cc->zone is the zone where the e820 physaddr has a pfn_valid) + min_pfn = physaddr_of_E820_non_RAM_page_with_valid_pfn >> PAGE_SHIFT; I didn't try to inject the bug to validate the fix and it'd be great if someone can try that to validate this or any other fix. Andrea Arcangeli (1): mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages mm/compaction.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-)
[PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages
A corollary issue was fixed in e577c8b64d58fe307ea4d5149d31615df2d90861. A second issue remained in v5.7: https://lkml.kernel.org/r/8c537eb7-85ee-4dcf-943e-3cc0ed0df...@lca.pw == page:eaaa refcount:1 mapcount:0 mapping:2243743b index:0x0 flags: 0x1fffe01000(reserved) == 73a6e474cb376921a311786652782155eac2fdf0 was applied to supposedly the second issue, but I still reproduced it twice with v5.9 on two different systems: == page:62b3e92f refcount:1 mapcount:0 mapping: index:0x0 pfn:0x39800 flags: 0x1000(reserved) == page:2a7114f8 refcount:1 mapcount:0 mapping: index:0x0 pfn:0x7a200 flags: 0x1fff1000(reserved) == I actually never reproduced it until v5.9, but it's still the same bug as it was reported first for v5.7. See the page is "reserved" in all 3 cases. In the last two crashes with the pfn: pfn 0x39800 -> 0x3980 min_pfn hit non-RAM: 39639000-39814fff : Unknown E820 type pfn 0x7a200 -> 0x7a20 min_pfn hit non-RAM: 7a17b000-7a216fff : Unknown E820 type This actually seems a false positive bugcheck, the page structures are valid and the zones are correct, just it's non-RAM but setting pageblockskip should do no harm. However it's possible to solve the crash without lifting the bugcheck, by enforcing the invariant that the free_pfn cursor doesn't point to reserved pages (which would be otherwise implicitly achieved through the PageBuddy check, except in the new fast_isolate_around() path). Fixes: 5a811889de10 ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Andrea Arcangeli --- mm/compaction.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/compaction.c b/mm/compaction.c index 13cb7a961b31..d17e69549d34 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1433,7 +1433,10 @@ fast_isolate_freepages(struct compact_control *cc) page = pageblock_pfn_to_page(min_pfn, pageblock_end_pfn(min_pfn), cc->zone); - cc->free_pfn = min_pfn; + if (likely(!PageReserved(page))) + cc->free_pfn = min_pfn; + else + page = NULL; } } }