Re: [PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-11 Thread Kirill A. Shutemov
On Sat, Jan 09, 2021 at 09:54:05PM -0500, Andrea Arcangeli wrote:
> Hello,
> 
> On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote:
> > allowing a child to corrupt memory in the parent. That's a problem
> > that could happen not-maliciously too. So the scenario described
> 
> I updated the above partly quoted sentence since in the previous
> version it didn't have full accuracy:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=fc5a76b1c14e5e6cdc64ece306fc03773662d98a
> 
> "However since a single transient GUP pin on a tail page, would elevate
> the page_count for all other tail pages (unlike the mapcount which is
> subpage granular), the COW page reuse inaccuracy would then cross
> different vmas and the effect would happen at a distance in vma of
> different processes. A single GUP pin taken on a subpage mapped in a
> different process could trigger 511 false positive COWs copies in the
> local process, after a fork()."
> 
> This a best effort to try to document all side effects, but it'd be
> great to hear from Kirill too on the above detail to have
> confirmation.

Yes, this side effect is possible. But I wouldn't worry about too much. If
it routinely happens in a real workloads (I doubt it does), the workload
can tune it with MADV_NOHUGEPAGE/MADV_DONTFORK/MADV_WIPEONFORK or
something.

-- 
 Kirill A. Shutemov


Re: [PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli
Hello,

On Sat, Jan 09, 2021 at 07:44:35PM -0500, Andrea Arcangeli wrote:
> allowing a child to corrupt memory in the parent. That's a problem
> that could happen not-maliciously too. So the scenario described

I updated the above partly quoted sentence since in the previous
version it didn't have full accuracy:

https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?id=fc5a76b1c14e5e6cdc64ece306fc03773662d98a

"However since a single transient GUP pin on a tail page, would elevate
the page_count for all other tail pages (unlike the mapcount which is
subpage granular), the COW page reuse inaccuracy would then cross
different vmas and the effect would happen at a distance in vma of
different processes. A single GUP pin taken on a subpage mapped in a
different process could trigger 511 false positive COWs copies in the
local process, after a fork()."

This a best effort to try to document all side effects, but it'd be
great to hear from Kirill too on the above detail to have
confirmation.

Thanks and have a great weekend everyone,
Andrea



[PATCH 1/1] mm: restore full accuracy in COW page reuse

2021-01-09 Thread Andrea Arcangeli
This reverts commit 09854ba94c6aad7886996bfbee2530b3d8a7f4f4.
This reverts commit be068f29034fb00530a053d18b8cf140c32b12b3.
This reverts commit 1a0cf26323c80e2f1c58fc04f15686de61bfab0c.

This example below uses O_DIRECT, on a process under
clear_refs. There's no FOLL_LONGTERM involvement. All GUP pins are
transient. fork is never called and even when clear_refs is cleared by
an external program, fork() would not be involved.

thread0 thread1 other process
(or thread 3)
=   =   ===
read syscall
O_DIRECT
read DMA to vaddr+0
len = 512 bytes
GUP(FOLL_WRITE)
DMA writing to RAM started
clear_refs
pte_wrprotect
write vaddrA+512
page_count == 2
wp_copy_page
read syscall returns
read lost

Notwithstanding the fact that failing O_DIRECT at sub-PAGE_SIZE
granularity is an ABI break by itself, this is of course not just
about O_DIRECT. recvmsg kernel TLS and plenty of other GUP FOLL_WRITE
iov_iter_get_pages users would write to the memory with sub-PAGE_SIZE
granularity, depending on the msghdr iov tiny buffers. Those recvmsg
are already silently lost too as well as the above O_DIRECT already
get lost.

The fact COW must not happen too much is documented in commit
6d0a07edd17cfc12fdc1f36de8072fa17cc3666f:

==
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
==

And in the current comment above the THP mapcount:

==
 * [..] we only use
 * page_trans_huge_mapcount() in the copy-on-write faults where we
 * need full accuracy to avoid breaking page pinning, [..]
==

Quote from the Link tagged:

==
In other words, the page_count check in do_wp_page, extends the old
fork vs gup vs threads race, so that it also happens within a single
thread without fork() involvement.
===

In addition FOLL_LONGTERM breaks for readonly DMA, despite
FOLL_LONGTERM pages are now treated specially and copied in fork().

FOLL_LONGTERM is in no way special with regard to the VM core and it
can't be. From the VM standpoint all GUP in are equal and breaking
one, means breaking them all the same.

It is not possible to resolve this by looking only at FOLL_LONGTERM,
that only hides the problem, but it doesn't fully resolve it as shown
above.

In addition this avoids storms of extra false positive COWs if the THP
are shared by different processes, which causes extra overhead, but
the several performance considerations are only a secondary concern.

After this commit, the copy in fork() for FOLL_LONGTERM also must be
extended to all GUP pins including transient pins or we shouldn't even
copy FOLL_LONGTERM pinned pages. Treating FOLL_LONGTERM any different
from transient GUP pin is sign that something is fundamentally flawed.

FOLL_LONGTERM can be treated different in writeback and during GUP
runtime itself, but not in the VM-core when deciding when to COW or
not to COW an anonymous page.

This revert causes no theoretical security regression because THP is
not yet using page_count in do_huge_pmd_wp_page.

The alternative of this patch would be to replace the mapcount with
the page_count in do_huge_pmd_wp_page too in order to really close the
vmsplice attack from child to parent that way.

However since a single transient GUP pin on a tail page, would elevate
the page_count for all other tail pages (unlike the mapcount which is
subpage granular), the above "fork-less" race would span over a
HPAGE_PMD_SIZE range, it would even cross different vmas and the
effect would happen at a distance in vma of different processes
allowing a child to corrupt memory in the parent. That's a problem
that could happen not-maliciously too. So the scenario described
above, if extended to THP new refcounting, would be more concerning
than the vmsplice issue, that can be tackled first in vmsplice itself,
and only at a second stage in the COW code.

Link: https://lkml.kernel.org/r/20210107200402.31095-1-aarca...@redhat.com
Cc: sta...@kernel.org
Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Signed-off-by: Andrea Arcangeli 
---
 include/linux/ksm.h |  7 ++
 mm/ksm.c| 25 +++
 mm/memory.c | 59 -
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 161e8164abcf..e48b1e453ff5 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -53,6 +53,8 @@ struct page *ksm_might_need_to_copy(struct page *page,
 
 void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc);
 void ksm_migrate_page(struct page *newpage, struct page *oldpage);
+bool reuse_ksm_page(struc