[RFC 15/20] mm: detect deferred TLB flushes in vma granularity

2021-01-30 Thread Nadav Amit
From: Nadav Amit Currently, deferred TLB flushes are detected in the mm granularity: if there is any deferred TLB flush in the entire address space due to NUMA migration, pte_accessible() in x86 would return true, and ptep_clear_flush() would require a TLB flush. This would happen even if the

[RFC 14/20] mm: move inc/dec_tlb_flush_pending() to mmu_gather.c

2021-01-30 Thread Nadav Amit
From: Nadav Amit Reduce the chances that inc/dec_tlb_flush_pending() will be abused by moving them into mmu_gather.c, which is more of their natural place. This also allows to reduce the clutter on mm_types.h. Signed-off-by: Nadav Amit Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy

[RFC 12/20] mm/tlb: save the VMA that is flushed during tlb_start_vma()

2021-01-30 Thread Nadav Amit
From: Nadav Amit Certain architectures need information about the vma that is about to be flushed. Currently, an artificial vma is constructed using the original vma infromation. Instead of saving the flags, record the vma during tlb_start_vma() and use this vma when calling flush_tlb_range

[RFC 07/20] mm: move x86 tlb_gen to generic code

2021-01-30 Thread Nadav Amit
From: Nadav Amit x86 currently has a TLB-generation tracking logic that can be used by additional architectures (as long as they implement some additional logic). Extract the relevant pieces of code from x86 to general TLB code. This would be useful to allow to write the next "fine granul

[RFC 09/20] mm: create pte/pmd_tlb_flush_pending()

2021-01-30 Thread Nadav Amit
From: Nadav Amit In preparation for fine(r) granularity, introduce pte_tlb_flush_pending() and pmd_tlb_flush_pending(). Right now the function directs to mm_tlb_flush_pending(). Change pte_accessible() to provide the vma as well. No functional change. Next patches will use this information on

[RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-01-30 Thread Nadav Amit
From: Nadav Amit Architecture-specific tlb_start_vma() and tlb_end_vma() seem unnecessary. They are currently used for: 1. Avoid per-VMA TLB flushes. This can be determined by introducing a new config option. 2. Avoid saving information on the vma that is being flushed. Saving this

[RFC 13/20] mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()

2021-01-30 Thread Nadav Amit
From: Nadav Amit Introduce tlb_start_ptes() and tlb_end_ptes() which would be called before and after PTEs are updated and TLB flushes are deferred. This will be later be used for fine granualrity deferred TLB flushing detection. In the meanwhile, move flush_tlb_batched_pending() into

[RFC 10/20] mm: add pte_to_page()

2021-01-30 Thread Nadav Amit
From: Nadav Amit Add a pte_to_page(), which is similar to pmd_to_page, which will be used later. Inline pmd_to_page() as well. Signed-off-by: Nadav Amit Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Will Deacon Cc

[RFC 06/20] fs/task_mmu: use mmu_gather interface of clear-soft-dirty

2021-01-30 Thread Nadav Amit
From: Nadav Amit Use mmu_gather interface in task_mmu instead of {inc|dec}_tlb_flush_pending(). This would allow to consolidate the code and to avoid potential bugs. Signed-off-by: Nadav Amit Cc: Andrea Arcangeli Cc: Andrew Morton Cc: Andy Lutomirski Cc: Dave Hansen Cc: Peter Zijlstra Cc

[RFC 08/20] mm: store completed TLB generation

2021-01-30 Thread Nadav Amit
From: Nadav Amit To detect deferred TLB flushes in fine granularity, we need to keep track on the completed TLB flush generation for each mm. Add logic to track for each mm the tlb_gen_completed, which tracks the completed TLB generation. It is the arch responsibility to call

[RFC 05/20] mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h

2021-01-30 Thread Nadav Amit
From: Nadav Amit Arguably, tlb.h is the natural place for TLB related code. In addition, task_mmu needs to be able to call to flush_tlb_batched_pending() and therefore cannot (or should not) use mm/internal.h. Move all the functions that are controlled by

[RFC 02/20] mm/mprotect: use mmu_gather

2021-01-30 Thread Nadav Amit
From: Nadav Amit change_pXX_range() currently does not use mmu_gather, but instead implements its own deferred TLB flushes scheme. This both complicates the code, as developers need to be aware of different invalidation schemes, and prevents. Use mmu_gather in change_pXX_range(). As the pages

[RFC 01/20] mm/tlb: fix fullmm semantics

2021-01-30 Thread Nadav Amit
From: Nadav Amit fullmm in mmu_gather is supposed to indicate that the mm is torn-down (e.g., on process exit) and can therefore allow certain optimizations. However, tlb_finish_mmu() sets fullmm, when in fact it want to say that the TLB should be fully flushed. Change tlb_finish_mmu() to set

[RFC 03/20] mm/mprotect: do not flush on permission promotion

2021-01-30 Thread Nadav Amit
From: Nadav Amit Currently, using mprotect() to unprotect a memory region or uffd to unprotect a memory region causes a TLB flush. At least on x86, as protection is promoted, no TLB flush is needed. Add an arch-specific pte_may_need_flush() which tells whether a TLB flush is needed based on the

[RFC 04/20] mm/mapping_dirty_helpers: use mmu_gather

2021-01-30 Thread Nadav Amit
From: Nadav Amit Avoid open-coding mmu_gather for no reason. There is no apparent reason not to use the existing mmu_gather interfaces. Use the newly introduced pte_may_need_flush() to check whether a flush is needed to avoid unnecassary flushes. Signed-off-by: Nadav Amit Cc: Andrea Arcangeli

[RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Nadav Amit
From: Nadav Amit There are currently (at least?) 5 different TLB batching schemes in the kernel: 1. Using mmu_gather (e.g., zap_page_range()). 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the ongoing deferred TLB flush and flushing the entire range eventually (e.g

[PATCH v3] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-27 Thread Nadav Amit
From: Nadav Amit When an Intel IOMMU is virtualized, and a physical device is passed-through to the VM, changes of the virtual IOMMU need to be propagated to the physical IOMMU. The hypervisor therefore needs to monitor PTE mappings in the IOMMU page-tables. Intel specifications provide "ca

Re: [PATCH v2] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-27 Thread Nadav Amit
> On Jan 27, 2021, at 3:25 AM, Lu Baolu wrote: > > On 2021/1/27 14:17, Nadav Amit wrote: >> From: Nadav Amit >> When an Intel IOMMU is virtualized, and a physical device is >> passed-through to the VM, changes of the virtual IOMMU need to be >> propagated to t

[PATCH v2] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Nadav Amit
From: Nadav Amit When an Intel IOMMU is virtualized, and a physical device is passed-through to the VM, changes of the virtual IOMMU need to be propagated to the physical IOMMU. The hypervisor therefore needs to monitor PTE mappings in the IOMMU page-tables. Intel specifications provide "ca

Re: [PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Nadav Amit
> On Jan 26, 2021, at 4:26 PM, Lu Baolu wrote: > > Hi Nadav, > > On 1/27/21 4:38 AM, Nadav Amit wrote: >> From: Nadav Amit >> When an Intel IOMMU is virtualized, and a physical device is >> passed-through to the VM, changes of the virtual IOMMU need to be &g

[PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Nadav Amit
From: Nadav Amit When an Intel IOMMU is virtualized, and a physical device is passed-through to the VM, changes of the virtual IOMMU need to be propagated to the physical IOMMU. The hypervisor therefore needs to monitor PTE mappings in the IOMMU page-tables. Intel specifications provide "ca

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-17 Thread Nadav Amit
> On Jan 17, 2021, at 11:25 AM, Yu Zhao wrote: > > On Sun, Jan 17, 2021 at 02:13:43AM -0800, Nadav Amit wrote: >>> On Jan 17, 2021, at 1:16 AM, Yu Zhao wrote: >>> >>> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote: >>>>> On Jan 16

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-17 Thread Nadav Amit
> On Jan 17, 2021, at 1:16 AM, Yu Zhao wrote: > > On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote: >>> On Jan 16, 2021, at 8:41 PM, Yu Zhao wrote: >>> >>> On Tue, Jan 12, 2021 at 09:43:38PM +, Will Deacon wrote: >>>> On Tue, Ja

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-16 Thread Nadav Amit
> On Jan 16, 2021, at 8:41 PM, Yu Zhao wrote: > > On Tue, Jan 12, 2021 at 09:43:38PM +, Will Deacon wrote: >> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote: >>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao wrote: >>>> On Tue, Jan 12, 202

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-12 Thread Nadav Amit
> On Jan 12, 2021, at 1:43 PM, Will Deacon wrote: > > On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote: >>> On Jan 12, 2021, at 11:56 AM, Yu Zhao wrote: >>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote: >>>> I will send an RFC

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-12 Thread Nadav Amit
> On Jan 12, 2021, at 11:56 AM, Yu Zhao wrote: > > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote: >>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour >>> wrote: >>> >>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit : >>>>

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-12 Thread Nadav Amit
> On Jan 12, 2021, at 11:02 AM, Laurent Dufour > wrote: > > Le 12/01/2021 à 17:57, Peter Zijlstra a écrit : >> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote: >>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit : Possibility of race against other PTE modifiers 1) For

Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli wrote: > > On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote: >>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli wrote: >>> >>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: >>>&g

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 11:45 AM, Andrea Arcangeli wrote: > > On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote: >>> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli wrote: >>> I just don't like to slow down a feature required in the future for >>> i

Re: [RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli wrote: > > On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote: >> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking") > > Targeting a backport down to 2013 when nothing could wrong i

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 7:08 AM, Peter Xu wrote: > > On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >> diff --git a/mm/mprotect.c b/mm/mprotect.c >> index ab709023e9aa..c08c4055b051 100644 >> --- a/mm/mprotect.c >> +++ b/mm/mprotect.c >>

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli wrote: > > On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote: >> I would feel more comfortable if you provide patches for uffd-wp. If you >> want, I will do it, but I restate that I do not feel comfortable with this &g

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 12:58 AM, Peter Zijlstra wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >>>

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-05 Thread Nadav Amit
> On Jan 5, 2021, at 12:13 AM, Peter Zijlstra wrote: > > On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote: >> The problematic one not pictured is the one of the wrprotect that has >> to be running in another CPU which is also isn't picture above. More >> accurate traces are posted

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-04 Thread Nadav Amit
> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli wrote: > > On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli wrote: >>> >>> On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote: >>>>

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-04 Thread Nadav Amit
> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli wrote: > > On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote: >>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote: >>> >>> Hello, >>> >>> On Mon, Jan 04, 2021 at 01:22:27PM +01

Re: [RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2021-01-04 Thread Nadav Amit
> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote: > > Hello, > > On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote: >> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote: >> >>> The scenario that happens in selftests/vm/user

[RFC PATCH v2 2/2] fs/task_mmu: acquire mmap_lock for write on soft-dirty cleanup

2020-12-25 Thread Nadav Amit
From: Nadav Amit Clearing soft-dirty through /proc/[pid]/clear_refs can cause memory corruption as it clears the dirty-bit without acquiring the mmap_lock for write and defers TLB flushes. As a result of this behavior, it is possible that one of the CPUs would have the stale PTE cached in its

[RFC PATCH v2 1/2] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-25 Thread Nadav Amit
From: Nadav Amit Userfaultfd self-test fails occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range() and defers flushes, and since there is insufficient consideration of concurrent

[RFC PATCH v2 0/2] mm: fix races due to deferred TLB flushes

2020-12-25 Thread Nadav Amit
From: Nadav Amit This patch-set went from v1 to RFCv2, as there is still an ongoing discussion regarding the way of solving the recently found races due to deferred TLB flushes. These patches are only sent for reference for now, and can be applied later if no better solution is taken. In a

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Nadav Amit
> On Dec 23, 2020, at 8:01 PM, Andrea Arcangeli wrote: > >> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote: >>> Perhaps holding some small bitmap based on part of the deferred flushed >>> pages (e.g., bits 12-17 of the address or some other kind of a

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Nadav Amit
> On Dec 23, 2020, at 7:34 PM, Yu Zhao wrote: > > On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote: >>> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli wrote: >>> >>> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote: >>>>

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Nadav Amit
> On Dec 23, 2020, at 7:09 PM, Nadav Amit wrote: > >> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli wrote: >> >> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote: >>> I don’t love this as a long term fix. AFAICT we can have >>>

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Nadav Amit
> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli wrote: > > On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote: >> I don’t love this as a long term fix. AFAICT we can have >> mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait >> in IO while splitting a huge

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Nadav Amit
> On Dec 23, 2020, at 2:05 PM, Andrea Arcangeli wrote: > > On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote: >>> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote: >>> >>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote: >>>> On

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-23 Thread Nadav Amit
> On Dec 23, 2020, at 8:23 AM, Will Deacon wrote: > > On Tue, Dec 22, 2020 at 11:20:21AM -0800, Nadav Amit wrote: >>> On Dec 22, 2020, at 10:30 AM, Yu Zhao wrote: >>> >>> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote: >>>&

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Nadav Amit
> On Dec 22, 2020, at 11:31 AM, Andrea Arcangeli wrote: > > From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001 > From: Andrea Arcangeli > Date: Tue, 22 Dec 2020 14:30:16 -0500 > Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after > userfaultfd_writeprotect() >

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Nadav Amit
> On Dec 22, 2020, at 12:34 PM, Andy Lutomirski wrote: > > On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit wrote: >>> [ I have in mind another solution, such as keeping in each page-table a >>> “table-generation” which is the mm-generation at the time of the change, &

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Nadav Amit
> On Dec 22, 2020, at 11:44 AM, Andrea Arcangeli wrote: > > On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote: >> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so > > I assume you mean "in" mprotect_fixup, after change_protection.

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Nadav Amit
> On Dec 22, 2020, at 10:30 AM, Yu Zhao wrote: > > On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote: >>> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote: >>> >>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote: >>>> On

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Nadav Amit
> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote: > > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote: >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit wrote: >>> Using mmap_write_lock() was my initial fix and there was a strong pushback >>> on this app

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-22 Thread Nadav Amit
> On Dec 21, 2020, at 7:19 PM, Andy Lutomirski wrote: > > On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds > wrote: >> On Mon, Dec 21, 2020 at 2:30 PM Peter Xu wrote: >>> AFAIU mprotect() is the only one who modifies the pte using the mmap write >>> lock. NUMA balancing is also using read mmap l

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 3:30 PM, Linus Torvalds > wrote: > > On Mon, Dec 21, 2020 at 2:55 PM Nadav Amit wrote: >> So as an alternative solution, I can do copying under the PTL after >> flushing, which seems to solve the problem. > > ... > Note that th

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 2:30 PM, Peter Xu wrote: > > On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote: >> BTW: In general, I think that you are right, and that changing of PTEs >> should not require taking mmap_lock for write. However, I am not sure >> cow_user

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote: > > On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote: >> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit wrote: >>> Using mmap_write_lock() was my initial fix and there was a strong pushback >>> on this app

Re: [RFC PATCH 03/13] selftests/vm/userfaultfd: wake after copy failure

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 12:52 PM, Peter Xu wrote: > > On Mon, Dec 21, 2020 at 07:51:52PM +0000, Nadav Amit wrote: >>> On Dec 21, 2020, at 11:28 AM, Peter Xu wrote: >>> >>> On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote: >>>> From: Na

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 11:55 AM, Linus Torvalds > wrote: > > On Mon, Dec 21, 2020 at 11:16 AM Yu Zhao wrote: >> Nadav Amit found memory corruptions when running userfaultfd test above. >> It seems to me the problem is related to commit 09854ba94c6a ("mm: >>

Re: [RFC PATCH 03/13] selftests/vm/userfaultfd: wake after copy failure

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 11:28 AM, Peter Xu wrote: > > On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote: >> From: Nadav Amit >> >> When userfaultfd copy-ioctl fails since the PTE already exists, an >> -EEXIST error is returned and the faulting th

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-21 Thread Nadav Amit
> On Dec 21, 2020, at 9:27 AM, Peter Xu wrote: > > Hi, Nadav, > > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote: > > [...] > >> So to correct myself, I think that what I really encountered was actually >> during MM_CP_UFFD_WP_RESOLVE (i.e.,

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-20 Thread Nadav Amit
> On Dec 20, 2020, at 9:25 PM, Nadav Amit wrote: > >> On Dec 20, 2020, at 9:12 PM, Yu Zhao wrote: >> >> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote: >>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli wrote: >>>> >>>>

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-20 Thread Nadav Amit
> On Dec 20, 2020, at 9:12 PM, Yu Zhao wrote: > > On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote: >>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli wrote: >>> >>> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote: >>>>

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-20 Thread Nadav Amit
> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli wrote: > > On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote: >>> On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote: >>> >>> [ cc’ing some more people who have experience with similar problems ] >>&g

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-20 Thread Nadav Amit
> On Dec 20, 2020, at 1:54 AM, Yu Zhao wrote: > > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote: >>> On Dec 19, 2020, at 10:05 PM, Yu Zhao wrote: >>> >>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote: >>>> [ cc’ing s

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-20 Thread Nadav Amit
> On Dec 19, 2020, at 10:05 PM, Yu Zhao wrote: > > On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote: >> [ cc’ing some more people who have experience with similar problems ] >> >>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote: >>> >

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-19 Thread Nadav Amit
> On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote: > > [ cc’ing some more people who have experience with similar problems ] > >> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote: >> >> Hello, >> >> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Na

Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-19 Thread Nadav Amit
[ cc’ing some more people who have experience with similar problems ] > On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote: > > Hello, > > On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote: >> Analyzing this problem indicates that there is a real bug since >&

[PATCH] mm/userfaultfd: fix memory corruption due to writeprotect

2020-12-18 Thread Nadav Amit
From: Nadav Amit Userfaultfd self-tests fail occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range(). This might cause the TLBs to be in an inconsistent state due to the deferred

Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2020-12-08 Thread Nadav Amit
> On Dec 8, 2020, at 12:34 AM, Mike Rapoport wrote: > > On Sun, Dec 06, 2020 at 08:31:39PM -0800, Nadav Amit wrote: >> Whenever I run into a non-standard and non-trivial synchronization algorithm >> in the kernel (and elsewhere), I become very confused and concerned. I

Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2020-12-06 Thread Nadav Amit
Thanks for the detailed answer, Mike. Things are clearer in regard to your intention. > On Dec 6, 2020, at 1:37 AM, Mike Rapoport wrote: > > The uffd monotor should *know* what is the state of child's memory and > without this patch it could only guess. I see - so mmap_changing is not just abou

Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Nadav Amit
I am not very familiar with membarrier, but here are my 2 cents while trying to answer your questions. > On Dec 3, 2020, at 9:26 PM, Andy Lutomirski wrote: > @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, >* from one thread in a proc

Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races

2020-12-03 Thread Nadav Amit
Hello Mike, Regarding your (old) patch: > On May 23, 2018, at 12:42 AM, Mike Rapoport wrote: > > If a process monitored with userfaultfd changes it's memory mappings or > forks() at the same time as uffd monitor fills the process memory with > UFFDIO_COPY, the actual creation of page table entr

Re: [RFC PATCH 11/13] fs/userfaultfd: complete write asynchronously

2020-12-01 Thread Nadav Amit
> On Nov 28, 2020, at 4:45 PM, Nadav Amit wrote: > > From: Nadav Amit > > Userfaultfd writes can now be used for copy/zeroing. When using iouring > with userfaultfd, performing the copying/zeroing on the faulting thread > instead of the handler/iouring thread has severa

Re: [RFC PATCH 12/13] fs/userfaultfd: kmem-cache for wait-queue objects

2020-11-30 Thread Nadav Amit
> On Nov 28, 2020, at 4:45 PM, Nadav Amit wrote: > > From: Nadav Amit > > Allocating work-queue objects on the stack has usually negative > performance side-effects. First, it is hard to ensure alignment to > cache-lines without increasing the stack size. Second, it might

Re: [RFC PATCH 07/13] fs/userfaultfd: support read_iter to use io_uring

2020-11-30 Thread Nadav Amit
> On Nov 30, 2020, at 10:20 AM, Jens Axboe wrote: > > On 11/28/20 5:45 PM, Nadav Amit wrote: >> From: Nadav Amit >> >> iouring with userfaultfd cannot currently be used fixed buffers since >> userfaultfd does not provide read_iter(). This is required to all

[RFC PATCH 11/13] fs/userfaultfd: complete write asynchronously

2020-11-28 Thread Nadav Amit
From: Nadav Amit Userfaultfd writes can now be used for copy/zeroing. When using iouring with userfaultfd, performing the copying/zeroing on the faulting thread instead of the handler/iouring thread has several advantages: (1) The data of the faulting thread will be available on the local

[RFC PATCH 10/13] fs/userfaultfd: add write_iter() interface

2020-11-28 Thread Nadav Amit
From: Nadav Amit In order to use userfaultfd with io-uring, there are two options for extensions: support userfaultfd ioctls or provide similar functionality through the "write" interface. The latter approach seems more compelling as it does not require io-uring changes, and keeps all

[RFC PATCH 09/13] fs/userfaultfd: use iov_iter for copy/zero

2020-11-28 Thread Nadav Amit
From: Nadav Amit Use iov_iter for copy and zero ioctls. This is done in preparation to support a write_iter() interface that would provide similar services as UFFDIO_COPY/ZERO. In the case of UFFDIO_ZERO, the iov_iter is not really used for any purpose other than providing the length of the

[RFC PATCH 08/13] fs/userfaultfd: complete reads asynchronously

2020-11-28 Thread Nadav Amit
From: Nadav Amit Complete reads asynchronously to allow io_uring to complete reads asynchronously. Reads, which report page-faults and events, can only be performed asynchronously if the read is performed into a kernel buffer, and therefore guarantee that no page-fault would occur during the

[RFC PATCH 13/13] selftests/vm/userfaultfd: iouring and polling tests

2020-11-28 Thread Nadav Amit
From: Nadav Amit Add tests to check the use of userfaultfd with iouring, "write" interface of userfaultfd and with the "poll" feature of userfaultfd. Enabling the tests is done through new test "modifiers": "poll", "write" "iouring" t

[RFC PATCH 07/13] fs/userfaultfd: support read_iter to use io_uring

2020-11-28 Thread Nadav Amit
From: Nadav Amit iouring with userfaultfd cannot currently be used fixed buffers since userfaultfd does not provide read_iter(). This is required to allow asynchronous (queued) reads from userfaultfd. To support async-reads of userfaultfd provide read_iter() instead of read(). Cc: Jens Axboe

[RFC PATCH 06/13] iov_iter: support atomic copy_page_from_iter_iovec()

2020-11-28 Thread Nadav Amit
From: Nadav Amit copy_page_from_iter_iovec() cannot be used when preemption is enabled. Change copy_page_from_iter_iovec() into __copy_page_from_iter_iovec() with an additional parameter that says whether the caller runs in atomic context. When __copy_page_from_iter_iovec() is used in an atomic

[RFC PATCH 05/13] fs/userfaultfd: introduce UFFD_FEATURE_POLL

2020-11-28 Thread Nadav Amit
From: Nadav Amit Add a feature UFFD_FEATURE_POLL that makes the faulting thread spin while waiting for the page-fault to be handled. Users of this feature should be wise by setting the page-fault handling thread on another physical CPU and to potentially ensure that there are available cores to

[RFC PATCH 12/13] fs/userfaultfd: kmem-cache for wait-queue objects

2020-11-28 Thread Nadav Amit
From: Nadav Amit Allocating work-queue objects on the stack has usually negative performance side-effects. First, it is hard to ensure alignment to cache-lines without increasing the stack size. Second, it might cause false sharing. Third, it is more likely to encounter TLB misses as objects are

[RFC PATCH 01/13] fs/userfaultfd: fix wrong error code on WP & !VM_MAYWRITE

2020-11-28 Thread Nadav Amit
From: Nadav Amit It is possible to get an EINVAL error instead of EPERM if the following test vm_flags have VM_UFFD_WP but do not have VM_MAYWRITE, as "ret" is overwritten since commit cab350afcbc9 ("userfaultfd: hugetlbfs: allow registration of ranges containing huge pages")

[RFC PATCH 03/13] selftests/vm/userfaultfd: wake after copy failure

2020-11-28 Thread Nadav Amit
From: Nadav Amit When userfaultfd copy-ioctl fails since the PTE already exists, an -EEXIST error is returned and the faulting thread is not woken. The current userfaultfd test does not wake the faulting thread in such case. The assumption is presumably that another thread set the PTE through

[RFC PATCH 04/13] fs/userfaultfd: simplify locks in userfaultfd_ctx_read

2020-11-28 Thread Nadav Amit
From: Nadav Amit Small refactoring to reduce the number of locations in which locks are released in userfaultfd_ctx_read(), as this makes the understanding of the code and its changes harder. No functional change intended. Cc: Jens Axboe Cc: Andrea Arcangeli Cc: Peter Xu Cc: Alexander Viro

[RFC PATCH 00/13] fs/userfaultfd: support iouring and polling

2020-11-28 Thread Nadav Amit
From: Nadav Amit While the overhead of userfaultfd is usually reasonable, this overhead can still be prohibitive for low-latency backing storage, such as RDMA, persistent memory or in-memory compression. In such cases the overhead of scheduling and entering/exiting the kernel becomes dominant

[RFC PATCH 02/13] fs/userfaultfd: fix wrong file usage with iouring

2020-11-28 Thread Nadav Amit
From: Nadav Amit Using io-uring with userfaultfd for reads can lead upon a fork event to the installation of the userfaultfd file descriptor on the worker kernel thread instead of the process that initiated the read. io-uring assumes that no new file descriptors are installed during read. As a

Re: Lockdep warning on io_file_data_ref_zero() with 5.10-rc5

2020-11-28 Thread Nadav Amit
> On Nov 28, 2020, at 4:13 PM, Pavel Begunkov wrote: > > On 28/11/2020 23:59, Nadav Amit wrote: >> Hello Pavel, >> >> I got the following lockdep splat while rebasing my work on 5.10-rc5 on the >> kernel (based on 5.10-rc5+). >> >> I did not a

Lockdep warning on io_file_data_ref_zero() with 5.10-rc5

2020-11-28 Thread Nadav Amit
Hello Pavel, I got the following lockdep splat while rebasing my work on 5.10-rc5 on the kernel (based on 5.10-rc5+). I did not actually confirm that the problem is triggered without my changes, as my iouring workload requires some kernel changes (not iouring changes), yet IMHO it seems pretty cl

Re: [PATCH] x86/special_insn: reverse __force_order logic

2020-09-02 Thread Nadav Amit
> On Sep 2, 2020, at 9:56 AM, pet...@infradead.org wrote: > > On Wed, Sep 02, 2020 at 03:32:18PM +0000, Nadav Amit wrote: > >> Thanks for pointer. I did not see the discussion, and embarrassingly, I have >> also never figured out how to reply on lkml emails withou

Re: [PATCH] x86/special_insn: reverse __force_order logic

2020-09-02 Thread Nadav Amit
> On Sep 2, 2020, at 5:54 AM, pet...@infradead.org wrote: > > On Tue, Sep 01, 2020 at 09:18:57AM -0700, Nadav Amit wrote: > >> Unless I misunderstand the logic, __force_order should also be used by >> rdpkru() and wrpkru() which do not have dependency on __force_

[PATCH] x86/special_insn: reverse __force_order logic

2020-09-01 Thread Nadav Amit
From: Nadav Amit The __force_order logic seems to be inverted. __force_order is supposedly used to manipulate the compiler to use its memory dependencies analysis to enforce orders between CR writes and reads. Therefore, the memory should behave as a "CR": when the CR is read, the mem

Re: [kvm-unit-tests PATCH] x86: pmu: Test full-width counter writes support

2020-06-19 Thread Nadav Amit
> On Jun 16, 2020, at 5:28 AM, Paolo Bonzini wrote: > > On 16/06/20 12:49, Thomas Huth wrote: >> On 29/05/2020 09.43, Like Xu wrote: >>> When the full-width writes capability is set, use the alternative MSR >>> range to write larger sign counter values (up to GP counter width). >>> >>> Signed-of

Re: [PATCH] smp: generic ipi_raise tracepoint

2020-05-21 Thread Nadav Amit
> On May 20, 2020, at 6:17 AM, Wojciech Kudla wrote: > > Preliminary discussion: > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2020%2F5%2F13%2F1327&data=02%7C01%7Cnamit%40vmware.com%7Ceb1fce63ca4644ab29ad08d7fcc022df%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%

Re: [PATCH] VMCI: Release resource if the work is already queued

2019-09-07 Thread Nadav Amit
> On Sep 7, 2019, at 9:47 PM, Greg Kroah-Hartman > wrote: > > On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote: >> Francois reported that VMware balloon gets stuck after a balloon reset, >> when the VMCI doorbell is removed. A similar error can occur when

Re: [PATCH v2 4/6] compiler-gcc.h: add asm_inline definition

2019-09-04 Thread Nadav Amit
> On Sep 4, 2019, at 5:18 PM, Nick Desaulniers wrote: > > On Fri, Aug 30, 2019 at 4:15 PM Rasmus Villemoes > wrote: >> This adds an asm_inline macro which expands to "asm inline" [1] when gcc >> is new enough (>= 9.1), and just asm for older gccs and other >> compilers. >> >> Using asm inline("

Re: [RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI

2019-09-03 Thread Nadav Amit
> On Sep 3, 2019, at 8:17 AM, Dave Hansen wrote: > > On 8/23/19 3:52 PM, Nadav Amit wrote: >> n_pages concurrent +deferred-pti change >> --- -- - -- >> 121191

Re: [RFC PATCH 0/5] make use of gcc 9's "asm inline()"

2019-08-29 Thread Nadav Amit
> On Aug 29, 2019, at 11:15 AM, Linus Torvalds > wrote: > > On Thu, Aug 29, 2019 at 10:36 AM Nick Desaulniers > wrote: >> I'm curious what "the size of the asm" means, and how it differs >> precisely from "how many instructions GCC thinks it is." I would >> think those are one and the same? O

Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI

2019-08-29 Thread Nadav Amit
> On Aug 27, 2019, at 5:30 PM, Andy Lutomirski wrote: > > On Tue, Aug 27, 2019 at 4:52 PM Nadav Amit wrote: >>> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski wrote: >>> >>> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit wrote: >>>> INVPCID is cons

<    1   2   3   4   5   6   7   8   9   10   >