From: Nadav Amit
Currently, deferred TLB flushes are detected in the mm granularity: if
there is any deferred TLB flush in the entire address space due to NUMA
migration, pte_accessible() in x86 would return true, and
ptep_clear_flush() would require a TLB flush. This would happen even if
the
From: Nadav Amit
Reduce the chances that inc/dec_tlb_flush_pending() will be abused by
moving them into mmu_gather.c, which is more of their natural place.
This also allows to reduce the clutter on mm_types.h.
Signed-off-by: Nadav Amit
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy
From: Nadav Amit
Certain architectures need information about the vma that is about to be
flushed. Currently, an artificial vma is constructed using the original
vma infromation. Instead of saving the flags, record the vma during
tlb_start_vma() and use this vma when calling flush_tlb_range
From: Nadav Amit
x86 currently has a TLB-generation tracking logic that can be used by
additional architectures (as long as they implement some additional
logic).
Extract the relevant pieces of code from x86 to general TLB code. This
would be useful to allow to write the next "fine granul
From: Nadav Amit
In preparation for fine(r) granularity, introduce
pte_tlb_flush_pending() and pmd_tlb_flush_pending(). Right now the
function directs to mm_tlb_flush_pending().
Change pte_accessible() to provide the vma as well.
No functional change. Next patches will use this information on
From: Nadav Amit
Architecture-specific tlb_start_vma() and tlb_end_vma() seem
unnecessary. They are currently used for:
1. Avoid per-VMA TLB flushes. This can be determined by introducing
a new config option.
2. Avoid saving information on the vma that is being flushed. Saving
this
From: Nadav Amit
Introduce tlb_start_ptes() and tlb_end_ptes() which would be called
before and after PTEs are updated and TLB flushes are deferred. This
will be later be used for fine granualrity deferred TLB flushing
detection.
In the meanwhile, move flush_tlb_batched_pending() into
From: Nadav Amit
Add a pte_to_page(), which is similar to pmd_to_page, which will be used
later.
Inline pmd_to_page() as well.
Signed-off-by: Nadav Amit
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Cc
From: Nadav Amit
Use mmu_gather interface in task_mmu instead of
{inc|dec}_tlb_flush_pending(). This would allow to consolidate the code
and to avoid potential bugs.
Signed-off-by: Nadav Amit
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: Peter Zijlstra
Cc
From: Nadav Amit
To detect deferred TLB flushes in fine granularity, we need to keep
track on the completed TLB flush generation for each mm.
Add logic to track for each mm the tlb_gen_completed, which tracks the
completed TLB generation. It is the arch responsibility to call
From: Nadav Amit
Arguably, tlb.h is the natural place for TLB related code. In addition,
task_mmu needs to be able to call to flush_tlb_batched_pending() and
therefore cannot (or should not) use mm/internal.h.
Move all the functions that are controlled by
From: Nadav Amit
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates
the code, as developers need to be aware of different invalidation
schemes, and prevents.
Use mmu_gather in change_pXX_range(). As the pages
From: Nadav Amit
fullmm in mmu_gather is supposed to indicate that the mm is torn-down
(e.g., on process exit) and can therefore allow certain optimizations.
However, tlb_finish_mmu() sets fullmm, when in fact it want to say that
the TLB should be fully flushed.
Change tlb_finish_mmu() to set
From: Nadav Amit
Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.
Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the
From: Nadav Amit
Avoid open-coding mmu_gather for no reason. There is no apparent reason
not to use the existing mmu_gather interfaces.
Use the newly introduced pte_may_need_flush() to check whether a flush
is needed to avoid unnecassary flushes.
Signed-off-by: Nadav Amit
Cc: Andrea Arcangeli
From: Nadav Amit
There are currently (at least?) 5 different TLB batching schemes in the
kernel:
1. Using mmu_gather (e.g., zap_page_range()).
2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
ongoing deferred TLB flush and flushing the entire range eventually
(e.g
From: Nadav Amit
When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "ca
> On Jan 27, 2021, at 3:25 AM, Lu Baolu wrote:
>
> On 2021/1/27 14:17, Nadav Amit wrote:
>> From: Nadav Amit
>> When an Intel IOMMU is virtualized, and a physical device is
>> passed-through to the VM, changes of the virtual IOMMU need to be
>> propagated to t
From: Nadav Amit
When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "ca
> On Jan 26, 2021, at 4:26 PM, Lu Baolu wrote:
>
> Hi Nadav,
>
> On 1/27/21 4:38 AM, Nadav Amit wrote:
>> From: Nadav Amit
>> When an Intel IOMMU is virtualized, and a physical device is
>> passed-through to the VM, changes of the virtual IOMMU need to be
&g
From: Nadav Amit
When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "ca
> On Jan 17, 2021, at 11:25 AM, Yu Zhao wrote:
>
> On Sun, Jan 17, 2021 at 02:13:43AM -0800, Nadav Amit wrote:
>>> On Jan 17, 2021, at 1:16 AM, Yu Zhao wrote:
>>>
>>> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
>>>>> On Jan 16
> On Jan 17, 2021, at 1:16 AM, Yu Zhao wrote:
>
> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
>>> On Jan 16, 2021, at 8:41 PM, Yu Zhao wrote:
>>>
>>> On Tue, Jan 12, 2021 at 09:43:38PM +, Will Deacon wrote:
>>>> On Tue, Ja
> On Jan 16, 2021, at 8:41 PM, Yu Zhao wrote:
>
> On Tue, Jan 12, 2021 at 09:43:38PM +, Will Deacon wrote:
>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao wrote:
>>>> On Tue, Jan 12, 202
> On Jan 12, 2021, at 1:43 PM, Will Deacon wrote:
>
> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao wrote:
>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>> I will send an RFC
> On Jan 12, 2021, at 11:56 AM, Yu Zhao wrote:
>
> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour
>>> wrote:
>>>
>>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>>>>
> On Jan 12, 2021, at 11:02 AM, Laurent Dufour
> wrote:
>
> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
Possibility of race against other PTE modifiers
1) For
> On Jan 5, 2021, at 12:39 PM, Andrea Arcangeli wrote:
>
> On Tue, Jan 05, 2021 at 07:26:43PM +0000, Nadav Amit wrote:
>>> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli wrote:
>>>
>>> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
>>>&g
> On Jan 5, 2021, at 11:45 AM, Andrea Arcangeli wrote:
>
> On Tue, Jan 05, 2021 at 07:05:22PM +0000, Nadav Amit wrote:
>>> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli wrote:
>>> I just don't like to slow down a feature required in the future for
>>> i
> On Jan 5, 2021, at 10:20 AM, Andrea Arcangeli wrote:
>
> On Fri, Dec 25, 2020 at 01:25:29AM -0800, Nadav Amit wrote:
>> Fixes: 0f8975ec4db2 ("mm: soft-dirty bits for user memory changes tracking")
>
> Targeting a backport down to 2013 when nothing could wrong i
> On Jan 5, 2021, at 7:08 AM, Peter Xu wrote:
>
> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index ab709023e9aa..c08c4055b051 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>>
> On Jan 5, 2021, at 10:45 AM, Andrea Arcangeli wrote:
>
> On Mon, Jan 04, 2021 at 09:26:33PM +0000, Nadav Amit wrote:
>> I would feel more comfortable if you provide patches for uffd-wp. If you
>> want, I will do it, but I restate that I do not feel comfortable with this
&g
> On Jan 5, 2021, at 12:58 AM, Peter Zijlstra wrote:
>
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
>> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>>
> On Jan 5, 2021, at 12:13 AM, Peter Zijlstra wrote:
>
> On Mon, Jan 04, 2021 at 02:24:38PM -0500, Andrea Arcangeli wrote:
>> The problematic one not pictured is the one of the wrprotect that has
>> to be running in another CPU which is also isn't picture above. More
>> accurate traces are posted
> On Jan 4, 2021, at 1:01 PM, Andrea Arcangeli wrote:
>
> On Mon, Jan 04, 2021 at 08:39:37PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli wrote:
>>>
>>> On Mon, Jan 04, 2021 at 07:35:06PM +, Nadav Amit wrote:
>>>>
> On Jan 4, 2021, at 12:19 PM, Andrea Arcangeli wrote:
>
> On Mon, Jan 04, 2021 at 07:35:06PM +0000, Nadav Amit wrote:
>>> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote:
>>>
>>> Hello,
>>>
>>> On Mon, Jan 04, 2021 at 01:22:27PM +01
> On Jan 4, 2021, at 11:24 AM, Andrea Arcangeli wrote:
>
> Hello,
>
> On Mon, Jan 04, 2021 at 01:22:27PM +0100, Peter Zijlstra wrote:
>> On Fri, Dec 25, 2020 at 01:25:28AM -0800, Nadav Amit wrote:
>>
>>> The scenario that happens in selftests/vm/user
From: Nadav Amit
Clearing soft-dirty through /proc/[pid]/clear_refs can cause memory
corruption as it clears the dirty-bit without acquiring the mmap_lock
for write and defers TLB flushes.
As a result of this behavior, it is possible that one of the CPUs would
have the stale PTE cached in its
From: Nadav Amit
Userfaultfd self-test fails occasionally, indicating a memory
corruption.
Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range() and defers
flushes, and since there is insufficient consideration of concurrent
From: Nadav Amit
This patch-set went from v1 to RFCv2, as there is still an ongoing
discussion regarding the way of solving the recently found races due to
deferred TLB flushes. These patches are only sent for reference for now,
and can be applied later if no better solution is taken.
In a
> On Dec 23, 2020, at 8:01 PM, Andrea Arcangeli wrote:
>
>> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
>>> Perhaps holding some small bitmap based on part of the deferred flushed
>>> pages (e.g., bits 12-17 of the address or some other kind of a
> On Dec 23, 2020, at 7:34 PM, Yu Zhao wrote:
>
> On Wed, Dec 23, 2020 at 07:09:10PM -0800, Nadav Amit wrote:
>>> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli wrote:
>>>
>>> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>>>>
> On Dec 23, 2020, at 7:09 PM, Nadav Amit wrote:
>
>> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli wrote:
>>
>> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>>> I don’t love this as a long term fix. AFAICT we can have
>>>
> On Dec 23, 2020, at 6:00 PM, Andrea Arcangeli wrote:
>
> On Wed, Dec 23, 2020 at 05:21:43PM -0800, Andy Lutomirski wrote:
>> I don’t love this as a long term fix. AFAICT we can have
>> mm_tlb_flush_pending set for quite a while — mprotect seems like it can wait
>> in IO while splitting a huge
> On Dec 23, 2020, at 2:05 PM, Andrea Arcangeli wrote:
>
> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote:
>>>
>>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>>>> On
> On Dec 23, 2020, at 8:23 AM, Will Deacon wrote:
>
> On Tue, Dec 22, 2020 at 11:20:21AM -0800, Nadav Amit wrote:
>>> On Dec 22, 2020, at 10:30 AM, Yu Zhao wrote:
>>>
>>> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>>&
> On Dec 22, 2020, at 11:31 AM, Andrea Arcangeli wrote:
>
> From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli
> Date: Tue, 22 Dec 2020 14:30:16 -0500
> Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
> userfaultfd_writeprotect()
>
> On Dec 22, 2020, at 12:34 PM, Andy Lutomirski wrote:
>
> On Sat, Dec 19, 2020 at 2:06 PM Nadav Amit wrote:
>>> [ I have in mind another solution, such as keeping in each page-table a
>>> “table-generation” which is the mm-generation at the time of the change,
&
> On Dec 22, 2020, at 11:44 AM, Andrea Arcangeli wrote:
>
> On Mon, Dec 21, 2020 at 02:55:12PM -0800, Nadav Amit wrote:
>> wouldn’t mmap_write_downgrade() be executed before mprotect_fixup() (so
>
> I assume you mean "in" mprotect_fixup, after change_protection.
> On Dec 22, 2020, at 10:30 AM, Yu Zhao wrote:
>
> On Tue, Dec 22, 2020 at 04:40:32AM -0800, Nadav Amit wrote:
>>> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote:
>>>
>>> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>>>> On
> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote:
>
> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit wrote:
>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>> on this app
> On Dec 21, 2020, at 7:19 PM, Andy Lutomirski wrote:
>
> On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds
> wrote:
>> On Mon, Dec 21, 2020 at 2:30 PM Peter Xu wrote:
>>> AFAIU mprotect() is the only one who modifies the pte using the mmap write
>>> lock. NUMA balancing is also using read mmap l
> On Dec 21, 2020, at 3:30 PM, Linus Torvalds
> wrote:
>
> On Mon, Dec 21, 2020 at 2:55 PM Nadav Amit wrote:
>> So as an alternative solution, I can do copying under the PTL after
>> flushing, which seems to solve the problem.
>
> ...
> Note that th
> On Dec 21, 2020, at 2:30 PM, Peter Xu wrote:
>
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
>> BTW: In general, I think that you are right, and that changing of PTEs
>> should not require taking mmap_lock for write. However, I am not sure
>> cow_user
> On Dec 21, 2020, at 1:24 PM, Yu Zhao wrote:
>
> On Mon, Dec 21, 2020 at 12:26:22PM -0800, Linus Torvalds wrote:
>> On Mon, Dec 21, 2020 at 12:23 PM Nadav Amit wrote:
>>> Using mmap_write_lock() was my initial fix and there was a strong pushback
>>> on this app
> On Dec 21, 2020, at 12:52 PM, Peter Xu wrote:
>
> On Mon, Dec 21, 2020 at 07:51:52PM +0000, Nadav Amit wrote:
>>> On Dec 21, 2020, at 11:28 AM, Peter Xu wrote:
>>>
>>> On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote:
>>>> From: Na
> On Dec 21, 2020, at 11:55 AM, Linus Torvalds
> wrote:
>
> On Mon, Dec 21, 2020 at 11:16 AM Yu Zhao wrote:
>> Nadav Amit found memory corruptions when running userfaultfd test above.
>> It seems to me the problem is related to commit 09854ba94c6a ("mm:
>>
> On Dec 21, 2020, at 11:28 AM, Peter Xu wrote:
>
> On Sat, Nov 28, 2020 at 04:45:38PM -0800, Nadav Amit wrote:
>> From: Nadav Amit
>>
>> When userfaultfd copy-ioctl fails since the PTE already exists, an
>> -EEXIST error is returned and the faulting th
> On Dec 21, 2020, at 9:27 AM, Peter Xu wrote:
>
> Hi, Nadav,
>
> On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
>
> [...]
>
>> So to correct myself, I think that what I really encountered was actually
>> during MM_CP_UFFD_WP_RESOLVE (i.e.,
> On Dec 20, 2020, at 9:25 PM, Nadav Amit wrote:
>
>> On Dec 20, 2020, at 9:12 PM, Yu Zhao wrote:
>>
>> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
>>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli wrote:
>>>>
>>>>
> On Dec 20, 2020, at 9:12 PM, Yu Zhao wrote:
>
> On Sun, Dec 20, 2020 at 08:36:15PM -0800, Nadav Amit wrote:
>>> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli wrote:
>>>
>>> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
>>>>
> On Dec 19, 2020, at 6:20 PM, Andrea Arcangeli wrote:
>
> On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote:
>>> On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote:
>>>
>>> [ cc’ing some more people who have experience with similar problems ]
>>&g
> On Dec 20, 2020, at 1:54 AM, Yu Zhao wrote:
>
> On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote:
>>> On Dec 19, 2020, at 10:05 PM, Yu Zhao wrote:
>>>
>>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
>>>> [ cc’ing s
> On Dec 19, 2020, at 10:05 PM, Yu Zhao wrote:
>
> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote:
>> [ cc’ing some more people who have experience with similar problems ]
>>
>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote:
>>>
>
> On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote:
>
> [ cc’ing some more people who have experience with similar problems ]
>
>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote:
>>
>> Hello,
>>
>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Na
[ cc’ing some more people who have experience with similar problems ]
> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli wrote:
>
> Hello,
>
> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote:
>> Analyzing this problem indicates that there is a real bug since
>&
From: Nadav Amit
Userfaultfd self-tests fail occasionally, indicating a memory
corruption.
Analyzing this problem indicates that there is a real bug since
mmap_lock is only taken for read in mwriteprotect_range(). This might
cause the TLBs to be in an inconsistent state due to the deferred
> On Dec 8, 2020, at 12:34 AM, Mike Rapoport wrote:
>
> On Sun, Dec 06, 2020 at 08:31:39PM -0800, Nadav Amit wrote:
>> Whenever I run into a non-standard and non-trivial synchronization algorithm
>> in the kernel (and elsewhere), I become very confused and concerned. I
Thanks for the detailed answer, Mike. Things are clearer in regard to your
intention.
> On Dec 6, 2020, at 1:37 AM, Mike Rapoport wrote:
>
> The uffd monotor should *know* what is the state of child's memory and
> without this patch it could only guess.
I see - so mmap_changing is not just abou
I am not very familiar with membarrier, but here are my 2 cents while trying
to answer your questions.
> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski wrote:
> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
>* from one thread in a proc
Hello Mike,
Regarding your (old) patch:
> On May 23, 2018, at 12:42 AM, Mike Rapoport wrote:
>
> If a process monitored with userfaultfd changes it's memory mappings or
> forks() at the same time as uffd monitor fills the process memory with
> UFFDIO_COPY, the actual creation of page table entr
> On Nov 28, 2020, at 4:45 PM, Nadav Amit wrote:
>
> From: Nadav Amit
>
> Userfaultfd writes can now be used for copy/zeroing. When using iouring
> with userfaultfd, performing the copying/zeroing on the faulting thread
> instead of the handler/iouring thread has severa
> On Nov 28, 2020, at 4:45 PM, Nadav Amit wrote:
>
> From: Nadav Amit
>
> Allocating work-queue objects on the stack has usually negative
> performance side-effects. First, it is hard to ensure alignment to
> cache-lines without increasing the stack size. Second, it might
> On Nov 30, 2020, at 10:20 AM, Jens Axboe wrote:
>
> On 11/28/20 5:45 PM, Nadav Amit wrote:
>> From: Nadav Amit
>>
>> iouring with userfaultfd cannot currently be used fixed buffers since
>> userfaultfd does not provide read_iter(). This is required to all
From: Nadav Amit
Userfaultfd writes can now be used for copy/zeroing. When using iouring
with userfaultfd, performing the copying/zeroing on the faulting thread
instead of the handler/iouring thread has several advantages:
(1) The data of the faulting thread will be available on the local
From: Nadav Amit
In order to use userfaultfd with io-uring, there are two options for
extensions: support userfaultfd ioctls or provide similar functionality
through the "write" interface. The latter approach seems more compelling
as it does not require io-uring changes, and keeps all
From: Nadav Amit
Use iov_iter for copy and zero ioctls. This is done in preparation to
support a write_iter() interface that would provide similar services as
UFFDIO_COPY/ZERO.
In the case of UFFDIO_ZERO, the iov_iter is not really used for any
purpose other than providing the length of the
From: Nadav Amit
Complete reads asynchronously to allow io_uring to complete reads
asynchronously.
Reads, which report page-faults and events, can only be performed
asynchronously if the read is performed into a kernel buffer, and
therefore guarantee that no page-fault would occur during the
From: Nadav Amit
Add tests to check the use of userfaultfd with iouring, "write"
interface of userfaultfd and with the "poll" feature of userfaultfd.
Enabling the tests is done through new test "modifiers": "poll", "write"
"iouring" t
From: Nadav Amit
iouring with userfaultfd cannot currently be used fixed buffers since
userfaultfd does not provide read_iter(). This is required to allow
asynchronous (queued) reads from userfaultfd.
To support async-reads of userfaultfd provide read_iter() instead of
read().
Cc: Jens Axboe
From: Nadav Amit
copy_page_from_iter_iovec() cannot be used when preemption is enabled.
Change copy_page_from_iter_iovec() into __copy_page_from_iter_iovec()
with an additional parameter that says whether the caller runs in atomic
context. When __copy_page_from_iter_iovec() is used in an atomic
From: Nadav Amit
Add a feature UFFD_FEATURE_POLL that makes the faulting thread spin
while waiting for the page-fault to be handled.
Users of this feature should be wise by setting the page-fault handling
thread on another physical CPU and to potentially ensure that there are
available cores to
From: Nadav Amit
Allocating work-queue objects on the stack has usually negative
performance side-effects. First, it is hard to ensure alignment to
cache-lines without increasing the stack size. Second, it might cause
false sharing. Third, it is more likely to encounter TLB misses as
objects are
From: Nadav Amit
It is possible to get an EINVAL error instead of EPERM if the following
test vm_flags have VM_UFFD_WP but do not have VM_MAYWRITE, as "ret" is
overwritten since commit cab350afcbc9 ("userfaultfd: hugetlbfs: allow
registration of ranges containing huge pages")
From: Nadav Amit
When userfaultfd copy-ioctl fails since the PTE already exists, an
-EEXIST error is returned and the faulting thread is not woken. The
current userfaultfd test does not wake the faulting thread in such case.
The assumption is presumably that another thread set the PTE through
From: Nadav Amit
Small refactoring to reduce the number of locations in which locks are
released in userfaultfd_ctx_read(), as this makes the understanding of
the code and its changes harder.
No functional change intended.
Cc: Jens Axboe
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Alexander Viro
From: Nadav Amit
While the overhead of userfaultfd is usually reasonable, this overhead
can still be prohibitive for low-latency backing storage, such as RDMA,
persistent memory or in-memory compression. In such cases the overhead
of scheduling and entering/exiting the kernel becomes dominant
From: Nadav Amit
Using io-uring with userfaultfd for reads can lead upon a fork event to
the installation of the userfaultfd file descriptor on the worker kernel
thread instead of the process that initiated the read. io-uring assumes
that no new file descriptors are installed during read.
As a
> On Nov 28, 2020, at 4:13 PM, Pavel Begunkov wrote:
>
> On 28/11/2020 23:59, Nadav Amit wrote:
>> Hello Pavel,
>>
>> I got the following lockdep splat while rebasing my work on 5.10-rc5 on the
>> kernel (based on 5.10-rc5+).
>>
>> I did not a
Hello Pavel,
I got the following lockdep splat while rebasing my work on 5.10-rc5 on the
kernel (based on 5.10-rc5+).
I did not actually confirm that the problem is triggered without my changes,
as my iouring workload requires some kernel changes (not iouring changes),
yet IMHO it seems pretty cl
> On Sep 2, 2020, at 9:56 AM, pet...@infradead.org wrote:
>
> On Wed, Sep 02, 2020 at 03:32:18PM +0000, Nadav Amit wrote:
>
>> Thanks for pointer. I did not see the discussion, and embarrassingly, I have
>> also never figured out how to reply on lkml emails withou
> On Sep 2, 2020, at 5:54 AM, pet...@infradead.org wrote:
>
> On Tue, Sep 01, 2020 at 09:18:57AM -0700, Nadav Amit wrote:
>
>> Unless I misunderstand the logic, __force_order should also be used by
>> rdpkru() and wrpkru() which do not have dependency on __force_
From: Nadav Amit
The __force_order logic seems to be inverted. __force_order is
supposedly used to manipulate the compiler to use its memory
dependencies analysis to enforce orders between CR writes and reads.
Therefore, the memory should behave as a "CR": when the CR is read, the
mem
> On Jun 16, 2020, at 5:28 AM, Paolo Bonzini wrote:
>
> On 16/06/20 12:49, Thomas Huth wrote:
>> On 29/05/2020 09.43, Like Xu wrote:
>>> When the full-width writes capability is set, use the alternative MSR
>>> range to write larger sign counter values (up to GP counter width).
>>>
>>> Signed-of
> On May 20, 2020, at 6:17 AM, Wojciech Kudla wrote:
>
> Preliminary discussion:
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2020%2F5%2F13%2F1327&data=02%7C01%7Cnamit%40vmware.com%7Ceb1fce63ca4644ab29ad08d7fcc022df%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%
> On Sep 7, 2019, at 9:47 PM, Greg Kroah-Hartman
> wrote:
>
> On Tue, Aug 20, 2019 at 01:26:38PM -0700, Nadav Amit wrote:
>> Francois reported that VMware balloon gets stuck after a balloon reset,
>> when the VMCI doorbell is removed. A similar error can occur when
> On Sep 4, 2019, at 5:18 PM, Nick Desaulniers wrote:
>
> On Fri, Aug 30, 2019 at 4:15 PM Rasmus Villemoes
> wrote:
>> This adds an asm_inline macro which expands to "asm inline" [1] when gcc
>> is new enough (>= 9.1), and just asm for older gccs and other
>> compilers.
>>
>> Using asm inline("
> On Sep 3, 2019, at 8:17 AM, Dave Hansen wrote:
>
> On 8/23/19 3:52 PM, Nadav Amit wrote:
>> n_pages concurrent +deferred-pti change
>> --- -- - --
>> 121191
> On Aug 29, 2019, at 11:15 AM, Linus Torvalds
> wrote:
>
> On Thu, Aug 29, 2019 at 10:36 AM Nick Desaulniers
> wrote:
>> I'm curious what "the size of the asm" means, and how it differs
>> precisely from "how many instructions GCC thinks it is." I would
>> think those are one and the same? O
> On Aug 27, 2019, at 5:30 PM, Andy Lutomirski wrote:
>
> On Tue, Aug 27, 2019 at 4:52 PM Nadav Amit wrote:
>>> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski wrote:
>>>
>>> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit wrote:
>>>> INVPCID is cons
101 - 200 of 1005 matches
Mail list logo