[RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Nadav Amit
From: Nadav Amit 

There are currently (at least?) 5 different TLB batching schemes in the
kernel:

1. Using mmu_gather (e.g., zap_page_range()).

2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
   ongoing deferred TLB flush and flushing the entire range eventually
   (e.g., change_protection_range()).

3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).

4. Batching per-table flushes (move_ptes()).

5. By setting a flag on that a deferred TLB flush operation takes place,
   flushing when (try_to_unmap_one() on x86).

It seems that (1)-(4) can be consolidated. In addition, it seems that
(5) is racy. It also seems there can be many redundant TLB flushes, and
potentially TLB-shootdown storms, for instance during batched
reclamation (using try_to_unmap_one()) if at the same time mmu_gather
defers TLB flushes.

More aggressive TLB batching may be possible, but this patch-set does
not add such batching. The proposed changes would enable such batching
in a later time.

Admittedly, I do not understand how things are not broken today, which
frightens me to make further batching before getting things in order.
For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
for each page-table (but not in greater granularity). Can't
ClearPageDirty() be called before the flush, causing writes after
ClearPageDirty() and before the flush to be lost?

This patch-set therefore performs the following changes:

1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
   instead of {inc|dec}_tlb_flush_pending().

2. Avoid TLB flushes if PTE permission is not demoted.

3. Cleans up mmu_gather to be less arch-dependant.

4. Uses mm's generations to track in finer granularity, either per-VMA
   or per page-table, whether a pending mmu_gather operation is
   outstanding. This should allow to avoid some TLB flushes when KSM or
   memory reclamation takes place while another operation such as
   munmap() or mprotect() is running.

5. Changes try_to_unmap_one() flushing scheme, as the current seems
   broken to track in a bitmap which CPUs have outstanding TLB flushes
   instead of having a flag.

Further optimizations are possible, such as changing move_ptes() to use
mmu_gather.

The patches were very very lightly tested. I am looking forward for your
feedback regarding the overall approaches, and whether to split them
into multiple patch-sets.

Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: linux-c...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: Mel Gorman 
Cc: Nick Piggin 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: x...@kernel.org
Cc: Yu Zhao 


Nadav Amit (20):
  mm/tlb: fix fullmm semantics
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm/mapping_dirty_helpers: use mmu_gather
  mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
  fs/task_mmu: use mmu_gather interface of clear-soft-dirty
  mm: move x86 tlb_gen to generic code
  mm: store completed TLB generation
  mm: create pte/pmd_tlb_flush_pending()
  mm: add pte_to_page()
  mm/tlb: remove arch-specific tlb_start/end_vma()
  mm/tlb: save the VMA that is flushed during tlb_start_vma()
  mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
  mm: detect deferred TLB flushes in vma granularity
  mm/tlb: per-page table generation tracking
  mm/tlb: updated completed deferred TLB flush conditionally
  mm: make mm_cpumask() volatile
  lib/cpumask: introduce cpumask_atomic_or()
  mm/rmap: avoid potential races

 arch/arm/include/asm/bitops.h |   4 +-
 arch/arm/include/asm/pgtable.h|   4 +-
 arch/arm64/include/asm/pgtable.h  |   4 +-
 arch/csky/Kconfig |   1 +
 arch/csky/include/asm/tlb.h   |  12 --
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/tlb.h|   2 -
 arch/s390/Kconfig |   1 +
 arch/s390/include/asm/tlb.h   |   3 -
 arch/sparc/Kconfig|   1 +
 arch/sparc/include/asm/pgtable_64.h   |   9 +-
 arch/sparc/include/asm/tlb_64.h   |   2 -
 arch/sparc/mm/init_64.c   |   2 +-
 arch/x86/Kconfig  |   3 +
 arch/x86/hyperv/mmu.c |   2 +-
 arch/x86/include/asm/mmu.h|  10 -
 arch/x86/include/asm/mmu_context.h|   1 -
 arch/x86/include/asm/paravirt_types.h |   2 +-
 arch/x86/include/asm/pgtable.h|  24 +--
 arch/x86/include/asm/tlb.h|  21 +-
 arch/x86/include/asm/tlbbatch.h   |  15 --
 arch/x86/include/asm/tlbflush.h   |  61 --
 arch/x86/mm/tlb.c |  52 +++--
 arch/x86/xen/mmu_pv.c |   2 +-
 drivers/firmware/efi/efi.c|   1 +
 fs/proc/task_mmu.c|  29 ++-
 include/asm-generic/bitops/find.h |   8 +-
 include/asm-generic/tlb.h

[RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-01-30 Thread Nadav Amit
From: Nadav Amit 

Architecture-specific tlb_start_vma() and tlb_end_vma() seem
unnecessary. They are currently used for:

1. Avoid per-VMA TLB flushes. This can be determined by introducing
   a new config option.

2. Avoid saving information on the vma that is being flushed. Saving
   this information, even for architectures that do not need it, is
   cheap and we will need it for per-VMA deferred TLB flushing.

3. Avoid calling flush_cache_range().

Remove the architecture specific tlb_start_vma() and tlb_end_vma() in
the following manner, corresponding to the previous requirements:

1. Introduce a new config option -
   ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING - to allow architectures to
   define whether they want aggressive TLB flush batching (instead of
   flushing mappings of each VMA separately).

2. Save information on the vma regardless of architecture. Saving this
   information should have negligible overhead, and they will be
   needed for fine granularity TLB flushes.

3. flush_cache_range() is anyhow not defined for the architectures that
   implement tlb_start/end_vma().

No functional change intended.

Signed-off-by: Nadav Amit 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Will Deacon 
Cc: Yu Zhao 
Cc: Nick Piggin 
Cc: linux-c...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: x...@kernel.org
---
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/tlb.h | 12 
 arch/powerpc/Kconfig|  1 +
 arch/powerpc/include/asm/tlb.h  |  2 --
 arch/s390/Kconfig   |  1 +
 arch/s390/include/asm/tlb.h |  3 ---
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/tlb_64.h |  2 --
 arch/x86/Kconfig|  1 +
 arch/x86/include/asm/tlb.h  |  3 ---
 include/asm-generic/tlb.h   | 15 +--
 init/Kconfig|  8 
 12 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 89dd2fcf38fa..924ff5721240 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -8,6 +8,7 @@ config CSKY
select ARCH_HAS_SYNC_DMA_FOR_DEVICE
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_QUEUED_RWLOCKS if NR_CPUS>2
+   select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
select ARCH_WANT_FRAME_POINTERS if !CPU_CK610
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select COMMON_CLK
diff --git a/arch/csky/include/asm/tlb.h b/arch/csky/include/asm/tlb.h
index fdff9b8d70c8..8130a5f09a6b 100644
--- a/arch/csky/include/asm/tlb.h
+++ b/arch/csky/include/asm/tlb.h
@@ -6,18 +6,6 @@
 
 #include 
 
-#define tlb_start_vma(tlb, vma) \
-   do { \
-   if (!(tlb)->fullmm) \
-   flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); 
\
-   }  while (0)
-
-#define tlb_end_vma(tlb, vma) \
-   do { \
-   if (!(tlb)->fullmm) \
-   flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \
-   }  while (0)
-
 #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
 
 #include 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..d9761b6f192a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -151,6 +151,7 @@ config PPC
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
+   select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..880b7daf904e 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,8 +19,6 @@
 
 #include 
 
-#define tlb_start_vma(tlb, vma)do { } while (0)
-#define tlb_end_vma(tlb, vma)  do { } while (0)
 #define __tlb_remove_tlb_entry __tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c72874f09741..5b3dc5ca9873 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
+   select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_IPC_PARSE_VERSION
select BUILDTIME_TABLE_SORT
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 954fa8ca6cbd..03f31d59f97c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -27,9 +27,6 @@ static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
   

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Nadav Amit
> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski  wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>> From: Nadav Amit 
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
> 
> Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

Yes.


Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Nadav Amit
> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>> From: Nadav Amit 
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
>> 
>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>> potentially TLB-shootdown storms, for instance during batched
>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>> defers TLB flushes.
>> 
>> More aggressive TLB batching may be possible, but this patch-set does
>> not add such batching. The proposed changes would enable such batching
>> in a later time.
>> 
>> Admittedly, I do not understand how things are not broken today, which
>> frightens me to make further batching before getting things in order.
>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>> for each page-table (but not in greater granularity). Can't
>> ClearPageDirty() be called before the flush, causing writes after
>> ClearPageDirty() and before the flush to be lost?
> 
> Because it's holding the page table lock which stops page_mkclean from 
> cleaning the page. Or am I misunderstanding the question?

Thanks. I understood this part. Looking again at the code, I now understand
my confusion: I forgot that the reverse mapping is removed after the PTE is
zapped.

Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
by performing set_page_dirty() for the batched pages when needed in
tlb_finish_mmu() [ i.e., by marking for each batched page whether
set_page_dirty() should be issued for that page while collecting them ].

> I'll go through the patches a bit more closely when they all come 
> through. Sparc and powerpc of course need the arch lazy mode to get 
> per-page/pte information for operations that are not freeing pages, 
> which is what mmu gather is designed for.

IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
where no previous PTE was set, right?

> I wouldn't mind using a similar API so it's less of a black box when 
> reading generic code, but it might not quite fit the mmu gather API
> exactly (most of these paths don't want a full mmu_gather on stack).

I see your point. It may be possible to create two mmu_gather structs: a
small one that only holds the flush information and another that also holds
the pages. 

>> This patch-set therefore performs the following changes:
>> 
>> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>>   instead of {inc|dec}_tlb_flush_pending().
>> 
>> 2. Avoid TLB flushes if PTE permission is not demoted.
>> 
>> 3. Cleans up mmu_gather to be less arch-dependant.
>> 
>> 4. Uses mm's generations to track in finer granularity, either per-VMA
>>   or per page-table, whether a pending mmu_gather operation is
>>   outstanding. This should allow to avoid some TLB flushes when KSM or
>>   memory reclamation takes place while another operation such as
>>   munmap() or mprotect() is running.
>> 
>> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>>   broken to track in a bitmap which CPUs have outstanding TLB flushes
>>   instead of having a flag.
> 
> Putting fixes first, and cleanups and independent patches (like #2) next
> would help with getting stuff merged and backported.

I tried to do it mostly this way. There are some theoretical races which
I did not manage (or try hard enough) to create, so I did not include in
the “fixes” section. I will restructure the patch-set according to the
feedback.

Thanks,
Nadav

Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-31 Thread Nadav Amit
> On Jan 30, 2021, at 11:57 PM, Nadav Amit  wrote:
> 
>> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin  wrote:
>> 
>> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>>> From: Nadav Amit 
>>> 
>>> There are currently (at least?) 5 different TLB batching schemes in the
>>> kernel:
>>> 
>>> 1. Using mmu_gather (e.g., zap_page_range()).
>>> 
>>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>>  ongoing deferred TLB flush and flushing the entire range eventually
>>>  (e.g., change_protection_range()).
>>> 
>>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>>> 
>>> 4. Batching per-table flushes (move_ptes()).
>>> 
>>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>>  flushing when (try_to_unmap_one() on x86).
>>> 
>>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>>> potentially TLB-shootdown storms, for instance during batched
>>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>>> defers TLB flushes.
>>> 
>>> More aggressive TLB batching may be possible, but this patch-set does
>>> not add such batching. The proposed changes would enable such batching
>>> in a later time.
>>> 
>>> Admittedly, I do not understand how things are not broken today, which
>>> frightens me to make further batching before getting things in order.
>>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>>> for each page-table (but not in greater granularity). Can't
>>> ClearPageDirty() be called before the flush, causing writes after
>>> ClearPageDirty() and before the flush to be lost?
>> 
>> Because it's holding the page table lock which stops page_mkclean from 
>> cleaning the page. Or am I misunderstanding the question?
> 
> Thanks. I understood this part. Looking again at the code, I now understand
> my confusion: I forgot that the reverse mapping is removed after the PTE is
> zapped.
> 
> Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
> by performing set_page_dirty() for the batched pages when needed in
> tlb_finish_mmu() [ i.e., by marking for each batched page whether
> set_page_dirty() should be issued for that page while collecting them ].

Correcting myself (I hope): no we cannot do so, since the buffers might be
remove from the page at that point.



Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-02 Thread Nadav Amit
> On Feb 1, 2021, at 10:41 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Peter Zijlstra's message of February 1, 2021 10:09 pm:
>> I also don't think AGRESSIVE_FLUSH_BATCHING quite captures what it does.
>> How about:
>> 
>>  CONFIG_MMU_GATHER_NO_PER_VMA_FLUSH
> 
> Yes please, have to have descriptive names.

Point taken. I will fix it.

> 
> I didn't quite see why this was much of an improvement though. Maybe 
> follow up patches take advantage of it? I didn't see how they all fit 
> together.

They do, but I realized as I said in other emails that I have a serious bug
in the deferred invalidation scheme.

Having said that, I think there is an advantage of having an explicit config
option instead of relying on whether tlb_end_vma is defined. For instance,
Arm does not define tlb_end_vma, and consequently it flushes the TLB after
each VMA. I suspect it is not intentional.



Re: [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()

2021-02-02 Thread Nadav Amit
> On Feb 2, 2021, at 1:31 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 02, 2021 at 07:20:55AM +0000, Nadav Amit wrote:
>> Arm does not define tlb_end_vma, and consequently it flushes the TLB after
>> each VMA. I suspect it is not intentional.
> 
> ARM is one of those that look at the VM_EXEC bit to explicitly flush
> ITLB IIRC, so it has to.

Hmm… I don’t think Arm is doing that. At least arm64 does not use the
default tlb_flush(), and it does not seem to consider VM_EXEC (at least in
this path):

static inline void tlb_flush(struct mmu_gather *tlb)
{
struct vm_area_struct vma = TLB_FLUSH_VMA(tlb->mm, 0);
bool last_level = !tlb->freed_tables;
unsigned long stride = tlb_get_unmap_size(tlb);
int tlb_level = tlb_get_level(tlb);

/*
 * If we're tearing down the address space then we only care about
 * invalidating the walk-cache, since the ASID allocator won't
 * reallocate our ASID without invalidating the entire TLB.
 */
if (tlb->mm_exiting) {
if (!last_level)
flush_tlb_mm(tlb->mm);
return;
}   

__flush_tlb_range(&vma, tlb->start, tlb->end, stride,
  last_level, tlb_level);
}

Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-16 Thread Nadav Amit



> On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> 
> @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> struct load_info *info)
>   add_kallsyms(mod, info);
> 
>   /* Arch-specific module finalizing. */
> - return module_finalize(info->hdr, info->sechdrs, mod);
> + ret = module_finalize(info->hdr, info->sechdrs, mod);
> + if (ret)
> + return ret;
> +
> + for_each_mod_mem_type(type) {
> + struct module_memory *mem = &mod->mem[type];
> +
> + if (mem->is_rox) {
> + if (!execmem_update_copy(mem->base, mem->rw_copy,
> +  mem->size))
> + return -ENOMEM;
> +
> + vfree(mem->rw_copy);
> + mem->rw_copy = NULL;
> + }
> + }
> +
> + return 0;
> }

I might be missing something, but it seems a bit racy.

IIUC, module_finalize() calls alternatives_smp_module_add(). At this
point, since you don’t hold the text_mutex, some might do text_poke(),
e.g., by enabling/disabling static-key, and the update would be
overwritten. No?

Re: [RFC PATCH 3/7] module: [

2024-04-18 Thread Nadav Amit



> On 18 Apr 2024, at 13:20, Mike Rapoport  wrote:
> 
> On Tue, Apr 16, 2024 at 12:36:08PM +0300, Nadav Amit wrote:
>> 
>> 
>> 
>> I might be missing something, but it seems a bit racy.
>> 
>> IIUC, module_finalize() calls alternatives_smp_module_add(). At this
>> point, since you don’t hold the text_mutex, some might do text_poke(),
>> e.g., by enabling/disabling static-key, and the update would be
>> overwritten. No?
> 
> Right :(
> Even worse, for UP case alternatives_smp_unlock() will "patch" still empty
> area.
> 
> So I'm thinking about calling alternatives_smp_module_add() from an
> additional callback after the execmem_update_copy().
> 
> Does it make sense to you?

Going over the code again - I might have just been wrong: I confused the
alternatives and the jump-label mechanisms (as they do share a lot of
code and characteristics).

The jump-labels are updated when prepare_coming_module() is called, which
happens after post_relocation() [which means they would be updated using
text_poke() “inefficiently” but should be safe].

The “alternatives” appear only to use text_poke() (in contrast for
text_poke_early()) from very specific few flows, e.g., 
common_cpu_up() -> alternatives_enable_smp().

Are those flows pose a problem after boot?

Anyhow, sorry for the noise.

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-01 Thread Nadav Amit



> On Jun 1, 2023, at 1:50 PM, Edgecombe, Rick P  
> wrote:
> 
> On Thu, 2023-06-01 at 14:38 -0400, Kent Overstreet wrote:
>> On Thu, Jun 01, 2023 at 06:13:44PM +, Edgecombe, Rick P wrote:
 text_poke() _does_ create a separate RW mapping.
>>> 
>>> Sorry, I meant a separate RW allocation.
>> 
>> Ah yes, that makes sense
>> 
>> 
>>> 
 
 The thing that sucks about text_poke() is that it always does a
 full
 TLB
 flush, and AFAICT that's not remotely needed. What it really
 wants to
 be
 doing is conceptually just
 
 kmap_local()
 mempcy()
 kunmap_loca()
 flush_icache();
 
 ...except that kmap_local() won't actually create a new mapping
 on
 non-highmem architectures, so text_poke() open codes it.
>>> 
>>> Text poke creates only a local CPU RW mapping. It's more secure
>>> because
>>> other threads can't write to it.
>> 
>> *nod*, same as kmap_local
> 
> It's only used and flushed locally, but it is accessible to all CPU's,
> right?
> 
>> 
>>> It also only needs to flush the local core when it's done since
>>> it's
>>> not using a shared MM.
>>  
>> Ahh! Thanks for that; perhaps the comment in text_poke() about IPIs
>> could be a bit clearer.
>> 
>> What is it (if anything) you don't like about text_poke() then? It
>> looks
>> like it's doing broadly similar things to kmap_local(), so should be
>> in the same ballpark from a performance POV?
> 
> The way text_poke() is used here, it is creating a new writable alias
> and flushing it for *each* write to the module (like for each write of
> an individual relocation, etc). I was just thinking it might warrant
> some batching or something.

I am not advocating to do so, but if you want to have many efficient
writes, perhaps you can just disable CR0.WP. Just saying that if you
are about to write all over the memory, text_poke() does not provide
too much security for the poking thread.



Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Nadav Amit



> On Jun 5, 2023, at 9:10 AM, Edgecombe, Rick P  
> wrote:
> 
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
>> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
>>> On Thu, 1 Jun 2023 16:54:36 -0700
>>> Nadav Amit  wrote:
>>> 
>>>>> The way text_poke() is used here, it is creating a new writable
>>>>> alias
>>>>> and flushing it for *each* write to the module (like for each
>>>>> write of
>>>>> an individual relocation, etc). I was just thinking it might
>>>>> warrant
>>>>> some batching or something.  
>> 
>>>> I am not advocating to do so, but if you want to have many
>>>> efficient
>>>> writes, perhaps you can just disable CR0.WP. Just saying that if
>>>> you
>>>> are about to write all over the memory, text_poke() does not
>>>> provide
>>>> too much security for the poking thread.
>> 
>> Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.

Thanks for highlighting this issue. I understand the limitations of
CR0.WP. There is also always the concerns that without CET or other
control flow integrity mechanism, someone would abuse (using ROP/JOP)
functions that clear CR0.WP…



Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-19 Thread Nadav Amit



> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski  wrote:
> 
> But jit_text_alloc() can't do this, because the order of operations doesn't 
> match.  With jit_text_alloc(), the executable mapping shows up before the 
> text is populated, so there is no atomic change from not-there to 
> populated-and-executable.  Which means that there is an opportunity for CPUs, 
> speculatively or otherwise, to start filling various caches with intermediate 
> states of the text, which means that various architectures (even x86!) may 
> need serialization.
> 
> For eBPF- and module- like use cases, where JITting/code gen is quite 
> coarse-grained, perhaps something vaguely like:
> 
> jit_text_alloc() -> returns a handle and an executable virtual address, but 
> does *not* map it there
> jit_text_write() -> write to that handle
> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I 
> think)

Andy, would you mind explaining why you think a sync is not needed? I mean I 
have a “feeling” that perhaps TSO can guarantee something based on the order of 
write and page-table update. Is that the argument?

On this regard, one thing that I clearly do not understand is why *today* it is 
ok for users of bpf_arch_text_copy() not to call text_poke_sync(). Am I missing 
something?



Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-06-21 Thread Nadav Amit
> 
> On Jun 20, 2023, at 7:46 AM, Yair Podemsky  wrote:
> 
> @@ -1525,7 +1525,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, 
> struct vm_area_struct *v
>   addr + HPAGE_PMD_SIZE);
>   mmu_notifier_invalidate_range_start(&range);
>   pmd = pmdp_collapse_flush(vma, addr, pmdp);
> - tlb_remove_table_sync_one();
> + tlb_remove_table_sync_one(mm);

Can’t pmdp_collapse_flush() have one additional argument “freed_tables”
that it would propagate, for instance on x86 to flush_tlb_mm_range() ?
Then you would not need tlb_remove_table_sync_one() to issue an additional
IPI, no?

It just seems that you might still have 2 IPIs in many cases instead of
one, and unless I am missing something, I don’t see why.



Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

2023-01-18 Thread Nadav Amit



> On Jan 18, 2023, at 12:00 AM, Nicholas Piggin  wrote:
> 
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> + struct mm_struct *mm = arg;
> +
> + if (current->active_mm == mm) {
> + WARN_ON_ONCE(current->mm);
> + current->active_mm = &init_mm;
> + switch_mm(mm, &init_mm, current);
> + }
> +}

I might be out of touch - doesn’t a flush already take place when we free
the page-tables, at least on common cases on x86?

IIUC exit_mmap() would free page-tables, and whenever page-tables are
freed, on x86, we do shootdown regardless to whether the target CPU TLB state
marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
everything should be fine, no?

[ I understand you care about powerpc, just wondering on the effect on x86 ]



Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable

2023-01-22 Thread Nadav Amit




On 1/18/23 10:00 AM, Nicholas Piggin wrote:

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin 
---
  Documentation/mm/active_mm.rst |  6 ++
  arch/Kconfig   | 17 +
  include/linux/sched/mm.h   | 18 +++---
  kernel/sched/core.c| 22 ++
  kernel/sched/sched.h   |  4 +++-
  5 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index 6f8269c284ed..2b0d08332400 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -4,6 +4,12 @@
  Active MM
  =
  
+Note, the mm_count refcount may no longer include the "lazy" users

+(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
+with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
+references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers which abstracts this config option.
+
  ::
  
   List:   linux-kernel

diff --git a/arch/Kconfig b/arch/Kconfig
index 12e3ddabac9d..b07d36f08fea 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -465,6 +465,23 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  irqs disabled over activate_mm. Architectures that do IPI based TLB
  shootdowns should enable this.
  
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.

+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example.
+#
+# To implement this, an arch *must*:
+# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when dropping the
+# lazy reference of a kthread's ->active_mm (non-arch code has been converted
+# already).
+config MMU_LAZY_TLB_REFCOUNT
+   def_bool y
+
  config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
  
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h

index 5376caf6fcf3..68bbe8d90c2e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -82,17 +82,29 @@ static inline void mmdrop_sched(struct mm_struct *mm)
  /* Helpers for lazy TLB mm refcounting */
  static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
  {
-   mmgrab(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+   mmgrab(mm);
  }
  
  static inline void mmdrop_lazy_tlb(struct mm_struct *mm)

  {
-   mmdrop(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+   mmdrop(mm);
+   } else {
+   /*
+* mmdrop_lazy_tlb must provide a full memory barrier, see the
+* membarrier comment finish_task_switch which relies on this.
+*/
+   smp_mb();
+   }
  }


Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various 
locations in which smp_mb() was not required, this comment might be 
confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() 
replaced mmdrop(), this comment was irrelevant, and therefore it now 
becomes confusing.


I am not sure the include the smp_mb() here instead of "open-coding" it 
helps.


  
  static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)

  {
-   mmdrop_sched(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+   mmdrop_sched(mm);
+   else
+   smp_mb(); // see above
  }


Wrong style of comment.


Re: [PATCH v6 2/5] lazy tlb: allow lazy tlb mm refcounting to be configurable

2023-01-23 Thread Nadav Amit




On 1/23/23 9:35 AM, Nadav Amit wrote:

+    if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+    mmdrop(mm);
+    } else {
+    /*
+ * mmdrop_lazy_tlb must provide a full memory barrier, see the
+ * membarrier comment finish_task_switch which relies on this.
+ */
+    smp_mb();
+    }
  }


Considering the fact that mmdrop_lazy_tlb() replaced mmdrop() in various 
locations in which smp_mb() was not required, this comment might be 
confusing. IOW, for the cases in most cases where mmdrop_lazy_tlb() 
replaced mmdrop(), this comment was irrelevant, and therefore it now 
becomes confusing.


I am not sure the include the smp_mb() here instead of "open-coding" it 
helps.
I think that I now understand why you do need the smp_mb() here, so 
ignore my comment.


Re: [PATCH v6 3/5] lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme

2023-01-23 Thread Nadav Amit




On 1/19/23 6:22 AM, Nicholas Piggin wrote:

On Thu Jan 19, 2023 at 8:22 AM AEST, Nadav Amit wrote:




On Jan 18, 2023, at 12:00 AM, Nicholas Piggin  wrote:

+static void do_shoot_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   if (current->active_mm == mm) {
+   WARN_ON_ONCE(current->mm);
+   current->active_mm = &init_mm;
+   switch_mm(mm, &init_mm, current);
+   }
+}


I might be out of touch - doesn’t a flush already take place when we free
the page-tables, at least on common cases on x86?

IIUC exit_mmap() would free page-tables, and whenever page-tables are
freed, on x86, we do shootdown regardless to whether the target CPU TLB state
marks is_lazy. Then, flush_tlb_func() should call switch_mm_irqs_off() and
everything should be fine, no?

[ I understand you care about powerpc, just wondering on the effect on x86 ]


Now I come to think of it, Rik had done this for x86 a while back.

https://lore.kernel.org/all/20180728215357.3249-10-r...@surriel.com/

I didn't know about it when I wrote this, so I never dug into why it
didn't get merged. It might have missed the final __mmdrop races but
I'm not not sure, x86 lazy tlb mode is too complicated to know at a
glance. I would check with him though.


My point was that naturally (i.e., as done today), when exit_mmap() is 
done, you release the page tables (not just the pages). On x86 it means 
that you also send shootdown IPI to all the *lazy* CPUs to perform a 
flush, so they would exit the lazy mode.


[ this should be true for 99% of the cases, excluding cases where there
  were not page-tables, for instance ]

So the patch of Rik, I think, does not help in the common cases, 
although it may perhaps make implicit actions more explicit in the code.


Re: [PATCH 16/16] mm: pass get_user_pages_fast iterator arguments in a structure

2019-06-11 Thread Nadav Amit
> On Jun 11, 2019, at 5:52 PM, Nicholas Piggin  wrote:
> 
> Christoph Hellwig's on June 12, 2019 12:41 am:
>> Instead of passing a set of always repeated arguments down the
>> get_user_pages_fast iterators, create a struct gup_args to hold them and
>> pass that by reference.  This leads to an over 100 byte .text size
>> reduction for x86-64.
> 
> What does this do for performance? I've found this pattern can be
> bad for store aliasing detection.

Note that sometimes such an optimization can also have adverse effect due to
stack protector code that gcc emits when you use such structs.

Matthew Wilcox encountered such a case:
https://patchwork.kernel.org/patch/10702741/


Re: [RFC PATCH 1/3] Revert "mm: always flush VMA ranges affected by zap_page_range"

2018-06-12 Thread Nadav Amit
at 12:16 AM, Nicholas Piggin  wrote:

> This reverts commit 4647706ebeee6e50f7b9f922b095f4ec94d581c3.
> 
> Patch 99baac21e4585 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss
> problem") provides a superset of the TLB flush coverage of this
> commit, and even includes in the changelog "this patch supersedes
> 'mm: Always flush VMA ranges affected by zap_page_range v2'".
> 
> Reverting this avoids double flushing the TLB range, and the less
> efficient flush_tlb_range() call (the mmu_gather API is more precise
> about what ranges it invalidates).
> 
> Signed-off-by: Nicholas Piggin 
> ---
> mm/memory.c | 14 +-
> 1 file changed, 1 insertion(+), 13 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7206a634270b..9d472e00fc2d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1603,20 +1603,8 @@ void zap_page_range(struct vm_area_struct *vma, 
> unsigned long start,
>   tlb_gather_mmu(&tlb, mm, start, end);
>   update_hiwater_rss(mm);
>   mmu_notifier_invalidate_range_start(mm, start, end);
> - for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
> + for ( ; vma && vma->vm_start < end; vma = vma->vm_next)
>   unmap_single_vma(&tlb, vma, start, end, NULL);
> -
> - /*
> -  * zap_page_range does not specify whether mmap_sem should be
> -  * held for read or write. That allows parallel zap_page_range
> -  * operations to unmap a PTE and defer a flush meaning that
> -  * this call observes pte_none and fails to flush the TLB.
> -  * Rather than adding a complex API, ensure that no stale
> -  * TLB entries exist when this call returns.
> -  */
> - flush_tlb_range(vma, start, end);
> - }
> -
>   mmu_notifier_invalidate_range_end(mm, start, end);
>   tlb_finish_mmu(&tlb, start, end);
> }

Yes, this was in my “to check when I have time” todo list, especially since
the flush was from start to end, not even vma->vm_start to vma->vm_end.

The revert seems correct.

Reviewed-by: Nadav Amit 



Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

2022-09-15 Thread Nadav Amit



> On Sep 14, 2022, at 11:42 PM, Barry Song <21cn...@gmail.com> wrote:
> 
>> 
>> The very idea behind TLB deferral is the opportunity it (might) provide
>> to accumulate address ranges and cpu masks so that individual TLB flush
>> can be replaced with a more cost effective range based TLB flush. Hence
>> I guess unless address range or cpumask based cost effective TLB flush
>> is available, deferral does not improve the unmap performance as much.
> 
> 
> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
> from all cpus in the system, tlbi is not scalable. The point here is that we
> avoid waiting for each individual TLBi. Alternatively, they are batched. If
> you read the benchmark in the commit log, you can find the great decline
> in the cost to swap out a page.

Just a minor correction: arch_tlbbatch_flush() does not collect ranges.
On x86 it only accumulate CPU mask.



Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

2022-09-21 Thread Nadav Amit
On Sep 20, 2022, at 11:53 PM, Anshuman Khandual  
wrote:

> ⚠ External Email
> 
> On 8/22/22 13:51, Yicong Yang wrote:
>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
>> *batch,
>> + struct mm_struct *mm,
>> + unsigned long uaddr)
>> +{
>> + __flush_tlb_page_nosync(mm, uaddr);
>> +}
>> +
>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
>> *batch)
>> +{
>> + dsb(ish);
>> +}
> 
> Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping
> TLB invalidation requests on a given mm and try to generate a range based TLB
> invalidation such as flush_tlb_range().
> 
> struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous
> ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can
> later be flushed in subsequent arch_tlbbatch_flush() ?
> 
> OR
> 
> It might not be worth the effort and complexity, in comparison to performance
> improvement, TLB range flush brings in ?

So here are my 2 cents, based on my experience with Intel-x86. It is likely
different on arm64, but perhaps it can provide you some insight into what
parameters you should measure and consider.

In general there is a tradeoff between full TLB flushes and entry-specific
ones. Flushing specific entries takes more time than flushing the entire
TLB, but sade TLB refills.

Dave Hansen made some calculations in the past and came up with 33 as a
magic cutoff number, i.e., if you need to flush more than 33 entries, just
flush the entire TLB. I am not sure that this exact number is very
meaningful, since one might argue that it should’ve taken PTI into account
(which might require twice as many TLB invalidations).

Anyhow, back to arch_tlbbatch_add_mm(). It may be possible to track ranges,
but the question is whether you would actually succeed in forming continuous
ranges that are eventually (on x86) smaller than the full TLB flush cutoff
(=33). Questionable (perhaps better with MGLRU?).

Then, you should remember that tracking should be very efficient, since even
few cache misses might have greater cost than what you save by
selective-flushing. Finally, on x86 you would need to invoke the smp/IPI
layer multiple times to send different cores the relevant range they need to
flush.

IOW: It is somewhat complicated to implement efficeintly. On x86, and
probably other IPI-based TLB shootdown systems, does not have clear
performance benefit (IMHO).



Re: [PATCH v1 4/6] mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite

2022-11-02 Thread Nadav Amit
On Nov 2, 2022, at 12:12 PM, David Hildenbrand  wrote:

> !! External Email
> 
> commit b191f9b106ea ("mm: numa: preserve PTE write permissions across a
> NUMA hinting fault") added remembering write permissions using ordinary
> pte_write() for PROT_NONE mapped pages to avoid write faults when
> remapping the page !PROT_NONE on NUMA hinting faults.
> 

[ snip ]

Here’s a very shallow reviewed with some minor points...

> ---
> include/linux/mm.h |  2 ++
> mm/huge_memory.c   | 28 +---
> mm/ksm.c   |  9 -
> mm/memory.c| 19 ---
> mm/mprotect.c  |  7 ++-
> 5 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 25ff9a14a777..a0deeece5e87 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1975,6 +1975,8 @@ extern unsigned long move_page_tables(struct 
> vm_area_struct *vma,
> #define  MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
>MM_CP_UFFD_WP_RESOLVE)
> 
> +bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
> +pte_t pte);

It might not be customary, but how about marking it as __pure?

> extern unsigned long change_protection(struct mmu_gather *tlb,
>  struct vm_area_struct *vma, unsigned long start,
>  unsigned long end, pgprot_t newprot,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2ad68e91896a..45abd27d75a0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1462,8 +1462,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
>int page_nid = NUMA_NO_NODE;
>int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> -   bool migrated = false;
> -   bool was_writable = pmd_savedwrite(oldpmd);
> +   bool try_change_writable, migrated = false;
>int flags = 0;
> 
>vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1472,13 +1471,22 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>goto out;
>}
> 
> +   /* See mprotect_fixup(). */
> +   if (vma->vm_flags & VM_SHARED)
> +   try_change_writable = vma_wants_writenotify(vma, 
> vma->vm_page_prot);
> +   else
> +   try_change_writable = !!(vma->vm_flags & VM_WRITE);

Do you find it better to copy the code instead of extracting it to a
separate function?

> +
>pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>page = vm_normal_page_pmd(vma, haddr, pmd);
>if (!page)
>goto out_map;
> 
>/* See similar comment in do_numa_page for explanation */
> -   if (!was_writable)
> +   if (try_change_writable && !pmd_write(pmd) &&
> +can_change_pmd_writable(vma, vmf->address, pmd))
> +   pmd = pmd_mkwrite(pmd);
> +   if (!pmd_write(pmd))
>flags |= TNF_NO_GROUP;
> 
>page_nid = page_to_nid(page);
> @@ -1523,8 +1531,12 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
>/* Restore the PMD */
>pmd = pmd_modify(oldpmd, vma->vm_page_prot);
>pmd = pmd_mkyoung(pmd);
> -   if (was_writable)
> +
> +   /* Similar to mprotect() protection updates, avoid write faults. */
> +   if (try_change_writable && !pmd_write(pmd) &&
> +can_change_pmd_writable(vma, vmf->address, pmd))

Why do I have a deja-vu? :)

There must be a way to avoid the redundant code and specifically the call to
can_change_pmd_writable(), no?

>pmd = pmd_mkwrite(pmd);
> +
>set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
>update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
>spin_unlock(vmf->ptl);
> @@ -1764,11 +1776,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>struct mm_struct *mm = vma->vm_mm;
>spinlock_t *ptl;
>pmd_t oldpmd, entry;
> -   bool preserve_write;
> -   int ret;
>bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
>bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
>bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
> +   int ret = 1;
> 
>tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> 
> @@ -1779,9 +1790,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>if (!ptl)
>return 0;
> 
> -   preserve_write = prot_numa && pmd_write(*pmd);
> -   ret = 1;
> -
> #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
>if (is_swap_pmd(*pmd)) {
>swp_entry_t entry = pmd_to_swp_entry(*pmd);
> @@ -1861,8 +1869,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
> vm_area_struct *vma,
>oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
> 
>entry = pmd_modify(oldpmd, newprot);
> -   if (preserve_write)
> -   entry = pmd_mk_savedwrite(entry);
>if (uff

Re: [PATCH v6 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-11-15 Thread Nadav Amit
On Nov 14, 2022, at 7:14 PM, Yicong Yang  wrote:

> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 8a497d902c16..5bd78ae55cd4 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
> }
> 
> static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
> *batch,
> - struct mm_struct *mm)
> + struct mm_struct *mm,
> + unsigned long uaddr)

Logic-wise it looks fine. I notice the “v6", and it should not be blocking,
but I would note that the name "arch_tlbbatch_add_mm()” does not make much
sense once the function also takes an address.

It could’ve been something like arch_set_tlb_ubc_flush_pending() but that’s
too long. I’m not very good with naming, but the current name is not great.




Re: [PATCH v6 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-11-15 Thread Nadav Amit
On Nov 15, 2022, at 5:50 PM, Yicong Yang  wrote:

> !! External Email
> 
> On 2022/11/16 7:38, Nadav Amit wrote:
>> On Nov 14, 2022, at 7:14 PM, Yicong Yang  wrote:
>> 
>>> diff --git a/arch/x86/include/asm/tlbflush.h 
>>> b/arch/x86/include/asm/tlbflush.h
>>> index 8a497d902c16..5bd78ae55cd4 100644
>>> --- a/arch/x86/include/asm/tlbflush.h
>>> +++ b/arch/x86/include/asm/tlbflush.h
>>> @@ -264,7 +264,8 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
>>> }
>>> 
>>> static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
>>> *batch,
>>> -struct mm_struct *mm)
>>> +struct mm_struct *mm,
>>> +unsigned long uaddr)
>> 
>> Logic-wise it looks fine. I notice the “v6", and it should not be blocking,
>> but I would note that the name "arch_tlbbatch_add_mm()” does not make much
>> sense once the function also takes an address.
> 
> ok the add_mm should still apply to x86 since the address is not used, but 
> not for arm64.
> 
>> It could’ve been something like arch_set_tlb_ubc_flush_pending() but that’s
>> too long. I’m not very good with naming, but the current name is not great.
> 
> What about arch_tlbbatch_add_pending()? Considering the x86 is pending the 
> flush operation
> while arm64 is pending the sychronization operation, 
> arch_tlbbatch_add_pending() should
> make sense to both.

Sounds reasonable. Thanks.




Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Nadav Amit
I am not very familiar with membarrier, but here are my 2 cents while trying
to answer your questions.

> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski  wrote:
> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>* from one thread in a process to another thread in the same
>* process. No TLB flush required.
>*/
> +
> + // XXX: why is this okay wrt membarrier?
>   if (!was_lazy)
>   return;

I am confused.

On one hand, it seems that membarrier_private_expedited() would issue an IPI
to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the
same as the one that the membarrier applies to. But… (see below)


> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>   smp_mb();
>   next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>   if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> - next_tlb_gen)
> + next_tlb_gen) {
> + /*
> +  * We're reactivating an mm, and membarrier might
> +  * need to serialize.  Tell membarrier.
> +  */
> +
> + // XXX: I can't understand the logic in
> + // membarrier_mm_sync_core_before_usermode().  What's
> + // the mm check for?
> + membarrier_mm_sync_core_before_usermode(next);

On the other hand the reason for this mm check that you mention contradicts
my previous understanding as the git log says:

commit 2840cf02fae627860156737e83326df354ee4ec6
Author: Mathieu Desnoyers 
Date:   Thu Sep 19 13:37:01 2019 -0400

sched/membarrier: Call sync_core only before usermode for same mm

When the prev and next task's mm change, switch_mm() provides the core
serializing guarantees before returning to usermode. The only case
where an explicit core serialization is needed is when the scheduler
keeps the same mm for prev and next.

>   /*
>* When switching through a kernel thread, the loop in
>* membarrier_{private,global}_expedited() may have observed that
>* kernel thread and not issued an IPI. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
>* switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> +  * rq->curr, before returning to userspace, and mmdrop() provides
> +  * this barrier.
>*
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * XXX: I don't think mmdrop() actually does this.  There's no
> +  * smp_mb__before/after_atomic() in there.

I presume that since x86 is the only one that needs
membarrier_mm_sync_core_before_usermode(), nobody noticed the missing
smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86,
and such a barrier would take place before the return to userspace.



Re: [Pv-drivers] [PATCH 29/36] cpuidle, xenpv: Make more PARAVIRT_XXL noinstr clean

2022-06-14 Thread Nadav Amit
�ۚ�,ڶ*'�+-�X���wZ�*'�� jg��m�i^�j�gz�!��(���z�h��&��j{��w���r��rkۑ� 
���r��rkۑ� ���r���'��*�v)�f���&�yا�   
��W(�G���qz}'��z\^�I�jg��''y�ڎꮉȧq�&�蜝�j;��'"��(�X��7�Ib��l��/���z�ޖ���!�蝭�aj�(���w�v\�h�z
 ��,�)'�^��g�  
޲�b��k�x�u�j�.��첋�q����+���z�,���y�+���}�-z�f���&}�-z�f���&u�ez�&�םzY^�   
�u�Z�'jz%�vu��zY^�Ƨ�X�j�(�蝫ajxg�'bi�&��ڶ��{�v&��h\jX�����&�ƥ�{�i^���w���%zv�z�ޖ��)��^�ƥ�{��ק��+�X��mz{�)��^�ƨ��Ƥz�ޖljG���h�(�X�����&���{�ib��Z�i����G���h�
  b��Z�i����G���h�.u穆�ei��r���מ�%�&�)��n��X���b��f��(� 
b��f��%��l�)��n+��&j)\�k!��ނf���&�)�Ū���zYb��"���u杢�%�{�j��z�ޖX��ȧ~��y�h�!�+3jy�w�r�6���gz��'dz�ޖ�౺2vG���h�b�

�zy�w����x.���z�ޖ��nky�Z��&nky�Z��&r���j)�w�����ṧ�G���h��{�u��+!�)�{��{^��&jW�jw^��b�"�X�����\�iz�^m�r��my���&�y�&�)�ƶ���/�Y^���vIb��kjɮ���z�d��}��jw^�}��jw^���Z)e
朢|"�Y�w������:�k��$ɺ+��,��/�L���%y�&�z�ޖ��z�ޖ��)���$���G���h�  
b��\�Lz�ޖ��>�k���ݮ+ޮ�r���|u�(��'ɫh�'^r���{��zlj�%��l�w�iךv��)���鱪ܖ+-�)߭�^i�+�ǥ��jy�ˬyףi����jyb��b�ץr��i����jyb��b�ץr���wAz�&jyڮwZ�w[u륖)+�Y`��%zf���&�Yb��%�
(�W�j)\�kປZ���zZ+��.�֤z�ޖ!��!�m��#��c��m�*ez�h��z�F�-y�k��^v�(�٢���˯���z�ޖ��˯���z�ޖ��1�a��z왫a��z�y��r*,r��q���蜊w�f�j)�'"��(����u�i���jy�׭��
 
zwZ��Z~�zX��Z�+�zZ+�X��Z�+�zZ+��ڊwں[h�j)�j�m��%�{�jZaj��G���h�
b��Z��Z�zZ+��Z�x.�G���h�!k
ຉ�w���j�u�ޚZ�w�u�ޙ֯zih~�޵֥��%��(�Z�&��&ܢ�z׫f(��ڥ�^�8�~��y�h�㢹ڝ׫��'�)ڮ�+���v�u�첉�v�z�'�)ᥬ�ܢk)j�%�{��zZ+��Z��b��o���z�ޖ��)�Ƹ�r�b��"���u杢�%�{��+�X��ȧ~��y�h�'��Ƚ�轩�x�jz/q���'`z
���(�:'j�(��i�W�z:'j�(��i�W�{+��z+��&j)\�l��)讋rZ���u�k��Z���u�W�
�Τz�ޖ��)�Ū�)�Ɗ�Ib��Z�ib��h��j
���w�����,�G���h���眱^��nj��y�z���v�Z�Y��G���h�
Z�Y��G���h��y׫���w�����^�'$z�ޖ��ןjy+��bj{,�{�v�jb~+-y���&���'���jV��'⢗��+�+-�X���(��(�
)zz��b��%���r�޲�५���f�W��'��(���^�ȟ��ib��mz
ھzZ+�X��^��z�ޖ��zZ+�v��+��G���h��v����r���b��b�ץr���^��^�J%�{��{^��&��"�����zZ+�:h�f�zG���h�nz��j��஋,��r���{-�j'��޺j'���{-�륊{��*l�zZ+�X��z��w���)jY��֛m�mr��jY��֛m�mr���{���Z�z�ޖ��)���j
'�zZ+�+��Z��ۥ�bzz:!jy޲ȩ��n�*'�w���Z�w��*l�[�e�{���z�b��i��^�X���3��좸��+�:%�{���z��w��r�

Re: [PATCH v2 1/2] mm/migrate_device.c: Copy pte dirty bit to page

2022-08-17 Thread Nadav Amit
On Aug 17, 2022, at 12:17 AM, Huang, Ying  wrote:

> Alistair Popple  writes:
> 
>> Peter Xu  writes:
>> 
>>> On Wed, Aug 17, 2022 at 11:49:03AM +1000, Alistair Popple wrote:
 Peter Xu  writes:
 
> On Tue, Aug 16, 2022 at 04:10:29PM +0800, huang ying wrote:
>>> @@ -193,11 +194,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>bool anon_exclusive;
>>>pte_t swp_pte;
>>> 
>>> +   flush_cache_page(vma, addr, pte_pfn(*ptep));
>>> +   pte = ptep_clear_flush(vma, addr, ptep);
>> 
>> Although I think it's possible to batch the TLB flushing just before
>> unlocking PTL.  The current code looks correct.
> 
> If we're with unconditionally ptep_clear_flush(), does it mean we should
> probably drop the "unmapped" and the last flush_tlb_range() already since
> they'll be redundant?
 
 This patch does that, unless I missed something?
>>> 
>>> Yes it does.  Somehow I didn't read into the real v2 patch, sorry!
>>> 
> If that'll need to be dropped, it looks indeed better to still keep the
> batch to me but just move it earlier (before unlock iiuc then it'll be
> safe), then we can keep using ptep_get_and_clear() afaiu but keep "pte"
> updated.
 
 I think we would also need to check should_defer_flush(). Looking at
 try_to_unmap_one() there is this comment:
 
if (should_defer_flush(mm, flags) && !anon_exclusive) {
/*
 * We clear the PTE but do not flush so 
 potentially
 * a remote CPU could still be writing to the 
 folio.
 * If the entry was previously clean then the
 * architecture must guarantee that a 
 clear->dirty
 * transition on a cached TLB entry is written 
 through
 * and traps if the PTE is unmapped.
 */
 
 And as I understand it we'd need the same guarantee here. Given
 try_to_migrate_one() doesn't do batched TLB flushes either I'd rather
 keep the code as consistent as possible between
 migrate_vma_collect_pmd() and try_to_migrate_one(). I could look at
 introducing TLB flushing for both in some future patch series.
>>> 
>>> should_defer_flush() is TTU-specific code?
>> 
>> I'm not sure, but I think we need the same guarantee here as mentioned
>> in the comment otherwise we wouldn't see a subsequent CPU write that
>> could dirty the PTE after we have cleared it but before the TLB flush.
>> 
>> My assumption was should_defer_flush() would ensure we have that
>> guarantee from the architecture, but maybe there are alternate/better
>> ways of enforcing that?
>>> IIUC the caller sets TTU_BATCH_FLUSH showing that tlb can be omitted since
>>> the caller will be responsible for doing it.  In migrate_vma_collect_pmd()
>>> iiuc we don't need that hint because it'll be flushed within the same
>>> function but just only after the loop of modifying the ptes.  Also it'll be
>>> with the pgtable lock held.
>> 
>> Right, but the pgtable lock doesn't protect against HW PTE changes such
>> as setting the dirty bit so we need to ensure the HW does the right
>> thing here and I don't know if all HW does.
> 
> This sounds sensible.  But I take a look at zap_pte_range(), and find
> that it appears that the implementation requires the PTE dirty bit to be
> write-through.  Do I miss something?
> 
> Hi, Nadav, Can you help?

Sorry for joining the discussion late. I read most ofthis thread and I hope
I understand what you ask me. So at the risk of rehashing or repeating what
you already know - here are my 2 cents. Feel free to ask me again if I did
not understand your questions:

1. ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH is currently x86 specific. There is a
recent patch that want to use it for arm64 as well [1]. The assumption that
Alistair cited from the code (regarding should_defer_flush()) might not be
applicable to certain architectures (although most likely it is). I tried
to encapsulate the logic on whether an unflushed RO entry can become dirty
in an arch specific function [2].

2. Having said all of that, using the logic of “flush if there are pending
TLB flushes for this mm” as done by UNMAP_TLB_FLUSH makes sense IMHO
(although I would have considered doing it in finer granularity of
VMA/page-table as I proposed before and got somewhat lukewarm response [3]).

3. There is no question that flushing after dropping the ptl is wrong. But
reading the thread, I think that you only focus on whether a dirty
indication might get lost. The problem, I think, is bigger, as it might also
cause correction problems after concurrently removing mappings. What happens
if we get for a clean PTE something like:

  CPU0  

Re: [PATCH v3 1/4] Make place for common balloon code

2022-08-22 Thread Nadav Amit
On Aug 22, 2022, at 4:37 AM, Alexander Atanasov 
 wrote:

> mm/balloon_compaction.c -> mm/balloon.c
> File already contains code that is common along balloon
> drivers so rename it to reflect its contents.
> 
> include/linux/balloon_compaction.h -> include/linux/balloon.h
> Remove it from files which do not actually use it.
> Drop externs from function delcarations.
> 
> Signed-off-by: Alexander Atanasov 

Makes so much sense.

Acked-by: Nadav Amit