date:20230913

Re: KASAN debug kernel fails to boot at early stage when CONFIG_SMP=y is set (kernel 6.5-rc5, PowerMac G4 3,6)

2023-09-13 Thread Christophe Leroy



Le 12/09/2023 à 19:39, Christophe Leroy a écrit :
> 
> 
> Le 12/09/2023 à 17:59, Erhard Furtner a écrit :
>>
>> printk: bootconsole [udbg0] enabled
>> Total memory = 2048MB; using 4096kB for hash table
>> mapin_ram:125
>> mmu_mapin_ram:169 0 3000 140 200
>> __mmu_mapin_ram:146 0 140
>> __mmu_mapin_ram:155 140
>> __mmu_mapin_ram:146 140 3000
>> __mmu_mapin_ram:155 2000
>> __mapin_ram_chunk:107 2000 3000
>> __mapin_ram_chunk:117
>> mapin_ram:134
>> kasan_mmu_init:129
>> kasan_mmu_init:132 0
>> kasan_mmu_init:137
>> ioremap() called early from btext_map+0x64/0xdc. Use early_ioremap() instead
>> Linux version 6.6.0-rc1-PMacG4-dirty (root@T1000) (gcc (Gentoo 
>> 12.3.1_p20230526 p2) 12.3.1 20230526, GNU ld (Gentoo 2.40 p7) 2.40.0) #5 SMP 
>> Tue Sep 12 16:50:47 CEST 2023
>> kasan_init_region: c000 3000 f800 fe00
>> kasan_init_region: loop f800 fe00
>>
>>
>> So I get no "kasan_init_region: setbat" line and don't reach "KASAN init 
>> done".
> 
> Ah ok, maybe your CPU only has 4 BATs and they are all used, following
> change would tell us.
> 
> diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
> index 850783cfa9c7..bd26767edce7 100644
> --- a/arch/powerpc/mm/book3s32/mmu.c
> +++ b/arch/powerpc/mm/book3s32/mmu.c
> @@ -86,6 +86,7 @@ int __init find_free_bat(void)
>   if (!(bat[1].batu & 3))
>   return b;
>   }
> + pr_err("NO FREE BAT (%d)\n", n);
>   return -1;
>}
> 
> 
> Or you have 8 BATs in which case it's an alignment problem, you need to
> increase CONFIG_DATA_SHIFT to 23, for that you need CONFIG_ADVANCED and
> CONFIG_DATA_SHIFT_BOOL
> 
> But regardless of that there is a problem we need to find out, because
> it should work without BATs.
> 
> As the BATs allocation fails, it falls back to :
> 
>   phys = memblock_phys_alloc_range(k_end - k_start, PAGE_SIZE, 0,
>MEMBLOCK_ALLOC_ANYWHERE);
>   if (!phys)
>   return -ENOMEM;
>   }
> 
>   ret = kasan_init_shadow_page_tables(k_start, k_end);
>   if (ret)
>   return ret;
> 
>   for (k_cur = k_start; k_cur < k_end; k_cur += PAGE_SIZE) {
>   pmd_t *pmd = pmd_off_k(k_cur);
>   pte_t pte = pfn_pte(PHYS_PFN(phys + k_cur - k_start), 
> PAGE_KERNEL);
> 
>   __set_pte_at(_mm, k_cur, pte_offset_kernel(pmd, k_cur), 
> pte, 0);
>   }
>   flush_tlb_kernel_range(k_start, k_end);
>   memset(kasan_mem_to_shadow(start), 0, k_end - k_start);
> 
> 
> While the __weak function that you confirmed working is:
> 
>   ret = kasan_init_shadow_page_tables(k_start, k_end);
>   if (ret)
>   return ret;
> 
>   block = memblock_alloc(k_end - k_start, PAGE_SIZE);
>   if (!block)
>   return -ENOMEM;
> 
>   for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
>   pmd_t *pmd = pmd_off_k(k_cur);
>   void *va = block + k_cur - k_start;
>   pte_t pte = pfn_pte(PHYS_PFN(__pa(va)), PAGE_KERNEL);
> 
>   __set_pte_at(_mm, k_cur, pte_offset_kernel(pmd, k_cur), 
> pte, 0);
>   }
>   flush_tlb_kernel_range(k_start, k_end);
> 
> 
> I'm having hard time to understand what's could be wrong at the first place.
> 
> Could you try following change:
> 
> diff --git a/arch/powerpc/mm/kasan/book3s_32.c
> b/arch/powerpc/mm/kasan/book3s_32.c
> index 9954b7a3b7ae..e04f21908c6a 100644
> --- a/arch/powerpc/mm/kasan/book3s_32.c
> +++ b/arch/powerpc/mm/kasan/book3s_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_region(void *start, size_t size)
> 
>   if (k_nobat < k_end) {
>   phys = memblock_phys_alloc_range(k_end - k_nobat, PAGE_SIZE, 0,
> -  MEMBLOCK_ALLOC_ANYWHERE);
> +  MEMBLOCK_ALLOC_ACCESSIBLE);
>   if (!phys)
>   return -ENOMEM;
>   }
> 
> And also that one:
> 
> 
> diff --git a/arch/powerpc/mm/kasan/init_32.c
> b/arch/powerpc/mm/kasan/init_32.c
> index a70828a6d935..bc1c075489f4 100644
> --- a/arch/powerpc/mm/kasan/init_32.c
> +++ b/arch/powerpc/mm/kasan/init_32.c
> @@ -84,6 +84,9 @@ kasan_update_early_region(unsigned long k_start,
> unsigned long k_end, pte_t pte)
>{
>   unsigned long k_cur;
> 
> + if (k_start == k_end)
> + return;
> +
>   for (k_cur = k_start; k_cur != k_end; k_cur += PAGE_SIZE) {
>   pmd_t *pmd = pmd_off_k(k_cur);
>   pte_t *ptep = pte_offset_kernel(pmd, k_cur);
> 
> 
> 

I tested the two vmlinux you sent me offlist, they both start without 
problem on QEMU.

Regarding the use of BATs, in fact a shift of 23 is still not enough to 
get free BATs for KASAN. But at least it allows you to map all linear 
mem with BATS whereas a shift of 22 would require 9 BATs :

With shift 22 you have BATs with

Re: [PATCH v3 2/2] powerpc/fadump: make is_kdump_kernel() return false when fadump is active

2023-09-13 Thread Baoquan He

On 09/12/23 at 01:59pm, Hari Bathini wrote:
> Currently, is_kdump_kernel() returns true in crash dump capture kernel
> for both kdump and fadump crash dump capturing methods, as both these
> methods set elfcorehdr_addr. Some restrictions enforced for crash dump
> capture kernel, based on is_kdump_kernel(), are specifically meant for
> kdump case and not desirable for fadump - eg. IO queues restriction in
> device drivers. So, define is_kdump_kernel() to return false when f/w
> assisted dump is active.
> 
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/include/asm/kexec.h |  8 ++--
>  arch/powerpc/kernel/crash_dump.c | 12 
>  2 files changed, 18 insertions(+), 2 deletions(-)

LGTM,

Acked-by: Baoquan He 

> 
> diff --git a/arch/powerpc/include/asm/kexec.h 
> b/arch/powerpc/include/asm/kexec.h
> index a1ddba01e7d1..e1b43aa12175 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -99,10 +99,14 @@ void relocate_new_kernel(unsigned long indirection_page, 
> unsigned long reboot_co
>  
>  void kexec_copy_flush(struct kimage *image);
>  
> -#if defined(CONFIG_CRASH_DUMP) && defined(CONFIG_PPC_RTAS)
> +#if defined(CONFIG_CRASH_DUMP)
> +bool is_kdump_kernel(void);
> +#define is_kdump_kernel  is_kdump_kernel
> +#if defined(CONFIG_PPC_RTAS)
>  void crash_free_reserved_phys_range(unsigned long begin, unsigned long end);
>  #define crash_free_reserved_phys_range crash_free_reserved_phys_range
> -#endif
> +#endif /* CONFIG_PPC_RTAS */
> +#endif /* CONFIG_CRASH_DUMP */
>  
>  #ifdef CONFIG_KEXEC_FILE
>  extern const struct kexec_file_ops kexec_elf64_ops;
> diff --git a/arch/powerpc/kernel/crash_dump.c 
> b/arch/powerpc/kernel/crash_dump.c
> index 9a3b85bfc83f..2086fa6cdc25 100644
> --- a/arch/powerpc/kernel/crash_dump.c
> +++ b/arch/powerpc/kernel/crash_dump.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef DEBUG
>  #include 
> @@ -92,6 +93,17 @@ ssize_t copy_oldmem_page(struct iov_iter *iter, unsigned 
> long pfn,
>   return csize;
>  }
>  
> +/*
> + * Return true only when kexec based kernel dump capturing method is used.
> + * This ensures all restritions applied for kdump case are not automatically
> + * applied for fadump case.
> + */
> +bool is_kdump_kernel(void)
> +{
> + return !is_fadump_active() && elfcorehdr_addr != ELFCORE_ADDR_MAX;
> +}
> +EXPORT_SYMBOL_GPL(is_kdump_kernel);
> +
>  #ifdef CONFIG_PPC_RTAS
>  /*
>   * The crashkernel region will almost always overlap the RTAS region, so
> -- 
> 2.41.0
>

Re: [PATCH v3 1/2] vmcore: remove dependency with is_kdump_kernel() for exporting vmcore

2023-09-13 Thread Baoquan He

On 09/12/23 at 01:59pm, Hari Bathini wrote:
> Currently, is_kdump_kernel() returns true when elfcorehdr_addr is set.
> While elfcorehdr_addr is set for kexec based kernel dump mechanism,
> alternate dump capturing methods like fadump [1] also set it to export
> the vmcore. Since, is_kdump_kernel() is used to restrict resources in
> crash dump capture kernel and such restrictions may not be desirable
> for fadump, allow is_kdump_kernel() to be defined differently for such
> scenarios. With this, is_kdump_kernel() could be false while vmcore is
> usable. So, remove unnecessary dependency with is_kdump_kernel(), for
> exporting vmcore.
> 
> [1] https://docs.kernel.org/powerpc/firmware-assisted-dump.html
> 
> Suggested-by: Michael Ellerman 
> Signed-off-by: Hari Bathini 
> ---
> 
> Changes in v3:
> * Decoupled is_vmcore_usable() & vmcore_unusable() from is_kdump_kernel()
>   as suggested here: 
> 
> https://lore.kernel.org/linuxppc-dev/ZP7si3UMVpPfYV+w@MiWiFi-R3L-srv/T/#m13ae5a7e4ba6f4d8397f0f66581832292eee3a85
> 
> 
>  include/linux/crash_dump.h | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)

LGTM,

Acked-by: Baoquan He 

> 
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index 0f3a656293b0..acc55626afdc 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -50,6 +50,7 @@ void vmcore_cleanup(void);
>  #define vmcore_elf64_check_arch(x) (elf_check_arch(x) || 
> vmcore_elf_check_arch_cross(x))
>  #endif
>  
> +#ifndef is_kdump_kernel
>  /*
>   * is_kdump_kernel() checks whether this kernel is booting after a panic of
>   * previous kernel or not. This is determined by checking if previous kernel
> @@ -64,6 +65,7 @@ static inline bool is_kdump_kernel(void)
>  {
>   return elfcorehdr_addr != ELFCORE_ADDR_MAX;
>  }
> +#endif
>  
>  /* is_vmcore_usable() checks if the kernel is booting after a panic and
>   * the vmcore region is usable.
> @@ -75,7 +77,8 @@ static inline bool is_kdump_kernel(void)
>  
>  static inline int is_vmcore_usable(void)
>  {
> - return is_kdump_kernel() && elfcorehdr_addr != ELFCORE_ADDR_ERR ? 1 : 0;
> + return elfcorehdr_addr != ELFCORE_ADDR_ERR &&
> + elfcorehdr_addr != ELFCORE_ADDR_MAX ? 1 : 0;
>  }
>  
>  /* vmcore_unusable() marks the vmcore as unusable,
> @@ -84,8 +87,7 @@ static inline int is_vmcore_usable(void)
>  
>  static inline void vmcore_unusable(void)
>  {
> - if (is_kdump_kernel())
> - elfcorehdr_addr = ELFCORE_ADDR_ERR;
> + elfcorehdr_addr = ELFCORE_ADDR_ERR;
>  }
>  
>  /**
> -- 
> 2.41.0
>

Re: [PATCH v2] ASoC: imx-rpmsg: Set ignore_pmdown_time for dai_link

2023-09-13 Thread Shengjiu Wang

On Wed, Sep 13, 2023 at 6:27 PM Chancel Liu  wrote:
>
> i.MX rpmsg sound cards work on codec slave mode. MCLK will be disabled
> by CPU DAI driver in hw_free(). Some codec requires MCLK present at
> power up/down sequence. So need to set ignore_pmdown_time to power down
> codec immediately before MCLK is turned off.
>
> Take WM8962 as an example, if MCLK is disabled before DAPM power down
> playback stream, FIFO error will arise in WM8962 which will have bad
> impact on playback next.
>
> Signed-off-by: Chancel Liu 

Acked-by: Shengjiu Wang 

Best regards
Wang Shengjiu
> ---
>  sound/soc/fsl/imx-rpmsg.c | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/sound/soc/fsl/imx-rpmsg.c b/sound/soc/fsl/imx-rpmsg.c
> index 3c7b95db2eac..b578f9a32d7f 100644
> --- a/sound/soc/fsl/imx-rpmsg.c
> +++ b/sound/soc/fsl/imx-rpmsg.c
> @@ -89,6 +89,14 @@ static int imx_rpmsg_probe(struct platform_device *pdev)
> SND_SOC_DAIFMT_NB_NF |
> SND_SOC_DAIFMT_CBC_CFC;
>
> +   /*
> +* i.MX rpmsg sound cards work on codec slave mode. MCLK will be
> +* disabled by CPU DAI driver in hw_free(). Some codec requires MCLK
> +* present at power up/down sequence. So need to set 
> ignore_pmdown_time
> +* to power down codec immediately before MCLK is turned off.
> +*/
> +   data->dai.ignore_pmdown_time = 1;
> +
> /* Optional codec node */
> ret = of_parse_phandle_with_fixed_args(np, "audio-codec", 0, 0, 
> );
> if (ret) {
> --
> 2.25.1
>

Re: [PATCH v7 3/3 RESEND] powerpc/pseries: PLPKS SED Opal keystore support

2023-09-13 Thread Michael Ellerman

Nathan Chancellor  writes:
> Hi Greg,
>
> On Fri, Sep 08, 2023 at 10:30:56AM -0500, gjo...@linux.vnet.ibm.com wrote:
>> From: Greg Joyce 
>>
>> Define operations for SED Opal to read/write keys
>> from POWER LPAR Platform KeyStore(PLPKS). This allows
>> non-volatile storage of SED Opal keys.
>>
>> Signed-off-by: Greg Joyce 
>> Reviewed-by: Jonathan Derrick 
>> Reviewed-by: Hannes Reinecke 
>
> After this change in -next as commit 9f2c7411ada9 ("powerpc/pseries:
> PLPKS SED Opal keystore support"), I see the following crash when
> booting some distribution configurations, such as OpenSUSE's [1] (the
> rootfs is available at [2] if necessary):

Thanks for testing Nathan.

The code needs to check plpks_is_available() somewhere, before calling
the plpks routines.

cheers

> $ qemu-system-ppc64 \
> -display none \
> -nodefaults \
> -device ipmi-bmc-sim,id=bmc0 \
> -device isa-ipmi-bt,bmc=bmc0,irq=10 \
> -machine powernv \
> -kernel arch/powerpc/boot/zImage.epapr \
> -initrd ppc64le-rootfs.cpio \
> -m 2G \
> -serial mon:stdio
> ...
> [0.00] Linux version 6.6.0-rc1-4-g9f2c7411ada9 
> (nathan@dev-arch.thelio-3990X) (powerpc64-linux-gcc (GCC) 13.2.0, GNU ld (GNU 
> Binutils) 2.41) #1 SMP Wed Sep 13 11:53:38 MST 2023
> ...
> [1.808911] [ cut here ]
> [1.810336] kernel BUG at arch/powerpc/kernel/syscall.c:34!
> [1.810799] Oops: Exception in kernel mode, sig: 5 [#1]
> [1.810985] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> [1.811191] Modules linked in:
> [1.811483] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 6.6.0-rc1-4-g9f2c7411ada9 #1
> [1.811825] Hardware name: IBM PowerNV (emulated by qemu) POWER9 0x4e1202 
> opal:v7.0 PowerNV
> [1.812133] NIP:  c002c8c4 LR: c000d620 CTR: 
> c000d4c0
> [1.812335] REGS: c2deb7b0 TRAP: 0700   Not tainted  
> (6.6.0-rc1-4-g9f2c7411ada9)
> [1.812595] MSR:  90029033   CR: 2800028d 
>  XER: 20040004
> [1.812930] CFAR: c000d61c IRQMASK: 3
> [1.812930] GPR00: c000d620 c2deba50 c15ef400 
> c2debe80
> [1.812930] GPR04: 4800028d   
> 
> [1.812930] GPR08: 79cd 0001  
> 
> [1.812930] GPR12:  c28b  
> 
> [1.812930] GPR16:    
> 
> [1.812930] GPR20:    
> 
> [1.812930] GPR24:    
> 
> [1.812930] GPR28:  4800028d c2debe80 
> c2debe10
> [1.814858] NIP [c002c8c4] system_call_exception+0x84/0x250
> [1.815480] LR [c000d620] system_call_common+0x160/0x2c4
> [1.815772] Call Trace:
> [1.815929] [c2debe50] [c000d620] 
> system_call_common+0x160/0x2c4
> [1.816178] --- interrupt: c00 at plpar_hcall+0x38/0x60
> [1.816330] NIP:  c00e43f8 LR: c00fb558 CTR: 
> 
> [1.816518] REGS: c2debe80 TRAP: 0c00   Not tainted  
> (6.6.0-rc1-4-g9f2c7411ada9)
> [1.816740] MSR:  9280b033   
> CR: 2800028d  XER: 
> [1.817039] IRQMASK: 0
> [1.817039] GPR00: 4800028d c2deb950 c15ef400 
> 0434
> [1.817039] GPR04: 028eb190 28ac6600 001d 
> 0010
> [1.817039] GPR08:    
> 
> [1.817039] GPR12:  c28b c0011188 
> 
> [1.817039] GPR16:    
> 
> [1.817039] GPR20:    
> 
> [1.817039] GPR24:    
> c00028ac6600
> [1.817039] GPR28: 0010 c28eb190 c00028ac6600 
> c2deba30
> [1.818785] NIP [c00e43f8] plpar_hcall+0x38/0x60
> [1.818929] LR [c00fb558] plpks_read_var+0x208/0x290
> [1.819093] --- interrupt: c00
> [1.819195] [c2deb950] [c00fb528] 
> plpks_read_var+0x1d8/0x290 (unreliable)
> [1.819433] [c2deba10] [c00fc1ac] sed_read_key+0x9c/0x170
> [1.819617] [c2debad0] [c20541a8] sed_opal_init+0xac/0x174
> [1.819823] [c2debc50] [c0010ad0] 
> do_one_initcall+0x80/0x3b0
> [1.820017] [c2debd30] [c2004860] 
> kernel_init_freeable+0x338/0x3dc
> [1.820229] [c2debdf0] [c00111b0] kernel_init+0x30/0x1a0
> [1.820411] [c2debe50] [c000d620] 
> system_call_common+0x160/0x2c4
> [1.820614] ---

Re: [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-09-13 Thread Binbin Wu





On 9/14/2023 9:55 AM, Sean Christopherson wrote:

From: Chao Peng 

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson 
---
  arch/x86/kvm/mmu/mmu.c   | 10 ++
  arch/x86/kvm/vmx/vmx.c   | 11 +--
  include/linux/kvm_host.h | 33 +
  virt/kvm/kvm_main.c  | 40 +++-
  4 files changed, 63 insertions(+), 31 deletions(-)


[...]
  
-void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned long start,

- unsigned long end)
+void kvm_mmu_invalidate_begin(struct kvm *kvm)
  {
+   lockdep_assert_held_write(>mmu_lock);
/*
 * The count increase must become visible at unlock time as no
 * spte can be established without taking the mmu_lock and
 * count is also read inside the mmu_lock critical section.
 */
kvm->mmu_invalidate_in_progress++;
+
+   if (likely(kvm->mmu_invalidate_in_progress == 1))
+   kvm->mmu_invalidate_range_start = INVALID_GPA;
+}
+
+void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+   lockdep_assert_held_write(>mmu_lock);
+
+   WARN_ON_ONCE(!kvm->mmu_invalidate_in_progress);
+
if (likely(kvm->mmu_invalidate_in_progress == 1)) {
kvm->mmu_invalidate_range_start = start;
kvm->mmu_invalidate_range_end = end;
@@ -771,6 +781,12 @@ void kvm_mmu_invalidate_begin(struct kvm *kvm, unsigned 
long start,
}
  }
  
+static bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)

+{
+   kvm_mmu_invalidate_range_add(kvm, range->start, range->end);
+   return kvm_unmap_gfn_range(kvm, range);
+}
+
  static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
const struct mmu_notifier_range *range)
  {
@@ -778,7 +794,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
const struct kvm_mmu_notifier_range hva_range = {
.start  = range->start,
.end= range->end,
-   .handler= kvm_unmap_gfn_range,
+   .handler= kvm_mmu_unmap_gfn_range,
.on_lock= kvm_mmu_invalidate_begin,
.on_unlock  = kvm_arch_guest_memory_reclaimed,
.flush_on_ret   = true,
@@ -817,8 +833,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
return 0;
  }
  
-void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long start,

-   unsigned long end)
+void kvm_mmu_invalidate_end(struct kvm *kvm)
  {
/*
 * This sequence increase will notify the kvm page fault that
@@ -833,6 +848,13 @@ void kvm_mmu_invalidate_end(struct kvm *kvm, unsigned long 
start,
 * in conjunction with the smp_rmb in mmu_invalidate_retry().
 */
kvm->mmu_invalidate_in_progress--;
+
+   /*
+* Assert that at least one range must be added between start() and
+* end().  Not adding a range isn't fatal, but it is a KVM bug.
+*/
+   WARN_ON_ONCE(kvm->mmu_invalidate_in_progress &&
+kvm->mmu_invalidate_range_start == INVALID_GPA);
Should the check happen before the decrease of 
kvm->mmu_invalidate_in_progress?
Otherwise, KVM calls kvm_mmu_invalidate_begin(), then 
kvm_mmu_invalidate_end()

the check will not take effect.


  }
  
  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

[PATCH v5 11/11] docs: powerpc: Document nested KVM on POWER

2023-09-13 Thread Jordan Niethe

From: Michael Neuling 

Document support for nested KVM on POWER using the existing API as well
as the new PAPR API. This includes the new HCALL interface and how it
used by KVM.

Signed-off-by: Michael Neuling 
Signed-off-by: Jordan Niethe 
---
v2:
  - Separated into individual patch
v3:
  - Fix typos
---
 Documentation/powerpc/index.rst  |   1 +
 Documentation/powerpc/kvm-nested.rst | 636 +++
 2 files changed, 637 insertions(+)
 create mode 100644 Documentation/powerpc/kvm-nested.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index d33b554ca7ba..23e449994c2a 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -26,6 +26,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..8b37981dc3d9
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN().  Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+  (normally at L1 boot time).
+
+- L1 requests the L0 create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 runs the vCPU running H_GUEST_VCPU_RUN() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+More details of the individual hcalls follows:
+
+HCALL Details
+=
+
+This documentation is provided to give an overall understating of the
+API. It doesn't aim to provide all the details required to implement
+an L1 or L0. Latest version of PAPR can be referred to for more details.
+
+All these HCALLs are made by the L1 to the L0.
+
+H_GUEST_GET_CAPABILITIES()
+--
+
+This is called to get the capabilities of the L0 nested
+hypervisor. This includes capabilities such the CPU versions (eg
+POWER9, POWER10) that are supported as L2s::
+
+  H_GUEST_GET_CAPABILITIES(uint64 flags)
+
+  Parameters:
+Input:
+  flags:

[PATCH v5 10/11] KVM: PPC: Add support for nestedv2 guests

2023-09-13 Thread Jordan Niethe

A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. KVM
already had an interface that allowed this on powernv platforms. This
existing interface will now be called "nestedv1". The newly added PAPR
interface will be called "nestedv2".  PHYP will support the nestedv2
interface. At this time the host side of the nestedv2 interface has not
been implemented on powernv but there is no technical reason why it
could not be added.

The nestedv1 interface is still supported.

Add support to KVM to utilize these hcalls to enable running nested
guests as a pseries guest on PHYP.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, use the nestedv2
interface.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

Tested-by: Sachin Sant 
Signed-off-by: Vaibhav Jain 
Signed-off-by: Gautam Menghani 
Signed-off-by: Kautuk Consul 
Signed-off-by: Amit Machhiwal 
Signed-off-by: Jordan Niethe 
---
v2:
  - Declare op structs as static
  - Guatam: Use expressions in switch case with local variables
  - Do not use the PVR for the LOGICAL PVR ID
  - Kautuk: Handle emul_inst as now a double word, init correctly
  - Use new GPR(), etc macros
  - Amit: Determine PAPR nested capabilities from cpu features
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Change to kvmhv_nestedv2 namespace
  - Make kvmhv_enable_nested() return -ENODEV on NESTEDv2 L1 hosts
  - s/kvmhv_on_papr/kvmhv_is_nestedv2/
  - mv book3s_hv_papr.c book3s_hv_nestedv2.c
  - Handle shared regs without a guest state id in the same wrapper
  - Vaibhav: Use a static key for API version
  - Add a positive test for NESTEDv1
  - Give the amor a static value
  - s/struct kvmhv_nestedv2_host/struct kvmhv_nestedv2_io/
  - Propagate failure in kvmhv_vcpu_entry_nestedv2()
  - WARN if getters and setters fail
  - Progagate failure from kvmhv_nestedv2_parse_output()
  - Replace delay with sleep in plpar_guest_{create,delete,create_vcpu}()
  - Amit: Add logical PVR handling
  - Replace kvmppc_gse_{get,put} with specific version
v4:
  - Batch H_GUEST_GET calls in kvmhv_nestedv2_reload_ptregs()
  - Fix compile without CONFIG_PSERIES
  - Fix maybe uninitialized trap in kvmhv_p9_guest_entry()
  - Extend existing setters for arch_compat and lpcr
v5:
  - Check H_BUSY for {g,s}etting capabilities
  - Message if plpar_guest_get_capabilities() fails and nestedv1
support will be attempted.
  - Remove unused amor variable
---
 arch/powerpc/include/asm/guest-state-buffer.h |  91 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 137 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  20 +
 arch/powerpc/include/asm/kvm_ppc.h|  90 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 263 +
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 134 ++-
 arch/powerpc/kvm/book3s_hv.h  |  80 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  40 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 994 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 arch/powerpc/kvm/guest-state-buffer.c |  50 +
 14 files changed, 1843 insertions(+), 97 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_nestedv2.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
index aaefe1075fc4..808149f31576 100644
--- a/arch/powerpc/include/asm/guest-state-buffer.h
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -5,6 +5,7 @@
 #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
 #define

[PATCH v5 09/11] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-09-13 Thread Jordan Niethe

The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that.  Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.

Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
v4:
  - Use u64
  - Change format strings instead of casting
---
 arch/powerpc/include/asm/kvm_book3s.h | 10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  2 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 +++---
 arch/powerpc/kvm/book3s_hv_nested.c   |  4 ++--
 arch/powerpc/kvm/book3s_hv_uvmem.c|  2 +-
 arch/powerpc/kvm/book3s_xive.c|  4 ++--
 9 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4c6558d5fefe..831c23e4f121 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -191,14 +191,14 @@ extern int kvmppc_mmu_radix_translate_table(struct 
kvm_vcpu *vcpu, gva_t eaddr,
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
 extern void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
-   unsigned int pshift, unsigned int lpid);
+   unsigned int pshift, u64 lpid);
 extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
unsigned int shift,
const struct kvm_memory_slot *memslot,
-   unsigned int lpid);
+   u64 lpid);
 extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
bool writing, unsigned long gpa,
-   unsigned int lpid);
+   u64 lpid);
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
@@ -207,7 +207,7 @@ extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu 
*vcpu,
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
 extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
- unsigned int lpid);
+ u64 lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
 extern void kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
@@ -300,7 +300,7 @@ void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
-void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
+void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d49065af08e9..572f9bbf1a25 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -624,7 +624,7 @@ static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 
 extern int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 unsigned long gpa, unsigned int level,
-unsigned long mmu_seq, unsigned int lpid,
+unsigned long mmu_seq, u64 lpid,
 unsigned long *rmapp, struct rmap_nested **n_rmap);
 extern void kvmhv_insert_nest_rmap(struct kvm *kvm, unsigned long *rmapp,
   struct rmap_nested **n_rmap);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..429b53bc1773

[PATCH v5 08/11] KVM: PPC: Add helper library for Guest State Buffers

2023-09-13 Thread Jordan Niethe

The PAPR "Nestedv2" guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the nestedv2 guest support.

Signed-off-by: Jordan Niethe 
---
v2:
  - Add missing #ifdef CONFIG_VSXs
  - Move files from lib/ to kvm/
  - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
  - Use kunit for guest state buffer tests
  - Add configuration option for the tests
  - Use macros for contiguous id ranges like GPRs
  - Add some missing EXPORTs to functions
  - HEIR element is a double word not a word
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Use the kvmppc namespace
  - Move kvmppc_gsb_reset() out of kvmppc_gsm_fill_info()
  - Comments for GSID elements
  - Pass vector elements by reference
  - Remove generic put and get functions
v5:
  - Fix mismatched function comments
---
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 904 ++
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/guest-state-buffer.c | 571 +++
 arch/powerpc/kvm/test-guest-state-buffer.c| 328 +++
 5 files changed, 1818 insertions(+)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 2a54fadbeaf5..339c3a5f56f1 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
bool "Run self-tests of the MSI bitmap code"
depends on DEBUG_KERNEL
 
+config GUEST_STATE_BUFFER_TEST
+   def_tristate n
+   prompt "Enable Guest State Buffer unit tests"
+   depends on KUNIT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default KUNIT_ALL_TESTS
+   help
+ The Guest State Buffer is a data format specified in the PAPR.
+ It is by hcalls to communicate the state of L2 guests between
+ the L1 and L0 hypervisors. Enable unit tests for the library
+ used to create and use guest state buffers.
+
 config PPC_IRQ_SOFT_MASK_DEBUG
bool "Include extra checks for powerpc irq soft masking"
depends on PPC64
diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..aaefe1075fc4
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,904 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+/* Element without a value and any length */
+#define KVMPPC_GSID_BLANK  0x
+/* Size required for the L0's internal VCPU representation */
+#define KVMPPC_GSID_HOST_STATE_SIZE0x0001
+ /* Minimum size for the H_GUEST_RUN_VCPU output buffer */
+#define KVMPPC_GSID_RUN_OUTPUT_MIN_SIZE0x0002
+ /* "Logical" PVR value as defined in the PAPR */
+#define KVMPPC_GSID_LOGICAL_PVR0x0003
+ /* L0 relative timebase offset */
+#define KVMPPC_GSID_TB_OFFSET  0x0004
+ /* Partition Scoped Page Table Info */
+#define KVMPPC_GSID_PARTITION_TABLE0x0005
+ /* Process Table Info */

[PATCH v5 07/11] KVM: PPC: Book3S HV: Introduce low level MSR accessor

2023-09-13 Thread Jordan Niethe

kvmppc_get_msr() and kvmppc_set_msr_fast() serve as accessors for the
MSR. However because the MSR is kept in the shared regs they include a
conditional check for kvmppc_shared_big_endian() and endian conversion.

Within the Book3S HV specific code there are direct reads and writes of
shregs::msr. In preparation for Nested APIv2 these accesses need to be
replaced with accessor functions so it is possible to extend their
behavior. However, using the kvmppc_get_msr() and kvmppc_set_msr_fast()
functions is undesirable because it would introduce a conditional branch
and endian conversion that is not currently present.

kvmppc_set_msr_hv() already exists, it is used for the
kvmppc_ops::set_msr callback.

Introduce a low level accessor __kvmppc_{s,g}et_msr_hv() that simply
gets and sets shregs::msr. This will be extend for Nested APIv2 support.

Signed-off-by: Jordan Niethe 
---
v4:
  - New to series
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  5 ++--
 arch/powerpc/kvm/book3s_hv.c | 34 ++--
 arch/powerpc/kvm/book3s_hv.h | 10 
 arch/powerpc/kvm/book3s_hv_builtin.c |  5 ++--
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index efd0ebf70a5e..fdfc2a62dd67 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -28,6 +28,7 @@
 #include 
 
 #include "book3s.h"
+#include "book3s_hv.h"
 #include "trace_hv.h"
 
 //#define DEBUG_RESIZE_HPT 1
@@ -347,7 +348,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
unsigned long v, orig_v, gr;
__be64 *hptep;
long int index;
-   int virtmode = vcpu->arch.shregs.msr & (data ? MSR_DR : MSR_IR);
+   int virtmode = __kvmppc_get_msr_hv(vcpu) & (data ? MSR_DR : MSR_IR);
 
if (kvm_is_radix(vcpu->kvm))
return kvmppc_mmu_radix_xlate(vcpu, eaddr, gpte, data, iswrite);
@@ -385,7 +386,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
 
/* Get PP bits and key for permission check */
pp = gr & (HPTE_R_PP0 | HPTE_R_PP);
-   key = (vcpu->arch.shregs.msr & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
+   key = (__kvmppc_get_msr_hv(vcpu) & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
key &= slb_v;
 
/* Calculate permissions */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 25025f6c4cce..5743f32bf45e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1374,7 +1374,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
  */
 static void kvmppc_cede(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.shregs.msr |= MSR_EE;
+   __kvmppc_set_msr_hv(vcpu, __kvmppc_get_msr_hv(vcpu) | MSR_EE);
vcpu->arch.ceded = 1;
smp_mb();
if (vcpu->arch.prodded) {
@@ -1589,7 +1589,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * That can happen due to a bug, or due to a machine check
 * occurring at just the wrong time.
 */
-   if (vcpu->arch.shregs.msr & MSR_HV) {
+   if (__kvmppc_get_msr_hv(vcpu) & MSR_HV) {
printk(KERN_EMERG "KVM trap in HV mode!\n");
printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n",
vcpu->arch.trap, kvmppc_get_pc(vcpu),
@@ -1640,7 +1640,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * so that it knows that the machine check occurred.
 */
if (!vcpu->kvm->arch.fwnmi_enabled) {
-   ulong flags = (vcpu->arch.shregs.msr & 0x083c) |
+   ulong flags = (__kvmppc_get_msr_hv(vcpu) & 0x083c) |
(kvmppc_get_msr(vcpu) & SRR1_PREFIXED);
kvmppc_core_queue_machine_check(vcpu, flags);
r = RESUME_GUEST;
@@ -1670,7 +1670,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * as a result of a hypervisor emulation interrupt
 * (e40) getting turned into a 700 by BML RTAS.
 */
-   flags = (vcpu->arch.shregs.msr & 0x1full) |
+   flags = (__kvmppc_get_msr_hv(vcpu) & 0x1full) |
(kvmppc_get_msr(vcpu) & SRR1_PREFIXED);
kvmppc_core_queue_program(vcpu, flags);
r = RESUME_GUEST;
@@ -1680,7 +1680,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
{
int i;
 
-   if (unlikely(vcpu->arch.shregs.msr & MSR_PR)) {
+   if (unlikely(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) {
/*
 * Guest userspace executed sc 1. This can only be
 * reached by the P9 path because the old path
@@ -1758,7 +1758,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,

[PATCH v5 06/11] KVM: PPC: Book3S HV: Use accessors for VCPU registers

2023-09-13 Thread Jordan Niethe

Introduce accessor generator macros for Book3S HV VCPU registers. Use
the accessor functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
v5:
  - Remove unneeded trailing comment for line length
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
 arch/powerpc/kvm/book3s_hv.c   | 148 +
 arch/powerpc/kvm/book3s_hv.h   |  58 ++
 3 files changed, 139 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 5c71d6ae3a7b..ab646f59afd7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include "book3s_hv.h"
 #include 
 #include 
 #include 
@@ -294,9 +295,9 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t 
eaddr,
} else {
if (!(pte & _PAGE_PRIVILEGED)) {
/* Check AMR/IAMR to see if strict mode is in force */
-   if (vcpu->arch.amr & (1ul << 62))
+   if (kvmppc_get_amr_hv(vcpu) & (1ul << 62))
gpte->may_read = 0;
-   if (vcpu->arch.amr & (1ul << 63))
+   if (kvmppc_get_amr_hv(vcpu) & (1ul << 63))
gpte->may_write = 0;
if (vcpu->arch.iamr & (1ul << 62))
gpte->may_execute = 0;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 73d9a9eb376f..25025f6c4cce 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -868,7 +868,7 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
/* Guests can't breakpoint the hypervisor */
if ((value1 & CIABR_PRIV) == CIABR_PRIV_HYPER)
return H_P3;
-   vcpu->arch.ciabr  = value1;
+   kvmppc_set_ciabr_hv(vcpu, value1);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_SET_DAWR0:
if (!kvmppc_power8_compatible(vcpu))
@@ -879,8 +879,8 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
return H_UNSUPPORTED_FLAG_START;
if (value2 & DABRX_HYP)
return H_P4;
-   vcpu->arch.dawr0  = value1;
-   vcpu->arch.dawrx0 = value2;
+   kvmppc_set_dawr0_hv(vcpu, value1);
+   kvmppc_set_dawrx0_hv(vcpu, value2);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_SET_DAWR1:
if (!kvmppc_power8_compatible(vcpu))
@@ -895,8 +895,8 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
return H_UNSUPPORTED_FLAG_START;
if (value2 & DABRX_HYP)
return H_P4;
-   vcpu->arch.dawr1  = value1;
-   vcpu->arch.dawrx1 = value2;
+   kvmppc_set_dawr1_hv(vcpu, value1);
+   kvmppc_set_dawrx1_hv(vcpu, value2);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_ADDR_TRANS_MODE:
/*
@@ -1548,7 +1548,7 @@ static int kvmppc_pmu_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_PM))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_PM;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_PM);
 
return RESUME_GUEST;
 }
@@ -1558,7 +1558,7 @@ static int kvmppc_ebb_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_EBB))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_EBB;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_EBB);
 
return RESUME_GUEST;
 }
@@ -1568,7 +1568,7 @@ static int kvmppc_tm_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_TM))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_TM;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_TM);
 
return RESUME_GUEST;
 }
@@ -1867,7 +1867,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * Otherwise, we just generate a program interrupt to the guest.
 */
case BOOK3S_INTERRUPT_H_FAC_UNAVAIL: {
-   u64 cause = vcpu->arch.hfscr >> 56;
+   u64 cause = kvmppc_get_hfscr_hv(vcpu) >> 56;
 
r = EMULATE_FAIL;
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
@@ -2211,64 +2211,64 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->arch.dabrx);
break;
case KVM_REG_PPC_DSCR:
-   *val =

[PATCH v5 05/11] KVM: PPC: Use accessors for VCORE registers

2023-09-13 Thread Jordan Niethe

Introduce accessor generator macros for VCORE registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
  - Remove _hv suffix
  - Do not generate for setter arch_compat and lpcr
---
 arch/powerpc/include/asm/kvm_book3s.h | 25 -
 arch/powerpc/kvm/book3s_hv.c  | 24 
 arch/powerpc/kvm/book3s_hv_ras.c  |  4 ++--
 arch/powerpc/kvm/book3s_xive.c|  4 +---
 4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 1a220cd63227..4c6558d5fefe 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -483,6 +483,29 @@ KVMPPC_BOOK3S_VCPU_ACCESSOR(bescr, 64)
 KVMPPC_BOOK3S_VCPU_ACCESSOR(ic, 64)
 KVMPPC_BOOK3S_VCPU_ACCESSOR(vrsave, 64)
 
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(reg, size)\
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   vcpu->arch.vcore->reg = val;\
+}
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(reg, size)\
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.vcore->reg;   \
+}
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR(reg, size)
\
+   KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(reg, size) \
+   KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(reg, size) \
+
+
+KVMPPC_BOOK3S_VCORE_ACCESSOR(vtb, 64)
+KVMPPC_BOOK3S_VCORE_ACCESSOR(tb_offset, 64)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(arch_compat, 32)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(lpcr, 64)
+
 static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
 {
return vcpu->arch.dec_expires;
@@ -496,7 +519,7 @@ static inline void kvmppc_set_dec_expires(struct kvm_vcpu 
*vcpu, u64 val)
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return kvmppc_get_dec_expires(vcpu) - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset(vcpu);
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 27faecad1e3b..73d9a9eb376f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -794,7 +794,7 @@ static void kvmppc_update_vpa_dispatch(struct kvm_vcpu 
*vcpu,
 
vpa->enqueue_dispatch_tb = 
cpu_to_be64(be64_to_cpu(vpa->enqueue_dispatch_tb) + stolen);
 
-   __kvmppc_create_dtl_entry(vcpu, vpa, vc->pcpu, now + vc->tb_offset, 
stolen);
+   __kvmppc_create_dtl_entry(vcpu, vpa, vc->pcpu, now + 
kvmppc_get_tb_offset(vcpu), stolen);
 
vcpu->arch.vpa.dirty = true;
 }
@@ -845,9 +845,9 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 
 static bool kvmppc_power8_compatible(struct kvm_vcpu *vcpu)
 {
-   if (vcpu->arch.vcore->arch_compat >= PVR_ARCH_207)
+   if (kvmppc_get_arch_compat(vcpu) >= PVR_ARCH_207)
return true;
-   if ((!vcpu->arch.vcore->arch_compat) &&
+   if ((!kvmppc_get_arch_compat(vcpu)) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
return true;
return false;
@@ -2283,7 +2283,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->arch.vcore->dpdes);
break;
case KVM_REG_PPC_VTB:
-   *val = get_reg_val(id, vcpu->arch.vcore->vtb);
+   *val = get_reg_val(id, kvmppc_get_vtb(vcpu));
break;
case KVM_REG_PPC_DAWR:
*val = get_reg_val(id, vcpu->arch.dawr0);
@@ -2342,11 +2342,11 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
spin_unlock(>arch.vpa_update_lock);
break;
case KVM_REG_PPC_TB_OFFSET:
-   *val = get_reg_val(id, vcpu->arch.vcore->tb_offset);
+   *val = get_reg_val(id, kvmppc_get_tb_offset(vcpu));
break;
case KVM_REG_PPC_LPCR:
case KVM_REG_PPC_LPCR_64:
-   *val = get_reg_val(id, vcpu->arch.vcore->lpcr);
+   *val = get_reg_val(id, kvmppc_get_lpcr(vcpu));
break;
case KVM_REG_PPC_PPR:
*val = get_reg_val(id, vcpu->arch.ppr);
@@ -2418,7 +2418,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
break;
 #endif
case KVM_REG_PPC_ARCH_COMPAT:
-   *val =

[PATCH v5 04/11] KVM: PPC: Use accessors for VCPU registers

2023-09-13 Thread Jordan Niethe

Introduce accessor generator macros for VCPU registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
---
 arch/powerpc/include/asm/kvm_book3s.h  | 37 +-
 arch/powerpc/kvm/book3s.c  | 22 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  4 +--
 arch/powerpc/kvm/book3s_hv.c   | 12 -
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |  4 +--
 arch/powerpc/kvm/powerpc.c |  4 +--
 6 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 109a5f56767a..1a220cd63227 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -458,10 +458,45 @@ static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, 
u32 val)
 }
 #endif
 
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size) \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size) \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR(reg, size) \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size)  \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size)  \
+
+KVMPPC_BOOK3S_VCPU_ACCESSOR(pid, 32)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(tar, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbhr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbrr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(bescr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ic, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(vrsave, 64)
+
+static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires;
+}
+
+static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.dec_expires = val;
+}
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - vcpu->arch.vcore->tb_offset;
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index c080dd2e96ac..6cd20ab9e94e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -565,7 +565,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
regs->msr = kvmppc_get_msr(vcpu);
regs->srr0 = kvmppc_get_srr0(vcpu);
regs->srr1 = kvmppc_get_srr1(vcpu);
-   regs->pid = vcpu->arch.pid;
+   regs->pid = kvmppc_get_pid(vcpu);
regs->sprg0 = kvmppc_get_sprg0(vcpu);
regs->sprg1 = kvmppc_get_sprg1(vcpu);
regs->sprg2 = kvmppc_get_sprg2(vcpu);
@@ -683,19 +683,19 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
*val = get_reg_val(id, vcpu->arch.fscr);
break;
case KVM_REG_PPC_TAR:
-   *val = get_reg_val(id, vcpu->arch.tar);
+   *val = get_reg_val(id, kvmppc_get_tar(vcpu));
break;
case KVM_REG_PPC_EBBHR:
-   *val = get_reg_val(id, vcpu->arch.ebbhr);
+   *val = get_reg_val(id, kvmppc_get_ebbhr(vcpu));
break;
case KVM_REG_PPC_EBBRR:
-   *val = get_reg_val(id, vcpu->arch.ebbrr);
+   *val = get_reg_val(id, kvmppc_get_ebbrr(vcpu));
break;
case KVM_REG_PPC_BESCR:
-   *val = get_reg_val(id, vcpu->arch.bescr);
+   *val = get_reg_val(id, kvmppc_get_bescr(vcpu));
break;
case KVM_REG_PPC_IC:
-   *val = get_reg_val(id, vcpu->arch.ic);
+   *val = get_reg_val(id, kvmppc_get_ic(vcpu));
break;
default:
r = -EINVAL;
@@ -768,19 +768,19 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
kvmppc_set_fpscr(vcpu, set_reg_val(id, *val));
break;
case KVM_REG_PPC_TAR:
-   vcpu->arch.tar = set_reg_val(id, *val);
+   kvmppc_set_tar(vcpu,

[PATCH v5 03/11] KVM: PPC: Rename accessor generator macros

2023-09-13 Thread Jordan Niethe

More "wrapper" style accessor generating macros will be introduced for
the nestedv2 guest support. Rename the existing macros with more
descriptive names now so there is a consistent naming convention.

Reviewed-by: Nicholas Piggin 
Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
v4:
  - Fix ACESSOR typo
---
 arch/powerpc/include/asm/kvm_ppc.h | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d16d80ad2ae4..d554bc56e7f3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -927,19 +927,19 @@ static inline bool kvmppc_shared_big_endian(struct 
kvm_vcpu *vcpu)
 #endif
 }
 
-#define SPRNG_WRAPPER_GET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_GET(reg, bookehv_spr)   \
 static inline ulong kvmppc_get_##reg(struct kvm_vcpu *vcpu)\
 {  \
return mfspr(bookehv_spr);  \
 }  \
 
-#define SPRNG_WRAPPER_SET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_SET(reg, bookehv_spr)   \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, ulong val)  \
 {  \
mtspr(bookehv_spr, val);
\
 }  \
 
-#define SHARED_WRAPPER_GET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR_GET(reg, size)
\
 static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -948,7 +948,7 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu 
*vcpu)   \
   return le##size##_to_cpu(vcpu->arch.shared->reg);\
 }  \
 
-#define SHARED_WRAPPER_SET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR_SET(reg, size)
\
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -957,36 +957,36 @@ static inline void kvmppc_set_##reg(struct kvm_vcpu 
*vcpu, u##size val)   \
   vcpu->arch.shared->reg = cpu_to_le##size(val);   \
 }  \
 
-#define SHARED_WRAPPER(reg, size)  \
-   SHARED_WRAPPER_GET(reg, size)   \
-   SHARED_WRAPPER_SET(reg, size)   \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR(reg, size)\
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR_GET(reg, size) \
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR_SET(reg, size) \
 
-#define SPRNG_WRAPPER(reg, bookehv_spr)
\
-   SPRNG_WRAPPER_GET(reg, bookehv_spr) \
-   SPRNG_WRAPPER_SET(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR(reg, bookehv_spr)   \
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_GET(reg, bookehv_spr)\
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_SET(reg, bookehv_spr)\
 
 #ifdef CONFIG_KVM_BOOKE_HV
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SPRNG_WRAPPER(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR(reg, bookehv_spr)\
 
 #else
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SHARED_WRAPPER(reg, size)   \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR(reg, size) \
 
 #endif
 
-SHARED_WRAPPER(critical, 64)
-SHARED_SPRNG_WRAPPER(sprg0, 64, SPRN_GSPRG0)
-SHARED_SPRNG_WRAPPER(sprg1, 64, SPRN_GSPRG1)
-SHARED_SPRNG_WRAPPER(sprg2, 64, SPRN_GSPRG2)
-SHARED_SPRNG_WRAPPER(sprg3, 64, SPRN_GSPRG3)
-SHARED_SPRNG_WRAPPER(srr0, 64, SPRN_GSRR0)
-SHARED_SPRNG_WRAPPER(srr1, 64, SPRN_GSRR1)
-SHARED_SPRNG_WRAPPER(dar, 64, SPRN_GDEAR)
-SHARED_SPRNG_WRAPPER(esr, 64, SPRN_GESR)
-SHARED_WRAPPER_GET(msr, 64)
+KVMPPC_VCPU_SHARED_REGS_ACCESSOR(critical, 64)

[PATCH v5 02/11] KVM: PPC: Introduce FPR/VR accessor functions

2023-09-13 Thread Jordan Niethe

Introduce accessor functions for floating point and vector registers
like the ones that exist for GPRs. Use these to replace the existing FPR
and VR accessor macros.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Gautam Menghani 
Signed-off-by: Jordan Niethe 
---
v3:
  - Guatam: Pass vector elements by reference
v4:
  - Split into unique patch
---
 arch/powerpc/include/asm/kvm_book3s.h | 55 
 arch/powerpc/include/asm/kvm_booke.h  | 10 
 arch/powerpc/kvm/book3s.c | 16 +++---
 arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 72 +--
 5 files changed, 110 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..109a5f56767a 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -403,6 +403,61 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j,
+ u64 val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
+#ifdef CONFIG_ALTIVEC
+static inline void kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
*v)
+{
+   *v =  vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i,
+vector128 *val)
+{
+   vcpu->arch.vr.vr[i] = *val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+#endif
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 0c3401b2e19e..7c3291aa8922 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
 #ifdef CONFIG_BOOKE
 static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..c080dd2e96ac 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -636,17 +636,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   *val = get_reg_val(id, VCPU_FPR(vcpu, i));
+   *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
break;
case KVM_REG_PPC_FPSCR:
-   *val = get_reg_val(id, vcpu->arch.fp.fpscr);
+   *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
-   val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
+   val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
+   val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
} else {
r = -ENXIO;
}
@@ -724,7 +724,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
+   kvmppc_set_fpr(vcpu, i, set_reg_val(id, *val));

[PATCH v5 01/11] KVM: PPC: Always use the GPR accessors

2023-09-13 Thread Jordan Niethe

Always use the GPR accessor functions. This will be important later for
Nested APIv2 support which requires additional functionality for
accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split into unique patch
---
 arch/powerpc/kvm/book3s_64_vio.c | 4 ++--
 arch/powerpc/kvm/book3s_hv.c | 8 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 8 
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 4 ++--
 arch/powerpc/kvm/book3s_xive.c   | 4 ++--
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 93b695b289e9..4ba048f272f2 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -786,12 +786,12 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned 
long liobn,
idx = (ioba >> stt->page_shift) - stt->offset;
page = stt->pages[idx / TCES_PER_PAGE];
if (!page) {
-   vcpu->arch.regs.gpr[4] = 0;
+   kvmppc_set_gpr(vcpu, 4, 0);
return H_SUCCESS;
}
tbl = (u64 *)page_address(page);
 
-   vcpu->arch.regs.gpr[4] = tbl[idx % TCES_PER_PAGE];
+   kvmppc_set_gpr(vcpu, 4, tbl[idx % TCES_PER_PAGE]);
 
return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..4af5b68cf7f8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1267,10 +1267,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
return RESUME_HOST;
break;
 #endif
-   case H_RANDOM:
-   if (!arch_get_random_seed_longs(>arch.regs.gpr[4], 1))
+   case H_RANDOM: {
+   unsigned long rand;
+
+   if (!arch_get_random_seed_longs(, 1))
ret = H_HARDWARE;
+   kvmppc_set_gpr(vcpu, 4, rand);
break;
+   }
case H_RPT_INVALIDATE:
ret = kvmppc_h_rpt_invalidate(vcpu, kvmppc_get_gpr(vcpu, 4),
  kvmppc_get_gpr(vcpu, 5),
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 0f5b021fa559..f3afe194e616 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -182,9 +182,13 @@ EXPORT_SYMBOL_GPL(kvmppc_hwrng_present);
 
 long kvmppc_rm_h_random(struct kvm_vcpu *vcpu)
 {
+   unsigned long rand;
+
if (ppc_md.get_random_seed &&
-   ppc_md.get_random_seed(>arch.regs.gpr[4]))
+   ppc_md.get_random_seed()) {
+   kvmppc_set_gpr(vcpu, 4, rand);
return H_SUCCESS;
+   }
 
return H_HARDWARE;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9182324dbef9..17cb75a127b0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -776,8 +776,8 @@ long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long 
flags,
r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
r &= ~HPTE_GR_RESERVED;
}
-   vcpu->arch.regs.gpr[4 + i * 2] = v;
-   vcpu->arch.regs.gpr[5 + i * 2] = r;
+   kvmppc_set_gpr(vcpu, 4 + i * 2, v);
+   kvmppc_set_gpr(vcpu, 5 + i * 2, r);
}
return H_SUCCESS;
 }
@@ -824,7 +824,7 @@ long kvmppc_h_clear_ref(struct kvm_vcpu *vcpu, unsigned 
long flags,
}
}
}
-   vcpu->arch.regs.gpr[4] = gr;
+   kvmppc_set_gpr(vcpu, 4, gr);
ret = H_SUCCESS;
  out:
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
@@ -872,7 +872,7 @@ long kvmppc_h_clear_mod(struct kvm_vcpu *vcpu, unsigned 
long flags,
kvmppc_set_dirty_from_hpte(kvm, v, gr);
}
}
-   vcpu->arch.regs.gpr[4] = gr;
+   kvmppc_set_gpr(vcpu, 4, gr);
ret = H_SUCCESS;
  out:
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index e165bfa842bf..e42984878503 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -481,7 +481,7 @@ static void icp_rm_down_cppr(struct kvmppc_xics *xics, 
struct kvmppc_icp *icp,
 
 unsigned long xics_rm_h_xirr_x(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.regs.gpr[5] = get_tb();
+   kvmppc_set_gpr(vcpu, 5, get_tb());
return xics_rm_h_xirr(vcpu);
 }
 
@@ -518,7 +518,7 @@ unsigned long xics_rm_h_xirr(struct kvm_vcpu *vcpu)
} while (!icp_rm_try_update(icp, old_state, new_state));
 
/* Return the result in GPR4 */
-   vcpu->arch.regs.gpr[4] = xirr;
+   kvmppc_set_gpr(vcpu, 4, xirr);
 
return check_too_hard(xics, icp);
 }
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index

[PATCH v5 00/11] KVM: PPC: Nested APIv2 guest support

2023-09-13 Thread Jordan Niethe



A nested-HV API for PAPR has been developed based on the KVM-specific
nested-HV API that is upstream in Linux/KVM and QEMU. The PAPR API had
to break compatibility to accommodate implementation in other
hypervisors and partitioning firmware. The existing KVM-specific API
will be known as the Nested APIv1 and the PAPR API will be known as the
Nested APIv2. 

The control flow and interrupt processing between L0, L1, and L2 in
the Nested APIv2 are conceptually unchanged. Where Nested APIv1 is almost
stateless, the Nested APIv2 is stateful, with the L1 registering L2 virtual
machines and vCPUs with the L0. Supervisor-privileged register switching
duty is now the responsibility for the L0, which holds canonical L2
register state and handles all switching. This new register handling
motivates the "getters and setters" wrappers to assist in syncing the
L2s state in the L1 and the L0.

Broadly, the new hcalls will be used for  creating and managing guests
by a regular partition in the following way:

  - L1 and L0 negotiate capabilities with
H_GUEST_{G,S}ET_CAPABILITIES

  - L1 requests the L0 create a L2 with
H_GUEST_CREATE and receives a handle to use in future hcalls

  - L1 requests the L0 create a L2 vCPU with
H_GUEST_CREATE_VCPU

  - L1 sets up the L2 using H_GUEST_SET and the
H_GUEST_VCPU_RUN input buffer

  - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN

  - L2 returns to L1 with an exit reason and L1 reads the
H_GUEST_VCPU_RUN output buffer populated by the L0

  - L1 handles the exit using H_GET_STATE if necessary

  - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

  - L1 frees the L2 in the L0 with H_GUEST_DELETE

Further details are available in Documentation/powerpc/kvm-nested.rst.

This series adds KVM support for using this hcall interface as a regular
PAPR partition, i.e. the L1. It does not add support for running as the
L0.

The new hcalls have been implemented in the spapr qemu model for
testing.

This is available at https://github.com/planetharsh/qemu/tree/upstream-0714-kop

There are scripts available to assist in setting up an environment for
testing nested guests at https://github.com/iamjpn/kvm-powervm-test

A tree with this series is available at
https://github.com/iamjpn/linux/tree/features/kvm-nestedv2-v5

Thanks to Amit Machhiwal, Kautuk Consul, Vaibhav Jain, Michael Neuling,
Shivaprasad Bhat, Harsh Prateek Bora, Paul Mackerras and Nicholas
Piggin.

Change overview in v5:
  - KVM: PPC: Add helper library for Guest State Buffers:
- Fix mismatched function comments
  - KVM: PPC: Add support for nestedv2 guests:
- Check H_BUSY for {g,s}etting capabilities
- Message if plpar_guest_get_capabilities() fails and nestedv1
  support will be attempted.
- Remove unused amor variable
  - KVM: PPC: Book3S HV: Use accessors for VCPU registers:
- Remove unneeded trailing comment for line length


Change overview in v4:
  - Split previous "KVM: PPC: Use getters and setters for vcpu register
state" into a number of seperate patches
- Remove _hv suffix from VCORE wrappers
- Do not create arch_compat and lpcr setters, use the existing ones
- Use #ifdef ALTIVEC
  - KVM: PPC: Rename accessor generator macros
- Fix typo
  - KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long
- Use u64
- Change format strings instead of casting
  - KVM: PPC: Add support for nestedv2 guests
- Batch H_GUEST_GET calls in kvmhv_nestedv2_reload_ptregs()
- Fix compile without CONFIG_PSERIES
- Fix maybe uninitialized 'trap' in kvmhv_p9_guest_entry()
- Extend existing setters for arch_compat and lpcr


Change overview in v3:
  - KVM: PPC: Use getters and setters for vcpu register state
  - Do not add a helper for pvr
  - Use an expression when declaring variable in case
  - Squash in all getters and setters
  - Pass vector registers by reference
  - KVM: PPC: Rename accessor generator macros
  - New to series
  - KVM: PPC: Add helper library for Guest State Buffers
  - Use EXPORT_SYMBOL_GPL()
  - Use the kvmppc namespace
  - Move kvmppc_gsb_reset() out of kvmppc_gsm_fill_info()
  - Comments for GSID elements
  - Pass vector elements by reference
  - Remove generic put and get functions
  - KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long
  - New to series
  - KVM: PPC: Add support for nestedv2 guests
  - Use EXPORT_SYMBOL_GPL()
  - Change to kvmhv_nestedv2 namespace
  - Make kvmhv_enable_nested() return -ENODEV on NESTEDv2 L1 hosts
  - s/kvmhv_on_papr/kvmhv_is_nestedv2/
  - mv book3s_hv_papr.c book3s_hv_nestedv2.c
  - Handle shared regs without a guest state id in the same wrapper
  - Use a static key for API version
  - Add a positive test for NESTEDv1
  - Give the amor a static value
  - s/struct kvmhv_nestedv2_host/struct kvmhv_nestedv2_io/
  - Propagate failure in kvmhv_vcpu_entry_nestedv2()
  - WARN if getters and setters fail

Re: [PATCH] powerpc: Export kvm_guest static key, for bcachefs six locks

2023-09-13 Thread Michael Ellerman

Kent Overstreet  writes:
> bcachefs's six locks need kvm_guest, via
>  ower_on_cpu() ->  vcpu_is_preempted() -> is_kvm_guest()
>
> Signed-off-by: Kent Overstreet 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/kernel/firmware.c | 2 ++
>  1 file changed, 2 insertions(+)

Acked-by: Michael Ellerman  (powerpc)

I'm happy for you to take this via your tree.

cheers

> diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
> index 20328f72f9f2..8987eee33dc8 100644
> --- a/arch/powerpc/kernel/firmware.c
> +++ b/arch/powerpc/kernel/firmware.c
> @@ -23,6 +23,8 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
>  
>  #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
>  DEFINE_STATIC_KEY_FALSE(kvm_guest);
> +EXPORT_SYMBOL_GPL(kvm_guest);
> +
>  int __init check_kvm_guest(void)
>  {
>   struct device_node *hyper_node;
> -- 
> 2.40.1

[RFC PATCH v12 33/33] KVM: selftests: Test KVM exit behavior for private memory/access

2023-09-13 Thread Sean Christopherson

From: Ackerley Tng 

"Testing private access when memslot gets deleted" tests the behavior
of KVM when a private memslot gets deleted while the VM is using the
private memslot. When KVM looks up the deleted (slot = NULL) memslot,
KVM should exit to userspace with KVM_EXIT_MEMORY_FAULT.

In the second test, upon a private access to non-private memslot, KVM
should also exit to userspace with KVM_EXIT_MEMORY_FAULT.

sean: These testcases belong in set_memory_region_test.c, they're private
variants on existing testscases and aren't as robust, e.g. don't ensure
the vCPU is actually running and accessing memory when converting and
deleting.

Signed-off-by: Ackerley Tng 
Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   | 121 ++
 2 files changed, 122 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index 2b1ef809d73a..f7fdd8244547 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -82,6 +82,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
new file mode 100644
index ..1a61c51c2390
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#include 
+#include 
+#include 
+
+#include "kvm_util.h"
+#include "processor.h"
+#include "test_util.h"
+
+/* Arbitrarily selected to avoid overlaps with anything else */
+#define EXITS_TEST_GVA 0xc000
+#define EXITS_TEST_GPA EXITS_TEST_GVA
+#define EXITS_TEST_NPAGES 1
+#define EXITS_TEST_SIZE (EXITS_TEST_NPAGES * PAGE_SIZE)
+#define EXITS_TEST_SLOT 10
+
+static uint64_t guest_repeatedly_read(void)
+{
+   volatile uint64_t value;
+
+   while (true)
+   value = *((uint64_t *) EXITS_TEST_GVA);
+
+   return value;
+}
+
+static uint32_t run_vcpu_get_exit_reason(struct kvm_vcpu *vcpu)
+{
+   int r;
+
+   r = _vcpu_run(vcpu);
+   if (r) {
+   TEST_ASSERT(errno == EFAULT, KVM_IOCTL_ERROR(KVM_RUN, r));
+   TEST_ASSERT_EQ(vcpu->run->exit_reason, KVM_EXIT_MEMORY_FAULT);
+   }
+   return vcpu->run->exit_reason;
+}
+
+const struct vm_shape protected_vm_shape = {
+   .mode = VM_MODE_DEFAULT,
+   .type = KVM_X86_SW_PROTECTED_VM,
+};
+
+static void test_private_access_memslot_deleted(void)
+{
+   struct kvm_vm *vm;
+   struct kvm_vcpu *vcpu;
+   pthread_t vm_thread;
+   void *thread_return;
+   uint32_t exit_reason;
+
+   vm = vm_create_shape_with_one_vcpu(protected_vm_shape, ,
+  guest_repeatedly_read);
+
+   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+   EXITS_TEST_GPA, EXITS_TEST_SLOT,
+   EXITS_TEST_NPAGES,
+   KVM_MEM_PRIVATE);
+
+   virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
+
+   /* Request to access page privately */
+   vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+
+   pthread_create(_thread, NULL,
+  (void *(*)(void *))run_vcpu_get_exit_reason,
+  (void *)vcpu);
+
+   vm_mem_region_delete(vm, EXITS_TEST_SLOT);
+
+   pthread_join(vm_thread, _return);
+   exit_reason = (uint32_t)(uint64_t)thread_return;
+
+   TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, 
KVM_MEMORY_EXIT_FLAG_PRIVATE);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+   TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+
+   kvm_vm_free(vm);
+}
+
+static void test_private_access_memslot_not_private(void)
+{
+   struct kvm_vm *vm;
+   struct kvm_vcpu *vcpu;
+   uint32_t exit_reason;
+
+   vm = vm_create_shape_with_one_vcpu(protected_vm_shape, ,
+  guest_repeatedly_read);
+
+   /* Add a non-private memslot (flags = 0) */
+   vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+   EXITS_TEST_GPA, EXITS_TEST_SLOT,
+   EXITS_TEST_NPAGES, 0);
+
+   virt_map(vm,

[RFC PATCH v12 32/33] KVM: selftests: Add basic selftest for guest_memfd()

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

Add a selftest to verify the basic functionality of guest_memfd():

+ file descriptor created with the guest_memfd() ioctl does not allow
  read/write/mmap operations
+ file size and block size as returned from fstat are as expected
+ fallocate on the fd checks that offset/length on
  fallocate(FALLOC_FL_PUNCH_HOLE) should be page aligned
+ invalid inputs (misaligned size, invalid flags) are rejected

Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 165 ++
 2 files changed, 166 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index b709a52d5cdb..2b1ef809d73a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -124,6 +124,7 @@ TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
new file mode 100644
index ..75073645aaa1
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -0,0 +1,165 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright Intel Corporation, 2023
+ *
+ * Author: Chao Peng 
+ */
+
+#define _GNU_SOURCE
+#include "test_util.h"
+#include "kvm_util_base.h"
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static void test_file_read_write(int fd)
+{
+   char buf[64];
+
+   TEST_ASSERT(read(fd, buf, sizeof(buf)) < 0,
+   "read on a guest_mem fd should fail");
+   TEST_ASSERT(write(fd, buf, sizeof(buf)) < 0,
+   "write on a guest_mem fd should fail");
+   TEST_ASSERT(pread(fd, buf, sizeof(buf), 0) < 0,
+   "pread on a guest_mem fd should fail");
+   TEST_ASSERT(pwrite(fd, buf, sizeof(buf), 0) < 0,
+   "pwrite on a guest_mem fd should fail");
+}
+
+static void test_mmap(int fd, size_t page_size)
+{
+   char *mem;
+
+   mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+   TEST_ASSERT_EQ(mem, MAP_FAILED);
+}
+
+static void test_file_size(int fd, size_t page_size, size_t total_size)
+{
+   struct stat sb;
+   int ret;
+
+   ret = fstat(fd, );
+   TEST_ASSERT(!ret, "fstat should succeed");
+   TEST_ASSERT_EQ(sb.st_size, total_size);
+   TEST_ASSERT_EQ(sb.st_blksize, page_size);
+}
+
+static void test_fallocate(int fd, size_t page_size, size_t total_size)
+{
+   int ret;
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, total_size);
+   TEST_ASSERT(!ret, "fallocate with aligned offset and size should 
succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size - 1, page_size);
+   TEST_ASSERT(ret, "fallocate with unaligned offset should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size, page_size);
+   TEST_ASSERT(ret, "fallocate beginning at total_size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, total_size + page_size, 
page_size);
+   TEST_ASSERT(ret, "fallocate beginning after total_size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   total_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) at total_size should succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   total_size + page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) after total_size should 
succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size, page_size - 1);
+   TEST_ASSERT(ret, "fallocate with unaligned size should fail");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE,
+   page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate(PUNCH_HOLE) with aligned offset and size 
should succeed");
+
+   ret = fallocate(fd, FALLOC_FL_KEEP_SIZE, page_size, page_size);
+   TEST_ASSERT(!ret, "fallocate to restore punched hole should succeed");
+}
+
+static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
+{
+   uint64_t valid_flags = 0;
+   size_t page_size = getpagesize();
+   uint64_t flag;
+   size_t size;
+   int fd;
+
+   for (size = 1; size <

[RFC PATCH v12 31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd()

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

Expand set_memory_region_test to exercise various positive and negative
testcases for private memory.

 - Non-guest_memfd() file descriptor for private memory
 - guest_memfd() from different VM
 - Overlapping bindings
 - Unaligned bindings

Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
[sean: trim the testcases to remove duplicate coverage]
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  10 ++
 .../selftests/kvm/set_memory_region_test.c| 100 ++
 2 files changed, 110 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index edc0f380acc0..ac9356108df6 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -819,6 +819,16 @@ static inline struct kvm_vm *vm_create_barebones(void)
return vm_create(VM_SHAPE_DEFAULT);
 }
 
+static inline struct kvm_vm *vm_create_barebones_protected_vm(void)
+{
+   const struct vm_shape shape = {
+   .mode = VM_MODE_DEFAULT,
+   .type = KVM_X86_SW_PROTECTED_VM,
+   };
+
+   return vm_create(shape);
+}
+
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c 
b/tools/testing/selftests/kvm/set_memory_region_test.c
index b32960189f5f..ca83e3307a98 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -385,6 +385,98 @@ static void test_add_max_memory_regions(void)
kvm_vm_free(vm);
 }
 
+
+static void test_invalid_guest_memfd(struct kvm_vm *vm, int memfd,
+size_t offset, const char *msg)
+{
+   int r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, 
KVM_MEM_PRIVATE,
+MEM_REGION_GPA, MEM_REGION_SIZE,
+0, memfd, offset);
+   TEST_ASSERT(r == -1 && errno == EINVAL, "%s", msg);
+}
+
+static void test_add_private_memory_region(void)
+{
+   struct kvm_vm *vm, *vm2;
+   int memfd, i;
+
+   pr_info("Testing ADD of KVM_MEM_PRIVATE memory regions\n");
+
+   vm = vm_create_barebones_protected_vm();
+
+   test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail");
+   test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail");
+
+   memfd = kvm_memfd_alloc(MEM_REGION_SIZE, false);
+   test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail");
+   close(memfd);
+
+   vm2 = vm_create_barebones_protected_vm();
+   memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0);
+   test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should 
fail");
+
+   vm_set_user_memory_region2(vm2, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 
0);
+   close(memfd);
+   kvm_vm_free(vm2);
+
+   memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE, 0);
+   for (i = 1; i < PAGE_SIZE; i++)
+   test_invalid_guest_memfd(vm, memfd, i, "Unaligned offset should 
fail");
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, MEM_REGION_SIZE, 0, memfd, 
0);
+   close(memfd);
+
+   kvm_vm_free(vm);
+}
+
+static void test_add_overlapping_private_memory_regions(void)
+{
+   struct kvm_vm *vm;
+   int memfd;
+   int r;
+
+   pr_info("Testing ADD of overlapping KVM_MEM_PRIVATE memory regions\n");
+
+   vm = vm_create_barebones_protected_vm();
+
+   memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0);
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, MEM_REGION_SIZE * 2, 0, 
memfd, 0);
+
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT + 1, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA * 2, MEM_REGION_SIZE * 2,
+  0, memfd, MEM_REGION_SIZE * 2);
+
+   /*
+* Delete the first memslot, and then attempt to recreate it except
+* with a "bad" offset that results in overlap in the guest_memfd().
+*/
+   vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+  MEM_REGION_GPA, 0, NULL, -1, 0);
+
+   /* Overlap the front half of the other slot. */
+   r = __vm_set_user_memory_region2(vm, MEM_REGION_SLOT, KVM_MEM_PRIVATE,
+MEM_REGION_GPA * 2 - MEM_REGION_SIZE,
+MEM_REGION_SIZE * 2,
+0, memfd, 0);
+   TEST_ASSERT(r == -1 && errno == EEXIST, "%s",
+

[RFC PATCH v12 30/33] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

Add helpers to invoke KVM_SET_USER_MEMORY_REGION2 directly so that tests
can validate of features that are unique to "version 2" of "set user
memory region", e.g. do negative testing on gmem_fd and gmem_offset.

Provide a raw version as well as an assert-success version to reduce
the amount of boilerplate code need for basic usage.

Signed-off-by: Chao Peng 
Signed-off-by: Ackerley Tng 
---
 .../selftests/kvm/include/kvm_util_base.h |  7 +
 tools/testing/selftests/kvm/lib/kvm_util.c| 29 +++
 2 files changed, 36 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index b608fbb832d5..edc0f380acc0 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -522,6 +522,13 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags,
   uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
uint64_t gpa, uint64_t size, void *hva);
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+   uint32_t flags, uint64_t gpa, uint64_t size,
+   void *hva, uint32_t gmem_fd, uint64_t 
gmem_offset);
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+uint32_t flags, uint64_t gpa, uint64_t size,
+void *hva, uint32_t gmem_fd, uint64_t 
gmem_offset);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 68afea10b469..8fc70c021c1c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -873,6 +873,35 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags,
errno, strerror(errno));
 }
 
+int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+uint32_t flags, uint64_t gpa, uint64_t size,
+void *hva, uint32_t gmem_fd, uint64_t 
gmem_offset)
+{
+   struct kvm_userspace_memory_region2 region = {
+   .slot = slot,
+   .flags = flags,
+   .guest_phys_addr = gpa,
+   .memory_size = size,
+   .userspace_addr = (uintptr_t)hva,
+   .gmem_fd = gmem_fd,
+   .gmem_offset = gmem_offset,
+   };
+
+   return ioctl(vm->fd, KVM_SET_USER_MEMORY_REGION2, );
+}
+
+void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot,
+   uint32_t flags, uint64_t gpa, uint64_t size,
+   void *hva, uint32_t gmem_fd, uint64_t 
gmem_offset)
+{
+   int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
+  gmem_fd, gmem_offset);
+
+   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
+   errno, strerror(errno));
+}
+
+
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 29/33] KVM: selftests: Add x86-only selftest for private memory conversions

2023-09-13 Thread Sean Christopherson

From: Vishal Annapurve 

Add a selftest to exercise implicit/explicit conversion functionality
within KVM and verify:

 - Shared memory is visible to host userspace
 - Private memory is not visible to host userspace
 - Host userspace and guest can communicate over shared memory
 - Data in shared backing is preserved across conversions (test's
   host userspace doesn't free the data)
 - Private memory is bound to the lifetime of the VM

Ideally, KVM's selftests infrastructure would be reworked to allow backing
a single region of guest memory with multiple memslots for _all_ backing
types and shapes, i.e. ideally the code for using a single backing fd
across multiple memslots would work for "regular" memory as well.  But
sadly, support for KVM_CREATE_GUEST_MEMFD has languished for far too long,
and overhauling selftests' memslots infrastructure would likely open a can
of worms, i.e. delay things even further.

Signed-off-by: Vishal Annapurve 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/Makefile  |   1 +
 .../kvm/x86_64/private_mem_conversions_test.c | 410 ++
 2 files changed, 411 insertions(+)
 create mode 100644 
tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index a3bb36fb3cfc..b709a52d5cdb 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -81,6 +81,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
+TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
new file mode 100644
index ..50541246d6fd
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2022, Google LLC.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define BASE_DATA_SLOT 10
+#define BASE_DATA_GPA  ((uint64_t)(1ull << 32))
+#define PER_CPU_DATA_SIZE  ((uint64_t)(SZ_2M + PAGE_SIZE))
+
+/* Horrific macro so that the line info is captured accurately :-( */
+#define memcmp_g(gpa, pattern,  size)  
\
+do {   
\
+   uint8_t *mem = (uint8_t *)gpa;  
\
+   size_t i;   
\
+   
\
+   for (i = 0; i < size; i++)  
\
+   __GUEST_ASSERT(mem[i] == pattern,   
\
+  "Expected 0x%x at offset %lu (gpa 0x%llx), got 
0x%x",\
+  pattern, i, gpa + i, mem[i]);
\
+} while (0)
+
+static void memcmp_h(uint8_t *mem, uint8_t pattern, size_t size)
+{
+   size_t i;
+
+   for (i = 0; i < size; i++)
+   TEST_ASSERT(mem[i] == pattern,
+   "Expected 0x%x at offset %lu, got 0x%x",
+   pattern, i, mem[i]);
+}
+
+/*
+ * Run memory conversion tests with explicit conversion:
+ * Execute KVM hypercall to map/unmap gpa range which will cause userspace exit
+ * to back/unback private memory. Subsequent accesses by guest to the gpa range
+ * will not cause exit to userspace.
+ *
+ * Test memory conversion scenarios with following steps:
+ * 1) Access private memory using private access and verify that memory 
contents
+ *   are not visible to userspace.
+ * 2) Convert memory to shared using explicit conversions and ensure that
+ *   userspace is able to access the shared regions.
+ * 3) Convert memory back to private using explicit conversions and ensure that
+ *   userspace is again not able to access converted private regions.
+ */
+
+#define GUEST_STAGE(o, s) { .offset = o, .size = s }
+
+enum ucall_syncs {
+   SYNC_SHARED,
+   SYNC_PRIVATE,
+};
+
+static void guest_sync_shared(uint64_t gpa, uint64_t size,
+ uint8_t current_pattern, uint8_t

[RFC PATCH v12 28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data

2023-09-13 Thread Sean Christopherson

Add GUEST_SYNC[1-6]() so that tests can pass the maximum amount of
information supported via ucall(), without needing to resort to shared
memory.

Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/include/ucall_common.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/ucall_common.h 
b/tools/testing/selftests/kvm/include/ucall_common.h
index 112bc1da732a..7cf40aba7add 100644
--- a/tools/testing/selftests/kvm/include/ucall_common.h
+++ b/tools/testing/selftests/kvm/include/ucall_common.h
@@ -54,6 +54,17 @@ int ucall_nr_pages_required(uint64_t page_size);
 #define GUEST_SYNC_ARGS(stage, arg1, arg2, arg3, arg4) \
ucall(UCALL_SYNC, 6, "hello", stage, arg1, 
arg2, arg3, arg4)
 #define GUEST_SYNC(stage)  ucall(UCALL_SYNC, 2, "hello", stage)
+#define GUEST_SYNC1(arg0)  ucall(UCALL_SYNC, 1, arg0)
+#define GUEST_SYNC2(arg0, arg1)ucall(UCALL_SYNC, 2, arg0, arg1)
+#define GUEST_SYNC3(arg0, arg1, arg2) \
+   ucall(UCALL_SYNC, 3, arg0, arg1, arg2)
+#define GUEST_SYNC4(arg0, arg1, arg2, arg3) \
+   ucall(UCALL_SYNC, 4, arg0, arg1, arg2, arg3)
+#define GUEST_SYNC5(arg0, arg1, arg2, arg3, arg4) \
+   ucall(UCALL_SYNC, 5, arg0, arg1, arg2, arg3, 
arg4)
+#define GUEST_SYNC6(arg0, arg1, arg2, arg3, arg4, arg5) \
+   ucall(UCALL_SYNC, 6, arg0, arg1, arg2, arg3, 
arg4, arg5)
+
 #define GUEST_PRINTF(_fmt, _args...) ucall_fmt(UCALL_PRINTF, _fmt, ##_args)
 #define GUEST_DONE()   ucall(UCALL_DONE, 0)
 
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type

2023-09-13 Thread Sean Christopherson

Add a "vm_shape" structure to encapsulate the selftests-defined "mode",
along with the KVM-defined "type" for use when creating a new VM.  "mode"
tracks physical and virtual address properties, as well as the preferred
backing memory type, while "type" corresponds to the VM type.

Taking the VM type will allow adding tests for KVM_CREATE_GUEST_MEMFD,
a.k.a. guest private memory, without needing an entirely separate set of
helpers.  Guest private memory is effectively usable only by confidential
VM types, and it's expected that x86 will double down and require unique
VM types for TDX and SNP guests.

Signed-off-by: Sean Christopherson 
---
 tools/testing/selftests/kvm/dirty_log_test.c  |  2 +-
 .../selftests/kvm/include/kvm_util_base.h | 54 +++
 .../selftests/kvm/kvm_page_table_test.c   |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c| 43 +++
 tools/testing/selftests/kvm/lib/memstress.c   |  3 +-
 .../kvm/x86_64/ucna_injection_test.c  |  2 +-
 6 files changed, 72 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/kvm/dirty_log_test.c 
b/tools/testing/selftests/kvm/dirty_log_test.c
index 936f3a8d1b83..6cbecf499767 100644
--- a/tools/testing/selftests/kvm/dirty_log_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_test.c
@@ -699,7 +699,7 @@ static struct kvm_vm *create_vm(enum vm_guest_mode mode, 
struct kvm_vcpu **vcpu,
 
pr_info("Testing guest mode: %s\n", vm_guest_mode_string(mode));
 
-   vm = __vm_create(mode, 1, extra_mem_pages);
+   vm = __vm_create(VM_SHAPE(mode), 1, extra_mem_pages);
 
log_mode_create_vm_done(vm);
*vcpu = vm_vcpu_add(vm, 0, guest_code);
diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a0315503ac3e..b608fbb832d5 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -188,6 +188,23 @@ enum vm_guest_mode {
NUM_VM_MODES,
 };
 
+struct vm_shape {
+   enum vm_guest_mode mode;
+   unsigned int type;
+};
+
+#define VM_TYPE_DEFAULT0
+
+#define VM_SHAPE(__mode)   \
+({ \
+   struct vm_shape shape = {   \
+   .mode = (__mode),   \
+   .type = VM_TYPE_DEFAULT \
+   };  \
+   \
+   shape;  \
+})
+
 #if defined(__aarch64__)
 
 extern enum vm_guest_mode vm_mode_default;
@@ -220,6 +237,8 @@ extern enum vm_guest_mode vm_mode_default;
 
 #endif
 
+#define VM_SHAPE_DEFAULT   VM_SHAPE(VM_MODE_DEFAULT)
+
 #define MIN_PAGE_SIZE  (1U << MIN_PAGE_SHIFT)
 #define PTES_PER_MIN_PAGE  ptes_per_page(MIN_PAGE_SIZE)
 
@@ -784,21 +803,21 @@ vm_paddr_t vm_alloc_page_table(struct kvm_vm *vm);
  * __vm_create() does NOT create vCPUs, @nr_runnable_vcpus is used purely to
  * calculate the amount of memory needed for per-vCPU data, e.g. stacks.
  */
-struct kvm_vm *vm_create(enum vm_guest_mode mode);
-struct kvm_vm *__vm_create(enum vm_guest_mode mode, uint32_t nr_runnable_vcpus,
+struct kvm_vm *vm_create(struct vm_shape shape);
+struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
   uint64_t nr_extra_pages);
 
 static inline struct kvm_vm *vm_create_barebones(void)
 {
-   return vm_create(VM_MODE_DEFAULT);
+   return vm_create(VM_SHAPE_DEFAULT);
 }
 
 static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus)
 {
-   return __vm_create(VM_MODE_DEFAULT, nr_runnable_vcpus, 0);
+   return __vm_create(VM_SHAPE_DEFAULT, nr_runnable_vcpus, 0);
 }
 
-struct kvm_vm *__vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t 
nr_vcpus,
+struct kvm_vm *__vm_create_with_vcpus(struct vm_shape shape, uint32_t nr_vcpus,
  uint64_t extra_mem_pages,
  void *guest_code, struct kvm_vcpu 
*vcpus[]);
 
@@ -806,17 +825,27 @@ static inline struct kvm_vm 
*vm_create_with_vcpus(uint32_t nr_vcpus,
  void *guest_code,
  struct kvm_vcpu *vcpus[])
 {
-   return __vm_create_with_vcpus(VM_MODE_DEFAULT, nr_vcpus, 0,
+   return __vm_create_with_vcpus(VM_SHAPE_DEFAULT, nr_vcpus, 0,
  guest_code, vcpus);
 }
 
+
+struct kvm_vm *__vm_create_shape_with_one_vcpu(struct vm_shape shape,
+  struct kvm_vcpu **vcpu,
+  uint64_t extra_mem_pages,
+  void *guest_code);
+
 /*
  * Create a VM with a single vCPU with reasonable defaults and @extra_mem_pages
  * additional pages of guest memory.  Returns

[RFC PATCH v12 26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86)

2023-09-13 Thread Sean Christopherson

From: Vishal Annapurve 

Add helpers for x86 guests to invoke the KVM_HC_MAP_GPA_RANGE hypercall,
which KVM will forward to userspace and thus can be used by tests to
coordinate private<=>shared conversions between host userspace code and
guest code.

Signed-off-by: Vishal Annapurve 
[sean: drop shared/private helpers (let tests specify flags)]
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/x86_64/processor.h  | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h 
b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 4fd042112526..1911c12d5bad 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 #include "../kvm_util.h"
@@ -1171,6 +1172,20 @@ uint64_t kvm_hypercall(uint64_t nr, uint64_t a0, 
uint64_t a1, uint64_t a2,
 uint64_t __xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 void xen_hypercall(uint64_t nr, uint64_t a0, void *a1);
 
+static inline uint64_t __kvm_hypercall_map_gpa_range(uint64_t gpa,
+uint64_t size, uint64_t 
flags)
+{
+   return kvm_hypercall(KVM_HC_MAP_GPA_RANGE, gpa, size >> PAGE_SHIFT, 
flags, 0);
+}
+
+static inline void kvm_hypercall_map_gpa_range(uint64_t gpa, uint64_t size,
+  uint64_t flags)
+{
+   uint64_t ret = __kvm_hypercall_map_gpa_range(gpa, size, flags);
+
+   GUEST_ASSERT(!ret);
+}
+
 void __vm_xsave_require_permission(uint64_t xfeature, const char *name);
 
 #define vm_xsave_require_permission(xfeature)  \
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared

2023-09-13 Thread Sean Christopherson

From: Vishal Annapurve 

Add helpers to convert memory between private and shared via KVM's
memory attributes, as well as helpers to free/allocate guest_memfd memory
via fallocate().  Userspace, i.e. tests, is NOT required to do fallocate()
when converting memory, as the attributes are the single source of true.
The fallocate() helpers are provided so that tests can mimic a userspace
that frees private memory on conversion, e.g. to prioritize memory usage
over performance.

Signed-off-by: Vishal Annapurve 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h | 48 +++
 tools/testing/selftests/kvm/lib/kvm_util.c| 26 ++
 2 files changed, 74 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 47ea25f9dc97..a0315503ac3e 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -333,6 +333,54 @@ static inline void vm_enable_cap(struct kvm_vm *vm, 
uint32_t cap, uint64_t arg0)
vm_ioctl(vm, KVM_ENABLE_CAP, _cap);
 }
 
+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+   uint64_t size, uint64_t attributes)
+{
+   struct kvm_memory_attributes attr = {
+   .attributes = attributes,
+   .address = gpa,
+   .size = size,
+   .flags = 0,
+   };
+
+   /*
+* KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
+* need significant enhancements to support multiple attributes.
+*/
+   TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
+   "Update me to support multiple attributes!");
+
+   vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, );
+}
+
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+ uint64_t size)
+{
+   vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+uint64_t size)
+{
+   vm_set_memory_attributes(vm, gpa, size, 0);
+}
+
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+   bool punch_hole);
+
+static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
+  uint64_t size)
+{
+   vm_guest_mem_fallocate(vm, gpa, size, true);
+}
+
+static inline void vm_guest_mem_allocate(struct kvm_vm *vm, uint64_t gpa,
+uint64_t size)
+{
+   vm_guest_mem_fallocate(vm, gpa, size, false);
+}
+
 void vm_enable_dirty_ring(struct kvm_vm *vm, uint32_t ring_size);
 const char *vm_guest_mode_string(uint32_t i);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 127f44c6c83c..bf2bd5c39a96 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1176,6 +1176,32 @@ void vm_mem_region_delete(struct kvm_vm *vm, uint32_t 
slot)
__vm_mem_region_delete(vm, memslot2region(vm, slot), true);
 }
 
+void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
+   bool punch_hole)
+{
+   struct userspace_mem_region *region;
+   uint64_t end = gpa + size - 1;
+   off_t fd_offset;
+   int mode, ret;
+
+   region = userspace_mem_region_find(vm, gpa, gpa);
+   TEST_ASSERT(region && region->region.flags & KVM_MEM_PRIVATE,
+   "Private memory region not found for GPA 0x%lx", gpa);
+
+   TEST_ASSERT(region == userspace_mem_region_find(vm, end, end),
+   "fallocate() for guest_memfd must act on a single memslot");
+
+   fd_offset = region->region.gmem_offset +
+   (gpa - region->region.guest_phys_addr);
+
+   mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
+
+   ret = fallocate(region->region.gmem_fd, mode, fd_offset, size);
+   TEST_ASSERT(!ret, "fallocate() failed to %s at %lx[%lu], fd = %d, mode 
= %x, offset = %lx\n",
+punch_hole ? "punch hole" : "allocate", gpa, size,
+region->region.gmem_fd, mode, fd_offset);
+}
+
 /* Returns the size of a vCPU's kvm_run structure. */
 static int vcpu_mmap_sz(void)
 {
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 24/33] KVM: selftests: Add support for creating private memslots

2023-09-13 Thread Sean Christopherson

Add support for creating "private" memslots via KVM_CREATE_GUEST_MEMFD and
KVM_SET_USER_MEMORY_REGION2.  Make vm_userspace_mem_region_add() a wrapper
to its effective replacement, vm_mem_add(), so that private memslots are
fully opt-in, i.e. don't require update all tests that add memory regions.

Pivot on the KVM_MEM_PRIVATE flag instead of the validity of the "gmem"
file descriptor so that simple tests can let vm_mem_add() do the heavy
lifting of creating the guest memfd, but also allow the caller to pass in
an explicit fd+offset so that fancier tests can do things like back
multiple memslots with a single file.  If the caller passes in a fd, dup()
the fd so that (a) __vm_mem_region_delete() can close the fd associated
with the memory region without needing yet another flag, and (b) so that
the caller can safely close its copy of the fd without having to first
destroy memslots.

Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h | 23 +
 .../testing/selftests/kvm/include/test_util.h |  5 ++
 tools/testing/selftests/kvm/lib/kvm_util.c| 85 ---
 3 files changed, 82 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 9f144841c2ee..47ea25f9dc97 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -431,6 +431,26 @@ static inline uint64_t vm_get_stat(struct kvm_vm *vm, 
const char *stat_name)
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+   uint64_t flags)
+{
+   struct kvm_create_guest_memfd gmem = {
+   .size = size,
+   .flags = flags,
+   };
+
+   return __vm_ioctl(vm, KVM_CREATE_GUEST_MEMFD, );
+}
+
+static inline int vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
+   uint64_t flags)
+{
+   int fd = __vm_create_guest_memfd(vm, size, flags);
+
+   TEST_ASSERT(fd >= 0, KVM_IOCTL_ERROR(KVM_CREATE_GUEST_MEMFD, fd));
+   return fd;
+}
+
 void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
   uint64_t gpa, uint64_t size, void *hva);
 int __vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t 
flags,
@@ -439,6 +459,9 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
enum vm_mem_backing_src_type src_type,
uint64_t guest_paddr, uint32_t slot, uint64_t npages,
uint32_t flags);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+   uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+   uint32_t flags, int gmem_fd, uint64_t gmem_offset);
 
 void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t slot, uint32_t flags);
 void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, uint64_t new_gpa);
diff --git a/tools/testing/selftests/kvm/include/test_util.h 
b/tools/testing/selftests/kvm/include/test_util.h
index 7e614adc6cf4..7257f2243ab9 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,11 @@ static inline bool backing_src_is_shared(enum 
vm_mem_backing_src_type t)
return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
 }
 
+static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
+{
+   return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
+}
+
 /* Aligns x up to the next multiple of size. Size must be a power of 2. */
 static inline uint64_t align_up(uint64_t x, uint64_t size)
 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 3676b37bea38..127f44c6c83c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -669,6 +669,8 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
TEST_ASSERT(!ret, __KVM_SYSCALL_ERROR("munmap()", ret));
close(region->fd);
}
+   if (region->region.gmem_fd >= 0)
+   close(region->region.gmem_fd);
 
free(region);
 }
@@ -870,36 +872,15 @@ void vm_set_user_memory_region(struct kvm_vm *vm, 
uint32_t slot, uint32_t flags,
errno, strerror(errno));
 }
 
-/*
- * VM Userspace Memory Region Add
- *
- * Input Args:
- *   vm - Virtual Machine
- *   src_type - Storage source for this region.
- *  NULL to use anonymous memory.
- *   guest_paddr - Starting guest physical address
- *   slot - KVM region slot
- *   npages - Number of physical pages
- *   flags - KVM memory region flags (e.g. KVM_MEM_LOG_DIRTY_PAGES)
- *
- * Output Args: None
- *
- * Return: None
- *
- * Allocates a memory area of the number of pages specified by npages
-

[RFC PATCH v12 23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2

2023-09-13 Thread Sean Christopherson

Use KVM_SET_USER_MEMORY_REGION2 throughough KVM's selftests library so
that support for guest private memory can be added without needing an
entirely separate set of helpers.

Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  2 +-
 tools/testing/selftests/kvm/lib/kvm_util.c| 19 ++-
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 967eaaeacd75..9f144841c2ee 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -44,7 +44,7 @@ typedef uint64_t vm_paddr_t; /* Virtual Machine (Guest) 
physical address */
 typedef uint64_t vm_vaddr_t; /* Virtual Machine (Guest) virtual address */
 
 struct userspace_mem_region {
-   struct kvm_userspace_memory_region region;
+   struct kvm_userspace_memory_region2 region;
struct sparsebit *unused_phy_pages;
int fd;
off_t offset;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index f09295d56c23..3676b37bea38 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -453,8 +453,9 @@ void kvm_vm_restart(struct kvm_vm *vmp)
vm_create_irqchip(vmp);
 
hash_for_each(vmp->regions.slot_hash, ctr, region, slot_node) {
-   int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION, 
>region);
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL 
failed,\n"
+   int ret = ioctl(vmp->fd, KVM_SET_USER_MEMORY_REGION2, 
>region);
+
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL 
failed,\n"
"  rc: %i errno: %i\n"
"  slot: %u flags: 0x%x\n"
"  guest_phys_addr: 0x%llx size: 0x%llx",
@@ -657,7 +658,7 @@ static void __vm_mem_region_delete(struct kvm_vm *vm,
}
 
region->region.memory_size = 0;
-   vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
+   vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
 
sparsebit_free(>unused_phy_pages);
ret = munmap(region->mmap_start, region->mmap_size);
@@ -1014,8 +1015,8 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
region->region.guest_phys_addr = guest_paddr;
region->region.memory_size = npages * vm->page_size;
region->region.userspace_addr = (uintptr_t) region->host_mem;
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
"  rc: %i errno: %i\n"
"  slot: %u flags: 0x%x\n"
"  guest_phys_addr: 0x%lx size: 0x%lx",
@@ -1097,9 +1098,9 @@ void vm_mem_region_set_flags(struct kvm_vm *vm, uint32_t 
slot, uint32_t flags)
 
region->region.flags = flags;
 
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
 
-   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION IOCTL failed,\n"
+   TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
"  rc: %i errno: %i slot: %u flags: 0x%x",
ret, errno, slot, flags);
 }
@@ -1127,9 +1128,9 @@ void vm_mem_region_move(struct kvm_vm *vm, uint32_t slot, 
uint64_t new_gpa)
 
region->region.guest_phys_addr = new_gpa;
 
-   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION, >region);
+   ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, >region);
 
-   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION failed\n"
+   TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed\n"
"ret: %i errno: %i slot: %u new_gpa: 0x%lx",
ret, errno, slot, new_gpa);
 }
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper

2023-09-13 Thread Sean Christopherson

Drop kvm_userspace_memory_region_find(), it's unused and a terrible API
(probably why it's unused).  If anything outside of kvm_util.c needs to
get at the memslot, userspace_mem_region_find() can be exposed to give
others full access to all memory region/slot information.

Signed-off-by: Sean Christopherson 
---
 .../selftests/kvm/include/kvm_util_base.h |  4 ---
 tools/testing/selftests/kvm/lib/kvm_util.c| 29 ---
 2 files changed, 33 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h 
b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..967eaaeacd75 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -776,10 +776,6 @@ vm_adjust_num_guest_pages(enum vm_guest_mode mode, 
unsigned int num_guest_pages)
return n;
 }
 
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end);
-
 #define sync_global_to_guest(vm, g) ({ \
typeof(g) *_p = addr_gva2hva(vm, (vm_vaddr_t)&(g)); \
memcpy(_p, &(g), sizeof(g));\
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c 
b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..f09295d56c23 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -590,35 +590,6 @@ userspace_mem_region_find(struct kvm_vm *vm, uint64_t 
start, uint64_t end)
return NULL;
 }
 
-/*
- * KVM Userspace Memory Region Find
- *
- * Input Args:
- *   vm - Virtual Machine
- *   start - Starting VM physical address
- *   end - Ending VM physical address, inclusive.
- *
- * Output Args: None
- *
- * Return:
- *   Pointer to overlapping region, NULL if no such region.
- *
- * Public interface to userspace_mem_region_find. Allows tests to look up
- * the memslot datastructure for a given range of guest physical memory.
- */
-struct kvm_userspace_memory_region *
-kvm_userspace_memory_region_find(struct kvm_vm *vm, uint64_t start,
-uint64_t end)
-{
-   struct userspace_mem_region *region;
-
-   region = userspace_mem_region_find(vm, start, end);
-   if (!region)
-   return NULL;
-
-   return >region;
-}
-
 __weak void vcpu_arch_free(struct kvm_vcpu *vcpu)
 {
 
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory

2023-09-13 Thread Sean Christopherson

Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst  | 32 
 arch/x86/include/asm/kvm_host.h | 15 +--
 arch/x86/include/uapi/asm/kvm.h |  3 +++
 arch/x86/kvm/Kconfig| 12 
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 arch/x86/kvm/x86.c  | 16 +++-
 include/uapi/linux/kvm.h|  1 +
 virt/kvm/Kconfig|  5 +
 8 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c44ef5295a12..5e08f2a157ef 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,29 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+
+
+Supported X86 VM types can be queried via KVM_CAP_VM_TYPES.
+
+S390:
+^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^
+
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
@@ -8558,6 +8577,19 @@ block sizes is exposed in 
KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 This capability indicates KVM supports per-page memory attributes and ioctls
 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
 
+8.41 KVM_CAP_VM_TYPES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: system ioctl
+
+This capability returns a bitmap of support VM types.  The 1-setting of bit @n
+means the VM type with value @n is supported.  Possible values of @n are::
+
+  #define KVM_X86_DEFAULT_VM   0
+  #define KVM_X86_SW_PROTECTED_VM  1
+
 9. Known KVM API problems
 =
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 44d67a97304e..95018cc653f5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1245,6 +1245,7 @@ enum kvm_apicv_inhibit {
 };
 
 struct kvm_arch {
+   unsigned long vm_type;
unsigned long n_used_mmu_pages;
unsigned long n_requested_mmu_pages;
unsigned long n_max_mmu_pages;
@@ -2079,6 +2080,12 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t 
new_pgd);
 void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
   int tdp_max_root_level, int tdp_huge_page_level);
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != 
KVM_X86_DEFAULT_VM)
+#else
+#define kvm_arch_has_private_mem(kvm) false
+#endif
+
 static inline u16 kvm_read_ldt(void)
 {
u16 ldt;
@@ -2127,14 +2134,10 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
 # define KVM_MAX_NR_ADDRESS_SPACES 2
+/* SMM is currently unsupported for guests with private memory. */
+# define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 
2)
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
-
-static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
-{
-   return KVM_MAX_NR_ADDRESS_SPACES;
-}
-
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1a6a1f987949..a448d0964fc0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -562,4 +562,7 @@ struct kvm_pmu_event_filter {
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE   BIT(0)
 
+#define KVM_X86_DEFAULT_VM 0
+#define KVM_X86_SW_PROTECTED_VM1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 091b74599c22..8452ed0228cb 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -77,6 +77,18 @@ config KVM_WERROR
 
  If in doubt, say "N".
 
+config KVM_SW_PROTECTED_VM
+   bool "Enable support for KVM software-protected VMs"
+   depends on EXPERT
+   depends on X86_64
+   select KVM_GENERIC_PRIVATE_MEM
+   help
+ Enable support for KVM software-protected VMs.  Currently "protected"
+ means the VM can be backed with memory provided by
+ KVM_CREATE_GUEST_MEMFD.
+
+ If unsure, say "N".
+
 config KVM_INTEL
tristate "KVM for Intel (and compatible) processors support"
depends on KVM && IA32_FEAT_CTL
diff --git

[RFC PATCH v12 20/33] KVM: Allow arch code to track number of memslot address spaces per VM

2023-09-13 Thread Sean Christopherson

Let x86 track the number of address spaces on a per-VM basis so that KVM
can disallow SMM memslots for confidential VMs.  Confidentials VMs are
fundamentally incompatible with emulating SMM, which as the name suggests
requires being able to read and write guest memory and register state.

Disallowing SMM will simplify support for guest private memory, as KVM
will not need to worry about tracking memory attributes for multiple
address spaces (SMM is the only "non-default" address space across all
architectures).

Signed-off-by: Sean Christopherson 
---
 arch/powerpc/kvm/book3s_hv.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  8 +++-
 arch/x86/kvm/debugfs.c  |  2 +-
 arch/x86/kvm/mmu/mmu.c  |  8 
 arch/x86/kvm/mmu/tdp_mmu.c  |  2 +-
 arch/x86/kvm/x86.c  |  2 +-
 include/linux/kvm_host.h| 17 +++--
 virt/kvm/dirty_ring.c   |  2 +-
 virt/kvm/kvm_main.c | 26 ++
 9 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..9b0eaa17275a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
}
 
srcu_idx = srcu_read_lock(>srcu);
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
struct kvm_memory_slot *memslot;
struct kvm_memslots *slots = __kvm_memslots(kvm, i);
int bkt;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 78d641056ec5..44d67a97304e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2126,9 +2126,15 @@ enum {
 #define HF_SMM_MASK(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
-# define KVM_ADDRESS_SPACE_NUM 2
+# define KVM_MAX_NR_ADDRESS_SPACES 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
+
+static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
+{
+   return KVM_MAX_NR_ADDRESS_SPACES;
+}
+
 #else
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
 #endif
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..42026b3f3ff3 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, void 
*v)
mutex_lock(>slots_lock);
write_lock(>mmu_lock);
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
int bkt;
 
slots = __kvm_memslots(kvm, i);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9b48d8d0300b..269d4dc47c98 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
kvm_page_track_write_tracking_enabled(kvm))
goto out_success;
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
kvm_for_each_memslot(slot, bkt, slots) {
/*
@@ -6301,7 +6301,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t 
gfn_start, gfn_t gfn_e
if (!kvm_memslots_have_rmaps(kvm))
return flush;
 
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
slots = __kvm_memslots(kvm, i);
 
kvm_for_each_memslot_in_gfn_range(, slots, gfn_start, 
gfn_end) {
@@ -6341,7 +6341,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
if (tdp_mmu_enabled) {
-   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++)
flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
  gfn_end, true, flush);
}
@@ -6802,7 +6802,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 
gen)
 * modifier prior to checking for a wrap of the MMIO generation so
 * that a wrap in any address space is detected.
 */
-   gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
+   gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
 
/*
 * The very rare case: if the MMIO generation number has wrapped,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c63f2d1675f..ca7ec39f17d3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -905,7 +905,7 @@ void kvm_tdp_mmu_zap_all(struct kvm

[RFC PATCH v12 19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

2023-09-13 Thread Sean Christopherson

Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
KVM_ADDRESS_SPACE_NUM.

No functional change intended.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 1 -
 include/linux/kvm_host.h| 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 91a28ddf7cfd..78d641056ec5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2126,7 +2126,6 @@ enum {
 #define HF_SMM_MASK(1 << 1)
 #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
 
-# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
 # define KVM_ADDRESS_SPACE_NUM 2
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 
1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 18d8f02a99a3..aea1b4306129 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
 #define KVM_MEM_SLOTS_NUM SHRT_MAX
 #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
 
-#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
+#if KVM_ADDRESS_SPACE_NUM == 1
 static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 {
return 0;
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

A KVM_MEM_PRIVATE memslot can include both fd-based private memory and
hva-based shared memory. Architecture code (like TDX code) can tell
whether the on-going fault is private or not. This patch adds a
'is_private' field to kvm_page_fault to indicate this and architecture
code is expected to set it.

To handle page fault for such memslot, the handling logic is different
depending on whether the fault is private or shared. KVM checks if
'is_private' matches the host's view of the page (maintained in
mem_attr_array).
  - For a successful match, private pfn is obtained with
restrictedmem_get_page() and shared pfn is obtained with existing
get_user_pages().
  - For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to
userspace. Userspace then can convert memory between private/shared
in host's view and retry the fault.

Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/mmu/mmu.c  | 94 +++--
 arch/x86/kvm/mmu/mmu_internal.h |  1 +
 2 files changed, 90 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a079f36a8bf5..9b48d8d0300b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t 
gfn,
return level;
 }
 
-int kvm_mmu_max_mapping_level(struct kvm *kvm,
- const struct kvm_memory_slot *slot, gfn_t gfn,
- int max_level)
+static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
+  const struct kvm_memory_slot *slot,
+  gfn_t gfn, int max_level, bool 
is_private)
 {
struct kvm_lpage_info *linfo;
int host_level;
@@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
break;
}
 
+   if (is_private)
+   return max_level;
+
if (max_level == PG_LEVEL_4K)
return PG_LEVEL_4K;
 
@@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
return min(host_level, max_level);
 }
 
+int kvm_mmu_max_mapping_level(struct kvm *kvm,
+ const struct kvm_memory_slot *slot, gfn_t gfn,
+ int max_level)
+{
+   bool is_private = kvm_slot_can_be_private(slot) &&
+ kvm_mem_is_private(kvm, gfn);
+
+   return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, 
is_private);
+}
+
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
 {
struct kvm_memory_slot *slot = fault->slot;
@@ -3188,8 +3201,9 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault
 * Enforce the iTLB multihit workaround after capturing the requested
 * level, which will be used to do precise, accurate accounting.
 */
-   fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, slot,
-fault->gfn, 
fault->max_level);
+   fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
+  fault->gfn, 
fault->max_level,
+  fault->is_private);
if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
return;
 
@@ -4261,6 +4275,55 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
struct kvm_async_pf *work)
kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true, NULL);
 }
 
+static inline u8 kvm_max_level_for_order(int order)
+{
+   BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+   KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
+   order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
+   order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
+
+   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+   return PG_LEVEL_1G;
+
+   if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+   return PG_LEVEL_2M;
+
+   return PG_LEVEL_4K;
+}
+
+static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+ struct kvm_page_fault *fault)
+{
+   kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
+ PAGE_SIZE, fault->write, fault->exec,
+ fault->is_private);
+}
+
+static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu,
+  struct kvm_page_fault *fault)
+{
+   int max_order, r;
+
+   if (!kvm_slot_can_be_private(fault->slot)) {
+   kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+   return -EFAULT;
+   }
+
+   r = kvm_gmem_get_pfn(vcpu->kvm,

[RFC PATCH v12 17/33] KVM: x86: Disallow hugepages when memory attributes are mixed

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

Disallow creating hugepages with mixed memory attributes, e.g. shared
versus private, as mapping a hugepage in this case would allow the guest
to access memory with the wrong attributes, e.g. overlaying private memory
with a shared hugepage.

Tracking whether or not attributes are mixed via the existing
disallow_lpage field, but use the most significant bit in 'disallow_lpage'
to indicate a hugepage has mixed attributes instead using the normal
refcounting.  Whether or not attributes are mixed is binary; either they
are or they aren't.  Attempting to squeeze that info into the refcount is
unnecessarily complex as it would require knowing the previous state of
the mixed count when updating attributes.  Using a flag means KVM just
needs to ensure the current status is reflected in the memslots.

Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu/mmu.c  | 152 +++-
 arch/x86/kvm/x86.c  |   4 +
 3 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a2b53483524..91a28ddf7cfd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1838,6 +1838,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu);
 int kvm_mmu_init_vm(struct kvm *kvm);
 void kvm_mmu_uninit_vm(struct kvm *kvm);
 
+void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm,
+   struct kvm_memory_slot *slot);
+
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu);
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0f0231d2b74f..a079f36a8bf5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -795,16 +795,26 @@ static struct kvm_lpage_info *lpage_info_slot(gfn_t gfn,
return >arch.lpage_info[level - 2][idx];
 }
 
+/*
+ * The most significant bit in disallow_lpage tracks whether or not memory
+ * attributes are mixed, i.e. not identical for all gfns at the current level.
+ * The lower order bits are used to refcount other cases where a hugepage is
+ * disallowed, e.g. if KVM has shadow a page table at the gfn.
+ */
+#define KVM_LPAGE_MIXED_FLAG   BIT(31)
+
 static void update_gfn_disallow_lpage_count(const struct kvm_memory_slot *slot,
gfn_t gfn, int count)
 {
struct kvm_lpage_info *linfo;
-   int i;
+   int old, i;
 
for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
linfo = lpage_info_slot(gfn, slot, i);
+
+   old = linfo->disallow_lpage;
linfo->disallow_lpage += count;
-   WARN_ON_ONCE(linfo->disallow_lpage < 0);
+   WARN_ON_ONCE((old ^ linfo->disallow_lpage) & 
KVM_LPAGE_MIXED_FLAG);
}
 }
 
@@ -7172,3 +7182,141 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
if (kvm->arch.nx_huge_page_recovery_thread)
kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
+
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+   int level)
+{
+   return lpage_info_slot(gfn, slot, level)->disallow_lpage & 
KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_clear_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+int level)
+{
+   lpage_info_slot(gfn, slot, level)->disallow_lpage &= 
~KVM_LPAGE_MIXED_FLAG;
+}
+
+static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
+  int level)
+{
+   lpage_info_slot(gfn, slot, level)->disallow_lpage |= 
KVM_LPAGE_MIXED_FLAG;
+}
+
+static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
+  gfn_t gfn, int level, unsigned long attrs)
+{
+   const unsigned long start = gfn;
+   const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
+
+   if (level == PG_LEVEL_2M)
+   return kvm_range_has_memory_attributes(kvm, start, end, attrs);
+
+   for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
+   if (hugepage_test_mixed(slot, gfn, level - 1) ||
+   attrs != kvm_get_memory_attributes(kvm, gfn))
+   return false;
+   }
+   return true;
+}
+
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+struct kvm_gfn_range *range)
+{
+   unsigned long attrs = range->arg.attributes;
+   struct kvm_memory_slot *slot = range->slot;
+   int level;
+
+   lockdep_assert_held_write(>mmu_lock);
+   lockdep_assert_held(>slots_lock);
+
+   /*
+* KVM x86 currently only supports KVM_MEMORY_ATTRIBUTE_PRIVATE, skip
+

[RFC PATCH v12 16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-09-13 Thread Sean Christopherson

Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
the probability of exiting to userspace with a stale run->exit_reason that
*appears* to be valid.

To support fd-based guest memory (guest memory without a corresponding
userspace virtual address), KVM will exit to userspace for various memory
related errors, which userspace *may* be able to resolve, instead of using
e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
utilize the same functionality to let userspace "intercept" and handle
memory faults when the userspace mapping is missing, i.e. when fast gup()
fails.

Because many of KVM's internal APIs related to guest memory use '0' to
indicate "success, continue on" and not "exit to userspace", reporting
memory faults/errors to userspace will set run->exit_reason and
corresponding fields in the run structure fields in conjunction with a
a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
KVM already returns  -EFAULT in many paths, there's a relatively high
probability that KVM could return -EFAULT without setting run->exit_reason,
in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
whatever exit reason happened to be in the run structure.

Note, KVM must wait until after run->immediate_exit is serviced to
sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
preserved across KVM_RUN when run->immediate_exit is true.

Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8356907079e1..8d21b7b09bb5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10951,6 +10951,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
 {
int r;
 
+   vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->arch.l1tf_flush_l1d = true;
 
for (;;) {
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 15/33] KVM: Add transparent hugepage support for dedicated guest memory

2023-09-13 Thread Sean Christopherson

TODO: writeme

Signed-off-by: Sean Christopherson 
---
 include/uapi/linux/kvm.h |  2 ++
 virt/kvm/guest_mem.c | 54 
 2 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index b6f90a273e2e..2df18796fd8e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -2314,6 +2314,8 @@ struct kvm_memory_attributes {
 
 #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct 
kvm_create_guest_memfd)
 
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0)
+
 struct kvm_create_guest_memfd {
__u64 size;
__u64 flags;
diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 0dd3f836cf9c..a819367434e9 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -17,15 +17,48 @@ struct kvm_gmem {
struct list_head entry;
 };
 
-static struct folio *kvm_gmem_get_folio(struct file *file, pgoff_t index)
+static struct folio *kvm_gmem_get_huge_folio(struct inode *inode, pgoff_t 
index)
 {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   unsigned long huge_index = round_down(index, HPAGE_PMD_NR);
+   unsigned long flags = (unsigned long)inode->i_private;
+   struct address_space *mapping  = inode->i_mapping;
+   gfp_t gfp = mapping_gfp_mask(mapping);
struct folio *folio;
 
-   /* TODO: Support huge pages. */
-   folio = filemap_grab_folio(file->f_mapping, index);
-   if (IS_ERR_OR_NULL(folio))
+   if (!(flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE))
return NULL;
 
+   if (filemap_range_has_page(mapping, huge_index << PAGE_SHIFT,
+  (huge_index + HPAGE_PMD_NR - 1) << 
PAGE_SHIFT))
+   return NULL;
+
+   folio = filemap_alloc_folio(gfp, HPAGE_PMD_ORDER);
+   if (!folio)
+   return NULL;
+
+   if (filemap_add_folio(mapping, folio, huge_index, gfp)) {
+   folio_put(folio);
+   return NULL;
+   }
+
+   return folio;
+#else
+   return NULL;
+#endif
+}
+
+static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
+{
+   struct folio *folio;
+
+   folio = kvm_gmem_get_huge_folio(inode, index);
+   if (!folio) {
+   folio = filemap_grab_folio(inode->i_mapping, index);
+   if (IS_ERR_OR_NULL(folio))
+   return NULL;
+   }
+
/*
 * Use the up-to-date flag to track whether or not the memory has been
 * zeroed before being handed off to the guest.  There is no backing
@@ -323,7 +356,8 @@ static const struct inode_operations kvm_gmem_iops = {
.setattr= kvm_gmem_setattr,
 };
 
-static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount 
*mnt)
+static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags,
+struct vfsmount *mnt)
 {
const char *anon_name = "[kvm-gmem]";
const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name));
@@ -346,6 +380,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, 
struct vfsmount *mnt)
inode->i_mode |= S_IFREG;
inode->i_size = size;
mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+   mapping_set_large_folios(inode->i_mapping);
mapping_set_unmovable(inode->i_mapping);
/* Unmovable mappings are supposed to be marked unevictable as well. */
WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
@@ -396,6 +431,12 @@ static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
if (size < 0 || !PAGE_ALIGNED(size))
return false;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
+   !IS_ALIGNED(size, HPAGE_PMD_SIZE))
+   return false;
+#endif
+
return true;
 }
 
@@ -405,6 +446,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
u64 flags = args->flags;
u64 valid_flags = 0;
 
+   if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+   valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
+
if (flags & ~valid_flags)
return -EINVAL;
 
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-09-13 Thread Sean Christopherson

TODO

Cc: Fuad Tabba 
Cc: Vishal Annapurve 
Cc: Ackerley Tng 
Cc: Jarkko Sakkinen 
Cc: Maciej Szmigiero 
Cc: Vlastimil Babka 
Cc: David Hildenbrand 
Cc: Quentin Perret 
Cc: Michael Roth 
Cc: Wang 
Cc: Liam Merwick 
Cc: Isaku Yamahata 
Co-developed-by: Kirill A. Shutemov 
Signed-off-by: Kirill A. Shutemov 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Co-developed-by: Chao Peng 
Signed-off-by: Chao Peng 
Co-developed-by: Ackerley Tng 
Signed-off-by: Ackerley Tng 
Co-developed-by: Isaku Yamahata 
Signed-off-by: Isaku Yamahata 
Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h   |  48 +++
 include/uapi/linux/kvm.h   |  15 +-
 include/uapi/linux/magic.h |   1 +
 virt/kvm/Kconfig   |   4 +
 virt/kvm/Makefile.kvm  |   1 +
 virt/kvm/guest_mem.c   | 593 +
 virt/kvm/kvm_main.c|  61 +++-
 virt/kvm/kvm_mm.h  |  38 +++
 8 files changed, 756 insertions(+), 5 deletions(-)
 create mode 100644 virt/kvm/guest_mem.c

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9b695391b11c..18d8f02a99a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -591,8 +591,20 @@ struct kvm_memory_slot {
u32 flags;
short id;
u16 as_id;
+
+#ifdef CONFIG_KVM_PRIVATE_MEM
+   struct {
+   struct file __rcu *file;
+   pgoff_t pgoff;
+   } gmem;
+#endif
 };
 
+static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+{
+   return slot && (slot->flags & KVM_MEM_PRIVATE);
+}
+
 static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot 
*slot)
 {
return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
@@ -687,6 +699,17 @@ static inline int kvm_arch_vcpu_memslots_id(struct 
kvm_vcpu *vcpu)
 }
 #endif
 
+/*
+ * Arch code must define kvm_arch_has_private_mem if support for private memory
+ * is enabled.
+ */
+#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
+static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
+{
+   return false;
+}
+#endif
+
 struct kvm_memslots {
u64 generation;
atomic_long_t last_used_slot;
@@ -1401,6 +1424,7 @@ void *kvm_mmu_memory_cache_alloc(struct 
kvm_mmu_memory_cache *mc);
 void kvm_mmu_invalidate_begin(struct kvm *kvm);
 void kvm_mmu_invalidate_range_add(struct kvm *kvm, gfn_t start, gfn_t end);
 void kvm_mmu_invalidate_end(struct kvm *kvm);
+bool kvm_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
 long kvm_arch_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg);
@@ -2360,6 +2384,30 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 struct kvm_gfn_range *range);
+
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+   return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
+  kvm_get_memory_attributes(kvm, gfn) & 
KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+#else
+static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
+{
+   return false;
+}
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
+#ifdef CONFIG_KVM_PRIVATE_MEM
+int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
+ gfn_t gfn, kvm_pfn_t *pfn, int *max_order);
+#else
+static inline int kvm_gmem_get_pfn(struct kvm *kvm,
+  struct kvm_memory_slot *slot, gfn_t gfn,
+  kvm_pfn_t *pfn, int *max_order)
+{
+   KVM_BUG_ON(1, kvm);
+   return -EIO;
+}
+#endif /* CONFIG_KVM_PRIVATE_MEM */
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index f8642ff2eb9d..b6f90a273e2e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -102,7 +102,10 @@ struct kvm_userspace_memory_region2 {
__u64 guest_phys_addr;
__u64 memory_size;
__u64 userspace_addr;
-   __u64 pad[16];
+   __u64 gmem_offset;
+   __u32 gmem_fd;
+   __u32 pad1;
+   __u64 pad2[14];
 };
 
 /*
@@ -112,6 +115,7 @@ struct kvm_userspace_memory_region2 {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES(1UL << 0)
 #define KVM_MEM_READONLY   (1UL << 1)
+#define KVM_MEM_PRIVATE(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -1228,6 +1232,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
 #define KVM_CAP_USER_MEMORY2 230
 #define KVM_CAP_MEMORY_ATTRIBUTES 231
+#define KVM_CAP_GUEST_MEMFD 232
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2307,4 +2312,12 @@ struct kvm_memory_attributes {
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
 
+#define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO,  0xd4, struct 
kvm_create_guest_memfd)
+
+struct kvm_create_guest_memfd {
+   __u64 size;
+   __u64 flags;
+   __u64 reserved[6];

[RFC PATCH v12 13/33] security: Export security_inode_init_security_anon() for use by KVM

2023-09-13 Thread Sean Christopherson

TODO: Throw this away, assuming KVM drops its dedicated file system.

Acked-by: Paul Moore 
Signed-off-by: Sean Christopherson 
---
 security/security.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/security.c b/security/security.c
index 23b129d482a7..0024156f867a 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1693,6 +1693,7 @@ int security_inode_init_security_anon(struct inode *inode,
return call_int_hook(inode_init_security_anon, 0, inode, name,
 context_inode);
 }
+EXPORT_SYMBOL_GPL(security_inode_init_security_anon);
 
 #ifdef CONFIG_SECURITY_PATH
 /**
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable

2023-09-13 Thread Sean Christopherson

Add an "unmovable" flag for mappings that cannot be migrated under any
circumstance.  KVM will use the flag for its upcoming GUEST_MEMFD support,
which will not support compaction/migration, at least not in the
foreseeable future.

Test AS_UNMOVABLE under folio lock as already done for the async
compaction/dirty folio case, as the mapping can be removed by truncation
while compaction is running.  To avoid having to lock every folio with a
mapping, assume/require that unmovable mappings are also unevictable, and
have mapping_set_unmovable() also set AS_UNEVICTABLE.

Cc: Matthew Wilcox 
Co-developed-by: Vlastimil Babka 
Signed-off-by: Vlastimil Babka 
Signed-off-by: Sean Christopherson 
---
 include/linux/pagemap.h | 19 +-
 mm/compaction.c | 43 +
 mm/migrate.c|  2 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..82c9bf506b79 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -203,7 +203,8 @@ enum mapping_flags {
/* writeback related tags are not used */
AS_NO_WRITEBACK_TAGS = 5,
AS_LARGE_FOLIO_SUPPORT = 6,
-   AS_RELEASE_ALWAYS,  /* Call ->release_folio(), even if no private 
data */
+   AS_RELEASE_ALWAYS = 7,  /* Call ->release_folio(), even if no private 
data */
+   AS_UNMOVABLE= 8,/* The mapping cannot be moved, ever */
 };
 
 /**
@@ -289,6 +290,22 @@ static inline void mapping_clear_release_always(struct 
address_space *mapping)
clear_bit(AS_RELEASE_ALWAYS, >flags);
 }
 
+static inline void mapping_set_unmovable(struct address_space *mapping)
+{
+   /*
+* It's expected unmovable mappings are also unevictable. Compaction
+* migrate scanner (isolate_migratepages_block()) relies on this to
+* reduce page locking.
+*/
+   set_bit(AS_UNEVICTABLE, >flags);
+   set_bit(AS_UNMOVABLE, >flags);
+}
+
+static inline bool mapping_unmovable(struct address_space *mapping)
+{
+   return test_bit(AS_UNMOVABLE, >flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
return mapping->gfp_mask;
diff --git a/mm/compaction.c b/mm/compaction.c
index 38c8d216c6a3..12b828aed7c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -883,6 +883,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Time to isolate some pages for migration */
for (; low_pfn < end_pfn; low_pfn++) {
+   bool is_dirty, is_unevictable;
 
if (skip_on_failure && low_pfn >= next_skip_pfn) {
/*
@@ -1080,8 +1081,10 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!folio_test_lru(folio))
goto isolate_fail_put;
 
+   is_unevictable = folio_test_unevictable(folio);
+
/* Compaction might skip unevictable pages but CMA takes them */
-   if (!(mode & ISOLATE_UNEVICTABLE) && 
folio_test_unevictable(folio))
+   if (!(mode & ISOLATE_UNEVICTABLE) && is_unevictable)
goto isolate_fail_put;
 
/*
@@ -1093,26 +1096,42 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if ((mode & ISOLATE_ASYNC_MIGRATE) && 
folio_test_writeback(folio))
goto isolate_fail_put;
 
-   if ((mode & ISOLATE_ASYNC_MIGRATE) && folio_test_dirty(folio)) {
-   bool migrate_dirty;
+   is_dirty = folio_test_dirty(folio);
+
+   if (((mode & ISOLATE_ASYNC_MIGRATE) && is_dirty) ||
+   (mapping && is_unevictable)) {
+   bool migrate_dirty = true;
+   bool is_unmovable;
 
/*
 * Only folios without mappings or that have
-* a ->migrate_folio callback are possible to
-* migrate without blocking.  However, we may
-* be racing with truncation, which can free
-* the mapping.  Truncation holds the folio lock
-* until after the folio is removed from the page
-* cache so holding it ourselves is sufficient.
+* a ->migrate_folio callback are possible to migrate
+* without blocking.
+*
+* Folios from unmovable mappings are not migratable.
+*
+* However, we can be racing with truncation, which can
+* free the mapping that we need to check. Truncation
+* holds the folio lock until after the folio is removed
+* from the page so holding it ourselves is

[RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
  - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
a guest memory range.
  - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Because setting memory attributes is roughly analogous to mprotect() on
memory that is mapped into the guest, zap existing mappings prior to
updating the memory attributes.  Opportunistically provide an arch hook
for the post-set path (needed to complete invalidation anyways) in
anticipation of x86 needing the hook to update metadata related to
determining whether or not a given gfn can be backed with various sizes
of hugepages.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Cc: Xu Yilun 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst |  60 ++
 include/linux/kvm_host.h   |  18 +++
 include/uapi/linux/kvm.h   |  14 +++
 virt/kvm/Kconfig   |   4 +
 virt/kvm/kvm_main.c| 212 +
 5 files changed, 308 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e28a13439a95..c44ef5295a12 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6070,6 +6070,56 @@ writes to the CNTVCT_EL0 and CNTPCT_EL0 registers using 
the SET_ONE_REG
 interface. No error will be returned, but the resulting offset will not be
 applied.
 
+4.139 KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: u64 memory attributes bitmask(out)
+:Returns: 0 on success, <0 on error
+
+Returns supported memory attributes bitmask. Supported memory attributes will
+have the corresponding bits set in u64 memory attributes bitmask.
+
+The following memory attributes are defined::
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE   (1ULL << 3)
+
+4.140 KVM_SET_MEMORY_ATTRIBUTES
+-
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes(in/out)
+:Returns: 0 on success, <0 on error
+
+Sets memory attributes for pages in a guest memory range. Parameters are
+specified via the following structure::
+
+  struct kvm_memory_attributes {
+   __u64 address;
+   __u64 size;
+   __u64 attributes;
+   __u64 flags;
+  };
+
+The user sets the per-page memory attributes to a guest memory range indicated
+by address/size, and in return KVM adjusts address and size to reflect the
+actual pages of the memory range have been successfully set to the attributes.
+If the call returns 0, "address" is updated to the last successful address + 1
+and "size" is updated to the remaining address size that has not been set
+successfully. The user should check the return value as well as the size to
+decide if the operation succeeded for the whole range or not. The user may want
+to retry the operation with the returned address/size if the previous range was
+partially successful.
+
+Both address and size should be page aligned and the supported attributes can 
be
+retrieved with KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES.
+
+The "flags" field may be used for future extensions and should be set to 0s.
+
 5. The kvm_run structure
 
 
@@ -8498,6 +8548,16 @@ block sizes is exposed in 
KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a
 64-bit bitmap (each bit describing a block size). The default value is
 0, to disable the eager page splitting.
 
+8.41 KVM_CAP_MEMORY_ATTRIBUTES
+--
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES
+:Architectures: x86
+:Type: vm
+
+This capability indicates KVM supports per-page memory attributes and ioctls
+KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES/KVM_SET_MEMORY_ATTRIBUTES are available.
+
 9. Known KVM API problems
 =
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index

[RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events

2023-09-13 Thread Sean Christopherson

Add flags to "struct kvm_gfn_range" to let notifier events target only
shared and only private mappings, and write up the existing mmu_notifier
events to be shared-only (private memory is never associated with a
userspace virtual address, i.e. can't be reached via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities (shared,
private, and shared+private) without needing something like a tri-state
enum.

Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/kvm_main.c  | 7 +++
 2 files changed, 9 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d8c6ce6c8211..b5373cee2b08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -263,6 +263,8 @@ struct kvm_gfn_range {
gfn_t start;
gfn_t end;
union kvm_mmu_notifier_arg arg;
+   bool only_private;
+   bool only_shared;
bool may_block;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 174de2789657..a41f8658dfe0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
 * the second or later invocation of the handler).
 */
gfn_range.arg = range->arg;
+
+   /*
+* HVA-based notifications aren't relevant to private
+* mappings as they don't have a userspace mapping.
+*/
+   gfn_range.only_private = false;
+   gfn_range.only_shared = true;
gfn_range.may_block = range->may_block;
 
/*
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-09-13 Thread Sean Christopherson

Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
__kvm_handle_hva_range() return whether or not an overlapping memslot
was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.

Use a small struct to return the tuple of the notifier-specific return,
plus whether or not overlap was found.  Because the iteration helpers are
__always_inlined, practically speaking, the struct will never actually be
returned from a function call (not to mention the size of the struct will
be two bytes in practice).

Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 53 +++--
 1 file changed, 37 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7c0e38752526..76d01de7838f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
bool may_block;
 };
 
+/*
+ * The inner-most helper returns a tuple containing the return value from the
+ * arch- and action-specific handler, plus a flag indicating whether or not at
+ * least one memslot was found, i.e. if the handler found guest memory.
+ *
+ * Note, most notifiers are averse to booleans, so even though KVM tracks the
+ * return from arch code as a bool, outer helpers will cast it to an int. :-(
+ */
+typedef struct kvm_mmu_notifier_return {
+   bool ret;
+   bool found_memslot;
+} kvm_mn_ret_t;
+
 /*
  * Use a dedicated stub instead of NULL to indicate that there is no callback
  * function/handler.  The compiler technically can't guarantee that a real
@@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg 
KVM_MMU_NOTIFIER_NO_ARG;
 node;   \
 node = interval_tree_iter_next(node, start, last))  \
 
-static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct 
kvm_mmu_notifier_range *range)
+static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
+  const struct 
kvm_mmu_notifier_range *range)
 {
-   bool ret = false, locked = false;
+   struct kvm_mmu_notifier_return r = {
+   .ret = false,
+   .found_memslot = false,
+   };
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
int i, idx;
 
if (WARN_ON_ONCE(range->end <= range->start))
-   return 0;
+   return r;
 
/* A null handler is allowed if and only if on_lock() is provided. */
if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
 IS_KVM_NULL_FN(range->handler)))
-   return 0;
+   return r;
 
idx = srcu_read_lock(>srcu);
 
@@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
- 1, slot);
gfn_range.slot = slot;
 
-   if (!locked) {
-   locked = true;
+   if (!r.found_memslot) {
+   r.found_memslot = true;
KVM_MMU_LOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_lock))
range->on_lock(kvm);
@@ -640,14 +656,14 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
if (IS_KVM_NULL_FN(range->handler))
break;
}
-   ret |= range->handler(kvm, _range);
+   r.ret |= range->handler(kvm, _range);
}
}
 
-   if (range->flush_on_ret && ret)
+   if (range->flush_on_ret && r.ret)
kvm_flush_remote_tlbs(kvm);
 
-   if (locked) {
+   if (r.found_memslot) {
KVM_MMU_UNLOCK(kvm);
if (!IS_KVM_NULL_FN(range->on_unlock))
range->on_unlock(kvm);
@@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
 
srcu_read_unlock(>srcu, idx);
 
-   /* The notifiers are averse to booleans. :-( */
-   return (int)ret;
+   return r;
 }
 
 static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
@@ -677,7 +692,7 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
.may_block  = false,
};
 
-   return __kvm_handle_hva_range(kvm, );
+   return __kvm_handle_hva_range(kvm, ).ret;
 }
 
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
*mn,
@@ -696,7 +711,7 @@ static

[RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

Add a new KVM exit type to allow userspace to handle memory faults that
KVM cannot resolve, but that userspace *may* be able to handle (without
terminating the guest).

KVM will initially use KVM_EXIT_MEMORY_FAULT to report implicit
conversions between private and shared memory.  With guest private memory,
there will be  two kind of memory conversions:

  - explicit conversion: happens when the guest explicitly calls into KVM
to map a range (as private or shared)

  - implicit conversion: happens when the guest attempts to access a gfn
that is configured in the "wrong" state (private vs. shared)

On x86 (first architecture to support guest private memory), explicit
conversions will be reported via KVM_EXIT_HYPERCALL+KVM_HC_MAP_GPA_RANGE,
but reporting KVM_EXIT_HYPERCALL for implicit conversions is undesriable
as there is (obviously) no hypercall, and there is no guarantee that the
guest actually intends to convert between private and shared, i.e. what
KVM thinks is an implicit conversion "request" could actually be the
result of a guest code bug.

KVM_EXIT_MEMORY_FAULT will be used to report memory faults that appear to
be implicit conversions.

Place "struct memory_fault" in a second anonymous union so that filling
memory_fault doesn't clobber state from other yet-to-be-fulfilled exits,
and to provide additional information if KVM does NOT ultimately exit to
userspace with KVM_EXIT_MEMORY_FAULT, e.g. if KVM suppresses (or worse,
loses) the exit, as KVM often suppresses exits for memory failures that
occur when accessing paravirt data structures.  The initial usage for
private memory will be all-or-nothing, but other features such as the
proposed "userfault on missing mappings" support will use
KVM_EXIT_MEMORY_FAULT for potentially _all_ guest memory accesses, i.e.
will run afoul of KVM's various quirks.

Use bit 3 for flagging private memory so that KVM can use bits 0-2 for
capturing RWX behavior if/when userspace needs such information.

Note!  To allow for future possibilities where KVM reports
KVM_EXIT_MEMORY_FAULT and fills run->memory_fault on _any_ unresolved
fault, KVM returns "-EFAULT" (-1 with errno == EFAULT from userspace's
perspective), not '0'!  Due to historical baggage within KVM, exiting to
userspace with '0' from deep callstacks, e.g. in emulation paths, is
infeasible as doing so would require a near-complete overhaul of KVM,
whereas KVM already propagates -errno return codes to userspace even when
the -errno originated in a low level helper.

Link: https://lore.kernel.org/all/20230908222905.1321305-5-amoor...@google.com
Cc: Anish Moorthy 
Suggested-by: Sean Christopherson 
Co-developed-by: Yu Zhang 
Signed-off-by: Yu Zhang 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 Documentation/virt/kvm/api.rst | 24 
 include/linux/kvm_host.h   | 15 +++
 include/uapi/linux/kvm.h   | 24 
 3 files changed, 63 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 21a7578142a1..e28a13439a95 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6702,6 +6702,30 @@ array field represents return values. The userspace 
should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+   /* KVM_EXIT_MEMORY_FAULT */
+   struct {
+  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
+   __u64 flags;
+   __u64 gpa;
+   __u64 size;
+   } memory;
+
+KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
+could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
+guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
+describes properties of the faulting access that are likely pertinent:
+
+ - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
+   on a private memory access.  When clear, indicates the fault occurred on a
+   shared access.
+
+Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
+accompanies a return code of '-1', not '0'!  errno will always be set to EFAULT
+or EHWPOISON when KVM exits with KVM_EXIT_MEMORY_FAULT, userspace should assume
+kvm_run.exit_reason is stale/undefined for all other error numbers.
+
 ::
 
 /* KVM_EXIT_NOTIFY */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4e741ff27af3..d8c6ce6c8211 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2327,4 +2327,19 @@ static inline void kvm_account_pgtable_pages(void *virt, 
int nr)
 /* Max number of entries allowed for each kvm dirty ring */
 #define  KVM_DIRTY_RING_MAX_ENTRIES  65536
 
+static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
+

[RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2

2023-09-13 Thread Sean Christopherson

Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional
information can be supplied without setting userspace up to fail.  The
padding in the new kvm_userspace_memory_region2 structure will be used to
pass a file descriptor in addition to the userspace_addr, i.e. allow
userspace to point at a file descriptor and map memory into a guest that
is NOT mapped into host userspace.

Alternatively, KVM could simply add "struct kvm_userspace_memory_region2"
without a new ioctl(), but as Paolo pointed out, adding a new ioctl()
makes detection of bad flags a bit more robust, e.g. if the new fd field
is guarded only by a flag and not a new ioctl(), then a userspace bug
(setting a "bad" flag) would generate out-of-bounds access instead of an
-EINVAL error.

Cc: Jarkko Sakkinen 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c   |  2 +-
 include/linux/kvm_host.h |  4 ++--
 include/uapi/linux/kvm.h | 13 +
 virt/kvm/kvm_main.c  | 38 ++
 4 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..8356907079e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12447,7 +12447,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, 
int id, gpa_t gpa,
}
 
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
-   struct kvm_userspace_memory_region m;
+   struct kvm_userspace_memory_region2 m;
 
m.slot = id | (i << 16);
m.flags = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5faba69403ac..4e741ff27af3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1146,9 +1146,9 @@ enum kvm_mr_change {
 };
 
 int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem);
+ const struct kvm_userspace_memory_region2 *mem);
 int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region *mem);
+   const struct kvm_userspace_memory_region2 *mem);
 void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot);
 void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen);
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 13065dd96132..bd1abe067f28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -95,6 +95,16 @@ struct kvm_userspace_memory_region {
__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+   __u32 slot;
+   __u32 flags;
+   __u64 guest_phys_addr;
+   __u64 memory_size;
+   __u64 userspace_addr;
+   __u64 pad[16];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
  * userspace, other bits are reserved for kvm internal use which are defined
@@ -1192,6 +1202,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 230
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1473,6 +1484,8 @@ struct kvm_vfio_spapr_tce {
struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR  _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+struct kvm_userspace_memory_region2)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8d21757cd5e9..7c0e38752526 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1571,7 +1571,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region 
*mem)
+static int check_memory_region_flags(const struct kvm_userspace_memory_region2 
*mem)
 {
u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
@@ -1973,7 +1973,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots 
*slots, int id,
  * Must be called holding kvm->slots_lock for write.
  */
 int __kvm_set_memory_region(struct kvm *kvm,
-   const struct kvm_userspace_memory_region *mem)
+   const struct kvm_userspace_memory_region2 *mem)
 {
struct kvm_memory_slot *old, *new;
struct kvm_memslots *slots;
@@ -2077,7 +2077,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 EXPORT_SYMBOL_GPL(__kvm_set_memory_region);
 
 int kvm_set_memory_region(struct kvm *kvm,
- const struct kvm_userspace_memory_region *mem)
+ const struct kvm_userspace_memory_region2 *mem)
 {
int r;
 
@@

[RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER

2023-09-13 Thread Sean Christopherson

Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where
appropriate to effectively maintain existing behavior.  Using a proper
Kconfig will simplify building more functionality on top of KVM's
mmu_notifier infrastructure.

Add a forward declaration of kvm_gfn_range to kvm_types.h so that
including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't
generate warnings due to kvm_gfn_range being undeclared.  PPC defines
hooks for PR vs. HV without guarding them via #ifdeffery, e.g.

  bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);
  bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range);

Alternatively, PPC could forward declare kvm_gfn_range, but there's no
good reason not to define it in common KVM.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h   |  2 --
 arch/arm64/kvm/Kconfig  |  2 +-
 arch/mips/include/asm/kvm_host.h|  2 --
 arch/mips/kvm/Kconfig   |  2 +-
 arch/powerpc/include/asm/kvm_host.h |  2 --
 arch/powerpc/kvm/Kconfig|  8 
 arch/powerpc/kvm/powerpc.c  |  4 +---
 arch/riscv/include/asm/kvm_host.h   |  2 --
 arch/riscv/kvm/Kconfig  |  2 +-
 arch/x86/include/asm/kvm_host.h |  2 --
 arch/x86/kvm/Kconfig|  2 +-
 include/linux/kvm_host.h|  6 +++---
 include/linux/kvm_types.h   |  1 +
 virt/kvm/Kconfig|  4 
 virt/kvm/kvm_main.c | 10 +-
 15 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index af06ccb7ee34..9e046b64847a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -921,8 +921,6 @@ int __kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
 int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
  struct kvm_vcpu_events *events);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 void kvm_arm_halt_guest(struct kvm *kvm);
 void kvm_arm_resume_guest(struct kvm *kvm);
 
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 83c1e09be42e..1a15199f 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -22,7 +22,7 @@ menuconfig KVM
bool "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select KVM_GENERIC_HARDWARE_ENABLING
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select PREEMPT_NOTIFIERS
select HAVE_KVM_CPU_RELAX_INTERCEPT
select KVM_MMIO
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 54a85f1d4f2c..179f320cc231 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -810,8 +810,6 @@ int kvm_mips_mkclean_gpa_pt(struct kvm *kvm, gfn_t 
start_gfn, gfn_t end_gfn);
 pgd_t *kvm_pgd_alloc(void);
 void kvm_mmu_free_memory_caches(struct kvm_vcpu *vcpu);
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 /* Emulation */
 enum emulation_result update_pc(struct kvm_vcpu *vcpu, u32 cause);
 int kvm_get_badinstr(u32 *opc, struct kvm_vcpu *vcpu, u32 *out);
diff --git a/arch/mips/kvm/Kconfig b/arch/mips/kvm/Kconfig
index a8cdba75f98d..c04987d2ed2e 100644
--- a/arch/mips/kvm/Kconfig
+++ b/arch/mips/kvm/Kconfig
@@ -25,7 +25,7 @@ config KVM
select HAVE_KVM_EVENTFD
select HAVE_KVM_VCPU_ASYNC_IOCTL
select KVM_MMIO
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select INTERVAL_TREE
select KVM_GENERIC_HARDWARE_ENABLING
help
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..4b5c3f2acf78 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -62,8 +62,6 @@
 
 #include 
 
-#define KVM_ARCH_WANT_MMU_NOTIFIER
-
 #define HPTEG_CACHE_NUM(1 << 15)
 #define HPTEG_HASH_BITS_PTE13
 #define HPTEG_HASH_BITS_PTE_LONG   12
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..b33358ee6424 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -42,7 +42,7 @@ config KVM_BOOK3S_64_HANDLER
 config KVM_BOOK3S_PR_POSSIBLE
bool
select KVM_MMIO
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
 
 config KVM_BOOK3S_HV_POSSIBLE
bool
@@ -85,7 +85,7 @@ config KVM_BOOK3S_64_HV
tristate "KVM for POWER7 and later using hypervisor mode in host"
depends on KVM_BOOK3S_64 && PPC_POWERNV
select KVM_BOOK3S_HV_POSSIBLE
-   select MMU_NOTIFIER
+   select KVM_GENERIC_MMU_NOTIFIER
select CMA
help
  Support running unmodified book3s_64 guest kernels in
@@ -194,7 +194,7 @@ config KVM_E500V2
depends on !CONTEXT_TRACKING_USER

[RFC PATCH v12 04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU

2023-09-13 Thread Sean Christopherson

Advertise that KVM's MMU is synchronized with the primary MMU for all
flavors of PPC KVM support, i.e. advertise that the MMU is synchronized
when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor
mode (a.k.a. PR VMs).  PR VMs, via kvm_unmap_gfn_range_pr(), do the right
thing for mmu_notifier invalidation events, and more tellingly, KVM
returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n
and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a
synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y.

Suggested-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/powerpc/kvm/powerpc.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b0a512ede764..8d3ec483bc2b 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -635,11 +635,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
BUILD_BUG();
 #endif
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   r = hv_enabled;
-#else
r = 1;
-#endif
break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
case KVM_CAP_PPC_HTAB_FD:
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER

2023-09-13 Thread Sean Christopherson

Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are
defined when KVM is enabled, and return '1' unconditionally for the
CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path.  All flavors of PPC support for KVM
select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally
defined by arch/powerpc/include/asm/kvm_host.h.

Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a
future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e.
will allow combining all of the

  #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)

checks into a single

  #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER

without having to worry about PPC's "bare" usage of
KVM_ARCH_WANT_MMU_NOTIFIER.

Signed-off-by: Sean Christopherson 
---
 arch/powerpc/kvm/powerpc.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7197c8256668..b0a512ede764 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -632,12 +632,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext)
break;
 #endif
case KVM_CAP_SYNC_MMU:
+#if !defined(CONFIG_MMU_NOTIFIER) || !defined(KVM_ARCH_WANT_MMU_NOTIFIER)
+   BUILD_BUG();
+#endif
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
r = hv_enabled;
-#elif defined(KVM_ARCH_WANT_MMU_NOTIFIER)
-   r = 1;
 #else
-   r = 0;
+   r = 1;
 #endif
break;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-- 
2.42.0.283.g2d96d420d3-goog

[RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry

2023-09-13 Thread Sean Christopherson

From: Chao Peng 

Currently in mmu_notifier invalidate path, hva range is recorded and
then checked against by mmu_notifier_retry_hva() in the page fault
handling path. However, for the to be introduced private memory, a page
fault may not have a hva associated, checking gfn(gpa) makes more sense.

For existing hva based shared memory, gfn is expected to also work. The
only downside is when aliasing multiple gfns to a single hva, the
current algorithm of checking multiple ranges could result in a much
larger range being rejected. Such aliasing should be uncommon, so the
impact is expected small.

Suggested-by: Sean Christopherson 
Signed-off-by: Chao Peng 
Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 
[sean: convert vmx_set_apic_access_page_addr() to gfn-based API]
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/mmu/mmu.c   | 10 ++
 arch/x86/kvm/vmx/vmx.c   | 11 +--
 include/linux/kvm_host.h | 33 +
 virt/kvm/kvm_main.c  | 40 +++-
 4 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1d011c67cc6..0f0231d2b74f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3056,7 +3056,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
  *
  * There are several ways to safely use this helper:
  *
- * - Check mmu_invalidate_retry_hva() after grabbing the mapping level, before
+ * - Check mmu_invalidate_retry_gfn() after grabbing the mapping level, before
  *   consuming it.  In this case, mmu_lock doesn't need to be held during the
  *   lookup, but it does need to be held while checking the MMU notifier.
  *
@@ -4358,7 +4358,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu,
return true;
 
return fault->slot &&
-  mmu_invalidate_retry_hva(vcpu->kvm, fault->mmu_seq, fault->hva);
+  mmu_invalidate_retry_gfn(vcpu->kvm, fault->mmu_seq, fault->gfn);
 }
 
 static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault 
*fault)
@@ -6253,7 +6253,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
 
write_lock(>mmu_lock);
 
-   kvm_mmu_invalidate_begin(kvm, 0, -1ul);
+   kvm_mmu_invalidate_begin(kvm);
+
+   kvm_mmu_invalidate_range_add(kvm, gfn_start, gfn_end);
 
flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
@@ -6266,7 +6268,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, 
gfn_t gfn_end)
if (flush)
kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - 
gfn_start);
 
-   kvm_mmu_invalidate_end(kvm, 0, -1ul);
+   kvm_mmu_invalidate_end(kvm);
 
write_unlock(>mmu_lock);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..6e502ba93141 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6757,10 +6757,10 @@ static void vmx_set_apic_access_page_addr(struct 
kvm_vcpu *vcpu)
return;
 
/*
-* Grab the memslot so that the hva lookup for the mmu_notifier retry
-* is guaranteed to use the same memslot as the pfn lookup, i.e. rely
-* on the pfn lookup's validation of the memslot to ensure a valid hva
-* is used for the retry check.
+* Explicitly grab the memslot using KVM's internal slot ID to ensure
+* KVM doesn't unintentionally grab a userspace memslot.  It _should_
+* be impossible for userspace to create a memslot for the APIC when
+* APICv is enabled, but paranoia won't hurt in this case.
 */
slot = id_to_memslot(slots, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT);
if (!slot || slot->flags & KVM_MEMSLOT_INVALID)
@@ -6785,8 +6785,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu 
*vcpu)
return;
 
read_lock(>kvm->mmu_lock);
-   if (mmu_invalidate_retry_hva(kvm, mmu_seq,
-gfn_to_hva_memslot(slot, gfn))) {
+   if (mmu_invalidate_retry_gfn(kvm, mmu_seq, gfn)) {
kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
read_unlock(>kvm->mmu_lock);
goto out;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index fb6c6109fdca..11d091688346 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -787,8 +787,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_invalidate_seq;
long mmu_invalidate_in_progress;
-   unsigned long mmu_invalidate_range_start;
-   unsigned long mmu_invalidate_range_end;
+   gfn_t mmu_invalidate_range_start;
+   gfn_t mmu_invalidate_range_end;
 #endif
struct list_head devices;
u64 manual_dirty_log_protect;
@@ -1392,10 +1392,9 @@ void kvm_mmu_free_memory_cache(struct 
kvm_mmu_memory_cache *mc);
 void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 #endif

[RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges

2023-09-13 Thread Sean Christopherson

Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so
that the structure can be used to handle notifications that operate on gfn
context, i.e. that aren't tied to a host virtual address.

Practically speaking, this is a nop for 64-bit kernels as the only
meaningful change is to store start+end as u64s instead of unsigned longs.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..0524933856d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -541,18 +541,22 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
mmu_notifier *mn)
return container_of(mn, struct kvm, mmu_notifier);
 }
 
-typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
+typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
 
 typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start,
 unsigned long end);
 
 typedef void (*on_unlock_fn_t)(struct kvm *kvm);
 
-struct kvm_hva_range {
-   unsigned long start;
-   unsigned long end;
+struct kvm_mmu_notifier_range {
+   /*
+* 64-bit addresses, as KVM notifiers can operate on host virtual
+* addresses (unsigned long) and guest physical addresses (64-bit).
+*/
+   u64 start;
+   u64 end;
union kvm_mmu_notifier_arg arg;
-   hva_handler_t handler;
+   gfn_handler_t handler;
on_lock_fn_t on_lock;
on_unlock_fn_t on_unlock;
bool flush_on_ret;
@@ -581,7 +585,7 @@ static const union kvm_mmu_notifier_arg 
KVM_MMU_NOTIFIER_NO_ARG;
 node = interval_tree_iter_next(node, start, last))  \
 
 static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
- const struct kvm_hva_range 
*range)
+ const struct 
kvm_mmu_notifier_range *range)
 {
bool ret = false, locked = false;
struct kvm_gfn_range gfn_range;
@@ -608,9 +612,9 @@ static __always_inline int __kvm_handle_hva_range(struct 
kvm *kvm,
unsigned long hva_start, hva_end;
 
slot = container_of(node, struct kvm_memory_slot, 
hva_node[slots->node_idx]);
-   hva_start = max(range->start, slot->userspace_addr);
-   hva_end = min(range->end, slot->userspace_addr +
- (slot->npages << PAGE_SHIFT));
+   hva_start = max_t(unsigned long, range->start, 
slot->userspace_addr);
+   hva_end = min_t(unsigned long, range->end,
+   slot->userspace_addr + (slot->npages << 
PAGE_SHIFT));
 
/*
 * To optimize for the likely case where the address
@@ -660,10 +664,10 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
unsigned long start,
unsigned long end,
union kvm_mmu_notifier_arg arg,
-   hva_handler_t handler)
+   gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range range = {
+   const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.arg= arg,
@@ -680,10 +684,10 @@ static __always_inline int kvm_handle_hva_range(struct 
mmu_notifier *mn,
 static __always_inline int kvm_handle_hva_range_no_flush(struct mmu_notifier 
*mn,
 unsigned long start,
 unsigned long end,
-hva_handler_t handler)
+gfn_handler_t handler)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range range = {
+   const struct kvm_mmu_notifier_range range = {
.start  = start,
.end= end,
.handler= handler,
@@ -771,7 +775,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
const struct mmu_notifier_range *range)
 {
struct kvm *kvm = mmu_notifier_to_kvm(mn);
-   const struct kvm_hva_range hva_range = {
+   const struct kvm_mmu_notifier_range hva_range = {
.start  = range->start,
.end= range->end,
.handler=

[RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes

2023-09-13 Thread Sean Christopherson

This is hopefully the last RFC for implementing fd-based (instead of vma-based)
memory for KVM guests.  If you want the full background of why we are doing
this, please go read the v10 cover letter.  With luck, v13 will be a "normal"
series that's ready for inclusion.

Tagged RFC as there are still several empty changelogs, a lot of missing
documentation, and a handful of TODOs.  And I haven't tested or proofread this
anywhere near as much as I normally would.  I am posting even though the
remaining TODOs aren't _that_ big so that people can test this new version
without having to wait a few weeks to close out the remaining TODOs, i.e. to
give us at least some chance of hitting v6.7.

The most relevant TODO item for non-KVM folks is that we are planning on
dropping the dedicated "gmem" file system.  Assuming that pans out, the patch
to export security_inode_init_security_anon() should go away.

KVM folks, there a few changes I want to highlight and get feedback on, all of
which are directly related to the "annotated memory faults" series[*]:

 - Rename kvm_run.memory to kvm_run.memory_fault
 - Place "memory_fault" in a separate union
 - Return -EFAULT or -EHWPOISON with exiting with KVM_EXIT_MEMORY_FAULT

The first one is pretty self-explanatory, "run->memory.gpa" looks quite odd and
would prevent ever doing something directly with memory.

Putting the struct in a separate union is not at all necessary for supporting
private memory, it's purely forward looking to Anish series, which wants to
annotate (fill memory_fault) on all faults, even if KVM ultimately doesn't exit
to userspace (x86 has a few unfortunate flows where KVM can clobber a previous
exit, or suppress a memory fault exit).  Using a separate union, i.e. different
bytes in kvm_run, allows exiting to userspace with both memory_fault and the
"normal" union filled, e.g. if KVM starts an MMIO exit and then hits a memory
fault exit, the MMIO exit will be preserved.  It's unlikely userspace will be
able to do anything useful with the info in that case, but the reverse will
likely be much more interesting, e.g. if KVM hits a memory fault and then 
doesn't
report it to userspace for whatever reason.

As for returning -EFAULT/-EHWPOISON, far too many helpers that touch guest
memory, i.e. can "fault", return 0 on success, which makes it all bug impossible
to use '0' to signal "exit to userspace".  Rather than use '0' for _just_ the
case where the guest is accessing private vs. shared, my thought is to use
-EFAULT everywhere except for the poisoned page case.

[*] https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com

TODOs [owner]:
 - Documentation [none]
 - Changelogs [Sean]
 - Fully anonymous inode vs. proper filesystem [Paolo]
 - kvm_gmem_error_page() testing (my version is untested) [Isaku?]

v12:
 - Squash fixes from others. [Many people]
 - Kill of the .on_unlock() callback and use .on_lock() when handling
   memory attributes updates. [Isaku]
 - Add more tests. [Ackerley]
 - Move range_has_attrs() to common code. [Paolo]
 - Return actually number of address spaces for the VM-scoped version of
   KVM_CAP_MULTI_ADDRESS_SPACE. [Paolo]
 - Move forward declaration of "struct kvm_gfn_range" to kvm_types.h. [Yuan]
 - Plumb code to have HVA-based mmu_notifier events affect only shared
   mappings. [Asish]
 - Clean up kvm_vm_ioctl_set_mem_attributes() math. [Binbin]
 - Collect a few reviews and acks. [Paolo, Paul]
 - Unconditionally advertise a synchronized MMU on PPC. [Paolo]
 - Check for error return from filemap_grab_folio(). [A
 - Make max_order optional. [Fuad]
 - Remove signal injection, zap SPTEs on memory error. [Isaku]
 - Add KVM_CAP_GUEST_MEMFD. [Xiaoyao]
 - Invoke kvm_arch_pre_set_memory_attributes() instead of
   kvm_mmu_unmap_gfn_range().
 - Rename kvm_run.memory to kvm_run.memory_fault
 - Place "memory_fault" in a separate union
 - Return -EFAULT and -EHWPOISON with KVM_EXIT_MEMORY_FAULT
 - "Init" run->exit_reason in x86's vcpu_run()

v11:
 - https://lore.kernel.org/all/20230718234512.1690985-1-sea...@google.com
 - Test private<=>shared conversions *without* doing fallocate()
 - PUNCH_HOLE all memory between iterations of the conversion test so that
   KVM doesn't retain pages in the guest_memfd
 - Rename hugepage control to be a very generic ALLOW_HUGEPAGE, instead of
   giving it a THP or PMD specific name.
 - Fold in fixes from a lot of people (thank you!)
 - Zap SPTEs *before* updating attributes to ensure no weirdness, e.g. if
   KVM handles a page fault and looks at inconsistent attributes
 - Refactor MMU interaction with attributes updates to reuse much of KVM's
   framework for mmu_notifiers.

v10: 
https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.p...@linux.intel.com

Ackerley Tng (1):
  KVM: selftests: Test KVM exit behavior for private memory/access

Chao Peng (8):
  KVM: Use gfn instead of hva for mmu_notifier_retry
  KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
  KVM: Introduce

[PATCH] powerpc: Export kvm_guest static key, for bcachefs six locks

2023-09-13 Thread Kent Overstreet

bcachefs's six locks need kvm_guest, via
 ower_on_cpu() ->  vcpu_is_preempted() -> is_kvm_guest()

Signed-off-by: Kent Overstreet 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/firmware.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index 20328f72f9f2..8987eee33dc8 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -23,6 +23,8 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
 DEFINE_STATIC_KEY_FALSE(kvm_guest);
+EXPORT_SYMBOL_GPL(kvm_guest);
+
 int __init check_kvm_guest(void)
 {
struct device_node *hyper_node;
-- 
2.40.1

Re: [PATCH 2/2] arch: Reserve map_shadow_stack() syscall number for all architectures

2023-09-13 Thread Edgecombe, Rick P

On Wed, 2023-09-13 at 12:18 -0700, Sohil Mehta wrote:
> On 9/11/2023 2:10 PM, Edgecombe, Rick P wrote:
> > On Mon, 2023-09-11 at 18:02 +, Sohil Mehta wrote:
> > > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl
> > > b/arch/powerpc/kernel/syscalls/syscall.tbl
> > > index 20e50586e8a2..2767b8a42636 100644
> > > --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> > > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> > > @@ -539,3 +539,4 @@
> > >  450nospu   set_mempolicy_home_node sys_set_mempolicy
> > > _hom
> > > e_node
> > >  451common  cachestat   sys_cachestat
> > >  452common  fchmodat2   sys_fchmodat2
> > > +453common  map_shadow_stacksys_map_shadow_st
> > > ack
> > 
> > I noticed in powerpc, the not implemented syscalls are manually
> > mapped
> > to sys_ni_syscall. It also has some special extra sys_ni_syscall()
> > implementation bits to handle both ARCH_HAS_SYSCALL_WRAPPER and
> > !ARCH_HAS_SYSCALL_WRAPPER. So wondering if it might need special
> > treatment. Did you see those parts?
> > 
> 
> Thanks for pointing this out. Powerpc seems to be unique in their
> handling of not implemented syscalls. Maybe it's because of their
> special casing of the ARCH_HAS_SYSCALL_WRAPPER.
> 
> The code below in arch/powerpc/include/asm/syscalls.h suggests to me
> that it should be safe to map map_shadow_stack() to sys_ni_syscall()
> and
> the special handling will be taken care of.
> 
> #ifndef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
> long sys_ni_syscall(void);
> #else
> long sys_ni_syscall(const struct pt_regs *regs);
> #endif
> 
> I don't quite understand the underlying reasoning for it though. Do
> you
> have any additional insight into how we should handle this?
> 
> I am thinking of doing the following in the next iteration unless
> someone chimes in and says otherwise.
> 
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -539,4 +539,4 @@
>  450    nospu   set_mempolicy_home_node
> sys_set_mempolicy_home_node
>  451    common  cachestat   sys_cachestat
>  452    common  fchmodat2   sys_fchmodat2
> -453    common  map_shadow_stack    sys_map_shadow_stack
> +453    common  map_shadow_stack    sys_ni_syscall

It might have something to do with that powerpc's COND_SYSCALL()
implementation only defines the struct pt_regs variety, but TBH I get a
bit lost when I get to the inline assembly symbol definitions parts and
how it all ties together.

Doing powerpc's version as sys_ni_syscall seems to be consistent at
least, and makes sense with respect to the code you quoted.

Re: [PATCH 0/2] arch: Sync all syscall tables with 2 newly added system calls

2023-09-13 Thread Sohil Mehta

On 9/11/2023 11:02 AM, Sohil Mehta wrote:
> Sohil Mehta (2):
>   tools headers UAPI: Sync fchmodat2() syscall table entries

I now see a patch by Arnaldo that does something similar:
https://lore.kernel.org/lkml/zp8be7axdbu%2fd...@kernel.org/

Also, it states that:

"The tools/perf/check-headers.sh script, part of the tools/ build
process, points out changes in the original files.

So its important not to touch the copies in tools/ when doing changes in
the original kernel headers, that will be done later, when
check-headers.sh inform about the change to the perf tools hackers."

I was unaware of this and therefore I'll drop all the tools/ related
changes from this series.

Sohil

Re: [PATCH 2/2] arch: Reserve map_shadow_stack() syscall number for all architectures

2023-09-13 Thread Sohil Mehta

On 9/11/2023 2:10 PM, Edgecombe, Rick P wrote:
> On Mon, 2023-09-11 at 18:02 +, Sohil Mehta wrote:
>> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl
>> b/arch/powerpc/kernel/syscalls/syscall.tbl
>> index 20e50586e8a2..2767b8a42636 100644
>> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
>> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
>> @@ -539,3 +539,4 @@
>>  450nospu   set_mempolicy_home_node sys_set_mempolicy_hom
>> e_node
>>  451common  cachestat   sys_cachestat
>>  452common  fchmodat2   sys_fchmodat2
>> +453common  map_shadow_stacksys_map_shadow_stack
> 
> I noticed in powerpc, the not implemented syscalls are manually mapped
> to sys_ni_syscall. It also has some special extra sys_ni_syscall()
> implementation bits to handle both ARCH_HAS_SYSCALL_WRAPPER and
> !ARCH_HAS_SYSCALL_WRAPPER. So wondering if it might need special
> treatment. Did you see those parts?
> 

Thanks for pointing this out. Powerpc seems to be unique in their
handling of not implemented syscalls. Maybe it's because of their
special casing of the ARCH_HAS_SYSCALL_WRAPPER.

The code below in arch/powerpc/include/asm/syscalls.h suggests to me
that it should be safe to map map_shadow_stack() to sys_ni_syscall() and
the special handling will be taken care of.

#ifndef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
long sys_ni_syscall(void);
#else
long sys_ni_syscall(const struct pt_regs *regs);
#endif

I don't quite understand the underlying reasoning for it though. Do you
have any additional insight into how we should handle this?

I am thinking of doing the following in the next iteration unless
someone chimes in and says otherwise.

--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -539,4 +539,4 @@
 450nospu   set_mempolicy_home_node sys_set_mempolicy_home_node
 451common  cachestat   sys_cachestat
 452common  fchmodat2   sys_fchmodat2
-453common  map_shadow_stacksys_map_shadow_stack
+453common  map_shadow_stacksys_ni_syscall

Re: [PATCH v14 00/15] phy: Add support for Lynx 10G SerDes

2023-09-13 Thread Vladimir Oltean

Hi Sean,

On Thu, Aug 10, 2023 at 03:58:36PM -0400, Sean Anderson wrote:
> I can look into doing this. It will be in my free time, so it will
> likely be a bit before I can update this series.

I was expecting you'd ask some clarification questions about the RCW
override procedure that I've informally described over email, so I guess
you haven't spent any more time on this.

I'm letting you know that very soon, I will have to start my work on
porting the backplane driver posted here:
https://patchwork.kernel.org/project/netdevbpf/cover/20230817150644.3605105-1-vladimir.olt...@nxp.com/
to the Lynx 10G SoCs. And for that, I need a SerDes driver as a base :)

I was wondering how inclined are you to respond positively to the
feedback that the lynx-10g driver should have a look and feel as close
as possible to lynx-28g, given that they're very similar.

Because internally within NXP, we do have a version of the lynx-10g
driver which is contemporary with lynx-28g from mainline, but we didn't
publish it because protocol changes didn't work (for the same reason
that they don't work with your driver). With that driver, you can think
of the feedback about the similar look and feel as being "implicitly applied"
(being written by the same author), so I'm starting to consider more and
more seriously the option of basing my work on that instead of your v14
(on which I'd need to spend extra time to modify the dt-bindings with PCCRs,
concept of lane groups, concept of PLL CCF driver, etc).

What are your thoughts?

Re: [PATCH v7 1/3 RESEND] block:sed-opal: SED Opal keystore

2023-09-13 Thread Nathan Chancellor

On Wed, Sep 13, 2023 at 01:49:39PM -0700, Nick Desaulniers wrote:
> On Wed, Sep 13, 2023 at 9:56 AM Nathan Chancellor  wrote:
> >
> > Hi Greg,
> >
> > On Fri, Sep 08, 2023 at 10:30:54AM -0500, gjo...@linux.vnet.ibm.com wrote:
> > > From: Greg Joyce 
> > >
> > > Add read and write functions that allow SED Opal keys to stored
> > > in a permanent keystore.
> > >
> > > Signed-off-by: Greg Joyce 
> > > Reviewed-by: Jonathan Derrick 
> > > ---
> > >  block/Makefile   |  2 +-
> > >  block/sed-opal-key.c | 24 
> > >  include/linux/sed-opal-key.h | 15 +++
> > >  3 files changed, 40 insertions(+), 1 deletion(-)
> > >  create mode 100644 block/sed-opal-key.c
> > >  create mode 100644 include/linux/sed-opal-key.h
> > >
> > > diff --git a/block/Makefile b/block/Makefile
> > > index 46ada9dc8bbf..ea07d80402a6 100644
> > > --- a/block/Makefile
> > > +++ b/block/Makefile
> > > @@ -34,7 +34,7 @@ obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
> > >  obj-$(CONFIG_BLK_WBT)+= blk-wbt.o
> > >  obj-$(CONFIG_BLK_DEBUG_FS)   += blk-mq-debugfs.o
> > >  obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
> > > -obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o
> > > +obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o sed-opal-key.o
> > >  obj-$(CONFIG_BLK_PM) += blk-pm.o
> > >  obj-$(CONFIG_BLK_INLINE_ENCRYPTION)  += blk-crypto.o 
> > > blk-crypto-profile.o \
> > >  blk-crypto-sysfs.o
> > > diff --git a/block/sed-opal-key.c b/block/sed-opal-key.c
> > > new file mode 100644
> > > index ..16f380164c44
> > > --- /dev/null
> > > +++ b/block/sed-opal-key.c
> > > @@ -0,0 +1,24 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * SED key operations.
> > > + *
> > > + * Copyright (C) 2022 IBM Corporation
> > > + *
> > > + * These are the accessor functions (read/write) for SED Opal
> > > + * keys. Specific keystores can provide overrides.
> > > + *
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +int __weak sed_read_key(char *keyname, char *key, u_int *keylen)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> > > +
> > > +int __weak sed_write_key(char *keyname, char *key, u_int keylen)
> > > +{
> > > + return -EOPNOTSUPP;
> > > +}
> >
> > This change causes a build failure for certain clang configurations due
> > to an unfortunate issue [1] with recordmcount, clang's integrated
> > assembler, and object files that contain a section with only weak
> > functions/symbols (in this case, the .text section in sed-opal-key.c),
> > resulting in
> >
> >   Cannot find symbol for section 2: .text.
> >   block/sed-opal-key.o: failed
> >
> > when building this file.
> 
> The definitions in
> block/sed-opal-key.c
> should be deleted. Instead, in
> include/linux/sed-opal-key.h
> CONFIG_PSERIES_PLPKS_SED should be used to define static inline
> versions when CONFIG_PSERIES_PLPKS_SED is not defined.
> 
> #ifdef CONFIG_PSERIES_PLPKS_SED
> int sed_read_key(char *keyname, char *key, u_int *keylen);
> int sed_write_key(char *keyname, char *key, u_int keylen);
> #else
> static inline
> int sed_read_key(char *keyname, char *key, u_int *keylen) {
>   return -EOPNOTSUPP;
> }
> static inline
> int sed_write_key(char *keyname, char *key, u_int keylen);
>   return -EOPNOTSUPP;
> }
> #endif

Ah yes, this is the other solution. I figured the way that it was
written, sed_read_key() and sed_write_key() may be overridden by a
different architecture or translation unit in the future but I think
until it is needed, your solution would be perfectly fine. Thanks for
taking a look!

Cheers,
Nathan

> > Is there any real reason to have a separate translation unit for these
> > two functions versus just having them living in sed-opal.c? Those two
> > object files share the same Kconfig dependency. I am happy to send a
> > patch if that is an acceptable approach.
> >
> > [1]: https://github.com/ClangBuiltLinux/linux/issues/981
> >
> > Cheers,
> > Nathan
> >
> 
> 
> -- 
> Thanks,
> ~Nick Desaulniers

Re: [PATCH v7 1/3 RESEND] block:sed-opal: SED Opal keystore

2023-09-13 Thread Nick Desaulniers

On Wed, Sep 13, 2023 at 9:56 AM Nathan Chancellor  wrote:
>
> Hi Greg,
>
> On Fri, Sep 08, 2023 at 10:30:54AM -0500, gjo...@linux.vnet.ibm.com wrote:
> > From: Greg Joyce 
> >
> > Add read and write functions that allow SED Opal keys to stored
> > in a permanent keystore.
> >
> > Signed-off-by: Greg Joyce 
> > Reviewed-by: Jonathan Derrick 
> > ---
> >  block/Makefile   |  2 +-
> >  block/sed-opal-key.c | 24 
> >  include/linux/sed-opal-key.h | 15 +++
> >  3 files changed, 40 insertions(+), 1 deletion(-)
> >  create mode 100644 block/sed-opal-key.c
> >  create mode 100644 include/linux/sed-opal-key.h
> >
> > diff --git a/block/Makefile b/block/Makefile
> > index 46ada9dc8bbf..ea07d80402a6 100644
> > --- a/block/Makefile
> > +++ b/block/Makefile
> > @@ -34,7 +34,7 @@ obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
> >  obj-$(CONFIG_BLK_WBT)+= blk-wbt.o
> >  obj-$(CONFIG_BLK_DEBUG_FS)   += blk-mq-debugfs.o
> >  obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
> > -obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o
> > +obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o sed-opal-key.o
> >  obj-$(CONFIG_BLK_PM) += blk-pm.o
> >  obj-$(CONFIG_BLK_INLINE_ENCRYPTION)  += blk-crypto.o blk-crypto-profile.o \
> >  blk-crypto-sysfs.o
> > diff --git a/block/sed-opal-key.c b/block/sed-opal-key.c
> > new file mode 100644
> > index ..16f380164c44
> > --- /dev/null
> > +++ b/block/sed-opal-key.c
> > @@ -0,0 +1,24 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * SED key operations.
> > + *
> > + * Copyright (C) 2022 IBM Corporation
> > + *
> > + * These are the accessor functions (read/write) for SED Opal
> > + * keys. Specific keystores can provide overrides.
> > + *
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +int __weak sed_read_key(char *keyname, char *key, u_int *keylen)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +
> > +int __weak sed_write_key(char *keyname, char *key, u_int keylen)
> > +{
> > + return -EOPNOTSUPP;
> > +}
>
> This change causes a build failure for certain clang configurations due
> to an unfortunate issue [1] with recordmcount, clang's integrated
> assembler, and object files that contain a section with only weak
> functions/symbols (in this case, the .text section in sed-opal-key.c),
> resulting in
>
>   Cannot find symbol for section 2: .text.
>   block/sed-opal-key.o: failed
>
> when building this file.

The definitions in
block/sed-opal-key.c
should be deleted. Instead, in
include/linux/sed-opal-key.h
CONFIG_PSERIES_PLPKS_SED should be used to define static inline
versions when CONFIG_PSERIES_PLPKS_SED is not defined.

#ifdef CONFIG_PSERIES_PLPKS_SED
int sed_read_key(char *keyname, char *key, u_int *keylen);
int sed_write_key(char *keyname, char *key, u_int keylen);
#else
static inline
int sed_read_key(char *keyname, char *key, u_int *keylen) {
  return -EOPNOTSUPP;
}
static inline
int sed_write_key(char *keyname, char *key, u_int keylen);
  return -EOPNOTSUPP;
}
#endif

>
> Is there any real reason to have a separate translation unit for these
> two functions versus just having them living in sed-opal.c? Those two
> object files share the same Kconfig dependency. I am happy to send a
> patch if that is an acceptable approach.
>
> [1]: https://github.com/ClangBuiltLinux/linux/issues/981
>
> Cheers,
> Nathan
>


-- 
Thanks,
~Nick Desaulniers

Re: [PATCH v7 3/3 RESEND] powerpc/pseries: PLPKS SED Opal keystore support

2023-09-13 Thread Jens Axboe

On 9/13/23 12:59 PM, Nathan Chancellor wrote:
> Hi Greg,
> 
> On Fri, Sep 08, 2023 at 10:30:56AM -0500, gjo...@linux.vnet.ibm.com wrote:
>> From: Greg Joyce 
>>
>> Define operations for SED Opal to read/write keys
>> from POWER LPAR Platform KeyStore(PLPKS). This allows
>> non-volatile storage of SED Opal keys.
>>
>> Signed-off-by: Greg Joyce 
>> Reviewed-by: Jonathan Derrick 
>> Reviewed-by: Hannes Reinecke 
> 
> After this change in -next as commit 9f2c7411ada9 ("powerpc/pseries:
> PLPKS SED Opal keystore support"), I see the following crash when
> booting some distribution configurations, such as OpenSUSE's [1] (the
> rootfs is available at [2] if necessary):

I'll drop the series for now - I didn't push out the main branch just
yet as I don't publish the block next tree until at least at -rc2 time,
so it's just in a private branch for now.

-- 
Jens Axboe

Re: [PATCH v7 3/3 RESEND] powerpc/pseries: PLPKS SED Opal keystore support

2023-09-13 Thread Nathan Chancellor

Hi Greg,

On Fri, Sep 08, 2023 at 10:30:56AM -0500, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
>
> Define operations for SED Opal to read/write keys
> from POWER LPAR Platform KeyStore(PLPKS). This allows
> non-volatile storage of SED Opal keys.
>
> Signed-off-by: Greg Joyce 
> Reviewed-by: Jonathan Derrick 
> Reviewed-by: Hannes Reinecke 

After this change in -next as commit 9f2c7411ada9 ("powerpc/pseries:
PLPKS SED Opal keystore support"), I see the following crash when
booting some distribution configurations, such as OpenSUSE's [1] (the
rootfs is available at [2] if necessary):

$ qemu-system-ppc64 \
-display none \
-nodefaults \
-device ipmi-bmc-sim,id=bmc0 \
-device isa-ipmi-bt,bmc=bmc0,irq=10 \
-machine powernv \
-kernel arch/powerpc/boot/zImage.epapr \
-initrd ppc64le-rootfs.cpio \
-m 2G \
-serial mon:stdio
...
[0.00] Linux version 6.6.0-rc1-4-g9f2c7411ada9 
(nathan@dev-arch.thelio-3990X) (powerpc64-linux-gcc (GCC) 13.2.0, GNU ld (GNU 
Binutils) 2.41) #1 SMP Wed Sep 13 11:53:38 MST 2023
...
[1.808911] [ cut here ]
[1.810336] kernel BUG at arch/powerpc/kernel/syscall.c:34!
[1.810799] Oops: Exception in kernel mode, sig: 5 [#1]
[1.810985] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[1.811191] Modules linked in:
[1.811483] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
6.6.0-rc1-4-g9f2c7411ada9 #1
[1.811825] Hardware name: IBM PowerNV (emulated by qemu) POWER9 0x4e1202 
opal:v7.0 PowerNV
[1.812133] NIP:  c002c8c4 LR: c000d620 CTR: c000d4c0
[1.812335] REGS: c2deb7b0 TRAP: 0700   Not tainted  
(6.6.0-rc1-4-g9f2c7411ada9)
[1.812595] MSR:  90029033   CR: 2800028d  
XER: 20040004
[1.812930] CFAR: c000d61c IRQMASK: 3
[1.812930] GPR00: c000d620 c2deba50 c15ef400 
c2debe80
[1.812930] GPR04: 4800028d   

[1.812930] GPR08: 79cd 0001  

[1.812930] GPR12:  c28b  

[1.812930] GPR16:    

[1.812930] GPR20:    

[1.812930] GPR24:    

[1.812930] GPR28:  4800028d c2debe80 
c2debe10
[1.814858] NIP [c002c8c4] system_call_exception+0x84/0x250
[1.815480] LR [c000d620] system_call_common+0x160/0x2c4
[1.815772] Call Trace:
[1.815929] [c2debe50] [c000d620] 
system_call_common+0x160/0x2c4
[1.816178] --- interrupt: c00 at plpar_hcall+0x38/0x60
[1.816330] NIP:  c00e43f8 LR: c00fb558 CTR: 
[1.816518] REGS: c2debe80 TRAP: 0c00   Not tainted  
(6.6.0-rc1-4-g9f2c7411ada9)
[1.816740] MSR:  9280b033   CR: 
2800028d  XER: 
[1.817039] IRQMASK: 0
[1.817039] GPR00: 4800028d c2deb950 c15ef400 
0434
[1.817039] GPR04: 028eb190 28ac6600 001d 
0010
[1.817039] GPR08:    

[1.817039] GPR12:  c28b c0011188 

[1.817039] GPR16:    

[1.817039] GPR20:    

[1.817039] GPR24:    
c00028ac6600
[1.817039] GPR28: 0010 c28eb190 c00028ac6600 
c2deba30
[1.818785] NIP [c00e43f8] plpar_hcall+0x38/0x60
[1.818929] LR [c00fb558] plpks_read_var+0x208/0x290
[1.819093] --- interrupt: c00
[1.819195] [c2deb950] [c00fb528] plpks_read_var+0x1d8/0x290 
(unreliable)
[1.819433] [c2deba10] [c00fc1ac] sed_read_key+0x9c/0x170
[1.819617] [c2debad0] [c20541a8] sed_opal_init+0xac/0x174
[1.819823] [c2debc50] [c0010ad0] do_one_initcall+0x80/0x3b0
[1.820017] [c2debd30] [c2004860] 
kernel_init_freeable+0x338/0x3dc
[1.820229] [c2debdf0] [c00111b0] kernel_init+0x30/0x1a0
[1.820411] [c2debe50] [c000d620] 
system_call_common+0x160/0x2c4
[1.820614] --- interrupt: c00 at plpar_hcall+0x38/0x60
[1.820755] NIP:  c00e43f8 LR: c00fb558 CTR: 
[1.820940] REGS: c2debe80 TRAP: 0c00   Not tainted  
(6.6.0-rc1-4-g9f2c7411ada9)
[1.821157] MSR:  9280b033   CR: 
2800028d  XER: 
[1.821444] IRQMASK: 0
[1.821444] GPR00: 4800028d c2deb950

Re: [PATCH v5 24/31] net: wan: Add framer framework support

2023-09-13 Thread Mark Brown

On Tue, Sep 12, 2023 at 12:14:36PM +0200, Herve Codina wrote:
> A framer is a component in charge of an E1/T1 line interface.
> Connected usually to a TDM bus, it converts TDM frames to/from E1/T1
> frames. It also provides information related to the E1/T1 line.
> 
> The framer framework provides a set of APIs for the framer drivers
> (framer provider) to create/destroy a framer and APIs for the framer
> users (framer consumer) to obtain a reference to the framer, and
> use the framer.

If people are fine with this could we perhaps get it applied on a branch
with a tag?  That way we could cut down the size of the series a little
and I could apply the generic ASoC bit too, neither of the two patches
have any dependency on the actual hardware.

signature.asc
Description: PGP signature

Re: [PATCH v7 1/3 RESEND] block:sed-opal: SED Opal keystore

2023-09-13 Thread Nathan Chancellor

Hi Greg,

On Fri, Sep 08, 2023 at 10:30:54AM -0500, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
> 
> Add read and write functions that allow SED Opal keys to stored
> in a permanent keystore.
> 
> Signed-off-by: Greg Joyce 
> Reviewed-by: Jonathan Derrick 
> ---
>  block/Makefile   |  2 +-
>  block/sed-opal-key.c | 24 
>  include/linux/sed-opal-key.h | 15 +++
>  3 files changed, 40 insertions(+), 1 deletion(-)
>  create mode 100644 block/sed-opal-key.c
>  create mode 100644 include/linux/sed-opal-key.h
> 
> diff --git a/block/Makefile b/block/Makefile
> index 46ada9dc8bbf..ea07d80402a6 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -34,7 +34,7 @@ obj-$(CONFIG_BLK_DEV_ZONED) += blk-zoned.o
>  obj-$(CONFIG_BLK_WBT)+= blk-wbt.o
>  obj-$(CONFIG_BLK_DEBUG_FS)   += blk-mq-debugfs.o
>  obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
> -obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o
> +obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o sed-opal-key.o
>  obj-$(CONFIG_BLK_PM) += blk-pm.o
>  obj-$(CONFIG_BLK_INLINE_ENCRYPTION)  += blk-crypto.o blk-crypto-profile.o \
>  blk-crypto-sysfs.o
> diff --git a/block/sed-opal-key.c b/block/sed-opal-key.c
> new file mode 100644
> index ..16f380164c44
> --- /dev/null
> +++ b/block/sed-opal-key.c
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * SED key operations.
> + *
> + * Copyright (C) 2022 IBM Corporation
> + *
> + * These are the accessor functions (read/write) for SED Opal
> + * keys. Specific keystores can provide overrides.
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +int __weak sed_read_key(char *keyname, char *key, u_int *keylen)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +int __weak sed_write_key(char *keyname, char *key, u_int keylen)
> +{
> + return -EOPNOTSUPP;
> +}

This change causes a build failure for certain clang configurations due
to an unfortunate issue [1] with recordmcount, clang's integrated
assembler, and object files that contain a section with only weak
functions/symbols (in this case, the .text section in sed-opal-key.c),
resulting in

  Cannot find symbol for section 2: .text.
  block/sed-opal-key.o: failed

when building this file.

Is there any real reason to have a separate translation unit for these
two functions versus just having them living in sed-opal.c? Those two
object files share the same Kconfig dependency. I am happy to send a
patch if that is an acceptable approach.

[1]: https://github.com/ClangBuiltLinux/linux/issues/981

Cheers,
Nathan

Re: [PATCH v5 08/31] dt-bindings: soc: fsl: cpm_qe: cpm1-scc-qmc: Add support for QMC HDLC

2023-09-13 Thread Conor Dooley

On Wed, Sep 13, 2023 at 03:56:16PM +0100, Conor Dooley wrote:
> On Wed, Sep 13, 2023 at 04:52:50PM +0200, Herve Codina wrote:
> > On Wed, 13 Sep 2023 15:42:45 +0100
> > Conor Dooley  wrote:
> > 
> > > On Wed, Sep 13, 2023 at 09:26:40AM +0200, Herve Codina wrote:
> > > > Hi Conor,
> > > > 
> > > > On Tue, 12 Sep 2023 18:21:58 +0100
> > > > Conor Dooley  wrote:
> > > >   
> > > > > On Tue, Sep 12, 2023 at 12:10:18PM +0200, Herve Codina wrote:  
> > > > > > The QMC (QUICC mutichannel controller) is a controller present in 
> > > > > > some
> > > > > > PowerQUICC SoC such as MPC885.
> > > > > > The QMC HDLC uses the QMC controller to transfer HDLC data.
> > > > > > 
> > > > > > Additionally, a framer can be connected to the QMC HDLC.
> > > > > > If present, this framer is the interface between the TDM bus used 
> > > > > > by the
> > > > > > QMC HDLC and the E1/T1 line.
> > > > > > The QMC HDLC can use this framer to get information about the E1/T1 
> > > > > > line
> > > > > > and configure the E1/T1 line.
> > > > > > 
> > > > > > Signed-off-by: Herve Codina 
> > > > > > ---
> > > > > >  .../bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml   | 13 
> > > > > > +
> > > > > >  1 file changed, 13 insertions(+)
> > > > > > 
> > > > > > diff --git 
> > > > > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > >  
> > > > > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > > index 82d9beb48e00..b5073531f3f1 100644
> > > > > > --- 
> > > > > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > > +++ 
> > > > > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > > @@ -101,6 +101,16 @@ patternProperties:
> > > > > >Channel assigned Rx time-slots within the Rx time-slots 
> > > > > > routed by the
> > > > > >TSA to this cell.
> > > > > >  
> > > > > > +  compatible:
> > > > > > +const: fsl,qmc-hdlc
> > > > > > +
> > > > > > +  fsl,framer:
> > > > > > +$ref: /schemas/types.yaml#/definitions/phandle
> > > > > > +description:
> > > > > > +  phandle to the framer node. The framer is in charge of 
> > > > > > an E1/T1 line
> > > > > > +  interface connected to the TDM bus. It can be used to 
> > > > > > get the E1/T1 line
> > > > > > +  status such as link up/down.
> > > > > 
> > > > > Sounds like this fsl,framer property should depend on the compatible
> > > > > being present, no?  
> > > > 
> > > > Well from the implementation point of view, only the QMC HDLC driver 
> > > > uses this
> > > > property.
> > > > 
> > > > From the hardware description point of view, this property means that 
> > > > the time slots
> > > > handled by this channel are connected to the framer. So I think it 
> > > > makes sense for
> > > > any channel no matter the compatible (even if compatible is not 
> > > > present).
> > > > 
> > > > Should I change and constraint the fsl,framer property to the 
> > > > compatible presence ?
> > > > If so, is the following correct for this contraint ?
> > > >--- 8< ---
> > > >dependencies:
> > > >  - fsl,framer: [ compatible ];
> > > >--- 8< ---  
> > > 
> > > The regular sort of
> > > if:
> > >   compatible:
> > >   contains:
> > >   const: foo
> > > then:
> > >   required:
> > >   - fsl,framer
> > > would fit the bill, no?
> > 
> > Not sure.
> > "fsl,framer" is an optional property (depending on the hardware we can have
> > a framer or not).
> 
> Ah apologies, I had it backwards! Your suggestion seems fair in that
> case.

Or actually,
if:
compatible:
not:
contains:
const: foo
 then:
properties:
fsl,framer: false
? That should do the trick in a more conventional way.


signature.asc
Description: PGP signature

Re: [PATCH v5 08/31] dt-bindings: soc: fsl: cpm_qe: cpm1-scc-qmc: Add support for QMC HDLC

2023-09-13 Thread Conor Dooley

On Wed, Sep 13, 2023 at 04:52:50PM +0200, Herve Codina wrote:
> On Wed, 13 Sep 2023 15:42:45 +0100
> Conor Dooley  wrote:
> 
> > On Wed, Sep 13, 2023 at 09:26:40AM +0200, Herve Codina wrote:
> > > Hi Conor,
> > > 
> > > On Tue, 12 Sep 2023 18:21:58 +0100
> > > Conor Dooley  wrote:
> > >   
> > > > On Tue, Sep 12, 2023 at 12:10:18PM +0200, Herve Codina wrote:  
> > > > > The QMC (QUICC mutichannel controller) is a controller present in some
> > > > > PowerQUICC SoC such as MPC885.
> > > > > The QMC HDLC uses the QMC controller to transfer HDLC data.
> > > > > 
> > > > > Additionally, a framer can be connected to the QMC HDLC.
> > > > > If present, this framer is the interface between the TDM bus used by 
> > > > > the
> > > > > QMC HDLC and the E1/T1 line.
> > > > > The QMC HDLC can use this framer to get information about the E1/T1 
> > > > > line
> > > > > and configure the E1/T1 line.
> > > > > 
> > > > > Signed-off-by: Herve Codina 
> > > > > ---
> > > > >  .../bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml   | 13 
> > > > > +
> > > > >  1 file changed, 13 insertions(+)
> > > > > 
> > > > > diff --git 
> > > > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > >  
> > > > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > index 82d9beb48e00..b5073531f3f1 100644
> > > > > --- 
> > > > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > +++ 
> > > > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > > @@ -101,6 +101,16 @@ patternProperties:
> > > > >Channel assigned Rx time-slots within the Rx time-slots 
> > > > > routed by the
> > > > >TSA to this cell.
> > > > >  
> > > > > +  compatible:
> > > > > +const: fsl,qmc-hdlc
> > > > > +
> > > > > +  fsl,framer:
> > > > > +$ref: /schemas/types.yaml#/definitions/phandle
> > > > > +description:
> > > > > +  phandle to the framer node. The framer is in charge of an 
> > > > > E1/T1 line
> > > > > +  interface connected to the TDM bus. It can be used to get 
> > > > > the E1/T1 line
> > > > > +  status such as link up/down.
> > > > 
> > > > Sounds like this fsl,framer property should depend on the compatible
> > > > being present, no?  
> > > 
> > > Well from the implementation point of view, only the QMC HDLC driver uses 
> > > this
> > > property.
> > > 
> > > From the hardware description point of view, this property means that the 
> > > time slots
> > > handled by this channel are connected to the framer. So I think it makes 
> > > sense for
> > > any channel no matter the compatible (even if compatible is not present).
> > > 
> > > Should I change and constraint the fsl,framer property to the compatible 
> > > presence ?
> > > If so, is the following correct for this contraint ?
> > >--- 8< ---
> > >dependencies:
> > >  - fsl,framer: [ compatible ];
> > >--- 8< ---  
> > 
> > The regular sort of
> > if:
> > compatible:
> > contains:
> > const: foo
> > then:
> > required:
> > - fsl,framer
> > would fit the bill, no?
> 
> Not sure.
> "fsl,framer" is an optional property (depending on the hardware we can have
> a framer or not).

Ah apologies, I had it backwards! Your suggestion seems fair in that
case.

Thanks,
Conor.


signature.asc
Description: PGP signature

Re: [PATCH v5 08/31] dt-bindings: soc: fsl: cpm_qe: cpm1-scc-qmc: Add support for QMC HDLC

2023-09-13 Thread Herve Codina

On Wed, 13 Sep 2023 15:42:45 +0100
Conor Dooley  wrote:

> On Wed, Sep 13, 2023 at 09:26:40AM +0200, Herve Codina wrote:
> > Hi Conor,
> > 
> > On Tue, 12 Sep 2023 18:21:58 +0100
> > Conor Dooley  wrote:
> >   
> > > On Tue, Sep 12, 2023 at 12:10:18PM +0200, Herve Codina wrote:  
> > > > The QMC (QUICC mutichannel controller) is a controller present in some
> > > > PowerQUICC SoC such as MPC885.
> > > > The QMC HDLC uses the QMC controller to transfer HDLC data.
> > > > 
> > > > Additionally, a framer can be connected to the QMC HDLC.
> > > > If present, this framer is the interface between the TDM bus used by the
> > > > QMC HDLC and the E1/T1 line.
> > > > The QMC HDLC can use this framer to get information about the E1/T1 line
> > > > and configure the E1/T1 line.
> > > > 
> > > > Signed-off-by: Herve Codina 
> > > > ---
> > > >  .../bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml   | 13 +
> > > >  1 file changed, 13 insertions(+)
> > > > 
> > > > diff --git 
> > > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > >  
> > > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > index 82d9beb48e00..b5073531f3f1 100644
> > > > --- 
> > > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > +++ 
> > > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > > @@ -101,6 +101,16 @@ patternProperties:
> > > >Channel assigned Rx time-slots within the Rx time-slots 
> > > > routed by the
> > > >TSA to this cell.
> > > >  
> > > > +  compatible:
> > > > +const: fsl,qmc-hdlc
> > > > +
> > > > +  fsl,framer:
> > > > +$ref: /schemas/types.yaml#/definitions/phandle
> > > > +description:
> > > > +  phandle to the framer node. The framer is in charge of an 
> > > > E1/T1 line
> > > > +  interface connected to the TDM bus. It can be used to get 
> > > > the E1/T1 line
> > > > +  status such as link up/down.
> > > 
> > > Sounds like this fsl,framer property should depend on the compatible
> > > being present, no?  
> > 
> > Well from the implementation point of view, only the QMC HDLC driver uses 
> > this
> > property.
> > 
> > From the hardware description point of view, this property means that the 
> > time slots
> > handled by this channel are connected to the framer. So I think it makes 
> > sense for
> > any channel no matter the compatible (even if compatible is not present).
> > 
> > Should I change and constraint the fsl,framer property to the compatible 
> > presence ?
> > If so, is the following correct for this contraint ?
> >--- 8< ---
> >dependencies:
> >  - fsl,framer: [ compatible ];
> >--- 8< ---  
> 
> The regular sort of
> if:
>   compatible:
>   contains:
>   const: foo
> then:
>   required:
>   - fsl,framer
> would fit the bill, no?

Not sure.
"fsl,framer" is an optional property (depending on the hardware we can have
a framer or not).

Hervé

Re: [PATCH v5 08/31] dt-bindings: soc: fsl: cpm_qe: cpm1-scc-qmc: Add support for QMC HDLC

2023-09-13 Thread Conor Dooley

On Wed, Sep 13, 2023 at 09:26:40AM +0200, Herve Codina wrote:
> Hi Conor,
> 
> On Tue, 12 Sep 2023 18:21:58 +0100
> Conor Dooley  wrote:
> 
> > On Tue, Sep 12, 2023 at 12:10:18PM +0200, Herve Codina wrote:
> > > The QMC (QUICC mutichannel controller) is a controller present in some
> > > PowerQUICC SoC such as MPC885.
> > > The QMC HDLC uses the QMC controller to transfer HDLC data.
> > > 
> > > Additionally, a framer can be connected to the QMC HDLC.
> > > If present, this framer is the interface between the TDM bus used by the
> > > QMC HDLC and the E1/T1 line.
> > > The QMC HDLC can use this framer to get information about the E1/T1 line
> > > and configure the E1/T1 line.
> > > 
> > > Signed-off-by: Herve Codina 
> > > ---
> > >  .../bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml   | 13 +
> > >  1 file changed, 13 insertions(+)
> > > 
> > > diff --git 
> > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml 
> > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > index 82d9beb48e00..b5073531f3f1 100644
> > > --- 
> > > a/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > +++ 
> > > b/Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
> > > @@ -101,6 +101,16 @@ patternProperties:
> > >Channel assigned Rx time-slots within the Rx time-slots routed 
> > > by the
> > >TSA to this cell.
> > >  
> > > +  compatible:
> > > +const: fsl,qmc-hdlc
> > > +
> > > +  fsl,framer:
> > > +$ref: /schemas/types.yaml#/definitions/phandle
> > > +description:
> > > +  phandle to the framer node. The framer is in charge of an 
> > > E1/T1 line
> > > +  interface connected to the TDM bus. It can be used to get the 
> > > E1/T1 line
> > > +  status such as link up/down.  
> > 
> > Sounds like this fsl,framer property should depend on the compatible
> > being present, no?
> 
> Well from the implementation point of view, only the QMC HDLC driver uses this
> property.
> 
> From the hardware description point of view, this property means that the 
> time slots
> handled by this channel are connected to the framer. So I think it makes 
> sense for
> any channel no matter the compatible (even if compatible is not present).
> 
> Should I change and constraint the fsl,framer property to the compatible 
> presence ?
> If so, is the following correct for this contraint ?
>--- 8< ---
>dependencies:
>  - fsl,framer: [ compatible ];
>--- 8< ---

The regular sort of
if:
compatible:
contains:
const: foo
then:
required:
- fsl,framer
would fit the bill, no?


signature.asc
Description: PGP signature

Re: [PATCH v5 25/31] dt-bindings: net: Add the Lantiq PEF2256 E1/T1/J1 framer

2023-09-13 Thread Conor Dooley

On Tue, Sep 12, 2023 at 01:54:05PM -0500, Rob Herring wrote:
> > > +  lantiq,data-rate-bps:
> > > +$ref: /schemas/types.yaml#/definitions/uint32
> > > +enum: [2048000, 4096000, 8192000, 16384000]
> > 
> > -kBps is a standard suffix, would it be worth using that instead here?
> > What you have would fit as even multiples.
> > Otherwise Rob, should dt-schema grow -bps as a standard suffix?
> 
> Yeah, I think that makes sense. I've added it now.

Cool, thanks!


signature.asc
Description: PGP signature

[PATCH v8 08/24] iommu: Reorganize iommu_get_default_domain_type() to respect def_domain_type()

2023-09-13 Thread Jason Gunthorpe

Except for dart (which forces IOMMU_DOMAIN_DMA) every driver returns 0 or
IDENTITY from ops->def_domain_type().

The drivers that return IDENTITY have some kind of good reason, typically
that quirky hardware really can't support anything other than IDENTITY.

Arrange things so that if the driver says it needs IDENTITY then
iommu_get_default_domain_type() either fails or returns IDENTITY.  It will
not ignore the driver's override to IDENTITY.

Split the function into two steps, reducing the group device list to the
driver's def_domain_type() and the untrusted flag.

Then compute the result based on those two reduced variables. Fully reject
combining untrusted with IDENTITY.

Remove the debugging print on the iommu_group_store_type() failure path,
userspace should not be able to trigger kernel prints.

This makes the next patch cleaner that wants to force IDENTITY always for
ARM_IOMMU because there is no support for DMA.

Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 117 --
 1 file changed, 79 insertions(+), 38 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0e13e566581c21..9188eae61e929e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1718,19 +1718,6 @@ struct iommu_group *fsl_mc_device_group(struct device 
*dev)
 }
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
-static int iommu_get_def_domain_type(struct device *dev)
-{
-   const struct iommu_ops *ops = dev_iommu_ops(dev);
-
-   if (dev_is_pci(dev) && to_pci_dev(dev)->untrusted)
-   return IOMMU_DOMAIN_DMA;
-
-   if (ops->def_domain_type)
-   return ops->def_domain_type(dev);
-
-   return 0;
-}
-
 static struct iommu_domain *
 __iommu_group_alloc_default_domain(const struct bus_type *bus,
   struct iommu_group *group, int req_type)
@@ -1740,6 +1727,23 @@ __iommu_group_alloc_default_domain(const struct bus_type 
*bus,
return __iommu_domain_alloc(bus, req_type);
 }
 
+/*
+ * Returns the iommu_ops for the devices in an iommu group.
+ *
+ * It is assumed that all devices in an iommu group are managed by a single
+ * IOMMU unit. Therefore, this returns the dev_iommu_ops of the first device
+ * in the group.
+ */
+static const struct iommu_ops *group_iommu_ops(struct iommu_group *group)
+{
+   struct group_device *device =
+   list_first_entry(>devices, struct group_device, list);
+
+   lockdep_assert_held(>mutex);
+
+   return dev_iommu_ops(device->dev);
+}
+
 /*
  * req_type of 0 means "auto" which means to select a domain based on
  * iommu_def_domain_type or what the driver actually supports.
@@ -1820,40 +1824,77 @@ static int iommu_bus_notifier(struct notifier_block *nb,
return 0;
 }
 
-/* A target_type of 0 will select the best domain type and cannot fail */
+/*
+ * Combine the driver's chosen def_domain_type across all the devices in a
+ * group. Drivers must give a consistent result.
+ */
+static int iommu_get_def_domain_type(struct iommu_group *group,
+struct device *dev, int cur_type)
+{
+   const struct iommu_ops *ops = group_iommu_ops(group);
+   int type;
+
+   if (!ops->def_domain_type)
+   return cur_type;
+
+   type = ops->def_domain_type(dev);
+   if (!type || cur_type == type)
+   return cur_type;
+   if (!cur_type)
+   return type;
+
+   dev_err_ratelimited(
+   dev,
+   "IOMMU driver error, requesting conflicting def_domain_type, %s 
and %s, for devices in group %u.\n",
+   iommu_domain_type_str(cur_type), iommu_domain_type_str(type),
+   group->id);
+
+   /*
+* Try to recover, drivers are allowed to force IDENITY or DMA, IDENTITY
+* takes precedence.
+*/
+   if (type == IOMMU_DOMAIN_IDENTITY)
+   return type;
+   return cur_type;
+}
+
+/*
+ * A target_type of 0 will select the best domain type. 0 can be returned in
+ * this case meaning the global default should be used.
+ */
 static int iommu_get_default_domain_type(struct iommu_group *group,
 int target_type)
 {
-   int best_type = target_type;
+   struct device *untrusted = NULL;
struct group_device *gdev;
-   struct device *last_dev;
+   int driver_type = 0;
 
lockdep_assert_held(>mutex);
-
for_each_group_device(group, gdev) {
-   unsigned int type = iommu_get_def_domain_type(gdev->dev);
+   driver_type = iommu_get_def_domain_type(group, gdev->dev,
+   driver_type);
 
-   if (best_type && type && best_type != type) {
-   if (target_type) {
-   dev_err_ratelimited(
-   gdev->dev,
-

[PATCH v8 24/24] iommu: Convert remaining simple drivers to domain_alloc_paging()

2023-09-13 Thread Jason Gunthorpe

These drivers don't support IOMMU_DOMAIN_DMA, so this commit effectively
allows them to support that mode.

The prior work to require default_domains makes this safe because every
one of these drivers is either compilation incompatible with dma-iommu.c,
or already establishing a default_domain. In both cases alloc_domain()
will never be called with IOMMU_DOMAIN_DMA for these drivers so it is safe
to drop the test.

Removing these tests clarifies that the domain allocation path is only
about the functionality of a paging domain and has nothing to do with
policy of how the paging domain is used for UNMANAGED/DMA/DMA_FQ.

Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/msm_iommu.c| 7 ++-
 drivers/iommu/mtk_iommu_v1.c | 7 ++-
 drivers/iommu/omap-iommu.c   | 7 ++-
 drivers/iommu/s390-iommu.c   | 7 ++-
 4 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 26ed81cfeee897..a163cee0b7242d 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -302,13 +302,10 @@ static void __program_context(void __iomem *base, int ctx,
SET_M(base, ctx, 1);
 }
 
-static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *msm_iommu_domain_alloc_paging(struct device *dev)
 {
struct msm_priv *priv;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv)
goto fail_nomem;
@@ -691,7 +688,7 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
 
 static struct iommu_ops msm_iommu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc = msm_iommu_domain_alloc,
+   .domain_alloc_paging = msm_iommu_domain_alloc_paging,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 7c0c1d50df5f75..67e044c1a7d93b 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -270,13 +270,10 @@ static int mtk_iommu_v1_domain_finalise(struct 
mtk_iommu_v1_data *data)
return 0;
 }
 
-static struct iommu_domain *mtk_iommu_v1_domain_alloc(unsigned type)
+static struct iommu_domain *mtk_iommu_v1_domain_alloc_paging(struct device 
*dev)
 {
struct mtk_iommu_v1_domain *dom;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!dom)
return NULL;
@@ -585,7 +582,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
 
 static const struct iommu_ops mtk_iommu_v1_ops = {
.identity_domain = _iommu_v1_identity_domain,
-   .domain_alloc   = mtk_iommu_v1_domain_alloc,
+   .domain_alloc_paging = mtk_iommu_v1_domain_alloc_paging,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
.release_device = mtk_iommu_v1_release_device,
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 34340ef15241bc..fcf99bd195b32e 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1580,13 +1580,10 @@ static struct iommu_domain omap_iommu_identity_domain = 
{
.ops = _iommu_identity_ops,
 };
 
-static struct iommu_domain *omap_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *omap_iommu_domain_alloc_paging(struct device *dev)
 {
struct omap_iommu_domain *omap_domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
omap_domain = kzalloc(sizeof(*omap_domain), GFP_KERNEL);
if (!omap_domain)
return NULL;
@@ -1748,7 +1745,7 @@ static struct iommu_group *omap_iommu_device_group(struct 
device *dev)
 
 static const struct iommu_ops omap_iommu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc   = omap_iommu_domain_alloc,
+   .domain_alloc_paging = omap_iommu_domain_alloc_paging,
.probe_device   = omap_iommu_probe_device,
.release_device = omap_iommu_release_device,
.device_group   = omap_iommu_device_group,
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index f0c867c57a5b9b..5695ad71d60e24 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -39,13 +39,10 @@ static bool s390_iommu_capable(struct device *dev, enum 
iommu_cap cap)
}
 }
 
-static struct iommu_domain *s390_domain_alloc(unsigned domain_type)
+static struct iommu_domain *s390_domain_alloc_paging(struct device *dev)
 {
struct s390_domain *s390_domain;
 
-   if (domain_type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-

[PATCH v8 18/24] iommu/mtk_iommu: Add an IOMMU_IDENTITIY_DOMAIN

2023-09-13 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/mtk_iommu.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 640275873a271e..164f9759e1c039 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -777,6 +777,28 @@ static int mtk_iommu_attach_device(struct iommu_domain 
*domain,
return ret;
 }
 
+static int mtk_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct mtk_iommu_data *data = dev_iommu_priv_get(dev);
+
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   mtk_iommu_config(data, dev, false, 0);
+   return 0;
+}
+
+static struct iommu_domain_ops mtk_iommu_identity_ops = {
+   .attach_dev = mtk_iommu_identity_attach,
+};
+
+static struct iommu_domain mtk_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int mtk_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -996,6 +1018,7 @@ static void mtk_iommu_get_resv_regions(struct device *dev,
 }
 
 static const struct iommu_ops mtk_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc   = mtk_iommu_domain_alloc,
.probe_device   = mtk_iommu_probe_device,
.release_device = mtk_iommu_release_device,
-- 
2.42.0

[PATCH v8 23/24] iommu: Convert simple drivers with DOMAIN_DMA to domain_alloc_paging()

2023-09-13 Thread Jason Gunthorpe

These drivers are all trivially converted since the function is only
called if the domain type is going to be
IOMMU_DOMAIN_UNMANAGED/DMA.

Tested-by: Heiko Stuebner 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 6 ++
 drivers/iommu/exynos-iommu.c| 7 ++-
 drivers/iommu/ipmmu-vmsa.c  | 7 ++-
 drivers/iommu/mtk_iommu.c   | 7 ++-
 drivers/iommu/rockchip-iommu.c  | 7 ++-
 drivers/iommu/sprd-iommu.c  | 7 ++-
 drivers/iommu/sun50i-iommu.c| 9 +++--
 drivers/iommu/tegra-smmu.c  | 7 ++-
 8 files changed, 17 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index bc45d18f350cb9..97b2122032b237 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -332,12 +332,10 @@ static int qcom_iommu_init_domain(struct iommu_domain 
*domain,
return ret;
 }
 
-static struct iommu_domain *qcom_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *qcom_iommu_domain_alloc_paging(struct device *dev)
 {
struct qcom_iommu_domain *qcom_domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
-   return NULL;
/*
 * Allocate the domain and initialise some of its data structures.
 * We can't really do anything meaningful until we've added a
@@ -605,7 +603,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
 static const struct iommu_ops qcom_iommu_ops = {
.identity_domain = _iommu_identity_domain,
.capable= qcom_iommu_capable,
-   .domain_alloc   = qcom_iommu_domain_alloc,
+   .domain_alloc_paging = qcom_iommu_domain_alloc_paging,
.probe_device   = qcom_iommu_probe_device,
.device_group   = generic_device_group,
.of_xlate   = qcom_iommu_of_xlate,
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 5e12b85dfe8705..d6dead2ed10c11 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -887,7 +887,7 @@ static inline void exynos_iommu_set_pte(sysmmu_pte_t *ent, 
sysmmu_pte_t val)
   DMA_TO_DEVICE);
 }
 
-static struct iommu_domain *exynos_iommu_domain_alloc(unsigned type)
+static struct iommu_domain *exynos_iommu_domain_alloc_paging(struct device 
*dev)
 {
struct exynos_iommu_domain *domain;
dma_addr_t handle;
@@ -896,9 +896,6 @@ static struct iommu_domain 
*exynos_iommu_domain_alloc(unsigned type)
/* Check if correct PTE offsets are initialized */
BUG_ON(PG_ENT_SHIFT < 0 || !dma_dev);
 
-   if (type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_UNMANAGED)
-   return NULL;
-
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!domain)
return NULL;
@@ -1472,7 +1469,7 @@ static int exynos_iommu_of_xlate(struct device *dev,
 
 static const struct iommu_ops exynos_iommu_ops = {
.identity_domain = _identity_domain,
-   .domain_alloc = exynos_iommu_domain_alloc,
+   .domain_alloc_paging = exynos_iommu_domain_alloc_paging,
.device_group = generic_device_group,
.probe_device = exynos_iommu_probe_device,
.release_device = exynos_iommu_release_device,
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 04830d3931d239..eaabae76157761 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -563,13 +563,10 @@ static irqreturn_t ipmmu_irq(int irq, void *dev)
  * IOMMU Operations
  */
 
-static struct iommu_domain *ipmmu_domain_alloc(unsigned type)
+static struct iommu_domain *ipmmu_domain_alloc_paging(struct device *dev)
 {
struct ipmmu_vmsa_domain *domain;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
-   return NULL;
-
domain = kzalloc(sizeof(*domain), GFP_KERNEL);
if (!domain)
return NULL;
@@ -892,7 +889,7 @@ static struct iommu_group *ipmmu_find_group(struct device 
*dev)
 
 static const struct iommu_ops ipmmu_ops = {
.identity_domain = _iommu_identity_domain,
-   .domain_alloc = ipmmu_domain_alloc,
+   .domain_alloc_paging = ipmmu_domain_alloc_paging,
.probe_device = ipmmu_probe_device,
.release_device = ipmmu_release_device,
.probe_finalize = ipmmu_probe_finalize,
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 164f9759e1c039..19ef50221c93db 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -689,13 +689,10 @@ static int mtk_iommu_domain_finalise(struct 
mtk_iommu_domain *dom,
return 0;
 }
 
-static struct iommu_domain *mtk_iommu_domain_alloc(unsigned type)
+static

[PATCH v8 22/24] iommu: Add ops->domain_alloc_paging()

2023-09-13 Thread Jason Gunthorpe

This callback requests the driver to create only a __IOMMU_DOMAIN_PAGING
domain, so it saves a few lines in a lot of drivers needlessly checking
the type.

More critically, this allows us to sweep out all the
IOMMU_DOMAIN_UNMANAGED and IOMMU_DOMAIN_DMA checks from a lot of the
drivers, simplifying what is going on in the code and ultimately removing
the now-unused special cases in drivers where they did not support
IOMMU_DOMAIN_DMA.

domain_alloc_paging() should return a struct iommu_domain that is
functionally compatible with ARM_DMA_USE_IOMMU, dma-iommu.c and iommufd.

Be forwards looking and pass in a 'struct device *' argument. We can
provide this when allocating the default_domain. No drivers will look at
this.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 17 ++---
 include/linux/iommu.h |  3 +++
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 38856d542afc35..fe033043be467a 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2041,6 +2041,7 @@ void iommu_set_fault_handler(struct iommu_domain *domain,
 EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
 
 static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
+struct device *dev,
 unsigned int type)
 {
struct iommu_domain *domain;
@@ -2048,8 +2049,13 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct iommu_ops *ops,
 
if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
return ops->identity_domain;
+   else if (type & __IOMMU_DOMAIN_PAGING && ops->domain_alloc_paging)
+   domain = ops->domain_alloc_paging(dev);
+   else if (ops->domain_alloc)
+   domain = ops->domain_alloc(alloc_type);
+   else
+   return NULL;
 
-   domain = ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
 
@@ -2074,14 +2080,19 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct iommu_ops *ops,
 static struct iommu_domain *
 __iommu_group_domain_alloc(struct iommu_group *group, unsigned int type)
 {
-   return __iommu_domain_alloc(group_iommu_ops(group), type);
+   struct device *dev =
+   list_first_entry(>devices, struct group_device, list)
+   ->dev;
+
+   return __iommu_domain_alloc(group_iommu_ops(group), dev, type);
 }
 
 struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus)
 {
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
-   return __iommu_domain_alloc(bus->iommu_ops, IOMMU_DOMAIN_UNMANAGED);
+   return __iommu_domain_alloc(bus->iommu_ops, NULL,
+   IOMMU_DOMAIN_UNMANAGED);
 }
 EXPORT_SYMBOL_GPL(iommu_domain_alloc);
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 511dfeea527215..3f173307434dcc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -239,6 +239,8 @@ struct iommu_iotlb_gather {
  *   use. The information type is one of enum iommu_hw_info_type 
defined
  *   in include/uapi/linux/iommufd.h.
  * @domain_alloc: allocate iommu domain
+ * @domain_alloc_paging: Allocate an iommu_domain that can be used for
+ *   UNMANAGED, DMA, and DMA_FQ domain types.
  * @probe_device: Add device to iommu driver handling
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
@@ -273,6 +275,7 @@ struct iommu_ops {
 
/* Domain allocation and freeing by the iommu driver */
struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type);
+   struct iommu_domain *(*domain_alloc_paging)(struct device *dev);
 
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
-- 
2.42.0

[PATCH v8 17/24] iommu/ipmmu: Add an IOMMU_IDENTITIY_DOMAIN

2023-09-13 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Also reverts commit 584d334b1393 ("iommu/ipmmu-vmsa: Remove
ipmmu_utlb_disable()")

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/ipmmu-vmsa.c | 43 ++
 1 file changed, 43 insertions(+)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 65ff69477c43e4..04830d3931d239 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -295,6 +295,18 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain 
*domain,
mmu->utlb_ctx[utlb] = domain->context_id;
 }
 
+/*
+ * Disable MMU translation for the microTLB.
+ */
+static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
+  unsigned int utlb)
+{
+   struct ipmmu_vmsa_device *mmu = domain->mmu;
+
+   ipmmu_imuctr_write(mmu, utlb, 0);
+   mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID;
+}
+
 static void ipmmu_tlb_flush_all(void *cookie)
 {
struct ipmmu_vmsa_domain *domain = cookie;
@@ -627,6 +639,36 @@ static int ipmmu_attach_device(struct iommu_domain 
*io_domain,
return 0;
 }
 
+static int ipmmu_iommu_identity_attach(struct iommu_domain *identity_domain,
+  struct device *dev)
+{
+   struct iommu_domain *io_domain = iommu_get_domain_for_dev(dev);
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct ipmmu_vmsa_domain *domain;
+   unsigned int i;
+
+   if (io_domain == identity_domain || !io_domain)
+   return 0;
+
+   domain = to_vmsa_domain(io_domain);
+   for (i = 0; i < fwspec->num_ids; ++i)
+   ipmmu_utlb_disable(domain, fwspec->ids[i]);
+
+   /*
+* TODO: Optimize by disabling the context when no device is attached.
+*/
+   return 0;
+}
+
+static struct iommu_domain_ops ipmmu_iommu_identity_ops = {
+   .attach_dev = ipmmu_iommu_identity_attach,
+};
+
+static struct iommu_domain ipmmu_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int ipmmu_map(struct iommu_domain *io_domain, unsigned long iova,
 phys_addr_t paddr, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -849,6 +891,7 @@ static struct iommu_group *ipmmu_find_group(struct device 
*dev)
 }
 
 static const struct iommu_ops ipmmu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc = ipmmu_domain_alloc,
.probe_device = ipmmu_probe_device,
.release_device = ipmmu_release_device,
-- 
2.42.0

[PATCH v8 10/24] iommu/exynos: Implement an IDENTITY domain

2023-09-13 Thread Jason Gunthorpe

What exynos calls exynos_iommu_detach_device is actually putting the iommu
into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

Tested-by: Marek Szyprowski 
Acked-by: Marek Szyprowski 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/exynos-iommu.c | 66 +---
 1 file changed, 32 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index c275fe71c4db32..5e12b85dfe8705 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -24,6 +24,7 @@
 
 typedef u32 sysmmu_iova_t;
 typedef u32 sysmmu_pte_t;
+static struct iommu_domain exynos_identity_domain;
 
 /* We do not consider super section mapping (16MB) */
 #define SECT_ORDER 20
@@ -829,7 +830,7 @@ static int __maybe_unused exynos_sysmmu_suspend(struct 
device *dev)
struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
mutex_lock(>rpm_lock);
-   if (data->domain) {
+   if (>domain->domain != _identity_domain) {
dev_dbg(data->sysmmu, "saving state\n");
__sysmmu_disable(data);
}
@@ -847,7 +848,7 @@ static int __maybe_unused exynos_sysmmu_resume(struct 
device *dev)
struct exynos_iommu_owner *owner = dev_iommu_priv_get(master);
 
mutex_lock(>rpm_lock);
-   if (data->domain) {
+   if (>domain->domain != _identity_domain) {
dev_dbg(data->sysmmu, "restoring state\n");
__sysmmu_enable(data);
}
@@ -980,17 +981,20 @@ static void exynos_iommu_domain_free(struct iommu_domain 
*iommu_domain)
kfree(domain);
 }
 
-static void exynos_iommu_detach_device(struct iommu_domain *iommu_domain,
-   struct device *dev)
+static int exynos_iommu_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
-   struct exynos_iommu_domain *domain = to_exynos_domain(iommu_domain);
struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
-   phys_addr_t pagetable = virt_to_phys(domain->pgtable);
+   struct exynos_iommu_domain *domain;
+   phys_addr_t pagetable;
struct sysmmu_drvdata *data, *next;
unsigned long flags;
 
-   if (!has_sysmmu(dev) || owner->domain != iommu_domain)
-   return;
+   if (owner->domain == identity_domain)
+   return 0;
+
+   domain = to_exynos_domain(owner->domain);
+   pagetable = virt_to_phys(domain->pgtable);
 
mutex_lock(>rpm_lock);
 
@@ -1009,15 +1013,25 @@ static void exynos_iommu_detach_device(struct 
iommu_domain *iommu_domain,
list_del_init(>domain_node);
spin_unlock(>lock);
}
-   owner->domain = NULL;
+   owner->domain = identity_domain;
spin_unlock_irqrestore(>lock, flags);
 
mutex_unlock(>rpm_lock);
 
-   dev_dbg(dev, "%s: Detached IOMMU with pgtable %pa\n", __func__,
-   );
+   dev_dbg(dev, "%s: Restored IOMMU to IDENTITY from pgtable %pa\n",
+   __func__, );
+   return 0;
 }
 
+static struct iommu_domain_ops exynos_identity_ops = {
+   .attach_dev = exynos_iommu_identity_attach,
+};
+
+static struct iommu_domain exynos_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _identity_ops,
+};
+
 static int exynos_iommu_attach_device(struct iommu_domain *iommu_domain,
   struct device *dev)
 {
@@ -1026,12 +1040,11 @@ static int exynos_iommu_attach_device(struct 
iommu_domain *iommu_domain,
struct sysmmu_drvdata *data;
phys_addr_t pagetable = virt_to_phys(domain->pgtable);
unsigned long flags;
+   int err;
 
-   if (!has_sysmmu(dev))
-   return -ENODEV;
-
-   if (owner->domain)
-   exynos_iommu_detach_device(owner->domain, dev);
+   err = exynos_iommu_identity_attach(_identity_domain, dev);
+   if (err)
+   return err;
 
mutex_lock(>rpm_lock);
 
@@ -1407,26 +1420,12 @@ static struct iommu_device 
*exynos_iommu_probe_device(struct device *dev)
return >iommu;
 }
 
-static void exynos_iommu_set_platform_dma(struct device *dev)
-{
-   struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
-
-   if (owner->domain) {
-   struct iommu_group *group = iommu_group_get(dev);
-
-   if (group) {
-   exynos_iommu_detach_device(owner->domain, dev);
-   iommu_group_put(group);
-   }
-   }
-}
-
 static void exynos_iommu_release_device(struct device *dev)
 {
struct exynos_iommu_owner *owner = dev_iommu_priv_get(dev);
struct sysmmu_drvdata *data;
 
-   exynos_iommu_set_platform_dma(dev);
+

[PATCH v8 15/24] iommu: Remove ops->set_platform_dma_ops()

2023-09-13 Thread Jason Gunthorpe

All drivers are now using IDENTITY or PLATFORM domains for what this did,
we can remove it now. It is no longer possible to attach to a NULL domain.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 30 +-
 include/linux/iommu.h |  4 
 2 files changed, 5 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1efd6351bbc2da..42a4585dd76da6 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2351,21 +2351,8 @@ static int __iommu_group_set_domain_internal(struct 
iommu_group *group,
if (group->domain == new_domain)
return 0;
 
-   /*
-* New drivers should support default domains, so set_platform_dma()
-* op will never be called. Otherwise the NULL domain represents some
-* platform specific behavior.
-*/
-   if (!new_domain) {
-   for_each_group_device(group, gdev) {
-   const struct iommu_ops *ops = dev_iommu_ops(gdev->dev);
-
-   if (!WARN_ON(!ops->set_platform_dma_ops))
-   ops->set_platform_dma_ops(gdev->dev);
-   }
-   group->domain = NULL;
-   return 0;
-   }
+   if (WARN_ON(!new_domain))
+   return -EINVAL;
 
/*
 * Changing the domain is done by calling attach_dev() on the new
@@ -2401,19 +2388,15 @@ static int __iommu_group_set_domain_internal(struct 
iommu_group *group,
 */
last_gdev = gdev;
for_each_group_device(group, gdev) {
-   const struct iommu_ops *ops = dev_iommu_ops(gdev->dev);
-
/*
-* If set_platform_dma_ops is not present a NULL domain can
-* happen only for first probe, in which case we leave
-* group->domain as NULL and let release clean everything up.
+* A NULL domain can happen only for first probe, in which case
+* we leave group->domain as NULL and let release clean
+* everything up.
 */
if (group->domain)
WARN_ON(__iommu_device_set_domain(
group, gdev->dev, group->domain,
IOMMU_SET_DOMAIN_MUST_SUCCEED));
-   else if (ops->set_platform_dma_ops)
-   ops->set_platform_dma_ops(gdev->dev);
if (gdev == last_gdev)
break;
}
@@ -3036,9 +3019,6 @@ static int iommu_setup_default_domain(struct iommu_group 
*group,
/*
 * There are still some drivers which don't support default domains, so
 * we ignore the failure and leave group->default_domain NULL.
-*
-* We assume that the iommu driver starts up the device in
-* 'set_platform_dma_ops' mode if it does not support default domains.
 */
dom = iommu_group_alloc_default_domain(group, req_type);
if (!dom) {
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a05480be05fd08..511dfeea527215 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -243,9 +243,6 @@ struct iommu_iotlb_gather {
  * @release_device: Remove device from iommu driver handling
  * @probe_finalize: Do final setup work after the device is added to an IOMMU
  *  group and attached to the groups domain
- * @set_platform_dma_ops: Returning control back to the platform DMA ops. This 
op
- *is to support old IOMMU drivers, new drivers should 
use
- *default domains, and the common IOMMU DMA ops.
  * @device_group: find iommu group for a particular device
  * @get_resv_regions: Request list of reserved regions for a device
  * @of_xlate: add OF master IDs to iommu grouping
@@ -280,7 +277,6 @@ struct iommu_ops {
struct iommu_device *(*probe_device)(struct device *dev);
void (*release_device)(struct device *dev);
void (*probe_finalize)(struct device *dev);
-   void (*set_platform_dma_ops)(struct device *dev);
struct iommu_group *(*device_group)(struct device *dev);
 
/* Request/Free a list of reserved regions for a device */
-- 
2.42.0

[PATCH v8 00/24] iommu: Make default_domain's mandatory

2023-09-13 Thread Jason Gunthorpe

It has been a long time coming, this series completes the default_domain
transition and makes it so that the core IOMMU code will always have a
non-NULL default_domain for every driver on every
platform. set_platform_dma_ops() turned out to be a bad idea, and so
completely remove it.

This is achieved by changing each driver to either:

1 - Convert the existing (or deleted) ops->detach_dev() into an
op->attach_dev() of an IDENTITY domain.

This is based on the theory that the ARM32 HW is able to function when
the iommu is turned off and so the turned off state is an IDENTITY
translation.

2 - Use a new PLATFORM domain type. This is a hack to accommodate drivers
that we don't really know WTF they do. S390 is legitimately using this
to switch to it's platform dma_ops implementation, which is where the
name comes from.

3 - Do #1 and force the default domain to be IDENTITY, this corrects
the tegra-smmu case where even an ARM64 system would have a NULL
default_domain.

Using this we can apply the rules:

a) ARM_DMA_USE_IOMMU mode always uses either the driver's
   ops->default_domain, ops->def_domain_type(), or an IDENTITY domain.
   All ARM32 drivers provide one of these three options.

b) dma-iommu.c mode uses either the driver's ops->default_domain,
   ops->def_domain_type or the usual DMA API policy logic based on the
   command line/etc to pick IDENTITY/DMA domain types

c) All other arch's (PPC/S390) use ops->default_domain always.

See the patch "Require a default_domain for all iommu drivers" for a
per-driver breakdown.

The conversion broadly teaches a bunch of ARM32 drivers that they can do
IDENTITY domains. There is some educated guessing involved that these are
actual IDENTITY domains. If this turns out to be wrong the driver can be
trivially changed to use a BLOCKING domain type instead. Further, the
domain type only matters for drivers using ARM64's dma-iommu.c mode as it
will select IDENTITY based on the command line and expect IDENTITY to
work. For ARM32 and other arch cases it is purely documentation.

Finally, based on all the analysis in this series, we can purge
IOMMU_DOMAIN_UNMANAGED/DMA constants from most of the drivers. This
greatly simplifies understanding the driver contract to the core
code. IOMMU drivers should not be involved in policy for how the DMA API
works, that should be a core core decision.

The main gain from this work is to remove alot of ARM_DMA_USE_IOMMU
specific code and behaviors from drivers. All that remains in iommu
drivers after this series is the calls to arm_iommu_create_mapping().

This is a step toward removing ARM_DMA_USE_IOMMU.

The IDENTITY domains added to the ARM64 supporting drivers can be tested
by booting in ARM64 mode and enabling CONFIG_IOMMU_DEFAULT_PASSTHROUGH. If
the system still boots then most likely the implementation is an IDENTITY
domain. If not we can trivially change it to BLOCKING or at worst PLATFORM
if there is no detail what is going on in the HW.

I think this is pretty safe for the ARM32 drivers as they don't really
change, the code that was in detach_dev continues to be called in the same
places it was called before.

This is on github: https://github.com/jgunthorpe/linux/commits/iommu_all_defdom

v8:
 - Rebase on v6.6-rc1
 - Adjust comments for ops.default_domain
v7:
 - Rebase on v6.5-rc6/Joerg's tree/iommufd
 - Most of patch "iommufd/selftest: Make the mock iommu driver into a real
   driver" is now in the iommufd tree, diffuse the remaining bits to
   "iommu: Add iommu_ops->identity_domain" and
   "iommu: Add IOMMU_DOMAIN_PLATFORM"
 - Move the check for domain->ops->free to patch 1 as the rockchip
   conversion relies on it
 - Add IOMMU_DOMAIN_PLATFORM to iommu_domain_type_str
 - Rewrite "iommu: Reorganize iommu_get_default_domain_type() to respect 
def_domain_type()"
   to be clearer and more robust
 - Remove left over .default_domain in tegra-smmu.c
 - Use group_iommu_ops() in all appropriate places
 - Typo s/paging/dev/ in sun50i
v6: 
https://lore.kernel.org/r/0-v6-e8114faedade+425-iommu_all_defdom_...@nvidia.com
 - Rebase on v6.5-rc1/Joerg's tree
 - Fix the iommufd self test missing the iommu_device_sysfs_add()
 - Update typo in msm commit message
v5: 
https://lore.kernel.org/r/0-v5-d0a204c678c7+3d16a-iommu_all_defdom_...@nvidia.com
 - Rebase on v6.5-rc1/Joerg's tree
 - Fix Dan's remark about 'gdev uninitialized' in patch 9
v4: 
https://lore.kernel.org/r/0-v4-874277bde66e+1a9f6-iommu_all_defdom_...@nvidia.com
 - Fix rebasing typo missing ops->alloc_domain_paging check
 - Rebase on latest Joerg tree
v3: 
https://lore.kernel.org/r/0-v3-89830a6c7841+43d-iommu_all_defdom_...@nvidia.com
 - FSL is back to a PLATFORM domain, with some fixing so it attach only
   does something when leaving an UNMANAGED domain like it always was
 - Rebase on Joerg's tree, adjust for "alloc_type" change
 - Change the ARM32 untrusted check to a WARN_ON since no ARM32 system
   can currently set trusted
v2:

[PATCH v8 11/24] iommu/tegra-smmu: Implement an IDENTITY domain

2023-09-13 Thread Jason Gunthorpe

What tegra-smmu does during tegra_smmu_set_platform_dma() is actually
putting the iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/tegra-smmu.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index e445f80d02263b..80481e1ba561b8 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -511,23 +511,39 @@ static int tegra_smmu_attach_dev(struct iommu_domain 
*domain,
return err;
 }
 
-static void tegra_smmu_set_platform_dma(struct device *dev)
+static int tegra_smmu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-   struct tegra_smmu_as *as = to_smmu_as(domain);
-   struct tegra_smmu *smmu = as->smmu;
+   struct tegra_smmu_as *as;
+   struct tegra_smmu *smmu;
unsigned int index;
 
if (!fwspec)
-   return;
+   return -ENODEV;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   as = to_smmu_as(domain);
+   smmu = as->smmu;
for (index = 0; index < fwspec->num_ids; index++) {
tegra_smmu_disable(smmu, fwspec->ids[index], as->id);
tegra_smmu_as_unprepare(smmu, as);
}
+   return 0;
 }
 
+static struct iommu_domain_ops tegra_smmu_identity_ops = {
+   .attach_dev = tegra_smmu_identity_attach,
+};
+
+static struct iommu_domain tegra_smmu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _smmu_identity_ops,
+};
+
 static void tegra_smmu_set_pde(struct tegra_smmu_as *as, unsigned long iova,
   u32 value)
 {
@@ -962,11 +978,22 @@ static int tegra_smmu_of_xlate(struct device *dev,
return iommu_fwspec_add_ids(dev, , 1);
 }
 
+static int tegra_smmu_def_domain_type(struct device *dev)
+{
+   /*
+* FIXME: For now we want to run all translation in IDENTITY mode, due
+* to some device quirks. Better would be to just quirk the troubled
+* devices.
+*/
+   return IOMMU_DOMAIN_IDENTITY;
+}
+
 static const struct iommu_ops tegra_smmu_ops = {
+   .identity_domain = _smmu_identity_domain,
+   .def_domain_type = _smmu_def_domain_type,
.domain_alloc = tegra_smmu_domain_alloc,
.probe_device = tegra_smmu_probe_device,
.device_group = tegra_smmu_device_group,
-   .set_platform_dma_ops = tegra_smmu_set_platform_dma,
.of_xlate = tegra_smmu_of_xlate,
.pgsize_bitmap = SZ_4K,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.42.0

[PATCH v8 09/24] iommu: Allow an IDENTITY domain as the default_domain in ARM32

2023-09-13 Thread Jason Gunthorpe

Even though dma-iommu.c and CONFIG_ARM_DMA_USE_IOMMU do approximately the
same stuff, the way they relate to the IOMMU core is quiet different.

dma-iommu.c expects the core code to setup an UNMANAGED domain (of type
IOMMU_DOMAIN_DMA) and then configures itself to use that domain. This
becomes the default_domain for the group.

ARM_DMA_USE_IOMMU does not use the default_domain, instead it directly
allocates an UNMANAGED domain and operates it just like an external
driver. In this case group->default_domain is NULL.

If the driver provides a global static identity_domain then automatically
use it as the default_domain when in ARM_DMA_USE_IOMMU mode.

This allows drivers that implemented default_domain == NULL as an IDENTITY
translation to trivially get a properly labeled non-NULL default_domain on
ARM32 configs.

With this arrangment when ARM_DMA_USE_IOMMU wants to disconnect from the
device the normal detach_domain flow will restore the IDENTITY domain as
the default domain. Overall this makes attach_dev() of the IDENTITY domain
called in the same places as detach_dev().

This effectively migrates these drivers to default_domain mode. For
drivers that support ARM64 they will gain support for the IDENTITY
translation mode for the dma_api and behave in a uniform way.

Drivers use this by setting ops->identity_domain to a static singleton
iommu_domain that implements the identity attach. If the core detects
ARM_DMA_USE_IOMMU mode then it automatically attaches the IDENTITY domain
during probe.

Drivers can continue to prevent the use of DMA translation by returning
IOMMU_DOMAIN_IDENTITY from def_domain_type, this will completely prevent
IOMMU_DMA from running but will not impact ARM_DMA_USE_IOMMU.

This allows removing the set_platform_dma_ops() from every remaining
driver.

Remove the set_platform_dma_ops from rockchip and mkt_v1 as all it does
is set an existing global static identity domain. mkt_v1 does not support
IOMMU_DOMAIN_DMA and it does not compile on ARM64 so this transformation
is safe.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c  | 21 -
 drivers/iommu/mtk_iommu_v1.c   | 12 
 drivers/iommu/rockchip-iommu.c | 10 --
 3 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9188eae61e929e..1efd6351bbc2da 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1865,17 +1865,36 @@ static int iommu_get_def_domain_type(struct iommu_group 
*group,
 static int iommu_get_default_domain_type(struct iommu_group *group,
 int target_type)
 {
+   const struct iommu_ops *ops = group_iommu_ops(group);
struct device *untrusted = NULL;
struct group_device *gdev;
int driver_type = 0;
 
lockdep_assert_held(>mutex);
+
+   /*
+* ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
+* identity_domain and it will automatically become their default
+* domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
+* Override the selection to IDENTITY if we are sure the driver supports
+* it.
+*/
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain)
+   driver_type = IOMMU_DOMAIN_IDENTITY;
+
for_each_group_device(group, gdev) {
driver_type = iommu_get_def_domain_type(group, gdev->dev,
driver_type);
 
-   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted)
+   if (dev_is_pci(gdev->dev) && to_pci_dev(gdev->dev)->untrusted) {
+   /*
+* No ARM32 using systems will set untrusted, it cannot
+* work.
+*/
+   if (WARN_ON(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)))
+   return -1;
untrusted = gdev->dev;
+   }
}
 
if (untrusted) {
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index cc3e7d53d33ad9..7c0c1d50df5f75 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -337,11 +337,6 @@ static struct iommu_domain mtk_iommu_v1_identity_domain = {
.ops = _iommu_v1_identity_ops,
 };
 
-static void mtk_iommu_v1_set_platform_dma(struct device *dev)
-{
-   mtk_iommu_v1_identity_attach(_iommu_v1_identity_domain, dev);
-}
-
 static int mtk_iommu_v1_map(struct iommu_domain *domain, unsigned long iova,
phys_addr_t paddr, size_t pgsize, size_t pgcount,
int prot, gfp_t gfp, size_t *mapped)
@@ -457,11 +452,6 @@ static int mtk_iommu_v1_create_mapping(struct device *dev, 
struct of_phandle_arg

[PATCH v8 12/24] iommu/tegra-smmu: Support DMA domains in tegra

2023-09-13 Thread Jason Gunthorpe

All ARM64 iommu drivers should support IOMMU_DOMAIN_DMA to enable
dma-iommu.c.

tegra is blocking dma-iommu usage, and also default_domain's, because it
wants an identity translation. This is needed for some device quirk. The
correct way to do this is to support IDENTITY domains and use
ops->def_domain_type() to return IOMMU_DOMAIN_IDENTITY for only the quirky
devices.

Add support for IOMMU_DOMAIN_DMA and force IOMMU_DOMAIN_IDENTITY mode for
everything so no behavior changes.

Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/tegra-smmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 80481e1ba561b8..b91ad1b5a20d36 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -276,7 +276,7 @@ static struct iommu_domain 
*tegra_smmu_domain_alloc(unsigned type)
 {
struct tegra_smmu_as *as;
 
-   if (type != IOMMU_DOMAIN_UNMANAGED)
+   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
return NULL;
 
as = kzalloc(sizeof(*as), GFP_KERNEL);
-- 
2.42.0

[PATCH v8 16/24] iommu/qcom_iommu: Add an IOMMU_IDENTITIY_DOMAIN

2023-09-13 Thread Jason Gunthorpe

This brings back the ops->detach_dev() code that commit
1b932ceddd19 ("iommu: Remove detach_dev callbacks") deleted and turns it
into an IDENTITY domain.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 39 +
 1 file changed, 39 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 775a3cbaff4ed0..bc45d18f350cb9 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -400,6 +400,44 @@ static int qcom_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev
return 0;
 }
 
+static int qcom_iommu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct qcom_iommu_domain *qcom_domain;
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct qcom_iommu_dev *qcom_iommu = to_iommu(dev);
+   unsigned int i;
+
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   qcom_domain = to_qcom_iommu_domain(domain);
+   if (WARN_ON(!qcom_domain->iommu))
+   return -EINVAL;
+
+   pm_runtime_get_sync(qcom_iommu->dev);
+   for (i = 0; i < fwspec->num_ids; i++) {
+   struct qcom_iommu_ctx *ctx = to_ctx(qcom_domain, 
fwspec->ids[i]);
+
+   /* Disable the context bank: */
+   iommu_writel(ctx, ARM_SMMU_CB_SCTLR, 0);
+
+   ctx->domain = NULL;
+   }
+   pm_runtime_put_sync(qcom_iommu->dev);
+   return 0;
+}
+
+static struct iommu_domain_ops qcom_iommu_identity_ops = {
+   .attach_dev = qcom_iommu_identity_attach,
+};
+
+static struct iommu_domain qcom_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int qcom_iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, size_t pgsize, size_t pgcount,
  int prot, gfp_t gfp, size_t *mapped)
@@ -565,6 +603,7 @@ static int qcom_iommu_of_xlate(struct device *dev, struct 
of_phandle_args *args)
 }
 
 static const struct iommu_ops qcom_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.capable= qcom_iommu_capable,
.domain_alloc   = qcom_iommu_domain_alloc,
.probe_device   = qcom_iommu_probe_device,
-- 
2.42.0

[PATCH v8 06/24] iommu/tegra-gart: Remove tegra-gart

2023-09-13 Thread Jason Gunthorpe

Thierry says this is not used anymore, and doesn't think it makes sense as
an iommu driver. The HW it supports is about 10 years old now and newer HW
uses different IOMMU drivers.

As this is the only driver with a GART approach, and it doesn't really
meet the driver expectations from the IOMMU core, let's just remove it
so we don't have to think about how to make it fit in.

It has a number of identified problems:
 - The assignment of iommu_groups doesn't match the HW behavior

 - It claims to have an UNMANAGED domain but it is really an IDENTITY
   domain with a translation aperture. This is inconsistent with the core
   expectation for security sensitive operations

 - It doesn't implement a SW page table under struct iommu_domain so
   * It can't accept a map until the domain is attached
   * It forgets about all maps after the domain is detached
   * It doesn't clear the HW of maps once the domain is detached
 (made worse by having the wrong groups)

Cc: Thierry Reding 
Cc: Dmitry Osipenko 
Acked-by: Thierry Reding 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 arch/arm/configs/multi_v7_defconfig |   1 -
 arch/arm/configs/tegra_defconfig|   1 -
 drivers/iommu/Kconfig   |  11 -
 drivers/iommu/Makefile  |   1 -
 drivers/iommu/tegra-gart.c  | 371 
 drivers/memory/tegra/mc.c   |  34 ---
 drivers/memory/tegra/tegra20.c  |  28 ---
 include/soc/tegra/mc.h  |  26 --
 8 files changed, 473 deletions(-)
 delete mode 100644 drivers/iommu/tegra-gart.c

diff --git a/arch/arm/configs/multi_v7_defconfig 
b/arch/arm/configs/multi_v7_defconfig
index 23fc49f23d255a..5dc4416b75d36f 100644
--- a/arch/arm/configs/multi_v7_defconfig
+++ b/arch/arm/configs/multi_v7_defconfig
@@ -1073,7 +1073,6 @@ CONFIG_QCOM_IPCC=y
 CONFIG_OMAP_IOMMU=y
 CONFIG_OMAP_IOMMU_DEBUG=y
 CONFIG_ROCKCHIP_IOMMU=y
-CONFIG_TEGRA_IOMMU_GART=y
 CONFIG_TEGRA_IOMMU_SMMU=y
 CONFIG_EXYNOS_IOMMU=y
 CONFIG_QCOM_IOMMU=y
diff --git a/arch/arm/configs/tegra_defconfig b/arch/arm/configs/tegra_defconfig
index 613f07b8ce1505..8635b7216bfc5a 100644
--- a/arch/arm/configs/tegra_defconfig
+++ b/arch/arm/configs/tegra_defconfig
@@ -292,7 +292,6 @@ CONFIG_CHROME_PLATFORMS=y
 CONFIG_CROS_EC=y
 CONFIG_CROS_EC_I2C=m
 CONFIG_CROS_EC_SPI=m
-CONFIG_TEGRA_IOMMU_GART=y
 CONFIG_TEGRA_IOMMU_SMMU=y
 CONFIG_ARCH_TEGRA_2x_SOC=y
 CONFIG_ARCH_TEGRA_3x_SOC=y
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 2b12b583ef4b1e..cd6727898b1175 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -236,17 +236,6 @@ config SUN50I_IOMMU
help
  Support for the IOMMU introduced in the Allwinner H6 SoCs.
 
-config TEGRA_IOMMU_GART
-   bool "Tegra GART IOMMU Support"
-   depends on ARCH_TEGRA_2x_SOC
-   depends on TEGRA_MC
-   select IOMMU_API
-   help
- Enables support for remapping discontiguous physical memory
- shared with the operating system into contiguous I/O virtual
- space through the GART (Graphics Address Relocation Table)
- hardware included on Tegra SoCs.
-
 config TEGRA_IOMMU_SMMU
bool "NVIDIA Tegra SMMU Support"
depends on ARCH_TEGRA
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 769e43d780ce89..95ad9dbfbda022 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -20,7 +20,6 @@ obj-$(CONFIG_OMAP_IOMMU) += omap-iommu.o
 obj-$(CONFIG_OMAP_IOMMU_DEBUG) += omap-iommu-debug.o
 obj-$(CONFIG_ROCKCHIP_IOMMU) += rockchip-iommu.o
 obj-$(CONFIG_SUN50I_IOMMU) += sun50i-iommu.o
-obj-$(CONFIG_TEGRA_IOMMU_GART) += tegra-gart.o
 obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
deleted file mode 100644
index a482ff838b5331..00
--- a/drivers/iommu/tegra-gart.c
+++ /dev/null
@@ -1,371 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * IOMMU API for Graphics Address Relocation Table on Tegra20
- *
- * Copyright (c) 2010-2012, NVIDIA CORPORATION.  All rights reserved.
- *
- * Author: Hiroshi DOYU 
- */
-
-#define dev_fmt(fmt)   "gart: " fmt
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-#define GART_REG_BASE  0x24
-#define GART_CONFIG(0x24 - GART_REG_BASE)
-#define GART_ENTRY_ADDR(0x28 - GART_REG_BASE)
-#define GART_ENTRY_DATA(0x2c - GART_REG_BASE)
-
-#define GART_ENTRY_PHYS_ADDR_VALID BIT(31)
-
-#define GART_PAGE_SHIFT12
-#define GART_PAGE_SIZE (1 << GART_PAGE_SHIFT)
-#define GART_PAGE_MASK GENMASK(30, GART_PAGE_SHIFT)
-
-/* bitmap of the page sizes currently supported */
-#define GART_IOMMU_PGSIZES (GART_PAGE_SIZE)
-
-struct gart_device {
-   void __iomem*regs;
-   u32

[PATCH v8 04/24] iommu: Add IOMMU_DOMAIN_PLATFORM for S390

2023-09-13 Thread Jason Gunthorpe

The PLATFORM domain will be set as the default domain and attached as
normal during probe. The driver will ignore the initial attach from a NULL
domain to the PLATFORM domain.

After this, the PLATFORM domain's attach_dev will be called whenever we
detach from an UNMANAGED domain (eg for VFIO). This is the same time the
original design would have called op->detach_dev().

This is temporary until the S390 dma-iommu.c conversion is merged.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/s390-iommu.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index fbf59a8db29b11..f0c867c57a5b9b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -142,14 +142,31 @@ static int s390_iommu_attach_device(struct iommu_domain 
*domain,
return 0;
 }
 
-static void s390_iommu_set_platform_dma(struct device *dev)
+/*
+ * Switch control over the IOMMU to S390's internal dma_api ops
+ */
+static int s390_iommu_platform_attach(struct iommu_domain *platform_domain,
+ struct device *dev)
 {
struct zpci_dev *zdev = to_zpci_dev(dev);
 
+   if (!zdev->s390_domain)
+   return 0;
+
__s390_iommu_detach_device(zdev);
zpci_dma_init_device(zdev);
+   return 0;
 }
 
+static struct iommu_domain_ops s390_iommu_platform_ops = {
+   .attach_dev = s390_iommu_platform_attach,
+};
+
+static struct iommu_domain s390_iommu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _iommu_platform_ops,
+};
+
 static void s390_iommu_get_resv_regions(struct device *dev,
struct list_head *list)
 {
@@ -428,12 +445,12 @@ void zpci_destroy_iommu(struct zpci_dev *zdev)
 }
 
 static const struct iommu_ops s390_iommu_ops = {
+   .default_domain = _iommu_platform_domain,
.capable = s390_iommu_capable,
.domain_alloc = s390_domain_alloc,
.probe_device = s390_iommu_probe_device,
.release_device = s390_iommu_release_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = s390_iommu_set_platform_dma,
.pgsize_bitmap = SZ_4K,
.get_resv_regions = s390_iommu_get_resv_regions,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.42.0

[PATCH v8 14/24] iommu/msm: Implement an IDENTITY domain

2023-09-13 Thread Jason Gunthorpe

What msm does during msm_iommu_set_platform_dma() is actually putting the
iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/msm_iommu.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 79d89bad5132b7..26ed81cfeee897 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -443,15 +443,20 @@ static int msm_iommu_attach_dev(struct iommu_domain 
*domain, struct device *dev)
return ret;
 }
 
-static void msm_iommu_set_platform_dma(struct device *dev)
+static int msm_iommu_identity_attach(struct iommu_domain *identity_domain,
+struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct msm_priv *priv = to_msm_priv(domain);
+   struct msm_priv *priv;
unsigned long flags;
struct msm_iommu_dev *iommu;
struct msm_iommu_ctx_dev *master;
-   int ret;
+   int ret = 0;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   priv = to_msm_priv(domain);
free_io_pgtable_ops(priv->iop);
 
spin_lock_irqsave(_iommu_lock, flags);
@@ -468,8 +473,18 @@ static void msm_iommu_set_platform_dma(struct device *dev)
}
 fail:
spin_unlock_irqrestore(_iommu_lock, flags);
+   return ret;
 }
 
+static struct iommu_domain_ops msm_iommu_identity_ops = {
+   .attach_dev = msm_iommu_identity_attach,
+};
+
+static struct iommu_domain msm_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int msm_iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t pa, size_t pgsize, size_t pgcount,
 int prot, gfp_t gfp, size_t *mapped)
@@ -675,10 +690,10 @@ irqreturn_t msm_iommu_fault_handler(int irq, void *dev_id)
 }
 
 static struct iommu_ops msm_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc = msm_iommu_domain_alloc,
.probe_device = msm_iommu_probe_device,
.device_group = generic_device_group,
-   .set_platform_dma_ops = msm_iommu_set_platform_dma,
.pgsize_bitmap = MSM_IOMMU_PGSIZES,
.of_xlate = qcom_iommu_of_xlate,
.default_domain_ops = &(const struct iommu_domain_ops) {
-- 
2.42.0

[PATCH v8 01/24] iommu: Add iommu_ops->identity_domain

2023-09-13 Thread Jason Gunthorpe

This allows a driver to set a global static to an IDENTITY domain and
the core code will automatically use it whenever an IDENTITY domain
is requested.

By making it always available it means the IDENTITY can be used in error
handling paths to force the iommu driver into a known state. Devices
implementing global static identity domains should avoid failing their
attach_dev ops.

To make global static domains simpler allow drivers to omit their free
function and update the iommufd selftest.

Convert rockchip to use the new mechanism.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c| 6 +-
 drivers/iommu/iommufd/selftest.c | 5 -
 drivers/iommu/rockchip-iommu.c   | 9 +
 include/linux/iommu.h| 3 +++
 4 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3bfc56df4f781c..33bd1107090720 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1978,6 +1978,9 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct bus_type *bus,
if (bus == NULL || bus->iommu_ops == NULL)
return NULL;
 
+   if (alloc_type == IOMMU_DOMAIN_IDENTITY && 
bus->iommu_ops->identity_domain)
+   return bus->iommu_ops->identity_domain;
+
domain = bus->iommu_ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
@@ -2011,7 +2014,8 @@ void iommu_domain_free(struct iommu_domain *domain)
if (domain->type == IOMMU_DOMAIN_SVA)
mmdrop(domain->mm);
iommu_put_dma_cookie(domain);
-   domain->ops->free(domain);
+   if (domain->ops->free)
+   domain->ops->free(domain);
 }
 EXPORT_SYMBOL_GPL(iommu_domain_free);
 
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 56506d5753f15c..d48a202a7c3b81 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -111,10 +111,6 @@ struct selftest_obj {
};
 };
 
-static void mock_domain_blocking_free(struct iommu_domain *domain)
-{
-}
-
 static int mock_domain_nop_attach(struct iommu_domain *domain,
  struct device *dev)
 {
@@ -122,7 +118,6 @@ static int mock_domain_nop_attach(struct iommu_domain 
*domain,
 }
 
 static const struct iommu_domain_ops mock_blocking_ops = {
-   .free = mock_domain_blocking_free,
.attach_dev = mock_domain_nop_attach,
 };
 
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 8ff69fbf9f65db..033678f2f8b3ab 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -989,13 +989,8 @@ static int rk_iommu_identity_attach(struct iommu_domain 
*identity_domain,
return 0;
 }
 
-static void rk_iommu_identity_free(struct iommu_domain *domain)
-{
-}
-
 static struct iommu_domain_ops rk_identity_ops = {
.attach_dev = rk_iommu_identity_attach,
-   .free = rk_iommu_identity_free,
 };
 
 static struct iommu_domain rk_identity_domain = {
@@ -1059,9 +1054,6 @@ static struct iommu_domain 
*rk_iommu_domain_alloc(unsigned type)
 {
struct rk_iommu_domain *rk_domain;
 
-   if (type == IOMMU_DOMAIN_IDENTITY)
-   return _identity_domain;
-
if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
return NULL;
 
@@ -1186,6 +1178,7 @@ static int rk_iommu_of_xlate(struct device *dev,
 }
 
 static const struct iommu_ops rk_iommu_ops = {
+   .identity_domain = _identity_domain,
.domain_alloc = rk_iommu_domain_alloc,
.probe_device = rk_iommu_probe_device,
.release_device = rk_iommu_release_device,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c50a769d569a60..d0920b2a9f1c0e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -260,6 +260,8 @@ struct iommu_iotlb_gather {
  *will be blocked by the hardware.
  * @pgsize_bitmap: bitmap of all possible supported page sizes
  * @owner: Driver module providing these ops
+ * @identity_domain: An always available, always attachable identity
+ *   translation.
  */
 struct iommu_ops {
bool (*capable)(struct device *dev, enum iommu_cap);
@@ -294,6 +296,7 @@ struct iommu_ops {
const struct iommu_domain_ops *default_domain_ops;
unsigned long pgsize_bitmap;
struct module *owner;
+   struct iommu_domain *identity_domain;
 };
 
 /**
-- 
2.42.0

[PATCH v8 19/24] iommu/sun50i: Add an IOMMU_IDENTITIY_DOMAIN

2023-09-13 Thread Jason Gunthorpe

Prior to commit 1b932ceddd19 ("iommu: Remove detach_dev callbacks") the
sun50i_iommu_detach_device() function was being called by
ops->detach_dev().

This is an IDENTITY domain so convert sun50i_iommu_detach_device() into
sun50i_iommu_identity_attach() and a full IDENTITY domain and thus hook it
back up the same was as the old ops->detach_dev().

Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/sun50i-iommu.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/sun50i-iommu.c b/drivers/iommu/sun50i-iommu.c
index 74c5cb93e90027..0bf08b120cf105 100644
--- a/drivers/iommu/sun50i-iommu.c
+++ b/drivers/iommu/sun50i-iommu.c
@@ -757,21 +757,32 @@ static void sun50i_iommu_detach_domain(struct 
sun50i_iommu *iommu,
iommu->domain = NULL;
 }
 
-static void sun50i_iommu_detach_device(struct iommu_domain *domain,
-  struct device *dev)
+static int sun50i_iommu_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
-   struct sun50i_iommu_domain *sun50i_domain = to_sun50i_domain(domain);
struct sun50i_iommu *iommu = dev_iommu_priv_get(dev);
+   struct sun50i_iommu_domain *sun50i_domain;
 
dev_dbg(dev, "Detaching from IOMMU domain\n");
 
-   if (iommu->domain != domain)
-   return;
+   if (iommu->domain == identity_domain)
+   return 0;
 
+   sun50i_domain = to_sun50i_domain(iommu->domain);
if (refcount_dec_and_test(_domain->refcnt))
sun50i_iommu_detach_domain(iommu, sun50i_domain);
+   return 0;
 }
 
+static struct iommu_domain_ops sun50i_iommu_identity_ops = {
+   .attach_dev = sun50i_iommu_identity_attach,
+};
+
+static struct iommu_domain sun50i_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static int sun50i_iommu_attach_device(struct iommu_domain *domain,
  struct device *dev)
 {
@@ -789,8 +800,7 @@ static int sun50i_iommu_attach_device(struct iommu_domain 
*domain,
if (iommu->domain == domain)
return 0;
 
-   if (iommu->domain)
-   sun50i_iommu_detach_device(iommu->domain, dev);
+   sun50i_iommu_identity_attach(_iommu_identity_domain, dev);
 
sun50i_iommu_attach_domain(iommu, sun50i_domain);
 
@@ -827,6 +837,7 @@ static int sun50i_iommu_of_xlate(struct device *dev,
 }
 
 static const struct iommu_ops sun50i_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.pgsize_bitmap  = SZ_4K,
.device_group   = sun50i_iommu_device_group,
.domain_alloc   = sun50i_iommu_domain_alloc,
@@ -985,6 +996,7 @@ static int sun50i_iommu_probe(struct platform_device *pdev)
if (!iommu)
return -ENOMEM;
spin_lock_init(>iommu_lock);
+   iommu->domain = _iommu_identity_domain;
platform_set_drvdata(pdev, iommu);
iommu->dev = >dev;
 
-- 
2.42.0

[PATCH v8 21/24] iommu: Add __iommu_group_domain_alloc()

2023-09-13 Thread Jason Gunthorpe

Allocate a domain from a group. Automatically obtains the iommu_ops to use
from the device list of the group. Convert the internal callers to use it.

Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 59 +--
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index cfb597751f5bad..38856d542afc35 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -96,8 +96,8 @@ static const char * const iommu_group_resv_type_string[] = {
 static int iommu_bus_notifier(struct notifier_block *nb,
  unsigned long action, void *data);
 static void iommu_release_device(struct device *dev);
-static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
-unsigned type);
+static struct iommu_domain *
+__iommu_group_domain_alloc(struct iommu_group *group, unsigned int type);
 static int __iommu_attach_device(struct iommu_domain *domain,
 struct device *dev);
 static int __iommu_attach_group(struct iommu_domain *domain,
@@ -1719,12 +1719,11 @@ struct iommu_group *fsl_mc_device_group(struct device 
*dev)
 EXPORT_SYMBOL_GPL(fsl_mc_device_group);
 
 static struct iommu_domain *
-__iommu_group_alloc_default_domain(const struct bus_type *bus,
-  struct iommu_group *group, int req_type)
+__iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
if (group->default_domain && group->default_domain->type == req_type)
return group->default_domain;
-   return __iommu_domain_alloc(bus, req_type);
+   return __iommu_group_domain_alloc(group, req_type);
 }
 
 /*
@@ -1751,9 +1750,7 @@ static const struct iommu_ops *group_iommu_ops(struct 
iommu_group *group)
 static struct iommu_domain *
 iommu_group_alloc_default_domain(struct iommu_group *group, int req_type)
 {
-   const struct bus_type *bus =
-   list_first_entry(>devices, struct group_device, list)
-   ->dev->bus;
+   const struct iommu_ops *ops = group_iommu_ops(group);
struct iommu_domain *dom;
 
lockdep_assert_held(>mutex);
@@ -1763,24 +1760,24 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 * domain. This should always be either an IDENTITY/BLOCKED/PLATFORM
 * domain. Do not use in new drivers.
 */
-   if (bus->iommu_ops->default_domain) {
+   if (ops->default_domain) {
if (req_type)
return ERR_PTR(-EINVAL);
-   return bus->iommu_ops->default_domain;
+   return ops->default_domain;
}
 
if (req_type)
-   return __iommu_group_alloc_default_domain(bus, group, req_type);
+   return __iommu_group_alloc_default_domain(group, req_type);
 
/* The driver gave no guidance on what type to use, try the default */
-   dom = __iommu_group_alloc_default_domain(bus, group, 
iommu_def_domain_type);
+   dom = __iommu_group_alloc_default_domain(group, iommu_def_domain_type);
if (dom)
return dom;
 
/* Otherwise IDENTITY and DMA_FQ defaults will try DMA */
if (iommu_def_domain_type == IOMMU_DOMAIN_DMA)
return NULL;
-   dom = __iommu_group_alloc_default_domain(bus, group, IOMMU_DOMAIN_DMA);
+   dom = __iommu_group_alloc_default_domain(group, IOMMU_DOMAIN_DMA);
if (!dom)
return NULL;
 
@@ -2043,19 +2040,16 @@ void iommu_set_fault_handler(struct iommu_domain 
*domain,
 }
 EXPORT_SYMBOL_GPL(iommu_set_fault_handler);
 
-static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus,
-unsigned type)
+static struct iommu_domain *__iommu_domain_alloc(const struct iommu_ops *ops,
+unsigned int type)
 {
struct iommu_domain *domain;
unsigned int alloc_type = type & IOMMU_DOMAIN_ALLOC_FLAGS;
 
-   if (bus == NULL || bus->iommu_ops == NULL)
-   return NULL;
+   if (alloc_type == IOMMU_DOMAIN_IDENTITY && ops->identity_domain)
+   return ops->identity_domain;
 
-   if (alloc_type == IOMMU_DOMAIN_IDENTITY && 
bus->iommu_ops->identity_domain)
-   return bus->iommu_ops->identity_domain;
-
-   domain = bus->iommu_ops->domain_alloc(alloc_type);
+   domain = ops->domain_alloc(alloc_type);
if (!domain)
return NULL;
 
@@ -2065,10 +2059,10 @@ static struct iommu_domain *__iommu_domain_alloc(const 
struct bus_type *bus,
 * may override this later
 */
if (!domain->pgsize_bitmap)
-   domain->pgsize_bitmap =

[PATCH v8 13/24] iommu/omap: Implement an IDENTITY domain

2023-09-13 Thread Jason Gunthorpe

What omap does during omap_iommu_set_platform_dma() is actually putting
the iommu into identity mode.

Move to the new core support for ARM_DMA_USE_IOMMU by defining
ops->identity_domain.

This driver does not support IOMMU_DOMAIN_DMA, however it cannot be
compiled on ARM64 either. Most likely it is fine to support dma-iommu.c

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/omap-iommu.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 537e402f9bba97..34340ef15241bc 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1555,16 +1555,31 @@ static void _omap_iommu_detach_dev(struct 
omap_iommu_domain *omap_domain,
omap_domain->dev = NULL;
 }
 
-static void omap_iommu_set_platform_dma(struct device *dev)
+static int omap_iommu_identity_attach(struct iommu_domain *identity_domain,
+ struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct omap_iommu_domain *omap_domain = to_omap_domain(domain);
+   struct omap_iommu_domain *omap_domain;
 
+   if (domain == identity_domain || !domain)
+   return 0;
+
+   omap_domain = to_omap_domain(domain);
spin_lock(_domain->lock);
_omap_iommu_detach_dev(omap_domain, dev);
spin_unlock(_domain->lock);
+   return 0;
 }
 
+static struct iommu_domain_ops omap_iommu_identity_ops = {
+   .attach_dev = omap_iommu_identity_attach,
+};
+
+static struct iommu_domain omap_iommu_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_identity_ops,
+};
+
 static struct iommu_domain *omap_iommu_domain_alloc(unsigned type)
 {
struct omap_iommu_domain *omap_domain;
@@ -1732,11 +1747,11 @@ static struct iommu_group 
*omap_iommu_device_group(struct device *dev)
 }
 
 static const struct iommu_ops omap_iommu_ops = {
+   .identity_domain = _iommu_identity_domain,
.domain_alloc   = omap_iommu_domain_alloc,
.probe_device   = omap_iommu_probe_device,
.release_device = omap_iommu_release_device,
.device_group   = omap_iommu_device_group,
-   .set_platform_dma_ops = omap_iommu_set_platform_dma,
.pgsize_bitmap  = OMAP_IOMMU_PGSIZES,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = omap_iommu_attach_dev,
-- 
2.42.0

[PATCH v8 20/24] iommu: Require a default_domain for all iommu drivers

2023-09-13 Thread Jason Gunthorpe

At this point every iommu driver will cause a default_domain to be
selected, so we can finally remove this gap from the core code.

The following table explains what each driver supports and what the
resulting default_domain will be:

ops->defaut_domain
IDENTITY   DMA  PLATFORMv  ARM32  dma-iommu 
 ARCH
amd/iommu.c Y   Y   N/A either
apple-dart.cY   Y   N/A either
arm-smmu.c  Y   Y   IDENTITYeither
qcom_iommu.cG   Y   IDENTITYeither
arm-smmu-v3.c   Y   Y   N/A either
exynos-iommu.c  G   Y   IDENTITYeither
fsl_pamu_domain.c   Y   Y   N/A N/A 
PLATFORM
intel/iommu.c   Y   Y   N/A either
ipmmu-vmsa.cG   Y   IDENTITYeither
msm_iommu.c G   IDENTITYN/A
mtk_iommu.c G   Y   IDENTITYeither
mtk_iommu_v1.c  G   IDENTITYN/A
omap-iommu.cG   IDENTITYN/A
rockchip-iommu.cG   Y   IDENTITYeither
s390-iommu.cY   Y   N/A N/A 
PLATFORM
sprd-iommu.cY   N/A DMA
sun50i-iommu.c  G   Y   IDENTITYeither
tegra-smmu.cG   Y   IDENTITYIDENTITY
virtio-iommu.c  Y   Y   N/A either
spapr   Y   Y   N/A N/A 
PLATFORM
 * G means ops->identity_domain is used
 * N/A means the driver will not compile in this configuration

ARM32 drivers select an IDENTITY default domain through either the
ops->identity_domain or directly requesting an IDENTIY domain through
alloc_domain().

In ARM64 mode tegra-smmu will still block the use of dma-iommu.c and
forces an IDENTITY domain.

S390 uses a PLATFORM domain to represent when the dma_ops are set to the
s390 iommu code.

fsl_pamu uses an PLATFORM domain.

POWER SPAPR uses PLATFORM and blocking to enable its weird VFIO mode.

The x86 drivers continue unchanged.

After this patch group->default_domain is only NULL for a short period
during bus iommu probing while all the groups are constituted. Otherwise
it is always !NULL.

This completes changing the iommu subsystem driver contract to a system
where the current iommu_domain always represents some form of translation
and the driver is continuously asserting a definable translation mode.

It resolves the confusion that the original ops->detach_dev() caused
around what translation, exactly, is the IOMMU performing after
detach. There were at least three different answers to that question in
the tree, they are all now clearly named with domain types.

Tested-by: Heiko Stuebner 
Tested-by: Niklas Schnelle 
Tested-by: Steven Price 
Tested-by: Marek Szyprowski 
Tested-by: Nicolin Chen 
Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 42a4585dd76da6..cfb597751f5bad 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1865,7 +1865,6 @@ static int iommu_get_def_domain_type(struct iommu_group 
*group,
 static int iommu_get_default_domain_type(struct iommu_group *group,
 int target_type)
 {
-   const struct iommu_ops *ops = group_iommu_ops(group);
struct device *untrusted = NULL;
struct group_device *gdev;
int driver_type = 0;
@@ -1876,11 +1875,13 @@ static int iommu_get_default_domain_type(struct 
iommu_group *group,
 * ARM32 drivers supporting CONFIG_ARM_DMA_USE_IOMMU can declare an
 * identity_domain and it will automatically become their default
 * domain. Later on ARM_DMA_USE_IOMMU will install its UNMANAGED domain.
-* Override the selection to IDENTITY if we are sure the driver supports
-* it.
+* Override the selection to IDENTITY.
 */
-   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) && ops->identity_domain)
+   if (IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU)) {
+   static_assert(!(IS_ENABLED(CONFIG_ARM_DMA_USE_IOMMU) &&
+   IS_ENABLED(CONFIG_IOMMU_DMA)));
driver_type = IOMMU_DOMAIN_IDENTITY;
+   }
 
for_each_group_device(group, gdev) {
driver_type =

[PATCH v8 02/24] iommu: Add IOMMU_DOMAIN_PLATFORM

2023-09-13 Thread Jason Gunthorpe

This is used when the iommu driver is taking control of the dma_ops,
currently only on S390 and power spapr. It is designed to preserve the
original ops->detach_dev() semantic that these S390 was built around.

Provide an opaque domain type and a 'default_domain' ops value that allows
the driver to trivially force any single domain as the default domain.

Update iommufd selftest to use this instead of set_platform_dma_ops

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c| 13 +
 drivers/iommu/iommufd/selftest.c | 14 +-
 include/linux/iommu.h|  8 
 3 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 33bd1107090720..0e13e566581c21 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -184,6 +184,8 @@ static const char *iommu_domain_type_str(unsigned int t)
case IOMMU_DOMAIN_DMA:
case IOMMU_DOMAIN_DMA_FQ:
return "Translated";
+   case IOMMU_DOMAIN_PLATFORM:
+   return "Platform";
default:
return "Unknown";
}
@@ -1752,6 +1754,17 @@ iommu_group_alloc_default_domain(struct iommu_group 
*group, int req_type)
 
lockdep_assert_held(>mutex);
 
+   /*
+* Allow legacy drivers to specify the domain that will be the default
+* domain. This should always be either an IDENTITY/BLOCKED/PLATFORM
+* domain. Do not use in new drivers.
+*/
+   if (bus->iommu_ops->default_domain) {
+   if (req_type)
+   return ERR_PTR(-EINVAL);
+   return bus->iommu_ops->default_domain;
+   }
+
if (req_type)
return __iommu_group_alloc_default_domain(bus, group, req_type);
 
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index d48a202a7c3b81..fb981ba97c4e87 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -281,14 +281,6 @@ static bool mock_domain_capable(struct device *dev, enum 
iommu_cap cap)
return cap == IOMMU_CAP_CACHE_COHERENCY;
 }
 
-static void mock_domain_set_plaform_dma_ops(struct device *dev)
-{
-   /*
-* mock doesn't setup default domains because we can't hook into the
-* normal probe path
-*/
-}
-
 static struct iommu_device mock_iommu_device = {
 };
 
@@ -298,12 +290,16 @@ static struct iommu_device *mock_probe_device(struct 
device *dev)
 }
 
 static const struct iommu_ops mock_ops = {
+   /*
+* IOMMU_DOMAIN_BLOCKED cannot be returned from def_domain_type()
+* because it is zero.
+*/
+   .default_domain = _blocking_domain,
.owner = THIS_MODULE,
.pgsize_bitmap = MOCK_IO_PAGE_SIZE,
.hw_info = mock_domain_hw_info,
.domain_alloc = mock_domain_alloc,
.capable = mock_domain_capable,
-   .set_platform_dma_ops = mock_domain_set_plaform_dma_ops,
.device_group = generic_device_group,
.probe_device = mock_probe_device,
.default_domain_ops =
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d0920b2a9f1c0e..a05480be05fd08 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -64,6 +64,7 @@ struct iommu_domain_geometry {
 #define __IOMMU_DOMAIN_DMA_FQ  (1U << 3)  /* DMA-API uses flush queue*/
 
 #define __IOMMU_DOMAIN_SVA (1U << 4)  /* Shared process address space */
+#define __IOMMU_DOMAIN_PLATFORM(1U << 5)
 
 #define IOMMU_DOMAIN_ALLOC_FLAGS ~__IOMMU_DOMAIN_DMA_FQ
 /*
@@ -81,6 +82,8 @@ struct iommu_domain_geometry {
  *   invalidation.
  * IOMMU_DOMAIN_SVA- DMA addresses are shared process addresses
  *   represented by mm_struct's.
+ * IOMMU_DOMAIN_PLATFORM   - Legacy domain for drivers that do their own
+ *   dma_api stuff. Do not use in new drivers.
  */
 #define IOMMU_DOMAIN_BLOCKED   (0U)
 #define IOMMU_DOMAIN_IDENTITY  (__IOMMU_DOMAIN_PT)
@@ -91,6 +94,7 @@ struct iommu_domain_geometry {
 __IOMMU_DOMAIN_DMA_API |   \
 __IOMMU_DOMAIN_DMA_FQ)
 #define IOMMU_DOMAIN_SVA   (__IOMMU_DOMAIN_SVA)
+#define IOMMU_DOMAIN_PLATFORM  (__IOMMU_DOMAIN_PLATFORM)
 
 struct iommu_domain {
unsigned type;
@@ -262,6 +266,9 @@ struct iommu_iotlb_gather {
  * @owner: Driver module providing these ops
  * @identity_domain: An always available, always attachable identity
  *   translation.
+ * @default_domain: If not NULL this will always be set as the default domain.
+ *  This should be an IDENTITY/BLOCKED/PLATFORM domain.
+ *  Do not use in new drivers.
  */
 struct iommu_ops {
bool (*capable)(struct device *dev, enum iommu_cap);
@@ -297,6 +304,7 @@ struct iommu_ops {
unsigned long

[PATCH v8 07/24] iommu/mtk_iommu_v1: Implement an IDENTITY domain

2023-09-13 Thread Jason Gunthorpe

What mtk does during mtk_iommu_v1_set_platform_dma() is actually putting
the iommu into identity mode. Make this available as a proper IDENTITY
domain.

The mtk_iommu_v1_def_domain_type() from
commit 8bbe13f52cb7 ("iommu/mediatek-v1: Add def_domain_type") explains
this was needed to allow probe_finalize() to be called, but now the
IDENTITY domain will do the same job so change the returned
def_domain_type.

mkt_v1 is the only driver that returns IOMMU_DOMAIN_UNMANAGED from
def_domain_type().  This allows the next patch to enforce an IDENTITY
domain policy for this driver.

Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/mtk_iommu_v1.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 8a0a5e5d049f4a..cc3e7d53d33ad9 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -319,11 +319,27 @@ static int mtk_iommu_v1_attach_device(struct iommu_domain 
*domain, struct device
return 0;
 }
 
-static void mtk_iommu_v1_set_platform_dma(struct device *dev)
+static int mtk_iommu_v1_identity_attach(struct iommu_domain *identity_domain,
+   struct device *dev)
 {
struct mtk_iommu_v1_data *data = dev_iommu_priv_get(dev);
 
mtk_iommu_v1_config(data, dev, false);
+   return 0;
+}
+
+static struct iommu_domain_ops mtk_iommu_v1_identity_ops = {
+   .attach_dev = mtk_iommu_v1_identity_attach,
+};
+
+static struct iommu_domain mtk_iommu_v1_identity_domain = {
+   .type = IOMMU_DOMAIN_IDENTITY,
+   .ops = _iommu_v1_identity_ops,
+};
+
+static void mtk_iommu_v1_set_platform_dma(struct device *dev)
+{
+   mtk_iommu_v1_identity_attach(_iommu_v1_identity_domain, dev);
 }
 
 static int mtk_iommu_v1_map(struct iommu_domain *domain, unsigned long iova,
@@ -443,7 +459,7 @@ static int mtk_iommu_v1_create_mapping(struct device *dev, 
struct of_phandle_arg
 
 static int mtk_iommu_v1_def_domain_type(struct device *dev)
 {
-   return IOMMU_DOMAIN_UNMANAGED;
+   return IOMMU_DOMAIN_IDENTITY;
 }
 
 static struct iommu_device *mtk_iommu_v1_probe_device(struct device *dev)
@@ -578,6 +594,7 @@ static int mtk_iommu_v1_hw_init(const struct 
mtk_iommu_v1_data *data)
 }
 
 static const struct iommu_ops mtk_iommu_v1_ops = {
+   .identity_domain = _iommu_v1_identity_domain,
.domain_alloc   = mtk_iommu_v1_domain_alloc,
.probe_device   = mtk_iommu_v1_probe_device,
.probe_finalize = mtk_iommu_v1_probe_finalize,
-- 
2.42.0

[PATCH v8 03/24] powerpc/iommu: Setup a default domain and remove set_platform_dma_ops

2023-09-13 Thread Jason Gunthorpe

POWER is using the set_platform_dma_ops() callback to hook up its private
dma_ops, but this is buired under some indirection and is weirdly
happening for a BLOCKED domain as well.

For better documentation create a PLATFORM domain to manage the dma_ops,
since that is what it is for, and make the BLOCKED domain an alias for
it. BLOCKED is required for VFIO.

Also removes the leaky allocation of the BLOCKED domain by using a global
static.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 arch/powerpc/kernel/iommu.c | 38 +
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 14251bc5219eba..d6ad3fde85a212 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1280,7 +1280,7 @@ struct iommu_table_group_ops spapr_tce_table_group_ops = {
 /*
  * A simple iommu_ops to allow less cruft in generic VFIO code.
  */
-static int spapr_tce_blocking_iommu_attach_dev(struct iommu_domain *dom,
+static int spapr_tce_platform_iommu_attach_dev(struct iommu_domain *dom,
   struct device *dev)
 {
struct iommu_group *grp = iommu_group_get(dev);
@@ -1297,17 +1297,22 @@ static int spapr_tce_blocking_iommu_attach_dev(struct 
iommu_domain *dom,
return ret;
 }
 
-static void spapr_tce_blocking_iommu_set_platform_dma(struct device *dev)
-{
-   struct iommu_group *grp = iommu_group_get(dev);
-   struct iommu_table_group *table_group;
+static const struct iommu_domain_ops spapr_tce_platform_domain_ops = {
+   .attach_dev = spapr_tce_platform_iommu_attach_dev,
+};
 
-   table_group = iommu_group_get_iommudata(grp);
-   table_group->ops->release_ownership(table_group);
-}
+static struct iommu_domain spapr_tce_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _tce_platform_domain_ops,
+};
 
-static const struct iommu_domain_ops spapr_tce_blocking_domain_ops = {
-   .attach_dev = spapr_tce_blocking_iommu_attach_dev,
+static struct iommu_domain spapr_tce_blocked_domain = {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   /*
+* FIXME: SPAPR mixes blocked and platform behaviors, the blocked domain
+* also sets the dma_api ops
+*/
+   .ops = _tce_platform_domain_ops,
 };
 
 static bool spapr_tce_iommu_capable(struct device *dev, enum iommu_cap cap)
@@ -1324,18 +1329,9 @@ static bool spapr_tce_iommu_capable(struct device *dev, 
enum iommu_cap cap)
 
 static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
 {
-   struct iommu_domain *dom;
-
if (type != IOMMU_DOMAIN_BLOCKED)
return NULL;
-
-   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
-   if (!dom)
-   return NULL;
-
-   dom->ops = _tce_blocking_domain_ops;
-
-   return dom;
+   return _tce_blocked_domain;
 }
 
 static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
@@ -1371,12 +1367,12 @@ static struct iommu_group 
*spapr_tce_iommu_device_group(struct device *dev)
 }
 
 static const struct iommu_ops spapr_tce_iommu_ops = {
+   .default_domain = _tce_platform_domain,
.capable = spapr_tce_iommu_capable,
.domain_alloc = spapr_tce_iommu_domain_alloc,
.probe_device = spapr_tce_iommu_probe_device,
.release_device = spapr_tce_iommu_release_device,
.device_group = spapr_tce_iommu_device_group,
-   .set_platform_dma_ops = spapr_tce_blocking_iommu_set_platform_dma,
 };
 
 static struct attribute *spapr_tce_iommu_attrs[] = {
-- 
2.42.0

[PATCH v8 05/24] iommu/fsl_pamu: Implement a PLATFORM domain

2023-09-13 Thread Jason Gunthorpe

This driver is nonsensical. To not block migrating the core API away from
NULL default_domains give it a hacky of a PLATFORM domain that keeps it
working exactly as it always did.

Leave some comments around to warn away any future people looking at this.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/fsl_pamu_domain.c | 41 ++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 4ac0e247ec2b51..e9d2bff4659b7c 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -196,6 +196,13 @@ static struct iommu_domain *fsl_pamu_domain_alloc(unsigned 
type)
 {
struct fsl_dma_domain *dma_domain;
 
+   /*
+* FIXME: This isn't creating an unmanaged domain since the
+* default_domain_ops do not have any map/unmap function it doesn't meet
+* the requirements for __IOMMU_DOMAIN_PAGING. The only purpose seems to
+* allow drivers/soc/fsl/qbman/qman_portal.c to do
+* fsl_pamu_configure_l1_stash()
+*/
if (type != IOMMU_DOMAIN_UNMANAGED)
return NULL;
 
@@ -283,15 +290,33 @@ static int fsl_pamu_attach_device(struct iommu_domain 
*domain,
return ret;
 }
 
-static void fsl_pamu_set_platform_dma(struct device *dev)
+/*
+ * FIXME: fsl/pamu is completely broken in terms of how it works with the iommu
+ * API. Immediately after probe the HW is left in an IDENTITY translation and
+ * the driver provides a non-working UNMANAGED domain that it can switch over
+ * to. However it cannot switch back to an IDENTITY translation, instead it
+ * switches to what looks like BLOCKING.
+ */
+static int fsl_pamu_platform_attach(struct iommu_domain *platform_domain,
+   struct device *dev)
 {
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
-   struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain);
+   struct fsl_dma_domain *dma_domain;
const u32 *prop;
int len;
struct pci_dev *pdev = NULL;
struct pci_controller *pci_ctl;
 
+   /*
+* Hack to keep things working as they always have, only leaving an
+* UNMANAGED domain makes it BLOCKING.
+*/
+   if (domain == platform_domain || !domain ||
+   domain->type != IOMMU_DOMAIN_UNMANAGED)
+   return 0;
+
+   dma_domain = to_fsl_dma_domain(domain);
+
/*
 * Use LIODN of the PCI controller while detaching a
 * PCI device.
@@ -312,8 +337,18 @@ static void fsl_pamu_set_platform_dma(struct device *dev)
detach_device(dev, dma_domain);
else
pr_debug("missing fsl,liodn property at %pOF\n", dev->of_node);
+   return 0;
 }
 
+static struct iommu_domain_ops fsl_pamu_platform_ops = {
+   .attach_dev = fsl_pamu_platform_attach,
+};
+
+static struct iommu_domain fsl_pamu_platform_domain = {
+   .type = IOMMU_DOMAIN_PLATFORM,
+   .ops = _pamu_platform_ops,
+};
+
 /* Set the domain stash attribute */
 int fsl_pamu_configure_l1_stash(struct iommu_domain *domain, u32 cpu)
 {
@@ -395,11 +430,11 @@ static struct iommu_device *fsl_pamu_probe_device(struct 
device *dev)
 }
 
 static const struct iommu_ops fsl_pamu_ops = {
+   .default_domain = _pamu_platform_domain,
.capable= fsl_pamu_capable,
.domain_alloc   = fsl_pamu_domain_alloc,
.probe_device   = fsl_pamu_probe_device,
.device_group   = fsl_pamu_device_group,
-   .set_platform_dma_ops = fsl_pamu_set_platform_dma,
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = fsl_pamu_attach_device,
.iova_to_phys   = fsl_pamu_iova_to_phys,
-- 
2.42.0

[PATCH] powerpc: Fix build issue with LD_DEAD_CODE_DATA_ELIMINATION and FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY

2023-09-13 Thread Naveen N Rao

We recently added support for -fpatchable-function-entry and it is
enabled by default on ppc32 (ppc64 needs gcc v13.1.0). When building the
kernel for ppc32 and also enabling CONFIG_LD_DEAD_CODE_DATA_ELIMINATION,
we see the below build error with older gcc versions:
  powerpc-linux-gnu-ld: init/main.o(__patchable_function_entries): error: need 
linked-to section for --gc-sections

This error is thrown since __patchable_function_entries section would be
garbage collected with --gc-sections since it does not reference any
other kept sections. This has subsequently been fixed with:
  
https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=b7d072167715829eed0622616f6ae0182900de3e

Disable LD_DEAD_CODE_DATA_ELIMINATION for gcc versions before v11.1.0 if
using -fpatchable-function-entry to avoid this bug.

Fixes: 0f71dcfb4aef ("powerpc/ftrace: Add support for 
-fpatchable-function-entry")
Reported-by: Michael Ellerman 
Signed-off-by: Naveen N Rao 
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 54b9387c3691..3aaadfd2c8eb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -255,7 +255,7 @@ config PPC
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
select HAVE_KRETPROBES
-   select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if HAVE_OBJTOOL_MCOUNT
+   select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if HAVE_OBJTOOL_MCOUNT && 
(!ARCH_USING_PATCHABLE_FUNCTION_ENTRY || (!CC_IS_GCC || GCC_VERSION >= 110100))
select HAVE_LIVEPATCH   if HAVE_DYNAMIC_FTRACE_WITH_REGS
select HAVE_MOD_ARCH_SPECIFIC
select HAVE_NMI if PERF_EVENTS || (PPC64 && 
PPC_BOOK3S)

base-commit: 0ad6bbfc7dab179bb3de79190140acec2862934d
-- 
2.41.0

Re: [PATCH] powerpc: add `cur_cpu_spec` symbol to vmcoreinfo

2023-09-13 Thread Sachin Sant




> On 11-Sep-2023, at 2:44 PM, Aditya Gupta  wrote:
> 
> Presently, while reading a vmcore, makedumpfile uses
> `cur_cpu_spec.mmu_features` to decide whether the crashed system had
> RADIX MMU or not.
> 
> Currently, makedumpfile fails to get the `cur_cpu_spec` symbol (unless
> a vmlinux is passed with the `-x` flag to makedumpfile), and hence
> assigns offsets and shifts (such as pgd_offset_l4) incorrecly considering
> MMU to be hash MMU.
> 
> Add `cur_cpu_spec` symbol and offset of `mmu_features` in the
> `cpu_spec` struct, to VMCOREINFO, so that the symbol address and offset
> is accessible to makedumpfile, without needing the vmlinux file
> 
> Signed-off-by: Aditya Gupta 
> ---

Thanks for the patch. With this patch applied (along with makedumpfile changes)
I am able to capture vmcore against a kernel which contains commit 8dc9a0ad0c3e

Reported-by: Sachin Sant 
Tested-by: Sachin Sant 

- Sachin

Re: [PATCH v1 07/19] powerpc: Untangle fixmap.h and pgtable.h and mmu.h

2023-09-13 Thread kernel test robot

Hi Christophe,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.6-rc1 next-20230913]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Christophe-Leroy/powerpc-8xx-Fix-pte_access_permitted-for-PAGE_NONE/20230912-031616
base:   linus/master
patch link:
https://lore.kernel.org/r/c94717708db817a0a0a6349431a2701252686899.1694443576.git.christophe.leroy%40csgroup.eu
patch subject: [PATCH v1 07/19] powerpc: Untangle fixmap.h and pgtable.h and 
mmu.h
config: powerpc-randconfig-r013-20230912 
(https://download.01.org/0day-ci/archive/20230913/202309131942.k7ezjho8-...@intel.com/config)
compiler: powerpc-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20230913/202309131942.k7ezjho8-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202309131942.k7ezjho8-...@intel.com/

All errors (new ones prefixed by >>):

   arch/powerpc/platforms/83xx/misc.c: In function 'mpc83xx_setup_arch':
>> arch/powerpc/platforms/83xx/misc.c:126:28: error: implicit declaration of 
>> function 'fix_to_virt'; did you mean 'bus_to_virt'? 
>> [-Werror=implicit-function-declaration]
 126 | unsigned long va = fix_to_virt(FIX_IMMR_BASE);
 |^~~
 |bus_to_virt
>> arch/powerpc/platforms/83xx/misc.c:126:40: error: 'FIX_IMMR_BASE' undeclared 
>> (first use in this function)
 126 | unsigned long va = fix_to_virt(FIX_IMMR_BASE);
 |^
   arch/powerpc/platforms/83xx/misc.c:126:40: note: each undeclared identifier 
is reported only once for each function it appears in
   cc1: all warnings being treated as errors

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for HOTPLUG_CPU
   Depends on [n]: SMP [=y] && (PPC_PSERIES [=n] || PPC_PMAC [=n] || 
PPC_POWERNV [=n] || FSL_SOC_BOOKE [=n])
   Selected by [y]:
   - PM_SLEEP_SMP [=y] && SMP [=y] && (ARCH_SUSPEND_POSSIBLE [=y] || 
ARCH_HIBERNATION_POSSIBLE [=y]) && PM_SLEEP [=y]


vim +/FIX_IMMR_BASE +126 arch/powerpc/platforms/83xx/misc.c

fff69fd03d1290 Kevin Hao2016-08-23  121  
fff69fd03d1290 Kevin Hao2016-08-23  122  void __init 
mpc83xx_setup_arch(void)
fff69fd03d1290 Kevin Hao2016-08-23  123  {
6b7c095a51e1ba Christophe Leroy 2019-09-16  124 phys_addr_t immrbase = 
get_immrbase();
6b7c095a51e1ba Christophe Leroy 2019-09-16  125 int immrsize = 
IS_ALIGNED(immrbase, SZ_2M) ? SZ_2M : SZ_1M;
6b7c095a51e1ba Christophe Leroy 2019-09-16 @126 unsigned long va = 
fix_to_virt(FIX_IMMR_BASE);
6b7c095a51e1ba Christophe Leroy 2019-09-16  127  
1ce844973bb516 Christophe Leroy 2022-06-14  128 if (ppc_md.progress)
1ce844973bb516 Christophe Leroy 2022-06-14  129 
ppc_md.progress("mpc83xx_setup_arch()", 0);
1ce844973bb516 Christophe Leroy 2022-06-14  130  
6b7c095a51e1ba Christophe Leroy 2019-09-16  131 setbat(-1, va, 
immrbase, immrsize, PAGE_KERNEL_NCG);
6b7c095a51e1ba Christophe Leroy 2019-09-16  132 update_bats();
6b7c095a51e1ba Christophe Leroy 2019-09-16  133  }
0deae39cec6dab Christophe Leroy 2018-12-10  134  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

[PATCH v2] ASoC: imx-rpmsg: Set ignore_pmdown_time for dai_link

2023-09-13 Thread Chancel Liu

i.MX rpmsg sound cards work on codec slave mode. MCLK will be disabled
by CPU DAI driver in hw_free(). Some codec requires MCLK present at
power up/down sequence. So need to set ignore_pmdown_time to power down
codec immediately before MCLK is turned off.

Take WM8962 as an example, if MCLK is disabled before DAPM power down
playback stream, FIFO error will arise in WM8962 which will have bad
impact on playback next.

Signed-off-by: Chancel Liu 
---
 sound/soc/fsl/imx-rpmsg.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/sound/soc/fsl/imx-rpmsg.c b/sound/soc/fsl/imx-rpmsg.c
index 3c7b95db2eac..b578f9a32d7f 100644
--- a/sound/soc/fsl/imx-rpmsg.c
+++ b/sound/soc/fsl/imx-rpmsg.c
@@ -89,6 +89,14 @@ static int imx_rpmsg_probe(struct platform_device *pdev)
SND_SOC_DAIFMT_NB_NF |
SND_SOC_DAIFMT_CBC_CFC;
 
+   /*
+* i.MX rpmsg sound cards work on codec slave mode. MCLK will be
+* disabled by CPU DAI driver in hw_free(). Some codec requires MCLK
+* present at power up/down sequence. So need to set ignore_pmdown_time
+* to power down codec immediately before MCLK is turned off.
+*/
+   data->dai.ignore_pmdown_time = 1;
+
/* Optional codec node */
ret = of_parse_phandle_with_fixed_args(np, "audio-codec", 0, 0, );
if (ret) {
-- 
2.25.1

[PATCH v2 3/8] arch/x86: Remove now superfluous sentinel elem from ctl_table arrays

2023-09-13 Thread Joel Granados via B4 Relay

From: Joel Granados 

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)

Remove sentinel element from sld_sysctl and itmt_kern_table. This
removal is safe because register_sysctl_init and register_sysctl
implicitly use the array size in addition to checking for the sentinel.

Reviewed-by: Ingo Molnar 
Acked-by: Dave Hansen  # for x86
Signed-off-by: Joel Granados 
---
 arch/x86/kernel/cpu/intel.c | 3 +--
 arch/x86/kernel/itmt.c  | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index be4045628fd3..e63391b82624 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1015,8 +1015,7 @@ static struct ctl_table sld_sysctls[] = {
.proc_handler   = proc_douintvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
-   },
-   {}
+   }
 };
 
 static int __init sld_mitigate_sysctl_init(void)
diff --git a/arch/x86/kernel/itmt.c b/arch/x86/kernel/itmt.c
index ee4fe8cdb857..5f2ccff38297 100644
--- a/arch/x86/kernel/itmt.c
+++ b/arch/x86/kernel/itmt.c
@@ -73,8 +73,7 @@ static struct ctl_table itmt_kern_table[] = {
.proc_handler   = sched_itmt_update_handler,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
-   },
-   {}
+   }
 };
 
 static struct ctl_table_header *itmt_sysctl_header;

-- 
2.30.2

[PATCH v2 2/8] arm: Remove now superfluous sentinel elem from ctl_table arrays

2023-09-13 Thread Joel Granados via B4 Relay

From: Joel Granados 

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)

Removed the sentinel as well as the explicit size from ctl_isa_vars. The
size is redundant as the initialization sets it. Changed
insn_emulation->sysctl from a 2 element array of struct ctl_table to a
simple struct. This has no consequence for the sysctl registration as it
is forwarded as a pointer. Removed sentinel from sve_defatul_vl_table,
sme_default_vl_table, tagged_addr_sysctl_table and
armv8_pmu_sysctl_table.

This removal is safe because register_sysctl_sz and register_sysctl use
the array size in addition to checking for the sentinel.

Signed-off-by: Joel Granados 
---
 arch/arm/kernel/isa.c| 4 ++--
 arch/arm64/kernel/armv8_deprecated.c | 8 +++-
 arch/arm64/kernel/fpsimd.c   | 6 ++
 arch/arm64/kernel/process.c  | 3 +--
 drivers/perf/arm_pmuv3.c | 3 +--
 5 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kernel/isa.c b/arch/arm/kernel/isa.c
index 20218876bef2..0b9c28077092 100644
--- a/arch/arm/kernel/isa.c
+++ b/arch/arm/kernel/isa.c
@@ -16,7 +16,7 @@
 
 static unsigned int isa_membase, isa_portbase, isa_portshift;
 
-static struct ctl_table ctl_isa_vars[4] = {
+static struct ctl_table ctl_isa_vars[] = {
{
.procname   = "membase",
.data   = _membase, 
@@ -35,7 +35,7 @@ static struct ctl_table ctl_isa_vars[4] = {
.maxlen = sizeof(isa_portshift),
.mode   = 0444,
.proc_handler   = proc_dointvec,
-   }, {}
+   }
 };
 
 static struct ctl_table_header *isa_sysctl_header;
diff --git a/arch/arm64/kernel/armv8_deprecated.c 
b/arch/arm64/kernel/armv8_deprecated.c
index e459cfd33711..dd6ce86d4332 100644
--- a/arch/arm64/kernel/armv8_deprecated.c
+++ b/arch/arm64/kernel/armv8_deprecated.c
@@ -52,10 +52,8 @@ struct insn_emulation {
int min;
int max;
 
-   /*
-* sysctl for this emulation + a sentinal entry.
-*/
-   struct ctl_table sysctl[2];
+   /* sysctl for this emulation */
+   struct ctl_table sysctl;
 };
 
 #define ARM_OPCODE_CONDTEST_FAIL   0
@@ -558,7 +556,7 @@ static void __init register_insn_emulation(struct 
insn_emulation *insn)
update_insn_emulation_mode(insn, INSN_UNDEF);
 
if (insn->status != INSN_UNAVAILABLE) {
-   sysctl = >sysctl[0];
+   sysctl = >sysctl;
 
sysctl->mode = 0644;
sysctl->maxlen = sizeof(int);
diff --git a/arch/arm64/kernel/fpsimd.c b/arch/arm64/kernel/fpsimd.c
index 91e44ac7150f..db3ad1ba8272 100644
--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -588,8 +588,7 @@ static struct ctl_table sve_default_vl_table[] = {
.mode   = 0644,
.proc_handler   = vec_proc_do_default_vl,
.extra1 = _info[ARM64_VEC_SVE],
-   },
-   { }
+   }
 };
 
 static int __init sve_sysctl_init(void)
@@ -612,8 +611,7 @@ static struct ctl_table sme_default_vl_table[] = {
.mode   = 0644,
.proc_handler   = vec_proc_do_default_vl,
.extra1 = _info[ARM64_VEC_SME],
-   },
-   { }
+   }
 };
 
 static int __init sme_sysctl_init(void)
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 0fcc4eb1a7ab..48861cdc3aae 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -723,8 +723,7 @@ static struct ctl_table tagged_addr_sysctl_table[] = {
.proc_handler   = proc_dointvec_minmax,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
-   },
-   { }
+   }
 };
 
 static int __init tagged_addr_init(void)
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index e5a2ac4155f6..c4aa6a8d1b05 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1172,8 +1172,7 @@ static struct ctl_table armv8_pmu_sysctl_table[] = {
.proc_handler   = armv8pmu_proc_user_access_handler,
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_ONE,
-   },
-   { }
+   }
 };
 
 static void armv8_pmu_register_sysctl_table(void)

-- 
2.30.2

[PATCH v2 8/8] c-sky: Remove now superfluous sentinel element from ctl_talbe array

2023-09-13 Thread Joel Granados via B4 Relay

From: Joel Granados 

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which
will reduce the overall build time size of the kernel and run time
memory bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/zo5yx5jfoggi%2f...@bombadil.infradead.org/)

Remove sentinel from alignment_tbl ctl_table array. This removal is safe
because register_sysctl_init implicitly uses ARRAY_SIZE() in addition to
checking for the sentinel.

Acked-by: Guo Ren 
Signed-off-by: Joel Granados 
---
 arch/csky/abiv1/alignment.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/csky/abiv1/alignment.c b/arch/csky/abiv1/alignment.c
index b60259daed1b..0d75ce7b0328 100644
--- a/arch/csky/abiv1/alignment.c
+++ b/arch/csky/abiv1/alignment.c
@@ -328,8 +328,7 @@ static struct ctl_table alignment_tbl[5] = {
.maxlen = sizeof(align_usr_count),
.mode = 0666,
.proc_handler = _dointvec
-   },
-   {}
+   }
 };
 
 static int __init csky_alignment_init(void)

-- 
2.30.2

1 2 >

1 - 100 of 110 matches

Mail list logo