[PATCH v5 11/11] docs: powerpc: Document nested KVM on POWER

2023-09-13 Thread Jordan Niethe
From: Michael Neuling 

Document support for nested KVM on POWER using the existing API as well
as the new PAPR API. This includes the new HCALL interface and how it
used by KVM.

Signed-off-by: Michael Neuling 
Signed-off-by: Jordan Niethe 
---
v2:
  - Separated into individual patch
v3:
  - Fix typos
---
 Documentation/powerpc/index.rst  |   1 +
 Documentation/powerpc/kvm-nested.rst | 636 +++
 2 files changed, 637 insertions(+)
 create mode 100644 Documentation/powerpc/kvm-nested.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index d33b554ca7ba..23e449994c2a 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -26,6 +26,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..8b37981dc3d9
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN().  Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+  (normally at L1 boot time).
+
+- L1 requests the L0 create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 runs the vCPU running H_GUEST_VCPU_RUN() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+More details of the individual hcalls follows:
+
+HCALL Details
+=
+
+This documentation is provided to give an overall understating of the
+API. It doesn't aim to provide all the details required to implement
+an L1 or L0. Latest version of PAPR can be referred to for more details.
+
+All these HCALLs are made by the L1 to the L0.
+
+H_GUEST_GET_CAPABILITIES()
+--
+
+This is called to get the capabilities of the L0 nested
+hypervisor. This includes capabilities such the CPU versions (eg
+POWER9, POWER10) that are supported as L2s::
+
+  H_GUEST_GET_CAPABILITIES(uint64 flags)
+
+  Parameters

[PATCH v5 10/11] KVM: PPC: Add support for nestedv2 guests

2023-09-13 Thread Jordan Niethe
A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. KVM
already had an interface that allowed this on powernv platforms. This
existing interface will now be called "nestedv1". The newly added PAPR
interface will be called "nestedv2".  PHYP will support the nestedv2
interface. At this time the host side of the nestedv2 interface has not
been implemented on powernv but there is no technical reason why it
could not be added.

The nestedv1 interface is still supported.

Add support to KVM to utilize these hcalls to enable running nested
guests as a pseries guest on PHYP.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, use the nestedv2
interface.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

Tested-by: Sachin Sant 
Signed-off-by: Vaibhav Jain 
Signed-off-by: Gautam Menghani 
Signed-off-by: Kautuk Consul 
Signed-off-by: Amit Machhiwal 
Signed-off-by: Jordan Niethe 
---
v2:
  - Declare op structs as static
  - Guatam: Use expressions in switch case with local variables
  - Do not use the PVR for the LOGICAL PVR ID
  - Kautuk: Handle emul_inst as now a double word, init correctly
  - Use new GPR(), etc macros
  - Amit: Determine PAPR nested capabilities from cpu features
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Change to kvmhv_nestedv2 namespace
  - Make kvmhv_enable_nested() return -ENODEV on NESTEDv2 L1 hosts
  - s/kvmhv_on_papr/kvmhv_is_nestedv2/
  - mv book3s_hv_papr.c book3s_hv_nestedv2.c
  - Handle shared regs without a guest state id in the same wrapper
  - Vaibhav: Use a static key for API version
  - Add a positive test for NESTEDv1
  - Give the amor a static value
  - s/struct kvmhv_nestedv2_host/struct kvmhv_nestedv2_io/
  - Propagate failure in kvmhv_vcpu_entry_nestedv2()
  - WARN if getters and setters fail
  - Progagate failure from kvmhv_nestedv2_parse_output()
  - Replace delay with sleep in plpar_guest_{create,delete,create_vcpu}()
  - Amit: Add logical PVR handling
  - Replace kvmppc_gse_{get,put} with specific version
v4:
  - Batch H_GUEST_GET calls in kvmhv_nestedv2_reload_ptregs()
  - Fix compile without CONFIG_PSERIES
  - Fix maybe uninitialized trap in kvmhv_p9_guest_entry()
  - Extend existing setters for arch_compat and lpcr
v5:
  - Check H_BUSY for {g,s}etting capabilities
  - Message if plpar_guest_get_capabilities() fails and nestedv1
support will be attempted.
  - Remove unused amor variable
---
 arch/powerpc/include/asm/guest-state-buffer.h |  91 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 137 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  20 +
 arch/powerpc/include/asm/kvm_ppc.h|  90 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 263 +
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 134 ++-
 arch/powerpc/kvm/book3s_hv.h  |  80 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  40 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 994 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 arch/powerpc/kvm/guest-state-buffer.c |  50 +
 14 files changed, 1843 insertions(+), 97 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_nestedv2.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
index aaefe1075fc4..808149f31576 100644
--- a/arch/powerpc/include/asm/guest-state-buffer.h
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -5,6 +5,7 @@
 #ifndef _ASM_POWERPC_GUEST_STA

[PATCH v5 09/11] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-09-13 Thread Jordan Niethe
The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that.  Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.

Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
v4:
  - Use u64
  - Change format strings instead of casting
---
 arch/powerpc/include/asm/kvm_book3s.h | 10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  2 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 +++---
 arch/powerpc/kvm/book3s_hv_nested.c   |  4 ++--
 arch/powerpc/kvm/book3s_hv_uvmem.c|  2 +-
 arch/powerpc/kvm/book3s_xive.c|  4 ++--
 9 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4c6558d5fefe..831c23e4f121 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -191,14 +191,14 @@ extern int kvmppc_mmu_radix_translate_table(struct 
kvm_vcpu *vcpu, gva_t eaddr,
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
 extern void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
-   unsigned int pshift, unsigned int lpid);
+   unsigned int pshift, u64 lpid);
 extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
unsigned int shift,
const struct kvm_memory_slot *memslot,
-   unsigned int lpid);
+   u64 lpid);
 extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
bool writing, unsigned long gpa,
-   unsigned int lpid);
+   u64 lpid);
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
@@ -207,7 +207,7 @@ extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu 
*vcpu,
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
 extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
- unsigned int lpid);
+ u64 lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
 extern void kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
@@ -300,7 +300,7 @@ void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
-void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
+void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d49065af08e9..572f9bbf1a25 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -624,7 +624,7 @@ static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 
 extern int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 unsigned long gpa, unsigned int level,
-unsigned long mmu_seq, unsigned int lpid,
+unsigned long mmu_seq, u64 lpid,
 unsigned long *rmapp, struct rmap_nested **n_rmap);
 extern void kvmhv_insert_nest_rmap(struct kvm *kvm, unsigned long *rmapp,
   struct rmap_nested **n_rmap);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0d

[PATCH v5 08/11] KVM: PPC: Add helper library for Guest State Buffers

2023-09-13 Thread Jordan Niethe
The PAPR "Nestedv2" guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the nestedv2 guest support.

Signed-off-by: Jordan Niethe 
---
v2:
  - Add missing #ifdef CONFIG_VSXs
  - Move files from lib/ to kvm/
  - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
  - Use kunit for guest state buffer tests
  - Add configuration option for the tests
  - Use macros for contiguous id ranges like GPRs
  - Add some missing EXPORTs to functions
  - HEIR element is a double word not a word
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Use the kvmppc namespace
  - Move kvmppc_gsb_reset() out of kvmppc_gsm_fill_info()
  - Comments for GSID elements
  - Pass vector elements by reference
  - Remove generic put and get functions
v5:
  - Fix mismatched function comments
---
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 904 ++
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/guest-state-buffer.c | 571 +++
 arch/powerpc/kvm/test-guest-state-buffer.c| 328 +++
 5 files changed, 1818 insertions(+)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 2a54fadbeaf5..339c3a5f56f1 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
bool "Run self-tests of the MSI bitmap code"
depends on DEBUG_KERNEL
 
+config GUEST_STATE_BUFFER_TEST
+   def_tristate n
+   prompt "Enable Guest State Buffer unit tests"
+   depends on KUNIT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default KUNIT_ALL_TESTS
+   help
+ The Guest State Buffer is a data format specified in the PAPR.
+ It is by hcalls to communicate the state of L2 guests between
+ the L1 and L0 hypervisors. Enable unit tests for the library
+ used to create and use guest state buffers.
+
 config PPC_IRQ_SOFT_MASK_DEBUG
bool "Include extra checks for powerpc irq soft masking"
depends on PPC64
diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..aaefe1075fc4
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,904 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+/* Element without a value and any length */
+#define KVMPPC_GSID_BLANK  0x
+/* Size required for the L0's internal VCPU representation */
+#define KVMPPC_GSID_HOST_STATE_SIZE0x0001
+ /* Minimum size for the H_GUEST_RUN_VCPU output buffer */
+#define KVMPPC_GSID_RUN_OUTPUT_MIN_SIZE0x0002
+ /* "Logical" PVR value as defined in the PAPR */
+#define KVMPPC_GSID_LOGICAL_PVR0x0003
+ /* L0 relative timebase offset */
+#define KVMPPC_GSID_TB_OFFSET  0x0004
+ /* Partition Scoped Page Table Info */
+#de

[PATCH v5 07/11] KVM: PPC: Book3S HV: Introduce low level MSR accessor

2023-09-13 Thread Jordan Niethe
kvmppc_get_msr() and kvmppc_set_msr_fast() serve as accessors for the
MSR. However because the MSR is kept in the shared regs they include a
conditional check for kvmppc_shared_big_endian() and endian conversion.

Within the Book3S HV specific code there are direct reads and writes of
shregs::msr. In preparation for Nested APIv2 these accesses need to be
replaced with accessor functions so it is possible to extend their
behavior. However, using the kvmppc_get_msr() and kvmppc_set_msr_fast()
functions is undesirable because it would introduce a conditional branch
and endian conversion that is not currently present.

kvmppc_set_msr_hv() already exists, it is used for the
kvmppc_ops::set_msr callback.

Introduce a low level accessor __kvmppc_{s,g}et_msr_hv() that simply
gets and sets shregs::msr. This will be extend for Nested APIv2 support.

Signed-off-by: Jordan Niethe 
---
v4:
  - New to series
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  5 ++--
 arch/powerpc/kvm/book3s_hv.c | 34 ++--
 arch/powerpc/kvm/book3s_hv.h | 10 
 arch/powerpc/kvm/book3s_hv_builtin.c |  5 ++--
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index efd0ebf70a5e..fdfc2a62dd67 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -28,6 +28,7 @@
 #include 
 
 #include "book3s.h"
+#include "book3s_hv.h"
 #include "trace_hv.h"
 
 //#define DEBUG_RESIZE_HPT 1
@@ -347,7 +348,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
unsigned long v, orig_v, gr;
__be64 *hptep;
long int index;
-   int virtmode = vcpu->arch.shregs.msr & (data ? MSR_DR : MSR_IR);
+   int virtmode = __kvmppc_get_msr_hv(vcpu) & (data ? MSR_DR : MSR_IR);
 
if (kvm_is_radix(vcpu->kvm))
return kvmppc_mmu_radix_xlate(vcpu, eaddr, gpte, data, iswrite);
@@ -385,7 +386,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
 
/* Get PP bits and key for permission check */
pp = gr & (HPTE_R_PP0 | HPTE_R_PP);
-   key = (vcpu->arch.shregs.msr & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
+   key = (__kvmppc_get_msr_hv(vcpu) & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
key &= slb_v;
 
/* Calculate permissions */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 25025f6c4cce..5743f32bf45e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1374,7 +1374,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
  */
 static void kvmppc_cede(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.shregs.msr |= MSR_EE;
+   __kvmppc_set_msr_hv(vcpu, __kvmppc_get_msr_hv(vcpu) | MSR_EE);
vcpu->arch.ceded = 1;
smp_mb();
if (vcpu->arch.prodded) {
@@ -1589,7 +1589,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * That can happen due to a bug, or due to a machine check
 * occurring at just the wrong time.
 */
-   if (vcpu->arch.shregs.msr & MSR_HV) {
+   if (__kvmppc_get_msr_hv(vcpu) & MSR_HV) {
printk(KERN_EMERG "KVM trap in HV mode!\n");
printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n",
vcpu->arch.trap, kvmppc_get_pc(vcpu),
@@ -1640,7 +1640,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * so that it knows that the machine check occurred.
 */
if (!vcpu->kvm->arch.fwnmi_enabled) {
-   ulong flags = (vcpu->arch.shregs.msr & 0x083c) |
+   ulong flags = (__kvmppc_get_msr_hv(vcpu) & 0x083c) |
(kvmppc_get_msr(vcpu) & SRR1_PREFIXED);
kvmppc_core_queue_machine_check(vcpu, flags);
r = RESUME_GUEST;
@@ -1670,7 +1670,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * as a result of a hypervisor emulation interrupt
 * (e40) getting turned into a 700 by BML RTAS.
 */
-   flags = (vcpu->arch.shregs.msr & 0x1full) |
+   flags = (__kvmppc_get_msr_hv(vcpu) & 0x1full) |
(kvmppc_get_msr(vcpu) & SRR1_PREFIXED);
kvmppc_core_queue_program(vcpu, flags);
r = RESUME_GUEST;
@@ -1680,7 +1680,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
{
int i;
 
-   if (unlikely(vcpu->arch.shregs.msr & MSR_PR)) {
+   if (unlikely(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) {
/*
 * Guest userspace executed sc 1. This can on

[PATCH v5 06/11] KVM: PPC: Book3S HV: Use accessors for VCPU registers

2023-09-13 Thread Jordan Niethe
Introduce accessor generator macros for Book3S HV VCPU registers. Use
the accessor functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
v5:
  - Remove unneeded trailing comment for line length
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
 arch/powerpc/kvm/book3s_hv.c   | 148 +
 arch/powerpc/kvm/book3s_hv.h   |  58 ++
 3 files changed, 139 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 5c71d6ae3a7b..ab646f59afd7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include "book3s_hv.h"
 #include 
 #include 
 #include 
@@ -294,9 +295,9 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t 
eaddr,
} else {
if (!(pte & _PAGE_PRIVILEGED)) {
/* Check AMR/IAMR to see if strict mode is in force */
-   if (vcpu->arch.amr & (1ul << 62))
+   if (kvmppc_get_amr_hv(vcpu) & (1ul << 62))
gpte->may_read = 0;
-   if (vcpu->arch.amr & (1ul << 63))
+   if (kvmppc_get_amr_hv(vcpu) & (1ul << 63))
gpte->may_write = 0;
if (vcpu->arch.iamr & (1ul << 62))
gpte->may_execute = 0;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 73d9a9eb376f..25025f6c4cce 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -868,7 +868,7 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
/* Guests can't breakpoint the hypervisor */
if ((value1 & CIABR_PRIV) == CIABR_PRIV_HYPER)
return H_P3;
-   vcpu->arch.ciabr  = value1;
+   kvmppc_set_ciabr_hv(vcpu, value1);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_SET_DAWR0:
if (!kvmppc_power8_compatible(vcpu))
@@ -879,8 +879,8 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
return H_UNSUPPORTED_FLAG_START;
if (value2 & DABRX_HYP)
return H_P4;
-   vcpu->arch.dawr0  = value1;
-   vcpu->arch.dawrx0 = value2;
+   kvmppc_set_dawr0_hv(vcpu, value1);
+   kvmppc_set_dawrx0_hv(vcpu, value2);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_SET_DAWR1:
if (!kvmppc_power8_compatible(vcpu))
@@ -895,8 +895,8 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
return H_UNSUPPORTED_FLAG_START;
if (value2 & DABRX_HYP)
return H_P4;
-   vcpu->arch.dawr1  = value1;
-   vcpu->arch.dawrx1 = value2;
+   kvmppc_set_dawr1_hv(vcpu, value1);
+   kvmppc_set_dawrx1_hv(vcpu, value2);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_ADDR_TRANS_MODE:
/*
@@ -1548,7 +1548,7 @@ static int kvmppc_pmu_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_PM))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_PM;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_PM);
 
return RESUME_GUEST;
 }
@@ -1558,7 +1558,7 @@ static int kvmppc_ebb_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_EBB))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_EBB;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_EBB);
 
return RESUME_GUEST;
 }
@@ -1568,7 +1568,7 @@ static int kvmppc_tm_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_TM))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_TM;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_TM);
 
return RESUME_GUEST;
 }
@@ -1867,7 +1867,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * Otherwise, we just generate a program interrupt to the guest.
 */
case BOOK3S_INTERRUPT_H_FAC_UNAVAIL: {
-   u64 cause = vcpu->arch.hfscr >> 56;
+   u64 cause = kvmppc_get_hfscr_hv(vcpu) >> 56;
 
r = EMULATE_FAIL;
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
@@ -2211,64 +2211,64 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 

[PATCH v5 05/11] KVM: PPC: Use accessors for VCORE registers

2023-09-13 Thread Jordan Niethe
Introduce accessor generator macros for VCORE registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
  - Remove _hv suffix
  - Do not generate for setter arch_compat and lpcr
---
 arch/powerpc/include/asm/kvm_book3s.h | 25 -
 arch/powerpc/kvm/book3s_hv.c  | 24 
 arch/powerpc/kvm/book3s_hv_ras.c  |  4 ++--
 arch/powerpc/kvm/book3s_xive.c|  4 +---
 4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 1a220cd63227..4c6558d5fefe 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -483,6 +483,29 @@ KVMPPC_BOOK3S_VCPU_ACCESSOR(bescr, 64)
 KVMPPC_BOOK3S_VCPU_ACCESSOR(ic, 64)
 KVMPPC_BOOK3S_VCPU_ACCESSOR(vrsave, 64)
 
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(reg, size)\
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   vcpu->arch.vcore->reg = val;\
+}
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(reg, size)\
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.vcore->reg;   \
+}
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR(reg, size)
\
+   KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(reg, size) \
+   KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(reg, size) \
+
+
+KVMPPC_BOOK3S_VCORE_ACCESSOR(vtb, 64)
+KVMPPC_BOOK3S_VCORE_ACCESSOR(tb_offset, 64)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(arch_compat, 32)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(lpcr, 64)
+
 static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
 {
return vcpu->arch.dec_expires;
@@ -496,7 +519,7 @@ static inline void kvmppc_set_dec_expires(struct kvm_vcpu 
*vcpu, u64 val)
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return kvmppc_get_dec_expires(vcpu) - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset(vcpu);
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 27faecad1e3b..73d9a9eb376f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -794,7 +794,7 @@ static void kvmppc_update_vpa_dispatch(struct kvm_vcpu 
*vcpu,
 
vpa->enqueue_dispatch_tb = 
cpu_to_be64(be64_to_cpu(vpa->enqueue_dispatch_tb) + stolen);
 
-   __kvmppc_create_dtl_entry(vcpu, vpa, vc->pcpu, now + vc->tb_offset, 
stolen);
+   __kvmppc_create_dtl_entry(vcpu, vpa, vc->pcpu, now + 
kvmppc_get_tb_offset(vcpu), stolen);
 
vcpu->arch.vpa.dirty = true;
 }
@@ -845,9 +845,9 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 
 static bool kvmppc_power8_compatible(struct kvm_vcpu *vcpu)
 {
-   if (vcpu->arch.vcore->arch_compat >= PVR_ARCH_207)
+   if (kvmppc_get_arch_compat(vcpu) >= PVR_ARCH_207)
return true;
-   if ((!vcpu->arch.vcore->arch_compat) &&
+   if ((!kvmppc_get_arch_compat(vcpu)) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
return true;
return false;
@@ -2283,7 +2283,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->arch.vcore->dpdes);
break;
case KVM_REG_PPC_VTB:
-   *val = get_reg_val(id, vcpu->arch.vcore->vtb);
+   *val = get_reg_val(id, kvmppc_get_vtb(vcpu));
break;
case KVM_REG_PPC_DAWR:
*val = get_reg_val(id, vcpu->arch.dawr0);
@@ -2342,11 +2342,11 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
spin_unlock(>arch.vpa_update_lock);
break;
case KVM_REG_PPC_TB_OFFSET:
-   *val = get_reg_val(id, vcpu->arch.vcore->tb_offset);
+   *val = get_reg_val(id, kvmppc_get_tb_offset(vcpu));
break;
case KVM_REG_PPC_LPCR:
case KVM_REG_PPC_LPCR_64:
-   *val = get_reg_val(id, vcpu->arch.vcore->lpcr);
+   *val = get_reg_val(id, kvmppc_get_lpcr(vcpu));
break;
case KVM_REG_PPC_PPR:
*val = get_reg_val(id, vcpu->arch.ppr);
@@ -2418,7 +2418,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 i

[PATCH v5 04/11] KVM: PPC: Use accessors for VCPU registers

2023-09-13 Thread Jordan Niethe
Introduce accessor generator macros for VCPU registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
---
 arch/powerpc/include/asm/kvm_book3s.h  | 37 +-
 arch/powerpc/kvm/book3s.c  | 22 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  4 +--
 arch/powerpc/kvm/book3s_hv.c   | 12 -
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |  4 +--
 arch/powerpc/kvm/powerpc.c |  4 +--
 6 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 109a5f56767a..1a220cd63227 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -458,10 +458,45 @@ static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, 
u32 val)
 }
 #endif
 
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size) \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size) \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR(reg, size) \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size)  \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size)  \
+
+KVMPPC_BOOK3S_VCPU_ACCESSOR(pid, 32)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(tar, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbhr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbrr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(bescr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ic, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(vrsave, 64)
+
+static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires;
+}
+
+static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.dec_expires = val;
+}
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - vcpu->arch.vcore->tb_offset;
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index c080dd2e96ac..6cd20ab9e94e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -565,7 +565,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
regs->msr = kvmppc_get_msr(vcpu);
regs->srr0 = kvmppc_get_srr0(vcpu);
regs->srr1 = kvmppc_get_srr1(vcpu);
-   regs->pid = vcpu->arch.pid;
+   regs->pid = kvmppc_get_pid(vcpu);
regs->sprg0 = kvmppc_get_sprg0(vcpu);
regs->sprg1 = kvmppc_get_sprg1(vcpu);
regs->sprg2 = kvmppc_get_sprg2(vcpu);
@@ -683,19 +683,19 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
*val = get_reg_val(id, vcpu->arch.fscr);
break;
case KVM_REG_PPC_TAR:
-   *val = get_reg_val(id, vcpu->arch.tar);
+   *val = get_reg_val(id, kvmppc_get_tar(vcpu));
break;
case KVM_REG_PPC_EBBHR:
-   *val = get_reg_val(id, vcpu->arch.ebbhr);
+   *val = get_reg_val(id, kvmppc_get_ebbhr(vcpu));
break;
case KVM_REG_PPC_EBBRR:
-   *val = get_reg_val(id, vcpu->arch.ebbrr);
+   *val = get_reg_val(id, kvmppc_get_ebbrr(vcpu));
break;
case KVM_REG_PPC_BESCR:
-   *val = get_reg_val(id, vcpu->arch.bescr);
+   *val = get_reg_val(id, kvmppc_get_bescr(vcpu));
break;
case KVM_REG_PPC_IC:
-   *val = get_reg_val(id, vcpu->arch.ic);
+   *val = get_reg_val(id, kvmppc_get_ic(vcpu));
break;
default:
r = -EINVAL;
@@ -768,19 +768,19 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
kvmppc_set_fpscr(vcpu, set_reg_val(id, *val));
break;
case KVM_REG_PPC_TAR:
-   vcpu->ar

[PATCH v5 03/11] KVM: PPC: Rename accessor generator macros

2023-09-13 Thread Jordan Niethe
More "wrapper" style accessor generating macros will be introduced for
the nestedv2 guest support. Rename the existing macros with more
descriptive names now so there is a consistent naming convention.

Reviewed-by: Nicholas Piggin 
Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
v4:
  - Fix ACESSOR typo
---
 arch/powerpc/include/asm/kvm_ppc.h | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d16d80ad2ae4..d554bc56e7f3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -927,19 +927,19 @@ static inline bool kvmppc_shared_big_endian(struct 
kvm_vcpu *vcpu)
 #endif
 }
 
-#define SPRNG_WRAPPER_GET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_GET(reg, bookehv_spr)   \
 static inline ulong kvmppc_get_##reg(struct kvm_vcpu *vcpu)\
 {  \
return mfspr(bookehv_spr);  \
 }  \
 
-#define SPRNG_WRAPPER_SET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_SET(reg, bookehv_spr)   \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, ulong val)  \
 {  \
mtspr(bookehv_spr, val);
\
 }  \
 
-#define SHARED_WRAPPER_GET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR_GET(reg, size)
\
 static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -948,7 +948,7 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu 
*vcpu)   \
   return le##size##_to_cpu(vcpu->arch.shared->reg);\
 }  \
 
-#define SHARED_WRAPPER_SET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR_SET(reg, size)
\
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -957,36 +957,36 @@ static inline void kvmppc_set_##reg(struct kvm_vcpu 
*vcpu, u##size val)   \
   vcpu->arch.shared->reg = cpu_to_le##size(val);   \
 }  \
 
-#define SHARED_WRAPPER(reg, size)  \
-   SHARED_WRAPPER_GET(reg, size)   \
-   SHARED_WRAPPER_SET(reg, size)   \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR(reg, size)\
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR_GET(reg, size) \
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR_SET(reg, size) \
 
-#define SPRNG_WRAPPER(reg, bookehv_spr)
\
-   SPRNG_WRAPPER_GET(reg, bookehv_spr) \
-   SPRNG_WRAPPER_SET(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR(reg, bookehv_spr)   \
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_GET(reg, bookehv_spr)\
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_SET(reg, bookehv_spr)\
 
 #ifdef CONFIG_KVM_BOOKE_HV
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SPRNG_WRAPPER(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR(reg, bookehv_spr)\
 
 #else
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SHARED_WRAPPER(reg, size)   \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR(reg, size) \
 
 #endif
 
-SHARED_WRAPPER(critical, 64)
-SHARED_SPRNG_WRAPPER(sprg0, 64, SPRN_GSPRG0)
-SHARED_SPRNG_WRAPPER(sprg1, 64, SPRN_GSPRG1)
-SHARED_SPRNG_WRAPPER(sprg2, 64, SPRN_GSPRG2)
-SHARED_SPRNG_WRAPPER(sprg3, 64, SPRN_GSPRG3)
-SHARED_SPRNG_WRAPPER(srr0, 64, SPRN_GSRR0)
-SHARED_SPRNG_WRAPPER(srr1, 64, SPRN_GSRR1)
-SHARED_SPRNG_WRAPPER(dar, 64, SPRN_GDEAR)
-SHARED_SPRNG_WRAPPER(esr, 64, SPRN_GESR)
-SHARED_WRAPPER_GET(msr, 64)
+KVMPPC_VCPU_SHARED_REGS

[PATCH v5 02/11] KVM: PPC: Introduce FPR/VR accessor functions

2023-09-13 Thread Jordan Niethe
Introduce accessor functions for floating point and vector registers
like the ones that exist for GPRs. Use these to replace the existing FPR
and VR accessor macros.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Gautam Menghani 
Signed-off-by: Jordan Niethe 
---
v3:
  - Guatam: Pass vector elements by reference
v4:
  - Split into unique patch
---
 arch/powerpc/include/asm/kvm_book3s.h | 55 
 arch/powerpc/include/asm/kvm_booke.h  | 10 
 arch/powerpc/kvm/book3s.c | 16 +++---
 arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 72 +--
 5 files changed, 110 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..109a5f56767a 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -403,6 +403,61 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j,
+ u64 val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
+#ifdef CONFIG_ALTIVEC
+static inline void kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
*v)
+{
+   *v =  vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i,
+vector128 *val)
+{
+   vcpu->arch.vr.vr[i] = *val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+#endif
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 0c3401b2e19e..7c3291aa8922 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
 #ifdef CONFIG_BOOKE
 static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..c080dd2e96ac 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -636,17 +636,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   *val = get_reg_val(id, VCPU_FPR(vcpu, i));
+   *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
break;
case KVM_REG_PPC_FPSCR:
-   *val = get_reg_val(id, vcpu->arch.fp.fpscr);
+   *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
-   val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
+   val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
+   val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
} else {
r = -ENXIO;
}
@@ -724,7 +724,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
+

[PATCH v5 01/11] KVM: PPC: Always use the GPR accessors

2023-09-13 Thread Jordan Niethe
Always use the GPR accessor functions. This will be important later for
Nested APIv2 support which requires additional functionality for
accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split into unique patch
---
 arch/powerpc/kvm/book3s_64_vio.c | 4 ++--
 arch/powerpc/kvm/book3s_hv.c | 8 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 8 
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 4 ++--
 arch/powerpc/kvm/book3s_xive.c   | 4 ++--
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 93b695b289e9..4ba048f272f2 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -786,12 +786,12 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned 
long liobn,
idx = (ioba >> stt->page_shift) - stt->offset;
page = stt->pages[idx / TCES_PER_PAGE];
if (!page) {
-   vcpu->arch.regs.gpr[4] = 0;
+   kvmppc_set_gpr(vcpu, 4, 0);
return H_SUCCESS;
}
tbl = (u64 *)page_address(page);
 
-   vcpu->arch.regs.gpr[4] = tbl[idx % TCES_PER_PAGE];
+   kvmppc_set_gpr(vcpu, 4, tbl[idx % TCES_PER_PAGE]);
 
return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..4af5b68cf7f8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1267,10 +1267,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
return RESUME_HOST;
break;
 #endif
-   case H_RANDOM:
-   if (!arch_get_random_seed_longs(>arch.regs.gpr[4], 1))
+   case H_RANDOM: {
+   unsigned long rand;
+
+   if (!arch_get_random_seed_longs(, 1))
ret = H_HARDWARE;
+   kvmppc_set_gpr(vcpu, 4, rand);
break;
+   }
case H_RPT_INVALIDATE:
ret = kvmppc_h_rpt_invalidate(vcpu, kvmppc_get_gpr(vcpu, 4),
  kvmppc_get_gpr(vcpu, 5),
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 0f5b021fa559..f3afe194e616 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -182,9 +182,13 @@ EXPORT_SYMBOL_GPL(kvmppc_hwrng_present);
 
 long kvmppc_rm_h_random(struct kvm_vcpu *vcpu)
 {
+   unsigned long rand;
+
if (ppc_md.get_random_seed &&
-   ppc_md.get_random_seed(>arch.regs.gpr[4]))
+   ppc_md.get_random_seed()) {
+   kvmppc_set_gpr(vcpu, 4, rand);
return H_SUCCESS;
+   }
 
return H_HARDWARE;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9182324dbef9..17cb75a127b0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -776,8 +776,8 @@ long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long 
flags,
r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
r &= ~HPTE_GR_RESERVED;
}
-   vcpu->arch.regs.gpr[4 + i * 2] = v;
-   vcpu->arch.regs.gpr[5 + i * 2] = r;
+   kvmppc_set_gpr(vcpu, 4 + i * 2, v);
+   kvmppc_set_gpr(vcpu, 5 + i * 2, r);
}
return H_SUCCESS;
 }
@@ -824,7 +824,7 @@ long kvmppc_h_clear_ref(struct kvm_vcpu *vcpu, unsigned 
long flags,
}
}
}
-   vcpu->arch.regs.gpr[4] = gr;
+   kvmppc_set_gpr(vcpu, 4, gr);
ret = H_SUCCESS;
  out:
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
@@ -872,7 +872,7 @@ long kvmppc_h_clear_mod(struct kvm_vcpu *vcpu, unsigned 
long flags,
kvmppc_set_dirty_from_hpte(kvm, v, gr);
}
}
-   vcpu->arch.regs.gpr[4] = gr;
+   kvmppc_set_gpr(vcpu, 4, gr);
ret = H_SUCCESS;
  out:
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index e165bfa842bf..e42984878503 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -481,7 +481,7 @@ static void icp_rm_down_cppr(struct kvmppc_xics *xics, 
struct kvmppc_icp *icp,
 
 unsigned long xics_rm_h_xirr_x(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.regs.gpr[5] = get_tb();
+   kvmppc_set_gpr(vcpu, 5, get_tb());
return xics_rm_h_xirr(vcpu);
 }
 
@@ -518,7 +518,7 @@ unsigned long xics_rm_h_xirr(struct kvm_vcpu *vcpu)
} while (!icp_rm_try_update(icp, old_state, new_state));
 
/* Return the result in GPR4 */
-   vcpu->arch.regs.gpr[4] = xirr;
+   kvmppc_set_gpr(vcpu, 4, xirr);
 
return check_too_hard(xics, icp);
 }
diff --git a/

[PATCH v5 00/11] KVM: PPC: Nested APIv2 guest support

2023-09-13 Thread Jordan Niethe
ers and setters fail
  - Progagate failure from kvmhv_nestedv2_parse_output()
  - Replace delay with sleep in plpar_guest_{create,delete,create_vcpu}()
  - Add logical PVR handling
  - Replace kvmppc_gse_{get,put} with specific version
  - docs: powerpc: Document nested KVM on POWER
  - Fix typos


Change overview in v2:
  - Rebase on top of kvm ppc prefix instruction support
  - Make documentation an individual patch
  - Move guest state buffer files from arch/powerpc/lib/ to
arch/powerpc/kvm/
  - Use kunit for testing guest state buffer
  - Fix some build errors
  - Change HEIR element from 4 bytes to 8 bytes

Previous revisions:

  - v1: 
https://lore.kernel.org/linuxppc-dev/20230508072332.2937883-1-...@linux.vnet.ibm.com/
  - v2: 
https://lore.kernel.org/linuxppc-dev/20230605064848.12319-1-...@linux.vnet.ibm.com/
  - v3: 
https://lore.kernel.org/linuxppc-dev/20230807014553.1168699-1-jniet...@gmail.com/
  - v4: 
https://lore.kernel.org/linuxppc-dev/20230905034658.82835-1-jniet...@gmail.com/

Jordan Niethe (10):
  KVM: PPC: Always use the GPR accessors
  KVM: PPC: Introduce FPR/VR accessor functions
  KVM: PPC: Rename accessor generator macros
  KVM: PPC: Use accessors for VCPU registers
  KVM: PPC: Use accessors for VCORE registers
  KVM: PPC: Book3S HV: Use accessors for VCPU registers
  KVM: PPC: Book3S HV: Introduce low level MSR accessor
  KVM: PPC: Add helper library for Guest State Buffers
  KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long
  KVM: PPC: Add support for nestedv2 guests

Michael Neuling (1):
  docs: powerpc: Document nested KVM on POWER

 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/kvm-nested.rst  | 636 +++
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 995 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 220 +++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   8 +-
 arch/powerpc/include/asm/kvm_booke.h  |  10 +
 arch/powerpc/include/asm/kvm_host.h   |  22 +-
 arch/powerpc/include/asm/kvm_ppc.h| 102 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 267 -
 arch/powerpc/kvm/Makefile |   4 +
 arch/powerpc/kvm/book3s.c |  38 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   7 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  31 +-
 arch/powerpc/kvm/book3s_64_vio.c  |   4 +-
 arch/powerpc/kvm/book3s_hv.c  | 358 +--
 arch/powerpc/kvm/book3s_hv.h  |  76 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  |  11 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  44 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 994 +
 arch/powerpc/kvm/book3s_hv_p9_entry.c |   4 +-
 arch/powerpc/kvm/book3s_hv_ras.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c|   2 +-
 arch/powerpc/kvm/book3s_xive.c|  12 +-
 arch/powerpc/kvm/emulate_loadstore.c  |   6 +-
 arch/powerpc/kvm/guest-state-buffer.c | 621 +++
 arch/powerpc/kvm/powerpc.c|  76 +-
 arch/powerpc/kvm/test-guest-state-buffer.c| 328 ++
 31 files changed, 4672 insertions(+), 263 deletions(-)
 create mode 100644 Documentation/powerpc/kvm-nested.rst
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_nestedv2.c
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

-- 
2.39.3



[PATCH v4 11/11] docs: powerpc: Document nested KVM on POWER

2023-09-04 Thread Jordan Niethe
From: Michael Neuling 

Document support for nested KVM on POWER using the existing API as well
as the new PAPR API. This includes the new HCALL interface and how it
used by KVM.

Signed-off-by: Michael Neuling 
Signed-off-by: Jordan Niethe 
---
v2:
  - Separated into individual patch
v3:
  - Fix typos
---
 Documentation/powerpc/index.rst  |   1 +
 Documentation/powerpc/kvm-nested.rst | 636 +++
 2 files changed, 637 insertions(+)
 create mode 100644 Documentation/powerpc/kvm-nested.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index d33b554ca7ba..23e449994c2a 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -26,6 +26,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..8b37981dc3d9
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN().  Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+  (normally at L1 boot time).
+
+- L1 requests the L0 create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 runs the vCPU running H_GUEST_VCPU_RUN() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+More details of the individual hcalls follows:
+
+HCALL Details
+=
+
+This documentation is provided to give an overall understating of the
+API. It doesn't aim to provide all the details required to implement
+an L1 or L0. Latest version of PAPR can be referred to for more details.
+
+All these HCALLs are made by the L1 to the L0.
+
+H_GUEST_GET_CAPABILITIES()
+--
+
+This is called to get the capabilities of the L0 nested
+hypervisor. This includes capabilities such the CPU versions (eg
+POWER9, POWER10) that are supported as L2s::
+
+  H_GUEST_GET_CAPABILITIES(uint64 flags)
+
+  Parameters

[PATCH v4 10/11] KVM: PPC: Add support for nestedv2 guests

2023-09-04 Thread Jordan Niethe
A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. KVM
already had an interface that allowed this on powernv platforms. This
existing interface will now be called "nestedv1". The newly added PAPR
interface will be called "nestedv2".  PHYP will support the nestedv2
interface. At this time the host side of the nestedv2 interface has not
been implemented on powernv but there is no technical reason why it
could not be added.

The nestedv1 interface is still supported.

Add support to KVM to utilize these hcalls to enable running nested
guests as a pseries guest on PHYP.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, use the nestedv2
interface.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

Signed-off-by: Vaibhav Jain 
Signed-off-by: Gautam Menghani 
Signed-off-by: Kautuk Consul 
Signed-off-by: Amit Machhiwal 
Signed-off-by: Jordan Niethe 
---
v2:
  - Declare op structs as static
  - Guatam: Use expressions in switch case with local variables
  - Do not use the PVR for the LOGICAL PVR ID
  - Kautuk: Handle emul_inst as now a double word, init correctly
  - Use new GPR(), etc macros
  - Amit: Determine PAPR nested capabilities from cpu features
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Change to kvmhv_nestedv2 namespace
  - Make kvmhv_enable_nested() return -ENODEV on NESTEDv2 L1 hosts
  - s/kvmhv_on_papr/kvmhv_is_nestedv2/
  - mv book3s_hv_papr.c book3s_hv_nestedv2.c
  - Handle shared regs without a guest state id in the same wrapper
  - Vaibhav: Use a static key for API version
  - Add a positive test for NESTEDv1
  - Give the amor a static value
  - s/struct kvmhv_nestedv2_host/struct kvmhv_nestedv2_io/
  - Propagate failure in kvmhv_vcpu_entry_nestedv2()
  - WARN if getters and setters fail
  - Progagate failure from kvmhv_nestedv2_parse_output()
  - Replace delay with sleep in plpar_guest_{create,delete,create_vcpu}()
  - Amit: Add logical PVR handling
  - Replace kvmppc_gse_{get,put} with specific version
v4:
  - Batch H_GUEST_GET calls in kvmhv_nestedv2_reload_ptregs()
  - Fix compile without CONFIG_PSERIES
  - Fix maybe uninitialized trap in kvmhv_p9_guest_entry()
  - Extend existing setters for arch_compat and lpcr
---
 arch/powerpc/include/asm/guest-state-buffer.h |  91 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 137 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  20 +
 arch/powerpc/include/asm/kvm_ppc.h|  90 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 244 +
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 134 ++-
 arch/powerpc/kvm/book3s_hv.h  |  80 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 998 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 arch/powerpc/kvm/guest-state-buffer.c |  50 +
 14 files changed, 1826 insertions(+), 97 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_nestedv2.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
index aaefe1075fc4..808149f31576 100644
--- a/arch/powerpc/include/asm/guest-state-buffer.h
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -5,6 +5,7 @@
 #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
 #define _ASM_POWERPC_GUEST_STATE_BUFFER_H
 
+#include "asm/hvcall.h"
 #include 
 #include 
 #include 
@@ -313,6 +314,8 @@ struct kvmppc_gs_buff *kvmppc_gsb_new(size_t

[PATCH v4 09/11] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-09-04 Thread Jordan Niethe
The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that.  Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.

Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
v4:
  - Use u64
  - Change format strings instead of casting
---
 arch/powerpc/include/asm/kvm_book3s.h | 10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  2 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 +++---
 arch/powerpc/kvm/book3s_hv_nested.c   |  4 ++--
 arch/powerpc/kvm/book3s_hv_uvmem.c|  2 +-
 arch/powerpc/kvm/book3s_xive.c|  4 ++--
 9 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4c6558d5fefe..831c23e4f121 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -191,14 +191,14 @@ extern int kvmppc_mmu_radix_translate_table(struct 
kvm_vcpu *vcpu, gva_t eaddr,
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
 extern void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
-   unsigned int pshift, unsigned int lpid);
+   unsigned int pshift, u64 lpid);
 extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
unsigned int shift,
const struct kvm_memory_slot *memslot,
-   unsigned int lpid);
+   u64 lpid);
 extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
bool writing, unsigned long gpa,
-   unsigned int lpid);
+   u64 lpid);
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
@@ -207,7 +207,7 @@ extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu 
*vcpu,
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
 extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
- unsigned int lpid);
+ u64 lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
 extern void kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
@@ -300,7 +300,7 @@ void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
-void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
+void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d49065af08e9..572f9bbf1a25 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -624,7 +624,7 @@ static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 
 extern int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 unsigned long gpa, unsigned int level,
-unsigned long mmu_seq, unsigned int lpid,
+unsigned long mmu_seq, u64 lpid,
 unsigned long *rmapp, struct rmap_nested **n_rmap);
 extern void kvmhv_insert_nest_rmap(struct kvm *kvm, unsigned long *rmapp,
   struct rmap_nested **n_rmap);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0d

[PATCH v4 08/11] KVM: PPC: Add helper library for Guest State Buffers

2023-09-04 Thread Jordan Niethe
The PAPR "Nestedv2" guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the nestedv2 guest support.

Signed-off-by: Jordan Niethe 
---
v2:
  - Add missing #ifdef CONFIG_VSXs
  - Move files from lib/ to kvm/
  - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
  - Use kunit for guest state buffer tests
  - Add configuration option for the tests
  - Use macros for contiguous id ranges like GPRs
  - Add some missing EXPORTs to functions
  - HEIR element is a double word not a word
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Use the kvmppc namespace
  - Move kvmppc_gsb_reset() out of kvmppc_gsm_fill_info()
  - Comments for GSID elements
  - Pass vector elements by reference
  - Remove generic put and get functions
---
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 904 ++
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/guest-state-buffer.c | 571 +++
 arch/powerpc/kvm/test-guest-state-buffer.c| 328 +++
 5 files changed, 1818 insertions(+)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 2a54fadbeaf5..339c3a5f56f1 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
bool "Run self-tests of the MSI bitmap code"
depends on DEBUG_KERNEL
 
+config GUEST_STATE_BUFFER_TEST
+   def_tristate n
+   prompt "Enable Guest State Buffer unit tests"
+   depends on KUNIT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default KUNIT_ALL_TESTS
+   help
+ The Guest State Buffer is a data format specified in the PAPR.
+ It is by hcalls to communicate the state of L2 guests between
+ the L1 and L0 hypervisors. Enable unit tests for the library
+ used to create and use guest state buffers.
+
 config PPC_IRQ_SOFT_MASK_DEBUG
bool "Include extra checks for powerpc irq soft masking"
depends on PPC64
diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..aaefe1075fc4
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,904 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+/* Element without a value and any length */
+#define KVMPPC_GSID_BLANK  0x
+/* Size required for the L0's internal VCPU representation */
+#define KVMPPC_GSID_HOST_STATE_SIZE0x0001
+ /* Minimum size for the H_GUEST_RUN_VCPU output buffer */
+#define KVMPPC_GSID_RUN_OUTPUT_MIN_SIZE0x0002
+ /* "Logical" PVR value as defined in the PAPR */
+#define KVMPPC_GSID_LOGICAL_PVR0x0003
+ /* L0 relative timebase offset */
+#define KVMPPC_GSID_TB_OFFSET  0x0004
+ /* Partition Scoped Page Table Info */
+#define KVMPPC_GSID_PARTITION_TABLE  

[PATCH v4 07/11] KVM: PPC: Book3S HV: Introduce low level MSR accessor

2023-09-04 Thread Jordan Niethe
kvmppc_get_msr() and kvmppc_set_msr_fast() serve as accessors for the
MSR. However because the MSR is kept in the shared regs they include a
conditional check for kvmppc_shared_big_endian() and endian conversion.

Within the Book3S HV specific code there are direct reads and writes of
shregs::msr. In preparation for Nested APIv2 these accesses need to be
replaced with accessor functions so it is possible to extend their
behavior. However, using the kvmppc_get_msr() and kvmppc_set_msr_fast()
functions is undesirable because it would introduce a conditional branch
and endian conversion that is not currently present.

kvmppc_set_msr_hv() already exists, it is used for the
kvmppc_ops::set_msr callback.

Introduce a low level accessor __kvmppc_{s,g}et_msr_hv() that simply
gets and sets shregs::msr. This will be extend for Nested APIv2 support.

Signed-off-by: Jordan Niethe 
---
v4:
  - New to series
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  5 ++--
 arch/powerpc/kvm/book3s_hv.c | 34 ++--
 arch/powerpc/kvm/book3s_hv.h | 10 
 arch/powerpc/kvm/book3s_hv_builtin.c |  5 ++--
 4 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index efd0ebf70a5e..fdfc2a62dd67 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -28,6 +28,7 @@
 #include 
 
 #include "book3s.h"
+#include "book3s_hv.h"
 #include "trace_hv.h"
 
 //#define DEBUG_RESIZE_HPT 1
@@ -347,7 +348,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
unsigned long v, orig_v, gr;
__be64 *hptep;
long int index;
-   int virtmode = vcpu->arch.shregs.msr & (data ? MSR_DR : MSR_IR);
+   int virtmode = __kvmppc_get_msr_hv(vcpu) & (data ? MSR_DR : MSR_IR);
 
if (kvm_is_radix(vcpu->kvm))
return kvmppc_mmu_radix_xlate(vcpu, eaddr, gpte, data, iswrite);
@@ -385,7 +386,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
 
/* Get PP bits and key for permission check */
pp = gr & (HPTE_R_PP0 | HPTE_R_PP);
-   key = (vcpu->arch.shregs.msr & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
+   key = (__kvmppc_get_msr_hv(vcpu) & MSR_PR) ? SLB_VSID_KP : SLB_VSID_KS;
key &= slb_v;
 
/* Calculate permissions */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index fabe99af0e0b..d4db8192753b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1374,7 +1374,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
  */
 static void kvmppc_cede(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.shregs.msr |= MSR_EE;
+   __kvmppc_set_msr_hv(vcpu, __kvmppc_get_msr_hv(vcpu) | MSR_EE);
vcpu->arch.ceded = 1;
smp_mb();
if (vcpu->arch.prodded) {
@@ -1589,7 +1589,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * That can happen due to a bug, or due to a machine check
 * occurring at just the wrong time.
 */
-   if (vcpu->arch.shregs.msr & MSR_HV) {
+   if (__kvmppc_get_msr_hv(vcpu) & MSR_HV) {
printk(KERN_EMERG "KVM trap in HV mode!\n");
printk(KERN_EMERG "trap=0x%x | pc=0x%lx | msr=0x%llx\n",
vcpu->arch.trap, kvmppc_get_pc(vcpu),
@@ -1640,7 +1640,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * so that it knows that the machine check occurred.
 */
if (!vcpu->kvm->arch.fwnmi_enabled) {
-   ulong flags = (vcpu->arch.shregs.msr & 0x083c) |
+   ulong flags = (__kvmppc_get_msr_hv(vcpu) & 0x083c) |
(kvmppc_get_msr(vcpu) & SRR1_PREFIXED);
kvmppc_core_queue_machine_check(vcpu, flags);
r = RESUME_GUEST;
@@ -1670,7 +1670,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * as a result of a hypervisor emulation interrupt
 * (e40) getting turned into a 700 by BML RTAS.
 */
-   flags = (vcpu->arch.shregs.msr & 0x1full) |
+   flags = (__kvmppc_get_msr_hv(vcpu) & 0x1full) |
(kvmppc_get_msr(vcpu) & SRR1_PREFIXED);
kvmppc_core_queue_program(vcpu, flags);
r = RESUME_GUEST;
@@ -1680,7 +1680,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
{
int i;
 
-   if (unlikely(vcpu->arch.shregs.msr & MSR_PR)) {
+   if (unlikely(__kvmppc_get_msr_hv(vcpu) & MSR_PR)) {
/*
 * Guest userspace executed sc 1. This can on

[PATCH v4 06/11] KVM: PPC: Book3S HV: Use accessors for VCPU registers

2023-09-04 Thread Jordan Niethe
Introduce accessor generator macros for Book3S HV VCPU registers. Use
the accessor functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
 arch/powerpc/kvm/book3s_hv.c   | 148 +
 arch/powerpc/kvm/book3s_hv.h   |  58 ++
 3 files changed, 139 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 5c71d6ae3a7b..ab646f59afd7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include "book3s_hv.h"
 #include 
 #include 
 #include 
@@ -294,9 +295,9 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t 
eaddr,
} else {
if (!(pte & _PAGE_PRIVILEGED)) {
/* Check AMR/IAMR to see if strict mode is in force */
-   if (vcpu->arch.amr & (1ul << 62))
+   if (kvmppc_get_amr_hv(vcpu) & (1ul << 62))
gpte->may_read = 0;
-   if (vcpu->arch.amr & (1ul << 63))
+   if (kvmppc_get_amr_hv(vcpu) & (1ul << 63))
gpte->may_write = 0;
if (vcpu->arch.iamr & (1ul << 62))
gpte->may_execute = 0;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 73d9a9eb376f..fabe99af0e0b 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -868,7 +868,7 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
/* Guests can't breakpoint the hypervisor */
if ((value1 & CIABR_PRIV) == CIABR_PRIV_HYPER)
return H_P3;
-   vcpu->arch.ciabr  = value1;
+   kvmppc_set_ciabr_hv(vcpu, value1);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_SET_DAWR0:
if (!kvmppc_power8_compatible(vcpu))
@@ -879,8 +879,8 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
return H_UNSUPPORTED_FLAG_START;
if (value2 & DABRX_HYP)
return H_P4;
-   vcpu->arch.dawr0  = value1;
-   vcpu->arch.dawrx0 = value2;
+   kvmppc_set_dawr0_hv(vcpu, value1);
+   kvmppc_set_dawrx0_hv(vcpu, value2);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_SET_DAWR1:
if (!kvmppc_power8_compatible(vcpu))
@@ -895,8 +895,8 @@ static int kvmppc_h_set_mode(struct kvm_vcpu *vcpu, 
unsigned long mflags,
return H_UNSUPPORTED_FLAG_START;
if (value2 & DABRX_HYP)
return H_P4;
-   vcpu->arch.dawr1  = value1;
-   vcpu->arch.dawrx1 = value2;
+   kvmppc_set_dawr1_hv(vcpu, value1);
+   kvmppc_set_dawrx1_hv(vcpu, value2);
return H_SUCCESS;
case H_SET_MODE_RESOURCE_ADDR_TRANS_MODE:
/*
@@ -1548,7 +1548,7 @@ static int kvmppc_pmu_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_PM))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_PM;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_PM);
 
return RESUME_GUEST;
 }
@@ -1558,7 +1558,7 @@ static int kvmppc_ebb_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_EBB))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_EBB;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_EBB);
 
return RESUME_GUEST;
 }
@@ -1568,7 +1568,7 @@ static int kvmppc_tm_unavailable(struct kvm_vcpu *vcpu)
if (!(vcpu->arch.hfscr_permitted & HFSCR_TM))
return EMULATE_FAIL;
 
-   vcpu->arch.hfscr |= HFSCR_TM;
+   kvmppc_set_hfscr_hv(vcpu, kvmppc_get_hfscr_hv(vcpu) | HFSCR_TM);
 
return RESUME_GUEST;
 }
@@ -1867,7 +1867,7 @@ static int kvmppc_handle_exit_hv(struct kvm_vcpu *vcpu,
 * Otherwise, we just generate a program interrupt to the guest.
 */
case BOOK3S_INTERRUPT_H_FAC_UNAVAIL: {
-   u64 cause = vcpu->arch.hfscr >> 56;
+   u64 cause = kvmppc_get_hfscr_hv(vcpu) >> 56;
 
r = EMULATE_FAIL;
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
@@ -2211,64 +2211,64 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->a

[PATCH v4 05/11] KVM: PPC: Use accessors VCORE registers

2023-09-04 Thread Jordan Niethe
Introduce accessor generator macros for VCORE registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
  - Remove _hv suffix
  - Do not generate for setter arch_compat and lpcr
---
 arch/powerpc/include/asm/kvm_book3s.h | 25 -
 arch/powerpc/kvm/book3s_hv.c  | 24 
 arch/powerpc/kvm/book3s_hv_ras.c  |  4 ++--
 arch/powerpc/kvm/book3s_xive.c|  4 +---
 4 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 1a220cd63227..4c6558d5fefe 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -483,6 +483,29 @@ KVMPPC_BOOK3S_VCPU_ACCESSOR(bescr, 64)
 KVMPPC_BOOK3S_VCPU_ACCESSOR(ic, 64)
 KVMPPC_BOOK3S_VCPU_ACCESSOR(vrsave, 64)
 
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(reg, size)\
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   vcpu->arch.vcore->reg = val;\
+}
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(reg, size)\
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.vcore->reg;   \
+}
+
+#define KVMPPC_BOOK3S_VCORE_ACCESSOR(reg, size)
\
+   KVMPPC_BOOK3S_VCORE_ACCESSOR_SET(reg, size) \
+   KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(reg, size) \
+
+
+KVMPPC_BOOK3S_VCORE_ACCESSOR(vtb, 64)
+KVMPPC_BOOK3S_VCORE_ACCESSOR(tb_offset, 64)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(arch_compat, 32)
+KVMPPC_BOOK3S_VCORE_ACCESSOR_GET(lpcr, 64)
+
 static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
 {
return vcpu->arch.dec_expires;
@@ -496,7 +519,7 @@ static inline void kvmppc_set_dec_expires(struct kvm_vcpu 
*vcpu, u64 val)
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return kvmppc_get_dec_expires(vcpu) - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset(vcpu);
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 27faecad1e3b..73d9a9eb376f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -794,7 +794,7 @@ static void kvmppc_update_vpa_dispatch(struct kvm_vcpu 
*vcpu,
 
vpa->enqueue_dispatch_tb = 
cpu_to_be64(be64_to_cpu(vpa->enqueue_dispatch_tb) + stolen);
 
-   __kvmppc_create_dtl_entry(vcpu, vpa, vc->pcpu, now + vc->tb_offset, 
stolen);
+   __kvmppc_create_dtl_entry(vcpu, vpa, vc->pcpu, now + 
kvmppc_get_tb_offset(vcpu), stolen);
 
vcpu->arch.vpa.dirty = true;
 }
@@ -845,9 +845,9 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 
 static bool kvmppc_power8_compatible(struct kvm_vcpu *vcpu)
 {
-   if (vcpu->arch.vcore->arch_compat >= PVR_ARCH_207)
+   if (kvmppc_get_arch_compat(vcpu) >= PVR_ARCH_207)
return true;
-   if ((!vcpu->arch.vcore->arch_compat) &&
+   if ((!kvmppc_get_arch_compat(vcpu)) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
return true;
return false;
@@ -2283,7 +2283,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
*val = get_reg_val(id, vcpu->arch.vcore->dpdes);
break;
case KVM_REG_PPC_VTB:
-   *val = get_reg_val(id, vcpu->arch.vcore->vtb);
+   *val = get_reg_val(id, kvmppc_get_vtb(vcpu));
break;
case KVM_REG_PPC_DAWR:
*val = get_reg_val(id, vcpu->arch.dawr0);
@@ -2342,11 +2342,11 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
spin_unlock(>arch.vpa_update_lock);
break;
case KVM_REG_PPC_TB_OFFSET:
-   *val = get_reg_val(id, vcpu->arch.vcore->tb_offset);
+   *val = get_reg_val(id, kvmppc_get_tb_offset(vcpu));
break;
case KVM_REG_PPC_LPCR:
case KVM_REG_PPC_LPCR_64:
-   *val = get_reg_val(id, vcpu->arch.vcore->lpcr);
+   *val = get_reg_val(id, kvmppc_get_lpcr(vcpu));
break;
case KVM_REG_PPC_PPR:
*val = get_reg_val(id, vcpu->arch.ppr);
@@ -2418,7 +2418,7 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 i

[PATCH v4 04/11] KVM: PPC: Use accessors for VCPU registers

2023-09-04 Thread Jordan Niethe
Introduce accessor generator macros for VCPU registers. Use the accessor
functions to replace direct accesses to this registers.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split to unique patch
---
 arch/powerpc/include/asm/kvm_book3s.h  | 37 +-
 arch/powerpc/kvm/book3s.c  | 22 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  4 +--
 arch/powerpc/kvm/book3s_hv.c   | 12 -
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |  4 +--
 arch/powerpc/kvm/powerpc.c |  4 +--
 6 files changed, 59 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 109a5f56767a..1a220cd63227 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -458,10 +458,45 @@ static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, 
u32 val)
 }
 #endif
 
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size) \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size) \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR(reg, size) \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size)  \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size)  \
+
+KVMPPC_BOOK3S_VCPU_ACCESSOR(pid, 32)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(tar, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbhr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbrr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(bescr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ic, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(vrsave, 64)
+
+static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires;
+}
+
+static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.dec_expires = val;
+}
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - vcpu->arch.vcore->tb_offset;
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index c080dd2e96ac..6cd20ab9e94e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -565,7 +565,7 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct kvm_regs *regs)
regs->msr = kvmppc_get_msr(vcpu);
regs->srr0 = kvmppc_get_srr0(vcpu);
regs->srr1 = kvmppc_get_srr1(vcpu);
-   regs->pid = vcpu->arch.pid;
+   regs->pid = kvmppc_get_pid(vcpu);
regs->sprg0 = kvmppc_get_sprg0(vcpu);
regs->sprg1 = kvmppc_get_sprg1(vcpu);
regs->sprg2 = kvmppc_get_sprg2(vcpu);
@@ -683,19 +683,19 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
*val = get_reg_val(id, vcpu->arch.fscr);
break;
case KVM_REG_PPC_TAR:
-   *val = get_reg_val(id, vcpu->arch.tar);
+   *val = get_reg_val(id, kvmppc_get_tar(vcpu));
break;
case KVM_REG_PPC_EBBHR:
-   *val = get_reg_val(id, vcpu->arch.ebbhr);
+   *val = get_reg_val(id, kvmppc_get_ebbhr(vcpu));
break;
case KVM_REG_PPC_EBBRR:
-   *val = get_reg_val(id, vcpu->arch.ebbrr);
+   *val = get_reg_val(id, kvmppc_get_ebbrr(vcpu));
break;
case KVM_REG_PPC_BESCR:
-   *val = get_reg_val(id, vcpu->arch.bescr);
+   *val = get_reg_val(id, kvmppc_get_bescr(vcpu));
break;
case KVM_REG_PPC_IC:
-   *val = get_reg_val(id, vcpu->arch.ic);
+   *val = get_reg_val(id, kvmppc_get_ic(vcpu));
break;
default:
r = -EINVAL;
@@ -768,19 +768,19 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
kvmppc_set_fpscr(vcpu, set_reg_val(id, *val));
break;
case KVM_REG_PPC_TAR:
-   vcpu->ar

[PATCH v4 03/11] KVM: PPC: Rename accessor generator macros

2023-09-04 Thread Jordan Niethe
More "wrapper" style accessor generating macros will be introduced for
the nestedv2 guest support. Rename the existing macros with more
descriptive names now so there is a consistent naming convention.

Reviewed-by: Nicholas Piggin 
Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
v4:
  - Fix ACESSOR typo
---
 arch/powerpc/include/asm/kvm_ppc.h | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d16d80ad2ae4..d554bc56e7f3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -927,19 +927,19 @@ static inline bool kvmppc_shared_big_endian(struct 
kvm_vcpu *vcpu)
 #endif
 }
 
-#define SPRNG_WRAPPER_GET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_GET(reg, bookehv_spr)   \
 static inline ulong kvmppc_get_##reg(struct kvm_vcpu *vcpu)\
 {  \
return mfspr(bookehv_spr);  \
 }  \
 
-#define SPRNG_WRAPPER_SET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_SET(reg, bookehv_spr)   \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, ulong val)  \
 {  \
mtspr(bookehv_spr, val);
\
 }  \
 
-#define SHARED_WRAPPER_GET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR_GET(reg, size)
\
 static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -948,7 +948,7 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu 
*vcpu)   \
   return le##size##_to_cpu(vcpu->arch.shared->reg);\
 }  \
 
-#define SHARED_WRAPPER_SET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR_SET(reg, size)
\
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -957,36 +957,36 @@ static inline void kvmppc_set_##reg(struct kvm_vcpu 
*vcpu, u##size val)   \
   vcpu->arch.shared->reg = cpu_to_le##size(val);   \
 }  \
 
-#define SHARED_WRAPPER(reg, size)  \
-   SHARED_WRAPPER_GET(reg, size)   \
-   SHARED_WRAPPER_SET(reg, size)   \
+#define KVMPPC_VCPU_SHARED_REGS_ACCESSOR(reg, size)\
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR_GET(reg, size) \
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR_SET(reg, size) \
 
-#define SPRNG_WRAPPER(reg, bookehv_spr)
\
-   SPRNG_WRAPPER_GET(reg, bookehv_spr) \
-   SPRNG_WRAPPER_SET(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_ACCESSOR(reg, bookehv_spr)   \
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_GET(reg, bookehv_spr)\
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR_SET(reg, bookehv_spr)\
 
 #ifdef CONFIG_KVM_BOOKE_HV
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SPRNG_WRAPPER(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_BOOKE_HV_SPRNG_ACCESSOR(reg, bookehv_spr)\
 
 #else
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SHARED_WRAPPER(reg, size)   \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_VCPU_SHARED_REGS_ACCESSOR(reg, size) \
 
 #endif
 
-SHARED_WRAPPER(critical, 64)
-SHARED_SPRNG_WRAPPER(sprg0, 64, SPRN_GSPRG0)
-SHARED_SPRNG_WRAPPER(sprg1, 64, SPRN_GSPRG1)
-SHARED_SPRNG_WRAPPER(sprg2, 64, SPRN_GSPRG2)
-SHARED_SPRNG_WRAPPER(sprg3, 64, SPRN_GSPRG3)
-SHARED_SPRNG_WRAPPER(srr0, 64, SPRN_GSRR0)
-SHARED_SPRNG_WRAPPER(srr1, 64, SPRN_GSRR1)
-SHARED_SPRNG_WRAPPER(dar, 64, SPRN_GDEAR)
-SHARED_SPRNG_WRAPPER(esr, 64, SPRN_GESR)
-SHARED_WRAPPER_GET(msr, 64)
+KVMPPC_VCPU_SHARED_REGS

[PATCH v4 02/11] KVM: PPC: Introduce FPR/VR accessor functions

2023-09-04 Thread Jordan Niethe
Introduce accessor functions for floating point and vector registers
like the ones that exist for GPRs. Use these to replace the existing FPR
and VR accessor macros.

This will be important later for Nested APIv2 support which requires
additional functionality for accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split into unique patch
---
 arch/powerpc/include/asm/kvm_book3s.h | 55 
 arch/powerpc/include/asm/kvm_booke.h  | 10 
 arch/powerpc/kvm/book3s.c | 16 +++---
 arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 72 +--
 5 files changed, 110 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..109a5f56767a 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -403,6 +403,61 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j,
+ u64 val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
+#ifdef CONFIG_ALTIVEC
+static inline void kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
*v)
+{
+   *v =  vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i,
+vector128 *val)
+{
+   vcpu->arch.vr.vr[i] = *val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+#endif
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 0c3401b2e19e..7c3291aa8922 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
 #ifdef CONFIG_BOOKE
 static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 686d8d9eda3e..c080dd2e96ac 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -636,17 +636,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   *val = get_reg_val(id, VCPU_FPR(vcpu, i));
+   *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
break;
case KVM_REG_PPC_FPSCR:
-   *val = get_reg_val(id, vcpu->arch.fp.fpscr);
+   *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
-   val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
+   val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
+   val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
} else {
r = -ENXIO;
}
@@ -724,7 +724,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
+   kvmppc_set_fpr(vcpu, i, set_reg_val(id, *val));
  

[PATCH v4 01/11] KVM: PPC: Always use the GPR accessors

2023-09-04 Thread Jordan Niethe
Always use the GPR accessor functions. This will be important later for
Nested APIv2 support which requires additional functionality for
accessing and modifying VCPU state.

Signed-off-by: Jordan Niethe 
---
v4:
  - Split into unique patch
---
 arch/powerpc/kvm/book3s_64_vio.c | 4 ++--
 arch/powerpc/kvm/book3s_hv.c | 8 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c | 6 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  | 8 
 arch/powerpc/kvm/book3s_hv_rm_xics.c | 4 ++--
 arch/powerpc/kvm/book3s_xive.c   | 4 ++--
 6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 93b695b289e9..4ba048f272f2 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -786,12 +786,12 @@ long kvmppc_h_get_tce(struct kvm_vcpu *vcpu, unsigned 
long liobn,
idx = (ioba >> stt->page_shift) - stt->offset;
page = stt->pages[idx / TCES_PER_PAGE];
if (!page) {
-   vcpu->arch.regs.gpr[4] = 0;
+   kvmppc_set_gpr(vcpu, 4, 0);
return H_SUCCESS;
}
tbl = (u64 *)page_address(page);
 
-   vcpu->arch.regs.gpr[4] = tbl[idx % TCES_PER_PAGE];
+   kvmppc_set_gpr(vcpu, 4, tbl[idx % TCES_PER_PAGE]);
 
return H_SUCCESS;
 }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 130bafdb1430..4af5b68cf7f8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1267,10 +1267,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
return RESUME_HOST;
break;
 #endif
-   case H_RANDOM:
-   if (!arch_get_random_seed_longs(>arch.regs.gpr[4], 1))
+   case H_RANDOM: {
+   unsigned long rand;
+
+   if (!arch_get_random_seed_longs(, 1))
ret = H_HARDWARE;
+   kvmppc_set_gpr(vcpu, 4, rand);
break;
+   }
case H_RPT_INVALIDATE:
ret = kvmppc_h_rpt_invalidate(vcpu, kvmppc_get_gpr(vcpu, 4),
  kvmppc_get_gpr(vcpu, 5),
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 0f5b021fa559..f3afe194e616 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -182,9 +182,13 @@ EXPORT_SYMBOL_GPL(kvmppc_hwrng_present);
 
 long kvmppc_rm_h_random(struct kvm_vcpu *vcpu)
 {
+   unsigned long rand;
+
if (ppc_md.get_random_seed &&
-   ppc_md.get_random_seed(>arch.regs.gpr[4]))
+   ppc_md.get_random_seed()) {
+   kvmppc_set_gpr(vcpu, 4, rand);
return H_SUCCESS;
+   }
 
return H_HARDWARE;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9182324dbef9..17cb75a127b0 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -776,8 +776,8 @@ long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long 
flags,
r = rev[i].guest_rpte | (r & (HPTE_R_R | HPTE_R_C));
r &= ~HPTE_GR_RESERVED;
}
-   vcpu->arch.regs.gpr[4 + i * 2] = v;
-   vcpu->arch.regs.gpr[5 + i * 2] = r;
+   kvmppc_set_gpr(vcpu, 4 + i * 2, v);
+   kvmppc_set_gpr(vcpu, 5 + i * 2, r);
}
return H_SUCCESS;
 }
@@ -824,7 +824,7 @@ long kvmppc_h_clear_ref(struct kvm_vcpu *vcpu, unsigned 
long flags,
}
}
}
-   vcpu->arch.regs.gpr[4] = gr;
+   kvmppc_set_gpr(vcpu, 4, gr);
ret = H_SUCCESS;
  out:
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
@@ -872,7 +872,7 @@ long kvmppc_h_clear_mod(struct kvm_vcpu *vcpu, unsigned 
long flags,
kvmppc_set_dirty_from_hpte(kvm, v, gr);
}
}
-   vcpu->arch.regs.gpr[4] = gr;
+   kvmppc_set_gpr(vcpu, 4, gr);
ret = H_SUCCESS;
  out:
unlock_hpte(hpte, v & ~HPTE_V_HVLOCK);
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index e165bfa842bf..e42984878503 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -481,7 +481,7 @@ static void icp_rm_down_cppr(struct kvmppc_xics *xics, 
struct kvmppc_icp *icp,
 
 unsigned long xics_rm_h_xirr_x(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.regs.gpr[5] = get_tb();
+   kvmppc_set_gpr(vcpu, 5, get_tb());
return xics_rm_h_xirr(vcpu);
 }
 
@@ -518,7 +518,7 @@ unsigned long xics_rm_h_xirr(struct kvm_vcpu *vcpu)
} while (!icp_rm_try_update(icp, old_state, new_state));
 
/* Return the result in GPR4 */
-   vcpu->arch.regs.gpr[4] = xirr;
+   kvmppc_set_gpr(vcpu, 4, xirr);
 
return check_too_hard(xics, icp);
 }
diff --git a/

[PATCH v4 00/11] KVM: PPC: Nested APIv2 guest support

2023-09-04 Thread Jordan Niethe
es from arch/powerpc/lib/ to
arch/powerpc/kvm/
  - Use kunit for testing guest state buffer
  - Fix some build errors
  - Change HEIR element from 4 bytes to 8 bytes

Previous revisions:

  - v1: 
https://lore.kernel.org/linuxppc-dev/20230508072332.2937883-1-...@linux.vnet.ibm.com/
  - v2: 
https://lore.kernel.org/linuxppc-dev/20230605064848.12319-1-...@linux.vnet.ibm.com/
  - v3: 
https://lore.kernel.org/linuxppc-dev/20230807014553.1168699-1-jniet...@gmail.com/

Jordan Niethe (10):
  KVM: PPC: Always use the GPR accessors
  KVM: PPC: Introduce FPR/VR accessor functions
  KVM: PPC: Rename accessor generator macros
  KVM: PPC: Use accessors for VCPU registers
  KVM: PPC: Use accessors VCORE registers
  KVM: PPC: Book3S HV: Use accessors for VCPU registers
  KVM: PPC: Book3S HV: Introduce low level MSR accessor
  KVM: PPC: Add helper library for Guest State Buffers
  KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long
  KVM: PPC: Add support for nestedv2 guests

Michael Neuling (1):
  docs: powerpc: Document nested KVM on POWER

 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/kvm-nested.rst  | 636 +++
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 995 +
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 220 +++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   8 +-
 arch/powerpc/include/asm/kvm_booke.h  |  10 +
 arch/powerpc/include/asm/kvm_host.h   |  22 +-
 arch/powerpc/include/asm/kvm_ppc.h| 102 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 248 -
 arch/powerpc/kvm/Makefile |   4 +
 arch/powerpc/kvm/book3s.c |  38 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   7 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  31 +-
 arch/powerpc/kvm/book3s_64_vio.c  |   4 +-
 arch/powerpc/kvm/book3s_hv.c  | 358 +--
 arch/powerpc/kvm/book3s_hv.h  |  76 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  |  11 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  42 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 998 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c |   4 +-
 arch/powerpc/kvm/book3s_hv_ras.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c|   2 +-
 arch/powerpc/kvm/book3s_xive.c|  12 +-
 arch/powerpc/kvm/emulate_loadstore.c  |   6 +-
 arch/powerpc/kvm/guest-state-buffer.c | 621 +++
 arch/powerpc/kvm/powerpc.c|  76 +-
 arch/powerpc/kvm/test-guest-state-buffer.c| 328 ++
 31 files changed, 4655 insertions(+), 263 deletions(-)
 create mode 100644 Documentation/powerpc/kvm-nested.rst
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_nestedv2.c
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

-- 
2.39.3



Re: [PATCH v3 4/6] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-08-15 Thread Jordan Niethe




On 15/8/23 8:45 pm, Michael Ellerman wrote:

"Nicholas Piggin"  writes:

On Mon Aug 7, 2023 at 11:45 AM AEST, Jordan Niethe wrote:

The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that.  Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.


I don't suppose it would be worth having an lpid_t.


diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 4adff4f1896d..229f0a1ffdd4 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -886,10 +886,10 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, 
u8 prio,
  
  	if (single_escalation)

name = kasprintf(GFP_KERNEL, "kvm-%d-%d",
-vcpu->kvm->arch.lpid, xc->server_num);
+(unsigned int)vcpu->kvm->arch.lpid, 
xc->server_num);
else
name = kasprintf(GFP_KERNEL, "kvm-%d-%d-%d",
-vcpu->kvm->arch.lpid, xc->server_num, prio);
+(unsigned int)vcpu->kvm->arch.lpid, 
xc->server_num, prio);
if (!name) {
pr_err("Failed to allocate escalation irq name for queue %d of VCPU 
%d\n",
   prio, xc->server_num);


I would have thought you'd keep the type and change the format.


Yeah. Don't we risk having ambigious names by discarding the high bits?
Not sure that would be a bug per se, but it could be confusing.


In this context is would always be constrained be the number of LPID 
bits so wouldn't be ambiguous, but I'm going to change the format.




cheers



Re: [PATCH v3 2/6] KVM: PPC: Rename accessor generator macros

2023-08-15 Thread Jordan Niethe




On 14/8/23 6:27 pm, Nicholas Piggin wrote:

On Mon Aug 7, 2023 at 11:45 AM AEST, Jordan Niethe wrote:

More "wrapper" style accessor generating macros will be introduced for
the nestedv2 guest support. Rename the existing macros with more
descriptive names now so there is a consistent naming convention.

Signed-off-by: Jordan Niethe 



---
v3:
   - New to series
---
  arch/powerpc/include/asm/kvm_ppc.h | 60 +++---
  1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d16d80ad2ae4..b66084a81dd0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -927,19 +927,19 @@ static inline bool kvmppc_shared_big_endian(struct 
kvm_vcpu *vcpu)
  #endif
  }
  
-#define SPRNG_WRAPPER_GET(reg, bookehv_spr)\

+#define KVMPPC_BOOKE_HV_SPRNG_ACESSOR_GET(reg, bookehv_spr)\
  static inline ulong kvmppc_get_##reg(struct kvm_vcpu *vcpu)   \
  { \
return mfspr(bookehv_spr);  \
  } \
  
-#define SPRNG_WRAPPER_SET(reg, bookehv_spr)\

+#define KVMPPC_BOOKE_HV_SPRNG_ACESSOR_SET(reg, bookehv_spr)\
  static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, ulong val) \
  { \
mtspr(bookehv_spr, val);
\
  } \
  
-#define SHARED_WRAPPER_GET(reg, size)	\

+#define KVMPPC_VCPU_SHARED_REGS_ACESSOR_GET(reg, size) \
  static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu) \
  { \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -948,7 +948,7 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu 
*vcpu)   \
   return le##size##_to_cpu(vcpu->arch.shared->reg);  \
  } \
  
-#define SHARED_WRAPPER_SET(reg, size)	\

+#define KVMPPC_VCPU_SHARED_REGS_ACESSOR_SET(reg, size) \
  static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)   
\
  { \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -957,36 +957,36 @@ static inline void kvmppc_set_##reg(struct kvm_vcpu 
*vcpu, u##size val)   \
   vcpu->arch.shared->reg = cpu_to_le##size(val); \
  } \
  
-#define SHARED_WRAPPER(reg, size)	\

-   SHARED_WRAPPER_GET(reg, size)   \
-   SHARED_WRAPPER_SET(reg, size)   \
+#define KVMPPC_VCPU_SHARED_REGS_ACESSOR(reg, size) 
\
+   KVMPPC_VCPU_SHARED_REGS_ACESSOR_GET(reg, size)  
\
+   KVMPPC_VCPU_SHARED_REGS_ACESSOR_SET(reg, size)  
\
  
-#define SPRNG_WRAPPER(reg, bookehv_spr)	\

-   SPRNG_WRAPPER_GET(reg, bookehv_spr) \
-   SPRNG_WRAPPER_SET(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_ACESSOR(reg, bookehv_spr)
\
+   KVMPPC_BOOKE_HV_SPRNG_ACESSOR_GET(reg, bookehv_spr) 
\
+   KVMPPC_BOOKE_HV_SPRNG_ACESSOR_SET(reg, bookehv_spr) 
\
  
  #ifdef CONFIG_KVM_BOOKE_HV
  
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)			\

-   SPRNG_WRAPPER(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_BOOKE_HV_SPRNG_ACESSOR(reg, bookehv_spr) \
  
  #else
  
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)			\

-   SHARED_WRAPPER(reg, size)   \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_VCPU_SHARED_REGS_ACESSOR(reg, size)  \


Not the greatest name I've ever seen :D Hard to be concice and
consistent though, this is an odd one.


Yes, it is a bit wordy.



Reviewed-by: Nicholas Piggin 


Thanks.



  
  #endif
  
-SHARED_WRAPPER(critical, 64)

-SHARED_SPRNG_WRAPPER(sprg0, 64, SPRN_GSPRG0)
-SHARED_SPRNG_WRAPPER(sprg1, 64, SPRN_GSPRG1)
-SHARED_SPRNG_WRAPPER(sprg2, 64, SPRN_GSPRG2)
-SHARED_SPRNG_WRAPPER(sprg3, 64, SPRN_GSPRG3)
-SHARED_SPRNG_WRAPPER(srr0, 64, SPRN_GSRR0)
-SHARED_SPRNG_WRAPPER(srr1, 64, SPRN_GSRR1)
-SHARED_SPRNG_WRAPP

Re: [PATCH v3 4/6] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-08-15 Thread Jordan Niethe




On 14/8/23 6:15 pm, David Laight wrote:

From: Jordan Niethe

Sent: 07 August 2023 02:46

The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.


Shouldn't it be changed to u64?


This will only be for 64-bit PPC so an unsigned long will always be 64 
bits wide, but I can use a u64 instead.




David
  


-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v3 4/6] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-08-15 Thread Jordan Niethe




On 14/8/23 6:12 pm, Nicholas Piggin wrote:

On Mon Aug 7, 2023 at 11:45 AM AEST, Jordan Niethe wrote:

The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that.  Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.


I don't suppose it would be worth having an lpid_t.


I actually introduced that when I was developing for the purpose of 
doing the conversion, but I felt like it was unnecessary in the end, it 
is just a wider integer and it is simpler to treat it that way imho.





diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 4adff4f1896d..229f0a1ffdd4 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -886,10 +886,10 @@ int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, 
u8 prio,
  
  	if (single_escalation)

name = kasprintf(GFP_KERNEL, "kvm-%d-%d",
-vcpu->kvm->arch.lpid, xc->server_num);
+(unsigned int)vcpu->kvm->arch.lpid, 
xc->server_num);
else
name = kasprintf(GFP_KERNEL, "kvm-%d-%d-%d",
-vcpu->kvm->arch.lpid, xc->server_num, prio);
+(unsigned int)vcpu->kvm->arch.lpid, 
xc->server_num, prio);
if (!name) {
pr_err("Failed to allocate escalation irq name for queue %d of VCPU 
%d\n",
   prio, xc->server_num);


I would have thought you'd keep the type and change the format.


yeah, I will do that.



Otherwise seems okay too.


Thanks.



Thanks,
Nick



Re: [PATCH v3 1/6] KVM: PPC: Use getters and setters for vcpu register state

2023-08-15 Thread Jordan Niethe




On 14/8/23 6:08 pm, Nicholas Piggin wrote:

On Mon Aug 7, 2023 at 11:45 AM AEST, Jordan Niethe wrote:

There are already some getter and setter functions used for accessing
vcpu register state, e.g. kvmppc_get_pc(). There are also more
complicated examples that are generated by macros like
kvmppc_get_sprg0() which are generated by the SHARED_SPRNG_WRAPPER()
macro.

In the new PAPR "Nestedv2" API for nested guest partitions the L1 is
required to communicate with the L0 to modify and read nested guest
state.

Prepare to support this by replacing direct accesses to vcpu register
state with wrapper functions. Follow the existing pattern of using
macros to generate individual wrappers. These wrappers will
be augmented for supporting Nestedv2 guests later.

Signed-off-by: Gautam Menghani 
Signed-off-by: Jordan Niethe 
---
v3:
   - Do not add a helper for pvr
   - Use an expression when declaring variable in case
   - Squash in all getters and setters
   - Guatam: Pass vector registers by reference
---
  arch/powerpc/include/asm/kvm_book3s.h  | 123 +-
  arch/powerpc/include/asm/kvm_booke.h   |  10 ++
  arch/powerpc/kvm/book3s.c  |  38 ++---
  arch/powerpc/kvm/book3s_64_mmu_hv.c|   4 +-
  arch/powerpc/kvm/book3s_64_mmu_radix.c |   9 +-
  arch/powerpc/kvm/book3s_64_vio.c   |   4 +-
  arch/powerpc/kvm/book3s_hv.c   | 220 +
  arch/powerpc/kvm/book3s_hv.h   |  58 +++
  arch/powerpc/kvm/book3s_hv_builtin.c   |  10 +-
  arch/powerpc/kvm/book3s_hv_p9_entry.c  |   4 +-
  arch/powerpc/kvm/book3s_hv_ras.c   |   5 +-
  arch/powerpc/kvm/book3s_hv_rm_mmu.c|   8 +-
  arch/powerpc/kvm/book3s_hv_rm_xics.c   |   4 +-
  arch/powerpc/kvm/book3s_xive.c |   9 +-
  arch/powerpc/kvm/emulate_loadstore.c   |   2 +-
  arch/powerpc/kvm/powerpc.c |  76 -
  16 files changed, 395 insertions(+), 189 deletions(-)



[snip]


+
  /* Expiry time of vcpu DEC relative to host TB */
  static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
  {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset_hv(vcpu);
  }


I don't see kvmppc_get_tb_offset_hv in this patch.


It should be generated by:

KVMPPC_BOOK3S_VCORE_ACCESSOR(tb_offset, 64)




diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..738f2ecbe9b9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -347,7 +347,7 @@ static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu 
*vcpu, gva_t eaddr,
unsigned long v, orig_v, gr;
__be64 *hptep;
long int index;
-   int virtmode = vcpu->arch.shregs.msr & (data ? MSR_DR : MSR_IR);
+   int virtmode = kvmppc_get_msr(vcpu) & (data ? MSR_DR : MSR_IR);
  
  	if (kvm_is_radix(vcpu->kvm))

return kvmppc_mmu_radix_xlate(vcpu, eaddr, gpte, data, iswrite);


So this isn't _only_ adding new accessors. This should be functionally a
noop, but I think it introduces a branch if PR is defined.


That being checking kvmppc_shared_big_endian()?



Shared page is a slight annoyance for HV, I'd like to get rid of it...
but that's another story. I think the pattern here would be to add a
kvmppc_get_msr_hv() accessor.


Yes, that will work.



And as a nitpick, for anywhere employing existing access functions, gprs
and such, could that be split into its own patch?


Sure will do. One other thing I could do is make the existing functions 
use the macros if they don't already. Do you think that is worth doing?




Looks pretty good aside from those little things.


Thanks.



Thanks,
Nick



[PATCH v3 6/6] docs: powerpc: Document nested KVM on POWER

2023-08-06 Thread Jordan Niethe
From: Michael Neuling 

Document support for nested KVM on POWER using the existing API as well
as the new PAPR API. This includes the new HCALL interface and how it
used by KVM.

Signed-off-by: Michael Neuling 
Signed-off-by: Jordan Niethe 
---
v2:
  - Separated into individual patch
v3:
  - Fix typos
---
 Documentation/powerpc/index.rst  |   1 +
 Documentation/powerpc/kvm-nested.rst | 636 +++
 2 files changed, 637 insertions(+)
 create mode 100644 Documentation/powerpc/kvm-nested.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index d33b554ca7ba..23e449994c2a 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -26,6 +26,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..8b37981dc3d9
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN().  Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+  (normally at L1 boot time).
+
+- L1 requests the L0 create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 runs the vCPU running H_GUEST_VCPU_RUN() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+More details of the individual hcalls follows:
+
+HCALL Details
+=
+
+This documentation is provided to give an overall understating of the
+API. It doesn't aim to provide all the details required to implement
+an L1 or L0. Latest version of PAPR can be referred to for more details.
+
+All these HCALLs are made by the L1 to the L0.
+
+H_GUEST_GET_CAPABILITIES()
+--
+
+This is called to get the capabilities of the L0 nested
+hypervisor. This includes capabilities such the CPU versions (eg
+POWER9, POWER10) that are supported as L2s::
+
+  H_GUEST_GET_CAPABILITIES(uint64 flags)
+
+  Parameters

[PATCH v3 5/6] KVM: PPC: Add support for nestedv2 guests

2023-08-06 Thread Jordan Niethe
A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. KVM
already had an interface that allowed this on powernv platforms. This
existing interface will now be called "nestedv1". The newly added PAPR
interface will be called "nestedv2".  PHYP will support the nestedv2
interface. At this time the host side of the nestedv2 interface has not
been implemented on powernv but there is no technical reason why it
could not be added.

The nestedv1 interface is still supported.

Add support to KVM to utilize these hcalls to enable running nested
guests as a pseries guest on PHYP.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, use the nestedv2
interface.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

Signed-off-by: Vaibhav Jain 
Signed-off-by: Gautam Menghani 
Signed-off-by: Kautuk Consul 
Signed-off-by: Amit Machhiwal 
Signed-off-by: Jordan Niethe 
---
v2:
  - Declare op structs as static
  - Guatam: Use expressions in switch case with local variables
  - Do not use the PVR for the LOGICAL PVR ID
  - Kautuk: Handle emul_inst as now a double word, init correctly
  - Use new GPR(), etc macros
  - Amit: Determine PAPR nested capabilities from cpu features
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Change to kvmhv_nestedv2 namespace
  - Make kvmhv_enable_nested() return -ENODEV on NESTEDv2 L1 hosts
  - s/kvmhv_on_papr/kvmhv_is_nestedv2/
  - mv book3s_hv_papr.c book3s_hv_nestedv2.c
  - Handle shared regs without a guest state id in the same wrapper
  - Vaibhav: Use a static key for API version
  - Add a positive test for NESTEDv1
  - Give the amor a static value
  - s/struct kvmhv_nestedv2_host/struct kvmhv_nestedv2_io/
  - Propagate failure in kvmhv_vcpu_entry_nestedv2()
  - WARN if getters and setters fail
  - Progagate failure from kvmhv_nestedv2_parse_output()
  - Replace delay with sleep in plpar_guest_{create,delete,create_vcpu}()
  - Amit: Add logical PVR handling
  - Replace kvmppc_gse_{get,put} with specific version
---
 arch/powerpc/include/asm/guest-state-buffer.h |  91 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 136 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  20 +
 arch/powerpc/include/asm/kvm_ppc.h|  96 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 188 
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 136 ++-
 arch/powerpc/kvm/book3s_hv.h  |  72 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 985 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 arch/powerpc/kvm/guest-state-buffer.c |  50 +
 14 files changed, 1757 insertions(+), 96 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_nestedv2.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
index aaefe1075fc4..808149f31576 100644
--- a/arch/powerpc/include/asm/guest-state-buffer.h
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -5,6 +5,7 @@
 #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
 #define _ASM_POWERPC_GUEST_STATE_BUFFER_H
 
+#include "asm/hvcall.h"
 #include 
 #include 
 #include 
@@ -313,6 +314,8 @@ struct kvmppc_gs_buff *kvmppc_gsb_new(size_t size, unsigned 
long guest_id,
  unsigned long vcpu_id, gfp_t flags);
 void kvmppc_gsb_free(struct kvmppc_gs_buff *gsb);
 void *kvmppc_gsb_put(struct kvmppc_gs_buff *gsb, size_t size);
+int kvmppc_gsb_s

[PATCH v3 4/6] KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long

2023-08-06 Thread Jordan Niethe
The LPID register is 32 bits long. The host keeps the lpids for each
guest in an unsigned word struct kvm_arch. Currently, LPIDs are already
limited by mmu_lpid_bits and KVM_MAX_NESTED_GUESTS_SHIFT.

The nestedv2 API returns a 64 bit "Guest ID" to be used be the L1 host
for each L2 guest. This value is used as an lpid, e.g. it is the
parameter used by H_RPT_INVALIDATE. To minimize needless special casing
it makes sense to keep this "Guest ID" in struct kvm_arch::lpid.

This means that struct kvm_arch::lpid is too small so prepare for this
and make it an unsigned long. This is not a problem for the KVM-HV and
nestedv1 cases as their lpid values are already limited to valid ranges
so in those contexts the lpid can be used as an unsigned word safely as
needed.

In the PAPR, the H_RPT_INVALIDATE pid/lpid parameter is already
specified as an unsigned long so change pseries_rpt_invalidate() to
match that.  Update the callers of pseries_rpt_invalidate() to also take
an unsigned long if they take an lpid value.

Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
---
 arch/powerpc/include/asm/kvm_book3s.h | 10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  2 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  4 ++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 +++---
 arch/powerpc/kvm/book3s_hv_nested.c   |  4 ++--
 arch/powerpc/kvm/book3s_xive.c|  4 ++--
 8 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 1a7e837ea2d5..98d4870ec4b3 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -191,14 +191,14 @@ extern int kvmppc_mmu_radix_translate_table(struct 
kvm_vcpu *vcpu, gva_t eaddr,
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
 extern void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned long addr,
-   unsigned int pshift, unsigned int lpid);
+   unsigned int pshift, unsigned long lpid);
 extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
unsigned int shift,
const struct kvm_memory_slot *memslot,
-   unsigned int lpid);
+   unsigned long lpid);
 extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
bool writing, unsigned long gpa,
-   unsigned int lpid);
+   unsigned long lpid);
 extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long gpa,
struct kvm_memory_slot *memslot,
@@ -207,7 +207,7 @@ extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu 
*vcpu,
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
 extern void kvmppc_free_radix(struct kvm *kvm);
 extern void kvmppc_free_pgtable_radix(struct kvm *kvm, pgd_t *pgd,
- unsigned int lpid);
+ unsigned long lpid);
 extern int kvmppc_radix_init(void);
 extern void kvmppc_radix_exit(void);
 extern void kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
@@ -300,7 +300,7 @@ void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
-void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
+void kvmhv_set_ptbl_entry(unsigned long lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index d49065af08e9..9fc3ad3990f7 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -624,7 +624,7 @@ static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 
 extern int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 unsigned long gpa, unsigned int level,
-unsigned long mmu_seq, unsigned int lpid,
+unsigned long mmu_seq, unsigned long lpid,
 unsigned long *rmapp, struct rmap_nested **n_rmap);
 extern void kvmhv_insert_nest_rmap(struct kvm *kvm, unsigned long *rmapp,
   struct rmap_nested **n_rmap);
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 14ee0dece853..67dd3e749cac 100644
--- a/arch/powerpc/include/

[PATCH v3 3/6] KVM: PPC: Add helper library for Guest State Buffers

2023-08-06 Thread Jordan Niethe
The PAPR "Nestedv2" guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the nestedv2 guest support.

Signed-off-by: Jordan Niethe 
---
v2:
  - Add missing #ifdef CONFIG_VSXs
  - Move files from lib/ to kvm/
  - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
  - Use kunit for guest state buffer tests
  - Add configuration option for the tests
  - Use macros for contiguous id ranges like GPRs
  - Add some missing EXPORTs to functions
  - HEIR element is a double word not a word
v3:
  - Use EXPORT_SYMBOL_GPL()
  - Use the kvmppc namespace
  - Move kvmppc_gsb_reset() out of kvmppc_gsm_fill_info()
  - Comments for GSID elements
  - Pass vector elements by reference
  - Remove generic put and get functions
---
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 904 ++
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/guest-state-buffer.c | 571 +++
 arch/powerpc/kvm/test-guest-state-buffer.c| 328 +++
 5 files changed, 1818 insertions(+)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 2a54fadbeaf5..339c3a5f56f1 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
bool "Run self-tests of the MSI bitmap code"
depends on DEBUG_KERNEL
 
+config GUEST_STATE_BUFFER_TEST
+   def_tristate n
+   prompt "Enable Guest State Buffer unit tests"
+   depends on KUNIT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default KUNIT_ALL_TESTS
+   help
+ The Guest State Buffer is a data format specified in the PAPR.
+ It is by hcalls to communicate the state of L2 guests between
+ the L1 and L0 hypervisors. Enable unit tests for the library
+ used to create and use guest state buffers.
+
 config PPC_IRQ_SOFT_MASK_DEBUG
bool "Include extra checks for powerpc irq soft masking"
depends on PPC64
diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..aaefe1075fc4
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,904 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+/* Element without a value and any length */
+#define KVMPPC_GSID_BLANK  0x
+/* Size required for the L0's internal VCPU representation */
+#define KVMPPC_GSID_HOST_STATE_SIZE0x0001
+ /* Minimum size for the H_GUEST_RUN_VCPU output buffer */
+#define KVMPPC_GSID_RUN_OUTPUT_MIN_SIZE0x0002
+ /* "Logical" PVR value as defined in the PAPR */
+#define KVMPPC_GSID_LOGICAL_PVR0x0003
+ /* L0 relative timebase offset */
+#define KVMPPC_GSID_TB_OFFSET  0x0004
+ /* Partition Scoped Page Table Info */
+#define KVMPPC_GSID_PARTITION_TABLE  

[PATCH v3 2/6] KVM: PPC: Rename accessor generator macros

2023-08-06 Thread Jordan Niethe
More "wrapper" style accessor generating macros will be introduced for
the nestedv2 guest support. Rename the existing macros with more
descriptive names now so there is a consistent naming convention.

Signed-off-by: Jordan Niethe 
---
v3:
  - New to series
---
 arch/powerpc/include/asm/kvm_ppc.h | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index d16d80ad2ae4..b66084a81dd0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -927,19 +927,19 @@ static inline bool kvmppc_shared_big_endian(struct 
kvm_vcpu *vcpu)
 #endif
 }
 
-#define SPRNG_WRAPPER_GET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACESSOR_GET(reg, bookehv_spr)\
 static inline ulong kvmppc_get_##reg(struct kvm_vcpu *vcpu)\
 {  \
return mfspr(bookehv_spr);  \
 }  \
 
-#define SPRNG_WRAPPER_SET(reg, bookehv_spr)\
+#define KVMPPC_BOOKE_HV_SPRNG_ACESSOR_SET(reg, bookehv_spr)\
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, ulong val)  \
 {  \
mtspr(bookehv_spr, val);
\
 }  \
 
-#define SHARED_WRAPPER_GET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACESSOR_GET(reg, size) \
 static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -948,7 +948,7 @@ static inline u##size kvmppc_get_##reg(struct kvm_vcpu 
*vcpu)   \
   return le##size##_to_cpu(vcpu->arch.shared->reg);\
 }  \
 
-#define SHARED_WRAPPER_SET(reg, size)  \
+#define KVMPPC_VCPU_SHARED_REGS_ACESSOR_SET(reg, size) \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
if (kvmppc_shared_big_endian(vcpu)) \
@@ -957,36 +957,36 @@ static inline void kvmppc_set_##reg(struct kvm_vcpu 
*vcpu, u##size val)   \
   vcpu->arch.shared->reg = cpu_to_le##size(val);   \
 }  \
 
-#define SHARED_WRAPPER(reg, size)  \
-   SHARED_WRAPPER_GET(reg, size)   \
-   SHARED_WRAPPER_SET(reg, size)   \
+#define KVMPPC_VCPU_SHARED_REGS_ACESSOR(reg, size) 
\
+   KVMPPC_VCPU_SHARED_REGS_ACESSOR_GET(reg, size)  
\
+   KVMPPC_VCPU_SHARED_REGS_ACESSOR_SET(reg, size)  
\
 
-#define SPRNG_WRAPPER(reg, bookehv_spr)
\
-   SPRNG_WRAPPER_GET(reg, bookehv_spr) \
-   SPRNG_WRAPPER_SET(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_ACESSOR(reg, bookehv_spr)
\
+   KVMPPC_BOOKE_HV_SPRNG_ACESSOR_GET(reg, bookehv_spr) 
\
+   KVMPPC_BOOKE_HV_SPRNG_ACESSOR_SET(reg, bookehv_spr) 
\
 
 #ifdef CONFIG_KVM_BOOKE_HV
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SPRNG_WRAPPER(reg, bookehv_spr) \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_BOOKE_HV_SPRNG_ACESSOR(reg, bookehv_spr) \
 
 #else
 
-#define SHARED_SPRNG_WRAPPER(reg, size, bookehv_spr)   \
-   SHARED_WRAPPER(reg, size)   \
+#define KVMPPC_BOOKE_HV_SPRNG_OR_VCPU_SHARED_REGS_ACCESSOR(reg, size, 
bookehv_spr) \
+   KVMPPC_VCPU_SHARED_REGS_ACESSOR(reg, size)  \
 
 #endif
 
-SHARED_WRAPPER(critical, 64)
-SHARED_SPRNG_WRAPPER(sprg0, 64, SPRN_GSPRG0)
-SHARED_SPRNG_WRAPPER(sprg1, 64, SPRN_GSPRG1)
-SHARED_SPRNG_WRAPPER(sprg2, 64, SPRN_GSPRG2)
-SHARED_SPRNG_WRAPPER(sprg3, 64, SPRN_GSPRG3)
-SHARED_SPRNG_WRAPPER(srr0, 64, SPRN_GSRR0)
-SHARED_SPRNG_WRAPPER(srr1, 64, SPRN_GSRR1)
-SHARED_SPRNG_WRAPPER(dar, 64, SPRN_GDEAR)
-SHARED_SPRNG_WRAPPER(esr, 64, SPRN_GESR)
-SHARED_WRAPPER_GET(msr, 64)
+KVMPPC_VCPU_SHARED_REG

[PATCH v3 1/6] KVM: PPC: Use getters and setters for vcpu register state

2023-08-06 Thread Jordan Niethe
There are already some getter and setter functions used for accessing
vcpu register state, e.g. kvmppc_get_pc(). There are also more
complicated examples that are generated by macros like
kvmppc_get_sprg0() which are generated by the SHARED_SPRNG_WRAPPER()
macro.

In the new PAPR "Nestedv2" API for nested guest partitions the L1 is
required to communicate with the L0 to modify and read nested guest
state.

Prepare to support this by replacing direct accesses to vcpu register
state with wrapper functions. Follow the existing pattern of using
macros to generate individual wrappers. These wrappers will
be augmented for supporting Nestedv2 guests later.

Signed-off-by: Gautam Menghani 
Signed-off-by: Jordan Niethe 
---
v3:
  - Do not add a helper for pvr
  - Use an expression when declaring variable in case
  - Squash in all getters and setters
  - Guatam: Pass vector registers by reference
---
 arch/powerpc/include/asm/kvm_book3s.h  | 123 +-
 arch/powerpc/include/asm/kvm_booke.h   |  10 ++
 arch/powerpc/kvm/book3s.c  |  38 ++---
 arch/powerpc/kvm/book3s_64_mmu_hv.c|   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   9 +-
 arch/powerpc/kvm/book3s_64_vio.c   |   4 +-
 arch/powerpc/kvm/book3s_hv.c   | 220 +
 arch/powerpc/kvm/book3s_hv.h   |  58 +++
 arch/powerpc/kvm/book3s_hv_builtin.c   |  10 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_ras.c   |   5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c|   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   4 +-
 arch/powerpc/kvm/book3s_xive.c |   9 +-
 arch/powerpc/kvm/emulate_loadstore.c   |   2 +-
 arch/powerpc/kvm/powerpc.c |  76 -
 16 files changed, 395 insertions(+), 189 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..1a7e837ea2d5 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -392,6 +392,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.pid = val;
+}
+
+static inline u32 kvmppc_get_pid(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pid;
+}
+
 static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu);
 static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu)
 {
@@ -403,10 +413,121 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j,
+ u64 val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
+#ifdef CONFIG_VSX
+static inline void kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
*v)
+{
+   *v =  vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i,
+vector128 *val)
+{
+   vcpu->arch.vr.vr[i] = *val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+#endif
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size) \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size) \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define KVMPPC_BOOK3S_VCPU_ACCESSOR(reg, size) \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_SET(reg, size)  \
+   KVMPPC_BOOK3S_VCPU_ACCESSOR_GET(reg, size)  \
+
+KVMPPC_BOOK3S_VCPU_ACCESSOR(tar, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbhr, 64)
+KVMPPC_BOOK3S_VCPU_ACCESSOR(ebbrr, 64)
+KVMPPC_BOOK3S_VC

[PATCH v3 0/6] KVM: PPC: Nested APIv2 guest support

2023-08-06 Thread Jordan Niethe
A nested-HV API for PAPR has been developed based on the KVM-specific
nested-HV API that is upstream in Linux/KVM and QEMU. The PAPR API had
to break compatibility to accommodate implementation in other
hypervisors and partitioning firmware. The existing KVM-specific API
will be known as the Nested APIv1 and the PAPR API will be known as the
Nested APIv2. 

The control flow and interrupt processing between L0, L1, and L2 in
the Nested APIv2 are conceptually unchanged. Where Nested APIv1 is almost
stateless, the Nested APIv2 is stateful, with the L1 registering L2 virtual
machines and vCPUs with the L0. Supervisor-privileged register switching
duty is now the responsibility for the L0, which holds canonical L2
register state and handles all switching. This new register handling
motivates the "getters and setters" wrappers to assist in syncing the
L2s state in the L1 and the L0.

Broadly, the new hcalls will be used for  creating and managing guests
by a regular partition in the following way:

  - L1 and L0 negotiate capabilities with
H_GUEST_{G,S}ET_CAPABILITIES

  - L1 requests the L0 create a L2 with
H_GUEST_CREATE and receives a handle to use in future hcalls

  - L1 requests the L0 create a L2 vCPU with
H_GUEST_CREATE_VCPU

  - L1 sets up the L2 using H_GUEST_SET and the
H_GUEST_VCPU_RUN input buffer

  - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN

  - L2 returns to L1 with an exit reason and L1 reads the
H_GUEST_VCPU_RUN output buffer populated by the L0

  - L1 handles the exit using H_GET_STATE if necessary

  - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

  - L1 frees the L2 in the L0 with H_GUEST_DELETE

Further details are available in Documentation/powerpc/kvm-nested.rst.

This series adds KVM support for using this hcall interface as a regular
PAPR partition, i.e. the L1. It does not add support for running as the
L0.

The new hcalls have been implemented in the spapr qemu model for
testing.

This is available at https://github.com/planetharsh/qemu/tree/upstream-0714-kop

There are scripts available to assist in setting up an environment for
testing nested guests at https://github.com/iamjpn/kvm-powervm-test

A tree with this series is available at
https://github.com/iamjpn/linux/tree/features/kvm-nestedv2-v3

Thanks to Amit Machhiwal, Kautuk Consul, Vaibhav Jain, Michael Neuling,
Shivaprasad Bhat, Harsh Prateek Bora, Paul Mackerras and Nicholas
Piggin. 

Change overview in v3:
  - KVM: PPC: Use getters and setters for vcpu register state
  - Do not add a helper for pvr
  - Use an expression when declaring variable in case
  - Squash in all getters and setters
  - Pass vector registers by reference
  - KVM: PPC: Rename accessor generator macros
  - New to series
  - KVM: PPC: Add helper library for Guest State Buffers
  - Use EXPORT_SYMBOL_GPL()
  - Use the kvmppc namespace
  - Move kvmppc_gsb_reset() out of kvmppc_gsm_fill_info()
  - Comments for GSID elements
  - Pass vector elements by reference
  - Remove generic put and get functions
  - KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long
  - New to series
  - KVM: PPC: Add support for nestedv2 guests
  - Use EXPORT_SYMBOL_GPL()
  - Change to kvmhv_nestedv2 namespace
  - Make kvmhv_enable_nested() return -ENODEV on NESTEDv2 L1 hosts
  - s/kvmhv_on_papr/kvmhv_is_nestedv2/
  - mv book3s_hv_papr.c book3s_hv_nestedv2.c
  - Handle shared regs without a guest state id in the same wrapper
  - Use a static key for API version
  - Add a positive test for NESTEDv1
  - Give the amor a static value
  - s/struct kvmhv_nestedv2_host/struct kvmhv_nestedv2_io/
  - Propagate failure in kvmhv_vcpu_entry_nestedv2()
  - WARN if getters and setters fail
  - Progagate failure from kvmhv_nestedv2_parse_output()
  - Replace delay with sleep in plpar_guest_{create,delete,create_vcpu}()
  - Add logical PVR handling
  - Replace kvmppc_gse_{get,put} with specific version
  - docs: powerpc: Document nested KVM on POWER
  - Fix typos


Change overview in v2:
  - Rebase on top of kvm ppc prefix instruction support
  - Make documentation an individual patch
  - Move guest state buffer files from arch/powerpc/lib/ to
arch/powerpc/kvm/
  - Use kunit for testing guest state buffer
  - Fix some build errors
  - Change HEIR element from 4 bytes to 8 bytes

Previous revisions:

  - v1: 
https://lore.kernel.org/linuxppc-dev/20230508072332.2937883-1-...@linux.vnet.ibm.com/
  - v2: 
https://lore.kernel.org/linuxppc-dev/20230605064848.12319-1-...@linux.vnet.ibm.com/
 

Jordan Niethe (5):
  KVM: PPC: Use getters and setters for vcpu register state
  KVM: PPC: Rename accessor generator macros
  KVM: PPC: Add helper library for Guest State Buffers
  KVM: PPC: Book3s HV: Hold LPIDs in an unsigned long
  KVM: PPC: Add support for nestedv2 guests

Michael Neuling (1):
  docs: powerpc: Document nested KVM on POWER

 Documentati

Re: [PATCH] arch/powerpc: Remove unnecessary endian conversion code in XICS

2023-07-06 Thread Jordan Niethe




On 30/6/23 3:56 pm, Gautam Menghani wrote:

Remove an unnecessary piece of code that does an endianness conversion but
does not use the result. The following warning was reported by Clang's
static analyzer:

arch/powerpc/sysdev/xics/ics-opal.c:114:2: warning: Value stored to
'server' is never read [deadcode.DeadStores]
 server = be16_to_cpu(oserver);

As the result of endianness conversion is never used, delete the line
and fix the warning.

Signed-off-by: Gautam Menghani 


'server' was used as a parameter to opal_get_xive() in commit 
5c7c1e9444d8 ("powerpc/powernv: Add OPAL ICS backend") when it was 
introduced. 'server' was also used in an error message for the call to 
opal_get_xive().


'server' was always later set by a call to ics_opal_mangle_server() 
before being used.


Commit bf8e0f891a32 ("powerpc/powernv: Fix endian issues in OPAL ICS 
backend") used a new variable 'oserver' as the parameter to 
opal_get_xive() instead of 'server' for endian correctness. It also 
removed 'server' from the error message for the call to opal_get_xive().


It was commit bf8e0f891a32 that added the unnecessary conversion and 
never used the result.


Reviewed-by: Jordan Niethe 



---
  arch/powerpc/sysdev/xics/ics-opal.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index 6cfbb4fac7fb..5fe73dabab79 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -111,7 +111,6 @@ static int ics_opal_set_affinity(struct irq_data *d,
   __func__, d->irq, hw_irq, rc);
return -1;
}
-   server = be16_to_cpu(oserver);
  
  	wanted_server = xics_get_irq_server(d->irq, cpumask, 1);

if (wanted_server < 0) {



Re: [PATCH] KVM: ppc64: Enable ring-based dirty memory tracking

2023-07-05 Thread Jordan Niethe




On 8/6/23 10:34 pm, Kautuk Consul wrote:

Need at least a little context in the commit message itself:

"Enable ring-based dirty memory tracking on ppc64:"


- Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as ppc64 is weakly
   ordered.
- Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP because the
   kvmppc_xive_native_set_attr is called in the context of an ioctl
   syscall and will call kvmppc_xive_native_eq_sync for setting the
   KVM_DEV_XIVE_EQ_SYNC attribute which will call mark_dirty_page()
   when there isn't a running vcpu. Implemented the
   kvm_arch_allow_write_without_running_vcpu to always return true
   to allow mark_page_dirty_in_slot to mark the page dirty in the
   memslot->dirty_bitmap in this case.


Should kvm_arch_allow_write_without_running_vcpu() only return true in 
the context of kvmppc_xive_native_eq_sync()?



- Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
   offset.
- Implement the kvm_arch_mmu_enable_log_dirty_pt_masked function required
   for the generic KVM code to call.
- Add a check to kvmppc_vcpu_run_hv for checking whether the dirty
   ring is soft full.
- Implement the kvm_arch_flush_remote_tlbs_memslot function to support
   the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT config option.

On testing with live migration it was found that there is around
150-180 ms improvment in overall migration time with this

Signed-off-by: Kautuk Consul 
---
  Documentation/virt/kvm/api.rst  |  2 +-
  arch/powerpc/include/uapi/asm/kvm.h |  2 ++
  arch/powerpc/kvm/Kconfig|  2 ++
  arch/powerpc/kvm/book3s_64_mmu_hv.c | 42 +
  arch/powerpc/kvm/book3s_hv.c|  3 +++
  include/linux/kvm_dirty_ring.h  |  5 
  6 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index add067793b90..ce1ebc513bae 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8114,7 +8114,7 @@ regardless of what has actually been exposed through the 
CPUID leaf.
  8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
  --
  
-:Architectures: x86, arm64

+:Architectures: x86, arm64, ppc64
  :Parameters: args[0] - size of the dirty log ring
  
  KVM is capable of tracking dirty memory using ring buffers that are

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 9f18fa090f1f..f722309ed7fb 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -33,6 +33,8 @@
  /* Not always available, but if it is, this is the correct offset.  */
  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
  
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64

+
  struct kvm_regs {
__u64 pc;
__u64 cr;
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 902611954200..c93354ec3bd5 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -26,6 +26,8 @@ config KVM
select IRQ_BYPASS_MANAGER
select HAVE_KVM_IRQ_BYPASS
select INTERVAL_TREE
+   select HAVE_KVM_DIRTY_RING_ACQ_REL
+   select NEED_KVM_DIRTY_RING_WITH_BITMAP
  
  config KVM_BOOK3S_HANDLER

bool
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 7f765d5ad436..c92e8022e017 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -2147,3 +2147,45 @@ void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
  
  	vcpu->arch.hflags |= BOOK3S_HFLAG_SLB;

  }
+
+/*
+ * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for selected
+ * dirty pages.
+ *
+ * It write protects selected pages to enable dirty logging for them.
+ */
+void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
+struct kvm_memory_slot *slot,
+gfn_t gfn_offset,
+unsigned long mask)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+   while (start < end) {
+   pte_t *ptep;
+   unsigned int shift;
+
+   ptep = find_kvm_secondary_pte(kvm, start, );
+
+   *ptep = __pte(pte_val(*ptep) & ~(_PAGE_WRITE));

On rpt I think you'd need to use kvmppc_radix_update_pte()?


+
+   start += PAGE_SIZE > +   }
+}
+
+#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP
+bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
+{
+   return true;
+}
+#endif
+
+#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
+void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
+   const struct kvm_memory_slot *memslot)
+{
+   kvm_flush_remote_tlbs(kvm);
+}
+#endif
diff --git 

Re: [PATCH] powernv/opal-prd: Silence memcpy() run-time false positive warnings

2023-07-04 Thread Jordan Niethe




On 26/6/23 5:04 pm, Mahesh Salgaonkar wrote:

opal_prd_msg_notifier extracts the opal prd message size from the message
header and uses it for allocating opal_prd_msg_queue_item that includes
the correct message size to be copied. However, while running under
CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning:

[ 6458.234352] memcpy: detected field-spanning write (size 32) of single field 
">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4)
[ 6458.234390] WARNING: CPU: 9 PID: 660 at 
arch/powerpc/platforms/powernv/opal-prd.c:355 opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[...]
[ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 
[opal_prd]
[ 6458.234736] Call Trace:
[ 6458.234742] [c002acb23c10] [c0080e0c0e68] 
opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable)
[ 6458.234759] [c002acb23ca0] [c019ccc0] 
notifier_call_chain+0xc0/0x1b0
[ 6458.234774] [c002acb23d00] [c019ceac] 
atomic_notifier_call_chain+0x2c/0x40
[ 6458.234788] [c002acb23d20] [c00d69b4] 
opal_message_notify+0xf4/0x2c0
[...]

Add a flexible array member to avoid false positive run-time warning.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Mahesh Salgaonkar 
---
  arch/powerpc/platforms/powernv/opal-prd.c |7 +--
  1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 113bdb151f687..9e2c4775f75f5 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -30,7 +30,10 @@
   */
  struct opal_prd_msg_queue_item {
struct list_headlist;
-   struct opal_prd_msg_header  msg;
+   union {
+   struct opal_prd_msg_header  msg;
+   DECLARE_FLEX_ARRAY(__u8, msg_flex);
+   };
  };
  
  static struct device_node *prd_node;

@@ -352,7 +355,7 @@ static int opal_prd_msg_notifier(struct notifier_block *nb,
if (!item)
return -ENOMEM;
  
-	memcpy(>msg, msg->params, msg_size);

+   memcpy(>msg_flex, msg->params, msg_size);


This does silence the warning but it seems like we might be able to go a 
little further.


What about not adding that flex array to struct opal_prd_msg_queue_item, 
but adding one to struct opal_prd_msg. That is what the data format 
actually is.


So we'd have something like:


struct opal_prd_msg  {

struct opal_prd_msg_header hdr;

char msg[];

}


and change things to use that instead?

But that might be more trouble than it is worth, alternatively we can 
just do:


item->msg = *hdr;
	memcpy((char *)item->msg + sizeof(*hdr), (char *)hdr + sizeof(*hdr), 
msg_size - sizeof(*hdr));



  
  	spin_lock_irqsave(_prd_msg_queue_lock, flags);

list_add_tail(>list, _prd_msg_queue);






Re: [RFC PATCH v2 5/6] KVM: PPC: Add support for nested PAPR guests

2023-06-09 Thread Jordan Niethe
On Wed, Jun 7, 2023 at 7:09 PM Nicholas Piggin  wrote:
[snip]
>
> You lost your comments.

Thanks

>
> > diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> > b/arch/powerpc/include/asm/kvm_book3s.h
> > index 0ca2d8b37b42..c5c57552b447 100644
> > --- a/arch/powerpc/include/asm/kvm_book3s.h
> > +++ b/arch/powerpc/include/asm/kvm_book3s.h
> > @@ -12,6 +12,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  struct kvmppc_bat {
> >   u64 raw;
> > @@ -316,6 +317,57 @@ long int kvmhv_nested_page_fault(struct kvm_vcpu 
> > *vcpu);
> >
> >  void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
> >
> > +
> > +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +
> > +extern bool __kvmhv_on_papr;
> > +
> > +static inline bool kvmhv_on_papr(void)
> > +{
> > + return __kvmhv_on_papr;
> > +}
>
> It's a nitpick, but kvmhv_on_pseries() is because we're runnning KVM-HV
> on a pseries guest kernel. Which is a papr guest kernel. So this kind of
> doesn't make sense if you read it the same way.
>
> kvmhv_nested_using_papr() or something like that might read a bit
> better.

Will we go with kvmhv_using_nested_v2()?

>
> This could be a static key too.

Will do.

>
> > @@ -575,6 +593,7 @@ struct kvm_vcpu_arch {
> >   ulong dscr;
> >   ulong amr;
> >   ulong uamor;
> > + ulong amor;
> >   ulong iamr;
> >   u32 ctrl;
> >   u32 dabrx;
>
> This belongs somewhere else.

It can be dropped.

>
> > @@ -829,6 +848,8 @@ struct kvm_vcpu_arch {
> >   u64 nested_hfscr;   /* HFSCR that the L1 requested for the nested 
> > guest */
> >   u32 nested_vcpu_id;
> >   gpa_t nested_io_gpr;
> > + /* For nested APIv2 guests*/
> > + struct kvmhv_papr_host papr_host;
> >  #endif
>
> This is not exactly a papr host. Might have to come up with a better
> name especially if we implement a L0 things could get confusing.

Any name ideas? nestedv2_state?

>
> > @@ -342,6 +343,203 @@ static inline long 
> > plpar_get_cpu_characteristics(struct h_cpu_char_result *p)
> >   return rc;
> >  }
> >
> > +static inline long plpar_guest_create(unsigned long flags, unsigned long 
> > *guest_id)
> > +{
> > + unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> > + unsigned long token;
> > + long rc;
> > +
> > + token = -1UL;
> > + while (true) {
> > + rc = plpar_hcall(H_GUEST_CREATE, retbuf, flags, token);
> > + if (rc == H_SUCCESS) {
> > + *guest_id = retbuf[0];
> > + break;
> > + }
> > +
> > + if (rc == H_BUSY) {
> > + token = retbuf[0];
> > + cpu_relax();
> > + continue;
> > + }
> > +
> > + if (H_IS_LONG_BUSY(rc)) {
> > + token = retbuf[0];
> > + mdelay(get_longbusy_msecs(rc));
>
> All of these things need a non-sleeping delay? Can we sleep instead?
> Or if not, might have to think about going back to the caller and it
> can retry.
>
> get/set state might be a bit inconvenient, although I don't expect
> that should potentially take so long as guest and vcpu create/delete,
> so at least those ones would be good if they're called while
> preemptable.

Yeah no reason not to sleep except for get/set, let me try it out.

>
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 521d84621422..f22ee582e209 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -383,6 +383,11 @@ static void kvmppc_core_vcpu_put_hv(struct kvm_vcpu 
> > *vcpu)
> >   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
> >  }
> >
> > +static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 pvr)
> > +{
> > + vcpu->arch.pvr = pvr;
> > +}
>
> Didn't you lose this in a previous patch? I thought it must have moved
> to a header but it reappears.

Yes, that was meant to stay put.

>
> > +
> >  /* Dummy value used in computing PCR value below */
> >  #define PCR_ARCH_31(PCR_ARCH_300 << 1)
> >
> > @@ -1262,13 +1267,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
> >   return RESUME_HOST;
> >   break;
> >  #endif
> > - case H_RANDOM:
> > + case H_RANDOM: {
> >   unsigned long rand;
> >
> >   if (!arch_get_random_seed_longs(, 1))
> >   ret = H_HARDWARE;
> >   kvmppc_set_gpr(vcpu, 4, rand);
> >   break;
> > + }
> >   case H_RPT_INVALIDATE:
> >   ret = kvmppc_h_rpt_invalidate(vcpu, kvmppc_get_gpr(vcpu, 4),
> > kvmppc_get_gpr(vcpu, 5),
>
> Compile fix for a previous patch.

Thanks.

>
> > @@ -2921,14 +2927,21 @@ static int kvmppc_core_vcpu_create_hv(struct 
> > kvm_vcpu *vcpu)
> >   vcpu->arch.shared_big_endian = false;
> >  #endif
> >  #endif
> > - kvmppc_set_mmcr_hv(vcpu, 0, MMCR0_FC);
> >
> > + if (kvmhv_on_papr()) {
> > + 

Re: [RFC PATCH v2 4/6] KVM: PPC: Add helper library for Guest State Buffers

2023-06-09 Thread Jordan Niethe
On Wed, Jun 7, 2023 at 6:27 PM Nicholas Piggin  wrote:
[snip]
>
> This is a tour de force in one of these things, so I hate to be
> the "me smash with club" guy, but what if you allocated buffers
> with enough room for all the state (or 99% of cases, in which
> case an overflow would make an hcall)?
>
> What's actually a fast-path that we don't get from the interrupt
> return buffer? Getting and setting a few regs for MMIO emulation?

As it is a vcpu uses four buffers:

- One for registering it's input and output buffers
   This is allocated just large enough for GSID_RUN_OUTPUT_MIN_SIZE,
   GSID_RUN_INPUT and GSID_RUN_OUTPUT.
   Freed once the buffers are registered.
   I suppose we could just make a buffer big enough to be used for the
vcpu run input buffer then have it register its own address.

- One for process and partition table entries
   Because kvmhv_set_ptbl_entry() isn't associated with a vcpu.
   kvmhv_papr_set_ptbl_entry() allocates and frees a minimal sized
buffer on demand.

- The run vcpu input buffer
   Persists over the lifetime of the vcpu after creation. Large enough
to hold all VCPU-wide elements. The same buffer is also reused for:

 * GET state hcalls
 * SET guest wide state hcalls (guest wide can not be passed into
the vcpu run buffer)

- The run vcpu output buffer
   Persists over the lifetime of the vcpu after creation. This is
sized to be GSID_RUN_OUTPUT_MIN_SIZE as returned by the L0.
   It's unlikely that it would be larger than the run vcpu buffer
size, so I guess you could make it that size too. Probably you could
even use the run vcpu input buffer as the vcpu output buffer.

The buffers could all be that max size and could combine the
configuration buffer, input and output buffers, but I feel it's more
understandable like this.

[snip]

>
> The namespaces are a little abbreviated. KVM_PAPR_ might be nice if
> you're calling the API that.

Will we go with KVM_NESTED_V2_ ?

>
> > +
> > +#define GSID_HOST_STATE_SIZE 0x0001 /* Size of Hypervisor Internal 
> > Format VCPU state */
> > +#define GSID_RUN_OUTPUT_MIN_SIZE 0x0002 /* Minimum size of the Run 
> > VCPU output buffer */
> > +#define GSID_LOGICAL_PVR 0x0003 /* Logical PVR */
> > +#define GSID_TB_OFFSET   0x0004 /* Timebase Offset */
> > +#define GSID_PARTITION_TABLE 0x0005 /* Partition Scoped Page Table 
> > */
> > +#define GSID_PROCESS_TABLE   0x0006 /* Process Table */
>
> > +
> > +#define GSID_RUN_INPUT   0x0C00 /* Run VCPU Input 
> > Buffer */
> > +#define GSID_RUN_OUTPUT  0x0C01 /* Run VCPU Out Buffer 
> > */
> > +#define GSID_VPA 0x0C02 /* HRA to Guest VCPU VPA */
> > +
> > +#define GSID_GPR(x)  (0x1000 + (x))
> > +#define GSID_HDEC_EXPIRY_TB  0x1020
> > +#define GSID_NIA 0x1021
> > +#define GSID_MSR 0x1022
> > +#define GSID_LR  0x1023
> > +#define GSID_XER 0x1024
> > +#define GSID_CTR 0x1025
> > +#define GSID_CFAR0x1026
> > +#define GSID_SRR00x1027
> > +#define GSID_SRR10x1028
> > +#define GSID_DAR 0x1029
>
> It's a shame you have to rip up all your wrapper functions now to
> shoehorn these in.
>
> If you included names analogous to the reg field names in the kvm
> structures, the wrappers could do macro expansions that get them.
>
> #define __GSID_WRAPPER_dar  GSID_DAR
>
> Or similar.

Before I had something pretty hacky, in the macro accessors I had
along the lines of

 gsid_table[offsetof(vcpu, reg)]

to get the GSID for the register.

We can do the wrapper idea, I just worry if it is getting too magic.

>
> And since of course you have to explicitly enumerate all these, I
> wouldn't mind defining the types and lengths up-front rather than
> down in the type function. You'd like to be able to go through the
> spec and eyeball type, number, size.

Something like
#define KVM_NESTED_V2_GS_NIA (KVM_NESTED_V2_GSID_NIA | VCPU_WIDE |
READ_WRITE | DOUBLE_WORD)
etc
?

>
> [snip]
>
> > +/**
> > + * gsb_paddress() - the physical address of buffer
> > + * @gsb: guest state buffer
> > + *
> > + * Returns the physical address of the buffer.
> > + */
> > +static inline u64 gsb_paddress(struct gs_buff *gsb)
> > +{
> > + return __pa(gsb_header(gsb));
> > +}
>
> > +/**
> > + * __gse_put_reg() - add a register type guest state element to a buffer
> > + * @gsb: guest state buffer to add element to
> > + * @iden: guest state ID
> > + * @val: host endian value
> > + *
> > + * Adds a register type guest state element. Uses the guest state ID for
> > + * determining the length of the guest element. If the guest state ID has
> > + * bits that can not be set they will be cleared.
> > + */
> > +static inline int __gse_put_reg(struct gs_buff *gsb, u16 iden, u64 val)
> > +{
> > + 

Re: [RFC PATCH v2 2/6] KVM: PPC: Add fpr getters and setters

2023-06-09 Thread Jordan Niethe
On Wed, Jun 7, 2023 at 5:56 PM Nicholas Piggin  wrote:
[snip]
>
> Is there a particular reason some reg sets are broken into their own
> patches? Looking at this hunk you think the VR one got missed, but it's
> in its own patch.
>
> Not really a big deal but I wouldn't mind them all in one patch. Or at
> least the FP/VR/VSR ine one since they're quite regular and similar.

There's not really a reason,

Originally I had things even more broken apart but then thought one
patch made
more sense. Part way through squashing the patches I had a change of
heart
and thought I'd see if people had a preference.

I'll just finish the squashing for the next series.

Thanks,
Jordan
>
> Thanks,
> Nick
>


Re: [RFC PATCH v2 1/6] KVM: PPC: Use getters and setters for vcpu register state

2023-06-09 Thread Jordan Niethe
On Wed, Jun 7, 2023 at 5:53 PM Nicholas Piggin  wrote:
[snip]
>
> The general idea is fine, some of the names could use a bit of
> improvement. What's a BOOK3S_WRAPPER for example, is it not a
> VCPU_WRAPPER, or alternatively why isn't a VCORE_WRAPPER Book3S
> as well?

Yeah the names are not great.
I didn't call it VCPU_WRAPPER because I wanted to keep separate
BOOK3S_WRAPPER for book3s registers
HV_WRAPPER for hv specific registers
I will change it to something like you suggested.

[snip]
>
> Stray hunk I think.

Yep.

>
> > @@ -957,10 +957,32 @@ static inline void kvmppc_set_##reg(struct kvm_vcpu 
> > *vcpu, u##size val) \
> >  vcpu->arch.shared->reg = cpu_to_le##size(val);   \
> >  }\
> >
> > +#define SHARED_CACHE_WRAPPER_GET(reg, size)  \
> > +static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  
> >   \
> > +{\
> > + if (kvmppc_shared_big_endian(vcpu)) \
> > +return be##size##_to_cpu(vcpu->arch.shared->reg);\
> > + else\
> > +return le##size##_to_cpu(vcpu->arch.shared->reg);\
> > +}\
> > +
> > +#define SHARED_CACHE_WRAPPER_SET(reg, size)  \
> > +static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
> >   \
> > +{\
> > + if (kvmppc_shared_big_endian(vcpu)) \
> > +vcpu->arch.shared->reg = cpu_to_be##size(val);   \
> > + else\
> > +vcpu->arch.shared->reg = cpu_to_le##size(val);   \
> > +}\
> > +
> >  #define SHARED_WRAPPER(reg, size)\
> >   SHARED_WRAPPER_GET(reg, size)   \
> >   SHARED_WRAPPER_SET(reg, size)   \
> >
> > +#define SHARED_CACHE_WRAPPER(reg, size)
> >   \
> > + SHARED_CACHE_WRAPPER_GET(reg, size) \
> > + SHARED_CACHE_WRAPPER_SET(reg, size) \
>
> SHARED_CACHE_WRAPPER that does the same thing as SHARED_WRAPPER.

That changes once the guest state buffer IDs are included in a later
patch.

>
> I know some of the names are a but crufty but it's probably a good time
> to rethink them a bit.
>
> KVMPPC_VCPU_SHARED_REG_ACCESSOR or something like that. A few
> more keystrokes could help imensely.

Yes, I will do something like that, for the BOOK3S_WRAPPER and
HV_WRAPPER
too.

>
> > diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
> > b/arch/powerpc/kvm/book3s_hv_p9_entry.c
> > index 34f1db212824..34bc0a8a1288 100644
> > --- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
> > +++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
> > @@ -305,7 +305,7 @@ static void switch_mmu_to_guest_radix(struct kvm *kvm, 
> > struct kvm_vcpu *vcpu, u6
> >   u32 pid;
> >
> >   lpid = nested ? nested->shadow_lpid : kvm->arch.lpid;
> > - pid = vcpu->arch.pid;
> > + pid = kvmppc_get_pid(vcpu);
> >
> >   /*
> >* Prior memory accesses to host PID Q3 must be completed before we
>
> Could add some accessors for get_lpid / get_guest_id which check for the
> correct KVM mode maybe.

True.

Thanks,
Jordan

>
> Thanks,
> Nick
>


Re: [RFC PATCH v2 0/6] KVM: PPC: Nested PAPR guests

2023-06-09 Thread Jordan Niethe
On Wed, Jun 7, 2023 at 3:54 PM Nicholas Piggin  wrote:
>
> On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote:
> > There is existing support for nested guests on powernv hosts however the
> > hcall interface this uses is not support by other PAPR hosts.
>
> I kind of liked it being called nested-HV v1 and v2 APIs as short and
> to the point, but I suppose that's ambiguous with version 2 of the v1
> API, so papr is okay. What's the old API called in this scheme, then?
> "Existing API" is not great after patches go upstream.

Yes I was trying for a more descriptive name but it is just more
confusing and I'm struggling for a better alternative.

In the next revision I'll use v1 and v2. For version 2 of v1
we now call it v1.2 or something like that?

>
> And, you've probably explained it pretty well but slightly more of
> a background first up could be helpful. E.g.,
>
>   A nested-HV API for PAPR has been developed based on the KVM-specific
>   nested-HV API that is upstream in Linux/KVM and QEMU. The PAPR API
>   had to break compatibility to accommodate implementation in other
>   hypervisors and partitioning firmware.
>
> And key overall differences
>
>   The control flow and interrupt processing between L0, L1, and L2
>   in the new PAPR API are conceptually unchanged. Where the old API
>   is almost stateless, the PAPR API is stateful, with the L1 registering
>   L2 virtual machines and vCPUs with the L0. Supervisor-privileged
>   register switching duty is now the responsibility for the L0, which
>   holds canonical L2 register state and handles all switching. This
>   new register handling motivates the "getters and setters" wrappers
>   ...

I'll include something along those lines.

Thanks,
Jordan

>
> Thanks,
> Nick
>


Re: [PATCH RFC v2 6/6] docs: powerpc: Document nested KVM on POWER

2023-06-09 Thread Jordan Niethe
On Wed, Jun 7, 2023 at 3:38 PM Gautam Menghani  wrote:
>
> On Mon, Jun 05, 2023 at 04:48:48PM +1000, Jordan Niethe wrote:
> > From: Michael Neuling 
>
> Hi,
> There are some minor typos in the documentation pointed out below

Thank you, will correct in the next revision.

Jordan
>
>
> > +H_GUEST_GET_STATE()
> > +---
> > +
> > +This is called to get state associated with an L2 (Guest-wide or vCPU 
> > specific).
> > +This info is passed via the Guest State Buffer (GSB), a standard format as
> > +explained later in this doc, necessary details below:
> > +
> > +This can set either L2 wide or vcpu specific information. Examples of
>
> We are getting the info about vcpu here : s/set/get
>
> > +H_GUEST_RUN_VCPU()
> > +--
> > +
> > +This is called to run an L2 vCPU. The L2 and vCPU IDs are passed in as
> > +parameters. The vCPU run with the state set previously using
>
> Minor nit : s/run/runs
>
> > +H_GUEST_SET_STATE(). When the L2 exits, the L1 will resume from this
> > +hcall.
> > +
> > +This hcall also has associated input and output GSBs. Unlike
> > +H_GUEST_{S,G}ET_STATE(), these GSB pointers are not passed in as
> > +parameters to the hcall (This was done in the interest of
> > +performance). The locations of these GSBs must be preregistered using
> > +the H_GUEST_SET_STATE() call with ID 0x0c00 and 0x0c01 (see table
> > +below).
> > +
> >
> > --
> > 2.31.1
> >
>


[RFC PATCH v2 5/6] KVM: PPC: Add support for nested PAPR guests

2023-06-05 Thread Jordan Niethe
A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. Add
support to KVM to utilize these hcalls to enable running nested guests.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, the new API will then
be used.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

There is existing support for running nested guests on KVM
with powernv. However the interface used for this is not supported by
other PAPR hosts. This existing API is still supported.

Signed-off-by: Jordan Niethe 
---
v2:
  - Declare op structs as static
  - Use expressions in switch case with local variables
  - Do not use the PVR for the LOGICAL PVR ID
  - Handle emul_inst as now a double word
  - Use new GPR(), etc macros
  - Determine PAPR nested capabilities from cpu features
---
 arch/powerpc/include/asm/guest-state-buffer.h | 105 +-
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 122 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  21 +
 arch/powerpc/include/asm/kvm_ppc.h|  64 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 198 
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 126 ++-
 arch/powerpc/kvm/book3s_hv.h  |  74 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
 arch/powerpc/kvm/book3s_hv_papr.c | 940 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 arch/powerpc/kvm/guest-state-buffer.c |  49 +
 14 files changed, 1684 insertions(+), 94 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
index 65a840abf1bb..116126edd8e2 100644
--- a/arch/powerpc/include/asm/guest-state-buffer.h
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -5,6 +5,7 @@
 #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
 #define _ASM_POWERPC_GUEST_STATE_BUFFER_H
 
+#include "asm/hvcall.h"
 #include 
 #include 
 #include 
@@ -14,16 +15,16 @@
  **/
 #define GSID_BLANK 0x
 
-#define GSID_HOST_STATE_SIZE   0x0001 /* Size of Hypervisor Internal 
Format VCPU state */
-#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002 /* Minimum size of the Run VCPU 
output buffer */
-#define GSID_LOGICAL_PVR   0x0003 /* Logical PVR */
-#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */
-#define GSID_PARTITION_TABLE   0x0005 /* Partition Scoped Page Table */
-#define GSID_PROCESS_TABLE 0x0006 /* Process Table */
+#define GSID_HOST_STATE_SIZE   0x0001
+#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002
+#define GSID_LOGICAL_PVR   0x0003
+#define GSID_TB_OFFSET 0x0004
+#define GSID_PARTITION_TABLE   0x0005
+#define GSID_PROCESS_TABLE 0x0006
 
-#define GSID_RUN_INPUT 0x0C00 /* Run VCPU Input Buffer */
-#define GSID_RUN_OUTPUT0x0C01 /* Run VCPU Out Buffer */
-#define GSID_VPA   0x0C02 /* HRA to Guest VCPU VPA */
+#define GSID_RUN_INPUT 0x0C00
+#define GSID_RUN_OUTPUT0x0C01
+#define GSID_VPA   0x0C02
 
 #define GSID_GPR(x)(0x1000 + (x))
 #define GSID_HDEC_EXPIRY_TB0x1020
@@ -300,6 +301,8 @@ struct gs_buff *gsb_new(size_t size, unsigned long guest_id,
unsigned long vcpu_id, g

[RFC PATCH v2 6/6] docs: powerpc: Document nested KVM on POWER

2023-06-05 Thread Jordan Niethe
From: Michael Neuling 

Document support for nested KVM on POWER using the existing API as well
as the new PAPR API. This includes the new HCALL interface and how it
used by KVM.

Signed-off-by: Michael Neuling 
Signed-off-by: Jordan Niethe 
---
v2:
  - Separated into individual patch
---
 Documentation/powerpc/index.rst  |   1 +
 Documentation/powerpc/kvm-nested.rst | 636 +++
 2 files changed, 637 insertions(+)
 create mode 100644 Documentation/powerpc/kvm-nested.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 85e80e30160b..5a15dc6389ab 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -25,6 +25,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..c0c2e29a59d3
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN().  Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+  (normally at L1 boot time).
+
+- L1 requests the L0 create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 runs the vCPU running H_GUEST_VCPU_RUN() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+More details of the individual hcalls follows:
+
+HCALL Details
+=
+
+This documentation is provided to give an overall understating of the
+API. It doesn't aim to provide all the details required to implement
+an L1 or L0. Latest version of PAPR can be referred to for more details.
+
+All these HCALLs are made by the L1 to the L0.
+
+H_GUEST_GET_CAPABILITIES()
+--
+
+This is called to get the capabilities of the L0 nested
+hypervisor. This includes capabilities such the CPU versions (eg
+POWER9, POWER10) that are supported as L2s::
+
+  H_GUEST_GET_CAPABILITIES(uint64 flags)
+
+  Parameters:
+Input:
+  f

[RFC PATCH v2 1/6] KVM: PPC: Use getters and setters for vcpu register state

2023-06-05 Thread Jordan Niethe
There are already some getter and setter functions used for accessing
vcpu register state, e.g. kvmppc_get_pc(). There are also more
complicated examples that are generated by macros like
kvmppc_get_sprg0() which are generated by the SHARED_SPRNG_WRAPPER()
macro.

In the new PAPR API for nested guest partitions the L1 is required to
communicate with the L0 to modify and read nested guest state.

Prepare to support this by replacing direct accesses to vcpu register
state with wrapper functions. Follow the existing pattern of using
macros to generate individual wrappers. These wrappers will
be augmented for supporting PAPR nested guests later.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  68 +++-
 arch/powerpc/include/asm/kvm_ppc.h |  48 --
 arch/powerpc/kvm/book3s.c  |  22 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c|   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   9 +-
 arch/powerpc/kvm/book3s_64_vio.c   |   4 +-
 arch/powerpc/kvm/book3s_hv.c   | 222 +
 arch/powerpc/kvm/book3s_hv.h   |  59 +++
 arch/powerpc/kvm/book3s_hv_builtin.c   |  10 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_ras.c   |   5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c|   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   4 +-
 arch/powerpc/kvm/book3s_xive.c |   9 +-
 arch/powerpc/kvm/powerpc.c |   4 +-
 15 files changed, 322 insertions(+), 158 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..4e91f54a3f9f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -392,6 +392,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.pid = val;
+}
+
+static inline u32 kvmppc_get_pid(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pid;
+}
+
 static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu);
 static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu)
 {
@@ -403,10 +413,66 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+#define BOOK3S_WRAPPER_SET(reg, size)  \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define BOOK3S_WRAPPER_GET(reg, size)  \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define BOOK3S_WRAPPER(reg, size)  \
+   BOOK3S_WRAPPER_SET(reg, size)   \
+   BOOK3S_WRAPPER_GET(reg, size)   \
+
+BOOK3S_WRAPPER(tar, 64)
+BOOK3S_WRAPPER(ebbhr, 64)
+BOOK3S_WRAPPER(ebbrr, 64)
+BOOK3S_WRAPPER(bescr, 64)
+BOOK3S_WRAPPER(ic, 64)
+BOOK3S_WRAPPER(vrsave, 64)
+
+
+#define VCORE_WRAPPER_SET(reg, size)   \
+static inline void kvmppc_set_##reg ##_hv(struct kvm_vcpu *vcpu, u##size val)  
\
+{  \
+   vcpu->arch.vcore->reg = val;\
+}
+
+#define VCORE_WRAPPER_GET(reg, size)   \
+static inline u##size kvmppc_get_##reg ##_hv(struct kvm_vcpu *vcpu)\
+{  \
+   return vcpu->arch.vcore->reg;   \
+}
+
+#define VCORE_WRAPPER(reg, size)   \
+   VCORE_WRAPPER_SET(reg, size)\
+   VCORE_WRAPPER_GET(reg, size)\
+
+
+VCORE_WRAPPER(vtb, 64)
+VCORE_WRAPPER(tb_offset, 64)
+VCORE_WRAPPER(lpcr, 64)
+
+static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires;
+}
+
+static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.dec_expires = val;
+}
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset_hv(vcpu);
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index

[RFC PATCH v2 2/6] KVM: PPC: Add fpr getters and setters

2023-06-05 Thread Jordan Niethe
Add wrappers for fpr registers to prepare for supporting PAPR nested
guests.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 31 +++
 arch/powerpc/include/asm/kvm_booke.h  | 10 +
 arch/powerpc/kvm/book3s.c | 16 +++---
 arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 22 +--
 5 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4e91f54a3f9f..a632e79639f0 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -413,6 +413,37 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j, u64 
val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
 #define BOOK3S_WRAPPER_SET(reg, size)  \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 0c3401b2e19e..7c3291aa8922 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
 #ifdef CONFIG_BOOKE
 static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 2fe31b518886..6cd20ab9e94e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -636,17 +636,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   *val = get_reg_val(id, VCPU_FPR(vcpu, i));
+   *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
break;
case KVM_REG_PPC_FPSCR:
-   *val = get_reg_val(id, vcpu->arch.fp.fpscr);
+   *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
-   val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
+   val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
+   val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
} else {
r = -ENXIO;
}
@@ -724,7 +724,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
+   kvmppc_set_fpr(vcpu, i, set_reg_val(id, *val));
break;
case KVM_REG_PPC_FPSCR:
vcpu->arch.fp.fpscr = set_reg_val(id, *val);
@@ -733,8 +733,8 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   vcpu->arch.fp.fpr[i][0] = val->vsxval[0];
-   vcpu->arch.fp.fpr[i][1] = val->vsxval[1];
+   kvmppc_set_vsx_fpr(vcpu, i, 0, val->vsxval[0]);
+   kvmppc_set_vsx_fpr(vcpu, i, 1, val->vsxval[1]);
   

[RFC PATCH v2 0/6] KVM: PPC: Nested PAPR guests

2023-06-05 Thread Jordan Niethe
There is existing support for nested guests on powernv hosts however the
hcall interface this uses is not support by other PAPR hosts. A set of
new hcalls will be added to PAPR to facilitate creating and managing
guests by a regular partition in the following way:

  - L1 and L0 negotiate capabilities with
H_GUEST_{G,S}ET_CAPABILITIES

  - L1 requests the L0 create a L2 with
H_GUEST_CREATE and receives a handle to use in future hcalls

  - L1 requests the L0 create a L2 vCPU with
H_GUEST_CREATE_VCPU

  - L1 sets up the L2 using H_GUEST_SET and the
H_GUEST_VCPU_RUN input buffer

  - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN

  - L2 returns to L1 with an exit reason and L1 reads the
H_GUEST_VCPU_RUN output buffer populated by the L0

  - L1 handles the exit using H_GET_STATE if necessary

  - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

  - L1 frees the L2 in the L0 with H_GUEST_DELETE

Further details are available in Documentation/powerpc/kvm-nested.rst.

This series adds KVM support for using this hcall interface as a regular
PAPR partition, i.e. the L1. It does not add support for running as the
L0.

The new hcalls have been implemented in the spapr qemu model for
testing.

This is available at https://github.com/mikey/qemu/tree/kvm-papr

There are scripts available to assist in setting up an environment for
testing nested guests at https://github.com/mikey/kvm-powervm-test

A tree with this series is available at
https://github.com/iamjpn/linux/tree/features/kvm-papr

Thanks to Amit Machhiwal, Kautuk Consul, Vaibhav Jain, Michael Neuling,
Shivaprasad Bhat, Harsh Prateek Bora, Paul Mackerras and Nicholas
Piggin. 

Change overview in v2:
  - Rebase on top of kvm ppc prefix instruction support
  - Make documentation an individual patch
  - Move guest state buffer files from arch/powerpc/lib/ to
arch/powerpc/kvm/
  - Use kunit for testing guest state buffer
  - Fix some build errors
  - Change HEIR element from 4 bytes to 8 bytes

Previous revisions:

  - v1: 
https://lore.kernel.org/linuxppc-dev/20230508072332.2937883-1-...@linux.vnet.ibm.com/

Jordan Niethe (5):
  KVM: PPC: Use getters and setters for vcpu register state
  KVM: PPC: Add fpr getters and setters
  KVM: PPC: Add vr getters and setters
  KVM: PPC: Add helper library for Guest State Buffers
  KVM: PPC: Add support for nested PAPR guests

Michael Neuling (1):
  docs: powerpc: Document nested KVM on POWER

 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/kvm-nested.rst  | 636 +++
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 988 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 205 +++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_booke.h  |  10 +
 arch/powerpc/include/asm/kvm_host.h   |  21 +
 arch/powerpc/include/asm/kvm_ppc.h|  80 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 198 
 arch/powerpc/kvm/Makefile |   4 +
 arch/powerpc/kvm/book3s.c |  38 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c|   9 +-
 arch/powerpc/kvm/book3s_64_vio.c  |   4 +-
 arch/powerpc/kvm/book3s_hv.c  | 336 --
 arch/powerpc/kvm/book3s_hv.h  |  65 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  |  10 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c |   4 +-
 arch/powerpc/kvm/book3s_hv_papr.c | 940 +
 arch/powerpc/kvm/book3s_hv_ras.c  |   5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |   4 +-
 arch/powerpc/kvm/book3s_xive.c|   9 +-
 arch/powerpc/kvm/emulate_loadstore.c  |   6 +-
 arch/powerpc/kvm/guest-state-buffer.c | 612 +++
 arch/powerpc/kvm/powerpc.c|  76 +-
 arch/powerpc/kvm/test-guest-state-buffer.c| 321 ++
 30 files changed, 4467 insertions(+), 213 deletions(-)
 create mode 100644 Documentation/powerpc/kvm-nested.rst
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

-- 
2.31.1



[RFC PATCH v2 4/6] KVM: PPC: Add helper library for Guest State Buffers

2023-06-05 Thread Jordan Niethe
The new PAPR nested guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the PAPR nested guest support.

Signed-off-by: Jordan Niethe 
---
v2:
  - Add missing #ifdef CONFIG_VSXs
  - Move files from lib/ to kvm/
  - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
  - Use kunit for guest state buffer tests
  - Add configuration option for the tests
  - Use macros for contiguous id ranges like GPRs
  - Add some missing EXPORTs to functions
  - HEIR element is a double word not a word
---
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 901 ++
 arch/powerpc/include/asm/kvm_book3s.h |   2 +
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/guest-state-buffer.c | 563 +++
 arch/powerpc/kvm/test-guest-state-buffer.c| 321 +++
 6 files changed, 1802 insertions(+)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 6aaf8dc60610..ed830a714720 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
bool "Run self-tests of the MSI bitmap code"
depends on DEBUG_KERNEL
 
+config GUEST_STATE_BUFFER_TEST
+   def_tristate n
+   prompt "Enable Guest State Buffer unit tests"
+   depends on KUNIT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default KUNIT_ALL_TESTS
+   help
+ The Guest State Buffer is a data format specified in the PAPR.
+ It is by hcalls to communicate the state of L2 guests between
+ the L1 and L0 hypervisors. Enable unit tests for the library
+ used to create and use guest state buffers.
+
 config PPC_IRQ_SOFT_MASK_DEBUG
bool "Include extra checks for powerpc irq soft masking"
depends on PPC64
diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..65a840abf1bb
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,901 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+#define GSID_BLANK 0x
+
+#define GSID_HOST_STATE_SIZE   0x0001 /* Size of Hypervisor Internal 
Format VCPU state */
+#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002 /* Minimum size of the Run VCPU 
output buffer */
+#define GSID_LOGICAL_PVR   0x0003 /* Logical PVR */
+#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */
+#define GSID_PARTITION_TABLE   0x0005 /* Partition Scoped Page Table */
+#define GSID_PROCESS_TABLE 0x0006 /* Process Table */
+
+#define GSID_RUN_INPUT 0x0C00 /* Run VCPU Input Buffer */
+#define GSID_RUN_OUTPUT0x0C01 /* Run VCPU Out Buffer */
+#define GSID_VPA   0x0C02 /* HRA to Guest VCPU VPA */
+
+#define GSID_GPR(x)(0x1000 + (x))
+#define G

[RFC PATCH v2 3/6] KVM: PPC: Add vr getters and setters

2023-06-05 Thread Jordan Niethe
Add wrappers for vr registers to prepare for supporting PAPR nested
guests.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 20 +++
 arch/powerpc/kvm/powerpc.c| 50 +--
 2 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index a632e79639f0..77653c5b356b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -444,6 +444,26 @@ static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu 
*vcpu, int i, int j, u64 v
vcpu->arch.fp.fpr[i][j] = val;
 }
 
+static inline vector128 kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
val)
+{
+   vcpu->arch.vr.vr[i] = val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+
 #define BOOK3S_WRAPPER_SET(reg, size)  \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7f913e68342a..10436213aea2 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -934,9 +934,9 @@ static inline void kvmppc_set_vsr_dword(struct kvm_vcpu 
*vcpu,
return;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsxval[offset] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
kvmppc_set_vsx_fpr(vcpu, index, offset, gpr);
}
@@ -949,10 +949,10 @@ static inline void kvmppc_set_vsr_dword_dump(struct 
kvm_vcpu *vcpu,
int index = vcpu->arch.io_gpr & KVM_MMIO_REG_MASK;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsxval[0] = gpr;
val.vsxval[1] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
kvmppc_set_vsx_fpr(vcpu, index, 0, gpr);
kvmppc_set_vsx_fpr(vcpu, index, 1,  gpr);
@@ -970,7 +970,7 @@ static inline void kvmppc_set_vsr_word_dump(struct kvm_vcpu 
*vcpu,
val.vsx32val[1] = gpr;
val.vsx32val[2] = gpr;
val.vsx32val[3] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
val.vsx32val[0] = gpr;
val.vsx32val[1] = gpr;
@@ -991,9 +991,9 @@ static inline void kvmppc_set_vsr_word(struct kvm_vcpu 
*vcpu,
return;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsx32val[offset] = gpr32;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
dword_offset = offset / 2;
word_offset = offset % 2;
@@ -1058,9 +1058,9 @@ static inline void kvmppc_set_vmx_dword(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsxval[offset] = gpr;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_word(struct kvm_vcpu *vcpu,
@@ -1074,9 +1074,9 @@ static inline void kvmppc_set_vmx_word(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsx32val[offset] = gpr32;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_hword(struct kvm_vcpu *vcpu,
@@ -1090,9 +1090,9 @@ static inline void kvmppc_set_vmx_hword(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsx16val[offset] = gpr16;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_byte(struct kvm_vcpu *vcpu,
@@ -1106,9 +1106,9 @@ static inline void kvmp

[RFC PATCH v1 5/5] KVM: PPC: Add support for nested PAPR guests

2023-05-08 Thread Jordan Niethe
A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. Add
support to KVM to utilize these hcalls to enable running nested guests.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, the new API will then
be used.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
seperate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

There is existing support for running nested guests on KVM
with powernv. However the interface used for this is not supported by
other PAPR hosts. This existing API is still supported.

Signed-off-by: Jordan Niethe 
---
 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/kvm-nested.rst  | 636 
 arch/powerpc/include/asm/guest-state-buffer.h | 150 ++-
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 122 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  21 +
 arch/powerpc/include/asm/kvm_ppc.h|  64 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 198 
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 118 ++-
 arch/powerpc/kvm/book3s_hv.h  |  75 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  34 +-
 arch/powerpc/kvm/book3s_hv_papr.c | 947 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 15 files changed, 2314 insertions(+), 93 deletions(-)
 create mode 100644 Documentation/powerpc/kvm-nested.rst
 create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 85e80e30160b..5a15dc6389ab 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -25,6 +25,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..942f422d61a9
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of 

[RFC PATCH v1 4/5] powerpc: Add helper library for Guest State Buffers

2023-05-08 Thread Jordan Niethe
The new PAPR nested guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the PAPR nested guest support.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/guest-state-buffer.h | 1001 +
 arch/powerpc/lib/Makefile |3 +-
 arch/powerpc/lib/guest-state-buffer.c |  560 +
 arch/powerpc/lib/test-guest-state-buffer.c|  334 ++
 4 files changed, 1897 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/lib/guest-state-buffer.c
 create mode 100644 arch/powerpc/lib/test-guest-state-buffer.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..332669302a0b
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,1001 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+#define GSID_BLANK 0x
+
+#define GSID_HOST_STATE_SIZE   0x0001 /* Size of Hypervisor Internal 
Format VCPU state */
+#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002 /* Minimum size of the Run VCPU 
output buffer */
+#define GSID_LOGICAL_PVR   0x0003 /* Logical PVR */
+#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */
+#define GSID_PARTITION_TABLE   0x0005 /* Partition Scoped Page Table */
+#define GSID_PROCESS_TABLE 0x0006 /* Process Table */
+
+#define GSID_RUN_INPUT 0x0C00 /* Run VCPU Input Buffer */
+#define GSID_RUN_OUTPUT0x0C01 /* Run VCPU Out Buffer */
+#define GSID_VPA   0x0C02 /* HRA to Guest VCPU VPA */
+
+#define GSID_GPR0  0x1000
+#define GSID_GPR1  0x1001
+#define GSID_GPR2  0x1002
+#define GSID_GPR3  0x1003
+#define GSID_GPR4  0x1004
+#define GSID_GPR5  0x1005
+#define GSID_GPR6  0x1006
+#define GSID_GPR7  0x1007
+#define GSID_GPR8  0x1008
+#define GSID_GPR9  0x1009
+#define GSID_GPR10 0x100A
+#define GSID_GPR11 0x100B
+#define GSID_GPR12 0x100C
+#define GSID_GPR13 0x100D
+#define GSID_GPR14 0x100E
+#define GSID_GPR15 0x100F
+#define GSID_GPR16 0x1010
+#define GSID_GPR17 0x1011
+#define GSID_GPR18 0x1012
+#define GSID_GPR19 0x1013
+#define GSID_GPR20 0x1014
+#define GSID_GPR21 0x1015
+#define GSID_GPR22 0x1016
+#define GSID_GPR23 0x1017
+#define GSID_GPR24 0x1018
+#define GSID_GPR25 0x1019
+#define GSID_GPR26 0x101A
+#define GSID_GPR27 0x101B
+#define GSID_GPR28 0x101C
+#define GSID_GPR29 0x101D
+#define GSID_GPR30 0x101E
+#define GSID_GPR31 0x101F
+#define GSID_HDEC_EXPIRY_TB 0x1020
+#define GSID_NIA   0x1021
+#define GSID_MSR   0x1022
+#define GSID_LR0x1023
+#define GSID_XER   0x1024
+#define GSID_CTR   0x1025
+#define GSID_CFAR  0x1026
+#define GSID_SRR0  0x1027
+#define GSID_SRR1  0x1028
+#define GSID_DAR   0x1029
+#define GSID_DEC_EXPIRY_TB 0x102A
+#define GSID_VTB   0x102B
+#define GSID_LPCR  0x102C
+#define GSID_HFSCR 0x102D
+#define GSID_F

[RFC PATCH v1 3/5] KVM: PPC: Add vr getters and setters

2023-05-08 Thread Jordan Niethe
Add wrappers for vr registers to prepare for supporting PAPR nested
guests.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 20 +++
 arch/powerpc/kvm/powerpc.c| 50 +--
 2 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index a632e79639f0..77653c5b356b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -444,6 +444,26 @@ static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu 
*vcpu, int i, int j, u64 v
vcpu->arch.fp.fpr[i][j] = val;
 }
 
+static inline vector128 kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
val)
+{
+   vcpu->arch.vr.vr[i] = val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+
 #define BOOK3S_WRAPPER_SET(reg, size)  \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 9468df8d9987..c1084d40e292 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -932,9 +932,9 @@ static inline void kvmppc_set_vsr_dword(struct kvm_vcpu 
*vcpu,
return;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsxval[offset] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
kvmppc_set_vsx_fpr(vcpu, index, offset, gpr);
}
@@ -947,10 +947,10 @@ static inline void kvmppc_set_vsr_dword_dump(struct 
kvm_vcpu *vcpu,
int index = vcpu->arch.io_gpr & KVM_MMIO_REG_MASK;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsxval[0] = gpr;
val.vsxval[1] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
kvmppc_set_vsx_fpr(vcpu, index, 0, gpr);
kvmppc_set_vsx_fpr(vcpu, index, 1,  gpr);
@@ -968,7 +968,7 @@ static inline void kvmppc_set_vsr_word_dump(struct kvm_vcpu 
*vcpu,
val.vsx32val[1] = gpr;
val.vsx32val[2] = gpr;
val.vsx32val[3] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
val.vsx32val[0] = gpr;
val.vsx32val[1] = gpr;
@@ -989,9 +989,9 @@ static inline void kvmppc_set_vsr_word(struct kvm_vcpu 
*vcpu,
return;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsx32val[offset] = gpr32;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
dword_offset = offset / 2;
word_offset = offset % 2;
@@ -1056,9 +1056,9 @@ static inline void kvmppc_set_vmx_dword(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsxval[offset] = gpr;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_word(struct kvm_vcpu *vcpu,
@@ -1072,9 +1072,9 @@ static inline void kvmppc_set_vmx_word(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsx32val[offset] = gpr32;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_hword(struct kvm_vcpu *vcpu,
@@ -1088,9 +1088,9 @@ static inline void kvmppc_set_vmx_hword(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsx16val[offset] = gpr16;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_byte(struct kvm_vcpu *vcpu,
@@ -1104,9 +1104,9 @@ static inline void kvmp

[RFC PATCH v1 2/5] KVM: PPC: Add fpr getters and setters

2023-05-08 Thread Jordan Niethe
Add wrappers for fpr registers to prepare for supporting PAPR nested
guests.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 31 +++
 arch/powerpc/include/asm/kvm_booke.h  | 10 +
 arch/powerpc/kvm/book3s.c | 16 +++---
 arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 22 +--
 5 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4e91f54a3f9f..a632e79639f0 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -413,6 +413,37 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j, u64 
val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
 #define BOOK3S_WRAPPER_SET(reg, size)  \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 0c3401b2e19e..7c3291aa8922 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
 #ifdef CONFIG_BOOKE
 static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bcb335681387..c12066a12b30 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -614,17 +614,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   *val = get_reg_val(id, VCPU_FPR(vcpu, i));
+   *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
break;
case KVM_REG_PPC_FPSCR:
-   *val = get_reg_val(id, vcpu->arch.fp.fpscr);
+   *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
-   val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
+   val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
+   val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
} else {
r = -ENXIO;
}
@@ -702,7 +702,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
+   kvmppc_set_fpr(vcpu, i, set_reg_val(id, *val));
break;
case KVM_REG_PPC_FPSCR:
vcpu->arch.fp.fpscr = set_reg_val(id, *val);
@@ -711,8 +711,8 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   vcpu->arch.fp.fpr[i][0] = val->vsxval[0];
-   vcpu->arch.fp.fpr[i][1] = val->vsxval[1];
+   kvmppc_set_vsx_fpr(vcpu, i, 0, val->vsxval[0]);
+   kvmppc_set_vsx_fpr(vcpu, i, 1, val->vsxval[1]);
   

[RFC PATCH v1 1/5] KVM: PPC: Use getters and setters for vcpu register state

2023-05-08 Thread Jordan Niethe
There are already some getter and setter functions used for accessing
vcpu register state, e.g. kvmppc_get_pc(). There are also more
complicated examples that are generated by macros like
kvmppc_get_sprg0() which are generated by the SHARED_SPRNG_WRAPPER()
macro.

In the new PAPR API for nested guest partitions the L1 is required to
communicate with the L0 to modify and read nested guest state.

Prepare to support this by replacing direct accesses to vcpu register
state with wrapper functions. Follow the existing pattern of using
macros to generate individual wrappers. These wrappers will
be augmented for supporting PAPR nested guests later.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  68 +++-
 arch/powerpc/include/asm/kvm_ppc.h |  48 --
 arch/powerpc/kvm/book3s.c  |  22 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c|   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   9 +-
 arch/powerpc/kvm/book3s_64_vio.c   |   4 +-
 arch/powerpc/kvm/book3s_hv.c   | 221 +
 arch/powerpc/kvm/book3s_hv.h   |  59 +++
 arch/powerpc/kvm/book3s_hv_builtin.c   |  10 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_ras.c   |   5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c|   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   4 +-
 arch/powerpc/kvm/book3s_xive.c |   9 +-
 arch/powerpc/kvm/powerpc.c |   4 +-
 15 files changed, 322 insertions(+), 157 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..4e91f54a3f9f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -392,6 +392,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.pid = val;
+}
+
+static inline u32 kvmppc_get_pid(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pid;
+}
+
 static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu);
 static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu)
 {
@@ -403,10 +413,66 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+#define BOOK3S_WRAPPER_SET(reg, size)  \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define BOOK3S_WRAPPER_GET(reg, size)  \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define BOOK3S_WRAPPER(reg, size)  \
+   BOOK3S_WRAPPER_SET(reg, size)   \
+   BOOK3S_WRAPPER_GET(reg, size)   \
+
+BOOK3S_WRAPPER(tar, 64)
+BOOK3S_WRAPPER(ebbhr, 64)
+BOOK3S_WRAPPER(ebbrr, 64)
+BOOK3S_WRAPPER(bescr, 64)
+BOOK3S_WRAPPER(ic, 64)
+BOOK3S_WRAPPER(vrsave, 64)
+
+
+#define VCORE_WRAPPER_SET(reg, size)   \
+static inline void kvmppc_set_##reg ##_hv(struct kvm_vcpu *vcpu, u##size val)  
\
+{  \
+   vcpu->arch.vcore->reg = val;\
+}
+
+#define VCORE_WRAPPER_GET(reg, size)   \
+static inline u##size kvmppc_get_##reg ##_hv(struct kvm_vcpu *vcpu)\
+{  \
+   return vcpu->arch.vcore->reg;   \
+}
+
+#define VCORE_WRAPPER(reg, size)   \
+   VCORE_WRAPPER_SET(reg, size)\
+   VCORE_WRAPPER_GET(reg, size)\
+
+
+VCORE_WRAPPER(vtb, 64)
+VCORE_WRAPPER(tb_offset, 64)
+VCORE_WRAPPER(lpcr, 64)
+
+static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires;
+}
+
+static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.dec_expires = val;
+}
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset_hv(vcpu);
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index

[RFC PATCH v1 0/5] KVM: PPC: Nested PAPR guests

2023-05-08 Thread Jordan Niethe
There is existing support for nested guests on powernv hosts however the
hcall interface this uses is not support by other PAPR hosts. A set of
new hcalls will be added to PAPR to facilitate creating and managing
guests by a regular partition in the following way:

  - L1 and L0 negotiate capabilities with
H_GUEST_{G,S}ET_CAPABILITIES

  - L1 requests the L0 create a L2 with
H_GUEST_CREATE and receives a handle to use in future hcalls

  - L1 requests the L0 create a L2 vCPU with
H_GUEST_CREATE_VCPU

  - L1 sets up the L2 using H_GUEST_SET and the
H_GUEST_VCPU_RUN input buffer

  - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN

  - L2 returns to L1 with an exit reason and L1 reads the
H_GUEST_VCPU_RUN output buffer populated by the L0

  - L1 handles the exit using H_GET_STATE if necessary

  - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

  - L1 frees the L2 in the L0 with H_GUEST_DELETE

Further details are available in Documentation/powerpc/kvm-nested.rst.

This series adds KVM support for using this hcall interface as a regular
PAPR partition, i.e. the L1. It does not add support for running as the
L0.

The new hcalls have been implemented in the spapr qemu model for
testing.

This is available at https://github.com/mikey/qemu/tree/kvm-papr

There are scripts available to assist in setting up an environment for
testing nested guests at https://github.com/mikey/kvm-powervm-test

Thanks to Kautuk Consul, Vaibhav Jain, Michael Neuling, Shivaprasad
Bhat, Harsh Prateek Bora, Paul Mackerras and Nicholas Piggin. 

Jordan Niethe (5):
  KVM: PPC: Use getters and setters for vcpu register state
  KVM: PPC: Add fpr getters and setters
  KVM: PPC: Add vr getters and setters
  powerpc: Add helper library for Guest State Buffers
  KVM: PPC: Add support for nested PAPR guests

 Documentation/powerpc/index.rst   |1 +
 Documentation/powerpc/kvm-nested.rst  |  636 +
 arch/powerpc/include/asm/guest-state-buffer.h | 1133 +
 arch/powerpc/include/asm/hvcall.h |   30 +
 arch/powerpc/include/asm/kvm_book3s.h |  203 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |6 +
 arch/powerpc/include/asm/kvm_booke.h  |   10 +
 arch/powerpc/include/asm/kvm_host.h   |   21 +
 arch/powerpc/include/asm/kvm_ppc.h|   80 +-
 arch/powerpc/include/asm/plpar_wrappers.h |  198 +++
 arch/powerpc/kvm/Makefile |1 +
 arch/powerpc/kvm/book3s.c |   38 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c|9 +-
 arch/powerpc/kvm/book3s_64_vio.c  |4 +-
 arch/powerpc/kvm/book3s_hv.c  |  337 +++--
 arch/powerpc/kvm/book3s_hv.h  |   66 +
 arch/powerpc/kvm/book3s_hv_builtin.c  |   10 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |   34 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c |4 +-
 arch/powerpc/kvm/book3s_hv_papr.c |  947 ++
 arch/powerpc/kvm/book3s_hv_ras.c  |5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |4 +-
 arch/powerpc/kvm/book3s_xive.c|9 +-
 arch/powerpc/kvm/emulate_loadstore.c  |6 +-
 arch/powerpc/kvm/powerpc.c|   76 +-
 arch/powerpc/lib/Makefile |3 +-
 arch/powerpc/lib/guest-state-buffer.c |  560 
 arch/powerpc/lib/test-guest-state-buffer.c|  334 +
 30 files changed, 4560 insertions(+), 217 deletions(-)
 create mode 100644 Documentation/powerpc/kvm-nested.rst
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c
 create mode 100644 arch/powerpc/lib/guest-state-buffer.c
 create mode 100644 arch/powerpc/lib/test-guest-state-buffer.c

-- 
2.31.1



Re: [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Finding the owner or a queued waiter on a lock with a preempted vcpu
> is indicative of an oversubscribed guest causing the lock to get into
> trouble. Provide some options to detect this situation and have new
> CPUs avoid queueing for a longer time (more steal iterations) to
> minimise the problems caused by vcpu preemption on the queue.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h |   7 +-
>  arch/powerpc/lib/qspinlock.c   | 240 +++--
>  2 files changed, 232 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 35f9525381e6..4fbcc8a4230b 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -30,7 +30,7 @@ typedef struct qspinlock {
>   *
>   * 0: locked bit
>   *  1-14: lock holder cpu
> - *15: unused bit
> + *15: lock owner or queuer vcpus observed to be preempted bit
>   *16: must queue bit
>   * 17-31: tail cpu (+1)
>   */
> @@ -49,6 +49,11 @@ typedef struct qspinlock {
>  #error "qspinlock does not support such large CONFIG_NR_CPUS"
>  #endif
>  
> +#define _Q_SLEEPY_OFFSET 15
> +#define _Q_SLEEPY_BITS   1
> +#define _Q_SLEEPY_MASK   _Q_SET_MASK(SLEEPY_OWNER)
> +#define _Q_SLEEPY_VAL(1U << _Q_SLEEPY_OFFSET)
> +
>  #define _Q_MUST_Q_OFFSET 16
>  #define _Q_MUST_Q_BITS   1
>  #define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5cfd69931e31..c18133c01450 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_spin_on_preempted_owner __read_mostly = false;
> +static bool pv_sleepy_lock __read_mostly = true;
> +static bool pv_sleepy_lock_sticky __read_mostly = false;

The sticky part could potentially be its own patch.

> +static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
> +static int pv_sleepy_lock_factor __read_mostly = 256;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
>  
> -static __always_inline int get_steal_spins(bool paravirt, bool remote)
> +static __always_inline bool recently_sleepy(void)
> +{

Other users of pv_sleepy_lock_interval_ns first check pv_sleepy_lock.

> + if (pv_sleepy_lock_interval_ns) {
> + u64 seen = this_cpu_read(sleepy_lock_seen_clock);
> +
> + if (seen) {
> + u64 delta = sched_clock() - seen;
> + if (delta < pv_sleepy_lock_interval_ns)
> + return true;
> + this_cpu_write(sleepy_lock_seen_clock, 0);
> + }
> + }
> +
> + return false;
> +}
> +
> +static __always_inline int get_steal_spins(bool paravirt, bool remote, bool 
> sleepy)

It seems like paravirt is implied by sleepy.

>  {
>   if (remote) {
> - return REMOTE_STEAL_SPINS;
> + if (paravirt && sleepy)
> + return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
> + else
> + return REMOTE_STEAL_SPINS;
>   } else {
> - return STEAL_SPINS;
> + if (paravirt && sleepy)
> + return STEAL_SPINS * pv_sleepy_lock_factor;
> + else
> + return STEAL_SPINS;
>   }
>  }

I think that separate functions would still be nicer but this could get rid of
the nesting conditionals like


int spins;
if (remote)
spins = REMOTE_STEAL_SPINS;
else
spins = STEAL_SPINS;

if (sleepy)
return spins * pv_sleepy_lock_factor;
return spins;

>  
> -static __always_inline int get_head_spins(bool paravirt)
> +static __always_inline int get_head_spins(bool paravirt, bool sleepy)
>  {
> - return HEAD_SPINS;
> + if (paravirt && sleepy)
> + return HEAD_SPINS * pv_sleepy_lock_factor;
> + else
> + return HEAD_SPINS;
>  }
>  
>  static inline u32 encode_tail_cpu(void)
> @@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct 
> qspinlock *lock)
>   return prev;
>  }
>  
> +static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 
> old)
> +{
> + u32 prev;
> + u32 new = old | _Q_SLEEPY_VAL;
> +
> + BUG_ON(!(old 

Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Provide an option that holds off queueing indefinitely while the lock
> owner is preempted. This could reduce queueing latencies for very
> overcommitted vcpu situations.
> 
> This is disabled by default.
> ---
>  arch/powerpc/lib/qspinlock.c | 91 +++-
>  1 file changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 24f68bd71e2b..5cfd69931e31 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
> +static bool pv_spin_on_preempted_owner __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
> @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
>   int owner;
>   u32 yield_count;
>  
>   BUG_ON(!(val & _Q_LOCKED_VAL));
>  
> + *preempted = false;
> +
>   if (!paravirt)
>   goto relax;
>  
> @@ -241,6 +244,8 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>   spin_end();
>  
> + *preempted = true;
> +
>   /*
>* Read the lock word after sampling the yield count. On the other side
>* there may a wmb because the yield count update is done by the
> @@ -265,14 +270,14 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   spin_cpu_relax();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool *preempted)

It seems like preempted parameter could be the return value of
yield_to_locked_owner(). Then callers that don't use the value returned in
preempted don't need to create an unnecessary variable to pass in.

>  {
> - __yield_to_locked_owner(lock, val, paravirt, false);
> + __yield_to_locked_owner(lock, val, paravirt, false, preempted);
>  }
>  
> -static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
> - __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> + __yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
>  }
>  
>  static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, 
> int *set_yield_cpu, bool paravirt)
> @@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
> - int iters;
> + int iters = 0;
> +
> + if (!STEAL_SPINS) {
> + if (paravirt && pv_spin_on_preempted_owner) {
> + spin_begin();
> + for (;;) {
> + u32 val = READ_ONCE(lock->val);
> + bool preempted;
> +
> + if (val & _Q_MUST_Q_VAL)
> + break;
> + if (!(val & _Q_LOCKED_VAL))
> + break;
> + if (!vcpu_is_preempted(get_owner_cpu(val)))
> + break;
> + yield_to_locked_owner(lock, val, paravirt, 
> );
> + }
> + spin_end();
> + }
> + return false;
> + }
>  
>   /* Attempt to steal the lock */
>   spin_begin();
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
> + bool preempted;
>  
>   if (val & _Q_MUST_Q_VAL)
>   break;
> @@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   continue;
>   }
>  
> - yield_to_locked_owner(lock, val, paravirt);
> -
> - iters++;
> + yield_to_locked_owner(lock, val, paravirt, );
> +
> + if (paravirt && preempted) {
> + if (!pv_spin_on_preempted_owner)
> + iters++;
> + /*
> +  

Re: [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Allow for a reduction in the number of times a CPU from a different
> node than the owner can attempt to steal the lock before queueing.
> This could bias the transfer behaviour of the lock across the
> machine and reduce NUMA crossings.
> ---
>  arch/powerpc/lib/qspinlock.c | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index d4594c701f7d..24f68bd71e2b 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -24,6 +25,7 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
>  #if _Q_SPIN_TRY_LOCK_STEAL == 1
>  static const bool MAYBE_STEALERS = true;
>  #else
> @@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(bool paravirt)
> +static __always_inline int get_steal_spins(bool paravirt, bool remote)
>  {
> - return STEAL_SPINS;
> + if (remote) {
> + return REMOTE_STEAL_SPINS;
> + } else {
> + return STEAL_SPINS;
> + }
>  }
>  
>  static __always_inline int get_head_spins(bool paravirt)
> @@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>  
>   iters++;
>  
> - if (iters >= get_steal_spins(paravirt))
> + if (iters >= get_steal_spins(paravirt, false))
>   break;
> + if (iters >= get_steal_spins(paravirt, true)) {

There's no indication of what true and false mean here which is hard to read.
To me it feels like two separate functions would be more clear.


> + int cpu = get_owner_cpu(val);
> + if (numa_node_id() != cpu_to_node(cpu))

What about using node_distance() instead?


> + break;
> + }
>   }
>   spin_end();
>  
> @@ -588,6 +599,22 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, 
> "%llu\n");
>  
> +static int remote_steal_spins_set(void *data, u64 val)
> +{
> + REMOTE_STEAL_SPINS = val;

REMOTE_STEAL_SPINS is int not u64.

> +
> + return 0;
> +}
> +
> +static int remote_steal_spins_get(void *data, u64 *val)
> +{
> + *val = REMOTE_STEAL_SPINS;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, 
> remote_steal_spins_set, "%llu\n");
> +
>  static int head_spins_set(void *data, u64 val)
>  {
>   HEAD_SPINS = val;
> @@ -687,6 +714,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, 
> pv_prod_head_get, pv_prod_head_set, "
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> _steal_spins);
> + debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, 
> NULL, _remote_steal_spins);
>   debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, 
> _head_spins);
>   if (is_shared_processor()) {
>   debugfs_create_file("qspl_pv_yield_owner", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_owner);



Re: [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
> to prevent threads issuing a lot of expensive priority nops which may not
> have much effect due to immediately executing low then medium priority.

Just a general comment regarding the spin_{begin,end} API, more complicated
than something like

spin_begin()
for(;;)
spin_cpu_relax()
spin_end()

it becomes difficult to keep track of. Unfortunately, I don't have any good
suggestions how to improve it. Hopefully with P10s wait instruction we can
maybe try and move away from this.

It might be useful to comment the functions pre and post conditions regarding
expectations about spin_begin() and spin_end().

> ---
>  arch/powerpc/lib/qspinlock.c | 35 +++
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 277aef1fab0a..d4594c701f7d 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -233,6 +233,8 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   if ((yield_count & 1) == 0)
>   goto relax; /* owner vcpu is running */
>  
> + spin_end();
> +
>   /*
>* Read the lock word after sampling the yield count. On the other side
>* there may a wmb because the yield count update is done by the
> @@ -248,11 +250,13 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   yield_to_preempted(owner, yield_count);
>   if (clear_mustq)
>   lock_set_mustq(lock);
> + spin_begin();
>   /* Don't relax if we yielded. Maybe we should? */
>   return;
>   }
> + spin_begin();
>  relax:
> - cpu_relax();
> + spin_cpu_relax();
>  }
>  
>  static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> @@ -315,14 +319,18 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>   if ((yield_count & 1) == 0)
>   goto yield_prev; /* owner vcpu is running */
>  
> + spin_end();
> +
>   smp_rmb();
>  
>   if (yield_cpu == node->yield_cpu) {
>   if (node->next && node->next->yield_cpu != yield_cpu)
>   node->next->yield_cpu = yield_cpu;
>   yield_to_preempted(yield_cpu, yield_count);
> + spin_begin();
>   return;
>   }
> + spin_begin();
>  
>  yield_prev:
>   if (!pv_yield_prev)
> @@ -332,15 +340,19 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>   if ((yield_count & 1) == 0)
>   goto relax; /* owner vcpu is running */
>  
> + spin_end();
> +
>   smp_rmb(); /* See yield_to_locked_owner comment */
>  
>   if (!node->locked) {
>   yield_to_preempted(prev_cpu, yield_count);
> + spin_begin();
>   return;
>   }
> + spin_begin();
>  
>  relax:
> - cpu_relax();
> + spin_cpu_relax();
>  }
>  
>  
> @@ -349,6 +361,7 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   int iters;
>  
>   /* Attempt to steal the lock */
> + spin_begin();
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
>  
> @@ -356,8 +369,10 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   break;
>  
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
> + spin_end();
>   if (trylock_with_tail_cpu(lock, val))
>   return true;
> + spin_begin();
>   continue;
>   }
>  
> @@ -368,6 +383,7 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   if (iters >= get_steal_spins(paravirt))
>   break;
>   }
> + spin_end();
>  
>   return false;
>  }
> @@ -418,8 +434,10 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   WRITE_ONCE(prev->next, node);
>  
>   /* Wait for mcs node lock to be released */
> + spin_begin();
>   while (!node->locked)
>   yield_to_prev(lock, node, prev_cpu, paravirt);
> + spin_end();
>  
>   /* Clear out stale propagated yield_cpu */
>   if (paravirt && pv_yield_propagate_owner && node->yield_cpu != 
> -1)
> @@ -432,10 +450,12 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   int set_yield_cpu = -1;
>  
>   /* We're at the head of the waitqueue, wait for the lock. */
> + spin_begin();
> 

Re: [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> This gives trylock slightly more strength, and it also gives most
> of the benefit of passing 'val' back through the slowpath without
> the complexity.
> ---
>  arch/powerpc/include/asm/qspinlock.h | 39 +++-
>  arch/powerpc/lib/qspinlock.c |  9 +++
>  2 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 44601b261e08..d3d2039237b2 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -5,6 +5,8 @@
>  #include 
>  #include 
>  
> +#define _Q_SPIN_TRY_LOCK_STEAL 1

Would this be a config option?

> +
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
>   return READ_ONCE(lock->val);
> @@ -26,11 +28,12 @@ static __always_inline u32 
> queued_spin_get_locked_val(void)
>   return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
>  }
>  
> -static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock 
> *lock)
>  {
>   u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
> + /* Trylock succeeds only when unlocked and no queued nodes */
>   asm volatile(
>  "1:  lwarx   %0,0,%1,%3  # queued_spin_trylock   \n"

s/queued_spin_trylock/__queued_spin_trylock_nosteal

>  "cmpwi   0,%0,0  \n"
> @@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct 
> qspinlock *lock)
>   return 0;
>  }
>  
> +static __always_inline int __queued_spin_trylock_steal(struct qspinlock 
> *lock)
> +{
> + u32 new = queued_spin_get_locked_val();
> + u32 prev, tmp;
> +
> + /* Trylock may get ahead of queued nodes if it finds unlocked */
> + asm volatile(
> +"1:  lwarx   %0,0,%2,%5  # queued_spin_trylock   \n"

s/queued_spin_trylock/__queued_spin_trylock_steal

> +"andc.   %1,%0,%4\n"
> +"bne-2f  \n"
> +"and %1,%0,%4\n"
> +"or  %1,%1,%3\n"
> +"stwcx.  %1,0,%2 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"

Just because there's a little bit more going on here...

Q_TAIL_CPU_MASK = 0xFFFE
~Q_TAIL_CPU_MASK = 0x1


1:  lwarx   prev, 0, >val, IS_ENABLED_PPC64
andc.   tmp, prev, _Q_TAIL_CPU_MASK (tmp = prev & ~_Q_TAIL_CPU_MASK)
bne-2f  (exit if locked)
and tmp, prev, _Q_TAIL_CPU_MASK (tmp = prev & _Q_TAIL_CPU_MASK)
or  tmp, tmp, new   (tmp |= new)

stwcx.  tmp, 0, >val  

bne-1b  
PPC_ACQUIRE_BARRIER 
2:

... which seems correct.


> + : "=" (prev), "=" (tmp)
> + : "r" (>val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", "memory");
> +
> + if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
> + return 1;
> + return 0;
> +}
> +
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +{
> + if (!_Q_SPIN_TRY_LOCK_STEAL)
> + return __queued_spin_trylock_nosteal(lock);
> + else
> + return __queued_spin_trylock_steal(lock);
> +}
> +
>  void queued_spin_lock_slowpath(struct qspinlock *lock);
>  
>  static __always_inline void queued_spin_lock(struct qspinlock *lock)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 3b10e31bcf0a..277aef1fab0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -24,7 +24,11 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +static const bool MAYBE_STEALERS = true;
> +#else
>  static bool MAYBE_STEALERS __read_mostly = true;
> +#endif
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> @@ -522,6 +526,10 @@ void pv_spinlocks_init(void)
>  #include 
>  static int steal_spins_set(void *data, u64 val)
>  {
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> + /* MAYBE_STEAL remains true */
> + STEAL_SPINS = val;
> +#else
>   static DEFINE_MUTEX(lock);
>  
>   mutex_lock();
> @@ -539,6 +547,7 @@ static int 

Re: [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> After the head of the queue acquires the lock, it releases the
> next waiter in the queue to become the new head. Add an option
> to prod the new head if its vCPU was preempted. This may only
> have an effect if queue waiters are yielding.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 29 -
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 28c85a2d5635..3b10e31bcf0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>   struct qnode*next;
>   struct qspinlock *lock;
> + int cpu;
>   int yield_cpu;
>   u8  locked; /* 1 if lock acquired */
>  };
> @@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
> +static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -392,6 +394,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   node = >nodes[idx];
>   node->next = NULL;
>   node->lock = lock;
> + node->cpu = smp_processor_id();

I suppose this could be used in some other places too.

For example change:
yield_to_prev(lock, node, prev, paravirt);

In yield_to_prev() it could then access the prev->cpu.

>   node->yield_cpu = -1;
>   node->locked = 0;
>  
> @@ -483,7 +486,14 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>* this store to locked. The corresponding barrier is the smp_rmb()
>* acquire barrier for mcs lock, above.
>*/
> - WRITE_ONCE(next->locked, 1);
> + if (paravirt && pv_prod_head) {
> + int next_cpu = next->cpu;
> + WRITE_ONCE(next->locked, 1);
> + if (vcpu_is_preempted(next_cpu))
> + prod_cpu(next_cpu);
> + } else {
> + WRITE_ONCE(next->locked, 1);
> + }
>  
>  release:
>   qnodesp->count--; /* release the node */
> @@ -622,6 +632,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 
> *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, 
> pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
>  
> +static int pv_prod_head_set(void *data, u64 val)
> +{
> + pv_prod_head = !!val;
> +
> + return 0;
> +}
> +
> +static int pv_prod_head_get(void *data, u64 *val)
> +{
> + *val = pv_prod_head;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, 
> pv_prod_head_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> _steal_spins);
> @@ -631,6 +657,7 @@ static __init int spinlock_debugfs_init(void)
>   debugfs_create_file("qspl_pv_yield_allow_steal", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_allow_steal);
>   debugfs_create_file("qspl_pv_yield_prev", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_prev);
>   debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_propagate_owner);
> + debugfs_create_file("qspl_pv_prod_head", 0600, 
> arch_debugfs_dir, NULL, _pv_prod_head);
>   }
>  
>   return 0;



Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Having all CPUs poll the lock word for the owner CPU that should be
> yielded to defeats most of the purpose of using MCS queueing for
> scalability. Yet it may be desirable for queued waiters to to yield
> to a preempted owner.
> 
> s390 addreses this problem by having queued waiters sample the lock
> word to find the owner much less frequently. In this approach, the
> waiters never sample it directly, but the queue head propagates the
> owner CPU back to the next waiter if it ever finds the owner has
> been preempted. Queued waiters then subsequently propagate the owner
> CPU back to the next waiter, and so on.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 85 +++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 94f007f66942..28c85a2d5635 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>   struct qnode*next;
>   struct qspinlock *lock;
> + int yield_cpu;
>   u8  locked; /* 1 if lock acquired */
>  };
>  
> @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
> +static bool pv_yield_propagate_owner __read_mostly = true;

This also seems to be enabled by default.

>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -257,13 +259,66 @@ static __always_inline void 
> yield_head_to_locked_owner(struct qspinlock *lock, u
>   __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
>  }
>  
> +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, 
> int *set_yield_cpu, bool paravirt)
> +{
> + struct qnode *next;
> + int owner;
> +
> + if (!paravirt)
> + return;
> + if (!pv_yield_propagate_owner)
> + return;
> +
> + owner = get_owner_cpu(val);
> + if (*set_yield_cpu == owner)
> + return;
> +
> + next = READ_ONCE(node->next);
> + if (!next)
> + return;
> +
> + if (vcpu_is_preempted(owner)) {

Is there a difference about using vcpu_is_preempted() here
vs checking bit 0 in other places?


> + next->yield_cpu = owner;
> + *set_yield_cpu = owner;
> + } else if (*set_yield_cpu != -1) {

It might be worth giving the -1 CPU a #define.

> + next->yield_cpu = owner;
> + *set_yield_cpu = owner;
> + }
> +}

Does this need to pass set_yield_cpu by reference? Couldn't it's new value be
returned? To me it makes it more clear the function is used to change
set_yield_cpu. I think this would work:

int set_yield_cpu = -1;

static __always_inline int propagate_yield_cpu(struct qnode *node, u32 val, int 
set_yield_cpu, bool paravirt)
{
struct qnode *next;
int owner;

if (!paravirt)
goto out;
if (!pv_yield_propagate_owner)
goto out;

owner = get_owner_cpu(val);
if (set_yield_cpu == owner)
goto out;

next = READ_ONCE(node->next);
if (!next)
goto out;

if (vcpu_is_preempted(owner)) {
next->yield_cpu = owner;
return owner;
} else if (set_yield_cpu != -1) {
next->yield_cpu = owner;
return owner;
}

out:
return set_yield_cpu;
}

set_yield_cpu = propagate_yield_cpu(...  set_yield_cpu ...);



> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)
>  {
>   u32 yield_count;
> + int yield_cpu;
>  
>   if (!paravirt)
>   goto relax;
>  
> + if (!pv_yield_propagate_owner)
> + goto yield_prev;
> +
> + yield_cpu = READ_ONCE(node->yield_cpu);
> + if (yield_cpu == -1) {
> + /* Propagate back the -1 CPU */
> + if (node->next && node->next->yield_cpu != -1)
> + node->next->yield_cpu = yield_cpu;
> + goto yield_prev;
> + }
> +
> + yield_count = yield_count_of(yield_cpu);
> + if ((yield_count & 1) == 0)
> + goto yield_prev; /* owner vcpu is running */
> +
> + smp_rmb();
> +
> + if (yield_cpu == node->yield_cpu) {
> + if (node->next && node->next->yield_cpu != yield_cpu)
> + node->next->yield_cpu = yield_cpu;
> + yield_to_preempted(yield_cpu, yield_count);
> + return;
> + }
> +
> +yield_prev:
>   if (!pv_yield_prev)
>   goto relax;
>  
> @@ -337,6 +392,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct 

Re: [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> If the head of queue is preventing stealing but it finds the owner vCPU
> is preempted, it will yield its cycles to the owner which could cause it
> to become preempted. Add an option to re-allow stealers before yielding,
> and disallow them again after returning from the yield.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 56 ++--
>  1 file changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index b39f8c5b329c..94f007f66942 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_allow_steal __read_mostly = false;

To me this one does read as a boolean, but if you go with those other changes
I'd make it pv_yield_steal_enable to be consistent.

>  static bool pv_yield_prev __read_mostly = true;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> @@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct 
> qspinlock *lock)
>   return prev;
>  }
>  
> +static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> +{
> + u32 new = _Q_MUST_Q_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1 # lock_clear_mustq  \n"
> +"andc%0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new)
> + : "cr0", "memory");
> +

This is pretty similar to the DEFINE_TESTOP() pattern again with the same llong 
caveat.


> + return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>   int cpu = get_tail_cpu(val);
> @@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq)

 /* See yield_to_locked_owner comment */ comment needs to be updated now.


>  {
>   int owner;
>   u32 yield_count;
> @@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   smp_rmb();
>  
>   if (READ_ONCE(lock->val) == val) {
> + if (clear_mustq)
> + lock_clear_mustq(lock);
>   yield_to_preempted(owner, yield_count);
> + if (clear_mustq)
> + lock_set_mustq(lock);
>   /* Don't relax if we yielded. Maybe we should? */
>   return;
>   }
> @@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   cpu_relax();
>  }
>  
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +{
> + __yield_to_locked_owner(lock, val, paravirt, false);
> +}
> +
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq)
> +{

The check for pv_yield_allow_steal seems like it could go here instead of
being done by the caller.
__yield_to_locked_owner() checks for pv_yield_owner so it seems more
  consistent.



> + __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +}
> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)
>  {
>   u32 yield_count;
> @@ -332,7 +364,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   if (!MAYBE_STEALERS) {
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - yield_to_locked_owner(lock, val, paravirt);
> + yield_head_to_locked_owner(lock, val, paravirt, false);
>  
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -350,7 +382,8 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> - yield_to_locked_owner(lock, val, paravirt);
> + yield_head_to_locked_owner(lock, val, paravirt,
> + pv_yield_allow_steal && 

Re: [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Queued waiters which are not at the head of the queue don't spin on
> the lock word but their qnode lock word, waiting for the previous queued
> CPU to release them. Add an option which allows these waiters to yield
> to the previous CPU if its vCPU is preempted.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 46 +++-
>  1 file changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 55286ac91da5..b39f8c5b329c 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_prev __read_mostly = true;

Similiar suggestion, maybe pv_yield_prev_enabled would read better.

Isn't this enabled by default contrary to the commit message?


>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   cpu_relax();
>  }
>  
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)

yield_to_locked_owner() takes a raw val and works out the cpu to yield to.
I think for consistency have yield_to_prev() take the raw val and work it out 
too.

> +{
> + u32 yield_count;
> +
> + if (!paravirt)
> + goto relax;
> +
> + if (!pv_yield_prev)
> + goto relax;
> +
> + yield_count = yield_count_of(prev_cpu);
> + if ((yield_count & 1) == 0)
> + goto relax; /* owner vcpu is running */
> +
> + smp_rmb(); /* See yield_to_locked_owner comment */
> +
> + if (!node->locked) {
> + yield_to_preempted(prev_cpu, yield_count);
> + return;
> + }
> +
> +relax:
> + cpu_relax();
> +}
> +
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
> @@ -291,13 +317,14 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>*/
>   if (old & _Q_TAIL_CPU_MASK) {
>   struct qnode *prev = get_tail_qnode(lock, old);
> + int prev_cpu = get_tail_cpu(old);

This could then be removed.

>  
>   /* Link @node into the waitqueue. */
>   WRITE_ONCE(prev->next, node);
>  
>   /* Wait for mcs node lock to be released */
>   while (!node->locked)
> - cpu_relax();
> + yield_to_prev(lock, node, prev_cpu, paravirt);

And would have this as:
yield_to_prev(lock, node, old, paravirt);


>  
>   smp_rmb(); /* acquire barrier for the mcs lock */
>   }
> @@ -448,12 +475,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, 
> pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_prev_set(void *data, u64 val)
> +{
> + pv_yield_prev = !!val;
> +
> + return 0;
> +}
> +
> +static int pv_yield_prev_get(void *data, u64 *val)
> +{
> + *val = pv_yield_prev;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, 
> pv_yield_prev_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> _steal_spins);
>   debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, 
> _head_spins);
>   if (is_shared_processor()) {
>   debugfs_create_file("qspl_pv_yield_owner", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_owner);
> + debugfs_create_file("qspl_pv_yield_prev", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_prev);
>   }
>  
>   return 0;



Re: [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner

2022-11-09 Thread Jordan Niethe
 On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
 [resend as utf-8, not utf-7]
> Waiters spinning on the lock word should yield to the lock owner if the
> vCPU is preempted. This improves performance when the hypervisor has
> oversubscribed physical CPUs.
> ---
>  arch/powerpc/lib/qspinlock.c | 97 ++--
>  1 file changed, 83 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index aa26cfe21f18..55286ac91da5 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define MAX_NODES4
>  
> @@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
> +static bool pv_yield_owner __read_mostly = true;

Not macro case for these globals? To me name does not make it super clear this
is a boolean. What about pv_yield_owner_enabled?

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(void)
> +static __always_inline int get_steal_spins(bool paravirt)
>  {
>   return STEAL_SPINS;
>  }
>  
> -static __always_inline int get_head_spins(void)
> +static __always_inline int get_head_spins(bool paravirt)
>  {
>   return HEAD_SPINS;
>  }
> @@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
>   return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
> -/* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */
> +static inline int get_owner_cpu(u32 val)
> +{
> + return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
> +}
> +
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> @@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static inline bool try_to_steal_lock(struct qspinlock *lock)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)

This name doesn't seem correct for the non paravirt case.

> +{
> + int owner;
> + u32 yield_count;
> +
> + BUG_ON(!(val & _Q_LOCKED_VAL));
> +
> + if (!paravirt)
> + goto relax;
> +
> + if (!pv_yield_owner)
> + goto relax;
> +
> + owner = get_owner_cpu(val);
> + yield_count = yield_count_of(owner);
> +
> + if ((yield_count & 1) == 0)
> + goto relax; /* owner vcpu is running */

I wonder why not use vcpu_is_preempted()?

> +
> + /*
> +  * Read the lock word after sampling the yield count. On the other side
> +  * there may a wmb because the yield count update is done by the
> +  * hypervisor preemption and the value update by the OS, however this
> +  * ordering might reduce the chance of out of order accesses and
> +  * improve the heuristic.
> +  */
> + smp_rmb();
> +
> + if (READ_ONCE(lock->val) == val) {
> + yield_to_preempted(owner, yield_count);
> + /* Don't relax if we yielded. Maybe we should? */
> + return;
> + }
> +relax:
> + cpu_relax();
> +}
> +
> +
> +static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
>   int iters;
>  
> @@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock 
> *lock)
>   continue;
>   }
>  
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   iters++;
>  
> - if (iters >= get_steal_spins())
> + if (iters >= get_steal_spins(paravirt))
>   break;
>   }
>  
>   return false;
>  }
>  
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock 
> *lock, bool paravirt)
>  {
>   struct qnodes *qnodesp;
>   struct qnode *next, *node;
> @@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   if (!MAYBE_STEALERS) {
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>  

Re: [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Store the owner CPU number in the lock word so it may be yielded to,
> as powerpc's paravirtualised simple spinlocks do.
> ---
>  arch/powerpc/include/asm/qspinlock.h   |  8 +++-
>  arch/powerpc/include/asm/qspinlock_types.h | 10 ++
>  arch/powerpc/lib/qspinlock.c   |  6 +++---
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 3ab354159e5e..44601b261e08 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct 
> qspinlock *lock)
>   return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
> +static __always_inline u32 queued_spin_get_locked_val(void)

Maybe this function should have "encode" in the name to match with
encode_tail_cpu().


> +{
> + /* XXX: make this use lock value in paca like simple spinlocks? */

Is that the paca's lock_token which is 0x8000?


> + return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> +}
> +
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   asm volatile(
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 8b20f5e22bba..35f9525381e6 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,6 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   * 0: locked bit
> + *  1-14: lock holder cpu
> + *15: unused bit
>   *16: must queue bit
>   * 17-31: tail cpu (+1)

So there is one more bit to store the tail cpu vs the lock holder cpu?

>   */
> @@ -39,6 +41,14 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
>  
> +#define _Q_OWNER_CPU_OFFSET  1
> +#define _Q_OWNER_CPU_BITS14
> +#define _Q_OWNER_CPU_MASK_Q_SET_MASK(OWNER_CPU)
> +
> +#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #define _Q_MUST_Q_OFFSET 16
>  #define _Q_MUST_Q_BITS   1
>  #define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index a906cc8f15fa..aa26cfe21f18 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -50,7 +50,7 @@ static inline int get_tail_cpu(u32 val)
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   asm volatile(
> @@ -68,7 +68,7 @@ static __always_inline void lock_set_locked(struct 
> qspinlock *lock)
>  /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
>  static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, 
> u32 old)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   BUG_ON(old & _Q_LOCKED_VAL);
> @@ -116,7 +116,7 @@ static __always_inline u32 __trylock_cmpxchg(struct 
> qspinlock *lock, u32 old, u3
>  /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
>  static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 
> val)
>  {
> - u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> + u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
>  
>   if (__trylock_cmpxchg(lock, val, newval) == val)
>   return 1;



Re: [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Give the queue head the ability to stop stealers. After a number of
> spins without sucessfully acquiring the lock, the queue head employs
> this, which will assure it is the next owner.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
>  arch/powerpc/lib/qspinlock.c   | 56 +-
>  2 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 210adf05b235..8b20f5e22bba 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,7 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   * 0: locked bit
> - * 16-31: tail cpu (+1)
> + *16: must queue bit
> + * 17-31: tail cpu (+1)
>   */
>  #define  _Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
> << _Q_ ## type ## _OFFSET)
> @@ -38,7 +39,12 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
>  
> -#define _Q_TAIL_CPU_OFFSET   16
> +#define _Q_MUST_Q_OFFSET 16
> +#define _Q_MUST_Q_BITS   1
> +#define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> +#define _Q_MUST_Q_VAL(1U << _Q_MUST_Q_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET   17
>  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
>  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)

Not a big deal but some of these values could be calculated like in the
generic version. e.g.

#define _Q_PENDING_OFFSET   (_Q_LOCKED_OFFSET +_Q_LOCKED_BITS)

>  
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 1625cce714b2..a906cc8f15fa 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -22,6 +22,7 @@ struct qnodes {
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
> +static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
>   return STEAL_SPINS;
>  }
>  
> +static __always_inline int get_head_spins(void)
> +{
> + return HEAD_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct 
> qspinlock *lock, u32 tail)
>   return prev;
>  }
>  
> +static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> +{
> + u32 new = _Q_MUST_Q_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1 # lock_set_mustq\n"

Is the EH bit not set because we don't hold the lock here?

> +"or  %0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new)
> + : "cr0", "memory");

This is another usage close to the DEFINE_TESTOP() pattern.

> +
> + return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>   int cpu = get_tail_cpu(val);
> @@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock 
> *lock)
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
>  
> + if (val & _Q_MUST_Q_VAL)
> + break;
> +
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
>   if (trylock_with_tail_cpu(lock, val))
>   return true;
> @@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   /* We must be the owner, just set the lock bit and acquire */
>   lock_set_locked(lock);
>   } else {
> + int iters = 0;
> + bool set_mustq = false;
> +
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
> - while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> + while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>   cpu_relax();
>  
> + iters++;

It seems instead of using set_mustq, (val & _Q_MUST_Q_VAL) could be checked?

> + if (!set_mustq && iters >= get_head_spins()) {
> + set_mustq = true;
> + lock_set_mustq(lock);
> + val |= _Q_MUST_Q_VAL;
> + }
> + }
> +
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) 

Re: [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Allow new waiters a number of spins on the lock word before queueing,
> which particularly helps paravirt performance when physical CPUs are
> oversubscribed.
> ---
>  arch/powerpc/lib/qspinlock.c | 152 ---
>  1 file changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 7c71e5e287df..1625cce714b2 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -19,8 +19,17 @@ struct qnodes {
>   struct qnode nodes[MAX_NODES];
>  };
>  
> +/* Tuning parameters */
> +static int STEAL_SPINS __read_mostly = (1<<5);
> +static bool MAYBE_STEALERS __read_mostly = true;

I can understand why, but macro case variables can be a bit confusing.

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> +static __always_inline int get_steal_spins(void)
> +{
> + return STEAL_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct 
> qspinlock *lock, u32 ol
>   return 0;
>  }
>  
> +static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 
> old, u32 new)
> +{
> + u32 prev;
> +
> + BUG_ON(old & _Q_LOCKED_VAL);
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%4  # queued_spin_trylock_cmpxchg   \n"

s/queued_spin_trylock_cmpxchg/__trylock_cmpxchg/

btw what is the format you using for the '\n's in the inline asm?

> +"cmpw0,%0,%2 \n"
> +"bne-2f  \n"
> +"stwcx.  %3,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"
> + : "=" (prev)
> + : "r" (>val), "r"(old), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", "memory");

This is very similar to trylock_clear_tail_cpu(). So maybe it is worth having
some form of "test and set" primitive helper.

> +
> + return prev;
> +}
> +
> +/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 
> val)
> +{
> + u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +
> + if (__trylock_cmpxchg(lock, val, newval) == val)
> + return 1;
> + else
> + return 0;

same optional style nit: return __trylock_cmpxchg(lock, val, newval) == val

> +}
> +
>  /*
>   * Publish our tail, replacing previous tail. Return previous value.
>   *
> @@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> +static inline bool try_to_steal_lock(struct qspinlock *lock)
> +{
> + int iters;
> +
> + /* Attempt to steal the lock */
> + for (;;) {
> + u32 val = READ_ONCE(lock->val);
> +
> + if (unlikely(!(val & _Q_LOCKED_VAL))) {
> + if (trylock_with_tail_cpu(lock, val))
> + return true;
> + continue;
> + }

The continue would bypass iters++/cpu_relax but the next time around
  if (unlikely(!(val & _Q_LOCKED_VAL))) {
should fail so everything should be fine?

> +
> + cpu_relax();
> +
> + iters++;
> +
> + if (iters >= get_steal_spins())
> + break;
> + }
> +
> + return false;
> +}
> +
>  static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>   struct qnodes *qnodesp;
> @@ -164,20 +231,39 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   smp_rmb(); /* acquire barrier for the mcs lock */
>   }
>  
> - /* We're at the head of the waitqueue, wait for the lock. */
> - while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - cpu_relax();
> + if (!MAYBE_STEALERS) {
> + /* We're at the head of the waitqueue, wait for the lock. */
> + while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> + cpu_relax();
>  
> - /* If we're the last queued, must clean up the tail. */
> - if ((val & _Q_TAIL_CPU_MASK) == tail) {
> - if (trylock_clear_tail_cpu(lock, val))
> - goto release;
> - /* Another waiter must have enqueued */
> - }
> + /* If we're the last queued, must clean up the tail. */
> + if ((val & _Q_TAIL_CPU_MASK) == tail) {
> + if (trylock_clear_tail_cpu(lock, val))
> + goto release;
> +

Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> This uses more optimal ll/sc style access patterns (rather than
> cmpxchg), and also sets the EH=1 lock hint on those operations
> which acquire ownership of the lock.
> ---
>  arch/powerpc/include/asm/qspinlock.h   | 25 +--
>  arch/powerpc/include/asm/qspinlock_types.h |  6 +-
>  arch/powerpc/lib/qspinlock.c   | 81 +++---
>  3 files changed, 79 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 79a1936fb68d..3ab354159e5e 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -2,28 +2,43 @@
>  #ifndef _ASM_POWERPC_QSPINLOCK_H
>  #define _ASM_POWERPC_QSPINLOCK_H
>  
> -#include 
>  #include 
>  #include 
>  
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
> - return atomic_read(>val);
> + return READ_ONCE(lock->val);
>  }
>  
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> - return !atomic_read();
> + return !lock.val;
>  }
>  
>  static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  {
> - return !!(atomic_read(>val) & _Q_TAIL_CPU_MASK);
> + return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (atomic_cmpxchg_acquire(>val, 0, _Q_LOCKED_VAL) == 0)
> + u32 new = _Q_LOCKED_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%3  # queued_spin_trylock   \n"
> +"cmpwi   0,%0,0  \n"
> +"bne-2f  \n"
> +"stwcx.  %2,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)

btw IS_ENABLED() already returns 1 or 0

> + : "cr0", "memory");

This is the ISA's "test and set" atomic primitive. Do you think it would be 
worth seperating it as a helper?

> +
> + if (likely(prev == 0))
>   return 1;
>   return 0;

same optional style nit: return likely(prev == 0);

>  }
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 3425dab42576..210adf05b235 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -7,7 +7,7 @@
>  
>  typedef struct qspinlock {
>   union {
> - atomic_t val;
> + u32 val;
>  
>  #ifdef __LITTLE_ENDIAN
>   struct {
> @@ -23,10 +23,10 @@ typedef struct qspinlock {
>   };
>  } arch_spinlock_t;
>  
> -#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = ATOMIC_INIT(0) } }
> +#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = 0 } }
>  
>  /*
> - * Bitfields in the atomic value:
> + * Bitfields in the lock word:
>   *
>   * 0: locked bit
>   * 16-31: tail cpu (+1)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5ebb88d95636..7c71e5e287df 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -22,32 +21,59 @@ struct qnodes {
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static inline int encode_tail_cpu(void)
> +static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
>  }
>  
> -static inline int get_tail_cpu(int val)
> +static inline int get_tail_cpu(u32 val)
>  {
>   return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
>  /* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */

I think you missed deleting the above line.

> +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> - atomic_or(_Q_LOCKED_VAL, >val);
> - __atomic_acquire_fence();
> + u32 new = _Q_LOCKED_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%3  # lock_set_locked   \n"
> +"or  %0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 

Re: [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> The first 16 bits of the lock are only modified by the owner, and other
> modifications always use atomic operations on the entire 32 bits, so
> unlocks can use plain stores on the 16 bits. This is the same kind of
> optimisation done by core qspinlock code.
> ---
>  arch/powerpc/include/asm/qspinlock.h   |  6 +-
>  arch/powerpc/include/asm/qspinlock_types.h | 19 +--
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index f06117aa60e1..79a1936fb68d 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct 
> qspinlock *lock)
>  
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
> - for (;;) {
> - int val = atomic_read(>val);
> - if (atomic_cmpxchg_release(>val, val, val & 
> ~_Q_LOCKED_VAL) == val)
> - return;
> - }
> + smp_store_release(>locked, 0);

Is it also possible for lock_set_locked() to use a non-atomic acquire
operation?

>  }
>  
>  #define arch_spin_is_locked(l)   queued_spin_is_locked(l)
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 9630e714c70d..3425dab42576 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -3,12 +3,27 @@
>  #define _ASM_POWERPC_QSPINLOCK_TYPES_H
>  
>  #include 
> +#include 
>  
>  typedef struct qspinlock {
> - atomic_t val;
> + union {
> + atomic_t val;
> +
> +#ifdef __LITTLE_ENDIAN
> + struct {
> + u16 locked;
> + u8  reserved[2];
> + };
> +#else
> + struct {
> + u8  reserved[2];
> + u16 locked;
> + };
> +#endif
> + };
>  } arch_spinlock_t;

Just to double check we have:

#define _Q_LOCKED_OFFSET0
#define _Q_LOCKED_BITS  1
#define _Q_LOCKED_MASK  0x0001
#define _Q_LOCKED_VAL   1

#define _Q_TAIL_CPU_OFFSET  16
#define _Q_TAIL_CPU_BITS16
#define _Q_TAIL_CPU_MASK0x


so the ordering here looks correct.

>  
> -#define  __ARCH_SPIN_LOCK_UNLOCKED   { .val = ATOMIC_INIT(0) }
> +#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = ATOMIC_INIT(0) } }
>  
>  /*
>   * Bitfields in the atomic value:



Re: [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

[resend as utf-8, not utf-7]
>  
> +/*
> + * Bitfields in the atomic value:
> + *
> + * 0: locked bit
> + * 16-31: tail cpu (+1)
> + */
> +#define  _Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
> +   << _Q_ ## type ## _OFFSET)
> +#define _Q_LOCKED_OFFSET 0
> +#define _Q_LOCKED_BITS   1
> +#define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
> +#define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET   16
> +#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
> +#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
> +

Just to state the obvious this is:

#define _Q_LOCKED_OFFSET0
#define _Q_LOCKED_BITS  1
#define _Q_LOCKED_MASK  0x0001
#define _Q_LOCKED_VAL   1

#define _Q_TAIL_CPU_OFFSET  16
#define _Q_TAIL_CPU_BITS16
#define _Q_TAIL_CPU_MASK0x

> +#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 8dbce99a373c..5ebb88d95636 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,12 +1,172 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> +#include 
> +#include 
> +#include 
>  #include 
> -#include 
> +#include 
> +#include 
>  #include 
>  
> -void queued_spin_lock_slowpath(struct qspinlock *lock)
> +#define MAX_NODES4
> +
> +struct qnode {
> + struct qnode*next;
> + struct qspinlock *lock;
> + u8  locked; /* 1 if lock acquired */
> +};
> +
> +struct qnodes {
> + int count;
> + struct qnode nodes[MAX_NODES];
> +};

I think it could be worth commenting why qnodes::count instead 
_Q_TAIL_IDX_OFFSET.

> +
> +static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +
> +static inline int encode_tail_cpu(void)

I think the generic version that takes smp_processor_id() as a parameter is 
clearer - at least with this function name.

> +{
> + return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> +}
> +
> +static inline int get_tail_cpu(int val)

It seems like there should be a "decode" function to pair up with the "encode" 
function.

> +{
> + return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> +}
> +
> +/* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */

Does that comment mean it is not necessary to use an atomic_or here?

> +static __always_inline void lock_set_locked(struct qspinlock *lock)

nit: could just be called set_locked()

> +{
> + atomic_or(_Q_LOCKED_VAL, >val);
> + __atomic_acquire_fence();
> +}
> +
> +/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, 
> int val)
> +{
> + int newval = _Q_LOCKED_VAL;
> +
> + if (atomic_cmpxchg_acquire(>val, val, newval) == val)
> + return 1;
> + else
> + return 0;

same optional style nit: return (atomic_cmpxchg_acquire(>val, val, 
newval) == val);

> +}
> +
> +/*
> + * Publish our tail, replacing previous tail. Return previous value.
> + *
> + * This provides a release barrier for publishing node, and an acquire 
> barrier
> + * for getting the old node.
> + */
> +static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)

Did you change from the xchg_tail() name in the generic version because of the 
release and acquire barriers this provides?
Does "publish" generally imply the old value will be returned?

>  {
> - while (!queued_spin_trylock(lock))
> + for (;;) {
> + int val = atomic_read(>val);
> + int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> + int old;
> +
> + old = atomic_cmpxchg(>val, val, newval);
> + if (old == val)
> + return old;
> + }
> +}
> +
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +{
> + int cpu = get_tail_cpu(val);
> + struct qnodes *qnodesp = per_cpu_ptr(, cpu);
> + int idx;
> +
> + for (idx = 0; idx < MAX_NODES; idx++) {
> + struct qnode *qnode = >nodes[idx];
> + if (qnode->lock == lock)
> + return qnode;
> + }

In case anyone else is confused by this, Nick explained each cpu can only queue 
on a unique spinlock once regardless of "idx" level.

> +
> + BUG();
> +}
> +
> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +{
> + struct qnodes *qnodesp;
> + struct qnode *next, *node;
> + int val, old, tail;
> + int idx;
> +
> + BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
> + qnodesp = this_cpu_ptr();
> + if (unlikely(qnodesp->count == MAX_NODES)) {

The comparison is >= in the 

Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> - smp_store_release(>locked, 0);
> - else
> - __pv_queued_spin_unlock(lock);
> + if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
> + return 1;
> + return 0;

optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);

[resend as utf-8, not utf-7]



Re: [PATCH 2/2] powerpc/rtas: Fix RTAS MSR[HV] handling for Cell

2022-08-24 Thread Jordan Niethe
On Wed, 2022-08-24 at 22:04 +1000, Michael Ellerman wrote:
> Jordan Niethe  writes:
> > On Tue, 2022-08-23 at 21:59 +1000, Michael Ellerman wrote:
> > > The semi-recent changes to MSR handling when entering RTAS (firmware)
> > > cause crashes on IBM Cell machines. An example trace:
> ...
> > > diff --git a/arch/powerpc/kernel/rtas_entry.S 
> > > b/arch/powerpc/kernel/rtas_entry.S
> > > index 9a434d42e660..6ce95ddadbcd 100644
> > > --- a/arch/powerpc/kernel/rtas_entry.S
> > > +++ b/arch/powerpc/kernel/rtas_entry.S
> > > @@ -109,8 +109,12 @@ _GLOBAL(enter_rtas)
> > >* its critical regions (as specified in PAPR+ section 7.2.1). MSR[S]
> > >* is not impacted by RFI_TO_KERNEL (only urfid can unset it). So if
> > >* MSR[S] is set, it will remain when entering RTAS.
> > > +  * If we're in HV mode, RTAS must also run in HV mode, so extract MSR_HV
> > > +  * from the saved MSR value and insert into the value RTAS will use.
> > >*/
> > 
> > Interestingly it looks like these are the first uses of these extended
> > mnemonics in the kernel?
> 
> We used to have at least one use I know of in TM code, but it's since
> been converted to C.
> 
> > > + extrdi  r0, r6, 1, 63 - MSR_HV_LG
> > 
> > Or in non-mnemonic form...
> > rldicl  r0, r6, 64 - MSR_HV_LG, 63
> 
> It's rldicl all the way down.
> 
> > >   LOAD_REG_IMMEDIATE(r6, MSR_ME | MSR_RI)
> > > + insrdi  r6, r0, 1, 63 - MSR_HV_LG
> > 
> > Or in non-mnemonic form...
> > rldimi  r6, r0, MSR_HV_LG, 63 - MSR_HV_LG
> 
> I think the extended mnemonics are slightly more readable than the
> open-coded versions?

Yeah definitely. I was just noting the plain instruction as I think we
have some existing patterns that may be potential candidates for conversion to 
the
extended version. Like in exceptions-64s.S

rldicl. r0, r12, (64-MSR_TS_LG), (64-2) 
to 
extrdi. r0, r12, 2, 63 - MSR_TS_LG - 1

Would it be worth changing these?

> 
> > It is ok to use r0 as a scratch register as it is loaded with 0 afterwards 
> > anyway.
> 
> I originally used r7, but r0 is more obviously safe.
> 
> > >   li  r0,0
> > >   mtmsrd  r0,1/* disable RI before using SRR0/1 */
> > 
> > Reviewed-by: Jordan Niethe 
> 
> Thanks.
> 
> cheers



Re: [PATCH 2/2] powerpc/rtas: Fix RTAS MSR[HV] handling for Cell

2022-08-23 Thread Jordan Niethe
On Tue, 2022-08-23 at 21:59 +1000, Michael Ellerman wrote:
> The semi-recent changes to MSR handling when entering RTAS (firmware)
> cause crashes on IBM Cell machines. An example trace:
> 
>   kernel tried to execute user page (2fff01a8) - exploit attempt? (uid: 0)
>   BUG: Unable to handle kernel instruction fetch
>   Faulting instruction address: 0x2fff01a8
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   BE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=4 NUMA Cell
>   Modules linked in:
>   CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW  
> 6.0.0-rc2-00433-gede0a8d3307a #207
>   NIP:  2fff01a8 LR: 00032608 CTR: 
>   REGS: c15236b0 TRAP: 0400   Tainted: GW   
> (6.0.0-rc2-00433-gede0a8d3307a)
>   MSR:  08001002   CR:   XER: 2000
>   ...
>   NIP 0x2fff01a8
>   LR  0x32608
>   Call Trace:
> 0xc143c5f8 (unreliable)
> .rtas_call+0x224/0x320
> .rtas_get_boot_time+0x70/0x150
> .read_persistent_clock64+0x114/0x140
> .read_persistent_wall_and_boot_offset+0x24/0x80
> .timekeeping_init+0x40/0x29c
> .start_kernel+0x674/0x8f0
> start_here_common+0x1c/0x50
> 
> Unlike PAPR platforms where RTAS is only used in guests, on the IBM Cell
> machines Linux runs with MSR[HV] set but also uses RTAS, provided by
> SLOF.
> 
> Fix it by copying the MSR[HV] bit from the MSR value we've just read
> using mfmsr into the value used for RTAS.
> 
> It seems like we could also fix it using an #ifdef CELL to set MSR[HV],
> but that doesn't work because it's possible to build a single kernel
> image that runs on both Cell native and pseries.
> 
> Fixes: b6b1c3ce06ca ("powerpc/rtas: Keep MSR[RI] set when calling RTAS")
> Cc: sta...@vger.kernel.org # v5.19+
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/kernel/rtas_entry.S | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/rtas_entry.S 
> b/arch/powerpc/kernel/rtas_entry.S
> index 9a434d42e660..6ce95ddadbcd 100644
> --- a/arch/powerpc/kernel/rtas_entry.S
> +++ b/arch/powerpc/kernel/rtas_entry.S
> @@ -109,8 +109,12 @@ _GLOBAL(enter_rtas)
>* its critical regions (as specified in PAPR+ section 7.2.1). MSR[S]
>* is not impacted by RFI_TO_KERNEL (only urfid can unset it). So if
>* MSR[S] is set, it will remain when entering RTAS.
> +  * If we're in HV mode, RTAS must also run in HV mode, so extract MSR_HV
> +  * from the saved MSR value and insert into the value RTAS will use.
>*/

Interestingly it looks like these are the first uses of these extended
mnemonics in the kernel?

> + extrdi  r0, r6, 1, 63 - MSR_HV_LG

Or in non-mnemonic form...
rldicl  r0, r6, 64 - MSR_HV_LG, 63

>   LOAD_REG_IMMEDIATE(r6, MSR_ME | MSR_RI)
> + insrdi  r6, r0, 1, 63 - MSR_HV_LG

Or in non-mnemonic form...
rldimi  r6, r0, MSR_HV_LG, 63 - MSR_HV_LG

It is ok to use r0 as a scratch register as it is loaded with 0 afterwards 
anyway.

>  
>   li  r0,0
>   mtmsrd  r0,1/* disable RI before using SRR0/1 */

Reviewed-by: Jordan Niethe 



Re: [PATCH v4 2/2] selftests/powerpc: Add a test for execute-only memory

2022-08-17 Thread Jordan Niethe
On Wed, 2022-08-17 at 15:06 +1000, Russell Currey wrote:
> From: Nicholas Miehlbradt 
> 
> This selftest is designed to cover execute-only protections
> on the Radix MMU but will also work with Hash.
> 
> The tests are based on those found in pkey_exec_test with modifications
> to use the generic mprotect() instead of the pkey variants.

Would it make sense to rename pkey_exec_test to exec_test and have this test be 
apart of that?

> 
> Signed-off-by: Nicholas Miehlbradt 
> Signed-off-by: Russell Currey 
> ---
> v4: new
> 
>  tools/testing/selftests/powerpc/mm/Makefile   |   3 +-
>  .../testing/selftests/powerpc/mm/exec_prot.c  | 231 ++
>  2 files changed, 233 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/powerpc/mm/exec_prot.c
> 
> diff --git a/tools/testing/selftests/powerpc/mm/Makefile 
> b/tools/testing/selftests/powerpc/mm/Makefile
> index 27dc09d0bfee..19dd0b2ea397 100644
> --- a/tools/testing/selftests/powerpc/mm/Makefile
> +++ b/tools/testing/selftests/powerpc/mm/Makefile
> @@ -3,7 +3,7 @@ noarg:
>   $(MAKE) -C ../
>  
>  TEST_GEN_PROGS := hugetlb_vs_thp_test subpage_prot prot_sao segv_errors 
> wild_bctr \
> -   large_vm_fork_separation bad_accesses pkey_exec_prot \
> +   large_vm_fork_separation bad_accesses exec_prot 
> pkey_exec_prot \
> pkey_siginfo stack_expansion_signal stack_expansion_ldst \
> large_vm_gpr_corruption
>  TEST_PROGS := stress_code_patching.sh
> @@ -22,6 +22,7 @@ $(OUTPUT)/wild_bctr: CFLAGS += -m64
>  $(OUTPUT)/large_vm_fork_separation: CFLAGS += -m64
>  $(OUTPUT)/large_vm_gpr_corruption: CFLAGS += -m64
>  $(OUTPUT)/bad_accesses: CFLAGS += -m64
> +$(OUTPUT)/exec_prot: CFLAGS += -m64
>  $(OUTPUT)/pkey_exec_prot: CFLAGS += -m64
>  $(OUTPUT)/pkey_siginfo: CFLAGS += -m64
>  
> diff --git a/tools/testing/selftests/powerpc/mm/exec_prot.c 
> b/tools/testing/selftests/powerpc/mm/exec_prot.c
> new file mode 100644
> index ..db75b2225de1
> --- /dev/null
> +++ b/tools/testing/selftests/powerpc/mm/exec_prot.c
> @@ -0,0 +1,231 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright 2022, Nicholas Miehlbradt, IBM Corporation
> + * based on pkey_exec_prot.c
> + *
> + * Test if applying execute protection on pages works as expected.
> + */
> +
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#include "pkeys.h"
> +
> +
> +#define PPC_INST_NOP 0x6000
> +#define PPC_INST_TRAP0x7fe8
> +#define PPC_INST_BLR 0x4e800020
> +
> +static volatile sig_atomic_t fault_code;
> +static volatile sig_atomic_t remaining_faults;
> +static volatile unsigned int *fault_addr;
> +static unsigned long pgsize, numinsns;
> +static unsigned int *insns;
> +static bool pkeys_supported;
> +
> +static bool is_fault_expected(int fault_code)
> +{
> + if (fault_code == SEGV_ACCERR)
> + return true;
> +
> + /* Assume any pkey error is fine since pkey_exec_prot test covers them 
> */
> + if (fault_code == SEGV_PKUERR && pkeys_supported)
> + return true;
> +
> + return false;
> +}
> +
> +static void trap_handler(int signum, siginfo_t *sinfo, void *ctx)
> +{
> + /* Check if this fault originated from the expected address */
> + if (sinfo->si_addr != (void *)fault_addr)
> + sigsafe_err("got a fault for an unexpected address\n");
> +
> + _exit(1);
> +}
> +
> +static void segv_handler(int signum, siginfo_t *sinfo, void *ctx)
> +{
> + fault_code = sinfo->si_code;
> +
> + /* Check if this fault originated from the expected address */
> + if (sinfo->si_addr != (void *)fault_addr) {
> + sigsafe_err("got a fault for an unexpected address\n");
> + _exit(1);
> + }
> +
> + /* Check if too many faults have occurred for a single test case */
> + if (!remaining_faults) {
> + sigsafe_err("got too many faults for the same address\n");
> + _exit(1);
> + }
> +
> +
> + /* Restore permissions in order to continue */
> + if (is_fault_expected(fault_code)) {
> + if (mprotect(insns, pgsize, PROT_READ | PROT_WRITE | 
> PROT_EXEC)) {
> + sigsafe_err("failed to set access permissions\n");
> + _exit(1);
> + }
> + } else {
> + sigsafe_err("got a fault with an unexpected code\n");
> + _exit(1);
> + }
> +
> + remaining_faults--;
> +}
> +
> +static int check_exec_fault(int rights)
> +{
> + /*
> +  * Jump to the executable region.
> +  *
> +  * The first iteration also checks if the overwrite of the
> +  * first instruction word from a trap to a no-op succeeded.
> +  */
> + fault_code = -1;
> + remaining_faults = 0;
> + if (!(rights & PROT_EXEC))
> + remaining_faults = 1;
> +
> + FAIL_IF(mprotect(insns, pgsize, rights) != 0);
> + asm volatile("mtctr %0; 

Re: [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks

2022-08-14 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Finding the owner or a queued waiter on a lock with a preempted vcpu
> is indicative of an oversubscribed guest causing the lock to get into
> trouble. Provide some options to detect this situation and have new
> CPUs avoid queueing for a longer time (more steal iterations) to
> minimise the problems caused by vcpu preemption on the queue.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h |   7 +-
>  arch/powerpc/lib/qspinlock.c   | 240 +++--
>  2 files changed, 232 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 35f9525381e6..4fbcc8a4230b 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -30,7 +30,7 @@ typedef struct qspinlock {
>   *
>   * 0: locked bit
>   *  1-14: lock holder cpu
> - *15: unused bit
> + *15: lock owner or queuer vcpus observed to be preempted bit
>   *16: must queue bit
>   * 17-31: tail cpu (+1)
>   */
> @@ -49,6 +49,11 @@ typedef struct qspinlock {
>  #error "qspinlock does not support such large CONFIG_NR_CPUS"
>  #endif
>  
> +#define _Q_SLEEPY_OFFSET 15
> +#define _Q_SLEEPY_BITS   1
> +#define _Q_SLEEPY_MASK   _Q_SET_MASK(SLEEPY_OWNER)
> +#define _Q_SLEEPY_VAL(1U << _Q_SLEEPY_OFFSET)
> +
>  #define _Q_MUST_Q_OFFSET 16
>  #define _Q_MUST_Q_BITS   1
>  #define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5cfd69931e31..c18133c01450 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_spin_on_preempted_owner __read_mostly = false;
> +static bool pv_sleepy_lock __read_mostly = true;
> +static bool pv_sleepy_lock_sticky __read_mostly = false;

The sticky part could potentially be its own patch.

> +static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
> +static int pv_sleepy_lock_factor __read_mostly = 256;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
>  
> -static __always_inline int get_steal_spins(bool paravirt, bool remote)
> +static __always_inline bool recently_sleepy(void)
> +{

Other users of pv_sleepy_lock_interval_ns first check pv_sleepy_lock.

> + if (pv_sleepy_lock_interval_ns) {
> + u64 seen = this_cpu_read(sleepy_lock_seen_clock);
> +
> + if (seen) {
> + u64 delta = sched_clock() - seen;
> + if (delta < pv_sleepy_lock_interval_ns)
> + return true;
> + this_cpu_write(sleepy_lock_seen_clock, 0);
> + }
> + }
> +
> + return false;
> +}
> +
> +static __always_inline int get_steal_spins(bool paravirt, bool remote, bool 
> sleepy)

It seems like paravirt is implied by sleepy.

>  {
>   if (remote) {
> - return REMOTE_STEAL_SPINS;
> + if (paravirt && sleepy)
> + return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
> + else
> + return REMOTE_STEAL_SPINS;
>   } else {
> - return STEAL_SPINS;
> + if (paravirt && sleepy)
> + return STEAL_SPINS * pv_sleepy_lock_factor;
> + else
> + return STEAL_SPINS;
>   }
>  }

I think that separate functions would still be nicer but this could get rid of
the nesting conditionals like


int spins;
if (remote)
spins = REMOTE_STEAL_SPINS;
else
spins = STEAL_SPINS;

if (sleepy)
return spins * pv_sleepy_lock_factor;
return spins;

>  
> -static __always_inline int get_head_spins(bool paravirt)
> +static __always_inline int get_head_spins(bool paravirt, bool sleepy)
>  {
> - return HEAD_SPINS;
> + if (paravirt && sleepy)
> + return HEAD_SPINS * pv_sleepy_lock_factor;
> + else
> + return HEAD_SPINS;
>  }
>  
>  static inline u32 encode_tail_cpu(void)
> @@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct 
> qspinlock *lock)
>   return prev;
>  }
>  
> +static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 
> old)
> +{
> + u32 prev;
> + u32 new = old | _Q_SLEEPY_VAL;
> +
> + BUG_ON(!(old & _Q_LOCKED_VAL));
> + 

Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Provide an option that holds off queueing indefinitely while the lock
> owner is preempted. This could reduce queueing latencies for very
> overcommitted vcpu situations.
> 
> This is disabled by default.
> ---
>  arch/powerpc/lib/qspinlock.c | 91 +++-
>  1 file changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 24f68bd71e2b..5cfd69931e31 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
> +static bool pv_spin_on_preempted_owner __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
> @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
>   int owner;
>   u32 yield_count;
>  
>   BUG_ON(!(val & _Q_LOCKED_VAL));
>  
> + *preempted = false;
> +
>   if (!paravirt)
>   goto relax;
>  
> @@ -241,6 +244,8 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>   spin_end();
>  
> + *preempted = true;
> +
>   /*
>* Read the lock word after sampling the yield count. On the other side
>* there may a wmb because the yield count update is done by the
> @@ -265,14 +270,14 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   spin_cpu_relax();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool *preempted)

It seems like preempted parameter could be the return value of
yield_to_locked_owner(). Then callers that don't use the value returned in
preempted don't need to create an unnecessary variable to pass in.

>  {
> - __yield_to_locked_owner(lock, val, paravirt, false);
> + __yield_to_locked_owner(lock, val, paravirt, false, preempted);
>  }
>  
> -static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
> - __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> + __yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
>  }
>  
>  static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, 
> int *set_yield_cpu, bool paravirt)
> @@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
> - int iters;
> + int iters = 0;
> +
> + if (!STEAL_SPINS) {
> + if (paravirt && pv_spin_on_preempted_owner) {
> + spin_begin();
> + for (;;) {
> + u32 val = READ_ONCE(lock->val);
> + bool preempted;
> +
> + if (val & _Q_MUST_Q_VAL)
> + break;
> + if (!(val & _Q_LOCKED_VAL))
> + break;
> + if (!vcpu_is_preempted(get_owner_cpu(val)))
> + break;
> + yield_to_locked_owner(lock, val, paravirt, 
> );
> + }
> + spin_end();
> + }
> + return false;
> + }
>  
>   /* Attempt to steal the lock */
>   spin_begin();
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
> + bool preempted;
>  
>   if (val & _Q_MUST_Q_VAL)
>   break;
> @@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   continue;
>   }
>  
> - yield_to_locked_owner(lock, val, paravirt);
> -
> - iters++;
> + yield_to_locked_owner(lock, val, paravirt, );
> +
> + if (paravirt && preempted) {
> + if (!pv_spin_on_preempted_owner)
> + iters++;
> + /*
> +  * 

Re: [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Allow for a reduction in the number of times a CPU from a different
> node than the owner can attempt to steal the lock before queueing.
> This could bias the transfer behaviour of the lock across the
> machine and reduce NUMA crossings.
> ---
>  arch/powerpc/lib/qspinlock.c | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index d4594c701f7d..24f68bd71e2b 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -24,6 +25,7 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
>  #if _Q_SPIN_TRY_LOCK_STEAL == 1
>  static const bool MAYBE_STEALERS = true;
>  #else
> @@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(bool paravirt)
> +static __always_inline int get_steal_spins(bool paravirt, bool remote)
>  {
> - return STEAL_SPINS;
> + if (remote) {
> + return REMOTE_STEAL_SPINS;
> + } else {
> + return STEAL_SPINS;
> + }
>  }
>  
>  static __always_inline int get_head_spins(bool paravirt)
> @@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>  
>   iters++;
>  
> - if (iters >= get_steal_spins(paravirt))
> + if (iters >= get_steal_spins(paravirt, false))
>   break;
> + if (iters >= get_steal_spins(paravirt, true)) {

There's no indication of what true and false mean here which is hard to read.
To me it feels like two separate functions would be more clear.


> + int cpu = get_owner_cpu(val);
> + if (numa_node_id() != cpu_to_node(cpu))

What about using node_distance() instead?


> + break;
> + }
>   }
>   spin_end();
>  
> @@ -588,6 +599,22 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, 
> "%llu\n");
>  
> +static int remote_steal_spins_set(void *data, u64 val)
> +{
> + REMOTE_STEAL_SPINS = val;

REMOTE_STEAL_SPINS is int not u64.

> +
> + return 0;
> +}
> +
> +static int remote_steal_spins_get(void *data, u64 *val)
> +{
> + *val = REMOTE_STEAL_SPINS;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, 
> remote_steal_spins_set, "%llu\n");
> +
>  static int head_spins_set(void *data, u64 val)
>  {
>   HEAD_SPINS = val;
> @@ -687,6 +714,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, 
> pv_prod_head_get, pv_prod_head_set, "
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> _steal_spins);
> + debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, 
> NULL, _remote_steal_spins);
>   debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, 
> _head_spins);
>   if (is_shared_processor()) {
>   debugfs_create_file("qspl_pv_yield_owner", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_owner);



Re: [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
> to prevent threads issuing a lot of expensive priority nops which may not
> have much effect due to immediately executing low then medium priority.

Just a general comment regarding the spin_{begin,end} API, more complicated
than something like

spin_begin()
for(;;)
spin_cpu_relax()
spin_end()

it becomes difficult to keep track of. Unfortunately, I don't have any good
suggestions how to improve it. Hopefully with P10s wait instruction we can
maybe try and move away from this.

It might be useful to comment the functions pre and post conditions regarding
expectations about spin_begin() and spin_end().

> ---
>  arch/powerpc/lib/qspinlock.c | 35 +++
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 277aef1fab0a..d4594c701f7d 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -233,6 +233,8 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   if ((yield_count & 1) == 0)
>   goto relax; /* owner vcpu is running */
>  
> + spin_end();
> +
>   /*
>* Read the lock word after sampling the yield count. On the other side
>* there may a wmb because the yield count update is done by the
> @@ -248,11 +250,13 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   yield_to_preempted(owner, yield_count);
>   if (clear_mustq)
>   lock_set_mustq(lock);
> + spin_begin();
>   /* Don't relax if we yielded. Maybe we should? */
>   return;
>   }
> + spin_begin();
>  relax:
> - cpu_relax();
> + spin_cpu_relax();
>  }
>  
>  static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> @@ -315,14 +319,18 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>   if ((yield_count & 1) == 0)
>   goto yield_prev; /* owner vcpu is running */
>  
> + spin_end();
> +
>   smp_rmb();
>  
>   if (yield_cpu == node->yield_cpu) {
>   if (node->next && node->next->yield_cpu != yield_cpu)
>   node->next->yield_cpu = yield_cpu;
>   yield_to_preempted(yield_cpu, yield_count);
> + spin_begin();
>   return;
>   }
> + spin_begin();
>  
>  yield_prev:
>   if (!pv_yield_prev)
> @@ -332,15 +340,19 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>   if ((yield_count & 1) == 0)
>   goto relax; /* owner vcpu is running */
>  
> + spin_end();
> +
>   smp_rmb(); /* See yield_to_locked_owner comment */
>  
>   if (!node->locked) {
>   yield_to_preempted(prev_cpu, yield_count);
> + spin_begin();
>   return;
>   }
> + spin_begin();
>  
>  relax:
> - cpu_relax();
> + spin_cpu_relax();
>  }
>  
>  
> @@ -349,6 +361,7 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   int iters;
>  
>   /* Attempt to steal the lock */
> + spin_begin();
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
>  
> @@ -356,8 +369,10 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   break;
>  
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
> + spin_end();
>   if (trylock_with_tail_cpu(lock, val))
>   return true;
> + spin_begin();
>   continue;
>   }
>  
> @@ -368,6 +383,7 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   if (iters >= get_steal_spins(paravirt))
>   break;
>   }
> + spin_end();
>  
>   return false;
>  }
> @@ -418,8 +434,10 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   WRITE_ONCE(prev->next, node);
>  
>   /* Wait for mcs node lock to be released */
> + spin_begin();
>   while (!node->locked)
>   yield_to_prev(lock, node, prev_cpu, paravirt);
> + spin_end();
>  
>   /* Clear out stale propagated yield_cpu */
>   if (paravirt && pv_yield_propagate_owner && node->yield_cpu != 
> -1)
> @@ -432,10 +450,12 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   int set_yield_cpu = -1;
>  
>   /* We're at the head of the waitqueue, wait for the lock. */
> + spin_begin();
>   while ((val = 

Re: [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> This gives trylock slightly more strength, and it also gives most
> of the benefit of passing 'val' back through the slowpath without
> the complexity.
> ---
>  arch/powerpc/include/asm/qspinlock.h | 39 +++-
>  arch/powerpc/lib/qspinlock.c |  9 +++
>  2 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 44601b261e08..d3d2039237b2 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -5,6 +5,8 @@
>  #include 
>  #include 
>  
> +#define _Q_SPIN_TRY_LOCK_STEAL 1

Would this be a config option?

> +
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
>   return READ_ONCE(lock->val);
> @@ -26,11 +28,12 @@ static __always_inline u32 
> queued_spin_get_locked_val(void)
>   return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
>  }
>  
> -static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock 
> *lock)
>  {
>   u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
> + /* Trylock succeeds only when unlocked and no queued nodes */
>   asm volatile(
>  "1:  lwarx   %0,0,%1,%3  # queued_spin_trylock   \n"

s/queued_spin_trylock/__queued_spin_trylock_nosteal

>  "cmpwi   0,%0,0  \n"
> @@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct 
> qspinlock *lock)
>   return 0;
>  }
>  
> +static __always_inline int __queued_spin_trylock_steal(struct qspinlock 
> *lock)
> +{
> + u32 new = queued_spin_get_locked_val();
> + u32 prev, tmp;
> +
> + /* Trylock may get ahead of queued nodes if it finds unlocked */
> + asm volatile(
> +"1:  lwarx   %0,0,%2,%5  # queued_spin_trylock   \n"

s/queued_spin_trylock/__queued_spin_trylock_steal

> +"andc.   %1,%0,%4\n"
> +"bne-2f  \n"
> +"and %1,%0,%4\n"
> +"or  %1,%1,%3\n"
> +"stwcx.  %1,0,%2 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"

Just because there's a little bit more going on here...

Q_TAIL_CPU_MASK = 0xFFFE
~Q_TAIL_CPU_MASK = 0x1


1:  lwarx   prev, 0, >val, IS_ENABLED_PPC64
andc.   tmp, prev, _Q_TAIL_CPU_MASK (tmp = prev & ~_Q_TAIL_CPU_MASK)
bne-2f  (exit if locked)
and tmp, prev, _Q_TAIL_CPU_MASK (tmp = prev & _Q_TAIL_CPU_MASK)
or  tmp, tmp, new   (tmp |= new)

stwcx.  tmp, 0, >val  

bne-1b  
PPC_ACQUIRE_BARRIER 
2:

... which seems correct.


> + : "=" (prev), "=" (tmp)
> + : "r" (>val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", "memory");
> +
> + if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
> + return 1;
> + return 0;
> +}
> +
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +{
> + if (!_Q_SPIN_TRY_LOCK_STEAL)
> + return __queued_spin_trylock_nosteal(lock);
> + else
> + return __queued_spin_trylock_steal(lock);
> +}
> +
>  void queued_spin_lock_slowpath(struct qspinlock *lock);
>  
>  static __always_inline void queued_spin_lock(struct qspinlock *lock)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 3b10e31bcf0a..277aef1fab0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -24,7 +24,11 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +static const bool MAYBE_STEALERS = true;
> +#else
>  static bool MAYBE_STEALERS __read_mostly = true;
> +#endif
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> @@ -522,6 +526,10 @@ void pv_spinlocks_init(void)
>  #include 
>  static int steal_spins_set(void *data, u64 val)
>  {
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> + /* MAYBE_STEAL remains true */
> + STEAL_SPINS = val;
> +#else
>   static DEFINE_MUTEX(lock);
>  
>   mutex_lock();
> @@ -539,6 +547,7 @@ static int steal_spins_set(void *data, u64 val)
> 

Re: [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> After the head of the queue acquires the lock, it releases the
> next waiter in the queue to become the new head. Add an option
> to prod the new head if its vCPU was preempted. This may only
> have an effect if queue waiters are yielding.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 29 -
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 28c85a2d5635..3b10e31bcf0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>   struct qnode*next;
>   struct qspinlock *lock;
> + int cpu;
>   int yield_cpu;
>   u8  locked; /* 1 if lock acquired */
>  };
> @@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
> +static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -392,6 +394,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   node = >nodes[idx];
>   node->next = NULL;
>   node->lock = lock;
> + node->cpu = smp_processor_id();

I suppose this could be used in some other places too.

For example change:
yield_to_prev(lock, node, prev, paravirt);

In yield_to_prev() it could then access the prev->cpu.

>   node->yield_cpu = -1;
>   node->locked = 0;
>  
> @@ -483,7 +486,14 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>* this store to locked. The corresponding barrier is the smp_rmb()
>* acquire barrier for mcs lock, above.
>*/
> - WRITE_ONCE(next->locked, 1);
> + if (paravirt && pv_prod_head) {
> + int next_cpu = next->cpu;
> + WRITE_ONCE(next->locked, 1);
> + if (vcpu_is_preempted(next_cpu))
> + prod_cpu(next_cpu);
> + } else {
> + WRITE_ONCE(next->locked, 1);
> + }
>  
>  release:
>   qnodesp->count--; /* release the node */
> @@ -622,6 +632,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 
> *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, 
> pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
>  
> +static int pv_prod_head_set(void *data, u64 val)
> +{
> + pv_prod_head = !!val;
> +
> + return 0;
> +}
> +
> +static int pv_prod_head_get(void *data, u64 *val)
> +{
> + *val = pv_prod_head;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, 
> pv_prod_head_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> _steal_spins);
> @@ -631,6 +657,7 @@ static __init int spinlock_debugfs_init(void)
>   debugfs_create_file("qspl_pv_yield_allow_steal", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_allow_steal);
>   debugfs_create_file("qspl_pv_yield_prev", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_prev);
>   debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_propagate_owner);
> + debugfs_create_file("qspl_pv_prod_head", 0600, 
> arch_debugfs_dir, NULL, _pv_prod_head);
>   }
>  
>   return 0;



Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Having all CPUs poll the lock word for the owner CPU that should be
> yielded to defeats most of the purpose of using MCS queueing for
> scalability. Yet it may be desirable for queued waiters to to yield
> to a preempted owner.
> 
> s390 addreses this problem by having queued waiters sample the lock
> word to find the owner much less frequently. In this approach, the
> waiters never sample it directly, but the queue head propagates the
> owner CPU back to the next waiter if it ever finds the owner has
> been preempted. Queued waiters then subsequently propagate the owner
> CPU back to the next waiter, and so on.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 85 +++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 94f007f66942..28c85a2d5635 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>   struct qnode*next;
>   struct qspinlock *lock;
> + int yield_cpu;
>   u8  locked; /* 1 if lock acquired */
>  };
>  
> @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
> +static bool pv_yield_propagate_owner __read_mostly = true;

This also seems to be enabled by default.

>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -257,13 +259,66 @@ static __always_inline void 
> yield_head_to_locked_owner(struct qspinlock *lock, u
>   __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
>  }
>  
> +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, 
> int *set_yield_cpu, bool paravirt)
> +{
> + struct qnode *next;
> + int owner;
> +
> + if (!paravirt)
> + return;
> + if (!pv_yield_propagate_owner)
> + return;
> +
> + owner = get_owner_cpu(val);
> + if (*set_yield_cpu == owner)
> + return;
> +
> + next = READ_ONCE(node->next);
> + if (!next)
> + return;
> +
> + if (vcpu_is_preempted(owner)) {

Is there a difference about using vcpu_is_preempted() here
vs checking bit 0 in other places?


> + next->yield_cpu = owner;
> + *set_yield_cpu = owner;
> + } else if (*set_yield_cpu != -1) {

It might be worth giving the -1 CPU a #define.

> + next->yield_cpu = owner;
> + *set_yield_cpu = owner;
> + }
> +}

Does this need to pass set_yield_cpu by reference? Couldn't it's new value be
returned? To me it makes it more clear the function is used to change
set_yield_cpu. I think this would work:

int set_yield_cpu = -1;

static __always_inline int propagate_yield_cpu(struct qnode *node, u32 val, int 
set_yield_cpu, bool paravirt)
{
struct qnode *next;
int owner;

if (!paravirt)
goto out;
if (!pv_yield_propagate_owner)
goto out;

owner = get_owner_cpu(val);
if (set_yield_cpu == owner)
goto out;

next = READ_ONCE(node->next);
if (!next)
goto out;

if (vcpu_is_preempted(owner)) {
next->yield_cpu = owner;
return owner;
} else if (set_yield_cpu != -1) {
next->yield_cpu = owner;
return owner;
}

out:
return set_yield_cpu;
}

set_yield_cpu = propagate_yield_cpu(...  set_yield_cpu ...);



> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)
>  {
>   u32 yield_count;
> + int yield_cpu;
>  
>   if (!paravirt)
>   goto relax;
>  
> + if (!pv_yield_propagate_owner)
> + goto yield_prev;
> +
> + yield_cpu = READ_ONCE(node->yield_cpu);
> + if (yield_cpu == -1) {
> + /* Propagate back the -1 CPU */
> + if (node->next && node->next->yield_cpu != -1)
> + node->next->yield_cpu = yield_cpu;
> + goto yield_prev;
> + }
> +
> + yield_count = yield_count_of(yield_cpu);
> + if ((yield_count & 1) == 0)
> + goto yield_prev; /* owner vcpu is running */
> +
> + smp_rmb();
> +
> + if (yield_cpu == node->yield_cpu) {
> + if (node->next && node->next->yield_cpu != yield_cpu)
> + node->next->yield_cpu = yield_cpu;
> + yield_to_preempted(yield_cpu, yield_count);
> + return;
> + }
> +
> +yield_prev:
>   if (!pv_yield_prev)
>   goto relax;
>  
> @@ -337,6 +392,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   

Re: [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> If the head of queue is preventing stealing but it finds the owner vCPU
> is preempted, it will yield its cycles to the owner which could cause it
> to become preempted. Add an option to re-allow stealers before yielding,
> and disallow them again after returning from the yield.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 56 ++--
>  1 file changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index b39f8c5b329c..94f007f66942 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_allow_steal __read_mostly = false;

To me this one does read as a boolean, but if you go with those other changes
I'd make it pv_yield_steal_enable to be consistent.

>  static bool pv_yield_prev __read_mostly = true;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> @@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct 
> qspinlock *lock)
>   return prev;
>  }
>  
> +static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> +{
> + u32 new = _Q_MUST_Q_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1 # lock_clear_mustq  \n"
> +"andc%0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new)
> + : "cr0", "memory");
> +

This is pretty similar to the DEFINE_TESTOP() pattern again with the same llong 
caveat.


> + return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>   int cpu = get_tail_cpu(val);
> @@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq)

 /* See yield_to_locked_owner comment */ comment needs to be updated now.


>  {
>   int owner;
>   u32 yield_count;
> @@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   smp_rmb();
>  
>   if (READ_ONCE(lock->val) == val) {
> + if (clear_mustq)
> + lock_clear_mustq(lock);
>   yield_to_preempted(owner, yield_count);
> + if (clear_mustq)
> + lock_set_mustq(lock);
>   /* Don't relax if we yielded. Maybe we should? */
>   return;
>   }
> @@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   cpu_relax();
>  }
>  
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +{
> + __yield_to_locked_owner(lock, val, paravirt, false);
> +}
> +
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq)
> +{

The check for pv_yield_allow_steal seems like it could go here instead of
being done by the caller.
__yield_to_locked_owner() checks for pv_yield_owner so it seems more
  consistent.



> + __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +}
> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)
>  {
>   u32 yield_count;
> @@ -332,7 +364,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   if (!MAYBE_STEALERS) {
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - yield_to_locked_owner(lock, val, paravirt);
> + yield_head_to_locked_owner(lock, val, paravirt, false);
>  
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -350,7 +382,8 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> - yield_to_locked_owner(lock, val, paravirt);
> + yield_head_to_locked_owner(lock, val, paravirt,
> + pv_yield_allow_steal && set_mustq);
>  
>   

Re: [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Queued waiters which are not at the head of the queue don't spin on
> the lock word but their qnode lock word, waiting for the previous queued
> CPU to release them. Add an option which allows these waiters to yield
> to the previous CPU if its vCPU is preempted.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 46 +++-
>  1 file changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 55286ac91da5..b39f8c5b329c 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_prev __read_mostly = true;

Similiar suggestion, maybe pv_yield_prev_enabled would read better.

Isn't this enabled by default contrary to the commit message?


>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   cpu_relax();
>  }
>  
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)

yield_to_locked_owner() takes a raw val and works out the cpu to yield to.
I think for consistency have yield_to_prev() take the raw val and work it out 
too.

> +{
> + u32 yield_count;
> +
> + if (!paravirt)
> + goto relax;
> +
> + if (!pv_yield_prev)
> + goto relax;
> +
> + yield_count = yield_count_of(prev_cpu);
> + if ((yield_count & 1) == 0)
> + goto relax; /* owner vcpu is running */
> +
> + smp_rmb(); /* See yield_to_locked_owner comment */
> +
> + if (!node->locked) {
> + yield_to_preempted(prev_cpu, yield_count);
> + return;
> + }
> +
> +relax:
> + cpu_relax();
> +}
> +
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
> @@ -291,13 +317,14 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>*/
>   if (old & _Q_TAIL_CPU_MASK) {
>   struct qnode *prev = get_tail_qnode(lock, old);
> + int prev_cpu = get_tail_cpu(old);

This could then be removed.

>  
>   /* Link @node into the waitqueue. */
>   WRITE_ONCE(prev->next, node);
>  
>   /* Wait for mcs node lock to be released */
>   while (!node->locked)
> - cpu_relax();
> + yield_to_prev(lock, node, prev_cpu, paravirt);

And would have this as:
yield_to_prev(lock, node, old, paravirt);


>  
>   smp_rmb(); /* acquire barrier for the mcs lock */
>   }
> @@ -448,12 +475,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, 
> pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_prev_set(void *data, u64 val)
> +{
> + pv_yield_prev = !!val;
> +
> + return 0;
> +}
> +
> +static int pv_yield_prev_get(void *data, u64 *val)
> +{
> + *val = pv_yield_prev;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, 
> pv_yield_prev_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> _steal_spins);
>   debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, 
> _head_spins);
>   if (is_shared_processor()) {
>   debugfs_create_file("qspl_pv_yield_owner", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_owner);
> + debugfs_create_file("qspl_pv_yield_prev", 0600, 
> arch_debugfs_dir, NULL, _pv_yield_prev);
>   }
>  
>   return 0;



Re: [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner

2022-08-11 Thread Jordan Niethe
 On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Waiters spinning on the lock word should yield to the lock owner if the
> vCPU is preempted. This improves performance when the hypervisor has
> oversubscribed physical CPUs.
> ---
>  arch/powerpc/lib/qspinlock.c | 97 ++--
>  1 file changed, 83 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index aa26cfe21f18..55286ac91da5 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define MAX_NODES4
>  
> @@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
> +static bool pv_yield_owner __read_mostly = true;

Not macro case for these globals? To me name does not make it super clear this
is a boolean. What about pv_yield_owner_enabled?

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(void)
> +static __always_inline int get_steal_spins(bool paravirt)
>  {
>   return STEAL_SPINS;
>  }
>  
> -static __always_inline int get_head_spins(void)
> +static __always_inline int get_head_spins(bool paravirt)
>  {
>   return HEAD_SPINS;
>  }
> @@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
>   return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
> -/* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */
> +static inline int get_owner_cpu(u32 val)
> +{
> + return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
> +}
> +
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> @@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static inline bool try_to_steal_lock(struct qspinlock *lock)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)

This name doesn't seem correct for the non paravirt case.

> +{
> + int owner;
> + u32 yield_count;
> +
> + BUG_ON(!(val & _Q_LOCKED_VAL));
> +
> + if (!paravirt)
> + goto relax;
> +
> + if (!pv_yield_owner)
> + goto relax;
> +
> + owner = get_owner_cpu(val);
> + yield_count = yield_count_of(owner);
> +
> + if ((yield_count & 1) == 0)
> + goto relax; /* owner vcpu is running */

I wonder why not use vcpu_is_preempted()?

> +
> + /*
> +  * Read the lock word after sampling the yield count. On the other side
> +  * there may a wmb because the yield count update is done by the
> +  * hypervisor preemption and the value update by the OS, however this
> +  * ordering might reduce the chance of out of order accesses and
> +  * improve the heuristic.
> +  */
> + smp_rmb();
> +
> + if (READ_ONCE(lock->val) == val) {
> + yield_to_preempted(owner, yield_count);
> + /* Don't relax if we yielded. Maybe we should? */
> + return;
> + }
> +relax:
> + cpu_relax();
> +}
> +
> +
> +static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
>   int iters;
>  
> @@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock 
> *lock)
>   continue;
>   }
>  
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   iters++;
>  
> - if (iters >= get_steal_spins())
> + if (iters >= get_steal_spins(paravirt))
>   break;
>   }
>  
>   return false;
>  }
>  
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock 
> *lock, bool paravirt)
>  {
>   struct qnodes *qnodesp;
>   struct qnode *next, *node;
> @@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   if (!MAYBE_STEALERS) {
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   iters++;

Re: [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word

2022-08-11 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Store the owner CPU number in the lock word so it may be yielded to,
> as powerpc's paravirtualised simple spinlocks do.
> ---
>  arch/powerpc/include/asm/qspinlock.h   |  8 +++-
>  arch/powerpc/include/asm/qspinlock_types.h | 10 ++
>  arch/powerpc/lib/qspinlock.c   |  6 +++---
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 3ab354159e5e..44601b261e08 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct 
> qspinlock *lock)
>   return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
> +static __always_inline u32 queued_spin_get_locked_val(void)

Maybe this function should have "encode" in the name to match with
encode_tail_cpu().


> +{
> + /* XXX: make this use lock value in paca like simple spinlocks? */

Is that the paca's lock_token which is 0x8000?


> + return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> +}
> +
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   asm volatile(
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 8b20f5e22bba..35f9525381e6 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,6 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   * 0: locked bit
> + *  1-14: lock holder cpu
> + *15: unused bit
>   *16: must queue bit
>   * 17-31: tail cpu (+1)

So there is one more bit to store the tail cpu vs the lock holder cpu?

>   */
> @@ -39,6 +41,14 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
>  
> +#define _Q_OWNER_CPU_OFFSET  1
> +#define _Q_OWNER_CPU_BITS14
> +#define _Q_OWNER_CPU_MASK_Q_SET_MASK(OWNER_CPU)
> +
> +#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #define _Q_MUST_Q_OFFSET 16
>  #define _Q_MUST_Q_BITS   1
>  #define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index a906cc8f15fa..aa26cfe21f18 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -50,7 +50,7 @@ static inline int get_tail_cpu(u32 val)
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   asm volatile(
> @@ -68,7 +68,7 @@ static __always_inline void lock_set_locked(struct 
> qspinlock *lock)
>  /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
>  static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, 
> u32 old)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   BUG_ON(old & _Q_LOCKED_VAL);
> @@ -116,7 +116,7 @@ static __always_inline u32 __trylock_cmpxchg(struct 
> qspinlock *lock, u32 old, u3
>  /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
>  static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 
> val)
>  {
> - u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> + u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
>  
>   if (__trylock_cmpxchg(lock, val, newval) == val)
>   return 1;



Re: [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency

2022-08-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Give the queue head the ability to stop stealers. After a number of
> spins without sucessfully acquiring the lock, the queue head employs
> this, which will assure it is the next owner.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
>  arch/powerpc/lib/qspinlock.c   | 56 +-
>  2 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 210adf05b235..8b20f5e22bba 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,7 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   * 0: locked bit
> - * 16-31: tail cpu (+1)
> + *16: must queue bit
> + * 17-31: tail cpu (+1)
>   */
>  #define  _Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
> << _Q_ ## type ## _OFFSET)
> @@ -38,7 +39,12 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
>  
> -#define _Q_TAIL_CPU_OFFSET   16
> +#define _Q_MUST_Q_OFFSET 16
> +#define _Q_MUST_Q_BITS   1
> +#define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> +#define _Q_MUST_Q_VAL(1U << _Q_MUST_Q_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET   17
>  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
>  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)

Not a big deal but some of these values could be calculated like in the
generic version. e.g.

#define _Q_PENDING_OFFSET   (_Q_LOCKED_OFFSET +_Q_LOCKED_BITS)

>  
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 1625cce714b2..a906cc8f15fa 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -22,6 +22,7 @@ struct qnodes {
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
> +static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
>   return STEAL_SPINS;
>  }
>  
> +static __always_inline int get_head_spins(void)
> +{
> + return HEAD_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct 
> qspinlock *lock, u32 tail)
>   return prev;
>  }
>  
> +static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> +{
> + u32 new = _Q_MUST_Q_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1 # lock_set_mustq\n"

Is the EH bit not set because we don't hold the lock here?

> +"or  %0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new)
> + : "cr0", "memory");

This is another usage close to the DEFINE_TESTOP() pattern.

> +
> + return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>   int cpu = get_tail_cpu(val);
> @@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock 
> *lock)
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
>  
> + if (val & _Q_MUST_Q_VAL)
> + break;
> +
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
>   if (trylock_with_tail_cpu(lock, val))
>   return true;
> @@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   /* We must be the owner, just set the lock bit and acquire */
>   lock_set_locked(lock);
>   } else {
> + int iters = 0;
> + bool set_mustq = false;
> +
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
> - while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> + while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>   cpu_relax();
>  
> + iters++;

It seems instead of using set_mustq, (val & _Q_MUST_Q_VAL) could be checked?

> + if (!set_mustq && iters >= get_head_spins()) {
> + set_mustq = true;
> + lock_set_mustq(lock);
> + val |= _Q_MUST_Q_VAL;
> + }
> + }
> +
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
>  

Re: [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing

2022-08-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> Allow new waiters a number of spins on the lock word before queueing,
> which particularly helps paravirt performance when physical CPUs are
> oversubscribed.
> ---
>  arch/powerpc/lib/qspinlock.c | 152 ---
>  1 file changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 7c71e5e287df..1625cce714b2 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -19,8 +19,17 @@ struct qnodes {
>   struct qnode nodes[MAX_NODES];
>  };
>  
> +/* Tuning parameters */
> +static int STEAL_SPINS __read_mostly = (1<<5);
> +static bool MAYBE_STEALERS __read_mostly = true;

I can understand why, but macro case variables can be a bit confusing.

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> +static __always_inline int get_steal_spins(void)
> +{
> + return STEAL_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct 
> qspinlock *lock, u32 ol
>   return 0;
>  }
>  
> +static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 
> old, u32 new)
> +{
> + u32 prev;
> +
> + BUG_ON(old & _Q_LOCKED_VAL);
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%4  # queued_spin_trylock_cmpxchg   \n"

s/queued_spin_trylock_cmpxchg/__trylock_cmpxchg/

btw what is the format you using for the '\n's in the inline asm?

> +"cmpw0,%0,%2 \n"
> +"bne-2f  \n"
> +"stwcx.  %3,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"
> + : "=" (prev)
> + : "r" (>val), "r"(old), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", "memory");

This is very similar to trylock_clear_tail_cpu(). So maybe it is worth having
some form of "test and set" primitive helper.

> +
> + return prev;
> +}
> +
> +/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 
> val)
> +{
> + u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +
> + if (__trylock_cmpxchg(lock, val, newval) == val)
> + return 1;
> + else
> + return 0;

same optional style nit: return __trylock_cmpxchg(lock, val, newval) == val

> +}
> +
>  /*
>   * Publish our tail, replacing previous tail. Return previous value.
>   *
> @@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> +static inline bool try_to_steal_lock(struct qspinlock *lock)
> +{
> + int iters;
> +
> + /* Attempt to steal the lock */
> + for (;;) {
> + u32 val = READ_ONCE(lock->val);
> +
> + if (unlikely(!(val & _Q_LOCKED_VAL))) {
> + if (trylock_with_tail_cpu(lock, val))
> + return true;
> + continue;
> + }

The continue would bypass iters++/cpu_relax but the next time around
  if (unlikely(!(val & _Q_LOCKED_VAL))) {
should fail so everything should be fine?

> +
> + cpu_relax();
> +
> + iters++;
> +
> + if (iters >= get_steal_spins())
> + break;
> + }
> +
> + return false;
> +}
> +
>  static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>   struct qnodes *qnodesp;
> @@ -164,20 +231,39 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   smp_rmb(); /* acquire barrier for the mcs lock */
>   }
>  
> - /* We're at the head of the waitqueue, wait for the lock. */
> - while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - cpu_relax();
> + if (!MAYBE_STEALERS) {
> + /* We're at the head of the waitqueue, wait for the lock. */
> + while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> + cpu_relax();
>  
> - /* If we're the last queued, must clean up the tail. */
> - if ((val & _Q_TAIL_CPU_MASK) == tail) {
> - if (trylock_clear_tail_cpu(lock, val))
> - goto release;
> - /* Another waiter must have enqueued */
> - }
> + /* If we're the last queued, must clean up the tail. */
> + if ((val & _Q_TAIL_CPU_MASK) == tail) {
> + if (trylock_clear_tail_cpu(lock, val))
> + goto release;
> + /* Another 

Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly

2022-08-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> This uses more optimal ll/sc style access patterns (rather than
> cmpxchg), and also sets the EH=1 lock hint on those operations
> which acquire ownership of the lock.
> ---
>  arch/powerpc/include/asm/qspinlock.h   | 25 +--
>  arch/powerpc/include/asm/qspinlock_types.h |  6 +-
>  arch/powerpc/lib/qspinlock.c   | 81 +++---
>  3 files changed, 79 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 79a1936fb68d..3ab354159e5e 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -2,28 +2,43 @@
>  #ifndef _ASM_POWERPC_QSPINLOCK_H
>  #define _ASM_POWERPC_QSPINLOCK_H
>  
> -#include 
>  #include 
>  #include 
>  
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
> - return atomic_read(>val);
> + return READ_ONCE(lock->val);
>  }
>  
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> - return !atomic_read();
> + return !lock.val;
>  }
>  
>  static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  {
> - return !!(atomic_read(>val) & _Q_TAIL_CPU_MASK);
> + return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (atomic_cmpxchg_acquire(>val, 0, _Q_LOCKED_VAL) == 0)
> + u32 new = _Q_LOCKED_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%3  # queued_spin_trylock   \n"
> +"cmpwi   0,%0,0  \n"
> +"bne-2f  \n"
> +"stwcx.  %2,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)

btw IS_ENABLED() already returns 1 or 0

> + : "cr0", "memory");

This is the ISA's "test and set" atomic primitive. Do you think it would be 
worth seperating it as a helper?

> +
> + if (likely(prev == 0))
>   return 1;
>   return 0;

same optional style nit: return likely(prev == 0);

>  }
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 3425dab42576..210adf05b235 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -7,7 +7,7 @@
>  
>  typedef struct qspinlock {
>   union {
> - atomic_t val;
> + u32 val;
>  
>  #ifdef __LITTLE_ENDIAN
>   struct {
> @@ -23,10 +23,10 @@ typedef struct qspinlock {
>   };
>  } arch_spinlock_t;
>  
> -#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = ATOMIC_INIT(0) } }
> +#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = 0 } }
>  
>  /*
> - * Bitfields in the atomic value:
> + * Bitfields in the lock word:
>   *
>   * 0: locked bit
>   * 16-31: tail cpu (+1)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5ebb88d95636..7c71e5e287df 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -22,32 +21,59 @@ struct qnodes {
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static inline int encode_tail_cpu(void)
> +static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
>  }
>  
> -static inline int get_tail_cpu(int val)
> +static inline int get_tail_cpu(u32 val)
>  {
>   return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
>  /* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */

I think you missed deleting the above line.

> +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> - atomic_or(_Q_LOCKED_VAL, >val);
> - __atomic_acquire_fence();
> + u32 new = _Q_LOCKED_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%3  # lock_set_locked   \n"
> +"or  %0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> + : "=" (prev)
> + : "r" (>val), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", 

Re: [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.

2022-08-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> The first 16 bits of the lock are only modified by the owner, and other
> modifications always use atomic operations on the entire 32 bits, so
> unlocks can use plain stores on the 16 bits. This is the same kind of
> optimisation done by core qspinlock code.
> ---
>  arch/powerpc/include/asm/qspinlock.h   |  6 +-
>  arch/powerpc/include/asm/qspinlock_types.h | 19 +--
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index f06117aa60e1..79a1936fb68d 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct 
> qspinlock *lock)
>  
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
> - for (;;) {
> - int val = atomic_read(>val);
> - if (atomic_cmpxchg_release(>val, val, val & 
> ~_Q_LOCKED_VAL) == val)
> - return;
> - }
> + smp_store_release(>locked, 0);

Is it also possible for lock_set_locked() to use a non-atomic acquire
operation?

>  }
>  
>  #define arch_spin_is_locked(l)   queued_spin_is_locked(l)
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 9630e714c70d..3425dab42576 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -3,12 +3,27 @@
>  #define _ASM_POWERPC_QSPINLOCK_TYPES_H
>  
>  #include 
> +#include 
>  
>  typedef struct qspinlock {
> - atomic_t val;
> + union {
> + atomic_t val;
> +
> +#ifdef __LITTLE_ENDIAN
> + struct {
> + u16 locked;
> + u8  reserved[2];
> + };
> +#else
> + struct {
> + u8  reserved[2];
> + u16 locked;
> + };
> +#endif
> + };
>  } arch_spinlock_t;

Just to double check we have:

#define _Q_LOCKED_OFFSET0
#define _Q_LOCKED_BITS  1
#define _Q_LOCKED_MASK  0x0001
#define _Q_LOCKED_VAL   1

#define _Q_TAIL_CPU_OFFSET  16
#define _Q_TAIL_CPU_BITS16
#define _Q_TAIL_CPU_MASK0x


so the ordering here looks correct.

>  
> -#define  __ARCH_SPIN_LOCK_UNLOCKED   { .val = ATOMIC_INIT(0) }
> +#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = ATOMIC_INIT(0) } }
>  
>  /*
>   * Bitfields in the atomic value:



Re: [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters

2022-08-09 Thread Jordan NIethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:


>  
> +/*
> + * Bitfields in the atomic value:
> + *
> + * 0: locked bit
> + * 16-31: tail cpu (+1)
> + */
> +#define  _Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
> +   << _Q_ ## type ## _OFFSET)
> +#define _Q_LOCKED_OFFSET 0
> +#define _Q_LOCKED_BITS   1
> +#define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
> +#define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET   16
> +#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
> +#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
> +

Just to state the obvious this is:

#define _Q_LOCKED_OFFSET0
#define _Q_LOCKED_BITS  1
#define _Q_LOCKED_MASK  0x0001
#define _Q_LOCKED_VAL   1

#define _Q_TAIL_CPU_OFFSET  16
#define _Q_TAIL_CPU_BITS16
#define _Q_TAIL_CPU_MASK0x

> +#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 8dbce99a373c..5ebb88d95636 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,12 +1,172 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> +#include 
> +#include 
> +#include 
>  #include 
> -#include 
> +#include 
> +#include 
>  #include 
>  
> -void queued_spin_lock_slowpath(struct qspinlock *lock)
> +#define MAX_NODES4
> +
> +struct qnode {
> + struct qnode*next;
> + struct qspinlock *lock;
> + u8  locked; /* 1 if lock acquired */
> +};
> +
> +struct qnodes {
> + int count;
> + struct qnode nodes[MAX_NODES];
> +};

I think it could be worth commenting why qnodes::count instead 
_Q_TAIL_IDX_OFFSET.

> +
> +static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +
> +static inline int encode_tail_cpu(void)

I think the generic version that takes smp_processor_id() as a parameter is 
clearer - at least with this function name.

> +{
> + return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> +}
> +
> +static inline int get_tail_cpu(int val)

It seems like there should be a "decode" function to pair up with the "encode" 
function.

> +{
> + return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> +}
> +
> +/* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */

Does that comment mean it is not necessary to use an atomic_or here?

> +static __always_inline void lock_set_locked(struct qspinlock *lock)

nit: could just be called set_locked()

> +{
> + atomic_or(_Q_LOCKED_VAL, >val);
> + __atomic_acquire_fence();
> +}
> +
> +/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, 
> int val)
> +{
> + int newval = _Q_LOCKED_VAL;
> +
> + if (atomic_cmpxchg_acquire(>val, val, newval) == val)
> + return 1;
> + else
> + return 0;

same optional style nit: return (atomic_cmpxchg_acquire(>val, val, 
newval) == val);

> +}
> +
> +/*
> + * Publish our tail, replacing previous tail. Return previous value.
> + *
> + * This provides a release barrier for publishing node, and an acquire 
> barrier
> + * for getting the old node.
> + */
> +static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)

Did you change from the xchg_tail() name in the generic version because of the 
release and acquire barriers this provides?
Does "publish" generally imply the old value will be returned?

>  {
> - while (!queued_spin_trylock(lock))
> + for (;;) {
> + int val = atomic_read(>val);
> + int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> + int old;
> +
> + old = atomic_cmpxchg(>val, val, newval);
> + if (old == val)
> + return old;
> + }
> +}
> +
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +{
> + int cpu = get_tail_cpu(val);
> + struct qnodes *qnodesp = per_cpu_ptr(, cpu);
> + int idx;
> +
> + for (idx = 0; idx < MAX_NODES; idx++) {
> + struct qnode *qnode = >nodes[idx];
> + if (qnode->lock == lock)
> + return qnode;
> + }

In case anyone else is confused by this, Nick explained each cpu can only queue 
on a unique spinlock once regardless of "idx" level.

> +
> + BUG();
> +}
> +
> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +{
> + struct qnodes *qnodesp;
> + struct qnode *next, *node;
> + int val, old, tail;
> + int idx;
> +
> + BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
> + qnodesp = this_cpu_ptr();
> + if (unlikely(qnodesp->count == MAX_NODES)) {

The comparison is >= in the generic, I guess we've no 

Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-08-09 Thread Jordan NIethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> - smp_store_release(>locked, 0);
> - else
> - __pv_queued_spin_unlock(lock);
> + if (atomic_cmpxchg_acquire(>val, 0, 1) == 0)
> + return 1;
> + return 0;

optional style nit: return (atomic_cmpxchg_acquire(>val, 0, 1) == 0);



Re: [PATCH] powerpc/64: Drop ppc_inst_as_str()

2022-06-02 Thread Jordan Niethe
On Thu, Jun 2, 2022 at 6:49 PM Segher Boessenkool
 wrote:
>
> On Thu, Jun 02, 2022 at 01:01:04PM +1000, Jordan Niethe wrote:
> > > What about the more fundamental thing?  Have the order of the two halves
> > > of a prefixed insn as ulong not depend on endianness?  It really is two
> > > opcodes, and the prefixed one is first, always, even in LE.
> > The reason would be the value of as ulong is then used to write a
> > prefixed instruction to
> > memory with std.
> > If both endiannesses had the halves the same one of them would store
> > the suffix in front of the prefix.
>
> You cannot do such a (possibly) unaligned access from C though, not
> without invoking undefined behaviour.  The compiler usually lets you get
> away with it, but there are no guarantees.  You can make sure you only
> ever do such an access from assembler code of course.

Would using inline assembly to do it be ok?

>
> Swapping the two halves of a register costs at most one insn.  It is
> harmful premature optimisation to make this single cycle advantage
> override more important consideration (almost everything else :-) )

I'm not sure I follow. We are not doing this as an optimisation, but
out of the necessity of writing
the prefixed instruction to memory in a single instruction so that we
don't end up with half an
instruction in the kernel image.

>
>
> Segher


Re: [PATCH] powerpc/64: Drop ppc_inst_as_str()

2022-06-01 Thread Jordan Niethe
On Thu, Jun 2, 2022 at 2:22 AM Segher Boessenkool
 wrote:
>
> On Wed, Jun 01, 2022 at 08:43:01PM +1000, Michael Ellerman wrote:
> > Segher Boessenkool  writes:
> > > Hi!
> > >
> > > On Tue, May 31, 2022 at 04:59:36PM +1000, Michael Ellerman wrote:
> > >> More problematically it doesn't compile at all with GCC 12, due to the
> > >> fact that it returns the char buffer declared inside the macro:
> > >
> > > It returns a pointer to a buffer on stack.  It is not valid C to access
> > > that buffer after the function has returned (and indeed it does not
> > > work, in general).
> >
> > It's a statement expression though, not a function. So it doesn't return
> > as such, that would be obviously wrong.
>
> Yes, wrong language, my bad.  But luckily it doesn't matter if this is a
> function or not anyway: the question is about scopes and lifetimes :-)
>
> > But I'm not a language lawyer, so presumably it's not valid to refer to
> > the variable after it's gone out of scope.
> >
> > Although we do use that same pattern in many places where the value of
> > the expression is a scalar type.
>
> It's an object with automatic storage duration.  Its lifetime ends when
> the scope is left, which is at the end of the statement expression, so
> before the object is used.
>
> The value of the expression can be used just fine, sure, but the object
> it points to has ceased to exist, so dereferencing that pointer is
> undefined behaviour.
>
> > >> A simpler solution is to just print the value as an unsigned long. For
> > >> normal instructions the output is identical. For prefixed instructions
> > >> the value is printed as a single 64-bit quantity, whereas previously the
> > >> low half was printed first. But that is good enough for debug output,
> > >> especially as prefixed instructions will be rare in practice.
> > >
> > > Prefixed insns might be somewhat rare currently, but it will not stay
> > > that way.
> >
> > These are all printing kernel instructions, not userspace. I should have
> > said that in the change log.
>
> Ah!  In that case, it will take quite a bit longer before you will see
> many prefixed insns, sure.
>
> > The kernel doesn't build for -mcpu=power10 because we haven't done any
> > changes for pcrel.
> >
> > We will do that one day, but not soon.
>
> Yeah, pcrel is the big hitter currently.  But with the extra opcode
> space we have now, maybe something else will show up that even the
> kernel will use.  I cannot predict the future very well :-)
>
> > > It is not hard to fix the problem here?  The only tricky part is that
> > > ppc_inst_as_ulong swaps the two halves for LE, for as far as I can see
> > > no reason at all :-(
> > >
> > > If it didn't it would be easy to detect prefixed insns (because they
> > > then are guaranteed to be > 0x), and it is easy to print them
> > > with a space between the two opcodes, with a utility function:
> > >
> > > void print_insn_bytes_nicely(unsigned long insn)
> > > {
> > > if (insn > 0x)
> > > printf("%08x ", insn >> 32);
> > > printf("%08x", insn & 0x);
> > > }
> >
> > We don't want to do that because it can lead to interleaving messages
> > between different CPUs in the kernel log.
>
> Yuck.
>
> void print_insn_bytes_nicely(unsigned long insn)
> {
> if (insn > 0x)
> printf("%08x ", insn >> 32, insn & 0x);
> else
> printf("%08x", insn & 0x);
> }
>
> But it makes things much less enticing, alright.
>
> > In the medium term there's some changes to printk that might land soon
> > (printbuf), which would mean we could more easily define a custom printk
> > formatter for printing prefixed instructions.
>
> Yeah :-)
>
> What about the more fundamental thing?  Have the order of the two halves
> of a prefixed insn as ulong not depend on endianness?  It really is two
> opcodes, and the prefixed one is first, always, even in LE.
The reason would be the value of as ulong is then used to write a
prefixed instruction to
memory with std.
If both endiannesses had the halves the same one of them would store
the suffix in front of the prefix.
>
>
> Segher


Re: [PATCH v7 3/5] powerpc: Rework and improve STRICT_KERNEL_RWX patching

2022-03-14 Thread Jordan Niethe
On Sat, Mar 12, 2022 at 6:30 PM Christophe Leroy
 wrote:
>
> Hi Jordan
>
> Le 10/11/2021 à 01:37, Jordan Niethe a écrit :
> > From: "Christopher M. Riedl" 
> >
> > Rework code-patching with STRICT_KERNEL_RWX to prepare for a later patch
> > which uses a temporary mm for patching under the Book3s64 Radix MMU.
> > Make improvements by adding a WARN_ON when the patchsite doesn't match
> > after patching and return the error from __patch_instruction() properly.
> >
> > Signed-off-by: Christopher M. Riedl 
> > Signed-off-by: Jordan Niethe 
> > ---
> > v7: still pass addr to map_patch_area()
>
>
> This patch doesn-t apply, can you rebase the series ?
Yep, will do.
>
> Thanks
> Christophe
>
> > ---
> >   arch/powerpc/lib/code-patching.c | 20 ++--
> >   1 file changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 29a30c3068ff..d586bf9c7581 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -75,6 +75,7 @@ static inline void stop_using_temp_mm(struct 
> > temp_mm_state prev_state)
> >   }
> >
> >   static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > +static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> >
> >   static int text_area_cpu_up(unsigned int cpu)
> >   {
> > @@ -87,6 +88,7 @@ static int text_area_cpu_up(unsigned int cpu)
> >   return -1;
> >   }
> >   this_cpu_write(text_poke_area, area);
> > + this_cpu_write(cpu_patching_addr, (unsigned long)area->addr);
> >
> >   return 0;
> >   }
> > @@ -172,11 +174,10 @@ static inline int unmap_patch_area(unsigned long addr)
> >
> >   static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
> >   {
> > - int err;
> > + int err, rc = 0;
> >   u32 *patch_addr = NULL;
> >   unsigned long flags;
> >   unsigned long text_poke_addr;
> > - unsigned long kaddr = (unsigned long)addr;
> >
> >   /*
> >* During early early boot patch_instruction is called
> > @@ -188,15 +189,13 @@ static int do_patch_instruction(u32 *addr, struct 
> > ppc_inst instr)
> >
> >   local_irq_save(flags);
> >
> > - text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
> > - if (map_patch_area(addr, text_poke_addr)) {
> > - err = -1;
> > + text_poke_addr = __this_cpu_read(cpu_patching_addr);
> > + err = map_patch_area(addr, text_poke_addr);
> > + if (err)
> >   goto out;
> > - }
> > -
> > - patch_addr = (u32 *)(text_poke_addr + (kaddr & ~PAGE_MASK));
> >
> > - __patch_instruction(addr, instr, patch_addr);
> > + patch_addr = (u32 *)(text_poke_addr | offset_in_page(addr));
> > + rc = __patch_instruction(addr, instr, patch_addr);
> >
> >   err = unmap_patch_area(text_poke_addr);
> >   if (err)
> > @@ -204,8 +203,9 @@ static int do_patch_instruction(u32 *addr, struct 
> > ppc_inst instr)
> >
> >   out:
> >   local_irq_restore(flags);
> > + WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
> >
> > - return err;
> > + return rc ? rc : err;
> >   }
> >   #else /* !CONFIG_STRICT_KERNEL_RWX */
> >


Re: [PATCH 3/4] powerpc: Handle prefixed instructions in show_user_instructions()

2022-03-14 Thread Jordan Niethe
On Wed, Feb 23, 2022 at 1:34 AM Christophe Leroy
 wrote:
>
>
>
> Le 02/06/2020 à 07:27, Jordan Niethe a écrit :
> > Currently prefixed instructions are treated as two word instructions by
> > show_user_instructions(), treat them as a single instruction. '<' and
> > '>' are placed around the instruction at the NIP, and for prefixed
> > instructions this is placed around the prefix only. Make the '<' and '>'
> > wrap the prefix and suffix.
> >
> > Currently showing a prefixed instruction looks like:
> > fbe1fff8 3920 0600 a3e3 <0400> f7e4 ebe1fff8 4e800020
> >
> > Make it look like:
> > 0xfbe1fff8 0x3920 0x0600 0xa3e3 <0x0400 0xf7e4> 
> > 0xebe1fff8 0x4e800020 0x 0x
>
> Is it really needed to have the leading 0x ?
You are right, that is not consistent with how instructions are usually dumped.
That formatting comes from ppc_inst_as_str(), which when mpe merged he
removed the leading 0x.

>
> And is there a reason for that two 0x at the end of the new line
> that we don't have at the end of the old line ?
No, that is wrong.
>
> This is initially split into 8 instructions per line in order to fit in
> a 80 columns screen/terminal.
>
> Could you make it such that it still fits within 80 cols ?
Sure that makes sense.
>
> Same for patch 4 on show_user_instructions()
>
> Christophe


[PATCH v7 5/5] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-11-09 Thread Jordan Niethe
From: "Christopher M. Riedl" 

When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K)
bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization at boot for each
possible CPU in the system.

Introduce a new function, patch_instruction_mm(), to perform the
patching with a temporary mapping with write permissions at
patching_addr. Map the page with PAGE_KERNEL to set EAA[0] for the PTE
which ignores the AMR (so no need to unlock/lock KUAP) according to
PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl 
Signed-off-by: Jordan Niethe 
---
v7: - Change to patch_instruction_mm() instead of map_patch_mm() and
   unmap_patch_mm()
- include ptesync
---
 arch/powerpc/lib/code-patching.c | 106 +--
 1 file changed, 101 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index aa466e4930ec..7722dec4a914 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -76,6 +77,7 @@ static inline void stop_using_temp_mm(struct temp_mm_state 
prev_state)
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
 
 static int text_area_cpu_up(unsigned int cpu)
 {
@@ -99,8 +101,48 @@ static int text_area_cpu_down(unsigned int cpu)
return 0;
 }
 
+static __always_inline void __poking_init_temp_mm(void)
+{
+   int cpu;
+   spinlock_t *ptl;
+   pte_t *ptep;
+   struct mm_struct *patching_mm;
+   unsigned long patching_addr;
+
+   for_each_possible_cpu(cpu) {
+   patching_mm = copy_init_mm();
+   WARN_ON(!patching_mm);
+   per_cpu(cpu_patching_mm, cpu) = patching_mm;
+
+   /*
+* Choose a randomized, page-aligned address from the range:
+* [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE] The lower
+* address bound is PAGE_SIZE to avoid the zero-page.  The
+* upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to
+* stay under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
+*/
+   patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK) %
+(DEFAULT_MAP_WINDOW - 2 * 
PAGE_SIZE));
+   per_cpu(cpu_patching_addr, cpu) = patching_addr;
+
+   /*
+* PTE allocation uses GFP_KERNEL which means we need to
+* pre-allocate the PTE here because we cannot do the
+* allocation during patching when IRQs are disabled.
+*/
+   ptep = get_locked_pte(patching_mm, patching_addr, );
+   WARN_ON(!ptep);
+   pte_unmap_unlock(ptep, ptl);
+   }
+}
+
 void __init poking_init(void)
 {
+   if (radix_enabled()) {
+   __poking_init_temp_mm();
+   return;
+   }
+
WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"powerpc/text_poke:online", text_area_cpu_up,
text_area_cpu_down) < 0);
@@ -167,6 +209,57 @@ static inline int unmap_patch_area(unsigned lon

  1   2   3   4   5   6   >