from:"Bharata B Rao"

Re: [RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-09 Thread Bharata B Rao

On Thu, Jul 09, 2020 at 03:18:03PM +1000, Paul Mackerras wrote:
> On Fri, Jul 03, 2020 at 04:14:20PM +0530, Bharata B Rao wrote:
> > In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
> > H_RPT_INVALIDATE if available. The availability of this hcall
> > is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
> > DT property.
> 
> What are we going to use when nested KVM supports HPT guests at L2?
> L1 will need to do partition-scoped tlbies with R=0 via a hypercall,
> but H_RPT_INVALIDATE says in its name that it only handles radix
> page tables (i.e. R=1).

For L2 HPT guests, the old hcall is expected to work after it adds
support for R=0 case?

The new hcall should be advertised via ibm,hypertas-functions only
for radix guests I suppose.

Regards,
Bharata.

Re: [PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-08 Thread Bharata B Rao

On Fri, Jul 03, 2020 at 05:59:14PM +0200, Laurent Dufour wrote:
> When a secure memslot is dropped, all the pages backed in the secure device
> (aka really backed by secure memory by the Ultravisor) should be paged out
> to a normal page. Previously, this was achieved by triggering the page
> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
> 
> This can't work when hot unplugging a memory slot because the memory slot
> is flagged as invalid and gfn_to_pfn() is then not trying to access the
> page, so the page fault mechanism is not triggered.
> 
> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
> simpler to directly calling it instead of triggering such a mechanism. This
> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
> memslot.

Yes, this appears much simpler.

> 
> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
> the call to __kvmppc_svm_page_out() is made.
> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
> addition, the mmap_sem is help in read mode during that time, not in write
> mode since the virual memory layout is not impacted, and
> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
> 
> Cc: Ram Pai 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
>  1 file changed, 37 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 852cc9ae6a0b..479ddf16d18c 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
> vm_area_struct *vma,
>   * fault on them, do fault time migration to replace the device PTEs in
>   * QEMU page table with normal PTEs from newly allocated pages.
>   */
> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>struct kvm *kvm, bool skip_page_out)
>  {
>   int i;
>   struct kvmppc_uvmem_page_pvt *pvt;
> - unsigned long pfn, uvmem_pfn;
> - unsigned long gfn = free->base_gfn;
> + struct page *uvmem_page;
> + struct vm_area_struct *vma = NULL;
> + unsigned long uvmem_pfn, gfn;
> + unsigned long addr, end;
> +
> + down_read(&kvm->mm->mmap_sem);

You should be using mmap_read_lock(kvm->mm) with recent kernels.

> +
> + addr = slot->userspace_addr;
> + end = addr + (slot->npages * PAGE_SIZE);
>  
> - for (i = free->npages; i; --i, ++gfn) {
> - struct page *uvmem_page;
> + gfn = slot->base_gfn;
> + for (i = slot->npages; i; --i, ++gfn, addr += PAGE_SIZE) {
> +
> + /* Fetch the VMA if addr is not in the latest fetched one */
> + if (!vma || (addr < vma->vm_start || addr >= vma->vm_end)) {
> + vma = find_vma_intersection(kvm->mm, addr, end);
> + if (!vma ||
> + vma->vm_start > addr || vma->vm_end < end) {
> + pr_err("Can't find VMA for gfn:0x%lx\n", gfn);
> + break;
> + }
> + }

The first find_vma_intersection() was called for the range spanning the
entire memslot, but you have code to check if vma remains valid for the
new addr in each iteration. Guess you wanted to get vma for one page at
a time and use it for subsequent pages until it covers the range?

Regards,
Bharata.

Re: [PATCH v3 1/3] powerpc/mm: Enable radix GTSE only if supported.

2020-07-05 Thread Bharata B Rao

On Mon, Jul 06, 2020 at 07:19:02AM +0530, Santosh Sivaraj wrote:
> 
> Hi Bharata,
> 
> Bharata B Rao  writes:
> 
> > Make GTSE an MMU feature and enable it by default for radix.
> > However for guest, conditionally enable it if hypervisor supports
> > it via OV5 vector. Let prom_init ask for radix GTSE only if the
> > support exists.
> >
> > Having GTSE as an MMU feature will make it easy to enable radix
> > without GTSE. Currently radix assumes GTSE is enabled by default.
> >
> > Signed-off-by: Bharata B Rao 
> > Reviewed-by: Aneesh Kumar K.V 
> > ---
> >  arch/powerpc/include/asm/mmu.h|  4 
> >  arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
> >  arch/powerpc/kernel/prom_init.c   | 13 -
> >  arch/powerpc/mm/init_64.c |  5 -
> >  4 files changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
> > index f4ac25d4df05..884d51995934 100644
> > --- a/arch/powerpc/include/asm/mmu.h
> > +++ b/arch/powerpc/include/asm/mmu.h
> > @@ -28,6 +28,9 @@
> >   * Individual features below.
> >   */
> >  
> > +/* Guest Translation Shootdown Enable */
> > +#define MMU_FTR_GTSE   ASM_CONST(0x1000)
> > +
> >  /*
> >   * Support for 68 bit VA space. We added that from ISA 2.05
> >   */
> > @@ -173,6 +176,7 @@ enum {
> >  #endif
> >  #ifdef CONFIG_PPC_RADIX_MMU
> > MMU_FTR_TYPE_RADIX |
> > +   MMU_FTR_GTSE |
> >  #ifdef CONFIG_PPC_KUAP
> > MMU_FTR_RADIX_KUAP |
> >  #endif /* CONFIG_PPC_KUAP */
> > diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
> > b/arch/powerpc/kernel/dt_cpu_ftrs.c
> > index a0edeb391e3e..ac650c233cd9 100644
> > --- a/arch/powerpc/kernel/dt_cpu_ftrs.c
> > +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
> > @@ -336,6 +336,7 @@ static int __init feat_enable_mmu_radix(struct 
> > dt_cpu_feature *f)
> >  #ifdef CONFIG_PPC_RADIX_MMU
> > cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
> > cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
> > +   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
> > cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
> >  
> > return 1;
> > diff --git a/arch/powerpc/kernel/prom_init.c 
> > b/arch/powerpc/kernel/prom_init.c
> > index 90c604d00b7d..cbc605cfdec0 100644
> > --- a/arch/powerpc/kernel/prom_init.c
> > +++ b/arch/powerpc/kernel/prom_init.c
> > @@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
> > }
> > }
> >  
> > -   if (supported.radix_mmu && supported.radix_gtse &&
> > -   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
> > -   /* Radix preferred - but we require GTSE for now */
> > -   prom_debug("Asking for radix with GTSE\n");
> > +   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
> > +   /* Radix preferred - Check if GTSE is also supported */
> > +   prom_debug("Asking for radix\n");
> > ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
> > -   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
> > +   if (supported.radix_gtse)
> > +   ibm_architecture_vec.vec5.radix_ext =
> > +   OV5_FEAT(OV5_RADIX_GTSE);
> > +   else
> > +   prom_debug("Radix GTSE isn't supported\n");
> > } else if (supported.hash_mmu) {
> > /* Default to hash mmu (if we can) */
> > prom_debug("Asking for hash\n");
> > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> > index bc73abf0bc25..152aa0200cef 100644
> > --- a/arch/powerpc/mm/init_64.c
> > +++ b/arch/powerpc/mm/init_64.c
> > @@ -407,12 +407,15 @@ static void __init early_check_vec5(void)
> > if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
> > OV5_FEAT(OV5_RADIX_GTSE))) {
> > pr_warn("WARNING: Hypervisor doesn't support RADIX with 
> > GTSE\n");
> > -   }
> > +   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
> > +   } else
> > +   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
> 
> The GTSE flag is set by default in feat_enable_mmu_radix(), should it
> be set again here?

Strictly speaking no, but makes it a bit explicit and also follows what
the related feature does below:

> > /* Do radix anyway - the hypervisor said we had to */
> > cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;

Regards,
Bharata.

[RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-03 Thread Bharata B Rao

In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
H_RPT_INVALIDATE if available. The availability of this hcall
is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
DT property.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/firmware.h   |  4 +++-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 26 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 13 ++--
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 6003c2e533a0..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020)
 #define FW_FEATURE_ULTRAVISOR  ASM_CONST(0x0040)
 #define FW_FEATURE_STUFF_TCE   ASM_CONST(0x0080)
+#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100)
 
 #ifndef __ASSEMBLY__
 
@@ -71,7 +72,8 @@ enum {
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
+   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
+   FW_FEATURE_RPT_INVALIDATE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e738ea652192..8411e42eedbd 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -313,9 +314,17 @@ void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned 
long addr,
}
 
psi = shift_to_mmu_psize(pshift);
-   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
-   lpid, rb);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) {
+   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(0, 0, 1), lpid, rb);
+   } else {
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB,
+   psize_to_rpti_pgsize(psi),
+   addr, addr + psize);
+   }
if (rc)
pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
 }
@@ -329,8 +338,15 @@ static void kvmppc_radix_flush_pwc(struct kvm *kvm, 
unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(1, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_PWC, H_RPTI_PAGE_ALL,
+   0, -1UL);
if (rc)
pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index efb78d37f29a..4d023c451be4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct patb_entry *pseries_partition_tb;
 
@@ -401,8 +402,16 @@ static void kvmhv_flush_lpid(unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(2, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC |
+

[RFC PATCH v0 1/2] KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE (nested case only)

2020-07-03 Thread Bharata B Rao

Implements H_RPT_INVALIDATE hcall and supports only nested case
currently.

A KVM capability KVM_CAP_RPT_INVALIDATE is added to indicate the
support for this hcall.

Signed-off-by: Bharata B Rao 
---
 Documentation/virt/kvm/api.rst| 17 
 .../include/asm/book3s/64/tlbflush-radix.h| 18 
 arch/powerpc/include/asm/kvm_book3s.h |  3 +
 arch/powerpc/kvm/book3s_hv.c  | 32 +++
 arch/powerpc/kvm/book3s_hv_nested.c   | 94 +++
 arch/powerpc/kvm/powerpc.c|  3 +
 arch/powerpc/mm/book3s64/radix_tlb.c  |  4 -
 include/uapi/linux/kvm.h  |  1 +
 8 files changed, 168 insertions(+), 4 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 426f94582b7a..d235d16a4bf0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5843,6 +5843,23 @@ controlled by the kvm module parameter halt_poll_ns. 
This capability allows
 the maximum halt time to specified on a per-VM basis, effectively overriding
 the module parameter for the target VM.
 
+7.21 KVM_CAP_RPT_INVALIDATE
+--
+
+:Capability: KVM_CAP_RPT_INVALIDATE
+:Architectures: ppc
+:Type: vm
+
+This capability indicates that the kernel is capable of handling
+H_RPT_INVALIDATE hcall.
+
+In order to enable the use of H_RPT_INVALIDATE in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+This capability is always enabled.
+
 8. Other capabilities.
 ==
 
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 94439e0cefc9..aace7e9b2397 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -4,6 +4,10 @@
 
 #include 
 
+#define RIC_FLUSH_TLB 0
+#define RIC_FLUSH_PWC 1
+#define RIC_FLUSH_ALL 2
+
 struct vm_area_struct;
 struct mm_struct;
 struct mmu_gather;
@@ -21,6 +25,20 @@ static inline u64 psize_to_rpti_pgsize(unsigned long psize)
return H_RPTI_PAGE_ALL;
 }
 
+static inline int rpti_pgsize_to_psize(unsigned long page_size)
+{
+   if (page_size == H_RPTI_PAGE_4K)
+   return MMU_PAGE_4K;
+   if (page_size == H_RPTI_PAGE_64K)
+   return MMU_PAGE_64K;
+   if (page_size == H_RPTI_PAGE_2M)
+   return MMU_PAGE_2M;
+   if (page_size == H_RPTI_PAGE_1G)
+   return MMU_PAGE_1G;
+   else
+   return MMU_PAGE_64K; /* Default */
+}
+
 static inline int mmu_get_ap(int psize)
 {
return mmu_psize_defs[psize].ap;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index d32ec9ae73bd..0f1c5fa6e8ce 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -298,6 +298,9 @@ void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 
dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
 long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
+long kvmhv_h_rpti_nested(struct kvm_vcpu *vcpu, unsigned long lpid,
+unsigned long type, unsigned long pg_sizes,
+unsigned long start, unsigned long end);
 int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu,
  u64 time_limit, unsigned long lpcr);
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6bf66649ab92..2f772183f249 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -895,6 +895,28 @@ static int kvmppc_get_yield_count(struct kvm_vcpu *vcpu)
return yield_count;
 }
 
+static long kvmppc_h_rpt_invalidate(struct kvm_vcpu *vcpu,
+   unsigned long pid, unsigned long target,
+   unsigned long type, unsigned long pg_sizes,
+   unsigned long start, unsigned long end)
+{
+   if (end < start)
+   return H_P5;
+
+   if ((!type & H_RPTI_TYPE_NESTED))
+   return H_P3;
+
+   if (!nesting_enabled(vcpu->kvm))
+   return H_FUNCTION;
+
+   /* Support only cores as target */
+   if (target != H_RPTI_TARGET_CMMU)
+   return H_P2;
+
+   return kvmhv_h_rpti_nested(vcpu, pid, (type & ~H_RPTI_TYPE_NESTED),
+  pg_sizes, start, end);
+}
+
 int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 {
unsigned long req = kvmppc_get_gpr(vcpu, 3);
@@ -1103,6 +1125,14 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 */
ret = kvmppc_h_svm_init_abort

[RFC PATCH v0 0/2] Use H_RPT_INVALIDATE for nested guest

2020-07-03 Thread Bharata B Rao

This patchset adds support for the new hcall H_RPT_INVALIDATE
(currently handles nested case only) and replaces the nested tlb flush
calls with this new hcall if the support for the same exists.

This applies on top of "[PATCH v3 0/3] Off-load TLB invalidations to host
for !GTSE" patchset that was posted at:

https://lore.kernel.org/linuxppc-dev/20200703053608.12884-1-bhar...@linux.ibm.com/T/#t

H_RPT_INVALIDATE

Syntax:
int64   /* H_Success: Return code on successful completion */
    /* H_Busy - repeat the call with the same */
    /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid parameters */
    hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT translation 
lookaside information */
  uint64 pid,   /* PID/LPID to invalidate */
  uint64 target,    /* Invalidation target */
  uint64 type,  /* Type of lookaside information */
  uint64 pageSizes, /* Page sizes */
  uint64 start, /* Start of Effective Address (EA) range 
(inclusive) */
  uint64 end)   /* End of EA range (exclusive) */

Invalidation targets (target)
-
Core MMU    0x01 /* All virtual processors in the partition */
Core local MMU  0x02 /* Current virtual processor */
Nest MMU    0x04 /* All nest/accelerator agents in use by the partition */

A combination of the above can be specified, except core and core local.

Type of translation to invalidate (type)
---
NESTED   0x0001  /* Invalidate nested guest partition-scope */
TLB  0x0002  /* Invalidate TLB */
PWC  0x0004  /* Invalidate Page Walk Cache */
PRT  0x0008  /* Invalidate Process Table Entries if NESTED is clear */
PAT  0x0008  /* Invalidate Partition Table Entries if NESTED is set */

A combination of the above can be specified.

Page size mask (pageSizes)
--
4K  0x01
64K 0x02
2M  0x04
1G  0x08
All sizes   (-1UL)

A combination of the above can be specified.
All page sizes can be selected with -1.

Semantics: Invalidate radix tree lookaside information
   matching the parameters given.
* Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters are
  different from the defined values.
* Return H_PARAMETER if NESTED is set and pid is not a valid nested
  LPID allocated to this partition
* Return H_P5 if (start, end) doesn't form a valid range. Start and end
  should be a valid Quadrant address and  end > start.
* Return H_NotSupported if the partition is not in running in radix
  translation mode.
* May invalidate more translation information than requested.
* If start = 0 and end = -1, set the range to cover all valid addresses.
  Else start and end should be aligned to 4kB (lower 11 bits clear).
* If NESTED is clear, then invalidate process scoped lookaside information.
  Else pid specifies a nested LPID, and the invalidation is performed
  on nested guest partition table and nested guest partition scope real
  addresses.
* If pid = 0 and NESTED is clear, then valid addresses are quadrant 3 and
  quadrant 0 spaces, Else valid addresses are quadrant 0.
* Pages which are fully covered by the range are to be invalidated.
  Those which are partially covered are considered outside invalidation
  range, which allows a caller to optimally invalidate ranges that may
  contain mixed page sizes.
* Return H_SUCCESS on success.

Bharata B Rao (2):
  KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE (nested case
only)
  KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

 Documentation/virt/kvm/api.rst|  17 +++
 .../include/asm/book3s/64/tlbflush-radix.h|  18 +++
 arch/powerpc/include/asm/firmware.h   |   4 +-
 arch/powerpc/include/asm/kvm_book3s.h |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c|  26 -
 arch/powerpc/kvm/book3s_hv.c  |  32 ++
 arch/powerpc/kvm/book3s_hv_nested.c   | 107 +-
 arch/powerpc/kvm/powerpc.c|   3 +
 arch/powerpc/mm/book3s64/radix_tlb.c  |   4 -
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 include/uapi/linux/kvm.h  |   1 +
 11 files changed, 204 insertions(+), 12 deletions(-)

-- 
2.21.3

[PATCH v3 3/3] powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when !GTSE

2020-07-02 Thread Bharata B Rao

From: Nicholas Piggin 

When platform doesn't support GTSE, let TLB invalidation requests
for radix guests be off-loaded to the host using H_RPT_INVALIDATE
hcall.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Bharata B Rao 
[hcall wrapper, error path handling and renames]
---
 .../include/asm/book3s/64/tlbflush-radix.h| 15 
 arch/powerpc/include/asm/hvcall.h | 34 +++-
 arch/powerpc/include/asm/plpar_wrappers.h | 52 
 arch/powerpc/mm/book3s64/radix_tlb.c  | 82 +--
 4 files changed, 175 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index ca8db193ae38..94439e0cefc9 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -2,10 +2,25 @@
 #ifndef _ASM_POWERPC_TLBFLUSH_RADIX_H
 #define _ASM_POWERPC_TLBFLUSH_RADIX_H
 
+#include 
+
 struct vm_area_struct;
 struct mm_struct;
 struct mmu_gather;
 
+static inline u64 psize_to_rpti_pgsize(unsigned long psize)
+{
+   if (psize == MMU_PAGE_4K)
+   return H_RPTI_PAGE_4K;
+   if (psize == MMU_PAGE_64K)
+   return H_RPTI_PAGE_64K;
+   if (psize == MMU_PAGE_2M)
+   return H_RPTI_PAGE_2M;
+   if (psize == MMU_PAGE_1G)
+   return H_RPTI_PAGE_1G;
+   return H_RPTI_PAGE_ALL;
+}
+
 static inline int mmu_get_ap(int psize)
 {
return mmu_psize_defs[psize].ap;
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e90c073e437e..43486e773bd6 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -305,7 +305,8 @@
 #define H_SCM_UNBIND_ALL0x3FC
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
-#define MAX_HCALL_OPCODE   H_SCM_PERFORMANCE_STATS
+#define H_RPT_INVALIDATE   0x448
+#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -389,6 +390,37 @@
 #define PROC_TABLE_RADIX   0x04
 #define PROC_TABLE_GTSE0x01
 
+/*
+ * Defines for
+ * H_RPT_INVALIDATE - Invalidate RPT translation lookaside information.
+ */
+
+/* Type of translation to invalidate (type) */
+#define H_RPTI_TYPE_NESTED 0x0001  /* Invalidate nested guest 
partition-scope */
+#define H_RPTI_TYPE_TLB0x0002  /* Invalidate TLB */
+#define H_RPTI_TYPE_PWC0x0004  /* Invalidate Page Walk Cache */
+/* Invalidate Process Table Entries if H_RPTI_TYPE_NESTED is clear */
+#define H_RPTI_TYPE_PRT0x0008
+/* Invalidate Partition Table Entries if H_RPTI_TYPE_NESTED is set */
+#define H_RPTI_TYPE_PAT0x0008
+#define H_RPTI_TYPE_ALL(H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC | \
+H_RPTI_TYPE_PRT)
+#define H_RPTI_TYPE_NESTED_ALL (H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC | \
+H_RPTI_TYPE_PAT)
+
+/* Invalidation targets (target) */
+#define H_RPTI_TARGET_CMMU 0x01 /* All virtual processors in the 
partition */
+#define H_RPTI_TARGET_CMMU_LOCAL   0x02 /* Current virtual processor */
+/* All nest/accelerator agents in use by the partition */
+#define H_RPTI_TARGET_NMMU 0x04
+
+/* Page size mask (page sizes) */
+#define H_RPTI_PAGE_4K 0x01
+#define H_RPTI_PAGE_64K0x02
+#define H_RPTI_PAGE_2M 0x04
+#define H_RPTI_PAGE_1G 0x08
+#define H_RPTI_PAGE_ALL (-1UL)
+
 #ifndef __ASSEMBLY__
 #include 
 
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 4497c8afb573..4293c5d2ddf4 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -334,6 +334,51 @@ static inline long plpar_get_cpu_characteristics(struct 
h_cpu_char_result *p)
return rc;
 }
 
+/*
+ * Wrapper to H_RPT_INVALIDATE hcall that handles return values appropriately
+ *
+ * - Returns H_SUCCESS on success
+ * - For H_BUSY return value, we retry the hcall.
+ * - For any other hcall failures, attempt a full flush once before
+ *   resorting to BUG().
+ *
+ * Note: This hcall is expected to fail only very rarely. The correct
+ * error recovery of killing the process/guest will be eventually
+ * needed.
+ */
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 type,
+ u64 page_sizes, u64 start, u64 end)
+{
+   long rc;
+   unsigned long all;
+
+   while (true) {
+   rc = plpar_hcall_norets(H_RPT_INVALIDATE, pid, target, type,
+   page_sizes, start, end);
+   if (rc == H_BUSY) {
+   cpu_relax();
+   continue;
+   } else if (rc == H_SUCCESS)
+   return rc;
+
+   /* Flush request failed, try w

[PATCH v3 2/3] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled

2020-07-02 Thread Bharata B Rao

H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should
be set only when GTSE is supported.

Signed-off-by: Bharata B Rao 
Reviewed-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/pseries/lpar.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index fd26f3d21d7b..f82569a505f1 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned 
long base,
 
if (table_size)
flags |= PROC_TABLE_NEW;
-   if (radix_enabled())
-   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
-   else
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
flags |= PROC_TABLE_HPT_SLB;
for (;;) {
rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-- 
2.21.3

[PATCH v3 1/3] powerpc/mm: Enable radix GTSE only if supported.

2020-07-02 Thread Bharata B Rao

Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector. Let prom_init ask for radix GTSE only if the
support exists.

Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.

Signed-off-by: Bharata B Rao 
Reviewed-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/mmu.h|  4 
 arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
 arch/powerpc/kernel/prom_init.c   | 13 -
 arch/powerpc/mm/init_64.c |  5 -
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index f4ac25d4df05..884d51995934 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -28,6 +28,9 @@
  * Individual features below.
  */
 
+/* Guest Translation Shootdown Enable */
+#define MMU_FTR_GTSE   ASM_CONST(0x1000)
+
 /*
  * Support for 68 bit VA space. We added that from ISA 2.05
  */
@@ -173,6 +176,7 @@ enum {
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
+   MMU_FTR_GTSE |
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_RADIX_KUAP |
 #endif /* CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index a0edeb391e3e..ac650c233cd9 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -336,6 +336,7 @@ static int __init feat_enable_mmu_radix(struct 
dt_cpu_feature *f)
 #ifdef CONFIG_PPC_RADIX_MMU
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
 
return 1;
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 90c604d00b7d..cbc605cfdec0 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
}
}
 
-   if (supported.radix_mmu && supported.radix_gtse &&
-   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
-   /* Radix preferred - but we require GTSE for now */
-   prom_debug("Asking for radix with GTSE\n");
+   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
+   /* Radix preferred - Check if GTSE is also supported */
+   prom_debug("Asking for radix\n");
ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
-   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
+   if (supported.radix_gtse)
+   ibm_architecture_vec.vec5.radix_ext =
+   OV5_FEAT(OV5_RADIX_GTSE);
+   else
+   prom_debug("Radix GTSE isn't supported\n");
} else if (supported.hash_mmu) {
/* Default to hash mmu (if we can) */
prom_debug("Asking for hash\n");
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index bc73abf0bc25..152aa0200cef 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -407,12 +407,15 @@ static void __init early_check_vec5(void)
if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
OV5_FEAT(OV5_RADIX_GTSE))) {
pr_warn("WARNING: Hypervisor doesn't support RADIX with 
GTSE\n");
-   }
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
+   } else
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
/* Do radix anyway - the hypervisor said we had to */
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
} else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
/* Hypervisor only supports hash - disable radix */
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
}
 }
 
-- 
2.21.3

[PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE

2020-07-02 Thread Bharata B Rao

Hypervisor may choose not to enable Guest Translation Shootdown Enable
(GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
permitted to use instructions like tblie and tlbsync directly, but is
expected to make hypervisor calls to get the TLB flushed.

This series enables the TLB flush routines in the radix code to
off-load TLB flushing to hypervisor via the newly proposed hcall
H_RPT_INVALIDATE. 

To easily check the availability of GTSE, it is made an MMU feature.
The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
handle GTSE as an optionally available feature and to not assume GTSE
when radix support is available.

The actual hcall implementation for KVM isn't included in this
patchset and will be posted separately.

Changes in v3
=
- Fixed a bug in the hcall wrapper code where we were missing setting
  H_RPTI_TYPE_NESTED while retrying the failed flush request with
  a full flush for the nested case.
- s/psize_to_h_rpti/psize_to_rpti_pgsize

v2: 
https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bhar...@linux.ibm.com/T/#t

Bharata B Rao (2):
  powerpc/mm: Enable radix GTSE only if supported.
  powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
enabled

Nicholas Piggin (1):
  powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
!GTSE

 .../include/asm/book3s/64/tlbflush-radix.h| 15 
 arch/powerpc/include/asm/hvcall.h | 34 +++-
 arch/powerpc/include/asm/mmu.h|  4 +
 arch/powerpc/include/asm/plpar_wrappers.h | 52 
 arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
 arch/powerpc/kernel/prom_init.c   | 13 +--
 arch/powerpc/mm/book3s64/radix_tlb.c  | 82 +--
 arch/powerpc/mm/init_64.c |  5 +-
 arch/powerpc/platforms/pseries/lpar.c |  8 +-
 9 files changed, 197 insertions(+), 17 deletions(-)

-- 
2.21.3

Re: [PATCH v2 1/3] powerpc/mm: Enable radix GTSE only if supported.

2020-06-28 Thread Bharata B Rao

On Fri, Jun 26, 2020 at 05:55:30PM -0300, Murilo Opsfelder Araújo wrote:
> > diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> > index bc73abf0bc25..152aa0200cef 100644
> > --- a/arch/powerpc/mm/init_64.c
> > +++ b/arch/powerpc/mm/init_64.c
> > @@ -407,12 +407,15 @@ static void __init early_check_vec5(void)
> > if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
> > OV5_FEAT(OV5_RADIX_GTSE))) {
> > pr_warn("WARNING: Hypervisor doesn't support RADIX with 
> > GTSE\n");
> > -   }
> > +   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
> > +   } else
> > +   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
> > /* Do radix anyway - the hypervisor said we had to */
> > cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
> > } else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
> > /* Hypervisor only supports hash - disable radix */
> > cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
> > +   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
> > }
> >  }
> 
> Is this a part of the code where mmu_clear_feature() cannot be used?

Yes, it appears so. Jump label initialization isn't done yet.

Regards,
Bharata.

Re: [PATCH v3 0/4] Migrate non-migrated pages of a SVM.

2020-06-28 Thread Bharata B Rao

On Sun, Jun 28, 2020 at 09:41:53PM +0530, Bharata B Rao wrote:
> On Fri, Jun 19, 2020 at 03:43:38PM -0700, Ram Pai wrote:
> > The time taken to switch a VM to Secure-VM, increases by the size of the 
> > VM.  A
> > 100GB VM takes about 7minutes. This is unacceptable.  This linear increase 
> > is
> > caused by a suboptimal behavior by the Ultravisor and the Hypervisor.  The
> > Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory 
> > to
> > secure-memory. It has to just migrate the necessary and sufficient GFNs.
> > 
> > However when the optimization is incorporated in the Ultravisor, the 
> > Hypervisor
> > starts misbehaving. The Hypervisor has a inbuilt assumption that the 
> > Ultravisor
> > will explicitly request to migrate, each and every GFN of the VM. If only
> > necessary and sufficient GFNs are requested for migration, the Hypervisor
> > continues to manage the remaining GFNs as normal GFNs. This leads of memory
> > corruption, manifested consistently when the SVM reboots.
> > 
> > The same is true, when a memory slot is hotplugged into a SVM. The 
> > Hypervisor
> > expects the ultravisor to request migration of all GFNs to secure-GFN.  But 
> > at
> > the same time, the hypervisor is unable to handle any H_SVM_PAGE_IN requests
> > from the Ultravisor, done in the context of UV_REGISTER_MEM_SLOT ucall.  
> > This
> > problem manifests as random errors in the SVM, when a memory-slot is
> > hotplugged.
> > 
> > This patch series automatically migrates the non-migrated pages of a SVM,
> >  and thus solves the problem.
> 
> So this is what I understand as the objective of this patchset:
> 
> 1. Getting all the pages into the secure memory right when the guest
>transitions into secure mode is expensive. Ultravisor wants to just get
>the necessary and sufficient pages in and put the onus on the Hypervisor
>to mark the remaining pages (w/o actual page-in) as secure during
>H_SVM_INIT_DONE.
> 2. During H_SVM_INIT_DONE, you want a way to differentiate the pages that
>are already secure from the pages that are shared and that are paged-out.
>For this you are introducing all these new states in HV.
> 
> UV knows about the shared GFNs and maintains the state of the same. Hence
> let HV send all the pages (minus already secured pages) via H_SVM_PAGE_IN
> and if UV finds any shared pages in them, let it fail the uv-page-in call.
> Then HV can fail the migration for it  and the page continues to remain
> shared. With this, you don't need to maintain a state for secured GFN in HV.
> 
> In the unlikely case of sending a paged-out page to UV during
> H_SVM_INIT_DONE, let the page-in succeed and HV will fault on it again
> if required. With this, you don't need a state in HV to identify a
> paged-out-but-encrypted state.
> 
> Doesn't the above work?

I see that you want to infact skip the uv-page-in calls from H_SVM_INIT_DONE.
So that would need the extra states in HV which you are proposing here.

Regards,
Bharata.

Re: [PATCH v3 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-06-28 Thread Bharata B Rao

On Fri, Jun 19, 2020 at 03:43:41PM -0700, Ram Pai wrote:
> H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly

As noted in the last iteration, can you reword the above please?
I don't see it as an incorrect assumption, but see it as extension of
scope now :-)

> called H_SVM_PAGE_IN for all secure pages. These GFNs continue to be
> normal GFNs associated with normal PFNs; when infact, these GFNs should
> have been secure GFNs, associated with device PFNs.
> 
> Move all the PFNs associated with the SVM's GFNs, to secure-PFNs, in
> H_SVM_INIT_DONE. Skip the GFNs that are already Paged-in or Shared
> through H_SVM_PAGE_IN, or Paged-in followed by a Paged-out through
> UV_PAGE_OUT.
> 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Cc: Bharata B Rao 
> Cc: Aneesh Kumar K.V 
> Cc: Sukadev Bhattiprolu 
> Cc: Laurent Dufour 
> Cc: Thiago Jung Bauermann 
> Cc: David Gibson 
> Cc: Claudio Carvalho 
> Cc: kvm-...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Ram Pai 
> ---
>  Documentation/powerpc/ultravisor.rst|   2 +
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   2 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c  | 154 
> +++-
>  3 files changed, 132 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/powerpc/ultravisor.rst 
> b/Documentation/powerpc/ultravisor.rst
> index 363736d..3bc8957 100644
> --- a/Documentation/powerpc/ultravisor.rst
> +++ b/Documentation/powerpc/ultravisor.rst
> @@ -933,6 +933,8 @@ Return values
>   * H_UNSUPPORTED if called from the wrong context (e.g.
>   from an SVM or before an H_SVM_INIT_START
>   hypercall).
> + * H_STATE   if the hypervisor could not successfully
> +transition the VM to Secure VM.
>  
>  Description
>  ~~~
> diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
> b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> index 5a9834e..b9cd7eb 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> @@ -22,6 +22,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
>  unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
>  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>struct kvm *kvm, bool skip_page_out);
> +int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
> + const struct kvm_memory_slot *memslot);
>  #else
>  static inline int kvmppc_uvmem_init(void)
>  {
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index c8c0290..449e8a7 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -93,6 +93,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static struct dev_pagemap kvmppc_uvmem_pgmap;
>  static unsigned long *kvmppc_uvmem_bitmap;
> @@ -339,6 +340,21 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, 
> struct kvm *kvm,
>   return false;
>  }
>  
> +/* return true, if the GFN is a shared-GFN, or a secure-GFN */
> +bool kvmppc_gfn_has_transitioned(unsigned long gfn, struct kvm *kvm)
> +{
> + struct kvmppc_uvmem_slot *p;
> +
> + list_for_each_entry(p, &kvm->arch.uvmem_pfns, list) {
> + if (gfn >= p->base_pfn && gfn < p->base_pfn + p->nr_pfns) {
> + unsigned long index = gfn - p->base_pfn;
> +
> + return (p->pfns[index] & KVMPPC_GFN_FLAG_MASK);
> + }
> + }
> + return false;
> +}
> +
>  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  {
>   struct kvm_memslots *slots;
> @@ -379,12 +395,31 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  
>  unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
>  {
> + struct kvm_memslots *slots;
> + struct kvm_memory_slot *memslot;
> + int srcu_idx;
> + long ret = H_SUCCESS;
> +
>   if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
>   return H_UNSUPPORTED;
>  
> + /* migrate any unmoved normal pfn to device pfns*/
> + srcu_idx = srcu_read_lock(&kvm->srcu);
> + slots = kvm_memslots(kvm);
> + kvm_for_each_memslot(memslot, slots) {
> + ret = kvmppc_uv_migrate_mem_slot(kvm, memslot);
> + if (ret) {
> + ret = H_STATE;
> + goto out;
> + }
> + }
> +
>   kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_DONE;
>

Re: [PATCH v3 0/4] Migrate non-migrated pages of a SVM.

2020-06-28 Thread Bharata B Rao

On Fri, Jun 19, 2020 at 03:43:38PM -0700, Ram Pai wrote:
> The time taken to switch a VM to Secure-VM, increases by the size of the VM.  
> A
> 100GB VM takes about 7minutes. This is unacceptable.  This linear increase is
> caused by a suboptimal behavior by the Ultravisor and the Hypervisor.  The
> Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory to
> secure-memory. It has to just migrate the necessary and sufficient GFNs.
> 
> However when the optimization is incorporated in the Ultravisor, the 
> Hypervisor
> starts misbehaving. The Hypervisor has a inbuilt assumption that the 
> Ultravisor
> will explicitly request to migrate, each and every GFN of the VM. If only
> necessary and sufficient GFNs are requested for migration, the Hypervisor
> continues to manage the remaining GFNs as normal GFNs. This leads of memory
> corruption, manifested consistently when the SVM reboots.
> 
> The same is true, when a memory slot is hotplugged into a SVM. The Hypervisor
> expects the ultravisor to request migration of all GFNs to secure-GFN.  But at
> the same time, the hypervisor is unable to handle any H_SVM_PAGE_IN requests
> from the Ultravisor, done in the context of UV_REGISTER_MEM_SLOT ucall.  This
> problem manifests as random errors in the SVM, when a memory-slot is
> hotplugged.
> 
> This patch series automatically migrates the non-migrated pages of a SVM,
>  and thus solves the problem.

So this is what I understand as the objective of this patchset:

1. Getting all the pages into the secure memory right when the guest
   transitions into secure mode is expensive. Ultravisor wants to just get
   the necessary and sufficient pages in and put the onus on the Hypervisor
   to mark the remaining pages (w/o actual page-in) as secure during
   H_SVM_INIT_DONE.
2. During H_SVM_INIT_DONE, you want a way to differentiate the pages that
   are already secure from the pages that are shared and that are paged-out.
   For this you are introducing all these new states in HV.

UV knows about the shared GFNs and maintains the state of the same. Hence
let HV send all the pages (minus already secured pages) via H_SVM_PAGE_IN
and if UV finds any shared pages in them, let it fail the uv-page-in call.
Then HV can fail the migration for it  and the page continues to remain
shared. With this, you don't need to maintain a state for secured GFN in HV.

In the unlikely case of sending a paged-out page to UV during
H_SVM_INIT_DONE, let the page-in succeed and HV will fault on it again
if required. With this, you don't need a state in HV to identify a
paged-out-but-encrypted state.

Doesn't the above work? If so, we can avoid all those extra states
in HV. That way HV can continue to differentiate only between two types
of pages: secure and not-secure. The rest of the states (shared,
paged-out-encrypted) actually belong to SVM/UV and let UV take care of
them.

Or did I miss something?

Regards,
Bharata.

[PATCH v2 1/3] powerpc/mm: Enable radix GTSE only if supported.

2020-06-26 Thread Bharata B Rao

Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector. Let prom_init ask for radix GTSE only if the
support exists.

Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/mmu.h|  4 
 arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
 arch/powerpc/kernel/prom_init.c   | 13 -
 arch/powerpc/mm/init_64.c |  5 -
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index f4ac25d4df05..884d51995934 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -28,6 +28,9 @@
  * Individual features below.
  */
 
+/* Guest Translation Shootdown Enable */
+#define MMU_FTR_GTSE   ASM_CONST(0x1000)
+
 /*
  * Support for 68 bit VA space. We added that from ISA 2.05
  */
@@ -173,6 +176,7 @@ enum {
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
+   MMU_FTR_GTSE |
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_RADIX_KUAP |
 #endif /* CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index a0edeb391e3e..ac650c233cd9 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -336,6 +336,7 @@ static int __init feat_enable_mmu_radix(struct 
dt_cpu_feature *f)
 #ifdef CONFIG_PPC_RADIX_MMU
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
 
return 1;
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 90c604d00b7d..cbc605cfdec0 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
}
}
 
-   if (supported.radix_mmu && supported.radix_gtse &&
-   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
-   /* Radix preferred - but we require GTSE for now */
-   prom_debug("Asking for radix with GTSE\n");
+   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
+   /* Radix preferred - Check if GTSE is also supported */
+   prom_debug("Asking for radix\n");
ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
-   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
+   if (supported.radix_gtse)
+   ibm_architecture_vec.vec5.radix_ext =
+   OV5_FEAT(OV5_RADIX_GTSE);
+   else
+   prom_debug("Radix GTSE isn't supported\n");
} else if (supported.hash_mmu) {
/* Default to hash mmu (if we can) */
prom_debug("Asking for hash\n");
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index bc73abf0bc25..152aa0200cef 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -407,12 +407,15 @@ static void __init early_check_vec5(void)
if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
OV5_FEAT(OV5_RADIX_GTSE))) {
pr_warn("WARNING: Hypervisor doesn't support RADIX with 
GTSE\n");
-   }
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
+   } else
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
/* Do radix anyway - the hypervisor said we had to */
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
} else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
/* Hypervisor only supports hash - disable radix */
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
}
 }
 
-- 
2.21.3

[PATCH v2 2/3] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled

2020-06-26 Thread Bharata B Rao

H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should
be set only when GTSE is supported.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/platforms/pseries/lpar.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index fd26f3d21d7b..f82569a505f1 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned 
long base,
 
if (table_size)
flags |= PROC_TABLE_NEW;
-   if (radix_enabled())
-   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
-   else
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
flags |= PROC_TABLE_HPT_SLB;
for (;;) {
rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-- 
2.21.3

Re: [PATCH v1 2/3] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

2020-06-23 Thread Bharata B Rao

On Tue, Jun 23, 2020 at 04:07:34PM +0530, Aneesh Kumar K.V wrote:
> Bharata B Rao  writes:
> 
> > We can hit the following BUG_ON during memory unplug:
> >
> > kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
> > Oops: Exception in kernel mode, sig: 5 [#1]
> > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > NIP [c0093308] pmd_fragment_free+0x48/0xc0
> > LR [c147bfec] remove_pagetable+0x578/0x60c
> > Call Trace:
> > 0xc0805000 (unreliable)
> > remove_pagetable+0x384/0x60c
> > radix__remove_section_mapping+0x18/0x2c
> > remove_section_mapping+0x1c/0x3c
> > arch_remove_memory+0x11c/0x180
> > try_remove_memory+0x120/0x1b0
> > __remove_memory+0x20/0x40
> > dlpar_remove_lmb+0xc0/0x114
> > dlpar_memory+0x8b0/0xb20
> > handle_dlpar_errorlog+0xc0/0x190
> > pseries_hp_work_fn+0x2c/0x60
> > process_one_work+0x30c/0x810
> > worker_thread+0x98/0x540
> > kthread+0x1c4/0x1d0
> > ret_from_kernel_thread+0x5c/0x74
> >
> > This occurs when unplug is attempted for such memory which has
> > been mapped using memblock pages as part of early kernel page
> > table setup. We wouldn't have initialized the PMD or PTE fragment
> > count for those PMD or PTE pages.
> >
> > Fixing this includes 3 parts:
> >
> > - Re-walk the init_mm page tables from mem_init() and initialize
> >   the PMD and PTE fragment count to 1.
> > - When freeing PUD, PMD and PTE page table pages, check explicitly
> >   if they come from memblock and if so free then appropriately.
> > - When we do early memblock based allocation of PMD and PUD pages,
> >   allocate in PAGE_SIZE granularity so that we are sure the
> >   complete page is used as pagetable page.
> >
> > Since we now do PAGE_SIZE allocations for both PUD table and
> > PMD table (Note that PTE table allocation is already of PAGE_SIZE),
> > we end up allocating more memory for the same amount of system RAM.
> > Here is a comparision of how much more we need for a 64T and 2G
> > system after this patch:
> >
> > 1. 64T system
> > -
> > 64T RAM would need 64G for vmemmap with struct page size being 64B.
> >
> > 128 PUD tables for 64T memory (1G mappings)
> > 1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)
> >
> > With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
> > With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K
> >
> > 2. 2G system
> > 
> > 2G RAM would need 2M for vmemmap with struct page size being 64B.
> >
> > 1 PUD table for 2G memory (1G mapping)
> > 1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)
> >
> > With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
> > With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K
> 
> How about we just do
> 
> void pmd_fragment_free(unsigned long *pmd)
> {
>   struct page *page = virt_to_page(pmd);
> 
>   /*
>* Early pmd pages allocated via memblock
>* allocator need to be freed differently
>*/
>   if (PageReserved(page))
>   return free_reserved_page(page);
> 
>   BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
>   if (atomic_dec_and_test(&page->pt_frag_refcount)) {
>   pgtable_pmd_page_dtor(page);
>   __free_page(page);
>   }
> }
> 
> That way we could avoid the fixup_pgtable_fragments completely?

Yes we could, by doing the same for pte_fragment_free() too.

However right from the early versions, we were going in the direction of
making the handling and behaviour of both early page tables and later
page tables as similar to each other as possible. Hence we started with
"fixing up" the early page tables.

If that's not a significant consideration, we can do away with fixup
and retain the other parts (PAGE_SIZE allocations and conditional
freeing) and still fix the bug.

Regards,
Bharata.

[PATCH v1 2/3] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

2020-06-23 Thread Bharata B Rao

We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c0093308] pmd_fragment_free+0x48/0xc0
LR [c147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc0805000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

Fixing this includes 3 parts:

- Re-walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment count to 1.
- When freeing PUD, PMD and PTE page table pages, check explicitly
  if they come from memblock and if so free then appropriately.
- When we do early memblock based allocation of PMD and PUD pages,
  allocate in PAGE_SIZE granularity so that we are sure the
  complete page is used as pagetable page.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system

2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 11 ++-
 arch/powerpc/include/asm/book3s/64/radix.h   |  1 +
 arch/powerpc/include/asm/sparsemem.h |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 31 +++-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 80 +++-
 arch/powerpc/mm/mem.c|  5 ++
 arch/powerpc/mm/pgtable-frag.c   |  9 ++-
 7 files changed, 129 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 69c5b051734f..56d695f0095c 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -109,7 +109,16 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, 
unsigned long addr)
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+   struct page *page = virt_to_page(pud);
+
+   /*
+* Early pud pages allocated via memblock allocator
+* can't be directly freed to slab
+*/
+   if (PageReserved(page))
+   free_reserved_page(page);
+   else
+   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 0cba794c4fb8..90f05d52f46d 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -297,6 +297,7 @@ static inline unsigned long radix__get_tree_size(void)
 int radix__create_section_mapping(unsigned long start, unsigned long end,
  int nid, pgprot_t prot);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
+void radix__fixup_pgtable_fragments(void);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 #endif /* __ASSEMBLY__ */
 #endif
diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index c89b32443cff..d0b22a937a7a 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -16,6 +16,7 @@
 extern int create_section_mapping(unsigned long start, unsigned long end,
  int nid, pgprot_t prot);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
+void fixup_pgtable_fragments(void);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/ar

[PATCH v1 3/3] powerpc/mm/radix: Free PUD table when freeing pagetable

2020-06-23 Thread Bharata B Rao

remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 58e42393d5e8..8ec2110eaa1a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -782,6 +782,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
pud_clear(pud);
 }
 
+static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   pud_free(&init_mm, pud_start);
+   p4d_clear(p4d);
+}
+
 struct change_mapping_params {
pte_t *pte;
unsigned long start;
@@ -956,6 +971,7 @@ static void __meminit remove_pagetable(unsigned long start, 
unsigned long end)
 
pud_base = (pud_t *)p4d_page_vaddr(*p4d);
remove_pud_table(pud_base, addr, next);
+   free_pud_table(pud_base, p4d);
}
 
spin_unlock(&init_mm.page_table_lock);
-- 
2.21.3

[PATCH v1 1/3] powerpc/mm/radix: Create separate mappings for hot-plugged memory

2020-06-23 Thread Bharata B Rao

Memory that gets hot-plugged _during_ boot (and not the memory
that gets plugged in after boot), is mapped with 1G mappings
and will undergo splitting when it is unplugged. The splitting
code has a few issues:

1. Recursive locking

Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.

2. BUG: sleeping function called from in_atomic() context
-
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()

3. Bad unlock unbalance
---
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.

These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. Hence, during radix init, identify the hot-plugged memory region
and create separate mappings for each LMB so that they don't get mapped
with 1G mappings. The identification of hot-plugged memory has become
possible after the commit b6eca183e23e ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests").

To create separate mappings for every LMB in the hot-plugged
region, we need lmb-size for which we use memory_block_size_bytes().
Since this is early init time code, the machine type isn't probed yet
and hence memory_block_size_bytes() would return the default LMB size
as 16MB. Hence we end up issuing more number of mapping requests
than earlier.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 8acb96de0e48..ffccfe00ca2a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -320,6 +321,8 @@ static void __init radix_init_pgtable(void)
 {
unsigned long rts_field;
struct memblock_region *reg;
+   phys_addr_t addr;
+   u64 lmb_size = memory_block_size_bytes();
 
/* We don't support slb for radix */
mmu_slb_size = 0;
@@ -338,9 +341,15 @@ static void __init radix_init_pgtable(void)
continue;
}
 
-   WARN_ON(create_physical_mapping(reg->base,
-   reg->base + reg->size,
-   -1, PAGE_KERNEL));
+   if (memblock_is_hotpluggable(reg)) {
+   for (addr = reg->base; addr < (reg->base + reg->size);
+addr += lmb_size)
+   WARN_ON(create_physical_mapping(addr,
+   addr + lmb_size, -1, PAGE_KERNEL));
+   } else
+   WARN_ON(create_physical_mapping(reg->base,
+   reg->base + reg->size,
+   -1, PAGE_KERNEL));
}
 
/* Find out how many PID bits are supported */
-- 
2.21.3

[PATCH v1 0/3] powerpc/mm/radix: Memory unplug fixes

2020-06-23 Thread Bharata B Rao

This is the next version of the fixes for memory unplug on radix.
The issues and the fix are described in the actual patches.

Changes in v1:
==
- Rebased to latest kernel.
- Took care of p4d changes.
- Addressed Aneesh's review feedback:
 - Added comments.
 - Indentation fixed.
- Dropped the 1st patch (setting DRCONF_MEM_HOTREMOVABLE lmb flags) as
  it is debatable if this flag should be set in the device tree by OS
  and not by platform in case of hotplug. This can be looked at separately.
  (The fixes in this patchset remain valid without the dropped patch)
- Dropped the last patch that removed split_kernel_mapping() to ensure
  that spilitting code is available for any radix guest running on
  platforms that don't set DRCONF_MEM_HOTREMOVABLE.

v0: 
https://lore.kernel.org/linuxppc-dev/20200406034925.22586-1-bhar...@linux.ibm.com/

Bharata B Rao (3):
  powerpc/mm/radix: Create separate mappings for hot-plugged memory
  powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
mappings
  powerpc/mm/radix: Free PUD table when freeing pagetable

 arch/powerpc/include/asm/book3s/64/pgalloc.h |  11 +-
 arch/powerpc/include/asm/book3s/64/radix.h   |   1 +
 arch/powerpc/include/asm/sparsemem.h |   1 +
 arch/powerpc/mm/book3s64/pgtable.c   |  31 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 111 +--
 arch/powerpc/mm/mem.c|   5 +
 arch/powerpc/mm/pgtable-frag.c   |   9 +-
 7 files changed, 157 insertions(+), 12 deletions(-)

-- 
2.21.3

[PATCH v1 5/5] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-06-18 Thread Bharata B Rao

In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
H_RPT_INVALIDATE if available. The availability of this hcall
is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
DT property.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/firmware.h   |  4 +++-
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 27 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 13 +--
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 6003c2e533a0..aa6a5ef5d483 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020)
 #define FW_FEATURE_ULTRAVISOR  ASM_CONST(0x0040)
 #define FW_FEATURE_STUFF_TCE   ASM_CONST(0x0080)
+#define FW_FEATURE_RPT_INVALIDATE ASM_CONST(0x0100)
 
 #ifndef __ASSEMBLY__
 
@@ -71,7 +72,8 @@ enum {
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
+   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR |
+   FW_FEATURE_RPT_INVALIDATE,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 84acb4769487..fcf8b031a32e 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -313,10 +314,17 @@ void kvmppc_radix_tlbie_page(struct kvm *kvm, unsigned 
long addr,
return;
}
 
-   psi = shift_to_mmu_psize(pshift);
-   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
-   lpid, rb);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE)) {
+   psi = shift_to_mmu_psize(pshift);
+   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(0, 0, 1), lpid, rb);
+   } else {
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB, H_RPTI_PAGE_ALL,
+   addr, addr + psize);
+   }
if (rc)
pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
 }
@@ -330,8 +338,15 @@ static void kvmppc_radix_flush_pwc(struct kvm *kvm, 
unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(1, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_PWC, H_RPTI_PAGE_ALL,
+   0, -1UL);
if (rc)
pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
 }
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 75993f44519b..81f903284d34 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct patb_entry *pseries_partition_tb;
 
@@ -402,8 +403,16 @@ static void kvmhv_flush_lpid(unsigned int lpid)
return;
}
 
-   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
-   lpid, TLBIEL_INVAL_SET_LPID);
+   if (!firmware_has_feature(FW_FEATURE_RPT_INVALIDATE))
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE,
+   H_TLBIE_P1_ENC(2, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   else
+   rc = pseries_rpt_invalidate(lpid, H_RPTI_TARGET_CMMU,
+   H_RPTI_TYPE_NESTED |
+   H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC |
+

[PATCH v1 4/5] powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when !GTSE

2020-06-18 Thread Bharata B Rao

From: Nicholas Piggin 

When platform doesn't support GTSE, let TLB invalidation requests
for radix guests be off-loaded to the host using H_RPT_INVALIDATE
hcall.

Signed-off-by: Nicholas Piggin 
Signed-off-by: Bharata B Rao 
[hcall wrapper, error path handling and renames]
---
 arch/powerpc/include/asm/hvcall.h | 27 ++-
 arch/powerpc/include/asm/plpar_wrappers.h | 52 +
 arch/powerpc/mm/book3s64/radix_tlb.c  | 95 +--
 3 files changed, 166 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e90c073e437e..3f9bc7ad1cdd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -305,7 +305,8 @@
 #define H_SCM_UNBIND_ALL0x3FC
 #define H_SCM_HEALTH0x400
 #define H_SCM_PERFORMANCE_STATS 0x418
-#define MAX_HCALL_OPCODE   H_SCM_PERFORMANCE_STATS
+#define H_RPT_INVALIDATE   0x448
+#define MAX_HCALL_OPCODE   H_RPT_INVALIDATE
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -389,6 +390,30 @@
 #define PROC_TABLE_RADIX   0x04
 #define PROC_TABLE_GTSE0x01
 
+/*
+ * Defines for
+ * H_RPT_INVALIDATE - Invalidate RPT translation lookaside information.
+ */
+
+/* Type of translation to invalidate (type) */
+#define H_RPTI_TYPE_NESTED 0x0001  /* Invalidate nested guest 
partition-scope */
+#define H_RPTI_TYPE_TLB0x0002  /* Invalidate TLB */
+#define H_RPTI_TYPE_PWC0x0004  /* Invalidate Page Walk Cache */
+#define H_RPTI_TYPE_PRT0x0008  /* Invalidate Process Table 
Entries if H_RPTI_TYPE_NESTED is clear */
+#define H_RPTI_TYPE_PAT0x0008  /* Invalidate Partition Table 
Entries if H_RPTI_TYPE_NESTED is set */
+
+/* Invalidation targets (target) */
+#define H_RPTI_TARGET_CMMU 0x01 /* All virtual processors in the 
partition */
+#define H_RPTI_TARGET_CMMU_LOCAL   0x02 /* Current virtual processor */
+#define H_RPTI_TARGET_NMMU 0x04 /* All nest/accelerator agents in 
use by the partition */
+
+/* Page size mask (page sizes) */
+#define H_RPTI_PAGE_4K 0x01
+#define H_RPTI_PAGE_64K0x02
+#define H_RPTI_PAGE_2M 0x04
+#define H_RPTI_PAGE_1G 0x08
+#define H_RPTI_PAGE_ALL (-1UL)
+
 #ifndef __ASSEMBLY__
 #include 
 
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 4497c8afb573..92320bb309c7 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -334,6 +334,51 @@ static inline long plpar_get_cpu_characteristics(struct 
h_cpu_char_result *p)
return rc;
 }
 
+/*
+ * Wrapper to H_RPT_INVALIDATE hcall that handles return values appropriately
+ *
+ * - Returns H_SUCCESS on success
+ * - For H_BUSY return value, we retry the hcall.
+ * - For any other hcall failures, attempt a full flush once before
+ *   resorting to BUG().
+ *
+ * Note: This hcall is expected to fail only very rarely. The correct
+ * error recovery of killing the process/guest will be eventually
+ * needed.
+ */
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 type,
+ u64 page_sizes, u64 start, u64 end)
+{
+   long rc;
+   unsigned long all = H_RPTI_TYPE_TLB | H_RPTI_TYPE_PWC;
+
+   while (true) {
+   rc = plpar_hcall_norets(H_RPT_INVALIDATE, pid, target, type,
+   page_sizes, start, end);
+   if (rc == H_BUSY) {
+   cpu_relax();
+   continue;
+   } else if (rc == H_SUCCESS)
+   return rc;
+
+   /* Flush request failed, try with a full flush once */
+   if (type & H_RPTI_TYPE_NESTED)
+   all |= H_RPTI_TYPE_PAT;
+   else
+   all |= H_RPTI_TYPE_PRT;
+retry:
+   rc = plpar_hcall_norets(H_RPT_INVALIDATE, pid, target,
+   all, page_sizes, 0, -1UL);
+   if (rc == H_BUSY) {
+   cpu_relax();
+   goto retry;
+   } else if (rc == H_SUCCESS)
+   return rc;
+
+   BUG();
+   }
+}
+
 #else /* !CONFIG_PPC_PSERIES */
 
 static inline long plpar_set_ciabr(unsigned long ciabr)
@@ -346,6 +391,13 @@ static inline long plpar_pte_read_4(unsigned long flags, 
unsigned long ptex,
 {
return 0;
 }
+
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 type,
+ u64 page_sizes, u64 start, u64 end)
+{
+   return 0;
+}
+
 #endif /* CONFIG_PPC_PSERIES */
 
 #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index b5cc9b23cf02..733935b68f37 100644
--- a/arch/powerpc/mm/book3s64/radix_tl

[PATCH v1 3/5] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled

2020-06-18 Thread Bharata B Rao

H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should
be set only when GTSE is supported.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/platforms/pseries/lpar.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index e4ed5317f117..58ba76bc1964 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned 
long base,
 
if (table_size)
flags |= PROC_TABLE_NEW;
-   if (radix_enabled())
-   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
-   else
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
flags |= PROC_TABLE_HPT_SLB;
for (;;) {
rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-- 
2.21.3

[PATCH v1 2/5] powerpc/prom_init: Ask for Radix GTSE only if supported.

2020-06-18 Thread Bharata B Rao

In the case of radix, don't ask for GTSE by default but ask
only if GTSE is enabled.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/kernel/prom_init.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5f15b10eb007..16dd14f58ba6 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
}
}
 
-   if (supported.radix_mmu && supported.radix_gtse &&
-   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
-   /* Radix preferred - but we require GTSE for now */
-   prom_debug("Asking for radix with GTSE\n");
+   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
+   /* Radix preferred - Check if GTSE is also supported */
+   prom_debug("Asking for radix\n");
ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
-   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
+   if (supported.radix_gtse)
+   ibm_architecture_vec.vec5.radix_ext =
+   OV5_FEAT(OV5_RADIX_GTSE);
+   else
+   prom_debug("Radix GTSE isn't supported\n");
} else if (supported.hash_mmu) {
/* Default to hash mmu (if we can) */
prom_debug("Asking for hash\n");
-- 
2.21.3

[PATCH v1 1/5] powerpc/mm: Make GTSE an MMU FTR

2020-06-18 Thread Bharata B Rao

Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector.

Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/mmu.h| 4 
 arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
 arch/powerpc/mm/init_64.c | 5 -
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index f4ac25d4df05..884d51995934 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -28,6 +28,9 @@
  * Individual features below.
  */
 
+/* Guest Translation Shootdown Enable */
+#define MMU_FTR_GTSE   ASM_CONST(0x1000)
+
 /*
  * Support for 68 bit VA space. We added that from ISA 2.05
  */
@@ -173,6 +176,7 @@ enum {
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
+   MMU_FTR_GTSE |
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_RADIX_KUAP |
 #endif /* CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 3a409517c031..fcb815b3a84d 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -337,6 +337,7 @@ static int __init feat_enable_mmu_radix(struct 
dt_cpu_feature *f)
 #ifdef CONFIG_PPC_RADIX_MMU
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
 
return 1;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c7ce4ec5060e..a7b571c60e90 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -408,12 +408,15 @@ static void __init early_check_vec5(void)
if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
OV5_FEAT(OV5_RADIX_GTSE))) {
pr_warn("WARNING: Hypervisor doesn't support RADIX with 
GTSE\n");
-   }
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
+   } else
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
/* Do radix anyway - the hypervisor said we had to */
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
} else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
/* Hypervisor only supports hash - disable radix */
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
}
 }
 
-- 
2.21.3

[PATCH v1 0/5] Off-load TLB invalidations to host for !GTSE

2020-06-18 Thread Bharata B Rao

Hypervisor may choose not to enable Guest Translation Shootdown Enable
(GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
permitted to use instructions like tblie and tlbsync directly, but is
expected to make hypervisor calls to get the TLB flushed.

This series enables the TLB flush routines in the radix code to
off-load TLB flushing to hypervisor via the newly proposed hcall
H_RPT_INVALIDATE. The specification of this hcall is still evolving
while the patchset is posted here for any early comments.

To easily check the availability of GTSE, it is made an MMU feature.
The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
handle GTSE as an optionally available feature and to not assume GTSE
when radix support is available.

The actual hcall implementation for KVM isn't included in this
patchset.

H_RPT_INVALIDATE

Syntax:
int64   /* H_Success: Return code on successful completion */
    /* H_Busy - repeat the call with the same */
    /* H_Parameter, H_P2, H_P3, H_P4, H_P5 : Invalid parameters */
    hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate RPT translation 
lookaside information */
  uint64 pid,   /* PID/LPID to invalidate */
  uint64 target,    /* Invalidation target */
  uint64 type,  /* Type of lookaside information */
  uint64 pageSizes, /* Page sizes */
  uint64 start, /* Start of Effective Address (EA) range 
(inclusive) */
  uint64 end)   /* End of EA range (exclusive) */

Invalidation targets (target)
-
Core MMU    0x01 /* All virtual processors in the partition */
Core local MMU  0x02 /* Current virtual processor */
Nest MMU    0x04 /* All nest/accelerator agents in use by the partition */

A combination of the above can be specified, except core and core local.

Type of translation to invalidate (type)
---
NESTED   0x0001  /* invalidate nested guest partition-scope */
TLB  0x0002  /* Invalidate TLB */
PWC  0x0004  /* Invalidate Page Walk Cache */
PRT  0x0008  /* Invalidate Process Table Entries if NESTED is clear*/
PAT  0x0008  /* Invalidate Partition Table Entries  if NESTED is set*/

A combination of the above can be specified.

Page size mask (pages)
--
4K  0x01
64K 0x02
2M  0x04
1G  0x08
All sizes   (-1UL)

A combination of the above can be specified.
All page sizes can be selected with -1.

Semantics: Invalidate radix tree lookaside information
   matching the parameters given.
* Return H_P2, H_P3 or H_P4 if target, type, or pageSizes parameters are
  different from the defined values.
* Return H_PARAMETER if NESTED is set and pid is not a valid nested
  LPID allocated to this partition
* Return H_P5 if (start, end) doesn't form a valid range. Start and end
  should be a valid Quadrant address and  end > start.
* Return H_NotSupported if the partition is not in running in radix
  translation mode.
* May invalidate more translation information than requested.
* If start = 0 and end = -1, set the range to cover all valid addresses.
  Else start and end should be aligned to 4kB (lower 11 bits clear).
* If NESTED is clear, then invalidate process scoped lookaside information.
  Else pid specifies a nested LPID, and the invalidation is performed
  on nested guest partition table and nested guest partition scope real
  addresses.
* If pid = 0 and NESTED is clear, then valid addresses are quadrant 3 and
  quadrant 0 spaces, Else valid addresses are quadrant 0.
* Pages which are fully covered by the range are to be invalidated.
  Those which are partially covered are considered outside invalidation
  range, which allows a caller to optimally invalidate ranges that may
  contain mixed page sizes.
* Return H_SUCCESS on success.

Bharata B Rao (4):
  powerpc/mm: Make GTSE an MMU FTR
  powerpc/prom_init: Ask for Radix GTSE only if supported.
  powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
enabled
  KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

Nicholas Piggin (1):
  powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
!GTSE

 arch/powerpc/include/asm/firmware.h   |  4 +-
 arch/powerpc/include/asm/hvcall.h | 27 ++-
 arch/powerpc/include/asm/mmu.h|  4 +
 arch/powerpc/include/asm/plpar_wrappers.h | 52 +
 arch/powerpc/kernel/dt_cpu_ftrs.c |  1 +
 arch/powerpc/kernel/prom_init.c   | 13 ++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 27 +--
 arch/powerpc/kvm/book3s_hv_nested.c   | 13 +++-
 arch/powerpc/mm/book3s64/radix_tlb.c  | 95 +--
 arch/powerpc/mm/init_64.c |  5 +-
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 arch/powerpc/platforms/pseries/lpar.c |  8 +-
 12 files changed

[RFC PATCH v0 3/4] powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if enabled

2020-06-08 Thread Bharata B Rao

H_REGISTER_PROC_TBL asks for GTSE by default. GTSE flag bit should
be set only when GTSE is supported.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/platforms/pseries/lpar.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index e4ed5317f117..58ba76bc1964 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1680,9 +1680,11 @@ static int pseries_lpar_register_process_table(unsigned 
long base,
 
if (table_size)
flags |= PROC_TABLE_NEW;
-   if (radix_enabled())
-   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
-   else
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
flags |= PROC_TABLE_HPT_SLB;
for (;;) {
rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-- 
2.21.3

[RFC PATCH v0 4/4] powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when !GTSE

2020-06-08 Thread Bharata B Rao

From: Nicholas Piggin 

When platform doesn't support GTSE, let TLB invalidation requests
for radix guests be off-loaded to the host using H_RPT_INVALIDATE
hcall

Signed-off-by: Nicholas Piggin 
Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h |   1 +
 arch/powerpc/include/asm/plpar_wrappers.h |  14 +++
 arch/powerpc/mm/book3s64/radix_tlb.c  | 105 --
 3 files changed, 113 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e90c073e437e..08917147415b 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -335,6 +335,7 @@
 #define H_GET_24X7_CATALOG_PAGE0xF078
 #define H_GET_24X7_DATA0xF07C
 #define H_GET_PERF_COUNTER_INFO0xF080
+#define H_RPT_INVALIDATE   0xF084
 
 /* Platform-specific hcalls used for nested HV KVM */
 #define H_SET_PARTITION_TABLE  0xF800
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index 4497c8afb573..e952139b0e47 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -334,6 +334,13 @@ static inline long plpar_get_cpu_characteristics(struct 
h_cpu_char_result *p)
return rc;
 }
 
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 what,
+ u64 pages, u64 start, u64 end)
+{
+   return plpar_hcall_norets(H_RPT_INVALIDATE, pid, target, what,
+ pages, start, end);
+}
+
 #else /* !CONFIG_PPC_PSERIES */
 
 static inline long plpar_set_ciabr(unsigned long ciabr)
@@ -346,6 +353,13 @@ static inline long plpar_pte_read_4(unsigned long flags, 
unsigned long ptex,
 {
return 0;
 }
+
+static inline long pseries_rpt_invalidate(u32 pid, u64 target, u64 what,
+ u64 pages, u64 start, u64 end)
+{
+   return 0;
+}
+
 #endif /* CONFIG_PPC_PSERIES */
 
 #endif /* _ASM_POWERPC_PLPAR_WRAPPERS_H */
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index b5cc9b23cf02..4dd1d3c75562 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -16,11 +16,39 @@
 #include 
 #include 
 #include 
+#include 
 
 #define RIC_FLUSH_TLB 0
 #define RIC_FLUSH_PWC 1
 #define RIC_FLUSH_ALL 2
 
+#define H_TLBI_TLB 0x0001
+#define H_TLBI_PWC 0x0002
+#define H_TLBI_PRS 0x0004
+
+#define H_TLBI_TARGET_CMMU 0x01
+#define H_TLBI_TARGET_CMMU_LOCAL 0x02
+#define H_TLBI_TARGET_NMMU 0x04
+
+#define H_TLBI_PAGE_ALL (-1UL)
+#define H_TLBI_PAGE_4K 0x01
+#define H_TLBI_PAGE_64K0x02
+#define H_TLBI_PAGE_2M 0x04
+#define H_TLBI_PAGE_1G 0x08
+
+static inline u64 psize_to_h_tlbi(unsigned long psize)
+{
+   if (psize == MMU_PAGE_4K)
+   return H_TLBI_PAGE_4K;
+   if (psize == MMU_PAGE_64K)
+   return H_TLBI_PAGE_64K;
+   if (psize == MMU_PAGE_2M)
+   return H_TLBI_PAGE_2M;
+   if (psize == MMU_PAGE_1G)
+   return H_TLBI_PAGE_1G;
+   return H_TLBI_PAGE_ALL;
+}
+
 /*
  * tlbiel instruction for radix, set invalidation
  * i.e., r=1 and is=01 or is=10 or is=11
@@ -694,7 +722,14 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
goto local;
}
 
-   if (cputlb_use_tlbie()) {
+   if (!mmu_has_feature(MMU_FTR_GTSE)) {
+   unsigned long targ = H_TLBI_TARGET_CMMU;
+
+   if (atomic_read(&mm->context.copros) > 0)
+   targ |= H_TLBI_TARGET_NMMU;
+   pseries_rpt_invalidate(pid, targ, H_TLBI_TLB,
+  H_TLBI_PAGE_ALL, 0, -1UL);
+   } else if (cputlb_use_tlbie()) {
if (mm_needs_flush_escalation(mm))
_tlbie_pid(pid, RIC_FLUSH_ALL);
else
@@ -727,7 +762,16 @@ static void __flush_all_mm(struct mm_struct *mm, bool 
fullmm)
goto local;
}
}
-   if (cputlb_use_tlbie())
+   if (!mmu_has_feature(MMU_FTR_GTSE)) {
+   unsigned long targ = H_TLBI_TARGET_CMMU;
+   unsigned long what = H_TLBI_TLB | H_TLBI_PWC |
+H_TLBI_PRS;
+
+   if (atomic_read(&mm->context.copros) > 0)
+   targ |= H_TLBI_TARGET_NMMU;
+   pseries_rpt_invalidate(pid, targ, what,
+  H_TLBI_PAGE_ALL, 0, -1UL);
+   } else if (cputlb_use_tlbie())
_tlbie_pid(pid, RIC_FLUSH_ALL);
else
_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);

[RFC PATCH v0 1/4] powerpc/mm: Make GTSE as MMU FTR

2020-06-08 Thread Bharata B Rao

Make GTSE as an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports it
via OV5 vector.

Making GTSE as a MMU feature will make it easy to enable radix
without GTSE.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/mmu.h| 4 
 arch/powerpc/kernel/dt_cpu_ftrs.c | 2 ++
 arch/powerpc/mm/init_64.c | 6 +-
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index f4ac25d4df05..884d51995934 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -28,6 +28,9 @@
  * Individual features below.
  */
 
+/* Guest Translation Shootdown Enable */
+#define MMU_FTR_GTSE   ASM_CONST(0x1000)
+
 /*
  * Support for 68 bit VA space. We added that from ISA 2.05
  */
@@ -173,6 +176,7 @@ enum {
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
+   MMU_FTR_GTSE |
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_RADIX_KUAP |
 #endif /* CONFIG_PPC_KUAP */
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 3a409517c031..571aa39e35d5 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -337,6 +337,8 @@ static int __init feat_enable_mmu_radix(struct 
dt_cpu_feature *f)
 #ifdef CONFIG_PPC_RADIX_MMU
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
+   /* TODO: Does this need a separate cpu dt feature? */
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
cur_cpu_spec->cpu_user_features |= PPC_FEATURE_HAS_MMU;
 
return 1;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c7ce4ec5060e..feb9bed9177c 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -408,13 +408,17 @@ static void __init early_check_vec5(void)
if (!(vec5[OV5_INDX(OV5_RADIX_GTSE)] &
OV5_FEAT(OV5_RADIX_GTSE))) {
pr_warn("WARNING: Hypervisor doesn't support RADIX with 
GTSE\n");
-   }
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
+   } else
+   cur_cpu_spec->mmu_features |= MMU_FTR_GTSE;
/* Do radix anyway - the hypervisor said we had to */
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
} else if (mmu_supported == OV5_FEAT(OV5_MMU_HASH)) {
/* Hypervisor only supports hash - disable radix */
cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_GTSE;
}
+
 }
 
 void __init mmu_early_init_devtree(void)
-- 
2.21.3

[RFC PATCH v0 2/4] powerpc/prom_init: Ask for Radix GTSE only if supported.

2020-06-08 Thread Bharata B Rao

In the case of radix, don't ask for GTSE by default but ask
only if GTSE is enabled.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/kernel/prom_init.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 5f15b10eb007..16dd14f58ba6 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1336,12 +1336,15 @@ static void __init prom_check_platform_support(void)
}
}
 
-   if (supported.radix_mmu && supported.radix_gtse &&
-   IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
-   /* Radix preferred - but we require GTSE for now */
-   prom_debug("Asking for radix with GTSE\n");
+   if (supported.radix_mmu && IS_ENABLED(CONFIG_PPC_RADIX_MMU)) {
+   /* Radix preferred - Check if GTSE is also supported */
+   prom_debug("Asking for radix\n");
ibm_architecture_vec.vec5.mmu = OV5_FEAT(OV5_MMU_RADIX);
-   ibm_architecture_vec.vec5.radix_ext = OV5_FEAT(OV5_RADIX_GTSE);
+   if (supported.radix_gtse)
+   ibm_architecture_vec.vec5.radix_ext =
+   OV5_FEAT(OV5_RADIX_GTSE);
+   else
+   prom_debug("Radix GTSE isn't supported\n");
} else if (supported.hash_mmu) {
/* Default to hash mmu (if we can) */
prom_debug("Asking for hash\n");
-- 
2.21.3

[RFC PATCH v0 0/4] Off-load TLB invalidations to host for !GTSE

2020-06-08 Thread Bharata B Rao

Hypervisor may choose not to enable Guest Translation Shootdown Enable
(GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
permitted to use instructions like tblie and tlbsync directly, but is
expected to make hypervisor calls to get the TLB flushed.

This series enables the TLB flush routines in the radix code to
off-load TLB flushing to hypervisor via the newly proposed hcall
H_RPT_INVALIDATE. The specification of this hcall is still evolving
while the patchset is posted here for any early comments.

To easily check the availability of GTSE, it is made an MMU feature.
(TODO: Check if this can be a static key instead of MMU feature)

The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
handle GTSE as an optionally available feature and to not assume GTSE
when radix support is available.

H_RPT_INVALIDATE

Syntax:
int64   /* H_Success: Return code on successful completion */
/* H_Busy - repeat the call with the same */
/* H_P2, H_P3, H_P4, H_Parameter: Invalid parameters */
hcall(const uint64 H_RPT_INVALIDATE, /* Invalidate process scoped RPT 
lookaside information */
  uint64 pid,   /* PID to invalidate */
  uint64 target,/* Invalidation target */
  uint64 what,  /* What type of lookaside information */
  uint64 pages, /* Page sizes */
  uint64 start, /* Start of Effective Address (EA) range */
  uint64 end)   /* End of EA range */

Invalidation targets (target)
-
Core MMU0x01 /* All virtual processors in the partition */
Core local MMU  0x02 /* Current virtual processor */
Nest MMU0x04 /* All nest/accelerator agents in use by the partition */
A combination of the above can be specified, except core and core local.

What to invalidate (what)
-
Reserved0x0001  /* Reserved */
TLB 0x0002  /* Invalidate TLB */
PWC 0x0004  /* Invalidate Page Walk Cache */
PRS 0x0008  /* Invalidate Process Table Entries */
A combination of the above can be specified.

Page size mask (pages)
--
4K  0x01
64K 0x02
2M  0x04
1G  0x08
All sizes   (-1UL)
A combination of the above can be specified.
All page sizes can be selected with -1.

Semantics: Invalidate radix tree lookaside information
   matching the parameters given.
* Return H_P2, H_P3 or H_P4 if target, what or pages parameters are
  different from the defined values.
* Return H_PARAMETER if (start, end) doesn't form a valid range.
* May invalidate more translation information than was specified.
* If start = 0 and end = -1, set the range to cover all valid addresses.
  Else start and end should be aligned to 4kB (lower 11 bits clear).
* If pid = 0 then valid addresses are quadrant 3 and quadrant 0 spaces,
  Else valid addresses are quadrant 0.
* Pages which are fully covered by the range are to be invalidated.
  Those which are partially covered are considered outside invalidation
  range, which allows a call to optimally invalidate ranges that may
  contain mixed page sizes.
* Return H_SUCCESS on success.

Bharata B Rao (3):
  powerpc/mm: Make GTSE as MMU FTR
  powerpc/prom_init: Ask for Radix GTSE only if supported.
  powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
enabled

Nicholas Piggin (1):
  powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
!GTSE

 arch/powerpc/include/asm/hvcall.h |   1 +
 arch/powerpc/include/asm/mmu.h|   4 +
 arch/powerpc/include/asm/plpar_wrappers.h |  14 +++
 arch/powerpc/kernel/dt_cpu_ftrs.c |   2 +
 arch/powerpc/kernel/prom_init.c   |  13 +--
 arch/powerpc/mm/book3s64/radix_tlb.c  | 105 --
 arch/powerpc/mm/init_64.c |   6 +-
 arch/powerpc/platforms/pseries/lpar.c |   8 +-
 8 files changed, 137 insertions(+), 16 deletions(-)

-- 
2.21.3

Re: [PATCH v1 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-06-03 Thread Bharata B Rao

On Wed, Jun 03, 2020 at 04:10:25PM -0700, Ram Pai wrote:
> On Tue, Jun 02, 2020 at 03:36:39PM +0530, Bharata B Rao wrote:
> > On Mon, Jun 01, 2020 at 12:05:35PM -0700, Ram Pai wrote:
> > > On Mon, Jun 01, 2020 at 05:25:18PM +0530, Bharata B Rao wrote:
> > > > On Sat, May 30, 2020 at 07:27:50PM -0700, Ram Pai wrote:
> > > > > H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly
> > > > > called H_SVM_PAGE_IN for all secure pages.
> > > > 
> > > > I don't think that is quite true. HV doesn't assume anything about
> > > > secure pages by itself.
> > > 
> > > Yes. Currently, it does not assume anything about secure pages.  But I am
> > > proposing that it should consider all pages (except the shared pages) as
> > > secure pages, when H_SVM_INIT_DONE is called.
> > 
> > Ok, then may be also add the proposed changes to H_SVM_INIT_DONE
> > documentation.
> 
> ok.
> 
> > 
> > > 
> > > In other words, HV should treat all pages; except shared pages, as
> > > secure pages once H_SVM_INIT_DONE is called. And this includes pages
> > > added subsequently through memory hotplug.
> > 
> > So after H_SVM_INIT_DONE, if HV touches a secure page for any
> > reason and gets encrypted contents via page-out, HV drops the
> > device pfn at that time. So what state we would be in that case? We
> > have completed H_SVM_INIT_DONE, but still have a normal (but encrypted)
> > page in HV?
> 
> Good point.
> 
> The corresponding GFN will continue to be a secure GFN. Just that its
> backing PFN is not a device-PFN, but a memory-PFN. Also that backing
> memory-PFN contains encrypted content.
> 
> I will clarify this in the patch; about secure-GFN state.

I feel all this is complicating the states in HV and is avoidable
if UV just issued page-in calls during memslot registration uvcall.

Regards,
Bharata.

Re: [PATCH v1 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-06-02 Thread Bharata B Rao

On Mon, Jun 01, 2020 at 12:05:35PM -0700, Ram Pai wrote:
> On Mon, Jun 01, 2020 at 05:25:18PM +0530, Bharata B Rao wrote:
> > On Sat, May 30, 2020 at 07:27:50PM -0700, Ram Pai wrote:
> > > H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly
> > > called H_SVM_PAGE_IN for all secure pages.
> > 
> > I don't think that is quite true. HV doesn't assume anything about
> > secure pages by itself.
> 
> Yes. Currently, it does not assume anything about secure pages.  But I am
> proposing that it should consider all pages (except the shared pages) as
> secure pages, when H_SVM_INIT_DONE is called.

Ok, then may be also add the proposed changes to H_SVM_INIT_DONE
documentation.

> 
> In other words, HV should treat all pages; except shared pages, as
> secure pages once H_SVM_INIT_DONE is called. And this includes pages
> added subsequently through memory hotplug.

So after H_SVM_INIT_DONE, if HV touches a secure page for any
reason and gets encrypted contents via page-out, HV drops the
device pfn at that time. So what state we would be in that case? We
have completed H_SVM_INIT_DONE, but still have a normal (but encrypted)
page in HV?

> 
> Yes. the Ultravisor can explicitly request the HV to move the pages
> individually.  But that will slow down the transition too significantly.
> It takes above 20min to transition them, for a SVM of size 100G.
> 
> With this proposed enhancement, the switch completes in a few seconds.

I think, many pages during initial switch and most pages for hotplugged
memory are zero pages, for which we don't anyway issue UV page-in calls.
So the 20min saving you are observing is purely due to hcall overhead?

How about extending H_SVM_PAGE_IN interface or a new hcall to request
multiple pages in one request?

Also, how about requesting for bigger page sizes (2M)? Ralph Campbell
had patches that added THP support for migrate_vma_* calls.

> 
> > 
> > > These GFNs continue to be
> > > normal GFNs associated with normal PFNs; when infact, these GFNs should
> > > have been secure GFNs associated with device PFNs.
> > 
> > Transition to secure state is driven by SVM/UV and HV just responds to
> > hcalls by issuing appropriate uvcalls. SVM/UV is in the best position to
> > determine the required pages that need to be moved into secure side.
> > HV just responds to it and tracks such pages as device private pages.
> > 
> > If SVM/UV doesn't get in all the pages to secure side by the time
> > of H_SVM_INIT_DONE, the remaining pages are just normal (shared or
> > otherwise) pages as far as HV is concerned.  Why should HV assume that
> > SVM/UV didn't ask for a few pages and hence push those pages during
> > H_SVM_INIT_DONE?
> 
> By definition, SVM is a VM backed by secure pages.
> Hence all pages(except shared) must turn secure when a VM switches to SVM.
> 
> UV is interested in only a certain pages for the VM, which it will
> request explicitly through H_SVM_PAGE_IN.  All other pages, need not
> be paged-in through UV_PAGE_IN.  They just need to be switched to
> device-pages.
> 
> > 
> > I think UV should drive the movement of pages into secure side both
> > of boot-time SVM memory and hot-plugged memory. HV does memslot
> > registration uvcall when new memory is plugged in, UV should explicitly
> > get the required pages in at that time instead of expecting HV to drive
> > the same.
> > 
> > > +static int uv_migrate_mem_slot(struct kvm *kvm,
> > > + const struct kvm_memory_slot *memslot)
> > > +{
> > > + unsigned long gfn = memslot->base_gfn;
> > > + unsigned long end;
> > > + bool downgrade = false;
> > > + struct vm_area_struct *vma;
> > > + int i, ret = 0;
> > > + unsigned long start = gfn_to_hva(kvm, gfn);
> > > +
> > > + if (kvm_is_error_hva(start))
> > > + return H_STATE;
> > > +
> > > + end = start + (memslot->npages << PAGE_SHIFT);
> > > +
> > > + down_write(&kvm->mm->mmap_sem);
> > > +
> > > + mutex_lock(&kvm->arch.uvmem_lock);
> > > + vma = find_vma_intersection(kvm->mm, start, end);
> > > + if (!vma || vma->vm_start > start || vma->vm_end < end) {
> > > + ret = H_STATE;
> > > + goto out_unlock;
> > > + }
> > > +
> > > + ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> > > +   MADV_UNMERGEABLE, &vma->vm_flags);
> > > + downgrade_write(&kvm->mm->mmap_sem);
> > > + downgrade = true;
> > > + if (ret) {
> > >

Re: [PATCH v1 3/4] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE

2020-06-01 Thread Bharata B Rao

On Sat, May 30, 2020 at 07:27:50PM -0700, Ram Pai wrote:
> H_SVM_INIT_DONE incorrectly assumes that the Ultravisor has explicitly
> called H_SVM_PAGE_IN for all secure pages.

I don't think that is quite true. HV doesn't assume anything about
secure pages by itself.

> These GFNs continue to be
> normal GFNs associated with normal PFNs; when infact, these GFNs should
> have been secure GFNs associated with device PFNs.

Transition to secure state is driven by SVM/UV and HV just responds to
hcalls by issuing appropriate uvcalls. SVM/UV is in the best position to
determine the required pages that need to be moved into secure side.
HV just responds to it and tracks such pages as device private pages.

If SVM/UV doesn't get in all the pages to secure side by the time
of H_SVM_INIT_DONE, the remaining pages are just normal (shared or
otherwise) pages as far as HV is concerned.  Why should HV assume that
SVM/UV didn't ask for a few pages and hence push those pages during
H_SVM_INIT_DONE?

I think UV should drive the movement of pages into secure side both
of boot-time SVM memory and hot-plugged memory. HV does memslot
registration uvcall when new memory is plugged in, UV should explicitly
get the required pages in at that time instead of expecting HV to drive
the same.

> +static int uv_migrate_mem_slot(struct kvm *kvm,
> + const struct kvm_memory_slot *memslot)
> +{
> + unsigned long gfn = memslot->base_gfn;
> + unsigned long end;
> + bool downgrade = false;
> + struct vm_area_struct *vma;
> + int i, ret = 0;
> + unsigned long start = gfn_to_hva(kvm, gfn);
> +
> + if (kvm_is_error_hva(start))
> + return H_STATE;
> +
> + end = start + (memslot->npages << PAGE_SHIFT);
> +
> + down_write(&kvm->mm->mmap_sem);
> +
> + mutex_lock(&kvm->arch.uvmem_lock);
> + vma = find_vma_intersection(kvm->mm, start, end);
> + if (!vma || vma->vm_start > start || vma->vm_end < end) {
> + ret = H_STATE;
> + goto out_unlock;
> + }
> +
> + ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> +   MADV_UNMERGEABLE, &vma->vm_flags);
> + downgrade_write(&kvm->mm->mmap_sem);
> + downgrade = true;
> + if (ret) {
> + ret = H_STATE;
> + goto out_unlock;
> + }
> +
> + for (i = 0; i < memslot->npages; i++, ++gfn) {
> + /* skip paged-in pages and shared pages */
> + if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL) ||
> + kvmppc_gfn_is_uvmem_shared(gfn, kvm))
> + continue;
> +
> + start = gfn_to_hva(kvm, gfn);
> + end = start + (1UL << PAGE_SHIFT);
> + ret = kvmppc_svm_migrate_page(vma, start, end,
> + (gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
> +
> + if (ret)
> + goto out_unlock;
> + }

Is there a guarantee that the vma you got for the start address remains
valid for all the addresses till end in a memslot? If not, you should
re-get the vma for the current address in each iteration I suppose.

Regards,
Bharata.

Re: [RFC PATCH v0 0/5] powerpc/mm/radix: Memory unplug fixes

2020-05-19 Thread Bharata B Rao

Aneesh,

Do these memory unplug fixes on radix look fine? Do you want these
to be rebased on recent kernel? Would you like me to test any specific
scenario with these fixes?

Regards,
Bharata.
 
On Mon, Apr 06, 2020 at 09:19:20AM +0530, Bharata B Rao wrote:
> Memory unplug has a few bugs which I had attempted to fix ealier
> at https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-July/194087.html
> 
> Now with Leonardo's patch for PAPR changes that add a separate flag bit
> to LMB flags for explicitly identifying hot-removable memory
> (https://lore.kernel.org/linuxppc-dev/f55a7b65a43cc9dc7b22385cf9960f8b11d5ce2e.ca...@linux.ibm.com/T/#t),
> a few other issues around memory unplug on radix can be fixed. This
> series is a combination of those fixes.
> 
> This series works on top of above mentioned Leonardo's patch.
> 
> Bharata B Rao (5):
>   powerpc/pseries/hotplug-memory: Set DRCONF_MEM_HOTREMOVABLE for
> hot-plugged mem
>   powerpc/mm/radix: Create separate mappings for hot-plugged memory
>   powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
> mappings
>   powerpc/mm/radix: Free PUD table when freeing pagetable
>   powerpc/mm/radix: Remove split_kernel_mapping()
> 
>  arch/powerpc/include/asm/book3s/64/pgalloc.h  |  11 +-
>  arch/powerpc/include/asm/book3s/64/radix.h|   1 +
>  arch/powerpc/include/asm/sparsemem.h  |   1 +
>  arch/powerpc/mm/book3s64/pgtable.c|  31 ++-
>  arch/powerpc/mm/book3s64/radix_pgtable.c  | 186 +++---
>  arch/powerpc/mm/mem.c |   5 +
>  arch/powerpc/mm/pgtable-frag.c|   9 +-
>  .../platforms/pseries/hotplug-memory.c|   6 +-
>  8 files changed, 167 insertions(+), 83 deletions(-)
> 
> -- 
> 2.21.0

Re: [RFC PATCH v0 0/5] powerpc/mm/radix: Memory unplug fixes

2020-04-08 Thread Bharata B Rao

On Mon, Apr 06, 2020 at 09:19:20AM +0530, Bharata B Rao wrote:
> Memory unplug has a few bugs which I had attempted to fix ealier
> at https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-July/194087.html
> 
> Now with Leonardo's patch for PAPR changes that add a separate flag bit
> to LMB flags for explicitly identifying hot-removable memory
> (https://lore.kernel.org/linuxppc-dev/f55a7b65a43cc9dc7b22385cf9960f8b11d5ce2e.ca...@linux.ibm.com/T/#t),
> a few other issues around memory unplug on radix can be fixed. This
> series is a combination of those fixes.
> 
> This series works on top of above mentioned Leonardo's patch.
> 
> Bharata B Rao (5):
>   powerpc/pseries/hotplug-memory: Set DRCONF_MEM_HOTREMOVABLE for
> hot-plugged mem
>   powerpc/mm/radix: Create separate mappings for hot-plugged memory
>   powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
> mappings
>   powerpc/mm/radix: Free PUD table when freeing pagetable
>   powerpc/mm/radix: Remove split_kernel_mapping()

3/5 in this series fixes long-standing bug and multiple versions of it
has been posted outside of this series earlier.

4/5 fixes a memory leak.

I included the above tow in this series because with the patches to
explicitly mark the hotplugged memory (1/5 and 2/5), reproducing the bug
fixed by 3/5 becomes easier.

Hence 3/5 and 4/5 can be considered as standalone fixes too.

Regards,
Bharata.

Re: [PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-06 Thread Bharata B Rao

On Mon, Apr 06, 2020 at 12:41:01PM -0300, Leonardo Bras wrote:
> Hello Bharata,
> 
> On Fri, 2020-04-03 at 20:08 +0530, Bharata B Rao wrote:
> > The patch would be more complete with the following change that ensures
> > that DRCONF_MEM_HOTREMOVABLE flag is set for non-boot-time hotplugged
> > memory too. This will ensure that ibm,dynamic-memory-vN property
> > reflects the right flags value for memory that gets hotplugged
> > post boot.
> > 
> 
> You just sent that on a separated patchset, so I think it's dealt with.
> Do you have any other comments on the present patch?

None, thanks.

Regards,
Bharata.

[RFC PATCH v0 4/5] powerpc/mm/radix: Free PUD table when freeing pagetable

2020-04-05 Thread Bharata B Rao

remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index e675c0bbf9a4..0d9ef3277579 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -767,6 +767,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
pud_clear(pud);
 }
 
+static void free_pud_table(pud_t *pud_start, pgd_t *pgd)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   pud_free(&init_mm, pud_start);
+   pgd_clear(pgd);
+}
+
 struct change_mapping_params {
pte_t *pte;
unsigned long start;
@@ -937,6 +952,7 @@ static void __meminit remove_pagetable(unsigned long start, 
unsigned long end)
 
pud_base = (pud_t *)pgd_page_vaddr(*pgd);
remove_pud_table(pud_base, addr, next);
+   free_pud_table(pud_base, pgd);
}
 
spin_unlock(&init_mm.page_table_lock);
-- 
2.21.0

[RFC PATCH v0 5/5] powerpc/mm/radix: Remove split_kernel_mapping()

2020-04-05 Thread Bharata B Rao

With hot-plugged memory getting mapped with 2M mappings always,
there will be no need to split any mappings during unplug.

Hence remove split_kernel_mapping() and associated code. This
essentially is a revert of
commit 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 93 +---
 1 file changed, 19 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 0d9ef3277579..56f2c698deac 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -782,30 +781,6 @@ static void free_pud_table(pud_t *pud_start, pgd_t *pgd)
pgd_clear(pgd);
 }
 
-struct change_mapping_params {
-   pte_t *pte;
-   unsigned long start;
-   unsigned long end;
-   unsigned long aligned_start;
-   unsigned long aligned_end;
-};
-
-static int __meminit stop_machine_change_mapping(void *data)
-{
-   struct change_mapping_params *params =
-   (struct change_mapping_params *)data;
-
-   if (!data)
-   return -1;
-
-   spin_unlock(&init_mm.page_table_lock);
-   pte_clear(&init_mm, params->aligned_start, params->pte);
-   create_physical_mapping(__pa(params->aligned_start), 
__pa(params->start), -1);
-   create_physical_mapping(__pa(params->end), __pa(params->aligned_end), 
-1);
-   spin_lock(&init_mm.page_table_lock);
-   return 0;
-}
-
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 unsigned long end)
 {
@@ -834,52 +809,6 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
}
 }
 
-/*
- * clear the pte and potentially split the mapping helper
- */
-static void __meminit split_kernel_mapping(unsigned long addr, unsigned long 
end,
-   unsigned long size, pte_t *pte)
-{
-   unsigned long mask = ~(size - 1);
-   unsigned long aligned_start = addr & mask;
-   unsigned long aligned_end = addr + size;
-   struct change_mapping_params params;
-   bool split_region = false;
-
-   if ((end - addr) < size) {
-   /*
-* We're going to clear the PTE, but not flushed
-* the mapping, time to remap and flush. The
-* effects if visible outside the processor or
-* if we are running in code close to the
-* mapping we cleared, we are in trouble.
-*/
-   if (overlaps_kernel_text(aligned_start, addr) ||
-   overlaps_kernel_text(end, aligned_end)) {
-   /*
-* Hack, just return, don't pte_clear
-*/
-   WARN_ONCE(1, "Linear mapping %lx->%lx overlaps kernel "
- "text, not splitting\n", addr, end);
-   return;
-   }
-   split_region = true;
-   }
-
-   if (split_region) {
-   params.pte = pte;
-   params.start = addr;
-   params.end = end;
-   params.aligned_start = addr & ~(size - 1);
-   params.aligned_end = min_t(unsigned long, aligned_end,
-   (unsigned long)__va(memblock_end_of_DRAM()));
-   stop_machine(stop_machine_change_mapping, ¶ms, NULL);
-   return;
-   }
-
-   pte_clear(&init_mm, addr, pte);
-}
-
 static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 unsigned long end)
 {
@@ -895,7 +824,12 @@ static void remove_pmd_table(pmd_t *pmd_start, unsigned 
long addr,
continue;
 
if (pmd_is_leaf(*pmd)) {
-   split_kernel_mapping(addr, end, PMD_SIZE, (pte_t *)pmd);
+   if (!IS_ALIGNED(addr, PMD_SIZE) ||
+   !IS_ALIGNED(next, PMD_SIZE)) {
+   WARN_ONCE(1, "%s: unaligned range\n", __func__);
+   continue;
+   }
+   pte_clear(&init_mm, addr, (pte_t *)pmd);
continue;
}
 
@@ -920,7 +854,12 @@ static void remove_pud_table(pud_t *pud_start, unsigned 
long addr,
continue;
 
if (pud_is_leaf(*pud)) {
-   split_kernel_mapping(addr, end, PUD_SIZE, (pte_t *)pud);
+   if (!IS_ALIGNED(addr, PUD_SIZE) ||
+   !IS_ALIGNED(next, PUD_SIZE)) {
+   WARN_ONCE(1, "%s: unaligned range\n", __func__);
+

[RFC PATCH v0 3/5] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

2020-04-05 Thread Bharata B Rao

We can hit the following BUG_ON during memory unplug

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:344!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c0097d48] pmd_fragment_free+0x48/0xd0
LR [c16aaefc] remove_pagetable+0x494/0x530
Call Trace:
_raw_spin_lock+0x54/0x80 (unreliable)
remove_pagetable+0x2b0/0x530
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x38/0x5c
arch_remove_memory+0x124/0x190
try_remove_memory+0xd0/0x1c0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xbc/0x110
dlpar_memory+0xa90/0xd40
handle_dlpar_errorlog+0xa8/0x160
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x47c/0x870
worker_thread+0x364/0x5e0
kthread+0x1b4/0x1c0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

Fixing this includes 3 parts:

- Re-walk the init_mm page tables from mem_init() and initialize
  the PMD and PTE fragment count to 1.
- When freeing PUD, PMD and PTE page table pages, check explicitly
  if they come from memblock and if so free then appropriately.
- When we do early memblock based allocation of PMD and PUD pages,
  allocate in PAGE_SIZE granularity so that we are sure the
  complete page is used as pagetable page.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system

2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 11 ++-
 arch/powerpc/include/asm/book3s/64/radix.h   |  1 +
 arch/powerpc/include/asm/sparsemem.h |  1 +
 arch/powerpc/mm/book3s64/pgtable.c   | 31 -
 arch/powerpc/mm/book3s64/radix_pgtable.c | 72 ++--
 arch/powerpc/mm/mem.c|  5 ++
 arch/powerpc/mm/pgtable-frag.c   |  9 ++-
 7 files changed, 121 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index a41e91bd0580..e96572fb2871 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -109,7 +109,16 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, 
unsigned long addr)
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+   struct page *page = virt_to_page(pud);
+
+   /*
+* Early pud pages allocated via memblock allocator
+* can't be directly freed to slab
+*/
+   if (PageReserved(page))
+   free_reserved_page(page);
+   else
+   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index d97db3ad9aae..0aff8750181a 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -291,6 +291,7 @@ static inline unsigned long radix__get_tree_size(void)
 #ifdef CONFIG_MEMORY_HOTPLUG
 int radix__create_section_mapping(unsigned long start, unsigned long end, int 
nid);
 int radix__remove_section_mapping(unsigned long start, unsigned long end);
+void radix__fixup_pgtable_fragments(void);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 #endif /* __ASSEMBLY__ */
 #endif
diff --git a/arch/powerpc/include/asm/sparsemem.h 
b/arch/powerpc/include/asm/sparsemem.h
index 3192d454a733..e662f9232d35 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -15,6 +15,7 @@
 #ifdef CONFIG_MEMORY_HOTPLUG
 extern int create_section_mapping(unsigned long start, unsigned long end, int 
nid);
 extern int remove_section_mapping(unsigned long start, unsigned long end);
+void fixup_pgtable_fragments(void);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 extern int resize_hpt_for_hotplug(unsigned long new_mem_size);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtab

[RFC PATCH v0 2/5] powerpc/mm/radix: Create separate mappings for hot-plugged memory

2020-04-05 Thread Bharata B Rao

Memory that gets hot-plugged _during_ boot (and not the memory
that gets plugged in after boot), is mapped with 1G mappings
and will undergo splitting when it is unplugged. The splitting
code has a few issues:

1. Recursive locking

Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.

2. BUG: sleeping function called from in_atomic() context
-
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()

3. Bad unlock unbalance
---
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.

These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. During radix init, identify(*) the hot-plugged memory region
and create separate mappings for each LMB so that they don't get mapped
with 1G mappings.

To create separate mappings for every LMB in the hot-plugged
region, we need lmb-size. I am currently using memory_block_size_bytes()
API to get the lmb-size. Since this is early init time code, the
machine type isn't probed yet and hence memory_block_size_bytes()
would return the default LMB size as 16MB. Hence we end up creating
separate mappings at much lower granularity than what we can ideally
do for pseries machine.

(*) Identifying and differentiating hot-plugged memory from the
boot time memory is now possible with PAPR extension to LMB flags.
(Ref: 
https://lore.kernel.org/linuxppc-dev/f55a7b65a43cc9dc7b22385cf9960f8b11d5ce2e.ca...@linux.ibm.com/T/#t)

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index dd1bea45325c..4a4fb30f6c3d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -313,6 +314,8 @@ static void __init radix_init_pgtable(void)
 {
unsigned long rts_field;
struct memblock_region *reg;
+   phys_addr_t addr;
+   u64 lmb_size = memory_block_size_bytes();
 
/* We don't support slb for radix */
mmu_slb_size = 0;
@@ -331,9 +334,15 @@ static void __init radix_init_pgtable(void)
continue;
}
 
-   WARN_ON(create_physical_mapping(reg->base,
-   reg->base + reg->size,
-   -1));
+   if (memblock_is_hotpluggable(reg)) {
+   for (addr = reg->base; addr < (reg->base + reg->size);
+   addr += lmb_size)
+   WARN_ON(create_physical_mapping(addr,
+   addr + lmb_size, -1));
+   } else
+   WARN_ON(create_physical_mapping(reg->base,
+   reg->base + reg->size,
+   -1));
}
 
/* Find out how many PID bits are supported */
-- 
2.21.0

[RFC PATCH v0 0/5] powerpc/mm/radix: Memory unplug fixes

2020-04-05 Thread Bharata B Rao

Memory unplug has a few bugs which I had attempted to fix ealier
at https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-July/194087.html

Now with Leonardo's patch for PAPR changes that add a separate flag bit
to LMB flags for explicitly identifying hot-removable memory
(https://lore.kernel.org/linuxppc-dev/f55a7b65a43cc9dc7b22385cf9960f8b11d5ce2e.ca...@linux.ibm.com/T/#t),
a few other issues around memory unplug on radix can be fixed. This
series is a combination of those fixes.

This series works on top of above mentioned Leonardo's patch.

Bharata B Rao (5):
  powerpc/pseries/hotplug-memory: Set DRCONF_MEM_HOTREMOVABLE for
hot-plugged mem
  powerpc/mm/radix: Create separate mappings for hot-plugged memory
  powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
mappings
  powerpc/mm/radix: Free PUD table when freeing pagetable
  powerpc/mm/radix: Remove split_kernel_mapping()

 arch/powerpc/include/asm/book3s/64/pgalloc.h  |  11 +-
 arch/powerpc/include/asm/book3s/64/radix.h|   1 +
 arch/powerpc/include/asm/sparsemem.h  |   1 +
 arch/powerpc/mm/book3s64/pgtable.c|  31 ++-
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 186 +++---
 arch/powerpc/mm/mem.c |   5 +
 arch/powerpc/mm/pgtable-frag.c|   9 +-
 .../platforms/pseries/hotplug-memory.c|   6 +-
 8 files changed, 167 insertions(+), 83 deletions(-)

-- 
2.21.0

[RFC PATCH v0 1/5] powerpc/pseries/hotplug-memory: Set DRCONF_MEM_HOTREMOVABLE for hot-plugged mem

2020-04-05 Thread Bharata B Rao

In addition to setting DRCONF_MEM_HOTREMOVABLE for boot-time hot-plugged
memory, we should set the same too for the memory that gets hot-plugged
post-boot. This ensures that correct LMB flags value is reflected in
ibm,dynamic-memory-vN property.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index a4d40a3ceea3..6d75f6e182ae 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -395,7 +395,8 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~(DRCONF_MEM_ASSIGNED |
+   DRCONF_MEM_HOTREMOVABLE);
 
return 0;
 }
@@ -678,7 +679,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
} else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags |= (DRCONF_MEM_ASSIGNED |
+  DRCONF_MEM_HOTREMOVABLE);
}
 
return rc;
-- 
2.21.0

Re: [PATCH v3 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-03 Thread Bharata B Rao

On Thu, Apr 02, 2020 at 04:51:57PM -0300, Leonardo Bras wrote:
> While providing guests, it's desirable to resize it's memory on demand.
> 
> By now, it's possible to do so by creating a guest with a small base
> memory, hot-plugging all the rest, and using 'movable_node' kernel
> command-line parameter, which puts all hot-plugged memory in
> ZONE_MOVABLE, allowing it to be removed whenever needed.
> 
> But there is an issue regarding guest reboot:
> If memory is hot-plugged, and then the guest is rebooted, all hot-plugged
> memory goes to ZONE_NORMAL, which offers no guaranteed hot-removal.
> It usually prevents this memory to be hot-removed from the guest.
> 
> It's possible to use device-tree information to fix that behavior, as
> it stores flags for LMB ranges on ibm,dynamic-memory-vN.
> It involves marking each memblock with the correct flags as hotpluggable
> memory, which mm/memblock.c puts in ZONE_MOVABLE during boot if
> 'movable_node' is passed.
> 
> For carrying such information, the new flag DRCONF_MEM_HOTREMOVABLE was
> proposed and accepted into Power Architecture documentation.
> This flag should be:
> - true (b=1) if the hypervisor may want to hot-remove it later, and
> - false (b=0) if it does not care.
> 
> During boot, guest kernel reads the device-tree, early_init_drmem_lmb()
> is called for every added LMBs. Here, checking for this new flag and
> marking memblocks as hotplugable memory is enough to get the desirable
> behavior.
> 
> This should cause no change if 'movable_node' parameter is not passed
> in kernel command-line.
> 
> Signed-off-by: Leonardo Bras 
> Reviewed-by: Bharata B Rao 
> 
> ---
> 
> Changes since v2:
> - New flag name changed from DRCONF_MEM_HOTPLUGGED to
>   DRCONF_MEM_HOTREMOVABLE

The patch would be more complete with the following change that ensures
that DRCONF_MEM_HOTREMOVABLE flag is set for non-boot-time hotplugged
memory too. This will ensure that ibm,dynamic-memory-vN property
reflects the right flags value for memory that gets hotplugged
post boot.

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index a4d40a3ceea3..6d75f6e182ae 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -395,7 +395,8 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
-   lmb->flags &= ~DRCONF_MEM_ASSIGNED;
+   lmb->flags &= ~(DRCONF_MEM_ASSIGNED |
+   DRCONF_MEM_HOTREMOVABLE);
 
return 0;
 }
@@ -678,7 +679,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
invalidate_lmb_associativity_index(lmb);
lmb_clear_nid(lmb);
} else {
-   lmb->flags |= DRCONF_MEM_ASSIGNED;
+   lmb->flags |= (DRCONF_MEM_ASSIGNED |
+  DRCONF_MEM_HOTREMOVABLE);
}
 
return rc;

Regards,
Bharata.

Re: [RFC PATCH v2 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-04-02 Thread Bharata B Rao

On Wed, Apr 1, 2020 at 8:38 PM Leonardo Bras  wrote:
>
> On Thu, 2020-03-05 at 20:32 -0300, Leonardo Bras wrote:
> > ---
> > The new flag was already proposed on Power Architecture documentation,
> > and it's waiting for approval.
> >
> > I would like to get your comments on this change, but it's still not
> > ready for being merged.
>
> New flag got approved on the documentation.
> Please review this patch.

Looks good to me, also tested with PowerKVM guests.

Reviewed-by: Bharata B Rao 

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/

Re: [PATCH 2/2] KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

2020-03-22 Thread Bharata B Rao

On Fri, Mar 20, 2020 at 03:36:05PM +0100, Laurent Dufour wrote:
> Le 20/03/2020 à 12:24, Bharata B Rao a écrit :
> > On Fri, Mar 20, 2020 at 11:26:43AM +0100, Laurent Dufour wrote:
> > > When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
> > > there is not enough free secured memory, the Hypervisor (HV) has to call
> > > UV_RETURN to report the error to the Ultravisor (UV). Then the UV will 
> > > call
> > > H_SVM_INIT_ABORT to abort the securing phase and go back to the calling 
> > > VM.
> > > 
> > > If the kvm->arch.secure_guest is not set, in the return path rfid is 
> > > called
> > > but there is no valid context to get back to the SVM since the Hcall has
> > > been routed by the Ultravisor.
> > > 
> > > Move the setting of kvm->arch.secure_guest earlier in
> > > kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
> > > instead of rfid.
> > > 
> > > Cc: Bharata B Rao 
> > > Cc: Paul Mackerras 
> > > Cc: Benjamin Herrenschmidt 
> > > Cc: Michael Ellerman 
> > > Signed-off-by: Laurent Dufour 
> > > ---
> > >   arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> > > b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > index 79b1202b1c62..68dff151315c 100644
> > > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > > @@ -209,6 +209,8 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
> > >   int ret = H_SUCCESS;
> > >   int srcu_idx;
> > > + kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;
> > > +
> > >   if (!kvmppc_uvmem_bitmap)
> > >   return H_UNSUPPORTED;
> > > @@ -233,7 +235,6 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
> > >   goto out;
> > >   }
> > >   }
> > > - kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_START;
> > 
> > There is an assumption that memory slots would have been registered with UV
> > if KVMPPC_SECURE_INIT_START has been done. KVM_PPC_SVM_OFF ioctl will skip
> > unregistration and other steps during reboot if KVMPPC_SECURE_INIT_START
> > hasn't been done.
> > 
> > Have you checked if that path isn't affected by this change?
> 
> I checked that and didn't find any issue there.
> 
> My only concern was that block:
>   kvm_for_each_vcpu(i, vcpu, kvm) {
>   spin_lock(&vcpu->arch.vpa_update_lock);
>   unpin_vpa_reset(kvm, &vcpu->arch.dtl);
>   unpin_vpa_reset(kvm, &vcpu->arch.slb_shadow);
>   unpin_vpa_reset(kvm, &vcpu->arch.vpa);
>   spin_unlock(&vcpu->arch.vpa_update_lock);
>   }
> 
> But that seems to be safe.

Yes, looks like.

> 
> However I'm not a familiar with the KVM's code, do you think an additional
> KVMPPC_SECURE_INIT_* value needed here?

May be not as long as UV can handle the unexpected uv_unregister_mem_slot()
calls, we are good.

Regards,
Bharata.

Re: [PATCH 2/2] KVM: PPC: Book3S HV: H_SVM_INIT_START must call UV_RETURN

2020-03-20 Thread Bharata B Rao

On Fri, Mar 20, 2020 at 11:26:43AM +0100, Laurent Dufour wrote:
> When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
> there is not enough free secured memory, the Hypervisor (HV) has to call
> UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
> H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.
> 
> If the kvm->arch.secure_guest is not set, in the return path rfid is called
> but there is no valid context to get back to the SVM since the Hcall has
> been routed by the Ultravisor.
> 
> Move the setting of kvm->arch.secure_guest earlier in
> kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
> instead of rfid.
> 
> Cc: Bharata B Rao 
> Cc: Paul Mackerras 
> Cc: Benjamin Herrenschmidt 
> Cc: Michael Ellerman 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 79b1202b1c62..68dff151315c 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -209,6 +209,8 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>   int ret = H_SUCCESS;
>   int srcu_idx;
>  
> + kvm->arch.secure_guest = KVMPPC_SECURE_INIT_START;
> +
>   if (!kvmppc_uvmem_bitmap)
>   return H_UNSUPPORTED;
>  
> @@ -233,7 +235,6 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>   goto out;
>   }
>   }
> - kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_START;

There is an assumption that memory slots would have been registered with UV
if KVMPPC_SECURE_INIT_START has been done. KVM_PPC_SVM_OFF ioctl will skip
unregistration and other steps during reboot if KVMPPC_SECURE_INIT_START
hasn't been done.

Have you checked if that path isn't affected by this change?

Regards,
Bharata.

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-20 Thread Bharata B Rao

On Fri, Mar 20, 2020 at 09:37:18AM +0100, Vlastimil Babka wrote:
> On 3/20/20 4:42 AM, Bharata B Rao wrote:
> > On Thu, Mar 19, 2020 at 02:47:58PM +0100, Vlastimil Babka wrote:
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 17dc00e33115..7113b1f9cd77 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> >> flags, int node,
> >>  
> >>if (node == NUMA_NO_NODE)
> >>searchnode = numa_mem_id();
> >> -  else if (!node_present_pages(node))
> >> -  searchnode = node_to_mem_node(node);
> >>  
> >>object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >>if (object || node != NUMA_NO_NODE)
> >> @@ -2563,17 +2561,27 @@ static void *___slab_alloc(struct kmem_cache *s, 
> >> gfp_t gfpflags, int node,
> >>struct page *page;
> >>  
> >>page = c->page;
> >> -  if (!page)
> >> +  if (!page) {
> >> +  /*
> >> +   * if the node is not online or has no normal memory, just
> >> +   * ignore the node constraint
> >> +   */
> >> +  if (unlikely(node != NUMA_NO_NODE &&
> >> +   !node_state(node, N_NORMAL_MEMORY)))
> >> +  node = NUMA_NO_NODE;
> >>goto new_slab;
> >> +  }
> >>  redo:
> >>  
> >>if (unlikely(!node_match(page, node))) {
> >> -  int searchnode = node;
> >> -
> >> -  if (node != NUMA_NO_NODE && !node_present_pages(node))
> >> -  searchnode = node_to_mem_node(node);
> >> -
> >> -  if (unlikely(!node_match(page, searchnode))) {
> >> +  /*
> >> +   * same as above but node_match() being false already
> >> +   * implies node != NUMA_NO_NODE
> >> +   */
> >> +  if (!node_state(node, N_NORMAL_MEMORY)) {
> >> +  node = NUMA_NO_NODE;
> >> +  goto redo;
> >> +  } else {
> >>stat(s, ALLOC_NODE_MISMATCH);
> >>deactivate_slab(s, page, c->freelist, c);
> >>goto new_slab;
> > 
> > This fixes the problem I reported at
> > https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/
> 
> Thanks, I hope it means I can make it Reported-and-tested-by: you

It was reeported first by PUVICHAKRAVARTHY RAMACHANDRAN 

You can add my tested-by.

Regards,
Bharata.

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-19 Thread Bharata B Rao

On Thu, Mar 19, 2020 at 02:47:58PM +0100, Vlastimil Babka wrote:
> diff --git a/mm/slub.c b/mm/slub.c
> index 17dc00e33115..7113b1f9cd77 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> flags, int node,
>  
>   if (node == NUMA_NO_NODE)
>   searchnode = numa_mem_id();
> - else if (!node_present_pages(node))
> - searchnode = node_to_mem_node(node);
>  
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);
>   if (object || node != NUMA_NO_NODE)
> @@ -2563,17 +2561,27 @@ static void *___slab_alloc(struct kmem_cache *s, 
> gfp_t gfpflags, int node,
>   struct page *page;
>  
>   page = c->page;
> - if (!page)
> + if (!page) {
> + /*
> +  * if the node is not online or has no normal memory, just
> +  * ignore the node constraint
> +  */
> + if (unlikely(node != NUMA_NO_NODE &&
> +  !node_state(node, N_NORMAL_MEMORY)))
> + node = NUMA_NO_NODE;
>   goto new_slab;
> + }
>  redo:
>  
>   if (unlikely(!node_match(page, node))) {
> - int searchnode = node;
> -
> - if (node != NUMA_NO_NODE && !node_present_pages(node))
> - searchnode = node_to_mem_node(node);
> -
> - if (unlikely(!node_match(page, searchnode))) {
> + /*
> +  * same as above but node_match() being false already
> +  * implies node != NUMA_NO_NODE
> +  */
> + if (!node_state(node, N_NORMAL_MEMORY)) {
> + node = NUMA_NO_NODE;
> + goto redo;
> + } else {
>   stat(s, ALLOC_NODE_MISMATCH);
>   deactivate_slab(s, page, c->freelist, c);
>   goto new_slab;

This fixes the problem I reported at
https://lore.kernel.org/linux-mm/20200317092624.gb22...@in.ibm.com/

Regards,
Bharata.

Re: [RFC 1/2] mm, slub: prevent kmalloc_node crashes and memory leaks

2020-03-18 Thread Bharata B Rao

On Wed, Mar 18, 2020 at 03:42:19PM +0100, Vlastimil Babka wrote:
> This is a PowerPC platform with following NUMA topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31
> node 1 size: 35247 MB
> node 1 free: 30907 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> 
> possible numa nodes: 0-31
> 
> A related issue was reported by Bharata [3] where a similar PowerPC
> configuration, but without patch [2] ends up allocating large amounts of pages
> by kmalloc-1k kmalloc-512. This seems to have the same underlying issue with
> node_to_mem_node() not behaving as expected, and might probably also lead
> to an infinite loop with CONFIG_SLUB_CPU_PARTIAL.

This patch doesn't fix the issue of kmalloc caches consuming more
memory for the above mentioned topology. Also CONFIG_SLUB_CPU_PARTIAL is set
here and I have not observed infinite loop till now.

Or, are you expecting your fix to work on top of Srikar's other patchset
https://lore.kernel.org/linuxppc-dev/2020030237.5731-1-sri...@linux.vnet.ibm.com/t/#u
 ?

With the above patchset, no fix is required to address increased memory
consumption of kmalloc caches because this patchset prevents such
topology from occuring thereby making it impossible for the problem
to surface (or at least impossible for the specific topology that I
mentioned)

> diff --git a/mm/slub.c b/mm/slub.c
> index 17dc00e33115..4d798cacdae1 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1511,7 +1511,7 @@ static inline struct page *alloc_slab_page(struct 
> kmem_cache *s,
>   struct page *page;
>   unsigned int order = oo_order(oo);
>  
> - if (node == NUMA_NO_NODE)
> + if (node == NUMA_NO_NODE || !node_online(node))
>   page = alloc_pages(flags, order);
>   else
>   page = __alloc_pages_node(node, flags, order);
> @@ -1973,8 +1973,6 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> flags, int node,
>  
>   if (node == NUMA_NO_NODE)
>   searchnode = numa_mem_id();
> - else if (!node_present_pages(node))
> - searchnode = node_to_mem_node(node);

We still come here with memory-less node=0 (and not NUMA_NO_NODE), fail to
find partial slab, go back and allocate a new one thereby continuosly
increasing the number of newly allocated slabs.

>  
>   object = get_partial_node(s, get_node(s, searchnode), c, flags);
>   if (object || node != NUMA_NO_NODE)
> @@ -2568,12 +2566,15 @@ static void *___slab_alloc(struct kmem_cache *s, 
> gfp_t gfpflags, int node,
>  redo:
>  
>   if (unlikely(!node_match(page, node))) {
> - int searchnode = node;
> -
> - if (node != NUMA_NO_NODE && !node_present_pages(node))
> - searchnode = node_to_mem_node(node);
> -
> - if (unlikely(!node_match(page, searchnode))) {
> + /*
> +  * node_match() false implies node != NUMA_NO_NODE
> +  * but if the node is not online or has no pages, just
> +  * ignore the constraint
> +  */
> + if ((!node_online(node) || !node_present_pages(node))) {
> + node = NUMA_NO_NODE;
> + goto redo;

Many calls for allocating slab object from memory-less node 0 in my case
don't even hit the above check because they get short circuited by
goto new_slab label which is present a few lines above.  Hence I don't see
any reduction in the amount of slab memory with this fix.

Regards,
Bharata.

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Bharata B Rao

On Wed, Mar 18, 2020 at 08:50:44AM +0530, Srikar Dronamraju wrote:
> * Vlastimil Babka  [2020-03-17 17:45:15]:
> 
> > On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> > > * Vlastimil Babka  [2020-03-17 16:56:04]:
> > > 
> > >> 
> > >> I wonder why do you get a memory leak while Sachin in the same situation 
> > >> [1]
> > >> gets a crash? I don't understand anything anymore.
> > > 
> > > Sachin was testing on linux-next which has Kirill's patch which modifies
> > > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> > > upstream, which doesn't have this. 
> > 
> > Yes, that Kirill's patch was about the memcg shrinker map allocation. But 
> > the
> > patch hunk that Bharata posted as a "hack" that fixes the problem, it 
> > follows
> > that there has to be something else that calls kmalloc_node(node) where 
> > node is
> > one that doesn't have present pages.
> > 
> > He mentions alloc_fair_sched_group() which has:
> > 
> > for_each_possible_cpu(i) {
> > cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> >   GFP_KERNEL, cpu_to_node(i));
> > ...
> > se = kzalloc_node(sizeof(struct sched_entity),
> >   GFP_KERNEL, cpu_to_node(i));
> > 
> 
> 
> Sachin's experiment.
> Upstream-next/ memcg /
> possible nodes were 0-31
> online nodes were 0-1
> kmalloc_node called for_each_node / for_each_possible_node.
> This would crash while allocating slab from !N_ONLINE nodes.
> 
> Bharata's experiment.
> Upstream
> possible nodes were 0-1
> online nodes were 0-1
> kmalloc_node called for_each_online_node/ for_each_possible_cpu
> i.e kmalloc is called for N_ONLINE nodes.
> So wouldn't crash
> 
> Even if his possible nodes were 0-256. I don't think we have kmalloc_node
> being called in !N_ONLINE nodes. Hence its not crashing.
> If we see the above code that you quote, kzalloc_node is using cpu_to_node
> which in Bharata's case will always return 1.
> 
> 
> > I assume one of these structs is 1k and other 512 bytes (rounded) and that 
> > for
> > some possible cpu's cpu_to_node(i) will be 0, which has no present pages. 
> > And as
> > Bharata pasted, node_to_mem_node(0) = 0

Correct, these two kazalloc_node() calls for all possible cpus are
causing increased slab memory consumption in my case.

> > So this looks like the same scenario, but it doesn't crash? Is the node 0
> > actually online here, and/or does it have N_NORMAL_MEMORY state?
> 

Node 0 is online, but N_NORMAL_MEMORY state is empty. In fact memory
leak goes away if I insert the below check/assignment in the slab
alloc code path:

+   if (!node_isset(node, node_states[N_NORMAL_MEMORY]))
+   node = NUMA_NO_NODE;

Regards,
Bharata.

Re: [PATCH 4/4] powerpc/numa: Set fallback nodes for offline nodes

2020-03-17 Thread Bharata B Rao

This patchset can also fix a related problem that I reported earlier at
https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-March/206076.html
with an additional change, suggested by Srikar as shown below:

On Tue, Mar 17, 2020 at 06:47:53PM +0530, Srikar Dronamraju wrote:
> Currently fallback nodes for offline nodes aren't set. Hence by default
> node 0 ends up being the default node. However node 0 might be offline.
> 
> Fix this by explicitly setting fallback node. Ensure first_memory_node
> is set before kernel does explicit setting of fallback node.
> 
>  arch/powerpc/mm/numa.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 281531340230..6e97ab6575cb 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -827,7 +827,16 @@ void __init dump_numa_cpu_topology(void)
>   if (!numa_enabled)
>   return;
>  
> - for_each_online_node(node) {
> + for_each_node(node) {
> + /*
> +  * For all possible but not yet online nodes, ensure their
> +  * node_numa_mem is set correctly so that kmalloc_node works
> +  * for such nodes.
> +  */
> + if (!node_online(node)) {

Change the above line to like below:

+   if (!node_state(node, N_MEMORY)) {

Regards,
Bharata.

Re: Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Bharata B Rao

On Tue, Mar 17, 2020 at 02:56:28PM +0530, Bharata B Rao wrote:
> Case 1: 2 node NUMA, node0 empty
> 
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7
> node 1 size: 16294 MB
> node 1 free: 15453 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 17dc00e33115..888e4d245444 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
> flags, int node,
>   void *object;
>   int searchnode = node;
>  
> - if (node == NUMA_NO_NODE)
> + if (node == NUMA_NO_NODE || !node_present_pages(node))
>   searchnode = numa_mem_id();
> - else if (!node_present_pages(node))
> - searchnode = node_to_mem_node(node);

For the above topology, I see this:

node_to_mem_node(1) = 1
node_to_mem_node(0) = 0
node_to_mem_node(NUMA_NO_NODE) = 0

Looks like the last two cases (returning memory-less node 0) is the
problem here?

Regards,
Bharata.

Slub: Increased mem consumption on cpu,mem-less node powerpc guest

2020-03-17 Thread Bharata B Rao

Hi,

We are seeing an increased slab memory consumption on PowerPC guest
LPAR (on PowerVM) having an uncommon topology where one NUMA node has no
CPUs or any memory and the other node has all the CPUs and memory. Though
QEMU prevents such topologies for KVM guest, I hacked QEMU to allow such
topology to get some slab numbers. Here is the comparision of such
a KVM guest with a single node KVM guest with equal amount of CPUs and memory.

Case 1: 2 node NUMA, node0 empty

# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16294 MB
node 1 free: 15453 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

Case 2: Single node
===
# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16294 MB
node 0 free: 15675 MB
node distances:
node   0 
  0:  10 

Here is how the total slab memory consumptions compare right after boot:
# grep -i slab /proc/meminfo

Case 1: 442560 kB
Case 2: 195904 kB

Closer look at the individual slabs suggests that most of the increased
slab consumption in Case 1 can be attributed to kmalloc-N slabs. In
particular the following two caches account for most of the increase.

Case 1:
# ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 
kmalloc-1k28691024  101.5M1549/1540/0   32 0  
99   2 U
kmalloc-512   3302 512  100.2M1530/1522/0   64 0  
99   1 U

Case 2:
# ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 
kmalloc-1k281110246.1M94/29/0   32 0  
30  46 U
kmalloc-512   3207 5123.5M54/13/0   64 0  
24  46 U

Here is the list of slub stats that significantly differ between two cases:

Case 1:
--
alloc_from_partial 6333 C0=1506 C1=525 C2=774 C3=478 C4=413 C5=1036 C6=698 
C7=903
alloc_slab 3350 C0=757 C1=336 C2=120 C3=72 C4=120 C5=912 C6=600 C7=433
alloc_slowpath 9792 C0=2329 C1=861 C2=916 C3=571 C4=533 C5=1948 C6=1298 C7=1336
cmpxchg_double_fail 31 C1=3 C2=2 C3=7 C4=3 C5=4 C6=2 C7=10
deactivate_full 38 C0=14 C1=2 C2=13 C5=3 C6=2 C7=4
deactivate_remote_frees 1 C7=1
deactivate_to_head 10092 C0=2654 C1=859 C2=903 C3=571 C4=533 C5=1945 C6=1296 
C7=1331
deactivate_to_tail 1 C7=1
free_add_partial 29 C0=7 C2=1 C3=5 C4=3 C5=6 C6=2 C7=5
free_frozen 32 C0=4 C1=3 C2=4 C3=3 C4=7 C5=3 C6=7 C7=1
free_remove_partial 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197
free_slab 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197
free_slowpath 7415 C0=2014 C1=486 C2=433 C3=525 C4=814 C5=1707 C6=586 C7=850
objects 2875 N1=2875
objects_partial 2587 N1=2587
partial 1542 N1=1542
slabs 1551 N1=1551
total_objects 49632 N1=49632

# cat alloc_calls (truncated)
   1952 alloc_fair_sched_group+0x114/0x240 age=147813/152837/153714 pid=1-1074 
cpus=0-2,5-7 nodes=1

# cat free_calls (truncated) 
   2671  age=4295094831 pid=0 cpus=0 nodes=1
  2 free_fair_sched_group+0xa0/0x120 age=156576/156850/157125 pid=0 
cpus=0,5 nodes=1

Case 1:
--
alloc_from_partial 9231 C0=435 C1=2349 C2=2386 C3=1807 C4=882 C5=367 C6=559 
C7=446
alloc_slab 114 C0=12 C1=41 C2=28 C3=15 C4=9 C5=1 C6=1 C7=7
alloc_slowpath 9415 C0=448 C1=2390 C2=2414 C3=1891 C4=891 C5=368 C6=560 C7=453
cmpxchg_double_fail 22 C0=1 C1=1 C3=3 C4=8 C5=1 C6=5 C7=3
deactivate_full 512 C0=13 C1=143 C2=147 C3=147 C4=22 C5=10 C6=6 C7=24
deactivate_remote_frees 1 C4=1
deactivate_to_head 9099 C0=437 C1=2247 C2=2267 C3=1937 C4=870 C5=358 C6=554 
C7=429
deactivate_to_tail 1 C4=1
free_add_partial 447 C0=21 C1=140 C2=164 C3=60 C4=22 C5=16 C6=14 C7=10
free_frozen 22 C0=3 C2=3 C3=2 C4=1 C5=6 C6=6 C7=1
free_remove_partial 20 C1=5 C2=5 C4=3 C6=7
free_slab 20 C1=5 C2=5 C4=3 C6=7
free_slowpath 6953 C0=194 C1=2123 C2=1729 C3=850 C4=466 C5=725 C6=520 C7=346
objects 2812 N0=2812
objects_partial 733 N0=733
partial 29 N0=29
slabs 94 N0=94
total_objects 3008 N0=3008

# cat alloc_calls (truncated)
   1952 alloc_fair_sched_group+0x114/0x240 age=43957/46225/46802 pid=1-1059 
cpus=1-5,7

# cat free_calls (truncated) 
   1516  age=4294987281 pid=0 cpus=0
647 free_fair_sched_group+0xa0/0x120 age=48798/49142/49628 pid=0-954 
cpus=1-2

We see a significant difference in the number of partial slabs and
the resulting total_objects between the two cases. I was trying to
see if this has got to do anything with the way the node value is
arrived at in difference slub routines. Haven't yet understood slub
code to say anything conclusively, but the following hack in the slub
code completely reduces the increased slab consumption for Case1 and
makes it very similar to Case2

diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..888e4d245444 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
void *object;
int searchnode = node;
 
-   if (node == NUMA_NO_NODE)
+   if (node == NUMA_NO_NODE || !node_present_pages(

Re: [PATCH 1/1] powerpc/kernel: Enables memory hot-remove after reboot on pseries guests

2020-03-03 Thread Bharata B Rao

On Fri, Feb 28, 2020 at 11:36 AM Leonardo Bras  wrote:
>
> While providing guests, it's desirable to resize it's memory on demand.
>
> By now, it's possible to do so by creating a guest with a small base
> memory, hot-plugging all the rest, and using 'movable_node' kernel
> command-line parameter, which puts all hot-plugged memory in
> ZONE_MOVABLE, allowing it to be removed whenever needed.
>
> But there is an issue regarding guest reboot:
> If memory is hot-plugged, and then the guest is rebooted, all hot-plugged
> memory goes to ZONE_NORMAL, which offers no guaranteed hot-removal.
> It usually prevents this memory to be hot-removed from the guest.
>
> It's possible to use device-tree information to fix that behavior, as
> it stores flags for LMB ranges on ibm,dynamic-memory-vN.
> It involves marking each memblock with the correct flags as hotpluggable
> memory, which mm/memblock.c puts in ZONE_MOVABLE during boot if
> 'movable_node' is passed.
>
> For base memory, qemu assigns these flags for it's LMBs:
> (DRCONF_MEM_AI_INVALID | DRCONF_MEM_RESERVED)
> For hot-plugged memory, it assigns (DRCONF_MEM_ASSIGNED).
>
> While guest kernel reads the device-tree, early_init_drmem_lmb() is
> called for every added LMBs, doing nothing for base memory, and adding
> memblocks for hot-plugged memory. Skipping base memory happens here:
>
> if ((lmb->flags & DRCONF_MEM_RESERVED) ||
> !(lmb->flags & DRCONF_MEM_ASSIGNED))
> return;
>
> Marking memblocks added by this function as hotplugable memory
> is enough to get the desirable behavior, and should cause no change
> if 'movable_node' parameter is not passed to kernel.
>
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/kernel/prom.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 6620f37abe73..f4d14c67bf53 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -518,6 +518,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
> *lmb,
> DBG("Adding: %llx -> %llx\n", base, size);
> if (validate_mem_limit(base, &size))
> memblock_add(base, size);
> +
> +   early_init_dt_mark_hotplug_memory_arch(base, size);

Hi,

I tried this a few years back
(https://patchwork.ozlabs.org/patch/800142/) and didn't pursue it
further because at that time, it was felt that the approach might not
work for PowerVM guests, because all the present memory except RMA
gets marked as hot-pluggable by PowerVM. This discussion is not
present in the above thread, but during my private discussions with
Reza and Nathan, it was noted that making all that memory as MOVABLE
is not preferable for PowerVM guests as we might run out of memory for
kernel allocations.

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/

[PATCH FIX] KVM: PPC: Book3S HV: Release lock on page-out failure path

2020-01-21 Thread Bharata B Rao

When migrate_vma_setup() fails in kvmppc_svm_page_out(),
release kvm->arch.uvmem_lock before returning.

Fixes: ca9f4942670 ("KVM: PPC: Book3S HV: Support for running secure guests")
Signed-off-by: Bharata B Rao 
---
Applies on paulus/kvm-ppc-next branch

 arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 4d1f25a3959a..79b1202b1c62 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -571,7 +571,7 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned 
long start,
 
ret = migrate_vma_setup(&mig);
if (ret)
-   return ret;
+   goto out;
 
spage = migrate_pfn_to_page(*mig.src);
if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE))
-- 
2.21.0

[PATCH] powerpc: Ultravisor: Fix the dependencies for CONFIG_PPC_UV

2020-01-09 Thread Bharata B Rao

Let PPC_UV depend only on DEVICE_PRIVATE which in turn
will satisfy all the other required dependencies

Fixes: 013a53f2d25a ("powerpc: Ultravisor: Add PPC_UV config option")
Reported-by: kbuild test robot 
Signed-off-by: Bharata B Rao 
---
 arch/powerpc/Kconfig | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1ec34e16ed65..e2a412113359 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -455,11 +455,7 @@ config PPC_TRANSACTIONAL_MEM
 config PPC_UV
bool "Ultravisor support"
depends on KVM_BOOK3S_HV_POSSIBLE
-   select ZONE_DEVICE
-   select DEV_PAGEMAP_OPS
-   select DEVICE_PRIVATE
-   select MEMORY_HOTPLUG
-   select MEMORY_HOTREMOVE
+   depends on DEVICE_PRIVATE
default n
help
  This option paravirtualizes the kernel to run in POWER platforms that
-- 
2.21.0

Re: [PATCH V3 2/2] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-12-15 Thread Bharata B Rao

On Sat, Dec 14, 2019 at 06:12:08PM -0800, Sukadev Bhattiprolu wrote:
> +unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
> +{
> + int i;
> +
> + if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> + return H_UNSUPPORTED;
> +
> + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> + struct kvm_memory_slot *memslot;
> + struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> +
> + if (!slots)
> + continue;
> +
> + kvm_for_each_memslot(memslot, slots)
> + kvmppc_uvmem_drop_pages(memslot, kvm, false);
> + }

You need to hold srcu_read_lock(&kvm->srcu) here.

Regards,
Bharata.

Re: [PATCH v11 0/7] KVM: PPC: Driver to manage pages of secure guest

2019-12-03 Thread Bharata B Rao

On Sun, Dec 01, 2019 at 12:24:50PM -0800, Hugh Dickins wrote:
> On Thu, 28 Nov 2019, Bharata B Rao wrote:
> > On Mon, Nov 25, 2019 at 08:36:24AM +0530, Bharata B Rao wrote:
> > > Hi,
> > > 
> > > This is the next version of the patchset that adds required support
> > > in the KVM hypervisor to run secure guests on PEF-enabled POWER platforms.
> > > 
> > 
> > Here is a fix for the issue Hugh identified with the usage of ksm_madvise()
> > in this patchset. It applies on top of this patchset.
> 
> It looks correct to me, and I hope will not spoil your performance in any
> way that matters.  But I have to say, the patch would be so much clearer,
> if you just named your bool "downgraded" instead of "downgrade".

Thanks for confirming. Yes "downgraded" would have been more
appropriate, will probably change it when we do any next change in this
part of the code.

Regards,
Bharata.

Re: [PATCH v11 0/7] KVM: PPC: Driver to manage pages of secure guest

2019-11-27 Thread Bharata B Rao

On Mon, Nov 25, 2019 at 08:36:24AM +0530, Bharata B Rao wrote:
> Hi,
> 
> This is the next version of the patchset that adds required support
> in the KVM hypervisor to run secure guests on PEF-enabled POWER platforms.
> 

Here is a fix for the issue Hugh identified with the usage of ksm_madvise()
in this patchset. It applies on top of this patchset.


>From 8a4d769bf4c61f921c79ce68923be3c403bd5862 Mon Sep 17 00:00:00 2001
From: Bharata B Rao 
Date: Thu, 28 Nov 2019 09:31:54 +0530
Subject: [PATCH 1/1] KVM: PPC: Book3S HV: Take write mmap_sem when calling
 ksm_madvise

In order to prevent the device private pages (that correspond to
pages of secure guest) from participating in KSM merging, H_SVM_PAGE_IN
calls ksm_madvise() under read version of mmap_sem. However ksm_madvise()
needs to be under write lock, fix this.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index f24ac3cfb34c..2de264fc3156 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -46,11 +46,10 @@
  *
  * Locking order
  *
- * 1. srcu_read_lock(&kvm->srcu) - Protects KVM memslots
- * 2. down_read(&kvm->mm->mmap_sem) - find_vma, migrate_vma_pages and helpers
- * 3. mutex_lock(&kvm->arch.uvmem_lock) - protects read/writes to uvmem slots
- *   thus acting as sync-points
- *   for page-in/out
+ * 1. kvm->srcu - Protects KVM memslots
+ * 2. kvm->mm->mmap_sem - find_vma, migrate_vma_pages and helpers, ksm_madvise
+ * 3. kvm->arch.uvmem_lock - protects read/writes to uvmem slots thus acting
+ *  as sync-points for page-in/out
  */
 
 /*
@@ -344,7 +343,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 static int
 kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
   unsigned long end, unsigned long gpa, struct kvm *kvm,
-  unsigned long page_shift)
+  unsigned long page_shift, bool *downgrade)
 {
unsigned long src_pfn, dst_pfn = 0;
struct migrate_vma mig;
@@ -360,8 +359,15 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
mig.src = &src_pfn;
mig.dst = &dst_pfn;
 
+   /*
+* We come here with mmap_sem write lock held just for
+* ksm_madvise(), otherwise we only need read mmap_sem.
+* Hence downgrade to read lock once ksm_madvise() is done.
+*/
ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
  MADV_UNMERGEABLE, &vma->vm_flags);
+   downgrade_write(&kvm->mm->mmap_sem);
+   *downgrade = true;
if (ret)
return ret;
 
@@ -456,6 +462,7 @@ unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
 unsigned long flags, unsigned long page_shift)
 {
+   bool downgrade = false;
unsigned long start, end;
struct vm_area_struct *vma;
int srcu_idx;
@@ -476,7 +483,7 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
 
ret = H_PARAMETER;
srcu_idx = srcu_read_lock(&kvm->srcu);
-   down_read(&kvm->mm->mmap_sem);
+   down_write(&kvm->mm->mmap_sem);
 
start = gfn_to_hva(kvm, gfn);
if (kvm_is_error_hva(start))
@@ -492,12 +499,16 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (!vma || vma->vm_start > start || vma->vm_end < end)
goto out_unlock;
 
-   if (!kvmppc_svm_page_in(vma, start, end, gpa, kvm, page_shift))
+   if (!kvmppc_svm_page_in(vma, start, end, gpa, kvm, page_shift,
+   &downgrade))
ret = H_SUCCESS;
 out_unlock:
mutex_unlock(&kvm->arch.uvmem_lock);
 out:
-   up_read(&kvm->mm->mmap_sem);
+   if (downgrade)
+   up_read(&kvm->mm->mmap_sem);
+   else
+   up_write(&kvm->mm->mmap_sem);
srcu_read_unlock(&kvm->srcu, srcu_idx);
return ret;
 }
-- 
2.21.0

Re: [PATCH v11 1/7] mm: ksm: Export ksm_madvise()

2019-11-26 Thread Bharata B Rao

On Tue, Nov 26, 2019 at 07:59:49PM -0800, Hugh Dickins wrote:
> On Mon, 25 Nov 2019, Bharata B Rao wrote:
> 
> > On PEF-enabled POWER platforms that support running of secure guests,
> > secure pages of the guest are represented by device private pages
> > in the host. Such pages needn't participate in KSM merging. This is
> > achieved by using ksm_madvise() call which need to be exported
> > since KVM PPC can be a kernel module.
> > 
> > Signed-off-by: Bharata B Rao 
> > Acked-by: Paul Mackerras 
> > Cc: Andrea Arcangeli 
> > Cc: Hugh Dickins 
> 
> I can say
> Acked-by: Hugh Dickins 
> to this one.
> 
> But not to your 2/7 which actually makes use of it: because sadly it
> needs down_write(&kvm->mm->mmap_sem) for the case when it switches off
> VM_MERGEABLE in vma->vm_flags.  That's frustrating, since I think it's
> the only operation for which down_read() is not good enough.

Oh ok! Thanks for pointing this out.

> 
> I have no idea how contended that mmap_sem is likely to be, nor how
> many to-be-secured pages that vma is likely to contain: you might find
> it okay simply to go with it down_write throughout, or you might want
> to start out with it down_read, and only restart with down_write (then
> perhaps downgrade_write later) when you see VM_MERGEABLE is set.

Using down_write throughtout is not easy as we do migrate_vma_pages()
from fault path (->migrate_to_ram()) too. Here we come with down_read
already held.

Starting with down_read and restarting with down_write if VM_MERGEABLE
is set -- this also looks a bit difficult as we will have challenges
with locking order if we release mmap_sem in between and re-acquire.

So I think I will start with down_write in this particular case
and will downgrade_write as soon as ksm_madvise() is complete.

> 
> The crash you got (thanks for the link): that will be because your
> migrate_vma_pages() had already been applied to a page that was
> already being shared via KSM.
> 
> But if these secure pages are expected to be few and far between,
> maybe you'd prefer to keep VM_MERGEABLE, and add per-page checks
> of some kind into mm/ksm.c, to skip over these surprising hybrids.

I did bail out from a few routines in mm/ksm.c with
is_device_private_page(page) check, but that wasn't good enough and
I encountered crashes in different code paths. Guess a bit more
understanding of KSM internals would be required before retrying that.

However since all the pages of the guest except for a few will be turned
into secure pages early during boot, it appears better if secure guests
don't participate in in KSM merging at all.

Regards,
Bharata.

Re: [PATCH v11 1/7] mm: ksm: Export ksm_madvise()

2019-11-24 Thread Bharata B Rao

On Mon, Nov 25, 2019 at 08:36:25AM +0530, Bharata B Rao wrote:
> On PEF-enabled POWER platforms that support running of secure guests,
> secure pages of the guest are represented by device private pages
> in the host. Such pages needn't participate in KSM merging. This is
> achieved by using ksm_madvise() call which need to be exported
> since KVM PPC can be a kernel module.
> 
> Signed-off-by: Bharata B Rao 
> Acked-by: Paul Mackerras 
> Cc: Andrea Arcangeli 
> Cc: Hugh Dickins 

Just want to point out that I observe a kernel crash when KSM is
dealing with device private pages. More details about the crash here:

https://lore.kernel.org/linuxppc-dev/20191115141006.ga21...@in.ibm.com/

Regards,
Bharata.

[PATCH v11 7/7] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-11-24 Thread Bharata B Rao

From: Anshuman Khandual 

CONFIG_PPC_UV adds support for ultravisor.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[ Update config help and commit message ]
Signed-off-by: Claudio Carvalho 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e446bb5b3f8d..1ec34e16ed65 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -452,6 +452,23 @@ config PPC_TRANSACTIONAL_MEM
help
  Support user-mode Transactional Memory on POWERPC.
 
+config PPC_UV
+   bool "Ultravisor support"
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   select ZONE_DEVICE
+   select DEV_PAGEMAP_OPS
+   select DEVICE_PRIVATE
+   select MEMORY_HOTPLUG
+   select MEMORY_HOTREMOVE
+   default n
+   help
+ This option paravirtualizes the kernel to run in POWER platforms that
+ supports the Protected Execution Facility (PEF). On such platforms,
+ the ultravisor firmware runs at a privilege level above the
+ hypervisor.
+
+ If unsure, say "N".
+
 config LD_HEAD_STUB_CATCH
bool "Reserve 256 bytes to cope with linker stubs in HEAD text" if 
EXPERT
depends on PPC64
-- 
2.21.0

[PATCH v11 6/7] KVM: PPC: Support reset of secure guest

2019-11-24 Thread Bharata B Rao

Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
This ioctl will be issued by QEMU during reset and includes the
the following steps:

- Release all device pages of the secure guest.
- Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
- Unpin the VPA pages so that they can be migrated back to secure
  side when guest becomes secure again. This is required because
  pinned pages can't be migrated.
- Reinit the partition scoped page tables

After these steps, guest is ready to issue UV_ESM call once again
to switch to secure mode.

Signed-off-by: Bharata B Rao 
Signed-off-by: Sukadev Bhattiprolu 
[Implementation of uv_svm_terminate() and its call from
guest shutdown path]
Signed-off-by: Ram Pai 
[Unpinning of VPA pages]
Signed-off-by: Paul Mackerras 
[Prevent any vpus from running when unpinng VPAs]
---
 Documentation/virt/kvm/api.txt| 18 +
 arch/powerpc/include/asm/kvm_ppc.h|  1 +
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 ++
 arch/powerpc/kvm/book3s_hv.c  | 90 +++
 arch/powerpc/kvm/powerpc.c| 12 +++
 include/uapi/linux/kvm.h  |  1 +
 7 files changed, 128 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 4833904d32a5..5a773bd3e686 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -4126,6 +4126,24 @@ Valid values for 'action':
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+4.121 KVM_PPC_SVM_OFF
+
+Capability: basic
+Architectures: powerpc
+Type: vm ioctl
+Parameters: none
+Returns: 0 on successful completion,
+Errors:
+  EINVAL:if ultravisor failed to terminate the secure guest
+  ENOMEM:if hypervisor failed to allocate new radix page tables for guest
+
+This ioctl is used to turn off the secure mode of the guest or transition
+the guest from secure mode to normal mode. This is invoked when the guest
+is reset. This has no effect if called for a normal guest.
+
+This ioctl issues an ultravisor call to terminate the secure guest,
+unpins the VPA pages and releases all the device pages that are used to
+track the secure pages by hypervisor.
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index ee62776e5433..3713e8e4d7ea 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -321,6 +321,7 @@ struct kvmppc_ops {
   int size);
int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
  int size);
+   int (*svm_off)(struct kvm *kvm);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 4b0d044caa2a..b66f6db7be6c 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -34,5 +34,6 @@
 #define UV_UNSHARE_PAGE0xF134
 #define UV_UNSHARE_ALL_PAGES   0xF140
 #define UV_PAGE_INVAL  0xF138
+#define UV_SVM_TERMINATE   0xF13C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index b8e59b7b4ac8..790b0e63681f 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -77,4 +77,9 @@ static inline int uv_page_inval(u64 lpid, u64 gpa, u64 
page_shift)
return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
 }
 
+static inline int uv_svm_terminate(u64 lpid)
+{
+   return ucall_norets(UV_SVM_TERMINATE, lpid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cb7ae1e9e4f2..a0bc1722dec1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5000,6 +5000,7 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
if (nesting_enabled(kvm))
kvmhv_release_all_nested(kvm);
kvm->arch.process_table = 0;
+   uv_svm_terminate(kvm->arch.lpid);
kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
}
 
@@ -5442,6 +5443,94 @@ static int kvmhv_store_to_eaddr(struct kvm_vcpu *vcpu, 
ulong *eaddr, void *ptr,
return rc;
 }
 
+static void unpin_vpa_reset(struct kvm *kvm, struct kvmppc_vpa *vpa)
+{
+   unpin_vpa(kvm, vpa);
+   vpa->gpa = 0;
+   vpa->pinned_addr = NULL;
+   vpa->dirty = false;
+   vpa->update_pending = 0;
+}
+
+/*
+ *  IOCTL handler to turn off secure mode of guest
+ *
+ * - Release all device pages
+ * - Issue ucall to terminate the guest on the UV side
+ * - Unpin the VPA pages.
+ * - Reinit the partition scoped page tables
+ */
+stat

[PATCH v11 5/7] KVM: PPC: Handle memory plug/unplug to secure VM

2019-11-24 Thread Bharata B Rao

Register the new memslot with UV during plug and unregister
the memslot during unplug. In addition, release all the
device pages during unplug.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  6 
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  3 ++
 arch/powerpc/kvm/book3s_hv.c| 24 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 37 +
 6 files changed, 76 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 3033a9585b43..50204e228f16 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -19,6 +19,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+struct kvm *kvm);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -64,5 +66,9 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
 {
return -EFAULT;
 }
+
+static inline void
+kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+   struct kvm *kvm) { }
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index e774274ab30e..4b0d044caa2a 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -27,6 +27,7 @@
 #define UV_RETURN  0xF11C
 #define UV_ESM 0xF110
 #define UV_REGISTER_MEM_SLOT   0xF120
+#define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 #define UV_SHARE_PAGE  0xF130
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 40cc8bace654..b8e59b7b4ac8 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -67,6 +67,11 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
+{
+   return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
+}
+
 static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
 {
return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 9f6ba113ffe3..da857c8ba6e4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1101,6 +1101,9 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
unsigned long gpa;
unsigned int shift;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START)
+   kvmppc_uvmem_drop_pages(memslot, kvm);
+
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return;
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 80e84277d11f..cb7ae1e9e4f2 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
@@ -4532,6 +4533,29 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
kvmppc_radix_flush_memslot(kvm, old);
+   /*
+* If UV hasn't yet called H_SVM_INIT_START, don't register memslots.
+*/
+   if (!kvm->arch.secure_guest)
+   return;
+
+   switch (change) {
+   case KVM_MR_CREATE:
+   if (kvmppc_uvmem_slot_init(kvm, new))
+   return;
+   uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id);
+   break;
+   case KVM_MR_DELETE:
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   kvmppc_uvmem_slot_free(kvm, old);
+   break;
+   default:
+   /* TODO: Handle KVM_MR_MOVE */
+   break;
+   }
 }
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 9266ed53cf7a..f24ac3cfb34c 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book

[PATCH v11 4/7] KVM: PPC: Radix changes for secure guest

2019-11-24 Thread Bharata B Rao

- After the guest becomes secure, when we handle a page fault of a page
  belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
- Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
- Ensure all those routines that walk the secondary page tables of
  the guest don't do so in case of secure VM. For secure guest, the
  active secondary page tables are in secure memory and the secondary
  page tables in HV are freed when guest becomes secure.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  6 
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 
 arch/powerpc/kvm/book3s_64_mmu_radix.c  | 22 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 32 +
 5 files changed, 66 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 95f389c2937b..3033a9585b43 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -18,6 +18,7 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long page_shift);
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -58,5 +59,10 @@ static inline unsigned long kvmppc_h_svm_init_done(struct 
kvm *kvm)
 {
return H_UNSUPPORTED;
 }
+
+static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn)
+{
+   return -EFAULT;
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 2483f15bd71a..e774274ab30e 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -32,5 +32,6 @@
 #define UV_SHARE_PAGE  0xF130
 #define UV_UNSHARE_PAGE0xF134
 #define UV_UNSHARE_ALL_PAGES   0xF140
+#define UV_PAGE_INVAL  0xF138
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 79bb005e8ee9..40cc8bace654 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -67,4 +67,9 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
+{
+   return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..9f6ba113ffe3 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -915,6 +917,9 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!(dsisr & DSISR_PRTABLE_FAULT))
gpa |= ea & 0xfff;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
+   return kvmppc_send_page_to_uv(kvm, gfn);
+
/* Get the corresponding memslot */
memslot = gfn_to_memslot(kvm, gfn);
 
@@ -972,6 +977,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE) {
+   uv_page_inval(kvm->arch.lpid, gpa, PAGE_SHIFT);
+   return 0;
+   }
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
@@ -989,6 +999,9 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1026,9 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))

[PATCH v11 3/7] KVM: PPC: Shared pages support for secure guests

2019-11-24 Thread Bharata B Rao

A secure guest will share some of its pages with hypervisor (Eg. virtio
bounce buffers etc). Support sharing of pages between hypervisor and
ultravisor.

Shared page is reachable via both HV and UV side page tables. Once a
secure page is converted to shared page, the device page that represents
the secure page is unmapped from the HV side page tables.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h  |  3 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c | 85 --
 2 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 4150732c81a0..13bd870609c3 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,9 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Flags for H_SVM_PAGE_IN */
+#define H_PAGE_IN_SHARED0x1
+
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 1b8f4a3ceb12..51f094db43f8 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -19,7 +19,10 @@
  * available in the platform for running secure guests is hotplugged.
  * Whenever a page belonging to the guest becomes secure, a page from this
  * private device memory is used to represent and track that secure page
- * on the HV side.
+ * on the HV side. Some pages (like virtio buffers, VPA pages etc) are
+ * shared between UV and HV. However such pages aren't represented by
+ * device private memory and mappings to shared memory exist in both
+ * UV and HV page tables.
  */
 
 /*
@@ -64,6 +67,9 @@
  * UV splits and remaps the 2MB page if necessary and copies out the
  * required 64K page contents.
  *
+ * Shared pages: Whenever guest shares a secure page, UV will split and
+ * remap the 2MB page if required and issue H_SVM_PAGE_IN with 64K page size.
+ *
  * In summary, the current secure pages handling code in HV assumes
  * 64K page size and in fact fails any page-in/page-out requests of
  * non-64K size upfront. If and when UV starts supporting multiple
@@ -94,6 +100,7 @@ struct kvmppc_uvmem_slot {
 struct kvmppc_uvmem_page_pvt {
struct kvm *kvm;
unsigned long gpa;
+   bool skip_page_out;
 };
 
 int kvmppc_uvmem_slot_init(struct kvm *kvm, const struct kvm_memory_slot *slot)
@@ -338,8 +345,64 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
return ret;
 }
 
+/*
+ * Shares the page with HV, thus making it a normal page.
+ *
+ * - If the page is already secure, then provision a new page and share
+ * - If the page is a normal page, share the existing page
+ *
+ * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
+ * to unmap the device page from QEMU's page tables.
+ */
+static unsigned long
+kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+{
+
+   int ret = H_PARAMETER;
+   struct page *uvmem_page;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   unsigned long gfn = gpa >> page_shift;
+   int srcu_idx;
+   unsigned long uvmem_pfn;
+
+   srcu_idx = srcu_read_lock(&kvm->srcu);
+   mutex_lock(&kvm->arch.uvmem_lock);
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   }
+
+retry:
+   mutex_unlock(&kvm->arch.uvmem_lock);
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (is_error_noslot_pfn(pfn))
+   goto out;
+
+   mutex_lock(&kvm->arch.uvmem_lock);
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   kvm_release_pfn_clean(pfn);
+   goto retry;
+   }
+
+   if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift))
+   ret = H_SUCCESS;
+   kvm_release_pfn_clean(pfn);
+   mutex_unlock(&kvm->arch.uvmem_lock);
+out:
+   srcu_read_unlock(&kvm->srcu, srcu_idx);
+   return ret;
+}
+
 /*
  * H_SVM_PAGE_IN: Move page from normal memory to secure memory.
+ *
+ * H_PAGE_IN_SHARED flag makes the page shared which means that the same
+ * memory in is visible from both UV and HV.
  */
 unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
@@ -357,9 +420,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
 
-   if (flags)
+   if (flags & ~H_PAGE_IN_SHARED)
return H_P2;
 
+

[PATCH v11 2/7] KVM: PPC: Support for running secure guests

2019-11-24 Thread Bharata B Rao

A pseries guest can be run as secure guest on Ultravisor-enabled
POWER platforms. On such platforms, this driver will be used to manage
the movement of guest pages between the normal memory managed by
hypervisor (HV) and secure memory managed by Ultravisor (UV).

HV is informed about the guest's transition to secure mode via hcalls:

H_SVM_INIT_START: Initiate securing a VM
H_SVM_INIT_DONE: Conclude securing a VM

As part of H_SVM_INIT_START, register all existing memslots with
the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
the guest to secure mode is complete.

These two states (transition to secure mode STARTED and transition
to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
Setting these states will cause the assembly code that enters the
guest to call the UV_RETURN ucall instead of trying to enter the
guest directly.

Migration of pages betwen normal and secure memory of secure
guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages() using UV_PAGE_IN and
UV_PAGE_OUT ucalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   6 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  62 ++
 arch/powerpc/include/asm/kvm_host.h |   6 +
 arch/powerpc/include/asm/ultravisor-api.h   |   3 +
 arch/powerpc/include/asm/ultravisor.h   |  21 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  29 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 628 
 8 files changed, 758 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2023e327..4150732c81a0 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,12 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+#define H_SVM_INIT_START   0xEF08
+#define H_SVM_INIT_DONE0xEF0C
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
new file mode 100644
index ..95f389c2937b
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_KVM_BOOK3S_UVMEM_H__
+#define __ASM_KVM_BOOK3S_UVMEM_H__
+
+#ifdef CONFIG_PPC_UV
+int kvmppc_uvmem_init(void);
+void kvmppc_uvmem_free(void);
+int kvmppc_uvmem_slot_init(struct kvm *kvm, const struct kvm_memory_slot 
*slot);
+void kvmppc_uvmem_slot_free(struct kvm *kvm,
+   const struct kvm_memory_slot *slot);
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+  unsigned long gra,
+  unsigned long flags,
+  unsigned long page_shift);
+unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+   unsigned long gra,
+   unsigned long flags,
+   unsigned long page_shift);
+unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
+unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+#else
+static inline int kvmppc_uvmem_init(void)
+{
+   return 0;
+}
+
+static inline void kvmppc_uvmem_free(void) { }
+
+static inline int
+kvmppc_uvmem_slot_init(struct kvm *kvm, const struct kvm_memory_slot *slot)
+{
+   return 0;
+}
+
+static inline void
+kvmppc_uvmem_slot_free(struct kvm *kvm, const struct kvm_memory_slot *slot) { }
+
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
+unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
+ unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
+{
+   return H_UNS

[PATCH v11 1/7] mm: ksm: Export ksm_madvise()

2019-11-24 Thread Bharata B Rao

On PEF-enabled POWER platforms that support running of secure guests,
secure pages of the guest are represented by device private pages
in the host. Such pages needn't participate in KSM merging. This is
achieved by using ksm_madvise() call which need to be exported
since KVM PPC can be a kernel module.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
Cc: Andrea Arcangeli 
Cc: Hugh Dickins 
---
 mm/ksm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index dbee2eb4dd05..e45b02ad3f0b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2478,6 +2478,7 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long 
start,

return 0;
 }
+EXPORT_SYMBOL_GPL(ksm_madvise);

 int __ksm_enter(struct mm_struct *mm)
 {
-- 
2.21.0

[PATCH v11 0/7] KVM: PPC: Driver to manage pages of secure guest

2019-11-24 Thread Bharata B Rao

Hi,

This is the next version of the patchset that adds required support
in the KVM hypervisor to run secure guests on PEF-enabled POWER platforms.

This version includes the following changes:

- Ensure that any malicious calls to the 4 hcalls (init_start, init_done,
  page_in and page_out) are handled safely by returning appropriate
  errors (Paul Mackerras)
- init_start hcall should work for only radix guests.
- Fix the page-size-order argument in uv_page_inval (Ram Pai)
- Don't free up partition scoped page tables in HV when guest
  becomes secure (Paul Mackerras)
- During guest reset, when we unpin VPA pages, make sure that no vcpu
  is running and fail the SVM_OFF ioctl if any are running (Paul Mackerras)
- Dropped the patch that implemented init_abort hcall as it still has
  unresolved questions.

Anshuman Khandual (1):
  KVM: PPC: Ultravisor: Add PPC_UV config option

Bharata B Rao (6):
  mm: ksm: Export ksm_madvise()
  KVM: PPC: Support for running secure guests
  KVM: PPC: Shared pages support for secure guests
  KVM: PPC: Radix changes for secure guest
  KVM: PPC: Handle memory plug/unplug to secure VM
  KVM: PPC: Support reset of secure guest

 Documentation/virt/kvm/api.txt  |  18 +
 arch/powerpc/Kconfig|  17 +
 arch/powerpc/include/asm/hvcall.h   |   9 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  74 ++
 arch/powerpc/include/asm/kvm_host.h |   6 +
 arch/powerpc/include/asm/kvm_ppc.h  |   1 +
 arch/powerpc/include/asm/ultravisor-api.h   |   6 +
 arch/powerpc/include/asm/ultravisor.h   |  36 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  25 +
 arch/powerpc/kvm/book3s_hv.c| 143 
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 774 
 arch/powerpc/kvm/powerpc.c  |  12 +
 include/uapi/linux/kvm.h|   1 +
 mm/ksm.c|   1 +
 15 files changed, 1126 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

-- 
2.21.0

Re: [PATCH v10 1/8] mm: ksm: Export ksm_madvise()

2019-11-15 Thread Bharata B Rao

On Thu, Nov 07, 2019 at 04:45:35PM +1100, Paul Mackerras wrote:
> On Wed, Nov 06, 2019 at 12:15:42PM +0530, Bharata B Rao wrote:
> > On Wed, Nov 06, 2019 at 03:33:29PM +1100, Paul Mackerras wrote:
> > > On Mon, Nov 04, 2019 at 09:47:53AM +0530, Bharata B Rao wrote:
> > > > KVM PPC module needs ksm_madvise() for supporting secure guests.
> > > > Guest pages that become secure are represented as device private
> > > > pages in the host. Such pages shouldn't participate in KSM merging.
> > > 
> > > If we don't do the ksm_madvise call, then as far as I can tell, it
> > > should all still work correctly, but we might have KSM pulling pages
> > > in unnecessarily, causing a reduction in performance.  Is that right?
> > 
> > I thought so too. When KSM tries to merge a secure page, it should
> > cause a fault resulting in page-out the secure page. However I see
> > the below crash when KSM is enabled and KSM scan tries to kmap and
> > memcmp the device private page.
> > 
> > BUG: Unable to handle kernel data access at 0xc007fffe0001
> > Faulting instruction address: 0xc00ab5a0
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> > Modules linked in:
> > CPU: 0 PID: 22 Comm: ksmd Not tainted 5.4.0-rc2-00026-g2249c0ae4a53-dirty 
> > #376
> > NIP:  c00ab5a0 LR: c03d7c3c CTR: 0004
> > REGS: c001c85d79b0 TRAP: 0300   Not tainted  
> > (5.4.0-rc2-00026-g2249c0ae4a53-dirty)
> > MSR:  9280b033   CR: 24002242  
> > XER: 2004
> > CFAR: c00ab3d0 DAR: c007fffe0001 DSISR: 4000 IRQMASK: 0 
> > GPR00: 0004 c001c85d7c40 c18ce000 c001c388 
> > GPR04: c007fffe0001 0001   
> > GPR08: c1992298 603820002138  3a69 
> > GPR12: 24002242 c255 c001c870 c179b728 
> > GPR16: c00c01800040 c179b5b8 c00c0070e200  
> > GPR20:   f000 c179b648 
> > GPR24: c24464a0 c249f568 c1118918  
> > GPR28: c001c804c590 c249f518  c001c870 
> > NIP [c00ab5a0] memcmp+0x320/0x6a0
> > LR [c03d7c3c] memcmp_pages+0x8c/0xe0
> > Call Trace:
> > [c001c85d7c40] [c001c804c590] 0xc001c804c590 (unreliable)
> > [c001c85d7c70] [c04591d0] ksm_scan_thread+0x960/0x21b0
> > [c001c85d7db0] [c01bf328] kthread+0x198/0x1a0
> > [c001c85d7e20] [c000bfbc] ret_from_kernel_thread+0x5c/0x80
> > Instruction dump:
> > ebc1fff0 eba1ffe8 eb81ffe0 eb61ffd8 4e800020 3861 4d810020 3860 
> > 4e800020 3804 7c0903a6 7d201c28 <7d402428> 7c295040 38630008 38840008 
> 
> Hmmm, that seems like a bug in the ZONE_DEVICE stuff generally.  All
> that ksm is doing as far as I can see is follow_page() and
> kmap_atomic().  I wonder how many other places in the kernel might
> also be prone to crashing if they try to touch device pages?

In the above shown crash, we don't go via follow_page() and hence
I believe we don't hit the fault path. I see that we come here
after getting the page from get_ksm_page() which returns a device
private page which the subsequent memcmp_pages() does kmap_atomic and
tries to access the address resulting in the above crash.

> 
> > In anycase, we wouldn't want secure guests pages to be pulled out due
> > to KSM, hence disabled merging.
> 
> Sure, I don't disagree with that, but I worry that we are papering
> over a bug here.

Looks like yes. May be someone with better understanding of KSM code
can comment here?

Regards,
Bharata.

Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-13 Thread Bharata B Rao

On Tue, Nov 12, 2019 at 04:34:34PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
> [snip]
> > @@ -5442,6 +5471,64 @@ static int kvmhv_store_to_eaddr(struct kvm_vcpu 
> > *vcpu, ulong *eaddr, void *ptr,
> > return rc;
> >  }
> >  
> > +/*
> > + *  IOCTL handler to turn off secure mode of guest
> > + *
> > + * - Issue ucall to terminate the guest on the UV side
> > + * - Unpin the VPA pages (Enables these pages to be migrated back
> > + *   when VM becomes secure again)
> > + * - Recreate partition table as the guest is transitioning back to
> > + *   normal mode
> > + * - Release all device pages
> > + */
> > +static int kvmhv_svm_off(struct kvm *kvm)
> > +{
> > +   struct kvm_vcpu *vcpu;
> > +   int srcu_idx;
> > +   int ret = 0;
> > +   int i;
> > +
> > +   if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
> > +   return ret;
> > +
> 
> A further comment on this code: it should check that no vcpus are
> running and fail if any are running, and it should prevent any vcpus
> from running until the function is finished, using code like that in
> kvmhv_configure_mmu().  That is, it should do something like this:
> 
>   mutex_lock(&kvm->arch.mmu_setup_lock);
>   mmu_was_ready = kvm->arch.mmu_ready;
>   if (kvm->arch.mmu_ready) {
>   kvm->arch.mmu_ready = 0;
>   /* order mmu_ready vs. vcpus_running */
>   smp_mb();
>   if (atomic_read(&kvm->arch.vcpus_running)) {
>   kvm->arch.mmu_ready = 1;
>   ret = -EBUSY;
>   goto out_unlock;
>   }
>   }
> 
> and then after clearing kvm->arch.secure_guest below:
> 
> > +   srcu_idx = srcu_read_lock(&kvm->srcu);
> > +   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +   struct kvm_memory_slot *memslot;
> > +   struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +
> > +   if (!slots)
> > +   continue;
> > +
> > +   kvm_for_each_memslot(memslot, slots) {
> > +   kvmppc_uvmem_drop_pages(memslot, kvm, true);
> > +   uv_unregister_mem_slot(kvm->arch.lpid, memslot->id);
> > +   }
> > +   }
> > +   srcu_read_unlock(&kvm->srcu, srcu_idx);
> > +
> > +   ret = uv_svm_terminate(kvm->arch.lpid);
> > +   if (ret != U_SUCCESS) {
> > +   ret = -EINVAL;
> > +   goto out;
> > +   }
> > +
> > +   kvm_for_each_vcpu(i, vcpu, kvm) {
> > +   spin_lock(&vcpu->arch.vpa_update_lock);
> > +   unpin_vpa_reset(kvm, &vcpu->arch.dtl);
> > +   unpin_vpa_reset(kvm, &vcpu->arch.slb_shadow);
> > +   unpin_vpa_reset(kvm, &vcpu->arch.vpa);
> > +   spin_unlock(&vcpu->arch.vpa_update_lock);
> > +   }
> > +
> > +   ret = kvmppc_reinit_partition_table(kvm);
> > +   if (ret)
> > +   goto out;
> > +
> > +   kvm->arch.secure_guest = 0;
> 
> you need to do:
> 
>   kvm->arch.mmu_ready = mmu_was_ready;
>  out_unlock:
>   mutex_unlock(&kvm->arch.mmu_setup_lock);
> 
> > +out:
> > +   return ret;
> > +}
> > +
> 
> With that extra check in place, it should be safe to unpin the vpas if
> there is a good reason to do so.  ("Userspace has some bug that we
> haven't found" isn't a good reason to do so.)

QEMU indeed does set_one_reg to reset the VPAs but that only marks
the VPA update as pending. The actual unpinning happens when vcpu
gets to run after reset at which time the VPAs are updated after
any unpinning (if required)

When secure guest reboots, vpu 0 gets to run and does unpin its
VPA pages and then proceeds with switching to secure. Here UV
tries to page-in all the guest pages, including the still pinned
VPA pages corresponding to other vcpus which haven't had a chance
to run till now. They are all still pinned and hence page-in fails.

To prevent this, we have to explicitly unpin the VPA pages during
this svm off ioctl. This will ensure that SMP secure guest is able
to reboot correctly.

So I will incorporate the code chunk you have shown above to fail
if any vcpu is running and prevent any vcpu from running when
we unpin VPAs from this ioctl.

Regards,
Bharata.

Re: [PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-10 Thread Bharata B Rao

On Mon, Nov 11, 2019 at 04:28:06PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:58AM +0530, Bharata B Rao wrote:
> > Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
> > This ioctl will be issued by QEMU during reset and includes the
> > the following steps:
> > 
> > - Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
> > - Unpin the VPA pages so that they can be migrated back to secure
> >   side when guest becomes secure again. This is required because
> >   pinned pages can't be migrated.
> 
> Unpinning the VPA pages is normally handled during VM reset by QEMU
> doing set_one_reg operations to set the values for the
> KVM_REG_PPC_VPA_ADDR, KVM_REG_PPC_VPA_SLB and KVM_REG_PPC_VPA_DTL
> pseudo-registers to zero.  Is there some reason why this isn't
> happening for a secure VM, and if so, what is that reason?
> If it is happening, then why do we need to unpin the pages explicitly
> here?

We were observing these VPA pages still remaining pinned during
reset and hence subsequent paging-in of these pages were failing.
Unpinning them fixed the problem.

I will investigate and get back on why exactly these pages weren't
gettting unpinned normally as part of reset.

> 
> > - Reinitialize guest's partitioned scoped page tables. These are
> >   freed when guest becomes secure (H_SVM_INIT_DONE)
> 
> It doesn't seem particularly useful to me to free the partition-scoped
> page tables when the guest becomes secure, and it feels like it makes
> things more fragile.  If you don't free them then, then you don't need
> to reallocate them now.

Sure, I will not free them in the next version.

Regards,
Bharata.

Re: [PATCH v10 4/8] KVM: PPC: Radix changes for secure guest

2019-11-06 Thread Bharata B Rao

On Wed, Nov 06, 2019 at 04:58:23PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:56AM +0530, Bharata B Rao wrote:
> > - After the guest becomes secure, when we handle a page fault of a page
> >   belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
> > - Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
> > - Ensure all those routines that walk the secondary page tables of
> >   the guest don't do so in case of secure VM. For secure guest, the
> >   active secondary page tables are in secure memory and the secondary
> >   page tables in HV are freed when guest becomes secure.
> 
> Why do we free the page tables?  Just to save a little memory?  It
> feels like it would make things more fragile.

I guess we could just leave the page tables around and they would get
populated again if and when the guest is reset (i,e., when it goes back
to non-secure mode)

However it appeared cleaner to cleanup the page tables given that aren't
in user any longer.
> 
> Also, I don't see where the freeing gets done in this patch.

There isn't a very good reason for freeing code to be not part of this patch.
I just put that in reset patch (6/8) where there is code for reinitializing
page tables again.

Regards,
Bharata.

Re: [PATCH v10 3/8] KVM: PPC: Shared pages support for secure guests

2019-11-06 Thread Bharata B Rao

On Wed, Nov 06, 2019 at 01:52:39PM +0530, Bharata B Rao wrote:
> > However, since kvmppc_gfn_is_uvmem_pfn() returned true, doesn't that
> > mean that pfn here should be a device pfn, and in fact should be the
> > same as uvmem_pfn (possibly with some extra bit(s) set)?
> 
> If secure page is being converted to share, pfn will be uvmem_pfn (device 
> pfn).

Also, kvmppc_gfn_is_uvmem_pfn() needn't always return true. It returns
true only for secure pages, while this routine handles sharing of both
secure and normal pages.

Regards,
Bharata.

Re: [PATCH v10 3/8] KVM: PPC: Shared pages support for secure guests

2019-11-06 Thread Bharata B Rao

On Wed, Nov 06, 2019 at 03:52:38PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:55AM +0530, Bharata B Rao wrote:
> > A secure guest will share some of its pages with hypervisor (Eg. virtio
> > bounce buffers etc). Support sharing of pages between hypervisor and
> > ultravisor.
> > 
> > Shared page is reachable via both HV and UV side page tables. Once a
> > secure page is converted to shared page, the device page that represents
> > the secure page is unmapped from the HV side page tables.
> 
> I'd like to understand a little better what's going on - see below...
> 
> > +/*
> > + * Shares the page with HV, thus making it a normal page.
> > + *
> > + * - If the page is already secure, then provision a new page and share
> > + * - If the page is a normal page, share the existing page
> > + *
> > + * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
> > + * to unmap the device page from QEMU's page tables.
> > + */
> > +static unsigned long
> > +kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long 
> > page_shift)
> > +{
> > +
> > +   int ret = H_PARAMETER;
> > +   struct page *uvmem_page;
> > +   struct kvmppc_uvmem_page_pvt *pvt;
> > +   unsigned long pfn;
> > +   unsigned long gfn = gpa >> page_shift;
> > +   int srcu_idx;
> > +   unsigned long uvmem_pfn;
> > +
> > +   srcu_idx = srcu_read_lock(&kvm->srcu);
> > +   mutex_lock(&kvm->arch.uvmem_lock);
> > +   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
> > +   uvmem_page = pfn_to_page(uvmem_pfn);
> > +   pvt = uvmem_page->zone_device_data;
> > +   pvt->skip_page_out = true;
> > +   }
> > +
> > +retry:
> > +   mutex_unlock(&kvm->arch.uvmem_lock);
> > +   pfn = gfn_to_pfn(kvm, gfn);
> 
> At this point, pfn is the value obtained from the page table for
> userspace (e.g. QEMU), right?

Yes.

> I would think it should be equal to
> uvmem_pfn in most cases, shouldn't it?

Yes, in most cases (Common case is to share a page that is already secure)

> If not, what is it going to
> be?

It can be a regular pfn if non-secure page is being shared again.

> 
> > +   if (is_error_noslot_pfn(pfn))
> > +   goto out;
> > +
> > +   mutex_lock(&kvm->arch.uvmem_lock);
> > +   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
> > +   uvmem_page = pfn_to_page(uvmem_pfn);
> > +   pvt = uvmem_page->zone_device_data;
> > +   pvt->skip_page_out = true;
> > +   kvm_release_pfn_clean(pfn);
> 
> This is going to do a put_page(), unless pfn is a reserved pfn.  If it
> does a put_page(), where did we do the corresponding get_page()?

gfn_to_pfn() will come with a reference held.

> However, since kvmppc_gfn_is_uvmem_pfn() returned true, doesn't that
> mean that pfn here should be a device pfn, and in fact should be the
> same as uvmem_pfn (possibly with some extra bit(s) set)?

If secure page is being converted to share, pfn will be uvmem_pfn (device pfn).
If not, it will be regular pfn.

>  What does
> kvm_is_reserved_pfn() return for a device pfn?

>From this code patch, we will never call kvm_release_pfn_clean() on a device
pfn. The prior call to gfn_to_pfn() would fault, result in page-out thus
converting the device pfn to regular pfn (page share request for secure page
case).

Regards,
Bharata.

Re: [PATCH v10 1/8] mm: ksm: Export ksm_madvise()

2019-11-05 Thread Bharata B Rao

On Wed, Nov 06, 2019 at 03:33:29PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:53AM +0530, Bharata B Rao wrote:
> > KVM PPC module needs ksm_madvise() for supporting secure guests.
> > Guest pages that become secure are represented as device private
> > pages in the host. Such pages shouldn't participate in KSM merging.
> 
> If we don't do the ksm_madvise call, then as far as I can tell, it
> should all still work correctly, but we might have KSM pulling pages
> in unnecessarily, causing a reduction in performance.  Is that right?

I thought so too. When KSM tries to merge a secure page, it should
cause a fault resulting in page-out the secure page. However I see
the below crash when KSM is enabled and KSM scan tries to kmap and
memcmp the device private page.

BUG: Unable to handle kernel data access at 0xc007fffe0001
Faulting instruction address: 0xc00ab5a0
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 0 PID: 22 Comm: ksmd Not tainted 5.4.0-rc2-00026-g2249c0ae4a53-dirty #376
NIP:  c00ab5a0 LR: c03d7c3c CTR: 0004
REGS: c001c85d79b0 TRAP: 0300   Not tainted  
(5.4.0-rc2-00026-g2249c0ae4a53-dirty)
MSR:  9280b033   CR: 24002242  XER: 
2004
CFAR: c00ab3d0 DAR: c007fffe0001 DSISR: 4000 IRQMASK: 0 
GPR00: 0004 c001c85d7c40 c18ce000 c001c388 
GPR04: c007fffe0001 0001   
GPR08: c1992298 603820002138  3a69 
GPR12: 24002242 c255 c001c870 c179b728 
GPR16: c00c01800040 c179b5b8 c00c0070e200  
GPR20:   f000 c179b648 
GPR24: c24464a0 c249f568 c1118918  
GPR28: c001c804c590 c249f518  c001c870 
NIP [c00ab5a0] memcmp+0x320/0x6a0
LR [c03d7c3c] memcmp_pages+0x8c/0xe0
Call Trace:
[c001c85d7c40] [c001c804c590] 0xc001c804c590 (unreliable)
[c001c85d7c70] [c04591d0] ksm_scan_thread+0x960/0x21b0
[c001c85d7db0] [c01bf328] kthread+0x198/0x1a0
[c001c85d7e20] [c000bfbc] ret_from_kernel_thread+0x5c/0x80
Instruction dump:
ebc1fff0 eba1ffe8 eb81ffe0 eb61ffd8 4e800020 3861 4d810020 3860 
4e800020 3804 7c0903a6 7d201c28 <7d402428> 7c295040 38630008 38840008 

In anycase, we wouldn't want secure guests pages to be pulled out due
to KSM, hence disabled merging.

Regards,
Bharata.

Re: [PATCH v10 0/8] KVM: PPC: Driver to manage pages of secure guest

2019-11-05 Thread Bharata B Rao

On Wed, Nov 06, 2019 at 03:30:58PM +1100, Paul Mackerras wrote:
> On Mon, Nov 04, 2019 at 09:47:52AM +0530, Bharata B Rao wrote:
> > 
> > Now, all the dependencies required by this patchset are in powerpc/next
> > on which this patchset is based upon.
> 
> Can you tell me what patches that are in powerpc/next but not upstream
> this depends on?

Sorry, I should have been clear. All the dependencies are upstream.

Regards,
Bharata.

[PATCH v10 8/8] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-11-03 Thread Bharata B Rao

From: Anshuman Khandual 

CONFIG_PPC_UV adds support for ultravisor.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[ Update config help and commit message ]
Signed-off-by: Claudio Carvalho 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 3e56c9c2f16e..d7fef29b47c9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -451,6 +451,23 @@ config PPC_TRANSACTIONAL_MEM
help
  Support user-mode Transactional Memory on POWERPC.
 
+config PPC_UV
+   bool "Ultravisor support"
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   select ZONE_DEVICE
+   select DEV_PAGEMAP_OPS
+   select DEVICE_PRIVATE
+   select MEMORY_HOTPLUG
+   select MEMORY_HOTREMOVE
+   default n
+   help
+ This option paravirtualizes the kernel to run in POWER platforms that
+ supports the Protected Execution Facility (PEF). On such platforms,
+ the ultravisor firmware runs at a privilege level above the
+ hypervisor.
+
+ If unsure, say "N".
+
 config LD_HEAD_STUB_CATCH
bool "Reserve 256 bytes to cope with linker stubs in HEAD text" if 
EXPERT
depends on PPC64
-- 
2.21.0

[PATCH v10 7/8] KVM: PPC: Implement H_SVM_INIT_ABORT hcall

2019-11-03 Thread Bharata B Rao

From: Sukadev Bhattiprolu 

Implement the H_SVM_INIT_ABORT hcall which the Ultravisor can use to
abort an SVM after it has issued the H_SVM_INIT_START and before the
H_SVM_INIT_DONE hcalls. This hcall could be used when Ultravisor
encounters security violations or other errors when starting an SVM.

Note that this hcall is different from UV_SVM_TERMINATE ucall which
is used by HV to terminate/cleanup an SVM.

In case of H_SVM_INIT_ABORT, we should page-out all the pages back to
HV (i.e., we should not skip the page-out). Otherwise the VM's pages,
possibly including its text/data would be stuck in secure memory.
Since the SVM did not go secure, its MSR_S bit will be clear and the
VM wont be able to access its pages even to do a clean exit.

Based on patches and discussion with Ram Pai and Bharata Rao.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Ram Pai 
Signed-off-by: Bharata B Rao 
---
 Documentation/powerpc/ultravisor.rst| 39 +
 arch/powerpc/include/asm/hvcall.h   |  1 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  6 
 arch/powerpc/include/asm/kvm_host.h |  1 +
 arch/powerpc/kvm/book3s_hv.c|  3 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 23 ++--
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 29 +++
 7 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
index 730854f73830..286cabadc566 100644
--- a/Documentation/powerpc/ultravisor.rst
+++ b/Documentation/powerpc/ultravisor.rst
@@ -948,6 +948,45 @@ Use cases
 up its internal state for this virtual machine.
 
 
+H_SVM_INIT_ABORT
+
+
+Abort the process of securing an SVM.
+
+Syntax
+~~
+
+.. code-block:: c
+
+   uint64_t hypercall(const uint64_t H_SVM_INIT_ABORT)
+
+Return values
+~
+
+One of the following values:
+
+   * H_SUCCESS on success.
+   * H_UNSUPPORTED if called from the wrong context (e.g.
+   from an SVM or before an H_SVM_INIT_START
+   hypercall).
+
+Description
+~~~
+
+Abort the process of securing a virtual machine. This call must
+be made after a prior call to ``H_SVM_INIT_START`` hypercall.
+
+Use cases
+~
+
+
+On successfully securing a virtual machine, the Ultravisor informs
+If the Ultravisor is unable to secure a virtual machine either due
+to lack of resources or because the VM's security information could
+not be validated, Ultravisor informs the Hypervisor about it.
+Hypervisor can use this call to clean up any internal state for this
+virtual machine.
+
 H_SVM_PAGE_IN
 -
 
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 13bd870609c3..e90c073e437e 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -350,6 +350,7 @@
 #define H_SVM_PAGE_OUT 0xEF04
 #define H_SVM_INIT_START   0xEF08
 #define H_SVM_INIT_DONE0xEF0C
+#define H_SVM_INIT_ABORT   0xEF14
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 3cf8425b9838..eaea400ea715 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -18,6 +18,7 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long page_shift);
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
 int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
 void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
 struct kvm *kvm, bool skip_page_out);
@@ -62,6 +63,11 @@ static inline unsigned long kvmppc_h_svm_init_done(struct 
kvm *kvm)
return H_UNSUPPORTED;
 }
 
+static inline unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
 static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn)
 {
return -EFAULT;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 577ca95fac7c..8310c0407383 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -278,6 +278,7 @@ struct kvm_resize_hpt;
 /* Flag values for kvm_arch.secure_guest */
 #define KVMPPC_SECURE_INIT_START 0x1 /* H_SVM_INIT_START has been called */
 #define KVMPPC_SECURE_INIT_DONE  0x2 /* H_SVM_INIT_DONE completed */
+#define KVMPPC_SECURE_INIT_ABORT 0x4 /* H_SVM_INIT_ABORT issued */
 
 struct kvm_arch {
unsigned int lpid;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d2

[PATCH v10 6/8] KVM: PPC: Support reset of secure guest

2019-11-03 Thread Bharata B Rao

Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
This ioctl will be issued by QEMU during reset and includes the
the following steps:

- Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
- Unpin the VPA pages so that they can be migrated back to secure
  side when guest becomes secure again. This is required because
  pinned pages can't be migrated.
- Reinitialize guest's partitioned scoped page tables. These are
  freed when guest becomes secure (H_SVM_INIT_DONE)
- Release all device pages of the secure guest.

After these steps, guest is ready to issue UV_ESM call once again
to switch to secure mode.

Signed-off-by: Bharata B Rao 
Signed-off-by: Sukadev Bhattiprolu 
[Implementation of uv_svm_terminate() and its call from
guest shutdown path]
Signed-off-by: Ram Pai 
[Unpinning of VPA pages]
---
 Documentation/virt/kvm/api.txt| 19 +
 arch/powerpc/include/asm/kvm_ppc.h|  2 +
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 ++
 arch/powerpc/kvm/book3s_hv.c  | 88 +++
 arch/powerpc/kvm/book3s_hv_uvmem.c|  6 ++
 arch/powerpc/kvm/powerpc.c| 12 
 include/uapi/linux/kvm.h  |  1 +
 8 files changed, 134 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 4833904d32a5..1b2e1d2002ba 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -4126,6 +4126,25 @@ Valid values for 'action':
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+4.121 KVM_PPC_SVM_OFF
+
+Capability: basic
+Architectures: powerpc
+Type: vm ioctl
+Parameters: none
+Returns: 0 on successful completion,
+Errors:
+  EINVAL:if ultravisor failed to terminate the secure guest
+  ENOMEM:if hypervisor failed to allocate new radix page tables for guest
+
+This ioctl is used to turn off the secure mode of the guest or transition
+the guest from secure mode to normal mode. This is invoked when the guest
+is reset. This has no effect if called for a normal guest.
+
+This ioctl issues an ultravisor call to terminate the secure guest,
+unpins the VPA pages, reinitializes guest's partition scoped page
+tables and releases all the device pages that are used to track the
+secure pages by hypervisor.
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index ee62776e5433..6d1bb597fe95 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -177,6 +177,7 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm 
*kvm,
 extern int kvmppc_switch_mmu_to_hpt(struct kvm *kvm);
 extern int kvmppc_switch_mmu_to_radix(struct kvm *kvm);
 extern void kvmppc_setup_partition_table(struct kvm *kvm);
+int kvmppc_reinit_partition_table(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
@@ -321,6 +322,7 @@ struct kvmppc_ops {
   int size);
int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
  int size);
+   int (*svm_off)(struct kvm *kvm);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 4b0d044caa2a..b66f6db7be6c 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -34,5 +34,6 @@
 #define UV_UNSHARE_PAGE0xF134
 #define UV_UNSHARE_ALL_PAGES   0xF140
 #define UV_PAGE_INVAL  0xF138
+#define UV_SVM_TERMINATE   0xF13C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index b8e59b7b4ac8..790b0e63681f 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -77,4 +77,9 @@ static inline int uv_page_inval(u64 lpid, u64 gpa, u64 
page_shift)
return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
 }
 
+static inline int uv_svm_terminate(u64 lpid)
+{
+   return ucall_norets(UV_SVM_TERMINATE, lpid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index cb7ae1e9e4f2..d2bc4e9bbe7e 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2443,6 +2443,15 @@ static void unpin_vpa(struct kvm *kvm, struct kvmppc_vpa 
*vpa)
vpa->dirty);
 }
 
+static void unpin_vpa_reset(struct kvm *kvm, struct kvmppc_vpa *vpa)
+{
+   unpin_vpa(kvm, vpa);
+   vpa->gpa = 0;
+   vpa->pinned_addr = NULL;
+   vpa->dirty = false;
+   vpa->update_pending = 0;
+}
+
 static void kvmppc

[PATCH v10 5/8] KVM: PPC: Handle memory plug/unplug to secure VM

2019-11-03 Thread Bharata B Rao

Register the new memslot with UV during plug and unregister
the memslot during unplug. In addition, release all the
device pages during unplug.

Signed-off-by: Bharata B Rao 
Signed-off-by: Sukadev Bhattiprolu 
[Added skip_page_out arg to kvmppc_uvmem_drop_pages()]
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  6 
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  3 ++
 arch/powerpc/kvm/book3s_hv.c| 24 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 37 +
 6 files changed, 76 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 3033a9585b43..3cf8425b9838 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -19,6 +19,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
+void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+struct kvm *kvm, bool skip_page_out);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -64,5 +66,9 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, 
unsigned long gfn)
 {
return -EFAULT;
 }
+
+static inline void
+kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
+   struct kvm *kvm, bool skip_page_out) { }
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index e774274ab30e..4b0d044caa2a 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -27,6 +27,7 @@
 #define UV_RETURN  0xF11C
 #define UV_ESM 0xF110
 #define UV_REGISTER_MEM_SLOT   0xF120
+#define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 #define UV_SHARE_PAGE  0xF130
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 40cc8bace654..b8e59b7b4ac8 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -67,6 +67,11 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
+{
+   return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
+}
+
 static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
 {
return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 4aec55a0ebc7..ee70bfc28c82 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1101,6 +1101,9 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
unsigned long gpa;
unsigned int shift;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START)
+   kvmppc_uvmem_drop_pages(memslot, kvm, true);
+
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return;
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 80e84277d11f..cb7ae1e9e4f2 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
@@ -4532,6 +4533,29 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
kvmppc_radix_flush_memslot(kvm, old);
+   /*
+* If UV hasn't yet called H_SVM_INIT_START, don't register memslots.
+*/
+   if (!kvm->arch.secure_guest)
+   return;
+
+   switch (change) {
+   case KVM_MR_CREATE:
+   if (kvmppc_uvmem_slot_init(kvm, new))
+   return;
+   uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id);
+   break;
+   case KVM_MR_DELETE:
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   kvmppc_uvmem_slot_free(kvm, old);
+   break;
+   default:
+   /* TODO: Handle KVM_MR_MOVE */
+   break;
+   }
 }
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c

[PATCH v10 4/8] KVM: PPC: Radix changes for secure guest

2019-11-03 Thread Bharata B Rao

- After the guest becomes secure, when we handle a page fault of a page
  belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
- Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
- Ensure all those routines that walk the secondary page tables of
  the guest don't do so in case of secure VM. For secure guest, the
  active secondary page tables are in secure memory and the secondary
  page tables in HV are freed when guest becomes secure.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  6 
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 
 arch/powerpc/kvm/book3s_64_mmu_radix.c  | 22 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 32 +
 5 files changed, 66 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 95f389c2937b..3033a9585b43 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -18,6 +18,7 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long page_shift);
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -58,5 +59,10 @@ static inline unsigned long kvmppc_h_svm_init_done(struct 
kvm *kvm)
 {
return H_UNSUPPORTED;
 }
+
+static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn)
+{
+   return -EFAULT;
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 2483f15bd71a..e774274ab30e 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -32,5 +32,6 @@
 #define UV_SHARE_PAGE  0xF130
 #define UV_UNSHARE_PAGE0xF134
 #define UV_UNSHARE_ALL_PAGES   0xF140
+#define UV_PAGE_INVAL  0xF138
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 79bb005e8ee9..40cc8bace654 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -67,4 +67,9 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
+{
+   return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..4aec55a0ebc7 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -915,6 +917,9 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!(dsisr & DSISR_PRTABLE_FAULT))
gpa |= ea & 0xfff;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
+   return kvmppc_send_page_to_uv(kvm, gfn);
+
/* Get the corresponding memslot */
memslot = gfn_to_memslot(kvm, gfn);
 
@@ -972,6 +977,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE) {
+   uv_page_inval(kvm->arch.lpid, gpa, PAGE_SIZE);
+   return 0;
+   }
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
@@ -989,6 +999,9 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1026,9 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@

[PATCH v10 3/8] KVM: PPC: Shared pages support for secure guests

2019-11-03 Thread Bharata B Rao

A secure guest will share some of its pages with hypervisor (Eg. virtio
bounce buffers etc). Support sharing of pages between hypervisor and
ultravisor.

Shared page is reachable via both HV and UV side page tables. Once a
secure page is converted to shared page, the device page that represents
the secure page is unmapped from the HV side page tables.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h  |  3 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c | 85 --
 2 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 4150732c81a0..13bd870609c3 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,9 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Flags for H_SVM_PAGE_IN */
+#define H_PAGE_IN_SHARED0x1
+
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index fe456fd07c74..d9395a23b10d 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -19,7 +19,10 @@
  * available in the platform for running secure guests is hotplugged.
  * Whenever a page belonging to the guest becomes secure, a page from this
  * private device memory is used to represent and track that secure page
- * on the HV side.
+ * on the HV side. Some pages (like virtio buffers, VPA pages etc) are
+ * shared between UV and HV. However such pages aren't represented by
+ * device private memory and mappings to shared memory exist in both
+ * UV and HV page tables.
  */
 
 /*
@@ -64,6 +67,9 @@
  * UV splits and remaps the 2MB page if necessary and copies out the
  * required 64K page contents.
  *
+ * Shared pages: Whenever guest shares a secure page, UV will split and
+ * remap the 2MB page if required and issue H_SVM_PAGE_IN with 64K page size.
+ *
  * In summary, the current secure pages handling code in HV assumes
  * 64K page size and in fact fails any page-in/page-out requests of
  * non-64K size upfront. If and when UV starts supporting multiple
@@ -93,6 +99,7 @@ struct kvmppc_uvmem_slot {
 struct kvmppc_uvmem_page_pvt {
struct kvm *kvm;
unsigned long gpa;
+   bool skip_page_out;
 };
 
 int kvmppc_uvmem_slot_init(struct kvm *kvm, const struct kvm_memory_slot *slot)
@@ -329,8 +336,64 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
return ret;
 }
 
+/*
+ * Shares the page with HV, thus making it a normal page.
+ *
+ * - If the page is already secure, then provision a new page and share
+ * - If the page is a normal page, share the existing page
+ *
+ * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
+ * to unmap the device page from QEMU's page tables.
+ */
+static unsigned long
+kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+{
+
+   int ret = H_PARAMETER;
+   struct page *uvmem_page;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   unsigned long gfn = gpa >> page_shift;
+   int srcu_idx;
+   unsigned long uvmem_pfn;
+
+   srcu_idx = srcu_read_lock(&kvm->srcu);
+   mutex_lock(&kvm->arch.uvmem_lock);
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   }
+
+retry:
+   mutex_unlock(&kvm->arch.uvmem_lock);
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (is_error_noslot_pfn(pfn))
+   goto out;
+
+   mutex_lock(&kvm->arch.uvmem_lock);
+   if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, &uvmem_pfn)) {
+   uvmem_page = pfn_to_page(uvmem_pfn);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   kvm_release_pfn_clean(pfn);
+   goto retry;
+   }
+
+   if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift))
+   ret = H_SUCCESS;
+   kvm_release_pfn_clean(pfn);
+   mutex_unlock(&kvm->arch.uvmem_lock);
+out:
+   srcu_read_unlock(&kvm->srcu, srcu_idx);
+   return ret;
+}
+
 /*
  * H_SVM_PAGE_IN: Move page from normal memory to secure memory.
+ *
+ * H_PAGE_IN_SHARED flag makes the page shared which means that the same
+ * memory in is visible from both UV and HV.
  */
 unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
@@ -345,9 +408,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
 
-   if (flags)
+   if (flags & ~H_PAGE_IN_SHARED)
return H_P2;
 
+

[PATCH v10 2/8] KVM: PPC: Support for running secure guests

2019-11-03 Thread Bharata B Rao

A pseries guest can be run as secure guest on Ultravisor-enabled
POWER platforms. On such platforms, this driver will be used to manage
the movement of guest pages between the normal memory managed by
hypervisor (HV) and secure memory managed by Ultravisor (UV).

HV is informed about the guest's transition to secure mode via hcalls:

H_SVM_INIT_START: Initiate securing a VM
H_SVM_INIT_DONE: Conclude securing a VM

As part of H_SVM_INIT_START, register all existing memslots with
the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
the guest to secure mode is complete.

These two states (transition to secure mode STARTED and transition
to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
Setting these states will cause the assembly code that enters the
guest to call the UV_RETURN ucall instead of trying to enter the
guest directly.

Migration of pages betwen normal and secure memory of secure
guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages() using UV_PAGE_IN and
UV_PAGE_OUT ucalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   6 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  62 ++
 arch/powerpc/include/asm/kvm_host.h |   6 +
 arch/powerpc/include/asm/ultravisor-api.h   |   3 +
 arch/powerpc/include/asm/ultravisor.h   |  21 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  29 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 613 
 8 files changed, 743 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2023e327..4150732c81a0 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,12 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+#define H_SVM_INIT_START   0xEF08
+#define H_SVM_INIT_DONE0xEF0C
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
new file mode 100644
index ..95f389c2937b
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_KVM_BOOK3S_UVMEM_H__
+#define __ASM_KVM_BOOK3S_UVMEM_H__
+
+#ifdef CONFIG_PPC_UV
+int kvmppc_uvmem_init(void);
+void kvmppc_uvmem_free(void);
+int kvmppc_uvmem_slot_init(struct kvm *kvm, const struct kvm_memory_slot 
*slot);
+void kvmppc_uvmem_slot_free(struct kvm *kvm,
+   const struct kvm_memory_slot *slot);
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+  unsigned long gra,
+  unsigned long flags,
+  unsigned long page_shift);
+unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+   unsigned long gra,
+   unsigned long flags,
+   unsigned long page_shift);
+unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
+unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+#else
+static inline int kvmppc_uvmem_init(void)
+{
+   return 0;
+}
+
+static inline void kvmppc_uvmem_free(void) { }
+
+static inline int
+kvmppc_uvmem_slot_init(struct kvm *kvm, const struct kvm_memory_slot *slot)
+{
+   return 0;
+}
+
+static inline void
+kvmppc_uvmem_slot_free(struct kvm *kvm, const struct kvm_memory_slot *slot) { }
+
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
+unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
+ unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
+{
+   return H_UNS

[PATCH v10 1/8] mm: ksm: Export ksm_madvise()

2019-11-03 Thread Bharata B Rao

KVM PPC module needs ksm_madvise() for supporting secure guests.
Guest pages that become secure are represented as device private
pages in the host. Such pages shouldn't participate in KSM merging.

Signed-off-by: Bharata B Rao 
---
 mm/ksm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index dbee2eb4dd05..e45b02ad3f0b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -2478,6 +2478,7 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long 
start,
 
return 0;
 }
+EXPORT_SYMBOL_GPL(ksm_madvise);
 
 int __ksm_enter(struct mm_struct *mm)
 {
-- 
2.21.0

[PATCH v10 0/8] KVM: PPC: Driver to manage pages of secure guest

2019-11-03 Thread Bharata B Rao

Hi,

This is the next version of the patchset that adds required support
in the KVM hypervisor to run secure guests on PEF-enabled POWER platforms.

The major change in this version is about not using kvm.arch->rmap[]
array to store device PFNs, thus not depending on the memslot availability
to reach to the device PFN from the fault path. Instead of rmap[], we
now have a different array which gets created and destroyed along with
memslot creation and deletion. These arrays hang off from kvm.arch and
are arragned in a simple linked list for now. We could move to some other
data structure in future if walking of linked list becomes an overhead
due to large number of memslots.

Other changes include:

- Rearranged/Merged/Cleaned up patches, removed all Acks/Reviewed-by since
  all the patches have changed.
- Added a new patch to support H_SVM_INIT_ABORT hcall (From Suka)
- Added KSM unmerge support so that VMAs that have device PFNs don't
  participate in KSM merging and eventually crash in KSM code.
- Release device pages during unplug (Paul) and ensure that memory
  hotplug and unplug works correctly.
- Let kvm-hv module to load on PEF-disabled platforms (Ram) when
  CONFIG_PPC_UV is enabled allowing regular non-secure guests
  to still run.
- Support guest reset when swithing to secure is in progress.
- Check if page is already secure in kvmppc_send_page_to_uv() before
  sending it to UV.
- Fixed sentinal for header file kvm_book3s_uvmem.h (Jason)

Now, all the dependencies required by this patchset are in powerpc/next
on which this patchset is based upon.

Outside of PowerPC code, this needs a change in KSM code as this
patchset uses ksm_madvise() which is not exported.

Anshuman Khandual (1):
  KVM: PPC: Ultravisor: Add PPC_UV config option

Bharata B Rao (6):
  mm: ksm: Export ksm_madvise()
  KVM: PPC: Support for running secure guests
  KVM: PPC: Shared pages support for secure guests
  KVM: PPC: Radix changes for secure guest
  KVM: PPC: Handle memory plug/unplug to secure VM
  KVM: PPC: Support reset of secure guest

Sukadev Bhattiprolu (1):
  KVM: PPC: Implement H_SVM_INIT_ABORT hcall

 Documentation/powerpc/ultravisor.rst|  39 +
 Documentation/virt/kvm/api.txt  |  19 +
 arch/powerpc/Kconfig|  17 +
 arch/powerpc/include/asm/hvcall.h   |  10 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  80 ++
 arch/powerpc/include/asm/kvm_host.h |   7 +
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/asm/ultravisor-api.h   |   6 +
 arch/powerpc/include/asm/ultravisor.h   |  36 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  25 +
 arch/powerpc/kvm/book3s_hv.c| 144 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  23 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 794 
 arch/powerpc/kvm/powerpc.c  |  12 +
 include/uapi/linux/kvm.h|   1 +
 mm/ksm.c|   1 +
 17 files changed, 1217 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

-- 
2.21.0

Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-22 Thread Bharata B Rao

On Wed, Oct 23, 2019 at 03:17:54PM +1100, Paul Mackerras wrote:
> On Tue, Oct 22, 2019 at 11:59:35AM +0530, Bharata B Rao wrote:
> The mapping of pages in userspace memory, and the mapping of userspace
> memory to guest physical space, are two distinct things.  The memslots
> describe the mapping of userspace addresses to guest physical
> addresses, but don't say anything about what is mapped at those
> userspace addresses.  So you can indeed get a page fault on a
> userspace address at the same time that a memslot is being deleted
> (even a memslot that maps that particular userspace address), because
> removing the memslot does not unmap anything from userspace memory,
> it just breaks the association between that userspace memory and guest
> physical memory.  Deleting the memslot does unmap the pages from the
> guest but doesn't unmap them from the userspace process (e.g. QEMU).
> 
> It is an interesting question what the semantics should be when a
> memslot is deleted and there are pages of userspace currently paged
> out to the device (i.e. the ultravisor).  One approach might be to say
> that all those pages have to come back to the host before we finish
> the memslot deletion, but that is probably not necessary; I think we
> could just say that those pages are gone and can be replaced by zero
> pages if they get accessed on the host side.  If userspace then unmaps
> the corresponding region of the userspace memory map, we can then just
> forget all those pages with very little work.

There are 5 scenarios currently where we are replacing the device mappings:

1. Guest reset
2. Memslot free (Memory unplug) (Not present in this version though)
3. Converting secure page to shared page
4. HV touching the secure page
5. H_SVM_INIT_ABORT hcall to abort SVM due to errors when transitioning
   to secure mode (Not present in this version)

In the first 3 cases, we don't need to get the page to HV from
the secure side and hence skip the page out. However currently we do
allocate fresh page and replace the mapping with the new one.
 
> > However if that sounds fragile, may be I can go back to my initial
> > design where we weren't using rmap[] to store device PFNs. That will
> > increase the memory usage but we give us an easy option to have
> > per-guest mutex to protect concurrent page-ins/outs/faults.
> 
> That sounds like it would be the best option, even if only in the
> short term.  At least it would give us a working solution, even if
> it's not the best performing solution.

Sure, will avoid using rmap[] in the next version.

Regards,
Bharata.

Re: [PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-10-21 Thread Bharata B Rao

On Fri, Oct 18, 2019 at 8:31 AM Paul Mackerras  wrote:
>
> On Wed, Sep 25, 2019 at 10:36:43AM +0530, Bharata B Rao wrote:
> > Manage migration of pages betwen normal and secure memory of secure
> > guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> >
> > H_SVM_PAGE_IN: Move the content of a normal page to secure page
> > H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> >
> > Private ZONE_DEVICE memory equal to the amount of secure memory
> > available in the platform for running secure guests is created.
> > Whenever a page belonging to the guest becomes secure, a page from
> > this private device memory is used to represent and track that secure
> > page on the HV side. The movement of pages between normal and secure
> > memory is done via migrate_vma_pages() using UV_PAGE_IN and
> > UV_PAGE_OUT ucalls.
>
> As we discussed privately, but mentioning it here so there is a
> record:  I am concerned about this structure
>
> > +struct kvmppc_uvmem_page_pvt {
> > + unsigned long *rmap;
> > + struct kvm *kvm;
> > + unsigned long gpa;
> > +};
>
> which keeps a reference to the rmap.  The reference could become stale
> if the memslot is deleted or moved, and nothing in the patch series
> ensures that the stale references are cleaned up.

I will add code to release the device PFNs when memslot goes away. In
fact the early versions of the patchset had this, but it subsequently
got removed.

>
> If it is possible to do without the long-term rmap reference, and
> instead find the rmap via the memslots (with the srcu lock held) each
> time we need the rmap, that would be safer, I think, provided that we
> can sort out the lock ordering issues.

All paths except fault handler access rmap[] under srcu lock. Even in
case of fault handler, for those faults induced by us (shared page
handling, releasing device pfns), we do hold srcu lock. The difficult
case is when we fault due to HV accessing a device page. In this case
we come to fault hanler with mmap_sem already held and are not in a
position to take kvm srcu lock as that would lead to lock order
reversal. Given that we have pages mapped in still, I assume memslot
can't go away while we access rmap[], so think we should be ok here.

However if that sounds fragile, may be I can go back to my initial
design where we weren't using rmap[] to store device PFNs. That will
increase the memory usage but we give us an easy option to have
per-guest mutex to protect concurrent page-ins/outs/faults.

Regards,
Bharata.
-- 
http://raobharata.wordpress.com/

[PATCH v9 8/8] KVM: PPC: Ultravisor: Add PPC_UV config option

2019-09-24 Thread Bharata B Rao

From: Anshuman Khandual 

CONFIG_PPC_UV adds support for ultravisor.

Signed-off-by: Anshuman Khandual 
Signed-off-by: Bharata B Rao 
Signed-off-by: Ram Pai 
[ Update config help and commit message ]
Signed-off-by: Claudio Carvalho 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/Kconfig | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d8dcd8820369..044838794112 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -448,6 +448,23 @@ config PPC_TRANSACTIONAL_MEM
help
  Support user-mode Transactional Memory on POWERPC.
 
+config PPC_UV
+   bool "Ultravisor support"
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   select ZONE_DEVICE
+   select DEV_PAGEMAP_OPS
+   select DEVICE_PRIVATE
+   select MEMORY_HOTPLUG
+   select MEMORY_HOTREMOVE
+   default n
+   help
+ This option paravirtualizes the kernel to run in POWER platforms that
+ supports the Protected Execution Facility (PEF). On such platforms,
+ the ultravisor firmware runs at a privilege level above the
+ hypervisor.
+
+ If unsure, say "N".
+
 config LD_HEAD_STUB_CATCH
bool "Reserve 256 bytes to cope with linker stubs in HEAD text" if 
EXPERT
depends on PPC64
-- 
2.21.0

[PATCH v9 7/8] KVM: PPC: Support reset of secure guest

2019-09-24 Thread Bharata B Rao

Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
This ioctl will be issued by QEMU during reset and includes the
the following steps:

- Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
- Unpin the VPA pages so that they can be migrated back to secure
  side when guest becomes secure again. This is required because
  pinned pages can't be migrated.
- Reinitialize guest's partitioned scoped page tables. These are
  freed when guest becomes secure (H_SVM_INIT_DONE)
- Release all device pages of the secure guest.

After these steps, guest is ready to issue UV_ESM call once again
to switch to secure mode.

Signed-off-by: Bharata B Rao 
Signed-off-by: Sukadev Bhattiprolu 
[Implementation of uv_svm_terminate() and its call from
guest shutdown path]
Signed-off-by: Ram Pai 
[Unpinning of VPA pages]
---
 Documentation/virt/kvm/api.txt  | 19 ++
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  7 ++
 arch/powerpc/include/asm/kvm_ppc.h  |  2 +
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  5 ++
 arch/powerpc/kvm/book3s_hv.c| 74 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 60 +
 arch/powerpc/kvm/powerpc.c  | 12 
 include/uapi/linux/kvm.h|  1 +
 9 files changed, 181 insertions(+)

diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
index 2d067767b617..8e7a02e547e9 100644
--- a/Documentation/virt/kvm/api.txt
+++ b/Documentation/virt/kvm/api.txt
@@ -4111,6 +4111,25 @@ Valid values for 'action':
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+4.121 KVM_PPC_SVM_OFF
+
+Capability: basic
+Architectures: powerpc
+Type: vm ioctl
+Parameters: none
+Returns: 0 on successful completion,
+Errors:
+  EINVAL:if ultravisor failed to terminate the secure guest
+  ENOMEM:if hypervisor failed to allocate new radix page tables for guest
+
+This ioctl is used to turn off the secure mode of the guest or transition
+the guest from secure mode to normal mode. This is invoked when the guest
+is reset. This has no effect if called for a normal guest.
+
+This ioctl issues an ultravisor call to terminate the secure guest,
+unpins the VPA pages, reinitializes guest's partition scoped page
+tables and releases all the device pages that are used to track the
+secure pages by hypervisor.
 
 5. The kvm_run structure
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index fc924ef00b91..6b8cc8edd0ab 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -13,6 +13,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long page_shift);
 unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
 unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
+void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm,
+   struct kvm_memslots *slots);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
@@ -37,5 +39,10 @@ static inline unsigned long kvmppc_h_svm_init_done(struct 
kvm *kvm)
 {
return H_UNSUPPORTED;
 }
+
+static inline void kvmppc_uvmem_free_memslot_pfns(struct kvm *kvm,
+ struct kvm_memslots *slots)
+{
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2484e6a8f5ca..e4093d067354 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -177,6 +177,7 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm 
*kvm,
 extern int kvmppc_switch_mmu_to_hpt(struct kvm *kvm);
 extern int kvmppc_switch_mmu_to_radix(struct kvm *kvm);
 extern void kvmppc_setup_partition_table(struct kvm *kvm);
+extern int kvmppc_reinit_partition_table(struct kvm *kvm);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce_64 *args);
@@ -321,6 +322,7 @@ struct kvmppc_ops {
   int size);
int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
  int size);
+   int (*svm_off)(struct kvm *kvm);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index cf200d4ce703..3a27a0c0be05 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -30,5 +30,6 @@
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 #define UV_PAGE_INVAL  0xF138
+#define UV_SVM_TERMINATE   0xF13C
 
 #endif /* _ASM

[PATCH v9 6/8] KVM: PPC: Radix changes for secure guest

2019-09-24 Thread Bharata B Rao

- After the guest becomes secure, when we handle a page fault of a page
  belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
- Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
- Ensure all those routines that walk the secondary page tables of
  the guest don't do so in case of secure VM. For secure guest, the
  active secondary page tables are in secure memory and the secondary
  page tables in HV are freed when guest becomes secure.

Signed-off-by: Bharata B Rao 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/kvm_host.h   | 12 
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c| 22 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c| 20 
 5 files changed, 60 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 726d35eb3bfe..c0c6603ddd6b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -877,6 +877,8 @@ static inline void kvm_arch_vcpu_block_finish(struct 
kvm_vcpu *vcpu) {}
 #ifdef CONFIG_PPC_UV
 int kvmppc_uvmem_init(void);
 void kvmppc_uvmem_free(void);
+bool kvmppc_is_guest_secure(struct kvm *kvm);
+int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa);
 #else
 static inline int kvmppc_uvmem_init(void)
 {
@@ -884,6 +886,16 @@ static inline int kvmppc_uvmem_init(void)
 }
 
 static inline void kvmppc_uvmem_free(void) {}
+
+static inline bool kvmppc_is_guest_secure(struct kvm *kvm)
+{
+   return false;
+}
+
+static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gpa)
+{
+   return -EFAULT;
+}
 #endif /* CONFIG_PPC_UV */
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 46b1ee381695..cf200d4ce703 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -29,5 +29,6 @@
 #define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
+#define UV_PAGE_INVAL  0xF138
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 719c0c3930b9..b333241bbe4c 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -57,4 +57,9 @@ static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
 }
 
+static inline int uv_page_inval(u64 lpid, u64 gpa, u64 page_shift)
+{
+   return ucall_norets(UV_PAGE_INVAL, lpid, gpa, page_shift);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 2d415c36a61d..93ad34e63045 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -19,6 +19,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /*
  * Supported radix tree geometry.
@@ -915,6 +917,9 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
if (!(dsisr & DSISR_PRTABLE_FAULT))
gpa |= ea & 0xfff;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return kvmppc_send_page_to_uv(kvm, gpa & PAGE_MASK);
+
/* Get the corresponding memslot */
memslot = gfn_to_memslot(kvm, gfn);
 
@@ -972,6 +977,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
 
+   if (kvmppc_is_guest_secure(kvm)) {
+   uv_page_inval(kvm->arch.lpid, gpa, PAGE_SIZE);
+   return 0;
+   }
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
@@ -989,6 +999,9 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
int ref = 0;
unsigned long old, *rmapp;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
@@ -1013,6 +1026,9 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
unsigned int shift;
int ref = 0;
 
+   if (kvmppc_is_guest_secure(kvm))
+   return ref;
+
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
@@ -1030,6 +1046,9 @@ static int kvm

[PATCH v9 5/8] KVM: PPC: Handle memory plug/unplug to secure VM

2019-09-24 Thread Bharata B Rao

Register the new memslot with UV during plug and unregister
the memslot during unplug.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/include/asm/ultravisor.h |  5 +
 arch/powerpc/kvm/book3s_hv.c  | 21 +
 3 files changed, 27 insertions(+)

diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index c578d9b13a56..46b1ee381695 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -26,6 +26,7 @@
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
 #define UV_REGISTER_MEM_SLOT   0xF120
+#define UV_UNREGISTER_MEM_SLOT 0xF124
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 58ccf5e2d6bb..719c0c3930b9 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -52,4 +52,9 @@ static inline int uv_register_mem_slot(u64 lpid, u64 
start_gpa, u64 size,
size, flags, slotid);
 }
 
+static inline int uv_unregister_mem_slot(u64 lpid, u64 slotid)
+{
+   return ucall_norets(UV_UNREGISTER_MEM_SLOT, lpid, slotid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3ba27fed3018..c5320cc0a534 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "book3s.h"
 
@@ -4517,6 +4518,26 @@ static void kvmppc_core_commit_memory_region_hv(struct 
kvm *kvm,
if (change == KVM_MR_FLAGS_ONLY && kvm_is_radix(kvm) &&
((new->flags ^ old->flags) & KVM_MEM_LOG_DIRTY_PAGES))
kvmppc_radix_flush_memslot(kvm, old);
+   /*
+* If UV hasn't yet called H_SVM_INIT_START, don't register memslots.
+*/
+   if (!kvm->arch.secure_guest)
+   return;
+
+   switch (change) {
+   case KVM_MR_CREATE:
+   uv_register_mem_slot(kvm->arch.lpid,
+new->base_gfn << PAGE_SHIFT,
+new->npages * PAGE_SIZE,
+0, new->id);
+   break;
+   case KVM_MR_DELETE:
+   uv_unregister_mem_slot(kvm->arch.lpid, old->id);
+   break;
+   default:
+   /* TODO: Handle KVM_MR_MOVE */
+   break;
+   }
 }
 
 /*
-- 
2.21.0

[PATCH v9 4/8] KVM: PPC: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls

2019-09-24 Thread Bharata B Rao

H_SVM_INIT_START: Initiate securing a VM
H_SVM_INIT_DONE: Conclude securing a VM

As part of H_SVM_INIT_START, register all existing memslots with
the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
the guest to secure mode is complete.

These two states (transition to secure mode STARTED and transition
to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
Setting these states will cause the assembly code that enters the
guest to call the UV_RETURN ucall instead of trying to enter the
guest directly.

Signed-off-by: Bharata B Rao 
Acked-by: Paul Mackerras 
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/hvcall.h   |  2 ++
 arch/powerpc/include/asm/kvm_book3s_uvmem.h | 12 
 arch/powerpc/include/asm/kvm_host.h |  4 +++
 arch/powerpc/include/asm/ultravisor-api.h   |  1 +
 arch/powerpc/include/asm/ultravisor.h   |  7 +
 arch/powerpc/kvm/book3s_hv.c|  7 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 34 +
 7 files changed, 67 insertions(+)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 4e98dd992bd1..13bd870609c3 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -348,6 +348,8 @@
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
+#define H_SVM_INIT_START   0xEF08
+#define H_SVM_INIT_DONE0xEF0C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
index 9603c2b48d67..fc924ef00b91 100644
--- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -11,6 +11,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
unsigned long gra,
unsigned long flags,
unsigned long page_shift);
+unsigned long kvmppc_h_svm_init_start(struct kvm *kvm);
+unsigned long kvmppc_h_svm_init_done(struct kvm *kvm);
 #else
 static inline unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
@@ -25,5 +27,15 @@ kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
 {
return H_UNSUPPORTED;
 }
+
+static inline unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
+{
+   return H_UNSUPPORTED;
+}
 #endif /* CONFIG_PPC_UV */
 #endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index a2e7502346a3..726d35eb3bfe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -281,6 +281,10 @@ struct kvm_hpt_info {
 
 struct kvm_resize_hpt;
 
+/* Flag values for kvm_arch.secure_guest */
+#define KVMPPC_SECURE_INIT_START 0x1 /* H_SVM_INIT_START has been called */
+#define KVMPPC_SECURE_INIT_DONE  0x2 /* H_SVM_INIT_DONE completed */
+
 struct kvm_arch {
unsigned int lpid;
unsigned int smt_mode;  /* # vcpus per virtual core */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 1cd1f595fd81..c578d9b13a56 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -25,6 +25,7 @@
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
+#define UV_REGISTER_MEM_SLOT   0xF120
 #define UV_PAGE_IN 0xF128
 #define UV_PAGE_OUT0xF12C
 
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 0fc4a974b2e8..58ccf5e2d6bb 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -45,4 +45,11 @@ static inline int uv_page_out(u64 lpid, u64 dst_ra, u64 
src_gpa, u64 flags,
page_shift);
 }
 
+static inline int uv_register_mem_slot(u64 lpid, u64 start_gpa, u64 size,
+  u64 flags, u64 slotid)
+{
+   return ucall_norets(UV_REGISTER_MEM_SLOT, lpid, start_gpa,
+   size, flags, slotid);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ef532cce85f9..3ba27fed3018 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1089,6 +1089,13 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_get_gpr(vcpu, 5),
kvmppc_get_gpr(vcpu, 6));
break;
+   case H_SVM_INIT_START:
+   ret = kvmppc_h_svm_init_start(vcpu->kvm);
+

[PATCH v9 3/8] KVM: PPC: Shared pages support for secure guests

2019-09-24 Thread Bharata B Rao

A secure guest will share some of its pages with hypervisor (Eg. virtio
bounce buffers etc). Support sharing of pages between hypervisor and
ultravisor.

Shared page is reachable via both HV and UV side page tables. Once a
secure page is converted to shared page, the device page that represents
the secure page is unmapped from the HV side page tables.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h  |  3 ++
 arch/powerpc/kvm/book3s_hv_uvmem.c | 86 --
 2 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2595d0144958..4e98dd992bd1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,9 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Flags for H_SVM_PAGE_IN */
+#define H_PAGE_IN_SHARED0x1
+
 /* Platform-specific hcalls used by the Ultravisor */
 #define H_SVM_PAGE_IN  0xEF00
 #define H_SVM_PAGE_OUT 0xEF04
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 312f0fedde0b..5e5b5a3e9eec 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -19,7 +19,10 @@
  * available in the platform for running secure guests is hotplugged.
  * Whenever a page belonging to the guest becomes secure, a page from this
  * private device memory is used to represent and track that secure page
- * on the HV side.
+ * on the HV side. Some pages (like virtio buffers, VPA pages etc) are
+ * shared between UV and HV. However such pages aren't represented by
+ * device private memory and mappings to shared memory exist in both
+ * UV and HV page tables.
  *
  * For each page that gets moved into secure memory, a device PFN is used
  * on the HV side and migration PTE corresponding to that PFN would be
@@ -80,6 +83,7 @@ struct kvmppc_uvmem_page_pvt {
unsigned long *rmap;
struct kvm *kvm;
unsigned long gpa;
+   bool skip_page_out;
 };
 
 /*
@@ -190,8 +194,70 @@ kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned 
long start,
return ret;
 }
 
+/*
+ * Shares the page with HV, thus making it a normal page.
+ *
+ * - If the page is already secure, then provision a new page and share
+ * - If the page is a normal page, share the existing page
+ *
+ * In the former case, uses dev_pagemap_ops.migrate_to_ram handler
+ * to unmap the device page from QEMU's page tables.
+ */
+static unsigned long
+kvmppc_share_page(struct kvm *kvm, unsigned long gpa, unsigned long page_shift)
+{
+
+   int ret = H_PARAMETER;
+   struct page *uvmem_page;
+   struct kvmppc_uvmem_page_pvt *pvt;
+   unsigned long pfn;
+   unsigned long *rmap;
+   struct kvm_memory_slot *slot;
+   unsigned long gfn = gpa >> page_shift;
+   int srcu_idx;
+
+   srcu_idx = srcu_read_lock(&kvm->srcu);
+   slot = gfn_to_memslot(kvm, gfn);
+   if (!slot)
+   goto out;
+
+   rmap = &slot->arch.rmap[gfn - slot->base_gfn];
+   mutex_lock(&kvm->arch.uvmem_lock);
+   if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) {
+   uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   }
+
+retry:
+   mutex_unlock(&kvm->arch.uvmem_lock);
+   pfn = gfn_to_pfn(kvm, gfn);
+   if (is_error_noslot_pfn(pfn))
+   goto out;
+
+   mutex_lock(&kvm->arch.uvmem_lock);
+   if (kvmppc_rmap_type(rmap) == KVMPPC_RMAP_UVMEM_PFN) {
+   uvmem_page = pfn_to_page(*rmap & ~KVMPPC_RMAP_UVMEM_PFN);
+   pvt = uvmem_page->zone_device_data;
+   pvt->skip_page_out = true;
+   kvm_release_pfn_clean(pfn);
+   goto retry;
+   }
+
+   if (!uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0, page_shift))
+   ret = H_SUCCESS;
+   kvm_release_pfn_clean(pfn);
+   mutex_unlock(&kvm->arch.uvmem_lock);
+out:
+   srcu_read_unlock(&kvm->srcu, srcu_idx);
+   return ret;
+}
+
 /*
  * H_SVM_PAGE_IN: Move page from normal memory to secure memory.
+ *
+ * H_PAGE_IN_SHARED flag makes the page shared which means that the same
+ * memory in is visible from both UV and HV.
  */
 unsigned long
 kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
@@ -208,9 +274,12 @@ kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
if (page_shift != PAGE_SHIFT)
return H_P3;
 
-   if (flags)
+   if (flags & ~H_PAGE_IN_SHARED)
return H_P2;
 
+   if (flags & H_PAGE_IN_SHARED)
+   return kvmppc_share_page(kvm, gpa, page_shift);
+
ret = H_PARAMETER;
srcu_idx = srcu_read_lock(&kvm->srcu);
down_rea

[PATCH v9 2/8] KVM: PPC: Move pages between normal and secure memory

2019-09-24 Thread Bharata B Rao

Manage migration of pages betwen normal and secure memory of secure
guest by implementing H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.

H_SVM_PAGE_IN: Move the content of a normal page to secure page
H_SVM_PAGE_OUT: Move the content of a secure page to normal page

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages() using UV_PAGE_IN and
UV_PAGE_OUT ucalls.

Signed-off-by: Bharata B Rao 
---
 arch/powerpc/include/asm/hvcall.h   |   4 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  29 ++
 arch/powerpc/include/asm/kvm_host.h |  13 +
 arch/powerpc/include/asm/ultravisor-api.h   |   2 +
 arch/powerpc/include/asm/ultravisor.h   |  14 +
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_hv.c|  20 +
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 481 
 8 files changed, 566 insertions(+)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 2023e327..2595d0144958 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -342,6 +342,10 @@
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
 
+/* Platform-specific hcalls used by the Ultravisor */
+#define H_SVM_PAGE_IN  0xEF00
+#define H_SVM_PAGE_OUT 0xEF04
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h 
b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
new file mode 100644
index ..9603c2b48d67
--- /dev/null
+++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __POWERPC_KVM_PPC_HMM_H__
+#define __POWERPC_KVM_PPC_HMM_H__
+
+#ifdef CONFIG_PPC_UV
+unsigned long kvmppc_h_svm_page_in(struct kvm *kvm,
+  unsigned long gra,
+  unsigned long flags,
+  unsigned long page_shift);
+unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
+   unsigned long gra,
+   unsigned long flags,
+   unsigned long page_shift);
+#else
+static inline unsigned long
+kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gra,
+unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+
+static inline unsigned long
+kvmppc_h_svm_page_out(struct kvm *kvm, unsigned long gra,
+ unsigned long flags, unsigned long page_shift)
+{
+   return H_UNSUPPORTED;
+}
+#endif /* CONFIG_PPC_UV */
+#endif /* __POWERPC_KVM_PPC_HMM_H__ */
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 81cd221ccc04..a2e7502346a3 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -336,6 +336,7 @@ struct kvm_arch {
 #endif
struct kvmppc_ops *kvm_ops;
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   struct mutex uvmem_lock;
struct mutex mmu_setup_lock;/* nests inside vcpu mutexes */
u64 l1_ptcr;
int max_nested_lpid;
@@ -869,4 +870,16 @@ static inline void kvm_arch_vcpu_blocking(struct kvm_vcpu 
*vcpu) {}
 static inline void kvm_arch_vcpu_unblocking(struct kvm_vcpu *vcpu) {}
 static inline void kvm_arch_vcpu_block_finish(struct kvm_vcpu *vcpu) {}
 
+#ifdef CONFIG_PPC_UV
+int kvmppc_uvmem_init(void);
+void kvmppc_uvmem_free(void);
+#else
+static inline int kvmppc_uvmem_init(void)
+{
+   return 0;
+}
+
+static inline void kvmppc_uvmem_free(void) {}
+#endif /* CONFIG_PPC_UV */
+
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 6a0f9c74f959..1cd1f595fd81 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -25,5 +25,7 @@
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
 #define UV_RETURN  0xF11C
+#define UV_PAGE_IN 0xF128
+#define UV_PAGE_OUT0xF12C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index d7aa97aa7834..0fc4a974b2e8 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -31,4 +31,18 @@ static inline int uv_register_pate(u64 lpid, u64 dw0, u64 
dw1

[PATCH v9 1/8] KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot

2019-09-24 Thread Bharata B Rao

From: Suraj Jitindar Singh 

The rmap array in the guest memslot is an array of size number of guest
pages, allocated at memslot creation time. Each rmap entry in this array
is used to store information about the guest page to which it
corresponds. For example for a hpt guest it is used to store a lock bit,
rc bits, a present bit and the index of a hpt entry in the guest hpt
which maps this page. For a radix guest which is running nested guests
it is used to store a pointer to a linked list of nested rmap entries
which store the nested guest physical address which maps this guest
address and for which there is a pte in the shadow page table.

As there are currently two uses for the rmap array, and the potential
for this to expand to more in the future, define a type field (being the
top 8 bits of the rmap entry) to be used to define the type of the rmap
entry which is currently present and define two values for this field
for the two current uses of the rmap array.

Since the nested case uses the rmap entry to store a pointer, define
this type as having the two high bits set as is expected for a pointer.
Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
Signed-off-by: Bharata B Rao 
[Added rmap type KVMPPC_RMAP_UVMEM_PFN]
Reviewed-by: Sukadev Bhattiprolu 
---
 arch/powerpc/include/asm/kvm_host.h | 28 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 +-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 4bb552d639b8..81cd221ccc04 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -232,11 +232,31 @@ struct revmap_entry {
 };
 
 /*
- * We use the top bit of each memslot->arch.rmap entry as a lock bit,
- * and bit 32 as a present flag.  The bottom 32 bits are the
- * index in the guest HPT of a HPTE that points to the page.
+ * The rmap array of size number of guest pages is allocated for each memslot.
+ * This array is used to store usage specific information about the guest page.
+ * Below are the encodings of the various possible usage types.
  */
-#define KVMPPC_RMAP_LOCK_BIT   63
+/* Free bits which can be used to define a new usage */
+#define KVMPPC_RMAP_TYPE_MASK  0xff00
+#define KVMPPC_RMAP_NESTED 0xc000  /* Nested rmap array */
+#define KVMPPC_RMAP_HPT0x0100  /* HPT guest */
+#define KVMPPC_RMAP_UVMEM_PFN  0x0200  /* Secure GPA */
+
+static inline unsigned long kvmppc_rmap_type(unsigned long *rmap)
+{
+   return (*rmap & KVMPPC_RMAP_TYPE_MASK);
+}
+
+/*
+ * rmap usage definition for a hash page table (hpt) guest:
+ * 0x0800  Lock bit
+ * 0x0180  RC bits
+ * 0x0001  Present bit
+ * 0x  HPT index bits
+ * The bottom 32 bits are the index in the guest HPT of a HPTE that points to
+ * the page.
+ */
+#define KVMPPC_RMAP_LOCK_BIT   43
 #define KVMPPC_RMAP_RC_SHIFT   32
 #define KVMPPC_RMAP_REFERENCED (HPTE_R_R << KVMPPC_RMAP_RC_SHIFT)
 #define KVMPPC_RMAP_PRESENT0x1ul
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 63e0ce91e29d..7186c65c61c9 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -99,7 +99,7 @@ void kvmppc_add_revmap_chain(struct kvm *kvm, struct 
revmap_entry *rev,
} else {
rev->forw = rev->back = pte_index;
*rmap = (*rmap & ~KVMPPC_RMAP_INDEX) |
-   pte_index | KVMPPC_RMAP_PRESENT;
+   pte_index | KVMPPC_RMAP_PRESENT | KVMPPC_RMAP_HPT;
}
unlock_rmap(rmap);
 }
-- 
2.21.0

[PATCH v9 0/8] KVM: PPC: Driver to manage pages of secure guest

2019-09-24 Thread Bharata B Rao

[The main change in this version is the introduction of new
locking to prevent concurrent page-in and page-out calls. More
details about this are present in patch 2/8]

Hi,

A pseries guest can be run as a secure guest on Ultravisor-enabled
POWER platforms. On such platforms, this driver will be used to manage
the movement of guest pages between the normal memory managed by
hypervisor(HV) and secure memory managed by Ultravisor(UV).

Private ZONE_DEVICE memory equal to the amount of secure memory
available in the platform for running secure guests is created.
Whenever a page belonging to the guest becomes secure, a page from
this private device memory is used to represent and track that secure
page on the HV side. The movement of pages between normal and secure
memory is done via migrate_vma_pages(). The reverse movement is driven
via pagemap_ops.migrate_to_ram().

The page-in or page-out requests from UV will come to HV as hcalls and
HV will call back into UV via uvcalls to satisfy these page requests.

These patches are against hmm.git
(https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm)

plus

Claudio Carvalho's base ultravisor enablement patches that are present
in Michael Ellerman's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/log/?h=topic/ppc-kvm)

These patches along with Claudio's above patches are required to
run secure pseries guests on KVM. This patchset is based on hmm.git
because hmm.git has migrate_vma cleanup and not-device memremap_pages
patchsets that are required by this patchset.

Changes in v9
=
- Prevent concurrent page-in and page-out calls.
- Ensure device PFNs are allocated for zero-pages that are sent to UV.
- Failure to migrate a page during page-in will now return error via
  hcall.
- Address review comments by Suka
- Misc cleanups

v8: 
https://lore.kernel.org/linux-mm/20190910082946.7849-2-bhar...@linux.ibm.com/T/

Anshuman Khandual (1):
  KVM: PPC: Ultravisor: Add PPC_UV config option

Bharata B Rao (6):
  KVM: PPC: Move pages between normal and secure memory
  KVM: PPC: Shared pages support for secure guests
  KVM: PPC: H_SVM_INIT_START and H_SVM_INIT_DONE hcalls
  KVM: PPC: Handle memory plug/unplug to secure VM
  KVM: PPC: Radix changes for secure guest
  KVM: PPC: Support reset of secure guest

Suraj Jitindar Singh (1):
  KVM: PPC: Book3S HV: Define usage types for rmap array in guest
memslot

 Documentation/virt/kvm/api.txt  |  19 +
 arch/powerpc/Kconfig|  17 +
 arch/powerpc/include/asm/hvcall.h   |   9 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |  48 ++
 arch/powerpc/include/asm/kvm_host.h |  57 +-
 arch/powerpc/include/asm/kvm_ppc.h  |   2 +
 arch/powerpc/include/asm/ultravisor-api.h   |   6 +
 arch/powerpc/include/asm/ultravisor.h   |  36 ++
 arch/powerpc/kvm/Makefile   |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c  |  22 +
 arch/powerpc/kvm/book3s_hv.c| 122 
 arch/powerpc/kvm/book3s_hv_rm_mmu.c |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c  | 673 
 arch/powerpc/kvm/powerpc.c  |  12 +
 include/uapi/linux/kvm.h|   1 +
 15 files changed, 1024 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/include/asm/kvm_book3s_uvmem.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_uvmem.c

-- 
2.21.0

Re: [PATCH v8 2/8] kvmppc: Movement of pages between normal and secure memory

2019-09-18 Thread Bharata B Rao

On Wed, Sep 18, 2019 at 12:42:10PM +0530, Bharata B Rao wrote:
> On Tue, Sep 17, 2019 at 04:31:39PM -0700, Sukadev Bhattiprolu wrote:
> > 
> > Minor: Can this allocation be outside the lock? I guess it would change
> > the order of cleanup at the end of the function.
> 
> Cleanup has bitmap_clear which needs be under spinlock, so this order
> of setup/alloc and cleanup will keep things simple is what I felt.
> 
> > 
> > > + spin_unlock(&kvmppc_uvmem_pfn_lock);
> > > +
> > > + *rmap = uvmem_pfn | KVMPPC_RMAP_UVMEM_PFN;
> > > + pvt->rmap = rmap;
> > > + pvt->gpa = gpa;
> > > + pvt->lpid = lpid;
> > > + dpage->zone_device_data = pvt;
> > > +
> > > + get_page(dpage);
> > > + return dpage;
> > > +
> > > +out_unlock:
> > > + unlock_page(dpage);
> > > +out_clear:
> > > + bitmap_clear(kvmppc_uvmem_pfn_bitmap, uvmem_pfn - pfn_first, 1);
> > 
> > Reuse variable 'bit'  here?
> 
> Sure.
> 
> > 
> > > +out:
> > > + spin_unlock(&kvmppc_uvmem_pfn_lock);
> > > + return NULL;
> > > +}
> > > +
> > > +/*
> > > + * Alloc a PFN from private device memory pool and copy page from normal
> > > + * memory to secure memory using UV_PAGE_IN uvcall.
> > > + */
> > > +static int
> > > +kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
> > > +unsigned long end, unsigned long *rmap,
> > > +unsigned long gpa, unsigned int lpid,
> > > +unsigned long page_shift)
> > > +{
> > > + unsigned long src_pfn, dst_pfn = 0;
> > > + struct migrate_vma mig;
> > > + struct page *spage;
> > > + unsigned long pfn;
> > > + struct page *dpage;
> > > + int ret = 0;
> > > +
> > > + memset(&mig, 0, sizeof(mig));
> > > + mig.vma = vma;
> > > + mig.start = start;
> > > + mig.end = end;
> > > + mig.src = &src_pfn;
> > > + mig.dst = &dst_pfn;
> > > +
> > > + ret = migrate_vma_setup(&mig);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + spage = migrate_pfn_to_page(*mig.src);
> > > + pfn = *mig.src >> MIGRATE_PFN_SHIFT;
> > > + if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE)) {
> > > + ret = 0;
> > 
> > Do we want to return success here (and have caller return H_SUCCESS) if
> > we can't find the source page?
> 
> spage is NULL for zero page. In this case we return success but there is
> no UV_PAGE_IN involved.
> 
> Absence of MIGRATE_PFN_MIGRATE indicates that the requested page
> can't be migrated. I haven't hit this case till now. Similar check
> is also present in the nouveau driver. I am not sure if this is strictly
> needed here.
> 
> Christoph, Jason - do you know if !(*mig.src & MIGRATE_PFN_MIGRATE)
> check is required and if so in which cases will it be true?

Looks like the currently existing check

if (!spage || !(*mig.src & MIGRATE_PFN_MIGRATE)) {
ret = 0;
goto out_finalize;
}

will prevent both

1. Zero pages and
2. Pages for which no page table entries exist

from getting migrated to secure (device) side. In both the above cases
!spage is true (and MIGRATE_PFN_MIGRATE is set). In both cases
we needn't copy the page, but migration should complete.

Guess the following comment extract from migrate_vma_setup() is talking about
Case 2 above.

 * For empty entries inside CPU page table (pte_none() or pmd_none() is true) we
 * do set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thus
 * allowing the caller to allocate device memory for those unback virtual
 * address.  For this the caller simply has to allocate device memory and
 * properly set the destination entry like for regular migration.  Note that

Is the above understanding correct? Will the same apply to nouveau driver too?
nouveau_dmem_migrate_copy_one() also seems to bail out after a similar
check.

Regards,
Bharata.

< 1 2 3 4 >

101 - 200 of 350 matches

Mail list logo