from:"Suraj Jitindar Singh"

Re: ppc64le STRICT_MODULE_RWX and livepatch apply_relocate_add() crashes

2021-11-03 Thread Suraj Jitindar Singh

Hi Russell,

On Mon, 2021-11-01 at 19:20 +1000, Russell Currey wrote:
> On Sun, 2021-10-31 at 22:43 -0400, Joe Lawrence wrote:
> > Starting with 5.14 kernels, I can reliably reproduce a crash [1] on
> > ppc64le when loading livepatches containing late klp-relocations
> > [2].
> > These are relocations, specific to livepatching, that are resolved
> > not
> > when a livepatch module is loaded, but only when a livepatch-target
> > module is loaded.
> 
> Hey Joe, thanks for the report.
> 
> > I haven't started looking at a fix yet, but in the case of the x86
> > code
> > update, its apply_relocate_add() implementation was modified to use
> > a
> > common text_poke() function to allowed us to drop
> > module_{en,dis}ble_ro() games by the livepatching code.
> 
> It should be a similar fix for Power, our patch_instruction() uses a
> text poke area but apply_relocate_add() doesn't use it and does its
> own
> raw patching instead.
> 
> > I can take a closer look this week, but thought I'd send out a
> > report
> > in case this may be a known todo for STRICT_MODULE_RWX on Power.
> 
> I'm looking into this now, will update when there's progress.  I
> personally wasn't aware but Jordan flagged this as an issue back in
> August [0].  Are the selftests in the klp-convert tree sufficient for
> testing?  I'm not especially familiar with livepatching & haven't
> used
> the userspace tools.
> 

You can test this by livepatching any module since this only occurs
when writing relocations for modules since the vmlinux relocations are
written earlier before the module text is mapped read-only.

- Suraj

> - Russell
> 
> [0] https://github.com/linuxppc/issues/issues/375
> 
> > 
> > -- Joe
> 
>

Re: [PATCH v6 1/7] kvmppc: Driver to manage pages of secure guest

2019-08-19 Thread Suraj Jitindar Singh

On Fri, 2019-08-09 at 14:11 +0530, Bharata B Rao wrote:
> KVMPPC driver to manage page transitions of secure guest
> via H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
> 
> H_SVM_PAGE_IN: Move the content of a normal page to secure page
> H_SVM_PAGE_OUT: Move the content of a secure page to normal page
> 
> Private ZONE_DEVICE memory equal to the amount of secure memory
> available in the platform for running secure guests is created
> via a char device. Whenever a page belonging to the guest becomes
> secure, a page from this private device memory is used to
> represent and track that secure page on the HV side. The movement
> of pages between normal and secure memory is done via
> migrate_vma_pages() using UV_PAGE_IN and UV_PAGE_OUT ucalls.

Hi Bharata,

please see my patch where I define the bits which define the type of
the rmap entry:
https://patchwork.ozlabs.org/patch/1149791/

Please add an entry for the devm pfn type like:
#define KVMPPC_RMAP_PFN_DEVM 0x0200 /* secure guest devm
pfn */

And the following in the appropriate header file

static inline bool kvmppc_rmap_is_pfn_demv(unsigned long *rmapp)
{
return !!((*rmapp & KVMPPC_RMAP_TYPE_MASK) ==
KVMPPC_RMAP_PFN_DEVM));
}

Also see comment below.

Thanks,
Suraj

> 
> Signed-off-by: Bharata B Rao 
> ---
>  arch/powerpc/include/asm/hvcall.h  |   4 +
>  arch/powerpc/include/asm/kvm_book3s_devm.h |  29 ++
>  arch/powerpc/include/asm/kvm_host.h|  12 +
>  arch/powerpc/include/asm/ultravisor-api.h  |   2 +
>  arch/powerpc/include/asm/ultravisor.h  |  14 +
>  arch/powerpc/kvm/Makefile  |   3 +
>  arch/powerpc/kvm/book3s_hv.c   |  19 +
>  arch/powerpc/kvm/book3s_hv_devm.c  | 492
> +
>  8 files changed, 575 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/kvm_book3s_devm.h
>  create mode 100644 arch/powerpc/kvm/book3s_hv_devm.c
> 
[snip]
> +
> +struct kvmppc_devm_page_pvt {
> + unsigned long *rmap;
> + unsigned int lpid;
> + unsigned long gpa;
> +};
> +
> +struct kvmppc_devm_copy_args {
> + unsigned long *rmap;
> + unsigned int lpid;
> + unsigned long gpa;
> + unsigned long page_shift;
> +};
> +
> +/*
> + * Bits 60:56 in the rmap entry will be used to identify the
> + * different uses/functions of rmap. This definition with move
> + * to a proper header when all other functions are defined.
> + */
> +#define KVMPPC_PFN_DEVM  (0x2ULL << 56)
> +
> +static inline bool kvmppc_is_devm_pfn(unsigned long pfn)
> +{
> + return !!(pfn & KVMPPC_PFN_DEVM);
> +}
> +
> +/*
> + * Get a free device PFN from the pool
> + *
> + * Called when a normal page is moved to secure memory (UV_PAGE_IN).
> Device
> + * PFN will be used to keep track of the secure page on HV side.
> + *
> + * @rmap here is the slot in the rmap array that corresponds to
> @gpa.
> + * Thus a non-zero rmap entry indicates that the corresonding guest
> + * page has become secure, and is not mapped on the HV side.
> + *
> + * NOTE: In this and subsequent functions, we pass around and access
> + * individual elements of kvm_memory_slot->arch.rmap[] without any
> + * protection. Should we use lock_rmap() here?
> + */
> +static struct page *kvmppc_devm_get_page(unsigned long *rmap,
> + unsigned long gpa, unsigned
> int lpid)
> +{
> + struct page *dpage = NULL;
> + unsigned long bit, devm_pfn;
> + unsigned long nr_pfns = kvmppc_devm.pfn_last -
> + kvmppc_devm.pfn_first;
> + unsigned long flags;
> + struct kvmppc_devm_page_pvt *pvt;
> +
> + if (kvmppc_is_devm_pfn(*rmap))
> + return NULL;
> +
> + spin_lock_irqsave(&kvmppc_devm_lock, flags);
> + bit = find_first_zero_bit(kvmppc_devm.pfn_bitmap, nr_pfns);
> + if (bit >= nr_pfns)
> + goto out;
> +
> + bitmap_set(kvmppc_devm.pfn_bitmap, bit, 1);
> + devm_pfn = bit + kvmppc_devm.pfn_first;
> + dpage = pfn_to_page(devm_pfn);
> +
> + if (!trylock_page(dpage))
> + goto out_clear;
> +
> + *rmap = devm_pfn | KVMPPC_PFN_DEVM;
> + pvt = kzalloc(sizeof(*pvt), GFP_ATOMIC);
> + if (!pvt)
> + goto out_unlock;
> + pvt->rmap = rmap;

Am I missing something, why does the rmap need to be stored in pvt?
Given the gpa is already stored and this is enough to get back to the
rmap entry, right?

> + pvt->gpa = gpa;
> + pvt->lpid = lpid;
> + dpage->zone_device_data = pvt;
> + spin_unlock_irqrestore(&kvmppc_devm_lock, flags);
> +
> + get_page(dpage);
> + return dpage;
> +
> +out_unlock:
> + unlock_page(dpage);
> +out_clear:
> + bitmap_clear(kvmppc_devm.pfn_bitmap,
> +  devm_pfn - kvmppc_devm.pfn_first, 1);
> +out:
> + spin_unlock_irqrestore(&kvmppc_devm_lock, flags);
> + return NULL;
> +}
> +
> 
[snip]

Re: [PATCH 1/3] KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting

2019-07-14 Thread Suraj Jitindar Singh

On Sat, 2019-07-13 at 13:47 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> > The performance monitoring unit (PMU) registers are saved on guest
> > exit
> > when the guest has set the pmcregs_in_use flag in its lppaca, if it
> > exists, or unconditionally if it doesn't. If a nested guest is
> > being
> > run then the hypervisor doesn't, and in most cases can't, know if
> > the
> > pmu registers are in use since it doesn't know the location of the
> > lppaca
> > for the nested guest, although it may have one for its immediate
> > guest.
> > This results in the values of these registers being lost across
> > nested
> > guest entry and exit in the case where the nested guest was making
> > use
> > of the performance monitoring facility while it's nested guest
> > hypervisor
> > wasn't.
> > 
> > Further more the hypervisor could interrupt a guest hypervisor
> > between
> > when it has loaded up the pmu registers and it calling
> > H_ENTER_NESTED or
> > between returning from the nested guest to the guest hypervisor and
> > the
> > guest hypervisor reading the pmu registers, in
> > kvmhv_p9_guest_entry().
> > This means that it isn't sufficient to just save the pmu registers
> > when
> > entering or exiting a nested guest, but that it is necessary to
> > always
> > save the pmu registers whenever a guest is capable of running
> > nested guests
> > to ensure the register values aren't lost in the context switch.
> > 
> > Ensure the pmu register values are preserved by always saving their
> > value into the vcpu struct when a guest is capable of running
> > nested
> > guests.
> > 
> > This should have minimal performance impact however any impact can
> > be
> > avoided by booting a guest with "-machine pseries,cap-nested-
> > hv=false"
> > on the qemu commandline.
> > 
> > Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest
> > entry/exit path on P9 for radix guests"
> 
> I'm not clear why this and the next commit are marked as fixing the
> above commit. Wasn't it broken prior to that commit as well?

That was the commit which introduced the entry path which we use for a
nested guest, the path on which we need to be saving and restoring the
pmu registers and so where the new code was introduced.

It wasn't technically broken prior to that commit since you couldn't
run nested prior to that commit, and in fact it's a few commits after
that one where we actually enabled the ability to run nested guests.

However since that's the code which introduced the nested entry path it
seemed like the best fit for the fixes tag for people who will be
looking for fixes in that area. Also all the other nested entry path
fixes used that fixes tag so it ties them together nicely.

Thanks,
Suraj

> 
> cheers
> 
> > Signed-off-by: Suraj Jitindar Singh 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c
> > b/arch/powerpc/kvm/book3s_hv.c
> > index ec1804f822af..b682a429f3ef 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -3654,6 +3654,8 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu
> > *vcpu, u64 time_limit,
> > vcpu->arch.vpa.dirty = 1;
> > save_pmu = lp->pmcregs_in_use;
> > }
> > +   /* Must save pmu if this guest is capable of running
> > nested guests */
> > +   save_pmu |= nesting_enabled(vcpu->kvm);
> >  
> > kvmhv_save_guest_pmu(vcpu, save_pmu);
> >  
> > -- 
> > 2.13.6

Re: [PATCH] powerpc: mm: Limit rma_size to 1TB when running without HV mode

2019-07-14 Thread Suraj Jitindar Singh

On Fri, 2019-07-12 at 23:09 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> > The virtual real mode addressing (VRMA) mechanism is used when a
> > partition is using HPT (Hash Page Table) translation and performs
> > real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
> > mode effective address bits 0:23 are treated as zero (i.e. the
> > access
> > is aliased to 0) and the access is performed using an implicit 1TB
> > SLB
> > entry.
> > 
> > The size of the RMA (Real Memory Area) is communicated to the guest
> > as
> > the size of the first memory region in the device tree. And because
> > of
> > the mechanism described above can be expected to not exceed 1TB. In
> > the
> > event that the host erroneously represents the RMA as being larger
> > than
> > 1TB, guest accesses in real mode to memory addresses above 1TB will
> > be
> > aliased down to below 1TB. This means that a memory access
> > performed in
> > real mode may differ to one performed in virtual mode for the same
> > memory
> > address, which would likely have unintended consequences.
> > 
> > To avoid this outcome have the guest explicitly limit the size of
> > the
> > RMA to the current maximum, which is 1TB. This means that even if
> > the
> > first memory block is larger than 1TB, only the first 1TB should be
> > accessed in real mode.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> 
> I added:
> 
> Fixes: c3ab300ea555 ("powerpc: Add POWER9 cputable entry")
> Cc: sta...@vger.kernel.org # v4.6+
> 
> 
> Which is not exactly correct, but probably good enough?

I think we actually want:
Fixes: c610d65c0ad0 ("powerpc/pseries: lift RTAS limit for hash")

Which is what actually caused it to break and for the issue to present
itself.

> 
> cheers
> 
> > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c
> > b/arch/powerpc/mm/book3s64/hash_utils.c
> > index 28ced26f2a00..4d0e2cce9cd5 100644
> > --- a/arch/powerpc/mm/book3s64/hash_utils.c
> > +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> > @@ -1901,11 +1901,19 @@ void
> > hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
> >  *
> >  * For guests on platforms before POWER9, we clamp the it
> > limit to 1G
> >  * to avoid some funky things such as RTAS bugs etc...
> > +* On POWER9 we limit to 1TB in case the host erroneously
> > told us that
> > +* the RMA was >1TB. Effective address bits 0:23 are
> > treated as zero
> > +* (meaning the access is aliased to zero i.e. addr = addr
> > % 1TB)
> > +* for virtual real mode addressing and so it doesn't make
> > sense to
> > +* have an area larger than 1TB as it can't be addressed.
> >  */
> > if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
> > ppc64_rma_size = first_memblock_size;
> > if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
> > ppc64_rma_size = min_t(u64,
> > ppc64_rma_size, 0x4000);
> > +   else
> > +   ppc64_rma_size = min_t(u64,
> > ppc64_rma_size,
> > +  1UL <<
> > SID_SHIFT_1T);
> >  
> > /* Finally limit subsequent allocations */
> > memblock_set_current_limit(ppc64_rma_size);
> > -- 
> > 2.13.6

[PATCH] powerpc: mm: Limit rma_size to 1TB when running without HV mode

2019-07-09 Thread Suraj Jitindar Singh

The virtual real mode addressing (VRMA) mechanism is used when a
partition is using HPT (Hash Page Table) translation and performs
real mode accesses (MSR[IR|DR] = 0) in non-hypervisor mode. In this
mode effective address bits 0:23 are treated as zero (i.e. the access
is aliased to 0) and the access is performed using an implicit 1TB SLB
entry.

The size of the RMA (Real Memory Area) is communicated to the guest as
the size of the first memory region in the device tree. And because of
the mechanism described above can be expected to not exceed 1TB. In the
event that the host erroneously represents the RMA as being larger than
1TB, guest accesses in real mode to memory addresses above 1TB will be
aliased down to below 1TB. This means that a memory access performed in
real mode may differ to one performed in virtual mode for the same memory
address, which would likely have unintended consequences.

To avoid this outcome have the guest explicitly limit the size of the
RMA to the current maximum, which is 1TB. This means that even if the
first memory block is larger than 1TB, only the first 1TB should be
accessed in real mode.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 28ced26f2a00..4d0e2cce9cd5 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1901,11 +1901,19 @@ void hash__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 *
 * For guests on platforms before POWER9, we clamp the it limit to 1G
 * to avoid some funky things such as RTAS bugs etc...
+* On POWER9 we limit to 1TB in case the host erroneously told us that
+* the RMA was >1TB. Effective address bits 0:23 are treated as zero
+* (meaning the access is aliased to zero i.e. addr = addr % 1TB)
+* for virtual real mode addressing and so it doesn't make sense to
+* have an area larger than 1TB as it can't be addressed.
 */
if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
ppc64_rma_size = first_memblock_size;
if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x4000);
+   else
+   ppc64_rma_size = min_t(u64, ppc64_rma_size,
+  1UL << SID_SHIFT_1T);
 
/* Finally limit subsequent allocations */
memblock_set_current_limit(ppc64_rma_size);
-- 
2.13.6

[PATCH 3/3] KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries

2019-07-02 Thread Suraj Jitindar Singh

The performance stop status and control register (PSSCR) is used to
control the power saving facilities of the processor. This register has
various fields, some of which can be modified only in hypervisor state,
and others which can be modified in both hypervisor and priviledged
non-hypervisor state. The bits which can be modified in priviledged
non-hypervisor state are referred to as guest visible.

Currently the L0 hypervisor saves and restores both it's own host value
as well as the guest value of the psscr when context switching between
the hypervisor and guest. However a nested hypervisor running it's own
nested guests (as indicated by kvmhv_on_pseries()) doesn't context
switch the psscr register. This means that if a nested (L2) guest
modified the psscr that the L1 guest hypervisor will run with this
value, and if the L1 guest hypervisor modified this value and then goes
to run the nested (L2) guest again that the L2 psscr value will be lost.

Fix this by having the (L1) nested hypervisor save and restore both its
host and the guest psscr value when entering and exiting a nested (L2)
guest. Note that only the guest visible parts of the psscr are context
switched since this is all the L1 nested hypervisor can access, this is
fine however as these are the only fields the L0 hypervisor provides
guest control of anyway and so all other fields are ignored.

This could also have been implemented by adding the psscr register to
the hv_regs passed to the L0 hypervisor as input to the H_ENTER_NESTED
hcall, however this would have meant updating the structure layout and
thus required modifications to both the L0 and L1 kernels. Whereas the
approach used doesn't require L0 kernel modifications while achieving
the same result.

Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path on 
P9 for radix guests"

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b682a429f3ef..cde3f5a4b3e4 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3569,9 +3569,18 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
mtspr(SPRN_DEC, vcpu->arch.dec_expires - mftb());
 
if (kvmhv_on_pseries()) {
+   /*
+* We need to save and restore the guest visible part of the
+* psscr (i.e. using SPRN_PSSCR_PR) since the hypervisor
+* doesn't do this for us. Note only required if pseries since
+* this is done in kvmhv_load_hv_regs_and_go() below otherwise.
+*/
+   unsigned long host_psscr;
/* call our hypervisor to load up HV regs and go */
struct hv_guest_state hvregs;
 
+   host_psscr = mfspr(SPRN_PSSCR_PR);
+   mtspr(SPRN_PSSCR_PR, vcpu->arch.psscr);
kvmhv_save_hv_regs(vcpu, &hvregs);
hvregs.lpcr = lpcr;
vcpu->arch.regs.msr = vcpu->arch.shregs.msr;
@@ -3590,6 +3599,8 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
vcpu->arch.shregs.msr = vcpu->arch.regs.msr;
vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
+   vcpu->arch.psscr = mfspr(SPRN_PSSCR_PR);
+   mtspr(SPRN_PSSCR_PR, host_psscr);
 
/* H_CEDE has to be handled now, not later */
if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
-- 
2.13.6

[PATCH 2/3] PPC: PMC: Set pmcregs_in_use in paca when running as LPAR

2019-07-02 Thread Suraj Jitindar Singh

The ability to run nested guests under KVM means that a guest can also
act as a hypervisor for it's own nested guest. Currently
ppc_set_pmu_inuse() assumes that either FW_FEATURE_LPAR is set,
indicating a guest environment, and so sets the pmcregs_in_use flag in
the lppaca, or that it isn't set, indicating a hypervisor environment,
and so sets the pmcregs_in_use flag in the paca.

The pmcregs_in_use flag in the lppaca is used to communicate this
information to a hypervisor and so must be set in a guest environment.
The pmcregs_in_use flag in the paca is used by KVM code to determine
whether the host state of the performance monitoring unit (PMU) must be
saved and restored when running a guest.

Thus when a guest also acts as a hypervisor it must set this bit in both
places since it needs to ensure both that the real hypervisor saves it's
pmu registers when it runs (requires pmcregs_in_use flag in lppaca), and
that it saves it's own pmu registers when running a nested guest
(requires pmcregs_in_use flag in paca).

Modify ppc_set_pmu_inuse() so that the pmcregs_in_use bit is set in both
the lppaca and the paca when a guest (LPAR) is running with the
capability of running it's own guests (CONFIG_KVM_BOOK3S_HV_POSSIBLE).

Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path on 
P9 for radix guests"

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/pmc.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
index dc9a1ca70edf..c6bbe9778d3c 100644
--- a/arch/powerpc/include/asm/pmc.h
+++ b/arch/powerpc/include/asm/pmc.h
@@ -27,11 +27,10 @@ static inline void ppc_set_pmu_inuse(int inuse)
 #ifdef CONFIG_PPC_PSERIES
get_lppaca()->pmcregs_in_use = inuse;
 #endif
-   } else {
+   }
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   get_paca()->pmcregs_in_use = inuse;
+   get_paca()->pmcregs_in_use = inuse;
 #endif
-   }
 #endif
 }
 
-- 
2.13.6

[PATCH 1/3] KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting

2019-07-02 Thread Suraj Jitindar Singh

The performance monitoring unit (PMU) registers are saved on guest exit
when the guest has set the pmcregs_in_use flag in its lppaca, if it
exists, or unconditionally if it doesn't. If a nested guest is being
run then the hypervisor doesn't, and in most cases can't, know if the
pmu registers are in use since it doesn't know the location of the lppaca
for the nested guest, although it may have one for its immediate guest.
This results in the values of these registers being lost across nested
guest entry and exit in the case where the nested guest was making use
of the performance monitoring facility while it's nested guest hypervisor
wasn't.

Further more the hypervisor could interrupt a guest hypervisor between
when it has loaded up the pmu registers and it calling H_ENTER_NESTED or
between returning from the nested guest to the guest hypervisor and the
guest hypervisor reading the pmu registers, in kvmhv_p9_guest_entry().
This means that it isn't sufficient to just save the pmu registers when
entering or exiting a nested guest, but that it is necessary to always
save the pmu registers whenever a guest is capable of running nested guests
to ensure the register values aren't lost in the context switch.

Ensure the pmu register values are preserved by always saving their
value into the vcpu struct when a guest is capable of running nested
guests.

This should have minimal performance impact however any impact can be
avoided by booting a guest with "-machine pseries,cap-nested-hv=false"
on the qemu commandline.

Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path on 
P9 for radix guests"

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ec1804f822af..b682a429f3ef 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3654,6 +3654,8 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
vcpu->arch.vpa.dirty = 1;
save_pmu = lp->pmcregs_in_use;
}
+   /* Must save pmu if this guest is capable of running nested guests */
+   save_pmu |= nesting_enabled(vcpu->kvm);
 
kvmhv_save_guest_pmu(vcpu, save_pmu);
 
-- 
2.13.6

[PATCH 2/3] KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decr

2019-06-19 Thread Suraj Jitindar Singh

On POWER9 the decrementer can operate in large decrementer mode where
the decrementer is 56 bits and signed extended to 64 bits. When not
operating in this mode the decrementer behaves as a 32 bit decrementer
which is NOT signed extended (as on POWER8).

Currently when reading a guest decrementer value we don't take into
account whether the large decrementer is enabled or not, and this means
the value will be incorrect when the guest is not using the large
decrementer. Fix this by sign extending the value read when the guest
isn't using the large decrementer.

Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path on 
P9 for radix guests"

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d3684509da35..719fd2529eec 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3607,6 +3607,8 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
 
vcpu->arch.slb_max = 0;
dec = mfspr(SPRN_DEC);
+   if (!(lpcr & LPCR_LD)) /* Sign extend if not using large decrementer */
+   dec = (s32) dec;
tb = mftb();
vcpu->arch.dec_expires = dec + tb;
vcpu->cpu = -1;
-- 
2.13.6

[PATCH 3/3] KVM: PPC: Book3S HV: Clear pending decr exceptions on nested guest entry

2019-06-19 Thread Suraj Jitindar Singh

If we enter an L1 guest with a pending decrementer exception then this
is cleared on guest exit if the guest has writtien a positive value into
the decrementer (indicating that it handled the decrementer exception)
since there is no other way to detect that the guest has handled the
pending exception and that it should be dequeued. In the event that the
L1 guest tries to run a nested (L2) guest immediately after this and the
L2 guest decrementer is negative (which is loaded by L1 before making
the H_ENTER_NESTED hcall), then the pending decrementer exception
isn't cleared and the L2 entry is blocked since L1 has a pending
exception, even though L1 may have already handled the exception and
written a positive value for it's decrementer. This results in a loop of
L1 trying to enter the L2 guest and L0 blocking the entry since L1 has
an interrupt pending with the outcome being that L2 never gets to run
and hangs.

Fix this by clearing any pending decrementer exceptions when L1 makes
the H_ENTER_NESTED hcall since it won't do this if it's decrementer has
gone negative, and anyway it's decrementer has been communicated to L0
in the hdec_expires field and L0 will return control to L1 when this
goes negative by delivering an H_DECREMENTER exception.

Fixes: 95a6432ce903 "KVM: PPC: Book3S HV: Streamlined guest entry/exit path on 
P9 for radix guests"

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 719fd2529eec..4a5eb29b952f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4128,8 +4128,15 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
 
preempt_enable();
 
-   /* cancel pending decrementer exception if DEC is now positive */
-   if (get_tb() < vcpu->arch.dec_expires && kvmppc_core_pending_dec(vcpu))
+   /*
+* cancel pending decrementer exception if DEC is now positive, or if
+* entering a nested guest in which case the decrementer is now owned
+* by L2 and the L1 decrementer is provided in hdec_expires
+*/
+   if (kvmppc_core_pending_dec(vcpu) &&
+   ((get_tb() < vcpu->arch.dec_expires) ||
+(trap == BOOK3S_INTERRUPT_SYSCALL &&
+ kvmppc_get_gpr(vcpu, 3) == H_ENTER_NESTED)))
kvmppc_core_dequeue_dec(vcpu);
 
trace_kvm_guest_exit(vcpu);
-- 
2.13.6

[PATCH 1/3] KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries

2019-06-19 Thread Suraj Jitindar Singh

When a guest vcpu moves from one physical thread to another it is
necessary for the host to perform a tlb flush on the previous core if
another vcpu from the same guest is going to run there. This is because the
guest may use the local form of the tlb invalidation instruction meaning
stale tlb entries would persist where it previously ran. This is handled
on guest entry in kvmppc_check_need_tlb_flush() which calls
flush_guest_tlb() to perform the tlb flush.

Previously the generic radix__local_flush_tlb_lpid_guest() function was
used, however the functionality was reimplemented in flush_guest_tlb()
to avoid the trace_tlbie() call as the flushing may be done in real
mode. The reimplementation in flush_guest_tlb() was missing an erat
invalidation after flushing the tlb.

This lead to observable memory corruption in the guest due to the
caching of stale translations. Fix this by adding the erat invalidation.

Fixes: 70ea13f6e609 "KVM: PPC: Book3S HV: Flush TLB on secondary radix threads"

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 6035d24f1d1d..a46286f73eec 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -833,6 +833,7 @@ static void flush_guest_tlb(struct kvm *kvm)
}
}
asm volatile("ptesync": : :"memory");
+   asm volatile(PPC_INVALIDATE_ERAT : : :"memory");
 }
 
 void kvmppc_check_need_tlb_flush(struct kvm *kvm, int pcpu,
-- 
2.13.6

Re: [PATCH 0/2] Fix handling of h_set_dawr

2019-06-19 Thread Suraj Jitindar Singh

On Mon, 2019-06-17 at 11:06 +0200, Cédric Le Goater wrote:
> On 17/06/2019 09:16, Suraj Jitindar Singh wrote:
> > Series contains 2 patches to fix the host in kernel handling of the
> > hcall
> > h_set_dawr.
> > 
> > First patch from Michael Neuling is just a resend added here for
> > clarity.
> > 
> > Michael Neuling (1):
> >   KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
> > 
> > Suraj Jitindar Singh (1):
> >   KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr
> > in
> > real mode
> 
> 
> 
> Reviewed-by: Cédric Le Goater 
> 
> and 
> 
> Tested-by: Cédric Le Goater 
> 
> 
> but I see slowdowns in nested as if the IPIs were not delivered. Have
> we
> touch this part in 5.2 ? 

Hi,

I've seen the same and tracked it down to decrementer exceptions not
being delivered when the guest is using large decrementer. I've got a
patch I'm about to send so I'll CC you.

Another option is to disable the large decrementer with:
-machine pseries,cap-large-decr=false

Thanks,
Suraj

> 
> Thanks,
> 
> C.
>

[PATCH 2/2] KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode

2019-06-17 Thread Suraj Jitindar Singh

The hcall H_SET_DAWR is used by a guest to set the data address
watchpoint register (DAWR). This hcall is handled in the host in
kvmppc_h_set_dawr() which can be called in either real mode on the guest
exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S, or in
virtual mode when called from kvmppc_pseries_do_hcall() in book3s_hv.c.

The function kvmppc_h_set_dawr updates the dawr and dawrx fields in the
vcpu struct accordingly and then also writes the respective values into
the DAWR and DAWRX registers directly. It is necessary to write the
registers directly here when calling the function in real mode since the
path to re-enter the guest won't do this. However when in virtual mode
the host DAWR and DAWRX values have already been restored, and so writing
the registers would overwrite these. Additionally there is no reason to
write the guest values here as these will be read from the vcpu struct
and written to the registers appropriately the next time the vcpu is
run.

This also avoids the case when handling h_set_dawr for a nested guest
where the guest hypervisor isn't able to write the DAWR and DAWRX
registers directly and must rely on the real hypervisor to do this for
it when it calls H_ENTER_NESTED.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 703cd6cd994d..337e64468d78 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2510,9 +2510,18 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
clrrdi  r4, r4, 3
std r4, VCPU_DAWR(r3)
std r5, VCPU_DAWRX(r3)
+   /*
+* If came in through the real mode hcall handler then it is necessary
+* to write the registers since the return path won't. Otherwise it is
+* sufficient to store then in the vcpu struct as they will be loaded
+* next time the vcpu is run.
+*/
+   mfmsr   r6
+   andi.   r6, r6, MSR_DR  /* in real mode? */
+   bne 4f
mtspr   SPRN_DAWR, r4
mtspr   SPRN_DAWRX, r5
-   li  r3, 0
+4: li  r3, 0
blr
 
 _GLOBAL(kvmppc_h_cede) /* r3 = vcpu pointer, r11 = msr, r13 = paca */
-- 
2.13.6

[PATCH 1/2] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()

2019-06-17 Thread Suraj Jitindar Singh

From: Michael Neuling 

Commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9
option") screwed up some assembler and corrupted a pointer in
r3. This resulted in crashes like the below:

  [   44.374746] BUG: Kernel NULL pointer dereference at 0x13bf
  [   44.374848] Faulting instruction address: 0xc010b044
  [   44.374906] Oops: Kernel access of bad area, sig: 11 [#1]
  [   44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA 
pSeries
  [   44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM 
iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack 
nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp 
bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables 
iptable_filter bpfilter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm 
sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover virtio_scsi 
failover
  [   44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not 
tainted 5.2.0-rc4+ #3
  [   44.375500] NIP:  c010b044 LR: c008089dacf4 CTR: 
c010aff4
  [   44.375604] REGS: c0179b397710 TRAP: 0300   Not tainted  (5.2.0-rc4+)
  [   44.375691] MSR:  8280b033   CR: 
42244842  XER: 
  [   44.375815] CFAR: c010aff8 DAR: 13bf DSISR: 4200 
IRQMASK: 0
  [   44.375815] GPR00: c008089dd6bc c0179b3979a0 c00808a04300 

  [   44.375815] GPR04:  0003 2444b05d 
c017f11c45d0
  [   44.375815] GPR08: 07803e018dfe 0028 0001 
0075
  [   44.375815] GPR12: c010aff4 c7ff6300  

  [   44.375815] GPR16:  c017f11d  
c017f11ca7a8
  [   44.375815] GPR20: c017f11c42ec   
000a
  [   44.375815] GPR24: fffc  c017f11c 
c1a77ed8
  [   44.375815] GPR28: c0179af7 fffc c008089ff170 
c0179ae88540
  [   44.376673] NIP [c010b044] kvmppc_h_set_dabr+0x50/0x68
  [   44.376754] LR [c008089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 
[kvm_hv]
  [   44.376849] Call Trace:
  [   44.376886] [c0179b3979a0] [c017f11c] 0xc017f11c 
(unreliable)
  [   44.376982] [c0179b397a10] [c008089dd6bc] 
kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
  [   44.377084] [c0179b397ae0] [c008093f8bcc] 
kvmppc_vcpu_run+0x34/0x48 [kvm]
  [   44.377185] [c0179b397b00] [c008093f522c] 
kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
  [   44.377286] [c0179b397b90] [c008093e3618] 
kvm_vcpu_ioctl+0x460/0x850 [kvm]
  [   44.377384] [c0179b397d00] [c04ba6c4] do_vfs_ioctl+0xe4/0xb40
  [   44.377464] [c0179b397db0] [c04bb1e4] ksys_ioctl+0xc4/0x110
  [   44.377547] [c0179b397e00] [c04bb258] sys_ioctl+0x28/0x80
  [   44.377628] [c0179b397e20] [c000b888] system_call+0x5c/0x70
  [   44.377712] Instruction dump:
  [   44.377765] 4082fff4 4c00012c 3860 4e800020 e96280c0 896b 2c2b 
3860
  [   44.377862] 4d820020 50852e74 508516f6 78840724  f8a313c8 
7c942ba6 7cbc2ba6

Fix the bug by only changing r3 when we are returning immediately.

Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option")
Signed-off-by: Michael Neuling 
Reported-by: Cédric Le Goater 
--
mpe: This is for 5.2 fixes

v2: Review from Christophe Leroy
  - De-Mikey/Cedric-ify commit message
  - Add "Fixes:"
  - Other trivial commit messages changes
  - No code change
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d885a5831daa..703cd6cd994d 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2500,8 +2500,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
LOAD_REG_ADDR(r11, dawr_force_enable)
lbz r11, 0(r11)
cmpdi   r11, 0
+   bne 3f
li  r3, H_HARDWARE
-   beqlr
+   blr
+3:
/* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */
rlwimi  r5, r4, 5, DAWRX_DR | DAWRX_DW
rlwimi  r5, r4, 2, DAWRX_WT
-- 
2.13.6

[PATCH 0/2] Fix handling of h_set_dawr

2019-06-17 Thread Suraj Jitindar Singh

Series contains 2 patches to fix the host in kernel handling of the hcall
h_set_dawr.

First patch from Michael Neuling is just a resend added here for clarity.

Michael Neuling (1):
  KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()

Suraj Jitindar Singh (1):
  KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in
real mode

 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

-- 
2.13.6

Re: [PATCH] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()

2019-06-12 Thread Suraj Jitindar Singh

On Thu, 2019-06-13 at 10:16 +1000, Michael Neuling wrote:
> On Wed, 2019-06-12 at 09:43 +0200, Cédric Le Goater wrote:
> > On 12/06/2019 09:22, Michael Neuling wrote:
> > > In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9
> > > option") I screwed up some assembler and corrupted a pointer in
> > > r3. This resulted in crashes like the below from Cédric:
> > > 
> > >   [   44.374746] BUG: Kernel NULL pointer dereference at
> > > 0x13bf
> > >   [   44.374848] Faulting instruction address: 0xc010b044
> > >   [   44.374906] Oops: Kernel access of bad area, sig: 11 [#1]
> > >   [   44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP
> > > NR_CPUS=2048 NUMA pSeries
> > >   [   44.375018] Modules linked in: vhost_net vhost tap
> > > xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat
> > > xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4
> > > ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter
> > > ebtables ip6table_filter ip6_tables iptable_filter bpfilter
> > > vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel
> > > ip_tables x_tables autofs4 virtio_net net_failover virtio_scsi
> > > failover
> > >   [   44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump:
> > > loaded Not tainted 5.2.0-rc4+ #3
> > >   [   44.375500] NIP:  c010b044 LR: c008089dacf4 CTR:
> > > c010aff4
> > >   [   44.375604] REGS: c0179b397710 TRAP: 0300   Not
> > > tainted  (5.2.0-rc4+)
> > >   [   44.375691] MSR:  8280b033
> > >   CR: 42244842  XER: 
> > >   [   44.375815] CFAR: c010aff8 DAR: 13bf
> > > DSISR: 4200 IRQMASK: 0
> > >   [   44.375815] GPR00: c008089dd6bc c0179b3979a0
> > > c00808a04300 
> > >   [   44.375815] GPR04:  0003
> > > 2444b05d c017f11c45d0
> > >   [   44.375815] GPR08: 07803e018dfe 0028
> > > 0001 0075
> > >   [   44.375815] GPR12: c010aff4 c7ff6300
> > >  
> > >   [   44.375815] GPR16:  c017f11d
> > >  c017f11ca7a8
> > >   [   44.375815] GPR20: c017f11c42ec 
> > >  000a
> > >   [   44.375815] GPR24: fffc 
> > > c017f11c c1a77ed8
> > >   [   44.375815] GPR28: c0179af7 fffc
> > > c008089ff170 c0179ae88540
> > >   [   44.376673] NIP [c010b044]
> > > kvmppc_h_set_dabr+0x50/0x68
> > >   [   44.376754] LR [c008089dacf4]
> > > kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv]
> > >   [   44.376849] Call Trace:
> > >   [   44.376886] [c0179b3979a0] [c017f11c]
> > > 0xc017f11c (unreliable)
> > >   [   44.376982] [c0179b397a10] [c008089dd6bc]
> > > kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
> > >   [   44.377084] [c0179b397ae0] [c008093f8bcc]
> > > kvmppc_vcpu_run+0x34/0x48 [kvm]
> > >   [   44.377185] [c0179b397b00] [c008093f522c]
> > > kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
> > >   [   44.377286] [c0179b397b90] [c008093e3618]
> > > kvm_vcpu_ioctl+0x460/0x850 [kvm]
> > >   [   44.377384] [c0179b397d00] [c04ba6c4]
> > > do_vfs_ioctl+0xe4/0xb40
> > >   [   44.377464] [c0179b397db0] [c04bb1e4]
> > > ksys_ioctl+0xc4/0x110
> > >   [   44.377547] [c0179b397e00] [c04bb258]
> > > sys_ioctl+0x28/0x80
> > >   [   44.377628] [c0179b397e20] [c000b888]
> > > system_call+0x5c/0x70
> > >   [   44.377712] Instruction dump:
> > >   [   44.377765] 4082fff4 4c00012c 3860 4e800020 e96280c0
> > > 896b 2c2b 3860
> > >   [   44.377862] 4d820020 50852e74 508516f6 78840724 
> > > f8a313c8 7c942ba6 7cbc2ba6
> > > 
> > > This fixes the problem by only changing r3 when we are returning
> > > immediately.
> > > 
> > > Signed-off-by: Michael Neuling 
> > > Reported-by: Cédric Le Goater 
> > 
> > On nested, I still see : 
> > 
> > [   94.609274] Oops: Exception in kernel mode, sig: 4 [#1]
> > [   94.609432] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048
> > NUMA pSeries
> > [   94.609596] Modules linked in: vhost_net vhost tap xt_CHECKSUM
> > iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack
> > nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT
> > nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables
> > ip6table_filter ip6_tables iptable_filter bpfilter vmx_crypto
> > kvm_hv crct10dif_vpmsum crc32c_vpmsum kvm sch_fq_codel ip_tables
> > x_tables autofs4 virtio_net virtio_scsi net_failover failover
> > [   94.610179] CPU: 12 PID: 2026 Comm: qemu-system-ppc Kdump:
> > loaded Not tainted 5.2.0-rc4+ #6
> > [   94.610290] NIP:  c010b050 LR: c00808bbacf4 CTR:
> > c010aff4
> > [   94.610400] REGS: c017913d7710 TRAP: 0700   Not
> > tainted  (5.2.0-rc4+)
> > [   94.610493] MSR:  8284b033
> >   CR: 42224842  XER:

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-03-28 Thread Suraj Jitindar Singh

On Wed, 2019-03-27 at 17:51 +0100, Cédric Le Goater wrote:
> On 3/27/19 5:37 PM, Cédric Le Goater wrote:
> > On 3/27/19 1:36 PM, Sebastian Andrzej Siewior wrote:
> > > With qemu-system-ppc64le -machine pseries -smp 4 I get:
> > > 
> > > > #  chrt 1 hackbench
> > > > Running in process mode with 10 groups using 40 file
> > > > descriptors each (== 400 tasks)
> > > > Each sender will pass 100 messages of 100 bytes
> > > > Oops: Exception in kernel mode, sig: 4 [#1]
> > > > LE PAGE_SIZE=64K MMU=Hash PREEMPT SMP NR_CPUS=2048 NUMA pSeries
> > > > Modules linked in:
> > > > CPU: 0 PID: 629 Comm: hackbench Not tainted 5.1.0-rc2 #71
> > > > NIP:  c0046978 LR: c0046a38 CTR:
> > > > c00b0150
> > > > REGS: c001fffeb8e0 TRAP: 0700   Not tainted  (5.1.0-rc2)
> > > > MSR:  80089033   CR:
> > > > 42000874  XER: 
> > > > CFAR: c0046a34 IRQMASK: 1
> > > > GPR00: c00b0170 c001fffebb70 c0a6ba00
> > > > 2800
> > > 
> > > …
> > > > NIP [c0046978] doorbell_core_ipi+0x28/0x30
> > > > LR [c0046a38] doorbell_try_core_ipi+0xb8/0xf0
> > > > Call Trace:
> > > > [c001fffebb70] [c001fffebba0] 0xc001fffebba0
> > > > (unreliable)
> > > > [c001fffebba0] [c00b0170]
> > > > smp_pseries_cause_ipi+0x20/0x70
> > > > [c001fffebbd0] [c004b02c]
> > > > arch_send_call_function_single_ipi+0x8c/0xa0
> > > > [c001fffebbf0] [c01de600]
> > > > irq_work_queue_on+0xe0/0x130
> > > > [c001fffebc30] [c01340c8]
> > > > rto_push_irq_work_func+0xc8/0x120
> > > 
> > > …
> > > > Instruction dump:
> > > > 6000 6000 3c4c00a2 384250b0 3d220009 392949c8 8129
> > > > 3929
> > > > 7d231838 7c0004ac 5463017e 64632800 <7c00191c> 4e800020
> > > > 3c4c00a2 38425080
> > > > ---[ end trace eb842b544538cbdf ]---
> > > 
> > > and I was wondering whether this is a qemu bug or the kernel is
> > > using an
> > > opcode it should rather not. If I skip doorbell_try_core_ipi() in
> > > smp_pseries_cause_ipi() then there is no crash. The comment says
> > > "POWER9
> > > should not use this handler" so…
> > 
> > I would say Linux is using a msgsndp instruction which is not
> > implemented
> > in QEMU TCG. But why have we started using dbells in Linux ? 

Yeah the kernel must have used msgsndp which isn't implemented for TCG
yet. We use doorbells in linux but only for threads which are on the
same core.
And when I try to construct a situation with more than 1 thread per
core (e.g. -smp 4,threads=4), I get "TCG cannot support more than 1
thread/core on a pseries machine".

So I wonder why the guest thinks it can use msgsndp...

> 
> ah. It seems arch_local_irq_restore() / replay_interrupt() generated
> some interrupt.
> 
> C.
>

[PATCH] powerpc: Add barrier_nospec to raw_copy_in_user()

2019-03-05 Thread Suraj Jitindar Singh

Commit ddf35cf3764b ("powerpc: Use barrier_nospec in copy_from_user()")
Added barrier_nospec before loading from user-controller pointers.
The intention was to order the load from the potentially user-controlled
pointer vs a previous branch based on an access_ok() check or similar.

In order to achieve the same result, add a barrier_nospec to the
raw_copy_in_user() function before loading from such a user-controlled
pointer.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/uaccess.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index e3a731793ea2..bb615592d5bb 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -306,6 +306,7 @@ extern unsigned long __copy_tofrom_user(void __user *to,
 static inline unsigned long
 raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
 {
+   barrier_nospec();
return __copy_tofrom_user(to, from, n);
 }
 #endif /* __powerpc64__ */
-- 
2.13.6

[PATCH] KVM: PPC: powerpc: Add count cache flush parameters to kvmppc_get_cpu_char()

2019-02-28 Thread Suraj Jitindar Singh

Add KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST &
KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE to the characteristics returned from
the H_GET_CPU_CHARACTERISTICS H-CALL, as queried from either the
hypervisor or the device tree.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/uapi/asm/kvm.h |  2 ++
 arch/powerpc/kvm/powerpc.c  | 18 ++
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 8c876c166ef2..26ca425f4c2c 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -463,10 +463,12 @@ struct kvm_ppc_cpu_char {
 #define KVM_PPC_CPU_CHAR_BR_HINT_HONOURED  (1ULL << 58)
 #define KVM_PPC_CPU_CHAR_MTTRIG_THR_RECONF (1ULL << 57)
 #define KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS   (1ULL << 56)
+#define KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST(1ull << 54)
 
 #define KVM_PPC_CPU_BEHAV_FAVOUR_SECURITY  (1ULL << 63)
 #define KVM_PPC_CPU_BEHAV_L1D_FLUSH_PR (1ULL << 62)
 #define KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR(1ULL << 61)
+#define KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE(1ull << 58)
 
 /* Per-vcpu XICS interrupt controller state */
 #define KVM_REG_PPC_ICP_STATE  (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0x8c)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b90a7d154180..a99dcac91e50 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -2189,10 +2189,12 @@ static int pseries_get_cpu_char(struct kvm_ppc_cpu_char 
*cp)
KVM_PPC_CPU_CHAR_L1D_THREAD_PRIV |
KVM_PPC_CPU_CHAR_BR_HINT_HONOURED |
KVM_PPC_CPU_CHAR_MTTRIG_THR_RECONF |
-   KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS;
+   KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS |
+   KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST;
cp->behaviour_mask = KVM_PPC_CPU_BEHAV_FAVOUR_SECURITY |
KVM_PPC_CPU_BEHAV_L1D_FLUSH_PR |
-   KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR;
+   KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR |
+   KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE;
}
return 0;
 }
@@ -2251,12 +2253,16 @@ static int kvmppc_get_cpu_char(struct kvm_ppc_cpu_char 
*cp)
if (have_fw_feat(fw_features, "enabled",
 "fw-count-cache-disabled"))
cp->character |= KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS;
+   if (have_fw_feat(fw_features, "enabled",
+"fw-count-cache-flush-bcctr2,0,0"))
+   cp->character |= KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST;
cp->character_mask = KVM_PPC_CPU_CHAR_SPEC_BAR_ORI31 |
KVM_PPC_CPU_CHAR_BCCTRL_SERIALISED |
KVM_PPC_CPU_CHAR_L1D_FLUSH_ORI30 |
KVM_PPC_CPU_CHAR_L1D_FLUSH_TRIG2 |
KVM_PPC_CPU_CHAR_L1D_THREAD_PRIV |
-   KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS;
+   KVM_PPC_CPU_CHAR_COUNT_CACHE_DIS |
+   KVM_PPC_CPU_CHAR_BCCTR_FLUSH_ASSIST;
 
if (have_fw_feat(fw_features, "enabled",
 "speculation-policy-favor-security"))
@@ -2267,9 +2273,13 @@ static int kvmppc_get_cpu_char(struct kvm_ppc_cpu_char 
*cp)
if (!have_fw_feat(fw_features, "disabled",
  "needs-spec-barrier-for-bound-checks"))
cp->behaviour |= KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR;
+   if (have_fw_feat(fw_features, "enabled",
+"needs-count-cache-flush-on-context-switch"))
+   cp->behaviour |= KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE;
cp->behaviour_mask = KVM_PPC_CPU_BEHAV_FAVOUR_SECURITY |
KVM_PPC_CPU_BEHAV_L1D_FLUSH_PR |
-   KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR;
+   KVM_PPC_CPU_BEHAV_BNDS_CHK_SPEC_BAR |
+   KVM_PPC_CPU_BEHAV_FLUSH_COUNT_CACHE;
 
of_node_put(fw_features);
}
-- 
2.13.6

[PATCH v2] KVM: PPC: Book3S: Add KVM stat largepages_[2M/1G]

2019-02-18 Thread Suraj Jitindar Singh

This adds an entry to the kvm_stats_debugfs directory which provides the
number of large (2M or 1G) pages which have been used to setup the guest
mappings.

Signed-off-by: Suraj Jitindar Singh 
---

V1 -> V2:
- Rename debugfs files from num_[2M/1G]_pages to largepages_[2M/1G] to match
  x86

 arch/powerpc/include/asm/kvm_host.h|  2 ++
 arch/powerpc/kvm/book3s.c  |  3 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 15 ++-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..cbb090010312 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -99,6 +99,8 @@ struct kvm_nested_guest;
 
 struct kvm_vm_stat {
ulong remote_tlb_flush;
+   ulong num_2M_pages;
+   ulong num_1G_pages;
 };
 
 struct kvm_vcpu_stat {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd1a677dd9e4..72fd7d44379b 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -39,6 +39,7 @@
 #include "book3s.h"
 #include "trace.h"
 
+#define VM_STAT(x) offsetof(struct kvm, stat.x), KVM_STAT_VM
 #define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU
 
 /* #define EXIT_DEBUG */
@@ -71,6 +72,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "pthru_all",   VCPU_STAT(pthru_all) },
{ "pthru_host",  VCPU_STAT(pthru_host) },
{ "pthru_bad_aff",   VCPU_STAT(pthru_bad_aff) },
+   { "largepages_2M",VM_STAT(num_2M_pages) },
+   { "largepages_1G",VM_STAT(num_1G_pages) },
{ NULL }
 };
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 1b821c6efdef..f55ef071883f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -403,8 +403,13 @@ void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, 
unsigned long gpa,
if (!memslot)
return;
}
-   if (shift)
+   if (shift) { /* 1GB or 2MB page */
page_size = 1ul << shift;
+   if (shift == PMD_SHIFT)
+   kvm->stat.num_2M_pages--;
+   else if (shift == PUD_SHIFT)
+   kvm->stat.num_1G_pages--;
+   }
 
gpa &= ~(page_size - 1);
hpa = old & PTE_RPN_MASK;
@@ -878,6 +883,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
put_page(page);
}
 
+   /* Increment number of large pages if we (successfully) inserted one */
+   if (!ret) {
+   if (level == 1)
+   kvm->stat.num_2M_pages++;
+   else if (level == 2)
+   kvm->stat.num_1G_pages++;
+   }
+
return ret;
 }
 
-- 
2.13.6

[PATCH] KVM: PPC: Book3S: Add KVM stat num_[2M/1G]_pages

2019-02-14 Thread Suraj Jitindar Singh

This adds an entry to the kvm_stats_debugfs directory which provides the
number of large (2M or 1G) pages which have been used to setup the guest
mappings.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_host.h|  2 ++
 arch/powerpc/kvm/book3s.c  |  3 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 15 ++-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..cbb090010312 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -99,6 +99,8 @@ struct kvm_nested_guest;
 
 struct kvm_vm_stat {
ulong remote_tlb_flush;
+   ulong num_2M_pages;
+   ulong num_1G_pages;
 };
 
 struct kvm_vcpu_stat {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd1a677dd9e4..3cc5215bdb2e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -39,6 +39,7 @@
 #include "book3s.h"
 #include "trace.h"
 
+#define VM_STAT(x) offsetof(struct kvm, stat.x), KVM_STAT_VM
 #define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU
 
 /* #define EXIT_DEBUG */
@@ -71,6 +72,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "pthru_all",   VCPU_STAT(pthru_all) },
{ "pthru_host",  VCPU_STAT(pthru_host) },
{ "pthru_bad_aff",   VCPU_STAT(pthru_bad_aff) },
+   { "num_2M_pages",VM_STAT(num_2M_pages) },
+   { "num_1G_pages",VM_STAT(num_1G_pages) },
{ NULL }
 };
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 1b821c6efdef..f55ef071883f 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -403,8 +403,13 @@ void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, 
unsigned long gpa,
if (!memslot)
return;
}
-   if (shift)
+   if (shift) { /* 1GB or 2MB page */
page_size = 1ul << shift;
+   if (shift == PMD_SHIFT)
+   kvm->stat.num_2M_pages--;
+   else if (shift == PUD_SHIFT)
+   kvm->stat.num_1G_pages--;
+   }
 
gpa &= ~(page_size - 1);
hpa = old & PTE_RPN_MASK;
@@ -878,6 +883,14 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
put_page(page);
}
 
+   /* Increment number of large pages if we (successfully) inserted one */
+   if (!ret) {
+   if (level == 1)
+   kvm->stat.num_2M_pages++;
+   else if (level == 2)
+   kvm->stat.num_1G_pages++;
+   }
+
return ret;
 }
 
-- 
2.13.6

[PATCH] KVM: PPC: Book3S HV: Optimise mmio emulation for devices on FAST_MMIO_BUS

2019-02-06 Thread Suraj Jitindar Singh

Devices on the KVM_FAST_MMIO_BUS by definition have length zero and are
thus used for notification purposes rather than data transfer. For
example eventfd for virtio devices.

This means that when emulating mmio instructions which target devices on
this bus we can immediately handle them and return without needing to load
the instruction from guest memory.

For now we restrict this to stores as this is the only use case at
present.

For a normal guest the effect is negligible, however for a nested guest
we save on the order of 5us per access.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index bd2dcfbf00cd..be7bc070eae5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -442,6 +442,24 @@ int kvmppc_hv_emulate_mmio(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
u32 last_inst;
 
/*
+* Fast path - check if the guest physical address corresponds to a
+* device on the FAST_MMIO_BUS, if so we can avoid loading the
+* instruction all together, then we can just handle it and return.
+*/
+   if (is_store) {
+   int idx, ret;
+
+   idx = srcu_read_lock(&vcpu->kvm->srcu);
+   ret = kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, (gpa_t) gpa, 0,
+  NULL);
+   srcu_read_unlock(&vcpu->kvm->srcu, idx);
+   if (!ret) {
+   kvmppc_set_pc(vcpu, kvmppc_get_pc(vcpu) + 4);
+   return RESUME_GUEST;
+   }
+   }
+
+   /*
 * If we fail, we just return to the guest and try executing it again.
 */
if (kvmppc_get_last_inst(vcpu, INST_GENERIC, &last_inst) !=
-- 
2.13.6

[PATCH V4 8/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest

2018-12-13 Thread Suraj Jitindar Singh

Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 9 -
 arch/powerpc/kvm/book3s_hv_nested.c| 5 -
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index da89d10e5886..8522b034a4b2 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -37,11 +37,10 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
int old_pid, old_lpid;
bool is_load = !!to;
 
-   /* Can't access quadrants 1 or 2 in non-HV mode */
-   if (kvmhv_on_pseries()) {
-   /* TODO h-call */
-   return -EPERM;
-   }
+   /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
+   if (kvmhv_on_pseries())
+   return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
+ __pa(to), __pa(from), n);
 
quadrant = 1;
if (!pid)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 5903175751b4..a9db12cbc0fa 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1284,11 +1284,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
}
 
/* passthrough of emulated MMIO case */
-   if (kvmhv_on_pseries()) {
-   pr_err("emulated MMIO passthrough?\n");
-   return -EINVAL;
-   }
-
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea, writing);
}
if (memslot->flags & KVM_MEM_READONLY) {
-- 
2.13.6

[PATCH V4 7/8] KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

2018-12-13 Thread Suraj Jitindar Singh

A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/hvcall.h  |  1 +
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  7 ++--
 arch/powerpc/kvm/book3s_hv.c   |  6 ++-
 arch/powerpc/kvm/book3s_hv_nested.c| 75 ++
 5 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 33a4fc891947..463c63a9fcf1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -335,6 +335,7 @@
 #define H_SET_PARTITION_TABLE  0xF800
 #define H_ENTER_NESTED 0xF804
 #define H_TLB_INVALIDATE   0xF808
+#define H_COPY_TOFROM_GUEST0xF80C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index ea94110bfde4..720483733bb2 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,9 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, 
unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+   gva_t eaddr, void *to, void *from,
+   unsigned long n);
 extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
void *to, unsigned long n);
 extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
@@ -302,6 +305,7 @@ long kvmhv_nested_init(void);
 void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e1e3ef710bd0..da89d10e5886 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,9 +29,9 @@
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-static unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
-   gva_t eaddr, void *to, void *from,
-   unsigned long n)
+unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+ gva_t eaddr, void *to, void *from,
+ unsigned long n)
 {
unsigned long quadrant, ret = n;
int old_pid, old_lpid;
@@ -82,6 +82,7 @@ static unsigned long __kvmhv_copy_tofrom_guest_radix(int 
lpid, int pid,
 
return ret;
 }
+EXPORT_SYMBOL_GPL(__kvmhv_copy_tofrom_guest_radix);
 
 static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
  void *to, void *from, unsigned long n)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2280bc4778f5..bd07f9b7c5e8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -996,7 +996,11 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (nesting_enabled(vcpu->kvm))
ret = kvmhv_do_nested_tlbie(vcpu);
break;
-
+   case H_COPY_TOFROM_GUEST:
+   ret = H_FUNCTION;
+   if (nesting_enabled(vcpu->kvm))
+   ret = kvmhv_copy_tofrom_guest_nested(vcpu);
+   break;
default:
return RESUME_HOST;
}
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 991f40ce4eea..5903175751b4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -462,6 +462,81 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu)
 }
 
 /*
+ * Handle the H_COPY_TOFROM_GUEST hcall.
+ * r4 = L1 lpid of nested guest
+ * r5 = pid
+ * r6 = eaddr to access
+ * r7 = to buffer (L1 gpa)
+ * r8 = from buffer (L1 gpa)
+ * r9 = n bytes to copy
+ */
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
+{
+   struct kvm_nested_guest *gp;
+   int l1_lpid = kvmppc_get_gpr(vcpu, 4);
+   int pid = kvmppc_get_gpr(vcpu, 5);
+   gva_t

[PATCH V4 6/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest

2018-12-13 Thread Suraj Jitindar Singh

Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.

The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.

To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.

Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_hv_nested.c   | 43 ++-
 arch/powerpc/kvm/powerpc.c|  8 +++
 5 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 5883fcce7009..ea94110bfde4 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -311,7 +311,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu,
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
 void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
   struct hv_guest_state *hr);
-long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
+long int kvmhv_nested_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fac6f631ed29..7a2483a139cf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -793,6 +793,7 @@ struct kvm_vcpu_arch {
/* For support of nested guests */
struct kvm_nested_guest *nested;
u32 nested_vcpu_id;
+   gpa_t nested_io_gpr;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
@@ -827,6 +828,8 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR  0x00c0
 #define KVM_MMIO_REG_VSX   0x0100
 #define KVM_MMIO_REG_VMX   0x0180
+#define KVM_MMIO_REG_NESTED_GPR0xffc0
+
 
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8a0921176a60..2280bc4778f5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -985,6 +985,10 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_set_gpr(vcpu, 3, 0);
vcpu->arch.hcall_needed = 0;
return -EINTR;
+   } else if (ret == H_TOO_HARD) {
+   kvmppc_set_gpr(vcpu, 3, 0);
+   vcpu->arch.hcall_needed = 0;
+   return RESUME_HOST;
}
break;
case H_TLB_INVALIDATE:
@@ -1336,7 +1340,7 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
return r;
 }
 
-static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
+static int kvmppc_handle_nested_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu)
 {
int r;
int srcu_idx;
@@ -1394,7 +1398,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
@@ -1404,7 +1408,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
 
@@ -4059,7 +4063,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
if (!nested)
r = kvmppc_handle_exit_hv(kvm_run, vcpu, current);
else
-   r = kvmppc_handle_nested_exit(vcpu);

[PATCH V4 5/8] KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants

2018-12-13 Thread Suraj Jitindar Singh

The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.

This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.

We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.

At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95859c53a5cd..cb029fcab404 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -331,10 +331,17 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int r;
+   int r = -EINVAL;
 
vcpu->stat.st++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->store_to_eaddr)
+   r = vcpu->kvm->arch.kvm_ops->store_to_eaddr(vcpu, eaddr, ptr,
+   size);
+
+   if ((!r) || (r == -EAGAIN))
+   return r;
+
r = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
 XLATE_WRITE, &pte);
if (r < 0)
@@ -367,10 +374,17 @@ int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int rc;
+   int rc = -EINVAL;
 
vcpu->stat.ld++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->load_from_eaddr)
+   rc = vcpu->kvm->arch.kvm_ops->load_from_eaddr(vcpu, eaddr, ptr,
+ size);
+
+   if ((!rc) || (rc == -EAGAIN))
+   return rc;
+
rc = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
  XLATE_READ, &pte);
if (rc)
-- 
2.13.6

[PATCH V4 4/8] KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct

2018-12-13 Thread Suraj Jitindar Singh

The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_ppc.h |  4 
 arch/powerpc/kvm/book3s_hv.c   | 40 ++
 2 files changed, 44 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9b89b1918dfc..159dd76700cb 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -326,6 +326,10 @@ struct kvmppc_ops {
unsigned long flags);
void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
int (*enable_nested)(struct kvm *kvm);
+   int (*load_from_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+  int size);
+   int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+ int size);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a56f8413758a..8a0921176a60 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5214,6 +5214,44 @@ static int kvmhv_enable_nested(struct kvm *kvm)
return 0;
 }
 
+static int kvmhv_load_from_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void 
*ptr,
+int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_from_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
+static int kvmhv_store_to_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+   int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_to_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
 static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -5254,6 +5292,8 @@ static struct kvmppc_ops kvm_ops_hv = {
.get_rmmu_info = kvmhv_get_rmmu_info,
.set_smt_mode = kvmhv_set_smt_mode,
.enable_nested = kvmhv_enable_nested,
+   .load_from_eaddr = kvmhv_load_from_eaddr,
+   .store_to_eaddr = kvmhv_store_to_eaddr,
 };
 
 static int kvm_init_subcore_bitmap(void)
-- 
2.13.6

[PATCH V4 3/8] KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2

2018-12-13 Thread Suraj Jitindar Singh

The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant| Hypervisor| Guest
--
| EA[0:1] = 0b00| EA[0:1] = 0b00
0   | effLPID = 0   | effLPID = LPIDR
| effPID  = PIDR| effPID  = PIDR
--
| EA[0:1] = 0b01|
1   | effLPID = LPIDR   | Invalid Access
| effPID  = PIDR|
--
| EA[0:1] = 0b10|
2   | effLPID = LPIDR   | Invalid Access
| effPID  = 0   |
--
| EA[0:1] = 0b11| EA[0:1] = 0b11
3   | effLPID = 0   | effLPID = LPIDR
| effPID  = 0   | effPID  = 0
--

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kernel/exceptions-64s.S   |  9 
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 97 ++
 arch/powerpc/mm/fault.c|  1 +
 4 files changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 09f8e9ba69bc..5883fcce7009 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,10 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm 
*kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+   void *to, unsigned long n);
+extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+ void *from, unsigned long n);
 extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
  struct kvmppc_pte *gpte, u64 root,
  u64 *pte_ret_p);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 89d32bb79d5e..db2691ff4c0b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -995,7 +995,16 @@ EXC_COMMON_BEGIN(h_data_storage_common)
bl  save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
+BEGIN_MMU_FTR_SECTION
+   ld  r4,PACA_EXGEN+EX_DAR(r13)
+   lwz r5,PACA_EXGEN+EX_DSISR(r13)
+   std r4,_DAR(r1)
+   std r5,_DSISR(r1)
+   li  r5,SIGSEGV
+   bl  bad_page_fault
+MMU_FTR_SECTION_ELSE
bl  unknown

[PATCH V4 2/8] KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()

2018-12-13 Thread Suraj Jitindar Singh

There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.

Currently all nested guests must be radix but this may change in the
future.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 13 +
 arch/powerpc/kvm/book3s_hv_nested.c  |  1 +
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 6d298145d564..7a9e472f2872 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -55,6 +55,7 @@ struct kvm_nested_guest {
cpumask_t need_tlb_flush;
cpumask_t cpu_in_guest;
short prev_cpu[NR_CPUS];
+   u8 radix;   /* is this nested guest radix */
 };
 
 /*
@@ -150,6 +151,18 @@ static inline bool kvm_is_radix(struct kvm *kvm)
return kvm->arch.radix;
 }
 
+static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu *vcpu)
+{
+   bool radix;
+
+   if (vcpu->arch.nested)
+   radix = vcpu->arch.nested->radix;
+   else
+   radix = kvm_is_radix(vcpu->kvm);
+
+   return radix;
+}
+
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 401d2ecbebc5..4fca462e54c4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -480,6 +480,7 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm 
*kvm, unsigned int lpid)
if (shadow_lpid < 0)
goto out_free2;
gp->shadow_lpid = shadow_lpid;
+   gp->radix = 1;
 
memset(gp->prev_cpu, -1, sizeof(gp->prev_cpu));
 
-- 
2.13.6

[PATCH V4 1/8] KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines

2018-12-13 Thread Suraj Jitindar Singh

The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.

Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2869a299c4ed..95859c53a5cd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -496,6 +496,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int r;
/* Assume we're using HV mode when the HV module is loaded */
int hv_enabled = kvmppc_hv_ops ? 1 : 0;
+   int kvm_on_pseries = !cpu_has_feature(CPU_FTR_HVMODE);
 
if (kvm) {
/*
@@ -543,8 +544,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
case KVM_CAP_SPAPR_TCE:
case KVM_CAP_SPAPR_TCE_64:
-   /* fallthrough */
+   r = 1;
+   break;
case KVM_CAP_SPAPR_TCE_VFIO:
+   r = !kvm_on_pseries;
+   break;
case KVM_CAP_PPC_RTAS:
case KVM_CAP_PPC_FIXUP_HCALL:
case KVM_CAP_PPC_ENABLE_HCALL:
-- 
2.13.6

[PATCH V4 0/8] KVM: PPC: Implement passthrough of emulated devices for nested guests

2018-12-13 Thread Suraj Jitindar Singh

This patch series allows for emulated devices to be passed through to nested
guests, irrespective of at which level the device is being emulated.

Note that the emulated device must be using dma, not virtio.

For example, passing through an emulated e1000:

1. Emulate the device at L(n) for L(n+1)

qemu-system-ppc64 -netdev type=user,id=net0 -device e1000,netdev=net0

2. Assign the VFIO-PCI driver at L(n+1)

echo vfio-pci > /sys/bus/pci/devices/:00:00.0/driver_override
echo :00:00.0 > /sys/bus/pci/drivers/e1000/unbind
echo :00:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
chmod 666 /dev/vfio/0

3. Pass the device through from L(n+1) to L(n+2)

qemu-system-ppc64 -device vfio-pci,host=:00:00.0

4. L(n+2) can now access the device which will be emulated at L(n)

V2 -> V3:
1/8: None
2/8: None
3/8: None
4/8: None
5/8: None
6/8: Add if def to fix compilation for some platforms
7/8: None
8/8: None

Suraj Jitindar Singh (8):
  KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
  KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()
  KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2
  KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops
struct
  KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2
guest
  KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants
1 & 2
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3
guest

 arch/powerpc/include/asm/hvcall.h|   1 +
 arch/powerpc/include/asm/kvm_book3s.h|  10 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h |  13 
 arch/powerpc/include/asm/kvm_host.h  |   3 +
 arch/powerpc/include/asm/kvm_ppc.h   |   4 ++
 arch/powerpc/kernel/exceptions-64s.S |   9 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  97 ++
 arch/powerpc/kvm/book3s_hv.c |  58 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  | 114 +--
 arch/powerpc/kvm/powerpc.c   |  32 -
 arch/powerpc/mm/fault.c  |   1 +
 11 files changed, 327 insertions(+), 15 deletions(-)

-- 
2.13.6

[PATCH V3 8/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest

2018-12-13 Thread Suraj Jitindar Singh

Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 9 -
 arch/powerpc/kvm/book3s_hv_nested.c| 5 -
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index da89d10e5886..8522b034a4b2 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -37,11 +37,10 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
int old_pid, old_lpid;
bool is_load = !!to;
 
-   /* Can't access quadrants 1 or 2 in non-HV mode */
-   if (kvmhv_on_pseries()) {
-   /* TODO h-call */
-   return -EPERM;
-   }
+   /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
+   if (kvmhv_on_pseries())
+   return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
+ __pa(to), __pa(from), n);
 
quadrant = 1;
if (!pid)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 5903175751b4..a9db12cbc0fa 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1284,11 +1284,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
}
 
/* passthrough of emulated MMIO case */
-   if (kvmhv_on_pseries()) {
-   pr_err("emulated MMIO passthrough?\n");
-   return -EINVAL;
-   }
-
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea, writing);
}
if (memslot->flags & KVM_MEM_READONLY) {
-- 
2.13.6

[PATCH V3 7/8] KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

2018-12-13 Thread Suraj Jitindar Singh

A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/hvcall.h  |  1 +
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  7 ++--
 arch/powerpc/kvm/book3s_hv.c   |  6 ++-
 arch/powerpc/kvm/book3s_hv_nested.c| 75 ++
 5 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 33a4fc891947..463c63a9fcf1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -335,6 +335,7 @@
 #define H_SET_PARTITION_TABLE  0xF800
 #define H_ENTER_NESTED 0xF804
 #define H_TLB_INVALIDATE   0xF808
+#define H_COPY_TOFROM_GUEST0xF80C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index ea94110bfde4..720483733bb2 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,9 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, 
unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+   gva_t eaddr, void *to, void *from,
+   unsigned long n);
 extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
void *to, unsigned long n);
 extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
@@ -302,6 +305,7 @@ long kvmhv_nested_init(void);
 void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e1e3ef710bd0..da89d10e5886 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,9 +29,9 @@
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-static unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
-   gva_t eaddr, void *to, void *from,
-   unsigned long n)
+unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+ gva_t eaddr, void *to, void *from,
+ unsigned long n)
 {
unsigned long quadrant, ret = n;
int old_pid, old_lpid;
@@ -82,6 +82,7 @@ static unsigned long __kvmhv_copy_tofrom_guest_radix(int 
lpid, int pid,
 
return ret;
 }
+EXPORT_SYMBOL_GPL(__kvmhv_copy_tofrom_guest_radix);
 
 static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
  void *to, void *from, unsigned long n)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2280bc4778f5..bd07f9b7c5e8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -996,7 +996,11 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (nesting_enabled(vcpu->kvm))
ret = kvmhv_do_nested_tlbie(vcpu);
break;
-
+   case H_COPY_TOFROM_GUEST:
+   ret = H_FUNCTION;
+   if (nesting_enabled(vcpu->kvm))
+   ret = kvmhv_copy_tofrom_guest_nested(vcpu);
+   break;
default:
return RESUME_HOST;
}
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 991f40ce4eea..5903175751b4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -462,6 +462,81 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu)
 }
 
 /*
+ * Handle the H_COPY_TOFROM_GUEST hcall.
+ * r4 = L1 lpid of nested guest
+ * r5 = pid
+ * r6 = eaddr to access
+ * r7 = to buffer (L1 gpa)
+ * r8 = from buffer (L1 gpa)
+ * r9 = n bytes to copy
+ */
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
+{
+   struct kvm_nested_guest *gp;
+   int l1_lpid = kvmppc_get_gpr(vcpu, 4);
+   int pid = kvmppc_get_gpr(vcpu, 5);
+   gva_t

[PATCH V3 6/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest

2018-12-13 Thread Suraj Jitindar Singh

Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.

The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.

To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.

Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_hv_nested.c   | 43 ++-
 arch/powerpc/kvm/powerpc.c|  6 +
 5 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 5883fcce7009..ea94110bfde4 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -311,7 +311,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu,
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
 void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
   struct hv_guest_state *hr);
-long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
+long int kvmhv_nested_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fac6f631ed29..7a2483a139cf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -793,6 +793,7 @@ struct kvm_vcpu_arch {
/* For support of nested guests */
struct kvm_nested_guest *nested;
u32 nested_vcpu_id;
+   gpa_t nested_io_gpr;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
@@ -827,6 +828,8 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR  0x00c0
 #define KVM_MMIO_REG_VSX   0x0100
 #define KVM_MMIO_REG_VMX   0x0180
+#define KVM_MMIO_REG_NESTED_GPR0xffc0
+
 
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8a0921176a60..2280bc4778f5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -985,6 +985,10 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_set_gpr(vcpu, 3, 0);
vcpu->arch.hcall_needed = 0;
return -EINTR;
+   } else if (ret == H_TOO_HARD) {
+   kvmppc_set_gpr(vcpu, 3, 0);
+   vcpu->arch.hcall_needed = 0;
+   return RESUME_HOST;
}
break;
case H_TLB_INVALIDATE:
@@ -1336,7 +1340,7 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
return r;
 }
 
-static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
+static int kvmppc_handle_nested_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu)
 {
int r;
int srcu_idx;
@@ -1394,7 +1398,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
@@ -1404,7 +1408,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
 
@@ -4059,7 +4063,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
if (!nested)
r = kvmppc_handle_exit_hv(kvm_run, vcpu, current);
else
-   r = kvmppc_handle_nested_exit(vcpu);
+

[PATCH V3 5/8] KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants

2018-12-13 Thread Suraj Jitindar Singh

The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.

This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.

We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.

At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95859c53a5cd..cb029fcab404 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -331,10 +331,17 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int r;
+   int r = -EINVAL;
 
vcpu->stat.st++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->store_to_eaddr)
+   r = vcpu->kvm->arch.kvm_ops->store_to_eaddr(vcpu, eaddr, ptr,
+   size);
+
+   if ((!r) || (r == -EAGAIN))
+   return r;
+
r = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
 XLATE_WRITE, &pte);
if (r < 0)
@@ -367,10 +374,17 @@ int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int rc;
+   int rc = -EINVAL;
 
vcpu->stat.ld++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->load_from_eaddr)
+   rc = vcpu->kvm->arch.kvm_ops->load_from_eaddr(vcpu, eaddr, ptr,
+ size);
+
+   if ((!rc) || (rc == -EAGAIN))
+   return rc;
+
rc = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
  XLATE_READ, &pte);
if (rc)
-- 
2.13.6

[PATCH V3 4/8] KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct

2018-12-13 Thread Suraj Jitindar Singh

The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_ppc.h |  4 
 arch/powerpc/kvm/book3s_hv.c   | 40 ++
 2 files changed, 44 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9b89b1918dfc..159dd76700cb 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -326,6 +326,10 @@ struct kvmppc_ops {
unsigned long flags);
void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
int (*enable_nested)(struct kvm *kvm);
+   int (*load_from_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+  int size);
+   int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+ int size);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a56f8413758a..8a0921176a60 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5214,6 +5214,44 @@ static int kvmhv_enable_nested(struct kvm *kvm)
return 0;
 }
 
+static int kvmhv_load_from_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void 
*ptr,
+int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_from_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
+static int kvmhv_store_to_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+   int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_to_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
 static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -5254,6 +5292,8 @@ static struct kvmppc_ops kvm_ops_hv = {
.get_rmmu_info = kvmhv_get_rmmu_info,
.set_smt_mode = kvmhv_set_smt_mode,
.enable_nested = kvmhv_enable_nested,
+   .load_from_eaddr = kvmhv_load_from_eaddr,
+   .store_to_eaddr = kvmhv_store_to_eaddr,
 };
 
 static int kvm_init_subcore_bitmap(void)
-- 
2.13.6

[PATCH V3 3/8] KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2

2018-12-13 Thread Suraj Jitindar Singh

The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant| Hypervisor| Guest
--
| EA[0:1] = 0b00| EA[0:1] = 0b00
0   | effLPID = 0   | effLPID = LPIDR
| effPID  = PIDR| effPID  = PIDR
--
| EA[0:1] = 0b01|
1   | effLPID = LPIDR   | Invalid Access
| effPID  = PIDR|
--
| EA[0:1] = 0b10|
2   | effLPID = LPIDR   | Invalid Access
| effPID  = 0   |
--
| EA[0:1] = 0b11| EA[0:1] = 0b11
3   | effLPID = 0   | effLPID = LPIDR
| effPID  = 0   | effPID  = 0
--

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kernel/exceptions-64s.S   |  9 
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 97 ++
 arch/powerpc/mm/fault.c|  1 +
 4 files changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 09f8e9ba69bc..5883fcce7009 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,10 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm 
*kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+   void *to, unsigned long n);
+extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+ void *from, unsigned long n);
 extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
  struct kvmppc_pte *gpte, u64 root,
  u64 *pte_ret_p);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 89d32bb79d5e..db2691ff4c0b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -995,7 +995,16 @@ EXC_COMMON_BEGIN(h_data_storage_common)
bl  save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
+BEGIN_MMU_FTR_SECTION
+   ld  r4,PACA_EXGEN+EX_DAR(r13)
+   lwz r5,PACA_EXGEN+EX_DSISR(r13)
+   std r4,_DAR(r1)
+   std r5,_DSISR(r1)
+   li  r5,SIGSEGV
+   bl  bad_page_fault
+MMU_FTR_SECTION_ELSE
bl  unknown

[PATCH V3 2/8] KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()

2018-12-13 Thread Suraj Jitindar Singh

There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.

Currently all nested guests must be radix but this may change in the
future.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 13 +
 arch/powerpc/kvm/book3s_hv_nested.c  |  1 +
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 6d298145d564..7a9e472f2872 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -55,6 +55,7 @@ struct kvm_nested_guest {
cpumask_t need_tlb_flush;
cpumask_t cpu_in_guest;
short prev_cpu[NR_CPUS];
+   u8 radix;   /* is this nested guest radix */
 };
 
 /*
@@ -150,6 +151,18 @@ static inline bool kvm_is_radix(struct kvm *kvm)
return kvm->arch.radix;
 }
 
+static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu *vcpu)
+{
+   bool radix;
+
+   if (vcpu->arch.nested)
+   radix = vcpu->arch.nested->radix;
+   else
+   radix = kvm_is_radix(vcpu->kvm);
+
+   return radix;
+}
+
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 401d2ecbebc5..4fca462e54c4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -480,6 +480,7 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm 
*kvm, unsigned int lpid)
if (shadow_lpid < 0)
goto out_free2;
gp->shadow_lpid = shadow_lpid;
+   gp->radix = 1;
 
memset(gp->prev_cpu, -1, sizeof(gp->prev_cpu));
 
-- 
2.13.6

[PATCH V3 1/8] KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines

2018-12-13 Thread Suraj Jitindar Singh

The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.

Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2869a299c4ed..95859c53a5cd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -496,6 +496,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int r;
/* Assume we're using HV mode when the HV module is loaded */
int hv_enabled = kvmppc_hv_ops ? 1 : 0;
+   int kvm_on_pseries = !cpu_has_feature(CPU_FTR_HVMODE);
 
if (kvm) {
/*
@@ -543,8 +544,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
case KVM_CAP_SPAPR_TCE:
case KVM_CAP_SPAPR_TCE_64:
-   /* fallthrough */
+   r = 1;
+   break;
case KVM_CAP_SPAPR_TCE_VFIO:
+   r = !kvm_on_pseries;
+   break;
case KVM_CAP_PPC_RTAS:
case KVM_CAP_PPC_FIXUP_HCALL:
case KVM_CAP_PPC_ENABLE_HCALL:
-- 
2.13.6

[PATCH V3 0/8] KVM: PPC: Implement passthrough of emulated devices for nested guests

2018-12-13 Thread Suraj Jitindar Singh

This patch series allows for emulated devices to be passed through to nested
guests, irrespective of at which level the device is being emulated.

Note that the emulated device must be using dma, not virtio.

For example, passing through an emulated e1000:

1. Emulate the device at L(n) for L(n+1)

qemu-system-ppc64 -netdev type=user,id=net0 -device e1000,netdev=net0

2. Assign the VFIO-PCI driver at L(n+1)

echo vfio-pci > /sys/bus/pci/devices/:00:00.0/driver_override
echo :00:00.0 > /sys/bus/pci/drivers/e1000/unbind
echo :00:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
chmod 666 /dev/vfio/0

3. Pass the device through from L(n+1) to L(n+2)

qemu-system-ppc64 -device vfio-pci,host=:00:00.0

4. L(n+2) can now access the device which will be emulated at L(n)

V2 -> V3:
1/8: None
2/8: None
3/8: None
4/8: None
5/8: None
6/8: None
7/8: Use guest physical address for the args in H_COPY_TOFROM_GUEST to
 match the comment.
8/8: None

Suraj Jitindar Singh (8):
  KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
  KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()
  KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2
  KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops
struct
  KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2
guest
  KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants
1 & 2
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3
guest

 arch/powerpc/include/asm/hvcall.h|   1 +
 arch/powerpc/include/asm/kvm_book3s.h|  10 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h |  13 
 arch/powerpc/include/asm/kvm_host.h  |   3 +
 arch/powerpc/include/asm/kvm_ppc.h   |   4 ++
 arch/powerpc/kernel/exceptions-64s.S |   9 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  97 ++
 arch/powerpc/kvm/book3s_hv.c |  58 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  | 114 +--
 arch/powerpc/kvm/powerpc.c   |  30 +++-
 arch/powerpc/mm/fault.c  |   1 +
 11 files changed, 325 insertions(+), 15 deletions(-)

-- 
2.13.6

Re: [PATCH V2 7/8] KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

2018-12-13 Thread Suraj Jitindar Singh

On Thu, 2018-12-13 at 16:24 +1100, Paul Mackerras wrote:
> On Mon, Dec 10, 2018 at 02:58:24PM +1100, Suraj Jitindar Singh wrote:
> > A guest cannot access quadrants 1 or 2 as this would result in an
> > exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used
> > by a
> > guest when it wants to perform an access to quadrants 1 or 2, for
> > example when it wants to access memory for one of its nested
> > guests.
> > 
> > Also provide an implementation for the kvm-hv module.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> 
> [snip]
> 
> >  /*
> > + * Handle the H_COPY_TOFROM_GUEST hcall.
> > + * r4 = L1 lpid of nested guest
> > + * r5 = pid
> > + * r6 = eaddr to access
> > + * r7 = to buffer (L1 gpa)
> > + * r8 = from buffer (L1 gpa)
> 
> Comment says these are GPAs...
> 
> > + * r9 = n bytes to copy
> > + */
> > +long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
> > +{
> > +   struct kvm_nested_guest *gp;
> > +   int l1_lpid = kvmppc_get_gpr(vcpu, 4);
> > +   int pid = kvmppc_get_gpr(vcpu, 5);
> > +   gva_t eaddr = kvmppc_get_gpr(vcpu, 6);
> > +   void *gp_to = (void *) kvmppc_get_gpr(vcpu, 7);
> > +   void *gp_from = (void *) kvmppc_get_gpr(vcpu, 8);
> > +   void *buf;
> > +   unsigned long n = kvmppc_get_gpr(vcpu, 9);
> > +   bool is_load = !!gp_to;
> > +   long rc;
> > +
> > +   if (gp_to && gp_from) /* One must be NULL to determine the
> > direction */
> > +   return H_PARAMETER;
> > +
> > +   if (eaddr & (0xFFFUL << 52))
> > +   return H_PARAMETER;
> > +
> > +   buf = kzalloc(n, GFP_KERNEL);
> > +   if (!buf)
> > +   return H_NO_MEM;
> > +
> > +   gp = kvmhv_get_nested(vcpu->kvm, l1_lpid, false);
> > +   if (!gp) {
> > +   rc = H_PARAMETER;
> > +   goto out_free;
> > +   }
> > +
> > +   mutex_lock(&gp->tlb_lock);
> > +
> > +   if (is_load) {
> > +   /* Load from the nested guest into our buffer */
> > +   rc = __kvmhv_copy_tofrom_guest_radix(gp-
> > >shadow_lpid, pid,
> > +eaddr, buf,
> > NULL, n);
> > +   if (rc)
> > +   goto not_found;
> > +
> > +   /* Write what was loaded into our buffer back to
> > the L1 guest */
> > +   rc = kvmppc_st(vcpu, (ulong *) &gp_to, n, buf,
> > true);
> 
> but using kvmppc_st implies that it is an EA (and in fact when you
> call it in the next patch you pass an EA).
> 
> It would be more like other hcalls to pass a GPA, meaning that you
> would use kvm_write_guest() here.  On the other hand, with the
> quadrant access, kvmppc_st() might well be faster than
> kvm_write_guest.
> 
> So you need to decide which it is and either fix the comment or
> change
> the code.

Lets stick with gpa for now then for consistency, with room for
optimisation.

> 
> Paul.

[PATCH V2 8/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest

2018-12-09 Thread Suraj Jitindar Singh

Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 9 -
 arch/powerpc/kvm/book3s_hv_nested.c| 5 -
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index da89d10e5886..cf16e9d207a5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -37,11 +37,10 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
int old_pid, old_lpid;
bool is_load = !!to;
 
-   /* Can't access quadrants 1 or 2 in non-HV mode */
-   if (kvmhv_on_pseries()) {
-   /* TODO h-call */
-   return -EPERM;
-   }
+   /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
+   if (kvmhv_on_pseries())
+   return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
+ to, from, n);
 
quadrant = 1;
if (!pid)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index f54301fcfbe4..acde90eb56f7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1284,11 +1284,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
}
 
/* passthrough of emulated MMIO case */
-   if (kvmhv_on_pseries()) {
-   pr_err("emulated MMIO passthrough?\n");
-   return -EINVAL;
-   }
-
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea, writing);
}
if (memslot->flags & KVM_MEM_READONLY) {
-- 
2.13.6

[PATCH V2 7/8] KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

2018-12-09 Thread Suraj Jitindar Singh

A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/hvcall.h  |  1 +
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  7 ++--
 arch/powerpc/kvm/book3s_hv.c   |  6 ++-
 arch/powerpc/kvm/book3s_hv_nested.c| 75 ++
 5 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 33a4fc891947..463c63a9fcf1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -335,6 +335,7 @@
 #define H_SET_PARTITION_TABLE  0xF800
 #define H_ENTER_NESTED 0xF804
 #define H_TLB_INVALIDATE   0xF808
+#define H_COPY_TOFROM_GUEST0xF80C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index ea94110bfde4..720483733bb2 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,9 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, 
unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+   gva_t eaddr, void *to, void *from,
+   unsigned long n);
 extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
void *to, unsigned long n);
 extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
@@ -302,6 +305,7 @@ long kvmhv_nested_init(void);
 void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e1e3ef710bd0..da89d10e5886 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,9 +29,9 @@
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-static unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
-   gva_t eaddr, void *to, void *from,
-   unsigned long n)
+unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+ gva_t eaddr, void *to, void *from,
+ unsigned long n)
 {
unsigned long quadrant, ret = n;
int old_pid, old_lpid;
@@ -82,6 +82,7 @@ static unsigned long __kvmhv_copy_tofrom_guest_radix(int 
lpid, int pid,
 
return ret;
 }
+EXPORT_SYMBOL_GPL(__kvmhv_copy_tofrom_guest_radix);
 
 static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
  void *to, void *from, unsigned long n)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2280bc4778f5..bd07f9b7c5e8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -996,7 +996,11 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (nesting_enabled(vcpu->kvm))
ret = kvmhv_do_nested_tlbie(vcpu);
break;
-
+   case H_COPY_TOFROM_GUEST:
+   ret = H_FUNCTION;
+   if (nesting_enabled(vcpu->kvm))
+   ret = kvmhv_copy_tofrom_guest_nested(vcpu);
+   break;
default:
return RESUME_HOST;
}
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 991f40ce4eea..f54301fcfbe4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -462,6 +462,81 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu)
 }
 
 /*
+ * Handle the H_COPY_TOFROM_GUEST hcall.
+ * r4 = L1 lpid of nested guest
+ * r5 = pid
+ * r6 = eaddr to access
+ * r7 = to buffer (L1 gpa)
+ * r8 = from buffer (L1 gpa)
+ * r9 = n bytes to copy
+ */
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
+{
+   struct kvm_nested_guest *gp;
+   int l1_lpid = kvmppc_get_gpr(vcpu, 4);
+   int pid = kvmppc_get_gpr(vcpu, 5);
+   gva_t

[PATCH V2 6/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest

2018-12-09 Thread Suraj Jitindar Singh

Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.

The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.

To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.

Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_hv_nested.c   | 43 ++-
 arch/powerpc/kvm/powerpc.c|  6 +
 5 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 5883fcce7009..ea94110bfde4 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -311,7 +311,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu,
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
 void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
   struct hv_guest_state *hr);
-long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
+long int kvmhv_nested_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fac6f631ed29..7a2483a139cf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -793,6 +793,7 @@ struct kvm_vcpu_arch {
/* For support of nested guests */
struct kvm_nested_guest *nested;
u32 nested_vcpu_id;
+   gpa_t nested_io_gpr;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
@@ -827,6 +828,8 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR  0x00c0
 #define KVM_MMIO_REG_VSX   0x0100
 #define KVM_MMIO_REG_VMX   0x0180
+#define KVM_MMIO_REG_NESTED_GPR0xffc0
+
 
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8a0921176a60..2280bc4778f5 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -985,6 +985,10 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
kvmppc_set_gpr(vcpu, 3, 0);
vcpu->arch.hcall_needed = 0;
return -EINTR;
+   } else if (ret == H_TOO_HARD) {
+   kvmppc_set_gpr(vcpu, 3, 0);
+   vcpu->arch.hcall_needed = 0;
+   return RESUME_HOST;
}
break;
case H_TLB_INVALIDATE:
@@ -1336,7 +1340,7 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
return r;
 }
 
-static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
+static int kvmppc_handle_nested_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu)
 {
int r;
int srcu_idx;
@@ -1394,7 +1398,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
@@ -1404,7 +1408,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
 
@@ -4059,7 +4063,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
if (!nested)
r = kvmppc_handle_exit_hv(kvm_run, vcpu, current);
else
-   r = kvmppc_handle_nested_exit(vcpu);
+

[PATCH V2 5/8] KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants

2018-12-09 Thread Suraj Jitindar Singh

The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.

This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.

We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.

At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95859c53a5cd..cb029fcab404 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -331,10 +331,17 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int r;
+   int r = -EINVAL;
 
vcpu->stat.st++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->store_to_eaddr)
+   r = vcpu->kvm->arch.kvm_ops->store_to_eaddr(vcpu, eaddr, ptr,
+   size);
+
+   if ((!r) || (r == -EAGAIN))
+   return r;
+
r = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
 XLATE_WRITE, &pte);
if (r < 0)
@@ -367,10 +374,17 @@ int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int rc;
+   int rc = -EINVAL;
 
vcpu->stat.ld++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->load_from_eaddr)
+   rc = vcpu->kvm->arch.kvm_ops->load_from_eaddr(vcpu, eaddr, ptr,
+ size);
+
+   if ((!rc) || (rc == -EAGAIN))
+   return rc;
+
rc = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
  XLATE_READ, &pte);
if (rc)
-- 
2.13.6

[PATCH V2 4/8] KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct

2018-12-09 Thread Suraj Jitindar Singh

The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_ppc.h |  4 
 arch/powerpc/kvm/book3s_hv.c   | 40 ++
 2 files changed, 44 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9b89b1918dfc..159dd76700cb 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -326,6 +326,10 @@ struct kvmppc_ops {
unsigned long flags);
void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
int (*enable_nested)(struct kvm *kvm);
+   int (*load_from_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+  int size);
+   int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+ int size);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a56f8413758a..8a0921176a60 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5214,6 +5214,44 @@ static int kvmhv_enable_nested(struct kvm *kvm)
return 0;
 }
 
+static int kvmhv_load_from_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void 
*ptr,
+int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_from_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
+static int kvmhv_store_to_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+   int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_to_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
 static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -5254,6 +5292,8 @@ static struct kvmppc_ops kvm_ops_hv = {
.get_rmmu_info = kvmhv_get_rmmu_info,
.set_smt_mode = kvmhv_set_smt_mode,
.enable_nested = kvmhv_enable_nested,
+   .load_from_eaddr = kvmhv_load_from_eaddr,
+   .store_to_eaddr = kvmhv_store_to_eaddr,
 };
 
 static int kvm_init_subcore_bitmap(void)
-- 
2.13.6

[PATCH V2 3/8] KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2

2018-12-09 Thread Suraj Jitindar Singh

The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant| Hypervisor| Guest
--
| EA[0:1] = 0b00| EA[0:1] = 0b00
0   | effLPID = 0   | effLPID = LPIDR
| effPID  = PIDR| effPID  = PIDR
--
| EA[0:1] = 0b01|
1   | effLPID = LPIDR   | Invalid Access
| effPID  = PIDR|
--
| EA[0:1] = 0b10|
2   | effLPID = LPIDR   | Invalid Access
| effPID  = 0   |
--
| EA[0:1] = 0b11| EA[0:1] = 0b11
3   | effLPID = 0   | effLPID = LPIDR
| effPID  = 0   | effPID  = 0
--

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kernel/exceptions-64s.S   |  9 
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 97 ++
 arch/powerpc/mm/fault.c|  1 +
 4 files changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 09f8e9ba69bc..5883fcce7009 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,10 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm 
*kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+   void *to, unsigned long n);
+extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+ void *from, unsigned long n);
 extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
  struct kvmppc_pte *gpte, u64 root,
  u64 *pte_ret_p);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 89d32bb79d5e..db2691ff4c0b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -995,7 +995,16 @@ EXC_COMMON_BEGIN(h_data_storage_common)
bl  save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
+BEGIN_MMU_FTR_SECTION
+   ld  r4,PACA_EXGEN+EX_DAR(r13)
+   lwz r5,PACA_EXGEN+EX_DSISR(r13)
+   std r4,_DAR(r1)
+   std r5,_DSISR(r1)
+   li  r5,SIGSEGV
+   bl  bad_page_fault
+MMU_FTR_SECTION_ELSE
bl  unknown

[PATCH V2 2/8] KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()

2018-12-09 Thread Suraj Jitindar Singh

There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.

Currently all nested guests must be radix but this may change in the
future.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 13 +
 arch/powerpc/kvm/book3s_hv_nested.c  |  1 +
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 6d298145d564..7a9e472f2872 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -55,6 +55,7 @@ struct kvm_nested_guest {
cpumask_t need_tlb_flush;
cpumask_t cpu_in_guest;
short prev_cpu[NR_CPUS];
+   u8 radix;   /* is this nested guest radix */
 };
 
 /*
@@ -150,6 +151,18 @@ static inline bool kvm_is_radix(struct kvm *kvm)
return kvm->arch.radix;
 }
 
+static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu *vcpu)
+{
+   bool radix;
+
+   if (vcpu->arch.nested)
+   radix = vcpu->arch.nested->radix;
+   else
+   radix = kvm_is_radix(vcpu->kvm);
+
+   return radix;
+}
+
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 401d2ecbebc5..4fca462e54c4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -480,6 +480,7 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm 
*kvm, unsigned int lpid)
if (shadow_lpid < 0)
goto out_free2;
gp->shadow_lpid = shadow_lpid;
+   gp->radix = 1;
 
memset(gp->prev_cpu, -1, sizeof(gp->prev_cpu));
 
-- 
2.13.6

[PATCH V2 1/8] KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines

2018-12-09 Thread Suraj Jitindar Singh

The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.

Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2869a299c4ed..95859c53a5cd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -496,6 +496,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int r;
/* Assume we're using HV mode when the HV module is loaded */
int hv_enabled = kvmppc_hv_ops ? 1 : 0;
+   int kvm_on_pseries = !cpu_has_feature(CPU_FTR_HVMODE);
 
if (kvm) {
/*
@@ -543,8 +544,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
case KVM_CAP_SPAPR_TCE:
case KVM_CAP_SPAPR_TCE_64:
-   /* fallthrough */
+   r = 1;
+   break;
case KVM_CAP_SPAPR_TCE_VFIO:
+   r = !kvm_on_pseries;
+   break;
case KVM_CAP_PPC_RTAS:
case KVM_CAP_PPC_FIXUP_HCALL:
case KVM_CAP_PPC_ENABLE_HCALL:
-- 
2.13.6

[PATCH V2 0/8] KVM: PPC: Implement passthrough of emulated devices for nested guests

2018-12-09 Thread Suraj Jitindar Singh

This patch series allows for emulated devices to be passed through to nested
guests, irrespective of at which level the device is being emulated.

Note that the emulated device must be using dma, not virtio.

For example, passing through an emulated e1000:

1. Emulate the device at L(n) for L(n+1)

qemu-system-ppc64 -netdev type=user,id=net0 -device e1000,netdev=net0

2. Assign the VFIO-PCI driver at L(n+1)

echo vfio-pci > /sys/bus/pci/devices/:00:00.0/driver_override
echo :00:00.0 > /sys/bus/pci/drivers/e1000/unbind
echo :00:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
chmod 666 /dev/vfio/0

3. Pass the device through from L(n+1) to L(n+2)

qemu-system-ppc64 -device vfio-pci,host=:00:00.0

4. L(n+2) can now access the device which will be emulated at L(n)

V1 -> V2:
1/8: None
2/8: None
3/8: None
4/8: None
5/8: None
6/8: Account for L1 differing in endianess in kvmppc_complete_mmio_load()
7/8: None
8/8: None

Suraj Jitindar Singh (8):
  KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
  KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()
  KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2
  KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops
struct
  KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2
guest
  KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants
1 & 2
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3
guest

 arch/powerpc/include/asm/hvcall.h|   1 +
 arch/powerpc/include/asm/kvm_book3s.h|  10 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h |  13 
 arch/powerpc/include/asm/kvm_host.h  |   3 +
 arch/powerpc/include/asm/kvm_ppc.h   |   4 ++
 arch/powerpc/kernel/exceptions-64s.S |   9 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  97 ++
 arch/powerpc/kvm/book3s_hv.c |  58 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  | 114 +--
 arch/powerpc/kvm/powerpc.c   |  30 +++-
 arch/powerpc/mm/fault.c  |   1 +
 11 files changed, 325 insertions(+), 15 deletions(-)

-- 
2.13.6

Re: [PATCH 0/8] KVM: PPC: Implement passthrough of emulated devices for nested guests

2018-12-06 Thread Suraj Jitindar Singh

On Fri, 2018-12-07 at 14:43 +1100, Suraj Jitindar Singh wrote:
> This patch series allows for emulated devices to be passed through to
> nested
> guests, irrespective of at which level the device is being emulated.
> 
> Note that the emulated device must be using dma, not virtio.
> 
> For example, passing through an emulated e1000:
> 
> 1. Emulate the device at L(n) for L(n+1)
> 
> qemu-system-ppc64 -netdev type=user,id=net0 -device e1000,netdev=net0
> 
> 2. Assign the VFIO-PCI driver at L(n+1)
> 
> echo :00:00.0 > /sys/bus/pci/drivers/e1000/unbind
> echo :00:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
> chmod 666 /dev/vfio/0
> 
> 3. Pass the device through from L(n+1) to L(n+2)
> 
> qemu-system-ppc64 -device vfio-pci,host=:00:00.0
> 
> 4. L(n+2) can now access the device which will be emulated at L(n)

Note,
[PATCH] KVM: PPC: Book3S PR: Set hflag to indicate that POWER9 supports
1T segments
is not supposed to be part of this series

> 
> Suraj Jitindar Singh (8):
>   KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
>   KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()
>   KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2
>   KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops
> struct
>   KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants
>   KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an
> L2
> guest
>   KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access
> quadrants
> 1 & 2
>   KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an
> L3
> guest
> 
>  arch/powerpc/include/asm/hvcall.h|   1 +
>  arch/powerpc/include/asm/kvm_book3s.h|  10 ++-
>  arch/powerpc/include/asm/kvm_book3s_64.h |  13 
>  arch/powerpc/include/asm/kvm_host.h  |   3 +
>  arch/powerpc/include/asm/kvm_ppc.h   |   4 ++
>  arch/powerpc/kernel/exceptions-64s.S |   9 +++
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   |  97
> ++
>  arch/powerpc/kvm/book3s_hv.c |  58 ++--
>  arch/powerpc/kvm/book3s_hv_nested.c  | 114
> +--
>  arch/powerpc/kvm/powerpc.c   |  28 +++-
>  arch/powerpc/mm/fault.c  |   1 +
>  11 files changed, 323 insertions(+), 15 deletions(-)
>

[PATCH 8/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest

2018-12-06 Thread Suraj Jitindar Singh

Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 9 -
 arch/powerpc/kvm/book3s_hv_nested.c| 5 -
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index da89d10e5886..cf16e9d207a5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -37,11 +37,10 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
pid,
int old_pid, old_lpid;
bool is_load = !!to;
 
-   /* Can't access quadrants 1 or 2 in non-HV mode */
-   if (kvmhv_on_pseries()) {
-   /* TODO h-call */
-   return -EPERM;
-   }
+   /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
+   if (kvmhv_on_pseries())
+   return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
+ to, from, n);
 
quadrant = 1;
if (!pid)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index f54301fcfbe4..acde90eb56f7 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1284,11 +1284,6 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
}
 
/* passthrough of emulated MMIO case */
-   if (kvmhv_on_pseries()) {
-   pr_err("emulated MMIO passthrough?\n");
-   return -EINVAL;
-   }
-
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea, writing);
}
if (memslot->flags & KVM_MEM_READONLY) {
-- 
2.13.6

[PATCH 7/8] KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2

2018-12-06 Thread Suraj Jitindar Singh

A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/hvcall.h  |  1 +
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  7 ++--
 arch/powerpc/kvm/book3s_hv.c   |  6 ++-
 arch/powerpc/kvm/book3s_hv_nested.c| 75 ++
 5 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 33a4fc891947..463c63a9fcf1 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -335,6 +335,7 @@
 #define H_SET_PARTITION_TABLE  0xF800
 #define H_ENTER_NESTED 0xF804
 #define H_TLB_INVALIDATE   0xF808
+#define H_COPY_TOFROM_GUEST0xF80C
 
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index ea94110bfde4..720483733bb2 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,9 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, 
unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+   gva_t eaddr, void *to, void *from,
+   unsigned long n);
 extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
void *to, unsigned long n);
 extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
@@ -302,6 +305,7 @@ long kvmhv_nested_init(void);
 void kvmhv_nested_exit(void);
 void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index e1e3ef710bd0..da89d10e5886 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,9 +29,9 @@
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-static unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
-   gva_t eaddr, void *to, void *from,
-   unsigned long n)
+unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int pid,
+ gva_t eaddr, void *to, void *from,
+ unsigned long n)
 {
unsigned long quadrant, ret = n;
int old_pid, old_lpid;
@@ -82,6 +82,7 @@ static unsigned long __kvmhv_copy_tofrom_guest_radix(int 
lpid, int pid,
 
return ret;
 }
+EXPORT_SYMBOL_GPL(__kvmhv_copy_tofrom_guest_radix);
 
 static long kvmhv_copy_tofrom_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
  void *to, void *from, unsigned long n)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index e7233499e063..e2e15722584a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -995,7 +995,11 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (nesting_enabled(vcpu->kvm))
ret = kvmhv_do_nested_tlbie(vcpu);
break;
-
+   case H_COPY_TOFROM_GUEST:
+   ret = H_FUNCTION;
+   if (nesting_enabled(vcpu->kvm))
+   ret = kvmhv_copy_tofrom_guest_nested(vcpu);
+   break;
default:
return RESUME_HOST;
}
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 991f40ce4eea..f54301fcfbe4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -462,6 +462,81 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu)
 }
 
 /*
+ * Handle the H_COPY_TOFROM_GUEST hcall.
+ * r4 = L1 lpid of nested guest
+ * r5 = pid
+ * r6 = eaddr to access
+ * r7 = to buffer (L1 gpa)
+ * r8 = from buffer (L1 gpa)
+ * r9 = n bytes to copy
+ */
+long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu)
+{
+   struct kvm_nested_guest *gp;
+   int l1_lpid = kvmppc_get_gpr(vcpu, 4);
+   int pid = kvmppc_get_gpr(vcpu, 5);
+   gva_t

[PATCH 6/8] KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest

2018-12-06 Thread Suraj Jitindar Singh

Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.

The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.

To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.

Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h |  2 +-
 arch/powerpc/include/asm/kvm_host.h   |  3 +++
 arch/powerpc/kvm/book3s_hv.c  | 12 ++
 arch/powerpc/kvm/book3s_hv_nested.c   | 43 ++-
 arch/powerpc/kvm/powerpc.c|  4 
 5 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 5883fcce7009..ea94110bfde4 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -311,7 +311,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu,
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
 void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
   struct hv_guest_state *hr);
-long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
+long int kvmhv_nested_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index fac6f631ed29..7a2483a139cf 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -793,6 +793,7 @@ struct kvm_vcpu_arch {
/* For support of nested guests */
struct kvm_nested_guest *nested;
u32 nested_vcpu_id;
+   gpa_t nested_io_gpr;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
@@ -827,6 +828,8 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR  0x00c0
 #define KVM_MMIO_REG_VSX   0x0100
 #define KVM_MMIO_REG_VMX   0x0180
+#define KVM_MMIO_REG_NESTED_GPR0xffc0
+
 
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6c8b4f632168..e7233499e063 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -984,6 +984,10 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (ret == H_INTERRUPT) {
kvmppc_set_gpr(vcpu, 3, 0);
return -EINTR;
+   } else if (ret == H_TOO_HARD) {
+   kvmppc_set_gpr(vcpu, 3, 0);
+   vcpu->arch.hcall_needed = 0;
+   return RESUME_HOST;
}
break;
case H_TLB_INVALIDATE:
@@ -1335,7 +1339,7 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
return r;
 }
 
-static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
+static int kvmppc_handle_nested_exit(struct kvm_run *run, struct kvm_vcpu 
*vcpu)
 {
int r;
int srcu_idx;
@@ -1393,7 +1397,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
 */
case BOOK3S_INTERRUPT_H_DATA_STORAGE:
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
@@ -1403,7 +1407,7 @@ static int kvmppc_handle_nested_exit(struct kvm_vcpu 
*vcpu)
if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = kvmhv_nested_page_fault(vcpu);
+   r = kvmhv_nested_page_fault(run, vcpu);
srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
break;
 
@@ -4058,7 +4062,7 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
if (!nested)
r = kvmppc_handle_exit_hv(kvm_run, vcpu, current);
else
-   r = kvmppc_handle_nested_exit(vcpu);
+

[PATCH 5/8] KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants

2018-12-06 Thread Suraj Jitindar Singh

The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.

This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.

We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.

At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 95859c53a5cd..cb029fcab404 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -331,10 +331,17 @@ int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int r;
+   int r = -EINVAL;
 
vcpu->stat.st++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->store_to_eaddr)
+   r = vcpu->kvm->arch.kvm_ops->store_to_eaddr(vcpu, eaddr, ptr,
+   size);
+
+   if ((!r) || (r == -EAGAIN))
+   return r;
+
r = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
 XLATE_WRITE, &pte);
if (r < 0)
@@ -367,10 +374,17 @@ int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int 
size, void *ptr,
 {
ulong mp_pa = vcpu->arch.magic_page_pa & KVM_PAM & PAGE_MASK;
struct kvmppc_pte pte;
-   int rc;
+   int rc = -EINVAL;
 
vcpu->stat.ld++;
 
+   if (vcpu->kvm->arch.kvm_ops && vcpu->kvm->arch.kvm_ops->load_from_eaddr)
+   rc = vcpu->kvm->arch.kvm_ops->load_from_eaddr(vcpu, eaddr, ptr,
+ size);
+
+   if ((!rc) || (rc == -EAGAIN))
+   return rc;
+
rc = kvmppc_xlate(vcpu, *eaddr, data ? XLATE_DATA : XLATE_INST,
  XLATE_READ, &pte);
if (rc)
-- 
2.13.6

[PATCH 4/8] KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct

2018-12-06 Thread Suraj Jitindar Singh

The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_ppc.h |  4 
 arch/powerpc/kvm/book3s_hv.c   | 40 ++
 2 files changed, 44 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9b89b1918dfc..159dd76700cb 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -326,6 +326,10 @@ struct kvmppc_ops {
unsigned long flags);
void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
int (*enable_nested)(struct kvm *kvm);
+   int (*load_from_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+  int size);
+   int (*store_to_eaddr)(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+ int size);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d65b961661fb..6c8b4f632168 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -5213,6 +5213,44 @@ static int kvmhv_enable_nested(struct kvm *kvm)
return 0;
 }
 
+static int kvmhv_load_from_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void 
*ptr,
+int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_from_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
+static int kvmhv_store_to_eaddr(struct kvm_vcpu *vcpu, ulong *eaddr, void *ptr,
+   int size)
+{
+   int rc = -EINVAL;
+
+   if (kvmhv_vcpu_is_radix(vcpu)) {
+   rc = kvmhv_copy_to_guest_radix(vcpu, *eaddr, ptr, size);
+
+   if (rc > 0)
+   rc = -EINVAL;
+   }
+
+   /* For now quadrants are the only way to access nested guest memory */
+   if (rc && vcpu->arch.nested)
+   rc = -EAGAIN;
+
+   return rc;
+}
+
 static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -5253,6 +5291,8 @@ static struct kvmppc_ops kvm_ops_hv = {
.get_rmmu_info = kvmhv_get_rmmu_info,
.set_smt_mode = kvmhv_set_smt_mode,
.enable_nested = kvmhv_enable_nested,
+   .load_from_eaddr = kvmhv_load_from_eaddr,
+   .store_to_eaddr = kvmhv_store_to_eaddr,
 };
 
 static int kvm_init_subcore_bitmap(void)
-- 
2.13.6

[PATCH 3/8] KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2

2018-12-06 Thread Suraj Jitindar Singh

The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant| Hypervisor| Guest
--
| EA[0:1] = 0xb00   | EA[0:1] = 0xb00
0   | effLPID = 0   | effLPID = LPIDR
| effPID  = PIDR| effPID  = PIDR
--
| EA[0:1] = 0xb01   |
1   | effLPID = LPIDR   | Invalid Access
| effPID  = PIDR|
--
| EA[0:1] = 0xb10   |
2   | effLPID = LPIDR   | Invalid Access
| effPID  = 0   |
--
| EA[0:1] = 0xb11   | EA[0:1] = 0xb11
3   | effLPID = 0   | effLPID = LPIDR
| effPID  = 0   | effPID  = 0
--

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  4 ++
 arch/powerpc/kernel/exceptions-64s.S   |  9 
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 97 ++
 arch/powerpc/mm/fault.c|  1 +
 4 files changed, 111 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 09f8e9ba69bc..5883fcce7009 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,10 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm 
*kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern long kvmhv_copy_from_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+   void *to, unsigned long n);
+extern long kvmhv_copy_to_guest_radix(struct kvm_vcpu *vcpu, gva_t eaddr,
+ void *from, unsigned long n);
 extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
  struct kvmppc_pte *gpte, u64 root,
  u64 *pte_ret_p);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 89d32bb79d5e..db2691ff4c0b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -995,7 +995,16 @@ EXC_COMMON_BEGIN(h_data_storage_common)
bl  save_nvgprs
RECONCILE_IRQ_STATE(r10, r11)
addir3,r1,STACK_FRAME_OVERHEAD
+BEGIN_MMU_FTR_SECTION
+   ld  r4,PACA_EXGEN+EX_DAR(r13)
+   lwz r5,PACA_EXGEN+EX_DSISR(r13)
+   std r4,_DAR(r1)
+   std r5,_DSISR(r1)
+   li  r5,SIGSEGV
+   bl  bad_page_fault
+MMU_FTR_SECTION_ELSE
bl  unknown

[PATCH 2/8] KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()

2018-12-06 Thread Suraj Jitindar Singh

There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.

Currently all nested guests must be radix but this may change in the
future.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 13 +
 arch/powerpc/kvm/book3s_hv_nested.c  |  1 +
 2 files changed, 14 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 6d298145d564..7a9e472f2872 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -55,6 +55,7 @@ struct kvm_nested_guest {
cpumask_t need_tlb_flush;
cpumask_t cpu_in_guest;
short prev_cpu[NR_CPUS];
+   u8 radix;   /* is this nested guest radix */
 };
 
 /*
@@ -150,6 +151,18 @@ static inline bool kvm_is_radix(struct kvm *kvm)
return kvm->arch.radix;
 }
 
+static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu *vcpu)
+{
+   bool radix;
+
+   if (vcpu->arch.nested)
+   radix = vcpu->arch.nested->radix;
+   else
+   radix = kvm_is_radix(vcpu->kvm);
+
+   return radix;
+}
+
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 401d2ecbebc5..4fca462e54c4 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -480,6 +480,7 @@ struct kvm_nested_guest *kvmhv_alloc_nested(struct kvm 
*kvm, unsigned int lpid)
if (shadow_lpid < 0)
goto out_free2;
gp->shadow_lpid = shadow_lpid;
+   gp->radix = 1;
 
memset(gp->prev_cpu, -1, sizeof(gp->prev_cpu));
 
-- 
2.13.6

[PATCH 1/8] KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines

2018-12-06 Thread Suraj Jitindar Singh

The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.

Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/powerpc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 2869a299c4ed..95859c53a5cd 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -496,6 +496,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
int r;
/* Assume we're using HV mode when the HV module is loaded */
int hv_enabled = kvmppc_hv_ops ? 1 : 0;
+   int kvm_on_pseries = !cpu_has_feature(CPU_FTR_HVMODE);
 
if (kvm) {
/*
@@ -543,8 +544,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 #ifdef CONFIG_PPC_BOOK3S_64
case KVM_CAP_SPAPR_TCE:
case KVM_CAP_SPAPR_TCE_64:
-   /* fallthrough */
+   r = 1;
+   break;
case KVM_CAP_SPAPR_TCE_VFIO:
+   r = !kvm_on_pseries;
+   break;
case KVM_CAP_PPC_RTAS:
case KVM_CAP_PPC_FIXUP_HCALL:
case KVM_CAP_PPC_ENABLE_HCALL:
-- 
2.13.6

[PATCH] KVM: PPC: Book3S PR: Set hflag to indicate that POWER9 supports 1T segments

2018-12-06 Thread Suraj Jitindar Singh

When booting a kvm-pr guest on a POWER9 machine the following message is
observed:
"qemu-system-ppc64: KVM does not support 1TiB segments which guest expects"

This is because the guest is expecting to be able to use 1T segments
however we don't indicate support for it. This is because we don't set
the BOOK3S_HFLAG_MULTI_PGSIZE flag in the hflags in kvmppc_set_pvr_pr()
on POWER9.

POWER9 does indeed have support for 1T segments, so add a case for
POWER9 to the switch statement to ensure it is set.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_pr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 4efd65d9e828..82840160c606 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -587,6 +587,7 @@ void kvmppc_set_pvr_pr(struct kvm_vcpu *vcpu, u32 pvr)
case PVR_POWER8:
case PVR_POWER8E:
case PVR_POWER8NVL:
+   case PVR_POWER9:
vcpu->arch.hflags |= BOOK3S_HFLAG_MULTI_PGSIZE |
BOOK3S_HFLAG_NEW_TLBIE;
break;
-- 
2.13.6

[PATCH 0/8] KVM: PPC: Implement passthrough of emulated devices for nested guests

2018-12-06 Thread Suraj Jitindar Singh

This patch series allows for emulated devices to be passed through to nested
guests, irrespective of at which level the device is being emulated.

Note that the emulated device must be using dma, not virtio.

For example, passing through an emulated e1000:

1. Emulate the device at L(n) for L(n+1)

qemu-system-ppc64 -netdev type=user,id=net0 -device e1000,netdev=net0

2. Assign the VFIO-PCI driver at L(n+1)

echo :00:00.0 > /sys/bus/pci/drivers/e1000/unbind
echo :00:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
chmod 666 /dev/vfio/0

3. Pass the device through from L(n+1) to L(n+2)

qemu-system-ppc64 -device vfio-pci,host=:00:00.0

4. L(n+2) can now access the device which will be emulated at L(n)

Suraj Jitindar Singh (8):
  KVM: PPC: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
  KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix()
  KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2
  KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops
struct
  KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2
guest
  KVM: PPC: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants
1 & 2
  KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3
guest

 arch/powerpc/include/asm/hvcall.h|   1 +
 arch/powerpc/include/asm/kvm_book3s.h|  10 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h |  13 
 arch/powerpc/include/asm/kvm_host.h  |   3 +
 arch/powerpc/include/asm/kvm_ppc.h   |   4 ++
 arch/powerpc/kernel/exceptions-64s.S |   9 +++
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  97 ++
 arch/powerpc/kvm/book3s_hv.c |  58 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  | 114 +--
 arch/powerpc/kvm/powerpc.c   |  28 +++-
 arch/powerpc/mm/fault.c  |   1 +
 11 files changed, 323 insertions(+), 15 deletions(-)

-- 
2.13.6

Re: [PATCH] KVM: PPC: Book3S HV: NULL check before some freeing functions is not needed.

2018-12-02 Thread Suraj Jitindar Singh

On Sun, 2018-12-02 at 21:52 +0100, Thomas Meyer wrote:
> NULL check before some freeing functions is not needed.

Technically true, however I think a comment should be added then to
make it clearer to someone reading the code why this is ok.

See below.

Suraj.

> 
> Signed-off-by: Thomas Meyer 
> ---
> 
> diff -u -p a/arch/powerpc/kvm/book3s_hv_nested.c
> b/arch/powerpc/kvm/book3s_hv_nested.c
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -1252,8 +1252,7 @@ static long int __kvmhv_nested_page_faul
>   rmapp = &memslot->arch.rmap[gfn - memslot->base_gfn];
>   ret = kvmppc_create_pte(kvm, gp->shadow_pgtable, pte, n_gpa,
> level,
>   mmu_seq, gp->shadow_lpid, rmapp,
> &n_rmap);
> - if (n_rmap)
> - kfree(n_rmap);
> + kfree(n_rmap);

e.g.
/* n_rmap set to NULL in kvmppc_create_pte if reference preserved */

>   if (ret == -EAGAIN)
>   ret = RESUME_GUEST; /* Let the guest try
> again */
>

Re: [PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED

2018-11-13 Thread Suraj Jitindar Singh

 8280f033PVR
> 004e1202 VRSAVE 
>   SPRG0  SPRG1 c1b0  SPRG2
> 772f9565a280  SPRG3 
>   SPRG4  SPRG5   SPRG6
> 0000  SPRG7 0000
>   HSRR0  HSRR1 
>CFAR 0000
>    LPCR 03d40413
>PTCR    DAR 772f9539  DSISR
> 4200
> 
> Fix this by setting vcpu->arch.hcall_needed = 0 to indicate
> completion
> of H_ENTER_NESTED before we exit to L0 userspace.

Nice Catch :)

Reviewed-by: Suraj Jitindar Singh 

> 
> Cc: linuxppc-...@ozlabs.org
> Cc: David Gibson 
> Cc: Paul Mackerras 
> Cc: Suraj Jitindar Singh 
> Signed-off-by: Michael Roth 
> ---
>  arch/powerpc/kvm/book3s_hv.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c
> b/arch/powerpc/kvm/book3s_hv.c
> index d65b961661fb..a56f8413758a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu
> *vcpu)
>   ret = kvmhv_enter_nested_guest(vcpu);
>   if (ret == H_INTERRUPT) {
>   kvmppc_set_gpr(vcpu, 3, 0);
> + vcpu->arch.hcall_needed = 0;
>   return -EINTR;
>   }
>   break;

[PATCH 1/2] powerpc/prom: Remove VLA in prom_check_platform_support()

2018-09-04 Thread Suraj Jitindar Singh

In prom_check_platform_support() we retrieve and parse the
"ibm,arch-vec-5-platform-support" property of the chosen node.
Currently we use a variable length array however to avoid this use an
array of constant length 8.

This property is used to indicate the supported options of vector 5
bytes 23-26 of the ibm,architecture.vec node. Each of these options
is a pair of bytes, thus for 4 options we have a max length of 8 bytes.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kernel/prom_init.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 9b38a2e5dd35..ce5fc03dc69f 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1131,12 +1131,15 @@ static void __init prom_check_platform_support(void)
   "ibm,arch-vec-5-platform-support");
if (prop_len > 1) {
int i;
-   u8 vec[prop_len];
+   u8 vec[8];
prom_debug("Found ibm,arch-vec-5-platform-support, len: %d\n",
   prop_len);
+   if (prop_len > sizeof(vec))
+   prom_printf("WARNING: ibm,arch-vec-5-platform-support 
longer "\
+   " than expected (len: %d)\n", prop_len);
prom_getprop(prom.chosen, "ibm,arch-vec-5-platform-support",
 &vec, sizeof(vec));
-   for (i = 0; i < prop_len; i += 2) {
+   for (i = 0; i < sizeof(vec); i += 2) {
prom_debug("%d: index = 0x%x val = 0x%x\n", i / 2
  , vec[i]
  , vec[i + 1]);
-- 
2.13.6

[PATCH 2/2] powerpc/pseries: Remove VLA from lparcfg_write()

2018-09-04 Thread Suraj Jitindar Singh

In lparcfg_write we hard code kbuf_sz and then use this as the variable
length of kbuf creating a variable length array. Since we're hard coding
the length anyway just define the array using this as the length and
remove the need for kbuf_sz, thus removing the variable length array.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/platforms/pseries/lparcfg.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
b/arch/powerpc/platforms/pseries/lparcfg.c
index 7c872dc01bdb..8bd590af488a 100644
--- a/arch/powerpc/platforms/pseries/lparcfg.c
+++ b/arch/powerpc/platforms/pseries/lparcfg.c
@@ -585,8 +585,7 @@ static ssize_t update_mpp(u64 *entitlement, u8 *weight)
 static ssize_t lparcfg_write(struct file *file, const char __user * buf,
 size_t count, loff_t * off)
 {
-   int kbuf_sz = 64;
-   char kbuf[kbuf_sz];
+   char kbuf[64];
char *tmp;
u64 new_entitled, *new_entitled_ptr = &new_entitled;
u8 new_weight, *new_weight_ptr = &new_weight;
@@ -595,7 +594,7 @@ static ssize_t lparcfg_write(struct file *file, const char 
__user * buf,
if (!firmware_has_feature(FW_FEATURE_SPLPAR))
return -EINVAL;
 
-   if (count > kbuf_sz)
+   if (count > sizeof(kbuf))
return -EINVAL;
 
if (copy_from_user(kbuf, buf, count))
-- 
2.13.6

[PATCH 0/2] Remove Variable Length Arrays from powerpc code

2018-09-04 Thread Suraj Jitindar Singh

This patch series removes two Variable Length Arrays (VLAs) from
the powerpc code.

Series based on v4.19-rc2

Suraj Jitindar Singh (2):
  powerpc/prom: Remove VLA in prom_check_platform_support()
  powerpc/pseries: Remove VLA from lparcfg_write()

 arch/powerpc/kernel/prom_init.c  | 7 +--
 arch/powerpc/platforms/pseries/lparcfg.c | 5 ++---
 2 files changed, 7 insertions(+), 5 deletions(-)

-- 
2.13.6

Re: [PATCH] KVM: PPC: Book3S: Add capabilities for Meltdown/Spectre workarounds

2018-01-09 Thread Suraj Jitindar Singh

On Tue, 2018-01-09 at 23:44 +1100, Alexey Kardashevskiy wrote:
> On 09/01/18 19:39, Suraj Jitindar Singh wrote:
> > On Tue, 2018-01-09 at 15:48 +1100, Paul Mackerras wrote:
> > > This adds three new capabilities that give userspace information
> > > about
> > > the underlying machine's level of vulnerability to the Meltdown
> > > and
> > > Spectre attacks, and what instructions the hardware implements to
> > > assist software to work around the vulnerabilities.
> > > 
> > > Each capability is a tri-state, where 0 indicates that the
> > > machine is
> > > vulnerable and no workarounds are implement, 1 indicates that the
> > > machine is vulnerable but workaround assist instructions are
> > > available, and 2 indicates that the machine is not vulnerable.
> > > 
> > > The capabilities are:
> > > 
> > > KVM_CAP_PPC_SAFE_CACHE reports the vulnerability of the machine
> > > to
> > > attacks based on using speculative loads to data in L1 cache
> > > which
> > > should not be addressable.  The workaround provided by hardware
> > > is an
> > > instruction to invalidate the entire L1 data cache.
> > > 
> > > KVM_CAP_PPC_SAFE_BOUNDS_CHECK reports the vulnerability of the
> > > machine
> > > to attacks based on using speculative loads behind mispredicted
> > > bounds
> > > checks.  The workaround provided by hardware is an instruction
> > > that
> > > acts as a speculation barrier.
> > > 
> > > KVM_CAP_PPC_SAFE_INDIRECT_BRANCH reports the vulnerability of the
> > > machine to attacks based on poisoning the indirect branch
> > > predictor.
> > > No workaround that requires software changes is provided; the
> > > current
> > > hardware fix is to prevent speculation past indirect branches.
> > > 
> > > Signed-off-by: Paul Mackerras 
> > > ---
> > > Note: This patch depends on the patch "powerpc/pseries: Add
> > > H_GET_CPU_CHARACTERISTICS flags & wrapper" by Michael Ellerman,
> > > available at http://patchwork.ozlabs.org/patch/856914/ .
> > > 
> > >  Documentation/virtual/kvm/api.txt |  36 +++
> > >  arch/powerpc/kvm/powerpc.c| 202
> > > ++
> > >  include/uapi/linux/kvm.h  |   3 +
> > >  3 files changed, 241 insertions(+)
> > > 
> > > diff --git a/Documentation/virtual/kvm/api.txt
> > > b/Documentation/virtual/kvm/api.txt
> > > index 57d3ee9..8d76260 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -4369,3 +4369,39 @@ Parameters: none
> > >  This capability indicates if the flic device will be able to
> > > get/set
> > > the
> > >  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute
> > > and
> > > allows
> > >  to discover this without having to create a flic device.
> > > +
> > > +8.14 KVM_CAP_PPC_SAFE_CACHE
> > > +
> > > +Architectures: ppc
> > > +
> > > +This capability gives information about the underlying machine's
> > > +vulnerability or otherwise to the Meltdown attack.  Its value is
> > > a
> > > +tristate, where 0 indicates the machine is vulnerable, 1
> > > indicates
> > > the
> > > +hardware is vulnerable but provides assistance to work around
> > > the
> > > +vulnerability (specifically by providing a fast L1 data cache
> > > flush
> > > +facility), and 2 indicates that the machine is not vulnerable.
> > > +
> > > +8.15 KVM_CAP_PPC_SAFE_BOUNDS_CHECK
> > > +
> > > +Architectures: ppc
> > > +
> > > +This capability gives information about the underlying machine's
> > > +vulnerability or otherwise to the bounds-check variant of the
> > > Spectre
> > > +attack.  Its value is a tristate, where 0 indicates the machine
> > > is
> > > +vulnerable, 1 indicates the hardware is vulnerable but provides
> > > +assistance to work around the vulnerability (specifically by
> > > providing
> > > +an instruction that acts as a speculation barrier), and 2
> > > indicates
> > > +that the machine is not vulnerable.
> > > +
> > > +8.16 KVM_CAP_PPC_SAFE_INDIRECT_BRANCH
> > > +
> > > +Architectures: ppc
> > > +
> > > +This capability gives information about the

Re: [PATCH v2] KVM: PPC: Book3S: Add capabilities for Meltdown/Spectre workarounds

2018-01-09 Thread Suraj Jitindar Singh

On Tue, 2018-01-09 at 20:21 +1100, Paul Mackerras wrote:
> This adds three new capabilities that give userspace information
> about
> the underlying machine's level of vulnerability to the Meltdown and
> Spectre attacks, and what instructions the hardware implements to
> assist software to work around the vulnerabilities.
> 
> Each capability is a tri-state, where 0 indicates that the machine is
> vulnerable and no workarounds are implement, 1 indicates that the
> machine is vulnerable but workaround assist instructions are
> available, and 2 indicates that the machine is not vulnerable.
> 
> The capabilities are:
> 
> KVM_CAP_PPC_SAFE_CACHE reports the vulnerability of the machine to
> attacks based on using speculative loads to data in L1 cache which
> should not be addressable.  The workaround provided by hardware is an
> instruction to invalidate the entire L1 data cache.
> 
> KVM_CAP_PPC_SAFE_BOUNDS_CHECK reports the vulnerability of the
> machine
> to attacks based on using speculative loads behind mispredicted
> bounds
> checks.  The workaround provided by hardware is an instruction that
> acts as a speculation barrier.
> 
> KVM_CAP_PPC_SAFE_INDIRECT_BRANCH reports the vulnerability of the
> machine to attacks based on poisoning the indirect branch predictor.
> No workaround that requires software changes is provided; the current
> hardware fix is to prevent speculation past indirect branches.
> 

Tested-by: Suraj Jitindar Singh 

> Signed-off-by: Paul Mackerras 
> ---
> Note: This patch depends on the patch "powerpc/pseries: Add
> H_GET_CPU_CHARACTERISTICS flags & wrapper" by Michael Ellerman,
> available at http://patchwork.ozlabs.org/patch/856914/ .
> 
>  Documentation/virtual/kvm/api.txt |  35 +++
>  arch/powerpc/kvm/powerpc.c| 202
> ++
>  include/uapi/linux/kvm.h  |   3 +
>  3 files changed, 240 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt
> b/Documentation/virtual/kvm/api.txt
> index 57d3ee9..7107e52 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4369,3 +4369,38 @@ Parameters: none
>  This capability indicates if the flic device will be able to get/set
> the
>  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and
> allows
>  to discover this without having to create a flic device.
> +
> +8.14 KVM_CAP_PPC_SAFE_CACHE
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the Meltdown attack.  Its value is a
> +tristate, where 0 indicates the machine is vulnerable, 1 indicates
> the
> +hardware is vulnerable but provides assistance to work around the
> +vulnerability (specifically by providing a fast L1 data cache flush
> +facility), and 2 indicates that the machine is not vulnerable.
> +
> +8.15 KVM_CAP_PPC_SAFE_BOUNDS_CHECK
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the bounds-check variant of the
> Spectre
> +attack.  Its value is a tristate, where 0 indicates the machine is
> +vulnerable, 1 indicates the hardware is vulnerable but provides
> +assistance to work around the vulnerability (specifically by
> providing
> +an instruction that acts as a speculation barrier), and 2 indicates
> +that the machine is not vulnerable.
> +
> +8.16 KVM_CAP_PPC_SAFE_INDIRECT_BRANCH
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the indirect branch variant of the
> Spectre
> +attack.  Its value is a tristate, where 0 indicates the machine is
> +vulnerable and 2 indicates that the machine is not vulnerable.
> +(1 would indicate the availability of a workaround that software
> +needs to implement, but there is currently no workaround that needs
> +software changes.)
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 1915e86..bef76f8 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -39,6 +39,10 @@
>  #include 
>  #include 
>  #include 
> +#ifdef CONFIG_PPC_PSERIES
> +#include 
> +#include 
> +#endif
>  
>  #include "timing.h"
>  #include "irq.h"
> @@ -488,6 +492,193 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   module_put(kvm->arch.kvm_ops->owner);
>  }
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +/*
> + * These functions check whether the underlying hardware is safe
> + * against the Meltdown/Spectre attacks and whether it supplies
> + * instructions f

Re: [PATCH] KVM: PPC: Book3S: Add capabilities for Meltdown/Spectre workarounds

2018-01-09 Thread Suraj Jitindar Singh

On Tue, 2018-01-09 at 15:48 +1100, Paul Mackerras wrote:
> This adds three new capabilities that give userspace information
> about
> the underlying machine's level of vulnerability to the Meltdown and
> Spectre attacks, and what instructions the hardware implements to
> assist software to work around the vulnerabilities.
> 
> Each capability is a tri-state, where 0 indicates that the machine is
> vulnerable and no workarounds are implement, 1 indicates that the
> machine is vulnerable but workaround assist instructions are
> available, and 2 indicates that the machine is not vulnerable.
> 
> The capabilities are:
> 
> KVM_CAP_PPC_SAFE_CACHE reports the vulnerability of the machine to
> attacks based on using speculative loads to data in L1 cache which
> should not be addressable.  The workaround provided by hardware is an
> instruction to invalidate the entire L1 data cache.
> 
> KVM_CAP_PPC_SAFE_BOUNDS_CHECK reports the vulnerability of the
> machine
> to attacks based on using speculative loads behind mispredicted
> bounds
> checks.  The workaround provided by hardware is an instruction that
> acts as a speculation barrier.
> 
> KVM_CAP_PPC_SAFE_INDIRECT_BRANCH reports the vulnerability of the
> machine to attacks based on poisoning the indirect branch predictor.
> No workaround that requires software changes is provided; the current
> hardware fix is to prevent speculation past indirect branches.
> 
> Signed-off-by: Paul Mackerras 
> ---
> Note: This patch depends on the patch "powerpc/pseries: Add
> H_GET_CPU_CHARACTERISTICS flags & wrapper" by Michael Ellerman,
> available at http://patchwork.ozlabs.org/patch/856914/ .
> 
>  Documentation/virtual/kvm/api.txt |  36 +++
>  arch/powerpc/kvm/powerpc.c| 202
> ++
>  include/uapi/linux/kvm.h  |   3 +
>  3 files changed, 241 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt
> b/Documentation/virtual/kvm/api.txt
> index 57d3ee9..8d76260 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4369,3 +4369,39 @@ Parameters: none
>  This capability indicates if the flic device will be able to get/set
> the
>  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and
> allows
>  to discover this without having to create a flic device.
> +
> +8.14 KVM_CAP_PPC_SAFE_CACHE
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the Meltdown attack.  Its value is a
> +tristate, where 0 indicates the machine is vulnerable, 1 indicates
> the
> +hardware is vulnerable but provides assistance to work around the
> +vulnerability (specifically by providing a fast L1 data cache flush
> +facility), and 2 indicates that the machine is not vulnerable.
> +
> +8.15 KVM_CAP_PPC_SAFE_BOUNDS_CHECK
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the bounds-check variant of the
> Spectre
> +attack.  Its value is a tristate, where 0 indicates the machine is
> +vulnerable, 1 indicates the hardware is vulnerable but provides
> +assistance to work around the vulnerability (specifically by
> providing
> +an instruction that acts as a speculation barrier), and 2 indicates
> +that the machine is not vulnerable.
> +
> +8.16 KVM_CAP_PPC_SAFE_INDIRECT_BRANCH
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the indirect branch variant of the
> Spectre
> +attack.  Its value is a tristate, where 0 indicates the machine is
> +vulnerable and 2 indicates that the machine is not vulnerable.
> +(1 would indicate the availability of a workaround that software
> +needs to implement, but there is currently no workaround that needs
> +software changes.)
> +
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 1915e86..58e863b 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -39,6 +39,10 @@
>  #include 
>  #include 
>  #include 
> +#ifdef CONFIG_PPC_PSERIES
> +#include 
> +#include 
> +#endif
>  
>  #include "timing.h"
>  #include "irq.h"
> @@ -488,6 +492,193 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   module_put(kvm->arch.kvm_ops->owner);
>  }
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +/*
> + * These functions check whether the underlying hardware is safe
> + * against the Meltdown/Spectre attacks and whether it supplies
> + * instructions for use in workarounds.  The information comes from
> + * firmware, either via the device tree on powernv platforms or
> + * from an hcall on pseries platforms.
> + *
> + * For check_safe_cache() and check_safe_bounds_check(), a return
> + * value of 0 means vulnerable, 1 means vulnerable but workaround
> + * instructions are provided, and 2 means not vulnerable (no
> workaround
> + * is needed).
> + * For check_safe_indirect_branch(), 0 means vulnerab

Re: [PATCH] KVM: PPC: Book3S: Add capabilities for Meltdown/Spectre workarounds

2018-01-09 Thread Suraj Jitindar Singh

On Tue, 2018-01-09 at 15:48 +1100, Paul Mackerras wrote:
> This adds three new capabilities that give userspace information
> about
> the underlying machine's level of vulnerability to the Meltdown and
> Spectre attacks, and what instructions the hardware implements to
> assist software to work around the vulnerabilities.
> 
> Each capability is a tri-state, where 0 indicates that the machine is
> vulnerable and no workarounds are implement, 1 indicates that the
> machine is vulnerable but workaround assist instructions are
> available, and 2 indicates that the machine is not vulnerable.
> 
> The capabilities are:
> 
> KVM_CAP_PPC_SAFE_CACHE reports the vulnerability of the machine to
> attacks based on using speculative loads to data in L1 cache which
> should not be addressable.  The workaround provided by hardware is an
> instruction to invalidate the entire L1 data cache.
> 
> KVM_CAP_PPC_SAFE_BOUNDS_CHECK reports the vulnerability of the
> machine
> to attacks based on using speculative loads behind mispredicted
> bounds
> checks.  The workaround provided by hardware is an instruction that
> acts as a speculation barrier.
> 
> KVM_CAP_PPC_SAFE_INDIRECT_BRANCH reports the vulnerability of the
> machine to attacks based on poisoning the indirect branch predictor.
> No workaround that requires software changes is provided; the current
> hardware fix is to prevent speculation past indirect branches.
> 
> Signed-off-by: Paul Mackerras 
> ---
> Note: This patch depends on the patch "powerpc/pseries: Add
> H_GET_CPU_CHARACTERISTICS flags & wrapper" by Michael Ellerman,
> available at http://patchwork.ozlabs.org/patch/856914/ .
> 
>  Documentation/virtual/kvm/api.txt |  36 +++
>  arch/powerpc/kvm/powerpc.c| 202
> ++
>  include/uapi/linux/kvm.h  |   3 +
>  3 files changed, 241 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt
> b/Documentation/virtual/kvm/api.txt
> index 57d3ee9..8d76260 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4369,3 +4369,39 @@ Parameters: none
>  This capability indicates if the flic device will be able to get/set
> the
>  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and
> allows
>  to discover this without having to create a flic device.
> +
> +8.14 KVM_CAP_PPC_SAFE_CACHE
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the Meltdown attack.  Its value is a
> +tristate, where 0 indicates the machine is vulnerable, 1 indicates
> the
> +hardware is vulnerable but provides assistance to work around the
> +vulnerability (specifically by providing a fast L1 data cache flush
> +facility), and 2 indicates that the machine is not vulnerable.
> +
> +8.15 KVM_CAP_PPC_SAFE_BOUNDS_CHECK
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the bounds-check variant of the
> Spectre
> +attack.  Its value is a tristate, where 0 indicates the machine is
> +vulnerable, 1 indicates the hardware is vulnerable but provides
> +assistance to work around the vulnerability (specifically by
> providing
> +an instruction that acts as a speculation barrier), and 2 indicates
> +that the machine is not vulnerable.
> +
> +8.16 KVM_CAP_PPC_SAFE_INDIRECT_BRANCH
> +
> +Architectures: ppc
> +
> +This capability gives information about the underlying machine's
> +vulnerability or otherwise to the indirect branch variant of the
> Spectre
> +attack.  Its value is a tristate, where 0 indicates the machine is
> +vulnerable and 2 indicates that the machine is not vulnerable.
> +(1 would indicate the availability of a workaround that software
> +needs to implement, but there is currently no workaround that needs
> +software changes.)
> +
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 1915e86..58e863b 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -39,6 +39,10 @@
>  #include 
>  #include 
>  #include 
> +#ifdef CONFIG_PPC_PSERIES
> +#include 
> +#include 
> +#endif
>  
>  #include "timing.h"
>  #include "irq.h"
> @@ -488,6 +492,193 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   module_put(kvm->arch.kvm_ops->owner);
>  }
>  
> +#ifdef CONFIG_PPC_BOOK3S_64
> +/*
> + * These functions check whether the underlying hardware is safe
> + * against the Meltdown/Spectre attacks and whether it supplies
> + * instructions for use in workarounds.  The information comes from
> + * firmware, either via the device tree on powernv platforms or
> + * from an hcall on pseries platforms.
> + *
> + * For check_safe_cache() and check_safe_bounds_check(), a return
> + * value of 0 means vulnerable, 1 means vulnerable but workaround
> + * instructions are provided, and 2 means not vulnerable (no
> workaround
> + * is needed).
> + * For check_safe_indirect_branch(), 0 means vulnerab

Re: [RFC PATCH 2/2] KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9

2018-01-02 Thread Suraj Jitindar Singh

On Fri, 2017-12-08 at 17:11 +1100, Paul Mackerras wrote:
> POWER9 has hardware bugs relating to transactional memory and thread
> reconfiguration (changes to hardware SMT mode).  Specifically, the
> core
> does not have enough storage to store a complete checkpoint of all
> the
> architected state for all four threads.  The DD2.2 version of POWER9
> includes hardware modifications designed to allow hypervisor software
> to implement workarounds for these problems.  This patch implements
> those workarounds in KVM code so that KVM guests see a full, working
> transactional memory implementation.
> 
> The problems center around the use of TM suspended state, where the
> CPU has a checkpointed state but execution is not transactional.  The
> workaround is to implement a "fake suspend" state, which looks to the
> guest like suspended state but the CPU does not store a checkpoint.
> In this state, any instruction that would cause a transition to
> transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
> checkpointed state (treclaim) causes a "soft patch" interrupt (vector
> 0x1500) to the hypervisor so that it can be emulated.  The trechkpt
> instruction also causes a soft patch interrupt.
> 
> On POWER9 DD2.2, we avoid returning to the guest in any state which
> would require a checkpoint to be present.  The trechkpt in the guest
> entry path which would normally create that checkpoint is replaced by
> either a transition to fake suspend state, if the guest is in suspend
> state, or a rollback to the pre-transactional state if the guest is
> in
> transactional state.  Fake suspend state is indicated by a flag in
> the
> PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only
> and
> reads back as 0.
> 
> On exit from the guest, if the guest is in fake suspend state, we
> still
> do the treclaim instruction as we would in real suspend state, in
> order
> to get into non-transactional state, but we do not save the resulting
> register state since there was no checkpoint.
> 
> Emulation of the instructions that cause a softpath interrupt is
> handled
> in two paths.  If the guest is in real suspend mode, we call
> kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
> transitioning to transactional state.  This is called before we do
> the treclaim in the guest exit path; because we haven't done
> treclaim,
> we can get back to the guest with the transaction still active.
> If the instruction is a case that kvmhv_p9_tm_emulation_early()
> doesn't
> handle, or if the guest is in fake suspend state, then we proceed to
> do the complete guest exit path and subsequently call
> kvmhv_p9_tm_emulation() in host context with the MMU on.  This
> handles all the cases including the cases that generate program
> interrupts (illegal instruction or TM Bad Thing) and facility
> unavailable interrupts.
> 
> The emulation is reasonably straightforward and is mostly concerned
> with checking for exception conditions and updating the state of
> registers such as MSR and CR0.  The treclaim emulation takes care to
> ensure that the TEXASR register gets updated as if it were the guest
> treclaim instruction that had done failure recording, not the
> treclaim
> done in hypervisor state in the guest exit path.
> 
> Signed-off-by: Paul Mackerras 
> 

With the following patch applied on top of the TM emulation code I was
able to get at least a basic test to run on the guest on real hardware.

[snip]

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index c7fe377ff6bc..adf2da6b2211 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -3049,6 +3049,7 @@ BEGIN_FTR_SECTION
li  r0, PSSCR_FAKE_SUSPEND
andcr3, r3, r0
mtspr   SPRN_PSSCR, r3
+   ld  r9, HSTATE_KVM_VCPU(r13)
b   1f
 2:
 END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_EMUL)
@@ -3273,8 +3274,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_EMUL)
b   9b  /* and return */
 10:stdur1, -PPC_MIN_STKFRM(r1)
/* guest is in transactional state, so simulate rollback */
+   mr  r3, r4
bl  kvmhv_emulate_tm_rollback
nop
+   ld  r4, HSTATE_KVM_VCPU(r13) /* our vcpu pointer has been
trashed */
addir1, r1, PPC_MIN_STKFRM
b   9b
 #endif

Re: [PATCH] powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-14 Thread Suraj Jitindar Singh

On Wed, 2017-08-09 at 20:30 +1000, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> 
> > The host process table base is stored in the partition table by
> > calling
> > the function native_register_process_table(). Currently this just
> > sets
> > the entry in memory and is missing a proceeding cache invalidation
> > instruction. Any update to the partition table should be followed
> > by a
> > cache invalidation instruction specifying invalidation of the
> > caching of
> > any partition table entries (RIC = 2, PRS = 0).
> > 
> > We already have a function to update the partition table with the
> > required cache invalidation instructions -
> > mmu_partition_table_set_entry().
> > Update the native_register_process_table() function to call
> > mmu_partition_table_set_entry(), this ensures all appropriate
> > invalidation will be performed.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> > ---
> >  arch/powerpc/mm/pgtable-radix.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/mm/pgtable-radix.c
> > b/arch/powerpc/mm/pgtable-radix.c
> > index 671a45d..1d5178f 100644
> > --- a/arch/powerpc/mm/pgtable-radix.c
> > +++ b/arch/powerpc/mm/pgtable-radix.c
> > @@ -33,7 +33,8 @@ static int native_register_process_table(unsigned
> > long base, unsigned long pg_sz
> >  {
> >     unsigned long patb1 = base | table_size | PATB_GR;
> >  
> > -   partition_tb->patb1 = cpu_to_be64(patb1);
> > +   mmu_partition_table_set_entry(0, be64_to_cpu(partition_tb-
> > >patb0),
> > +     patb1);
> 
> This is really a bit gross.
> 
> Can we agree on whether partition_tb is an array or not?

Well it is an array, it's just we only ever want the first element in
this function. That being said we might as well access it as an array
to make that clear.

> 
> How about ...
> 
> cheers
> 
> diff --git a/arch/powerpc/mm/pgtable-radix.c
> b/arch/powerpc/mm/pgtable-radix.c
> index c1185c8ecb22..5d8be076f8e5 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -28,9 +28,13 @@
>  static int native_register_process_table(unsigned long base,
> unsigned long pg_sz,
>  unsigned long table_size)
>  {
> -   unsigned long patb1 = base | table_size | PATB_GR;
> +   unsigned long patb0, patb1;
> +
> +   patb0 = be64_to_cpu(partition_tb[0].patb0);
> +   patb1 = base | table_size | PATB_GR;
> +
> +   mmu_partition_table_set_entry(0, patb0, patb1);
>  
> -   partition_tb->patb1 = cpu_to_be64(patb1);
> return 0;
>  }

Looks good :)

Re: [PATCH] powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-03 Thread Suraj Jitindar Singh

On Fri, 2017-08-04 at 11:31 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2017-08-04 at 11:02 +1000, Suraj Jitindar Singh wrote:
> > 
> > I guess there's the possibility of:
> > [x] randomly crash
> > 
> > This is required to run a powernv kernel as a guest because we need
> > to
> > know when it's updated its process table location.
> 
> You mean in qemu full emu ? powernv kernels don't run as guest do
> they
> ?

For now they don't... :p

> 
> Cheers,
> Ben.
>

Re: [PATCH] powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-03 Thread Suraj Jitindar Singh

On Thu, 2017-08-03 at 17:35 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-08-03 at 16:30 +1000, Michael Ellerman wrote:
> > Suraj Jitindar Singh  writes:
> > 
> > > The host process table base is stored in the partition table by
> > > calling
> > > the function native_register_process_table(). Currently this just
> > > sets
> > > the entry in memory and is missing a proceeding cache
> > > invalidation
> > > instruction. Any update to the partition table should be followed
> > > by a
> > > cache invalidation instruction specifying invalidation of the
> > > caching of
> > > any partition table entries (RIC = 2, PRS = 0).
> > > 
> > > We already have a function to update the partition table with the
> > > required cache invalidation instructions -
> > > mmu_partition_table_set_entry().
> > > Update the native_register_process_table() function to call
> > > mmu_partition_table_set_entry(), this ensures all appropriate
> > > invalidation will be performed.
> > 
> > Without this patch the kernel will:
> >  [ ] work normally
> >  [ ] randomly crash
> >  [ ] catch fire
> 
> I think we get lucky because OPAL added a "flush the whole world" to
> opal_reinit_cpus() but this patch seems to improve general code
> "correctness".
> 
> Cheers,
> Ben.
> 

I guess there's the possibility of:
[x] randomly crash

This is required to run a powernv kernel as a guest because we need to
know when it's updated its process table location.

[PATCH] powerpc/mm: Invalidate partition table cache on host proc tbl base update

2017-08-02 Thread Suraj Jitindar Singh

The host process table base is stored in the partition table by calling
the function native_register_process_table(). Currently this just sets
the entry in memory and is missing a proceeding cache invalidation
instruction. Any update to the partition table should be followed by a
cache invalidation instruction specifying invalidation of the caching of
any partition table entries (RIC = 2, PRS = 0).

We already have a function to update the partition table with the
required cache invalidation instructions - mmu_partition_table_set_entry().
Update the native_register_process_table() function to call
mmu_partition_table_set_entry(), this ensures all appropriate
invalidation will be performed.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/mm/pgtable-radix.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 671a45d..1d5178f 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -33,7 +33,8 @@ static int native_register_process_table(unsigned long base, 
unsigned long pg_sz
 {
unsigned long patb1 = base | table_size | PATB_GR;
 
-   partition_tb->patb1 = cpu_to_be64(patb1);
+   mmu_partition_table_set_entry(0, be64_to_cpu(partition_tb->patb0),
+ patb1);
return 0;
 }
 
-- 
2.9.4

Re: [PATCH v2] powerpc/mm: Check for _PAGE_PTE in *_devmap()

2017-07-27 Thread Suraj Jitindar Singh

On Fri, 2017-07-28 at 01:35 +1000, Oliver O'Halloran wrote:
> The ISA radix translation tree contains two different types of entry,
s/entry/entries
> > directories and leaves. The formats of the two entries are
> different
> with the directory entries containing no spare bits for use by
> software.

Rather than saying the directory entries contain no spare bits, would
it be better to say something like: the devmap property only relates to
pte (leaf) entries and so we shouldn't perform the check on/should
always return false for page directory entries?

> As a result we need to ensure that the *_devmap() family of functions
> check fail for everything except leaf (PTE) entries.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
> "i'll just tweak the mbox before i sent it, what's the worst that can
> happen"
> *completely breaks KVM*
> "..."
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index d1da415..6bc6248 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -610,7 +610,9 @@ static inline pte_t pte_mkdevmap(pte_t pte)
>  
>  static inline int pte_devmap(pte_t pte)
>  {
> - return !!(pte_raw(pte) & cpu_to_be64(_PAGE_DEVMAP));
> + uint64_t mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
> +
> + return (pte_raw(pte) & mask) == mask;
>  }
>  
>  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

Reviewed-by: Suraj Jitindar Singh

Re: KVM guests freeze under upstream kernel

2017-07-26 Thread Suraj Jitindar Singh

On Thu, 2017-07-27 at 13:14 +1000, Michael Ellerman wrote:
> jos...@linux.vnet.ibm.com writes:
> > On Thu, Jul 20, 2017 at 10:18:18PM -0300, jos...@linux.vnet.ibm.com
> >  wrote:
> > > On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> > > > 
> > > > Did you check the host kernel logs for any oops messages?
> > > 
> > > dmesg was clean but after sometime waiting (I forgot QEMU running
> > > in
> > > another terminal) I got the oops below (after rebooting the host
> > > I 
> > > couldn't reproduce it again).
> > > 
> > > Another test that I did was:
> > > Compile with transparent huge pages disabled: KVM works fine
> > > Compile with transparent huge pages enabled: doesn't work
> > >   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't
> > > work
> > > 
> > > Just out of my own curiosity I made this small change:
> > > 
> > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > b/arch/powerpc/include
> > > index c0737c8..f94a3b6 100644
> > > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > @@ -80,7 +80,7 @@
> > >  
> > >   #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software
> > > dirty
> > >   tracking 
> > >    #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special
> > > page */
> > >    -#define _PAGE_DEVMAP   _RPAGE_SW1 /* software:
> > > ZONE_DEVICE page */
> > >    +#define _PAGE_DEVMAP   _RPAGE_RSV3
> > > #define __HAVE_ARCH_PTE_DEVMAP
> > > 
> > > and it works. I chose _RPAGE_RSV3 because it uses the same value
> > > that
> > > x86 uses (0x0400UL) but I don't if it could have any
> > > side
> > > effect
> > > 
> > 
> > Does this change make any sense to you people?
> 
> No :)
> 
> I think it's just hiding the bug somehow. Presumably we have some
> code
> somewhere that is getting confused by _RPAGE_SW1 being set, or
> setting
> that bit incorrectly.

kernel BUG at 
/scratch/surajjs/linux/arch/powerpc/include/asm/book3s/64/radix.h:260!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 
NUMA 
PowerNV
Modules linked in:
CPU: 3 PID: 2050 Comm: qemu-system-ppc Not tainted 
4.13.0-rc2-1-g2f3013c-dirty #1
task: c00f1ebc task.stack: c00f1ec0
NIP: c0070fd4 LR: c00e2120 CTR: c00e20d0
REGS: c00f1ec036b0 TRAP: 0700   Not tainted  
(4.13.0-rc2-1-g2f3013c-dirty)
MSR: 9282b033 
  CR: 22244824  XER: 
CFAR: c0070e74 SOFTE: 1 
GPR00: 0009 c00f1ec03930 c1067400 19cf0a05 
GPR04: c000 050acf190f80 0005 0800 
GPR08: 0015 800f19cf0a05 c00f1eb64368 0009 
GPR12: 0009 cfd80f00 c00f1eca7a30 4000 
GPR16: 5f9f1780 40002000 7fff5fff 7fff879700a6 
GPR20: 8108 c110bce0 0f61 c00e20d0 
GPR24:  c00f1c7a6008 7fff6f60 7fff5fff 
GPR28: c00f19fd 0da0  c00f1ec03990 
NIP [c0070fd4] __find_linux_pte_or_hugepte+0x1d4/0x350
LR [c00e2120] kvm_unmap_radix+0x50/0x1d0
Call Trace:
[c00f1ec03930] [c00b2554] mark_page_dirty+0x34/0xa0 (unreliable)
[c00f1ec03970] [c00e2120] kvm_unmap_radix+0x50/0x1d0
[c00f1ec039c0] [c00dbea0] kvm_handle_hva_range+0x100/0x170
[c00f1ec03a30] [c00df43c] kvm_unmap_hva_range_hv+0x6c/0x80
[c00f1ec03a70] [c00c7588] kvm_unmap_hva_range+0x48/0x60
[c00f1ec03ab0] [c00bb77c] 
kvm_mmu_notifier_invalidate_range_start+0x8c/0x130
[c00f1ec03b10] [c0316f10] 
__mmu_notifier_invalidate_range_start+0xa0/0xf0
[c00f1ec03b60] [c02e95f0] change_protection+0x840/0xe20
[c00f1ec03cb0] [c0313050] change_prot_numa+0x50/0xd0
[c00f1ec03d00] [c0143f24] task_numa_work+0x2b4/0x3b0
[c00f1ec03dc0] [c0128738] task_work_run+0xf8/0x160
[c00f1ec03e00] [c001db94] do_notify_resume+0xe4/0xf0
[c00f1ec03e30] [c000b744] ret_from_except_lite+0x70/0x74
Instruction dump:
419e00ec 6000 78a70022 54a9403e 50a9c00e 54e3403e 50a9c42e 50e3c00e 
50e3c42e 792907c6 7d291b78 55270528 <0b07> 3ce04000 3c804000 78e707c6 
---[ end trace aecf406c356566bb ]---


The bug on added was:

arch/powerpc/include/asm/book3s/64/radix.h:260:
258 static inline int radix__pmd_trans_huge(pmd_t pmd)
259 {
260 BUG_ON(pmd_val(pmd) & _PAGE_DEVMAP);
261 return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
262 }

> 
> cheers

Re: [PATCH V3 2/2] KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9

2017-05-31 Thread Suraj Jitindar Singh

On Mon, 2017-05-29 at 20:12 +1000, Paul Mackerras wrote:
> This allows userspace (e.g. QEMU) to enable large decrementer mode
> for
> the guest when running on a POWER9 host, by setting the LPCR_LD bit
> in
> the guest LPCR value.  With this, the guest exit code saves 64 bits
> of
> the guest DEC value on exit.  Other places that use the guest DEC
> value check the LPCR_LD bit in the guest LPCR value, and if it is
> set,
> omit the 32-bit sign extension that would otherwise be done.
> 
> This doesn't change the DEC emulation used by PR KVM because PR KVM
> is not supported on POWER9 yet.
> 
> This is partly based on an earlier patch by Oliver O'Halloran.
> 
> Signed-off-by: Paul Mackerras 

Tested with a hacked up qemu and upstream guest/host (with these
patches).

Tested-by: Suraj Jitindar Singh 

> ---
>  arch/powerpc/include/asm/kvm_host.h |  2 +-
>  arch/powerpc/kvm/book3s_hv.c|  6 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 29
> -
>  arch/powerpc/kvm/emulate.c  |  4 ++--
>  4 files changed, 33 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h
> b/arch/powerpc/include/asm/kvm_host.h
> index 9c51ac4..3f879c8 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -579,7 +579,7 @@ struct kvm_vcpu_arch {
>   ulong mcsrr0;
>   ulong mcsrr1;
>   ulong mcsr;
> - u32 dec;
> + ulong dec;
>  #ifdef CONFIG_BOOKE
>   u32 decar;
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_hv.c
> b/arch/powerpc/kvm/book3s_hv.c
> index 42b7a4f..9b2eb66 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1143,6 +1143,12 @@ static void kvmppc_set_lpcr(struct kvm_vcpu
> *vcpu, u64 new_lpcr,
>   mask = LPCR_DPFD | LPCR_ILE | LPCR_TC;
>   if (cpu_has_feature(CPU_FTR_ARCH_207S))
>   mask |= LPCR_AIL;
> + /*
> +  * On POWER9, allow userspace to enable large decrementer
> for the
> +  * guest, whether or not the host has it enabled.
> +  */
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + mask |= LPCR_LD;
>  
>   /* Broken 32-bit version of LPCR must not clear top bits */
>   if (preserve_top32)
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index e390b38..3c901b5 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -920,7 +920,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
>   mftbr7
>   subfr3,r7,r8
>   mtspr   SPRN_DEC,r3
> - stw r3,VCPU_DEC(r4)
> + std r3,VCPU_DEC(r4)
>  
>   ld  r5, VCPU_SPRG0(r4)
>   ld  r6, VCPU_SPRG1(r4)
> @@ -1032,7 +1032,13 @@ kvmppc_cede_reentry:   /* r4 =
> vcpu, r13 = paca */
>   li  r0, BOOK3S_INTERRUPT_EXTERNAL
>   bne cr1, 12f
>   mfspr   r0, SPRN_DEC
> - cmpwi   r0, 0
> +BEGIN_FTR_SECTION
> + /* On POWER9 check whether the guest has large decrementer
> enabled */
> + andis.  r8, r8, LPCR_LD@h
> + bne 15f
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
> + extsw   r0, r0
> +15:  cmpdi   r0, 0
>   li  r0, BOOK3S_INTERRUPT_DECREMENTER
>   bge 5f
>  
> @@ -1459,12 +1465,18 @@ mc_cont:
>   mtspr   SPRN_SPURR,r4
>  
>   /* Save DEC */
> + ld  r3, HSTATE_KVM_VCORE(r13)
>   mfspr   r5,SPRN_DEC
>   mftbr6
> + /* On P9, if the guest has large decr enabled, don't sign
> extend */
> +BEGIN_FTR_SECTION
> + ld  r4, VCORE_LPCR(r3)
> + andis.  r4, r4, LPCR_LD@h
> + bne 16f
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>   extsw   r5,r5
> - add r5,r5,r6
> +16:  add r5,r5,r6
>   /* r5 is a guest timebase value here, convert to host TB */
> - ld  r3,HSTATE_KVM_VCORE(r13)
>   ld  r4,VCORE_TB_OFFSET(r3)
>   subfr5,r4,r5
>   std r5,VCPU_DEC_EXPIRES(r9)
> @@ -2376,8 +2388,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
>   mfspr   r3, SPRN_DEC
>   mfspr   r4, SPRN_HDEC
>   mftbr5
> +BEGIN_FTR_SECTION
> + /* On P9 check whether the guest has large decrementer mode
> enabled */
> + ld  r6, HSTATE_KVM_VCORE(r13)
> + ld  r6, VCORE_LPCR(r6)
> + andis.  r6, r6, LPCR_LD@h
> + bne 68f
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>   extsw   r3, r3
> - EXTEND_HDEC(r4)
> +68:  EXTEND_HDEC(r4)
>   cmpdr3, r4
>   ble 67f
>   mtspr   SPRN_DEC, r4
> diff --git a/arch/powerpc/kvm/emulate

[PATCH V3 2/2] powerpc: Update to new option-vector-5 format for CAS

2017-02-27 Thread Suraj Jitindar Singh

On POWER9 the ibm,client-architecture-support (CAS) negotiation process
has been updated to change how the host to guest negotiation is done for
the new hash/radix mmu as well as the nest mmu, process tables and guest
translation shootdown (GTSE).

The host tells the guest which options it supports in
ibm,arch-vec-5-platform-support. The guest then chooses a subset of these
to request in the CAS call and these are agreed to in the
ibm,architecture-vec-5 property of the chosen node.

Thus we read ibm,arch-vec-5-platform-support and make our selection before
calling CAS. We then parse the ibm,architecture-vec-5 property of the
chosen node to check whether we should run as hash or radix.

ibm,arch-vec-5-platform-support format:

index value pairs:  ... 

index: Option vector 5 byte number
val:   Some representation of supported values

Signed-off-by: Suraj Jitindar Singh 

---

V2 -> V3:
 - Check for new either with dynamic switching option in
   ibm,arch-vec-5-platform-support which is indicated by 0xC0 and is used
   to tell the guest it can choose either HASH or RADIX and is allowed to
   dynamically switch later via H_REGISTER_PROCESS_TABLE

V1 -> V2:
 - Fix error where whole byte was compared for mmu support instead of only the
   first two bytes
 - Break platform support parsing into multiple functions for clarity
 - Instead of printing WARNING: messages on old hypervisors change to a debug
   message
---
 arch/powerpc/include/asm/prom.h |  18 --
 arch/powerpc/kernel/prom_init.c | 121 ++--
 arch/powerpc/mm/init_64.c   |  36 ++--
 3 files changed, 159 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 8af2546..d1b240b 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -158,12 +158,18 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
 #define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation supported */
-#define OV5_MMU_RADIX_300  0x1880  /* ISA v3.00 radix MMU supported */
-#define OV5_MMU_HASH_300   0x1840  /* ISA v3.00 hash MMU supported */
-#define OV5_MMU_SEGM_RADIX 0x1820  /* radix mode (no segmentation) */
-#define OV5_MMU_PROC_TBL   0x1810  /* hcall selects SLB or proc table */
-#define OV5_MMU_SLB0x1800  /* always use SLB */
-#define OV5_MMU_GTSE   0x1808  /* Guest translation shootdown */
+/* MMU Base Architecture */
+#define OV5_MMU_SUPPORT0x18C0  /* MMU Mode Support Mask */
+#define OV5_MMU_HASH   0x00/* Hash MMU Only */
+#define OV5_MMU_RADIX  0x40/* Radix MMU Only */
+#define OV5_MMU_EITHER 0x80/* Hash or Radix Supported */
+#define OV5_MMU_DYNAMIC0xC0/* Hash or Radix Can Switch 
Later */
+#define OV5_NMMU   0x1820  /* Nest MMU Available */
+/* Hash Table Extensions */
+#define OV5_HASH_SEG_TBL   0x1980  /* In Memory Segment Tables Available */
+#define OV5_HASH_GTSE  0x1940  /* Guest Translation Shoot Down Avail */
+/* Radix Table Extensions */
+#define OV5_RADIX_GTSE 0x1A40  /* Guest Translation Shoot Down Avail */
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 37b5a29..4110350 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -168,6 +168,14 @@ static unsigned long __initdata prom_tce_alloc_start;
 static unsigned long __initdata prom_tce_alloc_end;
 #endif
 
+static bool __initdata prom_radix_disable;
+
+struct platform_support {
+   bool hash_mmu;
+   bool radix_mmu;
+   bool radix_gtse;
+};
+
 /* Platforms codes are now obsolete in the kernel. Now only used within this
  * file and ultimately gone too. Feel free to change them if you need, they
  * are not shared with anything outside of this file anymore
@@ -626,6 +634,12 @@ static void __init early_cmdline_parse(void)
prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
 #endif
}
+
+   opt = strstr(prom_cmd_line, "disable_radix");
+   if (opt) {
+   prom_debug("Radix disabled from cmdline\n");
+   prom_radix_disable = true;
+   }
 }
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
@@ -693,8 +707,10 @@ struct option_vector5 {
__be16 reserved3;
u8 subprocessors;
u8 byte22;
-   u8 intarch;
+   u8 xive;
u8 mmu;
+   u8 hash_ext;
+   u8 radix_ext;
 } __packed;
 
 struct option_vector6 {
@@ -849,9 +865,10 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
.reserved2 = 0,
.reserved3 = 0,
.subprocessors = 1,
-

[PATCH V3 1/2] powerpc: Parse the command line before calling CAS

2017-02-27 Thread Suraj Jitindar Singh

On POWER9 the hypervisor requires the guest to decide whether it would
like to use a hash or radix mmu model at the time it calls
ibm,client-architecture-support (CAS) based on what the hypervisor has
said it's allowed to do. It is possible to disable radix by passing
"disable_radix" on the command line. The next patch will add support for
the new CAS format, thus we need to parse the command line before calling
CAS so we can correctly select which mmu we would like to use.

Signed-off-by: Suraj Jitindar Singh 
Reviewed-by: Paul Mackerras 

---

V1 -> V3:
 - Reword commit message for clarity. No functional change
---
 arch/powerpc/kernel/prom_init.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index d3db1bc..37b5a29 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2993,6 +2993,11 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
 */
prom_check_initrd(r3, r4);
 
+   /*
+* Do early parsing of command line
+*/
+   early_cmdline_parse();
+
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
/*
 * On pSeries, inform the firmware about our capabilities
@@ -3009,11 +3014,6 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
copy_and_flush(0, kbase, 0x100, 0);
 
/*
-* Do early parsing of command line
-*/
-   early_cmdline_parse();
-
-   /*
 * Initialize memory management within prom_init
 */
prom_init_mem();
-- 
2.5.5

Re: [PATCH v2] powerpc/powernv: add hdat attribute to sysfs

2017-02-23 Thread Suraj Jitindar Singh

On Fri, 2017-02-24 at 15:28 +1100, Matt Brown wrote:
> The HDAT data area is consumed by skiboot and turned into a device-
> tree.
> In some cases we would like to look directly at the HDAT, so this
> patch
> adds a sysfs node to allow it to be viewed.  This is not possible
> through
> /dev/mem as it is reserved memory which is stopped by the /dev/mem
> filter.
> 
> Signed-off-by: Matt Brown 
Your first patch, nice work! :)

See below.
> ---
> 
> Between v1 and v2 of the patch the following changes were made.
> Changelog:
>   - moved hdat code into opal-hdat.c
>   - added opal-hdat to the makefile
>   - changed struct and variable names from camelcase
> ---
>  arch/powerpc/include/asm/opal.h|  1 +
>  arch/powerpc/platforms/powernv/Makefile|  1 +
>  arch/powerpc/platforms/powernv/opal-hdat.c | 63
> ++
>  arch/powerpc/platforms/powernv/opal.c  |  2 +
>  4 files changed, 67 insertions(+)
>  create mode 100644 arch/powerpc/platforms/powernv/opal-hdat.c
> 
> diff --git a/arch/powerpc/include/asm/opal.h
> b/arch/powerpc/include/asm/opal.h
> index 5c7db0f..b26944e 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -277,6 +277,7 @@ extern int opal_async_comp_init(void);
>  extern int opal_sensor_init(void);
>  extern int opal_hmi_handler_init(void);
>  extern int opal_event_init(void);
> +extern void opal_hdat_sysfs_init(void);
>  
>  extern int opal_machine_check(struct pt_regs *regs);
>  extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
> diff --git a/arch/powerpc/platforms/powernv/Makefile
> b/arch/powerpc/platforms/powernv/Makefile
> index b5d98cb..9a0c9d6 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -3,6 +3,7 @@ obj-y += opal-rtc.o opal-
> nvram.o opal-lpc.o opal-flash.o
>  obj-y+= rng.o opal-elog.o opal-dump.o opal-
> sysparam.o opal-sensor.o
>  obj-y+= opal-msglog.o opal-hmi.o opal-
> power.o opal-irqchip.o
>  obj-y+= opal-kmsg.o
> +obj-y+= opal-hdat.o
>  
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
>  obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o
> diff --git a/arch/powerpc/platforms/powernv/opal-hdat.c
> b/arch/powerpc/platforms/powernv/opal-hdat.c
> new file mode 100644
> index 000..bd305e0
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-hdat.c
> @@ -0,0 +1,63 @@
> +/*
> + * PowerNV OPAL in-memory console interface
> + *
> + * Copyright 2014 IBM Corp.
2014?
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
Check with someone maybe, but I thought we had to use V2.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct hdat_info {
> + char *base;
> + u64 size;
> +};
> +
> +static struct hdat_info hdat_inf;
> +
> +/* Read function for HDAT attribute in sysfs */
> +static ssize_t hdat_read(struct file *file, struct kobject *kobj,
I assume this is just misaligned in my mail client...
> +  struct bin_attribute *bin_attr, char *to,
> +  loff_t pos, size_t count)
> +{
> + if (!hdat_inf.base)
> + return -ENODEV;
> +
> + return memory_read_from_buffer(to, count, &pos,
> hdat_inf.base,
> + hdat_inf.size);
> +}
> +
> +
> +/* HDAT attribute for sysfs */
> +static struct bin_attribute hdat_attr = {
> + .attr = {.name = "hdat", .mode = 0444},
> + .read = hdat_read
> +};
> +
> +void __init opal_hdat_sysfs_init(void)
> +{
> + u64 hdat_addr[2];
> +
> + /* Check for the hdat-map prop in device-tree */
> + if (of_property_read_u64_array(opal_node, "hdat-map",
> hdat_addr, 2)) {
> + pr_debug("OPAL: Property hdat-map not found.\n");
> + return;
> + }
> +
> + /* Print out hdat-map values. [0]: base, [1]: size */
> + pr_debug("OPAL: HDAT Base address: %#llx\n", hdat_addr[0]);
> + pr_debug("OPAL: HDAT Size: %#llx\n", hdat_addr[1]);
> +
> + hdat_inf.base = phys_to_virt(hdat_addr[0]);
> + hdat_inf.size = hdat_addr[1];
> +
> + if (sysfs_create_bin_file(opal_kobj, &hdat_attr) != 0)
 
Not Required
This can be replaced with:
"if (sysfs_create_bin_file(opal_kobj, &hdat_attr))"
> + pr_debug("OPAL: sysfs file creation for HDAT
> failed");
> +
> +}
> diff --git a/arch/powerpc/platforms/powernv/opal.c
> b/arch/powerpc/platforms/powernv/opal.c
> index 2822935..cae3745 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -740,6 +740,8 @@

[PATCH V2 2/2] arch/powerpc/CAS: Update to new option-vector-5 format for CAS

2017-02-23 Thread Suraj Jitindar Singh

On POWER9 the ibm,client-architecture-support (CAS) negotiation process
has been updated to change how the host to guest negotiation is done for
the new hash/radix mmu as well as the nest mmu, process tables and guest
translation shootdown (GTSE).

The host tells the guest which options it supports in
ibm,arch-vec-5-platform-support. The guest then chooses a subset of these
to request in the CAS call and these are agreed to in the
ibm,architecture-vec-5 property of the chosen node.

Thus we read ibm,arch-vec-5-platform-support and make our selection before
calling CAS. We then parse the ibm,architecture-vec-5 property of the
chosen node to check whether we should run as hash or radix.

ibm,arch-vec-5-platform-support format:

index value pairs:  ... 

index: Option vector 5 byte number
val:   Some representation of supported values

Signed-off-by: Suraj Jitindar Singh 

---

V1 -> V2:
 - Fix error where whole byte was compared for mmu support instead of only the
   first two bytes
 - Break platform support parsing into multiple functions for clarity
 - Instead of printing WARNING: messages on old hypervisors change to a debug
   message
---
 arch/powerpc/include/asm/prom.h |  17 --
 arch/powerpc/kernel/prom_init.c | 120 ++--
 arch/powerpc/mm/init_64.c   |  36 ++--
 3 files changed, 157 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 8af2546..d838b9d 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -158,12 +158,17 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
 #define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation supported */
-#define OV5_MMU_RADIX_300  0x1880  /* ISA v3.00 radix MMU supported */
-#define OV5_MMU_HASH_300   0x1840  /* ISA v3.00 hash MMU supported */
-#define OV5_MMU_SEGM_RADIX 0x1820  /* radix mode (no segmentation) */
-#define OV5_MMU_PROC_TBL   0x1810  /* hcall selects SLB or proc table */
-#define OV5_MMU_SLB0x1800  /* always use SLB */
-#define OV5_MMU_GTSE   0x1808  /* Guest translation shootdown */
+/* MMU Base Architecture */
+#define OV5_MMU_SUPPORT0x18C0  /* MMU Mode Support Mask */
+#define OV5_MMU_HASH   0x00/* Hash MMU Only */
+#define OV5_MMU_RADIX  0x40/* Radix MMU Only */
+#define OV5_MMU_EITHER 0x80/* Hash or Radix Supported */
+#define OV5_NMMU   0x1820  /* Nest MMU Available */
+/* Hash Table Extensions */
+#define OV5_HASH_SEG_TBL   0x1980  /* In Memory Segment Tables Available */
+#define OV5_HASH_GTSE  0x1940  /* Guest Translation Shoot Down Avail */
+/* Radix Table Extensions */
+#define OV5_RADIX_GTSE 0x1A40  /* Guest Translation Shoot Down Avail */
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 37b5a29..08cd1b8 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -168,6 +168,14 @@ static unsigned long __initdata prom_tce_alloc_start;
 static unsigned long __initdata prom_tce_alloc_end;
 #endif
 
+static bool __initdata prom_radix_disable;
+
+struct platform_support {
+   bool hash_mmu;
+   bool radix_mmu;
+   bool radix_gtse;
+};
+
 /* Platforms codes are now obsolete in the kernel. Now only used within this
  * file and ultimately gone too. Feel free to change them if you need, they
  * are not shared with anything outside of this file anymore
@@ -626,6 +634,12 @@ static void __init early_cmdline_parse(void)
prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
 #endif
}
+
+   opt = strstr(prom_cmd_line, "disable_radix");
+   if (opt) {
+   prom_debug("Radix disabled from cmdline\n");
+   prom_radix_disable = true;
+   }
 }
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
@@ -693,8 +707,10 @@ struct option_vector5 {
__be16 reserved3;
u8 subprocessors;
u8 byte22;
-   u8 intarch;
+   u8 xive;
u8 mmu;
+   u8 hash_ext;
+   u8 radix_ext;
 } __packed;
 
 struct option_vector6 {
@@ -849,9 +865,10 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
.reserved2 = 0,
.reserved3 = 0,
.subprocessors = 1,
-   .intarch = 0,
-   .mmu = OV5_FEAT(OV5_MMU_RADIX_300) | OV5_FEAT(OV5_MMU_HASH_300) 
|
-   OV5_FEAT(OV5_MMU_PROC_TBL) | OV5_FEAT(OV5_MMU_GTSE),
+   .xive = 0,
+   .mmu = 0,
+   .hash_ext = 0,
+   .radix_ext = 0,
},
 
/* option vector 6: IBM PAPR hint

[PATCH V2 1/2] arch/powerpc/prom_init: Parse the command line before calling CAS

2017-02-23 Thread Suraj Jitindar Singh

On POWER9 the hypervisor requires the guest to decide whether it would
like to use a hash or radix mmu model at the time it calls
ibm,client-architecture-support (CAS) based on what the hypervisor has
said it's allowed to do. It is possible to disable radix by passing
"disable_radix" on the command line. The next patch will add support for
the new CAS format, thus we need to parse the command line before calling
CAS so we can correctly select which mmu we would like to use.

Signed-off-by: Suraj Jitindar Singh 
Reviewed-by: Paul Mackerras 

---

V1 -> V2:
 - Reword commit message for clarity. No functional change
---
 arch/powerpc/kernel/prom_init.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index d3db1bc..37b5a29 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2993,6 +2993,11 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
 */
prom_check_initrd(r3, r4);
 
+   /*
+* Do early parsing of command line
+*/
+   early_cmdline_parse();
+
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
/*
 * On pSeries, inform the firmware about our capabilities
@@ -3009,11 +3014,6 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
copy_and_flush(0, kbase, 0x100, 0);
 
/*
-* Do early parsing of command line
-*/
-   early_cmdline_parse();
-
-   /*
 * Initialize memory management within prom_init
 */
prom_init_mem();
-- 
2.5.5

Re: [RFC NO-MERGE 2/2] arch/powerpc/CAS: Update to new option-vector-5 format for CAS

2017-02-23 Thread Suraj Jitindar Singh

On Thu, 2017-02-23 at 15:44 +1100, Paul Mackerras wrote:
> On Tue, Feb 21, 2017 at 05:06:11PM +1100, Suraj Jitindar Singh wrote:
> > 
> > The CAS process has been updated to change how the host to guest
> Once again, explain CAS; perhaps "The ibm,client-architecture-support
> (CAS) negotiation process has been updated for POWER9 to ..."
> 
> > 
> > negotiation is done for the new hash/radix mmu as well as the nest
> > mmu,
> > process tables and guest translation shootdown (GTSE).
> > 
> > The host tells the guest which options it supports in
> > ibm,arch-vec-5-platform-support. The guest then chooses a subset of
> > these
> > to request in the CAS call and these are agreed to in the
> > ibm,architecture-vec-5 property of the chosen node.
> > 
> > Thus we read ibm,arch-vec-5-platform-support and make our selection
> > before
> > calling CAS. We then parse the ibm,architecture-vec-5 property of
> > the
> > chosen node to check whether we should run as hash or radix.
> > 
> > Signed-off-by: Suraj Jitindar Singh 
> > ---
> >  arch/powerpc/include/asm/prom.h | 16 ---
> >  arch/powerpc/kernel/prom_init.c | 99
> > +++--
> >  arch/powerpc/mm/init_64.c   | 31 ++---
> >  3 files changed, 130 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/prom.h
> > b/arch/powerpc/include/asm/prom.h
> > index 8af2546..19d2e84 100644
> > --- a/arch/powerpc/include/asm/prom.h
> > +++ b/arch/powerpc/include/asm/prom.h
> > @@ -158,12 +158,16 @@ struct of_drconf_cell {
> >  #define OV5_PFO_HW_ENCR0x1120  /* PFO
> > Encryption Accelerator */
> >  #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-
> > Processors supported */
> >  #define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation
> > supported */
> > -#define OV5_MMU_RADIX_300  0x1880  /* ISA v3.00 radix
> > MMU supported */
> > -#define OV5_MMU_HASH_300   0x1840  /* ISA v3.00 hash
> > MMU supported */
> > -#define OV5_MMU_SEGM_RADIX 0x1820  /* radix mode (no
> > segmentation) */
> > -#define OV5_MMU_PROC_TBL   0x1810  /* hcall selects SLB
> > or proc table */
> > -#define OV5_MMU_SLB0x1800  /* always use SLB
> > */
> > -#define OV5_MMU_GTSE   0x1808  /* Guest
> > translation shootdown */
> > +/* MMU Base Architecture */
> > +#define OV5_MMU_HASH_300   0x1800  /* ISA v3.00 Hash
> > MMU Only */
> This is actually legacy HPT as well as ISA v3.00 HPT.

True

> 
> > 
> > +#define OV5_MMU_RADIX_300  0x1840  /* ISA v3.00 Radix
> > MMU Only */
> > +#define OV5_MMU_EITHER_300 0x1880  /* ISA v3.00 Hash
> > or Radix Supported */
> I wonder if it would work better to have a define for the 2-bit field
> with subsidiary definitions for the field values.  Something like
> 
> #define OV5_MMU_SELECTION 0x18c0
> #define  OV5_MMU_HPT  0x00
> #define  OV5_MMU_RADIX0x40
> #define  OV5_MMU_EITHER   0x80

Yep that's clearer

> 
> > 
> > +#define OV5_NMMU   0x1820  /* Nest MMU
> > Available */
> > +/* Hash Table Extensions */
> > +#define OV5_HASH_SEG_TBL   0x1980  /* In Memory Segment
> > Tables Available */
> > +#define OV5_HASH_GTSE  0x1940  /* Guest
> > Translation Shoot Down Avail */
> > +/* Radix Table Extensions */
> > +#define OV5_RADIX_GTSE 0x1A40  /* Guest
> > Translation Shoot Down Avail */
> >  
> >  /* Option Vector 6: IBM PAPR hints */
> >  #define OV6_LINUX  0x02/* Linux is our OS */
> > diff --git a/arch/powerpc/kernel/prom_init.c
> > b/arch/powerpc/kernel/prom_init.c
> > index 37b5a29..8272104 100644
> > --- a/arch/powerpc/kernel/prom_init.c
> > +++ b/arch/powerpc/kernel/prom_init.c
> > @@ -168,6 +168,8 @@ static unsigned long __initdata
> > prom_tce_alloc_start;
> >  static unsigned long __initdata prom_tce_alloc_end;
> >  #endif
> >  
> > +static bool __initdata prom_radix_disable;
> > +
> >  /* Platforms codes are now obsolete in the kernel. Now only used
> > within this
> >   * file and ultimately gone too. Feel free to change them if you
> > need, they
> >   * are not shared with anything outside of this file anymore
> > @@ -626,6 +628,12 @@ static void __init early_cmdline_parse(void)
> >     prom_memory_limit = ALIGN(prom_memory_limit,
> > 0x100);
> >  #endif
> >     }
> > +
> > +   opt = strstr(

[RFC NO-MERGE 2/2] arch/powerpc/CAS: Update to new option-vector-5 format for CAS

2017-02-20 Thread Suraj Jitindar Singh

The CAS process has been updated to change how the host to guest
negotiation is done for the new hash/radix mmu as well as the nest mmu,
process tables and guest translation shootdown (GTSE).

The host tells the guest which options it supports in
ibm,arch-vec-5-platform-support. The guest then chooses a subset of these
to request in the CAS call and these are agreed to in the
ibm,architecture-vec-5 property of the chosen node.

Thus we read ibm,arch-vec-5-platform-support and make our selection before
calling CAS. We then parse the ibm,architecture-vec-5 property of the
chosen node to check whether we should run as hash or radix.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/prom.h | 16 ---
 arch/powerpc/kernel/prom_init.c | 99 +++--
 arch/powerpc/mm/init_64.c   | 31 ++---
 3 files changed, 130 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 8af2546..19d2e84 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -158,12 +158,16 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
 #define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation supported */
-#define OV5_MMU_RADIX_300  0x1880  /* ISA v3.00 radix MMU supported */
-#define OV5_MMU_HASH_300   0x1840  /* ISA v3.00 hash MMU supported */
-#define OV5_MMU_SEGM_RADIX 0x1820  /* radix mode (no segmentation) */
-#define OV5_MMU_PROC_TBL   0x1810  /* hcall selects SLB or proc table */
-#define OV5_MMU_SLB0x1800  /* always use SLB */
-#define OV5_MMU_GTSE   0x1808  /* Guest translation shootdown */
+/* MMU Base Architecture */
+#define OV5_MMU_HASH_300   0x1800  /* ISA v3.00 Hash MMU Only */
+#define OV5_MMU_RADIX_300  0x1840  /* ISA v3.00 Radix MMU Only */
+#define OV5_MMU_EITHER_300 0x1880  /* ISA v3.00 Hash or Radix Supported */
+#define OV5_NMMU   0x1820  /* Nest MMU Available */
+/* Hash Table Extensions */
+#define OV5_HASH_SEG_TBL   0x1980  /* In Memory Segment Tables Available */
+#define OV5_HASH_GTSE  0x1940  /* Guest Translation Shoot Down Avail */
+/* Radix Table Extensions */
+#define OV5_RADIX_GTSE 0x1A40  /* Guest Translation Shoot Down Avail */
 
 /* Option Vector 6: IBM PAPR hints */
 #define OV6_LINUX  0x02/* Linux is our OS */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 37b5a29..8272104 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -168,6 +168,8 @@ static unsigned long __initdata prom_tce_alloc_start;
 static unsigned long __initdata prom_tce_alloc_end;
 #endif
 
+static bool __initdata prom_radix_disable;
+
 /* Platforms codes are now obsolete in the kernel. Now only used within this
  * file and ultimately gone too. Feel free to change them if you need, they
  * are not shared with anything outside of this file anymore
@@ -626,6 +628,12 @@ static void __init early_cmdline_parse(void)
prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
 #endif
}
+
+   opt = strstr(prom_cmd_line, "disable_radix");
+   if (opt) {
+   prom_debug("Radix disabled from cmdline\n");
+   prom_radix_disable = true;
+   }
 }
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
@@ -693,8 +701,10 @@ struct option_vector5 {
__be16 reserved3;
u8 subprocessors;
u8 byte22;
-   u8 intarch;
+   u8 xive;
u8 mmu;
+   u8 hash_ext;
+   u8 radix_ext;
 } __packed;
 
 struct option_vector6 {
@@ -849,9 +859,10 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
.reserved2 = 0,
.reserved3 = 0,
.subprocessors = 1,
-   .intarch = 0,
-   .mmu = OV5_FEAT(OV5_MMU_RADIX_300) | OV5_FEAT(OV5_MMU_HASH_300) 
|
-   OV5_FEAT(OV5_MMU_PROC_TBL) | OV5_FEAT(OV5_MMU_GTSE),
+   .xive = 0,
+   .mmu = 0,
+   .hash_ext = 0,
+   .radix_ext = 0,
},
 
/* option vector 6: IBM PAPR hints */
@@ -990,6 +1001,83 @@ static int __init prom_count_smt_threads(void)
 
 }
 
+static void __init prom_check_platform_support(void)
+{
+   int prop_len, i;
+   bool radix_gtse = false, radix_mmu = false, hash_mmu = false;
+
+   prop_len = prom_getproplen(prom.chosen,
+  "ibm,arch-vec-5-platform-support");
+   if (prop_len > 1) {
+   u8 val[prop_len];
+   prom_debug("Found ibm,arch-vec-5-platform-support, len: %d\n",
+  prop_len);
+   prom_getprop(prom.chosen, "ibm,arch-vec-5-platform-support

[RFC NO-MERGE 1/2] arch/powerpc/prom_init: Parse the command line before calling CAS

2017-02-20 Thread Suraj Jitindar Singh

CAS now requires the guest to tell the host whether it would like to use
a hash or radix mmu. It is possible to disable radix by passing
"disable_radix" on the command line. The next patch will add support for
the new CAS format, thus we need to parse the command line before calling
CAS so we can correctly represent which mmu we would like to use.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kernel/prom_init.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index d3db1bc..37b5a29 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -2993,6 +2993,11 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
 */
prom_check_initrd(r3, r4);
 
+   /*
+* Do early parsing of command line
+*/
+   early_cmdline_parse();
+
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
/*
 * On pSeries, inform the firmware about our capabilities
@@ -3009,11 +3014,6 @@ unsigned long __init prom_init(unsigned long r3, 
unsigned long r4,
copy_and_flush(0, kbase, 0x100, 0);
 
/*
-* Do early parsing of command line
-*/
-   early_cmdline_parse();
-
-   /*
 * Initialize memory management within prom_init
 */
prom_init_mem();
-- 
2.5.5

Re: [PATCH] powerpc: Detect POWER9 architected mode

2017-02-16 Thread Suraj Jitindar Singh

On Fri, 2017-02-17 at 10:59 +1100, Russell Currey wrote:
> Signed-off-by: Russell Currey 
Tested-in-QEMU-by: Suraj Jitindar Singh 
> ---
>  arch/powerpc/kernel/cputable.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/cputable.c
> b/arch/powerpc/kernel/cputable.c
> index 6a82ef039c50..d23a54b09436 100644
> --- a/arch/powerpc/kernel/cputable.c
> +++ b/arch/powerpc/kernel/cputable.c
> @@ -386,6 +386,25 @@ static struct cpu_spec __initdata cpu_specs[] =
> {
>   .machine_check_early=
> __machine_check_early_realmode_p8,
>   .platform   = "power8",
>   },
> + {   /* 3.00-compliant processor, i.e. Power9
> "architected" mode */
> + .pvr_mask   = 0x,
> + .pvr_value  = 0x0f05,
> + .cpu_name   = "POWER9 (architected)",
> + .cpu_features   = CPU_FTRS_POWER9,
> + .cpu_user_features  = COMMON_USER_POWER9,
> + .cpu_user_features2 = COMMON_USER2_POWER9,
> + .mmu_features   = MMU_FTRS_POWER9,
> + .icache_bsize   = 128,
> + .dcache_bsize   = 128,
> + .num_pmcs   = 6,
> + .pmc_type   = PPC_PMC_IBM,
> + .oprofile_cpu_type  = "ppc64/ibm-compat-v1",
> + .oprofile_type  =
> PPC_OPROFILE_INVALID,
> + .cpu_setup  = __setup_cpu_power9,
> + .cpu_restore= __restore_cpu_power9,
> + .flush_tlb  = __flush_tlb_power9,
> + .platform   = "power9",
> + },
>   {   /* Power7 */
>   .pvr_mask   = 0x,
>   .pvr_value  = 0x003f,

Re: [PATCH] powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9

2017-02-15 Thread Suraj Jitindar Singh

On Thu, 2017-02-16 at 16:03 +1100, Paul Mackerras wrote:
> On POWER9, since commit cc3d2940133d ("powerpc/64: Enable use of
> radix
> MMU under hypervisor on POWER9", 2017-01-30), we set both the radix
> and
> HPT bits in the client-architecture-support (CAS) vector, which tells
> the hypervisor that we can do either radix or HPT.  According to
> PAPR,
> if we use this combination we are promising to do a
> H_REGISTER_PROC_TBL
> hcall later on to let the hypervisor know whether we are doing radix
> or HPT.  We currently do this call if we are doing radix but not if
> we are doing HPT.  If the hypervisor is able to support both radix
> and HPT guests, it would be entitled to defer allocation of the HPT
> until the H_REGISTER_PROC_TBL call, and to fail any attempts to
> create
> HPTEs until the H_REGISTER_PROC_TBL call.  Thus we need to do a
> H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may
> crash at boot time.
> 
> This adds the code to call H_REGISTER_PROC_TBL in this case, before
> we attempt to create any HPT entries using H_ENTER.
> 
> Fixes: cc3d2940133d ("powerpc/64: Enable use of radix MMU under
> hypervisor on POWER9")
> Signed-off-by: Paul Mackerras 
> ---
> This needs to go in after the topic/ppc-kvm branch.
> 
> arch/powerpc/mm/hash_utils_64.c   | 6 ++
>  arch/powerpc/platforms/pseries/lpar.c | 8 ++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hash_utils_64.c
> b/arch/powerpc/mm/hash_utils_64.c
> index 8033493..b0ed96e 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -839,6 +839,12 @@ static void __init htab_initialize(void)
>   /* Using a hypervisor which owns the htab */
>   htab_address = NULL;
>   _SDR1 = 0; 
> + /*
> +  * On POWER9, we need to do a H_REGISTER_PROC_TBL
> hcall
> +  * to inform the hypervisor that we wish to use the
> HPT.
> +  */
> + if (cpu_has_feature(CPU_FTR_ARCH_300))
> + register_process_table(0, 0, 0);
>  #ifdef CONFIG_FA_DUMP
>   /*
>    * If firmware assisted dump is active firmware
> preserves
> diff --git a/arch/powerpc/platforms/pseries/lpar.c
> b/arch/powerpc/platforms/pseries/lpar.c
> index 0587655..5b47026 100644
> --- a/arch/powerpc/platforms/pseries/lpar.c
> +++ b/arch/powerpc/platforms/pseries/lpar.c
> @@ -609,15 +609,18 @@ static int __init disable_bulk_remove(char
> *str)
>  
>  __setup("bulk_remove=", disable_bulk_remove);
>  
> -/* Actually only used for radix, so far */
>  static int pseries_lpar_register_process_table(unsigned long base,
>   unsigned long page_size, unsigned long
> table_size)
>  {
>   long rc;
> - unsigned long flags = PROC_TABLE_NEW;
> + unsigned long flags = 0;
>  
> + if (table_size)
> + flags |= PROC_TABLE_NEW;
>   if (radix_enabled())
>   flags |= PROC_TABLE_RADIX | PROC_TABLE_GTSE;
> + else
> + flags |= PROC_TABLE_HPT_SLB;
>   for (;;) {
>   rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags,
> base,
>   page_size, table_size);
> @@ -643,6 +646,7 @@ void __init hpte_init_pseries(void)
>   mmu_hash_ops.flush_hash_range    =
> pSeries_lpar_flush_hash_range;
>   mmu_hash_ops.hpte_clear_all  = pseries_hpte_clear_all;
>   mmu_hash_ops.hugepage_invalidate =
> pSeries_lpar_hugepage_invalidate;
> + register_process_table   =
> pseries_lpar_register_process_table;
>  }
>  
>  void radix_init_pseries(void)
FWIW:

Reviewed-by: Suraj Jitindar Singh

Re: [PATCH 17/18] KVM: PPC: Book3S HV: Enable radix guest support

2017-01-22 Thread Suraj Jitindar Singh

On Thu, 2017-01-12 at 20:07 +1100, Paul Mackerras wrote:
> This adds a few last pieces of the support for radix guests:
> 
> * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
>   KVM_PPC_GET_RMMU_INFO ioctls for radix guests
> 
> * On POWER9, allow secondary threads to be on/off-lined while guests
>   are running.
> 
> * Set up LPCR and the partition table entry for radix guests.
> 
> * Don't allocate the rmap array in the kvm_memory_slot structure
>   on radix.
> 
> * Prevent the AIL field in the LPCR being set for radix guests,
>   since we can't yet handle getting interrupts from the guest with
>   the MMU on.
> 
> * Don't try to initialize the HPT for radix guests, since they don't
>   have an HPT.
> 
> * Take out the code that prevents the HV KVM module from
>   initializing on radix hosts.
> 
> At this stage, we only support radix guests if the host is running
> in radix mode, and only support HPT guests if the host is running in
> HPT mode.  Thus a guest cannot switch from one mode to the other,
> which enables some simplifications.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/include/asm/kvm_book3s.h  |  2 +
>  arch/powerpc/kvm/book3s_64_mmu_hv.c|  1 -
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 45 
>  arch/powerpc/kvm/book3s_hv.c   | 93
> --
>  arch/powerpc/kvm/powerpc.c |  2 +-
>  5 files changed, 115 insertions(+), 28 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 57dc407..2bf3501 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -189,6 +189,7 @@ extern int kvmppc_book3s_radix_page_fault(struct
> kvm_run *run,
>   unsigned long ea, unsigned long dsisr);
>  extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t
> eaddr,
>   struct kvmppc_pte *gpte, bool data, bool
> iswrite);
> +extern int kvmppc_init_vm_radix(struct kvm *kvm);
>  extern void kvmppc_free_radix(struct kvm *kvm);
>  extern int kvmppc_radix_init(void);
>  extern void kvmppc_radix_exit(void);
> @@ -200,6 +201,7 @@ extern int kvm_test_age_radix(struct kvm *kvm,
> struct kvm_memory_slot *memslot,
>   unsigned long gfn);
>  extern long kvmppc_hv_get_dirty_log_radix(struct kvm *kvm,
>   struct kvm_memory_slot *memslot, unsigned
> long *map);
> +extern int kvmhv_get_rmmu_info(struct kvm *kvm, struct
> kvm_ppc_rmmu_info *info);
>  
>  /* XXX remove this export when load_last_inst() is generic */
>  extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size,
> void *ptr, bool data);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 7a9afbe..db8de17 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -155,7 +155,6 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32
> *htab_orderp)
>  
>  void kvmppc_free_hpt(struct kvm *kvm)
>  {
> - kvmppc_free_lpid(kvm->arch.lpid);
>   vfree(kvm->arch.revmap);
>   if (kvm->arch.hpt_cma_alloc)
>   kvm_release_hpt(virt_to_page(kvm->arch.hpt_virt),
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 125cc7c..4344651 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -610,6 +610,51 @@ long kvmppc_hv_get_dirty_log_radix(struct kvm
> *kvm,
>   return 0;
>  }
>  
> +static void add_rmmu_ap_encoding(struct kvm_ppc_rmmu_info *info,
> +  int psize, int *indexp)
> +{
> + if (!mmu_psize_defs[psize].shift)
> + return;
> + info->ap_encodings[*indexp] = mmu_psize_defs[psize].shift |
> + (mmu_psize_defs[psize].ap << 29);
> + ++(*indexp);
> +}
> +
> +int kvmhv_get_rmmu_info(struct kvm *kvm, struct kvm_ppc_rmmu_info
> *info)
> +{
> + int i;
> +
> + if (!radix_enabled())
> + return -EINVAL;
> + memset(info, 0, sizeof(*info));
> +
> + /* 4k page size */
> + info->geometries[0].page_shift = 12;
> + info->geometries[0].level_bits[0] = 9;
> + for (i = 1; i < 4; ++i)
> + info->geometries[0].level_bits[i] =
> p9_supported_radix_bits[i];
> + /* 64k page size */
> + info->geometries[1].page_shift = 16;
> + for (i = 0; i < 4; ++i)
> + info->geometries[1].level_bits[i] =
> p9_supported_radix_bits[i];
> +
> + i = 0;
> + add_rmmu_ap_encoding(info, MMU_PAGE_4K, &i);
> + add_rmmu_ap_encoding(info, MMU_PAGE_64K, &i);
> + add_rmmu_ap_encoding(info, MMU_PAGE_2M, &i);
> + add_rmmu_ap_encoding(info, MMU_PAGE_1G, &i);
> +
> + return 0;
> +}
> +
> +int kvmppc_init_vm_radix(struct kvm *kvm)
> +{
> + kvm->arch.pgtable = pgd_alloc(kvm->mm);
> + if (!kvm->arch.pgtable)
> + return -ENOMEM;
> + return 0;
> +}
> +
>  void kvmp

Re: [PATCH 14/18] KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests

2017-01-22 Thread Suraj Jitindar Singh

On Thu, 2017-01-12 at 20:07 +1100, Paul Mackerras wrote:
> This adapts our implementations of the MMU notifier callbacks
> (unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
> to call radix functions when the guest is using radix.  These
> implementations are much simpler than for HPT guests because we
> have only one PTE to deal with, so we don't need to traverse
> rmap chains.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/include/asm/kvm_book3s.h  |  6 
>  arch/powerpc/kvm/book3s_64_mmu_hv.c| 64 +++-
> --
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 54
> 
>  3 files changed, 103 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h
> b/arch/powerpc/include/asm/kvm_book3s.h
> index ff5cd5c..952cc4b 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -192,6 +192,12 @@ extern int kvmppc_mmu_radix_xlate(struct
> kvm_vcpu *vcpu, gva_t eaddr,
>  extern void kvmppc_free_radix(struct kvm *kvm);
>  extern int kvmppc_radix_init(void);
>  extern void kvmppc_radix_exit(void);
> +extern int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot
> *memslot,
> + unsigned long gfn);
> +extern int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot
> *memslot,
> + unsigned long gfn);
> +extern int kvm_test_age_radix(struct kvm *kvm, struct
> kvm_memory_slot *memslot,
> + unsigned long gfn);
>  
>  /* XXX remove this export when load_last_inst() is generic */
>  extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size,
> void *ptr, bool data);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 57690c2..fbb3de4 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -701,12 +701,13 @@ static void kvmppc_rmap_reset(struct kvm *kvm)
>   srcu_read_unlock(&kvm->srcu, srcu_idx);
>  }
>  
> +typedef int (*hva_handler_fn)(struct kvm *kvm, struct
> kvm_memory_slot *memslot,
> +   unsigned long gfn);
> +
>  static int kvm_handle_hva_range(struct kvm *kvm,
>   unsigned long start,
>   unsigned long end,
> - int (*handler)(struct kvm *kvm,
> -    unsigned long *rmapp,
> -    unsigned long gfn))
> + hva_handler_fn handler)
>  {
>   int ret;
>   int retval = 0;
> @@ -731,9 +732,7 @@ static int kvm_handle_hva_range(struct kvm *kvm,
>   gfn_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE -
> 1, memslot);
>  
>   for (; gfn < gfn_end; ++gfn) {
> - gfn_t gfn_offset = gfn - memslot->base_gfn;
> -
> - ret = handler(kvm, &memslot-
> >arch.rmap[gfn_offset], gfn);
> + ret = handler(kvm, memslot, gfn);
>   retval |= ret;
>   }
>   }
> @@ -742,20 +741,21 @@ static int kvm_handle_hva_range(struct kvm
> *kvm,
>  }
>  
>  static int kvm_handle_hva(struct kvm *kvm, unsigned long hva,
> -   int (*handler)(struct kvm *kvm, unsigned
> long *rmapp,
> -  unsigned long gfn))
> +   hva_handler_fn handler)
>  {
>   return kvm_handle_hva_range(kvm, hva, hva + 1, handler);
>  }
>  
> -static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp,
> +static int kvm_unmap_rmapp(struct kvm *kvm, struct kvm_memory_slot
> *memslot,
>      unsigned long gfn)
>  {
>   struct revmap_entry *rev = kvm->arch.revmap;
>   unsigned long h, i, j;
>   __be64 *hptep;
>   unsigned long ptel, psize, rcbits;
> + unsigned long *rmapp;
>  
> + rmapp = &memslot->arch.rmap[gfn - memslot->base_gfn];
>   for (;;) {
>   lock_rmap(rmapp);
>   if (!(*rmapp & KVMPPC_RMAP_PRESENT)) {
> @@ -816,26 +816,36 @@ static int kvm_unmap_rmapp(struct kvm *kvm,
> unsigned long *rmapp,
>  
>  int kvm_unmap_hva_hv(struct kvm *kvm, unsigned long hva)
>  {
> - kvm_handle_hva(kvm, hva, kvm_unmap_rmapp);
> + hva_handler_fn handler;
> +
> + handler = kvm->arch.radix ? kvm_unmap_radix : 
kvm_is_radix() for consistency?
> kvm_unmap_rmapp;
> + kvm_handle_hva(kvm, hva, handler);
>   return 0;
>  }
>  
>  int kvm_unmap_hva_range_hv(struct kvm *kvm, unsigned long start,
> unsigned long end)
>  {
> - kvm_handle_hva_range(kvm, start, end, kvm_unmap_rmapp);
> + hva_handler_fn handler;
> +
> + handler = kvm->arch.radix ? kvm_unmap_radix : 
ditto
> kvm_unmap_rmapp;
> + kvm_handle_hva_range(kvm, start, end, handler);
>   return 0;
>  }
>  
>  void kvmppc_core_flush_memslot_hv(struct kvm *kvm,
>     struct kvm_memory_slot *memslot

Re: [PATCH 13/18] KVM: PPC: Book3S HV: Page table construction and page faults for radix guests

2017-01-22 Thread Suraj Jitindar Singh

On Thu, 2017-01-12 at 20:07 +1100, Paul Mackerras wrote:
> This adds the code to construct the second-level ("partition-scoped"
> in
> architecturese) page tables for guests using the radix MMU.  Apart
> from
> the PGD level, which is allocated when the guest is created, the rest
> of the tree is all constructed in response to hypervisor page faults.
> 
> As well as hypervisor page faults for missing pages, we also get
> faults
> for reference/change (RC) bits needing to be set, as well as various
> other error conditions.  For now, we only set the R or C bit in the
> guest page table if the same bit is set in the host PTE for the
> backing page.
> 
> This code can take advantage of the guest being backed with either
> transparent or ordinary 2MB huge pages, and insert 2MB page entries
> into the guest page tables.  There is no support for 1GB huge pages
> yet.
> ---
>  arch/powerpc/include/asm/kvm_book3s.h  |   8 +
>  arch/powerpc/kvm/book3s.c  |   1 +
>  arch/powerpc/kvm/book3s_64_mmu_hv.c|   7 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 385
> +
>  arch/powerpc/kvm/book3s_hv.c   |  17 +-
>  5 files changed, 415 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 7adfcc0..ff5cd5c 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -170,6 +170,8 @@ extern int kvmppc_book3s_hv_page_fault(struct
> kvm_run *run,
>   unsigned long status);
>  extern long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t eaddr,
>   unsigned long slb_v, unsigned long valid);
> +extern int kvmppc_hv_emulate_mmio(struct kvm_run *run, struct
> kvm_vcpu *vcpu,
> + unsigned long gpa, gva_t ea, int is_store);
>  
>  extern void kvmppc_mmu_hpte_cache_map(struct kvm_vcpu *vcpu, struct
> hpte_cache *pte);
>  extern struct hpte_cache *kvmppc_mmu_hpte_cache_next(struct kvm_vcpu
> *vcpu);
> @@ -182,8 +184,14 @@ extern void kvmppc_mmu_hpte_sysexit(void);
>  extern int kvmppc_mmu_hv_init(void);
>  extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, unsigned
> long hc);
>  
> +extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
> + struct kvm_vcpu *vcpu,
> + unsigned long ea, unsigned long dsisr);
>  extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t
> eaddr,
>   struct kvmppc_pte *gpte, bool data, bool
> iswrite);
> +extern void kvmppc_free_radix(struct kvm *kvm);
> +extern int kvmppc_radix_init(void);
> +extern void kvmppc_radix_exit(void);
>  
>  /* XXX remove this export when load_last_inst() is generic */
>  extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size,
> void *ptr, bool data);
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 019f008..b6b5c18 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -239,6 +239,7 @@ void kvmppc_core_queue_data_storage(struct
> kvm_vcpu *vcpu, ulong dar,
>   kvmppc_set_dsisr(vcpu, flags);
>   kvmppc_book3s_queue_irqprio(vcpu,
> BOOK3S_INTERRUPT_DATA_STORAGE);
>  }
> +EXPORT_SYMBOL_GPL(kvmppc_core_queue_data_storage);   /* used by
> kvm_hv */
>  
>  void kvmppc_core_queue_inst_storage(struct kvm_vcpu *vcpu, ulong
> flags)
>  {
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index c208bf3..57690c2 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -395,8 +395,8 @@ static int instruction_is_store(unsigned int
> instr)
>   return (instr & mask) != 0;
>  }
>  
> -static int kvmppc_hv_emulate_mmio(struct kvm_run *run, struct
> kvm_vcpu *vcpu,
> -   unsigned long gpa, gva_t ea, int
> is_store)
> +int kvmppc_hv_emulate_mmio(struct kvm_run *run, struct kvm_vcpu
> *vcpu,
> +    unsigned long gpa, gva_t ea, int
> is_store)
>  {
>   u32 last_inst;
>  
> @@ -461,6 +461,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run
> *run, struct kvm_vcpu *vcpu,
>   unsigned long rcbits;
>   long mmio_update;
>  
> + if (kvm_is_radix(kvm))
> + return kvmppc_book3s_radix_page_fault(run, vcpu, ea,
> dsisr);
> +
>   /*
>    * Real-mode code has already searched the HPT and found the
>    * entry we're interested in.  Lock the entry and check that
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 9091407..865ea9b 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -137,3 +137,388 @@ int kvmppc_mmu_radix_xlate(struct kvm_vcpu
> *vcpu, gva_t eaddr,
>   return 0;
>  }
>  
> +#ifdef CONFIG_PPC_64K_PAGES
> +#define MMU_BASE_PSIZE   MMU_PAGE_64K
> +#else
> +#define MMU_BASE_PSIZE   MMU_PAGE_4K
> +#endif
> +
> +static void k

Re: [PATCH 10/18] KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9

2017-01-22 Thread Suraj Jitindar Singh

On Thu, 2017-01-12 at 20:07 +1100, Paul Mackerras wrote:
> This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
> for HPT guests on POWER9.  With this, we can return 1 for the
> KVM_CAP_PPC_MMU_HASH_V3 capability.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/include/asm/kvm_host.h |  1 +
>  arch/powerpc/kvm/book3s_hv.c| 35
> +++
>  arch/powerpc/kvm/powerpc.c  |  2 +-
>  3 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h
> b/arch/powerpc/include/asm/kvm_host.h
> index e59b172..944532d 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -264,6 +264,7 @@ struct kvm_arch {
>   atomic_t hpte_mod_interest;
>   cpumask_t need_tlb_flush;
>   int hpt_cma_alloc;
> + u64 process_table;
>   struct dentry *debugfs_dir;
>   struct dentry *htab_dentry;
>  #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
> diff --git a/arch/powerpc/kvm/book3s_hv.c
> b/arch/powerpc/kvm/book3s_hv.c
> index 1736f87..6bd0f4a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3092,8 +3092,8 @@ static void kvmppc_setup_partition_table(struct
> kvm *kvm)
>   /* HTABSIZE and HTABORG fields */
>   dw0 |= kvm->arch.sdr1;
>  
> - /* Second dword has GR=0; other fields are unused since
> UPRT=0 */
> - dw1 = 0;
> + /* Second dword as set by userspace */
> + dw1 = kvm->arch.process_table;
>  
>   mmu_partition_table_set_entry(kvm->arch.lpid, dw0, dw1);
>  }
> @@ -3658,10 +3658,37 @@ static void init_default_hcalls(void)
>   }
>  }
>  
> -/* dummy implementations for now */
>  static int kvmhv_configure_mmu(struct kvm *kvm, struct
> kvm_ppc_mmuv3_cfg *cfg)
>  {
> - return -EINVAL;
> + unsigned long lpcr;
> +
> + /* If not on a POWER9, reject it */
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -ENODEV;
> +
> + /* If any unknown flags set, reject it */
> + if (cfg->flags & ~(KVM_PPC_MMUV3_RADIX |
> KVM_PPC_MMUV3_GTSE))
> + return -EINVAL;
> +
> + /* We can't do radix yet */
> + if (cfg->flags & KVM_PPC_MMUV3_RADIX)
> + return -EINVAL;
> +
> + /* GR (guest radix) bit in process_table field must match */
> + if (cfg->process_table & PATB_GR)
> + return -EINVAL;
> +
> + /* Process table size field must be reasonable, i.e. <= 24
> */
> + if ((cfg->process_table & PRTS_MASK) > 24)
> + return -EINVAL;
> +
> + kvm->arch.process_table = cfg->process_table;
> + kvmppc_setup_partition_table(kvm);
> +
> + lpcr = (cfg->flags & KVM_PPC_MMUV3_GTSE) ? LPCR_GTSE : 0;
> + kvmppc_update_lpcr(kvm, lpcr, LPCR_GTSE);
> +
> + return 0;
>  }
>  
>  static int kvmhv_get_rmmu_info(struct kvm *kvm, struct
> kvm_ppc_rmmu_info *info)
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 38c0d15..1476a48 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -569,7 +569,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
> long ext)
>   r = !!(0 && hv_enabled && radix_enabled());
>   break;
>   case KVM_CAP_PPC_MMU_HASH_V3:
> - r = !!(0 && hv_enabled && !radix_enabled() &&
> + r = !!(hv_enabled && !radix_enabled() &&
Just because we have radix enabled, is it correct to preclude a hash
guest from running? Isn't it the case that we may have support for
radix but a guest choose to run in hash mode (for what ever reason)?
>      cpu_has_feature(CPU_FTR_ARCH_300));
>   break;
>  #endif

Re: [PATCH] powerpc/64: Don't try to use radix MMU under a hypervisor

2016-12-20 Thread Suraj Jitindar Singh

On Tue, 2016-12-20 at 22:40 +1100, Paul Mackerras wrote:
> Currently, if the kernel is running on a POWER9 processor under a
> hypervisor, it will try to use the radix MMU even though it doesn't
> have the necessary code to use radix under a hypervisor (it doesn't
> negotiate use of radix, and it doesn't do the H_REGISTER_PROC_TBL
> hcall).  The result is that the guest kernel will crash when it tries
> to turn on the MMU, because it will still actually be using the HPT
> MMU, but it won't have set up any SLB or HPT entries.  It does this
> because the only thing that the kernel looks at in deciding to use
> radix, on any platform, is the ibm,pa-features property on the cpu
> device nodes.
> 
> This fixes it by looking for the /chosen/ibm,architecture-vec-5
> property, and if it exists, clearing the radix MMU feature bit.
> We do this before we decide whether to initialize for radix or HPT.
> This property is created by the hypervisor as a result of the guest
> calling the ibm,client-architecture-support method to indicate
> its capabilities, so it only exists on systems with a hypervisor.
> The reason for using this property is that in future, when we
> have support for using radix under a hypervisor, we will need
> to check this property to see whether the hypervisor agreed to
> us using radix.
> 
> Fixes: 17a3dd2f5fc7 ("powerpc/mm/radix: Use firmware feature to
> enable Radix MMU")
> Cc: sta...@vger.kernel.org # v4.7+
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/mm/init_64.c | 27 +++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a000c35..098531d 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -42,6 +42,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #include 
>  #include 
> @@ -344,6 +346,28 @@ static int __init parse_disable_radix(char *p)
>  }
>  early_param("disable_radix", parse_disable_radix);
>  
> +/*
> + * If we're running under a hypervisor, we currently can't do radix
> + * since we don't have the code to do the H_REGISTER_PROC_TBL hcall.
> + * We tell that we're running under a hypervisor by looking for the
> + * /chosen/ibm,architecture-vec-5 property.
> + */
> +static void early_check_vec5(void)
> +{
> + unsigned long root, chosen;
> + int size;
> + const u8 *vec5;
> +
> + root = of_get_flat_dt_root();
> + chosen = of_get_flat_dt_subnode_by_name(root, "chosen");
> + if (chosen == -FDT_ERR_NOTFOUND)
> + return;
> + vec5 = of_get_flat_dt_prop(chosen, "ibm,architecture-vec-5", 
> &size);
> + if (!vec5)
> + return;
> + cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
> +}
> +
Given that currently radix guest support doesn't exist upstream, it's
sufficient to check for the existence of the vec5 node to determine
that we are a guest and thus can't run radix.

Is it worth checking the specific radix feature bit of the vec5 node so
that this code is still correct for determining the lack of radix
support by the host platform once guest radix kernels are (in the
future) supported?
>  void __init mmu_early_init_devtree(void)
>  {
>   /* Disable radix mode based on kernel command line. */
> @@ -351,6 +375,9 @@ void __init mmu_early_init_devtree(void)
>   cur_cpu_spec->mmu_features &= ~MMU_FTR_TYPE_RADIX;
>  
>   if (early_radix_enabled())
> + early_check_vec5();
> +
> + if (early_radix_enabled())
>   radix__early_init_devtree();
>   else
>   hash__early_init_devtree();

[PATCH V4 2/2] powerpc/kvm: Update kvmppc_set_arch_compat() for ISA v3.00

2016-11-13 Thread Suraj Jitindar Singh

The function kvmppc_set_arch_compat() is used to determine the value of the
processor compatibility register (PCR) for a guest running in a given
compatibility mode. There is currently no support for v3.00 of the ISA.

Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
to the PCR.

We also add a check to ensure the processor we are running on is capable of
emulating the chosen processor (for example a POWER7 cannot emulate a
POWER8, similarly with a POWER8 and a POWER9).

Based on work by: Paul Mackerras 

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv.c | 38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3686471..5d83ecb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -301,39 +301,47 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 
pvr)
 
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
-   unsigned long pcr = 0;
+   unsigned long host_pcr_bit = 0, guest_pcr_bit = 0;
struct kvmppc_vcore *vc = vcpu->arch.vcore;
 
+   /* We can (emulate) our own architecture version and anything older */
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   host_pcr_bit = PCR_ARCH_300;
+   else if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   host_pcr_bit = PCR_ARCH_207;
+   else if (cpu_has_feature(CPU_FTR_ARCH_206))
+   host_pcr_bit = PCR_ARCH_206;
+   else
+   host_pcr_bit = PCR_ARCH_205;
+
+   /* Determine lowest PCR bit needed to run guest in given PVR level */
if (arch_compat) {
switch (arch_compat) {
case PVR_ARCH_205:
-   /*
-* If an arch bit is set in PCR, all the defined
-* higher-order arch bits also have to be set.
-*/
-   pcr = PCR_ARCH_206 | PCR_ARCH_205;
+   guest_pcr_bit = PCR_ARCH_205;
break;
case PVR_ARCH_206:
case PVR_ARCH_206p:
-   pcr = PCR_ARCH_206;
+   guest_pcr_bit = PCR_ARCH_206;
break;
case PVR_ARCH_207:
+   guest_pcr_bit = PCR_ARCH_207;
+   break;
+   case PVR_ARCH_300:
+   guest_pcr_bit = PCR_ARCH_300;
break;
default:
return -EINVAL;
}
-
-   if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
-   /* POWER7 can't emulate POWER8 */
-   if (!(pcr & PCR_ARCH_206))
-   return -EINVAL;
-   pcr &= ~PCR_ARCH_206;
-   }
}
 
+   /* Check requested PCR bits don't exceed our capabilities */
+   if (guest_pcr_bit > host_pcr_bit)
+   return -EINVAL;
+
spin_lock(&vc->lock);
vc->arch_compat = arch_compat;
-   vc->pcr = pcr;
+   vc->pcr = host_pcr_bit - guest_pcr_bit;
spin_unlock(&vc->lock);
 
return 0;
-- 
2.5.5

[PATCH V4 1/2] powerpc: Define new ISA v3.00 logical PVR value and PCR register value

2016-11-13 Thread Suraj Jitindar Singh

ISA 3.00 adds the logical PVR value 0x0f05, so add a definition for
this.

Define PCR_ARCH_207 to reflect ISA 2.07 compatibility mode in the processor
compatibility register (PCR).

The next patch changes the algorithm used to determine the required PCR
value in the function kvmppc_set_arch_compat(). We use the PCR_ARCH_XXX
bits to specify and determine the compatibility level which we want to
emulate as well as the compatibility levels which the host is capable
of emulating. To show that we can emulate a v3.00 guest (which is actually
a v3.00 host with no compatility bits set, at the moment) we need a
PCR_ARCH_300 bit to represent this, however currently there is no such bit
defined by the ISA. Thus we define a 'dummy' v3.00 compat bit to be used.

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/include/asm/reg.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 9cd4e8c..30d897a 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -377,6 +377,16 @@
 #define   PCR_VEC_DIS  (1ul << (63-0)) /* Vec. disable (bit NA since POWER8) */
 #define   PCR_VSX_DIS  (1ul << (63-1)) /* VSX disable (bit NA since POWER8) */
 #define   PCR_TM_DIS   (1ul << (63-2)) /* Trans. memory disable (POWER8) */
+/*
+ * These bits are used in the function kvmppc_set_arch_compat() to specify and
+ * determine both the compatibility level which we want to emulate and the
+ * compatibility level which the host is capable of emulating. Thus we need a
+ * bit to show that we are capable of emulating an ISA v3.00 guest however as
+ * yet no such bit has been defined in the PCR register. Thus we have to define
+ * a 'dummy' value to be used.
+ */
+#define   PCR_ARCH_300 0x10/* Dummy Architecture 3.00 */
+#define   PCR_ARCH_207 0x8 /* Architecture 2.07 */
 #define   PCR_ARCH_206 0x4 /* Architecture 2.06 */
 #define   PCR_ARCH_205 0x2 /* Architecture 2.05 */
 #defineSPRN_HEIR   0x153   /* Hypervisor Emulated Instruction 
Register */
@@ -1218,6 +1228,7 @@
 #define PVR_ARCH_206   0x0f03
 #define PVR_ARCH_206p  0x0f13
 #define PVR_ARCH_207   0x0f04
+#define PVR_ARCH_300   0x0f05
 
 /* Macros for setting and retrieving special purpose registers */
 #ifndef __ASSEMBLY__
-- 
2.5.5

[PATCH V4 0/2] powerpc: add support for ISA v2.07 compat level

2016-11-13 Thread Suraj Jitindar Singh

Version v3.00 of the ISA added a new compat level to the processor
compatibility register (PCR), an ISA v2.07 compatibility mode.

Upstream QEMU already supports this so it may as well go into the kernel
now.

Change Log:

V1 -> V2: 
- Reworked logic to set and mask the PCR, no functional change

V2 -> V3:
- Reworked logic again, no functional change

V3 -> V4:
- Added a comment in the first patch to clarify why a 'dummy' PCR v3.00
  value is needed

Suraj Jitindar Singh (2):
  powerpc: Define new ISA v3.00 logical PVR value and PCR register value
  powerpc/kvm: Update kvmppc_set_arch_compat() for ISA v3.00

 arch/powerpc/include/asm/reg.h | 11 +++
 arch/powerpc/kvm/book3s_hv.c   | 38 +++---
 2 files changed, 34 insertions(+), 15 deletions(-)

-- 
2.5.5

Re: [PATCH V3 1/2] powerpc: Define new ISA v3.00 logical PVR value and PCR register value

2016-11-10 Thread Suraj Jitindar Singh

On Thu, 2016-11-10 at 21:36 +1100, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> 
> > 
> > On Tue, 2016-11-08 at 19:21 +1100, Michael Ellerman wrote:
> > > 
> > > Suraj Jitindar Singh  writes:
> > > 
> > > > 
> > > > 
> > > > ISA 3.00 adds the logical PVR value 0x0f05, so add a
> > > > definition
> > > > for
> > > > this.
> > > > 
> > > > Define PCR_ARCH_207 to reflect ISA 2.07 compatibility mode in
> > > > the
> > > > processor
> > > > compatibility register (PCR). Also define a dummy ISA 3.00
> > > > compatibility
> > > > mode PCR_ARCH_300 to be used in the next patch to help with
> > > > determining the
> > > > PCR value.
> > > What's "dummy" about the PCR value?
> > 
> > Then next patch needs some PCR bit to specify that we want to
> > emulate
> > v3.00 and/or that the host can emulate v3.00 to follow the pattern
> > used
> > to determine that the host is capable of emulating the given compat
> > level and for determining which PCR bits to set. But no such bit is
> > defined for V3.00 compat mode yet so a "dummy" one is used to
> > represent
> > this even though it's never defined in the ISA.
> > > 
> > > 
> > > AFAICS that value is reserved in the ISA.
> > 
> > Yes it is a reserved bit in the PCR register but it will never
> > actually
> > be set, it will always be cleared by "host_pcr_bit -
> > guest_pcr_bit;"
> > 
> > > 
> > > 
> > > Are we assuming/hoping that ISA 4.0 will use 0x10 to mean ISA 3.0
> > > ?
> > 
> > Basically yes, and although I know nothing's given, it would follow
> > the
> > current pattern for whatever the next ISA version is to use 0x10 to
> > mean V3.00 compat mode. Otherwise this will need to be updated at
> > some
> > point when that's released... In fact if the compat bits are no
> > longer
> > sequential this will need rewriting.
> 
> OK thanks.
> 
> Please send a v4 with that detail in a comment and a better
> explanation
> in the change log.
> 
> I think a block comment before the #define would be best, ie.
> something
Will do and send a V4
> like:
> 
> #define   PCR_ARCH_2070x8 /* Architecture 2.07
> */
> 
> /*
>  * All that helpful detail from above ...
>  */
> #define   PCR_ARCH_3000x10
> 
> 
> We should also ask if we can get 0x10 reserved in the ISA to mean
> 3.00.
Probably a good idea, might ask you about the process for this on
Monday...
> 
> cheers

[PATCH] powerpc/mm: Correct process and partition table max size

2016-11-08 Thread Suraj Jitindar Singh

Version 3.00 of the ISA states that the PATS (partition table size) field
of the PTCR (partition table control register) and the PRTS (process table
size) field of the partition table entry must both be less than or equal
to 24. However the actual size of the partition and process tables is equal
to 2 to the power of 12 plus the PATS and PRTS fields, respectively. This
means that the max allowable size of each of these tables is 2^36 or 64GB
for both.

Thus when checking the size shift for each we should be checking for values
of greater than 36 instead of the current check for shifts larger than 24
and 23.

Fixes: 2bfd65e45e877fb5704730244da67c748d28a1b8
Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/mm/pgtable-radix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index ed7bddc..80f3479 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -159,7 +159,7 @@ static void __init radix_init_pgtable(void)
 * Allocate Partition table and process table for the
 * host.
 */
-   BUILD_BUG_ON_MSG((PRTB_SIZE_SHIFT > 23), "Process table size too 
large.");
+   BUILD_BUG_ON_MSG((PRTB_SIZE_SHIFT > 36), "Process table size too 
large.");
process_tb = early_alloc_pgtable(1UL << PRTB_SIZE_SHIFT);
/*
 * Fill in the process table.
@@ -181,7 +181,7 @@ static void __init radix_init_partition_table(void)
 
rts_field = radix__get_tree_size();
 
-   BUILD_BUG_ON_MSG((PATB_SIZE_SHIFT > 24), "Partition table size too 
large.");
+   BUILD_BUG_ON_MSG((PATB_SIZE_SHIFT > 36), "Partition table size too 
large.");
partition_tb = early_alloc_pgtable(1UL << PATB_SIZE_SHIFT);
partition_tb->patb0 = cpu_to_be64(rts_field | __pa(init_mm.pgd) |
  RADIX_PGD_INDEX_SIZE | PATB_HR);
-- 
2.5.5

Re: [PATCH V3 1/2] powerpc: Define new ISA v3.00 logical PVR value and PCR register value

2016-11-08 Thread Suraj Jitindar Singh

On Tue, 2016-11-08 at 19:21 +1100, Michael Ellerman wrote:
> Suraj Jitindar Singh  writes:
> 
> > 
> > ISA 3.00 adds the logical PVR value 0x0f05, so add a definition
> > for
> > this.
> > 
> > Define PCR_ARCH_207 to reflect ISA 2.07 compatibility mode in the
> > processor
> > compatibility register (PCR). Also define a dummy ISA 3.00
> > compatibility
> > mode PCR_ARCH_300 to be used in the next patch to help with
> > determining the
> > PCR value.
> What's "dummy" about the PCR value?
Then next patch needs some PCR bit to specify that we want to emulate
v3.00 and/or that the host can emulate v3.00 to follow the pattern used
to determine that the host is capable of emulating the given compat
level and for determining which PCR bits to set. But no such bit is
defined for V3.00 compat mode yet so a "dummy" one is used to represent
this even though it's never defined in the ISA.
> 
> AFAICS that value is reserved in the ISA.
Yes it is a reserved bit in the PCR register but it will never actually
be set, it will always be cleared by "host_pcr_bit - guest_pcr_bit;"
> 
> Are we assuming/hoping that ISA 4.0 will use 0x10 to mean ISA 3.0 ?
Basically yes, and although I know nothing's given, it would follow the
current pattern for whatever the next ISA version is to use 0x10 to
mean V3.00 compat mode. Otherwise this will need to be updated at some
point when that's released... In fact if the compat bits are no longer
sequential this will need rewriting.
> 
> cheers

[PATCH V3 2/2] powerpc/kvm: Update kvmppc_set_arch_compat() for ISA v3.00

2016-10-31 Thread Suraj Jitindar Singh

The function kvmppc_set_arch_compat() is used to determine the value of the
processor compatibility register (PCR) for a guest running in a given
compatibility mode. There is currently no support for v3.00 of the ISA.

Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
to the PCR.

We also add a check to ensure the processor we are running on is capable of
emulating the chosen processor (for example a POWER7 cannot emulate a
POWER8, similarly with a POWER8 and a POWER9).

Based on work by: Paul Mackerras 

Signed-off-by: Suraj Jitindar Singh 
---
 arch/powerpc/kvm/book3s_hv.c | 38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3686471..5d83ecb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -301,39 +301,47 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 
pvr)
 
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
-   unsigned long pcr = 0;
+   unsigned long host_pcr_bit = 0, guest_pcr_bit = 0;
struct kvmppc_vcore *vc = vcpu->arch.vcore;
 
+   /* We can (emulate) our own architecture version and anything older */
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   host_pcr_bit = PCR_ARCH_300;
+   else if (cpu_has_feature(CPU_FTR_ARCH_207S))
+   host_pcr_bit = PCR_ARCH_207;
+   else if (cpu_has_feature(CPU_FTR_ARCH_206))
+   host_pcr_bit = PCR_ARCH_206;
+   else
+   host_pcr_bit = PCR_ARCH_205;
+
+   /* Determine lowest PCR bit needed to run guest in given PVR level */
if (arch_compat) {
switch (arch_compat) {
case PVR_ARCH_205:
-   /*
-* If an arch bit is set in PCR, all the defined
-* higher-order arch bits also have to be set.
-*/
-   pcr = PCR_ARCH_206 | PCR_ARCH_205;
+   guest_pcr_bit = PCR_ARCH_205;
break;
case PVR_ARCH_206:
case PVR_ARCH_206p:
-   pcr = PCR_ARCH_206;
+   guest_pcr_bit = PCR_ARCH_206;
break;
case PVR_ARCH_207:
+   guest_pcr_bit = PCR_ARCH_207;
+   break;
+   case PVR_ARCH_300:
+   guest_pcr_bit = PCR_ARCH_300;
break;
default:
return -EINVAL;
}
-
-   if (!cpu_has_feature(CPU_FTR_ARCH_207S)) {
-   /* POWER7 can't emulate POWER8 */
-   if (!(pcr & PCR_ARCH_206))
-   return -EINVAL;
-   pcr &= ~PCR_ARCH_206;
-   }
}
 
+   /* Check requested PCR bits don't exceed our capabilities */
+   if (guest_pcr_bit > host_pcr_bit)
+   return -EINVAL;
+
spin_lock(&vc->lock);
vc->arch_compat = arch_compat;
-   vc->pcr = pcr;
+   vc->pcr = host_pcr_bit - guest_pcr_bit;
spin_unlock(&vc->lock);
 
return 0;
-- 
2.5.5

1 2 >

1 - 100 of 190 matches

Mail list logo