Re: linux-next: Tree for Oct 4

2018-10-04 Thread Stephen Rothwell
Hi Guenter,

On Thu, 4 Oct 2018 18:33:02 -0700 Guenter Roeck  wrote:
>
> Most of the boot failures are hopefully fixed with
> https://lore.kernel.org/patchwork/patch/995254/

I have added that commit to linux-next today.

-- 
Cheers,
Stephen Rothwell


pgpGh31TN9eMx.pgp
Description: OpenPGP digital signature


Re: [PATCH v4 25/32] KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu

2018-10-04 Thread Paul Mackerras
On Fri, Oct 05, 2018 at 02:54:28PM +1000, David Gibson wrote:
> On Fri, Oct 05, 2018 at 02:23:50PM +1000, Paul Mackerras wrote:
> > On Fri, Oct 05, 2018 at 02:09:08PM +1000, David Gibson wrote:
> > > On Thu, Oct 04, 2018 at 09:56:02PM +1000, Paul Mackerras wrote:
> > > > From: Suraj Jitindar Singh 
> > > > 
> > > > This is only done at level 0, since only level 0 knows which physical
> > > > CPU a vcpu is running on.  This does for nested guests what L0 already
> > > > did for its own guests, which is to flush the TLB on a pCPU when it
> > > > goes to run a vCPU there, and there is another vCPU in the same VM
> > > > which previously ran on this pCPU and has now started to run on another
> > > > pCPU.  This is to handle the situation where the other vCPU touched
> > > > a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
> > > > on that new pCPU and thus left behind a stale TLB entry on this pCPU.
> > > > 
> > > > This introduces a limit on the the vcpu_token values used in the
> > > > H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
> > > 
> > > This does make the vcpu tokens no longer entirely opaque to the L0.
> > > It works for now, because the only L1 is Linux and we know basically
> > > how it allocates those tokens.  Eventually we probably want some way
> > > to either remove this restriction or to advertise the limit to the L1.
> > 
> > Right, we could use something like a hash table and have it be
> > basically just as efficient as the array when the set of IDs is dense
> > while also handling arbitrary ID values.  (We'd have to make sure that
> > L1 couldn't trigger unbounded memory consumption in L0, though.)
> 
> Another approach would be to sacifice some performance for L0
> simplicity:  when an L1 vCPU changes pCPU, flush all the nested LPIDs
> associated with that L1.  When an L2 vCPU changes L1 vCPU (and
> therefore, indirectly pCPU), the L1 would be responsible for flushing
> it.

That was one of the approaches I considered initially, but it has
complexities that aren't apparent, and it could be quite inefficient
for a guest with a lot of nested guests.  For a start you have to
provide a way for L1 to flush the TLB for another LPID, which guests
can't do themselves (it's a hypervisor privileged operation).  Then
there's the fact that it's not the pCPU where the moving vCPU has
moved to that needs the flush, it's the pCPU that it moved from (where
presumably something else is now running).  All in all, the simplest
solution was to have L0 do it, because L0 knows unambiguously the real
physical CPU where any given vCPU last ran.

Paul.


Re: [PATCH v4 25/32] KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu

2018-10-04 Thread David Gibson
On Fri, Oct 05, 2018 at 02:23:50PM +1000, Paul Mackerras wrote:
> On Fri, Oct 05, 2018 at 02:09:08PM +1000, David Gibson wrote:
> > On Thu, Oct 04, 2018 at 09:56:02PM +1000, Paul Mackerras wrote:
> > > From: Suraj Jitindar Singh 
> > > 
> > > This is only done at level 0, since only level 0 knows which physical
> > > CPU a vcpu is running on.  This does for nested guests what L0 already
> > > did for its own guests, which is to flush the TLB on a pCPU when it
> > > goes to run a vCPU there, and there is another vCPU in the same VM
> > > which previously ran on this pCPU and has now started to run on another
> > > pCPU.  This is to handle the situation where the other vCPU touched
> > > a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
> > > on that new pCPU and thus left behind a stale TLB entry on this pCPU.
> > > 
> > > This introduces a limit on the the vcpu_token values used in the
> > > H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
> > 
> > This does make the vcpu tokens no longer entirely opaque to the L0.
> > It works for now, because the only L1 is Linux and we know basically
> > how it allocates those tokens.  Eventually we probably want some way
> > to either remove this restriction or to advertise the limit to the L1.
> 
> Right, we could use something like a hash table and have it be
> basically just as efficient as the array when the set of IDs is dense
> while also handling arbitrary ID values.  (We'd have to make sure that
> L1 couldn't trigger unbounded memory consumption in L0, though.)

Another approach would be to sacifice some performance for L0
simplicity:  when an L1 vCPU changes pCPU, flush all the nested LPIDs
associated with that L1.  When an L2 vCPU changes L1 vCPU (and
therefore, indirectly pCPU), the L1 would be responsible for flushing
it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 25/32] KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu

2018-10-04 Thread Paul Mackerras
On Fri, Oct 05, 2018 at 02:09:08PM +1000, David Gibson wrote:
> On Thu, Oct 04, 2018 at 09:56:02PM +1000, Paul Mackerras wrote:
> > From: Suraj Jitindar Singh 
> > 
> > This is only done at level 0, since only level 0 knows which physical
> > CPU a vcpu is running on.  This does for nested guests what L0 already
> > did for its own guests, which is to flush the TLB on a pCPU when it
> > goes to run a vCPU there, and there is another vCPU in the same VM
> > which previously ran on this pCPU and has now started to run on another
> > pCPU.  This is to handle the situation where the other vCPU touched
> > a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
> > on that new pCPU and thus left behind a stale TLB entry on this pCPU.
> > 
> > This introduces a limit on the the vcpu_token values used in the
> > H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
> 
> This does make the vcpu tokens no longer entirely opaque to the L0.
> It works for now, because the only L1 is Linux and we know basically
> how it allocates those tokens.  Eventually we probably want some way
> to either remove this restriction or to advertise the limit to the L1.

Right, we could use something like a hash table and have it be
basically just as efficient as the array when the set of IDs is dense
while also handling arbitrary ID values.  (We'd have to make sure that
L1 couldn't trigger unbounded memory consumption in L0, though.)

Paul.




Re: [PATCH v4 06/32] KVM: PPC: Book3S HV: Simplify real-mode interrupt handling

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:55:43PM +1000, Paul Mackerras wrote:
> This streamlines the first part of the code that handles a hypervisor
> interrupt that occurred in the guest.  With this, all of the real-mode
> handling that occurs is done before the "guest_exit_cont" label; once
> we get to that label we are committed to exiting to host virtual mode.
> Thus the machine check and HMI real-mode handling is moved before that
> label.
> 
> Also, the code to handle external interrupts is moved out of line, as
> is the code that calls kvmppc_realmode_hmi_handler().
> 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/kvm/book3s_hv_ras.c|   8 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 220 
> 
>  2 files changed, 119 insertions(+), 109 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_ras.c 
> b/arch/powerpc/kvm/book3s_hv_ras.c
> index b11043b..ee564b6 100644
> --- a/arch/powerpc/kvm/book3s_hv_ras.c
> +++ b/arch/powerpc/kvm/book3s_hv_ras.c
> @@ -331,5 +331,13 @@ long kvmppc_realmode_hmi_handler(void)
>   } else {
>   wait_for_tb_resync();
>   }
> +
> + /*
> +  * Reset tb_offset_applied so the guest exit code won't try
> +  * to subtract the previous timebase offset from the timebase.
> +  */
> + if (local_paca->kvm_hstate.kvm_vcore)
> + local_paca->kvm_hstate.kvm_vcore->tb_offset_applied = 0;
> +
>   return 0;
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 5b2ae34..772740d 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1018,8 +1018,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
>  no_xive:
>  #endif /* CONFIG_KVM_XICS */
>  
> -deliver_guest_interrupt:
> -kvmppc_cede_reentry: /* r4 = vcpu, r13 = paca */
> +deliver_guest_interrupt: /* r4 = vcpu, r13 = paca */
>   /* Check if we can deliver an external or decrementer interrupt now */
>   ld  r0, VCPU_PENDING_EXC(r4)
>  BEGIN_FTR_SECTION
> @@ -1269,18 +1268,26 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   std r3, VCPU_CTR(r9)
>   std r4, VCPU_XER(r9)
>  
> -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> - /* For softpatch interrupt, go off and do TM instruction emulation */
> - cmpwi   r12, BOOK3S_INTERRUPT_HV_SOFTPATCH
> - beq kvmppc_tm_emul
> -#endif
> + /* Save more register state  */
> + mfdar   r6
> + mfdsisr r7
> + std r6, VCPU_DAR(r9)
> + stw r7, VCPU_DSISR(r9)
>  
>   /* If this is a page table miss then see if it's theirs or ours */
>   cmpwi   r12, BOOK3S_INTERRUPT_H_DATA_STORAGE
>   beq kvmppc_hdsi
> + std r6, VCPU_FAULT_DAR(r9)
> + stw r7, VCPU_FAULT_DSISR(r9)
>   cmpwi   r12, BOOK3S_INTERRUPT_H_INST_STORAGE
>   beq kvmppc_hisi
>  
> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> + /* For softpatch interrupt, go off and do TM instruction emulation */
> + cmpwi   r12, BOOK3S_INTERRUPT_HV_SOFTPATCH
> + beq kvmppc_tm_emul
> +#endif
> +
>   /* See if this is a leftover HDEC interrupt */
>   cmpwi   r12,BOOK3S_INTERRUPT_HV_DECREMENTER
>   bne 2f
> @@ -1303,7 +1310,7 @@ BEGIN_FTR_SECTION
>  END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>   lbz r0, HSTATE_HOST_IPI(r13)
>   cmpwi   r0, 0
> - beq 4f
> + beq maybe_reenter_guest
>   b   guest_exit_cont
>  3:
>   /* If it's a hypervisor facility unavailable interrupt, save HFSCR */
> @@ -1315,82 +1322,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>  14:
>   /* External interrupt ? */
>   cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
> - bne+guest_exit_cont
> -
> - /* External interrupt, first check for host_ipi. If this is
> -  * set, we know the host wants us out so let's do it now
> -  */
> - bl  kvmppc_read_intr
> -
> - /*
> -  * Restore the active volatile registers after returning from
> -  * a C function.
> -  */
> - ld  r9, HSTATE_KVM_VCPU(r13)
> - li  r12, BOOK3S_INTERRUPT_EXTERNAL
> -
> - /*
> -  * kvmppc_read_intr return codes:
> -  *
> -  * Exit to host (r3 > 0)
> -  *   1 An interrupt is pending that needs to be handled by the host
> -  * Exit guest and return to host by branching to guest_exit_cont
> -  *
> -  *   2 Passthrough that needs completion in the host
> -  * Exit guest and return to host by branching to guest_exit_cont
> -  * However, we also set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
> -  * to indicate to the host to complete handling the interrupt
> -  *
> -  * Before returning to guest, we check if any CPU is heading out
> -  * to the host and if so, we head out also. If no CPUs are heading
> -  * check return values <= 0.
> -  *
> -  * Return to guest (r3 <= 0)
> -  *  0 No external interrupt is pending
> 

Re: [PATCH v4 17/32] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:55:54PM +1000, Paul Mackerras wrote:
> This starts the process of adding the code to support nested HV-style
> virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
> a nested hypervisor can use to set the base address and size of a
> partition table in its memory (analogous to the PTCR register).
> On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
> hypercall from the guest is handled by code that saves the virtual
> PTCR value for the guest.
> 
> This also adds code for creating and destroying nested guests and for
> reading the partition table entry for a nested guest from L1 memory.
> Each nested guest has its own shadow LPID value, different in general
> from the LPID value used by the nested hypervisor to refer to it.  The
> shadow LPID value is allocated at nested guest creation time.
> 
> Nested hypervisor functionality is only available for a radix guest,
> which therefore means a radix host on a POWER9 (or later) processor.
> 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/hvcall.h |   5 +
>  arch/powerpc/include/asm/kvm_book3s.h |  10 +-
>  arch/powerpc/include/asm/kvm_book3s_64.h  |  33 
>  arch/powerpc/include/asm/kvm_book3s_asm.h |   3 +
>  arch/powerpc/include/asm/kvm_host.h   |   5 +
>  arch/powerpc/kvm/Makefile |   3 +-
>  arch/powerpc/kvm/book3s_hv.c  |  27 ++-
>  arch/powerpc/kvm/book3s_hv_nested.c   | 298 
> ++
>  8 files changed, 377 insertions(+), 7 deletions(-)
>  create mode 100644 arch/powerpc/kvm/book3s_hv_nested.c
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index a0b17f9..c95c651 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -322,6 +322,11 @@
>  #define H_GET_24X7_DATA  0xF07C
>  #define H_GET_PERF_COUNTER_INFO  0xF080
>  
> +/* Platform-specific hcalls used for nested HV KVM */
> +#define H_SET_PARTITION_TABLE0xF800
> +#define H_ENTER_NESTED   0xF804
> +#define H_TLB_INVALIDATE 0xF808
> +
>  /* Values for 2nd argument to H_SET_MODE */
>  #define H_SET_MODE_RESOURCE_SET_CIABR1
>  #define H_SET_MODE_RESOURCE_SET_DAWR 2
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 91c9779..43f212e 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -274,6 +274,13 @@ static inline void kvmppc_save_tm_sprs(struct kvm_vcpu 
> *vcpu) {}
>  static inline void kvmppc_restore_tm_sprs(struct kvm_vcpu *vcpu) {}
>  #endif
>  
> +long kvmhv_nested_init(void);
> +void kvmhv_nested_exit(void);
> +void kvmhv_vm_nested_init(struct kvm *kvm);
> +long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
> +void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
> +void kvmhv_release_all_nested(struct kvm *kvm);
> +
>  void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
>  
>  extern int kvm_irq_bypass;
> @@ -387,9 +394,6 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu 
> *vcpu);
>  /* TO = 31 for unconditional trap */
>  #define INS_TW   0x7fe8
>  
> -/* LPIDs we support with this build -- runtime limit may be lower */
> -#define KVMPPC_NR_LPIDS  (LPID_RSVD + 1)
> -
>  #define SPLIT_HACK_MASK  0xff00
>  #define SPLIT_HACK_OFFS  0xfb00
>  
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 5c0e2d9..6d67b6a 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -23,6 +23,39 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
> +#ifdef CONFIG_PPC_PSERIES
> +static inline bool kvmhv_on_pseries(void)
> +{
> + return !cpu_has_feature(CPU_FTR_HVMODE);
> +}
> +#else
> +static inline bool kvmhv_on_pseries(void)
> +{
> + return false;
> +}
> +#endif
> +
> +/*
> + * Structure for a nested guest, that is, for a guest that is managed by
> + * one of our guests.
> + */
> +struct kvm_nested_guest {
> + struct kvm *l1_host;/* L1 VM that owns this nested guest */
> + int l1_lpid;/* lpid L1 guest thinks this guest is */
> + int shadow_lpid;/* real lpid of this nested guest */
> + pgd_t *shadow_pgtable;  /* our page table for this guest */
> + u64 l1_gr_to_hr;/* L1's addr of part'n-scoped table */
> + u64 process_table;  /* process table entry for this guest */
> + long refcnt;/* number of pointers to this struct */
> + struct mutex tlb_lock;  /* serialize page faults and tlbies */
> + struct kvm_nested_guest *next;
> +};
> +
> +struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int 

Re: [PATCH v4 25/32] KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:56:02PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh 
> 
> This is only done at level 0, since only level 0 knows which physical
> CPU a vcpu is running on.  This does for nested guests what L0 already
> did for its own guests, which is to flush the TLB on a pCPU when it
> goes to run a vCPU there, and there is another vCPU in the same VM
> which previously ran on this pCPU and has now started to run on another
> pCPU.  This is to handle the situation where the other vCPU touched
> a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
> on that new pCPU and thus left behind a stale TLB entry on this pCPU.
> 
> This introduces a limit on the the vcpu_token values used in the
> H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.

This does make the vcpu tokens no longer entirely opaque to the L0.
It works for now, because the only L1 is Linux and we know basically
how it allocates those tokens.  Eventually we probably want some way
to either remove this restriction or to advertise the limit to the L1.

> [pau...@ozlabs.org - made prev_cpu array be unsigned short[] to reduce
>  memory consumption.]
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h |   3 +
>  arch/powerpc/kvm/book3s_hv.c | 101 
> +++
>  arch/powerpc/kvm/book3s_hv_nested.c  |   5 ++
>  3 files changed, 71 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index aa5bf85..1e96027 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -52,6 +52,9 @@ struct kvm_nested_guest {
>   long refcnt;/* number of pointers to this struct */
>   struct mutex tlb_lock;  /* serialize page faults and tlbies */
>   struct kvm_nested_guest *next;
> + cpumask_t need_tlb_flush;
> + cpumask_t cpu_in_guest;
> + unsigned short prev_cpu[NR_CPUS];
>  };
>  
>  /*
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index ba58883..53a967ea 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -2397,10 +2397,18 @@ static void kvmppc_release_hwthread(int cpu)
>  
>  static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
>  {
> + struct kvm_nested_guest *nested = vcpu->arch.nested;
> + cpumask_t *cpu_in_guest;
>   int i;
>  
>   cpu = cpu_first_thread_sibling(cpu);
> - cpumask_set_cpu(cpu, >arch.need_tlb_flush);
> + if (nested) {
> + cpumask_set_cpu(cpu, >need_tlb_flush);
> + cpu_in_guest = >cpu_in_guest;
> + } else {
> + cpumask_set_cpu(cpu, >arch.need_tlb_flush);
> + cpu_in_guest = >arch.cpu_in_guest;
> + }
>   /*
>* Make sure setting of bit in need_tlb_flush precedes
>* testing of cpu_in_guest bits.  The matching barrier on
> @@ -2408,13 +2416,23 @@ static void radix_flush_cpu(struct kvm *kvm, int cpu, 
> struct kvm_vcpu *vcpu)
>*/
>   smp_mb();
>   for (i = 0; i < threads_per_core; ++i)
> - if (cpumask_test_cpu(cpu + i, >arch.cpu_in_guest))
> + if (cpumask_test_cpu(cpu + i, cpu_in_guest))
>   smp_call_function_single(cpu + i, do_nothing, NULL, 1);
>  }
>  
>  static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu *vcpu, int pcpu)
>  {
> + struct kvm_nested_guest *nested = vcpu->arch.nested;
>   struct kvm *kvm = vcpu->kvm;
> + int prev_cpu;
> +
> + if (!cpu_has_feature(CPU_FTR_HVMODE))
> + return;
> +
> + if (nested)
> + prev_cpu = nested->prev_cpu[vcpu->arch.nested_vcpu_id];
> + else
> + prev_cpu = vcpu->arch.prev_cpu;
>  
>   /*
>* With radix, the guest can do TLB invalidations itself,
> @@ -2428,12 +2446,46 @@ static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu 
> *vcpu, int pcpu)
>* ran to flush the TLB.  The TLB is shared between threads,
>* so we use a single bit in .need_tlb_flush for all 4 threads.
>*/
> - if (vcpu->arch.prev_cpu != pcpu) {
> - if (vcpu->arch.prev_cpu >= 0 &&
> - cpu_first_thread_sibling(vcpu->arch.prev_cpu) !=
> + if (prev_cpu != pcpu) {
> + if (prev_cpu >= 0 &&
> + cpu_first_thread_sibling(prev_cpu) !=
>   cpu_first_thread_sibling(pcpu))
> - radix_flush_cpu(kvm, vcpu->arch.prev_cpu, vcpu);
> - vcpu->arch.prev_cpu = pcpu;
> + radix_flush_cpu(kvm, prev_cpu, vcpu);
> + if (nested)
> + nested->prev_cpu[vcpu->arch.nested_vcpu_id] = pcpu;
> + else
> + vcpu->arch.prev_cpu = pcpu;
> + }
> +}
> +
> +static void 

Re: [PATCH v4 24/32] KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:56:01PM +1000, Paul Mackerras wrote:
> This adds code to call the H_TLB_INVALIDATE hypercall when running as
> a guest, in the cases where we need to invalidate TLBs (or other MMU
> caches) as part of managing the mappings for a nested guest.  Calling
> H_TLB_INVALIDATE lets the nested hypervisor inform the parent
> hypervisor about changes to partition-scoped page tables or the
> partition table without needing to do hypervisor-privileged tlbie
> instructions.
> 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h |  5 +
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   | 30 --
>  arch/powerpc/kvm/book3s_hv_nested.c  | 30 --
>  3 files changed, 57 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index a02f0b3..aa5bf85 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_PPC_PSERIES
>  static inline bool kvmhv_on_pseries(void)
> @@ -117,6 +118,10 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm 
> *kvm, int l1_lpid,
> bool create);
>  void kvmhv_put_nested(struct kvm_nested_guest *gp);
>  
> +/* Encoding of first parameter for H_TLB_INVALIDATE */
> +#define H_TLBIE_P1_ENC(ric, prs, r)  (___PPC_RIC(ric) | ___PPC_PRS(prs) | \
> +  ___PPC_R(r))
> +
>  /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
>  #define PPC_MIN_HPT_ORDER18
>  #define PPC_MAX_HPT_ORDER46
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 4c1eccb..ae0e3ed 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -201,17 +201,43 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, 
> unsigned long addr,
>   unsigned int pshift, unsigned int lpid)
>  {
>   unsigned long psize = PAGE_SIZE;
> + int psi;
> + long rc;
> + unsigned long rb;
>  
>   if (pshift)
>   psize = 1UL << pshift;
> + else
> + pshift = PAGE_SHIFT;
>  
>   addr &= ~(psize - 1);
> - radix__flush_tlb_lpid_page(lpid, addr, psize);
> +
> + if (!kvmhv_on_pseries()) {
> + radix__flush_tlb_lpid_page(lpid, addr, psize);
> + return;
> + }
> +
> + psi = shift_to_mmu_psize(pshift);
> + rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
> + rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
> + lpid, rb);
> + if (rc)
> + pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
>  }
>  
>  static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid)
>  {
> - radix__flush_pwc_lpid(lpid);
> + long rc;
> +
> + if (!kvmhv_on_pseries()) {
> + radix__flush_pwc_lpid(lpid);
> + return;
> + }
> +
> + rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
> + lpid, TLBIEL_INVAL_SET_LPID);
> + if (rc)
> + pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
>  }
>  
>  static unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 26151e8..35f8111 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -298,14 +298,32 @@ void kvmhv_nested_exit(void)
>   }
>  }
>  
> +static void kvmhv_flush_lpid(unsigned int lpid)
> +{
> + long rc;
> +
> + if (!kvmhv_on_pseries()) {
> + radix__flush_tlb_lpid(lpid);
> + return;
> + }
> +
> + rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
> + lpid, TLBIEL_INVAL_SET_LPID);
> + if (rc)
> + pr_err("KVM: TLB LPID invalidation hcall failed, rc=%ld\n", rc);
> +}
> +
>  void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
>  {
> - if (cpu_has_feature(CPU_FTR_HVMODE)) {
> + if (!kvmhv_on_pseries()) {
>   mmu_partition_table_set_entry(lpid, dw0, dw1);
> - } else {
> - pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
> - pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
> + return;
>   }
> +
> + pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
> + pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
> + /* L0 will do the necessary barriers */
> + kvmhv_flush_lpid(lpid);
>  }
>  
>  static void kvmhv_set_nested_ptbl(struct kvm_nested_guest *gp)
> @@ -482,7 +500,7 @@ static void kvmhv_flush_nested(struct 

Re: [PATCH v4 23/32] KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:56:00PM +1000, Paul Mackerras wrote:
11;rgb://> From: Suraj Jitindar Singh 
> 
> When running a nested (L2) guest the guest (L1) hypervisor will use
> the H_TLB_INVALIDATE hcall when it needs to change the partition
> scoped page tables or the partition table which it manages.  It will
> use this hcall in the situations where it would use a partition-scoped
> tlbie instruction if it were running in hypervisor mode.
> 
> The H_TLB_INVALIDATE hcall can invalidate different scopes:
> 
> Invalidate TLB for a given target address:
> - This invalidates a single L2 -> L1 pte
> - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
>   address space which is being invalidated. This is because a single
>   L2 -> L1 pte may have been mapped with more than one pte in the
>   L2 -> L0 page tables.
> 
> Invalidate the entire TLB for a given LPID or for all LPIDs:
> - Invalidate the entire shadow_pgtable for a given nested guest, or
>   for all nested guests.
> 
> Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
> - We don't cache the PWC, so nothing to do.
> 
> Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
> - Here we re-read the partition table entry and remove the nested state
>   for any nested guest for which the first doubleword of the partition
>   table entry is now zero.
> 
> The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
> word (of which only the RIC, PRS and R fields are used), the rS value
> (giving the lpid, where required) and the rB value (giving the IS, AP
> and EPN values).
> 
> [pau...@ozlabs.org - adapted to having the partition table in guest
> memory, added the H_TLB_INVALIDATE implementation, removed tlbie
> instruction emulation, reworded the commit message.]
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

That said, there's one change I think could make it substantially
easier to read..

[snip]
> +static int kvmhv_emulate_tlbie_tlb_addr(struct kvm_vcpu *vcpu, int lpid,
> + int ap, long epn)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_nested_guest *gp;
> + long npages;
> + int shift;
> + unsigned long addr;
> +
> + shift = ap_to_shift(ap);
> + addr = epn << 12;
> + if (shift < 0)
> + /* Invalid ap encoding */
> + return -EINVAL;
> +
> + addr &= ~((1UL << shift) - 1);
> + npages = 1UL << (shift - PAGE_SHIFT);
> +
> + gp = kvmhv_get_nested(kvm, lpid, false);
> + if (!gp) /* No such guest -> nothing to do */
> + return 0;
> + mutex_lock(>tlb_lock);
> +
> + /* There may be more than one host page backing this single guest pte */
> + do {
> + kvmhv_invalidate_shadow_pte(vcpu, gp, addr, );
> +
> + npages -= 1UL << (shift - PAGE_SHIFT);
> + addr += 1UL << shift;

I read this about 6 times before realizing that 'shift' here has a
different value to what it did before the loop (which it has to in
order to be correct).  I'd suggest a loop local variable with a
different name (maybe 'shadow_shift') to make that more obvious.
-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 33/33] KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 07:48:26PM +1000, Paul Mackerras wrote:
> On Wed, Oct 03, 2018 at 04:21:44PM +1000, David Gibson wrote:
> > On Tue, Oct 02, 2018 at 09:31:32PM +1000, Paul Mackerras wrote:
> > > With this, userspace can enable a KVM-HV guest to run nested guests
> > > under it.
> [snip]
> > > +/* If set, guests are allowed to create and control nested guests */
> > > +static bool enable_nested = true;
> > > +module_param(enable_nested, bool, S_IRUGO | S_IWUSR);
> > > +MODULE_PARM_DESC(enable_nested, "Enable nested virtualization (only on 
> > > POWER9)");
> > 
> > I'd suggest calling the module parameter just "nested" to match x86.
> 
> OK.
> 
> > >  /* If set, the threads on each CPU core have to be in the same MMU mode 
> > > */
> > >  static bool no_mixing_hpt_and_radix;
> > >  
> > > @@ -5188,6 +5193,17 @@ static int kvmhv_configure_mmu(struct kvm *kvm, 
> > > struct kvm_ppc_mmuv3_cfg *cfg)
> > >   return err;
> > >  }
> > >  
> > > +static int kvmhv_enable_nested(struct kvm *kvm, bool enable)
> > > +{
> > > + if (!(enable_nested && cpu_has_feature(CPU_FTR_ARCH_300)))
> > > + return -EINVAL;
> > 
> > Maybe EPERM, rather than EINVAL.  There's nothing invalid about the
> > ioctl() parameters - we just can't do what they want.
> 
> Just for pedantry's sake, I'll make it EPERM for !enable_nested and
> ENODEV for !POWER9. :)

Sounds fair.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 22/32] KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:55:59PM +1000, Paul Mackerras wrote:
> From: Suraj Jitindar Singh 
> 
> When a host (L0) page which is mapped into a (L1) guest is in turn
> mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
> so that these mappings can be retrieved later.
> 
> Whenever we create an entry in a shadow_pgtable for a nested guest we
> create a corresponding rmap entry and add it to the list for the
> L1 guest memslot at the index of the L1 guest page it maps. This means
> at the L1 guest memslot we end up with lists of rmaps.
> 
> When we are notified of a host page being invalidated which has been
> mapped through to a (L1) guest, we can then walk the rmap list for that
> guest page, and find and invalidate all of the corresponding
> shadow_pgtable entries.
> 
> In order to reduce memory consumption, we compress the information for
> each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
> for the guest real page frame number -- which will fit in a single
> unsigned long.  To avoid a scenario where a guest can trigger
> unbounded memory allocations, we scan the list when adding an entry to
> see if there is already an entry with the contents we need.  This can
> occur, because we don't ever remove entries from the middle of a list.
> 
> A struct nested guest rmap is a list pointer and an rmap entry;
> 
> | next pointer |
> 
> | rmap entry   |
> 
> 
> Thus the rmap pointer for each guest frame number in the memslot can be
> either NULL, a single entry, or a pointer to a list of nested rmap entries.
> 
> gfnmemslot rmap array
>   -
>  0| NULL  |   (no rmap entry)
>   -
>  1| single rmap entry |   (rmap entry with low bit set)
>   -
>  2| list head pointer |   (list of rmap entries)
>   -
> 
> The final entry always has the lowest bit set and is stored in the next
> pointer of the last list entry, or as a single rmap entry.
> With a list of rmap entries looking like;
> 
> - -   -
> | list head ptr   | > | next pointer  | > | single rmap entry 
> |
> - -   -
>   | rmap entry|   | rmap entry|
>   -   -
> 
> Signed-off-by: Suraj Jitindar Singh 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/include/asm/kvm_book3s.h|   3 +
>  arch/powerpc/include/asm/kvm_book3s_64.h |  70 -
>  arch/powerpc/kvm/book3s_64_mmu_radix.c   |  44 +++
>  arch/powerpc/kvm/book3s_hv.c |   1 +
>  arch/powerpc/kvm/book3s_hv_nested.c  | 130 
> ++-
>  5 files changed, 233 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 63f7ccf..d7aeb6f 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -196,6 +196,9 @@ extern int kvmppc_mmu_radix_translate_table(struct 
> kvm_vcpu *vcpu, gva_t eaddr,
>   int table_index, u64 *pte_ret_p);
>  extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
>   struct kvmppc_pte *gpte, bool data, bool iswrite);
> +extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
> + unsigned int shift, struct kvm_memory_slot *memslot,
> + unsigned int lpid);
>  extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
>   bool writing, unsigned long gpa,
>   unsigned int lpid);
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 5496152..a02f0b3 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -53,6 +53,66 @@ struct kvm_nested_guest {
>   struct kvm_nested_guest *next;
>  };
>  
> +/*
> + * We define a nested rmap entry as a single 64-bit quantity
> + * 0xFFF012-bit lpid field
> + * 0x000FF00040-bit guest 4k page frame number
> + * 0x00011-bit  single entry flag
> + */
> +#define RMAP_NESTED_LPID_MASK0xFFF0UL
> +#define RMAP_NESTED_LPID_SHIFT   (52)
> +#define RMAP_NESTED_GPA_MASK 0x000FF000UL
> +#define RMAP_NESTED_IS_SINGLE_ENTRY  0x0001UL
> +
> +/* Structure for a nested guest rmap entry */
> +struct rmap_nested {
> + struct llist_node list;
> + u64 rmap;
> +};
> +
> +/*
> + * for_each_nest_rmap_safe - iterate over the list of nested rmap entries
> + * 

Re: [PATCH v3 22/33] KVM: PPC: Book3S HV: Handle page fault for a nested guest

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 07:21:20PM +1000, Paul Mackerras wrote:
> On Wed, Oct 03, 2018 at 03:39:13PM +1000, David Gibson wrote:
> > On Tue, Oct 02, 2018 at 09:31:21PM +1000, Paul Mackerras wrote:
> > > From: Suraj Jitindar Singh 
> > > @@ -367,7 +367,9 @@ struct kvmppc_pte {
> > >   bool may_write  : 1;
> > >   bool may_execute: 1;
> > >   unsigned long wimg;
> > > + unsigned long rc;
> > >   u8 page_size;   /* MMU_PAGE_xxx */
> > > + u16 page_shift;
> > 
> > It's a bit ugly that this has both page_size and page_shift, which is
> > redundant information AFAICT.  Also, why does page_shift need to be
> > u16 - given that 2^255 bytes is much more than our supported address
> > space, let alone a plausible page size.
> 
> These values are all essentially function outputs, so I don't think
> it's ugly to have the same information in different forms.  I actually
> don't like using the MMU_PAGE_xxx values, because the information in
> the mmu_psize_defs[] array depends on the MMU mode of the host, but
> KVM needs to be able to work with guests in both MMU modes.  More
> generally I don't think it's a good idea that the KVM <-> guest
> interface depends so much on what the host firmware tells us about the
> physical machine we're on.  Thus I'm trying to move away from using
> MMU_PSIZE_xxx values and mmu_psize_defs[] in KVM code.

Fair enough.

> I'll change the type to u8.
> 
> > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> > > b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > index bd06a95..ee6f493 100644
> > > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> > > @@ -29,43 +29,16 @@
> > >   */
> > >  static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
> > >  
> > > -/*
> > > - * Used to walk a partition or process table radix tree in guest memory
> > > - * Note: We exploit the fact that a partition table and a process
> > > - * table have the same layout, a partition-scoped page table and a
> > > - * process-scoped page table have the same layout, and the 2nd
> > > - * doubleword of a partition table entry has the same layout as
> > > - * the PTCR register.
> > > - */
> > > -int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
> > > -  struct kvmppc_pte *gpte, u64 table,
> > > -  int table_index, u64 *pte_ret_p)
> > > +int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
> > > +struct kvmppc_pte *gpte, u64 root,
> > > +u64 *pte_ret_p)
> > >  {
> > >   struct kvm *kvm = vcpu->kvm;
> > >   int ret, level, ps;
> > > - unsigned long ptbl, root;
> > > - unsigned long rts, bits, offset;
> > > - unsigned long size, index;
> > > - struct prtb_entry entry;
> > > + unsigned long rts, bits, offset, index;
> > >   u64 pte, base, gpa;
> > >   __be64 rpte;
> > >  
> > > - if ((table & PRTS_MASK) > 24)
> > > - return -EINVAL;
> > > - size = 1ul << ((table & PRTS_MASK) + 12);
> > > -
> > > - /* Is the table big enough to contain this entry? */
> > > - if ((table_index * sizeof(entry)) >= size)
> > > - return -EINVAL;
> > > -
> > > - /* Read the table to find the root of the radix tree */
> > > - ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> > > - ret = kvm_read_guest(kvm, ptbl, , sizeof(entry));
> > > - if (ret)
> > > - return ret;
> > > -
> > > - /* Root is stored in the first double word */
> > > - root = be64_to_cpu(entry.prtb0);
> > 
> > This refactoring somewhat obscures the changes directly relevant to
> > the nested guest handling.  Ideally it would be nice to fold some of
> > this into the earlier reworkings.
> 
> True, but given the rapidly approaching merge window, I'm not inclined
> to rework it.

Yeah, ok.

> 
> > > + if (ret) {
> > > + /* We didn't find a pte */
> > > + if (ret == -EINVAL) {
> > > + /* Unsupported mmu config */
> > > + flags |= DSISR_UNSUPP_MMU;
> > > + } else if (ret == -ENOENT) {
> > > + /* No translation found */
> > > + flags |= DSISR_NOHPTE;
> > > + } else if (ret == -EFAULT) {
> > > + /* Couldn't access L1 real address */
> > > + flags |= DSISR_PRTABLE_FAULT;
> > > + vcpu->arch.fault_gpa = fault_addr;
> > > + } else {
> > > + /* Unknown error */
> > > + return ret;
> > > + }
> > > + goto resume_host;
> > 
> > This is effectively forwarding the fault to L1, yes?  In which case a
> > different name might be better than the ambiguous "resume_host".
> 
> I'll change it to "forward_to_l1".

Thanks.

> 
> Paul.
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!

Re: [PATCH v3 30/33] KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 01:03:21PM +1000, Paul Mackerras wrote:
> On Wed, Oct 03, 2018 at 04:15:15PM +1000, David Gibson wrote:
> > On Tue, Oct 02, 2018 at 09:31:29PM +1000, Paul Mackerras wrote:
> > > With this, the KVM-HV module can be loaded in a guest running under
> > > KVM-HV, and if the hypervisor supports nested virtualization, this
> > > guest can now act as a nested hypervisor and run nested guests.
> > > 
> > > This also adds some checks to inform userspace that HPT guests are not
> > > supported by nested hypervisors, and to prevent userspace from
> > > configuring a guest to use HPT mode.
> > > 
> > > Signed-off-by: Paul Mackerras 
> > > ---
> > >  arch/powerpc/kvm/book3s_hv.c | 20 
> > >  1 file changed, 16 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > > index f630e91..196bff1 100644
> > > --- a/arch/powerpc/kvm/book3s_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_hv.c
> > > @@ -4237,6 +4237,10 @@ static int kvm_vm_ioctl_get_smmu_info_hv(struct 
> > > kvm *kvm,
> > >  {
> > >   struct kvm_ppc_one_seg_page_size *sps;
> > >  
> > > + /* If we're a nested hypervisor, we only support radix guests */
> > > + if (kvmhv_on_pseries())
> > > + return -EINVAL;
> > > +
> > >   /*
> > >* POWER7, POWER8 and POWER9 all support 32 storage keys for data.
> > >* POWER7 doesn't support keys for instruction accesses,
> > > @@ -4822,11 +4826,15 @@ static int kvmppc_core_emulate_mfspr_hv(struct 
> > > kvm_vcpu *vcpu, int sprn,
> > >  
> > >  static int kvmppc_core_check_processor_compat_hv(void)
> > >  {
> > > - if (!cpu_has_feature(CPU_FTR_HVMODE) ||
> > > - !cpu_has_feature(CPU_FTR_ARCH_206))
> > > - return -EIO;
> > > + if (cpu_has_feature(CPU_FTR_HVMODE) &&
> > > + cpu_has_feature(CPU_FTR_ARCH_206))
> > > + return 0;
> > >  
> > > - return 0;
> > > + /* Can run as nested hypervisor on POWER9 in radix mode. */
> > > + if (cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled())
> > 
> > Shouldn't we probe the parent hypervisor for ability to support nested
> > guests before we say "yes" here?
> 
> Well, we do check that the parent hypervisor can support nested
> hypervisors, it's just done later on.  And to match nitpick with
> nitpick, this is a function evaluating _processor_ compatibility, and
> a POWER9 processor in radix mode does have everything necessary to
> support nested hypervisors -- if the parent hypervisor doesn't support
> nested hypervisors, that's not a deficiency in the processor.

Fair enough.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 30/32] KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:56:07PM +1000, Paul Mackerras wrote:
> With this, the KVM-HV module can be loaded in a guest running under
> KVM-HV, and if the hypervisor supports nested virtualization, this
> guest can now act as a nested hypervisor and run nested guests.
> 
> This also adds some checks to inform userspace that HPT guests are not
> supported by nested hypervisors, and to prevent userspace from
> configuring a guest to use HPT mode.
> 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/kvm/book3s_hv.c | 20 
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 7561c99..7f89b22 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4214,6 +4214,10 @@ static int kvm_vm_ioctl_get_smmu_info_hv(struct kvm 
> *kvm,
>  {
>   struct kvm_ppc_one_seg_page_size *sps;
>  
> + /* If we're a nested hypervisor, we only support radix guests */
> + if (kvmhv_on_pseries())
> + return -EINVAL;
> +
>   /*
>* POWER7, POWER8 and POWER9 all support 32 storage keys for data.
>* POWER7 doesn't support keys for instruction accesses,
> @@ -4799,11 +4803,15 @@ static int kvmppc_core_emulate_mfspr_hv(struct 
> kvm_vcpu *vcpu, int sprn,
>  
>  static int kvmppc_core_check_processor_compat_hv(void)
>  {
> - if (!cpu_has_feature(CPU_FTR_HVMODE) ||
> - !cpu_has_feature(CPU_FTR_ARCH_206))
> - return -EIO;
> + if (cpu_has_feature(CPU_FTR_HVMODE) &&
> + cpu_has_feature(CPU_FTR_ARCH_206))
> + return 0;
>  
> - return 0;
> + /* POWER9 in radix mode is capable of being a nested hypervisor. */
> + if (cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled())
> + return 0;
> +
> + return -EIO;
>  }
>  
>  #ifdef CONFIG_KVM_XICS
> @@ -5121,6 +5129,10 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
> kvm_ppc_mmuv3_cfg *cfg)
>   if (radix && !radix_enabled())
>   return -EINVAL;
>  
> + /* If we're a nested hypervisor, we currently only support radix */
> + if (kvmhv_on_pseries() && !radix)
> + return -EINVAL;
> +
>   mutex_lock(>lock);
>   if (radix != kvm_is_radix(kvm)) {
>   if (kvm->arch.mmu_ready) {

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 32/32] KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization

2018-10-04 Thread David Gibson
On Thu, Oct 04, 2018 at 09:56:09PM +1000, Paul Mackerras wrote:
> With this, userspace can enable a KVM-HV guest to run nested guests
> under it.
> 
> The administrator can control whether any nested guests can be run;
> setting the "nested" module parameter to false prevents any guests
> becoming nested hypervisors (that is, any attempt to enable the nested
> capability on a guest will fail).  Guests which are already nested
> hypervisors will continue to be so.
> 
> Signed-off-by: Paul Mackerras 

Reviewed-by: David Gibson 

> ---
>  Documentation/virtual/kvm/api.txt  | 14 ++
>  arch/powerpc/include/asm/kvm_ppc.h |  1 +
>  arch/powerpc/kvm/book3s_hv.c   | 19 +++
>  arch/powerpc/kvm/powerpc.c | 12 
>  include/uapi/linux/kvm.h   |  1 +
>  5 files changed, 47 insertions(+)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 017d851..a2d4832 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4522,6 +4522,20 @@ hpage module parameter is not set to 1, -EINVAL is 
> returned.
>  While it is generally possible to create a huge page backed VM without
>  this capability, the VM will not be able to run.
>  
> +7.15 KVM_CAP_PPC_NESTED_HV
> +
> +Architectures: ppc
> +Parameters: enable flag (0 to disable, non-zero to enable)
> +Returns: 0 on success, -EINVAL when the implementation doesn't support
> +nested-HV virtualization.
> +
> +HV-KVM on POWER9 and later systems allows for "nested-HV"
> +virtualization, which provides a way for a guest VM to run guests that
> +can run using the CPU's supervisor mode (privileged non-hypervisor
> +state).  Enabling this capability on a VM depends on the CPU having
> +the necessary functionality and on the facility being enabled with a
> +kvm-hv module parameter.
> +
>  8. Other capabilities.
>  --
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index 245e564..80f0091 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -327,6 +327,7 @@ struct kvmppc_ops {
>   int (*set_smt_mode)(struct kvm *kvm, unsigned long mode,
>   unsigned long flags);
>   void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
> + int (*enable_nested)(struct kvm *kvm, bool enable);
>  };
>  
>  extern struct kvmppc_ops *kvmppc_hv_ops;
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 7f89b22..d3cc013 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -118,6 +118,11 @@ module_param_cb(h_ipi_redirect, _param_ops, 
> _ipi_redirect, 0644);
>  MODULE_PARM_DESC(h_ipi_redirect, "Redirect H_IPI wakeup to a free host 
> core");
>  #endif
>  
> +/* If set, guests are allowed to create and control nested guests */
> +static bool nested = true;
> +module_param(nested, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
> +
>  /* If set, the threads on each CPU core have to be in the same MMU mode */
>  static bool no_mixing_hpt_and_radix;
>  
> @@ -5165,6 +5170,19 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
> kvm_ppc_mmuv3_cfg *cfg)
>   return err;
>  }
>  
> +static int kvmhv_enable_nested(struct kvm *kvm, bool enable)
> +{
> + if (!nested)
> + return -EPERM;
> + if (!cpu_has_feature(CPU_FTR_ARCH_300))
> + return -ENODEV;
> +
> + /* kvm == NULL means the caller is testing if the capability exists */
> + if (kvm)
> + kvm->arch.nested_enable = enable;
> + return 0;
> +}
> +
>  static struct kvmppc_ops kvm_ops_hv = {
>   .get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
>   .set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
> @@ -5204,6 +5222,7 @@ static struct kvmppc_ops kvm_ops_hv = {
>   .configure_mmu = kvmhv_configure_mmu,
>   .get_rmmu_info = kvmhv_get_rmmu_info,
>   .set_smt_mode = kvmhv_set_smt_mode,
> + .enable_nested = kvmhv_enable_nested,
>  };
>  
>  static int kvm_init_subcore_bitmap(void)
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index eba5756..449ae1d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -596,6 +596,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   case KVM_CAP_PPC_MMU_HASH_V3:
>   r = !!(hv_enabled && cpu_has_feature(CPU_FTR_ARCH_300));
>   break;
> + case KVM_CAP_PPC_NESTED_HV:
> + r = !!(hv_enabled && kvmppc_hv_ops->enable_nested &&
> +!kvmppc_hv_ops->enable_nested(NULL, false));
> + break;
>  #endif
>   case KVM_CAP_SYNC_MMU:
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> @@ -2114,6 +2118,14 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   r = kvm->arch.kvm_ops->set_smt_mode(kvm, mode, flags);
>  

[PATCH 2/2] fsl: add i2c controlled qixis driver

2018-10-04 Thread Pankaj Bansal
FPGA on LX2160AQDS/LX2160ARDB connected on I2C bus, so add qixis
driver which is basically an i2c client driver to control FPGA.

Signed-off-by: Wang Dongsheng 
Signed-off-by: Pankaj Bansal 
---
 drivers/soc/fsl/Kconfig|  9 
 drivers/soc/fsl/Makefile   |  1 +
 drivers/soc/fsl/qixis_ctrl.c   | 75 
 include/linux/fsl/qixis_ctrl.h | 20 +
 4 files changed, 105 insertions(+)

diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
index 8f80e8bbf29e..c355c2cbbd45 100644
--- a/drivers/soc/fsl/Kconfig
+++ b/drivers/soc/fsl/Kconfig
@@ -28,4 +28,13 @@ config FSL_MC_DPIO
  other DPAA2 objects. This driver does not expose the DPIO
  objects individually, but groups them under a service layer
  API.
+
+config FSL_QIXIS
+   tristate "QIXIS system controller driver"
+   select REGMAP_I2C
+   default n
+   help
+ Say y here to enable QIXIS system controller api. The qixis driver
+ provides FPGA functions to control system.
+
 endmenu
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index 803ef1bfb5ff..47e0cfc66ca4 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -5,5 +5,6 @@
 obj-$(CONFIG_FSL_DPAA) += qbman/
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
+obj-$(CONFIG_FSL_QIXIS)+= qixis_ctrl.o
 obj-$(CONFIG_FSL_GUTS) += guts.o
 obj-$(CONFIG_FSL_MC_DPIO)  += dpio/
diff --git a/drivers/soc/fsl/qixis_ctrl.c b/drivers/soc/fsl/qixis_ctrl.c
new file mode 100644
index ..b94649fb9726
--- /dev/null
+++ b/drivers/soc/fsl/qixis_ctrl.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+/* Freescale QIXIS system controller driver.
+ *
+ * Copyright 2015 Freescale Semiconductor, Inc.
+ * Copyright 2018 NXP
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct regmap *qixis_regmap;
+
+static struct regmap_config qixis_regmap_config = {
+   .reg_bits = 8,
+   .val_bits = 8,
+};
+
+static int fsl_qixis_i2c_probe(struct i2c_client *client)
+{
+   struct platform_device *pdev;
+   struct device_node *child;
+   u32 qver;
+
+   if (!i2c_check_functionality(client->adapter, I2C_FUNC_SMBUS_BYTE_DATA))
+   return -EOPNOTSUPP;
+
+   qixis_regmap = regmap_init_i2c(client, _regmap_config);
+
+   /* create platform device for  each of the child node of FPGA node */
+   for_each_child_of_node(client->dev.of_node, child) {
+   pdev = of_platform_device_create(child, NULL, >dev);
+   };
+
+   regmap_read(qixis_regmap, offsetof(struct fsl_qixis_regs, qixis_ver),
+   );
+
+   pr_info("Freescale QIXIS Version: 0x%08x\n", qver);
+
+   return 0;
+}
+
+static int fsl_qixis_i2c_remove(struct i2c_client *client)
+{
+   return 0;
+}
+
+static const struct of_device_id fsl_qixis_of_match[] = {
+   { .compatible = "fsl,fpga-qixis-i2c" },
+   {}
+};
+MODULE_DEVICE_TABLE(of, fsl_qixis_of_match);
+
+static struct i2c_driver fsl_qixis_i2c_driver = {
+   .driver = {
+   .name   = "qixis_ctrl",
+   .owner  = THIS_MODULE,
+   .of_match_table = of_match_ptr(fsl_qixis_of_match),
+   },
+   .probe_new  = fsl_qixis_i2c_probe,
+   .remove = fsl_qixis_i2c_remove,
+};
+module_i2c_driver(fsl_qixis_i2c_driver);
+
+MODULE_AUTHOR("Wang Dongsheng ");
+MODULE_DESCRIPTION("Freescale QIXIS system controller driver");
+MODULE_LICENSE("GPL v2");
diff --git a/include/linux/fsl/qixis_ctrl.h b/include/linux/fsl/qixis_ctrl.h
new file mode 100644
index ..00e80ef21adc
--- /dev/null
+++ b/include/linux/fsl/qixis_ctrl.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0+
+ *
+ * Definitions for Freescale QIXIS system controller.
+ *
+ * Copyright 2015 Freescale Semiconductor, Inc.
+ * Copyright 2018 NXP
+ */
+
+#ifndef _FSL_QIXIS_CTRL_H_
+#define _FSL_QIXIS_CTRL_H_
+
+/* QIXIS MAP */
+struct fsl_qixis_regs {
+   u8  id; /* Identification Registers */
+   u8  version;/* Version Register */
+   u8  qixis_ver;  /* QIXIS Version Register */
+   u8  reserved1[0x1f];
+};
+
+#endif
-- 
2.17.1



[PATCH 1/2] dt-bindings: soc: fsl: Document Qixis FPGA usage

2018-10-04 Thread Pankaj Bansal
an FPGA-based system controller, called “Qixis”, which
manages several critical system features, including:
• Reset sequencing
• Power supply configuration
• Board configuration
• hardware configuration

The qixis registers are accessible over one or more system-specific
interfaces, typically I2C, JTAG or an embedded processor.

Signed-off-by: Pankaj Bansal 
---
 .../bindings/soc/fsl/qixis_ctrl.txt  | 33 ++
 1 file changed, 33 insertions(+)

diff --git a/Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt 
b/Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt
new file mode 100644
index ..bc2950cab71d
--- /dev/null
+++ b/Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt
@@ -0,0 +1,33 @@
+* QIXIS FPGA block
+
+an FPGA-based system controller, called “Qixis”, which
+manages several critical system features, including:
+• Configuration switch monitoring
+• Power on/off sequencing
+• Reset sequencing
+• Power supply configuration
+• Board configuration
+• hardware configuration
+• Background power data collection (DCM)
+• Fault monitoring
+• RCW bypass SRAM (replace flash RCW with internal RCW) (NOR only)
+• Dedicated functional validation blocks (POSt/IRS, triggered event, and so on)
+• I2C master for remote board control even with no DUT available
+
+The qixis registers are accessible over one or more system-specific interfaces,
+typically I2C, JTAG or an embedded processor.
+
+Required properties:
+
+ - compatible : string, must contain "fsl,fpga-qixis-i2c"
+ - reg : i2c address of the qixis device.
+
+Examples:
+   /* The FPGA node */
+fpga@66 {
+   compatible = "fsl,lx2160aqds-fpga", "fsl,fpga-qixis-i2c";
+   reg = <0x66>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+   }
+
-- 
2.17.1



[PATCH 0/2] add i2c controlled qixis driver

2018-10-04 Thread Pankaj Bansal
FPGA on LX2160AQDS/LX2160ARDB connected on I2C bus, so add qixis
driver which is basically an i2c client driver to control FPGA.

This driver is essential to control MDIO mux multiplexing.

Cc: Varun Sethi 

Pankaj Bansal (2):
  dt-bindings: soc: fsl: Document Qixis FPGA usage
  fsl: add i2c controlled qixis driver

 .../bindings/soc/fsl/qixis_ctrl.txt   | 33 
 drivers/soc/fsl/Kconfig   |  9 +++
 drivers/soc/fsl/Makefile  |  1 +
 drivers/soc/fsl/qixis_ctrl.c  | 75 +++
 include/linux/fsl/qixis_ctrl.h| 20 +
 5 files changed, 138 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt
 create mode 100644 drivers/soc/fsl/qixis_ctrl.c
 create mode 100644 include/linux/fsl/qixis_ctrl.h

-- 
2.17.1



[PATCH 16/16] of: unittest: find overlays[] entry by name instead of index

2018-10-04 Thread frowand . list
From: Frank Rowand 

One accessor of overlays[] was using a hard coded index value to
find the correct array entry instead of searching for the entry
containing the correct name.

Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index b61a33f30a56..4d4ba4ddba9b 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -2152,7 +2152,7 @@ struct overlay_info {
 OVERLAY_INFO_EXTERN(overlay_bad_phandle);
 OVERLAY_INFO_EXTERN(overlay_bad_symbol);
 
-/* order of entries is hard-coded into users of overlays[] */
+/* entries found by name */
 static struct overlay_info overlays[] = {
OVERLAY_INFO(overlay_base, -),
OVERLAY_INFO(overlay, 0),
@@ -2175,7 +2175,8 @@ struct overlay_info {
OVERLAY_INFO(overlay_bad_add_dup_prop, -EINVAL),
OVERLAY_INFO(overlay_bad_phandle, -EINVAL),
OVERLAY_INFO(overlay_bad_symbol, -EINVAL),
-   {}
+   /* end marker */
+   {.dtb_begin = NULL, .dtb_end = NULL, .expected_result = 0, .name = NULL}
 };
 
 static struct device_node *overlay_base_root;
@@ -2205,6 +2206,19 @@ void __init unittest_unflatten_overlay_base(void)
u32 data_size;
void *new_fdt;
u32 size;
+   int found = 0;
+   const char *overlay_name = "overlay_base";
+
+   for (info = overlays; info && info->name; info++) {
+   if (!strcmp(overlay_name, info->name)) {
+   found = 1;
+   break;
+   }
+   }
+   if (!found) {
+   pr_err("no overlay data for %s\n", overlay_name);
+   return;
+   }
 
info = [0];
 
@@ -2252,11 +2266,10 @@ static int __init overlay_data_apply(const char 
*overlay_name, int *overlay_id)
 {
struct overlay_info *info;
int found = 0;
-   int k;
int ret;
u32 size;
 
-   for (k = 0, info = overlays; info && info->name; info++, k++) {
+   for (info = overlays; info && info->name; info++) {
if (!strcmp(overlay_name, info->name)) {
found = 1;
break;
-- 
Frank Rowand 



[PATCH 15/16] of: unittest: initialize args before calling of_irq_parse_one()

2018-10-04 Thread frowand . list
From: Frank Rowand 

Callers of of_irq_parse_one() blindly use the pointer args.np
without checking whether of_irq_parse_one() had an error and
thus did not set the value of args.np.  Initialize args to
zero so that using the format "%pOF" to show the value of
args.np will show "(null)" when of_irq_parse_one() has an
error and does not set args.np instead of trying to
dereference a random value.

Reported-by: Guenter Roeck 
Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 6d80f474c8f2..b61a33f30a56 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -780,7 +780,7 @@ static void __init of_unittest_parse_interrupts(void)
for (i = 0; i < 4; i++) {
bool passed = true;
 
-   args.args_count = 0;
+   memset(, 0, sizeof(args));
rc = of_irq_parse_one(np, i, );
 
passed &= !rc;
@@ -801,7 +801,7 @@ static void __init of_unittest_parse_interrupts(void)
for (i = 0; i < 4; i++) {
bool passed = true;
 
-   args.args_count = 0;
+   memset(, 0, sizeof(args));
rc = of_irq_parse_one(np, i, );
 
/* Test the values from tests-phandle.dtsi */
@@ -854,6 +854,7 @@ static void __init 
of_unittest_parse_interrupts_extended(void)
for (i = 0; i < 7; i++) {
bool passed = true;
 
+   memset(, 0, sizeof(args));
rc = of_irq_parse_one(np, i, );
 
/* Test the values from tests-phandle.dtsi */
-- 
Frank Rowand 



[PATCH 14/16] of: unittest: remove unused of_unittest_apply_overlay() argument

2018-10-04 Thread frowand . list
From: Frank Rowand 

Argument unittest_nr is not used in of_unittest_apply_overlay(),
remove it.

Signed-off-by: Frank Rowand 
---
 drivers/of/unittest.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index efd9c947f192..6d80f474c8f2 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -1419,8 +1419,7 @@ static void of_unittest_destroy_tracked_overlays(void)
} while (defers > 0);
 }
 
-static int __init of_unittest_apply_overlay(int overlay_nr, int unittest_nr,
-   int *overlay_id)
+static int __init of_unittest_apply_overlay(int overlay_nr, int *overlay_id)
 {
const char *overlay_name;
 
@@ -1453,7 +1452,7 @@ static int __init of_unittest_apply_overlay_check(int 
overlay_nr,
}
 
ovcs_id = 0;
-   ret = of_unittest_apply_overlay(overlay_nr, unittest_nr, _id);
+   ret = of_unittest_apply_overlay(overlay_nr, _id);
if (ret != 0) {
/* of_unittest_apply_overlay already called unittest() */
return ret;
@@ -1489,7 +1488,7 @@ static int __init 
of_unittest_apply_revert_overlay_check(int overlay_nr,
 
/* apply the overlay */
ovcs_id = 0;
-   ret = of_unittest_apply_overlay(overlay_nr, unittest_nr, _id);
+   ret = of_unittest_apply_overlay(overlay_nr, _id);
if (ret != 0) {
/* of_unittest_apply_overlay already called unittest() */
return ret;
-- 
Frank Rowand 



[PATCH 13/16] of: overlay: check prevents multiple fragments touching same property

2018-10-04 Thread frowand . list
From: Frank Rowand 

Add test case of two fragments updating the same property.  After
adding the test case, the system hangs at end of boot, after
after slub stack dumps from kfree() in crypto modprobe code.

Multiple overlay fragments adding, modifying, or deleting the same
property is not supported.  Add check to detect the attempt and fail
the overlay apply.

After applying this patch, the devicetree unittest messages will
include:

   OF: overlay: ERROR: multiple overlay fragments add, update, and/or delete 
property /testcase-data-2/substation@100/motor-1/rpm_avail

   ...

   ### dt-test ### end of unittest - 212 passed, 0 failed

The check to detect two fragments updating the same property is
folded into the patch that created the test case to maintain
bisectability.

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c   | 118 ++---
 drivers/of/unittest-data/Makefile  |   1 +
 .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 +
 drivers/of/unittest-data/overlay_base.dts  |   1 +
 drivers/of/unittest.c  |   5 +
 5 files changed, 112 insertions(+), 37 deletions(-)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 5376ae166caf..640435534675 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -514,52 +514,96 @@ static int build_changeset_symbols_node(struct 
overlay_changeset *ovcs,
return 0;
 }
 
+static int find_dup_cset_node_entry(struct overlay_changeset *ovcs,
+   struct of_changeset_entry *ce_1)
+{
+   struct of_changeset_entry *ce_2;
+   char *fn_1, *fn_2;
+   int node_path_match;
+
+   if (ce_1->action != OF_RECONFIG_ATTACH_NODE &&
+   ce_1->action != OF_RECONFIG_DETACH_NODE)
+   return 0;
+
+   ce_2 = ce_1;
+   list_for_each_entry_continue(ce_2, >cset.entries, node) {
+   if ((ce_2->action == OF_RECONFIG_ATTACH_NODE ||
+ce_2->action == OF_RECONFIG_DETACH_NODE) &&
+   !of_node_cmp(ce_1->np->full_name, ce_2->np->full_name)) {
+
+   fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np);
+   fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np);
+   node_path_match = !strcmp(fn_1, fn_2);
+   kfree(fn_1);
+   kfree(fn_2);
+   if (node_path_match) {
+   pr_err("ERROR: multiple overlay fragments add 
and/or delete node %pOF\n",
+  ce_1->np);
+   return -EINVAL;
+   }
+   }
+   }
+
+   return 0;
+}
+
+static int find_dup_cset_prop(struct overlay_changeset *ovcs,
+   struct of_changeset_entry *ce_1)
+{
+   struct of_changeset_entry *ce_2;
+   char *fn_1, *fn_2;
+   int node_path_match;
+
+   if (ce_1->action != OF_RECONFIG_ADD_PROPERTY &&
+   ce_1->action != OF_RECONFIG_REMOVE_PROPERTY &&
+   ce_1->action != OF_RECONFIG_UPDATE_PROPERTY)
+   return 0;
+
+   ce_2 = ce_1;
+   list_for_each_entry_continue(ce_2, >cset.entries, node) {
+   if ((ce_2->action == OF_RECONFIG_ADD_PROPERTY ||
+ce_2->action == OF_RECONFIG_REMOVE_PROPERTY ||
+ce_2->action == OF_RECONFIG_UPDATE_PROPERTY) &&
+   !of_node_cmp(ce_1->np->full_name, ce_2->np->full_name)) {
+
+   fn_1 = kasprintf(GFP_KERNEL, "%pOF", ce_1->np);
+   fn_2 = kasprintf(GFP_KERNEL, "%pOF", ce_2->np);
+   node_path_match = !strcmp(fn_1, fn_2);
+   kfree(fn_1);
+   kfree(fn_2);
+   if (node_path_match &&
+   !of_prop_cmp(ce_1->prop->name, ce_2->prop->name)) {
+   pr_err("ERROR: multiple overlay fragments add, 
update, and/or delete property %pOF/%s\n",
+  ce_1->np, ce_1->prop->name);
+   return -EINVAL;
+   }
+   }
+   }
+
+   return 0;
+}
+
 /**
- * check_changeset_dup_add_node() - changeset validation: duplicate add node
+ * changeset_dup_entry_check() - check for duplicate entries
  * @ovcs:  Overlay changeset
  *
- * Check changeset @ovcs->cset for multiple add node entries for the same
- * node.
+ * Check changeset @ovcs->cset for multiple {add or delete} node entries for
+ * the same node or duplicate {add, delete, or update} properties entries
+ * for the same property.
  *
- * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if
- * invalid overlay in @ovcs->fragments[].
+ * Returns 0 on success, or -EINVAL if duplicate changeset entry found.
  */
-static int 

[PATCH 12/16] of: overlay: check prevents multiple fragments add or delete same node

2018-10-04 Thread frowand . list
From: Frank Rowand 

Multiple overlay fragments adding or deleting the same node is not
supported.  Replace code comment of such, with check to detect the
attempt and fail the overlay apply.

Devicetree unittest where multiple fragments added the same node was
added in the previous patch in the series.  After applying this patch
the unittest messages will no longer include:

   Duplicate name in motor-1, renamed to "controller#1"
   OF: overlay: of_overlay_apply() err=0
   ### dt-test ### of_overlay_fdt_apply() expected -22, ret=0, 
overlay_bad_add_dup_node
   ### dt-test ### FAIL of_unittest_overlay_high_level():2419 Adding overlay 
'overlay_bad_add_dup_node' failed

   ...

   ### dt-test ### end of unittest - 210 passed, 1 failed

but will instead include:

   OF: overlay: ERROR: multiple overlay fragments add and/or delete node 
/testcase-data-2/substation@100/motor-1/controller

   ...

   ### dt-test ### end of unittest - 211 passed, 0 failed

Signed-off-by: Frank Rowand 
---

checkpatch errors "line over 80 characters" are ok, they will be
fixed two patches later in this series

 drivers/of/overlay.c | 58 
 1 file changed, 49 insertions(+), 9 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index f89383331b88..5376ae166caf 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -398,14 +398,6 @@ static int add_changeset_property(struct overlay_changeset 
*ovcs,
  *   a live devicetree created from Open Firmware.
  *
  * NOTE_2: Multiple mods of created nodes not supported.
- *   If more than one fragment contains a node that does not already exist
- *   in the live tree, then for each fragment of_changeset_attach_node()
- *   will add a changeset entry to add the node.  When the changeset is
- *   applied, __of_attach_node() will attach the node twice (once for
- *   each fragment).  At this point the device tree will be corrupted.
- *
- *   TODO: add integrity check to ensure that multiple fragments do not
- * create the same node.
  *
  * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if
  * invalid @overlay.
@@ -523,6 +515,54 @@ static int build_changeset_symbols_node(struct 
overlay_changeset *ovcs,
 }
 
 /**
+ * check_changeset_dup_add_node() - changeset validation: duplicate add node
+ * @ovcs:  Overlay changeset
+ *
+ * Check changeset @ovcs->cset for multiple add node entries for the same
+ * node.
+ *
+ * Returns 0 on success, -ENOMEM if memory allocation failure, or -EINVAL if
+ * invalid overlay in @ovcs->fragments[].
+ */
+static int check_changeset_dup_add_node(struct overlay_changeset *ovcs)
+{
+   struct of_changeset_entry *ce_1, *ce_2;
+   char *fn_1, *fn_2;
+   int name_match;
+
+   list_for_each_entry(ce_1, >cset.entries, node) {
+
+   if (ce_1->action == OF_RECONFIG_ATTACH_NODE ||
+   ce_1->action == OF_RECONFIG_DETACH_NODE) {
+
+   ce_2 = ce_1;
+   list_for_each_entry_continue(ce_2, >cset.entries, 
node) {
+   if (ce_2->action == OF_RECONFIG_ATTACH_NODE ||
+   ce_2->action == OF_RECONFIG_DETACH_NODE) {
+   /* inexpensive name compare */
+   if (!of_node_cmp(ce_1->np->full_name,
+   ce_2->np->full_name)) {
+   /* expensive full path name 
compare */
+   fn_1 = kasprintf(GFP_KERNEL, 
"%pOF", ce_1->np);
+   fn_2 = kasprintf(GFP_KERNEL, 
"%pOF", ce_2->np);
+   name_match = !strcmp(fn_1, 
fn_2);
+   kfree(fn_1);
+   kfree(fn_2);
+   if (name_match) {
+   pr_err("ERROR: multiple 
overlay fragments add and/or delete node %pOF\n",
+  ce_1->np);
+   return -EINVAL;
+   }
+   }
+   }
+   }
+   }
+   }
+
+   return 0;
+}
+
+/**
  * build_changeset() - populate overlay changeset in @ovcs from 
@ovcs->fragments
  * @ovcs:  Overlay changeset
  *
@@ -577,7 +617,7 @@ static int build_changeset(struct overlay_changeset *ovcs)
}
}
 
-   return 0;
+   return check_changeset_dup_add_node(ovcs);
 }
 
 /*
-- 
Frank Rowand 



[PATCH 11/16] of: overlay: test case of two fragments adding same node

2018-10-04 Thread frowand . list
From: Frank Rowand 

Multiple overlay fragments adding or deleting the same node is not
supported.  An attempt to do so results in an incorrect devicetree.
The node name will be munged for the second add.

After adding this patch, the unittest messages will show:

   Duplicate name in motor-1, renamed to "controller#1"
   OF: overlay: of_overlay_apply() err=0
   ### dt-test ### of_overlay_fdt_apply() expected -22, ret=0, 
overlay_bad_add_dup_node
   ### dt-test ### FAIL of_unittest_overlay_high_level():2419 Adding overlay 
'overlay_bad_add_dup_node' failed

   ...

   ### dt-test ### end of unittest - 210 passed, 1 failed

The incorrect (munged) node name "controller#1" can be seen in the
/proc filesystem:

   $ pwd
   /proc/device-tree/testcase-data-2/substation@100/motor-1
   $ ls
   compatiblecontrollercontroller#1  name  phandle   spin
   $ ls controller
   power_bus
   $ ls controller#1
   power_bus_emergency

Signed-off-by: Frank Rowand 
---
 drivers/of/unittest-data/Makefile  |  1 +
 .../of/unittest-data/overlay_bad_add_dup_node.dts  | 28 ++
 drivers/of/unittest.c  |  5 
 3 files changed, 34 insertions(+)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts

diff --git a/drivers/of/unittest-data/Makefile 
b/drivers/of/unittest-data/Makefile
index 013d85e694c6..166dbdbfd1c5 100644
--- a/drivers/of/unittest-data/Makefile
+++ b/drivers/of/unittest-data/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_OF_OVERLAY) += overlay.dtb.o \
overlay_12.dtb.o \
overlay_13.dtb.o \
overlay_15.dtb.o \
+   overlay_bad_add_dup_node.dtb.o \
overlay_bad_phandle.dtb.o \
overlay_bad_symbol.dtb.o \
overlay_base.dtb.o
diff --git a/drivers/of/unittest-data/overlay_bad_add_dup_node.dts 
b/drivers/of/unittest-data/overlay_bad_add_dup_node.dts
new file mode 100644
index ..145dfc3b1024
--- /dev/null
+++ b/drivers/of/unittest-data/overlay_bad_add_dup_node.dts
@@ -0,0 +1,28 @@
+// SPDX-License-Identifier: GPL-2.0
+/dts-v1/;
+/plugin/;
+
+/*
+ * _1/motor-1 and _ctrl_1 are the same node:
+ *   /testcase-data-2/substation@100/motor-1
+ *
+ * Thus the new node "controller" in each fragment will
+ * result in an attempt to add the same node twice.
+ * This will result in an error and the overlay apply
+ * will fail.
+ */
+
+_1 {
+
+   motor-1 {
+   controller {
+   power_bus = < 0x1 0x2 >;
+   };
+   };
+};
+
+_ctrl_1 {
+   controller {
+   power_bus_emergency = < 0x101 0x102 >;
+   };
+};
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 722537e14848..471b8eb6e842 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -2147,6 +2147,7 @@ struct overlay_info {
 OVERLAY_INFO_EXTERN(overlay_12);
 OVERLAY_INFO_EXTERN(overlay_13);
 OVERLAY_INFO_EXTERN(overlay_15);
+OVERLAY_INFO_EXTERN(overlay_bad_add_dup_node);
 OVERLAY_INFO_EXTERN(overlay_bad_phandle);
 OVERLAY_INFO_EXTERN(overlay_bad_symbol);
 
@@ -2169,6 +2170,7 @@ struct overlay_info {
OVERLAY_INFO(overlay_12, 0),
OVERLAY_INFO(overlay_13, 0),
OVERLAY_INFO(overlay_15, 0),
+   OVERLAY_INFO(overlay_bad_add_dup_node, -EINVAL),
OVERLAY_INFO(overlay_bad_phandle, -EINVAL),
OVERLAY_INFO(overlay_bad_symbol, -EINVAL),
{}
@@ -2413,6 +2415,9 @@ static __init void of_unittest_overlay_high_level(void)
unittest(overlay_data_apply("overlay", NULL),
 "Adding overlay 'overlay' failed\n");
 
+   unittest(overlay_data_apply("overlay_bad_add_dup_node", NULL),
+"Adding overlay 'overlay_bad_add_dup_node' failed\n");
+
unittest(overlay_data_apply("overlay_bad_phandle", NULL),
 "Adding overlay 'overlay_bad_phandle' failed\n");
 
-- 
Frank Rowand 



[PATCH 10/16] of: overlay: make all pr_debug() and pr_err() messages unique

2018-10-04 Thread frowand . list
From: Frank Rowand 

Make overlay.c debug and error messages unique so that they can be
unambiguously found by grep.

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index e6fb3ffe9d93..f89383331b88 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -513,7 +513,7 @@ static int build_changeset_symbols_node(struct 
overlay_changeset *ovcs,
for_each_property_of_node(overlay_symbols_node, prop) {
ret = add_changeset_property(ovcs, target, prop, 1);
if (ret) {
-   pr_debug("Failed to apply prop @%pOF/%s, err=%d\n",
+   pr_debug("Failed to apply symbols prop @%pOF/%s, 
err=%d\n",
 target->np, prop->name, ret);
return ret;
}
@@ -557,7 +557,8 @@ static int build_changeset(struct overlay_changeset *ovcs)
ret = build_changeset_next_level(ovcs, ,
 fragment->overlay);
if (ret) {
-   pr_debug("apply failed '%pOF'\n", fragment->target);
+   pr_debug("fragment apply failed '%pOF'\n",
+fragment->target);
return ret;
}
}
@@ -570,7 +571,8 @@ static int build_changeset(struct overlay_changeset *ovcs)
ret = build_changeset_symbols_node(ovcs, ,
   fragment->overlay);
if (ret) {
-   pr_debug("apply failed '%pOF'\n", fragment->target);
+   pr_debug("symbols fragment apply failed '%pOF'\n",
+fragment->target);
return ret;
}
}
@@ -879,7 +881,7 @@ static int of_overlay_apply(const void *fdt, struct 
device_node *tree,
 
ret = __of_changeset_apply_notify(>cset);
if (ret)
-   pr_err("overlay changeset entry notify error %d\n", ret);
+   pr_err("overlay apply changeset entry notify error %d\n", ret);
/* notify failure is not fatal, continue */
 
list_add_tail(>ovcs_list, _list);
@@ -1138,7 +1140,7 @@ int of_overlay_remove(int *ovcs_id)
 
ret = __of_changeset_revert_notify(>cset);
if (ret)
-   pr_err("overlay changeset entry notify error %d\n", ret);
+   pr_err("overlay remove changeset entry notify error %d\n", ret);
/* notify failure is not fatal, continue */
 
*ovcs_id = 0;
-- 
Frank Rowand 



[PATCH 09/16] of: overlay: validate overlay properties #address-cells and #size-cells

2018-10-04 Thread frowand . list
From: Frank Rowand 

If overlay properties #address-cells or #size-cells are already in
the live devicetree for any given node, then the values in the
overlay must match the values in the live tree.

If the properties are already in the live tree then there is no
need to create a changeset entry to add them since they must
have the same value.  This reduces the memory used by the
changeset and eliminates a possible memory leak.  This is
verified by 12 fewer warnings during the devicetree unittest,
as the possible memory leak warnings about #address-cells and

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 29c33a5c533f..e6fb3ffe9d93 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -287,7 +287,12 @@ static struct property *dup_and_fixup_symbol_prop(
  * @target may be either in the live devicetree or in a new subtree that
  * is contained in the changeset.
  *
- * Some special properties are not updated (no error returned).
+ * Some special properties are not added or updated (no error returned):
+ * "name", "phandle", "linux,phandle".
+ *
+ * Properties "#address-cells" and "#size-cells" are not updated if they
+ * are already in the live tree, but if present in the live tree, the values
+ * in the overlay must match the values in the live tree.
  *
  * Update of property in symbols node is not allowed.
  *
@@ -300,6 +305,7 @@ static int add_changeset_property(struct overlay_changeset 
*ovcs,
 {
struct property *new_prop = NULL, *prop;
int ret = 0;
+   bool check_for_non_overlay_node = false;
 
if (!of_prop_cmp(overlay_prop->name, "name") ||
!of_prop_cmp(overlay_prop->name, "phandle") ||
@@ -322,13 +328,39 @@ static int add_changeset_property(struct 
overlay_changeset *ovcs,
if (!new_prop)
return -ENOMEM;
 
-   if (!prop)
+   if (!prop) {
+
+   check_for_non_overlay_node = true;
ret = of_changeset_add_property(>cset, target->np,
new_prop);
-   else
+
+   } else if (!of_prop_cmp(prop->name, "#address-cells")) {
+
+   if (prop->length != 4 || new_prop->length != 4 ||
+   *(u32 *)prop->value != *(u32 *)new_prop->value)
+   pr_err("ERROR: overlay and/or live tree #address-cells 
invalid in node %pOF\n",
+  target->np);
+
+   } else if (!of_prop_cmp(prop->name, "#size-cells")) {
+
+   if (prop->length != 4 || new_prop->length != 4 ||
+   *(u32 *)prop->value != *(u32 *)new_prop->value)
+   pr_err("ERROR: overlay and/or live tree #size-cells 
invalid in node %pOF\n",
+  target->np);
+
+   } else {
+
+   check_for_non_overlay_node = true;
ret = of_changeset_update_property(>cset, target->np,
   new_prop);
 
+   }
+
+   if (check_for_non_overlay_node &&
+   !of_node_check_flag(target->np, OF_OVERLAY))
+   pr_err("WARNING: %s(), memory leak will occur if overlay 
removed.  Property: %pOF/%s\n",
+  __func__, target->np, new_prop->name);
+
if (ret) {
kfree(new_prop->name);
kfree(new_prop->value);
-- 
Frank Rowand 



[PATCH 08/16] of: overlay: reorder fields in struct fragment

2018-10-04 Thread frowand . list
From: Frank Rowand 

Order the fields of struct fragment in the same order as
struct of_overlay_notify_data.  The order in struct fragment is
not significant.  If both structs are ordered the same then when
examining the data in a debugger or dump the human involved does
not have to remember which context they are examining.

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index c113186e222c..29c33a5c533f 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -49,8 +49,8 @@ struct target {
  * @overlay:   pointer to the __overlay__ node
  */
 struct fragment {
-   struct device_node *target;
struct device_node *overlay;
+   struct device_node *target;
 };
 
 /**
-- 
Frank Rowand 



[PATCH 07/16] of: dynamic: change type of of_{at, de}tach_node() to void

2018-10-04 Thread frowand . list
From: Frank Rowand 

of_attach_node() and of_detach_node() always return zero, so
their return value is meaningless.  Change their type to void
and fix all callers to ignore return value.

Signed-off-by: Frank Rowand 
---

Powerpc files not tested

 arch/powerpc/platforms/pseries/dlpar.c| 13 ++---
 arch/powerpc/platforms/pseries/reconfig.c |  6 +-
 drivers/of/dynamic.c  |  9 ++---
 include/linux/of.h|  4 ++--
 4 files changed, 7 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index e3010b14aea5..0027eea94a8b 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -244,15 +244,9 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
 
 int dlpar_attach_node(struct device_node *dn, struct device_node *parent)
 {
-   int rc;
-
dn->parent = parent;
 
-   rc = of_attach_node(dn);
-   if (rc) {
-   printk(KERN_ERR "Failed to add device node %pOF\n", dn);
-   return rc;
-   }
+   of_attach_node(dn);
 
return 0;
 }
@@ -260,7 +254,6 @@ int dlpar_attach_node(struct device_node *dn, struct 
device_node *parent)
 int dlpar_detach_node(struct device_node *dn)
 {
struct device_node *child;
-   int rc;
 
child = of_get_next_child(dn, NULL);
while (child) {
@@ -268,9 +261,7 @@ int dlpar_detach_node(struct device_node *dn)
child = of_get_next_child(dn, child);
}
 
-   rc = of_detach_node(dn);
-   if (rc)
-   return rc;
+   of_detach_node(dn);
 
of_node_put(dn);
 
diff --git a/arch/powerpc/platforms/pseries/reconfig.c 
b/arch/powerpc/platforms/pseries/reconfig.c
index 0e0208117e77..0b72098da454 100644
--- a/arch/powerpc/platforms/pseries/reconfig.c
+++ b/arch/powerpc/platforms/pseries/reconfig.c
@@ -47,11 +47,7 @@ static int pSeries_reconfig_add_node(const char *path, 
struct property *proplist
goto out_err;
}
 
-   err = of_attach_node(np);
-   if (err) {
-   printk(KERN_ERR "Failed to add device node %s\n", path);
-   goto out_err;
-   }
+   of_attach_node(np);
 
of_node_put(np->parent);
 
diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index 275c0d7e2268..5f7c99b9de0d 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -224,7 +224,7 @@ static void __of_attach_node(struct device_node *np)
 /**
  * of_attach_node() - Plug a device node into the tree and global list.
  */
-int of_attach_node(struct device_node *np)
+void of_attach_node(struct device_node *np)
 {
struct of_reconfig_data rd;
unsigned long flags;
@@ -241,8 +241,6 @@ int of_attach_node(struct device_node *np)
mutex_unlock(_mutex);
 
of_reconfig_notify(OF_RECONFIG_ATTACH_NODE, );
-
-   return 0;
 }
 
 void __of_detach_node(struct device_node *np)
@@ -273,11 +271,10 @@ void __of_detach_node(struct device_node *np)
 /**
  * of_detach_node() - "Unplug" a node from the device tree.
  */
-int of_detach_node(struct device_node *np)
+void of_detach_node(struct device_node *np)
 {
struct of_reconfig_data rd;
unsigned long flags;
-   int rc = 0;
 
memset(, 0, sizeof(rd));
rd.dn = np;
@@ -291,8 +288,6 @@ int of_detach_node(struct device_node *np)
mutex_unlock(_mutex);
 
of_reconfig_notify(OF_RECONFIG_DETACH_NODE, );
-
-   return rc;
 }
 EXPORT_SYMBOL_GPL(of_detach_node);
 
diff --git a/include/linux/of.h b/include/linux/of.h
index aa1dafaec6ae..72c593455019 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -406,8 +406,8 @@ extern int of_phandle_iterator_args(struct 
of_phandle_iterator *it,
 #define OF_RECONFIG_REMOVE_PROPERTY0x0004
 #define OF_RECONFIG_UPDATE_PROPERTY0x0005
 
-extern int of_attach_node(struct device_node *);
-extern int of_detach_node(struct device_node *);
+extern void of_attach_node(struct device_node *np);
+extern void of_detach_node(struct device_node *np);
 
 #define of_match_ptr(_ptr) (_ptr)
 
-- 
Frank Rowand 



[PATCH 06/16] of: overlay: do not duplicate properties from overlay for new nodes

2018-10-04 Thread frowand . list
From: Frank Rowand 

When allocating a new node, add_changeset_node() was duplicating the
properties from the respective node in the overlay instead of
allocating a node with no properties.

When this patch is applied the errors reported by the devictree
unittest from patch "of: overlay: add tests to validate kfrees from
overlay removal" will no longer occur.  These error messages are of
the form:

   "OF: ERROR: ..."

and the unittest results will change from:

   ### dt-test ### end of unittest - 203 passed, 7 failed

to

   ### dt-test ### end of unittest - 210 passed, 0 failed

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 0b0904f44bc7..c113186e222c 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -393,7 +393,7 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
break;
 
if (!tchild) {
-   tchild = __of_node_dup(node, node_kbasename);
+   tchild = __of_node_dup(NULL, node_kbasename);
if (!tchild)
return -ENOMEM;
 
-- 
Frank Rowand 



[PATCH 05/16] of: overlay: use prop add changeset entry for property in new nodes

2018-10-04 Thread frowand . list
From: Frank Rowand 

The changeset entry 'update property' was used for new properties in
an overlay instead of 'add property'.

The decision of whether to use 'update property' was based on whether
the property already exists in the subtree where the node is being
spliced into.  At the top level of creating a changeset describing the
overlay, the target node is in the live devicetree, so checking whether
the property exists in the target node returns the correct result.
As soon as the changeset creation algorithm recurses into a new node,
the target is no longer in the live devicetree, but is instead in the
detached overlay tree, thus all properties are incorrectly found to
already exist in the target.

This fix will expose another devicetree bug that will be fixed
in the following patch in the series.

When this patch is applied the errors reported by the devictree
unittest will change, and the unittest results will change from:

   ### dt-test ### end of unittest - 210 passed, 0 failed

to

   ### dt-test ### end of unittest - 203 passed, 7 failed

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 112 ++-
 1 file changed, 74 insertions(+), 38 deletions(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 32cfee68f2e3..0b0904f44bc7 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -24,6 +24,26 @@
 #include "of_private.h"
 
 /**
+ * struct target - info about current target node as recursing through overlay
+ * @np:node where current level of overlay will be 
applied
+ * @in_livetree:   @np is a node in the live devicetree
+ *
+ * Used in the algorithm to create the portion of a changeset that describes
+ * an overlay fragment, which is a devicetree subtree.  Initially @np is a node
+ * in the live devicetree where the overlay subtree is targeted to be grafted
+ * into.  When recursing to the next level of the overlay subtree, the target
+ * also recurses to the next level of the live devicetree, as long as overlay
+ * subtree node also exists in the live devicetree.  When a node in the overlay
+ * subtree does not exist at the same level in the live devicetree, target->np
+ * points to a newly allocated node, and all subsequent targets in the subtree
+ * will be newly allocated nodes.
+ */
+struct target {
+   struct device_node *np;
+   bool in_livetree;
+};
+
+/**
  * struct fragment - info about fragment nodes in overlay expanded device tree
  * @target:target of the overlay operation
  * @overlay:   pointer to the __overlay__ node
@@ -72,8 +92,7 @@ static int devicetree_corrupt(void)
 }
 
 static int build_changeset_next_level(struct overlay_changeset *ovcs,
-   struct device_node *target_node,
-   const struct device_node *overlay_node);
+   struct target *target, const struct device_node *overlay_node);
 
 /*
  * of_resolve_phandles() finds the largest phandle in the live tree.
@@ -257,14 +276,17 @@ static struct property *dup_and_fixup_symbol_prop(
 /**
  * add_changeset_property() - add @overlay_prop to overlay changeset
  * @ovcs:  overlay changeset
- * @target_node:   where to place @overlay_prop in live tree
+ * @target:where @overlay_prop will be placed
  * @overlay_prop:  property to add or update, from overlay tree
  * @is_symbols_prop:   1 if @overlay_prop is from node "/__symbols__"
  *
- * If @overlay_prop does not already exist in @target_node, add changeset entry
- * to add @overlay_prop in @target_node, else add changeset entry to update
+ * If @overlay_prop does not already exist in live devicetree, add changeset
+ * entry to add @overlay_prop in @target, else add changeset entry to update
  * value of @overlay_prop.
  *
+ * @target may be either in the live devicetree or in a new subtree that
+ * is contained in the changeset.
+ *
  * Some special properties are not updated (no error returned).
  *
  * Update of property in symbols node is not allowed.
@@ -273,20 +295,22 @@ static struct property *dup_and_fixup_symbol_prop(
  * invalid @overlay.
  */
 static int add_changeset_property(struct overlay_changeset *ovcs,
-   struct device_node *target_node,
-   struct property *overlay_prop,
+   struct target *target, struct property *overlay_prop,
bool is_symbols_prop)
 {
struct property *new_prop = NULL, *prop;
int ret = 0;
 
-   prop = of_find_property(target_node, overlay_prop->name, NULL);
-
if (!of_prop_cmp(overlay_prop->name, "name") ||
!of_prop_cmp(overlay_prop->name, "phandle") ||
!of_prop_cmp(overlay_prop->name, "linux,phandle"))
return 0;
 
+   if (target->in_livetree)
+   prop = of_find_property(target->np, overlay_prop->name, NULL);
+   else
+   prop = NULL;
+
if (is_symbols_prop) {
if (prop)
  

[PATCH 04/16] powerpc/pseries: add of_node_put() in dlpar_detach_node()

2018-10-04 Thread frowand . list
From: Frank Rowand 

"of: overlay: add missing of_node_get() in __of_attach_node_sysfs"
added a missing of_node_get() to __of_attach_node_sysfs().  This
results in a refcount imbalance for nodes attached with
dlpar_attach_node().  The calling sequence from dlpar_attach_node()
to __of_attach_node_sysfs() is:

   dlpar_attach_node()
  of_attach_node()
 __of_attach_node_sysfs()

Signed-off-by: Frank Rowand 
---

* UNTESTED.  I need people with the affected PowerPC systems
*(systems that dynamically allocate and deallocate
*devicetree nodes) to test this patch.

 arch/powerpc/platforms/pseries/dlpar.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index a0b20c03f078..e3010b14aea5 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -272,6 +272,8 @@ int dlpar_detach_node(struct device_node *dn)
if (rc)
return rc;
 
+   of_node_put(dn);
+
return 0;
 }
 
-- 
Frank Rowand 



[PATCH 03/16] of: overlay: add missing of_node_get() in __of_attach_node_sysfs

2018-10-04 Thread frowand . list
From: Frank Rowand 

There is a matching of_node_put() in __of_detach_node_sysfs()

Remove misleading comment from function header comment for
of_detach_node().

This patch may result in memory leaks from code that directly calls
the dynamic node add and delete functions directly instead of
using changesets.

Signed-off-by: Frank Rowand 
---

This patch should result in powerpc systems that dynamically
allocate a node, then later deallocate the node to have a
memory leak when the node is deallocated.

The next patch in the series will fix the leak.

 drivers/of/dynamic.c | 3 ---
 drivers/of/kobj.c| 4 +++-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index b04ee021a891..275c0d7e2268 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -272,9 +272,6 @@ void __of_detach_node(struct device_node *np)
 
 /**
  * of_detach_node() - "Unplug" a node from the device tree.
- *
- * The caller must hold a reference to the node.  The memory associated with
- * the node is not freed until its refcount goes to zero.
  */
 int of_detach_node(struct device_node *np)
 {
diff --git a/drivers/of/kobj.c b/drivers/of/kobj.c
index 7a0a18980b98..c72eef988041 100644
--- a/drivers/of/kobj.c
+++ b/drivers/of/kobj.c
@@ -133,6 +133,9 @@ int __of_attach_node_sysfs(struct device_node *np)
}
if (!name)
return -ENOMEM;
+
+   of_node_get(np);
+
rc = kobject_add(>kobj, parent, "%s", name);
kfree(name);
if (rc)
@@ -159,6 +162,5 @@ void __of_detach_node_sysfs(struct device_node *np)
kobject_del(>kobj);
}
 
-   /* finally remove the kobj_init ref */
of_node_put(np);
 }
-- 
Frank Rowand 



[PATCH 02/16] of: overlay: add missing of_node_put() after add new node to changeset

2018-10-04 Thread frowand . list
From: Frank Rowand 

The refcount of a newly added overlay node decrements to one
(instead of zero) when the overlay changeset is destroyed.  This
change will cause the final decrement be to zero.

After applying this patch, new validation warnings will be
reported from the devicetree unittest during boot due to
a pre-existing devicetree bug.  The warnings will be similar to:

  OF: ERROR: memory leak of_node_release() overlay node 
/testcase-data/overlay-node/test-bus/test-unittest4 before free overlay 
changeset

This pre-existing devicetree bug will also trigger a WARN_ONCE() from
refcount_sub_and_test_checked() when an overlay changeset is
destroyed without having first been applied.  This scenario occurs
when an error in the overlay is detected during the overlay changeset
creation:

  WARNING: CPU: 0 PID: 1 at lib/refcount.c:187 
refcount_sub_and_test_checked+0xa8/0xbc
  refcount_t: underflow; use-after-free.

  (unwind_backtrace) from (show_stack+0x10/0x14)
  (show_stack) from (dump_stack+0x6c/0x8c)
  (dump_stack) from (__warn+0xdc/0x104)
  (__warn) from (warn_slowpath_fmt+0x44/0x6c)
  (warn_slowpath_fmt) from (refcount_sub_and_test_checked+0xa8/0xbc)
  (refcount_sub_and_test_checked) from (kobject_put+0x24/0x208)
  (kobject_put) from (of_changeset_destroy+0x2c/0xb4)
  (of_changeset_destroy) from (free_overlay_changeset+0x1c/0x9c)
  (free_overlay_changeset) from (of_overlay_remove+0x284/0x2cc)
  (of_overlay_remove) from 
(of_unittest_apply_revert_overlay_check.constprop.4+0xf8/0x1e8)
  (of_unittest_apply_revert_overlay_check.constprop.4) from 
(of_unittest_overlay+0x960/0xed8)
  (of_unittest_overlay) from (of_unittest+0x1cc4/0x2138)
  (of_unittest) from (do_one_initcall+0x4c/0x28c)
  (do_one_initcall) from (kernel_init_freeable+0x29c/0x378)
  (kernel_init_freeable) from (kernel_init+0x8/0x110)
  (kernel_init) from (ret_from_fork+0x14/0x2c)

Signed-off-by: Frank Rowand 
---
 drivers/of/overlay.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index 1176cb4b6e4e..32cfee68f2e3 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -379,7 +379,9 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
if (ret)
return ret;
 
-   return build_changeset_next_level(ovcs, tchild, node);
+   ret = build_changeset_next_level(ovcs, tchild, node);
+   of_node_put(tchild);
+   return ret;
}
 
if (node->phandle && tchild->phandle)
-- 
Frank Rowand 



[PATCH 01/16] of: overlay: add tests to validate kfrees from overlay removal

2018-10-04 Thread frowand . list
From: Frank Rowand 

Add checks:
  - attempted kfree due to refcount reaching zero before overlay
is removed
  - properties linked to an overlay node when the node is removed
  - node refcount > one during node removal in a changeset destroy,
if the node was created by the changeset

After applying this patch, several validation warnings will be
reported from the devicetree unittest during boot due to
pre-existing devicetree bugs. The warnings will be similar to:

  OF: ERROR: of_node_release() overlay node 
/testcase-data/overlay-node/test-bus/test-unittest11/test-unittest111 contains 
unexpected properties
  OF: ERROR: memory leak - destroy cset entry: attach overlay node 
/testcase-data-2/substation@100/hvac-medium-2 with refcount 2

Signed-off-by: Frank Rowand 
---
 drivers/of/dynamic.c | 29 +
 drivers/of/overlay.c |  1 +
 include/linux/of.h   | 15 ++-
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index f4f8ed9b5454..b04ee021a891 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -330,6 +330,25 @@ void of_node_release(struct kobject *kobj)
if (!of_node_check_flag(node, OF_DYNAMIC))
return;
 
+   if (of_node_check_flag(node, OF_OVERLAY)) {
+
+   if (!of_node_check_flag(node, OF_OVERLAY_FREE_CSET)) {
+   /* premature refcount of zero, do not free memory */
+   pr_err("ERROR: memory leak %s() overlay node %pOF 
before free overlay changeset\n",
+  __func__, node);
+   return;
+   }
+
+   /*
+* If node->properties non-empty then properties were added
+* to this node either by different overlay that has not
+* yet been removed, or by a non-overlay mechanism.
+*/
+   if (node->properties)
+   pr_err("ERROR: %s() overlay node %pOF contains 
unexpected properties\n",
+  __func__, node);
+   }
+
property_list_free(node->properties);
property_list_free(node->deadprops);
 
@@ -434,6 +453,16 @@ struct device_node *__of_node_dup(const struct device_node 
*np,
 
 static void __of_changeset_entry_destroy(struct of_changeset_entry *ce)
 {
+   if (ce->action == OF_RECONFIG_ATTACH_NODE &&
+   of_node_check_flag(ce->np, OF_OVERLAY)) {
+   if (kref_read(>np->kobj.kref) > 1) {
+   pr_err("ERROR: memory leak - destroy cset entry: attach 
overlay node %pOF with refcount %d\n",
+  ce->np, kref_read(>np->kobj.kref));
+   } else {
+   of_node_set_flag(ce->np, OF_OVERLAY_FREE_CSET);
+   }
+   }
+
of_node_put(ce->np);
list_del(>node);
kfree(ce);
diff --git a/drivers/of/overlay.c b/drivers/of/overlay.c
index eda57ef12fd0..1176cb4b6e4e 100644
--- a/drivers/of/overlay.c
+++ b/drivers/of/overlay.c
@@ -373,6 +373,7 @@ static int add_changeset_node(struct overlay_changeset 
*ovcs,
return -ENOMEM;
 
tchild->parent = target_node;
+   of_node_set_flag(tchild, OF_OVERLAY);
 
ret = of_changeset_attach_node(>cset, tchild);
if (ret)
diff --git a/include/linux/of.h b/include/linux/of.h
index 4d25e4f952d9..aa1dafaec6ae 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -138,11 +138,16 @@ static inline void of_node_put(struct device_node *node) 
{ }
 extern struct device_node *of_stdout;
 extern raw_spinlock_t devtree_lock;
 
-/* flag descriptions (need to be visible even when !CONFIG_OF) */
-#define OF_DYNAMIC 1 /* node and properties were allocated via kmalloc */
-#define OF_DETACHED2 /* node has been detached from the device tree */
-#define OF_POPULATED   3 /* device already created for the node */
-#define OF_POPULATED_BUS   4 /* of_platform_populate recursed to children 
of this node */
+/*
+ * struct device_node flag descriptions
+ * (need to be visible even when !CONFIG_OF)
+ */
+#define OF_DYNAMIC 1 /* (and properties) allocated via kmalloc */
+#define OF_DETACHED2 /* detached from the device tree */
+#define OF_POPULATED   3 /* device already created */
+#define OF_POPULATED_BUS   4 /* platform bus created for children */
+#define OF_OVERLAY 5 /* allocated for an overlay */
+#define OF_OVERLAY_FREE_CSET   6 /* in overlay cset being freed */
 
 #define OF_BAD_ADDR((u64)-1)
 
-- 
Frank Rowand 



[PATCH 00/16] of: overlay: validation checks, subsequent fixes

2018-10-04 Thread frowand . list
From: Frank Rowand 

Add checks to (1) overlay apply process and (2) memory freeing
triggered by overlay release.  The checks are intended to detect
possible memory leaks and invalid overlays.

The checks revealed bugs in existing code.  Fixed the bugs.

While fixing bugs, noted other issues, which are fixed in
separate patches.

*  Powerpc folks: I was not able to test the patches that
*  directly impact Powerpc systems that use dynamic
*  devicetree.  Please review that code carefully and
*  test.  The specific patches are: 03/16, 04/16, 07/16

FPGA folks:

  I made the validation checks that should result in an
  invalid live devicetree report "ERROR" and cause the overlay apply
  to fail.

  I made the memory leak validation tests report "WARNING" and allow
  the overlay apply to complete successfully.  Please let me know
  if you encounter the warnings.  There are at least two paths
  forward to deal with the cases that trigger the warning: (1) change
  the warning to an error and fail the overlay apply, or (2) find a
  way to detect the potential memory leaks and free the memory
  appropriately.

ALL people:

  The validations do _not_ address another major concern I have with
  releasing overlays, which is use after free errors.

Frank Rowand (16):
  of: overlay: add tests to validate kfrees from overlay removal
  of: overlay: add missing of_node_put() after add new node to changeset
  of: overlay: add missing of_node_get() in __of_attach_node_sysfs
  powerpc/pseries: add of_node_put() in dlpar_detach_node()
  of: overlay: use prop add changeset entry for property in new nodes
  of: overlay: do not duplicate properties from overlay for new nodes
  of: dynamic: change type of of_{at,de}tach_node() to void
  of: overlay: reorder fields in struct fragment
  of: overlay: validate overlay properties #address-cells and
#size-cells
  of: overlay: make all pr_debug() and pr_err() messages unique
  of: overlay: test case of two fragments adding same node
  of: overlay: check prevents multiple fragments add or delete same node
  of: overlay: check prevents multiple fragments touching same property
  of: unittest: remove unused of_unittest_apply_overlay() argument
  of: unittest: initialize args before calling of_irq_parse_one()
  of: unittest: find overlays[] entry by name instead of index

 arch/powerpc/platforms/pseries/dlpar.c |  15 +-
 arch/powerpc/platforms/pseries/reconfig.c  |   6 +-
 drivers/of/dynamic.c   |  41 +++-
 drivers/of/kobj.c  |   4 +-
 drivers/of/overlay.c   | 271 -
 drivers/of/unittest-data/Makefile  |   2 +
 .../of/unittest-data/overlay_bad_add_dup_node.dts  |  28 +++
 .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 ++
 drivers/of/unittest-data/overlay_base.dts  |   1 +
 drivers/of/unittest.c  |  43 +++-
 include/linux/of.h |  19 +-
 11 files changed, 353 insertions(+), 101 deletions(-)
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts
 create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts

-- 
Frank Rowand 



Re: [PATCH v4 4/6] clk: qoriq: Add clockgen support for lx2160a

2018-10-04 Thread Viresh Kumar
On 04-10-18, 06:33, Vabhav Sharma wrote:
> diff --git a/drivers/cpufreq/qoriq-cpufreq.c b/drivers/cpufreq/qoriq-cpufreq.c
> index 3d773f6..83921b7 100644
> --- a/drivers/cpufreq/qoriq-cpufreq.c
> +++ b/drivers/cpufreq/qoriq-cpufreq.c
> @@ -295,6 +295,7 @@ static const struct of_device_id node_matches[] 
> __initconst = {
>   { .compatible = "fsl,ls1046a-clockgen", },
>   { .compatible = "fsl,ls1088a-clockgen", },
>   { .compatible = "fsl,ls2080a-clockgen", },
> + { .compatible = "fsl,lx2160a-clockgen", },
>   { .compatible = "fsl,p4080-clockgen", },
>   { .compatible = "fsl,qoriq-clockgen-1.0", },
>   { .compatible = "fsl,qoriq-clockgen-2.0", },

Acked-by: Viresh Kumar 

-- 
viresh


Re: [PATCH] memblock: stop using implicit alignement to SMP_CACHE_BYTES

2018-10-04 Thread Benjamin Herrenschmidt
On Fri, 2018-10-05 at 00:07 +0300, Mike Rapoport wrote:
> When a memblock allocation APIs are called with align = 0, the alignment is
> implicitly set to SMP_CACHE_BYTES.
> 
> Replace all such uses of memblock APIs with the 'align' parameter explicitly
> set to SMP_CACHE_BYTES and stop implicit alignment assignment in the
> memblock internal allocation functions.
> 
> For the case when memblock APIs are used via helper functions, e.g. like
> iommu_arena_new_node() in Alpha, the helper functions were detected with
> Coccinelle's help and then manually examined and updated where appropriate.
> 
> The direct memblock APIs users were updated using the semantic patch below:

What is the purpose of this ? It sounds rather counter-intuitive...

Ben.




Re: powerpc/lib: fix book3s/32 boot failure due to code patching

2018-10-04 Thread Michael Ellerman
On Mon, 2018-10-01 at 12:21:10 UTC, Christophe Leroy wrote:
> Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init
> sections") accesses 'init_mem_is_free' flag too early, before the
> kernel is relocated. This provokes early boot failure (before the
> console is active).
> 
> As it is not necessary to do this verification that early, this
> patch moves the test into patch_instruction() instead of
> __patch_instruction().
> 
> This modification also has the advantage of avoiding unnecessary
> remappings.
> 
> Fixes: 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections")
> Signed-off-by: Christophe Leroy 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/b45ba4a51cde29b2939365ef0c07ad

cheers


Re: lib/xz: Put CRC32_POLY_LE in xz_private.h

2018-10-04 Thread Michael Ellerman
On Fri, 2018-09-21 at 02:54:31 UTC, Joel Stanley wrote:
> This fixes a regression introduced by faa16bc404d72a5 ("lib: Use
> existing define with polynomial").
> 
> The cleanup added a dependency on include/linux, which broke the PowerPC
> boot wrapper/decompresser when KERNEL_XZ is enabled:
> 
>   BOOTCC  arch/powerpc/boot/decompress.o
>  In file included from arch/powerpc/boot/../../../lib/decompress_unxz.c:233,
>  from arch/powerpc/boot/decompress.c:42:
>  arch/powerpc/boot/../../../lib/xz/xz_crc32.c:18:10: fatal error:
>  linux/crc32poly.h: No such file or directory
>   #include 
>^~~
> 
> The powerpc decompresser is a hairy corner of the kernel. Even while building
> a 64-bit kernel it needs to build a 32-bit binary and therefore avoid 
> including
> files from include/linux.
> 
> This allows users of the xz library to avoid including headers from
> 'include/linux/' while still achieving the cleanup of the magic number.
> 
> Fixes: faa16bc404d72a5 ("lib: Use existing define with polynomial")
> Reported-by: Meelis Roos 
> Reported-by: kbuild test robot 
> Suggested-by: Christophe LEROY 
> Signed-off-by: Joel Stanley 
> Tested-by: Meelis Roos 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/242cdad873a75652f97c35aad61270

cheers


Re: [PATCH] powerpc/xmon/ppc-opc: Use ARRAY_SIZE macro

2018-10-04 Thread Tyrel Datwyler
On 10/04/2018 10:10 AM, Gustavo A. R. Silva wrote:
> Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element.
> 
> This code was detected with the help of Coccinelle.
> 
> Signed-off-by: Gustavo A. R. Silva 

Reviewed-by: Tyrel Datwyler 



[PATCH] memblock: stop using implicit alignement to SMP_CACHE_BYTES

2018-10-04 Thread Mike Rapoport
When a memblock allocation APIs are called with align = 0, the alignment is
implicitly set to SMP_CACHE_BYTES.

Replace all such uses of memblock APIs with the 'align' parameter explicitly
set to SMP_CACHE_BYTES and stop implicit alignment assignment in the
memblock internal allocation functions.

For the case when memblock APIs are used via helper functions, e.g. like
iommu_arena_new_node() in Alpha, the helper functions were detected with
Coccinelle's help and then manually examined and updated where appropriate.

The direct memblock APIs users were updated using the semantic patch below:

@@
expression size, min_addr, max_addr, nid;
@@
(
|
- memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
nid)
|
- memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
+ memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
|
- memblock_alloc(size, 0)
+ memblock_alloc(size, SMP_CACHE_BYTES)
|
- memblock_alloc_raw(size, 0)
+ memblock_alloc_raw(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from(size, 0, min_addr)
+ memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_nopanic(size, 0)
+ memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low(size, 0)
+ memblock_alloc_low(size, SMP_CACHE_BYTES)
|
- memblock_alloc_low_nopanic(size, 0)
+ memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
|
- memblock_alloc_from_nopanic(size, 0, min_addr)
+ memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
|
- memblock_alloc_node(size, 0, nid)
+ memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
)

Suggested-by: Michal Hocko 
Signed-off-by: Mike Rapoport 
---
 arch/alpha/kernel/core_apecs.c|  3 ++-
 arch/alpha/kernel/core_lca.c  |  3 ++-
 arch/alpha/kernel/core_marvel.c   |  4 ++--
 arch/alpha/kernel/core_mcpcia.c   |  6 +++--
 arch/alpha/kernel/core_t2.c   |  2 +-
 arch/alpha/kernel/core_titan.c|  6 +++--
 arch/alpha/kernel/core_tsunami.c  |  6 +++--
 arch/alpha/kernel/core_wildfire.c |  6 +++--
 arch/alpha/kernel/pci-noop.c  |  4 ++--
 arch/alpha/kernel/pci.c   |  4 ++--
 arch/alpha/kernel/pci_iommu.c |  4 ++--
 arch/arm/kernel/setup.c   |  4 ++--
 arch/arm/mach-omap2/omap_hwmod.c  |  8 ---
 arch/arm64/kernel/setup.c |  2 +-
 arch/ia64/kernel/mca.c|  4 ++--
 arch/ia64/mm/tlb.c|  6 +++--
 arch/ia64/sn/kernel/io_common.c   |  4 +++-
 arch/ia64/sn/kernel/setup.c   |  5 ++--
 arch/m68k/sun3/sun3dvma.c |  2 +-
 arch/microblaze/mm/init.c |  2 +-
 arch/mips/kernel/setup.c  |  2 +-
 arch/powerpc/kernel/pci_32.c  |  3 ++-
 arch/powerpc/lib/alloc.c  |  2 +-
 arch/powerpc/mm/mmu_context_nohash.c  |  7 +++---
 arch/powerpc/platforms/powermac/nvram.c   |  2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  6 ++---
 arch/powerpc/sysdev/msi_bitmap.c  |  2 +-
 arch/um/drivers/net_kern.c|  2 +-
 arch/um/drivers/vector_kern.c |  2 +-
 arch/um/kernel/initrd.c   |  2 +-
 arch/unicore32/kernel/setup.c |  2 +-
 arch/x86/kernel/acpi/boot.c   |  2 +-
 arch/x86/kernel/apic/io_apic.c|  2 +-
 arch/x86/kernel/e820.c|  3 ++-
 arch/x86/platform/olpc/olpc_dt.c  |  2 +-
 arch/xtensa/platforms/iss/network.c   |  2 +-
 drivers/clk/ti/clk.c  |  2 +-
 drivers/firmware/memmap.c |  3 ++-
 drivers/macintosh/smu.c   |  2 +-
 drivers/of/of_reserved_mem.c  |  1 +
 include/linux/memblock.h  |  3 ++-
 init/main.c   | 13 +++
 kernel/power/snapshot.c   |  3 ++-
 lib/cpumask.c |  2 +-
 mm/memblock.c |  8 ---
 mm/page_alloc.c   |  6 +++--
 mm/percpu.c   | 39 ---
 mm/sparse.c   |  3 ++-
 48 files changed, 118 insertions(+), 95 deletions(-)

diff --git a/arch/alpha/kernel/core_apecs.c b/arch/alpha/kernel/core_apecs.c
index 1bf3eef..6df765f 100644
--- a/arch/alpha/kernel/core_apecs.c
+++ b/arch/alpha/kernel/core_apecs.c
@@ -346,7 +346,8 @@ apecs_init_arch(void)
 * Window 1 is direct access 1GB at 1GB
 * Window 2 is scatter-gather 8MB at 8MB (for isa)
 */
-   hose->sg_isa = iommu_arena_new(hose, 0x0080, 0x0080, 0);
+   hose->sg_isa = iommu_arena_new(hose, 0x0080, 0x0080,
+  SMP_CACHE_BYTES);
hose->sg_pci = 

[PATCH] powerpc/migration: Init nodes before remove memory

2018-10-04 Thread Michael Bringmann
In some LPAR migration scenarios, device-tree modifications are
made to the affinity of the memory in the system.  For instance,
it may occur that memory is installed to nodes 0,3 on a source
system, and to nodes 0,2 on a target system.  Node 2 may not have
been initialized/allocated on the target system.

During normal DLPAR memory 'hot add' operations, unitialized nodes
are initialized/allocated prior to use.  After migration, if a
RTAS PRRN memory remove operation is made on a memory block that
was in node 3 on the source system, then try_offline_node tries
to remove it from node 2 on the target assuming that it was in
node 2 on the source system and that node 2 had been setup.
The NODE_DATA(2) block is not initialized on the target, and
there is no validation check to prevent the use of a NULL pointer.
Call traces such as the following may be observed:

pseries-hotplug-mem: Attempting to update LMB, drc index 8002
Offlined Pages 4096
...
Oops: Kernel access of bad area, sig: 11 [#1]
...
Workqueue: pseries hotplug workque pseries_hp_work_fn
...
NIP [c02bc088] try_offline_node+0x48/0x1e0
LR [c02e0b84] remove_memory+0xb4/0xf0
Call Trace:
[c002bbee7a30] [c002bbee7a70] 0xc002bbee7a70 (unreliable)
[c002bbee7a70] [c02e0b84] remove_memory+0xb4/0xf0
[c002bbee7ab0] [c0097784] dlpar_remove_lmb+0xb4/0x160
[c002bbee7af0] [c0097f38] dlpar_memory+0x328/0xcb0
[c002bbee7ba0] [c00906d0] handle_dlpar_errorlog+0xc0/0x130
[c002bbee7c10] [c00907d4] pseries_hp_work_fn+0x94/0xa0
[c002bbee7c40] [c00e1cd0] process_one_work+0x1a0/0x4e0
[c002bbee7cd0] [c00e21b0] worker_thread+0x1a0/0x610
[c002bbee7d80] [c00ea458] kthread+0x128/0x150
[c002bbee7e30] [c000982c] ret_from_kernel_thread+0x5c/0xb0

A similar problem of moving memory to an unitialized node has also
been observed on systems where multiple PRRN events occur prior to
a complete update of the device-tree.

This patch attempts to detect and initialize an uninitialized node
in the memory_add_physaddr_to_nid -> hot_add_scn_to_nid functions
used by powerpc DLPAR memory operations to compute the node of a
emory address based on the device-tree affinity configuration
after migration.  This occurs before try_offline_node is used by
remove_memory.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/mm/numa.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0ade0a1..d6f6e24 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1020,6 +1020,13 @@ int hot_add_scn_to_nid(unsigned long scn_addr)
if (nid < 0 || !node_possible(nid))
nid = first_online_node;
 
+   if (NODE_DATA(nid) == NULL) {
+   if (try_online_node(nid))
+   nid = first_online_node;
+   else
+   pr_debug("new nid %d for %#010lx\n", nid, scn_addr);
+   }
+
return nid;
 }
 



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread Michal Suchánek
On Thu, 4 Oct 2018 17:45:13 +0200
David Hildenbrand  wrote:

> On 04/10/2018 17:28, Michal Suchánek wrote:

> > 
> > The state of the art is to determine what to do with hotplugged
> > memory in userspace based on platform and virtualization type.  
> 
> Exactly.
> 
> > 
> > Changing the default to depend on the driver that added the memory
> > rather than platform type should solve the issue of VMs growing
> > different types of memory device emulation.  
> 
> Yes, my original proposal (this patch) was to handle it in the kernel
> for known types. But as we learned, there might be some use cases that
> might still require to make a decision in user space.
> 
> So providing the user space either with some type hint (auto-online
> vs. standby) or the driver that added it (system vs. hyper-v ...)
> would solve the issue.

Is that not available in the udev event?

Thanks

Michal


[PATCH] powerpc/xmon/ppc-opc: Use ARRAY_SIZE macro

2018-10-04 Thread Gustavo A. R. Silva
Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva 
---
 arch/powerpc/xmon/ppc-opc.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/xmon/ppc-opc.c b/arch/powerpc/xmon/ppc-opc.c
index ac2b55b..f3f57a1 100644
--- a/arch/powerpc/xmon/ppc-opc.c
+++ b/arch/powerpc/xmon/ppc-opc.c
@@ -966,8 +966,7 @@ const struct powerpc_operand powerpc_operands[] =
   { 0xff, 11, NULL, NULL, PPC_OPERAND_SIGNOPT },
 };
 
-const unsigned int num_powerpc_operands = (sizeof (powerpc_operands)
-  / sizeof (powerpc_operands[0]));
+const unsigned int num_powerpc_operands = ARRAY_SIZE(powerpc_operands);
 
 /* The functions used to insert and extract complicated operands.  */
 
@@ -6980,8 +6979,7 @@ const struct powerpc_opcode powerpc_opcodes[] = {
 {"fcfidu.",XRC(63,974,1),  XRA_MASK, POWER7|PPCA2, PPCVLE, {FRT, 
FRB}},
 };
 
-const int powerpc_num_opcodes =
-  sizeof (powerpc_opcodes) / sizeof (powerpc_opcodes[0]);
+const int powerpc_num_opcodes = ARRAY_SIZE(powerpc_opcodes);
 
 /* The VLE opcode table.
 
@@ -7219,8 +7217,7 @@ const struct powerpc_opcode vle_opcodes[] = {
 {"se_bl",  BD8(58,0,1),BD8_MASK,   PPCVLE, 0,  {B8}},
 };
 
-const int vle_num_opcodes =
-  sizeof (vle_opcodes) / sizeof (vle_opcodes[0]);
+const int vle_num_opcodes = ARRAY_SIZE(vle_opcodes);
 
 /* The macro table.  This is only used by the assembler.  */
 
@@ -7288,5 +7285,4 @@ const struct powerpc_macro powerpc_macros[] = {
 {"e_clrlslwi",4, PPCVLE, "e_rlwinm %0,%1,%3,(%2)-(%3),31-(%3)"},
 };
 
-const int powerpc_num_macros =
-  sizeof (powerpc_macros) / sizeof (powerpc_macros[0]);
+const int powerpc_num_macros = ARRAY_SIZE(powerpc_macros);
-- 
2.7.4



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread David Hildenbrand
On 04/10/2018 17:28, Michal Suchánek wrote:
> On Thu, 4 Oct 2018 10:13:48 +0200
> David Hildenbrand  wrote:
> 
> ok, so what is the problem here?
> 
> Handling the hotplug in userspace through udev may be suboptimal and
> kernel handling might be faster but that's orthogonal to the problem at
> hand.

Yes, that one to solve is a different story.

> 
> The state of the art is to determine what to do with hotplugged memory
> in userspace based on platform and virtualization type.

Exactly.

> 
> Changing the default to depend on the driver that added the memory
> rather than platform type should solve the issue of VMs growing
> different types of memory device emulation.

Yes, my original proposal (this patch) was to handle it in the kernel
for known types. But as we learned, there might be some use cases that
might still require to make a decision in user space.

So providing the user space either with some type hint (auto-online vs.
standby) or the driver that added it (system vs. hyper-v ...) would
solve the issue.

> 
> Am I missing something?
> 

No, that's it. Thanks!

> Thanks
> 
> Michal
> 


-- 

Thanks,

David / dhildenb


[PATCH v4 9/9] powerpc: clean stack pointers naming

2018-10-04 Thread Christophe Leroy
Some stack pointers used to also be thread_info pointers
and were called tp. Now that they are only stack pointers,
rename them sp.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c  | 17 +++--
 arch/powerpc/kernel/setup_64.c | 20 ++--
 2 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 62cfccf4af89..754f0efc507b 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -659,21 +659,21 @@ void __do_irq(struct pt_regs *regs)
 void do_IRQ(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
-   void *curtp, *irqtp, *sirqtp;
+   void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
-   curtp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
-   irqtp = hardirq_ctx[raw_smp_processor_id()];
-   sirqtp = softirq_ctx[raw_smp_processor_id()];
+   cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
+   irqsp = hardirq_ctx[raw_smp_processor_id()];
+   sirqsp = softirq_ctx[raw_smp_processor_id()];
 
/* Already there ? */
-   if (unlikely(curtp == irqtp || curtp == sirqtp)) {
+   if (unlikely(cursp == irqsp || cursp == sirqsp)) {
__do_irq(regs);
set_irq_regs(old_regs);
return;
}
/* Switch stack and call */
-   call_do_irq(regs, irqtp);
+   call_do_irq(regs, irqsp);
 
set_irq_regs(old_regs);
 }
@@ -732,10 +732,7 @@ void irq_ctx_init(void)
 
 void do_softirq_own_stack(void)
 {
-   void *irqtp;
-
-   irqtp = softirq_ctx[smp_processor_id()];
-   call_do_softirq(irqtp);
+   call_do_softirq(softirq_ctx[smp_processor_id()]);
 }
 
 irq_hw_number_t virq_to_hw(unsigned int virq)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6792e9c90689..4912ec0320b8 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -717,22 +717,22 @@ void __init emergency_stack_init(void)
limit = min(ppc64_bolted_size(), ppc64_rma_size);
 
for_each_possible_cpu(i) {
-   void *ti;
+   void *sp;
 
-   ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
-   paca_ptrs[i]->emergency_sp = ti + THREAD_SIZE;
+   sp = alloc_stack(limit, i);
+   memset(sp, 0, THREAD_SIZE);
+   paca_ptrs[i]->emergency_sp = sp + THREAD_SIZE;
 
 #ifdef CONFIG_PPC_BOOK3S_64
/* emergency stack for NMI exception handling. */
-   ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
-   paca_ptrs[i]->nmi_emergency_sp = ti + THREAD_SIZE;
+   sp = alloc_stack(limit, i);
+   memset(sp, 0, THREAD_SIZE);
+   paca_ptrs[i]->nmi_emergency_sp = sp + THREAD_SIZE;
 
/* emergency stack for machine check exception handling. */
-   ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
-   paca_ptrs[i]->mc_emergency_sp = ti + THREAD_SIZE;
+   sp = alloc_stack(limit, i);
+   memset(sp, 0, THREAD_SIZE);
+   paca_ptrs[i]->mc_emergency_sp = sp + THREAD_SIZE;
 #endif
}
 }
-- 
2.13.3



[PATCH v4 8/9] powerpc/64: Remove CURRENT_THREAD_INFO

2018-10-04 Thread Christophe Leroy
Now that current_thread_info is located at the beginning of 'current'
task struct, CURRENT_THREAD_INFO macro is not really needed any more.

This patch replaces it by loads of the value at PACACURRENT(r13).

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/exception-64s.h   |  4 ++--
 arch/powerpc/include/asm/thread_info.h |  4 
 arch/powerpc/kernel/entry_64.S | 10 +-
 arch/powerpc/kernel/exceptions-64e.S   |  2 +-
 arch/powerpc/kernel/exceptions-64s.S   |  2 +-
 arch/powerpc/kernel/idle_book3e.S  |  2 +-
 arch/powerpc/kernel/idle_power4.S  |  2 +-
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  6 +++---
 8 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index a86fead0..ca3af3e9015e 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -680,7 +680,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 
 #define RUNLATCH_ON\
 BEGIN_FTR_SECTION  \
-   CURRENT_THREAD_INFO(r3, r1);\
+   ld  r3, PACACURRENT(r13);   \
ld  r4,TI_LOCAL_FLAGS(r3);  \
andi.   r0,r4,_TLF_RUNLATCH;\
beqlppc64_runlatch_on_trampoline;   \
@@ -730,7 +730,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CTRL)
 #ifdef CONFIG_PPC_970_NAP
 #define FINISH_NAP \
 BEGIN_FTR_SECTION  \
-   CURRENT_THREAD_INFO(r11, r1);   \
+   ld  r11, PACACURRENT(r13);  \
ld  r9,TI_LOCAL_FLAGS(r11); \
andi.   r10,r9,_TLF_NAPPING;\
bnelpower4_fixup_nap;   \
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 361bb45b8990..2ee9e248c933 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -17,10 +17,6 @@
 
 #define THREAD_SIZE(1 << THREAD_SHIFT)
 
-#ifdef CONFIG_PPC64
-#define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(ld dest, 
PACACURRENT(r13))
-#endif
-
 #ifndef __ASSEMBLY__
 #include 
 #include 
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 6fce0f8fd8c4..06d9a7c084a1 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -158,7 +158,7 @@ system_call:/* label this so stack 
traces look sane */
li  r10,IRQS_ENABLED
std r10,SOFTE(r1)
 
-   CURRENT_THREAD_INFO(r11, r1)
+   ld  r11, PACACURRENT(r13)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne .Lsyscall_dotrace   /* does not return */
@@ -205,7 +205,7 @@ system_call:/* label this so stack 
traces look sane */
ld  r3,RESULT(r1)
 #endif
 
-   CURRENT_THREAD_INFO(r12, r1)
+   ld  r12, PACACURRENT(r13)
 
ld  r8,_MSR(r1)
 #ifdef CONFIG_PPC_BOOK3S
@@ -336,7 +336,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 
/* Repopulate r9 and r10 for the syscall path */
addir9,r1,STACK_FRAME_OVERHEAD
-   CURRENT_THREAD_INFO(r10, r1)
+   ld  r10, PACACURRENT(r13)
ld  r10,TI_FLAGS(r10)
 
cmpldi  r0,NR_syscalls
@@ -735,7 +735,7 @@ _GLOBAL(ret_from_except_lite)
mtmsrd  r10,1 /* Update machine state */
 #endif /* CONFIG_PPC_BOOK3E */
 
-   CURRENT_THREAD_INFO(r9, r1)
+   ld  r9, PACACURRENT(r13)
ld  r3,_MSR(r1)
 #ifdef CONFIG_PPC_BOOK3E
ld  r10,PACACURRENT(r13)
@@ -849,7 +849,7 @@ resume_kernel:
 1: bl  preempt_schedule_irq
 
/* Re-test flags and eventually loop */
-   CURRENT_THREAD_INFO(r9, r1)
+   ld  r9, PACACURRENT(r13)
ld  r4,TI_FLAGS(r9)
andi.   r0,r4,_TIF_NEED_RESCHED
bne 1b
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 231d066b4a3d..dfafcd0af009 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -469,7 +469,7 @@ exc_##n##_bad_stack:
\
  * interrupts happen before the wait instruction.
  */
 #define CHECK_NAPPING()
\
-   CURRENT_THREAD_INFO(r11, r1);   \
+   ld  r11, PACACURRENT(r13);  \
ld  r10,TI_LOCAL_FLAGS(r11);\
andi.   r9,r10,_TLF_NAPPING;\
beq+1f; \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b9239dbf6d59..f776f30ecfcc 100644

[PATCH v4 7/9] powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU

2018-10-04 Thread Christophe Leroy
Now that thread_info is similar to task_struct, it's address is in r2
so CURRENT_THREAD_INFO() macro is useless. This patch removes it.

At the same time, as the 'cpu' field is not anymore in thread_info,
this patch renames it to TASK_CPU.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Makefile  |  2 +-
 arch/powerpc/include/asm/thread_info.h |  2 --
 arch/powerpc/kernel/asm-offsets.c  |  2 +-
 arch/powerpc/kernel/entry_32.S | 43 --
 arch/powerpc/kernel/epapr_hcalls.S |  5 ++--
 arch/powerpc/kernel/head_fsl_booke.S   |  5 ++--
 arch/powerpc/kernel/idle_6xx.S |  8 +++
 arch/powerpc/kernel/idle_e500.S|  8 +++
 arch/powerpc/kernel/misc_32.S  |  3 +--
 arch/powerpc/mm/hash_low_32.S  | 14 ---
 arch/powerpc/sysdev/6xx-suspend.S  |  5 ++--
 11 files changed, 35 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 02e7ca1c15d4..f1e2d7f7b022 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -426,7 +426,7 @@ ifdef CONFIG_SMP
 prepare: task_cpu_prepare
 
 task_cpu_prepare: prepare0
-   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
print $$3;}' include/generated/asm-offsets.h))
+   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == 
"TASK_CPU") print $$3;}' include/generated/asm-offsets.h))
 endif
 
 # Use the file '.tmp_gas_check' for binutils tests, as gas won't output
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index 61c8747cd926..361bb45b8990 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -19,8 +19,6 @@
 
 #ifdef CONFIG_PPC64
 #define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(ld dest, 
PACACURRENT(r13))
-#else
-#define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(mr dest, r2)
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 768ce602d624..31be6eb9c0d4 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -97,7 +97,7 @@ int main(void)
 #endif /* CONFIG_PPC64 */
OFFSET(TASK_STACK, task_struct, stack);
 #ifdef CONFIG_SMP
-   OFFSET(TI_CPU, task_struct, cpu);
+   OFFSET(TASK_CPU, task_struct, cpu);
 #endif
 
 #ifdef CONFIG_LIVEPATCH
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index bd3b146e18a3..d0c546ce387e 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -168,8 +168,7 @@ transfer_to_handler:
tophys(r11,r11)
addir11,r11,global_dbcr0@l
 #ifdef CONFIG_SMP
-   CURRENT_THREAD_INFO(r9, r1)
-   lwz r9,TI_CPU(r9)
+   lwz r9,TASK_CPU(r2)
slwir9,r9,3
add r11,r11,r9
 #endif
@@ -180,8 +179,7 @@ transfer_to_handler:
stw r12,4(r11)
 #endif
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9, r9)
+   tophys(r9, r2)
ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
 #endif
 
@@ -195,8 +193,7 @@ transfer_to_handler:
ble-stack_ovf   /* then the kernel stack overflowed */
 5:
 #if defined(CONFIG_6xx) || defined(CONFIG_E500)
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9,r9)   /* check local flags */
+   tophys(r9,r2)   /* check local flags */
lwz r12,TI_LOCAL_FLAGS(r9)
mtcrf   0x01,r12
bt- 31-TLF_NAPPING,4f
@@ -345,8 +342,7 @@ _GLOBAL(DoSyscall)
mtmsr   r11
 1:
 #endif /* CONFIG_TRACE_IRQFLAGS */
-   CURRENT_THREAD_INFO(r10, r1)
-   lwz r11,TI_FLAGS(r10)
+   lwz r11,TI_FLAGS(r2)
andi.   r11,r11,_TIF_SYSCALL_DOTRACE
bne-syscall_dotrace
 syscall_dotrace_cont:
@@ -379,13 +375,12 @@ ret_from_syscall:
lwz r3,GPR3(r1)
 #endif
mr  r6,r3
-   CURRENT_THREAD_INFO(r12, r1)
/* disable interrupts so current_thread_info()->flags can't change */
LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */
/* Note: We don't bother telling lockdep about it */
SYNC
MTMSRD(r10)
-   lwz r9,TI_FLAGS(r12)
+   lwz r9,TI_FLAGS(r2)
li  r8,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
bne-syscall_exit_work
@@ -432,8 +427,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
andi.   r4,r8,MSR_PR
beq 3f
-   CURRENT_THREAD_INFO(r4, r1)
-   ACCOUNT_CPU_USER_EXIT(r4, r5, r7)
+   ACCOUNT_CPU_USER_EXIT(r2, r5, r7)
 3:
 #endif
lwz r4,_LINK(r1)
@@ -526,7 +520,7 @@ syscall_exit_work:
/* Clear per-syscall TIF flags if any are set.  */
 
li  r11,_TIF_PERSYSCALL_MASK
-   addir12,r12,TI_FLAGS
+   addi

[PATCH v4 6/9] powerpc: 'current_set' is now a table of task_struct pointers

2018-10-04 Thread Christophe Leroy
The table of pointers 'current_set' has been used for retrieving
the stack and current. They used to be thread_info pointers as
they were pointing to the stack and current was taken from the
'task' field of the thread_info.

Now, the pointers of 'current_set' table are now both pointers
to task_struct and pointers to thread_info.

As they are used to get current, and the stack pointer is
retrieved from current's stack field, this patch changes
their type to task_struct, and renames secondary_ti to
secondary_current.

Reviewed-by: Nicholas Piggin 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 ++--
 arch/powerpc/kernel/head_32.S |  6 +++---
 arch/powerpc/kernel/head_44x.S|  4 ++--
 arch/powerpc/kernel/head_fsl_booke.S  |  4 ++--
 arch/powerpc/kernel/smp.c | 10 --
 5 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 9bc98c239305..ab0541f9da42 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -23,8 +23,8 @@
 #include 
 
 /* SMP */
-extern struct thread_info *current_set[NR_CPUS];
-extern struct thread_info *secondary_ti;
+extern struct task_struct *current_set[NR_CPUS];
+extern struct task_struct *secondary_current;
 void start_secondary(void *unused);
 
 /* kexec */
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 44dfd73b2a62..ba0341bd5a00 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -842,9 +842,9 @@ __secondary_start:
 #endif /* CONFIG_6xx */
 
/* get current's stack and current */
-   lis r1,secondary_ti@ha
-   tophys(r1,r1)
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   tophys(r2,r2)
+   lwz r2,secondary_current@l(r2)
tophys(r1,r2)
lwz r1,TASK_STACK(r1)
 
diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index 2c7e90f36358..48e4de4dfd0c 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -1021,8 +1021,8 @@ _GLOBAL(start_secondary_47x)
/* Now we can get our task struct and real stack pointer */
 
/* Get current's stack and current */
-   lis r1,secondary_ti@ha
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   lwz r2,secondary_current@l(r2)
lwz r1,TASK_STACK(r2)
 
/* Current stack pointer */
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index b8a2b789677e..0d27bfff52dd 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -1076,8 +1076,8 @@ __secondary_start:
bl  call_setup_cpu
 
/* get current's stack and current */
-   lis r1,secondary_ti@ha
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   lwz r2,secondary_current@l(r2)
lwz r1,TASK_STACK(r2)
 
/* stack */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index f22fcbeb9898..00193643f0da 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -74,7 +74,7 @@
 static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 #endif
 
-struct thread_info *secondary_ti;
+struct task_struct *secondary_current;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
@@ -644,7 +644,7 @@ void smp_send_stop(void)
 }
 #endif /* CONFIG_NMI_IPI */
 
-struct thread_info *current_set[NR_CPUS];
+struct task_struct *current_set[NR_CPUS];
 
 static void smp_store_cpu_info(int id)
 {
@@ -724,7 +724,7 @@ void smp_prepare_boot_cpu(void)
paca_ptrs[boot_cpuid]->__current = current;
 #endif
set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
-   current_set[boot_cpuid] = task_thread_info(current);
+   current_set[boot_cpuid] = current;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -809,15 +809,13 @@ static bool secondaries_inhibited(void)
 
 static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle)
 {
-   struct thread_info *ti = task_thread_info(idle);
-
 #ifdef CONFIG_PPC64
paca_ptrs[cpu]->__current = idle;
paca_ptrs[cpu]->kstack = (unsigned long)task_stack_page(idle) +
  THREAD_SIZE - STACK_FRAME_OVERHEAD;
 #endif
idle->cpu = cpu;
-   secondary_ti = current_set[cpu] = ti;
+   secondary_current = current_set[cpu] = idle;
 }
 
 int __cpu_up(unsigned int cpu, struct task_struct *tidle)
-- 
2.13.3



[PATCH v4 5/9] powerpc: regain entire stack space

2018-10-04 Thread Christophe Leroy
thread_info is not anymore in the stack, so the entire stack
can now be used.

In the meantime, with the previous patch all pointers to the stacks
are not anymore pointers to thread_info so this patch changes them
to void*

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/irq.h   | 10 +-
 arch/powerpc/include/asm/processor.h |  3 +--
 arch/powerpc/kernel/asm-offsets.c|  1 -
 arch/powerpc/kernel/entry_32.S   | 14 --
 arch/powerpc/kernel/irq.c| 19 +--
 arch/powerpc/kernel/misc_32.S|  6 ++
 arch/powerpc/kernel/process.c|  9 +++--
 arch/powerpc/kernel/setup_64.c   |  8 
 8 files changed, 28 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 2efbae8d93be..966ddd4d2414 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -48,9 +48,9 @@ struct pt_regs;
  * Per-cpu stacks for handling critical, debug and machine check
  * level interrupts.
  */
-extern struct thread_info *critirq_ctx[NR_CPUS];
-extern struct thread_info *dbgirq_ctx[NR_CPUS];
-extern struct thread_info *mcheckirq_ctx[NR_CPUS];
+extern void *critirq_ctx[NR_CPUS];
+extern void *dbgirq_ctx[NR_CPUS];
+extern void *mcheckirq_ctx[NR_CPUS];
 extern void exc_lvl_ctx_init(void);
 #else
 #define exc_lvl_ctx_init()
@@ -59,8 +59,8 @@ extern void exc_lvl_ctx_init(void);
 /*
  * Per-cpu stacks for handling hard and soft interrupts.
  */
-extern struct thread_info *hardirq_ctx[NR_CPUS];
-extern struct thread_info *softirq_ctx[NR_CPUS];
+extern void *hardirq_ctx[NR_CPUS];
+extern void *softirq_ctx[NR_CPUS];
 
 extern void irq_ctx_init(void);
 void call_do_softirq(void *sp);
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index b225c7f7c5a4..e763342265a2 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -331,8 +331,7 @@ struct thread_struct {
 #define ARCH_MIN_TASKALIGN 16
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
_stack)
-#define INIT_SP_LIMIT \
-   (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)_stack)
+#define INIT_SP_LIMIT  ((unsigned long)_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 833d189df04c..768ce602d624 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -93,7 +93,6 @@ int main(void)
DEFINE(NMI_MASK, NMI_MASK);
OFFSET(TASKTHREADPPR, task_struct, thread.ppr);
 #else
-   DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16));
OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
 #endif /* CONFIG_PPC64 */
OFFSET(TASK_STACK, task_struct, stack);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index fa7a69ffb37a..bd3b146e18a3 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -97,14 +97,11 @@ crit_transfer_to_handler:
mfspr   r0,SPRN_SRR1
stw r0,_SRR1(r11)
 
-   /* set the stack limit to the current stack
-* and set the limit to protect the thread_info
-* struct
-*/
+   /* set the stack limit to the current stack */
mfspr   r8,SPRN_SPRG_THREAD
lwz r0,KSP_LIMIT(r8)
stw r0,SAVED_KSP_LIMIT(r11)
-   rlwimi  r0,r1,0,0,(31-THREAD_SHIFT)
+   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
@@ -121,14 +118,11 @@ crit_transfer_to_handler:
mfspr   r0,SPRN_SRR1
stw r0,crit_srr1@l(0)
 
-   /* set the stack limit to the current stack
-* and set the limit to protect the thread_info
-* struct
-*/
+   /* set the stack limit to the current stack */
mfspr   r8,SPRN_SPRG_THREAD
lwz r0,KSP_LIMIT(r8)
stw r0,saved_ksp_limit@l(0)
-   rlwimi  r0,r1,0,0,(31-THREAD_SHIFT)
+   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 3fdb6b6973cf..62cfccf4af89 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -618,9 +618,8 @@ static inline void check_stack_overflow(void)
sp = current_stack_pointer() & (THREAD_SIZE-1);
 
/* check for stack overflow: is there less than 2KB free? */
-   if (unlikely(sp < (sizeof(struct thread_info) + 2048))) {
-   pr_err("do_IRQ: stack overflow: %ld\n",
-   sp - sizeof(struct thread_info));
+   if (unlikely(sp < 2048)) {
+   pr_err("do_IRQ: stack overflow: %ld\n", sp);
dump_stack();
}
 #endif
@@ -660,7 +659,7 @@ void __do_irq(struct pt_regs *regs)
 void do_IRQ(struct pt_regs *regs)
 {
struct pt_regs *old_regs 

[PATCH v4 4/9] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

2018-10-04 Thread Christophe Leroy
This patch activates CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

This has the following consequences:
- thread_info is now located at the beginning of task_struct.
- The 'cpu' field is now in task_struct, and only exists when
CONFIG_SMP is active.
- thread_info doesn't have anymore the 'task' field.

This patch:
- Removes all recopy of thread_info struct when the stack changes.
- Changes the CURRENT_THREAD_INFO() macro to point to current.
- Selects CONFIG_THREAD_INFO_IN_TASK.
- Modifies raw_smp_processor_id() to get ->cpu from current without
including linux/sched.h to avoid circular inclusion and without
including asm/asm-offsets.h to avoid symbol names duplication
between ASM constants and C constants.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/Makefile  |  8 +-
 arch/powerpc/include/asm/ptrace.h  |  2 +-
 arch/powerpc/include/asm/smp.h | 17 +++-
 arch/powerpc/include/asm/thread_info.h | 17 ++--
 arch/powerpc/kernel/asm-offsets.c  |  7 +++--
 arch/powerpc/kernel/entry_32.S |  9 +++
 arch/powerpc/kernel/exceptions-64e.S   | 11 
 arch/powerpc/kernel/head_32.S  |  6 ++---
 arch/powerpc/kernel/head_44x.S |  4 +--
 arch/powerpc/kernel/head_64.S  |  1 +
 arch/powerpc/kernel/head_booke.h   |  8 +-
 arch/powerpc/kernel/head_fsl_booke.S   |  7 +++--
 arch/powerpc/kernel/irq.c  | 47 +-
 arch/powerpc/kernel/kgdb.c | 28 
 arch/powerpc/kernel/machine_kexec_64.c |  6 ++---
 arch/powerpc/kernel/setup-common.c |  2 +-
 arch/powerpc/kernel/setup_64.c | 21 ---
 arch/powerpc/kernel/smp.c  |  2 +-
 19 files changed, 51 insertions(+), 153 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 602eea723624..3b958cd4e284 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -238,6 +238,7 @@ config PPC
select RTC_LIB
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
+   select THREAD_INFO_IN_TASK
select VIRT_TO_BUS  if !PPC64
#
# Please keep this list sorted alphabetically.
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 81552c7b46eb..02e7ca1c15d4 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -422,6 +422,13 @@ else
 endif
 endif
 
+ifdef CONFIG_SMP
+prepare: task_cpu_prepare
+
+task_cpu_prepare: prepare0
+   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
print $$3;}' include/generated/asm-offsets.h))
+endif
+
 # Use the file '.tmp_gas_check' for binutils tests, as gas won't output
 # to stdout and these checks are run even on install targets.
 TOUT   := .tmp_gas_check
@@ -439,4 +446,3 @@ checkbin:
 
 
 CLEAN_FILES += $(TOUT)
-
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index 447cbd1bee99..3a7e5561630b 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -120,7 +120,7 @@ extern int ptrace_put_reg(struct task_struct *task, int 
regno,
  unsigned long data);
 
 #define current_pt_regs() \
-   ((struct pt_regs *)((unsigned long)current_thread_info() + THREAD_SIZE) 
- 1)
+   ((struct pt_regs *)((unsigned long)task_stack_page(current) + 
THREAD_SIZE) - 1)
 /*
  * We use the least-significant bit of the trap field to indicate
  * whether we have saved the full set of registers, or only a
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 95b66a0c639b..93a8cd120663 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -83,7 +83,22 @@ int is_cpu_dead(unsigned int cpu);
 /* 32-bit */
 extern int smp_hw_index[];
 
-#define raw_smp_processor_id() (current_thread_info()->cpu)
+/*
+ * This is particularly ugly: it appears we can't actually get the definition
+ * of task_struct here, but we need access to the CPU this task is running on.
+ * Instead of using task_struct we're using _TASK_CPU which is extracted from
+ * asm-offsets.h by kbuild to get the current processor ID.
+ *
+ * This also needs to be safeguarded when building asm-offsets.s because at
+ * that time _TASK_CPU is not defined yet. It could have been guarded by
+ * _TASK_CPU itself, but we want the build to fail if _TASK_CPU is missing
+ * when building something else than asm-offsets.s
+ */
+#ifdef GENERATING_ASM_OFFSETS
+#define raw_smp_processor_id() (0)
+#else
+#define raw_smp_processor_id() (*(unsigned int *)((void *)current + 
_TASK_CPU))
+#endif
 

[PATCH v4 3/9] powerpc: Prepare for moving thread_info into task_struct

2018-10-04 Thread Christophe Leroy
This patch cleans the powerpc kernel before activating
CONFIG_THREAD_INFO_IN_TASK:
- The purpose of the pointer given to call_do_softirq() and
call_do_irq() is to point the new stack ==> change it to void* and
rename it 'sp'
- Don't use CURRENT_THREAD_INFO() to locate the stack.
- Fix a few comments.
- Replace current_thread_info()->task by current
- Remove unnecessary casts to thread_info, as they'll become invalid
once thread_info is not in stack anymore.
- Rename THREAD_INFO to TASK_STASK: as it is in fact the offset of the
pointer to the stack in task_struct, this pointer will not be impacted
by the move of THREAD_INFO.
- Makes TASK_STACK available to PPC64. PPC64 will need it to get the
stack pointer from current once the thread_info have been moved.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/irq.h   |  4 ++--
 arch/powerpc/include/asm/livepatch.h |  2 +-
 arch/powerpc/include/asm/processor.h |  4 ++--
 arch/powerpc/include/asm/reg.h   |  2 +-
 arch/powerpc/kernel/asm-offsets.c|  2 +-
 arch/powerpc/kernel/entry_32.S   |  2 +-
 arch/powerpc/kernel/entry_64.S   |  2 +-
 arch/powerpc/kernel/head_32.S|  4 ++--
 arch/powerpc/kernel/head_40x.S   |  4 ++--
 arch/powerpc/kernel/head_44x.S   |  2 +-
 arch/powerpc/kernel/head_8xx.S   |  2 +-
 arch/powerpc/kernel/head_booke.h |  4 ++--
 arch/powerpc/kernel/head_fsl_booke.S |  4 ++--
 arch/powerpc/kernel/irq.c|  2 +-
 arch/powerpc/kernel/misc_32.S|  4 ++--
 arch/powerpc/kernel/process.c|  6 +++---
 arch/powerpc/kernel/setup_32.c   | 15 +--
 arch/powerpc/kernel/smp.c|  4 +++-
 18 files changed, 33 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index ee39ce56b2a2..2efbae8d93be 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -63,8 +63,8 @@ extern struct thread_info *hardirq_ctx[NR_CPUS];
 extern struct thread_info *softirq_ctx[NR_CPUS];
 
 extern void irq_ctx_init(void);
-extern void call_do_softirq(struct thread_info *tp);
-extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp);
+void call_do_softirq(void *sp);
+void call_do_irq(struct pt_regs *regs, void *sp);
 extern void do_IRQ(struct pt_regs *regs);
 extern void __init init_IRQ(void);
 extern void __do_irq(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/livepatch.h 
b/arch/powerpc/include/asm/livepatch.h
index 47a03b9b528b..818451bf629c 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -49,7 +49,7 @@ static inline void klp_init_thread_info(struct thread_info 
*ti)
ti->livepatch_sp = (unsigned long *)(ti + 1) + 1;
 }
 #else
-static void klp_init_thread_info(struct thread_info *ti) { }
+static inline void klp_init_thread_info(struct thread_info *ti) { }
 #endif /* CONFIG_LIVEPATCH */
 
 #endif /* _ASM_POWERPC_LIVEPATCH_H */
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 13589274fe9b..b225c7f7c5a4 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -40,7 +40,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
-#include 
+#include 
 #include 
 #include 
 
@@ -332,7 +332,7 @@ struct thread_struct {
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
_stack)
 #define INIT_SP_LIMIT \
-   (_ALIGN_UP(sizeof(init_thread_info), 16) + (unsigned long) _stack)
+   (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 640a4d818772..d2528a0b2f5b 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1058,7 +1058,7 @@
  * - SPRG9 debug exception scratch
  *
  * All 32-bit:
- * - SPRG3 current thread_info pointer
+ * - SPRG3 current thread_struct physical addr pointer
  *(virtual on BookE, physical on others)
  *
  * 32-bit classic:
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index a6d70fd2e499..c583a02e5a21 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -91,10 +91,10 @@ int main(void)
DEFINE(NMI_MASK, NMI_MASK);
OFFSET(TASKTHREADPPR, task_struct, thread.ppr);
 #else
-   OFFSET(THREAD_INFO, task_struct, stack);
DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16));
OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
 #endif /* CONFIG_PPC64 */
+   OFFSET(TASK_STACK, task_struct, stack);
 
 #ifdef CONFIG_LIVEPATCH
OFFSET(TI_livepatch_sp, thread_info, livepatch_sp);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 77decded1175..7ea1d71f4546 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -1166,7 +1166,7 @@ 

[PATCH v4 2/9] powerpc: Only use task_struct 'cpu' field on SMP

2018-10-04 Thread Christophe Leroy
When moving to CONFIG_THREAD_INFO_IN_TASK, the thread_info 'cpu' field
gets moved into task_struct and only defined when CONFIG_SMP is set.

This patch ensures that TI_CPU is only used when CONFIG_SMP is set and
that task_struct 'cpu' field is not used directly out of SMP code.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/head_fsl_booke.S | 2 ++
 arch/powerpc/kernel/misc_32.S| 4 
 arch/powerpc/xmon/xmon.c | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index e2750b856c8f..05b574f416b3 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -243,8 +243,10 @@ set_ivor:
li  r0,0
stwur0,THREAD_SIZE-STACK_FRAME_OVERHEAD(r1)
 
+#ifdef CONFIG_SMP
CURRENT_THREAD_INFO(r22, r1)
stw r24, TI_CPU(r22)
+#endif
 
bl  early_init
 
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 695b24a2d954..2f0fe8bfc078 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -183,10 +183,14 @@ _GLOBAL(low_choose_750fx_pll)
or  r4,r4,r5
mtspr   SPRN_HID1,r4
 
+#ifdef CONFIG_SMP
/* Store new HID1 image */
CURRENT_THREAD_INFO(r6, r1)
lwz r6,TI_CPU(r6)
slwir6,r6,2
+#else
+   li  r6, 0
+#endif
addis   r6,r6,nap_save_hid1@ha
stw r4,nap_save_hid1@l(r6)
 
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index c70d17c9a6ba..1731793e1277 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2986,7 +2986,7 @@ static void show_task(struct task_struct *tsk)
printf("%px %016lx %6d %6d %c %2d %s\n", tsk,
tsk->thread.ksp,
tsk->pid, tsk->parent->pid,
-   state, task_thread_info(tsk)->cpu,
+   state, task_cpu(tsk),
tsk->comm);
 }
 
-- 
2.13.3



[PATCH v4 1/9] book3s/64: avoid circular header inclusion in mmu-hash.h

2018-10-04 Thread Christophe Leroy
When activating CONFIG_THREAD_INFO_IN_TASK, linux/sched.h
includes asm/current.h. This generates a circular dependency.
To avoid that, asm/processor.h shall not be included in mmu-hash.h

In order to do that, this patch moves into a new header called
asm/task_size_user64.h the information from asm/processor.h required
by mmu-hash.h

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +-
 arch/powerpc/include/asm/processor.h  | 34 +-
 arch/powerpc/include/asm/task_size_user64.h   | 42 +++
 arch/powerpc/kvm/book3s_hv_hmi.c  |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)
 create mode 100644 arch/powerpc/include/asm/task_size_user64.h

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index e0e4ce8f77d6..02955d867067 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -23,7 +23,7 @@
  */
 #include 
 #include 
-#include 
+#include 
 #include 
 
 /*
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 52fadded5c1e..13589274fe9b 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -101,40 +101,8 @@ void release_thread(struct task_struct *);
 #endif
 
 #ifdef CONFIG_PPC64
-/*
- * 64-bit user address space can have multiple limits
- * For now supported values are:
- */
-#define TASK_SIZE_64TB  (0x4000UL)
-#define TASK_SIZE_128TB (0x8000UL)
-#define TASK_SIZE_512TB (0x0002UL)
-#define TASK_SIZE_1PB   (0x0004UL)
-#define TASK_SIZE_2PB   (0x0008UL)
-/*
- * With 52 bits in the address we can support
- * upto 4PB of range.
- */
-#define TASK_SIZE_4PB   (0x0010UL)
 
-/*
- * For now 512TB is only supported with book3s and 64K linux page size.
- */
-#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
-/*
- * Max value currently used:
- */
-#define TASK_SIZE_USER64   TASK_SIZE_4PB
-#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
-#define TASK_CONTEXT_SIZE  TASK_SIZE_512TB
-#else
-#define TASK_SIZE_USER64   TASK_SIZE_64TB
-#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
-/*
- * We don't need to allocate extended context ids for 4K page size, because
- * we limit the max effective address on this config to 64TB.
- */
-#define TASK_CONTEXT_SIZE  TASK_SIZE_64TB
-#endif
+#include 
 
 /*
  * 32-bit user address space is 4GB - 1 page
diff --git a/arch/powerpc/include/asm/task_size_user64.h 
b/arch/powerpc/include/asm/task_size_user64.h
new file mode 100644
index ..a4043075864b
--- /dev/null
+++ b/arch/powerpc/include/asm/task_size_user64.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_TASK_SIZE_USER64_H
+#define _ASM_POWERPC_TASK_SIZE_USER64_H
+
+#ifdef CONFIG_PPC64
+/*
+ * 64-bit user address space can have multiple limits
+ * For now supported values are:
+ */
+#define TASK_SIZE_64TB  (0x4000UL)
+#define TASK_SIZE_128TB (0x8000UL)
+#define TASK_SIZE_512TB (0x0002UL)
+#define TASK_SIZE_1PB   (0x0004UL)
+#define TASK_SIZE_2PB   (0x0008UL)
+/*
+ * With 52 bits in the address we can support
+ * upto 4PB of range.
+ */
+#define TASK_SIZE_4PB   (0x0010UL)
+
+/*
+ * For now 512TB is only supported with book3s and 64K linux page size.
+ */
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
+/*
+ * Max value currently used:
+ */
+#define TASK_SIZE_USER64   TASK_SIZE_4PB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
+#define TASK_CONTEXT_SIZE  TASK_SIZE_512TB
+#else
+#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
+/*
+ * We don't need to allocate extended context ids for 4K page size, because
+ * we limit the max effective address on this config to 64TB.
+ */
+#define TASK_CONTEXT_SIZE  TASK_SIZE_64TB
+#endif
+
+#endif /* CONFIG_PPC64 */
+#endif /* _ASM_POWERPC_TASK_SIZE_USER64_H */
diff --git a/arch/powerpc/kvm/book3s_hv_hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
index e3f738eb1cac..64b5011475c7 100644
--- a/arch/powerpc/kvm/book3s_hv_hmi.c
+++ b/arch/powerpc/kvm/book3s_hv_hmi.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void wait_for_subcore_guest_exit(void)
 {
-- 
2.13.3



[PATCH v4 0/9] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2018-10-04 Thread Christophe Leroy
The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

Changes since RFC v3: (based on Nick's review)
 - Renamed task_size.h to task_size_user64.h to better relate to what it 
contains.
 - Handling of the isolation of thread_info cpu field inside CONFIG_SMP #ifdefs 
moved to a separate patch.
 - Removed CURRENT_THREAD_INFO macro completely.
 - Added a guard in asm/smp.h to avoid build failure before _TASK_CPU is 
defined.
 - Added a patch at the end to rename 'tp' pointers to 'sp' pointers
 - Renamed 'tp' into 'sp' pointers in preparation patch when relevant
 - Fixed a few commit logs
 - Fixed checkpatch report.

Changes since RFC v2:
 - Removed the modification of names in asm-offsets
 - Created a rule in arch/powerpc/Makefile to append the offset of current->cpu 
in CFLAGS
 - Modified asm/smp.h to use the offset set in CFLAGS
 - Squashed the renaming of THREAD_INFO to TASK_STACK in the preparation patch
 - Moved the modification of current_pt_regs in the patch activating 
CONFIG_THREAD_INFO_IN_TASK

Changes since RFC v1:
 - Removed the first patch which was modifying header inclusion order in timer
 - Modified some names in asm-offsets to avoid conflicts when including 
asm-offsets in C files
 - Modified asm/smp.h to avoid having to include linux/sched.h (using 
asm-offsets instead)
 - Moved some changes from the activation patch to the preparation patch.

Christophe Leroy (9):
  book3s/64: avoid circular header inclusion in mmu-hash.h
  powerpc: Only use task_struct 'cpu' field on SMP
  powerpc: Prepare for moving thread_info into task_struct
  powerpc: Activate CONFIG_THREAD_INFO_IN_TASK
  powerpc: regain entire stack space
  powerpc: 'current_set' is now a table of task_struct pointers
  powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU
  powerpc/64: Remove CURRENT_THREAD_INFO
  powerpc: clean stack pointers naming

 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/Makefile  |  8 ++-
 arch/powerpc/include/asm/asm-prototypes.h  |  4 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h  |  2 +-
 arch/powerpc/include/asm/exception-64s.h   |  4 +-
 arch/powerpc/include/asm/irq.h | 14 ++---
 arch/powerpc/include/asm/livepatch.h   |  2 +-
 arch/powerpc/include/asm/processor.h   | 39 +
 arch/powerpc/include/asm/ptrace.h  |  2 +-
 arch/powerpc/include/asm/reg.h |  2 +-
 arch/powerpc/include/asm/smp.h | 17 +-
 arch/powerpc/include/asm/task_size_user64.h| 42 ++
 arch/powerpc/include/asm/thread_info.h | 19 ---
 arch/powerpc/kernel/asm-offsets.c  | 10 ++--
 arch/powerpc/kernel/entry_32.S | 66 --
 arch/powerpc/kernel/entry_64.S | 12 ++--
 arch/powerpc/kernel/epapr_hcalls.S |  5 +-
 arch/powerpc/kernel/exceptions-64e.S   | 13 +
 arch/powerpc/kernel/exceptions-64s.S   |  2 +-
 arch/powerpc/kernel/head_32.S  | 14 ++---
 arch/powerpc/kernel/head_40x.S |  4 +-
 arch/powerpc/kernel/head_44x.S |  8 +--
 arch/powerpc/kernel/head_64.S  |  1 +
 arch/powerpc/kernel/head_8xx.S |  2 +-
 arch/powerpc/kernel/head_booke.h   | 12 +---
 arch/powerpc/kernel/head_fsl_booke.S   | 16 +++---
 arch/powerpc/kernel/idle_6xx.S |  8 +--
 arch/powerpc/kernel/idle_book3e.S  |  2 +-
 arch/powerpc/kernel/idle_e500.S|  8 +--
 arch/powerpc/kernel/idle_power4.S  |  2 +-
 arch/powerpc/kernel/irq.c  | 77 +-
 arch/powerpc/kernel/kgdb.c | 28 --
 arch/powerpc/kernel/machine_kexec_64.c |  6 +-
 arch/powerpc/kernel/misc_32.S  | 17 +++---
 arch/powerpc/kernel/process.c  | 15 ++---
 arch/powerpc/kernel/setup-common.c |  2 +-
 arch/powerpc/kernel/setup_32.c | 15 ++---
 arch/powerpc/kernel/setup_64.c | 41 --
 arch/powerpc/kernel/smp.c  | 16 +++---
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  6 +-
 arch/powerpc/kvm/book3s_hv_hmi.c   |  1 +
 arch/powerpc/mm/hash_low_32.S  | 14 ++---
 arch/powerpc/sysdev/6xx-suspend.S  |  5 +-
 arch/powerpc/xmon/xmon.c   |  2 +-
 44 files changed, 224 insertions(+), 362 deletions(-)
 create mode 100644 arch/powerpc/include/asm/task_size_user64.h

-- 
2.13.3



Re: [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread Michal Suchánek
On Thu, 4 Oct 2018 10:13:48 +0200
David Hildenbrand  wrote:

ok, so what is the problem here?

Handling the hotplug in userspace through udev may be suboptimal and
kernel handling might be faster but that's orthogonal to the problem at
hand.

The state of the art is to determine what to do with hotplugged memory
in userspace based on platform and virtualization type.

Changing the default to depend on the driver that added the memory
rather than platform type should solve the issue of VMs growing
different types of memory device emulation.

Am I missing something?

Thanks

Michal


Re: [PATCH] dma-direct: Fix return value of dma_direct_supported

2018-10-04 Thread Alexander Duyck
On Thu, Oct 4, 2018 at 4:25 AM Robin Murphy  wrote:
>
> On 04/10/18 00:48, Alexander Duyck wrote:
> > It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask
> > <= physiscal memory size") the logic of the test was changed from a "<" to
> > a ">=" however I don't see any reason for that change. I am assuming that
> > there was some additional change planned, specifically I suspect the logic
> > was intended to be reversed and possibly used for a return. Since that is
> > the case I have gone ahead and done that.
>
> Bah, seems I got hung up on the min_mask code above it and totally
> overlooked that the condition itself got flipped. It probably also can't
> help that it's an int return type, but treated as a bool by callers
> rather than "0 for success" as int tends to imply in isolation.
>
> Anyway, paying a bit more attention this time, I think this looks like
> the right fix - cheers Alex.
>
> Robin.

Thanks for the review.

- Alex

P.S. It looks like I forgot to add Christoph to the original mail
since I had just copied the To and Cc from the original submission, so
I added him to the Cc for this.

> > This addresses issues I had on my system that prevented me from booting
> > with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU.
> >
> > Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory 
> > size")
> > Signed-off-by: Alexander Duyck 
> > ---
> >   kernel/dma/direct.c |4 +---
> >   1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> > index 5a0806b5351b..65872f6c2e93 100644
> > --- a/kernel/dma/direct.c
> > +++ b/kernel/dma/direct.c
> > @@ -301,9 +301,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
> >
> >   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
> >
> > - if (mask >= phys_to_dma(dev, min_mask))
> > - return 0;
> > - return 1;
> > + return mask >= phys_to_dma(dev, min_mask);
> >   }
> >
> >   int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr)
> >
> > ___
> > iommu mailing list
> > io...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >
> ___
> iommu mailing list
> io...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v3 2/7] powerpc: Prepare for moving thread_info into task_struct

2018-10-04 Thread Christophe LEROY




Le 03/10/2018 à 07:49, Christophe LEROY a écrit :



Le 03/10/2018 à 07:02, Nicholas Piggin a écrit :

On Mon,  1 Oct 2018 12:30:21 + (UTC)
Christophe Leroy  wrote:


This patch cleans the powerpc kernel before activating
CONFIG_THREAD_INFO_IN_TASK:
- The purpose of the pointer given to call_do_softirq() and
call_do_irq() is to point the new stack ==> change it to void*
- Don't use CURRENT_THREAD_INFO() to locate the stack.
- Fixed a few comments.
- TI_CPU is only used when CONFIG_SMP is set.
- Replace current_thread_info()->task by current
- Remove unnecessary casts to thread_info, as they'll become invalid
once thread_info is not in stack anymore.
- Ensure task_struct 'cpu' fields is not used directly out of SMP code
- Rename THREAD_INFO to TASK_STASK: As it is in fact the offset of the
pointer to the stack in task_struct, this pointer will not be impacted
by the move of THREAD_INFO.
- Makes TASK_STACK available to PPC64 which will need it to the get
stack pointer from current once the thread_info have been moved.

Signed-off-by: Christophe Leroy 
---
  arch/powerpc/include/asm/irq.h   |  4 ++--
  arch/powerpc/include/asm/livepatch.h |  2 +-
  arch/powerpc/include/asm/processor.h |  4 ++--
  arch/powerpc/include/asm/reg.h   |  2 +-
  arch/powerpc/kernel/asm-offsets.c    |  2 +-
  arch/powerpc/kernel/entry_32.S   |  2 +-
  arch/powerpc/kernel/entry_64.S   |  2 +-
  arch/powerpc/kernel/head_32.S    |  4 ++--
  arch/powerpc/kernel/head_40x.S   |  4 ++--
  arch/powerpc/kernel/head_44x.S   |  2 +-
  arch/powerpc/kernel/head_8xx.S   |  2 +-
  arch/powerpc/kernel/head_booke.h |  4 ++--
  arch/powerpc/kernel/head_fsl_booke.S |  6 --
  arch/powerpc/kernel/irq.c    |  2 +-
  arch/powerpc/kernel/misc_32.S    |  8 ++--
  arch/powerpc/kernel/process.c    |  6 +++---
  arch/powerpc/kernel/setup_32.c   | 15 +--
  arch/powerpc/kernel/smp.c    |  4 +++-
  arch/powerpc/xmon/xmon.c |  2 +-
  19 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h 
b/arch/powerpc/include/asm/irq.h

index ee39ce56b2a2..8108d1fe33ca 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -63,8 +63,8 @@ extern struct thread_info *hardirq_ctx[NR_CPUS];
  extern struct thread_info *softirq_ctx[NR_CPUS];
  extern void irq_ctx_init(void);
-extern void call_do_softirq(struct thread_info *tp);
-extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp);
+extern void call_do_softirq(void *tp);
+extern void call_do_irq(struct pt_regs *regs, void *tp);


void *sp for these ?


Yes, why not but it means changing the code. I wanted to minimise the 
changes and avoid cosmetic. Or maybe should add a cosmetic patch at the 
end ?


In fact, I'll do it because the only additional impact is on a comment 
in misc_32.S


Christophe





This all seems okay to me except the 32-bit code which I don't know.
Would it be any trouble for you to put the TI_CPU bits into their own
patch?


No problem, I can put the TI_CPU bits in a separate patch.



Reviewed-by: Nicholas Piggin 



Thanks
Christophe




  extern void do_IRQ(struct pt_regs *regs);
  extern void __init init_IRQ(void);
  extern void __do_irq(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/livepatch.h 
b/arch/powerpc/include/asm/livepatch.h

index 47a03b9b528b..818451bf629c 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -49,7 +49,7 @@ static inline void klp_init_thread_info(struct 
thread_info *ti)

  ti->livepatch_sp = (unsigned long *)(ti + 1) + 1;
  }
  #else
-static void klp_init_thread_info(struct thread_info *ti) { }
+static inline void klp_init_thread_info(struct thread_info *ti) { }
  #endif /* CONFIG_LIVEPATCH */
  #endif /* _ASM_POWERPC_LIVEPATCH_H */
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h

index 353879db3e98..31873614392f 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -40,7 +40,7 @@
  #ifndef __ASSEMBLY__
  #include 
-#include 
+#include 
  #include 
  #include 
@@ -333,7 +333,7 @@ struct thread_struct {
  #define INIT_SP    (sizeof(init_stack) + (unsigned long) 
_stack)

  #define INIT_SP_LIMIT \
-    (_ALIGN_UP(sizeof(init_thread_info), 16) + (unsigned long) 
_stack)
+    (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long) 
_stack)

  #ifdef CONFIG_SPE
  #define SPEFSCR_INIT \
diff --git a/arch/powerpc/include/asm/reg.h 
b/arch/powerpc/include/asm/reg.h

index e5b314ed054e..f3a9cf19a986 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1053,7 +1053,7 @@
   *    - SPRG9 debug exception scratch
   *
   * All 32-bit:
- *    - SPRG3 current thread_info pointer
+ *    - SPRG3 current thread_struct physical addr pointer
   *    (virtual on BookE, physical on others)
   *
   * 32-bit classic:
diff --git 

[PATCH v4 6/6] arm64: dts: add LX2160ARDB board support

2018-10-04 Thread Vabhav Sharma
LX2160A reference design board (RDB) is a high-performance
computing, evaluation, and development platform with LX2160A
SoC.

Signed-off-by: Priyanka Jain 
Signed-off-by: Sriram Dash 
Signed-off-by: Vabhav Sharma 
---
 arch/arm64/boot/dts/freescale/Makefile|   1 +
 arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts | 100 ++
 2 files changed, 101 insertions(+)
 create mode 100644 arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts

diff --git a/arch/arm64/boot/dts/freescale/Makefile 
b/arch/arm64/boot/dts/freescale/Makefile
index 86e18ad..445b72b 100644
--- a/arch/arm64/boot/dts/freescale/Makefile
+++ b/arch/arm64/boot/dts/freescale/Makefile
@@ -13,3 +13,4 @@ dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2080a-rdb.dtb
 dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2080a-simu.dtb
 dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2088a-qds.dtb
 dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2088a-rdb.dtb
+dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-lx2160a-rdb.dtb
diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts 
b/arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts
new file mode 100644
index 000..1483071
--- /dev/null
+++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+//
+// Device Tree file for LX2160ARDB
+//
+// Copyright 2018 NXP
+
+/dts-v1/;
+
+#include "fsl-lx2160a.dtsi"
+
+/ {
+   model = "NXP Layerscape LX2160ARDB";
+   compatible = "fsl,lx2160a-rdb", "fsl,lx2160a";
+
+   chosen {
+   stdout-path = "serial0:115200n8";
+   };
+
+   sb_3v3: regulator-fixed {
+   compatible = "regulator-fixed";
+   regulator-name = "fixed-3.3V";
+   regulator-min-microvolt = <330>;
+   regulator-max-microvolt = <330>;
+   regulator-boot-on;
+   regulator-always-on;
+   };
+
+};
+
+ {
+   status = "okay";
+};
+
+ {
+   status = "okay";
+};
+
+ {
+   status = "okay";
+   i2c-mux@77 {
+   compatible = "nxp,pca9547";
+   reg = <0x77>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   i2c@2 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   reg = <0x2>;
+
+   power-monitor@40 {
+   compatible = "ti,ina220";
+   reg = <0x40>;
+   shunt-resistor = <1000>;
+   };
+   };
+
+   i2c@3 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   reg = <0x3>;
+
+   temperature-sensor@4c {
+   compatible = "nxp,sa56004";
+   reg = <0x4c>;
+   vcc-supply = <_3v3>;
+   };
+
+   temperature-sensor@4d {
+   compatible = "nxp,sa56004";
+   reg = <0x4d>;
+   vcc-supply = <_3v3>;
+   };
+   };
+   };
+};
+
+ {
+   status = "okay";
+
+   rtc@51 {
+   compatible = "nxp,pcf2129";
+   reg = <0x51>;
+   // IRQ10_B
+   interrupts = <0 150 0x4>;
+   };
+
+};
+
+ {
+   status = "okay";
+};
+
+ {
+   status = "okay";
+};
+
+ {
+   status = "okay";
+};
-- 
2.7.4



[PATCH v4 5/6] arm64: dts: add QorIQ LX2160A SoC support

2018-10-04 Thread Vabhav Sharma
LX2160A SoC is based on Layerscape Chassis Generation 3.2 Architecture.

LX2160A features an advanced 16 64-bit ARM v8 CortexA72 processor cores
in 8 cluster, CCN508, GICv3,two 64-bit DDR4 memory controller, 8 I2C
controllers, 3 dspi, 2 esdhc,2 USB 3.0, mmu 500, 3 SATA, 4 PL011 SBSA
UARTs etc.

Signed-off-by: Ramneek Mehresh 
Signed-off-by: Zhang Ying-22455 
Signed-off-by: Nipun Gupta 
Signed-off-by: Priyanka Jain 
Signed-off-by: Yogesh Gaur 
Signed-off-by: Sriram Dash 
Signed-off-by: Vabhav Sharma 
---
 arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi | 702 +
 1 file changed, 702 insertions(+)
 create mode 100644 arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi

diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
new file mode 100644
index 000..c758268
--- /dev/null
+++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
@@ -0,0 +1,702 @@
+// SPDX-License-Identifier: (GPL-2.0 OR MIT)
+//
+// Device Tree Include file for Layerscape-LX2160A family SoC.
+//
+// Copyright 2018 NXP
+
+#include 
+
+/memreserve/ 0x8000 0x0001;
+
+/ {
+   compatible = "fsl,lx2160a";
+   interrupt-parent = <>;
+   #address-cells = <2>;
+   #size-cells = <2>;
+
+   cpus {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   // 8 clusters having 2 Cortex-A72 cores each
+   cpu@0 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a72";
+   enable-method = "psci";
+   reg = <0x0>;
+   clocks = < 1 0>;
+   d-cache-size = <0x8000>;
+   d-cache-line-size = <64>;
+   d-cache-sets = <128>;
+   i-cache-size = <0xC000>;
+   i-cache-line-size = <64>;
+   i-cache-sets = <192>;
+   next-level-cache = <_l2>;
+   };
+
+   cpu@1 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a72";
+   enable-method = "psci";
+   reg = <0x1>;
+   clocks = < 1 0>;
+   d-cache-size = <0x8000>;
+   d-cache-line-size = <64>;
+   d-cache-sets = <128>;
+   i-cache-size = <0xC000>;
+   i-cache-line-size = <64>;
+   i-cache-sets = <192>;
+   next-level-cache = <_l2>;
+   };
+
+   cpu@100 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a72";
+   enable-method = "psci";
+   reg = <0x100>;
+   clocks = < 1 1>;
+   d-cache-size = <0x8000>;
+   d-cache-line-size = <64>;
+   d-cache-sets = <128>;
+   i-cache-size = <0xC000>;
+   i-cache-line-size = <64>;
+   i-cache-sets = <192>;
+   next-level-cache = <_l2>;
+   };
+
+   cpu@101 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a72";
+   enable-method = "psci";
+   reg = <0x101>;
+   clocks = < 1 1>;
+   d-cache-size = <0x8000>;
+   d-cache-line-size = <64>;
+   d-cache-sets = <128>;
+   i-cache-size = <0xC000>;
+   i-cache-line-size = <64>;
+   i-cache-sets = <192>;
+   next-level-cache = <_l2>;
+   };
+
+   cpu@200 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a72";
+   enable-method = "psci";
+   reg = <0x200>;
+   clocks = < 1 2>;
+   d-cache-size = <0x8000>;
+   d-cache-line-size = <64>;
+   d-cache-sets = <128>;
+   i-cache-size = <0xC000>;
+   i-cache-line-size = <64>;
+   i-cache-sets = <192>;
+   next-level-cache = <_l2>;
+   };
+
+   cpu@201 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a72";
+   enable-method = "psci";
+   reg = <0x201>;
+   clocks = < 1 2>;
+   d-cache-size = <0x8000>;
+   d-cache-line-size = <64>;
+   d-cache-sets = <128>;
+   i-cache-size = <0xC000>;
+   i-cache-line-size = <64>;
+   

[PATCH v4 4/6] clk: qoriq: Add clockgen support for lx2160a

2018-10-04 Thread Vabhav Sharma
From: Yogesh Gaur 

Add clockgen support for lx2160a.
Added entry for compat 'fsl,lx2160a-clockgen'.

Signed-off-by: Tang Yuantian 
Signed-off-by: Yogesh Gaur 
Signed-off-by: Vabhav Sharma 
Acked-by: Stephen Boyd 
---
 drivers/clk/clk-qoriq.c | 12 
 drivers/cpufreq/qoriq-cpufreq.c |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c
index e152bfb..99675de 100644
--- a/drivers/clk/clk-qoriq.c
+++ b/drivers/clk/clk-qoriq.c
@@ -570,6 +570,17 @@ static const struct clockgen_chipinfo chipinfo[] = {
.flags = CG_VER3 | CG_LITTLE_ENDIAN,
},
{
+   .compat = "fsl,lx2160a-clockgen",
+   .cmux_groups = {
+   _cmux_cga12, _cmux_cgb
+   },
+   .cmux_to_group = {
+   0, 0, 0, 0, 1, 1, 1, 1, -1
+   },
+   .pll_mask = 0x37,
+   .flags = CG_VER3 | CG_LITTLE_ENDIAN,
+   },
+   {
.compat = "fsl,p2041-clockgen",
.guts_compat = "fsl,qoriq-device-config-1.0",
.init_periph = p2041_init_periph,
@@ -1424,6 +1435,7 @@ CLK_OF_DECLARE(qoriq_clockgen_ls1043a, 
"fsl,ls1043a-clockgen", clockgen_init);
 CLK_OF_DECLARE(qoriq_clockgen_ls1046a, "fsl,ls1046a-clockgen", clockgen_init);
 CLK_OF_DECLARE(qoriq_clockgen_ls1088a, "fsl,ls1088a-clockgen", clockgen_init);
 CLK_OF_DECLARE(qoriq_clockgen_ls2080a, "fsl,ls2080a-clockgen", clockgen_init);
+CLK_OF_DECLARE(qoriq_clockgen_lx2160a, "fsl,lx2160a-clockgen", clockgen_init);
 
 /* Legacy nodes */
 CLK_OF_DECLARE(qoriq_sysclk_1, "fsl,qoriq-sysclk-1.0", sysclk_init);
diff --git a/drivers/cpufreq/qoriq-cpufreq.c b/drivers/cpufreq/qoriq-cpufreq.c
index 3d773f6..83921b7 100644
--- a/drivers/cpufreq/qoriq-cpufreq.c
+++ b/drivers/cpufreq/qoriq-cpufreq.c
@@ -295,6 +295,7 @@ static const struct of_device_id node_matches[] __initconst 
= {
{ .compatible = "fsl,ls1046a-clockgen", },
{ .compatible = "fsl,ls1088a-clockgen", },
{ .compatible = "fsl,ls2080a-clockgen", },
+   { .compatible = "fsl,lx2160a-clockgen", },
{ .compatible = "fsl,p4080-clockgen", },
{ .compatible = "fsl,qoriq-clockgen-1.0", },
{ .compatible = "fsl,qoriq-clockgen-2.0", },
-- 
2.7.4



[PATCH v4 3/6] clk: qoriq: increase array size of cmux_to_group

2018-10-04 Thread Vabhav Sharma
From: Yogesh Gaur 

Increase size of cmux_to_group array, to accomdate entry of
-1 termination.

Added -1, terminated, entry for 4080_cmux_grpX.

Signed-off-by: Yogesh Gaur 
Signed-off-by: Vabhav Sharma 
---
 drivers/clk/clk-qoriq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c
index 3a1812f..e152bfb 100644
--- a/drivers/clk/clk-qoriq.c
+++ b/drivers/clk/clk-qoriq.c
@@ -79,7 +79,7 @@ struct clockgen_chipinfo {
const struct clockgen_muxinfo *cmux_groups[2];
const struct clockgen_muxinfo *hwaccel[NUM_HWACCEL];
void (*init_periph)(struct clockgen *cg);
-   int cmux_to_group[NUM_CMUX]; /* -1 terminates if fewer than NUM_CMUX */
+   int cmux_to_group[NUM_CMUX+1]; /* array should be -1 terminated */
u32 pll_mask;   /* 1 << n bit set if PLL n is valid */
u32 flags;  /* CG_xxx */
 };
@@ -601,7 +601,7 @@ static const struct clockgen_chipinfo chipinfo[] = {
_cmux_grp1, _cmux_grp2
},
.cmux_to_group = {
-   0, 0, 0, 0, 1, 1, 1, 1
+   0, 0, 0, 0, 1, 1, 1, 1, -1
},
.pll_mask = 0x1f,
},
-- 
2.7.4



[PATCH v4 2/6] soc/fsl/guts: Add compatible string for LX2160A

2018-10-04 Thread Vabhav Sharma
Adding compatible string "lx2160a-dcfg" to
initialize guts driver for lx2160

Signed-off-by: Vabhav Sharma 
---
 drivers/soc/fsl/guts.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
index 302e0c8..5e1e633 100644
--- a/drivers/soc/fsl/guts.c
+++ b/drivers/soc/fsl/guts.c
@@ -222,6 +222,7 @@ static const struct of_device_id fsl_guts_of_match[] = {
{ .compatible = "fsl,ls1088a-dcfg", },
{ .compatible = "fsl,ls1012a-dcfg", },
{ .compatible = "fsl,ls1046a-dcfg", },
+   { .compatible = "fsl,lx2160a-dcfg", },
{}
 };
 MODULE_DEVICE_TABLE(of, fsl_guts_of_match);
-- 
2.7.4



[PATCH v4 1/6] dt-bindings: arm64: add compatible for LX2160A

2018-10-04 Thread Vabhav Sharma
Add compatible for LX2160A SoC,QDS and RDB board
Add lx2160a compatible for clockgen and dcfg

Signed-off-by: Vabhav Sharma 
Reviewed-by: Rob Herring 
---
 Documentation/devicetree/bindings/arm/fsl.txt   | 14 +-
 Documentation/devicetree/bindings/clock/qoriq-clock.txt |  1 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/arm/fsl.txt 
b/Documentation/devicetree/bindings/arm/fsl.txt
index cdb9dd7..4f5d55b 100644
--- a/Documentation/devicetree/bindings/arm/fsl.txt
+++ b/Documentation/devicetree/bindings/arm/fsl.txt
@@ -126,7 +126,7 @@ core start address and release the secondary core from 
holdoff and startup.
   - compatible: Should contain a chip-specific compatible string,
Chip-specific strings are of the form "fsl,-dcfg",
The following s are known to be supported:
-   ls1012a, ls1021a, ls1043a, ls1046a, ls2080a.
+   ls1012a, ls1021a, ls1043a, ls1046a, ls2080a, lx2160a.
 
   - reg : should contain base address and length of DCFG memory-mapped 
registers
 
@@ -218,3 +218,15 @@ Required root node properties:
 LS2088A ARMv8 based RDB Board
 Required root node properties:
 - compatible = "fsl,ls2088a-rdb", "fsl,ls2088a";
+
+LX2160A SoC
+Required root node properties:
+- compatible = "fsl,lx2160a";
+
+LX2160A ARMv8 based QDS Board
+Required root node properties:
+- compatible = "fsl,lx2160a-qds", "fsl,lx2160a";
+
+LX2160A ARMv8 based RDB Board
+Required root node properties:
+- compatible = "fsl,lx2160a-rdb", "fsl,lx2160a";
diff --git a/Documentation/devicetree/bindings/clock/qoriq-clock.txt 
b/Documentation/devicetree/bindings/clock/qoriq-clock.txt
index 97f46ad..3fb9995 100644
--- a/Documentation/devicetree/bindings/clock/qoriq-clock.txt
+++ b/Documentation/devicetree/bindings/clock/qoriq-clock.txt
@@ -37,6 +37,7 @@ Required properties:
* "fsl,ls1046a-clockgen"
* "fsl,ls1088a-clockgen"
* "fsl,ls2080a-clockgen"
+   * "fsl,lx2160a-clockgen"
Chassis-version clock strings include:
* "fsl,qoriq-clockgen-1.0": for chassis 1.0 clocks
* "fsl,qoriq-clockgen-2.0": for chassis 2.0 clocks
-- 
2.7.4



[PATCH v4 0/6] arm64: dts: NXP: add basic dts file for LX2160A SoC

2018-10-04 Thread Vabhav Sharma
Changes for v4:
-Updated bindings for lx2160a clockgen and dcfg
-Modified commit message for lx2160a clockgen changes
-Updated interrupt property with macro definition
-Added required enable-method property to each core node with psci value
-Removed unused node syscon in device tree
-Removed blank lines in device tree fsl-lx2160a.dtsi
-Updated uart node compatible sbsa-uart first
-Added and defined vcc-supply property to temperature sensor node in
 device tree fsl-lx2160a-rdb.dts

Changes for v3:
-Split clockgen support patch into below two patches:
- a)Updated array size of cmux_to_group[] with NUM_CMUX+1 to include -1
 terminator and p4080 cmux_to_group[] array with -1 terminator
- b)Add clockgen support for lx2160a

Changes for v2:
- Modified cmux_to_group array to include -1 terminator
- Revert NUM_CMUX to original value 8 from 16
- Remove “As LX2160A is 16 core, so modified value for NUM_CMUX”
  in patch "[PATCH 3/5] drivers: clk-qoriq: Add clockgen support for
  lx2160a" description
- Populated cache properties for L1 and L2 cache in lx2160a device-tree.
- Removed reboot node from lx2160a device-tree as PSCI is implemented.
- Removed incorrect comment for timer node interrupt property in
  lx2160a device-tree.
- Modified pmu node compatible property from "arm,armv8-pmuv3" to
  "arm,cortex-a72-pmu" in lx2160a device-tree
- Non-standard aliases removed in lx2160a rdb board device-tree
- Updated i2c child nodes to generic name in lx2160a rdb device-tree.

Changes for v1:
- Add compatible string for LX2160A clockgen support
- Add compatible string to initialize LX2160A guts driver
- Add compatible string for LX2160A support in dt-bindings
- Add dts file to enable support for LX2160A SoC and LX2160A RDB
  (Reference design board)

Vabhav Sharma (4):
  dt-bindings: arm64: add compatible for LX2160A
  soc/fsl/guts: Add compatible string for LX2160A
  arm64: dts: add QorIQ LX2160A SoC support
  arm64: dts: add LX2160ARDB board support

Yogesh Gaur (2):
  clk: qoriq: increase array size of cmux_to_group
  clk: qoriq: Add clockgen support for lx2160a

 Documentation/devicetree/bindings/arm/fsl.txt  |  14 +-
 .../devicetree/bindings/clock/qoriq-clock.txt  |   1 +
 arch/arm64/boot/dts/freescale/Makefile |   1 +
 arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts  | 100 +++
 arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi | 702 +
 drivers/clk/clk-qoriq.c|  16 +-
 drivers/cpufreq/qoriq-cpufreq.c|   1 +
 drivers/soc/fsl/guts.c |   1 +
 8 files changed, 833 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/boot/dts/freescale/fsl-lx2160a-rdb.dts
 create mode 100644 arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi

-- 
2.7.4



Re: [PATCH 1/7] macintosh: Use common code to access RTC

2018-10-04 Thread Geert Uytterhoeven
On Wed, Sep 12, 2018 at 2:18 AM Finn Thain  wrote:
> Now that the 68k Mac port has adopted the via-pmu driver, the same RTC
> code can be shared between m68k and powerpc. Replace duplicated code in
> arch/powerpc and arch/m68k with common RTC accessors for Cuda and PMU.
>
> Drop the problematic WARN_ON which was introduced in commit 22db552b50fa
> ("powerpc/powermac: Fix rtc read/write functions").
>
> Tested-by: Stan Johnson 
> Signed-off-by: Finn Thain 

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH v4 32/32] KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization

2018-10-04 Thread Paul Mackerras
With this, userspace can enable a KVM-HV guest to run nested guests
under it.

The administrator can control whether any nested guests can be run;
setting the "nested" module parameter to false prevents any guests
becoming nested hypervisors (that is, any attempt to enable the nested
capability on a guest will fail).  Guests which are already nested
hypervisors will continue to be so.

Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt  | 14 ++
 arch/powerpc/include/asm/kvm_ppc.h |  1 +
 arch/powerpc/kvm/book3s_hv.c   | 19 +++
 arch/powerpc/kvm/powerpc.c | 12 
 include/uapi/linux/kvm.h   |  1 +
 5 files changed, 47 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 017d851..a2d4832 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -4522,6 +4522,20 @@ hpage module parameter is not set to 1, -EINVAL is 
returned.
 While it is generally possible to create a huge page backed VM without
 this capability, the VM will not be able to run.
 
+7.15 KVM_CAP_PPC_NESTED_HV
+
+Architectures: ppc
+Parameters: enable flag (0 to disable, non-zero to enable)
+Returns: 0 on success, -EINVAL when the implementation doesn't support
+nested-HV virtualization.
+
+HV-KVM on POWER9 and later systems allows for "nested-HV"
+virtualization, which provides a way for a guest VM to run guests that
+can run using the CPU's supervisor mode (privileged non-hypervisor
+state).  Enabling this capability on a VM depends on the CPU having
+the necessary functionality and on the facility being enabled with a
+kvm-hv module parameter.
+
 8. Other capabilities.
 --
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 245e564..80f0091 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -327,6 +327,7 @@ struct kvmppc_ops {
int (*set_smt_mode)(struct kvm *kvm, unsigned long mode,
unsigned long flags);
void (*giveup_ext)(struct kvm_vcpu *vcpu, ulong msr);
+   int (*enable_nested)(struct kvm *kvm, bool enable);
 };
 
 extern struct kvmppc_ops *kvmppc_hv_ops;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7f89b22..d3cc013 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -118,6 +118,11 @@ module_param_cb(h_ipi_redirect, _param_ops, 
_ipi_redirect, 0644);
 MODULE_PARM_DESC(h_ipi_redirect, "Redirect H_IPI wakeup to a free host core");
 #endif
 
+/* If set, guests are allowed to create and control nested guests */
+static bool nested = true;
+module_param(nested, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(nested, "Enable nested virtualization (only on POWER9)");
+
 /* If set, the threads on each CPU core have to be in the same MMU mode */
 static bool no_mixing_hpt_and_radix;
 
@@ -5165,6 +5170,19 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
kvm_ppc_mmuv3_cfg *cfg)
return err;
 }
 
+static int kvmhv_enable_nested(struct kvm *kvm, bool enable)
+{
+   if (!nested)
+   return -EPERM;
+   if (!cpu_has_feature(CPU_FTR_ARCH_300))
+   return -ENODEV;
+
+   /* kvm == NULL means the caller is testing if the capability exists */
+   if (kvm)
+   kvm->arch.nested_enable = enable;
+   return 0;
+}
+
 static struct kvmppc_ops kvm_ops_hv = {
.get_sregs = kvm_arch_vcpu_ioctl_get_sregs_hv,
.set_sregs = kvm_arch_vcpu_ioctl_set_sregs_hv,
@@ -5204,6 +5222,7 @@ static struct kvmppc_ops kvm_ops_hv = {
.configure_mmu = kvmhv_configure_mmu,
.get_rmmu_info = kvmhv_get_rmmu_info,
.set_smt_mode = kvmhv_set_smt_mode,
+   .enable_nested = kvmhv_enable_nested,
 };
 
 static int kvm_init_subcore_bitmap(void)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index eba5756..449ae1d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -596,6 +596,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_PPC_MMU_HASH_V3:
r = !!(hv_enabled && cpu_has_feature(CPU_FTR_ARCH_300));
break;
+   case KVM_CAP_PPC_NESTED_HV:
+   r = !!(hv_enabled && kvmppc_hv_ops->enable_nested &&
+  !kvmppc_hv_ops->enable_nested(NULL, false));
+   break;
 #endif
case KVM_CAP_SYNC_MMU:
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
@@ -2114,6 +2118,14 @@ static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
r = kvm->arch.kvm_ops->set_smt_mode(kvm, mode, flags);
break;
}
+
+   case KVM_CAP_PPC_NESTED_HV:
+   r = -EINVAL;
+   if (!is_kvmppc_hv_enabled(kvm) ||
+   !kvm->arch.kvm_ops->enable_nested)
+   break;
+   r = 

[PATCH v4 31/32] KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs

2018-10-04 Thread Paul Mackerras
This adds a list of valid shadow PTEs for each nested guest to
the 'radix' file for the guest in debugfs.  This can be useful for
debugging.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 39 +---
 arch/powerpc/kvm/book3s_hv_nested.c  | 15 
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 1e96027..d11f73c 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -120,6 +120,7 @@ struct rmap_nested {
 struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
  bool create);
 void kvmhv_put_nested(struct kvm_nested_guest *gp);
+int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid);
 
 /* Encoding of first parameter for H_TLB_INVALIDATE */
 #define H_TLBIE_P1_ENC(ric, prs, r)(___PPC_RIC(ric) | ___PPC_PRS(prs) | \
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index ae0e3ed..43b21e8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1002,6 +1002,7 @@ struct debugfs_radix_state {
struct kvm  *kvm;
struct mutexmutex;
unsigned long   gpa;
+   int lpid;
int chars_left;
int buf_index;
charbuf[128];
@@ -1043,6 +1044,7 @@ static ssize_t debugfs_radix_read(struct file *file, char 
__user *buf,
struct kvm *kvm;
unsigned long gpa;
pgd_t *pgt;
+   struct kvm_nested_guest *nested;
pgd_t pgd, *pgdp;
pud_t pud, *pudp;
pmd_t pmd, *pmdp;
@@ -1077,10 +1079,39 @@ static ssize_t debugfs_radix_read(struct file *file, 
char __user *buf,
}
 
gpa = p->gpa;
-   pgt = kvm->arch.pgtable;
-   while (len != 0 && gpa < RADIX_PGTABLE_RANGE) {
+   nested = NULL;
+   pgt = NULL;
+   while (len != 0 && p->lpid >= 0) {
+   if (gpa >= RADIX_PGTABLE_RANGE) {
+   gpa = 0;
+   pgt = NULL;
+   if (nested) {
+   kvmhv_put_nested(nested);
+   nested = NULL;
+   }
+   p->lpid = kvmhv_nested_next_lpid(kvm, p->lpid);
+   p->hdr = 0;
+   if (p->lpid < 0)
+   break;
+   }
+   if (!pgt) {
+   if (p->lpid == 0) {
+   pgt = kvm->arch.pgtable;
+   } else {
+   nested = kvmhv_get_nested(kvm, p->lpid, false);
+   if (!nested) {
+   gpa = RADIX_PGTABLE_RANGE;
+   continue;
+   }
+   pgt = nested->shadow_pgtable;
+   }
+   }
+   n = 0;
if (!p->hdr) {
-   n = scnprintf(p->buf, sizeof(p->buf),
+   if (p->lpid > 0)
+   n = scnprintf(p->buf, sizeof(p->buf),
+ "\nNested LPID %d: ", p->lpid);
+   n += scnprintf(p->buf + n, sizeof(p->buf) - n,
  "pgdir: %lx\n", (unsigned long)pgt);
p->hdr = 1;
goto copy;
@@ -1146,6 +1177,8 @@ static ssize_t debugfs_radix_read(struct file *file, char 
__user *buf,
}
}
p->gpa = gpa;
+   if (nested)
+   kvmhv_put_nested(nested);
 
  out:
mutex_unlock(>mutex);
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 8c0da00..75b461d0 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1263,3 +1263,18 @@ long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu)
mutex_unlock(>tlb_lock);
return ret;
 }
+
+int kvmhv_nested_next_lpid(struct kvm *kvm, int lpid)
+{
+   int ret = -1;
+
+   spin_lock(>mmu_lock);
+   while (++lpid <= kvm->arch.max_nested_lpid) {
+   if (kvm->arch.nested_guests[lpid]) {
+   ret = lpid;
+   break;
+   }
+   }
+   spin_unlock(>mmu_lock);
+   return ret;
+}
-- 
2.7.4



[PATCH v4 30/32] KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode

2018-10-04 Thread Paul Mackerras
With this, the KVM-HV module can be loaded in a guest running under
KVM-HV, and if the hypervisor supports nested virtualization, this
guest can now act as a nested hypervisor and run nested guests.

This also adds some checks to inform userspace that HPT guests are not
supported by nested hypervisors, and to prevent userspace from
configuring a guest to use HPT mode.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 7561c99..7f89b22 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4214,6 +4214,10 @@ static int kvm_vm_ioctl_get_smmu_info_hv(struct kvm *kvm,
 {
struct kvm_ppc_one_seg_page_size *sps;
 
+   /* If we're a nested hypervisor, we only support radix guests */
+   if (kvmhv_on_pseries())
+   return -EINVAL;
+
/*
 * POWER7, POWER8 and POWER9 all support 32 storage keys for data.
 * POWER7 doesn't support keys for instruction accesses,
@@ -4799,11 +4803,15 @@ static int kvmppc_core_emulate_mfspr_hv(struct kvm_vcpu 
*vcpu, int sprn,
 
 static int kvmppc_core_check_processor_compat_hv(void)
 {
-   if (!cpu_has_feature(CPU_FTR_HVMODE) ||
-   !cpu_has_feature(CPU_FTR_ARCH_206))
-   return -EIO;
+   if (cpu_has_feature(CPU_FTR_HVMODE) &&
+   cpu_has_feature(CPU_FTR_ARCH_206))
+   return 0;
 
-   return 0;
+   /* POWER9 in radix mode is capable of being a nested hypervisor. */
+   if (cpu_has_feature(CPU_FTR_ARCH_300) && radix_enabled())
+   return 0;
+
+   return -EIO;
 }
 
 #ifdef CONFIG_KVM_XICS
@@ -5121,6 +5129,10 @@ static int kvmhv_configure_mmu(struct kvm *kvm, struct 
kvm_ppc_mmuv3_cfg *cfg)
if (radix && !radix_enabled())
return -EINVAL;
 
+   /* If we're a nested hypervisor, we currently only support radix */
+   if (kvmhv_on_pseries() && !radix)
+   return -EINVAL;
+
mutex_lock(>lock);
if (radix != kvm_is_radix(kvm)) {
if (kvm->arch.mmu_ready) {
-- 
2.7.4



[PATCH v4 29/32] KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

The hcall H_ENTER_NESTED takes two parameters: the address in L1 guest
memory of a hv_regs struct and the address of a pt_regs struct.  The
hcall requests the L0 hypervisor to use the register values in these
structs to run a L2 guest and to return the exit state of the L2 guest
in these structs.  These are in the endianness of the L1 guest, rather
than being always big-endian as is usually the case for PAPR
hypercalls.

This is convenient because it means that the L1 guest can pass the
address of the regs field in its kvm_vcpu_arch struct.  This also
improves performance slightly by avoiding the need for two copies of
the pt_regs struct.

When reading/writing these structures, this patch handles the case
where the endianness of the L1 guest differs from that of the L0
hypervisor, by byteswapping the structures after reading and before
writing them back.

Since all the fields of the pt_regs are of the same type, i.e.,
unsigned long, we treat it as an array of unsigned longs.  The fields
of struct hv_guest_state are not all the same, so its fields are
byteswapped individually.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_nested.c | 51 -
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index f54f779..8c0da00 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -50,6 +50,48 @@ void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct 
hv_guest_state *hr)
hr->ppr = vcpu->arch.ppr;
 }
 
+static void byteswap_pt_regs(struct pt_regs *regs)
+{
+   unsigned long *addr = (unsigned long *) regs;
+
+   for (; addr < ((unsigned long *) (regs + 1)); addr++)
+   *addr = swab64(*addr);
+}
+
+static void byteswap_hv_regs(struct hv_guest_state *hr)
+{
+   hr->version = swab64(hr->version);
+   hr->lpid = swab32(hr->lpid);
+   hr->vcpu_token = swab32(hr->vcpu_token);
+   hr->lpcr = swab64(hr->lpcr);
+   hr->pcr = swab64(hr->pcr);
+   hr->amor = swab64(hr->amor);
+   hr->dpdes = swab64(hr->dpdes);
+   hr->hfscr = swab64(hr->hfscr);
+   hr->tb_offset = swab64(hr->tb_offset);
+   hr->dawr0 = swab64(hr->dawr0);
+   hr->dawrx0 = swab64(hr->dawrx0);
+   hr->ciabr = swab64(hr->ciabr);
+   hr->hdec_expiry = swab64(hr->hdec_expiry);
+   hr->purr = swab64(hr->purr);
+   hr->spurr = swab64(hr->spurr);
+   hr->ic = swab64(hr->ic);
+   hr->vtb = swab64(hr->vtb);
+   hr->hdar = swab64(hr->hdar);
+   hr->hdsisr = swab64(hr->hdsisr);
+   hr->heir = swab64(hr->heir);
+   hr->asdr = swab64(hr->asdr);
+   hr->srr0 = swab64(hr->srr0);
+   hr->srr1 = swab64(hr->srr1);
+   hr->sprg[0] = swab64(hr->sprg[0]);
+   hr->sprg[1] = swab64(hr->sprg[1]);
+   hr->sprg[2] = swab64(hr->sprg[2]);
+   hr->sprg[3] = swab64(hr->sprg[3]);
+   hr->pidr = swab64(hr->pidr);
+   hr->cfar = swab64(hr->cfar);
+   hr->ppr = swab64(hr->ppr);
+}
+
 static void save_hv_return_state(struct kvm_vcpu *vcpu, int trap,
 struct hv_guest_state *hr)
 {
@@ -174,6 +216,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
  sizeof(struct hv_guest_state));
if (err)
return H_PARAMETER;
+   if (kvmppc_need_byteswap(vcpu))
+   byteswap_hv_regs(_hv);
if (l2_hv.version != HV_GUEST_STATE_VERSION)
return H_P2;
 
@@ -182,7 +226,8 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
  sizeof(struct pt_regs));
if (err)
return H_PARAMETER;
-
+   if (kvmppc_need_byteswap(vcpu))
+   byteswap_pt_regs(_regs);
if (l2_hv.vcpu_token >= NR_CPUS)
return H_PARAMETER;
 
@@ -254,6 +299,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
kvmhv_put_nested(l2);
 
/* copy l2_hv_state and regs back to guest */
+   if (kvmppc_need_byteswap(vcpu)) {
+   byteswap_hv_regs(_hv);
+   byteswap_pt_regs(_regs);
+   }
err = kvm_vcpu_write_guest(vcpu, hv_ptr, _hv,
   sizeof(struct hv_guest_state));
if (err)
-- 
2.7.4



[PATCH v4 27/32] KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register

2018-10-04 Thread Paul Mackerras
This adds a one-reg register identifier which can be used to read and
set the virtual PTCR for the guest.  This register identifies the
address and size of the virtual partition table for the guest, which
contains information about the nested guests under this guest.

Migrating this value is the only extra requirement for migrating a
guest which has nested guests (assuming of course that the destination
host supports nested virtualization in the kvm-hv module).

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 Documentation/virtual/kvm/api.txt   | 1 +
 arch/powerpc/include/uapi/asm/kvm.h | 1 +
 arch/powerpc/kvm/book3s_hv.c| 6 ++
 3 files changed, 8 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index c664064..017d851 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1922,6 +1922,7 @@ registers, find a list below:
   PPC   | KVM_REG_PPC_TIDR  | 64
   PPC   | KVM_REG_PPC_PSSCR | 64
   PPC   | KVM_REG_PPC_DEC_EXPIRY| 64
+  PPC   | KVM_REG_PPC_PTCR  | 64
   PPC   | KVM_REG_PPC_TM_GPR0   | 64
   ...
   PPC   | KVM_REG_PPC_TM_GPR31  | 64
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 1b32b56..8c876c1 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -634,6 +634,7 @@ struct kvm_ppc_cpu_char {
 
 #define KVM_REG_PPC_DEC_EXPIRY (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xbe)
 #define KVM_REG_PPC_ONLINE (KVM_REG_PPC | KVM_REG_SIZE_U32 | 0xbf)
+#define KVM_REG_PPC_PTCR   (KVM_REG_PPC | KVM_REG_SIZE_U64 | 0xc0)
 
 /* Transactional Memory checkpointed state:
  * This is all GPRs, all VSX regs and a subset of SPRs
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4da9564..7561c99 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1710,6 +1710,9 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
case KVM_REG_PPC_ONLINE:
*val = get_reg_val(id, vcpu->arch.online);
break;
+   case KVM_REG_PPC_PTCR:
+   *val = get_reg_val(id, vcpu->kvm->arch.l1_ptcr);
+   break;
default:
r = -EINVAL;
break;
@@ -1941,6 +1944,9 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
atomic_dec(>arch.vcore->online_count);
vcpu->arch.online = i;
break;
+   case KVM_REG_PPC_PTCR:
+   vcpu->kvm->arch.l1_ptcr = set_reg_val(id, *val);
+   break;
default:
r = -EINVAL;
break;
-- 
2.7.4



[PATCH v4 28/32] KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
nested (L2) guest into the vcpu structure. We need to sanitise these
values to ensure we don't let the L1 guest hypervisor do things we don't
want it to.

We don't let data address watchpoints or completed instruction address
breakpoints be set to match in hypervisor state.

We also don't let L1 enable features in the hypervisor facility status
and control register (HFSCR) for L2 which we have disabled for L1. That
is L2 will get the subset of features which the L0 hypervisor has
enabled for L1 and the features L1 wants to enable for L2. This could
mean we give L1 a hypervisor facility unavailable interrupt for a
facility it thinks it has enabled, however it shouldn't have enabled a
facility it itself doesn't have for the L2 guest.

We sanitise the registers when copying in the L2 hv_regs. We don't need
to sanitise when copying back the L1 hv_regs since these shouldn't be
able to contain invalid values as they're just what was copied out.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h  |  1 +
 arch/powerpc/kvm/book3s_hv_nested.c | 17 +
 2 files changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 6fda746..c9069897 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -415,6 +415,7 @@
 #define   HFSCR_DSCR   __MASK(FSCR_DSCR_LG)
 #define   HFSCR_VECVSX __MASK(FSCR_VECVSX_LG)
 #define   HFSCR_FP __MASK(FSCR_FP_LG)
+#define   HFSCR_INTR_CAUSE (ASM_CONST(0xFF) << 56) /* interrupt cause */
 #define SPRN_TAR   0x32f   /* Target Address Register */
 #define SPRN_LPCR  0x13E   /* LPAR Control Register */
 #define   LPCR_VPM0ASM_CONST(0x8000)
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 1a8c40d..f54f779 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -85,6 +85,22 @@ static void save_hv_return_state(struct kvm_vcpu *vcpu, int 
trap,
}
 }
 
+static void sanitise_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
+{
+   /*
+* Don't let L1 enable features for L2 which we've disabled for L1,
+* but preserve the interrupt cause field.
+*/
+   hr->hfscr &= (HFSCR_INTR_CAUSE | vcpu->arch.hfscr);
+
+   /* Don't let data address watchpoint match in hypervisor state */
+   hr->dawrx0 &= ~DAWRX_HYP;
+
+   /* Don't let completed instruction address breakpt match in HV state */
+   if ((hr->ciabr & CIABR_PRIV) == CIABR_PRIV_HYPER)
+   hr->ciabr &= ~CIABR_PRIV;
+}
+
 static void restore_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
@@ -197,6 +213,7 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
mask = LPCR_DPFD | LPCR_ILE | LPCR_TC | LPCR_AIL | LPCR_LD |
LPCR_LPES | LPCR_MER;
lpcr = (vc->lpcr & ~mask) | (l2_hv.lpcr & mask);
+   sanitise_hv_regs(vcpu, _hv);
restore_hv_regs(vcpu, _hv);
 
vcpu->arch.ret = RESUME_GUEST;
-- 
2.7.4



[PATCH v4 26/32] KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested

2018-10-04 Thread Paul Mackerras
When running as a nested hypervisor, this avoids reading hypervisor
privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
instead reasonable default values are used.  This also avoids writing
LPIDR in the single-vcpu entry/exit path.

Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
since its only caller already checks this.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  7 +++
 arch/powerpc/kvm/book3s_hv.c| 33 +
 2 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 68e14af..c615617 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -268,14 +268,13 @@ int kvmppc_mmu_hv_init(void)
 {
unsigned long host_lpid, rsvd_lpid;
 
-   if (!cpu_has_feature(CPU_FTR_HVMODE))
-   return -EINVAL;
-
if (!mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
return -EINVAL;
 
/* POWER7 has 10-bit LPIDs (12-bit in POWER8) */
-   host_lpid = mfspr(SPRN_LPID);
+   host_lpid = 0;
+   if (cpu_has_feature(CPU_FTR_HVMODE))
+   host_lpid = mfspr(SPRN_LPID);
rsvd_lpid = LPID_RSVD;
 
kvmppc_init_lpid(rsvd_lpid + 1);
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 53a967ea..4da9564 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2174,15 +2174,18 @@ static struct kvm_vcpu 
*kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 * Set the default HFSCR for the guest from the host value.
 * This value is only used on POWER9.
 * On POWER9, we want to virtualize the doorbell facility, so we
-* turn off the HFSCR bit, which causes those instructions to trap.
+* don't set the HFSCR_MSGP bit, and that causes those instructions
+* to trap and then we emulate them.
 */
-   vcpu->arch.hfscr = mfspr(SPRN_HFSCR);
-   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   vcpu->arch.hfscr = HFSCR_TAR | HFSCR_EBB | HFSCR_PM | HFSCR_BHRB |
+   HFSCR_DSCR | HFSCR_VECVSX | HFSCR_FP;
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   vcpu->arch.hfscr &= mfspr(SPRN_HFSCR);
+   if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
+   vcpu->arch.hfscr |= HFSCR_TM;
+   }
+   if (cpu_has_feature(CPU_FTR_TM_COMP))
vcpu->arch.hfscr |= HFSCR_TM;
-   else if (!cpu_has_feature(CPU_FTR_TM_COMP))
-   vcpu->arch.hfscr &= ~HFSCR_TM;
-   if (cpu_has_feature(CPU_FTR_ARCH_300))
-   vcpu->arch.hfscr &= ~HFSCR_MSGP;
 
kvmppc_mmu_book3s_hv_init(vcpu);
 
@@ -4001,8 +4004,10 @@ int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
 
srcu_read_unlock(>kvm->srcu, srcu_idx);
 
-   mtspr(SPRN_LPID, vc->kvm->arch.host_lpid);
-   isync();
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   mtspr(SPRN_LPID, vc->kvm->arch.host_lpid);
+   isync();
+   }
 
trace_hardirqs_off();
set_irq_happened(trap);
@@ -4622,9 +4627,13 @@ static int kvmppc_core_init_vm_hv(struct kvm *kvm)
kvm->arch.host_sdr1 = mfspr(SPRN_SDR1);
 
/* Init LPCR for virtual RMA mode */
-   kvm->arch.host_lpid = mfspr(SPRN_LPID);
-   kvm->arch.host_lpcr = lpcr = mfspr(SPRN_LPCR);
-   lpcr &= LPCR_PECE | LPCR_LPES;
+   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   kvm->arch.host_lpid = mfspr(SPRN_LPID);
+   kvm->arch.host_lpcr = lpcr = mfspr(SPRN_LPCR);
+   lpcr &= LPCR_PECE | LPCR_LPES;
+   } else {
+   lpcr = 0;
+   }
lpcr |= (4UL << LPCR_DPFD_SH) | LPCR_HDICE |
LPCR_VPM0 | LPCR_VPM1;
kvm->arch.vrma_slb_v = SLB_VSID_B_1T |
-- 
2.7.4



[PATCH v4 25/32] KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

This is only done at level 0, since only level 0 knows which physical
CPU a vcpu is running on.  This does for nested guests what L0 already
did for its own guests, which is to flush the TLB on a pCPU when it
goes to run a vCPU there, and there is another vCPU in the same VM
which previously ran on this pCPU and has now started to run on another
pCPU.  This is to handle the situation where the other vCPU touched
a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
on that new pCPU and thus left behind a stale TLB entry on this pCPU.

This introduces a limit on the the vcpu_token values used in the
H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.

[pau...@ozlabs.org - made prev_cpu array be unsigned short[] to reduce
 memory consumption.]

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   3 +
 arch/powerpc/kvm/book3s_hv.c | 101 +++
 arch/powerpc/kvm/book3s_hv_nested.c  |   5 ++
 3 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index aa5bf85..1e96027 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -52,6 +52,9 @@ struct kvm_nested_guest {
long refcnt;/* number of pointers to this struct */
struct mutex tlb_lock;  /* serialize page faults and tlbies */
struct kvm_nested_guest *next;
+   cpumask_t need_tlb_flush;
+   cpumask_t cpu_in_guest;
+   unsigned short prev_cpu[NR_CPUS];
 };
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ba58883..53a967ea 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2397,10 +2397,18 @@ static void kvmppc_release_hwthread(int cpu)
 
 static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
 {
+   struct kvm_nested_guest *nested = vcpu->arch.nested;
+   cpumask_t *cpu_in_guest;
int i;
 
cpu = cpu_first_thread_sibling(cpu);
-   cpumask_set_cpu(cpu, >arch.need_tlb_flush);
+   if (nested) {
+   cpumask_set_cpu(cpu, >need_tlb_flush);
+   cpu_in_guest = >cpu_in_guest;
+   } else {
+   cpumask_set_cpu(cpu, >arch.need_tlb_flush);
+   cpu_in_guest = >arch.cpu_in_guest;
+   }
/*
 * Make sure setting of bit in need_tlb_flush precedes
 * testing of cpu_in_guest bits.  The matching barrier on
@@ -2408,13 +2416,23 @@ static void radix_flush_cpu(struct kvm *kvm, int cpu, 
struct kvm_vcpu *vcpu)
 */
smp_mb();
for (i = 0; i < threads_per_core; ++i)
-   if (cpumask_test_cpu(cpu + i, >arch.cpu_in_guest))
+   if (cpumask_test_cpu(cpu + i, cpu_in_guest))
smp_call_function_single(cpu + i, do_nothing, NULL, 1);
 }
 
 static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu *vcpu, int pcpu)
 {
+   struct kvm_nested_guest *nested = vcpu->arch.nested;
struct kvm *kvm = vcpu->kvm;
+   int prev_cpu;
+
+   if (!cpu_has_feature(CPU_FTR_HVMODE))
+   return;
+
+   if (nested)
+   prev_cpu = nested->prev_cpu[vcpu->arch.nested_vcpu_id];
+   else
+   prev_cpu = vcpu->arch.prev_cpu;
 
/*
 * With radix, the guest can do TLB invalidations itself,
@@ -2428,12 +2446,46 @@ static void kvmppc_prepare_radix_vcpu(struct kvm_vcpu 
*vcpu, int pcpu)
 * ran to flush the TLB.  The TLB is shared between threads,
 * so we use a single bit in .need_tlb_flush for all 4 threads.
 */
-   if (vcpu->arch.prev_cpu != pcpu) {
-   if (vcpu->arch.prev_cpu >= 0 &&
-   cpu_first_thread_sibling(vcpu->arch.prev_cpu) !=
+   if (prev_cpu != pcpu) {
+   if (prev_cpu >= 0 &&
+   cpu_first_thread_sibling(prev_cpu) !=
cpu_first_thread_sibling(pcpu))
-   radix_flush_cpu(kvm, vcpu->arch.prev_cpu, vcpu);
-   vcpu->arch.prev_cpu = pcpu;
+   radix_flush_cpu(kvm, prev_cpu, vcpu);
+   if (nested)
+   nested->prev_cpu[vcpu->arch.nested_vcpu_id] = pcpu;
+   else
+   vcpu->arch.prev_cpu = pcpu;
+   }
+}
+
+static void kvmppc_radix_check_need_tlb_flush(struct kvm *kvm, int pcpu,
+ struct kvm_nested_guest *nested)
+{
+   cpumask_t *need_tlb_flush;
+   int lpid;
+
+   if (!cpu_has_feature(CPU_FTR_HVMODE))
+   return;
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   pcpu &= ~0x3UL;
+
+   if (nested) {
+   lpid = nested->shadow_lpid;
+   need_tlb_flush = >need_tlb_flush;
+   } else {
+   lpid 

[PATCH v4 24/32] KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested

2018-10-04 Thread Paul Mackerras
This adds code to call the H_TLB_INVALIDATE hypercall when running as
a guest, in the cases where we need to invalidate TLBs (or other MMU
caches) as part of managing the mappings for a nested guest.  Calling
H_TLB_INVALIDATE lets the nested hypervisor inform the parent
hypervisor about changes to partition-scoped page tables or the
partition table without needing to do hypervisor-privileged tlbie
instructions.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  5 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 30 --
 arch/powerpc/kvm/book3s_hv_nested.c  | 30 --
 3 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index a02f0b3..aa5bf85 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC_PSERIES
 static inline bool kvmhv_on_pseries(void)
@@ -117,6 +118,10 @@ struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, 
int l1_lpid,
  bool create);
 void kvmhv_put_nested(struct kvm_nested_guest *gp);
 
+/* Encoding of first parameter for H_TLB_INVALIDATE */
+#define H_TLBIE_P1_ENC(ric, prs, r)(___PPC_RIC(ric) | ___PPC_PRS(prs) | \
+___PPC_R(r))
+
 /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
 #define PPC_MIN_HPT_ORDER  18
 #define PPC_MAX_HPT_ORDER  46
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 4c1eccb..ae0e3ed 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -201,17 +201,43 @@ static void kvmppc_radix_tlbie_page(struct kvm *kvm, 
unsigned long addr,
unsigned int pshift, unsigned int lpid)
 {
unsigned long psize = PAGE_SIZE;
+   int psi;
+   long rc;
+   unsigned long rb;
 
if (pshift)
psize = 1UL << pshift;
+   else
+   pshift = PAGE_SHIFT;
 
addr &= ~(psize - 1);
-   radix__flush_tlb_lpid_page(lpid, addr, psize);
+
+   if (!kvmhv_on_pseries()) {
+   radix__flush_tlb_lpid_page(lpid, addr, psize);
+   return;
+   }
+
+   psi = shift_to_mmu_psize(pshift);
+   rb = addr | (mmu_get_ap(psi) << PPC_BITLSHIFT(58));
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(0, 0, 1),
+   lpid, rb);
+   if (rc)
+   pr_err("KVM: TLB page invalidation hcall failed, rc=%ld\n", rc);
 }
 
 static void kvmppc_radix_flush_pwc(struct kvm *kvm, unsigned int lpid)
 {
-   radix__flush_pwc_lpid(lpid);
+   long rc;
+
+   if (!kvmhv_on_pseries()) {
+   radix__flush_pwc_lpid(lpid);
+   return;
+   }
+
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(1, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   if (rc)
+   pr_err("KVM: TLB PWC invalidation hcall failed, rc=%ld\n", rc);
 }
 
 static unsigned long kvmppc_radix_update_pte(struct kvm *kvm, pte_t *ptep,
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index 26151e8..35f8111 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -298,14 +298,32 @@ void kvmhv_nested_exit(void)
}
 }
 
+static void kvmhv_flush_lpid(unsigned int lpid)
+{
+   long rc;
+
+   if (!kvmhv_on_pseries()) {
+   radix__flush_tlb_lpid(lpid);
+   return;
+   }
+
+   rc = plpar_hcall_norets(H_TLB_INVALIDATE, H_TLBIE_P1_ENC(2, 0, 1),
+   lpid, TLBIEL_INVAL_SET_LPID);
+   if (rc)
+   pr_err("KVM: TLB LPID invalidation hcall failed, rc=%ld\n", rc);
+}
+
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
 {
-   if (cpu_has_feature(CPU_FTR_HVMODE)) {
+   if (!kvmhv_on_pseries()) {
mmu_partition_table_set_entry(lpid, dw0, dw1);
-   } else {
-   pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
-   pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
+   return;
}
+
+   pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
+   pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
+   /* L0 will do the necessary barriers */
+   kvmhv_flush_lpid(lpid);
 }
 
 static void kvmhv_set_nested_ptbl(struct kvm_nested_guest *gp)
@@ -482,7 +500,7 @@ static void kvmhv_flush_nested(struct kvm_nested_guest *gp)
spin_lock(>mmu_lock);
kvmppc_free_pgtable_radix(kvm, gp->shadow_pgtable, gp->shadow_lpid);
spin_unlock(>mmu_lock);
-   radix__flush_tlb_lpid(gp->shadow_lpid);
+   kvmhv_flush_lpid(gp->shadow_lpid);
   

[PATCH v4 23/32] KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

When running a nested (L2) guest the guest (L1) hypervisor will use
the H_TLB_INVALIDATE hcall when it needs to change the partition
scoped page tables or the partition table which it manages.  It will
use this hcall in the situations where it would use a partition-scoped
tlbie instruction if it were running in hypervisor mode.

The H_TLB_INVALIDATE hcall can invalidate different scopes:

Invalidate TLB for a given target address:
- This invalidates a single L2 -> L1 pte
- We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
  address space which is being invalidated. This is because a single
  L2 -> L1 pte may have been mapped with more than one pte in the
  L2 -> L0 page tables.

Invalidate the entire TLB for a given LPID or for all LPIDs:
- Invalidate the entire shadow_pgtable for a given nested guest, or
  for all nested guests.

Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
- We don't cache the PWC, so nothing to do.

Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
- Here we re-read the partition table entry and remove the nested state
  for any nested guest for which the first doubleword of the partition
  table entry is now zero.

The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
word (of which only the RIC, PRS and R fields are used), the rS value
(giving the lpid, where required) and the rB value (giving the IS, AP
and EPN values).

[pau...@ozlabs.org - adapted to having the partition table in guest
memory, added the H_TLB_INVALIDATE implementation, removed tlbie
instruction emulation, reworded the commit message.]

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  12 ++
 arch/powerpc/include/asm/kvm_book3s.h |   1 +
 arch/powerpc/include/asm/ppc-opcode.h |   1 +
 arch/powerpc/kvm/book3s_emulate.c |   1 -
 arch/powerpc/kvm/book3s_hv.c  |   3 +
 arch/powerpc/kvm/book3s_hv_nested.c   | 196 +-
 6 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index b3520b5..66db23e 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -203,6 +203,18 @@ static inline unsigned int mmu_psize_to_shift(unsigned int 
mmu_psize)
BUG();
 }
 
+static inline unsigned int ap_to_shift(unsigned long ap)
+{
+   int psize;
+
+   for (psize = 0; psize < MMU_PAGE_COUNT; psize++) {
+   if (mmu_psize_defs[psize].ap == ap)
+   return mmu_psize_defs[psize].shift;
+   }
+
+   return -1;
+}
+
 static inline unsigned long get_sllp_encoding(int psize)
 {
unsigned long sllp;
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index d7aeb6f..09f8e9b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -301,6 +301,7 @@ long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
 long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
+long kvmhv_do_nested_tlbie(struct kvm_vcpu *vcpu);
 int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu,
  u64 time_limit, unsigned long lpcr);
 void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 665af14..6093bc8 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -104,6 +104,7 @@
 #define OP_31_XOP_LHZUX 311
 #define OP_31_XOP_MSGSNDP   142
 #define OP_31_XOP_MSGCLRP   174
+#define OP_31_XOP_TLBIE 306
 #define OP_31_XOP_MFSPR 339
 #define OP_31_XOP_LWAX  341
 #define OP_31_XOP_LHAX  343
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 2654df2..8c7e933 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ b/arch/powerpc/kvm/book3s_emulate.c
@@ -36,7 +36,6 @@
 #define OP_31_XOP_MTSR 210
 #define OP_31_XOP_MTSRIN   242
 #define OP_31_XOP_TLBIEL   274
-#define OP_31_XOP_TLBIE306
 /* Opcode is officially reserved, reuse it as sc 1 when sc 1 doesn't trap */
 #define OP_31_XOP_FAKE_SC1 308
 #define OP_31_XOP_SLBMTE   402
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2d8209a..ba58883 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -974,6 +974,9 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
break;
case H_TLB_INVALIDATE:
ret = H_FUNCTION;
+   if (!vcpu->kvm->arch.nested_enable)
+   break;
+ 

[PATCH v4 22/32] KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

When a host (L0) page which is mapped into a (L1) guest is in turn
mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
so that these mappings can be retrieved later.

Whenever we create an entry in a shadow_pgtable for a nested guest we
create a corresponding rmap entry and add it to the list for the
L1 guest memslot at the index of the L1 guest page it maps. This means
at the L1 guest memslot we end up with lists of rmaps.

When we are notified of a host page being invalidated which has been
mapped through to a (L1) guest, we can then walk the rmap list for that
guest page, and find and invalidate all of the corresponding
shadow_pgtable entries.

In order to reduce memory consumption, we compress the information for
each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
for the guest real page frame number -- which will fit in a single
unsigned long.  To avoid a scenario where a guest can trigger
unbounded memory allocations, we scan the list when adding an entry to
see if there is already an entry with the contents we need.  This can
occur, because we don't ever remove entries from the middle of a list.

A struct nested guest rmap is a list pointer and an rmap entry;

| next pointer |

| rmap entry   |


Thus the rmap pointer for each guest frame number in the memslot can be
either NULL, a single entry, or a pointer to a list of nested rmap entries.

gfn  memslot rmap array
-
 0  | NULL  |   (no rmap entry)
-
 1  | single rmap entry |   (rmap entry with low bit set)
-
 2  | list head pointer |   (list of rmap entries)
-

The final entry always has the lowest bit set and is stored in the next
pointer of the last list entry, or as a single rmap entry.
With a list of rmap entries looking like;

-   -   -
| list head ptr | > | next pointer  | > | single rmap entry |
-   -   -
| rmap entry|   | rmap entry|
-   -

Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h|   3 +
 arch/powerpc/include/asm/kvm_book3s_64.h |  70 -
 arch/powerpc/kvm/book3s_64_mmu_radix.c   |  44 +++
 arch/powerpc/kvm/book3s_hv.c |   1 +
 arch/powerpc/kvm/book3s_hv_nested.c  | 130 ++-
 5 files changed, 233 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 63f7ccf..d7aeb6f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -196,6 +196,9 @@ extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu 
*vcpu, gva_t eaddr,
int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
+extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
+   unsigned int shift, struct kvm_memory_slot *memslot,
+   unsigned int lpid);
 extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
bool writing, unsigned long gpa,
unsigned int lpid);
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 5496152..a02f0b3 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -53,6 +53,66 @@ struct kvm_nested_guest {
struct kvm_nested_guest *next;
 };
 
+/*
+ * We define a nested rmap entry as a single 64-bit quantity
+ * 0xFFF0  12-bit lpid field
+ * 0x000FF000  40-bit guest 4k page frame number
+ * 0x0001  1-bit  single entry flag
+ */
+#define RMAP_NESTED_LPID_MASK  0xFFF0UL
+#define RMAP_NESTED_LPID_SHIFT (52)
+#define RMAP_NESTED_GPA_MASK   0x000FF000UL
+#define RMAP_NESTED_IS_SINGLE_ENTRY0x0001UL
+
+/* Structure for a nested guest rmap entry */
+struct rmap_nested {
+   struct llist_node list;
+   u64 rmap;
+};
+
+/*
+ * for_each_nest_rmap_safe - iterate over the list of nested rmap entries
+ *  safe against removal of the list entry or NULL list
+ * @pos:   a (struct rmap_nested *) to use as a loop cursor
+ * @node:  pointer to the first entry
+ * NOTE: this can be NULL
+ * @rmapp: an (unsigned long *) in which to return the rmap entries on each

[PATCH v4 21/32] KVM: PPC: Book3S HV: Handle page fault for a nested guest

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

Consider a normal (L1) guest running under the main hypervisor (L0),
and then a nested guest (L2) running under the L1 guest which is acting
as a nested hypervisor. L0 has page tables to map the address space for
L1 providing the translation from L1 real address -> L0 real address;

L1
|
| (L1 -> L0)
|
> L0

There are also page tables in L1 used to map the address space for L2
providing the translation from L2 real address -> L1 read address. Since
the hardware can only walk a single level of page table, we need to
maintain in L0 a "shadow_pgtable" for L2 which provides the translation
from L2 real address -> L0 real address. Which looks like;

L2  L2
|   |
| (L2 -> L1)|
|   |
> L1| (L2 -> L0)
  | |
  | (L1 -> L0)  |
  | |
  > L0  > L0

When a page fault occurs while running a nested (L2) guest we need to
insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
do this we need to:

1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
   provide a page fault to L1 if this mapping doesn't exist.
2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
   or try to insert a pte for that mapping if it doesn't exist.
3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable

Once this mapping exists we can take rc faults when hardware is unable
to automatically set the reference and change bits in the pte. On these
we need to:

1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
   the fault down to L1.
2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
   host page.
3. Set the rc bits in the L2 -> L0 pte.

As we reuse a large number of functions in book3s_64_mmu_radix.c for
this we also needed to refactor a number of these functions to take
an lpid parameter so that the correct lpid is used for tlb invalidations.
The functionality however has remained the same.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |   1 +
 arch/powerpc/include/asm/kvm_book3s.h  |  17 ++
 arch/powerpc/include/asm/kvm_book3s_64.h   |   4 +
 arch/powerpc/include/asm/kvm_host.h|   2 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 194 ++--
 arch/powerpc/kvm/book3s_hv_nested.c| 332 -
 arch/powerpc/mm/tlb-radix.c|   9 +
 7 files changed, 473 insertions(+), 86 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 1154a6d..671316f 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -53,6 +53,7 @@ extern void radix__flush_tlb_lpid_page(unsigned int lpid,
unsigned long addr,
unsigned long page_size);
 extern void radix__flush_pwc_lpid(unsigned int lpid);
+extern void radix__flush_tlb_lpid(unsigned int lpid);
 extern void radix__local_flush_tlb_lpid(unsigned int lpid);
 extern void radix__local_flush_tlb_lpid_guest(unsigned int lpid);
 
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 093fd70..63f7ccf 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,17 +188,34 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm 
*kvm, unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, gva_t eaddr,
+ struct kvmppc_pte *gpte, u64 root,
+ u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, u64 table,
int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
+extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
+   bool writing, unsigned long gpa,
+   unsigned int lpid);
+extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
+   unsigned long gpa,
+   struct kvm_memory_slot *memslot,
+

[PATCH v4 20/32] KVM: PPC: Book3S HV: Handle hypercalls correctly when nested

2018-10-04 Thread Paul Mackerras
When we are running as a nested hypervisor, we use a hypercall to
enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
that the hypercall handlers listed in hcall_real_table never get called.
There are some hypercalls that are handled there and not in
kvmppc_pseries_do_hcall(), which therefore won't get processed for
a nested guest.

To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
hypercalls, with the following exceptions:

- The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
  we only support radix mode for nested guests.

- H_CEDE has to be handled specially because the cede logic in
  kvmhv_run_single_vcpu assumes that it has been processed by the time
  that kvmhv_p9_guest_entry() returns.  Therefore we put a special
  case for H_CEDE in kvmhv_p9_guest_entry().

For the XICS hypercalls, if real-mode processing is enabled, then the
virtual-mode handlers assume that they are being called only to finish
up the operation.  Therefore we turn off the real-mode flag in the XICS
code when running as a nested hypervisor.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 +++
 arch/powerpc/kvm/book3s_hv.c  | 43 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  2 ++
 arch/powerpc/kvm/book3s_xics.c|  3 ++-
 4 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 5c9b00c..c55ba3b 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -167,4 +167,8 @@ void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
 
 int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
 
+long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);
+long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
+   unsigned long dabrx);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a7cb310..134d7c7 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -915,6 +916,19 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
break;
}
return RESUME_HOST;
+   case H_SET_DABR:
+   ret = kvmppc_h_set_dabr(vcpu, kvmppc_get_gpr(vcpu, 4));
+   break;
+   case H_SET_XDABR:
+   ret = kvmppc_h_set_xdabr(vcpu, kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5));
+   break;
+   case H_GET_TCE:
+   ret = kvmppc_h_get_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+   kvmppc_get_gpr(vcpu, 5));
+   if (ret == H_TOO_HARD)
+   return RESUME_HOST;
+   break;
case H_PUT_TCE:
ret = kvmppc_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
kvmppc_get_gpr(vcpu, 5),
@@ -938,6 +952,10 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (ret == H_TOO_HARD)
return RESUME_HOST;
break;
+   case H_RANDOM:
+   if (!powernv_get_random_long(>arch.regs.gpr[4]))
+   ret = H_HARDWARE;
+   break;
 
case H_SET_PARTITION_TABLE:
ret = H_FUNCTION;
@@ -966,6 +984,24 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
return RESUME_GUEST;
 }
 
+/*
+ * Handle H_CEDE in the nested virtualization case where we haven't
+ * called the real-mode hcall handlers in book3s_hv_rmhandlers.S.
+ * This has to be done early, not in kvmppc_pseries_do_hcall(), so
+ * that the cede logic in kvmppc_run_single_vcpu() works properly.
+ */
+static void kvmppc_nested_cede(struct kvm_vcpu *vcpu)
+{
+   vcpu->arch.shregs.msr |= MSR_EE;
+   vcpu->arch.ceded = 1;
+   smp_mb();
+   if (vcpu->arch.prodded) {
+   vcpu->arch.prodded = 0;
+   smp_mb();
+   vcpu->arch.ceded = 0;
+   }
+}
+
 static int kvmppc_hcall_impl_hv(unsigned long cmd)
 {
switch (cmd) {
@@ -3422,6 +3458,13 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
vcpu->arch.shregs.msr = vcpu->arch.regs.msr;
vcpu->arch.shregs.dar = mfspr(SPRN_DAR);
vcpu->arch.shregs.dsisr = mfspr(SPRN_DSISR);
+
+   /* H_CEDE has to be handled now, not later */
+   if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
+   kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
+   kvmppc_nested_cede(vcpu);
+   trap = 0;
+   }
} else {
trap = 

[PATCH v4 19/32] KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor

2018-10-04 Thread Paul Mackerras
This adds code to call the H_IPI and H_EOI hypercalls when we are
running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
feature) and we would otherwise access the XICS interrupt controller
directly or via an OPAL call.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c |  7 +-
 arch/powerpc/kvm/book3s_hv_builtin.c | 44 +---
 arch/powerpc/kvm/book3s_hv_rm_xics.c |  8 +++
 3 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9900fd8..a7cb310 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -173,6 +173,10 @@ static bool kvmppc_ipi_thread(int cpu)
 {
unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
 
+   /* If we're a nested hypervisor, fall back to ordinary IPIs for now */
+   if (kvmhv_on_pseries())
+   return false;
+
/* On POWER9 we can use msgsnd to IPI any cpu */
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
msg |= get_hard_smp_processor_id(cpu);
@@ -5165,7 +5169,8 @@ static int kvmppc_book3s_init_hv(void)
 * indirectly, via OPAL.
 */
 #ifdef CONFIG_SMP
-   if (!xive_enabled() && !local_paca->kvm_hstate.xics_phys) {
+   if (!xive_enabled() && !kvmhv_on_pseries() &&
+   !local_paca->kvm_hstate.xics_phys) {
struct device_node *np;
 
np = of_find_compatible_node(NULL, NULL, "ibm,opal-intc");
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index ccfea5b..a71e2fc 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -231,6 +231,15 @@ void kvmhv_rm_send_ipi(int cpu)
void __iomem *xics_phys;
unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
 
+   /* For a nested hypervisor, use the XICS via hcall */
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   plpar_hcall_raw(H_IPI, retbuf, get_hard_smp_processor_id(cpu),
+   IPI_PRIORITY);
+   return;
+   }
+
/* On POWER9 we can use msgsnd for any destination cpu. */
if (cpu_has_feature(CPU_FTR_ARCH_300)) {
msg |= get_hard_smp_processor_id(cpu);
@@ -460,12 +469,19 @@ static long kvmppc_read_one_intr(bool *again)
return 1;
 
/* Now read the interrupt from the ICP */
-   xics_phys = local_paca->kvm_hstate.xics_phys;
-   rc = 0;
-   if (!xics_phys)
-   rc = opal_int_get_xirr(, false);
-   else
-   xirr = __raw_rm_readl(xics_phys + XICS_XIRR);
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   rc = plpar_hcall_raw(H_XIRR, retbuf, 0xFF);
+   xirr = cpu_to_be32(retbuf[0]);
+   } else {
+   xics_phys = local_paca->kvm_hstate.xics_phys;
+   rc = 0;
+   if (!xics_phys)
+   rc = opal_int_get_xirr(, false);
+   else
+   xirr = __raw_rm_readl(xics_phys + XICS_XIRR);
+   }
if (rc < 0)
return 1;
 
@@ -494,7 +510,13 @@ static long kvmppc_read_one_intr(bool *again)
 */
if (xisr == XICS_IPI) {
rc = 0;
-   if (xics_phys) {
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   plpar_hcall_raw(H_IPI, retbuf,
+   hard_smp_processor_id(), 0xff);
+   plpar_hcall_raw(H_EOI, retbuf, h_xirr);
+   } else if (xics_phys) {
__raw_rm_writeb(0xff, xics_phys + XICS_MFRR);
__raw_rm_writel(xirr, xics_phys + XICS_XIRR);
} else {
@@ -520,7 +542,13 @@ static long kvmppc_read_one_intr(bool *again)
/* We raced with the host,
 * we need to resend that IPI, bummer
 */
-   if (xics_phys)
+   if (kvmhv_on_pseries()) {
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+   plpar_hcall_raw(H_IPI, retbuf,
+   hard_smp_processor_id(),
+   IPI_PRIORITY);
+   } else if (xics_phys)
__raw_rm_writeb(IPI_PRIORITY,
xics_phys + XICS_MFRR);
else
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c 
b/arch/powerpc/kvm/book3s_hv_rm_xics.c
index 8b9f356..b3f5786 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
@@ -767,6 +767,14 @@ static void 

[PATCH v4 18/32] KVM: PPC: Book3S HV: Nested guest entry via hypercall

2018-10-04 Thread Paul Mackerras
This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
hypervisor to enter one of its nested guests.  The hypercall supplies
register values in two structs.  Those values are copied by the level 0
(L0) hypervisor (the one which is running in hypervisor mode) into the
vcpu struct of the L1 guest, and then the guest is run until an
interrupt or error occurs which needs to be reported to L1 via the
hypercall return value.

Currently this assumes that the L0 and L1 hypervisors are the same
endianness, and the structs passed as arguments are in native
endianness.  If they are of different endianness, the version number
check will fail and the hcall will be rejected.

Nested hypervisors do not support indep_threads_mode=N, so this adds
code to print a warning message if the administrator has set
indep_threads_mode=N, and treat it as Y.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/hvcall.h   |  36 +
 arch/powerpc/include/asm/kvm_book3s.h   |   7 +
 arch/powerpc/include/asm/kvm_host.h |   5 +
 arch/powerpc/kernel/asm-offsets.c   |   1 +
 arch/powerpc/kvm/book3s_hv.c| 214 -
 arch/powerpc/kvm/book3s_hv_nested.c | 230 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |   8 ++
 7 files changed, 471 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index c95c651..45e8789 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -466,6 +466,42 @@ struct h_cpu_char_result {
u64 behaviour;
 };
 
+/* Register state for entering a nested guest with H_ENTER_NESTED */
+struct hv_guest_state {
+   u64 version;/* version of this structure layout */
+   u32 lpid;
+   u32 vcpu_token;
+   /* These registers are hypervisor privileged (at least for writing) */
+   u64 lpcr;
+   u64 pcr;
+   u64 amor;
+   u64 dpdes;
+   u64 hfscr;
+   s64 tb_offset;
+   u64 dawr0;
+   u64 dawrx0;
+   u64 ciabr;
+   u64 hdec_expiry;
+   u64 purr;
+   u64 spurr;
+   u64 ic;
+   u64 vtb;
+   u64 hdar;
+   u64 hdsisr;
+   u64 heir;
+   u64 asdr;
+   /* These are OS privileged but need to be set late in guest entry */
+   u64 srr0;
+   u64 srr1;
+   u64 sprg[4];
+   u64 pidr;
+   u64 cfar;
+   u64 ppr;
+};
+
+/* Latest version of hv_guest_state structure */
+#define HV_GUEST_STATE_VERSION 1
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_HVCALL_H */
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 43f212e..093fd70 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -280,6 +280,13 @@ void kvmhv_vm_nested_init(struct kvm *kvm);
 long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
 void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
 void kvmhv_release_all_nested(struct kvm *kvm);
+long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
+int kvmhv_run_single_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu,
+ u64 time_limit, unsigned long lpcr);
+void kvmhv_save_hv_regs(struct kvm_vcpu *vcpu, struct hv_guest_state *hr);
+void kvmhv_restore_hv_return_state(struct kvm_vcpu *vcpu,
+  struct hv_guest_state *hr);
+long int kvmhv_nested_page_fault(struct kvm_vcpu *vcpu);
 
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index c35d4f2..ceb9f20 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -95,6 +95,7 @@ struct dtl_entry;
 
 struct kvmppc_vcpu_book3s;
 struct kvmppc_book3s_shadow_vcpu;
+struct kvm_nested_guest;
 
 struct kvm_vm_stat {
ulong remote_tlb_flush;
@@ -786,6 +787,10 @@ struct kvm_vcpu_arch {
u32 emul_inst;
 
u32 online;
+
+   /* For support of nested guests */
+   struct kvm_nested_guest *nested;
+   u32 nested_vcpu_id;
 #endif
 
 #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 7c3738d..d0abcbb 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -503,6 +503,7 @@ int main(void)
OFFSET(VCPU_VPA, kvm_vcpu, arch.vpa.pinned_addr);
OFFSET(VCPU_VPA_DIRTY, kvm_vcpu, arch.vpa.dirty);
OFFSET(VCPU_HEIR, kvm_vcpu, arch.emul_inst);
+   OFFSET(VCPU_NESTED, kvm_vcpu, arch.nested);
OFFSET(VCPU_CPU, kvm_vcpu, cpu);
OFFSET(VCPU_THREAD_CPU, kvm_vcpu, arch.thread_cpu);
 #endif
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ca2529e..9900fd8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -942,6 +942,13 @@ int 

[PATCH v4 17/32] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization

2018-10-04 Thread Paul Mackerras
This starts the process of adding the code to support nested HV-style
virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
a nested hypervisor can use to set the base address and size of a
partition table in its memory (analogous to the PTCR register).
On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
hypercall from the guest is handled by code that saves the virtual
PTCR value for the guest.

This also adds code for creating and destroying nested guests and for
reading the partition table entry for a nested guest from L1 memory.
Each nested guest has its own shadow LPID value, different in general
from the LPID value used by the nested hypervisor to refer to it.  The
shadow LPID value is allocated at nested guest creation time.

Nested hypervisor functionality is only available for a radix guest,
which therefore means a radix host on a POWER9 (or later) processor.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/hvcall.h |   5 +
 arch/powerpc/include/asm/kvm_book3s.h |  10 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |  33 
 arch/powerpc/include/asm/kvm_book3s_asm.h |   3 +
 arch/powerpc/include/asm/kvm_host.h   |   5 +
 arch/powerpc/kvm/Makefile |   3 +-
 arch/powerpc/kvm/book3s_hv.c  |  27 ++-
 arch/powerpc/kvm/book3s_hv_nested.c   | 298 ++
 8 files changed, 377 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_nested.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index a0b17f9..c95c651 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -322,6 +322,11 @@
 #define H_GET_24X7_DATA0xF07C
 #define H_GET_PERF_COUNTER_INFO0xF080
 
+/* Platform-specific hcalls used for nested HV KVM */
+#define H_SET_PARTITION_TABLE  0xF800
+#define H_ENTER_NESTED 0xF804
+#define H_TLB_INVALIDATE   0xF808
+
 /* Values for 2nd argument to H_SET_MODE */
 #define H_SET_MODE_RESOURCE_SET_CIABR  1
 #define H_SET_MODE_RESOURCE_SET_DAWR   2
diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 91c9779..43f212e 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -274,6 +274,13 @@ static inline void kvmppc_save_tm_sprs(struct kvm_vcpu 
*vcpu) {}
 static inline void kvmppc_restore_tm_sprs(struct kvm_vcpu *vcpu) {}
 #endif
 
+long kvmhv_nested_init(void);
+void kvmhv_nested_exit(void);
+void kvmhv_vm_nested_init(struct kvm *kvm);
+long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
+void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1);
+void kvmhv_release_all_nested(struct kvm *kvm);
+
 void kvmppc_giveup_fac(struct kvm_vcpu *vcpu, ulong fac);
 
 extern int kvm_irq_bypass;
@@ -387,9 +394,6 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
 /* TO = 31 for unconditional trap */
 #define INS_TW 0x7fe8
 
-/* LPIDs we support with this build -- runtime limit may be lower */
-#define KVMPPC_NR_LPIDS(LPID_RSVD + 1)
-
 #define SPLIT_HACK_MASK0xff00
 #define SPLIT_HACK_OFFS0xfb00
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 5c0e2d9..6d67b6a 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -23,6 +23,39 @@
 #include 
 #include 
 #include 
+#include 
+
+#ifdef CONFIG_PPC_PSERIES
+static inline bool kvmhv_on_pseries(void)
+{
+   return !cpu_has_feature(CPU_FTR_HVMODE);
+}
+#else
+static inline bool kvmhv_on_pseries(void)
+{
+   return false;
+}
+#endif
+
+/*
+ * Structure for a nested guest, that is, for a guest that is managed by
+ * one of our guests.
+ */
+struct kvm_nested_guest {
+   struct kvm *l1_host;/* L1 VM that owns this nested guest */
+   int l1_lpid;/* lpid L1 guest thinks this guest is */
+   int shadow_lpid;/* real lpid of this nested guest */
+   pgd_t *shadow_pgtable;  /* our page table for this guest */
+   u64 l1_gr_to_hr;/* L1's addr of part'n-scoped table */
+   u64 process_table;  /* process table entry for this guest */
+   long refcnt;/* number of pointers to this struct */
+   struct mutex tlb_lock;  /* serialize page faults and tlbies */
+   struct kvm_nested_guest *next;
+};
+
+struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
+ bool create);
+void kvmhv_put_nested(struct kvm_nested_guest *gp);
 
 /* Power architecture requires HPT is at least 256kiB, at most 64TiB */
 #define PPC_MIN_HPT_ORDER  18
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 

[PATCH v4 16/32] KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix()

2018-10-04 Thread Paul Mackerras
kvmppc_unmap_pte() does a sequence of operations that are open-coded in
kvm_unmap_radix().  This extends kvmppc_unmap_pte() a little so that it
can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 47f2b18..bd06a95 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -240,19 +240,22 @@ static void kvmppc_pmd_free(pmd_t *pmdp)
 }
 
 static void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte,
-unsigned long gpa, unsigned int shift)
+unsigned long gpa, unsigned int shift,
+struct kvm_memory_slot *memslot)
 
 {
-   unsigned long page_size = 1ul << shift;
unsigned long old;
 
old = kvmppc_radix_update_pte(kvm, pte, ~0UL, 0, gpa, shift);
kvmppc_radix_tlbie_page(kvm, gpa, shift);
if (old & _PAGE_DIRTY) {
unsigned long gfn = gpa >> PAGE_SHIFT;
-   struct kvm_memory_slot *memslot;
+   unsigned long page_size = PAGE_SIZE;
 
-   memslot = gfn_to_memslot(kvm, gfn);
+   if (shift)
+   page_size = 1ul << shift;
+   if (!memslot)
+   memslot = gfn_to_memslot(kvm, gfn);
if (memslot && memslot->dirty_bitmap)
kvmppc_update_dirty_map(memslot, gfn, page_size);
}
@@ -282,7 +285,7 @@ static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t 
*pte, bool full)
WARN_ON_ONCE(1);
kvmppc_unmap_pte(kvm, p,
 pte_pfn(*p) << PAGE_SHIFT,
-PAGE_SHIFT);
+PAGE_SHIFT, NULL);
}
}
 
@@ -304,7 +307,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
*pmd, bool full)
WARN_ON_ONCE(1);
kvmppc_unmap_pte(kvm, (pte_t *)p,
 pte_pfn(*(pte_t *)p) << PAGE_SHIFT,
-PMD_SHIFT);
+PMD_SHIFT, NULL);
}
} else {
pte_t *pte;
@@ -468,7 +471,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t 
*pgtable, pte_t pte,
goto out_unlock;
}
/* Valid 1GB page here already, remove it */
-   kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT);
+   kvmppc_unmap_pte(kvm, (pte_t *)pud, hgpa, PUD_SHIFT, NULL);
}
if (level == 2) {
if (!pud_none(*pud)) {
@@ -517,7 +520,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pgd_t 
*pgtable, pte_t pte,
goto out_unlock;
}
/* Valid 2MB page here already, remove it */
-   kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT);
+   kvmppc_unmap_pte(kvm, pmdp_ptep(pmd), lgpa, PMD_SHIFT, NULL);
}
if (level == 1) {
if (!pmd_none(*pmd)) {
@@ -780,20 +783,10 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
pte_t *ptep;
unsigned long gpa = gfn << PAGE_SHIFT;
unsigned int shift;
-   unsigned long old;
 
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
-   if (ptep && pte_present(*ptep)) {
-   old = kvmppc_radix_update_pte(kvm, ptep, ~0UL, 0,
- gpa, shift);
-   kvmppc_radix_tlbie_page(kvm, gpa, shift);
-   if ((old & _PAGE_DIRTY) && memslot->dirty_bitmap) {
-   unsigned long psize = PAGE_SIZE;
-   if (shift)
-   psize = 1ul << shift;
-   kvmppc_update_dirty_map(memslot, gfn, psize);
-   }
-   }
+   if (ptep && pte_present(*ptep))
+   kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot);
return 0;   
 }
 
-- 
2.7.4



[PATCH v4 15/32] KVM: PPC: Book3S HV: Refactor radix page fault handler

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

The radix page fault handler accounts for all cases, including just
needing to insert a pte.  This breaks it up into separate functions for
the two main cases; setting rc and inserting a pte.

This allows us to make the setting of rc and inserting of a pte
generic for any pgtable, not specific to the one for this guest.

[pau...@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 210 +++--
 1 file changed, 123 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index f2976f4..47f2b18 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -400,8 +400,9 @@ static void kvmppc_unmap_free_pud_entry_table(struct kvm 
*kvm, pud_t *pud,
  */
 #define PTE_BITS_MUST_MATCH (~(_PAGE_WRITE | _PAGE_DIRTY | _PAGE_ACCESSED))
 
-static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
-unsigned int level, unsigned long mmu_seq)
+static int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
+unsigned long gpa, unsigned int level,
+unsigned long mmu_seq)
 {
pgd_t *pgd;
pud_t *pud, *new_pud = NULL;
@@ -410,7 +411,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
int ret;
 
/* Traverse the guest's 2nd-level tree, allocate new levels needed */
-   pgd = kvm->arch.pgtable + pgd_index(gpa);
+   pgd = pgtable + pgd_index(gpa);
pud = NULL;
if (pgd_present(*pgd))
pud = pud_offset(pgd, gpa);
@@ -565,95 +566,49 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, 
unsigned long gpa,
return ret;
 }
 
-int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
-  unsigned long ea, unsigned long dsisr)
+static bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
+   bool writing, unsigned long gpa)
+{
+   unsigned long pgflags;
+   unsigned int shift;
+   pte_t *ptep;
+
+   /*
+* Need to set an R or C bit in the 2nd-level tables;
+* since we are just helping out the hardware here,
+* it is sufficient to do what the hardware does.
+*/
+   pgflags = _PAGE_ACCESSED;
+   if (writing)
+   pgflags |= _PAGE_DIRTY;
+   /*
+* We are walking the secondary (partition-scoped) page table here.
+* We can do this without disabling irq because the Linux MM
+* subsystem doesn't do THP splits and collapses on this tree.
+*/
+   ptep = __find_linux_pte(pgtable, gpa, NULL, );
+   if (ptep && pte_present(*ptep) && (!writing || pte_write(*ptep))) {
+   kvmppc_radix_update_pte(kvm, ptep, 0, pgflags, gpa, shift);
+   return true;
+   }
+   return false;
+}
+
+static int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
+   unsigned long gpa,
+   struct kvm_memory_slot *memslot,
+   bool writing, bool kvm_ro,
+   pte_t *inserted_pte, unsigned int *levelp)
 {
struct kvm *kvm = vcpu->kvm;
-   unsigned long mmu_seq;
-   unsigned long gpa, gfn, hva;
-   struct kvm_memory_slot *memslot;
struct page *page = NULL;
-   long ret;
-   bool writing;
+   unsigned long mmu_seq;
+   unsigned long hva, gfn = gpa >> PAGE_SHIFT;
bool upgrade_write = false;
bool *upgrade_p = _write;
pte_t pte, *ptep;
-   unsigned long pgflags;
unsigned int shift, level;
-
-   /* Check for unusual errors */
-   if (dsisr & DSISR_UNSUPP_MMU) {
-   pr_err("KVM: Got unsupported MMU fault\n");
-   return -EFAULT;
-   }
-   if (dsisr & DSISR_BADACCESS) {
-   /* Reflect to the guest as DSI */
-   pr_err("KVM: Got radix HV page fault with DSISR=%lx\n", dsisr);
-   kvmppc_core_queue_data_storage(vcpu, ea, dsisr);
-   return RESUME_GUEST;
-   }
-
-   /* Translate the logical address and get the page */
-   gpa = vcpu->arch.fault_gpa & ~0xfffUL;
-   gpa &= ~0xF000ul;
-   gfn = gpa >> PAGE_SHIFT;
-   if (!(dsisr & DSISR_PRTABLE_FAULT))
-   gpa |= ea & 0xfff;
-   memslot = gfn_to_memslot(kvm, gfn);
-
-   /* No memslot means it's an emulated MMIO region */
-   if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID)) {
-   if (dsisr & (DSISR_PRTABLE_FAULT | DSISR_BADACCESS |
-DSISR_SET_RC)) {
-   /*
-* Bad address in guest 

[PATCH v4 14/32] KVM: PPC: Book3S HV: Make kvmppc_mmu_radix_xlate process/partition table agnostic

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

kvmppc_mmu_radix_xlate() is used to translate an effective address
through the process tables. The process table and partition tables have
identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate()
function able to translate either an effective address through the
process tables or a guest real address through the partition tables.

[pau...@ozlabs.org - reduced diffs from previous code]

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h  |   3 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 109 +++--
 2 files changed, 78 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index dd18d81..91c9779 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -188,6 +188,9 @@ extern int kvmppc_book3s_hcall_implemented(struct kvm *kvm, 
unsigned long hc);
 extern int kvmppc_book3s_radix_page_fault(struct kvm_run *run,
struct kvm_vcpu *vcpu,
unsigned long ea, unsigned long dsisr);
+extern int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
+   struct kvmppc_pte *gpte, u64 table,
+   int table_index, u64 *pte_ret_p);
 extern int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
struct kvmppc_pte *gpte, bool data, bool iswrite);
 extern int kvmppc_init_vm_radix(struct kvm *kvm);
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 71951b5..f2976f4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -29,83 +29,92 @@
  */
 static int p9_supported_radix_bits[4] = { 5, 9, 9, 13 };
 
-int kvmppc_mmu_radix_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
-  struct kvmppc_pte *gpte, bool data, bool iswrite)
+/*
+ * Used to walk a partition or process table radix tree in guest memory
+ * Note: We exploit the fact that a partition table and a process
+ * table have the same layout, a partition-scoped page table and a
+ * process-scoped page table have the same layout, and the 2nd
+ * doubleword of a partition table entry has the same layout as
+ * the PTCR register.
+ */
+int kvmppc_mmu_radix_translate_table(struct kvm_vcpu *vcpu, gva_t eaddr,
+struct kvmppc_pte *gpte, u64 table,
+int table_index, u64 *pte_ret_p)
 {
struct kvm *kvm = vcpu->kvm;
-   u32 pid;
int ret, level, ps;
-   __be64 prte, rpte;
-   unsigned long ptbl;
-   unsigned long root, pte, index;
+   unsigned long ptbl, root;
unsigned long rts, bits, offset;
-   unsigned long gpa;
-   unsigned long proc_tbl_size;
+   unsigned long size, index;
+   struct prtb_entry entry;
+   u64 pte, base, gpa;
+   __be64 rpte;
 
-   /* Work out effective PID */
-   switch (eaddr >> 62) {
-   case 0:
-   pid = vcpu->arch.pid;
-   break;
-   case 3:
-   pid = 0;
-   break;
-   default:
+   if ((table & PRTS_MASK) > 24)
return -EINVAL;
-   }
-   proc_tbl_size = 1 << ((kvm->arch.process_table & PRTS_MASK) + 12);
-   if (pid * 16 >= proc_tbl_size)
+   size = 1ul << ((table & PRTS_MASK) + 12);
+
+   /* Is the table big enough to contain this entry? */
+   if ((table_index * sizeof(entry)) >= size)
return -EINVAL;
 
-   /* Read partition table to find root of tree for effective PID */
-   ptbl = (kvm->arch.process_table & PRTB_MASK) + (pid * 16);
-   ret = kvm_read_guest(kvm, ptbl, , sizeof(prte));
+   /* Read the table to find the root of the radix tree */
+   ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
+   ret = kvm_read_guest(kvm, ptbl, , sizeof(entry));
if (ret)
return ret;
 
-   root = be64_to_cpu(prte);
+   /* Root is stored in the first double word */
+   root = be64_to_cpu(entry.prtb0);
rts = ((root & RTS1_MASK) >> (RTS1_SHIFT - 3)) |
((root & RTS2_MASK) >> RTS2_SHIFT);
bits = root & RPDS_MASK;
-   root = root & RPDB_MASK;
+   base = root & RPDB_MASK;
 
offset = rts + 31;
 
-   /* current implementations only support 52-bit space */
+   /* Current implementations only support 52-bit space */
if (offset != 52)
return -EINVAL;
 
+   /* Walk each level of the radix tree */
for (level = 3; level >= 0; --level) {
+   /* Check a valid size */
if (level && bits != p9_supported_radix_bits[level])
return -EINVAL;
if (level == 0 && !(bits == 5 || bits == 9))

[PATCH v4 13/32] KVM: PPC: Book3S HV: Clear partition table entry on vm teardown

2018-10-04 Thread Paul Mackerras
From: Suraj Jitindar Singh 

When destroying a VM we return the LPID to the pool, however we never
zero the partition table entry. This is instead done when we reallocate
the LPID.

Zero the partition table entry on VM teardown before returning the LPID
to the pool. This means if we were running as a nested hypervisor the
real hypervisor could use this to determine when it can free resources.

Reviewed-by: David Gibson 
Signed-off-by: Suraj Jitindar Singh 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 01a0532..ca0e4f4 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4501,13 +4501,19 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
 
kvmppc_free_vcores(kvm);
 
-   kvmppc_free_lpid(kvm->arch.lpid);
 
if (kvm_is_radix(kvm))
kvmppc_free_radix(kvm);
else
kvmppc_free_hpt(>arch.hpt);
 
+   /* Perform global invalidation and return lpid to the pool */
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   kvm->arch.process_table = 0;
+   kvmppc_setup_partition_table(kvm);
+   }
+   kvmppc_free_lpid(kvm->arch.lpid);
+
kvmppc_free_pimap(kvm);
 }
 
-- 
2.7.4



[PATCH v4 12/32] KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct

2018-10-04 Thread Paul Mackerras
When the 'regs' field was added to struct kvm_vcpu_arch, the code
was changed to use several of the fields inside regs (e.g., gpr, lr,
etc.) but not the ccr field, because the ccr field in struct pt_regs
is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
only 32 bits.  This changes the code to use the regs.ccr field
instead of cr, and changes the assembly code on 64-bit platforms to
use 64-bit loads and stores instead of 32-bit ones.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s.h|  4 ++--
 arch/powerpc/include/asm/kvm_book3s_64.h |  4 ++--
 arch/powerpc/include/asm/kvm_booke.h |  4 ++--
 arch/powerpc/include/asm/kvm_host.h  |  2 --
 arch/powerpc/kernel/asm-offsets.c|  4 ++--
 arch/powerpc/kvm/book3s_emulate.c| 12 ++--
 arch/powerpc/kvm/book3s_hv.c |  4 ++--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  4 ++--
 arch/powerpc/kvm/book3s_hv_tm.c  |  6 +++---
 arch/powerpc/kvm/book3s_hv_tm_builtin.c  |  5 +++--
 arch/powerpc/kvm/book3s_pr.c |  4 ++--
 arch/powerpc/kvm/bookehv_interrupts.S|  8 
 arch/powerpc/kvm/emulate_loadstore.c |  1 -
 13 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 83a9aa3..dd18d81 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -301,12 +301,12 @@ static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, 
int num)
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.cr = val;
+   vcpu->arch.regs.ccr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.cr;
+   return vcpu->arch.regs.ccr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, ulong val)
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index af25aaa..5c0e2d9 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -483,7 +483,7 @@ static inline u64 sanitize_msr(u64 msr)
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 static inline void copy_from_checkpoint(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.cr  = vcpu->arch.cr_tm;
+   vcpu->arch.regs.ccr  = vcpu->arch.cr_tm;
vcpu->arch.regs.xer = vcpu->arch.xer_tm;
vcpu->arch.regs.link  = vcpu->arch.lr_tm;
vcpu->arch.regs.ctr = vcpu->arch.ctr_tm;
@@ -500,7 +500,7 @@ static inline void copy_from_checkpoint(struct kvm_vcpu 
*vcpu)
 
 static inline void copy_to_checkpoint(struct kvm_vcpu *vcpu)
 {
-   vcpu->arch.cr_tm  = vcpu->arch.cr;
+   vcpu->arch.cr_tm  = vcpu->arch.regs.ccr;
vcpu->arch.xer_tm = vcpu->arch.regs.xer;
vcpu->arch.lr_tm  = vcpu->arch.regs.link;
vcpu->arch.ctr_tm = vcpu->arch.regs.ctr;
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index d513e3e..f0cef62 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -46,12 +46,12 @@ static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, 
int num)
 
 static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
 {
-   vcpu->arch.cr = val;
+   vcpu->arch.regs.ccr = val;
 }
 
 static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.cr;
+   return vcpu->arch.regs.ccr;
 }
 
 static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, ulong val)
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index a3d4f61..c9cc42f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -538,8 +538,6 @@ struct kvm_vcpu_arch {
ulong tar;
 #endif
 
-   u32 cr;
-
 #ifdef CONFIG_PPC_BOOK3S
ulong hflags;
ulong guest_owned_ext;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 89cf155..7c3738d 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -438,7 +438,7 @@ int main(void)
 #ifdef CONFIG_PPC_BOOK3S
OFFSET(VCPU_TAR, kvm_vcpu, arch.tar);
 #endif
-   OFFSET(VCPU_CR, kvm_vcpu, arch.cr);
+   OFFSET(VCPU_CR, kvm_vcpu, arch.regs.ccr);
OFFSET(VCPU_PC, kvm_vcpu, arch.regs.nip);
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
OFFSET(VCPU_MSR, kvm_vcpu, arch.shregs.msr);
@@ -695,7 +695,7 @@ int main(void)
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
-   OFFSET(VCPU_CR, kvm_vcpu, arch.cr);
+   OFFSET(VCPU_CR, kvm_vcpu, arch.regs.ccr);
OFFSET(VCPU_XER, kvm_vcpu, arch.regs.xer);
OFFSET(VCPU_LR, kvm_vcpu, arch.regs.link);
OFFSET(VCPU_CTR, kvm_vcpu, arch.regs.ctr);
diff --git a/arch/powerpc/kvm/book3s_emulate.c 
b/arch/powerpc/kvm/book3s_emulate.c
index 36b11c5..2654df2 100644
--- a/arch/powerpc/kvm/book3s_emulate.c
+++ 

[PATCH v4 11/32] KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings

2018-10-04 Thread Paul Mackerras
This adds a file called 'radix' in the debugfs directory for the
guest, which when read gives all of the valid leaf PTEs in the
partition-scoped radix tree for a radix guest, in human-readable
format.  It is analogous to the existing 'htab' file which dumps
the HPT entries for a HPT guest.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |   1 +
 arch/powerpc/include/asm/kvm_host.h  |   1 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 179 +++
 arch/powerpc/kvm/book3s_hv.c |   2 +
 4 files changed, 183 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index dc435a5..af25aaa 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -435,6 +435,7 @@ static inline struct kvm_memslots *kvm_memslots_raw(struct 
kvm *kvm)
 }
 
 extern void kvmppc_mmu_debugfs_init(struct kvm *kvm);
+extern void kvmhv_radix_debugfs_init(struct kvm *kvm);
 
 extern void kvmhv_rm_send_ipi(int cpu);
 
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3cd0b9f..a3d4f61 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -291,6 +291,7 @@ struct kvm_arch {
u64 process_table;
struct dentry *debugfs_dir;
struct dentry *htab_dentry;
+   struct dentry *radix_dentry;
struct kvm_resize_hpt *resize_hpt; /* protected by kvm->lock */
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 933c574..71951b5 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -10,6 +10,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -853,6 +856,182 @@ static void pmd_ctor(void *addr)
memset(addr, 0, RADIX_PMD_TABLE_SIZE);
 }
 
+struct debugfs_radix_state {
+   struct kvm  *kvm;
+   struct mutexmutex;
+   unsigned long   gpa;
+   int chars_left;
+   int buf_index;
+   charbuf[128];
+   u8  hdr;
+};
+
+static int debugfs_radix_open(struct inode *inode, struct file *file)
+{
+   struct kvm *kvm = inode->i_private;
+   struct debugfs_radix_state *p;
+
+   p = kzalloc(sizeof(*p), GFP_KERNEL);
+   if (!p)
+   return -ENOMEM;
+
+   kvm_get_kvm(kvm);
+   p->kvm = kvm;
+   mutex_init(>mutex);
+   file->private_data = p;
+
+   return nonseekable_open(inode, file);
+}
+
+static int debugfs_radix_release(struct inode *inode, struct file *file)
+{
+   struct debugfs_radix_state *p = file->private_data;
+
+   kvm_put_kvm(p->kvm);
+   kfree(p);
+   return 0;
+}
+
+static ssize_t debugfs_radix_read(struct file *file, char __user *buf,
+size_t len, loff_t *ppos)
+{
+   struct debugfs_radix_state *p = file->private_data;
+   ssize_t ret, r;
+   unsigned long n;
+   struct kvm *kvm;
+   unsigned long gpa;
+   pgd_t *pgt;
+   pgd_t pgd, *pgdp;
+   pud_t pud, *pudp;
+   pmd_t pmd, *pmdp;
+   pte_t *ptep;
+   int shift;
+   unsigned long pte;
+
+   kvm = p->kvm;
+   if (!kvm_is_radix(kvm))
+   return 0;
+
+   ret = mutex_lock_interruptible(>mutex);
+   if (ret)
+   return ret;
+
+   if (p->chars_left) {
+   n = p->chars_left;
+   if (n > len)
+   n = len;
+   r = copy_to_user(buf, p->buf + p->buf_index, n);
+   n -= r;
+   p->chars_left -= n;
+   p->buf_index += n;
+   buf += n;
+   len -= n;
+   ret = n;
+   if (r) {
+   if (!n)
+   ret = -EFAULT;
+   goto out;
+   }
+   }
+
+   gpa = p->gpa;
+   pgt = kvm->arch.pgtable;
+   while (len != 0 && gpa < RADIX_PGTABLE_RANGE) {
+   if (!p->hdr) {
+   n = scnprintf(p->buf, sizeof(p->buf),
+ "pgdir: %lx\n", (unsigned long)pgt);
+   p->hdr = 1;
+   goto copy;
+   }
+
+   pgdp = pgt + pgd_index(gpa);
+   pgd = READ_ONCE(*pgdp);
+   if (!(pgd_val(pgd) & _PAGE_PRESENT)) {
+   gpa = (gpa & PGDIR_MASK) + PGDIR_SIZE;
+   continue;
+   }
+
+   pudp = pud_offset(, gpa);
+   pud = READ_ONCE(*pudp);
+   if (!(pud_val(pud) & _PAGE_PRESENT)) {
+   gpa = (gpa & PUD_MASK) + PUD_SIZE;
+   continue;
+   }
+   

[PATCH v4 09/32] KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests

2018-10-04 Thread Paul Mackerras
This creates an alternative guest entry/exit path which is used for
radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
these circumstances there is exactly one vcpu per vcore and there is
no coordination required between vcpus or vcores; the vcpu can enter
the guest without needing to synchronize with anything else.

The new fast path is implemented almost entirely in C in book3s_hv.c
and runs with the MMU on until the guest is entered.  On guest exit
we use the existing path until the point where we are committed to
exiting the guest (as distinct from handling an interrupt in the
low-level code and returning to the guest) and we have pulled the
guest context from the XIVE.  At that point we check a flag in the
stack frame to see whether we came in via the old path and the new
path; if we came in via the new path then we go back to C code to do
the rest of the process of saving the guest context and restoring the
host context.

The C code is split into separate functions for handling the
OS-accessible state and the hypervisor state, with the idea that the
latter can be replaced by a hypercall when we implement nested
virtualization.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   2 +
 arch/powerpc/kvm/book3s_hv.c  | 425 +-
 arch/powerpc/kvm/book3s_hv_ras.c  |   2 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  95 ++-
 arch/powerpc/kvm/book3s_xive.c|  63 +
 6 files changed, 585 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 0c1a2b0..5c9b00c 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -165,4 +165,6 @@ void kvmhv_load_host_pmu(void);
 void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
 void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
 
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 83d61b8..245e564 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -585,6 +585,7 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
+extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
   u32 priority) { return -1; }
@@ -607,6 +608,7 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
+static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
 #endif /* CONFIG_KVM_XIVE */
 
 /*
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0e17593..0dda782 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3080,6 +3080,269 @@ static noinline void kvmppc_run_core(struct 
kvmppc_vcore *vc)
 }
 
 /*
+ * Load up hypervisor-mode registers on P9.
+ */
+static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit)
+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   s64 hdec;
+   u64 tb, purr, spurr;
+   int trap;
+   unsigned long host_hfscr = mfspr(SPRN_HFSCR);
+   unsigned long host_ciabr = mfspr(SPRN_CIABR);
+   unsigned long host_dawr = mfspr(SPRN_DAWR);
+   unsigned long host_dawrx = mfspr(SPRN_DAWRX);
+   unsigned long host_psscr = mfspr(SPRN_PSSCR);
+   unsigned long host_pidr = mfspr(SPRN_PID);
+
+   hdec = time_limit - mftb();
+   if (hdec < 0)
+   return BOOK3S_INTERRUPT_HV_DECREMENTER;
+   mtspr(SPRN_HDEC, hdec);
+
+   if (vc->tb_offset) {
+   u64 new_tb = mftb() + vc->tb_offset;
+   mtspr(SPRN_TBU40, new_tb);
+   tb = mftb();
+   if ((tb & 0xff) < (new_tb & 0xff))
+   mtspr(SPRN_TBU40, new_tb + 0x100);
+   vc->tb_offset_applied = vc->tb_offset;
+   }
+
+   if (vc->pcr)
+   mtspr(SPRN_PCR, vc->pcr);
+   mtspr(SPRN_DPDES, vc->dpdes);
+   mtspr(SPRN_VTB, vc->vtb);
+
+   local_paca->kvm_hstate.host_purr = mfspr(SPRN_PURR);
+   local_paca->kvm_hstate.host_spurr = mfspr(SPRN_SPURR);
+   mtspr(SPRN_PURR, vcpu->arch.purr);
+   mtspr(SPRN_SPURR, vcpu->arch.spurr);
+
+   if (cpu_has_feature(CPU_FTR_DAWR)) {
+   mtspr(SPRN_DAWR, vcpu->arch.dawr);
+   mtspr(SPRN_DAWRX, vcpu->arch.dawrx);
+   }
+   

[PATCH v4 10/32] KVM: PPC: Book3S HV: Handle hypervisor instruction faults better

2018-10-04 Thread Paul Mackerras
Currently the code for handling hypervisor instruction page faults
passes 0 for the flags indicating the type of fault, which is OK in
the usual case that the page is not mapped in the partition-scoped
page tables.  However, there are other causes for hypervisor
instruction page faults, such as not being to update a reference
(R) or change (C) bit.  The cause is indicated in bits in HSRR1,
including a bit which indicates that the fault is due to not being
able to write to a page (for example to update an R or C bit).
Not handling these other kinds of faults correctly can lead to a
loop of continual faults without forward progress in the guest.

In order to handle these faults better, this patch constructs a
"DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
have in common, and passes it to kvmppc_book3s_hv_page_fault() so
that it knows what caused the fault.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/reg.h | 1 +
 arch/powerpc/kvm/book3s_hv.c   | 5 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e5b314e..6fda746 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -766,6 +766,7 @@
 #define SPRN_HSRR0 0x13A   /* Save/Restore Register 0 */
 #define SPRN_HSRR1 0x13B   /* Save/Restore Register 1 */
 #define   HSRR1_DENORM 0x0010 /* Denorm exception */
+#define   HSRR1_HISI_WRITE 0x0001 /* HISI bcs couldn't update mem */
 
 #define SPRN_TBCTL 0x35f   /* PA6T Timebase control register */
 #define   TBCTL_FREEZE 0xull /* Freeze all tbs */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 0dda782..3aabbb2 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1188,7 +1188,10 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
break;
case BOOK3S_INTERRUPT_H_INST_STORAGE:
vcpu->arch.fault_dar = kvmppc_get_pc(vcpu);
-   vcpu->arch.fault_dsisr = 0;
+   vcpu->arch.fault_dsisr = vcpu->arch.shregs.msr &
+   DSISR_SRR1_MATCH_64S;
+   if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
+   vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
r = RESUME_PAGE_FAULT;
break;
/*
-- 
2.7.4



[PATCH v4 08/32] KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked

2018-10-04 Thread Paul Mackerras
Currently kvmppc_handle_exit_hv() is called with the vcore lock held
because it is called within a for_each_runnable_thread loop.
However, we already unlock the vcore within kvmppc_handle_exit_hv()
under certain circumstances, and this is safe because (a) any vcpus
that become runnable and are added to the runnable set by
kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
run in the guest (because the vcore state is VCORE_EXITING), and
(b) for_each_runnable_thread is safe against addition or removal
of vcpus from the runnable set.

Therefore, in order to simplify things for following patches, let's
drop the vcore lock in the for_each_runnable_thread loop, so
kvmppc_handle_exit_hv() gets called without the vcore lock held.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 49a686c..0e17593 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1084,7 +1084,6 @@ static int kvmppc_emulate_doorbell_instr(struct kvm_vcpu 
*vcpu)
return RESUME_GUEST;
 }
 
-/* Called with vcpu->arch.vcore->lock held */
 static int kvmppc_handle_exit_hv(struct kvm_run *run, struct kvm_vcpu *vcpu,
 struct task_struct *tsk)
 {
@@ -1205,10 +1204,7 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
swab32(vcpu->arch.emul_inst) :
vcpu->arch.emul_inst;
if (vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) {
-   /* Need vcore unlocked to call kvmppc_get_last_inst */
-   spin_unlock(>arch.vcore->lock);
r = kvmppc_emulate_debug_inst(run, vcpu);
-   spin_lock(>arch.vcore->lock);
} else {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
r = RESUME_GUEST;
@@ -1224,12 +1220,8 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
case BOOK3S_INTERRUPT_H_FAC_UNAVAIL:
r = EMULATE_FAIL;
if (((vcpu->arch.hfscr >> 56) == FSCR_MSGP_LG) &&
-   cpu_has_feature(CPU_FTR_ARCH_300)) {
-   /* Need vcore unlocked to call kvmppc_get_last_inst */
-   spin_unlock(>arch.vcore->lock);
+   cpu_has_feature(CPU_FTR_ARCH_300))
r = kvmppc_emulate_doorbell_instr(vcpu);
-   spin_lock(>arch.vcore->lock);
-   }
if (r == EMULATE_FAIL) {
kvmppc_core_queue_program(vcpu, SRR1_PROGILL);
r = RESUME_GUEST;
@@ -2599,6 +2591,14 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
spin_lock(>lock);
now = get_tb();
for_each_runnable_thread(i, vcpu, vc) {
+   /*
+* It's safe to unlock the vcore in the loop here, because
+* for_each_runnable_thread() is safe against removal of
+* the vcpu, and the vcore state is VCORE_EXITING here,
+* so any vcpus becoming runnable will have their arch.trap
+* set to zero and can't actually run in the guest.
+*/
+   spin_unlock(>lock);
/* cancel pending dec exception if dec is positive */
if (now < vcpu->arch.dec_expires &&
kvmppc_core_pending_dec(vcpu))
@@ -2614,6 +2614,7 @@ static void post_guest_process(struct kvmppc_vcore *vc, 
bool is_master)
vcpu->arch.ret = ret;
vcpu->arch.trap = 0;
 
+   spin_lock(>lock);
if (is_kvmppc_resume_guest(vcpu->arch.ret)) {
if (vcpu->arch.pending_exceptions)
kvmppc_core_prepare_to_enter(vcpu);
-- 
2.7.4



[PATCH v4 07/32] KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable

2018-10-04 Thread Paul Mackerras
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
which allows the caller to indicate whether it wants the nonvolatile
register state to be preserved across the call, as required by the C
calling conventions.  This parameter being non-zero also causes the
MSR bits that enable TM, FP, VMX and VSX to be preserved.  The
condition register and DSCR are now always preserved.

With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
from C code provided the 3rd parameter is non-zero.  So that these
functions can be called from modules, they now include code to set
the TOC pointer (r2) on entry, as they can call other built-in C
functions which will assume the TOC to have been set.

Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
assume that treclaim in fake-suspend state does not modify any registers,
which is the case on POWER9.  This enables the code to be simplified
quite a bit.

_kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
this change, since they now only need to save and restore TAR and pass
1 for the 3rd argument to __kvmppc_{save,restore}_tm.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |  10 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  49 +++---
 arch/powerpc/kvm/tm.S | 250 --
 3 files changed, 169 insertions(+), 140 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 024e8fc..0c1a2b0 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -150,6 +150,16 @@ extern s32 patch__memset_nocache, patch__memcpy_nocache;
 
 extern long flush_count_cache;
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+void kvmppc_save_tm_hv(struct kvm_vcpu *vcpu, u64 msr, bool preserve_nv);
+void kvmppc_restore_tm_hv(struct kvm_vcpu *vcpu, u64 msr, bool preserve_nv);
+#else
+static inline void kvmppc_save_tm_hv(struct kvm_vcpu *vcpu, u64 msr,
+bool preserve_nv) { }
+static inline void kvmppc_restore_tm_hv(struct kvm_vcpu *vcpu, u64 msr,
+   bool preserve_nv) { }
+#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+
 void kvmhv_save_host_pmu(void);
 void kvmhv_load_host_pmu(void);
 void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 772740d..67a847f 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -759,11 +759,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
mr  r3, r4
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_restore_tm_hv
+   nop
ld  r4, HSTATE_KVM_VCPU(r13)
 91:
 #endif
@@ -1603,11 +1605,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
mr  r3, r9
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_save_tm_hv
+   nop
ld  r9, HSTATE_KVM_VCPU(r13)
 91:
 #endif
@@ -2486,11 +2490,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
ld  r3, HSTATE_KVM_VCPU(r13)
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_save_tm_hv
+   nop
 91:
 #endif
 
@@ -2606,11 +2612,13 @@ BEGIN_FTR_SECTION
b   91f
 END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
/*
-* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
+* NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
 */
mr  r3, r4
ld  r4, VCPU_MSR(r3)
+   li  r5, 0   /* don't preserve non-vol regs */
bl  kvmppc_restore_tm_hv
+   nop
ld  r4, HSTATE_KVM_VCPU(r13)
 91:
 #endif
@@ -2943,10 +2951,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
  * Save transactional state and TM-related registers.
  * Called with r3 pointing to the vcpu struct and r4 containing
  * the guest MSR value.
- * This can modify all checkpointed registers, but
+ * r5 is non-zero iff 

[PATCH v4 06/32] KVM: PPC: Book3S HV: Simplify real-mode interrupt handling

2018-10-04 Thread Paul Mackerras
This streamlines the first part of the code that handles a hypervisor
interrupt that occurred in the guest.  With this, all of the real-mode
handling that occurs is done before the "guest_exit_cont" label; once
we get to that label we are committed to exiting to host virtual mode.
Thus the machine check and HMI real-mode handling is moved before that
label.

Also, the code to handle external interrupts is moved out of line, as
is the code that calls kvmppc_realmode_hmi_handler().

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_hv_ras.c|   8 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 220 
 2 files changed, 119 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index b11043b..ee564b6 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -331,5 +331,13 @@ long kvmppc_realmode_hmi_handler(void)
} else {
wait_for_tb_resync();
}
+
+   /*
+* Reset tb_offset_applied so the guest exit code won't try
+* to subtract the previous timebase offset from the timebase.
+*/
+   if (local_paca->kvm_hstate.kvm_vcore)
+   local_paca->kvm_hstate.kvm_vcore->tb_offset_applied = 0;
+
return 0;
 }
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 5b2ae34..772740d 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1018,8 +1018,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
 no_xive:
 #endif /* CONFIG_KVM_XICS */
 
-deliver_guest_interrupt:
-kvmppc_cede_reentry:   /* r4 = vcpu, r13 = paca */
+deliver_guest_interrupt:   /* r4 = vcpu, r13 = paca */
/* Check if we can deliver an external or decrementer interrupt now */
ld  r0, VCPU_PENDING_EXC(r4)
 BEGIN_FTR_SECTION
@@ -1269,18 +1268,26 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
std r3, VCPU_CTR(r9)
std r4, VCPU_XER(r9)
 
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   /* For softpatch interrupt, go off and do TM instruction emulation */
-   cmpwi   r12, BOOK3S_INTERRUPT_HV_SOFTPATCH
-   beq kvmppc_tm_emul
-#endif
+   /* Save more register state  */
+   mfdar   r6
+   mfdsisr r7
+   std r6, VCPU_DAR(r9)
+   stw r7, VCPU_DSISR(r9)
 
/* If this is a page table miss then see if it's theirs or ours */
cmpwi   r12, BOOK3S_INTERRUPT_H_DATA_STORAGE
beq kvmppc_hdsi
+   std r6, VCPU_FAULT_DAR(r9)
+   stw r7, VCPU_FAULT_DSISR(r9)
cmpwi   r12, BOOK3S_INTERRUPT_H_INST_STORAGE
beq kvmppc_hisi
 
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   /* For softpatch interrupt, go off and do TM instruction emulation */
+   cmpwi   r12, BOOK3S_INTERRUPT_HV_SOFTPATCH
+   beq kvmppc_tm_emul
+#endif
+
/* See if this is a leftover HDEC interrupt */
cmpwi   r12,BOOK3S_INTERRUPT_HV_DECREMENTER
bne 2f
@@ -1303,7 +1310,7 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
lbz r0, HSTATE_HOST_IPI(r13)
cmpwi   r0, 0
-   beq 4f
+   beq maybe_reenter_guest
b   guest_exit_cont
 3:
/* If it's a hypervisor facility unavailable interrupt, save HFSCR */
@@ -1315,82 +1322,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 14:
/* External interrupt ? */
cmpwi   r12, BOOK3S_INTERRUPT_EXTERNAL
-   bne+guest_exit_cont
-
-   /* External interrupt, first check for host_ipi. If this is
-* set, we know the host wants us out so let's do it now
-*/
-   bl  kvmppc_read_intr
-
-   /*
-* Restore the active volatile registers after returning from
-* a C function.
-*/
-   ld  r9, HSTATE_KVM_VCPU(r13)
-   li  r12, BOOK3S_INTERRUPT_EXTERNAL
-
-   /*
-* kvmppc_read_intr return codes:
-*
-* Exit to host (r3 > 0)
-*   1 An interrupt is pending that needs to be handled by the host
-* Exit guest and return to host by branching to guest_exit_cont
-*
-*   2 Passthrough that needs completion in the host
-* Exit guest and return to host by branching to guest_exit_cont
-* However, we also set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
-* to indicate to the host to complete handling the interrupt
-*
-* Before returning to guest, we check if any CPU is heading out
-* to the host and if so, we head out also. If no CPUs are heading
-* check return values <= 0.
-*
-* Return to guest (r3 <= 0)
-*  0 No external interrupt is pending
-* -1 A guest wakeup IPI (which has now been cleared)
-*In either case, we return to guest to deliver any pending
-*guest interrupts.
-*
-* -2 A PCI 

[PATCH v4 05/32] KVM: PPC: Book3S HV: Extract PMU save/restore operations as C-callable functions

2018-10-04 Thread Paul Mackerras
This pulls out the assembler code that is responsible for saving and
restoring the PMU state for the host and guest into separate functions
so they can be used from an alternate entry path.  The calling
convention is made compatible with C.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/asm-prototypes.h |   5 +
 arch/powerpc/kvm/book3s_hv_interrupts.S   |  95 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 363 --
 3 files changed, 253 insertions(+), 210 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 1f4691c..024e8fc 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -150,4 +150,9 @@ extern s32 patch__memset_nocache, patch__memcpy_nocache;
 
 extern long flush_count_cache;
 
+void kvmhv_save_host_pmu(void);
+void kvmhv_load_host_pmu(void);
+void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
+void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
+
 #endif /* _ASM_POWERPC_ASM_PROTOTYPES_H */
diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S 
b/arch/powerpc/kvm/book3s_hv_interrupts.S
index 666b91c..a6d1001 100644
--- a/arch/powerpc/kvm/book3s_hv_interrupts.S
+++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
@@ -64,52 +64,7 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 
/* Save host PMU registers */
-BEGIN_FTR_SECTION
-   /* Work around P8 PMAE bug */
-   li  r3, -1
-   clrrdi  r3, r3, 10
-   mfspr   r8, SPRN_MMCR2
-   mtspr   SPRN_MMCR2, r3  /* freeze all counters using MMCR2 */
-   isync
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
-   li  r3, 1
-   sldir3, r3, 31  /* MMCR0_FC (freeze counters) bit */
-   mfspr   r7, SPRN_MMCR0  /* save MMCR0 */
-   mtspr   SPRN_MMCR0, r3  /* freeze all counters, disable 
interrupts */
-   mfspr   r6, SPRN_MMCRA
-   /* Clear MMCRA in order to disable SDAR updates */
-   li  r5, 0
-   mtspr   SPRN_MMCRA, r5
-   isync
-   lbz r5, PACA_PMCINUSE(r13)  /* is the host using the PMU? */
-   cmpwi   r5, 0
-   beq 31f /* skip if not */
-   mfspr   r5, SPRN_MMCR1
-   mfspr   r9, SPRN_SIAR
-   mfspr   r10, SPRN_SDAR
-   std r7, HSTATE_MMCR0(r13)
-   std r5, HSTATE_MMCR1(r13)
-   std r6, HSTATE_MMCRA(r13)
-   std r9, HSTATE_SIAR(r13)
-   std r10, HSTATE_SDAR(r13)
-BEGIN_FTR_SECTION
-   mfspr   r9, SPRN_SIER
-   std r8, HSTATE_MMCR2(r13)
-   std r9, HSTATE_SIER(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
-   mfspr   r3, SPRN_PMC1
-   mfspr   r5, SPRN_PMC2
-   mfspr   r6, SPRN_PMC3
-   mfspr   r7, SPRN_PMC4
-   mfspr   r8, SPRN_PMC5
-   mfspr   r9, SPRN_PMC6
-   stw r3, HSTATE_PMC1(r13)
-   stw r5, HSTATE_PMC2(r13)
-   stw r6, HSTATE_PMC3(r13)
-   stw r7, HSTATE_PMC4(r13)
-   stw r8, HSTATE_PMC5(r13)
-   stw r9, HSTATE_PMC6(r13)
-31:
+   bl  kvmhv_save_host_pmu
 
/*
 * Put whatever is in the decrementer into the
@@ -161,3 +116,51 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
ld  r0, PPC_LR_STKOFF(r1)
mtlrr0
blr
+
+_GLOBAL(kvmhv_save_host_pmu)
+BEGIN_FTR_SECTION
+   /* Work around P8 PMAE bug */
+   li  r3, -1
+   clrrdi  r3, r3, 10
+   mfspr   r8, SPRN_MMCR2
+   mtspr   SPRN_MMCR2, r3  /* freeze all counters using MMCR2 */
+   isync
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+   li  r3, 1
+   sldir3, r3, 31  /* MMCR0_FC (freeze counters) bit */
+   mfspr   r7, SPRN_MMCR0  /* save MMCR0 */
+   mtspr   SPRN_MMCR0, r3  /* freeze all counters, disable 
interrupts */
+   mfspr   r6, SPRN_MMCRA
+   /* Clear MMCRA in order to disable SDAR updates */
+   li  r5, 0
+   mtspr   SPRN_MMCRA, r5
+   isync
+   lbz r5, PACA_PMCINUSE(r13)  /* is the host using the PMU? */
+   cmpwi   r5, 0
+   beq 31f /* skip if not */
+   mfspr   r5, SPRN_MMCR1
+   mfspr   r9, SPRN_SIAR
+   mfspr   r10, SPRN_SDAR
+   std r7, HSTATE_MMCR0(r13)
+   std r5, HSTATE_MMCR1(r13)
+   std r6, HSTATE_MMCRA(r13)
+   std r9, HSTATE_SIAR(r13)
+   std r10, HSTATE_SDAR(r13)
+BEGIN_FTR_SECTION
+   mfspr   r9, SPRN_SIER
+   std r8, HSTATE_MMCR2(r13)
+   std r9, HSTATE_SIER(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+   mfspr   r3, SPRN_PMC1
+   mfspr   r5, SPRN_PMC2
+   mfspr   r6, SPRN_PMC3
+   mfspr   r7, SPRN_PMC4
+   mfspr   r8, SPRN_PMC5
+   mfspr   r9, SPRN_PMC6
+   stw r3, HSTATE_PMC1(r13)
+   stw r5, HSTATE_PMC2(r13)
+   stw r6, HSTATE_PMC3(r13)
+   stw r7, 

[PATCH v4 00/32] KVM: PPC: Book3S HV: Nested HV virtualization

2018-10-04 Thread Paul Mackerras
This patch series implements nested virtualization in the KVM-HV
module for radix guests on POWER9 systems.  Unlike PR KVM, nested
guests are able to run in supervisor mode, meaning that performance is
much better than with PR KVM, and is very close to the performance of
a non-nested guests for most things.

The way this works is that each nested guest is also a guest of the
real hypervisor, also known as the level 0 or L0 hypervisor, which
runs in the CPU's hypervisor mode.  Its guests are at level 1, and
when a L1 system wants to run a nested guest, it performs hypercalls
to L0 to set up a virtual partition table in its (L1's) memory and to
enter the L2 guest.  The L0 hypervisor maintains a shadow
partition-scoped page table for the L2 guest and demand-faults entries
into it by translating the L1 real addresses in the partition-scoped
page table in L1 memory into L0 real addresses and puts them in the
shadow partition-scoped page table for L2.

Essentially what this is doing is providing L1 with the ability to do
(some) hypervisor functions using paravirtualization; optionally,
TLB invalidations can be done through emulation of the tlbie
instruction rather than a hypercall.

Along the way, this implements a new guest entry/exit path for radix
guests on POWER9 systems which is written almost entirely in C and
does not do any of the inter-thread coordination that the existing
entry/exit path does.  It is only used for radix guests and when
indep_threads_mode=Y (the default).

The limitations of this scheme are:

- Host and all nested hypervisors and their guests must be in radix
  mode.

- Nested hypervisors cannot use indep_threads_mode=N.

- If the host (i.e. the L0 hypervisor) has indep_threads_mode=N then
  only one nested vcpu can be run on any core at any given time; the
  secondary threads will do nothing.

- A nested hypervisor can't use a smaller page size than the base page
  size of the hypervisor(s) above it.

- A nested hypervisor is limited to having at most 1023 guests below
  it, each of which can have at most NR_CPUS virtual CPUs (and the
  virtual CPU ids have to be < NR_CPUS as well).

This patch series is against my kvm-ppc-fixes branch.

Changes in this version since version 3:

- Removed instruction emulation code.

- Validate parameter to H_SET_PARTITION_TABLE hcall.

- Minor changes in response to review comments.

Paul.

 Documentation/virtual/kvm/api.txt  |   15 +
 arch/powerpc/include/asm/asm-prototypes.h  |   21 +
 arch/powerpc/include/asm/book3s/64/mmu-hash.h  |   12 +
 .../powerpc/include/asm/book3s/64/tlbflush-radix.h |1 +
 arch/powerpc/include/asm/hvcall.h  |   41 +
 arch/powerpc/include/asm/kvm_asm.h |4 +-
 arch/powerpc/include/asm/kvm_book3s.h  |   45 +-
 arch/powerpc/include/asm/kvm_book3s_64.h   |  119 +-
 arch/powerpc/include/asm/kvm_book3s_asm.h  |3 +
 arch/powerpc/include/asm/kvm_booke.h   |4 +-
 arch/powerpc/include/asm/kvm_host.h|   16 +-
 arch/powerpc/include/asm/kvm_ppc.h |4 +
 arch/powerpc/include/asm/ppc-opcode.h  |1 +
 arch/powerpc/include/asm/reg.h |2 +
 arch/powerpc/include/uapi/asm/kvm.h|1 +
 arch/powerpc/kernel/asm-offsets.c  |5 +-
 arch/powerpc/kernel/cpu_setup_power.S  |4 +-
 arch/powerpc/kvm/Makefile  |3 +-
 arch/powerpc/kvm/book3s.c  |   43 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|7 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  718 ---
 arch/powerpc/kvm/book3s_emulate.c  |   13 +-
 arch/powerpc/kvm/book3s_hv.c   |  842 -
 arch/powerpc/kvm/book3s_hv_builtin.c   |   92 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S|   95 +-
 arch/powerpc/kvm/book3s_hv_nested.c| 1280 
 arch/powerpc/kvm/book3s_hv_ras.c   |   10 +
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   13 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|  809 +++--
 arch/powerpc/kvm/book3s_hv_tm.c|6 +-
 arch/powerpc/kvm/book3s_hv_tm_builtin.c|5 +-
 arch/powerpc/kvm/book3s_pr.c   |5 +-
 arch/powerpc/kvm/book3s_xics.c |   14 +-
 arch/powerpc/kvm/book3s_xive.c |   63 +
 arch/powerpc/kvm/book3s_xive_template.c|8 -
 arch/powerpc/kvm/bookehv_interrupts.S  |8 +-
 arch/powerpc/kvm/emulate_loadstore.c   |1 -
 arch/powerpc/kvm/powerpc.c |   12 +
 arch/powerpc/kvm/tm.S  |  250 ++--
 arch/powerpc/kvm/trace_book3s.h|1 -
 arch/powerpc/mm/tlb-radix.c|9 +
 include/uapi/linux/kvm.h  

[PATCH v4 01/32] powerpc: Turn off CPU_FTR_P9_TM_HV_ASSIST in non-hypervisor mode

2018-10-04 Thread Paul Mackerras
When doing nested virtualization, it is only necessary to do the
transactional memory hypervisor assist at level 0, that is, when
we are in hypervisor mode.  Nested hypervisors can just use the TM
facilities as architected.  Therefore we should clear the
CPU_FTR_P9_TM_HV_ASSIST bit when we are not in hypervisor mode,
along with the CPU_FTR_HVMODE bit.

Doing this will not change anything at this stage because the only
code that tests CPU_FTR_P9_TM_HV_ASSIST is in HV KVM, which currently
can only be used when when CPU_FTR_HVMODE is set.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kernel/cpu_setup_power.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 458b928..c317080 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -147,8 +147,8 @@ __init_hvmode_206:
rldicl. r0,r3,4,63
bnelr
ld  r5,CPU_SPEC_FEATURES(r4)
-   LOAD_REG_IMMEDIATE(r6,CPU_FTR_HVMODE)
-   xor r5,r5,r6
+   LOAD_REG_IMMEDIATE(r6,CPU_FTR_HVMODE | CPU_FTR_P9_TM_HV_ASSIST)
+   andcr5,r5,r6
std r5,CPU_SPEC_FEATURES(r4)
blr
 
-- 
2.7.4



[PATCH v4 03/32] KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation

2018-10-04 Thread Paul Mackerras
This removes code that clears the external interrupt pending bit in
the pending_exceptions bitmap.  This is left over from an earlier
iteration of the code where this bit was set when an escalation
interrupt arrived in order to wake the vcpu from cede.  Currently
we set the vcpu->arch.irq_pending flag instead for this purpose.
Therefore there is no need to do anything with the pending_exceptions
bitmap.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/kvm/book3s_xive_template.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive_template.c 
b/arch/powerpc/kvm/book3s_xive_template.c
index 203ea65..033363d 100644
--- a/arch/powerpc/kvm/book3s_xive_template.c
+++ b/arch/powerpc/kvm/book3s_xive_template.c
@@ -280,14 +280,6 @@ X_STATIC unsigned long GLUE(X_PFX,h_xirr)(struct kvm_vcpu 
*vcpu)
/* First collect pending bits from HW */
GLUE(X_PFX,ack_pending)(xc);
 
-   /*
-* Cleanup the old-style bits if needed (they may have been
-* set by pull or an escalation interrupts).
-*/
-   if (test_bit(BOOK3S_IRQPRIO_EXTERNAL, >arch.pending_exceptions))
-   clear_bit(BOOK3S_IRQPRIO_EXTERNAL,
- >arch.pending_exceptions);
-
pr_devel(" new pending=0x%02x hw_cppr=%d cppr=%d\n",
 xc->pending, xc->hw_cppr, xc->cppr);
 
-- 
2.7.4



[PATCH v4 02/32] KVM: PPC: Book3S: Simplify external interrupt handling

2018-10-04 Thread Paul Mackerras
Currently we use two bits in the vcpu pending_exceptions bitmap to
indicate that an external interrupt is pending for the guest, one
for "one-shot" interrupts that are cleared when delivered, and one
for interrupts that persist until cleared by an explicit action of
the OS (e.g. an acknowledge to an interrupt controller).  The
BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.

In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
Book3S platforms generally, and pseries in particular, expect
external interrupt requests to persist until they are acknowledged
at the interrupt controller.  That combined with the confusion
introduced by having two bits for what is essentially the same thing
makes it attractive to simplify things by only using one bit.  This
patch does that.

With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
it has the semantics of a persisting interrupt.  In order to avoid
breaking the ABI, we introduce a new "external_oneshot" flag which
preserves the behaviour of the KVM_INTERRUPT ioctl with the
KVM_INTERRUPT_SET argument.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_asm.h |  4 +--
 arch/powerpc/include/asm/kvm_host.h|  1 +
 arch/powerpc/kvm/book3s.c  | 43 --
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |  5 ++-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S|  4 +--
 arch/powerpc/kvm/book3s_pr.c   |  1 -
 arch/powerpc/kvm/book3s_xics.c | 11 +++
 arch/powerpc/kvm/book3s_xive_template.c|  2 +-
 arch/powerpc/kvm/trace_book3s.h|  1 -
 tools/perf/arch/powerpc/util/book3s_hv_exits.h |  1 -
 10 files changed, 44 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a790d5c..1f32191 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -84,7 +84,6 @@
 #define BOOK3S_INTERRUPT_INST_STORAGE  0x400
 #define BOOK3S_INTERRUPT_INST_SEGMENT  0x480
 #define BOOK3S_INTERRUPT_EXTERNAL  0x500
-#define BOOK3S_INTERRUPT_EXTERNAL_LEVEL0x501
 #define BOOK3S_INTERRUPT_EXTERNAL_HV   0x502
 #define BOOK3S_INTERRUPT_ALIGNMENT 0x600
 #define BOOK3S_INTERRUPT_PROGRAM   0x700
@@ -134,8 +133,7 @@
 #define BOOK3S_IRQPRIO_EXTERNAL14
 #define BOOK3S_IRQPRIO_DECREMENTER 15
 #define BOOK3S_IRQPRIO_PERFORMANCE_MONITOR 16
-#define BOOK3S_IRQPRIO_EXTERNAL_LEVEL  17
-#define BOOK3S_IRQPRIO_MAX 18
+#define BOOK3S_IRQPRIO_MAX 17
 
 #define BOOK3S_HFLAG_DCBZ320x1
 #define BOOK3S_HFLAG_SLB   0x2
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 906bcbdf..3cd0b9f 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -707,6 +707,7 @@ struct kvm_vcpu_arch {
u8 hcall_needed;
u8 epr_flags; /* KVMPPC_EPR_xxx */
u8 epr_needed;
+   u8 external_oneshot;/* clear external irq after delivery */
 
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 87348e4..66a5521 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -150,7 +150,6 @@ static int kvmppc_book3s_vec2irqprio(unsigned int vec)
case 0x400: prio = BOOK3S_IRQPRIO_INST_STORAGE; break;
case 0x480: prio = BOOK3S_IRQPRIO_INST_SEGMENT; break;
case 0x500: prio = BOOK3S_IRQPRIO_EXTERNAL; break;
-   case 0x501: prio = BOOK3S_IRQPRIO_EXTERNAL_LEVEL;   break;
case 0x600: prio = BOOK3S_IRQPRIO_ALIGNMENT;break;
case 0x700: prio = BOOK3S_IRQPRIO_PROGRAM;  break;
case 0x800: prio = BOOK3S_IRQPRIO_FP_UNAVAIL;   break;
@@ -236,18 +235,35 @@ EXPORT_SYMBOL_GPL(kvmppc_core_dequeue_dec);
 void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
 struct kvm_interrupt *irq)
 {
-   unsigned int vec = BOOK3S_INTERRUPT_EXTERNAL;
-
-   if (irq->irq == KVM_INTERRUPT_SET_LEVEL)
-   vec = BOOK3S_INTERRUPT_EXTERNAL_LEVEL;
+   /*
+* This case (KVM_INTERRUPT_SET) should never actually arise for
+* a pseries guest (because pseries guests expect their interrupt
+* controllers to continue asserting an external interrupt request
+* until it is acknowledged at the interrupt controller), but is
+* included to avoid ABI breakage and potentially for other
+* sorts of guest.
+*
+* There is a subtlety here: HV KVM does not test the
+* external_oneshot flag in the code that synthesizes
+* external interrupts for the guest 

[PATCH v4 04/32] KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code

2018-10-04 Thread Paul Mackerras
This is based on a patch by Suraj Jitindar Singh.

This moves the code in book3s_hv_rmhandlers.S that generates an
external, decrementer or privileged doorbell interrupt just before
entering the guest to C code in book3s_hv_builtin.c.  This is to
make future maintenance and modification easier.  The algorithm
expressed in the C code is almost identical to the previous
algorithm.

Reviewed-by: David Gibson 
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/kvm_ppc.h  |  1 +
 arch/powerpc/kvm/book3s_hv.c|  3 +-
 arch/powerpc/kvm/book3s_hv_builtin.c| 48 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 70 -
 4 files changed, 67 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e991821..83d61b8 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -652,6 +652,7 @@ int kvmppc_rm_h_ipi(struct kvm_vcpu *vcpu, unsigned long 
server,
 unsigned long mfrr);
 int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
 int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
+void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
 
 /*
  * Host-side operations we want to set up while running in real
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 3e3a715..49a686c 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -730,8 +730,7 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
/*
 * Ensure that the read of vcore->dpdes comes after the read
 * of vcpu->doorbell_request.  This barrier matches the
-* lwsync in book3s_hv_rmhandlers.S just before the
-* fast_guest_return label.
+* smb_wmb() in kvmppc_guest_entry_inject().
 */
smp_rmb();
vc = vcpu->arch.vcore;
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index fc6bb96..ccfea5b 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -729,3 +729,51 @@ void kvmhv_p9_restore_lpcr(struct kvm_split_mode *sip)
smp_mb();
local_paca->kvm_hstate.kvm_split_mode = NULL;
 }
+
+/*
+ * Is there a PRIV_DOORBELL pending for the guest (on POWER9)?
+ * Can we inject a Decrementer or a External interrupt?
+ */
+void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu)
+{
+   int ext;
+   unsigned long vec = 0;
+   unsigned long lpcr;
+
+   /* Insert EXTERNAL bit into LPCR at the MER bit position */
+   ext = (vcpu->arch.pending_exceptions >> BOOK3S_IRQPRIO_EXTERNAL) & 1;
+   lpcr = mfspr(SPRN_LPCR);
+   lpcr |= ext << LPCR_MER_SH;
+   mtspr(SPRN_LPCR, lpcr);
+   isync();
+
+   if (vcpu->arch.shregs.msr & MSR_EE) {
+   if (ext) {
+   vec = BOOK3S_INTERRUPT_EXTERNAL;
+   } else {
+   long int dec = mfspr(SPRN_DEC);
+   if (!(lpcr & LPCR_LD))
+   dec = (int) dec;
+   if (dec < 0)
+   vec = BOOK3S_INTERRUPT_DECREMENTER;
+   }
+   }
+   if (vec) {
+   unsigned long msr, old_msr = vcpu->arch.shregs.msr;
+
+   kvmppc_set_srr0(vcpu, kvmppc_get_pc(vcpu));
+   kvmppc_set_srr1(vcpu, old_msr);
+   kvmppc_set_pc(vcpu, vec);
+   msr = vcpu->arch.intr_msr;
+   if (MSR_TM_ACTIVE(old_msr))
+   msr |= MSR_TS_S;
+   vcpu->arch.shregs.msr = msr;
+   }
+
+   if (vcpu->arch.doorbell_request) {
+   mtspr(SPRN_DPDES, 1);
+   vcpu->arch.vcore->dpdes = 1;
+   smp_wmb();
+   vcpu->arch.doorbell_request = 0;
+   }
+}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 77960e6..6752da1 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1101,13 +1101,20 @@ no_xive:
 #endif /* CONFIG_KVM_XICS */
 
 deliver_guest_interrupt:
-   ld  r6, VCPU_CTR(r4)
-   ld  r7, VCPU_XER(r4)
-
-   mtctr   r6
-   mtxer   r7
-
 kvmppc_cede_reentry:   /* r4 = vcpu, r13 = paca */
+   /* Check if we can deliver an external or decrementer interrupt now */
+   ld  r0, VCPU_PENDING_EXC(r4)
+BEGIN_FTR_SECTION
+   /* On POWER9, also check for emulated doorbell interrupt */
+   lbz r3, VCPU_DBELL_REQ(r4)
+   or  r0, r0, r3
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+   cmpdi   r0, 0
+   beq 71f
+   mr  r3, r4
+   bl  kvmppc_guest_entry_inject_int
+   ld  r4, HSTATE_KVM_VCPU(r13)
+71:
ld  r10, VCPU_PC(r4)
ld  r11, VCPU_MSR(r4)
ld  r6, VCPU_SRR0(r4)
@@ -1120,53 +1127,10 @@ 

Re: [PATCH] dma-direct: Fix return value of dma_direct_supported

2018-10-04 Thread Robin Murphy

On 04/10/18 00:48, Alexander Duyck wrote:

It appears that in commit 9d7a224b463e ("dma-direct: always allow dma mask
<= physiscal memory size") the logic of the test was changed from a "<" to
a ">=" however I don't see any reason for that change. I am assuming that
there was some additional change planned, specifically I suspect the logic
was intended to be reversed and possibly used for a return. Since that is
the case I have gone ahead and done that.


Bah, seems I got hung up on the min_mask code above it and totally 
overlooked that the condition itself got flipped. It probably also can't 
help that it's an int return type, but treated as a bool by callers 
rather than "0 for success" as int tends to imply in isolation.


Anyway, paying a bit more attention this time, I think this looks like 
the right fix - cheers Alex.


Robin.


This addresses issues I had on my system that prevented me from booting
with the above mentioned commit applied on an x86_64 system w/ Intel IOMMU.

Fixes: 9d7a224b463e ("dma-direct: always allow dma mask <= physiscal memory 
size")
Signed-off-by: Alexander Duyck 
---
  kernel/dma/direct.c |4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 5a0806b5351b..65872f6c2e93 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -301,9 +301,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
  
  	min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
  
-	if (mask >= phys_to_dma(dev, min_mask))

-   return 0;
-   return 1;
+   return mask >= phys_to_dma(dev, min_mask);
  }
  
  int dma_direct_mapping_error(struct device *dev, dma_addr_t dma_addr)


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



  1   2   >