Re: [RFC PATCH 2/9] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

2021-02-18 Thread Nicholas Piggin
Excerpts from Daniel Axtens's message of February 19, 2021 4:03 pm:
> Hi Nick,
> 
>> +maybe_skip:
>> +cmpwi   r12,0x200
>> +beq 1f
>> +cmpwi   r12,0x300
>> +beq 1f
>> +cmpwi   r12,0x380
>> +beq 1f
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +/* XXX: cbe stuff? instruction breakpoint? */
>> +cmpwi   r12,0xe02
>> +beq 2f
>> +#endif
>> +b   no_skip
>> +1:  mfspr   r9,SPRN_SRR0
>> +addir9,r9,4
>> +mtspr   SPRN_SRR0,r9
>> +ld  r12,HSTATE_SCRATCH0(r13)
>> +ld  r9,HSTATE_SCRATCH2(r13)
>> +GET_SCRATCH0(r13)
>> +RFI_TO_KERNEL
>> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>> +2:  mfspr   r9,SPRN_HSRR0
>> +addir9,r9,4
>> +mtspr   SPRN_HSRR0,r9
>> +ld  r12,HSTATE_SCRATCH0(r13)
>> +ld  r9,HSTATE_SCRATCH2(r13)
>> +GET_SCRATCH0(r13)
>> +HRFI_TO_KERNEL
>> +#endif
> 
> If I understand correctly, label 1 is the kvmppc_skip_interrupt and
> label 2 is the kvmppc_skip_Hinterrupt. Would it be easier to understand
> if we used symbolic labels, or do you think the RFI_TO_KERNEL vs
> HRFI_TO_KERNEL and other changes are sufficient?

Yeah my thinking was it's okay this way because we've got all the 
context there, whereas prior to this patch those were branched to from 
far away places so the names helped more.

If the discontiguity or nesting was any larger then yes I would say 
naming the labels is probably a good idea (e.g., maybe_skip / no_skip).

> Apart from that, I haven't checked the precise copy-paste to make sure
> nothing has changed by accident, but I am able to follow the general
> idea of the patch and am vigorously in favour of anything that
> simplifies our exception/interrupt paths!

Thanks,
Nick


Re: [RFC PATCH 2/9] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

2021-02-18 Thread Nicholas Piggin
Excerpts from Fabiano Rosas's message of February 13, 2021 6:33 am:
> Nicholas Piggin  writes:
> 
>> Move the GUEST_MODE_SKIP logic into KVM code. This is quite a KVM
>> internal detail that has no real need to be in common handlers.
> 
> LGTM,
> 
>>
>> (XXX: Need to confirm CBE handlers etc)
> 
> Do these interrupts exist only in Cell? I see that they set HSRRs and
> MSR_HV, but CPU_FTRS_CELL does not contain CPU_HVMODE. So I don't get
> why they use the KVM macros.

Good question, I asked Michael Ellerman and he said there is a bare 
metal Cell which predates or otherwise does not define HVMODE.

However it does not support KVM. So I think we can remove it.

> And for the instruction_breakpoint (0x1300) I think it would help if we
> could at least restrict when it is built. But I don't know what
> ISA/processor version it is from.

Yeah we should be documenting these non-architected handlers a little 
better IMO.

I think we can get rid of the kvm skip from this one though. I've done 
that in a separate patch in the next series.

Thanks,
Nick


Re: [PATCH v12 13/14] mm/vmalloc: Hugepage vmalloc mappings

2021-02-18 Thread Nicholas Piggin
Excerpts from Ding Tianhong's message of February 19, 2021 1:45 pm:
> Hi Nicholas:
> 
> I met some problem for this patch, like this:
> 
> kva = vmalloc(3*1024k);
> 
> remap_vmalloc_range(xxx, kva, xxx)
> 
> It failed because that the check for page_count(page) is null so return, it 
> break the some logic for current modules.
> because the new huge page is not valid for composed page.

Hey Ding, that's a good catch. How are you testing this stuff, do you 
have a particular driver that does this?

> I think some guys really don't get used to the changes for the vmalloc that 
> the small pages was transparency to the hugepage
> when the size is bigger than the PMD_SIZE.

I think in this case vmalloc could allocate the large page as a compound
page which would solve this problem I think? (without having actually 
tested it)

> can we think about give a new static huge page to fix it? just like use a a 
> new vmalloc_huge_xxx function to disginguish the current function,
> the user could choose to use the transparent hugepage or static hugepage for 
> vmalloc.

Yeah that's a good question, there are a few things in the huge vmalloc 
code that accounts things as small pages and you can't assume large or 
small. If there is benefit from forcing large pages that could certainly
be added.

Interestingly, remap_vmalloc_range in theory could map the pages as 
large in userspace as well. That takes more work but if something
really needs that for performance, it could be done.

Thanks,
Nick


[PATCH 13/13] KVM: PPC: Book3S HV: Implement the rest of the P9 entry/exit handling in C

2021-02-18 Thread Nicholas Piggin
Almost all logic is moved to C, by introducing a new in_guest mode that
selects and branches very early in the interrupt handler to the P9 exit
code.

The remaining assembly is only about 160 lines of low level stack setup,
with VCPU vs host register save and restore, plus a small shim to the
legacy paths in the interrupt handler.

There are two motivations for this, the first is just make the code more
maintainable being in C. The second is to reduce the amount of code
running in a special KVM mode, "realmode". I put that in quotes because
with radix it is no longer necessarily real-mode in the MMU, but it
still has to be treated specially because it may be in real-mode, and
has various important registers like PID, DEC, TB, etc set to guest.
This is hostile to the rest of Linux and can't use arbitrary kernel
functionality or be instrumented well.

This initial patch is a reasonably faithful conversion of the asm code,
one notable difference being no real-mode hcall handler which necessitates
a small change to the H_CEDE call (although that makes it nicely common
with the nested HV case).

This also lacks any loop to return quickly back into the guest without
switching out of realmode in the case of unimportant or easily handled
interrupts.

The point was to reduce complexity and improve the ability to instrument
things, and it actually remains to be seen whether these short cuts are
required for modern processors. Radix, independent threads, XIVE, large
decrementer etc should all combine to vastly reduce guest exit frequency
compared with a POWER8. And radix mode additionally makes such exits
less costly by avoiding the need to flush and reload the SLB, flush
ERATs, etc.

Some of these things may be re-added if performance requires, or if more
of the other paths begin to be converted into C as well.

not-yet-Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/asm-prototypes.h |   3 +-
 arch/powerpc/include/asm/kvm_asm.h|   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   2 +
 arch/powerpc/kernel/security.c|   5 +-
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/book3s_64_entry.S| 180 +++
 arch/powerpc/kvm/book3s_hv.c  |  19 +-
 arch/powerpc/kvm/book3s_hv_interrupt.c| 208 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 112 +---
 arch/powerpc/kvm/book3s_xive.c|  32 
 11 files changed, 453 insertions(+), 116 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index d0b832cbbec8..00eb5224019c 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -151,6 +151,7 @@ extern s32 patch__call_flush_branch_caches3;
 extern s32 patch__flush_count_cache_return;
 extern s32 patch__flush_link_stack_return;
 extern s32 patch__call_kvm_flush_link_stack;
+extern s32 patch__call_kvm_flush_link_stack_2;
 extern s32 patch__memset_nocache, patch__memcpy_nocache;
 
 extern long flush_branch_caches;
@@ -171,7 +172,7 @@ void kvmhv_load_host_pmu(void);
 void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
 void kvmhv_load_guest_pmu(struct kvm_vcpu *vcpu);
 
-int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
+void kvmppc_p9_enter_guest(struct kvm_vcpu *vcpu);
 
 long kvmppc_h_set_dabr(struct kvm_vcpu *vcpu, unsigned long dabr);
 long kvmppc_h_set_xdabr(struct kvm_vcpu *vcpu, unsigned long dabr,
diff --git a/arch/powerpc/include/asm/kvm_asm.h 
b/arch/powerpc/include/asm/kvm_asm.h
index a3633560493b..b4f9996bd331 100644
--- a/arch/powerpc/include/asm/kvm_asm.h
+++ b/arch/powerpc/include/asm/kvm_asm.h
@@ -146,7 +146,8 @@
 #define KVM_GUEST_MODE_GUEST   1
 #define KVM_GUEST_MODE_SKIP2
 #define KVM_GUEST_MODE_GUEST_HV3
-#define KVM_GUEST_MODE_HOST_HV 4
+#define KVM_GUEST_MODE_GUEST_HV_FAST   4 /* ISA v3.0 with host radix mode */
+#define KVM_GUEST_MODE_HOST_HV 5
 
 #define KVM_INST_FETCH_FAILED  -1
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 9bb9bb370b53..7f08f6ed73df 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -153,6 +153,8 @@ static inline bool kvmhv_vcpu_is_radix(struct kvm_vcpu 
*vcpu)
return radix;
 }
 
+int __kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu);
+
 #define KVM_DEFAULT_HPT_ORDER  24  /* 16MB HPT by default */
 #endif
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 45b7610773b1..013e9d8b0754 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -670,6 +670,7 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
 

[PATCH 12/13] KVM: PPC: Book3S HV: Move radix MMU switching together in the P9 path

2021-02-18 Thread Nicholas Piggin
Switching the MMU from radix<->radix mode is tricky particularly as the
MMU can remain enabled and requires a certain sequence of SPR updates.
Move these together into their own functions.

This also includes the radix TLB check / flush because it's tied in to
MMU switching due to tlbiel getting LPID from LPIDR.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 50 +++-
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 53d0cbfe5933..6bf7f5ce4865 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3446,12 +3446,38 @@ static noinline void kvmppc_run_core(struct 
kvmppc_vcore *vc)
trace_kvmppc_run_core(vc, 1);
 }
 
+static void switch_mmu_to_guest_radix(struct kvm *kvm, struct kvm_vcpu *vcpu, 
u64 lpcr)
+{
+   struct kvmppc_vcore *vc = vcpu->arch.vcore;
+   struct kvm_nested_guest *nested = vcpu->arch.nested;
+   u32 lpid;
+
+   lpid = nested ? nested->shadow_lpid : kvm->arch.lpid;
+
+   mtspr(SPRN_LPID, lpid);
+   mtspr(SPRN_LPCR, lpcr);
+   mtspr(SPRN_PID, vcpu->arch.pid);
+   isync();
+
+   /* TLBIEL must have LPIDR set, so set guest LPID before flushing. */
+   kvmppc_check_need_tlb_flush(kvm, vc->pcpu, nested);
+}
+
+static void switch_mmu_to_host_radix(struct kvm *kvm, u32 pid)
+{
+   mtspr(SPRN_PID, pid);
+   mtspr(SPRN_LPID, kvm->arch.host_lpid);
+   mtspr(SPRN_LPCR, kvm->arch.host_lpcr);
+   isync();
+}
+
 /*
  * Load up hypervisor-mode registers on P9.
  */
 static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 unsigned long lpcr)
 {
+   struct kvm *kvm = vcpu->kvm;
struct kvmppc_vcore *vc = vcpu->arch.vcore;
s64 hdec;
u64 tb, purr, spurr;
@@ -3467,12 +3493,12 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
 * P8 and P9 suppress the HDEC exception when LPCR[HDICE] = 0,
 * so set HDICE before writing HDEC.
 */
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr | LPCR_HDICE);
+   mtspr(SPRN_LPCR, kvm->arch.host_lpcr | LPCR_HDICE);
isync();
 
hdec = time_limit - mftb();
if (hdec < 0) {
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr);
+   mtspr(SPRN_LPCR, kvm->arch.host_lpcr);
isync();
return BOOK3S_INTERRUPT_HV_DECREMENTER;
}
@@ -3503,7 +3529,6 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
}
mtspr(SPRN_CIABR, vcpu->arch.ciabr);
mtspr(SPRN_IC, vcpu->arch.ic);
-   mtspr(SPRN_PID, vcpu->arch.pid);
 
mtspr(SPRN_PSSCR, vcpu->arch.psscr | PSSCR_EC |
  (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
@@ -3517,8 +3542,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
 
mtspr(SPRN_AMOR, ~0UL);
 
-   mtspr(SPRN_LPCR, lpcr);
-   isync();
+   switch_mmu_to_guest_radix(kvm, vcpu, lpcr);
 
kvmppc_xive_push_vcpu(vcpu);
 
@@ -3553,7 +3577,6 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
mtspr(SPRN_CIABR, host_ciabr);
mtspr(SPRN_DAWR0, host_dawr);
mtspr(SPRN_DAWRX0, host_dawrx);
-   mtspr(SPRN_PID, host_pidr);
 
/*
 * Since this is radix, do a eieio; tlbsync; ptesync sequence in
@@ -3568,9 +3591,6 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if (cpu_has_feature(CPU_FTR_ARCH_31))
asm volatile(PPC_CP_ABORT);
 
-   mtspr(SPRN_LPID, vcpu->kvm->arch.host_lpid);/* restore host LPID */
-   isync();
-
vc->dpdes = mfspr(SPRN_DPDES);
vc->vtb = mfspr(SPRN_VTB);
mtspr(SPRN_DPDES, 0);
@@ -3587,7 +3607,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
}
 
mtspr(SPRN_HDEC, 0x7fff);
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr);
+
+   switch_mmu_to_host_radix(kvm, host_pidr);
 
return trap;
 }
@@ -4117,7 +4138,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 {
struct kvm_run *run = vcpu->run;
int trap, r, pcpu;
-   int srcu_idx, lpid;
+   int srcu_idx;
struct kvmppc_vcore *vc;
struct kvm *kvm = vcpu->kvm;
struct kvm_nested_guest *nested = vcpu->arch.nested;
@@ -4191,13 +4212,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
vc->vcore_state = VCORE_RUNNING;
trace_kvmppc_run_core(vc, 0);
 
-   if (cpu_has_feature(CPU_FTR_HVMODE)) {
-   lpid = nested ? nested->shadow_lpid : kvm->arch.lpid;
-   mtspr(SPRN_LPID, lpid);
-   isync();
-   kvmppc_check_need_tlb_flush(kvm, pcpu, nested);
-   }
-
guest_enter_irqoff();
 

[PATCH 11/13] KVM: PPC: Book3S HV: Minimise hcall handler calling convention differences

2021-02-18 Thread Nicholas Piggin
This sets up the same calling convention from interrupt entry to
KVM interrupt handler for system calls as exists for other interrupt
types.

This is a better API, it uses a save area rather than SPR, and it has
more registers free to use. Using a single common API helps maintain
it, and it becomes easier to use in C in a later patch.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 16 +++-
 arch/powerpc/kvm/book3s_64_entry.S   | 22 +++---
 2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 4878f05b5325..6f086ee922e0 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1923,8 +1923,22 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100)
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 TRAMP_REAL_BEGIN(kvm_hcall)
+   std r9,PACA_EXGEN+EX_R9(r13)
+   std r11,PACA_EXGEN+EX_R11(r13)
+   std r12,PACA_EXGEN+EX_R12(r13)
+   mfcrr9
mfctr   r10
-   SET_SCRATCH0(r10) /* Save r13 in SCRATCH0 */
+   std r10,PACA_EXGEN+EX_R13(r13)
+   li  r10,0
+   std r10,PACA_EXGEN+EX_CFAR(r13)
+   std r10,PACA_EXGEN+EX_CTR(r13)
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_PPR
+   std r10,PACA_EXGEN+EX_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+
+   HMT_MEDIUM
+
 #ifdef CONFIG_RELOCATABLE
/*
 * Requires __LOAD_FAR_HANDLER beause kvmppc_hcall lives
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 6b2fda5dee38..6f06b58b1bdd 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -13,24 +13,9 @@
 .globalkvmppc_hcall
 .balign IFETCH_ALIGN_BYTES
 kvmppc_hcall:
-   /*
-* This is a hcall, so register convention is as
-* Documentation/powerpc/papr_hcalls.rst, with these additions:
-* R13  = PACA
-* guest R13 saved in SPRN_SCRATCH0
-* R10  = free
-*/
-BEGIN_FTR_SECTION
-   mfspr   r10,SPRN_PPR
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   HMT_MEDIUM
-   mfcrr10
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r10,32
-   ori r12,r12,0xc00
-   ld  r10,PACA_EXGEN+EX_R10(r13)
-   b   do_kvm_interrupt
+   ld  r10,PACA_EXGEN+EX_R13(r13)
+   SET_SCRATCH0(r10)
+   li  r10,0xc00
 
 .globalkvmppc_interrupt
 .balign IFETCH_ALIGN_BYTES
@@ -61,7 +46,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r10,EX_R10(r11)
ld  r11,EX_R11(r11)
 
-do_kvm_interrupt:
/*
 * Hcalls and other interrupts come here after normalising register
 * contents and save locations:
-- 
2.23.0



[PATCH 10/13] KVM: PPC: Book3S HV: move bad_host_intr check to HV handler

2021-02-18 Thread Nicholas Piggin
This is not used by PR KVM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_64_entry.S  | 3 ---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++-
 arch/powerpc/kvm/book3s_segment.S   | 7 +++
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index ef946e202773..6b2fda5dee38 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -77,11 +77,8 @@ do_kvm_interrupt:
beq-.Lmaybe_skip
 .Lno_skip:
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
-   beq kvmppc_bad_host_intr
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
cmpwi   r9,KVM_GUEST_MODE_GUEST
-   ld  r9,HSTATE_SCRATCH2(r13)
beq kvmppc_interrupt_pr
 #endif
b   kvmppc_interrupt_hv
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index bbf786a0c0d6..eff4437e381c 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1251,6 +1251,7 @@ hdec_soon:
 kvmppc_interrupt_hv:
/*
 * Register contents:
+* R9   = HSTATE_IN_GUEST
 * R12  = (guest CR << 32) | interrupt vector
 * R13  = PACA
 * guest R12 saved in shadow VCPU SCRATCH0
@@ -1258,6 +1259,8 @@ kvmppc_interrupt_hv:
 * guest R9 saved in HSTATE_SCRATCH2
 */
/* We're now back in the host but in guest MMU context */
+   cmpwi   r9,KVM_GUEST_MODE_HOST_HV
+   beq kvmppc_bad_host_intr
li  r9, KVM_GUEST_MODE_HOST_HV
stb r9, HSTATE_IN_GUEST(r13)
 
@@ -3254,7 +3257,6 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)
  * cfar is saved in HSTATE_CFAR(r13)
  * ppr is saved in HSTATE_PPR(r13)
  */
-.global kvmppc_bad_host_intr
 kvmppc_bad_host_intr:
/*
 * Switch to the emergency stack, but start half-way down in
diff --git a/arch/powerpc/kvm/book3s_segment.S 
b/arch/powerpc/kvm/book3s_segment.S
index 1f492aa4c8d6..ef1d88b869bf 100644
--- a/arch/powerpc/kvm/book3s_segment.S
+++ b/arch/powerpc/kvm/book3s_segment.S
@@ -167,8 +167,15 @@ kvmppc_interrupt_pr:
 * R12 = (guest CR << 32) | exit handler id
 * R13 = PACA
 * HSTATE.SCRATCH0 = guest R12
+*
+* If HV is possible, additionally:
+* R9  = HSTATE_IN_GUEST
+* HSTATE.SCRATCH2 = guest R9
 */
 #ifdef CONFIG_PPC64
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   ld  r9,HSTATE_SCRATCH2(r13)
+#endif
/* Match 32-bit entry */
rotldi  r12, r12, 32  /* Flip R12 halves for stw */
stw r12, HSTATE_SCRATCH1(r13) /* CR is now in the low half */
-- 
2.23.0



[PATCH 09/13] KVM: PPC: Book3S HV: Move interrupt early register setup to KVM

2021-02-18 Thread Nicholas Piggin
Like the earlier patch for hcalls, KVM interrupt entry requires a
different calling convention than the Linux interrupt handlers
set up. Move the code that converts from one to the other into KVM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 126 ---
 arch/powerpc/kvm/book3s_64_entry.S   |  34 +++-
 2 files changed, 50 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6872f0376003..4878f05b5325 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -191,7 +191,6 @@ do_define_int n
.endif
 .endm
 
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 /*
  * All interrupts which set HSRR registers, as well as SRESET and MCE and
  * syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be taken,
@@ -224,54 +223,25 @@ do_define_int n
  * to KVM to handle.
  */
 
-.macro KVMTEST name
+.macro KVMTEST name handler
+#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
lbz r10,HSTATE_IN_GUEST(r13)
cmpwi   r10,0
-   bne \name\()_kvm
-.endm
-
-.macro GEN_KVM name
-   .balign IFETCH_ALIGN_BYTES
-\name\()_kvm:
-
-BEGIN_FTR_SECTION
-   ld  r10,IAREA+EX_CFAR(r13)
-   std r10,HSTATE_CFAR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
-
-   ld  r10,IAREA+EX_CTR(r13)
-   mtctr   r10
-BEGIN_FTR_SECTION
-   ld  r10,IAREA+EX_PPR(r13)
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   ld  r11,IAREA+EX_R11(r13)
-   ld  r12,IAREA+EX_R12(r13)
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r9,32
-   ld  r9,IAREA+EX_R9(r13)
-   ld  r10,IAREA+EX_R10(r13)
/* HSRR variants have the 0x2 bit added to their trap number */
.if IHSRR_IF_HVMODE
BEGIN_FTR_SECTION
-   ori r12,r12,(IVEC + 0x2)
+   li  r10,(IVEC + 0x2)
FTR_SECTION_ELSE
-   ori r12,r12,(IVEC)
+   li  r10,(IVEC)
ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
.elseif IHSRR
-   ori r12,r12,(IVEC+ 0x2)
+   li  r10,(IVEC + 0x2)
.else
-   ori r12,r12,(IVEC)
+   li  r10,(IVEC)
.endif
-   b   kvmppc_interrupt
-.endm
-
-#else
-.macro KVMTEST name
-.endm
-.macro GEN_KVM name
-.endm
+   bne \handler
 #endif
+.endm
 
 /*
  * This is the BOOK3S interrupt entry code macro.
@@ -413,7 +383,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
.endif
 
ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
@@ -436,7 +406,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
 DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 \name\()_common_virt:
.if IKVM_VIRT
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
 1:
.endif
.endif /* IVIRT */
@@ -450,7 +420,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name
+   KVMTEST \name kvm_interrupt
.endif
 .endm
 
@@ -1011,8 +981,6 @@ EXC_COMMON_BEGIN(system_reset_common)
EXCEPTION_RESTORE_REGS
RFI_TO_USER_OR_KERNEL
 
-   GEN_KVM system_reset
-
 
 /**
  * Interrupt 0x200 - Machine Check Interrupt (MCE).
@@ -1196,7 +1164,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
/*
 * Check if we are coming from guest. If yes, then run the normal
 * exception handler which will take the
-* machine_check_kvm->kvmppc_interrupt branch to deliver the MC event
+* machine_check_kvm->kvm_interrupt branch to deliver the MC event
 * to guest.
 */
lbz r11,HSTATE_IN_GUEST(r13)
@@ -1267,8 +1235,6 @@ EXC_COMMON_BEGIN(machine_check_common)
bl  machine_check_exception
b   interrupt_return
 
-   GEN_KVM machine_check
-
 
 #ifdef CONFIG_PPC_P7_NAP
 /*
@@ -1393,8 +1359,6 @@ MMU_FTR_SECTION_ELSE
b   handle_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 
-   GEN_KVM data_access
-
 
 /**
  * Interrupt 0x380 - Data Segment Interrupt (DSLB).
@@ -1449,8 +1413,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
bl  do_bad_slb_fault
b   interrupt_return
 
-   GEN_KVM data_access_slb
-
 
 /**
  * Interrupt 0x400 - Instruction Storage Interrupt (ISI).
@@ -1489,8 +1451,6 @@ MMU_FTR_SECTION_ELSE
b   handle_page_fault
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
 
-   GEN_KVM instruction_access
-
 
 /**
  * Interrupt 0x480 - Instruction Segment Interrupt (ISLB).
@@ -1540,8 +1500,6 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
bl  do_bad_slb_fault
b   interrupt_return
 
- 

[PATCH 08/13] KVM: PPC: Book3S HV: Move hcall early register setup to KVM

2021-02-18 Thread Nicholas Piggin
System calls / hcalls have a different calling convention than
other interrupts, so there is code in the KVMTEST to massage these
into the same form as other interrupt handlers.

Move this work into the KVM hcall handler. This means teaching KVM
a little more about the low level interrupt handler setup, PACA save
areas, etc., although that's not obviously worse than the current
approach of coming up with an entirely different interrupt register
/ save convention.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h | 13 +++
 arch/powerpc/kernel/exceptions-64s.S | 44 ++--
 arch/powerpc/kvm/book3s_64_entry.S   | 17 +
 3 files changed, 32 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index c1a8aac01cf9..bb6f78fcf981 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -35,6 +35,19 @@
 /* PACA save area size in u64 units (exgen, exmc, etc) */
 #define EX_SIZE10
 
+/* PACA save area offsets */
+#define EX_R9  0
+#define EX_R10 8
+#define EX_R11 16
+#define EX_R12 24
+#define EX_R13 32
+#define EX_DAR 40
+#define EX_DSISR   48
+#define EX_CCR 52
+#define EX_CFAR56
+#define EX_PPR 64
+#define EX_CTR 72
+
 /*
  * maximum recursive depth of MCE exceptions
  */
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a61a45704925..6872f0376003 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -21,22 +21,6 @@
 #include 
 #include 
 
-/* PACA save area offsets (exgen, exmc, etc) */
-#define EX_R9  0
-#define EX_R10 8
-#define EX_R11 16
-#define EX_R12 24
-#define EX_R13 32
-#define EX_DAR 40
-#define EX_DSISR   48
-#define EX_CCR 52
-#define EX_CFAR56
-#define EX_PPR 64
-#define EX_CTR 72
-.if EX_SIZE != 10
-   .error "EX_SIZE is wrong"
-.endif
-
 /*
  * Following are fixed section helper macros.
  *
@@ -1995,45 +1979,21 @@ EXC_VIRT_END(system_call, 0x4c00, 0x100)
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 TRAMP_REAL_BEGIN(system_call_kvm)
-   /*
-* This is a hcall, so register convention is as above, with these
-* differences:
-* r13 = PACA
-* ctr = orig r13
-* orig r10 saved in PACA
-*/
-/*
- * Save the PPR (on systems that support it) before changing to
- * HMT_MEDIUM. That allows the KVM code to save that value into the
- * guest state (it is the guest's PPR value).
- */
-BEGIN_FTR_SECTION
-   mfspr   r10,SPRN_PPR
-   std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
-   HMT_MEDIUM
mfctr   r10
-   SET_SCRATCH0(r10)
-   mfcrr10
-   std r12,HSTATE_SCRATCH0(r13)
-   sldir12,r10,32
-   ori r12,r12,0xc00
+   SET_SCRATCH0(r10) /* Save r13 in SCRATCH0 */
 #ifdef CONFIG_RELOCATABLE
/*
-* Requires __LOAD_FAR_HANDLER beause kvmppc_interrupt lives
+* Requires __LOAD_FAR_HANDLER beause kvmppc_hcall lives
 * outside the head section.
 */
__LOAD_FAR_HANDLER(r10, kvmppc_hcall)
mtctr   r10
-   ld  r10,PACA_EXGEN+EX_R10(r13)
bctr
 #else
-   ld  r10,PACA_EXGEN+EX_R10(r13)
b   kvmppc_hcall
 #endif
 #endif
 
-
 /**
  * Interrupt 0xd00 - Trace Interrupt.
  * This is a synchronous interrupt in response to instruction step or
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 53addbbe7b1a..7050197cd359 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -13,6 +13,23 @@
 .globalkvmppc_hcall
 .balign IFETCH_ALIGN_BYTES
 kvmppc_hcall:
+   /*
+* This is a hcall, so register convention is as
+* Documentation/powerpc/papr_hcalls.rst, with these additions:
+* R13  = PACA
+* guest R13 saved in SPRN_SCRATCH0
+* R10  = free
+*/
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_PPR
+   std r10,HSTATE_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+   HMT_MEDIUM
+   mfcrr10
+   std r12,HSTATE_SCRATCH0(r13)
+   sldir12,r10,32
+   ori r12,r12,0xc00
+   ld  r10,PACA_EXGEN+EX_R10(r13)
 
 .globalkvmppc_interrupt
 .balign IFETCH_ALIGN_BYTES
-- 
2.23.0



[PATCH 07/13] KVM: PPC: Book3S 64: add hcall interrupt handler

2021-02-18 Thread Nicholas Piggin
Add a separate hcall entry point. This can be used to deal with the
different calling convention.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 ++--
 arch/powerpc/kvm/book3s_64_entry.S   | 6 +-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 96f22c582213..a61a45704925 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -2023,13 +2023,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 * Requires __LOAD_FAR_HANDLER beause kvmppc_interrupt lives
 * outside the head section.
 */
-   __LOAD_FAR_HANDLER(r10, kvmppc_interrupt)
+   __LOAD_FAR_HANDLER(r10, kvmppc_hcall)
mtctr   r10
ld  r10,PACA_EXGEN+EX_R10(r13)
bctr
 #else
ld  r10,PACA_EXGEN+EX_R10(r13)
-   b   kvmppc_interrupt
+   b   kvmppc_hcall
 #endif
 #endif
 
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 820d103e5f50..53addbbe7b1a 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -7,9 +7,13 @@
 #include 
 
 /*
- * This is branched to from interrupt handlers in exception-64s.S which set
+ * These are branched to from interrupt handlers in exception-64s.S which set
  * IKVM_REAL or IKVM_VIRT, if HSTATE_IN_GUEST was found to be non-zero.
  */
+.globalkvmppc_hcall
+.balign IFETCH_ALIGN_BYTES
+kvmppc_hcall:
+
 .globalkvmppc_interrupt
 .balign IFETCH_ALIGN_BYTES
 kvmppc_interrupt:
-- 
2.23.0



[PATCH 06/13] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

2021-02-18 Thread Nicholas Piggin
Move the GUEST_MODE_SKIP logic into KVM code. This is quite a KVM
internal detail that has no real need to be in common handlers.

Also add a comment explaining why this this thing exists.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 60 --
 arch/powerpc/kvm/book3s_64_entry.S   | 64 
 2 files changed, 56 insertions(+), 68 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a1640d6ea65d..96f22c582213 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -133,7 +133,6 @@ name:
 #define IBRANCH_TO_COMMON  .L_IBRANCH_TO_COMMON_\name\() /* ENTRY branch 
to common */
 #define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\() /* Common runs in 
realmode */
 #define IMASK  .L_IMASK_\name\()   /* IRQ soft-mask bit */
-#define IKVM_SKIP  .L_IKVM_SKIP_\name\()   /* Generate KVM skip handler */
 #define IKVM_REAL  .L_IKVM_REAL_\name\()   /* Real entry tests KVM */
 #define __IKVM_REAL(name)  .L_IKVM_REAL_ ## name
 #define IKVM_VIRT  .L_IKVM_VIRT_\name\()   /* Virt entry tests KVM */
@@ -191,9 +190,6 @@ do_define_int n
.ifndef IMASK
IMASK=0
.endif
-   .ifndef IKVM_SKIP
-   IKVM_SKIP=0
-   .endif
.ifndef IKVM_REAL
IKVM_REAL=0
.endif
@@ -254,15 +250,10 @@ do_define_int n
.balign IFETCH_ALIGN_BYTES
 \name\()_kvm:
 
-   .if IKVM_SKIP
-   cmpwi   r10,KVM_GUEST_MODE_SKIP
-   beq 89f
-   .else
 BEGIN_FTR_SECTION
ld  r10,IAREA+EX_CFAR(r13)
std r10,HSTATE_CFAR(r13)
 END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
-   .endif
 
ld  r10,IAREA+EX_CTR(r13)
mtctr   r10
@@ -289,27 +280,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ori r12,r12,(IVEC)
.endif
b   kvmppc_interrupt
-
-   .if IKVM_SKIP
-89:mtocrf  0x80,r9
-   ld  r10,IAREA+EX_CTR(r13)
-   mtctr   r10
-   ld  r9,IAREA+EX_R9(r13)
-   ld  r10,IAREA+EX_R10(r13)
-   ld  r11,IAREA+EX_R11(r13)
-   ld  r12,IAREA+EX_R12(r13)
-   .if IHSRR_IF_HVMODE
-   BEGIN_FTR_SECTION
-   b   kvmppc_skip_Hinterrupt
-   FTR_SECTION_ELSE
-   b   kvmppc_skip_interrupt
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif IHSRR
-   b   kvmppc_skip_Hinterrupt
-   .else
-   b   kvmppc_skip_interrupt
-   .endif
-   .endif
 .endm
 
 #else
@@ -1128,7 +1098,6 @@ INT_DEFINE_BEGIN(machine_check)
ISET_RI=0
IDAR=1
IDSISR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(machine_check)
 
@@ -1419,7 +1388,6 @@ INT_DEFINE_BEGIN(data_access)
IVEC=0x300
IDAR=1
IDSISR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(data_access)
 
@@ -1465,7 +1433,6 @@ INT_DEFINE_BEGIN(data_access_slb)
IVEC=0x380
IRECONCILE=0
IDAR=1
-   IKVM_SKIP=1
IKVM_REAL=1
 INT_DEFINE_END(data_access_slb)
 
@@ -2111,7 +2078,6 @@ INT_DEFINE_BEGIN(h_data_storage)
IHSRR=1
IDAR=1
IDSISR=1
-   IKVM_SKIP=1
IKVM_REAL=1
IKVM_VIRT=1
 INT_DEFINE_END(h_data_storage)
@@ -3088,32 +3054,6 @@ EXPORT_SYMBOL(do_uaccess_flush)
 MASKED_INTERRUPT
 MASKED_INTERRUPT hsrr=1
 
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-kvmppc_skip_interrupt:
-   /*
-* Here all GPRs are unchanged from when the interrupt happened
-* except for r13, which is saved in SPRG_SCRATCH0.
-*/
-   mfspr   r13, SPRN_SRR0
-   addir13, r13, 4
-   mtspr   SPRN_SRR0, r13
-   GET_SCRATCH0(r13)
-   RFI_TO_KERNEL
-   b   .
-
-kvmppc_skip_Hinterrupt:
-   /*
-* Here all GPRs are unchanged from when the interrupt happened
-* except for r13, which is saved in SPRG_SCRATCH0.
-*/
-   mfspr   r13, SPRN_HSRR0
-   addir13, r13, 4
-   mtspr   SPRN_HSRR0, r13
-   GET_SCRATCH0(r13)
-   HRFI_TO_KERNEL
-   b   .
-#endif
-
/*
 * Relocation-on interrupts: A subset of the interrupts can be delivered
 * with IR=1/DR=1, if AIL==2 and MSR.HV won't be changed by delivering
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
index 147ebf1c3c1f..820d103e5f50 100644
--- a/arch/powerpc/kvm/book3s_64_entry.S
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -1,9 +1,10 @@
+#include 
 #include 
-#include 
+#include 
 #include 
-#include 
-#include 
 #include 
+#include 
+#include 
 
 /*
  * This is branched to from interrupt handlers in exception-64s.S which set
@@ -19,17 +20,64 @@ kvmppc_interrupt:
 * guest R12 saved in shadow VCPU SCRATCH0
 * guest R13 saved in SPRN_SCRATCH0
 */
+   std r9,HSTATE_SCRATCH2(r13)
+   lbz r9,HSTATE_IN_GUEST(r13)
+   cmpwi   

[PATCH 05/13] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point

2021-02-18 Thread Nicholas Piggin
Rather than bifurcate the call depending on whether or not HV is
possible, and have the HV entry test for PR, just make a single
common point which does the demultiplexing. This makes it simpler
to add another type of exit handler.

Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S|  8 +-
 arch/powerpc/kvm/Makefile   |  3 +++
 arch/powerpc/kvm/book3s_64_entry.S  | 35 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++--
 4 files changed, 41 insertions(+), 16 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_entry.S

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 5bc689a546ae..a1640d6ea65d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -212,7 +212,6 @@ do_define_int n
 .endm
 
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
  * All interrupts which set HSRR registers, as well as SRESET and MCE and
  * syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be taken,
@@ -242,13 +241,8 @@ do_define_int n
 
 /*
  * If an interrupt is taken while a guest is running, it is immediately routed
- * to KVM to handle. If both HV and PR KVM arepossible, KVM interrupts go first
- * to kvmppc_interrupt_hv, which handles the PR guest case.
+ * to KVM to handle.
  */
-#define kvmppc_interrupt kvmppc_interrupt_hv
-#else
-#define kvmppc_interrupt kvmppc_interrupt_pr
-#endif
 
 .macro KVMTEST name
lbz r10,HSTATE_IN_GUEST(r13)
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 2bfeaa13befb..cdd119028f64 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -59,6 +59,9 @@ kvm-pr-y := \
 kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \
tm.o
 
+kvm-book3s_64-builtin-objs-y += \
+   book3s_64_entry.o
+
 ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
 kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HANDLER) += \
book3s_rmhandlers.o
diff --git a/arch/powerpc/kvm/book3s_64_entry.S 
b/arch/powerpc/kvm/book3s_64_entry.S
new file mode 100644
index ..147ebf1c3c1f
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_64_entry.S
@@ -0,0 +1,35 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * This is branched to from interrupt handlers in exception-64s.S which set
+ * IKVM_REAL or IKVM_VIRT, if HSTATE_IN_GUEST was found to be non-zero.
+ */
+.globalkvmppc_interrupt
+.balign IFETCH_ALIGN_BYTES
+kvmppc_interrupt:
+   /*
+* Register contents:
+* R12  = (guest CR << 32) | interrupt vector
+* R13  = PACA
+* guest R12 saved in shadow VCPU SCRATCH0
+* guest R13 saved in SPRN_SCRATCH0
+*/
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   std r9, HSTATE_SCRATCH2(r13)
+   lbz r9, HSTATE_IN_GUEST(r13)
+   cmpwi   r9, KVM_GUEST_MODE_HOST_HV
+   beq kvmppc_bad_host_intr
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
+   cmpwi   r9, KVM_GUEST_MODE_GUEST
+   ld  r9, HSTATE_SCRATCH2(r13)
+   beq kvmppc_interrupt_pr
+#endif
+   b   kvmppc_interrupt_hv
+#else
+   b   kvmppc_interrupt_pr
+#endif
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 3988873b044c..bbf786a0c0d6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1255,16 +1255,8 @@ kvmppc_interrupt_hv:
 * R13  = PACA
 * guest R12 saved in shadow VCPU SCRATCH0
 * guest R13 saved in SPRN_SCRATCH0
+* guest R9 saved in HSTATE_SCRATCH2
 */
-   std r9, HSTATE_SCRATCH2(r13)
-   lbz r9, HSTATE_IN_GUEST(r13)
-   cmpwi   r9, KVM_GUEST_MODE_HOST_HV
-   beq kvmppc_bad_host_intr
-#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
-   cmpwi   r9, KVM_GUEST_MODE_GUEST
-   ld  r9, HSTATE_SCRATCH2(r13)
-   beq kvmppc_interrupt_pr
-#endif
/* We're now back in the host but in guest MMU context */
li  r9, KVM_GUEST_MODE_HOST_HV
stb r9, HSTATE_IN_GUEST(r13)
@@ -3262,6 +3254,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_P9_TM_HV_ASSIST)
  * cfar is saved in HSTATE_CFAR(r13)
  * ppr is saved in HSTATE_PPR(r13)
  */
+.global kvmppc_bad_host_intr
 kvmppc_bad_host_intr:
/*
 * Switch to the emergency stack, but start half-way down in
-- 
2.23.0



[PATCH 04/13] KVM: PPC: Book3S 64: remove unused kvmppc_h_protect argument

2021-02-18 Thread Nicholas Piggin
The va argument is not used in the function or set by its asm caller,
so remove it to be safe.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_ppc.h  | 3 +--
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0a056c64c317..45b7610773b1 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -765,8 +765,7 @@ long kvmppc_h_remove(struct kvm_vcpu *vcpu, unsigned long 
flags,
  unsigned long pte_index, unsigned long avpn);
 long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu);
 long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
-  unsigned long pte_index, unsigned long avpn,
-  unsigned long va);
+  unsigned long pte_index, unsigned long avpn);
 long kvmppc_h_read(struct kvm_vcpu *vcpu, unsigned long flags,
unsigned long pte_index);
 long kvmppc_h_clear_ref(struct kvm_vcpu *vcpu, unsigned long flags,
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index f87237927096..956522b6ea15 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -667,8 +667,7 @@ long kvmppc_h_bulk_remove(struct kvm_vcpu *vcpu)
 }
 
 long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
- unsigned long pte_index, unsigned long avpn,
- unsigned long va)
+ unsigned long pte_index, unsigned long avpn)
 {
struct kvm *kvm = vcpu->kvm;
__be64 *hpte;
-- 
2.23.0



[PATCH 03/13] KVM: PPC: Book3S HV: Ensure MSR[ME] is always set in guest MSR

2021-02-18 Thread Nicholas Piggin
Rather than add the ME bit to the MSR when the guest is entered, make
it clear that the hypervisor does not allow the guest to clear the bit.

The ME addition is kept in the code for now, but a future patch will
warn if it's not present.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv_builtin.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index dad118760a4e..ae8f291c5c48 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -661,6 +661,13 @@ static void kvmppc_end_cede(struct kvm_vcpu *vcpu)
 
 void kvmppc_set_msr_hv(struct kvm_vcpu *vcpu, u64 msr)
 {
+   /*
+* Guest must always run with machine check interrupt
+* enabled.
+*/
+   if (!(msr & MSR_ME))
+   msr |= MSR_ME;
+
/*
 * Check for illegal transactional state bit combination
 * and if we find it, force the TS field to a safe state.
-- 
2.23.0



[PATCH 02/13] powerpc/64s: remove KVM SKIP test from instruction breakpoint handler

2021-02-18 Thread Nicholas Piggin
The code being executed in KVM_GUEST_MODE_SKIP is hypervisor code with
MSR[IR]=0, so the faults of concern are the d-side ones caused by access
to guest context by the hypervisor.

Instruction breakpoint interrupts are not a concern here. It's unlikely
any good would come of causing breaks in this code, but skipping the
instruction that caused it won't help matters (e.g., skip the mtmsr that
sets MSR[DR]=0 or clears KVM_GUEST_MODE_SKIP).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 5d0ad3b38e90..5bc689a546ae 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -2597,7 +2597,6 @@ EXC_VIRT_NONE(0x5200, 0x100)
 INT_DEFINE_BEGIN(instruction_breakpoint)
IVEC=0x1300
 #ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
-   IKVM_SKIP=1
IKVM_REAL=1
 #endif
 INT_DEFINE_END(instruction_breakpoint)
-- 
2.23.0



[PATCH 01/13] powerpc/64s: Remove KVM handler support from CBE_RAS interrupts

2021-02-18 Thread Nicholas Piggin
Cell does not support KVM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 6 --
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 39cbea495154..5d0ad3b38e90 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -2574,8 +2574,6 @@ EXC_VIRT_NONE(0x5100, 0x100)
 INT_DEFINE_BEGIN(cbe_system_error)
IVEC=0x1200
IHSRR=1
-   IKVM_SKIP=1
-   IKVM_REAL=1
 INT_DEFINE_END(cbe_system_error)
 
 EXC_REAL_BEGIN(cbe_system_error, 0x1200, 0x100)
@@ -2745,8 +2743,6 @@ EXC_COMMON_BEGIN(denorm_exception_common)
 INT_DEFINE_BEGIN(cbe_maintenance)
IVEC=0x1600
IHSRR=1
-   IKVM_SKIP=1
-   IKVM_REAL=1
 INT_DEFINE_END(cbe_maintenance)
 
 EXC_REAL_BEGIN(cbe_maintenance, 0x1600, 0x100)
@@ -2798,8 +2794,6 @@ EXC_COMMON_BEGIN(altivec_assist_common)
 INT_DEFINE_BEGIN(cbe_thermal)
IVEC=0x1800
IHSRR=1
-   IKVM_SKIP=1
-   IKVM_REAL=1
 INT_DEFINE_END(cbe_thermal)
 
 EXC_REAL_BEGIN(cbe_thermal, 0x1800, 0x100)
-- 
2.23.0



[PATCH 00/13] KVM: PPC: Book3S: C-ify the P9 entry/exit code

2021-02-18 Thread Nicholas Piggin
This has a lot more implemented, things tidied up, and more things split
out. It's also implemented on top of powerpc next and kvm next which
have a few prerequisite patches (mainly removing EXSLB).

I've got a bunch more things after this including implementing HPT
guests with radix host in the "new" path -- whether we ever actually
want to do that, or port the legacy path up to C, or just leave it to
maintenance mode, I was just testing the waters there and making sure I
wasn't doing something fundamentally incompatible with hash.

I won't post anything further than this for now because I think it's a
good start and gets the total asm code for KVM entry and exit down to
about 160 lines plus the shim for the legacy paths. So would like to
concentrate on getting this in before juggling things around too much
or adding new things.

Thanks,
Nick

Nicholas Piggin (13):
  powerpc/64s: Remove KVM handler support from CBE_RAS interrupts
  powerpc/64s: remove KVM SKIP test from instruction breakpoint handler
  KVM: PPC: Book3S HV: Ensure MSR[ME] is always set in guest MSR
  KVM: PPC: Book3S 64: remove unused kvmppc_h_protect argument
  KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point
  KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM
  KVM: PPC: Book3S 64: add hcall interrupt handler
  KVM: PPC: Book3S HV: Move hcall early register setup to KVM
  KVM: PPC: Book3S HV: Move interrupt early register setup to KVM
  KVM: PPC: Book3S HV: move bad_host_intr check to HV handler
  KVM: PPC: Book3S HV: Minimise hcall handler calling convention
differences
  KVM: PPC: Book3S HV: Move radix MMU switching together in the P9 path
  KVM: PPC: Book3S HV: Implement the rest of the P9 entry/exit handling
in C

 arch/powerpc/include/asm/asm-prototypes.h |   3 +-
 arch/powerpc/include/asm/exception-64s.h  |  13 +
 arch/powerpc/include/asm/kvm_asm.h|   3 +-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|   5 +-
 arch/powerpc/kernel/exceptions-64s.S  | 257 +++
 arch/powerpc/kernel/security.c|   5 +-
 arch/powerpc/kvm/Makefile |   6 +
 arch/powerpc/kvm/book3s_64_entry.S| 295 ++
 arch/powerpc/kvm/book3s_hv.c  |  69 +++--
 arch/powerpc/kvm/book3s_hv_builtin.c  |   7 +
 arch/powerpc/kvm/book3s_hv_interrupt.c| 208 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   3 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 125 +
 arch/powerpc/kvm/book3s_segment.S |   7 +
 arch/powerpc/kvm/book3s_xive.c|  32 +++
 16 files changed, 670 insertions(+), 370 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_entry.S
 create mode 100644 arch/powerpc/kvm/book3s_hv_interrupt.c

-- 
2.23.0



Re: [RFC PATCH 2/9] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

2021-02-18 Thread Daniel Axtens
Hi Nick,

> +maybe_skip:
> + cmpwi   r12,0x200
> + beq 1f
> + cmpwi   r12,0x300
> + beq 1f
> + cmpwi   r12,0x380
> + beq 1f
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> + /* XXX: cbe stuff? instruction breakpoint? */
> + cmpwi   r12,0xe02
> + beq 2f
> +#endif
> + b   no_skip
> +1:   mfspr   r9,SPRN_SRR0
> + addir9,r9,4
> + mtspr   SPRN_SRR0,r9
> + ld  r12,HSTATE_SCRATCH0(r13)
> + ld  r9,HSTATE_SCRATCH2(r13)
> + GET_SCRATCH0(r13)
> + RFI_TO_KERNEL
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +2:   mfspr   r9,SPRN_HSRR0
> + addir9,r9,4
> + mtspr   SPRN_HSRR0,r9
> + ld  r12,HSTATE_SCRATCH0(r13)
> + ld  r9,HSTATE_SCRATCH2(r13)
> + GET_SCRATCH0(r13)
> + HRFI_TO_KERNEL
> +#endif

If I understand correctly, label 1 is the kvmppc_skip_interrupt and
label 2 is the kvmppc_skip_Hinterrupt. Would it be easier to understand
if we used symbolic labels, or do you think the RFI_TO_KERNEL vs
HRFI_TO_KERNEL and other changes are sufficient?

Apart from that, I haven't checked the precise copy-paste to make sure
nothing has changed by accident, but I am able to follow the general
idea of the patch and am vigorously in favour of anything that
simplifies our exception/interrupt paths!

Kind regards,
Daniel

> -- 
> 2.23.0


Re: [RFC PATCH 1/9] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point

2021-02-18 Thread Daniel Axtens
Hi Nick,

> +++ b/arch/powerpc/kvm/book3s_64_entry.S
> @@ -0,0 +1,34 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * We come here from the first-level interrupt handlers.
> + */
> +.global  kvmppc_interrupt
> +.balign IFETCH_ALIGN_BYTES
> +kvmppc_interrupt:
> + /*
> +  * Register contents:

Clearly r9 contains some data at this point, and I think it's guest r9
because of what you say later on in
book3s_hv_rmhandlers.S::kvmppc_interrupt_hv. Is that right? Should that
be documented in this comment as well?

> +  * R12  = (guest CR << 32) | interrupt vector
> +  * R13  = PACA
> +  * guest R12 saved in shadow VCPU SCRATCH0
> +  * guest R13 saved in SPRN_SCRATCH0
> +  */
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> + std r9, HSTATE_SCRATCH2(r13)
> + lbz r9, HSTATE_IN_GUEST(r13)
> + cmpwi   r9, KVM_GUEST_MODE_HOST_HV
> + beq kvmppc_bad_host_intr
> +#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
> + cmpwi   r9, KVM_GUEST_MODE_GUEST
> + ld  r9, HSTATE_SCRATCH2(r13)
> + beq kvmppc_interrupt_pr
> +#endif
> + b   kvmppc_interrupt_hv
> +#else
> + b   kvmppc_interrupt_pr
> +#endif

Apart from that I had a look and convinced myself that the code will
behave the same as before. On that basis:

Reviewed-by: Daniel Axtens 

Kind regards,
Daniel


Re: [PATCH v12 13/14] mm/vmalloc: Hugepage vmalloc mappings

2021-02-18 Thread Ding Tianhong
Hi Nicholas:

I met some problem for this patch, like this:

kva = vmalloc(3*1024k);

remap_vmalloc_range(xxx, kva, xxx)

It failed because that the check for page_count(page) is null so return, it 
break the some logic for current modules.
because the new huge page is not valid for composed page.

I think some guys really don't get used to the changes for the vmalloc that the 
small pages was transparency to the hugepage
when the size is bigger than the PMD_SIZE.

can we think about give a new static huge page to fix it? just like use a a new 
vmalloc_huge_xxx function to disginguish the current function,
the user could choose to use the transparent hugepage or static hugepage for 
vmalloc.

Thanks
Ding


On 2021/2/2 19:05, Nicholas Piggin wrote:
> Support huge page vmalloc mappings. Config option HAVE_ARCH_HUGE_VMALLOC
> enables support on architectures that define HAVE_ARCH_HUGE_VMAP and
> supports PMD sized vmap mappings.
> 
> vmalloc will attempt to allocate PMD-sized pages if allocating PMD size
> or larger, and fall back to small pages if that was unsuccessful.
> 
> Architectures must ensure that any arch specific vmalloc allocations
> that require PAGE_SIZE mappings (e.g., module allocations vs strict
> module rwx) use the VM_NOHUGE flag to inhibit larger mappings.
> 
> This can result in more internal fragmentation and memory overhead for a
> given allocation, an option nohugevmalloc is added to disable at boot.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/Kconfig|  11 ++
>  include/linux/vmalloc.h |  21 
>  mm/page_alloc.c |   5 +-
>  mm/vmalloc.c| 215 +++-
>  4 files changed, 205 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 24862d15f3a3..eef170e0c9b8 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -724,6 +724,17 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>  config HAVE_ARCH_HUGE_VMAP
>   bool
>  
> +#
> +#  Archs that select this would be capable of PMD-sized vmaps (i.e.,
> +#  arch_vmap_pmd_supported() returns true), and they must make no assumptions
> +#  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP 
> flag
> +#  can be used to prohibit arch-specific allocations from using hugepages to
> +#  help with this (e.g., modules may require it).
> +#
> +config HAVE_ARCH_HUGE_VMALLOC
> + depends on HAVE_ARCH_HUGE_VMAP
> + bool
> +
>  config ARCH_WANT_HUGE_PMD_SHARE
>   bool
>  
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 99ea72d547dc..93270adf5db5 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -25,6 +25,7 @@ struct notifier_block;  /* in notifier.h */
>  #define VM_NO_GUARD  0x0040  /* don't add guard page */
>  #define VM_KASAN 0x0080  /* has allocated kasan shadow 
> memory */
>  #define VM_MAP_PUT_PAGES 0x0100  /* put pages and free array in 
> vfree */
> +#define VM_NO_HUGE_VMAP  0x0200  /* force PAGE_SIZE pte 
> mapping */
>  
>  /*
>   * VM_KASAN is used slighly differently depending on CONFIG_KASAN_VMALLOC.
> @@ -59,6 +60,9 @@ struct vm_struct {
>   unsigned long   size;
>   unsigned long   flags;
>   struct page **pages;
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> + unsigned intpage_order;
> +#endif
>   unsigned intnr_pages;
>   phys_addr_t phys_addr;
>   const void  *caller;
> @@ -193,6 +197,22 @@ void free_vm_area(struct vm_struct *area);
>  extern struct vm_struct *remove_vm_area(const void *addr);
>  extern struct vm_struct *find_vm_area(const void *addr);
>  
> +static inline bool is_vm_area_hugepages(const void *addr)
> +{
> + /*
> +  * This may not 100% tell if the area is mapped with > PAGE_SIZE
> +  * page table entries, if for some reason the architecture indicates
> +  * larger sizes are available but decides not to use them, nothing
> +  * prevents that. This only indicates the size of the physical page
> +  * allocated in the vmalloc layer.
> +  */
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> + return find_vm_area(addr)->page_order > 0;
> +#else
> + return false;
> +#endif
> +}
> +
>  #ifdef CONFIG_MMU
>  int vmap_range(unsigned long addr, unsigned long end,
>   phys_addr_t phys_addr, pgprot_t prot,
> @@ -210,6 +230,7 @@ static inline void set_vm_flush_reset_perms(void *addr)
>   if (vm)
>   vm->flags |= VM_FLUSH_RESET_PERMS;
>  }
> +
>  #else
>  static inline int
>  map_kernel_range_noflush(unsigned long start, unsigned long size,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 519a60d5b6f7..1116ce45744b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -72,6 +72,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -8240,6 +8241,7 @@ void *__init 

Re: [PATCH] ASoC: imx-hdmi: no need to set .owner when using module_platform_driver

2021-02-18 Thread Shengjiu Wang
On Thu, Feb 11, 2021 at 5:21 PM Tian Tao  wrote:
>
> the module_platform_driver will call platform_driver_register.
> and It will set the .owner to THIS_MODULE
>
> Signed-off-by: Tian Tao 

Acked-by: Shengjiu Wang 


Re: [PATCH] of: error: 'const struct kimage' has no member named 'arch'

2021-02-18 Thread Lakshmi Ramasubramanian

On 2/18/21 5:13 PM, Thiago Jung Bauermann wrote:


Lakshmi Ramasubramanian  writes:


On 2/18/21 4:07 PM, Mimi Zohar wrote:

Hi Mimi,


On Thu, 2021-02-18 at 14:33 -0800, Lakshmi Ramasubramanian wrote:

of_kexec_alloc_and_setup_fdt() defined in drivers/of/kexec.c builds
a new device tree object that includes architecture specific data
for kexec system call.  This should be defined only if the architecture
being built defines kexec architecture structure "struct kimage_arch".

Define a new boolean config OF_KEXEC that is enabled if
CONFIG_KEXEC_FILE and CONFIG_OF_FLATTREE are enabled, and
the architecture is arm64 or powerpc64.  Build drivers/of/kexec.c
if CONFIG_OF_KEXEC is enabled.

Signed-off-by: Lakshmi Ramasubramanian 
Fixes: 33488dc4d61f ("of: Add a common kexec FDT setup function")
Reported-by: kernel test robot 
---
   drivers/of/Kconfig  | 6 ++
   drivers/of/Makefile | 7 +--
   2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index 18450437d5d5..f2e8fa54862a 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -100,4 +100,10 @@ config OF_DMA_DEFAULT_COHERENT
# arches should select this if DMA is coherent by default for OF devices
bool
   +config OF_KEXEC
+   bool
+   depends on KEXEC_FILE
+   depends on OF_FLATTREE
+   default y if ARM64 || PPC64
+
   endif # OF
diff --git a/drivers/of/Makefile b/drivers/of/Makefile
index c13b982084a3..287579dd1695 100644
--- a/drivers/of/Makefile
+++ b/drivers/of/Makefile
@@ -13,11 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
   obj-$(CONFIG_OF_RESOLVE)  += resolver.o
   obj-$(CONFIG_OF_OVERLAY) += overlay.o
   obj-$(CONFIG_OF_NUMA) += of_numa.o
-
-ifdef CONFIG_KEXEC_FILE
-ifdef CONFIG_OF_FLATTREE
-obj-y  += kexec.o
-endif
-endif
+obj-$(CONFIG_OF_KEXEC) += kexec.o
 obj-$(CONFIG_OF_UNITTEST) += unittest-data/

Is it possible to reuse CONFIG_HAVE_IMA_KEXEC here?



For ppc64 CONFIG_HAVE_IMA_KEXEC is selected when CONFIG_KEXEC_FILE is enabled.
So I don't see a problem in reusing CONFIG_HAVE_IMA_KEXEC for ppc.

But for arm64, CONFIG_HAVE_IMA_KEXEC is enabled in the final patch in the patch
set (the one for carrying forward IMA log across kexec for arm64). arm64 calls
of_kexec_alloc_and_setup_fdt() prior to enabling CONFIG_HAVE_IMA_KEXEC and hence
breaks the build for arm64.


One problem is that I believe that this patch won't placate the robot,
because IIUC it generates config files at random and this change still
allows hppa and s390 to enable CONFIG_OF_KEXEC.


I enabled CONFIG_OF_KEXEC for s390. With my patch applied, 
CONFIG_OF_KEXEC is removed. So I think the robot enabling this config 
would not be a problem.




Perhaps a new CONFIG_HAVE_KIMAGE_ARCH option? Not having that option
would still allow building kexec.o, but would be used inside kexec.c to
avoid accessing kimage.arch members.



I think this is a good idea - a new CONFIG_HAVE_KIMAGE_ARCH, which will 
be selected by arm64 and ppc for now. I tried this, and it fixes the 
build issue.


Although, the name for the new config can be misleading since PARISC, 
for instance, also defines "struct kimage_arch". Perhaps, 
CONFIG_HAVE_ELF_KIMAGE_ARCH since of_kexec_alloc_and_setup_fdt() is 
accessing ELF specific fields in "struct kimage_arch"?


Rob/Mimi - please let us know which approach you think is better.

thanks,
 -lakshmi


Re: [PATCH 5/6] powerpc/mm/64s/hash: Add real-mode change_memory_range() for hash LPAR

2021-02-18 Thread Daniel Axtens
Michael Ellerman  writes:

> When we enabled STRICT_KERNEL_RWX we received some reports of boot
> failures when using the Hash MMU and running under phyp. The crashes
> are intermittent, and often exhibit as a completely unresponsive
> system, or possibly an oops.
>
> One example, which was caught in xmon:
>
>   [   14.068327][T1] devtmpfs: mounted
>   [   14.069302][T1] Freeing unused kernel memory: 5568K
>   [   14.142060][  T347] BUG: Unable to handle kernel instruction fetch
>   [   14.142063][T1] Run /sbin/init as init process
>   [   14.142074][  T347] Faulting instruction address: 0xc0004400
>   cpu 0x2: Vector: 400 (Instruction Access) at [cc7475e0]
>   pc: c0004400: exc_virt_0x4400_instruction_access+0x0/0x80
>   lr: c01862d4: update_rq_clock+0x44/0x110
>   sp: cc747880
>  msr: 800040001031
> current = 0xcc60d380
> paca= 0xc0001ec9de80   irqmask: 0x03   irq_happened: 0x01
>   pid   = 347, comm = kworker/2:1
>   ...
>   enter ? for help
>   [cc747880] c01862d4 update_rq_clock+0x44/0x110 (unreliable)
>   [cc7478f0] c0198794 update_blocked_averages+0xb4/0x6d0
>   [cc7479f0] c0198e40 update_nohz_stats+0x90/0xd0
>   [cc747a20] c01a13b4 _nohz_idle_balance+0x164/0x390
>   [cc747b10] c01a1af8 newidle_balance+0x478/0x610
>   [cc747be0] c01a1d48 pick_next_task_fair+0x58/0x480
>   [cc747c40] c0eaab5c __schedule+0x12c/0x950
>   [cc747cd0] c0eab3e8 schedule+0x68/0x120
>   [cc747d00] c016b730 worker_thread+0x130/0x640
>   [cc747da0] c0174d50 kthread+0x1a0/0x1b0
>   [cc747e10] c000e0f0 ret_from_kernel_thread+0x5c/0x6c
>
> This shows that CPU 2, which was idle, woke up and then appears to
> randomly take an instruction fault on a completely valid area of
> kernel text.
>
> The cause turns out to be the call to hash__mark_rodata_ro(), late in
> boot. Due to the way we layout text and rodata, that function actually
> changes the permissions for all of text and rodata to read-only plus
> execute.
>
> To do the permission change we use a hypervisor call, H_PROTECT. On
> phyp that appears to be implemented by briefly removing the mapping of
> the kernel text, before putting it back with the updated permissions.
> If any other CPU is executing during that window, it will see spurious
> faults on the kernel text and/or data, leading to crashes.

Jordan asked why we saw this on phyp but not under KVM? We had a look at
book3s_hv_rm_mmu.c but the code is a bit too obtuse for me to reason
about!

Nick suggests that the KVM hypervisor is invalidating the HPTE, but
because we run guests in VPM mode, the hypervisor would catch the page
fault and not reflect it down to the guest. It looks like Linux-as-a-HV
will take HPTE_V_HVLOCK, and then because it's running in VPM mode, the
hypervisor will catch the fault and not pass it to the guest. But if
phyp runs with VPM mode off, the guest will see the fault before the
hypervisor. (we think this is what's going on anyway.)

We spent a while pondering if phyp is doing something buggy or not...
Looking at the PAPR definition of H_PROTECT, that claims the hypervisor
will do the 'architected “Modifying a Page Table Entry General Case”
sequence'. s 5.10.1.2 of Book IIIS of the ISAv3 defines that, and the
non-atomic hardware sequence does indeed modify the PTE by going through
the invalid state. So it looks like if phyp is running without VPM mode
it's technically not buggy.

Hopefully I'll get to have a look at the rest of the patch shortly!

Kind regards,
Daniel

> To fix it we use stop machine to collect all other CPUs, and then have
> them drop into real mode (MMU off), while we change the mapping. That
> way they are unaffected by the mapping temporarily disappearing.
>
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 105 +++-
>  1 file changed, 104 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
> b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 3663d3cdffac..01de985df2c4 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -400,6 +401,19 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +
> +struct change_memory_parms {
> + unsigned long start, end, newpp;
> + unsigned int step, nr_cpus, master_cpu;
> + atomic_t cpu_counter;
> +};
> +
> +// We'd rather this was on the stack but it has to be in the RMO
> +static struct change_memory_parms chmem_parms;
> +
> +// And therefore we need a lock to protect it from concurrent use
> +static DEFINE_MUTEX(chmem_lock);
> +
>  

Re: [PATCH 4/6] powerpc/mm/64s/hash: Factor out change_memory_range()

2021-02-18 Thread Daniel Axtens
Michael Ellerman  writes:

> Pull the loop calling hpte_updateboltedpp() out of
> hash__change_memory_range() into a helper function. We need it to be a
> separate function for the next patch.
>
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/mm/book3s64/hash_pgtable.c | 23 +++
>  1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
> b/arch/powerpc/mm/book3s64/hash_pgtable.c
> index 03819c259f0a..3663d3cdffac 100644
> --- a/arch/powerpc/mm/book3s64/hash_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
> @@ -400,10 +400,23 @@ EXPORT_SYMBOL_GPL(hash__has_transparent_hugepage);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> +static void change_memory_range(unsigned long start, unsigned long end,
> + unsigned int step, unsigned long newpp)

Looking at the call paths, this gets called only in bare metal, not
virtualised: should the name reflect that?

> +{
> + unsigned long idx;
> +
> + pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 
> 0x%x\n",
> +  start, end, newpp, step);
> +
> + for (idx = start; idx < end; idx += step)
> + /* Not sure if we can do much with the return value */

Hmm, I realise this comment isn't changed, but it did make me wonder
what the return value!

It turns out that the function doesn't actually return anything.

Tracking back the history of hpte_updateboltedpp, it looks like it has
not had a return value since the start of git history:

^1da177e4c3f4 include/asm-ppc64/machdep.hvoid
(*hpte_updateboltedpp)(unsigned long newpp, 
3c726f8dee6f5 include/asm-powerpc/machdep.h 
unsigned long ea,
1189be6508d45 include/asm-powerpc/machdep.h 
   int psize, int ssize);

The comment comes from commit cd65d6971334 ("powerpc/mm/hash: Implement
mark_rodata_ro() for hash") where Balbir added the comment, but again I
can't figure out what sort of return value there would be to ignore.

Should we drop the comment? (or return something from hpte_updateboltedpp)

> + mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
> + mmu_kernel_ssize);
> +}
> +
>  static bool hash__change_memory_range(unsigned long start, unsigned long end,
> unsigned long newpp)
>  {
> - unsigned long idx;
>   unsigned int step, shift;
>  
>   shift = mmu_psize_defs[mmu_linear_psize].shift;
> @@ -415,13 +428,7 @@ static bool hash__change_memory_range(unsigned long 
> start, unsigned long end,
>   if (start >= end)
>   return false;
>  
> - pr_debug("Changing page protection on range 0x%lx-0x%lx, to 0x%lx, step 
> 0x%x\n",
> -  start, end, newpp, step);
> -
> - for (idx = start; idx < end; idx += step)
> - /* Not sure if we can do much with the return value */
> - mmu_hash_ops.hpte_updateboltedpp(newpp, idx, mmu_linear_psize,
> - mmu_kernel_ssize);
> + change_memory_range(start, end, step, newpp);

Looking at how change_memory_range is called, step is derived by:

shift = mmu_psize_defs[mmu_linear_psize].shift;
step = 1 << shift;

We probably therefore don't really need to pass step in to
change_memory_range. Having said that, I'm not sure it would really be that
much tidier to compute step in change_memory_range, especially since we
also need step for the other branch in hash__change_memory_range.

Beyond that it all looks reasonable to me!

I also checked that the loop operations made sense, I think they do - we
cover from start inclusive to end exclusive and the alignment is done
before we call into change_memory_range.

Regards,
Daniel

>   return true;
>  }
> -- 
> 2.25.1


Re: [PATCH] of: error: 'const struct kimage' has no member named 'arch'

2021-02-18 Thread Thiago Jung Bauermann


Lakshmi Ramasubramanian  writes:

> On 2/18/21 4:07 PM, Mimi Zohar wrote:
>
> Hi Mimi,
>
>> On Thu, 2021-02-18 at 14:33 -0800, Lakshmi Ramasubramanian wrote:
>>> of_kexec_alloc_and_setup_fdt() defined in drivers/of/kexec.c builds
>>> a new device tree object that includes architecture specific data
>>> for kexec system call.  This should be defined only if the architecture
>>> being built defines kexec architecture structure "struct kimage_arch".
>>>
>>> Define a new boolean config OF_KEXEC that is enabled if
>>> CONFIG_KEXEC_FILE and CONFIG_OF_FLATTREE are enabled, and
>>> the architecture is arm64 or powerpc64.  Build drivers/of/kexec.c
>>> if CONFIG_OF_KEXEC is enabled.
>>>
>>> Signed-off-by: Lakshmi Ramasubramanian 
>>> Fixes: 33488dc4d61f ("of: Add a common kexec FDT setup function")
>>> Reported-by: kernel test robot 
>>> ---
>>>   drivers/of/Kconfig  | 6 ++
>>>   drivers/of/Makefile | 7 +--
>>>   2 files changed, 7 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
>>> index 18450437d5d5..f2e8fa54862a 100644
>>> --- a/drivers/of/Kconfig
>>> +++ b/drivers/of/Kconfig
>>> @@ -100,4 +100,10 @@ config OF_DMA_DEFAULT_COHERENT
>>> # arches should select this if DMA is coherent by default for OF devices
>>> bool
>>>   +config OF_KEXEC
>>> +   bool
>>> +   depends on KEXEC_FILE
>>> +   depends on OF_FLATTREE
>>> +   default y if ARM64 || PPC64
>>> +
>>>   endif # OF
>>> diff --git a/drivers/of/Makefile b/drivers/of/Makefile
>>> index c13b982084a3..287579dd1695 100644
>>> --- a/drivers/of/Makefile
>>> +++ b/drivers/of/Makefile
>>> @@ -13,11 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
>>>   obj-$(CONFIG_OF_RESOLVE)  += resolver.o
>>>   obj-$(CONFIG_OF_OVERLAY) += overlay.o
>>>   obj-$(CONFIG_OF_NUMA) += of_numa.o
>>> -
>>> -ifdef CONFIG_KEXEC_FILE
>>> -ifdef CONFIG_OF_FLATTREE
>>> -obj-y  += kexec.o
>>> -endif
>>> -endif
>>> +obj-$(CONFIG_OF_KEXEC) += kexec.o
>>> obj-$(CONFIG_OF_UNITTEST) += unittest-data/
>> Is it possible to reuse CONFIG_HAVE_IMA_KEXEC here?
>> 
>
> For ppc64 CONFIG_HAVE_IMA_KEXEC is selected when CONFIG_KEXEC_FILE is enabled.
> So I don't see a problem in reusing CONFIG_HAVE_IMA_KEXEC for ppc.
>
> But for arm64, CONFIG_HAVE_IMA_KEXEC is enabled in the final patch in the 
> patch
> set (the one for carrying forward IMA log across kexec for arm64). arm64 calls
> of_kexec_alloc_and_setup_fdt() prior to enabling CONFIG_HAVE_IMA_KEXEC and 
> hence
> breaks the build for arm64.

One problem is that I believe that this patch won't placate the robot,
because IIUC it generates config files at random and this change still
allows hppa and s390 to enable CONFIG_OF_KEXEC.

Perhaps a new CONFIG_HAVE_KIMAGE_ARCH option? Not having that option
would still allow building kexec.o, but would be used inside kexec.c to
avoid accessing kimage.arch members.

-- 
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH] of: error: 'const struct kimage' has no member named 'arch'

2021-02-18 Thread Lakshmi Ramasubramanian

On 2/18/21 4:07 PM, Mimi Zohar wrote:

Hi Mimi,


On Thu, 2021-02-18 at 14:33 -0800, Lakshmi Ramasubramanian wrote:

of_kexec_alloc_and_setup_fdt() defined in drivers/of/kexec.c builds
a new device tree object that includes architecture specific data
for kexec system call.  This should be defined only if the architecture
being built defines kexec architecture structure "struct kimage_arch".

Define a new boolean config OF_KEXEC that is enabled if
CONFIG_KEXEC_FILE and CONFIG_OF_FLATTREE are enabled, and
the architecture is arm64 or powerpc64.  Build drivers/of/kexec.c
if CONFIG_OF_KEXEC is enabled.

Signed-off-by: Lakshmi Ramasubramanian 
Fixes: 33488dc4d61f ("of: Add a common kexec FDT setup function")
Reported-by: kernel test robot 
---
  drivers/of/Kconfig  | 6 ++
  drivers/of/Makefile | 7 +--
  2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index 18450437d5d5..f2e8fa54862a 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -100,4 +100,10 @@ config OF_DMA_DEFAULT_COHERENT
# arches should select this if DMA is coherent by default for OF devices
bool
  
+config OF_KEXEC

+   bool
+   depends on KEXEC_FILE
+   depends on OF_FLATTREE
+   default y if ARM64 || PPC64
+
  endif # OF
diff --git a/drivers/of/Makefile b/drivers/of/Makefile
index c13b982084a3..287579dd1695 100644
--- a/drivers/of/Makefile
+++ b/drivers/of/Makefile
@@ -13,11 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
  obj-$(CONFIG_OF_RESOLVE)  += resolver.o
  obj-$(CONFIG_OF_OVERLAY) += overlay.o
  obj-$(CONFIG_OF_NUMA) += of_numa.o
-
-ifdef CONFIG_KEXEC_FILE
-ifdef CONFIG_OF_FLATTREE
-obj-y  += kexec.o
-endif
-endif
+obj-$(CONFIG_OF_KEXEC) += kexec.o
  
  obj-$(CONFIG_OF_UNITTEST) += unittest-data/


Is it possible to reuse CONFIG_HAVE_IMA_KEXEC here?



For ppc64 CONFIG_HAVE_IMA_KEXEC is selected when CONFIG_KEXEC_FILE is 
enabled. So I don't see a problem in reusing CONFIG_HAVE_IMA_KEXEC for ppc.


But for arm64, CONFIG_HAVE_IMA_KEXEC is enabled in the final patch in 
the patch set (the one for carrying forward IMA log across kexec for 
arm64). arm64 calls of_kexec_alloc_and_setup_fdt() prior to enabling 
CONFIG_HAVE_IMA_KEXEC and hence breaks the build for arm64.


thanks,
 -lakshmi







Re: [PATCH] of: error: 'const struct kimage' has no member named 'arch'

2021-02-18 Thread Mimi Zohar
On Thu, 2021-02-18 at 14:33 -0800, Lakshmi Ramasubramanian wrote:
> of_kexec_alloc_and_setup_fdt() defined in drivers/of/kexec.c builds
> a new device tree object that includes architecture specific data
> for kexec system call.  This should be defined only if the architecture
> being built defines kexec architecture structure "struct kimage_arch".
> 
> Define a new boolean config OF_KEXEC that is enabled if
> CONFIG_KEXEC_FILE and CONFIG_OF_FLATTREE are enabled, and
> the architecture is arm64 or powerpc64.  Build drivers/of/kexec.c
> if CONFIG_OF_KEXEC is enabled.
> 
> Signed-off-by: Lakshmi Ramasubramanian 
> Fixes: 33488dc4d61f ("of: Add a common kexec FDT setup function")
> Reported-by: kernel test robot 
> ---
>  drivers/of/Kconfig  | 6 ++
>  drivers/of/Makefile | 7 +--
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
> index 18450437d5d5..f2e8fa54862a 100644
> --- a/drivers/of/Kconfig
> +++ b/drivers/of/Kconfig
> @@ -100,4 +100,10 @@ config OF_DMA_DEFAULT_COHERENT
>   # arches should select this if DMA is coherent by default for OF devices
>   bool
>  
> +config OF_KEXEC
> + bool
> + depends on KEXEC_FILE
> + depends on OF_FLATTREE
> + default y if ARM64 || PPC64
> +
>  endif # OF
> diff --git a/drivers/of/Makefile b/drivers/of/Makefile
> index c13b982084a3..287579dd1695 100644
> --- a/drivers/of/Makefile
> +++ b/drivers/of/Makefile
> @@ -13,11 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
>  obj-$(CONFIG_OF_RESOLVE)  += resolver.o
>  obj-$(CONFIG_OF_OVERLAY) += overlay.o
>  obj-$(CONFIG_OF_NUMA) += of_numa.o
> -
> -ifdef CONFIG_KEXEC_FILE
> -ifdef CONFIG_OF_FLATTREE
> -obj-y+= kexec.o
> -endif
> -endif
> +obj-$(CONFIG_OF_KEXEC) += kexec.o
>  
>  obj-$(CONFIG_OF_UNITTEST) += unittest-data/

Is it possible to reuse CONFIG_HAVE_IMA_KEXEC here?

Mimi




Re: linux-next: manual merge of the devicetree tree with the powerpc tree

2021-02-18 Thread Michael Ellerman
Rob Herring  writes:
> On Thu, Feb 18, 2021 at 5:34 AM Stephen Rothwell  
> wrote:
>> On Thu, 18 Feb 2021 21:44:37 +1100 Michael Ellerman  
>> wrote:
>> >
>> > I think it just needs this?
>> >
>> > diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
>> > index 87e34611f93d..0492ca6003f3 100644
>> > --- a/arch/powerpc/kexec/elf_64.c
>> > +++ b/arch/powerpc/kexec/elf_64.c
>> > @@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char 
>> > *kernel_buf,
>> >
>> >   fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
>> >  initrd_len, cmdline,
>> > -
>> > fdt_totalsize(initial_boot_params));
>> > +kexec_fdt_totalsize_ppc64(image));
>> >   if (!fdt) {
>> >   pr_err("Error setting up the new device tree.\n");
>> >   ret = -EINVAL;
>> >
>>
>> I thought about that, but the last argument to
>> of_kexec_alloc_and_setup_fdt() is extra_fdt_size and the allocation
>> done is for this:
>>
>> fdt_size = fdt_totalsize(initial_boot_params) +
>>(cmdline ? strlen(cmdline) : 0) +
>>FDT_EXTRA_SPACE +
>>extra_fdt_size;
>>
>> and kexec_fdt_totalsize_ppc64() also includes
>> fdt_totalsize(initial_boot_params) so I was not sure.  Maybe
>> kexec_fdt_totalsize_ppc64() needs modification as well?
>
> You're both right. Michael's fix is sufficient for the merge. The only
> risk with a larger size is failing to allocate it, but we're talking
> only 10s of KB. Historically until the commit causing the conflict,
> PPC was just used 2x fdt_totalsize(initial_boot_params). You could
> drop 'fdt_size = fdt_totalsize(initial_boot_params) + (2 *
> COMMAND_LINE_SIZE);' from kexec_fdt_totalsize_ppc64() as well, but
> then the function name is misleading.
>
> Lakshmi can send a follow-up patch to fine tune the size and rename
> kexec_fdt_totalsize_ppc64.

Sounds good.

cheers


Re: [PATCH 2/6] powerpc/pseries: Add key to flags in pSeries_lpar_hpte_updateboltedpp()

2021-02-18 Thread Michael Ellerman
Daniel Axtens  writes:
> Michael Ellerman  writes:
>
>> The flags argument to plpar_pte_protect() (aka. H_PROTECT), includes
>> the key in bits 9-13, but currently we always set those bits to zero.
>>
>> In the past that hasn't been a problem because we always used key 0
>> for the kernel, and updateboltedpp() is only used for kernel mappings.
>>
>> However since commit d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3
>> for kernel mapping with hash translation") we are now inadvertently
>> changing the key (to zero) when we call plpar_pte_protect().
>>
>> That hasn't broken anything because updateboltedpp() is only used for
>> STRICT_KERNEL_RWX, which is currently disabled on 64s due to other
>> bugs.
>>
>> But we want to fix that, so first we need to pass the key correctly to
>> plpar_pte_protect(). In the `newpp` value the low 3 bits of the key
>> are already in the correct spot, but the high 2 bits of the key need
>> to be shifted down.
>>
>> Fixes: d94b827e89dc ("powerpc/book3s64/kuap: Use Key 3 for kernel mapping 
>> with hash translation")
>> Signed-off-by: Michael Ellerman 
>> ---
>>  arch/powerpc/platforms/pseries/lpar.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/lpar.c 
>> b/arch/powerpc/platforms/pseries/lpar.c
>> index 764170fdb0f7..8bbbddff7226 100644
>> --- a/arch/powerpc/platforms/pseries/lpar.c
>> +++ b/arch/powerpc/platforms/pseries/lpar.c
>> @@ -976,11 +976,13 @@ static void pSeries_lpar_hpte_updateboltedpp(unsigned 
>> long newpp,
>>  slot = pSeries_lpar_hpte_find(vpn, psize, ssize);
>>  BUG_ON(slot == -1);
>>  
>> -flags = newpp & 7;
>> +flags = newpp & (HPTE_R_PP | HPTE_R_N);
>>  if (mmu_has_feature(MMU_FTR_KERNEL_RO))
>>  /* Move pp0 into bit 8 (IBM 55) */
>>  flags |= (newpp & HPTE_R_PP0) >> 55;
>>  
>> +flags |= ((newpp & HPTE_R_KEY_HI) >> 48) | (newpp & HPTE_R_KEY_LO);
>> +
>
> I'm really confused about how these bits are getting packed into the
> flags parameter. It seems to match how they are unpacked in
> kvmppc_h_pr_protect, but I cannot figure out why they are packed in that
> order, and the LoPAR doesn't seem especially illuminating on this topic
> - although I may have missed the relevant section.

Yeah I agree it's not very clearly specified.

The hcall we're using here is H_PROTECT, which is specified in section
14.5.4.1.6 of LoPAPR v1.1.

It takes a `flags` parameter, and the description for flags says:

 * flags: AVPN, pp0, pp1, pp2, key0-key4, n, and for the CMO
   option: CMO Option flags as defined in Table 189‚


If you then go to the start of the parent section, 14.5.4.1, on page
405, it says:

Register Linkage (For hcall() tokens 0x04 - 0x18)
 * On Call
   * R3 function call token
   * R4 flags (see Table 178‚ “Page Frame Table Access flags field definition‚” 
on page 401)


Then you have to go to section 14.5.3, and on page 394 there is a list
of hcalls and their tokens (table 176), and there you can see that
H_PROTECT == 0x18.

Finally you can look at table 178, on page 401, where it specifies the
layout of the bits for the key:

 Bit Function
--
 50-54 | key0-key4


Those are big-endian bit numbers, converting to normal bit numbers you
get bits 9-13, or 0x3e00.

If you look at the kernel source we have:

#define HPTE_R_KEY_HI   ASM_CONST(0x3000)
#define HPTE_R_KEY_LO   ASM_CONST(0x0e00)

So the LO bits are already in the right place, and the HI bits just need
to be shifted down by 48.

Hope that makes it clearer :)

cheers


[PATCH] of: error: 'const struct kimage' has no member named 'arch'

2021-02-18 Thread Lakshmi Ramasubramanian
of_kexec_alloc_and_setup_fdt() defined in drivers/of/kexec.c builds
a new device tree object that includes architecture specific data
for kexec system call.  This should be defined only if the architecture
being built defines kexec architecture structure "struct kimage_arch".

Define a new boolean config OF_KEXEC that is enabled if
CONFIG_KEXEC_FILE and CONFIG_OF_FLATTREE are enabled, and
the architecture is arm64 or powerpc64.  Build drivers/of/kexec.c
if CONFIG_OF_KEXEC is enabled.

Signed-off-by: Lakshmi Ramasubramanian 
Fixes: 33488dc4d61f ("of: Add a common kexec FDT setup function")
Reported-by: kernel test robot 
---
 drivers/of/Kconfig  | 6 ++
 drivers/of/Makefile | 7 +--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index 18450437d5d5..f2e8fa54862a 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -100,4 +100,10 @@ config OF_DMA_DEFAULT_COHERENT
# arches should select this if DMA is coherent by default for OF devices
bool
 
+config OF_KEXEC
+   bool
+   depends on KEXEC_FILE
+   depends on OF_FLATTREE
+   default y if ARM64 || PPC64
+
 endif # OF
diff --git a/drivers/of/Makefile b/drivers/of/Makefile
index c13b982084a3..287579dd1695 100644
--- a/drivers/of/Makefile
+++ b/drivers/of/Makefile
@@ -13,11 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
 obj-$(CONFIG_OF_RESOLVE)  += resolver.o
 obj-$(CONFIG_OF_OVERLAY) += overlay.o
 obj-$(CONFIG_OF_NUMA) += of_numa.o
-
-ifdef CONFIG_KEXEC_FILE
-ifdef CONFIG_OF_FLATTREE
-obj-y  += kexec.o
-endif
-endif
+obj-$(CONFIG_OF_KEXEC) += kexec.o
 
 obj-$(CONFIG_OF_UNITTEST) += unittest-data/
-- 
2.30.0



Re: linux-next: manual merge of the devicetree tree with the powerpc tree

2021-02-18 Thread Stephen Rothwell
Hi all,

On Thu, 18 Feb 2021 07:52:52 -0600 Rob Herring  wrote:
>
> On Thu, Feb 18, 2021 at 5:34 AM Stephen Rothwell  
> wrote:
> >
> > On Thu, 18 Feb 2021 21:44:37 +1100 Michael Ellerman  
> > wrote:  
> > >
> > > I think it just needs this?
> > >
> > > diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> > > index 87e34611f93d..0492ca6003f3 100644
> > > --- a/arch/powerpc/kexec/elf_64.c
> > > +++ b/arch/powerpc/kexec/elf_64.c
> > > @@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char 
> > > *kernel_buf,
> > >
> > >   fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
> > >  initrd_len, cmdline,
> > > -
> > > fdt_totalsize(initial_boot_params));
> > > +
> > > kexec_fdt_totalsize_ppc64(image));
> > >   if (!fdt) {
> > >   pr_err("Error setting up the new device tree.\n");
> > >   ret = -EINVAL;
> > >  
> >
> > I thought about that, but the last argument to
> > of_kexec_alloc_and_setup_fdt() is extra_fdt_size and the allocation
> > done is for this:
> >
> > fdt_size = fdt_totalsize(initial_boot_params) +
> >(cmdline ? strlen(cmdline) : 0) +
> >FDT_EXTRA_SPACE +
> >extra_fdt_size;
> >
> > and kexec_fdt_totalsize_ppc64() also includes
> > fdt_totalsize(initial_boot_params) so I was not sure.  Maybe
> > kexec_fdt_totalsize_ppc64() needs modification as well?  
> 
> You're both right. Michael's fix is sufficient for the merge. The only
> risk with a larger size is failing to allocate it, but we're talking
> only 10s of KB. Historically until the commit causing the conflict,
> PPC was just used 2x fdt_totalsize(initial_boot_params). You could
> drop 'fdt_size = fdt_totalsize(initial_boot_params) + (2 *
> COMMAND_LINE_SIZE);' from kexec_fdt_totalsize_ppc64() as well, but
> then the function name is misleading.
> 
> Lakshmi can send a follow-up patch to fine tune the size and rename
> kexec_fdt_totalsize_ppc64.

OK, I have mode Michael's suggested change to my resolution from today.

-- 
Cheers,
Stephen Rothwell


pgpasUIcsqWYa.pgp
Description: OpenPGP digital signature


Re: [PATCH] powerpc/4xx: Fix build errors from mfdcr()

2021-02-18 Thread Feng Tang
On Thu, Feb 18, 2021 at 11:30:58PM +1100, Michael Ellerman wrote:
> lkp reported a build error in fsp2.o:
> 
>   CC  arch/powerpc/platforms/44x/fsp2.o
>   {standard input}:577: Error: unsupported relocation against base
> 
> Which comes from:
> 
>   pr_err("GESR0: 0x%08x\n", mfdcr(base + PLB4OPB_GESR0));
> 
> Where our mfdcr() macro is stringifying "base + PLB4OPB_GESR0", and
> passing that to the assembler, which obviously doesn't work.
> 
> The mfdcr() macro already checks that the argument is constant using
> __builtin_constant_p(), and if not calls the out-of-line version of
> mfdcr(). But in this case GCC is smart enough to notice that "base +
> PLB4OPB_GESR0" will be constant, even though it's not something we can
> immediately stringify into a register number.
> 
> Segher pointed out that passing the register number to the inline asm
> as a constant would be better, and in fact it fixes the build error,
> presumably because it gives GCC a chance to resolve the value.
> 
> While we're at it, change mtdcr() similarly.
> 
> Reported-by: kernel test robot 
> Suggested-by: Segher Boessenkool 
> Signed-off-by: Michael Ellerman 

Acked-by: Feng Tang 

Thanks!

> ---
>  arch/powerpc/include/asm/dcr-native.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dcr-native.h 
> b/arch/powerpc/include/asm/dcr-native.h
> index 7141ccea8c94..a92059964579 100644
> --- a/arch/powerpc/include/asm/dcr-native.h
> +++ b/arch/powerpc/include/asm/dcr-native.h
> @@ -53,8 +53,8 @@ static inline void mtdcrx(unsigned int reg, unsigned int 
> val)
>  #define mfdcr(rn)\
>   ({unsigned int rval;\
>   if (__builtin_constant_p(rn) && rn < 1024)  \
> - asm volatile("mfdcr %0," __stringify(rn)\
> -   : "=r" (rval));   \
> + asm volatile("mfdcr %0, %1" : "=r" (rval)   \
> +   : "n" (rn));  \
>   else if (likely(cpu_has_feature(CPU_FTR_INDEXED_DCR)))  \
>   rval = mfdcrx(rn);  \
>   else\
> @@ -64,8 +64,8 @@ static inline void mtdcrx(unsigned int reg, unsigned int 
> val)
>  #define mtdcr(rn, v) \
>  do { \
>   if (__builtin_constant_p(rn) && rn < 1024)  \
> - asm volatile("mtdcr " __stringify(rn) ",%0" \
> -   : : "r" (v)); \
> + asm volatile("mtdcr %0, %1" \
> +   : : "n" (rn), "r" (v));   \
>   else if (likely(cpu_has_feature(CPU_FTR_INDEXED_DCR)))  \
>   mtdcrx(rn, v);  \
>   else\
> -- 
> 2.25.1


Re: linux-next: manual merge of the devicetree tree with the powerpc tree

2021-02-18 Thread Rob Herring
On Thu, Feb 18, 2021 at 5:34 AM Stephen Rothwell  wrote:
>
> Hi Michael,
>
> On Thu, 18 Feb 2021 21:44:37 +1100 Michael Ellerman  
> wrote:
> >
> > I think it just needs this?
> >
> > diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> > index 87e34611f93d..0492ca6003f3 100644
> > --- a/arch/powerpc/kexec/elf_64.c
> > +++ b/arch/powerpc/kexec/elf_64.c
> > @@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char 
> > *kernel_buf,
> >
> >   fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
> >  initrd_len, cmdline,
> > -
> > fdt_totalsize(initial_boot_params));
> > +kexec_fdt_totalsize_ppc64(image));
> >   if (!fdt) {
> >   pr_err("Error setting up the new device tree.\n");
> >   ret = -EINVAL;
> >
>
> I thought about that, but the last argument to
> of_kexec_alloc_and_setup_fdt() is extra_fdt_size and the allocation
> done is for this:
>
> fdt_size = fdt_totalsize(initial_boot_params) +
>(cmdline ? strlen(cmdline) : 0) +
>FDT_EXTRA_SPACE +
>extra_fdt_size;
>
> and kexec_fdt_totalsize_ppc64() also includes
> fdt_totalsize(initial_boot_params) so I was not sure.  Maybe
> kexec_fdt_totalsize_ppc64() needs modification as well?

You're both right. Michael's fix is sufficient for the merge. The only
risk with a larger size is failing to allocate it, but we're talking
only 10s of KB. Historically until the commit causing the conflict,
PPC was just used 2x fdt_totalsize(initial_boot_params). You could
drop 'fdt_size = fdt_totalsize(initial_boot_params) + (2 *
COMMAND_LINE_SIZE);' from kexec_fdt_totalsize_ppc64() as well, but
then the function name is misleading.

Lakshmi can send a follow-up patch to fine tune the size and rename
kexec_fdt_totalsize_ppc64.

Rob


Re: [PATCH kernel] powerpc/iommu: Annotate nested lock for lockdep

2021-02-18 Thread Frederic Barrat




On 16/02/2021 04:20, Alexey Kardashevskiy wrote:

The IOMMU table is divided into pools for concurrent mappings and each
pool has a separate spinlock. When taking the ownership of an IOMMU group
to pass through a device to a VM, we lock these spinlocks which triggers
a false negative warning in lockdep (below).

This fixes it by annotating the large pool's spinlock as a nest lock.

===
WARNING: possible recursive locking detected
5.11.0-le_syzkaller_a+fstn1 #100 Not tainted

qemu-system-ppc/4129 is trying to acquire lock:
c000119bddb0 (&(p->lock)/1){}-{2:2}, at: iommu_take_ownership+0xac/0x1e0

but task is already holding lock:
c000119bdd30 (&(p->lock)/1){}-{2:2}, at: iommu_take_ownership+0xac/0x1e0

other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(&(p->lock)/1);
   lock(&(p->lock)/1);
===

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/kernel/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 557a09dd5b2f..2ee642a6731a 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1089,7 +1089,7 @@ int iommu_take_ownership(struct iommu_table *tbl)
  
  	spin_lock_irqsave(>large_pool.lock, flags);

for (i = 0; i < tbl->nr_pools; i++)
-   spin_lock(>pools[i].lock);
+   spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);



We have the same pattern and therefore should have the same problem in 
iommu_release_ownership().


But as I understand, we're hacking our way around lockdep here, since 
conceptually, those locks are independent. I was wondering why it seems 
to fix it by worrying only about the large pool lock. That loop can take 
many locks (up to 4 with current config). However, if the dma window is 
less than 1GB, we would only have one, so it would make sense for 
lockdep to stop complaining. Is it what happened? In which case, this 
patch doesn't really fix it. Or I'm missing something :-)


  Fred




iommu_table_release_pages(tbl);
  



[PATCH] powerpc/4xx: Fix build errors from mfdcr()

2021-02-18 Thread Michael Ellerman
lkp reported a build error in fsp2.o:

  CC  arch/powerpc/platforms/44x/fsp2.o
  {standard input}:577: Error: unsupported relocation against base

Which comes from:

  pr_err("GESR0: 0x%08x\n", mfdcr(base + PLB4OPB_GESR0));

Where our mfdcr() macro is stringifying "base + PLB4OPB_GESR0", and
passing that to the assembler, which obviously doesn't work.

The mfdcr() macro already checks that the argument is constant using
__builtin_constant_p(), and if not calls the out-of-line version of
mfdcr(). But in this case GCC is smart enough to notice that "base +
PLB4OPB_GESR0" will be constant, even though it's not something we can
immediately stringify into a register number.

Segher pointed out that passing the register number to the inline asm
as a constant would be better, and in fact it fixes the build error,
presumably because it gives GCC a chance to resolve the value.

While we're at it, change mtdcr() similarly.

Reported-by: kernel test robot 
Suggested-by: Segher Boessenkool 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/dcr-native.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/dcr-native.h 
b/arch/powerpc/include/asm/dcr-native.h
index 7141ccea8c94..a92059964579 100644
--- a/arch/powerpc/include/asm/dcr-native.h
+++ b/arch/powerpc/include/asm/dcr-native.h
@@ -53,8 +53,8 @@ static inline void mtdcrx(unsigned int reg, unsigned int val)
 #define mfdcr(rn)  \
({unsigned int rval;\
if (__builtin_constant_p(rn) && rn < 1024)  \
-   asm volatile("mfdcr %0," __stringify(rn)\
- : "=r" (rval));   \
+   asm volatile("mfdcr %0, %1" : "=r" (rval)   \
+ : "n" (rn));  \
else if (likely(cpu_has_feature(CPU_FTR_INDEXED_DCR)))  \
rval = mfdcrx(rn);  \
else\
@@ -64,8 +64,8 @@ static inline void mtdcrx(unsigned int reg, unsigned int val)
 #define mtdcr(rn, v)   \
 do {   \
if (__builtin_constant_p(rn) && rn < 1024)  \
-   asm volatile("mtdcr " __stringify(rn) ",%0" \
- : : "r" (v)); \
+   asm volatile("mtdcr %0, %1" \
+ : : "n" (rn), "r" (v));   \
else if (likely(cpu_has_feature(CPU_FTR_INDEXED_DCR)))  \
mtdcrx(rn, v);  \
else\
-- 
2.25.1



Re: linux-next: manual merge of the devicetree tree with the powerpc tree

2021-02-18 Thread Stephen Rothwell
Hi Michael,

On Thu, 18 Feb 2021 21:44:37 +1100 Michael Ellerman  wrote:
>
> I think it just needs this?
> 
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index 87e34611f93d..0492ca6003f3 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
>  
>   fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
>  initrd_len, cmdline,
> -fdt_totalsize(initial_boot_params));
> +kexec_fdt_totalsize_ppc64(image));
>   if (!fdt) {
>   pr_err("Error setting up the new device tree.\n");
>   ret = -EINVAL;
> 

I thought about that, but the last argument to
of_kexec_alloc_and_setup_fdt() is extra_fdt_size and the allocation
done is for this:

fdt_size = fdt_totalsize(initial_boot_params) +
   (cmdline ? strlen(cmdline) : 0) +
   FDT_EXTRA_SPACE +
   extra_fdt_size;

and kexec_fdt_totalsize_ppc64() also includes
fdt_totalsize(initial_boot_params) so I was not sure.  Maybe
kexec_fdt_totalsize_ppc64() needs modification as well?

-- 
Cheers,
Stephen Rothwell


pgp88Th0uxBOs.pgp
Description: OpenPGP digital signature


Re: [PATCH v5 05/10] powerpc/signal64: Remove TM ifdefery in middle of if/else block

2021-02-18 Thread Michael Ellerman
"Christopher M. Riedl"  writes:
> On Thu Feb 11, 2021 at 11:21 PM CST, Daniel Axtens wrote:
...
>>
>> My only concern here was whether it was valid to access
>> if (__get_user(msr, >uc_mcontext.gp_regs[PT_MSR]))
>> if CONFIG_PPC_TRANSACTIONAL_MEM was not defined, but I didn't think of
>> any obvious reason why it wouldn't be...
>
> Hmm, we don't really need it for the non-TM case and it is another extra
> uaccess. I will take your suggestion to remove the need for the other
> ifdefs but might keep this one. Keeping it also makes it very clear this
> call is only here for TM and possible to remove in a potentially TM-less
> future :)

Yep I agree.

cheers


Re: linux-next: manual merge of the devicetree tree with the powerpc tree

2021-02-18 Thread Michael Ellerman
Stephen Rothwell  writes:
> Hi all,
>
> Today's linux-next merge of the devicetree tree got a conflict in:
>
>   arch/powerpc/kexec/elf_64.c
>
> between commit:
>
>   2377c92e37fe ("powerpc/kexec_file: fix FDT size estimation for kdump 
> kernel")
>
> from the powerpc tree and commit:
>
>   130b2d59cec0 ("powerpc: Use common of_kexec_alloc_and_setup_fdt()")
>
> from the devicetree tree.
>
> I can't easily see how to resolve these, so for now I have just used
> the latter' changes to this file.

I think it just needs this?

diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 87e34611f93d..0492ca6003f3 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -104,7 +104,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 
fdt = of_kexec_alloc_and_setup_fdt(image, initrd_load_addr,
   initrd_len, cmdline,
-  fdt_totalsize(initial_boot_params));
+  kexec_fdt_totalsize_ppc64(image));
if (!fdt) {
pr_err("Error setting up the new device tree.\n");
ret = -EINVAL;


cheers