Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM
On Fri, 15 Sep 2017 07:52:49 +0200 Greg Kurz wrote: > Dang! The mail relay at OVH has blacklisted Paul's address :-\ > > : host smtp.samba.org[144.76.82.148] said: 550-blacklisted > at > zen.spamhaus.org 550 https://www.spamhaus.org/sbl/query/SBL370982 (in > reply > to RCPT TO command) > Dumb me! It's the opposite... OVH is blacklisted by smtp.samba.org :-\ Sigh. pgpbGVYSAUr7G.pgp Description: OpenPGP digital signature
[PATCH v2] net/ethernet/freescale: fix warning for ucc_geth
uf_info.regs is resource_size_t i.e. phys_addr_t that can be either u32 or u64 according to CONFIG_PHYS_ADDR_T_64BIT. The printk format is thus adaptet to u64 and the regs value cast to u64 to take both u32 and u64 into account. Signed-off-by: Valentin Longchamp --- drivers/net/ethernet/freescale/ucc_geth.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c index f77ba9fa257b..a96b838cffce 100644 --- a/drivers/net/ethernet/freescale/ucc_geth.c +++ b/drivers/net/ethernet/freescale/ucc_geth.c @@ -3857,8 +3857,9 @@ static int ucc_geth_probe(struct platform_device* ofdev) } if (netif_msg_probe(&debug)) - pr_info("UCC%1d at 0x%8x (irq = %d)\n", - ug_info->uf_info.ucc_num + 1, ug_info->uf_info.regs, + pr_info("UCC%1d at 0x%8llx (irq = %d)\n", + ug_info->uf_info.ucc_num + 1, + (u64)ug_info->uf_info.regs, ug_info->uf_info.irq); /* Create an ethernet device instance */ -- 2.13.5
Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM
Dang! The mail relay at OVH has blacklisted Paul's address :-\ : host smtp.samba.org[144.76.82.148] said: 550-blacklisted at zen.spamhaus.org 550 https://www.spamhaus.org/sbl/query/SBL370982 (in reply to RCPT TO command) Cc'ing Paul at ozlabs.org On Fri, 15 Sep 2017 10:48:39 +1000 David Gibson wrote: > On Thu, Sep 14, 2017 at 11:56:25PM +0200, Greg Kurz wrote: > > The following program causes a kernel oops: > > > > #include > > #include > > #include > > #include > > #include > > > > main() > > { > > int fd = open("/dev/kvm", O_RDWR); > > ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM); > > } > > > > This happens because when using the global KVM fd with > > KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets > > called with a NULL kvm argument, which gets dereferenced > > in is_kvmppc_hv_enabled(). Spotted while reading the code. > > > > Let's use the hv_enabled fallback variable, like everywhere > > else in this function. > > > > Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM") > > Cc: sta...@vger.kernel.org # v4.7+ > > Signed-off-by: Greg Kurz > > I don't think this is right. I'm pretty sure you want to fall back to > hv_enabled *only when* kvm is NULL. Otherwise if you have a PR guest > on an HV capable machine, this will give the wrong answer, when called > for that specific VM. > Hmmm... this is what we get with this patch applied: open("/dev/kvm", O_RDWR)= 3 ioctl(3, KVM_CHECK_EXTENSION, 0x84) = 1 <== if HV is present ioctl(3, KVM_CREATE_VM, 0x1)= 4 <== HV ioctl(4, KVM_CHECK_EXTENSION, 0x84) = 1 ioctl(3, KVM_CREATE_VM, 0x2)= 5 <== PR ioctl(5, KVM_CHECK_EXTENSION, 0x84) = 0 The hv_enabled variable is set as follows: /* Assume we're using HV mode when the HV module is loaded */ int hv_enabled = kvmppc_hv_ops ? 1 : 0; if (kvm) { /* * Hooray - we know which VM type we're running on. Depend on * that rather than the guess above. */ hv_enabled = is_kvmppc_hv_enabled(kvm); } so we're good. :) The last sentence in the commit message is maybe^wprobably not comprehensive enough... What about the following ? The hv_enabled variable is initialized to 1 if HV is loaded or 0 otherwise. In the case KVM_CHECK_EXTENSION is used with a VM fd, hv_enabled is updated to is_kvmppc_hv_enabled(kvm). Let's use it here, like everywhere else in this function. > > --- > > arch/powerpc/kvm/powerpc.c |3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > > index 3480faaf1ef8..ee279c7f4802 100644 > > --- a/arch/powerpc/kvm/powerpc.c > > +++ b/arch/powerpc/kvm/powerpc.c > > @@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long > > ext) > > break; > > #endif > > case KVM_CAP_PPC_HTM: > > - r = cpu_has_feature(CPU_FTR_TM_COMP) && > > - is_kvmppc_hv_enabled(kvm); > > + r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled; > > break; > > default: > > r = 0; > > > pgpCqjd5b_a7V.pgp Description: OpenPGP digital signature
[PATCH] KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception
On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage Interrupt (HDSI) the HDSISR is not be updated at all. To work around this we put a canary value into the HDSISR before returning to a guest and then check for this canary when we take a HDSI. If we find the canary on a HDSI, we know the hardware didn't update the HDSISR. In this case we return to the guest to retake the HDSI which should correctly update the HDSISR the second time HDSI entry. After talking to Paulus we've applied this workaround to all POWER9 CPUs. The workaround of returning to the guest shouldn't ever be triggered on well behaving CPU. The extra instructions should have negligible performance impact. Signed-off-by: Michael Neuling --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 663a4a861e..70dca60569 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -1118,6 +1118,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) BEGIN_FTR_SECTION mtspr SPRN_PPR, r0 END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) + +/* Move canary into DSISR to check for later */ +BEGIN_FTR_SECTION + li r0, 0x7fff + mtspr SPRN_HDSISR, r0 +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) + ld r0, VCPU_GPR(R0)(r4) ld r4, VCPU_GPR(R4)(r4) @@ -1947,9 +1954,14 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX) kvmppc_hdsi: ld r3, VCPU_KVM(r9) lbz r0, KVM_RADIX(r3) - cmpwi r0, 0 mfspr r4, SPRN_HDAR mfspr r6, SPRN_HDSISR +BEGIN_FTR_SECTION + /* Look for DSISR canary. If we find it, retry instruction */ + cmpdi r6, 0x7fff + beq 6f +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) + cmpwi r0, 0 bne .Lradix_hdsi/* on radix, just save DAR/DSISR/ASDR */ /* HPTE not found fault or protection fault? */ andis. r0, r6, (DSISR_NOHPTE | DSISR_PROTFAULT)@h -- 2.11.0
[PATCH 1/2] powerpc: Add workaround for P9 vector CI load issue
POWER9 DD2.1 and earlier has an issue where some cache inhibited vector load will return bad data. The workaround is two part, one firmware/microcode part triggers HMI interrupts when hitting such loads, the other part is this patch which then emulates the instructions in Linux. The affected instructions are limited to lxvd2x, lxvw4x, lxvb16x and lxvh8x. When an instruction triggers the HMI, all threads in the core will be sent to the HMI handler, not just the one running the vector load. In general, these spurious HMIs are detected by the emulation code and we just return back to the running process. Unfortunately, if a spurious interrupt occurs on a vector load that's to normal memory we have no way to detect that it's spurious (unless we walk the page tables, which is very expensive). In this case we emulate the load but we need do so using a vector load itself to ensure 128bit atomicity is preserved. Some additional debugfs emulated instruction counters are added also. Signed-off-by: Michael Neuling Signed-off-by: Benjamin Herrenschmidt --- arch/powerpc/include/asm/emulated_ops.h | 4 + arch/powerpc/include/asm/paca.h | 1 + arch/powerpc/include/asm/uaccess.h | 17 +++ arch/powerpc/kernel/exceptions-64s.S| 16 ++- arch/powerpc/kernel/mce.c | 30 - arch/powerpc/kernel/traps.c | 201 arch/powerpc/platforms/powernv/smp.c| 7 ++ 7 files changed, 271 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/emulated_ops.h b/arch/powerpc/include/asm/emulated_ops.h index f00e10e2a3..651e135449 100644 --- a/arch/powerpc/include/asm/emulated_ops.h +++ b/arch/powerpc/include/asm/emulated_ops.h @@ -55,6 +55,10 @@ extern struct ppc_emulated { struct ppc_emulated_entry mfdscr; struct ppc_emulated_entry mtdscr; struct ppc_emulated_entry lq_stq; + struct ppc_emulated_entry lxvw4x; + struct ppc_emulated_entry lxvh8x; + struct ppc_emulated_entry lxvd2x; + struct ppc_emulated_entry lxvb16x; #endif } ppc_emulated; diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 04b60af027..1e06310ccc 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -210,6 +210,7 @@ struct paca_struct { */ u16 in_mce; u8 hmi_event_available; /* HMI event is available */ + u8 hmi_p9_special_emu; /* HMI P9 special emulation */ #endif /* Stuff for accurate time accounting */ diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 9c0e60ca16..e34f15e727 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -173,6 +173,23 @@ do { \ extern long __get_user_bad(void); +/* + * This does an atomic 128 byte aligned load from userspace. + * Upto caller to do enable_kernel_vmx() before calling! + */ +#define __get_user_atomic_128_aligned(kaddr, uaddr, err) \ + __asm__ __volatile__( \ + "1: lvx 0,0,%1 # get user\n" \ + " stvx 0,0,%2 # put kernel\n" \ + "2:\n" \ + ".section .fixup,\"ax\"\n" \ + "3: li %0,%3\n" \ + " b 2b\n" \ + ".previous\n" \ + EX_TABLE(1b, 3b)\ + : "=r" (err)\ + : "b" (uaddr), "b" (kaddr), "i" (-EFAULT), "0" (err)) + #define __get_user_asm(x, addr, err, op) \ __asm__ __volatile__( \ "1: "op" %1,0(%2) # get_user\n" \ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 48da0f5d2f..8c32dd4cac 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -1010,6 +1010,8 @@ TRAMP_REAL_BEGIN(hmi_exception_early) EXCEPTION_PROLOG_COMMON_3(0xe60) addir3,r1,STACK_FRAME_OVERHEAD BRANCH_LINK_TO_FAR(hmi_exception_realmode) /* Function call ABI */ + cmpdi cr0,r3,0 + /* Windup the stack. */ /* Move original HSRR0 and HSRR1 into the respective regs */ ld r9,_MSR(r1) @@ -1026,10 +1028,15 @@ TRAMP_REAL_BEGIN(hmi_exception_early) REST_8GPRS(2, r1) REST_GPR(10, r1) ld r11,_CCR(r1) + REST_2GPRS(12, r1) + bne 1f mtcrr11 REST_GPR(11, r1) - REST_2GPRS(12, r1) - /* restore original r1. */ + ld r1,GPR1(r1) + hrfid + +1: mtcrr11 + REST_GPR(11, r1) ld r1,GPR1(r1) /* @@ -1042,8 +1049,9 @@ hmi_exception_after_realmode: EXCEPTION_P
[PATCH 2/2] powerpc: Handle MCE on POWER9 with only DSISR bit 33 set
On POWER9 DD2.1 and below, it's possible to get Machine Check Exception (MCE) where only DSISR bit 33 is set. This will result in the linux MCE handler seeing an unknown event, which triggers linux to crash. We change this by detecting unknown events in the MCE handler and marking them as handled so that we no longer crash. We do this only on chip revisions known to have this problem. Signed-off-by: Michael Neuling --- arch/powerpc/kernel/mce_power.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index b76ca198e0..72ec667136 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -595,6 +595,7 @@ static long mce_handle_error(struct pt_regs *regs, uint64_t addr; uint64_t srr1 = regs->msr; long handled; + unsigned long pvr; if (SRR1_MC_LOADSTORE(srr1)) handled = mce_handle_derror(regs, dtable, &mce_err, &addr); @@ -604,6 +605,20 @@ static long mce_handle_error(struct pt_regs *regs, if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE) handled = mce_handle_ue_error(regs); + /* +* On POWER9 DD2.1 and below, it's possible to get machine +* check where only DSISR bit 33 is set. This will result in +* the MCE handler seeing an unknown event and us crashing. +* Change this to mark as handled on these revisions. +*/ + pvr = mfspr(SPRN_PVR); + if (((PVR_VER(pvr) == PVR_POWER9) && +(PVR_CFG(pvr) == 2) && +(PVR_MIN(pvr) <= 1)) || cpu_has_feature(CPU_FTR_POWER9_DD1)) + /* DD2.1 and below */ + if (mce_err.error_type == MCE_ERROR_TYPE_UNKNOWN) + handled = 1; + save_mce_event(regs, handled, &mce_err, regs->nip, addr); return handled; -- 2.11.0
[PATCH 3/3] powerpc/5200: dts: digsy_mtc.dts: fix rv3029 compatible
The proper compatible for rv3029 is microcrystal,rv3029. Signed-off-by: Alexandre Belloni --- arch/powerpc/boot/dts/digsy_mtc.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/boot/dts/digsy_mtc.dts b/arch/powerpc/boot/dts/digsy_mtc.dts index c280e75c86bf..c3922fc03e0b 100644 --- a/arch/powerpc/boot/dts/digsy_mtc.dts +++ b/arch/powerpc/boot/dts/digsy_mtc.dts @@ -78,7 +78,7 @@ }; rtc@56 { - compatible = "mc,rv3029c2"; + compatible = "microcrystal,rv3029"; reg = <0x56>; }; -- 2.14.1
[PATCH 2/3] ARM: dts: at91: usb_a9g20: fix rtc node
The rv3029 compatible is missing its vendor string, add it. Also fix the node name to be a proper generic name. Signed-off-by: Alexandre Belloni --- arch/arm/boot/dts/usb_a9g20_common.dtsi | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm/boot/dts/usb_a9g20_common.dtsi b/arch/arm/boot/dts/usb_a9g20_common.dtsi index 088c2c3685ab..81c3fe0465d9 100644 --- a/arch/arm/boot/dts/usb_a9g20_common.dtsi +++ b/arch/arm/boot/dts/usb_a9g20_common.dtsi @@ -20,8 +20,8 @@ }; i2c-gpio-0 { - rv3029c2@56 { - compatible = "rv3029c2"; + rtc@56 { + compatible = "microcrystal,rv3029"; reg = <0x56>; }; }; -- 2.14.1
[PATCH 1/3] RTC: rv3029: fix vendor string
The vendor string for Microcrystal is microcrystal. Signed-off-by: Alexandre Belloni --- Documentation/devicetree/bindings/trivial-devices.txt | 2 +- drivers/rtc/rtc-rv3029c2.c| 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/trivial-devices.txt b/Documentation/devicetree/bindings/trivial-devices.txt index af284fbd4d23..aae37352c574 100644 --- a/Documentation/devicetree/bindings/trivial-devices.txt +++ b/Documentation/devicetree/bindings/trivial-devices.txt @@ -72,7 +72,6 @@ isil,isl29030 Intersil ISL29030 Ambient Light and Proximity Sensor maxim,ds1050 5 Bit Programmable, Pulse-Width Modulator maxim,max1237 Low-Power, 4-/12-Channel, 2-Wire Serial, 12-Bit ADCs maxim,max6625 9-Bit/12-Bit Temperature Sensors with I²C-Compatible Serial Interface -mc,rv3029c2Real Time Clock Module with I2C-Bus mcube,mc3230 mCube 3-axis 8-bit digital accelerometer memsic,mxc6225 MEMSIC 2-axis 8-bit digital accelerometer microchip,mcp4531-502 Microchip 7-bit Single I2C Digital Potentiometer (5k) @@ -141,6 +140,7 @@ microchip,mcp4662-503 Microchip 8-bit Dual I2C Digital Potentiometer with NV Mem microchip,mcp4662-104 Microchip 8-bit Dual I2C Digital Potentiometer with NV Memory (100k) microchip,tc654PWM Fan Speed Controller With Fan Fault Detection microchip,tc655PWM Fan Speed Controller With Fan Fault Detection +microcrystal,rv3029Real Time Clock Module with I2C-Bus miramems,da226 MiraMEMS DA226 2-axis 14-bit digital accelerometer miramems,da280 MiraMEMS DA280 3-axis 14-bit digital accelerometer miramems,da311 MiraMEMS DA311 3-axis 12-bit digital accelerometer diff --git a/drivers/rtc/rtc-rv3029c2.c b/drivers/rtc/rtc-rv3029c2.c index aa09771de04f..cfe3aece51d1 100644 --- a/drivers/rtc/rtc-rv3029c2.c +++ b/drivers/rtc/rtc-rv3029c2.c @@ -876,6 +876,8 @@ static const struct i2c_device_id rv3029_id[] = { MODULE_DEVICE_TABLE(i2c, rv3029_id); static const struct of_device_id rv3029_of_match[] = { + { .compatible = "microcrystal,rv3029" }, + /* Backward compatibility only, do not use compatibles below: */ { .compatible = "rv3029" }, { .compatible = "rv3029c2" }, { .compatible = "mc,rv3029c2" }, -- 2.14.1
Re: [PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory
Hi Mark, Thank you for looking at this. We can't do this because page table is not set until cpu_replace_ttbr1() is called. So, we can't do memset() on this memory until then. Pasha
Re: [PATCH v2] powerpc/tm: Flush TM only if CPU has TM feature
On Wed, 2017-09-13 at 22:13 -0400, Gustavo Romero wrote: > Commit cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump") > added code to access TM SPRs in flush_tmregs_to_thread(). However > flush_tmregs_to_thread() does not check if TM feature is available on > CPU before trying to access TM SPRs in order to copy live state to > thread structures. flush_tmregs_to_thread() is indeed guarded by > CONFIG_PPC_TRANSACTIONAL_MEM but it might be the case that kernel > was compiled with CONFIG_PPC_TRANSACTIONAL_MEM enabled and ran on > a CPU without TM feature available, thus rendering the execution > of TM instructions that are treated by the CPU as illegal instructions. > > The fix is just to add proper checking in flush_tmregs_to_thread() > if CPU has the TM feature before accessing any TM-specific resource, > returning immediately if TM is no available on the CPU. Adding > that checking in flush_tmregs_to_thread() instead of in places > where it is called, like in vsr_get() and vsr_set(), is better because > avoids the same problem cropping up elsewhere. > > Cc: sta...@vger.kernel.org # v4.13+ > Fixes: cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump") > Signed-off-by: Gustavo Romero Keeping in mind I reviewed cd63f3c and feeling a bit sheepish having missed this. Reviewed-by: Cyril Bur > --- > arch/powerpc/kernel/ptrace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > index 07cd22e..f52ad5b 100644 > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -131,7 +131,7 @@ static void flush_tmregs_to_thread(struct task_struct > *tsk) >* in the appropriate thread structures from live. >*/ > > - if (tsk != current) > + if ((!cpu_has_feature(CPU_FTR_TM)) || (tsk != current)) > return; > > if (MSR_TM_SUSPENDED(mfmsr())) {
Re: [PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory
On Thu, Sep 14, 2017 at 06:35:16PM -0400, Pavel Tatashin wrote: > To optimize the performance of struct page initialization, > vmemmap_populate() will no longer zero memory. > > We must explicitly zero the memory that is allocated by vmemmap_populate() > for kasan, as this memory does not go through struct page initialization > path. > > Signed-off-by: Pavel Tatashin > Reviewed-by: Steven Sistare > Reviewed-by: Daniel Jordan > Reviewed-by: Bob Picco > --- > arch/arm64/mm/kasan_init.c | 42 ++ > 1 file changed, 42 insertions(+) > > diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c > index 81f03959a4ab..e78a9ecbb687 100644 > --- a/arch/arm64/mm/kasan_init.c > +++ b/arch/arm64/mm/kasan_init.c > @@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start, > set_pgd(pgd_offset_k(start), __pgd(0)); > } > > +/* > + * Memory that was allocated by vmemmap_populate is not zeroed, so we must > + * zero it here explicitly. > + */ > +static void > +zero_vmemmap_populated_memory(void) > +{ > + struct memblock_region *reg; > + u64 start, end; > + > + for_each_memblock(memory, reg) { > + start = __phys_to_virt(reg->base); > + end = __phys_to_virt(reg->base + reg->size); > + > + if (start >= end) > + break; > + > + start = (u64)kasan_mem_to_shadow((void *)start); > + end = (u64)kasan_mem_to_shadow((void *)end); > + > + /* Round to the start end of the mapped pages */ > + start = round_down(start, SWAPPER_BLOCK_SIZE); > + end = round_up(end, SWAPPER_BLOCK_SIZE); > + memset((void *)start, 0, end - start); > + } > + > + start = (u64)kasan_mem_to_shadow(_text); > + end = (u64)kasan_mem_to_shadow(_end); > + > + /* Round to the start end of the mapped pages */ > + start = round_down(start, SWAPPER_BLOCK_SIZE); > + end = round_up(end, SWAPPER_BLOCK_SIZE); > + memset((void *)start, 0, end - start); > +} I really don't see the need to duplicate the existing logic to iterate over memblocks, calculate the addresses, etc. Why can't we just have a zeroing wrapper? e.g. something like the below. I really don't see why we couldn't have a generic function in core code to do this, even if vmemmap_populate() doesn't. Thanks, Mark. >8 diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 81f0395..698d065 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -135,6 +135,17 @@ static void __init clear_pgds(unsigned long start, set_pgd(pgd_offset_k(start), __pgd(0)); } +void kasan_populate_shadow(unsigned long shadow_start, unsigned long shadow_end, + nid_t nid) +{ + shadow_start = round_down(shadow_start, SWAPPER_BLOCK_SIZE); + shadow_end = round_up(shadow_end, SWAPPER_BLOCK_SIZE); + + vmemmap_populate(shadow_start, shadow_end, nid); + + memset((void *)shadow_start, 0, shadow_end - shadow_start); +} + void __init kasan_init(void) { u64 kimg_shadow_start, kimg_shadow_end; @@ -161,8 +172,8 @@ void __init kasan_init(void) clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END); - vmemmap_populate(kimg_shadow_start, kimg_shadow_end, -pfn_to_nid(virt_to_pfn(lm_alias(_text; + kasah_populate_shadow(kimg_shadow_start, kimg_shadow_end, + pfn_to_nid(virt_to_pfn(lm_alias(_text; /* * vmemmap_populate() has populated the shadow region that covers the @@ -191,9 +202,9 @@ void __init kasan_init(void) if (start >= end) break; - vmemmap_populate((unsigned long)kasan_mem_to_shadow(start), - (unsigned long)kasan_mem_to_shadow(end), - pfn_to_nid(virt_to_pfn(start))); + kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(start), + (unsigned long)kasan_mem_to_shadow(end), + pfn_to_nid(virt_to_pfn(start))); } /*
Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM
On Thu, Sep 14, 2017 at 11:56:25PM +0200, Greg Kurz wrote: > The following program causes a kernel oops: > > #include > #include > #include > #include > #include > > main() > { > int fd = open("/dev/kvm", O_RDWR); > ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM); > } > > This happens because when using the global KVM fd with > KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets > called with a NULL kvm argument, which gets dereferenced > in is_kvmppc_hv_enabled(). Spotted while reading the code. > > Let's use the hv_enabled fallback variable, like everywhere > else in this function. > > Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM") > Cc: sta...@vger.kernel.org # v4.7+ > Signed-off-by: Greg Kurz I don't think this is right. I'm pretty sure you want to fall back to hv_enabled *only when* kvm is NULL. Otherwise if you have a PR guest on an HV capable machine, this will give the wrong answer, when called for that specific VM. > --- > arch/powerpc/kvm/powerpc.c |3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c > index 3480faaf1ef8..ee279c7f4802 100644 > --- a/arch/powerpc/kvm/powerpc.c > +++ b/arch/powerpc/kvm/powerpc.c > @@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long > ext) > break; > #endif > case KVM_CAP_PPC_HTM: > - r = cpu_has_feature(CPU_FTR_TM_COMP) && > - is_kvmppc_hv_enabled(kvm); > + r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled; > break; > default: > r = 0; > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
[PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM
The following program causes a kernel oops: #include #include #include #include #include main() { int fd = open("/dev/kvm", O_RDWR); ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM); } This happens because when using the global KVM fd with KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets called with a NULL kvm argument, which gets dereferenced in is_kvmppc_hv_enabled(). Spotted while reading the code. Let's use the hv_enabled fallback variable, like everywhere else in this function. Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM") Cc: sta...@vger.kernel.org # v4.7+ Signed-off-by: Greg Kurz --- arch/powerpc/kvm/powerpc.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 3480faaf1ef8..ee279c7f4802 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) break; #endif case KVM_CAP_PPC_HTM: - r = cpu_has_feature(CPU_FTR_TM_COMP) && - is_kvmppc_hv_enabled(kvm); + r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled; break; default: r = 0;
[PATCH v8 00/11] complete deferred page initialization
Copy paste error, changing the subject for the header to v8 from v7. On 09/14/2017 06:35 PM, Pavel Tatashin wrote: Changelog: v8 - v7 - Added Acked-by's from Dave Miller for SPARC changes - Fixed a minor compiling issue on tile architecture reported by kbuild v7 - v6 - Addressed comments from Michal Hocko - memblock_discard() patch was removed from this series and integrated separately - Fixed bug reported by kbuild test robot new patch: mm: zero reserved and unavailable struct pages - Removed patch x86/mm: reserve only exiting low pages As, it is not needed anymore, because of the previous fix - Re-wrote deferred_init_memmap(), found and fixed an existing bug, where page variable is not reset when zone holes present. - Merged several patches together per Michal request - Added performance data including raw logs v6 - v5 - Fixed ARM64 + kasan code, as reported by Ard Biesheuvel - Tested ARM64 code in qemu and found few more issues, that I fixed in this iteration - Added page roundup/rounddown to x86 and arm zeroing routines to zero the whole allocated range, instead of only provided address range. - Addressed SPARC related comment from Sam Ravnborg - Fixed section mismatch warnings related to memblock_discard(). v5 - v4 - Fixed build issues reported by kbuild on various configurations v4 - v3 - Rewrote code to zero sturct pages in __init_single_page() as suggested by Michal Hocko - Added code to handle issues related to accessing struct page memory before they are initialized. v3 - v2 - Addressed David Miller comments about one change per patch: * Splited changes to platforms into 4 patches * Made "do not zero vmemmap_buf" as a separate patch v2 - v1 - Per request, added s390 to deferred "struct page" zeroing - Collected performance data on x86 which proofs the importance to keep memset() as prefetch (see below). SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option, which defers initializing struct pages until all cpus have been started so it can be done in parallel. However, this feature is sub-optimal, because the deferred page initialization code expects that the struct pages have already been zeroed, and the zeroing is done early in boot with a single thread only. Also, we access that memory and set flags before struct pages are initialized. All of this is fixed in this patchset. In this work we do the following: - Never read access struct page until it was initialized - Never set any fields in struct pages before they are initialized - Zero struct page at the beginning of struct page initialization == Performance improvements on x86 machine with 8 nodes: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory: TIME SPEED UP base no deferred: 95.796233s fix no deferred:79.978956s19.77% base deferred: 77.254713s fix deferred: 55.050509s40.34% == SPARC M6 3600 MHz with 15T of memory TIME SPEED UP base no deferred: 358.335727s fix no deferred:302.320936s 18.52% base deferred: 237.534603s fix deferred: 182.103003s 30.44% == Raw dmesg output with timestamps: x86 base no deferred:https://hastebin.com/ofunepurit.scala x86 base deferred: https://hastebin.com/ifazegeyas.scala x86 fix no deferred: https://hastebin.com/pegocohevo.scala x86 fix deferred:https://hastebin.com/ofupevikuk.scala sparc base no deferred: https://hastebin.com/ibobeteken.go sparc base deferred: https://hastebin.com/fariqimiyu.go sparc fix no deferred: https://hastebin.com/muhegoheyi.go sparc fix deferred: https://hastebin.com/xadinobutu.go Pavel Tatashin (11): x86/mm: setting fields in deferred pages sparc64/mm: setting fields in deferred pages mm: deferred_init_memmap improvements sparc64: simplify vmemmap_populate mm: defining memblock_virt_alloc_try_nid_raw mm: zero struct pages during initialization sparc64: optimized struct page zeroing mm: zero reserved and unavailable struct pages x86/kasan: explicitly zero kasan shadow memory arm64/kasan: explicitly zero kasan shadow memory mm: stop zeroing memory during allocation in vmemmap arch/arm64/mm/kasan_init.c | 42 arch/sparc/include/asm/pgtable_64.h | 30 ++ arch/sparc/mm/init_64.c | 31 +++--- arch/x86/mm/init_64.c | 9 +- arch/x86/mm/kasan_init_64.c | 66 include/linux/bootmem.h | 27 + include/linux/memblock.h| 16 +++ include/linux/mm.h | 26 + mm/memblock.c | 60 +-- mm/page_alloc.c | 207 ++
[PATCH v7 00/11] complete deferred page initialization
Changelog: v8 - v7 - Added Acked-by's from Dave Miller for SPARC changes - Fixed a minor compiling issue on tile architecture reported by kbuild v7 - v6 - Addressed comments from Michal Hocko - memblock_discard() patch was removed from this series and integrated separately - Fixed bug reported by kbuild test robot new patch: mm: zero reserved and unavailable struct pages - Removed patch x86/mm: reserve only exiting low pages As, it is not needed anymore, because of the previous fix - Re-wrote deferred_init_memmap(), found and fixed an existing bug, where page variable is not reset when zone holes present. - Merged several patches together per Michal request - Added performance data including raw logs v6 - v5 - Fixed ARM64 + kasan code, as reported by Ard Biesheuvel - Tested ARM64 code in qemu and found few more issues, that I fixed in this iteration - Added page roundup/rounddown to x86 and arm zeroing routines to zero the whole allocated range, instead of only provided address range. - Addressed SPARC related comment from Sam Ravnborg - Fixed section mismatch warnings related to memblock_discard(). v5 - v4 - Fixed build issues reported by kbuild on various configurations v4 - v3 - Rewrote code to zero sturct pages in __init_single_page() as suggested by Michal Hocko - Added code to handle issues related to accessing struct page memory before they are initialized. v3 - v2 - Addressed David Miller comments about one change per patch: * Splited changes to platforms into 4 patches * Made "do not zero vmemmap_buf" as a separate patch v2 - v1 - Per request, added s390 to deferred "struct page" zeroing - Collected performance data on x86 which proofs the importance to keep memset() as prefetch (see below). SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option, which defers initializing struct pages until all cpus have been started so it can be done in parallel. However, this feature is sub-optimal, because the deferred page initialization code expects that the struct pages have already been zeroed, and the zeroing is done early in boot with a single thread only. Also, we access that memory and set flags before struct pages are initialized. All of this is fixed in this patchset. In this work we do the following: - Never read access struct page until it was initialized - Never set any fields in struct pages before they are initialized - Zero struct page at the beginning of struct page initialization == Performance improvements on x86 machine with 8 nodes: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory: TIME SPEED UP base no deferred: 95.796233s fix no deferred:79.978956s19.77% base deferred: 77.254713s fix deferred: 55.050509s40.34% == SPARC M6 3600 MHz with 15T of memory TIME SPEED UP base no deferred: 358.335727s fix no deferred:302.320936s 18.52% base deferred: 237.534603s fix deferred: 182.103003s 30.44% == Raw dmesg output with timestamps: x86 base no deferred:https://hastebin.com/ofunepurit.scala x86 base deferred: https://hastebin.com/ifazegeyas.scala x86 fix no deferred: https://hastebin.com/pegocohevo.scala x86 fix deferred:https://hastebin.com/ofupevikuk.scala sparc base no deferred: https://hastebin.com/ibobeteken.go sparc base deferred: https://hastebin.com/fariqimiyu.go sparc fix no deferred: https://hastebin.com/muhegoheyi.go sparc fix deferred: https://hastebin.com/xadinobutu.go Pavel Tatashin (11): x86/mm: setting fields in deferred pages sparc64/mm: setting fields in deferred pages mm: deferred_init_memmap improvements sparc64: simplify vmemmap_populate mm: defining memblock_virt_alloc_try_nid_raw mm: zero struct pages during initialization sparc64: optimized struct page zeroing mm: zero reserved and unavailable struct pages x86/kasan: explicitly zero kasan shadow memory arm64/kasan: explicitly zero kasan shadow memory mm: stop zeroing memory during allocation in vmemmap arch/arm64/mm/kasan_init.c | 42 arch/sparc/include/asm/pgtable_64.h | 30 ++ arch/sparc/mm/init_64.c | 31 +++--- arch/x86/mm/init_64.c | 9 +- arch/x86/mm/kasan_init_64.c | 66 include/linux/bootmem.h | 27 + include/linux/memblock.h| 16 +++ include/linux/mm.h | 26 + mm/memblock.c | 60 +-- mm/page_alloc.c | 207 mm/sparse-vmemmap.c | 15 ++- mm/sparse.c | 6 +- 12 files changed, 406 insert
[PATCH v8 02/11] sparc64/mm: setting fields in deferred pages
Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT), flags and other fields in "struct page"es are never changed prior to first initializing struct pages by going through __init_single_page(). With deferred struct page feature enabled there is a case where we set some fields prior to initializing: mem_init() { register_page_bootmem_info(); free_all_bootmem(); ... } When register_page_bootmem_info() is called only non-deferred struct pages are initialized. But, this function goes through some reserved pages which might be part of the deferred, and thus are not yet initialized. mem_init register_page_bootmem_info register_page_bootmem_info_node get_page_bootmem .. setting fields here .. such as: page->freelist = (void *)type; free_all_bootmem() free_low_memory_core_early() for_each_reserved_mem_region() reserve_bootmem_region() init_reserved_page() <- Only if this is deferred reserved page __init_single_pfn() __init_single_page() memset(0) <-- Loose the set fields here We end-up with similar issue as in the previous patch, where currently we do not observe problem as memory is zeroed. But, if flag asserts are changed we can start hitting issues. Also, because in this patch series we will stop zeroing struct page memory during allocation, we must make sure that struct pages are properly initialized prior to using them. The deferred-reserved pages are initialized in free_all_bootmem(). Therefore, the fix is to switch the above calls. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco Acked-by: David S. Miller --- arch/sparc/mm/init_64.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index b2ba410b26f4..078f1352736e 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2539,9 +2539,15 @@ void __init mem_init(void) { high_memory = __va(last_valid_pfn << PAGE_SHIFT); - register_page_bootmem_info(); free_all_bootmem(); + /* Must be done after boot memory is put on freelist, because here we +* might set fields in deferred struct pages that have not yet been +* initialized, and free_all_bootmem() initializes all the reserved +* deferred pages for us. +*/ + register_page_bootmem_info(); + /* * Set up the zero page, mark it reserved, so that page count * is not manipulated when freeing the page from user ptes. -- 2.14.1
[PATCH v8 07/11] sparc64: optimized struct page zeroing
Add an optimized mm_zero_struct_page(), so struct page's are zeroed without calling memset(). We do eight to ten regular stores based on the size of struct page. Compiler optimizes out the conditions of switch() statement. SPARC-M6 with 15T of memory, single thread performance: BASEFIX OPTIMIZED_FIX bootmem_init 28.440467985s 2.305674818s 2.305161615s free_area_init_nodes 202.845901673s 225.343084508s 172.556506560s Total 231.286369658s 227.648759326s 174.861668175s BASE: current linux FIX: This patch series without "optimized struct page zeroing" OPTIMIZED_FIX: This patch series including the current patch. bootmem_init() is where memory for struct pages is zeroed during allocation. Note, about two seconds in this function is a fixed time: it does not increase as memory is increased. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco Acked-by: David S. Miller --- arch/sparc/include/asm/pgtable_64.h | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h index 4fefe3762083..8ed478abc630 100644 --- a/arch/sparc/include/asm/pgtable_64.h +++ b/arch/sparc/include/asm/pgtable_64.h @@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS; extern struct page *mem_map_zero; #define ZERO_PAGE(vaddr) (mem_map_zero) +/* This macro must be updated when the size of struct page grows above 80 + * or reduces below 64. + * The idea that compiler optimizes out switch() statement, and only + * leaves clrx instructions + */ +#definemm_zero_struct_page(pp) do { \ + unsigned long *_pp = (void *)(pp); \ + \ +/* Check that struct page is either 64, 72, or 80 bytes */ \ + BUILD_BUG_ON(sizeof(struct page) & 7); \ + BUILD_BUG_ON(sizeof(struct page) < 64); \ + BUILD_BUG_ON(sizeof(struct page) > 80); \ + \ + switch (sizeof(struct page)) { \ + case 80:\ + _pp[9] = 0; /* fallthrough */ \ + case 72:\ + _pp[8] = 0; /* fallthrough */ \ + default:\ + _pp[7] = 0; \ + _pp[6] = 0; \ + _pp[5] = 0; \ + _pp[4] = 0; \ + _pp[3] = 0; \ + _pp[2] = 0; \ + _pp[1] = 0; \ + _pp[0] = 0; \ + } \ +} while (0) + /* PFNs are real physical page numbers. However, mem_map only begins to record * per-page information starting at pfn_base. This is to handle systems where * the first physical page in the machine is at some huge physical address, -- 2.14.1
[PATCH v8 09/11] x86/kasan: explicitly zero kasan shadow memory
To optimize the performance of struct page initialization, vmemmap_populate() will no longer zero memory. We must explicitly zero the memory that is allocated by vmemmap_populate() for kasan, as this memory does not go through struct page initialization path. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco --- arch/x86/mm/kasan_init_64.c | 66 + 1 file changed, 66 insertions(+) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index bc84b73684b7..cc0399032673 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -84,6 +84,66 @@ static struct notifier_block kasan_die_notifier = { }; #endif +/* + * x86 variant of vmemmap_populate() uses either PMD_SIZE pages or base pages + * to map allocated memory. This routine determines the page size for the given + * address from vmemmap. + */ +static u64 get_vmemmap_pgsz(u64 addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + pgd = pgd_offset_k(addr); + BUG_ON(pgd_none(*pgd) || pgd_large(*pgd)); + + p4d = p4d_offset(pgd, addr); + BUG_ON(p4d_none(*p4d) || p4d_large(*p4d)); + + pud = pud_offset(p4d, addr); + BUG_ON(pud_none(*pud) || pud_large(*pud)); + + pmd = pmd_offset(pud, addr); + BUG_ON(pmd_none(*pmd)); + + if (pmd_large(*pmd)) + return PMD_SIZE; + return PAGE_SIZE; +} + +/* + * Memory that was allocated by vmemmap_populate is not zeroed, so we must + * zero it here explicitly. + */ +static void +zero_vmemmap_populated_memory(void) +{ + u64 i, start, end; + + for (i = 0; i < E820_MAX_ENTRIES && pfn_mapped[i].end; i++) { + void *kaddr_start = pfn_to_kaddr(pfn_mapped[i].start); + void *kaddr_end = pfn_to_kaddr(pfn_mapped[i].end); + + start = (u64)kasan_mem_to_shadow(kaddr_start); + end = (u64)kasan_mem_to_shadow(kaddr_end); + + /* Round to the start end of the mapped pages */ + start = rounddown(start, get_vmemmap_pgsz(start)); + end = roundup(end, get_vmemmap_pgsz(start)); + memset((void *)start, 0, end - start); + } + + start = (u64)kasan_mem_to_shadow(_stext); + end = (u64)kasan_mem_to_shadow(_end); + + /* Round to the start end of the mapped pages */ + start = rounddown(start, get_vmemmap_pgsz(start)); + end = roundup(end, get_vmemmap_pgsz(start)); + memset((void *)start, 0, end - start); +} + void __init kasan_early_init(void) { int i; @@ -146,6 +206,12 @@ void __init kasan_init(void) load_cr3(init_top_pgt); __flush_tlb_all(); + /* +* vmemmap_populate does not zero the memory, so we need to zero it +* explicitly +*/ + zero_vmemmap_populated_memory(); + /* * kasan_zero_page has been used as early shadow memory, thus it may * contain some garbage. Now we can clear and write protect it, since -- 2.14.1
[PATCH v8 05/11] mm: defining memblock_virt_alloc_try_nid_raw
* A new variant of memblock_virt_alloc_* allocations: memblock_virt_alloc_try_nid_raw() - Does not zero the allocated memory - Does not panic if request cannot be satisfied * optimize early system hash allocations Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify that memory that was allocated for system hash needs to be zeroed, otherwise the memory does not need to be zeroed, and client will initialize it. If memory does not need to be zero'd, call the new memblock_virt_alloc_raw() interface, and thus improve the boot performance. * debug for raw alloctor When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no places excpect zeroed memory. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco Acked-by: Michal Hocko --- include/linux/bootmem.h | 27 ++ mm/memblock.c | 60 +++-- mm/page_alloc.c | 15 ++--- 3 files changed, 87 insertions(+), 15 deletions(-) diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h index e223d91b6439..ea30b3987282 100644 --- a/include/linux/bootmem.h +++ b/include/linux/bootmem.h @@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat, #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0) /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */ +void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align, + phys_addr_t min_addr, + phys_addr_t max_addr, int nid); void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, int nid); @@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc( NUMA_NO_NODE); } +static inline void * __init memblock_virt_alloc_raw( + phys_addr_t size, phys_addr_t align) +{ + return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT, + BOOTMEM_ALLOC_ACCESSIBLE, + NUMA_NO_NODE); +} + static inline void * __init memblock_virt_alloc_nopanic( phys_addr_t size, phys_addr_t align) { @@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc( return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT); } +static inline void * __init memblock_virt_alloc_raw( + phys_addr_t size, phys_addr_t align) +{ + if (!align) + align = SMP_CACHE_BYTES; + return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT); +} + static inline void * __init memblock_virt_alloc_nopanic( phys_addr_t size, phys_addr_t align) { @@ -309,6 +328,14 @@ static inline void * __init memblock_virt_alloc_try_nid(phys_addr_t size, min_addr); } +static inline void * __init memblock_virt_alloc_try_nid_raw( + phys_addr_t size, phys_addr_t align, + phys_addr_t min_addr, phys_addr_t max_addr, int nid) +{ + return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align, + min_addr, max_addr); +} + static inline void * __init memblock_virt_alloc_try_nid_nopanic( phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, int nid) diff --git a/mm/memblock.c b/mm/memblock.c index 91205780e6b1..1f299fb1eb08 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal( return NULL; done: ptr = phys_to_virt(alloc); - memset(ptr, 0, size); /* * The min_count is set to 0 so that bootmem allocated blocks @@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal( return ptr; } +/** + * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing + * memory and without panicking + * @size: size of memory block to be allocated in bytes + * @align: alignment of the region and block's size + * @min_addr: the lower bound of the memory region from where the allocation + * is preferred (phys address) + * @max_addr: the upper bound of the memory region from where the allocation + * is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to + * allocate only from memory limited by memblock.current_limit value + * @nid: nid of the free area to find, %NUMA_NO_NODE for any node + * + * Public function, provides additional debug information (including caller + * info), if enabled. Does not zero allocated mem
[PATCH v8 11/11] mm: stop zeroing memory during allocation in vmemmap
vmemmap_alloc_block() will no longer zero the block, so zero memory at its call sites for everything except struct pages. Struct page memory is zero'd by struct page initialization. Replace allocators in sprase-vmemmap to use the non-zeroing version. So, we will get the performance improvement by zeroing the memory in parallel when struct pages are zeroed. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco --- include/linux/mm.h | 11 +++ mm/sparse-vmemmap.c | 15 +++ mm/sparse.c | 6 +++--- 3 files changed, 21 insertions(+), 11 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index a7bba4ce79ba..25848764570f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned long size, int node) return __vmemmap_alloc_block_buf(size, node, NULL); } +static inline void *vmemmap_alloc_block_zero(unsigned long size, int node) +{ + void *p = vmemmap_alloc_block(size, node); + + if (!p) + return NULL; + memset(p, 0, size); + + return p; +} + void vmemmap_verify(pte_t *, int, unsigned long, unsigned long); int vmemmap_populate_basepages(unsigned long start, unsigned long end, int node); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index d1a39b8051e0..c2f5654e7c9d 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node, unsigned long align, unsigned long goal) { - return memblock_virt_alloc_try_nid(size, align, goal, + return memblock_virt_alloc_try_nid_raw(size, align, goal, BOOTMEM_ALLOC_ACCESSIBLE, node); } @@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node) if (slab_is_available()) { struct page *page; - page = alloc_pages_node(node, - GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL, - get_order(size)); + page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL, + get_order(size)); if (page) return page_address(page); return NULL; @@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node) { pmd_t *pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { - void *p = vmemmap_alloc_block(PAGE_SIZE, node); + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; pmd_populate_kernel(&init_mm, pmd, p); @@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node) { pud_t *pud = pud_offset(p4d, addr); if (pud_none(*pud)) { - void *p = vmemmap_alloc_block(PAGE_SIZE, node); + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; pud_populate(&init_mm, pud, p); @@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node) { p4d_t *p4d = p4d_offset(pgd, addr); if (p4d_none(*p4d)) { - void *p = vmemmap_alloc_block(PAGE_SIZE, node); + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; p4d_populate(&init_mm, p4d, p); @@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) { pgd_t *pgd = pgd_offset_k(addr); if (pgd_none(*pgd)) { - void *p = vmemmap_alloc_block(PAGE_SIZE, node); + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); if (!p) return NULL; pgd_populate(&init_mm, pgd, p); diff --git a/mm/sparse.c b/mm/sparse.c index 83b3bf6461af..d22f51bb7c79 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page **map_map, } size = PAGE_ALIGN(size); - map = memblock_virt_alloc_try_nid(size * map_count, - PAGE_SIZE, __pa(MAX_DMA_ADDRESS), - BOOTMEM_ALLOC_ACCESSIBLE, nodeid); + map = memblock_virt_alloc_try_nid_raw(size * map_count, + PAGE_SIZE, __pa(MAX_DMA_ADDRESS), + BOOTMEM_ALLOC_ACCESSIBLE, nodeid); if (map) { for (pnum = pnum_begin; pnum < pnum_end; pnum++) { if (!present_section_nr(pnum)) -- 2.14.1
[PATCH v8 03/11] mm: deferred_init_memmap improvements
This patch fixes two issues in deferred_init_memmap = In deferred_init_memmap() where all deferred struct pages are initialized we have a check like this: if (page->flags) { VM_BUG_ON(page_zone(page) != zone); goto free_range; } This way we are checking if the current deferred page has already been initialized. It works, because memory for struct pages has been zeroed, and the only way flags are not zero if it went through __init_single_page() before. But, once we change the current behavior and won't zero the memory in memblock allocator, we cannot trust anything inside "struct page"es until they are initialized. This patch fixes this. The deferred_init_memmap() is re-written to loop through only free memory ranges provided by memblock. = This patch fixes another existing issue on systems that have holes in zones i.e CONFIG_HOLES_IN_ZONE is defined. In for_each_mem_pfn_range() we have code like this: if (!pfn_valid_within(pfn) goto free_range; Note: 'page' is not set to NULL and is not incremented but 'pfn' advances. Thus means if deferred struct pages are enabled on systems with these kind of holes, linux would get memory corruptions. I have fixed this issue by defining a new macro that performs all the necessary operations when we free the current set of pages. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco --- mm/page_alloc.c | 161 +++- 1 file changed, 78 insertions(+), 83 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c841af88836a..d132c801d2c1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone) } #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT -static void __init deferred_free_range(struct page *page, - unsigned long pfn, int nr_pages) +static void __init deferred_free_range(unsigned long pfn, + unsigned long nr_pages) { - int i; + struct page *page; + unsigned long i; - if (!page) + if (!nr_pages) return; + page = pfn_to_page(pfn); + /* Free a large naturally-aligned chunk if possible */ if (nr_pages == pageblock_nr_pages && (pfn & (pageblock_nr_pages - 1)) == 0) { @@ -1443,19 +1446,82 @@ static inline void __init pgdat_init_report_one_done(void) complete(&pgdat_init_all_done_comp); } +#define DEFERRED_FREE(nr_free, free_base_pfn, page)\ +({ \ + unsigned long nr = (nr_free); \ + \ + deferred_free_range((free_base_pfn), (nr)); \ + (free_base_pfn) = 0;\ + (nr_free) = 0; \ + page = NULL;\ + nr; \ +}) + +static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn, +unsigned long end_pfn) +{ + struct mminit_pfnnid_cache nid_init_state = { }; + unsigned long nr_pgmask = pageblock_nr_pages - 1; + unsigned long free_base_pfn = 0; + unsigned long nr_pages = 0; + unsigned long nr_free = 0; + struct page *page = NULL; + + for (; pfn < end_pfn; pfn++) { + /* +* First we check if pfn is valid on architectures where it is +* possible to have holes within pageblock_nr_pages. On systems +* where it is not possible, this function is optimized out. +* +* Then, we check if a current large page is valid by only +* checking the validity of the head pfn. +* +* meminit_pfn_in_nid is checked on systems where pfns can +* interleave within a node: a pfn is between start and end +* of a node, but does not belong to this memory node. +* +* Finally, we minimize pfn page lookups and scheduler checks by +* performing it only once every pageblock_nr_pages. +*/ + if (!pfn_valid_within(pfn)) { + nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page); + } else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) { + nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page); + } else if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) { + nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page); + } else if (page && (pfn & nr_pgmask))
[PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory
To optimize the performance of struct page initialization, vmemmap_populate() will no longer zero memory. We must explicitly zero the memory that is allocated by vmemmap_populate() for kasan, as this memory does not go through struct page initialization path. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco --- arch/arm64/mm/kasan_init.c | 42 ++ 1 file changed, 42 insertions(+) diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index 81f03959a4ab..e78a9ecbb687 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start, set_pgd(pgd_offset_k(start), __pgd(0)); } +/* + * Memory that was allocated by vmemmap_populate is not zeroed, so we must + * zero it here explicitly. + */ +static void +zero_vmemmap_populated_memory(void) +{ + struct memblock_region *reg; + u64 start, end; + + for_each_memblock(memory, reg) { + start = __phys_to_virt(reg->base); + end = __phys_to_virt(reg->base + reg->size); + + if (start >= end) + break; + + start = (u64)kasan_mem_to_shadow((void *)start); + end = (u64)kasan_mem_to_shadow((void *)end); + + /* Round to the start end of the mapped pages */ + start = round_down(start, SWAPPER_BLOCK_SIZE); + end = round_up(end, SWAPPER_BLOCK_SIZE); + memset((void *)start, 0, end - start); + } + + start = (u64)kasan_mem_to_shadow(_text); + end = (u64)kasan_mem_to_shadow(_end); + + /* Round to the start end of the mapped pages */ + start = round_down(start, SWAPPER_BLOCK_SIZE); + end = round_up(end, SWAPPER_BLOCK_SIZE); + memset((void *)start, 0, end - start); +} + void __init kasan_init(void) { u64 kimg_shadow_start, kimg_shadow_end; @@ -205,8 +240,15 @@ void __init kasan_init(void) pfn_pte(sym_to_pfn(kasan_zero_page), PAGE_KERNEL_RO)); memset(kasan_zero_page, 0, PAGE_SIZE); + cpu_replace_ttbr1(lm_alias(swapper_pg_dir)); + /* +* vmemmap_populate does not zero the memory, so we need to zero it +* explicitly +*/ + zero_vmemmap_populated_memory(); + /* At this point kasan is fully initialized. Enable error messages */ init_task.kasan_depth = 0; pr_info("KernelAddressSanitizer initialized\n"); -- 2.14.1
[PATCH v8 04/11] sparc64: simplify vmemmap_populate
Remove duplicating code by using common functions vmemmap_pud_populate and vmemmap_pgd_populate. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco Acked-by: David S. Miller --- arch/sparc/mm/init_64.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 078f1352736e..fc47afa518f5 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -2642,30 +2642,19 @@ int __meminit vmemmap_populate(unsigned long vstart, unsigned long vend, vstart = vstart & PMD_MASK; vend = ALIGN(vend, PMD_SIZE); for (; vstart < vend; vstart += PMD_SIZE) { - pgd_t *pgd = pgd_offset_k(vstart); + pgd_t *pgd = vmemmap_pgd_populate(vstart, node); unsigned long pte; pud_t *pud; pmd_t *pmd; - if (pgd_none(*pgd)) { - pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node); + if (!pgd) + return -ENOMEM; - if (!new) - return -ENOMEM; - pgd_populate(&init_mm, pgd, new); - } - - pud = pud_offset(pgd, vstart); - if (pud_none(*pud)) { - pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node); - - if (!new) - return -ENOMEM; - pud_populate(&init_mm, pud, new); - } + pud = vmemmap_pud_populate(pgd, vstart, node); + if (!pud) + return -ENOMEM; pmd = pmd_offset(pud, vstart); - pte = pmd_val(*pmd); if (!(pte & _PAGE_VALID)) { void *block = vmemmap_alloc_block(PMD_SIZE, node); -- 2.14.1
[PATCH v8 08/11] mm: zero reserved and unavailable struct pages
Some memory is reserved but unavailable: not present in memblock.memory (because not backed by physical pages), but present in memblock.reserved. Such memory has backing struct pages, but they are not initialized by going through __init_single_page(). In some cases these struct pages are accessed even if they do not contain any data. One example is page_to_pfn() might access page->flags if this is where section information is stored (CONFIG_SPARSEMEM, SECTION_IN_PAGE_FLAGS). Since, struct pages are zeroed in __init_single_page(), and not during allocation time, we must zero such struct pages explicitly. The patch involves adding a new memblock iterator: for_each_resv_unavail_range(i, p_start, p_end) Which iterates through reserved && !memory lists, and we zero struct pages explicitly by calling mm_zero_struct_page(). Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco --- include/linux/memblock.h | 16 include/linux/mm.h | 6 ++ mm/page_alloc.c | 30 ++ 3 files changed, 52 insertions(+) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index bae11c7e7bf3..bdd4268f9323 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, unsigned long max_pfn); for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \ nid, flags, p_start, p_end, p_nid) +/** + * for_each_resv_unavail_range - iterate through reserved and unavailable memory + * @i: u64 used as loop variable + * @flags: pick from blocks based on memory attributes + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL + * + * Walks over unavailabled but reserved (reserved && !memory) areas of memblock. + * Available as soon as memblock is initialized. + * Note: because this memory does not belong to any physical node, flags and + * nid arguments do not make sense and thus not exported as arguments. + */ +#define for_each_resv_unavail_range(i, p_start, p_end) \ + for_each_mem_range(i, &memblock.reserved, &memblock.memory, \ + NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL) + static inline void memblock_set_region_flags(struct memblock_region *r, unsigned long flags) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 50b74d628243..a7bba4ce79ba 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2010,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn, struct mminit_pfnnid_cache *state); #endif +#ifdef CONFIG_HAVE_MEMBLOCK +void zero_resv_unavail(void); +#else +static inline void zero_resv_unavail(void) {} +#endif + extern void set_dma_reserve(unsigned long new_dma_reserve); extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, enum memmap_context); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 4b630ee91430..1d38d391dffd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6202,6 +6202,34 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, free_area_init_core(pgdat); } +#ifdef CONFIG_HAVE_MEMBLOCK +/* + * Only struct pages that are backed by physical memory are zeroed and + * initialized by going through __init_single_page(). But, there are some + * struct pages which are reserved in memblock allocator and their fields + * may be accessed (for example page_to_pfn() on some configuration accesses + * flags). We must explicitly zero those struct pages. + */ +void __paginginit zero_resv_unavail(void) +{ + phys_addr_t start, end; + unsigned long pfn; + u64 i, pgcnt; + + /* Loop through ranges that are reserved, but do not have reported +* physical memory backing. +*/ + pgcnt = 0; + for_each_resv_unavail_range(i, &start, &end) { + for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) { + mm_zero_struct_page(pfn_to_page(pfn)); + pgcnt++; + } + } + pr_info("Reserved but unavailable: %lld pages", pgcnt); +} +#endif /* CONFIG_HAVE_MEMBLOCK */ + #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP #if MAX_NUMNODES > 1 @@ -6625,6 +6653,7 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn) node_set_state(nid, N_MEMORY); check_for_memory(pgdat, nid); } + zero_resv_unavail(); } static int __init cmdline_parse_core(char *p, unsigned long *core) @@ -6788,6 +6817,7 @@ void __init free_area_init(unsigned long *zones_size) { free_area_init_node(0, zones_size, __pa(PAGE_OFFSET) >> PAGE_
[PATCH v8 06/11] mm: zero struct pages during initialization
Add struct page zeroing as a part of initialization of other fields in __init_single_page(). This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes): BASEFIX sparse_init 11.244671836s 0.007199623s zone_sizes_init 4.879775891s 8.355182299s -- Total 16.124447727s 8.362381922s sparse_init is where memory for struct pages is zeroed, and the zeroing part is moved later in this patch into __init_single_page(), which is called from zone_sizes_init(). Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco Acked-by: Michal Hocko --- include/linux/mm.h | 9 + mm/page_alloc.c| 1 + 2 files changed, 10 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index f8c10d336e42..50b74d628243 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly; #define mm_forbids_zeropage(X) (0) #endif +/* + * On some architectures it is expensive to call memset() for small sizes. + * Those architectures should provide their own implementation of "struct page" + * zeroing by defining this macro in . + */ +#ifndef mm_zero_struct_page +#define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page))) +#endif + /* * Default maximum number of active map areas, this limits the number of vmas * per mm struct. Users can overwrite this number by sysctl but there is a diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a8dbd405ed94..4b630ee91430 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone, static void __meminit __init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { + mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); -- 2.14.1
[PATCH v8 01/11] x86/mm: setting fields in deferred pages
Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT), flags and other fields in "struct page"es are never changed prior to first initializing struct pages by going through __init_single_page(). With deferred struct page feature enabled, however, we set fields in register_page_bootmem_info that are subsequently clobbered right after in free_all_bootmem: mem_init() { register_page_bootmem_info(); free_all_bootmem(); ... } When register_page_bootmem_info() is called only non-deferred struct pages are initialized. But, this function goes through some reserved pages which might be part of the deferred, and thus are not yet initialized. mem_init register_page_bootmem_info register_page_bootmem_info_node get_page_bootmem .. setting fields here .. such as: page->freelist = (void *)type; free_all_bootmem() free_low_memory_core_early() for_each_reserved_mem_region() reserve_bootmem_region() init_reserved_page() <- Only if this is deferred reserved page __init_single_pfn() __init_single_page() memset(0) <-- Loose the set fields here We end-up with issue where, currently we do not observe problem as memory is explicitly zeroed. But, if flag asserts are changed we can start hitting issues. Also, because in this patch series we will stop zeroing struct page memory during allocation, we must make sure that struct pages are properly initialized prior to using them. The deferred-reserved pages are initialized in free_all_bootmem(). Therefore, the fix is to switch the above calls. Signed-off-by: Pavel Tatashin Reviewed-by: Steven Sistare Reviewed-by: Daniel Jordan Reviewed-by: Bob Picco --- arch/x86/mm/init_64.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 048fbe8fc274..42b4b7a585c2 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1173,12 +1173,17 @@ void __init mem_init(void) /* clear_bss() already clear the empty_zero_page */ - register_page_bootmem_info(); - /* this will put all memory onto the freelists */ free_all_bootmem(); after_bootmem = 1; + /* Must be done after boot memory is put on freelist, because here we +* might set fields in deferred struct pages that have not yet been +* initialized, and free_all_bootmem() initializes all the reserved +* deferred pages for us. +*/ + register_page_bootmem_info(); + /* Register memory areas for /proc/kcore */ kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR, PAGE_SIZE, KCORE_OTHER); -- 2.14.1
Re: [v4 07/11] soc/fsl/qbman: Rework portal mapping calls for ARM/PPC
On 9/14/2017 10:00 AM, Catalin Marinas wrote: > On Thu, Aug 24, 2017 at 04:37:51PM -0400, Roy Pledge wrote: >> diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c >> index ff8998f..e31c843 100644 >> --- a/drivers/soc/fsl/qbman/bman.c >> +++ b/drivers/soc/fsl/qbman/bman.c >> @@ -154,7 +154,7 @@ struct bm_mc { >> }; >> >> struct bm_addr { >> -void __iomem *ce; /* cache-enabled */ >> +void *ce; /* cache-enabled */ >> void __iomem *ci; /* cache-inhibited */ >> }; > > You dropped __iomem from ce, which is fine since it is now set via > memremap. However, I haven't seen (at least not in this patch), a change > to bm_ce_in() which still uses __raw_readl(). > > (it may be worth checking this code with sparse, it may warn about this) Thanks, you're correct I missed this. I will fix this (and the qman version) and run sparse. > >> diff --git a/drivers/soc/fsl/qbman/bman_portal.c >> b/drivers/soc/fsl/qbman/bman_portal.c >> index 39b39c8..bb03503 100644 >> --- a/drivers/soc/fsl/qbman/bman_portal.c >> +++ b/drivers/soc/fsl/qbman/bman_portal.c >> @@ -91,7 +91,6 @@ static int bman_portal_probe(struct platform_device *pdev) >> struct device_node *node = dev->of_node; >> struct bm_portal_config *pcfg; >> struct resource *addr_phys[2]; >> -void __iomem *va; >> int irq, cpu; >> >> pcfg = devm_kmalloc(dev, sizeof(*pcfg), GFP_KERNEL); >> @@ -123,23 +122,34 @@ static int bman_portal_probe(struct platform_device >> *pdev) >> } >> pcfg->irq = irq; >> >> -va = ioremap_prot(addr_phys[0]->start, resource_size(addr_phys[0]), 0); >> -if (!va) { >> -dev_err(dev, "ioremap::CE failed\n"); >> +/* >> + * TODO: Ultimately we would like to use a cacheable/non-shareable >> + * (coherent) mapping for the portal on both architectures but that >> + * isn't currently available in the kernel. Because of HW differences >> + * PPC needs to be mapped cacheable while ARM SoCs will work with non >> + * cacheable mappings >> + */ > > This comment mentions "cacheable/non-shareable (coherent)". Was this > meant for ARM platforms? Because non-shareable is not coherent, nor is > this combination guaranteed to work with different CPUs and > interconnects. My wording is poor I should have been clearer that non-shareable == non-coherent. I will fix this. We do understand that cacheable/non shareable isn't supported on all CPU/interconnect combinations but we have verified with ARM that for the CPU/interconnects we have integrated QBMan on our use is OK. The note is here to try to explain why the mapping is different right now. Once we get the basic QBMan support integrated for ARM we do plan to try to have patches integrated that enable the cacheable mapping as it gives a significant performance boost. This is a step 2 as we understand the topic is complex and a little controversial so I think treating it as an independent change will be easier than mixing it with the less interesting changes in this patchset. > >> +#ifdef CONFIG_PPC >> +/* PPC requires a cacheable/non-coherent mapping of the portal */ >> +pcfg->addr_virt_ce = memremap(addr_phys[0]->start, >> +resource_size(addr_phys[0]), MEMREMAP_WB); >> +#else >> +/* ARM can use a write combine mapping. */ >> +pcfg->addr_virt_ce = memremap(addr_phys[0]->start, >> +resource_size(addr_phys[0]), MEMREMAP_WC); >> +#endif > > Nitpick: you could define something like QBMAN_MAP_ATTR to be different > between PPC and the rest and just keep a single memremap() call. I will change this - it will be a little more compact. > > One may complain that "ce" is no longer "cache enabled" but I'm > personally fine to keep the same name for historical reasons. Cache Enabled is also how the 'data sheet' for the processor describes the region and I think it is useful to keep it aligned so that anyone looking at the manual and the code can easily correlate the ter > >> diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h >> b/drivers/soc/fsl/qbman/dpaa_sys.h >> index 81a9a5e..0a1d573 100644 >> --- a/drivers/soc/fsl/qbman/dpaa_sys.h >> +++ b/drivers/soc/fsl/qbman/dpaa_sys.h >> @@ -51,12 +51,12 @@ >> >> static inline void dpaa_flush(void *p) >> { >> +/* >> + * Only PPC needs to flush the cache currently - on ARM the mapping >> + * is non cacheable >> + */ >> #ifdef CONFIG_PPC >> flush_dcache_range((unsigned long)p, (unsigned long)p+64); >> -#elif defined(CONFIG_ARM) >> -__cpuc_flush_dcache_area(p, 64); >> -#elif defined(CONFIG_ARM64) >> -__flush_dcache_area(p, 64); >> #endif >> } > > Dropping the private API cache maintenance is fine and the memory is WC > now for ARM (mapping to Normal NonCacheable). However, do you require > any barriers here? Normal NC doesn't guarantee any ordering. The barrier is done in the code where the command is f
Re: [v4 05/11] soc/fsl/qbman: Drop L1_CACHE_BYTES compile time check
On 9/14/2017 9:49 AM, Catalin Marinas wrote: > On Thu, Aug 24, 2017 at 04:37:49PM -0400, Roy Pledge wrote: >> From: Claudiu Manoil >> >> Not relevant and arch dependent. Overkill for PPC. >> >> Signed-off-by: Claudiu Manoil >> Signed-off-by: Roy Pledge >> --- >> drivers/soc/fsl/qbman/dpaa_sys.h | 4 >> 1 file changed, 4 deletions(-) >> >> diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h >> b/drivers/soc/fsl/qbman/dpaa_sys.h >> index 2ce394a..f85c319 100644 >> --- a/drivers/soc/fsl/qbman/dpaa_sys.h >> +++ b/drivers/soc/fsl/qbman/dpaa_sys.h >> @@ -49,10 +49,6 @@ >> #define DPAA_PORTAL_CE 0 >> #define DPAA_PORTAL_CI 1 >> >> -#if (L1_CACHE_BYTES != 32) && (L1_CACHE_BYTES != 64) >> -#error "Unsupported Cacheline Size" >> -#endif > > Maybe this check was for a reason on PPC as it uses WB memory mappings > for some of the qbman descriptors (which IIUC fit within a cacheline). > You could add a check for CONFIG_PPC if you think there is any chance of > this constant going higher. > No, the reason PPC needs WB (technically any cacheable mapping) is that the QBMan block on those parts will raise an error IRQ if it sees any transaction less than cacheline size. We know that this cannot happen on PPC parts with QBMan when there is a cacheable mapping because we also developed the interconnect for everything that has a QBMan block. We dropped the check for L1_CACHE_BYTES due to the value being set to 128 on ARM64 even on parts that has smaller caches. I don't think there is much to worry about here as cacheline size isn't something SW controls in any case. If we produce a part with QBMan that has a larger cache granularity we will need to address that in other parts of the code as well. The check was in the code for PPC as a sanity check but since the value isn't (in my opinion) meaningful on ARM we can remove it to avoid problems.
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
On Thu, Sep 14, 2017 at 10:54:08AM -0700, Ram Pai wrote: > On Thu, Sep 14, 2017 at 11:44:49AM +1000, Balbir Singh wrote: > > On Fri, 8 Sep 2017 15:44:44 -0700 > > Ram Pai wrote: > > > > > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 > > > in the 64K backed HPTE pages. This along with the earlier > > > patch will entirely free up the four bits from 64K PTE. > > > The bit numbers are big-endian as defined in the ISA3.0 > > > > > > This patch does the following change to 64K PTE backed > > > by 64K HPTE. > > > > > > H_PAGE_F_SECOND (S) which occupied bit 4 moves to the > > > second part of the pte to bit 60. > > > H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also > > > moves to the second part of the pte to bit 61, > > > 62, 63, 64 respectively > > > > > > since bit 7 is now freed up, we move H_PAGE_BUSY (B) from > > > bit 9 to bit 7. > > > > > > The second part of the PTE will hold > > > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. > > > NOTE: None of the bits in the secondary PTE were not used > > > by 64k-HPTE backed PTE. > > > > > > Before the patch, the 64K HPTE backed 64k PTE format was > > > as follows > > > > > > 0 1 2 3 4 5 6 7 8 9 10...63 > > > : : : : : : : : : : :: > > > v v v v v v v v v v vv > > > > > > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-, > > > |x|x|x| |S |G |I |X |x|B| |x|x||x|x|x|x| <- primary pte > > > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_' > > > | | | | | | | | | | | | |..| | | | | <- secondary pte > > > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_' > > > > > > After the patch, the 64k HPTE backed 64k PTE format is > > > as follows > > > > > > 0 1 2 3 4 5 6 7 8 9 10...63 > > > : : : : : : : : : : :: > > > v v v v v v v v v v vv > > > > > > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-, > > > |x|x|x| | | | |B |x| | |x|x||.|.|.|.| <- primary pte > > > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_' > > > | | | | | | | | | | | | |..|S|G|I|X| <- secondary pte > > > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_' > > > > > > The above PTE changes is applicable to hugetlbpages aswell. > > > > > > The patch does the following code changes: > > > > > > a) moves the H_PAGE_F_SECOND and H_PAGE_F_GIX to 4k PTE > > > header since it is no more needed b the 64k PTEs. > > > b) abstracts out __real_pte() and __rpte_to_hidx() so the > > > caller need not know the bit location of the slot. > > > c) moves the slot bits to the secondary pte. > > > > > > Reviewed-by: Aneesh Kumar K.V > > > Signed-off-by: Ram Pai > > > --- > > > arch/powerpc/include/asm/book3s/64/hash-4k.h |3 ++ > > > arch/powerpc/include/asm/book3s/64/hash-64k.h | 29 > > > +++- > > > arch/powerpc/include/asm/book3s/64/hash.h |3 -- > > > arch/powerpc/mm/hash64_64k.c | 23 --- > > > arch/powerpc/mm/hugetlbpage-hash64.c | 18 ++- > > > 5 files changed, 33 insertions(+), 43 deletions(-) > > > > > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h > > > b/arch/powerpc/include/asm/book3s/64/hash-4k.h > > > index e66bfeb..dc153c6 100644 > > > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h > > > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h > > > @@ -16,6 +16,9 @@ > > > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE) > > > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE) > > > > > > +#define H_PAGE_F_GIX_SHIFT 56 > > > +#define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ > > > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) > > > #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are > > > busy */ > > > > > > /* PTE flags to conserve for HPTE identification */ > > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h > > > b/arch/powerpc/include/asm/book3s/64/hash-64k.h > > > index e038f1c..89ef5a9 100644 > > > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h > > > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h > > > @@ -12,7 +12,7 @@ > > > */ > > > #define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */ > > > #define H_PAGE_4K_PFN_RPAGE_RPN1 /* PFN is for a single 4k page */ > > > -#define H_PAGE_BUSY _RPAGE_RPN42 /* software: PTE & hash are > > > busy */ > > > +#define H_PAGE_BUSY _RPAGE_RPN44 /* software: PTE & hash are > > > busy */ > > > > > > /* > > > * We need to differentiate between explicit huge page and THP huge > > > @@ -21,8 +21,7 @@ > > > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN > > > > > > /* PTE flags to conserve for HPTE identification */ > >
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
On Thu, Sep 14, 2017 at 11:44:49AM +1000, Balbir Singh wrote: > On Fri, 8 Sep 2017 15:44:44 -0700 > Ram Pai wrote: > > > Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 > > in the 64K backed HPTE pages. This along with the earlier > > patch will entirely free up the four bits from 64K PTE. > > The bit numbers are big-endian as defined in the ISA3.0 > > > > This patch does the following change to 64K PTE backed > > by 64K HPTE. > > > > H_PAGE_F_SECOND (S) which occupied bit 4 moves to the > > second part of the pte to bit 60. > > H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also > > moves to the second part of the pte to bit 61, > > 62, 63, 64 respectively > > > > since bit 7 is now freed up, we move H_PAGE_BUSY (B) from > > bit 9 to bit 7. > > > > The second part of the PTE will hold > > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. > > NOTE: None of the bits in the secondary PTE were not used > > by 64k-HPTE backed PTE. > > > > Before the patch, the 64K HPTE backed 64k PTE format was > > as follows > > > > 0 1 2 3 4 5 6 7 8 9 10...63 > > : : : : : : : : : : :: > > v v v v v v v v v v vv > > > > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-, > > |x|x|x| |S |G |I |X |x|B| |x|x||x|x|x|x| <- primary pte > > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_' > > | | | | | | | | | | | | |..| | | | | <- secondary pte > > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_' > > > > After the patch, the 64k HPTE backed 64k PTE format is > > as follows > > > > 0 1 2 3 4 5 6 7 8 9 10...63 > > : : : : : : : : : : :: > > v v v v v v v v v v vv > > > > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-, > > |x|x|x| | | | |B |x| | |x|x||.|.|.|.| <- primary pte > > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_' > > | | | | | | | | | | | | |..|S|G|I|X| <- secondary pte > > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_' > > > > The above PTE changes is applicable to hugetlbpages aswell. > > > > The patch does the following code changes: > > > > a) moves the H_PAGE_F_SECOND and H_PAGE_F_GIX to 4k PTE > > header since it is no more needed b the 64k PTEs. > > b) abstracts out __real_pte() and __rpte_to_hidx() so the > > caller need not know the bit location of the slot. > > c) moves the slot bits to the secondary pte. > > > > Reviewed-by: Aneesh Kumar K.V > > Signed-off-by: Ram Pai > > --- > > arch/powerpc/include/asm/book3s/64/hash-4k.h |3 ++ > > arch/powerpc/include/asm/book3s/64/hash-64k.h | 29 > > +++- > > arch/powerpc/include/asm/book3s/64/hash.h |3 -- > > arch/powerpc/mm/hash64_64k.c | 23 --- > > arch/powerpc/mm/hugetlbpage-hash64.c | 18 ++- > > 5 files changed, 33 insertions(+), 43 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h > > b/arch/powerpc/include/asm/book3s/64/hash-4k.h > > index e66bfeb..dc153c6 100644 > > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h > > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h > > @@ -16,6 +16,9 @@ > > #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE) > > #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE) > > > > +#define H_PAGE_F_GIX_SHIFT 56 > > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */ > > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) > > #define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are > > busy */ > > > > /* PTE flags to conserve for HPTE identification */ > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h > > b/arch/powerpc/include/asm/book3s/64/hash-64k.h > > index e038f1c..89ef5a9 100644 > > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h > > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h > > @@ -12,7 +12,7 @@ > > */ > > #define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */ > > #define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */ > > -#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are > > busy */ > > +#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are > > busy */ > > > > /* > > * We need to differentiate between explicit huge page and THP huge > > @@ -21,8 +21,7 @@ > > #define H_PAGE_THP_HUGE H_PAGE_4K_PFN > > > > /* PTE flags to conserve for HPTE identification */ > > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \ > > -H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO) > > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO) > > /* > > * we support 16 fragments per PT
Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
On Wed, Sep 13, 2017 at 06:51:25PM -0500, Rob Landley wrote: > From: Rob Landley > > Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move > /dev/console open after devtmpfs mount. > > Add workaround for Debian bug that was copied by Ubuntu. > > Signed-off-by: Rob Landley > --- > > v2 discussion: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html > > drivers/base/Kconfig | 14 -- > fs/namespace.c | 14 ++ > init/main.c | 15 +-- > 3 files changed, 27 insertions(+), 16 deletions(-) Always run scripts/checkpatch.pl so you don't get grumpy emails from reviewers telling you to run scripts/checkpatch.pl... telling you to run scripts/checkpatch.pl... telling you to run scripts/checkpatch.pl... telling you to run scripts/checkpatch.pl...
Re: [PATCH 5/7] powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6
On Thu, Sep 14, 2017 at 11:48:34AM +1000, Balbir Singh wrote: > On Fri, 8 Sep 2017 15:44:45 -0700 > Ram Pai wrote: > > > We need PTE bits 3 ,4, 5, 6 and 57 to support protection-keys, > > because these are the bits we want to consolidate on across all > > configuration to support protection keys. > > > > Bit 3,4,5 and 6 are currently used on 4K-pte kernels. But bit 9 > > and 10 are available. Hence we use the two available bits and > > free up bit 5 and 6. We will still not be able to free up bit 3 > > and 4. In the absence of any other free bits, we will have to > > stay satisfied with what we have :-(. This means we will not > > be able to support 32 protection keys, but only 8. The bit > > numbers are big-endian as defined in the ISA3.0 > > > > Any chance for 4k PTE's we can do slot searching for the PTE? > I guess thats add additional complexity Aneesh, i think, is working on moving slot information out of the PTE. If that happens, we will have leg-space to support more keys. RP
Re: [PATCH 7/7] powerpc: capture the PTE format changes in the dump pte report
On Thu, Sep 14, 2017 at 01:22:27PM +1000, Balbir Singh wrote: > On Fri, 8 Sep 2017 15:44:47 -0700 > Ram Pai wrote: > > > The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE. > > capture these changes in the dump pte report. > > > > Reviewed-by: Aneesh Kumar K.V > > Signed-off-by: Ram Pai > > --- > > So we lose slot and secondary information for 64K PTE's with > this change? yes. It was anyway not there for 4k-backed-64k ptes. Now it wont be there for any 64k ptes. RP
Re: Machine Check in P2010(e500v2)
On Sat, 2017-09-09 at 14:59 +0200, Joakim Tjernlund wrote: > On Sat, 2017-09-09 at 14:45 +0200, Joakim Tjernlund wrote: > > On Fri, 2017-09-08 at 22:27 +, Leo Li wrote: > > > > -Original Message- > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com] > > > > Sent: Friday, September 08, 2017 7:51 AM > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; York Sun > > > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > > > > > On Fri, 2017-09-08 at 11:54 +0200, Joakim Tjernlund wrote: > > > > > On Thu, 2017-09-07 at 18:54 +, Leo Li wrote: > > > > > > > -Original Message- > > > > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com] > > > > > > > Sent: Thursday, September 07, 2017 3:41 AM > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; > > > > > > > York Sun > > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > > > > > > > > > > > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote: > > > > > > > > On Wed, 2017-09-06 at 21:13 +, Leo Li wrote: > > > > > > > > > > -Original Message- > > > > > > > > > > From: Joakim Tjernlund > > > > > > > > > > [mailto:joakim.tjernl...@infinera.com] > > > > > > > > > > Sent: Wednesday, September 06, 2017 3:54 PM > > > > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li > > > > > > > > > > ; York Sun > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > > > > > > > > > > > > > > > > > On Wed, 2017-09-06 at 20:28 +, Leo Li wrote: > > > > > > > > > > > > -Original Message- > > > > > > > > > > > > From: Joakim Tjernlund > > > > > > > > > > > > [mailto:joakim.tjernl...@infinera.com] > > > > > > > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM > > > > > > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li > > > > > > > > > > > > ; York Sun > > > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2017-09-06 at 19:31 +, Leo Li wrote: > > > > > > > > > > > > > > -Original Message- > > > > > > > > > > > > > > From: York Sun > > > > > > > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM > > > > > > > > > > > > > > To: Joakim Tjernlund > > > > > > > > > > > > > > ; > > > > > > > > > > > > > > linuxppc- d...@lists.ozlabs.org; Leo Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2) > > > > > > > > > > > > > > > > > > > > > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote: > > > > > > > > > > > > > > > So after some debugging I found this bug: > > > > > > > > > > > > > > > @@ -996,7 +998,7 @@ int > > > > > > > > > > > > > > > fsl_pci_mcheck_exception(struct pt_regs > > > > > > > > > > > > > > > > > > > > *regs) > > > > > > > > > > > > > > > if (is_in_pci_mem_space(addr)) { > > > > > > > > > > > > > > > if (user_mode(regs)) { > > > > > > > > > > > > > > > pagefault_disable(); > > > > > > > > > > > > > > > - ret = get_user(regs->nip, > > > > > > > > > > > > > > > &inst); > > > > > > > > > > > > > > > + ret = get_user(inst, > > > > > > > > > > > > > > > + (__u32 __user *)regs->nip); > > > > > > > > > > > > > > > pagefault_enable(); > > > > > > > > > > > > > > > } else { > > > > > > > > > > > > > > > ret = > > > > > > > > > > > > > > > probe_kernel_address(regs->nip, inst); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > However, the kernel still locked up after fixing > > > > > > > > > > > > > > > that. > > > > > > > > > > > > > > > Now I wonder why this fixup is there in the first > > > > > > > > > > > > > > > place? > > > > > > > > > > > > > > > The routine will not really fixup the insn, just > > > > > > > > > > > > > > > return 0x for the failing read and then > > > > > > > > > > > > > > > advance the > > > > > > > > process NIP. > > > > > > > > > > > > > > > > > > > > > > > > > > You are right. The code here only gives 0x to > > > > > > > > > > > > > the load instructions and > > > > > > > > > > > > > > > > > > > > > > > > continue with the next instruction when the load > > > > > > > > > > > > instruction is causing the machine check. This will > > > > > > > > > > > > prevent a system lockup when reading from PCI/RapidIO > > > > > > > > > > > > device > > > > > > > > which is link down. > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know what is actual problem in your case. > > > > > > > > > > > > > Maybe it is a write > > > > > > > > > > > > > > > > > > > > > > > > instruction instead of read? Or the code is in a > > > > > > > > > > > > infinite loop > > > > > > > > waiting for > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > valid > > > > >
Re: [PATCH 01/25] powerpc: initial pkey plumbing
On Thu, Sep 14, 2017 at 01:32:05PM +1000, Balbir Singh wrote: > On Fri, 8 Sep 2017 15:44:49 -0700 > Ram Pai wrote: > > > Basic plumbing to initialize the pkey system. > > Nothing is enabled yet. A later patch will enable it > > ones all the infrastructure is in place. > > > > Signed-off-by: Ram Pai > > --- > > arch/powerpc/Kconfig | 16 +++ > > arch/powerpc/include/asm/mmu_context.h |5 +++ > > arch/powerpc/include/asm/pkeys.h | 45 > > > > arch/powerpc/kernel/setup_64.c |4 +++ > > arch/powerpc/mm/Makefile |1 + > > arch/powerpc/mm/hash_utils_64.c|1 + > > arch/powerpc/mm/pkeys.c| 33 +++ > > 7 files changed, 105 insertions(+), 0 deletions(-) > > create mode 100644 arch/powerpc/include/asm/pkeys.h > > create mode 100644 arch/powerpc/mm/pkeys.c > > > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > > index 9fc3c0b..a4cd210 100644 > > --- a/arch/powerpc/Kconfig > > +++ b/arch/powerpc/Kconfig > > @@ -864,6 +864,22 @@ config SECCOMP > > > > If unsure, say Y. Only embedded should say N here. > > > > +config PPC64_MEMORY_PROTECTION_KEYS > > + prompt "PowerPC Memory Protection Keys" > > + def_bool y > > + # Note: only available in 64-bit mode > > + depends on PPC64 > > This is not sufficient right, you need PPC_BOOK3S_64 > for compile time at-least? Ok. Not thought too deep about this. Thanks for the input. > > > + select ARCH_USES_HIGH_VMA_FLAGS > > + . > > +void __init pkey_initialize(void) > > +{ > > + /* disable the pkey system till everything > > +* is in place. A patch further down the > > +* line will enable it. > > +*/ > > Comment style is broken > checkpatch.pl does not complain. So is it really a broken comment style, or is it checkpatch.pl needs to be fixed? RP
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
On Thu, Sep 14, 2017 at 02:38:07PM +1000, Balbir Singh wrote: > On Fri, 8 Sep 2017 15:44:50 -0700 > Ram Pai wrote: > > > powerpc needs an additional vma bit to support 32 keys. > > Till the additional vma bit lands in include/linux/mm.h > > we have to define it in powerpc specific header file. > > This is needed to get pkeys working on power. > > > > Signed-off-by: Ram Pai > > --- > > "This" being an arch specific hack for the additional bit? Yes. arch-specific hack. I am trying to get the arch specific changes merged parallelly, along with these patches. Don't know which one will merge first. Regardless of which patch-set lands-in first; I have organized the code such that nothing breaks. RP
Re: [PATCH] net/ethernet/freescale: fix warning for ucc_geth
Hi Christophe, On Thu, 2017-09-14 at 15:24 +0200, Christophe LEROY wrote: > Hi, > > Le 14/09/2017 à 14:05, Valentin Longchamp a écrit : > > Simple printk format warning for the the ucc registers address. > > Did you test your patch with mpc83xx_defconfig ? No I only tested on a 85xx where I had another (similar, because the physical addresses are u64 and not u32) warning. My quick fix indeed did not take the different typedefs for phys_addr_t. I try to come with a v2 that covers this. Thanks for the feedback. Valentin > > I get a new warning with your patch: > >CC drivers/net/ethernet/freescale/ucc_geth.o > In file included from ./include/linux/printk.h:6:0, > from ./include/linux/kernel.h:13, > from drivers/net/ethernet/freescale/ucc_geth.c:18: > drivers/net/ethernet/freescale/ucc_geth.c: In function > ‘ucc_geth_probe’: > ./include/linux/kern_levels.h:4:18: warning: format ‘%llx’ expects > argument of type ‘long long unsigned int’, but argument 3 has type > ‘resource_size_t {aka unsigned int}’ [-Wformat=] > #define KERN_SOH "\001" /* ASCII Start Of Header */ >^ > ./include/linux/kern_levels.h:13:19: note: in expansion of macro > ‘KERN_SOH’ > #define KERN_INFO KERN_SOH "6" /* informational */ > ^ > ./include/linux/printk.h:308:9: note: in expansion of macro > ‘KERN_INFO’ >printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) > ^ > drivers/net/ethernet/freescale/ucc_geth.c:3860:3: note: in expansion > of > macro ‘pr_info’ > pr_info("UCC%1d at 0x%8llx (irq = %d)\n", > ^ > > Christophe > > > > > Signed-off-by: Valentin Longchamp > > --- > > drivers/net/ethernet/freescale/ucc_geth.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c > > b/drivers/net/ethernet/freescale/ucc_geth.c > > index f77ba9fa257b..56b8fdb35c3b 100644 > > --- a/drivers/net/ethernet/freescale/ucc_geth.c > > +++ b/drivers/net/ethernet/freescale/ucc_geth.c > > @@ -3857,7 +3857,7 @@ static int ucc_geth_probe(struct > > platform_device* ofdev) > > } > > > > if (netif_msg_probe(&debug)) > > - pr_info("UCC%1d at 0x%8x (irq = %d)\n", > > + pr_info("UCC%1d at 0x%8llx (irq = %d)\n", > > ug_info->uf_info.ucc_num + 1, ug_info- > > >uf_info.regs, > > ug_info->uf_info.irq); > > > >
Re: [v4 07/11] soc/fsl/qbman: Rework portal mapping calls for ARM/PPC
On Thu, Aug 24, 2017 at 04:37:51PM -0400, Roy Pledge wrote: > diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c > index ff8998f..e31c843 100644 > --- a/drivers/soc/fsl/qbman/bman.c > +++ b/drivers/soc/fsl/qbman/bman.c > @@ -154,7 +154,7 @@ struct bm_mc { > }; > > struct bm_addr { > - void __iomem *ce; /* cache-enabled */ > + void *ce; /* cache-enabled */ > void __iomem *ci; /* cache-inhibited */ > }; You dropped __iomem from ce, which is fine since it is now set via memremap. However, I haven't seen (at least not in this patch), a change to bm_ce_in() which still uses __raw_readl(). (it may be worth checking this code with sparse, it may warn about this) > diff --git a/drivers/soc/fsl/qbman/bman_portal.c > b/drivers/soc/fsl/qbman/bman_portal.c > index 39b39c8..bb03503 100644 > --- a/drivers/soc/fsl/qbman/bman_portal.c > +++ b/drivers/soc/fsl/qbman/bman_portal.c > @@ -91,7 +91,6 @@ static int bman_portal_probe(struct platform_device *pdev) > struct device_node *node = dev->of_node; > struct bm_portal_config *pcfg; > struct resource *addr_phys[2]; > - void __iomem *va; > int irq, cpu; > > pcfg = devm_kmalloc(dev, sizeof(*pcfg), GFP_KERNEL); > @@ -123,23 +122,34 @@ static int bman_portal_probe(struct platform_device > *pdev) > } > pcfg->irq = irq; > > - va = ioremap_prot(addr_phys[0]->start, resource_size(addr_phys[0]), 0); > - if (!va) { > - dev_err(dev, "ioremap::CE failed\n"); > + /* > + * TODO: Ultimately we would like to use a cacheable/non-shareable > + * (coherent) mapping for the portal on both architectures but that > + * isn't currently available in the kernel. Because of HW differences > + * PPC needs to be mapped cacheable while ARM SoCs will work with non > + * cacheable mappings > + */ This comment mentions "cacheable/non-shareable (coherent)". Was this meant for ARM platforms? Because non-shareable is not coherent, nor is this combination guaranteed to work with different CPUs and interconnects. > +#ifdef CONFIG_PPC > + /* PPC requires a cacheable/non-coherent mapping of the portal */ > + pcfg->addr_virt_ce = memremap(addr_phys[0]->start, > + resource_size(addr_phys[0]), MEMREMAP_WB); > +#else > + /* ARM can use a write combine mapping. */ > + pcfg->addr_virt_ce = memremap(addr_phys[0]->start, > + resource_size(addr_phys[0]), MEMREMAP_WC); > +#endif Nitpick: you could define something like QBMAN_MAP_ATTR to be different between PPC and the rest and just keep a single memremap() call. One may complain that "ce" is no longer "cache enabled" but I'm personally fine to keep the same name for historical reasons. > diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h > b/drivers/soc/fsl/qbman/dpaa_sys.h > index 81a9a5e..0a1d573 100644 > --- a/drivers/soc/fsl/qbman/dpaa_sys.h > +++ b/drivers/soc/fsl/qbman/dpaa_sys.h > @@ -51,12 +51,12 @@ > > static inline void dpaa_flush(void *p) > { > + /* > + * Only PPC needs to flush the cache currently - on ARM the mapping > + * is non cacheable > + */ > #ifdef CONFIG_PPC > flush_dcache_range((unsigned long)p, (unsigned long)p+64); > -#elif defined(CONFIG_ARM) > - __cpuc_flush_dcache_area(p, 64); > -#elif defined(CONFIG_ARM64) > - __flush_dcache_area(p, 64); > #endif > } Dropping the private API cache maintenance is fine and the memory is WC now for ARM (mapping to Normal NonCacheable). However, do you require any barriers here? Normal NC doesn't guarantee any ordering. > diff --git a/drivers/soc/fsl/qbman/qman_portal.c > b/drivers/soc/fsl/qbman/qman_portal.c > index cbacdf4..41fe33a 100644 > --- a/drivers/soc/fsl/qbman/qman_portal.c > +++ b/drivers/soc/fsl/qbman/qman_portal.c > @@ -224,7 +224,6 @@ static int qman_portal_probe(struct platform_device *pdev) > struct device_node *node = dev->of_node; > struct qm_portal_config *pcfg; > struct resource *addr_phys[2]; > - void __iomem *va; > int irq, cpu, err; > u32 val; > > @@ -262,23 +261,34 @@ static int qman_portal_probe(struct platform_device > *pdev) > } > pcfg->irq = irq; > > - va = ioremap_prot(addr_phys[0]->start, resource_size(addr_phys[0]), 0); > - if (!va) { > - dev_err(dev, "ioremap::CE failed\n"); > + /* > + * TODO: Ultimately we would like to use a cacheable/non-shareable > + * (coherent) mapping for the portal on both architectures but that > + * isn't currently available in the kernel. Because of HW differences > + * PPC needs to be mapped cacheable while ARM SoCs will work with non > + * cacheable mappings > + */ Same comment as above non non-shareable. > +#ifdef CONFIG_PPC > + /* PPC requires a cacheable mapping of the portal */ > + pcfg->addr_virt_ce = memremap(addr_phys[0]->start, > +
Re: [v4 05/11] soc/fsl/qbman: Drop L1_CACHE_BYTES compile time check
On Thu, Aug 24, 2017 at 04:37:49PM -0400, Roy Pledge wrote: > From: Claudiu Manoil > > Not relevant and arch dependent. Overkill for PPC. > > Signed-off-by: Claudiu Manoil > Signed-off-by: Roy Pledge > --- > drivers/soc/fsl/qbman/dpaa_sys.h | 4 > 1 file changed, 4 deletions(-) > > diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h > b/drivers/soc/fsl/qbman/dpaa_sys.h > index 2ce394a..f85c319 100644 > --- a/drivers/soc/fsl/qbman/dpaa_sys.h > +++ b/drivers/soc/fsl/qbman/dpaa_sys.h > @@ -49,10 +49,6 @@ > #define DPAA_PORTAL_CE 0 > #define DPAA_PORTAL_CI 1 > > -#if (L1_CACHE_BYTES != 32) && (L1_CACHE_BYTES != 64) > -#error "Unsupported Cacheline Size" > -#endif Maybe this check was for a reason on PPC as it uses WB memory mappings for some of the qbman descriptors (which IIUC fit within a cacheline). You could add a check for CONFIG_PPC if you think there is any chance of this constant going higher. -- Catalin
Re: [v4 03/11] dt-bindings: soc/fsl: Update reserved memory binding for QBMan
On Thu, Aug 24, 2017 at 04:37:47PM -0400, Roy Pledge wrote: > Updates the QMan and BMan device tree bindings for reserved memory > nodes. This makes the reserved memory allocation compatible with > the shared-dma-pool usage. > > Signed-off-by: Roy Pledge > --- > Documentation/devicetree/bindings/soc/fsl/bman.txt | 12 +- > Documentation/devicetree/bindings/soc/fsl/qman.txt | 26 > -- This needs reviewed by the DT maintainers. -- Catalin
Re: [v4 01/11] soc/fsl/qbman: Use shared-dma-pool for BMan private memory allocations
On Thu, Aug 24, 2017 at 04:37:45PM -0400, Roy Pledge wrote: > --- a/drivers/soc/fsl/qbman/bman_ccsr.c > +++ b/drivers/soc/fsl/qbman/bman_ccsr.c [...] > @@ -201,6 +202,38 @@ static int fsl_bman_probe(struct platform_device *pdev) > return -ENODEV; > } > > + /* > + * If FBPR memory wasn't defined using the qbman compatible string > + * try using the of_reserved_mem_device method > + */ > + if (!fbpr_a) { > + ret = of_reserved_mem_device_init(dev); > + if (ret) { > + dev_err(dev, "of_reserved_mem_device_init() failed > 0x%x\n", > + ret); > + return -ENODEV; > + } > + mem_node = of_parse_phandle(dev->of_node, "memory-region", 0); > + if (mem_node) { > + ret = of_property_read_u64(mem_node, "size", &size); > + if (ret) { > + dev_err(dev, "FBPR: of_address_to_resource > fails 0x%x\n", > + ret); > + return -ENODEV; > + } > + fbpr_sz = size; > + } else { > + dev_err(dev, "No memory-region found for FBPR\n"); > + return -ENODEV; > + } > + if (!dma_zalloc_coherent(dev, fbpr_sz, &fbpr_a, 0)) { > + dev_err(dev, "Alloc FBPR memory failed\n"); > + return -ENODEV; > + } > + } At a quick look, I think I spotted this pattern a couple of more times in the subsequent patch. Could it be moved to a common function? -- Catalin
Re: [PATCH] net/ethernet/freescale: fix warning for ucc_geth
Hi, Le 14/09/2017 à 14:05, Valentin Longchamp a écrit : Simple printk format warning for the the ucc registers address. Did you test your patch with mpc83xx_defconfig ? I get a new warning with your patch: CC drivers/net/ethernet/freescale/ucc_geth.o In file included from ./include/linux/printk.h:6:0, from ./include/linux/kernel.h:13, from drivers/net/ethernet/freescale/ucc_geth.c:18: drivers/net/ethernet/freescale/ucc_geth.c: In function ‘ucc_geth_probe’: ./include/linux/kern_levels.h:4:18: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘resource_size_t {aka unsigned int}’ [-Wformat=] #define KERN_SOH "\001" /* ASCII Start Of Header */ ^ ./include/linux/kern_levels.h:13:19: note: in expansion of macro ‘KERN_SOH’ #define KERN_INFO KERN_SOH "6" /* informational */ ^ ./include/linux/printk.h:308:9: note: in expansion of macro ‘KERN_INFO’ printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__) ^ drivers/net/ethernet/freescale/ucc_geth.c:3860:3: note: in expansion of macro ‘pr_info’ pr_info("UCC%1d at 0x%8llx (irq = %d)\n", ^ Christophe Signed-off-by: Valentin Longchamp --- drivers/net/ethernet/freescale/ucc_geth.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c index f77ba9fa257b..56b8fdb35c3b 100644 --- a/drivers/net/ethernet/freescale/ucc_geth.c +++ b/drivers/net/ethernet/freescale/ucc_geth.c @@ -3857,7 +3857,7 @@ static int ucc_geth_probe(struct platform_device* ofdev) } if (netif_msg_probe(&debug)) - pr_info("UCC%1d at 0x%8x (irq = %d)\n", + pr_info("UCC%1d at 0x%8llx (irq = %d)\n", ug_info->uf_info.ucc_num + 1, ug_info->uf_info.regs, ug_info->uf_info.irq);
[PATCH] net/ethernet/freescale: fix warning for ucc_geth
Simple printk format warning for the the ucc registers address. Signed-off-by: Valentin Longchamp --- drivers/net/ethernet/freescale/ucc_geth.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c index f77ba9fa257b..56b8fdb35c3b 100644 --- a/drivers/net/ethernet/freescale/ucc_geth.c +++ b/drivers/net/ethernet/freescale/ucc_geth.c @@ -3857,7 +3857,7 @@ static int ucc_geth_probe(struct platform_device* ofdev) } if (netif_msg_probe(&debug)) - pr_info("UCC%1d at 0x%8x (irq = %d)\n", + pr_info("UCC%1d at 0x%8llx (irq = %d)\n", ug_info->uf_info.ucc_num + 1, ug_info->uf_info.regs, ug_info->uf_info.irq); -- 2.13.5
Re: [RFC PATCH 2/2] powerpc/powernv: implement NMI IPIs with OPAL_SIGNAL_SYSTEM_RESET
On Wed, 13 Sep 2017 02:05:53 +1000 Nicholas Piggin wrote: > There are two complications. The first is that sreset from stop states > come in with SRR1 set to do a powersave wakeup, with an sreset reason > encoded. > > The second is that threads on the same core can't be signalled directly > so we must designate a bounce CPU to reflect the IPI back. This is a revised patch with only DD2 enablement. DD2 allows threads on the same core to be IPIed. It's much simpler, and most of the code is fixing the watchdog and preventing it from triggering from xmon (which will be split into other patches of course). It's probably a better starting point to get this working and merged first, then revisiting bouncing. --- arch/powerpc/include/asm/opal-api.h| 1 + arch/powerpc/include/asm/opal.h| 2 ++ arch/powerpc/kernel/irq.c | 20 ++ arch/powerpc/kernel/watchdog.c | 29 +++--- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + arch/powerpc/platforms/powernv/powernv.h | 1 + arch/powerpc/platforms/powernv/setup.c | 3 +++ arch/powerpc/platforms/powernv/smp.c | 24 + arch/powerpc/xmon/xmon.c | 17 +++ 9 files changed, 82 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 450a60b81d2a..9d191ebea706 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -188,6 +188,7 @@ #define OPAL_XIVE_DUMP 142 #define OPAL_XIVE_RESERVED3143 #define OPAL_XIVE_RESERVED4144 +#define OPAL_SIGNAL_SYSTEM_RESET 145 #define OPAL_NPU_INIT_CONTEXT 146 #define OPAL_NPU_DESTROY_CONTEXT 147 #define OPAL_NPU_MAP_LPAR 148 diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 726c23304a57..7d7613c49f2b 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -281,6 +281,8 @@ int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr); int opal_set_power_shift_ratio(u32 handle, int token, u32 psr); int opal_sensor_group_clear(u32 group_hndl, int token); +int64_t opal_signal_system_reset(int32_t cpu); + /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, int depth, void *data); diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c index 4e65bf82f5e0..8ffebb9437e5 100644 --- a/arch/powerpc/kernel/irq.c +++ b/arch/powerpc/kernel/irq.c @@ -407,11 +407,31 @@ static const u8 srr1_to_lazyirq[0x10] = { PACA_IRQ_HMI, 0, 0, 0, 0, 0 }; +static noinline void replay_system_reset(void) +{ + struct pt_regs regs; + + ppc_save_regs(®s); + + get_paca()->in_nmi = 1; + system_reset_exception(®s); + get_paca()->in_nmi = 0; +} + void irq_set_pending_from_srr1(unsigned long srr1) { unsigned int idx = (srr1 & SRR1_WAKEMASK_P8) >> 18; /* +* 0100b SRR1 reason is system reset. Take it now, +* which is immediately after registers are restored +* from idle. It's an NMI, so interrupts needn't be +* re-enabled. +*/ + if (unlikely(idx == 4)) + replay_system_reset(); + + /* * The 0 index (SRR1[42:45]=b) must always evaluate to 0, * so this can be called unconditionally with srr1 wake reason. */ diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c index 2f6eadd9408d..1fb9379dc683 100644 --- a/arch/powerpc/kernel/watchdog.c +++ b/arch/powerpc/kernel/watchdog.c @@ -97,8 +97,7 @@ static void wd_lockup_ipi(struct pt_regs *regs) else dump_stack(); - if (hardlockup_panic) - nmi_panic(regs, "Hard LOCKUP"); + /* Do not panic from here because that can recurse into NMI IPI layer */ } static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb) @@ -134,15 +133,18 @@ static void watchdog_smp_panic(int cpu, u64 tb) pr_emerg("Watchdog CPU:%d detected Hard LOCKUP other CPUS:%*pbl\n", cpu, cpumask_pr_args(&wd_smp_cpus_pending)); - /* -* Try to trigger the stuck CPUs. -*/ - for_each_cpu(c, &wd_smp_cpus_pending) { - if (c == cpu) - continue; - smp_send_nmi_ipi(c, wd_lockup_ipi, 100); + if (!sysctl_hardlockup_all_cpu_backtrace) { + /* +* Try to trigger the stuck CPUs, unless we are going to +* get a backtrace on all of them anyway. +*/ + for_each_cpu(c, &wd_smp_cpus_pending) { + if (c == cpu) + continue; +
Re: [PATCH 4/5] powerpc/jprobes: Disable preemption when triggered through ftrace
On Thu, 14 Sep 2017 15:55:39 +0530 "Naveen N. Rao" wrote: > On 2017/09/13 05:05PM, Masami Hiramatsu wrote: > > On Thu, 14 Sep 2017 02:50:35 +0530 > > "Naveen N. Rao" wrote: > > > > > KPROBES_SANITY_TEST throws the below splat when CONFIG_PREEMPT is > > > enabled: > > > > > > [3.140410] Kprobe smoke test: started > > > [3.149680] DEBUG_LOCKS_WARN_ON(val > preempt_count()) > > > [3.149684] [ cut here ] > > > [3.149695] WARNING: CPU: 19 PID: 1 at kernel/sched/core.c:3094 > > > preempt_count_sub+0xcc/0x140 > > > [3.149699] Modules linked in: > > > [3.149705] CPU: 19 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc7-nnr+ > > > #97 > > > [3.149709] task: c000fea8 task.stack: c000feb0 > > > [3.149713] NIP: c011d3dc LR: c011d3d8 CTR: > > > c0a090d0 > > > [3.149718] REGS: c000feb03400 TRAP: 0700 Not tainted > > > (4.13.0-rc7-nnr+) > > > [3.149722] MSR: 80021033 CR: 28000282 > > > XER: > > > [3.149732] CFAR: c015aa18 SOFTE: 0 > > > > > > [3.149786] NIP [c011d3dc] preempt_count_sub+0xcc/0x140 > > > [3.149790] LR [c011d3d8] preempt_count_sub+0xc8/0x140 > > > [3.149794] Call Trace: > > > [3.149798] [c000feb03680] [c011d3d8] > > > preempt_count_sub+0xc8/0x140 (unreliable) > > > [3.149804] [c000feb036e0] [c0046198] > > > kprobe_handler+0x228/0x4b0 > > > [3.149810] [c000feb03750] [c00269c8] > > > program_check_exception+0x58/0x3b0 > > > [3.149816] [c000feb037c0] [c000903c] > > > program_check_common+0x16c/0x170 > > > [3.149822] --- interrupt: 0 at kprobe_target+0x8/0x20 > > >LR = init_test_probes+0x248/0x7d0 > > > [3.149829] [c000feb03ab0] [c0e4f048] kp+0x0/0x80 > > > (unreliable) > > > [3.149835] [c000feb03b10] [c004ea60] > > > livepatch_handler+0x38/0x74 > > > [3.149841] [c000feb03ba0] [c0d0de54] > > > init_kprobes+0x1d8/0x208 > > > [3.149846] [c000feb03c40] [c000daa8] > > > do_one_initcall+0x68/0x1d0 > > > [3.149852] [c000feb03d00] [c0ce44f0] > > > kernel_init_freeable+0x298/0x374 > > > [3.149857] [c000feb03dc0] [c000dd84] > > > kernel_init+0x24/0x160 > > > [3.149863] [c000feb03e30] [c000bfec] > > > ret_from_kernel_thread+0x5c/0x70 > > > [3.149867] Instruction dump: > > > [3.149871] 419effdc 3d22001b 39299240 8129 2f89 409effc8 > > > 3c82ffcb 3c62ffcb > > > [3.149879] 3884bc68 3863bc18 4803d5fd 6000 <0fe0> 4ba8 > > > 6000 6000 > > > [3.149890] ---[ end trace 432dd46b4ce3d29f ]--- > > > [3.166003] Kprobe smoke test: passed successfully > > > > > > The issue is that we aren't disabling preemption in > > > kprobe_ftrace_handler(). Disable it. > > > > Oops, right! Similar patch may need for x86 too. > > Indeed, I will send a patch for that. > > On a related note, I've been looking into why we have !PREEMPT for > CONFIG_OPTPROBES. It looks like the primary reason is x86 having to deal > with replacing multiple instructions. However, that isn't true with arm > and powerpc. So, does it make sense to move 'depends on !PREEMPT' to the > x86 code? Are there other scenarios where it might cause issues for > arm/powerpc? Indeed! Whuat I expected was to replace several instructions in RISC processors for jumping far site (like load & jump), but you chose different approach. So there is no reason to prehibit that. Thanks! -- Masami Hiramatsu
Re: [PATCH 4/5] powerpc/jprobes: Disable preemption when triggered through ftrace
On 2017/09/13 05:05PM, Masami Hiramatsu wrote: > On Thu, 14 Sep 2017 02:50:35 +0530 > "Naveen N. Rao" wrote: > > > KPROBES_SANITY_TEST throws the below splat when CONFIG_PREEMPT is > > enabled: > > > > [3.140410] Kprobe smoke test: started > > [3.149680] DEBUG_LOCKS_WARN_ON(val > preempt_count()) > > [3.149684] [ cut here ] > > [3.149695] WARNING: CPU: 19 PID: 1 at kernel/sched/core.c:3094 > > preempt_count_sub+0xcc/0x140 > > [3.149699] Modules linked in: > > [3.149705] CPU: 19 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc7-nnr+ > > #97 > > [3.149709] task: c000fea8 task.stack: c000feb0 > > [3.149713] NIP: c011d3dc LR: c011d3d8 CTR: > > c0a090d0 > > [3.149718] REGS: c000feb03400 TRAP: 0700 Not tainted > > (4.13.0-rc7-nnr+) > > [3.149722] MSR: 80021033 CR: 28000282 > > XER: > > [3.149732] CFAR: c015aa18 SOFTE: 0 > > > > [3.149786] NIP [c011d3dc] preempt_count_sub+0xcc/0x140 > > [3.149790] LR [c011d3d8] preempt_count_sub+0xc8/0x140 > > [3.149794] Call Trace: > > [3.149798] [c000feb03680] [c011d3d8] > > preempt_count_sub+0xc8/0x140 (unreliable) > > [3.149804] [c000feb036e0] [c0046198] > > kprobe_handler+0x228/0x4b0 > > [3.149810] [c000feb03750] [c00269c8] > > program_check_exception+0x58/0x3b0 > > [3.149816] [c000feb037c0] [c000903c] > > program_check_common+0x16c/0x170 > > [3.149822] --- interrupt: 0 at kprobe_target+0x8/0x20 > >LR = init_test_probes+0x248/0x7d0 > > [3.149829] [c000feb03ab0] [c0e4f048] kp+0x0/0x80 > > (unreliable) > > [3.149835] [c000feb03b10] [c004ea60] > > livepatch_handler+0x38/0x74 > > [3.149841] [c000feb03ba0] [c0d0de54] > > init_kprobes+0x1d8/0x208 > > [3.149846] [c000feb03c40] [c000daa8] > > do_one_initcall+0x68/0x1d0 > > [3.149852] [c000feb03d00] [c0ce44f0] > > kernel_init_freeable+0x298/0x374 > > [3.149857] [c000feb03dc0] [c000dd84] kernel_init+0x24/0x160 > > [3.149863] [c000feb03e30] [c000bfec] > > ret_from_kernel_thread+0x5c/0x70 > > [3.149867] Instruction dump: > > [3.149871] 419effdc 3d22001b 39299240 8129 2f89 409effc8 > > 3c82ffcb 3c62ffcb > > [3.149879] 3884bc68 3863bc18 4803d5fd 6000 <0fe0> 4ba8 > > 6000 6000 > > [3.149890] ---[ end trace 432dd46b4ce3d29f ]--- > > [3.166003] Kprobe smoke test: passed successfully > > > > The issue is that we aren't disabling preemption in > > kprobe_ftrace_handler(). Disable it. > > Oops, right! Similar patch may need for x86 too. Indeed, I will send a patch for that. On a related note, I've been looking into why we have !PREEMPT for CONFIG_OPTPROBES. It looks like the primary reason is x86 having to deal with replacing multiple instructions. However, that isn't true with arm and powerpc. So, does it make sense to move 'depends on !PREEMPT' to the x86 code? Are there other scenarios where it might cause issues for arm/powerpc? Thanks! - Naveen
[PATCH] drivers: of: static DT reservations incorrectly added to dynamic list
There are two types of memory reservations firmware can ask the kernel to make in the device tree: static and dynamic. See Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt If you have greater than 16 entries in /reserved-memory (as we do on POWER9 systems) you would get this scary looking error message: [0.00] OF: reserved mem: not enough space all defined regions. This is harmless if all your reservations are static (which with OPAL on POWER9, they are). It is not harmless if you have any dynamic reservations after the 16th. In the first pass over the fdt to find reservations, the child nodes of /reserved-memory are added to a static array in of_reserved_mem.c so that memory can be reserved in a 2nd pass. The array has 16 entries. This is why, on my dual socket POWER9 system, I get that error 4 times with 20 static reservations. We don't have a problem on ppc though, as in arch/powerpc/kernel/prom.c we look at the new style /reserved-ranges property to do reservations, and this logic was introduced in 0962e8004e974 (well before any powernv system shipped). Google shows up no occurances of that exact error message, so we're probably safe in that no machine that people use has memory not being reserved when it should be. The fix is simple, as there's a different code path for static and dynamic allocations, we just don't add the region to the list if it's static. Since it's a static *OR* dynamic region, this is a perfectly valid thing to do (although I have not checked every real world device tree on the planet for this condition) Fixes: 3f0c8206644836e4f10a6b9fc47cda6a9a372f9b Signed-off-by: Stewart Smith --- NOTE: I've done only fairly limited testing of this on POWER, I certainly haven't tested on ARM or *anything* with dynamic allocations. So, testing and comments welcome. --- drivers/of/fdt.c | 6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index ce30c9a588a4..a9a44099ed69 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -587,7 +587,7 @@ static int __init __reserved_mem_reserve_reg(unsigned long node, phys_addr_t base, size; int len; const __be32 *prop; - int nomap, first = 1; + int nomap; prop = of_get_flat_dt_prop(node, "reg", &len); if (!prop) @@ -614,10 +614,6 @@ static int __init __reserved_mem_reserve_reg(unsigned long node, uname, &base, (unsigned long)size / SZ_1M); len -= t_len; - if (first) { - fdt_reserved_mem_save_node(node, uname, base, size); - first = 0; - } } return 0; } -- 2.13.5
Re: [PATCH 3/5] powerpc/kprobes: Fix warnings from __this_cpu_read() on preempt kernels
On Thu, 14 Sep 2017 12:17:20 +0530 "Naveen N. Rao" wrote: > On 2017/09/13 05:36PM, Masami Hiramatsu wrote: > > On Thu, 14 Sep 2017 02:50:34 +0530 > > "Naveen N. Rao" wrote: > > > > > Kamalesh pointed out that we are getting the below call traces with > > > livepatched functions when we enable CONFIG_PREEMPT: > > > > > > [ 495.470721] BUG: using __this_cpu_read() in preemptible [] > > > code: cat/8394 > > > [ 495.471167] caller is is_current_kprobe_addr+0x30/0x90 > > > [ 495.471171] CPU: 4 PID: 8394 Comm: cat Tainted: G K > > > 4.13.0-rc7-nnr+ #95 > > > [ 495.471173] Call Trace: > > > [ 495.471178] [c0008fd9b960] [c09f039c] > > > dump_stack+0xec/0x160 (unreliable) > > > [ 495.471184] [c0008fd9b9a0] [c059169c] > > > check_preemption_disabled+0x15c/0x170 > > > [ 495.471187] [c0008fd9ba30] [c0046460] > > > is_current_kprobe_addr+0x30/0x90 > > > [ 495.471191] [c0008fd9ba60] [c004e9a0] ftrace_call+0x1c/0xb8 > > > [ 495.471195] [c0008fd9bc30] [c0376fd8] seq_read+0x238/0x5c0 > > > [ 495.471199] [c0008fd9bcd0] [c03cfd78] > > > proc_reg_read+0x88/0xd0 > > > [ 495.471203] [c0008fd9bd00] [c033e5d4] __vfs_read+0x44/0x1b0 > > > [ 495.471206] [c0008fd9bd90] [c03402ec] vfs_read+0xbc/0x1b0 > > > [ 495.471210] [c0008fd9bde0] [c0342138] SyS_read+0x68/0x110 > > > [ 495.471214] [c0008fd9be30] [c000bc6c] system_call+0x58/0x6c > > > > > > Commit c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for > > > jprobes") introduced a helper is_current_kprobe_addr() to help determine > > > if the current function has been livepatched or if it has a jprobe > > > installed, both of which modify the NIP. > > > > > > In the case of a jprobe, kprobe_ftrace_handler() disables pre-emption > > > before calling into setjmp_pre_handler() which returns without disabling > > > pre-emption. This is done to ensure that the jprobe han dler won't > > > disappear beneath us if the jprobe is unregistered between the > > > setjmp_pre_handler() and the subsequent longjmp_break_handler() called > > > from the jprobe handler. Due to this, we can use __this_cpu_read() in > > > is_current_kprobe_addr() with the pre-emption check as we know that > > > pre-emption will be disabled. > > > > > > However, if this function has been livepatched, we are still doing this > > > check and when we do so, pre-emption won't necessarily be disabled. This > > > results in the call trace shown above. > > > > > > Fix this by only invoking is_current_kprobe_addr() when pre-emption is > > > disabled. And since we now guard this within a pre-emption check, we can > > > instead use raw_cpu_read() to get the current_kprobe value skipping the > > > check done by __this_cpu_read(). > > > > Hmm, can you disable preempt temporary at this function and read it? > > Yes, but I felt this approach is more optimal specifically for live > patching. We don't normally expect preemption to be disabled while > handling a livepatched function, so the simple 'if (!preemptible())' > check helps us bypass looking at current_kprobe. Ah, I see. this is for checking "jprobes", since kprobes/kretprobes usually already done there. > > The check also ensures we can use raw_cpu_read() safely in other > scenarios. Do you see any other concerns with this approach? Yes, there are 2 reasons. At first, if user's custom kprobe pre-handler changes NIP for their custom handwriting livepatching, such handler will enable preemption before return. We don't prohibit it. I think this function can not detect it. And also, the function "name" is very confusing :) Especially, since this symbol is exported, if you are checking jprobes context, it should be renamed, at least it starts with "__" so that show it as local use. Thank you, > > Thanks, > Naveen > > > > > Thank you, > > > > > > > > Fixes: c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for > > > jprobes") > > > Reported-by: Kamalesh Babulal > > > Signed-off-by: Naveen N. Rao > > > --- > > > arch/powerpc/kernel/kprobes.c | 8 ++-- > > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > > > index e848fe2c93fb..db40b13fd3d1 100644 > > > --- a/arch/powerpc/kernel/kprobes.c > > > +++ b/arch/powerpc/kernel/kprobes.c > > > @@ -45,8 +45,12 @@ struct kretprobe_blackpoint kretprobe_blacklist[] = > > > {{NULL, NULL}}; > > > > > > int is_current_kprobe_addr(unsigned long addr) > > > { > > > - struct kprobe *p = kprobe_running(); > > > - return (p && (unsigned long)p->addr == addr) ? 1 : 0; > > > + if (!preemptible()) { > > > + struct kprobe *p = raw_cpu_read(current_kprobe); > > > + return (p && (unsigned long)p->addr == addr) ? 1 : 0; > > > + } > > > + > > > + return 0; > > > } > > > > > > bool arch_within_kprobe_blacklist(unsigned long addr) > > > --
Re: [PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR
On Thu, 2017-09-14 at 09:27 +, David Laight wrote: > You can logically 'hotplug' PCI(e) on any system [1]. > > The 'problem' is that whatever enumerates the PCI(e) at system > powerup doesn't normally assign extra resources to bridges to allow > for devices that aren't present at boot time. > So you can normally only replace cards with ones that use the same > (or less) resources, or that are not behind any bridges. > This is problematic if you have a docking station connected via > a bridge. There's also the problem of Max Payload Size. If you can hotplug behind a bridge then the standard algorithm of finding the max of all devices behind a host bridge doesn't work anymore and you have to clamp everybody to 128 bytes. > [1] Apart from some annoying x86 Dell servers we have which generate > an NMI when the PCIe link goes down (when we reprogram the fpga). > They also fail to boot if a link doesn't come up... > > David
Re: [PATCH 2/5] powerpc/kprobes: Do not suppress instruction emulation if a single run failed
On 2017/09/14 02:45AM, Masami Hiramatsu wrote: > On Thu, 14 Sep 2017 12:08:07 +0530 > "Naveen N. Rao" wrote: > > > On 2017/09/13 04:53PM, Masami Hiramatsu wrote: > > > On Thu, 14 Sep 2017 02:50:33 +0530 > > > "Naveen N. Rao" wrote: > > > > > > > Currently, we disable instruction emulation if emulate_step() fails for > > > > any reason. However, such failures could be transient and specific to a > > > > particular run. Instead, only disable instruction emulation if we have > > > > never been able to emulate this. If we had emulated this instruction > > > > successfully at least once, then we single step only this probe hit and > > > > continue to try emulating the instruction in subsequent probe hits. > > > > > > Hmm, would this mean that the instruction is emulatable or not depends > > > on context? What kind of situation is considerable? > > > > Yes, as an example, a load/store instruction can cause exceptions > > depending on the address. In some of those cases, we will have to single > > step the instruction, but we will be able to emulate in most scenarios. > > OK, I got it. > Could you add this example as comment in the code so that readers can > easily understand? Sure. Thanks, Naveen
Re: [PATCH 3/5] powerpc/kprobes: Fix warnings from __this_cpu_read() on preempt kernels
On Thursday 14 September 2017 02:50 AM, Naveen N. Rao wrote: Kamalesh pointed out that we are getting the below call traces with livepatched functions when we enable CONFIG_PREEMPT: [ 495.470721] BUG: using __this_cpu_read() in preemptible [] code: cat/8394 [ 495.471167] caller is is_current_kprobe_addr+0x30/0x90 [ 495.471171] CPU: 4 PID: 8394 Comm: cat Tainted: G K 4.13.0-rc7-nnr+ #95 [ 495.471173] Call Trace: [ 495.471178] [c0008fd9b960] [c09f039c] dump_stack+0xec/0x160 (unreliable) [ 495.471184] [c0008fd9b9a0] [c059169c] check_preemption_disabled+0x15c/0x170 [ 495.471187] [c0008fd9ba30] [c0046460] is_current_kprobe_addr+0x30/0x90 [ 495.471191] [c0008fd9ba60] [c004e9a0] ftrace_call+0x1c/0xb8 [ 495.471195] [c0008fd9bc30] [c0376fd8] seq_read+0x238/0x5c0 [ 495.471199] [c0008fd9bcd0] [c03cfd78] proc_reg_read+0x88/0xd0 [ 495.471203] [c0008fd9bd00] [c033e5d4] __vfs_read+0x44/0x1b0 [ 495.471206] [c0008fd9bd90] [c03402ec] vfs_read+0xbc/0x1b0 [ 495.471210] [c0008fd9bde0] [c0342138] SyS_read+0x68/0x110 [ 495.471214] [c0008fd9be30] [c000bc6c] system_call+0x58/0x6c Commit c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for jprobes") introduced a helper is_current_kprobe_addr() to help determine if the current function has been livepatched or if it has a jprobe installed, both of which modify the NIP. In the case of a jprobe, kprobe_ftrace_handler() disables pre-emption before calling into setjmp_pre_handler() which returns without disabling pre-emption. This is done to ensure that the jprobe handler won't disappear beneath us if the jprobe is unregistered between the setjmp_pre_handler() and the subsequent longjmp_break_handler() called from the jprobe handler. Due to this, we can use __this_cpu_read() in is_current_kprobe_addr() with the pre-emption check as we know that pre-emption will be disabled. However, if this function has been livepatched, we are still doing this check and when we do so, pre-emption won't necessarily be disabled. This results in the call trace shown above. Fix this by only invoking is_current_kprobe_addr() when pre-emption is disabled. And since we now guard this within a pre-emption check, we can instead use raw_cpu_read() to get the current_kprobe value skipping the check done by __this_cpu_read(). Fixes: c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for jprobes") Reported-by: Kamalesh Babulal Signed-off-by: Naveen N. Rao Thanks, the call traces are not seen anymore with the patch. Tested-by: Kamalesh Babulal --- arch/powerpc/kernel/kprobes.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index e848fe2c93fb..db40b13fd3d1 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -45,8 +45,12 @@ struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, NULL}}; int is_current_kprobe_addr(unsigned long addr) { - struct kprobe *p = kprobe_running(); - return (p && (unsigned long)p->addr == addr) ? 1 : 0; + if (!preemptible()) { + struct kprobe *p = raw_cpu_read(current_kprobe); + return (p && (unsigned long)p->addr == addr) ? 1 : 0; + } + + return 0; } bool arch_within_kprobe_blacklist(unsigned long addr)
Re: [PATCH 2/5] powerpc/kprobes: Do not suppress instruction emulation if a single run failed
On Thu, 14 Sep 2017 12:08:07 +0530 "Naveen N. Rao" wrote: > On 2017/09/13 04:53PM, Masami Hiramatsu wrote: > > On Thu, 14 Sep 2017 02:50:33 +0530 > > "Naveen N. Rao" wrote: > > > > > Currently, we disable instruction emulation if emulate_step() fails for > > > any reason. However, such failures could be transient and specific to a > > > particular run. Instead, only disable instruction emulation if we have > > > never been able to emulate this. If we had emulated this instruction > > > successfully at least once, then we single step only this probe hit and > > > continue to try emulating the instruction in subsequent probe hits. > > > > Hmm, would this mean that the instruction is emulatable or not depends > > on context? What kind of situation is considerable? > > Yes, as an example, a load/store instruction can cause exceptions > depending on the address. In some of those cases, we will have to single > step the instruction, but we will be able to emulate in most scenarios. OK, I got it. Could you add this example as comment in the code so that readers can easily understand? Thank you, > > Thanks for the review! > - Naveen > -- Masami Hiramatsu
Re: [PATCH v3 04/20] mm: VMA sequence count
On (09/14/17 11:15), Laurent Dufour wrote: > On 14/09/2017 11:11, Sergey Senozhatsky wrote: > > On (09/14/17 10:58), Laurent Dufour wrote: > > [..] > >> That's right, but here this is the sequence counter mm->mm_seq, not the > >> vm_seq one. > > > > d'oh... you are right. > > So I'm doubting about the probability of a deadlock here, but I don't like > to see lockdep complaining. Is there an easy way to make it happy ? /* * well... answering your question - it seems raw versions of seqcount * functions don't call lockdep's lock_acquire/lock_release... * * but I have never told you that. never. */ lockdep, perhaps, can be wrong sometimes, and may be it's one of those cases. may be not... I'm not a MM guy myself. below is a lockdep splat I got yesterday. that's v3 of SPF patch set. [ 2763.365898] == [ 2763.365899] WARNING: possible circular locking dependency detected [ 2763.365902] 4.13.0-next-20170913-dbg-00039-ge3c06ea4b028-dirty #1837 Not tainted [ 2763.365903] -- [ 2763.365905] khugepaged/42 is trying to acquire lock: [ 2763.365906] (&mapping->i_mmap_rwsem){}, at: [] rmap_walk_file+0x5a/0x142 [ 2763.365913] but task is already holding lock: [ 2763.365915] (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire+0x12/0x35 [ 2763.365920] which lock already depends on the new lock. [ 2763.365922] the existing dependency chain (in reverse order) is: [ 2763.365924] -> #3 (fs_reclaim){+.+.}: [ 2763.365930]lock_acquire+0x176/0x19e [ 2763.365932]fs_reclaim_acquire+0x32/0x35 [ 2763.365934]__alloc_pages_nodemask+0x6d/0x1f9 [ 2763.365937]pte_alloc_one+0x17/0x62 [ 2763.365940]__pte_alloc+0x1f/0x83 [ 2763.365943]move_page_tables+0x2c3/0x5a2 [ 2763.365944]move_vma.isra.25+0xff/0x29f [ 2763.365946]SyS_mremap+0x41b/0x49e [ 2763.365949]entry_SYSCALL_64_fastpath+0x18/0xad [ 2763.365951] -> #2 (&vma->vm_sequence/1){+.+.}: [ 2763.365955]lock_acquire+0x176/0x19e [ 2763.365958]write_seqcount_begin_nested+0x1b/0x1d [ 2763.365959]__vma_adjust+0x1c4/0x5f1 [ 2763.365961]__split_vma+0x12c/0x181 [ 2763.365963]do_munmap+0x128/0x2af [ 2763.365965]vm_munmap+0x5a/0x73 [ 2763.365968]elf_map+0xb1/0xce [ 2763.365970]load_elf_binary+0x91e/0x137a [ 2763.365973]search_binary_handler+0x70/0x1f3 [ 2763.365974]do_execveat_common+0x45e/0x68e [ 2763.365978]call_usermodehelper_exec_async+0xf7/0x11f [ 2763.365980]ret_from_fork+0x27/0x40 [ 2763.365981] -> #1 (&vma->vm_sequence){+.+.}: [ 2763.365985]lock_acquire+0x176/0x19e [ 2763.365987]write_seqcount_begin_nested+0x1b/0x1d [ 2763.365989]__vma_adjust+0x1a9/0x5f1 [ 2763.365991]__split_vma+0x12c/0x181 [ 2763.365993]do_munmap+0x128/0x2af [ 2763.365994]vm_munmap+0x5a/0x73 [ 2763.365996]elf_map+0xb1/0xce [ 2763.365998]load_elf_binary+0x91e/0x137a [ 2763.365999]search_binary_handler+0x70/0x1f3 [ 2763.366001]do_execveat_common+0x45e/0x68e [ 2763.366003]call_usermodehelper_exec_async+0xf7/0x11f [ 2763.366005]ret_from_fork+0x27/0x40 [ 2763.366006] -> #0 (&mapping->i_mmap_rwsem){}: [ 2763.366010]__lock_acquire+0xa72/0xca0 [ 2763.366012]lock_acquire+0x176/0x19e [ 2763.366015]down_read+0x3b/0x55 [ 2763.366017]rmap_walk_file+0x5a/0x142 [ 2763.366018]page_referenced+0xfc/0x134 [ 2763.366022]shrink_active_list+0x1ac/0x37d [ 2763.366024]shrink_node_memcg.constprop.72+0x3ca/0x567 [ 2763.366026]shrink_node+0x3f/0x14c [ 2763.366028]try_to_free_pages+0x288/0x47a [ 2763.366030]__alloc_pages_slowpath+0x3a7/0xa49 [ 2763.366032]__alloc_pages_nodemask+0xf1/0x1f9 [ 2763.366035]khugepaged+0xc8/0x167c [ 2763.366037]kthread+0x133/0x13b [ 2763.366039]ret_from_fork+0x27/0x40 [ 2763.366040] other info that might help us debug this: [ 2763.366042] Chain exists of: &mapping->i_mmap_rwsem --> &vma->vm_sequence/1 --> fs_reclaim [ 2763.366048] Possible unsafe locking scenario: [ 2763.366049]CPU0CPU1 [ 2763.366050] [ 2763.366051] lock(fs_reclaim); [ 2763.366054]lock(&vma->vm_sequence/1); [ 2763.366056]lock(fs_reclaim); [ 2763.366058] lock(&mapping->i_mmap_rwsem); [ 2763.366061] *** DEADLOCK *** [ 2763.366063] 1 lock held by khugepaged/42: [ 2763.366064] #0: (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire+0x12/0x35 [ 2763.366068] stack backtrace: [ 2763.366071] CPU: 2 PID: 42 Comm: khugepaged Not tainted 4.13.0-next-20170913-dbg-000
RE: [PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR
From: Benjamin Herrenschmidt > Sent: 14 September 2017 04:40 > On Thu, 2017-09-14 at 13:18 +1000, Alexey Kardashevskiy wrote: > > On 14/09/17 13:07, Benjamin Herrenschmidt wrote: > > > On Thu, 2017-09-14 at 12:45 +1000, Alexey Kardashevskiy wrote: > > > > On 31/08/17 13:34, Alexey Kardashevskiy wrote: > > > > > From: Benjamin Herrenschmidt > > > > > > > > Oops, this was not right :) > > > > > > > > Anyway, Ben, please comment. Thanks. > > > > > > This is incorrect, we can do hotplug behind switches afaik. > > > > Do we have an actual system which allows this? > > Tuleta no ? You can logically 'hotplug' PCI(e) on any system [1]. The 'problem' is that whatever enumerates the PCI(e) at system powerup doesn't normally assign extra resources to bridges to allow for devices that aren't present at boot time. So you can normally only replace cards with ones that use the same (or less) resources, or that are not behind any bridges. This is problematic if you have a docking station connected via a bridge. [1] Apart from some annoying x86 Dell servers we have which generate an NMI when the PCIe link goes down (when we reprogram the fpga). They also fail to boot if a link doesn't come up... David
Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT
Le 14/09/2017 à 01:51, Rob Landley a écrit : From: Rob Landley Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move /dev/console open after devtmpfs mount. Add workaround for Debian bug that was copied by Ubuntu. Is that a bug only for Debian ? Why ? Why should a Debian bug be fixed by a workaround in the mainline kernel ? Signed-off-by: Rob Landley --- v2 discussion: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html drivers/base/Kconfig | 14 -- fs/namespace.c | 14 ++ init/main.c | 15 +-- 3 files changed, 27 insertions(+), 16 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index f046d21..97352d4 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs" depends on DEVTMPFS help - This will instruct the kernel to automatically mount the - devtmpfs filesystem at /dev, directly after the kernel has - mounted the root filesystem. The behavior can be overridden - with the commandline parameter: devtmpfs.mount=0|1. - This option does not affect initramfs based booting, here - the devtmpfs filesystem always needs to be mounted manually - after the rootfs is mounted. - With this option enabled, it allows to bring up a system in - rescue mode with init=/bin/sh, even when the /dev directory - on the rootfs is completely empty. + Automatically mount devtmpfs at /dev on the root filesystem, which + lets the system to come up in rescue mode with [rd]init=/bin/sh. + Override with devtmpfs.mount=0 on the commandline. Initramfs can + create a /dev dir as needed, other rootfs needs the mount point. Why modifying the initial text ? Why talking about rescue mode only, whereas this feature mainly concerns the standard mode. config STANDALONE bool "Select only drivers that don't need compile-time external firmware" diff --git a/fs/namespace.c b/fs/namespace.c index f8893dc..06057d7 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2417,7 +2417,21 @@ static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags) err = -EBUSY; if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb && path->mnt->mnt_root == path->dentry) + { + if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT) && + !strcmp(path->mnt->mnt_sb->s_type->name, "devtmpfs")) + { + /* Debian's kernel config enables DEVTMPFS_MOUNT, then + its initramfs setup script tries to mount devtmpfs + again, and if the second mount-over-itself fails + the script overmounts a tmpfs on /dev to hide the + existing contents, then boot fails with empty /dev. */ Does it matter for the kernel code what Debian's kernel config does ? + printk(KERN_WARNING "Debian bug workaround for devtmpfs overmount."); Is this log message worth it when this modification goes in mainline kernel ? If so, pr_err() should be used instead. + + err = 0; + } goto unlock; + } err = -EINVAL; if (d_is_symlink(newmnt->mnt.mnt_root)) diff --git a/init/main.c b/init/main.c index 0ee9c686..0d8e5ec 100644 --- a/init/main.c +++ b/init/main.c @@ -1065,12 +1065,6 @@ static noinline void __init kernel_init_freeable(void) do_basic_setup(); - /* Open the /dev/console on the rootfs, this should never fail */ - if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) - pr_err("Warning: unable to open an initial console.\n"); - - (void) sys_dup(0); - (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work @@ -1082,8 +1076,17 @@ static noinline void __init kernel_init_freeable(void) if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); + } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) { + sys_mkdir("/dev", 0755); Why not, but couldn't we also expect the initramfs to already contains that mountpoint ? + devtmpfs_mount("/dev"); } + /* Open the /dev/console on the rootfs, this should never fail */ + if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) + pr_err("Warning: unable to open an initial console.\n"); + (void) sys_dup(0); + (void) sys_dup(0); + /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the Christophe
Re: [PATCH v3 04/20] mm: VMA sequence count
On 14/09/2017 11:11, Sergey Senozhatsky wrote: > On (09/14/17 10:58), Laurent Dufour wrote: > [..] >> That's right, but here this is the sequence counter mm->mm_seq, not the >> vm_seq one. > > d'oh... you are right. So I'm doubting about the probability of a deadlock here, but I don't like to see lockdep complaining. Is there an easy way to make it happy ?
Re: [PATCH v3 04/20] mm: VMA sequence count
On (09/14/17 10:58), Laurent Dufour wrote: [..] > That's right, but here this is the sequence counter mm->mm_seq, not the > vm_seq one. d'oh... you are right. -ss
Re: [PATCH v3 04/20] mm: VMA sequence count
On 14/09/2017 10:13, Sergey Senozhatsky wrote: > Hi, > > On (09/14/17 09:55), Laurent Dufour wrote: > [..] >>> so if there are two CPUs, one doing write_seqcount() and the other one >>> doing read_seqcount() then what can happen is something like this >>> >>> CPU0CPU1 >>> >>> fs_reclaim_acquire() >>> write_seqcount_begin() >>> fs_reclaim_acquire()read_seqcount_begin() >>> write_seqcount_end() >>> >>> CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from >>> CPU1, CPU1 can't read_seqcount_begin() because CPU0 did >>> write_seqcount_begin() >>> and now waits for fs_reclaim_acquire(). makes sense? >> >> Yes, this makes sense. >> >> But in the case of this series, there is no call to >> __read_seqcount_begin(), and the reader (the speculative page fault >> handler), is just checking for (vm_seq & 1) and if this is true, simply >> exit the speculative path without waiting. >> So there is no deadlock possibility. > > probably lockdep just knows that those locks interleave at some > point. > > > by the way, I think there is one path that can spin > > find_vma_srcu() > read_seqbegin() > read_seqcount_begin() >raw_read_seqcount_begin() > __read_seqcount_begin() That's right, but here this is the sequence counter mm->mm_seq, not the vm_seq one. This one is held to protect walking the VMA list "locklessly"... Cheers, Laurent.
Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
On Fri, 2017-09-08 at 15:44 -0700, Ram Pai wrote: > The second part of the PTE will hold > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63. > NOTE: None of the bits in the secondary PTE were not used > by 64k-HPTE backed PTE. Have you measured the performance impact of this ? The second part of the PTE being in a different cache line there could be one... Cheers, Ben.
Re: [PATCH v3 04/20] mm: VMA sequence count
Hi, On (09/14/17 09:55), Laurent Dufour wrote: [..] > > so if there are two CPUs, one doing write_seqcount() and the other one > > doing read_seqcount() then what can happen is something like this > > > > CPU0CPU1 > > > > fs_reclaim_acquire() > > write_seqcount_begin() > > fs_reclaim_acquire()read_seqcount_begin() > > write_seqcount_end() > > > > CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from > > CPU1, CPU1 can't read_seqcount_begin() because CPU0 did > > write_seqcount_begin() > > and now waits for fs_reclaim_acquire(). makes sense? > > Yes, this makes sense. > > But in the case of this series, there is no call to > __read_seqcount_begin(), and the reader (the speculative page fault > handler), is just checking for (vm_seq & 1) and if this is true, simply > exit the speculative path without waiting. > So there is no deadlock possibility. probably lockdep just knows that those locks interleave at some point. by the way, I think there is one path that can spin find_vma_srcu() read_seqbegin() read_seqcount_begin() raw_read_seqcount_begin() __read_seqcount_begin() -ss
Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.
On Thu, 2017-09-14 at 14:38 +1000, Balbir Singh wrote: > On Fri, 8 Sep 2017 15:44:50 -0700 > Ram Pai wrote: > > > powerpc needs an additional vma bit to support 32 keys. > > Till the additional vma bit lands in include/linux/mm.h > > we have to define it in powerpc specific header file. > > This is needed to get pkeys working on power. > > > > Signed-off-by: Ram Pai > > --- > > "This" being an arch specific hack for the additional bit? Arch VMA bits ? really ? I'd rather we limit ourselves to 16 keys first then push for adding the extra bit to the generic code. Ben.
Re: [PATCH v3 04/20] mm: VMA sequence count
Hi, On 14/09/2017 02:31, Sergey Senozhatsky wrote: > Hi, > > On (09/13/17 18:56), Laurent Dufour wrote: >> Hi Sergey, >> >> On 13/09/2017 13:53, Sergey Senozhatsky wrote: >>> Hi, >>> >>> On (09/08/17 20:06), Laurent Dufour wrote: > [..] >>> ok, so what I got on my box is: >>> >>> vm_munmap() -> down_write_killable(&mm->mmap_sem) >>> do_munmap() >>> __split_vma() >>>__vma_adjust() -> write_seqcount_begin(&vma->vm_sequence) >>>-> write_seqcount_begin_nested(&next->vm_sequence, >>> SINGLE_DEPTH_NESTING) >>> >>> so this gives 3 dependencies ->mmap_sem -> ->vm_seq >>> ->vm_seq -> ->vm_seq/1 >>> ->mmap_sem -> ->vm_seq/1 >>> >>> >>> SyS_mremap() -> down_write_killable(¤t->mm->mmap_sem) >>> move_vma() -> write_seqcount_begin(&vma->vm_sequence) >>> -> write_seqcount_begin_nested(&new_vma->vm_sequence, >>> SINGLE_DEPTH_NESTING); >>> move_page_tables() >>>__pte_alloc() >>> pte_alloc_one() >>> __alloc_pages_nodemask() >>> fs_reclaim_acquire() >>> >>> >>> I think here we have prepare_alloc_pages() call, that does >>> >>> -> fs_reclaim_acquire(gfp_mask) >>> -> fs_reclaim_release(gfp_mask) >>> >>> so that adds one more dependency ->mmap_sem -> ->vm_seq-> >>> fs_reclaim >>> ->mmap_sem -> ->vm_seq/1 -> >>> fs_reclaim >>> >>> >>> now, under memory pressure we hit the slow path and perform direct >>> reclaim. direct reclaim is done under fs_reclaim lock, so we end up >>> with the following call chain >>> >>> __alloc_pages_nodemask() >>> __alloc_pages_slowpath() >>> __perform_reclaim() -> fs_reclaim_acquire(gfp_mask); >>>try_to_free_pages() >>> shrink_node() >>> shrink_active_list() >>> rmap_walk_file() -> i_mmap_lock_read(mapping); >>> >>> >>> and this break the existing dependency. since we now take the leaf lock >>> (fs_reclaim) first and the the root lock (->mmap_sem). >> >> Thanks for looking at this. >> I'm sorry, I should have miss something. > > no prob :) > > >> My understanding is that there are 2 chains of locks: >> 1. from __vma_adjust() mmap_sem -> i_mmap_rwsem -> vm_seq >> 2. from move_vmap() mmap_sem -> vm_seq -> fs_reclaim >> 2. from __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem > > yes, as far as lockdep warning suggests. > >> So the solution would be to have in __vma_adjust() >> mmap_sem -> vm_seq -> i_mmap_rwsem >> >> But this will raised the following dependency from unmap_mapping_range() >> unmap_mapping_range()-> i_mmap_rwsem >> unmap_mapping_range_tree() >> unmap_mapping_range_vma() >>zap_page_range_single() >> unmap_single_vma() >> unmap_page_range() -> vm_seq >> >> And there is no way to get rid of it easily as in unmap_mapping_range() >> there is no VMA identified yet. >> >> That's being said I can't see any clear way to get lock dependency cleaned >> here. >> Furthermore, this is not clear to me how a deadlock could happen as vm_seq >> is a sequence lock, and there is no way to get blocked here. > > as far as I understand, >seq locks can deadlock, technically. not on the write() side, but on > the read() side: > > read_seqcount_begin() > raw_read_seqcount_begin() >__read_seqcount_begin() > > and __read_seqcount_begin() spins for ever > >__read_seqcount_begin() >{ > repeat: > ret = READ_ONCE(s->sequence); > if (unlikely(ret & 1)) { > cpu_relax(); > goto repeat; > } > return ret; >} > > > so if there are two CPUs, one doing write_seqcount() and the other one > doing read_seqcount() then what can happen is something like this > > CPU0CPU1 > > fs_reclaim_acquire() > write_seqcount_begin() > fs_reclaim_acquire()read_seqcount_begin() > write_seqcount_end() > > CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from > CPU1, CPU1 can't read_seqcount_begin() because CPU0 did write_seqcount_begin() > and now waits for fs_reclaim_acquire(). makes sense? Yes, this makes sense. But in the case of this series, there is no call to __read_seqcount_begin(), and the reader (the speculative page fault handler), is just checking for (vm_seq & 1) and if this is true, simply exit the speculative path without waiting. So there is no deadlock possibility. The bad case would be to have 2 concurrent threads calling write_seqcount_begin() on the same VMA, leading a wrongly freed sequence lock but this can't happen because of the mmap_sem holding for write in such a case. Cheers, Laurent.