[PATCH] selftests/powerpc: Increase timeout for vsx_signal test
On the max config P10 machine (1920 threads and 64TB) this test fails with a timeout: Sending signals to all threads 10 times...!! killing vmx_signal !! child died by signal 15 failure: vmx_signal The default timeout is 120sec so increase this 3x to 360sec. With this change the test passes on these large machines. Signed-off-by: Michael Neuling --- tools/testing/selftests/powerpc/math/vmx_signal.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/powerpc/math/vmx_signal.c b/tools/testing/selftests/powerpc/math/vmx_signal.c index b340a5c4e79d..c307dff19c12 100644 --- a/tools/testing/selftests/powerpc/math/vmx_signal.c +++ b/tools/testing/selftests/powerpc/math/vmx_signal.c @@ -151,5 +151,6 @@ int test_signal_vmx(void) int main(int argc, char *argv[]) { + test_harness_set_timeout(360); return test_harness(test_signal_vmx, "vmx_signal"); } -- 2.39.2
Re: [PATCH] powerpc/microwatt: Add LiteX MMC driver
On Thu, 2022-08-04 at 11:27 +0930, Joel Stanley wrote: > Enable the LiteX MMC device and it's dependency the common clock > framework. > > Signed-off-by: Joel Stanley Acked-by: Michael Neuling > --- > arch/powerpc/configs/microwatt_defconfig | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/arch/powerpc/configs/microwatt_defconfig > b/arch/powerpc/configs/microwatt_defconfig > index eff933ebbb9e..ea2dbd778aad 100644 > --- a/arch/powerpc/configs/microwatt_defconfig > +++ b/arch/powerpc/configs/microwatt_defconfig > @@ -75,7 +75,12 @@ CONFIG_SPI_BITBANG=y > CONFIG_SPI_SPIDEV=y > # CONFIG_HWMON is not set > # CONFIG_USB_SUPPORT is not set > +CONFIG_MMC=y > +# CONFIG_PWRSEQ_EMMC is not set > +# CONFIG_PWRSEQ_SIMPLE is not set > +CONFIG_MMC_LITEX=y > # CONFIG_VIRTIO_MENU is not set > +CONFIG_COMMON_CLK=y > # CONFIG_IOMMU_SUPPORT is not set > # CONFIG_NVMEM is not set > CONFIG_EXT4_FS=y
Re: [PATCH] powerpc/microwatt: Add mmu bits to device tree
On Fri, 2022-05-20 at 10:06 +1000, Nicholas Piggin wrote: > Excerpts from Joel Stanley's message of May 19, 2022 10:57 pm: > > In commit 5402e239d09f ("powerpc/64s: Get LPID bit width from device > > tree") the kernel tried to determine the pid and lpid bits from the > > device tree. If they are not found, there is a fallback, but Microwatt > > wasn't covered as has the unusual configuration of being both !HV and bare > > metal. > > > > Set the values in the device tree to avoid having to add a special case. > > The lpid value is the only one required, but add both for completeness. > > > > Fixes: 5402e239d09f ("powerpc/64s: Get LPID bit width from device tree") > > Signed-off-by: Joel Stanley > > LGTM... does Microwatt actually need 12 lpid bits, or does it just need > the LPID 0 partition table entry? Microwatt just assumes LPID=0 as it doesn't actually have an LPID SPR. It just uses the first entry in the partition table. There are some details here: https://github.com/antonblanchard/microwatt/commit/18120f153d138f733fa7e8a89c3456bb93683f96 > I wonder if you set it to 1. That's a minor nit though. .. or even set it to 0. Mikey
Re: [PATCH] powerpc/tm: Fix more userspace r13 corruption
On Fri, 2022-03-11 at 12:47 +1000, Nicholas Piggin wrote: > Commit cf13435b730a ("powerpc/tm: Fix userspace r13 corruption") fixes > a problem in treclaim where a SLB miss can occur on the > thread_struct->ckpt_regs while SCRATCH0 is live with the saved user r13 > value, clobbering it with the kernel r13 and ultimately resulting in > kernel r13 being stored in ckpt_regs. > > There is an equivalent problem in trechkpt where the user r13 value is > loaded into r13 from chkpt_regs to be recheckpointed, but a SLB miss > could occur on ckpt_regs accesses after that, which will result in r13 > being clobbered with a kernel value and that will get recheckpointed and > then restored to user registers. > > The same memory page is accessed right before this critical window where > a SLB miss could cause corruption, so hitting the bug requires the SLB > entry be removed within a small window of instructions, which is possible > if a SLB related MCE hits there. PAPR also permits the hypervisor to > discard this SLB entry (because slb_shadow->persistent is only set to > SLB_NUM_BOLTED) although it's not known whether any implementations would > do this (KVM does not). So this is an extremely unlikely bug, only found > by inspection. > > Fix this by also storing user r13 in a temporary location on the kernel > stack and don't chane the r13 register from kernel r13 until the RI=0 > critical section that does not fault. s/chane/change/ > > [ The SCRATCH0 change is not strictly part of the fix, it's only used in > the RI=0 section so it does not have the same problem as the previous > SCRATCH0 bug. ] > > Signed-off-by: Nicholas Piggin This needs to be marked for stable also. Other than that: Acked-by: Michael Neuling Thanks! > --- > arch/powerpc/kernel/tm.S | 25 - > 1 file changed, 16 insertions(+), 9 deletions(-) > > diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S > index 3beecc32940b..5a0f023a26e9 100644 > --- a/arch/powerpc/kernel/tm.S > +++ b/arch/powerpc/kernel/tm.S > @@ -443,7 +443,8 @@ restore_gprs: > > REST_GPR(0, r7) /* GPR0 */ > REST_GPRS(2, 4, r7) /* GPR2-4 */ > - REST_GPRS(8, 31, r7)/* GPR8-31 */ > + REST_GPRS(8, 12, r7)/* GPR8-12 */ > + REST_GPRS(14, 31, r7) /* GPR14-31 */ > > /* Load up PPR and DSCR here so we don't run with user values for long > */ > mtspr SPRN_DSCR, r5 > @@ -479,18 +480,24 @@ restore_gprs: > REST_GPR(6, r7) > > /* > - * Store r1 and r5 on the stack so that we can access them after we > - * clear MSR RI. > + * Store user r1 and r5 and r13 on the stack (in the unused save > + * areas / compiler reserved areas), so that we can access them after > + * we clear MSR RI. > */ > > REST_GPR(5, r7) > std r5, -8(r1) > - ld r5, GPR1(r7) > + ld r5, GPR13(r7) > std r5, -16(r1) > + ld r5, GPR1(r7) > + std r5, -24(r1) > > REST_GPR(7, r7) > > - /* Clear MSR RI since we are about to use SCRATCH0. EE is already off > */ > + /* Stash the stack pointer away for use after recheckpoint */ > + std r1, PACAR1(r13) > + > + /* Clear MSR RI since we are about to clobber r13. EE is already off > */ > li r5, 0 > mtmsrd r5, 1 > > @@ -501,9 +508,9 @@ restore_gprs: > * until we turn MSR RI back on. > */ > > - SET_SCRATCH0(r1) > ld r5, -8(r1) > - ld r1, -16(r1) > + ld r13, -16(r1) > + ld r1, -24(r1) > > /* Commit register state as checkpointed state: */ > TRECHKPT > @@ -519,9 +526,9 @@ restore_gprs: > */ > > GET_PACA(r13) > - GET_SCRATCH0(r1) > + ld r1, PACAR1(r13) > > - /* R1 is restored, so we are recoverable again. EE is still off */ > + /* R13, R1 is restored, so we are recoverable again. EE is still off > */ > li r4, MSR_RI > mtmsrd r4, 1 >
Re: [PATCH] powerpc/powernv/memtrace: Fake non-memblock aligned sized traces
CC Rashmica Gupta On Wed, 2020-11-11 at 16:55 +1100, Jordan Niethe wrote: > The hardware trace macros which use the memory provided by memtrace are > able to use trace sizes as small as 16MB. Only memblock aligned values > can be removed from each NUMA node by writing that value to > memtrace/enable in debugfs. This means setting up, say, a 16MB trace is > not possible. To allow such a trace size, instead align whatever value > is written to memtrace/enable to the memblock size for the purpose of > removing it from each NUMA node but report the written value from > memtrace/enable and memtrace/x/size in debugfs. > > Signed-off-by: Jordan Niethe > --- > arch/powerpc/platforms/powernv/memtrace.c | 20 ++-- > 1 file changed, 6 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/memtrace.c > b/arch/powerpc/platforms/powernv/memtrace.c > index 6828108486f8..1188bc8fd090 100644 > --- a/arch/powerpc/platforms/powernv/memtrace.c > +++ b/arch/powerpc/platforms/powernv/memtrace.c > @@ -191,7 +191,7 @@ static int memtrace_init_debugfs(void) > ent->dir = dir; > debugfs_create_file("trace", 0400, dir, ent, _fops); > debugfs_create_x64("start", 0400, dir, >start); > - debugfs_create_x64("size", 0400, dir, >size); > + debugfs_create_x64("size", 0400, dir, _size); > } > > return ret; > @@ -259,33 +259,25 @@ static int memtrace_enable_set(void *data, u64 val) > { > u64 bytes; > > - /* > - * Don't attempt to do anything if size isn't aligned to a memory > - * block or equal to zero. > - */ > - bytes = memory_block_size_bytes(); > - if (val & (bytes - 1)) { > - pr_err("Value must be aligned with 0x%llx\n", bytes); > - return -EINVAL; > - } > - > /* Re-add/online previously removed/offlined memory */ > if (memtrace_size) { > if (memtrace_online()) > return -EAGAIN; > } > > + memtrace_size = val; > + > if (!val) > return 0; > > - /* Offline and remove memory */ > - if (memtrace_init_regions_runtime(val)) > + /* Offline and remove memory aligned to memory blocks */ > + bytes = memory_block_size_bytes(); > + if (memtrace_init_regions_runtime(ALIGN(val, bytes))) > return -EINVAL; > > if (memtrace_init_debugfs()) > return -EINVAL; > > - memtrace_size = val; > > return 0; > }
[PATCH 1/2] powerpc: Fix user data corruption with P9N DD2.1 VSX CI load workaround emulation
__get_user_atomic_128_aligned() stores to kaddr using stvx which is a VMX store instruction, hence kaddr must be 16 byte aligned otherwise the store won't occur as expected. Unfortunately when we call __get_user_atomic_128_aligned() in p9_hmi_special_emu(), the buffer we pass as kaddr (ie. vbuf) isn't guaranteed to be 16B aligned. This means that the write to vbuf in __get_user_atomic_128_aligned() has the bottom bits of the address truncated. This results in other local variables being overwritten. Also vbuf will not contain the correct data which results in the userspace emulation being wrong and hence user data corruption. In the past we've been mostly lucky as vbuf has ended up aligned but this is fragile and isn't always true. CONFIG_STACKPROTECTOR in particular can change the stack arrangement enough that our luck runs out. This issue only occurs on POWER9 Nimbus <= DD2.1 bare metal. The fix is to align vbuf to a 16 byte boundary. Fixes: 5080332c2c89 ("powerpc/64s: Add workaround for P9 vector CI load issue") Signed-off-by: Michael Neuling Cc: # v4.15+ --- arch/powerpc/kernel/traps.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index c5f39f13e96e..5006dcbe1d9f 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -885,7 +885,7 @@ static void p9_hmi_special_emu(struct pt_regs *regs) { unsigned int ra, rb, t, i, sel, instr, rc; const void __user *addr; - u8 vbuf[16], *vdst; + u8 vbuf[16] __aligned(16), *vdst; unsigned long ea, msr, msr_mask; bool swap; -- 2.26.2
[PATCH 2/2] selftests/powerpc: Make alignment handler test P9N DD2.1 vector CI load workaround
alignment_handler currently only tests the unaligned cases but it can also be useful for testing the workaround for the P9N DD2.1 vector CI load issue fixed by p9_hmi_special_emu(). This workaround was introduced in 5080332c2c89 ("powerpc/64s: Add workaround for P9 vector CI load issue"). This changes the loop to start from offset 0 rather than 1 so that we test the kernel emulation in p9_hmi_special_emu(). Signed-off-by: Michael Neuling --- .../selftests/powerpc/alignment/alignment_handler.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/powerpc/alignment/alignment_handler.c b/tools/testing/selftests/powerpc/alignment/alignment_handler.c index 2a0503bc7e49..cb53a8b777e6 100644 --- a/tools/testing/selftests/powerpc/alignment/alignment_handler.c +++ b/tools/testing/selftests/powerpc/alignment/alignment_handler.c @@ -266,8 +266,12 @@ int do_test(char *test_name, void (*test_func)(char *, char *)) } rc = 0; - /* offset = 0 no alignment fault, so skip */ - for (offset = 1; offset < 16; offset++) { + /* +* offset = 0 is aligned but tests the workaround for the P9N +* DD2.1 vector CI load issue (see 5080332c2c89 "powerpc/64s: +* Add workaround for P9 vector CI load issue") +*/ + for (offset = 0; offset < 16; offset++) { width = 16; /* vsx == 16 bytes */ r = 0; -- 2.26.2
Re: [PATCH] powerpc: Fix P10 PVR revision in /proc/cpuinfo for SMT4 cores
On Mon, 2020-08-03 at 22:41 +1000, Michael Ellerman wrote: > Michael Neuling writes: > > On POWER10 bit 12 in the PVR indicates if the core is SMT4 or > > SMT8. Bit 12 is set for SMT4. > > > > Without this patch, /proc/cpuinfo on a SMT4 DD1 POWER10 looks like > > this: > > cpu : POWER10, altivec supported > > revision: 17.0 (pvr 0080 1100) > > > > Signed-off-by: Michael Neuling > > --- > > arch/powerpc/kernel/setup-common.c | 1 + > > 1 file changed, 1 insertion(+) > > This should have a Fixes: pointing at something so it gets backported. Yes it should. Mikey
[PATCH] powerpc: Fix P10 PVR revision in /proc/cpuinfo for SMT4 cores
On POWER10 bit 12 in the PVR indicates if the core is SMT4 or SMT8. Bit 12 is set for SMT4. Without this patch, /proc/cpuinfo on a SMT4 DD1 POWER10 looks like this: cpu : POWER10, altivec supported revision: 17.0 (pvr 0080 1100) Signed-off-by: Michael Neuling --- arch/powerpc/kernel/setup-common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index b198b0ff25..808ec9fab6 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -311,6 +311,7 @@ static int show_cpuinfo(struct seq_file *m, void *v) min = pvr & 0xFF; break; case 0x004e: /* POWER9 bits 12-15 give chip type */ + case 0x0080: /* POWER10 bit 12 gives SMT8/4 */ maj = (pvr >> 8) & 0x0F; min = pvr & 0xFF; break; -- 2.26.2
Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
On Fri, 2020-07-10 at 10:52 +0530, Pratik Rajesh Sampat wrote: > Additional registers DAWR0, DAWRX0 may be lost on Power 10 for > stop levels < 4. > Therefore save the values of these SPRs before entering a "stop" > state and restore their values on wakeup. > > Signed-off-by: Pratik Rajesh Sampat > --- > arch/powerpc/platforms/powernv/idle.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 19d94d021357..f2e2a6a4c274 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -600,6 +600,8 @@ struct p9_sprs { > u64 iamr; > u64 amor; > u64 uamor; > + u64 dawr0; > + u64 dawrx0; > }; > > static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) > @@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long > psscr, bool mmu_on) > sprs.iamr = mfspr(SPRN_IAMR); > sprs.amor = mfspr(SPRN_AMOR); > sprs.uamor = mfspr(SPRN_UAMOR); > + if (cpu_has_feature(CPU_FTR_ARCH_31)) { Can you add a comment here saying even though DAWR0 is ARCH_30, it's only required to be saved on 31. Otherwise this looks pretty odd. > + sprs.dawr0 = mfspr(SPRN_DAWR0); > + sprs.dawrx0 = mfspr(SPRN_DAWRX0); > + } > > srr1 = isa300_idle_stop_mayloss(psscr); /* go idle */ > > @@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long > psscr, bool mmu_on) > mtspr(SPRN_IAMR,sprs.iamr); > mtspr(SPRN_AMOR,sprs.amor); > mtspr(SPRN_UAMOR, sprs.uamor); > + if (cpu_has_feature(CPU_FTR_ARCH_31)) { > + mtspr(SPRN_DAWR0, sprs.dawr0); > + mtspr(SPRN_DAWRX0, sprs.dawrx0); > + } > > /* >* Workaround for POWER9 DD2.0, if we lost resources, the ERAT
Re: [PATCH v2 07/10] powerpc/perf: support BHRB disable bit and new filtering modes
On Wed, 2020-07-01 at 05:20 -0400, Athira Rajeev wrote: > PowerISA v3.1 has few updates for the Branch History Rolling Buffer(BHRB). > First is the addition of BHRB disable bit and second new filtering > modes for BHRB. > > BHRB disable is controlled via Monitor Mode Control Register A (MMCRA) > bit 26, namely "BHRB Recording Disable (BHRBRD)". This field controls > whether BHRB entries are written when BHRB recording is enabled by other > bits. Patch implements support for this BHRB disable bit. Probably good to note here that this is backwards compatible. So if you have a kernel that doesn't know about this bit, it'll clear it and hence you still get BHRB. You should also note why you'd want to do disable this (ie. the core will run faster). > Secondly PowerISA v3.1 introduce filtering support for > PERF_SAMPLE_BRANCH_IND_CALL/COND. The patch adds BHRB filter support > for "ind_call" and "cond" in power10_bhrb_filter_map(). > > 'commit bb19af816025 ("powerpc/perf: Prevent kernel address leak to userspace > via BHRB buffer")' > added a check in bhrb_read() to filter the kernel address from BHRB buffer. > Patch here modified > it to avoid that check for PowerISA v3.1 based processors, since PowerISA v3.1 > allows > only MSR[PR]=1 address to be written to BHRB buffer. > > Signed-off-by: Athira Rajeev > --- > arch/powerpc/perf/core-book3s.c | 27 +-- > arch/powerpc/perf/isa207-common.c | 13 + > arch/powerpc/perf/power10-pmu.c | 13 +++-- > arch/powerpc/platforms/powernv/idle.c | 14 ++ This touches the idle code so we should get those guys on CC (adding Vaidy and Ego). > 4 files changed, 59 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c > index fad5159..9709606 100644 > --- a/arch/powerpc/perf/core-book3s.c > +++ b/arch/powerpc/perf/core-book3s.c > @@ -466,9 +466,13 @@ static void power_pmu_bhrb_read(struct perf_event *event, > struct cpu_hw_events * >* addresses at this point. Check the privileges before >* exporting it to userspace (avoid exposure of regions >* where we could have speculative execution) > + * Incase of ISA 310, BHRB will capture only user-space > + * address,hence include a check before filtering code >*/ > - if (is_kernel_addr(addr) && perf_allow_kernel( > >attr) != 0) > - continue; > + if (!(ppmu->flags & PPMU_ARCH_310S)) > + if (is_kernel_addr(addr) && > + perf_allow_kernel(>attr) != 0) > + continue; > > /* Branches are read most recent first (ie. mfbhrb 0 is >* the most recent branch). > @@ -1212,7 +1216,7 @@ static void write_mmcr0(struct cpu_hw_events *cpuhw, > unsigned long mmcr0) > static void power_pmu_disable(struct pmu *pmu) > { > struct cpu_hw_events *cpuhw; > - unsigned long flags, mmcr0, val; > + unsigned long flags, mmcr0, val, mmcra = 0; > > if (!ppmu) > return; > @@ -1245,12 +1249,23 @@ static void power_pmu_disable(struct pmu *pmu) > mb(); > isync(); > > + val = mmcra = cpuhw->mmcr[2]; > + > /* >* Disable instruction sampling if it was enabled >*/ > - if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) { > - mtspr(SPRN_MMCRA, > - cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE); > + if (cpuhw->mmcr[2] & MMCRA_SAMPLE_ENABLE) > + mmcra = cpuhw->mmcr[2] & ~MMCRA_SAMPLE_ENABLE; > + > + /* Disable BHRB via mmcra [:26] for p10 if needed */ > + if (!(cpuhw->mmcr[2] & MMCRA_BHRB_DISABLE)) > + mmcra |= MMCRA_BHRB_DISABLE; > + > + /* Write SPRN_MMCRA if mmcra has either disabled > + * instruction sampling or BHRB > + */ > + if (val != mmcra) { > + mtspr(SPRN_MMCRA, mmcra); > mb(); > isync(); > } > diff --git a/arch/powerpc/perf/isa207-common.c b/arch/powerpc/perf/isa207- > common.c > index 7d4839e..463d925 100644 > --- a/arch/powerpc/perf/isa207-common.c > +++ b/arch/powerpc/perf/isa207-common.c > @@ -404,6 +404,12 @@ int isa207_compute_mmcr(u64 event[], int n_ev, > > mmcra = mmcr1 = mmcr2 = mmcr3 = 0; > > + /* Disable bhrb unless explicitly requested > + * by setting MMCRA [:26] bit. > + */ > + if (cpu_has_feature(CPU_FTR_ARCH_31)) > + mmcra |= MMCRA_BHRB_DISABLE; > + > /* Second pass: assign PMCs, set all MMCR1 fields */ > for (i = 0; i < n_ev; ++i) { > pmc =
Re: [PATCH v2 06/10] powerpc/perf: power10 Performance Monitoring support
> @@ -480,6 +520,7 @@ int isa207_compute_mmcr(u64 event[], int n_ev, > mmcr[1] = mmcr1; > mmcr[2] = mmcra; > mmcr[3] = mmcr2; > + mmcr[4] = mmcr3; This is fragile like the kvm vcpu case I commented on before but it gets passed in via a function parameter?! Can you create a struct to store these in rather than this odd ball numbering? The cleanup should start in patch 1/10 here: /* * The order of the MMCR array is: -* - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2 +* - 64-bit, MMCR0, MMCR1, MMCRA, MMCR2, MMCR3 * - 32-bit, MMCR0, MMCR1, MMCR2 */ - unsigned long mmcr[4]; + unsigned long mmcr[5]; mikey
Re: [PATCH v2 04/10] powerpc/perf: Add power10_feat to dt_cpu_ftrs
On Wed, 2020-07-01 at 05:20 -0400, Athira Rajeev wrote: > From: Madhavan Srinivasan > > Add power10 feature function to dt_cpu_ftrs.c along > with a power10 specific init() to initialize pmu sprs. Can you say why you're doing this? Can you add some text about what you're doing to the BHRB in this patch? Mikey > > Signed-off-by: Madhavan Srinivasan > --- > arch/powerpc/include/asm/reg.h| 3 +++ > arch/powerpc/kernel/cpu_setup_power.S | 7 +++ > arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++ > 3 files changed, 36 insertions(+) > > diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h > index 21a1b2d..900ada1 100644 > --- a/arch/powerpc/include/asm/reg.h > +++ b/arch/powerpc/include/asm/reg.h > @@ -1068,6 +1068,9 @@ > #define MMCR0_PMC2_LOADMISSTIME 0x5 > #endif > > +/* BHRB disable bit for PowerISA v3.10 */ > +#define MMCRA_BHRB_DISABLE 0x0020 > + > /* > * SPRG usage: > * > diff --git a/arch/powerpc/kernel/cpu_setup_power.S > b/arch/powerpc/kernel/cpu_setup_power.S > index efdcfa7..e8b3370c 100644 > --- a/arch/powerpc/kernel/cpu_setup_power.S > +++ b/arch/powerpc/kernel/cpu_setup_power.S > @@ -233,3 +233,10 @@ __init_PMU_ISA207: > li r5,0 > mtspr SPRN_MMCRS,r5 > blr > + > +__init_PMU_ISA31: > + li r5,0 > + mtspr SPRN_MMCR3,r5 > + LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE) > + mtspr SPRN_MMCRA,r5 > + blr > diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c > b/arch/powerpc/kernel/dt_cpu_ftrs.c > index a0edeb3..14a513f 100644 > --- a/arch/powerpc/kernel/dt_cpu_ftrs.c > +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c > @@ -449,6 +449,31 @@ static int __init feat_enable_pmu_power9(struct > dt_cpu_feature *f) > return 1; > } > > +static void init_pmu_power10(void) > +{ > + init_pmu_power9(); > + > + mtspr(SPRN_MMCR3, 0); > + mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE); > +} > + > +static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f) > +{ > + hfscr_pmu_enable(); > + > + init_pmu_power10(); > + init_pmu_registers = init_pmu_power10; > + > + cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA; > + cur_cpu_spec->cpu_user_features |= PPC_FEATURE_PSERIES_PERFMON_COMPAT; > + > + cur_cpu_spec->num_pmcs = 6; > + cur_cpu_spec->pmc_type = PPC_PMC_IBM; > + cur_cpu_spec->oprofile_cpu_type = "ppc64/power10"; > + > + return 1; > +} > + > static int __init feat_enable_tm(struct dt_cpu_feature *f) > { > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > @@ -638,6 +663,7 @@ struct dt_cpu_feature_match { > {"pc-relative-addressing", feat_enable, 0}, > {"machine-check-power9", feat_enable_mce_power9, 0}, > {"performance-monitor-power9", feat_enable_pmu_power9, 0}, > + {"performance-monitor-power10", feat_enable_pmu_power10, 0}, > {"event-based-branch-v3", feat_enable, 0}, > {"random-number-generator", feat_enable, 0}, > {"system-call-vectored", feat_disable, 0},
Re: [PATCH v2 02/10] KVM: PPC: Book3S HV: Save/restore new PMU registers
@@ -637,12 +637,12 @@ struct kvm_vcpu_arch { > u32 ccr1; > u32 dbsr; > > - u64 mmcr[5]; > + u64 mmcr[6]; > u32 pmc[8]; > u32 spmc[2]; > u64 siar; > + mfspr r5, SPRN_MMCR3 > + mfspr r6, SPRN_SIER2 > + mfspr r7, SPRN_SIER3 > + std r5, VCPU_MMCR + 40(r9) > + std r6, VCPU_SIER + 8(r9) > + std r7, VCPU_SIER + 16(r9) This is looking pretty fragile now. vcpu mmcr[6] stores (in this strict order): mmcr0, mmcr1, mmcra, mmcr2, mmcrs, mmmcr3. Can we clean that up? Give mmcra and mmcrs their own entries in vcpu and then have a flat array for mmcr0-3. Mikey
Re: [RFC PATCH 1/4] powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
On Fri, 2020-05-29 at 11:24 +1000, Alistair Popple wrote: > For what it's worth I tested this series on Mambo PowerNV and it seems to > correctly enable/disable the prefix FSCR bit based on the cpu feature so feel > free to add: > > Tested-by: Alistair Popple > > Mikey is going to test out pseries. FWIW this worked for me in the P10 + powervm sim testing. Tested-by: Michael Neuling > > - Alistair > > On Thursday, 28 May 2020 12:58:40 AM AEST Michael Ellerman wrote: > > __init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc: > > Add support for context switching the TAR register") (Feb 2013), and > > only set FSCR_TAR. > > > > At that point FSCR (Facility Status and Control Register) was not > > context switched, so the setting was permanent after boot. > > > > Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit > > 54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again > > that was permanent after boot. > > > > Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on > > POWER8") (Aug 2013) added a limited context switch of FSCR, just the > > FSCR_DSCR bit was context switched based on thread.dscr_inherit. That > > commit said "This clears the H/FSCR DSCR bit initially", but it > > didn't, it left the initialisation of FSCR_DSCR in __init_FSCR(). > > However the initial context switch from init_task to pid 1 would clear > > FSCR_DSCR because thread.dscr_inherit was 0. > > > > That commit also introduced the requirement that FSCR_DSCR be clear > > for user processes, so that we can take the facility unavailable > > interrupt in order to manage dscr_inherit. > > > > Then in commit 152d523e6307 ("powerpc: Create context switch helpers > > save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to > > thread_struct. However it still wasn't fully context switched, we just > > took the existing value and set FSCR_DSCR if the new thread had > > dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR | > > FSCR_TAR, but that value was not propagated into the thread_struct, so > > the initial context switch set FSCR_DSCR back to 0. > > > > Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context > > switching") (Jun 2016) added a full context switch of the FSCR, and > > added an initialisation of init_task.thread.fscr to FSCR_TAR | > > FSCR_EBB, but omitted FSCR_DSCR. > > > > The end result is that swapper runs with FSCR_DSCR set because of the > > initialisation in __init_FSCR(), but no other processes do, they use > > the value from init_task.thread.fscr. > > > > Having FSCR_DSCR set for swapper allows it to access SPR 3 from > > userspace, but swapper never runs userspace, so it has no useful > > effect. It's also confusing to have the value initialised in two > > places to two different values. > > > > So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the > > point where there's a single value of FSCR, even if it's still set in > > two places. > > > > Signed-off-by: Michael Ellerman > > --- > > arch/powerpc/kernel/cpu_setup_power.S | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/kernel/cpu_setup_power.S > > b/arch/powerpc/kernel/cpu_setup_power.S index a460298c7ddb..f91ecb10d0ae > > 100644 > > --- a/arch/powerpc/kernel/cpu_setup_power.S > > +++ b/arch/powerpc/kernel/cpu_setup_power.S > > @@ -184,7 +184,7 @@ _GLOBAL(__restore_cpu_power9) > > > > __init_FSCR: > > mfspr r3,SPRN_FSCR > > - ori r3,r3,FSCR_TAR|FSCR_DSCR|FSCR_EBB > > + ori r3,r3,FSCR_TAR|FSCR_EBB > > mtspr SPRN_FSCR,r3 > > blr > >
[PATCH] powerpc: Fix misleading small cores print
Currently when we boot on a big core system, we get this print: [0.040500] Using small cores at SMT level This is misleading as we've actually detected big cores. This patch clears up the print to say we've detect big cores but are using small cores for scheduling. Signed-off-by: Michael Neuling --- arch/powerpc/kernel/smp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 6d2a3a3666..c820c95162 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -1383,7 +1383,7 @@ void __init smp_cpus_done(unsigned int max_cpus) #ifdef CONFIG_SCHED_SMT if (has_big_cores) { - pr_info("Using small cores at SMT level\n"); + pr_info("Big cores detected but using small core scheduling\n"); power9_topology[0].mask = smallcore_smt_mask; powerpc_topology[0].mask = smallcore_smt_mask; } -- 2.26.2
[PATCH] powerpc/configs: Add LIBNVDIMM to ppc64_defconfig
This gives us OF_PMEM which is useful in mambo. This adds 153K to the text of ppc64le_defconfig which 0.8% of the total text. LIBNVDIMM text databss dec hex Without 18574833 5518150 1539240 25632223 1871ddf With 18727834 5546206 1539368 25813408 189e1a0 Signed-off-by: Michael Neuling --- arch/powerpc/configs/ppc64_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/configs/ppc64_defconfig b/arch/powerpc/configs/ppc64_defconfig index bae8170d74..0a92549924 100644 --- a/arch/powerpc/configs/ppc64_defconfig +++ b/arch/powerpc/configs/ppc64_defconfig @@ -281,6 +281,7 @@ CONFIG_RTC_CLASS=y CONFIG_RTC_DRV_DS1307=y CONFIG_VIRTIO_PCI=m CONFIG_VIRTIO_BALLOON=m +CONFIG_LIBNVDIMM=y CONFIG_RAS=y CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y -- 2.26.2
Re: [PATCH v2 1/7] powerpc: Add new HWCAP bits
On Tue, 2020-05-19 at 10:31 +1000, Alistair Popple wrote: > POWER10 introduces two new architectural features - ISAv3.1 and matrix > multiply accumulate (MMA) instructions. Userspace detects the presence > of these features via two HWCAP bits introduced in this patch. These > bits have been agreed to by the compiler and binutils team. > > Signed-off-by: Alistair Popple I've test booted this series + powerpc/next (30df74d67d) on top of powervm and OPAL on a P10 simulator. In both cases, it enables MMA and prefix instructions and advertises them via HWCAP2 MMA + ISA 3.1. Hence: Tested-by: Michael Neuling > --- > arch/powerpc/include/uapi/asm/cputable.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/powerpc/include/uapi/asm/cputable.h > b/arch/powerpc/include/uapi/asm/cputable.h > index 540592034740..2692a56bf20b 100644 > --- a/arch/powerpc/include/uapi/asm/cputable.h > +++ b/arch/powerpc/include/uapi/asm/cputable.h > @@ -50,6 +50,8 @@ > #define PPC_FEATURE2_DARN0x0020 /* darn random number insn */ > #define PPC_FEATURE2_SCV 0x0010 /* scv syscall */ > #define PPC_FEATURE2_HTM_NO_SUSPEND 0x0008 /* TM w/out suspended state > */ > +#define PPC_FEATURE2_ARCH_3_10x0004 /* ISA 3.1 */ > +#define PPC_FEATURE2_MMA 0x0002 /* Matrix Multiply Accumulate > */ > > /* > * IMPORTANT!
Re: [PATCH] powerpc: Add new HWCAP bits
On Tue, 2020-03-31 at 12:12 -0300, Tulio Magno Quites Machado Filho wrote: > Alistair Popple writes: > > > diff --git a/arch/powerpc/include/uapi/asm/cputable.h > > b/arch/powerpc/include/uapi/asm/cputable.h > > index 540592034740..c6fe10b2 100644 > > --- a/arch/powerpc/include/uapi/asm/cputable.h > > +++ b/arch/powerpc/include/uapi/asm/cputable.h > > @@ -50,6 +50,8 @@ > > #define PPC_FEATURE2_DARN 0x0020 /* darn random number insn */ > > #define PPC_FEATURE2_SCV 0x0010 /* scv syscall */ > > #define PPC_FEATURE2_HTM_NO_SUSPEND0x0008 /* TM w/out suspended > > state */ > > +#define PPC_FEATURE2_ARCH_3_10 0x0004 /* ISA 3.10 */ > > I think this should have been: > > #define PPC_FEATURE2_ARCH_3_1 0x0004 /* ISA 3.1 */ Agreed. That's the new name. Sorry Al I should have caught that earlier. Mikey
[PATCH] powerpc/tm: Document h/rfid and mtmsrd quirk
The ISA has a quirk that's useful for the Linux implementation. Document it here so others are less likely to trip over it. Signed-off-by: Michael Neuling Suggested-by: Michael Ellerman --- .../powerpc/transactional_memory.rst | 27 +++ 1 file changed, 27 insertions(+) diff --git a/Documentation/powerpc/transactional_memory.rst b/Documentation/powerpc/transactional_memory.rst index 09955103ac..74ae71001a 100644 --- a/Documentation/powerpc/transactional_memory.rst +++ b/Documentation/powerpc/transactional_memory.rst @@ -245,3 +245,30 @@ POWER9N DD2.2. Guest migration from POWER8 to POWER9 will work with POWER9N DD2.2 and POWER9C DD1.2. Since earlier POWER9 processors don't support TM emulation, migration from POWER8 to POWER9 is not supported there. + +Kernel implementation += + +h/rfid mtmsrd quirk +=== + +As defined in the ISA, rfid has a quirk which is useful in early +exception handling. When in a userspace transaction and we enter the +kernel via some exception, MSR will end up as TM=0 and TS=01 (ie. TM +off but TM suspended). Regularly the kernel will want change bits in +the MSR and will perform an rfid to do this. In this case rfid can +have SRR0 TM = 0 and TS = 00 (ie. TM off and non transaction) and the +resulting MSR will retain TM = 0 and TS=01 from before (ie. stay in +suspend). This is a quirk in the architecture as this would normally +be a transition from TS=01 to TS=00 (ie. suspend -> non transactional) +which is an illegal transition. + +This quirk is described the architecture in the definition of rfid +with these lines: + + if (MSR 29:31 ¬ = 0b010 | SRR1 29:31 ¬ = 0b000) then + MSR 29:31 <- SRR1 29:31 + +hrfid and mtmsrd have the same quirk. + +The Linux kernel uses this quirk in it's early exception handling. -- 2.25.1
Re: [RFC PATCH v2 00/12] Reduce ifdef mess in ptrace
Christophe, > Le 28/06/2019 à 17:47, Christophe Leroy a écrit : > > The purpose of this series is to reduce the amount of #ifdefs > > in ptrace.c > > > > Any feedback on this series which aims at fixing the issue you opened at > https://github.com/linuxppc/issues/issues/128 ? Yeah, sorry my bad. You did all the hard work and I ignored it. I like the approach and is a long the lines I was thinking. Putting it in a ptrace subdir, splitting out adv_debug_regs, TM, SPE, Alitivec, VSX. ppc_gethwdinfo() looks a lot nicer now too (that was some of the worst of it). I've not gone through it with a fine tooth comb though. There is (rightly) a lot of code moved around which could have introduced some issues. It applies on v5.2 but are you planning on updating it to a newer base? Mikey
Re: [PATCH] powerpc/chrp: Fix enter_rtas() with CONFIG_VMAP_STACK
On Mon, 2020-02-17 at 07:40 +0100, Christophe Leroy wrote: > > Le 16/02/2020 à 23:40, Michael Neuling a écrit : > > On Fri, 2020-02-14 at 08:33 +, Christophe Leroy wrote: > > > With CONFIG_VMAP_STACK, data MMU has to be enabled > > > to read data on the stack. > > > > Can you describe what goes wrong without this? Some oops message? rtas blows > > up? > > Get corrupt data? > > Larry reported a machine check. Or in fact, he reported a Oops in > kprobe_handler(), that Oops being a bug in kprobe_handle() triggered by > this machine check. > > By converting a VM address to a phys-like address as if is was linear > mem, you get in the dark. Either there is some physical memory at that > address and you corrupt it. Or there is none and you get a machine check. Excellent. Please put that in the commit message. > > Also can you say what you're actually doing (ie turning on MSR[DR]) > > Euh ... I'm saying that data MMU has to be enabled, so I'm enabling it. Yeah, it's a minor point. > > > > > Signed-off-by: Christophe Leroy > > > --- > > > arch/powerpc/kernel/entry_32.S | 9 +++-- > > > 1 file changed, 7 insertions(+), 2 deletions(-) > > > > > > diff --git a/arch/powerpc/kernel/entry_32.S > > > b/arch/powerpc/kernel/entry_32.S > > > index 0713daa651d9..bc056d906b51 100644 > > > --- a/arch/powerpc/kernel/entry_32.S > > > +++ b/arch/powerpc/kernel/entry_32.S > > > @@ -1354,12 +1354,17 @@ _GLOBAL(enter_rtas) > > > mtspr SPRN_SRR0,r8 > > > mtspr SPRN_SRR1,r9 > > > RFI > > > -1: tophys(r9,r1) > > > +1: tophys_novmstack r9, r1 > > > +#ifdef CONFIG_VMAP_STACKisntruction > > > + li r0, MSR_KERNEL & ~MSR_IR/* can take DTLB miss */ > > > > You're potentially turning on more than MSR DR here. This should be clear in > > the > > commit message. > > Am I ? > > At the time of the RFI just above, SRR1 contains the value of r9 which > has been set 2 lines before to MSR_KERNEL & ~(MSR_IR|MSR_DR). > > What should be clear in the commit message ? You're right. I was just looking at the patch and not the code. It's clearer in the code. Mikey > > > > + mtmsr r0 > > > + isync > > > +#endif > > > lwz r8,INT_FRAME_SIZE+4(r9) /* get return address */ > > > lwz r9,8(r9)/* original msr value */ > > > addir1,r1,INT_FRAME_SIZE > > > li r0,0 > > > - tophys(r7, r2) > > > + tophys_novmstack r7, r2 > > > stw r0, THREAD + RTAS_SP(r7) > > > mtspr SPRN_SRR0,r8 > > > mtspr SPRN_SRR1,r9 > > Christophe >
Re: [PATCH] KVM: PPC: Book3S HV: Treat unrecognized TM instructions as illegal
On Sun, 2020-02-16 at 23:57 -0600, Segher Boessenkool wrote: > On Mon, Feb 17, 2020 at 12:07:31PM +1100, Michael Neuling wrote: > > On Thu, 2020-02-13 at 10:15 -0500, Gustavo Romero wrote: > > > On P9 DD2.2 due to a CPU defect some TM instructions need to be emulated > > > by > > > KVM. This is handled at first by the hardware raising a softpatch > > > interrupt > > > when certain TM instructions that need KVM assistance are executed in the > > > guest. Some TM instructions, although not defined in the Power ISA, might > > > raise a softpatch interrupt. For instance, 'tresume.' instruction as > > > defined in the ISA must have bit 31 set (1), but an instruction that > > > matches 'tresume.' OP and XO opcodes but has bit 31 not set (0), like > > > 0x7cfe9ddc, also raises a softpatch interrupt, for example, if a code > > > like the following is executed in the guest it will raise a softpatch > > > interrupt just like a 'tresume.' when the TM facility is enabled: > > > > > > int main() { asm("tabort. 0; .long 0x7cfe9ddc;"); } > > > and then treats the executed instruction as 'nop' whilst it should > > > actually > > > be treated as an illegal instruction since it's not defined by the ISA. > > > > The ISA has this: > > > >1.3.3 Reserved Fields, Reserved Values, and Reserved SPRs > > > >Reserved fields in instructions are ignored by the pro- > >cessor. > > > > Hence the hardware will ignore reserved bits. For example executing your > > little > > program on P8 just exits normally with 0x7cfe9ddc being executed as a NOP. > > > > Hence, we should NOP this, not generate an illegal. > > It is not a reserved bit. > > The IMC entry for it matches op1=01 op2=101110 presumably, which > catches all TM instructions and nothing else (bits 0..5 and bits 21..30). > That does not look at bit 31, the softpatch handler has to deal with this. > > Some TM insns have bit 31 as 1 and some have it as /. All instructions > with a "." in the mnemonic have bit 31 is 1, all other have it reserved. > The tables in appendices D, E, F show tend. and tsr. as having it > reserved, which contradicts the individual instruction description (and > does not make much sense). (Only tcheck has /, everything else has 1; > everything else has a mnemonic with a dot, and does write CR0 always). Wow, interesting. P8 seems to be treating 31 as a reserved bit (with the table definition rather than the individual instruction description). I'm inclined to match P8 even though it's inconsistent with the dot mnemonic as you say. Mikey
Re: [PATCH] KVM: PPC: Book3S HV: Treat unrecognized TM instructions as illegal
On Thu, 2020-02-13 at 10:15 -0500, Gustavo Romero wrote: > On P9 DD2.2 due to a CPU defect some TM instructions need to be emulated by > KVM. This is handled at first by the hardware raising a softpatch interrupt > when certain TM instructions that need KVM assistance are executed in the > guest. Some TM instructions, although not defined in the Power ISA, might > raise a softpatch interrupt. For instance, 'tresume.' instruction as > defined in the ISA must have bit 31 set (1), but an instruction that > matches 'tresume.' OP and XO opcodes but has bit 31 not set (0), like > 0x7cfe9ddc, also raises a softpatch interrupt, for example, if a code > like the following is executed in the guest it will raise a softpatch > interrupt just like a 'tresume.' when the TM facility is enabled: > > int main() { asm("tabort. 0; .long 0x7cfe9ddc;"); } > > Currently in such a case KVM throws a complete trace like the following: > > [345523.705984] WARNING: CPU: 24 PID: 64413 at > arch/powerpc/kvm/book3s_hv_tm.c:211 kvmhv_p9_tm_emulation+0x68/0x620 [kvm_hv] > [345523.705985] Modules linked in: kvm_hv(E) xt_conntrack ipt_REJECT > nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat > iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 > ebtable_filter ebtables ip6table_filter > ip6_tables iptable_filter bridge stp llc sch_fq_codel ipmi_powernv at24 > vmx_crypto ipmi_devintf ipmi_msghandler > ibmpowernv uio_pdrv_genirq kvm opal_prd uio leds_powernv ib_iser rdma_cm > iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp > libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs > blake2b_generic zstd_compress raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor > raid6_pq raid1 raid0 multipath linear tg3 > crct10dif_vpmsum crc32c_vpmsum ipr [last unloaded: kvm_hv] > [345523.706030] CPU: 24 PID: 64413 Comm: CPU 0/KVM Tainted: GW E > 5.5.0+ #1 > [345523.706031] NIP: c008072cb9c0 LR: c008072b5e80 CTR: > c008085c7850 > [345523.706034] REGS: c00399467680 TRAP: 0700 Tainted: GW E >(5.5.0+) > [345523.706034] MSR: 90010282b033 > CR: 24022428 XER: > [345523.706042] CFAR: c008072b5e7c IRQMASK: 0 > GPR00: c008072b5e80 c00399467910 c008072db500 > c00375ccc720 > GPR04: c00375ccc720 0003fbec a10395dda5a6 > > GPR08: 7cfe9ddc 7cfe9ddc05dc 7cfe9ddc7c0005dc > c008072cd530 > GPR12: c008085c7850 c003fffeb800 0001 > 7dfb737f > GPR16: c0002001edcca558 > 0001 > GPR20: c1b21258 c0002001edcca558 0018 > > GPR24: 0100 0001 > 1500 > GPR28: c0002001edcc4278 c0037dd8 80050280f033 > c00375ccc720 > [345523.706062] NIP [c008072cb9c0] kvmhv_p9_tm_emulation+0x68/0x620 > [kvm_hv] > [345523.706065] LR [c008072b5e80] > kvmppc_handle_exit_hv.isra.53+0x3e8/0x798 [kvm_hv] > [345523.706066] Call Trace: > [345523.706069] [c00399467910] [c00399467940] 0xc00399467940 > (unreliable) > [345523.706071] [c00399467950] [c00399467980] 0xc00399467980 > [345523.706075] [c003994679f0] [c008072bd1c4] > kvmhv_run_single_vcpu+0xa1c/0xb80 [kvm_hv] > [345523.706079] [c00399467ac0] [c008072bd8e0] > kvmppc_vcpu_run_hv+0x5b8/0xb00 [kvm_hv] > [345523.706087] [c00399467b90] [c008085c93cc] > kvmppc_vcpu_run+0x34/0x48 [kvm] > [345523.706095] [c00399467bb0] [c008085c582c] > kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] > [345523.706101] [c00399467c40] [c008085b7498] > kvm_vcpu_ioctl+0x3d0/0x7b0 [kvm] > [345523.706105] [c00399467db0] [c04adf9c] ksys_ioctl+0x13c/0x170 > [345523.706107] [c00399467e00] [c04adff8] sys_ioctl+0x28/0x80 > [345523.706111] [c00399467e20] [c000b278] system_call+0x5c/0x68 > [345523.706112] Instruction dump: > [345523.706114] 419e0390 7f8a4840 409d0048 6d497c00 2f89075d 419e021c > 6d497c00 2f8907dd > [345523.706119] 419e01c0 6d497c00 2f8905dd 419e00a4 <0fe0> 38210040 > 3860 ebc1fff0 > > and then treats the executed instruction as 'nop' whilst it should actually > be treated as an illegal instruction since it's not defined by the ISA. The ISA has this: 1.3.3 Reserved Fields, Reserved Values, and Reserved SPRs Reserved fields in instructions are ignored by the pro- cessor. Hence the hardware will ignore reserved bits. For example executing your little program on P8 just exits normally with 0x7cfe9ddc being executed as a NOP. Hence, we should NOP this, not generate an illegal. Mikey > This commit changes the handling of the case above by treating the > unrecognized TM instructions that can raise a softpatch but are not >
Re: [PATCH v7 4/4] powerpc: Book3S 64-bit "heavyweight" KASAN support
Daniel. Can you start this commit message with a simple description of what you are actually doing? This reads like you've been on a long journey to Mordor and back, which as a reader of this patch in the long distant future, I don't care about. I just want to know what you're implementing. Also I'm struggling to review this as I don't know what software or hardware mechanisms you are using to perform sanitisation. Mikey On Thu, 2020-02-13 at 11:47 +1100, Daniel Axtens wrote: > KASAN support on Book3S is a bit tricky to get right: > > - It would be good to support inline instrumentation so as to be able to >catch stack issues that cannot be caught with outline mode. > > - Inline instrumentation requires a fixed offset. > > - Book3S runs code in real mode after booting. Most notably a lot of KVM >runs in real mode, and it would be good to be able to instrument it. > > - Because code runs in real mode after boot, the offset has to point to >valid memory both in and out of real mode. > > [ppc64 mm note: The kernel installs a linear mapping at effective > address c000... onward. This is a one-to-one mapping with physical > memory from ... onward. Because of how memory accesses work on > powerpc 64-bit Book3S, a kernel pointer in the linear map accesses the > same memory both with translations on (accessing as an 'effective > address'), and with translations off (accessing as a 'real > address'). This works in both guests and the hypervisor. For more > details, see s5.7 of Book III of version 3 of the ISA, in particular > the Storage Control Overview, s5.7.3, and s5.7.5 - noting that this > KASAN implementation currently only supports Radix.] > > One approach is just to give up on inline instrumentation. This way all > checks can be delayed until after everything set is up correctly, and the > address-to-shadow calculations can be overridden. However, the features and > speed boost provided by inline instrumentation are worth trying to do > better. > > If _at compile time_ it is known how much contiguous physical memory a > system has, the top 1/8th of the first block of physical memory can be set > aside for the shadow. This is a big hammer and comes with 3 big > consequences: > > - there's no nice way to handle physically discontiguous memory, so only >the first physical memory block can be used. > > - kernels will simply fail to boot on machines with less memory than >specified when compiling. > > - kernels running on machines with more memory than specified when >compiling will simply ignore the extra memory. > > Implement and document KASAN this way. The current implementation is Radix > only. > > Despite the limitations, it can still find bugs, > e.g. http://patchwork.ozlabs.org/patch/1103775/ > > At the moment, this physical memory limit must be set _even for outline > mode_. This may be changed in a later series - a different implementation > could be added for outline mode that dynamically allocates shadow at a > fixed offset. For example, see https://patchwork.ozlabs.org/patch/795211/ > > Suggested-by: Michael Ellerman > Cc: Balbir Singh # ppc64 out-of-line radix version > Cc: Christophe Leroy # ppc32 version > Signed-off-by: Daniel Axtens > > --- > Changes since v6: > - rework kasan_late_init support, which also fixes book3e problem that > snowpatch >picked up (I think) > - fix a checkpatch error that snowpatch picked up > - don't needlessly move the include in kasan.h > > Changes since v5: > - rebase on powerpc/merge, with Christophe's latest changes integrating >kasan-vmalloc > - documentation tweaks based on latest 32-bit changes > > Changes since v4: > - fix some ppc32 build issues > - support ptdump > - clean up the header file. It turns out we don't need or use > KASAN_SHADOW_SIZE, >so just dump it, and make KASAN_SHADOW_END the thing that varies between 32 >and 64 bit. As part of this, make sure KASAN_SHADOW_OFFSET is only > configured for >32 bit - it is calculated in the Makefile for ppc64. > - various cleanups > > Changes since v3: > - Address further feedback from Christophe. > - Drop changes to stack walking, it looks like the issue I observed is >related to that particular stack, not stack-walking generally. > > Changes since v2: > > - Address feedback from Christophe around cleanups and docs. > - Address feedback from Balbir: at this point I don't have a good solution >for the issues you identify around the limitations of the inline > implementation >but I think that it's worth trying to get the stack instrumentation > support. >I'm happy to have an alternative and more flexible outline mode - I had >envisoned this would be called 'lightweight' mode as it imposes fewer > restrictions. >I've linked to your implementation. I think it's best to add it in a > follow-up series. > - Made the default PHYS_MEM_SIZE_FOR_KASAN value
Re: Kernel (little-endian) crashing on POWER8 on heavy PowerKVM load
Paulus, Something below for you I think > We have an IBM POWER server (8247-42L) running Linux kernel 5.4.13 on Debian > unstable > hosting a big-endian ppc64 virtual machine running the same kernel in > big-endian > mode. > > When building OpenJDK-11 on the big-endian VM, the testsuite crashes the > *host* system > with the following kernel backtrace. The problem reproduces both with kernel > 4.19.98 > as well as 5.4.13. > > Backtrace has been attached at the end of this mail. > > Thanks, > Adrian > > watson login: [17667518570.438744] BUG: Unable to handle kernel data access > at 0xc2bfd038 > [17667518570.438772] Faulting instruction address: 0xc017a778 > [17667518570.438777] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [17667518570.438781] Faulting instruction address: 0xc02659a0 > [17667518570.438785] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [17667518570.438789] Faulting instruction address: 0xc02659a0 > [17667518570.438793] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [17667518570.438797] Faulting instruction address: 0xc02659a0 > [17667518570.438801] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [17667518570.438804] Faulting instruction address: 0xc02659a0 > [17667518570.438808] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [17667518570.439197] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [ 8142.397983] async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) > raid6_pq(E) libcrc32c(E) crc32c_generic(E) > [17667518570.439207] Faulting instruction address: 0xc02659a0 > [ 8142.397992] raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) > xhci_pci(E) xhci_hcd(E) > [17667518570.439215] Thread overran stack, or stack corrupted > [ 8142.398000] e1000e(E) usbcore(E) ptp(E) pps_core(E) ipr(E) usb_common(E) > [ 8142.398011] CPU: 48 PID: 2571 Comm: CPU 0/KVM Tainted: GE > 5.4.0-0.bpo.3-powerpc64le #1 Debian 5.4.13-1~bpo10+1 > [ 8142.398014] NIP: c00fe3117a00 LR: c0196b9c CTR: > c00fe3117a00 > [17667518570.439234] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [ 8142.398026] REGS: c00fe315f4c0 TRAP: 0400 Tainted: GE > (5.4.0-0.bpo.3-powerpc64le Debian 5.4.13-1~bpo10+1) > [17667518570.439243] Faulting instruction address: 0xc02659a0 > [17667518570.439245] Thread overran stack, or stack corrupted > [ 8142.398038] MSR: 900010009033 CR: 28448484 > XER: > [ 8142.398046] CFAR: c0196b98 IRQMASK: 1 > [ 8142.398046] GPR00: c0196e0c c00fe315f750 c12e0800 > c00fe31179c0 > [ 8142.398046] GPR04: 0003 > > [ 8142.398046] GPR08: c00fe315f7f0 c00fe3117a00 8030 > c008082bcd80 > [ 8142.398046] GPR12: c00fe3117a00 c00f5a00 > 0008 > [ 8142.398046] GPR16: c13a5c18 c00ff1035e00 c00fe315f8e8 > 0001 > [ 8142.398046] GPR20: c00fe315f8e8 c00fe31179c0 > > [ 8142.398046] GPR24: c00fe315f7f0 0001 > 0003 > [ 8142.398046] GPR28: c00fedc6e750 0010 > c00fe311f8d0 > [ 8142.398079] NIP [c00fe3117a00] 0xc00fe3117a00 > [ 8142.398087] LR [c0196b9c] __wake_up_common+0xcc/0x290 > [17667518570.439321] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [ 8142.398109] Call Trace: > [17667518570.439328] Faulting instruction address: 0xc02659a0 > [17667518570.439330] Thread overran stack, or stack corrupted > [ 8142.398122] [c00fe315f750] [c0196b9c] > __wake_up_common+0xcc/0x290 (unreliable) > [ 8142.398127] [c00fe315f7d0] [c0196e0c] > __wake_up_common_lock+0xac/0x110 > [ 8142.398134] [c00fe315f850] [c008082a9760] > kvmppc_run_core+0x12f8/0x18c0 [kvm_hv] > [ 8142.398140] [c00fe315fa10] [c008082acf14] > kvmppc_vcpu_run_hv+0x62c/0xb20 [kvm_hv] > [ 8142.398149] [c00fe315fae0] [c008081098cc] > kvmppc_vcpu_run+0x34/0x48 [kvm] > [ 8142.398158] [c00fe315fb00] [c0080810587c] > kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] > [ 8142.398166] [c00fe315fb90] [c008080f7ac8] > kvm_vcpu_ioctl+0x340/0x7d0 [kvm] > [ 8142.398172] [c00fe315fd00] [c0445410] do_vfs_ioctl+0xe0/0xac0 > [ 8142.398176] [c00fe315fdb0] [c0445eb4] ksys_ioctl+0xc4/0x110 > [ 8142.398180] [c00fe315fe00] [c0445f28] sys_ioctl+0x28/0x80 > [ 8142.398184] [c00fe315fe20] [c000b9c8] system_call+0x5c/0x68 > [ 8142.398186] Instruction dump: > [17667518570.439406] BUG: Unable to handle kernel data access at > 0xc007f9070c08 > [ 8142.398196] >
Re: [PATCH] powerpc/chrp: Fix enter_rtas() with CONFIG_VMAP_STACK
On Fri, 2020-02-14 at 08:33 +, Christophe Leroy wrote: > With CONFIG_VMAP_STACK, data MMU has to be enabled > to read data on the stack. Can you describe what goes wrong without this? Some oops message? rtas blows up? Get corrupt data? Also can you say what you're actually doing (ie turning on MSR[DR]) > Signed-off-by: Christophe Leroy > --- > arch/powerpc/kernel/entry_32.S | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S > index 0713daa651d9..bc056d906b51 100644 > --- a/arch/powerpc/kernel/entry_32.S > +++ b/arch/powerpc/kernel/entry_32.S > @@ -1354,12 +1354,17 @@ _GLOBAL(enter_rtas) > mtspr SPRN_SRR0,r8 > mtspr SPRN_SRR1,r9 > RFI > -1: tophys(r9,r1) > +1: tophys_novmstack r9, r1 > +#ifdef CONFIG_VMAP_STACK > + li r0, MSR_KERNEL & ~MSR_IR/* can take DTLB miss */ You're potentially turning on more than MSR DR here. This should be clear in the commit message. > + mtmsr r0 > + isync > +#endif > lwz r8,INT_FRAME_SIZE+4(r9) /* get return address */ > lwz r9,8(r9)/* original msr value */ > addir1,r1,INT_FRAME_SIZE > li r0,0 > - tophys(r7, r2) > + tophys_novmstack r7, r2 > stw r0, THREAD + RTAS_SP(r7) > mtspr SPRN_SRR0,r8 > mtspr SPRN_SRR1,r9
Re: [PATCH 1/1] powerpc/cputable: Remove unnecessary copy of cpu_spec->oprofile_type
On Sat, 2020-02-15 at 02:36 -0300, Leonardo Bras wrote: > Before checking for cpu_type == NULL, this same copy happens, so doing > it here will just write the same value to the t->oprofile_type > again. > > Remove the repeated copy, as it is unnecessary. > > Signed-off-by: Leonardo Bras LGTM Reviewed-by: Michael Neuling > --- > arch/powerpc/kernel/cputable.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c > index e745abc5457a..5a87ec96582f 100644 > --- a/arch/powerpc/kernel/cputable.c > +++ b/arch/powerpc/kernel/cputable.c > @@ -2197,7 +2197,6 @@ static struct cpu_spec * __init setup_cpu_spec(unsigned > long offset, >*/ > if (old.oprofile_cpu_type != NULL) { > t->oprofile_cpu_type = old.oprofile_cpu_type; > - t->oprofile_type = old.oprofile_type; > } > } >
Re: [PATCH v3 1/3] powerpc/tm: Fix clearing MSR[TS] in current when reclaiming on signal delivery
> Found with tm-signal-context-force-tm kernel selftest. > > v3: Subject and comment improvements. > v2: Fix build failure when tm is disabled. > > Fixes: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the > signal context") > Cc: sta...@vger.kernel.org # v3.9 > Signed-off-by: Gustavo Luiz Duarte Acked-By: Michael Neuling
Re: [PATCH V5 06/14] powerpc/vas: Setup thread IRQ handler per VAS instance
On Sun, 2020-02-09 at 21:17 -0800, Haren Myneni wrote: > On Fri, 2020-02-07 at 16:57 +1100, Michael Neuling wrote: > > > /* > > > + * Process CRBs that we receive on the fault window. > > > + */ > > > +irqreturn_t vas_fault_handler(int irq, void *data) > > > +{ > > > + struct vas_instance *vinst = data; > > > + struct coprocessor_request_block buf, *crb; > > > + struct vas_window *window; > > > + void *fifo; > > > + > > > + /* > > > + * VAS can interrupt with multiple page faults. So process all > > > + * valid CRBs within fault FIFO until reaches invalid CRB. > > > + * NX updates nx_fault_stamp in CRB and pastes in fault FIFO. > > > + * kernel retrives send window from parition send window ID > > > + * (pswid) in nx_fault_stamp. So pswid should be non-zero and > > > + * use this to check whether CRB is valid. > > > + * After reading CRB entry, it is reset with 0's in fault FIFO. > > > + * > > > + * In case kernel receives another interrupt with different page > > > + * fault and CRBs are processed by the previous handling, will be > > > + * returned from this function when it sees invalid CRB (means 0's). > > > + */ > > > + do { > > > + mutex_lock(>mutex); > > > > This isn't going to work. > > > > From Documentation/locking/mutex-design.rst > > > > - Mutexes may not be used in hardware or software interrupt > > contexts such as tasklets and timers. > > Initially used kernel thread per VAS instance and later using IRQ > thread. > > vas_fault_handler() is IRQ thread function, not IRQ handler. I thought > we can use mutex_lock() in thread function. Sorry, I missed it was a threaded IRQ handler, so I think is ok to use a mutex_lock() in there. You should run with CONFIG DEBUG_MUTEXES and CONFIG_LOCKDEP enabled to give you some more confidence. It would be good to document how this mutex is used and document the start of the function so it doesn't get changed later to a non-threaded handler. Mikey
Re: [PATCH V5 09/14] powerpc/vas: Update CSB and notify process for fault CRBs
> > > + > > > + csb.cc = CSB_CC_TRANSLATION; > > > + csb.ce = CSB_CE_TERMINATION; > > > + csb.cs = 0; > > > + csb.count = 0; > > > + > > > + /* > > > + * Returns the fault address in CPU format since it is passed with > > > + * signal. But if the user space expects BE format, need changes. > > > + * i.e either kernel (here) or user should convert to CPU format. > > > + * Not both! > > > + */ > > > + csb.address = be64_to_cpu(crb->stamp.nx.fault_storage_addr); > > > > This looks wrong and I don't understand the comment. You need to convert > > this > > back to be64 to write it to csb.address. ie. > > > > csb.address = cpu_to_be64(be64_to_cpu(crb->stamp.nx.fault_storage_addr)); > > > > Which I think you can just avoid the endian conversion all together. > > NX pastes fault CRB in big-endian, so passing this address in CPU format > to user space, otherwise the library has to convert. OK, then please change the definition in struct coprocessor_status_block to just __u64. struct coprocessor_status_block { u8 flags; u8 cs; u8 cc; u8 ce; __be32 count; __be64 address; } __packed __aligned(CSB_ALIGN); Big but I thought "struct coprocessor_status_block" was also written by hardware. If that's the case then it needs to be __be64 and you need the kernel to synthesize exactly what the hardware is doing. Hence the struct definition is correct and the kernel needs to convert to _be64 on writing. > What is the standard way for passing to user space? CPU endian. > > > + * process will be polling on csb.flags after request is sent to > > > + * NX. So generally CSB update should not fail except when an > > > + * application does not follow the process properly. So an error > > > + * message will be displayed and leave it to user space whether > > > + * to ignore or handle this signal. > > > + */ > > > + rcu_read_lock(); > > > + rc = kill_pid_info(SIGSEGV, , pid); > > > + rcu_read_unlock(); > > > > why the rcu_read_un/lock() here? > > Used same as in kill_proc_info()/kill_something_info() Please document. Mikey
Re: [PATCH V5 06/14] powerpc/vas: Setup thread IRQ handler per VAS instance
> /* > + * Process CRBs that we receive on the fault window. > + */ > +irqreturn_t vas_fault_handler(int irq, void *data) > +{ > + struct vas_instance *vinst = data; > + struct coprocessor_request_block buf, *crb; > + struct vas_window *window; > + void *fifo; > + > + /* > + * VAS can interrupt with multiple page faults. So process all > + * valid CRBs within fault FIFO until reaches invalid CRB. > + * NX updates nx_fault_stamp in CRB and pastes in fault FIFO. > + * kernel retrives send window from parition send window ID > + * (pswid) in nx_fault_stamp. So pswid should be non-zero and > + * use this to check whether CRB is valid. > + * After reading CRB entry, it is reset with 0's in fault FIFO. > + * > + * In case kernel receives another interrupt with different page > + * fault and CRBs are processed by the previous handling, will be > + * returned from this function when it sees invalid CRB (means 0's). > + */ > + do { > + mutex_lock(>mutex); This isn't going to work. >From Documentation/locking/mutex-design.rst - Mutexes may not be used in hardware or software interrupt contexts such as tasklets and timers. Mikey
Re: [PATCH V5 09/14] powerpc/vas: Update CSB and notify process for fault CRBs
On Wed, 2020-01-22 at 00:17 -0800, Haren Myneni wrote: > For each fault CRB, update fault address in CRB (fault_storage_addr) > and translation error status in CSB so that user space can touch the > fault address and resend the request. If the user space passed invalid > CSB address send signal to process with SIGSEGV. > > Signed-off-by: Sukadev Bhattiprolu > Signed-off-by: Haren Myneni > --- > arch/powerpc/platforms/powernv/vas-fault.c | 116 > + > 1 file changed, 116 insertions(+) > > diff --git a/arch/powerpc/platforms/powernv/vas-fault.c > b/arch/powerpc/platforms/powernv/vas-fault.c > index 5c2cada..2cfab0c 100644 > --- a/arch/powerpc/platforms/powernv/vas-fault.c > +++ b/arch/powerpc/platforms/powernv/vas-fault.c > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -26,6 +27,120 @@ > #define VAS_FAULT_WIN_FIFO_SIZE (4 << 20) > > /* > + * Update the CSB to indicate a translation error. > + * > + * If the fault is in the CSB address itself or if we are unable to > + * update the CSB, send a signal to the process, because we have no > + * other way of notifying the user process. > + * > + * Remaining settings in the CSB are based on wait_for_csb() of > + * NX-GZIP. > + */ > +static void update_csb(struct vas_window *window, > + struct coprocessor_request_block *crb) > +{ > + int rc; > + struct pid *pid; > + void __user *csb_addr; > + struct task_struct *tsk; > + struct kernel_siginfo info; > + struct coprocessor_status_block csb; > + > + /* > + * NX user space windows can not be opened for task->mm=NULL > + * and faults will not be generated for kernel requests. > + */ > + if (!window->mm || !window->user_win) > + return; > + > + csb_addr = (void *)be64_to_cpu(crb->csb_addr); > + > + csb.cc = CSB_CC_TRANSLATION; > + csb.ce = CSB_CE_TERMINATION; > + csb.cs = 0; > + csb.count = 0; > + > + /* > + * Returns the fault address in CPU format since it is passed with > + * signal. But if the user space expects BE format, need changes. > + * i.e either kernel (here) or user should convert to CPU format. > + * Not both! > + */ > + csb.address = be64_to_cpu(crb->stamp.nx.fault_storage_addr); This looks wrong and I don't understand the comment. You need to convert this back to be64 to write it to csb.address. ie. csb.address = cpu_to_be64(be64_to_cpu(crb->stamp.nx.fault_storage_addr)); Which I think you can just avoid the endian conversion all together. > + csb.flags = 0; > + > + pid = window->pid; > + tsk = get_pid_task(pid, PIDTYPE_PID); > + /* > + * Send window will be closed after processing all NX requests > + * and process exits after closing all windows. In multi-thread > + * applications, thread may not exists, but does not close FD > + * (means send window) upon exit. Parent thread (tgid) can use > + * and close the window later. > + * pid and mm references are taken when window is opened by > + * process (pid). So tgid is used only when child thread opens > + * a window and exits without closing it in multithread tasks. > + */ > + if (!tsk) { > + pid = window->tgid; > + tsk = get_pid_task(pid, PIDTYPE_PID); > + /* > + * Parent thread will be closing window during its exit. > + * So should not get here. > + */ > + if (!tsk) > + return; > + } > + > + /* Return if the task is exiting. */ > + if (tsk->flags & PF_EXITING) { > + put_task_struct(tsk); > + return; > + } > + > + use_mm(window->mm); > + rc = copy_to_user(csb_addr, , sizeof(csb)); > + /* > + * User space polls on csb.flags (first byte). So add barrier > + * then copy first byte with csb flags update. > + */ > + smp_mb(); > + if (!rc) { > + csb.flags = CSB_V; > + rc = copy_to_user(csb_addr, , sizeof(u8)); > + } > + unuse_mm(window->mm); > + put_task_struct(tsk); > + > + /* Success */ > + if (!rc) > + return; > + > + pr_err("Invalid CSB address 0x%p signalling pid(%d)\n", > + csb_addr, pid_vnr(pid)); This is a userspace error, not a kernel error. This should not be a pr_err(). Userspace could spam the console with this. > + > + clear_siginfo(); > + info.si_signo = SIGSEGV; > + info.si_errno = EFAULT; > + info.si_code = SEGV_MAPERR; > + info.si_addr = csb_addr; > + > + /* > + * process will be polling on csb.flags after request is sent to > + * NX. So generally CSB update should not fail except when an > + * application does not follow the process properly. So an error > + * message will be displayed and leave it to user space whether > + * to ignore
Re: [PATCH v2 1/3] powerpc/tm: Clear the current thread's MSR[TS] after treclaim
On Thu, 2020-02-06 at 19:13 -0300, Gustavo Luiz Duarte wrote: > > On 2/5/20 1:58 AM, Michael Neuling wrote: > > Other than the minor things below that I think you need, the patch good with > > me. > > > > Acked-by: Michael Neuling > > > > > Subject: Re: [PATCH v2 1/3] powerpc/tm: Clear the current thread's MSR[TS] > > > after treclaim > > > > The subject should mention "signals". > > How about "powerpc/tm: Clear the current thread's MSR[TS] when > transaction is reclaimed on signal delivery" ? mpe also likes the word "fix" in the subject if it's a fix. So maybe: powerpc/tm: Fix clearing MSR[TS] in current when reclaiming on signal delivery Thanks, Mikey
Re: [PATCH v2 1/3] powerpc/tm: Clear the current thread's MSR[TS] after treclaim
Other than the minor things below that I think you need, the patch good with me. Acked-by: Michael Neuling > Subject: Re: [PATCH v2 1/3] powerpc/tm: Clear the current thread's MSR[TS] > after treclaim The subject should mention "signals". On Mon, 2020-02-03 at 13:09 -0300, Gustavo Luiz Duarte wrote: > After a treclaim, we expect to be in non-transactional state. If we don't > immediately clear the current thread's MSR[TS] and we get preempted, then > tm_recheckpoint_new_task() will recheckpoint and we get rescheduled in > suspended transaction state. It's not "immediately", it's before re-enabling preemption. There is a similar comment in the code that needs to be fixed too. > When handling a signal caught in transactional state, handle_rt_signal64() > calls get_tm_stackpointer() that treclaims the transaction using > tm_reclaim_current() but without clearing the thread's MSR[TS]. This can cause > the TM Bad Thing exception below if later we pagefault and get preempted > trying > to access the user's sigframe, using __put_user(). Afterwards, when we are > rescheduled back into do_page_fault() (but now in suspended state since the > thread's MSR[TS] was not cleared), upon executing 'rfid' after completion of > the page fault handling, the exception is raised because a transition from > suspended to non-transactional state is invalid. > > Unexpected TM Bad Thing exception at c000de44 (msr > 0x800302a03031) tm_scratch=80010280b033 > Oops: Unrecoverable exception, sig: 6 [#1] > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink xts > vmx_crypto sg virtio_balloon > r_mod cdrom virtio_net net_failover virtio_blk virtio_scsi failover > dm_mirror dm_region_hash dm_log dm_mod > CPU: 25 PID: 15547 Comm: a.out Not tainted 5.4.0-rc2 #32 > NIP: c000de44 LR: c0034728 CTR: > REGS: c0003fe7bd70 TRAP: 0700 Not tainted (5.4.0-rc2) > MSR: 800302a03031 CR: 44000884 > XER: > CFAR: c000dda4 IRQMASK: 0 > PACATMSCRATCH: 80010280b033 > GPR00: c0034728 c00f65a17c80 c1662800 > 7fffacf3fd78 > GPR04: 1000 1000 > c00f611f8af0 > GPR08: 78006001 > 000c > GPR12: c00f611f84b0 c0003ffcb200 > > GPR16: > > GPR20: > c00f611f8140 > GPR24: 7fffacf3fd68 c00f65a17d90 > c00f611f7800 > GPR28: c00f65a17e90 c00f65a17e90 c1685e18 > 7fffacf3f000 > NIP [c000de44] fast_exception_return+0xf4/0x1b0 > LR [c0034728] handle_rt_signal64+0x78/0xc50 > Call Trace: > [c00f65a17c80] [c0034710] handle_rt_signal64+0x60/0xc50 > (unreliable) > [c00f65a17d30] [c0023640] do_notify_resume+0x330/0x460 > [c00f65a17e20] [c000dcc4] ret_from_except_lite+0x70/0x74 > Instruction dump: > 7c4ff120 e8410170 7c5a03a6 3840 f8410060 e8010070 e8410080 e8610088 > 6000 6000 e8810090 e8210078 <4c24> 4800 e8610178 > 88ed0989 > ---[ end trace 93094aa44b442f87 ]--- > > The simplified sequence of events that triggers the above exception is: > > ... # userspace in NON-TRANSACTIONAL state > tbegin # userspace in TRANSACTIONAL state > signal delivery # kernelspace in SUSPENDED state > handle_rt_signal64() > get_tm_stackpointer() > treclaim# kernelspace in NON-TRANSACTIONAL state > __put_user() > page fault happens. We will never get back here because of the TM Bad > Thing exception. > > page fault handling kicks in and we voluntarily preempt ourselves > do_page_fault() > __schedule() > __switch_to(other_task) > > our task is rescheduled and we recheckpoint because the thread's MSR[TS] > was not cleared > __switch_to(our_task) > switch_to_tm() > tm_recheckpoint_new_task() > trechkpt # kernelspace in SUSPENDED state > > The page fault handling resumes, but now we are in suspended transaction > state > do_page_fault()completes > rfid <- trying to get back where the page fau
CVE-2019-15031: Linux kernel: powerpc: data leak with FP/VMX triggerable by interrupt in transaction
The Linux kernel for powerpc since v4.15 has a bug in it's TM handling during interrupts where any user can read the FP/VMX registers of a difference user's process. Users of TM + FP/VMX can also experience corruption of their FP/VMX state. To trigger the bug, a process starts a transaction with FP/VMX off and then takes an interrupt. Due to the kernels incorrect handling of the interrupt, FP/VMX is turned on but the checkpointed state is not updated. If this transaction then rolls back, the checkpointed state may contain the state of a different process. This checkpointed state can then be read by the process hence leaking data from one process to another. The trigger for this bug is an interrupt inside a transaction where FP/VMX is off, hence the process needs FP/VMX off when starting the transaction. FP/VMX availability is under the control of the kernel and is transparent to the user, hence the user has to retry the transaction many times to trigger this bug. High interrupt loads also help trigger this bug. All 64-bit machines where TM is present are affected. This includes all POWER8 variants and POWER9 VMs under KVM or LPARs under PowerVM. POWER9 bare metal doesn't support TM and hence is not affected. The bug was introduced in commit: fa7771176b439 ("powerpc: Don't enable FP/Altivec if not checkpointed") Which was originally merged in v4.15 The upstream fix is here: https://git.kernel.org/torvalds/c/a8318c13e79badb92bc6640704a64cc022a6eb97 The fix can be verified by running the tm-poison from the kernel selftests. This test is in a patch here: https://patchwork.ozlabs.org/patch/1157467/ which should eventually end up here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/tm/tm-poison.c cheers Mikey
CVE-2019-15030: Linux kernel: powerpc: data leak with FP/VMX triggerable by unavailable exception in transaction
The Linux kernel for powerpc since v4.12 has a bug in it's TM handling where any user can read the FP/VMX registers of a difference user's process. Users of TM + FP/VMX can also experience corruption of their FP/VMX state. To trigger the bug, a process starts a transaction and reads a FP/VMX register. This transaction can then fail which causes a rollback to the checkpointed state. Due to the kernel taking an FP/VMX unavaliable exception inside a transaction and the kernel's incorrect handling of this, the checkpointed state can be set to the FP/VMX registers of another process. This checkpointed state can then be read by the process hence leaking data from one process to another. The trigger for this bug is an FP/VMX unavailable exception inside a transaction, hence the process needs FP/VMX off when starting the transaction. FP/VMX availability is under the control of the kernel and is transparent to the user, hence the user has to retry the transaction many times to trigger this bug. All 64-bit machines where TM is present are affected. This includes all POWER8 variants and POWER9 VMs under KVM or LPARs under PowerVM. POWER9 bare metal doesn't support TM and hence is not affected. The bug was introduced in commit: f48e91e87e67 ("powerpc/tm: Fix FP and VMX register corruption") Which was originally merged in v4.12 The upstream fix is here: https://git.kernel.org/torvalds/c/8205d5d98ef7f155de211f5e2eb6ca03d95a5a60 The fix can be verified by running the tm-poison from the kernel selftests. This test is in a patch here: https://patchwork.ozlabs.org/patch/1157467/ which should eventually end up here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/tm/tm-poison.c cheers Mikey
[PATCH 2/3] powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts
From: Gustavo Romero When in userspace and MSR FP=0 the hardware FP state is unrelated to the current process. This is extended for transactions where if tbegin is run with FP=0, the hardware checkpoint FP state will also be unrelated to the current process. Due to this, we need to ensure this hardware checkpoint is updated with the correct state before we enable FP for this process. Unfortunately we get this wrong when returning to a process from a hardware interrupt. A process that starts a transaction with FP=0 can take an interrupt. When the kernel returns back to that process, we change to FP=1 but with hardware checkpoint FP state not updated. If this transaction is then rolled back, the FP registers now contain the wrong state. The process looks like this: Userspace: Kernel Start userspace with MSR FP=0 TM=1 < - ... tbegin bne Hardware interrupt > ret_from_except restore_math() /* sees FP=0 */ restore_fp() tm_active_with_fp() /* sees FP=1 (Incorrect) */ load_fp_state() FP = 0 -> 1 < - Return to userspace with MSR TM=1 FP=1 with junk in the FP TM checkpoint TM rollback reads FP junk When returning from the hardware exception, tm_active_with_fp() is incorrectly making restore_fp() call load_fp_state() which is setting FP=1. The fix is to remove tm_active_with_fp(). tm_active_with_fp() is attempting to handle the case where FP state has been changed inside a transaction. In this case the checkpointed and transactional FP state is different and hence we must restore the FP state (ie. we can't do lazy FP restore inside a transaction that's used FP). It's safe to remove tm_active_with_fp() as this case is handled by restore_tm_state(). restore_tm_state() detects if FP has been using inside a transaction and will set load_fp and call restore_math() to ensure the FP state (checkpoint and transaction) is restored. This is a data integrity problem for the current process as the FP registers are corrupted. It's also a security problem as the FP registers from one process may be leaked to another. Similarly for VMX. A simple testcase to replicate this will be posted to tools/testing/selftests/powerpc/tm/tm-poison.c This fixes CVE-2019-15031. Fixes: a7771176b439 ("powerpc: Don't enable FP/Altivec if not checkpointed") Cc: sta...@vger.kernel.org # 4.15+ Signed-off-by: Gustavo Romero Signed-off-by: Michael Neuling --- arch/powerpc/kernel/process.c | 18 ++ 1 file changed, 2 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 437b57068c..7a84c9f177 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -101,21 +101,8 @@ static void check_if_tm_restore_required(struct task_struct *tsk) } } -static bool tm_active_with_fp(struct task_struct *tsk) -{ - return MSR_TM_ACTIVE(tsk->thread.regs->msr) && - (tsk->thread.ckpt_regs.msr & MSR_FP); -} - -static bool tm_active_with_altivec(struct task_struct *tsk) -{ - return MSR_TM_ACTIVE(tsk->thread.regs->msr) && - (tsk->thread.ckpt_regs.msr & MSR_VEC); -} #else static inline void check_if_tm_restore_required(struct task_struct *tsk) { } -static inline bool tm_active_with_fp(struct task_struct *tsk) { return false; } -static inline bool tm_active_with_altivec(struct task_struct *tsk) { return false; } #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ bool strict_msr_control; @@ -252,7 +239,7 @@ EXPORT_SYMBOL(enable_kernel_fp); static int restore_fp(struct task_struct *tsk) { - if (tsk->thread.load_fp || tm_active_with_fp(tsk)) { + if (tsk->thread.load_fp) { load_fp_state(>thread.fp_state); current->thread.load_fp++; return 1; @@ -334,8 +321,7 @@ EXPORT_SYMBOL_GPL(flush_altivec_to_thread); static int restore_altivec(struct task_struct *tsk) { - if (cpu_has_feature(CPU_FTR_ALTIVEC) && - (tsk->thread.load_vec || tm_active_with_altivec(tsk))) { + if (cpu_has_feature(CPU_FTR_ALTIVEC) && (tsk->thread.load_vec)) { load_vr_state(>thread.vr_state); tsk->thread.used_vr = 1; tsk->thread.load_vec++; -- 2.21.0
[PATCH 1/3] powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction
From: Gustavo Romero When we take an FP unavailable exception in a transaction we have to account for the hardware FP TM checkpointed registers being incorrect. In this case for this process we know the current and checkpointed FP registers must be the same (since FP wasn't used inside the transaction) hence in the thread_struct we copy the current FP registers to the checkpointed ones. This copy is done in tm_reclaim_thread(). We use thread->ckpt_regs.msr to determine if FP was on when in userspace. thread->ckpt_regs.msr represents the state of the MSR when exiting userspace. This is setup by check_if_tm_restore_required(). Unfortunatley there is an optimisation in giveup_all() which returns early if tsk->thread.regs->msr (via local variable `usermsr`) has FP=VEC=VSX=SPE=0. This optimisation means that check_if_tm_restore_required() is not called and hence thread->ckpt_regs.msr is not updated and will contain an old value. This can happen if due to load_fp=255 we start a userspace process with MSR FP=1 and then we are context switched out. In this case thread->ckpt_regs.msr will contain FP=1. If that same process is then context switched in and load_fp overflows, MSR will have FP=0. If that process now enters a transaction and does an FP instruction, the FP unavailable will not update thread->ckpt_regs.msr (the bug) and MSR FP=1 will be retained in thread->ckpt_regs.msr. tm_reclaim_thread() will then not perform the required memcpy and the checkpointed FP regs in the thread struct will contain the wrong values. The code path for this happening is: Userspace: Kernel Start userspace with MSR FP/VEC/VSX/SPE=0 TM=1 < - ... tbegin bne fp instruction FP unavailable > fp_unavailable_tm() tm_reclaim_current() tm_reclaim_thread() giveup_all() return early since FP/VMX/VSX=0 /* ckpt MSR not updated (Incorrect) */ tm_reclaim() /* thread_struct ckpt FP regs contain junk (OK) */ /* Sees ckpt MSR FP=1 (Incorrect) */ no memcpy() performed /* thread_struct ckpt FP regs not fixed (Incorrect) */ tm_recheckpoint() /* Put junk in hardware checkpoint FP regs */ < - Return to userspace with MSR TM=1 FP=1 with junk in the FP TM checkpoint TM rollback reads FP junk This is a data integrity problem for the current process as the FP registers are corrupted. It's also a security problem as the FP registers from one process may be leaked to another. This patch moves up check_if_tm_restore_required() in giveup_all() to ensure thread->ckpt_regs.msr is updated correctly. A simple testcase to replicate this will be posted to tools/testing/selftests/powerpc/tm/tm-poison.c Similarly for VMX. This fixes CVE-2019-15030. Fixes: f48e91e87e67 ("powerpc/tm: Fix FP and VMX register corruption") Cc: sta...@vger.kernel.org # 4.12+ Signed-off-by: Gustavo Romero Signed-off-by: Michael Neuling --- arch/powerpc/kernel/process.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 8fc4de0d22..437b57068c 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -497,13 +497,14 @@ void giveup_all(struct task_struct *tsk) if (!tsk->thread.regs) return; + check_if_tm_restore_required(tsk); + usermsr = tsk->thread.regs->msr; if ((usermsr & msr_all_available) == 0) return; msr_check_and_set(msr_all_available); - check_if_tm_restore_required(tsk); WARN_ON((usermsr & MSR_VSX) && !((usermsr & MSR_FP) && (usermsr & MSR_VEC))); -- 2.21.0
[PATCH 3/3] powerpc/tm: Add tm-poison test
From: Gustavo Romero Add TM selftest to check if FP or VEC register values from one process can leak into another process when both run on the same CPU. This tests for CVE-2019-15030 and CVE-2019-15031. Signed-off-by: Gustavo Romero Signed-off-by: Michael Neuling --- tools/testing/selftests/powerpc/tm/.gitignore | 1 + tools/testing/selftests/powerpc/tm/Makefile | 2 +- .../testing/selftests/powerpc/tm/tm-poison.c | 180 ++ 3 files changed, 182 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/powerpc/tm/tm-poison.c diff --git a/tools/testing/selftests/powerpc/tm/.gitignore b/tools/testing/selftests/powerpc/tm/.gitignore index 951fe855f7..98f2708d86 100644 --- a/tools/testing/selftests/powerpc/tm/.gitignore +++ b/tools/testing/selftests/powerpc/tm/.gitignore @@ -17,3 +17,4 @@ tm-vmx-unavail tm-unavailable tm-trap tm-sigreturn +tm-poison diff --git a/tools/testing/selftests/powerpc/tm/Makefile b/tools/testing/selftests/powerpc/tm/Makefile index c0734ed0ef..b15a1a325b 100644 --- a/tools/testing/selftests/powerpc/tm/Makefile +++ b/tools/testing/selftests/powerpc/tm/Makefile @@ -5,7 +5,7 @@ SIGNAL_CONTEXT_CHK_TESTS := tm-signal-context-chk-gpr tm-signal-context-chk-fpu TEST_GEN_PROGS := tm-resched-dscr tm-syscall tm-signal-msr-resv tm-signal-stack \ tm-vmxcopy tm-fork tm-tar tm-tmspr tm-vmx-unavail tm-unavailable tm-trap \ $(SIGNAL_CONTEXT_CHK_TESTS) tm-sigreturn tm-signal-sigreturn-nt \ - tm-signal-context-force-tm + tm-signal-context-force-tm tm-poison top_srcdir = ../../../../.. include ../../lib.mk diff --git a/tools/testing/selftests/powerpc/tm/tm-poison.c b/tools/testing/selftests/powerpc/tm/tm-poison.c new file mode 100644 index 00..0113da7a8d --- /dev/null +++ b/tools/testing/selftests/powerpc/tm/tm-poison.c @@ -0,0 +1,180 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright 2019, Gustavo Romero, Michael Neuling, IBM Corp. + * + * This test will spawn two processes. Both will be attached to the same + * CPU (CPU 0). The child will be in a loop writing to FP register f31 and + * VMX/VEC/Altivec register vr31 a known value, called poison, calling + * sched_yield syscall after to allow the parent to switch on the CPU. + * Parent will set f31 and vr31 to 1 and in a loop will check if f31 and + * vr31 remain 1 as expected until a given timeout (2m). If the issue is + * present child's poison will leak into parent's f31 or vr31 registers, + * otherwise, poison will never leak into parent's f31 and vr31 registers. + * + * This test for CVE-2019-15030 and CVE-2019-15031. + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include + +#include "tm.h" + +int tm_poison_test(void) +{ + int pid; + cpu_set_t cpuset; + uint64_t poison = 0xdeadbeefc0dec0fe; + uint64_t unknown = 0; + bool fail_fp = false; + bool fail_vr = false; + + SKIP_IF(!have_htm()); + + /* Attach both Child and Parent to CPU 0 */ + CPU_ZERO(); + CPU_SET(0, ); + sched_setaffinity(0, sizeof(cpuset), ); + + pid = fork(); + if (!pid) { + /** +* child +*/ + while (1) { + sched_yield(); + asm ( + "mtvsrd 31, %[poison];" // f31 = poison + "mtvsrd 63, %[poison];" // vr31 = poison + + : : [poison] "r" (poison) : ); + } + } + + /** +* parent +*/ + asm ( + /* +* Set r3, r4, and f31 to known value 1 before entering +* in transaction. They won't be written after that. +*/ + " li 3, 0x1 ;" + " li 4, 0x1 ;" + " mtvsrd 31, 4 ;" + + /* +* The Time Base (TB) is a 64-bit counter register that is +* independent of the CPU clock and which is incremented +* at a frequency of 51200 Hz, so every 1.953125ns. +* So it's necessary 120s/0.1953125s = 6144000 +* increments to get a 2 minutes timeout. Below we set that +* value in r5 and then use r6 to track initial TB value, +* updating TB values in r7 at every iteration and comparing it +* to r6. When r7 (current) - r6 (initial) > 6144000 we bail +* out since for sure we spent already 2 minutes in the loop. +* SPR 268 is the TB register. +*/ + " lis 5, 14 ;" + " ori 5, 5, 19996 ;" + " sldi5, 5, 16
CVE-2019-13648: Linux kernel: powerpc: kernel crash in TM handling triggerable by any local user
The Linux kernel for powerpc since v3.9 has a bug in the TM handling where any unprivileged local user may crash the operating system. This bug affects machines using 64-bit CPUs where Transactional Memory (TM) is not present or has been disabled (see below for more details on affected CPUs). To trigger the bug a process constructs a signal context which still has the MSR TS bits set. That process then passes this signal context to the sigreturn() system call. When returning back to userspace, the kernel then crashes with a bad TM transition (TM Bad Thing) or by executing TM code on a non-TM system. All 64bit machines where TM is not present are affected. This includes PowerPC 970 (G5), PA6T, POWER5/6/7 VMs under KVM or LPARs under PowerVM and POWER9 bare metal. Additionally systems with TM hardware but where TM is disabled in software (via ppc_tm=off kernel cmdline) are also affected. This includes POWER8/9 VMs under KVM or LPARs under PowerVM and POWER8 bare metal. The bug was introduced in commit: 2b0a576d15e0 ("powerpc: Add new transactional memory state to the signal context") Which was originally merged in v3.9. The upstream fix is here: https://git.kernel.org/torvalds/c/f16d80b75a096c52354c6e0a574993f3b0dfbdfe The fix can be verified by running `sigfuz -m` from the kernel selftests: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/powerpc/signal/sigfuz.c?h=v5.2 cheers Mikey
[PATCH] powerpc/tm: Fix oops on sigreturn on systems without TM
On systems like P9 powernv where we have no TM (or P8 booted with ppc_tm=off), userspace can construct a signal context which still has the MSR TS bits set. The kernel tries to restore this context which results in the following crash: [ 74.980557] Unexpected TM Bad Thing exception at c00022fc (msr 0x800102a03031) tm_scratch=80020280f033 [ 74.980741] Oops: Unrecoverable exception, sig: 6 [#1] [ 74.980820] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 74.980917] Modules linked in: [ 74.980980] CPU: 0 PID: 1636 Comm: sigfuz Not tainted 5.2.0-11043-g0a8ad0ffa4 #69 [ 74.981096] NIP: c00022fc LR: 7fffb2d67e48 CTR: [ 74.981212] REGS: c0003fffbd70 TRAP: 0700 Not tainted (5.2.0-11045-g7142b497d8) [ 74.981325] MSR: 800102a03031 CR: 42004242 XER: [ 74.981463] CFAR: c00022e0 IRQMASK: 0 [ 74.981463] GPR00: 0072 7fffb2b6e560 7fffb2d87f00 0669 [ 74.981463] GPR04: 7fffb2b6e728 7fffb2b6f2a8 [ 74.981463] GPR08: [ 74.981463] GPR12: 7fffb2b76900 [ 74.981463] GPR16: 7fffb237 7fffb2d84390 7fffea3a15ac 01000a250420 [ 74.981463] GPR20: 7fffb2b6f260 10001770 [ 74.981463] GPR24: 7fffb2d843a0 7fffea3a14a0 0001 0080 [ 74.981463] GPR28: 7fffea3a14d8 003d0f00 7fffb2b6e728 [ 74.982420] NIP [c00022fc] rfi_flush_fallback+0x7c/0x80 [ 74.982517] LR [7fffb2d67e48] 0x7fffb2d67e48 [ 74.982593] Call Trace: [ 74.982632] Instruction dump: [ 74.982691] e96a0220 e96a02a8 e96a0330 e96a03b8 394a0400 4200ffdc 7d2903a6 e92d0c00 [ 74.982809] e94d0c08 e96d0c10 e82d0c18 7db242a6 <4c24> 7db243a6 7db142a6 f82d0c18 The problem is the signal code assumes TM is enabled when CONFIG_PPC_TRANSACTIONAL_MEM is on. This may not be the case as with P9 powernv or if `ppc_tm=off` is used on P8. This means any local user can crash the system. Fix the problem by returning a bad stack frame to the user if they try to set the MSR TS bits with sigreturn() on systems where TM is not supported. Found with sigfuz kernel selftest on P9. This fixes CVE-2019-13648. Fixes: 2b0a576d15 ("powerpc: Add new transactional memory state to the signal context") Cc: sta...@vger.kernel.org # v3.9 Reported-by: Praveen Pandey Signed-off-by: Michael Neuling --- arch/powerpc/kernel/signal_32.c | 3 +++ arch/powerpc/kernel/signal_64.c | 5 + 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index f50b708d6d..98600b276f 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -1198,6 +1198,9 @@ SYSCALL_DEFINE0(rt_sigreturn) goto bad; if (MSR_TM_ACTIVE(msr_hi<<32)) { + /* Trying to start TM on non TM system */ + if (!cpu_has_feature(CPU_FTR_TM)) + goto bad; /* We only recheckpoint on return if we're * transaction. */ diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 2f80e270c7..117515564e 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -771,6 +771,11 @@ SYSCALL_DEFINE0(rt_sigreturn) if (MSR_TM_ACTIVE(msr)) { /* We recheckpoint on return. */ struct ucontext __user *uc_transact; + + /* Trying to start TM on non TM system */ + if (!cpu_has_feature(CPU_FTR_TM)) + goto badframe; + if (__get_user(uc_transact, >uc_link)) goto badframe; if (restore_tm_sigcontexts(current, >uc_mcontext, -- 2.21.0
Re: [PATCH] KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation
On Mon, 2019-06-24 at 21:48 +1000, Michael Ellerman wrote: > Michael Neuling writes: > > When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The > > code currently sets: > > CR0 <- 00 || MSR[TS] > > but according to the ISA it should be: > > CR0 <- 0 || MSR[TS] || 0 > > Seems bad, what's the worst case impact? It's a data integrity issue as CR0 is corrupted. > Do we have a test case for this? Suraj has a KVM unit test for it. > > This fixes the bit shift to put the bits in the correct location. > > Fixes: ? It's been around since we first wrote the code so: Fixes: 4bb3c7a0208fc13c ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9") Mikey
[PATCH] KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation
When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The code currently sets: CR0 <- 00 || MSR[TS] but according to the ISA it should be: CR0 <- 0 || MSR[TS] || 0 This fixes the bit shift to put the bits in the correct location. Tested-by: Suraj Jitindar Singh Signed-off-by: Michael Neuling --- arch/powerpc/kvm/book3s_hv_tm.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_tm.c b/arch/powerpc/kvm/book3s_hv_tm.c index 888e2609e3..31cd0f327c 100644 --- a/arch/powerpc/kvm/book3s_hv_tm.c +++ b/arch/powerpc/kvm/book3s_hv_tm.c @@ -131,7 +131,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu) } /* Set CR0 to indicate previous transactional state */ vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) | - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28); + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29); /* L=1 => tresume, L=0 => tsuspend */ if (instr & (1 << 21)) { if (MSR_TM_SUSPENDED(msr)) @@ -175,7 +175,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu) /* Set CR0 to indicate previous transactional state */ vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) | - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28); + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29); vcpu->arch.shregs.msr &= ~MSR_TS_MASK; return RESUME_GUEST; @@ -205,7 +205,7 @@ int kvmhv_p9_tm_emulation(struct kvm_vcpu *vcpu) /* Set CR0 to indicate previous transactional state */ vcpu->arch.regs.ccr = (vcpu->arch.regs.ccr & 0x0fff) | - (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 28); + (((msr & MSR_TS_MASK) >> MSR_TS_S_LG) << 29); vcpu->arch.shregs.msr = msr | MSR_TS_S; return RESUME_GUEST; } -- 2.21.0
Re: [PATCH v5 2/2] powerpc: Fix compile issue with force DAWR
On Tue, 2019-06-18 at 18:28 +0200, Christophe Leroy wrote: > > Le 04/06/2019 à 05:00, Michael Neuling a écrit : > > If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail > > at linking with: > >arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined > > reference to `dawr_force_enable' > > > > This was caused by commit c1fe190c0672 ("powerpc: Add force enable of > > DAWR on P9 option"). > > > > This moves a bunch of code around to fix this. It moves a lot of the > > DAWR code in a new file and creates a new CONFIG_PPC_DAWR to enable > > compiling it. > > After looking at all this once more, I'm just wondering: why are we > creating stuff specific to DAWR ? > > In the old days, we only add DABR, and everything was named on DABR. > When DAWR was introduced some years ago we renamed stuff like do_dabr() > to do_break() so that we could regroup things together. And now we are > taking dawr() out of the rest. Why not keep dabr() stuff and dawr() > stuff all together in something dedicated to breakpoints, and try to > regroup all breakpoint stuff in a single place ? I see some > breakpointing stuff done in kernel/process.c and other things done in > hw_breakpoint.c, to common functions call from one file to the other, > preventing GCC to fully optimise, etc ... > > Also, behing this thinking, I have the idea that we could easily > implement 512 bytes breakpoints on the 8xx too. The 8xx have neither > DABR nor DAWR, but is using a set of comparators. And as you can see in > the 8xx version of __set_dabr() in kernel/process.c, we emulate the DABR > behaviour by setting two comparators. By using the same comparators with > a different setup, we should be able to implement breakpoints on larger > ranges of address. Christophe I agree that their are opportunities to refactor this code and I appreciate your efforts in making this code better but... We have a problem here of not being able to compile an odd ball case that almost no one ever hits (it was just an odd mpe CI case). We're up to v5 of a simple fix which is just silly. So let's get this fix in and move on to the whole bunch of refactoring we can do in this code which is already documented in the github issue tracking. Regards, Mikey
Re: [PATCH 5/5] Powerpc/Watchpoint: Fix length calculation for unaligned target
On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote: > Watchpoint match range is always doubleword(8 bytes) aligned on > powerpc. If the given range is crossing doubleword boundary, we > need to increase the length such that next doubleword also get > covered. Ex, > > address len = 6 bytes > |=. >|v--|--v| >| | | | | | | | | | | | | | | | | >|---|---| > <---8 bytes---> > > In such case, current code configures hw as: > start_addr = address & ~HW_BREAKPOINT_ALIGN > len = 8 bytes > > And thus read/write in last 4 bytes of the given range is ignored. > Fix this by including next doubleword in the length. Watchpoint > exception handler already ignores extraneous exceptions, so no > changes required for that. Nice catch. Thanks. I assume this has been broken forever? Should we be CCing stable? If so, it would be nice to have this self contained (separate from the refactor) so we can more easily backport it. Also, can you update tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c to catch this issue? A couple more comments below. > > Signed-off-by: Ravi Bangoria > --- > arch/powerpc/include/asm/hw_breakpoint.h | 7 ++-- > arch/powerpc/kernel/hw_breakpoint.c | 44 +--- > arch/powerpc/kernel/process.c| 34 -- > 3 files changed, 60 insertions(+), 25 deletions(-) > > diff --git a/arch/powerpc/include/asm/hw_breakpoint.h > b/arch/powerpc/include/asm/hw_breakpoint.h > index 8acbbdd4a2d5..749a357164d5 100644 > --- a/arch/powerpc/include/asm/hw_breakpoint.h > +++ b/arch/powerpc/include/asm/hw_breakpoint.h > @@ -34,6 +34,8 @@ struct arch_hw_breakpoint { > #define HW_BRK_TYPE_PRIV_ALL (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \ >HW_BRK_TYPE_HYP) > > +#define HW_BREAKPOINT_ALIGN 0x7 > + > #ifdef CONFIG_HAVE_HW_BREAKPOINT > #include > #include > @@ -45,8 +47,6 @@ struct pmu; > struct perf_sample_data; > struct task_struct; > > -#define HW_BREAKPOINT_ALIGN 0x7 > - > extern int hw_breakpoint_slots(int type); > extern int arch_bp_generic_fields(int type, int *gen_bp_type); > extern int arch_check_bp_in_kernelspace(struct arch_hw_breakpoint *hw); > @@ -76,7 +76,8 @@ static inline void hw_breakpoint_disable(void) > } > extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs); > int hw_breakpoint_handler(struct die_args *args); > - > +extern u16 hw_breakpoint_get_final_len(struct arch_hw_breakpoint *brk, > + unsigned long *start_addr, unsigned long *end_addr); > extern int set_dawr(struct arch_hw_breakpoint *brk); > extern bool dawr_force_enable; > static inline bool dawr_enabled(void) > diff --git a/arch/powerpc/kernel/hw_breakpoint.c > b/arch/powerpc/kernel/hw_breakpoint.c > index 36bcf705df65..c122fd55aa44 100644 > --- a/arch/powerpc/kernel/hw_breakpoint.c > +++ b/arch/powerpc/kernel/hw_breakpoint.c > @@ -126,6 +126,28 @@ int arch_bp_generic_fields(int type, int *gen_bp_type) > return 0; > } > > +/* Maximum len for DABR is 8 bytes and DAWR is 512 bytes */ > +static int hw_breakpoint_validate_len(struct arch_hw_breakpoint *hw) > +{ > + u16 length_max = 8; > + u16 final_len; > + unsigned long start_addr, end_addr; > + > + final_len = hw_breakpoint_get_final_len(hw, _addr, _addr); > + > + if (dawr_enabled()) { > + length_max = 512; > + /* DAWR region can't cross 512 bytes boundary */ > + if ((start_addr >> 9) != (end_addr >> 9)) > + return -EINVAL; > + } > + > + if (final_len > length_max) > + return -EINVAL; > + > + return 0; > +} > + > /* > * Validate the arch-specific HW Breakpoint register settings > */ > @@ -133,12 +155,10 @@ int hw_breakpoint_arch_parse(struct perf_event *bp, >const struct perf_event_attr *attr, >struct arch_hw_breakpoint *hw) > { > - int length_max; > - > if (!ppc_breakpoint_available()) > return -ENODEV; > > - if (!bp) > + if (!bp || !attr->bp_len) > return -EINVAL; > > hw->type = HW_BRK_TYPE_TRANSLATE; > @@ -160,23 +180,7 @@ int hw_breakpoint_arch_parse(struct perf_event *bp, > hw->address = attr->bp_addr; > hw->len = attr->bp_len; > > - length_max = 8; /* DABR */ > - if (dawr_enabled()) { > - length_max = 512 ; /* 64 doublewords */ > - /* DAWR region can't cross 512 bytes boundary */ > - if ((hw->address >> 9) != ((hw->address + hw->len - 1) >> 9)) > - return -EINVAL; > - } > - > - /* > - * Since breakpoint length can be a maximum of length_max and > - * breakpoint addresses are aligned to nearest double-word > - * HW_BREAKPOINT_ALIGN by rounding off to the lower address, > - * the 'symbolsize' should satisfy the check
Re: [PATCH 0/5] Powerpc/hw-breakpoint: Fixes plus Code refactor
On Tue, 2019-06-18 at 08:01 +0200, Christophe Leroy wrote: > > Le 18/06/2019 à 06:27, Ravi Bangoria a écrit : > > patch 1-3: Code refactor > > patch 4: Speedup disabling breakpoint > > patch 5: Fix length calculation for unaligned targets > > While you are playing with hw breakpoints, did you have a look at > https://github.com/linuxppc/issues/issues/38 ? Agreed and also: https://github.com/linuxppc/issues/issues/170 https://github.com/linuxppc/issues/issues/128 Mikey
Re: [PATCH 4/5] Powerpc/hw-breakpoint: Optimize disable path
On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote: > Directly setting dawr and dawrx with 0 should be enough to > disable watchpoint. No need to reset individual bits in > variable and then set in hw. This seems like a pointless optimisation to me. I'm all for adding more code/complexity if it buys us some performance, but I can't imagine this is a fast path (nor have you stated any performance benefits). Mikey > > Signed-off-by: Ravi Bangoria > --- > arch/powerpc/include/asm/hw_breakpoint.h | 3 ++- > arch/powerpc/kernel/process.c| 12 > 2 files changed, 14 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/hw_breakpoint.h > b/arch/powerpc/include/asm/hw_breakpoint.h > index 78202d5fb13a..8acbbdd4a2d5 100644 > --- a/arch/powerpc/include/asm/hw_breakpoint.h > +++ b/arch/powerpc/include/asm/hw_breakpoint.h > @@ -19,6 +19,7 @@ struct arch_hw_breakpoint { > /* Note: Don't change the the first 6 bits below as they are in the same > order > * as the dabr and dabrx. > */ > +#define HW_BRK_TYPE_DISABLE 0x00 > #define HW_BRK_TYPE_READ 0x01 > #define HW_BRK_TYPE_WRITE0x02 > #define HW_BRK_TYPE_TRANSLATE0x04 > @@ -68,7 +69,7 @@ static inline void hw_breakpoint_disable(void) > struct arch_hw_breakpoint brk; > > brk.address = 0; > - brk.type = 0; > + brk.type = HW_BRK_TYPE_DISABLE; > brk.len = 0; > if (ppc_breakpoint_available()) > __set_breakpoint(); > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index f002d286..265fac9fb3a4 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -793,10 +793,22 @@ static inline int set_dabr(struct arch_hw_breakpoint > *brk) > return __set_dabr(dabr, dabrx); > } > > +static int disable_dawr(void) > +{ > + if (ppc_md.set_dawr) > + return ppc_md.set_dawr(0, 0); > + > + mtspr(SPRN_DAWRX, 0); > + return 0; > +} > + > int set_dawr(struct arch_hw_breakpoint *brk) > { > unsigned long dawr, dawrx, mrd; > > + if (brk->type == HW_BRK_TYPE_DISABLE) > + return disable_dawr(); > + > dawr = brk->address; > > dawrx = (brk->type & HW_BRK_TYPE_RDWR) << (63 - 58);
Re: [PATCH 3/5] Powerpc/hw-breakpoint: Refactor set_dawr()
This is going to collide with this patch https://patchwork.ozlabs.org/patch/1109594/ Mikey On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote: > Remove unnecessary comments. Code itself is self explanatory. > And, ISA already talks about MRD field. I Don't think we need > to re-describe it. > > Signed-off-by: Ravi Bangoria > --- > arch/powerpc/kernel/process.c | 17 + > 1 file changed, 5 insertions(+), 12 deletions(-) > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index f0fbbf6a6a1f..f002d286 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -799,18 +799,11 @@ int set_dawr(struct arch_hw_breakpoint *brk) > > dawr = brk->address; > > - dawrx = (brk->type & (HW_BRK_TYPE_READ | HW_BRK_TYPE_WRITE)) \ > -<< (63 - 58); //* read/write bits */ > - dawrx |= ((brk->type & (HW_BRK_TYPE_TRANSLATE)) >> 2) \ > -<< (63 - 59); //* translate */ > - dawrx |= (brk->type & (HW_BRK_TYPE_PRIV_ALL)) \ > ->> 3; //* PRIM bits */ > - /* dawr length is stored in field MDR bits 48:53. Matches range in > -doublewords (64 bits) baised by -1 eg. 0b00=1DW and > -0b11=64DW. > -brk->len is in bytes. > -This aligns up to double word size, shifts and does the bias. > - */ > + dawrx = (brk->type & HW_BRK_TYPE_RDWR) << (63 - 58); > + dawrx |= ((brk->type & HW_BRK_TYPE_TRANSLATE) >> 2) << (63 - 59); > + dawrx |= (brk->type & HW_BRK_TYPE_PRIV_ALL) >> 3; > + > + /* brk->len is in bytes. */ > mrd = ((brk->len + 7) >> 3) - 1; > dawrx |= (mrd & 0x3f) << (63 - 53); >
Re: [PATCH 1/5] Powerpc/hw-breakpoint: Replace stale do_dabr() with do_break()
> Subject: Powerpc/hw-breakpoint: Replace stale do_dabr() with do_break() Can you add the word "comment" to this subject. Currently it implies there are code changes here. Mikey On Tue, 2019-06-18 at 09:57 +0530, Ravi Bangoria wrote: > do_dabr() was renamed with do_break() long ago. But I still see > some comments mentioning do_dabr(). Replace it. > > Signed-off-by: Ravi Bangoria > --- > arch/powerpc/kernel/hw_breakpoint.c | 2 +- > arch/powerpc/kernel/ptrace.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/kernel/hw_breakpoint.c > b/arch/powerpc/kernel/hw_breakpoint.c > index a293a53b4365..1908e4fcc132 100644 > --- a/arch/powerpc/kernel/hw_breakpoint.c > +++ b/arch/powerpc/kernel/hw_breakpoint.c > @@ -232,7 +232,7 @@ int hw_breakpoint_handler(struct die_args *args) >* Return early after invoking user-callback function without restoring >* DABR if the breakpoint is from ptrace which always operates in >* one-shot mode. The ptrace-ed process will receive the SIGTRAP signal > - * generated in do_dabr(). > + * generated in do_break(). >*/ > if (bp->overflow_handler == ptrace_triggered) { > perf_bp_event(bp, regs); > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > index 684b0b315c32..44b823e5e8c8 100644 > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -2373,7 +2373,7 @@ void ptrace_triggered(struct perf_event *bp, > /* >* Disable the breakpoint request here since ptrace has defined a >* one-shot behaviour for breakpoint exceptions in PPC64. > - * The SIGTRAP signal is generated automatically for us in do_dabr(). > + * The SIGTRAP signal is generated automatically for us in do_break(). >* We don't have to do anything about that here >*/ > attr = bp->attr;
Re: [PATCH] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
> > > 3: > > > /* Emulate H_SET_DABR/X on P8 for the sake of compat mode > > > guests */ > > > rlwimi r5, r4, 5, DAWRX_DR | DAWRX_DW > > > c010b03c: 74 2e 85 50 rlwimi r5,r4,5,25,26 > > > rlwimi r5, r4, 2, DAWRX_WT > > > c010b040: f6 16 85 50 rlwimi r5,r4,2,27,27 > > > clrrdi r4, r4, 3 > > > c010b044: 24 07 84 78 rldicr r4,r4,0,60 > > > std r4, VCPU_DAWR(r3) > > > c010b048: c0 13 83 f8 std r4,5056(r3) > > > std r5, VCPU_DAWRX(r3) > > > c010b04c: c8 13 a3 f8 std r5,5064(r3) > > > mtspr SPRN_DAWR, r4 > > > c010b050: a6 2b 94 7c mtspr 180,r4 > > > mtspr SPRN_DAWRX, r5 > > > c010b054: a6 2b bc 7c mtspr 188,r5 > > > li r3, 0 > > > c010b058: 00 00 60 38 li r3,0 > > > blr > > > c010b05c: 20 00 80 4e blr > > > > It's the `mtspr SPRN_DAWR, r4` as you're HV=0. I'm not sure how > > nested works > > in that regard. Is the level above suppose to trap and emulate > > that? > > > > Yeah so as a nested hypervisor we need to avoid that call to mtspr > SPRN_DAWR since it's HV privileged and we run with HV = 0. > > The fix will be to check kvmhv_on_pseries() before doing the write. In > fact we should avoid the write any time we call the function from _not_ > real mode. > > I'll submit a fix for the KVM side. Doesn't look like this is anything > to do with Mikey's patch, was always broken as far as I can tell. Thanks Suraj. Mikey
[PATCH v2] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
Commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") screwed up some assembler and corrupted a pointer in r3. This resulted in crashes like the below: [ 44.374746] BUG: Kernel NULL pointer dereference at 0x13bf [ 44.374848] Faulting instruction address: 0xc010b044 [ 44.374906] Oops: Kernel access of bad area, sig: 11 [#1] [ 44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover virtio_scsi failover [ 44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3 [ 44.375500] NIP: c010b044 LR: c008089dacf4 CTR: c010aff4 [ 44.375604] REGS: c0179b397710 TRAP: 0300 Not tainted (5.2.0-rc4+) [ 44.375691] MSR: 8280b033 CR: 42244842 XER: [ 44.375815] CFAR: c010aff8 DAR: 13bf DSISR: 4200 IRQMASK: 0 [ 44.375815] GPR00: c008089dd6bc c0179b3979a0 c00808a04300 [ 44.375815] GPR04: 0003 2444b05d c017f11c45d0 [ 44.375815] GPR08: 07803e018dfe 0028 0001 0075 [ 44.375815] GPR12: c010aff4 c7ff6300 [ 44.375815] GPR16: c017f11d c017f11ca7a8 [ 44.375815] GPR20: c017f11c42ec 000a [ 44.375815] GPR24: fffc c017f11c c1a77ed8 [ 44.375815] GPR28: c0179af7 fffc c008089ff170 c0179ae88540 [ 44.376673] NIP [c010b044] kvmppc_h_set_dabr+0x50/0x68 [ 44.376754] LR [c008089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv] [ 44.376849] Call Trace: [ 44.376886] [c0179b3979a0] [c017f11c] 0xc017f11c (unreliable) [ 44.376982] [c0179b397a10] [c008089dd6bc] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] [ 44.377084] [c0179b397ae0] [c008093f8bcc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 44.377185] [c0179b397b00] [c008093f522c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [ 44.377286] [c0179b397b90] [c008093e3618] kvm_vcpu_ioctl+0x460/0x850 [kvm] [ 44.377384] [c0179b397d00] [c04ba6c4] do_vfs_ioctl+0xe4/0xb40 [ 44.377464] [c0179b397db0] [c04bb1e4] ksys_ioctl+0xc4/0x110 [ 44.377547] [c0179b397e00] [c04bb258] sys_ioctl+0x28/0x80 [ 44.377628] [c0179b397e20] [c000b888] system_call+0x5c/0x70 [ 44.377712] Instruction dump: [ 44.377765] 4082fff4 4c00012c 3860 4e800020 e96280c0 896b 2c2b 3860 [ 44.377862] 4d820020 50852e74 508516f6 78840724 f8a313c8 7c942ba6 7cbc2ba6 Fix the bug by only changing r3 when we are returning immediately. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Neuling Reported-by: Cédric Le Goater -- mpe: This is for 5.2 fixes v2: Review from Christophe Leroy - De-Mikey/Cedric-ify commit message - Add "Fixes:" - Other trivial commit messages changes - No code change --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 139027c62d..f781ee1458 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -2519,8 +2519,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) LOAD_REG_ADDR(r11, dawr_force_enable) lbz r11, 0(r11) cmpdi r11, 0 + bne 3f li r3, H_HARDWARE - beqlr + blr +3: /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */ rlwimi r5, r4, 5, DAWRX_DR | DAWRX_DW rlwimi r5, r4, 2, DAWRX_WT -- 2.21.0
Re: [PATCH] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
On Wed, 2019-06-12 at 09:43 +0200, Cédric Le Goater wrote: > On 12/06/2019 09:22, Michael Neuling wrote: > > In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 > > option") I screwed up some assembler and corrupted a pointer in > > r3. This resulted in crashes like the below from Cédric: > > > > [ 44.374746] BUG: Kernel NULL pointer dereference at 0x13bf > > [ 44.374848] Faulting instruction address: 0xc010b044 > > [ 44.374906] Oops: Kernel access of bad area, sig: 11 [#1] > > [ 44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA > > pSeries > > [ 44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM > > iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack > > nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp > > bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables > > iptable_filter bpfilter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv > > kvm sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover > > virtio_scsi failover > > [ 44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not > > tainted 5.2.0-rc4+ #3 > > [ 44.375500] NIP: c010b044 LR: c008089dacf4 CTR: > > c010aff4 > > [ 44.375604] REGS: c0179b397710 TRAP: 0300 Not tainted > > (5.2.0-rc4+) > > [ 44.375691] MSR: 8280b033 > > CR: 42244842 XER: > > [ 44.375815] CFAR: c010aff8 DAR: 13bf DSISR: > > 4200 IRQMASK: 0 > > [ 44.375815] GPR00: c008089dd6bc c0179b3979a0 c00808a04300 > > > > [ 44.375815] GPR04: 0003 2444b05d > > c017f11c45d0 > > [ 44.375815] GPR08: 07803e018dfe 0028 0001 > > 0075 > > [ 44.375815] GPR12: c010aff4 c7ff6300 > > > > [ 44.375815] GPR16: c017f11d > > c017f11ca7a8 > > [ 44.375815] GPR20: c017f11c42ec > > 000a > > [ 44.375815] GPR24: fffc c017f11c > > c1a77ed8 > > [ 44.375815] GPR28: c0179af7 fffc c008089ff170 > > c0179ae88540 > > [ 44.376673] NIP [c010b044] kvmppc_h_set_dabr+0x50/0x68 > > [ 44.376754] LR [c008089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 > > [kvm_hv] > > [ 44.376849] Call Trace: > > [ 44.376886] [c0179b3979a0] [c017f11c] 0xc017f11c > > (unreliable) > > [ 44.376982] [c0179b397a10] [c008089dd6bc] > > kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] > > [ 44.377084] [c0179b397ae0] [c008093f8bcc] > > kvmppc_vcpu_run+0x34/0x48 [kvm] > > [ 44.377185] [c0179b397b00] [c008093f522c] > > kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] > > [ 44.377286] [c0179b397b90] [c008093e3618] > > kvm_vcpu_ioctl+0x460/0x850 [kvm] > > [ 44.377384] [c0179b397d00] [c04ba6c4] > > do_vfs_ioctl+0xe4/0xb40 > > [ 44.377464] [c0179b397db0] [c04bb1e4] ksys_ioctl+0xc4/0x110 > > [ 44.377547] [c0179b397e00] [c04bb258] sys_ioctl+0x28/0x80 > > [ 44.377628] [c0179b397e20] [c0000000b888] system_call+0x5c/0x70 > > [ 44.377712] Instruction dump: > > [ 44.377765] 4082fff4 4c00012c 3860 4e800020 e96280c0 896b > > 2c2b 3860 > > [ 44.377862] 4d820020 50852e74 508516f6 78840724 f8a313c8 > > 7c942ba6 7cbc2ba6 > > > > This fixes the problem by only changing r3 when we are returning > > immediately. > > > > Signed-off-by: Michael Neuling > > Reported-by: Cédric Le Goater > > On nested, I still see : > > [ 94.609274] Oops: Exception in kernel mode, sig: 4 [#1] > [ 94.609432] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA > pSeries > [ 94.609596] Modules linked in: vhost_net vhost tap xt_CHECKSUM > iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack > nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp > bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables > iptable_filter bpfilter vmx_crypto kvm_hv crct10dif_vpmsum crc32c_vpmsum kvm > sch_fq_codel ip_tables x_tables autofs4 virtio_net virtio_scsi net_failover > failover > [ 94.610179] CPU: 12 PID: 2026 Comm: qemu-system-ppc Kdump: loaded No
[PATCH] KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr()
In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") I screwed up some assembler and corrupted a pointer in r3. This resulted in crashes like the below from Cédric: [ 44.374746] BUG: Kernel NULL pointer dereference at 0x13bf [ 44.374848] Faulting instruction address: 0xc010b044 [ 44.374906] Oops: Kernel access of bad area, sig: 11 [#1] [ 44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries [ 44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover virtio_scsi failover [ 44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3 [ 44.375500] NIP: c010b044 LR: c008089dacf4 CTR: c010aff4 [ 44.375604] REGS: c0179b397710 TRAP: 0300 Not tainted (5.2.0-rc4+) [ 44.375691] MSR: 8280b033 CR: 42244842 XER: [ 44.375815] CFAR: c010aff8 DAR: 13bf DSISR: 4200 IRQMASK: 0 [ 44.375815] GPR00: c008089dd6bc c0179b3979a0 c00808a04300 [ 44.375815] GPR04: 0003 2444b05d c017f11c45d0 [ 44.375815] GPR08: 07803e018dfe 0028 0001 0075 [ 44.375815] GPR12: c010aff4 c7ff6300 [ 44.375815] GPR16: c017f11d c017f11ca7a8 [ 44.375815] GPR20: c017f11c42ec 000a [ 44.375815] GPR24: fffc c017f11c c1a77ed8 [ 44.375815] GPR28: c0179af7 fffc c008089ff170 c0179ae88540 [ 44.376673] NIP [c010b044] kvmppc_h_set_dabr+0x50/0x68 [ 44.376754] LR [c008089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv] [ 44.376849] Call Trace: [ 44.376886] [c0179b3979a0] [c017f11c] 0xc017f11c (unreliable) [ 44.376982] [c0179b397a10] [c008089dd6bc] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] [ 44.377084] [c0179b397ae0] [c008093f8bcc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 44.377185] [c0179b397b00] [c008093f522c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] [ 44.377286] [c0179b397b90] [c008093e3618] kvm_vcpu_ioctl+0x460/0x850 [kvm] [ 44.377384] [c0179b397d00] [c04ba6c4] do_vfs_ioctl+0xe4/0xb40 [ 44.377464] [c0179b397db0] [c04bb1e4] ksys_ioctl+0xc4/0x110 [ 44.377547] [c0179b397e00] [c04bb258] sys_ioctl+0x28/0x80 [ 44.377628] [c0179b397e20] [c000b888] system_call+0x5c/0x70 [ 44.377712] Instruction dump: [ 44.377765] 4082fff4 4c00012c 3860 4e800020 e96280c0 896b 2c2b 3860 [ 44.377862] 4d820020 50852e74 508516f6 78840724 f8a313c8 7c942ba6 7cbc2ba6 This fixes the problem by only changing r3 when we are returning immediately. Signed-off-by: Michael Neuling Reported-by: Cédric Le Goater -- mpe: This is for 5.2 fixes --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 139027c62d..f781ee1458 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -2519,8 +2519,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) LOAD_REG_ADDR(r11, dawr_force_enable) lbz r11, 0(r11) cmpdi r11, 0 + bne 3f li r3, H_HARDWARE - beqlr + blr +3: /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */ rlwimi r5, r4, 5, DAWRX_DR | DAWRX_DW rlwimi r5, r4, 2, DAWRX_WT -- 2.21.0
Re: [PATCH v2] powerpc: Add force enable of DAWR on P9 option
On Tue, 2019-06-11 at 09:51 +0200, Christophe Leroy wrote: > > Le 11/06/2019 à 09:24, Michael Neuling a écrit : > > On Tue, 2019-06-11 at 08:48 +0200, Cédric Le Goater wrote: > > > On 11/06/2019 08:44, Michael Neuling wrote: > > > > > > 2: > > > > > > -BEGIN_FTR_SECTION > > > > > > - /* POWER9 with disabled DAWR */ > > > > > > + LOAD_REG_ADDR(r11, dawr_force_enable) > > > > > > + lbz r11, 0(r11) > > > > > > + cmpdi r11, 0 > > > > > > li r3, H_HARDWARE > > > > > > - blr > > > > > > -END_FTR_SECTION_IFCLR(CPU_FTR_DAWR) > > > > > > + beqlr > > > > > > > > > > Why is this a 'beqlr' ? Shouldn't it be a blr ? > > > > > > > > I believe it's right and should be a beqlr. It's to replace the FTR > > > > section to > > > > make it dynamic based on the dawr_force_enable bit. > > > > > > hmm, see the crash below on a L1 running a nested guest. r3 is set > > > to -1 (H_HARDWARE) but a vpcu pointer was expected. How can we fix > > > this ? > > > > > > C. > > > > > > > > > [ 44.374746] BUG: Kernel NULL pointer dereference at 0x13bf > > > [ 44.374848] Faulting instruction address: 0xc010b044 > > > [ 44.374906] Oops: Kernel access of bad area, sig: 11 [#1] > > > [ 44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA > > > pSeries > > > [ 44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM > > > iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack > > > nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 > > > xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter > > > ip6_tables iptable_filter bpfilter vmx_crypto crct10dif_vpmsum > > > crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables autofs4 > > > virtio_net net_failover virtio_scsi failover > > > [ 44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not > > > tainted 5.2.0-rc4+ #3 > > > [ 44.375500] NIP: c010b044 LR: c008089dacf4 CTR: > > > c010aff4 > > > [ 44.375604] REGS: c0179b397710 TRAP: 0300 Not tainted (5.2.0- > > > rc4+) > > > [ 44.375691] MSR: 8280b033 > > > CR: 42244842 XER: > > > [ 44.375815] CFAR: c010aff8 DAR: 13bf DSISR: > > > 4200 IRQMASK: 0 > > > [ 44.375815] GPR00: c008089dd6bc c0179b3979a0 c00808a04300 > > > > > > [ 44.375815] GPR04: 0003 2444b05d > > > c017f11c45d0 > > > [ 44.375815] GPR08: 07803e018dfe 0028 0001 > > > 0075 > > > [ 44.375815] GPR12: c010aff4 c7ff6300 > > > > > > [ 44.375815] GPR16: c017f11d > > > c017f11ca7a8 > > > [ 44.375815] GPR20: c017f11c42ec > > > 000a > > > [ 44.375815] GPR24: fffc c017f11c > > > c1a77ed8 > > > [ 44.375815] GPR28: c0179af7 fffc c008089ff170 > > > c0179ae88540 > > > [ 44.376673] NIP [c010b044] kvmppc_h_set_dabr+0x50/0x68 > > > [ 44.376754] LR [c008089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 > > > [kvm_hv] > > > [ 44.376849] Call Trace: > > > [ 44.376886] [c0179b3979a0] [c017f11c] 0xc017f11c > > > (unreliable) > > > [ 44.376982] [c0179b397a10] [c008089dd6bc] > > > kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] > > > [ 44.377084] [c0179b397ae0] [c008093f8bcc] > > > kvmppc_vcpu_run+0x34/0x48 [kvm] > > > [ 44.377185] [c0179b397b00] [c008093f522c] > > > kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] > > > [ 44.377286] [c0179b397b90] [c008093e3618] > > > kvm_vcpu_ioctl+0x460/0x850 [kvm] > > > [ 44.377384] [c0179b397d00] [c04ba6c4] > > > do_vfs_ioctl+0xe4/0xb40 > > > [ 44.377464] [c0179b397db0] [c04bb1e4] ksys_ioctl+0xc4/0x110 > > > [ 44.377547] [c0179b397e00] [c04bb258] sys_ioctl+0x28/0x80 > > > [ 44.377628] [c0179b397e20] [c000b888] system_call+0x5c/0x70 > > > [ 44.3777
Re: [PATCH v2] powerpc: Add force enable of DAWR on P9 option
On Tue, 2019-06-11 at 08:48 +0200, Cédric Le Goater wrote: > On 11/06/2019 08:44, Michael Neuling wrote: > > > > 2: > > > > -BEGIN_FTR_SECTION > > > > - /* POWER9 with disabled DAWR */ > > > > + LOAD_REG_ADDR(r11, dawr_force_enable) > > > > + lbz r11, 0(r11) > > > > + cmpdi r11, 0 > > > > li r3, H_HARDWARE > > > > - blr > > > > -END_FTR_SECTION_IFCLR(CPU_FTR_DAWR) > > > > + beqlr > > > > > > Why is this a 'beqlr' ? Shouldn't it be a blr ? > > > > I believe it's right and should be a beqlr. It's to replace the FTR > > section to > > make it dynamic based on the dawr_force_enable bit. > > hmm, see the crash below on a L1 running a nested guest. r3 is set > to -1 (H_HARDWARE) but a vpcu pointer was expected. How can we fix > this ? > > C. > > > [ 44.374746] BUG: Kernel NULL pointer dereference at 0x13bf > [ 44.374848] Faulting instruction address: 0xc010b044 > [ 44.374906] Oops: Kernel access of bad area, sig: 11 [#1] > [ 44.374951] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA > pSeries > [ 44.375018] Modules linked in: vhost_net vhost tap xt_CHECKSUM > iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack > nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp > bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables > iptable_filter bpfilter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm > sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover virtio_scsi > failover > [ 44.375401] CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not > tainted 5.2.0-rc4+ #3 > [ 44.375500] NIP: c010b044 LR: c008089dacf4 CTR: > c010aff4 > [ 44.375604] REGS: c0179b397710 TRAP: 0300 Not tainted (5.2.0-rc4+) > [ 44.375691] MSR: 8280b033 CR: > 42244842 XER: > [ 44.375815] CFAR: c010aff8 DAR: 13bf DSISR: 4200 > IRQMASK: 0 > [ 44.375815] GPR00: c008089dd6bc c0179b3979a0 c00808a04300 > > [ 44.375815] GPR04: 0003 2444b05d > c017f11c45d0 > [ 44.375815] GPR08: 07803e018dfe 0028 0001 > 0075 > [ 44.375815] GPR12: c010aff4 c7ff6300 > > [ 44.375815] GPR16: c017f11d > c017f11ca7a8 > [ 44.375815] GPR20: c017f11c42ec > 000a > [ 44.375815] GPR24: fffc c017f11c > c1a77ed8 > [ 44.375815] GPR28: c0179af7 fffc c008089ff170 > c0179ae88540 > [ 44.376673] NIP [c010b044] kvmppc_h_set_dabr+0x50/0x68 > [ 44.376754] LR [c008089dacf4] kvmppc_pseries_do_hcall+0xa3c/0xeb0 > [kvm_hv] > [ 44.376849] Call Trace: > [ 44.376886] [c0179b3979a0] [c017f11c] 0xc017f11c > (unreliable) > [ 44.376982] [c0179b397a10] [c008089dd6bc] > kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv] > [ 44.377084] [c0179b397ae0] [c008093f8bcc] > kvmppc_vcpu_run+0x34/0x48 [kvm] > [ 44.377185] [c0179b397b00] [c008093f522c] > kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm] > [ 44.377286] [c0179b397b90] [c008093e3618] > kvm_vcpu_ioctl+0x460/0x850 [kvm] > [ 44.377384] [c0179b397d00] [c04ba6c4] do_vfs_ioctl+0xe4/0xb40 > [ 44.377464] [c0179b397db0] [c04bb1e4] ksys_ioctl+0xc4/0x110 > [ 44.377547] [c0179b397e00] [c04bb258] sys_ioctl+0x28/0x80 > [ 44.377628] [c0179b397e20] [c000b888] system_call+0x5c/0x70 > [ 44.377712] Instruction dump: > [ 44.377765] 4082fff4 4c00012c 3860 4e800020 e96280c0 896b 2c2b > 3860 > [ 44.377862] 4d820020 50852e74 508516f6 78840724 f8a313c8 > 7c942ba6 7cbc2ba6 Opps, it's because I corrupted r3 :-( Does this fix it? diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 139027c62d..f781ee1458 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -2519,8 +2519,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) LOAD_REG_ADDR(r11, dawr_force_enable) lbz r11, 0(r11) cmpdi r11, 0 + bne 3f li r3, H_HARDWARE - beqlr + blr +3: /* Emulate H_SET_DABR/X on P8 for the sake of compat mode guests */ rlwimi r5, r4, 5, DAWRX_DR | DAWRX_DW rlwimi r5, r4, 2, DAWRX_WT
Re: [PATCH v2] powerpc: Add force enable of DAWR on P9 option
> > 2: > > -BEGIN_FTR_SECTION > > - /* POWER9 with disabled DAWR */ > > + LOAD_REG_ADDR(r11, dawr_force_enable) > > + lbz r11, 0(r11) > > + cmpdi r11, 0 > > li r3, H_HARDWARE > > - blr > > -END_FTR_SECTION_IFCLR(CPU_FTR_DAWR) > > + beqlr > > Why is this a 'beqlr' ? Shouldn't it be a blr ? I believe it's right and should be a beqlr. It's to replace the FTR section to make it dynamic based on the dawr_force_enable bit. Mikey
Re: [PATCH] Powerpc/Watchpoint: Restore nvgprs while returning from exception
On Thu, 2019-06-06 at 12:59 +0530, Ravi Bangoria wrote: > Powerpc hw triggers watchpoint before executing the instruction. > To make trigger-after-execute behavior, kernel emulates the > instruction. If the instruction is 'load something into non- > volatile register', exception handler should restore emulated > register state while returning back, otherwise there will be > register state corruption. Ex, Adding a watchpoint on a list > can corrput the list: > > # cat /proc/kallsyms | grep kthread_create_list > c121c8b8 d kthread_create_list > > Add watchpoint on kthread_create_list->next: > > # perf record -e mem:0xc121c8c0 > > Run some workload such that new kthread gets invoked. Ex, I > just logged out from console: > > list_add corruption. next->prev should be prev (c1214e00), \ > but was c121c8b8. (next=c121c8b8). > WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0 > CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69 > ... > NIP __list_add_valid+0xb4/0xc0 > LR __list_add_valid+0xb0/0xc0 > Call Trace: > __list_add_valid+0xb0/0xc0 (unreliable) > __kthread_create_on_node+0xe0/0x260 > kthread_create_on_node+0x34/0x50 > create_worker+0xe8/0x260 > worker_thread+0x444/0x560 > kthread+0x160/0x1a0 > ret_from_kernel_thread+0x5c/0x70 > > Signed-off-by: Ravi Bangoria How long has this been around? Should we be CCing stable? Mikey > --- > arch/powerpc/kernel/exceptions-64s.S | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/exceptions-64s.S > b/arch/powerpc/kernel/exceptions-64s.S > index 9481a11..96de0d1 100644 > --- a/arch/powerpc/kernel/exceptions-64s.S > +++ b/arch/powerpc/kernel/exceptions-64s.S > @@ -1753,7 +1753,7 @@ handle_dabr_fault: > ld r5,_DSISR(r1) > addir3,r1,STACK_FRAME_OVERHEAD > bl do_break > -12: b ret_from_except_lite > +12: b ret_from_except > > > #ifdef CONFIG_PPC_BOOK3S_64
[PATCH v5 1/2] powerpc: silence a -Wcast-function-type warning in dawr_write_file_bool
From: Mathieu Malaterre In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") the following piece of code was added: smp_call_function((smp_call_func_t)set_dawr, _brk, 0); Since GCC 8 this triggers the following warning about incompatible function types: arch/powerpc/kernel/hw_breakpoint.c:408:21: error: cast between incompatible function types from 'int (*)(struct arch_hw_breakpoint *)' to 'void (*)(void *)' [-Werror=cast-function-type] Since the warning is there for a reason, and should not be hidden behind a cast, provide an intermediate callback function to avoid the warning. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Suggested-by: Christoph Hellwig Signed-off-by: Mathieu Malaterre Signed-off-by: Michael Neuling --- arch/powerpc/kernel/hw_breakpoint.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index da307dd93e..ca3a2358b7 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -384,6 +384,11 @@ void hw_breakpoint_pmu_read(struct perf_event *bp) bool dawr_force_enable; EXPORT_SYMBOL_GPL(dawr_force_enable); +static void set_dawr_cb(void *info) +{ + set_dawr(info); +} + static ssize_t dawr_write_file_bool(struct file *file, const char __user *user_buf, size_t count, loff_t *ppos) @@ -403,7 +408,7 @@ static ssize_t dawr_write_file_bool(struct file *file, /* If we are clearing, make sure all CPUs have the DAWR cleared */ if (!dawr_force_enable) - smp_call_function((smp_call_func_t)set_dawr, _brk, 0); + smp_call_function(set_dawr_cb, _brk, 0); return rc; } -- 2.21.0
[PATCH v5 2/2] powerpc: Fix compile issue with force DAWR
If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail at linking with: arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined reference to `dawr_force_enable' This was caused by commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option"). This moves a bunch of code around to fix this. It moves a lot of the DAWR code in a new file and creates a new CONFIG_PPC_DAWR to enable compiling it. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Neuling -- v5: - Changes based on comments by hch - Change // to /* comments - Change return code from -1 to -ENODEV - Remove unneeded default n in new Kconfig option - Remove setting to default value - Remove unnecessary braces v4: - Fix merge conflict with patch from Mathieu Malaterre: powerpc: silence a -Wcast-function-type warning in dawr_write_file_bool - Fixed checkpatch issues noticed by Christophe Leroy. v3: Fixes based on Christophe Leroy's comments: - Fix Kconfig options to better reflect reality - Reorder alphabetically - Inline vs #define - Fixed default return for dawr_enabled() when CONFIG_PPC_DAWR=N V2: Fixes based on Christophe Leroy's comments: - Fix commit message formatting - Move more DAWR code into dawr.c --- arch/powerpc/Kconfig | 4 + arch/powerpc/include/asm/hw_breakpoint.h | 21 +++-- arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/dawr.c | 98 arch/powerpc/kernel/hw_breakpoint.c | 61 --- arch/powerpc/kernel/process.c| 28 --- arch/powerpc/kvm/Kconfig | 1 + 7 files changed, 118 insertions(+), 96 deletions(-) create mode 100644 arch/powerpc/kernel/dawr.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8c1c636308..9d61b36df4 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -234,6 +234,7 @@ config PPC select OLD_SIGSUSPEND select PCI_DOMAINS if PCI select PCI_SYSCALL if PCI + select PPC_DAWR if PPC64 select RTC_LIB select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE @@ -370,6 +371,9 @@ config PPC_ADV_DEBUG_DAC_RANGE depends on PPC_ADV_DEBUG_REGS && 44x default y +config PPC_DAWR + bool + config ZONE_DMA bool default y if PPC_BOOK3E_64 diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index 0fe8c1e46b..41abdae6d0 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -90,18 +90,25 @@ static inline void hw_breakpoint_disable(void) extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs); int hw_breakpoint_handler(struct die_args *args); -extern int set_dawr(struct arch_hw_breakpoint *brk); +#else /* CONFIG_HAVE_HW_BREAKPOINT */ +static inline void hw_breakpoint_disable(void) { } +static inline void thread_change_pc(struct task_struct *tsk, + struct pt_regs *regs) { } + +#endif /* CONFIG_HAVE_HW_BREAKPOINT */ + + +#ifdef CONFIG_PPC_DAWR extern bool dawr_force_enable; static inline bool dawr_enabled(void) { return dawr_force_enable; } - -#else /* CONFIG_HAVE_HW_BREAKPOINT */ -static inline void hw_breakpoint_disable(void) { } -static inline void thread_change_pc(struct task_struct *tsk, - struct pt_regs *regs) { } +int set_dawr(struct arch_hw_breakpoint *brk); +#else static inline bool dawr_enabled(void) { return false; } -#endif /* CONFIG_HAVE_HW_BREAKPOINT */ +static inline int set_dawr(struct arch_hw_breakpoint *brk) { return -1; } +#endif + #endif /* __KERNEL__ */ #endif /* _PPC_BOOK3S_64_HW_BREAKPOINT_H */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 0ea6c4aa3a..56dfa7a2a6 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -56,6 +56,7 @@ obj-$(CONFIG_PPC64) += setup_64.o sys_ppc32.o \ obj-$(CONFIG_VDSO32) += vdso32/ obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o +obj-$(CONFIG_PPC_DAWR) += dawr.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c new file mode 100644 index 00..ae3bd6abac --- /dev/null +++ b/arch/powerpc/kernel/dawr.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * DAWR infrastructure + * + * Copyright 2019, Michael Neuling, IBM Corporation. + */ + +#include +#include +#include +#include +#include +#include +#include + +bool dawr_force_enable; +EXPORT_SYMBOL_GPL(dawr_force_ena
Re: [PATCH v4 2/2] powerpc: Fix compile issue with force DAWR
I agree with all the below and will address in v5. Mikey On Tue, 2019-05-28 at 23:28 -0700, Christoph Hellwig wrote: > > +config PPC_DAWR > > + bool > > + default n > > "default n" is the default default. No need to write this line. > > > +++ b/arch/powerpc/kernel/dawr.c > > @@ -0,0 +1,100 @@ > > +// SPDX-License-Identifier: GPL-2.0+ > > +// > > +// DAWR infrastructure > > +// > > +// Copyright 2019, Michael Neuling, IBM Corporation. > > Normal top of file header should be /* */, //-style comments are only > for the actual SPDX heder line. > > > + /* Send error to user if they hypervisor won't allow us to write DAWR */ > > + if ((!dawr_force_enable) && > > + (firmware_has_feature(FW_FEATURE_LPAR)) && > > + (set_dawr(_brk) != H_SUCCESS)) > > None of the three inner brace sets here are required, and the code > becomes much easier to read without them. > > > + return -1; > > What about returning a proper error code? > > > +static int __init dawr_force_setup(void) > > +{ > > + dawr_force_enable = false; > > This variable already is initialized to alse by default, so this line > is not required. > > > + if (PVR_VER(mfspr(SPRN_PVR)) == PVR_POWER9) { > > + /* Turn DAWR off by default, but allow admin to turn it on */ > > + dawr_force_enable = false; > > .. and neither is this one. >
[PATCH v4 1/2] powerpc: silence a -Wcast-function-type warning in dawr_write_file_bool
From: Mathieu Malaterre In commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") the following piece of code was added: smp_call_function((smp_call_func_t)set_dawr, _brk, 0); Since GCC 8 this triggers the following warning about incompatible function types: arch/powerpc/kernel/hw_breakpoint.c:408:21: error: cast between incompatible function types from 'int (*)(struct arch_hw_breakpoint *)' to 'void (*)(void *)' [-Werror=cast-function-type] Since the warning is there for a reason, and should not be hidden behind a cast, provide an intermediate callback function to avoid the warning. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Suggested-by: Christoph Hellwig Signed-off-by: Mathieu Malaterre Signed-off-by: Michael Neuling --- arch/powerpc/kernel/hw_breakpoint.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index da307dd93e..ca3a2358b7 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -384,6 +384,11 @@ void hw_breakpoint_pmu_read(struct perf_event *bp) bool dawr_force_enable; EXPORT_SYMBOL_GPL(dawr_force_enable); +static void set_dawr_cb(void *info) +{ + set_dawr(info); +} + static ssize_t dawr_write_file_bool(struct file *file, const char __user *user_buf, size_t count, loff_t *ppos) @@ -403,7 +408,7 @@ static ssize_t dawr_write_file_bool(struct file *file, /* If we are clearing, make sure all CPUs have the DAWR cleared */ if (!dawr_force_enable) - smp_call_function((smp_call_func_t)set_dawr, _brk, 0); + smp_call_function(set_dawr_cb, _brk, 0); return rc; } -- 2.21.0
[PATCH v4 2/2] powerpc: Fix compile issue with force DAWR
If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail at linking with: arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined reference to `dawr_force_enable' This was caused by commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option"). This moves a bunch of code around to fix this. It moves a lot of the DAWR code in a new file and creates a new CONFIG_PPC_DAWR to enable compiling it. Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Signed-off-by: Michael Neuling -- v4: - Fix merge conflict with patch from Mathieu Malaterre: powerpc: silence a -Wcast-function-type warning in dawr_write_file_bool - Fixed checkpatch issues noticed by Christophe Leroy. v3: Fixes based on Christophe Leroy's comments: - Fix Kconfig options to better reflect reality - Reorder alphabetically - Inline vs #define - Fixed default return for dawr_enabled() when CONFIG_PPC_DAWR=N V2: Fixes based on Christophe Leroy's comments: - Fix commit message formatting - Move more DAWR code into dawr.c --- arch/powerpc/Kconfig | 5 ++ arch/powerpc/include/asm/hw_breakpoint.h | 21 +++-- arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/dawr.c | 100 +++ arch/powerpc/kernel/hw_breakpoint.c | 61 -- arch/powerpc/kernel/process.c| 28 --- arch/powerpc/kvm/Kconfig | 1 + 7 files changed, 121 insertions(+), 96 deletions(-) create mode 100644 arch/powerpc/kernel/dawr.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 8c1c636308..87a3ce4e92 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -234,6 +234,7 @@ config PPC select OLD_SIGSUSPEND select PCI_DOMAINS if PCI select PCI_SYSCALL if PCI + select PPC_DAWR if PPC64 select RTC_LIB select SPARSE_IRQ select SYSCTL_EXCEPTION_TRACE @@ -370,6 +371,10 @@ config PPC_ADV_DEBUG_DAC_RANGE depends on PPC_ADV_DEBUG_REGS && 44x default y +config PPC_DAWR + bool + default n + config ZONE_DMA bool default y if PPC_BOOK3E_64 diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index 0fe8c1e46b..41abdae6d0 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -90,18 +90,25 @@ static inline void hw_breakpoint_disable(void) extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs); int hw_breakpoint_handler(struct die_args *args); -extern int set_dawr(struct arch_hw_breakpoint *brk); +#else /* CONFIG_HAVE_HW_BREAKPOINT */ +static inline void hw_breakpoint_disable(void) { } +static inline void thread_change_pc(struct task_struct *tsk, + struct pt_regs *regs) { } + +#endif /* CONFIG_HAVE_HW_BREAKPOINT */ + + +#ifdef CONFIG_PPC_DAWR extern bool dawr_force_enable; static inline bool dawr_enabled(void) { return dawr_force_enable; } - -#else /* CONFIG_HAVE_HW_BREAKPOINT */ -static inline void hw_breakpoint_disable(void) { } -static inline void thread_change_pc(struct task_struct *tsk, - struct pt_regs *regs) { } +int set_dawr(struct arch_hw_breakpoint *brk); +#else static inline bool dawr_enabled(void) { return false; } -#endif /* CONFIG_HAVE_HW_BREAKPOINT */ +static inline int set_dawr(struct arch_hw_breakpoint *brk) { return -1; } +#endif + #endif /* __KERNEL__ */ #endif /* _PPC_BOOK3S_64_HW_BREAKPOINT_H */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 0ea6c4aa3a..56dfa7a2a6 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -56,6 +56,7 @@ obj-$(CONFIG_PPC64) += setup_64.o sys_ppc32.o \ obj-$(CONFIG_VDSO32) += vdso32/ obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o +obj-$(CONFIG_PPC_DAWR) += dawr.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c new file mode 100644 index 00..c8b3fb610c --- /dev/null +++ b/arch/powerpc/kernel/dawr.c @@ -0,0 +1,100 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// DAWR infrastructure +// +// Copyright 2019, Michael Neuling, IBM Corporation. + +#include +#include +#include +#include +#include +#include +#include + +bool dawr_force_enable; +EXPORT_SYMBOL_GPL(dawr_force_enable); + +int set_dawr(struct arch_hw_breakpoint *brk) +{ + unsigned long dawr, dawrx, mrd; + + dawr = brk->address; + + dawrx = (brk->type & (HW_BRK_TYPE_READ | HW_BRK_TYPE_WRITE)) +
Re: [PATCH v2] powerpc: Fix compile issue with force DAWR
> > > > -- > > > > v2: > > > > Fixes based on Christophe Leroy's comments: > > > > - Fix commit message formatting > > > > - Move more DAWR code into dawr.c > > > > --- > > > >arch/powerpc/Kconfig | 5 ++ > > > >arch/powerpc/include/asm/hw_breakpoint.h | 20 --- > > > >arch/powerpc/kernel/Makefile | 1 + > > > >arch/powerpc/kernel/dawr.c | 75 > > > > > > > >arch/powerpc/kernel/hw_breakpoint.c | 56 -- > > > >arch/powerpc/kvm/Kconfig | 1 + > > > >6 files changed, 94 insertions(+), 64 deletions(-) > > > >create mode 100644 arch/powerpc/kernel/dawr.c > > > > > > > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > > > > index 2711aac246..f4b19e48cc 100644 > > > > --- a/arch/powerpc/Kconfig > > > > +++ b/arch/powerpc/Kconfig > > > > @@ -242,6 +242,7 @@ config PPC > > > > select SYSCTL_EXCEPTION_TRACE > > > > select THREAD_INFO_IN_TASK > > > > select VIRT_TO_BUS if !PPC64 > > > > + select PPC_DAWR_FORCE_ENABLEif PPC64 || PERF > > > > > > What's PERF ? Did you mean PERF_EVENTS ? > > > > > > Then what you mean is: > > > - Selected all the time for PPC64 > > > - Selected for PPC32 when PERF is also selected. > > > > > > Is that what you want ? At first I thought it was linked to P9. > > > > This is wrong. I think we just want PPC64. PERF is a red herring. > > Are you sure ? Michael suggested PERF || KVM as far as I remember. It was mpe but I think it was wrong. We could make it more selective with something like: PPC64 && (HW_BREAK || PPC_ADV_DEBUG_REGS || PERF) but I think that will end up back in the larger mess of https://github.com/linuxppc/issues/issues/128 I don't think the minor size gain is is worth delving in that mess, hence I made it simply PPC64 which is hopefully an improvement on what we have since it eliminates 32bit. > > > >#else/* CONFIG_HAVE_HW_BREAKPOINT */ > > > >static inline void hw_breakpoint_disable(void) { } > > > >static inline void thread_change_pc(struct task_struct *tsk, > > > > struct pt_regs *regs) { } > > > > -static inline bool dawr_enabled(void) { return false; } > > > > + > > > >#endif /* CONFIG_HAVE_HW_BREAKPOINT */ > > > > + > > > > +extern bool dawr_force_enable; > > > > + > > > > +#ifdef CONFIG_PPC_DAWR_FORCE_ENABLE > > > > +extern bool dawr_enabled(void); > > > > > > Functions should not be 'extern'. I'm sure checkpatch --strict will tell > > > you. > > > > That's not what's currently being done in this header file. I'm keeping > > with > > the style of that file. > > style is not defined on a per file basis. There is the style from the > past and the nowadays style. If you keep old style just because the file > includes old style statements, then the code will never improve. > > All new patches should come with clean 'checkpatch' report and follow > new style. Having mixed styles in a file is not a problem, that's the > way to improvement. See arch/powerpc/mm/mmu_decl.h as an exemple. I'm not sure that's mpe's policy. mpe? > > > > + > > > > +static ssize_t dawr_write_file_bool(struct file *file, > > > > + const char __user *user_buf, > > > > + size_t count, loff_t *ppos) > > > > +{ > > > > + struct arch_hw_breakpoint null_brk = {0, 0, 0}; > > > > + size_t rc; > > > > + > > > > + /* Send error to user if they hypervisor won't allow us to write > > > > DAWR */ > > > > + if ((!dawr_force_enable) && > > > > + (firmware_has_feature(FW_FEATURE_LPAR)) && > > > > + (set_dawr(_brk) != H_SUCCESS)) > > > > > > The above is not real clear. > > > set_dabr() returns 0, H_SUCCESS is not used there. > > > > It pseries_set_dawr() will return a hcall number. > > Right, then it maybe means set_dawr() should be fixes ? Sorry, I don't understand this. > > This code hasn't changed. I'm just moving it. > > Right, but could be an improvment for another patch. > As far as I remember you are the one who wrote that code at first place, > arent't you ? Yep, classic crap Mikey code :-) Mikey
Re: [PATCH v2] powerpc: Fix compile issue with force DAWR
On Mon, 2019-05-13 at 11:08 +0200, Christophe Leroy wrote: > > Le 13/05/2019 à 09:17, Michael Neuling a écrit : > > If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail > > at linking with: > >arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined > > reference to `dawr_force_enable' > > > > This was caused by commit c1fe190c0672 ("powerpc: Add force enable of > > DAWR on P9 option"). > > > > This puts more of the dawr infrastructure in a new file. > > I think you are doing a bit more than that. I think you should explain > that you define a new CONFIG_ option, when it is selected, etc ... > > The commit you are referring to is talking about P9. It looks like your > patch covers a lot more, so it should be mentionned her I guess. Not really. It looks like I'm doing a lot but I'm really just moving code around to deal with the ugliness of a bunch of config options and dependencies. > > Signed-off-by: Michael Neuling > > You should add a Fixes: tag, ie: > > Fixes: c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option") Ok > > > -- > > v2: > >Fixes based on Christophe Leroy's comments: > >- Fix commit message formatting > >- Move more DAWR code into dawr.c > > --- > > arch/powerpc/Kconfig | 5 ++ > > arch/powerpc/include/asm/hw_breakpoint.h | 20 --- > > arch/powerpc/kernel/Makefile | 1 + > > arch/powerpc/kernel/dawr.c | 75 > > arch/powerpc/kernel/hw_breakpoint.c | 56 -- > > arch/powerpc/kvm/Kconfig | 1 + > > 6 files changed, 94 insertions(+), 64 deletions(-) > > create mode 100644 arch/powerpc/kernel/dawr.c > > > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > > index 2711aac246..f4b19e48cc 100644 > > --- a/arch/powerpc/Kconfig > > +++ b/arch/powerpc/Kconfig > > @@ -242,6 +242,7 @@ config PPC > > select SYSCTL_EXCEPTION_TRACE > > select THREAD_INFO_IN_TASK > > select VIRT_TO_BUS if !PPC64 > > + select PPC_DAWR_FORCE_ENABLEif PPC64 || PERF > > What's PERF ? Did you mean PERF_EVENTS ? > > Then what you mean is: > - Selected all the time for PPC64 > - Selected for PPC32 when PERF is also selected. > > Is that what you want ? At first I thought it was linked to P9. This is wrong. I think we just want PPC64. PERF is a red herring. > And ... did you read below statement ? Clearly not :-) > > > # > > # Please keep this list sorted alphabetically. > > # > > @@ -369,6 +370,10 @@ config PPC_ADV_DEBUG_DAC_RANGE > > depends on PPC_ADV_DEBUG_REGS && 44x > > default y > > > > +config PPC_DAWR_FORCE_ENABLE > > + bool > > + default y > > + > > Why defaulting it to y. Then how is it set to n ? Good point. > > > config ZONE_DMA > > bool > > default y if PPC_BOOK3E_64 > > diff --git a/arch/powerpc/include/asm/hw_breakpoint.h > > b/arch/powerpc/include/asm/hw_breakpoint.h > > index 0fe8c1e46b..ffbc8eab41 100644 > > --- a/arch/powerpc/include/asm/hw_breakpoint.h > > +++ b/arch/powerpc/include/asm/hw_breakpoint.h > > @@ -47,6 +47,8 @@ struct arch_hw_breakpoint { > > #define HW_BRK_TYPE_PRIV_ALL (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | > > \ > > HW_BRK_TYPE_HYP) > > > > +extern int set_dawr(struct arch_hw_breakpoint *brk); > > + > > #ifdef CONFIG_HAVE_HW_BREAKPOINT > > #include > > #include > > @@ -90,18 +92,20 @@ static inline void hw_breakpoint_disable(void) > > extern void thread_change_pc(struct task_struct *tsk, struct pt_regs > > *regs); > > int hw_breakpoint_handler(struct die_args *args); > > > > -extern int set_dawr(struct arch_hw_breakpoint *brk); > > -extern bool dawr_force_enable; > > -static inline bool dawr_enabled(void) > > -{ > > - return dawr_force_enable; > > -} > > - > > That's a very simple function, why not keep it here (or in another .h) > as 'static inline' ? Sure. > > #else /* CONFIG_HAVE_HW_BREAKPOINT */ > > static inline void hw_breakpoint_disable(void) { } > > static inline void thread_change_pc(struct task_struct *tsk, > > struct pt_regs *regs) { } > > -static inline bool dawr_enabled(void) { return false; } > > + > > #endif/* CONFIG_HAVE_HW_BREAKPOINT */ > > + > > +extern bool daw
[PATCH v2] powerpc: Fix compile issue with force DAWR
If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail at linking with: arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined reference to `dawr_force_enable' This was caused by commit c1fe190c0672 ("powerpc: Add force enable of DAWR on P9 option"). This puts more of the dawr infrastructure in a new file. Signed-off-by: Michael Neuling -- v2: Fixes based on Christophe Leroy's comments: - Fix commit message formatting - Move more DAWR code into dawr.c --- arch/powerpc/Kconfig | 5 ++ arch/powerpc/include/asm/hw_breakpoint.h | 20 --- arch/powerpc/kernel/Makefile | 1 + arch/powerpc/kernel/dawr.c | 75 arch/powerpc/kernel/hw_breakpoint.c | 56 -- arch/powerpc/kvm/Kconfig | 1 + 6 files changed, 94 insertions(+), 64 deletions(-) create mode 100644 arch/powerpc/kernel/dawr.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2711aac246..f4b19e48cc 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -242,6 +242,7 @@ config PPC select SYSCTL_EXCEPTION_TRACE select THREAD_INFO_IN_TASK select VIRT_TO_BUS if !PPC64 + select PPC_DAWR_FORCE_ENABLEif PPC64 || PERF # # Please keep this list sorted alphabetically. # @@ -369,6 +370,10 @@ config PPC_ADV_DEBUG_DAC_RANGE depends on PPC_ADV_DEBUG_REGS && 44x default y +config PPC_DAWR_FORCE_ENABLE + bool + default y + config ZONE_DMA bool default y if PPC_BOOK3E_64 diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index 0fe8c1e46b..ffbc8eab41 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -47,6 +47,8 @@ struct arch_hw_breakpoint { #define HW_BRK_TYPE_PRIV_ALL (HW_BRK_TYPE_USER | HW_BRK_TYPE_KERNEL | \ HW_BRK_TYPE_HYP) +extern int set_dawr(struct arch_hw_breakpoint *brk); + #ifdef CONFIG_HAVE_HW_BREAKPOINT #include #include @@ -90,18 +92,20 @@ static inline void hw_breakpoint_disable(void) extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs); int hw_breakpoint_handler(struct die_args *args); -extern int set_dawr(struct arch_hw_breakpoint *brk); -extern bool dawr_force_enable; -static inline bool dawr_enabled(void) -{ - return dawr_force_enable; -} - #else /* CONFIG_HAVE_HW_BREAKPOINT */ static inline void hw_breakpoint_disable(void) { } static inline void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs) { } -static inline bool dawr_enabled(void) { return false; } + #endif /* CONFIG_HAVE_HW_BREAKPOINT */ + +extern bool dawr_force_enable; + +#ifdef CONFIG_PPC_DAWR_FORCE_ENABLE +extern bool dawr_enabled(void); +#else +#define dawr_enabled() true +#endif + #endif /* __KERNEL__ */ #endif /* _PPC_BOOK3S_64_HW_BREAKPOINT_H */ diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 0ea6c4aa3a..a9c497c34f 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -56,6 +56,7 @@ obj-$(CONFIG_PPC64) += setup_64.o sys_ppc32.o \ obj-$(CONFIG_VDSO32) += vdso32/ obj-$(CONFIG_PPC_WATCHDOG) += watchdog.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o +obj-$(CONFIG_PPC_DAWR_FORCE_ENABLE)+= dawr.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_ppc970.o cpu_setup_pa6t.o obj-$(CONFIG_PPC_BOOK3S_64)+= cpu_setup_power.o obj-$(CONFIG_PPC_BOOK3S_64)+= mce.o mce_power.o diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c new file mode 100644 index 00..cf1d02fe1e --- /dev/null +++ b/arch/powerpc/kernel/dawr.c @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// DAWR infrastructure +// +// Copyright 2019, Michael Neuling, IBM Corporation. + +#include +#include +#include +#include +#include +#include +#include + +bool dawr_force_enable; +EXPORT_SYMBOL_GPL(dawr_force_enable); + +extern bool dawr_enabled(void) +{ + return dawr_force_enable; +} +EXPORT_SYMBOL_GPL(dawr_enabled); + +static ssize_t dawr_write_file_bool(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct arch_hw_breakpoint null_brk = {0, 0, 0}; + size_t rc; + + /* Send error to user if they hypervisor won't allow us to write DAWR */ + if ((!dawr_force_enable) && + (firmware_has_feature(FW_FEATURE_LPAR)) && + (set_dawr(_brk) != H_SUCCESS)) + return -1; + + rc = debugfs_write_file_bool(file, user_buf, count, ppos); + if (rc) + return rc; + + /* If we are clearing, make sure all CPUs have the DAWR cleared
[PATCH] powerpc: Document xive=off option
commit 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE interrupt controller") added an option to turn off Linux native XIVE usage via the xive=off kernel command line option. This documents this option. Signed-off-by: Michael Neuling --- Documentation/admin-guide/kernel-parameters.txt | 9 + 1 file changed, 9 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c45a19d654..ee410d0ef4 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5177,6 +5177,15 @@ Format: ,[,[,[,]]] + xive= [PPC] + By default on POWER9 and above, the kernel will + natively use the XIVE interrupt controller. This option + allows the fallback firmware mode to be used: + + off Fallback to firmware control of XIVE interrupt + controller on both pseries and powernv + platforms. Only useful on POWER9 and above. + xhci-hcd.quirks [USB,KNL] A hex value specifying bitmask with supplemental xhci host controller quirks. Meaning of each bit can be -- 2.21.0
[PATCH] powerpc: Fix compile issue with force DAWR
If you compile with KVM but without CONFIG_HAVE_HW_BREAKPOINT you fail at linking with: arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x708): undefined reference to `dawr_force_enable' This was caused by this recent patch: commit c1fe190c06723322f2dfac31d3b982c581e434ef Author: Michael Neuling powerpc: Add force enable of DAWR on P9 option This builds dawr_force_enable in always via a new file. Signed-off-by: Michael Neuling --- arch/powerpc/kernel/Makefile| 2 +- arch/powerpc/kernel/dawr.c | 11 +++ arch/powerpc/kernel/hw_breakpoint.c | 3 --- 3 files changed, 12 insertions(+), 4 deletions(-) create mode 100644 arch/powerpc/kernel/dawr.c diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile index 0ea6c4aa3a..48a20ef5be 100644 --- a/arch/powerpc/kernel/Makefile +++ b/arch/powerpc/kernel/Makefile @@ -49,7 +49,7 @@ obj-y := cputable.o ptrace.o syscalls.o \ signal.o sysfs.o cacheinfo.o time.o \ prom.o traps.o setup-common.o \ udbg.o misc.o io.o misc_$(BITS).o \ - of_platform.o prom_parse.o + of_platform.o prom_parse.o dawr.o obj-$(CONFIG_PPC64)+= setup_64.o sys_ppc32.o \ signal_64.o ptrace32.o \ paca.o nvram_64.o firmware.o diff --git a/arch/powerpc/kernel/dawr.c b/arch/powerpc/kernel/dawr.c new file mode 100644 index 00..ca343efd23 --- /dev/null +++ b/arch/powerpc/kernel/dawr.c @@ -0,0 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0+ +// +// DAWR global variables +// +// Copyright 2019, Michael Neuling, IBM Corporation. + +#include +#include + +bool dawr_force_enable; +EXPORT_SYMBOL_GPL(dawr_force_enable); diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index da307dd93e..78a17454f4 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -381,9 +381,6 @@ void hw_breakpoint_pmu_read(struct perf_event *bp) /* TODO */ } -bool dawr_force_enable; -EXPORT_SYMBOL_GPL(dawr_force_enable); - static ssize_t dawr_write_file_bool(struct file *file, const char __user *user_buf, size_t count, loff_t *ppos) -- 2.21.0
Re: [PATCH] powerpc/tm: Avoid machine crash on rt_sigreturn
On Wed, 2019-01-16 at 14:47 -0200, Breno Leitao wrote: > There is a kernel crash that happens if rt_sigreturn is called inside a > transactional block. > > This crash happens if the kernel hits an in-kernel page fault when > accessing userspace memory, usually through copy_ckvsx_to_user(). A major > page fault calls might_sleep() function, which can cause a task reschedule. > A task reschedule (switch_to()) reclaim and recheckpoint the TM states, > but, in the signal return path, the checkpointed memory was already > reclaimed, thus the exception stack has MSR that points to MSR[TS]=0. > > When the code returns from might_sleep() and a task reschedule happened, > then this task is returned with the memory recheckpointed, and > CPU MSR[TS] = suspended. > > This means that there is a side effect at might_sleep() if it is called > with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0. > > This side effect can cause a TM bad thing, since at the exception entrance, > the stack saves MSR[TS]=0, and this is what will be used at RFID, but, > the processor has MSR[TS] = Suspended, and this transition will be invalid > and a TM Bad thing will be raised, causing the following crash: > > Unexpected TM Bad Thing exception at c000e9ec (msr > 0x800302a03031) tm_scratch=80010280b033 > cpu 0xc: Vector: 700 (Program Check) at [c0003ff1fd70] > pc: c000e9ec: fast_exception_return+0x100/0x1bc > lr: c0032948: handle_rt_signal64+0xb8/0xaf0 > sp: c004263ebc40 > msr: 800302a03031 > current = 0xc00415050300 > paca= 0xc0003ffc4080 irqmask: 0x03 irq_happened: 0x01 > pid = 25006, comm = sigfuz > Linux version 5.0.0-rc1-1-g3bd6e94bec12 (breno@debian) (gcc version > 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019 > WARNING: exception is not recoverable, can't continue > enter ? for help > [c004263ebc40] c0032948 handle_rt_signal64+0xb8/0xaf0 > (unreliable) > [c004263ebd30] c0022780 do_notify_resume+0x2f0/0x430 > [c004263ebe20] c000e844 ret_from_except_lite+0x70/0x74 > --- Exception: c00 (System Call) at 7fffbaac400c > SP (7fffeca90f40) is in userspace > > The solution for this problem is running the sigreturn code with > regs->msr[TS] disabled, thus, avoiding hitting the side effect above. This > does not seem to be a problem since regs->msr will be replaced by the > ucontext value, so, it is being flushed already. In this case, it is > flushed earlier. > > Signed-off-by: Breno Leitao Acked-by: Michael Neuling This still applies on powerpc/next so just acking rather than reposting > --- > arch/powerpc/kernel/signal_64.c | 27 ++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index 6794466f6420..06c299ef6132 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -565,7 +565,7 @@ static long restore_tm_sigcontexts(struct task_struct > *tsk, > preempt_disable(); > > /* pull in MSR TS bits from user context */ > - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); > + regs->msr |= msr & MSR_TS_MASK; > > /* >* Ensure that TM is enabled in regs->msr before we leave the signal > @@ -745,6 +745,31 @@ SYSCALL_DEFINE0(rt_sigreturn) > if (MSR_TM_SUSPENDED(mfmsr())) > tm_reclaim_current(0); > > + /* > + * Disable MSR[TS] bit also, so, if there is an exception in the > + * code below (as a page fault in copy_ckvsx_to_user()), it does > + * not recheckpoint this task if there was a context switch inside > + * the exception. > + * > + * A major page fault can indirectly call schedule(). A reschedule > + * process in the middle of an exception can have a side effect > + * (Changing the CPU MSR[TS] state), since schedule() is called > + * with the CPU MSR[TS] disable and returns with MSR[TS]=Suspended > + * (switch_to() calls tm_recheckpoint() for the 'new' process). In > + * this case, the process continues to be the same in the CPU, but > + * the CPU state just changed. > + * > + * This can cause a TM Bad Thing, since the MSR in the stack will > + * have the MSR[TS]=0, and this is what will be used to RFID. > + * > + * Clearing MSR[TS] state here will avoid a recheckpoint if there > + * is any process reschedule in kernel space. The MSR[TS] state > + * does not need to be saved also, since it will be replaced with > + * the MSR[TS] that came from user context later, at > + * restore_tm_sigcontexts. > + */ > + regs->msr &= ~MSR_TS_MASK; > + > if (__get_user(msr, >uc_mcontext.gp_regs[PT_MSR])) > goto badframe; > if (MSR_TM_ACTIVE(msr)) {
Re: [PATCH] Documentation: powerpc: Expand the DAWR acronym
On Mon, 2019-04-01 at 16:41 +1030, Joel Stanley wrote: > Those not of us not drowning in POWER might not know what this means. Hehe... thanks! > Signed-off-by: Joel Stanley Acked-by: Michael Neuling > --- > Documentation/powerpc/DAWR-POWER9.txt | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/Documentation/powerpc/DAWR-POWER9.txt > b/Documentation/powerpc/DAWR-POWER9.txt > index 2feaa6619658..f6c24e54ce83 100644 > --- a/Documentation/powerpc/DAWR-POWER9.txt > +++ b/Documentation/powerpc/DAWR-POWER9.txt > @@ -1,10 +1,10 @@ > DAWR issues on POWER9 > > > -On POWER9 the DAWR can cause a checkstop if it points to cache > -inhibited (CI) memory. Currently Linux has no way to disinguish CI > -memory when configuring the DAWR, so (for now) the DAWR is disabled by > -this commit: > +On POWER9 the Data Address Watchpoint Register (DAWR) can cause a checkstop > +if it points to cache inhibited (CI) memory. Currently Linux has no way to > +disinguish CI memory when configuring the DAWR, so (for now) the DAWR is > +disabled by this commit: > > commit 9654153158d3e0684a1bdb76dbababdb7111d5a0 > Author: Michael Neuling
[PATCH v2] powerpc: Add force enable of DAWR on P9 option
This adds a flag so that the DAWR can be enabled on P9 via: echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous The DAWR was previously force disabled on POWER9 in: 9654153158 powerpc: Disable DAWR in the base POWER9 CPU features Also see Documentation/powerpc/DAWR-POWER9.txt This is a dangerous setting, USE AT YOUR OWN RISK. Some users may not care about a bad user crashing their box (ie. single user/desktop systems) and really want the DAWR. This allows them to force enable DAWR. This flag can also be used to disable DAWR access. Once this is cleared, all DAWR access should be cleared immediately and your machine once again safe from crashing. Userspace may get confused by toggling this. If DAWR is force enabled/disabled between getting the number of breakpoints (via PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an inconsistent view of what's available. Similarly for guests. For the DAWR to be enabled in a KVM guest, the DAWR needs to be force enabled in the host AND the guest. For this reason, this won't work on POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the dawr_enable_dangerous file will fail if the hypervisor doesn't support writing the DAWR. To double check the DAWR is working, run this kernel selftest: tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c Any errors/failures/skips mean something is wrong. Signed-off-by: Michael Neuling --- v2: Fix compile errors found by kbuild test robot. --- Documentation/powerpc/DAWR-POWER9.txt| 32 arch/powerpc/include/asm/hw_breakpoint.h | 8 +++ arch/powerpc/kernel/hw_breakpoint.c | 62 +++- arch/powerpc/kernel/process.c| 9 ++-- arch/powerpc/kernel/ptrace.c | 3 +- arch/powerpc/kvm/book3s_hv.c | 3 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 23 + 7 files changed, 123 insertions(+), 17 deletions(-) diff --git a/Documentation/powerpc/DAWR-POWER9.txt b/Documentation/powerpc/DAWR-POWER9.txt index 2feaa66196..bdec036509 100644 --- a/Documentation/powerpc/DAWR-POWER9.txt +++ b/Documentation/powerpc/DAWR-POWER9.txt @@ -56,3 +56,35 @@ POWER9. Loads and stores to the watchpoint locations will not be trapped in GDB. The watchpoint is remembered, so if the guest is migrated back to the POWER8 host, it will start working again. +Force enabling the DAWR += +Kernels (since ~v5.2) have an option to force enable the DAWR via: + + echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous + +This enables the DAWR even on POWER9. + +This is a dangerous setting, USE AT YOUR OWN RISK. + +Some users may not care about a bad user crashing their box +(ie. single user/desktop systems) and really want the DAWR. This +allows them to force enable DAWR. + +This flag can also be used to disable DAWR access. Once this is +cleared, all DAWR access should be cleared immediately and your +machine once again safe from crashing. + +Userspace may get confused by toggling this. If DAWR is force +enabled/disabled between getting the number of breakpoints (via +PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an +inconsistent view of what's available. Similarly for guests. + +For the DAWR to be enabled in a KVM guest, the DAWR needs to be force +enabled in the host AND the guest. For this reason, this won't work on +POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the +dawr_enable_dangerous file will fail if the hypervisor doesn't support +writing the DAWR. + +To double check the DAWR is working, run this kernel selftest: + tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c +Any errors/failures/skips mean something is wrong. diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index ece4dc89c9..0fe8c1e46b 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -90,10 +90,18 @@ static inline void hw_breakpoint_disable(void) extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs); int hw_breakpoint_handler(struct die_args *args); +extern int set_dawr(struct arch_hw_breakpoint *brk); +extern bool dawr_force_enable; +static inline bool dawr_enabled(void) +{ + return dawr_force_enable; +} + #else /* CONFIG_HAVE_HW_BREAKPOINT */ static inline void hw_breakpoint_disable(void) { } static inline void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs) { } +static inline bool dawr_enabled(void) { return false; } #endif /* CONFIG_HAVE_HW_BREAKPOINT */ #endif /* __KERNEL__ */ #endif /* _PPC_BOOK3S_64_HW_BREAKPOINT_H */ diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index fec8a67731..da307dd93e 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -29,11 +29,15 @@ #include #include #include +#include +#i
[PATCH] powerpc: Add force enable of DAWR on P9 option
This adds a flag so that the DAWR can be enabled on P9 via: echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous The DAWR was previously force disabled on POWER9 in: 9654153158 powerpc: Disable DAWR in the base POWER9 CPU features Also see Documentation/powerpc/DAWR-POWER9.txt This is a dangerous setting, USE AT YOUR OWN RISK. Some users may not care about a bad user crashing their box (ie. single user/desktop systems) and really want the DAWR. This allows them to force enable DAWR. This flag can also be used to disable DAWR access. Once this is cleared, all DAWR access should be cleared immediately and your machine once again safe from crashing. Userspace may get confused by toggling this. If DAWR is force enabled/disabled between getting the number of breakpoints (via PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an inconsistent view of what's available. Similarly for guests. For the DAWR to be enabled in a KVM guest, the DAWR needs to be force enabled in the host AND the guest. For this reason, this won't work on POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the dawr_enable_dangerous file will fail if the hypervisor doesn't support writing the DAWR. To double check the DAWR is working, run this kernel selftest: tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c Any errors/failures/skips mean something is wrong. Signed-off-by: Michael Neuling --- Documentation/powerpc/DAWR-POWER9.txt| 32 + arch/powerpc/include/asm/hw_breakpoint.h | 7 +++ arch/powerpc/kernel/hw_breakpoint.c | 60 +++- arch/powerpc/kernel/process.c| 9 ++-- arch/powerpc/kernel/ptrace.c | 3 +- arch/powerpc/kvm/book3s_hv.c | 3 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 23 + 7 files changed, 120 insertions(+), 17 deletions(-) diff --git a/Documentation/powerpc/DAWR-POWER9.txt b/Documentation/powerpc/DAWR-POWER9.txt index 2feaa66196..bdec036509 100644 --- a/Documentation/powerpc/DAWR-POWER9.txt +++ b/Documentation/powerpc/DAWR-POWER9.txt @@ -56,3 +56,35 @@ POWER9. Loads and stores to the watchpoint locations will not be trapped in GDB. The watchpoint is remembered, so if the guest is migrated back to the POWER8 host, it will start working again. +Force enabling the DAWR += +Kernels (since ~v5.2) have an option to force enable the DAWR via: + + echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous + +This enables the DAWR even on POWER9. + +This is a dangerous setting, USE AT YOUR OWN RISK. + +Some users may not care about a bad user crashing their box +(ie. single user/desktop systems) and really want the DAWR. This +allows them to force enable DAWR. + +This flag can also be used to disable DAWR access. Once this is +cleared, all DAWR access should be cleared immediately and your +machine once again safe from crashing. + +Userspace may get confused by toggling this. If DAWR is force +enabled/disabled between getting the number of breakpoints (via +PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an +inconsistent view of what's available. Similarly for guests. + +For the DAWR to be enabled in a KVM guest, the DAWR needs to be force +enabled in the host AND the guest. For this reason, this won't work on +POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the +dawr_enable_dangerous file will fail if the hypervisor doesn't support +writing the DAWR. + +To double check the DAWR is working, run this kernel selftest: + tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c +Any errors/failures/skips mean something is wrong. diff --git a/arch/powerpc/include/asm/hw_breakpoint.h b/arch/powerpc/include/asm/hw_breakpoint.h index ece4dc89c9..87c2a74f64 100644 --- a/arch/powerpc/include/asm/hw_breakpoint.h +++ b/arch/powerpc/include/asm/hw_breakpoint.h @@ -90,6 +90,13 @@ static inline void hw_breakpoint_disable(void) extern void thread_change_pc(struct task_struct *tsk, struct pt_regs *regs); int hw_breakpoint_handler(struct die_args *args); +extern int set_dawr(struct arch_hw_breakpoint *brk); +extern bool dawr_force_enable; +static inline bool dawr_enabled(void) +{ + return dawr_force_enable; +} + #else /* CONFIG_HAVE_HW_BREAKPOINT */ static inline void hw_breakpoint_disable(void) { } static inline void thread_change_pc(struct task_struct *tsk, diff --git a/arch/powerpc/kernel/hw_breakpoint.c b/arch/powerpc/kernel/hw_breakpoint.c index fec8a67731..0b2461e1d9 100644 --- a/arch/powerpc/kernel/hw_breakpoint.c +++ b/arch/powerpc/kernel/hw_breakpoint.c @@ -29,11 +29,14 @@ #include #include #include +#include +#include #include #include #include #include +#include #include /* @@ -174,7 +177,7 @@ int hw_breakpoint_arch_parse(struct perf_event *bp, if (!ppc_breakpoint_available()) return -ENODEV; length_max = 8; /* DABR */ - if (cpu_has_f
Re: [PATCH] powerpc/security: Fix spectre_v2 reporting
On Thu, 2019-03-21 at 15:24 +1100, Michael Ellerman wrote: > When I updated the spectre_v2 reporting to handle software count cache > flush I got the logic wrong when there's no software count cache > enabled at all. > > The result is that on systems with the software count cache flush > disabled we print: > > Mitigation: Indirect branch cache disabled, Software count cache flush > > Which correctly indicates that the count cache is disabled, but > incorrectly says the software count cache flush is enabled. > > The root of the problem is that we are trying to handle all > combinations of options. But we know now that we only expect to see > the software count cache flush enabled if the other options are false. > > So split the two cases, which simplifies the logic and fixes the bug. > We were also missing a space before "(hardware accelerated)". > > The result is we see one of: > > Mitigation: Indirect branch serialisation (kernel only) > Mitigation: Indirect branch cache disabled > Mitigation: Software count cache flush > Mitigation: Software count cache flush (hardware accelerated) > > Fixes: ee13cb249fab ("powerpc/64s: Add support for software count cache > flush") > Cc: sta...@vger.kernel.org # v4.19+ > Signed-off-by: Michael Ellerman LGTM Reviewed-by: Michael Neuling > --- > arch/powerpc/kernel/security.c | 23 --- > 1 file changed, 8 insertions(+), 15 deletions(-) > > diff --git a/arch/powerpc/kernel/security.c b/arch/powerpc/kernel/security.c > index 9b8631533e02..b33bafb8fcea 100644 > --- a/arch/powerpc/kernel/security.c > +++ b/arch/powerpc/kernel/security.c > @@ -190,29 +190,22 @@ ssize_t cpu_show_spectre_v2(struct device *dev, struct > device_attribute *attr, c > bcs = security_ftr_enabled(SEC_FTR_BCCTRL_SERIALISED); > ccd = security_ftr_enabled(SEC_FTR_COUNT_CACHE_DISABLED); > > - if (bcs || ccd || count_cache_flush_type != COUNT_CACHE_FLUSH_NONE) { > - bool comma = false; > + if (bcs || ccd) { > seq_buf_printf(, "Mitigation: "); > > - if (bcs) { > + if (bcs) > seq_buf_printf(, "Indirect branch serialisation > (kernel only)"); > - comma = true; > - } > > - if (ccd) { > - if (comma) > - seq_buf_printf(, ", "); > - seq_buf_printf(, "Indirect branch cache disabled"); > - comma = true; > - } > - > - if (comma) > + if (bcs && ccd) > seq_buf_printf(, ", "); > > - seq_buf_printf(, "Software count cache flush"); > + if (ccd) > + seq_buf_printf(, "Indirect branch cache disabled"); > + } else if (count_cache_flush_type != COUNT_CACHE_FLUSH_NONE) { > + seq_buf_printf(, "Mitigation: Software count cache flush"); > > if (count_cache_flush_type == COUNT_CACHE_FLUSH_HW) > - seq_buf_printf(, "(hardware accelerated)"); > + seq_buf_printf(, " (hardware accelerated)"); > } else if (btb_flush_enabled) { > seq_buf_printf(, "Mitigation: Branch predictor state flush"); > } else {
Re: [PATCH] selftests/powerpc: New TM signal self test
On Wed, 2018-11-28 at 11:23 -0200, Breno Leitao wrote: > A new self test that forces MSR[TS] to be set without calling any TM > instruction. This test also tries to cause a page fault at a signal > handler, exactly between MSR[TS] set and tm_recheckpoint(), forcing > thread->texasr to be rewritten with TEXASR[FS] = 0, which will cause a BUG > when tm_recheckpoint() is called. > > This test is not deterministic since it is hard to guarantee that the page > access will cause a page fault. Tests have shown that the bug could be > exposed with few interactions in a buggy kernel. This test is configured to > loop 5000x, having a good chance to hit the kernel issue in just one run. > This self test takes less than two seconds to run. You could try using sigaltstack() to put the ucontext somewhere else. Then you could play tricks with that memory to try to force a fault. madvise()+MADV_DONTNEED or fadvise()+POSIX_FADV_DONTNEED might do the trick. This is more extra credit to make it more reliable. Not a requirement. > This test uses set/getcontext because the kernel will recheckpoint > zeroed structures, causing the test to segfault, which is undesired because > the test needs to rerun, so, there is a signal handler for SIGSEGV which > will restart the test. Please put this description at the top of the test also. Other than that, it looks good. Mikey > > Signed-off-by: Breno Leitao > --- > tools/testing/selftests/powerpc/tm/.gitignore | 1 + > tools/testing/selftests/powerpc/tm/Makefile | 3 +- > .../powerpc/tm/tm-signal-force-msr.c | 115 ++ > 3 files changed, 118 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/powerpc/tm/tm-signal-force-msr.c > > diff --git a/tools/testing/selftests/powerpc/tm/.gitignore > b/tools/testing/selftests/powerpc/tm/.gitignore > index c3ee8393dae8..89679822ebc9 100644 > --- a/tools/testing/selftests/powerpc/tm/.gitignore > +++ b/tools/testing/selftests/powerpc/tm/.gitignore > @@ -11,6 +11,7 @@ tm-signal-context-chk-fpu > tm-signal-context-chk-gpr > tm-signal-context-chk-vmx > tm-signal-context-chk-vsx > +tm-signal-force-msr > tm-vmx-unavail > tm-unavailable > tm-trap > diff --git a/tools/testing/selftests/powerpc/tm/Makefile > b/tools/testing/selftests/powerpc/tm/Makefile > index 9fc2cf6fbc92..58a2ebd13958 100644 > --- a/tools/testing/selftests/powerpc/tm/Makefile > +++ b/tools/testing/selftests/powerpc/tm/Makefile > @@ -4,7 +4,7 @@ SIGNAL_CONTEXT_CHK_TESTS := tm-signal-context-chk-gpr tm- > signal-context-chk-fpu > > TEST_GEN_PROGS := tm-resched-dscr tm-syscall tm-signal-msr-resv tm-signal- > stack \ > tm-vmxcopy tm-fork tm-tar tm-tmspr tm-vmx-unavail tm-unavailable > tm-trap > \ > - $(SIGNAL_CONTEXT_CHK_TESTS) tm-sigreturn > + $(SIGNAL_CONTEXT_CHK_TESTS) tm-sigreturn tm-signal-force-msr > > top_srcdir = ../../../../.. > include ../../lib.mk > @@ -20,6 +20,7 @@ $(OUTPUT)/tm-vmx-unavail: CFLAGS += -pthread -m64 > $(OUTPUT)/tm-resched-dscr: ../pmu/lib.c > $(OUTPUT)/tm-unavailable: CFLAGS += -O0 -pthread -m64 -Wno- > error=uninitialized -mvsx > $(OUTPUT)/tm-trap: CFLAGS += -O0 -pthread -m64 > +$(OUTPUT)/tm-signal-force-msr: CFLAGS += -pthread > > SIGNAL_CONTEXT_CHK_TESTS := $(patsubst > %,$(OUTPUT)/%,$(SIGNAL_CONTEXT_CHK_TESTS)) > $(SIGNAL_CONTEXT_CHK_TESTS): tm-signal.S > diff --git a/tools/testing/selftests/powerpc/tm/tm-signal-force-msr.c > b/tools/testing/selftests/powerpc/tm/tm-signal-force-msr.c > new file mode 100644 > index ..4441d61c2328 > --- /dev/null > +++ b/tools/testing/selftests/powerpc/tm/tm-signal-force-msr.c > @@ -0,0 +1,115 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright 2018, Breno Leitao, Gustavo Romero, IBM Corp. > + */ > + > +#define _GNU_SOURCE > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "tm.h" > +#include "utils.h" > + > +#define __MASK(X) (1UL<<(X)) > +#define MSR_TS_S_LG 33 /* Trans Mem state: Suspended */ > +#define MSR_TM __MASK(MSR_TM_LG) /* Transactional Mem Available */ > +#define MSR_TS_S__MASK(MSR_TS_S_LG) /* Transaction Suspended */ Surely we have these defined somewhere else in selftests? > + > +#define COUNT_MAX 5000/* Number of interactions */ > + > +/* Setting contexts because the test will crash and we want to recover */ > +ucontext_t init_context, main_context; > + > +static int count, first_time; > + > +void trap_signal_handler(int signo, siginfo_t *si, void *uc) > +{ > + ucontext_t *ucp = uc; > + > + /* > + * Allocating memory in a signal handler, and never freeing it on > + * purpose, forcing the heap increase, so, the memory leak is what > + * we want here. > + */ > + ucp->uc_link = malloc(sizeof(ucontext_t)); > + memcpy(>uc_link, >uc_mcontext, sizeof(ucp->uc_mcontext)); > + > + /* Forcing to enable MSR[TM] */ > + ucp->uc_mcontext.gp_regs[PT_MSR]
Re: [PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint
> Do you mean in this part of code? > > SYSCALL_DEFINE0(rt_sigreturn) > { > > if (__copy_from_user(, >uc_sigmask, sizeof(set))) > goto badframe; > > ... > if (MSR_TM_SUSPENDED(mfmsr())) > tm_reclaim_current(0); I'm actually thinking after the reclaim, not before. If I follow your original email properly, you have a problem because you end up in this senario: 1) Current MSR is not TM suspended 2) regs->msr[TS] set 3) get_user() (which may fault) After the tm_reclaim there are cases in restore_tm_sigcontexts() where the above is also the case. Hence why I think we have a problem there too. Mikey
Re: [PATCH] powerpc/tm: Set MSR[TS] just prior to recheckpoint
On Mon, 2018-11-19 at 10:44 -0200, Breno Leitao wrote: > On a signal handler return, the user could set a context with MSR[TS] bits > set, and these bits would be copied to task regs->msr. > > At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set, > several __get_user() are called and then a recheckpoint is executed. Do we have the same problem on the entry side? There are a bunch of copy_from_user() before we do a tm_reclaim() (ie when still transactional). Is there some chance of the same problem there? > This is a problem since a page fault (in kernel space) could happen when > calling __get_user(). If it happens, the process MSR[TS] bits were > already set, but recheckpoint was not executed, and SPRs are still invalid. > > The page fault can cause the current process to be de-scheduled, with > MSR[TS] active and without tm_recheckpoint() being called. More > importantly, without TEXAR[FS] bit set also. This patch is great and should go in ASAP > Since TEXASR might not have the FS bit set, and when the process is > scheduled back, it will try to reclaim, which will be aborted because of > the CPU is not in the suspended state, and, then, recheckpoint. This > recheckpoint will restore thread->texasr into TEXASR SPR, which might be > zero, hitting a BUG_ON(). > > [ 2181.457997] kernel BUG at arch/powerpc/kernel/tm.S:446! Can you put more of the backtrace here? Can be useful for people searching for a similar problem. > This patch simply delays the MSR[TS] set, so, if there is any page fault in > the __get_user() section, it does not have regs->msr[TS] set, since the TM > structures are still invalid, thus avoiding doing TM operations for > in-kernel exceptions and possible process reschedule. Can you put a comment in the code what says after the MSR[TS] setting, there can be no get/put_user before the recheckpoint? Also, it looks like the 32bit signals case is safe, but please add a comment in there too. > With this patch, the MSR[TS] will only be set just before recheckpointing > and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in > invalid state. > > It is not possible to move tm_recheckpoint to happen earlier, because it is > required to get the checkpointed registers from userspace, with > __get_user(), thus, the only way to avoid this undesired behavior is > delaying the MSR[TS] set, as done in this patch. > > Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals") > Cc: sta...@vger.kernel.org (v3.9+) > Signed-off-by: Breno Leitao > --- > arch/powerpc/kernel/signal_64.c | 29 +++-- > 1 file changed, 15 insertions(+), 14 deletions(-) > > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index 83d51bf586c7..15b153bdd826 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -467,20 +467,6 @@ static long restore_tm_sigcontexts(struct task_struct > *tsk, > if (MSR_TM_RESV(msr)) > return -EINVAL; > > - /* pull in MSR TS bits from user context */ > - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); > - > - /* > - * Ensure that TM is enabled in regs->msr before we leave the signal > - * handler. It could be the case that (a) user disabled the TM bit > - * through the manipulation of the MSR bits in uc_mcontext or (b) the > - * TM bit was disabled because a sufficient number of context switches > - * happened whilst in the signal handler and load_tm overflowed, > - * disabling the TM bit. In either case we can end up with an illegal > - * TM state leading to a TM Bad Thing when we return to userspace. > - */ > - regs->msr |= MSR_TM; > - > /* pull in MSR LE from user context */ > regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE); > > @@ -572,6 +558,21 @@ static long restore_tm_sigcontexts(struct task_struct > *tsk, > tm_enable(); > /* Make sure the transaction is marked as failed */ > tsk->thread.tm_texasr |= TEXASR_FS; > + > + /* pull in MSR TS bits from user context */ > + regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); > + > + /* > + * Ensure that TM is enabled in regs->msr before we leave the signal > + * handler. It could be the case that (a) user disabled the TM bit > + * through the manipulation of ararararthe MSR bits in uc_mcontext or > (b) the > + * TM bit was disabled because a sufficient number of context switches > + * happened whilst in the signal handler and load_tm overflowed, > + * disabling the TM bit. In either case we can end up with an illegal > + * TM state leading to a TM Bad Thing when we return to userspace. > + */ > + regs->msr |= MSR_TM; > + > /* This loads the checkpointed FP/VEC state, if used */ > tm_recheckpoint(>thread); >
Re: [RFC PATCH 13/14] powerpc/tm: Do not restore TM without SPRs
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > Currently the signal context restore code enables the bit on the MSR > register without restoring the TM SPRs, which can cause undesired side > effects. > > This is not correct because if TM is enabled in MSR, it means the TM SPR > registers are valid and updated, which is not correct here. In fact, the > live registers may contain previous' thread SPRs. > > Functions check if the register values are valid or not through looking > if the facility is enabled at MSR, as MSR[TM] set means that the TM SPRs > are hot. > > When just enabling MSR[TM] without updating the live SPRs, this can cause a > crash, since current TM SPR from previous thread will be saved on the > current thread, and might not have TEXASR[FS] set, for example. > > Signed-off-by: Breno Leitao > --- > arch/powerpc/kernel/signal_64.c | 12 +++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index 487c3b6aa2e3..156b90e8ee78 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -478,8 +478,18 @@ static long restore_tm_sigcontexts(struct task_struct > *tsk, >* happened whilst in the signal handler and load_tm overflowed, >* disabling the TM bit. In either case we can end up with an illegal >* TM state leading to a TM Bad Thing when we return to userspace. > + * > + * Every time MSR_TM is enabled, mainly for the b) case, the TM SPRs > + * must be restored in the live registers, since the live registers > + * could contain garbage and later we want to read from live, since > + * MSR_TM is enabled, and MSR[TM] is what is used to check if the > + * TM SPRs live registers are valid or not. >*/ > - regs->msr |= MSR_TM; > + if ((regs->msr & MSR_TM) == 0) { > + regs->msr |= MSR_TM; > + tm_enable(); > + tm_restore_sprs(>thread); > + } I'm wondering if we should put the save/restore TM registers in the early entry/exit code too. We'd need to add the check on msr[tm]/load_tm. Distributing the SPR save/restore throughout the kernel is just going to lead us to similar problems that we are having now with reclaim/recheckpoint. Mikey > > /* pull in MSR LE from user context */ > regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
Re: [RFC PATCH 09/14] powerpc/tm: Warn if state is transactional
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > Since every kernel entrance is calling TM_KERNEL_ENTRY, it is not > expected to arrive at this point with a suspended transaction. > > If that is the case, cause a warning and reclaim the current thread in > order to avoid a TM Bad Thing. > > Signed-off-by: Breno Leitao > --- > arch/powerpc/kernel/process.c | 7 +++ > arch/powerpc/kernel/signal.c | 2 +- > 2 files changed, 4 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index 73872f751b33..849591bf0881 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -1752,11 +1752,10 @@ void start_thread(struct pt_regs *regs, unsigned long > start, unsigned long sp) > > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > /* > - * Clear any transactional state, we're exec()ing. The cause is > - * not important as there will never be a recheckpoint so it's not > - * user visible. > + * It is a bug if the transaction was not reclaimed until this > + * point. Warn us and try to workaround it calling tm_reclaim(). >*/ > - if (MSR_TM_SUSPENDED(mfmsr())) > + if (WARN_ON(MSR_TM_SUSPENDED(mfmsr( > tm_reclaim_current(0); > #endif Let's turn these into BUG_ON() Mikey > diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c > index b3e8db376ecd..cbaccc2be0fb 100644 > --- a/arch/powerpc/kernel/signal.c > +++ b/arch/powerpc/kernel/signal.c > @@ -203,7 +203,7 @@ unsigned long get_tm_stackpointer(struct task_struct *tsk) > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > BUG_ON(tsk != current); > > - if (MSR_TM_ACTIVE(tsk->thread.regs->msr)) { > + if (WARN_ON(MSR_TM_ACTIVE(mfmsr( { > tm_reclaim_current(TM_CAUSE_SIGNAL); > if (MSR_TM_TRANSACTIONAL(tsk->thread.regs->msr)) > return tsk->thread.ckpt_regs.gpr[1];
Re: [RFC PATCH 08/14] powerpc/tm: Recheckpoint at exit path
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > In the past, TIF_RESTORE_TM was being handled with the rest of the TIF > workers, > but, that was too early, and can cause some IRQ to be replayed in suspended > state (after recheckpoint). > > This patch moves TIF_RESTORE_TM handler to as late as possible, it also > forces the IRQ to be disabled, and it will continue to be until RFID, so, > no IRQ will be replayed at all. I.e, if trecheckpoint happens, it will RFID > to userspace. > > This makes TIF_RESTORE_TM a special case that should not be handled > similarly to the _TIF_USER_WORK_MASK. > > Since _TIF_RESTORE_TM is not part of _TIF_USER_WORK_MASK anymore, we > need to force system_call_exit to continue to leaves through > fast_exception_return, so, we add the flags together with > _TIF_USER_WORK_MASK at system_call_exist path. > > If this flag is set at system_call_exit, it means that recheckpoint > will be called, and doing it through fast_exception_return is the only > way to do so. > > Signed-off-by: Breno Leitao > --- > arch/powerpc/include/asm/thread_info.h | 2 +- > arch/powerpc/kernel/entry_64.S | 23 ++- > 2 files changed, 15 insertions(+), 10 deletions(-) > > diff --git a/arch/powerpc/include/asm/thread_info.h > b/arch/powerpc/include/asm/thread_info.h > index 544cac0474cb..2835d60bc9ef 100644 > --- a/arch/powerpc/include/asm/thread_info.h > +++ b/arch/powerpc/include/asm/thread_info.h > @@ -139,7 +139,7 @@ void arch_setup_new_exec(void); > > #define _TIF_USER_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \ >_TIF_NOTIFY_RESUME | _TIF_UPROBE | \ > - _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \ > + _TIF_PATCH_PENDING | \ >_TIF_FSCHECK) > #define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR) > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > index 17484ebda66c..a86619edf29d 100644 > --- a/arch/powerpc/kernel/entry_64.S > +++ b/arch/powerpc/kernel/entry_64.S > @@ -255,7 +255,7 @@ system_call_exit: > > ld r9,TI_FLAGS(r12) > li r11,-MAX_ERRNO > - andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MAS > K|_TIF_PERSYSCALL_MASK) > + andi. r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK| > _TIF_PERSYSCALL_MASK |_TIF_RESTORE_TM) > bne-.Lsyscall_exit_work > > andi. r0,r8,MSR_FP > @@ -784,14 +784,6 @@ _GLOBAL(ret_from_except_lite) > SCHEDULE_USER > b ret_from_except_lite > 2: > -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > - andi. r0,r4,_TIF_USER_WORK_MASK & ~_TIF_RESTORE_TM > - bne 3f /* only restore TM if nothing else to do */ > - addir3,r1,STACK_FRAME_OVERHEAD > - bl restore_tm_state > - b restore > -3: > -#endif > bl save_nvgprs > /* >* Use a non volatile GPR to save and restore our thread_info flags > @@ -938,6 +930,19 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS) >*/ > .globl fast_exception_return > fast_exception_return: > + > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > + CURRENT_THREAD_INFO(r4, r1) > + ld r4,TI_FLAGS(r4) > + andi. r0,r4,_TIF_RESTORE_TM > + beq 22f > + ld r4,_MSR(r1) /* TODO: MSR[!PR] shouldn't be here */ > + andi. r0,r4,MSR_PR > + beq 22f /* Skip if Kernel thread */ > + addir3,r1,STACK_FRAME_OVERHEAD > + bl restore_tm_state Calling out to C here is a bit concerning this late. The main thing you are calling out to is asm anyway (with a small C wrapper). We might want to adapt the asm this case, rather than the old model of getting it called from __switch_to(). We shouldn't need to call tm_reclaim/tm_recheckpoint from C anymore (especially if we use BUG_ON() rather than WARN_ON() in the cases already mentioned.). Same for patch 1. > +22: > +#endif > ld r3,_MSR(r1) > ld r4,_CTR(r1) > ld r0,_LINK(r1)
Re: [RFC PATCH 05/14] powerpc/tm: Refactor the __switch_to_tm code
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > __switch_to_tm is the function that switches between two tasks which might > have TM enabled. This function is clearly split in two parts, the task that > is leaving the CPU, known as 'prev' and the task that is being scheduled, > known as 'new'. > > It starts checking if the previous task had TM enable, if so, it increases > the load_tm (this is the only place we increment load_tm). It also saves > the TM SPRs here. > > If the previous task was scheduled out with a transaction active, the > failure cause needs to be updated, since it might contain the failure cause > that caused the exception, as TM_CAUSE_MISC. In this case, since there was > a context switch, overwrite the failure cause. > > If the previous task has overflowed load_tm, disable TM, putting the > facility save/restore lazy mechanism at lazy mode. > > Regarding the 'new' task being scheduled, restoring TM SPRs is enough if > the task had TM enabled when it was de-scheduled. (Checking if a > recheckpoint would be required will be done later, at restore_tm_state() > stage.) > > On top of that, both tm_reclaim_task() and tm_recheckpoint_new_task() > functions are not used anymore, removing them. Is the above describing the previous functionality or the refactored functionality? > > Signed-off-by: Breno Leitao > --- > arch/powerpc/kernel/process.c | 167 -- > 1 file changed, 78 insertions(+), 89 deletions(-) > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index 1842fd96b123..73872f751b33 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -912,48 +912,6 @@ void tm_reclaim_current(uint8_t cause) > tm_reclaim_thread(>thread, cause); > } > > -static inline void tm_reclaim_task(struct task_struct *tsk) > -{ > - /* We have to work out if we're switching from/to a task that's in the > - * middle of a transaction. > - * > - * In switching we need to maintain a 2nd register state as > - * oldtask->thread.ckpt_regs. We tm_reclaim(oldproc); this saves the > - * checkpointed (tbegin) state in ckpt_regs, ckfp_state and > - * ckvr_state > - * > - * We also context switch (save) TFHAR/TEXASR/TFIAR in here. > - */ > - struct thread_struct *thr = >thread; > - > - if (!thr->regs) > - return; > - > - if (!MSR_TM_ACTIVE(thr->regs->msr)) > - goto out_and_saveregs; > - > - WARN_ON(tm_suspend_disabled); > - > - TM_DEBUG("--- tm_reclaim on pid %d (NIP=%lx, " > - "ccr=%lx, msr=%lx, trap=%lx)\n", > - tsk->pid, thr->regs->nip, > - thr->regs->ccr, thr->regs->msr, > - thr->regs->trap); > - > - tm_reclaim_thread(thr, TM_CAUSE_RESCHED); > - > - TM_DEBUG("--- tm_reclaim on pid %d complete\n", > - tsk->pid); > - > -out_and_saveregs: > - /* Always save the regs here, even if a transaction's not active. > - * This context-switches a thread's TM info SPRs. We do it here to > - * be consistent with the restore path (in recheckpoint) which > - * cannot happen later in _switch(). > - */ > - tm_save_sprs(thr); > -} > - > extern void __tm_recheckpoint(struct thread_struct *thread); > > void tm_recheckpoint(struct thread_struct *thread) > @@ -980,59 +938,91 @@ void tm_recheckpoint(struct thread_struct *thread) > local_irq_restore(flags); > } > > -static inline void tm_recheckpoint_new_task(struct task_struct *new) > +static void tm_change_failure_cause(struct task_struct *task, uint8_t cause) > { > - if (!cpu_has_feature(CPU_FTR_TM)) > - return; > - > - /* Recheckpoint the registers of the thread we're about to switch to. > - * > - * If the task was using FP, we non-lazily reload both the original and > - * the speculative FP register states. This is because the kernel > - * doesn't see if/when a TM rollback occurs, so if we take an FP > - * unavailable later, we are unable to determine which set of FP regs > - * need to be restored. > - */ > - if (!tm_enabled(new)) > - return; > - > - if (!MSR_TM_ACTIVE(new->thread.regs->msr)){ > - tm_restore_sprs(>thread); > - return; > - } > - /* Recheckpoint to restore original checkpointed register state. */ > - TM_DEBUG("*** tm_recheckpoint of pid %d (new->msr 0x%lx)\n", > - new->pid, new->thread.regs->msr); > - > - tm_recheckpoint(>thread); > - > - /* > - * The checkpointed state has been restored but the live state has > - * not, ensure all the math functionality is turned off to trigger > - * restore_math() to reload. > - */ > - new->thread.regs->msr &= ~(MSR_FP | MSR_VEC | MSR_VSX); > - > - TM_DEBUG("*** tm_recheckpoint of pid %d complete " > - "(kernel msr 0x%lx)\n", > -
Re: [RFC PATCH 03/14] powerpc/tm: Recheckpoint when exiting from kernel
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > This is the only place we are going to recheckpoint now. Now the task > needs to have TIF_RESTORE_TM flag set, which will get into > restore_tm_state() at exception exit path, and execute the recheckpoint > depending on the MSR. > > Every time a task is required to recheckpoint, or just have the TM SPRs > restore, the TIF_RESTORE_TM flag should be set and the task MSR should > properly be in a transactional state, which will be checked by > restore_tm_state(). > > After the facility registers are recheckpointed, they are clobbered with > the values that were recheckpointed (and are now also in the checkpoint > area). Which facility registers? I don't understand this. > If facility is enabled at MSR that is being returned to user space, then > the facility registers need to be restored, otherwise userspace will see > invalid values. > > This patch simplify the restore_tm_state() to just restore the facility > registers that are enabled when returning to userspace, i.e. the MSR will > be the same that will be put into SRR1, which will be the MSR after RFID. > > Signed-off-by: Breno Leitao > --- > arch/powerpc/kernel/process.c | 38 --- > 1 file changed, 26 insertions(+), 12 deletions(-) > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index 4d5322cfad25..c7e758a42b8f 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -1049,8 +1049,6 @@ static inline void __switch_to_tm(struct task_struct > *prev, > */ > void restore_tm_state(struct pt_regs *regs) > { > - unsigned long msr_diff; > - > /* >* This is the only moment we should clear TIF_RESTORE_TM as >* it is here that ckpt_regs.msr and pt_regs.msr become the same > @@ -1061,19 +1059,35 @@ void restore_tm_state(struct pt_regs *regs) > if (!MSR_TM_ACTIVE(regs->msr)) > return; > > - msr_diff = current->thread.ckpt_regs.msr & ~regs->msr; > - msr_diff &= MSR_FP | MSR_VEC | MSR_VSX; > + tm_enable(); > + /* The only place we recheckpoint */ > + tm_recheckpoint(>thread); > > - /* Ensure that restore_math() will restore */ > - if (msr_diff & MSR_FP) > - current->thread.load_fp = 1; > + /* > + * Restore the facility registers that were clobbered during > + * recheckpoint. > + */ > + if (regs->msr & MSR_FP) { > + /* > + * Using load_fp_state() instead of restore_fp() because we > + * want to force the restore, independent of > + * tsk->thread.load_fp. Same for other cases below. > + */ > + load_fp_state(>thread.fp_state); > + } > #ifdef CONFIG_ALTIVEC > - if (cpu_has_feature(CPU_FTR_ALTIVEC) && msr_diff & MSR_VEC) > - current->thread.load_vec = 1; > + if (cpu_has_feature(CPU_FTR_ALTIVEC) && regs->msr & MSR_VEC) > + load_vr_state(>thread.vr_state); > +#endif > +#ifdef CONFIG_VSX > + if (cpu_has_feature(CPU_FTR_VSX) && regs->msr & MSR_VSX) { > + /* > + * If VSX is enabled, it is expected that VEC and FP are > + * also enabled and already restored the full register set. > + * Cause a warning if that is not the case. > + */ > + WARN_ON(!(regs->msr & MSR_VEC) || !(regs->msr & MSR_FP)); } > #endif > - restore_math(regs); > - > - regs->msr |= msr_diff; > } > > #else
Re: [RFC PATCH 01/14] powerpc/tm: Reclaim transaction on kernel entry
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > This patch creates a macro that will be invoked on all entrance to the > kernel, so, in kernel space the transaction will be completely reclaimed > and not suspended anymore. > > This patchset checks if we are coming from PR, if not, skip. Remove the double negative here. ie "This skips when coming from the OS". or "Only happens when coming from PR" > This is useful > when there is a irq_replay() being called after recheckpoint, when the IRQ > is re-enable. So we are talking about tm_recheckpoint on exit? On exit, we do: tm_recheckpoint -> irq_replay -> rfid? Why not swap the order of the recheckpoint and the replay to avoid this problem? > In this case, we do not want to re-reclaim and > re-recheckpoint, thus, if not coming from PR, skip it completely. Move double negatives... Try: "if coming from the OS, skip" or "only do it when coming from userspace" > This macro does not care about TM SPR also, it will only be saved and > restore in the context switch code now on. > This macro will return 0 or 1 in r3 register, to specify if a reclaim was > executed or not. > > This patchset is based on initial work done by Cyril: > https://patchwork.ozlabs.org/cover/875341/ > > Signed-off-by: Breno Leitao > --- > arch/powerpc/include/asm/exception-64s.h | 46 > arch/powerpc/kernel/entry_64.S | 10 ++ > arch/powerpc/kernel/exceptions-64s.S | 12 +-- > 3 files changed, 66 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/exception-64s.h > b/arch/powerpc/include/asm/exception-64s.h > index 3b4767ed3ec5..931a74ba037b 100644 > --- a/arch/powerpc/include/asm/exception-64s.h > +++ b/arch/powerpc/include/asm/exception-64s.h > @@ -36,6 +36,7 @@ > */ > #include > #include > +#include > > /* PACA save area offsets (exgen, exmc, etc) */ > #define EX_R90 > @@ -677,10 +678,54 @@ BEGIN_FTR_SECTION \ > beqlppc64_runlatch_on_trampoline; \ > END_FTR_SECTION_IFSET(CPU_FTR_CTRL) > > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > + > +/* > + * This macro will reclaim a transaction if called when coming from userspace > + * (MSR.PR = 1) and if the transaction state is active or suspended. > + * > + * Since we don't want to reclaim when coming from kernel, for instance after > + * a trechkpt. or a IRQ replay, the live MSR is not useful and instead of it > the > + * MSR from thread stack is used to check the MSR.PR bit. > + * This macro has one argument which is the cause that will be used by > treclaim. > + * and returns in r3 '1' if the reclaim happens or '0' if reclaim didn't > + * happen, which is useful to know what registers were clobbered. > + * > + * NOTE: If addition registers are clobbered here, make sure the callee > + * function restores them before proceeding. > + */ > +#define TM_KERNEL_ENTRY(cause) > \ > + ld r3, _MSR(r1); \ > + andi. r0, r3, MSR_PR; /* Coming from userspace? */\ > + beq 1f; /* Skip reclaim if MSR.PR != 1 */ \ > + rldicl. r0, r3, (64-MSR_TM_LG), 63; /* Is TM enabled? */\ > + beq 1f; /* Skip reclaim if TM is off */ \ > + rldicl. r0, r3, (64-MSR_TS_LG), 62; /* Is active */ \ > + beq 1f; /* Skip reclaim if neither */ \ > + /* \ > + * If there is a transaction active or suspended, save the \ > + * non-volatile GPRs if they are not already saved. \ > + */ \ > + bl save_nvgprs;\ > + /* \ > + * Soft disable the IRQs, otherwise it might cause a CPU hang. \ > + */ \ > + RECONCILE_IRQ_STATE(r10, r11); \ > + li r3, cause; \ > + bl tm_reclaim_current; \ Are we ready to call out to C at this point in the exception handlers? > + li r3, 1; /* Reclaim happened */ \ > + b 2f; \ > +1: li r3, 0; /* Reclaim didn't happen */ \ > +2: > +#else > +#define TM_KERNEL_ENTRY(cause) > +#endif > + > #define EXCEPTION_COMMON(area, trap, label, hdlr, ret, additions) \ > EXCEPTION_PROLOG_COMMON(trap, area);\ > /* Volatile regs are potentially clobbered here */ \ > additions; \ > + TM_KERNEL_ENTRY(TM_CAUSE_MISC);
Re: [RFC PATCH 02/14] powerpc/tm: Reclaim on unavailable exception
On Tue, 2018-11-06 at 10:40 -0200, Breno Leitao wrote: > If there is a FP/VEC/Altivec touch inside a transaction and the facility is > disabled, then a facility unavailable exception is raised and ends up > calling {fp,vec,vsx}_unavailable_tm, which was reclaiming and > recheckpointing. > > This is not required anymore, since the checkpointed state was reclaimed in > the exception entrance, and it will be recheckpointed by restore_tm_state > later. > > Adding a WARN_ON() warning if we hit the _unavailable_tm() in suspended > mode, i.e, the reclaim was not executed somehow in the trap entrance, and > this is a bug. The "why" above is good and the important part of the commit but, Can you also add what you're doing? The above would suggest you're just removing some things but you're actually adding the TM_KERNEL_ENTRY() macro too. Mikey > > Signed-off-by: Breno Leitao > --- > arch/powerpc/include/asm/exception-64s.h | 4 > arch/powerpc/kernel/exceptions-64s.S | 3 +++ > arch/powerpc/kernel/traps.c | 22 -- > 3 files changed, 11 insertions(+), 18 deletions(-) > > diff --git a/arch/powerpc/include/asm/exception-64s.h > b/arch/powerpc/include/asm/exception-64s.h > index 931a74ba037b..80f01d5683c3 100644 > --- a/arch/powerpc/include/asm/exception-64s.h > +++ b/arch/powerpc/include/asm/exception-64s.h > @@ -711,6 +711,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_CTRL) >* Soft disable the IRQs, otherwise it might cause a CPU hang. \ >*/ \ > RECONCILE_IRQ_STATE(r10, r11); \ > + /* \ > + * Although this cause will be set initially, it might be \ > + * updated later, once the exception is better understood \ > + */ \ > li r3, cause; \ > bl tm_reclaim_current; \ > li r3, 1; /* Reclaim happened */ \ > diff --git a/arch/powerpc/kernel/exceptions-64s.S > b/arch/powerpc/kernel/exceptions-64s.S > index 5c685a46202d..47e05b09eed6 100644 > --- a/arch/powerpc/kernel/exceptions-64s.S > +++ b/arch/powerpc/kernel/exceptions-64s.S > @@ -786,6 +786,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM) > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > 2: /* User process was in a transaction */ > bl save_nvgprs > + TM_KERNEL_ENTRY(TM_CAUSE_FAC_UNAV) > RECONCILE_IRQ_STATE(r10, r11) > addir3,r1,STACK_FRAME_OVERHEAD > bl fp_unavailable_tm > @@ -1128,6 +1129,7 @@ BEGIN_FTR_SECTION > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > 2: /* User process was in a transaction */ > bl save_nvgprs > + TM_KERNEL_ENTRY(TM_CAUSE_FAC_UNAV) > RECONCILE_IRQ_STATE(r10, r11) > addir3,r1,STACK_FRAME_OVERHEAD > bl altivec_unavailable_tm > @@ -1164,6 +1166,7 @@ BEGIN_FTR_SECTION > #ifdef CONFIG_PPC_TRANSACTIONAL_MEM > 2: /* User process was in a transaction */ > bl save_nvgprs > + TM_KERNEL_ENTRY(TM_CAUSE_FAC_UNAV) > RECONCILE_IRQ_STATE(r10, r11) > addir3,r1,STACK_FRAME_OVERHEAD > bl vsx_unavailable_tm > diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c > index 9a86572db1ef..e74b735e974c 100644 > --- a/arch/powerpc/kernel/traps.c > +++ b/arch/powerpc/kernel/traps.c > @@ -1742,23 +1742,10 @@ void fp_unavailable_tm(struct pt_regs *regs) > * transaction, and probably retry but now with FP enabled. So the > * checkpointed FP registers need to be loaded. >*/ > - tm_reclaim_current(TM_CAUSE_FAC_UNAV); > - > - /* > - * Reclaim initially saved out bogus (lazy) FPRs to ckfp_state, and > - * then it was overwrite by the thr->fp_state by tm_reclaim_thread(). > - * > - * At this point, ck{fp,vr}_state contains the exact values we want to > - * recheckpoint. > - */ > + WARN_ON(MSR_TM_SUSPENDED(mfmsr())); > > /* Enable FP for the task: */ > current->thread.load_fp = 1; > - > - /* > - * Recheckpoint all the checkpointed ckpt, ck{fp, vr}_state registers. > - */ > - tm_recheckpoint(>thread); > } > > void altivec_unavailable_tm(struct pt_regs *regs) > @@ -1770,10 +1757,10 @@ void altivec_unavailable_tm(struct pt_regs *regs) > TM_DEBUG("Vector Unavailable trap whilst transactional at 0x%lx," >"MSR=%lx\n", >regs->nip, regs->msr); > - tm_reclaim_current(TM_CAUSE_FAC_UNAV); > + WARN_ON(MSR_TM_SUSPENDED(mfmsr())); > current->thread.load_vec = 1; > - tm_recheckpoint(>thread); > current->thread.used_vr = 1; > + > } > > void vsx_unavailable_tm(struct pt_regs *regs) > @@ -1792,12 +1779,11 @@ void vsx_unavailable_tm(struct pt_regs *regs) >
Re: [RFC PATCH v2 00/14] New TM Model
> In fact, I was the one that identified this performance degradation issue, > and reported to Adhemerval who kindly fixed it with > f0458cf4f9ff3d870c43b624e6dccaaf657d5e83. > > Anyway, I think we are safe here. FWIW Agreed. PPC_FEATURE2_HTM_NOSC should be persevered by this series. Mikey
Re: [PATCH v4] powerpc: Avoid code patching freed init sections
On Tue, 2018-10-02 at 23:35 +0200, Andreas Schwab wrote: > On Sep 14 2018, Michael Neuling wrote: > > > This stops us from doing code patching in init sections after they've > > been freed. > > This breaks booting on PowerBook6,7, crashing very early. Sorry, Can you try? http://patchwork.ozlabs.org/patch/977195/ Mikey
Re: [PATCH 2/2] powerpc/tm: Avoid SPR flush if TM is disabled
On Mon, 2018-10-01 at 16:47 -0300, Breno Leitao wrote: > There is a bug in the flush_tmregs_to_thread() function, where it forces > TM SPRs to be saved to the thread even if the TM facility is disabled. > > This bug could be reproduced using a simple test case: > > mtspr(SPRN_TEXASR, XX); > sleep until load_tm == 0 > cause a coredump > read SPRN_TEXASR in the coredump > > In this case, the coredump may contain an invalid SPR, because the > current code is flushing live SPRs (Used by the last thread with TM > active) into the current thread, overwriting the latest SPRs (which were > valid). > > This patch checks if TM is enabled for current task before > saving the SPRs, otherwise, the TM is lazily disabled and the thread > value is already up-to-date and could be used directly, and saving is > not required. Acked-by: Michael Neuling Breno, can you send your selftest upstream that also? Mikey > > Fixes: cd63f3cf1d5 ("powerpc/tm: Fix saving of TM SPRs in core dump") > Signed-off-by: Breno Leitao > --- > arch/powerpc/kernel/ptrace.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > index 9667666eb18e..e0a2ee865032 100644 > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -138,7 +138,12 @@ static void flush_tmregs_to_thread(struct task_struct > *tsk) > > if (MSR_TM_SUSPENDED(mfmsr())) { > tm_reclaim_current(TM_CAUSE_SIGNAL); > - } else { > + } else if (tm_enabled(tsk)) { > + /* > + * Only flush TM SPRs to the thread if TM was enabled, > + * otherwise (TM lazily disabled), the thread already > + * contains the latest SPR value > + */ > tm_enable(); > tm_save_sprs(&(tsk->thread)); > }
Re: [PATCH] powerpc/lib: fix book3s/32 boot failure due to code patching
On Mon, 2018-10-01 at 12:21 +, Christophe Leroy wrote: > Commit 51c3c62b58b3 ("powerpc: Avoid code patching freed init > sections") accesses 'init_mem_is_free' flag too early, before the > kernel is relocated. This provokes early boot failure (before the > console is active). > > As it is not necessary to do this verification that early, this > patch moves the test into patch_instruction() instead of > __patch_instruction(). > > This modification also has the advantage of avoiding unnecessary > remappings. > > Fixes: 51c3c62b58b3 ("powerpc: Avoid code patching freed init sections") > Signed-off-by: Christophe Leroy Thanks Acked-by: Michael Neuling The original patch was also marked for stable so we should do the same here. Cc: sta...@vger.kernel.org # 4.13+ > --- > arch/powerpc/lib/code-patching.c | 20 > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code- > patching.c > index 6ae2777c220d..5ffee298745f 100644 > --- a/arch/powerpc/lib/code-patching.c > +++ b/arch/powerpc/lib/code-patching.c > @@ -28,12 +28,6 @@ static int __patch_instruction(unsigned int *exec_addr, > unsigned int instr, > { > int err; > > - /* Make sure we aren't patching a freed init section */ > - if (init_mem_is_free && init_section_contains(exec_addr, 4)) { > - pr_debug("Skipping init section patching addr: 0x%px\n", > exec_addr); > - return 0; > - } > - > __put_user_size(instr, patch_addr, 4, err); > if (err) > return err; > @@ -148,7 +142,7 @@ static inline int unmap_patch_area(unsigned long addr) > return 0; > } > > -int patch_instruction(unsigned int *addr, unsigned int instr) > +static int do_patch_instruction(unsigned int *addr, unsigned int instr) > { > int err; > unsigned int *patch_addr = NULL; > @@ -188,12 +182,22 @@ int patch_instruction(unsigned int *addr, unsigned int > instr) > } > #else /* !CONFIG_STRICT_KERNEL_RWX */ > > -int patch_instruction(unsigned int *addr, unsigned int instr) > +static int do_patch_instruction(unsigned int *addr, unsigned int instr) > { > return raw_patch_instruction(addr, instr); > } > > #endif /* CONFIG_STRICT_KERNEL_RWX */ > + > +int patch_instruction(unsigned int *addr, unsigned int instr) > +{ > + /* Make sure we aren't patching a freed init section */ > + if (init_mem_is_free && init_section_contains(addr, 4)) { > + pr_debug("Skipping init section patching addr: 0x%px\n", addr); > + return 0; > + } > + return do_patch_instruction(addr, instr); > +} > NOKPROBE_SYMBOL(patch_instruction); > > int patch_branch(unsigned int *addr, unsigned long target, int flags)
Re: [v4] powerpc: Avoid code patching freed init sections
On Mon, 2018-10-01 at 13:25 +0200, Christophe LEROY wrote: > > Le 21/09/2018 à 13:59, Michael Ellerman a écrit : > > On Fri, 2018-09-14 at 01:14:11 UTC, Michael Neuling wrote: > > > This stops us from doing code patching in init sections after they've > > > been freed. > > > > > > In this chain: > > >kvm_guest_init() -> > > > kvm_use_magic_page() -> > > >fault_in_pages_readable() -> > > >__get_user() -> > > > __get_user_nocheck() -> > > >barrier_nospec(); > > > > > > We have a code patching location at barrier_nospec() and > > > kvm_guest_init() is an init function. This whole chain gets inlined, > > > so when we free the init section (hence kvm_guest_init()), this code > > > goes away and hence should no longer be patched. > > > > > > We seen this as userspace memory corruption when using a memory > > > checker while doing partition migration testing on powervm (this > > > starts the code patching post migration via > > > /sys/kernel/mobility/migration). In theory, it could also happen when > > > using /sys/kernel/debug/powerpc/barrier_nospec. > > > > > > cc: sta...@vger.kernel.org # 4.13+ > > > Signed-off-by: Michael Neuling > > > Reviewed-by: Nicholas Piggin > > > Reviewed-by: Christophe Leroy > > > > Applied to powerpc fixes, thanks. > > > > https://git.kernel.org/powerpc/c/51c3c62b58b357e8d35e4cc32f7b4e > > > > This patch breaks booting on my MPC83xx board (book3s/32) very early > (before console is active), provoking restart. > u-boot reports a checkstop reset at restart. > > Reverting this commit fixes the issue. > > The following patch fixes the issue as well, but I think it is not the > best solution. I still think the test should be in patch_instruction() > instead of being in __patch_instruction(), see my comment on v2 Arrh, sorry. Can you write this up as a real patch with a signed off by so mpe can take it? Mikey > > Christophe > > diff --git a/arch/powerpc/lib/code-patching.c > b/arch/powerpc/lib/code-patching.c > index 6ae2777..6192fda 100644 > --- a/arch/powerpc/lib/code-patching.c > +++ b/arch/powerpc/lib/code-patching.c > @@ -29,7 +29,7 @@ static int __patch_instruction(unsigned int > *exec_addr, unsigned int instr, > int err; > > /* Make sure we aren't patching a freed init section */ > - if (init_mem_is_free && init_section_contains(exec_addr, 4)) { > + if (*PTRRELOC(_mem_is_free) && > init_section_contains(exec_addr, 4)) { > pr_debug("Skipping init section patching addr: > 0x%px\n", exec_addr); > return 0; > } > > > Christophe >
Re: [RFC PATCH 08/11] powerpc/tm: Do not reclaim on ptrace
On Sun, 2018-09-30 at 20:51 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/28/2018 02:36 AM, Michael Neuling wrote: > > > > > + WARN_ON(MSR_TM_SUSPENDED(mfmsr())); + + tm_enable(); + > > > > > tm_save_sprs(&(tsk->thread)); > > > > > > > > Do we need to check if TM was enabled in the task before saving the > > > > TM SPRs? > > > > > > > > What happens if TM was lazily off and hence we had someone else's TM > > > > SPRs in the CPU currently? Wouldn't this flush the wrong values to > > > > the task_struct? > > > > > > > > I think we need to check the processes MSR before doing this. > > > > > > Yes, it is a *very* good point, and I think we are vulnerable even > > > before this patch (in the current kernel). Take a look above, we are > > > calling tm_save_sprs() if MSR is not TM suspended independently if TM > > > is lazily off. > > > > I think you're right, we might already have an issue. There are some > > paths in here that don't check the userspace msr or any of the lazy tm > > enable. :( > > I was able to create a test case that reproduces this bug cleanly. > > The testcase basically sleeps for N cycles, and then segfaults. > > If N is high enough to have load_tm overflowed, then you see a corrupted > TEXASR value in the core dump file. If load_tm != 0 during the coredump, you > see the expected TEXASR value. > > I wrote a small bash that check for both cases. > > $ git clone https://github.com/leitao/texasr_corrupt.git > $ make check > > Anyway, I will propose a fix for this problem soon, since this whole patchset > may delay to get ready. Is it OK? Yeah, best to get a fix for this one out soon. Mikey
Re: [RFC PATCH 08/11] powerpc/tm: Do not reclaim on ptrace
On Thu, 2018-09-27 at 18:03 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/18/2018 02:36 AM, Michael Neuling wrote: > > On Wed, 2018-09-12 at 16:40 -0300, Breno Leitao wrote: > > > Make sure that we are not suspended on ptrace and that the registers were > > > already reclaimed. > > > > > > Since the data was already reclaimed, there is nothing to be done here > > > except to restore the SPRs. > > > > > > Signed-off-by: Breno Leitao > > > --- > > > arch/powerpc/kernel/ptrace.c | 10 -- > > > 1 file changed, 4 insertions(+), 6 deletions(-) > > > > > > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > > > index 9667666eb18e..cf6ee9154b11 100644 > > > --- a/arch/powerpc/kernel/ptrace.c > > > +++ b/arch/powerpc/kernel/ptrace.c > > > @@ -136,12 +136,10 @@ static void flush_tmregs_to_thread(struct > > > task_struct > > > *tsk) > > > if ((!cpu_has_feature(CPU_FTR_TM)) || (tsk != current)) > > > return; > > > > > > - if (MSR_TM_SUSPENDED(mfmsr())) { > > > - tm_reclaim_current(TM_CAUSE_SIGNAL); > > > - } else { > > > - tm_enable(); > > > - tm_save_sprs(&(tsk->thread)); > > > - } > > > + WARN_ON(MSR_TM_SUSPENDED(mfmsr())); > > > + > > > + tm_enable(); > > > + tm_save_sprs(&(tsk->thread)); > > > > Do we need to check if TM was enabled in the task before saving the TM SPRs? > > > > What happens if TM was lazily off and hence we had someone else's TM SPRs in > > the > > CPU currently? Wouldn't this flush the wrong values to the task_struct? > > > > I think we need to check the processes MSR before doing this. > > Yes, it is a *very* good point, and I think we are vulnerable even before > this patch (in the current kernel). Take a look above, we are calling > tm_save_sprs() if MSR is not TM suspended independently if TM is lazily off. I think you're right, we might already have an issue. There are some paths in here that don't check the userspace msr or any of the lazy tm enable. :( Mikey > It shouldn't be hard to create a test case for it. I can try to call > ptrace(PTRACE_GETVRREGS) on a task that sleeps until TM is lazily disabled, > compare if the TM SPR changed in this mean time. (while doing HTM in parallel > to keep HTM SPR changing). Let's see if I can read others task TM SPRs. > > Thank you. >
Re: [RFC PATCH 05/11] powerpc/tm: Function that updates the failure code
On Thu, 2018-09-27 at 17:58 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/17/2018 10:29 PM, Michael Neuling wrote: > > On Wed, 2018-09-12 at 16:40 -0300, Breno Leitao wrote: > > > Now the transaction reclaims happens very earlier in the trap handler, and > > > it is impossible to know precisely, at that early time, what should be set > > > as the failure cause for some specific cases, as, if the task will be > > > rescheduled, thus, the transaction abort case should be updated from > > > TM_CAUSE_MISC to TM_CAUSE_RESCHED, for example. > > > > Please add comments to where this is used (in EXCEPTION_COMMON macro I > > think) > > that say this might happen. > > Is it OK if I comment it in TM_KERNEL_ENTRY macro, since the failure cause > could be updated independently of the exception being execute, so, every call > to TM_KERNEL_ENTRY can have the cause overwritten. Sure. > I.e. it does not matter if the exception is a systemcall or a page fault, > the failure cause will be updated if there is a process reschedule after the > exception/syscall is handled. > > Thank you >
Re: [RFC PATCH 11/11] selftests/powerpc: Adapt the test
On Thu, 2018-09-27 at 17:57 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/18/2018 03:36 AM, Michael Neuling wrote: > > On Wed, 2018-09-12 at 16:40 -0300, Breno Leitao wrote: > > > The Documentation/powerpc/transactional_memory.txt says: > > > > > > "Syscalls made from within a suspended transaction are performed as > > > normal > > > and the transaction is not explicitly doomed by the kernel. However, > > > what the kernel does to perform the syscall may result in the > > > transaction > > > being doomed by the hardware." > > > > > > With this new TM mechanism, the syscall will continue to be executed if > > > the > > > syscall happens on a suspended syscall, but, the syscall will *fail* if > > > the > > > transaction is still active during the syscall invocation. > > > > Not sure I get this. This doesn't seem any different to before. > > > > An active (not suspended) transaction *will* result in the syscall failing > > and > > the transaction being doomed. > > > > A syscall in a suspended transaction should succeed and the transaction. > > I understand that a transaction will always be doomed if there is a > reclaim/checkpoint, thus, if the you make a system call inside a suspended > transaction, it will reclaim and recheckpoint, thus, dooming the transaction, > and failing it on the next RFID. So currently a syscall won't explicitly treclaim/trecheckpoint so it won't necessarily be doomed. With your new patches it will be. > So, the syscall executed in suspended mode may succeed, because it will not > be skipped (as in active mode), but the transaction will *always* fail, > because there was a reclaim and recheckpoint. > > > You might need to clean up the language. I try to use: > > > >Active == transactional but not suspended (ie MSR[TS] = T) > >Suspended == suspended (ie MSR [TS] = S) > >Doomed == transaction to be rolled back at next opportinity (ie tcheck > > returns doomed) > > > > (note: the kernel MSR_TM_ACTIVE() macro is not consistent with this since > > it's > > MSR[TS] == S or T). > > Ok, I agree with this language as well. I might want to improve the code to > follow the same language all across the code. > > > > On the syscall path, if the transaction is active and not suspended, it > > > will call TM_KERNEL_ENTRY which will reclaim and recheckpoint the > > > transaction, thus, dooming the transaction on userspace return, with > > > failure code TM_CAUSE_SYSCALL. > > > > But the test below is on a suspend transaction? > > Sorry, I meant "suspended transaction" above instead of "transaction is > active and not suspended". > > > > > > This new model will break part of this test, but I understand that that > > > the > > > documentation above didn't guarantee that the syscall would succeed, and > > > it > > > will never succeed anymore now on. > > > > The syscall should pass in suspend (modulo the normal syscall checks). The > > transaction may fail as a result. > > Perfect, and if the transaction fail, the CPU will rollback the changes and > restore the checkpoint registers (replacing the r3 that contains the pid > value), thus, it will be like "getpid" system call didn't execute. No. If we are suspended, then we go back right after the sc. We don't get rolled back till the tresume. Mikey > For this test specifically, it assumes the syscall didn't execute if the > transaction failed. Take a look: > > FUNC_START(getppid_tm_suspended) > tbegin. > beq 1f > li r0, __NR_getppid > tsuspend. > sc > tresume. > tend. > blr > 1: > li r3, -1 > blr > > So, the tests thinks the syscall failed because the transaction aborted. > > Anyway, I can try to improve this test other than removing this test, but I > am not sure how. > > Thank you >
Re: [RFC PATCH 10/11] powerpc/tm: Set failure summary
On Thu, 2018-09-27 at 17:52 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/18/2018 02:50 AM, Michael Neuling wrote: > > On Wed, 2018-09-12 at 16:40 -0300, Breno Leitao wrote: > > > Since the transaction will be doomed with treckpt., the TEXASR[FS] > > > should be set, to reflect that the transaction is a failure. This patch > > > ensures it before recheckpointing, and remove changes from other places > > > that were calling recheckpoint. > > > > TEXASR[FS] should be set by the reclaim. > > Do you mean that the CPU should set TEXASR[FS] when treclaim is called, or, > that the tm_reclaim? > > Looking at the ISA, I didn't see TEXASR[FS] being set automatically when a > reclaim happens, treclaim in ISA talks about "TMRecordFailure(cause)" and that macro sets TEXASR[FS]=1. So yes treclaim always sets TEXASR[FS]=1. > although, I see it needs TEXASR[FS] to be set when executing a > trecheckpoint, otherwise it will cause a TM Bad Thing. Yep > That is why I am forcing TEXASR[FS]=1 to doom the transaction so we can > recheckpoint it, but it seems I understood this wrongly. You shouldn't need to. We do a bug_on() just before the trecheckpoint to make sure it's set, but you shouldn't need to set it (other than the signals case). > > I don't know why you'd need to set this > > explicitly in process.c. The only case is when the user supplies a bad > > signal > > context, but we should check that in the signals code, not process.c > > > > Hence I think this patch is wrong. > > > > Also, according to the architecture, TEXASR[FS] HAS TO BE SET on > > trecheckpoint > > otherwise you'll get a TM Bad Thing. You should say that rather than > > suggesting > > it's because the transaction is doomed. It's ilqlegal to not do it. That's > > why we > > have this check in arch/powerpc/kernel/tm.S. > > When you say "HAS TO BE SET", do you mean that the hardware will set it and > we shouldn't care about this flag? Thus, if I am hitting EMIT_BUG_ENTRY, it > means my TEXASR was messed somehow? I'm just saying what you said before about the tm bad thing. We have to make sure it's set before we do the trecheckpoint otherwise we end up in the TM bad thing. The treclaim should handle this for us (but again the signals case need to be checked). Mikey
Re: [RFC PATCH 09/11] powerpc/tm: Do not restore default DSCR
On Thu, 2018-09-27 at 17:51 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/18/2018 02:41 AM, Michael Neuling wrote: > > On Wed, 2018-09-12 at 16:40 -0300, Breno Leitao wrote: > > > In the previous TM code, trecheckpoint was being executed in the middle of > > > an exception, thus, DSCR was being restored to default kernel DSCR value > > > after trecheckpoint was done. > > > > > > With this current patchset, trecheckpoint is executed just before getting > > > to userspace, at ret_from_except_lite, for example. Thus, we do not need > > > to > > > set default kernel DSCR value anymore, as we are leaving kernel space. It > > > is OK to keep the checkpointed DSCR value into the live SPR, mainly > > > because > > > the transaction is doomed and it will fail soon (after RFID), > > > > What if we are going back to a suspended transaction? It will remain live > > until > > userspace does a tresume > > Hmm, I understand that once we get in kernel space, and call > treclaim/trecheckpoint, the transaction will be doomed and it will abort and > rollback when we leave kernel space. I.e., if we can treclaim/trecheckpointn > in kernel space, the transaction will *always* abort just after RFID, so, a > possible tresume will never be executed. Is my understanding wrong? Your understanding is wrong. We don't roll back until MSR[TS] = T. There are two cases for the RFID. 1) if you RFID back to userspace that is transactional (ie MSR[TS] = T), then it will immediately rollback as soon as the RFID happens since we are going directly to T. 2) If you RFID back to userspace that is suspended (ie MSR[TS] = S), then it won't roll back until userspace does a tresume. It doesn't rollback until we go MSR[TS] = S -> T). > > > > > so, > > > continuing with the pre-checkpointed DSCR value is what seems correct. > > > > Reading this description suggests this patch isn't really needed. Right? > > Maybe the description is not clear, but I understand this patch is required, > otherwise we will leave userspace with a default DSCR value. > > By the way, do you know if there is a change in DSCR inside a transaction, > will it be reverted if the transaction is aborted? Yes it will be reverted. We even have a selftest for it in tools/testing/selftests/powerpc/tm/tm-resched-dscr.c There are a bunch of checkpointed SPRs. From the arch: Checkpointed registers: The set of registers that are saved to the “checkpoint area” when a transaction is initiated, and restored upon transaction failure, is a subset of the architected register state, consisting of the General Purpose Registers, Floating-Point Regis- ters, Vector Registers, Vector-Scalar Registers, and the following Special Registers and fields: CR fields other than CR0, XER, LR, CTR, FPSCR, AMR, PPR, VRSAVE, VSCR, DSCR, and TAR. Mikey
[PATCH] powerpc/tm: Reformat comments
The comments in this file don't conform to the coding style so take them to "Comment Formatting Re-Education Camp" Suggested-by: Michael "Camp Drill Sargent" Ellerman Signed-off-by: Michael Neuling --- arch/powerpc/kernel/tm.S | 49 +--- 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S index 50e5cff10d..c954aecd9b 100644 --- a/arch/powerpc/kernel/tm.S +++ b/arch/powerpc/kernel/tm.S @@ -92,7 +92,8 @@ _GLOBAL(tm_abort) blr EXPORT_SYMBOL_GPL(tm_abort); -/* void tm_reclaim(struct thread_struct *thread, +/* + * void tm_reclaim(struct thread_struct *thread, *uint8_t cause) * * - Performs a full reclaim. This destroys outstanding @@ -163,14 +164,16 @@ _GLOBAL(tm_reclaim) */ TRECLAIM(R4)/* Cause in r4 */ - /* GPRs */ - /* Stash the checkpointed r13 away in the scratch SPR and get the real + /* +* GPRs +* Stash the checkpointed r13 away in the scratch SPR and get the real * paca */ SET_SCRATCH0(r13) GET_PACA(r13) - /* Stash the checkpointed r1 away in paca tm_scratch and get the real + /* +* Stash the checkpointed r1 away in paca tm_scratch and get the real * stack pointer back */ std r1, PACATMSCRATCH(r13) @@ -195,7 +198,8 @@ _GLOBAL(tm_reclaim) addir7, r12, PT_CKPT_REGS /* Thread's ckpt_regs */ - /* Make r7 look like an exception frame so that we + /* +* Make r7 look like an exception frame so that we * can use the neat GPRx(n) macros. r7 is NOT a pt_regs ptr! */ subir7, r7, STACK_FRAME_OVERHEAD @@ -223,7 +227,8 @@ _GLOBAL(tm_reclaim) /* NIP */ mfspr r3, SPRN_TFHAR std r3, _NIP(r7)/* Returns to failhandler */ - /* The checkpointed NIP is ignored when rescheduling/rechkpting, + /* +* The checkpointed NIP is ignored when rescheduling/rechkpting, * but is used in signal return to 'wind back' to the abort handler. */ @@ -246,12 +251,14 @@ _GLOBAL(tm_reclaim) std r3, THREAD_TM_TAR(r12) std r4, THREAD_TM_DSCR(r12) - /* MSR and flags: We don't change CRs, and we don't need to alter + /* +* MSR and flags: We don't change CRs, and we don't need to alter * MSR. */ - /* FPR/VR/VSRs + /* +* FPR/VR/VSRs * After reclaiming, capture the checkpointed FPRs/VRs. * * We enabled VEC/FP/VSX in the msr above, so we can execute these @@ -277,7 +284,8 @@ _GLOBAL(tm_reclaim) stfdfr0,FPSTATE_FPSCR(r7) - /* TM regs, incl TEXASR -- these live in thread_struct. Note they've + /* +* TM regs, incl TEXASR -- these live in thread_struct. Note they've * been updated by the treclaim, to explain to userland the failure * cause (aborted). */ @@ -313,7 +321,7 @@ _GLOBAL(tm_reclaim) blr - /* + /* * void __tm_recheckpoint(struct thread_struct *thread) * - Restore the checkpointed register state saved by tm_reclaim *when we switch_to a process. @@ -329,7 +337,8 @@ _GLOBAL(__tm_recheckpoint) std r2, STK_GOT(r1) stdur1, -TM_FRAME_SIZE(r1) - /* We've a struct pt_regs at [r1+STACK_FRAME_OVERHEAD]. + /* +* We've a struct pt_regs at [r1+STACK_FRAME_OVERHEAD]. * This is used for backing up the NVGPRs: */ SAVE_NVGPRS(r1) @@ -338,7 +347,8 @@ _GLOBAL(__tm_recheckpoint) addir7, r3, PT_CKPT_REGS/* Thread's ckpt_regs */ - /* Make r7 look like an exception frame so that we + /* +* Make r7 look like an exception frame so that we * can use the neat GPRx(n) macros. r7 is now NOT a pt_regs ptr! */ subir7, r7, STACK_FRAME_OVERHEAD @@ -407,14 +417,15 @@ restore_gprs: REST_NVGPRS(r7) /* GPR14-31 */ - /* Load up PPR and DSCR here so we don't run with user values for long -*/ + /* Load up PPR and DSCR here so we don't run with user values for long */ mtspr SPRN_DSCR, r5 mtspr SPRN_PPR, r6 - /* Do final sanity check on TEXASR to make sure FS is set. Do this + /* +* Do final sanity check on TEXASR to make sure FS is set. Do this * here before we load up the userspace r1 so any bugs we hit will get -* a call chain */ +* a call chain +*/ mfspr r5, SPRN_TEXASR
Re: [PATCH] powerpc/tm: Avoid possible userspace r1 corruption on reclaim
On Tue, 2018-09-25 at 22:00 +1000, Michael Ellerman wrote: > Michael Neuling writes: > > Current we store the userspace r1 to PACATMSCRATCH before finally > > saving it to the thread struct. > > > > In theory an exception could be taken here (like a machine check or > > SLB miss) that could write PACATMSCRATCH and hence corrupt the > > userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but > > others do. > > > > We've never actually seen this happen but it's theoretically > > possible. Either way, the code is fragile as it is. > > > > This patch saves r1 to the kernel stack (which can't fault) before we > > turn MSR[RI] back on. PACATMSCRATCH is still used but only with > > MSR[RI] off. We then copy r1 from the kernel stack to the thread > > struct once we have MSR[RI] back on. > > > > Suggested-by: Breno Leitao > > Signed-off-by: Michael Neuling > > --- > > arch/powerpc/kernel/tm.S | 8 +++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S > > index 701b0f5b09..8207816a1e 100644 > > --- a/arch/powerpc/kernel/tm.S > > +++ b/arch/powerpc/kernel/tm.S > > @@ -178,6 +178,12 @@ _GLOBAL(tm_reclaim) > > > > std r11, GPR11(r1) /* Temporary stash */ > > > > + /* Move r1 to kernel stack in case PACATMSCRATCH is used once > > +* we turn on RI > > +*/ > > I see we still need to send you to Comment Formatting Re-Education Camp. The rest of that file has they style, so I'm just keeping with that. I can submit a patch later to fix them all up. > I rewrote it a bit too, to hopefully be clearer? > > /* >* Move the saved user r1 to the kernel stack in case PACATMSCRATCH is >* clobbered by an exception once we turn on MSR_RI below. >*/ > ld r11, PACATMSCRATCH(r13) > std r11, GPR1(r1) Yeah, that's clearer... thanks. Mikey
[PATCH] powerpc/tm: Avoid possible userspace r1 corruption on reclaim
Current we store the userspace r1 to PACATMSCRATCH before finally saving it to the thread struct. In theory an exception could be taken here (like a machine check or SLB miss) that could write PACATMSCRATCH and hence corrupt the userspace r1. The SLB fault currently doesn't touch PACATMSCRATCH, but others do. We've never actually seen this happen but it's theoretically possible. Either way, the code is fragile as it is. This patch saves r1 to the kernel stack (which can't fault) before we turn MSR[RI] back on. PACATMSCRATCH is still used but only with MSR[RI] off. We then copy r1 from the kernel stack to the thread struct once we have MSR[RI] back on. Suggested-by: Breno Leitao Signed-off-by: Michael Neuling --- arch/powerpc/kernel/tm.S | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S index 701b0f5b09..8207816a1e 100644 --- a/arch/powerpc/kernel/tm.S +++ b/arch/powerpc/kernel/tm.S @@ -178,6 +178,12 @@ _GLOBAL(tm_reclaim) std r11, GPR11(r1) /* Temporary stash */ + /* Move r1 to kernel stack in case PACATMSCRATCH is used once +* we turn on RI +*/ + ld r11, PACATMSCRATCH(r13) + std r11, GPR1(r1) + /* Store r13 away so we can free up the scratch SPR for the * SLB fault handler (needed once we start access the * thread_struct) @@ -214,7 +220,7 @@ _GLOBAL(tm_reclaim) SAVE_GPR(8, r7) /* user r8 */ SAVE_GPR(9, r7) /* user r9 */ SAVE_GPR(10, r7)/* user r10 */ - ld r3, PACATMSCRATCH(r13) /* user r1 */ + ld r3, GPR1(r1)/* user r1 */ ld r4, GPR7(r1)/* user r7 */ ld r5, GPR11(r1) /* user r11 */ ld r6, GPR12(r1) /* user r12 */ -- 2.17.1
Re: [PATCH] powerpc/tm: Fix userspace r13 corruption
On Mon, 2018-09-24 at 11:32 -0300, Breno Leitao wrote: > Hi Mikey, > > On 09/24/2018 04:27 AM, Michael Neuling wrote: > > When we treclaim we store the userspace checkpointed r13 to a scratch > > SPR and then later save the scratch SPR to the user thread struct. > > > > Unfortunately, this doesn't work as accessing the user thread struct > > can take an SLB fault and the SLB fault handler will write the same > > scratch SPRG that now contains the userspace r13. > > > > To fix this, we store r13 to the kernel stack (which can't fault) > > before we access the user thread struct. > > > > Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen > > as a random userspace segfault with r13 looking like a kernel address. > > > > Signed-off-by: Michael Neuling > > Reviewed-by: Breno Leitao > > I am wondering if the same problem could not happen with r1 as well, since r1 > is kept in the paca->tm_scratch after MSR[RI] is enabled, in a code similar > to this: > > std r1, PACATMSCRATCH(r13) > .. > li r11, MSR_RI > mtmsrd r11, 1 > > ld r3, PACATMSCRATCH(r13) /* user r1 */ > std r3, GPR1(r7) > > There might be a corruption if the exception that is raised just after > MSR[RI] is enabled somehow exits through 'fast_exception_return' (being > called by most of the exception handlers as I understand) which will > overwrite paca->tm_scratch with the upcoming MSR (aka SRR1) just before the > SRR1, as: > > ld r3,_MSR(r1) > > ... > std r3, PACATMSCRATCH(r13) /* Stash returned-to MSR */ > > > It seems that slb_miss_common and instruction_access_slb does not go through > this path, but handle_page_fault calls > ret_from_except_lite->fast_exception_return, which might cause this > 'possible' issue. Yeah, good catch! I think we are ok but it's delicate. I'll store it in the kernel stack and avoid PACATMSCRATCH before I turn RI on. This look ok? diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S index 701b0f5b09..ff781b2852 100644 --- a/arch/powerpc/kernel/tm.S +++ b/arch/powerpc/kernel/tm.S @@ -178,6 +178,12 @@ _GLOBAL(tm_reclaim) std r11, GPR11(r1) /* Temporary stash */ + /* Move r1 to kernel stack in case PACATMSCRATCH is used once +* we turn on RI +*/ + ld r11, PACATMSCRATCH(r13) + std r11, GPR1(r1) + /* Store r13 away so we can free up the scratch SPR for the * SLB fault handler (needed once we start access the * thread_struct) @@ -214,7 +220,7 @@ _GLOBAL(tm_reclaim) SAVE_GPR(8, r7) /* user r8 */ SAVE_GPR(9, r7) /* user r9 */ SAVE_GPR(10, r7)/* user r10 */ - ld r3, PACATMSCRATCH(r13) /* user r1 */ + ld r3, GPR1(r1)/* user r1 */ ld r4, GPR7(r1)/* user r7 */ ld r5, GPR11(r1) /* user r11 */ ld r6, GPR12(r1) /* user r12 */
[PATCH] powerpc/tm: Fix userspace r13 corruption
When we treclaim we store the userspace checkpointed r13 to a scratch SPR and then later save the scratch SPR to the user thread struct. Unfortunately, this doesn't work as accessing the user thread struct can take an SLB fault and the SLB fault handler will write the same scratch SPRG that now contains the userspace r13. To fix this, we store r13 to the kernel stack (which can't fault) before we access the user thread struct. Found by running P8 guest + powervm + disable_1tb_segments + TM. Seen as a random userspace segfault with r13 looking like a kernel address. Signed-off-by: Michael Neuling --- arch/powerpc/kernel/tm.S | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/tm.S b/arch/powerpc/kernel/tm.S index 6bffbc5aff..701b0f5b09 100644 --- a/arch/powerpc/kernel/tm.S +++ b/arch/powerpc/kernel/tm.S @@ -176,13 +176,20 @@ _GLOBAL(tm_reclaim) std r1, PACATMSCRATCH(r13) ld r1, PACAR1(r13) - /* Store the PPR in r11 and reset to decent value */ std r11, GPR11(r1) /* Temporary stash */ + /* Store r13 away so we can free up the scratch SPR for the +* SLB fault handler (needed once we start access the +* thread_struct) +*/ + GET_SCRATCH0(r11) + std r11, GPR13(r1) + /* Reset MSR RI so we can take SLB faults again */ li r11, MSR_RI mtmsrd r11, 1 + /* Store the PPR in r11 and reset to decent value */ mfspr r11, SPRN_PPR HMT_MEDIUM @@ -211,7 +218,7 @@ _GLOBAL(tm_reclaim) ld r4, GPR7(r1)/* user r7 */ ld r5, GPR11(r1) /* user r11 */ ld r6, GPR12(r1) /* user r12 */ - GET_SCRATCH0(8) /* user r13 */ + ld r8, GPR13(r1) /* user r13 */ std r3, GPR1(r7) std r4, GPR7(r7) std r5, GPR11(r7) -- 2.17.1
Re: [PATCH v8 1/3] powerpc: Detect the presence of big-cores via "ibm,thread-groups"
On Fri, 2018-09-21 at 22:47 +0530, Gautham R Shenoy wrote: > Hello Michael, > > On Fri, Sep 21, 2018 at 01:02:45PM +1000, Michael Neuling wrote: > > This doesn't compile for me with: > > > > arch/powerpc/kernel/smp.c: In function ‘smp_prepare_cpus’: > > arch/powerpc/kernel/smp.c:812:23: error: ‘tg.threads_per_group’ may be used > > uninitialized in this function [-Werror=maybe-uninitialized] > > struct thread_groups tg; > >^ > > arch/powerpc/kernel/smp.c:812:23: error: ‘tg.nr_groups’ may be used > > uninitialized in this function [-Werror=maybe-uninitialized] > > cc1: all warnings being treated as errors > > /home/mikey/src/linux-ozlabs/scripts/Makefile.build:305: recipe for target > > 'arch/powerpc/kernel/smp.o' failed > > > > I couldn't get this error with gcc 4.8.5 and 8.1.1 with > pseries_defconfig and powernv_defconfig with CONFIG_PPC_WERROR=y. I'm using 5.4.0 which is in Ubuntu 16.04 (LTS). > Does the following the following delta patch make it work? Yep that fixes it, thanks. Mikey