[PATCH 1/2] powerpc: Kconfig: Replace tabs with whitespaces
Replace tabs after keywords with whitespaces to be consistent. Signed-off-by: Juerg Haefliger --- arch/powerpc/Kconfig | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 174edabb74fa..b4acaa77837a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -11,7 +11,7 @@ config 64BIT config LIVEPATCH_64 def_bool PPC64 - depends on LIVEPATCH + depends on LIVEPATCH config MMU bool @@ -446,7 +446,7 @@ choice default MATH_EMULATION_FULL depends on MATH_EMULATION -config MATH_EMULATION_FULL +config MATH_EMULATION_FULL bool "Emulate all the floating point instructions" help Select this option will enable the kernel to support to emulate @@ -1235,7 +1235,7 @@ config PHYSICAL_START default "0x" endif -config ARCH_RANDOM +config ARCH_RANDOM def_bool n config PPC_LIB_RHEAP -- 2.32.0
[PATCH 2/2] powerpc: Kconfig.debug: Remove extra empty line
Remove a stray extra empty line. Signed-off-by: Juerg Haefliger --- arch/powerpc/Kconfig.debug | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug index 192f0ed0097f..2c019e4ac432 100644 --- a/arch/powerpc/Kconfig.debug +++ b/arch/powerpc/Kconfig.debug @@ -305,7 +305,6 @@ config PPC_EARLY_DEBUG_OPAL def_bool y depends on PPC_EARLY_DEBUG_OPAL_RAW || PPC_EARLY_DEBUG_OPAL_HVSI - config PPC_EARLY_DEBUG_HVSI_VTERMNO hex "vterm number to use with early debug HVSI" depends on PPC_EARLY_DEBUG_LPAR_HVSI -- 2.32.0
[PATCH 0/2] powerpc: Kconfig cleanups
Replace some stray tabs with whitespaces and remove an extra empty line. Juerg Haefliger (2): powerpc: Kconfig: Replace tabs with whitespaces powerpc: Kconfig.debug: Remove extra empty line arch/powerpc/Kconfig | 6 +++--- arch/powerpc/Kconfig.debug | 1 - 2 files changed, 3 insertions(+), 4 deletions(-) -- 2.32.0
Re: [Buildroot] [PATCH] linux: Fix powerpc64le defconfig selection
Arnout Vandecappelle writes: > On 16/05/2022 15:17, Michael Ellerman wrote: >> Arnout Vandecappelle writes: >>> On 10/05/2022 04:20, Joel Stanley wrote: The default defconfig target for the 64 bit powerpc kernel is ppc64_defconfig, the big endian configuration. When building for powerpc64le users want the little endian kernel as they can't boot LE userspace on a BE kernel. Fix up the defconfig used in this case. This will avoid the following autobuilder failure: VDSO32A arch/powerpc/kernel/vdso32/sigtramp.o cc1: error: ‘-m32’ not supported in this configuratioin make[4]: *** [arch/powerpc/kernel/vdso32/Makefile:49: arch/powerpc/kernel/vdso32/sigtramp.o] Error 1 http://autobuild.buildroot.net/results/dd76d53bab56470c0b83e296872d7bb90f9e8296/ Note that the failure indicates the toolchain is configured to disable the 32 bit target, causing the kernel to fail when building the 32 bit VDSO. This is only a problem on the BE kernel as the LE kernel disables CONFIG_COMPAT, aka 32 bit userspace support, by default. Signed-off-by: Joel Stanley >>> >>>Applied to master, thanks. However, the defconfig mechanism for *all* >>> powerpc >>> seems pretty broken. Here's what we have in 5.16, before that there was >>> something similar: >>> >>> # If we're on a ppc/ppc64/ppc64le machine use that defconfig, otherwise >>> just use >>> # ppc64_defconfig because we have nothing better to go on. >>> uname := $(shell uname -m) >>> KBUILD_DEFCONFIG := $(if $(filter ppc%,$(uname)),$(uname),ppc64)_defconfig >>> >>>So I guess we should use a specific defconfig for *all* powerpc. >>> >>>The arch-default defconfig is generally not really reliable, for example >>> for >>> arm it always takes v7_multi, but that won't work for v7m targets... >> >> There's a fundamental problem that just the "arch" is not sufficient >> detail when you're building a kernel. > > Yes, which is pretty much unavoidable. > >> Two CPUs that implement the same user-visible "arch" may differ enough >> at the kernel level to require a different defconfig. >> >> Having said that I think we could handle this better in the powerpc >> kernel. Other arches allow specifying a different value for ARCH, which >> then is fed into the defconfig. > > I don't know if it's worth bothering with that. It certainly would not make > our life easier, because it would mean we need to set ARCH correctly. If we > can > do that, we can just as well set the defconfig correctly. OK. >> That way you could at least pass ARCH=ppc/ppc64/ppc64le, and get an >> appropriate defconfig. >> >> I'll work on some kernel changes for that. > > I think the most important thing is that it makes no sense to rely on uname > when ARCH and/or CROSS_COMPILE are set. I'm not sure I entirely agree. Neither ARCH or CROSS_COMPILE give us enough information to know which defconfig to use, so we still have to guess somehow. CROSS_COMPILE can be set even when you're building on ppc, it's the easiest way to specfiy a different toolchain from the default. cheers
[PATCH] powerpc/perf: Give generic PMU a nice name
When booting on a machine that uses the compat pmu driver we see this: [0.071192] GENERIC_COMPAT performance monitor hardware support registered Which is a bit shouty. Give it a nicer name. Signed-off-by: Joel Stanley --- Other options: - ISAv3 (because it is relevant for PowerISA 3.0B and beyond, see the comment in init_generic_compat_pmu) - Generic Compat (same, but less shouty) arch/powerpc/perf/generic-compat-pmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/perf/generic-compat-pmu.c b/arch/powerpc/perf/generic-compat-pmu.c index f3db88aee4dd..5be5a5ebaf42 100644 --- a/arch/powerpc/perf/generic-compat-pmu.c +++ b/arch/powerpc/perf/generic-compat-pmu.c @@ -292,7 +292,7 @@ static int generic_compute_mmcr(u64 event[], int n_ev, } static struct power_pmu generic_compat_pmu = { - .name = "GENERIC_COMPAT", + .name = "Architected", .n_counter = MAX_PMU_COUNTERS, .add_fields = ISA207_ADD_FIELDS, .test_adder = ISA207_TEST_ADDER, -- 2.35.1
Re: [PATCH] powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending
> On 26-May-2022, at 10:20 AM, Nicholas Piggin wrote: > > Excerpts from Sachin Sant's message of March 9, 2022 6:37 pm: >> >> >>> On 07-Mar-2022, at 8:21 PM, Nicholas Piggin wrote: >>> >>> When a synchronous interrupt[1] is taken in a local_irq_disable() region >>> which has MSR[EE]=1, the interrupt handler will enable MSR[EE] as part >>> of enabling MSR[RI], for peformance and profiling reasons. >>> >>> [1] Typically a hash fault, but in error cases this could be a page >>> fault or facility unavailable as well. >>> >>> If an asynchronous interrupt hits here and its masked handler requires >>> MSR[EE] to be cleared (it is a PACA_IRQ_MUST_HARD_MASK interrupt), then >>> MSR[EE] must remain disabled until that pending interrupt is replayed. >>> The problem is that the MSR of the original context has MSR[EE]=1, so >>> returning directly to that causes MSR[EE] to be enabled while the >>> interrupt is still pending. >>> >>> This issue was hacked around in the interrupt return code by just >>> clearing the hard mask to avoid a warning, and taking the masked >>> interrupt again immediately in the return context, which would disable >>> MSR[EE]. However in the case of a pending PMI, it is possible that it is >>> not maked in the calling context so the full handler will be run while >>> there is a PMI pending, and this confuses the perf code and causes >>> warnings with its PMI pending management. >>> >>> Fix this by removing the hack, and adjusting the return MSR if it has >>> MSR[EE]=1 and there is a PACA_IRQ_MUST_HARD_MASK interrupt pending. >>> >>> Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous >>> interrupts enable MSR[EE] if possible") >>> Signed-off-by: Nicholas Piggin >>> --- >>> arch/powerpc/kernel/interrupt.c | 10 - >>> arch/powerpc/kernel/interrupt_64.S | 34 +++--- >>> 2 files changed, 31 insertions(+), 13 deletions(-) >> >> With this patch on top of powerpc/merge following rcu stalls are seen while >> running powerpc selftests (mitigation-patching) on P9. I don’t see this >> issue on P10. >> >> [ 1841.248838] link-stack-flush: flush disabled. >> [ 1841.248905] count-cache-flush: software flush enabled. >> [ 1841.248911] link-stack-flush: software flush enabled. >> [ 1901.249668] rcu: INFO: rcu_sched self-detected stall on CPU >> [ 1901.249703] rcu: 12-...!: (5999 ticks this GP) >> idle=d0f/1/0x4002 softirq=37019/37027 fqs=0 >> [ 1901.249720] (t=6000 jiffies g=106273 q=1624) >> [ 1901.249729] rcu: rcu_sched kthread starved for 6000 jiffies! g106273 f0x0 >> RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=6 >> [ 1901.249743] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM >> is now expected behavior. >> [ 1901.249752] rcu: RCU grace-period kthread stack dump: >> [ 1901.249759] task:rcu_sched state:R running task stack: 0 pid: 11 ppid: 2 >> flags:0x0800 >> [ 1901.249775] Call Trace: >> [ 1901.249781] [c76ab870] [0001] 0x1 (unreliable) >> [ 1901.249795] [c76aba60] [c001e508] __switch_to+0x288/0x4a0 >> [ 1901.249811] [c76abac0] [c0d15950] __schedule+0x2c0/0x950 >> [ 1901.249824] [c76abb80] [c0d16048] schedule+0x68/0x130 >> [ 1901.249836] [c76abbb0] [c0d1df1c] >> schedule_timeout+0x25c/0x3f0 >> [ 1901.249849] [c76abc90] [c021522c] >> rcu_gp_fqs_loop+0x2fc/0x3e0 >> [ 1901.249863] [c76abd40] [c021a0fc] >> rcu_gp_kthread+0x13c/0x180 >> [ 1901.249875] [c76abdc0] [c018ce94] kthread+0x124/0x130 >> [ 1901.249887] [c76abe10] [c000cec0] >> ret_from_kernel_thread+0x5c/0x64 >> [ 1901.249900] rcu: Stack dump where RCU GP kthread last ran: >> [ 1901.249908] Sending NMI from CPU 12 to CPUs 6: >> [ 1901.249944] NMI backtrace for cpu 6 >> [ 1901.249957] CPU: 6 PID: 40 Comm: migration/6 Not tainted >> 5.17.0-rc6-00327-g782b30d101f6-dirty #3 >> [ 1901.249971] Stopper: multi_cpu_stop+0x0/0x230 <- >> stop_machine_cpuslocked+0x188/0x1e0 >> [ 1901.249987] NIP: c0d14e0c LR: c0214280 CTR: >> c02914f0 >> [ 1901.249996] REGS: c785b980 TRAP: 0500 Not tainted >> (5.17.0-rc6-00327-g782b30d101f6-dirty) >> [ 1901.250007] MSR: 8280b033 CR: >> 48002822 XER: >> [ 1901.250038] CFAR: IRQMASK: 0 >> [ 1901.250038] GPR00: c029165c c785bc20 c2a2 >> 0002 >> [ 1901.250038] GPR04: c009fb60ab80 c009fb60ab70 >> c001e508 >> [ 1901.250038] GPR08: c009fb68f5a8 0009f94c >> 0098967f >> [ 1901.250038] GPR12: c0001ec57a00 c018cd78 >> c7234f80 >> [ 1901.250038] GPR16: >> >> [ 1901.250038] GPR20: >> 0001 >> [ 1901.250038] GPR24: 0002 0
Re: [PATCH] powerpc/64/interrupt: Fix return to masked context after hard-mask irq becomes pending
Excerpts from Sachin Sant's message of March 9, 2022 6:37 pm: > > >> On 07-Mar-2022, at 8:21 PM, Nicholas Piggin wrote: >> >> When a synchronous interrupt[1] is taken in a local_irq_disable() region >> which has MSR[EE]=1, the interrupt handler will enable MSR[EE] as part >> of enabling MSR[RI], for peformance and profiling reasons. >> >> [1] Typically a hash fault, but in error cases this could be a page >>fault or facility unavailable as well. >> >> If an asynchronous interrupt hits here and its masked handler requires >> MSR[EE] to be cleared (it is a PACA_IRQ_MUST_HARD_MASK interrupt), then >> MSR[EE] must remain disabled until that pending interrupt is replayed. >> The problem is that the MSR of the original context has MSR[EE]=1, so >> returning directly to that causes MSR[EE] to be enabled while the >> interrupt is still pending. >> >> This issue was hacked around in the interrupt return code by just >> clearing the hard mask to avoid a warning, and taking the masked >> interrupt again immediately in the return context, which would disable >> MSR[EE]. However in the case of a pending PMI, it is possible that it is >> not maked in the calling context so the full handler will be run while >> there is a PMI pending, and this confuses the perf code and causes >> warnings with its PMI pending management. >> >> Fix this by removing the hack, and adjusting the return MSR if it has >> MSR[EE]=1 and there is a PACA_IRQ_MUST_HARD_MASK interrupt pending. >> >> Fixes: 4423eb5ae32e ("powerpc/64/interrupt: make normal synchronous >> interrupts enable MSR[EE] if possible") >> Signed-off-by: Nicholas Piggin >> --- >> arch/powerpc/kernel/interrupt.c| 10 - >> arch/powerpc/kernel/interrupt_64.S | 34 +++--- >> 2 files changed, 31 insertions(+), 13 deletions(-) > > With this patch on top of powerpc/merge following rcu stalls are seen while > running powerpc selftests (mitigation-patching) on P9. I don’t see this > issue on P10. > > [ 1841.248838] link-stack-flush: flush disabled. > [ 1841.248905] count-cache-flush: software flush enabled. > [ 1841.248911] link-stack-flush: software flush enabled. > [ 1901.249668] rcu: INFO: rcu_sched self-detected stall on CPU > [ 1901.249703] rcu: 12-...!: (5999 ticks this GP) > idle=d0f/1/0x4002 softirq=37019/37027 fqs=0 > [ 1901.249720](t=6000 jiffies g=106273 q=1624) > [ 1901.249729] rcu: rcu_sched kthread starved for 6000 jiffies! g106273 f0x0 > RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=6 > [ 1901.249743] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM > is now expected behavior. > [ 1901.249752] rcu: RCU grace-period kthread stack dump: > [ 1901.249759] task:rcu_sched state:R running task stack:0 > pid: 11 ppid: 2 flags:0x0800 > [ 1901.249775] Call Trace: > [ 1901.249781] [c76ab870] [0001] 0x1 (unreliable) > [ 1901.249795] [c76aba60] [c001e508] __switch_to+0x288/0x4a0 > [ 1901.249811] [c76abac0] [c0d15950] __schedule+0x2c0/0x950 > [ 1901.249824] [c76abb80] [c0d16048] schedule+0x68/0x130 > [ 1901.249836] [c76abbb0] [c0d1df1c] > schedule_timeout+0x25c/0x3f0 > [ 1901.249849] [c76abc90] [c021522c] > rcu_gp_fqs_loop+0x2fc/0x3e0 > [ 1901.249863] [c76abd40] [c021a0fc] > rcu_gp_kthread+0x13c/0x180 > [ 1901.249875] [c76abdc0] [c018ce94] kthread+0x124/0x130 > [ 1901.249887] [c76abe10] [c000cec0] > ret_from_kernel_thread+0x5c/0x64 > [ 1901.249900] rcu: Stack dump where RCU GP kthread last ran: > [ 1901.249908] Sending NMI from CPU 12 to CPUs 6: > [ 1901.249944] NMI backtrace for cpu 6 > [ 1901.249957] CPU: 6 PID: 40 Comm: migration/6 Not tainted > 5.17.0-rc6-00327-g782b30d101f6-dirty #3 > [ 1901.249971] Stopper: multi_cpu_stop+0x0/0x230 <- > stop_machine_cpuslocked+0x188/0x1e0 > [ 1901.249987] NIP: c0d14e0c LR: c0214280 CTR: > c02914f0 > [ 1901.249996] REGS: c785b980 TRAP: 0500 Not tainted > (5.17.0-rc6-00327-g782b30d101f6-dirty) > [ 1901.250007] MSR: 8280b033 CR: > 48002822 XER: > [ 1901.250038] CFAR: IRQMASK: 0 > [ 1901.250038] GPR00: c029165c c785bc20 c2a2 > 0002 > [ 1901.250038] GPR04: c009fb60ab80 c009fb60ab70 > c001e508 > [ 1901.250038] GPR08: c009fb68f5a8 0009f94c > 0098967f > [ 1901.250038] GPR12: c0001ec57a00 c018cd78 > c7234f80 > [ 1901.250038] GPR16: > > [ 1901.250038] GPR20: > 0001 > [ 1901.250038] GPR24: 0002 0003 > c2a62138 > [ 1901.250038] GPR28: c000ee70faf8 0001 c000ee70fb1c > 000
Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types
nkel , Michal Simek , Thomas Bogendoerfer , linux-par...@vger.kernel.org, linux-m...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven Schnelle , Guo Ren , Borislav Petkov , Johannes Berg , linuxppc-dev@lists.ozlabs.org, "David S . Miller" Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org Sender: "Linuxppc-dev" On 5/24/22 16:45, Peter Xu wrote: I observed that for each of the shared file-backed page faults, we're very likely to retry one more time for the 1st write fault upon no page. It's because we'll need to release the mmap lock for dirty rate limit purpose with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). Then after that throttling we return VM_FAULT_RETRY. We did that probably because VM_FAULT_RETRY is the only way we can return to the fault handler at that time telling it we've released the mmap lock. However that's not ideal because it's very likely the fault does not need to be retried at all since the pgtable was well installed before the throttling, so the next continuous fault (including taking mmap read lock, walk the pgtable, etc.) could be in most cases unnecessary. It's not only slowing down page faults for shared file-backed, but also add more mmap lock contention which is in most cases not needed at all. To observe this, one could try to write to some shmem page and look at "pgfault" value in /proc/vmstat, then we should expect 2 counts for each shmem write simply because we retried, and vm event "pgfault" will capture that. To make it more efficient, add a new VM_FAULT_COMPLETED return code just to show that we've completed the whole fault and released the lock. It's also a hint that we should very possibly not need another fault immediately on this page because we've just completed it. This patch provides a ~12% perf boost on my aarch64 test VM with a simple program sequentially dirtying 400MB shmem file being mmap()ed and these are the time it needs: Before: 650.980 ms (+-1.94%) After: 569.396 ms (+-1.38%) I believe it could help more than that. We need some special care on GUP and the s390 pgfault handler (for gmap code before returning from pgfault), the rest changes in the page fault handlers should be relatively straightforward. Another thing to mention is that mm_account_fault() does take this new fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping them as-is. Signed-off-by: Peter Xu --- v3: - Rebase to akpm/mm-unstable - Copy arch maintainers --- arch/arc/mm/fault.c | 4 Acked-by: Vineet Gupta Thx, -Vineet
Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types
Michal Simek , Thomas Bogendoerfer , linux-par...@vger.kernel.org, Max Filippov , linux-ker...@vger.kernel.org, Dinh Nguyen , Palmer Dabbelt , Sven Schnelle , Guo Ren , Borislav Petkov , Johannes Berg , linuxppc-dev@lists.ozlabs.org, "David S . Miller" Errors-To: linuxppc-dev-bounces+archive=mail-archive@lists.ozlabs.org Sender: "Linuxppc-dev" On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote: > I observed that for each of the shared file-backed page faults, we're very > likely to retry one more time for the 1st write fault upon no page. It's > because we'll need to release the mmap lock for dirty rate limit purpose > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). > > Then after that throttling we return VM_FAULT_RETRY. > > We did that probably because VM_FAULT_RETRY is the only way we can return > to the fault handler at that time telling it we've released the mmap lock. > > However that's not ideal because it's very likely the fault does not need > to be retried at all since the pgtable was well installed before the > throttling, so the next continuous fault (including taking mmap read lock, > walk the pgtable, etc.) could be in most cases unnecessary. > > It's not only slowing down page faults for shared file-backed, but also add > more mmap lock contention which is in most cases not needed at all. > > To observe this, one could try to write to some shmem page and look at > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each > shmem write simply because we retried, and vm event "pgfault" will capture > that. > > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to > show that we've completed the whole fault and released the lock. It's also > a hint that we should very possibly not need another fault immediately on > this page because we've just completed it. > > This patch provides a ~12% perf boost on my aarch64 test VM with a simple > program sequentially dirtying 400MB shmem file being mmap()ed and these are > the time it needs: > > Before: 650.980 ms (+-1.94%) > After: 569.396 ms (+-1.38%) > > I believe it could help more than that. > > We need some special care on GUP and the s390 pgfault handler (for gmap > code before returning from pgfault), the rest changes in the page fault > handlers should be relatively straightforward. > > Another thing to mention is that mm_account_fault() does take this new > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. > > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping > them as-is. > > Signed-off-by: Peter Xu Acked-by: Johannes Weiner
Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds
On Tue, May 24, 2022 at 11:20:42AM +0530, Vaibhav Jain wrote: > Presently ima_get_kexec_buffer() doesn't check if the previous kernel's > ima-kexec-buffer lies outside the addressable memory range. This can result > in a kernel panic if the new kernel is booted with 'mem=X' arg and the > ima-kexec-buffer was allocated beyond that range by the previous kernel. > The panic is usually of the form below: > > $ sudo kexec --initrd initrd vmlinux --append='mem=16G' > > > BUG: Unable to handle kernel data access on read at 0xc000c01fff7f > Faulting instruction address: 0xc0837974 > Oops: Kernel access of bad area, sig: 11 [#1] > > NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 > LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 > Call Trace: > [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 > [c371fb00] [c20512c4] ima_init+0x80/0x108 > [c371fb70] [c20514dc] init_ima+0x4c/0x120 > [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 > [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec > [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 > [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 > Instruction dump: > f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 > 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 > ---[ end trace ]--- > > Fix this issue by checking returned PFN range of previous kernel's > ima-kexec-buffer with pfn_valid to ensure correct memory bounds. > > Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the > previous kernel") > Cc: Frank Rowand > Cc: Prakhar Srivastava > Cc: Lakshmi Ramasubramanian > Cc: Thiago Jung Bauermann > Cc: Rob Herring > Signed-off-by: Vaibhav Jain > > --- > Changelog > == > > v2: > * Instead of using memblock to determine the valid bounds use pfn_valid() to > do > so since memblock may not be available late after the kernel init. [ Mpe ] > * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ] > * Updated the 'Fixes' tag to point to correct commit that introduced this > function. [ Rob ] > * Fixed some whitespace/tab issues in the patch description [ Rob ] > * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0 > --- > drivers/of/kexec.c | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c > index 8d374cc552be..879e984fe901 100644 > --- a/drivers/of/kexec.c > +++ b/drivers/of/kexec.c > @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size) > { > int ret, len; > unsigned long tmp_addr; > + unsigned int start_pfn, end_pfn; > size_t tmp_size; > const void *prop; > > @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size) > if (ret) > return ret; > > + /* Do some sanity on the returned size for the ima-kexec buffer */ > + if (!tmp_size) > + return -ENOENT; > + > + /* > + * Calculate the PFNs for the buffer and ensure > + * they are with in addressable memory. > + */ > + start_pfn = PHYS_PFN(tmp_addr); > + end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1); > + if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) { pfn_valid() isn't necessarily RAM, only that you have a struct page AIUI. Maybe page_is_ram() instead? Thanks to Robin for this. Rob
Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support
On Thu, May 26, 2022 at 3:37 AM Heiko Stübner wrote: > > Am Mittwoch, 25. Mai 2022, 18:08:22 CEST schrieb Guo Ren: > > Thx Heiko & Guenter, > > > > On Wed, May 25, 2022 at 7:10 PM Heiko Stübner wrote: > > > > > > Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner: > > > > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck: > > > > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote: > > > > > [ ... ] > > > > > > > > > > > > The problem is come from "__dls3's vdso decode part in musl's > > > > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong. > > > > > > > > > > > > > > I think the root cause is from musl's implementation with the > > > > > > > wrong > > > > > > > elf parser. I would fix that soon. > > > > > > Not elf parser, it's "aux vector just past environ[]". I think I > > > > > > could > > > > > > solve this, but anyone who could help dig in is welcome. > > > > > > > > > > > > > > > > I am not sure I understand what you are saying here. Point is that my > > > > > root file system, generated with musl a year or so ago, crashes with > > > > > your patch set applied. That is a regression, even if there is a bug > > > > > in musl. > > Thx for the report, it's a valuable regression for riscv-compat. > > > > > > > > > > Also as I said in the other part of the thread, the rootfs seems > > > > innocent, > > > > as my completely-standard Debian riscv64 rootfs is also affected. > > > > > > > > The merged version seems to be v12 [0] - not sure how we this discussion > > > > ended up in v9, but I just tested this revision in two variants: > > > > > > > > - v5.17 + this v9 -> works nicely > > > > > > I take that back ... now going back to that build I somehow also run into > > > that issue here ... will investigate more. > > Yeah, it's my fault. I've fixed up it, please have a try: > > > > https://lore.kernel.org/linux-riscv/20220525160404.2930984-1-guo...@kernel.org/T/#u > > very cool that you found the issue. > I've tested your patch and it seems to fix the issue for me. > > Thanks for figuring out the cause I should thx Guenter Roeck, It just surprised me that compat_vdso could work with quite a lot of rv64 apps. > Heiko > > > > > > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot > > > > The only rebase-conflict was with the introduction of restartable > > > > sequences and removal of the tracehook include, but turning > > > > CONFIG_RSEQ > > > > off doesn't seem to affect the breakage. > > > > > > > > So it looks like something changed between 5.17 and 5.18 that causes > > > > the issue. > > > > > > > > > > > > Heiko > > > > > > > > > > > > [0] > > > > https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/ > > > > > > > > > > > > > > > > > > > > > > > > > > -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/
Re: [GIT PULL] Modules fixes for v5.19-rc1
Sorry the subject should say "Modules changes". I also forgot to itemize possible merge conflicts and resolutions which linux-next reported: powerpc: https://lkml.kernel.org/r/20220520154055.7f964...@canb.auug.org.au kbuild: https://lkml.kernel.org/r/20220523120859.570f7...@canb.auug.org.au Luis
[GIT PULL] Modules fixes for v5.19-rc1
OK, finally some changes for modules. It is still pretty boring, but I am hopefull that the cleanup will yield nice results in the future as further cleanups will make the code much easier to read, maintain and test. Perhaps the most exciting thing is Christophe Leroy's CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC. In reviewing Rick Edgecombe's prior work on enhancements for special allocators I suspect this is going to help as module space was the more complex aspect to deal with in his work. AFAICT you *may* run into conflicts *if* bpf folks submit the module_alloc_huge() stuff which I was still reviewing with Rick. To my taste that effort seems to be going fast and I like to take time to consider a proper interface for it which aligns well with that others have in mind, specially in consideration for what other architectures might need. The VM_FLUSH_RESET_PERMS stuff was what was loose there. It doesn't seem we can address that stuff in a generic neat way yet, and so the x86 open codes its own solution for it. I suspect we'll also need more tests on the huge page front so that if more module_alloc() users want to convert we can enable folks to give more realistic performance information rather than loose numbers. In the future I suspect we'll just generalize module_alloc() to vmalloc_exec() as its users are growing and the technical debt of not drawing a clean API for it is growing. Let me know if there are any issues. Luis The following changes since commit 3123109284176b1532874591f7c81f3837bbdc17: Linux 5.18-rc1 (2022-04-03 14:08:21 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/modules-5.19-rc1 for you to fetch changes up to 7390b94a3c2d93272d6da4945b81a9cf78055b7b: module: merge check_exported_symbol() into find_exported_symbol_in_section() (2022-05-12 10:29:41 -0700) Modules updates for v5.19-rc1 As promised, for v5.19 I queued up quite a bit of work for modules, but still with a pretty conservative eye. These changes have been soaking on modules-next (and so linux-next) for quite some time, the code shift was merged onto modules-next on March 22, and the last patch was queued on May 5th. The following are the highlights of what bells and whistles we will get for v5.19: 1) It was time to tidy up kernel/module.c and one way of starting with that effort was to split it up into files. At my request Aaron Tomlin spearheaded that effort with the goal to not introduce any functional at all during that endeavour. The penalty for the split is +1322 bytes total, +112 bytes in data, +1210 bytes in text while bss is unchanged. One of the benefits of this other than helping make the code easier to read and review is summoning more help on review for changes with livepatching so kernel/module/livepatch.c is now pegged as maintained by the live patching folks. The before and after with just the move on a defconfig on x86-64: $ size kernel/module.o textdata bss dec hex filename 384344540 104 43078a846 kernel/module.o $ size -t kernel/module/*.o textdata bss dec hex filename 4785 120 049051329 kernel/module/kallsyms.o 285774416 104 330978149 kernel/module/main.o 1158 8 01166 48e kernel/module/procfs.o 902 108 01010 3f2 kernel/module/strict_rwx.o 3390 0 03390 d3e kernel/module/sysfs.o 832 0 0 832 340 kernel/module/tree_lookup.o 396444652 104 44400ad70 (TOTALS) 2) Aaron added module unload taint tracking (MODULE_UNLOAD_TAINT_TRACKING), so to enable tracking unloaded modules which did taint the kernel. 3) Christophe Leroy added CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC which lets architectures to request having modules data in vmalloc area instead of module area. There are three reasons why an architecture might want this: a) On some architectures (like book3s/32) it is not possible to protect against execution on a page basis. The exec stuff can be mapped by different arch segment sizes (on book3s/32 that is 256M segments). By default the module area is in an Exec segment while vmalloc area is in a NoExec segment. Using vmalloc lets you muck with module data as NoExec on those architectures whereas before you could not. b) By pushing more module data to vmalloc you also increase the probability of module text to remain within a closer distance from kernel core text and this reduces trampolines, this has been reported on arm first and powerpc folks are following that lead. c) Free'ing module_alloc() (Exec by default) area leaves this exposed as Exec by default, some architectures h
Re: [PATCH] kexec_file: Drop weak attribute from arch_kexec_apply_relocations[_add]
On Fri, 20 May 2022 14:25:05 -0500 "Eric W. Biederman" wrote: > > I am not strongly against taking off __weak, just wondering if there's > > chance to fix it in recordmcount, and the cost comparing with kernel fix; > > except of this issue, any other weakness of __weak. Noticed Andrew has > > picked this patch, as a witness of this moment, raise a tiny concern. > > I just don't see what else we can realistically do. I think converting all of the kexec __weaks to use the ifdef approach makes sense, if only because kexec is now using two different styles. But for now, I'll send Naveen's v2 patch in to Linus to get us out of trouble. I'm thinking that we should add cc:stable to that patch as well, to reduce the amount of problems which people experience when using newer binutils on older kernels?
Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support
Am Mittwoch, 25. Mai 2022, 18:08:22 CEST schrieb Guo Ren: > Thx Heiko & Guenter, > > On Wed, May 25, 2022 at 7:10 PM Heiko Stübner wrote: > > > > Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner: > > > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck: > > > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote: > > > > [ ... ] > > > > > > > > > > The problem is come from "__dls3's vdso decode part in musl's > > > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong. > > > > > > > > > > > > I think the root cause is from musl's implementation with the wrong > > > > > > elf parser. I would fix that soon. > > > > > Not elf parser, it's "aux vector just past environ[]". I think I could > > > > > solve this, but anyone who could help dig in is welcome. > > > > > > > > > > > > > I am not sure I understand what you are saying here. Point is that my > > > > root file system, generated with musl a year or so ago, crashes with > > > > your patch set applied. That is a regression, even if there is a bug > > > > in musl. > Thx for the report, it's a valuable regression for riscv-compat. > > > > > > > Also as I said in the other part of the thread, the rootfs seems innocent, > > > as my completely-standard Debian riscv64 rootfs is also affected. > > > > > > The merged version seems to be v12 [0] - not sure how we this discussion > > > ended up in v9, but I just tested this revision in two variants: > > > > > > - v5.17 + this v9 -> works nicely > > > > I take that back ... now going back to that build I somehow also run into > > that issue here ... will investigate more. > Yeah, it's my fault. I've fixed up it, please have a try: > > https://lore.kernel.org/linux-riscv/20220525160404.2930984-1-guo...@kernel.org/T/#u very cool that you found the issue. I've tested your patch and it seems to fix the issue for me. Thanks for figuring out the cause Heiko > > > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot > > > The only rebase-conflict was with the introduction of restartable > > > sequences and removal of the tracehook include, but turning CONFIG_RSEQ > > > off doesn't seem to affect the breakage. > > > > > > So it looks like something changed between 5.17 and 5.18 that causes the > > > issue. > > > > > > > > > Heiko > > > > > > > > > [0] > > > https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/ > > > > > > > > > > > > > >
Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc
On 25/05/22 23:09, Christophe Leroy wrote: Hi Sathvika, Le 25/05/2022 à 12:14, Sathvika Vasireddy a écrit : Hi Christophe, On 24/05/22 18:47, Christophe Leroy wrote: This draft series adds PPC32 support to Sathvika's series. Verified on pmac32 on QEMU. It should in principle also work for PPC64 BE but for the time being something goes wrong. In the beginning I had a segfaut hence the first patch. But I still get no mcount section in the files. Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols. And so, the relocation records in case of PPC64BE point to "._mcount", rather than just "_mcount". We should be looking for "._mcount" to be able to generate mcount_loc section in the files. Like: diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 70be5a72e838..7da5bf8c7236 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2185,7 +2185,7 @@ static int classify_symbols(struct objtool_file *file) if (arch_is_retpoline(func)) func->retpoline_thunk = true; - if ((!strcmp(func->name, "__fentry__")) || (!strcmp(func->name, "_mcount"))) + if ((!strcmp(func->name, "__fentry__")) || (!strcmp(func->name, "_mcount")) || (!strcmp(func->name, "._mcount"))) func->fentry = true; if (is_profiling_func(func->name)) With this change, I could see __mcount_loc section being generated in individual ppc64be object files. Or should we implement an equivalent of arch_ftrace_match_adjust() in objtool ? Yeah, I think it makes more sense if we make it arch specific. Thanks for the suggestion. I'll make this change in next revision :-) - Sathvika
Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc
Hi Sathvika, Le 25/05/2022 à 12:14, Sathvika Vasireddy a écrit : > Hi Christophe, > > On 24/05/22 18:47, Christophe Leroy wrote: >> This draft series adds PPC32 support to Sathvika's series. >> Verified on pmac32 on QEMU. >> >> It should in principle also work for PPC64 BE but for the time being >> something goes wrong. In the beginning I had a segfaut hence the first >> patch. But I still get no mcount section in the files. > Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols. > And so, the relocation records in case of PPC64BE point to "._mcount", > rather than just "_mcount". We should be looking for "._mcount" to be > able to generate mcount_loc section in the files. > > Like: > > diff --git a/tools/objtool/check.c b/tools/objtool/check.c > index 70be5a72e838..7da5bf8c7236 100644 > --- a/tools/objtool/check.c > +++ b/tools/objtool/check.c > @@ -2185,7 +2185,7 @@ static int classify_symbols(struct objtool_file > *file) > if (arch_is_retpoline(func)) > func->retpoline_thunk = true; > > - if ((!strcmp(func->name, "__fentry__")) || > (!strcmp(func->name, "_mcount"))) > + if ((!strcmp(func->name, "__fentry__")) || > (!strcmp(func->name, "_mcount")) || (!strcmp(func->name, "._mcount"))) > func->fentry = true; > > if (is_profiling_func(func->name)) > > > With this change, I could see __mcount_loc section being > generated in individual ppc64be object files. > Or should we implement an equivalent of arch_ftrace_match_adjust() in objtool ? Christophe
Re: [RFC PATCH 4/4] objtool/powerpc: Add --mcount specific implementation
Le 24/05/2022 à 15:33, Christophe Leroy a écrit : Le 24/05/2022 à 13:00, Sathvika Vasireddy a écrit : +{ + switch (elf->ehdr.e_machine) { + case EM_X86_64: + return R_X86_64_64; + case EM_PPC64: + return R_PPC64_ADDR64; + default: + WARN("unknown machine..."); + exit(-1); + } +} Wouldn't it be better to make that function arch specific ? This is so that we can support cross architecture builds. I'm not sure I follow you here. This is only based on the target, it doesn't depend on the build host so I can't the link with cross arch builds. The same as you have arch_decode_instruction(), you could have arch_elf_reloc_type_long() It would make sense indeed, because there is no point in supporting X86 relocation when you don't support X86 instruction decoding. Could simply be some macro defined in tools/objtool/arch/powerpc/include/arch/elf.h and tools/objtool/arch/x86/include/arch/elf.h The x86 version would be: #define R_ADDR(elf) R_X86_64_64 And the powerpc version would be: #define R_ADDR(elf) (elf->ehdr.e_machine == EM_PPC64 ? R_PPC64_ADDR64 : R_PPC_ADDR32) Christophe
Re: [RFC PATCH v1 1/4] Revert "objtool: Enable objtool to run only on files with ftrace enabled"
Le 25/05/2022 à 18:34, Peter Zijlstra a écrit : > On Wed, May 25, 2022 at 05:58:14PM +0200, Christophe Leroy wrote: >> This reverts commit cf3013dfad89ad5ac7d16d56dced72d7c138a20e. >> >> That commit is problematic as we miss some static calls. > > Revert ?!?! who comitted this. And there's a ton more broken than just > static calls. This must absolutely not be. No worry, it is just a follow-up of my previous series which includes it.
Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc
On Wed, May 25, 2022 at 03:44:04PM +0530, Sathvika Vasireddy wrote: > On 24/05/22 18:47, Christophe Leroy wrote: > >This draft series adds PPC32 support to Sathvika's series. > >Verified on pmac32 on QEMU. > > > >It should in principle also work for PPC64 BE but for the time being > >something goes wrong. In the beginning I had a segfaut hence the first > >patch. But I still get no mcount section in the files. > Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols. > And so, the relocation records in case of PPC64BE point to "._mcount", > rather than just "_mcount". We should be looking for "._mcount" to be > able to generate mcount_loc section in the files. The dotted symbol is on the actual function. The "normal" symbol is on the "official procedure descriptor" (opd), which is what you get if you (in C) take the address of a function. A procedure descriptor holds one or two more pointers, the GOT and environment pointers. We don't use the environment one, but the GOT pointer is necessary everywhere :-) Segher
Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers
Hi Am 21.05.22 um 04:49 schrieb Benjamin Herrenschmidt: On Thu, 2022-05-19 at 09:27 +0200, Thomas Zimmermann wrote: to build without PCI to see what happens. If you bring any of the "heuristic" and palette support code in, you need PCI. I don't see any reason to take it out. Those old Macs use BootX, right? BootX is not supported ATM, as I don't have the HW to test. Is there an emulator for it? It isn't ? When did it break ? :-) I meant that BootX is not (yet) supported by this new driver. The Linux kernel overall probably supports it. If anyone what's to make patches for BootX, I'd be happy to add them. The offb driver also supports a number of special cases for palette handling. That might be necessary for ofdrm as well. The palette handling is useful when using a real Open Firmware implementation which tends to boot in 8-bit mode, so without palette things will look ... bad. It's not necessary when using 16/32 bpp framebuffers which is typically ... what BootX provides :-) Maybe the odd color formats can be tested via qemu. I don't mind adding DRM support for BootX displays, but getting the necessary test HW with a suitable Linux seems to be laborious. Would a G4 Powerbook work? Best regard Thomas Cheers, Ben. Best regards Thomas Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev OpenPGP_signature Description: OpenPGP digital signature
Re: [RFC PATCH v1 2/4] objtool: Add R_REL32 macro
Hi! On Wed, May 25, 2022 at 05:58:15PM +0200, Christophe Leroy wrote: > In order to allow other architectures than x86 to use 32 bits > relative relocations, define a R_REL32 macro that each architecture > will define, in the same way as already done for R_NONE. What are the expected semantics of this relocation? It is PC-relative, sure, but what is the destination? S+A-P always? That works for both x86-64 and for PowerPC, but it should be written doen somewhere :-) Segher
Re: [RFC PATCH v1 1/4] Revert "objtool: Enable objtool to run only on files with ftrace enabled"
On Wed, May 25, 2022 at 05:58:14PM +0200, Christophe Leroy wrote: > This reverts commit cf3013dfad89ad5ac7d16d56dced72d7c138a20e. > > That commit is problematic as we miss some static calls. Revert ?!?! who comitted this. And there's a ton more broken than just static calls. This must absolutely not be.
Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support
Thx Heiko & Guenter, On Wed, May 25, 2022 at 7:10 PM Heiko Stübner wrote: > > Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner: > > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck: > > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote: > > > [ ... ] > > > > > > > > The problem is come from "__dls3's vdso decode part in musl's > > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong. > > > > > > > > > > I think the root cause is from musl's implementation with the wrong > > > > > elf parser. I would fix that soon. > > > > Not elf parser, it's "aux vector just past environ[]". I think I could > > > > solve this, but anyone who could help dig in is welcome. > > > > > > > > > > I am not sure I understand what you are saying here. Point is that my > > > root file system, generated with musl a year or so ago, crashes with > > > your patch set applied. That is a regression, even if there is a bug > > > in musl. Thx for the report, it's a valuable regression for riscv-compat. > > > > Also as I said in the other part of the thread, the rootfs seems innocent, > > as my completely-standard Debian riscv64 rootfs is also affected. > > > > The merged version seems to be v12 [0] - not sure how we this discussion > > ended up in v9, but I just tested this revision in two variants: > > > > - v5.17 + this v9 -> works nicely > > I take that back ... now going back to that build I somehow also run into > that issue here ... will investigate more. Yeah, it's my fault. I've fixed up it, please have a try: https://lore.kernel.org/linux-riscv/20220525160404.2930984-1-guo...@kernel.org/T/#u > > > > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot > > The only rebase-conflict was with the introduction of restartable > > sequences and removal of the tracehook include, but turning CONFIG_RSEQ > > off doesn't seem to affect the breakage. > > > > So it looks like something changed between 5.17 and 5.18 that causes the > > issue. > > > > > > Heiko > > > > > > [0] https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/ > > > > > > -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/
[RFC PATCH v1 2/4] objtool: Add R_REL32 macro
In order to allow other architectures than x86 to use 32 bits relative relocations, define a R_REL32 macro that each architecture will define, in the same way as already done for R_NONE. Signed-off-by: Christophe Leroy --- tools/objtool/arch/x86/include/arch/elf.h | 1 + tools/objtool/check.c | 10 +- tools/objtool/orc_gen.c | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/tools/objtool/arch/x86/include/arch/elf.h b/tools/objtool/arch/x86/include/arch/elf.h index 69cc4264b28a..8aa8c29607da 100644 --- a/tools/objtool/arch/x86/include/arch/elf.h +++ b/tools/objtool/arch/x86/include/arch/elf.h @@ -2,5 +2,6 @@ #define _OBJTOOL_ARCH_ELF #define R_NONE R_X86_64_NONE +#define R_REL32R_X86_64_PC32 #endif /* _OBJTOOL_ARCH_ELF */ diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 70be5a72e838..1627d14a01c9 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -650,7 +650,7 @@ static int create_static_call_sections(struct objtool_file *file) /* populate reloc for 'addr' */ if (elf_add_reloc_to_insn(file->elf, sec, idx * sizeof(struct static_call_site), - R_X86_64_PC32, + R_REL32, insn->sec, insn->offset)) return -1; @@ -691,7 +691,7 @@ static int create_static_call_sections(struct objtool_file *file) /* populate reloc for 'key' */ if (elf_add_reloc(file->elf, sec, idx * sizeof(struct static_call_site) + 4, - R_X86_64_PC32, key_sym, + R_REL32, key_sym, is_sibling_call(insn) * STATIC_CALL_SITE_TAIL)) return -1; @@ -735,7 +735,7 @@ static int create_retpoline_sites_sections(struct objtool_file *file) if (elf_add_reloc_to_insn(file->elf, sec, idx * sizeof(int), - R_X86_64_PC32, + R_REL32, insn->sec, insn->offset)) { WARN("elf_add_reloc_to_insn: .retpoline_sites"); return -1; @@ -787,7 +787,7 @@ static int create_ibt_endbr_seal_sections(struct objtool_file *file) if (elf_add_reloc_to_insn(file->elf, sec, idx * sizeof(int), - R_X86_64_PC32, + R_REL32, insn->sec, insn->offset)) { WARN("elf_add_reloc_to_insn: .ibt_endbr_seal"); return -1; @@ -3716,7 +3716,7 @@ static int validate_ibt_insn(struct objtool_file *file, struct instruction *insn continue; off = reloc->sym->offset; - if (reloc->type == R_X86_64_PC32 || reloc->type == R_X86_64_PLT32) + if (reloc->type == R_REL32 || reloc->type == R_X86_64_PLT32) off += arch_dest_reloc_offset(reloc->addend); else off += reloc->addend; diff --git a/tools/objtool/orc_gen.c b/tools/objtool/orc_gen.c index 1f22b7ebae58..49a877b9c879 100644 --- a/tools/objtool/orc_gen.c +++ b/tools/objtool/orc_gen.c @@ -101,7 +101,7 @@ static int write_orc_entry(struct elf *elf, struct section *orc_sec, orc->bp_offset = bswap_if_needed(elf, orc->bp_offset); /* populate reloc for ip */ - if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_X86_64_PC32, + if (elf_add_reloc_to_insn(elf, ip_sec, idx * sizeof(int), R_REL32, insn_sec, insn_off)) return -1; -- 2.35.3
[RFC PATCH v1 4/4] powerpc/static_call: Implement inline static calls
Implement inline static calls: - Put a 'bl' to the destination function - Put a 'nop' when the destination function is NULL - Put a 'li r3,0' when the destination is the RET0 function For the time being it only works if the destination is within 32Mb from the caller. Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/static_call.h| 2 + arch/powerpc/kernel/static_call.c | 41 --- tools/objtool/arch/powerpc/include/arch/elf.h | 1 + 4 files changed, 31 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 5ef8bf8eb202..3257a1c258d8 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -246,6 +246,7 @@ config PPC select HAVE_STACKPROTECTOR if PPC32 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2) select HAVE_STACKPROTECTOR if PPC64 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13) select HAVE_STATIC_CALL if PPC32 + select HAVE_STATIC_CALL_INLINE if PPC32 select HAVE_SYSCALL_TRACEPOINTS select HAVE_VIRT_CPU_ACCOUNTING select HUGETLB_PAGE_SIZE_VARIABLE if PPC_BOOK3S_64 && HUGETLB_PAGE diff --git a/arch/powerpc/include/asm/static_call.h b/arch/powerpc/include/asm/static_call.h index de1018cc522b..e3d5d3823dac 100644 --- a/arch/powerpc/include/asm/static_call.h +++ b/arch/powerpc/include/asm/static_call.h @@ -26,4 +26,6 @@ #define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) __PPC_SCT(name, "blr") #define ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) __PPC_SCT(name, "b .+20") +#define CALL_INSN_SIZE 4 + #endif /* _ASM_POWERPC_STATIC_CALL_H */ diff --git a/arch/powerpc/kernel/static_call.c b/arch/powerpc/kernel/static_call.c index 863a7aa24650..fd25954cfd24 100644 --- a/arch/powerpc/kernel/static_call.c +++ b/arch/powerpc/kernel/static_call.c @@ -9,25 +9,38 @@ void arch_static_call_transform(void *site, void *tramp, void *func, bool tail) int err; bool is_ret0 = (func == __static_call_return0); unsigned long target = (unsigned long)(is_ret0 ? tramp + PPC_SCT_RET0 : func); - bool is_short = is_offset_in_branch_range((long)target - (long)tramp); - - if (!tramp) - return; mutex_lock(&text_mutex); - if (func && !is_short) { - err = patch_instruction(tramp + PPC_SCT_DATA, ppc_inst(target)); - if (err) - goto out; + if (tramp) { + bool is_short = is_offset_in_branch_range((long)target - (long)tramp); + + if (func && !is_short) { + err = patch_instruction(tramp + PPC_SCT_DATA, ppc_inst(target)); + if (err) + goto out; + } + + if (!func) + err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR())); + else if (is_short) + err = patch_branch(tramp, target, 0); + else + err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP())); } - if (!func) - err = patch_instruction(tramp, ppc_inst(PPC_RAW_BLR())); - else if (is_short) - err = patch_branch(tramp, target, 0); - else - err = patch_instruction(tramp, ppc_inst(PPC_RAW_NOP())); + if (site) { + bool is_short = is_offset_in_branch_range((long)func - (long)site); + + if (!func) + err = patch_instruction(site, ppc_inst(PPC_RAW_NOP())); + else if (is_ret0) + err = patch_instruction(site, ppc_inst(PPC_RAW_LI(_R3, 0))); + else if (is_short) + err = patch_branch(site, target, BRANCH_SET_LINK); + else + panic("%s: function %pS is out of reach of %pS\n", __func__, func, site); + } out: mutex_unlock(&text_mutex); diff --git a/tools/objtool/arch/powerpc/include/arch/elf.h b/tools/objtool/arch/powerpc/include/arch/elf.h index 3c8ebb7d2a6b..18784c764c14 100644 --- a/tools/objtool/arch/powerpc/include/arch/elf.h +++ b/tools/objtool/arch/powerpc/include/arch/elf.h @@ -4,5 +4,6 @@ #define _OBJTOOL_ARCH_ELF #define R_NONE R_PPC_NONE +#define R_REL32R_PPC_REL32 #endif /* _OBJTOOL_ARCH_ELF */ -- 2.35.3
[RFC PATCH v1 0/4] Implement inline static calls on PPC32
This is first draft for implementing inline static calls on PPC32. This series applies on top of the series v2 "objtool: Enable and implement --mcount option on powerpc" For the time being only the case where functions are within 'bl' reach is supported. Otherwise panic() is invoked. For the other case, we'll need to use the trampoline we have at startup before initialising inline static calls. But it seems that at the time being once inline static calls are initialised we don't know anymore where the trampoline was. We'd need to keep the information somewhere (is the static_call_key ?) We may also need to keep the information for when the trampoline itself is out of 'bl' reach, in that case there is a trampoline setup by the compiler and we'll need to remind the location of that trampoline. Guess it should get saved somewhere when we initialise inline static calls ? Christophe Leroy (4): Revert "objtool: Enable objtool to run only on files with ftrace enabled" objtool: Add R_REL32 macro static_call: Call static_call_init() from start_kernel() powerpc/static_call: Implement inline static calls arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/static_call.h| 2 + arch/powerpc/kernel/static_call.c | 41 --- init/main.c | 1 + scripts/Makefile.build| 4 +- tools/objtool/arch/powerpc/include/arch/elf.h | 1 + tools/objtool/arch/x86/include/arch/elf.h | 1 + tools/objtool/check.c | 10 ++--- tools/objtool/orc_gen.c | 2 +- 9 files changed, 41 insertions(+), 22 deletions(-) -- 2.35.3
[RFC PATCH v1 1/4] Revert "objtool: Enable objtool to run only on files with ftrace enabled"
This reverts commit cf3013dfad89ad5ac7d16d56dced72d7c138a20e. That commit is problematic as we miss some static calls. Signed-off-by: Christophe Leroy --- scripts/Makefile.build | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 06ceffd92921..2e0c3f9c1459 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -258,8 +258,8 @@ else # 'OBJECT_FILES_NON_STANDARD_foo.o := 'y': skip objtool checking for a file # 'OBJECT_FILES_NON_STANDARD_foo.o := 'n': override directory skip for a file -$(obj)/%.o: objtool-enabled = $(and $(if $(filter-out y%, $(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y), \ -$(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)),y),y) +$(obj)/%.o: objtool-enabled = $(if $(filter-out y%, \ + $(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y) endif -- 2.35.3
[RFC PATCH v1 3/4] static_call: Call static_call_init() from start_kernel()
Call static_call_init() just after jump_label_init(). x86 already called it from setup_arch(). This is not a problem as static_call_init() is guarded from double call. Signed-off-by: Christophe Leroy --- init/main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/init/main.c b/init/main.c index 98182c3c2c4b..b6c49c18ec5d 100644 --- a/init/main.c +++ b/init/main.c @@ -962,6 +962,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void) pr_notice("Kernel command line: %s\n", saved_command_line); /* parameters may set static keys */ jump_label_init(); + static_call_init(); parse_early_param(); after_dashes = parse_args("Booting kernel", static_command_line, __start___param, -- 2.35.3
[PATCH 5/5] KVM: PPC: Book3S HV: Provide more detailed timings for P9 entry path
Alter the data collection points for the debug timing code in the P9 path to be more in line with what the code does. The points where we accumulate time are now the following: vcpu_entry: From vcpu_run_hv entry until the start of the inner loop; guest_entry: From the start of the inner loop until the guest entry in asm; in_guest: From the guest entry in asm until the return to KVM C code; guest_exit: From the return into KVM C code until the corresponding hypercall/page fault handling or re-entry into the guest; hypercall: Time spent handling hcalls in the kernel (hcalls can go to QEMU, not accounted here); page_fault: Time spent handling page faults; vcpu_exit: vcpu_run_hv exit (almost no code here currently). Like before, these are exposed in debugfs in a file called "timings". There are four values: - number of occurrences of the accumulation point; - total time the vcpu spent in the phase in ns; - shortest time the vcpu spent in the phase in ns; - longest time the vcpu spent in the phase in ns; === Before: rm_entry: 53132 16793518 256 4060 rm_intr: 53132 2125914 22 340 rm_exit: 53132 24108344 374 2180 guest: 53132 40980507996 404 9997650 cede: 0 0 0 0 After: vcpu_entry: 34637 7716108 178 4416 guest_entry: 52414 49365608 324 747542 in_guest: 52411 40828715840 258 9997480 guest_exit: 52410 19681717182 826 102496674 vcpu_exit: 34636 1744462 38 182 hypercall: 45712 22878288 38 1307962 page_fault: 992 04034 568 168688 With just one instruction (hcall): vcpu_entry: 1 942 942 942 guest_entry: 1 4044 4044 4044 in_guest: 1 1540 1540 1540 guest_exit: 1 3542 3542 3542 vcpu_exit: 1 80 80 80 hypercall: 0 0 0 0 page_fault: 0 0 0 0 === Signed-off-by: Fabiano Rosas --- arch/powerpc/include/asm/kvm_host.h | 12 +++- arch/powerpc/kvm/Kconfig | 9 + arch/powerpc/kvm/book3s_hv.c | 23 ++- arch/powerpc/kvm/book3s_hv_p9_entry.c | 14 -- 4 files changed, 34 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 37f03665bfa2..de2b226aa350 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -827,11 +827,13 @@ struct kvm_vcpu_arch { struct kvmhv_tb_accumulator *cur_activity; /* What we're timing */ u64 cur_tb_start; /* when it started */ #ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING - struct kvmhv_tb_accumulator rm_entry; /* real-mode entry code */ - struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */ - struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ - struct kvmhv_tb_accumulator guest_time; /* guest execution */ - struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ + struct kvmhv_tb_accumulator vcpu_entry; + struct kvmhv_tb_accumulator vcpu_exit; + struct kvmhv_tb_accumulator in_guest; + struct kvmhv_tb_accumulator hcall; + struct kvmhv_tb_accumulator pg_fault; + struct kvmhv_tb_accumulator guest_entry; + struct kvmhv_tb_accumulator guest_exit; #else struct kvmhv_tb_accumulator rm_entry; /* real-mode entry code */ struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */ diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 191347f44731..cedf1e0f50e1 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -135,10 +135,11 @@ config KVM_BOOK3S_HV_P9_TIMING select KVM_BOOK3S_HV_EXIT_TIMING depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS help - Calculate time taken for each vcpu in various parts of the - code. The total, minimum and maximum times in nanoseconds - together with the number of executions are reported in debugfs in - kvm/vm#/vcpu#/timings. + Calculate time taken for each vcpu during vcpu entry and + exit, time spent inside the guest and time spent handling + hypercalls and page faults. The total, minimum and maximum + times in nanoseconds together with the number of executions + are reported in debugfs in kvm/vm#/vcpu#/timings. If unsure, say N. diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 69a6b40d58b9..f485632f247a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -2654,11 +2654,13 @@ static struct debugfs_timings_element { size_t offset; } timings[] = { #ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING - {"rm_entry",offsetof(struct kvm_vcpu, arch.rm_entry)}, - {"rm_intr", offsetof(struct kvm_vcpu, arch.rm_intr)}, - {"rm_exit", offsetof(struct kvm_vcpu, arch.rm_exit)}, - {"guest", offsetof(struct kvm_vcpu, arch.guest_time)}, - {"cede",offsetof(struct kvm_vcpu, arch.cede_time)}, + {"vc
[PATCH 4/5] KVM: PPC: Book3S HV: Expose timing functions to module code
The next patch adds new timing points to the P9 entry path, some of which are in the module code, so we need to export the timing functions. Signed-off-by: Fabiano Rosas --- arch/powerpc/kvm/book3s_hv.h | 10 ++ arch/powerpc/kvm/book3s_hv_p9_entry.c | 11 ++- 2 files changed, 12 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv.h b/arch/powerpc/kvm/book3s_hv.h index 6b7f07d9026b..2f2e59d7d433 100644 --- a/arch/powerpc/kvm/book3s_hv.h +++ b/arch/powerpc/kvm/book3s_hv.h @@ -40,3 +40,13 @@ void switch_pmu_to_guest(struct kvm_vcpu *vcpu, struct p9_host_os_sprs *host_os_sprs); void switch_pmu_to_host(struct kvm_vcpu *vcpu, struct p9_host_os_sprs *host_os_sprs); + +#ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING +void accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next); +#define start_timing(vcpu, next) accumulate_time(vcpu, next) +#define end_timing(vcpu) accumulate_time(vcpu, NULL) +#else +#define accumulate_time(vcpu, next) do {} while (0) +#define start_timing(vcpu, next) do {} while (0) +#define end_timing(vcpu) do {} while (0) +#endif diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c b/arch/powerpc/kvm/book3s_hv_p9_entry.c index f8ce473149b7..8b2a9a360e4e 100644 --- a/arch/powerpc/kvm/book3s_hv_p9_entry.c +++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c @@ -438,7 +438,7 @@ void restore_p9_host_os_sprs(struct kvm_vcpu *vcpu, EXPORT_SYMBOL_GPL(restore_p9_host_os_sprs); #ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING -static void __accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next) +void accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next) { struct kvmppc_vcore *vc = vcpu->arch.vcore; struct kvmhv_tb_accumulator *curr; @@ -468,14 +468,7 @@ static void __accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator smp_wmb(); curr->seqcount = seq + 2; } - -#define start_timing(vcpu, next) __accumulate_time(vcpu, next) -#define end_timing(vcpu) __accumulate_time(vcpu, NULL) -#define accumulate_time(vcpu, next) __accumulate_time(vcpu, next) -#else -#define start_timing(vcpu, next) do {} while (0) -#define end_timing(vcpu) do {} while (0) -#define accumulate_time(vcpu, next) do {} while (0) +EXPORT_SYMBOL_GPL(accumulate_time); #endif static inline u64 mfslbv(unsigned int idx) -- 2.35.1
[PATCH 3/5] KVM: PPC: Book3S HV: Decouple the debug timing from the P8 entry path
We are currently doing the timing for debug purposes of the P9 entry path using the accumulators and terminology defined by the old entry path for P8 machines. Not only the "real-mode" and "napping" mentions are out of place for the P9 Radix entry path but also we cannot change them because the timing code is coupled to the structures defined in struct kvm_vcpu_arch. Add a new CONFIG_KVM_BOOK3S_HV_P9_TIMING to enable the timing code for the P9 entry path. For now, just add the new CONFIG and duplicate the structures. A subsequent patch will add the P9 changes. Signed-off-by: Fabiano Rosas --- arch/powerpc/include/asm/kvm_host.h | 8 arch/powerpc/kvm/Kconfig | 14 +- arch/powerpc/kvm/book3s_hv.c | 13 +++-- arch/powerpc/kvm/book3s_hv_p9_entry.c | 2 +- 4 files changed, 33 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index faf301d0dec0..37f03665bfa2 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -826,11 +826,19 @@ struct kvm_vcpu_arch { #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING struct kvmhv_tb_accumulator *cur_activity; /* What we're timing */ u64 cur_tb_start; /* when it started */ +#ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING struct kvmhv_tb_accumulator rm_entry; /* real-mode entry code */ struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */ struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ struct kvmhv_tb_accumulator guest_time; /* guest execution */ struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ +#else + struct kvmhv_tb_accumulator rm_entry; /* real-mode entry code */ + struct kvmhv_tb_accumulator rm_intr;/* real-mode intr handling */ + struct kvmhv_tb_accumulator rm_exit;/* real-mode exit code */ + struct kvmhv_tb_accumulator guest_time; /* guest execution */ + struct kvmhv_tb_accumulator cede_time; /* time napping inside guest */ +#endif #endif /* CONFIG_KVM_BOOK3S_HV_EXIT_TIMING */ }; diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 73f8277df7d1..191347f44731 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -130,10 +130,22 @@ config KVM_BOOK3S_64_PR config KVM_BOOK3S_HV_EXIT_TIMING bool +config KVM_BOOK3S_HV_P9_TIMING + bool "Detailed timing for the P9 entry point" + select KVM_BOOK3S_HV_EXIT_TIMING + depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS + help + Calculate time taken for each vcpu in various parts of the + code. The total, minimum and maximum times in nanoseconds + together with the number of executions are reported in debugfs in + kvm/vm#/vcpu#/timings. + + If unsure, say N. + config KVM_BOOK3S_HV_P8_TIMING bool "Detailed timing for hypervisor real-mode code (for POWER8)" select KVM_BOOK3S_HV_EXIT_TIMING - depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS + depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS && !KVM_BOOK3S_HV_P9_TIMING help Calculate time taken for each vcpu in the real-mode guest entry, exit, and interrupt handling code, plus time spent in the guest diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 6fa518f6501d..69a6b40d58b9 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -2653,11 +2653,19 @@ static struct debugfs_timings_element { const char *name; size_t offset; } timings[] = { +#ifdef CONFIG_KVM_BOOK3S_HV_P9_TIMING {"rm_entry",offsetof(struct kvm_vcpu, arch.rm_entry)}, {"rm_intr", offsetof(struct kvm_vcpu, arch.rm_intr)}, {"rm_exit", offsetof(struct kvm_vcpu, arch.rm_exit)}, {"guest", offsetof(struct kvm_vcpu, arch.guest_time)}, {"cede",offsetof(struct kvm_vcpu, arch.cede_time)}, +#else + {"rm_entry",offsetof(struct kvm_vcpu, arch.rm_entry)}, + {"rm_intr", offsetof(struct kvm_vcpu, arch.rm_intr)}, + {"rm_exit", offsetof(struct kvm_vcpu, arch.rm_exit)}, + {"guest", offsetof(struct kvm_vcpu, arch.guest_time)}, + {"cede",offsetof(struct kvm_vcpu, arch.cede_time)}, +#endif }; #define N_TIMINGS (ARRAY_SIZE(timings)) @@ -2776,8 +2784,9 @@ static const struct file_operations debugfs_timings_ops = { /* Create a debugfs directory for the vcpu */ static int kvmppc_arch_create_vcpu_debugfs_hv(struct kvm_vcpu *vcpu, struct dentry *debugfs_dentry) { - debugfs_create_file("timings", 0444, debugfs_dentry, vcpu, - &debugfs_timings_ops); + if (cpu_has_feature(CPU_FTR_ARCH_300) == IS_ENABLED(CONFIG_KVM_BOOK3S_HV_P9_TIMING)) + debugfs_create_file("timings", 0444, debugfs_dentry, vcpu, +
[PATCH 2/5] KVM: PPC: Book3S HV: Add a new config for P8 debug timing
Turn the existing Kconfig KVM_BOOK3S_HV_EXIT_TIMING into KVM_BOOK3S_HV_P8_TIMING in preparation for the addition of a new config for P9 timings. This applies only to P8 code, the generic timing code is still kept under KVM_BOOK3S_HV_EXIT_TIMING. Signed-off-by: Fabiano Rosas --- arch/powerpc/kernel/asm-offsets.c | 2 +- arch/powerpc/kvm/Kconfig| 6 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 3 files changed, 18 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index eec536aef83a..8c10f536e478 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -379,7 +379,7 @@ int main(void) OFFSET(VCPU_SPRG2, kvm_vcpu, arch.shregs.sprg2); OFFSET(VCPU_SPRG3, kvm_vcpu, arch.shregs.sprg3); #endif -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING OFFSET(VCPU_TB_RMENTRY, kvm_vcpu, arch.rm_entry); OFFSET(VCPU_TB_RMINTR, kvm_vcpu, arch.rm_intr); OFFSET(VCPU_TB_RMEXIT, kvm_vcpu, arch.rm_exit); diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index ddd88179110a..73f8277df7d1 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -128,7 +128,11 @@ config KVM_BOOK3S_64_PR and system calls on the host. config KVM_BOOK3S_HV_EXIT_TIMING - bool "Detailed timing for hypervisor real-mode code" + bool + +config KVM_BOOK3S_HV_P8_TIMING + bool "Detailed timing for hypervisor real-mode code (for POWER8)" + select KVM_BOOK3S_HV_EXIT_TIMING depends on KVM_BOOK3S_HV_POSSIBLE && DEBUG_FS help Calculate time taken for each vcpu in the real-mode guest entry, diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index d185dee26026..c34932e31dcd 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -229,14 +229,14 @@ kvm_novcpu_wakeup: cmpdi r4, 0 beq kvmppc_primary_no_guest -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING addir3, r4, VCPU_TB_RMENTRY bl kvmhv_start_timing #endif b kvmppc_got_guest kvm_novcpu_exit: -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING ld r4, HSTATE_KVM_VCPU(r13) cmpdi r4, 0 beq 13f @@ -515,7 +515,7 @@ kvmppc_hv_entry: li r6, KVM_GUEST_MODE_HOST_HV stb r6, HSTATE_IN_GUEST(r13) -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING /* Store initial timestamp */ cmpdi r4, 0 beq 1f @@ -886,7 +886,7 @@ fast_guest_return: li r9, KVM_GUEST_MODE_GUEST_HV stb r9, HSTATE_IN_GUEST(r13) -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING /* Accumulate timing */ addir3, r4, VCPU_TB_GUEST bl kvmhv_accumulate_time @@ -937,7 +937,7 @@ secondary_too_late: cmpdi r4, 0 beq 11f stw r12, VCPU_TRAP(r4) -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING addir3, r4, VCPU_TB_RMEXIT bl kvmhv_accumulate_time #endif @@ -951,7 +951,7 @@ hdec_soon: li r12, BOOK3S_INTERRUPT_HV_DECREMENTER 12:stw r12, VCPU_TRAP(r4) mr r9, r4 -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING addir3, r4, VCPU_TB_RMEXIT bl kvmhv_accumulate_time #endif @@ -1048,7 +1048,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) li r0, MSR_RI mtmsrd r0, 1 -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING addir3, r9, VCPU_TB_RMINTR mr r4, r9 bl kvmhv_accumulate_time @@ -1127,7 +1127,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = paca */ -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING addir3, r9, VCPU_TB_RMEXIT mr r4, r9 bl kvmhv_accumulate_time @@ -1487,7 +1487,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) mtspr SPRN_LPCR,r8 isync -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING /* Finish timing, if we have a vcpu */ ld r4, HSTATE_KVM_VCPU(r13) cmpdi r4, 0 @@ -2155,7 +2155,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM) ld r4, HSTATE_KVM_VCPU(r13) std r3, VCPU_DEC_EXPIRES(r4) -#ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING +#ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING ld r4, HSTATE_KVM_VCPU(r13) addir3, r4, VCPU_TB_CEDE bl kvmhv_accumulate_time @@ -2223,7 +2223,7 @@ kvm_end_cede: /* get vcpu pointer */ ld r4,
[PATCH 1/5] KVM: PPC: Book3S HV: Fix "rm_exit" entry in debugfs timings
At debugfs/kvm//vcpu0/timings we show how long each part of the code takes to run: $ cat /sys/kernel/debug/kvm/*-*/vcpu0/timings rm_entry: 123785 49398892 118 4898 rm_intr: 123780 6075890 22 390 rm_exit: 0 0 0 0 <-- NOK guest: 123780 46732919988 402 9997638 cede: 0 0 0 0<-- OK, no cede napping in P9 The "rm_exit" is always showing zero because it is the last one and end_timing does not increment the counter of the previous entry. We can fix it by calling accumulate_time again instead of end_timing. That way the counter gets incremented. The rest of the arithmetic can be ignored because there are no timing points after this and the accumulators are reset before the next round. Signed-off-by: Fabiano Rosas --- arch/powerpc/kvm/book3s_hv_p9_entry.c | 13 ++--- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c b/arch/powerpc/kvm/book3s_hv_p9_entry.c index a28e5b3daabd..f7591b6c92d1 100644 --- a/arch/powerpc/kvm/book3s_hv_p9_entry.c +++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c @@ -438,15 +438,6 @@ void restore_p9_host_os_sprs(struct kvm_vcpu *vcpu, EXPORT_SYMBOL_GPL(restore_p9_host_os_sprs); #ifdef CONFIG_KVM_BOOK3S_HV_EXIT_TIMING -static void __start_timing(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next) -{ - struct kvmppc_vcore *vc = vcpu->arch.vcore; - u64 tb = mftb() - vc->tb_offset_applied; - - vcpu->arch.cur_activity = next; - vcpu->arch.cur_tb_start = tb; -} - static void __accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator *next) { struct kvmppc_vcore *vc = vcpu->arch.vcore; @@ -478,8 +469,8 @@ static void __accumulate_time(struct kvm_vcpu *vcpu, struct kvmhv_tb_accumulator curr->seqcount = seq + 2; } -#define start_timing(vcpu, next) __start_timing(vcpu, next) -#define end_timing(vcpu) __start_timing(vcpu, NULL) +#define start_timing(vcpu, next) __accumulate_time(vcpu, next) +#define end_timing(vcpu) __accumulate_time(vcpu, NULL) #define accumulate_time(vcpu, next) __accumulate_time(vcpu, next) #else #define start_timing(vcpu, next) do {} while (0) -- 2.35.1
[PATCH 0/5] KVM: PPC: Book3S HV: Update debug timing code
We have some debug information at /sys/kernel/debug/kvm//vcpu#/timings which shows the time it takes to run various parts of the code. That infrastructure was written in the P8 timeframe and wasn't updated along with the guest entry point changes for P9. Ideally we would be able to just add new/different accounting points to the code as it changes over time but since the P8 and P9 entry points are different code paths we first need to separate them from each other. This series alters KVM Kconfig to make that distinction. Currently: CONFIG_KVM_BOOK3S_HV_EXIT_TIMING - timing infrastructure in asm (P8 only) timing infrastructure in C (P9 only) generic timing variables (P8/P9) debugfs code timing points for P8 After this series: CONFIG_KVM_BOOK3S_HV_EXIT_TIMING - generic timing variables (P8/P9) debugfs code CONFIG_KVM_BOOK3S_HV_P8_TIMING - timing infrastructure in asm (P8 only) timing points for P8 CONFIG_KVM_BOOK3S_HV_P9_TIMING - timing infrastructure in C (P9 only) timing points for P9 The new Kconfig rules are: a) CONFIG_KVM_BOOK3S_HV_P8_TIMING selects CONFIG_KVM_BOOK3S_HV_EXIT_TIMING, resulting in the previous behavior. Tested on P8. b) CONFIG_KVM_BOOK3S_HV_P9_TIMING selects CONFIG_KVM_BOOK3S_HV_EXIT_TIMING, resulting in the new behavior. Tested on P9. c) CONFIG_KVM_BOOK3S_HV_P8_TIMING and CONFIG_KVM_BOOK3S_HV_P9_TIMING are mutually exclusive. If both are set, P9 takes precedence. Fabiano Rosas (5): KVM: PPC: Book3S HV: Fix "rm_exit" entry in debugfs timings KVM: PPC: Book3S HV: Add a new config for P8 debug timing KVM: PPC: Book3S HV: Decouple the debug timing from the P8 entry path KVM: PPC: Book3S HV: Expose timing functions to module code KVM: PPC: Book3S HV: Provide more detailed timings for P9 entry path arch/powerpc/include/asm/kvm_host.h | 10 +++ arch/powerpc/kernel/asm-offsets.c | 2 +- arch/powerpc/kvm/Kconfig| 19 - arch/powerpc/kvm/book3s_hv.c| 26 -- arch/powerpc/kvm/book3s_hv.h| 10 +++ arch/powerpc/kvm/book3s_hv_p9_entry.c | 36 + arch/powerpc/kvm/book3s_hv_rmhandlers.S | 24 - 7 files changed, 82 insertions(+), 45 deletions(-) -- 2.35.1
[PATCH] KVM: PPC: Align pt_regs in kvm_vcpu_arch structure
The H_ENTER_NESTED hypercall receives as second parameter the address of a region of memory containing the values for the nested guest privileged registers. We currently use the pt_regs structure contained within kvm_vcpu_arch for that end. Most hypercalls that receive a memory address expect that region to not cross a 4k page boundary. We would want H_ENTER_NESTED to follow the same pattern so this patch ensures the pt_regs structure sits within a page. Signed-off-by: Fabiano Rosas --- arch/powerpc/include/asm/kvm_host.h | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index faf301d0dec0..87eba60f2920 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -519,7 +519,11 @@ struct kvm_vcpu_arch { struct kvmppc_book3s_shadow_vcpu *shadow_vcpu; #endif - struct pt_regs regs; + /* +* This is passed along to the HV via H_ENTER_NESTED. Align to +* prevent it crossing a real 4K page. +*/ + struct pt_regs regs __aligned(512); struct thread_fp_state fp; -- 2.35.1
Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types
On Tue, May 24, 2022 at 07:45:31PM -0400, Peter Xu wrote: > I observed that for each of the shared file-backed page faults, we're very > likely to retry one more time for the 1st write fault upon no page. It's > because we'll need to release the mmap lock for dirty rate limit purpose > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). > > Then after that throttling we return VM_FAULT_RETRY. > > We did that probably because VM_FAULT_RETRY is the only way we can return > to the fault handler at that time telling it we've released the mmap lock. > > However that's not ideal because it's very likely the fault does not need > to be retried at all since the pgtable was well installed before the > throttling, so the next continuous fault (including taking mmap read lock, > walk the pgtable, etc.) could be in most cases unnecessary. > > It's not only slowing down page faults for shared file-backed, but also add > more mmap lock contention which is in most cases not needed at all. > > To observe this, one could try to write to some shmem page and look at > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each > shmem write simply because we retried, and vm event "pgfault" will capture > that. > > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to > show that we've completed the whole fault and released the lock. It's also > a hint that we should very possibly not need another fault immediately on > this page because we've just completed it. > > This patch provides a ~12% perf boost on my aarch64 test VM with a simple > program sequentially dirtying 400MB shmem file being mmap()ed and these are > the time it needs: > > Before: 650.980 ms (+-1.94%) > After: 569.396 ms (+-1.38%) > > I believe it could help more than that. > > We need some special care on GUP and the s390 pgfault handler (for gmap > code before returning from pgfault), the rest changes in the page fault > handlers should be relatively straightforward. > > Another thing to mention is that mm_account_fault() does take this new > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. > > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping > them as-is. > > Signed-off-by: Peter Xu Acked-by: Peter Zijlstra (Intel)
Re: [RFC PATCH 1/4] objtool: Add --mnop as an option to --mcount
On Tue, May 24, 2022 at 04:01:48PM +0530, Naveen N. Rao wrote: > We need to know for sure either way. Nop'ing out the _mcount locations at > boot allows us to discover existing long branch trampolines. If we want to > avoid it, we need to note down those locations during build time. > > Do you have a different approach in mind? If you put _mcount in a separate section then the compiler cannot tell where it is and is forced to always emit a long branch trampoline. Does that help?
Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support
Am Mittwoch, 25. Mai 2022, 12:57:30 CEST schrieb Heiko Stübner: > Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck: > > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote: > > [ ... ] > > > > > > The problem is come from "__dls3's vdso decode part in musl's > > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong. > > > > > > > > I think the root cause is from musl's implementation with the wrong > > > > elf parser. I would fix that soon. > > > Not elf parser, it's "aux vector just past environ[]". I think I could > > > solve this, but anyone who could help dig in is welcome. > > > > > > > I am not sure I understand what you are saying here. Point is that my > > root file system, generated with musl a year or so ago, crashes with > > your patch set applied. That is a regression, even if there is a bug > > in musl. > > Also as I said in the other part of the thread, the rootfs seems innocent, > as my completely-standard Debian riscv64 rootfs is also affected. > > The merged version seems to be v12 [0] - not sure how we this discussion > ended up in v9, but I just tested this revision in two variants: > > - v5.17 + this v9 -> works nicely I take that back ... now going back to that build I somehow also run into that issue here ... will investigate more. > - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot > The only rebase-conflict was with the introduction of restartable > sequences and removal of the tracehook include, but turning CONFIG_RSEQ > off doesn't seem to affect the breakage. > > So it looks like something changed between 5.17 and 5.18 that causes the > issue. > > > Heiko > > > [0] https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/ >
Re: [RFC PATCH v2 5/7] objtool: Enable objtool to run only on files with ftrace enabled
Hi Peter, On 25/05/22 01:20, Peter Zijlstra wrote: On Tue, May 24, 2022 at 06:59:50PM +, Christophe Leroy wrote: Le 24/05/2022 à 20:02, Peter Zijlstra a écrit : On Tue, May 24, 2022 at 08:01:39PM +0200, Peter Zijlstra wrote: On Tue, May 24, 2022 at 03:17:45PM +0200, Christophe Leroy wrote: From: Sathvika Vasireddy This patch makes sure objtool runs only on the object files that have ftrace enabled, instead of running on all the object files. Signed-off-by: Naveen N. Rao Signed-off-by: Sathvika Vasireddy Signed-off-by: Christophe Leroy --- scripts/Makefile.build | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/scripts/Makefile.build b/scripts/Makefile.build index 2e0c3f9c1459..06ceffd92921 100644 --- a/scripts/Makefile.build +++ b/scripts/Makefile.build @@ -258,8 +258,8 @@ else # 'OBJECT_FILES_NON_STANDARD_foo.o := 'y': skip objtool checking for a file # 'OBJECT_FILES_NON_STANDARD_foo.o := 'n': override directory skip for a file -$(obj)/%.o: objtool-enabled = $(if $(filter-out y%, \ - $(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y) +$(obj)/%.o: objtool-enabled = $(and $(if $(filter-out y%, $(OBJECT_FILES_NON_STANDARD_$(basetarget).o)$(OBJECT_FILES_NON_STANDARD)n),y), \ +$(if $(findstring $(strip $(CC_FLAGS_FTRACE)),$(_c_flags)),y),y) I think this breaks x86, quite a bit of files have ftrace disabled but very much must run objtool anyway. Also; since the Changelog gives 0 clue as to what problem it's trying to solve, I can't suggest anything. I asked Sathvika on the previous series, see https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220523175548.922671-3...@linux.ibm.com/ He says it is to solve the problem I reported at https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220318105140.43914-4...@linux.ibm.com/#2861128 So on x86 we have: arch/x86/entry/vdso/Makefile:OBJECT_FILES_NON_STANDARD := y to kill objtool for the whole of the VDSO. When we run objtool on vmlinux it isn't a problem, because the VDSO ends up as a data section through linker scripts. Right.. Like you and Christophe mentioned, arch/powerpc/kernel/vdso/Makefile:OBJECT_FILES_NON_STANDARD := y should solve it for powerpc as well. I'll drop this patch and replace it with the above change as part of next revision series. Thanks for reviewing! - Sathvika
Re: [PATCH v3] mm: Avoid unnecessary page fault retires on shared memory types
On Wed, May 25, 2022 at 1:45 AM Peter Xu wrote: > I observed that for each of the shared file-backed page faults, we're very > likely to retry one more time for the 1st write fault upon no page. It's > because we'll need to release the mmap lock for dirty rate limit purpose > with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()). > > Then after that throttling we return VM_FAULT_RETRY. > > We did that probably because VM_FAULT_RETRY is the only way we can return > to the fault handler at that time telling it we've released the mmap lock. > > However that's not ideal because it's very likely the fault does not need > to be retried at all since the pgtable was well installed before the > throttling, so the next continuous fault (including taking mmap read lock, > walk the pgtable, etc.) could be in most cases unnecessary. > > It's not only slowing down page faults for shared file-backed, but also add > more mmap lock contention which is in most cases not needed at all. > > To observe this, one could try to write to some shmem page and look at > "pgfault" value in /proc/vmstat, then we should expect 2 counts for each > shmem write simply because we retried, and vm event "pgfault" will capture > that. > > To make it more efficient, add a new VM_FAULT_COMPLETED return code just to > show that we've completed the whole fault and released the lock. It's also > a hint that we should very possibly not need another fault immediately on > this page because we've just completed it. > > This patch provides a ~12% perf boost on my aarch64 test VM with a simple > program sequentially dirtying 400MB shmem file being mmap()ed and these are > the time it needs: > > Before: 650.980 ms (+-1.94%) > After: 569.396 ms (+-1.38%) > > I believe it could help more than that. > > We need some special care on GUP and the s390 pgfault handler (for gmap > code before returning from pgfault), the rest changes in the page fault > handlers should be relatively straightforward. > > Another thing to mention is that mm_account_fault() does take this new > fault as a generic fault to be accounted, unlike VM_FAULT_RETRY. > > I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do > not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping > them as-is. > > Signed-off-by: Peter Xu > arch/m68k/mm/fault.c | 4 Acked-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH V9 20/20] riscv: compat: Add COMPAT Kbuild skeletal support
Am Mittwoch, 25. Mai 2022, 00:06:46 CEST schrieb Guenter Roeck: > On Wed, May 25, 2022 at 01:46:38AM +0800, Guo Ren wrote: > [ ... ] > > > > The problem is come from "__dls3's vdso decode part in musl's > > > ldso/dynlink.c". The ehdr->e_phnum & ehdr->e_phentsize are wrong. > > > > > > I think the root cause is from musl's implementation with the wrong > > > elf parser. I would fix that soon. > > Not elf parser, it's "aux vector just past environ[]". I think I could > > solve this, but anyone who could help dig in is welcome. > > > > I am not sure I understand what you are saying here. Point is that my > root file system, generated with musl a year or so ago, crashes with > your patch set applied. That is a regression, even if there is a bug > in musl. Also as I said in the other part of the thread, the rootfs seems innocent, as my completely-standard Debian riscv64 rootfs is also affected. The merged version seems to be v12 [0] - not sure how we this discussion ended up in v9, but I just tested this revision in two variants: - v5.17 + this v9 -> works nicely - v5.18-rc6 + this v9 (rebased onto it) -> breaks the boot The only rebase-conflict was with the introduction of restartable sequences and removal of the tracehook include, but turning CONFIG_RSEQ off doesn't seem to affect the breakage. So it looks like something changed between 5.17 and 5.18 that causes the issue. Heiko [0] https://lore.kernel.org/all/20220405071314.3225832-1-guo...@kernel.org/
Re: [RFC PATCH v2 0/7] objtool: Enable and implement --mcount option on powerpc
Hi Christophe, On 24/05/22 18:47, Christophe Leroy wrote: This draft series adds PPC32 support to Sathvika's series. Verified on pmac32 on QEMU. It should in principle also work for PPC64 BE but for the time being something goes wrong. In the beginning I had a segfaut hence the first patch. But I still get no mcount section in the files. Since PPC64 BE uses older elfv1 ABI, it prepends a dot to symbols. And so, the relocation records in case of PPC64BE point to "._mcount", rather than just "_mcount". We should be looking for "._mcount" to be able to generate mcount_loc section in the files. Like: diff --git a/tools/objtool/check.c b/tools/objtool/check.c index 70be5a72e838..7da5bf8c7236 100644 --- a/tools/objtool/check.c +++ b/tools/objtool/check.c @@ -2185,7 +2185,7 @@ static int classify_symbols(struct objtool_file *file) if (arch_is_retpoline(func)) func->retpoline_thunk = true; - if ((!strcmp(func->name, "__fentry__")) || (!strcmp(func->name, "_mcount"))) + if ((!strcmp(func->name, "__fentry__")) || (!strcmp(func->name, "_mcount")) || (!strcmp(func->name, "._mcount"))) func->fentry = true; if (is_profiling_func(func->name)) With this change, I could see __mcount_loc section being generated in individual ppc64be object files. - Sathvika
Re: [PATCH] powerpc/64: Use tick accounting by default
Le 22/05/2017 à 07:13, Anton Blanchard a écrit : Hi Michael, ppc64 is the only architecture that turns on VIRT_CPU_ACCOUNTING_NATIVE by default. The overhead of this option is extremely high - a context switch microbenchmark using sched_yield() is almost 20% slower. Running on what? It should all be nop'ed out unless you're on a platform that needs it (SPLPAR). POWERNV native. We don't nop out all the vtime_account_* gunk do we? It is all those functions that are a large part of the problem. To get finer grained user/hardirq/softirq statitics, the IRQ_TIME_ACCOUNTING option can be used instead, which has much lower overhead. Can it? We don't select HAVE_IRQ_TIME_ACCOUNTING, so AFAICS it can't be enabled. I have a separate patch to enable it. Doesn't dropping this mean we never count stolen time? Perhaps. Do we have any applications left that care? This patch has been superseded by Nick's patch https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220525081346.871535-1-npig...@gmail.com/ Christophe
Re: [PATCH -next v4 3/7] arm64: add support for machine check error safe
On Thu, May 19, 2022 at 02:29:54PM +0800, Tong Tiangen wrote: > > > 在 2022/5/13 23:26, Mark Rutland 写道: > > On Wed, Apr 20, 2022 at 03:04:14AM +, Tong Tiangen wrote: > > > During the processing of arm64 kernel hardware memory errors(do_sea()), if > > > the errors is consumed in the kernel, the current processing is panic. > > > However, it is not optimal. > > > > > > Take uaccess for example, if the uaccess operation fails due to memory > > > error, only the user process will be affected, kill the user process > > > and isolate the user page with hardware memory errors is a better choice. > > > > Conceptually, I'm fine with the idea of constraining what we do for a > > true uaccess, but I don't like the implementation of this at all, and I > > think we first need to clean up the arm64 extable usage to clearly > > distinguish a uaccess from another access. > > OK,using EX_TYPE_UACCESS and this extable type could be recover, this is > more reasonable. Great. > For EX_TYPE_UACCESS_ERR_ZERO, today we use it for kernel accesses in a > couple of cases, such as > get_user/futex/__user_cache_maint()/__user_swpX_asm(), Those are all user accesses. However, __get_kernel_nofault() and __put_kernel_nofault() use EX_TYPE_UACCESS_ERR_ZERO by way of __{get,put}_mem_asm(), so we'd need to refactor that code to split the user/kernel cases higher up the callchain. > your suggestion is: > get_user continues to use EX_TYPE_UACCESS_ERR_ZERO and the other cases use > new type EX_TYPE_FIXUP_ERR_ZERO? Yes, that's the rough shape. We could make the latter EX_TYPE_KACCESS_ERR_ZERO to be clearly analogous to EX_TYPE_UACCESS_ERR_ZERO, and with that I susepct we could remove EX_TYPE_FIXUP. Thanks, Mark.
[PATCH] powerpc/64: Remove PPC64 special case for cputime accounting default
Distro kernels tend to be moving to VIRT_CPU_ACCOUNTING_GEN, and there is not much reason why PPC64 should be special here. VIRT_CPU_ACCOUNTING_NATIVE does provide scaled vtime and stolen time apportioned between system and user time, and vtime accounting is not unconditionally enabled, and possibly other things. But it would be better at this point to extend GEN to cover important missing features rather than directing users back to a less used option. Signed-off-by: Nicholas Piggin --- After implementing stolen time for GEN for powerpc, can we try this and see who screams? Thanks, Nick init/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index ddcbefe535e9..544ed8b0707a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -473,8 +473,7 @@ config VIRT_CPU_ACCOUNTING choice prompt "Cputime accounting" - default TICK_CPU_ACCOUNTING if !PPC64 - default VIRT_CPU_ACCOUNTING_NATIVE if PPC64 + default TICK_CPU_ACCOUNTING # Kind of a stub config for the pure tick based cputime accounting config TICK_CPU_ACCOUNTING -- 2.35.1
Re: [PATCH v1 4/4] watchdog/pseries-wdt: initial support for PAPR H_WATCHDOG timers
On 5/24/22 23:35, Alexey Kardashevskiy wrote: On 5/21/22 04:35, Scott Cheloha wrote: PAPR v2.12 defines a new hypercall, H_WATCHDOG. The hypercall permits guest control of one or more virtual watchdog timers. The timers have millisecond granularity. The guest is terminated when a timer expires. This patch adds a watchdog driver for these timers, "pseries-wdt". pseries_wdt_probe() currently assumes the existence of only one platform device and always assigns it watchdogNumber 1. If we ever expose more than one timer to userspace we will need to devise a way to assign a distinct watchdogNumber to each platform device at device registration time. This one should go before 4/4 in the series for bisectability. What is platform_device_register_simple("pseries-wdt",...) going to do without the driver? Signed-off-by: Scott Cheloha --- .../watchdog/watchdog-parameters.rst | 12 + drivers/watchdog/Kconfig | 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/pseries-wdt.c | 337 ++ 4 files changed, 358 insertions(+) create mode 100644 drivers/watchdog/pseries-wdt.c diff --git a/Documentation/watchdog/watchdog-parameters.rst b/Documentation/watchdog/watchdog-parameters.rst index 223c99361a30..4ffe725e796c 100644 --- a/Documentation/watchdog/watchdog-parameters.rst +++ b/Documentation/watchdog/watchdog-parameters.rst @@ -425,6 +425,18 @@ pnx833x_wdt: - +pseries-wdt: + action: + Action taken when watchdog expires: 1 (power off), 2 (restart), + 3 (dump and restart). (default=2) + timeout: + Initial watchdog timeout in seconds. (default=60) + nowayout: + Watchdog cannot be stopped once started. + (default=kernel config parameter) + +- + rc32434_wdt: timeout: Watchdog timeout value, in seconds (default=20) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c4e82a8d863f..06b412603f3e 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -1932,6 +1932,14 @@ config MEN_A21_WDT # PPC64 Architecture +config PSERIES_WDT + tristate "POWER Architecture Platform Watchdog Timer" + depends on PPC_PSERIES + select WATCHDOG_CORE + help + Driver for virtual watchdog timers provided by PAPR + hypervisors (e.g. PowerVM, KVM). + config WATCHDOG_RTAS tristate "RTAS watchdog" depends on PPC_RTAS diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index f7da867e8782..f35660409f17 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -184,6 +184,7 @@ obj-$(CONFIG_BOOKE_WDT) += booke_wdt.o obj-$(CONFIG_MEN_A21_WDT) += mena21_wdt.o # PPC64 Architecture +obj-$(CONFIG_PSERIES_WDT) += pseries-wdt.o obj-$(CONFIG_WATCHDOG_RTAS) += wdrtas.o # S390 Architecture diff --git a/drivers/watchdog/pseries-wdt.c b/drivers/watchdog/pseries-wdt.c new file mode 100644 index ..f41bc4d3b7a2 --- /dev/null +++ b/drivers/watchdog/pseries-wdt.c @@ -0,0 +1,337 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022 International Business Machines, Inc. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#define DRV_NAME "pseries-wdt" + +/* + * The PAPR's MSB->LSB bit ordering is 0->63. These macros simplify + * defining bitfields as described in the PAPR without needing to + * transpose values to the more C-like 63->0 ordering. + */ +#define SETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) << PPC_BITLSHIFT(_e)) & PPC_BITMASK((_b), (_e))) +#define GETFIELD(_v, _b, _e) \ + (((unsigned long)(_v) & PPC_BITMASK((_b), (_e))) >> PPC_BITLSHIFT(_e)) + +/* + * H_WATCHDOG Hypercall Input + * + * R4: "flags": + * + * A 64-bit value structured as follows: + * + * Bits 0-46: Reserved (must be zero). + */ +#define PSERIES_WDTF_RESERVED PPC_BITMASK(0, 46) + +/* + * Bit 47: "leaveOtherWatchdogsRunningOnTimeout" + * + * 0 Stop outstanding watchdogs on timeout. + * 1 Leave outstanding watchdogs running on timeout. + */ +#define PSERIES_WDTF_LEAVE_OTHER PPC_BIT(47) + +/* + * Bits 48-55: "operation" + * + * 0x01 Start Watchdog + * 0x02 Stop Watchdog + * 0x03 Query Watchdog Capabilities + * 0x04 Query Watchdog LPM Requirement + */ +#define PSERIES_WDTF_OP(op) SETFIELD((op), 48, 55) +#define PSERIES_WDTF_OP_START PSERIES_WDTF_OP(0x1) +#define PSERIES_WDTF_OP_STOP PSERIES_WDTF_OP(0x2) +#define PSERIES_WDTF_OP_QUERY PSERIES_WDTF_OP(0x3) +#define PSERIES_WDTF_OP_QUERY_LPM PSERIES_WDTF_OP(0x4) + +/* + * Bits 56-63: "timeoutAction" + * + * 0x01 Hard poweroff + * 0x02 Hard restart + * 0x03 Dump restart + */ +#define PSERIES_WDTF_ACTION(ac) SETFIE
Re: [PATCH v2] of: check previous kernel's ima-kexec-buffer against memory bounds
Hi Ritesh, thanks for looking into this patch, Ritesh Harjani writes: > Just a minor nit which I noticed. > > On 22/05/24 11:20AM, Vaibhav Jain wrote: >> Presently ima_get_kexec_buffer() doesn't check if the previous kernel's >> ima-kexec-buffer lies outside the addressable memory range. This can result >> in a kernel panic if the new kernel is booted with 'mem=X' arg and the >> ima-kexec-buffer was allocated beyond that range by the previous kernel. >> The panic is usually of the form below: >> >> $ sudo kexec --initrd initrd vmlinux --append='mem=16G' >> >> >> BUG: Unable to handle kernel data access on read at 0xc000c01fff7f >> Faulting instruction address: 0xc0837974 >> Oops: Kernel access of bad area, sig: 11 [#1] >> >> NIP [c0837974] ima_restore_measurement_list+0x94/0x6c0 >> LR [c083b55c] ima_load_kexec_buffer+0xac/0x160 >> Call Trace: >> [c371fa80] [c083b55c] ima_load_kexec_buffer+0xac/0x160 >> [c371fb00] [c20512c4] ima_init+0x80/0x108 >> [c371fb70] [c20514dc] init_ima+0x4c/0x120 >> [c371fbf0] [c0012240] do_one_initcall+0x60/0x2c0 >> [c371fcc0] [c2004ad0] kernel_init_freeable+0x344/0x3ec >> [c371fda0] [c00128a4] kernel_init+0x34/0x1b0 >> [c371fe10] [c000ce64] ret_from_kernel_thread+0x5c/0x64 >> Instruction dump: >> f92100b8 f92100c0 90e10090 910100a0 4182050c 282a0017 3bc0 40810330 >> 7c0802a6 fb610198 7c9b2378 f80101d0 2c090001 40820614 e9240010 >> ---[ end trace ]--- >> >> Fix this issue by checking returned PFN range of previous kernel's >> ima-kexec-buffer with pfn_valid to ensure correct memory bounds. >> >> Fixes: 467d27824920 ("powerpc: ima: get the kexec buffer passed by the >> previous kernel") >> Cc: Frank Rowand >> Cc: Prakhar Srivastava >> Cc: Lakshmi Ramasubramanian >> Cc: Thiago Jung Bauermann >> Cc: Rob Herring >> Signed-off-by: Vaibhav Jain >> >> --- >> Changelog >> == >> >> v2: >> * Instead of using memblock to determine the valid bounds use pfn_valid() to >> do >> so since memblock may not be available late after the kernel init. [ Mpe ] >> * Changed the patch prefix from 'powerpc' to 'of' [ Mpe ] >> * Updated the 'Fixes' tag to point to correct commit that introduced this >> function. [ Rob ] >> * Fixed some whitespace/tab issues in the patch description [ Rob ] >> * Added another check for checking ig 'tmp_size' for ima-kexec-buffer is > 0 >> --- >> drivers/of/kexec.c | 17 + >> 1 file changed, 17 insertions(+) >> >> diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c >> index 8d374cc552be..879e984fe901 100644 >> --- a/drivers/of/kexec.c >> +++ b/drivers/of/kexec.c >> @@ -126,6 +126,7 @@ int ima_get_kexec_buffer(void **addr, size_t *size) >> { >> int ret, len; >> unsigned long tmp_addr; >> +unsigned int start_pfn, end_pfn; > > ^^^ Shouldn't this be unsigned long? Thanks for catching this. Yes that should be 'unsigned long'. Will resend the patch with this fixed. > > -ritesh > >> size_t tmp_size; >> const void *prop; >> >> @@ -140,6 +141,22 @@ int ima_get_kexec_buffer(void **addr, size_t *size) >> if (ret) >> return ret; >> >> +/* Do some sanity on the returned size for the ima-kexec buffer */ >> +if (!tmp_size) >> +return -ENOENT; >> + >> +/* >> + * Calculate the PFNs for the buffer and ensure >> + * they are with in addressable memory. >> + */ >> +start_pfn = PHYS_PFN(tmp_addr); >> +end_pfn = PHYS_PFN(tmp_addr + tmp_size - 1); >> +if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn)) { >> +pr_warn("IMA buffer at 0x%lx, size = 0x%zx beyond memory\n", >> +tmp_addr, tmp_size); >> +return -EINVAL; >> +} >> + >> *addr = __va(tmp_addr); >> *size = tmp_size; >> >> -- >> 2.35.1 >> -- Cheers ~ Vaibhav