Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Steven Rostedt writes: > On Wed, 7 Feb 2024 13:07:36 +0100 > Mete Durlu wrote: > >> wouldn't the following scenario explain the behavior we are seeing. >> When using event triggers, trace uses lockless ringbuffer control paths. >> If cmdline update and trace output reading is happening on different >> cpus, the ordering can get messed up. >> >> 1. event happens and trace trigger tells ring buffer to stop writes >> 2. (on cpu#1)test calculates checksum on current state of trace >> output. >> 3. (on cpu#2)not knowing about the trace buffers status yet, writer adds >> a one last entry which would collide with a pid in cmdline map before >> actually stopping. While (on cpu#1) checksum is being calculated, new >> saved cmdlines entry is waiting for spinlocks to be unlocked and then >> gets added. >> 4. test calculates checksum again and finds that the trace output has >> changed. <...> is put on collided pid. > > But the failure is here: > > on=`cat tracing_on` > if [ $on != "0" ]; then > fail "Tracing is not off" > fi It might be misleading because we're discussing two issues in one thread. The failure above was one problem, which the initial fix is for. The other issue we're still seeing is the test below: > csum1=`md5sum trace` > sleep $SLEEP_TIME > csum2=`md5sum trace` > > if [ "$csum1" != "$csum2" ]; then > fail "Tracing file is still changing" > fi > > 1. tracing is off > 2. do checksum of trace > 3. sleep > 4. do another checksum of trace > 5. compare the two checksums > > Now how did they come up differently in that amount of time? The > saved_cmdlines really should not have been updated. My assumption without reading the code is that something like this happens: CPU0 CPU1 [ringbuffer enabled] ring_buffer_write() if (atomic_read(>record_disabled)) goto out; echo 0 > tracing_on record_disabled |= RB_BUFFER_OFF csum1=`md5sum trace` [adds trace entry to ring buffer, overwriting savedcmd_lines entry because it thinks ring buffer is enabled] csum2=`md5sum trace`
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Hi Steven, Steven Rostedt writes: > On Tue, 06 Feb 2024 09:48:16 +0100 > Sven Schnelle wrote: > >> I added some logging, and the test is not triggering this issue. So i >> assume the default of 128 cmdline entries is just to small. Sorry for >> the noise. Lets see whether we're still triggering some failures with >> the other fix applied in CI. If we do, maybe we want to increase the >> saved_cmdline_size for the ftrace test suite. > > I wonder if it is a good idea to increase the size when tracing starts, > like the ring buffer expanding code. Maybe to 1024? Or is that still > too small? Not sure whether that is enough, have to test. However, it's not really a fix, it would just require a bit more load and the test would fail again. The fundamental problem is that even after disabling tracing there might be some tracing line added due to the lockless nature of the ringbuffer. This might then replace some existing cmdline entry. So maybe we should change the test to ignore the program name when calculating the checksums.
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Sven Schnelle writes: > Looking at trace_save_cmdline(): > > tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000 > > so this is basically > > tpid = tsk->pid & 0x7fff; > > further on: > > // might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & > 0x7fff) > idx = savedcmd->map_pid_to_cmdline[tpid]; > if (idx == NO_CMDLINE_MAP) { > // This will pick an existing entry if there are > // more than cmdline_num entries present > idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num; > savedcmd->map_pid_to_cmdline[tpid] = idx; > savedcmd->cmdline_idx = idx; > } > > So i think the problem that sometimes '<...>' instead of the correct > comm is logged is just expected behaviour given the code above. I added some logging, and the test is not triggering this issue. So i assume the default of 128 cmdline entries is just to small. Sorry for the noise. Lets see whether we're still triggering some failures with the other fix applied in CI. If we do, maybe we want to increase the saved_cmdline_size for the ftrace test suite.
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Steven Rostedt writes: > On Mon, 05 Feb 2024 14:16:30 +0100 > Sven Schnelle wrote: >> >> Another issue i'm hitting sometimes is this part: >> >> csum1=`md5sum trace` >> sleep $SLEEP_TIME >> csum2=`md5sum trace` >> >> if [ "$csum1" != "$csum2" ]; then >> fail "Tracing file is still changing" >> fi >> >> This is because the command line was replaced in the >> saved_cmdlines_buffer, an example diff between both files >> is: > > [..] > >> >> This can be improved by: >> >> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size >> >> But this is of course not a fix - should we maybe replace the program >> name with <...> before comparing, remove the check completely, or do >> anything else? What do you think? > > Hmm, actually I would say that this exposes a real bug. Not a major > one, but one that I find annoying. The saved commandlines should only > be updated when a trace event occurs. But really, it should only be > updated if one is added to the ring buffer. If the ring buffer isn't > being updated, we shouldn't be adding new command lines. > > There may be a location that has tracing off but still updating the > cmdlines which will break the saved cache. Looking at trace_save_cmdline(): tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000 so this is basically tpid = tsk->pid & 0x7fff; further on: // might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 0x7fff) idx = savedcmd->map_pid_to_cmdline[tpid]; if (idx == NO_CMDLINE_MAP) { // This will pick an existing entry if there are // more than cmdline_num entries present idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num; savedcmd->map_pid_to_cmdline[tpid] = idx; savedcmd->cmdline_idx = idx; } So i think the problem that sometimes '<...>' instead of the correct comm is logged is just expected behaviour given the code above.
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Hi Steven, Steven Rostedt writes: > On Mon, 05 Feb 2024 14:16:30 +0100 > Sven Schnelle wrote: >> >> Another issue i'm hitting sometimes is this part: >> >> csum1=`md5sum trace` >> sleep $SLEEP_TIME >> csum2=`md5sum trace` >> >> if [ "$csum1" != "$csum2" ]; then >> fail "Tracing file is still changing" >> fi >> >> This is because the command line was replaced in the >> saved_cmdlines_buffer, an example diff between both files >> is: > > [..] > >> >> This can be improved by: >> >> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size >> >> But this is of course not a fix - should we maybe replace the program >> name with <...> before comparing, remove the check completely, or do >> anything else? What do you think? > > Hmm, actually I would say that this exposes a real bug. Not a major > one, but one that I find annoying. The saved commandlines should only > be updated when a trace event occurs. But really, it should only be > updated if one is added to the ring buffer. If the ring buffer isn't > being updated, we shouldn't be adding new command lines. > > There may be a location that has tracing off but still updating the > cmdlines which will break the saved cache. Ok, my understanding is that it will override the entry in the list if another process comes up with the same PID. But i haven't read the code carefully - let me do that now.
Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
Hi Steven, Steven Rostedt writes: > On Mon, 5 Feb 2024 07:53:40 +0100 > Sven Schnelle wrote: > >> tracer_tracing_is_on() checks whether record_disabled is not zero. This >> checks both the record_disabled counter and the RB_BUFFER_OFF flag. >> Reading the source it looks like this function should only check for >> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). >> This fixes spurious fails in the 'test for function traceon/off triggers' >> test from the ftrace testsuite when the system is under load. >> > > I've seen these spurious failures too, but haven't looked deeper into > it. Thanks, Another issue i'm hitting sometimes is this part: csum1=`md5sum trace` sleep $SLEEP_TIME csum2=`md5sum trace` if [ "$csum1" != "$csum2" ]; then fail "Tracing file is still changing" fi This is because the command line was replaced in the saved_cmdlines_buffer, an example diff between both files is: ftracetest-17950 [005] . 344507.002490: sched_process_wait: comm=ftracetest pid=0 prio=120 ftracetest-17950 [005] . 344507.002492: sched_process_wait: comm=ftracetest pid=0 prio=120 - stress-ng-fanot-17820 [006] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=1054 [ns] + <...>-17820 [006] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=1054 [ns] ftracetest-17950 [005] d.h.. 344507.009901: sched_stat_runtime: comm=ftracetest pid=17950 runtime=7417915 [ns] stress-ng-fanot-17819 [003] d.h.. 344507.009901: sched_stat_runtime: comm=stress-ng-fanot pid=17819 runtime=9983473 [ns] - stress-ng-fanot-17820 [007] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=865 [ns] + <...>-17820 [007] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17820 runtime=865 [ns] stress-ng-fanot-17819 [004] d.h.. 344507.079900: sched_stat_runtime: comm=stress-ng-fanot pid=17819 runtime=8388039 [ns] This can be improved by: echo 32768 > /sys/kernel/tracing/saved_cmdlines_size But this is of course not a fix - should we maybe replace the program name with <...> before comparing, remove the check completely, or do anything else? What do you think? Thanks, Sven
[PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Signed-off-by: Sven Schnelle --- kernel/trace/trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 2a7c6fd934e9..47e221e1e720 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1532,7 +1532,7 @@ void disable_trace_on_warning(void) bool tracer_tracing_is_on(struct trace_array *tr) { if (tr->array_buffer.buffer) - return ring_buffer_record_is_on(tr->array_buffer.buffer); + return ring_buffer_record_is_set_on(tr->array_buffer.buffer); return !tr->buffer_disabled; } -- 2.40.1
Re: [PATCH 4/5] kbuild: unify vdso_install rules
Masahiro Yamada writes: > Currently, there is no standard implementation for vdso_install, > leading to various issues: > > 1. Code duplication > > Many architectures duplicate similar code just for copying files > to the install destination. > > Some architectures (arm, sparc, x86) create build-id symlinks, > introducing more code duplication. > > 2. Accidental updates of in-tree build artifacts > > The vdso_install rule depends on the vdso files to install. > It may update in-tree build artifacts. This can be problematic, > as explained in commit 19514fc665ff ("arm, kbuild: make > "make install" not depend on vmlinux"). > > 3. Broken code in some architectures > > Makefile code is often copied from one architecture to another > without proper adaptation or testing. > > The previous commits removed broken code from csky, UML, and parisc. > > Another issue is that 'make vdso_install' for ARCH=s390 installs > vdso64, but not vdso32. > > To address these problems, this commit introduces the generic vdso_install. > > Architectures that support vdso_install need to define vdso-install-y > in arch/*/Makefile. > > vdso-install-y lists the files to install. For example, arch/x86/Makefile > looks like this: > > vdso-install-$(CONFIG_X86_64) += arch/x86/entry/vdso/vdso64.so.dbg > vdso-install-$(CONFIG_X86_X32_ABI) += > arch/x86/entry/vdso/vdsox32.so.dbg > vdso-install-$(CONFIG_X86_32) += arch/x86/entry/vdso/vdso32.so.dbg > vdso-install-$(CONFIG_IA32_EMULATION) += arch/x86/entry/vdso/vdso32.so.dbg > > These files will be installed to $(MODLIB)/vdso/ with the .dbg suffix, > if exists, stripped away. > > vdso-install-y can optionally take the second field after the colon > separator. This is needed because some architectures install vdso > files as a different base name. > > The following is a snippet from arch/arm64/Makefile. > > vdso-install-$(CONFIG_COMPAT_VDSO) += > arch/arm64/kernel/vdso32/vdso.so.dbg:vdso32.so > > This will rename vdso.so.dbg to vdso32.so during installation. If such > architectures change their implementation so that the file names match, > this workaround will go away. > > Signed-off-by: Masahiro Yamada > --- Acked-by: Sven Schnelle # s390
Re: [PATCH v4 4/9] fprobe: rethook: Use ftrace_regs in fprobe exit handler and rethook
Masami Hiramatsu (Google) writes: >> > IOW, it is ftrace save regs/restore regs code issue. I need to check how >> > the >> > function_graph implements it. >> >> gpr2-gpr14 are always saved in ftrace_caller/ftrace_regs_caller(), >> regardless of the FTRACE_WITH_REGS flags. The only difference is that >> without the FTRACE_WITH_REGS flag the program status word (psw) is not >> saved because collecting that is a rather expensive operation. > > Thanks for checking that! So s390 will recover those saved registers > even if FTRACE_WITH_REGS flag is not set? (I wonder what is the requirement > of the ftrace_regs when returning from ftrace_call() without > FTRACE_WITH_REGS?) Yes, it will recover these in all cases. >> >> I used the following commands to test rethook (is that the correct >> testcase?) >> >> #!/bin/bash >> cd /sys/kernel/tracing >> >> echo 'r:icmp_rcv icmp_rcv' >kprobe_events >> echo 1 >events/kprobes/icmp_rcv/enable >> ping -c 1 127.0.0.1 >> cat trace > > No, the kprobe will path pt_regs to rethook. > Cna you run > > echo "f:icmp_rcv%return icmp_rcv" >> dynamic_events Ah, ok. Seems to work as well: ping-481 [001] ..s2.53.918480: icmp_rcv: (ip_protocol_deliver_rcu+0x42/0x218 <- icmp_rcv) ping-481 [001] ..s2.53.918491: icmp_rcv: (ip_protocol_deliver_rcu+0x42/0x218 <- icmp_rcv)
Re: [RESEND][PATCH] tracing/synthetic: Fix order of struct trace_dynamic_info
Hi Steven, Steven Rostedt writes: > From: "Steven Rostedt (Google)" > > To make handling BIG and LITTLE endian better the offset/len of dynamic > fields of the synthetic events was changed into a structure of: > > struct trace_dynamic_info { > #ifdef CONFIG_CPU_BIG_ENDIAN > u16 offset; > u16 len; > #else > u16 len; > u16 offset; > #endif > }; > > to replace the manual changes of: > > data_offset = offset & 0x; > data_offest = len << 16; > > But if you look closely, the above is: > ><< 16 | offset > > Which in little endian would be in memory: > > offset_lo offset_hi len_lo len_hi > > and in big endian: > > len_hi len_lo offset_hi offset_lo > > Which if broken into a structure would be: > > struct trace_dynamic_info { > #ifdef CONFIG_CPU_BIG_ENDIAN > u16 len; > u16 offset; > #else > u16 offset; > u16 len; > #endif > }; > > Which is the opposite of what was defined. > > Fix this and just to be safe also add "__packed". > > Link: https://lore.kernel.org/all/20230908154417.5172e...@gandalf.local.home/ > > Cc: sta...@vger.kernel.org > Fixes: ddeea494a16f3 ("tracing/synthetic: Use union instead of casts") > Signed-off-by: Steven Rostedt (Google) > --- > > [ Resending to the correct mailing list this time :-p ] > > include/linux/trace_events.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > index 12f875e9e69a..21ae37e49319 100644 > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -62,13 +62,13 @@ void trace_event_printf(struct trace_iterator *iter, > const char *fmt, ...); > /* Used to find the offset and length of dynamic fields in trace events */ > struct trace_dynamic_info { > #ifdef CONFIG_CPU_BIG_ENDIAN > - u16 offset; > u16 len; > + u16 offset; > #else > - u16 len; > u16 offset; > + u16 len; > #endif > -}; > +} __packed; > > /* > * The trace entry - the most basic unit of tracing. This is what This issue was also present on BE, but as you noted "covered" by the broken test case. With this patch everything works as expected. So: Tested-by: Sven Schnelle
Re: Is s390's new generic-using syscall code actually correct?
Sven Schnelle writes: > Hi Andy, > > Andy Lutomirski writes: > >> On Wed, Mar 24, 2021 at 10:39 AM Vasily Gorbik wrote: >>> >>> Hi Andy, >>> >>> On Sat, Mar 20, 2021 at 08:48:34PM -0700, Andy Lutomirski wrote: >>> > Hi all- >>> > >>> > I'm working on my kentry patchset, and I encountered: >>> > >>> > commit 56e62a73702836017564eaacd5212e4d0fa1c01d >>> > Author: Sven Schnelle >>> > Date: Sat Nov 21 11:14:56 2020 +0100 >>> > >>> > s390: convert to generic entry >>> > >>> > As part of this work, I was cleaning up the generic syscall helpers, >>> > and I encountered the goodies in do_syscall() and __do_syscall(). >>> > >>> > I'm trying to wrap my head around the current code, and I'm rather >>> > confused. >>> > >>> > 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just >>> > the syscall exit work. So a do_syscall() that gets called twice will >>> > do the loopy part of the exit work (e.g. signal handling) twice. Is >>> > this intentional? If so, why? >>> > >>> > 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed >>> > to work. Looking at the code in Linus' tree, if a signal is pending >>> > and a syscall returns -ERESTARTSYS, the syscall will return back to >>> > do_syscall(). The work (as in (1)) gets run, calling do_signal(), >>> > which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART. >>> > Presumably it will also push the signal frame onto the stack and aim >>> > the return address at the svc instruction mentioned in the commit >>> > message from "s390: convert to generic entry". Then __do_syscall() >>> > will turn interrupts back on and loop right back into do_syscall(). >>> > That seems incorrect. >>> > >>> > Can you enlighten me? My WIP tree is here: >>> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/kentry >>> > >>> >>> For all the details to that change we'd have to wait for Sven, who is back >>> next week. >>> >>> > Here are my changes to s390, and I don't think they're really correct: >>> > >>> > >>> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/diff/arch/s390/kernel/syscall.c?h=x86/kentry=58a459922be0fb8e0f17aeaebcb0ac8d0575a62c >>> >>> Couple of things: syscall_exit_to_user_mode_prepare is static, >>> and there is another code path in arch/s390/kernel/traps.c using >>> enter_from_user_mode/exit_to_user_mode. >>> >>> Anyhow I gave your branch a spin and got few new failures on strace test >>> suite, in particular on restart_syscall test. I'll try to find time to >>> look into details. >> >> I refreshed the branch, but I confess I haven't compile tested it. :) >> >> I would guess that the new test case failures are a result of the >> buggy syscall restart logic. I think that all of the "restart" cases >> except execve() should just be removed. Without my patch, I suspect >> that signal delivery with -ERESTARTSYS would create the signal frame, >> do an accidental "restarted" syscall that was a no-op, and then >> deliver the signal. With my patch, it may simply repeat the original >> interrupted signal forever. > > PIF_SYSCALL_RESTART is set in arch_do_signal_or_restart(), but only if > there's no signal handler registered. In that case we don't need a > signal frame, so that should be fine. > > The problem why your branch is not working is that arch_do_signal_or_restart() > gets called from exit_to_user_mode_prepare(), and that is after the > check whether PIF_SYSCALL_RESTART is set in __do_syscall(). > > So i'm wondering how to fix that. x86 simply rewinds the pc, so the > system call instruction gets re-executed when returning to user > space. For s390 that doesn't work, as the s390 svc instruction might > have the syscall number encoded. If we would have to restart a system > call with restart_systemcall(), we need to change the syscall number to > __NR_restart_syscall. As we cannot change the hardcoded system call > number, we somehow need to handle that inside of the kernel. > > So i wonder whether we should implement the PIF_SYSCALL_RESTART check in > entry.S after all the return to user space entry code was run but before > doing the real swtch back to user space. If PIF_SYSCALL_RESTART is set > we would then just jump back to the entry code and pretend we came from > user space. > > That would have the benefit that the entry C code looks the same like > other architectures and that amount of code to add in entry.S shouldn't > be much. Thinking about this again i guess this idea won't work because the exit loop might have scheduled the old process away already...
Re: Is s390's new generic-using syscall code actually correct?
Hi Andy, Andy Lutomirski writes: > On Wed, Mar 24, 2021 at 10:39 AM Vasily Gorbik wrote: >> >> Hi Andy, >> >> On Sat, Mar 20, 2021 at 08:48:34PM -0700, Andy Lutomirski wrote: >> > Hi all- >> > >> > I'm working on my kentry patchset, and I encountered: >> > >> > commit 56e62a73702836017564eaacd5212e4d0fa1c01d >> > Author: Sven Schnelle >> > Date: Sat Nov 21 11:14:56 2020 +0100 >> > >> > s390: convert to generic entry >> > >> > As part of this work, I was cleaning up the generic syscall helpers, >> > and I encountered the goodies in do_syscall() and __do_syscall(). >> > >> > I'm trying to wrap my head around the current code, and I'm rather >> > confused. >> > >> > 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just >> > the syscall exit work. So a do_syscall() that gets called twice will >> > do the loopy part of the exit work (e.g. signal handling) twice. Is >> > this intentional? If so, why? >> > >> > 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed >> > to work. Looking at the code in Linus' tree, if a signal is pending >> > and a syscall returns -ERESTARTSYS, the syscall will return back to >> > do_syscall(). The work (as in (1)) gets run, calling do_signal(), >> > which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART. >> > Presumably it will also push the signal frame onto the stack and aim >> > the return address at the svc instruction mentioned in the commit >> > message from "s390: convert to generic entry". Then __do_syscall() >> > will turn interrupts back on and loop right back into do_syscall(). >> > That seems incorrect. >> > >> > Can you enlighten me? My WIP tree is here: >> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/kentry >> > >> >> For all the details to that change we'd have to wait for Sven, who is back >> next week. >> >> > Here are my changes to s390, and I don't think they're really correct: >> > >> > >> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/diff/arch/s390/kernel/syscall.c?h=x86/kentry=58a459922be0fb8e0f17aeaebcb0ac8d0575a62c >> >> Couple of things: syscall_exit_to_user_mode_prepare is static, >> and there is another code path in arch/s390/kernel/traps.c using >> enter_from_user_mode/exit_to_user_mode. >> >> Anyhow I gave your branch a spin and got few new failures on strace test >> suite, in particular on restart_syscall test. I'll try to find time to >> look into details. > > I refreshed the branch, but I confess I haven't compile tested it. :) > > I would guess that the new test case failures are a result of the > buggy syscall restart logic. I think that all of the "restart" cases > except execve() should just be removed. Without my patch, I suspect > that signal delivery with -ERESTARTSYS would create the signal frame, > do an accidental "restarted" syscall that was a no-op, and then > deliver the signal. With my patch, it may simply repeat the original > interrupted signal forever. PIF_SYSCALL_RESTART is set in arch_do_signal_or_restart(), but only if there's no signal handler registered. In that case we don't need a signal frame, so that should be fine. The problem why your branch is not working is that arch_do_signal_or_restart() gets called from exit_to_user_mode_prepare(), and that is after the check whether PIF_SYSCALL_RESTART is set in __do_syscall(). So i'm wondering how to fix that. x86 simply rewinds the pc, so the system call instruction gets re-executed when returning to user space. For s390 that doesn't work, as the s390 svc instruction might have the syscall number encoded. If we would have to restart a system call with restart_systemcall(), we need to change the syscall number to __NR_restart_syscall. As we cannot change the hardcoded system call number, we somehow need to handle that inside of the kernel. So i wonder whether we should implement the PIF_SYSCALL_RESTART check in entry.S after all the return to user space entry code was run but before doing the real swtch back to user space. If PIF_SYSCALL_RESTART is set we would then just jump back to the entry code and pretend we came from user space. That would have the benefit that the entry C code looks the same like other architectures and that amount of code to add in entry.S shouldn't be much. Any thoughts? Regards Sven
Re: [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86
Mark Rutland writes: > On Thu, Mar 04, 2021 at 11:06:01AM -0800, Andy Lutomirski wrote: >> In principle, the generic entry code is generic, and the goal is to use it >> in many architectures once it settles down more. Move CONFIG_DEBUG_ENTRY >> to the generic config so that it can be used in the generic entry code and >> not just in arch/x86. >> >> Disable it on arm64. arm64 uses some but not all of the kentry >> code, and trying to debug the resulting state machine will be painful. >> arm64 can turn it back on when it starts using the entire generic >> path. > > Can we make this depend on CONFIG_GENERIC_ENTRY instead of !ARM64? > That'd be more in line with "use the generic entry code, get the generic > functionality". Note that arm64 doesn't select CONFIG_GENERIC_ENTRY > today. > > I see that s390 selects CONFIG_GENERIC_ENTRY, and either way this will > enable DEBUG_ENTRY for them, so it'd ve worth checking whether this is > ok for them. > > Sven, thoughts? > For s390 that change should be fine.
Re: Is s390's new generic-using syscall code actually correct?
Hi Andy, sorry for the late reply, i was away from kernel development for a few weeks. Andy Lutomirski writes: > Hi all- > > I'm working on my kentry patchset, and I encountered: > > commit 56e62a73702836017564eaacd5212e4d0fa1c01d > Author: Sven Schnelle > Date: Sat Nov 21 11:14:56 2020 +0100 > > s390: convert to generic entry > > As part of this work, I was cleaning up the generic syscall helpers, > and I encountered the goodies in do_syscall() and __do_syscall(). > > I'm trying to wrap my head around the current code, and I'm rather confused. > > 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just > the syscall exit work. So a do_syscall() that gets called twice will > do the loopy part of the exit work (e.g. signal handling) twice. Is > this intentional? If so, why? Not really intentional, but i decided to accept the overhead for now and fix that later by splitting up the generic entry code. Otherwise the patch would have had even more dependencies. I have not looked yet into your kentry branch, but will do that later. Maybe there is already a better way to do it or we can work something out. > 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed > to work. Looking at the code in Linus' tree, if a signal is pending > and a syscall returns -ERESTARTSYS, the syscall will return back to > do_syscall(). The work (as in (1)) gets run, calling do_signal(), > which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART. > Presumably it will also push the signal frame onto the stack and aim > the return address at the svc instruction mentioned in the commit > message from "s390: convert to generic entry". Then __do_syscall() > will turn interrupts back on and loop right back into do_syscall(). > That seems incorrect. That sounds indeed broken. My understanding is that x86 is always rewinding the pc in the restart case, and is exiting to user mode. That would than also work with signal frames. However, on s390 we cannot simply rewind the pc as the syscall number might be encoded in the system call instruction. If a user would have rewritten the system call number (i.e. with seccomp) it would still execute the original syscall number. That makes me wonder why i have not seen problems with signals and system call restarting so far. Have to read the code again. Regards Sven
[tip: sched/core] uprobes: (Re)add missing get_uprobe() in __find_uprobe()
The following commit has been merged into the sched/core branch of tip: Commit-ID: b0d6d4789677d128b1933af023083054f0973574 Gitweb: https://git.kernel.org/tip/b0d6d4789677d128b1933af023083054f0973574 Author:Sven Schnelle AuthorDate:Tue, 09 Feb 2021 16:07:11 +01:00 Committer: Ingo Molnar CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00 uprobes: (Re)add missing get_uprobe() in __find_uprobe() commit c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers") accidentally removed the refcount increase. Add it again. Fixes: c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers") Signed-off-by: Sven Schnelle Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Ingo Molnar Link: https://lkml.kernel.org/r/20210209150711.36778-1-sv...@linux.ibm.com --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index fd5160d..3ea7f8f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -661,7 +661,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) struct rb_node *node = rb_find(, _tree, __uprobe_cmp_key); if (node) - return __node_2_uprobe(node); + return get_uprobe(__node_2_uprobe(node)); return NULL; }
[tip: locking/core] s390: Use arch_local_irq_{save,restore}() in early boot code
The following commit has been merged into the locking/core branch of tip: Commit-ID: b38085ba60246fccc2f49d2ac162528dedbc4e71 Gitweb: https://git.kernel.org/tip/b38085ba60246fccc2f49d2ac162528dedbc4e71 Author:Sven Schnelle AuthorDate:Wed, 10 Feb 2021 14:24:16 +01:00 Committer: Peter Zijlstra CommitterDate: Wed, 10 Feb 2021 14:44:39 +01:00 s390: Use arch_local_irq_{save,restore}() in early boot code Commit 997acaf6b4b5 ("lockdep: report broken irq restoration") makes compiling s390 fail because the irq enable/disable functions are now no longer fully contained in header files. Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration") Signed-off-by: Sven Schnelle Signed-off-by: Peter Zijlstra (Intel) --- drivers/s390/char/sclp_early_core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/s390/char/sclp_early_core.c b/drivers/s390/char/sclp_early_core.c index ec9f8ad..b7329af 100644 --- a/drivers/s390/char/sclp_early_core.c +++ b/drivers/s390/char/sclp_early_core.c @@ -66,13 +66,13 @@ int sclp_early_cmd(sclp_cmdw_t cmd, void *sccb) unsigned long flags; int rc; - raw_local_irq_save(flags); + flags = arch_local_irq_save(); rc = sclp_service_call(cmd, sccb); if (rc) goto out; sclp_early_wait_irq(); out: - raw_local_irq_restore(flags); + arch_local_irq_restore(flags); return rc; }
[tip: sched/core] uprobes: (Re)add missing get_uprobe() in __find_uprobe()
The following commit has been merged into the sched/core branch of tip: Commit-ID: 2c3496a02cb06ffe957854d8488a5799d7bfb252 Gitweb: https://git.kernel.org/tip/2c3496a02cb06ffe957854d8488a5799d7bfb252 Author:Sven Schnelle AuthorDate:Tue, 09 Feb 2021 16:07:11 +01:00 Committer: Peter Zijlstra CommitterDate: Wed, 10 Feb 2021 14:44:48 +01:00 uprobes: (Re)add missing get_uprobe() in __find_uprobe() commit c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers") accidentally removed the refcount increase. Add it again. Fixes: c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers") Signed-off-by: Sven Schnelle Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20210209150711.36778-1-sv...@linux.ibm.com --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index fd5160d..3ea7f8f 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -661,7 +661,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) struct rb_node *node = rb_find(, _tree, __uprobe_cmp_key); if (node) - return __node_2_uprobe(node); + return get_uprobe(__node_2_uprobe(node)); return NULL; }
[PATCH] uprobes: add missing get_uprobe() in __find_uprobe()
commit c6bc9bd06dff49fa4c("rbtree, uprobes: Use rbtree helpers") from next-20210208 accidentally removed the refcount increase. Add it again. Signed-off-by: Sven Schnelle --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 7e15b2efdd87..6addc9780319 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -661,7 +661,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) struct rb_node *node = rb_find(, _tree, __uprobe_cmp_key); if (node) - return __node_2_uprobe(node); + return get_uprobe(__node_2_uprobe(node)); return NULL; } -- 2.17.1
Re: [PATCH] s390: allow reschedule on syscall restart
Christian Borntraeger writes: > On 21.01.21 15:39, Sven Schnelle wrote: >> Commit 845f44e8ef28 ("sched: Report local wake up on resched blind zone >> within idle loop") from next-20210121 causes a warning because s390 >> doesn't call sched_resched_local_allow() when restarting a syscall. >> >> Signed-off-by: Sven Schnelle >> --- >> arch/s390/kernel/syscall.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c >> index bc8e650e377d..2b39ac40f970 100644 >> --- a/arch/s390/kernel/syscall.c >> +++ b/arch/s390/kernel/syscall.c >> @@ -162,6 +162,7 @@ void noinstr __do_syscall(struct pt_regs *regs, int >> per_trap) >> do_syscall(regs); >> if (!test_pt_regs_flag(regs, PIF_SYSCALL_RESTART)) >> break; >> +sched_resched_local_allow(); >> local_irq_enable(); >> } >> exit_to_user_mode(); > > Yesterdays next now fails with > > > arch/s390/kernel/syscall.c: In function '__do_syscall': > arch/s390/kernel/syscall.c:165:3: error: implicit declaration of function > 'sched_resched_local_allow' [-Werror=implicit-function-declaration] > 165 | sched_resched_local_allow(); > | ^ > cc1: some warnings being treated as errors > make[2]: *** [scripts/Makefile.build:288: arch/s390/kernel/syscall.o] Error 1 > make[2]: *** Waiting for unfinished jobs > make[1]: *** [scripts/Makefile.build:530: arch/s390/kernel] Error 2 > make[1]: *** Waiting for unfinished jobs Looks to me like 845f44e8ef28 ("sched: Report local wake up on resched blind zone") was removed from linux-next. Stephen, can you remove my commit as well? It is no longer needed. Thanks Sven
Re: [PATCH] printk: fix buffer overflow potential for print_text()
John Ogness writes: > Hi Sven, > > Thanks for the outstanding analysis! > > On 2021-01-23, Sven Schnelle wrote: >>> 1401if (buf_size > 0) >>> 1402text[len] = 0; >> >> I don't think i have really understood how all the printk magic works, >> but using r->text_buf[len] seems to be the correct place to put the >> zero byte in that case? > > Yes, you are correct! @text is pointing to the beginning of the > currently processed line, not the beginning of the buffer. > > I will submit a patch to fix our recent fix (unless you would like to do > that). Please go ahead, thank you!
Re: [PATCH] printk: fix buffer overflow potential for print_text()
Sven Schnelle writes: > John Ogness writes: > >> On 2021-01-22, Sven Schnelle wrote: > I was able to reproduce it in a virtual machine where i have a few more > ways to debug. What i got was: > > 01: -> 001B8814" MVI 92001000 >> 0163F1CD CC 2 > > That's a watchpoint telling me that the code at 0x1b8814 wants to store > one byte to 0x163f1cd, which is the second byte of console_drivers. > > gdb tells me about 0x1b8814: > > (gdb) list *(0x1b8814) > 0x1b8814 is in record_print_text > (/home/svens/ibmgit/linux/kernel/printk/printk.c:1402). > 1397 * If a buffer was provided, it will be terminated. Space for > the > 1398 * string terminator is guaranteed to be available. The > terminator is > 1399 * not counted in the return value. > 1400 */ > 1401 if (buf_size > 0) > 1402 text[len] = 0; I don't think i have really understood how all the printk magic works, but using r->text_buf[len] seems to be the correct place to put the zero byte in that case? > 1403 > 1404 return len; > 1405 } > 1406 >
Re: [PATCH] printk: fix buffer overflow potential for print_text()
John Ogness writes: > On 2021-01-22, Sven Schnelle wrote: >> >> So somehow the pointer for console_drivers changes. >> >> I can't provide the normal kernel crash output as printk is no longer >> working, > > I don't understand what you mean here. The crash tool can dump the > printk buffer. > > I would be curious to see what the messages look like. It would also be > helpful to know the last message you saw on the console. > The last message is: [ 1845.640466] User process fault: interruption code 0007 ilc:2 in libm.so[7d78c000+a3000] [ 1845.640474] CPU: 9 PID: 859915 Comm: ld.so.1 Not tainted 5.11.0-20210122.rc4.git0.9f29bd8b2e71.300.fc33.s390x+git #1 [ 1845.640476] Hardware name: IBM 8561 T01 703 (LPAR) [ 1845.640478] User PSW : 070530008000 7d7983b2 [ 1845.640480]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:3 PM:0 RI:0 EA:1 [ 1845.640483] User GPRS: 7d798360 0010 03ff0004 [ 1845.640485]7f854dd0 7d7e9788 03ff00401b30 7f854ff8 [ 1845.640487]03ff 03ff7d5b4da0 03ff 03ff0010 [ 1845.640489]03ff004022c6 03ff004024b0 80401796 7f854de8 That's also what's written to the console. Please note the the output above is expected. It's only reporting that a user space program performed an invalid FPU operation. The kernel crash is neither written to the console nor the dmesg buffer. >> kmsg_dump_rewind+290 is: >> >> if (!(con->flags & CON_ENABLED)) >> continue; >> >> in kernel/printk/printk.c:1845 >> >> I haven't dived into whether our show_code() is doing something wrong >> which was covered in the past or whether that's because of the patch >> above but wanted to make you aware of that in case you have an idea. >> Otherwise i have to dig into the code. > > Unless we are dealing with very long printk messages that normally get > truncated (800+ characters) this patch simply adds a string > terminator. I do not see how that could possibly cause this kind of > damage. > > When this triggers, there is nothing happening with consoles registering > or deregistering, right? That's correct. No registering/unregistering taking place. > The string terminator (kernel/printk/printk.c:1402) is only added for > paranoia. If you comment that out, this patch will have no effect (for > "normal" length lines). I would be curious if that somehow makes a > difference for you. I was able to reproduce it in a virtual machine where i have a few more ways to debug. What i got was: 01: -> 001B8814" MVI 92001000 >> 0163F1CD CC 2 That's a watchpoint telling me that the code at 0x1b8814 wants to store one byte to 0x163f1cd, which is the second byte of console_drivers. gdb tells me about 0x1b8814: (gdb) list *(0x1b8814) 0x1b8814 is in record_print_text (/home/svens/ibmgit/linux/kernel/printk/printk.c:1402). 1397 * If a buffer was provided, it will be terminated. Space for the 1398 * string terminator is guaranteed to be available. The terminator is 1399 * not counted in the return value. 1400 */ 1401if (buf_size > 0) 1402text[len] = 0; 1403 1404return len; 1405} 1406 In s390 assembly, this is the store: 0x001b8810 <+0x130>: la %r1,0(%r7,%r9) 0x001b8814 <+0x134>: mvi 0(%r1),0 The cpu registers at the time of write: 01: GRG 0 = 0020 0163F1CD 01: GRG 2 = 0042 03E000623A98 01: GRG 4 = 000E 0087BB70 01: GRG 6 = 0400 *0224* 01: GRG 8 = 000F *0163EFA9* 01: GRG 10 = 0033 01: GRG 12 = 809AE000 03E000623A98 01: GRG 14 = 001B884C 03E0006239E8 So r9 is 0163EFA9 - seems to be the start of the current message: 0163EFA9: *[ 24.069514]7d7af3c2: b30d0002..debr.%f0,%f2.debr.%f0,%f2 while r7 is the offset, and that one is way to big: 0224 If you add that to 0163EFA9, you'll see that we're outside of the buffer and overwriting space after, which is console_drivers. The message it's trying to print is rather long, so the 0x224 could be the size of the whole line: This is not from the same crash, but this is how the message looks like it's trying to print: [ 23.960773] User Code: 7d7af3ba: 78005000le %f0,0(%r5) [ 23.960773]7d7af3be: 78205004le %f2,4(%r5) [ 23.960773] #7d7af3c2: b30d0002debr%f0,%f2 [ 23.960773] >00
Re: [PATCH] printk: fix buffer overflow potential for print_text()
John Ogness writes: > Before commit b6cf8b3f3312 ("printk: add lockless ringbuffer"), > msg_print_text() would only write up to size-1 bytes into the > provided buffer. Some callers expect this behavior and append > a terminator to returned string. In particular: > > arch/powerpc/xmon/xmon.c:dump_log_buf() > arch/um/kernel/kmsg_dump.c:kmsg_dumper_stdout() > > msg_print_text() has been replaced by record_print_text(), which > currently fills the full size of the buffer. This causes a > buffer overflow for the above callers. > > Change record_print_text() so that it will only use size-1 bytes > for text data. Also, for paranoia sakes, add a terminator after > the text data. > > And finally, document this behavior so that it is clear that only > size-1 bytes are used and a terminator is added. > > Fixes: b6cf8b3f3312 ("printk: add lockless ringbuffer") > Signed-off-by: John Ogness > --- > [..] I'm seeing crashes on s390x with this patch while running the glibc testsuite. The glibc test suite triggers a few FPU exceptions which are printed to the kernel log by default. Looking at the crash dump, i see that the console_drivers pointer seems to be overwritten while printing that (user space) warning: crash> print *console_drivers $1 = { name = "\247\071\377\373\354!\004\214\000\331\300\345\000\033\353_", write = 0xa7190001eb119078, read = 0xe6e320b0050090, device = 0xa51e0010a7210001, unblank = 0xa7291000b9e28012, setup = 0xe320f0a4e320, exit = 0x208e0094eb112000, match = 0xcec1c006e017f, flags = -4984, index = 133, cflag = 8143136, data = 0x800458108004ec12, next = 0x5007eaf00 } crash> x/16i console_drivers 0x79009700: lghi%r3,-5 0x79009704: aghik %r2,%r1,1164 0x7900970a: brasl %r14,0x79386dc8 0x79009710: lghi%r1,1 0x79009714: laog%r1,%r1,120(%r9) 0x7900971a: llgc%r2,5(%r11) 0x79009720: llilh %r1,16 0x79009724: tmll%r2,1 0x79009728: lghi%r2,4096 0x7900972c: locgre %r1,%r2 0x79009730: lg %r2,160(%r15) 0x79009736: llc %r2,142(%r2) 0x7900973c: srlg%r1,%r1,0(%r2) 0x79009742: clij%r1,1,12,0x7900981e 0x79009748: cgij%r8,0,8,0x79009852 0x7900974e: la %r2,4(%r8) crash> sym 0x79009700 79009700 (t) iomap_finish_ioend+192 /usr/src/debug/kernel-5.10.fc33/linux-5.11.0-20210122.rc4.git0.9f29bd8b2e71.300.fc33.s390x/./include/linux/pagemap.h: 59 So somehow the pointer for console_drivers changes. I can't provide the normal kernel crash output as printk is no longer working, but in crash the backtrace looks like this: crash> bt PID: 859915 TASK: dad24000 CPU: 9 COMMAND: "ld.so.1" LOWCORE INFO: -psw : 0x000200018000 0x78c8400e -function : stop_run at 78c8400e -prefix : 0x0041e000 -cpu timer: 0x7953a957e166 -clock cmp: 0xd92876863ac66b00 -general registers: 0xf0f0f0f0 0x78c8400e 0x79984830 0x79984830 0x79ac62c0 0x0002 0x0038 00 0x7a19157e 00 0xfffa 00 0xdad24000 00 0x78c8416e 0x0380033136f0 -access registers: 0x7d5b4da0 0xa2a795c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 -control registers: 0x00a014966a10 0x7a15c007 0x0001d0a0 00 0x 0x0001d080 0x3b00 0x0001f29b81c7 0x8000 00 00 00 00 0x7a15c007 0xdb00 0x0001d0c0 -floating point registers: 0x5fabeb67f92f 0x0007fff8 0x4040eb67f000 00 0x0007 0x0064 0x02aa2a63f028 0x0963cf85 0x2a642c60 00 0x7f85516ceb67f222 0x03ffeb67f528 0x03ffeb67f230 0x0001 0x03ffeb67f21d 0x02aa28d2d970 #0 [38003313710] stop_run at 78c8400e #1 [38003313728] atomic_notifier_call_chain at 78ced536 #2 [38003313778] panic at 7974aeca #3 [38003313818] die at 78c897ea #4 [38003313880] do_no_context at 78c9b230 #5 [380033138b8] pgm_check_handler at 7976438c PSW: 0404c0018000 78d304ba (kmsg_dump_rewind+290) GPRS: 0020 ec88 000e 793f4800 0224 0224 0005007eaf00 dad24000 038003313a68 038003313a18 #0 [38003313a68] console_unlock at 78d3158e #1 [38003313b48] vprintk_emit at 78d32f50 #2 [38003313ba8] vprintk_default at 78d32f9e #3 [38003313c00] printk at
[PATCH] s390: allow reschedule on syscall restart
Commit 845f44e8ef28 ("sched: Report local wake up on resched blind zone within idle loop") from next-20210121 causes a warning because s390 doesn't call sched_resched_local_allow() when restarting a syscall. Signed-off-by: Sven Schnelle --- arch/s390/kernel/syscall.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c index bc8e650e377d..2b39ac40f970 100644 --- a/arch/s390/kernel/syscall.c +++ b/arch/s390/kernel/syscall.c @@ -162,6 +162,7 @@ void noinstr __do_syscall(struct pt_regs *regs, int per_trap) do_syscall(regs); if (!test_pt_regs_flag(regs, PIF_SYSCALL_RESTART)) break; + sched_resched_local_allow(); local_irq_enable(); } exit_to_user_mode(); -- 2.17.1
Warning with next-20210121 on s390
there's a warning with linux-next on s390: [ 33.893818] systemd-xdg-autostart-generator[544]: Not generating service for XDG autostart app-at\x2dspi\x2ddbus\x2dbus-autostart.service, startup phases are not supported. [ 54.613344] [ cut here ] [ 54.613702] Late current task rescheduling may be lost [ 54.613709] WARNING: CPU: 0 PID: 574 at linux/kernel/sched/core.c:628 sched_resched_local_assert_allowed+0x86/0x90 [ 54.613714] Modules linked in: [ 54.613717] CPU: 0 PID: 574 Comm: qemu-system-s39 Not tainted 5.11.0-rc4-next-20210121 #2601 [ 54.613721] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0) [ 54.613724] Krnl PSW : 0404e0018000 00d86bea (sched_resched_local_assert_allowed+0x8a/0x90) [ 54.613730]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 This is because of a non-obvious merge conflict between 56e62a737028 ("s390: convert to generic entry") and 845f44e8ef28 ("sched: Report local wake up on resched blind zone within idle loop"). Can you include the attached patch into linux-next? Thanks, Sven
[tip: core/entry] entry: Add exit_to_user_mode() wrapper
The following commit has been merged into the core/entry branch of tip: Commit-ID: 310de1a678b2184c078c593dae343cb79c807f8d Gitweb: https://git.kernel.org/tip/310de1a678b2184c078c593dae343cb79c807f8d Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:54 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00 entry: Add exit_to_user_mode() wrapper Called from architecture specific code when syscall_exit_to_user_mode() is not suitable. It simply calls __exit_to_user_mode(). This way __exit_to_user_mode() can still be inlined because it is declared static __always_inline. [ tglx: Amended comments and moved it to a different place in the header ] Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-5-sv...@linux.ibm.com --- include/linux/entry-common.h | 23 +-- kernel/entry/common.c| 18 ++ 2 files changed, 27 insertions(+), 14 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index da60980..e370be8 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -301,6 +301,25 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) #endif /** + * exit_to_user_mode - Fixup state when exiting to user mode + * + * Syscall/interrupt exit enables interrupts, but the kernel state is + * interrupts disabled when this is invoked. Also tell RCU about it. + * + * 1) Trace interrupts on state + * 2) Invoke context tracking if enabled to adjust RCU state + * 3) Invoke architecture specific last minute exit code, e.g. speculation + *mitigations, etc.: arch_exit_to_user_mode() + * 4) Tell lockdep that interrupts are enabled + * + * Invoked from architecture specific code when syscall_exit_to_user_mode() + * is not suitable as the last step before returning to userspace. Must be + * invoked with interrupts disabled and the caller must be + * non-instrumentable. + */ +void exit_to_user_mode(void); + +/** * syscall_exit_to_user_mode - Handle work before returning to user mode * @regs: Pointer to currents pt_regs * @@ -322,8 +341,8 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) * - Architecture specific one time work arch_exit_to_user_mode_prepare() * - Address limit and lockdep checks * - * 3) Final transition (lockdep, tracing, context tracking, RCU). Invokes - * arch_exit_to_user_mode() to handle e.g. speculation mitigations + * 3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the + * functionality in exit_to_user_mode(). */ void syscall_exit_to_user_mode(struct pt_regs *regs); diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 17b1e03..48d30ce 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -117,18 +117,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) instrumentation_end(); } -/** - * __exit_to_user_mode - Fixup state when exiting to user mode - * - * Syscall/interupt exit enables interrupts, but the kernel state is - * interrupts disabled when this is invoked. Also tell RCU about it. - * - * 1) Trace interrupts on state - * 2) Invoke context tracking if enabled to adjust RCU state - * 3) Invoke architecture specific last minute exit code, e.g. speculation - *mitigations, etc. - * 4) Tell lockdep that interrupts are enabled - */ +/* See comment for exit_to_user_mode() in entry-common.h */ static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); @@ -141,6 +130,11 @@ static __always_inline void __exit_to_user_mode(void) lockdep_hardirqs_on(CALLER_ADDR0); } +void noinstr exit_to_user_mode(void) +{ + __exit_to_user_mode(); +} + /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
[tip: core/entry] entry: Rename enter_from_user_mode()
The following commit has been merged into the core/entry branch of tip: Commit-ID: bb714fb3bc7b2e8be72b9c92f2d8a89ea2dc Gitweb: https://git.kernel.org/tip/bb714fb3bc7b2e8be72b9c92f2d8a89ea2dc Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:51 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00 entry: Rename enter_from_user_mode() In order to make this function publicly available rename it so it can still be inlined. An additional enter_from_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-2-sv...@linux.ibm.com --- kernel/entry/common.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e661e70..8e294a7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -11,7 +11,7 @@ #include /** - * enter_from_user_mode - Establish state when coming from user mode + * __enter_from_user_mode - Establish state when coming from user mode * * Syscall/interrupt entry disables interrupts, but user mode is traced as * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. @@ -20,7 +20,7 @@ * 2) Invoke context tracking if enabled to reactivate RCU * 3) Trace interrupts off state */ -static __always_inline void enter_from_user_mode(struct pt_regs *regs) +static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); lockdep_hardirqs_off(CALLER_ADDR0); @@ -103,7 +103,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) { long ret; - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); @@ -115,7 +115,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); instrumentation_end(); @@ -304,7 +304,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); } noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
[tip: core/entry] entry_Add_enter_from_user_mode_wrapper
The following commit has been merged into the core/entry branch of tip: Commit-ID: 96e2fbccd0fc806364a964fdf072bfc858a66109 Gitweb: https://git.kernel.org/tip/96e2fbccd0fc806364a964fdf072bfc858a66109 Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:53 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00 entry_Add_enter_from_user_mode_wrapper To be called from architecture specific code if the combo interfaces are not suitable. It simply calls __enter_from_user_mode(). This way __enter_from_user_mode will still be inlined because it is declared static __always_inline. [ tglx: Amend comments and move it to a different location in the header ] Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-4-sv...@linux.ibm.com --- include/linux/entry-common.h | 24 +++- kernel/entry/common.c| 16 ++-- 2 files changed, 29 insertions(+), 11 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index a6e98b4..da60980 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -102,6 +102,27 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs #endif /** + * enter_from_user_mode - Establish state when coming from user mode + * + * Syscall/interrupt entry disables interrupts, but user mode is traced as + * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. + * + * 1) Tell lockdep that interrupts are disabled + * 2) Invoke context tracking if enabled to reactivate RCU + * 3) Trace interrupts off state + * + * Invoked from architecture specific syscall entry code with interrupts + * disabled. The calling code has to be non-instrumentable. When the + * function returns all state is correct and interrupts are still + * disabled. The subsequent functions can be instrumented. + * + * This is invoked when there is architecture specific functionality to be + * done between establishing state and enabling interrupts. The caller must + * enable interrupts before invoking syscall_enter_from_user_mode_work(). + */ +void enter_from_user_mode(struct pt_regs *regs); + +/** * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts * @regs: Pointer to currents pt_regs * @@ -110,7 +131,8 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs * function returns all state is correct, interrupts are enabled and the * subsequent functions can be instrumented. * - * This handles lockdep, RCU (context tracking) and tracing state. + * This handles lockdep, RCU (context tracking) and tracing state, i.e. + * the functionality provided by enter_from_user_mode(). * * This is invoked when there is extra architecture specific functionality * to be done between establishing state and handling user mode entry work. diff --git a/kernel/entry/common.c b/kernel/entry/common.c index dff07b4..17b1e03 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -10,16 +10,7 @@ #define CREATE_TRACE_POINTS #include -/** - * __enter_from_user_mode - Establish state when coming from user mode - * - * Syscall/interrupt entry disables interrupts, but user mode is traced as - * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. - * - * 1) Tell lockdep that interrupts are disabled - * 2) Invoke context tracking if enabled to reactivate RCU - * 3) Trace interrupts off state - */ +/* See comment for enter_from_user_mode() in entry-common.h */ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); @@ -33,6 +24,11 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) instrumentation_end(); } +void noinstr enter_from_user_mode(struct pt_regs *regs) +{ + __enter_from_user_mode(regs); +} + static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) { if (unlikely(audit_context())) {
[tip: core/entry] entry: Rename exit_to_user_mode()
The following commit has been merged into the core/entry branch of tip: Commit-ID: bb793562f0da7317adf6c456316bca651ff46f5d Gitweb: https://git.kernel.org/tip/bb793562f0da7317adf6c456316bca651ff46f5d Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:52 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00 entry: Rename exit_to_user_mode() In order to make this function publicly available rename it so it can still be inlined. An additional exit_to_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-3-sv...@linux.ibm.com --- kernel/entry/common.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 8e294a7..dff07b4 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) } /** - * exit_to_user_mode - Fixup state when exiting to user mode + * __exit_to_user_mode - Fixup state when exiting to user mode * * Syscall/interupt exit enables interrupts, but the kernel state is * interrupts disabled when this is invoked. Also tell RCU about it. @@ -133,7 +133,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) *mitigations, etc. * 4) Tell lockdep that interrupts are enabled */ -static __always_inline void exit_to_user_mode(void) +static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); trace_hardirqs_on_prepare(); @@ -299,7 +299,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) @@ -312,7 +312,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) instrumentation_begin(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
[tip: core/entry] entry: Add syscall_exit_to_user_mode_work()
The following commit has been merged into the core/entry branch of tip: Commit-ID: c6156e1da633f241e132eaea3b676d674376d770 Gitweb: https://git.kernel.org/tip/c6156e1da633f241e132eaea3b676d674376d770 Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:55 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 15:07:58 +01:00 entry: Add syscall_exit_to_user_mode_work() This is the same as syscall_exit_to_user_mode() but without calling exit_to_user_mode(). This can be used if there is an architectural reason to avoid the combo function, e.g. restarting a syscall without returning to userspace. Before returning to user space the caller has to invoke exit_to_user_mode(). [ tglx: Amended comments ] Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-6-sv...@linux.ibm.com --- include/linux/entry-common.h | 20 kernel/entry/common.c| 14 -- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index e370be8..7c581a4 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -316,10 +316,26 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) * is not suitable as the last step before returning to userspace. Must be * invoked with interrupts disabled and the caller must be * non-instrumentable. + * The caller has to invoke syscall_exit_to_user_mode_work() before this. */ void exit_to_user_mode(void); /** + * syscall_exit_to_user_mode_work - Handle work before returning to user mode + * @regs: Pointer to currents pt_regs + * + * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling + * exit_to_user_mode() to perform the final transition to user mode. + * + * Calling convention is the same as for syscall_exit_to_user_mode() and it + * returns with all work handled and interrupts disabled. The caller must + * invoke exit_to_user_mode() before actually switching to user mode to + * make the final state transitions. Interrupts must stay disabled between + * return from this function and the invocation of exit_to_user_mode(). + */ +void syscall_exit_to_user_mode_work(struct pt_regs *regs); + +/** * syscall_exit_to_user_mode - Handle work before returning to user mode * @regs: Pointer to currents pt_regs * @@ -343,6 +359,10 @@ void exit_to_user_mode(void); * * 3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the * functionality in exit_to_user_mode(). + * + * This is a combination of syscall_exit_to_user_mode_work() (1,2) and + * exit_to_user_mode(). This function is preferred unless there is a + * compelling architectural reason to use the seperate functions. */ void syscall_exit_to_user_mode(struct pt_regs *regs); diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 48d30ce..d6b7393 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -282,12 +282,22 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) syscall_exit_work(regs, work); } -__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) +static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs) { - instrumentation_begin(); syscall_exit_to_user_mode_prepare(regs); local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); +} + +void syscall_exit_to_user_mode_work(struct pt_regs *regs) +{ + __syscall_exit_to_user_mode_work(regs); +} + +__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) +{ + instrumentation_begin(); + __syscall_exit_to_user_mode_work(regs); instrumentation_end(); __exit_to_user_mode(); }
[tip: core/entry] entry: Rename exit_to_user_mode()
The following commit has been merged into the core/entry branch of tip: Commit-ID: f052615ca0112378c3571fc9662bbbcd18cf6b8d Gitweb: https://git.kernel.org/tip/f052615ca0112378c3571fc9662bbbcd18cf6b8d Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:52 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00 entry: Rename exit_to_user_mode() In order to make this function publicly available rename it so it can still be inlined. An additional exit_to_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-3-sv...@linux.ibm.com --- kernel/entry/common.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 8e294a7..dff07b4 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) } /** - * exit_to_user_mode - Fixup state when exiting to user mode + * __exit_to_user_mode - Fixup state when exiting to user mode * * Syscall/interupt exit enables interrupts, but the kernel state is * interrupts disabled when this is invoked. Also tell RCU about it. @@ -133,7 +133,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) *mitigations, etc. * 4) Tell lockdep that interrupts are enabled */ -static __always_inline void exit_to_user_mode(void) +static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); trace_hardirqs_on_prepare(); @@ -299,7 +299,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) @@ -312,7 +312,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) instrumentation_begin(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
[tip: core/entry] entry_Add_enter_from_user_mode_wrapper
The following commit has been merged into the core/entry branch of tip: Commit-ID: 6546601d7e7c8da38f5d87204aa01f7b394f4fb3 Gitweb: https://git.kernel.org/tip/6546601d7e7c8da38f5d87204aa01f7b394f4fb3 Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:53 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00 entry_Add_enter_from_user_mode_wrapper To be called from architecture specific code if the combo interfaces are not suitable. It simply calls __enter_from_user_mode(). This way __enter_from_user_mode will still be inlined because it is declared static __always_inline. [ tglx: Amend comments and move it to a different location in the header ] Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-4-sv...@linux.ibm.com --- include/linux/entry-common.h | 24 +++- kernel/entry/common.c| 16 ++-- 2 files changed, 29 insertions(+), 11 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index a6e98b4..da60980 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -102,6 +102,27 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs #endif /** + * enter_from_user_mode - Establish state when coming from user mode + * + * Syscall/interrupt entry disables interrupts, but user mode is traced as + * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. + * + * 1) Tell lockdep that interrupts are disabled + * 2) Invoke context tracking if enabled to reactivate RCU + * 3) Trace interrupts off state + * + * Invoked from architecture specific syscall entry code with interrupts + * disabled. The calling code has to be non-instrumentable. When the + * function returns all state is correct and interrupts are still + * disabled. The subsequent functions can be instrumented. + * + * This is invoked when there is architecture specific functionality to be + * done between establishing state and enabling interrupts. The caller must + * enable interrupts before invoking syscall_enter_from_user_mode_work(). + */ +void enter_from_user_mode(struct pt_regs *regs); + +/** * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts * @regs: Pointer to currents pt_regs * @@ -110,7 +131,8 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs * function returns all state is correct, interrupts are enabled and the * subsequent functions can be instrumented. * - * This handles lockdep, RCU (context tracking) and tracing state. + * This handles lockdep, RCU (context tracking) and tracing state, i.e. + * the functionality provided by enter_from_user_mode(). * * This is invoked when there is extra architecture specific functionality * to be done between establishing state and handling user mode entry work. diff --git a/kernel/entry/common.c b/kernel/entry/common.c index dff07b4..17b1e03 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -10,16 +10,7 @@ #define CREATE_TRACE_POINTS #include -/** - * __enter_from_user_mode - Establish state when coming from user mode - * - * Syscall/interrupt entry disables interrupts, but user mode is traced as - * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. - * - * 1) Tell lockdep that interrupts are disabled - * 2) Invoke context tracking if enabled to reactivate RCU - * 3) Trace interrupts off state - */ +/* See comment for enter_from_user_mode() in entry-common.h */ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); @@ -33,6 +24,11 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) instrumentation_end(); } +void noinstr enter_from_user_mode(struct pt_regs *regs) +{ + __enter_from_user_mode(regs); +} + static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) { if (unlikely(audit_context())) {
[tip: core/entry] entry: Add syscall_exit_to_user_mode_work()
The following commit has been merged into the core/entry branch of tip: Commit-ID: 1568b5540b3e6ff3fe43a2cf889cb777cf8149fc Gitweb: https://git.kernel.org/tip/1568b5540b3e6ff3fe43a2cf889cb777cf8149fc Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:55 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00 entry: Add syscall_exit_to_user_mode_work() This is the same as syscall_exit_to_user_mode() but without calling exit_to_user_mode(). This can be used if there is an architectural reason to avoid the combo function, e.g. restarting a syscall without returning to userspace. Before returning to user space the caller has to invoke exit_to_user_mode(). [ tglx: Amended comments ] Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-6-sv...@linux.ibm.com --- include/linux/entry-common.h | 20 kernel/entry/common.c| 14 -- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index e370be8..7c581a4 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -316,10 +316,26 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) * is not suitable as the last step before returning to userspace. Must be * invoked with interrupts disabled and the caller must be * non-instrumentable. + * The caller has to invoke syscall_exit_to_user_mode_work() before this. */ void exit_to_user_mode(void); /** + * syscall_exit_to_user_mode_work - Handle work before returning to user mode + * @regs: Pointer to currents pt_regs + * + * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling + * exit_to_user_mode() to perform the final transition to user mode. + * + * Calling convention is the same as for syscall_exit_to_user_mode() and it + * returns with all work handled and interrupts disabled. The caller must + * invoke exit_to_user_mode() before actually switching to user mode to + * make the final state transitions. Interrupts must stay disabled between + * return from this function and the invocation of exit_to_user_mode(). + */ +void syscall_exit_to_user_mode_work(struct pt_regs *regs); + +/** * syscall_exit_to_user_mode - Handle work before returning to user mode * @regs: Pointer to currents pt_regs * @@ -343,6 +359,10 @@ void exit_to_user_mode(void); * * 3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the * functionality in exit_to_user_mode(). + * + * This is a combination of syscall_exit_to_user_mode_work() (1,2) and + * exit_to_user_mode(). This function is preferred unless there is a + * compelling architectural reason to use the seperate functions. */ void syscall_exit_to_user_mode(struct pt_regs *regs); diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 48d30ce..d6b7393 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -282,12 +282,22 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) syscall_exit_work(regs, work); } -__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) +static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs) { - instrumentation_begin(); syscall_exit_to_user_mode_prepare(regs); local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); +} + +void syscall_exit_to_user_mode_work(struct pt_regs *regs) +{ + __syscall_exit_to_user_mode_work(regs); +} + +__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) +{ + instrumentation_begin(); + __syscall_exit_to_user_mode_work(regs); instrumentation_end(); __exit_to_user_mode(); }
[tip: core/entry] entry: Add exit_to_user_mode() wrapper
The following commit has been merged into the core/entry branch of tip: Commit-ID: 7918e4b844d1481c2200445f758bb2d1cd14346c Gitweb: https://git.kernel.org/tip/7918e4b844d1481c2200445f758bb2d1cd14346c Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:54 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00 entry: Add exit_to_user_mode() wrapper Called from architecture specific code when syscall_exit_to_user_mode() is not suitable. It simply calls __exit_to_user_mode(). This way __exit_to_user_mode() can still be inlined because it is declared static __always_inline. [ tglx: Amended comments and moved it to a different place in the header ] Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-5-sv...@linux.ibm.com --- include/linux/entry-common.h | 23 +-- kernel/entry/common.c| 18 ++ 2 files changed, 27 insertions(+), 14 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index da60980..e370be8 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -301,6 +301,25 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) #endif /** + * exit_to_user_mode - Fixup state when exiting to user mode + * + * Syscall/interrupt exit enables interrupts, but the kernel state is + * interrupts disabled when this is invoked. Also tell RCU about it. + * + * 1) Trace interrupts on state + * 2) Invoke context tracking if enabled to adjust RCU state + * 3) Invoke architecture specific last minute exit code, e.g. speculation + *mitigations, etc.: arch_exit_to_user_mode() + * 4) Tell lockdep that interrupts are enabled + * + * Invoked from architecture specific code when syscall_exit_to_user_mode() + * is not suitable as the last step before returning to userspace. Must be + * invoked with interrupts disabled and the caller must be + * non-instrumentable. + */ +void exit_to_user_mode(void); + +/** * syscall_exit_to_user_mode - Handle work before returning to user mode * @regs: Pointer to currents pt_regs * @@ -322,8 +341,8 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) * - Architecture specific one time work arch_exit_to_user_mode_prepare() * - Address limit and lockdep checks * - * 3) Final transition (lockdep, tracing, context tracking, RCU). Invokes - * arch_exit_to_user_mode() to handle e.g. speculation mitigations + * 3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the + * functionality in exit_to_user_mode(). */ void syscall_exit_to_user_mode(struct pt_regs *regs); diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 17b1e03..48d30ce 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -117,18 +117,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) instrumentation_end(); } -/** - * __exit_to_user_mode - Fixup state when exiting to user mode - * - * Syscall/interupt exit enables interrupts, but the kernel state is - * interrupts disabled when this is invoked. Also tell RCU about it. - * - * 1) Trace interrupts on state - * 2) Invoke context tracking if enabled to adjust RCU state - * 3) Invoke architecture specific last minute exit code, e.g. speculation - *mitigations, etc. - * 4) Tell lockdep that interrupts are enabled - */ +/* See comment for exit_to_user_mode() in entry-common.h */ static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); @@ -141,6 +130,11 @@ static __always_inline void __exit_to_user_mode(void) lockdep_hardirqs_on(CALLER_ADDR0); } +void noinstr exit_to_user_mode(void) +{ + __exit_to_user_mode(); +} + /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { }
[tip: core/entry] entry: Rename enter_from_user_mode()
The following commit has been merged into the core/entry branch of tip: Commit-ID: e2391bd55155ca98293b1bbaa44aa2815d3d054f Gitweb: https://git.kernel.org/tip/e2391bd55155ca98293b1bbaa44aa2815d3d054f Author:Sven Schnelle AuthorDate:Tue, 01 Dec 2020 15:27:51 +01:00 Committer: Thomas Gleixner CommitterDate: Wed, 02 Dec 2020 10:32:17 +01:00 entry: Rename enter_from_user_mode() In order to make this function publicly available rename it so it can still be inlined. An additional enter_from_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/r/20201201142755.31931-2-sv...@linux.ibm.com --- kernel/entry/common.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e661e70..8e294a7 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -11,7 +11,7 @@ #include /** - * enter_from_user_mode - Establish state when coming from user mode + * __enter_from_user_mode - Establish state when coming from user mode * * Syscall/interrupt entry disables interrupts, but user mode is traced as * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. @@ -20,7 +20,7 @@ * 2) Invoke context tracking if enabled to reactivate RCU * 3) Trace interrupts off state */ -static __always_inline void enter_from_user_mode(struct pt_regs *regs) +static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); lockdep_hardirqs_off(CALLER_ADDR0); @@ -103,7 +103,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) { long ret; - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); @@ -115,7 +115,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); instrumentation_end(); @@ -304,7 +304,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); } noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
Re: [PATCH v2 2/5] entry: rename exit_from_user_mode()
Sven Schnelle writes: > In order to make this function publicly available rename > it so it can still be inlined. An additional exit_from_user_mode() > function will be added with a later commit. That should of course be exit_to_user_mode() in the commit description... > Signed-off-by: Sven Schnelle > --- > kernel/entry/common.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > index 683a8e1b5388..076ee1cde67f 100644 > --- a/kernel/entry/common.c > +++ b/kernel/entry/common.c > @@ -111,7 +111,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct > pt_regs *regs) > } > > /** > - * exit_to_user_mode - Fixup state when exiting to user mode > + * __exit_to_user_mode - Fixup state when exiting to user mode > * > * Syscall/interupt exit enables interrupts, but the kernel state is > * interrupts disabled when this is invoked. Also tell RCU about it. > @@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct > pt_regs *regs) > *mitigations, etc. > * 4) Tell lockdep that interrupts are enabled > */ > -static __always_inline void exit_to_user_mode(void) > +static __always_inline void __exit_to_user_mode(void) > { > instrumentation_begin(); > trace_hardirqs_on_prepare(); > @@ -265,7 +265,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct > pt_regs *regs) > local_irq_disable_exit_to_user(); > exit_to_user_mode_prepare(regs); > instrumentation_end(); > - exit_to_user_mode(); > + __exit_to_user_mode(); > } > > noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) > @@ -278,7 +278,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs > *regs) > instrumentation_begin(); > exit_to_user_mode_prepare(regs); > instrumentation_end(); > - exit_to_user_mode(); > + __exit_to_user_mode(); > } > > noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
[PATCH v2 4/5] entry: add exit_to_user_mode() wrapper
Can be called from architecture dependent code, and simply calls __exit_to_user_mode(). This way __exit_to_user_mode() can still be inlined because it is declared static. Signed-off-by: Sven Schnelle --- include/linux/entry-common.h | 15 +++ kernel/entry/common.c| 17 + 2 files changed, 20 insertions(+), 12 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 28a8554fad7d..112007525f50 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -413,4 +413,19 @@ void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state); * 3) Trace interrupts off state */ void enter_from_user_mode(struct pt_regs *regs); + +/** + * exit_to_user_mode - Fixup state when exiting to user mode + * + * Syscall/interrupt exit enables interrupts, but the kernel state is + * interrupts disabled when this is invoked. Also tell RCU about it. + * + * 1) Trace interrupts on state + * 2) Invoke context tracking if enabled to adjust RCU state + * 3) Invoke architecture specific last minute exit code, e.g. speculation + *mitigations, etc. + * 4) Tell lockdep that interrupts are enabled + */ + +void exit_to_user_mode(void); #endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index ee588ee9f122..e696f6912642 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -105,18 +105,6 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) instrumentation_end(); } -/** - * __exit_to_user_mode - Fixup state when exiting to user mode - * - * Syscall/interupt exit enables interrupts, but the kernel state is - * interrupts disabled when this is invoked. Also tell RCU about it. - * - * 1) Trace interrupts on state - * 2) Invoke context tracking if enabled to adjust RCU state - * 3) Invoke architecture specific last minute exit code, e.g. speculation - *mitigations, etc. - * 4) Tell lockdep that interrupts are enabled - */ static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); @@ -129,6 +117,11 @@ static __always_inline void __exit_to_user_mode(void) lockdep_hardirqs_on(CALLER_ADDR0); } +void noinstr exit_to_user_mode(void) +{ + __exit_to_user_mode(); +} + /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal(struct pt_regs *regs) { } -- 2.17.1
[PATCH v2 2/5] entry: rename exit_from_user_mode()
In order to make this function publicly available rename it so it can still be inlined. An additional exit_from_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle --- kernel/entry/common.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 683a8e1b5388..076ee1cde67f 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -111,7 +111,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) } /** - * exit_to_user_mode - Fixup state when exiting to user mode + * __exit_to_user_mode - Fixup state when exiting to user mode * * Syscall/interupt exit enables interrupts, but the kernel state is * interrupts disabled when this is invoked. Also tell RCU about it. @@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) *mitigations, etc. * 4) Tell lockdep that interrupts are enabled */ -static __always_inline void exit_to_user_mode(void) +static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); trace_hardirqs_on_prepare(); @@ -265,7 +265,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) @@ -278,7 +278,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) instrumentation_begin(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs) -- 2.17.1
[PATCH v2 3/5] entry: add enter_from_user_mode() wrapper
Can be called from architecture dependent code, and simply calls __enter_from_user_mode(). This way __enter_from_user_mode can still be inlined because it is declared static. Signed-off-by: Sven Schnelle --- include/linux/entry-common.h | 11 +++ kernel/entry/common.c| 15 +-- 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 474f29638d2c..28a8554fad7d 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -402,4 +402,15 @@ void irqentry_exit_cond_resched(void); */ void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state); +/** + * enter_from_user_mode - Establish state when coming from user mode + * + * Syscall/interrupt entry disables interrupts, but user mode is traced as + * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. + * + * 1) Tell lockdep that interrupts are disabled + * 2) Invoke context tracking if enabled to reactivate RCU + * 3) Trace interrupts off state + */ +void enter_from_user_mode(struct pt_regs *regs); #endif diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 076ee1cde67f..ee588ee9f122 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -8,16 +8,6 @@ #define CREATE_TRACE_POINTS #include -/** - * __enter_from_user_mode - Establish state when coming from user mode - * - * Syscall/interrupt entry disables interrupts, but user mode is traced as - * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. - * - * 1) Tell lockdep that interrupts are disabled - * 2) Invoke context tracking if enabled to reactivate RCU - * 3) Trace interrupts off state - */ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); @@ -31,6 +21,11 @@ static __always_inline void __enter_from_user_mode(struct pt_regs *regs) instrumentation_end(); } +void noinstr enter_from_user_mode(struct pt_regs *regs) +{ + __enter_from_user_mode(regs); +} + static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) { if (unlikely(audit_context())) { -- 2.17.1
[PATCH v2 1/5] entry: rename enter_from_user_mode()
In order to make this function publicly available rename it so it can still be inlined. An addtional enter_from_user_mode() function will be added with a later commit. Signed-off-by: Sven Schnelle --- kernel/entry/common.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e9e2df3f3f9e..683a8e1b5388 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -9,7 +9,7 @@ #include /** - * enter_from_user_mode - Establish state when coming from user mode + * __enter_from_user_mode - Establish state when coming from user mode * * Syscall/interrupt entry disables interrupts, but user mode is traced as * interrupts enabled. Also with NO_HZ_FULL RCU might be idle. @@ -18,7 +18,7 @@ * 2) Invoke context tracking if enabled to reactivate RCU * 3) Trace interrupts off state */ -static __always_inline void enter_from_user_mode(struct pt_regs *regs) +static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); lockdep_hardirqs_off(CALLER_ADDR0); @@ -92,7 +92,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) { long ret; - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); @@ -104,7 +104,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); instrumentation_end(); @@ -270,7 +270,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); } noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) -- 2.17.1
[PATCH v2 5/5] entry: add syscall_exit_to_user_mode_work()
This is the same as syscall_exit_to_user_mode() but without calling exit_to_user_mode(). This is useful if a syscall has to be restarted without leaving to user space. Signed-off-by: Sven Schnelle --- include/linux/entry-common.h | 9 + kernel/entry/common.c| 14 -- 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 112007525f50..bdf6b005bbfb 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -312,6 +312,15 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) */ void syscall_exit_to_user_mode(struct pt_regs *regs); +/** + * syscall_exit_to_user_mode_work - Handle work before returning to user mode + * @regs: Pointer to currents pt_regs + * + * Same as syscall_exit_to_user_mode() but without calling exit_to_user_mode() + * to perform the final transition to user mode. + */ +void syscall_exit_to_user_mode_work(struct pt_regs *regs); + /** * irqentry_enter_from_user_mode - Establish state before invoking the irq handler * @regs: Pointer to currents pt_regs diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e696f6912642..c3c4ba21824a 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -246,12 +246,22 @@ static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs) syscall_exit_work(regs, cached_flags); } -__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) +static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs *regs) { - instrumentation_begin(); syscall_exit_to_user_mode_prepare(regs); local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); +} + +void syscall_exit_to_user_mode_work(struct pt_regs *regs) +{ + __syscall_exit_to_user_mode_work(regs); +} + +__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) +{ + instrumentation_begin(); + __syscall_exit_to_user_mode_work(regs); instrumentation_end(); __exit_to_user_mode(); } -- 2.17.1
[PATCH v2] split up lockdep and syscall related functionality in generic entry code
this is v2 of the patch. I've split it up the to multiple ones, and added documentation. The first patch was basically a hack to demonstrate what i need, sorry for not sending a more cleaned up version. As additional explanation here's the content of the v1 cover letter: i'm currently working on converting s390 to use the generic entry functionality. So far things are straigt-forward, there's only one slight problem. There is a syscall_enter_from_user_mode() which sets lockdep state and other initial stuff + does the entry work at the same time. This is a problem on s390 because the way we restart syscalls isn't as easy as on x86. My understanding on x86 is that syscalls are restarted there by just rewinding the program counter and return to user space, so the instruction causing the syscall gets executed again. On s390 this doesn't work, because the syscall number might be hard coded into the 'svc' instruction, so when the syscall number has to be changed we would repeat the wrong (old) syscall. So we would need functions that only do the stuff that is required when switching from user space to kernel and back, and functions which do the system call tracing and work which might be called repeatedly. With the attached patch, the s390 code now looks like this: (i removed some s390 specific stuff here to make the function easier to read) __do_syscall is the function which gets called by low level entry.S code: void noinstr __do_syscall(struct pt_regs *regs) { enter_from_user_mode(regs); /* sets lockdep state, and other initial stuff */ /* * functions that need to run with irqs disabled, * but lockdep state and other stuff set up */ memcpy(>gprs[8], S390_lowcore.save_area_sync, 8 * sizeof(unsigned long)); memcpy(>int_code, _lowcore.svc_ilc, sizeof(regs->int_code)); regs->psw = S390_lowcore.svc_old_psw; update_timer_sys(); local_irq_enable(); regs->orig_gpr2 = regs->gprs[2]; do { regs->flags = _PIF_SYSCALL; do_syscall(regs); } while (test_pt_regs_flag(regs, PIF_SYSCALL_RESTART)); exit_to_user_mode(); } __do_syscall calls do_syscall which does all the syscall work, and this might be called more than once if PIF_SYSCALL_RESTART is set: void do_syscall(struct pt_regs *regs) { unsigned long nr = regs->int_code & 0x; nr = syscall_enter_from_user_mode_work(regs, nr); regs->gprs[2] = -ENOSYS; if (likely(nr < NR_syscalls)) { regs->gprs[2] = current->thread.sys_call_table[nr]( regs->orig_gpr2, regs->gprs[3], regs->gprs[4], regs->gprs[5], regs->gprs[6], regs->gprs[7]); } syscall_exit_to_user_mode_work(regs); }
Re: [PATCH 1/2] sched/idle: Fix arch_cpu_idle() vs tracing
Hi Peter, Peter Zijlstra writes: > On Mon, Nov 30, 2020 at 01:00:03PM -0800, Guenter Roeck wrote: >> On Fri, Nov 20, 2020 at 12:41:46PM +0100, Peter Zijlstra wrote: >> > We call arch_cpu_idle() with RCU disabled, but then use >> > local_irq_{en,dis}able(), which invokes tracing, which relies on RCU. >> > >> > Switch all arch_cpu_idle() implementations to use >> > raw_local_irq_{en,dis}able() and carefully manage the >> > lockdep,rcu,tracing state like we do in entry. >> > >> > (XXX: we really should change arch_cpu_idle() to not return with >> > interrupts enabled) >> > >> >> Has this patch been tested on s390 ? Reason for asking is that it causes >> all my s390 emulations to crash. Reverting it fixes the problem. > > My understanding is that it changes the error on s390. Previously it > would complain about the local_irq_enable() in arch_cpu_idle(), now it > complains when taking an interrupt during idle. I looked into adding the required functionality for s390, but the code we would need to add to entry.S is rather large - as you noted we would have to duplicate large portions of irqentry_enter() into our code. Given that s390 was fine before that patch, can you revert it and submit it again during the next merge window? Thanks Sven
Re: [PATCH 1/2] sched/idle: Fix arch_cpu_idle() vs tracing
Peter Zijlstra writes: > On Mon, Nov 30, 2020 at 01:00:03PM -0800, Guenter Roeck wrote: >> On Fri, Nov 20, 2020 at 12:41:46PM +0100, Peter Zijlstra wrote: >> > We call arch_cpu_idle() with RCU disabled, but then use >> > local_irq_{en,dis}able(), which invokes tracing, which relies on RCU. >> > >> > Switch all arch_cpu_idle() implementations to use >> > raw_local_irq_{en,dis}able() and carefully manage the >> > lockdep,rcu,tracing state like we do in entry. >> > >> > (XXX: we really should change arch_cpu_idle() to not return with >> > interrupts enabled) >> > >> >> Has this patch been tested on s390 ? Reason for asking is that it causes >> all my s390 emulations to crash. Reverting it fixes the problem. > > My understanding is that it changes the error on s390. Previously it > would complain about the local_irq_enable() in arch_cpu_idle(), now it > complains when taking an interrupt during idle. The errors on s390 all were fixed in the meantime. I cannot say which patch fixed it, but we haven't seen any warning in our internal CI during the last weeks. So reverting the patch would likely fix the issue for us. s390 is likely to switch to generic entry with the next merge window (im working on it), so all that stuff will be easier than.
[PATCH] entry: split lockdep and syscall work functions
On s390, we can not call one function which sets lockdep state and do the syscall work at the same time. There add make enter_from_user_mode() and exit_to_user_mode() public, and add syscall_exit_to_user_mode1() which does the same as syscall_exit_to_user_mode() but skips the final exit_to_user_mode(). Signed-off-by: Sven Schnelle --- include/linux/entry-common.h | 4 +++- kernel/entry/common.c| 35 +++ 2 files changed, 30 insertions(+), 9 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 474f29638d2c..496c9a47eab4 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -124,7 +124,7 @@ static inline __must_check int arch_syscall_enter_tracehook(struct pt_regs *regs * to be done between establishing state and handling user mode entry work. */ void syscall_enter_from_user_mode_prepare(struct pt_regs *regs); - +void enter_from_user_mode(struct pt_regs *regs); /** * syscall_enter_from_user_mode_work - Check and handle work before invoking *a syscall @@ -311,6 +311,8 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step) * arch_exit_to_user_mode() to handle e.g. speculation mitigations */ void syscall_exit_to_user_mode(struct pt_regs *regs); +void syscall_exit_to_user_mode1(struct pt_regs *regs); +void exit_to_user_mode(void); /** * irqentry_enter_from_user_mode - Establish state before invoking the irq handler diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e9e2df3f3f9e..3ad462ebfa15 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -18,7 +18,7 @@ * 2) Invoke context tracking if enabled to reactivate RCU * 3) Trace interrupts off state */ -static __always_inline void enter_from_user_mode(struct pt_regs *regs) +static __always_inline void __enter_from_user_mode(struct pt_regs *regs) { arch_check_user_regs(regs); lockdep_hardirqs_off(CALLER_ADDR0); @@ -31,6 +31,11 @@ static __always_inline void enter_from_user_mode(struct pt_regs *regs) instrumentation_end(); } +void noinstr enter_from_user_mode(struct pt_regs *regs) +{ + __enter_from_user_mode(regs); +} + static inline void syscall_enter_audit(struct pt_regs *regs, long syscall) { if (unlikely(audit_context())) { @@ -92,7 +97,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) { long ret; - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); @@ -104,14 +109,14 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs *regs, long syscall) noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); instrumentation_begin(); local_irq_enable(); instrumentation_end(); } /** - * exit_to_user_mode - Fixup state when exiting to user mode + * __exit_to_user_mode - Fixup state when exiting to user mode * * Syscall/interupt exit enables interrupts, but the kernel state is * interrupts disabled when this is invoked. Also tell RCU about it. @@ -122,7 +127,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs) *mitigations, etc. * 4) Tell lockdep that interrupts are enabled */ -static __always_inline void exit_to_user_mode(void) +static __always_inline void __exit_to_user_mode(void) { instrumentation_begin(); trace_hardirqs_on_prepare(); @@ -134,6 +139,11 @@ static __always_inline void exit_to_user_mode(void) lockdep_hardirqs_on(CALLER_ADDR0); } +void noinstr exit_to_user_mode(void) +{ + __exit_to_user_mode(); +} + /* Workaround to allow gradual conversion of architecture code */ void __weak arch_do_signal(struct pt_regs *regs) { } @@ -265,12 +275,21 @@ __visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs) local_irq_disable_exit_to_user(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); +} + +__visible noinstr void syscall_exit_to_user_mode1(struct pt_regs *regs) +{ + instrumentation_begin(); + syscall_exit_to_user_mode_prepare(regs); + local_irq_disable_exit_to_user(); + exit_to_user_mode_prepare(regs); + instrumentation_end(); } noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs) { - enter_from_user_mode(regs); + __enter_from_user_mode(regs); } noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) @@ -278,7 +297,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs) instrumentation_begin(); exit_to_user_mode_prepare(regs); instrumentation_end(); - exit_to_user_mode(); + __exit_to_user_mode(); } noinstr irqentry_state_t irqentry_enter(struct
split up lockdep and syscall related functionailty in generic entry code
i'm currently working on converting s390 to use the generic entry functionality. So far things are straigt-forward, there's only one slight problem. There is a syscall_enter_from_user_mode() which sets lockdep state and other initial stuff + does the entry work at the same time. This is a problem on s390 because the way we restart syscalls isn't as easy as on x86. My understanding on x86 is that syscalls are restarted there by just rewinding the program counter and return to user space, so the instruction causing the syscall gets executed again. On s390 this doesn't work, because the syscall number might be hard coded into the 'svc' instruction, so when the syscall number has to be changed we would repeat the wrong (old) syscall. So we would need functions that only do the stuff that is required when switching from user space to kernel and back, and functions which do the system call tracing and work which might be called repeatedly. With the attached patch, the s390 code now looks like this: (i removed some s390 specific stuff here to make the function easier to read) __do_syscall is the function which gets called by low level entry.S code: void noinstr __do_syscall(struct pt_regs *regs) { enter_from_user_mode(regs); /* sets lockdep state, and other initial stuff */ /* * functions that need to run with irqs disabled, * but lockdep state and other stuff set up */ memcpy(>gprs[8], S390_lowcore.save_area_sync, 8 * sizeof(unsigned long)); memcpy(>int_code, _lowcore.svc_ilc, sizeof(regs->int_code)); regs->psw = S390_lowcore.svc_old_psw; update_timer_sys(); local_irq_enable(); regs->orig_gpr2 = regs->gprs[2]; do { regs->flags = _PIF_SYSCALL; do_syscall(regs); } while (test_pt_regs_flag(regs, PIF_SYSCALL_RESTART)); exit_to_user_mode(); } __do_syscall calls do_syscall which does all the syscall work, and this might be called more than once if PIF_SYSCALL_RESTART is set: void do_syscall(struct pt_regs *regs) { unsigned long nr = regs->int_code & 0x; nr = syscall_enter_from_user_mode_work(regs, nr); regs->gprs[2] = -ENOSYS; if (likely(nr < NR_syscalls)) { regs->gprs[2] = current->thread.sys_call_table[nr]( regs->orig_gpr2, regs->gprs[3], regs->gprs[4], regs->gprs[5], regs->gprs[6], regs->gprs[7]); } syscall_exit_to_user_mode1(regs); } What do you think about the attach patch? I'm also open for a proper name for syscall_exit_to_user_mode1() ;-)
Re: [GIT pull] locking/urgent for v5.10-rc6
Hi Peter, Peter Zijlstra writes: > On Mon, Nov 30, 2020 at 01:52:11PM +0100, Peter Zijlstra wrote: >> On Mon, Nov 30, 2020 at 01:31:33PM +0100, Sven Schnelle wrote: >> > [0.670280] [ cut here ] >> > [0.670288] WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:1054 >> > rcu_irq_enter+0x7e/0xa8 >> > [0.670293] Modules linked in: >> > [0.670299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW >> > 5.10.0-rc6 #2263 >> > [0.670304] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) >> > [0.670309] Krnl PSW : 0404d0018000 00d8a8da >> > (rcu_irq_enter+0x82/0xa8) >> > [0.670318]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 >> > PM:0 RI:0 EA:3 >> > [0.670325] Krnl GPRS: 8002 >> > 0001 0101fcee >> > [0.670331] >> > >> > [0.670337]03e00029ff48 >> > 017212d8 0001 >> > [0.670343]05ba0100 000324bb >> > 03e00029fe40 03e00029fe10 >> > >> > [0.670358] Krnl Code: 00d8a8ca: ec180013017ecij >> > %r1,1,8,00d8a8f0 >> > [0.670358]00d8a8d0: ecb80005007ecij >> > %r11,0,8,00d8a8da >> > [0.670358] #00d8a8d6: af00mc >> > 0,0 >> > [0.670358] >00d8a8da: ebbff0a4lmg >> > %r11,%r15,160(%r15) >> > [0.670358]00d8a8e0: c0f4ff68brcl >> > 15,00d8a7b0 >> > [0.670358]00d8a8e6: c0e538c1brasl >> > %r14,00d91a68 >> > [0.670358]00d8a8ec: a7f4ffdcbrc >> > 15,00d8a8a4 >> > [0.670358]00d8a8f0: c0e538bcbrasl >> > %r14,00d91a68 >> > [0.670392] Call Trace: >> > [0.670396] [<00d8a8da>] rcu_irq_enter+0x82/0xa8 >> > [0.670401] [<00157f9a>] irq_enter+0x22/0x30 >> > [0.670404] [<0010e51c>] do_IRQ+0x64/0xd0 >> > [0.670408] [<00d9a65a>] ext_int_handler+0x18e/0x194 >> > [0.670412] [<00d9a6a0>] psw_idle+0x40/0x48 >> > [0.670416] ([<00104202>] enabled_wait+0x22/0xf0) >> > [0.670419] [<001046e2>] arch_cpu_idle+0x22/0x38 >> > [0.670423] [<00d986cc>] default_idle_call+0x74/0xd8 >> > [0.670427] [<0019a94a>] do_idle+0xf2/0x1b0 >> > [0.670431] [<0019ac7e>] cpu_startup_entry+0x36/0x40 >> > [0.670435] [<00118b9a>] smp_start_secondary+0x82/0x88 >> >> But but but... >> >> do_idle() # IRQs on >> local_irq_disable(); # IRQs off >> defaul_idle_call() # IRQs off > lockdep_hardirqs_on(); # IRQs off, but lockdep things they're on >> arch_cpu_idle()# IRQs off >> enabled_wait() # IRQs off >>raw_local_save() # still off >>psw_idle()# very much off >> ext_int_handler # get an interrupt ?!?! > rcu_irq_enter() # lockdep thinks IRQs are on <- FAIL > > > I can't much read s390 assembler, but ext_int_handler() has a > TRACE_IRQS_OFF, which would be sufficient to re-align the lockdep state > with the actual state, but there's some condition before it, what's that > test and is that right? That test was introduced to only track changes in IRQ state because of recursion problems in lockdep. This now seems to no longer work. We propably could remove that as lockdep now can handle recursion much better with all the recent changes, but given that we're already at -rc6, i don't want to touch entry.S code because of a debugging feature.
Re: [GIT pull] locking/urgent for v5.10-rc6
Hi, Sven Schnelle writes: > Hi Peter, > > Peter Zijlstra writes: > >> On Sun, Nov 29, 2020 at 11:31:41AM -0800, Linus Torvalds wrote: >>> On Sun, Nov 29, 2020 at 5:38 AM Thomas Gleixner wrote: >>> > >>> > Yet two more places which invoke tracing from RCU disabled regions in the >>> > idle path. Similar to the entry path the low level idle functions have to >>> > be non-instrumentable. >>> >>> This really seems less than optimal. >>> >>> In particular, lookie here: >>> >>> > @@ -94,9 +94,35 @@ void __cpuidle default_idle_call(void) >>> > >>> > trace_cpu_idle(1, smp_processor_id()); >>> > stop_critical_timings(); >>> > + >>> > + /* >>> > +* arch_cpu_idle() is supposed to enable IRQs, however >>> > +* we can't do that because of RCU and tracing. >>> > +* >>> > +* Trace IRQs enable here, then switch off RCU, and have >>> > +* arch_cpu_idle() use raw_local_irq_enable(). Note that >>> > +* rcu_idle_enter() relies on lockdep IRQ state, so >>> > switch that >>> > +* last -- this is very similar to the entry code. >>> > +*/ >>> > + trace_hardirqs_on_prepare(); >>> > + lockdep_hardirqs_on_prepare(_THIS_IP_); >>> > rcu_idle_enter(); >>> > + lockdep_hardirqs_on(_THIS_IP_); >>> > + >>> > arch_cpu_idle(); >>> > + >>> > + /* >>> > +* OK, so IRQs are enabled here, but RCU needs them >>> > disabled to >>> > +* turn itself back on.. funny thing is that disabling >>> > IRQs >>> > +* will cause tracing, which needs RCU. Jump through >>> > hoops to >>> > +* make it 'work'. >>> > +*/ >>> > + raw_local_irq_disable(); >>> > + lockdep_hardirqs_off(_THIS_IP_); >>> > rcu_idle_exit(); >>> > + lockdep_hardirqs_on(_THIS_IP_); >>> > + raw_local_irq_enable(); >>> > + >>> > start_critical_timings(); >>> > trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); >>> > } >>> >>> And look at what the code generation for the idle exit path is when >>> lockdep isn't even on. >> >> Agreed. >> >> The idea was to flip all of arch_cpu_idle() to not enable interrupts. >> >> This is suboptimal for things like x86 where arch_cpu_idle() is >> basically STI;HLT, but x86 isn't likely to actually use this code path >> anyway, given all the various cpuidle drivers it has. >> >> Many of the other archs are now doing things like arm's: >> wfi();raw_local_irq_enable(). >> >> Doing that tree-wide interrupt-state flip was something I didn't want to >> do at this late a stage, the chanse of messing that up is just too high. >> >> After that I need to go look at flipping cpuidle, which is even more >> 'interesting'. cpuidle_enter() has the exact same semantics, and this is >> the code path that x86 actually uses, and here it's inconsitent at best. > > On s390 this introduces the following splat: > [..] I sent you the wrong backtrace. This is the correct one: [0.667491] smp: Bringing up secondary CPUs ... [0.670262] random: get_random_bytes called from __warn+0x12a/0x160 with crng_init=1 [0.670280] [ cut here ] [0.670288] WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:1054 rcu_irq_enter+0x7e/0xa8 [0.670293] Modules linked in: [0.670299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 5.10.0-rc6 #2263 [0.670304] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) [0.670309] Krnl PSW : 0404d0018000 00d8a8da (rcu_irq_enter+0x82/0xa8) [0.670318]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [0.670325] Krnl GPRS: 8002 0001 0101fcee [0.670331] [0.670337]03e00029ff48 017212d8 0001 [0.670343]05ba0100 000324bb 03e00029fe40
Re: [GIT pull] locking/urgent for v5.10-rc6
Hi Peter, Peter Zijlstra writes: > On Sun, Nov 29, 2020 at 11:31:41AM -0800, Linus Torvalds wrote: >> On Sun, Nov 29, 2020 at 5:38 AM Thomas Gleixner wrote: >> > >> > Yet two more places which invoke tracing from RCU disabled regions in the >> > idle path. Similar to the entry path the low level idle functions have to >> > be non-instrumentable. >> >> This really seems less than optimal. >> >> In particular, lookie here: >> >> > @@ -94,9 +94,35 @@ void __cpuidle default_idle_call(void) >> > >> > trace_cpu_idle(1, smp_processor_id()); >> > stop_critical_timings(); >> > + >> > + /* >> > +* arch_cpu_idle() is supposed to enable IRQs, however >> > +* we can't do that because of RCU and tracing. >> > +* >> > +* Trace IRQs enable here, then switch off RCU, and have >> > +* arch_cpu_idle() use raw_local_irq_enable(). Note that >> > +* rcu_idle_enter() relies on lockdep IRQ state, so switch >> > that >> > +* last -- this is very similar to the entry code. >> > +*/ >> > + trace_hardirqs_on_prepare(); >> > + lockdep_hardirqs_on_prepare(_THIS_IP_); >> > rcu_idle_enter(); >> > + lockdep_hardirqs_on(_THIS_IP_); >> > + >> > arch_cpu_idle(); >> > + >> > + /* >> > +* OK, so IRQs are enabled here, but RCU needs them >> > disabled to >> > +* turn itself back on.. funny thing is that disabling IRQs >> > +* will cause tracing, which needs RCU. Jump through hoops >> > to >> > +* make it 'work'. >> > +*/ >> > + raw_local_irq_disable(); >> > + lockdep_hardirqs_off(_THIS_IP_); >> > rcu_idle_exit(); >> > + lockdep_hardirqs_on(_THIS_IP_); >> > + raw_local_irq_enable(); >> > + >> > start_critical_timings(); >> > trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()); >> > } >> >> And look at what the code generation for the idle exit path is when >> lockdep isn't even on. > > Agreed. > > The idea was to flip all of arch_cpu_idle() to not enable interrupts. > > This is suboptimal for things like x86 where arch_cpu_idle() is > basically STI;HLT, but x86 isn't likely to actually use this code path > anyway, given all the various cpuidle drivers it has. > > Many of the other archs are now doing things like arm's: > wfi();raw_local_irq_enable(). > > Doing that tree-wide interrupt-state flip was something I didn't want to > do at this late a stage, the chanse of messing that up is just too high. > > After that I need to go look at flipping cpuidle, which is even more > 'interesting'. cpuidle_enter() has the exact same semantics, and this is > the code path that x86 actually uses, and here it's inconsitent at best. On s390 this introduces the following splat: [2.962283] Testing tracer wakeup_rt: [3.017102] sched: DL replenish lagged too much [3.061777] PASSED [3.062076] Testing tracer wakeup_dl: PASSED [3.161296] [ cut here ] [3.161301] DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key != current->curr_chain_key) [3.161310] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4155 lockdep_hardirqs_on+0x1ea/0x1f8 [3.161316] Modules linked in: [3.161323] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-rc6-07209-g9e30ba6d2d5c #2249 [3.161327] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) [3.161331] Krnl PSW : 0404d0018000 00d730fe (lockdep_hardirqs_on+0x1ee/0x1f8) [3.161340]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [3.161347] Krnl GPRS: c000bfff 8001 004a 001e33b8 [3.161352]03807b88 03807b80 [3.161357] 0151a610 01361500 00d81d40 [3.161362]00010400 00d88010 00d730fa 03807db8 [3.161461] Krnl Code: 00d730ee: c0200012d9adlarl %r2,00fce448 [3.161461]00d730f4: c0e5451abrasl %r14,00d5bb28 [3.161461] #00d730fa: af00mc 0,0 [3.161461] >00d730fe: a7f4ff70brc 15,00d72fde [3.161461]00d73102: 0707bcr 0,%r7 [3.161461]00d73104: 0707bcr 0,%r7 [3.161461]00d73106: 0707bcr 0,%r7 [3.161461]00d73108: c418002884eclgrl %r1,01283ae0 [3.161518] Call Trace: [3.161526] [<00d730fe>]
Re: [PATCH 0/2] More RCU vs idle fixes
Peter Zijlstra writes: > Both arm64 and s390 are tripping over arch_cpu_idle() RCU,tracing,lockdep > interaction. While looking at that I also found fail in inte_idle. > > Please consider for this cycle. Is anyone taking this patchset? For s390, we also need to change the local_irq_safe/restore to the raw variants in enabled_wait() in arch/s390/kernel/idle.c. I can make a patch and carry that via the s390 tree, but i want to make sure the s390 change in this patchset also reaches linus' tree. Thanks Sven
Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL
Hi Jens, Jens Axboe writes: > On 11/3/20 4:00 AM, Sven Schnelle wrote: >> Hi Jens, >> >> Heiko Carstens writes: >> >>> On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote: >>>> Wire up TIF_NOTIFY_SIGNAL handling for s390. >>>> >>>> Cc: linux-s...@vger.kernel.org >>>> Signed-off-by: Jens Axboe >>>> --- >>>> >>>> 5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting >>>> for details: >>>> >>>> https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/ >>>> >>>> As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs, >>>> as that will enable a set of cleanups once all of them support it. I'm >>>> happy carrying this patch if need be, or it can be funelled through the >>>> arch tree. Let me know. >>>> >>>> arch/s390/include/asm/thread_info.h | 2 ++ >>>> arch/s390/kernel/entry.S| 7 ++- >>>> 2 files changed, 8 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/arch/s390/include/asm/thread_info.h >>>> b/arch/s390/include/asm/thread_info.h >>>> index 13a04fcf7762..0045341ade48 100644 >>>> --- a/arch/s390/include/asm/thread_info.h >>>> +++ b/arch/s390/include/asm/thread_info.h >>>> @@ -65,6 +65,7 @@ void arch_setup_new_exec(void); >>>> #define TIF_GUARDED_STORAGE 4 /* load guarded storage control >>>> block */ >>>> #define TIF_PATCH_PENDING 5 /* pending live patching update */ >>>> #define TIF_PGSTE 6 /* New mm's will use 4K page tables */ >>>> +#define TIF_NOTIFY_SIGNAL 7 /* signal notifications exist */ >>>> #define TIF_ISOLATE_BP8 /* Run process with isolated BP >>>> */ >>>> #define TIF_ISOLATE_BP_GUEST 9 /* Run KVM guests with isolated >>>> BP */ >>>> >>>> @@ -82,6 +83,7 @@ void arch_setup_new_exec(void); >>>> #define TIF_SYSCALL_TRACEPOINT27 /* syscall tracepoint >>>> instrumentation */ >>>> >>>> #define _TIF_NOTIFY_RESUMEBIT(TIF_NOTIFY_RESUME) >>>> +#define _TIF_NOTIFY_SIGNALBIT(TIF_NOTIFY_SIGNAL) >>>> #define _TIF_SIGPENDING BIT(TIF_SIGPENDING) >>>> #define _TIF_NEED_RESCHED BIT(TIF_NEED_RESCHED) >>>> #define _TIF_UPROBE BIT(TIF_UPROBE) >>>> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S >>>> index 86235919c2d1..a30d891e8045 100644 >>>> --- a/arch/s390/kernel/entry.S >>>> +++ b/arch/s390/kernel/entry.S >>>> @@ -52,7 +52,8 @@ STACK_SIZE = 1 << STACK_SHIFT >>>> STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE >>>> >>>> _TIF_WORK = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \ >>>> - _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING) >>>> + _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING | \ >>>> + _TIF_NOTIFY_SIGNAL) >>>> _TIF_TRACE= (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | >>>> _TIF_SECCOMP | \ >>>> _TIF_SYSCALL_TRACEPOINT) >>>> _CIF_WORK = (_CIF_ASCE_PRIMARY | _CIF_ASCE_SECONDARY | _CIF_FPU) >>>> @@ -463,6 +464,8 @@ ENTRY(system_call) >>>> #endif >>>>TSTMSK __PT_FLAGS(%r11),_PIF_SYSCALL_RESTART >>>>jo .Lsysc_syscall_restart >>>> + TSTMSK __TI_flags(%r12),_TIF_NOTIFY_SIGNAL >>>> + jo .Lsysc_sigpending >>>>TSTMSK __TI_flags(%r12),_TIF_SIGPENDING >>>>jo .Lsysc_sigpending >>>>TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME >>>> @@ -857,6 +860,8 @@ ENTRY(io_int_handler) >>>> #endif >>>>TSTMSK __TI_flags(%r12),_TIF_SIGPENDING >>>>jo .Lio_sigpending >>>> + TSTMSK __TI_flags(%r12),_TIF_NOTIFY_SIGNAL >>>> + jo .Lio_sigpending >>>>TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME >>>>jo .Lio_notify_resume >>>>TSTMSK __TI_flags(%r12),_TIF_GUARDED_STORAGE >>> >>> (full quote so you can make sense of the patch below). >>> >>> Please merge the patch below into this one. With that: >>> >>> Acked-by: Heiko Carstens >>> >>> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/e
Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL
Hi Jens, Heiko Carstens writes: > On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote: >> Wire up TIF_NOTIFY_SIGNAL handling for s390. >> >> Cc: linux-s...@vger.kernel.org >> Signed-off-by: Jens Axboe >> --- >> >> 5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting >> for details: >> >> https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/ >> >> As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs, >> as that will enable a set of cleanups once all of them support it. I'm >> happy carrying this patch if need be, or it can be funelled through the >> arch tree. Let me know. >> >> arch/s390/include/asm/thread_info.h | 2 ++ >> arch/s390/kernel/entry.S| 7 ++- >> 2 files changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/arch/s390/include/asm/thread_info.h >> b/arch/s390/include/asm/thread_info.h >> index 13a04fcf7762..0045341ade48 100644 >> --- a/arch/s390/include/asm/thread_info.h >> +++ b/arch/s390/include/asm/thread_info.h >> @@ -65,6 +65,7 @@ void arch_setup_new_exec(void); >> #define TIF_GUARDED_STORAGE 4 /* load guarded storage control block */ >> #define TIF_PATCH_PENDING 5 /* pending live patching update */ >> #define TIF_PGSTE 6 /* New mm's will use 4K page tables */ >> +#define TIF_NOTIFY_SIGNAL 7 /* signal notifications exist */ >> #define TIF_ISOLATE_BP 8 /* Run process with isolated BP >> */ >> #define TIF_ISOLATE_BP_GUEST9 /* Run KVM guests with isolated >> BP */ >> >> @@ -82,6 +83,7 @@ void arch_setup_new_exec(void); >> #define TIF_SYSCALL_TRACEPOINT 27 /* syscall tracepoint >> instrumentation */ >> >> #define _TIF_NOTIFY_RESUME BIT(TIF_NOTIFY_RESUME) >> +#define _TIF_NOTIFY_SIGNAL BIT(TIF_NOTIFY_SIGNAL) >> #define _TIF_SIGPENDING BIT(TIF_SIGPENDING) >> #define _TIF_NEED_RESCHED BIT(TIF_NEED_RESCHED) >> #define _TIF_UPROBE BIT(TIF_UPROBE) >> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S >> index 86235919c2d1..a30d891e8045 100644 >> --- a/arch/s390/kernel/entry.S >> +++ b/arch/s390/kernel/entry.S >> @@ -52,7 +52,8 @@ STACK_SIZE = 1 << STACK_SHIFT >> STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE >> >> _TIF_WORK = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \ >> - _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING) >> + _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING | \ >> + _TIF_NOTIFY_SIGNAL) >> _TIF_TRACE = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \ >> _TIF_SYSCALL_TRACEPOINT) >> _CIF_WORK = (_CIF_ASCE_PRIMARY | _CIF_ASCE_SECONDARY | _CIF_FPU) >> @@ -463,6 +464,8 @@ ENTRY(system_call) >> #endif >> TSTMSK __PT_FLAGS(%r11),_PIF_SYSCALL_RESTART >> jo .Lsysc_syscall_restart >> +TSTMSK __TI_flags(%r12),_TIF_NOTIFY_SIGNAL >> +jo .Lsysc_sigpending >> TSTMSK __TI_flags(%r12),_TIF_SIGPENDING >> jo .Lsysc_sigpending >> TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME >> @@ -857,6 +860,8 @@ ENTRY(io_int_handler) >> #endif >> TSTMSK __TI_flags(%r12),_TIF_SIGPENDING >> jo .Lio_sigpending >> +TSTMSK __TI_flags(%r12),_TIF_NOTIFY_SIGNAL >> +jo .Lio_sigpending >> TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME >> jo .Lio_notify_resume >> TSTMSK __TI_flags(%r12),_TIF_GUARDED_STORAGE > > (full quote so you can make sense of the patch below). > > Please merge the patch below into this one. With that: > > Acked-by: Heiko Carstens > > diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S > index a30d891e8045..31f16d903ef3 100644 > --- a/arch/s390/kernel/entry.S > +++ b/arch/s390/kernel/entry.S > @@ -464,9 +464,7 @@ ENTRY(system_call) > #endif > TSTMSK __PT_FLAGS(%r11),_PIF_SYSCALL_RESTART > jo .Lsysc_syscall_restart > - TSTMSK __TI_flags(%r12),_TIF_NOTIFY_SIGNAL > - jo .Lsysc_sigpending > - TSTMSK __TI_flags(%r12),_TIF_SIGPENDING > + TSTMSK __TI_flags(%r12),(_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL) > jo .Lsysc_sigpending We need to also change the jo to jnz - in combination with tm, jo means 'jump if all tested bits are set' while jnz means 'jump if at least one bit is set' > TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME > jo .Lsysc_notify_resume > @@ -858,9 +856,7 @@ ENTRY(io_int_handler) > TSTMSK __TI_flags(%r12),_TIF_PATCH_PENDING > jo .Lio_patch_pending > #endif > - TSTMSK __TI_flags(%r12),_TIF_SIGPENDING > - jo .Lio_sigpending > - TSTMSK __TI_flags(%r12),_TIF_NOTIFY_SIGNAL > + TSTMSK __TI_flags(%r12),(_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL) Same here. > jo .Lio_sigpending > TSTMSK __TI_flags(%r12),_TIF_NOTIFY_RESUME > jo .Lio_notify_resume PS: I didn't get the previous emails, so i replied to a raw download
Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL
Jens Axboe writes: > On 11/2/20 9:59 AM, Qian Cai wrote: >> On Sun, 2020-11-01 at 17:31 +, Heiko Carstens wrote: >>> On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote: Wire up TIF_NOTIFY_SIGNAL handling for s390. Cc: linux-s...@vger.kernel.org Signed-off-by: Jens Axboe >> >> Even though I did confirm that today's linux-next contains this additional >> patch >> from Heiko below, a z10 guest is still unable to boot. Reverting the whole >> series (reverting only "s390: add support for TIF_NOTIFY_SIGNAL" introduced >> compiling errors) fixed the problem, i.e., git revert --no-edit >> af0dd809f3d3..7b074c15374c [1] > > That's odd, it should build fine without that patch. How did it fail for you? > > Can you try and add this on top? Looks like I forgot the signal change for > s390, though that shouldn't really cause any issues. > > diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c > index 9e900a8977bd..a68c3796a1bf 100644 > --- a/arch/s390/kernel/signal.c > +++ b/arch/s390/kernel/signal.c > @@ -472,7 +472,7 @@ void do_signal(struct pt_regs *regs) > current->thread.system_call = > test_pt_regs_flag(regs, PIF_SYSCALL) ? regs->int_code : 0; > > - if (get_signal()) { > + if (test_thread_flag(TIF_NOTIFY_SIGNAL) && get_signal()) { Shouldn't that be TIF_SIGPENDING? > /* Whee! Actually deliver the signal. */ > if (current->thread.system_call) { > regs->int_code = current->thread.system_call;
Re: autofs crash with latest linux-next
Kees Cook writes: > On Wed, Oct 14, 2020 at 09:15:47AM +0200, Krzysztof Kozlowski wrote: >> I hit this since few days as well. Although the bisect points to the >> merge, the issue looks like a result of mentioned commit 4d03e3cc5982 >> ("fs: don't allow kernel reads and writes without iter ops"). >> >> The __kernel_read() last argument 'pos' can be NULL and it is >> dereferenced here (added by the commit): >> >> 525 ssize_t __kernel_write(struct file *file, const void *buf, size_t >> count, loff_t *pos) >> ... >> 547 kiocb.ki_pos = *pos; >> 548 iov_iter_kvec(, WRITE, , 1, iov.iov_len); >> >> >> The __kernel_read() is called with NULL in fs/autofs/waitq.c: >> >> 45 static int autofs_write(struct autofs_sb_info *sbi, >> 46 struct file *file, const void *addr, int bytes) >> >> ... >> 54 mutex_lock(>pipe_mutex); >> 55 while (bytes) { >> 56 wr = __kernel_write(file, data, bytes, NULL); > > I think the thread here is the same thing, but you've found it in > autofs... > https://lore.kernel.org/lkml/CAHk-=wgj=mken-efv5tkwjnehplg0dybq+r5zyguc4weunq...@mail.gmail.com/ Indeed. Thanks, missed that. Sven
autofs crash with latest linux-next
Hi, on s390 i see the following crash with linux-next: [ 4525.432605] Unable to handle kernel pointer dereference in virtual kernel address space [ 4525.432612] Failing address: TEID: 0483 [ 4525.432613] Fault in home space mode while using kernel ASCE. [ 4525.432616] AS:cf048007 R3:0001fffec007 S:00011800 P:003d [ 4525.432640] Oops: 0004 ilc:3 [#1] SMP [ 4525.432644] Modules linked in: dm_crypt encrypted_keys lcs ctcm fsm nfsv3 nfs_acl nfs lockd grace quota_v2 quota_tree tun overlay ntfs exfat vfat fat sctp vfio_pci irqbypass vfio_virqfd scsi_debug vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb vfio_ap kvm loop nft_counter bridge stp llc dm_service_time nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng vfio_ccw vfio_mdev mdev vfio_iommu_type1 vfio zcrypt_cex4 eadm_sch sch_fq_codel ip_tables x_tables ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4 [last unloaded: dummy_del_mod] [ 4525.432691] CPU: 9 PID: 1050921 Comm: find Tainted: G OE 5.9.0-20201011.rc8.git0.d67bc7812221.300.fc32.s390x+next #1 [ 4525.432693] Hardware name: IBM 3906 M04 704 (LPAR) [ 4525.432694] Krnl PSW : 0704d0018000 cde29f70 (__kernel_write+0x1a0/0x2a0) [ 4525.432702]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [ 4525.432704] Krnl GPRS: 000100067343 0130 0001 [ 4525.432705]0006 5f82be2f 0130 8c6ab568 [ 4525.432728]84441f00 0130 84441f00 [ 4525.432729]81476000 0001 cde29ef4 03e002f5b6f0 [ 4525.432735] Krnl Code: cde29f62: a728lhi %r2,0 cde29f66: a7f4ff9dbrc 15,cde29ea0 #cde29f6a: e310f0f4lg %r1,240(%r15) >cde29f70: e3109024stg %r1,0(%r9) cde29f76: 9104b044tm 68(%r11),4 cde29f7a: a784000fbrc 8,cde29f98 cde29f7e: e3100344lg %r1,832 cde29f84: b904002algr %r2,%r10 [ 4525.432748] Call Trace: [ 4525.432750] [] __kernel_write+0x1a0/0x2a0 [ 4525.432752] ([ ] __kernel_write+0x124/0x2a0) [ 4525.432756] [<03ff80004cfa>] autofs_write+0x5a/0x140 [autofs4] [ 4525.432758] [<03ff80005262>] autofs_notify_daemon.constprop.0+0x10a/0x1c8 [autofs4] [ 4525.432760] [<03ff80005872>] autofs_wait+0x552/0x718 [autofs4] [ 4525.432762] [<03ff800033ca>] autofs_mount_wait+0x5a/0xb0 [autofs4] [ 4525.432764] [<03ff800048b2>] autofs_d_automount+0x102/0x278 [autofs4] [ 4525.432766] [ ] __traverse_mounts+0x9e/0x270 [ 4525.432768] [ ] step_into+0x1de/0x280 [ 4525.432770] [ ] open_last_lookups+0xb8/0x3f8 [ 4525.432772] [ ] path_openat+0x86/0x1d0 [ 4525.432773] [ ] do_filp_open+0x78/0x118 [ 4525.432776] [ ] do_sys_openat2+0xa8/0x168 [ 4525.432778] [ ] __s390x_sys_openat+0x6a/0x98 [ 4525.432781] [ ] system_call+0xdc/0x2a4 [ 4525.432782] Last Breaking-Event-Address: [ 4525.432783] [ ] __kernel_write+0x12c/0x2a0 This seems to be caused by the result of merging 0fb702791bf ("autofs: use __kernel_write() for the autofs pipe writing") and 4d03e3cc5982 ("fs: don't allow kernel reads and writes without iter ops"). __kernel_write() gets now called with a NULL pointer as pos argument, but __kernel_write expects a valid pointer as it fetches/stores the pos value there. Is there a fix pending somewhere? Thanks Sven
Re: [PATCH] s390/idle: Fix suspicious RCU usage
Hi Peter, Peter Zijlstra writes: > On Wed, Oct 07, 2020 at 12:05:51PM +0200, Peter Zijlstra wrote: >> On Wed, Oct 07, 2020 at 09:53:25AM +0200, Sven Schnelle wrote: >> > Hi Peter, >> > >> > pet...@infradead.org writes: >> > >> > > After commit eb1f00237aca ("lockdep,trace: Expose tracepoints") the >> > > lock tracepoints are visible to lockdep and RCU-lockdep is finding a >> > > bunch more RCU violations that were previously hidden. >> > > >> > > Switch the idle->seqcount over to using raw_write_*() to avoid the >> > > lockdep annotation and thus the lock tracepoints. >> > > >> > > Reported-by: Guenter Roeck >> > > Signed-off-by: Peter Zijlstra (Intel) >> > > [..] >> > >> > I'm still seeing the splat below on s390 when irq tracing is enabled: >> >> Damn... :/ >> >> This one is tricky, trouble seems to be that arch_cpu_idle() is defined >> to enable interrupts (no doubt because ot x86 :/), but we call it before >> rcu_exit_idle(). >> >> What a mess... let me rummage around the various archs to see what makes >> most sense here. > > Maybe something like so, I've not yet tested it. I need to figure out > how to force x86 into this path. I've gave this patch a quick test on linux-next from today and haven't seen the splat again. However it wasn't happening all the time, so will test it a bit longer. I haven't looked into the tracing code in detail, but i guess it was only happening when the lock was contented. The only thing with this patch is that rcu complains that it gets called with interrupts enabled on s390 when rcu_irq_enter() is called. But a few trace_hardirqs_{on,off} at the beginning and end of the IRQ handlers are fixing this. Will check why this worked in the past. Sven
Re: [PATCH] s390/idle: Fix suspicious RCU usage
Hi Peter, pet...@infradead.org writes: > After commit eb1f00237aca ("lockdep,trace: Expose tracepoints") the > lock tracepoints are visible to lockdep and RCU-lockdep is finding a > bunch more RCU violations that were previously hidden. > > Switch the idle->seqcount over to using raw_write_*() to avoid the > lockdep annotation and thus the lock tracepoints. > > Reported-by: Guenter Roeck > Signed-off-by: Peter Zijlstra (Intel) > [..] I'm still seeing the splat below on s390 when irq tracing is enabled: [ 1273.747948] = [ 1273.747953] WARNING: suspicious RCU usage [ 1273.747960] 5.9.0-20201006.rc8.git0.162c12a918a1.300.fc32.s390x+debug #1 Not tainted [ 1273.747965] - [ 1273.747971] include/trace/events/lock.h:74 suspicious rcu_dereference_check() usage! [ 1273.747976] other info that might help us debug this: [ 1273.747982] rcu_scheduler_active = 2, debug_locks = 1 [ 1273.747987] RCU used illegally from extended quiescent state! [ 1273.747993] 1 lock held by swapper/8/0: [ 1273.747998] #0: 00010f7281b8 (max_trace_lock){..-.}-{2:2}, at: check_critical_timing+0x7c/0x1c8 [ 1273.748019] stack backtrace: [ 1273.748034] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 5.9.0-20201006.rc8.git0.162c12a918a1.300.fc32.s390x+debug #1 [ 1273.748040] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) [ 1273.748045] Call Trace: [ 1273.748053] [<00010e656de0>] show_stack+0x90/0xf8 [ 1273.748060] [<00010eea94da>] dump_stack+0xa2/0xd8 [ 1273.748068] [<00010e711cc8>] trace_lock_acquired+0x178/0x180 [ 1273.748075] [<00010e718ed4>] lock_contended+0x24/0xd8 [ 1273.748083] [<00010f26d148>] _raw_spin_lock_irqsave+0xb0/0xd8 [ 1273.748089] [<00010e7ec404>] check_critical_timing+0x7c/0x1c8 [ 1273.748096] [<00010e7ecaa8>] tracer_hardirqs_on+0x128/0x148 [ 1273.748103] [<00010e7eae0c>] trace_hardirqs_on+0x6c/0x1b0 [ 1273.748110] [<00010e644ba8>] arch_cpu_idle+0x28/0x38 [ 1273.748116] [<00010f26ccd6>] default_idle_call+0x56/0x98 [ 1273.748124] [<00010e6da81a>] do_idle+0xf2/0x1b0 [ 1273.748130] [<00010e6dab4e>] cpu_startup_entry+0x36/0x40 [ 1273.748137] [<00010e6590fa>] smp_start_secondary+0x82/0x88 [ 1273.748142] 1 lock held by swapper/8/0: [ 1273.748147] #0: 00010f7281b8 (max_trace_lock){..-.}-{2:2}, at: check_critical_timing+0x7c/0x1c8 I think this happens because trace_lock_acquired gets called from idle context? Regards Sven
[tip: timers/urgent] lib/vdso: Allow to add architecture-specific vdso data
The following commit has been merged into the timers/urgent branch of tip: Commit-ID: d60d7de3e16d7cea998bad17d87366a359625894 Gitweb: https://git.kernel.org/tip/d60d7de3e16d7cea998bad17d87366a359625894 Author:Sven Schnelle AuthorDate:Tue, 04 Aug 2020 17:01:22 +02:00 Committer: Thomas Gleixner CommitterDate: Thu, 06 Aug 2020 10:57:30 +02:00 lib/vdso: Allow to add architecture-specific vdso data The initial assumption that all VDSO related data can be completely generic does not hold. S390 needs architecture specific storage to access the clock steering information. Add struct arch_vdso_data to the vdso data struct. For architectures which do not need extra data this defaults to an empty struct. Architectures which require it, enable CONFIG_ARCH_HAS_VDSO_DATA and provide their specific struct in asm/vdso/data.h. Signed-off-by: Sven Schnelle Signed-off-by: Thomas Gleixner Link: https://lkml.kernel.org/r/20200804150124.41692-2-sv...@linux.ibm.com --- arch/Kconfig| 3 +++ include/vdso/datapage.h | 10 ++ 2 files changed, 13 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index a112448..b44dd6b 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -975,6 +975,9 @@ config HAVE_SPARSE_SYSCALL_NR entries at 4000, 5000 and 6000 locations. This option turns on syscall related optimizations for a given architecture. +config ARCH_HAS_VDSO_DATA + bool + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h index ee810ca..73eb622 100644 --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -19,6 +19,12 @@ #include #include +#ifdef CONFIG_ARCH_HAS_VDSO_DATA +#include +#else +struct arch_vdso_data {}; +#endif + #define VDSO_BASES (CLOCK_TAI + 1) #define VDSO_HRES (BIT(CLOCK_REALTIME)| \ BIT(CLOCK_MONOTONIC) | \ @@ -64,6 +70,8 @@ struct vdso_timestamp { * @tz_dsttime:type of DST correction * @hrtimer_res: hrtimer resolution * @__unused: unused + * @arch_data: architecture specific data (optional, defaults + * to an empty struct) * * vdso_data will be accessed by 64 bit and compat code at the same time * so we should be careful before modifying this structure. @@ -97,6 +105,8 @@ struct vdso_data { s32 tz_dsttime; u32 hrtimer_res; u32 __unused; + + struct arch_vdso_data arch_data; }; /*
Re: [PATCH v3 3/3] s390: convert to GENERIC_VDSO
Hi Thomas, Thomas Gleixner writes: > Sven Schnelle writes: >> --- /dev/null >> +++ b/arch/s390/include/asm/vdso/data.h >> @@ -0,0 +1,13 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef __S390_ASM_VDSO_DATA_H >> +#define __S390_ASM_VDSO_DATA_H >> + >> +#include >> +#include > > I don't think this header needs vdso/datapage.h Agreed. >> +struct arch_vdso_data { >> +__u64 tod_steering_delta; >> +__u64 tod_steering_end; >> +}; >> + >> +#endif /* __S390_ASM_VDSO_DATA_H */ > >> --- /dev/null >> +++ b/arch/s390/include/asm/vdso/gettimeofday.h > >> +static inline u64 __arch_get_hw_counter(s32 clock_mode) >> +{ >> +const struct vdso_data *vdso = __arch_get_vdso_data(); > > If you go and implement time namespace support for VDSO then this > becomes a problem. See do_hres_timens(). > > As we did not expect that this function needs vdso_data we should just > add a vdso_data pointer argument to __arch_get_hw_counter(). And while > looking at it you're not the first one. MIPS already uses it and because > it does not support time namespaces so far nobody noticed yet. > > That's fine for all others because the compiler will optimize it > out when it's unused. If the compiler fails to do so, then this is the > least of our worries :) > > As there is no new VDSO conversion pending in next, I can just queue > that (see below) along with #1 and #2 of this series and send it to > Linus by end of the week. Fine with me. >> +u64 adj, now; >> + >> +now = get_tod_clock(); >> +adj = vdso->arch.tod_steering_end - now; >> +if (unlikely((s64) adj > 0)) >> +now += (vdso->arch.tod_steering_delta < 0) ? (adj >> 15) : >> -(adj >> 15); >> +return now; >> +} > > >> --- /dev/null >> +++ b/arch/s390/include/asm/vdso/processor.h >> @@ -0,0 +1,5 @@ >> +/* SPDX-License-Identifier: GPL-2.0-only */ >> +#ifndef __ASM_VDSO_PROCESSOR_H >> +#define __ASM_VDSO_PROCESSOR_H >> + >> +#endif /* __ASM_VDSO_PROCESSOR_H */ > > The idea of this file was to provide cpu_relax() self contained without > pulling in tons of other things from asm/processor.h. Thanks, i'll add that. >> diff --git a/arch/s390/include/asm/vdso/vdso.h >> b/arch/s390/include/asm/vdso/vdso.h >> new file mode 100644 >> index ..e69de29bb2d1 >> diff --git a/arch/s390/include/asm/vdso/vsyscall.h >> b/arch/s390/include/asm/vdso/vsyscall.h >> new file mode 100644 >> index ..6c67c08cefdd >> --- /dev/null >> +++ b/arch/s390/include/asm/vdso/vsyscall.h >> @@ -0,0 +1,26 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#ifndef __ASM_VDSO_VSYSCALL_H >> +#define __ASM_VDSO_VSYSCALL_H >> + >> +#ifndef __ASSEMBLY__ >> + >> +#include >> +#include > > I know you copied that from x86 or some other architecture, but thinking > about the above these two includes are not required for building VDSO > itself. Only the kernel update side needs them and they are included > already there. I'm going to remove them from x86 as well. Thanks, i removed them from my patch. >> @@ -443,9 +396,8 @@ static void clock_sync_global(unsigned long long delta) >> panic("TOD clock sync offset %lli is too large to drift\n", >>tod_steering_delta); >> tod_steering_end = now + (abs(tod_steering_delta) << 15); >> -vdso_data->ts_dir = (tod_steering_delta < 0) ? 0 : 1; >> -vdso_data->ts_end = tod_steering_end; >> -vdso_data->tb_update_count++; >> +vdso_data->arch.tod_steering_end = tod_steering_end; > > Leftover from the previous version. Should be arch_data.tod Oops, indeed. > Heiko, do you consider this 5.9 material or am I right to assume that > this targets 5.10? I talked to Heiko yesterday and we agreed that it's to late for 5.9, so we target 5.10. Thanks, Sven
[PATCH v3 3/3] s390: convert to GENERIC_VDSO
This patch converts s390 to the generic vDSO. There are a few special things on s390: - vDSO can be called without a stack frame - glibc did this in the past. So we need to allocate a stackframe on our own. - The former assembly code used stcke to get the TOD clock and applied time steering to it. We need to do the same in the new code. This is done in the architecture specific __arch_get_hw_counter function. The steering information is stored in an architecure specific area in the vDSO data. - CPUCLOCK_VIRT is now handled with a syscall fallback, which might be slower/less accurate than the old implementation. The getcpu() function stays as an assembly function because there is no generic implementation and the code is just a few lines. Performance number from my system do 100 mio gettimeofday() calls: Plain syscall: 8.6s Generic VDSO: 1.3s old ASM VDSO: 1s So it's a bit slower but still much faster than syscalls. Signed-off-by: Sven Schnelle --- arch/s390/Kconfig | 3 + arch/s390/include/asm/clocksource.h | 7 + arch/s390/include/asm/vdso.h| 25 +-- arch/s390/include/asm/vdso/clocksource.h| 8 + arch/s390/include/asm/vdso/data.h | 13 ++ arch/s390/include/asm/vdso/gettimeofday.h | 71 + arch/s390/include/asm/vdso/processor.h | 5 + arch/s390/include/asm/vdso/vdso.h | 0 arch/s390/include/asm/vdso/vsyscall.h | 26 arch/s390/kernel/asm-offsets.c | 20 --- arch/s390/kernel/entry.S| 6 - arch/s390/kernel/setup.c| 1 - arch/s390/kernel/time.c | 66 ++-- arch/s390/kernel/vdso.c | 29 +--- arch/s390/kernel/vdso64/Makefile| 19 ++- arch/s390/kernel/vdso64/clock_getres.S | 50 -- arch/s390/kernel/vdso64/clock_gettime.S | 163 arch/s390/kernel/vdso64/gettimeofday.S | 71 - arch/s390/kernel/vdso64/vdso64_generic.c| 18 +++ arch/s390/kernel/vdso64/vdso_user_wrapper.S | 38 + 20 files changed, 219 insertions(+), 420 deletions(-) create mode 100644 arch/s390/include/asm/clocksource.h create mode 100644 arch/s390/include/asm/vdso/clocksource.h create mode 100644 arch/s390/include/asm/vdso/data.h create mode 100644 arch/s390/include/asm/vdso/gettimeofday.h create mode 100644 arch/s390/include/asm/vdso/processor.h create mode 100644 arch/s390/include/asm/vdso/vdso.h create mode 100644 arch/s390/include/asm/vdso/vsyscall.h delete mode 100644 arch/s390/kernel/vdso64/clock_getres.S delete mode 100644 arch/s390/kernel/vdso64/clock_gettime.S delete mode 100644 arch/s390/kernel/vdso64/gettimeofday.S create mode 100644 arch/s390/kernel/vdso64/vdso64_generic.c create mode 100644 arch/s390/kernel/vdso64/vdso_user_wrapper.S diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c7d7ede6300c..eb50f748e151 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -73,6 +73,7 @@ config S390 select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL + select ARCH_HAS_VDSO_DATA select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_INLINE_READ_LOCK select ARCH_INLINE_READ_LOCK_BH @@ -118,6 +119,7 @@ config S390 select GENERIC_CPU_AUTOPROBE select GENERIC_CPU_VULNERABILITIES select GENERIC_FIND_FIRST_BIT + select GENERIC_GETTIMEOFDAY select GENERIC_SMP_IDLE_THREAD select GENERIC_TIME_VSYSCALL select HAVE_ALIGNED_STRUCT_PAGE if SLUB @@ -149,6 +151,7 @@ config S390 select HAVE_FUNCTION_TRACER select HAVE_FUTEX_CMPXCHG if FUTEX select HAVE_GCC_PLUGINS + select HAVE_GENERIC_VDSO select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZ4 diff --git a/arch/s390/include/asm/clocksource.h b/arch/s390/include/asm/clocksource.h new file mode 100644 index ..03434369fce4 --- /dev/null +++ b/arch/s390/include/asm/clocksource.h @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* s390-specific clocksource additions */ + +#ifndef _ASM_S390_CLOCKSOURCE_H +#define _ASM_S390_CLOCKSOURCE_H + +#endif /* _ASM_S390_CLOCKSOURCE_H */ diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h index 0cd085cdeb4f..82f86b3c394b 100644 --- a/arch/s390/include/asm/vdso.h +++ b/arch/s390/include/asm/vdso.h @@ -2,6 +2,8 @@ #ifndef __S390_VDSO_H__ #define __S390_VDSO_H__ +#include + /* Default link addresses for the vDSOs */ #define VDSO32_LBASE 0 #define VDSO64_LBASE 0 @@ -18,30 +20,7 @@ * itself and may change without notice. */ -struct vdso_data { - __u64 tb_update_count; /* Timebase atomicity ctr 0x00 */ - __u64 xtime_tod_stamp; /* TOD clock for xtime 0x08 */ - __u64 xtime_clock_sec; /* Kernel time 0x10
[PATCH v3 1/3] vdso: allow to add architecture-specific vdso data
Add the possibility to add architecture specific vDSO data to struct vdso_data. This is useful if the arch specific user space VDSO code needs additional data during execution. If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will include asm/vdso/data.h which should contain 'struct arch_vdso_data'. This structure will be embedded in the generic vDSO data. Signed-off-by: Sven Schnelle --- arch/Kconfig| 3 +++ include/vdso/datapage.h | 9 + 2 files changed, 12 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 8cc35dc556c7..e1017ce979e2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR entries at 4000, 5000 and 6000 locations. This option turns on syscall related optimizations for a given architecture. +config ARCH_HAS_VDSO_DATA + bool + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h index 7955c56d6b3c..a11e89e2f547 100644 --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -19,6 +19,12 @@ #include #include +#ifdef CONFIG_ARCH_HAS_VDSO_DATA +#include +#else +struct arch_vdso_data {}; +#endif + #define VDSO_BASES (CLOCK_TAI + 1) #define VDSO_HRES (BIT(CLOCK_REALTIME)| \ BIT(CLOCK_MONOTONIC) | \ @@ -64,6 +70,7 @@ struct vdso_timestamp { * @tz_dsttime:type of DST correction * @hrtimer_res: hrtimer resolution * @__unused: unused + * @arch_data: architecture specific data * * vdso_data will be accessed by 64 bit and compat code at the same time * so we should be careful before modifying this structure. @@ -97,6 +104,8 @@ struct vdso_data { s32 tz_dsttime; u32 hrtimer_res; u32 __unused; + + struct arch_vdso_data arch_data; }; /* -- 2.17.1
[PATCH v3 2/3] timekeeping/vsyscall: Provide vdso_update_begin/end()
From: Thomas Gleixner Architectures can have the requirement to add additional architecture specific data to the VDSO data page which needs to be updated independent of the timekeeper updates. To protect these updates vs. concurrent readers and a conflicting update through timekeeping, provide helper functions to make such updates safe. vdso_update_begin() takes the timekeeper_lock to protect against a potential update from timekeeper code and increments the VDSO sequence count to signal data inconsistency to concurrent readers. vdso_update_end() makes the sequence count even again to signal data consistency and drops the timekeeper lock. Signed-off-by: Thomas Gleixner Signed-off-by: Sven Schnelle --- include/vdso/vsyscall.h| 3 +++ kernel/time/timekeeping.c | 2 +- kernel/time/timekeeping_internal.h | 11 +++--- kernel/time/vsyscall.c | 32 ++ 4 files changed, 44 insertions(+), 4 deletions(-) diff --git a/include/vdso/vsyscall.h b/include/vdso/vsyscall.h index 2c6134e0c23d..b0fdc9c6bf43 100644 --- a/include/vdso/vsyscall.h +++ b/include/vdso/vsyscall.h @@ -6,6 +6,9 @@ #include +unsigned long vdso_update_begin(void); +void vdso_update_end(unsigned long flags); + #endif /* !__ASSEMBLY__ */ #endif /* __VDSO_VSYSCALL_H */ diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index d20d489841c8..8459aa87986f 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -50,7 +50,7 @@ static struct { .seq = SEQCNT_ZERO(tk_core.seq), }; -static DEFINE_RAW_SPINLOCK(timekeeper_lock); +DEFINE_RAW_SPINLOCK(timekeeper_lock); static struct timekeeper shadow_timekeeper; /** diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h index bcbb52db2256..4ca2787d1642 100644 --- a/kernel/time/timekeeping_internal.h +++ b/kernel/time/timekeeping_internal.h @@ -1,12 +1,14 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _TIMEKEEPING_INTERNAL_H #define _TIMEKEEPING_INTERNAL_H -/* - * timekeeping debug functions - */ + #include +#include #include +/* + * timekeeping debug functions + */ #ifdef CONFIG_DEBUG_FS extern void tk_debug_account_sleep_time(const struct timespec64 *t); #else @@ -31,4 +33,7 @@ static inline u64 clocksource_delta(u64 now, u64 last, u64 mask) } #endif +/* Semi public for serialization of non timekeeper VDSO updates. */ +extern raw_spinlock_t timekeeper_lock; + #endif /* _TIMEKEEPING_INTERNAL_H */ diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c index 54ce6eb2ca36..e8806eda6874 100644 --- a/kernel/time/vsyscall.c +++ b/kernel/time/vsyscall.c @@ -13,6 +13,8 @@ #include #include +#include "timekeeping_internal.h" + static inline void update_vdso_data(struct vdso_data *vdata, struct timekeeper *tk) { @@ -127,3 +129,33 @@ void update_vsyscall_tz(void) __arch_sync_vdso_data(vdata); } + +/** + * vdso_update_begin - Start of a VDSO update section + * + * Allows architecture code to safely update the architecture specific VDSO + * data. + */ +unsigned long vdso_update_begin(void) +{ + struct vdso_data *vdata = __arch_get_k_vdso_data(); + unsigned long flags; + + raw_spin_lock_irqsave(_lock, flags); + vdso_write_begin(vdata); + return flags; +} + +/** + * vdso_update_end - End of a VDSO update section + * + * Pairs with vdso_update_begin(). + */ +void vdso_update_end(unsigned long flags) +{ + struct vdso_data *vdata = __arch_get_k_vdso_data(); + + vdso_write_end(vdata); + __arch_sync_vdso_data(vdata); + raw_spin_unlock_irqrestore(_lock, flags); +} -- 2.17.1
[PATCH RFC v3] s390: convert to GENERIC_VDSO
here's the third version of the generic vdso patchset for s390. Changes in v3; - update the architecture specific data patch to the correct one. Changes in v2: - added patch from Thomas that adds vdso_update_begin()/vdso_update_end() - changed the s390 code to use the vdso update functions - changed the name of the architecture specific data to 'arch_data'
Re: [PATCH 1/3] vdso: allow to add architecture-specific vdso data
Hi, Sven Schnelle writes: > Add the possibility to add architecture specific vDSO > data to struct vdso_data. This is useful if the arch specific > user space VDSO code needs additional data during execution. > If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will > include asm/vdso/data.h which should contain 'struct arch_vdso_data'. > This structure will be embedded in the generic vDSO data. > > Signed-off-by: Sven Schnelle > --- > arch/Kconfig| 3 +++ > include/vdso/datapage.h | 7 +++ > 2 files changed, 10 insertions(+) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 8cc35dc556c7..e1017ce979e2 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR > entries at 4000, 5000 and 6000 locations. This option turns on syscall > related optimizations for a given architecture. > > +config ARCH_HAS_VDSO_DATA > + bool > + > source "kernel/gcov/Kconfig" > > source "scripts/gcc-plugins/Kconfig" > diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h > index 7955c56d6b3c..74e730238ce6 100644 > --- a/include/vdso/datapage.h > +++ b/include/vdso/datapage.h > @@ -19,6 +19,10 @@ > #include > #include > > +#ifdef CONFIG_ARCH_HAS_VDSO_DATA > +#include > +#endif > + > #define VDSO_BASES (CLOCK_TAI + 1) > #define VDSO_HRES(BIT(CLOCK_REALTIME)| \ >BIT(CLOCK_MONOTONIC) | \ > @@ -97,6 +101,9 @@ struct vdso_data { > s32 tz_dsttime; > u32 hrtimer_res; > u32 __unused; > +#ifdef CONFIG_ARCH_HAS_VDSO_DATA > + struct arch_vdso_data arch; > +#endif > }; > > /* I've sent the wrong version of this patch. I'll send a revised version with the requested changes later.
[PATCH 3/3] s390: convert to GENERIC_VDSO
This patch converts s390 to the generic vDSO. There are a few special things on s390: - vDSO can be called without a stack frame - glibc did this in the past. So we need to allocate a stackframe on our own. - The former assembly code used stcke to get the TOD clock and applied time steering to it. We need to do the same in the new code. This is done in the architecture specific __arch_get_hw_counter function. The steering information is stored in an architecure specific area in the vDSO data. - CPUCLOCK_VIRT is now handled with a syscall fallback, which might be slower/less accurate than the old implementation. The getcpu() function stays as an assembly function because there is no generic implementation and the code is just a few lines. Performance number from my system do 100 mio gettimeofday() calls: Plain syscall: 8.6s Generic VDSO: 1.3s old ASM VDSO: 1s So it's a bit slower but still much faster than syscalls. Signed-off-by: Sven Schnelle --- arch/s390/Kconfig | 3 + arch/s390/include/asm/clocksource.h | 7 + arch/s390/include/asm/vdso.h| 25 +-- arch/s390/include/asm/vdso/clocksource.h| 8 + arch/s390/include/asm/vdso/data.h | 13 ++ arch/s390/include/asm/vdso/gettimeofday.h | 71 + arch/s390/include/asm/vdso/processor.h | 5 + arch/s390/include/asm/vdso/vdso.h | 0 arch/s390/include/asm/vdso/vsyscall.h | 26 arch/s390/kernel/asm-offsets.c | 20 --- arch/s390/kernel/entry.S| 6 - arch/s390/kernel/setup.c| 1 - arch/s390/kernel/time.c | 66 ++-- arch/s390/kernel/vdso.c | 29 +--- arch/s390/kernel/vdso64/Makefile| 19 ++- arch/s390/kernel/vdso64/clock_getres.S | 50 -- arch/s390/kernel/vdso64/clock_gettime.S | 163 arch/s390/kernel/vdso64/gettimeofday.S | 71 - arch/s390/kernel/vdso64/vdso64_generic.c| 18 +++ arch/s390/kernel/vdso64/vdso_user_wrapper.S | 38 + 20 files changed, 219 insertions(+), 420 deletions(-) create mode 100644 arch/s390/include/asm/clocksource.h create mode 100644 arch/s390/include/asm/vdso/clocksource.h create mode 100644 arch/s390/include/asm/vdso/data.h create mode 100644 arch/s390/include/asm/vdso/gettimeofday.h create mode 100644 arch/s390/include/asm/vdso/processor.h create mode 100644 arch/s390/include/asm/vdso/vdso.h create mode 100644 arch/s390/include/asm/vdso/vsyscall.h delete mode 100644 arch/s390/kernel/vdso64/clock_getres.S delete mode 100644 arch/s390/kernel/vdso64/clock_gettime.S delete mode 100644 arch/s390/kernel/vdso64/gettimeofday.S create mode 100644 arch/s390/kernel/vdso64/vdso64_generic.c create mode 100644 arch/s390/kernel/vdso64/vdso_user_wrapper.S diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c7d7ede6300c..eb50f748e151 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -73,6 +73,7 @@ config S390 select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL + select ARCH_HAS_VDSO_DATA select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_INLINE_READ_LOCK select ARCH_INLINE_READ_LOCK_BH @@ -118,6 +119,7 @@ config S390 select GENERIC_CPU_AUTOPROBE select GENERIC_CPU_VULNERABILITIES select GENERIC_FIND_FIRST_BIT + select GENERIC_GETTIMEOFDAY select GENERIC_SMP_IDLE_THREAD select GENERIC_TIME_VSYSCALL select HAVE_ALIGNED_STRUCT_PAGE if SLUB @@ -149,6 +151,7 @@ config S390 select HAVE_FUNCTION_TRACER select HAVE_FUTEX_CMPXCHG if FUTEX select HAVE_GCC_PLUGINS + select HAVE_GENERIC_VDSO select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZ4 diff --git a/arch/s390/include/asm/clocksource.h b/arch/s390/include/asm/clocksource.h new file mode 100644 index ..03434369fce4 --- /dev/null +++ b/arch/s390/include/asm/clocksource.h @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* s390-specific clocksource additions */ + +#ifndef _ASM_S390_CLOCKSOURCE_H +#define _ASM_S390_CLOCKSOURCE_H + +#endif /* _ASM_S390_CLOCKSOURCE_H */ diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h index 0cd085cdeb4f..82f86b3c394b 100644 --- a/arch/s390/include/asm/vdso.h +++ b/arch/s390/include/asm/vdso.h @@ -2,6 +2,8 @@ #ifndef __S390_VDSO_H__ #define __S390_VDSO_H__ +#include + /* Default link addresses for the vDSOs */ #define VDSO32_LBASE 0 #define VDSO64_LBASE 0 @@ -18,30 +20,7 @@ * itself and may change without notice. */ -struct vdso_data { - __u64 tb_update_count; /* Timebase atomicity ctr 0x00 */ - __u64 xtime_tod_stamp; /* TOD clock for xtime 0x08 */ - __u64 xtime_clock_sec; /* Kernel time 0x10
[PATCH RFC v2] s390: convert to GENERIC_VDSO
as discussed here's the second version of the generic vdso patch for s390. Changes in v2: - added patch from Thomas that adds vdso_update_begin()/vdso_update_end() - changed the s390 code to use the vdso update functions - changed the name of the architecture specific data to 'arch_data'
[PATCH 1/3] vdso: allow to add architecture-specific vdso data
Add the possibility to add architecture specific vDSO data to struct vdso_data. This is useful if the arch specific user space VDSO code needs additional data during execution. If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will include asm/vdso/data.h which should contain 'struct arch_vdso_data'. This structure will be embedded in the generic vDSO data. Signed-off-by: Sven Schnelle --- arch/Kconfig| 3 +++ include/vdso/datapage.h | 7 +++ 2 files changed, 10 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 8cc35dc556c7..e1017ce979e2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR entries at 4000, 5000 and 6000 locations. This option turns on syscall related optimizations for a given architecture. +config ARCH_HAS_VDSO_DATA + bool + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h index 7955c56d6b3c..74e730238ce6 100644 --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -19,6 +19,10 @@ #include #include +#ifdef CONFIG_ARCH_HAS_VDSO_DATA +#include +#endif + #define VDSO_BASES (CLOCK_TAI + 1) #define VDSO_HRES (BIT(CLOCK_REALTIME)| \ BIT(CLOCK_MONOTONIC) | \ @@ -97,6 +101,9 @@ struct vdso_data { s32 tz_dsttime; u32 hrtimer_res; u32 __unused; +#ifdef CONFIG_ARCH_HAS_VDSO_DATA + struct arch_vdso_data arch; +#endif }; /* -- 2.17.1
[PATCH 2/3] timekeeping/vsyscall: Provide vdso_update_begin/end()
From: Thomas Gleixner Architectures can have the requirement to add additional architecture specific data to the VDSO data page which needs to be updated independent of the timekeeper updates. To protect these updates vs. concurrent readers and a conflicting update through timekeeping, provide helper functions to make such updates safe. vdso_update_begin() takes the timekeeper_lock to protect against a potential update from timekeeper code and increments the VDSO sequence count to signal data inconsistency to concurrent readers. vdso_update_end() makes the sequence count even again to signal data consistency and drops the timekeeper lock. Signed-off-by: Thomas Gleixner Signed-off-by: Sven Schnelle --- include/vdso/vsyscall.h| 3 +++ kernel/time/timekeeping.c | 2 +- kernel/time/timekeeping_internal.h | 11 +++--- kernel/time/vsyscall.c | 32 ++ 4 files changed, 44 insertions(+), 4 deletions(-) diff --git a/include/vdso/vsyscall.h b/include/vdso/vsyscall.h index 2c6134e0c23d..b0fdc9c6bf43 100644 --- a/include/vdso/vsyscall.h +++ b/include/vdso/vsyscall.h @@ -6,6 +6,9 @@ #include +unsigned long vdso_update_begin(void); +void vdso_update_end(unsigned long flags); + #endif /* !__ASSEMBLY__ */ #endif /* __VDSO_VSYSCALL_H */ diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index d20d489841c8..8459aa87986f 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -50,7 +50,7 @@ static struct { .seq = SEQCNT_ZERO(tk_core.seq), }; -static DEFINE_RAW_SPINLOCK(timekeeper_lock); +DEFINE_RAW_SPINLOCK(timekeeper_lock); static struct timekeeper shadow_timekeeper; /** diff --git a/kernel/time/timekeeping_internal.h b/kernel/time/timekeeping_internal.h index bcbb52db2256..4ca2787d1642 100644 --- a/kernel/time/timekeeping_internal.h +++ b/kernel/time/timekeeping_internal.h @@ -1,12 +1,14 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _TIMEKEEPING_INTERNAL_H #define _TIMEKEEPING_INTERNAL_H -/* - * timekeeping debug functions - */ + #include +#include #include +/* + * timekeeping debug functions + */ #ifdef CONFIG_DEBUG_FS extern void tk_debug_account_sleep_time(const struct timespec64 *t); #else @@ -31,4 +33,7 @@ static inline u64 clocksource_delta(u64 now, u64 last, u64 mask) } #endif +/* Semi public for serialization of non timekeeper VDSO updates. */ +extern raw_spinlock_t timekeeper_lock; + #endif /* _TIMEKEEPING_INTERNAL_H */ diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c index 54ce6eb2ca36..e8806eda6874 100644 --- a/kernel/time/vsyscall.c +++ b/kernel/time/vsyscall.c @@ -13,6 +13,8 @@ #include #include +#include "timekeeping_internal.h" + static inline void update_vdso_data(struct vdso_data *vdata, struct timekeeper *tk) { @@ -127,3 +129,33 @@ void update_vsyscall_tz(void) __arch_sync_vdso_data(vdata); } + +/** + * vdso_update_begin - Start of a VDSO update section + * + * Allows architecture code to safely update the architecture specific VDSO + * data. + */ +unsigned long vdso_update_begin(void) +{ + struct vdso_data *vdata = __arch_get_k_vdso_data(); + unsigned long flags; + + raw_spin_lock_irqsave(_lock, flags); + vdso_write_begin(vdata); + return flags; +} + +/** + * vdso_update_end - End of a VDSO update section + * + * Pairs with vdso_update_begin(). + */ +void vdso_update_end(unsigned long flags) +{ + struct vdso_data *vdata = __arch_get_k_vdso_data(); + + vdso_write_end(vdata); + __arch_sync_vdso_data(vdata); + raw_spin_unlock_irqrestore(_lock, flags); +} -- 2.17.1
Re: [PATCH 2/2] s390: convert to GENERIC_VDSO
Hi, Thomas Gleixner writes: > Heiko Carstens writes: > >> On Mon, Aug 03, 2020 at 06:05:24PM +0200, Thomas Gleixner wrote: >>> +/** >>> + * vdso_update_begin - Start of a VDSO update section >>> + * >>> + * Allows architecture code to safely update the architecture specific VDSO >>> + * data. >>> + */ >>> +void vdso_update_begin(void) >>> +{ >>> + struct vdso_data *vdata = __arch_get_k_vdso_data(); >>> + >>> + raw_spin_lock(_lock); >>> + vdso_write_begin(vdata); >>> +} >> >> I would assume that this only works if vdso_update_begin() is called >> with irqs disabled, otherwise it could deadlock, no? > > Yes. > >> Maybe something like: >> >> void vdso_update_begin(unsigned long *flags) >> { >> struct vdso_data *vdata = __arch_get_k_vdso_data(); >> >> raw_spin_lock_irqsave(_lock, *flags); >> vdso_write_begin(vdata); > > Shudder. Why not returning flags? > >> } >> >> void vdso_update_end(unsigned long *flags) > > Ditto, why pointer and not value? > >> { >> struct vdso_data *vdata = __arch_get_k_vdso_data(); >> >> vdso_write_end(vdata); >> __arch_sync_vdso_data(vdata); >> raw_spin_unlock_irqrestore(_lock, *flags); >> } >> >> ? Just wondering. > > Thought about that briefly, but then hated the flags thing and delegated > it to the caller. Lockdep will yell if that lock is taken with > interrupts enabled :) > > But aside of the pointer vs. value thing, I'm fine with doing it in the > functions. Thanks Thomas & Heiko. I'll incorporate the changes into my patchset and send an updated version. Thomas, i think it's fine if i update your patch and we take it through the s390 tree? Regards Sven
Re: [PATCH 2/2] s390: convert to GENERIC_VDSO
Thomas Gleixner writes: > Sven Schnelle writes: > >> - CPUCLOCK_VIRT is now handled with a syscall fallback, which might >> be slower/less accurate than the old implementation. > > I can understand the slower, but why does it become less accurate? Because we saved the system/user times as almost the last instruction when leaving the kernel to userspace. Now it's a bit earlier, because it is done in the C code. So it's not really related to the syscall fallback, but the switch from assembly to C. >> Performance number from my system do 100 mio gettimeofday() calls: >> >> Plain syscall: 8.6s >> Generic VDSO: 1.3s >> old ASM VDSO: 1s >> >> So it's a bit slower but still much faster than syscalls. > > Where is the overhead coming from? It's because we have to allocate a stackframe which we didn't do before, and the compiler generated code is less optimized than the hand-crafted assembly code we had before. >> +static inline u64 __arch_get_hw_counter(s32 clock_mode) >> +{ >> +const struct vdso_data *vdso = __arch_get_vdso_data(); >> +u64 adj, now; >> +int cnt; >> + >> +do { >> +do { >> +cnt = READ_ONCE(vdso->arch.tb_update_cnt); >> +} while (cnt & 1); > > smp_rmb() ? >> +now = get_tod_clock(); >> +adj = vdso->arch.tod_steering_end - now; >> +if (unlikely((s64) adj > 0)) >> +now += (vdso->arch.tod_steering_delta < 0) ? (adj >> >> 15) : -(adj >> 15); > > smp_rmb() ? > >> +} while (cnt != READ_ONCE(vdso->arch.tb_update_cnt)); >> +return now; >> if (ptff_query(PTFF_QTO) && ptff(, sizeof(qto), PTFF_QTO) == 0) >> lpar_offset = qto.tod_epoch_difference; >> @@ -599,6 +550,13 @@ static int stp_sync_clock(void *data) >> if (stp_info.todoff[0] || stp_info.todoff[1] || >> stp_info.todoff[2] || stp_info.todoff[3] || >> stp_info.tmd != 2) { >> +vdso_data->arch.tb_update_cnt++; >> +/* >> + * This barrier isn't really needed as we're called >> + * from stop_machine_cpuslocked(). However it doesn't >> + * hurt in case the code gets changed. >> + */ >> +smp_wmb(); > > WMB without a corresponding RMB and an explanation what's ordered > against what is voodoo at best. > >> rc = chsc_sstpc(stp_page, STP_OP_SYNC, 0, >> _delta); >> if (rc == 0) { >> @@ -609,6 +567,8 @@ static int stp_sync_clock(void *data) >> if (rc == 0 && stp_info.tmd != 2) >> rc = -EAGAIN; >> } >> +smp_wmb(); /* see comment above */ > > See my comments above :) :-) What do you think about my question on using vdso_write_begin/end()? __arch_get_hw_counter() is called inside a vdso_read_retry() loop, so i would think that just enclosing this update with vdso_write_begin/end() should sufficient. But i'm not sure whether arch/ should call these functions. Thanks Sven
Re: [PATCH 1/2] vdso: allow to add architecture-specific vdso data
Thomas Gleixner writes: > just a few nits. > > Can you please spare that #ifdef and do: > > #ifdef CONFIG_ARCH_HAS_VDSO_DATA > #include > #else > struct arch_vdso_data {}; > #endif Ok. > Please keep the tabular alignment of the struct members and add kernel > doc in the comment above the struct. Ok. > Aside of that 'arch' is not really a intuitive name. 'arch_data' or > something like that. Hmm? arch_data is fine, thank you for your comments. Regards Sven
Re: [PATCH] perf: add psw_idle and psw_idle_exit to list of idle symbols
Sven Schnelle writes: > Add the s390 idle functions so they don't show up in top when > using software sampling. > > Signed-off-by: Sven Schnelle > --- > tools/perf/util/symbol.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c > index 5ddf84dcbae7..d33d24c61d24 100644 > --- a/tools/perf/util/symbol.c > +++ b/tools/perf/util/symbol.c > @@ -666,6 +666,8 @@ static bool symbol__is_idle(const char *name) > "poll_idle", > "ppc64_runlatch_off", > "pseries_dedicated_idle_sleep", > + "psw_idle", > + "psw_idle_exit", > NULL > }; > int i; gentle ping? Regards Sven
[PATCH RFC] s390: convert to GENERIC_VDSO
these two patches convert the s390 architecture to generic VDSO. The first patch adds an option to add architecture specific information to struct vdso_data. We need that information because the old s390 assembly code had a steering capability, which steered the clock slowly. To emulate that behaviour we need to add the steering offset to struct vdso_data. This requirements results in the need for a seqlock kind of lock, which is implemented open-coded in __arch_get_hw_counter(). open-coded because we cannot include seqlock.h in userspace code (and using the normal seqlock interface on kernel side might result in people changing struct seqlock, but not changing the vdso userspace part), therefore both sides are open-coded. I think in theory we could also call vdso_write_begin()/ vdso_write_end(). What do you think? If there are no objections we would carry both patches through the s390 tree. Thanks Sven
[PATCH 2/2] s390: convert to GENERIC_VDSO
This patch converts s390 to the generic vDSO. There are a few special things on s390: - vDSO can be called without a stack frame - glibc did this in the past. So we need to allocate a stackframe on our own. - The former assembly code used stcke to get the TOD clock and applied time steering to it. We need to do the same in the new code. This is done in the architecture specific __arch_get_hw_counter function. The steering information is stored in an architecure specific area in the vDSO data. - CPUCLOCK_VIRT is now handled with a syscall fallback, which might be slower/less accurate than the old implementation. The getcpu() function stays as an assembly function because there is no generic implementation and the code is just a few lines. Performance number from my system do 100 mio gettimeofday() calls: Plain syscall: 8.6s Generic VDSO: 1.3s old ASM VDSO: 1s So it's a bit slower but still much faster than syscalls. Signed-off-by: Sven Schnelle Reviewed-by: Heiko Carstens --- arch/s390/Kconfig | 3 + arch/s390/include/asm/clocksource.h | 7 + arch/s390/include/asm/vdso.h| 25 +-- arch/s390/include/asm/vdso/clocksource.h| 8 + arch/s390/include/asm/vdso/data.h | 14 ++ arch/s390/include/asm/vdso/gettimeofday.h | 78 ++ arch/s390/include/asm/vdso/processor.h | 5 + arch/s390/include/asm/vdso/vdso.h | 0 arch/s390/include/asm/vdso/vsyscall.h | 26 arch/s390/kernel/asm-offsets.c | 20 --- arch/s390/kernel/entry.S| 6 - arch/s390/kernel/setup.c| 1 - arch/s390/kernel/time.c | 70 ++--- arch/s390/kernel/vdso.c | 29 +--- arch/s390/kernel/vdso64/Makefile| 19 ++- arch/s390/kernel/vdso64/clock_getres.S | 50 -- arch/s390/kernel/vdso64/clock_gettime.S | 163 arch/s390/kernel/vdso64/gettimeofday.S | 71 - arch/s390/kernel/vdso64/vdso64_generic.c| 18 +++ arch/s390/kernel/vdso64/vdso_user_wrapper.S | 38 + 20 files changed, 232 insertions(+), 419 deletions(-) create mode 100644 arch/s390/include/asm/clocksource.h create mode 100644 arch/s390/include/asm/vdso/clocksource.h create mode 100644 arch/s390/include/asm/vdso/data.h create mode 100644 arch/s390/include/asm/vdso/gettimeofday.h create mode 100644 arch/s390/include/asm/vdso/processor.h create mode 100644 arch/s390/include/asm/vdso/vdso.h create mode 100644 arch/s390/include/asm/vdso/vsyscall.h delete mode 100644 arch/s390/kernel/vdso64/clock_getres.S delete mode 100644 arch/s390/kernel/vdso64/clock_gettime.S delete mode 100644 arch/s390/kernel/vdso64/gettimeofday.S create mode 100644 arch/s390/kernel/vdso64/vdso64_generic.c create mode 100644 arch/s390/kernel/vdso64/vdso_user_wrapper.S diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index c7d7ede6300c..eb50f748e151 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -73,6 +73,7 @@ config S390 select ARCH_HAS_STRICT_MODULE_RWX select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL + select ARCH_HAS_VDSO_DATA select ARCH_HAVE_NMI_SAFE_CMPXCHG select ARCH_INLINE_READ_LOCK select ARCH_INLINE_READ_LOCK_BH @@ -118,6 +119,7 @@ config S390 select GENERIC_CPU_AUTOPROBE select GENERIC_CPU_VULNERABILITIES select GENERIC_FIND_FIRST_BIT + select GENERIC_GETTIMEOFDAY select GENERIC_SMP_IDLE_THREAD select GENERIC_TIME_VSYSCALL select HAVE_ALIGNED_STRUCT_PAGE if SLUB @@ -149,6 +151,7 @@ config S390 select HAVE_FUNCTION_TRACER select HAVE_FUTEX_CMPXCHG if FUTEX select HAVE_GCC_PLUGINS + select HAVE_GENERIC_VDSO select HAVE_KERNEL_BZIP2 select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZ4 diff --git a/arch/s390/include/asm/clocksource.h b/arch/s390/include/asm/clocksource.h new file mode 100644 index ..03434369fce4 --- /dev/null +++ b/arch/s390/include/asm/clocksource.h @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* s390-specific clocksource additions */ + +#ifndef _ASM_S390_CLOCKSOURCE_H +#define _ASM_S390_CLOCKSOURCE_H + +#endif /* _ASM_S390_CLOCKSOURCE_H */ diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h index 0cd085cdeb4f..82f86b3c394b 100644 --- a/arch/s390/include/asm/vdso.h +++ b/arch/s390/include/asm/vdso.h @@ -2,6 +2,8 @@ #ifndef __S390_VDSO_H__ #define __S390_VDSO_H__ +#include + /* Default link addresses for the vDSOs */ #define VDSO32_LBASE 0 #define VDSO64_LBASE 0 @@ -18,30 +20,7 @@ * itself and may change without notice. */ -struct vdso_data { - __u64 tb_update_count; /* Timebase atomicity ctr 0x00 */ - __u64 xtime_tod_stamp; /* TOD clock for xtime 0x08 */ - __u64 xtime_clock_sec; /* Kernel time
[PATCH 1/2] vdso: allow to add architecture-specific vdso data
Add the possibility to add architecture specific vDSO data to struct vdso_data. This is useful if the arch specific user space VDSO code needs additional data during execution. If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will include asm/vdso/data.h which should contain 'struct arch_vdso_data'. This structure will be embedded in the generic vDSO data. Signed-off-by: Sven Schnelle Reviewed-by: Heiko Carstens --- arch/Kconfig| 3 +++ include/vdso/datapage.h | 7 +++ 2 files changed, 10 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 8cc35dc556c7..e1017ce979e2 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR entries at 4000, 5000 and 6000 locations. This option turns on syscall related optimizations for a given architecture. +config ARCH_HAS_VDSO_DATA + bool + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h index 7955c56d6b3c..74e730238ce6 100644 --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -19,6 +19,10 @@ #include #include +#ifdef CONFIG_ARCH_HAS_VDSO_DATA +#include +#endif + #define VDSO_BASES (CLOCK_TAI + 1) #define VDSO_HRES (BIT(CLOCK_REALTIME)| \ BIT(CLOCK_MONOTONIC) | \ @@ -97,6 +101,9 @@ struct vdso_data { s32 tz_dsttime; u32 hrtimer_res; u32 __unused; +#ifdef CONFIG_ARCH_HAS_VDSO_DATA + struct arch_vdso_data arch; +#endif }; /* -- 2.17.1
[PATCH] perf: add psw_idle and psw_idle_exit to list of idle symbols
Add the s390 idle functions so they don't show up in top when using software sampling. Signed-off-by: Sven Schnelle --- tools/perf/util/symbol.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index 5ddf84dcbae7..d33d24c61d24 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -666,6 +666,8 @@ static bool symbol__is_idle(const char *name) "poll_idle", "ppc64_runlatch_off", "pseries_dedicated_idle_sleep", + "psw_idle", + "psw_idle_exit", NULL }; int i; -- 2.17.1
[PATCH] kprobes: use strncpy_from_kernel_nofault() in fetch_store_string()
With the latest linux-next i noticed that some tests in the ftrace test suites are failing on s390, namely: [FAIL] Kprobe event symbol argument [FAIL] Kprobe event with comm arguments The following doesn't work anymore: cd /sys/kernel/tracing echo 'p:testprobe _do_fork comm=$comm ' >kprobe_events echo 1 >events/kprobes/testprobe/enable cat trace it will just show test.sh-519 [012] 18.580625: testprobe: (_do_fork+0x0/0x3c8) comm=(fault) Looking at the source i see that there are two helpers for reading strings: fetch_store_string_user() -> read string from user space fetch_store_string() -> read string from kernel space(?) but in the end both are using strncpy_from_user_nofault(), but fetch_store_string() should use strncpy_from_kernel_nofault(). Signed-off-by: Sven Schnelle Acked-by: Masami Hiramatsu Acked-by: Christoph Hellwig --- kernel/trace/trace_kprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index b1f21d558e45..ea8d0b094f1b 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1278,7 +1278,7 @@ fetch_store_string(unsigned long addr, void *dest, void *base) * Try to get string again, since the string can be changed while * probing. */ - ret = strncpy_from_user_nofault(__dest, (void *)addr, maxlen); + ret = strncpy_from_kernel_nofault(__dest, (void *)addr, maxlen); if (ret >= 0) *(u32 *)dest = make_data_loc(ret, __dest - base); -- 2.17.1
Re: kprobes string reading broken on s390
Masami, On Sat, Jun 06, 2020 at 01:58:06AM +0900, Masami Hiramatsu wrote: > Hi Sven, > > On Fri, 5 Jun 2020 15:25:41 +0200 > Christoph Hellwig wrote: > > > Yes, this looks correct. You probably want to write a small changelog > > and add a Fixes tag, though. > > > > On Fri, Jun 05, 2020 at 01:05:34PM +0200, Sven Schnelle wrote: > > > Hi Christoph, > > > > > > with the latest linux-next i noticed that some tests in the > > > ftrace test suites are failing on s390, namely: > > > > > > [FAIL] Kprobe event symbol argument > > > [FAIL] Kprobe event with comm arguments > > > > > > The following doesn't work anymore: > > > > > > cd /sys/kernel/tracing > > > echo 'p:testprobe _do_fork comm=$comm ' >kprobe_events > > > echo 1 >/sys/kernel/tracing/events/kprobes/testprobe/enable > > > cat /sys/kernel/tracing/trace > > > > > > it will just show > > > > > > test.sh-519 [012] 18.580625: testprobe: (_do_fork+0x0/0x3c8) > > > comm=(fault) > > > > > > Looking at d411a9c4e95a ("tracing/kprobes: handle mixed kernel/userspace > > > probes > > > better") i see that there are two helpers for reading strings: > > > > > > fetch_store_string_user() -> read string from user space > > > fetch_store_string() -> read string from kernel space(?) > > > > > > but in the end both are using strncpy_from_user_nofault(), but i would > > > think that fetch_store_string() should use strncpy_from_kernel_nofault(). > > > However, i'm not sure about the exact semantics of fetch_store_string(), > > > as there where a lot of wrong assumptions in the past, especially since > > > on x86 you usually don't fail if you use the same function for accessing > > > kernel > > > and userspace although it's technically wrong. > > Thanks for fixing! > This report can be a good changelog. > Please resend it with Fixed tag as Christoph said. Which SHA1 should the Fixes tag carry? The one from linux-next? Regards Sven
kprobes string reading broken on s390
Hi Christoph, with the latest linux-next i noticed that some tests in the ftrace test suites are failing on s390, namely: [FAIL] Kprobe event symbol argument [FAIL] Kprobe event with comm arguments The following doesn't work anymore: cd /sys/kernel/tracing echo 'p:testprobe _do_fork comm=$comm ' >kprobe_events echo 1 >/sys/kernel/tracing/events/kprobes/testprobe/enable cat /sys/kernel/tracing/trace it will just show test.sh-519 [012] 18.580625: testprobe: (_do_fork+0x0/0x3c8) comm=(fault) Looking at d411a9c4e95a ("tracing/kprobes: handle mixed kernel/userspace probes better") i see that there are two helpers for reading strings: fetch_store_string_user() -> read string from user space fetch_store_string() -> read string from kernel space(?) but in the end both are using strncpy_from_user_nofault(), but i would think that fetch_store_string() should use strncpy_from_kernel_nofault(). However, i'm not sure about the exact semantics of fetch_store_string(), as there where a lot of wrong assumptions in the past, especially since on x86 you usually don't fail if you use the same function for accessing kernel and userspace although it's technically wrong. Regards, Sven commit 81408eab8fcc79dc0871a95462b13176d3446f5e Author: Sven Schnelle Date: Fri Jun 5 13:01:24 2020 +0200 kprobes: use strncpy_from_kernel_nofault() in fetch_store_string() Signed-off-by: Sven Schnelle diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index b1f21d558e45..ea8d0b094f1b 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1278,7 +1278,7 @@ fetch_store_string(unsigned long addr, void *dest, void *base) * Try to get string again, since the string can be changed while * probing. */ - ret = strncpy_from_user_nofault(__dest, (void *)addr, maxlen); + ret = strncpy_from_kernel_nofault(__dest, (void *)addr, maxlen); if (ret >= 0) *(u32 *)dest = make_data_loc(ret, __dest - base);
Re: [tracing] 06e0a548ba: WARNING:at_kernel/trace/ring_buffer.c:#ring_buffer_iter_peek
Hi Steve, On Wed, May 13, 2020 at 03:30:33PM -0400, Steven Rostedt wrote: > On Wed, 13 May 2020 18:15:57 +0200 > Sven Schnelle wrote: > > > Thanks for looking into this. I've attached my /proc/config.gz to this Mail. > > The x86 system is my Laptop which is a Thinkpad X280 with 4 HT CPUs (so 8 > > cpus > > in total). I've tried disabling preemption, but this didn't help. > > > > It's always this check that causes the loop: > > > > if (iter->head >= rb_page_size(iter->head_page)) { > > rb_inc_iter(iter); > > goto again; > > } > > > > On the first loop iter->head is some value > 0 and rb_page_size returns > > 0, afterwards it gets twice to this check with both values 0. The third > > time the warning is triggered. Maybe that information helps. > > I figured out what was causing this, and that's just that the writer and > the iterator could end up almost "in sync" where the writer writes to each > of the pages the iterator is trying to read, and this can trigger the three > failures of "zero commits" per page. > > I had a way to detect this, but then realized that it may be possible for > an active writer to possibly trigger the other failures to get an event, > that I just decided to force it to try three times, and simply return NULL > if an active writer is messing with the iterator. The iterator is a "best > effort" to read the buffer if there's an active writer. The consumer read > (trace_pipe) is made for that. > > This patch should solve you issues. > > (care to give a Tested-by: if it works for you?) Well, as there's no longer a RB_WARN_ON that indeed fixes the issue :-) Tested-by: Sven Schnelle Thanks! Sven
Re: [tracing] 06e0a548ba: WARNING:at_kernel/trace/ring_buffer.c:#ring_buffer_iter_peek
Hi Steve, On Wed, May 13, 2020 at 09:29:22AM -0400, Steven Rostedt wrote: > On Wed, 13 May 2020 11:19:06 +0200 > Sven Schnelle wrote: > > > Did you had a chance to look into this? I can easily reproduce this both on > > x86 > > and s390 by doing: > > > > cd /sys/kernel/tracing > > cat /dev/zero >/dev/null & # generate some load > > echo function >current_tracer > > # wait a few seconds to fill the buffer > > cat trace > > > > Usually it will print the warn after a few seconds. > > > > I haven't digged through all the ring buffer code yet, so i thought i might > > ask > > whether you have an idea what's going on. > > Can you send me the config for where you can reproduce it on x86? > > The iterator now doesn't stop the ring buffer when it iterates, and is > doing so over a live buffer (but should be able to handle it). It's > triggering something I thought wasn't suppose to happen (which must be > happening). > > Perhaps with your config I'd be able to reproduce it. Thanks for looking into this. I've attached my /proc/config.gz to this Mail. The x86 system is my Laptop which is a Thinkpad X280 with 4 HT CPUs (so 8 cpus in total). I've tried disabling preemption, but this didn't help. It's always this check that causes the loop: if (iter->head >= rb_page_size(iter->head_page)) { rb_inc_iter(iter); goto again; } On the first loop iter->head is some value > 0 and rb_page_size returns 0, afterwards it gets twice to this check with both values 0. The third time the warning is triggered. Maybe that information helps. Thanks, Sven config.gz Description: application/gzip
Re: [tracing] 06e0a548ba: WARNING:at_kernel/trace/ring_buffer.c:#ring_buffer_iter_peek
Hi Steve, On Wed, Apr 29, 2020 at 05:05:09PM +0800, kernel test robot wrote: > > kern :warn : [ 886.763510] WARNING: CPU: 70 PID: 22584 at > kernel/trace/ring_buffer.c:4067 ring_buffer_iter_peek+0x13c/0x1d0 > kern :warn : [ 886.776216] Modules linked in: test_firmware intel_rapl_msr > intel_rapl_common skx_edac nfit libnvdimm btrfs x86_pkg_temp_thermal > intel_powerclamp blake2b_generic xor zstd_decompress zstd_compress coretemp > ast kvm_intel kvm raid6_pq drm_vram_helper drm_ttm_helper libcrc32c irqbypass > ttm crct10dif_pclmul crc32_pclmul drm_kms_helper snd_pcm crc32c_intel > ghash_clmulni_intel ahci snd_timer syscopyarea libahci aesni_intel ipmi_ssif > sysfillrect snd crypto_simd sysimgblt cryptd nvme fb_sys_fops mei_me > soundcore ipmi_si glue_helper pcspkr drm nvme_core ioatdma ipmi_devintf > t10_pi libata mei i2c_i801 joydev lpc_ich dca ipmi_msghandler wmi acpi_pad > acpi_power_meter ip_tables > kern :warn : [ 886.838112] CPU: 70 PID: 22584 Comm: cat Not tainted > 5.6.0-rc4-00017-g06e0a548bad0f #1 > kern :warn : [ 886.846573] RIP: 0010:ring_buffer_iter_peek+0x13c/0x1d0 > kern :warn : [ 886.852355] Code: 41 5e 41 5f c3 83 38 1d 0f 85 98 00 00 00 > 48 89 df e8 78 cb ff ff e9 32 ff ff ff 80 3c 24 00 75 c8 49 8b 44 24 10 f0 ff > 40 08 <0f> 0b 4c 89 f6 4c 89 ef e8 37 8c 8f 00 48 83 c4 08 31 c0 5b 5d 41 > kern :warn : [ 886.872239] RSP: 0018:c900213e7d78 EFLAGS: 00010002 > kern :warn : [ 886.878030] RAX: 889883ea7740 RBX: 88a3fbd93720 > RCX: 88a42101b700 > kern :warn : [ 886.885731] RDX: 88a415c02e00 RSI: 88a42101b440 > RDI: 88a3fbd93720 > kern :warn : [ 886.893432] RBP: R08: 00ce959ae6ab > R09: 88a42101bb80 > kern :warn : [ 886.901129] R10: 88a3e99e2300 R11: 0178 > R12: 88a415c02e00 > kern :warn : [ 886.908831] R13: 88a415c02e18 R14: 0296 > R15: c900213e7df8 > kern :warn : [ 886.916521] FS: 7f12d51d2640() > GS:88a44f78() knlGS: > kern :warn : [ 886.925179] CS: 0010 DS: ES: CR0: > 80050033 > kern :warn : [ 886.931495] CR2: 7f12d4e36000 CR3: 0023e901e006 > CR4: 007606e0 > kern :warn : [ 886.939194] DR0: DR1: > DR2: > kern :warn : [ 886.946896] DR3: DR6: fffe0ff0 > DR7: 0400 > kern :warn : [ 886.954588] PKRU: 5554 > kern :warn : [ 886.957858] Call Trace: > kern :warn : [ 886.960866] peek_next_entry+0x22/0x70 > kern :warn : [ 886.965165] __find_next_entry+0xaf/0x180 > kern :warn : [ 886.969714] trace_find_next_entry_inc+0x1e/0x80 > kern :warn : [ 886.974863] s_next+0x49/0x70 > kern :warn : [ 886.978355] seq_read+0x23f/0x400 > kern :warn : [ 886.982193] vfs_read+0x9b/0x160 > kern :warn : [ 886.985937] ksys_read+0xa1/0xe0 > kern :warn : [ 886.989677] do_syscall_64+0x5b/0x1f0 > kern :warn : [ 886.993853] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > kern :warn : [ 886.999408] RIP: 0033:0x7f12d50f559e > kern :warn : [ 887.003479] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 36 fc 09 00 > e8 e9 e7 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 > 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 > kern :warn : [ 887.023260] RSP: 002b:7ffdab31acd8 EFLAGS: 0246 > ORIG_RAX: > kern :warn : [ 887.031350] RAX: ffda RBX: 0002 > RCX: 7f12d50f559e > kern :warn : [ 887.039006] RDX: 0002 RSI: 7f12d4e36000 > RDI: 0005 > kern :warn : [ 887.046657] RBP: 7f12d4e36000 R08: 7f12d4e35010 > R09: > kern :warn : [ 887.054305] R10: fc4d R11: 0246 > R12: 7f12d4e36000 > kern :warn : [ 887.061958] R13: 0005 R14: 0002 > R15: 0f93 > kern :warn : [ 887.069608] ---[ end trace 507492ef8332a5b4 ]--- > Did you had a chance to look into this? I can easily reproduce this both on x86 and s390 by doing: cd /sys/kernel/tracing cat /dev/zero >/dev/null & # generate some load echo function >current_tracer # wait a few seconds to fill the buffer cat trace Usually it will print the warn after a few seconds. I haven't digged through all the ring buffer code yet, so i thought i might ask whether you have an idea what's going on. Thanks Sven
Re: [PATCH -next] s390: Remove two unused inline functions
Hi Joe, On Mon, May 11, 2020 at 01:38:57PM -0700, Joe Perches wrote: > Awhile back, I posted a list of apparently unused static inline > functions in .h files treewide found by a script: > > https://lore.kernel.org/lkml/4603e761a5f39f4d97375e1e08d20d720c526341.ca...@perches.com/ > > Here are the s390 entries: > > arch/s390/include/asm/atomic_ops.h:138:static inline long > __atomic64_cmpxchg_bool(long *ptr, long old, long new) > arch/s390/include/asm/bitops.h:278:static inline void __set_bit_inv(unsigned > long nr, volatile unsigned long *ptr) > arch/s390/include/asm/bitops.h:283:static inline void > __clear_bit_inv(unsigned long nr, volatile unsigned long *ptr) > arch/s390/include/asm/cpu_mcf.h:106:static inline int > kernel_cpumcf_begin(void) > arch/s390/include/asm/cpu_mcf.h:114:static inline void kernel_cpumcf_end(void) > arch/s390/include/asm/ftrace.h:64:static inline int is_ftrace_nop(struct > ftrace_insn *insn) > arch/s390/include/asm/kvm_para.h:146:static inline long > kvm_hypercall5(unsigned long nr, unsigned long p1, > arch/s390/include/asm/kvm_para.h:175:static inline long > kvm_hypercall6(unsigned long nr, unsigned long p1, > arch/s390/include/asm/pci_dma.h:134:static inline void > invalidate_table_entry(unsigned long *entry) > arch/s390/include/asm/pci_dma.h:176:static inline int > entry_isprotected(unsigned long entry) > arch/s390/include/asm/timex.h:52:static inline void > store_clock_comparator(__u64 *time) Thanks, i take a look and prepare a patch. Regards Sven
Re: [PATCH -next] s390: Remove two unused inline functions
Hi, On Fri, May 08, 2020 at 10:07:24PM +0800, YueHaibing wrote: > commit 657480d9c015 ("s390: support KPROBES_ON_FTRACE") > left behind this, remove it. > > Signed-off-by: YueHaibing > --- > arch/s390/kernel/ftrace.c | 16 > 1 file changed, 16 deletions(-) > > diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c > index 4cd9b1ada834..44e01dd1e624 100644 > --- a/arch/s390/kernel/ftrace.c > +++ b/arch/s390/kernel/ftrace.c > @@ -72,22 +72,6 @@ static inline void ftrace_generate_orig_insn(struct > ftrace_insn *insn) > #endif > } > > -static inline void ftrace_generate_kprobe_nop_insn(struct ftrace_insn *insn) > -{ > -#ifdef CONFIG_KPROBES > - insn->opc = BREAKPOINT_INSTRUCTION; > - insn->disp = KPROBE_ON_FTRACE_NOP; > -#endif > -} > - > -static inline void ftrace_generate_kprobe_call_insn(struct ftrace_insn *insn) > -{ > -#ifdef CONFIG_KPROBES > - insn->opc = BREAKPOINT_INSTRUCTION; > - insn->disp = KPROBE_ON_FTRACE_CALL; > -#endif > -} > - > int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr, > unsigned long addr) > { > -- > 2.17.1 Thanks for noticing, looks like i missed them. Acked-by: Sven Schnelle Sven
Re: [PATCH] uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned
Hi, On Mon, May 04, 2020 at 08:40:44PM +0200, Christian Borntraeger wrote: > > > On 04.05.20 18:47, Oleg Nesterov wrote: > > uprobe_write_opcode() must not cross page boundary; prepare_uprobe() > > relies on arch_uprobe_analyze_insn() which should validate "vaddr" but > > some architectures (csky, s390, and sparc) don't do this. > > I think the idea was that the uprobe instruction is 2 bytes and instructions > are always aligned to 2 bytes on s390. (we can have 2,4 or 6 bytes). > > > > > We can remove the BUG_ON() check in prepare_uprobe() and validate the > > offset early in __uprobe_register(). The new IS_ALIGNED() check matches > > the alignment check in arch_prepare_kprobe() on supported architectures, > > so I think that all insns must be aligned to UPROBE_SWBP_INSN_SIZE. > > Not sure if it would have been possible to try to create a uprobe on an > odd address. If yes, then the new IS_ALIGNED check certainly makes this > better for s390, so the patch looks sane. Adding Vasily and Sven to double > check. I did a quick test, and without this patch it is possible to place a uprobe at an odd address. With the patch it fails with EINVAL, which is more reasonable. Regards Sven
Re: net: tulip: de2104x: Checking a kmemdup() call in de21041_get_srom_info()
On Sat, Oct 12, 2019 at 07:03:09PM +0200, Markus Elfring wrote: > Hello, > > I tried another script for the semantic patch language out. > This source code analysis approach points out that the implementation > of the function “de21041_get_srom_info” contains still an unchecked call > of the function “kmemdup”. > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/dec/tulip/de2104x.c?id=1c0cc5f1ae5ee5a6913704c0d75a6e99604ee30a#n1940 > https://elixir.bootlin.com/linux/v5.4-rc2/source/drivers/net/ethernet/dec/tulip/de2104x.c#L1940 > > How do you think about to improve it? If i have not missed a place, the only user is de_get_eeprom(), which checks whether de->ee_data is valid. So i think although not obvious, there's no problem here. Regards Sven
Re: allow larger than require DMA masks
Hi, On Fri, Feb 15, 2019 at 03:45:54PM +0100, Christoph Hellwig wrote: > Hi all, > > this series finishes off converting our dma mask model to split between > device capabilities (dev->dma_mask and dev->coherent_dma_mask) and system > limitations (dev->bus_dma_mask). We already accept larger than required > masks in most dma_map_ops implementation, in case of x86 and > implementations based on it since the dawn of time. Only one parisc > and two sparc64 instances failed larger than required DMA masks, and > this series fixes that up and updates the documentation that devices > don't need to handle DMA mask fallbacks. > I just tried latest linux-5.4 git on my hp c8000 (parisc), and got the following error: [ 27.246866] sata_sil24 :00:01.0: Applying completion IRQ loss on PCI-X errata fix [ 27.336968] sata_sil24 :00:01.0: DMA enable failed [ 27.476922] sata_sil24: probe of :00:01.0 failed with error -5 This is caused by commit dcc02c19cc06bd7bc1b6db0aa0087a2b6eb05b94: Author: Christoph Hellwig Date: Mon Aug 26 12:57:24 2019 +0200 sata_sil24: use dma_set_mask_and_coherent Use the dma_set_mask_and_coherent helper to set the DMA mask. Rely on the relatively recent change that setting a larger than required mask will never fail to avoid the need for the boilerplate 32-bit fallback code. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe However, the real problem seems to be in sba_dma_supported(): » /* Documentation/DMA-API-HOWTO.txt tells drivers to try 64-bit »* first, then fall back to 32-bit if that fails. »* We are just "encouraging" 32-bit DMA masks here since we can »* never allow IOMMU bypass unless we add special support for ZX1. »*/ if (mask > ~0U) » » return 0; Removing the if() makes the DMA mapping work. It's almost midnight here, so i won't look into that any further today. Does anyone have an opinion on this behaviour? Otherwise i will look a bit more into this in the next days. Regards Sven
Re: [PATCH] kprobes/parisc: remove arch_kprobe_on_func_entry()
On Wed, Aug 21, 2019 at 09:56:40AM +, Jisheng Zhang wrote: > The common kprobes provides a weak implementation of > arch_kprobe_on_func_entry(). The parisc version is the same as the > common version, so remove it. > > Signed-off-by: Jisheng Zhang Acked-by: Sven Schnelle > --- > arch/parisc/kernel/kprobes.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/arch/parisc/kernel/kprobes.c b/arch/parisc/kernel/kprobes.c > index 5d7f2692ac5a..77ec51818916 100644 > --- a/arch/parisc/kernel/kprobes.c > +++ b/arch/parisc/kernel/kprobes.c > @@ -281,10 +281,6 @@ int __kprobes arch_trampoline_kprobe(struct kprobe *p) > { > return p->addr == trampoline_p.addr; > } > -bool arch_kprobe_on_func_entry(unsigned long offset) > -{ > - return !offset; > -} > > int __init arch_init_kprobes(void) > { > -- > 2.23.0.rc1 >
Re: [PATCH v9 00/21] Generic page walk and ptdump
Hi Steven, On Mon, Jul 29, 2019 at 12:32:25PM +0100, Steven Price wrote: > > parisc is more interesting and I'm not sure if this is necessarily > correct. I originally proposed a patch with the line "For parisc, we > don't support large pages, so add stubs returning 0" which got Acked by > Helge Deller. However going back to look at that again I see there was a > follow up thread[2] which possibly suggests I was wrong? I just started a week ago implementing ptdump for PA-RISC. Didn't notice that you're working on making it generic, which is nice. I'll adjust my code to use the infrastructure you're currently developing. > Can anyone shed some light on whether parisc does support leaf entries > of the page table tree at a higher than the normal depth? > > [1] https://lkml.org/lkml/2019/2/27/572 > [2] https://lkml.org/lkml/2019/3/5/610 My understanding is that PA-RISC only has leaf entries on PTE level. > The intention is that the page table walker would be available for all > architectures so that it can be used in any generic code - PTDUMP simply > seemed like a good place to start. > > > Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() > > these > > functions need to be defined on all arch irrespective if they use PTDUMP or > > not > > or otherwise just define it for archs which need them now for sure i.e x86 > > and > > arm64 (which are moving to new generic PTDUMP framework). Other archs can > > implement these later. I'll take care of the PA-RISC part - for 32 bit your generic code works, for 64Bit i need to learn a bit more about the following hack: arch/parisc/include/asm/pgalloc.h:15 /* Allocate the top level pgd (page directory) * * Here (for 64 bit kernels) we implement a Hybrid L2/L3 scheme: we * allocate the first pmd adjacent to the pgd. This means that we can * subtract a constant offset to get to it. The pmd and pgd sizes are * arranged so that a single pmd covers 4GB (giving a full 64-bit * process access to 8TB) so our lookups are effectively L2 for the * first 4GB of the kernel (i.e. for all ILP32 processes and all the * kernel for machines with under 4GB of memory) */ I see that your change clear P?D entries when p?d_bad() returns true, which - i think - would be the case with the PA-RISC implementation. Regards Sven
Re: [PATCH] kprobes: fix compilation when KPROBE_EVENTS is enabled without kretpobes
Hi Naveen, On Sat, Apr 06, 2019 at 10:52:47PM +0530, Naveen N. Rao wrote: > Sven Schnelle wrote: > > While implementing kprobes on PA-RISC (without kretprobes) compilation > > fails when CONFIG_KPROBE_EVENTS is enabled: > > Thanks for working on that! Is there a specific reason kretprobes is not > being enabled on parisc? Reason is that i just started to add kprobes support to PA-RISC. I'm planning doing kretprobes as a second step and have separate patches for kprobe and kretprobe to keep things small. Regards Sven
[PATCH] kprobes: fix compilation when KPROBE_EVENTS is enabled without kretpobes
While implementing kprobes on PA-RISC (without kretprobes) compilation fails when CONFIG_KPROBE_EVENTS is enabled: kernel/trace/trace_kprobe.o: in function `trace_kprobe_create': kernel/trace/trace_kprobe.c:666: undefined reference to `kprobe_on_func_entry' kernel/trace/trace_kprobe.o: in function `trace_kprobe_on_func_entry': kernel/trace/trace_kprobe.c:167: undefined reference to `kprobe_on_func_entry' kernel/trace/trace_kprobe.c:167: undefined reference to `kprobe_on_func_entry' make: *** [Makefile:1029: vmlinux] Error 1 Given the fact that these functions have no dependencies on kretprobes i think they should be moved out of the #ifdef CONFIG_KRETPROBES block. Signed-off-by: Sven Schnelle Cc: "Naveen N. Rao" Cc: Anil S Keshavamurthy Cc: "David S. Miller" Cc: Masami Hiramatsu --- kernel/kprobes.c | 39 --- 1 file changed, 20 insertions(+), 19 deletions(-) diff --git a/kernel/kprobes.c b/kernel/kprobes.c index c83e54727131..10a7e67fea67 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1819,6 +1819,26 @@ unsigned long __weak arch_deref_entry_point(void *entry) return (unsigned long)entry; } +bool __weak arch_kprobe_on_func_entry(unsigned long offset) +{ + return !offset; +} + +bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, + unsigned long offset) +{ + kprobe_opcode_t *kp_addr = _kprobe_addr(addr, sym, offset); + + if (IS_ERR(kp_addr)) + return false; + + if (!kallsyms_lookup_size_offset((unsigned long)kp_addr, NULL, ) || +!arch_kprobe_on_func_entry(offset)) + return false; + + return true; +} + #ifdef CONFIG_KRETPROBES /* * This kprobe pre_handler is registered with every kretprobe. When probe @@ -1875,25 +1895,6 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs) } NOKPROBE_SYMBOL(pre_handler_kretprobe); -bool __weak arch_kprobe_on_func_entry(unsigned long offset) -{ - return !offset; -} - -bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset) -{ - kprobe_opcode_t *kp_addr = _kprobe_addr(addr, sym, offset); - - if (IS_ERR(kp_addr)) - return false; - - if (!kallsyms_lookup_size_offset((unsigned long)kp_addr, NULL, ) || - !arch_kprobe_on_func_entry(offset)) - return false; - - return true; -} - int register_kretprobe(struct kretprobe *rp) { int ret = 0; -- 2.20.1