from:"Sven Schnelle"

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-07 Thread Sven Schnelle

Steven Rostedt  writes:

> On Wed, 7 Feb 2024 13:07:36 +0100
> Mete Durlu  wrote:
>
>> wouldn't the following scenario explain the behavior we are seeing.
>> When using event triggers, trace uses lockless ringbuffer control paths.
>> If cmdline update and trace output reading is happening on different
>> cpus, the ordering can get messed up.
>>
>> 1. event happens and trace trigger tells ring buffer to stop writes
>> 2. (on cpu#1)test calculates checksum on current state of trace
>> output.
>> 3. (on cpu#2)not knowing about the trace buffers status yet, writer adds
>> a one last entry which would collide with a pid in cmdline map before
>> actually stopping. While (on cpu#1) checksum is being calculated, new
>> saved cmdlines entry is waiting for spinlocks to be unlocked and then
>> gets added.
>> 4. test calculates checksum again and finds that the trace output has
>> changed. <...> is put on collided pid.
>
> But the failure is here:
>
> on=`cat tracing_on`
> if [ $on != "0" ]; then
> fail "Tracing is not off"
> fi

It might be misleading because we're discussing two issues in one
thread. The failure above was one problem, which the initial fix
is for. The other issue we're still seeing is the test below:

> csum1=`md5sum trace`
> sleep $SLEEP_TIME
> csum2=`md5sum trace`
>
> if [ "$csum1" != "$csum2" ]; then
> fail "Tracing file is still changing"
> fi
>
> 1. tracing is off
> 2. do checksum of trace
> 3. sleep
> 4. do another checksum of trace
> 5. compare the two checksums
>
> Now how did they come up differently in that amount of time? The
> saved_cmdlines really should not have been updated.

My assumption without reading the code is that something like this
happens:

CPU0 CPU1
[ringbuffer enabled]
 ring_buffer_write()
 if (atomic_read(>record_disabled))
goto out;
echo 0 > tracing_on
record_disabled |= RB_BUFFER_OFF
csum1=`md5sum trace`
 [adds trace entry to ring buffer,
  overwriting savedcmd_lines entry because
  it thinks ring buffer is enabled]
csum2=`md5sum trace`

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-06 Thread Sven Schnelle

Hi Steven,

Steven Rostedt  writes:

> On Tue, 06 Feb 2024 09:48:16 +0100
> Sven Schnelle  wrote:
>
>> I added some logging, and the test is not triggering this issue. So i
>> assume the default of 128 cmdline entries is just to small. Sorry for
>> the noise. Lets see whether we're still triggering some failures with
>> the other fix applied in CI. If we do, maybe we want to increase the
>> saved_cmdline_size for the ftrace test suite.
>
> I wonder if it is a good idea to increase the size when tracing starts,
> like the ring buffer expanding code. Maybe to 1024? Or is that still
> too small?

Not sure whether that is enough, have to test. However, it's not really
a fix, it would just require a bit more load and the test would fail
again. The fundamental problem is that even after disabling tracing
there might be some tracing line added due to the lockless nature of
the ringbuffer. This might then replace some existing cmdline entry.
So maybe we should change the test to ignore the program name when
calculating the checksums.

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-06 Thread Sven Schnelle

Sven Schnelle  writes:

> Looking at trace_save_cmdline():
>
> tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000
>
> so this is basically
>
> tpid = tsk->pid & 0x7fff;
>
> further on:
>
> // might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 
> 0x7fff)
>   idx = savedcmd->map_pid_to_cmdline[tpid];
>   if (idx == NO_CMDLINE_MAP) {
> // This will pick an existing entry if there are
> // more than cmdline_num entries present
>   idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;  
>   savedcmd->map_pid_to_cmdline[tpid] = idx;
>   savedcmd->cmdline_idx = idx;
>   }
>
> So i think the problem that sometimes '<...>' instead of the correct
> comm is logged is just expected behaviour given the code above. 

I added some logging, and the test is not triggering this issue. So i
assume the default of 128 cmdline entries is just to small. Sorry for
the noise. Lets see whether we're still triggering some failures with
the other fix applied in CI. If we do, maybe we want to increase the
saved_cmdline_size for the ftrace test suite.

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Sven Schnelle

Steven Rostedt  writes:

> On Mon, 05 Feb 2024 14:16:30 +0100
> Sven Schnelle  wrote:
>>
>> Another issue i'm hitting sometimes is this part:
>>
>> csum1=`md5sum trace`
>> sleep $SLEEP_TIME
>> csum2=`md5sum trace`
>>
>> if [ "$csum1" != "$csum2" ]; then
>> fail "Tracing file is still changing"
>> fi
>>
>> This is because the command line was replaced in the
>> saved_cmdlines_buffer, an example diff between both files
>> is:
>
> [..]
>
>>
>> This can be improved by:
>>
>> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
>>
>> But this is of course not a fix - should we maybe replace the program
>> name with <...> before comparing, remove the check completely, or do
>> anything else? What do you think?
>
> Hmm, actually I would say that this exposes a real bug. Not a major
> one, but one that I find annoying. The saved commandlines should only
> be updated when a trace event occurs. But really, it should only be
> updated if one is added to the ring buffer. If the ring buffer isn't
> being updated, we shouldn't be adding new command lines.
>
> There may be a location that has tracing off but still updating the
> cmdlines which will break the saved cache.

Looking at trace_save_cmdline():

tpid = tsk->pid & (PID_MAX_DEFAULT - 1); where PID_MAX_DEFAULT = 0x8000

so this is basically

tpid = tsk->pid & 0x7fff;

further on:

// might clash with other pid if (otherpid & 0x7fff) == (tsk->pid & 
0x7fff)
idx = savedcmd->map_pid_to_cmdline[tpid];
if (idx == NO_CMDLINE_MAP) {
// This will pick an existing entry if there are
// more than cmdline_num entries present
idx = (savedcmd->cmdline_idx + 1) % savedcmd->cmdline_num;  
savedcmd->map_pid_to_cmdline[tpid] = idx;
savedcmd->cmdline_idx = idx;
}

So i think the problem that sometimes '<...>' instead of the correct
comm is logged is just expected behaviour given the code above.

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Sven Schnelle

Hi Steven,

Steven Rostedt  writes:

> On Mon, 05 Feb 2024 14:16:30 +0100
> Sven Schnelle  wrote:
>> 
>> Another issue i'm hitting sometimes is this part:
>> 
>> csum1=`md5sum trace`
>> sleep $SLEEP_TIME
>> csum2=`md5sum trace`
>> 
>> if [ "$csum1" != "$csum2" ]; then
>> fail "Tracing file is still changing"
>> fi
>> 
>> This is because the command line was replaced in the
>> saved_cmdlines_buffer, an example diff between both files
>> is:
>
> [..]
>
>> 
>> This can be improved by:
>> 
>> echo 32768 > /sys/kernel/tracing/saved_cmdlines_size
>> 
>> But this is of course not a fix - should we maybe replace the program
>> name with <...> before comparing, remove the check completely, or do
>> anything else? What do you think?
>
> Hmm, actually I would say that this exposes a real bug. Not a major
> one, but one that I find annoying. The saved commandlines should only
> be updated when a trace event occurs. But really, it should only be
> updated if one is added to the ring buffer. If the ring buffer isn't
> being updated, we shouldn't be adding new command lines.
>
> There may be a location that has tracing off but still updating the
> cmdlines which will break the saved cache.

Ok, my understanding is that it will override the entry in the list if
another process comes up with the same PID. But i haven't read the code
carefully - let me do that now.

Re: [PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-05 Thread Sven Schnelle

Hi Steven,

Steven Rostedt  writes:

> On Mon,  5 Feb 2024 07:53:40 +0100
> Sven Schnelle  wrote:
>
>> tracer_tracing_is_on() checks whether record_disabled is not zero. This
>> checks both the record_disabled counter and the RB_BUFFER_OFF flag.
>> Reading the source it looks like this function should only check for
>> the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
>> This fixes spurious fails in the 'test for function traceon/off triggers'
>> test from the ftrace testsuite when the system is under load.
>> 
>
> I've seen these spurious failures too, but haven't looked deeper into
> it. Thanks,

Another issue i'm hitting sometimes is this part:

csum1=`md5sum trace`
sleep $SLEEP_TIME
csum2=`md5sum trace`

if [ "$csum1" != "$csum2" ]; then
fail "Tracing file is still changing"
fi

This is because the command line was replaced in the
saved_cmdlines_buffer, an example diff between both files
is:

   ftracetest-17950   [005] . 344507.002490: sched_process_wait: 
comm=ftracetest pid=0 prio=120
   ftracetest-17950   [005] . 344507.002492: sched_process_wait: 
comm=ftracetest pid=0 prio=120
- stress-ng-fanot-17820   [006] d.h.. 344507.009901: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=1054 [ns]
+   <...>-17820   [006] d.h.. 344507.009901: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=1054 [ns]
   ftracetest-17950   [005] d.h.. 344507.009901: sched_stat_runtime: 
comm=ftracetest pid=17950 runtime=7417915 [ns]
  stress-ng-fanot-17819   [003] d.h.. 344507.009901: sched_stat_runtime: 
comm=stress-ng-fanot pid=17819 runtime=9983473 [ns]
- stress-ng-fanot-17820   [007] d.h.. 344507.079900: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=865 [ns]
+   <...>-17820   [007] d.h.. 344507.079900: sched_stat_runtime: 
comm=stress-ng-fanot pid=17820 runtime=865 [ns]
  stress-ng-fanot-17819   [004] d.h.. 344507.079900: sched_stat_runtime: 
comm=stress-ng-fanot pid=17819 runtime=8388039 [ns]

This can be improved by:

echo 32768 > /sys/kernel/tracing/saved_cmdlines_size

But this is of course not a fix - should we maybe replace the program
name with <...> before comparing, remove the check completely, or do
anything else? What do you think?

Thanks,
Sven

[PATCH] tracing: use ring_buffer_record_is_set_on() in tracer_tracing_is_on()

2024-02-04 Thread Sven Schnelle

tracer_tracing_is_on() checks whether record_disabled is not zero. This
checks both the record_disabled counter and the RB_BUFFER_OFF flag.
Reading the source it looks like this function should only check for
the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on().
This fixes spurious fails in the 'test for function traceon/off triggers'
test from the ftrace testsuite when the system is under load.

Signed-off-by: Sven Schnelle 
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2a7c6fd934e9..47e221e1e720 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1532,7 +1532,7 @@ void disable_trace_on_warning(void)
 bool tracer_tracing_is_on(struct trace_array *tr)
 {
if (tr->array_buffer.buffer)
-   return ring_buffer_record_is_on(tr->array_buffer.buffer);
+   return ring_buffer_record_is_set_on(tr->array_buffer.buffer);
return !tr->buffer_disabled;
 }
 
-- 
2.40.1

Re: [PATCH 4/5] kbuild: unify vdso_install rules

2023-10-10 Thread Sven Schnelle

Masahiro Yamada  writes:

> Currently, there is no standard implementation for vdso_install,
> leading to various issues:
>
>  1. Code duplication
>
> Many architectures duplicate similar code just for copying files
> to the install destination.
>
> Some architectures (arm, sparc, x86) create build-id symlinks,
> introducing more code duplication.
>
>  2. Accidental updates of in-tree build artifacts
>
> The vdso_install rule depends on the vdso files to install.
> It may update in-tree build artifacts. This can be problematic,
> as explained in commit 19514fc665ff ("arm, kbuild: make
> "make install" not depend on vmlinux").
>
>  3. Broken code in some architectures
>
> Makefile code is often copied from one architecture to another
> without proper adaptation or testing.
>
> The previous commits removed broken code from csky, UML, and parisc.
>
> Another issue is that 'make vdso_install' for ARCH=s390 installs
> vdso64, but not vdso32.
>
> To address these problems, this commit introduces the generic vdso_install.
>
> Architectures that support vdso_install need to define vdso-install-y
> in arch/*/Makefile.
>
> vdso-install-y lists the files to install. For example, arch/x86/Makefile
> looks like this:
>
>   vdso-install-$(CONFIG_X86_64)   += arch/x86/entry/vdso/vdso64.so.dbg
>   vdso-install-$(CONFIG_X86_X32_ABI)  += 
> arch/x86/entry/vdso/vdsox32.so.dbg
>   vdso-install-$(CONFIG_X86_32)   += arch/x86/entry/vdso/vdso32.so.dbg
>   vdso-install-$(CONFIG_IA32_EMULATION)   += arch/x86/entry/vdso/vdso32.so.dbg
>
> These files will be installed to $(MODLIB)/vdso/ with the .dbg suffix,
> if exists, stripped away.
>
> vdso-install-y can optionally take the second field after the colon
> separator. This is needed because some architectures install vdso
> files as a different base name.
>
> The following is a snippet from arch/arm64/Makefile.
>
>   vdso-install-$(CONFIG_COMPAT_VDSO)  += 
> arch/arm64/kernel/vdso32/vdso.so.dbg:vdso32.so
>
> This will rename vdso.so.dbg to vdso32.so during installation. If such
> architectures change their implementation so that the file names match,
> this workaround will go away.
>
> Signed-off-by: Masahiro Yamada 
> ---

Acked-by: Sven Schnelle  # s390

Re: [PATCH v4 4/9] fprobe: rethook: Use ftrace_regs in fprobe exit handler and rethook

2023-09-11 Thread Sven Schnelle

Masami Hiramatsu (Google)  writes:

>> > IOW, it is ftrace save regs/restore regs code issue. I need to check how 
>> > the
>> > function_graph implements it.
>> 
>> gpr2-gpr14 are always saved in ftrace_caller/ftrace_regs_caller(),
>> regardless of the FTRACE_WITH_REGS flags. The only difference is that
>> without the FTRACE_WITH_REGS flag the program status word (psw) is not
>> saved because collecting that is a rather expensive operation.
>
> Thanks for checking that! So s390 will recover those saved registers
> even if FTRACE_WITH_REGS flag is not set? (I wonder what is the requirement
> of the ftrace_regs when returning from ftrace_call() without
> FTRACE_WITH_REGS?)

Yes, it will recover these in all cases.

>> 
>> I used the following commands to test rethook (is that the correct
>> testcase?)
>> 
>> #!/bin/bash
>> cd /sys/kernel/tracing
>> 
>> echo 'r:icmp_rcv icmp_rcv' >kprobe_events
>> echo 1 >events/kprobes/icmp_rcv/enable
>> ping -c 1 127.0.0.1
>> cat trace
>
> No, the kprobe will path pt_regs to rethook.
> Cna you run
>
> echo "f:icmp_rcv%return icmp_rcv" >> dynamic_events

Ah, ok. Seems to work as well:

  ping-481 [001] ..s2.53.918480: icmp_rcv: 
(ip_protocol_deliver_rcu+0x42/0x218 <- icmp_rcv)
  ping-481 [001] ..s2.53.918491: icmp_rcv: 
(ip_protocol_deliver_rcu+0x42/0x218 <- icmp_rcv)

Re: [RESEND][PATCH] tracing/synthetic: Fix order of struct trace_dynamic_info

2023-09-10 Thread Sven Schnelle

Hi Steven,

Steven Rostedt  writes:

> From: "Steven Rostedt (Google)" 
>
> To make handling BIG and LITTLE endian better the offset/len of dynamic
> fields of the synthetic events was changed into a structure of:
>
>  struct trace_dynamic_info {
>  #ifdef CONFIG_CPU_BIG_ENDIAN
>   u16 offset;
>   u16 len;
>  #else
>   u16 len;
>   u16 offset;
>  #endif
>  };
>
> to replace the manual changes of:
>
>  data_offset = offset & 0x;
>  data_offest = len << 16;
>
> But if you look closely, the above is:
>
><< 16 | offset
>
> Which in little endian would be in memory:
>
>  offset_lo offset_hi len_lo len_hi
>
> and in big endian:
>
>  len_hi len_lo offset_hi offset_lo
>
> Which if broken into a structure would be:
>
>  struct trace_dynamic_info {
>  #ifdef CONFIG_CPU_BIG_ENDIAN
>   u16 len;
>   u16 offset;
>  #else
>   u16 offset;
>   u16 len;
>  #endif
>  };
>
> Which is the opposite of what was defined.
>
> Fix this and just to be safe also add "__packed".
>
> Link: https://lore.kernel.org/all/20230908154417.5172e...@gandalf.local.home/
>
> Cc: sta...@vger.kernel.org
> Fixes: ddeea494a16f3 ("tracing/synthetic: Use union instead of casts")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>
>  [ Resending to the correct mailing list this time :-p ]
>
>  include/linux/trace_events.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> index 12f875e9e69a..21ae37e49319 100644
> --- a/include/linux/trace_events.h
> +++ b/include/linux/trace_events.h
> @@ -62,13 +62,13 @@ void trace_event_printf(struct trace_iterator *iter, 
> const char *fmt, ...);
>  /* Used to find the offset and length of dynamic fields in trace events */
>  struct trace_dynamic_info {
>  #ifdef CONFIG_CPU_BIG_ENDIAN
> - u16 offset;
>   u16 len;
> + u16 offset;
>  #else
> - u16 len;
>   u16 offset;
> + u16 len;
>  #endif
> -};
> +} __packed;
>  
>  /*
>   * The trace entry - the most basic unit of tracing. This is what

This issue was also present on BE, but as you noted "covered" by the
broken test case. With this patch everything works as expected. So:

Tested-by: Sven Schnelle

Re: Is s390's new generic-using syscall code actually correct?

2021-03-30 Thread Sven Schnelle

Sven Schnelle  writes:

> Hi Andy,
>
> Andy Lutomirski  writes:
>
>> On Wed, Mar 24, 2021 at 10:39 AM Vasily Gorbik  wrote:
>>>
>>> Hi Andy,
>>>
>>> On Sat, Mar 20, 2021 at 08:48:34PM -0700, Andy Lutomirski wrote:
>>> > Hi all-
>>> >
>>> > I'm working on my kentry patchset, and I encountered:
>>> >
>>> > commit 56e62a73702836017564eaacd5212e4d0fa1c01d
>>> > Author: Sven Schnelle 
>>> > Date:   Sat Nov 21 11:14:56 2020 +0100
>>> >
>>> > s390: convert to generic entry
>>> >
>>> > As part of this work, I was cleaning up the generic syscall helpers,
>>> > and I encountered the goodies in do_syscall() and __do_syscall().
>>> >
>>> > I'm trying to wrap my head around the current code, and I'm rather 
>>> > confused.
>>> >
>>> > 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just
>>> > the syscall exit work.  So a do_syscall() that gets called twice will
>>> > do the loopy part of the exit work (e.g. signal handling) twice.  Is
>>> > this intentional?  If so, why?
>>> >
>>> > 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed
>>> > to work.  Looking at the code in Linus' tree, if a signal is pending
>>> > and a syscall returns -ERESTARTSYS, the syscall will return back to
>>> > do_syscall().  The work (as in (1)) gets run, calling do_signal(),
>>> > which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART.
>>> > Presumably it will also push the signal frame onto the stack and aim
>>> > the return address at the svc instruction mentioned in the commit
>>> > message from "s390: convert to generic entry".  Then __do_syscall()
>>> > will turn interrupts back on and loop right back into do_syscall().
>>> > That seems incorrect.
>>> >
>>> > Can you enlighten me?  My WIP tree is here:
>>> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/kentry
>>> >
>>>
>>> For all the details to that change we'd have to wait for Sven, who is back
>>> next week.
>>>
>>> > Here are my changes to s390, and I don't think they're really correct:
>>> >
>>> >
>>> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/diff/arch/s390/kernel/syscall.c?h=x86/kentry=58a459922be0fb8e0f17aeaebcb0ac8d0575a62c
>>>
>>> Couple of things: syscall_exit_to_user_mode_prepare is static,
>>> and there is another code path in arch/s390/kernel/traps.c using
>>> enter_from_user_mode/exit_to_user_mode.
>>>
>>> Anyhow I gave your branch a spin and got few new failures on strace test
>>> suite, in particular on restart_syscall test. I'll try to find time to
>>> look into details.
>>
>> I refreshed the branch, but I confess I haven't compile tested it. :)
>>
>> I would guess that the new test case failures are a result of the
>> buggy syscall restart logic.  I think that all of the "restart" cases
>> except execve() should just be removed.  Without my patch, I suspect
>> that signal delivery with -ERESTARTSYS would create the signal frame,
>> do an accidental "restarted" syscall that was a no-op, and then
>> deliver the signal.  With my patch, it may simply repeat the original
>> interrupted signal forever.
>
> PIF_SYSCALL_RESTART is set in arch_do_signal_or_restart(), but only if
> there's no signal handler registered. In that case we don't need a
> signal frame, so that should be fine.
>
> The problem why your branch is not working is that arch_do_signal_or_restart()
> gets called from exit_to_user_mode_prepare(), and that is after the
> check whether PIF_SYSCALL_RESTART is set in __do_syscall().
>
> So i'm wondering how to fix that. x86 simply rewinds the pc, so the
> system call instruction gets re-executed when returning to user
> space. For s390 that doesn't work, as the s390 svc instruction might
> have the syscall number encoded. If we would have to restart a system
> call with restart_systemcall(), we need to change the syscall number to
> __NR_restart_syscall. As we cannot change the hardcoded system call
> number, we somehow need to handle that inside of the kernel.
>
> So i wonder whether we should implement the PIF_SYSCALL_RESTART check in
> entry.S after all the return to user space entry code was run but before
> doing the real swtch back to user space. If PIF_SYSCALL_RESTART is set
> we would then just jump back to the entry code and pretend we came from
> user space.
>
> That would have the benefit that the entry C code looks the same like
> other architectures and that amount of code to add in entry.S shouldn't
> be much.

Thinking about this again i guess this idea won't work because the exit
loop might have scheduled the old process away already...

Re: Is s390's new generic-using syscall code actually correct?

2021-03-30 Thread Sven Schnelle

Hi Andy,

Andy Lutomirski  writes:

> On Wed, Mar 24, 2021 at 10:39 AM Vasily Gorbik  wrote:
>>
>> Hi Andy,
>>
>> On Sat, Mar 20, 2021 at 08:48:34PM -0700, Andy Lutomirski wrote:
>> > Hi all-
>> >
>> > I'm working on my kentry patchset, and I encountered:
>> >
>> > commit 56e62a73702836017564eaacd5212e4d0fa1c01d
>> > Author: Sven Schnelle 
>> > Date:   Sat Nov 21 11:14:56 2020 +0100
>> >
>> > s390: convert to generic entry
>> >
>> > As part of this work, I was cleaning up the generic syscall helpers,
>> > and I encountered the goodies in do_syscall() and __do_syscall().
>> >
>> > I'm trying to wrap my head around the current code, and I'm rather 
>> > confused.
>> >
>> > 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just
>> > the syscall exit work.  So a do_syscall() that gets called twice will
>> > do the loopy part of the exit work (e.g. signal handling) twice.  Is
>> > this intentional?  If so, why?
>> >
>> > 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed
>> > to work.  Looking at the code in Linus' tree, if a signal is pending
>> > and a syscall returns -ERESTARTSYS, the syscall will return back to
>> > do_syscall().  The work (as in (1)) gets run, calling do_signal(),
>> > which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART.
>> > Presumably it will also push the signal frame onto the stack and aim
>> > the return address at the svc instruction mentioned in the commit
>> > message from "s390: convert to generic entry".  Then __do_syscall()
>> > will turn interrupts back on and loop right back into do_syscall().
>> > That seems incorrect.
>> >
>> > Can you enlighten me?  My WIP tree is here:
>> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/kentry
>> >
>>
>> For all the details to that change we'd have to wait for Sven, who is back
>> next week.
>>
>> > Here are my changes to s390, and I don't think they're really correct:
>> >
>> >
>> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/diff/arch/s390/kernel/syscall.c?h=x86/kentry=58a459922be0fb8e0f17aeaebcb0ac8d0575a62c
>>
>> Couple of things: syscall_exit_to_user_mode_prepare is static,
>> and there is another code path in arch/s390/kernel/traps.c using
>> enter_from_user_mode/exit_to_user_mode.
>>
>> Anyhow I gave your branch a spin and got few new failures on strace test
>> suite, in particular on restart_syscall test. I'll try to find time to
>> look into details.
>
> I refreshed the branch, but I confess I haven't compile tested it. :)
>
> I would guess that the new test case failures are a result of the
> buggy syscall restart logic.  I think that all of the "restart" cases
> except execve() should just be removed.  Without my patch, I suspect
> that signal delivery with -ERESTARTSYS would create the signal frame,
> do an accidental "restarted" syscall that was a no-op, and then
> deliver the signal.  With my patch, it may simply repeat the original
> interrupted signal forever.

PIF_SYSCALL_RESTART is set in arch_do_signal_or_restart(), but only if
there's no signal handler registered. In that case we don't need a
signal frame, so that should be fine.

The problem why your branch is not working is that arch_do_signal_or_restart()
gets called from exit_to_user_mode_prepare(), and that is after the
check whether PIF_SYSCALL_RESTART is set in __do_syscall().

So i'm wondering how to fix that. x86 simply rewinds the pc, so the
system call instruction gets re-executed when returning to user
space. For s390 that doesn't work, as the s390 svc instruction might
have the syscall number encoded. If we would have to restart a system
call with restart_systemcall(), we need to change the syscall number to
__NR_restart_syscall. As we cannot change the hardcoded system call
number, we somehow need to handle that inside of the kernel.

So i wonder whether we should implement the PIF_SYSCALL_RESTART check in
entry.S after all the return to user space entry code was run but before
doing the real swtch back to user space. If PIF_SYSCALL_RESTART is set
we would then just jump back to the entry code and pretend we came from
user space.

That would have the benefit that the entry C code looks the same like
other architectures and that amount of code to add in entry.S shouldn't
be much.

Any thoughts?

Regards
Sven

Re: [PATCH v3 08/11] entry: Make CONFIG_DEBUG_ENTRY available outside x86

2021-03-29 Thread Sven Schnelle

Mark Rutland  writes:

> On Thu, Mar 04, 2021 at 11:06:01AM -0800, Andy Lutomirski wrote:
>> In principle, the generic entry code is generic, and the goal is to use it
>> in many architectures once it settles down more.  Move CONFIG_DEBUG_ENTRY
>> to the generic config so that it can be used in the generic entry code and
>> not just in arch/x86.
>> 
>> Disable it on arm64.  arm64 uses some but not all of the kentry
>> code, and trying to debug the resulting state machine will be painful.
>> arm64 can turn it back on when it starts using the entire generic
>> path.
>
> Can we make this depend on CONFIG_GENERIC_ENTRY instead of !ARM64?
> That'd be more in line with "use the generic entry code, get the generic
> functionality". Note that arm64 doesn't select CONFIG_GENERIC_ENTRY
> today.
>
> I see that s390 selects CONFIG_GENERIC_ENTRY, and either way this will
> enable DEBUG_ENTRY for them, so it'd ve worth checking whether this is
> ok for them.
>
> Sven, thoughts?
>

For s390 that change should be fine.

Re: Is s390's new generic-using syscall code actually correct?

2021-03-29 Thread Sven Schnelle

Hi Andy,

sorry for the late reply, i was away from kernel development for a few
weeks.

Andy Lutomirski  writes:

> Hi all-
>
> I'm working on my kentry patchset, and I encountered:
>
> commit 56e62a73702836017564eaacd5212e4d0fa1c01d
> Author: Sven Schnelle 
> Date:   Sat Nov 21 11:14:56 2020 +0100
>
> s390: convert to generic entry
>
> As part of this work, I was cleaning up the generic syscall helpers,
> and I encountered the goodies in do_syscall() and __do_syscall().
>
> I'm trying to wrap my head around the current code, and I'm rather confused.
>
> 1. syscall_exit_to_user_mode_work() does *all* the exit work, not just
> the syscall exit work.  So a do_syscall() that gets called twice will
> do the loopy part of the exit work (e.g. signal handling) twice.  Is
> this intentional?  If so, why?

Not really intentional, but i decided to accept the overhead for now and
fix that later by splitting up the generic entry code. Otherwise the
patch would have had even more dependencies. I have not looked yet into
your kentry branch, but will do that later. Maybe there is already a
better way to do it or we can work something out.

> 2. I don't understand how this PIF_SYSCALL_RESTART thing is supposed
> to work.  Looking at the code in Linus' tree, if a signal is pending
> and a syscall returns -ERESTARTSYS, the syscall will return back to
> do_syscall().  The work (as in (1)) gets run, calling do_signal(),
> which will notice -ERESTARTSYS and set PIF_SYSCALL_RESTART.
> Presumably it will also push the signal frame onto the stack and aim
> the return address at the svc instruction mentioned in the commit
> message from "s390: convert to generic entry".  Then __do_syscall()
> will turn interrupts back on and loop right back into do_syscall().
> That seems incorrect.

That sounds indeed broken. My understanding is that x86 is always
rewinding the pc in the restart case, and is exiting to user mode. That
would than also work with signal frames.

However, on s390 we cannot simply rewind the pc as the syscall number
might be encoded in the system call instruction. If a user would have
rewritten the system call number (i.e. with seccomp) it would still
execute the original syscall number.

That makes me wonder why i have not seen problems with signals and system
call restarting so far. Have to read the code again.

Regards
Sven

[tip: sched/core] uprobes: (Re)add missing get_uprobe() in __find_uprobe()

2021-02-17 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the sched/core branch of tip:

Commit-ID: b0d6d4789677d128b1933af023083054f0973574
Gitweb:
https://git.kernel.org/tip/b0d6d4789677d128b1933af023083054f0973574
Author:Sven Schnelle 
AuthorDate:Tue, 09 Feb 2021 16:07:11 +01:00
Committer: Ingo Molnar 
CommitterDate: Wed, 17 Feb 2021 14:12:42 +01:00

uprobes: (Re)add missing get_uprobe() in __find_uprobe()

commit c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers")
accidentally removed the refcount increase. Add it again.

Fixes: c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers")
Signed-off-by: Sven Schnelle 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lkml.kernel.org/r/20210209150711.36778-1-sv...@linux.ibm.com
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index fd5160d..3ea7f8f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -661,7 +661,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, 
loff_t offset)
struct rb_node *node = rb_find(, _tree, __uprobe_cmp_key);
 
if (node)
-   return __node_2_uprobe(node);
+   return get_uprobe(__node_2_uprobe(node));
 
return NULL;
 }

[tip: locking/core] s390: Use arch_local_irq_{save,restore}() in early boot code

2021-02-10 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the locking/core branch of tip:

Commit-ID: b38085ba60246fccc2f49d2ac162528dedbc4e71
Gitweb:
https://git.kernel.org/tip/b38085ba60246fccc2f49d2ac162528dedbc4e71
Author:Sven Schnelle 
AuthorDate:Wed, 10 Feb 2021 14:24:16 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:39 +01:00

s390: Use arch_local_irq_{save,restore}() in early boot code

Commit 997acaf6b4b5 ("lockdep: report broken irq restoration") makes
compiling s390 fail because the irq enable/disable functions are now
no longer fully contained in header files.

Fixes: 997acaf6b4b5 ("lockdep: report broken irq restoration")
Signed-off-by: Sven Schnelle 
Signed-off-by: Peter Zijlstra (Intel) 
---
 drivers/s390/char/sclp_early_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/char/sclp_early_core.c 
b/drivers/s390/char/sclp_early_core.c
index ec9f8ad..b7329af 100644
--- a/drivers/s390/char/sclp_early_core.c
+++ b/drivers/s390/char/sclp_early_core.c
@@ -66,13 +66,13 @@ int sclp_early_cmd(sclp_cmdw_t cmd, void *sccb)
unsigned long flags;
int rc;
 
-   raw_local_irq_save(flags);
+   flags = arch_local_irq_save();
rc = sclp_service_call(cmd, sccb);
if (rc)
goto out;
sclp_early_wait_irq();
 out:
-   raw_local_irq_restore(flags);
+   arch_local_irq_restore(flags);
return rc;
 }

[tip: sched/core] uprobes: (Re)add missing get_uprobe() in __find_uprobe()

2021-02-10 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2c3496a02cb06ffe957854d8488a5799d7bfb252
Gitweb:
https://git.kernel.org/tip/2c3496a02cb06ffe957854d8488a5799d7bfb252
Author:Sven Schnelle 
AuthorDate:Tue, 09 Feb 2021 16:07:11 +01:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 10 Feb 2021 14:44:48 +01:00

uprobes: (Re)add missing get_uprobe() in __find_uprobe()

commit c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers")
accidentally removed the refcount increase. Add it again.

Fixes: c6bc9bd06dff ("rbtree, uprobes: Use rbtree helpers")
Signed-off-by: Sven Schnelle 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210209150711.36778-1-sv...@linux.ibm.com
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index fd5160d..3ea7f8f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -661,7 +661,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, 
loff_t offset)
struct rb_node *node = rb_find(, _tree, __uprobe_cmp_key);
 
if (node)
-   return __node_2_uprobe(node);
+   return get_uprobe(__node_2_uprobe(node));
 
return NULL;
 }

[PATCH] uprobes: add missing get_uprobe() in __find_uprobe()

2021-02-09 Thread Sven Schnelle

commit c6bc9bd06dff49fa4c("rbtree, uprobes: Use rbtree helpers") from
next-20210208 accidentally removed the refcount increase. Add it again.

Signed-off-by: Sven Schnelle 
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 7e15b2efdd87..6addc9780319 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -661,7 +661,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, 
loff_t offset)
struct rb_node *node = rb_find(, _tree, __uprobe_cmp_key);
 
if (node)
-   return __node_2_uprobe(node);
+   return get_uprobe(__node_2_uprobe(node));
 
return NULL;
 }
-- 
2.17.1

Re: [PATCH] s390: allow reschedule on syscall restart

2021-01-26 Thread Sven Schnelle

Christian Borntraeger  writes:

> On 21.01.21 15:39, Sven Schnelle wrote:
>> Commit 845f44e8ef28 ("sched: Report local wake up on resched blind zone
>> within idle loop") from next-20210121 causes a warning because s390
>> doesn't call sched_resched_local_allow() when restarting a syscall.
>> 
>> Signed-off-by: Sven Schnelle 
>> ---
>>  arch/s390/kernel/syscall.c | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c
>> index bc8e650e377d..2b39ac40f970 100644
>> --- a/arch/s390/kernel/syscall.c
>> +++ b/arch/s390/kernel/syscall.c
>> @@ -162,6 +162,7 @@ void noinstr __do_syscall(struct pt_regs *regs, int 
>> per_trap)
>>  do_syscall(regs);
>>  if (!test_pt_regs_flag(regs, PIF_SYSCALL_RESTART))
>>  break;
>> +sched_resched_local_allow();
>>  local_irq_enable();
>>  }
>>  exit_to_user_mode();
>
> Yesterdays next now fails with
>
>
> arch/s390/kernel/syscall.c: In function '__do_syscall':
> arch/s390/kernel/syscall.c:165:3: error: implicit declaration of function 
> 'sched_resched_local_allow' [-Werror=implicit-function-declaration]
>   165 |   sched_resched_local_allow();
>   |   ^
> cc1: some warnings being treated as errors
> make[2]: *** [scripts/Makefile.build:288: arch/s390/kernel/syscall.o] Error 1
> make[2]: *** Waiting for unfinished jobs
> make[1]: *** [scripts/Makefile.build:530: arch/s390/kernel] Error 2
> make[1]: *** Waiting for unfinished jobs

Looks to me like 845f44e8ef28 ("sched: Report local wake up on resched
blind zone") was removed from linux-next. Stephen, can you remove my
commit as well? It is no longer needed.

Thanks
Sven

Re: [PATCH] printk: fix buffer overflow potential for print_text()

2021-01-24 Thread Sven Schnelle

John Ogness  writes:

> Hi Sven,
>
> Thanks for the outstanding analysis!
>
> On 2021-01-23, Sven Schnelle  wrote:
>>> 1401if (buf_size > 0)
>>> 1402text[len] = 0;
>>
>> I don't think i have really understood how all the printk magic works,
>> but using r->text_buf[len] seems to be the correct place to put the
>> zero byte in that case?
>
> Yes, you are correct! @text is pointing to the beginning of the
> currently processed line, not the beginning of the buffer.
>
> I will submit a patch to fix our recent fix (unless you would like to do
> that).

Please go ahead, thank you!

Re: [PATCH] printk: fix buffer overflow potential for print_text()

2021-01-23 Thread Sven Schnelle

Sven Schnelle  writes:

> John Ogness  writes:
>
>> On 2021-01-22, Sven Schnelle  wrote:

> I was able to reproduce it in a virtual machine where i have a few more
> ways to debug. What i got was:
>
> 01:  -> 001B8814"  MVI 92001000 >> 0163F1CD CC 2
>
> That's a watchpoint telling me that the code at 0x1b8814 wants to store
> one byte to 0x163f1cd, which is the second byte of console_drivers.
>
> gdb tells me about 0x1b8814:
>
> (gdb) list *(0x1b8814)
> 0x1b8814 is in record_print_text 
> (/home/svens/ibmgit/linux/kernel/printk/printk.c:1402).
> 1397   * If a buffer was provided, it will be terminated. Space for 
> the
> 1398   * string terminator is guaranteed to be available. The 
> terminator is
> 1399   * not counted in the return value.
> 1400   */
> 1401  if (buf_size > 0)
> 1402  text[len] = 0;

I don't think i have really understood how all the printk magic works,
but using r->text_buf[len] seems to be the correct place to put the zero byte
in that case?

> 1403  
> 1404  return len;
> 1405  }
> 1406
>

Re: [PATCH] printk: fix buffer overflow potential for print_text()

2021-01-23 Thread Sven Schnelle

John Ogness  writes:

> On 2021-01-22, Sven Schnelle  wrote:
>>
>> So somehow the pointer for console_drivers changes.
>>
>> I can't provide the normal kernel crash output as printk is no longer
>> working,
>
> I don't understand what you mean here. The crash tool can dump the
> printk buffer.
>
> I would be curious to see what the messages look like. It would also be
> helpful to know the last message you saw on the console.
>

The last message is:

[ 1845.640466] User process fault: interruption code 0007 ilc:2 in 
libm.so[7d78c000+a3000]
[ 1845.640474] CPU: 9 PID: 859915 Comm: ld.so.1 Not tainted 
5.11.0-20210122.rc4.git0.9f29bd8b2e71.300.fc33.s390x+git #1
[ 1845.640476] Hardware name: IBM 8561 T01 703 (LPAR)
[ 1845.640478] User PSW : 070530008000 7d7983b2
[ 1845.640480]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:3 PM:0 
RI:0 EA:1
[ 1845.640483] User GPRS:  7d798360 0010 
03ff0004
[ 1845.640485]7f854dd0 7d7e9788 03ff00401b30 
7f854ff8
[ 1845.640487]03ff 03ff7d5b4da0 03ff 
03ff0010
[ 1845.640489]03ff004022c6 03ff004024b0 80401796 
7f854de8

That's also what's written to the console. Please note the the output
above is expected. It's only reporting that a user space program
performed an invalid FPU operation. The kernel crash is neither written
to the console nor the dmesg buffer.

>> kmsg_dump_rewind+290 is:
>>
>>   if (!(con->flags & CON_ENABLED))
>> continue;
>>
>> in kernel/printk/printk.c:1845
>>
>> I haven't dived into whether our show_code() is doing something wrong
>> which was covered in the past or whether that's because of the patch
>> above but wanted to make you aware of that in case you have an idea.
>> Otherwise i have to dig into the code.
>
> Unless we are dealing with very long printk messages that normally get
> truncated (800+ characters) this patch simply adds a string
> terminator. I do not see how that could possibly cause this kind of
> damage.
>
> When this triggers, there is nothing happening with consoles registering
> or deregistering, right?

That's correct. No registering/unregistering taking place.

> The string terminator (kernel/printk/printk.c:1402) is only added for
> paranoia. If you comment that out, this patch will have no effect (for
> "normal" length lines). I would be curious if that somehow makes a
> difference for you.

I was able to reproduce it in a virtual machine where i have a few more
ways to debug. What i got was:

01:  -> 001B8814"  MVI 92001000 >> 0163F1CD CC 2

That's a watchpoint telling me that the code at 0x1b8814 wants to store
one byte to 0x163f1cd, which is the second byte of console_drivers.

gdb tells me about 0x1b8814:

(gdb) list *(0x1b8814)
0x1b8814 is in record_print_text 
(/home/svens/ibmgit/linux/kernel/printk/printk.c:1402).
1397 * If a buffer was provided, it will be terminated. Space for 
the
1398 * string terminator is guaranteed to be available. The 
terminator is
1399 * not counted in the return value.
1400 */
1401if (buf_size > 0)
1402text[len] = 0;
1403
1404return len;
1405}
1406

In s390 assembly, this is the store:

   0x001b8810 <+0x130>: la  %r1,0(%r7,%r9)
   0x001b8814 <+0x134>: mvi 0(%r1),0

The cpu registers at the time of write:

01: GRG  0 =  0020  0163F1CD
01: GRG  2 =  0042  03E000623A98
01: GRG  4 =  000E  0087BB70
01: GRG  6 =  0400  *0224*
01: GRG  8 =  000F  *0163EFA9*
01: GRG 10 =  0033  
01: GRG 12 =  809AE000  03E000623A98
01: GRG 14 =  001B884C  03E0006239E8

So r9 is 0163EFA9 - seems to be the start of the current message:

0163EFA9:  *[   24.069514]7d7af3c2: 
b30d0002..debr.%f0,%f2.debr.%f0,%f2

while r7 is the offset, and that one is way to big:

0224

If you add that to 0163EFA9, you'll see that we're outside of the buffer
and overwriting space after, which is console_drivers.

The message it's trying to print is rather long, so the 0x224 could be
the size of the whole line:

This is not from the same crash, but this is how the message looks like
it's trying to print:

[   23.960773] User Code: 7d7af3ba: 78005000le  
%f0,0(%r5) 
[   23.960773]7d7af3be: 78205004le  
%f2,4(%r5) 
[   23.960773]   #7d7af3c2: b30d0002debr%f0,%f2 
[   23.960773]   >00

Re: [PATCH] printk: fix buffer overflow potential for print_text()

2021-01-22 Thread Sven Schnelle

John Ogness  writes:

> Before commit b6cf8b3f3312 ("printk: add lockless ringbuffer"),
> msg_print_text() would only write up to size-1 bytes into the
> provided buffer. Some callers expect this behavior and append
> a terminator to returned string. In particular:
>
> arch/powerpc/xmon/xmon.c:dump_log_buf()
> arch/um/kernel/kmsg_dump.c:kmsg_dumper_stdout()
>
> msg_print_text() has been replaced by record_print_text(), which
> currently fills the full size of the buffer. This causes a
> buffer overflow for the above callers.
>
> Change record_print_text() so that it will only use size-1 bytes
> for text data. Also, for paranoia sakes, add a terminator after
> the text data.
>
> And finally, document this behavior so that it is clear that only
> size-1 bytes are used and a terminator is added.
>
> Fixes: b6cf8b3f3312 ("printk: add lockless ringbuffer")
> Signed-off-by: John Ogness 
> ---
> [..]

I'm seeing crashes on s390x with this patch while running the glibc
testsuite. The glibc test suite triggers a few FPU exceptions which
are printed to the kernel log by default. Looking at the crash dump,
i see that the console_drivers pointer seems to be overwritten while
printing that (user space) warning:

crash> print *console_drivers
$1 = {
  name = "\247\071\377\373\354!\004\214\000\331\300\345\000\033\353_", 
  write = 0xa7190001eb119078, 
  read = 0xe6e320b0050090, 
  device = 0xa51e0010a7210001, 
  unblank = 0xa7291000b9e28012, 
  setup = 0xe320f0a4e320, 
  exit = 0x208e0094eb112000, 
  match = 0xcec1c006e017f, 
  flags = -4984, 
  index = 133, 
  cflag = 8143136, 
  data = 0x800458108004ec12, 
  next = 0x5007eaf00
}

crash> x/16i console_drivers
   0x79009700:  lghi%r3,-5
   0x79009704:  aghik   %r2,%r1,1164
   0x7900970a:  brasl   %r14,0x79386dc8
   0x79009710:  lghi%r1,1
   0x79009714:  laog%r1,%r1,120(%r9)
   0x7900971a:  llgc%r2,5(%r11)
   0x79009720:  llilh   %r1,16
   0x79009724:  tmll%r2,1
   0x79009728:  lghi%r2,4096
   0x7900972c:  locgre  %r1,%r2
   0x79009730:  lg  %r2,160(%r15)
   0x79009736:  llc %r2,142(%r2)
   0x7900973c:  srlg%r1,%r1,0(%r2)
   0x79009742:  clij%r1,1,12,0x7900981e
   0x79009748:  cgij%r8,0,8,0x79009852
   0x7900974e:  la  %r2,4(%r8)

crash> sym 0x79009700
79009700 (t) iomap_finish_ioend+192
/usr/src/debug/kernel-5.10.fc33/linux-5.11.0-20210122.rc4.git0.9f29bd8b2e71.300.fc33.s390x/./include/linux/pagemap.h:
59

So somehow the pointer for console_drivers changes.

I can't provide the normal kernel crash output as printk is no longer
working, but in crash the backtrace looks like this:
crash> bt
PID: 859915  TASK: dad24000  CPU: 9   COMMAND: "ld.so.1"
 LOWCORE INFO:
  -psw  : 0x000200018000 0x78c8400e
  -function : stop_run at 78c8400e
  -prefix   : 0x0041e000
  -cpu timer: 0x7953a957e166
  -clock cmp: 0xd92876863ac66b00
  -general registers:
 0xf0f0f0f0 0x78c8400e
 0x79984830 0x79984830
 0x79ac62c0 0x0002
 0x0038 00
 0x7a19157e 00
 0xfffa 00
 0xdad24000 00
 0x78c8416e 0x0380033136f0
  -access registers:
 0x7d5b4da0 0xa2a795c0 00 00
 00 00 00 00
 00 00 00 00
 00 00 00 00
  -control registers:
 0x00a014966a10 0x7a15c007
 0x0001d0a0 00
 0x 0x0001d080
 0x3b00 0x0001f29b81c7
 0x8000 00
 00 00
 00 0x7a15c007
 0xdb00 0x0001d0c0
  -floating point registers:
 0x5fabeb67f92f 0x0007fff8
 0x4040eb67f000 00
 0x0007 0x0064
 0x02aa2a63f028 0x0963cf85
 0x2a642c60 00
 0x7f85516ceb67f222 0x03ffeb67f528
 0x03ffeb67f230 0x0001
 0x03ffeb67f21d 0x02aa28d2d970

 #0 [38003313710] stop_run at 78c8400e
 #1 [38003313728] atomic_notifier_call_chain at 78ced536
 #2 [38003313778] panic at 7974aeca
 #3 [38003313818] die at 78c897ea
 #4 [38003313880] do_no_context at 78c9b230
 #5 [380033138b8] pgm_check_handler at 7976438c
 PSW:  0404c0018000 78d304ba (kmsg_dump_rewind+290)
 GPRS: 0020   ec88 
   000e 793f4800 0224  
     0224 0005007eaf00 
   dad24000  038003313a68 038003313a18 
 #0 [38003313a68] console_unlock at 78d3158e
 #1 [38003313b48] vprintk_emit at 78d32f50
 #2 [38003313ba8] vprintk_default at 78d32f9e
 #3 [38003313c00] printk at

[PATCH] s390: allow reschedule on syscall restart

2021-01-21 Thread Sven Schnelle

Commit 845f44e8ef28 ("sched: Report local wake up on resched blind zone
within idle loop") from next-20210121 causes a warning because s390
doesn't call sched_resched_local_allow() when restarting a syscall.

Signed-off-by: Sven Schnelle 
---
 arch/s390/kernel/syscall.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/s390/kernel/syscall.c b/arch/s390/kernel/syscall.c
index bc8e650e377d..2b39ac40f970 100644
--- a/arch/s390/kernel/syscall.c
+++ b/arch/s390/kernel/syscall.c
@@ -162,6 +162,7 @@ void noinstr __do_syscall(struct pt_regs *regs, int 
per_trap)
do_syscall(regs);
if (!test_pt_regs_flag(regs, PIF_SYSCALL_RESTART))
break;
+   sched_resched_local_allow();
local_irq_enable();
}
exit_to_user_mode();
-- 
2.17.1

Warning with next-20210121 on s390

2021-01-21 Thread Sven Schnelle

there's a warning with linux-next on s390:

[   33.893818] systemd-xdg-autostart-generator[544]: Not generating service for 
XDG autostart app-at\x2dspi\x2ddbus\x2dbus-autostart.service, startup phases 
are not supported. 
[   54.613344] [ cut here ] 
[   54.613702] Late current task rescheduling may be lost 
[   54.613709] WARNING: CPU: 0 PID: 574 at linux/kernel/sched/core.c:628 
sched_resched_local_assert_allowed+0x86/0x90 
[   54.613714] Modules linked in: 
[   54.613717] CPU: 0 PID: 574 Comm: qemu-system-s39 Not tainted 
5.11.0-rc4-next-20210121 #2601 
[   54.613721] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0) 
[   54.613724] Krnl PSW : 0404e0018000 00d86bea 
(sched_resched_local_assert_allowed+0x8a/0x90) 
[   54.613730]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
RI:0 EA:3 

This is because of a non-obvious merge conflict between 56e62a737028
("s390: convert to generic entry") and 845f44e8ef28 ("sched: Report local
wake up on resched blind zone within idle loop"). Can you include the attached 
patch
into linux-next?

Thanks,
Sven

[tip: core/entry] entry: Add exit_to_user_mode() wrapper

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 310de1a678b2184c078c593dae343cb79c807f8d
Gitweb:
https://git.kernel.org/tip/310de1a678b2184c078c593dae343cb79c807f8d
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:54 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00

entry: Add exit_to_user_mode() wrapper

Called from architecture specific code when syscall_exit_to_user_mode() is
not suitable. It simply calls __exit_to_user_mode().

This way __exit_to_user_mode() can still be inlined because it is declared
static __always_inline.

[ tglx: Amended comments and moved it to a different place in the header ]

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-5-sv...@linux.ibm.com

---
 include/linux/entry-common.h | 23 +--
 kernel/entry/common.c| 18 ++
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index da60980..e370be8 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -301,6 +301,25 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
 #endif
 
 /**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall/interrupt exit enables interrupts, but the kernel state is
+ * interrupts disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Invoke architecture specific last minute exit code, e.g. speculation
+ *mitigations, etc.: arch_exit_to_user_mode()
+ * 4) Tell lockdep that interrupts are enabled
+ *
+ * Invoked from architecture specific code when syscall_exit_to_user_mode()
+ * is not suitable as the last step before returning to userspace. Must be
+ * invoked with interrupts disabled and the caller must be
+ * non-instrumentable.
+ */
+void exit_to_user_mode(void);
+
+/**
  * syscall_exit_to_user_mode - Handle work before returning to user mode
  * @regs:  Pointer to currents pt_regs
  *
@@ -322,8 +341,8 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
  * - Architecture specific one time work arch_exit_to_user_mode_prepare()
  * - Address limit and lockdep checks
  *
- *  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
- * arch_exit_to_user_mode() to handle e.g. speculation mitigations
+ *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
+ * functionality in exit_to_user_mode().
  */
 void syscall_exit_to_user_mode(struct pt_regs *regs);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 17b1e03..48d30ce 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -117,18 +117,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
instrumentation_end();
 }
 
-/**
- * __exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall/interupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *mitigations, etc.
- * 4) Tell lockdep that interrupts are enabled
- */
+/* See comment for exit_to_user_mode() in entry-common.h */
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
@@ -141,6 +130,11 @@ static __always_inline void __exit_to_user_mode(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+void noinstr exit_to_user_mode(void)
+{
+   __exit_to_user_mode();
+}
+
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { 
}

[tip: core/entry] entry: Rename enter_from_user_mode()

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: bb714fb3bc7b2e8be72b9c92f2d8a89ea2dc
Gitweb:
https://git.kernel.org/tip/bb714fb3bc7b2e8be72b9c92f2d8a89ea2dc
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:51 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00

entry: Rename enter_from_user_mode()

In order to make this function publicly available rename it so it can still
be inlined. An additional enter_from_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-2-sv...@linux.ibm.com

---
 kernel/entry/common.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e661e70..8e294a7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,7 +11,7 @@
 #include 
 
 /**
- * enter_from_user_mode - Establish state when coming from user mode
+ * __enter_from_user_mode - Establish state when coming from user mode
  *
  * Syscall/interrupt entry disables interrupts, but user mode is traced as
  * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -20,7 +20,7 @@
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -103,7 +103,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 {
long ret;
 
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 
instrumentation_begin();
local_irq_enable();
@@ -115,7 +115,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 
 noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
instrumentation_begin();
local_irq_enable();
instrumentation_end();
@@ -304,7 +304,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 }
 
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)

[tip: core/entry] entry_Add_enter_from_user_mode_wrapper

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 96e2fbccd0fc806364a964fdf072bfc858a66109
Gitweb:
https://git.kernel.org/tip/96e2fbccd0fc806364a964fdf072bfc858a66109
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:53 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00

entry_Add_enter_from_user_mode_wrapper

To be called from architecture specific code if the combo interfaces are
not suitable. It simply calls __enter_from_user_mode(). This way
__enter_from_user_mode will still be inlined because it is declared static
__always_inline.

[ tglx: Amend comments and move it to a different location in the header ]

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-4-sv...@linux.ibm.com

---
 include/linux/entry-common.h | 24 +++-
 kernel/entry/common.c| 16 ++--
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index a6e98b4..da60980 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -102,6 +102,27 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
 #endif
 
 /**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall/interrupt entry disables interrupts, but user mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct and interrupts are still
+ * disabled. The subsequent functions can be instrumented.
+ *
+ * This is invoked when there is architecture specific functionality to be
+ * done between establishing state and enabling interrupts. The caller must
+ * enable interrupts before invoking syscall_enter_from_user_mode_work().
+ */
+void enter_from_user_mode(struct pt_regs *regs);
+
+/**
  * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
  * @regs:  Pointer to currents pt_regs
  *
@@ -110,7 +131,8 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
  * function returns all state is correct, interrupts are enabled and the
  * subsequent functions can be instrumented.
  *
- * This handles lockdep, RCU (context tracking) and tracing state.
+ * This handles lockdep, RCU (context tracking) and tracing state, i.e.
+ * the functionality provided by enter_from_user_mode().
  *
  * This is invoked when there is extra architecture specific functionality
  * to be done between establishing state and handling user mode entry work.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index dff07b4..17b1e03 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -10,16 +10,7 @@
 #define CREATE_TRACE_POINTS
 #include 
 
-/**
- * __enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- */
+/* See comment for enter_from_user_mode() in entry-common.h */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
@@ -33,6 +24,11 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_end();
 }
 
+void noinstr enter_from_user_mode(struct pt_regs *regs)
+{
+   __enter_from_user_mode(regs);
+}
+
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
if (unlikely(audit_context())) {

[tip: core/entry] entry: Rename exit_to_user_mode()

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: bb793562f0da7317adf6c456316bca651ff46f5d
Gitweb:
https://git.kernel.org/tip/bb793562f0da7317adf6c456316bca651ff46f5d
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:52 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 15:07:57 +01:00

entry: Rename exit_to_user_mode()

In order to make this function publicly available rename it so it can still
be inlined. An additional exit_to_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-3-sv...@linux.ibm.com

---
 kernel/entry/common.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 8e294a7..dff07b4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
 }
 
 /**
- * exit_to_user_mode - Fixup state when exiting to user mode
+ * __exit_to_user_mode - Fixup state when exiting to user mode
  *
  * Syscall/interupt exit enables interrupts, but the kernel state is
  * interrupts disabled when this is invoked. Also tell RCU about it.
@@ -133,7 +133,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
  *mitigations, etc.
  * 4) Tell lockdep that interrupts are enabled
  */
-static __always_inline void exit_to_user_mode(void)
+static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
trace_hardirqs_on_prepare();
@@ -299,7 +299,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
@@ -312,7 +312,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs 
*regs)
instrumentation_begin();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)

[tip: core/entry] entry: Add syscall_exit_to_user_mode_work()

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: c6156e1da633f241e132eaea3b676d674376d770
Gitweb:
https://git.kernel.org/tip/c6156e1da633f241e132eaea3b676d674376d770
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:55 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 15:07:58 +01:00

entry: Add syscall_exit_to_user_mode_work()

This is the same as syscall_exit_to_user_mode() but without calling
exit_to_user_mode(). This can be used if there is an architectural reason
to avoid the combo function, e.g. restarting a syscall without returning to
userspace. Before returning to user space the caller has to invoke
exit_to_user_mode().

[ tglx: Amended comments ]

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-6-sv...@linux.ibm.com

---
 include/linux/entry-common.h | 20 
 kernel/entry/common.c| 14 --
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index e370be8..7c581a4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -316,10 +316,26 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
  * is not suitable as the last step before returning to userspace. Must be
  * invoked with interrupts disabled and the caller must be
  * non-instrumentable.
+ * The caller has to invoke syscall_exit_to_user_mode_work() before this.
  */
 void exit_to_user_mode(void);
 
 /**
+ * syscall_exit_to_user_mode_work - Handle work before returning to user mode
+ * @regs:  Pointer to currents pt_regs
+ *
+ * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling
+ * exit_to_user_mode() to perform the final transition to user mode.
+ *
+ * Calling convention is the same as for syscall_exit_to_user_mode() and it
+ * returns with all work handled and interrupts disabled. The caller must
+ * invoke exit_to_user_mode() before actually switching to user mode to
+ * make the final state transitions. Interrupts must stay disabled between
+ * return from this function and the invocation of exit_to_user_mode().
+ */
+void syscall_exit_to_user_mode_work(struct pt_regs *regs);
+
+/**
  * syscall_exit_to_user_mode - Handle work before returning to user mode
  * @regs:  Pointer to currents pt_regs
  *
@@ -343,6 +359,10 @@ void exit_to_user_mode(void);
  *
  *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
  * functionality in exit_to_user_mode().
+ *
+ * This is a combination of syscall_exit_to_user_mode_work() (1,2) and
+ * exit_to_user_mode(). This function is preferred unless there is a
+ * compelling architectural reason to use the seperate functions.
  */
 void syscall_exit_to_user_mode(struct pt_regs *regs);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 48d30ce..d6b7393 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -282,12 +282,22 @@ static void syscall_exit_to_user_mode_prepare(struct 
pt_regs *regs)
syscall_exit_work(regs, work);
 }
 
-__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs 
*regs)
 {
-   instrumentation_begin();
syscall_exit_to_user_mode_prepare(regs);
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
+}
+
+void syscall_exit_to_user_mode_work(struct pt_regs *regs)
+{
+   __syscall_exit_to_user_mode_work(regs);
+}
+
+__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+{
+   instrumentation_begin();
+   __syscall_exit_to_user_mode_work(regs);
instrumentation_end();
__exit_to_user_mode();
 }

[tip: core/entry] entry: Rename exit_to_user_mode()

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: f052615ca0112378c3571fc9662bbbcd18cf6b8d
Gitweb:
https://git.kernel.org/tip/f052615ca0112378c3571fc9662bbbcd18cf6b8d
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:52 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00

entry: Rename exit_to_user_mode()

In order to make this function publicly available rename it so it can still
be inlined. An additional exit_to_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-3-sv...@linux.ibm.com

---
 kernel/entry/common.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 8e294a7..dff07b4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
 }
 
 /**
- * exit_to_user_mode - Fixup state when exiting to user mode
+ * __exit_to_user_mode - Fixup state when exiting to user mode
  *
  * Syscall/interupt exit enables interrupts, but the kernel state is
  * interrupts disabled when this is invoked. Also tell RCU about it.
@@ -133,7 +133,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
  *mitigations, etc.
  * 4) Tell lockdep that interrupts are enabled
  */
-static __always_inline void exit_to_user_mode(void)
+static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
trace_hardirqs_on_prepare();
@@ -299,7 +299,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
@@ -312,7 +312,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs 
*regs)
instrumentation_begin();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)

[tip: core/entry] entry_Add_enter_from_user_mode_wrapper

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 6546601d7e7c8da38f5d87204aa01f7b394f4fb3
Gitweb:
https://git.kernel.org/tip/6546601d7e7c8da38f5d87204aa01f7b394f4fb3
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:53 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00

entry_Add_enter_from_user_mode_wrapper

To be called from architecture specific code if the combo interfaces are
not suitable. It simply calls __enter_from_user_mode(). This way
__enter_from_user_mode will still be inlined because it is declared static
__always_inline.

[ tglx: Amend comments and move it to a different location in the header ]

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-4-sv...@linux.ibm.com

---
 include/linux/entry-common.h | 24 +++-
 kernel/entry/common.c| 16 ++--
 2 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index a6e98b4..da60980 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -102,6 +102,27 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
 #endif
 
 /**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall/interrupt entry disables interrupts, but user mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ *
+ * Invoked from architecture specific syscall entry code with interrupts
+ * disabled. The calling code has to be non-instrumentable. When the
+ * function returns all state is correct and interrupts are still
+ * disabled. The subsequent functions can be instrumented.
+ *
+ * This is invoked when there is architecture specific functionality to be
+ * done between establishing state and enabling interrupts. The caller must
+ * enable interrupts before invoking syscall_enter_from_user_mode_work().
+ */
+void enter_from_user_mode(struct pt_regs *regs);
+
+/**
  * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
  * @regs:  Pointer to currents pt_regs
  *
@@ -110,7 +131,8 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
  * function returns all state is correct, interrupts are enabled and the
  * subsequent functions can be instrumented.
  *
- * This handles lockdep, RCU (context tracking) and tracing state.
+ * This handles lockdep, RCU (context tracking) and tracing state, i.e.
+ * the functionality provided by enter_from_user_mode().
  *
  * This is invoked when there is extra architecture specific functionality
  * to be done between establishing state and handling user mode entry work.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index dff07b4..17b1e03 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -10,16 +10,7 @@
 #define CREATE_TRACE_POINTS
 #include 
 
-/**
- * __enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- */
+/* See comment for enter_from_user_mode() in entry-common.h */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
@@ -33,6 +24,11 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_end();
 }
 
+void noinstr enter_from_user_mode(struct pt_regs *regs)
+{
+   __enter_from_user_mode(regs);
+}
+
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
if (unlikely(audit_context())) {

[tip: core/entry] entry: Add syscall_exit_to_user_mode_work()

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 1568b5540b3e6ff3fe43a2cf889cb777cf8149fc
Gitweb:
https://git.kernel.org/tip/1568b5540b3e6ff3fe43a2cf889cb777cf8149fc
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:55 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00

entry: Add syscall_exit_to_user_mode_work()

This is the same as syscall_exit_to_user_mode() but without calling
exit_to_user_mode(). This can be used if there is an architectural reason
to avoid the combo function, e.g. restarting a syscall without returning to
userspace. Before returning to user space the caller has to invoke
exit_to_user_mode().

[ tglx: Amended comments ]

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-6-sv...@linux.ibm.com

---
 include/linux/entry-common.h | 20 
 kernel/entry/common.c| 14 --
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index e370be8..7c581a4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -316,10 +316,26 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
  * is not suitable as the last step before returning to userspace. Must be
  * invoked with interrupts disabled and the caller must be
  * non-instrumentable.
+ * The caller has to invoke syscall_exit_to_user_mode_work() before this.
  */
 void exit_to_user_mode(void);
 
 /**
+ * syscall_exit_to_user_mode_work - Handle work before returning to user mode
+ * @regs:  Pointer to currents pt_regs
+ *
+ * Same as step 1 and 2 of syscall_exit_to_user_mode() but without calling
+ * exit_to_user_mode() to perform the final transition to user mode.
+ *
+ * Calling convention is the same as for syscall_exit_to_user_mode() and it
+ * returns with all work handled and interrupts disabled. The caller must
+ * invoke exit_to_user_mode() before actually switching to user mode to
+ * make the final state transitions. Interrupts must stay disabled between
+ * return from this function and the invocation of exit_to_user_mode().
+ */
+void syscall_exit_to_user_mode_work(struct pt_regs *regs);
+
+/**
  * syscall_exit_to_user_mode - Handle work before returning to user mode
  * @regs:  Pointer to currents pt_regs
  *
@@ -343,6 +359,10 @@ void exit_to_user_mode(void);
  *
  *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
  * functionality in exit_to_user_mode().
+ *
+ * This is a combination of syscall_exit_to_user_mode_work() (1,2) and
+ * exit_to_user_mode(). This function is preferred unless there is a
+ * compelling architectural reason to use the seperate functions.
  */
 void syscall_exit_to_user_mode(struct pt_regs *regs);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 48d30ce..d6b7393 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -282,12 +282,22 @@ static void syscall_exit_to_user_mode_prepare(struct 
pt_regs *regs)
syscall_exit_work(regs, work);
 }
 
-__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs 
*regs)
 {
-   instrumentation_begin();
syscall_exit_to_user_mode_prepare(regs);
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
+}
+
+void syscall_exit_to_user_mode_work(struct pt_regs *regs)
+{
+   __syscall_exit_to_user_mode_work(regs);
+}
+
+__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+{
+   instrumentation_begin();
+   __syscall_exit_to_user_mode_work(regs);
instrumentation_end();
__exit_to_user_mode();
 }

[tip: core/entry] entry: Add exit_to_user_mode() wrapper

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: 7918e4b844d1481c2200445f758bb2d1cd14346c
Gitweb:
https://git.kernel.org/tip/7918e4b844d1481c2200445f758bb2d1cd14346c
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:54 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 10:32:18 +01:00

entry: Add exit_to_user_mode() wrapper

Called from architecture specific code when syscall_exit_to_user_mode() is
not suitable. It simply calls __exit_to_user_mode().

This way __exit_to_user_mode() can still be inlined because it is declared
static __always_inline.

[ tglx: Amended comments and moved it to a different place in the header ]

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-5-sv...@linux.ibm.com

---
 include/linux/entry-common.h | 23 +--
 kernel/entry/common.c| 18 ++
 2 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index da60980..e370be8 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -301,6 +301,25 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
 #endif
 
 /**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall/interrupt exit enables interrupts, but the kernel state is
+ * interrupts disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Invoke architecture specific last minute exit code, e.g. speculation
+ *mitigations, etc.: arch_exit_to_user_mode()
+ * 4) Tell lockdep that interrupts are enabled
+ *
+ * Invoked from architecture specific code when syscall_exit_to_user_mode()
+ * is not suitable as the last step before returning to userspace. Must be
+ * invoked with interrupts disabled and the caller must be
+ * non-instrumentable.
+ */
+void exit_to_user_mode(void);
+
+/**
  * syscall_exit_to_user_mode - Handle work before returning to user mode
  * @regs:  Pointer to currents pt_regs
  *
@@ -322,8 +341,8 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
  * - Architecture specific one time work arch_exit_to_user_mode_prepare()
  * - Address limit and lockdep checks
  *
- *  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
- * arch_exit_to_user_mode() to handle e.g. speculation mitigations
+ *  3) Final transition (lockdep, tracing, context tracking, RCU), i.e. the
+ * functionality in exit_to_user_mode().
  */
 void syscall_exit_to_user_mode(struct pt_regs *regs);
 
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 17b1e03..48d30ce 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -117,18 +117,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
instrumentation_end();
 }
 
-/**
- * __exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall/interupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *mitigations, etc.
- * 4) Tell lockdep that interrupts are enabled
- */
+/* See comment for exit_to_user_mode() in entry-common.h */
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
@@ -141,6 +130,11 @@ static __always_inline void __exit_to_user_mode(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+void noinstr exit_to_user_mode(void)
+{
+   __exit_to_user_mode();
+}
+
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal) { 
}

[tip: core/entry] entry: Rename enter_from_user_mode()

2020-12-02 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the core/entry branch of tip:

Commit-ID: e2391bd55155ca98293b1bbaa44aa2815d3d054f
Gitweb:
https://git.kernel.org/tip/e2391bd55155ca98293b1bbaa44aa2815d3d054f
Author:Sven Schnelle 
AuthorDate:Tue, 01 Dec 2020 15:27:51 +01:00
Committer: Thomas Gleixner 
CommitterDate: Wed, 02 Dec 2020 10:32:17 +01:00

entry: Rename enter_from_user_mode()

In order to make this function publicly available rename it so it can still
be inlined. An additional enter_from_user_mode() function will be added with
a later commit.

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20201201142755.31931-2-sv...@linux.ibm.com

---
 kernel/entry/common.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e661e70..8e294a7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,7 +11,7 @@
 #include 
 
 /**
- * enter_from_user_mode - Establish state when coming from user mode
+ * __enter_from_user_mode - Establish state when coming from user mode
  *
  * Syscall/interrupt entry disables interrupts, but user mode is traced as
  * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -20,7 +20,7 @@
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -103,7 +103,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 {
long ret;
 
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 
instrumentation_begin();
local_irq_enable();
@@ -115,7 +115,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 
 noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
instrumentation_begin();
local_irq_enable();
instrumentation_end();
@@ -304,7 +304,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 }
 
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)

Re: [PATCH v2 2/5] entry: rename exit_from_user_mode()

2020-12-01 Thread Sven Schnelle

Sven Schnelle  writes:

> In order to make this function publicly available rename
> it so it can still be inlined. An additional exit_from_user_mode()
> function will be added with a later commit.

That should of course be exit_to_user_mode() in the commit
description...

> Signed-off-by: Sven Schnelle 
> ---
>  kernel/entry/common.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index 683a8e1b5388..076ee1cde67f 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -111,7 +111,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
> pt_regs *regs)
>  }
>  
>  /**
> - * exit_to_user_mode - Fixup state when exiting to user mode
> + * __exit_to_user_mode - Fixup state when exiting to user mode
>   *
>   * Syscall/interupt exit enables interrupts, but the kernel state is
>   * interrupts disabled when this is invoked. Also tell RCU about it.
> @@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
> pt_regs *regs)
>   *mitigations, etc.
>   * 4) Tell lockdep that interrupts are enabled
>   */
> -static __always_inline void exit_to_user_mode(void)
> +static __always_inline void __exit_to_user_mode(void)
>  {
>   instrumentation_begin();
>   trace_hardirqs_on_prepare();
> @@ -265,7 +265,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
> pt_regs *regs)
>   local_irq_disable_exit_to_user();
>   exit_to_user_mode_prepare(regs);
>   instrumentation_end();
> - exit_to_user_mode();
> + __exit_to_user_mode();
>  }
>  
>  noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
> @@ -278,7 +278,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs 
> *regs)
>   instrumentation_begin();
>   exit_to_user_mode_prepare(regs);
>   instrumentation_end();
> - exit_to_user_mode();
> + __exit_to_user_mode();
>  }
>  
>  noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)

[PATCH v2 4/5] entry: add exit_to_user_mode() wrapper

2020-12-01 Thread Sven Schnelle

Can be called from architecture dependent code, and simply
calls __exit_to_user_mode(). This way __exit_to_user_mode()
can still be inlined because it is declared static.

Signed-off-by: Sven Schnelle 
---
 include/linux/entry-common.h | 15 +++
 kernel/entry/common.c| 17 +
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 28a8554fad7d..112007525f50 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -413,4 +413,19 @@ void noinstr irqentry_exit(struct pt_regs *regs, 
irqentry_state_t state);
  * 3) Trace interrupts off state
  */
 void enter_from_user_mode(struct pt_regs *regs);
+
+/**
+ * exit_to_user_mode - Fixup state when exiting to user mode
+ *
+ * Syscall/interrupt exit enables interrupts, but the kernel state is
+ * interrupts disabled when this is invoked. Also tell RCU about it.
+ *
+ * 1) Trace interrupts on state
+ * 2) Invoke context tracking if enabled to adjust RCU state
+ * 3) Invoke architecture specific last minute exit code, e.g. speculation
+ *mitigations, etc.
+ * 4) Tell lockdep that interrupts are enabled
+ */
+
+void exit_to_user_mode(void);
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index ee588ee9f122..e696f6912642 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -105,18 +105,6 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
instrumentation_end();
 }
 
-/**
- * __exit_to_user_mode - Fixup state when exiting to user mode
- *
- * Syscall/interupt exit enables interrupts, but the kernel state is
- * interrupts disabled when this is invoked. Also tell RCU about it.
- *
- * 1) Trace interrupts on state
- * 2) Invoke context tracking if enabled to adjust RCU state
- * 3) Invoke architecture specific last minute exit code, e.g. speculation
- *mitigations, etc.
- * 4) Tell lockdep that interrupts are enabled
- */
 static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
@@ -129,6 +117,11 @@ static __always_inline void __exit_to_user_mode(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+void noinstr exit_to_user_mode(void)
+{
+   __exit_to_user_mode();
+}
+
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal(struct pt_regs *regs) { }
 
-- 
2.17.1

[PATCH v2 2/5] entry: rename exit_from_user_mode()

2020-12-01 Thread Sven Schnelle

In order to make this function publicly available rename
it so it can still be inlined. An additional exit_from_user_mode()
function will be added with a later commit.

Signed-off-by: Sven Schnelle 
---
 kernel/entry/common.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 683a8e1b5388..076ee1cde67f 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -111,7 +111,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
 }
 
 /**
- * exit_to_user_mode - Fixup state when exiting to user mode
+ * __exit_to_user_mode - Fixup state when exiting to user mode
  *
  * Syscall/interupt exit enables interrupts, but the kernel state is
  * interrupts disabled when this is invoked. Also tell RCU about it.
@@ -122,7 +122,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
  *mitigations, etc.
  * 4) Tell lockdep that interrupts are enabled
  */
-static __always_inline void exit_to_user_mode(void)
+static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
trace_hardirqs_on_prepare();
@@ -265,7 +265,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
@@ -278,7 +278,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs 
*regs)
instrumentation_begin();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
-- 
2.17.1

[PATCH v2 3/5] entry: add enter_from_user_mode() wrapper

2020-12-01 Thread Sven Schnelle

Can be called from architecture dependent code, and simply
calls __enter_from_user_mode(). This way __enter_from_user_mode
can still be inlined because it is declared static.

Signed-off-by: Sven Schnelle 
---
 include/linux/entry-common.h | 11 +++
 kernel/entry/common.c| 15 +--
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..28a8554fad7d 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -402,4 +402,15 @@ void irqentry_exit_cond_resched(void);
  */
 void noinstr irqentry_exit(struct pt_regs *regs, irqentry_state_t state);
 
+/**
+ * enter_from_user_mode - Establish state when coming from user mode
+ *
+ * Syscall/interrupt entry disables interrupts, but user mode is traced as
+ * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
+ *
+ * 1) Tell lockdep that interrupts are disabled
+ * 2) Invoke context tracking if enabled to reactivate RCU
+ * 3) Trace interrupts off state
+ */
+void enter_from_user_mode(struct pt_regs *regs);
 #endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 076ee1cde67f..ee588ee9f122 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -8,16 +8,6 @@
 #define CREATE_TRACE_POINTS
 #include 
 
-/**
- * __enter_from_user_mode - Establish state when coming from user mode
- *
- * Syscall/interrupt entry disables interrupts, but user mode is traced as
- * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
- *
- * 1) Tell lockdep that interrupts are disabled
- * 2) Invoke context tracking if enabled to reactivate RCU
- * 3) Trace interrupts off state
- */
 static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
@@ -31,6 +21,11 @@ static __always_inline void __enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_end();
 }
 
+void noinstr enter_from_user_mode(struct pt_regs *regs)
+{
+   __enter_from_user_mode(regs);
+}
+
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
if (unlikely(audit_context())) {
-- 
2.17.1

[PATCH v2 1/5] entry: rename enter_from_user_mode()

2020-12-01 Thread Sven Schnelle

In order to make this function publicly available rename
it so it can still be inlined. An addtional enter_from_user_mode()
function will be added with a later commit.

Signed-off-by: Sven Schnelle 
---
 kernel/entry/common.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e9e2df3f3f9e..683a8e1b5388 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -9,7 +9,7 @@
 #include 
 
 /**
- * enter_from_user_mode - Establish state when coming from user mode
+ * __enter_from_user_mode - Establish state when coming from user mode
  *
  * Syscall/interrupt entry disables interrupts, but user mode is traced as
  * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -18,7 +18,7 @@
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -92,7 +92,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 {
long ret;
 
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 
instrumentation_begin();
local_irq_enable();
@@ -104,7 +104,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 
 noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
instrumentation_begin();
local_irq_enable();
instrumentation_end();
@@ -270,7 +270,7 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 }
 
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
-- 
2.17.1

[PATCH v2 5/5] entry: add syscall_exit_to_user_mode_work()

2020-12-01 Thread Sven Schnelle

This is the same as syscall_exit_to_user_mode() but without
calling exit_to_user_mode(). This is useful if a syscall has to be
restarted without leaving to user space.

Signed-off-by: Sven Schnelle 
---
 include/linux/entry-common.h |  9 +
 kernel/entry/common.c| 14 --
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 112007525f50..bdf6b005bbfb 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -312,6 +312,15 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
  */
 void syscall_exit_to_user_mode(struct pt_regs *regs);
 
+/**
+ * syscall_exit_to_user_mode_work - Handle work before returning to user mode
+ * @regs:  Pointer to currents pt_regs
+ *
+ * Same as syscall_exit_to_user_mode() but without calling exit_to_user_mode()
+ * to perform the final transition to user mode.
+ */
+void syscall_exit_to_user_mode_work(struct pt_regs *regs);
+
 /**
  * irqentry_enter_from_user_mode - Establish state before invoking the irq 
handler
  * @regs:  Pointer to currents pt_regs
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e696f6912642..c3c4ba21824a 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -246,12 +246,22 @@ static void syscall_exit_to_user_mode_prepare(struct 
pt_regs *regs)
syscall_exit_work(regs, cached_flags);
 }
 
-__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+static __always_inline void __syscall_exit_to_user_mode_work(struct pt_regs 
*regs)
 {
-   instrumentation_begin();
syscall_exit_to_user_mode_prepare(regs);
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
+}
+
+void syscall_exit_to_user_mode_work(struct pt_regs *regs)
+{
+   __syscall_exit_to_user_mode_work(regs);
+}
+
+__visible noinstr void syscall_exit_to_user_mode(struct pt_regs *regs)
+{
+   instrumentation_begin();
+   __syscall_exit_to_user_mode_work(regs);
instrumentation_end();
__exit_to_user_mode();
 }
-- 
2.17.1

[PATCH v2] split up lockdep and syscall related functionality in generic entry code

2020-12-01 Thread Sven Schnelle

this is v2 of the patch. I've split it up the to multiple ones,
and added documentation. The first patch was basically a hack to demonstrate
what i need, sorry for not sending a more cleaned up version.

As additional explanation here's the content of the v1 cover letter:

i'm currently working on converting s390 to use the generic entry
functionality. So far things are straigt-forward, there's only one
slight problem. There is a syscall_enter_from_user_mode() which sets
lockdep state and other initial stuff + does the entry work at the same
time. This is a problem on s390 because the way we restart syscalls isn't
as easy as on x86.

My understanding on x86 is that syscalls are restarted there by just rewinding
the program counter and return to user space, so the instruction causing
the syscall gets executed again.

On s390 this doesn't work, because the syscall number might be hard coded
into the 'svc' instruction, so when the syscall number has to be changed we
would repeat the wrong (old) syscall.

So we would need functions that only do the stuff that is required when 
switching
from user space to kernel and back, and functions which do the system call 
tracing
and work which might be called repeatedly.

With the attached patch, the s390 code now looks like this:

(i removed some s390 specific stuff here to make the function easier
to read)

__do_syscall is the function which gets called by low level entry.S code:

void noinstr __do_syscall(struct pt_regs *regs)
{
enter_from_user_mode(regs); /* sets lockdep state, and other 
initial stuff */

/*
 * functions that need to run with irqs disabled,
 * but lockdep state and other stuff set up
 */
memcpy(>gprs[8], S390_lowcore.save_area_sync, 8 * sizeof(unsigned 
long));
memcpy(>int_code, _lowcore.svc_ilc, sizeof(regs->int_code));
regs->psw = S390_lowcore.svc_old_psw;

update_timer_sys();

local_irq_enable();

regs->orig_gpr2 = regs->gprs[2];

do {
regs->flags = _PIF_SYSCALL;
do_syscall(regs);
} while (test_pt_regs_flag(regs, PIF_SYSCALL_RESTART));

exit_to_user_mode();
}

__do_syscall calls do_syscall which does all the syscall work, and this might
be called more than once if PIF_SYSCALL_RESTART is set:

void do_syscall(struct pt_regs *regs)
{
unsigned long nr = regs->int_code & 0x;

nr = syscall_enter_from_user_mode_work(regs, nr);
regs->gprs[2] = -ENOSYS;
if (likely(nr < NR_syscalls)) {
regs->gprs[2] = current->thread.sys_call_table[nr](
regs->orig_gpr2, regs->gprs[3],
regs->gprs[4], regs->gprs[5],
regs->gprs[6], regs->gprs[7]);
}
syscall_exit_to_user_mode_work(regs);
}

Re: [PATCH 1/2] sched/idle: Fix arch_cpu_idle() vs tracing

2020-12-01 Thread Sven Schnelle

Hi Peter,

Peter Zijlstra  writes:

> On Mon, Nov 30, 2020 at 01:00:03PM -0800, Guenter Roeck wrote:
>> On Fri, Nov 20, 2020 at 12:41:46PM +0100, Peter Zijlstra wrote:
>> > We call arch_cpu_idle() with RCU disabled, but then use
>> > local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
>> > 
>> > Switch all arch_cpu_idle() implementations to use
>> > raw_local_irq_{en,dis}able() and carefully manage the
>> > lockdep,rcu,tracing state like we do in entry.
>> > 
>> > (XXX: we really should change arch_cpu_idle() to not return with
>> > interrupts enabled)
>> > 
>> 
>> Has this patch been tested on s390 ? Reason for asking is that it causes
>> all my s390 emulations to crash. Reverting it fixes the problem.
>
> My understanding is that it changes the error on s390. Previously it
> would complain about the local_irq_enable() in arch_cpu_idle(), now it
> complains when taking an interrupt during idle.

I looked into adding the required functionality for s390, but the code
we would need to add to entry.S is rather large - as you noted we would
have to duplicate large portions of irqentry_enter() into our code.
Given that s390 was fine before that patch, can you revert it and submit
it again during the next merge window?

Thanks
Sven

Re: [PATCH 1/2] sched/idle: Fix arch_cpu_idle() vs tracing

2020-12-01 Thread Sven Schnelle

Peter Zijlstra  writes:

> On Mon, Nov 30, 2020 at 01:00:03PM -0800, Guenter Roeck wrote:
>> On Fri, Nov 20, 2020 at 12:41:46PM +0100, Peter Zijlstra wrote:
>> > We call arch_cpu_idle() with RCU disabled, but then use
>> > local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.
>> > 
>> > Switch all arch_cpu_idle() implementations to use
>> > raw_local_irq_{en,dis}able() and carefully manage the
>> > lockdep,rcu,tracing state like we do in entry.
>> > 
>> > (XXX: we really should change arch_cpu_idle() to not return with
>> > interrupts enabled)
>> > 
>> 
>> Has this patch been tested on s390 ? Reason for asking is that it causes
>> all my s390 emulations to crash. Reverting it fixes the problem.
>
> My understanding is that it changes the error on s390. Previously it
> would complain about the local_irq_enable() in arch_cpu_idle(), now it
> complains when taking an interrupt during idle.

The errors on s390 all were fixed in the meantime. I cannot say which
patch fixed it, but we haven't seen any warning in our internal CI
during the last weeks. So reverting the patch would likely fix the issue
for us.

s390 is likely to switch to generic entry with the next merge window (im
working on it), so all that stuff will be easier than.

[PATCH] entry: split lockdep and syscall work functions

2020-12-01 Thread Sven Schnelle

On s390, we can not call one function which sets lockdep
state and do the syscall work at the same time. There add
make enter_from_user_mode() and exit_to_user_mode() public, and
add syscall_exit_to_user_mode1() which does the same as
syscall_exit_to_user_mode() but skips the final exit_to_user_mode().

Signed-off-by: Sven Schnelle 
---
 include/linux/entry-common.h |  4 +++-
 kernel/entry/common.c| 35 +++
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..496c9a47eab4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -124,7 +124,7 @@ static inline __must_check int 
arch_syscall_enter_tracehook(struct pt_regs *regs
  * to be done between establishing state and handling user mode entry work.
  */
 void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
+void enter_from_user_mode(struct pt_regs *regs);
 /**
  * syscall_enter_from_user_mode_work - Check and handle work before invoking
  *a syscall
@@ -311,6 +311,8 @@ static inline void arch_syscall_exit_tracehook(struct 
pt_regs *regs, bool step)
  * arch_exit_to_user_mode() to handle e.g. speculation mitigations
  */
 void syscall_exit_to_user_mode(struct pt_regs *regs);
+void syscall_exit_to_user_mode1(struct pt_regs *regs);
+void exit_to_user_mode(void);
 
 /**
  * irqentry_enter_from_user_mode - Establish state before invoking the irq 
handler
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e9e2df3f3f9e..3ad462ebfa15 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -18,7 +18,7 @@
  * 2) Invoke context tracking if enabled to reactivate RCU
  * 3) Trace interrupts off state
  */
-static __always_inline void enter_from_user_mode(struct pt_regs *regs)
+static __always_inline void __enter_from_user_mode(struct pt_regs *regs)
 {
arch_check_user_regs(regs);
lockdep_hardirqs_off(CALLER_ADDR0);
@@ -31,6 +31,11 @@ static __always_inline void enter_from_user_mode(struct 
pt_regs *regs)
instrumentation_end();
 }
 
+void noinstr enter_from_user_mode(struct pt_regs *regs)
+{
+   __enter_from_user_mode(regs);
+}
+
 static inline void syscall_enter_audit(struct pt_regs *regs, long syscall)
 {
if (unlikely(audit_context())) {
@@ -92,7 +97,7 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 {
long ret;
 
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 
instrumentation_begin();
local_irq_enable();
@@ -104,14 +109,14 @@ noinstr long syscall_enter_from_user_mode(struct pt_regs 
*regs, long syscall)
 
 noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
instrumentation_begin();
local_irq_enable();
instrumentation_end();
 }
 
 /**
- * exit_to_user_mode - Fixup state when exiting to user mode
+ * __exit_to_user_mode - Fixup state when exiting to user mode
  *
  * Syscall/interupt exit enables interrupts, but the kernel state is
  * interrupts disabled when this is invoked. Also tell RCU about it.
@@ -122,7 +127,7 @@ noinstr void syscall_enter_from_user_mode_prepare(struct 
pt_regs *regs)
  *mitigations, etc.
  * 4) Tell lockdep that interrupts are enabled
  */
-static __always_inline void exit_to_user_mode(void)
+static __always_inline void __exit_to_user_mode(void)
 {
instrumentation_begin();
trace_hardirqs_on_prepare();
@@ -134,6 +139,11 @@ static __always_inline void exit_to_user_mode(void)
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
+void noinstr exit_to_user_mode(void)
+{
+   __exit_to_user_mode();
+}
+
 /* Workaround to allow gradual conversion of architecture code */
 void __weak arch_do_signal(struct pt_regs *regs) { }
 
@@ -265,12 +275,21 @@ __visible noinstr void syscall_exit_to_user_mode(struct 
pt_regs *regs)
local_irq_disable_exit_to_user();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
+}
+
+__visible noinstr void syscall_exit_to_user_mode1(struct pt_regs *regs)
+{
+   instrumentation_begin();
+   syscall_exit_to_user_mode_prepare(regs);
+   local_irq_disable_exit_to_user();
+   exit_to_user_mode_prepare(regs);
+   instrumentation_end();
 }
 
 noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
-   enter_from_user_mode(regs);
+   __enter_from_user_mode(regs);
 }
 
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
@@ -278,7 +297,7 @@ noinstr void irqentry_exit_to_user_mode(struct pt_regs 
*regs)
instrumentation_begin();
exit_to_user_mode_prepare(regs);
instrumentation_end();
-   exit_to_user_mode();
+   __exit_to_user_mode();
 }
 
 noinstr irqentry_state_t irqentry_enter(struct

split up lockdep and syscall related functionailty in generic entry code

2020-12-01 Thread Sven Schnelle

i'm currently working on converting s390 to use the generic entry
functionality. So far things are straigt-forward, there's only one
slight problem. There is a syscall_enter_from_user_mode() which sets
lockdep state and other initial stuff + does the entry work at the same
time. This is a problem on s390 because the way we restart syscalls isn't
as easy as on x86.

My understanding on x86 is that syscalls are restarted there by just rewinding
the program counter and return to user space, so the instruction causing
the syscall gets executed again.

On s390 this doesn't work, because the syscall number might be hard coded
into the 'svc' instruction, so when the syscall number has to be changed we
would repeat the wrong (old) syscall.

So we would need functions that only do the stuff that is required when 
switching
from user space to kernel and back, and functions which do the system call 
tracing
and work which might be called repeatedly.

With the attached patch, the s390 code now looks like this:

(i removed some s390 specific stuff here to make the function easier
to read)

__do_syscall is the function which gets called by low level entry.S code:

void noinstr __do_syscall(struct pt_regs *regs)
{
enter_from_user_mode(regs); /* sets lockdep state, and other 
initial stuff */

/*
 * functions that need to run with irqs disabled,
 * but lockdep state and other stuff set up
 */
memcpy(>gprs[8], S390_lowcore.save_area_sync, 8 * sizeof(unsigned 
long));
memcpy(>int_code, _lowcore.svc_ilc, sizeof(regs->int_code));
regs->psw = S390_lowcore.svc_old_psw;

update_timer_sys();

local_irq_enable();

regs->orig_gpr2 = regs->gprs[2];

do {
regs->flags = _PIF_SYSCALL;
do_syscall(regs);
} while (test_pt_regs_flag(regs, PIF_SYSCALL_RESTART));

exit_to_user_mode();
}

__do_syscall calls do_syscall which does all the syscall work, and this might
be called more than once if PIF_SYSCALL_RESTART is set:

void do_syscall(struct pt_regs *regs)
{
unsigned long nr = regs->int_code & 0x;

nr = syscall_enter_from_user_mode_work(regs, nr);
regs->gprs[2] = -ENOSYS;
if (likely(nr < NR_syscalls)) {
regs->gprs[2] = current->thread.sys_call_table[nr](
regs->orig_gpr2, regs->gprs[3],
regs->gprs[4], regs->gprs[5],
regs->gprs[6], regs->gprs[7]);
}
syscall_exit_to_user_mode1(regs);
}

What do you think about the attach patch? I'm also open for a proper name
for syscall_exit_to_user_mode1() ;-)

Re: [GIT pull] locking/urgent for v5.10-rc6

2020-11-30 Thread Sven Schnelle

Hi Peter,

Peter Zijlstra  writes:

> On Mon, Nov 30, 2020 at 01:52:11PM +0100, Peter Zijlstra wrote:
>> On Mon, Nov 30, 2020 at 01:31:33PM +0100, Sven Schnelle wrote:
>> > [0.670280] [ cut here ] 
>> > [0.670288] WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:1054 
>> > rcu_irq_enter+0x7e/0xa8 
>> > [0.670293] Modules linked in: 
>> > [0.670299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 
>> > 5.10.0-rc6 #2263 
>> > [0.670304] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) 
>> > [0.670309] Krnl PSW : 0404d0018000 00d8a8da 
>> > (rcu_irq_enter+0x82/0xa8) 
>> > [0.670318]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 
>> > PM:0 RI:0 EA:3 
>> > [0.670325] Krnl GPRS:  8002 
>> > 0001 0101fcee 
>> > [0.670331]  
>> >   
>> > [0.670337]03e00029ff48  
>> > 017212d8 0001 
>> > [0.670343]05ba0100 000324bb 
>> > 03e00029fe40 03e00029fe10
>> >  
>> > [0.670358] Krnl Code: 00d8a8ca: ec180013017ecij 
>> > %r1,1,8,00d8a8f0 
>> > [0.670358]00d8a8d0: ecb80005007ecij 
>> > %r11,0,8,00d8a8da 
>> > [0.670358]   #00d8a8d6: af00mc  
>> > 0,0 
>> > [0.670358]   >00d8a8da: ebbff0a4lmg 
>> > %r11,%r15,160(%r15) 
>> > [0.670358]00d8a8e0: c0f4ff68brcl
>> > 15,00d8a7b0 
>> > [0.670358]00d8a8e6: c0e538c1brasl   
>> > %r14,00d91a68 
>> > [0.670358]00d8a8ec: a7f4ffdcbrc 
>> > 15,00d8a8a4 
>> > [0.670358]00d8a8f0: c0e538bcbrasl   
>> > %r14,00d91a68 
>> > [0.670392] Call Trace: 
>> > [0.670396]  [<00d8a8da>] rcu_irq_enter+0x82/0xa8  
>> > [0.670401]  [<00157f9a>] irq_enter+0x22/0x30  
>> > [0.670404]  [<0010e51c>] do_IRQ+0x64/0xd0  
>> > [0.670408]  [<00d9a65a>] ext_int_handler+0x18e/0x194  
>> > [0.670412]  [<00d9a6a0>] psw_idle+0x40/0x48  
>> > [0.670416] ([<00104202>] enabled_wait+0x22/0xf0) 
>> > [0.670419]  [<001046e2>] arch_cpu_idle+0x22/0x38  
>> > [0.670423]  [<00d986cc>] default_idle_call+0x74/0xd8  
>> > [0.670427]  [<0019a94a>] do_idle+0xf2/0x1b0  
>> > [0.670431]  [<0019ac7e>] cpu_startup_entry+0x36/0x40  
>> > [0.670435]  [<00118b9a>] smp_start_secondary+0x82/0x88  
>> 
>> But but but...
>> 
>>   do_idle()  # IRQs on
>> local_irq_disable(); # IRQs off
>> defaul_idle_call()   # IRQs off
>   lockdep_hardirqs_on();  # IRQs off, but lockdep things they're on
>>   arch_cpu_idle()# IRQs off
>> enabled_wait()   # IRQs off
>>raw_local_save()  # still off
>>psw_idle()# very much off
>>  ext_int_handler # get an interrupt ?!?!
> rcu_irq_enter()   # lockdep thinks IRQs are on <- FAIL
>
>
> I can't much read s390 assembler, but ext_int_handler() has a
> TRACE_IRQS_OFF, which would be sufficient to re-align the lockdep state
> with the actual state, but there's some condition before it, what's that
> test and is that right?

That test was introduced to only track changes in IRQ state because of
recursion problems in lockdep. This now seems to no longer work. We
propably could remove that as lockdep now can handle recursion much
better with all the recent changes, but given that we're already at
-rc6, i don't want to touch entry.S code because of a debugging feature.

Re: [GIT pull] locking/urgent for v5.10-rc6

2020-11-30 Thread Sven Schnelle

Hi,

Sven Schnelle  writes:

> Hi Peter,
>
> Peter Zijlstra  writes:
>
>> On Sun, Nov 29, 2020 at 11:31:41AM -0800, Linus Torvalds wrote:
>>> On Sun, Nov 29, 2020 at 5:38 AM Thomas Gleixner  wrote:
>>> >
>>> > Yet two more places which invoke tracing from RCU disabled regions in the
>>> > idle path. Similar to the entry path the low level idle functions have to
>>> > be non-instrumentable.
>>> 
>>> This really seems less than optimal.
>>> 
>>> In particular, lookie here:
>>> 
>>> > @@ -94,9 +94,35 @@ void __cpuidle default_idle_call(void)
>>> >
>>> > trace_cpu_idle(1, smp_processor_id());
>>> > stop_critical_timings();
>>> > +
>>> > +   /*
>>> > +* arch_cpu_idle() is supposed to enable IRQs, however
>>> > +* we can't do that because of RCU and tracing.
>>> > +*
>>> > +* Trace IRQs enable here, then switch off RCU, and have
>>> > +* arch_cpu_idle() use raw_local_irq_enable(). Note that
>>> > +* rcu_idle_enter() relies on lockdep IRQ state, so 
>>> > switch that
>>> > +* last -- this is very similar to the entry code.
>>> > +*/
>>> > +   trace_hardirqs_on_prepare();
>>> > +   lockdep_hardirqs_on_prepare(_THIS_IP_);
>>> > rcu_idle_enter();
>>> > +   lockdep_hardirqs_on(_THIS_IP_);
>>> > +
>>> > arch_cpu_idle();
>>> > +
>>> > +   /*
>>> > +* OK, so IRQs are enabled here, but RCU needs them 
>>> > disabled to
>>> > +* turn itself back on.. funny thing is that disabling 
>>> > IRQs
>>> > +* will cause tracing, which needs RCU. Jump through 
>>> > hoops to
>>> > +* make it 'work'.
>>> > +*/
>>> > +   raw_local_irq_disable();
>>> > +   lockdep_hardirqs_off(_THIS_IP_);
>>> > rcu_idle_exit();
>>> > +   lockdep_hardirqs_on(_THIS_IP_);
>>> > +   raw_local_irq_enable();
>>> > +
>>> > start_critical_timings();
>>> > trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>>> > }
>>> 
>>> And look at what the code generation for the idle exit path is when
>>> lockdep isn't even on.
>>
>> Agreed.
>>
>> The idea was to flip all of arch_cpu_idle() to not enable interrupts.
>>
>> This is suboptimal for things like x86 where arch_cpu_idle() is
>> basically STI;HLT, but x86 isn't likely to actually use this code path
>> anyway, given all the various cpuidle drivers it has.
>>
>> Many of the other archs are now doing things like arm's:
>> wfi();raw_local_irq_enable().
>>
>> Doing that tree-wide interrupt-state flip was something I didn't want to
>> do at this late a stage, the chanse of messing that up is just too high.
>>
>> After that I need to go look at flipping cpuidle, which is even more
>> 'interesting'. cpuidle_enter() has the exact same semantics, and this is
>> the code path that x86 actually uses, and here it's inconsitent at best.
>
> On s390 this introduces the following splat:
> [..]

I sent you the wrong backtrace. This is the correct one:

[0.667491] smp: Bringing up secondary CPUs ... 
[0.670262] random: get_random_bytes called from __warn+0x12a/0x160 with 
crng_init=1 
[0.670280] [ cut here ] 
[0.670288] WARNING: CPU: 1 PID: 0 at kernel/rcu/tree.c:1054 
rcu_irq_enter+0x7e/0xa8 
[0.670293] Modules linked in: 
[0.670299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 
5.10.0-rc6 #2263 
[0.670304] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) 
[0.670309] Krnl PSW : 0404d0018000 00d8a8da 
(rcu_irq_enter+0x82/0xa8) 
[0.670318]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
RI:0 EA:3 
[0.670325] Krnl GPRS:  8002 0001 
0101fcee 
[0.670331]   
 
[0.670337]03e00029ff48  017212d8 
0001 
[0.670343]05ba0100 000324bb 03e00029fe40

Re: [GIT pull] locking/urgent for v5.10-rc6

2020-11-30 Thread Sven Schnelle

Hi Peter,

Peter Zijlstra  writes:

> On Sun, Nov 29, 2020 at 11:31:41AM -0800, Linus Torvalds wrote:
>> On Sun, Nov 29, 2020 at 5:38 AM Thomas Gleixner  wrote:
>> >
>> > Yet two more places which invoke tracing from RCU disabled regions in the
>> > idle path. Similar to the entry path the low level idle functions have to
>> > be non-instrumentable.
>> 
>> This really seems less than optimal.
>> 
>> In particular, lookie here:
>> 
>> > @@ -94,9 +94,35 @@ void __cpuidle default_idle_call(void)
>> >
>> > trace_cpu_idle(1, smp_processor_id());
>> > stop_critical_timings();
>> > +
>> > +   /*
>> > +* arch_cpu_idle() is supposed to enable IRQs, however
>> > +* we can't do that because of RCU and tracing.
>> > +*
>> > +* Trace IRQs enable here, then switch off RCU, and have
>> > +* arch_cpu_idle() use raw_local_irq_enable(). Note that
>> > +* rcu_idle_enter() relies on lockdep IRQ state, so switch 
>> > that
>> > +* last -- this is very similar to the entry code.
>> > +*/
>> > +   trace_hardirqs_on_prepare();
>> > +   lockdep_hardirqs_on_prepare(_THIS_IP_);
>> > rcu_idle_enter();
>> > +   lockdep_hardirqs_on(_THIS_IP_);
>> > +
>> > arch_cpu_idle();
>> > +
>> > +   /*
>> > +* OK, so IRQs are enabled here, but RCU needs them 
>> > disabled to
>> > +* turn itself back on.. funny thing is that disabling IRQs
>> > +* will cause tracing, which needs RCU. Jump through hoops 
>> > to
>> > +* make it 'work'.
>> > +*/
>> > +   raw_local_irq_disable();
>> > +   lockdep_hardirqs_off(_THIS_IP_);
>> > rcu_idle_exit();
>> > +   lockdep_hardirqs_on(_THIS_IP_);
>> > +   raw_local_irq_enable();
>> > +
>> > start_critical_timings();
>> > trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id());
>> > }
>> 
>> And look at what the code generation for the idle exit path is when
>> lockdep isn't even on.
>
> Agreed.
>
> The idea was to flip all of arch_cpu_idle() to not enable interrupts.
>
> This is suboptimal for things like x86 where arch_cpu_idle() is
> basically STI;HLT, but x86 isn't likely to actually use this code path
> anyway, given all the various cpuidle drivers it has.
>
> Many of the other archs are now doing things like arm's:
> wfi();raw_local_irq_enable().
>
> Doing that tree-wide interrupt-state flip was something I didn't want to
> do at this late a stage, the chanse of messing that up is just too high.
>
> After that I need to go look at flipping cpuidle, which is even more
> 'interesting'. cpuidle_enter() has the exact same semantics, and this is
> the code path that x86 actually uses, and here it's inconsitent at best.

On s390 this introduces the following splat:

[2.962283] Testing tracer wakeup_rt:  
[3.017102] sched: DL replenish lagged too much 
[3.061777] PASSED 
[3.062076] Testing tracer wakeup_dl: PASSED 
[3.161296] [ cut here ] 
[3.161301] DEBUG_LOCKS_WARN_ON(current->hardirq_chain_key != 
current->curr_chain_key) 
[3.161310] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:4155 
lockdep_hardirqs_on+0x1ea/0x1f8 
[3.161316] Modules linked in: 
[3.161323] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.10.0-rc6-07209-g9e30ba6d2d5c #2249 
[3.161327] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0) 
[3.161331] Krnl PSW : 0404d0018000 00d730fe 
(lockdep_hardirqs_on+0x1ee/0x1f8) 
[3.161340]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
RI:0 EA:3 
[3.161347] Krnl GPRS: c000bfff 8001 004a 
001e33b8 
[3.161352]03807b88 03807b80  
 
[3.161357] 0151a610 01361500 
00d81d40 
[3.161362]00010400 00d88010 00d730fa 
03807db8
 
[3.161461] Krnl Code: 00d730ee: c0200012d9adlarl
%r2,00fce448 
[3.161461]00d730f4: c0e5451abrasl   
%r14,00d5bb28 
[3.161461]   #00d730fa: af00mc  0,0 
[3.161461]   >00d730fe: a7f4ff70brc 
15,00d72fde 
[3.161461]00d73102: 0707bcr 0,%r7 
[3.161461]00d73104: 0707bcr 0,%r7 
[3.161461]00d73106: 0707bcr 0,%r7 
[3.161461]00d73108: c418002884eclgrl
%r1,01283ae0 
[3.161518] Call Trace: 
[3.161526]  [<00d730fe>]

Re: [PATCH 0/2] More RCU vs idle fixes

2020-11-24 Thread Sven Schnelle

Peter Zijlstra  writes:

> Both arm64 and s390 are tripping over arch_cpu_idle() RCU,tracing,lockdep
> interaction. While looking at that I also found fail in inte_idle.
>
> Please consider for this cycle.

Is anyone taking this patchset? For s390, we also need to change the
local_irq_safe/restore to the raw variants in enabled_wait() in
arch/s390/kernel/idle.c. I can make a patch and carry that via the
s390 tree, but i want to make sure the s390 change in this patchset
also reaches linus' tree.

Thanks
Sven

Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL

2020-11-03 Thread Sven Schnelle

Hi Jens,

Jens Axboe  writes:

> On 11/3/20 4:00 AM, Sven Schnelle wrote:
>> Hi Jens,
>> 
>> Heiko Carstens  writes:
>> 
>>> On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote:
>>>> Wire up TIF_NOTIFY_SIGNAL handling for s390.
>>>>
>>>> Cc: linux-s...@vger.kernel.org
>>>> Signed-off-by: Jens Axboe 
>>>> ---
>>>>
>>>> 5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting
>>>> for details:
>>>>
>>>> https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/
>>>>
>>>> As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs,
>>>> as that will enable a set of cleanups once all of them support it. I'm
>>>> happy carrying this patch if need be, or it can be funelled through the
>>>> arch tree. Let me know.
>>>>
>>>>  arch/s390/include/asm/thread_info.h | 2 ++
>>>>  arch/s390/kernel/entry.S| 7 ++-
>>>>  2 files changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/s390/include/asm/thread_info.h 
>>>> b/arch/s390/include/asm/thread_info.h
>>>> index 13a04fcf7762..0045341ade48 100644
>>>> --- a/arch/s390/include/asm/thread_info.h
>>>> +++ b/arch/s390/include/asm/thread_info.h
>>>> @@ -65,6 +65,7 @@ void arch_setup_new_exec(void);
>>>>  #define TIF_GUARDED_STORAGE   4   /* load guarded storage control 
>>>> block */
>>>>  #define TIF_PATCH_PENDING 5   /* pending live patching update */
>>>>  #define TIF_PGSTE 6   /* New mm's will use 4K page tables */
>>>> +#define TIF_NOTIFY_SIGNAL 7   /* signal notifications exist */
>>>>  #define TIF_ISOLATE_BP8   /* Run process with isolated BP 
>>>> */
>>>>  #define TIF_ISOLATE_BP_GUEST  9   /* Run KVM guests with isolated 
>>>> BP */
>>>>  
>>>> @@ -82,6 +83,7 @@ void arch_setup_new_exec(void);
>>>>  #define TIF_SYSCALL_TRACEPOINT27  /* syscall tracepoint 
>>>> instrumentation */
>>>>  
>>>>  #define _TIF_NOTIFY_RESUMEBIT(TIF_NOTIFY_RESUME)
>>>> +#define _TIF_NOTIFY_SIGNALBIT(TIF_NOTIFY_SIGNAL)
>>>>  #define _TIF_SIGPENDING   BIT(TIF_SIGPENDING)
>>>>  #define _TIF_NEED_RESCHED BIT(TIF_NEED_RESCHED)
>>>>  #define _TIF_UPROBE   BIT(TIF_UPROBE)
>>>> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
>>>> index 86235919c2d1..a30d891e8045 100644
>>>> --- a/arch/s390/kernel/entry.S
>>>> +++ b/arch/s390/kernel/entry.S
>>>> @@ -52,7 +52,8 @@ STACK_SIZE  = 1 << STACK_SHIFT
>>>>  STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE
>>>>  
>>>>  _TIF_WORK = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
>>>> - _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING)
>>>> + _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING | \
>>>> + _TIF_NOTIFY_SIGNAL)
>>>>  _TIF_TRACE= (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | 
>>>> _TIF_SECCOMP | \
>>>>   _TIF_SYSCALL_TRACEPOINT)
>>>>  _CIF_WORK = (_CIF_ASCE_PRIMARY | _CIF_ASCE_SECONDARY | _CIF_FPU)
>>>> @@ -463,6 +464,8 @@ ENTRY(system_call)
>>>>  #endif
>>>>TSTMSK  __PT_FLAGS(%r11),_PIF_SYSCALL_RESTART
>>>>jo  .Lsysc_syscall_restart
>>>> +  TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_SIGNAL
>>>> +  jo  .Lsysc_sigpending
>>>>TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
>>>>jo  .Lsysc_sigpending
>>>>TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
>>>> @@ -857,6 +860,8 @@ ENTRY(io_int_handler)
>>>>  #endif
>>>>TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
>>>>jo  .Lio_sigpending
>>>> +  TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_SIGNAL
>>>> +  jo  .Lio_sigpending
>>>>TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
>>>>jo  .Lio_notify_resume
>>>>TSTMSK  __TI_flags(%r12),_TIF_GUARDED_STORAGE
>>>
>>> (full quote so you can make sense of the patch below).
>>>
>>> Please merge the patch below into this one. With that:
>>>
>>> Acked-by: Heiko Carstens 
>>>
>>> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/e

Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL

2020-11-03 Thread Sven Schnelle

Hi Jens,

Heiko Carstens  writes:

> On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote:
>> Wire up TIF_NOTIFY_SIGNAL handling for s390.
>> 
>> Cc: linux-s...@vger.kernel.org
>> Signed-off-by: Jens Axboe 
>> ---
>> 
>> 5.11 has support queued up for TIF_NOTIFY_SIGNAL, see this posting
>> for details:
>> 
>> https://lore.kernel.org/io-uring/20201026203230.386348-1-ax...@kernel.dk/
>> 
>> As part of that work, I'm adding TIF_NOTIFY_SIGNAL support to all archs,
>> as that will enable a set of cleanups once all of them support it. I'm
>> happy carrying this patch if need be, or it can be funelled through the
>> arch tree. Let me know.
>> 
>>  arch/s390/include/asm/thread_info.h | 2 ++
>>  arch/s390/kernel/entry.S| 7 ++-
>>  2 files changed, 8 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/s390/include/asm/thread_info.h 
>> b/arch/s390/include/asm/thread_info.h
>> index 13a04fcf7762..0045341ade48 100644
>> --- a/arch/s390/include/asm/thread_info.h
>> +++ b/arch/s390/include/asm/thread_info.h
>> @@ -65,6 +65,7 @@ void arch_setup_new_exec(void);
>>  #define TIF_GUARDED_STORAGE 4   /* load guarded storage control block */
>>  #define TIF_PATCH_PENDING   5   /* pending live patching update */
>>  #define TIF_PGSTE   6   /* New mm's will use 4K page tables */
>> +#define TIF_NOTIFY_SIGNAL   7   /* signal notifications exist */
>>  #define TIF_ISOLATE_BP  8   /* Run process with isolated BP 
>> */
>>  #define TIF_ISOLATE_BP_GUEST9   /* Run KVM guests with isolated 
>> BP */
>>  
>> @@ -82,6 +83,7 @@ void arch_setup_new_exec(void);
>>  #define TIF_SYSCALL_TRACEPOINT  27  /* syscall tracepoint 
>> instrumentation */
>>  
>>  #define _TIF_NOTIFY_RESUME  BIT(TIF_NOTIFY_RESUME)
>> +#define _TIF_NOTIFY_SIGNAL  BIT(TIF_NOTIFY_SIGNAL)
>>  #define _TIF_SIGPENDING BIT(TIF_SIGPENDING)
>>  #define _TIF_NEED_RESCHED   BIT(TIF_NEED_RESCHED)
>>  #define _TIF_UPROBE BIT(TIF_UPROBE)
>> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
>> index 86235919c2d1..a30d891e8045 100644
>> --- a/arch/s390/kernel/entry.S
>> +++ b/arch/s390/kernel/entry.S
>> @@ -52,7 +52,8 @@ STACK_SIZE  = 1 << STACK_SHIFT
>>  STACK_INIT = STACK_SIZE - STACK_FRAME_OVERHEAD - __PT_SIZE
>>  
>>  _TIF_WORK   = (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_NEED_RESCHED | \
>> -   _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING)
>> +   _TIF_UPROBE | _TIF_GUARDED_STORAGE | _TIF_PATCH_PENDING | \
>> +   _TIF_NOTIFY_SIGNAL)
>>  _TIF_TRACE  = (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | _TIF_SECCOMP | \
>> _TIF_SYSCALL_TRACEPOINT)
>>  _CIF_WORK   = (_CIF_ASCE_PRIMARY | _CIF_ASCE_SECONDARY | _CIF_FPU)
>> @@ -463,6 +464,8 @@ ENTRY(system_call)
>>  #endif
>>  TSTMSK  __PT_FLAGS(%r11),_PIF_SYSCALL_RESTART
>>  jo  .Lsysc_syscall_restart
>> +TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_SIGNAL
>> +jo  .Lsysc_sigpending
>>  TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
>>  jo  .Lsysc_sigpending
>>  TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
>> @@ -857,6 +860,8 @@ ENTRY(io_int_handler)
>>  #endif
>>  TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
>>  jo  .Lio_sigpending
>> +TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_SIGNAL
>> +jo  .Lio_sigpending
>>  TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
>>  jo  .Lio_notify_resume
>>  TSTMSK  __TI_flags(%r12),_TIF_GUARDED_STORAGE
>
> (full quote so you can make sense of the patch below).
>
> Please merge the patch below into this one. With that:
>
> Acked-by: Heiko Carstens 
>
> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
> index a30d891e8045..31f16d903ef3 100644
> --- a/arch/s390/kernel/entry.S
> +++ b/arch/s390/kernel/entry.S
> @@ -464,9 +464,7 @@ ENTRY(system_call)
>  #endif
>   TSTMSK  __PT_FLAGS(%r11),_PIF_SYSCALL_RESTART
>   jo  .Lsysc_syscall_restart
> - TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_SIGNAL
> - jo  .Lsysc_sigpending
> - TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
> + TSTMSK  __TI_flags(%r12),(_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL)
>   jo  .Lsysc_sigpending

We need to also change the jo to jnz - in combination with tm, jo means
'jump if all tested bits are set' while jnz means 'jump if at least one
bit is set'

>   TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
>   jo  .Lsysc_notify_resume
> @@ -858,9 +856,7 @@ ENTRY(io_int_handler)
>   TSTMSK  __TI_flags(%r12),_TIF_PATCH_PENDING
>   jo  .Lio_patch_pending
>  #endif
> - TSTMSK  __TI_flags(%r12),_TIF_SIGPENDING
> - jo  .Lio_sigpending
> - TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_SIGNAL
> + TSTMSK  __TI_flags(%r12),(_TIF_SIGPENDING|_TIF_NOTIFY_SIGNAL)

Same here.

>   jo  .Lio_sigpending
>   TSTMSK  __TI_flags(%r12),_TIF_NOTIFY_RESUME
>   jo  .Lio_notify_resume

PS: I didn't get the previous emails, so i replied to a raw download

Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL

2020-11-03 Thread Sven Schnelle

Jens Axboe  writes:

> On 11/2/20 9:59 AM, Qian Cai wrote:
>> On Sun, 2020-11-01 at 17:31 +, Heiko Carstens wrote:
>>> On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote:
 Wire up TIF_NOTIFY_SIGNAL handling for s390.

 Cc: linux-s...@vger.kernel.org
 Signed-off-by: Jens Axboe 
>> 
>> Even though I did confirm that today's linux-next contains this additional 
>> patch
>> from Heiko below, a z10 guest is still unable to boot. Reverting the whole
>> series (reverting only "s390: add support for TIF_NOTIFY_SIGNAL" introduced
>> compiling errors) fixed the problem, i.e., git revert --no-edit
>> af0dd809f3d3..7b074c15374c [1]
>
> That's odd, it should build fine without that patch. How did it fail for you?
>
> Can you try and add this on top? Looks like I forgot the signal change for
> s390, though that shouldn't really cause any issues.
>
> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
> index 9e900a8977bd..a68c3796a1bf 100644
> --- a/arch/s390/kernel/signal.c
> +++ b/arch/s390/kernel/signal.c
> @@ -472,7 +472,7 @@ void do_signal(struct pt_regs *regs)
>   current->thread.system_call =
>   test_pt_regs_flag(regs, PIF_SYSCALL) ? regs->int_code : 0;
>  
> - if (get_signal()) {
> + if (test_thread_flag(TIF_NOTIFY_SIGNAL) && get_signal()) {

Shouldn't that be TIF_SIGPENDING?

>   /* Whee!  Actually deliver the signal.  */
>   if (current->thread.system_call) {
>   regs->int_code = current->thread.system_call;

Re: autofs crash with latest linux-next

2020-10-15 Thread Sven Schnelle

Kees Cook  writes:

> On Wed, Oct 14, 2020 at 09:15:47AM +0200, Krzysztof Kozlowski wrote:
>> I hit this since few days as well. Although the bisect points to the
>> merge, the issue looks like a result of mentioned commit 4d03e3cc5982
>> ("fs: don't allow kernel reads and writes without iter ops").
>> 
>> The __kernel_read() last argument 'pos' can be NULL and it is
>> dereferenced here (added by the commit):
>> 
>>  525 ssize_t __kernel_write(struct file *file, const void *buf, size_t 
>> count, loff_t *pos)
>> ...
>>  547 kiocb.ki_pos = *pos;
>>  548 iov_iter_kvec(, WRITE, , 1, iov.iov_len);
>> 
>> 
>> The __kernel_read() is called with NULL in fs/autofs/waitq.c:
>> 
>>  45 static int autofs_write(struct autofs_sb_info *sbi,
>>  46 struct file *file, const void *addr, int bytes)
>> 
>> ...
>>  54 mutex_lock(>pipe_mutex);
>>  55 while (bytes) {
>>  56 wr = __kernel_write(file, data, bytes, NULL);
>
> I think the thread here is the same thing, but you've found it in
> autofs...
> https://lore.kernel.org/lkml/CAHk-=wgj=mken-efv5tkwjnehplg0dybq+r5zyguc4weunq...@mail.gmail.com/

Indeed. Thanks, missed that.

Sven

autofs crash with latest linux-next

2020-10-12 Thread Sven Schnelle

Hi,

on s390 i see the following crash with linux-next:

[ 4525.432605] Unable to handle kernel pointer dereference in virtual kernel 
address space
[ 4525.432612] Failing address:  TEID: 0483
[ 4525.432613] Fault in home space mode while using kernel ASCE.
[ 4525.432616] AS:cf048007 R3:0001fffec007 S:00011800 
P:003d 
[ 4525.432640] Oops: 0004 ilc:3 [#1] SMP 
[ 4525.432644] Modules linked in: dm_crypt encrypted_keys lcs ctcm fsm nfsv3 
nfs_acl nfs lockd grace quota_v2 quota_tree tun overlay ntfs exfat vfat fat 
sctp vfio_pci irqbypass vfio_virqfd scsi_debug vhost_vsock 
vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb vfio_ap kvm loop 
nft_counter bridge stp llc dm_service_time nft_ct nft_chain_nat nf_nat 
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc 
dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua s390_trng vfio_ccw 
vfio_mdev mdev vfio_iommu_type1 vfio zcrypt_cex4 eadm_sch sch_fq_codel 
ip_tables x_tables ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 
sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core 
autofs4 [last unloaded: dummy_del_mod]
[ 4525.432691] CPU: 9 PID: 1050921 Comm: find Tainted: G   OE 
5.9.0-20201011.rc8.git0.d67bc7812221.300.fc32.s390x+next #1
[ 4525.432693] Hardware name: IBM 3906 M04 704 (LPAR)
[ 4525.432694] Krnl PSW : 0704d0018000 cde29f70 
(__kernel_write+0x1a0/0x2a0)
[ 4525.432702]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
RI:0 EA:3
[ 4525.432704] Krnl GPRS: 000100067343  0130 
0001
[ 4525.432705]0006 5f82be2f 0130 
8c6ab568
[ 4525.432728]84441f00  0130 
84441f00
[ 4525.432729]81476000 0001 cde29ef4 
03e002f5b6f0
[ 4525.432735] Krnl Code: cde29f62: a728lhi %r2,0
  cde29f66: a7f4ff9dbrc 
15,cde29ea0
 #cde29f6a: e310f0f4lg  
%r1,240(%r15)
 >cde29f70: e3109024stg 
%r1,0(%r9)
  cde29f76: 9104b044tm  
68(%r11),4
  cde29f7a: a784000fbrc 
8,cde29f98
  cde29f7e: e3100344lg  %r1,832
  cde29f84: b904002algr %r2,%r10
[ 4525.432748] Call Trace:
[ 4525.432750]  [] __kernel_write+0x1a0/0x2a0 
[ 4525.432752] ([] __kernel_write+0x124/0x2a0)
[ 4525.432756]  [<03ff80004cfa>] autofs_write+0x5a/0x140 [autofs4] 
[ 4525.432758]  [<03ff80005262>] 
autofs_notify_daemon.constprop.0+0x10a/0x1c8 [autofs4] 
[ 4525.432760]  [<03ff80005872>] autofs_wait+0x552/0x718 [autofs4] 
[ 4525.432762]  [<03ff800033ca>] autofs_mount_wait+0x5a/0xb0 [autofs4] 
[ 4525.432764]  [<03ff800048b2>] autofs_d_automount+0x102/0x278 [autofs4] 
[ 4525.432766]  [] __traverse_mounts+0x9e/0x270 
[ 4525.432768]  [] step_into+0x1de/0x280 
[ 4525.432770]  [] open_last_lookups+0xb8/0x3f8 
[ 4525.432772]  [] path_openat+0x86/0x1d0 
[ 4525.432773]  [] do_filp_open+0x78/0x118 
[ 4525.432776]  [] do_sys_openat2+0xa8/0x168 
[ 4525.432778]  [] __s390x_sys_openat+0x6a/0x98 
[ 4525.432781]  [] system_call+0xdc/0x2a4 
[ 4525.432782] Last Breaking-Event-Address:
[ 4525.432783]  [] __kernel_write+0x12c/0x2a0

This seems to be caused by the result of merging 0fb702791bf ("autofs:
use __kernel_write() for the autofs pipe writing") and 4d03e3cc5982
("fs: don't allow kernel reads and writes without iter
ops"). __kernel_write() gets now called with a NULL pointer as pos
argument, but __kernel_write expects a valid pointer as it
fetches/stores the pos value there. Is there a fix pending somewhere?

Thanks
Sven

Re: [PATCH] s390/idle: Fix suspicious RCU usage

2020-10-08 Thread Sven Schnelle

Hi Peter,

Peter Zijlstra  writes:

> On Wed, Oct 07, 2020 at 12:05:51PM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 07, 2020 at 09:53:25AM +0200, Sven Schnelle wrote:
>> > Hi Peter,
>> > 
>> > pet...@infradead.org writes:
>> > 
>> > > After commit eb1f00237aca ("lockdep,trace: Expose tracepoints") the
>> > > lock tracepoints are visible to lockdep and RCU-lockdep is finding a
>> > > bunch more RCU violations that were previously hidden.
>> > >
>> > > Switch the idle->seqcount over to using raw_write_*() to avoid the
>> > > lockdep annotation and thus the lock tracepoints.
>> > >
>> > > Reported-by: Guenter Roeck 
>> > > Signed-off-by: Peter Zijlstra (Intel) 
>> > > [..]
>> > 
>> > I'm still seeing the splat below on s390 when irq tracing is enabled:
>> 
>> Damn... :/
>> 
>> This one is tricky, trouble seems to be that arch_cpu_idle() is defined
>> to enable interrupts (no doubt because ot x86 :/), but we call it before
>> rcu_exit_idle().
>> 
>> What a mess... let me rummage around the various archs to see what makes
>> most sense here.
>
> Maybe something like so, I've not yet tested it. I need to figure out
> how to force x86 into this path.

I've gave this patch a quick test on linux-next from today and haven't
seen the splat again. However it wasn't happening all the time, so will
test it a bit longer. I haven't looked into the tracing code in detail,
but i guess it was only happening when the lock was contented.

The only thing with this patch is that rcu complains that it gets called
with interrupts enabled on s390 when rcu_irq_enter() is called. But a
few trace_hardirqs_{on,off} at the beginning and end of the IRQ handlers
are fixing this. Will check why this worked in the past.

Sven

Re: [PATCH] s390/idle: Fix suspicious RCU usage

2020-10-07 Thread Sven Schnelle

Hi Peter,

pet...@infradead.org writes:

> After commit eb1f00237aca ("lockdep,trace: Expose tracepoints") the
> lock tracepoints are visible to lockdep and RCU-lockdep is finding a
> bunch more RCU violations that were previously hidden.
>
> Switch the idle->seqcount over to using raw_write_*() to avoid the
> lockdep annotation and thus the lock tracepoints.
>
> Reported-by: Guenter Roeck 
> Signed-off-by: Peter Zijlstra (Intel) 
> [..]

I'm still seeing the splat below on s390 when irq tracing is enabled:

[ 1273.747948] =
[ 1273.747953] WARNING: suspicious RCU usage
[ 1273.747960] 5.9.0-20201006.rc8.git0.162c12a918a1.300.fc32.s390x+debug #1 Not 
tainted
[ 1273.747965] -
[ 1273.747971] include/trace/events/lock.h:74 suspicious 
rcu_dereference_check() usage!
[ 1273.747976] other info that might help us debug this:
[ 1273.747982] rcu_scheduler_active = 2, debug_locks = 1
[ 1273.747987] RCU used illegally from extended quiescent state!
[ 1273.747993] 1 lock held by swapper/8/0:
[ 1273.747998]  #0: 00010f7281b8 (max_trace_lock){..-.}-{2:2}, at: 
check_critical_timing+0x7c/0x1c8
[ 1273.748019] stack backtrace:
[ 1273.748034] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 
5.9.0-20201006.rc8.git0.162c12a918a1.300.fc32.s390x+debug #1
[ 1273.748040] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[ 1273.748045] Call Trace:
[ 1273.748053]  [<00010e656de0>] show_stack+0x90/0xf8 
[ 1273.748060]  [<00010eea94da>] dump_stack+0xa2/0xd8 
[ 1273.748068]  [<00010e711cc8>] trace_lock_acquired+0x178/0x180 
[ 1273.748075]  [<00010e718ed4>] lock_contended+0x24/0xd8 
[ 1273.748083]  [<00010f26d148>] _raw_spin_lock_irqsave+0xb0/0xd8 
[ 1273.748089]  [<00010e7ec404>] check_critical_timing+0x7c/0x1c8 
[ 1273.748096]  [<00010e7ecaa8>] tracer_hardirqs_on+0x128/0x148 
[ 1273.748103]  [<00010e7eae0c>] trace_hardirqs_on+0x6c/0x1b0 
[ 1273.748110]  [<00010e644ba8>] arch_cpu_idle+0x28/0x38 
[ 1273.748116]  [<00010f26ccd6>] default_idle_call+0x56/0x98 
[ 1273.748124]  [<00010e6da81a>] do_idle+0xf2/0x1b0 
[ 1273.748130]  [<00010e6dab4e>] cpu_startup_entry+0x36/0x40 
[ 1273.748137]  [<00010e6590fa>] smp_start_secondary+0x82/0x88 
[ 1273.748142] 1 lock held by swapper/8/0:
[ 1273.748147]  #0: 00010f7281b8 (max_trace_lock){..-.}-{2:2}, at: 
check_critical_timing+0x7c/0x1c8

I think this happens because trace_lock_acquired gets called from idle
context?

Regards
Sven

[tip: timers/urgent] lib/vdso: Allow to add architecture-specific vdso data

2020-08-06 Thread tip-bot2 for Sven Schnelle

The following commit has been merged into the timers/urgent branch of tip:

Commit-ID: d60d7de3e16d7cea998bad17d87366a359625894
Gitweb:
https://git.kernel.org/tip/d60d7de3e16d7cea998bad17d87366a359625894
Author:Sven Schnelle 
AuthorDate:Tue, 04 Aug 2020 17:01:22 +02:00
Committer: Thomas Gleixner 
CommitterDate: Thu, 06 Aug 2020 10:57:30 +02:00

lib/vdso: Allow to add architecture-specific vdso data

The initial assumption that all VDSO related data can be completely generic
does not hold. S390 needs architecture specific storage to access the clock
steering information.

Add struct arch_vdso_data to the vdso data struct. For architectures which
do not need extra data this defaults to an empty struct. Architectures
which require it, enable CONFIG_ARCH_HAS_VDSO_DATA and provide their
specific struct in asm/vdso/data.h.

Signed-off-by: Sven Schnelle 
Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20200804150124.41692-2-sv...@linux.ibm.com

---
 arch/Kconfig|  3 +++
 include/vdso/datapage.h | 10 ++
 2 files changed, 13 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index a112448..b44dd6b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -975,6 +975,9 @@ config HAVE_SPARSE_SYSCALL_NR
  entries at 4000, 5000 and 6000 locations. This option turns on syscall
  related optimizations for a given architecture.
 
+config ARCH_HAS_VDSO_DATA
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index ee810ca..73eb622 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -19,6 +19,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_VDSO_DATA
+#include 
+#else
+struct arch_vdso_data {};
+#endif
+
 #define VDSO_BASES (CLOCK_TAI + 1)
 #define VDSO_HRES  (BIT(CLOCK_REALTIME)| \
 BIT(CLOCK_MONOTONIC)   | \
@@ -64,6 +70,8 @@ struct vdso_timestamp {
  * @tz_dsttime:type of DST correction
  * @hrtimer_res:   hrtimer resolution
  * @__unused:  unused
+ * @arch_data: architecture specific data (optional, defaults
+ * to an empty struct)
  *
  * vdso_data will be accessed by 64 bit and compat code at the same time
  * so we should be careful before modifying this structure.
@@ -97,6 +105,8 @@ struct vdso_data {
s32 tz_dsttime;
u32 hrtimer_res;
u32 __unused;
+
+   struct arch_vdso_data   arch_data;
 };
 
 /*

Re: [PATCH v3 3/3] s390: convert to GENERIC_VDSO

2020-08-05 Thread Sven Schnelle

Hi Thomas,

Thomas Gleixner  writes:

> Sven Schnelle  writes:
>> --- /dev/null
>> +++ b/arch/s390/include/asm/vdso/data.h
>> @@ -0,0 +1,13 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __S390_ASM_VDSO_DATA_H
>> +#define __S390_ASM_VDSO_DATA_H
>> +
>> +#include 
>> +#include 
>
> I don't think this header needs vdso/datapage.h

Agreed.

>> +struct arch_vdso_data {
>> +__u64 tod_steering_delta;
>> +__u64 tod_steering_end;
>> +};
>> +
>> +#endif /* __S390_ASM_VDSO_DATA_H */
>
>> --- /dev/null
>> +++ b/arch/s390/include/asm/vdso/gettimeofday.h
>
>> +static inline u64 __arch_get_hw_counter(s32 clock_mode)
>> +{
>> +const struct vdso_data *vdso = __arch_get_vdso_data();
>
> If you go and implement time namespace support for VDSO then this
> becomes a problem. See do_hres_timens().
>
> As we did not expect that this function needs vdso_data we should just
> add a vdso_data pointer argument to __arch_get_hw_counter(). And while
> looking at it you're not the first one. MIPS already uses it and because
> it does not support time namespaces so far nobody noticed yet.
>
> That's fine for all others because the compiler will optimize it
> out when it's unused. If the compiler fails to do so, then this is the
> least of our worries :)
>
> As there is no new VDSO conversion pending in next, I can just queue
> that (see below) along with #1 and #2 of this series and send it to
> Linus by end of the week.

Fine with me.

>> +u64 adj, now;
>> +
>> +now = get_tod_clock();
>> +adj = vdso->arch.tod_steering_end - now;
>> +if (unlikely((s64) adj > 0))
>> +now += (vdso->arch.tod_steering_delta < 0) ? (adj >> 15) : 
>> -(adj >> 15);
>> +return now;
>> +}
>
>
>> --- /dev/null
>> +++ b/arch/s390/include/asm/vdso/processor.h
>> @@ -0,0 +1,5 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +#ifndef __ASM_VDSO_PROCESSOR_H
>> +#define __ASM_VDSO_PROCESSOR_H
>> +
>> +#endif /* __ASM_VDSO_PROCESSOR_H */
>
> The idea of this file was to provide cpu_relax() self contained without
> pulling in tons of other things from asm/processor.h.

Thanks, i'll add that.

>> diff --git a/arch/s390/include/asm/vdso/vdso.h 
>> b/arch/s390/include/asm/vdso/vdso.h
>> new file mode 100644
>> index ..e69de29bb2d1
>> diff --git a/arch/s390/include/asm/vdso/vsyscall.h 
>> b/arch/s390/include/asm/vdso/vsyscall.h
>> new file mode 100644
>> index ..6c67c08cefdd
>> --- /dev/null
>> +++ b/arch/s390/include/asm/vdso/vsyscall.h
>> @@ -0,0 +1,26 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __ASM_VDSO_VSYSCALL_H
>> +#define __ASM_VDSO_VSYSCALL_H
>> +
>> +#ifndef __ASSEMBLY__
>> +
>> +#include 
>> +#include 
>
> I know you copied that from x86 or some other architecture, but thinking
> about the above these two includes are not required for building VDSO
> itself. Only the kernel update side needs them and they are included
> already there. I'm going to remove them from x86 as well.

Thanks, i removed them from my patch.

>> @@ -443,9 +396,8 @@ static void clock_sync_global(unsigned long long delta)
>>  panic("TOD clock sync offset %lli is too large to drift\n",
>>tod_steering_delta);
>>  tod_steering_end = now + (abs(tod_steering_delta) << 15);
>> -vdso_data->ts_dir = (tod_steering_delta < 0) ? 0 : 1;
>> -vdso_data->ts_end = tod_steering_end;
>> -vdso_data->tb_update_count++;
>> +vdso_data->arch.tod_steering_end = tod_steering_end;
>
> Leftover from the previous version. Should be arch_data.tod

Oops, indeed.

> Heiko, do you consider this 5.9 material or am I right to assume that
> this targets 5.10?

I talked to Heiko yesterday and we agreed that it's to late for 5.9, so
we target 5.10.

Thanks,
Sven

[PATCH v3 3/3] s390: convert to GENERIC_VDSO

2020-08-04 Thread Sven Schnelle

This patch converts s390 to the generic vDSO. There are a few special
things on s390:

- vDSO can be called without a stack frame - glibc did this in the past.
  So we need to allocate a stackframe on our own.

- The former assembly code used stcke to get the TOD clock and applied
  time steering to it. We need to do the same in the new code. This is done
  in the architecture specific __arch_get_hw_counter function. The steering
  information is stored in an architecure specific area in the vDSO data.

- CPUCLOCK_VIRT is now handled with a syscall fallback, which might
  be slower/less accurate than the old implementation.

The getcpu() function stays as an assembly function because there is no
generic implementation and the code is just a few lines.

Performance number from my system do 100 mio gettimeofday() calls:

Plain syscall: 8.6s
Generic VDSO:  1.3s
old ASM VDSO:  1s

So it's a bit slower but still much faster than syscalls.

Signed-off-by: Sven Schnelle 
---
 arch/s390/Kconfig   |   3 +
 arch/s390/include/asm/clocksource.h |   7 +
 arch/s390/include/asm/vdso.h|  25 +--
 arch/s390/include/asm/vdso/clocksource.h|   8 +
 arch/s390/include/asm/vdso/data.h   |  13 ++
 arch/s390/include/asm/vdso/gettimeofday.h   |  71 +
 arch/s390/include/asm/vdso/processor.h  |   5 +
 arch/s390/include/asm/vdso/vdso.h   |   0
 arch/s390/include/asm/vdso/vsyscall.h   |  26 
 arch/s390/kernel/asm-offsets.c  |  20 ---
 arch/s390/kernel/entry.S|   6 -
 arch/s390/kernel/setup.c|   1 -
 arch/s390/kernel/time.c |  66 ++--
 arch/s390/kernel/vdso.c |  29 +---
 arch/s390/kernel/vdso64/Makefile|  19 ++-
 arch/s390/kernel/vdso64/clock_getres.S  |  50 --
 arch/s390/kernel/vdso64/clock_gettime.S | 163 
 arch/s390/kernel/vdso64/gettimeofday.S  |  71 -
 arch/s390/kernel/vdso64/vdso64_generic.c|  18 +++
 arch/s390/kernel/vdso64/vdso_user_wrapper.S |  38 +
 20 files changed, 219 insertions(+), 420 deletions(-)
 create mode 100644 arch/s390/include/asm/clocksource.h
 create mode 100644 arch/s390/include/asm/vdso/clocksource.h
 create mode 100644 arch/s390/include/asm/vdso/data.h
 create mode 100644 arch/s390/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/s390/include/asm/vdso/processor.h
 create mode 100644 arch/s390/include/asm/vdso/vdso.h
 create mode 100644 arch/s390/include/asm/vdso/vsyscall.h
 delete mode 100644 arch/s390/kernel/vdso64/clock_getres.S
 delete mode 100644 arch/s390/kernel/vdso64/clock_gettime.S
 delete mode 100644 arch/s390/kernel/vdso64/gettimeofday.S
 create mode 100644 arch/s390/kernel/vdso64/vdso64_generic.c
 create mode 100644 arch/s390/kernel/vdso64/vdso_user_wrapper.S

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c7d7ede6300c..eb50f748e151 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_VDSO_DATA
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_INLINE_READ_LOCK
select ARCH_INLINE_READ_LOCK_BH
@@ -118,6 +119,7 @@ config S390
select GENERIC_CPU_AUTOPROBE
select GENERIC_CPU_VULNERABILITIES
select GENERIC_FIND_FIRST_BIT
+   select GENERIC_GETTIMEOFDAY
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
@@ -149,6 +151,7 @@ config S390
select HAVE_FUNCTION_TRACER
select HAVE_FUTEX_CMPXCHG if FUTEX
select HAVE_GCC_PLUGINS
+   select HAVE_GENERIC_VDSO
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZ4
diff --git a/arch/s390/include/asm/clocksource.h 
b/arch/s390/include/asm/clocksource.h
new file mode 100644
index ..03434369fce4
--- /dev/null
+++ b/arch/s390/include/asm/clocksource.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* s390-specific clocksource additions */
+
+#ifndef _ASM_S390_CLOCKSOURCE_H
+#define _ASM_S390_CLOCKSOURCE_H
+
+#endif /* _ASM_S390_CLOCKSOURCE_H */
diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h
index 0cd085cdeb4f..82f86b3c394b 100644
--- a/arch/s390/include/asm/vdso.h
+++ b/arch/s390/include/asm/vdso.h
@@ -2,6 +2,8 @@
 #ifndef __S390_VDSO_H__
 #define __S390_VDSO_H__
 
+#include 
+
 /* Default link addresses for the vDSOs */
 #define VDSO32_LBASE   0
 #define VDSO64_LBASE   0
@@ -18,30 +20,7 @@
  * itself and may change without notice.
  */
 
-struct vdso_data {
-   __u64 tb_update_count;  /* Timebase atomicity ctr   0x00 */
-   __u64 xtime_tod_stamp;  /* TOD clock for xtime  0x08 */
-   __u64 xtime_clock_sec;  /* Kernel time  0x10

[PATCH v3 1/3] vdso: allow to add architecture-specific vdso data

2020-08-04 Thread Sven Schnelle

Add the possibility to add architecture specific vDSO
data to struct vdso_data. This is useful if the arch specific
user space VDSO code needs additional data during execution.
If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will
include asm/vdso/data.h which should contain 'struct arch_vdso_data'.
This structure will be embedded in the generic vDSO data.

Signed-off-by: Sven Schnelle 
---
 arch/Kconfig| 3 +++
 include/vdso/datapage.h | 9 +
 2 files changed, 12 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..e1017ce979e2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR
  entries at 4000, 5000 and 6000 locations. This option turns on syscall
  related optimizations for a given architecture.
 
+config ARCH_HAS_VDSO_DATA
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index 7955c56d6b3c..a11e89e2f547 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -19,6 +19,12 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_VDSO_DATA
+#include 
+#else
+struct arch_vdso_data {};
+#endif
+
 #define VDSO_BASES (CLOCK_TAI + 1)
 #define VDSO_HRES  (BIT(CLOCK_REALTIME)| \
 BIT(CLOCK_MONOTONIC)   | \
@@ -64,6 +70,7 @@ struct vdso_timestamp {
  * @tz_dsttime:type of DST correction
  * @hrtimer_res:   hrtimer resolution
  * @__unused:  unused
+ * @arch_data: architecture specific data
  *
  * vdso_data will be accessed by 64 bit and compat code at the same time
  * so we should be careful before modifying this structure.
@@ -97,6 +104,8 @@ struct vdso_data {
s32 tz_dsttime;
u32 hrtimer_res;
u32 __unused;
+
+   struct arch_vdso_data   arch_data;
 };
 
 /*
-- 
2.17.1

[PATCH v3 2/3] timekeeping/vsyscall: Provide vdso_update_begin/end()

2020-08-04 Thread Sven Schnelle

From: Thomas Gleixner 

Architectures can have the requirement to add additional architecture
specific data to the VDSO data page which needs to be updated independent
of the timekeeper updates.

To protect these updates vs. concurrent readers and a conflicting update
through timekeeping, provide helper functions to make such updates safe.

vdso_update_begin() takes the timekeeper_lock to protect against a
potential update from timekeeper code and increments the VDSO sequence
count to signal data inconsistency to concurrent readers. vdso_update_end()
makes the sequence count even again to signal data consistency and drops
the timekeeper lock.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Sven Schnelle 
---
 include/vdso/vsyscall.h|  3 +++
 kernel/time/timekeeping.c  |  2 +-
 kernel/time/timekeeping_internal.h | 11 +++---
 kernel/time/vsyscall.c | 32 ++
 4 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/vdso/vsyscall.h b/include/vdso/vsyscall.h
index 2c6134e0c23d..b0fdc9c6bf43 100644
--- a/include/vdso/vsyscall.h
+++ b/include/vdso/vsyscall.h
@@ -6,6 +6,9 @@
 
 #include 
 
+unsigned long vdso_update_begin(void);
+void vdso_update_end(unsigned long flags);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __VDSO_VSYSCALL_H */
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index d20d489841c8..8459aa87986f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -50,7 +50,7 @@ static struct {
.seq = SEQCNT_ZERO(tk_core.seq),
 };
 
-static DEFINE_RAW_SPINLOCK(timekeeper_lock);
+DEFINE_RAW_SPINLOCK(timekeeper_lock);
 static struct timekeeper shadow_timekeeper;
 
 /**
diff --git a/kernel/time/timekeeping_internal.h 
b/kernel/time/timekeeping_internal.h
index bcbb52db2256..4ca2787d1642 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -1,12 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _TIMEKEEPING_INTERNAL_H
 #define _TIMEKEEPING_INTERNAL_H
-/*
- * timekeeping debug functions
- */
+
 #include 
+#include 
 #include 
 
+/*
+ * timekeeping debug functions
+ */
 #ifdef CONFIG_DEBUG_FS
 extern void tk_debug_account_sleep_time(const struct timespec64 *t);
 #else
@@ -31,4 +33,7 @@ static inline u64 clocksource_delta(u64 now, u64 last, u64 
mask)
 }
 #endif
 
+/* Semi public for serialization of non timekeeper VDSO updates. */
+extern raw_spinlock_t timekeeper_lock;
+
 #endif /* _TIMEKEEPING_INTERNAL_H */
diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c
index 54ce6eb2ca36..e8806eda6874 100644
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -13,6 +13,8 @@
 #include 
 #include 
 
+#include "timekeeping_internal.h"
+
 static inline void update_vdso_data(struct vdso_data *vdata,
struct timekeeper *tk)
 {
@@ -127,3 +129,33 @@ void update_vsyscall_tz(void)
 
__arch_sync_vdso_data(vdata);
 }
+
+/**
+ * vdso_update_begin - Start of a VDSO update section
+ *
+ * Allows architecture code to safely update the architecture specific VDSO
+ * data.
+ */
+unsigned long vdso_update_begin(void)
+{
+   struct vdso_data *vdata = __arch_get_k_vdso_data();
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(_lock, flags);
+   vdso_write_begin(vdata);
+   return flags;
+}
+
+/**
+ * vdso_update_end - End of a VDSO update section
+ *
+ * Pairs with vdso_update_begin().
+ */
+void vdso_update_end(unsigned long flags)
+{
+   struct vdso_data *vdata = __arch_get_k_vdso_data();
+
+   vdso_write_end(vdata);
+   __arch_sync_vdso_data(vdata);
+   raw_spin_unlock_irqrestore(_lock, flags);
+}
-- 
2.17.1

[PATCH RFC v3] s390: convert to GENERIC_VDSO

2020-08-04 Thread Sven Schnelle

here's the third version of the generic vdso patchset for s390.

Changes in v3;
- update the architecture specific data patch to the correct one.

Changes in v2:
- added patch from Thomas that adds vdso_update_begin()/vdso_update_end()
- changed the s390 code to use the vdso update functions
- changed the name of the architecture specific data to 'arch_data'

Re: [PATCH 1/3] vdso: allow to add architecture-specific vdso data

2020-08-04 Thread Sven Schnelle

Hi,

Sven Schnelle  writes:

> Add the possibility to add architecture specific vDSO
> data to struct vdso_data. This is useful if the arch specific
> user space VDSO code needs additional data during execution.
> If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will
> include asm/vdso/data.h which should contain 'struct arch_vdso_data'.
> This structure will be embedded in the generic vDSO data.
>
> Signed-off-by: Sven Schnelle 
> ---
>  arch/Kconfig| 3 +++
>  include/vdso/datapage.h | 7 +++
>  2 files changed, 10 insertions(+)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8cc35dc556c7..e1017ce979e2 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR
> entries at 4000, 5000 and 6000 locations. This option turns on syscall
> related optimizations for a given architecture.
>  
> +config ARCH_HAS_VDSO_DATA
> + bool
> +
>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
> index 7955c56d6b3c..74e730238ce6 100644
> --- a/include/vdso/datapage.h
> +++ b/include/vdso/datapage.h
> @@ -19,6 +19,10 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_ARCH_HAS_VDSO_DATA
> +#include 
> +#endif
> +
>  #define VDSO_BASES   (CLOCK_TAI + 1)
>  #define VDSO_HRES(BIT(CLOCK_REALTIME)| \
>BIT(CLOCK_MONOTONIC)   | \
> @@ -97,6 +101,9 @@ struct vdso_data {
>   s32 tz_dsttime;
>   u32 hrtimer_res;
>   u32 __unused;
> +#ifdef CONFIG_ARCH_HAS_VDSO_DATA
> + struct arch_vdso_data arch;
> +#endif
>  };
>  
>  /*

I've sent the wrong version of this patch. I'll send a revised version
with the requested changes later.

[PATCH 3/3] s390: convert to GENERIC_VDSO

2020-08-04 Thread Sven Schnelle

This patch converts s390 to the generic vDSO. There are a few special
things on s390:

- vDSO can be called without a stack frame - glibc did this in the past.
  So we need to allocate a stackframe on our own.

- The former assembly code used stcke to get the TOD clock and applied
  time steering to it. We need to do the same in the new code. This is done
  in the architecture specific __arch_get_hw_counter function. The steering
  information is stored in an architecure specific area in the vDSO data.

- CPUCLOCK_VIRT is now handled with a syscall fallback, which might
  be slower/less accurate than the old implementation.

The getcpu() function stays as an assembly function because there is no
generic implementation and the code is just a few lines.

Performance number from my system do 100 mio gettimeofday() calls:

Plain syscall: 8.6s
Generic VDSO:  1.3s
old ASM VDSO:  1s

So it's a bit slower but still much faster than syscalls.

Signed-off-by: Sven Schnelle 
---
 arch/s390/Kconfig   |   3 +
 arch/s390/include/asm/clocksource.h |   7 +
 arch/s390/include/asm/vdso.h|  25 +--
 arch/s390/include/asm/vdso/clocksource.h|   8 +
 arch/s390/include/asm/vdso/data.h   |  13 ++
 arch/s390/include/asm/vdso/gettimeofday.h   |  71 +
 arch/s390/include/asm/vdso/processor.h  |   5 +
 arch/s390/include/asm/vdso/vdso.h   |   0
 arch/s390/include/asm/vdso/vsyscall.h   |  26 
 arch/s390/kernel/asm-offsets.c  |  20 ---
 arch/s390/kernel/entry.S|   6 -
 arch/s390/kernel/setup.c|   1 -
 arch/s390/kernel/time.c |  66 ++--
 arch/s390/kernel/vdso.c |  29 +---
 arch/s390/kernel/vdso64/Makefile|  19 ++-
 arch/s390/kernel/vdso64/clock_getres.S  |  50 --
 arch/s390/kernel/vdso64/clock_gettime.S | 163 
 arch/s390/kernel/vdso64/gettimeofday.S  |  71 -
 arch/s390/kernel/vdso64/vdso64_generic.c|  18 +++
 arch/s390/kernel/vdso64/vdso_user_wrapper.S |  38 +
 20 files changed, 219 insertions(+), 420 deletions(-)
 create mode 100644 arch/s390/include/asm/clocksource.h
 create mode 100644 arch/s390/include/asm/vdso/clocksource.h
 create mode 100644 arch/s390/include/asm/vdso/data.h
 create mode 100644 arch/s390/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/s390/include/asm/vdso/processor.h
 create mode 100644 arch/s390/include/asm/vdso/vdso.h
 create mode 100644 arch/s390/include/asm/vdso/vsyscall.h
 delete mode 100644 arch/s390/kernel/vdso64/clock_getres.S
 delete mode 100644 arch/s390/kernel/vdso64/clock_gettime.S
 delete mode 100644 arch/s390/kernel/vdso64/gettimeofday.S
 create mode 100644 arch/s390/kernel/vdso64/vdso64_generic.c
 create mode 100644 arch/s390/kernel/vdso64/vdso_user_wrapper.S

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c7d7ede6300c..eb50f748e151 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_VDSO_DATA
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_INLINE_READ_LOCK
select ARCH_INLINE_READ_LOCK_BH
@@ -118,6 +119,7 @@ config S390
select GENERIC_CPU_AUTOPROBE
select GENERIC_CPU_VULNERABILITIES
select GENERIC_FIND_FIRST_BIT
+   select GENERIC_GETTIMEOFDAY
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
@@ -149,6 +151,7 @@ config S390
select HAVE_FUNCTION_TRACER
select HAVE_FUTEX_CMPXCHG if FUTEX
select HAVE_GCC_PLUGINS
+   select HAVE_GENERIC_VDSO
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZ4
diff --git a/arch/s390/include/asm/clocksource.h 
b/arch/s390/include/asm/clocksource.h
new file mode 100644
index ..03434369fce4
--- /dev/null
+++ b/arch/s390/include/asm/clocksource.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* s390-specific clocksource additions */
+
+#ifndef _ASM_S390_CLOCKSOURCE_H
+#define _ASM_S390_CLOCKSOURCE_H
+
+#endif /* _ASM_S390_CLOCKSOURCE_H */
diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h
index 0cd085cdeb4f..82f86b3c394b 100644
--- a/arch/s390/include/asm/vdso.h
+++ b/arch/s390/include/asm/vdso.h
@@ -2,6 +2,8 @@
 #ifndef __S390_VDSO_H__
 #define __S390_VDSO_H__
 
+#include 
+
 /* Default link addresses for the vDSOs */
 #define VDSO32_LBASE   0
 #define VDSO64_LBASE   0
@@ -18,30 +20,7 @@
  * itself and may change without notice.
  */
 
-struct vdso_data {
-   __u64 tb_update_count;  /* Timebase atomicity ctr   0x00 */
-   __u64 xtime_tod_stamp;  /* TOD clock for xtime  0x08 */
-   __u64 xtime_clock_sec;  /* Kernel time  0x10

[PATCH RFC v2] s390: convert to GENERIC_VDSO

2020-08-04 Thread Sven Schnelle

as discussed here's the second version of the generic vdso patch for s390.

Changes in v2:
- added patch from Thomas that adds vdso_update_begin()/vdso_update_end()
- changed the s390 code to use the vdso update functions
- changed the name of the architecture specific data to 'arch_data'

[PATCH 1/3] vdso: allow to add architecture-specific vdso data

2020-08-04 Thread Sven Schnelle

Add the possibility to add architecture specific vDSO
data to struct vdso_data. This is useful if the arch specific
user space VDSO code needs additional data during execution.
If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will
include asm/vdso/data.h which should contain 'struct arch_vdso_data'.
This structure will be embedded in the generic vDSO data.

Signed-off-by: Sven Schnelle 
---
 arch/Kconfig| 3 +++
 include/vdso/datapage.h | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..e1017ce979e2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR
  entries at 4000, 5000 and 6000 locations. This option turns on syscall
  related optimizations for a given architecture.
 
+config ARCH_HAS_VDSO_DATA
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index 7955c56d6b3c..74e730238ce6 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -19,6 +19,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_VDSO_DATA
+#include 
+#endif
+
 #define VDSO_BASES (CLOCK_TAI + 1)
 #define VDSO_HRES  (BIT(CLOCK_REALTIME)| \
 BIT(CLOCK_MONOTONIC)   | \
@@ -97,6 +101,9 @@ struct vdso_data {
s32 tz_dsttime;
u32 hrtimer_res;
u32 __unused;
+#ifdef CONFIG_ARCH_HAS_VDSO_DATA
+   struct arch_vdso_data arch;
+#endif
 };
 
 /*
-- 
2.17.1

[PATCH 2/3] timekeeping/vsyscall: Provide vdso_update_begin/end()

2020-08-04 Thread Sven Schnelle

From: Thomas Gleixner 

Architectures can have the requirement to add additional architecture
specific data to the VDSO data page which needs to be updated independent
of the timekeeper updates.

To protect these updates vs. concurrent readers and a conflicting update
through timekeeping, provide helper functions to make such updates safe.

vdso_update_begin() takes the timekeeper_lock to protect against a
potential update from timekeeper code and increments the VDSO sequence
count to signal data inconsistency to concurrent readers. vdso_update_end()
makes the sequence count even again to signal data consistency and drops
the timekeeper lock.

Signed-off-by: Thomas Gleixner 
Signed-off-by: Sven Schnelle 
---
 include/vdso/vsyscall.h|  3 +++
 kernel/time/timekeeping.c  |  2 +-
 kernel/time/timekeeping_internal.h | 11 +++---
 kernel/time/vsyscall.c | 32 ++
 4 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/vdso/vsyscall.h b/include/vdso/vsyscall.h
index 2c6134e0c23d..b0fdc9c6bf43 100644
--- a/include/vdso/vsyscall.h
+++ b/include/vdso/vsyscall.h
@@ -6,6 +6,9 @@
 
 #include 
 
+unsigned long vdso_update_begin(void);
+void vdso_update_end(unsigned long flags);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* __VDSO_VSYSCALL_H */
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index d20d489841c8..8459aa87986f 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -50,7 +50,7 @@ static struct {
.seq = SEQCNT_ZERO(tk_core.seq),
 };
 
-static DEFINE_RAW_SPINLOCK(timekeeper_lock);
+DEFINE_RAW_SPINLOCK(timekeeper_lock);
 static struct timekeeper shadow_timekeeper;
 
 /**
diff --git a/kernel/time/timekeeping_internal.h 
b/kernel/time/timekeeping_internal.h
index bcbb52db2256..4ca2787d1642 100644
--- a/kernel/time/timekeeping_internal.h
+++ b/kernel/time/timekeeping_internal.h
@@ -1,12 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _TIMEKEEPING_INTERNAL_H
 #define _TIMEKEEPING_INTERNAL_H
-/*
- * timekeeping debug functions
- */
+
 #include 
+#include 
 #include 
 
+/*
+ * timekeeping debug functions
+ */
 #ifdef CONFIG_DEBUG_FS
 extern void tk_debug_account_sleep_time(const struct timespec64 *t);
 #else
@@ -31,4 +33,7 @@ static inline u64 clocksource_delta(u64 now, u64 last, u64 
mask)
 }
 #endif
 
+/* Semi public for serialization of non timekeeper VDSO updates. */
+extern raw_spinlock_t timekeeper_lock;
+
 #endif /* _TIMEKEEPING_INTERNAL_H */
diff --git a/kernel/time/vsyscall.c b/kernel/time/vsyscall.c
index 54ce6eb2ca36..e8806eda6874 100644
--- a/kernel/time/vsyscall.c
+++ b/kernel/time/vsyscall.c
@@ -13,6 +13,8 @@
 #include 
 #include 
 
+#include "timekeeping_internal.h"
+
 static inline void update_vdso_data(struct vdso_data *vdata,
struct timekeeper *tk)
 {
@@ -127,3 +129,33 @@ void update_vsyscall_tz(void)
 
__arch_sync_vdso_data(vdata);
 }
+
+/**
+ * vdso_update_begin - Start of a VDSO update section
+ *
+ * Allows architecture code to safely update the architecture specific VDSO
+ * data.
+ */
+unsigned long vdso_update_begin(void)
+{
+   struct vdso_data *vdata = __arch_get_k_vdso_data();
+   unsigned long flags;
+
+   raw_spin_lock_irqsave(_lock, flags);
+   vdso_write_begin(vdata);
+   return flags;
+}
+
+/**
+ * vdso_update_end - End of a VDSO update section
+ *
+ * Pairs with vdso_update_begin().
+ */
+void vdso_update_end(unsigned long flags)
+{
+   struct vdso_data *vdata = __arch_get_k_vdso_data();
+
+   vdso_write_end(vdata);
+   __arch_sync_vdso_data(vdata);
+   raw_spin_unlock_irqrestore(_lock, flags);
+}
-- 
2.17.1

Re: [PATCH 2/2] s390: convert to GENERIC_VDSO

2020-08-04 Thread Sven Schnelle

Hi,

Thomas Gleixner  writes:
> Heiko Carstens  writes:
>
>> On Mon, Aug 03, 2020 at 06:05:24PM +0200, Thomas Gleixner wrote:
>>> +/**
>>> + * vdso_update_begin - Start of a VDSO update section
>>> + *
>>> + * Allows architecture code to safely update the architecture specific VDSO
>>> + * data.
>>> + */
>>> +void vdso_update_begin(void)
>>> +{
>>> +   struct vdso_data *vdata = __arch_get_k_vdso_data();
>>> +
>>> +   raw_spin_lock(_lock);
>>> +   vdso_write_begin(vdata);
>>> +}
>>
>> I would assume that this only works if vdso_update_begin() is called
>> with irqs disabled, otherwise it could deadlock, no?
>
> Yes.
>
>> Maybe something like:
>>
>> void vdso_update_begin(unsigned long *flags)
>> {
>>  struct vdso_data *vdata = __arch_get_k_vdso_data();
>>
>>  raw_spin_lock_irqsave(_lock, *flags);
>>  vdso_write_begin(vdata);
>
> Shudder. Why not returning flags?
>
>> }
>>
>> void vdso_update_end(unsigned long *flags)
>
> Ditto, why pointer and not value?
>
>> {
>>  struct vdso_data *vdata = __arch_get_k_vdso_data();
>>
>>  vdso_write_end(vdata);
>>  __arch_sync_vdso_data(vdata);
>>  raw_spin_unlock_irqrestore(_lock, *flags);
>> }
>>
>> ? Just wondering.
>
> Thought about that briefly, but then hated the flags thing and delegated
> it to the caller. Lockdep will yell if that lock is taken with
> interrupts enabled :)
>
> But aside of the pointer vs. value thing, I'm fine with doing it in the
> functions.

Thanks Thomas & Heiko. I'll incorporate the changes into my patchset and
send an updated version. Thomas, i think it's fine if i update your
patch and we take it through the s390 tree?

Regards
Sven

Re: [PATCH 2/2] s390: convert to GENERIC_VDSO

2020-08-03 Thread Sven Schnelle

Thomas Gleixner  writes:

> Sven Schnelle  writes:
>
>> - CPUCLOCK_VIRT is now handled with a syscall fallback, which might
>>   be slower/less accurate than the old implementation.
>
> I can understand the slower, but why does it become less accurate?

Because we saved the system/user times as almost the last instruction
when leaving the kernel to userspace. Now it's a bit earlier, because
it is done in the C code. So it's not really related to the syscall
fallback, but the switch from assembly to C.

>> Performance number from my system do 100 mio gettimeofday() calls:
>>
>> Plain syscall: 8.6s
>> Generic VDSO:  1.3s
>> old ASM VDSO:  1s
>>
>> So it's a bit slower but still much faster than syscalls.
>
> Where is the overhead coming from?

It's because we have to allocate a stackframe which we didn't do before,
and the compiler generated code is less optimized than the hand-crafted
assembly code we had before.

>> +static inline u64 __arch_get_hw_counter(s32 clock_mode)
>> +{
>> +const struct vdso_data *vdso = __arch_get_vdso_data();
>> +u64 adj, now;
>> +int cnt;
>> +
>> +do {
>> +do {
>> +cnt = READ_ONCE(vdso->arch.tb_update_cnt);
>> +} while (cnt & 1);
>
> smp_rmb() ?

>> +now = get_tod_clock();
>> +adj = vdso->arch.tod_steering_end - now;
>> +if (unlikely((s64) adj > 0))
>> +now += (vdso->arch.tod_steering_delta < 0) ? (adj >> 
>> 15) : -(adj >> 15);
>
> smp_rmb() ?
>
>> +} while (cnt != READ_ONCE(vdso->arch.tb_update_cnt));
>> +return now;
>>  if (ptff_query(PTFF_QTO) && ptff(, sizeof(qto), PTFF_QTO) == 0)
>>  lpar_offset = qto.tod_epoch_difference;
>> @@ -599,6 +550,13 @@ static int stp_sync_clock(void *data)
>>  if (stp_info.todoff[0] || stp_info.todoff[1] ||
>>  stp_info.todoff[2] || stp_info.todoff[3] ||
>>  stp_info.tmd != 2) {
>> +vdso_data->arch.tb_update_cnt++;
>> +/*
>> + * This barrier isn't really needed as we're called
>> + * from stop_machine_cpuslocked(). However it doesn't
>> + * hurt in case the code gets changed.
>> + */
>> +smp_wmb();
>
> WMB without a corresponding RMB and an explanation what's ordered
> against what is voodoo at best.
>
>>  rc = chsc_sstpc(stp_page, STP_OP_SYNC, 0,
>>  _delta);
>>  if (rc == 0) {
>> @@ -609,6 +567,8 @@ static int stp_sync_clock(void *data)
>>  if (rc == 0 && stp_info.tmd != 2)
>>  rc = -EAGAIN;
>>  }
>> +smp_wmb(); /* see comment above */
>
> See my comments above :)

:-)

What do you think about my question on using vdso_write_begin/end()?
__arch_get_hw_counter() is called inside a vdso_read_retry() loop, so i
would think that just enclosing this update with vdso_write_begin/end()
should sufficient. But i'm not sure whether arch/ should call these
functions.

Thanks
Sven

Re: [PATCH 1/2] vdso: allow to add architecture-specific vdso data

2020-08-03 Thread Sven Schnelle

Thomas Gleixner  writes:

> just a few nits.
>
> Can you please spare that #ifdef and do:
>
> #ifdef CONFIG_ARCH_HAS_VDSO_DATA
> #include 
> #else
> struct arch_vdso_data {};
> #endif

Ok.

> Please keep the tabular alignment of the struct members and add kernel
> doc in the comment above the struct.

Ok.

> Aside of that 'arch' is not really a intuitive name. 'arch_data' or
> something like that. Hmm?

arch_data is fine, thank you for your comments.

Regards
Sven

Re: [PATCH] perf: add psw_idle and psw_idle_exit to list of idle symbols

2020-08-03 Thread Sven Schnelle

Sven Schnelle  writes:

> Add the s390 idle functions so they don't show up in top when
> using software sampling.
>
> Signed-off-by: Sven Schnelle 
> ---
>  tools/perf/util/symbol.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
> index 5ddf84dcbae7..d33d24c61d24 100644
> --- a/tools/perf/util/symbol.c
> +++ b/tools/perf/util/symbol.c
> @@ -666,6 +666,8 @@ static bool symbol__is_idle(const char *name)
>   "poll_idle",
>   "ppc64_runlatch_off",
>   "pseries_dedicated_idle_sleep",
> + "psw_idle",
> + "psw_idle_exit",
>   NULL
>   };
>   int i;

gentle ping?

Regards
Sven

[PATCH RFC] s390: convert to GENERIC_VDSO

2020-08-02 Thread Sven Schnelle

these two patches convert the s390 architecture to generic VDSO. The
first patch adds an option to add architecture specific information
to struct vdso_data. We need that information because the old s390
assembly code had a steering capability, which steered the clock slowly.
To emulate that behaviour we need to add the steering offset to struct
vdso_data.

This requirements results in the need for a seqlock kind of lock, which is
implemented open-coded in __arch_get_hw_counter(). open-coded because we
cannot include seqlock.h in userspace code (and using the normal seqlock
interface on kernel side might result in people changing struct seqlock,
but not changing the vdso userspace part), therefore both sides are
open-coded. I think in theory we could also call vdso_write_begin()/
vdso_write_end(). What do you think?

If there are no objections we would carry both patches through the s390 tree.

Thanks
Sven

[PATCH 2/2] s390: convert to GENERIC_VDSO

2020-08-02 Thread Sven Schnelle

This patch converts s390 to the generic vDSO. There are a few special
things on s390:

- vDSO can be called without a stack frame - glibc did this in the past.
  So we need to allocate a stackframe on our own.

- The former assembly code used stcke to get the TOD clock and applied
  time steering to it. We need to do the same in the new code. This is done
  in the architecture specific __arch_get_hw_counter function. The steering
  information is stored in an architecure specific area in the vDSO data.

- CPUCLOCK_VIRT is now handled with a syscall fallback, which might
  be slower/less accurate than the old implementation.

The getcpu() function stays as an assembly function because there is no
generic implementation and the code is just a few lines.

Performance number from my system do 100 mio gettimeofday() calls:

Plain syscall: 8.6s
Generic VDSO:  1.3s
old ASM VDSO:  1s

So it's a bit slower but still much faster than syscalls.

Signed-off-by: Sven Schnelle 
Reviewed-by: Heiko Carstens 
---
 arch/s390/Kconfig   |   3 +
 arch/s390/include/asm/clocksource.h |   7 +
 arch/s390/include/asm/vdso.h|  25 +--
 arch/s390/include/asm/vdso/clocksource.h|   8 +
 arch/s390/include/asm/vdso/data.h   |  14 ++
 arch/s390/include/asm/vdso/gettimeofday.h   |  78 ++
 arch/s390/include/asm/vdso/processor.h  |   5 +
 arch/s390/include/asm/vdso/vdso.h   |   0
 arch/s390/include/asm/vdso/vsyscall.h   |  26 
 arch/s390/kernel/asm-offsets.c  |  20 ---
 arch/s390/kernel/entry.S|   6 -
 arch/s390/kernel/setup.c|   1 -
 arch/s390/kernel/time.c |  70 ++---
 arch/s390/kernel/vdso.c |  29 +---
 arch/s390/kernel/vdso64/Makefile|  19 ++-
 arch/s390/kernel/vdso64/clock_getres.S  |  50 --
 arch/s390/kernel/vdso64/clock_gettime.S | 163 
 arch/s390/kernel/vdso64/gettimeofday.S  |  71 -
 arch/s390/kernel/vdso64/vdso64_generic.c|  18 +++
 arch/s390/kernel/vdso64/vdso_user_wrapper.S |  38 +
 20 files changed, 232 insertions(+), 419 deletions(-)
 create mode 100644 arch/s390/include/asm/clocksource.h
 create mode 100644 arch/s390/include/asm/vdso/clocksource.h
 create mode 100644 arch/s390/include/asm/vdso/data.h
 create mode 100644 arch/s390/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/s390/include/asm/vdso/processor.h
 create mode 100644 arch/s390/include/asm/vdso/vdso.h
 create mode 100644 arch/s390/include/asm/vdso/vsyscall.h
 delete mode 100644 arch/s390/kernel/vdso64/clock_getres.S
 delete mode 100644 arch/s390/kernel/vdso64/clock_gettime.S
 delete mode 100644 arch/s390/kernel/vdso64/gettimeofday.S
 create mode 100644 arch/s390/kernel/vdso64/vdso64_generic.c
 create mode 100644 arch/s390/kernel/vdso64/vdso_user_wrapper.S

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c7d7ede6300c..eb50f748e151 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_VDSO_DATA
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_INLINE_READ_LOCK
select ARCH_INLINE_READ_LOCK_BH
@@ -118,6 +119,7 @@ config S390
select GENERIC_CPU_AUTOPROBE
select GENERIC_CPU_VULNERABILITIES
select GENERIC_FIND_FIRST_BIT
+   select GENERIC_GETTIMEOFDAY
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
select HAVE_ALIGNED_STRUCT_PAGE if SLUB
@@ -149,6 +151,7 @@ config S390
select HAVE_FUNCTION_TRACER
select HAVE_FUTEX_CMPXCHG if FUTEX
select HAVE_GCC_PLUGINS
+   select HAVE_GENERIC_VDSO
select HAVE_KERNEL_BZIP2
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZ4
diff --git a/arch/s390/include/asm/clocksource.h 
b/arch/s390/include/asm/clocksource.h
new file mode 100644
index ..03434369fce4
--- /dev/null
+++ b/arch/s390/include/asm/clocksource.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* s390-specific clocksource additions */
+
+#ifndef _ASM_S390_CLOCKSOURCE_H
+#define _ASM_S390_CLOCKSOURCE_H
+
+#endif /* _ASM_S390_CLOCKSOURCE_H */
diff --git a/arch/s390/include/asm/vdso.h b/arch/s390/include/asm/vdso.h
index 0cd085cdeb4f..82f86b3c394b 100644
--- a/arch/s390/include/asm/vdso.h
+++ b/arch/s390/include/asm/vdso.h
@@ -2,6 +2,8 @@
 #ifndef __S390_VDSO_H__
 #define __S390_VDSO_H__
 
+#include 
+
 /* Default link addresses for the vDSOs */
 #define VDSO32_LBASE   0
 #define VDSO64_LBASE   0
@@ -18,30 +20,7 @@
  * itself and may change without notice.
  */
 
-struct vdso_data {
-   __u64 tb_update_count;  /* Timebase atomicity ctr   0x00 */
-   __u64 xtime_tod_stamp;  /* TOD clock for xtime  0x08 */
-   __u64 xtime_clock_sec;  /* Kernel time

[PATCH 1/2] vdso: allow to add architecture-specific vdso data

2020-08-02 Thread Sven Schnelle

Add the possibility to add architecture specific vDSO
data to struct vdso_data. This is useful if the arch specific
user space VDSO code needs additional data during execution.
If CONFIG_ARCH_HAS_VDSO_DATA is defined, the generic code will
include asm/vdso/data.h which should contain 'struct arch_vdso_data'.
This structure will be embedded in the generic vDSO data.

Signed-off-by: Sven Schnelle 
Reviewed-by: Heiko Carstens 
---
 arch/Kconfig| 3 +++
 include/vdso/datapage.h | 7 +++
 2 files changed, 10 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..e1017ce979e2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -979,6 +979,9 @@ config HAVE_SPARSE_SYSCALL_NR
  entries at 4000, 5000 and 6000 locations. This option turns on syscall
  related optimizations for a given architecture.
 
+config ARCH_HAS_VDSO_DATA
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/include/vdso/datapage.h b/include/vdso/datapage.h
index 7955c56d6b3c..74e730238ce6 100644
--- a/include/vdso/datapage.h
+++ b/include/vdso/datapage.h
@@ -19,6 +19,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_ARCH_HAS_VDSO_DATA
+#include 
+#endif
+
 #define VDSO_BASES (CLOCK_TAI + 1)
 #define VDSO_HRES  (BIT(CLOCK_REALTIME)| \
 BIT(CLOCK_MONOTONIC)   | \
@@ -97,6 +101,9 @@ struct vdso_data {
s32 tz_dsttime;
u32 hrtimer_res;
u32 __unused;
+#ifdef CONFIG_ARCH_HAS_VDSO_DATA
+   struct arch_vdso_data arch;
+#endif
 };
 
 /*
-- 
2.17.1

[PATCH] perf: add psw_idle and psw_idle_exit to list of idle symbols

2020-07-07 Thread Sven Schnelle

Add the s390 idle functions so they don't show up in top when
using software sampling.

Signed-off-by: Sven Schnelle 
---
 tools/perf/util/symbol.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 5ddf84dcbae7..d33d24c61d24 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -666,6 +666,8 @@ static bool symbol__is_idle(const char *name)
"poll_idle",
"ppc64_runlatch_off",
"pseries_dedicated_idle_sleep",
+   "psw_idle",
+   "psw_idle_exit",
NULL
};
int i;
-- 
2.17.1

[PATCH] kprobes: use strncpy_from_kernel_nofault() in fetch_store_string()

2020-06-06 Thread Sven Schnelle

With the latest linux-next i noticed that some tests in the
ftrace test suites are failing on s390, namely:

[FAIL] Kprobe event symbol argument
[FAIL] Kprobe event with comm arguments

The following doesn't work anymore:

cd /sys/kernel/tracing
echo 'p:testprobe _do_fork comm=$comm ' >kprobe_events
echo 1 >events/kprobes/testprobe/enable
cat trace

it will just show

test.sh-519   [012] 18.580625: testprobe: (_do_fork+0x0/0x3c8) 
comm=(fault)

Looking at the source i see that there are two helpers for reading strings:

fetch_store_string_user() -> read string from user space
fetch_store_string() -> read string from kernel space(?)

but in the end both are using strncpy_from_user_nofault(), but
fetch_store_string() should use strncpy_from_kernel_nofault().

Signed-off-by: Sven Schnelle 
Acked-by: Masami Hiramatsu 
Acked-by: Christoph Hellwig 
---
 kernel/trace/trace_kprobe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index b1f21d558e45..ea8d0b094f1b 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1278,7 +1278,7 @@ fetch_store_string(unsigned long addr, void *dest, void 
*base)
 * Try to get string again, since the string can be changed while
 * probing.
 */
-   ret = strncpy_from_user_nofault(__dest, (void *)addr, maxlen);
+   ret = strncpy_from_kernel_nofault(__dest, (void *)addr, maxlen);
if (ret >= 0)
*(u32 *)dest = make_data_loc(ret, __dest - base);
 
-- 
2.17.1

Re: kprobes string reading broken on s390

2020-06-05 Thread Sven Schnelle

Masami,

On Sat, Jun 06, 2020 at 01:58:06AM +0900, Masami Hiramatsu wrote:
> Hi Sven,
> 
> On Fri, 5 Jun 2020 15:25:41 +0200
> Christoph Hellwig  wrote:
> 
> > Yes, this looks correct.  You probably want to write a small changelog
> > and add a Fixes tag, though.
> > 
> > On Fri, Jun 05, 2020 at 01:05:34PM +0200, Sven Schnelle wrote:
> > > Hi Christoph,
> > > 
> > > with the latest linux-next i noticed that some tests in the
> > > ftrace test suites are failing on s390, namely:
> > > 
> > > [FAIL] Kprobe event symbol argument
> > > [FAIL] Kprobe event with comm arguments
> > > 
> > > The following doesn't work anymore:
> > > 
> > > cd /sys/kernel/tracing
> > > echo 'p:testprobe _do_fork comm=$comm ' >kprobe_events
> > > echo 1 >/sys/kernel/tracing/events/kprobes/testprobe/enable
> > > cat /sys/kernel/tracing/trace
> > > 
> > > it will just show
> > > 
> > > test.sh-519   [012] 18.580625: testprobe: (_do_fork+0x0/0x3c8) 
> > > comm=(fault)
> > > 
> > > Looking at d411a9c4e95a ("tracing/kprobes: handle mixed kernel/userspace 
> > > probes
> > > better") i see that there are two helpers for reading strings:
> > > 
> > > fetch_store_string_user() -> read string from user space
> > > fetch_store_string() -> read string from kernel space(?)
> > > 
> > > but in the end both are using strncpy_from_user_nofault(), but i would
> > > think that fetch_store_string() should use strncpy_from_kernel_nofault().
> > > However, i'm not sure about the exact semantics of fetch_store_string(),
> > > as there where a lot of wrong assumptions in the past, especially since
> > > on x86 you usually don't fail if you use the same function for accessing 
> > > kernel
> > > and userspace although it's technically wrong.
> 
> Thanks for fixing!
> This report can be a good changelog.
> Please resend it with Fixed tag as Christoph said.

Which SHA1 should the Fixes tag carry? The one from linux-next?

Regards
Sven

kprobes string reading broken on s390

2020-06-05 Thread Sven Schnelle

Hi Christoph,

with the latest linux-next i noticed that some tests in the
ftrace test suites are failing on s390, namely:

[FAIL] Kprobe event symbol argument
[FAIL] Kprobe event with comm arguments

The following doesn't work anymore:

cd /sys/kernel/tracing
echo 'p:testprobe _do_fork comm=$comm ' >kprobe_events
echo 1 >/sys/kernel/tracing/events/kprobes/testprobe/enable
cat /sys/kernel/tracing/trace

it will just show

test.sh-519   [012] 18.580625: testprobe: (_do_fork+0x0/0x3c8) 
comm=(fault)

Looking at d411a9c4e95a ("tracing/kprobes: handle mixed kernel/userspace probes
better") i see that there are two helpers for reading strings:

fetch_store_string_user() -> read string from user space
fetch_store_string() -> read string from kernel space(?)

but in the end both are using strncpy_from_user_nofault(), but i would
think that fetch_store_string() should use strncpy_from_kernel_nofault().
However, i'm not sure about the exact semantics of fetch_store_string(),
as there where a lot of wrong assumptions in the past, especially since
on x86 you usually don't fail if you use the same function for accessing kernel
and userspace although it's technically wrong.

Regards,
Sven

commit 81408eab8fcc79dc0871a95462b13176d3446f5e
Author: Sven Schnelle 
Date:   Fri Jun 5 13:01:24 2020 +0200

kprobes: use strncpy_from_kernel_nofault() in fetch_store_string()

    Signed-off-by: Sven Schnelle 

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index b1f21d558e45..ea8d0b094f1b 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1278,7 +1278,7 @@ fetch_store_string(unsigned long addr, void *dest, void 
*base)
 * Try to get string again, since the string can be changed while
 * probing.
 */
-   ret = strncpy_from_user_nofault(__dest, (void *)addr, maxlen);
+   ret = strncpy_from_kernel_nofault(__dest, (void *)addr, maxlen);
if (ret >= 0)
*(u32 *)dest = make_data_loc(ret, __dest - base);

Re: [tracing] 06e0a548ba: WARNING:at_kernel/trace/ring_buffer.c:#ring_buffer_iter_peek

2020-05-14 Thread Sven Schnelle

Hi Steve,

On Wed, May 13, 2020 at 03:30:33PM -0400, Steven Rostedt wrote:
> On Wed, 13 May 2020 18:15:57 +0200
> Sven Schnelle  wrote:
> 
> > Thanks for looking into this. I've attached my /proc/config.gz to this Mail.
> > The x86 system is my Laptop which is a Thinkpad X280 with 4 HT CPUs (so 8 
> > cpus
> > in total). I've tried disabling preemption, but this didn't help.
> > 
> > It's always this check that causes the loop:
> > 
> > if (iter->head >= rb_page_size(iter->head_page)) {
> > rb_inc_iter(iter);
> > goto again;
> > }
> > 
> > On the first loop iter->head is some value > 0 and rb_page_size returns
> > 0, afterwards it gets twice to this check with both values 0. The third
> > time the warning is triggered. Maybe that information helps.
> 
> I figured out what was causing this, and that's just that the writer and
> the iterator could end up almost "in sync" where the writer writes to each
> of the pages the iterator is trying to read, and this can trigger the three
> failures of "zero commits" per page.
> 
> I had a way to detect this, but then realized that it may be possible for
> an active writer to possibly trigger the other failures to get an event,
> that I just decided to force it to try three times, and simply return NULL
> if an active writer is messing with the iterator. The iterator is a "best
> effort" to read the buffer if there's an active writer. The consumer read
> (trace_pipe) is made for that.
> 
> This patch should solve you issues.
> 
> (care to give a Tested-by: if it works for you?)

Well, as there's no longer a RB_WARN_ON that indeed fixes the issue :-)

Tested-by: Sven Schnelle 

Thanks!

Sven

Re: [tracing] 06e0a548ba: WARNING:at_kernel/trace/ring_buffer.c:#ring_buffer_iter_peek

2020-05-13 Thread Sven Schnelle

Hi Steve,

On Wed, May 13, 2020 at 09:29:22AM -0400, Steven Rostedt wrote:
> On Wed, 13 May 2020 11:19:06 +0200
> Sven Schnelle  wrote:
> 
> > Did you had a chance to look into this? I can easily reproduce this both on 
> > x86
> > and s390 by doing:
> > 
> > cd /sys/kernel/tracing
> > cat /dev/zero >/dev/null & # generate some load
> > echo function >current_tracer
> > # wait a few seconds to fill the buffer
> > cat trace
> > 
> > Usually it will print the warn after a few seconds.
> > 
> > I haven't digged through all the ring buffer code yet, so i thought i might 
> > ask
> > whether you have an idea what's going on.
> 
> Can you send me the config for where you can reproduce it on x86?
> 
> The iterator now doesn't stop the ring buffer when it iterates, and is
> doing so over a live buffer (but should be able to handle it). It's
> triggering something I thought wasn't suppose to happen (which must be
> happening).
> 
> Perhaps with your config I'd be able to reproduce it.

Thanks for looking into this. I've attached my /proc/config.gz to this Mail.
The x86 system is my Laptop which is a Thinkpad X280 with 4 HT CPUs (so 8 cpus
in total). I've tried disabling preemption, but this didn't help.

It's always this check that causes the loop:

if (iter->head >= rb_page_size(iter->head_page)) {
rb_inc_iter(iter);
goto again;
}

On the first loop iter->head is some value > 0 and rb_page_size returns
0, afterwards it gets twice to this check with both values 0. The third
time the warning is triggered. Maybe that information helps.

Thanks,
Sven


config.gz
Description: application/gzip

Re: [tracing] 06e0a548ba: WARNING:at_kernel/trace/ring_buffer.c:#ring_buffer_iter_peek

2020-05-13 Thread Sven Schnelle

Hi Steve,

On Wed, Apr 29, 2020 at 05:05:09PM +0800, kernel test robot wrote:
> 
> kern  :warn  : [  886.763510] WARNING: CPU: 70 PID: 22584 at 
> kernel/trace/ring_buffer.c:4067 ring_buffer_iter_peek+0x13c/0x1d0
> kern  :warn  : [  886.776216] Modules linked in: test_firmware intel_rapl_msr 
> intel_rapl_common skx_edac nfit libnvdimm btrfs x86_pkg_temp_thermal 
> intel_powerclamp blake2b_generic xor zstd_decompress zstd_compress coretemp 
> ast kvm_intel kvm raid6_pq drm_vram_helper drm_ttm_helper libcrc32c irqbypass 
> ttm crct10dif_pclmul crc32_pclmul drm_kms_helper snd_pcm crc32c_intel 
> ghash_clmulni_intel ahci snd_timer syscopyarea libahci aesni_intel ipmi_ssif 
> sysfillrect snd crypto_simd sysimgblt cryptd nvme fb_sys_fops mei_me 
> soundcore ipmi_si glue_helper pcspkr drm nvme_core ioatdma ipmi_devintf 
> t10_pi libata mei i2c_i801 joydev lpc_ich dca ipmi_msghandler wmi acpi_pad 
> acpi_power_meter ip_tables
> kern  :warn  : [  886.838112] CPU: 70 PID: 22584 Comm: cat Not tainted 
> 5.6.0-rc4-00017-g06e0a548bad0f #1
> kern  :warn  : [  886.846573] RIP: 0010:ring_buffer_iter_peek+0x13c/0x1d0
> kern  :warn  : [  886.852355] Code: 41 5e 41 5f c3 83 38 1d 0f 85 98 00 00 00 
> 48 89 df e8 78 cb ff ff e9 32 ff ff ff 80 3c 24 00 75 c8 49 8b 44 24 10 f0 ff 
> 40 08 <0f> 0b 4c 89 f6 4c 89 ef e8 37 8c 8f 00 48 83 c4 08 31 c0 5b 5d 41
> kern  :warn  : [  886.872239] RSP: 0018:c900213e7d78 EFLAGS: 00010002
> kern  :warn  : [  886.878030] RAX: 889883ea7740 RBX: 88a3fbd93720 
> RCX: 88a42101b700
> kern  :warn  : [  886.885731] RDX: 88a415c02e00 RSI: 88a42101b440 
> RDI: 88a3fbd93720
> kern  :warn  : [  886.893432] RBP:  R08: 00ce959ae6ab 
> R09: 88a42101bb80
> kern  :warn  : [  886.901129] R10: 88a3e99e2300 R11: 0178 
> R12: 88a415c02e00
> kern  :warn  : [  886.908831] R13: 88a415c02e18 R14: 0296 
> R15: c900213e7df8
> kern  :warn  : [  886.916521] FS:  7f12d51d2640() 
> GS:88a44f78() knlGS:
> kern  :warn  : [  886.925179] CS:  0010 DS:  ES:  CR0: 
> 80050033
> kern  :warn  : [  886.931495] CR2: 7f12d4e36000 CR3: 0023e901e006 
> CR4: 007606e0
> kern  :warn  : [  886.939194] DR0:  DR1:  
> DR2: 
> kern  :warn  : [  886.946896] DR3:  DR6: fffe0ff0 
> DR7: 0400
> kern  :warn  : [  886.954588] PKRU: 5554
> kern  :warn  : [  886.957858] Call Trace:
> kern  :warn  : [  886.960866]  peek_next_entry+0x22/0x70
> kern  :warn  : [  886.965165]  __find_next_entry+0xaf/0x180
> kern  :warn  : [  886.969714]  trace_find_next_entry_inc+0x1e/0x80
> kern  :warn  : [  886.974863]  s_next+0x49/0x70
> kern  :warn  : [  886.978355]  seq_read+0x23f/0x400
> kern  :warn  : [  886.982193]  vfs_read+0x9b/0x160
> kern  :warn  : [  886.985937]  ksys_read+0xa1/0xe0
> kern  :warn  : [  886.989677]  do_syscall_64+0x5b/0x1f0
> kern  :warn  : [  886.993853]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> kern  :warn  : [  886.999408] RIP: 0033:0x7f12d50f559e
> kern  :warn  : [  887.003479] Code: c0 e9 c6 fe ff ff 50 48 8d 3d 36 fc 09 00 
> e8 e9 e7 01 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 
> 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
> kern  :warn  : [  887.023260] RSP: 002b:7ffdab31acd8 EFLAGS: 0246 
> ORIG_RAX: 
> kern  :warn  : [  887.031350] RAX: ffda RBX: 0002 
> RCX: 7f12d50f559e
> kern  :warn  : [  887.039006] RDX: 0002 RSI: 7f12d4e36000 
> RDI: 0005
> kern  :warn  : [  887.046657] RBP: 7f12d4e36000 R08: 7f12d4e35010 
> R09: 
> kern  :warn  : [  887.054305] R10: fc4d R11: 0246 
> R12: 7f12d4e36000
> kern  :warn  : [  887.061958] R13: 0005 R14: 0002 
> R15: 0f93
> kern  :warn  : [  887.069608] ---[ end trace 507492ef8332a5b4 ]---
> 

Did you had a chance to look into this? I can easily reproduce this both on x86
and s390 by doing:

cd /sys/kernel/tracing
cat /dev/zero >/dev/null & # generate some load
echo function >current_tracer
# wait a few seconds to fill the buffer
cat trace

Usually it will print the warn after a few seconds.

I haven't digged through all the ring buffer code yet, so i thought i might ask
whether you have an idea what's going on.

Thanks
Sven

Re: [PATCH -next] s390: Remove two unused inline functions

2020-05-12 Thread Sven Schnelle

Hi Joe,

On Mon, May 11, 2020 at 01:38:57PM -0700, Joe Perches wrote:

> Awhile back, I posted a list of apparently unused static inline
> functions in .h files treewide found by a script:
> 
> https://lore.kernel.org/lkml/4603e761a5f39f4d97375e1e08d20d720c526341.ca...@perches.com/
> 
> Here are the s390 entries:
> 
> arch/s390/include/asm/atomic_ops.h:138:static inline long 
> __atomic64_cmpxchg_bool(long *ptr, long old, long new)
> arch/s390/include/asm/bitops.h:278:static inline void __set_bit_inv(unsigned 
> long nr, volatile unsigned long *ptr)
> arch/s390/include/asm/bitops.h:283:static inline void 
> __clear_bit_inv(unsigned long nr, volatile unsigned long *ptr)
> arch/s390/include/asm/cpu_mcf.h:106:static inline int 
> kernel_cpumcf_begin(void)
> arch/s390/include/asm/cpu_mcf.h:114:static inline void kernel_cpumcf_end(void)
> arch/s390/include/asm/ftrace.h:64:static inline int is_ftrace_nop(struct 
> ftrace_insn *insn)
> arch/s390/include/asm/kvm_para.h:146:static inline long 
> kvm_hypercall5(unsigned long nr, unsigned long p1,
> arch/s390/include/asm/kvm_para.h:175:static inline long 
> kvm_hypercall6(unsigned long nr, unsigned long p1,
> arch/s390/include/asm/pci_dma.h:134:static inline void 
> invalidate_table_entry(unsigned long *entry)
> arch/s390/include/asm/pci_dma.h:176:static inline int 
> entry_isprotected(unsigned long entry)
> arch/s390/include/asm/timex.h:52:static inline void 
> store_clock_comparator(__u64 *time)

Thanks, i take a look and prepare a patch.

Regards
Sven

Re: [PATCH -next] s390: Remove two unused inline functions

2020-05-11 Thread Sven Schnelle

Hi,

On Fri, May 08, 2020 at 10:07:24PM +0800, YueHaibing wrote:
> commit 657480d9c015 ("s390: support KPROBES_ON_FTRACE")
> left behind this, remove it.
> 
> Signed-off-by: YueHaibing 
> ---
>  arch/s390/kernel/ftrace.c | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index 4cd9b1ada834..44e01dd1e624 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -72,22 +72,6 @@ static inline void ftrace_generate_orig_insn(struct 
> ftrace_insn *insn)
>  #endif
>  }
>  
> -static inline void ftrace_generate_kprobe_nop_insn(struct ftrace_insn *insn)
> -{
> -#ifdef CONFIG_KPROBES
> - insn->opc = BREAKPOINT_INSTRUCTION;
> - insn->disp = KPROBE_ON_FTRACE_NOP;
> -#endif
> -}
> -
> -static inline void ftrace_generate_kprobe_call_insn(struct ftrace_insn *insn)
> -{
> -#ifdef CONFIG_KPROBES
> - insn->opc = BREAKPOINT_INSTRUCTION;
> - insn->disp = KPROBE_ON_FTRACE_CALL;
> -#endif
> -}
> -
>  int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
>      unsigned long addr)
>  {
> -- 
> 2.17.1

Thanks for noticing, looks like i missed them.

Acked-by: Sven Schnelle 

Sven

Re: [PATCH] uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned

2020-05-05 Thread Sven Schnelle

Hi,

On Mon, May 04, 2020 at 08:40:44PM +0200, Christian Borntraeger wrote:
> 
> 
> On 04.05.20 18:47, Oleg Nesterov wrote:
> > uprobe_write_opcode() must not cross page boundary; prepare_uprobe()
> > relies on arch_uprobe_analyze_insn() which should validate "vaddr" but
> > some architectures (csky, s390, and sparc) don't do this.
> 
> I think the idea was that the uprobe instruction is 2 bytes and instructions
> are always aligned to 2 bytes on s390.  (we can have 2,4 or 6 bytes).
> 
> > 
> > We can remove the BUG_ON() check in prepare_uprobe() and validate the
> > offset early in __uprobe_register(). The new IS_ALIGNED() check matches
> > the alignment check in arch_prepare_kprobe() on supported architectures,
> > so I think that all insns must be aligned to UPROBE_SWBP_INSN_SIZE.
> 
> Not sure if it would have been possible to try to create a uprobe on an 
> odd address. If yes, then the new IS_ALIGNED check certainly makes this
> better for s390, so the patch looks sane. Adding Vasily and Sven to double
> check.

I did a quick test, and without this patch it is possible to place a uprobe
at an odd address. With the patch it fails with EINVAL, which is more
reasonable.

Regards
Sven

Re: net: tulip: de2104x: Checking a kmemdup() call in de21041_get_srom_info()

2019-10-12 Thread Sven Schnelle

On Sat, Oct 12, 2019 at 07:03:09PM +0200, Markus Elfring wrote:
> Hello,
> 
> I tried another script for the semantic patch language out.
> This source code analysis approach points out that the implementation
> of the function “de21041_get_srom_info” contains still an unchecked call
> of the function “kmemdup”.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/dec/tulip/de2104x.c?id=1c0cc5f1ae5ee5a6913704c0d75a6e99604ee30a#n1940
> https://elixir.bootlin.com/linux/v5.4-rc2/source/drivers/net/ethernet/dec/tulip/de2104x.c#L1940
> 
> How do you think about to improve it?

If i have not missed a place, the only user is de_get_eeprom(), which checks
whether de->ee_data is valid. So i think although not obvious, there's no
problem here.

Regards
Sven

Re: allow larger than require DMA masks

2019-09-23 Thread Sven Schnelle

Hi,

On Fri, Feb 15, 2019 at 03:45:54PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> this series finishes off converting our dma mask model to split between
> device capabilities (dev->dma_mask and dev->coherent_dma_mask) and system
> limitations (dev->bus_dma_mask).  We already accept larger than required
> masks in most dma_map_ops implementation, in case of x86 and
> implementations based on it since the dawn of time.  Only one parisc
> and two sparc64 instances failed larger than required DMA masks, and
> this series fixes that up and updates the documentation that devices
> don't need to handle DMA mask fallbacks.
>

I just tried latest linux-5.4 git on my hp c8000 (parisc), and got the following
error:

[   27.246866] sata_sil24 :00:01.0: Applying completion IRQ loss on PCI-X 
errata fix
[   27.336968] sata_sil24 :00:01.0: DMA enable failed
[   27.476922] sata_sil24: probe of :00:01.0 failed with error -5

This is caused by commit dcc02c19cc06bd7bc1b6db0aa0087a2b6eb05b94:
Author: Christoph Hellwig 
Date:   Mon Aug 26 12:57:24 2019 +0200

sata_sil24: use dma_set_mask_and_coherent

Use the dma_set_mask_and_coherent helper to set the DMA mask. Rely
on the relatively recent change that setting a larger than required
mask will never fail to avoid the need for the boilerplate 32-bit
fallback code.

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 

However, the real problem seems to be in sba_dma_supported():

»   /* Documentation/DMA-API-HOWTO.txt tells drivers to try 64-bit
»* first, then fall back to 32-bit if that fails.
»* We are just "encouraging" 32-bit DMA masks here since we can
»* never allow IOMMU bypass unless we add special support for ZX1.
»*/
if (mask > ~0U)
» »   return 0;

Removing the if() makes the DMA mapping work. It's almost midnight here,
so i won't look into that any further today. Does anyone have an opinion
on this behaviour? Otherwise i will look a bit more into this in the
next days.

Regards
Sven

Re: [PATCH] kprobes/parisc: remove arch_kprobe_on_func_entry()

2019-08-21 Thread Sven Schnelle

On Wed, Aug 21, 2019 at 09:56:40AM +, Jisheng Zhang wrote:
> The common kprobes provides a weak implementation of
> arch_kprobe_on_func_entry(). The parisc version is the same as the
> common version, so remove it.
> 
> Signed-off-by: Jisheng Zhang 

Acked-by: Sven Schnelle 

> ---
>  arch/parisc/kernel/kprobes.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/arch/parisc/kernel/kprobes.c b/arch/parisc/kernel/kprobes.c
> index 5d7f2692ac5a..77ec51818916 100644
> --- a/arch/parisc/kernel/kprobes.c
> +++ b/arch/parisc/kernel/kprobes.c
> @@ -281,10 +281,6 @@ int __kprobes arch_trampoline_kprobe(struct kprobe *p)
>  {
>   return p->addr == trampoline_p.addr;
>  }
> -bool arch_kprobe_on_func_entry(unsigned long offset)
> -{
> - return !offset;
> -}
>  
>  int __init arch_init_kprobes(void)
>  {
> -- 
> 2.23.0.rc1
>

Re: [PATCH v9 00/21] Generic page walk and ptdump

2019-07-31 Thread Sven Schnelle

Hi Steven,

On Mon, Jul 29, 2019 at 12:32:25PM +0100, Steven Price wrote:
> 
> parisc is more interesting and I'm not sure if this is necessarily
> correct. I originally proposed a patch with the line "For parisc, we
> don't support large pages, so add stubs returning 0" which got Acked by
> Helge Deller. However going back to look at that again I see there was a
> follow up thread[2] which possibly suggests I was wrong?

I just started a week ago implementing ptdump for PA-RISC. Didn't notice that
you're working on making it generic, which is nice. I'll adjust my code
to use the infrastructure you're currently developing.

> Can anyone shed some light on whether parisc does support leaf entries
> of the page table tree at a higher than the normal depth?
> 
> [1] https://lkml.org/lkml/2019/2/27/572
> [2] https://lkml.org/lkml/2019/3/5/610

My understanding is that PA-RISC only has leaf entries on PTE level.

> The intention is that the page table walker would be available for all
> architectures so that it can be used in any generic code - PTDUMP simply
> seemed like a good place to start.
> 
> > Now that pmd_leaf() and pud_leaf() are getting used in walk_page_range() 
> > these
> > functions need to be defined on all arch irrespective if they use PTDUMP or 
> > not
> > or otherwise just define it for archs which need them now for sure i.e x86 
> > and
> > arm64 (which are moving to new generic PTDUMP framework). Other archs can
> > implement these later.

I'll take care of the PA-RISC part - for 32 bit your generic code works, for 
64Bit
i need to learn a bit more about the following hack:

arch/parisc/include/asm/pgalloc.h:15
/* Allocate the top level pgd (page directory)
 *
 * Here (for 64 bit kernels) we implement a Hybrid L2/L3 scheme: we
 * allocate the first pmd adjacent to the pgd.  This means that we can
 * subtract a constant offset to get to it.  The pmd and pgd sizes are
 * arranged so that a single pmd covers 4GB (giving a full 64-bit
 * process access to 8TB) so our lookups are effectively L2 for the
 * first 4GB of the kernel (i.e. for all ILP32 processes and all the
 * kernel for machines with under 4GB of memory)
 */

I see that your change clear P?D entries when p?d_bad() returns true, which - i 
think -
would be the case with the PA-RISC implementation.

Regards
Sven

Re: [PATCH] kprobes: fix compilation when KPROBE_EVENTS is enabled without kretpobes

2019-04-06 Thread Sven Schnelle

Hi Naveen,

On Sat, Apr 06, 2019 at 10:52:47PM +0530, Naveen N. Rao wrote:
> Sven Schnelle wrote:
> > While implementing kprobes on PA-RISC (without kretprobes) compilation
> > fails when CONFIG_KPROBE_EVENTS is enabled:
> 
> Thanks for working on that! Is there a specific reason kretprobes is not
> being enabled on parisc?

Reason is that i just started to add kprobes support to PA-RISC. I'm planning
doing kretprobes as a second step and have separate patches for kprobe and
kretprobe to keep things small.

Regards
Sven

[PATCH] kprobes: fix compilation when KPROBE_EVENTS is enabled without kretpobes

2019-04-06 Thread Sven Schnelle

While implementing kprobes on PA-RISC (without kretprobes) compilation
fails when CONFIG_KPROBE_EVENTS is enabled:

kernel/trace/trace_kprobe.o: in function `trace_kprobe_create':
kernel/trace/trace_kprobe.c:666: undefined reference to `kprobe_on_func_entry'
kernel/trace/trace_kprobe.o: in function `trace_kprobe_on_func_entry':
kernel/trace/trace_kprobe.c:167: undefined reference to `kprobe_on_func_entry'
kernel/trace/trace_kprobe.c:167: undefined reference to `kprobe_on_func_entry'
make: *** [Makefile:1029: vmlinux] Error 1

Given the fact that these functions have no dependencies on kretprobes i think
they should be moved out of the #ifdef CONFIG_KRETPROBES block.

Signed-off-by: Sven Schnelle 
Cc: "Naveen N. Rao" 
Cc: Anil S Keshavamurthy 
Cc: "David S. Miller" 
Cc: Masami Hiramatsu 
---
 kernel/kprobes.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index c83e54727131..10a7e67fea67 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1819,6 +1819,26 @@ unsigned long __weak arch_deref_entry_point(void *entry)
return (unsigned long)entry;
 }
 
+bool __weak arch_kprobe_on_func_entry(unsigned long offset)
+{
+   return !offset;
+}
+
+bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym,
+ unsigned long offset)
+{
+   kprobe_opcode_t *kp_addr = _kprobe_addr(addr, sym, offset);
+
+   if (IS_ERR(kp_addr))
+   return false;
+
+   if (!kallsyms_lookup_size_offset((unsigned long)kp_addr, NULL, ) 
||
+!arch_kprobe_on_func_entry(offset))
+   return false;
+
+   return true;
+}
+
 #ifdef CONFIG_KRETPROBES
 /*
  * This kprobe pre_handler is registered with every kretprobe. When probe
@@ -1875,25 +1895,6 @@ static int pre_handler_kretprobe(struct kprobe *p, 
struct pt_regs *regs)
 }
 NOKPROBE_SYMBOL(pre_handler_kretprobe);
 
-bool __weak arch_kprobe_on_func_entry(unsigned long offset)
-{
-   return !offset;
-}
-
-bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned 
long offset)
-{
-   kprobe_opcode_t *kp_addr = _kprobe_addr(addr, sym, offset);
-
-   if (IS_ERR(kp_addr))
-   return false;
-
-   if (!kallsyms_lookup_size_offset((unsigned long)kp_addr, NULL, ) 
||
-   
!arch_kprobe_on_func_entry(offset))
-   return false;
-
-   return true;
-}
-
 int register_kretprobe(struct kretprobe *rp)
 {
int ret = 0;
-- 
2.20.1

91 matches

Mail list logo