Re: Appropriate liburcu cache line size for Power

2024-03-28 Thread Mathieu Desnoyers

On 2024-03-25 16:34, Nathan Lynch wrote:

Mathieu Desnoyers  writes:

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.


For what it's worth, I found a copy of an IBM Journal of Research &
Development article confirming that POWER5's L3 had a 256-byte line
size:

   Each slice [of the L3] is 12-way set-associative, with 4,096
   congruence classes of 256-byte lines managed as two 128-byte sectors
   to match the L2 line size.

https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf

I don't know of any reason to prefer 256 over 128 for current Power
processors though.


Thanks for the pointer. I will add a reference to it in the liburcu
source code to explain the cache line size choice.

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: Appropriate liburcu cache line size for Power

2024-03-26 Thread Mathieu Desnoyers

On 2024-03-26 03:19, Michael Ellerman wrote:

Mathieu Desnoyers  writes:

Hi,


Hi Mathieu,


In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.


The ISA doesn't specify the cache line size, other than it is smaller
than a page.

In practice all the 64-bit IBM server CPUs I'm aware of have used 128
bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
Freescale e6500.

It is possible to discover at runtime via AUXV headers. But that's no
use if you want a compile-time constant.


Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
so changing this would require a soname bump, which I don't want to do
without really good reasons.



I'm happy to run some benchmarks if you can point me at what to run. I
had a poke around the repository and found short_bench, but it seemed to
run for a very long time.


I've created a dedicated test program for this, see:

https://github.com/compudj/userspace-rcu-dev/tree/false-sharing

example use:

On a AMD Ryzen 7 PRO 6850U with Radeon Graphics:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 21320
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 22657
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 47599
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 531364
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 523634
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 519402
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 520651
1..1

Would point to false-sharing starting with cache lines smaller than
64 bytes. I get similar results (false-sharing under 64 bytes) with a
AMD EPYC 9654 96-Core Processor.

The test programs runs 4 threads by default, which can be overridden
with "-t N". This may be needed if you want this to use all cores from
a larger machine. See "-h" for options.

On a POWER9 (architected), altivec supported:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 12264
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 12276
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 25638
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 39934
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 53971
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 53599
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 53962
1..1

This points at false-sharing below 128 bytes stride.

On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9049
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 9054
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 18643
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 37417
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 37906
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 37870
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 37899
1..1

Which points at false-sharing below 64 bytes.

I prefer to be cautious about this cache line size value and aim for
a value which takes into account the largest known cache line size
for an architecture rather than use a too small due to the large
overhead caused by false-sharing.

Feedback is welcome.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Appropriate liburcu cache line size for Power

2024-03-24 Thread Mathieu Desnoyers

Hi,

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.

Thanks!

Mathieu

[1] https://liburcu.org
[2] https://github.com/urcu/userspace-rcu/pull/22
[3] https://www.7-cpu.com/


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


Re: [powerpc] Boot failure kernel BUG at mm/usercopy.c:102

2023-01-10 Thread Mathieu Desnoyers

On 2023-01-10 05:42, Sachin Sant wrote:

6.2.0-rc3-next-20230109 fails to boot on powerpc with following:

[ 0.444834] [ cut here ]
[ 0.444838] kernel BUG at mm/usercopy.c:102!
[ 0.444842] Oops: Exception in kernel mode, sig: 5 [#1]
[ 0.444845] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 0.444849] Modules linked in:
[ 0.444853] CPU: 23 PID: 201 Comm: modprobe Not tainted 6.2.0-rc3-next-20230109 
#1
[ 0.444858] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1030.00 (NH1030_026) hv:phyp pSeries
[ 0.444862] NIP: c055a934 LR: c055a930 CTR: 00725d90
[ 0.444866] REGS: c7f937c0 TRAP: 0700 Not tainted 
(6.2.0-rc3-next-20230109)
[ 0.444871] MSR: 80029033  CR: 28088822 XER: 
0007
[ 0.444879] CFAR: c02012a8 IRQMASK: 0
[ 0.444879] GPR00: c055a930 c7f93a60 c13b0800 
0066
[ 0.444879] GPR04: 7fff c7f93880 c7f93878 

[ 0.444879] GPR08: 7fff  c25e7150 
c29672b8
[ 0.444879] GPR12: 48088824 c00e87bf6300 c017f458 
c34b8100
[ 0.444879] GPR16: 00012f18eac0 7fffc5c095d0 7fffc5c095d8 
00012f140040
[ 0.444879] GPR20: fcff 001f 5455 
0008
[ 0.444879] GPR24: c723a0c0 7fffc5c09368  
7fffc5c09370
[ 0.444879] GPR28: 0250 0001 c3017000 
c723a0c0
[ 0.444922] NIP [c055a934] usercopy_abort+0xa4/0xb0
[ 0.444928] LR [c055a930] usercopy_abort+0xa0/0xb0
[ 0.444932] Call Trace:
[ 0.444933] [c7f93a60] [c055a930] usercopy_abort+0xa0/0xb0 
(unreliable)
[ 0.444939] [c7f93ad0] [c050eeb8] 
__check_heap_object+0x198/0x1d0
[ 0.444945] [c7f93b10] [c055a7e0] 
__check_object_size+0x290/0x340
[ 0.444949] [c7f93b50] [c060eba4] 
create_elf_tables.isra.20+0xc04/0xc90
[ 0.444956] [c7f93c10] [c0610b2c] load_elf_binary+0xdac/0x1320
[ 0.444962] [c7f93d00] [c0571cf0] bprm_execve+0x3d0/0x7c0
[ 0.444966] [c7f93dc0] [c0572b9c] kernel_execve+0x1ac/0x270
[ 0.444971] [c7f93e10] [c017f5cc] 
call_usermodehelper_exec_async+0x17c/0x250
[ 0.444978] [c7f93e50] [c000e054] 
ret_from_kernel_thread+0x5c/0x64
[ 0.444983] --- interrupt: 0 at 0x0
[ 0.444986] NIP:  LR:  CTR: 
[ 0.444990] REGS: c7f93e80 TRAP:  Not tainted 
(6.2.0-rc3-next-20230109)
[ 0.444994] MSR:  <> CR:  XER: 
[ 0.444998] CFAR:  IRQMASK: 0
[ 0.444998] GPR00:  c7f94000  

[ 0.444998] GPR04:    

[ 0.444998] GPR08:    

[ 0.444998] GPR12:   c017f458 
c34b8100
[ 0.444998] GPR16:    

[ 0.444998] GPR20:    

[ 0.444998] GPR24:    

[ 0.444998] GPR28:    

[ 0.445039] NIP [] 0x0
[ 0.445042] LR [] 0x0
[ 0.445044] --- interrupt: 0
[ 0.445046] Code: 392990f8 4814 3d02ffe9 390827f0 7d074378 7d094378 7c661b78 
3c62ffe7 f9610060 386319f0 4bca6935 6000 <0fe0>   
7c0802a6
[ 0.445061] ---[ end trace  ]—

Git bisect points to following patch:

commit 317c8194e6aeb8b3b573ad139fc2a0635856498e
  rseq: Introduce feature size and alignment ELF auxiliary vector entries

Reverting the patch helps boot the kernel.


Can you try with the following fix ?

https://lore.kernel.org/r/20230104192054.34046-1-mathieu.desnoy...@efficios.com
"rseq: Fix: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries"

Peter, I suspect it would be good to merge that fix into tip/master though 
sched/core.

Thanks,

Mathieu




Thanks
-Sachin


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH] powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1

2022-12-07 Thread Mathieu Desnoyers

On 2022-12-06 21:09, Michael Ellerman wrote:

Mathieu Desnoyers  writes:

On 2022-12-05 17:50, Michael Ellerman wrote:

Michael Jeanson  writes:

On 2022-12-05 15:11, Michael Jeanson wrote:

Michael Jeanson  writes:

In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on
PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table
changing from their dot prefixed variant to the non-prefixed ones.

Since ftrace prefixes a dot to the syscall names when matching them to
build its syscall event list, this resulted in no syscall events being
available.

Remove the PPC64_ELF_ABI_V1 specific version of
arch_syscall_match_sym_name to have the same behavior across all powerpc
variants.


This doesn't seem to work for me.

Event with it applied I still don't see anything in
/sys/kernel/debug/tracing/events/syscalls

Did we break it in some other way recently?

cheers


I did some further testing, my config also enabled KALLSYMS_ALL, when I remove
it there is indeed no syscall events.


Aha, OK that explains it I guess.

I was using ppc64_guest_defconfig which has ABI_V1 and FTRACE_SYSCALLS,
but does not have KALLSYMS_ALL. So I guess there's some other bug
lurking in there.


I don't have the setup handy to validate it, but I suspect it is caused
by the way scripts/kallsyms.c:symbol_valid() checks whether a symbol
entry needs to be integrated into the assembler output when
--all-symbols is not specified. It only keeps symbols which addresses
are in the text range. On PPC64_ELF_ABI_V1, this means only the
dot-prefixed symbols will be kept (those point to the function begin),
leaving out the non-dot-prefixed symbols (those point to the function
descriptors).


OK. So I guess it never worked without KALLSYMS_ALL.


I suspect it worked prior to kernel v5.7, because back then the PPC64 
ABIv1 system call table contained pointers to the text section 
(beginning of functions) rather than function descriptors.


So changing this system call table to point to C functions introduced 
the dot-prefix match issue for syscall tracing as well as a dependency 
on KALLSYMS_ALL.




It seems like most distros enable KALLSYMS_ALL, so I guess that's why
we've never noticed.


Or very few people run recent PPC64 ABIv1 kernels :)




So I see two possible solutions there: either we ensure that
FTRACE_SYSCALLS selects KALLSYMS_ALL on PPC64_ELF_ABI_V1, or we modify
scripts/kallsyms.c:symbol_valid() to also include function descriptor
symbols. This would mean accepting symbols pointing into the .opd ELF
section.


My only worry is that will cause some other breakage, because .opd
symbols are not really "text" in the normal sense, ie. you can't execute
them directly.


AFAIU adding the .opd section to scripts/kallsyms.c text_ranges will 
only affect the result of symbol_valid(), which decides which symbols 
get pulled into a KALLSYMS=n/KALLSYMS_ALL=n kernel build. Considering 
that a KALLSYMS_ALL=y build pulls those function descriptor symbols into 
 the image without breaking anything, I don't see why adding the .opd 
section to this script text_ranges would have any ill side-effect.




On the other hand the help for KALLSYMS_ALL says:

   "Normally kallsyms only contains the symbols of functions"

But without .opd included that's not really true. In practice it
probably doesn't really matter, because eg. backtraces will point to dot
symbols which can be resolved.


Indeed, I don't see this affecting backtraces, but as soon as a lookup 
depends on comparing the C function pointer to a function descriptor, 
the .opd symbols are needed. Not having those function descriptor 
symbols in KALLSYMS_ALL=n builds seems rather error-prone.





IMHO the second option would be better because it does not increase the
kernel image size as much as KALLSYMS_ALL.


Yes I agree.

Even if that did break something, any breakage would be limited to
arches which uses function descriptors, which are now all rare.


Yes, it would only impact those arches using function descriptors, which 
are broken today with respect to system call tracing. Are you aware of 
other architectures other than PPC64 ELF ABI v1 supported by the Linux 
kernel that use function descriptors ?




Relatedly we have a patch in next to optionally use ABIv2 for 64-bit big
endian builds.


Interesting. Does it require a matching user-space ? (built with PPC64 
ABIv2 ?) Does it handle legacy PPC32 executables ?


Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH] powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1

2022-12-06 Thread Mathieu Desnoyers

On 2022-12-05 17:50, Michael Ellerman wrote:

Michael Jeanson  writes:

On 2022-12-05 15:11, Michael Jeanson wrote:

Michael Jeanson  writes:

In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on
PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table
changing from their dot prefixed variant to the non-prefixed ones.

Since ftrace prefixes a dot to the syscall names when matching them to
build its syscall event list, this resulted in no syscall events being
available.

Remove the PPC64_ELF_ABI_V1 specific version of
arch_syscall_match_sym_name to have the same behavior across all powerpc
variants.


This doesn't seem to work for me.

Event with it applied I still don't see anything in
/sys/kernel/debug/tracing/events/syscalls

Did we break it in some other way recently?

cheers


I did some further testing, my config also enabled KALLSYMS_ALL, when I remove
it there is indeed no syscall events.


Aha, OK that explains it I guess.

I was using ppc64_guest_defconfig which has ABI_V1 and FTRACE_SYSCALLS,
but does not have KALLSYMS_ALL. So I guess there's some other bug
lurking in there.


I don't have the setup handy to validate it, but I suspect it is caused 
by the way scripts/kallsyms.c:symbol_valid() checks whether a symbol 
entry needs to be integrated into the assembler output when 
--all-symbols is not specified. It only keeps symbols which addresses 
are in the text range. On PPC64_ELF_ABI_V1, this means only the 
dot-prefixed symbols will be kept (those point to the function begin), 
leaving out the non-dot-prefixed symbols (those point to the function 
descriptors).


So I see two possible solutions there: either we ensure that 
FTRACE_SYSCALLS selects KALLSYMS_ALL on PPC64_ELF_ABI_V1, or we modify 
scripts/kallsyms.c:symbol_valid() to also include function descriptor 
symbols. This would mean accepting symbols pointing into the .opd ELF 
section.


IMHO the second option would be better because it does not increase the 
kernel image size as much as KALLSYMS_ALL.


Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH printk v3 00/40] reduce console_lock scope

2022-11-11 Thread Mathieu Desnoyers

On 2022-11-07 09:15, John Ogness wrote:
[...]


The base commit for this series is from Paul McKenney's RCU tree
and provides an NMI-safe SRCU implementation [1]. Without the
NMI-safe SRCU implementation, this series is not less safe than
mainline. But we will need the NMI-safe SRCU implementation for
atomic consoles anyway, so we might as well get it in
now. Especially since it _does_ increase the reliability for
mainline in the panic path.


So, your email got me to review the SRCU nmi-safe series:

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/log/?h=srcunmisafe.2022.10.21a

Especially this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=srcunmisafe.2022.10.21a=5d0f5953b60f5f7a278085b55ddc73e2932f4c33

I disagree with the overall approach taken there, which is to create
yet another SRCU flavor, this time with explicit "nmi-safe" read-locks.
This adds complexity to the kernel APIs and I think we can be clever
about this and make SRCU nmi-safe without requiring a whole new incompatible
API.

You can find the basic idea needed to achieve this in the libside RCU
user-space implementation. I needed to introduce a split-counter concept
to support rseq vs atomics to keep track of per-cpu grace period counters.
The "rseq" counter is the fast-path, but if rseq fails, the abort handler
uses the atomic counter instead.

https://github.com/compudj/side/blob/main/src/rcu.h#L23

struct side_rcu_percpu_count {
uintptr_t begin;
uintptr_t rseq_begin;
uintptr_t end;
uintptr_t rseq_end;
}  __attribute__((__aligned__(SIDE_CACHE_LINE_SIZE)));

The idea is to "split" each percpu counter into two counters, one for rseq,
and the other for atomics. When a grace period wants to observe the value of
a percpu counter, it simply sums the two counters:

https://github.com/compudj/side/blob/main/src/rcu.c#L112

The same idea can be applied to SRCU in the kernel: one counter for percpu ops,
and the other counter for nmi context, so basically:

srcu_read_lock()

if (likely(!in_nmi()))
  increment the percpu-ops lock counter
else
  increment the atomic lock counter

srcu_read_unlock()

if (likely(!in_nmi()))
  increment the percpu-ops unlock counter
else
  increment the atomic unlock counter

Then in the grace period sum the percpu-ops and the atomic values whenever
each counter value is read.

This would allow SRCU to be NMI-safe without requiring the callers to
explicitly state whether they need to be nmi-safe or not, and would only
take the overhead of the atomics in the NMI handlers rather than for all
users which happen to use SRCU read locks shared with nmi handlers.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2022-01-12 Thread Mathieu Desnoyers
include/linux/sync_core.h
> deleted file mode 100644
> index 013da4b8b327..
> --- a/include/linux/sync_core.h
> +++ /dev/null
> @@ -1,21 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef _LINUX_SYNC_CORE_H
> -#define _LINUX_SYNC_CORE_H
> -
> -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> -#include 
> -#else
> -/*
> - * This is a dummy sync_core_before_usermode() implementation that can be 
> used
> - * on all architectures which return to user-space through core serializing
> - * instructions.
> - * If your architecture returns to user-space through non-core-serializing
> - * instructions, you need to write your own functions.
> - */
> -static inline void sync_core_before_usermode(void)
> -{
> -}
> -#endif
> -
> -#endif /* _LINUX_SYNC_CORE_H */
> -

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch

2022-01-12 Thread Mathieu Desnoyers
- On Jan 8, 2022, at 11:43 AM, Andy Lutomirski l...@kernel.org wrote:

> powerpc did the following on some, but not all, paths through
> switch_mm_irqs_off():
> 
>   /*
>* Only need the full barrier when switching between processes.
>* Barrier when switching from kernel to userspace is not
>* required here, given that it is implied by mmdrop(). Barrier
>* when switching from userspace to kernel is not needed after
>* store to rq->curr.
>*/
>   if (likely(!(atomic_read(>membarrier_state) &
>(MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
>   return;
> 
> This is puzzling: if !prev, then one might expect that we are switching
> from kernel to user, not user to kernel, which is inconsistent with the
> comment.  But this is all nonsense, because the one and only caller would
> never have prev == NULL and would, in fact, OOPS if prev == NULL.
> 
> In any event, this code is unnecessary, since the new generic
> membarrier_finish_switch_mm() provides the same barrier without arch help.
> 
> arch/powerpc/include/asm/membarrier.h remains as an empty header,
> because a later patch in this series will add code to it.

My disagreement with "membarrier: Make the post-switch-mm barrier explicit"
may affect this patch significantly, or even make it irrelevant.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin 
> Cc: Mathieu Desnoyers 
> Cc: Peter Zijlstra 
> Signed-off-by: Andy Lutomirski 
> ---
> arch/powerpc/include/asm/membarrier.h | 24 
> arch/powerpc/mm/mmu_context.c |  1 -
> 2 files changed, 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/membarrier.h
> b/arch/powerpc/include/asm/membarrier.h
> index de7f79157918..b90766e95bd1 100644
> --- a/arch/powerpc/include/asm/membarrier.h
> +++ b/arch/powerpc/include/asm/membarrier.h
> @@ -1,28 +1,4 @@
> #ifndef _ASM_POWERPC_MEMBARRIER_H
> #define _ASM_POWERPC_MEMBARRIER_H
> 
> -static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
> -  struct mm_struct *next,
> -  struct task_struct *tsk)
> -{
> - /*
> -  * Only need the full barrier when switching between processes.
> -  * Barrier when switching from kernel to userspace is not
> -  * required here, given that it is implied by mmdrop(). Barrier
> -  * when switching from userspace to kernel is not needed after
> -  * store to rq->curr.
> -  */
> - if (IS_ENABLED(CONFIG_SMP) &&
> - likely(!(atomic_read(>membarrier_state) &
> -  (MEMBARRIER_STATE_PRIVATE_EXPEDITED |
> -   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
> - return;
> -
> - /*
> -  * The membarrier system call requires a full memory barrier
> -  * after storing to rq->curr, before going back to user-space.
> -  */
> - smp_mb();
> -}
> -
> #endif /* _ASM_POWERPC_MEMBARRIER_H */
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 74246536b832..5f2daa6b0497 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
> mm_struct *next,
>   asm volatile ("dssall");
> 
>   if (!new_on_cpu)
> - membarrier_arch_switch_mm(prev, next, tsk);
> 
>   /*
>* The actual HW switching method differs between the various
> --
> 2.33.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v3 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-09-02 Thread Mathieu Desnoyers
- On Sep 1, 2021, at 4:30 PM, Sean Christopherson sea...@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 
> Signed-off-by: Sean Christopherson 

Thanks!

Acked-by: Mathieu Desnoyers 

> ---
> tools/testing/selftests/kvm/.gitignore  |   1 +
> tools/testing/selftests/kvm/Makefile|   3 +
> tools/testing/selftests/kvm/rseq_test.c | 236 
> 3 files changed, 240 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/rseq_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore
> b/tools/testing/selftests/kvm/.gitignore
> index 0709af0144c8..6d031ff6b68e 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -47,6 +47,7 @@
> /kvm_page_table_test
> /memslot_modification_stress_test
> /memslot_perf_test
> +/rseq_test
> /set_memory_region_test
> /steal_time
> /kvm_binary_stats_test
> diff --git a/tools/testing/selftests/kvm/Makefile
> b/tools/testing/selftests/kvm/Makefile
> index 5832f510a16c..0756e79cb513 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
> TEST_GEN_PROGS_x86_64 += kvm_page_table_test
> TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
> TEST_GEN_PROGS_x86_64 += memslot_perf_test
> +TEST_GEN_PROGS_x86_64 += rseq_test
> TEST_GEN_PROGS_x86_64 += set_memory_region_test
> TEST_GEN_PROGS_x86_64 += steal_time
> TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
> @@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
> TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
> TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
> TEST_GEN_PROGS_aarch64 += kvm_page_table_test
> +TEST_GEN_PROGS_aarch64 += rseq_test
> TEST_GEN_PROGS_aarch64 += set_memory_region_test
> TEST_GEN_PROGS_aarch64 += steal_time
> TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
> @@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
> TEST_GEN_PROGS_s390x += dirty_log_test
> TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
> TEST_GEN_PROGS_s390x += kvm_page_table_test
> +TEST_GEN_PROGS_s390x += rseq_test
> TEST_GEN_PROGS_s390x += set_memory_region_test
> TEST_GEN_PROGS_s390x += kvm_binary_stats_test
> 
> diff --git a/tools/testing/selftests/kvm/rseq_test.c
> b/tools/testing/selftests/kvm/rseq_test.c
> new file mode 100644
> index ..060538bd405a
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/rseq_test.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#define _GNU_SOURCE /* for program_invocation_short_name */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "kvm_util.h"
> +#include "processor.h"
> +#include "test_util.h"
> +
> +#define VCPU_ID 0
> +
> +static __thread volatile struct rseq __rseq = {
> + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> +};
> +
> +/*
> + * Use an arbitrary, bogus signature for configuring rseq, this test does not
> + * actually enter an rseq critical section.
> + */
> +#define RSEQ_SIG 0xdeadbeef
> +
> +/*
> + * Any bug related to task migration is likely to be timing-dependent; 
> perform
> + * a large number of migrations to reduce the odds of a false negative.
> + */
> +#define NR_TASK_MIGRATIONS 10
> +
> +static pthread_t migration_thread;
> +static cpu_set_t possible_mask;
> +static bool done;
> +
> +static atomic_t seq_cnt;
> +
> +static void guest_code(void)
> +{
> + for (;;)
> + GUEST_SYNC(0);
> +}
> +
> +static void sys_rseq(int flags)
> +{
> + int r;
> +
> + r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG);
> + TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno));
> +}
> +
> +static void *migration_worker(void *ign)
> +{
> + cpu_set_t allowed_mask;
> + int r, i, nr_cpus, cpu;
> +
> + CPU_ZERO(_mask);
> +
> + nr_cpus = CPU_COUNT(_mask);
> +
> + for (i = 0; i < NR_TASK_MIGRATIONS; i++) {
> + cpu = i % nr_cpus;
> + if (!CPU_ISSET(cpu, _mask))
> + continue;
> +
> + CPU_SET

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers
- On Aug 27, 2021, at 7:23 PM, Sean Christopherson sea...@google.com wrote:

> On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
[...]
>> Does it reproduce if we randomize the delay to have it picked randomly from 
>> 0us
>> to 100us (with 1us step) ? It would remove a lot of the needs for 
>> arch-specific
>> magic delay value.
> 
> My less-than-scientific testing shows that it can reproduce at delays up to
> ~500us,
> but above ~10us the reproducibility starts to drop.  The bug still reproduces
> reliably, it just takes more iterations, and obviously the test runs a bit
> slower.
> 
> Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?

Works for me, thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers



- On Aug 26, 2021, at 7:54 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
>> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> >> >> +  r = sched_setaffinity(0, sizeof(allowed_mask), 
>> >> >> _mask);
>> >> >> +  TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d 
>> >> >> (%s)",
>> >> >> +  errno, strerror(errno));
>> >> >> +  atomic_inc(_cnt);
>> >> >> +
>> >> >> +  CPU_CLR(cpu, _mask);
>> >> >> +
>> >> >> +  /*
>> >> >> +   * Let the read-side get back into KVM_RUN to improve 
>> >> >> the odds
>> >> >> +   * of task migration coinciding with KVM's run loop.
>> >> > 
>> >> > This comment should be about increasing the odds of letting the seqlock
>> >> > read-side complete. Otherwise, the delay between the two back-to-back
>> >> > atomic_inc is so small that the seqlock read-side may never have time to
>> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and 
>> >> > can
>> >> > retry forever.
>> > 
>> > Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
>> > isn't
>> > possible (though that syscall would have to be screaming fast),
>> 
>> I don't think we have the same understanding of the livelock scenario. AFAIU 
>> the
>> livelock
>> would be caused by a too-small delay between the two consecutive atomic_inc()
>> from one
>> loop iteration to the next compared to the time it takes to perform a 
>> read-side
>> critical
>> section of the seqlock. Back-to-back atomic_inc can be performed very 
>> quickly,
>> so I
>> doubt that the sched_getcpu implementation have good odds to be fast enough 
>> to
>> complete
>> in that narrow window, leading to lots of read seqlock retry.
> 
> Ooooh, yeah, brain fart on my side.  I was thinking of the two atomic_inc() in
> the
> same loop iteration and completely ignoring the next iteration.  Yes, I 100%
> agree
> that a delay to ensure forward progress is needed.  An assertion in main() 
> that
> the
> reader complete at least some reasonable number of KVM_RUNs is also probably a
> good
> idea, e.g. to rule out a false pass due to the reader never making forward
> progress.

Agreed.

> 
> FWIW, the do-while loop does make forward progress without a delay, but at 
> ~50%
> throughput, give or take.

I did not expect absolutely no progress, but a significant slow down of
the read-side.

> 
>> > but the primary motivation is very much to allow the read-side enough time
>> > to get back into KVM proper.
>> 
>> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, 
>> then we
>> have:
>> 
>> migration thread KVM_RUN/read-side thread
>> ---
>>  - ioctl(KVM_RUN)
>> - atomic_inc_seq_cst()
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>>  - a = atomic_load() & ~1
>>  - smp_rmb()
>>  - b = 
>> LOAD_ONCE(__rseq_abi->cpu_id);
>>  - sched_getcpu()
>>  - smp_rmb()
>>  - re-load seqcnt/compare 
>> (succeeds)
>>- Can only succeed if entire 
>> read-side happens while the seqcnt
>>  is in an even numbered 
>> state.
>>  - if (a != b) abort()
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst()
>>   /* Let's suppose the lack of delay causes the
>>  setaffinity to complete too early compared
>>  with KVM_RUN ioctl */
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>> 
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst()
>>   /* Then a setaffinity from a following
>>  migration thread loop wil

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-26 Thread Mathieu Desnoyers
- On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> - On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
>> > wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> + cpu_set_t allowed_mask;
>> >> + int r, i, nr_cpus, cpu;
>> >> +
>> >> + CPU_ZERO(_mask);
>> >> +
>> >> + nr_cpus = CPU_COUNT(_mask);
>> >> +
>> >> + for (i = 0; i < 2; i++) {
>> >> + cpu = i % nr_cpus;
>> >> + if (!CPU_ISSET(cpu, _mask))
>> >> + continue;
>> >> +
>> >> + CPU_SET(cpu, _mask);
>> >> +
>> >> + /*
>> >> +  * Bump the sequence count twice to allow the reader to detect
>> >> +  * that a migration may have occurred in between rseq and sched
>> >> +  * CPU ID reads.  An odd sequence count indicates a migration
>> >> +  * is in-progress, while a completely different count indicates
>> >> +  * a migration occurred since the count was last read.
>> >> +  */
>> >> + atomic_inc(_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(>val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would 
> actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I 
would
recommend adding a comment stating which memory barriers are required alongside 
each
use of atomic_inc here. I would even prefer that we add extra (currently 
unneeded)
write barriers to make extra sure that this stays documented. Performance of 
the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> + r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
>> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> + errno, strerror(errno));
>> >> + atomic_inc(_cnt);
>> >> +
>> >> + CPU_CLR(cpu, _mask);
>> >> +
>> >> + /*
>> >> +  * Let the read-side get back into KVM_RUN to improve the odds
>> >> +  * of task migration coinciding with KVM's run loop.
>> > 
>> > This comment should be about increasing the odds of letting the se

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-23 Thread Mathieu Desnoyers
[ re-send to Darren Hart ]

- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
> wrote:
> 
>> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> without updating rseq, leading to a stale CPU ID and other badness.
>> 
> 
> [...]
> 
> +#define RSEQ_SIG 0xdeadbeef
> 
> Is there any reason for defining a custom signature rather than including
> tools/testing/selftests/rseq/rseq.h ? This should take care of including
> the proper architecture header which will define the appropriate signature.
> 
> Arguably you don't define rseq critical sections in this test per se, but
> I'm wondering why the custom signature here.
> 
> [...]
> 
>> +
>> +static void *migration_worker(void *ign)
>> +{
>> +cpu_set_t allowed_mask;
>> +int r, i, nr_cpus, cpu;
>> +
>> +CPU_ZERO(_mask);
>> +
>> +nr_cpus = CPU_COUNT(_mask);
>> +
>> +for (i = 0; i < 2; i++) {
>> +cpu = i % nr_cpus;
>> +if (!CPU_ISSET(cpu, _mask))
>> +continue;
>> +
>> +CPU_SET(cpu, _mask);
>> +
>> +/*
>> + * Bump the sequence count twice to allow the reader to detect
>> + * that a migration may have occurred in between rseq and sched
>> + * CPU ID reads.  An odd sequence count indicates a migration
>> + * is in-progress, while a completely different count indicates
>> + * a migration occurred since the count was last read.
>> + */
>> +atomic_inc(_cnt);
> 
> So technically this atomic_inc contains the required barriers because the
> selftests
> implementation uses "__sync_add_and_fetch(>val, 1)". But it's rather odd
> that
> the semantic differs from the kernel implementation in terms of memory 
> barriers:
> the
> kernel implementation of atomic_inc guarantees no memory barriers, but this 
> one
> happens to provide full barriers pretty much by accident (selftests
> futex/include/atomic.h documents no such guarantee).
> 
> If this full barrier guarantee is indeed provided by the selftests atomic.h
> header,
> I would really like a comment stating that in the atomic.h header so the 
> carpet
> is
> not pulled from under our feet by a future optimization.
> 
> 
>> +r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
>> +TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> +errno, strerror(errno));
>> +atomic_inc(_cnt);
>> +
>> +CPU_CLR(cpu, _mask);
>> +
>> +/*
>> + * Let the read-side get back into KVM_RUN to improve the odds
>> + * of task migration coinciding with KVM's run loop.
> 
> This comment should be about increasing the odds of letting the seqlock
> read-side
> complete. Otherwise, the delay between the two back-to-back atomic_inc is so
> small
> that the seqlock read-side may never have time to complete the reading the 
> rseq
> cpu id and the sched_getcpu() call, and can retry forever.
> 
> I'm wondering if 1 microsecond is sufficient on other architectures as well. 
> One
> alternative way to make this depend less on the architecture's implementation 
> of
> sched_getcpu (whether it's a vDSO, or goes through a syscall) would be to read
> the rseq cpu id and call sched_getcpu a few times (e.g. 3 times) in the
> migration
> thread rather than use usleep, and throw away the value read. This would 
> ensure
> the delay is appropriate on all architectures.
> 
> Thanks!
> 
> Mathieu
> 
>> + */
>> +usleep(1);
>> +}
>> +done = true;
>> +return NULL;
>> +}
>> +
>> +int main(int argc, char *argv[])
>> +{
>> +struct kvm_vm *vm;
>> +u32 cpu, rseq_cpu;
>> +int r, snapshot;
>> +
>> +/* Tell stdout not to buffer its content */
>> +setbuf(stdout, NULL);
>> +
>> +r = sched_getaffinity(0, sizeof(possible_mask), _mask);
>> +TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
>> +strerror(errno));
>> +
>> +if (CPU_COUNT(_

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-23 Thread Mathieu Desnoyers
_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> + "Guest failed?");
> +
> + /*
> +  * Verify rseq's CPU matches sched's CPU.  Ensure migration
> +  * doesn't occur between sched_getcpu() and reading the rseq
> +  * cpu_id by rereading both if the sequence count changes, or
> +  * if the count is odd (migration in-progress).
> +  */
> + do {
> + /*
> +  * Drop bit 0 to force a mismatch if the count is odd,
> +  * i.e. if a migration is in-progress.
> +  */
> + snapshot = atomic_read(_cnt) & ~1;
> + smp_rmb();
> + cpu = sched_getcpu();
> + rseq_cpu = READ_ONCE(__rseq.cpu_id);
> + smp_rmb();
> + } while (snapshot != atomic_read(_cnt));
> +
> + TEST_ASSERT(rseq_cpu == cpu,
> + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
> + }
> +
> + pthread_join(migration_thread, NULL);
> +
> + kvm_vm_free(vm);
> +
> + sys_rseq(RSEQ_FLAG_UNREGISTER);
> +
> + return 0;
> +}
> --
> 2.33.0.rc2.250.ged5fa647cd-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v2 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-08-23 Thread Mathieu Desnoyers
- On Aug 20, 2021, at 6:49 PM, Sean Christopherson sea...@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the current CPU needs to be updated if
> the task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

Acked-by: Mathieu Desnoyers 

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley 
> Bisected-by: Doug Evans 
> Cc: Shakeel Butt 
> Cc: Thomas Gleixner 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
> ---
> kernel/entry/kvm.c |  4 +++-
> kernel/rseq.c  | 14 +++---
> 2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
>   if (ti_work & _TIF_NEED_RESCHED)
>   schedule();
> 
> - if (ti_work & _TIF_NOTIFY_RESUME)
> + if (ti_work & _TIF_NOTIFY_RESUME) {
>   tracehook_notify_resume(NULL);
> + rseq_handle_notify_resume(NULL, NULL);
> + }
> 
>   ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
>   if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..6d45ac3dae7f 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,9 +282,17 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>   if (unlikely(t->flags & PF_EXITING))
>   return;
> - ret = rseq_ip_fixup(regs);
> - if (unlikely(ret < 0))
> - goto error;
> +
> + /*
> +  * regs is NULL if and only if the caller is in a syscall path.  Skip
> +  * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> +  * kill a misbehaving userspace on debug kernels.
> +  */
> + if (regs) {
> + ret = rseq_ip_fixup(regs);
> + if (unlikely(ret < 0))
> + goto error;
> + }
>   if (unlikely(rseq_update_cpu_id(t)))
>   goto error;
>   return;
> --
> 2.33.0.rc2.250.ged5fa647cd-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-08-20 Thread Mathieu Desnoyers
- On Aug 19, 2021, at 7:48 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> - On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>> > * If not nested over a rseq critical section, restart is useless.
>> > * Clear the rseq_cs pointer and return.
>> > */
>> > -  if (!in_rseq_cs(ip, _cs))
>> > +  if (!regs || !in_rseq_cs(ip, _cs))
>> 
>> I think clearing the thread's rseq_cs unconditionally here when regs is NULL
>> is not the behavior we want when this is called from xfer_to_guest_mode_work.
>> 
>> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
>> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
>> kill this application in the rseq_syscall handler when exiting back to 
>> usermode
>> when the ioctl eventually returns.
>> 
>> However, clearing the thread's rseq_cs will prevent this from happening.
>> 
>> So I would favor an approach where we simply do:
>> 
>> if (!regs)
>>  return 0;
>> 
>> Immediately at the beginning of rseq_ip_fixup, before getting the instruction
>> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, 
>> it
>> is not relevant to do any fixup here, because it is nested in a ioctl system
>> call.
>> 
>> Effectively, this would preserve the SIGSEGV behavior when this ioctl is
>> erroneously called by user-space from a rseq critical section.
> 
> Ha, that's effectively what I implemented first, but I changed it because of 
> the
> comment in clear_rseq_cs() that says:
> 
>  The rseq_cs field is set to NULL on preemption or signal delivery ... as well
>  as well as on top of code outside of the rseq assembly block.
> 
> Which makes it sound like something might rely on clearing rseq_cs?

This comment is describing succinctly the lazy clear scheme for rseq_cs.

Without the lazy clear scheme, a rseq c.s. would look like:

 * init(rseq_cs)
 * cpu = TLS->rseq::cpu_id_start
 *   [1]   TLS->rseq::rseq_cs = rseq_cs
 *   [start_ip]
 *   [2]   if (cpu != TLS->rseq::cpu_id)
 * goto abort_ip;
 *   [3]   
 *   [post_commit_ip]  
 *   [4]   TLS->rseq::rseq_cs = NULL

But as a fast-path optimization, [4] is not entirely needed because the rseq_cs
descriptor contains information about the instruction pointer range of the 
critical
section. Therefore, userspace can omit [4], but if the kernel never clears it, 
it
means that it will have to re-read the rseq_cs descriptor's content each time it
needs to check it to confirm that it is not nested over a rseq c.s..

So making the kernel lazily clear the rseq_cs pointer is just an optimization 
which
ensures that the kernel won't do useless work the next time it needs to check
rseq_cs, given that it has already validated that the userspace code is 
currently
not within the rseq c.s. currently advertised by the rseq_cs field.

> 
> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in 
> an
> rseq critical section, and because syscalls in critical sections are illegal, 
> by
> definition clearing rseq_cs is a nop unless userspace is misbehaving.

Not quite, as I described above. But we want it to stay set so the 
CONFIG_DEBUG_RSEQ
code executed when returning from ioctl to userspace will be able to validate 
that
it is not nested within a rseq critical section.

> 
> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is 
> it
> not worth the extra code to detect an error that will likely be caught 
> anyways?

The error will indeed already be caught on return from ioctl to userspace, so I
don't see any added value in duplicating this check.

Thanks,

Mathieu

> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..28b8342290b0 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>if (unlikely(t->flags & PF_EXITING))
>return;
> +   if (!regs) {
> +#ifdef CONFIG_DEBUG_RSEQ
> +   if (t->rseq && rseq_get_rseq_cs(t, _cs))
> +   goto error;
> +#endif
> +       return;
> +   }
>ret = rseq_ip_fixup(regs);
>    if (unlikely(ret < 0))
>goto error;
> 
>> Thanks for looking into this !
>> 
>> Mathieu
>> 
>> >return clear_rseq_cs(t);
>> >ret = rseq_need_restart(t, rseq_cs.flags);
>> >if (ret <= 0)
>> > --
>> > 2.33.0.rc1.237.g0d66db33f3-goog
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-20 Thread Mathieu Desnoyers
- On Aug 19, 2021, at 7:33 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> - On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> 
>> > Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> > migrated while the kernel is handling KVM_RUN.  This is a regression test
>> > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> > without updating rseq, leading to a stale CPU ID and other badness.
>> > 
>> > Signed-off-by: Sean Christopherson 
>> > ---
>> 
>> [...]
>> 
>> > +  while (!done) {
>> > +  vcpu_run(vm, VCPU_ID);
>> > +  TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> > +  "Guest failed?");
>> > +
>> > +  cpu = sched_getcpu();
>> > +  rseq_cpu = READ_ONCE(__rseq.cpu_id);
>> > +
>> > +  /*
>> > +   * Verify rseq's CPU matches sched's CPU, and that sched's CPU
>> > +   * is stable.  This doesn't handle the case where the task is
>> > +   * migrated between sched_getcpu() and reading rseq, and again
>> > +   * between reading rseq and sched_getcpu(), but in practice no
>> > +   * false positives have been observed, while on the other hand
>> > +   * blocking migration while this thread reads CPUs messes with
>> > +   * the timing and prevents hitting failures on a buggy kernel.
>> > +   */
>> 
>> I think you could get a stable cpu id between sched_getcpu and 
>> __rseq_abi.cpu_id
>> if you add a pthread mutex to protect:
>> 
>> sched_getcpu and __rseq_abi.cpu_id  reads
>> 
>> vs
>> 
>> sched_setaffinity calls within the migration thread.
>> 
>> Thoughts ?
> 
> I tried that and couldn't reproduce the bug.  That's what I attempted to call
> out
> in the blurb "blocking migration while this thread reads CPUs ... prevents
> hitting
> failures on a buggy kernel".
> 
> I considered adding arbitrary delays around the mutex to try and hit the bug,
> but
> I was worried that even if I got it "working" for this bug, the test would be
> too
> tailored to this bug and potentially miss future regression.  Letting the two
> threads run wild seemed like it would provide the best coverage, at the cost 
> of
> potentially causing to false failures.

OK, so your point is that using mutual exclusion to ensure stability of the cpu 
id
changes the timings too much, to a point where the issues don't reproduce. I 
understand
that this mutex ties the migration thread timings to the vcpu thread's use of 
the mutex,
which will reduce timings randomness, which is unwanted here.

I still really hate flakiness in tests, because then people stop caring when 
they
fail once in a while. And with the nature of rseq, a once-in-a-while failure is 
a
big deal. Let's see if we can use other tricks to ensure stability of the cpu id
without changing timings too much.

One idea would be to use a seqcount lock. But even if we use that, I'm 
concerned that
the very long writer critical section calling sched_setaffinity would need to be
alternated with a sleep to ensure the read-side progresses. The sleep delay 
could be
relatively small compared to the duration of the sched_setaffinity call, e.g. 
ratio
1:10.

static volatile uint64_t seqcnt;

The thread responsible for setting the affinity would do something like:

for (;;) {
  atomic_inc_seq_cst();
  sched_setaffinity(..., n++ % nr_cpus);
  atomic_inc_seq_cst();
  usleep(1);  /* this is where read-side is allowed to progress. */
}

And the thread reading the rseq cpu id and calling sched_getcpu():

uint64_t snapshot;

do {
  snapshot = atomic_load() & ~1; /* force retry if odd */
  smp_rmb();
  cpu = sched_getcpu();
  rseq_cpu = READ_ONCE(__rseq.cpu_id);
  smp_rmb();
} while (snapshot != atomic_load());

So the reader retry the cpu id reads whenever sched_setaffinity is being
called by the migration thread, and whenever it is preempted for more
than one migration thread loop.

That should achieve our goal of providing cpu id stability without significantly
changing the timings of the migration thread, given that it never blocks waiting
for the reader.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-19 Thread Mathieu Desnoyers
- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 
> Signed-off-by: Sean Christopherson 
> ---

[...]

> +
> +static void *migration_worker(void *ign)
> +{
> + cpu_set_t allowed_mask;
> + int r, i, nr_cpus, cpu;
> +
> + CPU_ZERO(_mask);
> +
> + nr_cpus = CPU_COUNT(_mask);
> +
> + for (i = 0; i < 2; i++) {
> + cpu = i % nr_cpus;
> + if (!CPU_ISSET(cpu, _mask))
> + continue;
> +
> + CPU_SET(cpu, _mask);
> +
> + r = sched_setaffinity(0, sizeof(allowed_mask), _mask);
> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", 
> errno,
> + strerror(errno));
> +
> + CPU_CLR(cpu, _mask);
> +
> + usleep(10);
> + }
> + done = true;
> + return NULL;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + struct kvm_vm *vm;
> + u32 cpu, rseq_cpu;
> + int r;
> +
> + /* Tell stdout not to buffer its content */
> + setbuf(stdout, NULL);
> +
> + r = sched_getaffinity(0, sizeof(possible_mask), _mask);
> + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
> + strerror(errno));
> +
> + if (CPU_COUNT(_mask) < 2) {
> + print_skip("Only one CPU, task migration not possible\n");
> + exit(KSFT_SKIP);
> + }
> +
> + sys_rseq(0);
> +
> + /*
> +  * Create and run a dummy VM that immediately exits to userspace via
> +  * GUEST_SYNC, while concurrently migrating the process by setting its
> +  * CPU affinity.
> +  */
> + vm = vm_create_default(VCPU_ID, 0, guest_code);
> +
> + pthread_create(_thread, NULL, migration_worker, 0);
> +
> + while (!done) {
> + vcpu_run(vm, VCPU_ID);
> + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> + "Guest failed?");
> +
> + cpu = sched_getcpu();
> + rseq_cpu = READ_ONCE(__rseq.cpu_id);
> +
> + /*
> +  * Verify rseq's CPU matches sched's CPU, and that sched's CPU
> +  * is stable.  This doesn't handle the case where the task is
> +  * migrated between sched_getcpu() and reading rseq, and again
> +  * between reading rseq and sched_getcpu(), but in practice no
> +  * false positives have been observed, while on the other hand
> +  * blocking migration while this thread reads CPUs messes with
> +  * the timing and prevents hitting failures on a buggy kernel.
> +  */

I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id
if you add a pthread mutex to protect:

sched_getcpu and __rseq_abi.cpu_id  reads

vs

sched_setaffinity calls within the migration thread.

Thoughts ?

Thanks,

Mathieu

> + TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(),
> + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
> + }
> +
> + pthread_join(migration_thread, NULL);
> +
> + kvm_vm_free(vm);
> +
> + sys_rseq(RSEQ_FLAG_UNREGISTER);
> +
> + return 0;
> +}
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-08-19 Thread Mathieu Desnoyers
- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the CPU ID needs to be updated if the
> task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

I agree with the problem assessment, but I would recommend a small change
to this fix.

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley 
> Bisected-by: Doug Evans 
> Cc: Shakeel Butt 
> Cc: Thomas Gleixner 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
> ---
> kernel/entry/kvm.c | 4 +++-
> kernel/rseq.c  | 4 ++--
> 2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
>   if (ti_work & _TIF_NEED_RESCHED)
>   schedule();
> 
> - if (ti_work & _TIF_NOTIFY_RESUME)
> + if (ti_work & _TIF_NOTIFY_RESUME) {
>   tracehook_notify_resume(NULL);
> + rseq_handle_notify_resume(NULL, NULL);
> + }
> 
>   ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
>   if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..58c79a7918cd 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs
> *rseq_cs)
> 
> static int rseq_ip_fixup(struct pt_regs *regs)
> {
> - unsigned long ip = instruction_pointer(regs);
> + unsigned long ip = regs ? instruction_pointer(regs) : 0;
>   struct task_struct *t = current;
>   struct rseq_cs rseq_cs;
>   int ret;
> @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>* If not nested over a rseq critical section, restart is useless.
>* Clear the rseq_cs pointer and return.
>*/
> - if (!in_rseq_cs(ip, _cs))
> + if (!regs || !in_rseq_cs(ip, _cs))

I think clearing the thread's rseq_cs unconditionally here when regs is NULL
is not the behavior we want when this is called from xfer_to_guest_mode_work.

If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
kill this application in the rseq_syscall handler when exiting back to usermode
when the ioctl eventually returns.

However, clearing the thread's rseq_cs will prevent this from happening.

So I would favor an approach where we simply do:

if (!regs)
 return 0;

Immediately at the beginning of rseq_ip_fixup, before getting the instruction
pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
is not relevant to do any fixup here, because it is nested in a ioctl system
call.

Effectively, this would preserve the SIGSEGV behavior when this ioctl is
erroneously called by user-space from a rseq critical section.

Thanks for looking into this !

Mathieu

>   return clear_rseq_cs(t);
>   ret = rseq_need_restart(t, rseq_cs.flags);
>   if (ret <= 0)
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume()

2021-08-19 Thread Mathieu Desnoyers
- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote:

> Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
> that the two function are always called back-to-back by architectures
> that have rseq.  The rseq helper is stubbed out for architectures that
> don't support rseq, i.e. this is a nop across the board.
> 
> Note, tracehook_notify_resume() is horribly named and arguably does not
> belong in tracehook.h as literally every line of code in it has nothing
> to do with tracing.  But, that's been true since commit a42c6ded827d
> ("move key_repace_session_keyring() into tracehook_notify_resume()")
> first usurped tracehook_notify_resume() back in 2012.  Punt cleaning that
> mess up to future patches.
> 
> No functional change intended.

This will make it harder to introduce new code paths which consume the
NOTIFY_RESUME without calling the rseq callback, which introduces issues.
Agreed.

Acked-by: Mathieu Desnoyers 

> 
> Signed-off-by: Sean Christopherson 
> ---
> arch/arm/kernel/signal.c | 1 -
> arch/arm64/kernel/signal.c   | 1 -
> arch/csky/kernel/signal.c| 4 +---
> arch/mips/kernel/signal.c| 4 +---
> arch/powerpc/kernel/signal.c | 4 +---
> arch/s390/kernel/signal.c| 1 -
> include/linux/tracehook.h| 2 ++
> kernel/entry/common.c| 4 +---
> kernel/entry/kvm.c   | 4 +---
> 9 files changed, 7 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
> index a3a38d0a4c85..9df68d139965 100644
> --- a/arch/arm/kernel/signal.c
> +++ b/arch/arm/kernel/signal.c
> @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int
> thread_flags, int syscall)
>   uprobe_notify_resume(regs);
>   } else {
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
>   }
>   }
>   local_irq_disable();
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 23036334f4dc..22b55db13da6 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
> 
>   if (thread_flags & _TIF_NOTIFY_RESUME) {
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> 
>   /*
>* If we reschedule after checking the affinity
> diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c
> index 312f046d452d..bc4238b9f709 100644
> --- a/arch/csky/kernel/signal.c
> +++ b/arch/csky/kernel/signal.c
> @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>   if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>   do_signal(regs);
> 
> - if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> + if (thread_info_flags & _TIF_NOTIFY_RESUME)
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> }
> diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
> index f1e985109da0..c9b2a75563e1 100644
> --- a/arch/mips/kernel/signal.c
> +++ b/arch/mips/kernel/signal.c
> @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, 
> void
> *unused,
>   if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>   do_signal(regs);
> 
> - if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> + if (thread_info_flags & _TIF_NOTIFY_RESUME)
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> 
>   user_enter();
> }
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index e600764a926c..b93b87df499d 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
> thread_info_flags)
>   do_signal(current);
>   }
> 
> - if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> + if (thread_info_flags & _TIF_NOTIFY_RESUME)
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> }
> 
> static unsigned long get_tm_stackpointer(struct task_struct *tsk)
> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
> index 78ef53b29958..b307db26bf2d 100644
> --- a/arch/s390/kernel/signal.c
> +++ b/arch/s390/kernel/signal.c
> 

Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-18 Thread Mathieu Desnoyers
- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski l...@kernel.org wrote:

> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> - On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote:
>> 
>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> > 
>> >> Please change back this #ifndef / #else / #endif within function for
>> >> 
>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>   ...
>> >> } else {
>> >>   ...
>> >> }
>> >> 
>> >> I don't think mixing up preprocessor and code logic makes it more 
>> >> readable.
>> > 
>> > I agree, but I don't know how to make the result work well.
>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> > case, so either I need to fake up a definition or use #ifdef.
>> > 
>> > If I faked up a definition, I would want to assert, at build time, that
>> > it isn't called.  I don't think we can do:
>> > 
>> > static void membarrier_sync_core_before_usermode()
>> > {
>> >BUILD_BUG_IF_REACHABLE();
>> > }
>> 
>> Let's look at the context here:
>> 
>> static void ipi_sync_core(void *info)
>> {
>> []
>> membarrier_sync_core_before_usermode()
>> }
>> 
>> ^ this can be within #ifdef / #endif
>> 
>> static int membarrier_private_expedited(int flags, int cpu_id)
>> [...]
>>if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>> return -EINVAL;
>> if (!(atomic_read(>membarrier_state) &
>>   MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>> return -EPERM;
>> ipi_func = ipi_sync_core;
>> 
>> All we need to make the line above work is to define an empty ipi_sync_core
>> function in the #else case after the ipi_sync_core() function definition.
>> 
>> Or am I missing your point ?
> 
> Maybe?
> 
> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the 
> core.
> I would be fine with that if I could have the compiler statically verify that
> it’s not called, but I’m uncomfortable having it there if the implementation 
> is
> actively incorrect.

I see. Another approach would be to implement a "setter" function to populate
"ipi_func". That setter function would return -EINVAL in its #ifndef 
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
implementation.

Would that be better ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

2021-06-18 Thread Mathieu Desnoyers
- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.le...@csgroup.eu 
wrote:
[...]
> 
> I don't understand all that complexity to just replace a simple
> 'smp_mb__after_unlock_lock()'.
> 
> #define smp_mb__after_unlock_lock()   smp_mb()
> #define smp_mb()  barrier()
> # define barrier() __asm__ __volatile__("": : :"memory")
> 
> 
> Am I missing some subtility ?

On powerpc CONFIG_SMP, smp_mb() is actually defined as:

#define smp_mb()__smp_mb()
#define __smp_mb()  mb()
#define mb()   __asm__ __volatile__ ("sync" : : : "memory")

So the original motivation here was to skip a "sync" instruction whenever
switching between threads which are part of the same process. But based on
recent discussions, I suspect my implementation may be inaccurately doing
so though.

Thanks,

Mathieu


> 
> Thanks
> Christophe

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-18 Thread Mathieu Desnoyers
- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote:

> On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> 
>> Please change back this #ifndef / #else / #endif within function for
>> 
>> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>   ...
>> } else {
>>   ...
>> }
>> 
>> I don't think mixing up preprocessor and code logic makes it more readable.
> 
> I agree, but I don't know how to make the result work well.
> membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> case, so either I need to fake up a definition or use #ifdef.
> 
> If I faked up a definition, I would want to assert, at build time, that
> it isn't called.  I don't think we can do:
> 
> static void membarrier_sync_core_before_usermode()
> {
>BUILD_BUG_IF_REACHABLE();
> }

Let's look at the context here:

static void ipi_sync_core(void *info)
{
[]
membarrier_sync_core_before_usermode()
}

^ this can be within #ifdef / #endif

static int membarrier_private_expedited(int flags, int cpu_id)
[...]
   if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
return -EINVAL;
if (!(atomic_read(>membarrier_state) &
  MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
ipi_func = ipi_sync_core;

All we need to make the line above work is to define an empty ipi_sync_core
function in the #else case after the ipi_sync_core() function definition.

Or am I missing your point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-17 Thread Mathieu Desnoyers
- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote:

[...]

> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what 
> it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,

Can we please remove the editorial comment about the quality of the membarrier
sync-core's API ?

At least it's better than having all userspace rely on mprotect() undocumented
side-effects to perform something which typically works, until it won't, or 
until
this prevents mprotect's implementation to be improved because it will start 
breaking
JITs all over the place.

We can simply state that the definition of what membarrier sync-core does is 
defined
per-architecture, and document the sequence of operations to perform when doing
cross-modifying code specifically for each architecture.

Now if there are weird architectures where membarrier is an odd fit (I've heard 
that
riscv might need address ranges to which the core sync needs to apply), then 
those
might need to implement their own arch-specific system call, which is all fine.

> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context 
> synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

We should document the requirements on ARMv7 as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-17 Thread Mathieu Desnoyers
- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a 
> non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.

Agreed. Documentation of the sequence of operations that need to be performed
when cross-modifying code on SMP should be per-architecture. The documentation
of the architectural effects of membarrier sync-core should be per-arch as well.

> This means x86, arm64, and powerpc for now.

And also arm32, as discussed in the other leg of the patchset's email thread.

> Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

OK

> 
[...]
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>   smp_call_func_t ipi_func = ipi_mb;
> 
>   if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> - if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>   return -EINVAL;
> +#else
>   if (!(atomic_read(>membarrier_state) &
> MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>   return -EPERM;
>   ipi_func = ipi_sync_core;
> +#endif

Please change back this #ifndef / #else / #endif within function for

if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
  ...
} else {
  ...
}

I don't think mixing up preprocessor and code logic makes it more readable.

Thanks,

Mathieu

>   } else if (flags == MEMBARRIER_FLAG_RSEQ) {
>   if (!IS_ENABLED(CONFIG_RSEQ))
>   return -EINVAL;
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski l...@kernel.org wrote:

> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>  wrote:
>>
>> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote:
>>
>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >  wrote:
>> >>
>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> > After chatting with rmk about this (but without claiming that any of
>> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> > to synchronize code changes, userspace should first do the code
>> >> > change, then flush icache as appropriate for the architecture, and
>> >> > then do the membarrier() to ensure that the old code is unused?
>>
>> ^ exactly, yes.
>>
>> >> >
>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> > 32-bit arm?
>>
>> This was the original thinking, yes. The cacheflush IPI will flush specific
>> regions of code, and the membarrier IPI issues context synchronizing
>> instructions.
>>
>> Architectures with coherent i/d caches don't need the cacheflush step.
> 
> There are different levels of coherency -- VIVT architectures may have
> differing requirements compared to PIPT, etc.
> 
> In any case, I feel like the approach taken by the documentation is
> fundamentally confusing.  Architectures don't all speak the same
> language

Agreed.

> How about something like:

I dislike the wording "barrier" and the association between "write" and
"instruction fetch" done in the descriptions below. It leads to think that
this behaves like a memory barrier, when in fact my understanding of
a context synchronizing instruction is that it simply flushes internal
CPU state, which would cause coherency issues if the CPU observes both the
old and then the new code without having this state flushed.

[ Sorry if I take more time to reply and if my replies are a bit more
  concise than usual. I'm currently on parental leave, so I have
  non-maskable interrupts to attend to. ;-) ]

Thanks,

Mathieu

> 
> The SYNC_CORE operation causes all threads in the caller's address
> space (including the caller) to execute an architecture-defined
> barrier operation.  membarrier() will ensure that this barrier is
> executed at a time such that all data writes done by the calling
> thread before membarrier() are made visible by the barrier.
> Additional architecture-dependent cache management operations may be
> required to use this for JIT code.
> 
> x86: SYNC_CORE executes a barrier that will cause subsequent
> instruction fetches to observe prior writes.  Currently this will be a
> "serializing" instruction, but, if future improved CPU documentation
> becomes available and relaxes this requirement, the barrier may
> change.  The kernel guarantees that writing new or modified
> instructions to normal memory (and issuing SFENCE if the writes were
> non-temporal) then doing a membarrier SYNC_CORE operation is
> sufficient to cause all threads in the caller's address space to
> execute the new or modified instructions.  This is true regardless of
> whether or not those instructions are written at the same virtual
> address from which they are subsequently executed.  No additional
> cache management is required on x86.
> 
> arm: Something about the cache management syscalls.
> 
> arm64: Ditto
> 
> powerpc: I have no idea.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski l...@kernel.org wrote:

[...]

>> You seem to have noticed odd cases on arm64 where this guarantee does not
>> match reality. Where exactly can we find this in the code, and which part
>> of the architecture manual can you point us to which supports your concern ?
>>
>> Based on the notes I have, use of `eret` on aarch64 guarantees a context
>> synchronizing
>> instruction when returning to user-space.
> 
> Based on my reading of the manual, ERET on ARM doesn't synchronize
> anything at all.  I can't find any evidence that it synchronizes data
> or instructions, and I've seen reports that the CPU will happily
> speculate right past it.

Reading [1] there appears to be 3 kind of context synchronization events:

- Taking an exception,
- Returning from an exception,
- ISB.

This other source [2] adds (search for Context synchronization operation):

- Exit from Debug state
- Executing a DCPS instruction
- Executing a DRPS instruction

"ERET" falls into the second kind of events, and AFAIU should be context
synchronizing. That was confirmed to me by Will Deacon when membarrier
sync-core was implemented for aarch64. If the architecture reference manuals
are wrong, is there an errata ?

As for the algorithm to use on ARMv8 to update instructions, see [2]
B2.3.4  Implication of caches for the application programmer
"Synchronization and coherency issues between data and instruction accesses"

Membarrier only takes care of making sure the "ISB" part of the algorithm can be
done easily and efficiently on multiprocessor systems.

Thanks,

Mathieu

[1] 
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail
[2] 
https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
ted a core  serializing  instruc■
>  tion.   This  guarantee is provided only for threads in the same
>  process as the calling thread.
> 
>  The "expedited" commands complete faster than the  non-expedited
>  ones,  they  never block, but have the downside of causing extra
>  overhead.
> 
>  A process must register its intent to use the private  expedited
>  sync core command prior to using it.
> 
> This just says that the siblings have executed a serialising
> instruction, in other words a barrier. It makes no claims concerning
> cache coherency - and without some form of cache maintenance, there
> can be no expectation that the I and D streams to be coherent with
> each other.

Right, membarrier is not doing anything wrt I/D caches. On architectures
without coherent caches, users should use other system calls or instructions
provided by the architecture to synchronize the appropriate address ranges. 

> This description is also weird in another respect. "guarantee that
> all its running thread siblings have executed a core serializing
> instruction" ... "The expedited commands ... never block".
> 
> So, the core executing this call is not allowed to block, but the
> other part indicates that the other CPUs _have_ executed a serialising
> instruction before this call returns... one wonders how that happens
> without blocking. Maybe the CPU spins waiting for completion instead?

Membarrier expedited sync-core issues IPIs to all CPUs running sibling
threads. AFAIR the IPI mechanism uses the "csd lock" which is basically
busy waiting. So it does not "block", it busy-waits.

For completeness of the explanation, other (non-running) threads acting
on the same mm will eventually issue the context synchronizing instruction
before returning to user-space whenever they are scheduled back.

Thanks,

Mathieu


> 
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers
cache coherency. We should perhaps explicitly point
it out though.

>>
>> So, either Andy has a misunderstanding, or the man page is wrong, or
>> my rudimentary understanding of what membarrier is supposed to be
>> doing is wrong...
> 
> Look at the latest man page:
> 
> https://man7.org/linux/man-pages/man2/membarrier.2.html
> 
> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be
> all that enlightening.

As described in the membarrier(2) man page and in 
include/uapi/linux/membarrier.h,
membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core 
serializing
instructions in addition to the memory ordering guaranteed by 
MEMBARRIER_CMD_PRIVATE_EXPEDITED.

It does not guarantee anything wrt i/d cache coherency. I initially considered
adding such guarantees when we introduced membarrier sync-core, but decided
against it because it would basically replicate what architectures already
expose to user-space, e.g. flush_icache_user_range on arm32.

So between code modification and allowing other threads to jump to that code,
it should be expected that architectures without coherent i/d cache will need
to flush caches to ensure coherency *and* to issue membarrier to make sure
core serializing instructions will be issued by every thread acting on the
same mm either immediately by means of the IPI, or before they return to
user-space if they do not happen to be currently running when the membarrier
system call is invoked.

Hoping this clarifies things. I suspect we will need to clarify documentation
about what membarrier *does not* guarantee, given that you mistakenly expected
membarrier to take care of cache flushing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-27 Thread Mathieu Desnoyers
- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote:

> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.  Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.
> 
> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own.

Removing the generic implementation is OK with me, as this will really require
architecture maintainers to think hard about it when porting this feature.

> This means x86, arm64, and powerpc for now.  Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

Work for me too.

> 
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested.  My apologies
> for not reviewing this more carefully in the first place.

Please refer to 
Documentation/features/sched/membarrier-sync-core/arch-support.txt

It clearly states that only arm, arm64, powerpc and x86 support the membarrier
sync core feature as of now:


# Architecture requirements
#
# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.
#
# * x86
#
# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
# instruction is core serializing, but not SYSEXIT.
#
# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
# However, it can return to user-space through either SYSRETL (compat code),
# SYSRETQ, or IRET.
#
# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
# sync_core_before_usermode() in that specific case.

This is based on direct feedback from the architecture maintainers.

You seem to have noticed odd cases on arm64 where this guarantee does not
match reality. Where exactly can we find this in the code, and which part
of the architecture manual can you point us to which supports your concern ?

Based on the notes I have, use of `eret` on aarch64 guarantees a context 
synchronizing
instruction when returning to user-space.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Mathieu Desnoyers 
> Cc: x...@kernel.org
> Cc: sta...@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command,
> *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski 
> ---
> 
> Hi arm64 and powerpc people-
> 
> This is part of a series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
> 
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right.  Once I get
> some feedback on this patch, I'll send out the whole pile.  And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
> 
> The x86 part is already fixed in Linus' tree.
> 
> Thanks,
> Andy
> 
> arch/arm64/include/asm/sync_core.h   | 21 +
> arch/powerpc/include/asm/sync_core.h | 20 
> arch/x86/Kconfig |  1 -
> arch/x86/include/asm/sync_core.h |  7 +++
> include/linux/sched/mm.h |  1 -
> include/linux/sync_core.h| 21 -
> init/Kconfig |  3 ---
> kernel/sched/membarrier.c| 15 +++
> 8 files changed, 55 insertions(+), 34 deletions(-)
> create mode 100644 arch/arm64/include/asm/sync_core.h
> create mode 100644 arch/powerpc/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/arch/arm64/include/asm/sync_core.h
> b/arch/arm64/include/asm/sync_core.h
> new file mode

Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers
- On Dec 4, 2020, at 3:17 AM, Nadav Amit nadav.a...@gmail.com wrote:

> I am not very familiar with membarrier, but here are my 2 cents while trying
> to answer your questions.
> 
>> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski  wrote:
>> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>   * from one thread in a process to another thread in the same
>>   * process. No TLB flush required.
>>   */
>> +
>> +// XXX: why is this okay wrt membarrier?
>>  if (!was_lazy)
>>  return;
> 
> I am confused.
> 
> On one hand, it seems that membarrier_private_expedited() would issue an IPI
> to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the
> same as the one that the membarrier applies to.

If the scheduler switches from one thread to another which both have the same 
mm,
it means cpu_rq(cpu)->curr->mm is invariant, even though ->curr changes. So 
there
is no need to issue a memory barrier or sync core for membarrier in this case,
because there is no way the IPI can be missed.

> But… (see below)
> 
> 
>> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>  smp_mb();
>>  next_tlb_gen = atomic64_read(>context.tlb_gen);
>>  if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
>> -next_tlb_gen)
>> +next_tlb_gen) {
>> +/*
>> + * We're reactivating an mm, and membarrier might
>> + * need to serialize.  Tell membarrier.
>> + */
>> +
>> +// XXX: I can't understand the logic in
>> +// membarrier_mm_sync_core_before_usermode().  What's
>> +// the mm check for?
>> +membarrier_mm_sync_core_before_usermode(next);
> 
> On the other hand the reason for this mm check that you mention contradicts
> my previous understanding as the git log says:
> 
> commit 2840cf02fae627860156737e83326df354ee4ec6
> Author: Mathieu Desnoyers 
> Date:   Thu Sep 19 13:37:01 2019 -0400
> 
>sched/membarrier: Call sync_core only before usermode for same mm
>
>When the prev and next task's mm change, switch_mm() provides the core
>serializing guarantees before returning to usermode. The only case
>where an explicit core serialization is needed is when the scheduler
>keeps the same mm for prev and next.

Hrm, so your point here is that if the scheduler keeps the same mm for
prev and next, it means membarrier will have observed the same rq->curr->mm,
and therefore the IPI won't be missed. I wonder if that
membarrier_mm_sync_core_before_usermode is needed at all then or if we
have just been too careful and did not consider that all the scenarios which
need to be core-sync'd are indeed taken care of ?

I see here that my prior commit message indeed discusses prev and next task's
mm, but in reality, we are comparing current->mm with rq->prev_mm. So from
a lazy TLB perspective, this probably matters, and we may still need a core sync
in some lazy TLB scenarios.

> 
>>  /*
>>   * When switching through a kernel thread, the loop in
>>   * membarrier_{private,global}_expedited() may have observed that
>>   * kernel thread and not issued an IPI. It is therefore possible to
>>   * schedule between user->kernel->user threads without passing though
>>   * switch_mm(). Membarrier requires a barrier after storing to
>> - * rq->curr, before returning to userspace, so provide them here:
>> + * rq->curr, before returning to userspace, and mmdrop() provides
>> + * this barrier.
>>   *
>> - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
>> - *   provided by mmdrop(),
>> - * - a sync_core for SYNC_CORE.
>> + * XXX: I don't think mmdrop() actually does this.  There's no
>> + * smp_mb__before/after_atomic() in there.
> 
> I presume that since x86 is the only one that needs
> membarrier_mm_sync_core_before_usermode(), nobody noticed the missing
> smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86,
> and such a barrier would take place before the return to userspace.

mmdrop already provides the memory barriers for membarrer, as I documented 
within the
function.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers
en user->kernel->user threads without passing though
>* switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> +  * rq->curr, before returning to userspace, and mmdrop() provides
> +  * this barrier.
>*
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * XXX: I don't think mmdrop() actually does this.  There's no
> +  * smp_mb__before/after_atomic() in there.

I recall mmdrop providing a memory barrier. It looks like I event went though 
the
trouble of documenting it myself. ;-)

static inline void mmdrop(struct mm_struct *mm)
{
/*
 * The implicit full barrier implied by atomic_dec_and_test() is
 * required by the membarrier system call before returning to
 * user-space, after storing to rq->curr.
 */
if (unlikely(atomic_dec_and_test(>mm_count)))
__mmdrop(mm);
}


>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);

OK so here is the meat. The current code is using the (possibly incomplete)
lazy TLB state known by the scheduler to sync core, and it appears it may be
a bit more heavy that what is strictly needed.

Your change instead rely on the internal knowledge of lazy TLB within x86
switch_mm_irqs_off to achieve this, which overall seems like an improvement.

I agree with Nick's comment that it should go on top of his exit_lazy_mm
patches.

Thanks,

Mathieu


> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-11-30 Thread Mathieu Desnoyers
- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:

> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches

This sentence is incomplete.

> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.

Ideally yes this complexity should sit within the x86 architecture code
if only that architecture requires it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers
- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org 
>> wrote:
>> 
>> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
>> > 
>> >> That being said, the x86 sync core gap that I imagined could be fixed
>> >> by changing to rq->curr == rq->idle test does not actually exist because
>> >> the global membarrier does not have a sync core option. So fixing the
>> >> exit_lazy_tlb points that this series does *should* fix that. So
>> >> PF_KTHREAD may be less problematic than I thought from implementation
>> >> point of view, only semantics.
>> > 
>> > So I've been trying to figure out where that PF_KTHREAD comes from,
>> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
>> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
>> > 
>> > So the first version:
>> > 
>> >  
>> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
>> > 
>> > appears to unconditionally send the IPI and checks p->mm in the IPI
>> > context, but then v2:
>> > 
>> >  
>> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
>> > 
>> > has the current code. But I've been unable to find the reason the
>> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
>> 
>> Looking back at my inbox, it seems like you are the one who proposed to
>> skip all kthreads:
>> 
>> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net
> 
> I had a feeling it might've been me ;-) I just couldn't find the email.
> 
>> > The comment doesn't really help either; sure we have the whole lazy mm
>> > thing, but that's ->active_mm, not ->mm.
>> > 
>> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
>> > make the remote p->mm test work? Or were we over-eager with the !p->mm
>> > doesn't imply kthread 'cleanups' at the time?
>> 
>> The nice thing about adding back kthreads to the threads considered for
>> membarrier
>> IPI is that it has no observable effect on the user-space ABI. No 
>> pre-existing
>> kthread
>> rely on this, and we just provide an additional guarantee for future kthread
>> implementations.
>> 
>> > Also, I just realized, I still have a fix for use_mm() now
>> > kthread_use_mm() that seems to have been lost.
>> 
>> I suspect we need to at least document the memory barriers in kthread_use_mm 
>> and
>> kthread_unuse_mm to state that they are required by membarrier if we want to
>> ipi kthreads as well.
> 
> Right, so going by that email you found it was mostly a case of being
> lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
> any other bits that might be needed, covering kthreads should be
> possible.
> 
> No objections from me for making it so.

I'm OK on making membarrier cover kthreads using mm as well, provided we
audit kthread_{,un}use_mm() to make sure the proper barriers are in place
after setting task->mm and before clearing it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers
- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> So I've been trying to figure out where that PF_KTHREAD comes from,
> commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> 
> So the first version:
> 
>  
> https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
> 
> appears to unconditionally send the IPI and checks p->mm in the IPI
> context, but then v2:
> 
>  
> https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
> 
> has the current code. But I've been unable to find the reason the
> 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

Looking back at my inbox, it seems like you are the one who proposed to
skip all kthreads: 

https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net

> 
> The comment doesn't really help either; sure we have the whole lazy mm
> thing, but that's ->active_mm, not ->mm.
> 
> Possibly it is because {,un}use_mm() do not have sufficient barriers to
> make the remote p->mm test work? Or were we over-eager with the !p->mm
> doesn't imply kthread 'cleanups' at the time?

The nice thing about adding back kthreads to the threads considered for 
membarrier
IPI is that it has no observable effect on the user-space ABI. No pre-existing 
kthread
rely on this, and we just provide an additional guarantee for future kthread
implementations.

> Also, I just realized, I still have a fix for use_mm() now
> kthread_use_mm() that seems to have been lost.

I suspect we need to at least document the memory barriers in kthread_use_mm and
kthread_unuse_mm to state that they are required by membarrier if we want to
ipi kthreads as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers
- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
[...]
> 
> Yeah you're probably right in this case I think. Quite likely most kernel
> tasks that asynchronously write to user memory would at least have some
> kind of producer-consumer barriers.
> 
> But is that restriction of all async modifications documented and enforced
> anywhere?
> 
>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>> still ordering requirement there for membarrier,
>> 
>> Please provide an example case with memory accesses via kthread_use_mm where
>> ordering matters to support your concern.
> 
> I think the concern Andy raised with io_uring was less a specific
> problem he saw and more a general concern that we have these memory
> accesses which are not synchronized with membarrier.
> 
>>> so I really think
>>> it's a fragile interface with no real way for the user to know how
>>> kernel threads may use its mm for any particular reason, so membarrier
>>> should synchronize all possible kernel users as well.
>> 
>> I strongly doubt so, but perhaps something should be clarified in the
>> documentation
>> if you have that feeling.
> 
> I'd rather go the other way and say if you have reasoning or numbers for
> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
> then we could think about keeping this subtlety with appropriate
> documentation added, otherwise we can just kill it and remove all doubt.
> 
> That being said, the x86 sync core gap that I imagined could be fixed
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

Today, the membarrier global expedited command explicitly skips kernel threads,
but it happens that membarrier private expedited considers those with the
same mm as target for the IPI.

So we already implement a semantic which differs between private and global
expedited membarriers. This can be explained in part by the fact that
kthread_use_mm was introduced after 4.16, where the most recent membarrier
commands where introduced. It seems that the effect on membarrier was not
considered when kthread_use_mm was introduced.

Looking at membarrier(2) documentation, it states that IPIs are only sent to
threads belonging to the same process as the calling thread. If my understanding
of the notion of process is correct, this should rule out sending the IPI to
kernel threads, given they are not "part" of the same process, only borrowing
the mm. But I agree that the distinction is moot, and should be clarified.

Without a clear use-case to justify adding a constraint on membarrier, I am
tempted to simply clarify documentation of current membarrier commands,
stating clearly that they are not guaranteed to affect kernel threads. Then,
if we have a compelling use-case to implement a different behavior which covers
kthreads, this could be added consistently across membarrier commands with a
flag (or by adding new commands).

Does this approach make sense ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-20 Thread Mathieu Desnoyers
- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
>> - On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
>> [...]
>>> 
>>> membarrier does replace barrier instructions on remote CPUs, which do
>>> order accesses performed by the kernel on the user address space. So
>>> membarrier should too I guess.
>>> 
>>> Normal process context accesses like read(2) will do so because they
>>> don't get filtered out from IPIs, but kernel threads using the mm may
>>> not.
>> 
>> But it should not be an issue, because membarrier's ordering is only with
>> respect
>> to submit and completion of io_uring requests, which are performed through
>> system calls from the context of user-space threads, which are called from 
>> the
>> right mm.
> 
> Is that true? Can io completions be written into an address space via a
> kernel thread? I don't know the io_uring code well but it looks like
> that's asynchonously using the user mm context.

Indeed, the io completion appears to be signaled asynchronously between kernel
and user-space. Therefore, both kernel and userspace code need to have proper
memory barriers in place to signal completion, otherwise user-space could read
garbage after it notices completion of a read.

I did not review the entire io_uring implementation, but the publish side
for completion appears to be:

static void __io_commit_cqring(struct io_ring_ctx *ctx)
{
struct io_rings *rings = ctx->rings;

/* order cqe stores with ring update */
smp_store_release(>cq.tail, ctx->cached_cq_tail);

if (wq_has_sleeper(>cq_wait)) {
wake_up_interruptible(>cq_wait);
kill_fasync(>cq_fasync, SIGIO, POLL_IN);
}
}

The store-release on tail should be paired with a load_acquire on the
reader-side (it's called "read_barrier()" in the code):

tools/io_uring/queue.c:

static int __io_uring_get_cqe(struct io_uring *ring,
  struct io_uring_cqe **cqe_ptr, int wait)
{
struct io_uring_cq *cq = >cq;
const unsigned mask = *cq->kring_mask;
unsigned head;
int ret;

*cqe_ptr = NULL;
head = *cq->khead;
do {
/*
 * It's necessary to use a read_barrier() before reading
 * the CQ tail, since the kernel updates it locklessly. The
 * kernel has the matching store barrier for the update. The
 * kernel also ensures that previous stores to CQEs are ordered
 * with the tail update.
 */
read_barrier();
if (head != *cq->ktail) {
*cqe_ptr = >cqes[head & mask];
break;
}
if (!wait)
break;
ret = io_uring_enter(ring->ring_fd, 0, 1,
IORING_ENTER_GETEVENTS, NULL);
if (ret < 0)
return -errno;
} while (1);

return 0;
}

So as far as membarrier memory ordering dependencies are concerned, it relies
on the store-release/load-acquire dependency chain in the completion queue to
order against anything that was done prior to the completed requests.

What is in-flight while the requests are being serviced provides no memory
ordering guarantee whatsoever.

> How about other memory accesses via kthread_use_mm? Presumably there is
> still ordering requirement there for membarrier,

Please provide an example case with memory accesses via kthread_use_mm where
ordering matters to support your concern.

> so I really think
> it's a fragile interface with no real way for the user to know how
> kernel threads may use its mm for any particular reason, so membarrier
> should synchronize all possible kernel users as well.

I strongly doubt so, but perhaps something should be clarified in the 
documentation
if you have that feeling.

Thanks,

Mathieu

> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 1:44 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> >> > I agree with Nick: A memory barrier is needed somewhere between the
>> >> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> >> > with the Store Buffer pattern having a memory barrier on only one side,
>> >> > and it is well known that this arrangement does not guarantee any
>> >> > ordering.
>> >> 
>> >> Yes, I see this now. I'm still trying to wrap my head around why the 
>> >> memory
>> >> barrier at the end of membarrier() needs to be paired with a scheduler
>> >> barrier though.
>> > 
>> > The memory barrier at the end of membarrier() on CPU0 is necessary in
>> > order to enforce the guarantee that any writes occurring on CPU1 before
>> > the membarrier() is executed will be visible to any code executing on
>> > CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
>> > 
>> >CPU0CPU1
>> >x = 1
>> >barrier()
>> >y = 1
>> >r2 = y
>> >membarrier():
>> >  a: smp_mb()
>> >  b: send IPI   IPI-induced mb
>> >  c: smp_mb()
>> >r1 = x
>> > 
>> > The writes to x and y are unordered by the hardware, so it's possible to
>> > have r2 = 1 even though the write to x doesn't execute until b.  If the
>> > memory barrier at c is omitted then "r1 = x" can be reordered before b
>> > (although not before a), so we get r1 = 0.  This violates the guarantee
>> > that membarrier() is supposed to provide.
>> > 
>> > The timing of the memory barrier at c has to ensure that it executes
>> > after the IPI-induced memory barrier on CPU1.  If it happened before
>> > then we could still end up with r1 = 0.  That's why the pairing matters.
>> > 
>> > I hope this helps your head get properly wrapped.  :-)
>> 
>> It does help a bit! ;-)
>> 
>> This explains this part of the comment near the smp_mb at the end of 
>> membarrier:
>> 
>>  * Memory barrier on the caller thread _after_ we finished
>>  * waiting for the last IPI. [...]
>> 
>> However, it does not explain why it needs to be paired with a barrier in the
>> scheduler, clearly for the case where the IPI is skipped. I wonder whether 
>> this
>> part
>> of the comment is factually correct:
>> 
>>  * [...] Matches memory barriers around rq->curr modification in 
>> scheduler.
> 
> The reasoning is pretty much the same as above:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
>   switch to kthread (includes mb)
> b: read rq->curr == kthread
>   switch to user (includes mb)
> c: smp_mb()
>   r1 = x
> 
> Once again, it is possible that x = 1 doesn't become visible to CPU0
> until shortly before b.  But if c is omitted then "r1 = x" can be
> reordered before b (to any time after a), so we can have r1 = 0.
> 
> Here the timing requirement is that c executes after the first memory
> barrier on CPU1 -- which is one of the ones around the rq->curr
> modification.  (In fact, in this scenario CPU1's switch back to the user
> process is irrelevant.)

That indeed covers the last scenario I was wondering about. Thanks Alan!

Mathieu

> 
> Alan Stern

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu wrote:

>> > I agree with Nick: A memory barrier is needed somewhere between the
>> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> > with the Store Buffer pattern having a memory barrier on only one side,
>> > and it is well known that this arrangement does not guarantee any
>> > ordering.
>> 
>> Yes, I see this now. I'm still trying to wrap my head around why the memory
>> barrier at the end of membarrier() needs to be paired with a scheduler
>> barrier though.
> 
> The memory barrier at the end of membarrier() on CPU0 is necessary in
> order to enforce the guarantee that any writes occurring on CPU1 before
> the membarrier() is executed will be visible to any code executing on
> CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
> b: send IPI   IPI-induced mb
> c: smp_mb()
>   r1 = x
> 
> The writes to x and y are unordered by the hardware, so it's possible to
> have r2 = 1 even though the write to x doesn't execute until b.  If the
> memory barrier at c is omitted then "r1 = x" can be reordered before b
> (although not before a), so we get r1 = 0.  This violates the guarantee
> that membarrier() is supposed to provide.
> 
> The timing of the memory barrier at c has to ensure that it executes
> after the IPI-induced memory barrier on CPU1.  If it happened before
> then we could still end up with r1 = 0.  That's why the pairing matters.
> 
> I hope this helps your head get properly wrapped.  :-)

It does help a bit! ;-)

This explains this part of the comment near the smp_mb at the end of membarrier:

 * Memory barrier on the caller thread _after_ we finished
 * waiting for the last IPI. [...]

However, it does not explain why it needs to be paired with a barrier in the
scheduler, clearly for the case where the IPI is skipped. I wonder whether this 
part
of the comment is factually correct:

 * [...] Matches memory barriers around rq->curr modification in 
scheduler.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 17, 2020, at 10:51 AM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> >> mathieu.desnoy...@efficios.com wrote:
>> >> 
>> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> >> > mathieu.desnoy...@efficios.com wrote:
>> >> > 
>> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> >> wrote:
>> >> >>> I should be more complete here, especially since I was complaining
>> >> >>> about unclear barrier comment :)
>> >> >>> 
>> >> >>> 
>> >> >>> CPU0 CPU1
>> >> >>> a. user stuff1. user stuff
>> >> >>> b. membarrier()  2. enter kernel
>> >> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >> >>> f. return to user6. rq->curr switched to user thread
>> >> >>> g. user stuff7. switch_to user thread
>> >> >>> 8. exit kernel
>> >> >>> 9. more user stuff
> 
> ...
> 
>> >> Requiring a memory barrier between update of rq->curr (back to current 
>> >> process's
>> >> thread) and following user-space memory accesses does not seem to 
>> >> guarantee
>> >> anything more than what the initial barrier at the beginning of __schedule
>> >> already
>> >> provides, because the guarantees are only about accesses to user-space 
>> >> memory.
> 
> ...
> 
>> > Is it correct to say that the switch_to operations in 5 and 7 include
>> > memory barriers?  If they do, then skipping the IPI should be okay.
>> > 
>> > The reason is as follows: The guarantee you need to enforce is that
>> > anything written by CPU0 before the membarrier() will be visible to CPU1
>> > after it returns to user mode.  Let's say that a writes to X and 9
>> > reads from X.
>> > 
>> > Then we have an instance of the Store Buffer pattern:
>> > 
>> >CPU0CPU1
>> >a. Write X  6. Write rq->curr for user thread
>> >c. smp_mb() 7. switch_to memory barrier
>> >d. Read rq->curr9. Read X
>> > 
>> > In this pattern, the memory barriers make it impossible for both reads
>> > to miss their corresponding writes.  Since d does fail to read 6 (it
>> > sees the earlier value stored by 4), 9 must read a.
>> > 
>> > The other guarantee you need is that g on CPU0 will observe anything
>> > written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
>> > memory barrier and d reads from 4.
>> 
>> Right, and Nick's reply involving pairs of loads/stores on each side
>> clarifies the situation even further.
> 
> The key part of my reply was the question: "Is it correct to say that
> the switch_to operations in 5 and 7 include memory barriers?"  From the
> text quoted above and from Nick's reply, it seems clear that they do
> not.

I remember that switch_mm implies it, but not switch_to.

The scenario that triggered this discussion is when the scheduler does a
lazy tlb entry/exit, which is basically switch from a user task to
a kernel thread without changing the mm, and usually switching back afterwards.
This optimization means the rq->curr mm temporarily differs, which prevent
IPIs from being sent by membarrier, but without involving a switch_mm.
This requires explicit memory barriers either on entry/exit of lazy tlb
mode, or explicit barriers in the scheduler for those special-cases.

> I agree with Nick: A memory barrier is needed somewhere between the
> assignment at 6 and the return to user mode at 8.  Otherwise you end up
> with the Store Buffer pattern having a memory barrier on only one side,
> and it is well known that this arrangement does not guarantee any
> ordering.

Yes, I see this now. I'm still trying to wrap my head around why the memory
barrier at the end of membarrier() needs to be paired with a scheduler
barrier though.

> One thing I don't understand about all this: Any context switch has to
> include a memory barrier somewhere, but both you and Nick seem to be
> saying that steps 6 and 7 don't include (or don't need) any memory
> barriers.  What am I missing?

All context switch have the smp_mb__before_spinlock at the beginning of
__schedule(), which I suspect is what you refer to. However this barrier
is before the store to rq->curr, not after.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
[...]
> 
> membarrier does replace barrier instructions on remote CPUs, which do
> order accesses performed by the kernel on the user address space. So
> membarrier should too I guess.
> 
> Normal process context accesses like read(2) will do so because they
> don't get filtered out from IPIs, but kernel threads using the mm may
> not.

But it should not be an issue, because membarrier's ordering is only with 
respect
to submit and completion of io_uring requests, which are performed through
system calls from the context of user-space threads, which are called from the
right mm.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> > mathieu.desnoy...@efficios.com wrote:
>> > 
>> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> wrote:
>> >>> I should be more complete here, especially since I was complaining
>> >>> about unclear barrier comment :)
>> >>> 
>> >>> 
>> >>> CPU0 CPU1
>> >>> a. user stuff1. user stuff
>> >>> b. membarrier()  2. enter kernel
>> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >>> f. return to user6. rq->curr switched to user thread
>> >>> g. user stuff7. switch_to user thread
>> >>> 8. exit kernel
>> >>> 9. more user stuff
>> >>> 
>> >>> What you're really ordering is a, g vs 1, 9 right?
>> >>> 
>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> >>> etc.
>> >>> 
>> >>> Userspace does not care where the barriers are exactly or what kernel
>> >>> memory accesses might be being ordered by them, so long as there is a
>> >>> mb somewhere between a and g, and 1 and 9. Right?
>> >> 
>> >> This is correct.
>> > 
>> > Actually, sorry, the above is not quite right. It's been a while
>> > since I looked into the details of membarrier.
>> > 
>> > The smp_mb() at the beginning of membarrier() needs to be paired with a
>> > smp_mb() _after_ rq->curr is switched back to the user thread, so the
>> > memory barrier is between store to rq->curr and following user-space
>> > accesses.
>> > 
>> > The smp_mb() at the end of membarrier() needs to be paired with the
>> > smp_mb__after_spinlock() at the beginning of schedule, which is
>> > between accesses to userspace memory and switching rq->curr to kthread.
>> > 
>> > As to *why* this ordering is needed, I'd have to dig through additional
>> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
>> 
>> Thinking further about this, I'm beginning to consider that maybe we have 
>> been
>> overly cautious by requiring memory barriers before and after store to 
>> rq->curr.
>> 
>> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process
>> (current)
>> while running the membarrier system call, it necessarily means that CPU1 had
>> to issue smp_mb__after_spinlock when entering the scheduler, between any
>> user-space
>> loads/stores and update of rq->curr.
>> 
>> Requiring a memory barrier between update of rq->curr (back to current 
>> process's
>> thread) and following user-space memory accesses does not seem to guarantee
>> anything more than what the initial barrier at the beginning of __schedule
>> already
>> provides, because the guarantees are only about accesses to user-space 
>> memory.
>> 
>> Therefore, with the memory barrier at the beginning of __schedule, just
>> observing that
>> CPU1's rq->curr differs from current should guarantee that a memory barrier 
>> was
>> issued
>> between any sequentially consistent instructions belonging to the current
>> process on
>> CPU1.
>> 
>> Or am I missing/misremembering an important point here ?
> 
> Is it correct to say that the switch_to operations in 5 and 7 include
> memory barriers?  If they do, then skipping the IPI should be okay.
> 
> The reason is as follows: The guarantee you need to enforce is that
> anything written by CPU0 before the membarrier() will be visible to CPU1
> after it returns to user mode.  Let's say that a writes to X and 9
> reads from X.
> 
> Then we have an instance of the Store Buffer pattern:
> 
>   CPU0CPU1
>   a. Write X  6. Write rq->curr for user thread
>   c. smp_mb() 7. switch_to memory barrier
>   d. Read rq->curr9. Read X
> 
> In this pattern, the memory barriers make it impossible for both reads
> to miss their corresponding writes.  Since d does fail to read 6 (it
> sees the earlier value stored by 4), 9 must read a.
> 
> The other guarantee you need is that g on CPU0 will observe anything
> written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
> memory barrier and d reads from 4.

Right, and Nick's reply involving pairs of loads/stores on each side
clarifies the situation even further.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> mathieu.desnoy...@efficios.com wrote:
> 
>> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
>>> I should be more complete here, especially since I was complaining
>>> about unclear barrier comment :)
>>> 
>>> 
>>> CPU0 CPU1
>>> a. user stuff1. user stuff
>>> b. membarrier()  2. enter kernel
>>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>>> d. read rq->curr 4. rq->curr switched to kthread
>>> e. is kthread, skip IPI  5. switch_to kthread
>>> f. return to user6. rq->curr switched to user thread
>>> g. user stuff7. switch_to user thread
>>> 8. exit kernel
>>> 9. more user stuff
>>> 
>>> What you're really ordering is a, g vs 1, 9 right?
>>> 
>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>>> etc.
>>> 
>>> Userspace does not care where the barriers are exactly or what kernel
>>> memory accesses might be being ordered by them, so long as there is a
>>> mb somewhere between a and g, and 1 and 9. Right?
>> 
>> This is correct.
> 
> Actually, sorry, the above is not quite right. It's been a while
> since I looked into the details of membarrier.
> 
> The smp_mb() at the beginning of membarrier() needs to be paired with a
> smp_mb() _after_ rq->curr is switched back to the user thread, so the
> memory barrier is between store to rq->curr and following user-space
> accesses.
> 
> The smp_mb() at the end of membarrier() needs to be paired with the
> smp_mb__after_spinlock() at the beginning of schedule, which is
> between accesses to userspace memory and switching rq->curr to kthread.
> 
> As to *why* this ordering is needed, I'd have to dig through additional
> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thinking further about this, I'm beginning to consider that maybe we have been
overly cautious by requiring memory barriers before and after store to rq->curr.

If CPU0 observes a CPU1's rq->curr->mm which differs from its own process 
(current)
while running the membarrier system call, it necessarily means that CPU1 had
to issue smp_mb__after_spinlock when entering the scheduler, between any 
user-space
loads/stores and update of rq->curr.

Requiring a memory barrier between update of rq->curr (back to current process's
thread) and following user-space memory accesses does not seem to guarantee
anything more than what the initial barrier at the beginning of __schedule 
already
provides, because the guarantees are only about accesses to user-space memory.

Therefore, with the memory barrier at the beginning of __schedule, just 
observing that
CPU1's rq->curr differs from current should guarantee that a memory barrier was 
issued
between any sequentially consistent instructions belonging to the current 
process on
CPU1.

Or am I missing/misremembering an important point here ?

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
>> Note that the accesses to user-space memory can be
>> done either by user-space code or kernel code, it doesn't matter.
>> However, in order to be considered as happening before/after
>> either membarrier or the matching compiler barrier, kernel code
>> needs to have causality relationship with user-space execution,
>> e.g. user-space does a system call, or returns from a system call.
>> 
>> In the case of io_uring, submitting a request or returning from waiting
>> on request completion appear to provide this causality relationship.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
>> I should be more complete here, especially since I was complaining
>> about unclear barrier comment :)
>> 
>> 
>> CPU0 CPU1
>> a. user stuff1. user stuff
>> b. membarrier()  2. enter kernel
>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> d. read rq->curr 4. rq->curr switched to kthread
>> e. is kthread, skip IPI  5. switch_to kthread
>> f. return to user6. rq->curr switched to user thread
>> g. user stuff7. switch_to user thread
>> 8. exit kernel
>> 9. more user stuff
>> 
>> What you're really ordering is a, g vs 1, 9 right?
>> 
>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> etc.
>> 
>> Userspace does not care where the barriers are exactly or what kernel
>> memory accesses might be being ordered by them, so long as there is a
>> mb somewhere between a and g, and 1 and 9. Right?
> 
> This is correct.

Actually, sorry, the above is not quite right. It's been a while
since I looked into the details of membarrier.

The smp_mb() at the beginning of membarrier() needs to be paired with a
smp_mb() _after_ rq->curr is switched back to the user thread, so the
memory barrier is between store to rq->curr and following user-space
accesses.

The smp_mb() at the end of membarrier() needs to be paired with the
smp_mb__after_spinlock() at the beginning of schedule, which is
between accesses to userspace memory and switching rq->curr to kthread.

As to *why* this ordering is needed, I'd have to dig through additional
scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thanks,

Mathieu


> Note that the accesses to user-space memory can be
> done either by user-space code or kernel code, it doesn't matter.
> However, in order to be considered as happening before/after
> either membarrier or the matching compiler barrier, kernel code
> needs to have causality relationship with user-space execution,
> e.g. user-space does a system call, or returns from a system call.
> 
> In the case of io_uring, submitting a request or returning from waiting
> on request completion appear to provide this causality relationship.
> 
> Thanks,
> 
> Mathieu
> 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers



- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
> I should be more complete here, especially since I was complaining
> about unclear barrier comment :)
> 
> 
> CPU0 CPU1
> a. user stuff1. user stuff
> b. membarrier()  2. enter kernel
> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
> d. read rq->curr 4. rq->curr switched to kthread
> e. is kthread, skip IPI  5. switch_to kthread
> f. return to user6. rq->curr switched to user thread
> g. user stuff7. switch_to user thread
> 8. exit kernel
> 9. more user stuff
> 
> What you're really ordering is a, g vs 1, 9 right?
> 
> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
> etc.
> 
> Userspace does not care where the barriers are exactly or what kernel
> memory accesses might be being ordered by them, so long as there is a
> mb somewhere between a and g, and 1 and 9. Right?

This is correct. Note that the accesses to user-space memory can be
done either by user-space code or kernel code, it doesn't matter.
However, in order to be considered as happening before/after
either membarrier or the matching compiler barrier, kernel code
needs to have causality relationship with user-space execution,
e.g. user-space does a system call, or returns from a system call.

In the case of io_uring, submitting a request or returning from waiting
on request completion appear to provide this causality relationship.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers
- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
>> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
>> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
>> >> But I’m wondering if all this deferred sync stuff is wrong. In the
>> >> brave new world of io_uring and such, perhaps kernel access matter
>> >> too.  Heck, even:
>> > 
>> > IIRC the membarrier SYNC_CORE use-case is about user-space
>> > self-modifying code.
>> > 
>> > Userspace re-uses a text address and needs to SYNC_CORE before it can be
>> > sure the old text is forgotten. Nothing the kernel does matters there.
>> > 
>> > I suppose the manpage could be more clear there.
>> 
>> True, but memory ordering of kernel stores from kernel threads for
>> regular mem barrier is the concern here.
>> 
>> Does io_uring update completion queue from kernel thread or interrupt,
>> for example? If it does, then membarrier will not order such stores
>> with user memory accesses.
> 
> So we're talking about regular membarrier() then? Not the SYNC_CORE
> variant per-se.
> 
> Even there, I'll argue we don't care, but perhaps Mathieu has a
> different opinion.

I agree with Peter that we don't care about accesses to user-space
memory performed concurrently with membarrier.

What we'd care about in terms of accesses to user-space memory from the
kernel is something that would be clearly ordered as happening before
or after the membarrier call, for instance a read(2) followed by
membarrier(2) after the read returns, or a read(2) issued after return
from membarrier(2). The other scenario we'd care about is with the compiler
barrier paired with membarrier: e.g. read(2) returns, compiler barrier,
followed by a store. Or load, compiler barrier, followed by write(2).

All those scenarios imply before/after ordering wrt either membarrier or
the compiler barrier. I notice that io_uring has a "completion" queue.
Let's try to come up with realistic usage scenarios.

So the dependency chain would be provided by e.g.:

* Infrequent read / Frequent write, communicating read completion through 
variable X

wait for io_uring read request completion -> membarrier -> store X=1

with matching

load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit 
io_uring write request

or this other scenario:

* Frequent read / Infrequent write, communicating read completion through 
variable X

load from X (waiting for X==1) -> membarrier -> submit io_uring write request

with matching

wait for io_uring read request completion -> asm volatile (::: "memory") -> 
store X=1

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH v2] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-15 Thread Mathieu Desnoyers
- On Jul 15, 2020, at 5:48 AM, Nicholas Piggin npig...@gmail.com wrote:
[...]
> index 47bd4ea0837d..a4704f405e8d 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -68,6 +68,13 @@
>  *
>  * The nop instructions allow us to insert one or more instructions to flush 
> the
>  * L1-D cache when returning to userspace or a guest.
> + *
> + * powerpc relies on return from interrupt/syscall being context 
> synchronising
> + * (which hrfid, rfid, and rfscv are) to support 
> ARCH_HAS_MEMBARRIER_SYNC_CORE
> + * without additional additional synchronisation instructions. soft-masked
> + * interrupt replay does not include a context-synchronising rfid, but those
> + * always return to kernel, the context sync is only required for IPIs which
> + * return to user.
>  */
> #define RFI_FLUSH_SLOT
> \
>   RFI_FLUSH_FIXUP_SECTION;\

I suspect the statement "the context sync is only required for IPIs which 
return to
user." is misleading.

As I recall that we need more than just context sync after IPI. We need context 
sync
in return path of any trap/interrupt/system call which returns to user-space, 
else
we'd need to add the proper core serializing barriers in the scheduler, as we 
had
to do for lazy tlb on x86.

Or am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Mathieu Desnoyers
- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
>>> Also, as it stands, I can easily see in_irq() ceasing to promise to
>>> serialize.  There are older kernels for which it does not promise to
>>> serialize.  And I have plans to make it stop serializing in the
>>> nearish future.
>> 
>> You mean x86's return from interrupt? Sounds fun... you'll konw where to
>> update the membarrier sync code, at least :)
> 
> Oh, I should actually say Mathieu recently clarified a return from
> interrupt doesn't fundamentally need to serialize in order to support
> membarrier sync core.

Clarification to your statement:

Return from interrupt to kernel code does not need to be context serializing
as long as kernel serializes before returning to user-space.

However, return from interrupt to user-space needs to be context serializing.

Thanks,

Mathieu

> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html
> 
> So you may not need to do anything more if you relaxed it.
> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Mathieu Desnoyers
mbarrier_{private,global}_expedited() may have observed that
>* kernel thread and not issued an IPI. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
> -  * switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> -  *
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * switch_mm(). Membarrier requires a full barrier after storing to
> +  * rq->curr, before returning to userspace, for
> +  * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop().
>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);
> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> @@ -6292,6 +6289,7 @@ void idle_task_exit(void)
>   BUG_ON(current != this_rq()->idle);
> 
>   if (mm != _mm) {
> + /* enter_lazy_tlb is not done because we're about to go down */
>   switch_mm(mm, _mm, current);
>   finish_arch_post_lock_switch();
>   }
> --
> 2.23.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 4:46 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Thu, Jul 09, 2020 at 01:56:19PM -0400, Mathieu Desnoyers wrote:
>> > Just to make sure I understand your recommendation. So rather than
>> > hard coding r17 as the temporary registers, we could explicitly
>> > declare the temporary register as a C variable, pass it as an
>> > input operand to the inline asm, and then refer to it by operand
>> > name in the macros using it. This way the compiler would be free
>> > to perform its own register allocation.
>> > 
>> > If that is what you have in mind, then yes, I think it makes a
>> > lot of sense.
>> 
>> Except that asm goto have this limitation with gcc: those cannot
>> have any output operand, only inputs, clobbers and target labels.
>> We cannot modify a temporary register received as input operand. So I don't
>> see how to get a temporary register allocated by the compiler considering
>> this limitation.
> 
> Heh, yet another reason not to obfuscate your inline asm: it didn't
> register this is asm goto.
> 
> A clobber is one way, yes (those *are* allowed in asm goto).  Another
> way is to not actually change that register: move the original value
> back into there at the end of the asm!  (That isn't always easy to do,
> it depends on your code).  So something like
> 
>   long start = ...;
>   long tmp = start;
>   asm("stuff that modifies %0; ...; mr %0,%1" : : "r"(tmp), "r"(start));
> 
> is just fine: %0 isn't actually modified at all, as far as GCC is
> concerned, and this isn't lying to it!

It appears to be at the cost of adding one extra instruction on the fast-path
to restore the register to its original value. I'll leave Boqun whom authored
the original rseq-ppc code to figure out what works best performance-wise
(when he finds time).

Thanks for the pointers!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 1:42 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 9, 2020, at 1:37 PM, Segher Boessenkool 
> seg...@kernel.crashing.org
> wrote:
> 
>> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
>>> > What protects r17 *after* this asm statement?
>>> 
>>> As discussed in the other leg of the thread (with the code example),
>>> r17 is in the clobber list of all asm statements using this macro, and
>>> is used as a temporary register within each inline asm.
>> 
>> That works fine then, for a testcase.  Using r17 is not a great idea for
>> performance (it increases the active register footprint, and causes more
>> registers to be saved in the prologue of the functions, esp. on older
>> compilers), and it is easier to just let the compiler choose a good
>> register to use.  But maybe you want to see r17 in the generated
>> testcases, as eyecatcher or something, dunno :-)
> 
> Just to make sure I understand your recommendation. So rather than
> hard coding r17 as the temporary registers, we could explicitly
> declare the temporary register as a C variable, pass it as an
> input operand to the inline asm, and then refer to it by operand
> name in the macros using it. This way the compiler would be free
> to perform its own register allocation.
> 
> If that is what you have in mind, then yes, I think it makes a
> lot of sense.

Except that asm goto have this limitation with gcc: those cannot
have any output operand, only inputs, clobbers and target labels.
We cannot modify a temporary register received as input operand. So I don't
see how to get a temporary register allocated by the compiler considering
this limitation.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 1:37 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
>> > What protects r17 *after* this asm statement?
>> 
>> As discussed in the other leg of the thread (with the code example),
>> r17 is in the clobber list of all asm statements using this macro, and
>> is used as a temporary register within each inline asm.
> 
> That works fine then, for a testcase.  Using r17 is not a great idea for
> performance (it increases the active register footprint, and causes more
> registers to be saved in the prologue of the functions, esp. on older
> compilers), and it is easier to just let the compiler choose a good
> register to use.  But maybe you want to see r17 in the generated
> testcases, as eyecatcher or something, dunno :-)

Just to make sure I understand your recommendation. So rather than
hard coding r17 as the temporary registers, we could explicitly
declare the temporary register as a C variable, pass it as an
input operand to the inline asm, and then refer to it by operand
name in the macros using it. This way the compiler would be free
to perform its own register allocation.

If that is what you have in mind, then yes, I think it makes a
lot of sense.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 8:18 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Wed, Jul 08, 2020 at 08:01:23PM -0400, Mathieu Desnoyers wrote:
>> > > #define RSEQ_ASM_OP_CMPEQ(var, expect, label)
>> > > \
>> > > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
>> > >\
>> > 
>> > The way this hardcodes r17 *will* break, btw.  The compiler will not
>> > likely want to use r17 as long as your code (after inlining etc.!) stays
>> > small, but there is Murphy's law.
>> 
>> r17 is in the clobber list, so it should be ok.
> 
> What protects r17 *after* this asm statement?

As discussed in the other leg of the thread (with the code example),
r17 is in the clobber list of all asm statements using this macro, and
is used as a temporary register within each inline asm.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 8:10 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> Hi!
> 
> On Wed, Jul 08, 2020 at 10:00:01AM -0400, Mathieu Desnoyers wrote:
[...]
> 
>> -#define STORE_WORD "std "
>> -#define LOAD_WORD  "ld "
>> -#define LOADX_WORD "ldx "
>> +#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" 
>> __rseq_str(arg)
>> "] "/* To memory ("m" constraint) */
>> +#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
>> /* From memory ("m" constraint) */
> 
> That cannot work (you typoed "ld" here).

Indeed, I noticed it before pushing to master (lwd -> ld).

> 
> Some more advice about this code, pretty generic stuff:

Let's take an example to support the discussion here. I'm taking it from
master branch (after a cleanup changing e.g. LOAD_WORD into RSEQ_LOAD_LONG).
So for powerpc32 we have (code edited to remove testing instrumentation):

#define __rseq_str_1(x) #x
#define __rseq_str(x)   __rseq_str_1(x)

#define RSEQ_STORE_LONG(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
#define RSEQ_STORE_INT(arg) RSEQ_STORE_LONG(arg)
/* To memory ("m" constraint) */
#define RSEQ_LOAD_LONG(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* From memory ("m" constraint) */
#define RSEQ_LOAD_INT(arg)  RSEQ_LOAD_LONG(arg) 
/* From memory ("m" constraint) */
#define RSEQ_LOADX_LONG "lwzx " 
/* From base register ("b" constraint) */
#define RSEQ_CMP_LONG   "cmpw "

#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags,  
\
start_ip, post_commit_offset, abort_ip) 
\
".pushsection __rseq_cs, \"aw\"\n\t"
\
".balign 32\n\t"
\
__rseq_str(label) ":\n\t"   
\
".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t"  
\
/* 32-bit only supported on BE */   
\
".long 0x0, " __rseq_str(start_ip) ", 0x0, " 
__rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \
".popsection\n\t"   \
".pushsection __rseq_cs_ptr_array, \"aw\"\n\t"  \
".long 0x0, " __rseq_str(label) "b\n\t" \
".popsection\n\t"

/*
 * Exit points of a rseq critical section consist of all instructions outside
 * of the critical section where a critical section can either branch to or
 * reach through the normal course of its execution. The abort IP and the
 * post-commit IP are already part of the __rseq_cs section and should not be
 * explicitly defined as additional exit points. Knowing all exit points is
 * useful to assist debuggers stepping over the critical section.
 */
#define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip)   
\
".pushsection __rseq_exit_point_array, \"aw\"\n\t"  
\
/* 32-bit only supported on BE */   
\
".long 0x0, " __rseq_str(start_ip) ", 0x0, " 
__rseq_str(exit_ip) "\n\t" \
".popsection\n\t"

#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)
\
"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t"  
\
"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"   
\
RSEQ_STORE_INT(rseq_cs) "%%r17, %[" __rseq_str(rseq_cs) "]\n\t" 
\
__rseq_str(label) ":\n\t"

#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip)
\
__RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip,  
\
(post_commit_ip - start_ip), abort_ip)

#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)  
\
RSEQ_LOAD_INT(current_cpu_id) "%%r17, %[" 
__rseq_str(current_cpu_id) "]\n\t" \
"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t"

Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers


- Segher Boessenkool  wrote:
> Hi!
> 
> On Wed, Jul 08, 2020 at 10:27:27PM +1000, Michael Ellerman wrote:
> > Segher Boessenkool  writes:
> > > You'll have to show the actual failing machine code, and with enough
> > > context that we can relate this to the source code.
> > >
> > > -save-temps helps, or use -S instead of -c, etc.
> > 
> > Attached below.
> 
> Thanks!
> 
> > I think that's from:
> > 
> > #define LOAD_WORD   "ld "
> > 
> > #define RSEQ_ASM_OP_CMPEQ(var, expect, label)   
> > \
> > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"   
> > \
> 
> The way this hardcodes r17 *will* break, btw.  The compiler will not
> likely want to use r17 as long as your code (after inlining etc.!) stays
> small, but there is Murphy's law.

r17 is in the clobber list, so it should be ok.

> 
> Anyway...  something in rseq_str is wrong, missing %X.  This may
> have to do with the abuse of inline asm here, making a fix harder :-(

I just committed a fix which enhances the macros.

Thanks for your help!

Mathieu

> 
> 
> Segher

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH 1/1] powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at

2020-07-08 Thread Mathieu Desnoyers
The placeholder for instruction selection should use the second
argument's operand, which is %1, not %0. This could generate incorrect
assembly code if the instruction selection for argument %0 ever differs
from argument %1.

Fixes: 9bf2b5cdc5fe ("powerpc: Fixes for CONFIG_PTE_64BIT for SMP support")
Signed-off-by: Mathieu Desnoyers 
Cc: Christophe Leroy 
Cc: Kumar Gala 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc:  # v2.6.28+
---
 arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +-
 arch/powerpc/include/asm/nohash/pgtable.h| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 224912432821..f1467b3c417a 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -529,7 +529,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
-   stw%U0%X0 %L2,%1"
+   stw%U1%X1 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");
 
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 4b7c3472eab1..a00e4c1746d6 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -199,7 +199,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
-   stw%U0%X0 %L2,%1"
+   stw%U1%X1 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");
return;
-- 
2.11.0



powerpc: Incorrect stw operand modifier in __set_pte_at

2020-07-08 Thread Mathieu Desnoyers
Hi,

Reviewing use of the patterns "Un%Xn" with lwz and stw instructions
(where n should be the operand number) within the Linux kernel led
me to spot those 2 weird cases:

arch/powerpc/include/asm/nohash/pgtable.h:__set_pte_at()

__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
stw%U0%X0 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");

I would have expected the stw to be:

stw%U1%X1 %L2,%1"

and:
arch/powerpc/include/asm/book3s/32/pgtable.h:__set_pte_at()

__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
stw%U0%X0 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");

where I would have expected:

stw%U1%X1 %L2,%1"

Is it a bug or am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 10:21 AM, Christophe Leroy christophe.le...@csgroup.eu 
wrote:

> Le 08/07/2020 à 16:00, Mathieu Desnoyers a écrit :
>> - On Jul 8, 2020, at 8:33 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>>> - On Jul 7, 2020, at 8:59 PM, Segher Boessenkool 
>>> seg...@kernel.crashing.org
>>> wrote:
>> [...]
>>>>
>>>> So perhaps you have code like
>>>>
>>>>   int *p;
>>>>   int x;
>>>>   ...
>>>>   asm ("lwz %0,%1" : "=r"(x) : "m"(*p));
>>>
>>> We indeed have explicit "lwz" and "stw" instructions in there.
>>>
>>>>
>>>> where that last line should actually read
>>>>
>>>>   asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p));
>>>
>>> Indeed, turning those into "lwzx" and "stwx" seems to fix the issue.
>>>
>>> There has been some level of extra CPP macro coating around those 
>>> instructions
>>> to
>>> support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is 
>>> not
>>> trivial.
>>> Let me see what can be done here.
>> 
>> I did the following changes which appear to generate valid asm.
>> See attached corresponding .S output.
>> 
>> I grepped for uses of "m" asm operand in Linux powerpc code and noticed it's
>> pretty much
>> always used with e.g. "lwz%U1%X1". I could find one blog post discussing 
>> that %U
>> is about
>> update flag, and nothing about %X. Are those documented ?
> 
> As far as I can see, %U is mentioned in
> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html in the
> powerpc subpart, at the "m" constraint.

Yep, I did notice it, but mistakenly thought it was only needed for "m<>" 
operand,
not "m".

Thanks,

Mathieu

> 
> For the %X I don't know.
> 
> Christophe
> 
>> 
>> Although it appears to generate valid asm, I have the feeling I'm relying on
>> undocumented
>> features here. :-/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-08 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 1:17 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 7, 2020 9:25 pm:
>> - On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
[...]
>>> I should actually change the comment for 64-bit because soft masked
>>> interrupt replay is an interesting case. I thought it was okay (because
>>> the IPI would cause a hard interrupt which does do the rfi) but that
>>> should at least be written.
>> 
>> Yes.
>> 
>>> The context synchronisation happens before
>>> the Linux IPI function is called, but for the purpose of membarrier I
>>> think that is okay (the membarrier just needs to have caused a memory
>>> barrier + context synchronistaion by the time it has done).
>> 
>> Can you point me to the code implementing this logic ?
> 
> It's mostly in arch/powerpc/kernel/exception-64s.S and
> powerpc/kernel/irq.c, but a lot of asm so easier to explain.
> 
> When any Linux code does local_irq_disable(), we set interrupts as
> software-masked in a per-cpu flag. When interrupts (including IPIs) come
> in, the first thing we do is check that flag and if we are masked, then
> record that the interrupt needs to be "replayed" in another per-cpu
> flag. The interrupt handler then exits back using RFI (which is context
> synchronising the CPU). Later, when the kernel code does
> local_irq_enable(), it checks the replay flag to see if anything needs
> to be done. At that point we basically just call the interrupt handler
> code like a normal function, and when that returns there is no context
> synchronising instruction.

AFAIU this can only happen for interrupts nesting over irqoff sections,
therefore over kernel code, never userspace, right ?

> 
> So membarrier IPI will always cause target CPUs to perform a context
> synchronising instruction, but sometimes it happens before the IPI
> handler function runs.

If my understanding is correct, the replayed interrupt handler logic
only nests over kernel code, which will eventually need to issue a
context synchronizing instruction before returning to user-space.

All we care about is that starting from the membarrier, each core
either:

- interrupt user-space to issue the context synchronizing instruction if
  they were running userspace, or
- _eventually_ issue a context synchronizing instruction before returning
  to user-space if they were running kernel code.

So your earlier statement "the membarrier just needs to have caused a memory
barrier + context synchronistaion by the time it has done" is not strictly
correct: the context synchronizing instruction does not strictly need to
happen on each core before membarrier returns. A similar line of thoughts
can be followed for memory barriers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 8:33 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 7, 2020, at 8:59 PM, Segher Boessenkool 
> seg...@kernel.crashing.org
> wrote:
[...]
>> 
>> So perhaps you have code like
>> 
>>  int *p;
>>  int x;
>>  ...
>>  asm ("lwz %0,%1" : "=r"(x) : "m"(*p));
> 
> We indeed have explicit "lwz" and "stw" instructions in there.
> 
>> 
>> where that last line should actually read
>> 
>>  asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p));
> 
> Indeed, turning those into "lwzx" and "stwx" seems to fix the issue.
> 
> There has been some level of extra CPP macro coating around those instructions
> to
> support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is not
> trivial.
> Let me see what can be done here.

I did the following changes which appear to generate valid asm.
See attached corresponding .S output.

I grepped for uses of "m" asm operand in Linux powerpc code and noticed it's 
pretty much
always used with e.g. "lwz%U1%X1". I could find one blog post discussing that 
%U is about
update flag, and nothing about %X. Are those documented ?

Although it appears to generate valid asm, I have the feeling I'm relying on 
undocumented
features here. :-/

Here is the diff on 
https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/rseq-ppc.h
It's only compile-tested on powerpc32 so far:

diff --git a/include/rseq/rseq-ppc.h b/include/rseq/rseq-ppc.h
index eb53953..f689fe9 100644
--- a/include/rseq/rseq-ppc.h
+++ b/include/rseq/rseq-ppc.h
@@ -47,9 +47,9 @@ do {  
\

 #ifdef __PPC64__

-#define STORE_WORD "std "
-#define LOAD_WORD  "ld "
-#define LOADX_WORD "ldx "
+#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
+#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
/* From memory ("m" constraint) */
+#define LOADX_WORD "ldx "  
/* From base register ("b" constraint) */
 #define CMP_WORD   "cmpd "

 #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, 
\
@@ -89,9 +89,9 @@ do {  
\

 #else /* #ifdef __PPC64__ */

-#define STORE_WORD "stw "
-#define LOAD_WORD  "lwz "
-#define LOADX_WORD "lwzx "
+#define STORE_WORD(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
+#define LOAD_WORD(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
/* From memory ("m" constraint) */
+#define LOADX_WORD "lwzx " 
/* From base register ("b" constraint) */
 #define CMP_WORD   "cmpw "

 #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, 
\
@@ -125,7 +125,7 @@ do {
\
RSEQ_INJECT_ASM(1)  
\
"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t"  
\
"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"   
\
-   "stw %%r17, %[" __rseq_str(rseq_cs) "]\n\t" 
\
+   "stw%U[" __rseq_str(rseq_cs) "]%X[" __rseq_str(rseq_cs) "] 
%%r17, %[" __rseq_str(rseq_cs) "]\n\t"   \
__rseq_str(label) ":\n\t"

 #endif /* #ifdef __PPC64__ */
@@ -136,7 +136,7 @@ do {
\

 #define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) 
\
RSEQ_INJECT_ASM(2)  
\
-   "lwz %%r17, %[" __rseq_str(current_cpu_id) "]\n\t"  
\
+   "lwz%U[" __rseq_str(current_cpu_id) "]%X[" 
__rseq_str(current_cpu_id) "] %%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \
"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t"
\
"bne- cr7, " __rseq_str(label) "\n\t"
@@ -153,25 +153,25 @@ do { 

Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers
- On Jul 7, 2020, at 8:59 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> Hi!
> 
> On Tue, Jul 07, 2020 at 03:17:10PM -0400, Mathieu Desnoyers wrote:
>> I'm trying to build librseq at:
>> 
>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git
>> 
>> on powerpc, and I get these errors when building the rseq basic
>> test mirrored from the kernel selftests code:
>> 
>> /tmp/ccieEWxU.s: Assembler messages:
>> /tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:118: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:121: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:626: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:629: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:735: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:738: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:741: Error: junk at end of line: `,8'
>> Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed
> 
> You'll have to show the actual failing machine code, and with enough
> context that we can relate this to the source code.
> 
> -save-temps helps, or use -S instead of -c, etc.

Sure, see attached .S file.

> 
>> I am using this compiler:
>> 
>> gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
>> Target: powerpc-linux-gnu
>> 
>> So far, I got things to build by changing "m" operands to "Q" operands.
>> Based on
>> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
>> it seems that "Q" means "A memory operand addressed by just a base register."
> 
> Yup.
> 
>> I suspect that lwz and stw don't expect some kind of immediate offset which
>> can be kept with "m", and "Q" fixes this. Is that the right fix ?
>> 
>> And should we change all operands passed to lwz and stw to a "Q" operand ?
> 
> No, lwz and stw exactly *do* take an immediate offset.
> 
> It sounds like the compiler passed memory addressed by indexed
> addressing, instead.  Which is fine for "m", and also fine for those
> insns... well, you need lwzx and stwx.
> 
> So perhaps you have code like
> 
>  int *p;
>  int x;
>  ...
>  asm ("lwz %0,%1" : "=r"(x) : "m"(*p));

We indeed have explicit "lwz" and "stw" instructions in there.

> 
> where that last line should actually read
> 
>  asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p));

Indeed, turning those into "lwzx" and "stwx" seems to fix the issue.

There has been some level of extra CPP macro coating around those instructions 
to
support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is not 
trivial.
Let me see what can be done here.

Thanks,

Mathieu


> 
> ?
> 
> 
> Segher

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


basic_percpu_ops_test.S
Description: Binary data


Failure to build librseq on ppc

2020-07-07 Thread Mathieu Desnoyers
Hi Boqun,

I'm trying to build librseq at:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git

on powerpc, and I get these errors when building the rseq basic
test mirrored from the kernel selftests code:

/tmp/ccieEWxU.s: Assembler messages:
/tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:118: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:121: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:626: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:629: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:735: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:738: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:741: Error: junk at end of line: `,8'
Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed

I am using this compiler:

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
Target: powerpc-linux-gnu

So far, I got things to build by changing "m" operands to "Q" operands.
Based on 
https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
it seems that "Q" means "A memory operand addressed by just a base register."
I suspect that lwz and stw don't expect some kind of immediate offset which
can be kept with "m", and "Q" fixes this. Is that the right fix ?

And should we change all operands passed to lwz and stw to a "Q" operand ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-07 Thread Mathieu Desnoyers
- On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Christophe Leroy's message of July 6, 2020 7:53 pm:
>> 
>> 
>> Le 06/07/2020 à 04:18, Nicholas Piggin a écrit :
>>> diff --git a/arch/powerpc/include/asm/exception-64s.h
>>> b/arch/powerpc/include/asm/exception-64s.h
>>> index 47bd4ea0837d..b88cb3a989b6 100644
>>> --- a/arch/powerpc/include/asm/exception-64s.h
>>> +++ b/arch/powerpc/include/asm/exception-64s.h
>>> @@ -68,6 +68,10 @@
>>>*
>>>* The nop instructions allow us to insert one or more instructions to 
>>> flush the
>>>* L1-D cache when returning to userspace or a guest.
>>> + *
>>> + * powerpc relies on return from interrupt/syscall being context 
>>> synchronising
>>> + * (which hrfid, rfid, and rfscv are) to support 
>>> ARCH_HAS_MEMBARRIER_SYNC_CORE
>>> + * without additional additional synchronisation instructions.
>> 
>> This file is dedicated to BOOK3S/64. What about other ones ?
>> 
>> On 32 bits, this is also valid as 'rfi' is also context synchronising,
>> but then why just add some comment in exception-64s.h and only there ?
> 
> Yeah you're right, I basically wanted to keep a note there just in case,
> because it's possible we would get a less synchronising return (maybe
> unlikely with meltdown) or even return from a kernel interrupt using a
> something faster (e.g., bctar if we don't use tar register in the kernel
> anywhere).
> 
> So I wonder where to add the note, entry_32.S and 64e.h as well?
> 

For 64-bit powerpc, I would be tempted to either place the comment in the header
implementing the RFI_TO_USER and RFI_TO_USER_OR_KERNEL macros or the .S files
using them, e.g. either:

arch/powerpc/include/asm/exception-64e.h
arch/powerpc/include/asm/exception-64s.h

or

arch/powerpc/kernel/exceptions-64s.S
arch/powerpc/kernel/entry_64.S

And for 32-bit powerpc, AFAIU

arch/powerpc/kernel/entry_32.S

uses SYNC + RFI to return to user-space. RFI is defined in

arch/powerpc/include/asm/ppc_asm.h

So a comment either near the RFI define and its uses should work.

> I should actually change the comment for 64-bit because soft masked
> interrupt replay is an interesting case. I thought it was okay (because
> the IPI would cause a hard interrupt which does do the rfi) but that
> should at least be written.

Yes.

> The context synchronisation happens before
> the Linux IPI function is called, but for the purpose of membarrier I
> think that is okay (the membarrier just needs to have caused a memory
> barrier + context synchronistaion by the time it has done).

Can you point me to the code implementing this logic ?

Thanks,

Mathieu

> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH for 5.2 10/12] rseq/selftests: powerpc code signature: generate valid instructions

2019-04-29 Thread Mathieu Desnoyers
Use "twui" as the guard instruction for the restartable sequence abort
handler.

Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: Alan Modra 
CC: linuxppc-dev@lists.ozlabs.org
---
 tools/testing/selftests/rseq/rseq-ppc.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rseq/rseq-ppc.h 
b/tools/testing/selftests/rseq/rseq-ppc.h
index 9df18487fa9f..76be90196fe4 100644
--- a/tools/testing/selftests/rseq/rseq-ppc.h
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -6,7 +6,15 @@
  * (C) Copyright 2016-2018 - Boqun Feng 
  */
 
-#define RSEQ_SIG   0x53053053
+/*
+ * RSEQ_SIG is used with the following trap instruction:
+ *
+ * powerpc-be:0f e5 00 0b   twui   r5,11
+ * powerpc64-le:  0b 00 e5 0f   twui   r5,11
+ * powerpc64-be:  0f e5 00 0b   twui   r5,11
+ */
+
+#define RSEQ_SIG   0x0fe5000b
 
 #define rseq_smp_mb()  __asm__ __volatile__ ("sync"::: "memory", 
"cc")
 #define rseq_smp_lwsync()  __asm__ __volatile__ ("lwsync"  ::: "memory", 
"cc")
-- 
2.11.0



[RFC PATCH for 5.2 09/10] rseq/selftests: powerpc code signature: generate valid instructions

2019-04-24 Thread Mathieu Desnoyers
Use "twui" as the guard instruction for the restartable sequence abort
handler.

Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: Alan Modra 
CC: linuxppc-dev@lists.ozlabs.org
---
 tools/testing/selftests/rseq/rseq-ppc.h | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rseq/rseq-ppc.h 
b/tools/testing/selftests/rseq/rseq-ppc.h
index 9df18487fa9f..76be90196fe4 100644
--- a/tools/testing/selftests/rseq/rseq-ppc.h
+++ b/tools/testing/selftests/rseq/rseq-ppc.h
@@ -6,7 +6,15 @@
  * (C) Copyright 2016-2018 - Boqun Feng 
  */
 
-#define RSEQ_SIG   0x53053053
+/*
+ * RSEQ_SIG is used with the following trap instruction:
+ *
+ * powerpc-be:0f e5 00 0b   twui   r5,11
+ * powerpc64-le:  0b 00 e5 0f   twui   r5,11
+ * powerpc64-be:  0f e5 00 0b   twui   r5,11
+ */
+
+#define RSEQ_SIG   0x0fe5000b
 
 #define rseq_smp_mb()  __asm__ __volatile__ ("sync"::: "memory", 
"cc")
 #define rseq_smp_lwsync()  __asm__ __volatile__ ("lwsync"  ::: "memory", 
"cc")
-- 
2.11.0



[RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call

2018-11-01 Thread Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 01b5171ea189..8f58710f5e8b 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -394,3 +394,4 @@ SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
 COMPAT_SYS(io_pgetevents)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 985534d0b448..112e2c54750a 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -400,5 +400,6 @@
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
 #define __NR_io_pgetevents 388
+#define __NR_cpu_opv   389
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call

2018-10-10 Thread Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 01b5171ea189..8f58710f5e8b 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -394,3 +394,4 @@ SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
 COMPAT_SYS(io_pgetevents)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 985534d0b448..112e2c54750a 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -400,5 +400,6 @@
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
 #define __NR_io_pgetevents 388
+#define __NR_cpu_opv   389
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



Re: [RFC PATCH for 4.18 10/16] powerpc: Wire up restartable sequences system call

2018-06-05 Thread Mathieu Desnoyers
- On Jun 5, 2018, at 1:18 AM, Michael Ellerman m...@ellerman.id.au wrote:

> Mathieu Desnoyers  writes:
> 
>> From: Boqun Feng 
>>
>> Wire up the rseq system call on powerpc.
>>
>> This provides an ABI improving the speed of a user-space getcpu
>> operation on powerpc by skipping the getcpu system call on the fast
>> path, as well as improving the speed of user-space operations on per-cpu
>> data compared to using load-reservation/store-conditional atomics.
>>
>> Signed-off-by: Boqun Feng 
>> Signed-off-by: Mathieu Desnoyers 
>> CC: Benjamin Herrenschmidt 
>> CC: Paul Mackerras 
>> CC: Michael Ellerman 
>> CC: Peter Zijlstra 
>> CC: "Paul E. McKenney" 
>> CC: linuxppc-dev@lists.ozlabs.org
>> ---
>>  arch/powerpc/include/asm/systbl.h  | 1 +
>>  arch/powerpc/include/asm/unistd.h  | 2 +-
>>  arch/powerpc/include/uapi/asm/unistd.h | 1 +
>>  3 files changed, 3 insertions(+), 1 deletion(-)
> 
> Looks fine to me.
> 
> I don't have any other new syscalls in my next, so this should not
> conflict with anything for 4.18.
> 
> Acked-by: Michael Ellerman  (powerpc)

Added your ack to the series, thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH for 4.18 09/16] powerpc: Add syscall detection for restartable sequences

2018-06-05 Thread Mathieu Desnoyers
- On Jun 5, 2018, at 1:21 AM, Michael Ellerman m...@ellerman.id.au wrote:

> Mathieu Desnoyers  writes:
>> From: Boqun Feng 
>>
>> Syscalls are not allowed inside restartable sequences, so add a call to
>> rseq_syscall() at the very beginning of system call exiting path for
>> CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
>> is a syscall issued inside restartable sequences.
>>
>> [ Tested on 64-bit powerpc kernel by Mathieu Desnoyers. Still needs to
>>   be tested on 32-bit powerpc kernel. ]
>>
>> Signed-off-by: Boqun Feng 
>> Signed-off-by: Mathieu Desnoyers 
>> CC: Benjamin Herrenschmidt 
>> CC: Paul Mackerras 
>> CC: Michael Ellerman 
>> CC: Peter Zijlstra 
>> CC: "Paul E. McKenney" 
>> CC: linuxppc-dev@lists.ozlabs.org
>> ---
>>  arch/powerpc/kernel/entry_32.S | 7 +++
>>  arch/powerpc/kernel/entry_64.S | 8 
>>  2 files changed, 15 insertions(+)
> 
> I don't _love_ the #ifdefs in here, but they look correct and there's
> not really a better option until we rewrite the syscall handler in C.
> 
> The rseq selftests passed for me with this applied and enabled. So if
> you like here's some tags:
> 
> Tested-by: Michael Ellerman 
> Acked-by: Michael Ellerman 
> 

Adding you ack to the series.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[RFC PATCH for 4.18 10/16] powerpc: Wire up restartable sequences system call

2018-06-02 Thread Mathieu Desnoyers
From: Boqun Feng 

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.18 09/16] powerpc: Add syscall detection for restartable sequences

2018-06-02 Thread Mathieu Desnoyers
From: Boqun Feng 

Syscalls are not allowed inside restartable sequences, so add a call to
rseq_syscall() at the very beginning of system call exiting path for
CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
is a syscall issued inside restartable sequences.

[ Tested on 64-bit powerpc kernel by Mathieu Desnoyers. Still needs to
  be tested on 32-bit powerpc kernel. ]

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/entry_32.S | 7 +++
 arch/powerpc/kernel/entry_64.S | 8 
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index eb8d01bae8c6..973577f2141c 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -365,6 +365,13 @@ syscall_dotrace_cont:
blrl/* Call handler */
.globl  ret_from_syscall
 ret_from_syscall:
+#ifdef CONFIG_DEBUG_RSEQ
+   /* Check whether the syscall is issued inside a restartable sequence */
+   stw r3,GPR3(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+   lwz r3,GPR3(r1)
+#endif
mr  r6,r3
CURRENT_THREAD_INFO(r12, r1)
/* disable interrupts so current_thread_info()->flags can't change */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..1c374387656a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -184,6 +184,14 @@ system_call:   /* label this so stack 
traces look sane */
 
 .Lsyscall_exit:
std r3,RESULT(r1)
+
+#ifdef CONFIG_DEBUG_RSEQ
+   /* Check whether the syscall is issued inside a restartable sequence */
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+   ld  r3,RESULT(r1)
+#endif
+
CURRENT_THREAD_INFO(r12, r1)
 
ld  r8,_MSR(r1)
-- 
2.11.0



[RFC PATCH for 4.18 08/16] powerpc: Add support for restartable sequences

2018-06-02 Thread Mathieu Desnoyers
From: Boqun Feng 

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..ed21a777e8c6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0



Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-28 Thread Mathieu Desnoyers
- On May 24, 2018, at 3:03 AM, Michael Ellerman m...@ellerman.id.au wrote:

> Mathieu Desnoyers <mathieu.desnoy...@efficios.com> writes:
>> - On May 23, 2018, at 4:14 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
> ...
>>> 
>>> Hi Boqun,
>>> 
>>> I tried your patch in a ppc64 le environment, and it does not survive boot
>>> with CONFIG_DEBUG_RSEQ=y. init gets killed right away.
> 
> 
> Sorry this code is super gross and hard to deal with.
> 
>> The following fixup gets ppc64 to work:
>>
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -208,6 +208,7 @@ system_call_exit:
>> /* Check whether the syscall is issued inside a restartable sequence 
>> */
>> addir3,r1,STACK_FRAME_OVERHEAD
>> bl  rseq_syscall
>> +   ld  r3,RESULT(r1)
>>  #endif
>> /*
>>  * Disable interrupts so current_thread_info()->flags can't change,
> 
> I don't think that's safe.
> 
> If you look above that, we have r3, r8 and r12 all live:
> 
> .Lsyscall_exit:
>   std r3,RESULT(r1)
>   CURRENT_THREAD_INFO(r12, r1)
> 
>   ld  r8,_MSR(r1)
> #ifdef CONFIG_PPC_BOOK3S
>   /* No MSR:RI on BookE */
>   andi.   r10,r8,MSR_RI
>   beq-.Lunrecov_restore
> #endif
> 
> 
> They're all volatile across function calls:
> 
>  
> http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655240_68174.html
> 
> 
> The system_call_exit symbol is actually there for kprobes and cosmetic
> purposes. The actual syscall return flow starts at .Lsyscall_exit.
> 
> So I think this would work:
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index db4df061c33a..e19f377a25e0 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -184,6 +184,14 @@ system_call: /* label this so stack 
> traces look sane */
> 
> .Lsyscall_exit:
>   std r3,RESULT(r1)
> +
> +#ifdef CONFIG_DEBUG_RSEQ
> + /* Check whether the syscall is issued inside a restartable sequence */
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> + ld  r3,RESULT(r1)
> +#endif
> +
>   CURRENT_THREAD_INFO(r12, r1)
> 
>   ld  r8,_MSR(r1)
> 
> 
> I'll try and get this series into my test setup at some point, been a
> bit busy lately :)

Yes, this was needed. I had this in my tree already, but there is still
a kernel OOPS when running the rseq selftests on ppc64 with CONFIG_DEBUG_RSEQ=y.

My current dev tree is at: 
https://github.com/compudj/linux-percpu-dev/tree/rseq/dev-local

So considering we are at rc7 now, should I plan to removing the powerpc bits
for merge window submission, or is there someone planning to spend time on
fixing and testing ppc integration before the merge window opens ?

Thanks,

Mathieu


> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-23 Thread Mathieu Desnoyers
- On May 23, 2018, at 4:14 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On May 20, 2018, at 10:08 AM, Boqun Feng boqun.f...@gmail.com wrote:
> 
>> On Fri, May 18, 2018 at 02:17:17PM -0400, Mathieu Desnoyers wrote:
>>> - On May 17, 2018, at 7:50 PM, Boqun Feng boqun.f...@gmail.com wrote:
>>> [...]
>>> >> > I think you're right. So we have to introduce callsite to 
>>> >> > rseq_syscall()
>>> >> > in syscall path, something like:
>>> >> > 
>>> >> > diff --git a/arch/powerpc/kernel/entry_64.S 
>>> >> > b/arch/powerpc/kernel/entry_64.S
>>> >> > index 51695608c68b..a25734a96640 100644
>>> >> > --- a/arch/powerpc/kernel/entry_64.S
>>> >> > +++ b/arch/powerpc/kernel/entry_64.S
>>> >> > @@ -222,6 +222,9 @@ system_call_exit:
>>> >> >mtmsrd  r11,1
>>> >> > #endif /* CONFIG_PPC_BOOK3E */
>>> >> > 
>>> >> > +  addir3,r1,STACK_FRAME_OVERHEAD
>>> >> > +  bl  rseq_syscall
>>> >> > +
>>> >> >ld  r9,TI_FLAGS(r12)
>>> >> >li  r11,-MAX_ERRNO
>>> >> >andi.
>>> >> >
>>> >> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
>>> >> > 
>>> 
>>> By the way, I think this is not the right spot to call rseq_syscall, because
>>> interrupts are disabled. I think we should move this hunk right after
>>> system_call_exit.
>>> 
>> 
>> Good point.
>> 
>>> Would you like to implement and test an updated patch adding those calls 
>>> for ppc
>>> 32 and 64 ?
>>> 
>> 
>> I'd like to help, but I don't have a handy ppc environment for test...
>> So I made the below patch which has only been build-tested, hope it
>> could be somewhat helpful.
> 
> Hi Boqun,
> 
> I tried your patch in a ppc64 le environment, and it does not survive boot
> with CONFIG_DEBUG_RSEQ=y. init gets killed right away.

The following fixup gets ppc64 to work:

--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -208,6 +208,7 @@ system_call_exit:
/* Check whether the syscall is issued inside a restartable sequence */
addir3,r1,STACK_FRAME_OVERHEAD
bl  rseq_syscall
+   ld  r3,RESULT(r1)
 #endif
/*
 * Disable interrupts so current_thread_info()->flags can't change,

> Moreover, I'm not sure that the r3 register don't contain something worth
> saving before the call on ppc32. Just after there is a "mr" instruction
> which AFAIU takes r3 as input register.

I'll start testing on ppc32 now.

Thanks,

Mathieu

> 
> Can you look into it ?
> 
> Thanks,
> 
> Mathieu
> 
>> 
>> Regards,
>> Boqun
>> 
>> ->8
>> Subject: [PATCH] powerpc: Add syscall detection for restartable sequences
>> 
>> Syscalls are not allowed inside restartable sequences, so add a call to
>> rseq_syscall() at the very beginning of system call exiting path for
>> CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
>> is a syscall issued inside restartable sequences.
>> 
>> Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
>> ---
>> arch/powerpc/kernel/entry_32.S | 5 +
>> arch/powerpc/kernel/entry_64.S | 5 +
>> 2 files changed, 10 insertions(+)
>> 
>> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
>> index eb8d01bae8c6..2f134eebe7ed 100644
>> --- a/arch/powerpc/kernel/entry_32.S
>> +++ b/arch/powerpc/kernel/entry_32.S
>> @@ -365,6 +365,11 @@ syscall_dotrace_cont:
>>  blrl/* Call handler */
>>  .globl  ret_from_syscall
>> ret_from_syscall:
>> +#ifdef CONFIG_DEBUG_RSEQ
>> +/* Check whether the syscall is issued inside a restartable sequence */
>> +addir3,r1,STACK_FRAME_OVERHEAD
>> +bl  rseq_syscall
>> +#endif
>>  mr  r6,r3
>>  CURRENT_THREAD_INFO(r12, r1)
>>  /* disable interrupts so current_thread_info()->flags can't change */
>> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> index 2cb5109a7ea3..2e2d59bb45d0 100644
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -204,6 +204,11 @@ system_call:/* label this so stack 
>> traces look sane */
>>  * This is blacklisted from kprobes further below with 
>> _ASM_NOKPROBE_SYMBOL().
>>  */
>> system_call_exit:
>> +#ifdef CONFIG_DEBUG_RSEQ
>> +/* Check whether the syscall is issued inside a restartable sequence */
>> +addir3,r1,STACK_FRAME_OVERHEAD
>> +bl  rseq_syscall
>> +#endif
>>  /*
>>   * Disable interrupts so current_thread_info()->flags can't change,
>>   * and so that we don't get interrupted after loading SRR0/1.
>> --
>> 2.16.2
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-23 Thread Mathieu Desnoyers
- On May 20, 2018, at 10:08 AM, Boqun Feng boqun.f...@gmail.com wrote:

> On Fri, May 18, 2018 at 02:17:17PM -0400, Mathieu Desnoyers wrote:
>> - On May 17, 2018, at 7:50 PM, Boqun Feng boqun.f...@gmail.com wrote:
>> [...]
>> >> > I think you're right. So we have to introduce callsite to rseq_syscall()
>> >> > in syscall path, something like:
>> >> > 
>> >> > diff --git a/arch/powerpc/kernel/entry_64.S 
>> >> > b/arch/powerpc/kernel/entry_64.S
>> >> > index 51695608c68b..a25734a96640 100644
>> >> > --- a/arch/powerpc/kernel/entry_64.S
>> >> > +++ b/arch/powerpc/kernel/entry_64.S
>> >> > @@ -222,6 +222,9 @@ system_call_exit:
>> >> > mtmsrd  r11,1
>> >> > #endif /* CONFIG_PPC_BOOK3E */
>> >> > 
>> >> > +   addir3,r1,STACK_FRAME_OVERHEAD
>> >> > +   bl  rseq_syscall
>> >> > +
>> >> > ld  r9,TI_FLAGS(r12)
>> >> > li  r11,-MAX_ERRNO
>> >> > andi.
>> >> > 
>> >> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
>> >> > 
>> 
>> By the way, I think this is not the right spot to call rseq_syscall, because
>> interrupts are disabled. I think we should move this hunk right after
>> system_call_exit.
>> 
> 
> Good point.
> 
>> Would you like to implement and test an updated patch adding those calls for 
>> ppc
>> 32 and 64 ?
>> 
> 
> I'd like to help, but I don't have a handy ppc environment for test...
> So I made the below patch which has only been build-tested, hope it
> could be somewhat helpful.

Hi Boqun,

I tried your patch in a ppc64 le environment, and it does not survive boot
with CONFIG_DEBUG_RSEQ=y. init gets killed right away.

Moreover, I'm not sure that the r3 register don't contain something worth
saving before the call on ppc32. Just after there is a "mr" instruction
which AFAIU takes r3 as input register.

Can you look into it ?

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
> ->8
> Subject: [PATCH] powerpc: Add syscall detection for restartable sequences
> 
> Syscalls are not allowed inside restartable sequences, so add a call to
> rseq_syscall() at the very beginning of system call exiting path for
> CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
> is a syscall issued inside restartable sequences.
> 
> Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
> ---
> arch/powerpc/kernel/entry_32.S | 5 +
> arch/powerpc/kernel/entry_64.S | 5 +
> 2 files changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index eb8d01bae8c6..2f134eebe7ed 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -365,6 +365,11 @@ syscall_dotrace_cont:
>   blrl/* Call handler */
>   .globl  ret_from_syscall
> ret_from_syscall:
> +#ifdef CONFIG_DEBUG_RSEQ
> + /* Check whether the syscall is issued inside a restartable sequence */
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> +#endif
>   mr  r6,r3
>   CURRENT_THREAD_INFO(r12, r1)
>   /* disable interrupts so current_thread_info()->flags can't change */
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 2cb5109a7ea3..2e2d59bb45d0 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -204,6 +204,11 @@ system_call:     /* label this so stack 
> traces look sane */
>  * This is blacklisted from kprobes further below with _ASM_NOKPROBE_SYMBOL().
>  */
> system_call_exit:
> +#ifdef CONFIG_DEBUG_RSEQ
> + /* Check whether the syscall is issued inside a restartable sequence */
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> +#endif
>   /*
>* Disable interrupts so current_thread_info()->flags can't change,
>* and so that we don't get interrupted after loading SRR0/1.
> --
> 2.16.2

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-18 Thread Mathieu Desnoyers
- On May 17, 2018, at 7:50 PM, Boqun Feng boqun.f...@gmail.com wrote:
[...]
>> > I think you're right. So we have to introduce callsite to rseq_syscall()
>> > in syscall path, something like:
>> > 
>> > diff --git a/arch/powerpc/kernel/entry_64.S 
>> > b/arch/powerpc/kernel/entry_64.S
>> > index 51695608c68b..a25734a96640 100644
>> > --- a/arch/powerpc/kernel/entry_64.S
>> > +++ b/arch/powerpc/kernel/entry_64.S
>> > @@ -222,6 +222,9 @@ system_call_exit:
>> >mtmsrd  r11,1
>> > #endif /* CONFIG_PPC_BOOK3E */
>> > 
>> > +  addir3,r1,STACK_FRAME_OVERHEAD
>> > +  bl  rseq_syscall
>> > +
>> >ld  r9,TI_FLAGS(r12)
>> >li  r11,-MAX_ERRNO
>> >andi.
>> >
>> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
>> > 

By the way, I think this is not the right spot to call rseq_syscall, because
interrupts are disabled. I think we should move this hunk right after 
system_call_exit.

Would you like to implement and test an updated patch adding those calls for 
ppc 32 and 64 ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-17 Thread Mathieu Desnoyers
- On May 16, 2018, at 9:19 PM, Boqun Feng boqun.f...@gmail.com wrote:

> On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
>> - On May 16, 2018, at 12:18 PM, Peter Zijlstra pet...@infradead.org 
>> wrote:
>> 
>> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
>> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> >> index c32a181a7cbb..ed21a777e8c6 100644
>> >> --- a/arch/powerpc/Kconfig
>> >> +++ b/arch/powerpc/Kconfig
>> >> @@ -223,6 +223,7 @@ config PPC
>> >>   select HAVE_SYSCALL_TRACEPOINTS
>> >>   select HAVE_VIRT_CPU_ACCOUNTING
>> >>   select HAVE_IRQ_TIME_ACCOUNTING
>> >> + select HAVE_RSEQ
>> >>   select IRQ_DOMAIN
>> >>   select IRQ_FORCED_THREADING
>> >>   select MODULES_USE_ELF_RELA
>> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
>> >> index 61db86ecd318..d3bb3aaaf5ac 100644
>> >> --- a/arch/powerpc/kernel/signal.c
>> >> +++ b/arch/powerpc/kernel/signal.c
>> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
>> >>   /* Re-enable the breakpoints for the signal stack */
>> >>   thread_change_pc(tsk, tsk->thread.regs);
>> >>  
>> >> + rseq_signal_deliver(tsk->thread.regs);
>> >> +
>> >>   if (is32) {
>> >>   if (ksig.ka.sa.sa_flags & SA_SIGINFO)
>> >>   ret = handle_rt_signal32(, oldset, tsk);
>> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned 
>> >> long
>> >> thread_info_flags)
>> >>   if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>> >>   clear_thread_flag(TIF_NOTIFY_RESUME);
>> >>   tracehook_notify_resume(regs);
>> >> + rseq_handle_notify_resume(regs);
>> >>   }
>> >>  
>> >>   user_enter();
>> > 
>> > Again no rseq_syscall().
>> 
>> Same question for PowerPC as for ARM:
>> 
>> Considering that rseq_syscall is implemented as follows:
>> 
>> +void rseq_syscall(struct pt_regs *regs)
>> +{
>> +   unsigned long ip = instruction_pointer(regs);
>> +   struct task_struct *t = current;
>> +   struct rseq_cs rseq_cs;
>> +
>> +   if (!t->rseq)
>> +   return;
>> +   if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
>> +   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
>> +   force_sig(SIGSEGV, t);
>> +}
>> 
>> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
>> now used in the fast-path since KPTI), I wonder where we should call
> 
> So we actually detect this after the syscall takes effect, right? I
> wonder whether this could be problematic, because "disallowing syscall"
> in rseq areas may means the syscall won't take effect to some people, I
> guess?
> 
>> this on PowerPC ?  I was under the impression that PowerPC return to
>> userspace fast-path was not calling C code unless work flags were set,
>> but I might be wrong.
>> 
> 
> I think you're right. So we have to introduce callsite to rseq_syscall()
> in syscall path, something like:
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 51695608c68b..a25734a96640 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -222,6 +222,9 @@ system_call_exit:
>   mtmsrd  r11,1
> #endif /* CONFIG_PPC_BOOK3E */
> 
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> +
>   ld  r9,TI_FLAGS(r12)
>   li  r11,-MAX_ERRNO
>   andi.
>   
> r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> 
> But I think it's important for us to first decide where (before or after
> the syscall) we do the detection.

As Peter said, we don't really care whether it's on syscall entry or exit, as
long as the process gets killed when the erroneous use is detected. I think 
doing
it on syscall exit is a bit easier because we can clearly access the userspace
TLS, which AFAIU may be less straightforward on syscall entry.

We may want to add #ifdef CONFIG_DEBUG_RSEQ / #endif around the code you
proposed above, so it's only compiled in if CONFIG_DEBUG_RSEQ=y.

On the ARM leg of the email thread, Will Deacon suggests to test whether 
current->rseq
is non-NULL before calling rseq_syscall(). I wonder if this added check is 
justified
as the assembly level, considering that this is just a debugging option. We 
already do
that check at the very beginning of rseq_syscall().

Thoughts ?

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> Thoughts ?
>> 
>> Thanks!
>> 
>> Mathieu
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-16 Thread Mathieu Desnoyers
- On May 16, 2018, at 12:18 PM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index c32a181a7cbb..ed21a777e8c6 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -223,6 +223,7 @@ config PPC
>>  select HAVE_SYSCALL_TRACEPOINTS
>>  select HAVE_VIRT_CPU_ACCOUNTING
>>  select HAVE_IRQ_TIME_ACCOUNTING
>> +select HAVE_RSEQ
>>  select IRQ_DOMAIN
>>  select IRQ_FORCED_THREADING
>>  select MODULES_USE_ELF_RELA
>> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
>> index 61db86ecd318..d3bb3aaaf5ac 100644
>> --- a/arch/powerpc/kernel/signal.c
>> +++ b/arch/powerpc/kernel/signal.c
>> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
>>  /* Re-enable the breakpoints for the signal stack */
>>  thread_change_pc(tsk, tsk->thread.regs);
>>  
>> +rseq_signal_deliver(tsk->thread.regs);
>> +
>>  if (is32) {
>>  if (ksig.ka.sa.sa_flags & SA_SIGINFO)
>>  ret = handle_rt_signal32(, oldset, tsk);
>> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
>> thread_info_flags)
>>  if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>>  clear_thread_flag(TIF_NOTIFY_RESUME);
>>  tracehook_notify_resume(regs);
>> +rseq_handle_notify_resume(regs);
>>  }
>>  
>>  user_enter();
> 
> Again no rseq_syscall().

Same question for PowerPC as for ARM:

Considering that rseq_syscall is implemented as follows:

+void rseq_syscall(struct pt_regs *regs)
+{
+   unsigned long ip = instruction_pointer(regs);
+   struct task_struct *t = current;
+   struct rseq_cs rseq_cs;
+
+   if (!t->rseq)
+   return;
+   if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
+   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
+   force_sig(SIGSEGV, t);
+}

and that x86 calls it from syscall_return_slowpath() (which AFAIU is
now used in the fast-path since KPTI), I wonder where we should call
this on PowerPC ?  I was under the impression that PowerPC return to
userspace fast-path was not calling C code unless work flags were set,
but I might be wrong.

Thoughts ?

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH 08/14] powerpc: Wire up restartable sequences system call

2018-04-30 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

TODO: wire up rseq_syscall() on return from system call. It is used with
CONFIG_DEBUG_RSEQ=y to ensure system calls are not issued within rseq critical
section

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[PATCH 07/14] powerpc: Add support for restartable sequences

2018-04-30 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..ed21a777e8c6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0



[RFC PATCH for 4.18 14/23] powerpc: Wire up cpu_opv system call

2018-04-12 Thread Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 45d4d37495fd..4131825b5a05 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -393,3 +393,4 @@ SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 1e9708632dce..c19379f0a32e 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls388
+#define NR_syscalls389
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index ac5ba55066dd..f7a221bdb5df 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -399,5 +399,6 @@
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
+#define __NR_cpu_opv   388
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.18 07/23] powerpc: Add support for restartable sequences

2018-04-12 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..90700b6918ef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0



[RFC PATCH for 4.18 08/23] powerpc: Wire up restartable sequences system call

2018-04-12 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

TODO: wire up rseq_syscall() on return from system call. It is used with
CONFIG_DEBUG_RSEQ=y to ensure system calls are not issued within rseq critical
section

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.17 12/21] powerpc: Wire up cpu_opv system call

2018-03-27 Thread Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 45d4d37495fd..4131825b5a05 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -393,3 +393,4 @@ SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 1e9708632dce..c19379f0a32e 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls388
+#define NR_syscalls389
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index ac5ba55066dd..f7a221bdb5df 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -399,5 +399,6 @@
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
+#define __NR_cpu_opv   388
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.17 08/21] powerpc: Wire up restartable sequences system call

2018-03-27 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.17 07/21] powerpc: Add support for restartable sequences

2018-03-27 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal when a
signal is delivered on top of a restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 73ce5dd07642..90700b6918ef 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0



Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

2018-02-05 Thread Mathieu Desnoyers
- On Feb 5, 2018, at 3:22 PM, Ingo Molnar mi...@kernel.org wrote:

> * Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
> 
>>  
>> +config ARCH_HAS_MEMBARRIER_HOOKS
>> +bool
> 
> Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated 
> it
> through the rest of the patches. "Callback" is the canonical name, and I also
> cringe every time I see 'hook'.
> 
> Please let me know if there are any big objections against this minor cleanup.

No objection at all,

Thanks!

Mathieu

> 
> Thanks,
> 
>   Ingo

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

2018-01-29 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 845fc25812f1..34c1ecd5a5d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8929,6 +8929,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ed525a44734..09b02180b8a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 

[PATCH 02/11] powerpc: membarrier: Skip memory barrier in switch_mm() (v7)

2018-01-23 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e3581413420c..11ff47c28b12 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8931,6 +8931,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2ed525a44734..09b02180b8a0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
se

[PATCH for 4.16 02/11] powerpc: membarrier: Skip memory barrier in switch_mm() (v7)

2018-01-17 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 18994806e441..c2f0d9a48a10 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8931,6 +8931,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c51e6ce42e7a..a63adb082c0a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
se

[PATCH for 4.16 02/10] powerpc: membarrier: Skip memory barrier in switch_mm() (v7)

2018-01-15 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 18994806e441..c2f0d9a48a10 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8931,6 +8931,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c51e6ce42e7a..a63adb082c0a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
se

[PATCH v7 for 4.16 03/11] powerpc: membarrier: Skip memory barrier in switch_mm()

2018-01-10 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 95c3fa1f520f..a0ad9fe5735b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8931,6 +8931,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c51e6ce42e7a..a63adb082c0a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
se

[RFC PATCH for 4.16 03/11] powerpc: membarrier: Skip memory barrier in switch_mm() (v7)

2017-12-22 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a6e86e20761e..bd1666217c73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8934,6 +8934,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c51e6ce42e7a..a63adb082c0a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
se

[RFC PATCH for 4.16 12/21] powerpc: Wire up cpu_opv system call

2017-12-14 Thread Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 964321a5799c..f9cdb896fbaa 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@ COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index e76bd5601ea4..48f80f452e31 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls385
+#define NR_syscalls386
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index b1980fcd56d5..972a7d68c143 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_kexec_file_load   382
 #define __NR_statx 383
 #define __NR_rseq  384
+#define __NR_cpu_opv   385
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.16 07/21] powerpc: Add support for restartable sequences

2017-12-14 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Increment the event counter and perform fixup on the pre-signal when a
signal is delivered on top of a restartable sequence critical section.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c51e6ce42e7a..e9992f80819c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -221,6 +221,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 3d7539b90010..a7b95f7bcaf1 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(, oldset, tsk);
@@ -161,6 +163,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
if (thread_info_flags & _TIF_PATCH_PENDING)
-- 
2.11.0



[RFC PATCH for 4.16 08/21] powerpc: Wire up restartable sequences system call

2017-12-14 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 449912f057f6..964321a5799c 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -389,3 +389,4 @@ COMPAT_SYS_SPU(preadv2)
 COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 9ba11dbcaca9..e76bd5601ea4 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls384
+#define NR_syscalls385
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index df8684f31919..b1980fcd56d5 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -395,5 +395,6 @@
 #define __NR_pwritev2  381
 #define __NR_kexec_file_load   382
 #define __NR_statx 383
+#define __NR_rseq  384
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.15 v7 19/22] powerpc: membarrier: Skip memory barrier in switch_mm()

2017-11-21 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Paul E. McKenney <paul...@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Andrew Hunter <a...@google.com>
CC: Maged Michael <maged.mich...@gmail.com>
CC: Avi Kivity <a...@scylladb.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Dave Watson <davejwat...@fb.com>
CC: Alan Stern <st...@rowland.harvard.edu>
CC: Will Deacon <will.dea...@arm.com>
CC: Andy Lutomirski <l...@kernel.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Alexander Viro <v...@zeniv.linux.org.uk>
CC: Nicholas Piggin <npig...@gmail.com>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 25 +
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 12 +++-
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  9 +
 8 files changed, 57 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index ba9137c1f295..92f460afeaaf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8938,6 +8938,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e9992f80819c..d41c6ede0709 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
se

[RFC PATCH for 4.15 12/22] powerpc: Wire up cpu_opv system call

2017-11-21 Thread Mathieu Desnoyers
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Boqun Feng <boqun.f...@gmail.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 964321a5799c..f9cdb896fbaa 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -390,3 +390,4 @@ COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
 SYSCALL(rseq)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index e76bd5601ea4..48f80f452e31 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls385
+#define NR_syscalls386
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index b1980fcd56d5..972a7d68c143 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -396,5 +396,6 @@
 #define __NR_kexec_file_load   382
 #define __NR_statx 383
 #define __NR_rseq  384
+#define __NR_cpu_opv   385
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



[RFC PATCH for 4.15 08/22] powerpc: Wire up restartable sequences system call

2017-11-21 Thread Mathieu Desnoyers
From: Boqun Feng <boqun.f...@gmail.com>

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
CC: Benjamin Herrenschmidt <b...@kernel.crashing.org>
CC: Paul Mackerras <pau...@samba.org>
CC: Michael Ellerman <m...@ellerman.id.au>
CC: Peter Zijlstra <pet...@infradead.org>
CC: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 449912f057f6..964321a5799c 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -389,3 +389,4 @@ COMPAT_SYS_SPU(preadv2)
 COMPAT_SYS_SPU(pwritev2)
 SYSCALL(kexec_file_load)
 SYSCALL(statx)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 9ba11dbcaca9..e76bd5601ea4 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls384
+#define NR_syscalls385
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index df8684f31919..b1980fcd56d5 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -395,5 +395,6 @@
 #define __NR_pwritev2  381
 #define __NR_kexec_file_load   382
 #define __NR_statx 383
+#define __NR_rseq  384
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0



  1   2   >