date:20180517

Re: [PATCH bpf 5/6] tools: bpftool: resolve calls without using imm field

2018-05-17 Thread Sandipan Das

Hi Jakub,

On 05/18/2018 12:21 AM, Jakub Kicinski wrote:
> On Thu, 17 May 2018 12:05:47 +0530, Sandipan Das wrote:
>> Currently, we resolve the callee's address for a JITed function
>> call by using the imm field of the call instruction as an offset
>> from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
>> use this address to get the callee's kernel symbol's name.
>>
>> For some architectures, such as powerpc64, the imm field is not
>> large enough to hold this offset. So, instead of assigning this
>> offset to the imm field, the verifier now assigns the subprog
>> id. Also, a list of kernel symbol addresses for all the JITed
>> functions is provided in the program info. We now use the imm
>> field as an index for this list to lookup a callee's symbol's
>> address and resolve its name.
>>
>> Suggested-by: Daniel Borkmann 
>> Signed-off-by: Sandipan Das 
> 
> A few nit-picks below, thank you for the patch!
> 
>>  tools/bpf/bpftool/prog.c  | 31 +++
>>  tools/bpf/bpftool/xlated_dumper.c | 24 +---
>>  tools/bpf/bpftool/xlated_dumper.h |  2 ++
>>  3 files changed, 50 insertions(+), 7 deletions(-)
>>
>> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
>> index 9bdfdf2d3fbe..ac2f62a97e84 100644
>> --- a/tools/bpf/bpftool/prog.c
>> +++ b/tools/bpf/bpftool/prog.c
>> @@ -430,6 +430,10 @@ static int do_dump(int argc, char **argv)
>>  unsigned char *buf;
>>  __u32 *member_len;
>>  __u64 *member_ptr;
>> +unsigned int nr_addrs;
>> +unsigned long *addrs = NULL;
>> +__u32 *ksyms_len;
>> +__u64 *ksyms_ptr;
> 
> nit: please try to keep the variables ordered longest to shortest like
> we do in networking code (please do it in all functions).
> 
>>  ssize_t n;
>>  int err;
>>  int fd;
>> @@ -437,6 +441,8 @@ static int do_dump(int argc, char **argv)
>>  if (is_prefix(*argv, "jited")) {
>>  member_len = _prog_len;
>>  member_ptr = _prog_insns;
>> +ksyms_len = _jited_ksyms;
>> +ksyms_ptr = _ksyms;
>>  } else if (is_prefix(*argv, "xlated")) {
>>  member_len = _prog_len;
>>  member_ptr = _prog_insns;
>> @@ -496,10 +502,23 @@ static int do_dump(int argc, char **argv)
>>  return -1;
>>  }
>>  
>> +nr_addrs = *ksyms_len;
> 
> Here and ...
> 
>> +if (nr_addrs) {
>> +addrs = malloc(nr_addrs * sizeof(__u64));
>> +if (!addrs) {
>> +p_err("mem alloc failed");
>> +free(buf);
>> +close(fd);
>> +return -1;
> 
> You can just jump to err_free here.
> 
>> +}
>> +}
>> +
>>  memset(, 0, sizeof(info));
>>  
>>  *member_ptr = ptr_to_u64(buf);
>>  *member_len = buf_size;
>> +*ksyms_ptr = ptr_to_u64(addrs);
>> +*ksyms_len = nr_addrs;
> 
> ... here - this function is getting long, so maybe I'm not seeing
> something, but are ksyms_ptr and ksyms_len guaranteed to be initialized?
> 
>>  err = bpf_obj_get_info_by_fd(fd, , );
>>  close(fd);
>> @@ -513,6 +532,11 @@ static int do_dump(int argc, char **argv)
>>  goto err_free;
>>  }
>>  
>> +if (*ksyms_len > nr_addrs) {
>> +p_err("too many addresses returned");
>> +goto err_free;
>> +}
>> +
>>  if ((member_len == _prog_len &&
>>   info.jited_prog_insns == 0) ||
>>  (member_len == _prog_len &&
>> @@ -558,6 +582,9 @@ static int do_dump(int argc, char **argv)
>>  dump_xlated_cfg(buf, *member_len);
>>  } else {
>>  kernel_syms_load();
>> +dd.jited_ksyms = ksyms_ptr;
>> +dd.nr_jited_ksyms = *ksyms_len;
>> +
>>  if (json_output)
>>  dump_xlated_json(, buf, *member_len, opcodes);
>>  else
>> @@ -566,10 +593,14 @@ static int do_dump(int argc, char **argv)
>>  }
>>  
>>  free(buf);
>> +if (addrs)
>> +free(addrs);
> 
> Free can deal with NULL pointers, no need for an if.
> 
>>  return 0;
>>  
>>  err_free:
>>  free(buf);
>> +if (addrs)
>> +free(addrs);
>>  return -1;
>>  }
>>  
>> diff --git a/tools/bpf/bpftool/xlated_dumper.c 
>> b/tools/bpf/bpftool/xlated_dumper.c
>> index 7a3173b76c16..dc8e4eca0387 100644
>> --- a/tools/bpf/bpftool/xlated_dumper.c
>> +++ b/tools/bpf/bpftool/xlated_dumper.c
>> @@ -178,8 +178,12 @@ static const char *print_call_pcrel(struct dump_data 
>> *dd,
>>  snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>>   "%+d#%s", insn->off, sym->name);
>>  else
> 
> else if (address)
> 
> saves us the indentation.
> 
>> -snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>> - "%+d#0x%lx", insn->off, address);
>> +if (address)
>> +snprintf(dd->scratch_buff,

Re: [PATCH 0/3] Add support to disable sensor groups in P9

2018-05-17 Thread Shilpasri G Bhat



On 05/17/2018 06:08 PM, Guenter Roeck wrote:
> On 05/16/2018 11:10 PM, Shilpasri G Bhat wrote:
>>
>>
>> On 05/15/2018 08:32 PM, Guenter Roeck wrote:
>>> On Thu, Mar 22, 2018 at 04:24:32PM +0530, Shilpasri G Bhat wrote:
 This patch series adds support to enable/disable OCC based
 inband-sensor groups at runtime. The environmental sensor groups are
 managed in HWMON and the remaining platform specific sensor groups are
 managed in /sys/firmware/opal.

 The firmware changes required for this patch is posted below:
 https://lists.ozlabs.org/pipermail/skiboot/2018-March/010812.html

>>>
>>> Sorry for not getting back earlier. This is a tough one.
>>>
>>
>> Thanks for the reply. I have tried to answer your questions according to my
>> understanding below:
>>
>>> Key problem is that you are changing the ABI with those new attributes.
>>> On top of that, the attributes _do_ make some sense (many chips support
>>> enabling/disabling of individual sensors), suggesting that those or
>>> similar attributes may or even should at some point be added to the ABI.
>>>
>>> At the same time, returning "0" as measurement values when sensors are
>>> disabled does not seem like a good idea, since "0" is a perfectly valid
>>> measurement, at least for most sensors.
>>
>> I agree.
>>
>>>
>>> Given that, we need to have a discussion about adding _enable attributes to
>>> the ABI
>>
>>> what is the scope,
>> IIUC the scope should be RW and the attribute is defined for each supported
>> sensor group
>>
> 
> That is _your_ need. I am not aware of any other chip where a per-sensor group
> attribute would make sense. The discussion we need has to extend beyond the 
> need
> of a single chip.
> 
> Guenter
> 


Is it okay if the ABI provides provision for both types of attribute
power_enable and powerX_enable. And is it okay to decide which type of attribute
to be used by the capability provided by the hwmon chip?


- Shilpa

>>> when should the attributes exist and when not,
>> We control this currently via device-tree
>>
>>> do we want/need power_enable or powerX_enable or both, and so on), and
>> We need power_enable right now
>>
>>> what to return if a sensor is disabled (such as -ENODATA).
>> -ENODATA sounds good.
>>
>> Thanks and Regards,
>> Shilpa
>>
>>   Once we have an
>>> agreement, we can continue with an implementation.
>>>
>>> Guenter
>>>
 Shilpasri G Bhat (3):
powernv:opal-sensor-groups: Add support to enable sensor groups
hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
powernv: opal-sensor-groups: Add attributes to disable/enable sensors

   .../ABI/testing/sysfs-firmware-opal-sensor-groups  |  34 ++
   Documentation/hwmon/ibmpowernv |  31 -
   arch/powerpc/include/asm/opal-api.h|   4 +-
   arch/powerpc/include/asm/opal.h|   2 +
   .../powerpc/platforms/powernv/opal-sensor-groups.c | 104 
 -
   arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
   drivers/hwmon/ibmpowernv.c | 127
 +++--
   7 files changed, 265 insertions(+), 38 deletions(-)
   create mode 100644
 Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups

 -- 
 1.8.3.1

 -- 
 To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>

[PATCH] powepc: Clear PCR on boot

2018-05-17 Thread Michael Neuling

Clear the PCR on boot to ensure we are not running in a compat mode.

We've seen this cause problems when a crash (and kdump) occurs while
running compat mode guests. The kdump kernel then runs with the PCR
set and causes problems. The symptom in the kdump kernel (also seen in
petitboot after fast-reboot) is early userspace programs taking
sigills on newer instructions (seen in libc).

Signed-off-by: Michael Neuling 
Cc: 
---
 arch/powerpc/kernel/cpu_setup_power.S | 6 ++
 arch/powerpc/kernel/dt_cpu_ftrs.c | 1 +
 2 files changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
b/arch/powerpc/kernel/cpu_setup_power.S
index 3f30c994e9..458b928dbd 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -28,6 +28,7 @@ _GLOBAL(__setup_cpu_power7)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
bl  __init_LPCR_ISA206
@@ -41,6 +42,7 @@ _GLOBAL(__restore_cpu_power7)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
li  r4,(LPCR_LPES1 >> LPCR_LPES_SH)
bl  __init_LPCR_ISA206
@@ -57,6 +59,7 @@ _GLOBAL(__setup_cpu_power8)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
li  r4,0 /* LPES = 0 */
@@ -78,6 +81,7 @@ _GLOBAL(__restore_cpu_power8)
beqlr
li  r0,0
mtspr   SPRN_LPID,r0
+   mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
ori r3, r3, LPCR_PECEDH
li  r4,0 /* LPES = 0 */
@@ -99,6 +103,7 @@ _GLOBAL(__setup_cpu_power9)
mtspr   SPRN_PSSCR,r0
mtspr   SPRN_LPID,r0
mtspr   SPRN_PID,r0
+   mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE  | 
LPCR_HEIC)
or  r3, r3, r4
@@ -123,6 +128,7 @@ _GLOBAL(__restore_cpu_power9)
mtspr   SPRN_PSSCR,r0
mtspr   SPRN_LPID,r0
mtspr   SPRN_PID,r0
+   mtspr   SPRN_PCR,r0
mfspr   r3,SPRN_LPCR
LOAD_REG_IMMEDIATE(r4, LPCR_PECEDH | LPCR_PECE_HVEE | LPCR_HVICE | 
LPCR_HEIC)
or  r3, r3, r4
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 8ab51f6ca0..c904477aba 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -101,6 +101,7 @@ static void __restore_cpu_cpufeatures(void)
if (hv_mode) {
mtspr(SPRN_LPID, 0);
mtspr(SPRN_HFSCR, system_registers.hfscr);
+   mtspr(SPRN_PCR, 0);
}
mtspr(SPRN_FSCR, system_registers.fscr);
 
-- 
2.14.1

[PATCH RESEND] powerpc/lib: Fix "integer constant is too large" build failure

2018-05-17 Thread Finn Thain

My powerpc-linux-gnu-gcc v4.4.5 compiler can't build a 32-bit kernel
any more:

arch/powerpc/lib/sstep.c: In function 'do_popcnt':
arch/powerpc/lib/sstep.c:1068: error: integer constant is too large for 'long' 
type
arch/powerpc/lib/sstep.c:1069: error: integer constant is too large for 'long' 
type
arch/powerpc/lib/sstep.c:1069: error: integer constant is too large for 'long' 
type
arch/powerpc/lib/sstep.c:1070: error: integer constant is too large for 'long' 
type
arch/powerpc/lib/sstep.c:1079: error: integer constant is too large for 'long' 
type
arch/powerpc/lib/sstep.c: In function 'do_prty':
arch/powerpc/lib/sstep.c:1117: error: integer constant is too large for 'long' 
type

This file gets compiled with -std=gnu89 which means a constant can be
given the type 'long' even if it won't fit. Fix the errors with a 'ULL'
suffix on the relevant constants.

Fixes: 2c979c489fee ("powerpc/lib/sstep: Add prty instruction emulation")
Fixes: dcbd19b48d31 ("powerpc/lib/sstep: Add popcnt instruction emulation")
Signed-off-by: Finn Thain 
---
This change was compile tested but not regression tested.
---
I'm re-sending this because the first submission didn't show up on
patchwork.ozlabs.org, apparently because I sent it without any
message-id header. Sorry about that.
---
 arch/powerpc/lib/sstep.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 34d68f1b1b40..49427a3ee104 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1065,9 +1065,10 @@ static nokprobe_inline void do_popcnt(const struct 
pt_regs *regs,
 {
unsigned long long out = v1;
 
-   out -= (out >> 1) & 0x;
-   out = (0x & out) + (0x & (out >> 2));
-   out = (out + (out >> 4)) & 0x0f0f0f0f0f0f0f0f;
+   out -= (out >> 1) & 0xULL;
+   out = (0xULL & out) +
+ (0xULL & (out >> 2));
+   out = (out + (out >> 4)) & 0x0f0f0f0f0f0f0f0fULL;
 
if (size == 8) {/* popcntb */
op->val = out;
@@ -1076,7 +1077,7 @@ static nokprobe_inline void do_popcnt(const struct 
pt_regs *regs,
out += out >> 8;
out += out >> 16;
if (size == 32) {   /* popcntw */
-   op->val = out & 0x003f003f;
+   op->val = out & 0x003f003fULL;
return;
}
 
@@ -1114,7 +1115,7 @@ static nokprobe_inline void do_prty(const struct pt_regs 
*regs,
 
res ^= res >> 16;
if (size == 32) {   /* prtyw */
-   op->val = res & 0x00010001;
+   op->val = res & 0x00010001ULL;
return;
}
 
-- 
2.16.1

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-17 Thread Boqun Feng



On Thu, May 17, 2018, at 11:28 PM, Mathieu Desnoyers wrote:
> - On May 16, 2018, at 9:19 PM, Boqun Feng boqun.f...@gmail.com wrote:
> 
> > On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
> >> - On May 16, 2018, at 12:18 PM, Peter Zijlstra pet...@infradead.org 
> >> wrote:
> >> 
> >> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
> >> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> >> >> index c32a181a7cbb..ed21a777e8c6 100644
> >> >> --- a/arch/powerpc/Kconfig
> >> >> +++ b/arch/powerpc/Kconfig
> >> >> @@ -223,6 +223,7 @@ config PPC
> >> >> select HAVE_SYSCALL_TRACEPOINTS
> >> >> select HAVE_VIRT_CPU_ACCOUNTING
> >> >> select HAVE_IRQ_TIME_ACCOUNTING
> >> >> +   select HAVE_RSEQ
> >> >> select IRQ_DOMAIN
> >> >> select IRQ_FORCED_THREADING
> >> >> select MODULES_USE_ELF_RELA
> >> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> >> >> index 61db86ecd318..d3bb3aaaf5ac 100644
> >> >> --- a/arch/powerpc/kernel/signal.c
> >> >> +++ b/arch/powerpc/kernel/signal.c
> >> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
> >> >> /* Re-enable the breakpoints for the signal stack */
> >> >> thread_change_pc(tsk, tsk->thread.regs);
> >> >>  
> >> >> +   rseq_signal_deliver(tsk->thread.regs);
> >> >> +
> >> >> if (is32) {
> >> >> if (ksig.ka.sa.sa_flags & SA_SIGINFO)
> >> >> ret = handle_rt_signal32(, oldset, tsk);
> >> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, 
> >> >> unsigned long
> >> >> thread_info_flags)
> >> >> if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> >> >> clear_thread_flag(TIF_NOTIFY_RESUME);
> >> >> tracehook_notify_resume(regs);
> >> >> +   rseq_handle_notify_resume(regs);
> >> >> }
> >> >>  
> >> >> user_enter();
> >> > 
> >> > Again no rseq_syscall().
> >> 
> >> Same question for PowerPC as for ARM:
> >> 
> >> Considering that rseq_syscall is implemented as follows:
> >> 
> >> +void rseq_syscall(struct pt_regs *regs)
> >> +{
> >> +   unsigned long ip = instruction_pointer(regs);
> >> +   struct task_struct *t = current;
> >> +   struct rseq_cs rseq_cs;
> >> +
> >> +   if (!t->rseq)
> >> +   return;
> >> +   if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
> >> +   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
> >> +   force_sig(SIGSEGV, t);
> >> +}
> >> 
> >> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
> >> now used in the fast-path since KPTI), I wonder where we should call
> > 
> > So we actually detect this after the syscall takes effect, right? I
> > wonder whether this could be problematic, because "disallowing syscall"
> > in rseq areas may means the syscall won't take effect to some people, I
> > guess?
> > 
> >> this on PowerPC ?  I was under the impression that PowerPC return to
> >> userspace fast-path was not calling C code unless work flags were set,
> >> but I might be wrong.
> >> 
> > 
> > I think you're right. So we have to introduce callsite to rseq_syscall()
> > in syscall path, something like:
> > 
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 51695608c68b..a25734a96640 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -222,6 +222,9 @@ system_call_exit:
> > mtmsrd  r11,1
> > #endif /* CONFIG_PPC_BOOK3E */
> > 
> > +   addir3,r1,STACK_FRAME_OVERHEAD
> > +   bl  rseq_syscall
> > +
> > ld  r9,TI_FLAGS(r12)
> > li  r11,-MAX_ERRNO
> > andi.
> > 
> > r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> > 
> > But I think it's important for us to first decide where (before or after
> > the syscall) we do the detection.
> 
> As Peter said, we don't really care whether it's on syscall entry or 
> exit, as
> long as the process gets killed when the erroneous use is detected. I 
> think doing
> it on syscall exit is a bit easier because we can clearly access the 
> userspace
> TLS, which AFAIU may be less straightforward on syscall entry.
>

Fair enough.
 
> We may want to add #ifdef CONFIG_DEBUG_RSEQ / #endif around the code you
> proposed above, so it's only compiled in if CONFIG_DEBUG_RSEQ=y.
> 

OK.

> On the ARM leg of the email thread, Will Deacon suggests to test whether 
> current->rseq
> is non-NULL before calling rseq_syscall(). I wonder if this added check 
> is justified
> as the assembly level, considering that this is just a debugging option. 
> We already do
> that check at the very beginning of rseq_syscall().
> 

Yes, I think it's better to do the check in rseq_syscall(), leaving the asm
code a bit cleaner.

Regards,
Boqun

> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> > 
> > Regards,
> > Boqun
> > 
>

Re: [PATCH] powerpc: Ensure gcc doesn't move around cache flushing in __patch_instruction

2018-05-17 Thread Segher Boessenkool

On Fri, May 18, 2018 at 08:30:27AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2018-05-17 at 14:23 -0500, Segher Boessenkool wrote:
> > On Thu, May 17, 2018 at 01:06:10PM +1000, Benjamin Herrenschmidt wrote:
> > > The current asm statement in __patch_instruction() for the cache flushes
> > > doesn't have a "volatile" statement and no memory clobber. That means
> > > gcc can potentially move it around (or move the store done by put_user
> > > past the flush).
> > 
> > volatile is completely superfluous here, except maybe as documentation:
> > any asm without outputs is always volatile.
> 
> I wasn't aware of that. I was drilled early on to always stick volatile
> in my asm statements if they have any form of side effect :-)

If an asm without output was not marked automatically as having another
side effect, every such asm would be immediately deleted ;-)

Adding volatile as documentation for side effects can be good; it just
doesn't do much (nothing, in fact) for asms without output as far as
the compiler is concerned.

> > (And the memory clobber does not prevent the compiler from moving the
> > asm around, or duplicating it, etc., and neither does the volatile).
> 
> It prevents load/stores from moving around doesn't it ? I wanted to
> make sure the store of the instruction doesn't move in/pass the asm. If
> you say that's not needed then ignore the patch.

No, it's fine here, and you want either that or put exactly the memory
you are touching in a constraint (probably overkill here).  I just
wanted to say that a "memory" clobber does nothing more than say the
asm touches some unspecified memory; there is no magic other meaning
to it.  Your patch is correct, just the "volatile" part isn't needed,
and the explanation was a bit cargo-culty ;-)

Segher

[RFC v3 4/4] powerpc/hotplug/drcinfo: Improve code for ibm,drc-info device processing

2018-05-17 Thread Michael Bringmann

This patch extends the use of a common parse function for the
ibm,drc-info property that can be modified by a callback function
to the hotplug device processing.  Candidate code is replaced by
a call to the parser including a pointer to a local context-specific
functions, and local data.

In addition, the original set missed several opportunities to compress
and reuse common code which this patch attempts to provide.

Finally, a bug with the registration of slots was observed on some
systems, and the code was rewritten to prevent its reoccurrence.

Signed-off-by: Michael Bringmann 
Fixes: 3f38000eda48 ("powerpc/firmware: Add definitions for new drc-info firmwar
e feature" -- end of patch series applied to powerpc next)
---
Changes in V3:
  -- Update code to account for latest kernel checkins.
  -- Fix bug searching for virtual device slots.
  -- Rebased to 4.17-rc5 kernel
---
 drivers/pci/hotplug/rpaphp_core.c |  188 ++---
 1 file changed, 130 insertions(+), 58 deletions(-)

diff --git a/drivers/pci/hotplug/rpaphp_core.c 
b/drivers/pci/hotplug/rpaphp_core.c
index dccdf62..974147a 100644
--- a/drivers/pci/hotplug/rpaphp_core.c
+++ b/drivers/pci/hotplug/rpaphp_core.c
@@ -222,49 +222,52 @@ static int rpaphp_check_drc_props_v1(struct device_node 
*dn, char *drc_name,
return -EINVAL;
 }
 
-static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name,
-   char *drc_type, unsigned int my_index)
+struct check_drc_props_v2_struct {
+   char *drc_name;
+   char *drc_type;
+unsigned int my_index;
+};
+
+static int check_drc_props_v2_checkRun(struct of_drc_info *drc,
+void *idata, void *not_used,
+   int *ret_code)
 {
-   struct property *info;
-   unsigned int entries;
-   struct of_drc_info drc;
-   const __be32 *value;
+   struct check_drc_props_v2_struct *cdata = idata;
char cell_drc_name[MAX_DRC_NAME_LEN];
-   int j, fndit;
-
-   info = of_find_property(dn->parent, "ibm,drc-info", NULL);
-   if (info == NULL)
-   return -EINVAL;
-
-   value = of_prop_next_u32(info, NULL, );
-   if (!value)
-   return -EINVAL;
-   value++;
-
-   for (j = 0; j < entries; j++) {
-   of_read_drc_info_cell(, , );
 
-   /* Should now know end of current entry */
+   (*ret_code) = -EINVAL;
 
-   if (my_index > drc.last_drc_index)
-   continue;
+   if (cdata->my_index > drc->last_drc_index)
+   return 0;
 
-   fndit = 1;
-   break;
+   /* Found drc_index.  Now match the rest. */
+   sprintf(cell_drc_name, "%s%d", drc->drc_name_prefix, 
+   cdata->my_index - drc->drc_index_start +
+   drc->drc_name_suffix_start);
+
+   if (((cdata->drc_name == NULL) ||
+(cdata->drc_name && !strcmp(cdata->drc_name, cell_drc_name))) &&
+   ((cdata->drc_type == NULL) ||
+(cdata->drc_type && !strcmp(cdata->drc_type, drc->drc_type {
+   (*ret_code) = 0;
+   return 1;
}
-   /* Found it */
 
-   if (fndit)
-   sprintf(cell_drc_name, "%s%d", drc.drc_name_prefix, 
-   my_index);
+return 0;
+}
 
-   if (((drc_name == NULL) ||
-(drc_name && !strcmp(drc_name, cell_drc_name))) &&
-   ((drc_type == NULL) ||
-(drc_type && !strcmp(drc_type, drc.drc_type
-   return 0;
+static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name,
+   char *drc_type, unsigned int my_index)
+{
+   struct device_node *root = dn;
+   struct check_drc_props_v2_struct cdata = {
+   drc_name, drc_type, be32_to_cpu(my_index) };
 
-   return -EINVAL;
+   if (!drc_type || (drc_type && strcmp(drc_type, "SLOT")))
+   root = dn->parent;
+
+   return drc_info_parser(root, check_drc_props_v2_checkRun,
+   drc_type, );
 }
 
 int rpaphp_check_drc_props(struct device_node *dn, char *drc_name,
@@ -287,7 +290,6 @@ int rpaphp_check_drc_props(struct device_node *dn, char 
*drc_name,
 }
 EXPORT_SYMBOL_GPL(rpaphp_check_drc_props);
 
-
 static int is_php_type(char *drc_type)
 {
unsigned long value;
@@ -347,17 +349,40 @@ static int is_php_dn(struct device_node *dn, const int 
**indexes,
  *
  * To remove a slot, it suffices to call rpaphp_deregister_slot().
  */
-int rpaphp_add_slot(struct device_node *dn)
+
+static int rpaphp_add_slot_common(struct device_node *dn,
+   u32 drc_index, char *drc_name, char *drc_type,
+   u32 drc_power_domain)
 {
struct slot *slot;
int retval = 0;
-   int i;
+
+   slot = alloc_slot_struct(dn, drc_index, drc_name,
+

[RFC v3 3/4] powerpc/hotplug/drcinfo: Fix hot-add CPU issues

2018-05-17 Thread Michael Bringmann

This patch applies a common parse function for the ibm,drc-info
property that can be modified by a callback function to the
hot-add CPU code.  Candidate code is replaced by a call to the
parser including a pointer to a local context-specific functions,
and local data.

In addition, a bug in the release of the previous patch set may
break things in some of the CPU DLPAR operations.  For instance,
when attempting to hot-add a new CPU or set of CPUs, the original
patch failed to always properly calculate the available resources,
and aborted the operation.

Signed-off-by: Michael Bringmann 
Fixes: 3f38000eda48 ("powerpc/firmware: Add definitions for new drc-info firmwar
e feature" -- end of patch series applied to powerpc next)
---
Changes in V3:
  -- Update code to account for latest kernel checkins.
  -- Rebased to 4.17-rc5 kernel
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c|  129 +--
 arch/powerpc/platforms/pseries/pseries_energy.c |  112 ++--
 2 files changed, 154 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 6ef77ca..a408217 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -411,25 +411,67 @@ static bool dlpar_cpu_exists(struct device_node *parent, 
u32 drc_index)
return found;
 }
 
-static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index)
+static bool check_cpu_drc_index(struct device_node *parent,
+   int (*checkRun)(struct of_drc_info *drc,
+   void *data,
+   void *not_used,
+   int *ret_code),
+   void *cdata)
 {
-   bool found = false;
-   int rc, index;
+   int found = 0;
+
+   if (firmware_has_feature(FW_FEATURE_DRC_INFO)) {
+   found = drc_info_parser(parent, checkRun, "CPU", cdata);
+   } else {
+   int rc, index = 0;
 
-   index = 0;
-   while (!found) {
-   u32 drc;
+   while (!found) {
+   u32 drc;
 
-   rc = of_property_read_u32_index(parent, "ibm,drc-indexes",
+   rc = of_property_read_u32_index(parent,
+   "ibm,drc-indexes",
index++, );
-   if (rc)
-   break;
+   if (rc)
+   break;
+   found = checkRun(NULL, cdata, , NULL);
+   }
+   }
 
-   if (drc == drc_index)
-   found = true;
+   return (bool)found;
+}
+
+struct valid_cpu_drc_index_struct {
+   u32 targ_drc_index;
+};
+
+static int valid_cpu_drc_index_checkRun(struct of_drc_info *drc,
+   void *idata,
+   void *drc_index,
+   int *ret_code)
+{
+   struct valid_cpu_drc_index_struct *cdata = idata;
+
+   if (drc) {
+   if ((drc->drc_index_start <= cdata->targ_drc_index) &&
+   (cdata->targ_drc_index <= drc->last_drc_index)) {
+   (*ret_code) = 1;
+   return 1;
+   }
+   } else {
+   if (*((u32*)drc_index) == cdata->targ_drc_index) {
+   (*ret_code) = 1;
+   return 1;
+   }
}
+   return 0;
+}
 
-   return found;
+static bool valid_cpu_drc_index(struct device_node *parent, u32 drc_index)
+{
+   struct valid_cpu_drc_index_struct cdata = { drc_index };
+
+   return check_cpu_drc_index(parent, valid_cpu_drc_index_checkRun,
+   );
 }
 
 static ssize_t dlpar_cpu_add(u32 drc_index)
@@ -721,11 +763,45 @@ static int dlpar_cpu_remove_by_count(u32 cpus_to_remove)
return rc;
 }
 
+struct find_dlpar_cpus_to_add_struct {
+   struct device_node *parent;
+   u32 *cpu_drcs;
+   u32 cpus_to_add;
+   u32 cpus_found;
+};
+
+static int find_dlpar_cpus_to_add_checkRun(struct of_drc_info *drc,
+   void *idata,
+   void *drc_index,
+   int *ret_code)
+{
+   struct find_dlpar_cpus_to_add_struct *cdata = idata;
+
+   if (drc) {
+   int k;
+
+   for (k = 0; (k < drc->num_sequential_elems) &&
+   (cdata->cpus_found < cdata->cpus_to_add); k++) {
+   u32 idrc = drc->drc_index_start +
+   (k * drc->sequential_inc);
+
+   if (dlpar_cpu_exists(cdata->parent, idrc))
+

[RFC v3 2/4] powerpc/hotplug/drcinfo: Provide common parser for ibm,drc-info

2018-05-17 Thread Michael Bringmann

This patch provides a common parse function for the ibm,drc-info
property that can be modified by a callback function.  The caller
provides a pointer to the function and a pointer to their unique
data, and the parser provides the current lmb set from the struct.
The callback function may return codes indicating that the parsing
is complete, or should continue, along with an error code that may
be returned to the caller.

Signed-off-by: Michael Bringmann 
Fixes: 3f38000eda48 ("powerpc/firmware: Add definitions for new drc-info firmwar
e feature" -- end of patch series applied to powerpc next)
---
Changes in V3:
  -- Update code to account for latest kernel checkins.
  -- Rebased to 4.17-rc5 kernel
---
 arch/powerpc/include/asm/prom.h |7 +++
 arch/powerpc/platforms/pseries/Makefile |2 -
 arch/powerpc/platforms/pseries/drchelpers.c |   66 +++
 3 files changed, 74 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/pseries/drchelpers.c

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index b04c5ce..2e947b3 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -94,6 +94,13 @@ struct of_drc_info {
 extern int of_read_drc_info_cell(struct property **prop,
const __be32 **curval, struct of_drc_info *data);
 
+extern int drc_info_parser(struct device_node *dn,
+   int (*usercb)(struct of_drc_info *drc,
+   void *data,
+   void *optional_data,
+   int *ret_code),
+   char *opt_drc_type,
+   void *data);
 
 /*
  * There are two methods for telling firmware what our capabilities are.
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 13eede6..38c8547 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -3,7 +3,7 @@ ccflags-$(CONFIG_PPC64) := $(NO_MINIMAL_TOC)
 ccflags-$(CONFIG_PPC_PSERIES_DEBUG)+= -DDEBUG
 
 obj-y  := lpar.o hvCall.o nvram.o reconfig.o \
-  of_helpers.o \
+  of_helpers.o drchelpers.o \
   setup.o iommu.o event_sources.o ras.o \
   firmware.o power.o dlpar.o mobility.o rng.o \
   pci.o pci_dlpar.o eeh_pseries.o msi.o
diff --git a/arch/powerpc/platforms/pseries/drchelpers.c 
b/arch/powerpc/platforms/pseries/drchelpers.c
new file mode 100644
index 000..556e05d
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/drchelpers.c
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2018 Michael Bringmann , IBM
+ *
+ * pSeries specific routines for device-tree properties.
+ * 
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ * 
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include "pseries.h"
+
+#defineMAX_DRC_NAME_LEN 64
+
+int drc_info_parser(struct device_node *dn,
+   int (*usercb)(struct of_drc_info *drc,
+   void *data,
+   void *optional_data,
+   int *ret_code),
+   char *opt_drc_type,
+   void *data)
+{
+   struct property *info;
+   unsigned int entries;
+   struct of_drc_info drc;
+   const __be32 *value;
+   int j, done = 0, ret_code = -EINVAL;
+
+   info = of_find_property(dn, "ibm,drc-info", NULL);
+   if (info == NULL)
+   return -EINVAL;
+
+   value = of_prop_next_u32(info, NULL, );
+   if (!value)
+   return -EINVAL;
+   value++;
+
+   for (j = 0, done = 0; (j < entries) && (!done); j++) {
+   of_read_drc_info_cell(, , );
+
+   if (opt_drc_type && strcmp(opt_drc_type, drc.drc_type))
+   continue;
+
+   done = usercb(, data, NULL, _code);
+   }
+
+   return ret_code;
+}
+EXPORT_SYMBOL(drc_info_parser);

[RFC v3 1/4] powerpc/hotplug/drcinfo: Fix bugs parsing ibm,drc-info structs

2018-05-17 Thread Michael Bringmann

[Replace/withdraw previous patch submission to ensure that testing
of related patches on similar hardware progresses together.]

This patch fixes a memory parsing bug when using of_prop_next_u32
calls at the start of a structure.  Depending upon the value of
"cur" memory pointer argument to of_prop_next_u32, it will or it
won't advance the value of the returned memory pointer by the
size of one u32.  This patch corrects the code to deal with that
indexing feature when parsing the ibm,drc-info structs for CPUs.
Also, need to advance the pointer at the end of_read_drc_info_cell
for same reason.

Signed-off-by: Michael Bringmann 
Fixes: 3f38000eda48 ("powerpc/firmware: Add definitions for new drc-info 
firmware feature" -- end of patch series applied to powerpc next)
---
Changes in V3:
  -- Rebased patch to 4.17-rc5 kernel
---
 arch/powerpc/platforms/pseries/of_helpers.c |5 ++---
 arch/powerpc/platforms/pseries/pseries_energy.c |2 ++
 drivers/pci/hotplug/rpaphp_core.c   |1 +
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
b/arch/powerpc/platforms/pseries/of_helpers.c
index 6df192f..20598b2 100644
--- a/arch/powerpc/platforms/pseries/of_helpers.c
+++ b/arch/powerpc/platforms/pseries/of_helpers.c
@@ -65,9 +65,7 @@ int of_read_drc_info_cell(struct property **prop, const 
__be32 **curval,
 
/* Get drc-index-start:encode-int */
p2 = (const __be32 *)p;
-   p2 = of_prop_next_u32(*prop, p2, >drc_index_start);
-   if (!p2)
-   return -EINVAL;
+   data->drc_index_start = of_read_number(p2, 1);
 
/* Get drc-name-suffix-start:encode-int */
p2 = of_prop_next_u32(*prop, p2, >drc_name_suffix_start);
@@ -88,6 +86,7 @@ int of_read_drc_info_cell(struct property **prop, const 
__be32 **curval,
p2 = of_prop_next_u32(*prop, p2, >drc_power_domain);
if (!p2)
return -EINVAL;
+   p2++;
 
/* Should now know end of current entry */
(*curval) = (void *)p2;
diff --git a/arch/powerpc/platforms/pseries/pseries_energy.c 
b/arch/powerpc/platforms/pseries/pseries_energy.c
index 6ed2212..c7d84aa 100644
--- a/arch/powerpc/platforms/pseries/pseries_energy.c
+++ b/arch/powerpc/platforms/pseries/pseries_energy.c
@@ -64,6 +64,7 @@ static u32 cpu_to_drc_index(int cpu)
value = of_prop_next_u32(info, NULL, _set_entries);
if (!value)
goto err_of_node_put;
+   value++;
 
for (j = 0; j < num_set_entries; j++) {
 
@@ -126,6 +127,7 @@ static int drc_index_to_cpu(u32 drc_index)
value = of_prop_next_u32(info, NULL, _set_entries);
if (!value)
goto err_of_node_put;
+   value++;
 
for (j = 0; j < num_set_entries; j++) {
 
diff --git a/drivers/pci/hotplug/rpaphp_core.c 
b/drivers/pci/hotplug/rpaphp_core.c
index fb5e084..dccdf62 100644
--- a/drivers/pci/hotplug/rpaphp_core.c
+++ b/drivers/pci/hotplug/rpaphp_core.c
@@ -239,6 +239,7 @@ static int rpaphp_check_drc_props_v2(struct device_node 
*dn, char *drc_name,
value = of_prop_next_u32(info, NULL, );
if (!value)
return -EINVAL;
+   value++;
 
for (j = 0; j < entries; j++) {
of_read_drc_info_cell(, , );

[RFC v3 0/4] powerpc/drcinfo: Fix bugs 'ibm,drc-info' property

2018-05-17 Thread Michael Bringmann

This patch set corrects some errors and omissions in the previous
set of patches adding support for the "ibm,drc-info" property to
powerpc systems.

Unfortunately, some errors in the previous patch set break things
in some of the DLPAR operations.  In particular when attempting to
hot-add a new CPU or set of CPUs, the original patch failed to
properly calculate the available resources, and aborted the operation.
In addition, the original set missed several opportunities to compress
and reuse common code, especially, in the area of device processing.

Signed-off-by: Michael W. Bringmann 
---
Changes in V3:
  -- Update code for latest kernel checkins.
  -- Fix bug with virtual devices.
  -- Rebase on top of 4.17-rc5 kernel

Re: [PATCH] powerpc: Ensure gcc doesn't move around cache flushing in __patch_instruction

2018-05-17 Thread Benjamin Herrenschmidt

On Thu, 2018-05-17 at 14:23 -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, May 17, 2018 at 01:06:10PM +1000, Benjamin Herrenschmidt wrote:
> > The current asm statement in __patch_instruction() for the cache flushes
> > doesn't have a "volatile" statement and no memory clobber. That means
> > gcc can potentially move it around (or move the store done by put_user
> > past the flush).
> 
> volatile is completely superfluous here, except maybe as documentation:
> any asm without outputs is always volatile.

I wasn't aware of that. I was drilled early on to always stick volatile
in my asm statements if they have any form of side effect :-)

> (And the memory clobber does not prevent the compiler from moving the
> asm around, or duplicating it, etc., and neither does the volatile).

It prevents load/stores from moving around doesn't it ? I wanted to
make sure the store of the instruction doesn't move in/pass the asm. If
you say that's not needed then ignore the patch.

Cheers,
Ben.

> 
> Segher

[RFC v4 3/3] powerpc migration/memory: Associativity & memory updates

2018-05-17 Thread Michael Bringmann

powerpc migration/memory: This patch adds more recognition for changes
to the associativity of memory blocks described by the device-tree
properties and updates local and general kernel data structures to
reflect those changes.  These differences may include:

* Evaluating 'ibm,dynamic-memory' properties when processing the
  topology of LPARS in Post Migration events.  Previous efforts
  only recognized whether a memory block's assignment had changed
  in the property.  Changes here include checking the aa_index
  values for each drc_index of the old/new LMBs and to 'readd'
  any block for which the setting has changed.

* In an LPAR migration scenario, the "ibm,associativity-lookup-arrays"
  property may change.  In the event that a row of the array differs,
  locate all assigned memory blocks with that 'aa_index' and 're-add'
  them to the system memory block data structures.  In the process of
  the 're-add', the system routines will update the corresponding entry
  for the memory in the LMB structures and any other relevant kernel
  data structures.

* Extend the previous work for the 'ibm,associativity-lookup-array'
  and 'ibm,dynamic-memory' properties to support the property
  'ibm,dynamic-memory-v2' by means of the DRMEM LMB interpretation
  code.

Signed-off-by: Michael Bringmann 
---
Changes in RFC:
  -- Simplify code to update memory nodes during mobility checks.
  -- Reuse code from DRMEM changes to scan for LMBs when updating
 aa_index
  -- Combine common code for properties 'ibm,dynamic-memory' and
 'ibm,dynamic-memory-v2' after integrating DRMEM features.
  -- Rearrange patches to co-locate memory property-related changes.
  -- Use new paired list iterator for the drmem info arrays.
  -- Use direct calls to add/remove memory from the update drconf
 function as those operations are only intended for user DLPAR
 ops, and should not occur during Migration reconfig notifier
 changes.
  -- Correct processing bug in processing of ibm,associativity-lookup-arrays
  -- Rebase to 4.17-rc5 kernel
---
 arch/powerpc/platforms/pseries/hotplug-memory.c |  172 +++
 1 file changed, 139 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c1578f5..15c6f74 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -994,13 +994,29 @@ static int pseries_add_mem_node(struct device_node *np)
return (ret < 0) ? -EINVAL : 0;
 }
 
-static int pseries_update_drconf_memory(struct of_reconfig_data *pr)
+static void pseries_queue_memory_event(u32 drc_index, int action)
 {
-   struct of_drconf_cell_v1 *new_drmem, *old_drmem;
-   unsigned long memblock_size;
-   u32 entries;
-   __be32 *p;
-   int i, rc = -EINVAL;
+   struct pseries_hp_errorlog *hp_elog;
+
+   hp_elog = kzalloc(sizeof(*hp_elog), GFP_KERNEL);
+   if(!hp_elog)
+   return;
+
+   hp_elog->resource = PSERIES_HP_ELOG_RESOURCE_MEM;
+   hp_elog->action = action;
+   hp_elog->id_type = PSERIES_HP_ELOG_ID_DRC_INDEX;
+   hp_elog->_drc_u.drc_index = drc_index;
+
+   queue_hotplug_event(hp_elog, NULL, NULL);
+
+   kfree(hp_elog);
+}
+
+static int pseries_update_drconf_memory(struct drmem_lmb_info *new_dinfo)
+{
+   struct drmem_lmb *old_lmb, *new_lmb;
+   unsigned long memblock_size;
+   int rc = 0;
 
if (rtas_hp_event)
return 0;
@@ -1009,42 +1025,123 @@ static int pseries_update_drconf_memory(struct 
of_reconfig_data *pr)
if (!memblock_size)
return -EINVAL;
 
-   p = (__be32 *) pr->old_prop->value;
-   if (!p)
-   return -EINVAL;
+   /* Arrays should have the same size and DRC indexes */
+   for_each_pair_drmem_lmb(drmem_info, old_lmb, new_dinfo, new_lmb) {
 
-   /* The first int of the property is the number of lmb's described
-* by the property. This is followed by an array of of_drconf_cell
-* entries. Get the number of entries and skip to the array of
-* of_drconf_cell's.
-*/
-   entries = be32_to_cpu(*p++);
-   old_drmem = (struct of_drconf_cell_v1 *)p;
-
-   p = (__be32 *)pr->prop->value;
-   p++;
-   new_drmem = (struct of_drconf_cell_v1 *)p;
+   if (new_lmb->drc_index != old_lmb->drc_index)
+   continue;
 
-   for (i = 0; i < entries; i++) {
-   if ((be32_to_cpu(old_drmem[i].flags) & DRCONF_MEM_ASSIGNED) &&
-   (!(be32_to_cpu(new_drmem[i].flags) & DRCONF_MEM_ASSIGNED))) 
{
+   if ((old_lmb->flags & DRCONF_MEM_ASSIGNED) &&
+   (!(new_lmb->flags & DRCONF_MEM_ASSIGNED))) {
rc = pseries_remove_memblock(
-   be64_to_cpu(old_drmem[i].base_addr),
-

[RFC v4 2/3] powerpc migration/cpu: Associativity & cpu changes

2018-05-17 Thread Michael Bringmann

powerpc migration/cpu: Now apply changes to the associativity of cpus
for the topology of LPARS in Post Migration events.  Recognize more
changes to the associativity of memory blocks described by the
'cpu' properties when processing the topology of LPARS in Post Migration
events.  Previous efforts only recognized whether a memory block's
assignment had changed in the property.  Changes here include:

* Provide hotplug CPU 'readd by index' operation
* Checking for changes in cpu associativity and making 'readd' calls
  when differences are observed.
* Queue up  changes to CPU properties so that they may take place
  after all PowerPC device-tree changes have been applied i.e. after
  the device hotplug is released in the mobility code.

Signed-off-by: Michael Bringmann 
---
Changes include:
  -- Rearrange patches to co-locate CPU property-related changes.
  -- Modify dlpar_cpu_add & dlpar_cpu_remove to skip DRC index acquire
 or release operations during the CPU readd process.
  -- Correct a bug in DRC index selection for queued operation.
  -- Rebase to 4.17-rc5 kernel
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  123 +++---
 arch/powerpc/platforms/pseries/mobility.c|3 +
 2 files changed, 95 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index a408217..23d4cb8 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -474,7 +474,7 @@ static bool valid_cpu_drc_index(struct device_node *parent, 
u32 drc_index)
);
 }
 
-static ssize_t dlpar_cpu_add(u32 drc_index)
+static ssize_t dlpar_cpu_add(u32 drc_index, bool acquire_drc)
 {
struct device_node *dn, *parent;
int rc, saved_rc;
@@ -499,19 +499,22 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
return -EINVAL;
}
 
-   rc = dlpar_acquire_drc(drc_index);
-   if (rc) {
-   pr_warn("Failed to acquire DRC, rc: %d, drc index: %x\n",
-   rc, drc_index);
-   of_node_put(parent);
-   return -EINVAL;
+   if (acquire_drc) {
+   rc = dlpar_acquire_drc(drc_index);
+   if (rc) {
+   pr_warn("Failed to acquire DRC, rc: %d, drc index: 
%x\n",
+   rc, drc_index);
+   of_node_put(parent);
+   return -EINVAL;
+   }
}
 
dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent);
if (!dn) {
pr_warn("Failed call to configure-connector, drc index: %x\n",
drc_index);
-   dlpar_release_drc(drc_index);
+   if (acquire_drc)
+   dlpar_release_drc(drc_index);
of_node_put(parent);
return -EINVAL;
}
@@ -526,8 +529,9 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
pr_warn("Failed to attach node %s, rc: %d, drc index: %x\n",
dn->name, rc, drc_index);
 
-   rc = dlpar_release_drc(drc_index);
-   if (!rc)
+   if (acquire_drc)
+   rc = dlpar_release_drc(drc_index);
+   if (!rc || acquire_drc)
dlpar_free_cc_nodes(dn);
 
return saved_rc;
@@ -540,7 +544,7 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
dn->name, rc, drc_index);
 
rc = dlpar_detach_node(dn);
-   if (!rc)
+   if (!rc && acquire_drc)
dlpar_release_drc(drc_index);
 
return saved_rc;
@@ -608,12 +612,13 @@ static int dlpar_offline_cpu(struct device_node *dn)
 
 }
 
-static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index)
+static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index,
+   bool release_drc)
 {
int rc;
 
-   pr_debug("Attempting to remove CPU %s, drc index: %x\n",
-dn->name, drc_index);
+   pr_debug("Attempting to remove CPU %s, drc index: %x (%d)\n",
+dn->name, drc_index, release_drc);
 
rc = dlpar_offline_cpu(dn);
if (rc) {
@@ -621,12 +626,14 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, 
u32 drc_index)
return -EINVAL;
}
 
-   rc = dlpar_release_drc(drc_index);
-   if (rc) {
-   pr_warn("Failed to release drc (%x) for CPU %s, rc: %d\n",
-   drc_index, dn->name, rc);
-   dlpar_online_cpu(dn);
-   return rc;
+   if (release_drc) {
+   rc = dlpar_release_drc(drc_index);
+   if (rc) {
+   pr_warn("Failed to release drc (%x) for CPU %s, rc: 
%d\n",
+   drc_index, dn->name, rc);
+

[RFC v4 1/3] powerpc migration/drmem: Modify DRMEM code to export more features

2018-05-17 Thread Michael Bringmann

powerpc migration/drmem: Export many of the functions of DRMEM to
parse "ibm,dynamic-memory" and "ibm,dynamic-memory-v2" during
hotplug operations and for Post Migration events.

Also modify the DRMEM initialization code to allow it to,

* Be called after system initialization
* Provide a separate user copy of the LMB array that is produces
* Free the user copy upon request

Signed-off-by: Michael Bringmann 
---
Changes in RFC:
  -- Separate DRMEM changes into a standalone patch
  -- Do not export excess functions.  Make exported names more explicit.
  -- Add new iterator to work through a pair of drmem_info arrays.
  -- Modify DRMEM code to replace usages of dt_root_addr_cells, and
 dt_mem_next_cell, as these are only available at first boot.
  -- Rebase to 4.17-rc5 kernel
---
 arch/powerpc/include/asm/drmem.h |   10 +
 arch/powerpc/mm/drmem.c  |   78 +++---
 2 files changed, 66 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index ce242b9..c964b89 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -35,6 +35,13 @@ struct drmem_lmb_info {
_info->lmbs[0],   \
_info->lmbs[drmem_info->n_lmbs - 1])
 
+#define for_each_pair_drmem_lmb(dinfo1, lmb1, dinfo2, lmb2)\
+   for ((lmb1) = (>lmbs[0]),   \
+(lmb2) = (>lmbs[0]);   \
+ ((lmb1) <= (>lmbs[dinfo1->n_lmbs - 1])) &&\
+ ((lmb2) <= (>lmbs[dinfo2->n_lmbs - 1]));  \
+(lmb1)++, (lmb2)++)
+
 /*
  * The of_drconf_cell_v1 struct defines the layout of the LMB data
  * specified in the ibm,dynamic-memory device tree property.
@@ -94,6 +101,9 @@ void __init walk_drmem_lmbs(struct device_node *dn,
void (*func)(struct drmem_lmb *, const __be32 **));
 int drmem_update_dt(void);
 
+struct drmem_lmb_info* drmem_init_lmbs(struct property *prop);
+void drmem_lmbs_free(struct drmem_lmb_info *dinfo);
+
 #ifdef CONFIG_PPC_PSERIES
 void __init walk_drmem_lmbs_early(unsigned long node,
void (*func)(struct drmem_lmb *, const __be32 **));
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 3f18036..d9b281c 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -20,6 +20,7 @@
 
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
+static int n_root_addr_cells;
 
 u64 drmem_lmb_memory_max(void)
 {
@@ -193,12 +194,13 @@ int drmem_update_dt(void)
return rc;
 }
 
-static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
+static void read_drconf_v1_cell(struct drmem_lmb *lmb,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
-   lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   lmb->base_addr = of_read_number(p, n_root_addr_cells);
+   p += n_root_addr_cells;
lmb->drc_index = of_read_number(p++, 1);
 
p++; /* skip reserved field */
@@ -209,7 +211,7 @@ static void __init read_drconf_v1_cell(struct drmem_lmb 
*lmb,
*prop = p;
 }
 
-static void __init __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 *usm,
+static void __walk_drmem_v1_lmbs(const __be32 *prop, const __be32 *data,
void (*func)(struct drmem_lmb *, const __be32 **))
 {
struct drmem_lmb lmb;
@@ -221,17 +223,18 @@ static void __init __walk_drmem_v1_lmbs(const __be32 
*prop, const __be32 *usm,
 
for (i = 0; i < n_lmbs; i++) {
read_drconf_v1_cell(, );
-   func(, );
+   func(, );
}
 }
 
-static void __init read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
+static void read_drconf_v2_cell(struct of_drconf_cell_v2 *dr_cell,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
dr_cell->seq_lmbs = of_read_number(p++, 1);
-   dr_cell->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   dr_cell->base_addr = of_read_number(p, n_root_addr_cells);
+   p += n_root_addr_cells;
dr_cell->drc_index = of_read_number(p++, 1);
dr_cell->aa_index = of_read_number(p++, 1);
dr_cell->flags = of_read_number(p++, 1);
@@ -239,7 +242,7 @@ static void __init read_drconf_v2_cell(struct 
of_drconf_cell_v2 *dr_cell,
*prop = p;
 }
 
-static void __init __walk_drmem_v2_lmbs(const __be32 *prop, const __be32 *usm,
+static void __walk_drmem_v2_lmbs(const __be32 *prop, const __be32 *data,
void (*func)(struct drmem_lmb *, const __be32 **))
 {
struct of_drconf_cell_v2 dr_cell;
@@ -263,7 +266,7 @@ static void __init __walk_drmem_v2_lmbs(const __be32 *prop, 
const __be32 *usm,
lmb.aa_index = dr_cell.aa_index;
lmb.flags = dr_cell.flags;
 
-

[RFC v4 0/3] powerpc/hotplug: Fix affinity assoc for LPAR migration

2018-05-17 Thread Michael Bringmann

The migration of LPARs across Power systems affects many attributes
including that of the associativity of memory blocks and CPUs.  The
patches in this set execute when a system is coming up fresh upon a
migration target.  They are intended to,

* Recognize changes to the associativity of memory and CPUs recorded
  in internal data structures when compared to the latest copies in
  the device tree (e.g. ibm,dynamic-memory, ibm,dynamic-memory-v2,
  cpus),
* Recognize changes to the associativity mapping (e.g. ibm,
  associativity-lookup-arrays), locate all assigned memory blocks
  corresponding to each changed row, and readd all such blocks.
* Generate calls to other code layers to reset the data structures
  related to associativity of the CPUs and memory.
* Re-register the 'changed' entities into the target system.
  Re-registration of CPUs and memory blocks mostly entails acting as
  if they have been newly hot-added into the target system.

Signed-off-by: Michael Bringmann 

Michael Bringmann (3):
  powerpc migration/drmem: Modify DRMEM code to export more features
  powerpc migration/cpu: Associativity & cpu changes
  powerpc migration/memory: Associativity & memory updates
---
Changes in RFC:
  -- Restructure and rearrange content of patches to co-locate
 similar or related modifications
  -- Rename pseries_update_drconf_cpu to pseries_update_cpu
  -- Simplify code to update CPU nodes during mobility checks.
 Remove functions to generate extra HP_ELOG messages in favor
 of direct function calls to dlpar_cpu_readd_by_index, or
 dlpar_memory_readd_by_index.
  -- Revise code order in dlpar_cpu_readd_by_index() to present
 more appropriate error codes from underlying layers of the
 implementation.
  -- Add hotplug device lock around all property updates
  -- Schedule all CPU and memory changes due to device-tree updates /
 LPAR mobility as workqueue operations
  -- Export DRMEM accessor functions to parse 'ibm,dynamic-memory-v2'
  -- Export DRMEM functions to provide user copies of LMB array
  -- Compress code using DRMEM accessor functions.
  -- Split topology timer crash fix into new patch.
  -- Modify DRMEM code to replace usages of dt_root_addr_cells, and
 dt_mem_next_cell, as these are only available at first boot.
  -- Correct a bug in DRC index selection for queued operation.
  -- Rebase to 4.17-rc5 kernel

Re: [PATCH v17 0/9] Address error and recovery for AER and DPC

2018-05-17 Thread Bjorn Helgaas

[+cc Russell, Sam, Bryant, linuxppc-dev, Sebastian, linux-s390]

Sorry, I should have pulled in these new CC's earlier because ppc and
s390 both have PCI error handling similar to what Oza is changing
here.

The basic issue is that the new PCIe DPC (Downstream Port Containment,
see PCIe r4.0, sec 6.2.10) feature doesn't fit very well in the
framework of the pci_error_handlers callbacks.

When DPC is enabled, a Downstream Port (either a Root Port or a Switch
Downstream Port) that receives an ERR_FATAL message automatically
disables its Link.  IIUC, this is also intended for use in hot-unplug
scenarios.

When the DPC hardware takes the Link down, it resets all the
downstream devices, and there's not much point in calling the
pci_error_handlers callbacks because the devices are unreachable.
Even after the Link comes back up, we can't be certain the same device
is there because of the hotplug possibility.

The software side of DPC recovery basically consists of detaching the
drivers of the downstream devices (calling their .remove() methods),
bringing the link back up, re-enumerating the downstream devices, and
re-attaching the drivers (calling their .probe() methods).

The existing AER code also responds to ERR_FATAL messages, but it does
call the pci_error_handlers callbacks and also resets the link.

This is a bit of a mess because things look a lot different to the
driver depending on whether the platform supports AER or DPC.

Since we can't change the way DPC works, the idea of this series is
basically to make AER handle ERR_FATAL more like DPC does, i.e., by
resetting the link, detaching, and re-attaching the drivers.

This series is currently on my pci/aer branch
(https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/log/?h=pci/aer)
and is headed for v4.18 unless somebody raises major objections.

On Thu, May 17, 2018 at 03:43:02AM -0400, Oza Pawandeep wrote:
> This patch set brings in error handling support for DPC
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> 
> The goal of the patch-set is:
> DPC should handle the error handling and recovery similar to AER, because 
> finally both are attempting recovery in some or the other way,
> and for that error handling and recovery framework has to be loosely
> coupled.
> 
> It achieves uniformity and transparency to the error handling agents such
> as AER, DPC, with respect to recovery and error handling.
> 
> So, this patch-set tries to unify lot of things between error agents and
> make them behave in a well defined way. (be it error (FATAL, NON_FATAL)
> handling or recovery).
> 
> The FATAL error handling is handled with remove/reset_link/re-enumerate
> sequence while the NON_FATAL follows the default path.
> Documentation/PCI/pci-error-recovery.txt talks more on that.

I applied this series with a trivial change to remove an unused variable to
pci/aer for v4.18, thanks!

> Changes since v16:
> Bjorn's comments addressed
> > remove call pci_walk_bus(dev->subordinate, report_resume, _data)
> > pci_cleanup_aer_uncorrect_error_status(dev); happens only if service is 
> AER
> > aer_error_resume does not handle ERR_FATAL clearing anymore
> Changes since v15:
> Bjorn's comments addressed
> > minor comments fixed
> > made FATAL sequence aligned to existing one, as far as clearing status 
> are concerned.
> > pcie_do_fatal_recovery and pcie_do_nonfatal_recovery functions made to 
> modularize
> > pcie_do_fatal_recovery now takes service as an argument
> Changes since v14:
> Bjorn's comments addressed
> > simplified the patch set, and moved AER_FATAL handling in the beginning.
> > rebase the code to 4.17-rc1.
> Changes since v13:
> Bjorn's comments addressed
> > handke FATAL errors with remove devices followed by re-enumeration.
> > changes in AER and DPC along with required Documentation.
> Changes since v12:
> Bjorn's and Keith's Comments addressed.
> > Made DPC and AER error handling identical 
> > hanldled cases for hotplug enabled system differently.
> Changes since v11:
> Bjorn's comments addressed.
> > rename pcie-err.c to err.c
> > removed EXPORT_SYMBOL
> > made generic find_serivce function in port driver.
> > removed mutex patch as no need to have mutex in pcie_do_recovery
> > brough in DPC_FATAL in aer.h
> > so now all the error codes (AER and DPC) are unified in aer.h
> Changes since v10:
> Christoph Hellwig's, David Laight's and Randy Dunlap's
> comments addressed.
> > renamed pci_do_recovery to pcie_do_recovery
> > removed inner braces in conditional statements.
> > restrctured the code in

Re: [PATCH bpf 5/6] tools: bpftool: resolve calls without using imm field

2018-05-17 Thread Jakub Kicinski

On Thu, 17 May 2018 12:05:47 +0530, Sandipan Das wrote:
> Currently, we resolve the callee's address for a JITed function
> call by using the imm field of the call instruction as an offset
> from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
> use this address to get the callee's kernel symbol's name.
> 
> For some architectures, such as powerpc64, the imm field is not
> large enough to hold this offset. So, instead of assigning this
> offset to the imm field, the verifier now assigns the subprog
> id. Also, a list of kernel symbol addresses for all the JITed
> functions is provided in the program info. We now use the imm
> field as an index for this list to lookup a callee's symbol's
> address and resolve its name.
> 
> Suggested-by: Daniel Borkmann 
> Signed-off-by: Sandipan Das 

A few nit-picks below, thank you for the patch!

>  tools/bpf/bpftool/prog.c  | 31 +++
>  tools/bpf/bpftool/xlated_dumper.c | 24 +---
>  tools/bpf/bpftool/xlated_dumper.h |  2 ++
>  3 files changed, 50 insertions(+), 7 deletions(-)
> 
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 9bdfdf2d3fbe..ac2f62a97e84 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -430,6 +430,10 @@ static int do_dump(int argc, char **argv)
>   unsigned char *buf;
>   __u32 *member_len;
>   __u64 *member_ptr;
> + unsigned int nr_addrs;
> + unsigned long *addrs = NULL;
> + __u32 *ksyms_len;
> + __u64 *ksyms_ptr;

nit: please try to keep the variables ordered longest to shortest like
we do in networking code (please do it in all functions).

>   ssize_t n;
>   int err;
>   int fd;
> @@ -437,6 +441,8 @@ static int do_dump(int argc, char **argv)
>   if (is_prefix(*argv, "jited")) {
>   member_len = _prog_len;
>   member_ptr = _prog_insns;
> + ksyms_len = _jited_ksyms;
> + ksyms_ptr = _ksyms;
>   } else if (is_prefix(*argv, "xlated")) {
>   member_len = _prog_len;
>   member_ptr = _prog_insns;
> @@ -496,10 +502,23 @@ static int do_dump(int argc, char **argv)
>   return -1;
>   }
>  
> + nr_addrs = *ksyms_len;

Here and ...

> + if (nr_addrs) {
> + addrs = malloc(nr_addrs * sizeof(__u64));
> + if (!addrs) {
> + p_err("mem alloc failed");
> + free(buf);
> + close(fd);
> + return -1;

You can just jump to err_free here.

> + }
> + }
> +
>   memset(, 0, sizeof(info));
>  
>   *member_ptr = ptr_to_u64(buf);
>   *member_len = buf_size;
> + *ksyms_ptr = ptr_to_u64(addrs);
> + *ksyms_len = nr_addrs;

... here - this function is getting long, so maybe I'm not seeing
something, but are ksyms_ptr and ksyms_len guaranteed to be initialized?

>   err = bpf_obj_get_info_by_fd(fd, , );
>   close(fd);
> @@ -513,6 +532,11 @@ static int do_dump(int argc, char **argv)
>   goto err_free;
>   }
>  
> + if (*ksyms_len > nr_addrs) {
> + p_err("too many addresses returned");
> + goto err_free;
> + }
> +
>   if ((member_len == _prog_len &&
>info.jited_prog_insns == 0) ||
>   (member_len == _prog_len &&
> @@ -558,6 +582,9 @@ static int do_dump(int argc, char **argv)
>   dump_xlated_cfg(buf, *member_len);
>   } else {
>   kernel_syms_load();
> + dd.jited_ksyms = ksyms_ptr;
> + dd.nr_jited_ksyms = *ksyms_len;
> +
>   if (json_output)
>   dump_xlated_json(, buf, *member_len, opcodes);
>   else
> @@ -566,10 +593,14 @@ static int do_dump(int argc, char **argv)
>   }
>  
>   free(buf);
> + if (addrs)
> + free(addrs);

Free can deal with NULL pointers, no need for an if.

>   return 0;
>  
>  err_free:
>   free(buf);
> + if (addrs)
> + free(addrs);
>   return -1;
>  }
>  
> diff --git a/tools/bpf/bpftool/xlated_dumper.c 
> b/tools/bpf/bpftool/xlated_dumper.c
> index 7a3173b76c16..dc8e4eca0387 100644
> --- a/tools/bpf/bpftool/xlated_dumper.c
> +++ b/tools/bpf/bpftool/xlated_dumper.c
> @@ -178,8 +178,12 @@ static const char *print_call_pcrel(struct dump_data *dd,
>   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
>"%+d#%s", insn->off, sym->name);
>   else

else if (address)

saves us the indentation.

> - snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
> -  "%+d#0x%lx", insn->off, address);
> + if (address)
> + snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
> +  "%+d#0x%lx", insn->off, address);
> + else
> +

Re: [PATCH] powerpc: Ensure gcc doesn't move around cache flushing in __patch_instruction

2018-05-17 Thread Segher Boessenkool

Hi!

On Thu, May 17, 2018 at 01:06:10PM +1000, Benjamin Herrenschmidt wrote:
> The current asm statement in __patch_instruction() for the cache flushes
> doesn't have a "volatile" statement and no memory clobber. That means
> gcc can potentially move it around (or move the store done by put_user
> past the flush).

volatile is completely superfluous here, except maybe as documentation:
any asm without outputs is always volatile.

(And the memory clobber does not prevent the compiler from moving the
asm around, or duplicating it, etc., and neither does the volatile).

Segher

Re: [PATCH v11 01/26] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Randy Dunlap

On 05/17/2018 10:19 AM, Matthew Wilcox wrote:
> On Thu, May 17, 2018 at 09:36:00AM -0700, Randy Dunlap wrote:
>>> +If the speculative page fault fails because of a concurrency is
>>
>>   because a concurrency is
> 
> While one can use concurrency as a noun, it sounds archaic to me.  I'd
> rather:
> 
>   If the speculative page fault fails because a concurrent modification
>   is detected or because underlying PMD or PTE tables are not yet

Yeah, OK.

>>> +detected or because underlying PMD or PTE tables are not yet
>>> +allocating, it is failing its processing and a classic page fault
>>
>>   allocated, the speculative page fault fails and a classic page fault
>>
>>> +is then tried.


-- 
~Randy

Re: [PATCH v4 4/4] powerpc/kbuild: move -mprofile-kernel check to Kconfig

2018-05-17 Thread kbuild test robot

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-patches-for-new-Kconfig-language/20180517-224044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allmodconfig
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
make.cross ARCH=powerpc  allmodconfig
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
>> arch/powerpc/Kconfig:468: syntax error
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
>> arch/powerpc/Kconfig:467: invalid option
   make[2]: *** [allmodconfig] Error 1
   make[1]: *** [allmodconfig] Error 2
   make: *** [sub-make] Error 2
--
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
>> arch/powerpc/Kconfig:468: syntax error
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
>> arch/powerpc/Kconfig:467: invalid option
   make[2]: *** [oldconfig] Error 1
   make[1]: *** [oldconfig] Error 2
   make: *** [sub-make] Error 2
--
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
>> arch/powerpc/Kconfig:468: syntax error
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
   arch/powerpc/Kconfig:467:warning: ignoring unsupported character '$'
>> arch/powerpc/Kconfig:467: invalid option
   make[2]: *** [olddefconfig] Error 1
   make[2]: Target 'oldnoconfig' not remade because of errors.
   make[1]: *** [oldnoconfig] Error 2
   make: *** [sub-make] Error 2

vim +468 arch/powerpc/Kconfig

e05c0e81 Kevin Hao   2013-07-16  443  
3d72bbc4 Michael Neuling 2013-02-13  444  config PPC_TRANSACTIONAL_MEM
3d72bbc4 Michael Neuling 2013-02-13  445 bool "Transactional Memory 
support for POWERPC"
3d72bbc4 Michael Neuling 2013-02-13  446 depends on PPC_BOOK3S_64
3d72bbc4 Michael Neuling 2013-02-13  447 depends on SMP
7b37a123 Michael Neuling 2014-01-08  448 select ALTIVEC
7b37a123 Michael Neuling 2014-01-08  449 select VSX
3d72bbc4 Michael Neuling 2013-02-13  450 default n
3d72bbc4 Michael Neuling 2013-02-13  451 ---help---
3d72bbc4 Michael Neuling 2013-02-13  452   Support user-mode 
Transactional Memory on POWERPC.
3d72bbc4 Michael Neuling 2013-02-13  453  
951eedeb Nicholas Piggin 2017-05-29  454  config LD_HEAD_STUB_CATCH
951eedeb Nicholas Piggin 2017-05-29  455bool "Reserve 256 bytes to cope 
with linker stubs in HEAD text" if EXPERT
951eedeb Nicholas Piggin 2017-05-29  456depends on PPC64
951eedeb Nicholas Piggin 2017-05-29  457default n
951eedeb Nicholas Piggin 2017-05-29  458help
951eedeb Nicholas Piggin 2017-05-29  459  Very large kernels can cause 
linker branch stubs to be generated by
951eedeb Nicholas Piggin 2017-05-29  460  code in head_64.S, which 
moves the head text sections out of their
951eedeb Nicholas Piggin 2017-05-29  461  specified location. This 
option can work around the problem.
951eedeb Nicholas Piggin 2017-05-29  462  
951eedeb Nicholas Piggin 2017-05-29  463  If unsure, say "N".
951eedeb Nicholas Piggin 2017-05-29  464  
8c50b72a Torsten Duwe2016-03-03  465  config MPROFILE_KERNEL
8c50b72a Torsten Duwe2016-03-03  466depends on PPC64 && 
CPU_LITTLE_ENDIAN
4421b963 Nicholas Piggin 2018-05-17 @467def_bool $(success 
$(srctree)/arch/powerpc/tools/gcc-check-mprofile-kernel.sh $(CC) 
-I$(srctree)/include -D__KERNEL__)
8c50b72a Torsten Duwe2016-03-03 @468  

:: The code at line 468 was first introduced by commit
:: 8c50b72a3b4f1f7cdfdfebd233b1cbd121262e65 powerpc/ftrace: Add Kconfig & 
Make glue for mprofile-kernel

:: TO: Torsten Duwe <d...@lst.de>
:: CC: Michael Ellerman <m...@ellerman.id.au>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

Re: [PATCH v11 01/26] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Matthew Wilcox

On Thu, May 17, 2018 at 09:36:00AM -0700, Randy Dunlap wrote:
> > +If the speculative page fault fails because of a concurrency is
> 
>because a concurrency is

While one can use concurrency as a noun, it sounds archaic to me.  I'd
rather:

If the speculative page fault fails because a concurrent modification
is detected or because underlying PMD or PTE tables are not yet

> > +detected or because underlying PMD or PTE tables are not yet
> > +allocating, it is failing its processing and a classic page fault
> 
>allocated, the speculative page fault fails and a classic page fault
> 
> > +is then tried.

Re: [PATCH v11 01/26] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Randy Dunlap

Hi,

On 05/17/2018 04:06 AM, Laurent Dufour wrote:
> This configuration variable will be used to build the code needed to
> handle speculative page fault.
> 
> By default it is turned off, and activated depending on architecture
> support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.
> 
> The architecture support is needed since the speculative page fault handler
> is called from the architecture's page faulting code, and some code has to
> be added there to handle the speculative handler.
> 
> The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
> does processing that is not compatible with the speculative handling in the
> case ARCH_HAS_PTE_SPECIAL is not set.
> 
> Suggested-by: Thomas Gleixner 
> Suggested-by: David Rientjes 
> Signed-off-by: Laurent Dufour 
> ---
>  mm/Kconfig | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 1d0888c5b97a..a38796276113 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -761,3 +761,25 @@ config GUP_BENCHMARK
>  
>  config ARCH_HAS_PTE_SPECIAL
>   bool
> +
> +config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +   def_bool n
> +
> +config SPECULATIVE_PAGE_FAULT
> +   bool "Speculative page faults"
> +   default y
> +   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
> +   depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
> +   help
> + Try to handle user space page faults without holding the mmap_sem.
> +
> +  This should allow better concurrency for massively threaded process

 processes

> +  since the page fault handler will not wait for other threads memory

  thread's

> +  layout change to be done, assuming that this change is done in another
> +  part of the process's memory space. This type of page fault is named
> +  speculative page fault.
> +
> +  If the speculative page fault fails because of a concurrency is

 because a concurrency is

> +  detected or because underlying PMD or PTE tables are not yet
> +  allocating, it is failing its processing and a classic page fault

 allocated, the speculative page fault fails and a classic page fault

> +  is then tried.


Also, all of the help text (below the "help" line) should be indented by
1 tab + 2 spaces (in coding-style.rst).


-- 
~Randy

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-17 Thread Mathieu Desnoyers

- On May 16, 2018, at 9:19 PM, Boqun Feng boqun.f...@gmail.com wrote:

> On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
>> - On May 16, 2018, at 12:18 PM, Peter Zijlstra pet...@infradead.org 
>> wrote:
>> 
>> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
>> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> >> index c32a181a7cbb..ed21a777e8c6 100644
>> >> --- a/arch/powerpc/Kconfig
>> >> +++ b/arch/powerpc/Kconfig
>> >> @@ -223,6 +223,7 @@ config PPC
>> >>   select HAVE_SYSCALL_TRACEPOINTS
>> >>   select HAVE_VIRT_CPU_ACCOUNTING
>> >>   select HAVE_IRQ_TIME_ACCOUNTING
>> >> + select HAVE_RSEQ
>> >>   select IRQ_DOMAIN
>> >>   select IRQ_FORCED_THREADING
>> >>   select MODULES_USE_ELF_RELA
>> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
>> >> index 61db86ecd318..d3bb3aaaf5ac 100644
>> >> --- a/arch/powerpc/kernel/signal.c
>> >> +++ b/arch/powerpc/kernel/signal.c
>> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
>> >>   /* Re-enable the breakpoints for the signal stack */
>> >>   thread_change_pc(tsk, tsk->thread.regs);
>> >>  
>> >> + rseq_signal_deliver(tsk->thread.regs);
>> >> +
>> >>   if (is32) {
>> >>   if (ksig.ka.sa.sa_flags & SA_SIGINFO)
>> >>   ret = handle_rt_signal32(, oldset, tsk);
>> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned 
>> >> long
>> >> thread_info_flags)
>> >>   if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>> >>   clear_thread_flag(TIF_NOTIFY_RESUME);
>> >>   tracehook_notify_resume(regs);
>> >> + rseq_handle_notify_resume(regs);
>> >>   }
>> >>  
>> >>   user_enter();
>> > 
>> > Again no rseq_syscall().
>> 
>> Same question for PowerPC as for ARM:
>> 
>> Considering that rseq_syscall is implemented as follows:
>> 
>> +void rseq_syscall(struct pt_regs *regs)
>> +{
>> +   unsigned long ip = instruction_pointer(regs);
>> +   struct task_struct *t = current;
>> +   struct rseq_cs rseq_cs;
>> +
>> +   if (!t->rseq)
>> +   return;
>> +   if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
>> +   rseq_get_rseq_cs(t, _cs) || in_rseq_cs(ip, _cs))
>> +   force_sig(SIGSEGV, t);
>> +}
>> 
>> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
>> now used in the fast-path since KPTI), I wonder where we should call
> 
> So we actually detect this after the syscall takes effect, right? I
> wonder whether this could be problematic, because "disallowing syscall"
> in rseq areas may means the syscall won't take effect to some people, I
> guess?
> 
>> this on PowerPC ?  I was under the impression that PowerPC return to
>> userspace fast-path was not calling C code unless work flags were set,
>> but I might be wrong.
>> 
> 
> I think you're right. So we have to introduce callsite to rseq_syscall()
> in syscall path, something like:
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 51695608c68b..a25734a96640 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -222,6 +222,9 @@ system_call_exit:
>   mtmsrd  r11,1
> #endif /* CONFIG_PPC_BOOK3E */
> 
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> +
>   ld  r9,TI_FLAGS(r12)
>   li  r11,-MAX_ERRNO
>   andi.
>   
> r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> 
> But I think it's important for us to first decide where (before or after
> the syscall) we do the detection.

As Peter said, we don't really care whether it's on syscall entry or exit, as
long as the process gets killed when the erroneous use is detected. I think 
doing
it on syscall exit is a bit easier because we can clearly access the userspace
TLS, which AFAIU may be less straightforward on syscall entry.

We may want to add #ifdef CONFIG_DEBUG_RSEQ / #endif around the code you
proposed above, so it's only compiled in if CONFIG_DEBUG_RSEQ=y.

On the ARM leg of the email thread, Will Deacon suggests to test whether 
current->rseq
is non-NULL before calling rseq_syscall(). I wonder if this added check is 
justified
as the assembly level, considering that this is just a debugging option. We 
already do
that check at the very beginning of rseq_syscall().

Thoughts ?

Thanks,

Mathieu

> 
> Regards,
> Boqun
> 
>> Thoughts ?
>> 
>> Thanks!
>> 
>> Mathieu
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Failure to allocate HTAB for guest - CMA allocation failures?

2018-05-17 Thread Daniel Axtens

Hi,

I have reports from a user who is experiencing intermittent issues
with qemu being unable to allocate memory for the guest HPT. We see:

libvirtError: internal error: process exited while connecting to monitor: 
Unexpected error in spapr_alloc_htab() at 
/build/qemu-UwnbKa/qemu-2.5+dfsg/hw/ppc/spapr.c:1030:
qemu-system-ppc64le: Failed to allocate HTAB of requested size, try with 
smaller maxmem

and in the kernel logs:

[10103945.040498] alloc_contig_range: 19127 callbacks suppressed
[10103945.040502] alloc_contig_range: [7a5d00, 7a6500) PFNs busy
[10103945.040526] alloc_contig_range: [7a5d00, 7a6504) PFNs busy
[10103945.040548] alloc_contig_range: [7a5d00, 7a6508) PFNs busy
[10103945.040569] alloc_contig_range: [7a5d00, 7a650c) PFNs busy
[10103945.040591] alloc_contig_range: [7a5d00, 7a6510) PFNs busy
[10103945.040612] alloc_contig_range: [7a5d00, 7a6514) PFNs busy
[10103945.040634] alloc_contig_range: [7a5d00, 7a6518) PFNs busy
[10103945.040655] alloc_contig_range: [7a5d00, 7a651c) PFNs busy
[10103945.040676] alloc_contig_range: [7a5d00, 7a6520) PFNs busy
[10103945.040698] alloc_contig_range: [7a5d00, 7a6524) PFNs busy

I understand that this is caused when the request for an appropriately
sized and aligned piece of contiguous host memory for the guest hash
page table cannot be satisfied from the CMA. The user was attempting
to start a 16GB guest, so if I can read qemu code correctly, it would
be asking for 128MB of contiguous memory.

The CMA is pretty large - this is taken from /proc/meminfo some time
after the allocation failure:

CmaTotal: 26853376 kB
CmaFree: 4024448 kB

(The CMA is ~25GB, the host has 512GB of RAM.)

My guess is that the CMA has become fragmented (the machine had 112
days of uptime) and that was interfering with the ability of the
kernel to service the request?

Some googling suggests that these sorts of failures have been seen
before:

 * [1] is a Launchpad bug mirrored from the IBM Bugzilla that talks
   about this issue especially in the context of PCI passthrough
   leading to more memory being pinned. No PCI passthrough is
   occurring in this case.

 * [2] is from Red Hat - it seems to be especially focussed on
   particularly huge guests and memory hotplug. I don't think either
   of those apply here either.

I noticed from [1] that there is a patch from Balbir that apparently
helps when VFIO is used - 2e5bbb5461f1 ("KVM: PPC: Book3S HV: Migrate
pinned pages out of CMA"). The user is running a 4.4 kernel with this
backported. There's also reference to some work Alexey was doing to
unpin pages in a more timely fashion. It looks like that stalled, and
I can't see anything else particularly relevant in the kernel tree
between then and now - although I may well be missing stuff.

So:

 - have I missed anything obvious here/have I gone completely wrong in
   my analysis somewhere?

 - have I missed any great changes since 4.4 that would fix this?

 - is there any ongoing work in increasing CMA availability?

 - I noticed in arch/powerpc/kvm/book3s_hv_builtin.c,
   kvm_cma_resv_ratio is defined as a boot parameter. By default 5% of
   host memory is reserved for CMA. Presumably increasing this will
   increase the likelihood that the kernel can service a request for
   contiguous memory. Are there any recommended tunings here?

 - is there anything else the user could try?

Thanks!

Regards,
Daniel

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

Re: [2/2] powerpc/powernv: Fix NVRAM sleep in invalid context when crashing

2018-05-17 Thread Michael Ellerman

On Mon, 2018-05-14 at 15:59:47 UTC, Nicholas Piggin wrote:
> Similarly to opal_event_shutdown, opal_nvram_write can be called in
> the crash path with irqs disabled. Special case the delay to avoid
> sleeping in invalid context.
> 
> Cc: sta...@vger.kernel.org # v3.2
> Fixes: 3b8070335f ("powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops")
> Signed-off-by: Nicholas Piggin 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/c1d2a31397ec51f0370f6bd17b19b3

cheers

Re: [v4, 1/2] cxl: Set the PBCQ Tunnel BAR register when enabling capi mode

2018-05-17 Thread Michael Ellerman

On Mon, 2018-05-14 at 08:27:35 UTC, Philippe Bergheaud wrote:
> Skiboot used to set the default Tunnel BAR register value when capi mode
> was enabled. This approach was ok for the cxl driver, but prevented other
> drivers from choosing different values.
> 
> Skiboot versions > 5.11 will not set the default value any longer. This
> patch modifies the cxl driver to set/reset the Tunnel BAR register when
> entering/exiting the cxl mode, with pnv_pci_set_tunnel_bar().
> 
> That should work with old skiboot (since we are re-writing the value
> already set) and new skiboot.
> 
> Signed-off-by: Philippe Bergheaud 
> Reviewed-by: Christophe Lombard 
> Acked-by: Frederic Barrat 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/401dca8cbd14fc4b32d93499dcd12a

cheers

Re: [PATCH] Revert "powerpc/64: Fix checksum folding in csum_add()"

2018-05-17 Thread Segher Boessenkool

On Tue, Apr 10, 2018 at 08:34:37AM +0200, Christophe Leroy wrote:
> This reverts commit 6ad966d7303b70165228dba1ee8da1a05c10eefe.
> 
> That commit was pointless, because csum_add() sums two 32 bits
> values, so the sum is 0x1fffe at the maximum.
> And then when adding upper part (1) and lower part (0xfffe),
> the result is 0x which doesn't carry.
> Any lower value will not carry either.
> 
> And behind the fact that this commit is useless, it also kills the
> whole purpose of having an arch specific inline csum_add()
> because the resulting code gets even worse than what is obtained
> with the generic implementation of csum_add()

:-)

> And the reverted implementation for PPC64 gives:
> 
> 0240 <.csum_add>:
>  240: 7c 84 1a 14 add r4,r4,r3
>  244: 78 80 00 22 rldicl  r0,r4,32,32
>  248: 7c 80 22 14 add r4,r0,r4
>  24c: 78 83 00 20 clrldi  r3,r4,32
>  250: 4e 80 00 20 blr

If you really, really, *really* want to optimise this you could
make it:

rldimi r3,r3,0,32
rldimi r4,r4,0,32
add r3,r3,r4
srdi r3,r3,32
blr

which is the same size, but has a shorter critical path length.  Very
analogous to how you fold 64->32.


Segher

Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm

2018-05-17 Thread Segher Boessenkool

On Thu, May 17, 2018 at 03:27:37PM +0200, Christophe LEROY wrote:
> Le 17/05/2018 à 15:15, Segher Boessenkool a écrit :
> >>I guess we've been enabling this for all 32-bit targets for ever so it
> >>must be a reasonable option.
> >
> >On 603, load multiple (and string) are one cycle slower than doing all the
> >loads separately, and store is essentially the same as separate stores.
> >On 7xx and 7xxx both loads and stores are one cycle slower as multiple
> >than as separate insns.
> 
> That's in theory when the instructions are already in the cache.
> 
> But loading several instructions into the cache takes time.

Yes, of course, that's why I wrote:

> >load/store multiple are nice for saving/storing registers.

:-)


Segher

Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32

2018-05-17 Thread Christophe LEROY




Le 17/05/2018 à 15:46, Michael Ellerman a écrit :

Nicholas Piggin  writes:


On Thu, 17 May 2018 12:04:13 +0200 (CEST)
Christophe Leroy  wrote:


commit 87a156fb18fe1 ("Align hot loops of some string functions")
degraded the performance of string functions by adding useless
nops

A simple benchmark on an 8xx calling 10x a memchr() that
matches the first byte runs in 41668 TB ticks before this patch
and in 35986 TB ticks after this patch. So this gives an
improvement of approx 10%

Another benchmark doing the same with a memchr() matching the 128th
byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
after this patch, so regardless on the number of loops, removing
those useless nops improves the test by 5683 TB ticks.

Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
Signed-off-by: Christophe Leroy 
---
  Was sent already as part of a serie optimising string functions.
  Resending on itself as it is independent of the other changes in the
serie

  arch/powerpc/lib/string.S | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
index a787776822d8..a026d8fa8a99 100644
--- a/arch/powerpc/lib/string.S
+++ b/arch/powerpc/lib/string.S
@@ -23,7 +23,9 @@ _GLOBAL(strncpy)
mtctr   r5
addir6,r3,-1
addir4,r4,-1
+#ifdef CONFIG_PPC64
.balign 16
+#endif
  1:lbzur0,1(r4)
cmpwi   0,r0,0
stbur0,1(r6)


The ifdefs are a bit ugly, but you can't argue with the numbers. These
alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
the ifetch performance when you have such a loop (although there is
always a tradeoff for a single iteration).

Would it make sense to define that for 32-bit as well, and you could use
it here instead of the ifdefs? Small CPUs could just use 0.


Can we do it with a macro in the header, eg. like:

#ifdef CONFIG_PPC64
#define IFETCH_BALIGN   .balign IFETCH_ALIGN_BYTES
#endif

...

addir4,r4,-1
IFETCH_BALIGN
   1:   lbzur0,1(r4)




Why not just define IFETCH_ALIGN_SHIFT for PPC32 as well in asm/cache.h 
?, then replace the .balign 16 by .balign IFETCH_ALIGN_BYTES (or .align 
IFETCH_ALIGN_SHIFT) ?


Christophe


cheers

Re: [PATCH v4 3/4] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()

2018-05-17 Thread Michael Ellerman

wei.guo.si...@gmail.com writes:
> From: Simon Guo 
>
> This patch is based on the previous VMX patch on memcmp().
>
> To optimize ppc64 memcmp() with VMX instruction, we need to think about
> the VMX penalty brought with: If kernel uses VMX instruction, it needs
> to save/restore current thread's VMX registers. There are 32 x 128 bits
> VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.
>
> The major concern regarding the memcmp() performance in kernel is KSM,
> who will use memcmp() frequently to merge identical pages. So it will
> make sense to take some measures/enhancement on KSM to see whether any
> improvement can be done here.  Cyril Bur indicates that the memcmp() for
> KSM has a higher possibility to fail (unmatch) early in previous bytes
> in following mail.
>   https://patchwork.ozlabs.org/patch/817322/#1773629
> And I am taking a follow-up on this with this patch.
>
> Per some testing, it shows KSM memcmp() will fail early at previous 32
> bytes.  More specifically:
> - 76% cases will fail/unmatch before 16 bytes;
> - 83% cases will fail/unmatch before 32 bytes;
> - 84% cases will fail/unmatch before 64 bytes;
> So 32 bytes looks a better choice than other bytes for pre-checking.
>
> This patch adds a 32 bytes pre-checking firstly before jumping into VMX
> operations, to avoid the unnecessary VMX penalty. And the testing shows
> ~20% improvement on memcmp() average execution time with this patch.
>
> The detail data and analysis is at:
> https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md
>
> Any suggestion is welcome.

Thanks for digging into that, really great work.

I'm inclined to make this not depend on KSM though. It seems like a good
optimisation to do in general.

So can we just call it the 'pre-check' or something, and always do it?

cheers

> diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> index 6303bbf..df2eec0 100644
> --- a/arch/powerpc/lib/memcmp_64.S
> +++ b/arch/powerpc/lib/memcmp_64.S
> @@ -405,6 +405,35 @@ _GLOBAL(memcmp)
>   /* Enter with src/dst addrs has the same offset with 8 bytes
>* align boundary
>*/
> +
> +#ifdef CONFIG_KSM
> + /* KSM will always compare at page boundary so it falls into
> +  * .Lsameoffset_vmx_cmp.
> +  *
> +  * There is an optimization for KSM based on following fact:
> +  * KSM pages memcmp() prones to fail early at the first bytes. In
> +  * a statisis data, it shows 76% KSM memcmp() fails at the first
> +  * 16 bytes, and 83% KSM memcmp() fails at the first 32 bytes, 84%
> +  * KSM memcmp() fails at the first 64 bytes.
> +  *
> +  * Before applying VMX instructions which will lead to 32x128bits VMX
> +  * regs load/restore penalty, let us compares the first 32 bytes
> +  * so that we can catch the ~80% fail cases.
> +  */
> +
> + li  r0,4
> + mtctr   r0
> +.Lksm_32B_loop:
> + LD  rA,0,r3
> + LD  rB,0,r4
> + cmpld   cr0,rA,rB
> + addir3,r3,8
> + addir4,r4,8
> + bne cr0,.LcmpAB_lightweight
> + addir5,r5,-8
> + bdnz.Lksm_32B_loop
> +#endif
> +
>   ENTER_VMX_OPS
>   beq cr1,.Llong_novmx_cmp
>  
> -- 
> 1.8.3.1

Re: [PATCH 2/2] selftests/powerpc: Add core file test for Protection Key registers

2018-05-17 Thread Michael Ellerman

Thiago Jung Bauermann  writes:

> This test verifies that the AMR, IAMR and UAMOR are being written to a
> process' core file.
>
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  tools/testing/selftests/powerpc/ptrace/Makefile|   5 +-
>  tools/testing/selftests/powerpc/ptrace/core-pkey.c | 460 
> +

Also failing w/out pkeys:

  test: core_pkey
  tags: git_version:52e7d87
  [FAIL] Test FAILED on line 117
  [FAIL] Test FAILED on line 265
  failure: core_pkey


cheers

Re: [PATCH 1/2] selftests/powerpc: Add ptrace tests for Protection Key registers

2018-05-17 Thread Michael Ellerman

Thiago Jung Bauermann  writes:

> This test exercises read and write access to the AMR, IAMR and UAMOR.
>
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  tools/testing/selftests/powerpc/include/reg.h  |   1 +
>  tools/testing/selftests/powerpc/ptrace/Makefile|   5 +-
>  tools/testing/selftests/powerpc/ptrace/child.h | 130 
>  .../testing/selftests/powerpc/ptrace/ptrace-pkey.c | 326 
> +

This is failing on machines without pkeys:

  test: ptrace_pkey
  tags: git_version:52e7d87
  [FAIL] Test FAILED on line 117
  [FAIL] Test FAILED on line 191
  failure: ptrace_pkey


I think the first fail is in the child here:

int ptrace_read_regs(pid_t child, unsigned long type, unsigned long regs[],
 int n)
{
struct iovec iov;
long ret;

FAIL_IF(start_trace(child));

iov.iov_base = regs;
iov.iov_len = n * sizeof(unsigned long);

ret = ptrace(PTRACE_GETREGSET, child, type, );
FAIL_IF(ret != 0);


Which makes sense.

The test needs to skip if pkeys are not available/enabled. Using the
availability of the REGSET might actually be a nice way to detect that,
because it's read-only.

cheers

Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes

2018-05-17 Thread Segher Boessenkool

On Thu, May 17, 2018 at 12:49:58PM +0200, Christophe Leroy wrote:
> In my 8xx configuration, I get 208 calls to memcmp()
> Within those 208 calls, about half of them have constant sizes,
> 46 have a size of 8, 17 have a size of 16, only a few have a
> size over 16. Other fixed sizes are mostly 4, 6 and 10.
> 
> This patch inlines calls to memcmp() when size
> is constant and lower than or equal to 16
> 
> In my 8xx configuration, this reduces the number of calls
> to memcmp() from 208 to 123
> 
> The following table shows the number of TB timeticks to perform
> a constant size memcmp() before and after the patch depending on
> the size
> 
>   Before  After   Improvement
> 01:75775682   25%
> 02:   416685682   86%
> 03:   51137   13258   74%
> 04:   454555682   87%
> 05:   58713   13258   77%
> 06:   58712   13258   77%
> 07:   68183   20834   70%
> 08:   56819   15153   73%
> 09:   70077   28411   60%
> 10:   70077   28411   60%
> 11:   79546   35986   55%
> 12:   68182   28411   58%
> 13:   81440   35986   55%
> 14:   81440   39774   51%
> 15:   94697   43562   54%
> 16:   79546   37881   52%

Could you show results with a more recent GCC?  What version was this?

What is this really measuring?  I doubt it takes 7577 (or 5682) timebase
ticks to do a 1-byte memcmp, which is just 3 instructions after all.


Segher

Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32

2018-05-17 Thread Michael Ellerman

Nicholas Piggin  writes:

> On Thu, 17 May 2018 12:04:13 +0200 (CEST)
> Christophe Leroy  wrote:
>
>> commit 87a156fb18fe1 ("Align hot loops of some string functions")
>> degraded the performance of string functions by adding useless
>> nops
>> 
>> A simple benchmark on an 8xx calling 10x a memchr() that
>> matches the first byte runs in 41668 TB ticks before this patch
>> and in 35986 TB ticks after this patch. So this gives an
>> improvement of approx 10%
>> 
>> Another benchmark doing the same with a memchr() matching the 128th
>> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
>> after this patch, so regardless on the number of loops, removing
>> those useless nops improves the test by 5683 TB ticks.
>> 
>> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
>> Signed-off-by: Christophe Leroy 
>> ---
>>  Was sent already as part of a serie optimising string functions.
>>  Resending on itself as it is independent of the other changes in the
>> serie
>> 
>>  arch/powerpc/lib/string.S | 6 ++
>>  1 file changed, 6 insertions(+)
>> 
>> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
>> index a787776822d8..a026d8fa8a99 100644
>> --- a/arch/powerpc/lib/string.S
>> +++ b/arch/powerpc/lib/string.S
>> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>>  mtctr   r5
>>  addir6,r3,-1
>>  addir4,r4,-1
>> +#ifdef CONFIG_PPC64
>>  .balign 16
>> +#endif
>>  1:  lbzur0,1(r4)
>>  cmpwi   0,r0,0
>>  stbur0,1(r6)
>
> The ifdefs are a bit ugly, but you can't argue with the numbers. These
> alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
> the ifetch performance when you have such a loop (although there is
> always a tradeoff for a single iteration).
>
> Would it make sense to define that for 32-bit as well, and you could use
> it here instead of the ifdefs? Small CPUs could just use 0.

Can we do it with a macro in the header, eg. like:

#ifdef CONFIG_PPC64
#define IFETCH_BALIGN   .balign IFETCH_ALIGN_BYTES
#endif

...

addir4,r4,-1
IFETCH_BALIGN
  1:lbzur0,1(r4)


cheers

Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes

2018-05-17 Thread Benjamin Herrenschmidt

On Thu, 2018-05-17 at 15:21 +0200, Christophe LEROY wrote:
> > > +static inline int __memcmp8(const void *p, const void *q, int off)
> > > +{
> > > +   s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + 
> > > off));
> > 
> > I always assumed 64bits unaligned access would trigger an exception.
> > Is this correct ?
> 
> As far as I know, an unaligned access will only occur when the operand 
> of lmw, stmw, lwarx, or stwcx. is not aligned.
> 
> Maybe that's different for PPC64 ?

It's very implementation specific.

Recent ppc64 chips generally don't trap (unless it's cache inhibited
space). Earlier variants might trap on page boundaries or segment
boundaries. Some embedded parts are less forgiving... some earlier
POWER chips will trap on unaligned in LE mode...

I wouldn't worry too much about it though. I think if 8xx shows an
improvement then it's probably fine everywhere else :-)

Cheers,
Ben.

Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm

2018-05-17 Thread Benjamin Herrenschmidt

On Thu, 2018-05-17 at 22:10 +1000, Michael Ellerman wrote:
> Christophe Leroy  writes:
> > arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> > in order to use multiple word instructions in functions entry/exit
> 
> True, though that could be a lot simpler because the MULTIPLEWORD value
> is only used for PPC32, which is always big endian. I'll send a patch
> for that.

There have been known cases of 4xx LE ports though none ever made it
upstream ...

> > The patch does the same for the asm parts, for consistency
> > 
> > On processors like the 8xx on which insn fetching is pretty slow,
> > this speeds up registers save/restore
> 
> OK. I've always heard that they should be avoided, but that's coming
> from 64-bit land.
> 
> I guess we've been enabling this for all 32-bit targets for ever so it
> must be a reasonable option.
> 
> > Signed-off-by: Christophe Leroy 
> > ---
> >  v2: Swapped both patches in the serie to reduce number of impacted
> >  lines and added the same modification in ppc_save_regs()
> > 
> >  arch/powerpc/include/asm/ppc_asm.h  |  5 +
> >  arch/powerpc/kernel/misc.S  | 10 ++
> >  arch/powerpc/kernel/ppc_save_regs.S |  4 
> >  3 files changed, 19 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/ppc_asm.h 
> > b/arch/powerpc/include/asm/ppc_asm.h
> > index 13f7f4c0e1ea..4bb765d0b758 100644
> > --- a/arch/powerpc/include/asm/ppc_asm.h
> > +++ b/arch/powerpc/include/asm/ppc_asm.h
> > @@ -80,11 +80,16 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
> >  #else
> >  #define SAVE_GPR(n, base)  stw n,GPR0+4*(n)(base)
> >  #define REST_GPR(n, base)  lwz n,GPR0+4*(n)(base)
> > +#ifdef CONFIG_CPU_BIG_ENDIAN
> > +#define SAVE_NVGPRS(base)  stmw13, GPR0+4*13(base)
> > +#define REST_NVGPRS(base)  lmw 13, GPR0+4*13(base)
> > +#else
> >  #define SAVE_NVGPRS(base)  SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
> > SAVE_10GPRS(22, base)
> >  #define REST_NVGPRS(base)  REST_GPR(13, base); REST_8GPRS(14, base); \
> > REST_10GPRS(22, base)
> 
> There is no 32-bit little endian, so this is basically dead code now.
> 
> Maybe there'll be a 32-bit LE port one day, but if so we can put the
> code back then.
> 
> So I'll just drop the else case.
> 
> >  #endif
> > +#endif
> >  
> >  #define SAVE_2GPRS(n, base)SAVE_GPR(n, base); SAVE_GPR(n+1, base)
> >  #define SAVE_4GPRS(n, base)SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, 
> > base)
> > diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> > index 746ee0320ad4..a316d90a5c26 100644
> > --- a/arch/powerpc/kernel/misc.S
> > +++ b/arch/powerpc/kernel/misc.S
> > @@ -49,6 +49,10 @@ _GLOBAL(setjmp)
> > PPC_STL r0,0(r3)
> > PPC_STL r1,SZL(r3)
> > PPC_STL r2,2*SZL(r3)
> > +#if defined(CONFIG_PPC32) && defined(CONFIG_CPU_BIG_ENDIAN)
> 
> And this could just be:
> 
> #ifdef CONFIG_PPC32
> 
> > +   mfcrr12
> > +   stmwr12, 3*SZL(r3)
> > +#else
> > mfcrr0
> > PPC_STL r0,3*SZL(r3)
> > PPC_STL r13,4*SZL(r3)
> > @@ -70,10 +74,15 @@ _GLOBAL(setjmp)
> > PPC_STL r29,20*SZL(r3)
> > PPC_STL r30,21*SZL(r3)
> > PPC_STL r31,22*SZL(r3)
> > +#endif
> 
> It's a pity to end up with this basically split in half by ifdefs for
> 32/64-bit, but maybe we can clean that up later.
> 
> cheers

Re: [PATCH v6] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-17 Thread Nicholas Piggin

On Thu, 17 May 2018 12:59:29 +0200 (CEST)
Christophe Leroy  wrote:

> Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
> userspace instruction miss") has shown that limiting the read of
> faulting instruction to likely cases improves performance.
> 
> This patch goes further into this direction by limiting the read
> of the faulting instruction to the only cases where it is definitly
> needed.
> 
> On an MPC885, with the same benchmark app as in the commit referred
> above, we see a reduction of 4000 dTLB misses (approx 3%):
> 
> Before the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  720495838  cpu-cycles( +-  0.04% )
> 141769  dTLB-load-misses  ( +-  0.02% )
>  52722  iTLB-load-misses  ( +-  0.01% )
>  19611  faults( +-  0.02% )
> 
>5.750535176 seconds time elapsed   ( +-  0.16% )
> 
> With the patch:
>  Performance counter stats for './fault 500' (10 runs):
> 
>  717669123  cpu-cycles( +-  0.02% )
> 137344  dTLB-load-misses  ( +-  0.03% )
>  52731  iTLB-load-misses  ( +-  0.01% )
>  19614  faults( +-  0.03% )
> 
>5.728423115 seconds time elapsed   ( +-  0.14% )
> 
> The proper work of the huge stack expansion was tested with the
> following app:
> 
> int main(int argc, char **argv)
> {
>   char buf[1024 * 1025];
> 
>   sprintf(buf, "Hello world !\n");
>   printf(buf);
> 
>   exit(0);
> }
> 
> Signed-off-by: Christophe Leroy 
> ---
>  v6: Rebased on latest powerpc/merge branch ; Using __get_user_inatomic() 
> instead of get_user() in order
>  to move it inside the semaphored area. That removes all the complexity 
> of the patch.
> 
>  v5: Reworked to fit after Benh do_fault improvement and rebased on top of 
> powerpc/merge (65152902e43fef)
> 
>  v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() 
> verification before __get_user_xxx()
> 
>  v3: Do a first try with pagefault disabled before releasing the semaphore
> 
>  v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'
> 
>  arch/powerpc/mm/fault.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index c01d627e687a..a7d5cc76a8ce 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -72,8 +72,18 @@ static inline bool notify_page_fault(struct pt_regs *regs)
>  static bool store_updates_sp(struct pt_regs *regs)
>  {
>   unsigned int inst;
> + int ret;
>  
> - if (get_user(inst, (unsigned int __user *)regs->nip))
> + /*
> +  * Using get_user_in_atomic() as reading code around nip can result in
> +  * fault, which may cause a deadlock when called with mmap_sem held,
> +  * however since we are reading the instruction that generated the DSI
> +  * we are handling, the page is necessarily already present.
> +  */
> + pagefault_disable();
> + ret = __get_user_inatomic(inst, (unsigned int __user *)regs->nip);
> + pagefault_enable();
> + if (ret)
>   return false;

Problem is that the page can be removed from page tables between
taking the fault and reading the address here.

That case would be so rare that it should be fine to do a big hammer
fix like drop the mmap_sem, do a fault_in_pages_readable, and then
restart from taking the mmap_sem again.

Thanks,
Nick

Re: [PATCH v2 1/5] powerpc/lib: move PPC32 specific functions out of string.S

2018-05-17 Thread Segher Boessenkool

On Thu, May 17, 2018 at 12:49:50PM +0200, Christophe Leroy wrote:
> In preparation of optimisation patches, move PPC32 specific
> memcmp() and __clear_user() into string_32.S

> --- /dev/null
> +++ b/arch/powerpc/lib/string_32.S
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * String handling functions for PowerPC32
> + *
> + * Copyright (C) 2018 CS Systemes d'Information
> + *
> + * Author: Christophe Leroy 
> + *
> + */

That is wrong; that code is a plain copy, from 1996 already.


Segher

Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm

2018-05-17 Thread Christophe LEROY




Le 17/05/2018 à 15:15, Segher Boessenkool a écrit :

On Thu, May 17, 2018 at 10:10:21PM +1000, Michael Ellerman wrote:

Christophe Leroy  writes:

arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
in order to use multiple word instructions in functions entry/exit


True, though that could be a lot simpler because the MULTIPLEWORD value
is only used for PPC32, which is always big endian. I'll send a patch
for that.


Do you mean in the kernel?  Many 32-bit processors can do LE, and many
do not implement multiple or string insns in LE mode.


The patch does the same for the asm parts, for consistency

On processors like the 8xx on which insn fetching is pretty slow,
this speeds up registers save/restore


OK. I've always heard that they should be avoided, but that's coming
from 64-bit land.

I guess we've been enabling this for all 32-bit targets for ever so it
must be a reasonable option.


On 603, load multiple (and string) are one cycle slower than doing all the
loads separately, and store is essentially the same as separate stores.
On 7xx and 7xxx both loads and stores are one cycle slower as multiple
than as separate insns.


That's in theory when the instructions are already in the cache.

But loading several instructions into the cache takes time.

Christophe



load/store multiple are nice for saving/storing registers.


Segher

Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32

2018-05-17 Thread Nicholas Piggin

On Thu, 17 May 2018 12:04:13 +0200 (CEST)
Christophe Leroy  wrote:

> commit 87a156fb18fe1 ("Align hot loops of some string functions")
> degraded the performance of string functions by adding useless
> nops
> 
> A simple benchmark on an 8xx calling 10x a memchr() that
> matches the first byte runs in 41668 TB ticks before this patch
> and in 35986 TB ticks after this patch. So this gives an
> improvement of approx 10%
> 
> Another benchmark doing the same with a memchr() matching the 128th
> byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
> after this patch, so regardless on the number of loops, removing
> those useless nops improves the test by 5683 TB ticks.
> 
> Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
> Signed-off-by: Christophe Leroy 
> ---
>  Was sent already as part of a serie optimising string functions.
>  Resending on itself as it is independent of the other changes in the
> serie
> 
>  arch/powerpc/lib/string.S | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
> index a787776822d8..a026d8fa8a99 100644
> --- a/arch/powerpc/lib/string.S
> +++ b/arch/powerpc/lib/string.S
> @@ -23,7 +23,9 @@ _GLOBAL(strncpy)
>   mtctr   r5
>   addir6,r3,-1
>   addir4,r4,-1
> +#ifdef CONFIG_PPC64
>   .balign 16
> +#endif
>  1:   lbzur0,1(r4)
>   cmpwi   0,r0,0
>   stbur0,1(r6)

The ifdefs are a bit ugly, but you can't argue with the numbers. These
alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise
the ifetch performance when you have such a loop (although there is
always a tradeoff for a single iteration).

Would it make sense to define that for 32-bit as well, and you could use
it here instead of the ifdefs? Small CPUs could just use 0.

Thanks,
Nick

Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes

2018-05-17 Thread Christophe LEROY




Le 17/05/2018 à 15:03, Mathieu Malaterre a écrit :

On Thu, May 17, 2018 at 12:49 PM, Christophe Leroy
 wrote:

In my 8xx configuration, I get 208 calls to memcmp()
Within those 208 calls, about half of them have constant sizes,
46 have a size of 8, 17 have a size of 16, only a few have a
size over 16. Other fixed sizes are mostly 4, 6 and 10.

This patch inlines calls to memcmp() when size
is constant and lower than or equal to 16

In my 8xx configuration, this reduces the number of calls
to memcmp() from 208 to 123

The following table shows the number of TB timeticks to perform
a constant size memcmp() before and after the patch depending on
the size

 Before  After   Improvement
01:  75775682   25%
02: 416685682   86%
03: 51137   13258   74%
04: 454555682   87%
05: 58713   13258   77%
06: 58712   13258   77%
07: 68183   20834   70%
08: 56819   15153   73%
09: 70077   28411   60%
10: 70077   28411   60%
11: 79546   35986   55%
12: 68182   28411   58%
13: 81440   35986   55%
14: 81440   39774   51%
15: 94697   43562   54%
16: 79546   37881   52%

Signed-off-by: Christophe Leroy 
---
  arch/powerpc/include/asm/string.h | 46 +++
  1 file changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 35f1aaad9b50..80cf0f9605dd 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -4,6 +4,8 @@

  #ifdef __KERNEL__

+#include 
+
  #define __HAVE_ARCH_STRNCPY
  #define __HAVE_ARCH_STRNCMP
  #define __HAVE_ARCH_MEMSET
@@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, 
__kernel_size_t size)
 return __strncmp(p, q, size);
  }

+static inline int __memcmp1(const void *p, const void *q, int off)


Does that change anything if you change void* to char* pointer ? I
find void* arithmetic hard to read.


Yes that's not the same

void* means you can use any pointer, for instance pointers to two 
structs you want to compare.


char* will force users to cast their pointers to char*




+{
+   return *(u8*)(p + off) - *(u8*)(q + off);
+}
+
+static inline int __memcmp2(const void *p, const void *q, int off)
+{
+   return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
+}
+
+static inline int __memcmp4(const void *p, const void *q, int off)
+{
+   return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
+}
+
+static inline int __memcmp8(const void *p, const void *q, int off)
+{
+   s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));


I always assumed 64bits unaligned access would trigger an exception.
Is this correct ?


As far as I know, an unaligned access will only occur when the operand 
of lmw, stmw, lwarx, or stwcx. is not aligned.


Maybe that's different for PPC64 ?

Christophe




+   return tmp >> 32 ? : (int)tmp;
+}
+
+static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t 
size)
+{
+   if (size == 1)
+   return __memcmp1(p, q, 0);
+   if (size == 2)
+   return __memcmp2(p, q, 0);
+   if (size == 3)
+   return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
+   if (size == 4)
+   return __memcmp4(p, q, 0);
+   if (size == 5)
+   return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
+   if (size == 6)
+   return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
+   if (size == 7)
+   return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : 
__memcmp1(p, q, 6);
+   return __memcmp8(p, q, 0);
+}
+
  static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
  {
 if (unlikely(!size))
 return 0;
+   if (__builtin_constant_p(size) && size <= 8)
+   return __memcmp_cst(p, q, size);
+   if (__builtin_constant_p(size) && size <= 16)
+   return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size - 
8);
 return __memcmp(p, q, size);
  }

--
2.13.3

Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm

2018-05-17 Thread Segher Boessenkool

On Thu, May 17, 2018 at 10:10:21PM +1000, Michael Ellerman wrote:
> Christophe Leroy  writes:
> > arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> > in order to use multiple word instructions in functions entry/exit
> 
> True, though that could be a lot simpler because the MULTIPLEWORD value
> is only used for PPC32, which is always big endian. I'll send a patch
> for that.

Do you mean in the kernel?  Many 32-bit processors can do LE, and many
do not implement multiple or string insns in LE mode.

> > The patch does the same for the asm parts, for consistency
> >
> > On processors like the 8xx on which insn fetching is pretty slow,
> > this speeds up registers save/restore
> 
> OK. I've always heard that they should be avoided, but that's coming
> from 64-bit land.
> 
> I guess we've been enabling this for all 32-bit targets for ever so it
> must be a reasonable option.

On 603, load multiple (and string) are one cycle slower than doing all the
loads separately, and store is essentially the same as separate stores.
On 7xx and 7xxx both loads and stores are one cycle slower as multiple
than as separate insns.

load/store multiple are nice for saving/storing registers.

Segher

Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes

2018-05-17 Thread Mathieu Malaterre

On Thu, May 17, 2018 at 12:49 PM, Christophe Leroy
 wrote:
> In my 8xx configuration, I get 208 calls to memcmp()
> Within those 208 calls, about half of them have constant sizes,
> 46 have a size of 8, 17 have a size of 16, only a few have a
> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>
> This patch inlines calls to memcmp() when size
> is constant and lower than or equal to 16
>
> In my 8xx configuration, this reduces the number of calls
> to memcmp() from 208 to 123
>
> The following table shows the number of TB timeticks to perform
> a constant size memcmp() before and after the patch depending on
> the size
>
> Before  After   Improvement
> 01:  75775682   25%
> 02: 416685682   86%
> 03: 51137   13258   74%
> 04: 454555682   87%
> 05: 58713   13258   77%
> 06: 58712   13258   77%
> 07: 68183   20834   70%
> 08: 56819   15153   73%
> 09: 70077   28411   60%
> 10: 70077   28411   60%
> 11: 79546   35986   55%
> 12: 68182   28411   58%
> 13: 81440   35986   55%
> 14: 81440   39774   51%
> 15: 94697   43562   54%
> 16: 79546   37881   52%
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/string.h | 46 
> +++
>  1 file changed, 46 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/string.h 
> b/arch/powerpc/include/asm/string.h
> index 35f1aaad9b50..80cf0f9605dd 100644
> --- a/arch/powerpc/include/asm/string.h
> +++ b/arch/powerpc/include/asm/string.h
> @@ -4,6 +4,8 @@
>
>  #ifdef __KERNEL__
>
> +#include 
> +
>  #define __HAVE_ARCH_STRNCPY
>  #define __HAVE_ARCH_STRNCMP
>  #define __HAVE_ARCH_MEMSET
> @@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, 
> __kernel_size_t size)
> return __strncmp(p, q, size);
>  }
>
> +static inline int __memcmp1(const void *p, const void *q, int off)

Does that change anything if you change void* to char* pointer ? I
find void* arithmetic hard to read.

> +{
> +   return *(u8*)(p + off) - *(u8*)(q + off);
> +}
> +
> +static inline int __memcmp2(const void *p, const void *q, int off)
> +{
> +   return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
> +}
> +
> +static inline int __memcmp4(const void *p, const void *q, int off)
> +{
> +   return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
> +}
> +
> +static inline int __memcmp8(const void *p, const void *q, int off)
> +{
> +   s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + 
> off));

I always assumed 64bits unaligned access would trigger an exception.
Is this correct ?

> +   return tmp >> 32 ? : (int)tmp;
> +}
> +
> +static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t 
> size)
> +{
> +   if (size == 1)
> +   return __memcmp1(p, q, 0);
> +   if (size == 2)
> +   return __memcmp2(p, q, 0);
> +   if (size == 3)
> +   return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
> +   if (size == 4)
> +   return __memcmp4(p, q, 0);
> +   if (size == 5)
> +   return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
> +   if (size == 6)
> +   return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
> +   if (size == 7)
> +   return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : 
> __memcmp1(p, q, 6);
> +   return __memcmp8(p, q, 0);
> +}
> +
>  static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
>  {
> if (unlikely(!size))
> return 0;
> +   if (__builtin_constant_p(size) && size <= 8)
> +   return __memcmp_cst(p, q, size);
> +   if (__builtin_constant_p(size) && size <= 16)
> +   return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size 
> - 8);
> return __memcmp(p, q, size);
>  }
>
> --
> 2.13.3
>

Re: [PATCH 0/3] Add support to disable sensor groups in P9

2018-05-17 Thread Guenter Roeck


On 05/16/2018 11:10 PM, Shilpasri G Bhat wrote:



On 05/15/2018 08:32 PM, Guenter Roeck wrote:

On Thu, Mar 22, 2018 at 04:24:32PM +0530, Shilpasri G Bhat wrote:

This patch series adds support to enable/disable OCC based
inband-sensor groups at runtime. The environmental sensor groups are
managed in HWMON and the remaining platform specific sensor groups are
managed in /sys/firmware/opal.

The firmware changes required for this patch is posted below:
https://lists.ozlabs.org/pipermail/skiboot/2018-March/010812.html



Sorry for not getting back earlier. This is a tough one.



Thanks for the reply. I have tried to answer your questions according to my
understanding below:


Key problem is that you are changing the ABI with those new attributes.
On top of that, the attributes _do_ make some sense (many chips support
enabling/disabling of individual sensors), suggesting that those or
similar attributes may or even should at some point be added to the ABI.

At the same time, returning "0" as measurement values when sensors are
disabled does not seem like a good idea, since "0" is a perfectly valid
measurement, at least for most sensors.


I agree.



Given that, we need to have a discussion about adding _enable attributes to
the ABI



what is the scope,

IIUC the scope should be RW and the attribute is defined for each supported
sensor group



That is _your_ need. I am not aware of any other chip where a per-sensor group
attribute would make sense. The discussion we need has to extend beyond the need
of a single chip.

Guenter


when should the attributes exist and when not,

We control this currently via device-tree


do we want/need power_enable or powerX_enable or both, and so on), and

We need power_enable right now


what to return if a sensor is disabled (such as -ENODATA).

-ENODATA sounds good.

Thanks and Regards,
Shilpa

  Once we have an

agreement, we can continue with an implementation.

Guenter


Shilpasri G Bhat (3):
   powernv:opal-sensor-groups: Add support to enable sensor groups
   hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
   powernv: opal-sensor-groups: Add attributes to disable/enable sensors

  .../ABI/testing/sysfs-firmware-opal-sensor-groups  |  34 ++
  Documentation/hwmon/ibmpowernv |  31 -
  arch/powerpc/include/asm/opal-api.h|   4 +-
  arch/powerpc/include/asm/opal.h|   2 +
  .../powerpc/platforms/powernv/opal-sensor-groups.c | 104 -
  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
  drivers/hwmon/ibmpowernv.c | 127 +++--
  7 files changed, 265 insertions(+), 38 deletions(-)
  create mode 100644 Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups

--
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] powerpc/32be: use stmw/lmw for registers save/restore in asm

2018-05-17 Thread Michael Ellerman

Christophe Leroy  writes:
> arch/powerpc/Makefile activates -mmultiple on BE PPC32 configs
> in order to use multiple word instructions in functions entry/exit

True, though that could be a lot simpler because the MULTIPLEWORD value
is only used for PPC32, which is always big endian. I'll send a patch
for that.

> The patch does the same for the asm parts, for consistency
>
> On processors like the 8xx on which insn fetching is pretty slow,
> this speeds up registers save/restore

OK. I've always heard that they should be avoided, but that's coming
from 64-bit land.

I guess we've been enabling this for all 32-bit targets for ever so it
must be a reasonable option.

> Signed-off-by: Christophe Leroy 
> ---
>  v2: Swapped both patches in the serie to reduce number of impacted
>  lines and added the same modification in ppc_save_regs()
>
>  arch/powerpc/include/asm/ppc_asm.h  |  5 +
>  arch/powerpc/kernel/misc.S  | 10 ++
>  arch/powerpc/kernel/ppc_save_regs.S |  4 
>  3 files changed, 19 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/ppc_asm.h 
> b/arch/powerpc/include/asm/ppc_asm.h
> index 13f7f4c0e1ea..4bb765d0b758 100644
> --- a/arch/powerpc/include/asm/ppc_asm.h
> +++ b/arch/powerpc/include/asm/ppc_asm.h
> @@ -80,11 +80,16 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>  #else
>  #define SAVE_GPR(n, base)stw n,GPR0+4*(n)(base)
>  #define REST_GPR(n, base)lwz n,GPR0+4*(n)(base)
> +#ifdef CONFIG_CPU_BIG_ENDIAN
> +#define SAVE_NVGPRS(base)stmw13, GPR0+4*13(base)
> +#define REST_NVGPRS(base)lmw 13, GPR0+4*13(base)
> +#else
>  #define SAVE_NVGPRS(base)SAVE_GPR(13, base); SAVE_8GPRS(14, base); \
>   SAVE_10GPRS(22, base)
>  #define REST_NVGPRS(base)REST_GPR(13, base); REST_8GPRS(14, base); \
>   REST_10GPRS(22, base)

There is no 32-bit little endian, so this is basically dead code now.

Maybe there'll be a 32-bit LE port one day, but if so we can put the
code back then.

So I'll just drop the else case.

>  #endif
> +#endif
>  
>  #define SAVE_2GPRS(n, base)  SAVE_GPR(n, base); SAVE_GPR(n+1, base)
>  #define SAVE_4GPRS(n, base)  SAVE_2GPRS(n, base); SAVE_2GPRS(n+2, base)
> diff --git a/arch/powerpc/kernel/misc.S b/arch/powerpc/kernel/misc.S
> index 746ee0320ad4..a316d90a5c26 100644
> --- a/arch/powerpc/kernel/misc.S
> +++ b/arch/powerpc/kernel/misc.S
> @@ -49,6 +49,10 @@ _GLOBAL(setjmp)
>   PPC_STL r0,0(r3)
>   PPC_STL r1,SZL(r3)
>   PPC_STL r2,2*SZL(r3)
> +#if defined(CONFIG_PPC32) && defined(CONFIG_CPU_BIG_ENDIAN)

And this could just be:

#ifdef CONFIG_PPC32

> + mfcrr12
> + stmwr12, 3*SZL(r3)
> +#else
>   mfcrr0
>   PPC_STL r0,3*SZL(r3)
>   PPC_STL r13,4*SZL(r3)
> @@ -70,10 +74,15 @@ _GLOBAL(setjmp)
>   PPC_STL r29,20*SZL(r3)
>   PPC_STL r30,21*SZL(r3)
>   PPC_STL r31,22*SZL(r3)
> +#endif

It's a pity to end up with this basically split in half by ifdefs for
32/64-bit, but maybe we can clean that up later.

cheers

[PATCH v11 24/26] x86/mm: add speculative pagefault handling

2018-05-17 Thread Laurent Dufour

From: Peter Zijlstra 

Try a speculative fault before acquiring mmap_sem, if it returns with
VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
traditional fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
 handle_speculative_fault()]
[Retry with usual fault path in the case VM_ERROR is returned by
 handle_speculative_fault(). This allows signal to be delivered]
[Don't build SPF call if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Handle memory protection key fault]
Signed-off-by: Laurent Dufour 
---
 arch/x86/mm/fault.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index fd84edf82252..11944bfc805a 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1224,7 +1224,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
struct mm_struct *mm;
int fault, major = 0;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
-   u32 pkey;
+   u32 pkey, *pt_pkey = 
 
tsk = current;
mm = tsk->mm;
@@ -1314,6 +1314,27 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
flags |= FAULT_FLAG_INSTRUCTION;
 
/*
+* Do not try to do a speculative page fault if the fault was due to
+* protection keys since it can't be resolved.
+*/
+   if (!(error_code & X86_PF_PK)) {
+   fault = handle_speculative_fault(mm, address, flags);
+   if (fault != VM_FAULT_RETRY) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+   /*
+* Do not advertise for the pkey value since we don't
+* know it.
+* This is not a matter as we checked for X86_PF_PK
+* earlier, so we should not handle pkey fault here,
+* but to be sure that mm_fault_error() callees will
+* not try to use it, we invalidate the pointer.
+*/
+   pt_pkey = NULL;
+   goto done;
+   }
+   }
+
+   /*
 * When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in
 * the kernel and should generate an OOPS.  Unfortunately, in the
@@ -1427,8 +1448,10 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
error_code,
}
 
up_read(>mmap_sem);
+
+done:
if (unlikely(fault & VM_FAULT_ERROR)) {
-   mm_fault_error(regs, error_code, address, , fault);
+   mm_fault_error(regs, error_code, address, pt_pkey, fault);
return;
}
 
-- 
2.7.4

[PATCH v11 26/26] arm64/mm: add speculative page fault

2018-05-17 Thread Laurent Dufour

From: Mahendran Ganesh 

This patch enables the speculative page fault on the arm64
architecture.

I completed spf porting in 4.9. From the test result,
we can see app launching time improved by about 10% in average.
For the apps which have more than 50 threads, 15% or even more
improvement can be got.

Signed-off-by: Ganesh Mahendran 

[handle_speculative_fault() is no more returning the vma pointer]
Signed-off-by: Laurent Dufour 
---
 arch/arm64/mm/fault.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 91c53a7d2575..fb9f840367f9 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -411,6 +411,16 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 
/*
+* let's try a speculative page fault without grabbing the
+* mmap_sem.
+*/
+   fault = handle_speculative_fault(mm, addr, mm_flags);
+   if (fault != VM_FAULT_RETRY) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, addr);
+   goto done;
+   }
+
+   /*
 * As per x86, we may deadlock here. However, since the kernel only
 * validly references user space from well defined areas of the code,
 * we can bug out early if this is from code which shouldn't.
@@ -460,6 +470,8 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
}
up_read(>mmap_sem);
 
+done:
+
/*
 * Handle the "normal" (no error) case first.
 */
-- 
2.7.4

[PATCH v11 25/26] powerpc/mm: add speculative page fault

2018-05-17 Thread Laurent Dufour

This patch enable the speculative page fault on the PowerPC
architecture.

This will try a speculative page fault without holding the mmap_sem,
if it returns with VM_FAULT_RETRY, the mmap_sem is acquired and the
traditional page fault processing is done.

The speculative path is only tried for multithreaded process as there is no
risk of contention on the mmap_sem otherwise.

Signed-off-by: Laurent Dufour 
---
 arch/powerpc/mm/fault.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index ef268d5d9db7..d7b5742ffb2b 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -465,6 +465,21 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
if (is_exec)
flags |= FAULT_FLAG_INSTRUCTION;
 
+   /*
+* Try speculative page fault before grabbing the mmap_sem.
+* The Page fault is done if VM_FAULT_RETRY is not returned.
+* But if the memory protection keys are active, we don't know if the
+* fault is due to key mistmatch or due to a classic protection check.
+* To differentiate that, we will need the VMA we no more have, so
+* let's retry with the mmap_sem held.
+*/
+   fault = handle_speculative_fault(mm, address, flags);
+   if (fault != VM_FAULT_RETRY && (IS_ENABLED(CONFIG_PPC_MEM_KEYS) &&
+   fault != VM_FAULT_SIGSEGV)) {
+   perf_sw_event(PERF_COUNT_SW_SPF, 1, regs, address);
+   goto done;
+   }
+
/* When running in the kernel we expect faults to occur only to
 * addresses in user space.  All other faults represent errors in the
 * kernel and should generate an OOPS.  Unfortunately, in the case of an
@@ -565,6 +580,7 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
 
up_read(>mm->mmap_sem);
 
+done:
if (unlikely(fault & VM_FAULT_ERROR))
return mm_fault_error(regs, address, fault);
 
-- 
2.7.4

[PATCH v11 13/26] mm: cache some VMA fields in the vm_fault structure

2018-05-17 Thread Laurent Dufour

When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.

In the detail, when we deal with a speculative page fault, the mmap_sem is
not taken, so parallel VMA's changes can occurred. When a VMA change is
done which will impact the page fault processing, we assumed that the VMA
sequence counter will be changed.  In the page fault processing, at the
time the PTE is locked, we checked the VMA sequence counter to detect
changes done in our back. If no change is detected we can continue further.
But this doesn't prevent the VMA to not be changed in our back while the
PTE is locked. So VMA's fields which are used while the PTE is locked must
be saved to ensure that we are using *static* values.  This is important
since the PTE changes will be made on regards to these VMA fields and they
need to be consistent. This concerns the vma->vm_flags and
vma->vm_page_prot VMA fields.

This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 10 --
 mm/huge_memory.c   |  6 +++---
 mm/hugetlb.c   |  2 ++
 mm/khugepaged.c|  2 ++
 mm/memory.c| 50 ++
 mm/migrate.c   |  2 +-
 6 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3f8b2ce0ef7c..f385d721867d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -373,6 +373,12 @@ struct vm_fault {
 * page table to avoid allocation from
 * atomic context.
 */
+   /*
+* These entries are required when handling speculative page fault.
+* This way the page handling is done using consistent field values.
+*/
+   unsigned long vma_flags;
+   pgprot_t vma_page_prot;
 };
 
 /* page entry size for vm->huge_fault() */
@@ -693,9 +699,9 @@ void free_compound_page(struct page *page);
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
  * that do not have writing enabled, when used by access_process_vm.
  */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags)
 {
-   if (likely(vma->vm_flags & VM_WRITE))
+   if (likely(vma_flags & VM_WRITE))
pte = pte_mkwrite(pte);
return pte;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 323acdd14e6e..6bf5420cc62e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1194,8 +1194,8 @@ static int do_huge_pmd_wp_page_fallback(struct vm_fault 
*vmf, pmd_t orig_pmd,
 
for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
pte_t entry;
-   entry = mk_pte(pages[i], vma->vm_page_prot);
-   entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+   entry = mk_pte(pages[i], vmf->vma_page_prot);
+   entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
@@ -2168,7 +2168,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
entry = pte_swp_mksoft_dirty(entry);
} else {
entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot));
-   entry = maybe_mkwrite(entry, vma);
+   entry = maybe_mkwrite(entry, vma->vm_flags);
if (!write)
entry = pte_wrprotect(entry);
if (!young)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 129088710510..d7764b6568f5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3718,6 +3718,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
.vma = vma,
.address = address,
.flags = flags,
+   .vma_flags = vma->vm_flags,
+   .vma_page_prot = vma->vm_page_prot,
/*
 * Hard to debug if it ends up being
 * used by a callee that assumes
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0b28af4b950d..2b02a9f9589e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -887,6 +887,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
*mm,
.flags = FAULT_FLAG_ALLOW_RETRY,

[PATCH v11 22/26] perf tools: add support for the SPF perf event

2018-05-17 Thread Laurent Dufour

Add support for the new speculative faults event.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 tools/include/uapi/linux/perf_event.h | 1 +
 tools/perf/util/evsel.c   | 1 +
 tools/perf/util/parse-events.c| 4 
 tools/perf/util/parse-events.l| 1 +
 tools/perf/util/python.c  | 1 +
 5 files changed, 8 insertions(+)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index b8e288a1f740..e2b74c055f51 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 4cd2cf93f726..088ed45c68c1 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -429,6 +429,7 @@ const char *perf_evsel__sw_names[PERF_COUNT_SW_MAX] = {
"alignment-faults",
"emulation-faults",
"dummy",
+   "speculative-faults",
 };
 
 static const char *__perf_evsel__sw_name(u64 config)
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 2fb0272146d8..54719f566314 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -140,6 +140,10 @@ struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = {
.symbol = "bpf-output",
.alias  = "",
},
+   [PERF_COUNT_SW_SPF] = {
+   .symbol = "speculative-faults",
+   .alias  = "spf",
+   },
 };
 
 #define __PERF_EVENT_FIELD(config, name) \
diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index a1a01b1ac8b8..86584d3a3068 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -308,6 +308,7 @@ emulation-faults{ return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EM
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 duration_time  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 bpf-output { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_BPF_OUTPUT); }
+speculative-faults|spf { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_SPF); }
 
/*
 * We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c
index 863b61478edd..df4f7ff9bdff 100644
--- a/tools/perf/util/python.c
+++ b/tools/perf/util/python.c
@@ -1181,6 +1181,7 @@ static struct {
PERF_CONST(COUNT_SW_ALIGNMENT_FAULTS),
PERF_CONST(COUNT_SW_EMULATION_FAULTS),
PERF_CONST(COUNT_SW_DUMMY),
+   PERF_CONST(COUNT_SW_SPF),
 
PERF_CONST(SAMPLE_IP),
PERF_CONST(SAMPLE_TID),
-- 
2.7.4

[PATCH v11 23/26] mm: add speculative page fault vmstats

2018-05-17 Thread Laurent Dufour

Add speculative_pgfault vmstat counter to count successful speculative page
fault handling.

Also fixing a minor typo in include/linux/vm_event_item.h.

Signed-off-by: Laurent Dufour 
---
 include/linux/vm_event_item.h | 3 +++
 mm/memory.c   | 3 +++
 mm/vmstat.c   | 5 -
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 5c7f010676a7..a240acc09684 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
SWAP_RA,
SWAP_RA_HIT,
 #endif
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   SPECULATIVE_PGFAULT,
+#endif
NR_VM_EVENT_ITEMS
 };
 
diff --git a/mm/memory.c b/mm/memory.c
index 30433bde32f2..48e1cf0a54ef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4509,6 +4509,9 @@ int __handle_speculative_fault(struct mm_struct *mm, 
unsigned long address,
 
put_vma(vma);
 
+   if (ret != VM_FAULT_RETRY)
+   count_vm_event(SPECULATIVE_PGFAULT);
+
/*
 * The task may have entered a memcg OOM situation but
 * if the allocation error was handled gracefully (no
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a2b9518980ce..3af74498a969 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1289,7 +1289,10 @@ const char * const vmstat_text[] = {
"swap_ra",
"swap_ra_hit",
 #endif
-#endif /* CONFIG_VM_EVENTS_COUNTERS */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   "speculative_pgfault",
+#endif
+#endif /* CONFIG_VM_EVENT_COUNTERS */
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */
 
-- 
2.7.4

[PATCH v11 21/26] perf: add a speculative page fault sw event

2018-05-17 Thread Laurent Dufour

Add a new software event to count succeeded speculative page faults.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/uapi/linux/perf_event.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b8e288a1f740..e2b74c055f51 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -112,6 +112,7 @@ enum perf_sw_ids {
PERF_COUNT_SW_EMULATION_FAULTS  = 8,
PERF_COUNT_SW_DUMMY = 9,
PERF_COUNT_SW_BPF_OUTPUT= 10,
+   PERF_COUNT_SW_SPF   = 11,
 
PERF_COUNT_SW_MAX,  /* non-ABI */
 };
-- 
2.7.4

[PATCH v11 20/26] mm: adding speculative page fault failure trace events

2018-05-17 Thread Laurent Dufour

This patch a set of new trace events to collect the speculative page fault
event failures.

Signed-off-by: Laurent Dufour 
---
 include/trace/events/pagefault.h | 80 
 mm/memory.c  | 57 ++--
 2 files changed, 125 insertions(+), 12 deletions(-)
 create mode 100644 include/trace/events/pagefault.h

diff --git a/include/trace/events/pagefault.h b/include/trace/events/pagefault.h
new file mode 100644
index ..d9438f3e6bad
--- /dev/null
+++ b/include/trace/events/pagefault.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pagefault
+
+#if !defined(_TRACE_PAGEFAULT_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGEFAULT_H
+
+#include 
+#include 
+
+DECLARE_EVENT_CLASS(spf,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, caller)
+   __field(unsigned long, vm_start)
+   __field(unsigned long, vm_end)
+   __field(unsigned long, address)
+   ),
+
+   TP_fast_assign(
+   __entry->caller = caller;
+   __entry->vm_start   = vma->vm_start;
+   __entry->vm_end = vma->vm_end;
+   __entry->address= address;
+   ),
+
+   TP_printk("ip:%lx vma:%lx-%lx address:%lx",
+ __entry->caller, __entry->vm_start, __entry->vm_end,
+ __entry->address)
+);
+
+DEFINE_EVENT(spf, spf_vma_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_noanon,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_notsup,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_vma_access,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+DEFINE_EVENT(spf, spf_pmd_changed,
+
+   TP_PROTO(unsigned long caller,
+struct vm_area_struct *vma, unsigned long address),
+
+   TP_ARGS(caller, vma, address)
+);
+
+#endif /* _TRACE_PAGEFAULT_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/mm/memory.c b/mm/memory.c
index 78c7b9cd..30433bde32f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -80,6 +80,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 #if defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS) && !defined(CONFIG_COMPILE_TEST)
 #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for 
last_cpupid.
 #endif
@@ -2324,8 +2327,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
 
 again:
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2333,8 +2338,10 @@ static bool pte_spinlock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
@@ -2345,6 +2352,7 @@ static bool pte_spinlock(struct vm_fault *vmf)
 
if (vma_has_changed(vmf)) {
spin_unlock(vmf->ptl);
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
}
 
@@ -2378,8 +2386,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 */
 again:
local_irq_disable();
-   if (vma_has_changed(vmf))
+   if (vma_has_changed(vmf)) {
+   trace_spf_vma_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
@@ -2387,8 +2397,10 @@ static bool pte_map_lock(struct vm_fault *vmf)
 * is not a huge collapse operation in progress in our back.
 */
pmdval = READ_ONCE(*vmf->pmd);
-   if (!pmd_same(pmdval, vmf->orig_pmd))
+   if (!pmd_same(pmdval, vmf->orig_pmd)) {
+   trace_spf_pmd_changed(_RET_IP_, vmf->vma, vmf->address);
goto out;
+   }
 #endif
 
/*
@@ -2408,6 +2420,7 @@ static bool pte_map_lock(struct vm_fault *vmf)
 
if

[PATCH v11 19/26] mm: provide speculative fault infrastructure

2018-05-17 Thread Laurent Dufour

From: Peter Zijlstra 

Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) 

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
 mm->mm_users in handle_speculative_fault() defined in mm.h]
Signed-off-by: Laurent Dufour 
---
 include/linux/hugetlb_inline.h |   2 +-
 include/linux/mm.h |  30 
 include/linux/pagemap.h|   4 +-
 mm/internal.h  |  16 +-
 mm/memory.c| 340 -
 5 files changed, 385 insertions(+), 7 deletions(-)

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..9e25283d6fc9 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-   return !!(vma->vm_flags & VM_HUGETLB);
+   return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
 }
 
 #else
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05cbba70104b..31acf98a7d92 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER0x40/* The fault originated in 
userspace */
 #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an instruction 
fetch */
+#define FAULT_FLAG_SPECULATIVE 0x200   /* Speculative fault, not holding 
mmap_sem */
 
 #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
@@ -343,6 +344,10 @@ struct vm_fault {
gfp_t gfp_mask; /* gfp mask to be used for allocations 
*/
pgoff_t pgoff;  /* Logical page offset based on vma */
unsigned long address;  /* Faulting virtual address */
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   unsigned int sequence;
+   pmd_t orig_pmd; /* value of PMD at the time of fault */
+#endif
pmd_t *pmd; /* Pointer to pmd entry matching
 * the 'address' */
pud_t *pud; /* Pointer to pud entry matching
@@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags);
+
+#ifdef

[PATCH v11 18/26] mm: protect mm_rb tree with a rwlock

2018-05-17 Thread Laurent Dufour

This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Cc: Peter Zijlstra (Intel) 
Cc: Matthew Wilcox 
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h   |   1 +
 include/linux/mm_types.h |   4 ++
 kernel/fork.c|   3 ++
 mm/init-mm.c |   3 ++
 mm/internal.h|   6 +++
 mm/mmap.c| 115 +++
 6 files changed, 104 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bcebec117d4d..05cbba70104b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1314,6 +1314,7 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
INIT_LIST_HEAD(>anon_vma_chain);
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_init(>vm_sequence);
+   atomic_set(>vm_ref_count, 1);
 #endif
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb5962308183..b16ba02f7fd6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -337,6 +337,7 @@ struct vm_area_struct {
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 #ifdef CONFIG_SPECULATIVE_PAGE_FAULT
seqcount_t vm_sequence;
+   atomic_t vm_ref_count;  /* see vma_get(), vma_put() */
 #endif
 } __randomize_layout;
 
@@ -355,6 +356,9 @@ struct kioctx_table;
 struct mm_struct {
struct vm_area_struct *mmap;/* list of VMAs */
struct rb_root mm_rb;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_t mm_rb_lock;
+#endif
u32 vmacache_seqnum;   /* per-thread vmacache */
 #ifdef CONFIG_MMU
unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 99198a02efe9..f1258c2ade09 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -907,6 +907,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
mm->vmacache_seqnum = 0;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   rwlock_init(>mm_rb_lock);
+#endif
atomic_set(>mm_users, 1);
atomic_set(>mm_count, 1);
init_rwsem(>mmap_sem);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index f0179c9c04c2..228134f5a336 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -17,6 +17,9 @@
 
 struct mm_struct init_mm = {
.mm_rb  = RB_ROOT,
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   .mm_rb_lock = __RW_LOCK_UNLOCKED(init_mm.mm_rb_lock),
+#endif
.pgd= swapper_pg_dir,
.mm_users   = ATOMIC_INIT(2),
.mm_count   = ATOMIC_INIT(1),
diff --git a/mm/internal.h b/mm/internal.h
index 62d8c34e63d5..fb2667b20f0a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -40,6 +40,12 @@ void page_writeback_init(void);
 
 int do_swap_page(struct vm_fault *vmf);

[PATCH v11 17/26] mm: introduce __page_add_new_anon_rmap()

2018-05-17 Thread Laurent Dufour

When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.

This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.

So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.

If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.

So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.

This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.

When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.

Signed-off-by: Laurent Dufour 
---
 include/linux/rmap.h | 12 ++--
 mm/memory.c  |  8 
 mm/rmap.c|  5 ++---
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 988d176472df..a5d282573093 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -174,8 +174,16 @@ void page_add_anon_rmap(struct page *, struct 
vm_area_struct *,
unsigned long, bool);
 void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
   unsigned long, int);
-void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
-   unsigned long, bool);
+void __page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
+ unsigned long, bool);
+static inline void page_add_new_anon_rmap(struct page *page,
+ struct vm_area_struct *vma,
+ unsigned long address, bool compound)
+{
+   VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+   __page_add_new_anon_rmap(page, vma, address, compound);
+}
+
 void page_add_file_rmap(struct page *, bool);
 void page_remove_rmap(struct page *, bool);
 
diff --git a/mm/memory.c b/mm/memory.c
index cc4e6221ee7b..ab32b0b4bd69 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2569,7 +2569,7 @@ static int wp_page_copy(struct vm_fault *vmf)
 * thread doing COW.
 */
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
-   page_add_new_anon_rmap(new_page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
__lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
@@ -3106,7 +3106,7 @@ int do_swap_page(struct vm_fault *vmf)
 
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
@@ -3257,7 +3257,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
}
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
@@ -3511,7 +3511,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
/* copy-on-write page */
if (write && !(vmf->vma_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
-   page_add_new_anon_rmap(page, vma, vmf->address, false);
+   __page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
__lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
diff --git a/mm/rmap.c b/mm/rmap.c
index 6db729dc4c50..42d1ebed2b5b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1136,7

[PATCH v11 16/26] mm: introduce __vm_normal_page()

2018-05-17 Thread Laurent Dufour

When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.

Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.

Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 18 +++---
 mm/memory.c| 21 -
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f385d721867d..bcebec117d4d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1317,9 +1317,21 @@ static inline void INIT_VMA(struct vm_area_struct *vma)
 #endif
 }
 
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device);
-#define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags);
+static inline struct page *_vm_normal_page(struct vm_area_struct *vma,
+   unsigned long addr, pte_t pte,
+   bool with_public_device)
+{
+   return __vm_normal_page(vma, addr, pte, with_public_device,
+   vma->vm_flags);
+}
+static inline struct page *vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte)
+{
+   return _vm_normal_page(vma, addr, pte, false);
+}
 
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
diff --git a/mm/memory.c b/mm/memory.c
index deac7f12d777..cc4e6221ee7b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -780,7 +780,8 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
 }
 
 /*
- * vm_normal_page -- This function gets the "struct page" associated with a 
pte.
+ * __vm_normal_page -- This function gets the "struct page" associated with
+ * a pte.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
  * it doesn't exist, or it exists but they don't want to touch it). In this
@@ -821,8 +822,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
unsigned long addr,
  * PFNMAP mappings in order to support COWable mappings.
  *
  */
-struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
-pte_t pte, bool with_public_device)
+struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, bool with_public_device,
+ unsigned long vma_flags)
 {
unsigned long pfn = pte_pfn(pte);
 
@@ -831,7 +833,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
goto check_pfn;
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   if (vma_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
return NULL;
@@ -863,8 +865,8 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
 
/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
 
-   if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-   if (vma->vm_flags & VM_MIXEDMAP) {
+   if (unlikely(vma_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
+   if (vma_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
@@ -873,7 +875,7 @@ struct page *_vm_normal_page(struct vm_area_struct *vma, 
unsigned long addr,
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
-   if (!is_cow_mapping(vma->vm_flags))
+   if (!is_cow_mapping(vma_flags))
return NULL;
}
}
@@ -2753,7 +2755,8 @@ static int do_wp_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
 
-   vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
+   vmf->page = __vm_normal_page(vma, vmf->address, vmf->orig_pte, false,
+vmf->vma_flags);
if (!vmf->page) {
/*
 *

[PATCH v11 15/26] mm: introduce __lru_cache_add_active_or_unevictable

2018-05-17 Thread Laurent Dufour

The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/linux/swap.h | 10 --
 mm/memory.c  |  8 
 mm/swap.c|  6 +++---
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f73eafcaf4e9..730c14738574 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -338,8 +338,14 @@ extern void deactivate_file_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
-extern void lru_cache_add_active_or_unevictable(struct page *page,
-   struct vm_area_struct *vma);
+extern void __lru_cache_add_active_or_unevictable(struct page *page,
+   unsigned long vma_flags);
+
+static inline void lru_cache_add_active_or_unevictable(struct page *page,
+   struct vm_area_struct *vma)
+{
+   return __lru_cache_add_active_or_unevictable(page, vma->vm_flags);
+}
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
diff --git a/mm/memory.c b/mm/memory.c
index cb6310b74cfb..deac7f12d777 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2569,7 +2569,7 @@ static int wp_page_copy(struct vm_fault *vmf)
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
mem_cgroup_commit_charge(new_page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(new_page, vma);
+   __lru_cache_add_active_or_unevictable(new_page, vmf->vma_flags);
/*
 * We call the notify macro here because, when using secondary
 * mmu page tables (such as kvm shadow page tables), we want the
@@ -3105,7 +3105,7 @@ int do_swap_page(struct vm_fault *vmf)
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
mem_cgroup_commit_charge(page, memcg, true, false);
@@ -3256,7 +3256,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
 setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
 
@@ -3510,7 +3510,7 @@ int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup 
*memcg,
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
mem_cgroup_commit_charge(page, memcg, false, false);
-   lru_cache_add_active_or_unevictable(page, vma);
+   __lru_cache_add_active_or_unevictable(page, vmf->vma_flags);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
page_add_file_rmap(page, false);
diff --git a/mm/swap.c b/mm/swap.c
index 26fc9b5f1b6c..ba97d437e68a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -456,12 +456,12 @@ void lru_cache_add(struct page *page)
  * directly back onto it's zone's unevictable list, it does NOT use a
  * per cpu pagevec.
  */
-void lru_cache_add_active_or_unevictable(struct page *page,
-struct vm_area_struct *vma)
+void __lru_cache_add_active_or_unevictable(struct page *page,
+  unsigned long vma_flags)
 {
VM_BUG_ON_PAGE(PageLRU(page), page);
 
-   if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
+   if (likely((vma_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
SetPageActive(page);
else if (!TestSetPageMlocked(page)) {
/*
-- 
2.7.4

[PATCH v11 14/26] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-05-17 Thread Laurent Dufour

migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.

This way during the speculative page fault path the saved vma->vm_flags
could be used.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/linux/migrate.h | 4 ++--
 mm/memory.c | 2 +-
 mm/migrate.c| 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f2b4abbca55e..fd4c3ab7bd9c 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -126,14 +126,14 @@ static inline void __ClearPageMovable(struct page *page)
 #ifdef CONFIG_NUMA_BALANCING
 extern bool pmd_trans_migrating(pmd_t pmd);
 extern int migrate_misplaced_page(struct page *page,
- struct vm_area_struct *vma, int node);
+ struct vm_fault *vmf, int node);
 #else
 static inline bool pmd_trans_migrating(pmd_t pmd)
 {
return false;
 }
 static inline int migrate_misplaced_page(struct page *page,
-struct vm_area_struct *vma, int node)
+struct vm_fault *vmf, int node)
 {
return -EAGAIN; /* can't migrate now */
 }
diff --git a/mm/memory.c b/mm/memory.c
index 9dc455ae550c..cb6310b74cfb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3904,7 +3904,7 @@ static int do_numa_page(struct vm_fault *vmf)
}
 
/* Migrate to the requested node */
-   migrated = migrate_misplaced_page(page, vma, target_nid);
+   migrated = migrate_misplaced_page(page, vmf, target_nid);
if (migrated) {
page_nid = target_nid;
flags |= TNF_MIGRATED;
diff --git a/mm/migrate.c b/mm/migrate.c
index ae3d0faf72cb..884c57a16b7a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1945,7 +1945,7 @@ bool pmd_trans_migrating(pmd_t pmd)
  * node. Caller is expected to have an elevated reference count on
  * the page that will be dropped by this function before returning.
  */
-int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+int migrate_misplaced_page(struct page *page, struct vm_fault *vmf,
   int node)
 {
pg_data_t *pgdat = NODE_DATA(node);
@@ -1958,7 +1958,7 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 * with execute permissions as they are probably shared libraries.
 */
if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
-   (vma->vm_flags & VM_EXEC))
+   (vmf->vma_flags & VM_EXEC))
goto out;
 
/*
-- 
2.7.4

[PATCH v11 12/26] mm: protect SPF handler against anon_vma changes

2018-05-17 Thread Laurent Dufour

The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.

In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.

In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.

When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 551a1916da5d..d0b5f14cfe69 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -624,7 +624,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * Hide vma from rmap and truncate_pagecache before freeing
 * pgtables
 */
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
@@ -638,7 +640,9 @@ void free_pgtables(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
   && !is_vm_hugetlb_page(next)) {
vma = next;
next = vma->vm_next;
+   vm_write_begin(vma);
unlink_anon_vmas(vma);
+   vm_write_end(vma);
unlink_file_vma(vma);
}
free_pgd_range(tlb, addr, vma->vm_end,
-- 
2.7.4

[PATCH v11 11/26] mm: protect mremap() against SPF hanlder

2018-05-17 Thread Laurent Dufour

If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.

CPU 1   CPU2
enter mremap()
unmap the dest area
copy_vma()  Enter speculative page fault handler
   >> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler

move_ptes()
  > it is assumed that the dest area is empty,
  > the move ptes overwrite the page mapped by the CPU2.

To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.

Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h | 24 +++-
 mm/mmap.c  | 53 +
 mm/mremap.c| 13 +
 3 files changed, 73 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 18acfdeee759..3f8b2ce0ef7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2253,18 +2253,32 @@ void anon_vma_interval_tree_verify(struct 
anon_vma_chain *node);
 
 /* mmap.c */
 extern int __vm_enough_memory(struct mm_struct *mm, long pages, int 
cap_sys_admin);
+
 extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand);
+   struct vm_area_struct *expand, bool keep_locked);
+
 static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert)
 {
-   return __vma_adjust(vma, start, end, pgoff, insert, NULL);
+   return __vma_adjust(vma, start, end, pgoff, insert, NULL, false);
 }
-extern struct vm_area_struct *vma_merge(struct mm_struct *,
+
+extern struct vm_area_struct *__vma_merge(struct mm_struct *mm,
+   struct vm_area_struct *prev, unsigned long addr, unsigned long end,
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t pgoff, struct mempolicy *mpol,
+   struct vm_userfaultfd_ctx uff, bool keep_locked);
+
+static inline struct vm_area_struct *vma_merge(struct mm_struct *mm,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
-   unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *, struct vm_userfaultfd_ctx);
+   unsigned long vm_flags, struct anon_vma *anon, struct file *file,
+   pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff)
+{
+   return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off,
+  pol, uff, false);
+}
+
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
diff --git a/mm/mmap.c b/mm/mmap.c
index add13b4e1d8d..2450860e3f8e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -689,7 +689,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm,
  */
 int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert,
-   struct vm_area_struct *expand)
+   struct vm_area_struct *expand, bool keep_locked)
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
@@ -805,8 +805,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
 
importer->anon_vma = exporter->anon_vma;
error = anon_vma_clone(importer, exporter);
-   if (error)
+   if (error) {
+   if (next && next != vma)
+   vm_raw_write_end(next);
+   vm_raw_write_end(vma);
return error;
+   }
}
}
 again:
@@ -1001,7 +1005,8 @@ int

[PATCH v11 10/26] mm: protect VMA modifications using VMA sequence count

2018-05-17 Thread Laurent Dufour

The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Signed-off-by: Laurent Dufour 
---
 fs/proc/task_mmu.c |  5 -
 fs/userfaultfd.c   | 17 +
 mm/khugepaged.c|  3 +++
 mm/madvise.c   |  6 +-
 mm/mempolicy.c | 51 ++-
 mm/mlock.c | 13 -
 mm/mmap.c  | 22 +-
 mm/mprotect.c  |  4 +++-
 mm/swap_state.c|  8 ++--
 9 files changed, 89 insertions(+), 40 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 597969db9e90..7247d6d5afba 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1137,8 +1137,11 @@ static ssize_t clear_refs_write(struct file *file, const 
char __user *buf,
goto out_mm;
}
for (vma = mm->mmap; vma; vma = vma->vm_next) {
-   vma->vm_flags &= ~VM_SOFTDIRTY;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & 
~VM_SOFTDIRTY);
vma_set_page_prot(vma);
+   vm_write_end(vma);
}
downgrade_write(>mmap_sem);
break;
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index cec550c8468f..b8212ba17695 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -659,8 +659,11 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct 
list_head *fcs)
 
octx = vma->vm_userfaultfd_ctx.ctx;
if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+   vm_write_begin(vma);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
-   vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
+   WRITE_ONCE(vma->vm_flags,
+  vma->vm_flags & ~(VM_UFFD_WP | VM_UFFD_MISSING));
+   vm_write_end(vma);
return 0;
}
 
@@ -885,8 +888,10 @@ static int userfaultfd_release(struct inode *inode, struct 
file *file)
vma = prev;
else
prev = vma;
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
}
up_write(>mmap_sem);
mmput(mm);
@@ -1434,8 +1439,10 @@ static int userfaultfd_register(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx.ctx = ctx;
+   vm_write_end(vma);
 
skip:
prev = vma;
@@ -1592,8 +1599,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx 
*ctx,
 * the next vma was merged into the current one and
 * the current one has not been updated yet.
 */
-   vma->vm_flags = new_flags;
+   vm_write_begin(vma);
+   WRITE_ONCE(vma->vm_flags, new_flags);
vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
+   vm_write_end(vma);
 
skip:
prev = vma;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d7b2a4bf8671..0b28af4b950d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1011,6 +1011,7 @@ static void collapse_huge_page(struct mm_struct *mm,
if (mm_find_pmd(mm, address) != pmd)
goto out;
 
+   vm_write_begin(vma);
anon_vma_lock_write(vma->anon_vma);
 
pte = pte_offset_map(pmd, address);
@@ -1046,6 +1047,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
spin_unlock(pmd_ptl);
anon_vma_unlock_write(vma->anon_vma);
+   vm_write_end(vma);
result = SCAN_FAIL;
goto

[PATCH v11 09/26] mm: VMA sequence count

2018-05-17 Thread Laurent Dufour

From: Peter Zijlstra 

Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.

The calls to vm_write_begin/end() in unmap_page_range() are
used to detect when a VMA is being unmap and thus that new page fault
should not be satisfied for this VMA. If the seqcount hasn't changed when
the page table are locked, this means we are safe to satisfy the page
fault.

The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.

The VMA's sequence counter is also used to detect change to various VMA's
fields used during the page fault handling, such as:
 - vm_start, vm_end
 - vm_pgoff
 - vm_flags, vm_page_prot
 - anon_vma
 - vm_policy

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
 CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
 using vm_raw_write* functions]
[Fix a lock dependency warning in mmap_region() when entering the error
 path]
[move sequence initialisation INIT_VMA()]
[Review the patch description about unmap_page_range()]
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h   | 44 
 include/linux/mm_types.h |  3 +++
 mm/memory.c  |  2 ++
 mm/mmap.c| 31 +++
 4 files changed, 80 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 35ecb983ff36..18acfdeee759 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1306,6 +1306,9 @@ struct zap_details {
 static inline void INIT_VMA(struct vm_area_struct *vma)
 {
INIT_LIST_HEAD(>anon_vma_chain);
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_init(>vm_sequence);
+#endif
 }
 
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
@@ -1428,6 +1431,47 @@ static inline void unmap_shared_mapping_range(struct 
address_space *mapping,
unmap_mapping_range(mapping, holebegin, holelen, 0);
 }
 
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+   write_seqcount_begin(>vm_sequence);
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+int subclass)
+{
+   write_seqcount_begin_nested(>vm_sequence, subclass);
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+   write_seqcount_end(>vm_sequence);
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+   raw_write_seqcount_begin(>vm_sequence);
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+   raw_write_seqcount_end(>vm_sequence);
+}
+#else
+static inline void vm_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_write_begin_nested(struct vm_area_struct *vma,
+int subclass)
+{
+}
+static inline void vm_write_end(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_begin(struct vm_area_struct *vma)
+{
+}
+static inline void vm_raw_write_end(struct vm_area_struct *vma)
+{
+}
+#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
+
 extern int access_process_vm(struct task_struct *tsk, unsigned long addr,
void *buf, int len, unsigned int gup_flags);
 extern int access_remote_vm(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 54f1e05ecf3e..fb5962308183 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -335,6 +335,9 @@ struct vm_area_struct {
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
+   seqcount_t vm_sequence;
+#endif
 } __randomize_layout;
 
 struct core_thread {
diff --git a/mm/memory.c b/mm/memory.c
index 75163c145c76..551a1916da5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1499,6 +1499,7 @@ void unmap_page_range(struct mmu_gather *tlb,
unsigned long next;
 
BUG_ON(addr >= end);
+   vm_write_begin(vma);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
do {
@@ -1508,6 +1509,7 @@ void unmap_page_range(struct mmu_gather *tlb,
next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
} while (pgd++, addr = next, addr != end);
tlb_end_vma(tlb, vma);
+   vm_write_end(vma);
 }
 
 
diff --git a/mm/mmap.c b/mm/mmap.c
index ceb1c2c1b46b..eeafd0bc8b36 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -701,6 +701,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long 
start,
long adjust_next = 0;
int remove_next = 0;
 
+   /*
+

[PATCH v11 08/26] mm: introduce INIT_VMA()

2018-05-17 Thread Laurent Dufour

Some VMA struct fields need to be initialized once the VMA structure is
allocated.
Currently this only concerns anon_vma_chain field but some other will be
added to support the speculative page fault.

Instead of spreading the initialization calls all over the code, let's
introduce a dedicated inline function.

Signed-off-by: Laurent Dufour 
---
 fs/exec.c  |  2 +-
 include/linux/mm.h |  5 +
 kernel/fork.c  |  2 +-
 mm/mmap.c  | 10 +-
 mm/nommu.c |  2 +-
 5 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 6fc98cfd3bdb..7e134a588ef3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -311,7 +311,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
vma->vm_start = vma->vm_end - PAGE_SIZE;
vma->vm_flags = VM_SOFTDIRTY | VM_STACK_FLAGS | 
VM_STACK_INCOMPLETE_SETUP;
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(vma);
 
err = insert_vm_struct(mm, vma);
if (err)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 113b572471ca..35ecb983ff36 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1303,6 +1303,11 @@ struct zap_details {
pgoff_t last_index; /* Highest page->index to unmap 
*/
 };
 
+static inline void INIT_VMA(struct vm_area_struct *vma)
+{
+   INIT_LIST_HEAD(>anon_vma_chain);
+}
+
 struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 pte_t pte, bool with_public_device);
 #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
diff --git a/kernel/fork.c b/kernel/fork.c
index 744d6fbba8f8..99198a02efe9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -458,7 +458,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (!tmp)
goto fail_nomem;
*tmp = *mpnt;
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(tmp);
retval = vma_dup_policy(mpnt, tmp);
if (retval)
goto fail_nomem_policy;
diff --git a/mm/mmap.c b/mm/mmap.c
index d2ef1060a2d2..ceb1c2c1b46b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1709,7 +1709,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vma->vm_flags = vm_flags;
vma->vm_page_prot = vm_get_page_prot(vm_flags);
vma->vm_pgoff = pgoff;
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(vma);
 
if (file) {
if (vm_flags & VM_DENYWRITE) {
@@ -2595,7 +2595,7 @@ int __split_vma(struct mm_struct *mm, struct 
vm_area_struct *vma,
/* most fields are the same, copy all, and then fixup */
*new = *vma;
 
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(new);
 
if (new_below)
new->vm_end = addr;
@@ -2965,7 +2965,7 @@ static int do_brk_flags(unsigned long addr, unsigned long 
request, unsigned long
return -ENOMEM;
}
 
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(vma);
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
@@ -3184,7 +3184,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct 
**vmap,
new_vma->vm_pgoff = pgoff;
if (vma_dup_policy(vma, new_vma))
goto out_free_vma;
-   INIT_LIST_HEAD(_vma->anon_vma_chain);
+   INIT_VMA(new_vma);
if (anon_vma_clone(new_vma, vma))
goto out_free_mempol;
if (new_vma->vm_file)
@@ -3327,7 +3327,7 @@ static struct vm_area_struct *__install_special_mapping(
if (unlikely(vma == NULL))
return ERR_PTR(-ENOMEM);
 
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(vma);
vma->vm_mm = mm;
vma->vm_start = addr;
vma->vm_end = addr + len;
diff --git a/mm/nommu.c b/mm/nommu.c
index 4452d8bd9ae4..ece424315cc5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1212,7 +1212,7 @@ unsigned long do_mmap(struct file *file,
region->vm_flags = vm_flags;
region->vm_pgoff = pgoff;
 
-   INIT_LIST_HEAD(>anon_vma_chain);
+   INIT_VMA(vma);
vma->vm_flags = vm_flags;
vma->vm_pgoff = pgoff;
 
-- 
2.7.4

[PATCH v11 07/26] mm: make pte_unmap_same compatible with SPF

2018-05-17 Thread Laurent Dufour

pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.

This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.

This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.

As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)

The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.

Acked-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 include/linux/mm.h |  4 +++-
 mm/memory.c| 39 ---
 2 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 338b8a1afb02..113b572471ca 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1249,6 +1249,7 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 #define VM_FAULT_NEEDDSYNC  0x2000 /* ->fault did not modify page tables
 * and needs fsync() to complete (for
 * synchronous page faults in DAX) */
+#define VM_FAULT_PTNOTSAME 0x4000  /* Page table entries have changed */
 
 #define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
@@ -1267,7 +1268,8 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
{ VM_FAULT_RETRY,   "RETRY" }, \
{ VM_FAULT_FALLBACK,"FALLBACK" }, \
{ VM_FAULT_DONE_COW,"DONE_COW" }, \
-   { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" }
+   { VM_FAULT_NEEDDSYNC,   "NEEDDSYNC" },  \
+   { VM_FAULT_PTNOTSAME,   "PTNOTSAME" }
 
 /* Encode hstate index for a hwpoisoned large page */
 #define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/memory.c b/mm/memory.c
index fa0d9493acac..75163c145c76 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2319,21 +2319,29 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
  * parts, do_swap_page must check under lock before unmapping the pte and
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
+ *
+ * pte_unmap_same() returns:
+ * 0   if the PTE are the same
+ * VM_FAULT_PTNOTSAME  if the PTE are different
+ * VM_FAULT_RETRY  if the VMA has changed in our back during
+ * a speculative page fault handling.
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-   pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
-   int same = 1;
+   int ret = 0;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
if (sizeof(pte_t) > sizeof(unsigned long)) {
-   spinlock_t *ptl = pte_lockptr(mm, pmd);
-   spin_lock(ptl);
-   same = pte_same(*page_table, orig_pte);
-   spin_unlock(ptl);
+   if (pte_spinlock(vmf)) {
+   if (!pte_same(*vmf->pte, vmf->orig_pte))
+   ret = VM_FAULT_PTNOTSAME;
+   spin_unlock(vmf->ptl);
+   } else
+   ret = VM_FAULT_RETRY;
}
 #endif
-   pte_unmap(page_table);
-   return same;
+   pte_unmap(vmf->pte);
+   return ret;
 }
 
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
@@ -2922,10 +2930,19 @@ int do_swap_page(struct vm_fault *vmf)
pte_t pte;
int locked;
int exclusive = 0;
-   int ret = 0;
+   int ret;
 
-   if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+   ret = pte_unmap_same(vmf);
+   if (ret) {
+   /*
+* If pte != orig_pte, this means another thread did the
+* swap operation in our back.
+* So nothing else to do.
+*/
+   if (ret == VM_FAULT_PTNOTSAME)
+   ret = 0;
goto out;
+   }
 
entry = pte_to_swp_entry(vmf->orig_pte);
if (unlikely(non_swap_entry(entry))) {
-- 
2.7.4

[PATCH v11 06/26] mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE

2018-05-17 Thread Laurent Dufour

When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.

So move the fetch and locking operations in a dedicated function.

Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a55e72c8e469..fa0d9493acac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2298,6 +2298,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static inline bool pte_spinlock(struct vm_fault *vmf)
+{
+   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
+   spin_lock(vmf->ptl);
+   return true;
+}
+
 static inline bool pte_map_lock(struct vm_fault *vmf)
 {
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
@@ -3814,8 +3821,8 @@ static int do_numa_page(struct vm_fault *vmf)
 * validation through pte_unmap_same(). It's of NUMA type but
 * the pfn may be screwed if the read is non atomic.
 */
-   vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -4008,8 +4015,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
 
-   vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
-   spin_lock(vmf->ptl);
+   if (!pte_spinlock(vmf))
+   return VM_FAULT_RETRY;
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry)))
goto unlock;
-- 
2.7.4

[PATCH v11 05/26] mm: prepare for FAULT_FLAG_SPECULATIVE

2018-05-17 Thread Laurent Dufour

From: Peter Zijlstra 

When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.

Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.

Signed-off-by: Peter Zijlstra (Intel) 

[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
 implemented as the vm_fault structure in the kernel]
[move pte_map_lock()'s definition upper in the file]
[move the define of FAULT_FLAG_SPECULATIVE later in the series]
[review error path in do_swap_page(), do_anonymous_page() and
 wp_page_copy()]
Signed-off-by: Laurent Dufour 
---
 mm/memory.c | 87 -
 1 file changed, 58 insertions(+), 29 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 14578158ed20..a55e72c8e469 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2298,6 +2298,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned 
long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
+static inline bool pte_map_lock(struct vm_fault *vmf)
+{
+   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+  vmf->address, >ptl);
+   return true;
+}
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry which was
  * read non-atomically.  Before making any commitment, on those architectures
@@ -2487,25 +2494,26 @@ static int wp_page_copy(struct vm_fault *vmf)
const unsigned long mmun_start = vmf->address & PAGE_MASK;
const unsigned long mmun_end = mmun_start + PAGE_SIZE;
struct mem_cgroup *memcg;
+   int ret = VM_FAULT_OOM;
 
if (unlikely(anon_vma_prepare(vma)))
-   goto oom;
+   goto out;
 
if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
new_page = alloc_zeroed_user_highpage_movable(vma,
  vmf->address);
if (!new_page)
-   goto oom;
+   goto out;
} else {
new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (!new_page)
-   goto oom;
+   goto out;
cow_user_page(new_page, old_page, vmf->address, vma);
}
 
if (mem_cgroup_try_charge(new_page, mm, GFP_KERNEL, , false))
-   goto oom_free_new;
+   goto out_free_new;
 
__SetPageUptodate(new_page);
 
@@ -2514,7 +2522,10 @@ static int wp_page_copy(struct vm_fault *vmf)
/*
 * Re-check the pte - we dropped the lock
 */
-   vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, >ptl);
+   if (!pte_map_lock(vmf)) {
+   ret = VM_FAULT_RETRY;
+   goto out_uncharge;
+   }
if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
@@ -2601,12 +2612,14 @@ static int wp_page_copy(struct vm_fault *vmf)
put_page(old_page);
}
return page_copied ? VM_FAULT_WRITE : 0;
-oom_free_new:
+out_uncharge:
+   mem_cgroup_cancel_charge(new_page, memcg, false);
+out_free_new:
put_page(new_page);
-oom:
+out:
if (old_page)
put_page(old_page);
-   return VM_FAULT_OOM;
+   return ret;
 }
 
 /**
@@ -2627,8 +2640,8 @@ static int wp_page_copy(struct vm_fault *vmf)
 int finish_mkwrite_fault(struct vm_fault *vmf)
 {
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-   vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
-  >ptl);
+   if (!pte_map_lock(vmf))
+   return VM_FAULT_RETRY;
/*
 * We might have raced with another page fault while we released the
 * pte_offset_map_lock.
@@ -2746,8 +2759,11 @@ static int do_wp_page(struct vm_fault *vmf)
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
lock_page(vmf->page);
-   vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
-   vmf->address, >ptl);
+   if (!pte_map_lock(vmf)) {
+   unlock_page(vmf->page);
+   put_page(vmf->page);
+   return VM_FAULT_RETRY;
+   }
if (!pte_same(*vmf->pte, vmf->orig_pte)) {
unlock_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2954,11 +2970,15 @@ int do_swap_page(struct vm_fault

[PATCH v11 03/26] powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Laurent Dufour

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for BOOK3S_64. This enables
the Speculative Page Fault handler.

Support is only provide for BOOK3S_64 currently because:
- require CONFIG_PPC_STD_MMU because checks done in
  set_access_flags_filter()
- require BOOK3S because we can't support for book3e_hugetlb_preload()
  called by update_mmu_cache()

Cc: Michael Ellerman 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index be7aca467692..75f71b963630 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -232,6 +232,7 @@ config PPC
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
select SPARSE_IRQ
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT if PPC_BOOK3S_64
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS  if !PPC64
#
-- 
2.7.4

[PATCH v11 04/26] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Laurent Dufour

From: Mahendran Ganesh 

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT for arm64. This
enables Speculative Page Fault handler.

Signed-off-by: Ganesh Mahendran 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 4759566a78cb..c932ae6d2cce 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -147,6 +147,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
help
  ARM 64-bit (AArch64) Linux support.
 
-- 
2.7.4

[PATCH v11 02/26] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Laurent Dufour

Set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT which turns on the
Speculative Page Fault handler when building for 64bit.

Cc: Thomas Gleixner 
Signed-off-by: Laurent Dufour 
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 47e7f582f86a..603f788a3e83 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -32,6 +32,7 @@ config X86_64
select SWIOTLB
select X86_DEV_DMA_OPS
select ARCH_HAS_SYSCALL_WRAPPER
+   select ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
 
 #
 # Arch settings
-- 
2.7.4

[PATCH v11 01/26] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT

2018-05-17 Thread Laurent Dufour

This configuration variable will be used to build the code needed to
handle speculative page fault.

By default it is turned off, and activated depending on architecture
support, ARCH_HAS_PTE_SPECIAL, SMP and MMU.

The architecture support is needed since the speculative page fault handler
is called from the architecture's page faulting code, and some code has to
be added there to handle the speculative handler.

The dependency on ARCH_HAS_PTE_SPECIAL is required because vm_normal_page()
does processing that is not compatible with the speculative handling in the
case ARCH_HAS_PTE_SPECIAL is not set.

Suggested-by: Thomas Gleixner 
Suggested-by: David Rientjes 
Signed-off-by: Laurent Dufour 
---
 mm/Kconfig | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 1d0888c5b97a..a38796276113 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -761,3 +761,25 @@ config GUP_BENCHMARK
 
 config ARCH_HAS_PTE_SPECIAL
bool
+
+config ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+   def_bool n
+
+config SPECULATIVE_PAGE_FAULT
+   bool "Speculative page faults"
+   default y
+   depends on ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
+   depends on ARCH_HAS_PTE_SPECIAL && MMU && SMP
+   help
+ Try to handle user space page faults without holding the mmap_sem.
+
+This should allow better concurrency for massively threaded process
+since the page fault handler will not wait for other threads memory
+layout change to be done, assuming that this change is done in another
+part of the process's memory space. This type of page fault is named
+speculative page fault.
+
+If the speculative page fault fails because of a concurrency is
+detected or because underlying PMD or PTE tables are not yet
+allocating, it is failing its processing and a classic page fault
+is then tried.
-- 
2.7.4

[PATCH v11 00/26] Speculative page faults

2018-05-17 Thread Laurent Dufour

This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
page fault without holding the mm semaphore [1].

The idea is to try to handle user space page faults without holding the
mmap_sem. This should allow better concurrency for massively threaded
process since the page fault handler will not wait for other threads memory
layout change to be done, assuming that this change is done in another part
of the process's memory space. This type page fault is named speculative
page fault. If the speculative page fault fails because of a concurrency is
detected or because underlying PMD or PTE tables are not yet allocating, it
is failing its processing and a classic page fault is then tried.

The speculative page fault (SPF) has to look for the VMA matching the fault
address without holding the mmap_sem, this is done by introducing a rwlock
which protects the access to the mm_rb tree. Previously this was done using
SRCU but it was introducing a lot of scheduling to process the VMA's
freeing operation which was hitting the performance by 20% as reported by
Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
limiting the locking contention to these operations which are expected to
be in a O(log n) order. In addition to ensure that the VMA is not freed in
our back a reference count is added and 2 services (get_vma() and
put_vma()) are introduced to handle the reference count. Once a VMA is
fetched from the RB tree using get_vma(), it must be later freed using
put_vma(). I can't see anymore the overhead I got while will-it-scale
benchmark anymore.

The VMA's attributes checked during the speculative page fault processing
have to be protected against parallel changes. This is done by using a per
VMA sequence lock. This sequence lock allows the speculative page fault
handler to fast check for parallel changes in progress and to abort the
speculative page fault in that case.

Once the VMA has been found, the speculative page fault handler would check
for the VMA's attributes to verify that the page fault has to be handled
correctly or not. Thus, the VMA is protected through a sequence lock which
allows fast detection of concurrent VMA changes. If such a change is
detected, the speculative page fault is aborted and a *classic* page fault
is tried.  VMA sequence lockings are added when VMA attributes which are
checked during the page fault are modified.

When the PTE is fetched, the VMA is checked to see if it has been changed,
so once the page table is locked, the VMA is valid, so any other changes
leading to touching this PTE will need to lock the page table, so no
parallel change is possible at this time.

The locking of the PTE is done with interrupts disabled, this allows
checking for the PMD to ensure that there is not an ongoing collapsing
operation. Since khugepaged is firstly set the PMD to pmd_none and then is
waiting for the other CPU to have caught the IPI interrupt, if the pmd is
valid at the time the PTE is locked, we have the guarantee that the
collapsing operation will have to wait on the PTE lock to move forward.
This allows the SPF handler to map the PTE safely. If the PMD value is
different from the one recorded at the beginning of the SPF operation, the
classic page fault handler will be called to handle the operation while
holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
the lock is done using spin_trylock() to avoid dead lock when handling a
page fault while a TLB invalidate is requested by another CPU holding the
PTE.

In pseudo code, this could be seen as:
speculative_page_fault()
{
vma = get_vma()
check vma sequence count
check vma's support
disable interrupt
  check pgd,p4d,...,pte
  save pmd and pte in vmf
  save vma sequence counter in vmf
enable interrupt
check vma sequence count
handle_pte_fault(vma)
..
page = alloc_page()
pte_map_lock()
disable interrupt
abort if sequence counter has changed
abort if pmd or pte has changed
pte map and lock
enable interrupt
if abort
   free page
   abort
...
}

arch_fault_handler()
{
if (speculative_page_fault())
   goto done
again:
lock(mmap_sem)
vma = find_vma();
handle_pte_fault(vma);
if retry
   unlock(mmap_sem)
   goto again;
done:
handle fault error
}

Support for THP is not done because when checking for the PMD, we can be
confused by an in progress collapsing operation done by khugepaged. The
issue is that pmd_none() could

[PATCH v6] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2018-05-17 Thread Christophe Leroy

Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
userspace instruction miss") has shown that limiting the read of
faulting instruction to likely cases improves performance.

This patch goes further into this direction by limiting the read
of the faulting instruction to the only cases where it is definitly
needed.

On an MPC885, with the same benchmark app as in the commit referred
above, we see a reduction of 4000 dTLB misses (approx 3%):

Before the patch:
 Performance counter stats for './fault 500' (10 runs):

 720495838  cpu-cycles  ( +-  0.04% )
141769  dTLB-load-misses( +-  0.02% )
 52722  iTLB-load-misses( +-  0.01% )
 19611  faults  ( +-  0.02% )

   5.750535176 seconds time elapsed ( +-  0.16% )

With the patch:
 Performance counter stats for './fault 500' (10 runs):

 717669123  cpu-cycles  ( +-  0.02% )
137344  dTLB-load-misses( +-  0.03% )
 52731  iTLB-load-misses( +-  0.01% )
 19614  faults  ( +-  0.03% )

   5.728423115 seconds time elapsed ( +-  0.14% )

The proper work of the huge stack expansion was tested with the
following app:

int main(int argc, char **argv)
{
char buf[1024 * 1025];

sprintf(buf, "Hello world !\n");
printf(buf);

exit(0);
}

Signed-off-by: Christophe Leroy 
---
 v6: Rebased on latest powerpc/merge branch ; Using __get_user_inatomic() 
instead of get_user() in order
 to move it inside the semaphored area. That removes all the complexity of 
the patch.

 v5: Reworked to fit after Benh do_fault improvement and rebased on top of 
powerpc/merge (65152902e43fef)

 v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() 
verification before __get_user_xxx()

 v3: Do a first try with pagefault disabled before releasing the semaphore

 v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'

 arch/powerpc/mm/fault.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index c01d627e687a..a7d5cc76a8ce 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -72,8 +72,18 @@ static inline bool notify_page_fault(struct pt_regs *regs)
 static bool store_updates_sp(struct pt_regs *regs)
 {
unsigned int inst;
+   int ret;
 
-   if (get_user(inst, (unsigned int __user *)regs->nip))
+   /*
+* Using get_user_in_atomic() as reading code around nip can result in
+* fault, which may cause a deadlock when called with mmap_sem held,
+* however since we are reading the instruction that generated the DSI
+* we are handling, the page is necessarily already present.
+*/
+   pagefault_disable();
+   ret = __get_user_inatomic(inst, (unsigned int __user *)regs->nip);
+   pagefault_enable();
+   if (ret)
return false;
/* check for 1 in the rA field */
if (((inst >> 16) & 0x1f) != 1)
@@ -234,8 +244,7 @@ static bool bad_kernel_fault(bool is_exec, unsigned long 
error_code,
 }
 
 static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address,
-   struct vm_area_struct *vma,
-   bool store_update_sp)
+   struct vm_area_struct *vma)
 {
/*
 * N.B. The POWER/Open ABI allows programs to access up to
@@ -264,7 +273,7 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
unsigned long address,
 * between the last mapped region and the stack will
 * expand the stack rather than segfaulting.
 */
-   if (address + 2048 < uregs->gpr[1] && !store_update_sp)
+   if (address + 2048 < uregs->gpr[1] && !store_updates_sp(regs))
return true;
}
return false;
@@ -403,7 +412,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
int is_user = user_mode(regs);
int is_write = page_fault_is_write(error_code);
int fault, major = 0;
-   bool store_update_sp = false;
 
if (notify_page_fault(regs))
return 0;
@@ -449,14 +457,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned 
long address,
return bad_key_fault_exception(regs, address,
   get_mm_addr_key(mm, address));
 
-   /*
-* We want to do this outside mmap_sem, because reading code around nip
-* can result in fault, which will cause a deadlock when called with
-* mmap_sem held
-*/
-   if (is_write && is_user)
-   store_update_sp = store_updates_sp(regs);
-
if (is_user)
flags |=

[PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes

2018-05-17 Thread Christophe Leroy

In my 8xx configuration, I get 208 calls to memcmp()
Within those 208 calls, about half of them have constant sizes,
46 have a size of 8, 17 have a size of 16, only a few have a
size over 16. Other fixed sizes are mostly 4, 6 and 10.

This patch inlines calls to memcmp() when size
is constant and lower than or equal to 16

In my 8xx configuration, this reduces the number of calls
to memcmp() from 208 to 123

The following table shows the number of TB timeticks to perform
a constant size memcmp() before and after the patch depending on
the size

Before  After   Improvement
01:  75775682   25%
02: 416685682   86%
03: 51137   13258   74%
04: 454555682   87%
05: 58713   13258   77%
06: 58712   13258   77%
07: 68183   20834   70%
08: 56819   15153   73%
09: 70077   28411   60%
10: 70077   28411   60%
11: 79546   35986   55%
12: 68182   28411   58%
13: 81440   35986   55%
14: 81440   39774   51%
15: 94697   43562   54%
16: 79546   37881   52%

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/string.h | 46 +++
 1 file changed, 46 insertions(+)

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 35f1aaad9b50..80cf0f9605dd 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -4,6 +4,8 @@
 
 #ifdef __KERNEL__
 
+#include 
+
 #define __HAVE_ARCH_STRNCPY
 #define __HAVE_ARCH_STRNCMP
 #define __HAVE_ARCH_MEMSET
@@ -51,10 +53,54 @@ static inline int strncmp(const char *p, const char *q, 
__kernel_size_t size)
return __strncmp(p, q, size);
 }
 
+static inline int __memcmp1(const void *p, const void *q, int off)
+{
+   return *(u8*)(p + off) - *(u8*)(q + off);
+}
+
+static inline int __memcmp2(const void *p, const void *q, int off)
+{
+   return be16_to_cpu(*(u16*)(p + off)) - be16_to_cpu(*(u16*)(q + off));
+}
+
+static inline int __memcmp4(const void *p, const void *q, int off)
+{
+   return be32_to_cpu(*(u32*)(p + off)) - be32_to_cpu(*(u32*)(q + off));
+}
+
+static inline int __memcmp8(const void *p, const void *q, int off)
+{
+   s64 tmp = be64_to_cpu(*(u64*)(p + off)) - be64_to_cpu(*(u64*)(q + off));
+   return tmp >> 32 ? : (int)tmp;
+}
+
+static inline int __memcmp_cst(const void *p,const void *q,__kernel_size_t 
size)
+{
+   if (size == 1)
+   return __memcmp1(p, q, 0);
+   if (size == 2)
+   return __memcmp2(p, q, 0);
+   if (size == 3)
+   return __memcmp2(p, q, 0) ? : __memcmp1(p, q, 2);
+   if (size == 4)
+   return __memcmp4(p, q, 0);
+   if (size == 5)
+   return __memcmp4(p, q, 0) ? : __memcmp1(p, q, 4);
+   if (size == 6)
+   return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4);
+   if (size == 7)
+   return __memcmp4(p, q, 0) ? : __memcmp2(p, q, 4) ? : 
__memcmp1(p, q, 6);
+   return __memcmp8(p, q, 0);
+}
+
 static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
 {
if (unlikely(!size))
return 0;
+   if (__builtin_constant_p(size) && size <= 8)
+   return __memcmp_cst(p, q, size);
+   if (__builtin_constant_p(size) && size <= 16)
+   return __memcmp8(p, q, 0) ? : __memcmp_cst(p + 8, q + 8, size - 
8);
return __memcmp(p, q, size);
 }
 
-- 
2.13.3

[PATCH v2 4/5] powerpc/lib: inline string functions NUL size verification

2018-05-17 Thread Christophe Leroy

Many calls to memcmp(), strncmp(), strncpy() and memchr()
are done with constant size.

This patch gives GCC a chance to optimise out
the NUL size verification.

This is only done when CONFIG_FORTIFY_SOURCE is not set, because
when CONFIG_FORTIFY_SOURCE is set, other inline versions of the
functions are defined in linux/string.h and conflict with ours.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/string.h  | 45 +++---
 arch/powerpc/kernel/prom_init_check.sh |  2 +-
 arch/powerpc/lib/memcmp_64.S   |  8 ++
 arch/powerpc/lib/string.S  | 14 +++
 arch/powerpc/lib/string_32.S   |  8 ++
 5 files changed, 73 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/string.h 
b/arch/powerpc/include/asm/string.h
index 9b8cedf618f4..35f1aaad9b50 100644
--- a/arch/powerpc/include/asm/string.h
+++ b/arch/powerpc/include/asm/string.h
@@ -15,17 +15,56 @@
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE
 
 extern char * strcpy(char *,const char *);
-extern char * strncpy(char *,const char *, __kernel_size_t);
 extern __kernel_size_t strlen(const char *);
 extern int strcmp(const char *,const char *);
-extern int strncmp(const char *, const char *, __kernel_size_t);
 extern char * strcat(char *, const char *);
 extern void * memset(void *,int,__kernel_size_t);
 extern void * memcpy(void *,const void *,__kernel_size_t);
 extern void * memmove(void *,const void *,__kernel_size_t);
+extern void * memcpy_flushcache(void *,const void *,__kernel_size_t);
+
+#ifdef CONFIG_FORTIFY_SOURCE
+
+extern char * strncpy(char *,const char *, __kernel_size_t);
+extern int strncmp(const char *, const char *, __kernel_size_t);
 extern int memcmp(const void *,const void *,__kernel_size_t);
 extern void * memchr(const void *,int,__kernel_size_t);
-extern void * memcpy_flushcache(void *,const void *,__kernel_size_t);
+
+#else
+
+extern char *__strncpy(char *,const char *, __kernel_size_t);
+extern int __strncmp(const char *, const char *, __kernel_size_t);
+extern int __memcmp(const void *,const void *,__kernel_size_t);
+extern void *__memchr(const void *,int,__kernel_size_t);
+
+static inline char *strncpy(char *p, const char *q, __kernel_size_t size)
+{
+   if (unlikely(!size))
+   return p;
+   return __strncpy(p, q, size);
+}
+
+static inline int strncmp(const char *p, const char *q, __kernel_size_t size)
+{
+   if (unlikely(!size))
+   return 0;
+   return __strncmp(p, q, size);
+}
+
+static inline int memcmp(const void *p,const void *q,__kernel_size_t size)
+{
+   if (unlikely(!size))
+   return 0;
+   return __memcmp(p, q, size);
+}
+
+static inline void *memchr(const void *p, int c, __kernel_size_t size)
+{
+   if (unlikely(!size))
+   return NULL;
+   return __memchr(p, c, size);
+}
+#endif
 
 #ifdef CONFIG_PPC64
 #define __HAVE_ARCH_MEMSET32
diff --git a/arch/powerpc/kernel/prom_init_check.sh 
b/arch/powerpc/kernel/prom_init_check.sh
index acb6b9226352..2d87e5f9d87b 100644
--- a/arch/powerpc/kernel/prom_init_check.sh
+++ b/arch/powerpc/kernel/prom_init_check.sh
@@ -18,7 +18,7 @@
 
 WHITELIST="add_reloc_offset __bss_start __bss_stop copy_and_flush
 _end enter_prom memcpy memset reloc_offset __secondary_hold
-__secondary_hold_acknowledge __secondary_hold_spinloop __start
+__secondary_hold_acknowledge __secondary_hold_spinloop __start  __strncmp
 strcmp strcpy strlcpy strlen strncmp strstr kstrtobool logo_linux_clut224
 reloc_got2 kernstart_addr memstart_addr linux_banner _stext
 __prom_init_toc_start __prom_init_toc_end btext_setup_display TOC."
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b7bd55..9b28286b85cf 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -29,8 +29,14 @@
 #define LD ldx
 #endif
 
+#ifndef CONFIG_FORTIFY_SOURCE
+#define memcmp __memcmp
+#endif
+
 _GLOBAL(memcmp)
+#ifdef CONFIG_FORTIFY_SOURCE
cmpdi   cr1,r5,0
+#endif
 
/* Use the short loop if both strings are not 8B aligned */
or  r6,r3,r4
@@ -39,7 +45,9 @@ _GLOBAL(memcmp)
/* Use the short loop if length is less than 32B */
cmpdi   cr6,r5,31
 
+#ifdef CONFIG_FORTIFY_SOURCE
beq cr1,.Lzero
+#endif
bne .Lshort
bgt cr6,.Llong
 
diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
index 0ef189847337..2521c159e644 100644
--- a/arch/powerpc/lib/string.S
+++ b/arch/powerpc/lib/string.S
@@ -14,12 +14,20 @@
 #include 
 
.text
+
+#ifndef CONFIG_FORTIFY_SOURCE
+#define strncpy __strncpy
+#define strncmp __strncmp
+#define memchr __memchr
+#endif

 /* This clears out any unused part of the destination buffer,
just as the libc version does.  -- paulus */
 _GLOBAL(strncpy)
+#ifdef CONFIG_FORTIFY_SOURCE
PPC_LCMPI 0,r5,0
beqlr
+#endif
mtctr   r5
addir6,r3,-1

[PATCH v2 3/5] powerpc/lib: optimise PPC32 memcmp

2018-05-17 Thread Christophe Leroy

At the time being, memcmp() compares two chunks of memory
byte per byte.

This patch optimises the comparison by comparing word by word.

A small benchmark performed on an 8xx comparing two chuncks
of 512 bytes performed 10 times gives:

Before : 5852274 TB ticks
After:   1488638 TB ticks

This is almost 4 times faster

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/lib/string_32.S | 42 +++---
 1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/string_32.S b/arch/powerpc/lib/string_32.S
index 2c11c2019b69..5c0e77baa9c7 100644
--- a/arch/powerpc/lib/string_32.S
+++ b/arch/powerpc/lib/string_32.S
@@ -20,13 +20,41 @@
 _GLOBAL(memcmp)
PPC_LCMPI 0,r5,0
beq-2f
-   mtctr   r5
-   addir6,r3,-1
-   addir4,r4,-1
-1: lbzur3,1(r6)
-   lbzur0,1(r4)
-   subf.   r3,r0,r3
-   bdnzt   2,1b
+   srawi.  r7, r5, 2   /* Divide len by 4 */
+   mr  r6, r3
+   beq-3f
+   mtctr   r7
+   li  r7, 0
+1:
+#ifdef __LITTLE_ENDIAN__
+   lwbrx   r3, r6, r7
+   lwbrx   r0, r4, r7
+#else
+   lwzxr3, r6, r7
+   lwzxr0, r4, r7
+#endif
+   addir7, r7, 4
+   subf.   r3, r0, r3
+   bdnzt   eq, 1b
+   bnelr
+   andi.   r5, r5, 3
+   beqlr
+3: cmplwi  cr1, r5, 2
+   blt-cr1, 4f
+#ifdef __LITTLE_ENDIAN__
+   lhbrx   r3, r6, r7
+   lhbrx   r0, r4, r7
+#else
+   lhzxr3, r6, r7
+   lhzxr0, r4, r7
+#endif
+   addir7, r7, 2
+   subf.   r3, r0, r3
+   beqlr   cr1
+   bnelr
+4: lbzxr3, r6, r7
+   lbzxr0, r4, r7
+   subf.   r3, r0, r3
blr
 2: li  r3,0
blr
-- 
2.13.3

[PATCH v2 2/5] powerpc/lib: optimise 32 bits __clear_user()

2018-05-17 Thread Christophe Leroy

Rewrite clear_user() on the same principle as memset(0), making use
of dcbz to clear complete cache lines.

This code is a copy/paste of memset(), with some modifications
in order to retrieve remaining number of bytes to be cleared,
as it needs to be returned in case of error.

On a MPC885, throughput is almost doubled:

Before:
~# dd if=/dev/zero of=/dev/null bs=1M count=1000
1048576000 bytes (1000.0MB) copied, 18.990779 seconds, 52.7MB/s

After:
~# dd if=/dev/zero of=/dev/null bs=1M count=1000
1048576000 bytes (1000.0MB) copied, 9.611468 seconds, 104.0MB/s

On a MPC8321, throughput is multiplied by 2.12:

Before:
root@vgoippro:~# dd if=/dev/zero of=/dev/null bs=1M count=1000
1048576000 bytes (1000.0MB) copied, 6.844352 seconds, 146.1MB/s

After:
root@vgoippro:~# dd if=/dev/zero of=/dev/null bs=1M count=1000
1048576000 bytes (1000.0MB) copied, 3.218854 seconds, 310.7MB/s

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/lib/string_32.S | 85 +++-
 1 file changed, 60 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/lib/string_32.S b/arch/powerpc/lib/string_32.S
index ab8c4f5f31b6..2c11c2019b69 100644
--- a/arch/powerpc/lib/string_32.S
+++ b/arch/powerpc/lib/string_32.S
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
.text
 
@@ -31,44 +32,78 @@ _GLOBAL(memcmp)
blr
 EXPORT_SYMBOL(memcmp)
 
+CACHELINE_BYTES = L1_CACHE_BYTES
+LG_CACHELINE_BYTES = L1_CACHE_SHIFT
+CACHELINE_MASK = (L1_CACHE_BYTES-1)
+
 _GLOBAL(__clear_user)
-   addir6,r3,-4
-   li  r3,0
-   li  r5,0
-   cmplwi  0,r4,4
+/*
+ * Use dcbz on the complete cache lines in the destination
+ * to set them to zero.  This requires that the destination
+ * area is cacheable.
+ */
+   cmplwi  cr0, r4, 4
+   mr  r10, r3
+   li  r3, 0
blt 7f
-   /* clear a single word */
-11:stwur5,4(r6)
+
+11:stw r3, 0(r10)
beqlr
-   /* clear word sized chunks */
-   andi.   r0,r6,3
-   add r4,r0,r4
-   subfr6,r0,r6
-   srwir0,r4,2
-   andi.   r4,r4,3
+   andi.   r0, r10, 3
+   add r11, r0, r4
+   subfr6, r0, r10
+
+   clrlwi  r7, r6, 32 - LG_CACHELINE_BYTES
+   add r8, r7, r11
+   srwir9, r8, LG_CACHELINE_BYTES
+   addic.  r9, r9, -1  /* total number of complete cachelines */
+   ble 2f
+   xorir0, r7, CACHELINE_MASK & ~3
+   srwi.   r0, r0, 2
+   beq 3f
+   mtctr   r0
+4: stwur3, 4(r6)
+   bdnz4b
+3: mtctr   r9
+   li  r7, 4
+10:dcbzr7, r6
+   addir6, r6, CACHELINE_BYTES
+   bdnz10b
+   clrlwi  r11, r8, 32 - LG_CACHELINE_BYTES
+   addir11, r11, 4
+
+2: srwir0 ,r11 ,2
mtctr   r0
-   bdz 7f
-1: stwur5,4(r6)
+   bdz 6f
+1: stwur3, 4(r6)
bdnz1b
-   /* clear byte sized chunks */
-7: cmpwi   0,r4,0
+6: andi.   r11, r11, 3
beqlr
-   mtctr   r4
-   addir6,r6,3
-8: stbur5,1(r6)
+   mtctr   r11
+   addir6, r6, 3
+8: stbur3, 1(r6)
bdnz8b
blr
-90:mr  r3,r4
+
+7: cmpwi   cr0, r4, 0
+   beqlr
+   mtctr   r4
+   addir6, r10, -1
+9: stbur3, 1(r6)
+   bdnz9b
blr
-91:mfctr   r3
-   slwir3,r3,2
-   add r3,r3,r4
+
+90:mr  r3, r4
blr
-92:mfctr   r3
+91:add r3, r10, r4
+   subfr3, r6, r3
blr
 
EX_TABLE(11b, 90b)
+   EX_TABLE(4b, 91b)
+   EX_TABLE(10b, 91b)
EX_TABLE(1b, 91b)
-   EX_TABLE(8b, 92b)
+   EX_TABLE(8b, 91b)
+   EX_TABLE(9b, 91b)
 
 EXPORT_SYMBOL(__clear_user)
-- 
2.13.3

[PATCH v2 1/5] powerpc/lib: move PPC32 specific functions out of string.S

2018-05-17 Thread Christophe Leroy

In preparation of optimisation patches, move PPC32 specific
memcmp() and __clear_user() into string_32.S

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/lib/Makefile|  5 +--
 arch/powerpc/lib/string.S| 61 
 arch/powerpc/lib/string_32.S | 74 
 3 files changed, 77 insertions(+), 63 deletions(-)
 create mode 100644 arch/powerpc/lib/string_32.S

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 653901042ad7..2c9b8c0adf22 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -26,13 +26,14 @@ obj-$(CONFIG_PPC_BOOK3S_64) += copyuser_power7.o 
copypage_power7.o \
   memcpy_power7.o
 
 obj64-y+= copypage_64.o copyuser_64.o mem_64.o hweight_64.o \
-  string_64.o memcpy_64.o memcmp_64.o pmem.o
+  memcpy_64.o memcmp_64.o pmem.o
 
 obj64-$(CONFIG_SMP)+= locks.o
 obj64-$(CONFIG_ALTIVEC)+= vmx-helper.o
 obj64-$(CONFIG_KPROBES_SANITY_TEST) += test_emulate_step.o
 
-obj-y  += checksum_$(BITS).o checksum_wrappers.o
+obj-y  += checksum_$(BITS).o checksum_wrappers.o \
+  string_$(BITS).o
 
 obj-y  += sstep.o ldstfp.o quad.o
 obj64-y+= quad.o
diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
index a026d8fa8a99..0ef189847337 100644
--- a/arch/powerpc/lib/string.S
+++ b/arch/powerpc/lib/string.S
@@ -59,23 +59,6 @@ _GLOBAL(strncmp)
blr
 EXPORT_SYMBOL(strncmp)
 
-#ifdef CONFIG_PPC32
-_GLOBAL(memcmp)
-   PPC_LCMPI 0,r5,0
-   beq-2f
-   mtctr   r5
-   addir6,r3,-1
-   addir4,r4,-1
-1: lbzur3,1(r6)
-   lbzur0,1(r4)
-   subf.   r3,r0,r3
-   bdnzt   2,1b
-   blr
-2: li  r3,0
-   blr
-EXPORT_SYMBOL(memcmp)
-#endif
-
 _GLOBAL(memchr)
PPC_LCMPI 0,r5,0
beq-2f
@@ -91,47 +74,3 @@ _GLOBAL(memchr)
 2: li  r3,0
blr
 EXPORT_SYMBOL(memchr)
-
-#ifdef CONFIG_PPC32
-_GLOBAL(__clear_user)
-   addir6,r3,-4
-   li  r3,0
-   li  r5,0
-   cmplwi  0,r4,4
-   blt 7f
-   /* clear a single word */
-11:stwur5,4(r6)
-   beqlr
-   /* clear word sized chunks */
-   andi.   r0,r6,3
-   add r4,r0,r4
-   subfr6,r0,r6
-   srwir0,r4,2
-   andi.   r4,r4,3
-   mtctr   r0
-   bdz 7f
-1: stwur5,4(r6)
-   bdnz1b
-   /* clear byte sized chunks */
-7: cmpwi   0,r4,0
-   beqlr
-   mtctr   r4
-   addir6,r6,3
-8: stbur5,1(r6)
-   bdnz8b
-   blr
-90:mr  r3,r4
-   blr
-91:mfctr   r3
-   slwir3,r3,2
-   add r3,r3,r4
-   blr
-92:mfctr   r3
-   blr
-
-   EX_TABLE(11b, 90b)
-   EX_TABLE(1b, 91b)
-   EX_TABLE(8b, 92b)
-
-EXPORT_SYMBOL(__clear_user)
-#endif
diff --git a/arch/powerpc/lib/string_32.S b/arch/powerpc/lib/string_32.S
new file mode 100644
index ..ab8c4f5f31b6
--- /dev/null
+++ b/arch/powerpc/lib/string_32.S
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * String handling functions for PowerPC32
+ *
+ * Copyright (C) 2018 CS Systemes d'Information
+ *
+ * Author: Christophe Leroy 
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+   .text
+
+_GLOBAL(memcmp)
+   PPC_LCMPI 0,r5,0
+   beq-2f
+   mtctr   r5
+   addir6,r3,-1
+   addir4,r4,-1
+1: lbzur3,1(r6)
+   lbzur0,1(r4)
+   subf.   r3,r0,r3
+   bdnzt   2,1b
+   blr
+2: li  r3,0
+   blr
+EXPORT_SYMBOL(memcmp)
+
+_GLOBAL(__clear_user)
+   addir6,r3,-4
+   li  r3,0
+   li  r5,0
+   cmplwi  0,r4,4
+   blt 7f
+   /* clear a single word */
+11:stwur5,4(r6)
+   beqlr
+   /* clear word sized chunks */
+   andi.   r0,r6,3
+   add r4,r0,r4
+   subfr6,r0,r6
+   srwir0,r4,2
+   andi.   r4,r4,3
+   mtctr   r0
+   bdz 7f
+1: stwur5,4(r6)
+   bdnz1b
+   /* clear byte sized chunks */
+7: cmpwi   0,r4,0
+   beqlr
+   mtctr   r4
+   addir6,r6,3
+8: stbur5,1(r6)
+   bdnz8b
+   blr
+90:mr  r3,r4
+   blr
+91:mfctr   r3
+   slwir3,r3,2
+   add r3,r3,r4
+   blr
+92:mfctr   r3
+   blr
+
+   EX_TABLE(11b, 90b)
+   EX_TABLE(1b, 91b)
+   EX_TABLE(8b, 92b)
+
+EXPORT_SYMBOL(__clear_user)
-- 
2.13.3

[PATCH v2 0/5] powerpc/lib: Optimisation of string functions (mainly for PPC32)

2018-05-17 Thread Christophe Leroy

This serie intend to optimise string functions for PPC32 in the
same spirit as already done on PPC64.

The first patch moves PPC32 specific functions from string.S into
a dedicated file named string_32.S
The second patch rewrites __clear_user() by using dcbz intruction
The third patch rewrites memcmp() to compare 32 bits words instead
of comparing byte per byte.
The fourth patch removes NUL size verification from the assembly
functions so that GCC can optimise them out when the size is constant
The last patch inlines memcmp() for constant sizes <= 16

As shown in each individual commit log, second, third and last patches
provides significant improvment.

Changes in v2:
- Moved out the patch removing the hot loop alignment on PPC32
- Squashed the changes related to NUL size verification in a single patch
- Reordered the patches in a more logical order
- Modified the inlining patch to avoid warning about impossibility to version 
symbols.

Christophe Leroy (5):
  powerpc/lib: move PPC32 specific functions out of string.S
  powerpc/lib: optimise 32 bits __clear_user()
  powerpc/lib: optimise PPC32 memcmp
  powerpc/lib: inline string functions NUL size verification
  powerpc/lib: inline memcmp() for small constant sizes

 arch/powerpc/include/asm/string.h  |  91 -
 arch/powerpc/kernel/prom_init_check.sh |   2 +-
 arch/powerpc/lib/Makefile  |   5 +-
 arch/powerpc/lib/memcmp_64.S   |   8 ++
 arch/powerpc/lib/string.S  |  75 -
 arch/powerpc/lib/string_32.S   | 145 +
 6 files changed, 259 insertions(+), 67 deletions(-)
 create mode 100644 arch/powerpc/lib/string_32.S

-- 
2.13.3

Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-17 Thread Florian Weimer


On 05/16/2018 10:35 PM, Ram Pai wrote:

So let me see if I understand the overall idea.

Application can allocate new keys through a new syscall
sys_pkey_alloc_1(flags, init_val, sig_init_val)

'sig_init_val' is the permission-state of the key in signal context.


I would keep the existing system call and just add a flag, say 
PKEY_ALLOC_SETSIGNAL.  If the current thread needs different access 
rights, it can set those rights just after pkey_alloc returns.  There is 
no race that matters here, I think.


Thanks,
Florian

Re: [PATCH] pkeys: Introduce PKEY_ALLOC_SIGNALINHERIT and change signal semantics

2018-05-17 Thread Florian Weimer


On 05/16/2018 11:07 PM, Ram Pai wrote:


what would change the key-permission-values enforced in signal-handler
context?  Or can it never be changed, ones set through sys_pkey_alloc()?


The access rights can only be set by pkey_alloc and are unchanged after 
that (so we do not have to discuss whether the signal handler access 
rights are per-thread or not).



I suppose key-permission-values change done in non-signal-handler context,
will not apply to those in signal-handler context.


Correct, that is the plan.


Can the signal handler change the key-permission-values from the
signal-handler context?


Yes, changes are possible.  The access rights given to pkey_alloc only 
specify the initial access rights when the signal handler is entered.


We need to decide if we should restore it on exit from the signal 
handler.  There is also the matter of siglongjmp, which currently does 
not restore the current thread's access rights.  In general, this might 
be difficult to implement because of the limited space in jmp_buf.


Thanks,
Florian

[PATCH] powerpc/lib: Remove .balign inside string functions for PPC32

2018-05-17 Thread Christophe Leroy

commit 87a156fb18fe1 ("Align hot loops of some string functions")
degraded the performance of string functions by adding useless
nops

A simple benchmark on an 8xx calling 10x a memchr() that
matches the first byte runs in 41668 TB ticks before this patch
and in 35986 TB ticks after this patch. So this gives an
improvement of approx 10%

Another benchmark doing the same with a memchr() matching the 128th
byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
after this patch, so regardless on the number of loops, removing
those useless nops improves the test by 5683 TB ticks.

Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
Signed-off-by: Christophe Leroy 
---
 Was sent already as part of a serie optimising string functions.
 Resending on itself as it is independent of the other changes in the serie

 arch/powerpc/lib/string.S | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
index a787776822d8..a026d8fa8a99 100644
--- a/arch/powerpc/lib/string.S
+++ b/arch/powerpc/lib/string.S
@@ -23,7 +23,9 @@ _GLOBAL(strncpy)
mtctr   r5
addir6,r3,-1
addir4,r4,-1
+#ifdef CONFIG_PPC64
.balign 16
+#endif
 1: lbzur0,1(r4)
cmpwi   0,r0,0
stbur0,1(r6)
@@ -43,7 +45,9 @@ _GLOBAL(strncmp)
mtctr   r5
addir5,r3,-1
addir4,r4,-1
+#ifdef CONFIG_PPC64
.balign 16
+#endif
 1: lbzur3,1(r5)
cmpwi   1,r3,0
lbzur0,1(r4)
@@ -77,7 +81,9 @@ _GLOBAL(memchr)
beq-2f
mtctr   r5
addir3,r3,-1
+#ifdef CONFIG_PPC64
.balign 16
+#endif
 1: lbzur0,1(r3)
cmpw0,r0,r4
bdnzf   2,1b
-- 
2.13.3

Patch "futex: Remove duplicated code and fix undefined behaviour" has been added to the 4.9-stable tree

2018-05-17 Thread gregkh


This is a note to let you know that I've just added the patch titled

futex: Remove duplicated code and fix undefined behaviour

to the 4.9-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 futex-remove-duplicated-code-and-fix-undefined-behaviour.patch
and it can be found in the queue-4.9 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From 30d6e0a4190d37740e9447e4e4815f06992dd8c3 Mon Sep 17 00:00:00 2001
From: Jiri Slaby 
Date: Thu, 24 Aug 2017 09:31:05 +0200
Subject: futex: Remove duplicated code and fix undefined behaviour

From: Jiri Slaby 

commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3 upstream.

There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby 
Signed-off-by: Thomas Gleixner 
Acked-by: Russell King 
Acked-by: Michael Ellerman  (powerpc)
Acked-by: Heiko Carstens  [s390]
Acked-by: Chris Metcalf  [for tile]
Reviewed-by: Darren Hart (VMware) 
Reviewed-by: Will Deacon  [core/arm64]
Cc: linux-m...@linux-mips.org
Cc: Rich Felker 
Cc: linux-i...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: pet...@infradead.org
Cc: Benjamin Herrenschmidt 
Cc: Max Filippov 
Cc: Paul Mackerras 
Cc: sparcli...@vger.kernel.org
Cc: Jonas Bonn 
Cc: linux-s...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: Yoshinori Sato 
Cc: linux-hexa...@vger.kernel.org
Cc: Helge Deller 
Cc: "James E.J. Bottomley" 
Cc: Catalin Marinas 
Cc: Matt Turner 
Cc: linux-snps-...@lists.infradead.org
Cc: Fenghua Yu 
Cc: Arnd Bergmann 
Cc: linux-xte...@linux-xtensa.org
Cc: Stefan Kristiansson 
Cc: openr...@lists.librecores.org
Cc: Ivan Kokshaysky 
Cc: Stafford Horne 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Richard Henderson 
Cc: Chris Zankel 
Cc: Michal Simek 
Cc: Tony Luck 
Cc: linux-par...@vger.kernel.org
Cc: Vineet Gupta 
Cc: Ralf Baechle 
Cc: Richard Kuo 
Cc: linux-al...@vger.kernel.org
Cc: Martin Schwidefsky 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" 
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jsl...@suse.cz
Cc: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/alpha/include/asm/futex.h  |   26 +++---
 arch/arc/include/asm/futex.h|   40 +++-
 arch/arm/include/asm/futex.h|   26 ++
 arch/arm64/include/asm/futex.h  |   27 ++-
 arch/frv/include/asm/futex.h|3 +-
 arch/frv/kernel/futex.c |   27 ++-
 arch/hexagon/include/asm/futex.h|   38 ++-
 arch/ia64/include/asm/futex.h   |   25 ++
 arch/microblaze/include/asm/futex.h |   38 ++-
 arch/mips/include/asm/futex.h   |   25 ++
 arch/parisc/include/asm/futex.h |   26 ++
 arch/powerpc/include/asm/futex.h|   26 +++---
 arch/s390/include/asm/futex.h   |   23 +++-
 arch/sh/include/asm/futex.h |   26 ++
 arch/sparc/include/asm/futex_64.h   |   26 +++---
 arch/tile/include/asm/futex.h   |   40 +++-
 arch/x86/include/asm/futex.h|   40 +++-

Patch "futex: Remove duplicated code and fix undefined behaviour" has been added to the 4.4-stable tree

2018-05-17 Thread gregkh


This is a note to let you know that I've just added the patch titled

futex: Remove duplicated code and fix undefined behaviour

to the 4.4-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 futex-remove-duplicated-code-and-fix-undefined-behaviour.patch
and it can be found in the queue-4.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From 30d6e0a4190d37740e9447e4e4815f06992dd8c3 Mon Sep 17 00:00:00 2001
From: Jiri Slaby 
Date: Thu, 24 Aug 2017 09:31:05 +0200
Subject: futex: Remove duplicated code and fix undefined behaviour

From: Jiri Slaby 

commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3 upstream.

There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby 
Signed-off-by: Thomas Gleixner 
Acked-by: Russell King 
Acked-by: Michael Ellerman  (powerpc)
Acked-by: Heiko Carstens  [s390]
Acked-by: Chris Metcalf  [for tile]
Reviewed-by: Darren Hart (VMware) 
Reviewed-by: Will Deacon  [core/arm64]
Cc: linux-m...@linux-mips.org
Cc: Rich Felker 
Cc: linux-i...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: pet...@infradead.org
Cc: Benjamin Herrenschmidt 
Cc: Max Filippov 
Cc: Paul Mackerras 
Cc: sparcli...@vger.kernel.org
Cc: Jonas Bonn 
Cc: linux-s...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: Yoshinori Sato 
Cc: linux-hexa...@vger.kernel.org
Cc: Helge Deller 
Cc: "James E.J. Bottomley" 
Cc: Catalin Marinas 
Cc: Matt Turner 
Cc: linux-snps-...@lists.infradead.org
Cc: Fenghua Yu 
Cc: Arnd Bergmann 
Cc: linux-xte...@linux-xtensa.org
Cc: Stefan Kristiansson 
Cc: openr...@lists.librecores.org
Cc: Ivan Kokshaysky 
Cc: Stafford Horne 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Richard Henderson 
Cc: Chris Zankel 
Cc: Michal Simek 
Cc: Tony Luck 
Cc: linux-par...@vger.kernel.org
Cc: Vineet Gupta 
Cc: Ralf Baechle 
Cc: Richard Kuo 
Cc: linux-al...@vger.kernel.org
Cc: Martin Schwidefsky 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" 
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jsl...@suse.cz
Cc: Ben Hutchings 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/alpha/include/asm/futex.h  |   26 +++---
 arch/arc/include/asm/futex.h|   40 +++-
 arch/arm/include/asm/futex.h|   26 ++
 arch/arm64/include/asm/futex.h  |   26 ++
 arch/frv/include/asm/futex.h|3 +-
 arch/frv/kernel/futex.c |   27 ++-
 arch/hexagon/include/asm/futex.h|   38 ++-
 arch/ia64/include/asm/futex.h   |   25 ++
 arch/microblaze/include/asm/futex.h |   38 ++-
 arch/mips/include/asm/futex.h   |   25 ++
 arch/parisc/include/asm/futex.h |   25 ++
 arch/powerpc/include/asm/futex.h|   26 +++---
 arch/s390/include/asm/futex.h   |   23 +++-
 arch/sh/include/asm/futex.h |   26 ++
 arch/sparc/include/asm/futex_64.h   |   26 +++---
 arch/tile/include/asm/futex.h   |   40 +++-
 arch/x86/include/asm/futex.h|   40 +++-

Re: [PATCH] crypto: reorder paes test lexicographically

2018-05-17 Thread Corentin Labbe

On Fri, May 11, 2018 at 09:04:06AM +0100, Gilad Ben-Yossef wrote:
> Due to a snafu "paes" testmgr tests were not ordered
> lexicographically, which led to boot time warnings.
> Reorder the tests as needed.
> 
> Fixes: a794d8d ("crypto: ccree - enable support for hardware keys")
> Reported-by: Abdul Haleem 
> Signed-off-by: Gilad Ben-Yossef 

Tested-by: Corentin Labbe 

Thanks

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-17 Thread Peter Zijlstra

On Thu, May 17, 2018 at 09:19:49AM +0800, Boqun Feng wrote:
> On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:

> > and that x86 calls it from syscall_return_slowpath() (which AFAIU is
> > now used in the fast-path since KPTI), I wonder where we should call
> 
> So we actually detect this after the syscall takes effect, right? I
> wonder whether this could be problematic, because "disallowing syscall"
> in rseq areas may means the syscall won't take effect to some people, I
> guess?

It doesn't really matter I suspect, the important part is the program
getting killed.

I agree that doing it on sysenter is slightly nicer, but I'll take
sysexit if that's what it takes.

> > this on PowerPC ?  I was under the impression that PowerPC return to
> > userspace fast-path was not calling C code unless work flags were set,
> > but I might be wrong.
> > 
> 
> I think you're right. So we have to introduce callsite to rseq_syscall()
> in syscall path, something like:
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 51695608c68b..a25734a96640 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -222,6 +222,9 @@ system_call_exit:
>   mtmsrd  r11,1
>  #endif /* CONFIG_PPC_BOOK3E */
>  
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> +
>   ld  r9,TI_FLAGS(r12)
>   li  r11,-MAX_ERRNO
>   andi.   
> r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
> 
> But I think it's important for us to first decide where (before or after
> the syscall) we do the detection.

The important thing is the processed getting very dead. Either sysenter
or sysexit gets that done.

[PATCH v3] KVM: PPC: Book3S HV: lockless tlbie for HPT hcalls

2018-05-17 Thread Nicholas Piggin

tlbies to an LPAR do not have to be serialised since POWER4/PPC970,
after which the MMU_FTR_LOCKLESS_TLBIE feature was introduced to
avoid tlbie locking.

Since commit c17b98cf6028 ("KVM: PPC: Book3S HV: Remove code for
PPC970 processors"), KVM no longer supports processors that do not
have this feature, so the tlbie locking can be removed completely.
A sanity check for the feature is put in kvmppc_mmu_hv_init.

Testing was done on a POWER9 system in HPT mode, with a -smp 32 guest
in HPT mode. 32 instances of the powerpc fork benchmark from selftests
were run with --fork, and the results measured.

Without this patch, total throughput was about 13.5K/sec, and this is
the top of the host profile:

   74.52%  [k] do_tlbies
2.95%  [k] kvmppc_book3s_hv_page_fault
1.80%  [k] calc_checksum
1.80%  [k] kvmppc_vcpu_run_hv
1.49%  [k] kvmppc_run_core

After this patch, throughput was about 51K/sec, with this profile:

   21.28%  [k] do_tlbies
5.26%  [k] kvmppc_run_core
4.88%  [k] kvmppc_book3s_hv_page_fault
3.30%  [k] _raw_spin_lock_irqsave
3.25%  [k] gup_pgd_range

Signed-off-by: Nicholas Piggin 
---
Changes since v2:
- Removed the locking code completely (thanks mpe)
- Updated changelog (thanks mpe)

 arch/powerpc/include/asm/kvm_host.h |  1 -
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  3 +++
 arch/powerpc/kvm/book3s_hv_rm_mmu.c | 21 -
 3 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 17498e9a26e4..7756b0c6da75 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -269,7 +269,6 @@ struct kvm_arch {
unsigned long host_lpcr;
unsigned long sdr1;
unsigned long host_sdr1;
-   int tlbie_lock;
unsigned long lpcr;
unsigned long vrma_slb_v;
int mmu_ready;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index a670fa5fbe50..37cd6434d1c8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -272,6 +272,9 @@ int kvmppc_mmu_hv_init(void)
if (!cpu_has_feature(CPU_FTR_HVMODE))
return -EINVAL;
 
+   if (!mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE))
+   return -EINVAL;
+
/* POWER7 has 10-bit LPIDs (12-bit in POWER8) */
host_lpid = mfspr(SPRN_LPID);
rsvd_lpid = LPID_RSVD;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 78e6a392330f..89d909b3b881 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -434,24 +434,6 @@ static inline int is_mmio_hpte(unsigned long v, unsigned 
long r)
(HPTE_R_KEY_HI | HPTE_R_KEY_LO));
 }
 
-static inline int try_lock_tlbie(unsigned int *lock)
-{
-   unsigned int tmp, old;
-   unsigned int token = LOCK_TOKEN;
-
-   asm volatile("1:lwarx   %1,0,%2\n"
-"  cmpwi   cr0,%1,0\n"
-"  bne 2f\n"
-"  stwcx.  %3,0,%2\n"
-"  bne-1b\n"
-"  isync\n"
-"2:"
-: "=" (tmp), "=" (old)
-: "r" (lock), "r" (token)
-: "cc", "memory");
-   return old == 0;
-}
-
 static void do_tlbies(struct kvm *kvm, unsigned long *rbvalues,
  long npages, int global, bool need_sync)
 {
@@ -463,8 +445,6 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
*rbvalues,
 * the RS field, this is backwards-compatible with P7 and P8.
 */
if (global) {
-   while (!try_lock_tlbie(>arch.tlbie_lock))
-   cpu_relax();
if (need_sync)
asm volatile("ptesync" : : : "memory");
for (i = 0; i < npages; ++i) {
@@ -483,7 +463,6 @@ static void do_tlbies(struct kvm *kvm, unsigned long 
*rbvalues,
}
 
asm volatile("eieio; tlbsync; ptesync" : : : "memory");
-   kvm->arch.tlbie_lock = 0;
} else {
if (need_sync)
asm volatile("ptesync" : : : "memory");
-- 
2.17.0

[PATCH bpf 0/6] bpf: enhancements for multi-function programs

2018-05-17 Thread Sandipan Das

This patch series introduces the following:

[1] Support for bpf-to-bpf function calls in the powerpc64 JIT compiler.

[2] Provide a way for resolving function calls because of the way JITed
images are allocated in powerpc64.

[3] Fix to get JITed instruction dumps for multi-function programs from
the bpf system call.

Sandipan Das (6):
  bpf: support 64-bit offsets for bpf function calls
  bpf: powerpc64: add JIT support for multi-function programs
  bpf: get kernel symbol addresses via syscall
  tools: bpf: sync bpf uapi header
  tools: bpftool: resolve calls without using imm field
  bpf: fix JITed dump for multi-function programs via syscall

 arch/powerpc/net/bpf_jit_comp64.c | 79 ++-
 include/uapi/linux/bpf.h  |  2 +
 kernel/bpf/syscall.c  | 56 ---
 kernel/bpf/verifier.c | 22 +++
 tools/bpf/bpftool/prog.c  | 31 +++
 tools/bpf/bpftool/xlated_dumper.c | 24 
 tools/bpf/bpftool/xlated_dumper.h |  2 +
 tools/include/uapi/linux/bpf.h|  2 +
 8 files changed, 189 insertions(+), 29 deletions(-)

-- 
2.14.3

[PATCH bpf 4/6] tools: bpf: sync bpf uapi header

2018-05-17 Thread Sandipan Das

Syncing the bpf.h uapi header with tools so that struct
bpf_prog_info has the two new fields for passing on the
addresses of the kernel symbols corresponding to each
function in a JITed program.

Signed-off-by: Sandipan Das 
---
 tools/include/uapi/linux/bpf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 83a95ae388dd..c14a74eea910 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2107,6 +2107,8 @@ struct bpf_prog_info {
__u32 xlated_prog_len;
__aligned_u64 jited_prog_insns;
__aligned_u64 xlated_prog_insns;
+   __aligned_u64 jited_ksyms;
+   __u32 nr_jited_ksyms;
__u64 load_time;/* ns since boottime */
__u32 created_by_uid;
__u32 nr_map_ids;
-- 
2.14.3

[PATCH bpf 2/6] bpf: powerpc64: add JIT support for multi-function programs

2018-05-17 Thread Sandipan Das

This adds support for bpf-to-bpf function calls in the powerpc64
JIT compiler. The JIT compiler converts the bpf call instructions
to native branch instructions. After a round of the usual passes,
the start addresses of the JITed images for the callee functions
are known. Finally, to fixup the branch target addresses, we need
to perform an extra pass.

Because of the address range in which JITed images are allocated
on powerpc64, the offsets of the start addresses of these images
from __bpf_call_base are as large as 64 bits. So, for a function
call, we cannot use the imm field of the instruction to determine
the callee's address. Instead, we use the alternative method of
getting it from the list of function addresses in the auxillary
data of the caller by using the off field as an index.

Signed-off-by: Sandipan Das 
---
 arch/powerpc/net/bpf_jit_comp64.c | 79 ++-
 1 file changed, 69 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 1bdb1aff0619..25939892d8f7 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -256,7 +256,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct 
codegen_context *ctx, u32
 /* Assemble the body code between the prologue & epilogue */
 static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
  struct codegen_context *ctx,
- u32 *addrs)
+ u32 *addrs, bool extra_pass)
 {
const struct bpf_insn *insn = fp->insnsi;
int flen = fp->len;
@@ -712,11 +712,23 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
break;
 
/*
-* Call kernel helper
+* Call kernel helper or bpf function
 */
case BPF_JMP | BPF_CALL:
ctx->seen |= SEEN_FUNC;
-   func = (u8 *) __bpf_call_base + imm;
+
+   /* bpf function call */
+   if (insn[i].src_reg == BPF_PSEUDO_CALL && extra_pass)
+   if (fp->aux->func && off < fp->aux->func_cnt)
+   /* use the subprog id from the off
+* field to lookup the callee address
+*/
+   func = (u8 *) 
fp->aux->func[off]->bpf_func;
+   else
+   return -EINVAL;
+   /* kernel helper call */
+   else
+   func = (u8 *) __bpf_call_base + imm;
 
bpf_jit_emit_func_call(image, ctx, (u64)func);
 
@@ -864,6 +876,14 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
return 0;
 }
 
+struct powerpc64_jit_data {
+   struct bpf_binary_header *header;
+   u32 *addrs;
+   u8 *image;
+   u32 proglen;
+   struct codegen_context ctx;
+};
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 {
u32 proglen;
@@ -871,6 +891,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
u8 *image = NULL;
u32 *code_base;
u32 *addrs;
+   struct powerpc64_jit_data *jit_data;
struct codegen_context cgctx;
int pass;
int flen;
@@ -878,6 +899,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
struct bpf_prog *org_fp = fp;
struct bpf_prog *tmp_fp;
bool bpf_blinded = false;
+   bool extra_pass = false;
 
if (!fp->jit_requested)
return org_fp;
@@ -891,7 +913,28 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
fp = tmp_fp;
}
 
+   jit_data = fp->aux->jit_data;
+   if (!jit_data) {
+   jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL);
+   if (!jit_data) {
+   fp = org_fp;
+   goto out;
+   }
+   fp->aux->jit_data = jit_data;
+   }
+
flen = fp->len;
+   addrs = jit_data->addrs;
+   if (addrs) {
+   cgctx = jit_data->ctx;
+   image = jit_data->image;
+   bpf_hdr = jit_data->header;
+   proglen = jit_data->proglen;
+   alloclen = proglen + FUNCTION_DESCR_SIZE;
+   extra_pass = true;
+   goto skip_init_ctx;
+   }
+
addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL);
if (addrs == NULL) {
fp = org_fp;
@@ -904,10 +947,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
 
/* Scouting faux-generate pass 0 */
-   if (bpf_jit_build_body(fp, 0, , addrs)) {
+   if (bpf_jit_build_body(fp, 0,

[PATCH bpf 6/6] bpf: fix JITed dump for multi-function programs via syscall

2018-05-17 Thread Sandipan Das

Currently, for multi-function programs, we cannot get the JITed
instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
command. Because of this, userspace tools such as bpftool fail
to identify a multi-function program as being JITed or not.

With the JIT enabled and the test program running, this can be
verified as follows:

  # cat /proc/sys/net/core/bpf_jit_enable
  1

Before applying this patch:

  # bpftool prog list
  1: kprobe  name foo  tag b811aab41a39ad3d  gpl
  loaded_at 2018-05-16T11:43:38+0530  uid 0
  xlated 216B  not jited  memlock 65536B
  ...

  # bpftool prog dump jited id 1
  no instructions returned

After applying this patch:

  # bpftool prog list
  1: kprobe  name foo  tag b811aab41a39ad3d  gpl
  loaded_at 2018-05-16T12:13:01+0530  uid 0
  xlated 216B  jited 308B  memlock 65536B
  ...

  # bpftool prog dump jited id 1
 0:   nop
 4:   nop
 8:   mflrr0
 c:   std r0,16(r1)
10:   stdur1,-112(r1)
14:   std r31,104(r1)
18:   addir31,r1,48
1c:   li  r3,10
  ...

Signed-off-by: Sandipan Das 
---
 kernel/bpf/syscall.c | 38 --
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 03c8437a2990..b2f70718aca7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1896,7 +1896,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
struct bpf_prog_info info = {};
u32 info_len = attr->info.info_len;
char __user *uinsns;
-   u32 ulen;
+   u32 ulen, i;
int err;
 
err = check_uarg_tail_zero(uinfo, sizeof(info), info_len);
@@ -1922,7 +1922,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
ulen = min_t(u32, info.nr_map_ids, ulen);
if (ulen) {
u32 __user *user_map_ids = u64_to_user_ptr(info.map_ids);
-   u32 i;
 
for (i = 0; i < ulen; i++)
if (put_user(prog->aux->used_maps[i]->id,
@@ -1970,13 +1969,41 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog 
*prog,
 * for offload.
 */
ulen = info.jited_prog_len;
-   info.jited_prog_len = prog->jited_len;
+   if (prog->aux->func_cnt) {
+   info.jited_prog_len = 0;
+   for (i = 0; i < prog->aux->func_cnt; i++)
+   info.jited_prog_len += prog->aux->func[i]->jited_len;
+   } else {
+   info.jited_prog_len = prog->jited_len;
+   }
+
if (info.jited_prog_len && ulen) {
if (bpf_dump_raw_ok()) {
uinsns = u64_to_user_ptr(info.jited_prog_insns);
ulen = min_t(u32, info.jited_prog_len, ulen);
-   if (copy_to_user(uinsns, prog->bpf_func, ulen))
-   return -EFAULT;
+
+   /* for multi-function programs, copy the JITed
+* instructions for all the functions
+*/
+   if (prog->aux->func_cnt) {
+   u32 len, free;
+   u8 *img;
+
+   free = ulen;
+   for (i = 0; i < prog->aux->func_cnt; i++) {
+   len = prog->aux->func[i]->jited_len;
+   img = (u8 *) 
prog->aux->func[i]->bpf_func;
+   if (len > free)
+   break;
+   if (copy_to_user(uinsns, img, len))
+   return -EFAULT;
+   uinsns += len;
+   free -= len;
+   }
+   } else {
+   if (copy_to_user(uinsns, prog->bpf_func, ulen))
+   return -EFAULT;
+   }
} else {
info.jited_prog_insns = 0;
}
@@ -1987,7 +2014,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
if (info.nr_jited_ksyms && ulen) {
u64 __user *user_jited_ksyms = 
u64_to_user_ptr(info.jited_ksyms);
ulong ksym_addr;
-   u32 i;
 
/* copy the address of the kernel symbol corresponding to
 * each function
-- 
2.14.3

[PATCH bpf 5/6] tools: bpftool: resolve calls without using imm field

2018-05-17 Thread Sandipan Das

Currently, we resolve the callee's address for a JITed function
call by using the imm field of the call instruction as an offset
from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
use this address to get the callee's kernel symbol's name.

For some architectures, such as powerpc64, the imm field is not
large enough to hold this offset. So, instead of assigning this
offset to the imm field, the verifier now assigns the subprog
id. Also, a list of kernel symbol addresses for all the JITed
functions is provided in the program info. We now use the imm
field as an index for this list to lookup a callee's symbol's
address and resolve its name.

Suggested-by: Daniel Borkmann 
Signed-off-by: Sandipan Das 
---
 tools/bpf/bpftool/prog.c  | 31 +++
 tools/bpf/bpftool/xlated_dumper.c | 24 +---
 tools/bpf/bpftool/xlated_dumper.h |  2 ++
 3 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 9bdfdf2d3fbe..ac2f62a97e84 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -430,6 +430,10 @@ static int do_dump(int argc, char **argv)
unsigned char *buf;
__u32 *member_len;
__u64 *member_ptr;
+   unsigned int nr_addrs;
+   unsigned long *addrs = NULL;
+   __u32 *ksyms_len;
+   __u64 *ksyms_ptr;
ssize_t n;
int err;
int fd;
@@ -437,6 +441,8 @@ static int do_dump(int argc, char **argv)
if (is_prefix(*argv, "jited")) {
member_len = _prog_len;
member_ptr = _prog_insns;
+   ksyms_len = _jited_ksyms;
+   ksyms_ptr = _ksyms;
} else if (is_prefix(*argv, "xlated")) {
member_len = _prog_len;
member_ptr = _prog_insns;
@@ -496,10 +502,23 @@ static int do_dump(int argc, char **argv)
return -1;
}
 
+   nr_addrs = *ksyms_len;
+   if (nr_addrs) {
+   addrs = malloc(nr_addrs * sizeof(__u64));
+   if (!addrs) {
+   p_err("mem alloc failed");
+   free(buf);
+   close(fd);
+   return -1;
+   }
+   }
+
memset(, 0, sizeof(info));
 
*member_ptr = ptr_to_u64(buf);
*member_len = buf_size;
+   *ksyms_ptr = ptr_to_u64(addrs);
+   *ksyms_len = nr_addrs;
 
err = bpf_obj_get_info_by_fd(fd, , );
close(fd);
@@ -513,6 +532,11 @@ static int do_dump(int argc, char **argv)
goto err_free;
}
 
+   if (*ksyms_len > nr_addrs) {
+   p_err("too many addresses returned");
+   goto err_free;
+   }
+
if ((member_len == _prog_len &&
 info.jited_prog_insns == 0) ||
(member_len == _prog_len &&
@@ -558,6 +582,9 @@ static int do_dump(int argc, char **argv)
dump_xlated_cfg(buf, *member_len);
} else {
kernel_syms_load();
+   dd.jited_ksyms = ksyms_ptr;
+   dd.nr_jited_ksyms = *ksyms_len;
+
if (json_output)
dump_xlated_json(, buf, *member_len, opcodes);
else
@@ -566,10 +593,14 @@ static int do_dump(int argc, char **argv)
}
 
free(buf);
+   if (addrs)
+   free(addrs);
return 0;
 
 err_free:
free(buf);
+   if (addrs)
+   free(addrs);
return -1;
 }
 
diff --git a/tools/bpf/bpftool/xlated_dumper.c 
b/tools/bpf/bpftool/xlated_dumper.c
index 7a3173b76c16..dc8e4eca0387 100644
--- a/tools/bpf/bpftool/xlated_dumper.c
+++ b/tools/bpf/bpftool/xlated_dumper.c
@@ -178,8 +178,12 @@ static const char *print_call_pcrel(struct dump_data *dd,
snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
 "%+d#%s", insn->off, sym->name);
else
-   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
-"%+d#0x%lx", insn->off, address);
+   if (address)
+   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
+"%+d#0x%lx", insn->off, address);
+   else
+   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
+"%+d", insn->off);
return dd->scratch_buff;
 }
 
@@ -200,14 +204,20 @@ static const char *print_call(void *private_data,
  const struct bpf_insn *insn)
 {
struct dump_data *dd = private_data;
-   unsigned long address = dd->address_call_base + insn->imm;
-   struct kernel_sym *sym;
+   unsigned long address = 0;
+   struct kernel_sym *sym = NULL;
 
-   sym = kernel_syms_search(dd, address);
-   if (insn->src_reg == BPF_PSEUDO_CALL)
+   if (insn->src_reg == BPF_PSEUDO_CALL) {
+

[PATCH bpf 3/6] bpf: get kernel symbol addresses via syscall

2018-05-17 Thread Sandipan Das

This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of kernel symbol addresses for all functions in a
given program and to userspace using the bpf system call
with the BPF_OBJ_GET_INFO_BY_FD command.

When bpf_jit_kallsyms is enabled, we can get the address
of the corresponding kernel symbol for a callee function
and resolve the symbol's name. The address is determined
by adding the value of the call instruction's imm field
to __bpf_call_base. This offset gets assigned to the imm
field by the verifier.

For some architectures, such as powerpc64, the imm field
is not large enough to hold this offset.

We resolve this by:

[1] Assigning the subprog id to the imm field of a call
instruction in the verifier instead of the offset of
the callee's symbol's address from __bpf_call_base.

[2] Determining the address of a callee's corresponding
symbol by using the imm field as an index for the
list of kernel symbol addresses now available from
the program info.

Suggested-by: Daniel Borkmann 
Signed-off-by: Sandipan Das 
---
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/syscall.c | 20 
 kernel/bpf/verifier.c|  7 +--
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 93d5a4eeec2a..061482d3be11 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2108,6 +2108,8 @@ struct bpf_prog_info {
__u32 xlated_prog_len;
__aligned_u64 jited_prog_insns;
__aligned_u64 xlated_prog_insns;
+   __aligned_u64 jited_ksyms;
+   __u32 nr_jited_ksyms;
__u64 load_time;/* ns since boottime */
__u32 created_by_uid;
__u32 nr_map_ids;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c286e75ec087..03c8437a2990 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1933,6 +1933,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
if (!capable(CAP_SYS_ADMIN)) {
info.jited_prog_len = 0;
info.xlated_prog_len = 0;
+   info.nr_jited_ksyms = 0;
goto done;
}
 
@@ -1981,6 +1982,25 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
}
}
 
+   ulen = info.nr_jited_ksyms;
+   info.nr_jited_ksyms = prog->aux->func_cnt;
+   if (info.nr_jited_ksyms && ulen) {
+   u64 __user *user_jited_ksyms = 
u64_to_user_ptr(info.jited_ksyms);
+   ulong ksym_addr;
+   u32 i;
+
+   /* copy the address of the kernel symbol corresponding to
+* each function
+*/
+   ulen = min_t(u32, info.nr_jited_ksyms, ulen);
+   for (i = 0; i < ulen; i++) {
+   ksym_addr = (ulong) prog->aux->func[i]->bpf_func;
+   ksym_addr &= PAGE_MASK;
+   if (put_user((u64) ksym_addr, _jited_ksyms[i]))
+   return -EFAULT;
+   }
+   }
+
 done:
if (copy_to_user(uinfo, , info_len) ||
put_user(info_len, >info.info_len))
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index aa76879f4fd1..fc864eb3e29d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5416,17 +5416,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 * later look the same as if they were interpreted only.
 */
for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
-   unsigned long addr;
-
if (insn->code != (BPF_JMP | BPF_CALL) ||
insn->src_reg != BPF_PSEUDO_CALL)
continue;
insn->off = env->insn_aux_data[i].call_imm;
subprog = find_subprog(env, i + insn->off + 1);
-   addr  = (unsigned long)func[subprog]->bpf_func;
-   addr &= PAGE_MASK;
-   insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
-   addr - __bpf_call_base;
+   insn->imm = subprog;
}
 
prog->jited = 1;
-- 
2.14.3

[PATCH bpf 1/6] bpf: support 64-bit offsets for bpf function calls

2018-05-17 Thread Sandipan Das

The imm field of a bpf instruction is a signed 32-bit integer.
For JIT bpf-to-bpf function calls, it stores the offset of the
start address of the callee's JITed image from __bpf_call_base.

For some architectures, such as powerpc64, this offset may be
as large as 64 bits and cannot be accomodated in the imm field
without truncation.

We resolve this by:

[1] Additionally using the auxillary data of each function to
keep a list of start addresses of the JITed images for all
functions determined by the verifier.

[2] Retaining the subprog id inside the off field of the call
instructions and using it to index into the list mentioned
above and lookup the callee's address.

To make sure that the existing JIT compilers continue to work
without requiring changes, we keep the imm field as it is.

Signed-off-by: Sandipan Das 
---
 kernel/bpf/verifier.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d5e1a6c4165d..aa76879f4fd1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5373,11 +5373,24 @@ static int jit_subprogs(struct bpf_verifier_env *env)
insn->src_reg != BPF_PSEUDO_CALL)
continue;
subprog = insn->off;
-   insn->off = 0;
insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
func[subprog]->bpf_func -
__bpf_call_base;
}
+
+   /* we use the aux data to keep a list of the start addresses
+* of the JITed images for each function in the program
+*
+* for some architectures, such as powerpc64, the imm field
+* might not be large enough to hold the offset of the start
+* address of the callee's JITed image from __bpf_call_base
+*
+* in such cases, we can lookup the start address of a callee
+* by using its subprog id, available from the off field of
+* the call instruction, as an index for this list
+*/
+   func[i]->aux->func = func;
+   func[i]->aux->func_cnt = env->subprog_cnt + 1;
}
for (i = 0; i < env->subprog_cnt; i++) {
old_bpf_func = func[i]->bpf_func;
-- 
2.14.3

Re: [PATCH 0/3] Add support to disable sensor groups in P9

2018-05-17 Thread Shilpasri G Bhat



On 05/15/2018 08:32 PM, Guenter Roeck wrote:
> On Thu, Mar 22, 2018 at 04:24:32PM +0530, Shilpasri G Bhat wrote:
>> This patch series adds support to enable/disable OCC based
>> inband-sensor groups at runtime. The environmental sensor groups are
>> managed in HWMON and the remaining platform specific sensor groups are
>> managed in /sys/firmware/opal.
>>
>> The firmware changes required for this patch is posted below:
>> https://lists.ozlabs.org/pipermail/skiboot/2018-March/010812.html
>>
> 
> Sorry for not getting back earlier. This is a tough one.
> 

Thanks for the reply. I have tried to answer your questions according to my
understanding below:

> Key problem is that you are changing the ABI with those new attributes.
> On top of that, the attributes _do_ make some sense (many chips support
> enabling/disabling of individual sensors), suggesting that those or
> similar attributes may or even should at some point be added to the ABI.
> 
> At the same time, returning "0" as measurement values when sensors are
> disabled does not seem like a good idea, since "0" is a perfectly valid
> measurement, at least for most sensors.

I agree.

> 
> Given that, we need to have a discussion about adding _enable attributes to
> the ABI 

> what is the scope,
IIUC the scope should be RW and the attribute is defined for each supported
sensor group

> when should the attributes exist and when not,
We control this currently via device-tree

> do we want/need power_enable or powerX_enable or both, and so on), and 
We need power_enable right now

> what to return if a sensor is disabled (such as -ENODATA).
-ENODATA sounds good.

Thanks and Regards,
Shilpa

 Once we have an
> agreement, we can continue with an implementation.
> 
> Guenter
> 
>> Shilpasri G Bhat (3):
>>   powernv:opal-sensor-groups: Add support to enable sensor groups
>>   hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
>>   powernv: opal-sensor-groups: Add attributes to disable/enable sensors
>>
>>  .../ABI/testing/sysfs-firmware-opal-sensor-groups  |  34 ++
>>  Documentation/hwmon/ibmpowernv |  31 -
>>  arch/powerpc/include/asm/opal-api.h|   4 +-
>>  arch/powerpc/include/asm/opal.h|   2 +
>>  .../powerpc/platforms/powernv/opal-sensor-groups.c | 104 -
>>  arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>>  drivers/hwmon/ibmpowernv.c | 127 
>> +++--
>>  7 files changed, 265 insertions(+), 38 deletions(-)
>>  create mode 100644 
>> Documentation/ABI/testing/sysfs-firmware-opal-sensor-groups
>>
>> -- 
>> 1.8.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-hwmon" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

96 matches

Mail list logo