date:20180808

Re: [PATCH 1/5] powerpc/asm: Add a patch_site macro & helpers for patching instructions

2018-08-08 Thread Christophe LEROY





Le 08/08/2018 à 18:30, Christophe LEROY a écrit :



Le 23/07/2018 à 17:07, Michael Ellerman a écrit :

Add a macro and some helper C functions for patching single asm
instructions.

The gas macro means we can do something like:

   1:    nop
   patch_site 1b, patch__foo

Which is less visually distracting than defining a GLOBAL symbol at 1,
and also doesn't pollute the symbol table which can confuse eg. perf.

These are obviously similar to our existing feature sections, but are
not automatically patched based on CPU/MMU features, rather they are
designed to be manually patched by C code at some arbitrary point.

Signed-off-by: Michael Ellerman 
---
  arch/powerpc/include/asm/code-patching-asm.h | 18 ++
  arch/powerpc/include/asm/code-patching.h |  2 ++
  arch/powerpc/lib/code-patching.c | 16 
  3 files changed, 36 insertions(+)
  create mode 100644 arch/powerpc/include/asm/code-patching-asm.h

diff --git a/arch/powerpc/include/asm/code-patching-asm.h 
b/arch/powerpc/include/asm/code-patching-asm.h

new file mode 100644
index ..ed7b1448493a
--- /dev/null
+++ b/arch/powerpc/include/asm/code-patching-asm.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Copyright 2018, Michael Ellerman, IBM Corporation.
+ */
+#ifndef _ASM_POWERPC_CODE_PATCHING_ASM_H
+#define _ASM_POWERPC_CODE_PATCHING_ASM_H
+
+/* Define a "site" that can be patched */
+.macro patch_site label name
+    .pushsection ".rodata"
+    .balign 4
+    .global \name
+\name:
+    .4byte    \label - .
+    .popsection
+.endm
+
+#endif /* _ASM_POWERPC_CODE_PATCHING_ASM_H */
diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h

index 812535f40124..b2051234ada8 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -32,6 +32,8 @@ unsigned int create_cond_branch(const unsigned int 
*addr,

  int patch_branch(unsigned int *addr, unsigned long target, int flags);
  int patch_instruction(unsigned int *addr, unsigned int instr);
  int raw_patch_instruction(unsigned int *addr, unsigned int instr);
+int patch_instruction_site(s32 *addr, unsigned int instr);
+int patch_branch_site(s32 *site, unsigned long target, int flags);


Why use s32* instead of unsigned int* as usual for pointer to code ?


Forget my stupid question, I didn't see it was a relative address and 
not an absolute one.


Christophe



Christophe


  int instr_is_relative_branch(unsigned int instr);
  int instr_is_relative_link_branch(unsigned int instr);
diff --git a/arch/powerpc/lib/code-patching.c 
b/arch/powerpc/lib/code-patching.c

index e0d881ab304e..850f3b8f4da5 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -195,6 +195,22 @@ int patch_branch(unsigned int *addr, unsigned 
long target, int flags)

  return patch_instruction(addr, create_branch(addr, target, flags));
  }
+int patch_branch_site(s32 *site, unsigned long target, int flags)
+{
+    unsigned int *addr;
+
+    addr = (unsigned int *)((unsigned long)site + *site);
+    return patch_instruction(addr, create_branch(addr, target, flags));
+}
+
+int patch_instruction_site(s32 *site, unsigned int instr)
+{
+    unsigned int *addr;
+
+    addr = (unsigned int *)((unsigned long)site + *site);
+    return patch_instruction(addr, instr);
+}
+
  bool is_offset_in_branch_range(long offset)
  {
  /*

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-08 Thread Michael Ellerman

"Aneesh Kumar K.V"  writes:
> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
>> Mahesh J Salgaonkar  writes:
>>> From: Mahesh Salgaonkar 
>>>
>>> Introduce recovery action for recovered memory errors (MCEs). There are
>>> soft memory errors like SLB Multihit, which can be a result of a bad
>>> hardware OR software BUG. Kernel can easily recover from these soft errors
>>> by flushing SLB contents. After the recovery kernel can still continue to
>>> function without any issue. But in some scenario's we may keep getting
>>> these soft errors until the root cause is fixed. To be able to analyze and
>>> find the root cause, best way is to gather enough data and system state at
>>> the time of MCE. Hence this patch introduces a sysctl knob where user can
>>> decide either to continue after recovery or panic the kernel to capture the
>>> dump.
>> 
>> I'm not convinced we want this.
>> 
>> As we've discovered it's often not possible to reconstruct what happened
>> based on a dump anyway.
>> 
>> The key thing you need is the content of the SLB and that's not included
>> in a dump.
>> 
>> So I think we should dump the SLB content when we get the MCE (which
>> this series does) and any other useful info, and then if we can recover
>> we should.
>
> The reasoning there is what if we got multi-hit due to some corruption 
> in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
> data structure due to wrong pointer. Now that is far fetched, but then 
> possible right?. Hence the idea that, if we don't have much insight into 
> why a slb multi-hit occur from the dmesg which include slb content, 
> slb_cache contents etc, there should be an easy way to force a dump that 
> might assist in further debug.

If you're debugging something complex that you can't determine from the
SLB dump then you should be running a debug kernel anyway. And if
anything you want to drop into xmon and sit there, preserving the most
state, rather than taking a dump.

The last SLB multi-hit I debugged was this:

  
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=db7130d63fd8


Which took quite a while to track down, including a bunch of tracing and
so on. A dump would not have helped in the slightest.

cheers

Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-08 Thread Scott Wood

On Thu, 2018-08-09 at 03:28 +, Bharat Bhushan wrote:
> > -Original Message-
> > From: Scott Wood [mailto:o...@buserror.net]
> > Sent: Wednesday, August 8, 2018 11:27 PM
> > To: Bharat Bhushan ;
> > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > ker...@vger.kernel.org
> > Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> > j...@perches.com
> > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> > 
> > On Wed, 2018-08-08 at 06:28 +, Bharat Bhushan wrote:
> > > > -Original Message-
> > > > From: Scott Wood [mailto:o...@buserror.net]
> > > > Sent: Wednesday, August 8, 2018 11:26 AM
> > > > To: Bharat Bhushan ;
> > > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > > > ker...@vger.kernel.org
> > > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > > P2020
> > > > 
> > > > On Wed, 2018-08-08 at 03:44 +, Bharat Bhushan wrote:
> > > > > > -Original Message-
> > > > > > From: Scott Wood [mailto:o...@buserror.net]
> > > > > > Sent: Wednesday, August 8, 2018 2:44 AM
> > > > > > To: Bharat Bhushan ;
> > > > > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > > > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > > > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > > > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> > > > > > linux- ker...@vger.kernel.org
> > > > > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > > > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > > > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > > > > P2020
> > > > > > 
> > > > > > On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > > > > > > MPIC on NXP (Freescale) P2020 supports following irq
> > > > > > > ranges:
> > > > > > >   > 0 - 11  (External interrupt)
> > > > > > >   > 16 - 79 (Internal interrupt)
> > > > > > >   > 176 - 183   (Messaging interrupt)
> > > > > > >   > 224 - 231   (Shared message signaled interrupt)
> > > > > > 
> > > > > > Why don't you convert to the 4-cell interrupt specifiers that
> > > > > > make dealing with these ranges less error-prone?
> > > > > 
> > > > > Ok , will do if we agree to have this series as per comment on
> > > > > other patch.
> > > > 
> > > > If you're concerned with errors, this would be a good things to do
> > > > regardless.
> > > >  Actually, it seems that p2020si-post.dtsi already uses 4-cell
> > > > interrupts.
> > > > 
> > > > What is motivating this patchset?  Is there something wrong in the
> > > > existing dts files?
> > > 
> > > There is no error in device tree. Main motivation is to improve code
> > > for following reasons:
> > >   - While code study it was found that if a reserved irq-number used
> > > then there are no check in driver. irq will be configured as correct
> > > and interrupt will never fire.
> > 
> > Again, a wrong interrupt number won't fire, whether an interrupt by that
> > number exists or not.  I wouldn't mind a sanity check in the driver if the
> > programming model made it properly discoverable, but I don't think it's
> > worth messing with device trees just for this (and even less so given that
> > there don't seem to be new chips coming out that this would be relevant
> > for).
> 
> Fair enough, we can use MPIC version to define supported interrupts ranges.
> Will that be acceptable.

It's better than device tree changes but I'm not convinced it's worthwhile
just to suppress some simulator warnings.  If the warnings really bother you,
you can use pic-no-reset in the device tree (assuming this isn't some new chip
that you want to make sure doesn't fall over when the usual mpic init happens)
and/or convince the hardware people to make the interface properly
discoverable including discontiguous regions (if there *is* some new chip I
haven't heard about).

-Scott

Re: [PATCH] lib/test_hexdump: fix failure on big endian cpu

2018-08-08 Thread Michael Ellerman

rashmica  writes:
> On 08/08/18 17:25, Michael Ellerman wrote:
>> Christophe Leroy  writes:
>>> diff --git a/lib/test_hexdump.c b/lib/test_hexdump.c
>>> index 3f415d8101f3..626f580b4ff7 100644
>>> --- a/lib/test_hexdump.c
>>> +++ b/lib/test_hexdump.c
>>> @@ -32,16 +32,33 @@ static const char * const test_data_2_le[] __initconst 
>>> = {
>>> "d14c", "9919", "b143", "0caf",
>>>  };
>>>  
>>> +static const char * const test_data_2_be[] __initconst = {
>>> +   "be32", "db7b", "0a18", "93b2",
>>> +   "70ba", "c424", "7d83", "349b",
>>> +   "a69c", "31ad", "9c0f", "ace9",
>>> +   "4cd1", "1999", "43b1", "af0c",
>>> +};
>>> +
>>>  static const char * const test_data_4_le[] __initconst = {
>>> "7bdb32be", "b293180a", "24c4ba70", "9b34837d",
>>> "ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143",
>>>  };
>>>  
>>> +static const char * const test_data_4_be[] __initconst = {
>>> +   "be32db7b", "0a1893b2", "70bac424", "7d83349b",
>>> +   "a69c31ad", "9c0face9", "4cd11999", "43b1af0c",
>>> +};
>>> +
>> Is there a reason we can't do it all at compile time?
>
> mpe I sent a patch doing that awhile ago and you obviously didn't like
> it because you never merged it :P

Sorry, I wasn't sure who should merge it, and never followed up.

cheers

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Christoph Hellwig

On Thu, Aug 09, 2018 at 08:13:32AM +1000, Benjamin Herrenschmidt wrote:
> > > - if (xen_domain())
> > > + if (xen_domain() || pseries_secure_vm())
> > >   return true;
> > 
> > I don't think it's pseries specific actually. E.g. I suspect AMD SEV
> > might benefit from the same kind of hack.
> 
> As long as they can provide the same guarantee that the DMA ops are
> completely equivalent between virtio and other PCI devices, at least on
> the same bus, ie, we don't have to go hack special DMA ops.
> 
> I think the latter is really what Christoph wants to avoid for good
> reasons.

Yes.  I also generally want to avoid too much arch specific magic.

FYI, I'm off to a week-long vacation today, don't expect quick replies.

Re: [RFC PATCH 3/3] cpuidle/powernv: Conditionally save-restore sprs using opal

2018-08-08 Thread Gautham R Shenoy

Hello Nicholas,

On Fri, Aug 03, 2018 at 12:05:47AM +1000, Nicholas Piggin wrote:
> On Thu,  2 Aug 2018 10:21:32 +0530
> Akshay Adiga  wrote:
> 
> > From: Abhishek Goel 
> > 
> > If a state has "opal-supported" compat flag in device-tree, an opal call
> > needs to be made during the entry and exit of the stop state. This patch
> > passes a hint to the power9_idle_stop and power9_offline_stop.
> > 
> > This patch moves the saving and restoring of sprs for P9 cpuidle
> > from kernel to opal. This patch still uses existing code to detect
> > first thread in core.
> > In an attempt to make the powernv idle code backward compatible,
> > and to some extent forward compatible, add support for pre-stop entry
> > and post-stop exit actions in OPAL. If a kernel knows about this
> > opal call, then just a firmware supporting newer hardware is required,
> > instead of waiting for kernel updates.
> 
> Still think we should make these do-everything calls. Including
> executing nap/stop instructions, restoring timebase, possibly even
> saving and restoring SLB (although a return code could be used to
> tell the kernel to do that maybe if performance advantage is
enough).

So, if we execute the stop instruction in opal, the wakeup from stop
still happens at the hypervisor 0x100. On wake up, we need to check
SRR1 to see if we have lost state, in which case, the stop exit also
needs to be handled inside opal. On return from this opal call, we
need to unwind the extra stack frame that would have been created when
kernel entered opal to execute the stop from which there was no
return. In the case where a lossy stop state was requested, but wakeup
happened from a lossless stop state, this adds additional overhead.

Furthermore, the measurements show that the additional time taken to
perform the restore of the resources in OPAL vs doing so in Kernel on
wakeup from stop takes additional 5-10us. For the current stop states
that lose hypervisor state, since the latency is relatively high (100s
of us), this is a relatively small penalty (~1%) .

However, in future if we do have states that lose only a part of
hypervisor state to provide a wakeup latency in the order of few tens
of microseconds the additional latency caused by OPAL call would
become noticable, no ?

> 
> I haven't had a lot of time to go through it, I'm working on moving
> ~all of idle_book3s.S to C code, I'd like to do that before this
> OPAL idle driver if possible.
> 
> A minor thing I just noticed, you don't have to allocate the opal
> spr save space in Linux, just do it all in OPAL.

The idea was to not leave any state in OPAL, as OPAL is supposed to be
state-less. However, I agree, that if OPAL is not going to interpret
the contents of the save/area, it should be harmless to move that bit
into OPAL.

That said, if we are going to add the logic of determining the first
thread in the core waking up, etc, then we have no choice but to
maintain that state in OPAL.

> 
> Thanks,
> Nick
> 
--
Thanks and Regards
gautham.

[PATCH v6 2/2] powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores

2018-08-08 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Each of the SMT4 cores forming a big-core are more or less independent
units. Thus when multiple tasks are scheduled to run on the fused
core, we get the best performance when the tasks are spread across the
pair of SMT4 cores.

This patch achieves this by setting the SMT level mask to correspond
to the smallcore sibling mask on big-core systems. This patch also
ensures that while checked for shared-caches on big-core system, we
use the smallcore_sibling_mask to compare with the l2_cache_mask.
This ensure that the CACHE level sched-domain is created, whose groups
correspond to the threads of the big-core.

With this patch, the SMT sched-domain with SMT=8,4,2 on big-core
systems are as follows:

1) ppc64_cpu --smt=8

 CPU0 attaching sched-domain(s):
  domain-0: span=0,2,4,6 level=SMT
   groups: 0:{ span=0 cap=294 }, 2:{ span=2 cap=294 },
   4:{ span=4 cap=294 }, 6:{ span=6 cap=294 }
 CPU1 attaching sched-domain(s):
  domain-0: span=1,3,5,7 level=SMT
   groups: 1:{ span=1 cap=294 }, 3:{ span=3 cap=294 },
   5:{ span=5 cap=294 }, 7:{ span=7 cap=294 }

2) ppc64_cpu --smt=4

 CPU0 attaching sched-domain(s):
  domain-0: span=0,2 level=SMT
   groups: 0:{ span=0 cap=589 }, 2:{ span=2 cap=589 }
 CPU1 attaching sched-domain(s):
  domain-0: span=1,3 level=SMT
   groups: 1:{ span=1 cap=589 }, 3:{ span=3 cap=589 }

3) ppc64_cpu --smt=2
   SMT domain ceases to exist as each domain consists of just one
   group.

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/smp.h |  6 
 arch/powerpc/kernel/smp.c  | 62 ++
 2 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 29ffaab..30798c7 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -99,6 +99,7 @@ static inline void set_hard_smp_processor_id(int cpu, int 
phys)
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, cpu_sibling_map);
+DECLARE_PER_CPU(cpumask_var_t, cpu_smallcore_sibling_map);
 DECLARE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
 DECLARE_PER_CPU(cpumask_var_t, cpu_core_map);
 
@@ -107,6 +108,11 @@ static inline struct cpumask *cpu_sibling_mask(int cpu)
return per_cpu(cpu_sibling_map, cpu);
 }
 
+static inline struct cpumask *cpu_smallcore_sibling_mask(int cpu)
+{
+   return per_cpu(cpu_smallcore_sibling_map, cpu);
+}
+
 static inline struct cpumask *cpu_core_mask(int cpu)
 {
return per_cpu(cpu_core_map, cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 4794d6b..a515780 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,10 +76,12 @@
 struct thread_info *secondary_ti;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
+DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
 
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
+EXPORT_PER_CPU_SYMBOL(cpu_smallcore_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 
@@ -689,6 +691,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
for_each_possible_cpu(cpu) {
zalloc_cpumask_var_node(&per_cpu(cpu_sibling_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
+   zalloc_cpumask_var_node(&per_cpu(cpu_smallcore_sibling_map,
+cpu),
+   GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(&per_cpu(cpu_l2_cache_map, cpu),
GFP_KERNEL, cpu_to_node(cpu));
zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
@@ -707,6 +712,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
cpumask_set_cpu(boot_cpuid, cpu_sibling_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
+   if (has_big_cores) {
+   cpumask_set_cpu(boot_cpuid,
+   cpu_smallcore_sibling_mask(boot_cpuid));
+   }
 
if (smp_ops && smp_ops->probe)
smp_ops->probe();
@@ -991,6 +1000,10 @@ static void remove_cpu_from_masks(int cpu)
set_cpus_unrelated(cpu, i, cpu_core_mask);
set_cpus_unrelated(cpu, i, cpu_l2_cache_mask);
set_cpus_unrelated(cpu, i, cpu_sibling_mask);
+   if (has_big_cores) {
+   set_cpus_unrelated(cpu, i,
+  cpu_smallcore_sibling_mask);
+   }
}
 }
 #endif
@@ -999,7 +1012,17 @@ static void add_cpu_to_masks(int cpu)
 {
int first_thread = cpu_first_thread_sibling(cpu);
int chipid = cpu_to_chip_id(cpu);
-   int i;
+
+   struct thread_groups tg;
+   int i, cpu_group_start = -1;
+
+   if

[PATCH v6 1/2] powerpc: Detect the presence of big-cores via "ibm, thread-groups"

2018-08-08 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

On IBM POWER9, the device tree exposes a property array identifed by
"ibm,thread-groups" which will indicate which groups of threads share a
particular set of resources.

As of today we only have one form of grouping identifying the group of
threads in the core that share the L1 cache, translation cache and
instruction data flow.

This patch defines the helper function to parse the contents of
"ibm,thread-groups" and a new structure to contain the parsed output.

The patch also creates the sysfs file named "small_core_siblings" that
returns the physical ids of the threads in the core that share the L1
cache, translation cache and instruction data flow.

Signed-off-by: Gautham R. Shenoy 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   8 ++
 arch/powerpc/include/asm/cputhreads.h  |  22 +++
 arch/powerpc/kernel/setup-common.c | 154 +
 arch/powerpc/kernel/sysfs.c|  35 +
 4 files changed, 219 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
b/Documentation/ABI/testing/sysfs-devices-system-cpu
index 9c5e7732..52c9b50 100644
--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -487,3 +487,11 @@ Description:   Information about CPU vulnerabilities
"Not affected"CPU is not affected by the vulnerability
"Vulnerable"  CPU is affected and no mitigation in effect
"Mitigation: $M"  CPU is affected and mitigation $M is in effect
+
+What:  /sys/devices/system/cpu/cpu[0-9]+/small_core_siblings
+Date:  06-Aug-2018
+KernelVersion: v4.19.0
+Contact:   Linux for PowerPC mailing list 
+Description:   List of Physical ids of CPUs which share the L1 cache,
+   translation cache and instruction data-flow with this CPU.
+Values:Comma separated list of decimal integers.
diff --git a/arch/powerpc/include/asm/cputhreads.h 
b/arch/powerpc/include/asm/cputhreads.h
index d71a909..33226d7 100644
--- a/arch/powerpc/include/asm/cputhreads.h
+++ b/arch/powerpc/include/asm/cputhreads.h
@@ -23,11 +23,13 @@
 extern int threads_per_core;
 extern int threads_per_subcore;
 extern int threads_shift;
+extern bool has_big_cores;
 extern cpumask_t threads_core_mask;
 #else
 #define threads_per_core   1
 #define threads_per_subcore1
 #define threads_shift  0
+#define has_big_cores  0
 #define threads_core_mask  (*get_cpu_mask(0))
 #endif
 
@@ -69,12 +71,32 @@ static inline cpumask_t cpu_online_cores_map(void)
return cpu_thread_mask_to_cores(cpu_online_mask);
 }
 
+#define MAX_THREAD_LIST_SIZE   8
+struct thread_groups {
+   unsigned int property;
+   unsigned int nr_groups;
+   unsigned int threads_per_group;
+   unsigned int thread_list[MAX_THREAD_LIST_SIZE];
+};
+
 #ifdef CONFIG_SMP
 int cpu_core_index_of_thread(int cpu);
 int cpu_first_thread_of_core(int core);
+int parse_thread_groups(struct device_node *dn, struct thread_groups *tg);
+int get_cpu_thread_group_start(int cpu, struct thread_groups *tg);
 #else
 static inline int cpu_core_index_of_thread(int cpu) { return cpu; }
 static inline int cpu_first_thread_of_core(int core) { return core; }
+static inline int parse_thread_groups(struct device_node *dn,
+ struct thread_groups *tg)
+{
+   return -ENODATA;
+}
+
+static inline int get_cpu_thread_group_start(int cpu, struct thread_groups *tg)
+{
+   return -1;
+}
 #endif
 
 static inline int cpu_thread_in_core(int cpu)
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 40b44bb..989edc1 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -402,10 +402,12 @@ void __init check_for_initrd(void)
 #ifdef CONFIG_SMP
 
 int threads_per_core, threads_per_subcore, threads_shift;
+bool has_big_cores;
 cpumask_t threads_core_mask;
 EXPORT_SYMBOL_GPL(threads_per_core);
 EXPORT_SYMBOL_GPL(threads_per_subcore);
 EXPORT_SYMBOL_GPL(threads_shift);
+EXPORT_SYMBOL_GPL(has_big_cores);
 EXPORT_SYMBOL_GPL(threads_core_mask);
 
 static void __init cpu_init_thread_core_maps(int tpc)
@@ -433,6 +435,152 @@ static void __init cpu_init_thread_core_maps(int tpc)
 
 u32 *cpu_to_phys_id = NULL;
 
+/*
+ * parse_thread_groups: Parses the "ibm,thread-groups" device tree
+ *  property for the CPU device node @dn and stores
+ *  the parsed output in the thread_groups
+ *  structure @tg.
+ *
+ * @dn: The device node of the CPU device.
+ * @tg: Pointer to a thread group structure into which the parsed
+ * output of "ibm,thread-groups" is stored.
+ *
+ * ibm,thread-groups[0..N-1] array defines which group of threads in
+ * the CPU-device node can be grouped together based on the property.
+ *
+ * ibm,thread-groups[0] tells us the property

[PATCH v6 0/2] powerpc: Detection and scheduler optimization for POWER9 bigcore

2018-08-08 Thread Gautham R. Shenoy

From: "Gautham R. Shenoy" 

Hi,

This is the fifth iteration of the patchset to add support for
big-core on POWER9. This patch also optimizes the task placement on
such big-core systems.

The previous versions can be found here:

v5: https://lkml.org/lkml/2018/8/6/587
v4: https://lkml.org/lkml/2018/7/24/79
v3: https://lkml.org/lkml/2018/7/6/255
v2: https://lkml.org/lkml/2018/7/3/401
v1: https://lkml.org/lkml/2018/5/11/245

Changes :

v5 --> v6:
   - Fixed the code to build without warnings for !CONFIG_SCHED_SMT.
   - While checking for shared caches on big-core system, use the
 smallcore_sibling_mask to compare with compare with
 l2_cache_mask, which will ensure that the CACHE level
 sched-domain is created.
   - Added benchmark results with hackbench to demonstrate the
 benefits of having the CACHE level sched-domain.

v4 --> v5:
   - Patch 2 is entirely different: Instead of using CPU_FTR_ASYM_SMT
 feature, use the small core siblings at the SMT level
 sched-domain. This was suggested by Nicholas Piggin and Michael
 Ellerman.

   - A more detailed description follows below.

v3 --> v4:
   - Build fix for powerpc-g5 : Enable CPU_FTR_ASYM_SMT only on
 CONFIG_PPC_POWERNV and CONFIG_PPC_PSERIES.
   - Fixed a minor error in the ABI description.

v2 --> v3
- Set sane values in the tg->property, tg->nr_groups inside
parse_thread_groups before returning due to an error.
- Define a helper function to determine whether a CPU device node
  is a big-core or not.
- Updated the comments around the functions to describe the
  arguments passed to them.

v1 --> v2
- Added comments explaining the "ibm,thread-groups" device tree property.
- Uses cleaner device-tree parsing functions to parse the u32 arrays.
- Adds a sysfs file listing the small-core siblings for every CPU.
- Enables the scheduler optimization by setting the CPU_FTR_ASYM_SMT bit
  in the cur_cpu_spec->cpu_features on detecting the presence
  of interleaved big-core.
- Handles the corner case where there is only a single thread-group
  or when there is a single thread in a thread-group.

Description:

A pair of IBM POWER9 SMT4 cores can be fused together to form a
big-core with 8 SMT threads. This can be discovered via the
"ibm,thread-groups" CPU property in the device tree which will
indicate which group of threads that share the L1 cache, translation
cache and instruction data flow.  If there are multiple such group of
threads, then the core is a big-core. Furthermore, on POWER9 the thread-ids of
such a big-core is obtained by interleaving the thread-ids of the
component SMT4 cores.

Eg: Threads in the pair of component SMT4 cores of an interleaved
big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.

   -
   |L1 Cache   |
   --
   |L2| | | |  |
   |  |  0  |  2  |  4  |  6   |Small Core0
   |C | | | |  |
Big|a --
Core   |c | | | |  |
   |h |  1  |  3  |  5  |  7   | Small Core1
   |e | | | |  |
   -
  | L1 Cache   |
  --

On such a big-core system, when multiple tasks are scheduled to run on
the big-core, we get the best performance when the tasks are spread
across the pair of SMT4 cores.

Eg: Suppose there 4 tasks {p1, p2, p3, p4} are run on a big core, then

An Example of Optimal Task placement:
   --
   | | | |  |
   |  0  |  2  |  4  |  6   |   Small Core0
   | (p1)| (p2)| |  |
Big Core   --
   | | | |  |
   |  1  |  3  |  5  |  7   |   Small Core1
   | | (p3)| | (p4) |
   --

An example of Suboptimal Task placement:
   --
   | | | |  |
   |  0  |  2  |  4  |  6   |   Small Core0
   | (p1)| (p2)| |  (p4)|
Big Core   --
   | | | |  |
   |  1  |  3  |  5  |  7   |   Small Core1
   | | (p3)| |  |
   --

In order to achieve optimal task placement, on big-core systems, we
define the SMT level sched-domain to consist of the threads belonging
to the small cores. The CACHE level sched domain will consist of all
the threads belonging to the big-core. With this, the Linux Kernel
load-balancer will ensure that the tasks are spread across all the
component small cores in the system, thereby yielding optimum
performance.

Furthermore, this solution works correctly across all SMT modes
(8,4,2), as the interleaved thread-ids ensures that when we go to
lower SMT modes (4,2) the threads are offlined in a descending

Re: [PATCH kernel RFC 0/3] powerpc/pseries/iommu: GPU coherent memory pass through

2018-08-08 Thread Alexey Kardashevskiy




On 25/07/2018 19:50, Alexey Kardashevskiy wrote:
> I am trying to pass through a 3D controller:
> [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2] [10de:1db1] (rev a1)
> 
> which has a quite unique feature as coherent memory directly accessible
> from a POWER9 CPU via an NVLink2 transport.
> 
> So in addition to passing a PCI device + accompanying NPU devices,
> we will also be passing the host physical address range as it is done
> on the bare metal system.
> 
> The memory on the host is presented as:
> 
> ===
> [aik@yc02goos ~]$ lsprop /proc/device-tree/memory@420
> ibm,chip-id  00fe (254)
> device_type  "memory"
> compatible   "ibm,coherent-device-memory"
> reg  0420  0020 
> linux,usable-memory
>  0420   
> phandle  0726 (1830)
> name "memory"
> ibm,associativity
>  0004 00fe 00fe 00fe 00fe
> ===
> 
> and the host does not touch it as the second 64bit value of
> "linux,usable-memory" - the size - is null. Later on the NVIDIA driver
> trains the NVLink2 and probes this memory and this is how it becomes
> onlined.
> 
> In the virtual environment I am planning on doing the same thing,
> however there is a difference in 64bit DMA handling. The powernv
> platform uses a PHB3 bypass mode and that just works but
> the pseries platform uses DDW RTAS API to achieve the same
> result and the problem with this is that we need a huge DMA
> window to start from zero (because this GPU supports less than
> 50bits for DMA address space) and cover not just present memory
> but also this new coherent memory.
> 
> 
> This is based on sha1
> d72e90f3 Linus Torvalds "Linux 4.18-rc6".
> 
> Please comment. Thanks.


Ping?


> 
> 
> 
> Alexey Kardashevskiy (3):
>   powerpc/pseries/iommu: Allow dynamic window to start from zero
>   powerpc/pseries/iommu: Force default DMA window removal
>   powerpc/pseries/iommu: Use memory@ nodes in max RAM address
> calculation
> 
>  arch/powerpc/platforms/pseries/iommu.c | 77 
> ++
>  1 file changed, 70 insertions(+), 7 deletions(-)
> 

-- 
Alexey

Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100

2018-08-08 Thread Alexey Kardashevskiy




On 08/08/2018 18:39, Alexey Kardashevskiy wrote:
> 
> 
> On 02/08/2018 02:16, Alex Williamson wrote:
>> On Wed, 1 Aug 2018 18:37:35 +1000
>> Alexey Kardashevskiy  wrote:
>>
>>> On 01/08/2018 00:29, Alex Williamson wrote:
 On Tue, 31 Jul 2018 14:03:35 +1000
 Alexey Kardashevskiy  wrote:
   
> On 31/07/2018 02:29, Alex Williamson wrote:  
>> On Mon, 30 Jul 2018 18:58:49 +1000
>> Alexey Kardashevskiy  wrote:  
>>> After some local discussions, it was pointed out that force disabling
>>> nvlinks won't bring us much as for an nvlink to work, both sides need to
>>> enable it so malicious guests cannot penetrate good ones (or a host)
>>> unless a good guest enabled the link but won't happen with a well
>>> behaving guest. And if two guests became malicious, then can still only
>>> harm each other, and so can they via other ways such network. This is
>>> different from PCIe as once PCIe link is unavoidably enabled, a well
>>> behaving device cannot firewall itself from peers as it is up to the
>>> upstream bridge(s) now to decide the routing; with nvlink2, a GPU still
>>> has means to protect itself, just like a guest can run "firewalld" for
>>> network.
>>>
>>> Although it would be a nice feature to have an extra barrier between
>>> GPUs, is inability to block the links in hypervisor still a blocker for
>>> V100 pass through?
>>
>> How is the NVLink configured by the guest, is it 'on'/'off' or are
>> specific routes configured? 
>
> The GPU-GPU links need not to be blocked and need to be enabled
> (==trained) by a driver in the guest. There are no routes between GPUs
> in NVLink fabric, these are direct links, it is just a switch on each
> side, both switches need to be on for a link to work.  

 Ok, but there is at least the possibility of multiple direct links per
 GPU, the very first diagram I find of NVlink shows 8 interconnected
 GPUs:

 https://www.nvidia.com/en-us/data-center/nvlink/  
>>>
>>> Out design is like the left part of the picture but it is just a detail.
>>
>> Unless we can specifically identify a direct link vs a mesh link, we
>> shouldn't be making assumptions about the degree of interconnect.
>>  
 So if each switch enables one direct, point to point link, how does the
 guest know which links to open for which peer device?  
>>>
>>> It uses PCI config space on GPUs to discover the topology.
>>
>> So do we need to virtualize this config space if we're going to
>> virtualize the topology?
>>
 And of course
 since we can't see the spec, a security audit is at best hearsay :-\  
>>>
>>> Yup, the exact discovery protocol is hidden.
>>
>> It could be reverse engineered...
>>
> The GPU-CPU links - the GPU bit is the same switch, the CPU NVlink state
> is controlled via the emulated PCI bridges which I pass through together
> with the GPU.  

 So there's a special emulated switch, is that how the guest knows which
 GPUs it can enable NVLinks to?  
>>>
>>> Since it only has PCI config space (there is nothing relevant in the
>>> device tree at all), I assume (double checking with the NVIDIA folks
>>> now) the guest driver enables them all, tests which pair works and
>>> disables the ones which do not. This gives a malicious guest a tiny
>>> window of opportunity to break into a good guest. Hm :-/
>>
>> Let's not minimize that window, that seems like a prime candidate for
>> an exploit.
>>
>> If the former, then isn't a non-malicious
>> guest still susceptible to a malicious guest?
>
> A non-malicious guest needs to turn its switch on for a link to a GPU
> which belongs to a malicious guest.  

 Actual security, or obfuscation, will we ever know...  
>>> If the latter, how is  
>> routing configured by the guest given that the guest view of the
>> topology doesn't match physical hardware?  Are these routes
>> deconfigured by device reset?  Are they part of the save/restore
>> state?  Thanks,

 Still curious what happens to these routes on reset.  Can a later user
 of a GPU inherit a device where the links are already enabled?  Thanks,  
>>>
>>> I am told that the GPU reset disables links. As a side effect, we get an
>>> HMI (a hardware fault which reset the host machine) when trying
>>> accessing the GPU RAM which indicates that the link is down as the
>>> memory is only accessible via the nvlink. We have special fencing code
>>> in our host firmware (skiboot) to fence this memory on PCI reset so
>>> reading from it returns zeroes instead of HMIs.
>>
>> What sort of reset is required for this?  Typically we rely on
>> secondary bus reset for GPUs, but it would be a problem if GPUs were to
>> start implementing FLR and nobody had a spec to learn that FLR maybe
>> didn't disable the link.  The better approach to me still seems to be
>> virtualizing th

RE: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-08 Thread Bharat Bhushan



> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Wednesday, August 8, 2018 11:27 PM
> To: Bharat Bhushan ;
> b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> ga...@kernel.crashing.org; mark.rutl...@arm.com;
> kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> ker...@vger.kernel.org
> Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> j...@perches.com
> Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> 
> On Wed, 2018-08-08 at 06:28 +, Bharat Bhushan wrote:
> > > -Original Message-
> > > From: Scott Wood [mailto:o...@buserror.net]
> > > Sent: Wednesday, August 8, 2018 11:26 AM
> > > To: Bharat Bhushan ;
> > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > > ker...@vger.kernel.org
> > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > P2020
> > >
> > > On Wed, 2018-08-08 at 03:44 +, Bharat Bhushan wrote:
> > > > > -Original Message-
> > > > > From: Scott Wood [mailto:o...@buserror.net]
> > > > > Sent: Wednesday, August 8, 2018 2:44 AM
> > > > > To: Bharat Bhushan ;
> > > > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> > > > > linux- ker...@vger.kernel.org
> > > > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > > > P2020
> > > > >
> > > > > On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > > > > > MPIC on NXP (Freescale) P2020 supports following irq
> > > > > > ranges:
> > > > > >   > 0 - 11  (External interrupt)
> > > > > >   > 16 - 79 (Internal interrupt)
> > > > > >   > 176 - 183   (Messaging interrupt)
> > > > > >   > 224 - 231   (Shared message signaled interrupt)
> > > > >
> > > > > Why don't you convert to the 4-cell interrupt specifiers that
> > > > > make dealing with these ranges less error-prone?
> > > >
> > > > Ok , will do if we agree to have this series as per comment on
> > > > other patch.
> > >
> > > If you're concerned with errors, this would be a good things to do
> > > regardless.
> > >  Actually, it seems that p2020si-post.dtsi already uses 4-cell interrupts.
> > >
> > > What is motivating this patchset?  Is there something wrong in the
> > > existing dts files?
> >
> > There is no error in device tree. Main motivation is to improve code
> > for following reasons:
> >   - While code study it was found that if a reserved irq-number used
> > then there are no check in driver. irq will be configured as correct
> > and interrupt will never fire.
> 
> Again, a wrong interrupt number won't fire, whether an interrupt by that
> number exists or not.  I wouldn't mind a sanity check in the driver if the
> programming model made it properly discoverable, but I don't think it's
> worth messing with device trees just for this (and even less so given that
> there don't seem to be new chips coming out that this would be relevant
> for).

Fair enough, we can use MPIC version to define supported interrupts ranges. 
Will that be acceptable.

Thanks
-Bharat

> 
> > > > One other confusing observation I have is that "irq_count" from
> > > > platform code is given precedence over "last-interrupt-source" in
> > > > device-
> > >
> > > tree.
> > > > Should not device-tree should have precedence otherwise there is
> > > > no point using " last-interrupt-source" if platform code passes
> > > > "irq_count" in mpic_alloc().
> > >
> > > Maybe, though I don't think it matters much given that
> > > last-interrupt- source was only added to avoid having to pass
> > > irq_count in platform code.
> >
> > Thanks for clarifying;
> >
> > My understanding was that "last-interrupt-source" added to ensure that
> > we can over-ride value passed from platform code. In that case we do
> > not need to change code and can control from device tree.
> 
> The changelog says, "To avoid needing to write custom board-specific code
> to detect that scenario, allow it to be easily overridden in the device-tree,"
> where "it" means the value provided by hardware.  The goal was to pass in
> 256 without board code in the kernel, not to override the 256.
> 
> -Scott

Re: [PATCH] lib/test_hexdump: fix failure on big endian cpu

2018-08-08 Thread rashmica




On 08/08/18 17:25, Michael Ellerman wrote:
> Christophe Leroy  writes:
>> diff --git a/lib/test_hexdump.c b/lib/test_hexdump.c
>> index 3f415d8101f3..626f580b4ff7 100644
>> --- a/lib/test_hexdump.c
>> +++ b/lib/test_hexdump.c
>> @@ -32,16 +32,33 @@ static const char * const test_data_2_le[] __initconst = 
>> {
>>  "d14c", "9919", "b143", "0caf",
>>  };
>>  
>> +static const char * const test_data_2_be[] __initconst = {
>> +"be32", "db7b", "0a18", "93b2",
>> +"70ba", "c424", "7d83", "349b",
>> +"a69c", "31ad", "9c0f", "ace9",
>> +"4cd1", "1999", "43b1", "af0c",
>> +};
>> +
>>  static const char * const test_data_4_le[] __initconst = {
>>  "7bdb32be", "b293180a", "24c4ba70", "9b34837d",
>>  "ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143",
>>  };
>>  
>> +static const char * const test_data_4_be[] __initconst = {
>> +"be32db7b", "0a1893b2", "70bac424", "7d83349b",
>> +"a69c31ad", "9c0face9", "4cd11999", "43b1af0c",
>> +};
>> +
> Is there a reason we can't do it all at compile time?

mpe I sent a patch doing that awhile ago and you obviously didn't like
it because you never merged it :P
http://patchwork.ozlabs.org/patch/620405/ I prefer this version because
of the IS_ENABLED



> eg:
>
> static const char * const test_data_4[] __initconst = {
> #ifdef CONFIG_CPU_LITTLE_ENDIAN
>   "7bdb32be", "b293180a", "24c4ba70", "9b34837d",
>   "ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143",
> #else
>   "be32db7b", "0a1893b2", "70bac424", "7d83349b",
>   "a69c31ad", "9c0face9", "4cd11999", "43b1af0c",
> #endif
> };
>
>
> cheers

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt

On Thu, 2018-08-09 at 08:13 +1000, Benjamin Herrenschmidt wrote:
> > For completeness, virtio could also have its own bounce buffer
> > outside of DMA API one. I don't see lots of benefits to this
> > though.
> 
> Not fan of that either...

To elaborate a bit ...

For our secure VMs, we will need bounce buffering for everything
anyway. virtio, emulated PCI, or vfio.

By ensuring that we create an identity mapping in the IOMMU for
the bounce buffering pool, we enable virtio "legacy/direct" to
use the same mapping ops as things using the iommu.

That said, we still need somewhere in arch/powerpc a set of dma
ops which we'll attach to all PCI devices of a secure VM to force
bouncing always, rather than just based on address (which is what
the standard swiotlb ones do)... Unless we can tweak the swiotlb
"threshold" for example by using an empty mask.

We'll need the same set of DMA ops for VIO devices too, not just PCI.

Cheers,
Ben.

Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-08 Thread Benjamin Herrenschmidt

On Thu, 2018-08-09 at 10:54 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > These are identical to the arch specific ones, so remove them.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Acked-by: Benjamin Herrenschmidt 

Note: We will still need to implement some custom variant of this
for our secure VMs ... 

Basically we'll need to use the existing bounce bufferring as-is but
the condition will be different, it won't be whether the address is
below a certain limit, it will be *always*.

Cheers,
Ben.

> > ---
> >  arch/powerpc/include/asm/dma-direct.h |  4 
> >  arch/powerpc/include/asm/swiotlb.h|  2 --
> >  arch/powerpc/kernel/dma-swiotlb.c | 28 ++-
> >  arch/powerpc/sysdev/fsl_pci.c |  2 +-
> >  4 files changed, 7 insertions(+), 29 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/dma-direct.h 
> > b/arch/powerpc/include/asm/dma-direct.h
> > index 0fba19445ae8..657f84ddb20d 100644
> > --- a/arch/powerpc/include/asm/dma-direct.h
> > +++ b/arch/powerpc/include/asm/dma-direct.h
> > @@ -30,4 +30,8 @@ static inline phys_addr_t __dma_to_phys(struct device 
> > *dev, dma_addr_t daddr)
> > return daddr - PCI_DRAM_OFFSET;
> > return daddr - dev->archdata.dma_offset;
> >  }
> > +
> > +u64 swiotlb_powerpc_get_required(struct device *dev);
> > +#define swiotlb_get_required_mask swiotlb_powerpc_get_required
> > +
> >  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> > diff --git a/arch/powerpc/include/asm/swiotlb.h 
> > b/arch/powerpc/include/asm/swiotlb.h
> > index f65ecf57b66c..1d8c1da26ab3 100644
> > --- a/arch/powerpc/include/asm/swiotlb.h
> > +++ b/arch/powerpc/include/asm/swiotlb.h
> > @@ -13,8 +13,6 @@
> >  
> >  #include 
> >  
> > -extern const struct dma_map_ops powerpc_swiotlb_dma_ops;
> > -
> >  extern unsigned int ppc_swiotlb_enable;
> >  int __init swiotlb_setup_bus_notifier(void);
> >  
> > diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> > b/arch/powerpc/kernel/dma-swiotlb.c
> > index 25986fcd1e5e..0c269de61f39 100644
> > --- a/arch/powerpc/kernel/dma-swiotlb.c
> > +++ b/arch/powerpc/kernel/dma-swiotlb.c
> > @@ -24,7 +24,7 @@
> >  
> >  unsigned int ppc_swiotlb_enable;
> >  
> > -static u64 swiotlb_powerpc_get_required(struct device *dev)
> > +u64 swiotlb_powerpc_get_required(struct device *dev)
> >  {
> > u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
> >  
> > @@ -38,30 +38,6 @@ static u64 swiotlb_powerpc_get_required(struct device 
> > *dev)
> > return mask;
> >  }
> >  
> > -/*
> > - * At the moment, all platforms that use this code only require
> > - * swiotlb to be used if we're operating on HIGHMEM.  Since
> > - * we don't ever call anything other than map_sg, unmap_sg,
> > - * map_page, and unmap_page on highmem, use normal dma_ops
> > - * for everything else.
> > - */
> > -const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> > -   .alloc = dma_direct_alloc,
> > -   .free = dma_direct_free,
> > -   .mmap = dma_nommu_mmap_coherent,
> > -   .map_sg = swiotlb_map_sg_attrs,
> > -   .unmap_sg = swiotlb_unmap_sg_attrs,
> > -   .dma_supported = swiotlb_dma_supported,
> > -   .map_page = swiotlb_map_page,
> > -   .unmap_page = swiotlb_unmap_page,
> > -   .sync_single_for_cpu = swiotlb_sync_single_for_cpu,
> > -   .sync_single_for_device = swiotlb_sync_single_for_device,
> > -   .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
> > -   .sync_sg_for_device = swiotlb_sync_sg_for_device,
> > -   .mapping_error = swiotlb_dma_mapping_error,
> > -   .get_required_mask = swiotlb_powerpc_get_required,
> > -};
> > -
> >  void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
> >  {
> > struct pci_controller *hose;
> > @@ -88,7 +64,7 @@ static int ppc_swiotlb_bus_notify(struct notifier_block 
> > *nb,
> >  
> > /* May need to bounce if the device can't address all of DRAM */
> > if ((dma_get_mask(dev) + 1) < memblock_end_of_DRAM())
> > -   set_dma_ops(dev, &powerpc_swiotlb_dma_ops);
> > +   set_dma_ops(dev, &swiotlb_dma_ops);
> >  
> > return NOTIFY_DONE;
> >  }
> > diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> > index 918be816b097..daf44bc0108d 100644
> > --- a/arch/powerpc/sysdev/fsl_pci.c
> > +++ b/arch/powerpc/sysdev/fsl_pci.c
> > @@ -118,7 +118,7 @@ static void setup_swiotlb_ops(struct pci_controller 
> > *hose)
> >  {
> > if (ppc_swiotlb_enable) {
> > hose->controller_ops.dma_dev_setup = pci_dma_dev_setup_swiotlb;
> > -   set_pci_dma_ops(&powerpc_swiotlb_dma_ops);
> > +   set_pci_dma_ops(&swiotlb_dma_ops);
> > }
> >  }
> >  #else

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-08 Thread Nicholas Piggin

On Thu, 09 Aug 2018 00:56:00 +1000
Michael Ellerman  wrote:

> Mahesh J Salgaonkar  writes:
> > From: Mahesh Salgaonkar 
> >
> > Introduce recovery action for recovered memory errors (MCEs). There are
> > soft memory errors like SLB Multihit, which can be a result of a bad
> > hardware OR software BUG. Kernel can easily recover from these soft errors
> > by flushing SLB contents. After the recovery kernel can still continue to
> > function without any issue. But in some scenario's we may keep getting
> > these soft errors until the root cause is fixed. To be able to analyze and
> > find the root cause, best way is to gather enough data and system state at
> > the time of MCE. Hence this patch introduces a sysctl knob where user can
> > decide either to continue after recovery or panic the kernel to capture the
> > dump.  
> 
> I'm not convinced we want this.
> 
> As we've discovered it's often not possible to reconstruct what happened
> based on a dump anyway.
> 
> The key thing you need is the content of the SLB and that's not included
> in a dump.
> 
> So I think we should dump the SLB content when we get the MCE (which
> this series does) and any other useful info, and then if we can recover
> we should.

Yeah it's a lot of knobs that administrators can hardly be expected to
tune. Hypervisor or firmware should really eventually make the MCE
unrecoverable if we aren't making progress.

That said, x86 has a bunch of options, and for debugging a rare crash
or specialised installations it might be useful. But we should follow
the normal format, /proc/sys/kernel/panic_on_mce.

Thanks,
Nick

Re: [PATCH v7 7/9] powerpc/pseries: Dump the SLB contents on SLB MCE errors.

2018-08-08 Thread Michael Ellerman

Mahesh J Salgaonkar  writes:

> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 7f22929ce915..233d25ff6f64 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -254,6 +254,10 @@ struct paca_struct {
>  #endif
>  #ifdef CONFIG_PPC_PSERIES
>   u8 *mce_data_buf;   /* buffer to hold per cpu rtas errlog */
> +
> + /* Capture SLB related old contents in MCE handler. */
> + struct slb_entry *mce_faulty_slbs;
> + u16 slb_save_cache_ptr;
>  #endif /* CONFIG_PPC_PSERIES */

 ^

> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index e89f675f1b5e..16a53689ffd4 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -151,6 +151,79 @@ void slb_flush_and_rebolt_realmode(void)
>   get_paca()->slb_cache_ptr = 0;
>  }
>  
> +void slb_save_contents(struct slb_entry *slb_ptr)
> +{
> + int i;
> + unsigned long e, v;
> +
> + /* Save slb_cache_ptr value. */
> + get_paca()->slb_save_cache_ptr = get_paca()->slb_cache_ptr;

This isn't inside CONFIG_PPC_PSERIES which breaks lots of configs, eg
powernv.

  arch/powerpc/mm/slb.c:160:12: error: 'struct paca_struct' has no member named 
'slb_save_cache_ptr'
  arch/powerpc/mm/slb.c:218:27: error: 'struct paca_struct' has no member named 
'slb_save_cache_ptr'
  arch/powerpc/mm/slb.c:216:49: error: 'struct paca_struct' has no member named 
'slb_save_cache_ptr'

http://kisskb.ozlabs.ibm.com/kisskb/head/219f20e490add009194d94fdeb480da2e385f1c6/

cheers

Re: [PATCH 20/20] powerpc/dma: remove dma_nommu_mmap_coherent

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The remaining implementation for coherent caches is functionally
> identical to the default provided in common code.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h |  7 ---
>  arch/powerpc/kernel/dma-iommu.c|  1 -
>  arch/powerpc/kernel/dma.c  | 13 -
>  arch/powerpc/platforms/pseries/vio.c   |  1 -
>  4 files changed, 22 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 879c4efba785..e62e23aa3714 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -18,13 +18,6 @@
>  #include 
>  #include 
>  
> -/* Some dma direct funcs must be visible for use in other dma_ops */
> -extern int dma_nommu_mmap_coherent(struct device *dev,
> - struct vm_area_struct *vma,
> - void *cpu_addr, dma_addr_t handle,
> - size_t size, unsigned long attrs);
> -
> -
>  static inline unsigned long device_to_mask(struct device *dev)
>  {
>   if (dev->dma_mask && *dev->dma_mask)
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index f9fe2080ceb9..bf5234e1f71b 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -114,7 +114,6 @@ int dma_iommu_mapping_error(struct device *dev, 
> dma_addr_t dma_addr)
>  struct dma_map_ops dma_iommu_ops = {
>   .alloc  = dma_iommu_alloc_coherent,
>   .free   = dma_iommu_free_coherent,
> - .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_iommu_map_sg,
>   .unmap_sg   = dma_iommu_unmap_sg,
>   .dma_supported  = dma_iommu_dma_supported,
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 08b12cbd7abf..5b71c9d1b8cc 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -70,18 +70,6 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>   iommu_free_coherent(iommu, size, vaddr, dma_handle);
>  }
>  
> -int dma_nommu_mmap_coherent(struct device *dev, struct vm_area_struct *vma,
> -  void *cpu_addr, dma_addr_t handle, size_t size,
> -  unsigned long attrs)
> -{
> - unsigned long pfn = page_to_pfn(virt_to_page(cpu_addr));
> -
> - return remap_pfn_range(vma, vma->vm_start,
> -pfn + vma->vm_pgoff,
> -vma->vm_end - vma->vm_start,
> -vma->vm_page_prot);
> -}
> -
>  /* note: needs to be called arch_get_required_mask for dma-noncoherent.c */
>  u64 arch_get_required_mask(struct device *dev)
>  {
> @@ -98,7 +86,6 @@ u64 arch_get_required_mask(struct device *dev)
>  const struct dma_map_ops dma_nommu_ops = {
>   .alloc  = dma_nommu_alloc_coherent,
>   .free   = dma_nommu_free_coherent,
> - .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_direct_map_sg,
>   .map_page   = dma_direct_map_page,
>   .get_required_mask  = arch_get_required_mask,
> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 49e04ec19238..51d564313bd0 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -618,7 +618,6 @@ static u64 vio_dma_get_required_mask(struct device *dev)
>  static const struct dma_map_ops vio_dma_mapping_ops = {
>   .alloc = vio_dma_iommu_alloc_coherent,
>   .free  = vio_dma_iommu_free_coherent,
> - .mmap  = dma_nommu_mmap_coherent,
>   .map_sg= vio_dma_iommu_map_sg,
>   .unmap_sg  = vio_dma_iommu_unmap_sg,
>   .map_page  = vio_dma_iommu_map_page,

Re: [PATCH 18/20] powerpc/dma-noncoherent: use generic dma_noncoherent_ops

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The generic dma-noncoherent code provides all that is needed by powerpc.
> 
> Note that the cache maintainance in the existing code is a bit odd
> as it implements both the sync_to_device and sync_to_cpu callouts,
> but never flushes caches when unmapping.  This patch keeps both
> directions arounds, which will lead to more flushing than the previous
> implementation.  Someone more familar with the required CPUs should
> eventually take a look and optimize the cache flush handling if needed.

The original code looks bogus indeed.

I think we got away with it because those older CPUs wouldn't speculate
or prefetch aggressively enough (or at all) so the flush on map was
sufficient, the stuff wouldn't come back into the cache.

But safe is better than sorry, so ... tentative Ack, I do need to try
to dig one of these things to test, which might take a while.

Acked-by: Benjamin Herrenschmidt 


> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/Kconfig   |  2 +-
>  arch/powerpc/include/asm/dma-mapping.h | 29 -
>  arch/powerpc/kernel/dma.c  | 59 +++---
>  arch/powerpc/kernel/pci-common.c   |  5 ++-
>  arch/powerpc/kernel/setup-common.c |  4 ++
>  arch/powerpc/mm/dma-noncoherent.c  | 52 +--
>  arch/powerpc/platforms/44x/warp.c  |  2 +-
>  arch/powerpc/platforms/Kconfig.cputype |  6 ++-
>  8 files changed, 60 insertions(+), 99 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index bbfa6a8df4da..33c6017ffce6 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -129,7 +129,7 @@ config PPC
>   # Please keep this list sorted alphabetically.
>   #
>   select ARCH_HAS_DEVMEM_IS_ALLOWED
> - select ARCH_HAS_DMA_SET_COHERENT_MASK
> + select ARCH_HAS_DMA_SET_COHERENT_MASK if !NOT_COHERENT_CACHE
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_FORTIFY_SOURCE
>   select ARCH_HAS_GCOV_PROFILE_ALL
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index f0bf7ac2686c..879c4efba785 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -19,40 +19,11 @@
>  #include 
>  
>  /* Some dma direct funcs must be visible for use in other dma_ops */
> -extern void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> -  dma_addr_t *dma_handle, gfp_t flag,
> -  unsigned long attrs);
> -extern void __dma_nommu_free_coherent(struct device *dev, size_t size,
> -void *vaddr, dma_addr_t dma_handle,
> -unsigned long attrs);
>  extern int dma_nommu_mmap_coherent(struct device *dev,
>   struct vm_area_struct *vma,
>   void *cpu_addr, dma_addr_t handle,
>   size_t size, unsigned long attrs);
>  
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> -/*
> - * DMA-consistent mapping functions for PowerPCs that don't support
> - * cache snooping.  These allocate/free a region of uncached mapped
> - * memory space for use with DMA devices.  Alternatively, you could
> - * allocate the space "normally" and use the cache management functions
> - * to ensure it is consistent.
> - */
> -struct device;
> -extern void __dma_sync(void *vaddr, size_t size, int direction);
> -extern void __dma_sync_page(struct page *page, unsigned long offset,
> -  size_t size, int direction);
> -extern unsigned long __dma_get_coherent_pfn(unsigned long cpu_addr);
> -
> -#else /* ! CONFIG_NOT_COHERENT_CACHE */
> -/*
> - * Cache coherent cores.
> - */
> -
> -#define __dma_sync(addr, size, rw)   ((void)0)
> -#define __dma_sync_page(pg, off, sz, rw) ((void)0)
> -
> -#endif /* ! CONFIG_NOT_COHERENT_CACHE */
>  
>  static inline unsigned long device_to_mask(struct device *dev)
>  {
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 2b90a403cdac..b2e88075b2ea 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -36,12 +36,7 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>* we can really use the direct ops
>*/
>   if (dma_direct_supported(dev, dev->coherent_dma_mask))
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - return __dma_nommu_alloc_coherent(dev, size, dma_handle,
> -flag, attrs);
> -#else
>   return dma_direct_alloc(dev, size, dma_handle, flag, attrs);
> -#endif
>  
>   /* Ok we can't ... do we have an iommu ? If not, fail */
>   iommu = get_iommu_table_base(dev);
> @@ -62,12 +57,7 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>  
>   /* See comments in dma_nommu_al

Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These are identical to the arch specific ones, so remove them.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-direct.h |  4 
>  arch/powerpc/include/asm/swiotlb.h|  2 --
>  arch/powerpc/kernel/dma-swiotlb.c | 28 ++-
>  arch/powerpc/sysdev/fsl_pci.c |  2 +-
>  4 files changed, 7 insertions(+), 29 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-direct.h 
> b/arch/powerpc/include/asm/dma-direct.h
> index 0fba19445ae8..657f84ddb20d 100644
> --- a/arch/powerpc/include/asm/dma-direct.h
> +++ b/arch/powerpc/include/asm/dma-direct.h
> @@ -30,4 +30,8 @@ static inline phys_addr_t __dma_to_phys(struct device *dev, 
> dma_addr_t daddr)
>   return daddr - PCI_DRAM_OFFSET;
>   return daddr - dev->archdata.dma_offset;
>  }
> +
> +u64 swiotlb_powerpc_get_required(struct device *dev);
> +#define swiotlb_get_required_mask swiotlb_powerpc_get_required
> +
>  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> diff --git a/arch/powerpc/include/asm/swiotlb.h 
> b/arch/powerpc/include/asm/swiotlb.h
> index f65ecf57b66c..1d8c1da26ab3 100644
> --- a/arch/powerpc/include/asm/swiotlb.h
> +++ b/arch/powerpc/include/asm/swiotlb.h
> @@ -13,8 +13,6 @@
>  
>  #include 
>  
> -extern const struct dma_map_ops powerpc_swiotlb_dma_ops;
> -
>  extern unsigned int ppc_swiotlb_enable;
>  int __init swiotlb_setup_bus_notifier(void);
>  
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index 25986fcd1e5e..0c269de61f39 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -24,7 +24,7 @@
>  
>  unsigned int ppc_swiotlb_enable;
>  
> -static u64 swiotlb_powerpc_get_required(struct device *dev)
> +u64 swiotlb_powerpc_get_required(struct device *dev)
>  {
>   u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
>  
> @@ -38,30 +38,6 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
>   return mask;
>  }
>  
> -/*
> - * At the moment, all platforms that use this code only require
> - * swiotlb to be used if we're operating on HIGHMEM.  Since
> - * we don't ever call anything other than map_sg, unmap_sg,
> - * map_page, and unmap_page on highmem, use normal dma_ops
> - * for everything else.
> - */
> -const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> - .alloc = dma_direct_alloc,
> - .free = dma_direct_free,
> - .mmap = dma_nommu_mmap_coherent,
> - .map_sg = swiotlb_map_sg_attrs,
> - .unmap_sg = swiotlb_unmap_sg_attrs,
> - .dma_supported = swiotlb_dma_supported,
> - .map_page = swiotlb_map_page,
> - .unmap_page = swiotlb_unmap_page,
> - .sync_single_for_cpu = swiotlb_sync_single_for_cpu,
> - .sync_single_for_device = swiotlb_sync_single_for_device,
> - .sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
> - .sync_sg_for_device = swiotlb_sync_sg_for_device,
> - .mapping_error = swiotlb_dma_mapping_error,
> - .get_required_mask = swiotlb_powerpc_get_required,
> -};
> -
>  void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
>  {
>   struct pci_controller *hose;
> @@ -88,7 +64,7 @@ static int ppc_swiotlb_bus_notify(struct notifier_block *nb,
>  
>   /* May need to bounce if the device can't address all of DRAM */
>   if ((dma_get_mask(dev) + 1) < memblock_end_of_DRAM())
> - set_dma_ops(dev, &powerpc_swiotlb_dma_ops);
> + set_dma_ops(dev, &swiotlb_dma_ops);
>  
>   return NOTIFY_DONE;
>  }
> diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
> index 918be816b097..daf44bc0108d 100644
> --- a/arch/powerpc/sysdev/fsl_pci.c
> +++ b/arch/powerpc/sysdev/fsl_pci.c
> @@ -118,7 +118,7 @@ static void setup_swiotlb_ops(struct pci_controller *hose)
>  {
>   if (ppc_swiotlb_enable) {
>   hose->controller_ops.dma_dev_setup = pci_dma_dev_setup_swiotlb;
> - set_pci_dma_ops(&powerpc_swiotlb_dma_ops);
> + set_pci_dma_ops(&swiotlb_dma_ops);
>   }
>  }
>  #else

Re: [PATCH 16/20] powerpc/dma: use dma_direct_{alloc,free}

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These do the same functionality as the existing helpers, but do it
> simpler, and also allow the (optional) use of CMA.
> 
> Note that the swiotlb code now calls into the dma_direct code directly,
> given that it doesn't work with noncoherent caches at all, and isn't called
> when we have an iommu either, so the iommu special case in
> dma_nommu_alloc_coherent isn't required for swiotlb.

I am not convinced that this will produce the same results due to
the way the zone picking works.

As for the interaction with swiotlb, we'll need the FSL guys to have
a look. Scott, do you remember what this is about ?

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/include/asm/pgtable.h |  1 -
>  arch/powerpc/kernel/dma-swiotlb.c  |  4 +-
>  arch/powerpc/kernel/dma.c  | 78 --
>  arch/powerpc/mm/mem.c  | 19 
>  4 files changed, 11 insertions(+), 91 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 14c79a7dc855..123de4958d2e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -38,7 +38,6 @@ extern unsigned long empty_zero_page[];
>  extern pgd_t swapper_pg_dir[];
>  
>  void limit_zone_pfn(enum zone_type zone, unsigned long max_pfn);
> -int dma_pfn_limit_to_zone(u64 pfn_limit);
>  extern void paging_init(void);
>  
>  /*
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index f6e0701c5303..25986fcd1e5e 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -46,8 +46,8 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
>   * for everything else.
>   */
>  const struct dma_map_ops powerpc_swiotlb_dma_ops = {
> - .alloc = __dma_nommu_alloc_coherent,
> - .free = __dma_nommu_free_coherent,
> + .alloc = dma_direct_alloc,
> + .free = dma_direct_free,
>   .mmap = dma_nommu_mmap_coherent,
>   .map_sg = swiotlb_map_sg_attrs,
>   .unmap_sg = swiotlb_unmap_sg_attrs,
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 2cfc45acbb52..2b90a403cdac 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -26,75 +26,6 @@
>   * can set archdata.dma_data to an unsigned long holding the offset. By
>   * default the offset is PCI_DRAM_OFFSET.
>   */
> -
> -static u64 __maybe_unused get_pfn_limit(struct device *dev)
> -{
> - u64 pfn = (dev->coherent_dma_mask >> PAGE_SHIFT) + 1;
> - struct dev_archdata __maybe_unused *sd = &dev->archdata;
> -
> -#ifdef CONFIG_SWIOTLB
> - if (sd->max_direct_dma_addr && dev->dma_ops == &powerpc_swiotlb_dma_ops)
> - pfn = min_t(u64, pfn, sd->max_direct_dma_addr >> PAGE_SHIFT);
> -#endif
> -
> - return pfn;
> -}
> -
> -#ifndef CONFIG_NOT_COHERENT_CACHE
> -void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> -   dma_addr_t *dma_handle, gfp_t flag,
> -   unsigned long attrs)
> -{
> - void *ret;
> - struct page *page;
> - int node = dev_to_node(dev);
> -#ifdef CONFIG_FSL_SOC
> - u64 pfn = get_pfn_limit(dev);
> - int zone;
> -
> - /*
> -  * This code should be OK on other platforms, but we have drivers that
> -  * don't set coherent_dma_mask. As a workaround we just ifdef it. This
> -  * whole routine needs some serious cleanup.
> -  */
> -
> - zone = dma_pfn_limit_to_zone(pfn);
> - if (zone < 0) {
> - dev_err(dev, "%s: No suitable zone for pfn %#llx\n",
> - __func__, pfn);
> - return NULL;
> - }
> -
> - switch (zone) {
> - case ZONE_DMA:
> - flag |= GFP_DMA;
> - break;
> -#ifdef CONFIG_ZONE_DMA32
> - case ZONE_DMA32:
> - flag |= GFP_DMA32;
> - break;
> -#endif
> - };
> -#endif /* CONFIG_FSL_SOC */
> -
> - page = alloc_pages_node(node, flag, get_order(size));
> - if (page == NULL)
> - return NULL;
> - ret = page_address(page);
> - memset(ret, 0, size);
> - *dma_handle = phys_to_dma(dev,__pa(ret));
> -
> - return ret;
> -}
> -
> -void __dma_nommu_free_coherent(struct device *dev, size_t size,
> - void *vaddr, dma_addr_t dma_handle,
> - unsigned long attrs)
> -{
> - free_pages((unsigned long)vaddr, get_order(size));
> -}
> -#endif /* !CONFIG_NOT_COHERENT_CACHE */
> -
>  static void *dma_nommu_alloc_coherent(struct device *dev, size_t size,
>  dma_addr_t *dma_handle, gfp_t flag,
>  unsigned long attrs)
> @@ -105,8 +36,12 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>* we can really use the direct ops
>*/
>   if (dma_direct_supported(dev, dev->coherent_d

Re: [PATCH 15/20] powerpc/dma: remove the unused unmap_page and unmap_sg methods

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> These methods are optional to start with, no need to implement no-op
> versions.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/kernel/dma.c | 16 
>  1 file changed, 16 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 511a4972560d..2cfc45acbb52 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -178,12 +178,6 @@ static int dma_nommu_map_sg(struct device *dev, struct 
> scatterlist *sgl,
>   return nents;
>  }
>  
> -static void dma_nommu_unmap_sg(struct device *dev, struct scatterlist *sg,
> - int nents, enum dma_data_direction direction,
> - unsigned long attrs)
> -{
> -}
> -
>  static u64 dma_nommu_get_required_mask(struct device *dev)
>  {
>   u64 end, mask;
> @@ -209,14 +203,6 @@ static inline dma_addr_t dma_nommu_map_page(struct 
> device *dev,
>   return phys_to_dma(dev, page_to_phys(page)) + offset;
>  }
>  
> -static inline void dma_nommu_unmap_page(struct device *dev,
> -  dma_addr_t dma_address,
> -  size_t size,
> -  enum dma_data_direction direction,
> -  unsigned long attrs)
> -{
> -}
> -
>  #ifdef CONFIG_NOT_COHERENT_CACHE
>  static inline void dma_nommu_sync_sg(struct device *dev,
>   struct scatterlist *sgl, int nents,
> @@ -242,10 +228,8 @@ const struct dma_map_ops dma_nommu_ops = {
>   .free   = dma_nommu_free_coherent,
>   .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_nommu_map_sg,
> - .unmap_sg   = dma_nommu_unmap_sg,
>   .dma_supported  = dma_direct_supported,
>   .map_page   = dma_nommu_map_page,
> - .unmap_page = dma_nommu_unmap_page,
>   .get_required_mask  = dma_nommu_get_required_mask,
>  #ifdef CONFIG_NOT_COHERENT_CACHE
>   .sync_single_for_cpu= dma_nommu_sync_single,

Re: [PATCH 14/20] powerpc/dma: replace dma_nommu_dma_supported with dma_direct_supported

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The ppc32 case of dma_nommu_dma_supported already was a no-op, and the
> 64-bit case came to the same conclusion as dma_direct_supported, so
> replace it with the generic version.

It's not at all equivalent (see my review on your earlier patch) or
am I missing something ?

 - ppc32 always return 1, but dma_direct_supported() will not for
devices with a <32-bit mask (and yes ppc32 isn't quite right to do
so, it should check against memory size, but in practice it worked
as the only limited devices we deal with on systems we still support
have a 31-bit limitation)

 - ppc64 needs to check against the end of DRAM as some devices will
fail the check, dma_direct_supported() doesn't seem to be doing that.

Also as I mentioned, I'm not sure about the business with ZONE_DMA,
and that arbitrary 24-bit limit since our entire memory is in ZONE_DMA
but that's a different can of worms I suppose.

> Signed-off-by: Christoph Hellwig 


> ---
>  arch/powerpc/Kconfig  |  1 +
>  arch/powerpc/kernel/dma.c | 28 +++-
>  2 files changed, 4 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index f9cae7edd735..bbfa6a8df4da 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -158,6 +158,7 @@ config PPC
>   select CLONE_BACKWARDS
>   select DCACHE_WORD_ACCESS   if PPC64 && CPU_LITTLE_ENDIAN
>   select DYNAMIC_FTRACE   if FUNCTION_TRACER
> + select DMA_DIRECT_OPS
>   select EDAC_ATOMIC_SCRUB
>   select EDAC_SUPPORT
>   select GENERIC_ATOMIC64 if PPC32
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 3487de83bb37..511a4972560d 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -40,28 +40,6 @@ static u64 __maybe_unused get_pfn_limit(struct device *dev)
>   return pfn;
>  }
>  
> -static int dma_nommu_dma_supported(struct device *dev, u64 mask)
> -{
> -#ifdef CONFIG_PPC64
> - u64 limit = phys_to_dma(dev, (memblock_end_of_DRAM() - 1));
> -
> - /* Limit fits in the mask, we are good */
> - if (mask >= limit)
> - return 1;
> -
> -#ifdef CONFIG_FSL_SOC
> - /* Freescale gets another chance via ZONE_DMA/ZONE_DMA32, however
> -  * that will have to be refined if/when they support iommus
> -  */
> - return 1;
> -#endif
> - /* Sorry ... */
> - return 0;
> -#else
> - return 1;
> -#endif
> -}
> -
>  #ifndef CONFIG_NOT_COHERENT_CACHE
>  void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t flag,
> @@ -126,7 +104,7 @@ static void *dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>   /* The coherent mask may be smaller than the real mask, check if
>* we can really use the direct ops
>*/
> - if (dma_nommu_dma_supported(dev, dev->coherent_dma_mask))
> + if (dma_direct_supported(dev, dev->coherent_dma_mask))
>   return __dma_nommu_alloc_coherent(dev, size, dma_handle,
>  flag, attrs);
>  
> @@ -148,7 +126,7 @@ static void dma_nommu_free_coherent(struct device *dev, 
> size_t size,
>   struct iommu_table *iommu;
>  
>   /* See comments in dma_nommu_alloc_coherent() */
> - if (dma_nommu_dma_supported(dev, dev->coherent_dma_mask))
> + if (dma_direct_supported(dev, dev->coherent_dma_mask))
>   return __dma_nommu_free_coherent(dev, size, vaddr, dma_handle,
> attrs);
>   /* Maybe we used an iommu ... */
> @@ -265,7 +243,7 @@ const struct dma_map_ops dma_nommu_ops = {
>   .mmap   = dma_nommu_mmap_coherent,
>   .map_sg = dma_nommu_map_sg,
>   .unmap_sg   = dma_nommu_unmap_sg,
> - .dma_supported  = dma_nommu_dma_supported,
> + .dma_supported  = dma_direct_supported,
>   .map_page   = dma_nommu_map_page,
>   .unmap_page = dma_nommu_unmap_page,
>   .get_required_mask  = dma_nommu_get_required_mask,

Re: [PATCH 13/20] powerpc/dma: remove get_dma_offset

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Just fold the calculation into __phys_to_dma/__dma_to_phys as those are
> the only places that should know about it.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-direct.h  |  8 ++--
>  arch/powerpc/include/asm/dma-mapping.h | 16 
>  2 files changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-direct.h 
> b/arch/powerpc/include/asm/dma-direct.h
> index 7702875aabb7..0fba19445ae8 100644
> --- a/arch/powerpc/include/asm/dma-direct.h
> +++ b/arch/powerpc/include/asm/dma-direct.h
> @@ -19,11 +19,15 @@ static inline bool dma_capable(struct device *dev, 
> dma_addr_t addr, size_t size)
>  
>  static inline dma_addr_t __phys_to_dma(struct device *dev, phys_addr_t paddr)
>  {
> - return paddr + get_dma_offset(dev);
> + if (!dev)
> + return paddr + PCI_DRAM_OFFSET;
> + return paddr + dev->archdata.dma_offset;
>  }
>  
>  static inline phys_addr_t __dma_to_phys(struct device *dev, dma_addr_t daddr)
>  {
> - return daddr - get_dma_offset(dev);
> + if (!dev)
> + return daddr - PCI_DRAM_OFFSET;
> + return daddr - dev->archdata.dma_offset;
>  }
>  #endif /* ASM_POWERPC_DMA_DIRECT_H */
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index dacd0f93f2b2..f0bf7ac2686c 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -80,22 +80,6 @@ static inline const struct dma_map_ops 
> *get_arch_dma_ops(struct bus_type *bus)
>   return NULL;
>  }
>  
> -/*
> - * get_dma_offset()
> - *
> - * Get the dma offset on configurations where the dma address can be 
> determined
> - * from the physical address by looking at a simple offset.  Direct dma and
> - * swiotlb use this function, but it is typically not used by implementations
> - * with an iommu.
> - */
> -static inline dma_addr_t get_dma_offset(struct device *dev)
> -{
> - if (dev)
> - return dev->archdata.dma_offset;
> -
> - return PCI_DRAM_OFFSET;
> -}
> -
>  static inline void set_dma_offset(struct device *dev, dma_addr_t off)
>  {
>   if (dev)

Re: [PATCH 12/20] powerpc/dma: use phys_to_dma instead of get_dma_offset

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Use the standard portable helper instead of the powerpc specific one,
> which is about to go away.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/kernel/dma-swiotlb.c |  5 ++---
>  arch/powerpc/kernel/dma.c | 12 ++--
>  2 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/dma-swiotlb.c 
> b/arch/powerpc/kernel/dma-swiotlb.c
> index 88f3963ca30f..f6e0701c5303 100644
> --- a/arch/powerpc/kernel/dma-swiotlb.c
> +++ b/arch/powerpc/kernel/dma-swiotlb.c
> @@ -11,7 +11,7 @@
>   *
>   */
>  
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -31,9 +31,8 @@ static u64 swiotlb_powerpc_get_required(struct device *dev)
> end = memblock_end_of_DRAM();
> if (max_direct_dma_addr && end > max_direct_dma_addr)
> end = max_direct_dma_addr;
> -   end += get_dma_offset(dev);
>  
> -   mask = 1ULL << (fls64(end) - 1);
> +   mask = 1ULL << (fls64(phys_to_dma(dev, end)) - 1);
> mask += mask - 1;
>  
> return mask;
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index eceaa92e6986..3487de83bb37 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -6,7 +6,7 @@
>   */
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -43,7 +43,7 @@ static u64 __maybe_unused get_pfn_limit(struct device *dev)
>  static int dma_nommu_dma_supported(struct device *dev, u64 mask)
>  {
>  #ifdef CONFIG_PPC64
> -   u64 limit = get_dma_offset(dev) + (memblock_end_of_DRAM() - 1);
> +   u64 limit = phys_to_dma(dev, (memblock_end_of_DRAM() - 1));
>  
> /* Limit fits in the mask, we are good */
> if (mask >= limit)
> @@ -104,7 +104,7 @@ void *__dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
> return NULL;
> ret = page_address(page);
> memset(ret, 0, size);
> -   *dma_handle = __pa(ret) + get_dma_offset(dev);
> +   *dma_handle = phys_to_dma(dev,__pa(ret));
>  
> return ret;
>  }
> @@ -188,7 +188,7 @@ static int dma_nommu_map_sg(struct device *dev, struct 
> scatterlist *sgl,
> int i;
>  
> for_each_sg(sgl, sg, nents, i) {
> -   sg->dma_address = sg_phys(sg) + get_dma_offset(dev);
> +   sg->dma_address = phys_to_dma(dev, sg_phys(sg));
> sg->dma_length = sg->length;
>  
> if (attrs & DMA_ATTR_SKIP_CPU_SYNC)
> @@ -210,7 +210,7 @@ static u64 dma_nommu_get_required_mask(struct device *dev)
>  {
> u64 end, mask;
>  
> -   end = memblock_end_of_DRAM() + get_dma_offset(dev);
> +   end = phys_to_dma(dev, memblock_end_of_DRAM());
>  
> mask = 1ULL << (fls64(end) - 1);
> mask += mask - 1;
> @@ -228,7 +228,7 @@ static inline dma_addr_t dma_nommu_map_page(struct device 
> *dev,
> if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
> __dma_sync_page(page, offset, size, dir);
>  
> -   return page_to_phys(page) + offset + get_dma_offset(dev);
> +   return phys_to_dma(dev, page_to_phys(page)) + offset;
>  }
>  
>  static inline void dma_nommu_unmap_page(struct device *dev,

Re: [PATCH 11/20] powerpc/dma: split the two __dma_alloc_coherent implementations

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The implemementation for the CONFIG_NOT_COHERENT_CACHE case doesn't share
> any code with the one for systems with coherent caches.  Split it off
> and merge it with the helpers in dma-noncoherent.c that have no other
> callers.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h |  5 -
>  arch/powerpc/kernel/dma.c  | 14 ++
>  arch/powerpc/mm/dma-noncoherent.c  | 15 +++
>  arch/powerpc/platforms/44x/warp.c  |  2 +-
>  4 files changed, 10 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index f2a4a7142b1e..dacd0f93f2b2 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -39,9 +39,6 @@ extern int dma_nommu_mmap_coherent(struct device *dev,
>   * to ensure it is consistent.
>   */
>  struct device;
> -extern void *__dma_alloc_coherent(struct device *dev, size_t size,
> -   dma_addr_t *handle, gfp_t gfp);
> -extern void __dma_free_coherent(size_t size, void *vaddr);
>  extern void __dma_sync(void *vaddr, size_t size, int direction);
>  extern void __dma_sync_page(struct page *page, unsigned long offset,
>size_t size, int direction);
> @@ -52,8 +49,6 @@ extern unsigned long __dma_get_coherent_pfn(unsigned long 
> cpu_addr);
>   * Cache coherent cores.
>   */
>  
> -#define __dma_alloc_coherent(dev, gfp, size, handle) NULL
> -#define __dma_free_coherent(size, addr)  ((void)0)
>  #define __dma_sync(addr, size, rw)   ((void)0)
>  #define __dma_sync_page(pg, off, sz, rw) ((void)0)
>  
> diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> index 3939589aab04..eceaa92e6986 100644
> --- a/arch/powerpc/kernel/dma.c
> +++ b/arch/powerpc/kernel/dma.c
> @@ -62,18 +62,12 @@ static int dma_nommu_dma_supported(struct device *dev, 
> u64 mask)
>  #endif
>  }
>  
> +#ifndef CONFIG_NOT_COHERENT_CACHE
>  void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> dma_addr_t *dma_handle, gfp_t flag,
> unsigned long attrs)
>  {
>   void *ret;
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - ret = __dma_alloc_coherent(dev, size, dma_handle, flag);
> - if (ret == NULL)
> - return NULL;
> - *dma_handle += get_dma_offset(dev);
> - return ret;
> -#else
>   struct page *page;
>   int node = dev_to_node(dev);
>  #ifdef CONFIG_FSL_SOC
> @@ -113,19 +107,15 @@ void *__dma_nommu_alloc_coherent(struct device *dev, 
> size_t size,
>   *dma_handle = __pa(ret) + get_dma_offset(dev);
>  
>   return ret;
> -#endif
>  }
>  
>  void __dma_nommu_free_coherent(struct device *dev, size_t size,
>   void *vaddr, dma_addr_t dma_handle,
>   unsigned long attrs)
>  {
> -#ifdef CONFIG_NOT_COHERENT_CACHE
> - __dma_free_coherent(size, vaddr);
> -#else
>   free_pages((unsigned long)vaddr, get_order(size));
> -#endif
>  }
> +#endif /* !CONFIG_NOT_COHERENT_CACHE */
>  
>  static void *dma_nommu_alloc_coherent(struct device *dev, size_t size,
>  dma_addr_t *dma_handle, gfp_t flag,
> diff --git a/arch/powerpc/mm/dma-noncoherent.c 
> b/arch/powerpc/mm/dma-noncoherent.c
> index d1c16456abac..cfc48a253707 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -29,7 +29,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  #include 
> @@ -151,8 +151,8 @@ static struct ppc_vm_region *ppc_vm_region_find(struct 
> ppc_vm_region *head, unsi
>   * Allocate DMA-coherent memory space and return both the kernel remapped
>   * virtual and bus address for that space.
>   */
> -void *
> -__dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *handle, 
> gfp_t gfp)
> +void *__dma_nommu_alloc_coherent(struct device *dev, size_t size,
> + dma_addr_t *dma_handle, gfp_t gfp, unsigned long attrs)
>  {
>   struct page *page;
>   struct ppc_vm_region *c;
> @@ -223,7 +223,7 @@ __dma_alloc_coherent(struct device *dev, size_t size, 
> dma_addr_t *handle, gfp_t
>   /*
>* Set the "dma handle"
>*/
> - *handle = page_to_phys(page);
> + *dma_handle = phys_to_dma(dev, page_to_phys(page));
>  
>   do {
>   SetPageReserved(page);
> @@ -249,12 +249,12 @@ __dma_alloc_coherent(struct device *dev, size_t size, 
> dma_addr_t *handle, gfp_t
>   no_page:
>   return NULL;
>  }
> -EXPORT_SYMBOL(__dma_alloc_coherent);
>  
>  /*
>   * free a page as defined by the above mapping.
>   */
> -void __dma_free_coherent(size_t size, void *vaddr)
> +void __dma_nommu_free_coherent(struct devi

Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> The requirement to disable local irqs over kmap_atomic is long gone,
> so remove those calls.

Really ? I'm trying to verify that and getting lost in a mess of macros
from hell in the per-cpu stuff but if you look at our implementation
of kmap_atomic_prot(), all it does is a preempt_disable(), and then
it uses kmap_atomic_idx_push():

int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;

Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
ie this is the non-interrupt safe version...

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/mm/dma-noncoherent.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/dma-noncoherent.c 
> b/arch/powerpc/mm/dma-noncoherent.c
> index 382528475433..d1c16456abac 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -357,12 +357,10 @@ static inline void __dma_sync_page_highmem(struct page 
> *page,
>  {
>   size_t seg_size = min((size_t)(PAGE_SIZE - offset), size);
>   size_t cur_size = seg_size;
> - unsigned long flags, start, seg_offset = offset;
> + unsigned long start, seg_offset = offset;
>   int nr_segs = 1 + ((size - seg_size) + PAGE_SIZE - 1)/PAGE_SIZE;
>   int seg_nr = 0;
>  
> - local_irq_save(flags);
> -
>   do {
>   start = (unsigned long)kmap_atomic(page + seg_nr) + seg_offset;
>  
> @@ -378,8 +376,6 @@ static inline void __dma_sync_page_highmem(struct page 
> *page,
>   cur_size += seg_size;
>   seg_offset = 0;
>   } while (seg_nr < nr_segs);
> -
> - local_irq_restore(flags);
>  }
>  #endif /* CONFIG_HIGHMEM */
>

Re: [PATCH 09/20] powerpc/dma: remove the unused ISA_DMA_THRESHOLD export

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 


> ---
>  arch/powerpc/kernel/setup_32.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
> index 74457485574b..3c2d093f74c7 100644
> --- a/arch/powerpc/kernel/setup_32.c
> +++ b/arch/powerpc/kernel/setup_32.c
> @@ -55,7 +55,6 @@ unsigned long ISA_DMA_THRESHOLD;
>  unsigned int DMA_MODE_READ;
>  unsigned int DMA_MODE_WRITE;
>  
> -EXPORT_SYMBOL(ISA_DMA_THRESHOLD);
>  EXPORT_SYMBOL(DMA_MODE_READ);
>  EXPORT_SYMBOL(DMA_MODE_WRITE);
>  fooBenjam

Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-08 Thread Benjamin Herrenschmidt

On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> It turns out cxl actually uses it.  So for now skip this patch,
> although random code in drivers messing with dma ops will need to
> be sorted out sooner or later.

CXL devices are "special", they bypass the classic iommu in favor of
allowing the device to operate using the main processor page tables
using an MMU context (so basically the device can use userspace
addresses directly), akin to ATS.

I think the code currently uses the nommu ops as a way to do a simple
kernel mapping for kernel drivers using CXL (not userspace stuff)
though.

Ben.

Re: [PATCH 07/20] powerpc/dma: remove the unused ARCH_HAS_DMA_MMAP_COHERENT define

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 

Acked-by: Benjamin Herrenschmidt 

> ---
>  arch/powerpc/include/asm/dma-mapping.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> b/arch/powerpc/include/asm/dma-mapping.h
> index 8fa394520af6..f2a4a7142b1e 100644
> --- a/arch/powerpc/include/asm/dma-mapping.h
> +++ b/arch/powerpc/include/asm/dma-mapping.h
> @@ -112,7 +112,5 @@ extern int dma_set_mask(struct device *dev, u64 dma_mask);
>  
>  extern u64 __dma_get_required_mask(struct device *dev);
>  
> -#define ARCH_HAS_DMA_MMAP_COHERENT
> -
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_DMA_MAPPING_H */

Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> We need to take the DMA offset and encryption bit into account when selecting
> a zone.  Add a helper that takes those into account and use it.

That whole "encryption" stuff seems to be completely specific to the
way x86 does memory encryption, or am I mistaken ? It's not clear to me
what that does in practice and how it relates to DMA mappings.

I'm also not sure about that whole business with ZONE_DMA and
ARCH_ZONE_DMA_BITS...

On ppc64, unless you enable swiotlb (which we only do currently on
some embedded platforms), you have all of memory in ZONE_DMA.

[0.00] Zone ranges:
[0.00]   DMA  [mem 0x-0x001f]
[0.00]   DMA32empty
[0.00]   Normal   empty
[0.00]   Device   empty

I'm not sure how this will work with that dma direct code.

I also see a number of tests against a 64-bit mask rather than the
top of memory...

Ben.

> Signed-off-by: Christoph Hellwig 
> ---
>  kernel/dma/direct.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index d32d4f0d2c0c..c2c1df8827f2 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -58,6 +58,14 @@ static bool dma_coherent_ok(struct device *dev, 
> phys_addr_t phys, size_t size)
>   return addr + size - 1 <= dev->coherent_dma_mask;
>  }
>  
> +static bool dma_coherent_below(struct device *dev, u64 mask)
> +{
> + dma_addr_t addr = force_dma_unencrypted() ?
> + __phys_to_dma(dev, mask) : phys_to_dma(dev, mask);
> +
> + return dev->coherent_dma_mask <= addr;
> +}
> +
>  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
> *dma_handle,
>   gfp_t gfp, unsigned long attrs)
>  {
> @@ -70,9 +78,9 @@ void *dma_direct_alloc(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   gfp &= ~__GFP_ZERO;
>  
>   /* GFP_DMA32 and GFP_DMA are no ops without the corresponding zones: */
> - if (dev->coherent_dma_mask <= DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> + if (dma_coherent_below(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
>   gfp |= GFP_DMA;
> - if (dev->coherent_dma_mask <= DMA_BIT_MASK(32) && !(gfp & GFP_DMA))
> + if (dma_coherent_below(dev, DMA_BIT_MASK(32) && !(gfp & GFP_DMA)))
>   gfp |= GFP_DMA32;
>  
>  again:
> @@ -92,14 +100,14 @@ void *dma_direct_alloc(struct device *dev, size_t size, 
> dma_addr_t *dma_handle,
>   page = NULL;
>  
>   if (IS_ENABLED(CONFIG_ZONE_DMA32) &&
> - dev->coherent_dma_mask < DMA_BIT_MASK(64) &&
> + dma_coherent_below(dev, DMA_BIT_MASK(64)) &&
>   !(gfp & (GFP_DMA32 | GFP_DMA))) {
>   gfp |= GFP_DMA32;
>   goto again;
>   }
>  
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
> - dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
> + dma_coherent_below(dev, DMA_BIT_MASK(32)) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;

Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-08 Thread Benjamin Herrenschmidt

On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> When a device has a DMA offset the dma capable result will change due
> to the difference between the physical and DMA address.  Take that into
> account.

The patch in itself makes sense.

However, there are a number of things in that dma_direct.c file that I
don't quite get:

 - looking more generally at what that function does, I worry about the
switch of ppc32 to this later on:

We do have the occasional device with things like 31-bit DMA
limitation. We know they happens to work because those systems
can't have enough memory to be a problem. This is why our current
DMA direct ops in powerpc just unconditionally return true on ppc32.

The test against a full 32-bit mask here will break them I think.

Thing is, I'm not sure I still have access to one of these things
to test, I'll have to dig (from memory things like b43 wifi).

Also those platforms don't have an iommu.

 - What is this trying to achieve ?

/*
 * Various PCI/PCIe bridges have broken support for > 32bit DMA even
 * if the device itself might support it.
 */
if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
return 0;

IE, if the device has a 32-bit limit, we fail an attempt at checking
if a >32-bit mask works ? That doesn't quite seem to be the right thing
to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?

IE, dma_set_mask() is what a driver uses to establish the device capability,
so it makes sense tot have dma_32bit_limit just reduce that capability, not
fail because the device can do more than what the bridge can 

Sorry if I'm a bit confused here.

 - How is that file supposed to work on 64-bit platforms ? From what I can
tell, dma_supported() will unconditionally return true if the mask is
32-bit or larger (appart from the above issue). This doesn't look right,
the mask needs to be compared to the max memory address. There are a bunch
of devices out there with masks anywhere bettween 40 and 64 bits, and
some of these will not work "out of the box" if the offseted top
of memory is beyond the mask limit. Or am I missing something ?

Cheers,
Ben.

> Signed-off-by: Christoph Hellwig 

Reviewed-by: Benjamin Herrenschmidt 

> ---
>  kernel/dma/direct.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 8be8106270c2..d32d4f0d2c0c 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -167,7 +167,7 @@ int dma_direct_map_sg(struct device *dev, struct 
> scatterlist *sgl, int nents,
>  int dma_direct_supported(struct device *dev, u64 mask)
>  {
>  #ifdef CONFIG_ZONE_DMA
> - if (mask < DMA_BIT_MASK(ARCH_ZONE_DMA_BITS))
> + if (mask < phys_to_dma(dev, DMA_BIT_MASK(ARCH_ZONE_DMA_BITS)))
>   return 0;
>  #else
>   /*
> @@ -176,14 +176,14 @@ int dma_direct_supported(struct device *dev, u64 mask)
>* memory, or by providing a ZONE_DMA32.  If neither is the case, the
>* architecture needs to use an IOMMU instead of the direct mapping.
>*/
> - if (mask < DMA_BIT_MASK(32))
> + if (mask < phys_to_dma(dev, DMA_BIT_MASK(32)))
>   return 0;
>  #endif
>   /*
>* Various PCI/PCIe bridges have broken support for > 32bit DMA even
>* if the device itself might support it.
>*/
> - if (dev->dma_32bit_limit && mask > DMA_BIT_MASK(32))
> + if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
>   return 0;
>   return 1;
>  }

Re: [v2, 1/6] powerpc/pm: Fix suspend=n in menuconfig for e500mc platforms.

2018-08-08 Thread Scott Wood

On Wed, Apr 11, 2018 at 02:35:46PM +0800, Ran Wang wrote:
> Also, unselect FSL_PMC which is for older platfroms instead.
> 
> Signed-off-by: Ran Wang 
> ---
> Changes in v2:
>   - no change
> 
>  arch/powerpc/Kconfig |4 +---
>  1 files changed, 1 insertions(+), 3 deletions(-)

It's not "fixing" anything, but rather enabling something that was
previously unsupported -- and that should happen after the code that
adds support.

-Scott

Re: [2/3] powerpc/dts/fsl: t4240rdb: use the Cortina PHY driver compatible

2018-08-08 Thread Scott Wood

On Wed, Jul 18, 2018 at 02:46:50PM +0300, Camelia Groza wrote:
> The Cortina PHY requires the use of the dedicated Cortina PHY driver
> instead of the generic one.
> 
> Signed-off-by: Camelia Groza 
> ---
>  arch/powerpc/boot/dts/fsl/t4240rdb.dts | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Applied with the changelog updated to talk about hardware compatibility
("The Cortina PHY is not compatible with IEEE 802.3 clause 45") rather
than driver choice.

-Scott

Re: [PATCH 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c

2018-08-08 Thread Nicholas Piggin

On Wed, 8 Aug 2018 22:22:52 +0200
Michal Suchánek  wrote:

> On Fri,  3 Aug 2018 14:13:49 +1000
> Nicholas Piggin  wrote:
> 
> > The machine check code that flushes and restores bolted segments in
> > real mode belongs in mm/slb.c. This will be used by pseries machine
> > check and idle code.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
> >  arch/powerpc/kernel/mce_power.c   | 21 ++
> >  arch/powerpc/mm/slb.c | 38
> > +++ 3 files changed, 44 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> > b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index
> > 2f74bdc805e0..d4e398185b3a 100644 ---
> > a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++
> > b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -497,6 +497,9 @@
> > extern void hpte_init_native(void); 
> >  extern void slb_initialize(void);
> >  extern void slb_flush_and_rebolt(void);
> > +extern void slb_flush_all_realmode(void);
> > +extern void __slb_restore_bolted_realmode(void);
> > +extern void slb_restore_bolted_realmode(void);
> >  
> >  extern void slb_vmalloc_update(void);
> >  extern void slb_set_size(u16 size);
> > diff --git a/arch/powerpc/kernel/mce_power.c
> > b/arch/powerpc/kernel/mce_power.c index d6756af6ec78..50f7b9817246
> > 100644 --- a/arch/powerpc/kernel/mce_power.c
> > +++ b/arch/powerpc/kernel/mce_power.c
> > @@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs
> > *regs, unsigned long addr) #ifdef CONFIG_PPC_BOOK3S_64
> >  static void flush_and_reload_slb(void)
> >  {
> > -   struct slb_shadow *slb;
> > -   unsigned long i, n;
> > -
> > /* Invalidate all SLBs */
> > -   asm volatile("slbmte %0,%0; slbia" : : "r" (0));
> > +   slb_flush_all_realmode();
> >  
> >  #ifdef CONFIG_KVM_BOOK3S_HANDLER
> > /*
> > @@ -76,22 +73,10 @@ static void flush_and_reload_slb(void)
> > if (get_paca()->kvm_hstate.in_guest)
> > return;
> >  #endif
> > -
> > -   /* For host kernel, reload the SLBs from shadow SLB buffer.
> > */
> > -   slb = get_slb_shadow();
> > -   if (!slb)
> > +   if (early_radix_enabled())
> > return;  
> 
> And we lose the check that the shadow slb exists. Is !slb equivalent to
> early_radix_enabled() 

Yeah pretty close to.

> 
> >  
> > -   n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
> > -
> > -   /* Load up the SLB entries from shadow SLB */
> > -   for (i = 0; i < n; i++) {
> > -   unsigned long rb =
> > be64_to_cpu(slb->save_area[i].esid);
> > -   unsigned long rs =
> > be64_to_cpu(slb->save_area[i].vsid); -
> > -   rb = (rb & ~0xFFFul) | i;
> > -   asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
> > -   }
> > +   slb_restore_bolted_realmode();
> >  }
> >  #endif
> >  
> > diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> > index cb796724a6fc..136db8652577 100644
> > --- a/arch/powerpc/mm/slb.c
> > +++ b/arch/powerpc/mm/slb.c
> > @@ -90,6 +90,44 @@ static inline void create_shadowed_slbe(unsigned
> > long ea, int ssize, : "memory" );
> >  }
> >  
> > +/*
> > + * Insert bolted entries into SLB (which may not be empty).
> > + */
> > +void __slb_restore_bolted_realmode(void)
> > +{
> > +   struct slb_shadow *p = get_slb_shadow();
> > +   enum slb_index index;  
> 
> or can we get here at some point when shadow slb is not populated?

We shouldn't because we won't turn the MMU on so we shouldn't get SLB
MCEs... But I don't think that's guaranteed anywhere, so yeah wouldn't
hurt to add that check back in.

I'll send out another revision.

Thanks,
Nick

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt

On Wed, 2018-08-08 at 23:31 +0300, Michael S. Tsirkin wrote:
> On Wed, Aug 08, 2018 at 11:18:13PM +1000, Benjamin Herrenschmidt wrote:
> > Sure, but all of this is just the configuration of the iommu. But I
> > think we agree here, and your point remains valid, indeed my proposed
> > hack:
> > 
> > >   if ((flags & VIRTIO_F_IOMMU_PLATFORM) || 
> > > arch_virtio_wants_dma_ops())
> > 
> > Will only work if the IOMMU and non-IOMMU path are completely equivalent.
> > 
> > We can provide that guarantee for our secure VM case, but not generally so 
> > if
> > we were to go down the route of a quirk in virtio, it might be better to
> > make it painfully obvious that it's specific to that one case with a 
> > different
> > kind of turd:
> > 
> > -   if (xen_domain())
> > +   if (xen_domain() || pseries_secure_vm())
> > return true;
> 
> I don't think it's pseries specific actually. E.g. I suspect AMD SEV
> might benefit from the same kind of hack.

As long as they can provide the same guarantee that the DMA ops are
completely equivalent between virtio and other PCI devices, at least on
the same bus, ie, we don't have to go hack special DMA ops.

I think the latter is really what Christoph wants to avoid for good
reasons.

> > So to summarize, and make sure I'm not missing something, the two approaches
> > at hand are either:
> > 
> >  1- The above, which is a one liner and contained in the guest, so that's 
> > nice, but
> > also means another turd in virtio which isn't ...
> > 
> >  2- We force pseries to always set VIRTIO_F_IOMMU_PLATFORM, but with the 
> > current
> > architecture on our side that will force virtio to always go through an 
> > emulated
> > iommu, as pseries doesn't have the concept of a real bypass window, and 
> > thus will
> > impact performance for both secure and non-secure VMs.
> > 
> >  3- Invent a property that can be put in selected PCI device tree nodes that
> > indicates that for that device specifically, the iommu can be bypassed, 
> > along with
> > a hypercall to turn that bypass on/off. Virtio would then use 
> > VIRTIO_F_IOMMU_PLATFORM
> > but its DT nodes would also have that property and Linux would notice it 
> > and turn
> > bypass on.
> 
> For completeness, virtio could also have its own bounce buffer
> outside of DMA API one. I don't see lots of benefits to this
> though.

Not fan of that either...

> > The resulting properties of those options are:
> > 
> > 1- Is what I want because it's the simplest, provides the best performance 
> > now,
> >and works without code changes to qemu or non-secure Linux. However it 
> > does
> >add a tiny turd to virtio which is annoying.
> > 
> > 2- This works but it puts the iommu in the way always, thus reducing virtio 
> > performance
> >accross the board for pseries unless we only do that for secure VMs but 
> > that is
> >difficult (as discussed earlier).
> > 
> > 3- This would recover the performance lost in -2-, however it requires qemu 
> > *and*
> >guest changes. Specifically, existing guests (RHEL 7 etc...) would get 
> > the
> >performance hit of -2- unless modified to call that 'enable bypass' 
> > call, which
> >isn't great.
> > 
> > So imho we have to chose one of 3 not-great solutions here... Unless I 
> > missed
> > something in your ideas of course.
> >

Re: [PATCH] powerpc: fix size calculation using resource_size()

2018-08-08 Thread Tyrel Datwyler

On 08/08/2018 04:57 AM, Dan Carpenter wrote:
> The problem is the the calculation should be "end - start + 1" but the
> plus one is missing in this calculation.
> 
> Fixes: 8626816e905e ("powerpc: add support for MPIC message register API")
> Signed-off-by: Dan Carpenter 
> ---
> Static analysis.  Not tested.

Looks sane to me.

Reviewed-by: Tyrel Datwyler 

> 
> diff --git a/arch/powerpc/sysdev/mpic_msgr.c b/arch/powerpc/sysdev/mpic_msgr.c
> index eb69a5186243..280e964e1aa8 100644
> --- a/arch/powerpc/sysdev/mpic_msgr.c
> +++ b/arch/powerpc/sysdev/mpic_msgr.c
> @@ -196,7 +196,7 @@ static int mpic_msgr_probe(struct platform_device *dev)
> 
>   /* IO map the message register block. */
>   of_address_to_resource(np, 0, &rsrc);
> - msgr_block_addr = ioremap(rsrc.start, rsrc.end - rsrc.start);
> + msgr_block_addr = ioremap(rsrc.start, resource_size(&rsrc));
>   if (!msgr_block_addr) {
>   dev_err(&dev->dev, "Failed to iomap MPIC message registers");
>   return -EFAULT;
>

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Michael S. Tsirkin

On Wed, Aug 08, 2018 at 11:18:13PM +1000, Benjamin Herrenschmidt wrote:
> Sure, but all of this is just the configuration of the iommu. But I
> think we agree here, and your point remains valid, indeed my proposed
> hack:
> 
> >   if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())
> 
> Will only work if the IOMMU and non-IOMMU path are completely equivalent.
> 
> We can provide that guarantee for our secure VM case, but not generally so if
> we were to go down the route of a quirk in virtio, it might be better to
> make it painfully obvious that it's specific to that one case with a different
> kind of turd:
> 
> - if (xen_domain())
> + if (xen_domain() || pseries_secure_vm())
>   return true;

I don't think it's pseries specific actually. E.g. I suspect AMD SEV
might benefit from the same kind of hack.


> So to summarize, and make sure I'm not missing something, the two approaches
> at hand are either:
> 
>  1- The above, which is a one liner and contained in the guest, so that's 
> nice, but
> also means another turd in virtio which isn't ...
> 
>  2- We force pseries to always set VIRTIO_F_IOMMU_PLATFORM, but with the 
> current
> architecture on our side that will force virtio to always go through an 
> emulated
> iommu, as pseries doesn't have the concept of a real bypass window, and thus 
> will
> impact performance for both secure and non-secure VMs.
> 
>  3- Invent a property that can be put in selected PCI device tree nodes that
> indicates that for that device specifically, the iommu can be bypassed, along 
> with
> a hypercall to turn that bypass on/off. Virtio would then use 
> VIRTIO_F_IOMMU_PLATFORM
> but its DT nodes would also have that property and Linux would notice it and 
> turn
> bypass on.

For completeness, virtio could also have its own bounce buffer
outside of DMA API one. I don't see lots of benefits to this
though.


> The resulting properties of those options are:
> 
> 1- Is what I want because it's the simplest, provides the best performance 
> now,
>and works without code changes to qemu or non-secure Linux. However it does
>add a tiny turd to virtio which is annoying.
> 
> 2- This works but it puts the iommu in the way always, thus reducing virtio 
> performance
>accross the board for pseries unless we only do that for secure VMs but 
> that is
>difficult (as discussed earlier).
> 
> 3- This would recover the performance lost in -2-, however it requires qemu 
> *and*
>guest changes. Specifically, existing guests (RHEL 7 etc...) would get the
>performance hit of -2- unless modified to call that 'enable bypass' call, 
> which
>isn't great.
> 
> So imho we have to chose one of 3 not-great solutions here... Unless I missed
> something in your ideas of course.
> 
> Cheers,
> Ben.
> 
>

Re: [PATCH 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c

2018-08-08 Thread Michal Suchánek

On Fri,  3 Aug 2018 14:13:49 +1000
Nicholas Piggin  wrote:

> The machine check code that flushes and restores bolted segments in
> real mode belongs in mm/slb.c. This will be used by pseries machine
> check and idle code.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
>  arch/powerpc/kernel/mce_power.c   | 21 ++
>  arch/powerpc/mm/slb.c | 38
> +++ 3 files changed, 44 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index
> 2f74bdc805e0..d4e398185b3a 100644 ---
> a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++
> b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -497,6 +497,9 @@
> extern void hpte_init_native(void); 
>  extern void slb_initialize(void);
>  extern void slb_flush_and_rebolt(void);
> +extern void slb_flush_all_realmode(void);
> +extern void __slb_restore_bolted_realmode(void);
> +extern void slb_restore_bolted_realmode(void);
>  
>  extern void slb_vmalloc_update(void);
>  extern void slb_set_size(u16 size);
> diff --git a/arch/powerpc/kernel/mce_power.c
> b/arch/powerpc/kernel/mce_power.c index d6756af6ec78..50f7b9817246
> 100644 --- a/arch/powerpc/kernel/mce_power.c
> +++ b/arch/powerpc/kernel/mce_power.c
> @@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs
> *regs, unsigned long addr) #ifdef CONFIG_PPC_BOOK3S_64
>  static void flush_and_reload_slb(void)
>  {
> - struct slb_shadow *slb;
> - unsigned long i, n;
> -
>   /* Invalidate all SLBs */
> - asm volatile("slbmte %0,%0; slbia" : : "r" (0));
> + slb_flush_all_realmode();
>  
>  #ifdef CONFIG_KVM_BOOK3S_HANDLER
>   /*
> @@ -76,22 +73,10 @@ static void flush_and_reload_slb(void)
>   if (get_paca()->kvm_hstate.in_guest)
>   return;
>  #endif
> -
> - /* For host kernel, reload the SLBs from shadow SLB buffer.
> */
> - slb = get_slb_shadow();
> - if (!slb)
> + if (early_radix_enabled())
>   return;

And we lose the check that the shadow slb exists. Is !slb equivalent to
early_radix_enabled() 

>  
> - n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
> -
> - /* Load up the SLB entries from shadow SLB */
> - for (i = 0; i < n; i++) {
> - unsigned long rb =
> be64_to_cpu(slb->save_area[i].esid);
> - unsigned long rs =
> be64_to_cpu(slb->save_area[i].vsid); -
> - rb = (rb & ~0xFFFul) | i;
> - asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
> - }
> + slb_restore_bolted_realmode();
>  }
>  #endif
>  
> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> index cb796724a6fc..136db8652577 100644
> --- a/arch/powerpc/mm/slb.c
> +++ b/arch/powerpc/mm/slb.c
> @@ -90,6 +90,44 @@ static inline void create_shadowed_slbe(unsigned
> long ea, int ssize, : "memory" );
>  }
>  
> +/*
> + * Insert bolted entries into SLB (which may not be empty).
> + */
> +void __slb_restore_bolted_realmode(void)
> +{
> + struct slb_shadow *p = get_slb_shadow();
> + enum slb_index index;

or can we get here at some point when shadow slb is not populated?

Thanks

Michal

> +
> +  /* No isync needed because realmode. */
> + for (index = 0; index < SLB_NUM_BOLTED; index++) {
> + asm volatile("slbmte  %0,%1" :
> +  : "r" (be64_to_cpu(p->save_area[index].vsid)),
> +"r" (be64_to_cpu(p->save_area[index].esid)));
> + }
> +}
> +
> +/*
> + * Insert bolted entries into an empty SLB.
> + * This is not the same as rebolt because the bolted segments
> + * (e.g., kstack) are not changed (rebolted).
> + */
> +void slb_restore_bolted_realmode(void)
> +{
> + __slb_restore_bolted_realmode();
> + get_paca()->slb_cache_ptr = 0;
> +}
> +
> +/*
> + * This flushes all SLB entries including 0, so it must be realmode.
> + */
> +void slb_flush_all_realmode(void)
> +{
> + /*
> +  * This flushes all SLB entries including 0, so it must be
> realmode.
> +  */
> + asm volatile("slbmte %0,%0; slbia" : : "r" (0));
> +}
> +
>  static void __slb_flush_and_rebolt(void)
>  {
>   /* If you change this make sure you change SLB_NUM_BOLTED

Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020

2018-08-08 Thread Scott Wood

On Wed, 2018-08-08 at 06:28 +, Bharat Bhushan wrote:
> > -Original Message-
> > From: Scott Wood [mailto:o...@buserror.net]
> > Sent: Wednesday, August 8, 2018 11:26 AM
> > To: Bharat Bhushan ;
> > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > ker...@vger.kernel.org
> > Cc: r...@kernel.org; keesc...@chromium.org; tyr...@linux.vnet.ibm.com;
> > j...@perches.com
> > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for P2020
> > 
> > On Wed, 2018-08-08 at 03:44 +, Bharat Bhushan wrote:
> > > > -Original Message-
> > > > From: Scott Wood [mailto:o...@buserror.net]
> > > > Sent: Wednesday, August 8, 2018 2:44 AM
> > > > To: Bharat Bhushan ;
> > > > b...@kernel.crashing.org; pau...@samba.org; m...@ellerman.id.au;
> > > > ga...@kernel.crashing.org; mark.rutl...@arm.com;
> > > > kstew...@linuxfoundation.org; gre...@linuxfoundation.org;
> > > > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; linux-
> > > > ker...@vger.kernel.org
> > > > Cc: r...@kernel.org; keesc...@chromium.org;
> > > > tyr...@linux.vnet.ibm.com; j...@perches.com
> > > > Subject: Re: [RFC 5/5] powerpc/fsl: Add supported-irq-ranges for
> > > > P2020
> > > > 
> > > > On Fri, 2018-07-27 at 15:18 +0530, Bharat Bhushan wrote:
> > > > > MPIC on NXP (Freescale) P2020 supports following irq
> > > > > ranges:
> > > > >   > 0 - 11  (External interrupt)
> > > > >   > 16 - 79 (Internal interrupt)
> > > > >   > 176 - 183   (Messaging interrupt)
> > > > >   > 224 - 231   (Shared message signaled interrupt)
> > > > 
> > > > Why don't you convert to the 4-cell interrupt specifiers that make
> > > > dealing with these ranges less error-prone?
> > > 
> > > Ok , will do if we agree to have this series as per comment on other
> > > patch.
> > 
> > If you're concerned with errors, this would be a good things to do
> > regardless.
> >  Actually, it seems that p2020si-post.dtsi already uses 4-cell interrupts.
> > 
> > What is motivating this patchset?  Is there something wrong in the
> > existing
> > dts files?
> 
> There is no error in device tree. Main motivation is to improve code for
> following reasons: 
>   - While code study it was found that if a reserved irq-number used then
> there are no check in driver. irq will be configured as correct and
> interrupt will never fire.

Again, a wrong interrupt number won't fire, whether an interrupt by that
number exists or not.  I wouldn't mind a sanity check in the driver if the
programming model made it properly discoverable, but I don't think it's worth
messing with device trees just for this (and even less so given that there
don't seem to be new chips coming out that this would be relevant for).

> > > One other confusing observation I have is that "irq_count" from
> > > platform code is given precedence over "last-interrupt-source" in
> > > device-
> > 
> > tree.
> > > Should not device-tree should have precedence otherwise there is no
> > > point using " last-interrupt-source" if platform code passes
> > > "irq_count" in mpic_alloc().
> > 
> > Maybe, though I don't think it matters much given that last-interrupt-
> > source
> > was only added to avoid having to pass irq_count in platform code.
> 
> Thanks for clarifying;
> 
> My understanding was that "last-interrupt-source" added to ensure that we
> can over-ride value passed from platform code. In that case we do not need
> to change code and can control from device tree.

The changelog says, "To avoid needing to write custom board-specific code to
detect that scenario, allow it to be easily overridden in the device-tree,"
where "it" means the value provided by hardware.  The goal was to pass in 256
without board code in the kernel, not to override the 256.

-Scott

Re: [PATCH 1/5] powerpc/asm: Add a patch_site macro & helpers for patching instructions

2018-08-08 Thread Christophe LEROY





Le 23/07/2018 à 17:07, Michael Ellerman a écrit :

Add a macro and some helper C functions for patching single asm
instructions.

The gas macro means we can do something like:

   1:   nop
patch_site 1b, patch__foo

Which is less visually distracting than defining a GLOBAL symbol at 1,
and also doesn't pollute the symbol table which can confuse eg. perf.

These are obviously similar to our existing feature sections, but are
not automatically patched based on CPU/MMU features, rather they are
designed to be manually patched by C code at some arbitrary point.

Signed-off-by: Michael Ellerman 
---
  arch/powerpc/include/asm/code-patching-asm.h | 18 ++
  arch/powerpc/include/asm/code-patching.h |  2 ++
  arch/powerpc/lib/code-patching.c | 16 
  3 files changed, 36 insertions(+)
  create mode 100644 arch/powerpc/include/asm/code-patching-asm.h

diff --git a/arch/powerpc/include/asm/code-patching-asm.h 
b/arch/powerpc/include/asm/code-patching-asm.h
new file mode 100644
index ..ed7b1448493a
--- /dev/null
+++ b/arch/powerpc/include/asm/code-patching-asm.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * Copyright 2018, Michael Ellerman, IBM Corporation.
+ */
+#ifndef _ASM_POWERPC_CODE_PATCHING_ASM_H
+#define _ASM_POWERPC_CODE_PATCHING_ASM_H
+
+/* Define a "site" that can be patched */
+.macro patch_site label name
+   .pushsection ".rodata"
+   .balign 4
+   .global \name
+\name:
+   .4byte  \label - .
+   .popsection
+.endm
+
+#endif /* _ASM_POWERPC_CODE_PATCHING_ASM_H */
diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index 812535f40124..b2051234ada8 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -32,6 +32,8 @@ unsigned int create_cond_branch(const unsigned int *addr,
  int patch_branch(unsigned int *addr, unsigned long target, int flags);
  int patch_instruction(unsigned int *addr, unsigned int instr);
  int raw_patch_instruction(unsigned int *addr, unsigned int instr);
+int patch_instruction_site(s32 *addr, unsigned int instr);
+int patch_branch_site(s32 *site, unsigned long target, int flags);


Why use s32* instead of unsigned int* as usual for pointer to code ?

Christophe

  
  int instr_is_relative_branch(unsigned int instr);

  int instr_is_relative_link_branch(unsigned int instr);
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index e0d881ab304e..850f3b8f4da5 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -195,6 +195,22 @@ int patch_branch(unsigned int *addr, unsigned long target, 
int flags)
return patch_instruction(addr, create_branch(addr, target, flags));
  }
  
+int patch_branch_site(s32 *site, unsigned long target, int flags)

+{
+   unsigned int *addr;
+
+   addr = (unsigned int *)((unsigned long)site + *site);
+   return patch_instruction(addr, create_branch(addr, target, flags));
+}
+
+int patch_instruction_site(s32 *site, unsigned int instr)
+{
+   unsigned int *addr;
+
+   addr = (unsigned int *)((unsigned long)site + *site);
+   return patch_instruction(addr, instr);
+}
+
  bool is_offset_in_branch_range(long offset)
  {
/*

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-08 Thread Michal Suchánek

On Wed, 8 Aug 2018 21:07:11 +0530
"Aneesh Kumar K.V"  wrote:

> On 08/08/2018 08:26 PM, Michael Ellerman wrote:
> > Mahesh J Salgaonkar  writes:  
> >> From: Mahesh Salgaonkar 
> >>
> >> Introduce recovery action for recovered memory errors (MCEs).
> >> There are soft memory errors like SLB Multihit, which can be a
> >> result of a bad hardware OR software BUG. Kernel can easily
> >> recover from these soft errors by flushing SLB contents. After the
> >> recovery kernel can still continue to function without any issue.
> >> But in some scenario's we may keep getting these soft errors until
> >> the root cause is fixed. To be able to analyze and find the root
> >> cause, best way is to gather enough data and system state at the
> >> time of MCE. Hence this patch introduces a sysctl knob where user
> >> can decide either to continue after recovery or panic the kernel
> >> to capture the dump.  
> > 
> > I'm not convinced we want this.
> > 
> > As we've discovered it's often not possible to reconstruct what
> > happened based on a dump anyway.
> > 
> > The key thing you need is the content of the SLB and that's not
> > included in a dump.
> > 
> > So I think we should dump the SLB content when we get the MCE (which
> > this series does) and any other useful info, and then if we can
> > recover we should.
> >   
> 
> The reasoning there is what if we got multi-hit due to some
> corruption in slb_cache_ptr. ie. some part of kernel is wrongly
> updating the paca data structure due to wrong pointer. Now that is
> far fetched, but then possible right?. Hence the idea that, if we
> don't have much insight into why a slb multi-hit occur from the dmesg
> which include slb content, slb_cache contents etc, there should be an
> easy way to force a dump that might assist in further debug.

Nonetheless this turns all MCEs into crashes. Are there any MCEs that
could happen during normal operation and should be handled by default?

Thanks

Michal

Re: [PATCH v2] powerpc/tm: Print 64-bits MSR

2018-08-08 Thread Breno Leitao

Hi Leroy,

On 08/07/2018 03:57 PM, LEROY Christophe wrote:
> Breno Leitao  a écrit :

>> On 08/07/2018 02:15 PM, Christophe LEROY wrote:
>>> Le 07/08/2018 à 15:35, Breno Leitao a écrit :

>>> I think it would be better to change 'reason' to 'unsigned long' instead of
>>> replacing it by regs->msr for the printk.
>>
>> That was my initial approach, but this code seems to run on 32 bits system,
>> and I do not want to change the whole 'reason' bit width without having a 32
>> bits to test, at least.
> 
> But 'unsigned long' is still 32 bits on ppc32, so it makes no difference with
> 'unsigned int'
> And I will test it for you if needed

Cool, I really appreciate it, and I would definitely need it once I have a
more intrusive HTM patchset I am working on.

Regarding this one, I think the change is so simple as-is that  I would
prefer to continue with the v2 patch, if you do not mind.

Thank you!

Re: [PATCH v02] powerpc/mobility: Fix node detach/rename problem

2018-08-08 Thread Michael Bringmann

I will update the header files 'of_private.h' and 'of.h' and repost.

Michael

On 08/08/2018 10:37 AM, Michael Bringmann wrote:
> On 08/08/2018 09:02 AM, Michael Ellerman wrote:
>> Michael Bringmann  writes:
>>> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
>>> b/arch/powerpc/platforms/pseries/mobility.c
>>> index e245a88..efc9442 100644
>>> --- a/arch/powerpc/platforms/pseries/mobility.c
>>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>>> @@ -22,6 +22,9 @@
>>>  #include 
>>>  #include "pseries.h"
>>>  
>>> +extern int of_free_phandle_cache(void);
>>> +extern void of_populate_phandle_cache(void);
>>
>> We don't do that, they should be in a header.
>>
>> But that's a minor problem given that the patch doesn't compile, because
>> both those functions are static.
> 
> I am building against the latest 'linux-ppc' kernel.  It includes patch
> 
> Commit b9952b5218added5577e4a3443969bc20884cea9 Mon Sep 17 00:00:00 2001
> From: Frank Rowand 
> Date: Thu, 12 Jul 2018 14:00:07 -0700
> Subject: of: overlay: update phandle cache on overlay apply and remove
> 
> which makes the functions static.  I will rebuild and test with an
> earlier version if you will specify which one.
> 
>>
>> Presumably you have a hack in your tree to make them non-static?
>> Please try and compile your patches in a clean tree before sending.
>>
>> cheers
> 
> Regards,
> Michael
> 

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com

Re: [PATCH v02] powerpc/mobility: Fix node detach/rename problem

2018-08-08 Thread Michael Bringmann

On 08/08/2018 09:02 AM, Michael Ellerman wrote:
> Michael Bringmann  writes:
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
>> b/arch/powerpc/platforms/pseries/mobility.c
>> index e245a88..efc9442 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -22,6 +22,9 @@
>>  #include 
>>  #include "pseries.h"
>>  
>> +extern int of_free_phandle_cache(void);
>> +extern void of_populate_phandle_cache(void);
> 
> We don't do that, they should be in a header.
> 
> But that's a minor problem given that the patch doesn't compile, because
> both those functions are static.

I am building against the latest 'linux-ppc' kernel.  It includes patch

Commit b9952b5218added5577e4a3443969bc20884cea9 Mon Sep 17 00:00:00 2001
From: Frank Rowand 
Date: Thu, 12 Jul 2018 14:00:07 -0700
Subject: of: overlay: update phandle cache on overlay apply and remove

which makes the functions static.  I will rebuild and test with an
earlier version if you will specify which one.

> 
> Presumably you have a hack in your tree to make them non-static?
> Please try and compile your patches in a clean tree before sending.
> 
> cheers

Regards,
Michael

-- 
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line  363-5196
External: (512) 286-5196
Cell:   (512) 466-0650
m...@linux.vnet.ibm.com

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-08 Thread Aneesh Kumar K.V


On 08/08/2018 08:26 PM, Michael Ellerman wrote:

Mahesh J Salgaonkar  writes:

From: Mahesh Salgaonkar 

Introduce recovery action for recovered memory errors (MCEs). There are
soft memory errors like SLB Multihit, which can be a result of a bad
hardware OR software BUG. Kernel can easily recover from these soft errors
by flushing SLB contents. After the recovery kernel can still continue to
function without any issue. But in some scenario's we may keep getting
these soft errors until the root cause is fixed. To be able to analyze and
find the root cause, best way is to gather enough data and system state at
the time of MCE. Hence this patch introduces a sysctl knob where user can
decide either to continue after recovery or panic the kernel to capture the
dump.


I'm not convinced we want this.

As we've discovered it's often not possible to reconstruct what happened
based on a dump anyway.

The key thing you need is the content of the SLB and that's not included
in a dump.

So I think we should dump the SLB content when we get the MCE (which
this series does) and any other useful info, and then if we can recover
we should.



The reasoning there is what if we got multi-hit due to some corruption 
in slb_cache_ptr. ie. some part of kernel is wrongly updating the paca 
data structure due to wrong pointer. Now that is far fetched, but then 
possible right?. Hence the idea that, if we don't have much insight into 
why a slb multi-hit occur from the dmesg which include slb content, 
slb_cache contents etc, there should be an easy way to force a dump that 
might assist in further debug.


-aneesh

[PATCH] powerpc/mm: remove huge_pte_offset_and_shift() prototype

2018-08-08 Thread Christophe Leroy

huge_pte_offset_and_shift() has never existed

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/hugetlb.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 3225eb6402cc..2d00cc530083 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -84,9 +84,6 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
return dir + idx;
 }
 
-pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
-unsigned long addr, unsigned *shift);
-
 void flush_dcache_icache_hugepage(struct page *page);
 
 int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
-- 
2.13.3

[PATCH v3 1/2] powerpc/pseries: Avoid blocking rtas polling handling multiple PRRN events

2018-08-08 Thread John Allen

When a PRRN event is being handled and another PRRN event comes in, the
second event will block rtas polling waiting on the first to complete,
preventing any further rtas events from being handled. This can be
especially problematic in case that PRRN events are continuously being
queued in which case rtas polling gets indefinitely blocked completely.

This patch removes the blocking call to flush_work and allows the
default workqueue behavior to handle duplicate events.

Signed-off-by: John Allen 
---
v3:
  -Scrap the mutex as it only replicates existing workqueue behavior.
v2:
  -Unlock prrn_lock when PRRN operations are complete, not after handler is
   scheduled.
  -Remove call to flush_work, the previous broken method of serializing
   PRRN events.
---
 arch/powerpc/kernel/rtasd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index 44d66c33d59d..2017934e5985 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -290,7 +290,6 @@ static DECLARE_WORK(prrn_work, prrn_work_fn);
 
 static void prrn_schedule_update(u32 scope)
 {
-   flush_work(&prrn_work);
prrn_update_scope = scope;
schedule_work(&prrn_work);
 }
-- 
2.17.1

[PATCH v3 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling

2018-08-08 Thread John Allen

While handling PRRN events, the time to handle the actual hotplug events
dwarfs the time it takes to perform the device tree updates and queue the
hotplug events. In the case that PRRN events are being queued continuously,
hotplug events have been observed to be queued faster than the kernel can
actually handle them. This patch avoids the problem by waiting for a
hotplug request to complete before queueing more hotplug events.

Signed-off-by: John Allen 
---
 arch/powerpc/platforms/pseries/mobility.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 8a8033a249c7..49930848fa78 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -242,6 +242,7 @@ static int add_dt_node(__be32 parent_phandle, __be32 
drc_index)
 static void prrn_update_node(__be32 phandle)
 {
struct pseries_hp_errorlog *hp_elog;
+   struct completion hotplug_done;
struct device_node *dn;
 
/*
@@ -263,7 +264,9 @@ static void prrn_update_node(__be32 phandle)
hp_elog->id_type = PSERIES_HP_ELOG_ID_DRC_INDEX;
hp_elog->_drc_u.drc_index = phandle;
 
-   queue_hotplug_event(hp_elog, NULL, NULL);
+   init_completion(&hotplug_done);
+   queue_hotplug_event(hp_elog, &hotplug_done, NULL);
+   wait_for_completion(&hotplug_done);
 
kfree(hp_elog);
 }
-- 
2.17.1

[PATCH v3 0/2] powerpc/pseries: Improve serialization of PRRN events

2018-08-08 Thread John Allen

Stress testing has uncovered issues with handling continuously queued PRRN
events. Running PRRN events in this way can seriously load the system given
the sheer volume of dlpar being handled. This patchset ensures that PRRN
events are handled more synchronously, only allowing the PRRN handler to
queue a single dlpar event at any given time.  Additionally, it ensures
that rtas polling continues normally when multiple PRRN events are queued
simultaneously.

v3:
  -Scrap the PRRN mutex as it only replicates existing workqueue behavior.
v2:
  -Unlock prrn_lock when PRRN operations are complete, not after handler is
   scheduled.
  -Remove call to flush_work, the previous broken method of serializing
   PRRN events.

John Allen (2):
  powerpc/pseries: Avoid blocking rtas polling handling multiple PRRN
events
  powerpc/pseries: Wait for completion of hotplug events during PRRN
handling

 arch/powerpc/kernel/rtasd.c   | 10 +++---
 arch/powerpc/platforms/pseries/mobility.c |  5 -
 2 files changed, 11 insertions(+), 4 deletions(-)

-- 
2.17.1

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

2018-08-08 Thread Michael Ellerman

Mahesh J Salgaonkar  writes:
> From: Mahesh Salgaonkar 
>
> Introduce recovery action for recovered memory errors (MCEs). There are
> soft memory errors like SLB Multihit, which can be a result of a bad
> hardware OR software BUG. Kernel can easily recover from these soft errors
> by flushing SLB contents. After the recovery kernel can still continue to
> function without any issue. But in some scenario's we may keep getting
> these soft errors until the root cause is fixed. To be able to analyze and
> find the root cause, best way is to gather enough data and system state at
> the time of MCE. Hence this patch introduces a sysctl knob where user can
> decide either to continue after recovery or panic the kernel to capture the
> dump.

I'm not convinced we want this.

As we've discovered it's often not possible to reconstruct what happened
based on a dump anyway.

The key thing you need is the content of the SLB and that's not included
in a dump.

So I think we should dump the SLB content when we get the MCE (which
this series does) and any other useful info, and then if we can recover
we should.

cheers

Re: [PATCH] powerpc/64s: Make unrecoverable SLB miss less confusing

2018-08-08 Thread Naveen N. Rao


Michael Ellerman wrote:

Nicholas Piggin  writes:

On Thu, 26 Jul 2018 23:01:51 +1000
Michael Ellerman  wrote:


If we take an SLB miss while MSR[RI]=0 we can't recover and have to
oops. Currently this is reported by faking up a 0x4100 exception, eg:

  Unrecoverable exception 4100 at 0
  Oops: Unrecoverable exception, sig: 6 [#1]
  ...
  CPU: 0 PID: 1262 Comm: sh Not tainted 
4.18.0-rc3-gcc-7.3.1-00098-g7fc2229fb2ab-dirty #9
  NIP:   LR: c000b9e4 CTR: 7fff8bb971b0
  REGS: c000ee02bbb0 TRAP: 4100
  ...
  LR [c000b9e4] system_call+0x5c/0x70

The 0x4100 value was chosen back in 2004 as part of the fix for the
"mega bug" - "ppc64: Fix SLB reload bug". Back then it was obvious
that 0x4100 was not a real trap value, as the highest actual trap was
less than 0x2000.

Since then however the architecture has changed and now we have
"virtual mode" or "relon" exceptions, in which exceptions can be
delivered with the MMU on starting at 0x4000.

At a glance 0x4100 looks like a virtual mode 0x100 exception, aka
system reset exception. A close reading of the architecture will show
that system reset exceptions can't be delivered in virtual mode, and
so 0x4100 is not a valid trap number. But that's not immediately
obvious. There's also nothing about 0x4100 that suggests SLB miss.

So to make things a bit less confusing switch to a fake but unique and
hopefully more helpful numbering. For data SLB misses we report a
0x390 trap and for instruction we report 0x490. Compared to 0x380 and
0x480 for the actual data & instruction SLB exceptions.

Also add a C handler that prints a more explicit message. The end
result is something like:

  Oops: Unrecoverable SLB miss (MSR[RI]=0), sig: 6 [#3]


This is all good, but allow me to nitpick. Our unrecoverable
exception messages (and other messages, but those) are becoming a bit
ad-hoc and messy.

It would be nice to go the other way eventually and consolidate them
into one. Would be nice to have a common function that takes regs and
returns the string of the corresponding exception name that makes
these more readable.


Yeah that's true, though some of them aren't simply a mapping from the
trap number, eg. the kernel bad stack one.

But in general our whole oops output, regs, stack trace etc. could use a
revamp.

I've been thinking of making the trap number more prominent and
providing a text description, because apparently not everyone knows the
trap numbers by heart :)


Yes please, guilty as charged :)
https://patchwork.ozlabs.org/patch/899980/

Thanks,
Naveen

Re: [PATCH v7 4/9] powerpc/pseries: Define MCE error event section.

2018-08-08 Thread Michael Ellerman

Hi Mahesh,

A few nitpicks.

Mahesh J Salgaonkar  writes:
> From: Mahesh Salgaonkar 
>
> On pseries, the machine check error details are part of RTAS extended
> event log passed under Machine check exception section. This patch adds
> the definition of rtas MCE event section and related helper
> functions.
>
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/rtas.h |  111 
> +++
>  1 file changed, 111 insertions(+)

AFIACS none of this ever gets used outside of ras.c, should it should
just go in there.

> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index 71e393c46a49..adc677c5e3a4 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -326,6 +334,109 @@ struct pseries_hp_errorlog {
>  #define PSERIES_HP_ELOG_ID_DRC_COUNT 3
>  #define PSERIES_HP_ELOG_ID_DRC_IC4
>  
> +/* RTAS pseries MCE errorlog section */
> +#pragma pack(push, 1)
> +struct pseries_mc_errorlog {
> + __be32  fru_id;
> + __be32  proc_id;
> + uint8_t error_type;

Please use kernel types, so u8.

> + union {
> + struct {
> + uint8_t ue_err_type;
> + /* 
> +  * X1: Permanent or Transient UE.
> +  *  X   1: Effective address provided.
> +  *   X  1: Logical address provided.
> +  *XX2: Reserved.
> +  *  XXX 3: Type of UE error.
> +  */

But which bit is bit 0? And is that the LSB or MSB?


> + uint8_t reserved_1[6];
> + __be64  effective_address;
> + __be64  logical_address;
> + } ue_error;
> + struct {
> + uint8_t soft_err_type;
> + /* 
> +  * X1: Effective address provided.
> +  *  X   5: Reserved.
> +  *   XX 2: Type of SLB/ERAT/TLB error.
> +  */
> + uint8_t reserved_1[6];
> + __be64  effective_address;
> + uint8_t reserved_2[8];
> + } soft_error;
> + } u;
> +};
> +#pragma pack(pop)

Why not __packed ?

> +/* RTAS pseries MCE error types */
> +#define PSERIES_MC_ERROR_TYPE_UE 0x00
> +#define PSERIES_MC_ERROR_TYPE_SLB0x01
> +#define PSERIES_MC_ERROR_TYPE_ERAT   0x02
> +#define PSERIES_MC_ERROR_TYPE_TLB0x04
> +#define PSERIES_MC_ERROR_TYPE_D_CACHE0x05
> +#define PSERIES_MC_ERROR_TYPE_I_CACHE0x07

Once these are in ras.c they can have less unwieldy names, ie. the
PSERIES at least can be dropped.

> +/* RTAS pseries MCE error sub types */
> +#define PSERIES_MC_ERROR_UE_INDETERMINATE0
> +#define PSERIES_MC_ERROR_UE_IFETCH   1
> +#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_IFETCH   2
> +#define PSERIES_MC_ERROR_UE_LOAD_STORE   3
> +#define PSERIES_MC_ERROR_UE_PAGE_TABLE_WALK_LOAD_STORE   4
> +
> +#define PSERIES_MC_ERROR_SLB_PARITY  0
> +#define PSERIES_MC_ERROR_SLB_MULTIHIT1
> +#define PSERIES_MC_ERROR_SLB_INDETERMINATE   2
> +
> +#define PSERIES_MC_ERROR_ERAT_PARITY 1
> +#define PSERIES_MC_ERROR_ERAT_MULTIHIT   2
> +#define PSERIES_MC_ERROR_ERAT_INDETERMINATE  3
> +
> +#define PSERIES_MC_ERROR_TLB_PARITY  1
> +#define PSERIES_MC_ERROR_TLB_MULTIHIT2
> +#define PSERIES_MC_ERROR_TLB_INDETERMINATE   3
> +
> +static inline uint8_t rtas_mc_error_type(const struct pseries_mc_errorlog 
> *mlog)
> +{
> + return mlog->error_type;
> +}

Why not just access it directly?

> +static inline uint8_t rtas_mc_error_sub_type(
> + const struct pseries_mc_errorlog *mlog)
> +{
> + switch (mlog->error_type) {
> + casePSERIES_MC_ERROR_TYPE_UE:
> + return (mlog->u.ue_error.ue_err_type & 0x07);
> + casePSERIES_MC_ERROR_TYPE_SLB:
> + casePSERIES_MC_ERROR_TYPE_ERAT:
> + casePSERIES_MC_ERROR_TYPE_TLB:
> + return (mlog->u.soft_error.soft_err_type & 0x03);
> + default:
> + return 0;
> + }
> +}
> +
> +static inline uint64_t rtas_mc_get_effective_addr(
> + const struct pseries_mc_errorlog *mlog)
> +{
> + uint64_t addr = 0;

That should be __be64.

> +
> + switch (mlog->error_type) {
> + casePSERIES_MC_ERROR_TYPE_UE:
> + if (mlog->u.ue_error.ue_err_type & 0x40)
> + addr = mlog->u.ue_error.effective_address;
> + break;
> + casePSERIES_MC_ERROR_TYPE_SLB:
> + casePSERIES_MC_ERROR_TYPE_ERAT:
> + casePSERIES_MC_ERROR_TYPE_TLB:
> + if (mlog->u.soft_error.soft_err_ty

Re: powerpc/64: Disable irq restore warning for now

2018-08-08 Thread Michael Ellerman

On Tue, 2018-08-07 at 11:48:46 UTC, Michael Ellerman wrote:
> We recently added a warning in arch_local_irq_restore() to check that
> the soft masking state matches reality.
> 
> Unfortunately it trips in a few places, which are not entirely trivial
> to fix. The key problem is if we're doing function_graph tracing of
> restore_math(), the warning pops and then seems to recurse. It's not
> entirely clear because the system continuously oopses on all CPUs,
> with the output interleaved and unreadable.
> 
> It's also been observed on a G5 coming out of idle.
> 
> Until we can fix those cases disable the warning for now.
> 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/4da1f79227ad42550e3e70e7b4812e

cheers

Re: powerpc/Makefiles: Convert ifeq to ifdef where possible

2018-08-08 Thread Michael Ellerman

On Mon, 2018-08-06 at 16:42:03 UTC, "Rodrigo R. Galvao" wrote:
> In Makefiles if we're testing a CONFIG_FOO symbol for equality with 'y'
> we can instead just use ifdef. The latter reads easily, so convert to
> it where possible.
> 
> Signed-off-by: Rodrigo R. Galvao 
> Reviewed-by: Mauro S. M. Rodrigues 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/badf436f6fa5dc864d579d73fe7538

cheers

Re: misc: cxl: changed asterisk position

2018-08-08 Thread Michael Ellerman

On Fri, 2018-08-03 at 10:20:38 UTC, Parth Y Shah wrote:
> Resolved <"foo* bar" should be "foo *bar"> error
> 
> Signed-off-by: Parth Y Shah 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a0ac3687fb404d59296ecba4642071

cheers

Re: powpc:feature: of_node_put is not needed after iterator.

2018-08-08 Thread Michael Ellerman

On Sat, 2018-08-04 at 14:25:00 UTC, zhong jiang wrote:
> for_each_node_by_name iterators only  exit normally when the loop
> cursor is NULL, So there is no point to call of_node_put.
> 
> Signed-off-by: zhong jiang 
> Reviewed-by: Tyrel Datwyler 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/81d7b08b3cec79a43411e7175401b3

cheers

Re: [v4, 1/6] powerpc/traps: Print unhandled signals in a separate function

2018-08-08 Thread Michael Ellerman

On Wed, 2018-08-01 at 21:33:15 UTC, Murilo Opsfelder Araujo wrote:
> Isolate the logic of printing unhandled signals out of _exception_pkey().
> No functional change, only code rearrangement.
> 
> Signed-off-by: Murilo Opsfelder Araujo 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/658b0f92bc7003bc734471f61bf7cd

cheers

Re: [v3, 1/4] powerpc/64: Make exception table clearer in __copy_tofrom_user_base

2018-08-08 Thread Michael Ellerman

On Fri, 2018-08-03 at 10:13:03 UTC, Paul Mackerras wrote:
> This aims to make the generation of exception table entries for the
> loads and stores in __copy_tofrom_user_base clearer and easier to
> verify.  Instead of having a series of local labels on the loads and
> stores, with a series of corresponding labels later for the exception
> handlers, we now use macros to generate exception table entries at the
> point of each load and store that could potentially trap.  We do this
> with the macros lex (load exception) and stex (store exception).
> These macros are used right before the load or store to which they
> apply.
> 
> Some complexity is introduced by the fact that we have some more work
> to do after hitting an exception, because we need to calculate and
> return the number of bytes not copied.  The code uses r3 as the
> current pointer into the destination buffer, that is, the address of
> the first byte of the destination that has not been modified.
> However, at various points in the copy loops, r3 can be 4, 8, 16 or 24
> bytes behind that point.
> 
> To express this offset in an understandable way, we define a symbol
> r3_offset which is updated at various points so that it equal to the
> difference between the address of the first unmodified byte of the
> destination and the value in r3.  (In fact it only needs to be
> accurate at the point of each lex or stex macro invocation.)
> 
> The rules for updating r3_offset are as follows:
> 
> * It starts out at 0
> * An addi r3,r3,N instruction decreases r3_offset by N
> * A store instruction (stb, sth, stw, std) to N(r3)
>   increases r3_offset by the width of the store (1, 2, 4, 8)
> * A store with update instruction (stbu, sthu, stwu, stdu) to N(r3)
>   sets r3_offset to the width of the store.
> 
> There is some trickiness to the way that the lex and stex macros and
> the associated exception handlers work.  I would have liked to use
> the current value of r3_offset in the name of the symbol used as
> the exception handler, as in ".Lld_exc_$(r3_offset)" and then
> have symbols .Lld_exc_0, .Lld_exc_8, .Lld_exc_16 etc. corresponding
> to the offsets that needed to be added to r3.  However, I couldn't
> see a way to do that with gas.
> 
> Instead, the exception handler address is .Lld_exc - r3_offset or
> .Lst_exc - r3_offset, that is, the distance ahead of .Lld_exc/.Lst_exc
> that we start executing is equal to the amount that we need to add to
> r3.  This works because r3_offset is always a small multiple of 4,
> and our instructions are 4 bytes long.  This means that before
> .Lld_exc and .Lst_exc, we have a sequence of instructions that
> increments r3 by 4, 8, 16 or 24 depending on where we start.  The
> sequence increments r3 by 4 per instruction (on average).
> 
> We also replace the exception table for the 4k copy loop by a
> macro per load or store.  These loads and stores all use exactly
> the same exception handler, which simply resets the argument registers
> r3, r4 and r5 to there original values and re-does the whole copy
> using the slower loop.
> 
> Signed-off-by: Paul Mackerras 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a7c81ce398e2ad304f61d6167155f3

cheers

Re: [v5,8/8] Documentation: Add nospectre_v1 parameter

2018-08-08 Thread Michael Ellerman

On Fri, 2018-07-27 at 23:06:39 UTC, Michael Ellerman wrote:
> From: Diana Craciun 
> 
> Currently only supported on powerpc.
> 
> Signed-off-by: Diana Craciun 
> Signed-off-by: Michael Ellerman 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/26cb1f36c43ee6e89d2a9f48a5a750

cheers

Re: [v8,1/4] selftests/powerpc: add test for 32 bits memcmp

2018-08-08 Thread Michael Ellerman

On Wed, 2018-08-01 at 09:01:10 UTC, Christophe Leroy wrote:
> This patch renames memcmp test to memcmp_64 and adds
> a memcmp_32 test for testing the 32 bits version of memcmp()
> 
> Signed-off-by: Christophe Leroy 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/1bb07b593adc1934a526eb04acfe8b

cheers

Re: [v2, 1/2] selftests/powerpc: Skip earlier in alignment_handler test

2018-08-08 Thread Michael Ellerman

On Tue, 2018-07-31 at 12:08:41 UTC, Michael Ellerman wrote:
> Currently the alignment_handler test prints "Can't open /dev/fb0"
> about 80 times per run, which is a little annoying.
> 
> Refactor it to check earlier if it can open /dev/fb0 and skip if not,
> this results in each test printing something like:
> 
>   test: test_alignment_handler_vsx_206
>   tags: git_version:v4.18-rc3-134-gfb21a48904aa
>   [SKIP] Test skipped on line 291
>   skip: test_alignment_handler_vsx_206
> 
> Signed-off-by: Michael Ellerman 
> Acked-by: Andrew Donnellan 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/edba42cd14dbb0cc8b48cf786b006a

cheers

Re: [v5, 1/8] powerpc/64: Disable the speculation barrier from the command line

2018-08-08 Thread Michael Ellerman

On Fri, 2018-07-27 at 23:06:32 UTC, Michael Ellerman wrote:
> From: Diana Craciun 
> 
> The speculation barrier can be disabled from the command line
> with the parameter: "nospectre_v1".
> 
> Signed-off-by: Diana Craciun 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/cf175dc315f90185128fb061dc05b6

cheers

Re: [resend] powerpc/64s: fix page table fragment refcount race vs speculative references

2018-08-08 Thread Michael Ellerman

On Fri, 2018-07-27 at 11:48:17 UTC, Nicholas Piggin wrote:
> The page table fragment allocator uses the main page refcount racily
> with respect to speculative references. A customer observed a BUG due
> to page table page refcount underflow in the fragment allocator. This
> can be caused by the fragment allocator set_page_count stomping on a
> speculative reference, and then the speculative failure handler
> decrements the new reference, and the underflow eventually pops when
> the page tables are freed.
> 
> Fix this by using a dedicated field in the struct page for the page
> table fragment allocator.
> 
> Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
> Reviewed-by: Aneesh Kumar K.V 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/4231aba000f5a4583dd9f67057aadb

cheers

Re: [01/16] powerpc/64s: Move SET_SCRATCH0() into EXCEPTION_PROLOG_PSERIES()

2018-08-08 Thread Michael Ellerman

On Thu, 2018-07-26 at 13:07:02 UTC, Michael Ellerman wrote:
> EXCEPTION_PROLOG_PSERIES() only has two users, STD_EXCEPTION_PSERIES()
> and STD_EXCEPTION_HV() both of which "call" SET_SCRATCH0(), so just
> move SET_SCRATCH0() into EXCEPTION_PROLOG_PSERIES().
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/4a7a0a8444ba4cebb3a6744e9c14fc

cheers

Re: [1/3] selftests/powerpc: Add a helper for checking if we're on ppc64le

2018-08-08 Thread Michael Ellerman

On Thu, 2018-07-26 at 12:24:57 UTC, Michael Ellerman wrote:
> Some of our selftests have only been tested on ppc64le and crash or
> behave weirdly on ppc64/ppc32. So add a helper for checking the UTS
> machine.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/95f9b3af401f5b4daeb908a2c658e8

cheers

Re: powerpc/64s: Make rfi_flush_fallback a little more robust

2018-08-08 Thread Michael Ellerman

On Thu, 2018-07-26 at 12:42:44 UTC, Michael Ellerman wrote:
> Because rfi_flush_fallback runs immediately before the return to
> userspace it currently runs with the user r1 (stack pointer). This
> means if we oops in there we will report a bad kernel stack pointer in
> the exception entry path, eg:
> 
>   Bad kernel stack pointer 77150e40 at c00023b4
>   Oops: Bad kernel stack pointer, sig: 6 [#1]
>   LE SMP NR_CPUS=32 NUMA PowerNV
>   Modules linked in:
>   CPU: 0 PID: 1246 Comm: klogd Not tainted 
> 4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3 #7
>   NIP:  c00023b4 LR: 10053e00 CTR: 0040
>   REGS: c000fffe7d40 TRAP: 4100   Not tainted  
> (4.18.0-rc2-gcc-7.3.1-00175-g0443f8a69ba3)
>   MSR:  92803031   CR: 44000442  XER: 
> 2000
>   CFAR: c000bac8 IRQMASK: c000f1e66a80
>   GPR00: 0200 77150e40 7fff93a99900 0020
>   ...
>   NIP [c00023b4] rfi_flush_fallback+0x34/0x80
>   LR [10053e00] 0x10053e00
> 
> Although the NIP tells us where we were, and the TRAP number tells us
> what happened, it would still be nicer if we could report the actual
> exception rather than barfing about the stack pointer.
> 
> We an do that fairly simply by loading the kernel stack pointer on
> entry and restoring the user value before returning. That way we see a
> regular oops such as:
> 
>   Unrecoverable exception 4100 at c000239c
>   Oops: Unrecoverable exception, sig: 6 [#1]
>   LE SMP NR_CPUS=32 NUMA PowerNV
>   Modules linked in:
>   CPU: 0 PID: 1251 Comm: klogd Not tainted 
> 4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty #40
>   NIP:  c000239c LR: 10053e00 CTR: 0040
>   REGS: c000f1e17bb0 TRAP: 4100   Not tainted  
> (4.18.0-rc3-gcc-7.3.1-00097-g4ebfcac65acd-dirty)
>   MSR:  92803031   CR: 44000442  XER: 
> 2000
>   CFAR: c000bac8 IRQMASK: 0
>   ...
>   NIP [c000239c] rfi_flush_fallback+0x3c/0x80
>   LR [10053e00] 0x10053e00
>   Call Trace:
>   [c000f1e17e30] [c000b9e4] system_call+0x5c/0x70 (unreliable)
> 
> Note this shouldn't make the kernel stack pointer vulnerable to a
> meltdown attack, because it should be flushed from the cache before we
> return to userspace. The user r1 value will be in the cache, because
> we load it in the return path, but that is harmless.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Nicholas Piggin 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/78ee9946371f5848ddfc88ab1a4386

cheers

Re: powerpc/pasemi: Use pr_err/pr_warn... for kernel messages

2018-08-08 Thread Michael Ellerman

On Wed, 2018-07-25 at 20:45:01 UTC, Darren Stevens wrote:
> Pasemi code still uses printk(KERN_ERR/KERN_WARN ... change these to
> pr_err(, pr_warn(... to match other powerpc arch code.
> 
> No functional changes.
> 
> Signed-off-by: Darren Stevens 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e13606d7321c0c08d4ac2d74a11026

cheers

Re: powerpc/pasemi: Seach for PCI root bus by compatible property

2018-08-08 Thread Michael Ellerman

On Wed, 2018-07-25 at 20:55:18 UTC, Darren Stevens wrote:
> Pasemi arch code finds the root of the PCI-e bus by searching the
> device-tree for a node called 'pxp'. But the root bus has a 
> compatible property of 'pasemi,rootbus' so search for that instead.
> 
> Signed-off-by: Darren Stevens 
> Acked-by: Olof Johansson 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/250a93501d6265bbb9ddf06af25ac9

cheers

Re: powerpc/64s: free page table caches at exit_mmap time

2018-08-08 Thread Michael Ellerman

On Wed, 2018-07-25 at 09:54:28 UTC, Nicholas Piggin wrote:
> The kernel page table caches are tied to init_mm, so there is no
> more need for them after userspace is finished.
> 
> Signed-off-by: Nicholas Piggin 
> Reviewed-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/34c604d27590fdc9a2c944be8c50ae

cheers

Re: powerpc/64s/radix: tlb do not flush on page size when fullmm

2018-08-08 Thread Michael Ellerman

On Wed, 2018-07-25 at 13:58:06 UTC, Nicholas Piggin wrote:
> When the mm is being torn down there will be a full PID flush so
> there is no need to flush the TLB on page size changes.
> 
> Signed-off-by: Nicholas Piggin 
> Reviewed-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/5a6099346c41792f1ba23aea6f74ad

cheers

Re: powerpc: Add a checkpatch wrapper with our preferred settings

2018-08-08 Thread Michael Ellerman

On Tue, 2018-07-24 at 14:03:46 UTC, Michael Ellerman wrote:
> This makes it easy to run checkpatch with settings that we have agreed
> on (bwhahahahah).
> 
> Usage is eg:
> 
>   $ ./arch/powerpc/tools/checkpatch.sh -g origin/master..
> 
> To check all commits since origin/master.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Russell Currey 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/7cd129b4b5370030a5c0b8031a54b2

cheers

Re: [1/5] powerpc/asm: Add a patch_site macro & helpers for patching instructions

2018-08-08 Thread Michael Ellerman

On Mon, 2018-07-23 at 15:07:52 UTC, Michael Ellerman wrote:
> Add a macro and some helper C functions for patching single asm
> instructions.
> 
> The gas macro means we can do something like:
> 
>   1:  nop
>   patch_site 1b, patch__foo
> 
> Which is less visually distracting than defining a GLOBAL symbol at 1,
> and also doesn't pollute the symbol table which can confuse eg. perf.
> 
> These are obviously similar to our existing feature sections, but are
> not automatically patched based on CPU/MMU features, rather they are
> designed to be manually patched by C code at some arbitrary point.
> 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/06d0bbc6d0f56dacac3a79900e9a9a

cheers

Re: powerpc/platforms/85xx: fix t1042rdb_diu.c build errors & warning

2018-08-08 Thread Michael Ellerman

On Sun, 2018-07-15 at 17:34:46 UTC, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Fix build errors and warnings in t1042rdb_diu.c by adding header files
> and MODULE_LICENSE().
> 
> ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: data definition 
> has no type or storage class
>  early_initcall(t1042rdb_diu_init);
> ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: error: type defaults to 
> 'int' in declaration of 'early_initcall' [-Werror=implicit-int]
> ../arch/powerpc/platforms/85xx/t1042rdb_diu.c:152:1: warning: parameter names 
> (without types) in function declaration
> 
> and
> WARNING: modpost: missing MODULE_LICENSE() in 
> arch/powerpc/platforms/85xx/t1042rdb_diu.o
> 
> Signed-off-by: Randy Dunlap 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: Scott Wood 
> Cc: Kumar Gala 
> Cc: linuxppc-dev@lists.ozlabs.org

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f5daf77a55ef0e695cc90c440ed650

cheers

Re: [v6, 2/8] powerpc/pseries: Defer the logging of rtas error to irq work queue.

2018-08-08 Thread Michael Ellerman

On Wed, 2018-07-04 at 17:57:21 UTC, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> rtas_log_buf is a buffer to hold RTAS event data that are communicated
> to kernel by hypervisor. This buffer is then used to pass RTAS event
> data to user through proc fs. This buffer is allocated from vmalloc
> (non-linear mapping) area.
> 
> On Machine check interrupt, register r3 points to RTAS extended event
> log passed by hypervisor that contains the MCE event. The pseries
> machine check handler then logs this error into rtas_log_buf. The
> rtas_log_buf is a vmalloc-ed (non-linear) buffer we end up taking up a
> page fault (vector 0x300) while accessing it. Since machine check
> interrupt handler runs in NMI context we can not afford to take any
> page fault. Page faults are not honored in NMI context and causes
> kernel panic. Apart from that, as Nick pointed out, pSeries_log_error()
> also takes a spin_lock while logging error which is not safe in NMI
> context. It may endup in deadlock if we get another MCE before releasing
> the lock. Fix this by deferring the logging of rtas error to irq work queue.
> 
> Current implementation uses two different buffers to hold rtas error log
> depending on whether extended log is provided or not. This makes bit
> difficult to identify which buffer has valid data that needs to logged
> later in irq work. Simplify this using single buffer, one per paca, and
> copy rtas log to it irrespective of whether extended log is provided or
> not. Allocate this buffer below RMA region so that it can be accessed
> in real mode mce handler.
> 
> Fixes: b96672dd840f ("powerpc: Machine check interrupt is a non-maskable 
> interrupt")
> Cc: sta...@vger.kernel.org
> Reviewed-by: Nicholas Piggin 
> Signed-off-by: Mahesh Salgaonkar 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/94675cceacaec27a30eefb142c4c59

cheers

Re: [v6, 1/8] powerpc/pseries: Avoid using the size greater than RTAS_ERROR_LOG_MAX.

2018-08-08 Thread Michael Ellerman

On Wed, 2018-07-04 at 17:57:02 UTC, Mahesh J Salgaonkar wrote:
> From: Mahesh Salgaonkar 
> 
> The global mce data buffer that used to copy rtas error log is of 2048
> (RTAS_ERROR_LOG_MAX) bytes in size. Before the copy we read
> extended_log_length from rtas error log header, then use max of
> extended_log_length and RTAS_ERROR_LOG_MAX as a size of data to be copied.
> Ideally the platform (phyp) will never send extended error log with
> size > 2048. But if that happens, then we have a risk of buffer overrun
> and corruption. Fix this by using min_t instead.
> 
> Fixes: d368514c3097 ("powerpc: Fix corruption when grabbing FWNMI data")
> Reported-by: Michal Suchanek 
> Signed-off-by: Mahesh Salgaonkar 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/74e96bf44f430cf7a01de19ba6cf49

cheers

Re: [V3, 2/2] crypto/nx: Initialize 842 high and normal RxFIFO control registers

2018-08-08 Thread Michael Ellerman

On Wed, 2018-06-13 at 07:32:40 UTC, Haren Myneni wrote:
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
> 
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
> 
> Signed-off-by: Haren Myneni 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/656ecc16e8fc2ab44b3d70e3fcc197

cheers

Re: [V3,1/2] powerpc/powernv: Export opal_check_token symbol

2018-08-08 Thread Michael Ellerman

On Wed, 2018-06-13 at 07:28:57 UTC, Haren Myneni wrote:
> Export opal_check_token symbol for modules to check the availability
> of OPAL calls before using them.
> 
> Signed-off-by: Haren Myneni 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6e708000ec2c93c2bde6a46aa2d6c3

cheers

Re: powerpc/perf: Remove sched_task function defined for thread-imc

2018-08-08 Thread Michael Ellerman

On Fri, 2018-05-18 at 07:35:25 UTC, Anju T Sudhakar wrote:
> Call trace observed while running perf-fuzzer:
> 
> [  329.228068] CPU: 43 PID: 9088 Comm: perf_fuzzer Not tainted 
> 4.13.0-32-generic #35~lp1746225
> [  329.228070] task: c03f776ac900 task.stack: c03f77728000
> [  329.228071] NIP: c0299b70 LR: c02a4534 CTR: 
> c029bb80
> [  329.228073] REGS: c03f7772b760 TRAP: 0700   Not tainted  
> (4.13.0-32-generic)
> [  329.228073] MSR: 9282b033 
> [  329.228079]   CR: 24008822  XER: 
> [  329.228080] CFAR: c0299a70 SOFTE: 0
> GPR00: c02a4534 c03f7772b9e0 c1606200 c03fef858908
> GPR04: c03f776ac900 0001  003fee73
> GPR08:   c11220d8 0002
> GPR12: c029bb80 c7a3d900  
> GPR16:    
> GPR20:   c03f776ad090 c0c71354
> GPR24: c03fef716780 003fee73 c03fe69d4200 c03f776ad330
> GPR28: c11220d8 0001 c14c6108 c03fef858900
> [  329.228098] NIP [c0299b70] perf_pmu_sched_task+0x170/0x180
> [  329.228100] LR [c02a4534] __perf_event_task_sched_in+0xc4/0x230
> [  329.228101] Call Trace:
> [  329.228102] [c03f7772b9e0] [c02a0678] 
> perf_iterate_sb+0x158/0x2a0 (unreliable)
> [  329.228105] [c03f7772ba30] [c02a4534] 
> __perf_event_task_sched_in+0xc4/0x230
> [  329.228107] [c03f7772bab0] [c01396dc] 
> finish_task_switch+0x21c/0x310
> [  329.228109] [c03f7772bb60] [c0c71354] __schedule+0x304/0xb80
> [  329.228111] [c03f7772bc40] [c0c71c10] schedule+0x40/0xc0
> [  329.228113] [c03f7772bc60] [c01033f4] do_wait+0x254/0x2e0
> [  329.228115] [c03f7772bcd0] [c0104ac0] kernel_wait4+0xa0/0x1a0
> [  329.228117] [c03f7772bd70] [c0104c24] SyS_wait4+0x64/0xc0
> [  329.228121] [c03f7772be30] [c000b184] system_call+0x58/0x6c
> [  329.228121] Instruction dump:
> [  329.228123] 3beafea0 7faa4800 409eff18 e8010060 eb610028 ebc10040 7c0803a6 
> 38210050
> [  329.228127] eb81ffe0 eba1ffe8 ebe1fff8 4e800020 <0fe0> 4bbc 
> 6000 6042
> [  329.228131] ---[ end trace 8c46856d314c1811 ]---
> [  375.755943] hrtimer: interrupt took 31601 ns
> 
> 
> The context switch call-backs for thread-imc are defined in sched_task 
> function.
> So when thread-imc events are grouped with software pmu events,
> perf_pmu_sched_task hits the WARN_ON_ONCE condition, since software PMUs are
> assumed not to have a sched_task defined. 
>  
> Patch to move the thread_imc enable/disable opal call back from sched_task to
> event_[add/del] function
> 
> Signed-off-by: Anju T Sudhakar 
> Reviewed-by: Madhavan Srinivasan 
> Tested-by: Joel Stanley 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7ccc4fe5ff9e3a134e863beed0dba1

cheers

Re: [1/3] powerpc: make CPU selection logic generic in Makefile

2018-08-08 Thread Michael Ellerman

On Thu, 2018-06-07 at 10:10:18 UTC, Christophe Leroy wrote:
> At the time being, when adding a new CPU for selection, both
> Kconfig.cputype and Makefile have to be modified.
> 
> This patch moves into Kconfig.cputype the name of the CPU to me
> passed to the -mcpu= argument.
> 
> Signed-off-by: Christophe Leroy 
> Reviewed-by: Nicholas Piggin 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/cc62d20ce4ca4fe73a09d571144b29

cheers

Re: [2/3] powerpc/xive: Remove now useless pr_debug statements

2018-08-08 Thread Michael Ellerman

On Wed, 2018-04-11 at 05:18:00 UTC, Benjamin Herrenschmidt wrote:
> Those overly verbose statement in the setup of the pool VP
> aren't particularly useful (esp. considering we don't actually
> use the pool, we configure it bcs HW requires it only). So
> remove them which improves the code readability.
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/dbc5740247961d6b060737620520dc

cheers

Re: [3/3] powerpc/xive: Remove xive_kexec_teardown_cpu()

2018-08-08 Thread Michael Ellerman

On Wed, 2018-04-11 at 05:18:01 UTC, Benjamin Herrenschmidt wrote:
> It's identical to xive_teardown_cpu() so just use the latter
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e27e0a94651e30912443e88642e698

cheers

Re: powerpc/powernv/opal: Use standard interrupts property when available

2018-08-08 Thread Michael Ellerman

On Tue, 2018-04-10 at 07:16:10 UTC, Benjamin Herrenschmidt wrote:
> For (bad) historical reasons, OPAL used to create a non-standard pair of
> properties "opal-interrupts" and "opal-interrupts-names" for representing
> the list of interrupts it wants Linux to request on its behalf.
> 
> Among other issues, the opal-interrupts doesn't have a way to carry the
> type of interrupts, and they were assumed to be all level sensitive.
> 
> This is wrong on some recent systems where some of them are edge sensitive
> causing warnings in the XIVE code and possible misbehaviours if they need
> to be retriggered (typically the NPU2 TCE error interrupts).
> 
> This makes Linux switch to using the standard "interrupts" and
> "interrupt-names" properties instead when they are available, using standard
> of_irq helpers, which can carry all the desired type information.
> 
> Newer versions of OPAL will generate those properties in addition to the
> legacy ones.
> 
> Signed-off-by: Benjamin Herrenschmidt 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/77b5f703dcc859915f0f20d92bc538

cheers

Re: [PATCH] powerpc/64s: Make unrecoverable SLB miss less confusing

2018-08-08 Thread Michael Ellerman

Nicholas Piggin  writes:
> On Thu, 26 Jul 2018 23:01:51 +1000
> Michael Ellerman  wrote:
>
>> If we take an SLB miss while MSR[RI]=0 we can't recover and have to
>> oops. Currently this is reported by faking up a 0x4100 exception, eg:
>> 
>>   Unrecoverable exception 4100 at 0
>>   Oops: Unrecoverable exception, sig: 6 [#1]
>>   ...
>>   CPU: 0 PID: 1262 Comm: sh Not tainted 
>> 4.18.0-rc3-gcc-7.3.1-00098-g7fc2229fb2ab-dirty #9
>>   NIP:   LR: c000b9e4 CTR: 7fff8bb971b0
>>   REGS: c000ee02bbb0 TRAP: 4100
>>   ...
>>   LR [c000b9e4] system_call+0x5c/0x70
>> 
>> The 0x4100 value was chosen back in 2004 as part of the fix for the
>> "mega bug" - "ppc64: Fix SLB reload bug". Back then it was obvious
>> that 0x4100 was not a real trap value, as the highest actual trap was
>> less than 0x2000.
>> 
>> Since then however the architecture has changed and now we have
>> "virtual mode" or "relon" exceptions, in which exceptions can be
>> delivered with the MMU on starting at 0x4000.
>> 
>> At a glance 0x4100 looks like a virtual mode 0x100 exception, aka
>> system reset exception. A close reading of the architecture will show
>> that system reset exceptions can't be delivered in virtual mode, and
>> so 0x4100 is not a valid trap number. But that's not immediately
>> obvious. There's also nothing about 0x4100 that suggests SLB miss.
>> 
>> So to make things a bit less confusing switch to a fake but unique and
>> hopefully more helpful numbering. For data SLB misses we report a
>> 0x390 trap and for instruction we report 0x490. Compared to 0x380 and
>> 0x480 for the actual data & instruction SLB exceptions.
>> 
>> Also add a C handler that prints a more explicit message. The end
>> result is something like:
>> 
>>   Oops: Unrecoverable SLB miss (MSR[RI]=0), sig: 6 [#3]
>
> This is all good, but allow me to nitpick. Our unrecoverable
> exception messages (and other messages, but those) are becoming a bit
> ad-hoc and messy.
>
> It would be nice to go the other way eventually and consolidate them
> into one. Would be nice to have a common function that takes regs and
> returns the string of the corresponding exception name that makes
> these more readable.

Yeah that's true, though some of them aren't simply a mapping from the
trap number, eg. the kernel bad stack one.

But in general our whole oops output, regs, stack trace etc. could use a
revamp.

I've been thinking of making the trap number more prominent and
providing a text description, because apparently not everyone knows the
trap numbers by heart :)

>>   ...
>>   CPU: 0 PID: 1262 Comm: sh Not tainted 
>> 4.18.0-rc3-gcc-7.3.1-00098-g7fc2229fb2ab-dirty #9
>>   NIP:   LR: c000b9e4 CTR: 
>>   REGS: c000f19a3bb0 TRAP: 0490
>
> Unless I'm mistaken, the fake trap number was only because the code
> couldn't distinguish between 380 and 480. Now that you do, I think you
> can just use them directly rather than 390/490.

Hmm. I always thought it was there so that you could differentiate the
unrecoverable miss vs a normal miss using the trap number.

But maybe that's a bit superfluous given we're printing "unrecoverable
slb miss".

cheers

Re: [PATCH v5 0/8] powerpc/fsl: Speculation barrier for NXP PowerPC Book3E

2018-08-08 Thread Michael Ellerman

Diana Madalina Craciun  writes:

> Hi Michael,
>
> Sorry for the late answer, I was out of the office last week.
>
> It looks fine to me, I have tested the patches on NXP PowerPC Book 3E
> platforms and it worked well.

Thanks.

cheers

Re: [PATCH v02] powerpc/mobility: Fix node detach/rename problem

2018-08-08 Thread Michael Ellerman

Michael Bringmann  writes:
> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
> b/arch/powerpc/platforms/pseries/mobility.c
> index e245a88..efc9442 100644
> --- a/arch/powerpc/platforms/pseries/mobility.c
> +++ b/arch/powerpc/platforms/pseries/mobility.c
> @@ -22,6 +22,9 @@
>  #include 
>  #include "pseries.h"
>  
> +extern int of_free_phandle_cache(void);
> +extern void of_populate_phandle_cache(void);

We don't do that, they should be in a header.

But that's a minor problem given that the patch doesn't compile, because
both those functions are static.

Presumably you have a hack in your tree to make them non-static?
Please try and compile your patches in a clean tree before sending.

cheers

Re: is there still any need PPC checking for "chosen@0"?

2018-08-08 Thread Michael Ellerman

"Robert P. J. Day"  writes:
>   given that there are no .dts files in the current kernel code base
> that define the node name "/chosen@0" instead of the proper "/chosen",

A good portion of PPC machines get their device tree from firmware, not
from a dts that's in the kernel tree, so that's not a good indication.

> is there any need for arch/powerpc/boot/oflib.c to still make this
> test:
>
> chosen = of_finddevice("/chosen");
> if (chosen == (phandle) -1) {
> chosen = of_finddevice("/chosen@0");   <--- this
> if (chosen == (phandle) -1) {
> printf("no chosen\n");
> return 0;
> }
> }
>
> are there still PPC machines that require the recognition of
> "/chosen@0"?

It was added by Paul in:

66a45dd3620e ("powerpc: Make COFF zImages for old 32-bit powermacs")

So presumably there's an old powermac somewhere that needs it.

Given it's basically 2 extra lines of code I'd be inclined to leave it
alone.

cheers

Re: Several suspected memory leaks

2018-08-08 Thread Michael Ellerman

Catalin Marinas  writes:
> On Wed, 11 Jul 2018 at 00:40, Benjamin Herrenschmidt
>  wrote:
>> On Tue, 2018-07-10 at 17:17 +0200, Paul Menzel wrote:
>> > On a the IBM S822LC (8335-GTA) with Ubuntu 18.04 I built Linux master
>> > – 4.18-rc4+, commit 092150a2 (Merge branch 'for-linus'
>> > of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid) – with
>> > kmemleak. Several issues are found.
>>
>> Some of these are completely uninteresting though and look like
>> kmemleak bugs to me :-)
>>
>> > [] __pud_alloc+0x80/0x270
>> > [<07135d64>] hash__map_kernel_page+0x30c/0x4d0
>> > [<71677858>] __ioremap_at+0x108/0x140
>> > [<0023e921>] __ioremap_caller+0x130/0x180
>> > [<9dbc3923>] icp_native_init_one_node+0x5cc/0x760
>> > [<15f3168a>] icp_native_init+0x70/0x13c
>> > [<60ed>] xics_init+0x38/0x1ac
>> > [<88dbf9d1>] pnv_init_IRQ+0x30/0x5c
>>
>> This is the interrupt controller mapping its registers, why on earth
>> would that be considered a leak ? kmemleak needs to learn to ignore
>> kernel page tables allocations.
>
> Indeed, that's just a false positive for powerpc. Kmemleak ignores
> page allocations and most architectures use __get_free_pages() for the
> page table. In this particular case, the powerpc code uses
> kmem_cache_alloc() and that's tracked by kmemleak. Since the pgd
> stores the __pa(pud), kmemleak doesn't detect this pointer and reports
> it as a leak. To work around this, you can pass SLAB_NOLEAKTRACE to
> kmem_cache_create() in pgtable_cache_add()

Ah thanks, I didn't know we could do it that way.

I did this instead which seems to work:

  https://git.kernel.org/torvalds/c/a984506c542e


cheers

Re: [PATCH v2 2/2] powerpc/pseries: Wait for completion of hotplug events during PRRN handling

2018-08-08 Thread Michael Ellerman

John Allen  writes:
> On Wed, Aug 01, 2018 at 11:16:22PM +1000, Michael Ellerman wrote:
>>John Allen  writes:
>>> On Mon, Jul 23, 2018 at 11:41:24PM +1000, Michael Ellerman wrote:
John Allen  writes:

> While handling PRRN events, the time to handle the actual hotplug events
> dwarfs the time it takes to perform the device tree updates and queue the
> hotplug events. In the case that PRRN events are being queued 
> continuously,
> hotplug events have been observed to be queued faster than the kernel can
> actually handle them. This patch avoids the problem by waiting for a
> hotplug request to complete before queueing more hotplug events.
>>
>>Have you tested this patch in isolation, ie. not with patch 1?
>
> While I was away on vacation, I believe a build was tested with just 
> this patch and not the first and it has been running with no problems.  
> However, I think they've had problems recreating the problem in general 
> so it may just be that the environment is not setup properly to recreate 
> the issue.

Yes I asked Haren to test it :)

>From memory there were some warnings still about the work queue being
blocked for long periods, but they weren't fatal.

So do we need the hotplug work queue at all? Can we just call
handle_dlpar_errorlog() directly?

Or are we using the work queue to serialise things? And if so would a
mutex be better?
>>>
>>> Right, the workqueue is meant to serialize all hotplug events and it
>>> gets used for more than just PRRN events. I believe the motivation for
>>> using the workqueue over a mutex is that KVM guests initiate hotplug
>>> events through the hotplug interrupt and can queue fairly large requests
>>> meaning that in this scenario, waiting for a lock would block interrupts
>>> for a while.
>>
>>OK, but that just means that path needs to schedule work to run later.
>>
>>> Using the workqueue allows us to serialize hotplug events
>>> from different sources in the same way without worrying about the
>>> context in which the event is generated.
>>
>>A lock would be so much simpler.
>>
>>It looks like we have three callers of queue_hotplug_event(), the dlpar
>>code, the mobility code and the ras interrupt.
>>
>>The dlpar code already waits synchronously:
>>
>>  init_completion(&hotplug_done);
>>  queue_hotplug_event(hp_elog, &hotplug_done, &rc);
>>  wait_for_completion(&hotplug_done);
>>
>>You're changing mobility to do the same (this patch), leaving only the
>>ras interrupt that actually queues work and returns.
>>
>>
>>So it really seems like a mutex would do the trick, and the ras
>>interrupt would be the only case that needs to schedule work for later.
>
> I think you may be right, but I would need some feedback from Nathan 
> Fontenot before I redesign the queue. He's been thinking about that 
> design for longer than I have and may know something that I don't 
> regarding the reason we're using a workqueue rather than a mutex.

> Given that the bug this is meant to address is pretty high priority, 
> would you consider the wait_for_completion an acceptable stopgap while a 
> more substantial redesign of this code is discussed?

Yeah that would be OK.

cheers

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt

On Wed, 2018-08-08 at 05:30 -0700, Christoph Hellwig wrote:
> On Wed, Aug 08, 2018 at 08:07:49PM +1000, Benjamin Herrenschmidt wrote:
> > Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
> > is not set (default) but there's nothing in the device-tree to tell the
> > guest about this since it's a violation of our pseries architecture, so
> > we just rely on Linux virtio "knowing" that it happens. It's a bit
> > yucky but that's now history...
> 
> That is ugly as hell, but it is how virtio works everywhere, so nothing
> special so far.

Yup.

> > Essentially pseries "architecturally" does not have the concept of not
> > having an iommu in the way and qemu violates that architecture today.
> > 
> > (Remember it comes from pHyp, our priorietary HV, which we are somewhat
> > mimmicing here).
> 
> It shouldnt be too hard to have a dt property that communicates this,
> should it?

We could invent something I suppose. The additional problem then (yeah
I know ... what a mess) is that qemu doesn't create the DT for PCI
devices, the firmware (SLOF) inside the guest does using normal PCI
probing.

That said, that FW could know about all the virtio vendor/device IDs,
check the VIRTIO_F_IOMMU_PLATFORM and set that property accordingly...
messy but doable. It's not a bus property (see my other reply below as
this could complicate things with your bus mask).

But we are drifting from the problem at hand :-) You propose we do set
VIRTIO_F_IOMMU_PLATFORM so we aren't in the above case, and the bypass
stuff works, so no need to touch it.

See my recap at the end of the email to make sure I understand fully
what you suggest.

> > So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
> > through that iommu and performance will suffer (esp vhost I suspect),
> > especially since adding/removing translations in the iommu is a
> > hypercall.

> Well, we'd nee to make sure that for this particular bus we skip the
> actualy iommu.

It's not a bus property. Qemu will happily mix up everything on the
same bus, that includes emulated devices that go through the emulated
iommu, real VFIO devices that go through an actual HW iommu and virtio
that bypasses everything.

This makes things tricky in general (not just in my powerpc secure VM
case) since, at least on powerpc but I suppose elsewhere too, iommu
related properties tend to be per "bus" while here, qemu will mix and
match.

But again, I think we are drifting away from the topic, see below

> > > It would not be the same effect.  The problem with that is that you must
> > > now assumes that your qemu knows that for example you might be passing
> > > a dma offset if the bus otherwise requires it. 
> > 
> > I would assume that arch_virtio_wants_dma_ops() only returns true when
> > no such offsets are involved, at least in our case that would be what
> > happens.
> 
> That would work, but we're really piling hacĸs ontop of hacks here.

Sort-of :-) At least none of what we are discussing now involves
touching the dma_ops themselves so we are not in the way of your big
cleanup operation here. But yeah, let's continue discussing your other
solution below.

> > >  Or in other words:
> > > you potentially break the contract between qemu and the guest of always
> > > passing down physical addresses.  If we explicitly change that contract
> > > through using a flag that says you pass bus address everything is fine.
> > 
> > For us a "bus address" is behind the iommu so that's what
> > VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
> > bus address that is different. I suppose it's an ARMism to have DMA
> > offsets that are separate from iommus ? 
> 
> No, a lot of platforms support a bus address that has an offset from
> the physical address. including a lot of power platforms:

Ok, just talking past each other :-) For all the powerpc ones, these
*do* go through the iommu, which is what I meant. It's just a window of
the iommu that provides some kind of direct mapping of memory.

For pseries, there is no such thing however. What we do to avoid
constant map/unmap of iommu PTEs in pseries guests is that we use
hypercalls to create a 64-bit window and populate all its PTEs with an
identity mapping. But that's not as efficient as a real bypass.

There are good historical reasons for that, since pseries is a guest
platform, its memory is never really where the guest thinks it is, so
you always need an iommu to remap. Even for virtual devices, since for
most of them, in the "IBM" pHyp model, the "peer" is actually another
partition, so the virtual iommu handles translating accross the two
partitions.

Same goes with cell in HW, no real bypass, just the iommu being
confiured with very large pages and a fixed mapping.

powernv has a separate physical window that can be configured as a real
bypass though, so does the U4 DART. Not sure about the FSL one.

But yeah, your point stands, this is just implementation details.

> arch/powerpc/kernel

[PATCH] powerpc/mm: Fix address space layout randomization (ASLR)

2018-08-08 Thread Christophe Leroy

Today, when stack size is set to unlimited (ulimit -s unlimited),
mmap() doesn't randomise the stack address returned by mmap()

This patch fixes it by applying the random factor on
TASK_UNMAPPED_BASE when setting mm->mmap_base

Link: https://github.com/linuxppc/linux/issues/59
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/mmap.c  | 4 ++--
 arch/powerpc/mm/slice.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index b24ce40acd47..07de04c4fb56 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -201,7 +201,7 @@ static void radix__arch_pick_mmap_layout(struct mm_struct 
*mm,
struct rlimit *rlim_stack)
 {
if (mmap_is_legacy(rlim_stack)) {
-   mm->mmap_base = TASK_UNMAPPED_BASE;
+   mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = radix__arch_get_unmapped_area;
} else {
mm->mmap_base = mmap_base(random_factor, rlim_stack);
@@ -233,7 +233,7 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
rlimit *rlim_stack)
 * bit is set, or if the expected stack growth is unlimited:
 */
if (mmap_is_legacy(rlim_stack)) {
-   mm->mmap_base = TASK_UNMAPPED_BASE;
+   mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = arch_get_unmapped_area;
} else {
mm->mmap_base = mmap_base(random_factor, rlim_stack);
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 205fe557ca10..cc1af565b813 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -329,7 +329,7 @@ static unsigned long slice_find_area_bottomup(struct 
mm_struct *mm,
info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
info.align_offset = 0;
 
-   addr = TASK_UNMAPPED_BASE;
+   addr = mm->mmap_base;
/*
 * Check till the allow max value for this mmap request
 */
-- 
2.13.3

Re: Build regressions/improvements in v4.17-rc1

2018-08-08 Thread Michael Ellerman

Michael Ellerman  writes:
> Andrew Morton  writes:
>> On Mon, 6 Aug 2018 12:39:21 +0200 Geert Uytterhoeven  
>> wrote:
>>
>>> CC Dan, Michael, AKPM, powerpc
>>> 
>>> On Mon, Apr 16, 2018 at 3:10 PM Geert Uytterhoeven  
>>> wrote:
>>> > Below is the list of build error/warning regressions/improvements in
>>> > v4.17-rc1[1] compared to v4.16[2].
>>> 
>>> I'd like to point your attention to:
>>> 
>>> >   + warning: vmlinux.o(.text+0x376518): Section mismatch in reference 
>>> > from the function .devm_memremap_pages() to the function 
>>> > .meminit.text:.arch_add_memory():  => N/A
>>> >   + warning: vmlinux.o(.text+0x376d64): Section mismatch in reference 
>>> > from the function .devm_memremap_pages_release() to the function 
>>> > .meminit.text:.arch_remove_memory():  => N/A
>>
>> hm.  Dan isn't around at present so we're on our own with this one.
>>
>> x86 doesn't put arch_add_memory and arch_remove_memory into __meminit. 
>> x86 does
>>
>> #ifdef CONFIG_MEMORY_HOTPLUG
>> int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>>  bool want_memblock)
>> {
>>  ...
>>
>>
>> So I guess powerpc should do that as well?
>
> But we only recently added it to fix a section mismatch warning:
>
>   WARNING: vmlinux.o(.text+0x6da88): Section mismatch in reference from the 
> function .arch_add_memory() to the function 
> .meminit.text:.create_section_mapping()
>   The function .arch_add_memory() references
>   the function __meminit .create_section_mapping().
>   This is often because .arch_add_memory lacks a __meminit 
>   annotation or the annotation of .create_section_mapping is wrong.
>
>
> I think the problem is that the section mismatch logic isn't able to
> cope with __meminit's changing semantics.
>
> When CONFIG_MEMORY_HOTPLUG=y references from .text to .meminit.text
> should be allowed, because they're just folded in together in the linker
> script.
>
> When CONFIG_MEMORY_HOTPLUG=n references from .text to .meminit.text
> should NOT be allowed, because .meminit.text becomes .init.text and will
> be freed.
>
> I don't see anything in the section mismatch logic to cope with that
> difference.
>
> It looks like __meminit is saving us about 1K on powerpc, so I'm
> strongly inclined to just remove it entirely from arch/powerpc.

Gah that doesn't work.

Our arch routines call things in mm/ that are __meminit, so we have to
mark our code __meminit too, otherwise we get a section mismatch
warning.

cheers

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Christoph Hellwig

On Wed, Aug 08, 2018 at 08:07:49PM +1000, Benjamin Herrenschmidt wrote:
> Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
> is not set (default) but there's nothing in the device-tree to tell the
> guest about this since it's a violation of our pseries architecture, so
> we just rely on Linux virtio "knowing" that it happens. It's a bit
> yucky but that's now history...

That is ugly as hell, but it is how virtio works everywhere, so nothing
special so far.

> Essentially pseries "architecturally" does not have the concept of not
> having an iommu in the way and qemu violates that architecture today.
> 
> (Remember it comes from pHyp, our priorietary HV, which we are somewhat
> mimmicing here).

It shouldnt be too hard to have a dt property that communicates this,
should it?

> So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
> through that iommu and performance will suffer (esp vhost I suspect),
> especially since adding/removing translations in the iommu is a
> hypercall.

Well, we'd nee to make sure that for this particular bus we skip the
actualy iommu.

> > It would not be the same effect.  The problem with that is that you must
> > now assumes that your qemu knows that for example you might be passing
> > a dma offset if the bus otherwise requires it. 
> 
> I would assume that arch_virtio_wants_dma_ops() only returns true when
> no such offsets are involved, at least in our case that would be what
> happens.

That would work, but we're really piling hacĸs ontop of hacks here.

> >  Or in other words:
> > you potentially break the contract between qemu and the guest of always
> > passing down physical addresses.  If we explicitly change that contract
> > through using a flag that says you pass bus address everything is fine.
> 
> For us a "bus address" is behind the iommu so that's what
> VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
> bus address that is different. I suppose it's an ARMism to have DMA
> offsets that are separate from iommus ? 

No, a lot of platforms support a bus address that has an offset from
the physical address. including a lot of power platforms:

arch/powerpc/kernel/pci-common.c:   set_dma_offset(&dev->dev, 
PCI_DRAM_OFFSET);
arch/powerpc/platforms/cell/iommu.c:set_dma_offset(dev, 
cell_dma_nommu_offset);
arch/powerpc/platforms/cell/iommu.c:set_dma_offset(dev, addr);
arch/powerpc/platforms/powernv/pci-ioda.c:  set_dma_offset(&pdev->dev, 
pe->tce_bypass_base);
arch/powerpc/platforms/powernv/pci-ioda.c:  
set_dma_offset(&pdev->dev, (1ULL << 32));
arch/powerpc/platforms/powernv/pci-ioda.c:  
set_dma_offset(&dev->dev, pe->tce_bypass_base);
arch/powerpc/platforms/pseries/iommu.c: 
set_dma_offset(dev, dma_offset);
arch/powerpc/sysdev/dart_iommu.c:   set_dma_offset(&dev->dev, 
DART_U4_BYPASS_BASE);
arch/powerpc/sysdev/fsl_pci.c:  set_dma_offset(dev, pci64_dma_offset);

to make things worse some platforms (at least on arm/arm64/mips/x86) can
also require additional banking where it isn't even a single linear map
but multiples windows.

Re: Build regressions/improvements in v4.17-rc1

2018-08-08 Thread Geert Uytterhoeven

On Wed, Aug 8, 2018 at 12:32 PM Michael Ellerman  wrote:
> Also I haven't been seeing this in my local builds because I have:
>
> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
>
> So I guess we need to work out why that's interfering with section
> mismatch analysis.

One other common case of missing section mismatch warnings is when some
versions of the compiler inline the called function, while other versions don't.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH] lib/test_hexdump: fix failure on big endian cpu

2018-08-08 Thread Andy Shevchenko

On Wed, 2018-08-08 at 06:37 +, Christophe Leroy wrote:
> On a big endian cpu, test_hexdump fails as follows. The logs show
> that bytes are expected in reversed order.
> 
> [...]
> [   16.643648] test_hexdump: Len: 24 buflen: 130 strlen: 97
> [   16.648681] test_hexdump: Result: 97 'be32db7b 0a1893b2 70bac424
> 7d83349b a69c31ad
> 9c0face9.2.{p..$}.4...1.'
> [   16.660951] test_hexdump: Expect: 97 '7bdb32be b293180a 24c4ba70
> 9b34837d ad319ca6
> e9ac0f9c.2.{p..$}.4...1.'
> [   16.673129] test_hexdump: Len: 8 buflen: 130 strlen: 77
> [   16.678113] test_hexdump: Result: 77
> 'be32db7b0a1893b2 
> .2.{'
> [   16.688660] test_hexdump: Expect: 77
> 'b293180a7bdb32be 
> .2.{'
> [   16.699170] test_hexdump: Len: 6 buflen: 131 strlen: 87
> [   16.704238] test_hexdump: Result: 87 'be32 db7b
> 0a18  
>  .2.{..'
> [   16.715511] test_hexdump: Expect: 87 '32be 7bdb
> 180a  
>  .2.{..'
> [   16.726864] test_hexdump: Len: 24 buflen: 131 strlen: 97
> [   16.731902] test_hexdump: Result: 97 'be32db7b 0a1893b2 70bac424
> 7d83349b a69c31ad
> 9c0face9.2.{p..$}.4...1.'
> [   16.744175] test_hexdump: Expect: 97 '7bdb32be b293180a 24c4ba70
> 9b34837d ad319ca6
> e9ac0f9c.2.{p..$}.4...1.'
> [   16.756379] test_hexdump: Len: 32 buflen: 131 strlen: 101
> [   16.761507] test_hexdump: Result: 101 'be32db7b0a1893b2
> 70bac4247d83349b a69c31ad9c0face9
> 4cd1199943b1af0c  .2.{p..$}.4...1.L...C...'
> [   16.774212] test_hexdump: Expect: 101 'b293180a7bdb32be
> 9b34837d24c4ba70 e9ac0f9cad319ca6
> 0cafb1439919d14c  .2.{p..$}.4...1.L...C...'
> [   16.786763] test_hexdump: failed 801 out of 1184 tests
> 
> This patch fixes it.

I like this approach because in the future we might introduce something
to print be data on le cpu or otherwise and thus test data will be used
independently of cpu endianess.

Reviewed-by: Andy Shevchenko 

Thanks for adding the BE support.

> 
> Fixes: 64d1d77a44697 ("hexdump: introduce test suite")
> Signed-off-by: Christophe Leroy 
> ---
>  lib/test_hexdump.c | 28 +++-
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/test_hexdump.c b/lib/test_hexdump.c
> index 3f415d8101f3..626f580b4ff7 100644
> --- a/lib/test_hexdump.c
> +++ b/lib/test_hexdump.c
> @@ -18,7 +18,7 @@ static const unsigned char data_b[] = {
>  
>  static const unsigned char data_a[] =
> ".2.{p..$}.4...1.L...C...";
>  
> -static const char * const test_data_1_le[] __initconst = {
> +static const char * const test_data_1[] __initconst = {
>   "be", "32", "db", "7b", "0a", "18", "93", "b2",
>   "70", "ba", "c4", "24", "7d", "83", "34", "9b",
>   "a6", "9c", "31", "ad", "9c", "0f", "ac", "e9",
> @@ -32,16 +32,33 @@ static const char * const test_data_2_le[]
> __initconst = {
>   "d14c", "9919", "b143", "0caf",
>  };
>  
> +static const char * const test_data_2_be[] __initconst = {
> + "be32", "db7b", "0a18", "93b2",
> + "70ba", "c424", "7d83", "349b",
> + "a69c", "31ad", "9c0f", "ace9",
> + "4cd1", "1999", "43b1", "af0c",
> +};
> +
>  static const char * const test_data_4_le[] __initconst = {
>   "7bdb32be", "b293180a", "24c4ba70", "9b34837d",
>   "ad319ca6", "e9ac0f9c", "9919d14c", "0cafb143",
>  };
>  
> +static const char * const test_data_4_be[] __initconst = {
> + "be32db7b", "0a1893b2", "70bac424", "7d83349b",
> + "a69c31ad", "9c0face9", "4cd11999", "43b1af0c",
> +};
> +
>  static const char * const test_data_8_le[] __initconst = {
>   "b293180a7bdb32be", "9b34837d24c4ba70",
>   "e9ac0f9cad319ca6", "0cafb1439919d14c",
>  };
>  
> +static const char * const test_data_8_be[] __initconst = {
> + "be32db7b0a1893b2", "70bac4247d83349b",
> + "a69c31ad9c0face9", "4cd1199943b1af0c",
> +};
> +
>  #define FILL_CHAR'#'
>  
>  static unsigned total_tests __initdata;
> @@ -56,6 +73,7 @@ static void __init test_hexdump_prepare_test(size_t
> len, int rowsize,
>   size_t l = len;
>   int gs = groupsize, rs = rowsize;
>   unsigned int i;
> + const bool is_be = IS_ENABLED(CONFIG_CPU_BIG_ENDIAN);
>  
>   if (rs != 16 && rs != 32)
>   rs = 16;
> @@ -67,13 +85,13 @@ static void __init
> test_hexdump_prepare_test(size_t len, int rowsize,
>   gs = 1;
>  
>   if (gs == 8)
> - result = test_data_8_le;
> + result = is_be ? test_data_8_be : test_data_8_le;
>   else if (gs == 4)
> - result = test_data_4_le;
> + result = is_be ? test_data_4_be : test_data_4_le;
>   else if (gs == 2)
> - result = test_data_2_le;
> + result = is_be ? test_data_2_be : test_data_2_l

[PATCH] powerpc: fix size calculation using resource_size()

2018-08-08 Thread Dan Carpenter

The problem is the the calculation should be "end - start + 1" but the
plus one is missing in this calculation.

Fixes: 8626816e905e ("powerpc: add support for MPIC message register API")
Signed-off-by: Dan Carpenter 
---
Static analysis.  Not tested.

diff --git a/arch/powerpc/sysdev/mpic_msgr.c b/arch/powerpc/sysdev/mpic_msgr.c
index eb69a5186243..280e964e1aa8 100644
--- a/arch/powerpc/sysdev/mpic_msgr.c
+++ b/arch/powerpc/sysdev/mpic_msgr.c
@@ -196,7 +196,7 @@ static int mpic_msgr_probe(struct platform_device *dev)
 
/* IO map the message register block. */
of_address_to_resource(np, 0, &rsrc);
-   msgr_block_addr = ioremap(rsrc.start, rsrc.end - rsrc.start);
+   msgr_block_addr = ioremap(rsrc.start, resource_size(&rsrc));
if (!msgr_block_addr) {
dev_err(&dev->dev, "Failed to iomap MPIC message registers");
return -EFAULT;

Re: Build regressions/improvements in v4.17-rc1

2018-08-08 Thread Mathieu Malaterre

On Wed, Aug 8, 2018 at 12:34 PM Michael Ellerman  wrote:
>
> Andrew Morton  writes:
> > On Mon, 6 Aug 2018 12:39:21 +0200 Geert Uytterhoeven  
> > wrote:
> >
> >> CC Dan, Michael, AKPM, powerpc
> >>
> >> On Mon, Apr 16, 2018 at 3:10 PM Geert Uytterhoeven  
> >> wrote:
> >> > Below is the list of build error/warning regressions/improvements in
> >> > v4.17-rc1[1] compared to v4.16[2].
> >>
> >> I'd like to point your attention to:
> >>
> >> >   + warning: vmlinux.o(.text+0x376518): Section mismatch in reference 
> >> > from the function .devm_memremap_pages() to the function 
> >> > .meminit.text:.arch_add_memory():  => N/A
> >> >   + warning: vmlinux.o(.text+0x376d64): Section mismatch in reference 
> >> > from the function .devm_memremap_pages_release() to the function 
> >> > .meminit.text:.arch_remove_memory():  => N/A
> >
> > hm.  Dan isn't around at present so we're on our own with this one.
> >
> > x86 doesn't put arch_add_memory and arch_remove_memory into __meminit.
> > x86 does
> >
> > #ifdef CONFIG_MEMORY_HOTPLUG
> > int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap 
> > *altmap,
> >   bool want_memblock)
> > {
> >   ...
> >
> >
> > So I guess powerpc should do that as well?
>
> But we only recently added it to fix a section mismatch warning:
>
>   WARNING: vmlinux.o(.text+0x6da88): Section mismatch in reference from the 
> function .arch_add_memory() to the function 
> .meminit.text:.create_section_mapping()
>   The function .arch_add_memory() references
>   the function __meminit .create_section_mapping().
>   This is often because .arch_add_memory lacks a __meminit
>   annotation or the annotation of .create_section_mapping is wrong.
>
>
> I think the problem is that the section mismatch logic isn't able to
> cope with __meminit's changing semantics.
>
> When CONFIG_MEMORY_HOTPLUG=y references from .text to .meminit.text
> should be allowed, because they're just folded in together in the linker
> script.
>
> When CONFIG_MEMORY_HOTPLUG=n references from .text to .meminit.text
> should NOT be allowed, because .meminit.text becomes .init.text and will
> be freed.
>
> I don't see anything in the section mismatch logic to cope with that
> difference.
>
> It looks like __meminit is saving us about 1K on powerpc, so I'm
> strongly inclined to just remove it entirely from arch/powerpc.
>
> Also I haven't been seeing this in my local builds because I have:
>
> CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
>
> So I guess we need to work out why that's interfering with section
> mismatch analysis.

...

Well that's a good question actually. Section mismatch
analysis is done on the throwaway vmlinux.o which is not linked
with --gc-sections (and is not a final link), so the via_pmu_driver
symbol should exist and be picked up.

I wonder if something about the  -ffunction-sections is breaking
the reference detection.
...



ref:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg135431.html

>
> cheers

Re: [PATCH] of/fdt: Remove PPC32 longtrail hack in memory scan

2018-08-08 Thread Dominik Klein

Hi guys,

On 08.08.2018 11:29, Geert Uytterhoeven wrote:
> [...]
> Acked-by: Geert Uytterhoeven 
> 
> However, recently Dominik (CC) send me an enquiry about it, so perhaps
> he is a happy user?

Nope - don't have one either, so go ahead :)


Regards,
Dominik

Re: Build regressions/improvements in v4.17-rc1

2018-08-08 Thread Michael Ellerman

Andrew Morton  writes:
> On Mon, 6 Aug 2018 12:39:21 +0200 Geert Uytterhoeven  
> wrote:
>
>> CC Dan, Michael, AKPM, powerpc
>> 
>> On Mon, Apr 16, 2018 at 3:10 PM Geert Uytterhoeven  
>> wrote:
>> > Below is the list of build error/warning regressions/improvements in
>> > v4.17-rc1[1] compared to v4.16[2].
>> 
>> I'd like to point your attention to:
>> 
>> >   + warning: vmlinux.o(.text+0x376518): Section mismatch in reference from 
>> > the function .devm_memremap_pages() to the function 
>> > .meminit.text:.arch_add_memory():  => N/A
>> >   + warning: vmlinux.o(.text+0x376d64): Section mismatch in reference from 
>> > the function .devm_memremap_pages_release() to the function 
>> > .meminit.text:.arch_remove_memory():  => N/A
>
> hm.  Dan isn't around at present so we're on our own with this one.
>
> x86 doesn't put arch_add_memory and arch_remove_memory into __meminit. 
> x86 does
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
>   bool want_memblock)
> {
>   ...
>
>
> So I guess powerpc should do that as well?

But we only recently added it to fix a section mismatch warning:

  WARNING: vmlinux.o(.text+0x6da88): Section mismatch in reference from the 
function .arch_add_memory() to the function 
.meminit.text:.create_section_mapping()
  The function .arch_add_memory() references
  the function __meminit .create_section_mapping().
  This is often because .arch_add_memory lacks a __meminit 
  annotation or the annotation of .create_section_mapping is wrong.

I think the problem is that the section mismatch logic isn't able to
cope with __meminit's changing semantics.

When CONFIG_MEMORY_HOTPLUG=y references from .text to .meminit.text
should be allowed, because they're just folded in together in the linker
script.

When CONFIG_MEMORY_HOTPLUG=n references from .text to .meminit.text
should NOT be allowed, because .meminit.text becomes .init.text and will
be freed.

I don't see anything in the section mismatch logic to cope with that
difference.

It looks like __meminit is saving us about 1K on powerpc, so I'm
strongly inclined to just remove it entirely from arch/powerpc.

Also I haven't been seeing this in my local builds because I have:

CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y

So I guess we need to work out why that's interfering with section
mismatch analysis.

cheers

Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt

On Tue, 2018-08-07 at 23:31 -0700, Christoph Hellwig wrote:
> 
> You don't need to set them the time you go secure.  You just need to
> set the flag from the beginning on any VM you might want to go secure.
> Or for simplicity just any VM - if the DT/ACPI tables exposed by
> qemu are good enough that will always exclude a iommu and not set a
> DMA offset, so nothing will change on the qemu side of he processing,
> and with the new direct calls for the direct dma ops performance in
> the guest won't change either.

So that's where I'm not sure things are "good enough" due to how
pseries works. (remember it's paravirtualized).

A pseries system starts with a default iommu on all devices, that uses
translation using 4k entires with a "pinhole" window (usually 2G with
qemu iirc). There's no "pass through" by default.

Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
is not set (default) but there's nothing in the device-tree to tell the
guest about this since it's a violation of our pseries architecture, so
we just rely on Linux virtio "knowing" that it happens. It's a bit
yucky but that's now history...

Essentially pseries "architecturally" does not have the concept of not
having an iommu in the way and qemu violates that architecture today.

(Remember it comes from pHyp, our priorietary HV, which we are somewhat
mimmicing here).

So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
through that iommu and performance will suffer (esp vhost I suspect),
especially since adding/removing translations in the iommu is a
hypercall.

Now, we do have HV APIs to create a second window that's "permanently
mapped" to the guest memory, thus avoiding dynamic map/unmaps, and
Linux can make use of this but I don't know if that works with qemu and
the performance impact with vhost.

So the situation isn't that great On the other hand, I think the
other approach works for us:

> > It's nicer if we have a way in the guest virtio driver to do something
> > along the lines of
> > 
> > if ((flags & VIRTIO_F_IOMMU_PLATFORM) || arch_virtio_wants_dma_ops())
> > 
> > Which would have the same effect and means the issue is entirely
> > contained in the guest.
> 
> It would not be the same effect.  The problem with that is that you must
> now assumes that your qemu knows that for example you might be passing
> a dma offset if the bus otherwise requires it. 

I would assume that arch_virtio_wants_dma_ops() only returns true when
no such offsets are involved, at least in our case that would be what
happens.

>  Or in other words:
> you potentially break the contract between qemu and the guest of always
> passing down physical addresses.  If we explicitly change that contract
> through using a flag that says you pass bus address everything is fine.

For us a "bus address" is behind the iommu so that's what
VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
bus address that is different. I suppose it's an ARMism to have DMA
offsets that are separate from iommus ? 

> Note that in practice your scheme will probably just work for your
> initial prototype, but chances are it will get us in trouble later on.

Not on pseries, at least not in any way I can think of mind you... but
maybe other architectures would abuse it... We could add a WARN_ON if
that calls returns true on a bus with an offset I suppose.

Cheers,
Ben.

1 2 >

1 - 100 of 112 matches

Mail list logo